Controlling pronunciation in TTS

SSML, phonemes, lexicons, and other pronunciation controls

Updated June 29, 2026

A voice reads "Siobhan" as "see-OH-ban," "Reyes" as "rays," and your product name as something you do not recognize. The model is not malfunctioning; it is guessing, because spelling does not predict sound and these are exactly the words its grapheme-to-phoneme step was never going to get right. Names, brands, acronyms, and numbers are where every TTS deployment eventually has to override the model.[1][2]

The tools for overriding it run from a crude rewrite to an exact phonetic spec. The sections below begin at the crude end, so the lightest tool that fixes the problem comes first.

Respelling

The crudest fix rewrites the text phonetically and lets the model read the rewrite: "Reyes" becomes "rayes," "Nguyen" becomes "win." It costs nothing and sometimes works.

Respelling is also fragile: it counters the model's spelling rules with more spelling, the result varies between voices and languages, and it corrupts the real text, which causes problems wherever that string is later displayed or logged. It is a stopgap rather than a permanent fix.

SSML say-as

SSML's <say-as> is the first proper tool. It tells the voice how to interpret a string, which is the alphanumeric problem in reverse: is "1234" a number, a year, or digits to read one at a time?

<say-as interpret-as="telephone">555-0142</say-as>
<say-as interpret-as="characters">SKU</say-as>      <!-- S-K-U, spelled out -->
<say-as interpret-as="date" format="mdy">3/4</say-as>
<say-as interpret-as="cardinal">23</say-as>          <!-- twenty-three -->

This tool applies when the issue is what kind of thing a string is, not how a word sounds. It resolves the ambiguity that text normalization alone would have to guess.

SSML sub

When you want the voice to say something different from what is written, <sub> substitutes an alias:

<sub alias="Doctor">Dr.</sub> <sub alias="Saint">St.</sub> Mary's
<sub alias="World Health Organization">WHO</sub>

This handles abbreviations that expand differently by context ("St." as "Street" or "Saint") and acronyms you want spoken as words or expanded, without altering the displayed text.

Phonemes

When you need a specific pronunciation and nothing else will do, specify the sounds directly with the phoneme tag, using a phonetic alphabet, usually IPA or X-SAMPA.

<phoneme alphabet="ipa" ph="ʃɪˈvɔːn">Siobhan</phoneme>
<phoneme alphabet="ipa" ph="ˈnɡwɛn">Nguyen</phoneme>

This is the most reliable tool, because it specifies the exact sequence of sounds instead of leaving the reading to the model. It does require knowing the phonetic transcription and the alphabet the system accepts, and IPA versus X-SAMPA support varies. Phonemes suit a fixed set of names and terms that must be correct every time.

Breaks, emphasis, and prosody

Pronunciation covers more than sounds. It also includes pacing and stress, the territory of prosody. SSML offers tags for these too:

Take a deep breath <break time="500ms"/> and continue.
I said the <emphasis level="strong">red</emphasis> one.
<prosody rate="slow" pitch="-2st">This part is important.</prosody>

These tags should be used sparingly. They suit a specific line that must be delivered a particular way; good default prosody handles ordinary text.

Lexicons

Annotating every occurrence of a term is tedious and error-prone. A pronunciation lexicon (the W3C format is PLS) is a dictionary of terms and their pronunciations that applies across all your text at once, so "Soniox" or a drug name is said correctly everywhere without per-instance tags.[5] For any deployment with a stable vocabulary, a lexicon is the maintainable option, with inline SSML reserved for one-off cases.

ToolFixesPrecision
RespellingOne-off mispronunciationsLow, fragile
say-asNumbers, dates, codes, spell-outMedium
subAbbreviations, acronymsMedium
phonemeExact sound of a wordHigh
break / emphasis / prosodyPacing, stress, pitchTargeted
Lexicon (PLS)A term, everywhere, consistentlyHigh, scalable
The lightest tool that fixes the problem is preferable. Precision and required effort both rise down the list.

Pronunciation control in neural TTS

SSML originated with the concatenative and parametric systems of the 2000s, and it assumes a pipeline that can be instructed tag by tag. The newest token-based neural models do not always work that way. Some support a subset of SSML, some ignore it, and some replace it: supplying surrounding context so the model infers the right reading, prompting with an example, or accepting a custom dictionary rather than inline phonemes.[3][6][7]

SSML support varies among synthesis systems, and an implementation may support only a subset of tags.[4][7] Pronunciation behavior should therefore be tested against the selected synthesizer. Production systems also need a documented method for correcting names, numbers, and domain terms.

Common questions

What is SSML?

Speech Synthesis Markup Language, a W3C XML format for annotating text with how to speak it, covering everything from interpreting numbers and dates to specifying exact phonemes.[4] It is the traditional vehicle for pronunciation control, but support is one of the least standardized parts of TTS, so never assume a given system honors a given tag.

How do I make a TTS voice say a name correctly?

For a one-off, reach for the lightest tool that works, ending at a phoneme tag in IPA or X-SAMPA when nothing softer holds. For a name that recurs, put it in a pronunciation lexicon (PLS) so it is correct everywhere at once, instead of re-tagging every occurrence and getting it wrong somewhere.

Why does the same word get read two different ways?

Because it is a homograph like "read," "lead," "bass," or "live," and the model picks a reading from context, sometimes wrongly. When it guesses wrong, pin the reading with a phoneme tag or a substitution rather than hoping more context fixes it.

Do modern neural TTS systems still use SSML?

Inconsistently: some support a subset, some ignore it, some replace it with context or custom dictionaries. The need to override pronunciation stays permanent even when the syntax does not, so test your real names and numbers on your actual system rather than assuming SSML works.

References

  1. Rao, K., Peng, F., Sak, H., & Beaufays, F. (2015). Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4584–4588.
  2. Shirali-Shahreza, S., Luitjens, P., Morcos, N., Xiao, W., & Penn, G. (2017). Crowdsourcing the Pronunciation of Out-of-Vocabulary Words. The AAAI-17 Workshop on Crowdsourcing, Deep Learning, and Artificial Intelligence Agents.
  3. Taylor, J., & Richmond, K. (2019). Analysis of pronunciation learning in end-to-end speech synthesis. Proceedings of Interspeech 2019, 2070–2074.
  4. W3C (2010). Speech Synthesis Markup Language (SSML) Version 1.1. World Wide Web Consortium (W3C).
  5. W3C (2008). Pronunciation Lexicon Specification (PLS) Version 1.0. World Wide Web Consortium (W3C).
  6. Ould Ouali, N., Sani, A. H., Bueno, R., & Dauvet, J. (2025). Improving French Synthetic Speech Quality via SSML Prosody Control. Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP).
  7. Behr, M. (2021). Fine-Grained Prosody Control in Neural TTS Systems. Bachelor's thesis, Karlsruhe Institute of Technology.