Controlling pronunciation in TTS: SSML, phonemes, and beyond

A voice reads "Siobhan" as "see-OH-ban," "Reyes" as "rays," and your product name as something you do not recognize. The model is not malfunctioning; it is guessing, because spelling does not predict sound and these are exactly the words its grapheme-to-phoneme step was never going to get right. Names, brands, acronyms, and numbers are where every TTS deployment eventually has to override the model.^[1]^[2]

The tools for overriding it run from a crude rewrite to an exact phonetic spec. The sections below begin at the crude end, so the lightest tool that fixes the problem comes first.

Respelling

The crudest fix rewrites the text phonetically and lets the model read the rewrite: "Reyes" becomes "rayes," "Nguyen" becomes "win." It costs nothing and sometimes works.

Respelling is also fragile. It fights spelling with more spelling, so the result varies between voices and languages, and it corrupts the real text, which causes problems wherever that string is later displayed or logged. It is a stopgap rather than a permanent fix.

SSML say-as

SSML's <say-as> is the first proper tool. It tells the voice how to interpret a string, which is the alphanumeric problem in reverse: is "1234" a number, a year, or digits to read one at a time?

<say-as interpret-as="telephone">555-0142</say-as>
<say-as interpret-as="characters">SKU</say-as>      <!-- S-K-U, spelled out -->
<say-as interpret-as="date" format="mdy">3/4</say-as>
<say-as interpret-as="cardinal">23</say-as>          <!-- twenty-three -->

This tool applies when the issue is what kind of thing a string is, not how a word sounds. It resolves the ambiguity that text normalization alone would have to guess.

SSML sub

When you want the voice to say something different from what is written, <sub> substitutes an alias:

<sub alias="Doctor">Dr.</sub> <sub alias="Saint">St.</sub> Mary's
<sub alias="World Health Organization">WHO</sub>

This handles abbreviations that expand differently by context ("St." as "Street" or "Saint") and acronyms you want spoken as words or expanded, without altering the displayed text.

Phonemes

When you need a specific pronunciation and nothing else will do, specify the sounds directly with the phoneme tag, using a phonetic alphabet, usually IPA or X-SAMPA.

<phoneme alphabet="ipa" ph="ʃɪˈvɔːn">Siobhan</phoneme>
<phoneme alphabet="ipa" ph="ˈnɡwɛn">Nguyen</phoneme>

This is the most reliable tool, because it specifies the exact sequence of sounds instead of leaving the reading to the model. It does require knowing the phonetic transcription and the alphabet the system accepts, and IPA versus X-SAMPA support varies. Phonemes suit a fixed set of names and terms that must be correct every time.

Breaks, emphasis, and prosody

Pronunciation covers more than sounds. It also includes pacing and stress, the territory of prosody. SSML offers tags for these too:

Take a deep breath <break time="500ms"/> and continue.
I said the <emphasis level="strong">red</emphasis> one.
<prosody rate="slow" pitch="-2st">This part is important.</prosody>

These tags should be used sparingly. They suit a specific line that must be delivered a particular way; good default prosody handles ordinary text.

Lexicons

Annotating every occurrence of a term is tedious and error-prone. A pronunciation lexicon (the W3C format is PLS) is a dictionary of terms and their pronunciations that applies across all your text at once, so "Soniox" or a drug name is said correctly everywhere without per-instance tags.^[5] For any deployment with a stable vocabulary, a lexicon is the maintainable option, with inline SSML reserved for one-off cases.

Tool	Fixes	Precision
Respelling	One-off mispronunciations	Low, fragile
`say-as`	Numbers, dates, codes, spell-out	Medium
`sub`	Abbreviations, acronyms	Medium
`phoneme`	Exact sound of a word	High
`break` / `emphasis` / `prosody`	Pacing, stress, pitch	Targeted
Lexicon (PLS)	A term, everywhere, consistently	High, scalable

The lightest tool that fixes the problem is preferable. Precision and required effort both rise down the list.

Pronunciation control in neural TTS

SSML originated with the concatenative and parametric systems of the 2000s, and it assumes a pipeline that can be instructed tag by tag. The newest token-based neural models do not always work that way. Some support a subset of SSML, some ignore it, and some replace it: supplying surrounding context so the model infers the right reading, prompting with an example, or accepting a custom dictionary rather than inline phonemes.^[3]^[6]^[7]

SSML support is one of the least standardized corners of TTS: an implementation may honor a subset of tags and silently ignore the rest.^[4]^[7] So test pronunciation against the synthesizer you actually chose, not the spec, and keep a documented way to correct names, numbers, and domain terms. Every production deployment ends up needing one.

Common questions

What is SSML?

Speech Synthesis Markup Language, a W3C XML format for annotating text with how to speak it, covering everything from interpreting numbers and dates to specifying exact phonemes.^[4] It is the traditional vehicle for pronunciation control, but support is one of the least standardized parts of TTS, so never assume a given system honors a given tag.

How do I make a TTS voice say a name correctly?

For a one-off, reach for the lightest tool that works, ending at a phoneme tag in IPA or X-SAMPA when nothing softer holds. For a name that recurs, put it in a pronunciation lexicon (PLS) so it is correct everywhere at once, instead of re-tagging every occurrence and getting it wrong somewhere.

Why does the same word get read two different ways?

Because it is a homograph like "read," "lead," "bass," or "live," and the model picks a reading from context, sometimes wrongly. When it guesses wrong, pin the reading with a phoneme tag or a substitution rather than hoping more context fixes it.

Do modern neural TTS systems still use SSML?

Inconsistently: some support a subset, some ignore it, some replace it with context or custom dictionaries. The need to override pronunciation stays permanent even when the syntax does not, so test your real names and numbers on your actual system rather than assuming SSML works.

References

Rao, K., Peng, F., Sak, H., & Beaufays, F. (2015). Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4584–4588.
Shirali-Shahreza, S., Luitjens, P., Morcos, N., Xiao, W., & Penn, G. (2017). Crowdsourcing the Pronunciation of Out-of-Vocabulary Words. The AAAI-17 Workshop on Crowdsourcing, Deep Learning, and Artificial Intelligence Agents.
Taylor, J., & Richmond, K. (2019). Analysis of pronunciation learning in end-to-end speech synthesis. Proceedings of Interspeech 2019, 2070–2074.
W3C (2010). Speech Synthesis Markup Language (SSML) Version 1.1. World Wide Web Consortium (W3C).
W3C (2008). Pronunciation Lexicon Specification (PLS) Version 1.0. World Wide Web Consortium (W3C).
Ould Ouali, N., Sani, A. H., Bueno, R., & Dauvet, J. (2025). Improving French Synthetic Speech Quality via SSML Prosody Control. Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP).
Behr, M. (2021). Fine-Grained Prosody Control in Neural TTS Systems. Bachelor's thesis, Karlsruhe Institute of Technology.

Controlling pronunciation in TTS