What is text-to-speech? How neural TTS turns text into voice

It is tempting to picture TTS as a lookup: somewhere a recording of every word, played back in order. Nothing shipped this decade works that way. A modern system has never heard most of the sentences it speaks. It generates the sound wave from scratch, the way a language model generates sentences it has never read, and that one design choice, generate rather than replay, is why synthetic voices stopped sounding synthetic around 2016.^[1]

This page covers the pipeline at a high level; each stage has its own page that goes deeper.^[6]

Text normalization

A naive view of TTS is that you hand the model a string and it reads the letters. Reading the letters is straightforward; the harder problem is deciding what the letters say out loud, and that happens in text normalization: the stage that rewrites text into the words a person would speak.

Consider "Dr. Stone lives at 5 Main St. and paid $5 for it on 3/4." A human reads that without thinking. A machine has to make a string of judgment calls. The first "Dr." is "Doctor" and the second "St." is "Street," but "St." can also be "Saint," and only context decides. "$5" is "five dollars," not "dollar five," because currency symbols are spoken in an order the writing does not show. "3/4" is "March fourth" in a date and "three quarters" in a recipe. "5 Main" stays "five," but the reference code "5A" might be "five A" or "five ay" depending on what kind of code it is.

This is the inverse of a problem the speech-to-text world calls inverse text normalization, where "twenty three dollars" gets turned back into "$23" (see punctuation and ITN for that direction). TTS runs it forward: symbols and abbreviations out, full spoken words in. Because the voice is fluent, a normalization error carries no audible warning: "dollar five" is spoken as naturally as the correct "five dollars," which makes such mistakes easy to overlook.

Normalization is also where pronunciation gets settled for words that spelling does not predict: names, borrowed terms, alphanumerics. That control surface (lexicons, phonemes, SSML) is covered in pronunciation control. By the end of normalization the text is unambiguous: a sequence of words, and often phonemes, with nothing left to guess.

The acoustic model and the vocoder

Once the text is clean, the system does not jump straight to a waveform. It predicts a middle representation first, almost always a mel-spectrogram: a picture of how acoustic energy is distributed across frequencies over time, with the frequency axis warped to match human hearing.^[5] The model that draws it is the acoustic model, and this is where the expressive decisions get made. The same sentence maps to many valid spectrograms, and the differences are exactly what prosody studies: where the pitch rises, which word carries the stress, how long the pause before "but" lasts. A spectrogram encodes one particular reading of the words.

A spectrogram is still not sound, so a second model, the vocoder, invents a waveform that matches it. The vocoder is where neural TTS was born and where it was slow. WaveNet (2016) generated audio one sample at a time, gorgeous and far too expensive to ship, and the years that followed were largely spent keeping its quality while making it fast enough to answer in real time.^[1]^[2]^[4] The stage-by-stage mechanics of both models, and the newer designs that fuse them into one, are in how neural TTS works.

flowchart LR A[Raw text $5, Dr.] --> B[Text normalization] B --> C[Acoustic model Tacotron-style] C --> D[Mel- spectrogram] D --> E[Vocoder WaveNet-style] E --> F[Waveform audio out]

The neural TTS pipeline. The mel-spectrogram is the seam between the two models, which let early systems train them separately.

Requirements for streaming TTS

Everything above describes generating a finished clip. That is fine for an audiobook or a voicemail greeting, where no one is waiting for the audio in real time. It becomes too slow once a person is waiting for a reply.

The metric that matters for live use is time-to-first-audio: how long after the text arrives before the first sound comes out. Humans start answering within a few hundred milliseconds, often before they have decided how their sentence ends. A TTS system that generates the whole utterance, then plays it, has to finish the slowest stage of the pipeline before the listener hears anything, and on a long sentence that wait reads as the system being slow or broken.

Streaming TTS reorganizes the pipeline to emit audio in chunks while the rest of the sentence is still being generated: produce a slice of spectrogram, vocode it, send it, repeat. The first words leave while the last words are still being computed.^[7] This is required for voice agents, where the TTS sits at the end of a chain that has already spent its time budget on recognition and a language model, and it is its own subject in streaming TTS.

One more capability rides on the neural design. Because identity in these models is a separate learned representation, a voice can be copied from a short sample rather than recorded from scratch. That is voice cloning, and it is why consent and watermarking are now central to the TTS conversation.

Common questions

What is the difference between TTS and a voice assistant?

TTS is only the speaking half: text in, audio out. A voice assistant is a full loop that also listens (speech recognition), decides what to say (usually a language model), and then uses TTS to say it. TTS is one component of that stack, covered as a whole under what is voice AI.

Is text-to-speech just playing back recorded words?

No, not in any modern system. Older concatenative engines glued together recorded fragments, with audible joins between them. Neural TTS generates the waveform from scratch, so it can speak words, names, and even languages that were never in its recordings.

What is a mel-spectrogram and why does TTS use one?

It is a time-frequency picture of sound, with the frequency axis scaled to match human hearing. TTS uses it as a halfway point: predicting a spectrogram from text is easier than predicting raw audio, and a separate vocoder then turns the spectrogram into the waveform. Many newer end-to-end models keep it as an internal step rather than a visible handoff.

Why does TTS sometimes mispronounce numbers and abbreviations?

The failure usually happens before any sound is generated, in text normalization. Deciding that "St." is "Street" not "Saint," or that "1995" is a year not a quantity, is a context judgment, and when the system guesses wrong the voice still sounds perfectly natural saying the wrong thing.

How is TTS quality actually measured?

The long-standing instrument is the Mean Opinion Score, where listeners rate samples from 1 to 5. It worked for decades but saturates once systems cluster near 4.5, so evaluation has moved toward head-to-head comparisons and task-specific tests. See evaluating TTS.

References

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., et al. (2017). Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv preprint arXiv:1711.10433.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., et al. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv preprint arXiv:1712.05884.
Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems 33.
Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A Scale for the Measurement of the Psychological Magnitude of Pitch. The Journal of the Acoustical Society of America, 8(3), 185–190.
Tan, X., Qin, T., Soong, F., & Liu, T.-Y. (2021). A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561.
Soniox (2026). Real-time Text-to-Speech generation. Soniox.