What is text-to-speech?

How neural TTS turns text into voice

Updated June 14, 2026

In 2016 a DeepMind paper reported a result that changed the field: WaveNet scored 4.21 on a 5-point opinion scale for US English, against 3.86 for the best concatenative system of the day and 4.55 for a real human. One model had roughly halved the gap to human speech by generating audio sample by sample instead of replaying recorded fragments.[1] Modern TTS systems descend from that result, and from the architecture that arrived a year later to make such models practical to train.[7]

This page covers the whole TTS pipeline at a high level; each stage below has its own page that goes deeper.

Text normalization

A naive view of TTS is that you hand the model a string and it reads the letters. Reading the letters is straightforward; the harder problem is deciding what the letters say out loud, and that happens in text normalization: the stage that rewrites text into the words a person would speak.

Consider "Dr. Stone lives at 5 Main St. and paid $5 for it on 3/4." A human reads that without thinking. A machine has to make a string of judgment calls. The first "Dr." is "Doctor" and the second "St." is "Street," but "St." can also be "Saint," and only context decides. "$5" is "five dollars," not "dollar five," because currency symbols are spoken in an order the writing does not show. "3/4" is "March fourth" in a date and "three quarters" in a recipe. "5 Main" stays "five," but the reference code "5A" might be "five A" or "five ay" depending on what kind of code it is.

This is the inverse of a problem the speech-to-text world calls inverse text normalization, where "twenty three dollars" gets turned back into "$23" (see punctuation and ITN for that direction). TTS runs it forward: symbols and abbreviations out, full spoken words in. Because the voice is fluent, a normalization error carries no audible warning: "dollar five" is spoken as naturally as the correct "five dollars," which makes such mistakes easy to overlook.

Normalization is also where pronunciation gets settled for words that spelling does not predict: names, borrowed terms, alphanumerics. That control surface (lexicons, phonemes, SSML) is covered in pronunciation control. By the end of normalization the text is unambiguous: a sequence of words, and often phonemes, with nothing left to guess.

The acoustic model

Once the text is clean, the system does not jump straight to a waveform. It predicts a middle representation first, almost always a mel-spectrogram: a picture of how acoustic energy is distributed across frequencies over time, with the frequency axis warped to match human hearing.[6] Read left to right it is time; bottom to top is pitch and timbre; brightness is loudness. It functions as an intermediate representation that the synthesizer reads to produce the audio.

The model that produces it is the acoustic model. The reference design is Tacotron 2, published by Google in 2017, which used an attention mechanism to line up input characters with output spectrogram frames: for each slice of sound it produces, the model decides which part of the text it is "looking at."[3] Expressive choices are made at this stage. The same sentence can map to many valid spectrograms, and the differences are exactly what prosody studies: where the pitch rises, which word gets stressed, how long the pause before "but" lasts. A spectrogram encodes a particular reading of the words, including these expressive choices.

The vocoder

A spectrogram is not audio. It throws away the fine phase detail of the original wave, so something has to invent a plausible waveform that would produce that spectrogram. That something is the vocoder.

The breakthrough vocoder was WaveNet (DeepMind, 2016), which modeled the raw waveform directly, predicting each audio sample from the samples before it.[1] At 24,000 samples per second that is a lot of predictions, and the original was famously slow: minutes of computation per second of speech. In the decade since, work has focused on keeping WaveNet's quality while making it fast, through parallel and GAN-based vocoders like HiFi-GAN (2020) that enable real-time synthesis on consumer devices.[2][4]

So the classic pipeline is two learned models in series: an acoustic model that produces the spectrogram, and a vocoder that turns it into a waveform. The full mechanics of each, with the math, live in how neural TTS works.

flowchart LR A[Raw text<br/>$5, Dr.] --> B[Text<br/>normalization] B --> C[Acoustic model<br/>Tacotron-style] C --> D[Mel-<br/>spectrogram] D --> E[Vocoder<br/>WaveNet-style] E --> F[Waveform<br/>audio out]
The neural TTS pipeline. The mel-spectrogram is the seam between the two models, allowing early systems to train them separately.

Requirements for streaming TTS

Everything above describes generating a finished clip. That is fine for an audiobook or a voicemail greeting, where no one is waiting for the audio in real time. It becomes too slow once a person is waiting for a reply.

The metric that matters for live use is time-to-first-audio: how long after the text arrives before the first sound comes out. Humans start answering within a few hundred milliseconds, often before they have decided how their sentence ends. A TTS system that generates the whole utterance, then plays it, has to finish the slowest stage of the pipeline before the listener hears anything, and on a long sentence that wait reads as the system being slow or broken.

Streaming TTS reorganizes the pipeline to emit audio in chunks while the rest of the sentence is still being generated: produce a slice of spectrogram, vocode it, send it, repeat. The first words leave while the last words are still being computed.[8] This is required for voice agents, where the TTS sits at the end of a chain that has already spent its time budget on recognition and a language model, and it is its own subject in streaming TTS.

One more capability rides on the neural design. Because identity in these models is a separate learned representation, a voice can be copied from a short sample rather than recorded from scratch. That is voice cloning, and it is why consent and watermarking are now central to the TTS conversation.

Common questions

What is the difference between TTS and a voice assistant?

TTS is only the speaking half: text in, audio out. A voice assistant is a full loop that also listens (speech recognition), decides what to say (usually a language model), and then uses TTS to say it. TTS is one component of that stack, covered as a whole under what is voice AI.

Is text-to-speech just playing back recorded words?

No, not in any modern system. Older concatenative engines glued together recorded fragments, with audible joins between them. Neural TTS generates the waveform from scratch, so it can speak words, names, and even languages that were never in its recordings.

What is a mel-spectrogram and why does TTS use one?

It is a time-frequency picture of sound, with the frequency axis scaled to match human hearing. TTS uses it as a halfway point: predicting a spectrogram from text is easier than predicting raw audio, and a separate vocoder then turns the spectrogram into the waveform. Many newer end-to-end models keep it as an internal step rather than a visible handoff.

Why does TTS sometimes mispronounce numbers and abbreviations?

The failure usually happens before any sound is generated, in text normalization. Deciding that "St." is "Street" not "Saint," or that "1995" is a year not a quantity, is a context judgment, and when the system guesses wrong the voice still sounds perfectly natural saying the wrong thing.

How is TTS quality actually measured?

The long-standing instrument is the Mean Opinion Score, where listeners rate samples from 1 to 5. It worked for decades but saturates once systems cluster near 4.5, so evaluation has moved toward head-to-head comparisons and task-specific tests. See evaluating TTS.

References

  1. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
  2. van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., et al. (2017). Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv preprint arXiv:1711.10433.
  3. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., et al. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv preprint arXiv:1712.05884.
  4. Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems 33.
  5. Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning (ICML).
  6. Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A Scale for the Measurement of the Psychological Magnitude of Pitch. The Journal of the Acoustical Society of America, 8(3), 185–190.
  7. Tan, X., Qin, T., Soong, F., & Liu, T.-Y. (2021). A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561.
  8. Soniox (2026). Real-time Text-to-Speech generation. Soniox.