How neural text-to-speech works: from text to waveform

Given the text "Dr. Reyes owes me $5," a neural TTS system produces a voice that speaks it a fraction of a second later. In that interval the system expands an abbreviation, reads a price out as words, maps every letter to a sound, chooses a melody, and generates tens of thousands of audio samples to draw the resulting pressure wave. None of these steps is externally visible.

The path from characters to a waveform is the synthesis counterpart to how speech-to-text works, run in reverse.

Text and phoneme processing

Raw text is not pronounceable as written. The front-end cleans it up and converts it to sound units.^[7]

Text normalization expands anything that is not a plain word into the words a person would say: "Dr." becomes "Doctor," "$5" becomes "five dollars," "2:30" becomes "two thirty." The step is full of context judgments ("Dr." is "Doctor" on Dr. Reyes and "Drive" on Elm Dr.), and the ways they fail, always silently, are cataloged in what is text-to-speech.

Grapheme-to-phoneme conversion (G2P) then maps the normalized words to phonemes, the abstract sound units of the language, because spelling does not reliably predict pronunciation. "Through," "though," and "tough" share letters and almost no sounds.

"Dr. Reyes owes me $5"
  -> normalize -> "Doctor Reyes owes me five dollars"
  -> G2P       -> /ˈdɑktər ˈreɪɛs oʊz mi faɪv ˈdɑlərz/

The front-end also predicts the first hints of prosody: where stress falls, where pitch rises or falls, where to pause. That prediction is the difference between a sentence read with meaning and one read like a list, covered in prosody.

Acoustic modeling

The sequence of phonemes, plus a speaker embedding in a multi-voice system to say which voice to use, goes into the acoustic model. It predicts a spectrogram: a picture of how the sound's energy is distributed across frequencies over time. A mel spectrogram, the usual choice, describes the sound frame by frame but is not yet the sound itself.

Tacotron 2 (2017) generated the spectrogram one frame at a time, autoregressively, each frame conditioned on the last.^[1] This produces natural-sounding speech but runs slowly, because it cannot be parallelized. FastSpeech and its successors generate all frames at once, in parallel, far faster and with explicit control over timing, at a cost in naturalness that later versions largely closed.^[2]

Waveform generation with a vocoder

A spectrogram is still not audio. The vocoder turns it into the actual waveform, the long sequence of samples (16,000 or more every second) that a speaker can play.

For years this stage was the bottleneck of the field. WaveNet, in 2016, generated the waveform one sample at a time, conditioning each on all the samples before it.^[3] The result was highly natural and far too slow for real-time use: producing a second of audio took much longer than a second. The parallel vocoders that followed, like HiFi-GAN built on generative adversarial networks, generate the whole waveform at once with quality close to WaveNet's at a tiny fraction of the cost. This made real-time neural TTS practical.^[4]

End-to-end architectures

The three-stage pipeline describes the problem clearly, but the newest systems blur it. End-to-end models like VITS (2021) go from text to waveform in a single trained system, folding the acoustic model and vocoder together so the boundary between them no longer introduces artifacts.^[5]

A second, fast-growing approach treats audio like language. An audio codec compresses a waveform into a sequence of discrete tokens, and a model learns to predict those tokens from text the way a language model predicts words.^[6] A decoder then turns the tokens back into sound. Because audio is modeled as language, these systems inherit language-model behavior: a short voice sample works as a prompt and copies the voice (see voice cloning), and tokens come out left to right, which is exactly the shape live streaming needs.

flowchart TB A[Text] --> B[Front-end<br/>normalize + phonemes] B --> C[Acoustic model<br/>spectrogram] C --> D[Vocoder<br/>waveform] A -.->|end-to-end<br/>or token model| D

The classic three stages, and the two ways modern systems collapse them.

Requirements for streaming synthesis

To start audio before the whole sentence is generated, every stage has to work on partial input and emit partial output. That rules out any design that requires the full sentence first. Parallel and token-based models therefore matter beyond raw speed: they emit audio while the rest is still being produced, the subject of streaming TTS and the reason conversational voices feel responsive.^[8]

Common questions

What are the stages of a neural TTS system?

Three: a front-end that normalizes text and converts it to phonemes, an acoustic model that predicts a spectrogram from those phonemes, and a vocoder that turns the spectrogram into a waveform. Modern systems merge these into an end-to-end model or a token-based model, but the same three jobs still happen internally.

What is a vocoder?

The component that turns a spectrogram, a frame-by-frame description of the sound, into the actual waveform a speaker can play. It was the slowest part: WaveNet generated audio one sample at a time. Parallel vocoders like HiFi-GAN produced the whole waveform at once with similar quality, making real-time synthesis practical.

Why do TTS systems convert text to phonemes first?

Because spelling does not reliably predict pronunciation: "through," "though," and "tough" look alike and sound nothing alike. Mapping words to phonemes gives the acoustic model a consistent input, and it is where pronunciation can be corrected for names and unusual words.

How are the newest TTS models different from older ones?

Many treat audio like a language. They compress sound into discrete tokens, predict those tokens from text the way a language model predicts words, then decode them back to audio. That lets them stream naturally, behave like large language models, and clone a voice from a short sample, which older multi-stage pipelines could not do as easily.

References

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Chen, Z., et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., et al. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. Advances in Neural Information Processing Systems, 32.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, 33.
Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning (ICML).
Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press.
Ma, M., Zhang, X., Li, X., & Liu, Y. (2020). A Streaming End-to-End Framework for Neural Text-to-Speech. Proc. Interspeech.