Given the text "Dr. Reyes owes me $5," a neural TTS system produces a voice that speaks it a fraction of a second later. In that interval the system expands an abbreviation, reads a price out as words, maps every letter to a sound, chooses a melody, and generates tens of thousands of audio samples to draw the resulting pressure wave. None of these steps is externally visible.
The path from characters to a waveform is the synthesis counterpart to how speech-to-text works, run in reverse.
Text and phoneme processing
Raw text is not pronounceable as written. The front-end cleans it up and converts it to sound units.[7]
Text normalization expands anything that is not a plain word into the words a person would say: "Dr." becomes "Doctor," "$5" becomes "five dollars," "2:30" becomes "two thirty." This is the inverse of the inverse text normalization a recognizer does, and inherits the same ambiguity. "Dr." is "Doctor" on Dr. Reyes and "Drive" on Elm Dr., and only context tells them apart.
Grapheme-to-phoneme conversion (G2P) then maps the normalized words to phonemes, the abstract sound units of the language, because spelling does not reliably predict pronunciation. "Through," "though," and "tough" share letters and almost no sounds.
"Dr. Reyes owes me $5"
-> normalize -> "Doctor Reyes owes me five dollars"
-> G2P -> /ˈdɑktər ˈreɪɛs oʊz mi faɪv ˈdɑlərz/
The front-end also predicts the first hints of prosody: where stress falls, where pitch rises or falls, where to pause. That prediction is the difference between a sentence read with meaning and one read like a list, covered in prosody.
Acoustic modeling
The sequence of phonemes, plus a speaker embedding in a multi-voice system to say which voice to use, goes into the acoustic model. It predicts a spectrogram: a picture of how the sound's energy is distributed across frequencies over time. A mel spectrogram, the usual choice, describes the sound frame by frame but is not yet the sound itself.
Tacotron 2 (2017) generated the spectrogram one frame at a time, autoregressively, each frame conditioned on the last.[1] This produces natural-sounding speech but runs slowly, because it cannot be parallelized. FastSpeech and its successors generate all frames at once, in parallel, far faster and with explicit control over timing, at a cost in naturalness that later versions largely closed.[2]
Waveform generation with a vocoder
A spectrogram is still not audio. The vocoder turns it into the actual waveform, the long sequence of samples (16,000 or more every second) that a speaker can play.
For years this stage was the bottleneck of the field. WaveNet, in 2016, generated the waveform one sample at a time, conditioning each on all the samples before it.[3] The result was highly natural and far too slow for real-time use: producing a second of audio took much longer than a second. The parallel vocoders that followed, like HiFi-GAN built on generative adversarial networks, generate the whole waveform at once with quality close to WaveNet's at a tiny fraction of the cost. This made real-time neural TTS practical.[4]
End-to-end architectures
The three-stage pipeline describes the problem clearly, but the newest systems blur it. End-to-end models like VITS (2021) go from text to waveform in a single trained system, folding the acoustic model and vocoder together so the boundary between them no longer introduces artifacts.[5]
A second, fast-growing approach treats audio like language. An audio codec compresses a waveform into a sequence of discrete tokens, and a model learns to predict those tokens from text the way a language model predicts words.[6] A decoder then turns the tokens back into sound. Because audio is modeled as language, the latest TTS systems behave like large language models, can be prompted with a short voice sample to clone it (see voice cloning), and stream naturally, with tokens produced left to right.
Requirements for streaming synthesis
To start audio before the whole sentence is generated, every stage has to work on partial input and emit partial output. That rules out any design that requires the full sentence first. Parallel and token-based models therefore matter beyond raw speed: they emit audio while the rest is still being produced, the subject of streaming TTS and the reason conversational voices feel responsive.[8]
A neural TTS system takes text as input and produces a waveform, with three jobs in between. The difficulty is doing all three fast enough to answer without a noticeable delay and well enough that the result sounds like a person.
Common questions
What are the stages of a neural TTS system?
Three: a front-end that normalizes text and converts it to phonemes, an acoustic model that predicts a spectrogram from those phonemes, and a vocoder that turns the spectrogram into a waveform. Modern systems merge these into an end-to-end model or a token-based model, but the same three jobs still happen internally.
What is a vocoder?
The component that turns a spectrogram, a frame-by-frame description of the sound, into the actual waveform a speaker can play. It was the slowest part: WaveNet generated audio one sample at a time. Parallel vocoders like HiFi-GAN produced the whole waveform at once with similar quality, making real-time synthesis practical.
Why do TTS systems convert text to phonemes first?
Because spelling does not reliably predict pronunciation: "through," "though," and "tough" look alike and sound nothing alike. Mapping words to phonemes gives the acoustic model a consistent input, and it is where pronunciation can be corrected for names and unusual words.
How are the newest TTS models different from older ones?
Many treat audio like a language. They compress sound into discrete tokens, predict those tokens from text the way a language model predicts words, then decode them back to audio. That lets them stream naturally, behave like large language models, and clone a voice from a short sample, which older multi-stage pipelines could not do as easily.
Related concepts
- What is text-to-speech?
- Prosody
- TTS voices
- Streaming TTS
- Voice cloning
- A brief history of text-to-speech
References
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Chen, Z., et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., et al. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. Advances in Neural Information Processing Systems, 32.
- van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
- Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, 33.
- Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning (ICML).
- Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
- Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press.
- Ma, M., Zhang, X., Li, X., & Liu, Y. (2020). A Streaming End-to-End Framework for Neural Text-to-Speech. Proc. Interspeech.