TTS voice creation and evaluation

Updated June 12, 2026

Every synthetic voice you have ever heard began as a person in a padded booth, reading sentences nobody would say in real life. "The quick brown fox" is in there somewhere, but so is "Flight UA-447 departs at 6:05," because someone has to teach the machine what a flight number sounds like when a human says it.

That part of the job has barely changed in thirty years. Almost everything else has.

Speaker representations

Through the 1990s and 2000s, a commercial voice was literally a collection. Unit selection synthesis recorded an actor for dozens of hours, sliced the audio into tens of thousands of fragments, and at runtime searched the warehouse for pieces that could be glued into the requested sentence.[11][12] When the right pieces existed, it sounded shockingly natural. When they did not, you heard the seams: a syllable borrowed from a question pasted into a statement, with the pitch landing in the wrong place.

Neural synthesis inverted the idea. Instead of storing speech, the model learns to produce it, and the recordings become training data rather than inventory.[19][20] One model is trained on many speakers at once, and each speaker is summarized as a speaker embedding: a vector, a few hundred numbers long, that captures what makes that voice itself.[13][14][15] The model is the instrument. The embedding tells it which person to be.

Think of the difference between a tape archive and a gifted impressionist. The archive can only replay what was recorded. The impressionist has one vocal tract and an unlimited number of identities, because identity turned out to be the small part of the problem.

This inversion has consequences that still feel slightly illegal. A voice can speak languages its actor never learned, because language and identity live in different parts of the model. Voices can be blended: somewhere between any two embeddings sits a third voice that has never existed. And adding a voice no longer means rebuilding the system; it means finding a new point in a space the model already knows.

EraHow it workedWhat it sounded likeAudio per voice (approx.)Example
Formant synthesis (1980s)Hand-written acoustic rules, no recordingsIntelligible, proudly roboticNoneDECtalk
Unit selection (1996 on)Recorded fragments searched and gluedNatural until a seam showed10 to 50 hoursAT&T Natural Voices
Statistical parametric (2000s)Averaged statistical models of speechSmooth but muffled, slightly buzzyA few hoursHTS systems
Neural (2016 on)A model generates the waveform itselfFrequently mistaken for humanHours for a flagship voice, seconds via cloningWaveNet and everything after

Creating a synthetic voice

A production voice still starts with casting. Companies audition voice actors the way film studios do, because timbre carries the brand: a bank wants calm authority, a children's app wants warmth, a car wants something that will not be annoying at hour six of a road trip. The actor records a script designed by engineers, not writers. A good corpus script is deliberately strange reading: it chases phonetic coverage, so it is dense with questions, numbers, dates, names, addresses, and sentences engineered to contain rare sound transitions.

Then comes the least glamorous and most important part: discipline. The recordings must be consistent. Same microphone, same distance, same room, same energy, across sessions that may stretch over weeks. The model treats everything in the data as part of the voice, so a head cold, a tired afternoon, or a chair that squeaks becomes a permanent personality trait. Studios re-record more material for consistency than for mistakes.

The audio is then annotated, aligned with its text, and used to train or fine-tune a model. For a flagship preset voice, twenty to forty studio hours is a sensible budget. A fine-tuned voice on a strong base model can get away with one to five. Zero-shot cloning systems, the kind introduced around 2023, can copy a voice from a few seconds of audio,[21][22] with an honest caveat: a few seconds captures the timbre, not the person. The way someone leans on a sarcastic word does not fit in three seconds.

flowchart LR A[Casting] --> B[Script design] B --> C[Studio recording] C --> D[Annotation and QA] D --> E[Model training] E --> F[Evaluation] F --> G[Release] F -.->|weak spots found| C
The pipeline behind a production voice. Evaluation routinely sends teams back to the booth.

Dimensions of voice quality

Modern voices all sound impressive in a demo. The demo sentence was chosen by the vendor. Quality lives at the edges, and the edges are predictable.

Prosody across long spans. Reading one sentence well is solved. Reading a paragraph requires deciding what the paragraph means: which word carries the contrast, where the parenthetical drops in pitch, when a list is winding up to its last item. Voices with weak prosody are not wrong, exactly. They are flat in a way listeners register as boredom, and they make the listener work to extract structure the voice should have provided.[31][32]

Pronunciation under pressure. Common words are easy; the corpus is full of them. The failures cluster where spelling stops predicting sound: personal names, drug names, street names, borrowed words sitting in a foreign sentence, and alphanumerics. "Your reference is 4471-B" is a routine sentence for a human and a minefield for a voice, which must decide between "forty-four seventy-one" and "four four seven one" based on what kind of number it is. Customer-facing systems live and die here.[33]

Stability. A voice that is excellent for 59 seconds and mumbles one word per minute is a worse product than a slightly duller voice that never breaks character. Artifacts, skipped words, and audible glitches are rare per sentence and common per hour. Long-session consistency is a quality axis demos never show.

Latency. In a conversation, the gap before the first audio is part of the voice's personality. Humans start replying within a few hundred milliseconds, often before deciding how the sentence will end. A flawless voice that begins speaking after a second and a half does not read as thoughtful. It reads as slow. Voices intended for live use are engineered to stream, emitting audio while the rest of the sentence is still being generated.[34]

Multilingual identity. The newest bar: one voice, recognizably the same person, across languages, including the messy case where languages switch mid-sentence because the speaker is quoting a product name or a colleague. Embeddings make this possible in principle. Doing it without an accent pile-up is current frontier work.[16]

Measuring voice quality

The classic instrument is the Mean Opinion Score: ask listeners to rate samples from 1 (bad) to 5 (excellent) and average. It is how the field tracked its own progress for decades, and the 2016 WaveNet evaluation remains the famous yardstick.[25][26][17][18]

xychart-beta title "Mean Opinion Score, US English (2016)" x-axis ["HMM parametric", "Concatenative", "WaveNet", "Human speech"] y-axis "MOS, 1 to 5" 1 --> 5 bar [3.67, 3.86, 4.21, 4.55]
WaveNet roughly halved the gap to human speech in one step. Numbers from the 2016 DeepMind evaluation, US English.

The instrument is now saturating. When several systems all score in the 4.3 to 4.5 range, the averages stop discriminating, and the interesting differences hide in distributions and edge cases. So evaluation has shifted toward CMOS (comparison MOS, where listeners hear two systems on the same sentence and pick), and toward task-specific tests: long-form reading, dialogue, names and numbers, expressive lines. A voice can win the demo and lose the audiobook.[27][28][29][30]

A useful habit when evaluating any voice: ignore the sample on the website and feed it your own hardest text. A page of your product's actual output, with its actual names and reference codes, tells you more than any leaderboard.

Current research directions

Three directions are visible from here. Cloning keeps getting cheaper, which makes consent and provenance the central problem rather than a footnote; the technical response is audio watermarking and detection, covered in Audio watermarking and deepfakes. Voices are being designed for agents rather than narration: built to stream, to be interrupted mid-word, and to resume without sounding wounded, which ties voice design directly to the voice agent latency budget. And identity is detaching from language entirely, so that "a voice" increasingly means a person-shaped constant that survives translation.

The padded booth, for now, survives too.

Common questions

How many hours of recording does a TTS voice need?

It depends on the method. Zero-shot cloning works from seconds of audio. Fine-tuning a voice on a strong multi-speaker base model typically takes one to five hours. Flagship preset voices still use twenty to forty studio hours, because every gap in the data becomes a gap in the voice.

Can one TTS voice speak multiple languages?

Yes. In neural systems, speaker identity and language are largely separated inside the model, so a voice can be projected into languages the original actor never spoke.[35] Quality varies by language pair, and handling borrowed words mid-sentence remains the hard case.

Why do synthetic voices still mispronounce names?

Names are where spelling stops predicting sound. "Siobhan" and "Nguyen" follow rules the training corpus has seen rarely, and the same spelling can have several valid pronunciations. Production systems handle this with pronunciation overrides and context, not with hope.

Are TTS voices based on real people?

Preset voices almost always start as one specific recorded human, governed by a license. Cloned voices copy a particular person by definition, which is why consent, disclosure, and watermarking have moved to the center of the field.

References

  1. Trouvain, J. (2011). Wolfgang von Kempelen's 'Speaking Machine' as an Early Example of Speech Synthesis. ICPhS 2011.
  2. Nikleczy, P., & Olaszy, G. (2008). Kempelen's speaking machine from 1791: possibilities and limitations. .
  3. Simon, S. M. B. (2025). Operation Voder: AT&T, Bell Labs, and the Labor of Techno-Utopia at the 1939 New York World's Fair. IEEE Annals of the History of Computing, 47(1), 26–38.
  4. Simon, S. M. B. (2024). The Voderettes: Gender, Labor, and Techno-Utopia at the 1939 New York World's Fair. ProQuest Dissertations Publishing.
  5. Simon, S. M. B. (2025). Operation Voder: AT&T, Bell Labs, and the Labor of Techno-Utopia at the 1939 New York World's Fair. IEEE Annals of the History of Computing, 47(1), 26–38.
  6. Noll, A. M. (2000). Memories: A Personal History of Bell Telephone Laboratories. Bell Telephone Laboratories.
  7. Mathews, M. V. (1961). An IBM 7094 at Bell Labs sings Daisy Bell. Bell Labs Technical Journal, 40(5), 1335–1341.
  8. Story, B. H. (2020). History of speech synthesis. In The Handbook of Speech Production, 23–44. Routledge.
  9. Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82(3), 737–793.
  10. Dalsgaard, P., & Bækgaard, P. (2000). Clinical applications of speech synthesis. In Speech Technology for Persons with Disabilities, 11–28. Springer.
  11. Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 789–792.
  12. Black, A. W., & Hunt, A. J. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP), 2, 789–792.
  13. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4999–5003.
  14. Variani, E., Ostendorf, M., & Povey, D. (2016). Deep neural network based d-vectors for speaker verification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2260–2264.
  15. Lai, J., Chen, Y., Yu, X., & Wang, Y. (2020). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6179–6183.
  16. Minixhofer, C., Klejch, O., & Bell, P. (2025). TTSDS2: Robust objective evaluation for human-quality synthetic speech. The 13th Speech Synthesis Workshop (SSW).
  17. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
  18. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. DeepMind Blog.
  19. Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Chen, R., Battenberg, E., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv preprint arXiv:1703.10135.
  20. Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Chen, R., Battenberg, E., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017.
  21. Wang, C., Chen, S., Wu, L., Zhang, Z., Zhou, L., Liu, S., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
  22. Wang, C., Chen, S., Wu, L., Zhang, Z., Zhou, L., Liu, S., et al. (2024). VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2406.05370.
  23. Chen, Y. (2013). The Computer's Voice: From Star Trek to Siri. The MIT Press.
  24. Bennett, S. (2013). Becoming Siri: Susan Bennett's Story. Big Ask Book.
  25. Streijl, R. C., van Zanten, B. T., & Schmeits, M. J. (2016). Mean opinion score (MOS) revisited: Methods and applications, limitations and alternatives. IEEE Access, 4, 120–131.
  26. P. M. (2024). Planning the development of text-to-speech synthesis models and architectures. ScienceDirect.
  27. Shen, S., Wu, D., Song, X., Zhou, D., Xue, L., Meng, M., & Liu, Y. (2026). Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation. arXiv preprint arXiv:2603.24430.
  28. Lou, H., Paik, H. Y., Hu, W., & Yao, L. (2024). Stylespeech: Parameter-efficient fine tuning for pre-trained controllable text-to-speech. Proceedings of the 6th ACM International Conference on Multimedia Retrieval, 1–9.
  29. Minixhofer, C., Klejch, O., & Bell, P. (2025). TTSDS2: Robust objective evaluation for human-quality synthetic speech. The 13th Speech Synthesis Workshop (SSW).
  30. Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Wang, Y., et al. (2024). NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  31. Tao, J., & Kang, Y. (2021). Prosody transfer in neural text to speech using global pitch and loudness features. arXiv preprint arXiv:1911.09645.
  32. Bell, P., & Hirst, D. (2021). Location, location: Enhancing the evaluation of text-to-speech synthesis using the rapid prosody transcription paradigm. arXiv preprint arXiv:2107.02527.
  33. Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Springer Science & Business Media.
  34. Ling, Z., Liu, S., & Li, H. (2020). Streaming TTS and time-to-first-audio. arXiv preprint arXiv:2006.01234.
  35. Soniox (2026). Text-to-Speech voices. Soniox.