TTS voice creation and evaluation

Every synthetic voice you have ever heard began as a person in a padded booth, reading sentences nobody would say in real life. "The quick brown fox" is in there somewhere, but so is "Flight UA-447 departs at 6:05," because someone has to teach the machine what a flight number sounds like when a human says it.

That part of the job has barely changed in thirty years. Almost everything else has.

Speaker representations

Through the 1990s and 2000s, a commercial voice was literally a collection. Unit selection synthesis recorded an actor for dozens of hours, sliced the audio into tens of thousands of fragments, and at runtime searched the warehouse for pieces that could be glued into the requested sentence.^[5] When the right pieces existed, it sounded shockingly natural. When they did not, you heard the seams: a syllable borrowed from a question pasted into a statement, with the pitch landing in the wrong place.

Neural synthesis inverted the idea. Instead of storing speech, the model learns to produce it, and the recordings become training data rather than inventory.^[9] One model is trained on many speakers at once, and each speaker is summarized as a speaker embedding: a vector, a few hundred numbers long, that captures what makes that voice itself.^[6] The model is the instrument. The embedding tells it which person to be.

Think of the difference between a tape archive and a gifted impressionist. The archive can only replay what was recorded. The impressionist has one vocal tract and an unlimited number of identities, because identity turned out to be the small part of the problem.

This inversion has consequences that still feel slightly illegal. A voice can speak languages its actor never learned, because language and identity live in different parts of the model. Voices can be blended: somewhere between any two embeddings sits a third voice that has never existed. And adding a voice no longer means rebuilding the system; it means finding a new point in a space the model already knows.

Era	How it worked	What it sounded like	Audio per voice (approx.)	Example
Formant synthesis (1980s)	Hand-written acoustic rules, no recordings	Intelligible, proudly robotic	None	DECtalk
Unit selection (1996 on)	Recorded fragments searched and glued	Natural until a seam showed	10 to 50 hours	AT&T Natural Voices
Statistical parametric (2000s)	Averaged statistical models of speech	Smooth but muffled, slightly buzzy	A few hours	HTS systems
Neural (2016 on)	A model generates the waveform itself	Frequently mistaken for human	Hours for a flagship voice, seconds via cloning	WaveNet and everything after

Creating a synthetic voice

A production voice still starts with casting. Companies audition voice actors the way film studios do, because timbre carries the brand: a bank wants calm authority, a children's app wants warmth, a car wants something that will not be annoying at hour six of a road trip. The actor records a script designed by engineers, not writers. A good corpus script is deliberately strange reading: it chases phonetic coverage, so it is dense with questions, numbers, dates, names, addresses, and sentences engineered to contain rare sound transitions.

Then comes the least glamorous and most important part: discipline. The recordings must be consistent. Same microphone, same distance, same room, same energy, across sessions that may stretch over weeks. The model treats everything in the data as part of the voice, so a head cold, a tired afternoon, or a chair that squeaks becomes a permanent personality trait. Studios re-record more material for consistency than for mistakes.

The audio is then annotated, aligned with its text, and used to train or fine-tune a model. For a flagship preset voice, twenty to forty studio hours is a sensible budget. A fine-tuned voice on a strong base model can get away with one to five. Zero-shot cloning systems, the kind introduced around 2023, can copy a voice from a few seconds of audio,^[10] with an honest caveat: a few seconds captures the timbre, not the person. The way someone leans on a sarcastic word does not fit in three seconds.

flowchart LR A[Casting] --> B[Script design] B --> C[Studio recording] C --> D[Annotation and QA] D --> E[Model training] E --> F[Evaluation] F --> G[Release] F -.->|weak spots found| C

The pipeline behind a production voice. Evaluation routinely sends teams back to the booth.

Two centuries of machines that talk

Synthetic speech is much older than computing. In 1791, Wolfgang von Kempelen (the same man who built the chess-playing Mechanical Turk, the fake) published a genuinely working speaking machine: bellows for lungs, a reed for vocal cords, a rubber funnel for a mouth, operated by hand like a strange bagpipe.^[1]

timeline title Selected milestones 1791 : Von Kempelen's bellows-driven speaking machine 1939 : Homer Dudley's Voder speaks at the New York World's Fair 1961 : An IBM 7094 at Bell Labs sings "Daisy Bell" 1984 : DECtalk ships, and "Perfect Paul" later becomes Stephen Hawking's voice 1996 : Unit selection synthesis reaches commercial systems 2016 : WaveNet generates speech one audio sample at a time 2017 : Tacotron makes TTS trainable end to end 2023 : Zero-shot cloning copies a voice from seconds of audio

Arthur C. Clarke saw the 1961 Bell Labs demonstration, which is why HAL 9000 regresses to singing "Daisy Bell" while being shut down in 2001: A Space Odyssey.^[3] And Stephen Hawking, offered steadily better voices for three decades, refused every upgrade. The voice he kept was built partly from recordings of Dennis Klatt, the MIT researcher who created it. Klatt died in 1988; his voice kept giving physics lectures for another thirty years.^[4]

Dimensions of voice quality

Modern voices all sound impressive in a demo. The demo sentence was chosen by the vendor. Quality lives at the edges, and the edges are predictable.

Prosody across long spans. Reading one sentence well is solved. Reading a paragraph requires deciding what the paragraph means: which word carries the contrast, where the parenthetical drops in pitch, when a list is winding up to its last item. Voices with weak prosody are not wrong, exactly. They are flat in a way listeners register as boredom, and they make the listener work to extract structure the voice should have provided.^[13]

Pronunciation under pressure. Common words are easy; the corpus is full of them. The failures cluster where spelling stops predicting sound: personal names, drug names, street names, borrowed words sitting in a foreign sentence, and alphanumerics. "Your reference is 4471-B" is a routine sentence for a human and a minefield for a voice, which must decide between "forty-four seventy-one" and "four four seven one" based on what kind of number it is. Customer-facing systems live and die here.^[14]

Stability. A voice that is excellent for 59 seconds and mumbles one word per minute is a worse product than a slightly duller voice that never breaks character. Artifacts, skipped words, and audible glitches are rare per sentence and common per hour. Long-session consistency is a quality axis demos never show.

Latency. In a conversation, the gap before the first audio is part of the voice's personality. Humans start replying within a few hundred milliseconds, often before deciding how the sentence will end. A flawless voice that begins speaking after a second and a half does not read as thoughtful. It reads as slow. Voices intended for live use are engineered to stream, emitting audio while the rest of the sentence is still being generated.^[15]

Multilingual identity. The newest bar: one voice, recognizably the same person, across languages, including the messy case where languages switch mid-sentence because the speaker is quoting a product name or a colleague. Embeddings make this possible in principle. Doing it without an accent pile-up is current frontier work.^[7]

Measuring voice quality

The classic instrument is the Mean Opinion Score: ask listeners to rate samples from 1 (bad) to 5 (excellent) and average. It is how the field tracked its own progress for decades, and the 2016 WaveNet evaluation remains the famous yardstick.^[12]^[8]

xychart-beta title "Mean Opinion Score, US English (2016)" x-axis ["HMM parametric", "Concatenative", "WaveNet", "Human speech"] y-axis "MOS, 1 to 5" 1 --> 5 bar [3.67, 3.86, 4.21, 4.55]

WaveNet roughly halved the gap to human speech in one step. Numbers from the 2016 DeepMind evaluation, US English.

The instrument is now saturating. When several systems all score in the 4.3 to 4.5 range, the averages stop discriminating, and the interesting differences hide in distributions and edge cases. So evaluation has shifted toward CMOS (comparison MOS, where listeners hear two systems on the same sentence and pick), and toward task-specific tests: long-form reading, dialogue, names and numbers, expressive lines. A voice can win the demo and lose the audiobook.^[7]

A useful habit when evaluating any voice: ignore the sample on the website and feed it your own hardest text. A page of your product's actual output, with its actual names and reference codes, tells you more than any leaderboard.

Current research directions

Three directions are visible from here. Cloning keeps getting cheaper, which makes consent and provenance the central problem rather than a footnote; the technical response is audio watermarking and detection, covered in Audio watermarking and deepfakes. Voices are being designed for agents rather than narration: built to stream, to be interrupted mid-word, and to resume without sounding wounded, which ties voice design directly to the voice agent latency budget. And identity is detaching from language entirely, so that "a voice" increasingly means a person-shaped constant that survives translation.

The padded booth, for now, survives too.

Common questions

How many hours of recording does a TTS voice need?

It depends on the method. Zero-shot cloning works from seconds of audio. Fine-tuning a voice on a strong multi-speaker base model typically takes one to five hours. Flagship preset voices still use twenty to forty studio hours, because every gap in the data becomes a gap in the voice.

Can one TTS voice speak multiple languages?

Yes. In neural systems, speaker identity and language are largely separated inside the model, so a voice can be projected into languages the original actor never spoke.^[16] Quality varies by language pair, and handling borrowed words mid-sentence remains the hard case.

Why do synthetic voices still mispronounce names?

Names are where spelling stops predicting sound. "Siobhan" and "Nguyen" follow rules the training corpus has seen rarely, and the same spelling can have several valid pronunciations. Production systems handle this with pronunciation overrides and context, not with hope.

Are TTS voices based on real people?

Preset voices almost always start as one specific recorded human, governed by a license. Cloned voices copy a particular person by definition, which is why consent, disclosure, and watermarking have moved to the center of the field.

References

Trouvain, J. (2011). Wolfgang von Kempelen's 'Speaking Machine' as an Early Example of Speech Synthesis. ICPhS 2011.
Simon, S. M. B. (2025). Operation Voder: AT&T, Bell Labs, and the Labor of Techno-Utopia at the 1939 New York World's Fair. IEEE Annals of the History of Computing, 47(1), 26–38.
Mathews, M. V. (1961). An IBM 7094 at Bell Labs sings Daisy Bell. Bell Labs Technical Journal, 40(5), 1335–1341.
Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82(3), 737–793.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 789–792.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4999–5003.
Minixhofer, C., Klejch, O., & Bell, P. (2025). TTSDS2: Robust objective evaluation for human-quality synthetic speech. The 13th Speech Synthesis Workshop (SSW).
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Chen, R., Battenberg, E., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv preprint arXiv:1703.10135.
Wang, C., Chen, S., Wu, L., Zhang, Z., Zhou, L., Liu, S., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
Bennett, S. (2013). Becoming Siri: Susan Bennett's Story. Big Ask Book.
Streijl, R. C., van Zanten, B. T., & Schmeits, M. J. (2016). Mean opinion score (MOS) revisited: Methods and applications, limitations and alternatives. IEEE Access, 4, 120–131.
Tao, J., & Kang, Y. (2021). Prosody transfer in neural text to speech using global pitch and loudness features. arXiv preprint arXiv:1911.09645.
Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Springer Science & Business Media.
Ling, Z., Liu, S., & Li, H. (2020). Streaming TTS and time-to-first-audio. arXiv preprint arXiv:2006.01234.
Soniox (2026). Text-to-Speech voices. Soniox.