A brief history of text-to-speech: from mechanical talkers to neural voices

In 1791, in Vienna, a Hungarian named Wolfgang von Kempelen published a book describing a machine he had been building for more than twenty years. It had a bellows for lungs, a vibrating reed for vocal cords, and a flexible leather tube for a mouth, which the operator reshaped by hand while pumping. Played well, it spoke whole words and short phrases in a small, breathy, unmistakably human voice, and listeners described it as uncanny.^[1]

The goal across two centuries stayed the same: make a machine talk, and make it talk like a person. What changed, each time the field jumped forward, was the material the machine was built from.^[11]

timeline title Two centuries of synthetic speech 1791 : Von Kempelen's bellows speaking machine 1846 : Faber's Euphonia, a talking head 1939 : The Voder speaks at the World's Fair 1950 : Pattern Playback turns pictures into sound 1984 : DECtalk ships; later, Hawking's voice 1996 : Unit selection glues recorded speech 2016 : WaveNet generates the waveform itself 2017 : Tacotron learns text-to-speech end to end 2023 : Cloning copies a voice from seconds of audio

The long arc: the milestones where teaching a machine to talk got meaningfully easier or more convincing.

Mechanical speech synthesis

Von Kempelen's instrument was, literally, a model of you. Speech is air pushed past something that vibrates (your vocal cords), with the resulting buzz shaped into vowels and consonants by the changing geometry of your throat, mouth, and lips. He built exactly that in wood, leather, and reeds, then learned to play it like a difficult woodwind. To make the sounds a mouth makes, he built a mouth.^[2]

The same man, twenty years earlier, had built the Mechanical Turk, the chess-playing "automaton" that toured Europe beating opponents and was secretly operated by a human hidden in the cabinet.^[3] So one inventor produced both a machine that faked thinking and a machine that genuinely spoke. The fake one made him famous; the real one was a footnote for a century.

The idea did not die. In the 1840s a German immigrant named Joseph Faber spent decades building Euphonia, a talking head with a keyboard, a bellows, and an artificial tongue and jaw, which recited sentences in a slow, ghostly monotone and could, by most accounts, sing.^[4]

Electronic speech synthesis

The wood-and-leather approach had a ceiling: a human vocal tract is a precision instrument, and carving a good one by hand is extremely difficult. The breakthrough was to model the sound electronically rather than the physical vocal tract.

At Bell Labs in the 1930s, an engineer named Homer Dudley had an idea that still underlies most of what came after. Speech, he noticed, is a slow signal riding on a fast one. The fast part is the raw buzz from the vocal cords, a few hundred times a second; the slow part is how the mouth shapes that buzz, which changes only a handful of times a second as you move from sound to sound. Capture just the slow part, a few numbers describing the energy in each frequency band, and you could send speech down a wire cheaply and rebuild it at the other end. He called the analysis machine the vocoder, for voice coder.^[5]

Dudley then ran the idea backwards. If a vocoder could take speech apart into a few control signals, a machine could put speech together from control signals supplied by a person. That machine was the Voder, demonstrated at the 1939 New York World's Fair. It was played like an organ: a keyboard of ten keys for the different sounds, a wrist bar to switch between a buzz and a hiss, and a foot pedal for pitch. Trained operators made it speak full sentences, live, to crowds who had never imagined such a thing; what that training took is a story told in how TTS voices are made.^[6]

A decade later, at Haskins Laboratories, Franklin Cooper built the Pattern Playback: it shone light through a painted spectrogram, a picture of sound with time across and frequency up, and turned that picture back into audio.^[8] Researchers could hand-draw a pattern, hear it, and so discover which acoustic shapes the ear uses to tell "ba" from "ga." For a long stretch, building speech synthesizers was a way of studying how human hearing works.

Rule-based synthesis

By the 1960s the question sharpened. Could a machine turn written text into those control signals on its own, with no operator at the keyboard and no recordings to lean on? This is formant synthesis, and it needs one piece of vocabulary.

A formant is a resonance of your vocal tract, a band of frequencies that the shape of your throat and mouth amplifies. Vowels are, to a first approximation, patterns of two or three formants: move them around and "ee" becomes "ah." The Swedish researcher Gunnar Fant set out the acoustic theory in 1960^[9], and engineers then built synthesizers that generated formants directly from rules about how letters map to sounds.

The masterwork was Dennis Klatt's, at MIT, whose synthesizer became the system known as MITalk and then, in 1984, the commercial DECtalk.^[10] DECtalk shipped with a handful of preset voices, and one of them, nicknamed "Perfect Paul" and built partly from recordings of Klatt's own voice, became the voice of Stephen Hawking. Formant synthesis never passed for human, but it never ran out of things to say either, because it worked from rules rather than a library of clips. That trade, character for coverage, is what the next chapter turns on. (For how a voice's identity later became a thing you could separate from the words, see how TTS voices are made.)

Concatenative synthesis

Rules could capture the skeleton of speech but not its texture, the tiny irregularities that make a real voice real. The field then moved in the opposite direction: instead of generating speech, it replayed recordings.

Concatenative synthesis recorded a human and chopped the audio into small units, often diphones, the transitions from the middle of one sound to the middle of the next, then stitched them together for whatever sentence you asked for. Unit selection, which reached commercial systems around 1996, scaled the idea up: record an actor for many hours, cut the audio into tens of thousands of fragments, and assemble each requested sentence from the best-fitting pieces.^[13] When the fit was good, the result was startlingly natural, because it was real human speech. When it was not, you heard the join: two fragments recorded on different days meeting mid-word at slightly different pitches.

Statistical parametric synthesis

A quieter line of work ran alongside unit selection and pointed at the future. Statistical parametric synthesis, usually built on hidden Markov models, stored no audio at all.^[14] It learned a statistical average of how each sound is produced and generated the control parameters on demand. The payoff was flexibility and size: a whole voice fit in a few megabytes, and you could nudge its speed or pitch freely. The cost was that an average of many recordings is blurrier than any single one, so it sounded smooth but muffled, with a faint buzz, and rarely won on quality. What it did was teach the field to treat synthesis as prediction, a problem of guessing the right sound from the text, which is exactly the frame the next wave needed.

Approach	The core idea	What it sounded like
Mechanical (1791)	Build a physical vocal tract	Breathy, human, barely controllable
Formant / rules (1980s)	Generate the sound from rules	Clear, and unmistakably a machine
Unit selection (1996)	Glue together recorded fragments	Real speech, until the joins showed
Statistical parametric (2000s)	Predict an average of real speech	Averaged, soft, faintly buzzing
Neural (2016 on)	Generate the waveform itself	Routinely mistaken for a person

Neural speech synthesis

In 2016, DeepMind's WaveNet achieved what had been widely assumed impractical. Instead of predicting compact parameters and handing them to a synthesizer, it modeled the raw waveform directly, generating each of the roughly sixteen thousand audio samples per second one at a time, each conditioned on the ones before.^[15] On listener tests it jumped most of the way from the best concatenative systems toward human recordings in a single step. The limitation was speed: the original was far slower than real time, and the work of the next few years was to make it fast while preserving quality, through faster network designs and vocoders like HiFi-GAN.

A year later, Tacotron (2017) closed the loop. Feed it text and recordings, and it learned, end to end, to predict the spectrogram the words should produce, which a neural vocoder then turned into sound, with no hand-written pronunciation rules in the middle.^[16] Identity became a learned representation, a handful of numbers locating a voice in a space the model already understood, rather than a warehouse of clips. That is why a modern system can conjure a voice that never recorded a word,^[12] and why, from around 2023, zero-shot cloning can copy a specific person from a few seconds of audio.^[17] The mechanics, and the consent questions that now sit at the center of the field, are covered in how TTS voices are made and voice cloning.

Von Kempelen wanted a machine that could say a few words in something like a human voice. Making a machine sound human is now largely solved; the open problem is telling synthetic speech apart from the real thing, which is where audio watermarking and deepfakes picks up.

Common questions

What was the first speech synthesizer?

The first machine that produced speech-like sounds was Wolfgang von Kempelen's bellows-driven device, described in 1791, a mechanical model of the vocal tract. The first fully electronic one was Homer Dudley's Voder, demonstrated at the 1939 World's Fair, though it was played by a human operator rather than driven from text.

What is a vocoder, and why does it matter?

A vocoder, invented by Homer Dudley at Bell Labs in the 1930s, analyzes speech into a small set of slowly-changing control signals (roughly, how much energy sits in each frequency band, plus the pitch). That insight, that speech is a slow control signal on top of a simple sound source, made it possible to transmit, compress, and synthesize speech, and it underlies most of the synthesis that followed.

Why did old text-to-speech sound so robotic?

Because it was built from rules, not recordings. Formant synthesis, the dominant approach behind systems like DECtalk in the 1980s, generated speech sounds from acoustic rules about how letters map to formants. It was perfectly intelligible and never ran out of words, but it could not reproduce the fine, irregular texture of a real voice, so it sounded clear and unmistakably synthetic.

When did text-to-speech start sounding human?

The turning point was 2016 to 2017, with WaveNet generating the audio waveform directly and Tacotron learning the whole text-to-speech mapping end to end. Before that, the most natural systems replayed glued-together recordings; after it, systems generated convincingly human speech from scratch.

How does this connect to how voices are made today?

The neural shift turned a voice from a collection of recordings into a learned representation. For how a production voice is cast, recorded, and built on top of that representation, and what separates a good one from a passable one, see how TTS voices are made.

References

Ramsay, G. J. (2019). Mechanical Speech Synthesis in Early Talking Automata. Acoustics Today.
Brackhane, R. (2011). Wolfgang von Kempelen's 'Speaking Machine' as an Acoustic Model of the Vocal Tract. Proceedings of the 17th International Congress of Phonetic Sciences.
Standage, T. (2002). The Turk: The Life and Times of the Famous Eighteenth-Century Chess-Playing Machine. Walker & Company.
Lindsay, D. (1997). Talking Head. Invention & Technology Magazine.
Dudley, H. (1939). The Vocoder. Bell Laboratories Record.
Simon, S. M. (2025). Operation Voder: AT&T, Bell Labs, and the Labor of Techno-Utopia at the 1939 New York World's Fair. IEEE Annals of the History of Computing.
Weadon, P. D. (2000). SIGSALY: The Start of the Digital Revolution. National Security Agency.
Cooper, F. S., Liberman, A. M., & Borst, J. M. (1951). The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proceedings of the National Academy of Sciences.
Fant, G. (1960). Acoustic Theory of Speech Production. Mouton.
Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America.
Story, B. H. (2019). History of speech synthesis. In The Routledge Handbook of Phonetics. Taylor & Francis.
Napolitano, D. (2023). The Shaping of a Standard Voice: Sonic and Sociotechnical Imaginaries in Smart Speakers. Im@go. A Journal of the Social Imaginary.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing.
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017.
Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.