A brief history of speech recognition: from Audrey to transformers

In 1952, Bell Labs engineers Kingsbury H. Davis, Rulon S. Biddulph, and Stephen Balashek built Audrey, short for Automatic Digit Recognizer: a relay rack the height of a person, expensive in vacuum tubes, that could identify the ten spoken digits.^[5]^[4] Calibrated to a single speaker who paused between words, Audrey got 97 to 99 percent of digits right. Handed a new voice, it had to be adjusted before it worked at all.^[3] That dependence on one known voice was the wall the field would spend its next thirty years climbing.

Audrey and the template era (1952)

Audrey worked by comparison: reduce telephone-quality speech to a few acoustic measurements per digit, then match them against a stored pattern of one person saying that digit.^[3] There was no notion of language anywhere in the machine, no sense that "nine" and "five" are words rather than shapes. It was matching shapes.

Template matching got smarter through the 1960s and 1970s. Dynamic time warping, given its standard dynamic-programming formulation by Hiroaki Sakoe and Seibi Chiba in 1978, aligned an utterance against a stored template even when the two were spoken at different speeds.^[7] But every word still needed its own reference pattern, and connected speech refuses to announce where one word ends and the next begins. Templates could not scale, and by the mid-1970s everyone serious knew it.

DARPA, Harpy, and connected speech (1971 to 1976)

In 1971, the U.S. Defense Advanced Research Projects Agency began the five-year Speech Understanding Research program. Its principal technical goal was connected-speech recognition with a vocabulary of approximately 1,000 words. Harpy, developed at Carnegie Mellon University by Bruce Lowerre within Raj Reddy's research group, met that goal in 1976.^[6]

Harpy recognized connected speech from a vocabulary of 1,011 words. It compiled every sentence its grammar permitted, with pronunciations, into one large search graph, then hunted for the path that best matched the acoustic input. Beam search kept the hunt affordable by retaining only the strongest partial hypotheses at each step. Harpy needed expensive hardware and a constrained grammar, but it was the proof that connected speech at real vocabulary sizes could be done.^[6]

Statistical speech recognition (late 1970s to 1988)

The deepest break in this history was a change of philosophy rather than a new machine. Through the 1970s, most researchers treated speech recognition as an expert-systems problem: encode what linguists know about phonetics and grammar into rules. A group at IBM, led by Frederick Jelinek starting in 1972, treated recognition as a statistical decoding problem instead. Given the acoustic evidence, find the word sequence that is most probable, with the probabilities estimated from data.

The principal model was the hidden Markov model (HMM), which represents an observed acoustic sequence as the output of unobserved states such as phonetic units. James Baker introduced an HMM-based approach at Carnegie Mellon University in the early 1970s, and research groups at IBM and Bell Labs developed related statistical systems during the following decade.^[9] Combined with n-gram language models, HMM systems estimated acoustic and word-sequence probabilities from recorded data rather than relying entirely on hand-written rules.

In 1988, Kai-Fu Lee, working with Raj Reddy and Roberto Bisiani at Carnegie Mellon University, developed Sphinx, a real-time, speaker-independent continuous-speech recognizer. Sphinx showed that HMM-based recognition could handle previously unseen speakers and a vocabulary of approximately 1,000 words on the computing hardware then available.^[9]

timeline title Speech recognition, the turning points 1952 : Audrey at Bell Labs : isolated digits, one speaker 1976 : Harpy at CMU : connected speech, 1011 words 1988 : Sphinx (Kai-Fu Lee) : speaker-independent HMMs 2009 : Deep nets on acoustics : error rates drop 2014 : End-to-end (CTC, attention) : drop the pipeline 2017 : Transformers : attention is all you need 2022 : Whisper : weak supervision at scale

The decisive episodes: each one replaced the method beneath it rather than refining it.

Incremental progress (1990s and 2000s)

Here the story slows down. HMMs plus n-gram language models plus a feature called MFCCs (mel-frequency cepstral coefficients, a compact representation of the audio spectrum tuned to human hearing) became the standard recipe, and for about fifteen years the field improved it at the margins. Consumer products shipped: Dragon NaturallySpeaking arrived in 1997 as the first general-purpose continuous dictation software for ordinary PCs, trained by reading to it for an hour.

Accuracy climbed slowly and then stalled. On hard benchmarks like conversational telephone speech, word error rates sat stubbornly high through the 2000s. The recipe was mature and the gains were small, and a reasonable observer in 2008 might have concluded that speech recognition was a solved-enough engineering problem with a hard ceiling. Within a year that conclusion would look badly mistaken.

Deep neural networks (2009 to 2012)

Around 2009, researchers including George Dahl, Abdel-rahman Mohamed, and Geoffrey Hinton at the University of Toronto, working with groups at Microsoft and IBM, swapped out the part of the HMM system that scores acoustics and put a deep neural network in its place. The sequence machinery stayed and only the scoring changed, yet word error rates fell by relative margins large enough that every major lab switched within about three years.^[10] The 2012 paper reporting the results was signed jointly by four rival groups, Microsoft, Google, IBM, and Toronto, which tells you how quickly the argument was over. How speech-to-text works describes what an acoustic model actually scores.

End-to-end recognition (2006 to 2017)

The deep-net hybrid still carried the old machinery: separate acoustic model, pronunciation dictionary, language model, all stitched together. The next idea was to drop those stages and train one model directly.^[1]

In 2006, Alex Graves and coauthors introduced connectionist temporal classification (CTC), a training objective that removes the need for a frame-level alignment between an input sequence and its labels.^[8] Baidu's Deep Speech later demonstrated a large CTC-based recognizer trained on varied speech data.^[12] In parallel, attention-based encoder-decoder systems such as Listen, Attend and Spell learned the recognition components jointly and emitted characters directly from acoustic input.^[13]

In 2017, Ashish Vaswani and coauthors introduced the transformer, an architecture based on attention rather than recurrent layers.^[14] The paper addressed machine translation, but transformer encoders and decoders were soon adopted for speech recognition because they can model relationships across long sequences.

flowchart LR A[Audio] --> B[Acoustic model] B --> C[Pronunciation dict] C --> D[Language model] D --> E[Text] A2[Audio] --> F[One end-to-end network] F --> G[Text]

Classical speech recognition uses separate acoustic, pronunciation, and language models. End-to-end recognition trains a single network to map audio to text.

Weak supervision at scale (2022)

In 2022, OpenAI released Whisper, a transformer model trained on approximately 680,000 hours of multilingual and multitask audio collected from the web with weak supervision.^[2] The architecture was five years old; the training scale was the news. That scale bought tolerance for accents and recording conditions that break models trained on clean corpora, and it also put a new failure mode on the map, ASR hallucinations: fluent output with little or no support in the input audio.^[15]

Today, multilingual neural models transcribe open conversation across dozens of languages in real time. The numbers people argue about now, measured with word error rate and its discontents, would have been science fiction to the team standing in front of Audrey in 1952.

Common questions

Who invented speech recognition?

There is no single inventor. The first working digit recognizer was Audrey, built by Kingsbury H. Davis, Rulon S. Biddulph, and Stephen Balashek at Bell Labs in 1952.^[3]^[4] Frederick Jelinek's group at IBM and James Baker's work at Carnegie Mellon University established important statistical methods during the 1970s. Deep neural acoustic models produced major accuracy improvements around 2009 to 2012.^[10]

What was the first speech recognition system?

Audrey, built in 1952, is generally identified as the first automatic spoken-digit recognizer. It matched acoustic measurements from isolated digits against stored reference patterns and required calibration for each speaker.^[3] It could not recognize connected speech or an unrestricted vocabulary.

When did speech recognition get good?

"Good" depends on the task. Speaker-independent continuous recognition arrived with Sphinx in 1988. The accuracy that powers today's products came from the deep-learning shift around 2009 to 2012 and the end-to-end and transformer models that followed. For the underlying mechanics, see what is speech recognition.

Did Whisper invent modern speech recognition?

No. Whisper is a transformer-based system trained on a large weakly labeled dataset.^[2] The transformer architecture appeared in 2017, while CTC and attention-based recognizers established earlier forms of end-to-end speech recognition.^[14]^[11]^[13] Whisper's main contribution was training scale across multilingual and multitask data.

Why did rule-based speech recognition lose to statistics?

Hand-written phonetic and grammatical rules could not cover the variation in real speech: accents, speed, coarticulation, noise. Jelinek's IBM group showed that estimating probabilities from data, rather than encoding expert rules, produced lower error rates, and the gap only widened as data and computing grew.

References

Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., & Watanabe, S. (2023). End-to-End Speech Recognition: A Survey. arXiv preprint arXiv:2303.03329.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic Recognition of Spoken Digits. Journal of the Acoustical Society of America, 24(6), 637–642.
Li, X., & Mills, M. (2019). Vocal Features: From Voice Identification to Speech Recognition by Machine. Technology and Culture, 60(2S), S129–S160.
Massachusetts Institute of Technology News Office (1952). Conference on Speech Analysis. MIT Institute Archives and Special Collections.
Lowerre, B. T. (1976). The Harpy Speech Recognition System. PhD thesis, Carnegie Mellon University.
Sakoe, H., & Chiba, S. (1978). Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd International Conference on Machine Learning (ICML), 369–376.
Singh, R. (2003). A History of the Sphinx Speech Recognition Systems. Carnegie Mellon University.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks. arXiv preprint arXiv:1303.5778.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, Attend and Spell. arXiv preprint arXiv:1508.01211.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.
Koenecke, A., Choi, A. S. G., Mei, K. X., Schellmann, H., & Sloane, M. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. arXiv preprint arXiv:2402.08021.