A brief history of speech recognition

From Audrey to transformers, seventy years in

Updated June 14, 2026

In 1952, Bell Labs engineers Kingsbury H. Davis, Rulon S. Biddulph, and Stephen Balashek built Audrey, short for Automatic Digit Recognizer.[5][4] After calibration to one person's voice, the analog machine could identify the ten spoken digits, zero through nine, when they were pronounced separately. Audrey divided telephone-quality speech into frequency bands, extracted acoustic measurements, and compared them with stored digit patterns. The original evaluation reported 97–99% accuracy for a single speaker, but the system required adjustment before it could recognize another speaker.[3] This dependence on a known voice remained a central limitation of early speech recognizers.

Audrey and the template era (1952)

Audrey established the basic shape of the problem: take an acoustic signal, reduce it to a few features, compare those features to references. The reference here was a stored pattern of one person saying one digit. There was no notion of language, no grammar, no concept that "nine" and "five" are words in a system. The machine was matching shapes.

Through the 1960s, template-matching methods became more sophisticated. Dynamic time warping aligned an utterance with a stored template even when the two were spoken at different speeds. Hiroaki Sakoe and Seibi Chiba published the standard dynamic-programming formulation for spoken-word recognition in 1978.[7] The method remained difficult to scale because each word required a reference pattern, while connected speech did not provide clear boundaries between words.

DARPA, Harpy, and connected speech (1971-1976)

In 1971, the U.S. Defense Advanced Research Projects Agency began the five-year Speech Understanding Research program. Its principal technical goal was connected-speech recognition with a vocabulary of approximately 1,000 words. Harpy, developed at Carnegie Mellon University by Bruce Lowerre within Raj Reddy's research group, met that goal in 1976.[6]

Harpy recognized connected speech from a vocabulary of 1,011 words. It represented permitted sentences, pronunciations, and grammar as a search graph, then searched for the path that best matched the acoustic input. Beam search reduced computation by retaining only the strongest partial hypotheses at each step. Although Harpy required expensive hardware and a constrained grammar, it demonstrated practical connected-speech recognition at a substantially larger vocabulary.[6]

Statistical speech recognition (late 1970s–1988)

The deepest break in this history was a change of philosophy rather than a new machine. Through the 1970s, most researchers treated speech recognition as an expert-systems problem: encode what linguists know about phonetics and grammar into rules. A group at IBM, led by Frederick Jelinek starting in 1972, treated recognition as a statistical decoding problem instead. Given the acoustic evidence, find the word sequence that is most probable, with the probabilities estimated from data.

The principal model was the hidden Markov model (HMM), which represents an observed acoustic sequence as the output of unobserved states such as phonetic units. James Baker introduced an HMM-based approach at Carnegie Mellon University in the early 1970s, and research groups at IBM and Bell Labs developed related statistical systems during the following decade.[7] Combined with n-gram language models, HMM systems estimated acoustic and word-sequence probabilities from recorded data rather than relying entirely on hand-written rules.

In 1988, Kai-Fu Lee, working with Raj Reddy and Roberto Bisiani at Carnegie Mellon University, developed Sphinx, a real-time, speaker-independent continuous-speech recognizer. Sphinx showed that HMM-based recognition could handle previously unseen speakers and a vocabulary of approximately 1,000 words on the computing hardware then available.[7]

timeline title Speech recognition, the turning points 1952 : Audrey at Bell Labs : isolated digits, one speaker 1976 : Harpy at CMU : connected speech, 1011 words 1988 : Sphinx (Kai-Fu Lee) : speaker-independent HMMs 2009 : Deep nets on acoustics : error rates drop 2014 : End-to-end (CTC, attention) : drop the pipeline 2017 : Transformers : attention is all you need 2022 : Whisper : weak supervision at scale
The decisive episodes: each one replaced the method beneath it rather than refining it.

Incremental progress (1990s–2000s)

Here the story slows down. HMMs plus n-gram language models plus a feature called MFCCs (mel-frequency cepstral coefficients, a compact representation of the audio spectrum tuned to human hearing) became the standard recipe, and for about fifteen years the field improved it at the margins. Consumer products shipped: Dragon NaturallySpeaking arrived in 1997 as the first general-purpose continuous dictation software for ordinary PCs, trained by reading to it for an hour.

Accuracy climbed slowly and then stalled. On hard benchmarks like conversational telephone speech, word error rates sat stubbornly high through the 2000s. The recipe was mature and the gains were small, and a reasonable observer in 2008 might have concluded that speech recognition was a solved-enough engineering problem with a hard ceiling. Within a year that conclusion would look badly mistaken.

Deep neural networks (2009–2012)

Around 2009, researchers including George Dahl, Abdel-rahman Mohamed, and Geoffrey Hinton at the University of Toronto, working with groups at Microsoft and IBM, swapped out the part of the HMM system that scores acoustics, replacing the older statistical model with a deep neural network. The hybrid kept the HMM's sequence structure but let a deep net do the acoustic scoring, and the error rates dropped by relative margins large enough that every major lab moved within about three years.

A 2012 review by researchers from Microsoft, Google, IBM, and the University of Toronto reported large relative reductions in word error rate across several recognition tasks when deep neural networks replaced Gaussian mixture models for acoustic scoring.[8] The systems still used HMMs for temporal structure, but neural acoustic models quickly became the standard design. How speech-to-text works describes the underlying processing stages.

End-to-end recognition (2006–2017)

The deep-net hybrid still carried the old machinery: separate acoustic model, pronunciation dictionary, language model, all stitched together. The next idea was to drop those stages and train one model directly.

In 2006, Alex Graves and coauthors introduced connectionist temporal classification (CTC), a training objective that removes the need for a frame-level alignment between an input sequence and its labels.[9] Baidu's Deep Speech later demonstrated a large CTC-based recognizer trained on varied speech data.[10] In parallel, attention-based encoder-decoder systems such as Listen, Attend and Spell learned the recognition components jointly and emitted characters directly from acoustic input.[11]

In 2017, Ashish Vaswani and coauthors introduced the transformer, an architecture based on attention rather than recurrent layers.[12] The paper addressed machine translation, but transformer encoders and decoders were soon adopted for speech recognition because they can model relationships across long sequences.

flowchart LR A[Audio] --> B[Acoustic model] B --> C[Pronunciation dict] C --> D[Language model] D --> E[Text] A2[Audio] --> F[One end-to-end network] F --> G[Text]
Classical speech recognition uses separate acoustic, pronunciation, and language models. End-to-end recognition trains a single network to map audio to text.

Weak supervision at scale (2022)

In 2022, OpenAI released Whisper, a transformer model trained on approximately 680,000 hours of multilingual and multitask audio collected from the web with weak supervision.[2] Its scale produced broad performance across languages, accents, and recording conditions. Whisper also drew attention to ASR hallucinations: fluent output with little or no support in the input audio.[13]

Today, multilingual neural models transcribe open conversation across dozens of languages in real time. The numbers people argue about now, measured with word error rate and its discontents, would have been science fiction to the team standing in front of Audrey in 1952.

Common questions

Who invented speech recognition?

There is no single inventor. The first working digit recognizer was Audrey, built by Kingsbury H. Davis, Rulon S. Biddulph, and Stephen Balashek at Bell Labs in 1952.[3][4] Frederick Jelinek's group at IBM and James Baker's work at Carnegie Mellon University established important statistical methods during the 1970s. Deep neural acoustic models produced major accuracy improvements around 2009–2012.[8]

What was the first speech recognition system?

Audrey, built in 1952, is generally identified as the first automatic spoken-digit recognizer. It matched acoustic measurements from isolated digits against stored reference patterns and required calibration for each speaker.[3] It could not recognize connected speech or an unrestricted vocabulary.

When did speech recognition get good?

"Good" depends on the task. Speaker-independent continuous recognition arrived with Sphinx in 1988. The accuracy that powers today's products came from the deep-learning shift around 2009-2012 and the end-to-end and transformer models that followed after 2014 and 2017. For the underlying mechanics, see what is speech recognition.

Did Whisper invent modern speech recognition?

No. Whisper is a transformer-based system trained on a large weakly labeled dataset.[2] The transformer architecture appeared in 2017, while CTC and attention-based recognizers established earlier forms of end-to-end speech recognition.[12][9][11] Whisper's main contribution was training scale across multilingual and multitask data.

Why did rule-based speech recognition lose to statistics?

Hand-written phonetic and grammatical rules could not cover the variation in real speech: accents, speed, coarticulation, noise. Jelinek's IBM group showed that estimating probabilities from data, rather than encoding expert rules, produced lower error rates, and the gap only widened as data and computing grew.

References

  1. Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., & Watanabe, S. (2023). End-to-End Speech Recognition: A Survey. arXiv preprint arXiv:2303.03329.
  2. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
  3. Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic Recognition of Spoken Digits. Journal of the Acoustical Society of America, 24(6), 637–642.
  4. Li, X., & Mills, M. (2019). Vocal Features: From Voice Identification to Speech Recognition by Machine. Technology and Culture, 60(2S), S129–S160.
  5. Massachusetts Institute of Technology News Office (1952). Conference on Speech Analysis. MIT Institute Archives and Special Collections.
  6. Lowerre, B. T. (1976). The Harpy Speech Recognition System. PhD thesis, Carnegie Mellon University.
  7. Singh, R. (2003). A History of the Sphinx Speech Recognition Systems. Carnegie Mellon University.
  8. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
  9. Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks. arXiv preprint arXiv:1303.5778.
  10. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
  11. Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, Attend and Spell. arXiv preprint arXiv:1508.01211.
  12. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.
  13. Koenecke, A., Choi, A. S. G., Mei, K. X., Schellmann, H., & Sloane, M. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. arXiv preprint arXiv:2402.08021.