In 1952, three engineers at Bell Labs built a machine they called Audrey. It filled a relay rack roughly the height of a person, cost a small fortune in vacuum tubes, and could recognize exactly one thing: the spoken digits zero through nine, from a single speaker who paused between each word.[1] Seventy years later your phone transcribes a stranger's voicemail in a noisy car. This page is how the field crossed that distance.
Digital representation of speech
Sound is pressure variation in air. A microphone measures that pressure many thousands of times per second and stores each measurement as a number. Phone-quality audio is sampled at 8,000 samples per second; most modern ASR works at 16,000.[6][9] So one second of speech is a list of 16,000 numbers, a waveform. Plotted, it looks like a jagged line: tall where the speech is loud, flat during silence.
The raw waveform is a terrible thing to feed a recognizer directly. It is high-dimensional, and the same word spoken twice never produces the same numbers. The trick, borrowed from the human ear, is to ask not "what is the pressure right now" but "which frequencies are present right now." You slide a short window (around 25 milliseconds) across the audio, step it forward in small hops (around 10 milliseconds), and run a Fourier transform on each window.[7] The result is a spectrogram: a picture with time on one axis, frequency on the other, and brightness showing how much energy sits at each frequency at each moment.
A spectrogram is where speech starts to look legible. Vowels show up as horizontal bands called formants; an /s/ shows up as a hiss of high-frequency energy; a stop consonant like /t/ shows up as a brief silence followed by a burst. Classical systems compressed each spectrogram frame further into about 13 numbers called MFCCs (mel-frequency cepstral coefficients), tuned so the spacing matched human hearing, which is finer at low frequencies than high.[8] Modern neural systems often skip MFCCs and learn their own features straight from the spectrogram, but the framing is the same: turn one second of audio into roughly 100 short feature vectors, then recognize words from those.
Acoustic and language modeling
Every speech recognizer, old or new, balances two kinds of knowledge. The acoustic model answers "given this slice of audio, which speech sounds is it likely to be?" The language model answers "given the words so far, which word probably comes next?" You need both, because acoustics alone are ambiguous. "Recognize speech" and "wreck a nice beach" produce nearly identical audio. The acoustic model cannot separate them; the language model knows which one people say.[13]
The decoder searches. It considers many possible word sequences, scores each by combining the acoustic evidence and the language plausibility, and returns the best one. In a classical system these were separate components you could swap out, so a good language model could rescue a mediocre acoustic model and vice versa. This separation is also why a system trained on news broadcasts stumbles on medical dictation: the acoustics are fine, but the language model has never seen "metoprolol" and ranks it as noise.
Classical and end-to-end architectures
For about three decades, roughly 1980 to 2010, the dominant design was the HMM-GMM pipeline. A hidden Markov model represented each word as a chain of states (rough stand-ins for the sounds inside it), and a Gaussian mixture model scored how well each audio frame matched each state. It worked, but it was a tower of hand-built parts: a pronunciation dictionary mapping words to phonemes, a separately trained language model, an acoustic model, and a decoder stitching them together. Each part was trained on its own objective, so improving the whole thing meant improving pieces and hoping they composed.
The break came in two waves. Around 2009 to 2012, researchers replaced the Gaussian mixtures with deep neural networks, and word error rates on hard benchmarks dropped by a fifth or more, the largest jump the field had seen in years.[4] Then the architecture changed. CTC (connectionist temporal classification, introduced by Graves and colleagues in 2006 and widely adopted after 2014) let a single network map audio frames directly to characters without anyone pre-aligning which frame belonged to which sound.[3] Attention and the transformer architecture (2017) went further, letting the model decide which parts of the audio matter for the next output token.[5]
These are called end-to-end models because one network learns the whole audio-to-text mapping, trained on one objective. The pronunciation dictionary, the separate acoustic and language models, the alignment bookkeeping: much of it dissolved into weights learned from data. You give the network audio and the correct text, and it figures out the rest. The cost is that you need a great deal of transcribed audio, and the model is harder to inspect when it goes wrong.
Streaming and batch recognition
There are two ways to run a recognizer. In batch (or async) mode you hand it a complete recording and wait for the transcript. The model can look at the whole utterance, including audio that comes after a given word, before deciding what that word was. Hearing the end of a sentence resolves ambiguity, so this is the easy case, where accuracy is highest. See real-time vs async transcription for when each mode fits.
In streaming mode the audio arrives a little at a time and you must emit words almost immediately, before the speaker has finished. The model cannot peek at future audio because it does not exist yet, and it has to commit: an attention model that wants to attend to the whole utterance has to be redesigned to attend only to what it has heard so far. This is why streaming systems often show a guess, then revise it as more audio arrives, the distinction between partial vs final results. Building a recognizer that is both fast and accurate under these constraints is most of the engineering in streaming speech recognition, and it is what powers responsive systems like a voice agent.
Common questions
Is speech recognition the same as voice recognition?
No, though the terms get mixed up constantly. Speech recognition figures out what was said (the words). Voice recognition usually means speaker recognition: figuring out who is speaking, for identity or verification.[11] Telling apart multiple speakers in one recording is speaker diarization.[12]
How is accuracy measured?
The standard metric is word error rate, the percentage of words the system got wrong through substitutions, deletions, and insertions, compared against a human reference transcript.[10] Lower is better. A WER of 5 percent means one word in twenty is off. Always check what audio a reported number was measured on, since clean read speech and noisy conversation are very different difficulties.
Does speech recognition understand what I mean?
No. ASR produces text, not meaning. Recognizing the words "book a flight to Denver" is separate from understanding the intent and acting on it. Voice assistants chain ASR to language understanding and dialogue components downstream.
Why does it still make mistakes on names and jargon?
Names, product terms, and specialized vocabulary appear rarely in training data, so the model's language knowledge is weak there, and many are acoustically close to common words. Most production systems let you supply a custom vocabulary or context to bias recognition toward the terms you expect.[14]
Related concepts
- How speech-to-text works
- Streaming speech recognition
- Real-time vs async transcription
- Word error rate
- Speaker diarization
- History of speech recognition
References
- Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic Recognition of Spoken Digits. Journal of the Acoustical Society of America, 24(6), 637–642.
- Lowerre, B. T. (1976). The HARPY Speech Recognition System. PhD Dissertation, Carnegie Mellon University.
- Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd International Conference on Machine Learning (ICML), 369–376.
- Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NIPS), 30.
- Picone, J. (1993). Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE, 81(9), 1215–1247.
- Huang, X., Acero, A., & Hon, H. W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall.
- Davis, S., & Mermelstein, P. (1980). Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
- Li, J., Deng, L., Haeb-Umbach, R., & Gong, Y. (2014). Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic Press.
- Hunt, M. J. (1990). Figures of Merit for Assessing Connected-Word Recognisers. Speech Communication, 9(4), 329–336.
- Campbell, J. P. (1997). Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85(9), 1437–1462.
- Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker Diarization: A Review of Recent Research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.
- Reddy, D. R. (1976). Speech Recognition by Machine: A Review. Proceedings of the IEEE, 64(4), 501–531.
- Soniox (2026). Soniox Speech-to-Text documentation. Soniox.