What is speech recognition? ASR explained from audio to text

Say "recognize speech" out loud, at a normal pace. Now say "wreck a nice beach." The two phrases produce nearly identical audio, and no amount of careful listening at the waveform level can reliably tell them apart. What separates them is knowledge about language: which of the two an English speaker would plausibly say.^[5] That is the field in miniature. Speech recognition is inference: the audio narrows the possibilities, and knowledge of the language picks among them.

From sound to evidence

A microphone measures air pressure thousands of times a second, so a second of speech reaches the recognizer as a list of numbers, 16,000 of them at the sample rate most systems use.^[1] The list itself is a poor thing to recognize from, because the same word spoken twice never produces the same numbers. Every system therefore converts it into a frequency picture first, a spectrogram, in which speech has visible structure a model can learn to read. The conversion, window by window, with the actual numbers at every stage, is the subject of how speech-to-text works.

Two kinds of knowledge

Every speech recognizer, old or new, balances two kinds of knowledge. The acoustic model answers "given this slice of audio, which speech sounds is it likely to be?" The language model answers "given the words so far, which word probably comes next?" You need both. Acoustics alone are ambiguous, as the beach example shows, and language knowledge alone is deaf.

The decoder is the negotiator. It considers many possible word sequences, scores each by combining the acoustic evidence and the language plausibility, and returns the best one. In a classical system these were separate components you could swap out, so a good language model could rescue a mediocre acoustic model and vice versa. The separation is also why a system trained on news broadcasts stumbles on medical dictation: the acoustics are fine, but the language model has never seen "metoprolol" and ranks it as noise.

flowchart LR A[Acoustic model<br/>what the audio suggests] --> C[Decoder<br/>search] B[Language model<br/>what people plausibly say] --> C C --> D[Transcript]

Neither knowledge source can produce a transcript alone; the decoder searches for the sequence both can live with.

Classical and end-to-end architectures

For about three decades the dominant design was a pipeline of separate parts: an acoustic model scoring sounds, a pronunciation dictionary mapping words to phonemes, a language model ranking word sequences, and a decoder stitching them together. Each part was trained on its own objective, so improving the whole thing meant improving pieces and hoping they composed.

Modern systems are end-to-end: a single network learns the entire audio-to-text mapping from transcribed examples. The dictionary, the separate language model, the alignment bookkeeping, most of the hand assembly: it dissolved into weights learned from data. What the field bought and what it paid shows up cleanly side by side.

	Classical pipeline	End-to-end model
Parts	Acoustic model, dictionary, language model, decoder	One network
Training	Each part separately, on its own objective	The whole mapping at once
Transcribed audio needed	Modest	Enormous
Fixing a weak spot	Retune or swap the offending part	Fine-tune, and hope nothing else moves
Teaching it your vocabulary	Edit the dictionary and language model	Context biasing at request time

How the field crossed from the left column to the right is the story told in history of speech recognition.

Streaming and batch recognition

There are two ways to run a recognizer. In batch (or async) mode you hand it a complete recording and wait for the transcript. The model can look at the whole utterance, including audio that comes after a given word, before deciding what that word was. Hearing the end of a sentence resolves ambiguity, so this is the easy case, where accuracy is highest. See real-time vs async transcription for when each mode fits.

In streaming mode the audio arrives a little at a time and you must emit words almost immediately, before the speaker has finished. The model cannot peek at future audio because it does not exist yet, and it has to commit: an attention model that wants to attend to the whole utterance has to be redesigned to attend only to what it has heard so far. This is why streaming systems often show a guess, then revise it as more audio arrives, the distinction between partial vs final results. Building a recognizer that is both fast and accurate under these constraints is most of the engineering in streaming speech recognition, and it is what powers responsive systems like a voice agent.

Common questions

Is speech recognition the same as voice recognition?

No, though the terms get mixed up constantly. Speech recognition figures out what was said (the words). Voice recognition usually means speaker recognition: figuring out who is speaking, for identity or verification.^[3] Telling apart multiple speakers in one recording is speaker diarization.^[4]

How is accuracy measured?

The standard metric is word error rate, the percentage of words the system got wrong through substitutions, deletions, and insertions, compared against a human reference transcript.^[2] Lower is better. A WER of 5 percent means one word in twenty is off. Always check what audio a reported number was measured on, since clean read speech and noisy conversation are very different difficulties.

Does speech recognition understand what I mean?

No. ASR produces text, not meaning. Recognizing the words "book a flight to Denver" is separate from understanding the intent and acting on it. Voice assistants chain ASR to language understanding and dialogue components downstream.

Why does it still make mistakes on names and jargon?

Names, product terms, and specialized vocabulary appear rarely in training data, so the model's language knowledge is weak there, and many are acoustically close to common words. Most production systems let you supply a custom vocabulary or context to bias recognition toward the terms you expect.^[6]

References

Picone, J. (1993). Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE, 81(9), 1215–1247.
Hunt, M. J. (1990). Figures of Merit for Assessing Connected-Word Recognisers. Speech Communication, 9(4), 329–336.
Campbell, J. P. (1997). Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85(9), 1437–1462.
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker Diarization: A Review of Recent Research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.
Reddy, D. R. (1976). Speech Recognition by Machine: A Review. Proceedings of the IEEE, 64(4), 501–531.
Soniox (2026). Soniox Speech-to-Text documentation. Soniox.