How speech-to-text works: from waveform to words

Say the word "four" into a microphone. The diaphragm inside it moves with the air, and a converter measures that motion thousands of times a second. What lands on disk is not a word and not even a sound. It is a long list of numbers describing how the pressure rose and fell, and nowhere in that list does anything spell out "four." A speech recognizer has to recover the word from those numbers. Because the audio for "four," "for," and "fore" is nearly identical, the system is always estimating the most probable word rather than reading off a certain one.

Recognition does not work by matching your voice against stored clips. There is no recording of you saying "four" filed away to compare against. The model has learned, from many thousands of hours of other people's speech, a function from sound to text, and runs that function on audio it has never heard from a person it has never met.

This page follows one waveform through that function, one stage at a time. For the bigger picture, start with the pillar what is speech recognition.

Sampling

Sound is continuous; computers are not. Sampling measures the air pressure at fixed intervals and stores each measurement as a number, called a sample. The sample rate is how many measurements per second. Speech recognition commonly uses 16,000 samples per second (16 kHz).^[1] Higher rates preserve frequencies that most speech recognizers do not use.

Opened up, our half-second recording of "four" is nothing but a list of 8,000 floating-point numbers, most of them small and hovering near zero. That list is the entire input. Everything the system will ever know about the word is in there.

Framing

The recognizer does not look at all 16,000 samples at once. Speech changes fast (a vowel and the consonant after it are milliseconds apart), so the signal is cut into short overlapping frames. A common choice: a 25 ms window that moves forward 10 ms each step (the hop).^[2] The window is short enough that the sound inside it is roughly steady, and the overlap means no transition falls in a crack between frames.

At 16 kHz the arithmetic is plain: the window holds 400 samples, the hop is 160, and the half-second clip becomes roughly 48 frames. From here on, the unit of work is the frame, not the sample, and the model produces one set of scores per frame.

Spectrogram and features

A frame of raw samples is hard for a model to read. What distinguishes "f" from "or" is which frequencies are present and how loud each one is. The short-time Fourier transform (STFT) takes each frame and reports its frequency content: how much energy sits at low pitches versus high. If you do this for every frame and stack the results, you get a spectrogram, a picture of frequency (vertical) over time (horizontal).

The spectrogram is where speech first looks legible. Vowels appear as stacked horizontal bands called formants, the frequencies your throat and mouth amplify. An /s/ is a wash of high-frequency hiss. A stop consonant like the /t/ in "stop" is a moment of silence followed by a burst. A trained phonetician can read words straight off a spectrogram, which is a fair description of what the model is about to learn to do.

Most systems do not feed the raw spectrogram to the model. They warp it onto the mel scale, which spaces frequencies the way human hearing does (fine detail down low, coarser up high), and often compress it further into MFCCs. The result is a compact feature vector per frame, 40 to 80 numbers summarizing what the sound is doing in that 25 ms slice.^[3]

The acoustic and language model

Now the learned part. A neural network reads the sequence of feature frames and, for each frame, outputs a score for every possible output symbol: each character, or each subword token, plus a blank symbol meaning "nothing new here." Two designs dominate.^[4]^[5] CTC scores frames independently and uses the blank to handle a sound that spans many frames.^[6] Attention (and transducer) models look across the whole sequence as they emit each token. Either way the output is a grid: frames across, symbol scores down.

For one frame somewhere in the middle of our word, that row of the grid looks like this (illustrative):

{ "f": 0.03, "o": 0.81, "u": 0.05, "r": 0.02, "<blank>": 0.09 }

One such distribution per frame, forty-eight frames for the half-second clip. That grid of numbers is everything the acoustic model has to say.

The acoustic model scores which symbols match the sound. It does not track that "four o'clock" is ordinary English and "fore o'clock" is golf terminology. That job belongs to the language model, which scores word sequences by plausibility and is folded in during decoding: the acoustic model proposes a reading of the sound, and the language model scores the resulting word sequence.^[7]

Decoding

The grid of per-frame scores is not text yet. Decoding searches it for the single most plausible string. Greedy decoding takes the winner of every frame and collapses the run: if the winners read f, f, f, blank, o, o, blank, u, r, r, merging repeats and deleting blanks leaves "four." It is fast, and it is sometimes wrong, because a symbol that never wins a single frame outright can still belong to the best word. Beam search keeps the top few candidate strings alive at once, scoring each with both the acoustic grid and the language model, and lets a candidate that starts weak win if it ends more plausible.

Decoding is where batch and streaming split. Batch decoding sees the whole grid and revises freely. Streaming decoding has to emit text while audio is still arriving, so it commits to words before the sentence ends, then sometimes corrects them.^[10] That trade-off is the subject of streaming speech recognition and partial vs final results.

flowchart LR A[Waveform 16 kHz samples] --> B[Frames 25 ms / 10 ms hop] B --> C[Features mel / MFCC] C --> D[Acoustic model per-frame scores] D --> E[Decode beam + LM] E --> F[Text four]

The full pipeline: one waveform from sound pressure to text.

Common questions

Why 16 kHz and not higher?

The human voice tops out around 8 kHz, and sampling theory says you need a rate twice the highest frequency you care about. 16 kHz covers it with margin. Higher rates add data and cost without adding information the model can use for speech.

Is a spectrogram strictly required?

No. Some recent models learn features directly from raw samples with convolutional front ends, skipping the hand-built mel step.^[8] Mel features stay common because they are compact and work well, but the spectrogram is a design choice, not a law.

What turns character scores into real words with spacing?

The decoder, guided by the language model. The acoustic model emits characters or subword tokens; beam search assembles them into words and applies the language model's sense of which sequences are plausible. Punctuation and casing are usually separate learned steps on top.

Where do confidence scores come from?

From the per-frame probabilities. After decoding, the model can report how sure it was about each word, derived from those scores. See confidence scores for how to read them.

How is accuracy measured?

By comparing the output text to a human transcript and counting insertions, deletions, and substitutions. That count, normalized by transcript length, is the word error rate.^[9]

References

Hirsch, H. G. (2001). Speech Recognition at Multiple Sampling Rates. Eurospeech.
Huang, X., Acero, A., & Hon, H. W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall.
Davis, S., & Mermelstein, P. (1980). Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Yu, D. (2017). Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.
Prabhavalkar, R., Sainath, T. N., Li, B., Bruguier, A., & Kannan, A. (2017). A Comparison of Sequence-to-Sequence Models for Speech Recognition. Interspeech.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd International Conference on Machine Learning (ICML), 369–376.
Reddy, D. R. (1976). Speech Recognition by Machine: A Review. Proceedings of the IEEE, 64(4), 501–531.
Zeghidour, N., Usunier, N., Bottou, L., Kokkinos, I., & Synnaeve, G. (2018). End-to-End Speech Recognition From the Raw Waveform. Interspeech.
Morris, A. C., & Maier, V. (2004). From WER and RIL to MER and WIL: Improved Evaluation Measures for Connected Speech Recognition. Interspeech.
Soniox (2026). Real-time Speech-to-Text. Soniox.