Say the word "four" into a microphone. The diaphragm inside it moves with the air, and a converter measures that motion thousands of times a second. What lands on disk is not a word and not even a sound. It is a long list of numbers describing how the pressure rose and fell, and nowhere in that list does anything spell out "four." A speech recognizer has to recover the word from those numbers. Because the audio for "four," "for," and "fore" is nearly identical, the system is always estimating the most probable word rather than reading off a certain one.
Recognition does not work by matching your voice against stored clips. There is no recording of you saying "four" filed away to compare against. The model has learned, from many thousands of hours of other people's speech, a function from sound to text, and runs that function on audio it has never heard from a person it has never met.
This page follows one waveform through that function, one stage at a time. For the bigger picture, start with the pillar what is speech recognition.
Sampling
Sound is continuous; computers are not. Sampling measures the air pressure at fixed intervals and stores each measurement as a number, called a sample. The sample rate is how many measurements per second. Speech recognition commonly uses 16,000 samples per second (16 kHz).[1] Higher rates preserve frequencies that most speech recognizers do not use.
import soundfile as sf
audio, sample_rate = sf.read("four.wav")
print(sample_rate) # 16000
print(audio.shape) # (e.g.) (8000,) -> 0.5 seconds of audio
print(audio[:5]) # small floats near zero, e.g. [-0.001 0.004 ...]
Framing
The recognizer does not look at all 16,000 samples at once. Speech changes fast (a vowel and the consonant after it are milliseconds apart), so the signal is cut into short overlapping frames. A common choice: a 25 ms window that moves forward 10 ms each step (the hop).[2][3][4] The window is short enough that the sound inside it is roughly steady, and the overlap means no transition falls in a crack between frames.
frame_len = int(0.025 * sample_rate) # 25 ms -> 400 samples
hop_len = int(0.010 * sample_rate) # 10 ms -> 160 samples
frames = []
for start in range(0, len(audio) - frame_len, hop_len):
frames.append(audio[start:start + frame_len])
print(len(frames), "frames of", frame_len, "samples each")
A half-second clip becomes roughly 48 frames. From here on, the unit of work is the frame, not the sample, and the model produces one set of scores per frame.
Spectrogram and features
A frame of raw samples is hard for a model to read. What distinguishes "f" from "or" is which frequencies are present and how loud each one is. The short-time Fourier transform (STFT) takes each frame and reports its frequency content: how much energy sits at low pitches versus high. Do this for every frame, stack the results, and you get a spectrogram, a picture of frequency (vertical) over time (horizontal).
import numpy as np
import librosa
# magnitude spectrogram: rows = frequencies, cols = frames
stft = librosa.stft(audio, n_fft=400, hop_length=hop_len)
spectrogram = np.abs(stft)
print(spectrogram.shape) # (201, ~num_frames)
Most systems do not feed the raw spectrogram to the model. They warp it onto the mel scale, which spaces frequencies the way human hearing does (fine detail down low, coarser up high), and often compress it further into MFCCs. The result is a compact feature vector per frame, 40 to 80 numbers summarizing what the sound is doing in that 25 ms slice.[5][6][7]
# 40 mel features per frame, a common input to the acoustic model
mel = librosa.feature.melspectrogram(y=audio, sr=sample_rate,
n_fft=400, hop_length=hop_len, n_mels=40)
log_mel = librosa.power_to_db(mel)
print(log_mel.shape) # (40, ~num_frames)
The acoustic and language model
Now the learned part. A neural network reads the sequence of feature frames and, for each frame, outputs a score for every possible output symbol: each character, or each subword token, plus a blank symbol meaning "nothing new here." Two designs dominate.[8][9][10] CTC scores frames independently and uses the blank to handle a sound that spans many frames.[11][12] Attention (and transducer) models look across the whole sequence as they emit each token. Either way the output is a grid: frames across, symbol scores down.
# Conceptual, not a real API. `model` is your trained acoustic network.
logits = model(log_mel.T) # shape: (num_frames, num_symbols)
probs = softmax(logits, axis=-1) # per-frame probability over symbols
# probs[t] might favor 'f' early, then 'o', 'r', and lots of blanks
The acoustic model scores which symbols match the sound. It does not track that "four o'clock" is ordinary English and "fore o'clock" is golf terminology. That job belongs to the language model, which scores word sequences by plausibility and is folded in during decoding: the acoustic model proposes a reading of the sound, and the language model scores the resulting word sequence.[13]
Decoding
The grid of per-frame scores is not text yet. Decoding searches it for the single most plausible string. Greedy decoding takes the top symbol at each frame and collapses repeats and blanks: fast, sometimes wrong. Beam search keeps the top few candidate strings alive at once, scoring each with both the acoustic grid and the language model, and lets a candidate that starts weak win if it ends more plausible.
# Greedy: top symbol per frame, then collapse repeats and blanks.
ids = probs.argmax(axis=-1)
out, prev = [], None
for i in ids:
if i != prev and i != BLANK:
out.append(i)
prev = i
text = "".join(id_to_char[i] for i in out) # -> "four"
Decoding is where batch and streaming split. Batch decoding sees the whole grid and revises freely. Streaming decoding has to emit text while audio is still arriving, so it commits to words before the sentence ends, then sometimes corrects them.[20] That trade-off is the subject of streaming speech recognition and partial vs final results.
Common questions
Why 16 kHz and not higher?
The human voice tops out around 8 kHz, and sampling theory says you need a rate twice the highest frequency you care about. 16 kHz covers it with margin. Higher rates add data and cost without adding information the model can use for speech.
Is a spectrogram strictly required?
No. Some recent models learn features directly from raw samples with convolutional front ends, skipping the hand-built mel step.[16][17] Mel features stay common because they are compact and work well, but the spectrogram is a design choice, not a law.
What turns character scores into real words with spacing?
The decoder, guided by the language model. The acoustic model emits characters or subword tokens; beam search assembles them into words and applies the language model's sense of which sequences are plausible. Punctuation and casing are usually separate learned steps on top.
Where do confidence scores come from?
From the per-frame probabilities. After decoding, the model can report how sure it was about each word, derived from those scores. See confidence scores for how to read them.
How is accuracy measured?
By comparing the output text to a human transcript and counting insertions, deletions, and substitutions. That count, normalized by transcript length, is the word error rate.[18][19]
Related concepts
- What is speech recognition
- Streaming speech recognition
- Partial vs final results
- Word error rate
- Audio formats
- Confidence scores
References
- Hirsch, H. G. (2001). Speech Recognition at Multiple Sampling Rates. Eurospeech.
- Jakuš, P., & Džapo, H. (2025). Implementing Keyword Spotting on the MCXU947 Microcontroller with Integrated NPU. arXiv preprint arXiv:2506.08911.
- Dey, S., & Saha, G. (2024). Spoken Language Identification Using Rhythmic Categorization: Syllable-Timed and Stress-Timed. 2024 International Conference on Signal Processing and Communications (SPCOM).
- Thienpondt, J., & Demuynck, K. (2022). Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping. arXiv preprint arXiv:2206.09396.
- Davis, S., & Mermelstein, P. (1980). Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
- Mel Frequency Cepstral Coefficient and Its Applications: A Review. IEEE.
- The Impact of MFCC, Spectrogram, and Mel-Spectrogram on Deep Learning Models for Amazigh Speech Recognition System. International Journal of Speech Technology (Springer).
- Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Yu, D. (2017). Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.
- Prabhavalkar, R., Sainath, T. N., Li, B., Bruguier, A., & Kannan, A. (2017). A Comparison of Sequence-to-Sequence Models for Speech Recognition. Interspeech.
- Ren, Z., et al. (2022). Improving Hybrid CTC/Attention Architecture for Agglutinative Languages Speech Recognition. PMC.
- Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd International Conference on Machine Learning (ICML), 369–376.
- Variational Connectionist Temporal Classification. SpringerLink.
- Reddy, D. R. (1976). Speech Recognition by Machine: A Review. Proceedings of the IEEE, 64(4), 501–531.
- Pushing the Limits of Beam Search Decoding for Transducer-Based ASR Models. arXiv.
- End-to-End Advanced Visual Speech Recognition Using 3D-CNN and BiLSTM with Beam Search Decoding. IEEE Xplore.
- Zeghidour, N., Usunier, N., Bottou, L., Kokkinos, I., & Synnaeve, G. (2018). End-to-End Speech Recognition From the Raw Waveform. Interspeech.
- End-to-End Speech Recognition From Raw Speech: Multi Time-Frequency Resolution CNN Architecture for Efficient Representation Learning. IEEE Xplore.
- Morris, A. C., & Maier, V. (2004). From WER and RIL to MER and WIL: Improved Evaluation Measures for Connected Speech Recognition. Interspeech.
- Speech Models Training Technologies Comparison Using Word Error Rate. Science.lpnu.ua.
- Soniox (2026). Real-time Speech-to-Text. Soniox.