Hallucinations in speech recognition

Transcribed text without supporting speech in the audio

Updated June 29, 2026

Some recognizers transcribe silence as phrases such as "Thank you for watching" or "Subtitles by the Amara.org community." The output may include conventional punctuation and capitalization despite having no acoustic support. Its fluency makes the error difficult to distinguish from correctly recognized text by inspection alone.

Silence transcribed as words

The signature failure is silence. A gap in the audio, a held line, a muted speaker, a pause between sentences, comes back not as empty space but as text.

What went wrong: many modern recognizers are sequence-to-sequence models, trained to always produce fluent language, with no strong way to say "I heard nothing." Faced with audio that supports no words, the language side of the model takes over and emits whatever is most probable in the absence of evidence, often a generic, well-formed sentence.

Phrases from the training data

The invented text is frequently the same text: "Thank you for watching," channel sign-offs, "Please subscribe," subtitle-credit lines. This is too specific to be coincidence.

What went wrong: large recognizers are often trained on enormous amounts of weakly labeled web audio, including video subtitles scraped at scale. Those subtitles are littered with boilerplate that has little to do with the audio, sign-offs, credits, calls to subscribe, sitting over music or silence. The model learned that silence-plus-music is often followed by "thanks for watching," and reproduces that association on similar input. The hallucination is a residue of the training set.

Repeated-text hallucinations

Another mode: the recognizer latches onto a phrase and repeats it, sometimes for many lines, "okay okay okay okay" or a whole clause echoing down the transcript.

What went wrong: generative decoders predict each word partly from the words they just produced. If the model drifts into a region where the most likely next word is the one it just emitted, it can fall into a self-reinforcing loop, especially over long stretches of low-information audio where nothing in the signal pulls it back out. The same mechanism that makes the output fluent lets it get stuck.

Noise and music, transcribed as speech

Point a recognizer at applause, traffic, a fan, or instrumental music, and it may hand back sentences. Not "[music]," but words.

What went wrong: the model is trained to find speech, and given speech-shaped noise it finds speech that is not there, much as the eye reads faces into random clouds. Without a reliable way to classify a segment as non-speech before transcribing it, the recognizer treats every input as if it must contain language.

Comparison with recognition errors

A misheard word usually announces itself. It is often slightly wrong in a way a reader can catch, and it may carry a low confidence score. A hallucination does the opposite. It is grammatical, on-topic enough to pass a skim, and frequently emitted with ordinary or high confidence, because the model is doing what it does well, generating fluent language, with no internal signal that the audio failed to justify it.

That combination defeats the reader's defenses. In a medical note, a legal transcript, or a compliance record, a confidently fabricated sentence is far more harmful than an obvious garble, because nobody flags it. It is also invisible to the most common quality metric: a transcript can post a respectable word error rate while containing an invented clause, since the metric averages over many words and the hallucination is only a few. This is part of why a single accuracy number is not enough, the argument in beyond WER.

Methods for reducing hallucinations

The defenses attack the problem at several points. Gating with voice activity detection, so silence and non-speech are not handed to the recognizer, removes the most common trigger. Explicit no-speech detection, where the model can output "nothing here" as a real answer, gives it the option it otherwise lacks. Decoding constraints, repetition penalties and limits on generating text unsupported by the audio, break the loops. And architecture matters: models whose output is tied more tightly to the audio frame by frame, such as transducer and CTC-style recognizers, have less freedom to wander off into pure language generation than attention-only sequence-to-sequence models, and hallucinate less.

None of these eliminates the problem on its own, and the only way to know whether a system hallucinates on your audio is to test it on your audio, including the silent and noisy parts you might otherwise trim, which is the discipline of benchmarking speech-to-text yourself.

Common questions

What is the difference between a hallucination and a normal recognition error?

A normal error is mishearing a word that was spoken. A hallucination is producing words for audio that contained no corresponding speech, usually silence or noise. The first is a wrong guess about real speech; the second is fluent fiction, harder to catch because it looks correct.

Why does my transcript say "Thank you for watching" over silence?

Because some recognizers were trained on large amounts of video subtitles, which are full of sign-off boilerplate sitting over music and silence. The model learned to associate that kind of low-information audio with those phrases and reproduces them when it has nothing real to transcribe.

Are hallucinations flagged by low confidence?

Often not. Hallucinations are generated by the part of the model that produces fluent language, which has no signal that the audio failed to support the words, so they can arrive with ordinary or high confidence. You cannot rely on confidence alone to catch them.

How do I stop a recognizer from hallucinating?

Reduce the triggers and constrain the output: filter silence and non-speech with voice activity detection before transcribing, prefer systems that can explicitly report "no speech," apply repetition limits, and favor architectures whose output stays tied to the audio. Then verify on your own silent and noisy audio, because behavior varies a lot by system.

References

  1. Koenecke, A., Choi, A. S. G., Mei, K. X., Schellmann, H., & Sloane, M. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. arXiv preprint arXiv:2402.08021.