Confidence scores in speech recognition: what they actually tell you

A token reported with confidence: 0.97 reads like a promise: this word is 97 percent likely to be right. It is not a promise. Uncalibrated, the number mostly reflects how strongly the model preferred this token over its own alternatives, which is a statement about the model's inner contest, not about the world. The score can point at relative uncertainty. What it cannot do, out of the box, is measure accuracy.

How confidence scores are calculated

A recognizer does not output words directly. It outputs probabilities over many possible words and picks among them, and the probability of the word it picked, after some processing, becomes the confidence score.^[1] The score is a real internal quantity, not decoration. It reflects how dominant the chosen word was over its competitors at that moment.

That origin explains its character. When the audio is clean and one word clearly beats the alternatives, confidence is high and deserved. When two words were acoustically close, a heard "fifteen" that might have been "fifty," confidence drops, because the model was torn. The score is an honest report of internal uncertainty, and internal certainty is not the same as correctness.

Calibration

A confidence score is well-calibrated if, across all the words it scores 0.9, about 90 percent are actually correct. Calibration is the property that lets you read the number as a probability. Some systems are reasonably calibrated. Many are not.

Neural models have a well-documented tendency to be overconfident. They assign high scores more freely than their accuracy justifies, so a population of 0.95 words might be right only 85 percent of the time. This is not a quirk of speech; it shows up across modern classifiers, and it has gotten worse as networks have grown.^[2] The consequence is concrete: you cannot assume the raw number is a true probability, and you cannot compare it across two systems, whose scales may mean entirely different things.

Causes of incorrect high-confidence output

The divergence is not uniform. It is worst on exactly the words you care about most.

Alphanumerics are the clearest case. A digit string has no grammar to anchor it, so the model can settle on a wrong digit with high confidence because nothing competed. ASR hallucinations, where a recognizer invents fluent words from silence or noise, arrive with ordinary-looking confidence, because the model is generating plausible language and has no inner alarm that the audio did not support it. Rare names the model has quietly replaced with common words score high, since the common word was its best bet.

Confidence is most trustworthy on easy words, where you did not need help, and least trustworthy on hard words, where you did.

Appropriate uses of confidence scores

None of this makes the score useless. It makes it a tool for triage, something that tells you where to look rather than a verdict on what is right.

The strong use is relative, within a single transcript from a single system: low-confidence regions deserve more scrutiny than high-confidence ones. That is enough to power real features. Route low-confidence segments to human review instead of checking everything. Have a voice agent ask "sorry, was that fifteen or fifty?" when a critical number comes back shaky. Flag uncertain words for an editor. Gate an irreversible action on the confidence of the words that trigger it.

The weak-but-real use is aggregate. A transcript whose confidence collapses across a whole stretch is telling you something about the audio, usually that it got noisy, distant, or overlapping. That is a useful quality flag even when individual scores are unreliable.

Use confidence to	Do not use confidence to
Rank which words to review first	Prove a word is correct
Trigger a clarifying question	Compare quality across two systems
Flag uncertain spans for an editor	Replace measuring accuracy on real data
Detect audio that went bad	Treat 0.9 as "90% right" without checking

A working split: confidence triages effort, it does not certify words. The right column is where people get burned.

Practical use of confidence thresholds

Two habits keep confidence honest. Calibrate the threshold against your own data: transcribe real audio, see what confidence level separates your correct and incorrect words, and set the cutoff there rather than at a number that sounded reasonable. The right threshold is a property of your audio and your model together, not a universal constant. And never let confidence replace measurement. The only way to know how accurate a system is on your audio is to measure it, with word error rate and the entity-level checks of beyond WER, on a real sample.

Common questions

Does a confidence of 0.95 mean the word is 95% likely to be correct?

Only if the system is well-calibrated, and neural recognizers are documented to be overconfident: a population scored 0.95 might be right only 85 percent of the time. Read it as the model's self-estimate, good for ranking words within one transcript, never as a literal probability you can trust without checking it on your own data.

Can a recognizer be confident and wrong at the same time?

Yes, and the most dangerous errors are often the most confident ones. Confidence measures agreement among the model's own hypotheses, not agreement with reality, so a misheard name or a hallucinated phrase scores high precisely because nothing in its candidates disagreed. Confidence is most trustworthy on easy words and least trustworthy on the hard ones where you needed it.

Should I drop or hide words below a confidence threshold?

No, not silently, and not on a borrowed cutoff. The right threshold is a property of your audio and model together, found by transcribing real audio and seeing what level separates your correct and incorrect words. Use confidence to route uncertain spans to review or trigger a clarifying question, where the score's relative ranking within one transcript actually holds.

Can I compare confidence scores between two speech-to-text providers?

No. Each system computes and scales confidence differently, so a 0.9 from one means nothing relative to a 0.9 from another, and the scores are honest only within a single transcript from a single system. Compare providers by measuring accuracy on the same audio with word error rate and entity-level checks.

References

Ogawa, A., Tawara, N., Kano, T., & Delcroix, M. (2023). BLSTM-Based Confidence Estimation for End-to-End Speech Recognition. arXiv preprint arXiv:2312.14609.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (ICML), PMLR 70.