Speech recognition evaluation beyond WER

Entity, severity, and task measures beyond aggregate word errors

Updated June 29, 2026

A transcript with 3% WER may nevertheless contain an incorrect dosage, a one-digit error in an account number, or a deleted negation in "the contract is not binding." WER assigns the same unit cost to each word error and therefore does not represent the unequal consequences of these cases.

Equal weighting of word errors

This is the root failure, and the others grow from it. WER computes the edit distance between the transcript and a reference, then divides by the number of words. A dropped "um" and a mangled drug name each count as one error. But words are not worth the same. In most transcripts a handful carry nearly all the meaning, the names, the numbers, the key terms, and the rest is connective tissue. WER weighs the load-bearing beam and the wallpaper on the same scale.

So the metric optimizes for the average word while your users care about the critical word. A system tuned to drive WER down will trade a correct rare name for a correct common word, because both move the number by the same amount and the rare name is harder to get.

Errors in names, numbers, and entities

The words that matter most are usually entities: people, places, organizations, dates, quantities, identifiers. They are also a small fraction of the total word count, so getting all of them wrong barely moves WER.

A transcript can be 97 percent accurate overall and 0 percent accurate on phone numbers, because the phone numbers were ten words out of a thousand. The metric drowns the entity errors in correct filler. This is why the alphanumerics problem is invisible to WER and obvious to anyone who tries to use the transcript.

How WER represents hallucinations

A recognizer that invents a fluent clause out of silence (see ASR hallucinations) adds a few wrong words to a long transcript.

Those few words are a rounding error in WER, so a metric-driven evaluation can wave a hallucinating system through as "accurate." But the danger of a hallucination has nothing to do with its word count. It is fluent, confident fiction dropped into a record people will later trust. WER has no concept of "this sentence was never spoken," only "these N words differ from the reference," and N is small.

Errors in reference transcripts

WER assumes there is one correct transcript to measure against. Spoken language does not cooperate.

"Going to" versus "gonna," "OK" versus "okay," whether to transcribe the false start in "I think, I mean, yes," whether "twenty three" should be words or "23": all legitimate choices, and the reference picks one. A system that makes a different but equally valid choice is penalized as if it erred. So part of any raw WER is not error at all. It is disagreement with the reference's conventions around formatting, contractions, and disfluencies. Comparing two systems on WER without controlling for this measures formatting styles as much as accuracy.

Effects of text formatting

Run WER on fully normalized text (lowercased, punctuation stripped, numbers spelled out) and you erase real differences: "$23" and "23 dollars" and "twenty three dollars" all collapse to the same thing, so a system that formats badly looks fine. Run WER on raw formatted text and you punish a system for a comma, inflating the number with punctuation and ITN disagreements that may not matter to your use.

There is no neutral choice. Whether formatting should count depends on what you are building, and a single WER number buries that decision instead of exposing it.

Additional evaluation measures

Do not abandon WER. Surround it. Keep it as the average and add the measurements that catch what the average buries.

Measure entities directly. Compute an entity error rate, or precision and recall on the names, numbers, and terms that matter, so the metric weights them the way your users do. Separate the layers: report WER on normalized text to isolate recognition from formatting, and score formatting on its own. Then measure the task, not the transcript. If the audio feeds a voice agent or an extraction pipeline, the honest metric is whether the downstream job succeeded. Did the agent take the right action? Did the right fields get filled? Last, read the failures. A sample of real transcripts, read by a person, surfaces problems no aggregate will.

Most of all, measure on your own audio. A leaderboard WER on someone else's clean test set tells you little about your noisy, accented, jargon-heavy calls. That is the entire premise of benchmarking speech-to-text yourself.

Common questions

Is word error rate a bad metric?

No, it is necessary but not sufficient. WER weights every word equally, so it optimizes for the average word while your users care about the critical word, the name or number a system will trade away for an easier common word. Use it as the starting point, then surround it.

How can a transcript have low WER but still be unusable?

Because a transcript can hit 5 percent WER and 100 percent entity error rate at the same time. The words that carry the meaning are a small fraction of the total, so getting every account number, dosage, and proper noun wrong barely moves the score while making every fact you extract wrong.

What should I measure besides WER?

Measure the task, not the transcript: where the audio feeds a voice agent or an extraction pipeline, the honest metric is whether the agent took the right action or the right fields got filled. Add an entity error rate for names and numbers, formatting scored apart from recognition, and a human read of real samples. And measure it all on your own audio, not a leaderboard test set.

Why do two providers report very different WER on the same data?

The reference and the formatting rules, often as much as the actual accuracy. WER assumes one correct transcript, but "gonna" versus "going to" and "$23" versus "twenty three dollars" are legitimate choices the reference picks one of, and whether they count is a scoring decision. Run it on your own data with your own choices and the comparison stops measuring formatting styles.

References

  1. Sasindran, Z., Yelchuri, H., Prabhakar, T. V., & Rao, S. (2022). H_eval: A new hybrid evaluation metric for automatic speech recognition tasks. arXiv preprint arXiv:2211.01722.