A system that transcribes "recognize speech" as "wreck a nice beach" has a WER of 100%. Every word is wrong, and the sentence is a perfectly fluent English phrase. That is the first thing to understand about WER: it does not care whether the output is plausible, only whether it matches the reference word for word.
Interpreting WER requires attention to alignment, text normalization, and evaluation conditions.
Calculating word error rate
The formula looks trivial: WER = (S + D + I) / N, where S, D, and I are the counts of substituted, deleted, and inserted words, and N is the number of words in the reference (the human-written "correct" transcript). The work is in getting S, D, and I, because before you can count errors you have to decide which output word corresponds to which reference word. That decision is called alignment, and it is where all the difficulty lives.
Alignment is solved with edit distance, specifically the Levenshtein distance computed at the level of whole words rather than characters. The algorithm finds the smallest number of single-word edits (substitute, delete, insert) that transforms the hypothesis into the reference. "Smallest" matters: there is usually more than one way to line up two sentences, and WER is defined by the cheapest one. A dynamic-programming table fills in the minimum cost cell by cell, and the path back through it tells you which words were substituted, dropped, or invented.
A worked micro-example. The reference has six words.
| Position | Reference | Hypothesis | Edit |
|---|---|---|---|
| 1 | the | the | match |
| 2 | cat | cat | match |
| 3 | sat | (none) | deletion |
| 4 | on | on | match |
| 5 | the | a | substitution |
| 6 | mat | mat | match |
| extra | (none) | quietly | insertion |
One deletion, one substitution, one insertion. S + D + I = 3, and N = 6, so WER = 3 / 6 = 50%. The hypothesis was "the cat on a mat quietly," which reads fine and is half wrong.
Why WER can exceed 100%
New engineers assume WER is capped at 100%, because "you can't get more than all the words wrong." You can. The denominator is the number of reference words, but insertions are not bounded by it. If the reference is three words and the system hallucinates a forty-word sentence, the insertions alone blow past N. A noisy clip where the model invents speech during silence can score 200% or more. The metric is working correctly: it is reporting that the output is much longer than, and unrelated to, the truth. If your evaluation script clips WER at 100%, it is lying to you about your worst cases.
WER is not accuracy
People casually report "95% accuracy" by subtracting WER from 100%. That works as a rough gloss, but fails as a definition. Word accuracy is sometimes written as 1 minus WER, but because WER can exceed 100%, accuracy defined this way can go negative, which no honest scoreboard wants to print. WER is an error rate, not an accuracy rate, and the two only line up when insertions are rare. Treat "WER" and "accuracy" as cousins, not synonyms, and never let a marketing page convince you that 4% WER and 96% accuracy are the same well-defined claim.
Information not captured by WER
WER counts word substitutions as equal. Transcribing "their" as "there" costs exactly one error. Transcribing the drug name "Klonopin" as "clonidine" also costs one error, even though one is a homophone nobody will misread and the other could change a prescription. The metric has no concept of which words matter.
It is also blind, by default, to almost everything that makes a transcript usable. Standard WER scoring lowercases the text, strips punctuation, and removes formatting before comparing, because otherwise every difference in capitalization or every comma would count as an error and the numbers would be unusable across systems. So the reported WER usually says nothing about whether the system got the capital letters, the sentence boundaries, the dollar signs, the phone-number grouping, or the spelling of a person's name right. Those live downstream in punctuation and inverse text normalization and in alphanumerics, and they are frequently the errors your users notice. A 6% WER transcript that mangles every account number is worse, for a banking app, than an 8% WER transcript that gets them all.
WER alone misleads. For real products, entity errors, formatting, and whether the system invents words it never heard (see ASR hallucinations) usually matter more than a half-point of overall WER. The full argument, and what to measure instead, lives in beyond WER.
There is one more soft spot, upstream of the math entirely: the reference transcript is a judgment call. Someone decided whether "gonna" is transcribed as "gonna" or "going to," whether "uh" and "um" are kept or dropped, whether "twenty twenty four" or "2024" is correct, and whether the speaker who trailed off said "I think so" or "I think, so." Change the reference convention and the WER changes without the audio or the model changing at all. Two vendors reporting WER on "the same" dataset may be scoring against subtly different ground truths, which is one reason cross-vendor WER comparisons deserve suspicion.
The bars above carry no real measurements. Quoting a single WER for a system is like quoting a single speed for a car without saying whether it was going downhill; always name the dataset and the conditions. To produce numbers you can trust on your own audio, the procedure is in how to benchmark STT yourself.
Common questions
What is a good WER?
There is no universal threshold, because WER depends entirely on the audio. On clean, scripted, single-speaker read speech, modern systems land in the low single digits. On noisy, accented, overlapping, or domain-heavy audio (medical, legal, names and numbers), the same system can be several times worse. A WER quoted without its dataset and conditions is close to meaningless. Judge a system on audio that resembles yours, not a vendor's best-case demo set.
Can WER be more than 100%?
Yes. Because insertions are not limited by the number of reference words, a system that produces far more words than the truth, or that hallucinates speech during silence, can score above 100%. A WER of 150% means the edit distance is larger than the reference length. Any tool that caps WER at 100% is hiding your worst failures.
Is WER the same as accuracy?
Not exactly. Accuracy is loosely defined as 1 minus WER, but since WER can exceed 100%, that subtraction can produce a negative "accuracy," which is nonsense. WER is an error rate. Treat "96% accuracy" as a restatement of "4% WER," and remember that both numbers are blind to capitalization, punctuation, and whether the right names came out.
Why do two vendors report different WER on the same dataset?
Usually because they are not scoring against quite the same reference. Decisions about contractions, filler words, numbers written as digits versus words, and text normalization all change the error count without changing the model. Different scoring scripts and normalization rules produce different numbers from identical transcripts. This is why your own benchmark, scored consistently across systems, beats any published comparison.
Does WER measure punctuation and capitalization?
By default, no. Standard WER scoring lowercases text and strips punctuation before comparing, so the headline number says nothing about formatting. Formatting quality is measured separately, and for many applications it matters more than raw WER. See punctuation and inverse text normalization.
Related concepts
- Beyond WER: what to measure instead
- What is speech recognition (ASR)?
- Punctuation and inverse text normalization
- ASR hallucinations
- Confidence scores in speech recognition
Building with Soniox? See how transcription output and tokens are structured in the Speech-to-Text documentation.
References
- Sasindran, Z., Yelchuri, H., Prabhakar, T. V., & Rao, S. (2022). H_eval: A new hybrid evaluation metric for automatic speech recognition tasks. arXiv preprint arXiv:2211.01722.