What is Word Error Rate (WER)? How STT accuracy is measured

A recognizer that hears "euthanasia" and writes "youth in Asia" made one mistake by ear and three by the arithmetic: a substitution plus two insertions against a one-word reference, for a WER of 300 percent. That is the first thing to understand about the metric. It does not care whether the output is plausible, only whether it matches the reference word for word, and its arithmetic has sharp corners.

Calculating word error rate

The formula looks trivial: WER = (S + D + I) / N, where S, D, and I are the counts of substituted, deleted, and inserted words, and N is the number of words in the reference (the human-written "correct" transcript). The work is in getting S, D, and I, because before you can count errors you have to decide which output word corresponds to which reference word. That decision is called alignment, and it is where all the difficulty lives.

Alignment is solved with edit distance, specifically the Levenshtein distance computed at the level of whole words rather than characters. The algorithm finds the smallest number of single-word edits (substitute, delete, insert) that transforms the hypothesis into the reference. "Smallest" matters: there is usually more than one way to line up two sentences, and WER is defined by the cheapest one. A dynamic-programming table fills in the minimum cost cell by cell, and the path back through it tells you which words were substituted, dropped, or invented.

flowchart LR A[Reference<br/>transcript] --> C[Align by<br/>edit distance] B[Hypothesis<br/>from STT] --> C C --> D[Count S, D, I] D --> E[WER =<br/>S+D+I over N]

Alignment is a shortest-path problem: the cheapest set of word edits that turns the hypothesis into the reference.

A worked micro-example. The reference has six words.

Position	Reference	Hypothesis	Edit
1	the	the	match
2	cat	cat	match
3	sat	(none)	deletion
4	on	on	match
5	the	a	substitution
6	mat	mat	match
extra	(none)	quietly	insertion

One deletion, one substitution, one insertion. S + D + I = 3, and N = 6, so WER = 3 / 6 = 50%. The hypothesis was "the cat on a mat quietly," which reads fine and is half wrong.

Why WER can exceed 100%

New engineers assume WER is capped at 100%, because "you can't get more than all the words wrong." You can. The denominator is the number of reference words, but insertions are not bounded by it. If the reference is three words and the system hallucinates a forty-word sentence, the insertions alone blow past N. A noisy clip where the model invents speech during silence can score 200% or more. The metric is working correctly: it is reporting that the output is much longer than, and unrelated to, the truth. If your evaluation script clips WER at 100%, it is lying to you about your worst cases.

WER is not accuracy

People casually report "95% accuracy" by subtracting WER from 100%. That works as a rough gloss, but fails as a definition. Word accuracy is sometimes written as 1 minus WER, but because WER can exceed 100%, accuracy defined this way can go negative, which no honest scoreboard wants to print. WER is an error rate, not an accuracy rate, and the two only line up when insertions are rare. Treat "WER" and "accuracy" as cousins, not synonyms, and never let a marketing page convince you that 4% WER and 96% accuracy are the same well-defined claim.

Information not captured by WER

WER counts word substitutions as equal. Transcribing "their" as "there" costs exactly one error. Transcribing the drug name "Klonopin" as "clonidine" also costs one error, even though one is a homophone nobody will misread and the other could change a prescription. The metric has no concept of which words matter.

It is also blind, by default, to almost everything that makes a transcript usable. Standard WER scoring lowercases the text, strips punctuation, and removes formatting before comparing, because otherwise every difference in capitalization or every comma would count as an error and the numbers would be unusable across systems. So the reported WER usually says nothing about whether the system got the capital letters, the sentence boundaries, the dollar signs, the phone-number grouping, or the spelling of a person's name right. Those live downstream in punctuation and inverse text normalization and in alphanumerics, and they are frequently the errors your users notice. A 6% WER transcript that mangles every account number is worse, for a banking app, than an 8% WER transcript that gets them all.

WER alone misleads. For real products, entity errors, formatting, and whether the system invents words it never heard (see ASR hallucinations) usually matter more than a half-point of overall WER.^[1] There is also a soft spot upstream of the math entirely: the reference transcript is a judgment call, and scoring conventions can move the number without the audio or the model changing at all. The full argument, and what to measure instead, lives in beyond WER.

xychart-beta title "WER by condition (illustrative)" x-axis ["Clean read", "Accented", "Noisy", "Overlapping"] y-axis "WER percent" 0 --> 40 bar [4, 11, 18, 35]

Illustrative only. WER is not one number per system, it is a number per condition, and the spread between conditions is usually larger than the gap between vendors.

The bars above carry no real measurements. Quoting a single WER for a system is like quoting a single speed for a car without saying whether it was going downhill; always name the dataset and the conditions. The only numbers you can fully trust are the ones you produce yourself, on your own audio, scored the same way across every system you compare.

Where WER came from

WER did not appear with deep learning. It comes from the speech recognition evaluations that the U.S. Defense Advanced Research Projects Agency (DARPA) funded and the National Institute of Standards and Technology (NIST) ran starting in the late 1980s. To compare competing research systems fairly, NIST needed a single mechanical scoring rule nobody could argue with, and word-level edit distance was it.

timeline title WER as a scoring standard 1988 : DARPA Resource Management evaluations use word scoring 1990s : NIST runs Switchboard and Broadcast News benchmarks 1990s : NIST sclite becomes the de facto WER scoring tool 2016 : Switchboard human-parity claims, scored the same way

NIST's scoring toolkit, sclite (part of the SCTK package), is still in use, and a surprising amount of modern reporting traces back to its alignment conventions. The tool also produces per-word substitution and deletion breakdowns that tell you not just how wrong a system is, but how it is wrong, which is far more useful than the headline rate.

Common questions

What is a good WER?

There is no universal threshold, because WER depends entirely on the audio. On clean, scripted, single-speaker read speech, modern systems land in the low single digits. On noisy, accented, overlapping, or domain-heavy audio (medical, legal, names and numbers), the same system can be several times worse. A WER quoted without its dataset and conditions is close to meaningless. Judge a system on audio that resembles yours, not a vendor's best-case demo set.

Can WER be more than 100%?

Yes. Because insertions are not limited by the number of reference words, a system that produces far more words than the truth, or that hallucinates speech during silence, can score above 100%. A WER of 150% means the edit distance is larger than the reference length. Any tool that caps WER at 100% is hiding your worst failures.

Is WER the same as accuracy?

Not exactly. Accuracy is loosely defined as 1 minus WER, but since WER can exceed 100%, that subtraction can produce a negative "accuracy," which is nonsense. WER is an error rate. Treat "96% accuracy" as a restatement of "4% WER," and remember that both numbers are blind to capitalization, punctuation, and whether the right names came out.

Why do two vendors report different WER on the same dataset?

Usually because they are not scoring against quite the same reference. Decisions about contractions, filler words, numbers written as digits versus words, and text normalization all change the error count without changing the model. Different scoring scripts and normalization rules produce different numbers from identical transcripts. This is why your own benchmark, scored consistently across systems, beats any published comparison.

Does WER measure punctuation and capitalization?

By default, no. Standard WER scoring lowercases text and strips punctuation before comparing, so the headline number says nothing about formatting. Formatting quality is measured separately, and for many applications it matters more than raw WER. See punctuation and inverse text normalization.

References

Sasindran, Z., Yelchuri, H., Prabhakar, T. V., & Rao, S. (2022). H_eval: A new hybrid evaluation metric for automatic speech recognition tasks. arXiv preprint arXiv:2211.01722.