Why speech recognition fails on phone numbers, IDs, and emails

A spoken confirmation number such as "B as in boy, four, seven, two, oh, nine" packs a spelling alphabet, plain digits, and an alternative name for zero into one short breath. A recognizer will happily render it as ordinary prose, or silently drop a character. The failures are not random. They follow from what an alphanumeric string is, and they repeat so predictably that each kind deserves its own entry.

Phone numbers

You say a ten-digit number as "five-five-five, one-two-three, four-five-six-seven," with the grouping a human hears as area code, prefix, line. The recognizer hears a flat run of digits with some pauses, and it has to decide where the groups are, whether a pause was a grouping cue or a hesitation, and how to format the result.

What went wrong: there is no grammar to lean on. In prose, "the" is almost never followed by "the," so the language prior fixes errors. In a digit string, every digit is equally likely to follow every other, so the prior that normally rescues a recognizer is useless, and any digit misheard stays misheard. Spoken conventions make it worse: "double seven" means "77," "treble five" means "555" in British usage, and "oh" means zero. A system that does not know these turns "double seven" into the literal words.

Account and reference IDs

"Your reference is A-X-4-9-Q-2." Now the recognizer has to switch, mid-string, between hearing letters and hearing digits, and the spoken letters are easy to confuse. The names of the letters are short, acoustically similar, and easily masked: B, C, D, E, G, P, T, V, and Z all rhyme, and over a real microphone they are easy to swap for one another.

What went wrong: spoken letters carry almost no acoustic information, which is why humans invented "B as in boy." A recognizer that does not interpret the spelling-alphabet convention treats "B as in boy" as three words and may transcribe the whole phrase instead of the single letter it encodes. And because IDs mix cases and characters freely, the model cannot use word context to recover, because there is no word.

Email addresses

"jane dot doe at example dot com, all lowercase." A human writes jane.doe@example.com without thinking. The recognizer has to turn the spoken words "dot" and "at" into symbols, glue the parts together with no spaces, and resist transcribing "Jane Doe at Example dot com" as an English sentence about a person named Jane.

What went wrong: the same sounds are sometimes symbols and sometimes words. "Dot" is a period inside an address and an ordinary word outside one; "at" is @ in an email and a preposition everywhere else. The recognizer has to know it is inside an email to make the right call, and those boundaries are not marked. Spelled-out local parts ("j-a-n-e") pile the letter-confusion problem on top.

Amounts, dates, and times

You say "fifteen hundred." Did you mean the number 1500, the year, or the words "fifteen hundred"? You say "the third of the fourth." Is that April 3rd or March 4th, and should it render as 3/4, 03/04, or April 3? You say "two thirty." A time, or the fraction, or 2:30?

What went wrong: this is a different failure from the others. The recognizer may hear the sounds perfectly and still produce the wrong text, because turning recognized words into formatted digits is a separate decision called inverse text normalization, and it is ambiguous.^[1] "Twenty-three" should become "23" in "flight twenty-three" but stay as words in some styles, and only context decides. Even a flawless acoustic model inherits this ambiguity.

Why alphanumeric strings are difficult

If you strip away the specifics, the same thing is wrong every time. Ordinary speech recognition leans hard on language being predictable: words are real, grammar constrains what comes next, and that lets the model correct what it half-heard. Alphanumerics remove all of it. The "words" are not words, any symbol can follow any symbol, the spoken forms are short and confusable, and the mapping from sound to written form is ambiguous on top.

So the model loses its usual guidance exactly where precision matters most, because a phone number, an account ID, or a dosage is useless if one character is wrong. A transcript can be 99 percent accurate by word error rate and still get every reference code wrong, which is one of the central arguments in beyond WER: the average hides the errors that cost you.

You say	You mean	A naive transcript returns
"double seven"	77	double seven
"B as in boy"	B	be as in boy
"jane dot doe at example dot com"	jane.doe@example.com	Jane Doe at example dot com
"fifteen hundred"	1500	fifteen hundred

Four failures, one root: the recognizer returns the words it heard instead of the string you meant.

Methods for improving accuracy

The fixes map onto the failures. Specialized handling for alphanumeric strings, where the model is trained or tuned to recognize digit runs, spelled letters, and the spelling-alphabet convention as their own kind of input, addresses the acoustic half. Good inverse text normalization addresses the formatting half, turning recognized digits and symbols into the conventional written form. And context biasing helps where the strings are partly predictable, such as a fixed set of product codes. None of these is optional for a system that takes phone numbers from callers.

Common questions

Why does speech recognition get phone numbers wrong so often?

Because a digit string has no grammar to fall back on. In ordinary speech, the language model corrects half-heard words using context; in a phone number, any digit can follow any other, so a misheard digit stays wrong. Spoken conventions like "double seven" and "oh" for zero multiply the failure modes further.

Can a recognizer understand "B as in boy"?

Only if it is built to. The spelling-alphabet convention encodes a single letter in a memorable phrase because spoken letter names are easy to confuse. A system that does not interpret the convention transcribes the literal words instead of the letter. Handling it correctly is a specific capability, not a default.

Why does "fifteen hundred" sometimes come out as words and sometimes as 1500?

Because converting recognized speech into formatted numbers is a separate, ambiguous step called inverse text normalization. The same spoken phrase can be a quantity, a year, or literal words, and only context decides. Even a recognizer that heard the audio perfectly can format it the way you did not want.

Does high overall accuracy mean my codes and numbers are safe?

No. Word error rate averages over all words, so a transcript can score well while getting every account number and reference code wrong, because those strings are a small fraction of the words but carry most of the meaning. This is why a single accuracy number can mislead; see beyond WER.

References

Sunkara, M., Shivade, C., Bodapati, S., & Kirchhoff, K. (2021). Neural Inverse Text Normalization. arXiv preprint arXiv:2102.06380.