Punctuation, capitalization, and inverse text normalization in ASR

Here is what a recognizer gives you before any cleanup:

i'll wire twenty three hundred to account four four seven one on tuesday the third at two thirty

And here is what you wanted:

I'll wire $2,300 to account 4471 on Tuesday the 3rd at 2:30.

Both lines contain the same recognized words. Every difference between them is formatting, a separate job from recognition, done by a chain of steps that each decide something the audio alone cannot settle.

Raw recognition output

A recognizer's native output is a flat run of lowercase words with no punctuation and every number spelled out, because that is what was spoken. Capital letters, periods, dollar signs, and digit grouping are written conventions, none of which exist in the sound. Early research systems shipped exactly this: uppercase, unpunctuated word streams. Everything readable about a modern transcript is added after the words are already known.

Restore punctuation

The first cleanup decides where sentences end and where commas pause. Speech carries cues, falling pitch at the end of a sentence, a breath before a clause, but they are soft and inconsistent, so restoring punctuation is a prediction rather than a transcription. A model reads the word sequence, sometimes the audio timing, and guesses where the marks go.

in: "the order shipped it should arrive friday did you get the email"
out: "The order shipped. It should arrive Friday. Did you get the email?"

Sometimes the audio does not decide it. "Let's eat grandma" and "Let's eat, grandma" are both valid sentences with the same spoken form, separated only by the comma. Punctuation restoration gets the easy cases right and the hard ones sometimes, which is why a stray period is one of the more common transcript blemishes.

Restore capitalization

Truecasing restores capitals: the start of each sentence, the word "I," and proper nouns. Sentence starts are easy once punctuation exists. Proper nouns are not, because the sound carries no capital letter and the same spoken word can be a name or not. "Rose" is a flower or a person, "apple" is a fruit or a trillion-dollar company, and the audio is identical either way. The model guesses from context and from what it knows about likely names, which is one reason context biasing helps here too.

in:  "i told mr reyes that apple stock rose"
out: "I told Mr. Reyes that Apple stock rose"

Inverse text normalization

ITN turns spoken quantities into written form. This is where most of the formatting intelligence lives.

"twenty three"              -> 23
"twenty three hundred"      -> 2,300
"two thirty"                -> 2:30
"march third"              -> March 3
"four four seven one"       -> 4471
"twenty three dollars"      -> $23
"three point one four"      -> 3.14
"fifty percent"             -> 50%

Every line above is a decision, and the same words resolve differently depending on what surrounds them. "Two thirty" is a time in a calendar sentence and a score in a sports report, and "four four seven one" is an account number on a bank call and a lottery draw on the radio. ITN has to infer the kind of number from context first, then apply the matching rule, the same ambiguity that makes alphanumerics so hard. The recognizer can hear "two thirty" perfectly and ITN still write down the wrong thing.

Domain formatting

The last step applies conventions that depend on what kind of string this is. A run of ten digits becomes a phone number with its local grouping. "Jane dot doe at example dot com" becomes an email address. A postal code, a credit-card number, a measurement each have their own written shape. This overlaps heavily with the alphanumeric problem, and is where deployments add their own rules for the specific strings their users speak.

flowchart TB A[Raw words<br/>lowercase, spelled out] --> B[Punctuation] B --> C[Capitalization] C --> D[Inverse text<br/>normalization] D --> E[Domain formatting] E --> F[Readable transcript]

The post-processing chain. Recognition produces words; everything readable is added in the stages after.

Rule-based and learned methods

The old way made every stage above a separate module: a punctuation model, a truecaser, an ITN grammar, a formatter, each run in sequence on the recognizer's raw output. This is transparent and tunable, and why a transcript's formatting can differ from its words in revealing ways.

The newer way lets the recognizer produce formatted text directly, having learned punctuation, casing, and normalization as part of recognition from data that was already formatted. This is more fluent and context-aware, because the model that knows the words also decides their written form, but it leaves you less of a seam to reach in and adjust. Most production systems sit between the two: a strong learned core, plus targeted rules for the formats a given domain cares about.

Either way, a formatting error is not always a recognition error: the system can have heard you perfectly and still written down something you did not mean.

Common questions

What is the difference between text normalization and inverse text normalization?

Text normalization goes from written to spoken form, turning "$5" into "five dollars," to prepare text for text-to-speech. Inverse text normalization goes the other way, from the spoken words a recognizer produces back to written form, turning "five dollars" into "$5." Speech recognition uses the inverse direction.

Why does my transcript get the words right but format numbers wrong?

Formatting is a separate, ambiguous step. The recognizer hears "two thirty" correctly, but whether that is a time (2:30), a quantity, or literal words is an inference ITN makes from context. It can choose wrong even when recognition was perfect.

Is punctuation actually in the audio?

Only as soft cues. Falling pitch, pauses, and breath hint at sentence and clause boundaries, but punctuation marks are written conventions that are not pronounced. Punctuation restoration predicts them from the words and timing, which is why it handles clear cases well and genuinely ambiguous ones imperfectly.

Can I turn formatting off and get raw words?

Usually. Many systems return the unformatted, lowercase, spelled-out token stream, useful when you want to apply your own normalization rules or need the literal spoken form. Whether and how depends on the specific API.

References

Sunkara, M., Shivade, C., Bodapati, S., & Kirchhoff, K. (2021). Neural Inverse Text Normalization. arXiv preprint arXiv:2102.06380.