Partial vs final results in live transcription, explained

Say "I sent it to" and pause. A streaming recognizer might show too, then to, then 2, flickering through spellings of the same sound while it waits to learn what comes next. The instant you add "the printer," it locks to and moves on. The earlier guesses were not errors; they were the recognizer's best reading of an unfinished phrase, shown before the phrase was done.

Showing text that early is why partial results exist, and why live captions feel responsive instead of laggy. The recognizer could wait until it was certain, but then you would stare at silence for a second after every phrase.^[1]^[2]

Why partial results change

Spoken language resolves backward. The sound /tu:/ could be to, too, or two, and which one is right often depends on words not yet spoken. A real-time system cannot wait for the end of the sentence to show you the beginning, so it commits to a provisional reading and revises when later audio settles the question.^[3]

The rewrites cluster where language resolves late. Homophones wait on grammar. Numbers wait on their neighbors, because "fifteen" might be the whole number or the start of "fifteen hundred." Even the boundaries between words can shift, so "ice cream" and "I scream" trade places until the surrounding words break the tie.

Audio heard so far	Partial result	State
"I owe you"	I owe you	partial
"I owe you two"	I owe you too	partial
"I owe you two hun-"	I owe you two	partial
"I owe you two hundred."	I owe you two hundred	final

Token-level output

Streaming output usually arrives not as whole sentences but as tokens: small pieces of text, often a word or a word-fragment, each carrying its own metadata. A token typically tells you its text, whether it is final, when it occurred (see timestamps), and how sure the recognizer is (see confidence scores).^[4]^[5]

The stream is a growing list of tokens, where a trailing run is still provisional and can be replaced, while everything before a certain point is frozen. Each server message either appends new partial tokens, revises the current partial tail, or promotes some partials to final. On the client, keep one current view of the text and update it as these messages arrive, rather than appending blindly.

Result finalization

Two forces promote partials to finals. The ordinary one is time and context: once enough later audio has arrived that a word is no longer in doubt, the recognizer commits it, often well before the speaker stops. The other is the end of the turn: when endpoint detection decides the utterance is over, everything still provisional is finalized at once.^[6]

There is also a manual lever. Sometimes the application knows the turn is over before the recognizer's silence timer would fire, because the user pressed "send" or released a push-to-talk button. Manual finalization is a control message that tells the recognizer to commit its current partials now rather than wait. It trades a little potential accuracy (it stops waiting for stabilizing context) for control over exactly when finals land.

Safe use of partial results

One rule prevents most live-transcription bugs: render partials, but act on finals.^[7]

Display partial text immediately, styled lightly so the user understands it is not settled. Do not take an irreversible action on a partial: do not save it to the record, send it to a downstream language model, translate it, or trigger a command on a word that may still change. For voice agents, the usual pattern is to read partials only to feel responsive (to show the user it is listening) and wait for finals before deciding what was actually said.^[4]

There is a tempting exception. For the absolute lowest latency, you can act speculatively on a stable partial and be prepared to roll back if it changes. Treat that as deliberate engineering, not a default: it only pays off when the rollback is cheap and the milliseconds matter, as in the tightest latency budgets.

Common questions

What is the difference between a partial and a final result?

A partial result is the recognizer's current best guess for what it has heard, and it can still change as more audio arrives. A final result is committed and will not change. Both stream during the utterance; you show partials for responsiveness and rely on finals for anything you cannot take back.

Why does my live transcript change words after showing them?

Because the early version was partial. Spoken language often cannot be resolved until later words arrive, so the recognizer displays its best provisional reading immediately, revises it once context settles the ambiguity, then freezes it as final. The visible rewriting is the system being responsive rather than slow.

Should I send partial results to my language model or database?

No. Partials can still change, so acting on them irreversibly means acting on text that may be wrong a moment later. Wait for finals before storing, translating, or sending text downstream. The exception is deliberate speculative execution with rollback, used only when latency is critical.

What is manual finalization?

It is a control message that asks the recognizer to commit its current partial tokens immediately instead of waiting for its own silence-based endpointing. It is useful when your application already knows the turn is over, such as a push-to-talk release, and wants the finals without the extra wait.

References

Liu, X., Zhang, J., Ferrer, L., et al. (2023). Modeling and Improving Text Stability in Live Captions. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems.
Yu, J., Chiu, C.-C., Li, B., et al. (2021). FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization. ICASSP 2021 (IEEE).
Albesano, D., Gemello, R., & Mana, F. (2000). Hybrid HMM–NN Modeling of Stationary–Transitional Units for Continuous Speech Recognition. Information Sciences.
Addlesee, A., et al. (2020). A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI. Proceedings of COLING 2020.
Soniox (2026). Real-time Transcription. Soniox Docs.
Ramezani, E., Giahi, M. M., Zarabadipour, M. E., et al. (2026). WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition. arXiv preprint arXiv:2604.25611.
Luz, S., Masoodian, M., & Rogers, B. (2008). Interactive Visualisation Techniques for Dynamic Speech Transcription, Correction and Training. CHINZ 2008 (ACM SIGCHI New Zealand).