Final vs non-final tokens
Learn about final and non-final tokens in real-time transcription.
Overview
In real-time processing, Soniox Speech-to-Text AI returns a stream of tokens as the audio is being transcribed. These tokens are classified as either final or non-final, depending on whether the AI considers them stable or potentially subject to change.
Understanding the difference between final and non-final tokens is essential for building responsive, accurate real-time applications.
Final tokens
Final tokens are considered complete and will not change in future results.
Non-final tokens
Non-final tokens are provisional and may change as more audio is received. They are recognized instantaneously and give an early indication of what’s being said, but:
- Their text may change in future results
- They may disappear entirely
- They may be replaced by different tokens
Token stream behavior
In each real-time result received from the API:
- Final tokens always appear first, followed by any non-final tokens.
- A final token is emitted exactly once. Once a token is marked as final, it will not be sent again in future responses.
- A token may be returned multiple times as non-final token before eventually becoming final.
- There is no guaranteed relationship between non-final tokens across updates — they may change, disappear, or be replaced entirely.
Output format
Each token in the API response includes:
text
: The recognized token (word or sub-word)is_final
: Whether a token is final or non-final.
Example behavior
Here's how a token may evolve through multiple real-time responses:
Step 1: Initial recognition (non-final)
Step 2: Updated recognition (still non-final)
Step 3: Further update (still non-final)
Step 4: Finalized tokens
Step 5: New non-final tokens begin
In this flow, the phrase "How are you" was first streamed as non-final tokens,
updated as context improved, and finally confirmed as final tokens. Once marked
as is_final: true
, the tokens will not appear again in future results.
Audio processing durations
In real-time transcription, the API response includes two fields that indicate how much of the audio has been processed:
audio_final_proc_ms
: The duration, in milliseconds, of audio that has been processed and resulted in final tokens.audio_total_proc_ms
: The duration, in milliseconds, of audio that has been processed and resulted in both final and non-final tokens.
These values represent the offset (from the start of the audio stream) up to which Soniox has processed the audio for final and total results, respectively.
Example
This means:
- Audio up to 4.8 seconds has been fully processed and yielded final tokens
- Audio up to 5.25 seconds has been partially processed and may have produced non-final tokens