Final vs non-final tokens

Overview

In real-time processing, Soniox Speech-to-Text AI returns a stream of tokens as the audio is being transcribed. These tokens are classified as either final or non-final, depending on whether the AI considers them stable or potentially subject to change.

Understanding the difference between final and non-final tokens is essential for building responsive, accurate real-time applications.

Final tokens

Final tokens are considered complete and will not change in future results.

Non-final tokens

Non-final tokens are provisional and may change as more audio is received. They are recognized instantaneously and give an early indication of what’s being said, but:

Their text may change in future results
They may disappear entirely
They may be replaced by different tokens

Token stream behavior

In each real-time result received from the API:

Final tokens always appear first, followed by any non-final tokens.
A final token is emitted exactly once. Once a token is marked as final, it will not be sent again in future responses.
A token may be returned multiple times as non-final token before eventually becoming final.
There is no guaranteed relationship between non-final tokens across updates — they may change, disappear, or be replaced entirely.

Output format

Each token in the API response includes:

text: The recognized token (word or sub-word)
is_final: Whether the token is final or non-final

Example behavior

Here's how a token may evolve through multiple real-time responses:

Step 1: Initial recognition (non-final)

{
  "tokens": [
    {"text": "How", "is_final": false}
  ]
}

Step 2: Updated recognition (still non-final)

{
  "tokens": [
    {"text": "How", "is_final": false},
    {"text": " ", "is_final": false},
    {"text": "are", "is_final": false}
  ]
}

Step 3: Further update (still non-final)

{
  "tokens": [
    {"text": "How", "is_final": false},
    {"text": " ", "is_final": false},
    {"text": "are", "is_final": false},
    {"text": " ", "is_final": false},
    {"text": "you", "is_final": false}
  ]
}

Step 4: Finalized tokens

{
  "tokens": [
    {"text": "How", "is_final": true},
    {"text": " ", "is_final": true},
    {"text": "are", "is_final": true},
    {"text": " ", "is_final": true},
    {"text": "you", "is_final": true}
  ]
}

Step 5: New non-final tokens begin

{
  "tokens": [
    {"text": " ", "is_final": false},
    {"text": "doing", "is_final": false}
  ]
}

In this flow, the phrase "How are you" was first streamed as non-final tokens, updated as context improved, and finally confirmed as final tokens. Once marked as is_final: true, the tokens will not appear again in future results.

Audio processing durations

In real-time transcription, the API response includes two fields that indicate how much of the audio has been processed:

audio_final_proc_ms: The duration, in milliseconds, of audio that has been processed and resulted in final tokens.
audio_total_proc_ms: The duration, in milliseconds, of audio that has been processed and resulted in both final and non-final tokens.

These values represent the offset (from the start of the audio stream) up to which Soniox has processed the audio for final and total results, respectively.

Example

{
  "audio_final_proc_ms": 4800,
  "audio_total_proc_ms": 5250
}

This means:

Audio up to 4.8 seconds has been fully processed and yielded final tokens
Audio up to 5.25 seconds has been partially processed and may have produced non-final tokens

On this page