Soniox
Docs
Core concepts

Final vs non-final tokens

Learn about final and non-final tokens in real-time transcription.

Overview

In real-time processing, Soniox Speech-to-Text AI returns a stream of tokens as the audio is being transcribed. These tokens are classified as either final or non-final, depending on whether the AI considers them stable or potentially subject to change.

Understanding the difference between final and non-final tokens is essential for building responsive, accurate real-time applications.


Final tokens

Final tokens are considered complete and will not change in future results.


Non-final tokens

Non-final tokens are provisional and may change as more audio is received. They are recognized instantaneously and give an early indication of what’s being said, but:

  • Their text may change in future results
  • They may disappear entirely
  • They may be replaced by different tokens

Token stream behavior

In each real-time result received from the API:

  • Final tokens always appear first, followed by any non-final tokens.
  • A final token is emitted exactly once. Once a token is marked as final, it will not be sent again in future responses.
  • A token may be returned multiple times as non-final token before eventually becoming final.
  • There is no guaranteed relationship between non-final tokens across updates — they may change, disappear, or be replaced entirely.

Output format

Each token in the API response includes:

  • text: The recognized token (word or sub-word)
  • is_final: Whether a token is final or non-final.

Example behavior

Here's how a token may evolve through multiple real-time responses:

Step 1: Initial recognition (non-final)

{
  "tokens": [
    {"text": "How", "is_final": false}
  ]
}

Step 2: Updated recognition (still non-final)

{
  "tokens": [
    {"text": "How", "is_final": false},
    {"text": " ", "is_final": false},
    {"text": "are", "is_final": false}
  ]
}

Step 3: Further update (still non-final)

{
  "tokens": [
    {"text": "How", "is_final": false},
    {"text": " ", "is_final": false},
    {"text": "are", "is_final": false},
    {"text": " ", "is_final": false},
    {"text": "you", "is_final": false}
  ]
}

Step 4: Finalized tokens

{
  "tokens": [
    {"text": "How", "is_final": true},
    {"text": " ", "is_final": true},
    {"text": "are", "is_final": true},
    {"text": " ", "is_final": true},
    {"text": "you", "is_final": true}
  ]
}

Step 5: New non-final tokens begin

{
  "tokens": [
    {"text": " ", "is_final": false},
    {"text": "doing", "is_final": false}
  ]
}

In this flow, the phrase "How are you" was first streamed as non-final tokens, updated as context improved, and finally confirmed as final tokens. Once marked as is_final: true, the tokens will not appear again in future results.


Audio processing durations

In real-time transcription, the API response includes two fields that indicate how much of the audio has been processed:

  • audio_final_proc_ms: The duration, in milliseconds, of audio that has been processed and resulted in final tokens.
  • audio_total_proc_ms: The duration, in milliseconds, of audio that has been processed and resulted in both final and non-final tokens.

These values represent the offset (from the start of the audio stream) up to which Soniox has processed the audio for final and total results, respectively.

Example

{
  "audio_final_proc_ms": 4800,
  "audio_total_proc_ms": 5250
}

This means:

  • Audio up to 4.8 seconds has been fully processed and yielded final tokens
  • Audio up to 5.25 seconds has been partially processed and may have produced non-final tokens

On this page