Real-time latency
Learn how to control the real-time latency of final tokens.
Overview
In real-time transcription, there is a natural trade-off between latency and accuracy.
Soniox Speech-to-Text AI allows you to control how quickly final tokens are returned
after speech is detected, using the max_non_final_tokens_duration_ms
parameter.
This parameter enables fine-grained control over the delay between when a word is spoken and when it is finalized in the transcription response.
Description
The max_non_final_tokens_duration_ms
parameter sets the maximum delay (in milliseconds)
between the end of a spoken token and the point at which that token is returned as final in the API response.
Allowed range:
- Minimum:
700
milliseconds - Maximum:
6000
milliseconds - Default:
4000
milliseconds
How it works
- When a token is first recognized, it is returned as non-final.
- After the delay specified by
max_non_final_tokens_duration_ms
, the token is returned as final (unless the model has revised it due to additional context) - A shorter value reduces finalization latency but may slightly reduce accuracy.
- A longer value gives the model more time/context to finalize tokens, improving accuracy at the cost of increased latency.
Example
If max_non_final_tokens_duration_ms
is set to 1000
:
- A token spoken at 3.0 seconds may be finalized and returned by 4.0 seconds.
- This ensures low-latency display, useful for live captions or voice interfaces.
If set to 6000
:
- The same token may not be finalized until up to 7.0 seconds, allowing the model to use more future context for higher accuracy.
Setting the parameter
You can set max_non_final_tokens_duration_ms
when initiating a real-time transcription session via the API:
Final tokens & token stability
For more details on how tokens progress from non-final to final, refer to Final vs. non-final tokens.