Soniox
Docs
Core concepts

Real-time latency

Learn how to control the real-time latency of final tokens.

Overview

In real-time transcription, there is a natural trade-off between latency and accuracy. Soniox Speech-to-Text AI allows you to control how quickly final tokens are returned after speech is detected, using the max_non_final_tokens_duration_ms parameter.

This parameter enables fine-grained control over the delay between when a word is spoken and when it is finalized in the transcription response.


Description

The max_non_final_tokens_duration_ms parameter sets the maximum delay (in milliseconds) between the end of a spoken token and the point at which that token is returned as final in the API response.

Allowed range:

  • Minimum: 700 milliseconds
  • Maximum: 6000 milliseconds
  • Default: 4000 milliseconds

How it works

  • When a token is first recognized, it is returned as non-final.
  • After the delay specified by max_non_final_tokens_duration_ms, the token is returned as final (unless the model has revised it due to additional context)
  • A shorter value reduces finalization latency but may slightly reduce accuracy.
  • A longer value gives the model more time/context to finalize tokens, improving accuracy at the cost of increased latency.

Example

If max_non_final_tokens_duration_ms is set to 1000:

  • A token spoken at 3.0 seconds may be finalized and returned by 4.0 seconds.
  • This ensures low-latency display, useful for live captions or voice interfaces.

If set to 6000:

  • The same token may not be finalized until up to 7.0 seconds, allowing the model to use more future context for higher accuracy.

Setting the parameter

You can set max_non_final_tokens_duration_ms when initiating a real-time transcription session via the API:

{
  "max_non_final_tokens_duration_ms": 1000
}

Final tokens & token stability

For more details on how tokens progress from non-final to final, refer to Final vs. non-final tokens.

On this page