Soniox
Docs

Real-time transcription

Learn about real-time transcription with low latency and high accuracy for all 60+ languages.

Overview

Soniox Speech-to-Text AI lets you transcribe audio in real time with low latency and high accuracy in over 60 languages. This is ideal for use cases like live captions, voice assistants, streaming analytics, and conversational AI.

Real-time transcription is provided through our WebSocket API, which streams results back to you as the audio is processed.


How processing works

As audio flows into the API, Soniox returns a continuous stream of tokens (small pieces of text, like words or spaces).

Each token has a status:

  • Non-final token: Provisional text. It appears instantly but may change, disappear, or be replaced as more audio comes in.
  • Final token: Confirmed text. Once a token is marked as final, it will never change in future responses.

This means you get text right away (non-final), and then shortly after, you get the confirmed final version.


Token behavior

  • Each API response includes final tokens followed by non-final tokens.
  • A final token is sent once — it won’t appear again.
  • Non-final tokens may show up multiple times (changing slightly) until they stabilize into a final token.
  • Don’t rely on non-final tokens being consistent across responses.

Each token has a flag:

{ "text": "hello", "is_final": true }

Example token evolution

Here’s how "How are you" might arrive over time:

First guess (non-final):

{ "tokens": [ {"text": "How", "is_final": false} ] }

Expanding guess (non-final):

{ "tokens": [ {"text": "How", "is_final": false}, {"text": " ", "is_final": false}, {"text": "are", "is_final": false} ] }

Full phrase recognized (non-final):

{ "tokens": [ {"text": "How", "is_final": false}, {"text": " ", "is_final": false}, {"text": "are", "is_final": false}, {"text": " ", "is_final": false}, {"text": "you", "is_final": false} ] }

Confirmed final tokens:

{ "tokens": [ {"text": "How", "is_final": true}, {"text": " ", "is_final": true}, {"text": "are", "is_final": true}, {"text": " ", "is_final": true}, {"text": "you", "is_final": true} ] }

New non-final tokens start again:

{ "tokens": [ {"text": " ", "is_final": false}, {"text": "doing", "is_final": false} ] }

Bottom line: “How are you” first appears as a guess (non-final), then gets locked in (final), and then transcription continues with the next words.


Audio progress tracking

Each response also tells you how much audio has been processed:

  • audio_final_proc_ms — audio fully processed into final tokens
  • audio_total_proc_ms — audio processed into final + non-final tokens

Example:

{
  "audio_final_proc_ms": 4800,
  "audio_total_proc_ms": 5250
}

This means:

  • Audio up to 4.8s has been finalized
  • Audio up to 5.25s has been partially processed (non-final tokens)

Controlling latency vs accuracy

By default, Soniox waits a short delay before finalizing tokens. This improves accuracy but adds a bit of lag. You control this with:

max_non_final_tokens_duration_ms
  • Range: 360–6000 ms (recommended: 4000)
  • Shorter delay = faster final text, less accurate
  • Longer delay = more accurate, higher latency

Example:

  • 1000 ms: word at 3.0s → finalized by ~4.0s
  • 6000 ms: word at 3.0s → finalized by ~9.0s
{ "max_non_final_tokens_duration_ms": 1000 }

Getting final tokens sooner

You don’t always have to wait for the timeout — there are two faster options:

  1. Endpoint detection — the model can detect when a speaker has stopped talking and finalize tokens immediately.
  2. Manual finalization — you can send a "type": "finalize" message over the WebSocket to force all pending tokens to finalize.

Code example