Real-time transcription

Learn about real-time transcription with low latency and high accuracy for all 60+ languages.

Overview

Soniox Speech-to-Text AI lets you transcribe audio in real time with low latency and high accuracy in over 60 languages. This is ideal for use cases like live captions, voice assistants, streaming analytics, and conversational AI.

Real-time transcription is provided through our WebSocket API, which streams results back to you as the audio is processed.

How processing works

As audio flows into the API, Soniox returns a continuous stream of tokens (small pieces of text, like words or spaces).

Each token has a status:

Non-final token: Provisional text. It appears instantly but may change, disappear, or be replaced as more audio comes in.
Final token: Confirmed text. Once a token is marked as final, it will never change in future responses.

This means you get text right away (non-final), and then shortly after, you get the confirmed final version.

Token behavior

Each API response includes final tokens followed by non-final tokens.
A final token is sent once — it won’t appear again.
Non-final tokens may show up multiple times (changing slightly) until they stabilize into a final token.
Don’t rely on non-final tokens being consistent across responses.

Each token has a flag:

{ "text": "hello", "is_final": true }

Example token evolution

Here’s how "How are you" might arrive over time:

First guess (non-final):

{ "tokens": [ {"text": "How", "is_final": false} ] }

Expanding guess (non-final):

{ "tokens": [ {"text": "How", "is_final": false}, {"text": " ", "is_final": false}, {"text": "are", "is_final": false} ] }

Full phrase recognized (non-final):

{ "tokens": [ {"text": "How", "is_final": false}, {"text": " ", "is_final": false}, {"text": "are", "is_final": false}, {"text": " ", "is_final": false}, {"text": "you", "is_final": false} ] }

Confirmed final tokens:

{ "tokens": [ {"text": "How", "is_final": true}, {"text": " ", "is_final": true}, {"text": "are", "is_final": true}, {"text": " ", "is_final": true}, {"text": "you", "is_final": true} ] }

New non-final tokens start again:

{ "tokens": [ {"text": " ", "is_final": false}, {"text": "doing", "is_final": false} ] }

Bottom line: “How are you” first appears as a guess (non-final), then gets locked in (final), and then transcription continues with the next words.

Audio progress tracking

Each response also tells you how much audio has been processed:

audio_final_proc_ms — audio fully processed into final tokens
audio_total_proc_ms — audio processed into final + non-final tokens

Example:

{
  "audio_final_proc_ms": 4800,
  "audio_total_proc_ms": 5250
}

This means:

Audio up to 4.8s has been finalized
Audio up to 5.25s has been partially processed (non-final tokens)

Controlling latency vs accuracy

By default, Soniox waits a short delay before finalizing tokens. This improves accuracy but adds a bit of lag. You control this with:

max_non_final_tokens_duration_ms

Range: 360–6000 ms (recommended: 4000)
Shorter delay = faster final text, less accurate
Longer delay = more accurate, higher latency