Real-time transcription
Learn about real-time transcription with low latency and high accuracy for all 60+ languages.
Overview
Soniox Speech-to-Text AI lets you transcribe audio in real time with low latency and high accuracy in over 60 languages. This is ideal for use cases like live captions, voice assistants, streaming analytics, and conversational AI.
Real-time transcription is provided through our WebSocket API, which streams results back to you as the audio is processed.
How processing works
As audio flows into the API, Soniox returns a continuous stream of tokens (small pieces of text, like words or spaces).
Each token has a status:
- Non-final token: Provisional text. It appears instantly but may change, disappear, or be replaced as more audio comes in.
- Final token: Confirmed text. Once a token is marked as final, it will never change in future responses.
This means you get text right away (non-final), and then shortly after, you get the confirmed final version.
Token behavior
- Each API response includes final tokens followed by non-final tokens.
- A final token is sent once — it won’t appear again.
- Non-final tokens may show up multiple times (changing slightly) until they stabilize into a final token.
- Don’t rely on non-final tokens being consistent across responses.
Each token has a flag:
Example token evolution
Here’s how "How are you"
might arrive over time:
Bottom line: “How are you” first appears as a guess (non-final), then gets locked in (final), and then transcription continues with the next words.
Audio progress tracking
Each response also tells you how much audio has been processed:
audio_final_proc_ms
— audio fully processed into final tokensaudio_total_proc_ms
— audio processed into final + non-final tokens
Example:
This means:
- Audio up to 4.8s has been finalized
- Audio up to 5.25s has been partially processed (non-final tokens)
Controlling latency vs accuracy
By default, Soniox waits a short delay before finalizing tokens. This improves accuracy but adds a bit of lag. You control this with:
- Range: 360–6000 ms (recommended: 4000)
- Shorter delay = faster final text, less accurate
- Longer delay = more accurate, higher latency
Example:
1000 ms
: word at 3.0s → finalized by ~4.0s6000 ms
: word at 3.0s → finalized by ~9.0s
Getting final tokens sooner
You don’t always have to wait for the timeout — there are two faster options:
- Endpoint detection — the model can detect when a speaker has stopped talking and finalize tokens immediately.
- Manual finalization — you can send a
"type": "finalize"
message over the WebSocket to force all pending tokens to finalize.