Real-time transcription
Learn about real-time transcription with low latency and high accuracy for all 60+ languages.
Overview
Soniox Speech-to-Text AI lets you transcribe audio in real time with low latency and high accuracy in over 60 languages. This is ideal for use cases like live captions, voice assistants, streaming analytics, and conversational AI.
Real-time transcription is provided through our WebSocket API, which streams results back to you as the audio is processed.
How processing works
As audio is streamed into the API, Soniox returns a continuous stream of tokens — small units of text such as subwords, words, or spaces.
Each token carries a status flag (is_final
) that tells you whether the token is provisional or confirmed:
- Non-final token (
is_final: false
) → Provisional text. Appears instantly but may change, disappear, or be replaced as more audio arrives. - Final token (
is_final: true
) → Confirmed text. Once marked final, it will never change in future responses.
This means you get text right away (non-final for instant feedback), followed by the confirmed version (final for stable output).
Non-final tokens may appear multiple times and change slightly until they stabilize into a final token. Final tokens are sent only once and never repeated.
Example token evolution
Here’s how "How are you doing?"
might arrive over time:
Initial guess (non-final):
Refined guess (non-final)
Mixed output (final + non-final):
Mixed output (final + non-final):
Confirmed tokens (final):
Bottom line: The model may start with a shorthand guess like “How’re”, then refine it into “How are you”, and finally extend it into “How are you doing?”. Non-final tokens update instantly, while final tokens never change once confirmed.
Audio progress tracking
Each response also tells you how much audio has been processed:
audio_final_proc_ms
— audio processed into final tokens.audio_total_proc_ms
— audio processed into final + non-final tokens.
Example:
This means:
- Audio up to 4.8s has been processed and finalized (final tokens).
- Audio up to 5.25s has been processed in total (final + non-final tokens).
Getting final tokens sooner
There are two ways to obtain final tokens more quickly:
- Endpoint detection — the model can detect when a speaker has stopped talking and finalize tokens immediately.
- Manual finalization — you can send a
"type": "finalize"
message over the WebSocket to force all pending tokens to finalize.
Audio formats
Soniox supports both auto-detected formats (no configuration required) and raw audio formats (manual configuration required).
Auto-detected formats
Soniox can automatically detect common container formats from stream headers. No configuration needed — just set:
Supported auto formats:
Raw audio formats
For raw audio streams without headers, you must provide:
audio_format
→ encoding type.sample_rate
→ sample rate in Hz.num_channels
→ number of channels (e.g. 1 (mono) or 2 (stereo)).
Supported encodings:
- PCM (signed):
pcm_s8
,pcm_s16
,pcm_s24
,pcm_s32
(le
/be
). - PCM (unsigned):
pcm_u8
,pcm_u16
,pcm_u24
,pcm_u32
(le
/be
). - Float PCM:
pcm_f32
,pcm_f64
(le
/be
). - Companded:
mulaw
,alaw
.
Example: raw PCM (16-bit, 16kHz, mono)
Code example
Prerequisite: Complete the steps in Get Started.