Real-time transcription

Learn about real-time transcription with low latency and high accuracy for all 60+ languages.

Overview

Soniox Speech-to-Text AI lets you transcribe audio in real time with low latency and high accuracy in over 60 languages. This is ideal for use cases like live captions, voice assistants, streaming analytics, and conversational AI.

Real-time transcription is provided through our WebSocket API, which streams results back to you as the audio is processed.

How processing works

As audio flows into the API, Soniox returns a continuous stream of tokens (small pieces of text, like words or spaces).

Each token has a status:

Non-final token: provisional text. It appears instantly but may change, disappear, or be replaced as more audio comes in.
Final token: confirmed text. Once a token is marked as final, it will never change in future responses.

This means you get text right away (non-final), and then shortly after, you get the confirmed final version.

Token behavior

Each API response includes final tokens followed by non-final tokens.
A final token is sent once — it won’t appear again.
Non-final tokens may show up multiple times (changing slightly) until they stabilize into a final token.
Don’t rely on non-final tokens being consistent across responses.

Each token has a flag is_final:

{"text": "hello", "is_final": true}

Example token evolution

Here’s how "How are you" might arrive over time:

First guess (non-final):

{"tokens": [{"text": "How", "is_final": false}]}

Expanding guess (non-final):

{"tokens": [{"text": "How", "is_final": false},
            {"text": " ",   "is_final": false},
            {"text": "are", "is_final": false}]}

Full phrase recognized (final + non-final):

{"tokens": [{"text": "How", "is_final": true},
            {"text": " ",   "is_final": true},
            {"text": "are", "is_final": false},
            {"text": " ",   "is_final": false},
            {"text": "you", "is_final": false}]}

Remaining final tokens:

{"tokens": [{"text": "are", "is_final": true},
            {"text": " ",   "is_final": true},
            {"text": "you", "is_final": true}]}

New non-final tokens start again:

{"tokens": [{"text": " ",     "is_final": false},
            {"text": "doing", "is_final": false}]}

Bottom line: “How are you” first appears as a guess (non-final), then “How ” gets locked in (final), followed by “are you” (final), and then transcription continues with the next words.

Audio progress tracking

Each response also tells you how much audio has been processed:

audio_final_proc_ms — audio fully processed into final tokens.
audio_total_proc_ms — audio processed into final + non-final tokens.

Example:

{
  "audio_final_proc_ms": 4800,
  "audio_total_proc_ms": 5250
}

This means:

Audio up to 4.8s has been finalized.
Audio up to 5.25s has been partially processed (non-final tokens).

Getting final tokens sooner

There are two ways to obtain final tokens more quickly:

Endpoint detection — the model can detect when a speaker has stopped talking and finalize tokens immediately.
Manual finalization — you can send a "type": "finalize" message over the WebSocket to force all pending tokens to finalize.

{
  "audio_format": "auto"
}

Supported auto formats:

aac, aiff, amr, asf, flac, mp3, ogg, wav, webm

Raw audio formats

For raw audio streams without headers, you must provide:

audio_format → encoding type.
sample_rate → in Hz.
num_channels → 1 (mono) or 2 (stereo).

Supported encodings:

PCM (signed): pcm_s8, pcm_s16, pcm_s24, pcm_s32 (le/be).
PCM (unsigned): pcm_u8, pcm_u16, pcm_u24, pcm_u32 (le/be).
Float PCM: pcm_f32, pcm_f64 (le/be).
Companded: mulaw, alaw.

Example: raw PCM (16-bit, 16kHz, mono)

{
  "audio_format": "pcm_s16le",
  "sample_rate": 16000,
  "num_channels": 1
}

Code example

Prerequisite: Complete the steps in Get Started.

Node