Soniox
Docs

WebSocket API

Learn how to use and integrate Soniox Speech-to-Text WebSocket API.

Real-time transcription over WebSocket

Soniox Speech-to-Text WebSocket API enables low-latency transcription of live audio streams. It supports advanced features such as automatic speaker diarization, context customization, and more — all over a persistent WebSocket connection.

This API is ideal for live transcription scenarios such as meetings, broadcasts, voice interfaces, and real-time voice applications.


WebSocket endpoint

To connect to the WebSocket API, use:

wss://stt-rt.soniox.com/transcribe-websocket

Authentication and configuration

Before sending audio, you must authenticate and configure the transcription session by sending a JSON message like this:

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "stt-rt-preview",
  "audio_format": "auto"
}

Configuration parameters

api_keyRequiredstring

Your Soniox API key. You can create keys in the Soniox Console.
For client-side integrations, use a temporary API key generated on the server to avoid exposing secrets.

modelRequiredstring

The transcription model to use. Example: "stt-rt-preview".
Use the GET /models endpoint to retrieve a list of available models.

audio_formatRequiredstring

The format of the streamed audio (e.g., "auto", "s16le").
See Supported audio formats for details.

num_channelsnumber

Required for raw PCM formats.
Typically 1 for mono audio.

sample_ratenumber

Required for raw PCM formats.
Common value: 16000.

language_hintsarray<string>

Hints to guide transcription toward specific languages.
See supported languages for list of available ISO language codes.

contextstring

Provide domain-specific terms or phrases to improve recognition accuracy.
Max length: 10,000 characters.

enable_non_final_tokensDefault: trueboolean

If true, partial non-final tokens will be streamed before they are finalized.

max_non_final_tokens_duration_msDefault: 4000number

Maximum delay (in milliseconds) between a spoken word and its finalization.
Valid range: 7006000.

enable_speaker_diarizationboolean

Enables automatic speaker separation.

client_reference_idstring

A client-defined identifier to track this stream. Can be any string. If not provided, it will be auto-generated.


Audio streaming

After sending the initial configuration, begin streaming audio data:

  • Audio can be sent as binary WebSocket frames (preferred)
  • Alternatively, Base64-encoded audio can be sent as text messages (if binary is not supported)

The server expects audio to be streamed in real time — not significantly faster or slower than the actual rate of speech.


Limitations

  • The maximum duration of a stream is 65 minutes
  • Max concurrent connections per organization: 10 (can be increased via the Soniox Console)
  • Streaming too slowly may result in the connection being closed
  • Audio must be sent at real-time speed — not faster or buffered

Ending the stream

To gracefully end a transcription session:

  • Send an empty WebSocket message (empty binary or text frame)
  • The server will return any final results, send a completion message, and close the connection

Response format

Soniox will send transcription responses in JSON format. Successful transcription responses follow this format:

{
  "tokens": [
    {
      "text": "Hello",
      "start_ms": 600,
      "end_ms": 760,
      "confidence": 0.97,
      "is_final": true,
      "speaker": "1",
      "language_code": "en",
      "is_audio_event": false
    }
  ],
  "final_audio_proc_ms": 760,
  "total_audio_proc_ms": 880
}

Field descriptions

tokensarray<object>

The list of transcribed tokens (words or subwords)

Each token may include:

textstring

Token text

start_msnumber

Start timestamp of the token (in milliseconds)

end_msnumber

End timestamp of the token (in milliseconds)

confidencenumber

Confidence score (0.01.0)

is_finalboolean

Whether the token is finalized

speakerOptionalstring

Speaker label (if diarization enabled)

language_codeOptionalstring

Detected language

is_audio_eventOptionalboolean

True if the token represents a non-verbal audio event

final_audio_proc_msnumber

Amount of audio processed and finalized (in ms)

total_audio_proc_msnumber

Total audio processed (in ms), including non-final tokens


Finished response

At the end of the stream, Soniox will send a final message indicating the session is complete:

{
  "tokens": [],
  "final_audio_proc_ms": 1560,
  "total_audio_proc_ms": 1680,
  "finished": true
}

The server will then close the WebSocket connection.


Error response

If an error occurs, the server will send an error response and immediately close the connection:

{
  "tokens": [],
  "error_code": 503,
  "error_message": "Service is currently overloaded. Please retry your request..."
}
error_codenumber

Standard HTTP status code.

error_messagestring

A description of the error encountered.

Possible error codes and their descriptions:

On this page