WebSocket API

Real-time transcription over WebSocket

Soniox Speech-to-Text WebSocket API enables low-latency transcription of live audio streams. It supports advanced features such as automatic speaker diarization, context customization, and more — all over a persistent WebSocket connection.

This API is ideal for live transcription scenarios such as meetings, broadcasts, voice interfaces, and real-time voice applications.

WebSocket endpoint

To connect to the WebSocket API, use:

wss://stt-rt.soniox.com/transcribe-websocket

Authentication and configuration

Before sending audio, you must authenticate and configure the transcription session by sending a JSON message like this:

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "stt-rt-preview-v2",
  "audio_format": "auto"
}

Configuration parameters

api_keyRequiredstring

Your Soniox API key. You can create keys in the Soniox Console. For client-side integrations, use a temporary API key generated on the server to avoid exposing secrets.

modelRequiredstring

The transcription model to use. Use GET /models endpoint to retrieve a list of available models.

Example: "stt-rt-preview-v2"

audio_formatRequiredstring

The format of the streamed audio. See Supported audio formats for details.

Example: "auto", "pcm_s16le"

num_channelsnumber

Required for raw PCM formats.

Common values: 1 for mono audio, 2 for stereo audio

sample_ratenumber

Required for raw PCM formats.

Common value: 16000

language_hintsarray<string>

Expected languages in the audio. If not specified, languages are automatically detected. See supported languages for list of available ISO language codes.

contextstring

Provide domain-specific terms or phrases to improve recognition accuracy.

Maximum length: 10000

enable_speaker_diarizationboolean

When true, speakers are identified and separated in the transcription output.

enable_language_identificationboolean

Enables automatic language detection at the token level.

enable_non_final_tokensboolean

When true, partial non-final tokens will be streamed before they are finalized. See Final vs non-final tokens for more information.

Default: true

max_non_final_tokens_duration_msnumber

Maximum delay (in milliseconds) between a spoken word and its finalization.

Default: 4000Minimum: 360Maximum: 6000

enable_endpoint_detectionboolean

When true, endpoint detection is enabled.

client_reference_idstring

Optional tracking identifier string. Does not need to be unique.

Maximum length: 256

translationobject

Configure real-time translation. See Real-time transcription page for more info.

One-way translation

typeRequiredstring

Needs to be set to one_way. Enables one-way translation.

target_languageRequiredstring

The target language for translation.

Two-way translation

typeRequiredstring

Needs to be set to two_way. Enables two-way translation.

language_aRequiredstring

One language for the two-way translation.

language_bRequiredstring

The other language for the two-way translation.

Audio streaming

After sending the initial configuration, begin streaming audio data:

Audio can be sent as binary WebSocket frames (preferred)
Alternatively, Base64-encoded audio can be sent as text messages (if binary is not supported)
The maximum duration of a stream is 65 minutes

Ending the stream

To gracefully end a transcription session:

Send an empty WebSocket message (empty binary or text frame)
The server will return any final results, send a completion message, and close the connection

Response format

Soniox will send transcription responses in JSON format. Successful transcription responses follow this format:

{
  "tokens": [
    {
      "text": "Hello",
      "start_ms": 600,
      "end_ms": 760,
      "confidence": 0.97,
      "is_final": true,
      "speaker": "1",
    }
  ],
  "final_audio_proc_ms": 760,
  "total_audio_proc_ms": 880
}

Field descriptions

tokensarray<object>

The list of transcribed tokens (words or subwords)

Each token may include:

textstring

Token text.

start_msOptionalnumber

Start timestamp of the token (in milliseconds). Not included if translation_status is translation.

end_msOptionalnumber

End timestamp of the token (in milliseconds). Not included if translation_status is translation.

confidencenumber

Confidence score (0.0–1.0).

is_finalboolean

Whether the token is finalized.

speakerOptionalstring

Speaker label (if diarization enabled).

translation_statusOptionalstring

Status of the translation. Included if translation is configured. The value will be "none" if the current token will not be translated.

Possible values: "original" | "translation" | "none"

languageOptionalstring

Language of the transcription. Included if translation is configured.

source_languageOptionalstring

Source language of the translation. Included if translation is configured and translation_status is translation.

final_audio_proc_msnumber

Amount of audio processed and finalized (in ms)

total_audio_proc_msnumber

Total audio processed (in ms), including non-final tokens

Finished response

At the end of the stream, Soniox will send a final message indicating the session is complete:

{
  "tokens": [],
  "final_audio_proc_ms": 1560,
  "total_audio_proc_ms": 1680,
  "finished": true
}

The server will then close the WebSocket connection.

Error response

If an error occurs, the server will send an error response and immediately close the connection:

{
  "tokens": [],
  "error_code": 503,
  "error_message": "Cannot continue request (code N). Please restart the request. ..."
}

error_codenumber

Standard HTTP status code.

error_messagestring

A description of the error encountered.

Possible error codes and error messages:

400Bad request

401Unauthorized

402Payment required

408Request timeout

429Too many requests

500Internal server error

503Service unavailable

On this page