Soniox

WebSocket API

Learn how to use and integrate Soniox Speech-to-Text WebSocket API.

Overview

The Soniox WebSocket API provides real-time transcription and translation of live audio with ultra-low latency. It supports advanced features like speaker diarization, context customization, and manual finalization — all over a persistent WebSocket connection. Ideal for live scenarios such as meetings, broadcasts, multilingual communication, and voice interfaces.


WebSocket endpoint

Connect to the API using:

wss://stt-rt.soniox.com/transcribe-websocket

Configuration

Before streaming audio, configure the transcription session by sending a JSON message such as:

{
  "api_key": "<SONIOX_API_KEY|SONIOX_TEMPORARY_API_KEY>",
  "model": "stt-rt-preview",
  "audio_format": "auto",
  "language_hints": ["en", "es"],
  "context": {
    "general": [
      { "key": "domain", "value": "Healthcare" },
      { "key": "topic", "value": "Diabetes management consultation" },
      { "key": "doctor", "value": "Dr. Martha Smith" },
      { "key": "patient", "value": "Mr. David Miller" },
      { "key": "organization", "value": "St John's Hospital" }
    ],
    "text": "Mr. David Miller visited his healthcare provider last month for a routine follow-up related to diabetes care. The clinician reviewed his recent test results, noted improved glucose levels, and adjusted his medication schedule accordingly. They also discussed meal planning strategies and scheduled the next check-up for early spring.",
    "terms": [
      "Celebrex",
      "Zyrtec",
      "Xanax",
      "Prilosec",
      "Amoxicillin Clavulanate Potassium"
    ],
    "translation_terms": [
      { "source": "Mr. Smith", "target": "Sr. Smith" },
      { "source": "St John's", "target": "St John's" },
      { "source": "stroke", "target": "ictus" }
    ]
  },
  "enable_speaker_diarization": true,
  "enable_language_identification": true,
  "translation": {
    "type": "two_way",
    "language_a": "en",
    "language_b": "es"
  }
}

Parameters

api_keyRequiredstring

Your Soniox API key. Create API keys in the Soniox Console. For client apps, generate a temporary API key from your server to keep secrets secure.

modelRequiredstring

Real-time model to use. See models.

Example: "stt-rt-preview"
audio_formatRequiredstring

Audio format of the stream. See audio formats.

num_channelsnumber

Required for raw audio formats. See audio formats.

sample_ratenumber

Required for raw audio formats. See audio formats.

language_hintsarray<string>

See language hints.

contextobject

See context.

enable_speaker_diarizationboolean

See speaker diarization.

enable_language_identificationboolean

See language identification.

enable_endpoint_detectionboolean

See endpoint detection.

client_reference_idstring

Optional identifier to track this request (client-defined).

translationobject

See real-time translation.

One-way translation

typeRequiredstring

Must be set to one_way.

target_languageRequiredstring

Language to translate the transcript into.

Two-way translation

typeRequiredstring

Must be set to two_way.

language_aRequiredstring

First language for two-way translation.

language_bRequiredstring

Second language for two-way translation.


Audio streaming

After configuration, start streaming audio:

  • Send audio as binary WebSocket frames.
  • Each stream supports up to 60 minutes of audio. The 300 minutes stream duration is coming soon.

Ending the stream

To gracefully close a streaming session:

  • Send an empty WebSocket frame (binary or text).
  • The server will return one or more responses, including finished response, and then close the connection.

Response

Soniox returns responses in JSON format. A typical successful response looks like:

{
  "tokens": [
    {
      "text": "Hello",
      "start_ms": 600,
      "end_ms": 760,
      "confidence": 0.97,
      "is_final": true,
      "speaker": "1"
    }
  ],
  "final_audio_proc_ms": 760,
  "total_audio_proc_ms": 880
}

Field descriptions

tokensarray<object>

List of processed tokens (words or subwords).

Each token may include:

textstring

Token text.

start_msOptionalnumber

Start timestamp of the token (in milliseconds). Not included if translation_status is translation.

end_msOptionalnumber

End timestamp of the token (in milliseconds). Not included if translation_status is translation.

confidencenumber

Confidence score (0.01.0).

is_finalboolean

Whether the token is finalized.

speakerOptionalstring

Speaker label (if diarization enabled).

translation_statusOptionalstring

See real-time translation.

languageOptionalstring

Language of the token.text.

source_languageOptionalstring

See real-time translation.

final_audio_proc_msnumber

Audio processed into final tokens.

total_audio_proc_msnumber

Audio processed into final + non-final tokens.


Finished response

At the end of a stream, Soniox sends a final message to indicate the session is complete:

{
  "tokens": [],
  "final_audio_proc_ms": 1560,
  "total_audio_proc_ms": 1680,
  "finished": true
}

After this, the server closes the WebSocket connection.


Error response

If an error occurs, the server returns an error message and immediately closes the connection:

{
  "tokens": [],
  "error_code": 503,
  "error_message": "Cannot continue request (code N). Please restart the request. ..."
}
error_codenumber

Standard HTTP status code.

error_messagestring

A description of the error encountered.

Full list of possible error codes and messages: