Soniox
Docs

Real-time API

Learn how to use and integrate Speech-to-Text Real-time API.

Speech-to-Text Real-time API allows developers to transcribe audio streams seamlessly via a WebSocket connection. It supports multiple audio formats and delivers real-time transcription with precise timestamps and confidence scores for each token, ensuring accuracy and reliability.

To connect to the Speech-to-Text Real-time API, use the following WebSocket URL:

wss://stt-rt.soniox.com/transcribe-websocket

Authentication and transcription configuration

Before transmitting audio data messages, ensure client authentication and define the transcription configuration by sending a JSON object with the following structure:

{
    "api_key": "<SONIOX_API_KEY>",
    "audio_format": "auto",
    "model": "stt-rt-preview",
    "language_hints": ["en"],
    "enable_speaker_tags": false,
    "context": "string"
}
api_keyRequiredstring

You can create your API key in Soniox Console.

If you are using WebSocket library on the client side, you need to be careful not to expose your API key in the client-side code. Instead, you can generate a temporary API key for each connection server-side and send it the client to authenticate the WebSocket connection.

audio_formatRequiredstring

The audio format of the audio data. Use auto for automatic detection.

num_channelsstring

Number of channels in the audio (required for PCM formats).

sample_ratestring

The sample rate of the audio data (required for PCM formats).

modelRequiredstring

Speech-to-Text model to use (e.g. stt-rt-preview). You can get an up-to-date list of available models from GET models endpoint.

language_hintsarray<string>

Provide language hints to enhance speech recognition.

enable_speaker_tagsboolean

Enabling speaker tags will separate speakers in the transcription output.

contextstring

Context can be beneficial to correctly transcribe uncommon spoken words, such as names, jargon, or abbreviations. The max length of the context is 10000 characters.

client_reference_idstring

A string provided by the client to track the uploaded file. It can be an ID, a JSON string, or any other text. This value can be used for reference in future API requests or for internal mapping within the client’s systems. The value does not have to be unique. If not provided, it will be auto-generated.

Audio streaming

After sending the start message, you can begin streaming audio data. The API supports multiple audio formats, including raw PCM and live microphone streams from all major web browsers. Audio data can be sent as binary WebSocket messages or as Base64-encoded text messages for WebSocket clients that do not support binary messaging.

Limitations

Server will send the text in real-time with minimal latency.

The maximum duration of the real-time audio stream is 65 minutes. After that the stream will be ended with an error 400: Audio is too long.. If you would like to transcribe longer audio, you can reconnect after receiving this error.

The max number of concurrent connections per organization is 10 and can be increased in the Soniox Console.

Ensure the data is sent in real time. Audio data must not be sent at a rate faster or slower than real-time. Sending data too slowly may result in the server terminating the connection.

Ending the stream

To end the streaming transcription, send an empty WebSocket message, either as an empty binary or text message. Upon receiving this, the server will initiate the closing process, send any remaining text responses, and close the WebSocket connection.

Response format

The WebSocket server will send the responses using JSON format.

Successful responses

Successful transcription responses follow this format:

{
  "text": "H",
  "tokens": [
    {
      "text": "H",
      "start_ms": 700,
      "end_ms": 760,
      "confidence": 0.5742499828338623
    }
  ],
  "audio_proc_ms": 760
}
{
  "text": "ello",
  "tokens": [
    {
      "text": "ello",
      "start_ms": 880,
      "end_ms": 940,
      "confidence": 0.9998869299888611
    }
  ],
  "audio_proc_ms": 1000
}
{
  "text": " world",
  "tokens": [
    {
      "text": " world",
      "start_ms": 1120,
      "end_ms": 1180,
      "confidence": 0.8351306319236755
    }
  ],
  "audio_proc_ms": 1240
}
{
  "text": ".",
  "tokens": [
    {
      "text": ".",
      "start_ms": 1420,
      "end_ms": 1480,
      "confidence": 0.6271795630455017
    }
  ],
  "audio_proc_ms": 1480
}
textstring

The transcribed text.

tokensarray<object>

An list of tokens, each containing a part of the transcribed text, along with its start timetamp, end timestamp and confidence level.

textstring

The token text.

start_msnumber

The start timestamp of the token in milliseconds.

end_msnumber

The end timestamp of the token in milliseconds.

confidencenumber

The confidence level of the token (between 0 and 1).

audio_proc_msnumber

The amount of processed audio in milliseconds.

Speaker tags

If enable_speaker_tags flag is set to true, speaker tags will be included in the tokens as follows: A single token indicating the speaker (e.g., spk:1, spk:2). Speaker tokens will be included in the text field of the response as well.

Final response

At the end of the transcription, the server will send a final response indicating completion:

{
  "text": "",
  "tokens": [],
  "audio_proc_ms": 1560,
  "finished": true
}

After sending this response, the server will close the WebSocket connection.

Error response

In the event of an error, the server will return an error response and terminate the WebSocket connection.

{
  "text": "",
  "tokens": [],
  "error_code": 503,
  "error_message": "Service is currently overloaded. Please retry your request..."
}
error_codenumber

The error code. Will follow the HTTP status code convention.

error_messagestring

A description of the error encountered.

Here is a list of possible error codes and their descriptions:

On this page