WebSocket API
Learn how to use and integrate Soniox Speech-to-Text WebSocket API.
Real-time transcription over WebSocket
Soniox Speech-to-Text WebSocket API enables low-latency transcription of live audio streams. It supports advanced features such as automatic speaker diarization, context customization, and more — all over a persistent WebSocket connection.
This API is ideal for live transcription scenarios such as meetings, broadcasts, voice interfaces, and real-time voice applications.
WebSocket endpoint
To connect to the WebSocket API, use:
Authentication and configuration
Before sending audio, you must authenticate and configure the transcription session by sending a JSON message like this:
Configuration parameters
api_key
RequiredstringYour Soniox API key. You can create keys in the Soniox Console.
For client-side integrations, use a temporary API key generated on
the server to avoid exposing secrets.
model
RequiredstringThe transcription model to use. Example: "stt-rt-preview"
.
Use the GET /models endpoint
to retrieve a list of available models.
audio_format
RequiredstringThe format of the streamed audio (e.g., "auto"
, "s16le"
).
See Supported audio formats for details.
num_channels
numberRequired for raw PCM formats.
Typically 1
for mono audio.
sample_rate
numberRequired for raw PCM formats.
Common value: 16000
.
language_hints
array<string>Hints to guide transcription toward specific languages.
See supported languages
for list of available ISO language codes.
context
stringProvide domain-specific terms or phrases to improve recognition accuracy.
Max length: 10,000 characters.
enable_non_final_tokens
Default: truebooleanIf true, partial non-final tokens will be streamed before they are finalized.
max_non_final_tokens_duration_ms
Default: 4000numberMaximum delay (in milliseconds) between a spoken word and its finalization.
Valid range: 700
–6000
.
enable_speaker_diarization
booleanEnables automatic speaker separation.
client_reference_id
stringA client-defined identifier to track this stream. Can be any string. If not provided, it will be auto-generated.
Audio streaming
After sending the initial configuration, begin streaming audio data:
- Audio can be sent as binary WebSocket frames (preferred)
- Alternatively, Base64-encoded audio can be sent as text messages (if binary is not supported)
The server expects audio to be streamed in real time — not significantly faster or slower than the actual rate of speech.
Limitations
- The maximum duration of a stream is 65 minutes
- Max concurrent connections per organization: 10 (can be increased via the Soniox Console)
- Streaming too slowly may result in the connection being closed
- Audio must be sent at real-time speed — not faster or buffered
Ending the stream
To gracefully end a transcription session:
- Send an empty WebSocket message (empty binary or text frame)
- The server will return any final results, send a completion message, and close the connection
Response format
Soniox will send transcription responses in JSON format. Successful transcription responses follow this format:
Field descriptions
tokens
array<object>The list of transcribed tokens (words or subwords)
Each token may include:
text
stringToken text
start_ms
numberStart timestamp of the token (in milliseconds)
end_ms
numberEnd timestamp of the token (in milliseconds)
confidence
numberConfidence score (0.0
–1.0
)
is_final
booleanWhether the token is finalized
speaker
OptionalstringSpeaker label (if diarization enabled)
language_code
OptionalstringDetected language
is_audio_event
OptionalbooleanTrue if the token represents a non-verbal audio event
final_audio_proc_ms
numberAmount of audio processed and finalized (in ms)
total_audio_proc_ms
numberTotal audio processed (in ms), including non-final tokens
Finished response
At the end of the stream, Soniox will send a final message indicating the session is complete:
The server will then close the WebSocket connection.
Error response
If an error occurs, the server will send an error response and immediately close the connection:
error_code
numberStandard HTTP status code.
error_message
stringA description of the error encountered.
Possible error codes and their descriptions: