WebSocket API
Learn how to use and integrate Soniox Speech-to-Text WebSocket API.
Overview
The Soniox WebSocket API provides real-time transcription and translation of live audio with ultra-low latency. It supports advanced features like speaker diarization, context customization, and manual finalization — all over a persistent WebSocket connection. Ideal for live scenarios such as meetings, broadcasts, multilingual communication, and voice interfaces.
WebSocket endpoint
Connect to the API using:
Configuration
Before streaming audio, configure the transcription session by sending a JSON message such as:
Parameters
api_key
RequiredstringYour Soniox API key. Create keys in the Soniox Console. For client apps, use a short-lived key generated on your server to keep secrets safe.
audio_format
RequiredstringAudio format of the stream. See audio formats.
num_channels
numberRequired for raw audio formats. See audio formats.
sample_rate
numberRequired for raw audio formats. See audio formats.
language_hints
array<string>See language hints.
context
stringSee context.
enable_speaker_diarization
booleanSee speaker diarization.
enable_language_identification
booleanenable_non_final_tokens
booleanenable_endpoint_detection
booleanSee endpoint detection.
client_reference_id
stringOptional identifier to track this request (client-defined).
translation
objectOne-way translation
type
RequiredstringMust be set to one_way
.
target_language
RequiredstringLanguage to translate the transcript into.
Two-way translation
type
RequiredstringMust be set to two_way
.
language_a
RequiredstringFirst language for two-way translation.
language_b
RequiredstringSecond language for two-way translation.
Audio streaming
After configuration, start streaming audio:
- Send audio as binary WebSocket frames.
- Each stream supports up to 60 minutes of audio.
Ending the stream
To gracefully close a streaming session:
- Send an empty WebSocket frame (binary or text).
- The server will return one or more responses, including finished response, and then close the connection.
Response
Soniox returns responses in JSON format. A typical successful response looks like:
Field descriptions
tokens
array<object>List of processed tokens (words or subwords).
Each token may include:
text
stringToken text.
start_ms
OptionalnumberStart timestamp of the token (in milliseconds). Not included if translation_status
is translation
.
end_ms
OptionalnumberEnd timestamp of the token (in milliseconds). Not included if translation_status
is translation
.
confidence
numberConfidence score (0.0
–1.0
).
is_final
booleanWhether the token is finalized.
speaker
OptionalstringSpeaker label (if diarization enabled).
translation_status
Optionalstringlanguage
OptionalstringLanguage of the token.text
.
source_language
Optionalstringfinal_audio_proc_ms
numberAudio processed into final tokens.
total_audio_proc_ms
numberAudio processed into final + non-final tokens.
Finished response
At the end of a stream, Soniox sends a final message to indicate the session is complete:
After this, the server closes the WebSocket connection.
Error response
If an error occurs, the server returns an error message and immediately closes the connection:
error_code
numberStandard HTTP status code.
error_message
stringA description of the error encountered.
Full list of possible error codes and messages: