WebSocket API
Learn how to use and integrate Soniox Speech-to-Text WebSocket API.
Real-time transcription over WebSocket
Soniox Speech-to-Text WebSocket API enables low-latency transcription of live audio streams. It supports advanced features such as automatic speaker diarization, context customization, and more — all over a persistent WebSocket connection.
This API is ideal for live transcription scenarios such as meetings, broadcasts, voice interfaces, and real-time voice applications.
WebSocket endpoint
To connect to the WebSocket API, use:
Authentication and configuration
Before sending audio, you must authenticate and configure the transcription session by sending a JSON message like this:
Configuration parameters
api_key
RequiredstringYour Soniox API key. You can create keys in the Soniox Console. For client-side integrations, use a temporary API key generated on the server to avoid exposing secrets.
model
RequiredstringThe transcription model to use. Use GET /models endpoint to retrieve a list of available models.
"stt-rt-preview"
audio_format
RequiredstringThe format of the streamed audio. See Supported audio formats for details.
"auto"
, "pcm_s16le"
num_channels
numberRequired for raw PCM formats.
1
for mono audio, 2
for stereo audiosample_rate
numberRequired for raw PCM formats.
16000
language_hints
array<string>Expected languages in the audio. If not specified, languages are automatically detected. See supported languages for list of available ISO language codes.
context
stringProvide domain-specific terms or phrases to improve recognition accuracy.
10000
enable_speaker_diarization
booleanWhen true
, speakers are identified and separated in the transcription output.
enable_non_final_tokens
booleanWhen true
, partial non-final tokens will be streamed before they are finalized. See Final vs non-final tokens for more information.
true
max_non_final_tokens_duration_ms
numberMaximum delay (in milliseconds) between a spoken word and its finalization.
4000
Minimum: 360
Maximum: 6000
enable_endpoint_detection
booleanWhen true
, endpoint detection is enabled.
translation
objectConfigure real-time translation. See Real-time transcription page for more info.
target_language
RequiredstringThe target language for translation. Required if translation
is set.
source_languages
Requiredarray<string>List of source languages to translate. Use ["*"]
to include all.
exclude_source_languages
array<string>Languages to exclude from translation. Only allowed when source_languages
is ["*"]
.
two_way_target_language
stringEnables two-way translation for conversations. All speech is translated between the two languages. Cannot be used with exclude_source_languages
.
client_reference_id
stringOptional tracking identifier string. Does not need to be unique.
256
Audio streaming
After sending the initial configuration, begin streaming audio data:
- Audio can be sent as binary WebSocket frames (preferred)
- Alternatively, Base64-encoded audio can be sent as text messages (if binary is not supported)
- The maximum duration of a stream is 65 minutes
Ending the stream
To gracefully end a transcription session:
- Send an empty WebSocket message (empty binary or text frame)
- The server will return any final results, send a completion message, and close the connection
Response format
Soniox will send transcription responses in JSON format. Successful transcription responses follow this format:
Field descriptions
tokens
array<object>The list of transcribed tokens (words or subwords)
Each token may include:
text
stringToken text.
start_ms
OptionalnumberStart timestamp of the token (in milliseconds). Not included if translation_status
is translation
.
end_ms
OptionalnumberEnd timestamp of the token (in milliseconds). Not included if translation_status
is translation
.
confidence
numberConfidence score (0.0
–1.0
).
is_final
booleanWhether the token is finalized.
speaker
OptionalstringSpeaker label (if diarization enabled).
translation_status
OptionalstringStatus of the translation. Included if translation
is configured.
"original" | "translation"
language
OptionalstringLanguage of the transcription. Included if translation
is configured.
source_language
OptionalstringSource language of the translation. Included if translation
is configured and translation_status
is translation
.
final_audio_proc_ms
numberAmount of audio processed and finalized (in ms)
total_audio_proc_ms
numberTotal audio processed (in ms), including non-final tokens
Finished response
At the end of the stream, Soniox will send a final message indicating the session is complete:
The server will then close the WebSocket connection.
Error response
If an error occurs, the server will send an error response and immediately close the connection:
error_code
numberStandard HTTP status code.
error_message
stringA description of the error encountered.
Possible error codes and their descriptions: