WebSocket API
Learn how to use and integrate Soniox Speech-to-Text WebSocket API.
Overview
The Soniox WebSocket API provides real-time transcription and translation of live audio with ultra-low latency. It supports advanced features like speaker diarization, context customization, and manual finalization — all over a persistent WebSocket connection. Ideal for live scenarios such as meetings, broadcasts, multilingual communication, and voice interfaces.
WebSocket endpoint
Connect to the API using:
Configuration
Before streaming audio, configure the transcription session by sending a JSON message such as:
Parameters
api_keyRequiredstringYour Soniox API key. Create API keys in the Soniox Console. For client apps, generate a temporary API key from your server to keep secrets secure.
audio_formatRequiredstringAudio format of the stream. See audio formats.
num_channelsnumberRequired for raw audio formats. See audio formats.
sample_ratenumberRequired for raw audio formats. See audio formats.
language_hintsarray<string>See language hints.
contextobjectSee context.
enable_speaker_diarizationbooleanSee speaker diarization.
enable_language_identificationbooleanenable_endpoint_detectionbooleanSee endpoint detection.
client_reference_idstringOptional identifier to track this request (client-defined).
translationobjectOne-way translation
typeRequiredstringMust be set to one_way.
target_languageRequiredstringLanguage to translate the transcript into.
Two-way translation
typeRequiredstringMust be set to two_way.
language_aRequiredstringFirst language for two-way translation.
language_bRequiredstringSecond language for two-way translation.
Audio streaming
After configuration, start streaming audio:
- Send audio as binary WebSocket frames.
- Each stream supports up to 60 minutes of audio. The 300 minutes stream duration is coming soon.
Ending the stream
To gracefully close a streaming session:
- Send an empty WebSocket frame (binary or text).
- The server will return one or more responses, including finished response, and then close the connection.
Response
Soniox returns responses in JSON format. A typical successful response looks like:
Field descriptions
tokensarray<object>List of processed tokens (words or subwords).
Each token may include:
textstringToken text.
start_msOptionalnumberStart timestamp of the token (in milliseconds). Not included if translation_status is translation.
end_msOptionalnumberEnd timestamp of the token (in milliseconds). Not included if translation_status is translation.
confidencenumberConfidence score (0.0–1.0).
is_finalbooleanWhether the token is finalized.
speakerOptionalstringSpeaker label (if diarization enabled).
translation_statusOptionalstringlanguageOptionalstringLanguage of the token.text.
source_languageOptionalstringfinal_audio_proc_msnumberAudio processed into final tokens.
total_audio_proc_msnumberAudio processed into final + non-final tokens.
Finished response
At the end of a stream, Soniox sends a final message to indicate the session is complete:
After this, the server closes the WebSocket connection.
Error response
If an error occurs, the server returns an error message and immediately closes the connection:
error_codenumberStandard HTTP status code.
error_messagestringA description of the error encountered.
Full list of possible error codes and messages: