WebSocket API
Learn how to use and integrate Soniox Speech-to-Text WebSocket API.
Real-time transcription over WebSocket
Soniox Speech-to-Text WebSocket API enables low-latency transcription of live audio streams. It supports advanced features such as automatic speaker diarization, context customization, and more — all over a persistent WebSocket connection.
This API is ideal for live transcription scenarios such as meetings, broadcasts, voice interfaces, and real-time voice applications.
WebSocket endpoint
To connect to the WebSocket API, use:
Authentication and configuration
Before sending audio, you must authenticate and configure the transcription session by sending a JSON message like this:
Configuration parameters
api_key
RequiredstringYour Soniox API key. You can create keys in the Soniox Console. For client-side integrations, use a temporary API key generated on the server to avoid exposing secrets.
model
RequiredstringThe transcription model to use. Use GET /models endpoint to retrieve a list of available models.
"stt-rt-preview"
audio_format
RequiredstringThe format of the streamed audio. See Supported audio formats for details.
"auto"
, "pcm_s16le"
num_channels
numberRequired for raw PCM formats.
1
for mono audio, 2
for stereo audiosample_rate
numberRequired for raw PCM formats.
16000
language_hints
array<string>Expected languages in the audio. If not specified, languages are automatically detected. See supported languages for list of available ISO language codes.
context
stringProvide domain-specific terms or phrases to improve recognition accuracy.
10000
enable_speaker_diarization
booleanWhen true
, speakers are identified and separated in the transcription output.
enable_non_final_tokens
booleanWhen true
, partial non-final tokens will be streamed before they are finalized. See Final vs non-final tokens for more information.
true
max_non_final_tokens_duration_ms
numberMaximum delay (in milliseconds) between a spoken word and its finalization.
4000
Minimum: 700
Maximum: 6000
client_reference_id
stringOptional tracking identifier string. Does not need to be unique.
256
Audio streaming
After sending the initial configuration, begin streaming audio data:
- Audio can be sent as binary WebSocket frames (preferred)
- Alternatively, Base64-encoded audio can be sent as text messages (if binary is not supported)
The server expects audio to be streamed in real time — not significantly faster or slower than the actual rate of speech.
Limitations
- The maximum duration of a stream is 65 minutes
- Max concurrent connections per organization: 10 (can be increased via the Soniox Console)
- Streaming too slowly may result in the connection being closed
- Audio must be sent at real-time speed — not faster or buffered
Ending the stream
To gracefully end a transcription session:
- Send an empty WebSocket message (empty binary or text frame)
- The server will return any final results, send a completion message, and close the connection
Response format
Soniox will send transcription responses in JSON format. Successful transcription responses follow this format:
Field descriptions
tokens
array<object>The list of transcribed tokens (words or subwords)
Each token may include:
text
stringToken text
start_ms
numberStart timestamp of the token (in milliseconds)
end_ms
numberEnd timestamp of the token (in milliseconds)
confidence
numberConfidence score (0.0
–1.0
)
is_final
booleanWhether the token is finalized
speaker
OptionalstringSpeaker label (if diarization enabled)
language_code
OptionalstringDetected language
is_audio_event
OptionalbooleanTrue if the token represents a non-verbal audio event
final_audio_proc_ms
numberAmount of audio processed and finalized (in ms)
total_audio_proc_ms
numberTotal audio processed (in ms), including non-final tokens
Finished response
At the end of the stream, Soniox will send a final message indicating the session is complete:
The server will then close the WebSocket connection.
Error response
If an error occurs, the server will send an error response and immediately close the connection:
error_code
numberStandard HTTP status code.
error_message
stringA description of the error encountered.
Possible error codes and their descriptions: