Real-time API
Learn how to use and integrate Speech-to-Text Real-time API.
Speech-to-Text Real-time API allows developers to transcribe audio streams seamlessly via a WebSocket connection. It supports multiple audio formats and delivers real-time transcription with precise timestamps and confidence scores for each token, ensuring accuracy and reliability.
To connect to the Speech-to-Text Real-time API, use the following WebSocket URL:
Authentication and transcription configuration
Before transmitting audio data messages, ensure client authentication and define the transcription configuration by sending a JSON object with the following structure:
api_key
RequiredstringYou can create your API key in Soniox Console.
If you are using WebSocket library on the client side, you need to be careful not to expose your API key in the client-side code. Instead, you can generate a temporary API key for each connection server-side and send it the client to authenticate the WebSocket connection.
audio_format
RequiredstringThe audio format of the audio data. Use auto
for automatic detection.
num_channels
stringNumber of channels in the audio (required for PCM formats).
sample_rate
stringThe sample rate of the audio data (required for PCM formats).
model
RequiredstringSpeech-to-Text model to use (e.g. stt-rt-preview
). You can get an
up-to-date list of available models from GET
models endpoint.
language_hints
array<string>Provide language hints to enhance speech recognition.
enable_speaker_tags
booleanEnabling speaker tags will separate speakers in the transcription output.
context
stringContext can be beneficial to correctly transcribe uncommon spoken words, such as names, jargon, or abbreviations. The max length of the context is 10000 characters.
client_reference_id
stringA string provided by the client to track the uploaded file. It can be an ID, a JSON string, or any other text. This value can be used for reference in future API requests or for internal mapping within the client’s systems. The value does not have to be unique. If not provided, it will be auto-generated.
Audio streaming
After sending the start message, you can begin streaming audio data. The API supports multiple audio formats, including raw PCM and live microphone streams from all major web browsers. Audio data can be sent as binary WebSocket messages or as Base64-encoded text messages for WebSocket clients that do not support binary messaging.
Limitations
Server will send the text in real-time with minimal latency.
The maximum duration of the real-time audio stream is 65 minutes. After that
the stream will be ended with an error 400: Audio is too long.
. If you would
like to transcribe longer audio, you can reconnect after receiving this error.
The max number of concurrent connections per organization is 10 and can be increased in the Soniox Console.
Ensure the data is sent in real time. Audio data must not be sent at a rate faster or slower than real-time. Sending data too slowly may result in the server terminating the connection.
Ending the stream
To end the streaming transcription, send an empty WebSocket message, either as an empty binary or text message. Upon receiving this, the server will initiate the closing process, send any remaining text responses, and close the WebSocket connection.
Response format
The WebSocket server will send the responses using JSON format.
Successful responses
Successful transcription responses follow this format:
text
stringThe transcribed text.
tokens
array<object>An list of tokens, each containing a part of the transcribed text, along with its start timetamp, end timestamp and confidence level.
text
stringThe token text.
start_ms
numberThe start timestamp of the token in milliseconds.
end_ms
numberThe end timestamp of the token in milliseconds.
confidence
numberThe confidence level of the token (between 0 and 1).
audio_proc_ms
numberThe amount of processed audio in milliseconds.
Speaker tags
If enable_speaker_tags
flag is set to true
, speaker tags will be included in
the tokens as follows: A single token indicating the speaker (e.g., spk:1
,
spk:2
). Speaker tokens will be included in the text
field of the response as
well.
Final response
At the end of the transcription, the server will send a final response indicating completion:
After sending this response, the server will close the WebSocket connection.
Error response
In the event of an error, the server will return an error response and terminate the WebSocket connection.
error_code
numberThe error code. Will follow the HTTP status code convention.
error_message
stringA description of the error encountered.
Here is a list of possible error codes and their descriptions: