Speech-to-Text
Use Soniox Speech-to-Text in LiveKit Agents.
Overview
soniox.STT provides real-time speech-to-text transcription using the Soniox WebSocket STT API, with support for 60+ languages, context customization, multilingual audio, and speaker diarization.
Basic usage
Use Soniox STT in an AgentSession or as a standalone transcription service:
Configuration
soniox.STT accepts top-level constructor arguments (for connection) and an STTOptions object passed via params (for transcription parameters).
Constructor arguments
| Argument | Type | Default | Description |
|---|---|---|---|
api_key | str | None | SONIOX_API_KEY env var | Soniox API key. |
base_url | str | wss://stt-rt.soniox.com/transcribe-websocket | Soniox WebSocket endpoint. See regional endpoints. |
params | STTOptions | None | STTOptions() | Transcription parameters (see below). |
http_session | aiohttp.ClientSession | None | None | Optional aiohttp session to reuse for the WebSocket connection. |
STTOptions
Pass via params=soniox.STTOptions(...).
| Setting | Type | Default | Description |
|---|---|---|---|
model | str | stt-rt-v4 | Model to use for transcription. |
language_hints | list[str] | None | None | Language hints to bias recognition toward expected languages. |
language_hints_strict | bool | False | If true, restrict recognition to the provided languages. |
context | ContextObject | str | None | None | Context customization to improve transcription accuracy. |
num_channels | int | 1 | Number of channels (e.g. 1 (mono) or 2 (stereo)). |
sample_rate | int | 16000 | Audio sample rate in Hz. |
enable_speaker_diarization | bool | False | Annotate tokens with speaker IDs. |
enable_language_identification | bool | True | Annotate tokens with language IDs. |
max_endpoint_delay_ms | int | 500 | Maximum delay in ms between speech cessation and endpoint detection. Range: 500–3000. |
client_reference_id | str | None | None | Client reference ID for tracking. |
translation | TranslationConfig | None | None | Translation configuration. See translation. |
Advanced usage
Regional endpoints
If you want to use a different region than the default (US), pass a regional base_url to the STT constructor:
See the list of regional endpoints for available endpoints.
Language hints
There is no need to pre-select a language. The model automatically detects and transcribes any supported language and handles multilingual audio seamlessly, even when multiple languages are mixed within a single sentence or conversation.
When you have prior knowledge of the languages likely to appear in your audio, language hints help the model prioritize them for greater accuracy:
See the list of supported languages and learn more about language hints.
Customization with context
By providing context, you help the model better understand and anticipate the language in your audio, even when some terms do not appear clearly or completely.
Learn more about customizing with context.
Speaker diarization
Enable speaker diarization to annotate tokens with speaker IDs:
Learn more about speaker diarization.
Endpoint detection
Soniox detects natural pauses in speech to finalize transcripts. Tune the maximum delay before endpoint detection fires via max_endpoint_delay_ms (range 500–3000 ms):
Learn more about endpoint detection.