Soniox

Speech-to-Text

Use Soniox Speech-to-Text in LiveKit Agents.

Overview

soniox.STT provides real-time speech-to-text transcription using the Soniox WebSocket STT API, with support for 60+ languages, context customization, multilingual audio, and speaker diarization.

Basic usage

Use Soniox STT in an AgentSession or as a standalone transcription service:

from livekit.agents import AgentSession
from livekit.plugins import soniox

session = AgentSession(
    stt=soniox.STT(),
    # ... llm, tts, etc.
)

Configuration

soniox.STT accepts top-level constructor arguments (for connection) and an STTOptions object passed via params (for transcription parameters).

Constructor arguments

ArgumentTypeDefaultDescription
api_keystr | NoneSONIOX_API_KEY env varSoniox API key.
base_urlstrwss://stt-rt.soniox.com/transcribe-websocketSoniox WebSocket endpoint. See regional endpoints.
paramsSTTOptions | NoneSTTOptions()Transcription parameters (see below).
http_sessionaiohttp.ClientSession | NoneNoneOptional aiohttp session to reuse for the WebSocket connection.

STTOptions

Pass via params=soniox.STTOptions(...).

SettingTypeDefaultDescription
modelstrstt-rt-v4Model to use for transcription.
language_hintslist[str] | NoneNoneLanguage hints to bias recognition toward expected languages.
language_hints_strictboolFalseIf true, restrict recognition to the provided languages.
contextContextObject | str | NoneNoneContext customization to improve transcription accuracy.
num_channelsint1Number of channels (e.g. 1 (mono) or 2 (stereo)).
sample_rateint16000Audio sample rate in Hz.
enable_speaker_diarizationboolFalseAnnotate tokens with speaker IDs.
enable_language_identificationboolTrueAnnotate tokens with language IDs.
max_endpoint_delay_msint500Maximum delay in ms between speech cessation and endpoint detection. Range: 500–3000.
client_reference_idstr | NoneNoneClient reference ID for tracking.
translationTranslationConfig | NoneNoneTranslation configuration. See translation.

Advanced usage

Regional endpoints

If you want to use a different region than the default (US), pass a regional base_url to the STT constructor:

from livekit.plugins import soniox

stt = soniox.STT(base_url="wss://stt-rt.eu.soniox.com/transcribe-websocket")

See the list of regional endpoints for available endpoints.

Language hints

There is no need to pre-select a language. The model automatically detects and transcribes any supported language and handles multilingual audio seamlessly, even when multiple languages are mixed within a single sentence or conversation.

When you have prior knowledge of the languages likely to appear in your audio, language hints help the model prioritize them for greater accuracy:

from livekit.plugins import soniox

stt = soniox.STT(
    params=soniox.STTOptions(
        language_hints=["en", "es"],
    ),
)

See the list of supported languages and learn more about language hints.

Customization with context

By providing context, you help the model better understand and anticipate the language in your audio, even when some terms do not appear clearly or completely.

from livekit.plugins import soniox
from livekit.plugins.soniox import (
    ContextObject,
    ContextGeneralItem,
    ContextTranslationTerm,
)

stt = soniox.STT(
    params=soniox.STTOptions(
        context=ContextObject(
            general=[ContextGeneralItem(key="domain", value="Healthcare")],
            terms=["Celebrex", "Zyrtec", "Xanax", "Prilosec", "Amoxicillin Clavulanate Potassium"],
            translation_terms=[ContextTranslationTerm(source="Mr. Smith", target="Sr. Smith")],
        ),
    ),
)

Learn more about customizing with context.

Speaker diarization

Enable speaker diarization to annotate tokens with speaker IDs:

from livekit.plugins import soniox

stt = soniox.STT(
    params=soniox.STTOptions(
        enable_speaker_diarization=True,
    ),
)

Learn more about speaker diarization.

Endpoint detection

Soniox detects natural pauses in speech to finalize transcripts. Tune the maximum delay before endpoint detection fires via max_endpoint_delay_ms (range 500–3000 ms):

from livekit.plugins import soniox

stt = soniox.STT(
    params=soniox.STTOptions(
        max_endpoint_delay_ms=1000,
    ),
)

Learn more about endpoint detection.

Reference