Best practices
Key best practices when working with Soniox Speech-to-Text
Overview
Automatically recognizing languages, words, and speakers, in real time — with low latency and no prior context — is an extremely difficult problem, even for humans.
As humans, we naturally rely on a rich set of contextual cues when listening and responding:
- Who is speaking?
- What language are they likely to use?
- What's the topic?
These cues dramatically improve our ability to understand speech, especially in noisy or fast-paced environments.
Soniox Speech-to-Text AI allows you to do the same — by supplying the model with optional context and configuration, you can improve accuracy, latency, and robustness across a wide range of audio environments and use cases.
Key best practices
Bring context where you can.
Soniox is context-aware. Supplying relevant hints and data helps the model behave more like a human listener.
1. Use language_hints
for known languages
If you know what languages are likely to be spoken in a session, pass them in the language_hints
parameter:
- Improves accuracy of transcription
- Especially useful in real-time mode and for less common languages (e.g., Slovenian, Hungarian)
2. Customize with the context
parameter
Provide custom session- or user-specific context whenever possible:
- Helps recognize names, brands, industry-specific terms, and unusual words
- Works in both real-time and async modes
- Especially valuable for technical, medical, legal, and enterprise use cases
3. Tune real-time latency for accuracy
Real-time transcription is a balance between latency and accuracy. For best results in complex or noisy audio, set:
- Allows more time for the model to analyze speech context
- Improves performance of speaker diarization
- Use this setting if your application allows slight delays in final tokens
Summary
Practice | Benefit |
---|---|
Use language_hints | Guides model toward expected languages |
Provide context | Boosts recognition of domain-specific or uncommon terms |
Set max_non_final_tokens_duration_ms | Improves real-time accuracy for language and speakers |
Final note
While Soniox works exceptionally well out of the box, great transcription results often come from great input. By giving the model just a bit of the context you already know, you can unlock a boost in performance — especially in real-time, low-latency scenarios.