Best practices
Key best practices when working with Soniox Speech-to-Text
Overview
Automatically recognizing languages, words, speakers, and audio events in real time—with low latency and no prior context—is an extremely difficult problem, even for humans.
As humans, we naturally rely on a rich set of contextual cues when listening and responding:
- Who is speaking?
- What language are they likely to use?
- What's the topic?
These cues dramatically improve our ability to understand speech, especially in noisy or fast-paced environments.
Soniox Speech-to-Text AI allows you to do the same—by supplying the model with optional context and configuration, you can improve accuracy, latency, and robustness across a wide range of audio environments and use cases.
Key best practices
Bring context where you can.
Soniox is context-aware. Supplying relevant hints and data helps the model behave more like a human listener.
1. Use language_hints
for known languages
If you know what languages are likely to be spoken in a session, pass them in the language_hints
parameter:
- Improves accuracy of transcription
- Especially useful in real-time mode and for less common languages (e.g., Slovene, Hungarian)
2. Customize with the context
parameter
Provide custom session- or user-specific context whenever possible:
- Helps recognize names, brands, industry-specific terms, and unusual words
- Works in both real-time and async modes
- Especially valuable for technical, medical, legal, and enterprise use cases
3. Tune real-time latency for accuracy
Real-time transcription is a balance between latency and accuracy. For best results in complex or noisy audio, set:
- Allows more time for the model to analyze speech context
- Improves performance of speaker diarization
- Use this setting if your application allows slight delays in final tokens
Summary
Practice | Benefit |
---|---|
Use language_hints | Guides model toward expected languages |
Provide context | Boosts recognition of domain-specific or uncommon terms |
Set max_non_final_tokens_duration_ms | Improves real-time accuracy for language, speakers, and audio events |
Final note
While Soniox works exceptionally well out of the box, great transcription results often come from great input. By giving the model just a bit of the context you already know, you can unlock a boost in performance — especially in real-time, low-latency scenarios.