Best practices

Overview

Automatically recognizing languages, words, and speakers, in real time — with low latency and no prior context — is an extremely difficult problem, even for humans.

As humans, we naturally rely on a rich set of contextual cues when listening and responding:

Who is speaking?
What language are they likely to use?
What's the topic?

These cues dramatically improve our ability to understand speech, especially in noisy or fast-paced environments.

Soniox Speech-to-Text AI allows you to do the same — by supplying the model with optional context and configuration, you can improve accuracy, latency, and robustness across a wide range of audio environments and use cases.

Key best practices

Bring context where you can.

Soniox is context-aware. Supplying relevant hints and data helps the model behave more like a human listener.

1. Use `language_hints` for known languages

If you know what languages are likely to be spoken in a session, pass them in the language_hints parameter:

{
  "language_hints": ["en", "sl"]
}

Improves accuracy of transcription
Especially useful in real-time mode and for less common languages (e.g., Slovenian, Hungarian)

2. Customize with the `context` parameter

Provide custom session- or user-specific context whenever possible:

{
  "context": "John Doe, HyperNova Tech, Q3 roadmap, NeuroBridge Project"
}

Helps recognize names, brands, industry-specific terms, and unusual words
Works in both real-time and async modes
Especially valuable for technical, medical, legal, and enterprise use cases

3. Tune real-time latency for accuracy

Real-time transcription is a balance between latency and accuracy. For best results in complex or noisy audio, set:

{
  "max_non_final_tokens_duration_ms": 6000
}

Allows more time for the model to analyze speech context
Improves performance of speaker diarization
Use this setting if your application allows slight delays in final tokens

Summary

Practice	Benefit
Use `language_hints`	Guides model toward expected languages
Provide `context`	Boosts recognition of domain-specific or uncommon terms
Set `max_non_final_tokens_duration_ms`	Improves real-time accuracy for language and speakers

Final note

While Soniox works exceptionally well out of the box, great transcription results often come from great input. By giving the model just a bit of the context you already know, you can unlock a boost in performance — especially in real-time, low-latency scenarios.