Soniox
Docs
Guides

Best practices

Key best practices when working with Soniox Speech-to-Text

Overview

Automatically recognizing languages, words, and speakers, in real time — with low latency and no prior context — is an extremely difficult problem, even for humans.

As humans, we naturally rely on a rich set of contextual cues when listening and responding:

  • Who is speaking?
  • What language are they likely to use?
  • What's the topic?

These cues dramatically improve our ability to understand speech, especially in noisy or fast-paced environments.

Soniox Speech-to-Text AI allows you to do the same — by supplying the model with optional context and configuration, you can improve accuracy, latency, and robustness across a wide range of audio environments and use cases.


Key best practices

Bring context where you can.

Soniox is context-aware. Supplying relevant hints and data helps the model behave more like a human listener.

1. Use language_hints for known languages

If you know what languages are likely to be spoken in a session, pass them in the language_hints parameter:

{
  "language_hints": ["en", "sl"]
}
  • Improves accuracy of transcription
  • Especially useful in real-time mode and for less common languages (e.g., Slovenian, Hungarian)

2. Customize with the context parameter

Provide custom session- or user-specific context whenever possible:

{
  "context": "John Doe, HyperNova Tech, Q3 roadmap, NeuroBridge Project"
}
  • Helps recognize names, brands, industry-specific terms, and unusual words
  • Works in both real-time and async modes
  • Especially valuable for technical, medical, legal, and enterprise use cases

3. Tune real-time latency for accuracy

Real-time transcription is a balance between latency and accuracy. For best results in complex or noisy audio, set:

{
  "max_non_final_tokens_duration_ms": 6000
}
  • Allows more time for the model to analyze speech context
  • Improves performance of speaker diarization
  • Use this setting if your application allows slight delays in final tokens

Summary

PracticeBenefit
Use language_hintsGuides model toward expected languages
Provide contextBoosts recognition of domain-specific or uncommon terms
Set max_non_final_tokens_duration_msImproves real-time accuracy for language and speakers

Final note

While Soniox works exceptionally well out of the box, great transcription results often come from great input. By giving the model just a bit of the context you already know, you can unlock a boost in performance — especially in real-time, low-latency scenarios.

On this page