Soniox
Docs
Guides

Best practices

Key best practices when working with Soniox Speech-to-Text

Overview

Automatically recognizing languages, words, speakers, and audio events in real time—with low latency and no prior context—is an extremely difficult problem, even for humans.

As humans, we naturally rely on a rich set of contextual cues when listening and responding:

  • Who is speaking?
  • What language are they likely to use?
  • What's the topic?

These cues dramatically improve our ability to understand speech, especially in noisy or fast-paced environments.

Soniox Speech-to-Text AI allows you to do the same—by supplying the model with optional context and configuration, you can improve accuracy, latency, and robustness across a wide range of audio environments and use cases.


Key best practices

Bring context where you can.

Soniox is context-aware. Supplying relevant hints and data helps the model behave more like a human listener.

1. Use language_hints for known languages

If you know what languages are likely to be spoken in a session, pass them in the language_hints parameter:

{
  "language_hints": ["en", "sl"]
}
  • Improves accuracy of transcription
  • Especially useful in real-time mode and for less common languages (e.g., Slovene, Hungarian)

2. Customize with the context parameter

Provide custom session- or user-specific context whenever possible:

{
  "context": "John Doe, HyperNova Tech, Q3 roadmap, NeuroBridge Project"
}
  • Helps recognize names, brands, industry-specific terms, and unusual words
  • Works in both real-time and async modes
  • Especially valuable for technical, medical, legal, and enterprise use cases

3. Tune real-time latency for accuracy

Real-time transcription is a balance between latency and accuracy. For best results in complex or noisy audio, set:

{
  "max_non_final_tokens_duration_ms": 6000
}
  • Allows more time for the model to analyze speech context
  • Improves performance of speaker diarization
  • Use this setting if your application allows slight delays in final tokens

Summary

PracticeBenefit
Use language_hintsGuides model toward expected languages
Provide contextBoosts recognition of domain-specific or uncommon terms
Set max_non_final_tokens_duration_msImproves real-time accuracy for language, speakers, and audio events

Final note

While Soniox works exceptionally well out of the box, great transcription results often come from great input. By giving the model just a bit of the context you already know, you can unlock a boost in performance — especially in real-time, low-latency scenarios.

On this page