Soniox
Docs
Real-time API

Real-time translation

Learn how real-time translation works.

Overview

Soniox Speech-to-Text AI introduces a new kind of translation designed for low latency applications. Unlike traditional systems that wait until the end of a sentence before producing a translation, Soniox translates mid-sentence—as words are spoken. This innovation enables a completely new experience: you can follow conversations across languages in real-time, without delays.


How it works

  • Always transcribes speech: every spoken word is transcribed, regardless of translation settings.
  • Translation: choose between:
    • One-way translation → translate all speech into a single target language.
    • Two-way translation → translate back and forth between two languages.
  • Low latency: translations are streamed in chunks, balancing speed and accuracy.
  • Unified token stream: transcriptions and translations arrive together, labeled for easy handling.

Example

Speaker says:

"Hello everyone, thank you for joining us today."

The token stream unfolds like this:

[Transcription] Hello everyone,
[Translation]   Bonjour à tous,

[Transcription] thank you
[Translation]   merci

[Transcription] for joining us
[Translation]   de nous avoir rejoints

[Transcription] today.
[Translation]   aujourd'hui.

Notice how:

  • Transcription tokens arrive first, as soon as words are recognized.
  • Translation tokens follow, chunk by chunk, without waiting for the full sentence.
  • Developers can display tokens immediately for low latency transcription and translation.

Translation modes

Soniox provides two translation modes: translate all speech into a single target language, or enable seamless two-way conversations between languages.

One-way translation

Translate all spoken languages into a single target language.

Example: translate everything into French

{
  "translation": {
    "type": "one_way",
    "target_language": "fr"
  }
}
  • All speech is transcribed.
  • All speech is translated into French.

Two-way translation

Translate back and forth between two specified languages.

Example: Japanese ⟷ Korean

{
  "translation": {
    "type": "two_way",
    "language_a": "ja",
    "language_b": "ko"
  }
}
  • All speech is transcribed.
  • Japanese speech is translated into Korean.
  • Korean speech is translated into Japanese.

Token format

Each result (transcription or translation) is returned as a token with clear metadata.

FieldDescription
textToken text
translation_status"none" (not translated)
"original" (spoken text)
"translation" (translated text)
languageLanguage of the token
source_languageOriginal language (only for translated tokens)

Example: two-way translation

Two way translation between English (en) and German (de).

Config

{
  "translation": {
    "type": "two_way",
    "language_a": "en",
    "language_b": "de"
  }
}

Text

[en] Good morning
[de] Guten Morgen

[de] Wie geht’s?
[en] How are you?

[fr] Bonjour à tous
(fr is only transcribed, not translated)

[en] I’m fine, thanks.
[de] Mir geht’s gut, danke.

Tokens

// ===== (1) =====
// Transcription tokens to be translated
{
  "text": "Good",
  "translation_status": "original",
  "language": "en"
}
{
  "text": " morn",
  "translation_status": "original",
  "language": "en"
}
{
  "text": "ing",
  "translation_status": "original",
  "language": "en"
}
// Translation tokens of previous transcription tokens
{
  "text": "Gu",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}
{
  "text": "ten",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}
{
  "text": " Morgen",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}

// ===== (2) =====
// Transcription tokens to be translated
{
  "text": "Wie",
  "translation_status": "original",
  "language": "de"
}
{
  "text": " geht’s?",
  "translation_status": "original",
  "language": "de"
}
// Translation tokens of previous transcription tokens
{
  "text": "How",
  "translation_status": "translation",
  "language": "en",
  "source_language": "de"
}
{
  "text": " are",
  "translation_status": "translation",
  "language": "en",
  "source_language": "de"
}
{
  "text": " you",
  "translation_status": "translation",
  "language": "en",
  "source_language": "de"
}
{
  "text": "?",
  "translation_status": "translation",
  "language": "en",
  "source_language": "de"
}

// ===== (3) =====
// Transcription tokens NOT to be translated
{
  "text": "Bon",
  "translation_status": "none",
  "language": "fr"
}
{
  "text": "jour",
  "translation_status": "none",
  "language": "fr"
}
{
  "text": " à",
  "translation_status": "none",
  "language": "fr"
}
{
  "text": " tous",
  "translation_status": "none",
  "language": "fr"
}

// ===== (4) =====
// Transcription tokens to be translated
{
  "text": "I’m",
  "translation_status": "original",
  "language": "en"
}
{
  "text": " fine,",
  "translation_status": "original",
  "language": "en"
}
{
  "text": " thanks.",
  "translation_status": "original",
  "language": "en"
}
// Translation tokens of previous transcription tokens
{
  "text": "Mir",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}
{
  "text": " geht’s",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}
{
  "text": " gut",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}
{
  "text": " dan",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}
{
  "text": "ke.",
  "translation_status": "translation",
  "language": "de",
  "source_language": "en"
}

Transcription and translation chunks follow each other, but tokens are not 1-to-1 mapped and may not align.


Supported languages

All pairs supported — translate between any two supported languages.


Timestamps

  • Spoken tokens (translation_status: "none" or "original") include timestamps (start_ms, end_ms) that align to the exact position in the audio.
  • Translated tokens do not include timestamps, since they are generated immediately after the spoken tokens and directly follow their timing.

This way, you can always align transcripts to the original audio, while translations stream naturally in sequence.


Code example

Prerequisite: Complete the steps in Get Started.