Speech-to-text translation

Overview

Soniox speech-to-text translation turns spoken audio into a transcript plus its translation, both delivered as a single token stream. The same translation config block and token format work on both real-time and async delivery.

Want translated spoken output instead of text? See Speech-to-speech translation.

Translation modes

Soniox supports two modes. The same translation config block works for both real-time and async delivery.

One-way translation

Translate detected speech into a single target language.

{
  "translation": {
    "type": "one_way",
    "target_language": "fr"
  }
}

Use this for live captions, multilingual meetings, broadcasts, lectures, events, customer calls, and other workflows where speakers should be read in one language.

Two-way translation

Translate back and forth between two specified languages. Each side speaks naturally; your application can display the other side's translated text.

{
  "translation": {
    "type": "two_way",
    "language_a": "ja",
    "language_b": "ko"
  }
}

Use this for bilingual conversations, customer support, travel assistants, and voice agents that need translated text.

See Supported languages for the language list and coverage notes.

Context and translation terms

Use context.translation_terms to control how specific words or phrases are translated. This is useful for:

Technical terminology.
Entity names.
Words with ambiguous domain-specific translations.
Idioms and figurative speech with non-literal meaning.

Example: English → Spanish translation

{
  "context": {
    "translation_terms": [
      { "source": "Mr. Smith", "target": "Sr. Smith" },
      { "source": "MRI", "target": "RM" },
      { "source": "St John's", "target": "St John's" },
      { "source": "stroke", "target": "ictus" }
    ]
  }
}

You can combine context with either one_way or two_way translation configuration in the same request.

Token format

Each translation result is returned as a token with clear metadata. The same shape is used for real-time and async delivery.

Field	Description
`text`	Token text
`translation_status`	`"none"` (not translated) `"original"` (spoken text) `"translation"` (translated text)
`language`	Language of the token
`source_language`	Original language (only for translated tokens)

Example: two-way translation

Two-way translation between English (en) and German (de).

Config

{
  "translation": {
    "type": "two_way",
    "language_a": "en",
    "language_b": "de"
  }
}

Text

[en] Good morning
[de] Guten Morgen

[de] Wie geht’s?
[en] How are you?

[fr] Bonjour à tous
(fr is only transcribed, not translated)

[en] I’m fine, thanks.
[de] Mir geht’s gut, danke.

// ===== (1) =====
// Transcription tokens to be translated
{ "text": "Good",    "translation_status": "original", "language": "en" }
{ "text": " morn",   "translation_status": "original", "language": "en" }
{ "text": "ing",     "translation_status": "original", "language": "en" }
// Translation tokens of previous transcription tokens
{ "text": "Gu",      "translation_status": "translation", "language": "de", "source_language": "en" }
{ "text": "ten",     "translation_status": "translation", "language": "de", "source_language": "en" }
{ "text": " Morgen", "translation_status": "translation", "language": "de", "source_language": "en" }

// ===== (2) =====
// Transcription tokens to be translated
{ "text": "Wie",      "translation_status": "original", "language": "de" }
{ "text": " geht’s?", "translation_status": "original", "language": "de" }
// Translation tokens of previous transcription tokens
{ "text": "How",      "translation_status": "translation", "language": "en", "source_language": "de" }
{ "text": " are",     "translation_status": "translation", "language": "en", "source_language": "de" }
{ "text": " you",     "translation_status": "translation", "language": "en", "source_language": "de" }
{ "text": "?",        "translation_status": "translation", "language": "en", "source_language": "de" }

// ===== (3) =====
// Transcription tokens NOT to be translated (fr is outside the configured pair)
{ "text": "Bon",   "translation_status": "none", "language": "fr" }
{ "text": "jour",  "translation_status": "none", "language": "fr" }
{ "text": " à",    "translation_status": "none", "language": "fr" }
{ "text": " tous", "translation_status": "none", "language": "fr" }

// ===== (4) =====
// Transcription tokens to be translated
{ "text": "I’m",      "translation_status": "original", "language": "en" }
{ "text": " fine,",   "translation_status": "original", "language": "en" }
{ "text": " thanks.", "translation_status": "original", "language": "en" }
// Translation tokens of previous transcription tokens
{ "text": "Mir",      "translation_status": "translation", "language": "de", "source_language": "en" }
{ "text": " geht’s",  "translation_status": "translation", "language": "de", "source_language": "en" }
{ "text": " gut",     "translation_status": "translation", "language": "de", "source_language": "en" }
{ "text": " dan",     "translation_status": "translation", "language": "de", "source_language": "en" }
{ "text": "ke.",      "translation_status": "translation", "language": "de", "source_language": "en" }

Transcription and translation chunks follow each other, but tokens are not 1-to-1 mapped and may not align.

Timestamps

Spoken tokens (translation_status: "none" or "original") include timestamps (start_ms, end_ms) that align to the exact position in the audio.
Translated tokens do not include timestamps. They are generated after their spoken tokens and follow the same sequence.

This way you can align transcripts to the original audio, while translations stream naturally in sequence.

Pick a delivery mode

Translation uses the same config block and token format in both delivery modes. Pick by the shape of your audio and your latency requirements.

Real-time speech-to-text translation

Live captions and translated text over a WebSocket, with mid-sentence streaming.

Async speech-to-text translation

Translate recorded audio files (URL or upload) in a single API call. No live connection required.

Speech-to-text translation

Full token stream (JSON)

On this page