Speech-to-text translation
Soniox speech-to-text translation, including translation modes, token format, and how to pick between real-time and async delivery.
Overview
Soniox speech-to-text translation turns spoken audio into a transcript plus
its translation, both delivered as a single token stream. The same
translation config block and token format work on both real-time and async
delivery.
Want translated spoken output instead of text? See Speech-to-speech translation.
Translation modes
Soniox supports two modes. The same translation config block works for both
real-time and async delivery.
One-way translation
Translate detected speech into a single target language.
Use this for live captions, multilingual meetings, broadcasts, lectures, events, customer calls, and other workflows where speakers should be read in one language.
Two-way translation
Translate back and forth between two specified languages. Each side speaks naturally; your application can display the other side's translated text.
Use this for bilingual conversations, customer support, travel assistants, and voice agents that need translated text.
See Supported languages for the language list and coverage notes.
Context and translation terms
Use context.translation_terms to control how specific words or phrases are
translated. This is useful for:
- Technical terminology.
- Entity names.
- Words with ambiguous domain-specific translations.
- Idioms and figurative speech with non-literal meaning.
Example: English → Spanish translation
You can combine context with either one_way or two_way translation
configuration in the same request.
Token format
Each translation result is returned as a token with clear metadata. The same shape is used for real-time and async delivery.
| Field | Description |
|---|---|
text | Token text |
translation_status | "none" (not translated) "original" (spoken text) "translation" (translated text) |
language | Language of the token |
source_language | Original language (only for translated tokens) |
Example: two-way translation
Two-way translation between English (en) and German (de).
Config
Text
Transcription and translation chunks follow each other, but tokens are not 1-to-1 mapped and may not align.
Timestamps
- Spoken tokens (
translation_status: "none"or"original") include timestamps (start_ms,end_ms) that align to the exact position in the audio. - Translated tokens do not include timestamps. They are generated after their spoken tokens and follow the same sequence.
This way you can align transcripts to the original audio, while translations stream naturally in sequence.
Pick a delivery mode
Translation uses the same config block and token format in both delivery modes. Pick by the shape of your audio and your latency requirements.