Real-time translation
Learn how real-time translation works.
Overview
Soniox Speech-to-Text AI introduces a new kind of translation designed for low latency applications. Unlike traditional systems that wait until the end of a sentence before producing a translation, Soniox translates mid-sentence—as words are spoken. This innovation enables a completely new experience: you can follow conversations across languages in real-time, without delays.
How it works
- Always transcribes speech: every spoken word is transcribed, regardless of translation settings.
- Translation: choose between:
- One-way translation → translate all speech into a single target language.
- Two-way translation → translate back and forth between two languages.
- Low latency: translations are streamed in chunks, balancing speed and accuracy.
- Unified token stream: transcriptions and translations arrive together, labeled for easy handling.
Example
Speaker says:
The token stream unfolds like this:
Notice how:
- Transcription tokens arrive first, as soon as words are recognized.
- Translation tokens follow, chunk by chunk, without waiting for the full sentence.
- Developers can display tokens immediately for low latency transcription and translation.
Translation modes
Soniox provides two translation modes: translate all speech into a single target language, or enable seamless two-way conversations between languages.
One-way translation
Translate all spoken languages into a single target language.
Example: translate everything into French
- All speech is transcribed.
- All speech is translated into French.
Two-way translation
Translate back and forth between two specified languages.
Example: Japanese ⟷ Korean
- All speech is transcribed.
- Japanese speech is translated into Korean.
- Korean speech is translated into Japanese.
Token format
Each result (transcription or translation) is returned as a token with clear metadata.
Field | Description |
---|---|
text | Token text |
translation_status | "none" (not translated) "original" (spoken text) "translation" (translated text) |
language | Language of the token |
source_language | Original language (only for translated tokens) |
Example: two-way translation
Two way translation between English (en
) and German (de
).
Config
Text
Tokens
Transcription and translation chunks follow each other, but tokens are not 1-to-1 mapped and may not align.
Supported languages
All pairs supported — translate between any two supported languages.
Timestamps
- Spoken tokens (
translation_status: "none"
or"original"
) include timestamps (start_ms
,end_ms
) that align to the exact position in the audio. - Translated tokens do not include timestamps, since they are generated immediately after the spoken tokens and directly follow their timing.
This way, you can always align transcripts to the original audio, while translations stream naturally in sequence.
Code example
Prerequisite: Complete the steps in Get Started.