Real-time translation
Learn how real-time translation works.
Overview
Soniox Speech-to-Text AI supports real-time speech translation with high accuracy and ultra-low latency. As audio is streamed, Soniox can transcribe spoken language and translate it into another language in real time, with both the transcription and translation returned in a single unified stream.
The translation system supports two modes:
-
One-way translation: Translate from one or more source languages into a single target language.
-
Two-way translation: Translate bi-directionally between two specific languages — ideal for conversational use cases.
How it works
Soniox transcribes and optionally translates speech in real time. Both transcription and translation are returned in the same unified token stream.
General behavior
-
All speech is transcribed
Transcription always occurs for all spoken audio, regardless of translation configuration.
-
Only configured languages are translated
-
One-way translation: Translation output is always in
target_language
. Translation is only applied to languages specified insource_languages
. Use"*"
to translate all languages (only supported when translating to English).exclude_source_languages
can be used to exclude specific languages from translation. -
Two-way translation: The system translates between
language_a
andlanguage_b
. Other languages are not translated.
-
-
Translations are streamed in real time
Translations are returned in variable-sized chunks, based on model-determined context windows. This balances translation quality with latency.
-
Unified stream of tokens
Transcribed and translated tokens are returned in the same stream. Each token includes:
translation_status
:"none"
token will not be translated"original"
token is original speech"translation"
token is translated output
language
: Language of the tokensource_language
: Present only for translated tokens
One-way translation
Use one-way translation to convert speech from specific source languages into a single target language.
Example 1: Translate all languages to English
source_languages: ["*"]
is only allowed and must be used when target language is English. You can't specify individualsource_languages
in this case.
Example 2: Translate all languages to English, excluding Spanish and Portuguese
exclude_source_languages
is only allowed whensource_languages
is["*"]
.
Example 3: Translate English to German
source_languages: ["*"]
is not allowed for non-English targets. You must specify all desiredsource_languages
individually.
Example 4: Translate Chinese and Japanese to Korean
- You can specify multiple
source_languages
for non-English targets. Refer to Supported translation pairs section below.
Two-way translation
Two-way translation enables bi-directional translation between two specified languages. This is ideal for live conversations between speakers of different languages.
Example 1: English ⟷ Spanish:
- Speech in English will be translated to Spanish.
- Speech in Spanish will be translated to English.
- Other languages will be transcribed but not translated.
- The order of
language_a
andlanguage_b
does not matter.
Output format
Transcribed and translated tokens are streamed in real time, with clear labels for handling downstream.
Token fields
Field | Description |
---|---|
text | Token text |
translation_status | "none" , "original" , or "translation" |
language | Language of the token itself |
source_language | Language that the translated token was derived from |
Example tokens
Original transcription:
Translation to English:
Original transcription not translated:
Supported translation pairs
- To English: All supported languages can be translated to English.
- From English: All supported languages can be used as translation targets from English.
- The following non-English ⟷ non-English translations are supported:
- Any translation between French, German, Italian, Spanish, Chinese, Japanese, and Korean. For example:
- Chinese ⟷ Japanese
- French ⟷ German
- Korean ⟷ Spanish
- Other supported non-English translation pairs:
- Portuguese ⟷ Spanish
- Slovenian ⟷ Croatian
- Slovenian ⟷ French
- Slovenian ⟷ German
- Slovenian ⟷ Italian
- Slovenian ⟷ Serbian
- Slovenian ⟷ Spanish
- Any translation between French, German, Italian, Spanish, Chinese, Japanese, and Korean. For example:
See the list of all Supported languages. To obtain the list of all supported languages and supported translation pairs via API, you can use the Get models endpoint.
Speaker separation with translation
Soniox real-time translation fully supports speaker diarization. When enabled, the model will automatically separate different speakers in the audio stream and assign them distinct speaker labels.
This means that in multi-speaker conversations, you will receive:
- Transcription tokens labeled with the speaker
- Translated tokens that correspond to the original speaker
Example
If two people are speaking different languages in the same session, you'll see:
To enable speaker separation, include the following in your request:
Example
This example demonstrates how to perform real-time two-way translation between a Spanish and an English speaker, with speaker diarization enabled.
Output