Real-time translation
Learn how teal-time translation works.
Overview
Soniox Speech-to-Text AI supports real-time speech translation in addition to multilingual transcription. With translation enabled, the model transcribes speech in any supported language and can translate it into another language in real time.
The translation system is highly flexible and supports:
- Translation from one or more source languages into a single target language
- Optional exclusion of specific languages from translation
- Conversational translation for two-way interactions between languages
How it works
Soniox Speech-to-Text AI processes all incoming speech in real time, transcribes it, and optionally translates it into a specified target language. The translation system is designed to balance accuracy, latency, and contextual quality, and operates as follows:
-
All spoken languages are transcribed.
Transcription always happens for all detected speech, regardless of translation configuration. -
Translation is applied only to configured source languages.
You control which languages are translated using thesource_languages
list, and (if applicable)exclude_source_languages
. -
Only one target language per session.
All translations in a session are directed to a singletarget_language
. -
Translations are streamed in real time.
Translations are returned in variable-sized chunks, based on when the model determines there is enough speech context to produce a high-quality translation. -
Translated tokens are included in the same token stream.
Each token includes atranslation_status
flag, so you can distinguish translated output from the original transcription. -
Two-way translation mode translates in both directions.
Whentwo_way_target_language
is specified, the model translates all speech between the two languages, allowing for natural back-and-forth conversation. In this mode, all languages are translated —exclude_source_languages
is not allowed.
Configuration
Translation is controlled using the translation
block in your API request. All fields are optional unless otherwise specified.
Example
Fields
Field | Type | Description |
---|---|---|
target_language | string (required) | The target language for translation (ISO 639-1 code). |
source_languages | string[] (required) | List of source languages to translate. Use ["*"] to include all. |
exclude_source_languages | string[] (optional) | Languages to exclude from translation. Only allowed when source_languages is ["*"] . |
two_way_target_language | string (optional) | Enables two-way translation for conversations. All speech is translated between the two languages. Cannot be used with exclude_source_languages . |
Translation rules
Target language is English
- You must use
"source_languages": ["*"]
to translate from all languages to English. - You may exclude specific source languages using
exclude_source_languages
. - You cannot specify a limited list of source languages — only
"*"
is allowed. - All supported languages can be translated to English.
Target language is not English
- You must explicitly specify which source languages to translate using
source_languages
. - All other spoken languages will be transcribed but not translated.
- Most non-English targets support only English as a source language.
- All supported languages can be translated from English.
Special source/target pairs
These target languages support additional source languages:
Target language | Supported source languages |
---|---|
pt | en, es |
es | en, pt |
de | en, fr |
fr | en, de |
zh | en, ja, ko |
ja | en, zh, ko |
ko | en, zh, ja |
Two-way translation (conversational)
Two-way translation enables real-time, bidirectional translation — ideal for conversational interfaces between two different languages.
In this mode, the system:
-
Translates any spoken language to a primary
target_language
-
And also translates the
target_language
back into a specifiedtwo_way_target_language
Current supported configuration
We currently support two-way translation in the following setup:
-
target_language
must be English ("en"
) -
two_way_target_language
can be any supported non-English language (e.g.,"es"
,"de"
,"zh"
)
Example
The following configuration will:
-
Translate any language to English
-
Translate English to Spanish
When using two_way_target_language
, you must use source_languages: ["*"]
and cannot use exclude_source_languages
.
Notes
When two_way_target_language
is set:
-
exclude_source_languages
is not allowed -
All speech is automatically translated — no need to list specific sources
-
Only one two-way target language is supported per session
Speaker separation with translation
Soniox real-time translation fully supports speaker diarization. When enabled, the model will automatically separate different speakers in the audio stream and assign them distinct speaker labels.
This means that in multi-speaker conversations, you will receive:
- Transcription tokens labeled with the correct speaker
- Translated tokens that correspond to the original speaker
Example
If two people are speaking different languages in the same session, you'll see:
This makes it easy to build voice applications where who said what is just as important as what was said — such as multilingual meetings, interviews, or assistants serving multiple users at once.
To enable speaker separation, include the following in your request:
Speaker labels are included in each token with the speaker field.
Examples
Translate all to English, exclude Spanish and Portuguese
Translate English to German
Translate English and Chinese to Korean
Conversational English ↔ Spanish
Output Format
Translated tokens are returned alongside original transcribed tokens in the
stream. Each token includes a translation_status
field indicating whether it
is original speech, a translation or the token will not be translated.
Example output tokens
Fields
Field | Description |
---|---|
text | Token text |
confidence | Confidence score (0-1) |
is_final | Whether the token is finalized |
language | Detected language of the token |
translation_status | "original" , "translation" or "none" |
source_language | Original language if the token is a translation |
Example
This example demonstrates how to perform real-time two-way translation between a Spanish and an English speaker, with speaker diarization enabled.
Output