Soniox
Docs
Core concepts

Real-time translation

Learn how teal-time translation works.

Overview

Soniox Speech-to-Text AI supports real-time speech translation in addition to multilingual transcription. With translation enabled, the model transcribes speech in any supported language and can translate it into another language in real time.

The translation system is highly flexible and supports:

  • Translation from one or more source languages into a single target language
  • Optional exclusion of specific languages from translation
  • Conversational translation for two-way interactions between languages

How it works

Soniox Speech-to-Text AI processes all incoming speech in real time, transcribes it, and optionally translates it into a specified target language. The translation system is designed to balance accuracy, latency, and contextual quality, and operates as follows:

  • All spoken languages are transcribed.
    Transcription always happens for all detected speech, regardless of translation configuration.

  • Translation is applied only to configured source languages.
    You control which languages are translated using the source_languages list, and (if applicable) exclude_source_languages.

  • Only one target language per session.
    All translations in a session are directed to a single target_language.

  • Translations are streamed in real time.
    Translations are returned in variable-sized chunks, based on when the model determines there is enough speech context to produce a high-quality translation.

  • Translated tokens are included in the same token stream.
    Each token includes a translation_status flag, so you can distinguish translated output from the original transcription.

  • Two-way translation mode translates in both directions.
    When two_way_target_language is specified, the model translates all speech between the two languages, allowing for natural back-and-forth conversation. In this mode, all languages are translated — exclude_source_languages is not allowed.


Configuration

Translation is controlled using the translation block in your API request. All fields are optional unless otherwise specified.

Example

{
  "translation": {
    "target_language": "en",
    "source_languages": ["*"],
    "exclude_source_languages": ["es", "pt"]
  }
}

Fields

FieldTypeDescription
target_languagestring (required)The target language for translation (ISO 639-1 code).
source_languagesstring[] (required)List of source languages to translate. Use ["*"] to include all.
exclude_source_languagesstring[] (optional)Languages to exclude from translation. Only allowed when source_languages is ["*"].
two_way_target_languagestring (optional)Enables two-way translation for conversations. All speech is translated between the two languages. Cannot be used with exclude_source_languages.

Translation rules

Target language is English

  • You must use "source_languages": ["*"] to translate from all languages to English.
  • You may exclude specific source languages using exclude_source_languages.
  • You cannot specify a limited list of source languages — only "*" is allowed.
  • All supported languages can be translated to English.

Target language is not English

  • You must explicitly specify which source languages to translate using source_languages.
  • All other spoken languages will be transcribed but not translated.
  • Most non-English targets support only English as a source language.
  • All supported languages can be translated from English.

Special source/target pairs

These target languages support additional source languages:

Target languageSupported source languages
pten, es
esen, pt
deen, fr
fren, de
zhen, ja, ko
jaen, zh, ko
koen, zh, ja

Two-way translation (conversational)

Two-way translation enables real-time, bidirectional translation — ideal for conversational interfaces between two different languages.

In this mode, the system:

  • Translates any spoken language to a primary target_language

  • And also translates the target_language back into a specified two_way_target_language

Current supported configuration

We currently support two-way translation in the following setup:

  • target_language must be English ("en")

  • two_way_target_language can be any supported non-English language (e.g., "es", "de", "zh")

Example

The following configuration will:

  • Translate any language to English

  • Translate English to Spanish

{
  "translation": {
    "target_language": "en",
    "source_languages": ["*"],
    "two_way_target_language": "es"
  }
}

When using two_way_target_language, you must use source_languages: ["*"] and cannot use exclude_source_languages.

Notes

When two_way_target_language is set:

  • exclude_source_languages is not allowed

  • All speech is automatically translated — no need to list specific sources

  • Only one two-way target language is supported per session

Speaker separation with translation

Soniox real-time translation fully supports speaker diarization. When enabled, the model will automatically separate different speakers in the audio stream and assign them distinct speaker labels.

This means that in multi-speaker conversations, you will receive:

  • Transcription tokens labeled with the correct speaker
  • Translated tokens that correspond to the original speaker

Example

If two people are speaking different languages in the same session, you'll see:

Speaker 1 (Original): Bonjour comment ça va ?
Speaker 1 (Translation): Hello, how are you?

Speaker 2 (Original): I’m good, thanks.
Speaker 2 (Translation): Estoy bien, gracias.

This makes it easy to build voice applications where who said what is just as important as what was said — such as multilingual meetings, interviews, or assistants serving multiple users at once.

To enable speaker separation, include the following in your request:

{
  "enable_speaker_diarization": true
}

Speaker labels are included in each token with the speaker field.

Examples

Translate all to English, exclude Spanish and Portuguese

{
  "translation": {
    "target_language": "en",
    "source_languages": ["*"],
    "exclude_source_languages": ["es", "pt"]
  }
}

Translate English to German

{
  "translation": {
    "target_language": "de",
    "source_languages": ["en"]
  }
}

Translate English and Chinese to Korean

{
  "translation": {
    "target_language": "ko",
    "source_languages": ["en", "zh"]
  }
}

Conversational English ↔ Spanish

{
  "translation": {
    "target_language": "en",
    "source_languages": ["*"],
    "two_way_target_language": "es"
  }
}

Output Format

Translated tokens are returned alongside original transcribed tokens in the stream. Each token includes a translation_status field indicating whether it is original speech, a translation or the token will not be translated.

Example output tokens

{
  "text": "Hello",
  "start_ms": 1020,
  "end_ms": 1080,
  "confidence": 0.981,
  "is_final": true,
  "language": "en",
  "translation_status": "original"
}
{
  "text": "Hola",
  "confidence": 0.849,
  "is_final": true,
  "language": "es",
  "translation_status": "translation",
  "source_language": "en"
}
{
  "text": "Hallo",
  "start_ms": 3260,
  "end_ms": 1380,
  "confidence": 0.947,
  "is_final": true,
  "language": "de",
  "translation_status": "none"
}

Fields

FieldDescription
textToken text
confidenceConfidence score (0-1)
is_finalWhether the token is finalized
languageDetected language of the token
translation_status"original", "translation" or "none"
source_languageOriginal language if the token is a translation

Example

This example demonstrates how to perform real-time two-way translation between a Spanish and an English speaker, with speaker diarization enabled.

import json
import os
import threading
import time
 
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
file_to_transcribe = "two_way_translation.pcm_s16le"
 
 
def stream_audio(ws):
    with open(file_to_transcribe, "rb") as fh:
        while True:
            data = fh.read(3840)
            if len(data) == 0:
                break
            ws.send(data)
            time.sleep(0.12)  # sleep for 120 ms
    ws.send("")  # signal end of stream
 
 
def render_tokens(final_tokens, non_final_tokens):
    # Render the tokens in the terminal using ANSI escape codes.
    text = ""
    text += "\033[2J\033[H"  # clear the screen, move to top-left corner
    is_final = True
    speaker = ""
    language = ""
    for token in final_tokens + non_final_tokens:
        token_text = token["text"]
        if not token["is_final"] and is_final:
            text += "\033[34m"  # change text color to blue
            is_final = False
        if token.get("speaker") and token["speaker"] != speaker:
            if speaker:
                text += "\n\n"
            speaker = token["speaker"]
            text += f"Speaker {speaker}: "
            token_text = token_text.lstrip()
            language = ""
        if token.get("language") and token["language"] != language:
            text += "\n"
            language = token["language"]
            text += f"[{language}] "
            token_text = token_text.lstrip()
        text += token_text
    text += "\033[39m"  # reset text color
    print(text)
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "pcm_s16le",
                    "sample_rate": 16000,
                    "num_channels": 1,
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                    "enable_speaker_diarization": True,
                    "translation": {
                        "target_language": "en",
                        "source_languages": ["*"],
                        "two_way_target_language": "es",
                    },
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print("Transcription started")
 
        final_tokens = []
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_tokens = []
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_tokens.append(token)
                        else:
                            non_final_tokens.append(token)
 
                render_tokens(final_tokens, non_final_tokens)
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()
View example on GitHub

Output