Soniox
Docs
Core concepts

Real-time translation

Learn how real-time translation works.

Overview

Soniox Speech-to-Text AI supports real-time speech translation with high accuracy and ultra-low latency. As audio is streamed, Soniox can transcribe spoken language and translate it into another language in real time, with both the transcription and translation returned in a single unified stream.

The translation system supports two modes:

  • One-way translation: Translate from one or more source languages into a single target language.

  • Two-way translation: Translate bi-directionally between two specific languages — ideal for conversational use cases.


How it works

Soniox transcribes and optionally translates speech in real time. Both transcription and translation are returned in the same unified token stream.

General behavior

  • All speech is transcribed

    Transcription always occurs for all spoken audio, regardless of translation configuration.

  • Only configured languages are translated

    • One-way translation: Translation output is always in target_language. Translation is only applied to languages specified in source_languages. Use "*" to translate all languages (only supported when translating to English). exclude_source_languages can be used to exclude specific languages from translation.

    • Two-way translation: The system translates between language_a and language_b. Other languages are not translated.

  • Translations are streamed in real time

    Translations are returned in variable-sized chunks, based on model-determined context windows. This balances translation quality with latency.

  • Unified stream of tokens

    Transcribed and translated tokens are returned in the same stream. Each token includes:

    • translation_status:
      • "none" token will not be translated
      • "original" token is original speech
      • "translation" token is translated output
    • language: Language of the token
    • source_language: Present only for translated tokens

One-way translation

Use one-way translation to convert speech from specific source languages into a single target language.

Example 1: Translate all languages to English

{
  "translation": {
    "type": "one_way",
    "target_language": "en",
    "source_languages": ["*"]
  }
}
  • source_languages: ["*"] is only allowed and must be used when target language is English. You can't specify individual source_languages in this case.

Example 2: Translate all languages to English, excluding Spanish and Portuguese

{
  "translation": {
    "type": "one_way",
    "target_language": "en",
    "source_languages": ["*"],
    "exclude_source_languages": ["es", "pt"]
  }
}
  • exclude_source_languages is only allowed when source_languages is ["*"].

Example 3: Translate English to German

{
  "translation": {
    "type": "one_way",
    "target_language": "de",
    "source_languages": ["en"]
  }
}
  • source_languages: ["*"] is not allowed for non-English targets. You must specify all desired source_languages individually.

Example 4: Translate Chinese and Japanese to Korean

{
  "translation": {
    "type": "one_way",
    "target_language": "ko",
    "source_languages": ["zh", "ja"]
  }
}

Two-way translation

Two-way translation enables bi-directional translation between two specified languages. This is ideal for live conversations between speakers of different languages.

Example 1: English ⟷ Spanish:

{
  "translation": {
    "type": "two_way",
    "language_a": "en",
    "language_b": "es"
  }
}
  • Speech in English will be translated to Spanish.
  • Speech in Spanish will be translated to English.
  • Other languages will be transcribed but not translated.
  • The order of language_a and language_b does not matter.

Output format

Transcribed and translated tokens are streamed in real time, with clear labels for handling downstream.

Token fields

FieldDescription
textToken text
translation_status"none", "original", or "translation"
languageLanguage of the token itself
source_languageLanguage that the translated token was derived from

Example tokens

Original transcription:

{
  "text": "Bonjour",
  "start_ms": 1020,
  "end_ms": 1080,
  "translation_status": "original",
  "language": "fr"
}

Translation to English:

{
  "text": "Hello",
  "translation_status": "translation",
  "language": "en",
  "source_language": "fr"
}

Original transcription not translated:

{
  "text": "Hallo",
  "start_ms": 3260,
  "end_ms": 3380,
  "translation_status": "none",
  "language": "de"
}

Supported translation pairs

  • To English: All supported languages can be translated to English.
  • From English: All supported languages can be used as translation targets from English.
  • The following non-Englishnon-English translations are supported:
    • Any translation between French, German, Italian, Spanish, Chinese, Japanese, and Korean. For example:
      • Chinese ⟷ Japanese
      • French ⟷ German
      • Korean ⟷ Spanish
    • Other supported non-English translation pairs:
      • Portuguese ⟷ Spanish
      • Slovenian ⟷ Croatian
      • Slovenian ⟷ French
      • Slovenian ⟷ German
      • Slovenian ⟷ Italian
      • Slovenian ⟷ Serbian
      • Slovenian ⟷ Spanish

See the list of all Supported languages. To obtain the list of all supported languages and supported translation pairs via API, you can use the Get models endpoint.


Speaker separation with translation

Soniox real-time translation fully supports speaker diarization. When enabled, the model will automatically separate different speakers in the audio stream and assign them distinct speaker labels.

This means that in multi-speaker conversations, you will receive:

  • Transcription tokens labeled with the speaker
  • Translated tokens that correspond to the original speaker

Example

If two people are speaking different languages in the same session, you'll see:

Speaker 1 (Original): Bonjour comment ça va ?
Speaker 1 (Translation): Hello, how are you?

Speaker 2 (Original): I’m good, thanks.
Speaker 2 (Translation): Estoy bien, gracias.

To enable speaker separation, include the following in your request:

{
  "enable_speaker_diarization": true
}

Example

This example demonstrates how to perform real-time two-way translation between a Spanish and an English speaker, with speaker diarization enabled.

import json
import os
import threading
import time
 
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
file_to_transcribe = "two_way_translation.pcm_s16le"
 
 
def stream_audio(ws):
    with open(file_to_transcribe, "rb") as fh:
        while True:
            data = fh.read(3840)
            if len(data) == 0:
                break
            ws.send(data)
            time.sleep(0.12)  # sleep for 120 ms
    ws.send("")  # signal end of stream
 
 
def render_tokens(final_tokens, non_final_tokens):
    # Render the tokens in the terminal using ANSI escape codes.
    text = ""
    text += "\033[2J\033[H"  # clear the screen, move to top-left corner
    is_final = True
    speaker = ""
    language = ""
    for token in final_tokens + non_final_tokens:
        token_text = token["text"]
        if not token["is_final"] and is_final:
            text += "\033[34m"  # change text color to blue
            is_final = False
        if token.get("speaker") and token["speaker"] != speaker:
            if speaker:
                text += "\n\n"
            speaker = token["speaker"]
            text += f"Speaker {speaker}: "
            token_text = token_text.lstrip()
            language = ""
        if token.get("language") and token["language"] != language:
            text += "\n"
            language = token["language"]
            text += f"[{language}] "
            token_text = token_text.lstrip()
        text += token_text
    text += "\033[39m"  # reset text color
    print(text)
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "pcm_s16le",
                    "sample_rate": 16000,
                    "num_channels": 1,
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                    "enable_speaker_diarization": True,
                    "translation": {
                        "type": "two_way",
                        "language_a": "en",
                        "language_b": "es",
                    },
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print("Transcription started")
 
        final_tokens = []
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_tokens = []
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_tokens.append(token)
                        else:
                            non_final_tokens.append(token)
 
                render_tokens(final_tokens, non_final_tokens)
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()
View example on GitHub

Output

On this page