Real-time speech-to-speech translation

Overview

Soniox real-time speech-to-speech translation takes spoken audio in one language and plays it back as spoken audio in another in real time. Build the pipeline by chaining two Soniox products:

Real-time speech-to-text translation: Soniox recognizes speech and translates it in real time across supported languages.
Real-time text-to-speech: Soniox speaks the translated text in the target language with a chosen voice, with streaming output.

Both APIs are streaming and low-latency by design, so audio for the first translated words can play before the speaker finishes their sentence. Typical use cases:

Live interpreters for meetings, conversations, and business communication.
Bilingual voice agents for support, sales, scheduling, healthcare, and other multilingual workflows.
Travel assistants and customer support that translate calls while preserving names, numbers, and verification codes.
Real-time multilingual communication: anywhere two people who don't share a language need to speak naturally.

Need translated text output instead? See Speech-to-text translation.

How it works

The pipeline has two Soniox WebSocket connections and a small piece of application logic between them:

STT + translation receives audio streams transcription and translation tokens back. Translation tokens are sent to TTS.
TTS receives the translated text chunks and streams audio chunks back. Your app decodes and plays those chunks as they arrive.

Because both APIs stream, your app can start sending translated text to TTS before the speaker has finished the full utterance.

Check out the Soniox speech-to-speech translation demo, a FastAPI backend and vanilla JS frontend that wires the STT with translation to the TTS.

Pipeline configuration

You combine two configs, one for each API. Pick a translation mode for the STT side and a voice and audio output format for the TTS side.

Real-time STT with one-way translation into Spanish:

{
  "model": "stt-rt-v5",
  "audio_format": "auto",
  "enable_endpoint_detection": true,
  "max_endpoint_delay_ms": 500,
  "translation": {
    "type": "one_way",
    "target_language": "es"
  }
}

For two-way conversations (e.g. English ⟷ Spanish), use {"type": "two_way", "language_a": "en", "language_b": "es"} so each speaker hears the other's language back.

Real-time TTS opens one stream per utterance in the target language:

{
  "model": "tts-rt-v1",
  "voice": "Maya",
  "audio_format": "pcm_s16le",
  "sample_rate": 24000
}

Voices are multilingual, so the same voice ID works across supported languages.

Things to consider

Latency: total end-to-end latency is roughly STT translation latency + TTS time-to-first-audio. Keep TTS streams short (one utterance each) and start them eagerly.
Utterance boundaries: enable endpoint detection on the STT side and use the final <end> token to close the current TTS stream.
Voice consistency: Soniox voices work with all 60+ supported languages, so you can keep the same voice across translation targets.
Two-way mode: for bilingual conversations, you can maintain separate TTS streams per direction and pick which to play based on the translation token's language field.

Real-time speech-to-speech translation

Overview

How it works

Pipeline configuration

Things to consider

On this page