Soniox | Build a real-time speech-to-speech translator with Soniox

Voiceover translation is a shape most people already recognise. TV documentaries and news interviews, where a translator speaks over the original audio, slightly delayed, in the viewer's language. You hear a voice in your own language while the speaker keeps talking. In broadcast, that is a human translator, with delay, and only for a handful of language pairs at a time. In software, the same pattern has historically meant batch translation or sentence-by-sentence apps that put a pause between every turn, which is what breaks the conversation for meetings, support calls, voice agents, and travel.

The bar for a useful product is voice in, voice out, in real time, across the languages people actually speak. Speech-to-speech translation, automated, in every language. That requires three key pieces: real-time speech-to-text transcription, real-time translation, and real-time text-to-speech. Soniox provides them all under same API.

This post walks through a small reference demo that wires them together over raw WebSockets, with no SDK, no LLM, and no audio post-processing. A Python backend, a vanilla HTML/JS frontend, with full demo code available.

Translate speech to text

The first half of the loop is turning audio into translated text. Soniox real-time STT transcribes streaming audio in 60+ languages with native-speaker accuracy, automatic language identification, mid-sentence language switching, speaker awareness, and accurate recognition of names, numbers, emails, addresses, IDs, and other alphanumerics. Those are the details that decide whether a transcript is useful in production or not.

Soniox real-time translation is not a separate model or endpoint. It is an extension of the same real-time STT streaming API, enabled with a single translation field on the config message. Once enabled, every result Soniox returns includes both the transcribed text in the original language and the translated text in the target language, tagged with translation_status of "original" or "translation". All within the same WebSocket stream.

Translation tokens stream in mid-sentence. Soniox does not wait for an end-of-sentence boundary to produce them, so the translated transcript updates word by word as the speaker talks. That is the property that makes voiceover-style audio output feel responsive. The target language is picked per session and covers 60+ languages out of the box (view API reference docs for configuration params).

In the demo, the STT side is a single config message followed by audio bytes. The field that does the heavy lifting is translation:

stt_config = {
    "api_key": SONIOX_API_KEY,
    "model": "stt-rt-v4",
    "audio_format": "auto",
    "enable_endpoint_detection": True,
    "max_endpoint_delay_ms": 500,
    "translation": {"type": "one_way", "target_language": target_lang},
}

Translate speech to speech

We now have a websocket connection that is streaming transcribed and translated text. For a real-time voiceover in the listener's language, that text has to become spoken audio quickly. Soniox real-time TTS is the piece that closes the loop.

Soniox TTS generates high-fidelity speech in 60+ languages with native-speaker quality, hallucination-free generation, correct pronunciation of names, foreign words, and alphanumerics, and streaming generation that begins before a sentence is complete. The streaming behaviour is what makes voiceover-style real-time output viable: as soon as the first translated tokens arrive, TTS starts producing audio for them, instead of waiting for the full sentence.

Wired together, the result is voiceover-style speech-to-speech translation. As the translated transcript streams in, it is piped into a TTS stream, and the listener hears the translation in the chosen voice while the speaker is still talking. Same shape as a TV news voiceover, except it runs automatically, and the same pattern scales to every one of the 60+ languages Soniox supports. The full set of supported voices and languages and the TTS API reference live in the docs.

Paired with a microphone input, the same loop powers a remote conversation with someone who speaks a different language. They hear you in their language with minimal delay, without ever reading subtitles while you talk.

Because real-time STT, translation, and TTS all run on Soniox, the loop uses one API key, one billing surface, and one consistent set of supported languages. The same platform handles high-concurrency real-time workloads, regional deployments, and the infrastructure side of running voice at scale, so a demo like the one below has a credible path from local script to production system without swapping vendors.

Demo and code walkthrough

What the demo does

You open the page, pick a target language and a voice, and click Start. The audio transcript appears on the left. The translation appears on the right. The translation plays through your speakers as a voiceover in the chosen voice, while the original audio is playing in background (or while you are talking in the microphone). Audio output starts before you finish the phrase, and the translated text updates word by word as Soniox streams tokens back.

Why raw WebSockets and not the SDK

Soniox publishes Python, Node, Web, React, and React Native SDKs that would shrink this demo by a couple of files. We deliberately did not use them here.

The point of the project is to show how the two real-time APIs connect to each other: what messages get sent, when streams open and close, how text from STT translation gets piped into TTS. That is exactly the wiring an SDK abstracts away. If you want to understand what your SDK is doing under the hood, or you are building on a stack the SDKs do not cover, this is the version of the code you want to read.

For production work, use the SDKs.

Architecture

The browser is a thin client: capture the mic, stream audio bytes to our backend, render incoming text, play incoming audio. All Soniox protocol code lives in Python.

Browser              Python backend              Soniox
   │                       │                        │
   ├─ WS /ws/translate ──▶ │                        │
   │  audio bytes ───────▶ │ forwards to STT ─────▶ │   (mic mode)
   │                       │ ─ HTTP GET audio_url ─ │ ◀ (file mode)
   │                       │ forwards to STT ─────▶ │
   │                       │  ◀── token JSON ────── │
   │  ◀── token JSON ──    │                        │
   │                       │ ─ WS tts-rt ─────────▶ │
   │                       │  text chunks ────────▶ │
   │                       │  ◀── audio (base64) ── │
   │  ◀── PCM binary ──    │  (decoded)             │

The backend is a single FastAPI WebSocket endpoint, translation_websocket, that owns one browser session end-to-end. When a browser connects, it accepts the WebSocket, reads the target language, voice, and settings from the query string, opens the outbound WebSockets to Soniox (STT always, TTS only if spoken translation is enabled), sends the initial config messages, and hands off to five concurrent coroutines that run for the lifetime of the session via asyncio.gather:

pipe_browser_audio_to_stt: forwards mic audio bytes from browser to STT.
handle_stt: reads STT results, forwards them to the browser for display, and pushes translation tokens onto a queue for TTS.
tts_sender: pulls tokens from the queue, manages the TTS stream lifecycle, and sends text chunks.
pipe_tts_to_browser: reads audio from TTS, base64-decodes it, and forwards it to the browser as binary frames.
tts_keepalive: sends a {"keep_alive": true} to the TTS connection on a fixed interval so idle streams do not time out.

When the browser disconnects, translation_websocket's finally block closes both Soniox WebSockets, which cleanly terminates any in-flight streams server-side.

One TTS stream per utterance

The Soniox real-time TTS WebSocket supports up to 5 concurrent streams per connection, each identified by a stream_id. Each stream has its own config message and its own end-of-input handshake (text_end: true → audio_end: true → terminated: true).

We use one stream per utterance: open on the first translation token, close when the STT side emits an <end> token, and wait for the previous stream to fully drain before opening the next one. Sequential streams mean utterance N+1's audio never starts until utterance N's audio has finished to prevent overlap.

async def tts_sender(...):
    while True:
        kind, payload = await tts_queue.get()
        if kind == "text":
            if tts_state["current_stream_id"] is None:
                await tts_terminated.wait()
                # open new stream with a fresh config message
                ...
            await tts_ws.send(json.dumps({"text": payload, ...}))
        elif kind == "end":
            # send text_end: true, reset state
            ...

The coordination between tts_sender and pipe_tts_to_browser is a single asyncio.Event that flips set/cleared as streams drain.

Checkout the demo source code and read the reference docs to get started:

Happy building!