Speech-to-speech translation

Overview

Soniox Speech-to-speech Translation demo app shows how to combine the Soniox real-time speech-to-text with translation and real-time text-to-speech WebSocket APIs into a complete speech-to-speech translation pipeline - voice input in one language, hear the translation in another, in real time.

This is a reference implementation for developers who want to learn how to wire the two real-time APIs together at the protocol level, without an SDK. The backend is a small FastAPI service; the frontend is a vanilla HTML/JS page.

Features

Stream audio from an audio file or your mic to Soniox in real time
Live transcription in 60+ languages, with automatic source-language detection
Mid-sentence speech translation to 60+ languages - translation tokens stream as you talk
Live spoken translation through one of the Soniox voices, played back to you
Optional speaker diarization and language identification
Toggle for text-only translation mode that skips TTS entirely

Usage flow

Pick a target language and a voice in the sidebar
Tap Start talking to begin streaming from your mic
The original transcript appears on the left; the translation appears on the right, word by word
The translated speech plays through your speakers in the chosen voice
Tap Stop to end the session

Architecture

Server (Python / FastAPI): Holds your Soniox API key, accepts a WebSocket from the browser, and proxies audio and tokens between the browser and Soniox. Manages the per-utterance TTS stream lifecycle, pre-warming, and connection keepalive.
Frontend (vanilla HTML / JS): Captures audio file or microphone audio with MediaRecorder, streams the bytes to the backend over WebSocket, renders incoming token JSON into the transcript columns, and plays incoming PCM audio through the Web Audio API.

Source code available in our GitHub examples repo.

Speech-to-speech translation

Overview

Features

Usage flow

Architecture

On this page