Speech-to-speech translation: the full pipeline explained

You speak a sentence in English, and a beat later the same sentence comes back in Mandarin, in a synthesized voice, ready to play to someone who does not speak a word of English. That is speech-to-speech translation, the closest thing yet to the universal translator of science fiction. Under the hood it is three hard real-time problems bolted together, each adding its own delay and failure modes.

This walkthrough builds the pipeline stage by stage, then covers the choices that separate a clumsy build from a good one.

Recognize the source

The pipeline starts with streaming recognition of the source language. Audio flows in, and provisional words flow out, as in any real-time transcription.

The key choice is when to pass words downstream. Wait for whole final sentences and you add latency. Pass partial results and translation starts sooner, but you risk feeding it words that still change. This is the first place the latency-versus-accuracy trade-off appears, and it returns at every stage.

Translate to the target language

The recognized words go into translation, which produces target-language text. Running live, it inherits the word-order problem from real-time speech translation: the translator waits for a clause to resolve before committing, and revises earlier output as later words arrive.

Whether stages 1 and 2 are two separate models or one fused model is the cascaded vs end-to-end question. The chain shown here is cascaded; a direct model would collapse these into one step.

Synthesize the target speech

The translated text flows into streaming TTS, which emits target-language audio as it arrives, so playback can begin before the sentence is complete.

Now the chain is complete: source audio in at recognition, target audio out at synthesis, with translation in the middle.

flowchart LR A[Source speech] --> B[Recognize] B --> C[Translate] C --> D[Synthesize] D --> E[Target speech]

The cascaded speech-to-speech pipeline. Each stage streams into the next; latency is the sum, minus what overlaps.

Latency

Three real-time stages in series mean three latencies in series. Recognition, translation, and synthesis each add their own delay, and the naive total is their sum, which pushes past what a conversation tolerates. This is the same arithmetic as the voice agent latency budget.

The defense is pipelining. Every stage streams, so translation renders the first clause while recognition is still on the second, and synthesis speaks the first clause while translation works the next. Done well, the stages overlap and perceived latency sits near the longest single stage plus a fixed lag, well below the full sum. Done badly, with each stage waiting for the previous to finish, the delays add up and it feels like a slow relay.

Speaker identity

One choice shapes the whole experience: the target speech can come out in a generic TTS voice, or in a voice that resembles the original speaker. Cross-lingual voice transfer carries the speaker's vocal identity across the language boundary, so the Mandarin output sounds like you speaking Mandarin rather than a stranger reading your words. This uses the same identity-versus-language separation behind voice cloning, and raises the same consent questions. Reproducing a speaker's voice, even their own, into another language is a capability to use deliberately, not by default.

Prosody preservation

Recognition discards how something was said and keeps only the words, so by the synthesis step the emphasis, urgency, and emotion of the original are gone, and the synthesized voice invents its own prosody from text. For flat informational content this is fine. For anything expressive, a heated negotiation or a tearful message, it flattens the delivery, translating the words but not the feeling. Preserving prosody across the pipeline, carrying emphasis and emotion from source to target, is an active research frontier and one of the main reasons direct speech-to-speech models are pursued: they still have the acoustic signal when they decide the output.

When the stages overlap properly and the voice carries across, two people who share no language can hold a conversation in real time, each hearing the other in the other's own voice. That is the live case. Point the same pipeline at recorded media, add timing and lip constraints, and you get AI dubbing.

Common questions

What is speech-to-speech translation?

A system that takes spoken input in one language and produces spoken output in another: voice in, voice out. It is usually built by chaining three real-time stages, recognition, translation, and synthesis, though direct speech-to-speech models that skip the intermediate text also exist. The result lets two people who share no language converse in real time.

Why does speech-to-speech translation feel laggy?

It runs three real-time stages in series, and their latencies add up. The fix is pipelining: each stage streams, so translation works on the first clause while recognition handles the next, and synthesis speaks it while translation continues. Overlapping the stages keeps perceived latency near the longest single stage rather than the full sum.

Can the translated speech keep my own voice?

Yes, with cross-lingual voice transfer, which carries your vocal identity into the target language so the output sounds like you speaking it. It uses the same technology as voice cloning and raises the same consent considerations, so it is a deliberate choice, not a default. Otherwise the output uses a standard synthetic voice.

Why does the translated voice sound flat compared to the original?

Recognition keeps only the words and discards how they were said, so by the synthesis stage the original emphasis and emotion are gone, and the voice generates its own prosody from text. Preserving delivery across the pipeline is an active research problem and one reason direct speech-to-speech models, which keep the audio, are pursued.

References

Jia, Y., Ramanovich, M. T., Wang, Q., & Zen, H. (2022). Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. arXiv preprint arXiv:2107.08661.