Speech translation explained: how it works, end to end

Imagine you are interpreting a meeting from German into English, live, and the speaker begins: "Ich habe den Vertrag gestern Abend nicht ..." You have heard "I did not ... the contract yesterday evening," but the verb that says what they did not do has not arrived yet. German puts it at the end. They could be about to say unterschrieben (signed), gelesen (read), or gefunden (found). Until that final word lands, you do not have a sentence you can speak. You can wait, and fall behind, or you can guess the verb, and risk telling the room the wrong action: that the contract was not signed when the speaker meant it was not read or not found.

That gap is the central problem of speech translation. For a machine it shows up as a scheduling conflict between three components that each want to run as soon as possible and cannot, because the information they need arrives in the wrong order.

Cascaded speech translation

The traditional and still widely used design runs the three steps as separate stages, each handing its output to the next. Audio goes into a speech-to-text recognizer, the transcript into a machine translation model, and the translated text into a text-to-speech synthesizer. This is the cascaded pipeline, and its appeal is that every box is a mature, separately trainable system you can improve or swap on its own.

flowchart LR A[Audio lang A] --> B[STT recognize] B --> C[Text lang A] C --> D[MT translate] D --> E[Text lang B] E --> F[TTS speak] F --> G[Audio lang B]

The cascade: each stage is a finished system, and the seams between them are where errors compound.

The weakness is the seams. If the recognizer mishears "ICU" as "I see you," the translator faithfully translates the wrong words, and the synthesizer says them in a confident voice; the translation model never sees the original audio that might have disambiguated the homophone. Tighter integration and richer recognizer output, such as n-best lists and confidence scores, can reduce this without erasing it.^[1]^[14]

The alternative is an end-to-end model that maps source audio to target text (or target audio) in one trained network, with no transcript in the middle. It can use acoustic cues the cascade throws away, and it has fewer seams to leak errors through.^[11]^[12]^[9] The cost is that it needs a large amount of paired audio-to-translation data, far rarer than the separate datasets the cascade is built from.^[3]^[15]^[16] Which design wins depends on the language pair and the use case, covered in cascaded vs end-to-end translation.^[13]

Text and speech output

Stop the pipeline at the translated text and you have speech-to-text translation: you speak German, you read English. This is what captions and live subtitles need, and it is usually the cheaper, lower-latency path because you skip synthesis entirely.

Run the last box and you get speech-to-speech translation: German audio in, English audio out. Now you owe the listener a voice, and a flat robotic one breaks the illusion that they are hearing the original speaker. The hard version preserves the speaker's voice, so the English sounds like them, carrying voice characteristics across a language boundary the words have already crossed.^[17]^[2] The full pipeline, including where the voice gets cloned, is laid out in speech-to-speech translation.

One-way and two-way translation

A lecture broadcast to a foreign audience is one-way: a single source language flowing to one or more targets, and the system can specialize on that direction. A conversation between two people who do not share a language is two-way: the system must detect which language is being spoken at any moment and translate in the right direction without being told, then flip when the other person answers. Two-way is harder because language detection becomes part of the live path, and a wrong guess sends the translation in the wrong direction.^[10]^[19] The modes, and how the source language gets tracked at the utterance, segment, or token level depending on whether code-switching is supported, are covered in one-way vs two-way translation.

Word-order constraints in real-time translation

Now back to that verb stuck at the end. In batch translation, where you have the whole recording, this is no problem: the system reads to the end, sees unterschrieben, and translates a complete thought. Streaming uncertainty disappears because the whole utterance is available, though recognition and translation can still be wrong.

Real time removes that luxury. The audio arrives a little at a time, the listener wants the translation now, and the word that fixes the meaning has not been spoken yet. You face a direct trade: wait for more context and the translation is right but late, or emit early and risk retracting it when the verb finally lands. Word-order differences make this unavoidable, and the wider the reordering between source and target, the worse the bind.^[4] The mechanics of translating before the sentence ends, including how systems revise an early guess, are the subject of real-time speech translation.

Machines borrow the same two tricks. They chunk, holding output until a stretch of input forms a unit safe to commit, and they predict, letting a strong language model guess the likely completion so they can start speaking sooner. Both commit before the evidence is complete, and how well a real-time translator makes those calls largely determines its quality.^[7]^[8]

Common questions

Is speech translation just running a translator on a transcript?

That is the cascaded version, and it works, but it is not the only way. On a transcript the translator never hears the audio, so any acoustic detail that would have resolved a homophone or a name is already gone. End-to-end models keep the audio in the loop, and which approach is better depends on your languages and your latency budget.

Why is real-time translation harder than translating a recording?

The meaning-deciding words often arrive late, and a recording lets you wait for all of them while a live stream does not. With the full audio in hand, the system translates a complete sentence. In real time it must choose between waiting (accurate but laggy) and guessing (fast but sometimes wrong), a trade that gets worse the more two languages reorder their words.

Does speech-to-speech translation keep the speaker's voice?

It can, if the synthesis stage is built to. A basic system speaks the translation in a generic voice; a voice-preserving system carries the original speaker's vocal characteristics into the target language so the output sounds like them. The second is harder and raises consent questions, since you reproduce a person's voice saying words they never spoke.

How many language pairs does a translation system support?

It depends whether the system pairs languages directly or routes through a shared representation. A system that handles N source languages and M targets can in principle cover N times M pairs, which is how a few dozen languages turns into thousands of directions.^[18]

References

Sperber, M., & Paulik, M. (2020). Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7409–7421.
Jia, Y., Ramanovich, M. T., Remez, T., et al. (2022). Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. Proceedings of the 39th International Conference on Machine Learning, PMLR 162:10120–10134.
Pino, J., Puzon, L., Gu, J., Ma, X., McCarthy, A. D., et al. (2019). Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade. Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT), 155–164.
Grissom II, A., et al. (2014). Don't until the final verb wait: Reinforcement learning for simultaneous machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1753–1763.
Lee, J. (2002). Ear voice span in English into Korean simultaneous interpretation. Meta: Journal des traducteurs, 47(4), 586–601.
Guo, Y., et al. (2023). From manual to machine: Evaluating automated ear–voice span measurement in simultaneous interpreting. Journal of Bilingualism.
Seeber, K. G. (2011). Cognitive load in simultaneous interpreting: Existing theories—new models. Interpreting: International Journal of Research and Practice in Interpreting, 13(2), 176–204.
Seleskovitch, D. (1978). Interpreting for International Conferences: Problems of Language and Communication. Pen and Booth.
Manakul, P., Gan, W. H., Bartelds, M., Sun, G., Held, W., et al. (2026). Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens. arXiv preprint arXiv:2602.16687.
Seamless Communication, Barrault, L., Chung, Y.-A., Cora Meglioli, M., Dale, D., et al. (2023). SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596.
Bérard, A., et al. (2016). Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation. arXiv preprint arXiv:1612.01744.
Bansal, S., et al. (2017). Towards Speech-to-Text Translation without Speech Recognition. Proceedings of EACL 2017.
Bentivogli, L., et al. (2021). Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) 2021.
Bahar, P., et al. (2020). Tight Integrated End-to-End Training for Cascaded Speech Translation. arXiv preprint arXiv:2011.12167.
Di Gangi, M. A., et al. (2019). MuST-C: a Multilingual Speech Translation Corpus. Proceedings of NAACL-HLT 2019.
Wang, C., Wu, A., & Pino, J. (2021). CoVoST 2 and Massively Multilingual Speech Translation. Interspeech 2021.
Jia, Y., et al. (2019). Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. Interspeech 2019.
Johnson, M., et al. (2017). Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5, 339–351.
Soniox (2026). Speech-to-text translation. Soniox.