Real-time speech translation

Translation under partial context and word-order constraints

Updated June 29, 2026

German puts the verb at the end. "Ich habe gestern das Buch, das du mir empfohlen hast, gelesen" tells you only at the final word that the whole sentence was about reading. A translator working into English, where the verb comes early, cannot reliably produce "I read..." until that last word lands. Waiting for it adds delay; guessing it risks a correction once the real verb arrives.[3][4][5][6]

Human simultaneous interpreters work in this gap every day, and so does a real-time translation system. This page is about how a machine translates a sentence whose final, decisive word has not arrived.

Limits of word-by-word translation

Translating finished text is hard but bounded: you have the whole sentence. Translating live removes that, and the reason it hurts is word order. If two languages shared an order, you could translate word by word as words arrived. They do not. German and Japanese put verbs late; many languages reorder subjects, objects, and modifiers freely. So the correct translation of the beginning of a sentence can depend on its end, and a system that has heard only the beginning is sometimes unable to commit.

This is the same shape as partial versus final results in recognition, raised a level. There, later audio revised which words you heard. Here, later words revise how the whole clause translates, which can rewrite text already shown, not just extend it. A live translation that updates itself mid-sentence is honoring information that arrived late.

Choosing an output delay

Every real-time translator comes down to choosing one number: how much of the source to hear before producing the target. Wait longer and the translation is more accurate, because more of the sentence's structure is known. Wait less and it is faster but more likely to commit to something it must revise.

The textbook version is the wait-k policy: begin translating after the first k source words, then stay k words behind the speaker for the rest. A small k is responsive and error-prone; a large k is accurate and laggy.[7][8][9][10] Real systems are more adaptive than a fixed k, holding output when the sentence is structurally open (a verb is clearly still coming) and releasing it when a clause has resolved, but the underlying tension never goes away: waiting longer buys accuracy at the cost of latency.

System architectures

Underneath, a real-time translator is one of two architectures, the subject of cascaded vs end-to-end translation. A cascaded system chains streaming recognition into machine translation: words are recognized live, then translated live, two streaming stages back to back. An end-to-end system translates speech to target text directly, without writing down the source words in between. Both solve the timing problem; they just package it differently.[1][2][17][18][19][20]

Whichever the architecture, the live output behaves like recognition's provisional stream. The translation appears incrementally, marked as still-changing or settled, and can revise as the source sentence completes. Systems that expose this often tag each piece with its source language and whether it is provisional or final, so a client can render firm translation solidly and tentative translation lightly, exactly as it would for partial recognition results.[21]

Text and speech output

Real-time speech translation splits by what it produces. If the output is text in the target language, captions for a live talk, subtitles on a call, this is speech-to-text translation, what most "real-time translation" features mean. If the output is speech in the target language, a translated voice, you have added synthesis on the end and entered speech-to-speech translation, which stacks a streaming TTS latency budget on top of everything above. This page is about the text-out case; the spoken case is a pipeline built from it.

flowchart LR A[Speech in] --> B[Recognize<br/>live] B --> C[Translate<br/>k words behind] C --> D[Target text,<br/>revised as it settles]
Live translation lags the speaker by a controlled gap, and revises as late words resolve the sentence.

Deployment considerations

A working real-time translator lets you speak and see your meaning appear in another language, continuously, a few seconds behind, while you are still talking. The lag and the occasional mid-sentence rewrite are the cost of the word-order problem, not signs that the system is broken. The rewrite is the system staying correct about late-arriving information, and the main decision left to you is where to set the latency knob for what your use can tolerate.

Common questions

How can a system translate before the sentence is finished?

By producing a provisional translation from the words heard so far and revising it as more arrive. It stays a controlled distance behind the speaker, releasing translation when a clause has resolved and holding it when the sentence is incomplete. A fixed amount of lag is unavoidable, because some target words depend on source words not yet spoken.

Why does the translation change while I am still speaking?

Because languages order words differently, so the correct translation of the start of a sentence can depend on its end. When a late word, such as a verb that comes last in German, arrives, the system updates earlier translation to match. The revision is the system staying faithful to information that arrived late, not an error.

What is wait-k in simultaneous translation?

A policy where the system waits for the first k words of the source before starting to translate, then stays k words behind. A small k is fast but error-prone, and a large k is more accurate but laggier. Real systems adapt this gap to the sentence rather than fixing it, but the latency-versus-quality trade-off it captures does not go away.

Is real-time translation the same as live captions in another language?

That is one form of it: speech-to-text translation, where the output is target-language text. If instead the output is a spoken voice in the target language, you have speech-to-speech translation, which adds streaming synthesis and its own latency on top. The text-out and speech-out cases share the same hard timing problem.

References

  1. Sperber, M., & Paulik, M. (2020). Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7306–7317.
  2. Higuchi, H., & Takeda, K. (2025). End-to-End Speech Translation Guided by Robust Alignment. ISCA Archive.
  3. Bevilacqua, A. (n.d.). The Position of the Verb in Germanic Languages and Simultaneous Interpreting. Semantic Scholar.
  4. Chmiel, A. (2021). Effects of simultaneous interpreting experience and training on anticipation, as measured by word-translation latencies. Interpreting, 23(1), 51–77.
  5. Seeber, K. G. (2011). Cognitive load in simultaneous interpreting: Existing theories—new models. Interpreting, 13(2), 175–202.
  6. Seeber, K. G., & Kerzel, D. (2012). Cognitive load in simultaneous interpreting: Model meets data. International Journal of Bilingualism, 16(2), 209–222.
  7. Ma, X., Ma, J., & Xia, Y. (2019). STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency Using Prefix-to-Prefix Framework. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2908–2918.
  8. Elbayad, M., Besacier, L., & Barrault, L. (2020). Efficient Wait-k Models for Simultaneous Machine Translation. Proceedings of Interspeech 2020.
  9. Ma, X., Ma, J., & Xia, Y. (2019). STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency Using Prefix-to-Prefix Framework. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  10. Elbayad, M., Besacier, L., & Barrault, L. (2020). Efficient Wait-k Models for Simultaneous Machine Translation. Proceedings of Interspeech 2020.
  11. Lee, T. H. (2002). Ear voice span in English into Korean simultaneous interpretation. Meta: Journal des Traducteurs, 47(4), 589–601.
  12. Janikowski, P., & Chmiel, A. (2025). Ear–voice span in simultaneous interpreting: Text-specific factors, interpreter-specific factors and individual variation. Interpreting, 27(1), 1–28.
  13. Moser-Mercer, B. (1997). Process models in simultaneous interpretation. Machine Translation and Translation Theory, 1(3), 3–18.
  14. AIIC (n.d.). AIIC Professional Standards. International Association of Conference Interpreters (AIIC).
  15. Shamil Translation (2026). How Many Interpreters Do You Need for a Conference?. Shamil Translation.
  16. Moser-Mercer, B. (n.d.). Prolonged turns in interpreting: Effects on quality, physiological and psychological stress (Pilot study). University of Geneva.
  17. Sperber, M., & Paulik, M. (2020). Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  18. Sperber, M., Neubig, G., Niehues, J., & Waibel, A. (2019). Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics, 7, 539–554.
  19. Slator (2025). Cascaded Speech Translation Systems Outperform End-to-End Models, Research Finds. Slator.
  20. NICT (2024). NICT's Cascaded and End-To-End Speech Translation Systems for IWSLT 2024 Indic Track. Proceedings of IWSLT 2024.
  21. Soniox (2026). Real-time speech-to-text translation. Soniox.