Speech-to-text latency: what sub-200ms actually means

A vendor's page says "sub-200ms latency." The figure has at least five possible meanings, and the gap between the flattering interpretation and the one you will experience can be a factor of five. Which clock starts, which clock stops, and what got left out?

Defining the start and end events

Latency is the time between two events. For streaming recognition there are several candidates, and they measure different things.

First-token latency is the time from when a word is spoken to when its first provisional guess appears on screen. This makes captions feel alive, and it is usually the smallest and most flattering number, which is why it is the one quoted.

Finalization latency is the time from speaking a word to that word becoming final, no longer subject to revision. It is larger than first-token latency, because the recognizer waits for a little future audio before it stops revising the word. If your application acts on finals, and most should, this is the number you live with.

Endpointing delay is the time the system waits after you stop talking before it declares the turn over. It is barely "model latency"; it is a tuning decision. But it lands on the user as delay all the same, and is frequently the single biggest contributor.

A "sub-200ms" claim almost always means first-token latency and almost never includes endpointing, so it can be accurate and still leave you with a system that feels a second slow.

Components of total latency

The model latency is only part of the wall-clock delay. Two more clocks run alongside it.

Transport adds the network round trip between your audio and the recognizer. A streaming connection across a continent can add tens to over a hundred milliseconds before any recognition happens, paid on the audio going up and the text coming down. Where the server sits matters as much as how fast it is.

Buffering adds the time spent collecting audio before it can be processed. Audio is sent in frames, and a system that needs, say, 100 milliseconds of audio before it emits a result has built 100 milliseconds of latency into its floor by design. Some accuracy comes from lookahead, peeking at upcoming audio, which adds latency of its own.

flowchart TB A[Word spoken] --> B[Buffering<br/>frame + lookahead] B --> C[Network up] C --> D[Recognition<br/>first token] D --> E[Finalization wait] E --> F[Endpointing silence wait] F --> G[Usable, final text]

The clocks that add up to perceived latency. The quoted figure is usually only the first bar.

Latency in asynchronous transcription

For batch transcription, people quote a different number: the real-time factor (RTF), how fast the system processes relative to the audio's length. An RTF of 0.1 means an hour of audio is transcribed in six minutes.

This is throughput, not latency. RTF tells you how quickly a recording is processed; it says nothing about how soon a word appears while someone is speaking, because in batch nobody is waiting on a live word. A system can have a wonderful RTF and still be unsuitable for anything live. For "how responsive does this feel in conversation," RTF is the wrong instrument.

Latency and accuracy trade-offs

Latency is not free to reduce, because it trades against two other things.

It trades against accuracy. The future audio a recognizer waits for is what lets it resolve ambiguity, so cutting finalization and lookahead to the bone means committing to words sooner, with less context, and being wrong more often.^[1] That is the difference that makes async more accurate than real-time. It also trades against stability: show words faster and they are more provisional, so the visible text rewrites itself more, which users read as flicker. Every latency setting is a position on these curves, not a free parameter.

Measuring application latency

To cut through a latency claim, define the two events yourself and measure end to end. Start the clock when a sound is produced (a clap, a tone, a known word) and stop it when the text you depend on, usually the final word, is in your application, over your real network, with your endpointing settings. That number, measured your way, is worth more than any figure on a marketing page, because it includes the transport, buffering, and endpointing the marketing figure left out.

For conversational systems, this latency is one line item in the voice agent latency budget, where recognition shares the clock with the language model and speech synthesis. 200 milliseconds is a meaningful target because humans take their own turns in roughly that window,^[2] and anything much slower stops feeling like a conversation.

Common questions

What does "sub-200ms latency" usually mean for speech-to-text?

Almost always first-token latency: the time from a word being spoken to its first provisional guess appearing. It typically excludes finalization, endpointing, network round trip, and buffering, all of which add to what the user actually experiences. Ask which clock the number measures before comparing it to anything.

What is the difference between latency and real-time factor?

Latency is the delay before a word is available while someone is speaking; it matters for live use. Real-time factor is how fast a system processes a recording relative to its length; it matters for batch throughput. A system can be excellent at one and poor at the other, so do not use real-time factor to judge live responsiveness.

Why does my live transcription feel slow even though first words appear fast?

Usually because of the endpointing wait. The system may show words quickly but then sit through a silence timer before deciding your turn is over and finalizing. That wait is a tuning choice, not model speed, and it is often the largest single component of perceived delay.

Can I just lower latency to the minimum?

Not without cost. Lower latency means committing to words with less future context, which reduces accuracy, and showing words sooner, which makes the text rewrite itself more. The right setting balances speed against accuracy and stability for your specific use, rather than minimizing one number.

References

Shinohara, Y., & Watanabe, S. (2022). Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition. arXiv preprint arXiv:2211.02333.
Stivers, T., Enfield, N. J., Brown, P., et al. (2009). Universals and Cultural Variation in Turn-Taking in Conversation. Proceedings of the National Academy of Sciences, 106(26).