Eight hundred milliseconds is about the length of this sentence read at a calm pace, and roughly all the time a voice agent has before a caller feels something is wrong.
The number is not arbitrary. Human conversation runs on a tight clock. Studies of turn-taking across ten languages (Stivers et al., 2009) put the typical gap between speakers near 200 milliseconds, and people read delays past about a second as hesitation or a dropped call. An agent does not have to match human speed; it has to stay in the zone where the delay reads as thinking rather than broken. Treat 800 ms as the line where it starts to feel slow.
Latency by processing stage
Lay the stages end to end, the way a naive implementation runs them, and add up each cost. The numbers below are illustrative, the ranges teams commonly see, not measurements of a specific system.
Add those bars: 350 + 120 + 400 + 150 + 160 is 1,180 milliseconds, and that is with friendly numbers. Run sequentially, a competent agent built from fast parts lands well over the line. The budget does not close in order, because the problem is the arrangement, not the speed of any one part.
The two largest line items get the least attention. Endpoint detection spends its budget on purpose: it waits to be sure you stopped talking, and every millisecond of that silence timer adds straight to response time (the failure modes are catalogued in endpoint detection). The model's first token dominates the decision stage, because perceived latency turns on how long until the first word is ready to speak, not on how long the full answer takes to generate.
Overlapping processing stages
The budget closes only when the stages overlap in time. A live agent does not wait for each box to finish before starting the next; it pipelines them.
Recognition emits words while you are still speaking, so the transcript is nearly final the instant you stop (this is the partial vs final results mechanism). The decision model can be primed with the partial transcript and start before endpointing confirms the turn. Synthesis is the biggest lever: a streaming TTS engine speaks the first clause of the reply while the model is still generating the rest, so time-to-first-audio is governed by the first few words rather than the whole sentence.
Overlapping makes no stage faster; it changes what you measure. Sequential thinking asks how long until everything is done, but a conversation cares only about when the first sound comes out. Optimize time-to-first-audio and the same components that overshot at 1,180 ms land under 800.
Endpoint-detection latency
Network latency is the cost you cannot hide. Unlike the other stages it overlaps with nothing useful: it is dead time on the wire, paid on every leg. Audio travels from the user to your server, possibly onward to a model provider, and back. If your recognizer is in Virginia and your caller is in Frankfurt, physics adds round-trip time before any computation happens, and a multi-vendor architecture pays that toll more than once.
So serious deployments care about where the compute runs as much as how fast it is. Co-locate recognition, the model, and synthesis in one region, keep the audio path short, and you recover a hundred milliseconds no model tuning would. Telephony agents have it worse: the call crosses the phone network before it reaches your stack.
| Stage | Illustrative cost | Can it overlap? | Biggest lever |
|---|---|---|---|
| Endpoint detection | 200-400 ms | Partly | Semantic endpointing |
| STT finalization | 50-150 ms | Yes (streaming) | Stream partials |
| Model first token | 300-500 ms | Yes (prime on partials) | First token, not full answer |
| TTS first audio | 100-200 ms | Yes (streaming) | Time-to-first-audio |
| Network (both legs) | 100-200 ms | No | Co-locate, shorten path |
Architecture requirements
The budget reframes every architecture choice. A single speech-to-speech model is attractive here because it collapses several of these stages and their seams into one network, cutting hops and handoffs (the trade-offs against the pipeline are weighed in speech-to-speech models vs pipelines). The cascaded pipeline keeps the budget closeable through aggressive overlap and careful placement.
Either way, the discipline is the same: measure time-to-first-audio rather than total turnaround, and treat the silence before the agent speaks as a line item you pay for. The agent that feels fast is usually the one that started talking soonest, not the one with the fastest model.
Common questions
Why 800 milliseconds specifically?
A working target, not a law. It sits between the 200 ms gap of human conversation and the roughly one-second mark where a delay starts to read as a problem, marking where a machine that is allowed to seem like it is thinking still feels responsive. Some use cases tolerate more, a few demand less.
Which stage should I optimize first?
Endpointing. It is often the largest item and it is pure waiting. Switch from a fixed silence timeout to content-aware endpointing that finalizes the moment a sentence is complete, then confirm recognition and synthesis are both streaming.
Does a faster language model fix a slow agent?
Not on its own. The model is often a third of the budget, and what counts is time to its first token, not total generation. Slow endpointing or non-streaming synthesis keeps an agent sluggish behind the quickest model you can find.
Why does the same agent feel slower on the phone?
The path got longer and the signal got worse. Telephony adds network legs through the phone system before audio reaches your stack, and the 8 kHz codec makes recognition and endpointing harder, lengthening the silence you wait through. The compute did not change.
Related concepts
- What is a voice agent?
- Endpoint detection
- Turn-taking and barge-in
- Streaming TTS
- Partial vs final results
- Speech-to-speech models vs pipelines
Building with Soniox? Low time-to-first-token in the real-time recognizer is what gives the rest of this budget room to breathe; see the Soniox documentation.
References
- Ekstedt, E., & Skantze, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. arXiv preprint arXiv:2205.09812.
- Udupa, S., Watanabe, S., Schwarz, P., & Cernocky, J. (2025). Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training. arXiv preprint arXiv:2506.07081.