The voice agent latency budget: where your 800ms goes

Eight hundred milliseconds is about the length of this sentence read at a calm pace, and roughly all the time a voice agent has before a caller feels something is wrong.

The number is not arbitrary. Human conversation runs on a tight clock. Studies of turn-taking across ten languages (Stivers et al., 2009) put the typical gap between speakers near 200 milliseconds, and people read delays past about a second as hesitation or a dropped call. An agent does not have to match human speed; it has to stay in the zone where the delay reads as thinking rather than broken. Treat 800 ms as the line where it starts to feel slow.

Latency by processing stage

Lay the stages end to end, the way a naive implementation runs them, and add up each cost. The numbers below are illustrative, the ranges teams commonly see, not measurements of a specific system.

xychart-beta horizontal title "Where the milliseconds go (illustrative, sequential)" x-axis ["Endpoint", "STT final", "LLM first token", "TTS first audio", "Network x2"] y-axis "Milliseconds" 0 --> 500 bar [350, 120, 400, 150, 160]

An illustrative 800ms budget run sequentially. The four stages alone overshoot before you add a single network hop.

Add those bars: 350 + 120 + 400 + 150 + 160 is 1,180 milliseconds, and that is with friendly numbers. Run sequentially, a competent agent built from fast parts lands well over the line. The budget does not close in order, because the problem is the arrangement, not the speed of any one part.

The two largest line items get the least attention. Endpoint detection spends its budget on purpose: it waits to be sure you stopped talking, and every millisecond of that silence timer adds straight to response time (the failure modes are catalogued in endpoint detection). The model's first token dominates the decision stage, because perceived latency turns on how long until the first word is ready to speak, not on how long the full answer takes to generate.

Overlapping processing stages

The budget closes only when the stages overlap in time. A live agent does not wait for each box to finish before starting the next; it pipelines them.

Recognition emits words while you are still speaking, so the transcript is nearly final the instant you stop (this is the partial vs final results mechanism). The decision model can be primed with the partial transcript and start before endpointing confirms the turn. Synthesis is the biggest lever: a streaming TTS engine speaks the first clause of the reply while the model is still generating the rest, so time-to-first-audio is governed by the first few words rather than the whole sentence.

flowchart LR subgraph Sequential direction LR S1[STT done] --> S2[LLM done] --> S3[TTS starts] end subgraph Overlapped direction LR O1[STT streaming] -.partial.-> O2[LLM streaming] O2 -.first words.-> O3[TTS speaking] end

Sequential versus overlapped. The same stages, the same per-stage costs, but the overlapped pipeline starts speaking far earlier because nothing waits for a stage to fully finish.

Overlapping makes no stage faster; it changes what you measure. Sequential thinking asks how long until everything is done, but a conversation cares only about when the first sound comes out. Optimize time-to-first-audio and the same components that overshot at 1,180 ms land under 800.

Network latency

Network latency is the cost you cannot hide. Unlike the other stages it overlaps with nothing useful: it is dead time on the wire, paid on every leg. Audio travels from the user to your server, possibly onward to a model provider, and back. If your recognizer is in Virginia and your caller is in Frankfurt, physics adds round-trip time before any computation happens, and a multi-vendor architecture pays that toll more than once.

So serious deployments care about where the compute runs as much as how fast it is. Co-locate recognition, the model, and synthesis in one region, keep the audio path short, and you recover a hundred milliseconds no model tuning would. Telephony agents have it worse: the call crosses the phone network before it reaches your stack.

Stage	Illustrative cost	Can it overlap?	Biggest lever
Endpoint detection	200-400 ms	Partly	Semantic endpointing
STT finalization	50-150 ms	Yes (streaming)	Stream partials
Model first token	300-500 ms	Yes (prime on partials)	First token, not full answer
TTS first audio	100-200 ms	Yes (streaming)	Time-to-first-audio
Network (both legs)	100-200 ms	No	Co-locate, shorten path

Architecture requirements

The budget reframes every architecture choice. A single speech-to-speech model is attractive here because it collapses several of these stages and their seams into one network, cutting hops and handoffs (the trade-offs against the pipeline are weighed in speech-to-speech models vs pipelines). The cascaded pipeline keeps the budget closeable through aggressive overlap and careful placement.

Either way, the discipline is the same: measure time-to-first-audio rather than total turnaround, and treat the silence before the agent speaks as a line item you pay for. The agent that feels fast is usually the one that started talking soonest, not the one with the fastest model.

Common questions

Why 800 milliseconds specifically?

A working target, not a law. It sits between the 200 ms gap of human conversation and the roughly one-second mark where a delay starts to read as a problem, marking where a machine that is allowed to seem like it is thinking still feels responsive. Some use cases tolerate more, a few demand less.

Which stage should I optimize first?

Endpointing. It is often the largest item and it is pure waiting. Switch from a fixed silence timeout to content-aware endpointing that finalizes the moment a sentence is complete, then confirm recognition and synthesis are both streaming.

Does a faster language model fix a slow agent?

Not on its own. The model is often a third of the budget, and what counts is time to its first token, not total generation. Slow endpointing or non-streaming synthesis keeps an agent sluggish behind the quickest model you can find.

Why does the same agent feel slower on the phone?

The path got longer and the signal got worse. Telephony adds network legs through the phone system before audio reaches your stack, and the 8 kHz codec makes recognition and endpointing harder, lengthening the silence you wait through. The compute did not change.

References

Ekstedt, E., & Skantze, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. arXiv preprint arXiv:2205.09812.
Udupa, S., Watanabe, S., Schwarz, P., & Cernocky, J. (2025). Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training. arXiv preprint arXiv:2506.07081.