A support agent that continues speaking after a caller begins an interruption exhibits a failure of turn management rather than response content. Voice-agent performance depends on the latency and coordination of the complete interaction loop, including endpoint detection, barge-in, and synthesis cancellation; language-model capability alone does not determine conversational behavior.
Voice-agent processing loop
Every voice agent is three jobs wired into a circle: it listens, it decides, it speaks, then it listens again. The classic build is the cascaded pipeline, which assigns those jobs to three separate, swappable components.
Listening is speech recognition, often called STT or ASR. It turns the incoming audio stream into text, ideally word by word as you speak rather than after you stop. The agent also has to decide when you are done speaking, a deep problem of its own called endpoint detection. Pause too eagerly and you cut the user off; wait too long and the agent feels asleep.
Deciding is the language model, plus the scaffolding around it: the system prompt, the conversation history, function calls to your booking system or CRM, and the guardrails that stop it from promising a refund it cannot give. This is where the word "agent" earns its keep: a chatbot answers a question, an agent takes an action on your behalf.
Speaking is text-to-speech. For a live conversation it has to be streaming TTS, producing audio from the first few words while the model is still writing the rest of the sentence. If TTS waits for the full reply, the agent is already too slow.
End-to-end latency
Humans are intolerant of silence in conversation. Studies of natural turn-taking (Stivers et al., 2009, across ten languages) put the typical gap between speakers around 200 ms, and people read a delay as hesitation or a bad line well before a full second.
For a voice agent, the working target most teams aim for is roughly 800 ms of perceived response latency, measured from the moment you stop talking to the moment you hear the agent's first sound. Split that across the loop and it gets brutal.
So the pipeline streams at every stage instead of running step by step. STT emits partial words while you speak. The LLM starts generating before the transcript is final. TTS speaks the first clause while the model finishes the last. The diagram above is a circle, but in a good agent the stages overlap in time. Getting that overlap right is most of the latency budget work.
Turn-taking and barge-in
An agent that only listens or only talks at any one moment works like a walkie-talkie, where one side waits for the other to finish. A good one runs full-duplex: it keeps listening while it speaks, so the instant you interrupt, it can stop. This makes an agent feel like a conversation rather than a menu.
That is barge-in, and it is hard because the agent has to hear you over the sound of itself. The microphone picks up the agent's own voice from the speaker (echo), so the system needs echo cancellation and a turn-taking and barge-in policy to tell the interrupting user apart from its own audio bouncing back. Get this wrong and the agent either talks over people or flinches at its own echo.
Cascaded and end-to-end architectures
The cascade is not the only way. A newer architecture, the speech-to-speech model, swallows the whole loop into one neural network: audio in, audio out, with no text transcript in the middle. OpenAI's Realtime API and Google's Gemini Live are the well-known examples.
Speech-to-speech wins on latency and carries tone and emotion that text throws away. The cascaded pipeline still dominates production because you can see inside it: when a cascade misbehaves, you read the transcript, inspect the model's tool calls, swap in a better STT, and log every stage. A single end-to-end model is a black box. Which one wins for your use case is laid out in speech-to-speech models vs pipelines.
Common questions
Is a voice agent the same thing as a chatbot?
No. A chatbot exchanges text turns, with no clock and no audio. A voice agent runs over live audio with strict real-time constraints: detect when you stop speaking, respond within a few hundred milliseconds, handle being interrupted. A chatbot can take three seconds and nobody minds; a voice agent that pauses three seconds sounds broken.
What is the difference between STT and a voice agent?
STT (speech-to-text, also called ASR) is one component inside a voice agent: it converts incoming audio into text. The agent is the full loop that wraps STT together with a decision-making model and text-to-speech, plus turn-taking and barge-in. STT alone produces a transcript; the agent uses it to carry on a conversation.
Why do voice agents feel slow even when the AI is fast?
The delay is the sum of every stage, not just the language model. The agent has to decide you finished (endpointing) and finalize the transcript before the model sees your words, then start producing audio after it replies, with network latency on both ends. A fast model behind slow endpointing or non-streaming TTS still feels sluggish.
What is barge-in and why does it matter?
Barge-in is interrupting the agent mid-sentence and having it stop and listen, the way a person would. Without it the agent feels rigid: you wait for it to finish before you can correct it or change the subject. It requires the agent to keep listening while it talks and cancel its own speech the instant you start.
Cascaded pipeline or speech-to-speech model: which should I use?
For most production systems today, the pipeline. You can inspect, debug, and swap each stage independently. Speech-to-speech is lower latency and carries more vocal nuance, but it is harder to debug and steer. Need tight control, auditability, or the freedom to change vendors? Start with a pipeline.
Related concepts
- Voice agent architecture
- The voice-agent latency budget
- Turn-taking and barge-in
- Endpoint detection
- Streaming TTS
- Speech-to-speech models vs pipelines
References
- Soniox (2026). Soniox Voice Agent. Soniox documentation.
- Ekstedt, E., & Skantze, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. arXiv preprint arXiv:2205.09812.