What is a voice agent? Architecture of AI that talks back

Call a support line and interrupt the agent mid-sentence. If it keeps talking over you, nothing was wrong with its answer; something was wrong with its ears. That is the thing to understand about voice agents: how good one feels depends less on the language model everyone obsesses over than on the loop around it, the timing, the turn-taking, the ability to stop mid-word when you cut in.

Voice-agent processing loop

Every voice agent is three jobs wired into a circle: it listens, it decides, it speaks, then it listens again. The classic build is the cascaded pipeline, which assigns those jobs to three separate, swappable components.^[1]

flowchart LR User[User speech] --> STT[STT<br/>transcribe] STT --> LLM[LLM<br/>decide + tools] LLM --> TTS[TTS<br/>synthesize] TTS --> Out[Agent speech] Out -.barge-in.-> STT STT -.turn end.-> LLM

The canonical voice-agent loop. Audio streams in, gets transcribed, a model decides, speech streams out, and the cycle repeats. Interruption short-circuits the speaking stage back to listening.

Listening is speech recognition, often called STT or ASR. It turns the incoming audio stream into text, ideally word by word as you speak rather than after you stop. The agent also has to decide when you are done speaking, a deep problem of its own called endpoint detection. If it pauses too eagerly, it cuts the user off; if it waits too long, the agent feels asleep.

Deciding is the language model, plus the scaffolding around it: the system prompt, the conversation history, function calls to your booking system or CRM, and the guardrails that stop it from promising a refund it cannot give. This is where the word "agent" earns its keep: a chatbot answers a question, an agent takes an action on your behalf.

Speaking is text-to-speech. For a live conversation it has to be streaming TTS, producing audio from the first few words while the model is still writing the rest of the sentence. If TTS waits for the full reply, the agent is already too slow.

End-to-end latency

Humans are intolerant of silence in conversation: people leave about 200 ms between turns (Stivers et al., 2009) and read anything approaching a second as hesitation or a bad line.

For a voice agent, the working target most teams aim for is roughly 800 ms of perceived response latency, measured from the moment you stop talking to the moment you hear the agent's first sound. Where that number comes from, and how it gets spent, is the whole subject of the latency budget. Split it across the loop and it gets brutal.

So the pipeline streams at every stage instead of running step by step. STT emits partial words while you speak. The LLM starts generating before the transcript is final. TTS speaks the first clause while the model finishes the last. The diagram above is a circle, but in a good agent the stages overlap in time. Getting that overlap right is most of the latency budget work.

Turn-taking and barge-in

An agent that can only listen or talk, never both at once, makes you wait for it to finish before you can say anything. A good one runs full-duplex: it keeps listening while it speaks, so the instant you interrupt, it can stop. This makes an agent feel like a conversation rather than a menu.

That is barge-in, and it is hard because the agent has to hear you over the sound of itself. The microphone picks up the agent's own voice from the speaker (echo), so the system needs echo cancellation and a turn-taking and barge-in policy to tell the interrupting user apart from its own audio bouncing back.^[2] If this goes wrong, the agent either talks over people or flinches at its own echo.

Cascaded and end-to-end architectures

The cascade is not the only way. A newer architecture, the speech-to-speech model, swallows the whole loop into one neural network: audio in, audio out, with no text transcript in the middle. OpenAI's Realtime API and Google's Gemini Live are the well-known examples.

Speech-to-speech wins on latency and carries tone and emotion that text throws away. The cascaded pipeline still dominates production because you can see inside it: when a cascade misbehaves, you read the transcript, inspect the model's tool calls, swap in a better STT, and log every stage. A single end-to-end model is a black box. Which one wins for your use case is laid out in speech-to-speech models vs pipelines.

Common questions

Is a voice agent the same thing as a chatbot?

No. A chatbot exchanges text turns, with no clock and no audio. A voice agent runs over live audio with strict real-time constraints: detect when you stop speaking, respond within a few hundred milliseconds, handle being interrupted. A chatbot can take three seconds and nobody minds; a voice agent that pauses three seconds sounds broken.

What is the difference between STT and a voice agent?

STT (speech-to-text, also called ASR) is one component inside a voice agent: it converts incoming audio into text. The agent is the full loop that wraps STT together with a decision-making model and text-to-speech, plus turn-taking and barge-in. STT alone produces a transcript; the agent uses it to carry on a conversation.

Why do voice agents feel slow even when the AI is fast?

The delay is the sum of every stage, not just the language model. The agent has to decide you finished (endpointing) and finalize the transcript before the model sees your words, then start producing audio after it replies, with network latency on both ends. A fast model behind slow endpointing or non-streaming TTS still feels sluggish.

What is barge-in and why does it matter?

Barge-in is interrupting the agent mid-sentence and having it stop and listen, the way a person would. Without it the agent feels rigid: you wait for it to finish before you can correct it or change the subject. It requires the agent to keep listening while it talks and cancel its own speech the instant you start.

Cascaded pipeline or speech-to-speech model: which should I use?

For most production systems today, the pipeline. You can inspect, debug, and swap each stage independently. Speech-to-speech is lower latency and carries more vocal nuance, but it is harder to debug and steer. Need tight control, auditability, or the freedom to change vendors? Start with a pipeline.

References

Soniox (2026). Soniox Voice Agent. Soniox documentation.
Ekstedt, E., & Skantze, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. arXiv preprint arXiv:2205.09812.