Voice agent architecture: STT → LLM → TTS, explained

The conventional three-stage diagram leaves out most of the runtime system. Something has to decide when an utterance is over, cope with interruption, keep the dialogue state, and conduct three streaming processes at once. That something is the orchestrator, and it shapes how the agent behaves more than any single box in the diagram does.

Transport

Before any intelligence, the agent needs a path for audio in both directions. In the browser that is the microphone and speaker over WebRTC. On the phone it is a telephony connection carrying 8 kHz call audio. This layer captures the user's speech as a stream and plays the agent's speech back, and it sets hard constraints on everything above it: the audio format, the latency floor of the network, and whether you get clean separate audio or a noisy phone line.

Recognition and turn detection

The incoming audio goes to streaming recognition, which emits partial and final words, and to the turn logic that decides when the user has finished. That second job, endpoint detection sitting on top of voice activity detection, tells the agent it is now its turn to think.^[2] It is a timing decision, and getting it wrong either cuts the user off or leaves them in silence.

The orchestrator

The orchestrator is the conductor the three-box diagram omits, the state machine that runs the conversation. It holds the agent's state, listening, thinking, speaking, interrupted, decides when a turn is over, assembles what to send the language model, starts and stops synthesis, and handles the messy real-time events that do not fit a clean pipeline. This is the part that makes talking to an agent feel like a conversation instead of a sequence of requests.

It is what voice agent frameworks like Pipecat and LiveKit Agents exist to provide, because writing it well is most of the work of building an agent.^[1]

The language model

The language model is the decision-maker, but stateless on its own, so the orchestrator feeds it the conversation's context every turn: the system prompt that defines the agent's role, the running history of what was said, and any retrieved knowledge or tool results. Managing this context, keeping it complete enough to be coherent and short enough to stay fast, is an ongoing job, because the history grows with every turn and the model's speed depends on its length.

The model's output is streamed token by token rather than produced all at once, so synthesis can begin on the first words while the rest are still being generated. An agent that waits for the complete reply before speaking throws away the time the model spent streaming.

Synthesis

The model's tokens flow into streaming TTS, which emits audio as the text arrives, so the agent starts speaking with minimal time-to-first-audio. The synthesis must be interruptible: if the user barges in, the orchestrator stops it mid-word and discards what was queued, which is why the audio is generated in small chunks instead of one clip.

flowchart TB T[Transport<br/>audio in/out] --> S[STT + endpointing] S --> O[Orchestrator<br/>state, context, turns] O --> L[Language model] L --> O O --> TTS[Streaming TTS] TTS --> T T -.->|barge-in| O

The real architecture: three capabilities around a central orchestrator that runs the conversation and handles interruption.

Interruption and feedback handling

The dotted arrow above is what lets the agent be interrupted. While the agent is speaking, the transport keeps listening, because the user might interrupt. When they do, the orchestrator detects it, stops the TTS, drops the reply it was giving, and swings back to listening, all within a couple hundred milliseconds. This loop, the agent monitoring for the user even mid-sentence, makes a voice agent a cycle with the orchestrator at its center rather than a straight pipeline.

It is also where the latency budget is spent and lost. Each component adds delay, and the orchestrator's job includes overlapping them, recognition feeding the model feeding synthesis, so the stages run concurrently rather than in series. A correct architecture also has to be fast, and being fast is mostly a matter of not running the boxes one at a time.

Common questions

What are the components of a voice agent?

Five: a transport for audio in and out, streaming speech-to-text with endpointing, a language model, streaming text-to-speech, and the orchestrator that ties them together. The orchestrator is the one the three-box diagram leaves out, and it is where most of the engineering lives, because the conversation state, context, turn-taking, and interruption handling all sit there.

Is a voice agent just STT, an LLM, and TTS connected in a row?

No. If you wire them in a plain row, you get turn-by-turn request and response, not a conversation. The orchestrator between them decides when a turn ends, what context to send the model, when to stream output, and how to stop on an interruption. The diagram is correct as far as it goes; the arrows hide most of the work.

What does the orchestrator do in a voice agent?

It runs the conversation as a state machine: listening, thinking, speaking, interrupted. It decides when the user's turn is over, assembles the prompt and history each turn, starts and stops synthesis, and on barge-in stops the agent mid-word within a couple hundred milliseconds. Writing this well is most of the work, which is why Pipecat and LiveKit Agents exist to supply it.

Why does the agent keep listening while it is talking?

So the user can barge in. The transport keeps capturing audio during playback and the orchestrator watches for it, so a mid-sentence interruption swings the agent back to listening fast instead of letting it talk over the user. That listening-while-speaking loop is the dotted arrow in the diagram above.

References

Soniox (2026). Soniox Voice Agent. Soniox documentation.
Ekstedt, E., & Skantze, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. arXiv preprint arXiv:2205.09812.