Voice agent frameworks: Pipecat, LiveKit Agents, and friends

Every team that builds a voice agent writes the same code first: capture the audio, detect the turns, stream the stages into each other, survive a dropped connection. None of it is the product, and all of it has to work before the product can exist. Frameworks package that layer, so the build starts where your agent is actually different.

Framework components

You get the orchestrator, the conductor that runs the conversation, plus the connective tissue around it. A framework handles the transport: audio in and out over WebRTC for the browser or a telephony connection for phones, plus the connection lifecycle when networks misbehave. The turn logic comes built in, integrating voice activity detection and endpointing so the agent knows when to talk, and barge-in so it stops when interrupted. So does the pipeline plumbing, which streams partial recognition into the model and the model's tokens into synthesis so the stages overlap instead of block. On top sit the provider plugins, swappable adapters for different STT, LLM, and TTS services, and the application hooks: tool calling, conversation state and context, and observability.

A framework gives you everything in the architecture diagram except the three core models, which you supply.

Major framework types

Two open-source frameworks anchor the space, with managed platforms above them.

Pipecat is an open-source Python framework originated by Daily. It models a voice agent as a pipeline of processors through which audio and text flow, with built-in handling for the real-time concerns above and a wide set of provider plugins. You self-host it, so you keep control of the whole stack.

LiveKit Agents builds on LiveKit's WebRTC infrastructure, which gives it strong transport and telephony out of the box, and adds an agents layer for turn-taking, tool use, and multi-agent workflows. Pick it when you want the media plumbing and the agent logic from one place.

Above these sit managed platforms, hosted services that run the entire agent for you, where you configure behavior rather than operate infrastructure. The roster changes quickly; the category is what matters.

Framework and platform trade-offs

For most teams the open-source framework is the right call. You keep control of your data, providers, and logic, and you skip the plumbing everyone else also has to write. Pipecat and LiveKit Agents both put you there.

The two ends are for narrower cases. Roll your own orchestrator only when your requirements do not fit existing tools, because you then maintain the hard real-time loop forever. A managed platform gets you live fastest, right when your needs are standard and you do not want to own the internals at all. Everything else, which is most of it, sits in the middle.

	Roll your own	Open-source framework	Managed platform
You control	Everything	Providers, logic, data	Configuration
You maintain	The real-time loop, forever	Your agent code	Almost nothing
Time to first call	Longest	Short	Shortest
Fits	Requirements nothing else meets	Most teams	Standard needs, fast launch

The middle column is the default for a reason: control without the plumbing.

Provider interfaces

The open frameworks do not lock you to a particular recognizer, model, or voice. Each is a swappable plugin, so you pick the best component for each job and change it as the field moves. A single speech-to-speech model sits at the other end, where the whole agent is one vendor's system. The swappable design is also why a recognition and synthesis provider integrates with a framework rather than competing with it: the framework supplies the conductor, the provider supplies the ears and the voice, and you choose both.^[1]

Common questions

What does a voice agent framework do?

It builds the conductor for you: the orchestrator and the real-time plumbing around it. You supply only the three core models, the speech-to-text, the language model, and the text-to-speech, and the framework wires them into a responsive conversation. It gives you everything in the architecture diagram except those three.

What is the difference between Pipecat and LiveKit Agents?

LiveKit Agents bundles the media infrastructure with the agent logic; Pipecat does not. Pick LiveKit when you want transport and telephony from the same source as your turn-taking and tool use, since it builds on LiveKit's WebRTC layer. Pick Pipecat when you want to self-host a pipeline of processors and pick your own transport. Both keep providers swappable, so neither locks you in.

Should I use a framework or a managed voice agent platform?

Default to the open-source framework: you keep control of data, providers, and logic without rebuilding the plumbing. Reach for a managed platform only when your needs are standard and you want to configure rather than operate, and roll your own orchestrator only when neither fits, because you then maintain the real-time loop forever.

Do frameworks lock me into specific STT, LLM, or TTS providers?

No, every component is a swappable plugin. You choose the best recognizer, model, and voice for each job and change any one as the field moves. A single speech-to-speech model works the opposite way, with the whole agent as one vendor's system and nothing to swap.

References

Soniox (2026). Build a voice agent with Pipecat and Soniox. Soniox documentation.