Build a voice agent with Pipecat and Soniox
Compose Soniox STT and TTS into a complete voice agent running with Pipecat framework, from a minimal chat bot to a structured appointment-booking flow.
Overview
The STT and TTS integration pages already cover how Soniox APIs run with Pipecat. This page combines them into a voice agent and grows the bot from a chat-only setup to a structured booking assistant.
The walkthrough builds a dentist receptionist in three stages:
- The pipeline shape, then the small additions that make it actually run as a chat bot.
- Adding tools the LLM can call to look up and book appointments.
- Structuring with Pipecat Flows, where the conversation follows a deterministic node graph.
Why Soniox for voice agents
Reasons that matter when shipping a real voice agent:
- One API key for both ends. Soniox covers STT and TTS through the unified speech platform. No second vendor to integrate, monitor, or scale.
- Real multilingual support. STT supports 60+ languages with automatic language identification and handles code-switched speech. TTS speaks 60+ languages.
- Names, numbers, and IDs. STT recognizes names, phone numbers, emails, and alphanumerics accurately, and TTS pronounces them back the same way. General-purpose providers usually mangle one or the other.
- Low STT latency. Soniox leads the Pipecat STT benchmark on time-to-final-transcript, so the LLM picks up the moment the user stops talking.
- Production scaling with good pricing. Soniox supports high-concurrency real-time workloads and regional endpoints.
Setup
Install Pipecat with the extras for Soniox STT and TTS, OpenAI as the LLM, the development runner, and the transports the examples support (browser WebRTC and Daily):
For the Pipecat Flows stage at the end, also install:
Set your API keys. The same Soniox key works for STT and TTS. Create one in the Soniox Console:
Or put them in a .env next to your bot file. The examples below call load_dotenv() to pick that up automatically.
Running the bot
Each bot file uses Pipecat's development runner, which picks the transport at startup. The fastest local test is the prebuilt WebRTC UI:
Open http://localhost:7860, click connect, and start talking. The prebuilt UI hands the bot your microphone and plays its replies through your speakers. Metrics are enabled in every example, so the same page shows per-stage latency, token usage, and function calls live as the conversation runs.
The same files also support -t daily and -t twilio for cloud rooms and telephony. See the Pipecat development runner guide for the credentials and tunneling those transports need.
The pipeline shape
A voice agent is a Pipecat pipeline with five stages in order:
- A transport captures audio from the user.
- An STT service turns that audio into text.
- An LLM produces a reply.
- A TTS service turns the reply back into audio.
- The transport again plays the audio back to the user.
The transport is whatever you use to move audio between the user and the bot. Pipecat ships transports for telephony providers (Twilio, Telnyx, and others), web and WebRTC stacks (Daily, LiveKit, browser WebRTC), and local audio for development. The rest of the pipeline does not change when you swap one for another.
vad_force_turn_endpoint=False tells Pipecat to use Soniox's built-in endpoint detection instead of running a separate local VAD. See endpoint detection for details.
The context field tunes STT to your domain. List the brand names, jargon, and identifiers your users will say. See STT context for examples and more details.
Adding turn-taking and history
The pipeline above runs, but it has no memory between turns. LLMContext solves that: it is the conversation history, and every LLM request includes it. That is how the bot remembers names, resolves pronouns, and answers follow-ups.
Two aggregators keep the context current. The user aggregator buffers STT fragments and appends one {"role": "user", "content": "..."} message when the user's turn ends. The assistant aggregator buffers the LLM's streaming tokens and appends one {"role": "assistant", "content": "..."} message when the response completes. They come as a pair because they share the same context.
After a few turns, that context looks like this:
Wiring this into the pipeline takes two things: a fresh LLMContext and the aggregator pair built from it, slotted in around the LLM. The full bot looks like this:
To start the conversation before the user speaks, append a message to the context and queue an LLMRunFrame:
That is a complete voice agent, chat-only for now. Tools come next.
Adding tools
Function calling lets the LLM trigger Python code during a conversation. Each tool has two parts:
- A handler is the Python function that runs when the tool is called. It receives the LLM's parsed arguments and returns a result.
- A schema is the declarative description the LLM sees: the tool's name, what it does, and what arguments it expects. The LLM uses the schema to decide when to call a tool and how to fill in its arguments.
Pipecat's FunctionSchema is what you build for the schema. A single FunctionSchema is translated into the right wire format for whichever LLM provider you use, so the same tool definition works across OpenAI, Anthropic, Google, and others.
The dentist needs two tools to start: one to look up open slots, one to book an appointment.
How invocation works
The LLM decides when to call a tool, based on the conversation and the tool's description. There is no fixed user phrase that maps to a tool. A user asking is there anything Tuesday? causes the LLM to call check_availability with date="2026-05-12", the same as one asking what slots are open this week?.
If a required parameter is missing, the LLM should ask the user a clarifying question rather than fill a placeholder, and call the tool once the answer arrives. That is why required arguments are useful: they push the LLM to gather data conversationally before triggering the side effect.
Useful patterns
Audible filler keeps the call from going silent during a tool's network round trip. The calls argument is a list of FunctionCallFromLLM entries with function_name, tool_call_id, and arguments, so you can pick the filler per tool:
When prompt rules stop being enough
This is enough for a small bot, but two real problems show up as the dentist flow grows:
- Accidental writes. Tools that change real state are easy to fire by accident. A system-prompt rule like always read back the date and time before calling
book_appointmenthelps, but the LLM follows prompt rules unevenly. One bad turn and the booking goes through unconfirmed. - More rules to enforce. Requirements like collect insurance before booking or urgent cases skip slot proposal pile into the system prompt as natural-language instructions. The longer that prompt, the more often the LLM ignores parts of it.
Pipecat Flows fixes both structurally. Each tool is scoped to a specific node, so book_appointment cannot be called until the conversation reaches the confirm node. Step ordering and branching live in Python instead of the prompt, so the LLM cannot skip them.
Structuring with Pipecat Flows
Pipecat Flows is a separate package that models a conversation as a graph of nodes. Each node has its own focused prompt and its own subset of tools. Each handler decides which node comes next. The LLM still phrases everything, but it cannot skip steps.
The mental model
A flow is a graph of nodes. At any moment, the bot is in exactly one node, and that node defines two things: what the LLM should be doing (its prompt) and which tools it is allowed to call. The LLM never sees tools from other nodes, so it cannot accidentally jump ahead and book an appointment while it is still collecting the patient's name.
Transitions between nodes happen in code, not in the prompt. Each tool is wired to a Python handler, and that handler returns the next node along with the tool's result. The LLM phrases the conversation, but the handler decides where the conversation goes next. Branching becomes an if statement: send urgent cases to triage and everyone else to slot proposal.
Flows does not replace the pipeline. It is a layer on top of the LLM service that swaps the prompt and the tool list whenever the conversation moves to a new node. A FlowManager object hooks into the LLM and the context aggregators, and you start the conversation by handing it the initial node.
Handlers and schemas
Each tool still needs a handler and a schema, the same as in the previous stage. Two things change:
- Handlers in Flows return
tuple[result, next_node]instead of calling aresult_callback. The first element is what the LLM sees; the second element is the node the conversation moves to next. - Schemas use
FlowsFunctionSchemainstead ofFunctionSchema. The shape is identical (name,description,properties,required), with one extra field,handler, that ties the schema to its Python function.
Branching happens inside the handler. Pick the next node with normal Python, based on the LLM's arguments or external state:
Cycles work the same way. A handler that returns a node the conversation has already visited just sends the user back to that step.
A schema looks the same as a regular FunctionSchema plus a handler reference:
The dentist flow has five handlers and five schemas in total. The full set is in the bot file at the end of this section.
Nodes
Each node-creator function returns a NodeConfig for one step. Parameters such as name or slots flow from one step to the next through these factories. The initial node is plain:
A later node accepts data from the previous step and bakes it into its prompt:
The dentist flow has seven nodes (greet, reason, triage, propose_slots, confirm, no_availability, end). Each one follows the same shape. The full set is in the bot file below.
Wiring it into the pipeline
The pipeline is the same as in stage 1. The two new pieces are a FlowManager built on top of the aggregator pair, and an on_client_connected that hands it the initial node:
There are no register_function calls and no tools argument on the context. Flows manages the tool surface per node, so the LLM only sees book_appointment once the conversation reaches the confirm node.
What this buys you
- Tool isolation.
book_appointmentis not in the LLM's tool list during thegreetnode, so it cannot be invoked there. With a single prompt, that constraint depends on the LLM choosing to follow instructions. - Deterministic transitions. Step ordering is a Python expression in a handler, not a sentence in a prompt. The LLM still phrases the conversation, but it cannot skip steps.
- Branchable logic. The
urgentfield routes totriageinstead ofpropose_slotsfrom the same handler. Adding a new branch is one extraifand one extra node. - Cycles.
no_availabilityreturns the patient topropose_slotsuntil they pick an open day or end the call. Cycles are common in voice flows.
What to read next
Constructor arguments, settings, language hints, context customization, and endpoint detection.
Constructor arguments, settings, voices, sample rates, and text aggregation modes.
The full Flows reference, including static flows, direct functions, and end-to-end examples.
End-to-end Pipecat voice agent using Soniox STT and TTS, ready to run.