Build a voice agent with LiveKit and Soniox
Compose Soniox STT and TTS into a complete voice agent running with Pipecat framework, from a minimal chat bot to a structured appointment-booking flow.
Overview
The STT and TTS integration pages already cover how Soniox APIs run with LiveKit. This page combines them into a voice agent built around a dentist receptionist scenario.
The walkthrough covers:
- The agent shape — a minimal LiveKit
Agentrunning Soniox at both ends. - Turn-taking and context — Soniox endpoint detection and domain context.
- Adding tools the LLM can call to look up and book appointments.
- When you need more structure — a brief on multi-agent workflows for complex flows.
LiveKit framework concepts (Agent, AgentSession, lifecycle hooks, multi-agent handoffs) are covered in LiveKit's docs. This page focuses on the Soniox-specific pieces.
Why Soniox for voice agents
- One API key for both ends. Soniox covers STT and TTS through the unified speech platform. No second vendor to integrate, monitor, or scale.
- Real multilingual support. STT supports 60+ languages with automatic language identification and code-switched speech. TTS speaks 60+ languages.
- Names, numbers, and IDs. STT recognizes names, phone numbers, emails, and alphanumerics accurately, and TTS pronounces them back the same way.
- Low STT latency. Soniox leads on time-to-final-transcript, so the LLM picks up the moment the user stops talking.
- Production scaling with good pricing. High-concurrency real-time workloads and regional endpoints.
Setup
Install LiveKit Agents with the extras for Soniox, OpenAI as the LLM, and Silero for local voice activity detection:
LiveKit's console mode also needs the PortAudio runtime to access your microphone.
On Linux:
On macOS:
Set your API keys. The same Soniox key works for STT and TTS. Create one in the Soniox Console:
For this example our agent will run in console mode (no LiveKit server required — local mic and speakers):
For dev and start modes and full deployment options, see LiveKit's running an agent docs.
The agent shape
A LiveKit voice agent is an Agent subclass running inside an AgentSession. The session wires four services: VAD, STT, LLM, and TTS. See the Voice AI quickstart for the framework basics.
The two Soniox-specific lines:
stt=soniox.STT()— uses defaults (modelstt-rt-v4, 16 kHz, language identification on).tts=soniox.TTS(voice="Maya")— one of the Soniox voices.
Conversation history is managed by AgentSession automatically — no manual aggregator wiring needed.
Turn-taking and context
The minimal bot uses Silero VAD for turn detection. Soniox emits its own end-of-speech events from the STT WebSocket, and they arrive earlier than VAD's silence timer. Switching to Soniox endpoint detection makes the conversation feel snappier.
Domain context is the second tuning knob: Soniox STT accepts a list of terms (proper nouns, jargon, identifiers) and a set of general key/value facts about the domain. Both bias transcription accuracy for the session.
Three changes:
turn_handling={"turn_detection": "stt", ...}switches turn detection from VAD silence to Soniox's STT end-of-speech events. See endpoint detection.interruption={"mode": "vad"}keeps interruption local. LiveKit's default uses a cloud ML model that requires real LiveKit Cloud credentials —"vad"is the right choice for console mode.max_endpoint_delay_ms=1000raises Soniox's silence threshold from 500 ms. Anything shorter logsstt end of speech received while user is speakingwarnings as Soniox declares endpoint before Silero VAD agrees the speaker stopped. A second of patience eliminates the desync.
Silero VAD stays in the pipeline even though Soniox now owns turn detection. AgentSession uses VAD for a second job: interruption detection — catching when the caller starts speaking while the agent is still talking. interruption.mode accepts "adaptive" (a LiveKit Cloud ML model) or "vad", so local Silero stays as the practical choice for that job. The labor split lines up with each tool's strengths: Soniox decides when a turn ends, Silero decides when one starts. To learn more, check out LiveKit's Turn-Taking Docs
The context field tunes STT to your domain. List the brand names, jargon, and identifiers your users will say in terms. Use general for structured facts that bias what the model expects to hear about. See STT context for examples.
Adding tools
Function calling in LiveKit is done via @function_tool methods on the Agent subclass. The decorator reads the signature and docstring to build the LLM-facing schema — no separate schema object to maintain.
The dentist needs two tools: one to look up open slots, one to book.
The AgentSession and entrypoint are unchanged.
Useful patterns
Audible filler. Tools that hit the network can stall the conversation. Make the agent speak before the work starts:
Date awareness. GPT models don't reliably know today's date, so relative phrases like next Tuesday produce wrong dates. Inject today's date into the system prompt:
Ending the call. Shut the worker down after a successful booking. get_job_context().shutdown() is synchronous — schedule it from a delayed task so the final TTS can finish playing:
When you need more structure
A single Agent with all the tools works fine for the dentist demo. It starts to fall apart when:
- you have many tools and some shouldn't coexist (e.g. a write-side tool firing during greeting),
- the system prompt grows into a wall of don't do X until Y rules,
- you want distinct personas or phases (triage → specialist, sales → support).
LiveKit supports breaking the conversation into multiple Agent subclasses, each holding its own subset of tools. A tool returns the next Agent instance to hand off control. The shared chat_ctx carries forward, so the new agent sees the conversation so far.
Pass the initial agent to session.start(...):
book_appointment cannot be called until the conversation reaches BookingAgent — the LLM physically does not see it before then. Branching (urgent vs. non-urgent), cycles (no availability → retry), and multi-step flows extend the same pattern.
For the full API, see LiveKit's workflow docs.