Soniox

Build a voice agent with LiveKit and Soniox

Compose Soniox STT and TTS into a complete voice agent running with Pipecat framework, from a minimal chat bot to a structured appointment-booking flow.

Overview

The STT and TTS integration pages already cover how Soniox APIs run with LiveKit. This page combines them into a voice agent built around a dentist receptionist scenario.

The walkthrough covers:

  1. The agent shape — a minimal LiveKit Agent running Soniox at both ends.
  2. Turn-taking and context — Soniox endpoint detection and domain context.
  3. Adding tools the LLM can call to look up and book appointments.
  4. When you need more structure — a brief on multi-agent workflows for complex flows.

LiveKit framework concepts (Agent, AgentSession, lifecycle hooks, multi-agent handoffs) are covered in LiveKit's docs. This page focuses on the Soniox-specific pieces.

Why Soniox for voice agents

  • One API key for both ends. Soniox covers STT and TTS through the unified speech platform. No second vendor to integrate, monitor, or scale.
  • Real multilingual support. STT supports 60+ languages with automatic language identification and code-switched speech. TTS speaks 60+ languages.
  • Names, numbers, and IDs. STT recognizes names, phone numbers, emails, and alphanumerics accurately, and TTS pronounces them back the same way.
  • Low STT latency. Soniox leads on time-to-final-transcript, so the LLM picks up the moment the user stops talking.
  • Production scaling with good pricing. High-concurrency real-time workloads and regional endpoints.

Setup

Install LiveKit Agents with the extras for Soniox, OpenAI as the LLM, and Silero for local voice activity detection:

pip install "livekit-agents[soniox,openai,silero]~=1.5" python-dotenv

LiveKit's console mode also needs the PortAudio runtime to access your microphone.

On Linux:

sudo apt install libportaudio2

On macOS:

brew install portaudio

Set your API keys. The same Soniox key works for STT and TTS. Create one in the Soniox Console:

SONIOX_API_KEY=...
OPENAI_API_KEY=...

# Placeholders for console mode. Replace with real LiveKit Cloud
# credentials for dev or prod runs.
LIVEKIT_URL=ws://localhost:7880
LIVEKIT_API_KEY=devkey
LIVEKIT_API_SECRET=devsecret

For this example our agent will run in console mode (no LiveKit server required — local mic and speakers):

python agent.py console

For dev and start modes and full deployment options, see LiveKit's running an agent docs.

The agent shape

A LiveKit voice agent is an Agent subclass running inside an AgentSession. The session wires four services: VAD, STT, LLM, and TTS. See the Voice AI quickstart for the framework basics.

import logging

from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentSession,
    JobContext,
    WorkerOptions,
    cli,
)
from livekit.plugins import openai, silero, soniox

load_dotenv()
logger = logging.getLogger("voice-agent")


class Receptionist(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly receptionist at Bright Smile Dental. "
                "Keep replies short and natural. They will be spoken aloud."
            ),
        )

    async def on_enter(self) -> None:
        self.session.generate_reply(
            instructions="Greet the caller and ask how you can help."
        )


async def entrypoint(ctx: JobContext) -> None:
    await ctx.connect()

    session = AgentSession(
        vad=silero.VAD.load(),
        stt=soniox.STT(),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=soniox.TTS(voice="Maya"),
    )

    await session.start(agent=Receptionist(), room=ctx.room)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

The two Soniox-specific lines:

  • stt=soniox.STT() — uses defaults (model stt-rt-v4, 16 kHz, language identification on).
  • tts=soniox.TTS(voice="Maya") — one of the Soniox voices.

Conversation history is managed by AgentSession automatically — no manual aggregator wiring needed.

Turn-taking and context

The minimal bot uses Silero VAD for turn detection. Soniox emits its own end-of-speech events from the STT WebSocket, and they arrive earlier than VAD's silence timer. Switching to Soniox endpoint detection makes the conversation feel snappier.

Domain context is the second tuning knob: Soniox STT accepts a list of terms (proper nouns, jargon, identifiers) and a set of general key/value facts about the domain. Both bias transcription accuracy for the session.

from livekit.agents import TurnHandlingOptions
from livekit.plugins.soniox import (
    ContextGeneralItem,
    ContextObject,
    STTOptions,
)

CLINIC_CONTEXT = ContextObject(
    terms=["Bright Smile Dental", "checkup", "cavity", "crown", "X-ray"],
    general=[
        ContextGeneralItem(key="domain", value="Dental practice"),
        ContextGeneralItem(key="topic", value="Booking an appointment"),
    ],
)

session = AgentSession(
    vad=silero.VAD.load(),
    stt=soniox.STT(
        params=STTOptions(
            language_hints=["en", "es"],
            context=CLINIC_CONTEXT,
            max_endpoint_delay_ms=1000,
        ),
    ),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=soniox.TTS(voice="Maya"),
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
        interruption={"mode": "vad"},
    ),
)

Three changes:

  • turn_handling={"turn_detection": "stt", ...} switches turn detection from VAD silence to Soniox's STT end-of-speech events. See endpoint detection.
  • interruption={"mode": "vad"} keeps interruption local. LiveKit's default uses a cloud ML model that requires real LiveKit Cloud credentials — "vad" is the right choice for console mode.
  • max_endpoint_delay_ms=1000 raises Soniox's silence threshold from 500 ms. Anything shorter logs stt end of speech received while user is speaking warnings as Soniox declares endpoint before Silero VAD agrees the speaker stopped. A second of patience eliminates the desync.

Silero VAD stays in the pipeline even though Soniox now owns turn detection. AgentSession uses VAD for a second job: interruption detection — catching when the caller starts speaking while the agent is still talking. interruption.mode accepts "adaptive" (a LiveKit Cloud ML model) or "vad", so local Silero stays as the practical choice for that job. The labor split lines up with each tool's strengths: Soniox decides when a turn ends, Silero decides when one starts. To learn more, check out LiveKit's Turn-Taking Docs

The context field tunes STT to your domain. List the brand names, jargon, and identifiers your users will say in terms. Use general for structured facts that bias what the model expects to hear about. See STT context for examples.

import logging

from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentSession,
    JobContext,
    TurnHandlingOptions,
    WorkerOptions,
    cli,
)
from livekit.plugins import openai, silero, soniox
from livekit.plugins.soniox import (
    ContextGeneralItem,
    ContextObject,
    STTOptions,
)

load_dotenv()
logger = logging.getLogger("voice-agent")


CLINIC_CONTEXT = ContextObject(
    terms=["Bright Smile Dental", "checkup", "cavity", "crown", "X-ray"],
    general=[
        ContextGeneralItem(key="domain", value="Dental practice"),
        ContextGeneralItem(key="topic", value="Booking an appointment"),
    ],
)


class Receptionist(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly receptionist at Bright Smile Dental. "
                "Keep replies short and natural. They will be spoken aloud."
            ),
        )

    async def on_enter(self) -> None:
        self.session.generate_reply(
            instructions="Greet the caller and ask how you can help."
        )


async def entrypoint(ctx: JobContext) -> None:
    await ctx.connect()

    session = AgentSession(
        vad=silero.VAD.load(),
        stt=soniox.STT(
            params=STTOptions(
                language_hints=["en"],
                context=CLINIC_CONTEXT,
                max_endpoint_delay_ms=1000,
            ),
        ),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=soniox.TTS(voice="Maya"),
        turn_handling=TurnHandlingOptions(
            turn_detection="stt",
            interruption={"mode": "vad"},
        ),
    )

    await session.start(agent=Receptionist(), room=ctx.room)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

Adding tools

Function calling in LiveKit is done via @function_tool methods on the Agent subclass. The decorator reads the signature and docstring to build the LLM-facing schema — no separate schema object to maintain.

The dentist needs two tools: one to look up open slots, one to book.

from livekit.agents import Agent, function_tool


class DentalCalendar:
    """Mock calendar. Replace with your real booking system."""

    # The same four times are offered every day.
    DAILY_SCHEDULE = ["09:00", "10:30", "14:00", "15:30"]

    def __init__(self) -> None:
        # ISO datetimes that are already booked.
        self.booked: set[str] = {"2026-05-12T10:30"}
        self._next_confirmation = 1000

    async def find_slots(self, date: str) -> list[str]:
        # Return open ISO datetimes for the given date.
        slots_for_day = [f"{date}T{time}" for time in self.DAILY_SCHEDULE]
        return [slot for slot in slots_for_day if slot not in self.booked]

    async def book(self, name: str, slot: str, reason: str) -> str:
        # Mark a slot as booked and return a confirmation ID.
        self.booked.add(slot)
        self._next_confirmation += 1
        return f"DENT-{self._next_confirmation}"


calendar = DentalCalendar()


class Receptionist(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly receptionist at Bright Smile Dental. "
                "Keep replies short and natural. "
                "They will be spoken aloud."
                "When booking, confirm name, date, time, and reason "
                "before calling the tool."
            ),
        )

    async def on_enter(self) -> None:
        self.session.generate_reply(
            instructions="Greet the caller and ask how you can help."
        )

    @function_tool
    async def check_availability(self, date: str) -> str:
        """Look up open appointment slots for a given date.

        Args:
            date: ISO date, e.g. "2026-05-12".
        """
        slots = await calendar.find_slots(date)
        if not slots:
            return f"No slots available on {date}; fully booked."
        return f"Open slots on {date}: {', '.join(slots)}."

    @function_tool
    async def book_appointment(self, name: str, slot: str, reason: str) -> str:
        """Book a confirmed appointment.

        Only call after the patient has confirmed the date, time, and reason.

        Args:
            name: Patient's full name.
            slot: ISO datetime, e.g. "2026-05-12T10:30".
            reason: Reason for the visit.
        """
        cid = await calendar.book(name=name, slot=slot, reason=reason)
        return f"Booked. Confirmation: {cid}."

The AgentSession and entrypoint are unchanged.

Useful patterns

Audible filler. Tools that hit the network can stall the conversation. Make the agent speak before the work starts:

@function_tool
async def check_availability(self, date: str) -> str:
    """..."""
    self.session.say("Let me check the schedule.")
    ...

Date awareness. GPT models don't reliably know today's date, so relative phrases like next Tuesday produce wrong dates. Inject today's date into the system prompt:

from datetime import date

class Receptionist(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=(
                f"Today's date is {date.today().isoformat()}. "
                "Resolve relative dates based on today. "
                "..."
            ),
        )

Ending the call. Shut the worker down after a successful booking. get_job_context().shutdown() is synchronous — schedule it from a delayed task so the final TTS can finish playing:

import asyncio
from livekit.agents import get_job_context

@function_tool
async def book_appointment(self, name: str, slot: str, reason: str) -> str:
    """..."""
    cid = await calendar.book(name=name, slot=slot, reason=reason)

    async def _shutdown() -> None:
        await asyncio.sleep(8.0)
        get_job_context().shutdown(reason="appointment_booked")

    asyncio.create_task(_shutdown())

    return f"Booked. Confirmation: {cid}."

import asyncio
import logging
from datetime import date

from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentSession,
    JobContext,
    TurnHandlingOptions,
    WorkerOptions,
    cli,
    function_tool,
    get_job_context,
)
from livekit.plugins import openai, silero, soniox
from livekit.plugins.soniox import (
    ContextGeneralItem,
    ContextObject,
    STTOptions,
)

load_dotenv()
logger = logging.getLogger("voice-agent")

TODAY = date.today().isoformat()

CLINIC_CONTEXT = ContextObject(
    terms=["Bright Smile Dental", "checkup", "cavity", "crown", "X-ray"],
    general=[
        ContextGeneralItem(key="domain", value="Dental practice"),
        ContextGeneralItem(key="topic", value="Booking an appointment"),
    ],
)


class DentalCalendar:
    """Mock calendar. Replace with your real booking system."""

    # The same four times are offered every day.
    DAILY_SCHEDULE = ["09:00", "10:30", "14:00", "15:30"]

    def __init__(self) -> None:
        # ISO datetimes that are already booked.
        self.booked: set[str] = {"2026-05-12T10:30"}
        self._next_confirmation = 1000

    async def find_slots(self, date: str) -> list[str]:
        """Return open ISO datetimes for the given date."""
        slots_for_day = [f"{date}T{time}" for time in self.DAILY_SCHEDULE]
        return [slot for slot in slots_for_day if slot not in self.booked]

    async def book(self, name: str, slot: str, reason: str) -> str:
        """Mark a slot as booked and return a confirmation ID."""
        self.booked.add(slot)
        self._next_confirmation += 1
        return f"DENT-{self._next_confirmation}"


calendar = DentalCalendar()


class Receptionist(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=(
                f"Today's date is {TODAY}. "
                "You are a friendly receptionist at Bright Smile Dental. "
                "Keep replies short and natural. They will be spoken aloud. "
                "Resolve relative dates based on today. "
                "When booking, confirm name, date, time, and reason before "
                "calling the tool."
            ),
        )

    async def on_enter(self) -> None:
        self.session.generate_reply(
            instructions="Greet the caller and ask how you can help."
        )

    @function_tool
    async def check_availability(self, date: str) -> str:
        """Look up open appointment slots for a given date.

        Args:
            date: ISO date, e.g. "2026-05-12".
        """
        self.session.say("Let me check the schedule.")
        slots = await calendar.find_slots(date)
        if not slots:
            return f"No slots available on {date}; fully booked."
        return f"Open slots on {date}: {', '.join(slots)}."

    @function_tool
    async def book_appointment(self, name: str, slot: str, reason: str) -> str:
        """Book a confirmed appointment.

        Only call after the patient has confirmed the date, time, and reason.

        Args:
            name: Patient's full name.
            slot: ISO datetime, e.g. "2026-05-12T10:30".
            reason: Reason for the visit.
        """
        cid = await calendar.book(name=name, slot=slot, reason=reason)

        async def _shutdown() -> None:
            await asyncio.sleep(8.0)
            get_job_context().shutdown(reason="appointment_booked")

        asyncio.create_task(_shutdown())

        return f"Booked. Confirmation: {cid}."


async def entrypoint(ctx: JobContext) -> None:
    await ctx.connect()

    session = AgentSession(
        vad=silero.VAD.load(),
        stt=soniox.STT(
            params=STTOptions(
                language_hints=["en"],
                context=CLINIC_CONTEXT,
                max_endpoint_delay_ms=1000,
            ),
        ),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=soniox.TTS(voice="Maya"),
        turn_handling=TurnHandlingOptions(
            turn_detection="stt",
            interruption={"mode": "vad"},
        ),
    )

    await session.start(agent=Receptionist(), room=ctx.room)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

When you need more structure

A single Agent with all the tools works fine for the dentist demo. It starts to fall apart when:

  • you have many tools and some shouldn't coexist (e.g. a write-side tool firing during greeting),
  • the system prompt grows into a wall of don't do X until Y rules,
  • you want distinct personas or phases (triage → specialist, sales → support).

LiveKit supports breaking the conversation into multiple Agent subclasses, each holding its own subset of tools. A tool returns the next Agent instance to hand off control. The shared chat_ctx carries forward, so the new agent sees the conversation so far.

class GreetAgent(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="Greet the caller and ask for their full name.",
        )

    async def on_enter(self) -> None:
        self.session.generate_reply()

    @function_tool
    async def collect_name(self, name: str) -> Agent:
        """Record the caller's name."""
        return BookingAgent(name=name)


class BookingAgent(Agent):
    def __init__(self, name: str) -> None:
        super().__init__(
            instructions=(
                f"You are speaking with {name}. Help them book an appointment. "
                "Confirm date, time, and reason before calling book_appointment."
            ),
        )
        self._name = name

    @function_tool
    async def book_appointment(self, slot: str, reason: str) -> str:
        """Book a confirmed appointment.

        Args:
            slot: ISO datetime, e.g. "2026-05-12T10:30".
            reason: Reason for the visit.
        """
        cid = await calendar.book(name=self._name, slot=slot, reason=reason)
        return f"Booked. Confirmation: {cid}."

Pass the initial agent to session.start(...):

await session.start(agent=GreetAgent(), room=ctx.room)

book_appointment cannot be called until the conversation reaches BookingAgent — the LLM physically does not see it before then. Branching (urgent vs. non-urgent), cycles (no availability → retry), and multi-step flows extend the same pattern.

For the full API, see LiveKit's workflow docs.