Soniox Voice Agent
Demo app showing how to implement a voice-to-voice AI agent with Soniox voice solutions
Overview
Soniox Voice Agent is a demo app that shows how to build a complete voice-to-voice conversational AI assistant. It demonstrates how to integrate streaming speech-to-text, a large language model (LLM), and streaming text-to-speech (TTS) into a seamless, low-latency application.
The demo bot is pre-configured as an appointment booking assistant for a fictional car repair shop, "Soniox AutoWorks." It can book appointments for services such as oil changes and car repairs, collect customer names and vehicle information, provide available appointment slots, and interactively guide users through the booking process.
The entire voice bot codebase is designed for easy customization and extension to other domains. You can quickly adapt the bot to different business needs, integrate new tools, or change its persona, making it a flexible starting point for any conversational AI application.
Features
- End-to-end real-time: Fully streaming architecture (voice-in, voice-out) for natural, low-latency conversations
- Multilingual: Understands and responds to users in multiple languages
- Customizable AI: The bot's persona and business logic are defined in a single, easy-to-edit file
- Extensible tools: Connect the LLM to your own APIs and databases to perform real-world actions
- Multiple ways to interact: Web frontend, Twilio phone call, or any other WebSocket-based connection
Usage flow
- Connect via web browser or phone call
- Speak naturally in any language to the AI agent
- The bot transcribes your speech, understands intent, and generates a response
- Listen to the AI's spoken response in real-time
- The conversation continues with full context awareness
Architecture
- Server (Python): Orchestrates the conversation with modular processors (VAD, STT, LLM, TTS)
- Frontend (React): Captures microphone audio, streams it to the backend, and plays back responses
- Twilio proxy (Python): Optional bridge to connect phone calls to the voice bot backend
We provide all the implementations with links to GitHub:
- Python server
- React frontend (web)
- Twilio proxy (phone integration)
How it works
The system is built on a modular, asynchronous architecture. When a user connects, a session is created to orchestrate the entire conversation, managing the flow of data between four core processors:
Voice Activity Detection (VAD) Processor
Uses Silero VAD to detect speech boundaries in incoming audio. Emits events to interrupt TTS when the user starts speaking.
Speech-to-Text (STT) Processor
The user's voice is captured by a client (web app or phone call) and streamed to the backend. The STT Processor uses the Soniox API to transcribe the audio into text in real-time.
Language Model (LLM) Processor
The transcribed text is sent to the LLM Processor. It maintains the conversation history, determines the user's intent, and decides whether to generate a direct response or use a predefined tool (like checking available slots).
Text-to-Speech (TTS) Processor
The LLM's final text response is sent to the TTS Processor, which uses the Soniox API to convert it back into audio and streams it to the user, completing the conversational turn.
Next steps
Use this project as a starting point to build your own voice assistant:
- Customize the bot's persona and instructions in
server/tools.py - Implement your own tools to connect to external APIs and databases
- Adapt the frontend or Twilio integration to your needs