Soniox
Demo apps

Soniox Voice Agent

Demo app showing how to implement a voice-to-voice AI agent with Soniox voice solutions

Overview

Soniox Voice Agent is a demo app that shows how to build a complete voice-to-voice conversational AI assistant. It demonstrates how to integrate streaming speech-to-text, a large language model (LLM), and streaming text-to-speech (TTS) into a seamless, low-latency application.

The demo bot is pre-configured as an appointment booking assistant for a fictional car repair shop, "Soniox AutoWorks." It can book appointments for services such as oil changes and car repairs, collect customer names and vehicle information, provide available appointment slots, and interactively guide users through the booking process.

The entire voice bot codebase is designed for easy customization and extension to other domains. You can quickly adapt the bot to different business needs, integrate new tools, or change its persona, making it a flexible starting point for any conversational AI application.

Features

  • End-to-end real-time: Fully streaming architecture (voice-in, voice-out) for natural, low-latency conversations
  • Multilingual: Understands and responds to users in multiple languages
  • Customizable AI: The bot's persona and business logic are defined in a single, easy-to-edit file
  • Extensible tools: Connect the LLM to your own APIs and databases to perform real-world actions
  • Multiple ways to interact: Web frontend, Twilio phone call, or any other WebSocket-based connection

Usage flow

  1. Connect via web browser or phone call
  2. Speak naturally in any language to the AI agent
  3. The bot transcribes your speech, understands intent, and generates a response
  4. Listen to the AI's spoken response in real-time
  5. The conversation continues with full context awareness

Architecture

  • Server (Python): Orchestrates the conversation with modular processors (VAD, STT, LLM, TTS)
  • Frontend (React): Captures microphone audio, streams it to the backend, and plays back responses
  • Twilio proxy (Python): Optional bridge to connect phone calls to the voice bot backend

We provide all the implementations with links to GitHub:

How it works

The system is built on a modular, asynchronous architecture. When a user connects, a session is created to orchestrate the entire conversation, managing the flow of data between four core processors:

Voice Activity Detection (VAD) Processor

Uses Silero VAD to detect speech boundaries in incoming audio. Emits events to interrupt TTS when the user starts speaking.

Speech-to-Text (STT) Processor

The user's voice is captured by a client (web app or phone call) and streamed to the backend. The STT Processor uses the Soniox API to transcribe the audio into text in real-time.

Language Model (LLM) Processor

The transcribed text is sent to the LLM Processor. It maintains the conversation history, determines the user's intent, and decides whether to generate a direct response or use a predefined tool (like checking available slots).

Text-to-Speech (TTS) Processor

The LLM's final text response is sent to the TTS Processor, which uses the Soniox API to convert it back into audio and streams it to the user, completing the conversational turn.

Next steps

Use this project as a starting point to build your own voice assistant:

  • Customize the bot's persona and instructions in server/tools.py
  • Implement your own tools to connect to external APIs and databases
  • Adapt the frontend or Twilio integration to your needs