Soniox | How to build a multilingual voice bot with Pipecat and Soniox

Voice AI is moving fast.

In English, it is now possible to build a working voice agent in a weekend. You can combine a speech-to-text provider, an LLM, a text-to-speech provider, and a voice-agent framework, then ship something that feels natural enough for a demo.

But the real world does not speak only English.

As soon as you build for global users, the problems start to show up. Speech recognition accuracy drops. Latency gets worse. Language switching breaks. Names, numbers, emails, addresses, and IDs are transcribed incorrectly. The TTS voice may sound unnatural, robotic, or unavailable in the language you need. And because STT and TTS often come from different vendors, you end up managing two APIs, two bills, two dashboards, two sets of limits, and two different definitions of “supported languages.”

That is exactly the problem Soniox solves.

Soniox is now available as a first-class integration in Pipecat, with native support for both Soniox Speech-to-Text and Soniox Text-to-Speech. You can now build real-time voice agents with multilingual STT and multilingual TTS through one speech AI platform, one API key, and one consistent set of languages.

What is Pipecat?

Pipecat is an open-source Python framework for building real-time voice and multimodal agents.

It gives developers a clean way to compose the core pieces of a voice agent:

Transport, such as WebRTC or telephony
Speech-to-text
LLM
Text-to-speech
Voice activity detection
Turn-taking
Interruption handling
Streaming audio pipelines

Instead of wiring all of this from scratch, you drop services into a Pipecat pipeline. Pipecat handles the real-time orchestration so you can focus on the behavior, intelligence, and product experience of your agent.

Teams use Pipecat to build browser voice assistants, phone bots, customer support agents, AI companions, healthcare assistants, internal tools, and other real-time speech applications.

Why Soniox for Pipecat?

Voice agents depend on one thing above all else: the conversation must feel natural.

That requires much more than converting speech to text and text back to speech. The system needs to understand users quickly, finalize turns at the right moment, generate speech with low latency, and work reliably across languages.

Soniox gives Pipecat developers a single speech stack for both sides of the conversation.

One API for speech in and speech out

Most voice-agent stacks split speech-to-text and text-to-speech across separate vendors.

That means separate integrations, separate billing, separate rate limits, separate dashboards, and often inconsistent language support between STT and TTS.

With Soniox, both STT and TTS run through one platform. The same API key gives you real-time transcription and real-time speech generation across 60+ languages.

Built for multilingual voice agents

Soniox was built for the non-English world.

Soniox STT delivers native-speaker accuracy across 60+ languages, including support for automatic language identification, mixed-language speech, and real-time transcription without forcing every call into a single English-first model.

That matters for global voice agents.

A customer may start in English, switch to Spanish, mention a French name, read an email address, then give a confirmation code. Soniox is designed for that kind of real-world speech.

Accurate on the details that break production systems

Voice agents fail when they get important details wrong.

Names. Phone numbers. Email addresses. Addresses. Product codes. Confirmation IDs. Dates. Foreign words. Mixed-language phrases.

These are not edge cases in production. They are the core of real customer interactions.

Soniox STT is designed to recognize these details accurately, and Soniox TTS is designed to speak them correctly. That gives developers a much better foundation for building reliable voice agents in real workflows.

Low latency for natural conversations

Latency defines how alive a voice agent feels.

Soniox real-time STT is built for low-latency transcription and fast finalization. Soniox TTS supports streaming generation so speech can begin quickly instead of waiting for the full response to be completed.

Together with Pipecat’s real-time pipeline, this gives developers the foundation for fast, natural turn-taking.

Scales from prototype to production

The same integration can be used for a local demo, a browser-based assistant, a phone bot, or a large production deployment.

Soniox supports high-concurrency real-time workloads, regional deployments, and production-grade scaling for companies building voice agents globally.

What the integration includes

The Pipecat integration includes two native Soniox services.

SonioxSTTService

SonioxSTTService connects Pipecat to the Soniox real-time Speech-to-Text API.

It provides:

Real-time transcription
Native-speaker accuracy in 60+ languages
Automatic language identification
Support for mixed-language speech
Accurate recognition of names, numbers, emails, IDs, and other alphanumerics
Low-latency final transcripts for responsive voice agents

SonioxTTSService

SonioxTTSService connects Pipecat to the Soniox real-time Text-to-Speech API.

It provides:

Streaming speech generation
Natural voices across 60+ languages
Correct pronunciation of names, foreign words, and alphanumerics
Low first-audio latency for real-time agents
One consistent TTS layer for global voice applications

Together, these services let you build a complete voice loop using Soniox for both speech recognition and speech generation.

How the pipeline works

A typical Pipecat voice agent using Soniox looks like this:

User speech
  -> Transport
  -> VAD
  -> Soniox STT
  -> LLM
  -> Soniox TTS
  -> Transport
  -> User hears response

Pipecat manages the real-time flow between services. Soniox handles the speech layer on both sides of the conversation.

You can use Soniox with Daily for WebRTC transport, OpenAI or Anthropic for the LLM, Silero for VAD, and Pipecat’s built-in pipeline orchestration.

Install Pipecat with Soniox

For Soniox STT and TTS:

pip install "pipecat-ai[soniox]"

For a complete voice-agent stack with Daily, OpenAI, Silero VAD, and the Pipecat runner:

pip install "pipecat-ai[soniox,daily,openai,silero,runner]"

Set your API keys

Create the following environment variables:

export SONIOX_API_KEY="your_soniox_api_key"
export OPENAI_API_KEY="your_openai_api_key"
export DAILY_API_KEY="your_daily_api_key"

You can create a Soniox API key in the Soniox Console.

Add Soniox Speech-to-Text

import os

from pipecat.services.soniox.stt import SonioxSTTService
from pipecat.transcriptions.language import Language

stt = SonioxSTTService(
    api_key=os.getenv("SONIOX_API_KEY"),
    settings=SonioxSTTService.Settings(
        language_hints=[Language.EN],
    ),
)

language_hints can be used when you know the expected language. Soniox can also identify languages automatically, which is useful for multilingual agents and global user bases.

Add Soniox Text-to-Speech

import os

from pipecat.services.soniox.tts import SonioxTTSService
from pipecat.transcriptions.language import Language

tts = SonioxTTSService(
    api_key=os.getenv("SONIOX_API_KEY"),
    settings=SonioxTTSService.Settings(
        voice="Nina",
        language=Language.EN,
    ),
)

The TTS service streams generated speech back into the Pipecat pipeline, so your agent can start speaking quickly and maintain a natural conversational rhythm.

Run the complete example

We published a runnable Pipecat example that uses:

Soniox STT for real-time transcription
OpenAI for the LLM
Soniox TTS for speech generation
Silero for voice activity detection
Daily for WebRTC transport

Clone the Pipecat repository:

git clone https://github.com/pipecat-ai/pipecat.git

Run the Soniox voice example:

cd pipecat/examples/voice
python voice-soniox.py -t webrtc

Open:

http://localhost:7860

Click "Connect" and start talking.

To test the same agent in a Daily room, run:

python voice-soniox.py -t daily

You can find the full source here: examples/voice/voice-soniox.py

Build global voice agents with one speech platform

Pipecat makes it easy to build real-time voice agents.

Soniox makes those agents work across languages.

With native Soniox STT and TTS support in Pipecat, developers can now build multilingual voice agents with one speech platform, one API key, and one consistent experience across speech recognition and speech generation.

Whether you are building a customer support bot, phone agent, browser assistant, AI companion, healthcare workflow, or internal voice interface, Soniox gives you the speech layer for real-time global conversations.

Build with Pipecat. Power the speech with Soniox.

Happy building.