Pipecat
Integrate Soniox Speech-to-Text into Pipecat pipeline.
Pipecat overview
Pipecat is a framework for building voice-enabled, real-time, multimodal AI applications. Pipecat's pipeline for real-time voice applications looks like this:
- Send Audio - Transmit and capture streamed audio from the user
- Transcribe Speech - Convert speech to text as the user is talking
- Process with LLM - Generate responses using a large language model
- Convert to Speech - Transform text responses into natural speech
- Play Audio - Stream the audio response back to the user
At each step, there are multiple options for services to use. Soniox provides SonioxSTTService
, which handles the Transcribe Speech step. For more details on how Pipecat works, check the Pipecat documentation.
Installation
To use SonioxSTTService
in Pipecat projects, you need to install the Soniox dependencies:
You'll also need to set up your Soniox API key as an environment variable: SONIOX_API_KEY
. You can obtain a Soniox API key by signing up at Soniox console.
Usage example
To integrate SonioxSTTService
into a Pipecat pipeline for real-time speech-to-text transcription,
you can simply create an instance of the service and add it to your pipeline:
Complete examples
The following examples demonstrate how to use SonioxSTTService
in Pipecat projects:
Server-side bot that listens to user voice and responds with a spoken response.
Transcribe audio stream in Pipecat architecture.
Chatbot agent using Soniox STT for Pipecat Cloud.
Advanced usage
Language hints
There is no need to pre-select a language — the model automatically detects and transcribes any supported language. It also handles multilingual audio seamlessly, even when multiple languages are mixed within a single sentence or conversation.
However, when you have prior knowledge of the languages likely to be spoken in your audio, you can use language hints to guide the model toward those languages for even greater recognition accuracy.
Language variants are ignored, for example Language.EN_GB
will be treated same as Language.EN
. See list of supported languages for a list of supported languages.
You can learn more about language hints here.
Customization with context
By providing context, you help the AI model better understand and anticipate the language in your audio - even if some terms do not appear clearly or completely.
Endpoint Detection and VAD
The SonioxSTTService
processes your speech and has two ways of knowing when to finalize the text.
Automatic Pause Detection
By default, the service listens for natural pauses in your speech. When it detects that you've likely finished a sentence, it finalizes the transcription. You can learn more about Endpoint Detection here.
Using Voice Activity Detection (VAD)
For more explicit control, you can use a dedicated Voice Activity Detection (VAD) component within your Pipecat pipeline. The VAD's job is to detect when a user has completely stopped talking.
To enable this behavior, set vad_force_turn_endpoint
to True
. This will disable the automatic endpoint detection and force the service to return transcription results as soon as the user stops talking.