Real-time transcription
Learn about real-time transcription with low latency and high accuracy for all 60+ languages.
Overview
Soniox Speech-to-Text AI supports real-time transcription with low latency and high accuracy for all 60+ languages. It's designed for responsive applications like live captioning, streaming analytics, and conversational interfaces.
Real-time transcription is provided through our WebSocket API. You can also use our Web library, which makes it easy to integrate real-time transcription directly into browser-based applications.
Streaming expectations
Real-time cadence
You should send audio data to Soniox in real-time or near real-time speed. Small deviations are tolerated — such as brief buffering or network jitter — but prolonged bursts or lags may result in disconnection.
Handling pauses
To implement pause or mute functionality without disconnecting the session, stream zero-valued PCM samples (silence) at real-time cadence.
This ensures that session-level context — such as speaker diarization or language tracking — is maintained throughout the stream.
Key concepts
We recommend reading the following real-time concepts before integrating:
-
Understand how tokens evolve during streaming and when you can consider them stable.
-
Learn how to configure latency settings to control the tradeoff between speed and accuracy.
Integration guides
Choose one of the following integration patterns based on your app architecture:
-
Send audio directly from your client (e.g., browser, mobile app) to Soniox.
Best for:
- Web/mobile apps
- Fastest latency
- Client-managed sessions
-
Stream audio from your client to your backend, and forward it from there to Soniox.
Best for:
- Centralized session control
- Audio preprocessing or archiving
- Use cases involving multiple clients
Example: Transcribe a live audio stream
See our example demonstrating how to transcribe a live audio stream (such as a radio broadcast) using the WebSocket API.
The example shows how to:
- Open a WebSocket connection
- Stream audio in real time
- Handle final and non-final tokens
- Display low-latency live transcripts
Output