Real-time transcription
Learn about real-time transcription with low latency and high accuracy for all 60+ languages.
Overview
Soniox Speech-to-Text AI supports real-time transcription with low latency and high accuracy for all 60+ languages. It's designed for responsive applications like live captioning, streaming analytics, and conversational interfaces.
Real-time transcription is provided through our WebSocket API. You can also use our Web library, which makes it easy to integrate real-time transcription directly into browser-based applications.
Streaming expectations
Real-time cadence
You should send audio data to Soniox in real-time or near real-time speed. Small deviations are tolerated — such as brief buffering or network jitter — but prolonged bursts or lags may result in disconnection.
Handling pauses
To implement pause or mute functionality without disconnecting the session, you should use manual finalization with connection keepalive.
This ensures that session-level context — such as speaker diarization or language tracking — is maintained throughout the stream, and keeps the connection alive.
Key concepts
We recommend reading the following real-time concepts before integrating:
-
Understand how tokens evolve during streaming and when you can consider them stable.
-
Learn how to configure latency settings to control the trade-off between speed and accuracy.
-
Configure the model to automatically detect when a speaker has stopped speaking.
-
Explicitly trigger finalization of all streamed audio at any time using a
{"type": "finalize"}
message. -
Prevent the WebSocket from timing out during silence by sending
{"type": "keepalive"}
message.
Integration guides
Choose one of the following integration patterns based on your app architecture:
-
Send audio directly from your client (e.g., browser, mobile app) to Soniox.
Best for:
- Web/mobile apps
- Fastest latency
- Client-managed sessions
-
Stream audio from your client to your backend, and forward it from there to Soniox.
Best for:
- Centralized session control
- Audio preprocessing or archiving
- Use cases involving multiple clients
Example: Transcribe a live audio stream
See our example demonstrating how to transcribe a live audio stream (such as a radio broadcast) using the WebSocket API.
The example shows how to:
- Open a WebSocket connection
- Stream audio in real time
- Handle final and non-final tokens
- Display low-latency live transcripts
Output