Real-time transcription

Learn about real-time transcription with low latency and high accuracy for all 60+ languages.

Overview

Soniox Speech-to-Text AI supports real-time transcription with low latency and high accuracy for all 60+ languages. It's designed for responsive applications like live captioning, streaming analytics, and conversational interfaces.

Real-time transcription is provided through our WebSocket API. You can also use our Web library, which makes it easy to integrate real-time transcription directly into browser-based applications.

Streaming expectations

Real-time cadence

You should send audio data to Soniox in real-time or near real-time speed. Small deviations are tolerated — such as brief buffering or network jitter — but prolonged bursts or lags may result in disconnection.

Handling pauses

To implement pause or mute functionality without disconnecting the session, you should use manual finalization with connection keepalive.

This ensures that session-level context — such as speaker diarization or language tracking — is maintained throughout the stream, and keeps the connection alive.

Termination of real-time transcription requests

Soniox aims to maintain all active real-time transcription sessions on a best-effort basis, but we cannot guarantee that every session will continue uninterrupted for the full duration supported by the model.

In some situations, the service may terminate a request early, before the maximum supported audio duration is reached.

When this occurs, you will receive an error message like:

Cannot continue request (code N). Please restart the request.

Your application should be designed to handle such errors and start a new request as needed.

Key concepts

We recommend reading the following real-time concepts before integrating:

Final vs non-final tokens

Understand how tokens evolve during streaming and when you can consider them stable.
Real-time latency

Learn how to configure latency settings to control the trade-off between speed and accuracy.
Endpoint detection

Configure the model to automatically detect when a speaker has stopped speaking.
Manual finalization

Explicitly trigger finalization of all streamed audio at any time using a {"type": "finalize"} message.
Connection keepalive

Prevent the WebSocket from timing out during silence by sending {"type": "keepalive"} message.

Integration guides

Choose one of the following integration patterns based on your app architecture:

Direct stream

Send audio directly from your client (e.g., browser, mobile app) to Soniox.

Best for:
- Web/mobile apps
- Fastest latency
- Client-managed sessions
Proxy stream

Stream audio from your client to your backend, and forward it from there to Soniox.

Best for:
- Centralized session control
- Audio preprocessing or archiving
- Use cases involving multiple clients

Example: Transcribe a live audio stream

See our example demonstrating how to transcribe a live audio stream (such as a radio broadcast) using the WebSocket API.

The example shows how to:

Open a WebSocket connection
Stream audio in real time
Handle final and non-final tokens
Display low-latency live transcripts

import json
import os
import threading
 
import requests
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
audio_url = "https://npr-ice.streamguys1.com/live.mp3?ck=1742897559135"
 
 
def stream_audio(ws):
    with requests.get(audio_url, stream=True) as res:
        res.raise_for_status()
        for chunk in res.iter_content(chunk_size=4096):
            if chunk:
                ws.send(chunk)
    ws.send("")  # signal end of stream
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "auto",  # server detects the format
                    "model": "stt-rt-preview-v2",
                    "language_hints": ["en", "es"],
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print(f"Transcription started from {audio_url}")
 
        final_text = ""
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_text = ""
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_text += token["text"]
                        else:
                            non_final_text += token["text"]
 
                print(
                    "\033[2J\033[H"  # clear the screen, move to top-left corner
                    + final_text  # write final text
                    + "\033[34m"  # change text color to blue
                    + non_final_text  # write non-final text
                    + "\033[39m"  # reset text color
                )
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output

On this page