Soniox
Docs
Core concepts

Real-time transcription

Learn about real-time transcription with low latency and high accuracy for all 60+ languages.

Overview

Soniox Speech-to-Text AI supports real-time transcription with low latency and high accuracy for all 60+ languages. It's designed for responsive applications like live captioning, streaming analytics, and conversational interfaces.

Real-time transcription is provided through our WebSocket API. You can also use our Web library, which makes it easy to integrate real-time transcription directly into browser-based applications.


Streaming expectations

Real-time cadence

You should send audio data to Soniox in real-time or near real-time speed. Small deviations are tolerated — such as brief buffering or network jitter — but prolonged bursts or lags may result in disconnection.

Handling pauses

To implement pause or mute functionality without disconnecting the session, stream zero-valued PCM samples (silence) at real-time cadence.

This ensures that session-level context — such as speaker diarization or language tracking — is maintained throughout the stream.


Key concepts

We recommend reading the following real-time concepts before integrating:

  • Final vs Non-Final Tokens

    Understand how tokens evolve during streaming and when you can consider them stable.

  • Real-time latency

    Learn how to configure latency settings to control the tradeoff between speed and accuracy.


Integration guides

Choose one of the following integration patterns based on your app architecture:

  • Direct stream

    Send audio directly from your client (e.g., browser, mobile app) to Soniox.

    Best for:

    • Web/mobile apps
    • Fastest latency
    • Client-managed sessions
  • Proxy stream

    Stream audio from your client to your backend, and forward it from there to Soniox.

    Best for:

    • Centralized session control
    • Audio preprocessing or archiving
    • Use cases involving multiple clients

Example: Transcribe a live audio stream

See our example demonstrating how to transcribe a live audio stream (such as a radio broadcast) using the WebSocket API.

The example shows how to:

  • Open a WebSocket connection
  • Stream audio in real time
  • Handle final and non-final tokens
  • Display low-latency live transcripts
import json
import os
import threading
 
import requests
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
audio_url = "https://npr-ice.streamguys1.com/live.mp3?ck=1742897559135"
 
 
def stream_audio(ws):
    with requests.get(audio_url, stream=True) as res:
        res.raise_for_status()
        for chunk in res.iter_content(chunk_size=4096):
            if chunk:
                ws.send(chunk)
    ws.send("")  # signal end of stream
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "auto",  # server detects the format
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print(f"Transcription started from {audio_url}")
 
        final_text = ""
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_text = ""
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_text += token["text"]
                        else:
                            non_final_text += token["text"]
 
                print(
                    "\033[2J\033[H"  # clear the screen, move to top-left corner
                    + final_text  # write final text
                    + "\033[34m"  # change text color to blue
                    + non_final_text  # write non-final text
                    + "\033[39m"  # reset text color
                )
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output

On this page