Soniox
Docs
Core concepts

Real-time transcription

Learn about real-time transcription with low latency and high accuracy for all 60+ languages.

Overview

Soniox Speech-to-Text AI supports real-time transcription with low latency and high accuracy for all 60+ languages. It's designed for responsive applications like live captioning, streaming analytics, and conversational interfaces.

Real-time transcription is provided through our WebSocket API. You can also use our Web library, which makes it easy to integrate real-time transcription directly into browser-based applications.


Streaming expectations

Real-time cadence

You should send audio data to Soniox in real-time or near real-time speed. Small deviations are tolerated — such as brief buffering or network jitter — but prolonged bursts or lags may result in disconnection.

Handling pauses

To implement pause or mute functionality without disconnecting the session, you should use manual finalization with connection keepalive.

This ensures that session-level context — such as speaker diarization or language tracking — is maintained throughout the stream, and keeps the connection alive.


Key concepts

We recommend reading the following real-time concepts before integrating:

  • Final vs non-final tokens

    Understand how tokens evolve during streaming and when you can consider them stable.

  • Real-time latency

    Learn how to configure latency settings to control the trade-off between speed and accuracy.

  • Endpoint detection

    Configure the model to automatically detect when a speaker has stopped speaking.

  • Manual finalization

    Explicitly trigger finalization of all streamed audio at any time using a {"type": "finalize"} message.

  • Connection keepalive

    Prevent the WebSocket from timing out during silence by sending {"type": "keepalive"} message.


Integration guides

Choose one of the following integration patterns based on your app architecture:

  • Direct stream

    Send audio directly from your client (e.g., browser, mobile app) to Soniox.

    Best for:

    • Web/mobile apps
    • Fastest latency
    • Client-managed sessions
  • Proxy stream

    Stream audio from your client to your backend, and forward it from there to Soniox.

    Best for:

    • Centralized session control
    • Audio preprocessing or archiving
    • Use cases involving multiple clients

Example: Transcribe a live audio stream

See our example demonstrating how to transcribe a live audio stream (such as a radio broadcast) using the WebSocket API.

The example shows how to:

  • Open a WebSocket connection
  • Stream audio in real time
  • Handle final and non-final tokens
  • Display low-latency live transcripts
import json
import os
import threading
 
import requests
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
audio_url = "https://npr-ice.streamguys1.com/live.mp3?ck=1742897559135"
 
 
def stream_audio(ws):
    with requests.get(audio_url, stream=True) as res:
        res.raise_for_status()
        for chunk in res.iter_content(chunk_size=4096):
            if chunk:
                ws.send(chunk)
    ws.send("")  # signal end of stream
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "auto",  # server detects the format
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print(f"Transcription started from {audio_url}")
 
        final_text = ""
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                non_final_text = ""
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token.get("is_final"):
                            final_text += token["text"]
                        else:
                            non_final_text += token["text"]
 
                print(
                    "\033[2J\033[H"  # clear the screen, move to top-left corner
                    + final_text  # write final text
                    + "\033[34m"  # change text color to blue
                    + non_final_text  # write non-final text
                    + "\033[39m"  # reset text color
                )
 
                if res.get("finished"):
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output

On this page