Endpoint detection

Overview

Soniox Speech-to-Text AI supports endpoint detection — the ability to detect when a speaker has finished speaking. This is especially useful for voice AI assistants, command-and-response systems, or any application where you want to reduce latency and act as soon as the user stops talking.

What it does

When endpoint detection is enabled:

The model listens for natural pauses and identifies when the utterance has ended
When this happens, it emits a special <end> token
All preceding tokens are finalized immediately
The <end> token itself is always final

This allows you to:

Know exactly when the speaker has finished
Immediately use all final tokens for downstream processing (e.g., sending to an LLM)
Reduce delay in conversational systems

How to enable

Set the following flag in your real-time transcription request:

{
  "enable_endpoint_detection": true
}

You can use this with WebSocket and streaming SDK integrations.

Output format

When the model detects that the speaker has stopped speaking, it returns a special token:

{
  "text": "<end>",
  "is_final": true
}

Important notes

<end> is treated like a regular token in the stream.
It will never appear as non-final.
You can use it as a reliable signal that the speaker has stopped or paused talking for an extended period.

Example use case

User speaks:

What's the weather in San Francisco tomorrow?

Soniox returns all tokens as final:

{"text": "Wh", "is_final": true}
{"text": "at", "is_final": true}
{"text": "'s", "is_final": true}
{"text": " the", "is_final": true}
{"text": " we", "is_final": true}
{"text": "ather", "is_final": true}
{"text": " in", "is_final": true}
{"text": " San", "is_final": true}
{"text": " Franc", "is_final": true}
{"text": "isc", "is_final": true}
{"text": "o,", "is_final": true}
{"text": " tom", "is_final": true}
{"text": "or", "is_final": true}
{"text": "row", "is_final": true}
{"text": "?", "is_final": true}
{"text": "<end>", "is_final": true}

Your system can now:
- Send the full final transcript to a text-based LLM

Example

This example demonstrates how to use endpoint detection.

import json
import os
import threading
import time
 
from websockets import ConnectionClosedOK
from websockets.sync.client import connect
 
# Retrieve the API key from environment variable (ensure SONIOX_API_KEY is set)
api_key = os.environ.get("SONIOX_API_KEY")
websocket_url = "wss://stt-rt.soniox.com/transcribe-websocket"
file_to_transcribe = "coffee_shop.pcm_s16le"
 
 
def stream_audio(ws):
    with open(file_to_transcribe, "rb") as fh:
        while True:
            data = fh.read(3840)
            if len(data) == 0:
                break
            ws.send(data)
            time.sleep(0.12)  # sleep for 120 ms
    ws.send("")  # signal end of stream
 
 
def main():
    print("Opening WebSocket connection...")
 
    with connect(websocket_url) as ws:
        # Send start request
        ws.send(
            json.dumps(
                {
                    "api_key": api_key,
                    "audio_format": "pcm_s16le",
                    "sample_rate": 16000,
                    "num_channels": 1,
                    "model": "stt-rt-preview",
                    "language_hints": ["en", "es"],
                    "enable_non_final_tokens": False,
                    "enable_endpoint_detection": True,
                }
            )
        )
 
        # Start streaming audio in background
        threading.Thread(target=stream_audio, args=(ws,), daemon=True).start()
 
        print("Transcription started")
 
        current_text = ""
 
        try:
            while True:
                message = ws.recv()
                res = json.loads(message)
 
                if res.get("error_code"):
                    print(f"Error: {res['error_code']} - {res['error_message']}")
                    break
 
                for token in res.get("tokens", []):
                    if token.get("text"):
                        if token["text"] == "<end>":
                            print(current_text)
                            current_text = ""
                        elif not current_text:
                            current_text = token["text"].lstrip()
                        else:
                            current_text += token["text"]
 
                if res.get("finished"):
                    if current_text:
                        print(current_text)
 
                    print("\nTranscription complete.")
        except ConnectionClosedOK:
            pass
        except Exception as e:
            print(f"Error: {e}")
 
 
if __name__ == "__main__":
    main()

View example on GitHub

Output