Soniox
Real-time API

Endpoint detection

Learn how speech endpoint detection works.

Overview

Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.

Unlike traditional endpoint detection based on voice activity detection (VAD), Soniox provides semantic endpointing where the speech model listens to intonations, pauses, and conversational context to determine when an utterance has ended. This makes it far more advanced — delivering lower latency, fewer false triggers, and a noticeably smoother product experience.


How it works

When enable_endpoint_detection is enabled:

  • Soniox monitors pauses in speech to determine the end of an utterance.
  • As soon as speech ends:
    • All preceding tokens are marked as final.
    • A special <end> token is returned.
  • The <end> token:
    • Always appears once at the end of the segment.
    • Is always final.
    • Can be treated as a reliable signal to trigger downstream logic (e.g., calling an LLM or executing a command).

Enabling endpoint detection

Add the flag in your real-time request:

{
  "enable_endpoint_detection": true
}

Example

User says

What's the weather in San Francisco?

Soniox stream

Non-final tokens (still being processed)

First response arrives:

{"text": "What's",    "is_final": false}
{"text": "the",       "is_final": false}
{"text": "weather",   "is_final": false}

Second response arrives:

{"text": "What's",    "is_final": false}
{"text": "the",       "is_final": false}
{"text": "weather",   "is_final": false}
{"text": "in",        "is_final": false}
{"text": "San",       "is_final": false}
{"text": "Francisco", "is_final": false}
{"text": "?",         "is_final": false}

Final tokens (endpoint detected, tokens are finalized)

{"text": "What's",    "is_final": true}
{"text": "the",       "is_final": true}
{"text": "weather",   "is_final": true}
{"text": "in",        "is_final": true}
{"text": "San",       "is_final": true}
{"text": "Francisco", "is_final": true}
{"text": "?",         "is_final": true}
{"text": "<end>",     "is_final": true}

Explanation

  1. Streaming phase: tokens are delivered in real-time as the user speaks. They are marked is_final: false, meaning the transcript is still being processed and may change.
  2. Endpoint detection: once the speaker stops, the model recognizes the end of the utterance.
  3. Finalization phase: previously non-final tokens are re-emitted with is_final: true, followed by the <end> token (also final).
  4. Usage tip: display non-final tokens immediately for live captions, but switch to final tokens once <end> arrives before triggering any downstream actions.

Controlling endpoint delay

In addition to semantic endpoint detection, you can also control the maximum delay between the end of speech and returned endpoint using max_endpoint_delay_ms. Lower values cause the endpoint to be returned sooner.

Allowed values for maximum delay are between 500ms and 3000ms. The default value is 2000ms.

Example configuration:

{
  "max_endpoint_delay_ms": 500
}