Endpoint detection
Learn how speech endpoint detection works.
Overview
Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.
Unlike traditional endpoint detection based on voice activity detection (VAD), Soniox provides semantic endpointing where the speech model listens to intonations, pauses, and conversational context to determine when an utterance has ended. This makes it far more advanced — delivering lower latency, fewer false triggers, and a noticeably smoother product experience.
How it works
When enable_endpoint_detection is enabled:
- Soniox monitors pauses in speech to determine the end of an utterance.
- As soon as speech ends:
- All preceding tokens are marked as final.
- A special
<end>token is returned.
- The
<end>token:- Always appears once at the end of the segment.
- Is always final.
- Can be treated as a reliable signal to trigger downstream logic (e.g., calling an LLM or executing a command).
Enabling endpoint detection
Add the flag in your real-time request:
Example
User says
Soniox stream
Non-final tokens (still being processed)
First response arrives:
Second response arrives:
Final tokens (endpoint detected, tokens are finalized)
Explanation
- Streaming phase: tokens are delivered in real-time as the user
speaks. They are marked
is_final: false, meaning the transcript is still being processed and may change. - Endpoint detection: once the speaker stops, the model recognizes the end of the utterance.
- Finalization phase: previously non-final tokens are re-emitted with
is_final: true, followed by the<end>token (also final). - Usage tip: display non-final tokens immediately for live captions, but switch to final tokens once
<end>arrives before triggering any downstream actions.
Controlling endpoint delay
In addition to semantic endpoint detection, you can also control the maximum delay between
the end of speech and returned endpoint using max_endpoint_delay_ms.
Lower values cause the endpoint to be returned sooner.
Allowed values for maximum delay are between 500ms and 3000ms. The default value is 2000ms.
Example configuration: