Endpoint detection
Learn how speech endpoint detection works.
Overview
Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.
Unlike traditional endpoint detection based on voice activity detection (VAD), Soniox uses its speech model itself to listen to intonations, pauses, and conversational context to determine when an utterance has ended. This makes it far more advanced — delivering lower latency, fewer false triggers, and a noticeably smoother product experience.
How it works
When enable_endpoint_detection
is enabled:
- Soniox monitors pauses in speech to determine the end of an utterance.
- As soon as speech ends:
- All preceding tokens are marked as final.
- A special
<end>
token is returned.
- The
<end>
token:- Always appears once at the end of the segment.
- Is always final.
- Can be treated as a reliable signal to trigger downstream logic (e.g., calling an LLM or executing a command).
Enabling endpoint detection
Add the flag in your real-time request:
Example
User says
Soniox stream
Non-final tokens (still being processed)
First response arrives:
Second response arrives:
Final tokens (endpoint detected, tokens are finalized)
Explanation
- Streaming phase: tokens are delivered in real-time as the user
speaks. They are marked
is_final: false
, meaning the transcript is still being processed and may change. - Endpoint detection: once the speaker stops, the model recognizes the end of the utterance.
- Finalization phase: previously non-final tokens are re-emitted with
is_final: true
, followed by the<end>
token (also final). - Usage tip: display non-final tokens immediately for live captions, but switch to final tokens once
<end>
arrives before triggering any downstream actions.