Endpoint detection
Learn how speech endpoint detection works.
Overview
Soniox Speech-to-Text AI supports endpoint detection — the ability to detect when a speaker has finished speaking. This is especially useful for voice AI assistants, command-and-response systems, or any application where you want to reduce latency and act as soon as the user stops talking.
What it does
When endpoint detection is enabled:
- The model listens for natural pauses and identifies when the utterance has ended
- When this happens, it emits a special
<end>
token - All preceding tokens are finalized immediately
- The
<end>
token itself is always final
This allows you to:
- Know exactly when the speaker has finished
- Immediately use all final tokens for downstream processing (e.g., sending to an LLM)
- Reduce delay in conversational systems
How to enable
Set the following flag in your real-time transcription request:
You can use this with WebSocket and streaming SDK integrations.
Output format
When the model detects that the speaker has stopped speaking, it returns a special token:
Important notes
<end>
is treated like a regular token in the stream- It will never appear as non-final
- You can use it as a reliable signal that the speaker has stopped or paused talking for an extended period.
Example use case
-
User speaks:
What's the weather in San Francisco tomorrow?
-
Soniox returns all tokens as final:
-
Your system can now:
- Send the full final transcript to a text-based LLM
Example
This example demonstrates how to use endpoint detection.
Output