Endpoint detection
Learn how real-time endpoint detection works and how to tune it for your application.
Overview
Endpoint detection lets you know when a speaker has finished an utterance.
This is critical for real-time voice AI assistants, command-and-response systems, live translation, dictation, and conversational applications where you want to respond quickly without waiting for long silences.
Soniox provides semantic endpointing. Instead of relying only on silence or voice activity detection, the speech model uses pauses, intonation, speech patterns, and conversational context to decide when the speaker has likely finished speaking.
Semantic endpointing helps produce a smoother user experience because the model can distinguish between:
- A speaker who has finished their thought.
- A speaker who paused briefly but is likely to continue.
- A speaker who is hesitating, thinking, or mid-sentence.
When an endpoint is detected, Soniox finalizes the current segment and returns a
special <end> token.
Important tradeoff
Endpoint detection finalizes speech earlier. This reduces latency, but it can also slightly reduce word recognition accuracy because the model has less time to revise the transcript.
More aggressive endpoint settings can also produce more endpoints, which means longer speech may be split into more segments.
For best results, tune endpoint detection for your application instead of always choosing the lowest possible latency.
Speaker diarization accuracy
If speaker diarization is enabled, endpoint detection reduces diarization accuracy because it forces earlier finalization.
For the highest speaker diarization accuracy, do not use endpoint detection.
How endpoint detection works
When enable_endpoint_detection is enabled:
- Soniox streams non-final tokens while the user is speaking.
- The model continuously evaluates whether the current utterance has ended.
When an endpoint is detected:
- All preceding tokens in the segment are finalized.
- A special
<end>token is returned. - Your application can use the
<end>token as the signal to trigger downstream logic.
The <end> token:
- Always appears once at the end of the finalized segment.
- Is always final.
- Can be used to trigger an LLM, execute a command, submit a user turn, or start a response.
Enable endpoint detection
Add enable_endpoint_detection to your real-time request:
Example
User says
Soniox stream
Non-final tokens are streamed while the user is speaking:
As more speech arrives, the transcript continues updating:
When Soniox detects the endpoint, the segment is finalized and <end> is returned:
How to use this
- Display non-final tokens immediately for real-time captions or live UI feedback.
- Use final tokens after
<end>arrives for actions that require stable text, such as calling an LLM, executing a command, submitting a form, or storing the final transcript.
Endpoint controls
Soniox provides three parameters for controlling endpoint behavior:
These parameters work together:
endpoint_latency_adjustment_levelreduces endpoint latency compared to the default behavior.endpoint_sensitivitycontrols how likely the model is to emit an endpoint.max_endpoint_delay_msguarantees that no endpoint is emitted later than the selected maximum delay after speech has ended.
Together, these settings let you control the endpointing experience for your application.
endpoint_latency_adjustment_level
endpoint_latency_adjustment_level reduces endpoint latency compared to the
default endpointing behavior.
Allowed values: 0, 1, 2, 3
Default value: 0
You do not need to specify this parameter when using the default behavior.
Higher values reduce endpoint latency more aggressively:
| Value | Behavior |
|---|---|
0 | Default semantic endpointing behavior |
1 | Lower latency than default |
2 | Even lower latency |
3 | Most aggressive latency reduction |
Example:
Increasing endpoint_latency_adjustment_level usually means:
- Endpoints are returned sooner.
- More endpoints will be emitted.
- Long speech may be split into more segments.
- Word recognition accuracy may slightly decrease because speech is finalized earlier.
This is still semantic endpointing. Even with a higher latency adjustment level, Soniox may not emit an endpoint immediately every time. If the speech indicates that the user is likely to continue, the model may wait before finalizing.
endpoint_sensitivity
endpoint_sensitivity controls how likely Soniox is to emit an endpoint.
Allowed values: -1.0 to 1.0
Default value: 0.0
Higher values make endpoints more likely. This can reduce latency and create more endpoint events.
Lower values make endpoints less likely. This can help Soniox wait longer before finalizing, which is useful when users pause frequently or speak slowly.
Example:
Use a positive value when you want more responsive endpointing:
Use a negative value when endpoints are happening too early:
Using endpoint_sensitivity with endpoint_latency_adjustment_level
When endpoint_latency_adjustment_level is greater than 0, use
endpoint_sensitivity to control how often endpoints are emitted at that lower
latency.
For example, this configuration reduces latency and makes endpoints more likely:
Setting endpoint_sensitivity to a negative value while also setting
endpoint_latency_adjustment_level above 0 is not recommended. These settings
work against each other:
endpoint_latency_adjustment_levelreduces endpoint latency.- Negative
endpoint_sensitivitymakes endpoints less likely.
If you want higher latency or fewer endpoints, reduce
endpoint_latency_adjustment_level first.
max_endpoint_delay_ms
max_endpoint_delay_ms sets the maximum time Soniox can wait before returning an
endpoint after speech has ended.
Allowed values: 500 to 3000
Default value: 2000
Example:
Use this parameter when your application needs a strict upper bound on endpoint latency.
For example, if you want to guarantee that no endpoint is emitted later than 1500 milliseconds after speech has ended:
Lower values create a stricter latency limit. Higher values give the model more time to decide whether the user has really finished speaking.
Default configuration
By default, endpoint detection uses semantic endpointing with no latency
adjustment. You only need to set enable_endpoint_detection; the other
parameters fall back to these values:
Recommended configuration for lower latency
For lower latency in many voice AI applications, start with:
This provides a good starting point for responsive turn-taking while still using semantic endpointing.
Then tune based on your application.
How to tune endpoint detection
If endpoints are too slow
Increase latency adjustment level or sensitivity:
If you need a stricter maximum delay, reduce max_endpoint_delay_ms:
If endpoints are too early
Use a lower latency adjustment level:
Or return to the default behavior:
You can also reduce endpoint_sensitivity:
For slower speakers, dictation, or users who pause often mid-sentence, you may use a negative sensitivity value:
If you want more endpoints at the selected latency
Increase endpoint_sensitivity:
If you want fewer endpoints
Reduce endpoint_sensitivity or use a lower endpoint_latency_adjustment_level:
Best practices
- Start with the default configuration, then tune based on real user conversations.
- Do not reduce latency to the maximum unless your application truly needs it. More aggressive endpointing can reduce recognition accuracy and may split speech into more segments.
- Use
endpoint_latency_adjustment_levelfirst to choose the overall latency profile. - Use
endpoint_sensitivityto make endpoints more or less likely within that latency profile. - Use
max_endpoint_delay_mswhen you need a hard upper bound on endpoint latency.