Speech-to-TextReal-time API

Endpoint detection

Learn how real-time endpoint detection works and how to tune it for your application.

Overview

Endpoint detection lets you know when a speaker has finished an utterance.

This is critical for real-time voice AI assistants, command-and-response systems, live translation, dictation, and conversational applications where you want to respond quickly without waiting for long silences.

Soniox provides semantic endpointing. Instead of relying only on silence or voice activity detection, the speech model uses pauses, intonation, speech patterns, and conversational context to decide when the speaker has likely finished speaking.

Semantic endpointing helps produce a smoother user experience because the model can distinguish between:

  • A speaker who has finished their thought.
  • A speaker who paused briefly but is likely to continue.
  • A speaker who is hesitating, thinking, or mid-sentence.

When an endpoint is detected, Soniox finalizes the current segment and returns a special <end> token.

Important tradeoff

Endpoint detection finalizes speech earlier. This reduces latency, but it can also slightly reduce word recognition accuracy because the model has less time to revise the transcript.

More aggressive endpoint settings can also produce more endpoints, which means longer speech may be split into more segments.

For best results, tune endpoint detection for your application instead of always choosing the lowest possible latency.

Speaker diarization accuracy

If speaker diarization is enabled, endpoint detection reduces diarization accuracy because it forces earlier finalization.

For the highest speaker diarization accuracy, do not use endpoint detection.


How endpoint detection works

When enable_endpoint_detection is enabled:

  • Soniox streams non-final tokens while the user is speaking.
  • The model continuously evaluates whether the current utterance has ended.

When an endpoint is detected:

  • All preceding tokens in the segment are finalized.
  • A special <end> token is returned.
  • Your application can use the <end> token as the signal to trigger downstream logic.

The <end> token:

  • Always appears once at the end of the finalized segment.
  • Is always final.
  • Can be used to trigger an LLM, execute a command, submit a user turn, or start a response.

Enable endpoint detection

Add enable_endpoint_detection to your real-time request:

{
  "enable_endpoint_detection": true
}

Example

User says

What's the weather in San Francisco?

Soniox stream

Non-final tokens are streamed while the user is speaking:

{"text": "What's",  "is_final": false}
{"text": "the",     "is_final": false}
{"text": "weather", "is_final": false}

As more speech arrives, the transcript continues updating:

{"text": "What's",    "is_final": false}
{"text": "the",       "is_final": false}
{"text": "weather",   "is_final": false}
{"text": "in",        "is_final": false}
{"text": "San",       "is_final": false}
{"text": "Francisco", "is_final": false}
{"text": "?",         "is_final": false}

When Soniox detects the endpoint, the segment is finalized and <end> is returned:

{"text": "What's",    "is_final": true}
{"text": "the",       "is_final": true}
{"text": "weather",   "is_final": true}
{"text": "in",        "is_final": true}
{"text": "San",       "is_final": true}
{"text": "Francisco", "is_final": true}
{"text": "?",         "is_final": true}
{"text": "<end>",     "is_final": true}

How to use this

  • Display non-final tokens immediately for real-time captions or live UI feedback.
  • Use final tokens after <end> arrives for actions that require stable text, such as calling an LLM, executing a command, submitting a form, or storing the final transcript.

Endpoint controls

Soniox provides three parameters for controlling endpoint behavior:

{
  "enable_endpoint_detection": true,
  "endpoint_latency_adjustment_level": 2,
  "endpoint_sensitivity": 0.3,
  "max_endpoint_delay_ms": 1500
}

These parameters work together:

  • endpoint_latency_adjustment_level reduces endpoint latency compared to the default behavior.
  • endpoint_sensitivity controls how likely the model is to emit an endpoint.
  • max_endpoint_delay_ms guarantees that no endpoint is emitted later than the selected maximum delay after speech has ended.

Together, these settings let you control the endpointing experience for your application.


endpoint_latency_adjustment_level

endpoint_latency_adjustment_level reduces endpoint latency compared to the default endpointing behavior.

Allowed values: 0, 1, 2, 3
Default value: 0

You do not need to specify this parameter when using the default behavior.

Higher values reduce endpoint latency more aggressively:

ValueBehavior
0Default semantic endpointing behavior
1Lower latency than default
2Even lower latency
3Most aggressive latency reduction

Example:

{
  "enable_endpoint_detection": true,
  "endpoint_latency_adjustment_level": 2
}

Increasing endpoint_latency_adjustment_level usually means:

  • Endpoints are returned sooner.
  • More endpoints will be emitted.
  • Long speech may be split into more segments.
  • Word recognition accuracy may slightly decrease because speech is finalized earlier.

This is still semantic endpointing. Even with a higher latency adjustment level, Soniox may not emit an endpoint immediately every time. If the speech indicates that the user is likely to continue, the model may wait before finalizing.


endpoint_sensitivity

endpoint_sensitivity controls how likely Soniox is to emit an endpoint.

Allowed values: -1.0 to 1.0
Default value: 0.0

Higher values make endpoints more likely. This can reduce latency and create more endpoint events.

Lower values make endpoints less likely. This can help Soniox wait longer before finalizing, which is useful when users pause frequently or speak slowly.

Example:

{
  "enable_endpoint_detection": true,
  "endpoint_sensitivity": 0.3
}

Use a positive value when you want more responsive endpointing:

{
  "endpoint_sensitivity": 0.3
}

Use a negative value when endpoints are happening too early:

{
  "endpoint_sensitivity": -0.3
}

Using endpoint_sensitivity with endpoint_latency_adjustment_level

When endpoint_latency_adjustment_level is greater than 0, use endpoint_sensitivity to control how often endpoints are emitted at that lower latency.

For example, this configuration reduces latency and makes endpoints more likely:

{
  "enable_endpoint_detection": true,
  "endpoint_latency_adjustment_level": 2,
  "endpoint_sensitivity": 0.3
}

Setting endpoint_sensitivity to a negative value while also setting endpoint_latency_adjustment_level above 0 is not recommended. These settings work against each other:

  • endpoint_latency_adjustment_level reduces endpoint latency.
  • Negative endpoint_sensitivity makes endpoints less likely.

If you want higher latency or fewer endpoints, reduce endpoint_latency_adjustment_level first.


max_endpoint_delay_ms

max_endpoint_delay_ms sets the maximum time Soniox can wait before returning an endpoint after speech has ended.

Allowed values: 500 to 3000
Default value: 2000

Example:

{
  "enable_endpoint_detection": true,
  "max_endpoint_delay_ms": 1500
}

Use this parameter when your application needs a strict upper bound on endpoint latency.

For example, if you want to guarantee that no endpoint is emitted later than 1500 milliseconds after speech has ended:

{
  "max_endpoint_delay_ms": 1500
}

Lower values create a stricter latency limit. Higher values give the model more time to decide whether the user has really finished speaking.


Default configuration

By default, endpoint detection uses semantic endpointing with no latency adjustment. You only need to set enable_endpoint_detection; the other parameters fall back to these values:

{
  "enable_endpoint_detection": true,
  "endpoint_latency_adjustment_level": 0,
  "endpoint_sensitivity": 0.0,
  "max_endpoint_delay_ms": 2000
}

For lower latency in many voice AI applications, start with:

{
  "enable_endpoint_detection": true,
  "endpoint_latency_adjustment_level": 2,
  "endpoint_sensitivity": 0.3,
  "max_endpoint_delay_ms": 1500
}

This provides a good starting point for responsive turn-taking while still using semantic endpointing.

Then tune based on your application.


How to tune endpoint detection

If endpoints are too slow

Increase latency adjustment level or sensitivity:

{
  "endpoint_latency_adjustment_level": 2,
  "endpoint_sensitivity": 0.3
}

If you need a stricter maximum delay, reduce max_endpoint_delay_ms:

{
  "max_endpoint_delay_ms": 1000
}

If endpoints are too early

Use a lower latency adjustment level:

{
  "endpoint_latency_adjustment_level": 1
}

Or return to the default behavior:

{
  "endpoint_latency_adjustment_level": 0
}

You can also reduce endpoint_sensitivity:

{
  "endpoint_sensitivity": 0.0
}

For slower speakers, dictation, or users who pause often mid-sentence, you may use a negative sensitivity value:

{
  "endpoint_sensitivity": -0.3
}

If you want more endpoints at the selected latency

Increase endpoint_sensitivity:

{
  "endpoint_latency_adjustment_level": 2,
  "endpoint_sensitivity": 0.5
}

If you want fewer endpoints

Reduce endpoint_sensitivity or use a lower endpoint_latency_adjustment_level:

{
  "endpoint_latency_adjustment_level": 1,
  "endpoint_sensitivity": 0.0
}

Best practices

  • Start with the default configuration, then tune based on real user conversations.
  • Do not reduce latency to the maximum unless your application truly needs it. More aggressive endpointing can reduce recognition accuracy and may split speech into more segments.
  • Use endpoint_latency_adjustment_level first to choose the overall latency profile.
  • Use endpoint_sensitivity to make endpoints more or less likely within that latency profile.
  • Use max_endpoint_delay_ms when you need a hard upper bound on endpoint latency.