What is endpoint detection? How machines know you stopped talking

You ask your phone to "set a timer for ten" and pause for half a second to decide between minutes and seconds. By the time you say "minutes," the assistant has already read the half-second pause as the end of your turn and started a ten-second countdown.

That gap, between the moment you stop making sound and the moment you are finished with your thought, is where endpointing operates.^[1] Most frustrating moments in voice interfaces trace back to a decision made in that gap.

Pauses mistaken for utterance endings

The classic failure is the thinking pause. You say "I'd like to book a flight to, um... let me think... Lisbon," and somewhere around "let me think" the system decides you are done, ships an incomplete query, and starts talking over you.

What went wrong: the endpointer watched for silence and nothing else. A natural mid-sentence pause, the kind you take while retrieving a word, can run 600 to 900 milliseconds.^[2]^[3] If your silence threshold is tighter than that, every hesitation looks like a finished sentence. Speakers who think out loud, older speakers, and non-native speakers all pause longer^[4]^[5]^[3], so a threshold tuned on fast confident talkers fails exactly the people it should serve.

Delayed endpoint detection

The opposite failure feels worse because it is so awkward. You finish a question, the line goes quiet, and you sit there wondering if anyone heard you. After a beat you start to repeat yourself, and that is when the system finally responds, now on top of your second attempt.

What went wrong: the silence timeout was set long to avoid the first failure, trading false positives for latency. Every endpointer makes this trade. A conservative timeout that never interrupts adds its full duration to your response time, pure overhead on top of recognition and model latency.^[6] See the voice-agent latency budget for where those hundreds of milliseconds go.

Handling filler words and backchannels

"Uh," "um," "you know," "right," "mm-hm." Humans pepper speech with these, and they wreck naive endpointers in two ways.

A trailing "um" holds the channel open with sound, so a silence-based detector keeps waiting even though you finished your real point three words ago. And a backchannel from the other side ("mm-hm," "go on") can register as speech and reset the timer when it should not.

What went wrong: the system treated all acoustic energy as equally meaningful. A filler is sound without semantic content, and a backchannel is the listener's signal, not the speaker's turn. Neither should drive the endpoint decision the way a real word does.

Combining silence with linguistic context

This is the distinction people most often skip, and it does the most damage.

Acoustic silence tells you the sound stopped, while semantic completeness tells you the thought stopped. An endpointer that measures only the first cuts people off whenever they pause mid-thought.

"My account number is four, seven, two..." has a clear acoustic pause after "two," but no human would think you were finished: the sentence is grammatically and semantically open. A model that understands the words can tell that a digit string is mid-stream, or that "I want to..." demands a verb, and hold the turn open through a pause that a silence timer would have cut.^[7]^[8] This is why the field has moved from fixed timeouts toward semantic endpointing, covered in the aside below.

flowchart LR A[Audio in] --> B[VAD speech<br/>or silence] B --> C[Silence timer] B --> D[Semantic check] C --> E{End of turn} D --> E E -->|yes| F[Finalize turn] E -->|no| B

Endpoint detection pipeline: audio feeds a silence timer and a semantic check, and the turn finalizes only when both agree.

Voice activity detection sits upstream of the endpoint decision rather than replacing it: the endpointer consumes VAD's speech-or-silence stream and adds timing and, increasingly, the words themselves.^[9]^[10] The three-way comparison with turn detection lives in VAD vs endpointing vs turn detection.

Endpoint detection in telephony audio

Phone audio is sampled at 8 kHz and band-limited to roughly 300 to 3400 Hz, the old telephone passband. That throws away the high-frequency energy that distinguishes a soft trailing consonant from background hiss, so the boundary between "still talking quietly" and "stopped" gets blurry right where you need it sharp.^[11]

What went wrong: a VAD and endpointer tuned on clean 16 kHz microphone audio inherit a quieter, narrower signal over the phone. Line noise, comfort noise that carriers inject into silent stretches, and codec artifacts all push the silence detector toward false readings.^[12] Endpointing that works in a demo booth can misfire on a real call, finalizing on noise or holding through quiet speech, which is why telephony deployments almost always need their own thresholds.

Interaction with barge-in

In a real conversation you interrupt. You start answering before the agent finishes its question, and a good system stops talking and listens. That is barge-in, and it pressures endpointing from the other direction: the system has to detect that you started, decide where your new turn ends, and not mistake the tail of its own playback for your speech.

What went wrong: without echo handling, the agent's own audio leaks into the microphone, the endpointer sees "speech," and the turn logic gets confused about who is talking.^[15] The interaction between interruption and turn boundaries is its own topic, handled in turn-taking and barge-in.

Common questions

Is endpoint detection the same as voice activity detection?

No. Voice activity detection labels each chunk of audio as speech or non-speech. Endpoint detection uses that signal, plus timing and often the recognized words, to decide that the speaker's turn is finished. VAD is a building block; endpointing is the decision built on top of it. The two are compared directly in VAD vs endpointing vs turn detection.

Why does my voice assistant cut me off when I pause?

Its silence threshold is shorter than your natural mid-sentence pause. Pauses for word retrieval commonly run past 600 ms, and a system tuned tighter than that reads the gap as the end of your turn. A semantic endpointer that checks whether the sentence is complete will hold the turn open through such pauses instead of finalizing on silence alone.

What is a good silence timeout for endpointing?

There is no universal number, because the right value depends on whether you optimize for never interrupting or for fast response. Fixed timeouts in the few-hundred-millisecond to roughly one-second range are common starting points, but a single threshold always trades one failure for the other. Model-based endpointing reduces the dependence on any one number by reading the content of the utterance.

Does endpointing work over the phone?

It works, but it is harder. Telephone audio is 8 kHz and band-limited to about 300 to 3400 Hz, which removes high-frequency cues and blurs the boundary between quiet speech and silence. Carrier comfort noise and codec artifacts add false signals, so telephony deployments typically need silence thresholds and VAD settings tuned separately from clean microphone audio.

Can I just wait longer to be safe?

You can, and you will stop cutting people off, but every extra millisecond of waiting adds directly to your response latency. A turn that always waits a full second before responding feels sluggish even when everything else is fast. The goal is to finalize as early as you safely can, which is why semantic signals beat a longer timer.

References

Chang, S.-Y., Li, B., Sainath, T. N., et al. (2017). Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition. Interspeech 2017.
Campione, E., & Véronis, J. (2002). A Large-Scale Multilingual Study of Silent Pause Duration. Speech Prosody 2002 (ISCA).
Huensch, A. (2023). Effects of Speaking Task and Proficiency on the Midclause Pausing Characteristics of L1 and L2 Speech from the Same Speakers. Studies in Second Language Acquisition, 45(4).
Betz, S., Bryhadyr, N., Türk, O., & Wagner, P. (2023). Cognitive Load Increases Spoken and Gestural Hesitation Frequency. Languages, 8(1), 71.
Lee, J., Huber, J., Jenkins, J., & Fredrick, J. (2019). Language Planning and Pauses in Story Retell: Evidence from Aging and Parkinson's Disease. Journal of Communication Disorders, 79.
Heldner, M., & Edlund, J. (2010). Pauses, Gaps and Overlaps in Conversations. Journal of Phonetics, 38(4).
de Ruiter, J. P., Mitterer, H., & Enfield, N. J. (2006). Projecting the End of a Speaker's Turn: A Cognitive Cornerstone of Conversation. Language, 82(3).
Ekstedt, E., & Skantze, G. (2020). TurnGPT: A Transformer-Based Language Model for Predicting Turn-Taking in Spoken Dialog. Findings of the Association for Computational Linguistics: EMNLP 2020.
Sohn, J., Kim, N. S., & Sung, W. (1999). A Statistical Model-Based Voice Activity Detection. IEEE Signal Processing Letters, 6(1).
Chang, S.-Y., Prabhavalkar, R., He, Y., et al. (2019). Joint Endpointing and Decoding with End-to-End Models. ICASSP 2019 (IEEE).
Bauer, P., Scheler, D., & Fingscheidt, T. (2010). WTIMIT: The TIMIT Speech Corpus Transmitted Over the 3G AMR Wideband Mobile Network. Proceedings of LREC 2010.
Benyassine, A., Shlomot, E., Su, H.-Y., et al. (1997). ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications. IEEE Communications Magazine, 35(9).
Shannon, M., Simko, G., Chang, S.-Y., & Parada, C. (2017). Improved End-of-Query Detection for Streaming Speech Recognition. Interspeech 2017.
Soniox (2026). Endpoint Detection. Soniox Docs.
Ström, N., & Seneff, S. (2000). Intelligent Barge-In in Conversational Systems. ICSLP 2000.
Stivers, T., Enfield, N. J., Brown, P., et al. (2009). Universals and Cultural Variation in Turn-Taking in Conversation. Proceedings of the National Academy of Sciences, 106(26).