VAD vs endpoint detection vs turn detection: stop confusing them

If you read enough product docs, you will see all three terms used as if they meant the same thing. An engineer says "the VAD cut me off," when an endpointer did. A product labeled "turn detection" turns out to run a silence timer. The words have drifted into one fuzzy term that means "the thing that decides when the machine talks."

That confusion is the source of a remarkable number of bugs, because the three jobs fail in different ways and have different fixes. Pull them apart and most "the assistant interrupts me" problems become obvious.

VAD

Voice activity detection is the lowest-level question. Frame by frame, usually every 10 to 30 milliseconds, it answers a binary: speech, or not speech.^[1]^[2] It does not know what was said, and it has no opinion on whether the speaker is finished. It separates voice from silence, noise, music, and the hum of an air conditioner.

VAD is old and cheap. Classic implementations measured energy and zero-crossing rate, modern ones use small neural networks, but the output is the same yes/no stream.^[3]^[2] Everything above it depends on it: if VAD reports "speech" during a fan's drone, the layers above inherit the mistake. The dedicated page is voice activity detection.

Endpoint detection

Endpoint detection sits one level up and asks a temporal question about a single speaker: has this person finished their utterance, so we can finalize the transcript and act on it?^[4]^[5]

This is not the same as "is there speech right now." A speaker pauses in the middle of a sentence to think, and VAD correctly reports silence, but the utterance is not over. A good endpointer has to tell a thinking pause from a finished thought, which is why the better ones read the words so far rather than just the silence.^[6] If you declare the endpoint too early, you talk over the user; if too late, you leave dead air. The full treatment, with the failure cases, is in endpoint detection.

The relationship is strict: endpointing consumes VAD output.^[7] You cannot decide a speaker is done without first knowing when they were speaking, and knowing they went quiet is not enough to know they are finished.

Turn detection

Turn detection is the conversational question, and the one most tangled up with the other two. It governs the exchange between participants: should the machine start speaking now, keep listening, or stop talking because the human just barged in?^[8]^[9]

Turn detection is broader than endpointing because it is about the dialogue rather than one utterance. It handles the human finishing (which uses endpointing), the human interrupting mid-reply (barge-in)^[10], and backchannels like "mm-hm" that signal "keep going" rather than "your turn."^[11] An endpointer that fires perfectly can still produce terrible turn-taking if the system treats every "uh-huh" as a bid for the floor. This layer is where a voice agent feels polite or rude, and it is covered in turn-taking and barge-in.

flowchart TB A[Audio frames] --> B[VAD speech or not] B --> C[Endpoint detection is this speaker done] C --> D[Turn detection whose turn now] D --> E[Agent speaks, keeps listening, or stops]

The three jobs are layered, not interchangeable. Each one consumes the answer below it and asks a larger question.

Comparison of the three tasks

If you remember nothing else, remember this table.

	VAD	Endpoint detection	Turn detection
Question	Is this speech?	Is the speaker done?	Whose turn is it?
Scope	One audio frame	One utterance	The whole dialogue
Time scale	10-30 ms	Hundreds of ms	The conversation
Needs the words?	No	Helps a lot	Yes
Fails as	Noise counted as speech	Cutting off or dead air	Talking over, or freezing

The progression is the point: each row uses the one before it. VAD knows nothing about utterances, endpointing knows nothing about who should speak next, and turn detection needs both underneath it to work. A vendor that calls all three "VAD" has collapsed three distinct jobs into one word.

Diagnosing common failures

A telephony deployment shows all three failing at once. The 8 kHz phone codec and carrier comfort noise push VAD toward false positives^[13]^[14], so the layer meant to detect speech keeps mislabeling line noise. That corrupts endpointing, which now thinks the speaker is active during noise and never finalizes. And that wrecks turn detection, which cannot decide it is the agent's turn because the human "never stopped." One root cause produces three symptoms, and a team that does not separate the layers will spend a week tuning the wrong one.

When the machine cuts you off, suspect the endpointer. When it answers to a cough or a slammed door, that is VAD letting noise through. And when it talks over you or refuses to yield when you interrupt, look at turn detection. Name the layer before you tune it.

Common questions

Is VAD the same as endpoint detection?

No. VAD labels each short frame of audio as speech or non-speech. Endpoint detection uses that stream, plus timing and often the recognized words, to decide that a speaker has finished an utterance. VAD is a building block; endpointing is a decision built on top of it. A pause produces "non-speech" from VAD without meaning the utterance is over.

Is endpoint detection the same as turn detection?

They overlap but are not the same. Endpoint detection is about one speaker finishing one utterance. Turn detection is about the whole conversation: when the machine should start, keep listening, or stop because it was interrupted. Endpointing is one input to turn detection, which also handles barge-in and backchannels.

Which one causes a voice assistant to interrupt me?

Almost always endpoint detection (or a turn-detection layer that is really just a silence timer). It decided your turn was over during a natural pause. Better endpointers read the partial transcript to tell an unfinished sentence from a finished one, instead of finalizing on silence alone.

Do I need all three?

For simple transcription, VAD plus endpointing is usually enough: you only need to know when speech is present and when an utterance ends. Full turn detection matters when the machine talks back in real time, like a voice agent, where interruption and yielding the floor become part of the experience.

References

Sohn, J., Kim, N. S., & Sung, W. (1999). A Statistical Model-Based Voice Activity Detection. IEEE Signal Processing Letters, 6(1).
Hughes, T., & Mierle, K. (2013). Recurrent Neural Networks for Voice Activity Detection. ICASSP 2013 (IEEE).
Graf, S., Herbig, T., Buck, M., & Schmidt, G. (2015). Features for Voice Activity Detection: A Comparative Analysis. EURASIP Journal on Advances in Signal Processing, 2015:91.
Chang, S.-Y., Li, B., Sainath, T. N., et al. (2017). Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition. Interspeech 2017.
Soniox (2026). Endpoint Detection. Soniox Docs.
Ekstedt, E., & Skantze, G. (2020). TurnGPT: A Transformer-Based Language Model for Predicting Turn-Taking in Spoken Dialog. Findings of the Association for Computational Linguistics: EMNLP 2020.
Chang, S.-Y., Prabhavalkar, R., He, Y., et al. (2019). Joint Endpointing and Decoding with End-to-End Models. ICASSP 2019 (IEEE).
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language, 50(4).
Skantze, G. (2021). Turn-Taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67.
Ström, N., & Seneff, S. (2000). Intelligent Barge-In in Conversational Systems. ICSLP 2000.
Ruede, R., Müller, M., Stüker, S., & Waibel, A. (2017). Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor. IWSDS 2017.
Shannon, M., Simko, G., Chang, S.-Y., & Parada, C. (2017). Improved End-of-Query Detection for Streaming Speech Recognition. Interspeech 2017.
Bauer, P., Scheler, D., & Fingscheidt, T. (2010). WTIMIT: The TIMIT Speech Corpus Transmitted Over the 3G AMR Wideband Mobile Network. Proceedings of LREC 2010.
Benyassine, A., Shlomot, E., Su, H.-Y., et al. (1997). ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications. IEEE Communications Magazine, 35(9).