What is voice activity detection (VAD) and how does it work?

A phone line is silent maybe 60 percent of the time a person is "on a call," because conversation is mostly one side waiting for the other.^[1] Telephone engineers noticed this decades ago and built machines to detect the silence and stop transmitting during it, the ancestor of every VAD running today.^[2] The problem has not changed: before you can do anything intelligent with audio, you have to know when the audio is worth listening to.

VAD answers that question. It is easy to describe and genuinely hard to do well.

Purpose of voice activity detection

The output of a VAD is small, a stream of speech and not-speech labels, but a lot hangs off it.

It gates the expensive work. Running a full recognizer on silence wastes compute and can invent words out of noise^[3], so VAD decides what the recognizer even sees. It saves bandwidth, which is why it was born in telephony: do not transmit frames that carry no speech. It segments audio into utterances for batch processing. And it feeds the timing decisions above it, including endpoint detection, which uses VAD's speech and silence signal as one input among several.^[4]

VAD answers "is there speech in this frame?" It does not answer "is the speaker finished?" The first is a per-frame acoustic question; the second is a turn-level decision built on top of it.^[5]^[6] The two are pulled apart in detail in VAD vs endpointing vs turn detection.

Voice activity detection methods

The simplest VAD is a volume gate: measure the energy in each 10-to-30-millisecond frame, and call it speech if the energy is above a threshold. This works in a quiet room and fails the moment there is a fan or a second person, because loud audio is not the same as speech.^[7]

Better VADs look at the shape of the sound rather than its loudness alone. Speech has structure that noise usually lacks: energy concentrated in certain frequency bands, a pitch that rises and falls, the rhythm of syllables, the brief silences of stop consonants. Classic detectors hand-built features for these (zero-crossing rate, spectral flatness, band energies) and combined them with simple statistical models.^[8]^[9] Modern detectors learn the distinction from data with a small neural network, which is far more resilient to noise because it has heard noise before.^[6]

flowchart LR A[Audio frames] --> B[VAD] B -->|speech| C[Recognizer] B -->|silence| D[Skip / save] C --> E[Endpointing]

VAD sits at the front, gating everything downstream. It classifies frames; it does not decide turns.

Every VAD lives with the trade-off every detector lives with: false accepts versus false rejects.^[7] If you tune it to never miss speech, it will pass through coughs, keyboard clicks, door slams, and background chatter. If you tune it to reject all that noise, it will start clipping the quiet ends of real words. The right setting depends on what sits downstream and what a mistake costs there.

Sources of VAD error

The failures cluster in predictable places, mostly about the gap between "loud" and "speech."

Low signal-to-noise audio is the classic problem: a quiet talker in a noisy car, where the noise is as loud as the voice.^[7] Far-field audio, captured by a microphone across a room, arrives faint and smeared with reflections.^[10] Babble noise, a room full of other people talking, is the cruelest case, because the background is speech, just not the speech you care about, and a VAD trained to detect "speech" has no reason to reject it.^[11] Music with vocals fools detectors for the same reason.^[12]

Then there is the boundary problem. Speech does not start and stop cleanly. A word can begin with a soft consonant that rises slowly out of the noise floor, and a VAD with a high threshold will clip that onset, costing the recognizer the first phoneme. Trailing sounds fade out the same way.^[7] The edges of speech are where VAD is least certain and its errors most audible.

VAD was invented to save phone bandwidth

The pressure to detect silence came from cost, not intelligence. Digital telephony in the 1990s wanted to carry more calls over the same lines, and since each side of a conversation is quiet much of the time, a system that stopped transmitting during silence could pack more calls together.

timeline title Voice activity detection, selected milestones 1990s : Telephony VAD with discontinuous transmission (DTX) and comfort noise 1996 : G.729 Annex B standardizes VAD for low-bitrate telephony 2010s : WebRTC ships a lightweight open-source VAD used everywhere in real-time apps Late 2010s : Small neural VADs become the default for noisy, real-world audio

The WebRTC VAD, released as part of Google's open real-time communication stack, became so widely embedded in voice apps that for years it was the default meaning of "a VAD" for most developers, despite being a fairly simple model by modern standards.

Evaluation measures

If you are choosing or tuning a VAD, start with accuracy in noise, usually framed as the false-accept and false-reject rates at a given signal-to-noise ratio: it tells you whether the detector can find speech in the conditions you actually have. Then look at latency, the delay between sound and decision, because a VAD that needs 200 milliseconds of lookahead to be confident adds that delay to everything downstream. Boundary precision, how tightly it brackets the true start and end of speech, governs whether the recognizer gets clean onsets or clipped ones.^[8]^[6]

A VAD that is accurate but slow, or fast but noise-blind, will degrade the system above it. So even though VAD produces only one bit per frame, choose and tune it carefully rather than dropping it in and forgetting it.

Common questions

Is voice activity detection the same as endpoint detection?

No, though vendor docs blur them constantly. VAD makes a per-frame acoustic call: speech, or not. Endpointing decides the speaker is done, using VAD's output plus timing and often the words. A long pause is "not speech" to VAD and "not finished yet" to a good endpointer, and that difference is the whole point of VAD vs endpointing vs turn detection.

Why does VAD struggle in noisy rooms?

Because loudness alone does not separate speech from noise, and the hardest noise (other people talking, music with vocals) actually is speech-like. Energy-based detectors fail first; neural detectors do better because they have learned the finer structure of speech, but a room full of background voices remains genuinely hard.

How small a slice of audio does VAD work on?

Typically 10 to 30 milliseconds per frame, which is short enough to react quickly but long enough to measure the structure of the sound. Smaller frames lower latency; the decision is then often smoothed over several frames so a single noisy frame does not flip the output.

Do I still need VAD if my recognizer is good?

Usually yes, because VAD does a different job: it decides what the recognizer should spend effort on, segments audio into utterances, saves bandwidth and compute, and feeds endpointing. Even an excellent recognizer benefits from not being asked to transcribe silence, where it tends to hallucinate words.^[13]

References

Brady, P. T. (1968). A Statistical Analysis of On-Off Patterns in 16 Conversations. Bell System Technical Journal, 47(1).
Benyassine, A., Shlomot, E., Su, H.-Y., et al. (1997). ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications. IEEE Communications Magazine, 35(9).
Barański, M., Jasiński, J., Bartolewska, J., et al. (2025). Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio. ICASSP 2025 (IEEE).
Soniox (2026). Endpoint Detection. Soniox Docs.
Sohn, J., Kim, N. S., & Sung, W. (1999). A Statistical Model-Based Voice Activity Detection. IEEE Signal Processing Letters, 6(1).
Hughes, T., & Mierle, K. (2013). Recurrent Neural Networks for Voice Activity Detection. ICASSP 2013 (IEEE).
Ramírez, J., Segura, J. C., Benítez, C., et al. (2004). Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information. Speech Communication, 42(3-4).
Graf, S., Herbig, T., Buck, M., & Schmidt, G. (2015). Features for Voice Activity Detection: A Comparative Analysis. EURASIP Journal on Advances in Signal Processing, 2015:91.
Shin, J. W., Chang, J.-H., & Kim, N. S. (2010). Voice Activity Detection Based on Statistical Models and Machine Learning Approaches. Computer Speech & Language, 24(3).
Ivry, A., Cohen, I., & Berdugo, B. (2020). Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments. ICASSP 2020 (IEEE).
Ma, Y., & Nishihara, A. (2013). Efficient Voice Activity Detection Algorithm Using Long-Term Spectral Flatness Measure. EURASIP Journal on Audio, Speech, and Music Processing, 2013:21.
Grundhuber, P., Halimeh, M. M., Strauß, M., & Habets, E. A. P. (2025). Robust Speech Activity Detection in the Presence of Singing Voice. WASPAA 2025 (IEEE).
Koenecke, A., Choi, A. S. G., Mei, K. X., et al. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. ACM FAccT 2024.