What is voice activity detection (VAD)?

Detection of speech and non-speech regions in audio

Updated June 29, 2026

A phone line is silent maybe 60 percent of the time a person is "on a call," because conversation is mostly one side waiting for the other.[1] Telephone engineers noticed this decades ago and built machines to detect the silence and stop transmitting during it, the ancestor of every VAD running today.[2] The problem has not changed: before you can do anything intelligent with audio, you have to know when the audio is worth listening to.

VAD answers that question. It is easy to describe and genuinely hard to do well.

Purpose of voice activity detection

The output of a VAD is small, a stream of speech and not-speech labels, but a lot hangs off it.

It gates the expensive work. Running a full recognizer on silence wastes compute and can invent words out of noise[3], so VAD decides what the recognizer even sees. It saves bandwidth, which is why it was born in telephony: do not transmit frames that carry no speech. It segments audio into utterances for batch processing. And it feeds the timing decisions above it, including endpoint detection, which uses VAD's speech and silence signal as one input among several.[4]

VAD answers "is there speech in this frame?" It does not answer "is the speaker finished?" The first is a per-frame acoustic question; the second is a turn-level decision built on top of it.[5][6] The two are pulled apart in detail in VAD vs endpointing vs turn detection.

Voice activity detection methods

The simplest VAD is a volume gate: measure the energy in each 10-to-30-millisecond frame, and call it speech if the energy is above a threshold. This works in a quiet room and fails the moment there is a fan, a road, or a second person, because loud audio is not the same as speech.[7]

Better VADs look at the shape of the sound rather than its loudness alone. Speech has structure that noise usually lacks: energy concentrated in certain frequency bands, a pitch that rises and falls, the rhythm of syllables, the brief silences of stop consonants. Classic detectors hand-built features for these (zero-crossing rate, spectral flatness, band energies) and combined them with simple statistical models.[8][9] Modern detectors learn the distinction from data with a small neural network, which is far more resilient to noise because it has heard noise before.[6]

flowchart LR A[Audio frames] --> B[VAD] B -->|speech| C[Recognizer] B -->|silence| D[Skip / save] C --> E[Endpointing]
VAD sits at the front, gating everything downstream. It classifies frames; it does not decide turns.

Every VAD lives with the trade-off every detector lives with: false accepts versus false rejects.[7] Tune it to never miss speech and it will pass through coughs, keyboard clicks, and background chatter. Tune it to reject all that noise and it will start clipping the quiet ends of real words. The right setting depends on what sits downstream and what a mistake costs there.

Sources of VAD error

The failures cluster in predictable places, mostly about the gap between "loud" and "speech."

Low signal-to-noise audio is the classic problem: a quiet talker in a noisy car, where the noise is as loud as the voice.[7] Far-field audio, captured by a microphone across a room, arrives faint and smeared with reflections.[10] Babble noise, a room full of other people talking, is the cruelest case, because the background is speech, just not the speech you care about, and a VAD trained to detect "speech" has no reason to reject it.[11] Music with vocals fools detectors for the same reason.[12]

Then there is the boundary problem. Speech does not start and stop cleanly. A word can begin with a soft consonant that rises slowly out of the noise floor, and a VAD with a high threshold will clip that onset, costing the recognizer the first phoneme. Trailing sounds fade out the same way.[7] The edges of speech are where VAD is least certain and its errors most audible.

Evaluation measures

If you are choosing or tuning a VAD, three numbers describe it. Accuracy in noise, usually framed as the false-accept and false-reject rates at a given signal-to-noise ratio, tells you whether it can find speech in the conditions you have. Latency, the delay between sound and decision, matters because a VAD that needs 200 milliseconds of lookahead to be confident adds that delay to everything downstream. Boundary precision, how tightly it brackets the true start and end of speech, governs whether the recognizer gets clean onsets or clipped ones.[8][6]

A VAD that is accurate but slow, or fast but noise-blind, will degrade the system above it. So even though VAD produces only one bit per frame, choose and tune it carefully rather than dropping it in and forgetting it.

Common questions

Is voice activity detection the same as endpoint detection?

No. VAD classifies each frame of audio as speech or not-speech. Endpoint detection uses that signal, plus timing and often the recognized words, to decide that a speaker's turn is over. VAD is a per-frame building block; endpointing is a turn-level decision built on top of it. The two are compared directly in VAD vs endpointing vs turn detection.

Why does VAD struggle in noisy rooms?

Because loudness alone does not separate speech from noise, and the hardest noise (other people talking, music with vocals) actually is speech-like. Energy-based detectors fail first; neural detectors do better because they have learned the finer structure of speech, but a room full of background voices remains genuinely hard.

How small a slice of audio does VAD work on?

Typically 10 to 30 milliseconds per frame, which is short enough to react quickly but long enough to measure the structure of the sound. Smaller frames lower latency; the decision is then often smoothed over several frames so a single noisy frame does not flip the output.

Do I still need VAD if my recognizer is good?

Usually yes, because VAD does a different job: it decides what the recognizer should spend effort on, segments audio into utterances, saves bandwidth and compute, and feeds endpointing. Even an excellent recognizer benefits from not being asked to transcribe silence, where it tends to hallucinate words.[13]

References

  1. Brady, P. T. (1968). A Statistical Analysis of On-Off Patterns in 16 Conversations. Bell System Technical Journal, 47(1).
  2. Benyassine, A., Shlomot, E., Su, H.-Y., et al. (1997). ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications. IEEE Communications Magazine, 35(9).
  3. Barański, M., Jasiński, J., Bartolewska, J., et al. (2025). Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio. ICASSP 2025 (IEEE).
  4. Soniox (2026). Endpoint Detection. Soniox Docs.
  5. Sohn, J., Kim, N. S., & Sung, W. (1999). A Statistical Model-Based Voice Activity Detection. IEEE Signal Processing Letters, 6(1).
  6. Hughes, T., & Mierle, K. (2013). Recurrent Neural Networks for Voice Activity Detection. ICASSP 2013 (IEEE).
  7. Ramírez, J., Segura, J. C., Benítez, C., et al. (2004). Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information. Speech Communication, 42(3-4).
  8. Graf, S., Herbig, T., Buck, M., & Schmidt, G. (2015). Features for Voice Activity Detection: A Comparative Analysis. EURASIP Journal on Advances in Signal Processing, 2015:91.
  9. Shin, J. W., Chang, J.-H., & Kim, N. S. (2010). Voice Activity Detection Based on Statistical Models and Machine Learning Approaches. Computer Speech & Language, 24(3).
  10. Ivry, A., Cohen, I., & Berdugo, B. (2020). Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments. ICASSP 2020 (IEEE).
  11. Ma, Y., & Nishihara, A. (2013). Efficient Voice Activity Detection Algorithm Using Long-Term Spectral Flatness Measure. EURASIP Journal on Audio, Speech, and Music Processing, 2013:21.
  12. Grundhuber, P., Halimeh, M. M., Strauß, M., & Habets, E. A. P. (2025). Robust Speech Activity Detection in the Presence of Singing Voice. WASPAA 2025 (IEEE).
  13. Koenecke, A., Choi, A. S. G., Mei, K. X., et al. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. ACM FAccT 2024.

Building with Soniox? Voice activity is folded into how the real-time API finalizes turns; see the endpoint detection documentation.