Keyword spotting and wake words: always-listening AI explained

A smart speaker sits on the kitchen counter for a year, hears every argument and every birthday, and stays inert until you say its name. The answer to "is it always listening?" is yes, but the device listens without transcribing, and in the normal case sends nothing anywhere, because a tiny model is doing one narrow job: waiting for a single phrase.

Keyword spotting is that job, a far smaller problem than full speech recognition.

Why wake words use a separate model

Full recognition is expensive in every way that matters for an always-on device. It burns compute and battery, and it streams your audio to the cloud, a privacy non-starter and a bandwidth one. A phone that ran a full recognizer day and night would drain its battery in hours.

So the design splits in two. A tiny model runs locally, always on, doing nothing but listening for the trigger. Only when it fires does the big system, full recognition, the cloud, the voice agent, wake up and start work. The wake word is a gate that stays shut almost all of the time.

	Wake-word model	Full recognition
Job	One phrase, yes or no	Every word
Size	Tens to hundreds of kilobytes	Orders of magnitude larger
Runs	Continuously, on the device	On demand, often in the cloud
Audio leaves the device	Only after a trigger	While active

The split that makes always-on listening affordable and private: the gate is tiny, and the system behind it is not.

Wake-word model operation

A wake-word detector is a small neural network, often only tens to hundreds of kilobytes, light enough to run continuously on a low-power chip or DSP.^[1] It is trained on one specific phrase and does one thing: score, continuously, how likely the recent audio was that phrase. When the score crosses a threshold, it triggers.

That threshold is the entire design tension, the same false-accept versus false-reject trade-off that governs voice activity detection. Set it loose and the device wakes at the TV, at a rhyming word, at nothing at all; a false accept is annoying and a privacy risk, because a false wake can start streaming audio. Set it tight and the device misses you when you call from across the room or with a cold. Every wake-word system lives at a chosen point on that curve.

The usual fix is to split the decision into two stages. The tiny on-device model triggers cheaply and a little too eagerly, tuned to never turn away the real wake word. A larger model, sometimes in the cloud, then re-checks the captured snippet and throws out the false accepts before the system fully activates.

Local audio processing and privacy

The privacy architecture follows from the split. The always-on model processes audio locally and keeps only a short rolling buffer of recent sound, overwriting it continuously, sending nothing anywhere. Audio leaves the device only after the wake word fires, when the buffered moment plus what follows is streamed to the full system. "Always listening" is true in the literal sense and misleading in the feared sense: the device hears constantly but, by design, retains and transmits almost nothing.

The caveat is the false accept. When the detector triggers by mistake, it captures and sends a few seconds you never meant to share. The exposure is an accidental wake rather than a secret recording, which is why the false-accept rate is a privacy number and not only a usability one.

Other uses of keyword spotting

The same technique serves other jobs once you see it as "detect specific phrases cheaply." Command spotting lets a device respond to a small fixed vocabulary ("stop," "next," "volume up") without running full recognition, useful on constrained hardware. Spoken-term detection searches audio archives for a keyword, finding every mention of a product name across thousands of recorded calls. Compliance keyword flagging watches for required or forbidden phrases ("this call may be recorded," or risk language) in monitored conversations.

These shade into a related capability inside full recognition: keyword boosting, where you tell a recognizer which terms to expect so it catches them more reliably. That is a different mechanism, biasing a full transcription rather than spotting a phrase in isolation, but serves the same need to not miss the words that matter.

Position in the voice-system pipeline

A wake-word model is not a small recognizer. A recognizer is built to transcribe every word; this one is built to ignore almost all of them, tuned for a single phrase and traded down to fit a power budget a full recognizer could never meet. That narrowness is why the detector can run for months on a chip that would never hold a transcriber.

When you say a device's name and it lights up, the tiny model that was listening the whole time has hit its threshold and handed off to the full system.

Common questions

Is my smart speaker always listening to me?

The wake-word model is always listening, but it is not transcribing or, normally, sending anything. It processes audio locally, keeps only a short rolling buffer it continuously overwrites, and streams to the cloud only after it detects the trigger phrase. The exception is a false trigger, which can start capturing audio you did not intend.

How is wake-word detection different from speech recognition?

Recognition transcribes everything said and is computationally heavy. Wake-word detection is a tiny model that does one narrow job, deciding whether the recent audio was a specific trigger phrase, so it runs continuously at very low power. The wake word activates the full recognizer; it does not replace it.

Why are wake words like "Alexa" chosen the way they are?

To minimize accidental triggers. A good wake word is several syllables long and rare in everyday speech, distinctive enough for the model to pick out and unlikely to occur by chance. A short, common word would set the device off constantly during normal conversation.

What is the trade-off in tuning a wake-word detector?

False accepts versus false rejects. A loose threshold wakes the device too easily, at the TV or similar-sounding words, which is annoying and a privacy risk. A tight threshold misses you when you call from far away or with a hoarse voice. Many systems use a cheap on-device trigger followed by a stricter second check to balance the two.

References

Jose, C., Mishchenko, Y., Senechal, T., Shah, A., Escott, A., & Vitaladevuni, S. (2020). Accurate Detection of Wake Word Start and End Using a CNN. arXiv preprint arXiv:2008.03790.