Audio event detection: when AI hears more than speech

A security system that listens for breaking glass will, out of the box, also fire on a dropped plate and a popped balloon, and it can stay silent when a real window breaks under traffic noise. None of that is surprising once you see what the detector is up against: real sounds arrive overlapped, disguised as one another, and, for the ones that matter most, almost never at all.

Overlapping sounds

Real environments are not one sound at a time. A street has traffic, footsteps, a siren, and talking, all layered, and the event you care about is buried in the mix.

Polyphony, several overlapping sound sources, is the cocktail-party problem for events. A detector trained mostly on clean, isolated examples of each sound fails when three arrive together, because the spectrogram of three sounds at once is not the sum of three spectrograms you can read off. They interfere. The sound you want is smeared into everything else happening at the same instant, as hard to lift out as a single voice at a crowded party.

Sounds outside the training classes

A detector is trained on a fixed taxonomy of sound classes. The world contains far more sounds than any taxonomy.

Faced with a novel sound outside its training set, an open-set input, the model does what classifiers do: it assigns the closest known label, confidently, instead of admitting it has never heard this before. A new alarm, an unusual mechanical failure, a sound the taxonomy never anticipated, comes back labeled as whatever it most resembles. The detector cannot flag what it has no category for, so the novel events, often the interesting ones, are misclassified silently.

Acoustically similar sounds

A gunshot, a door slamming, a balloon popping, and a hammer strike are all short, loud, broadband impulses. A cough and a bark, a baby's cry and a cat's, are acoustically close.

Classes that share an acoustic profile are hard to separate, especially from a short clip, and the cost of confusing them is not symmetric. A gunshot detector that fires on every door slam is a false-alarm machine; tune it to suppress door slams and it misses the gunshot. The model is asked to draw fine distinctions between sounds that, in the moment they occur, do look alike.

Rare high-stakes events

The sounds with the highest stakes, a gunshot, breaking glass, a smoke alarm, a fall, are the ones that almost never occur, in the world and therefore in training data.

Rare events mean scarce examples, so the detector is least practiced on the classes it most needs to get right. And any false-alarm rate, multiplied across the vast majority of audio where nothing is happening, produces a flood. A glass-break detector running around the clock hears millions of seconds of not-glass-breaking. Even a tiny per-second false-positive rate, multiplied by all those seconds, buries the one true break under a pile of false ones. That is why high-stakes detectors are judged on false-alarm rate as much as on detection rate: a detector that catches every break but also fires dozens of times a day on nothing gets switched off.

The role of context

A scream at a party is fun; a scream in an empty parking garage is an emergency. Applause is expected at a concert and strange in a server room.

The sound alone does not carry its meaning; context does, and a detector that classifies the sound in isolation cannot supply it. Acoustic scene classification, identifying the environment (street, office, restaurant, home), exists partly to fill that gap, so the same detected sound can be read differently depending on where it occurred. Without scene context, the detector gets the sound right and the meaning wrong, and that combination scores well on a benchmark while failing in the field.

System architecture

Mechanically, AED resembles other modern audio tasks: turn the sound into a spectrogram, feed it to a neural network, and classify.^[1] Two flavors exist. Tagging says which sounds are present in a clip. Detection also says when each one started and stopped, the event-level cousin of timestamps. The timing version powers alerting and indexing, because "a gunshot occurred" matters less than "a gunshot occurred at 2:47."

The question	The task
Is this speech at all?	Voice activity detection
What words were said?	Speech recognition
What sound just happened?	Audio event detection

Three questions about the same audio. Event detection works the territory the other two discard.

Audio event detection works on what the others throw away, the coughs, alarms, and applause that recognition treats as noise. It is also why captions can include "[applause]" and "[music]": those tags come from detecting events, not transcribing them. As an audio-intelligence capability it is less mature than transcription, which is why the overlapping, novel, confusable, and rare-but-critical sounds above are still where the open problems sit.

Common questions

What is audio event detection?

Recognizing non-speech sounds: the glass break, the cough, the applause, the gunshot. Its detection form also says when the sound started and stopped, which is what turns "a gunshot occurred" into "a gunshot occurred at 2:47" and makes alerting and indexing possible.

What is audio event detection used for?

Alerting and indexing on the sounds recognition throws away. Security cameras listen for breaking glass and gunshots, monitors listen for coughs and falls, and captions mark "[applause]" and "[music]" for accessibility. The timing version powers it: knowing a gunshot happened matters less than knowing when.

Why does a sound detector confuse a gunshot with a door slam?

Because in a short clip they are the same thing: short, loud, broadband impulses. The cost of confusing them is uneven, so tune the detector to suppress every door slam and it misses the gunshot, leave it sensitive and it false-alarms on every slam. Rare high-stakes sounds also have the least training data, so the detector is least practiced on exactly the class it most needs to get right.

How is audio event detection different from speech recognition?

They work opposite halves of the signal. Speech recognition transcribes words; audio event detection identifies the non-speech sounds recognition treats as background noise. There is no transcript, because a cough, an alarm, or applause was never a word.

References

Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M. D. (2019). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. arXiv preprint arXiv:1912.10211.