Abstract
Imagine standing on a street corner in the city. With your
eyes closed, you can hear and recognize a succession of
sounds: cars passing by, people speaking, their footsteps when
they walk by, and the continuously falling rain. Recognition
of all these sounds and interpretation of the perceived scene
as a city street soundscape comes naturally to humans. It is,
however, the result of years of “training”: encountering and
learning associations between the vast variety of sounds in
everyday life, the sources producing these sounds, and the
names given to them.
Our everyday environment consists of many sound sources
that create a complex mixture audio signal. Human auditory
perception is highly specialized in segregating the sound
sources and directing attention to the sound source of interest.
This phenomenon is called cocktail party effect, as an analogy
to being able to focus on a single conversation in a noisy room.
Perception groups the spectro-temporal information in acoustic
signals into auditory objects such that sounds or groups of
sounds are perceived as a coherent whole [1]. This determines
for example a complex sequence of sounds to be perceived as a
single sound event instance, be it “bird singing” or “footsteps”.
The goal of automatic sound event detection (SED) methods
is to recognize what is happening in an audio signal and when
it is happening. In practice, the goal is to recognize at what
temporal instances different sounds are active within an audio
signal.