Abstract
Environmental sounds occur in a complex mixture. Recognizing, isolating and interpreting different environmental sounds are easy and natural tasks for human listeners but remain challenging problems for machines. With proliferation of smart audio devices it is becoming ever more important to develop robust algorithms for analysis of such environmental sounds. In this thesis, we investigate methods based on Non-negative Matrix Factorization (NMF) for several tasks of environmental audio analysis. NMF as a dictionary learning method has been proven to learn compact representations of sounds, that can be used for source separation or classification, to name just a few. The experimental evidence presented in this thesis focuses on showing how we can adapt NMF via regularization for recognition and detection of environmental sounds.
Firstly, we approach the task of Audio Event Detection (AED), which aims to automatically recognize, label and estimate position in time of sound events in continuous audio signals. Using a carefully labelled dataset of real life recordings, we show how enforcing sparse representations and modelling temporal context in Coupled Sparse NMF improves accuracy of polyphonic AED.
Secondly, we tackle the problem of binary classification of weakly labelled audio. By weak labels we mean that each audio file has just a tag denoting the presence or absence of a sound of interest, without its specific location in time. We propose to address this challenge by adding a constraint to basic NMF, introducing a novel Masked NMF for learning on weakly labelled audio data. We show that Masked NMF performs well on the Bird Audio Detection task.
Finally, we propose an orthogonality regularizer for Masked NMF to perform AED on weakly labelled audio data. By adding a regularization term to Masked NMF, we show how the novel method can be used for monophonic AED. We conduct experiments on a dataset of rare audio events. Our results show that Orthogonality-Regularized Masked NMF is a promising method for monophonic detection of impact sounds.