Abstract
Supervised learning has been used to solve monaural speech enhancement problem, offering state-of-the-art performance. However, clean training data is difficult or expensive to obtain in real room environments, which limits the training of supervised learning-based methods. In addition, mismatch conditions e.g., noises in the testing stages may be unseen in the training stage, present a common challenge. In this paper, we propose a self-supervised learning-based monaural speech enhancement method, using two autoencoders i.e., the speech autoencoder (SAE) and mixture autoencoder (MAE), with a shared layer, which help to mitigate mismatch conditions by learning a shared latent space between speech and mixture. To further improve the enhancement performance, we also propose phase-aware training and multi-resolution spectral losses. The latent representations of the amplitude and phase are independently learned in two decoders of the proposed SAE with only a very limited set of clean speech signals. Moreover, multi-resolution spectral losses help extract rich feature information. Experimental results on a benchmark dataset demonstrate that the proposed method outperforms the state-of-the-art self-supervised and supervised approaches. The source code is available at https://github.com/Yukino-3/Complex-SSL-SE. 1