Abstract
This thesis aims to investigate the phase estimation problem in speech enhancement and dereverberation.
Noise and reverberation widely exist in daily lives, and have detrimental effects on the quality and
intelligibility of speech for human listeners. Therefore, speech enhancement and dereverberation
are very critical when processing speech with noise and reverberation. In many approaches of
speech enhancement and dereverberation, only the magnitude is considered
for enhancement, while the noisy phase is untouched and used directly when reconstructing
speech from enhanced magnitude. However, it has been discussed that phase is important in
speech enhancement, and using noisy phase to reconstruct speech may degrade its quality and
intelligibility. Therefore, we focused on phase-aware methods in this Thesis.
In our initial speech dereverberation experiments, we found an issue in the perceptual evaluation
of speech quality (PESQ) measure. When measuring reverberant speech, the cross-correlation-based
time alignment process in the PESQ measure may incorrectly align the reverberant
speech with the reference speech, especially when the reverberation is heavy. In the time
alignment process, the speech would be split into utterances based on speech activities, and
these utterances would be further split into small clips. We proposed a modified PESQ called
time alignment restricted PESQ (TAR-PESQ) where this utterance splitting step is removed.
We conducted experiments using both simulated room impulse responses and real RIRs to test
the time alignment performance of the proposed TAR-PESQ and the original PESQ. We also
computed the variance of PESQ scores estimated using both TAR-PESQ and PESQ to measure
the consistency of the produced PESQ scores. The results showed that our proposed TAR-PESQ
correctly aligned more pieces of speech than the original PESQ, and the TAR-PESQ could
produce more consistent PESQ scores than the original PESQ.
When estimating complex speech spectrograms or masks in the complex domain, the mean
square error (MSE) loss function is commonly used in the training. However, the MSE loss
function sometimes puts more weight on phase than magnitude, which may degrade the performance
of the trained model. To balance the estimation of magnitude and phase, we proposed a
weighted magnitude-phase loss function for speech dereverberation. We conducted experiments
and objective measures showed that the weighted magnitude-phase loss function improved the
performance in both speech denoising and dereverberation when a suitable weight is chosen.
We also conducted a listening test to measure the perceptual quality of the produced speech, and
the results showed that our proposed WMP loss function is slightly preferred in terms of speech
dereverberation.
To better exploit the temporal and frequency structure of phase, we proposed to use DNN
models to estimate phase derivative features as well as phase directly from the magnitude and
phase of noisy input. Assuming the phase time-frequency (T-F) units with low energy have
little impact on the quality of speech, we proposed a masked loss function to mask out the
time-frequency (T-F) units with low energy when computing loss. The experiments showed that
this loss function led to an increase in speech quality. We then proposed a non-iterative phase
reconstruction approach and an iterative phase reconstruction approach to reconstruct the phase
from estimated phase features. The experiments showed that these reconstruction approaches
achieved results similar to the estimated phase in terms of the quality of reconstructed speech.