IEEE Access (Jan 2021)
Efficient Audio-Visual Speech Enhancement Using Deep U-Net With Early Fusion of Audio and Video Information and RNN Attention Blocks
Abstract
Speech enhancement (SE) aims to improve speech quality and intelligibility by removing acoustic corruption. While various SE models using audio-only (AO) based on deep learning have been developed to achieve successful enhancement for non-speech background noise, audio-visual SE (AVSE) models have been studied to effectively remove competing speech. In this paper, we propose an AVSE model that estimates spectral masks for real and imaginary components to consider phase enhancement. It is based on the U-net structure that allows the decoder to perform information restoration by leveraging intermediate information in the encoding process and avoids the gradient vanishing problem by providing paths direct to the encoder’s layers. In the proposed model, we present early fusion to process audio and video with a single encoder that effectively generates features for the fused information easy to decode for SE with reduced parameters of the encoder and decoder. Moreover, we extend the U-net using the proposed Recurrent-Neural-Network (RNN) attention (RA) blocks and the Res paths (RPs) in the skip connections and the encoder. While the RPs are introduced to resolve the semantic gap between the low-level and high-level features, the RA blocks are developed to find efficient representations with inherent frequency-specific characteristics for speech as a type of time-series data. Experimental results on the LRS2-BBC dataset demonstrated that AV models successfully removed competing speech and our proposed model efficiently estimated complex spectral masks for SE. When compared with the conventional U-net model with a comparable number of parameters, our proposed model achieved relative improvements of about 7.23%, 5.21%, and 22.9% for the signal-to-distortion ratio, perceptual evaluation of speech quality, and FLOPS, respectively.
Keywords