IEEE Access (Jan 2019)

Multimodal Voice Conversion Under Adverse Environment Using a Deep Convolutional Neural Network

  • Jian Zhou,
  • Yuting Hu,
  • Hailun Lian,
  • Huabin Wang,
  • Liang Tao,
  • Hon Keung Kwan

DOI
https://doi.org/10.1109/ACCESS.2019.2955982
Journal volume & issue
Vol. 7
pp. 170878 – 170887

Abstract

Read online

This paper presents a voice conversion (VC) technique under noisy environments. Typically, VC methods use only audio information for conversion in a noiseless environment. However, existing conversion methods do not always achieve satisfactory results in an adverse acoustic environment. To solve this problem, we propose a multimodal voice conversion model based on a deep convolutional neural network (MDCNN) built by combining two convolutional neural networks (CNN) and a deep neural network (DNN) for VC under noisy environments. In the MDCNN, both the acoustic and visual information are incorporated into the voice conversion to improve its robustness in adverse acoustic conditions. The two CNNs are designed to extract acoustic and visual features, and the DNN is designed to capture the nonlinear mapping relation of source speech and target speech. Experimental results indicate that the proposed MDCNN outperforms two existing approaches in noisy environments.

Keywords