IEEE Access (Jan 2022)

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

  • Yu-Wen Chen,
  • Kuo-Hsuan Hung,
  • You-Jin Li,
  • Alexander Chao-Fu Kang,
  • Ya-Hsin Lai,
  • Kai-Chun Liu,
  • Szu-Wei Fu,
  • Syu-Siang Wang,
  • Yu Tsao

DOI
https://doi.org/10.1109/ACCESS.2022.3153469
Journal volume & issue
Vol. 10
pp. 46082 – 46099

Abstract

Read online

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, CITISEN downloads pretrained SE models on the cloud server and then uses these models to effectively reduce noise components from prerecordings or instant recordings provided by users. When it encounters noisy speech signals with unknown speakers or noise types, the MA function allows CITISEN to improve the SE performance effectively. A few audio files of unseen speakers or noise types are recorded and uploaded to the cloud server and then used to adapt the pretrained SE model. Finally, for BNC, CITISEN removes the original background noise using an SE model and then mixes the processed speech signal with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people’s tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals by SE achieved about 6% and 33% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6% and 11%, respectively. Note that the SE model and MA method are not limited to the ones described in this study and can be replaced with any SE model and MA method. Finally, the BNC experiment results indicated that the speech signals of original and converted backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.

Keywords