Egyptian Informatics Journal (Dec 2022)

A speech separation system in video sequence using dilated inception network and U-Net

  • Ghada Dahy,
  • Mohammed A.A. Refaey,
  • Reda Alkhoribi,
  • M. Shoman

Journal volume & issue
Vol. 23, no. 4
pp. 121 – 131

Abstract

Read online

In this paper, an audio-visual model for separating a speech of the target speaker from a combination of other speakers’ speeches is proposed. It can be used in speech separation, automatic speech recognition systems (ASR) and also in creating single speaker speech databases. Speech separation is complicated problem using audio information only so visual and auditory signals are combined to complete the separation process. The proposed model consists of four modules, two for audio signal, one for visual feature and the last one used to concatenate the features resulted from the previous three modules to generate the separated signals. Our proposed model improved Short-time objective intelligibility (STOI) with 11%, Perceptual Evaluation of Speech Quality (PESQ) with 24%, and Frequency-weighted Segmental SNR (fwSNRseg) with 16% compared with previous works. It also improved Csig' which is the predicted rating of speech distortion with 13% and 'Covl' which is the predicted rating of overall quality with 18% compared with previous audio-visual models.

Keywords