IEEE Access (Jan 2023)

Audio-Visual Overlapped Speech Detection for Spontaneous Distant Speech

  • Minyoung Kyoung,
  • Hyungbae Jeon,
  • Kiyoung Park

DOI
https://doi.org/10.1109/ACCESS.2023.3254529
Journal volume & issue
Vol. 11
pp. 27426 – 27432

Abstract

Read online

Although advances in deep learning have brought remarkable improvements to Overlapped Speech Detection (OSD), the performance in far-field environments is still limited owing to the lack of real-world overlapped speech and a low signal-to-noise ratio. In this paper, we present an end-to-end audiovisual OSD system based on decision fusion between audio and video modalities. Firstly, we propose a simple yet powerful audio data augmentation method for spontaneous distant speech data. Secondly, to maximize the effectiveness of the video modality, we design a video OSD system based on a cross-speaker attention module that explores the visual correlation between multiple speakers. Lastly, we present cross-modality attention module to make the final decision more accurate. Our experimental results demonstrate that our approach outperforms current state-of-the-art methods on a real-world distant speech dataset. Moreover, our approach can robustly detect overlapped speech when compared with its counterpart, which uses audio modality alone.

Keywords