IEEE Access (Jan 2024)
Active Speaker Detection Using Audio, Visual, and Depth Modalities: A Survey
Abstract
The rapid progress of multimodal signal processing in recent years has cleared the way for novel applications in human-computer interaction, surveillance, and telecommunication. Active Speaker Detection (ASD) is a critical pre-processing step with numerous applications such as voice recognition, speaker diarization, and noise reduction. This paper comprehensively reviews ASD, including various ASD methods and datasets based on these three modalities - audio, visual and/or depth modalities. ASD methods are broadly categorised into two categories: single modality ASD and multi-modality ASD. This review looks at the most common ASD modalities, which include audio-based ASD (A-ASD), visual-based ASD (V-ASD), audio-visual ASD (AV-ASD), and audio-visual-depth ASD (AVD-ASD). Each strategy is well-detailed, including model-based and neural network-based approaches. Finally, the challenges and future research opportunities are highlighted in order to expand its broader use.
Keywords