Speaker Diarization and Identification From Single Channel Classroom Audio Recordings Using Virtual Microphones

Antonio Gomez; Marios S. Pattichis; Sylvia Celedon-Pattichis

doi:10.1109/ACCESS.2022.3177584

IEEE Access (Jan 2022)

Speaker Diarization and Identification From Single Channel Classroom Audio Recordings Using Virtual Microphones

Antonio Gomez,
Marios S. Pattichis,
Sylvia Celedon-Pattichis

Affiliations

Antonio Gomez: ORCiD; Department of Electrical and Computer Engineering, The University of New Mexico, Albuquerque, NM, USA
Marios S. Pattichis: ORCiD; Department of Electrical and Computer Engineering, The University of New Mexico, Albuquerque, NM, USA
Sylvia Celedon-Pattichis: Department of Curriculum and Instruction, The University of Texas, Austin, TX, USA

DOI: https://doi.org/10.1109/ACCESS.2022.3177584
Journal volume & issue: Vol. 10
pp. 56256 – 56266

Abstract

Read online

Speaker diarization refers to methods for identifying speakers from audio recordings. An important application comes from the need to assess student interactions in collaborative learning environments. Diarization is very difficult in these environments where a single microphone is used to record multiple voices. Although there have been advancements in this field, little progress has been made in the case of noisy and disorganized multi-speaker environments. Current state-of-the-art methods based on Deep Learning require large training databases and can suffer from significant noise interference and bias due to the speaker’s accent, age, and gender. In this paper, we are proposing a new method to identify speakers that does not require the use of large training sets. To this end, we use a virtual array of microphones. The signal at the virtual microphones is simulated by extracting the spatial information of the speakers from a single channel audio recording using approximate speaker geometry observed from a video recording. The Room Impulse Responses (RIRs) at the virtual microphones are then estimated using acoustic scene simulations. The RIRs are then used to compute a cross-correlation matrix of possible audio sources. Speaker diarization is performed using the cross-correlation matrices as input to a classifier. For the task of identifying active student speakers in classroom audio, the proposed method significantly outperformed diarization methods performed by Google Cloud and Amazon AWS services.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords