IEEE Access (Jan 2022)

Speaker Diarization and Identification From Single Channel Classroom Audio Recordings Using Virtual Microphones

  • Antonio Gomez,
  • Marios S. Pattichis,
  • Sylvia Celedon-Pattichis

DOI
https://doi.org/10.1109/ACCESS.2022.3177584
Journal volume & issue
Vol. 10
pp. 56256 – 56266

Abstract

Read online

Speaker diarization refers to methods for identifying speakers from audio recordings. An important application comes from the need to assess student interactions in collaborative learning environments. Diarization is very difficult in these environments where a single microphone is used to record multiple voices. Although there have been advancements in this field, little progress has been made in the case of noisy and disorganized multi-speaker environments. Current state-of-the-art methods based on Deep Learning require large training databases and can suffer from significant noise interference and bias due to the speaker’s accent, age, and gender. In this paper, we are proposing a new method to identify speakers that does not require the use of large training sets. To this end, we use a virtual array of microphones. The signal at the virtual microphones is simulated by extracting the spatial information of the speakers from a single channel audio recording using approximate speaker geometry observed from a video recording. The Room Impulse Responses (RIRs) at the virtual microphones are then estimated using acoustic scene simulations. The RIRs are then used to compute a cross-correlation matrix of possible audio sources. Speaker diarization is performed using the cross-correlation matrices as input to a classifier. For the task of identifying active student speakers in classroom audio, the proposed method significantly outperformed diarization methods performed by Google Cloud and Amazon AWS services.

Keywords