IEEE Access (Jan 2024)

Identification of Non-Speaking and Minimal-Speaking Individuals Using Nonverbal Vocalizations

  • van-Thuan Tran,
  • Wei-Ho Tsai

DOI
https://doi.org/10.1109/ACCESS.2024.3398584
Journal volume & issue
Vol. 12
pp. 68954 – 68967

Abstract

Read online

Speech remains a prevalent mode of communication powering various intelligent functions in human-computer interaction applications, notably in Speaker/Person Identification (PID) systems. However, there is a considerable population of Non-speaking and Minimal-speaking (NMS) individuals, who heavily rely on nonverbal vocalizations for communication, and the existing speech-based PID systems may not be suitable for users from this community. This study delves into the use of nonverbal vocalizations to identify NMS subjects, termed as NMS-PID, and explores the feasibility of developing an identification system, namely S-NMS-PID, that accommodates both speaking users (with speech input) and NMS users (with nonverbal-vocalization input). Leveraging the recently published ReCANVo dataset of NMS nonverbal vocalizations and our speech dataset, our experiments with multiple networks and acoustic features demonstrate promising results for NMS-PID and S-NMS-PID, evident in average accuracies ranging from 70% to 92%. The proposed convolutional recurrent neural network-based model, despite its smaller size, achieves results nearly on par with much deeper models such as VGG16 and ResNet50. Our findings also highlight the efficacy of Mel-frequency cepstral coefficients features compared to the spectrogram features. Furthermore, a two-step training strategy involving supervised contrastive learning for representation learning followed by fine-tuning with cross-entropy loss significantly enhances robustness and accuracy, particularly in classifying data from minority classes, enhancing overall performance. This study’s outcomes hold potential for tailoring human-computer interaction applications specifically for NMS users. Implementing NMS-PID and S-NMS-PID in security and authentication processes ensures secure and reliable user identification across diverse platforms, transcending sole reliance on speech-based methods.

Keywords