Natural Language Processing Journal (Dec 2024)
Recent advancements in automatic disordered speech recognition: A survey paper
Abstract
Automatic Speech Recognition (ASR) technology has recently witnessed a paradigm shift with respect to performance accuracy. Nevertheless, impaired speech remains a significant challenge, evidenced by the inadequate accuracy of existing ASR solutions. This lacking is reported in various research reports. While this lacking has motivated new directions in Automatic Disordered Speech Recognition (ADSR), the gap between ASR performance accuracy and that of ADSR remains significant. In this paper, we report a consolidated account of research work conducted to date to address this gap, highlighting the root causes of such performance discrepancy and discussing prominent research directions in this area. The paper raises some fundamental issues and challenges that ADSR research faces today. Firstly, we discuss the adequacy of impaired speech representation in existing datasets, in terms of the diversity of speech impairments, speech continuity, speech style, vocabulary, age group, and the environments of the data collection process. We argue that disordered speech is poorly represented in the existing datasets; thus, it is expected that several fundamental components needed for training ADSR models are absent. Most of the open-access databases of impaired speech focus on adult dysarthric speakers, ignoring a wide spectrum of speech disorders and age groups. Furthermore, the paper reviews prominent research directions adopted by the ADSR research community in its effort to advance speech recognition technology for impaired speakers. We categorize this research effort into directions such as personalized models, model adaptation, data augmentation, and multi-modal learning. Although these research directions have advanced the performance of ADSR models, we believe there is still potential for further advancement since current efforts, in essence, make the false assumption that there is a limited distribution shift between the source and target data. Finally, we stress the need to investigate performance measures other than Word Error Rate (WER)- measures that can reliably encode the contribution of erroneous output tokens in the final uttered message.