IEEE Access (Jan 2023)
Dementia Speech Dataset Creation and Analysis in Indic Languages—A Pilot Study
Abstract
The paper describes the creation, analysis and validation of a multilingual Dementia Speech dataset for Indic languages. Three popular Indian languages viz. Telugu, Tamil and Hindi are considered for the pilot study. Dementia and associated Alzheimers disease affect a large section of Asian population. Though there are promising studies in dementia detection focussed on Western ethnicity, the absence of a clinical dementia dataset for Indian languages forms the primary motivation for this study. This pilot study aims to overcome the challenges associated with data collection and validation in a clinical setting and deal with situations wherein clinical data is not readily available. The Indic dementia dataset is an enacted non-clinical dataset created from the manual translations of the benchmark clinical English DementiaBank dataset. The dataset created is validated using features extracted from the benchmark. The feature evaluation revealed a similarity of 92.6% for silences, 92% for mean pitch (Hz), 84.7% for jitter and 90.3% for shimmer. Subjective evaluation was also conducted based on clarity and similarity of utterances with DementiaBank data. An average MOS of 3.9 for clarity of speech and 3.76 for similarity with respect to DementiaBank was obtained across all three languages. A baseline classification using state-of-art deep network architecture gave a maximum of 78% accuracy in dementia detection using the Indic dementia dataset. The pilot experimentation in this work gives promising insights into the development of a multilingual dataset for analysis of clinical speech patterns in early dementia in the Indian population.
Keywords