EURASIP Journal on Audio, Speech, and Music Processing (Oct 2024)

UTran-DSR: a novel transformer-based model using feature enhancement for dysarthric speech recognition

  • Usama Irshad,
  • Rabbia Mahum,
  • Ismaila Ganiyu,
  • Faisal Shafique Butt,
  • Lotfi Hidri,
  • Tamer G. Ali,
  • Ahmed M. El-Sherbeeny

DOI
https://doi.org/10.1186/s13636-024-00368-0
Journal volume & issue
Vol. 2024, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Over the past decade, the prevalence of neurological diseases has significantly risen due to population growth and aging. Individuals suffering from spastic paralysis, brain attack, and idiopathic Parkinson’s disease (PD), among other neurological illnesses, commonly suffer from dysarthria. Early detection and treatment of dysarthria in these patients are essential for effectively managing the progression of their disease. This paper provides UTrans-DSR, a novel encoder-decoder architecture for analyzing Mel-spectrograms (generated from audios) and classifying speech as healthy or dysarthric. Our model employs transformer encoder features based on a hybrid design, which includes the feature enhancement block (FEB) and the vision transformer (ViT) encoders. This combination effectively extracts global and local pixel information regarding localization while optimizing the mel-spectrograms feature extraction process. We keep up with the original class-token grouping sequence in the vision transformer while generating a new equivalent expanding route. More specifically, two unique growing pathways use a deep-supervision approach to increase spatial data recovery and expedite model convergence. We add consecutive residual connections to the system to reduce feature loss while increasing spatial data retrieval. Our technique is based on identifying gaps in mel-spectrograms distinguishing between normal and dysarthric speech. We conducted several experiments on UTrans-DSR using the UA speech and TORGO datasets, and it outperformed the existing top models. The model performed significantly in pixel’s localized and spatial feature extraction, effectively detecting and classifying spectral gaps. The Tran-DSR model outperforms previous research models, achieving an accuracy of 97.75%.

Keywords