Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

Chongchong Yu; Xiaosu Su; Zhaopeng Qian

doi:10.1109/tnsre.2023.3262001

IEEE Transactions on Neural Systems and Rehabilitation Engineering (Jan 2023)

Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

Chongchong Yu,
Xiaosu Su,
Zhaopeng Qian

Affiliations

Chongchong Yu: ORCiD; School of Artificial Intelligence, Beijing Technology and Business University, Beijing, China
Xiaosu Su: ORCiD; School of Artificial Intelligence, Beijing Technology and Business University, Beijing, China
Zhaopeng Qian: ORCiD; School of Artificial Intelligence, Beijing Technology and Business University, Beijing, China

DOI: https://doi.org/10.1109/tnsre.2023.3262001
Journal volume & issue: Vol. 31
pp. 1912 – 1921

Abstract

Read online

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.

Published in IEEE Transactions on Neural Systems and Rehabilitation Engineering

ISSN: 1534-4320 (Print); 1558-0210 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Medical technology; Medicine: Therapeutics. Pharmacology
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=7333

About the journal

Abstract

Keywords