IEEE Transactions on Neural Systems and Rehabilitation Engineering (Jan 2025)

End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning

  • Fengji Li,
  • Fei Shen,
  • Ding Ma,
  • Jie Zhou,
  • Shaochuan Zhang,
  • Li Wang,
  • Fan Fan,
  • Tao Liu,
  • Xiaohong Chen,
  • Tomoki Toda,
  • Haijun Niu

DOI
https://doi.org/10.1109/TNSRE.2024.3520498
Journal volume & issue
Vol. 33
pp. 140 – 149

Abstract

Read online

The loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes articulatory motion information to effectively restore speech. This study proposed a deep learning-based end-to-end method for speech reconstruction using ultrasound tongue images. Initially, ultrasound tongue images and speech data were collected simultaneously with a designed Mandarin corpus. Subsequently, a speech reconstruction model was built based on adversarial neural networks. The model includes a pretrained feature extractor to process ultrasound images, an upsampling block to generate speech, and discriminators to ensure the similarity and fidelity of the reconstructed speech. Finally, both objective and subjective evaluations were conducted for the reconstructed speech. The reconstructed speech demonstrated high intelligibility in both Mandarin phonemes and tones. The character error rate of phonemes in automatic speech recognition was 0.2605, and tone error rate obtained from dictation tests was 0.1784, respectively. Objective results showed high similarity between the reconstructed and ground truth speech. Subjective perception results also indicated an acceptable level of naturalness. The proposed method demonstrates its capability to reconstruct tonal Mandarin speech from ultrasound tongue images. However, future research should concentrate on specific conditions of laryngectomees, aiming to enhance and optimize model performance. This will be achieved by enlarging training datasets, investigating the impact of ultrasound tongue imaging parameters, and further refining this method.

Keywords