IEEE Access (Jan 2023)

Multi-Feature and Multi-Modal Mispronunciation Detection and Diagnosis Method Based on the Squeezeformer Encoder

  • Shen Guo,
  • Zaokere Kadeer,
  • Aishan Wumaier,
  • Liejun Wang,
  • Cong Fan

DOI
https://doi.org/10.1109/ACCESS.2023.3278837
Journal volume & issue
Vol. 11
pp. 66245 – 66256

Abstract

Read online

In recent years, with the development of deep learning, research on end-to-end mispronunciation detection and diagnosis(MDD) methods has been further promoted. At present, research on end-to-end mispronunciation detection and diagnosis is gradually emerging. Most end-to-end mispronunciation detection and diagnosis methods are based on the CNN-RNN-CTC network structure. To improve the performance of end-to-end mispronunciation detection and diagnosis systems, this paper proposes an end-to-end multi-feature and multi-modal mispronunciation detection and diagnosis method based on the Squeezeformer encoder. The model uses Squeezeformer as an audio encoder, a Bi-LSTM network as a phoneme encoder, and Transformer as a decoder. The model fuses phoneme information before speech encoding and decoding, respectively, and uses a secondary decoding mechanism during the decoding process. This study further incorporated phoneme information in the encoding process so that the model could learn the intrinsic characteristics of the speaker’s pronunciation content. The decoding process uses a secondary decoding mechanism to send the sequence decoded by the model to the decoder for decoding again, which solves the problem of no a priori knowledge at the decoder end in the first decoding stage, thus improving the performance of mispronunciation detection and diagnosis. In this study, experiments were conducted on the PSC-Reading Mandarin mispronunciation detection and diagnosis dataset. Compared with the baseline model, the F1 index improved from 0.4060 to 0.7943, and the diagnostic accuracy improved from 83.93% to 88.45%.

Keywords