Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

Ahmed I. Zahran; Aly A. Fahmy; Khaled T. Wassif; Hanaa Bayomi

doi:10.1109/access.2023.3317236

IEEE Access (Jan 2023)

Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

Ahmed I. Zahran,
Aly A. Fahmy,
Khaled T. Wassif,
Hanaa Bayomi

Affiliations

Ahmed I. Zahran: ORCiD; Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Orman, Egypt
Aly A. Fahmy: Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Orman, Egypt
Khaled T. Wassif: ORCiD; Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Orman, Egypt
Hanaa Bayomi: Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Orman, Egypt

DOI: https://doi.org/10.1109/access.2023.3317236
Journal volume & issue: Vol. 11
pp. 112650 – 112663

Abstract

Read online

Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords