IEEE Access (Jan 2023)
Using Character-Level Sequence-to-Sequence Model for Word Level Text Generation to Enhance Arabic Speech Recognition
Abstract
Owing to the linguistic richness of the Arabic language, which contains more than 6000 roots, building a reliable Arabic language model for Arabic speech recognition systems faces many challenges. This paper introduces a language model free Arabic automatic speech recognition system for Modern Standard Arabic based on an end-to-end-based Deep Speech architecture developed by Mozilla. The proposed model uses a character-level sequence-to-sequence model to map the character alignment produced by the recognizer model onto the corresponding words. The developed system outperformed recent studies on single-speaker and multi-speaker Arabic speech recognition using two different state-of-the-art datasets. The first was the Arabic Multi-Genre Broadcast (MGB2) corpus with 1200 h of audio data from multiple speakers. The system achieved a new milestone in the MGB2 challenge with a word error rate (WER) of 3.2, outperforming related work using the same corpus with a word error reduction of 17%. An additional experiment with a 7-hour Saudi Accent Single Speaker Corpus (SASSC) was used to build an additional model for single male speaker-based Arabic speech recognition using the same proposed network architecture. The single-speaker model outperformed related experiments with a WER of 4.25 with a relative improvement of 33.8%.
Keywords