Journal of Electrical and Computer Engineering (Jan 2024)

Convolutional Neural Networks to Facilitate the Continuous Recognition of Arabic Speech with Independent Speakers

  • Sally A. Sayed,
  • Rania Ahmed Abdel Azeem Abul Seoud,
  • Howida Y. Abdel Naby

DOI
https://doi.org/10.1155/2024/4976944
Journal volume & issue
Vol. 2024

Abstract

Read online

Automatic speech recognition (ASR) is a field of research that focuses on the ability of computers to process and interpret speech feedback from humans and to provide the highest degree of accuracy in recognition. Speech is one of the simplest ways to convey a message in a basic context, and ASR refers to the ability of machines to process and accept speech data from humans with the greatest degree of accuracy. As the human-to-machine interface continues to evolve, speech recognition is expected to become increasingly important. However, the Arabic language has distinct features that set it apart from other languages, such as the dialect and the pronunciation of words. Until now, insufficient attention has been devoted to continuous Arabic speech recognition research for independent speakers with a limited database. This research proposed two techniques for the recognition of Arabic speech. The first uses a combination of convolutional neural network (CNN) and long short-term memory (LSTM) encoders, and an attention-based decoder, and the second is based on the Sphinx-4 recognizer, which includes pocket sphinx, base sphinx, and sphinx train, with various types and number of features to be extracted (filter bank and mel frequency cepstral coefficients (MFCC)) based on the CMU Sphinx tool, which generates a language model for different sentences spoken by different speakers. These approaches were tested on a dataset containing 7 hours of spoken Arabic from 11 Arab countries, covering the Levant, Gulf, and African regions, which make up the Arab world, and achieved promising results. CNN-LSTM achieved a word error rate (WER) of 3.63% using 120 features for filter bank and 4.04% WER using 39 features for MFCC, respectively, while the Sphinx-4 recognizer technique achieved 8.17% WER and an accuracy of 91.83% using 25 features for MFCC and 8 Gaussian mixtures, respectively, when tested on the same benchmark dataset.