e-Prime: Advances in Electrical Engineering, Electronics and Energy (Mar 2024)

Improved spell corrector algorithm and deepspeech2 model for enhancing end-to-end Gujarati language ASR performance

  • Bhavesh Bhagat,
  • Mohit Dua

Journal volume & issue
Vol. 7
p. 100441

Abstract

Read online

Automatic Speech Recognition (ASR) is the process of converting auditory signals into text representations of spoken words. In recent years, advancements in deep learning algorithms have resulted in the development of intricate architectures that considerably enhance the efficacy of End-to-End (E2E) ASR systems. Obtaining significant quantities of training data can be difficult, particularly for languages with limited resources, such as Gujarati. This article describes a novel method for improving ASR performance without the need for additional training data. The proposed method combines an enhanced orthography corrector algorithm with a DeepSpeech2 model architecture that employs Bidirectional Encoder Representations from Transformers and Gated Recurrent Units. Existing decoding strategies, such as greedy or prefix beam search, are improved upon by the algorithm used in this work. It employs post-processing techniques designed specifically for Gujarati language modifications. To train the model, high-quality, multi-speaker (male and female) Gujarati voice data has been gathered via crowd-sourcing, assuring that the most optimal parameter values are used. Word Error Rate (WER) has been reduced by a remarkable 17.20 % across the board. In addition, the study investigates various analytic techniques for identifying errors resulting from diacritics, consonants, independents, homophones, and half-conjugates. The overall efficacy of the ASR system is improved by obtaining a deeper understanding of the Gujarati language and implementing these techniques.

Keywords