IEEE Access (Jan 2024)
Improving Automatic Forced Alignment for Phoneme Segmentation in Quranic Recitation
Abstract
Segmentation plays a crucial role in speech processing applications, where high accuracy is essential. The quest for improved accuracy in automatic segmentation, particularly in the context of the Arabic language, has garnered substantial attention. However, the differences between Qur’an recitation and normal Arabic speech, especially with regard to intonation rules affecting the lengthening of long vowels, pose challenges in segmentation especially for Qur’an recitation. This research endeavors to address these challenges by delving into the domain of automatic segmentation for Qur’an recitation recognition. The proposed scheme employs a hidden Markov models (HMMs) forced alignment algorithm. To enhance the precision of segmentation, several refinements have been introduced, with a primary emphasis on the phonetic model of the Qur’an and Tajweed, particularly the intricate rules governing elongation. These enhancements encompass the adaptation of an acoustic model tailored for Qur’anic recitation as preprocessing and culminate in the development of an algorithm aimed at refining forced alignment based on the phonetic nuances of the Qur’an. These enhancements are seamlessly integrated as post-processing components for the classic HMM-based forced alignment. The research utilizes a comprehensive database featuring recordings from 100 renowned Qur’an reciters, encompassing the recitation of 21 Qur’anic verses (Ayat). Additionally, 30 reciters were asked to record the same verses, incorporating various recitation speed patterns. To facilitate the evaluation process, a Random sample of the Qur’anic database was manually segmented, comprised 21 Ayats, totaling 19,800 words, with 89 unique words (14 verses x 3 recitation levels: fast, slow and normal x 6 readers). The outcomes of this study manifest notable advancements in the alignment of long vowels within Qur’an recitation, all while maintaining the precise alignment of vowels and consonants. Objective comparisons between the proposed automatic methods and manual segmentation were conducted to ascertain the superior approach. The findings affirm that the classic forced alignment method produces satisfactory outcomes when employed on verses lacking long vowels. However, its performance diminishes when confronted with verses containing long vowels. Therefore, the test samples were categorized into three groups based on the presence of long vowels, resulting in a Correct Classification Rate (CCR) that ranged from 6% to 57%, contingent on whether the verse includes long vowels or not. The average CCR across all test samples was 23%. In contrast, the proposed algorithm significantly enhances audio segmentation. It achieved CCR values ranging from 16% to 70% within the same database categories, with an average CCR of 45% across all test samples. This marks a notable advancement of 22% in segmented speech accuracy, particularly within a 30 ms tolerance, for verses containing long vowels.
Keywords