Journal of King Saud University: Computer and Information Sciences (Apr 2024)

Weight Averaging and re-adjustment ensemble for QRCD

  • Esha Aftab,
  • Muhammad Kamran Malik

Journal volume & issue
Vol. 36, no. 4
p. 102037

Abstract

Read online

Question Answering (QA) is a prominent task in the field of Natural Language Processing (NLP) with extensive applications. Recently, there has been a notable surge in research interest concerning the development of QA systems for the Holy Qur’an, an Islamic religious text. The Qur’an Reading Comprehension Dataset (QRCD) Malhas and Elsayed (2020) is a highly commendable effort in this respect. It stands as the first benchmark dataset specifically designed for a set of directly answerable questions from the Qur’an. Each question in the dataset is meticulously labeled with all potential answers sourced from the Holy Qur’an. From our perspective, the main challenge in QRCD stems from the limited volume of training data it offers. As a solution we propose an innovative approach to build a Deep Neural Network (DNN) ensemble, centered around Ara-Electra model (Antoun et al., 2021), that we called Weight Averaging and Re-adjustment (WAR) model. The model is constructed by computing running averages of all model states that evolve during a single training session and ensuring that model weights are readjusted prior to each training epoch, in order to hold it back from over fitting the training data. The scheme results in a single standalone model that exhibits the benefits of multi-model ensembles. It is distinguished from other ensembles proposed for QRCD that accumulates outputs from multiple expert models and employs classic techniques like hard voting or score averaging on output probabilities to build unified results. Each expert model costs individual training time and compute resources. The WAR model outperforms existing systems with improved generalization over unseen data. It achieves F1, partial Reciprocal Rank (pRR), and exact-match (EM) scores of 0.567, 0.60 and 0.29 respectively, exceeding best reported QRCD scores by 3%, 1.5% and 0.69% respectively. Notably, we are comparing our results with the top scores from different models, highlighting our model’s consistent performance across all three metrics.

Keywords