IEEE Access (Jan 2025)
Machine Reading Comprehension for the Tamil Language With Translated SQuAD
Abstract
Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages.
Keywords