IEEE Access (Jan 2024)
Development of a Geographical Question- Answering System in the Kazakh Language
Abstract
The study presents a detailed framework designed to develop a Question-Answering System (QA System) for the Kazakh language, highlighting its importance in the field of Low Resource Languages (LRL) Text Processing. This effort aims to fill the gap in resources for languages that lack substantial digital tools. Specifically, the project focuses on geographical questions about Kazakhstan, aiming to enhance accessibility and understanding of the nation’s geography. The challenges associated with LRL text processing are addressed through the creation of a question-answer corpus, training a Bidirectional Encoder Representations from Transformers (BERT)-based model, and evaluating the system using Bilingual Evaluation Understudy (BLEU) metrics. The endeavor begins with the careful compilation of a corpus containing 50,000 questions, which supports the subsequent development phases and ensures the creation of a robust QA System. In the second phase, a BERT model equipped with 91,821,056 parameters is trained, enhancing the model’s ability to understand the complex linguistic nuances of the Kazakh language. The final phase involves a rigorous evaluation using BLEU metrics, where the system achieves an impressive average score of 0.9576. This score indicates a high level of agreement between the system-generated answers and the reference answers, demonstrating the system’s effectiveness at interpreting and responding to queries about Kazakh geography. This study significantly contributes to the field by providing a systematic and nuanced approach to QA System development and underscores the model’s effectiveness through thorough evaluation and comparative analysis.
Keywords