An Efficient Document Retrieval for Korean Open-Domain Question Answering Based on ColBERT

Byungha Kang; Yeonghwa Kim; Youhyun Shin

doi:10.3390/app132413177

Applied Sciences (Dec 2023)

An Efficient Document Retrieval for Korean Open-Domain Question Answering Based on ColBERT

Byungha Kang,
Yeonghwa Kim,
Youhyun Shin

Affiliations

Byungha Kang: Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea
Yeonghwa Kim: Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea
Youhyun Shin: Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea

DOI: https://doi.org/10.3390/app132413177
Journal volume & issue: Vol. 13, no. 24
p. 13177

Abstract

Read online

Open-domain question answering requires the task of retrieving documents with high relevance to the query from a large-scale corpus. Deep learning-based dense retrieval methods have become the primary approach for finding related documents. Although deep learning-based methods have improved search accuracy compared to traditional techniques, they simultaneously impose a considerable increase in computational burden. Consequently, research on efficient models and methods that optimize the trade-off between search accuracy and time to alleviate computational demands is required. In this paper, we propose a Korean document retrieval method utilizing ColBERT’s late interaction paradigm to efficiently calculate the relevance between questions and documents. For open-domain Korean question answering document retrieval, we construct a Korean dataset using various corpora from AI-Hub. We conduct experiments comparing the search accuracy and inference time among the traditional IR (information retrieval) model BM25, the dense retrieval approach utilizing BERT-based models for Korean, and our proposed method. The experimental results demonstrate that our approach achieves a higher accuracy than BM25 and requires less search time than the dense retrieval method employing KoBERT. Moreover, the most outstanding performance is observed when using KoSBERT, a pre-trained Korean language model that learned to position semantically similar sentences closely in vector space.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords