IEEE Access (Jan 2024)

Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method

  • Suehyun Chang,
  • Geun-Jin Ahn,
  • Sungbum Park

DOI
https://doi.org/10.1109/ACCESS.2024.3382190
Journal volume & issue
Vol. 12
pp. 46851 – 46863

Abstract

Read online

Recently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing studies propose synthetic queries derived from targeted passages using trained query generation models, which require q-d pair data from other domains for their training. Our research introduces the integrated keyword extraction-driven data augmentation method with weak supervised learning. We derived keywords from passages in a corpus to generate pseudo-queries. Using established weak supervised learning methods, we then generated relevance between these pseudo-queries and passages to produce pseudo-labels. Our approach demonstrates that keyword extraction techniques can efficiently formulate queries and train neural IR systems, outperforming the existing synthetic query generation method. Specifically, the performance of models utilizing pseudo-labels closely approximates that of models trained with ground truth data, underscoring the potential of pseudo-labeling approaches as effective alternatives in the absence of extensive ground truth data. Code and related materials are available on GitHub at https://github.com/guenjinahn/hoseo-cedr.

Keywords