IEEE Access (Jan 2024)
Improving Performance of Neural IR Models by Using a Keyword-Extraction-Based Weak-Supervision Method
Abstract
Recently the efficiency of neural information retrieval (IR) models has been significantly improved. However, there are technical challenges such as the data bottleneck problem. In real-world scenarios, only documents without related queries are available for training neural IR models. Existing studies propose synthetic queries derived from targeted passages using trained query generation models, which require q-d pair data from other domains for their training. Our research introduces the integrated keyword extraction-driven data augmentation method with weak supervised learning. We derived keywords from passages in a corpus to generate pseudo-queries. Using established weak supervised learning methods, we then generated relevance between these pseudo-queries and passages to produce pseudo-labels. Our approach demonstrates that keyword extraction techniques can efficiently formulate queries and train neural IR systems, outperforming the existing synthetic query generation method. Specifically, the performance of models utilizing pseudo-labels closely approximates that of models trained with ground truth data, underscoring the potential of pseudo-labeling approaches as effective alternatives in the absence of extensive ground truth data. Code and related materials are available on GitHub at https://github.com/guenjinahn/hoseo-cedr.
Keywords