Journal of Universal Computer Science (Oct 2024)
A Study of Word Bigrams for Pseudo-relevance Feedback in Information Retrieval
Abstract
Read online Read online Read online
Traditional information retrieval models mostly adopt a term independence assumption and are based on single terms or unigrams. Past efforts have attempted to go beyond this assumption, such as by using contiguous terms (i.e. word n-grams) or terms appearing in proximity. One such approach employs pseudo-relevance feedback (PRF) in an extended BM25 model, with an expanded query containing bigrams and proximity word pairs besides unigrams. However, the benefit of this approach over the traditional unigram PRF remains inconclusive. We speculate the uncertain effectiveness of bigram PRF in this past work is due to: (1) The new bigrams obtained from the expanded query may be formed by pairing unigrams drawn from different documents. These are potentially noise instead of relevant concepts; (2) The collection statistics of n-grams needed to calculate the document ranking functions, such as their document frequency, is not available in retrieval. Only estimates of these quantities are used instead. We suggest that these issues may be overcome by extracting word n-grams as single units in query expansion, and employing a document index that contains both unigrams and word n-grams. We demonstrate the approach for the case of bigram PRF in an extended BM25. Retrieval experiments are conducted on a range of standard test collections. For the majority of tested collections, the difference between values of the evaluation metrics (Mean Average Precision and the precision-oriented NDCG@20) obtained by our bigram PRF and the unigram PRF baseline is not statistically significant. Thus, our bigram PRF fails to improve over unigram PRF robustly across collections. An analysis of our results reveals ‘query drift’ due to bigram query expansion terms that represent too-broad topics as a cause for the failure of our approach.
Keywords