Natural Language Processing Journal (Sep 2024)

A modified Vector Space Model for semantic information retrieval

  • Callistus Ireneous Nakpih

Journal volume & issue
Vol. 8
p. 100081

Abstract

Read online

In this research, we present a modified Vector Space Model which focuses on the semantic relevance of words for retrieving documents. The modified VSM resolves the problem of the classical model performing only lexical matching of query terms to document terms for retrievals. This problem also restricts the classical model from retrieving documents that do not have exact match of query terms even if they are semantically relevant to the query. In the modified model, we introduced a Query Relevance Update technique, which pads the original query set with semantically relevant document terms for optimised semantic retrieval results. The modified model also includes a novel tf−pwhich replaces the tf−idftechnique of the classical VSM, which is used to compute the Term Frequency weights. The replacement of the tf−idfresolves the problem of the classical model penalising terms that occur across documents with the assumption that they are stop words, which in practice, there are usually such words which carry relevant semantic information for documents’ retrieval. We also extended the cosine similarity function with a proportionality weight pqd, which moderates biases for high frequency of terms in longer documents. The pqdensures that the frequency of query terms including the updated ones are accounted for in proportionality with documents size for the overall ranking of documents. The simulated results reveal that, the modified VSM does achieve semantic retrieval of documents beyond lexical matching of query and document terms.

Keywords