IEEE Access (Jan 2024)

Transparent, Low Resource, and Context-Aware Information Retrieval From a Closed Domain Knowledge Base

  • Shubham Rateria,
  • Sanjay Singh

DOI
https://doi.org/10.1109/ACCESS.2024.3380006
Journal volume & issue
Vol. 12
pp. 44233 – 44243

Abstract

Read online

In large-scale enterprises, vast amounts of textual information are shared across corporate repositories and intranet websites. Traditional search techniques that lack context sensitivity, often fail to retrieve pertinent data efficiently. Modern techniques that use a distributed representation of words require a considerable training dataset and computation, thereby presenting financial and operational burdens. Generative models for information search suffer from problems of transparency and hallucination, which can be detrimental, especially for organizations and their stakeholders who rely on these results for critical business operations. This paper presents a non-goal oriented conversational agent based on a collection of finite state machines and an information search model for text search from an extensive collection of stored corporate documents and intranet websites. We used a distributed representation of words derived from the BERT model, which allows for contextual searching. We minimally fine-tuned a BERT model on a multi-label text classification task specific to a closed-domain knowledge base. Based on DCG metrics, our information retrieval model using distributed embeddings from the minimally trained BERT model and Word Movers Distance for calculating topic similarity is more relevant to user queries than BERT embeddings with cosine similarity and BM25. Our architecture promises to significantly expedite and improve the accuracy of information retrieval in closed-domain systems without the need for a massive training dataset or expensive computing while maintaining transparency.

Keywords