Combining Semantic Matching, Word Embeddings, Transformers, and LLMs for Enhanced Document Ranking: Application in Systematic Reviews

Goran Mitrov; Boris Stanoev; Sonja Gievska; Georgina Mirceva; Eftim Zdravevski

doi:10.3390/bdcc8090110

Big Data and Cognitive Computing (Sep 2024)

Combining Semantic Matching, Word Embeddings, Transformers, and LLMs for Enhanced Document Ranking: Application in Systematic Reviews

Goran Mitrov,
Boris Stanoev,
Sonja Gievska,
Georgina Mirceva,
Eftim Zdravevski

Affiliations

Goran Mitrov: Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovik 16, 1000 Skopje, North Macedonia
Boris Stanoev: Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovik 16, 1000 Skopje, North Macedonia
Sonja Gievska: Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovik 16, 1000 Skopje, North Macedonia
Georgina Mirceva: Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovik 16, 1000 Skopje, North Macedonia
Eftim Zdravevski: Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovik 16, 1000 Skopje, North Macedonia

DOI: https://doi.org/10.3390/bdcc8090110
Journal volume & issue: Vol. 8, no. 9
p. 110

Abstract

Read online

The rapid increase in scientific publications has made it challenging to keep up with the latest advancements. Conducting systematic reviews using traditional methods is both time-consuming and difficult. To address this, new review formats like rapid and scoping reviews have been introduced, reflecting an urgent need for efficient information retrieval. This challenge extends beyond academia to many organizations where numerous documents must be reviewed in relation to specific user queries. This paper focuses on improving document ranking to enhance the retrieval of relevant articles, thereby reducing the time and effort required by researchers. By applying a range of natural language processing (NLP) techniques, including rule-based matching, statistical text analysis, word embeddings, and transformer- and LLM-based approaches like Mistral LLM, we assess the article’s similarities to user-specific inputs and prioritize them according to relevance. We propose a novel methodology, Weighted Semantic Matching (WSM) + MiniLM, combining the strengths of the different methodologies. For validation, we employ global metrics such as precision at K, recall at K, average rank, median rank, and pairwise comparison metrics, including higher rank count, average rank difference, and median rank difference. Our proposed algorithm achieves optimal performance, with an average recall at 1000 of 95% and an average median rank of 185 for selected articles across the five datasets evaluated. These findings give promising results in pinpointing the relevant articles and reducing the manual work.

Published in Big Data and Cognitive Computing

ISSN: 2504-2289 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology
Website: http://www.mdpi.com/journal/BDCC

About the journal

Abstract

Keywords