Building a Multilevel Inflection Handling Stemmer to Improve Search Effectiveness for Urdu Language

Abdul Jabbar; Sajid Iqbal; Abdullah Abdulrhman Alaulamie; Manzoor Ilahi

doi:10.1109/ACCESS.2024.3373714

IEEE Access (Jan 2024)

Building a Multilevel Inflection Handling Stemmer to Improve Search Effectiveness for Urdu Language

Abdul Jabbar,
Sajid Iqbal,
Abdullah Abdulrhman Alaulamie,
Manzoor Ilahi

Affiliations

Abdul Jabbar: ORCiD; Department of Computer Science, COMSATS University Islamabad, Main Campus, Islamabad, Pakistan
Sajid Iqbal: Department of Information Systems, College of Computer Science and Information Technology, King Faisal University, Hofuf, Saudi Arabia
Abdullah Abdulrhman Alaulamie: Department of Information Systems, College of Computer Science and Information Technology, King Faisal University, Hofuf, Saudi Arabia
Manzoor Ilahi: ORCiD; Department of Computer Science, COMSATS University Islamabad, Main Campus, Islamabad, Pakistan

DOI: https://doi.org/10.1109/ACCESS.2024.3373714
Journal volume & issue: Vol. 12
pp. 39313 – 39329

Abstract

Read online

Stemming is an essential step in various Natural Language Processing (NLP) applications and is used to reduce different variants of the query words to a standard form to avoid the vocabulary mismatch issue in Information Retrieval (IR) systems. Due to specific grammatical rules and complex morphological structures, finding an effective stemming algorithm in Urdu is a challenging task. Although, several stemming algorithms have been proposed for the Urdu text stemming; however, none of them extract the stem from multilevel inflected forms. In this context, according to the best of our knowledge, this is a first effort towards the proposition and evaluation of a novel Urdu Text Stemmer (UTS) that can deal with multi-level inflection forms in Urdu text. The experimental evaluation of the proposed scheme has been conducted on the text-based and word-based custom-developed corpus. The proposed stemming technique is rigorously evaluated and compared with state-of-the-art stemming algorithms. Experimental results demonstrate that UTS outperforms existing Urdu stemmers and achieves an accuracy of 94.92% and 91.8% on word corpus and text corpus, respectively. We also evaluated our proposed system in an Information Retrieval application for Urdu, using the Collection for Urdu Retrieval Evaluation (CURE) dataset. Our approach for information retrieval outperformed and improved both recall and precision metrics.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords