An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Abdul Jabbar; Sajid Iqbal; Manzoor Ilahi Tamimy; Amjad Rehman; Saeed Ali Bahaj; Tanzila Saba

doi:10.1109/ACCESS.2023.3332710

IEEE Access (Jan 2023)

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Abdul Jabbar,
Sajid Iqbal,
Manzoor Ilahi Tamimy,
Amjad Rehman,
Saeed Ali Bahaj,
Tanzila Saba

Affiliations

Abdul Jabbar: Department of Computer Science, COMSATS University Islamabad (CUI), Main Campus, Tarlai Kalan, Islamabad, Pakistan
Sajid Iqbal: Department of Information Systems, College of Computer Science and Information Technology, King Faisal University, Al Hofuf, Saudi Arabia
Manzoor Ilahi Tamimy: ORCiD; Department of Computer Science, COMSATS University Islamabad (CUI), Main Campus, Tarlai Kalan, Islamabad, Pakistan
Amjad Rehman: ORCiD; Artificial Intelligence & Data Analytics Laboratory (AIDA), CCIS, Prince Sultan University, Riyadh, Saudi Arabia
Saeed Ali Bahaj: ORCiD; MIS Department, College of Business Administration, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia
Tanzila Saba: ORCiD; Artificial Intelligence & Data Analytics Laboratory (AIDA), CCIS, Prince Sultan University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2023.3332710
Journal volume & issue: Vol. 11
pp. 133681 – 133702

Abstract

Read online

The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords