IEEE Access (Jan 2023)

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

  • Abdul Jabbar,
  • Sajid Iqbal,
  • Manzoor Ilahi Tamimy,
  • Amjad Rehman,
  • Saeed Ali Bahaj,
  • Tanzila Saba

DOI
https://doi.org/10.1109/ACCESS.2023.3332710
Journal volume & issue
Vol. 11
pp. 133681 – 133702

Abstract

Read online

The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

Keywords