Journal of Information Systems Engineering and Business Intelligence (Jun 2024)

Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review

  • Zaenal Abidin,
  • Akmal Junaidi,
  • Wamiliana

DOI
https://doi.org/10.20473/jisebi.10.2.217-231
Journal volume & issue
Vol. 10, no. 2
pp. 217 – 231

Abstract

Read online

Background: Stemming is significantly essential in natural language processing (NLP) due to the ability to minimize word variations to fundamental forms. This procedure facilitates the analysis of textual data and enhances the precision of classification and information retrieval. Objective: Previous related systematic literature review has not been conducted on stemming and lemmatization in regional languages in Indonesia. Therefore, this study aims to conduct a systematic literature review to capture the latest developments in stemming and lemmatization in regional languages in Indonesia. Methods: This study was carried out using Kitchenham method, analyzing 35 studies extracted from 740, which were obtained from Scopus, IEEE Xplore, and Google Scholar, and published between 2014 and 2023. Results: The results showed that study trends in stemming possessed the potential to continue developing every year. Additionally, the main element in stemming and lemmatization studies was found to be the availability of digital dictionaries in regional languages. This was because greater number of basic vocabularies contributed more positively to stemming or lemmatization. The availability of word morphology information in regional languages would be constructive for making rule-based stemmers. Meanwhile, corpus-based stemming and lemmatization studies could only be conducted for languages with a large corpus to ensure there were various affixed words to process. Conclusion: Based on SLR study, stemming and lemmatization in regional languages in Indonesia developed significantly from 2014 to 2023. The two main strategies applied included using available digital dictionaries and language morphology information. However, the main challenges encountered were the limited number of vocabulary words in the dictionaries and testing various rule-based methods. Keywords: Lemmatization, Morphology, Rule-based, Stemming, Systematic Literature Review.