International Journal of Information Science and Management (Oct 2022)
The Effectiveness of Arabic Stemmers Using Arabized Word Removal
Abstract
Other languages have influenced Arabic because of several factors, such as geographical nearness, trade communication, past Islamic conquests, science and technology, new devices, brand names, models, and fashion. As a result of these factors, foreign words are used in Arabic text and are known as Arabised words. Arabised words affect the Arabic natural language processing (NLP) task because identifying a correct stem or root from an Arabic word becomes more difficult. Therefore, a more efficient Arabic NLP can be developed if Arabised word removal is part of a pre-processing task. In this paper, we propose an algorithm for detecting and extracting Arabised words as a pre-processing task for an Arabic stemming task. This algorithm is a combination of lexicon-based and rule-based approaches. The lexicon list has been developed based on various sources of Arabic text resources, and the rule-based algorithm has been designed to cater to Arabised words with definite articles and use pattern matching on prefixes and suffixes. To evaluate the effectiveness of the proposed Arabised word removal algorithm on the Arabic NLP task, we use Arabised word removal as part of pre-processing in Arabic stemmers. Three Arabic stemmers are used in our evaluation, namely, light stemming, condition light and ARLS, on three types of Arabic standard datasets. Comparisons were made by measuring the performance of precision, recall and IFC on the stemmers with or without our Arabised word removal pre-processing. Results show that the performance on all the stemmers improves if Arabised word removal is included as part of the stemming's pre-processing. Therefore, an efficient Arabic NLP application or task can be developed if Arabised word removal is included in the pre-processing stage for Arabic NLP application, mainly Arabic stemming.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.6.5