International Journal of Information Science and Management (Jan 2025)
Analytical Comparison of Stop Word Recognition Methods in Persian Texts
Abstract
Stop words are primarily non-significant words used to connect other words in sentence construction. Since these words do not contain specific information about the text, they are typically removed during text processing. Therefore, identifying stop words is an essential operation in text processing. A challenge arises when usually insignificant words can become significant depending on the situation, while words that are typically important can sometimes be classified as stop words. This problem is particularly pronounced in Persian due to the complexities inherent in the language. Recognizing the importance of identifying stop words in Persian, we analyzed and reviewed various approaches, including a dictionary-based approach, POS tagging-based approach, Word2Vec-based approach and FastText-based approach to identify stop words using a corpus of 50.000 Persian sentences from Hamshahri dataset. Our findings indicate that the FastText-based approach outperformed the others with a detection accuracy of 96.98, suggesting that this method can lead to the development of an automatic, reliable, and efficient system.
Keywords