Applied Sciences (Dec 2024)
Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set
Abstract
This research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Article Data set (SANAD), which includes 195,174 Arabic news articles, this study explores pre-processing methods such as cleaning, stemming, normalization, and stop word removal, which are crucial processes given the complex morphology of Arabic. Additionally, the influence of six different embedding models on the topic modeling performance was assessed. The originality of this work lies in addressing the lack of previous studies that optimize BERTopic through adjusting the n-gram range parameter and combining it with different embedding models for effective Arabic topic modeling. Pre-processing techniques were fine-tuned to improve data quality before applying BERTopic, LDA, and NMF, and the performance was assessed using metrics such as topic coherence and diversity. Coherence was measured using Normalized Pointwise Mutual Information (NPMI). The results show that the Tashaphyne stemmer significantly enhanced the performance of LDA and NMF. BERTopic, optimized with pre-processing and bi-grams, outperformed LDA and NMF in both coherence and diversity. The CAMeL-Lab/bert-base-arabic-camelbert-da embedding yielded the best results, emphasizing the importance of pre-processing in Arabic topic modeling.
Keywords