Journal of Umm Al-Qura University for Language Sciences and Literature (Feb 2022)

Evaluating Topic Modeling for Saudi Newspapers Texts Using LDA: A Computational Linguistics Study

  • Afrah Altamimi

DOI
https://doi.org/10.54940/ll16145890
Journal volume & issue
no. 29
pp. 24 – 33

Abstract

Read online

This paper is in the field of natural language processing. It applied unsupervised machine learning approach to identifying the latent topics in Saudi newspapers using one of the most important unsupervised topic modeling algorithms. This algorithm is called Latent Dirichlet Allocation (LDA). I built a corpus from Saudi newspapers, and it contained 4,781 texts after the preprocessing stage. It consisted of 649,734 tokens. The results of training 20 models with ten words showed that the optimal value for the number of topics in those texts is 7 topics. The 7-topics model got a good coherence degree of 0.6723. These topics were inferred through its ten words that had the highest probabilities on each topic. I interpreted the topics, respectively, according to the following topics: surveillance and awareness, development and improvement, sports, health, economics, domestic affairs, and international politics. The 7-topic model was evaluated qualitatively by manually reviewing the coherence of words in each topic. Also, I reviewed the first fifty texts on each topic to make sure that each of which belongs to the topic that LDA was assigned to it. The qualitative evaluation was supported by the algorithm being conducted again on the texts of each of the seven topics to access more details on each topic separately. Although there are some shortcomings in the results of the topic modeling, they can be optimized and then studied in discourse analysis instead of the traditional approaches.