IEEE Access (Jan 2020)
ProSOUL: A Framework to Identify Propaganda From Online Urdu Content
Abstract
Today, the rapid dissemination of information on digital platforms has seen the emergence of information pollution such as misinformation, disinformation, fake news, and different types of propaganda. Information pollution has become a serious threat to the online digital world and has posed several challenges to social media platforms and governments around the world. In this article, we propose Propaganda Spotting in Online Urdu Language (ProSOUL) - a framework to identify content and sources of propaganda spread in the Urdu language. First, we develop a labelled dataset of 11,574 Urdu news to train the machine learning classifiers. Next, we develop the Linguistic Inquiry and Word Count (LIWC) dictionary to extract psycho-linguistic features of Urdu text. We evaluate the performance of different classifiers by varying n-gram, News Landscape (NELA), Word2Vec, and Bidirectional Encoder Representations from Transformers (BERT) features. Our results show that the combination of NELA, word n-gram, and character n-gram features outperform with 0.91 accuracy for Urdu text classification. In addition, Word2Vec embedding outperforms BERT features in classification of the Urdu text with 0.87 accuracy. Moreover, we develop and classify large scale Urdu content repositories to identify web sources spreading propaganda. Our results show that ProSOUL framework performs best for propaganda detection in the online Urdu news content compared to the general web content. To the best of our knowledge, this is the first study on the detection of propaganda content in the Urdu language.
Keywords