ProSOUL: A Framework to Identify Propaganda From Online Urdu Content

Soufia Kausar; Bilal Tahir; Muhammad Amir Mehmood

doi:10.1109/ACCESS.2020.3028131

IEEE Access (Jan 2020)

ProSOUL: A Framework to Identify Propaganda From Online Urdu Content

Soufia Kausar,
Bilal Tahir,
Muhammad Amir Mehmood

Affiliations

Soufia Kausar: ORCiD; Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan
Bilal Tahir: ORCiD; Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan
Muhammad Amir Mehmood: ORCiD; Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan

DOI: https://doi.org/10.1109/ACCESS.2020.3028131
Journal volume & issue: Vol. 8
pp. 186039 – 186054

Abstract

Read online

Today, the rapid dissemination of information on digital platforms has seen the emergence of information pollution such as misinformation, disinformation, fake news, and different types of propaganda. Information pollution has become a serious threat to the online digital world and has posed several challenges to social media platforms and governments around the world. In this article, we propose Propaganda Spotting in Online Urdu Language (ProSOUL) - a framework to identify content and sources of propaganda spread in the Urdu language. First, we develop a labelled dataset of 11,574 Urdu news to train the machine learning classifiers. Next, we develop the Linguistic Inquiry and Word Count (LIWC) dictionary to extract psycho-linguistic features of Urdu text. We evaluate the performance of different classifiers by varying n-gram, News Landscape (NELA), Word2Vec, and Bidirectional Encoder Representations from Transformers (BERT) features. Our results show that the combination of NELA, word n-gram, and character n-gram features outperform with 0.91 accuracy for Urdu text classification. In addition, Word2Vec embedding outperforms BERT features in classification of the Urdu text with 0.87 accuracy. Moreover, we develop and classify large scale Urdu content repositories to identify web sources spreading propaganda. Our results show that ProSOUL framework performs best for propaganda detection in the online Urdu news content compared to the general web content. To the best of our knowledge, this is the first study on the detection of propaganda content in the Urdu language.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords