Hate Speech and Target Community Detection in Nastaliq Urdu Using Transfer Learning Techniques

Muhammad Shahid Iqbal Malik; Aftab Nawaz; Mona Mamdouh Jamjoom

doi:10.1109/ACCESS.2024.3444188

IEEE Access (Jan 2024)

Hate Speech and Target Community Detection in Nastaliq Urdu Using Transfer Learning Techniques

Muhammad Shahid Iqbal Malik,
Aftab Nawaz,
Mona Mamdouh Jamjoom

Affiliations

Muhammad Shahid Iqbal Malik: ORCiD; Department of Computer Science, University of Wah, Wah Cantt, Pakistan
Aftab Nawaz: ORCiD; Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan
Mona Mamdouh Jamjoom: ORCiD; Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2024.3444188
Journal volume & issue: Vol. 12
pp. 116875 – 116890

Abstract

Read online

Freedom of expression on social media has provided oppressed people with many opportunities to raise their voices against violence and injustice, but this freedom is being misused to spread various forms of hate speech. Several studies have been conducted to identify hate speech in high-resource languages, however, work on under-resource languages is very limited, especially for Nastaliq Urdu. Pakistan has been dealing with the issue of hateful and violence incitation content for the last two decades. Therefore, this study handled the problem of detecting hate speech and fine-grained multi-class target community identification in Nastaliq Urdu. Using the transfer learning paradigm, two benchmark Urdu transformer models are explored with fine-tuning. A Nastaliq Urdu Hate Speech and Target Community (HSTC) corpus is designed by collecting posts from Pakistani Facebook accounts. In particular, the strengths of the Urdu Robustly Optimized BERT Pre-Training Approach (Urdu-RoBERTa) and Urdu Distillated Bidirectional Encoder Representations from Transformers (Urdu-DistilBERT) are explored to design an automated system instead of hand-crafted features. The proposed framework consists of four steps: 1) data cleaning and preprocessing; 2) data transformation; 3) utilization of Grid search for fine-tuning process; and 4) classification (binary and multi-class). The results on the Nastaliq Urdu corpus showed that the proposed system achieved benchmark performance for binary classification task (hate speech) and target community detection (multi-class classification) on hateful Facebook posts. In particular, fine-tuned DistilBERT achieved 86.58% accuracy and 86.52% f1-score for binary classification and outperformed sixteen baselines. Furthermore, it demonstrated 84.17% accuracy and 83.91% f1-score for target community (religious, political, and gender-based) identification and outperformed all baselines. The findings of this study can be beneficial in detecting and filtering out hate speech in Nastaliq Urdu on the Facebook platform.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords