Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism

Atif Khan; Abrar Ahmed; Salman Jan; Muhammad Bilal; Megat F. Zuhairi

doi:10.1109/ACCESS.2024.3370232

IEEE Access (Jan 2024)

Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism

Atif Khan,
Abrar Ahmed,
Salman Jan,
Muhammad Bilal,
Megat F. Zuhairi

Affiliations

Atif Khan: ORCiD; Department of Computer Science, Islamia College Peshawar, Peshawar, Pakistan
Abrar Ahmed: Department of Computer Science, Islamia College Peshawar, Peshawar, Pakistan
Salman Jan: ORCiD; Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur, Malaysia
Muhammad Bilal: ORCiD; Department of Computer Science, Islamia College Peshawar, Peshawar, Pakistan
Megat F. Zuhairi: ORCiD; Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur, Malaysia

DOI: https://doi.org/10.1109/ACCESS.2024.3370232
Journal volume & issue: Vol. 12
pp. 37418 – 37431

Abstract

Read online

The widespread use of the Internet and the tremendous growth of social media have enabled people to connect with each other worldwide. Individuals are free to express themselves online, sharing their photos, videos, and text messages globally. However, such freedom sometimes leads to misuse, as some individuals exploit this platform by posting hateful and abusive comments on forums. The proliferation of abusive language on social media negatively impacts individuals and groups, leading to emotional distress and affecting mental health. It is crucial to automatically detect and filter such abusive content in order to effectively tackle this challenging issue. Detecting abusive language in text messages is challenging due to intentional word concealment and contextual complexity. To counter abusive speech on social media, we need to explore the potential of machine learning (ML) and deep learning (DL) models, particularly those equipped with attention mechanisms. In this study, we utilized popular ML and DL models integrated with attention mechanism to detect abusive language in Urdu text. Our methodology involved employing Count Vectorizer and Term Frequency-Inverse Document Frequency (TF/IDF) to extract n-grams at the word level: Unigrams (Uni), Bigrams (Bi), Trigrams (Tri), and their combination (Uni + Bi + Tri). Initially, we evaluated four traditional ML models—Logistic Regression (LR), Gaussian Naïve Bayes (NB), Support Vector Machine (SVM), and Random Forest (RF)—on both proposed and established datasets. The results highlighted that RF model outperformed other conventional models in terms of accuracy, precision, recall, and F1-measure on both datasets. In our implementation of deep learning models, we employed various models integrated with custom fastText and Word2Vec embeddings, each equipped with an attention layer, except for the Convolutional Neural Network (CNN). Our findings indicated that the Bidirectional Long Short-Term Memory (Bi-LSTM) + attention model, utilizing custom Word2Vec embeddings, exhibited improved performance in detecting abusive language on both datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords