IEEE Access (Jan 2024)
Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism
Abstract
The widespread use of the Internet and the tremendous growth of social media have enabled people to connect with each other worldwide. Individuals are free to express themselves online, sharing their photos, videos, and text messages globally. However, such freedom sometimes leads to misuse, as some individuals exploit this platform by posting hateful and abusive comments on forums. The proliferation of abusive language on social media negatively impacts individuals and groups, leading to emotional distress and affecting mental health. It is crucial to automatically detect and filter such abusive content in order to effectively tackle this challenging issue. Detecting abusive language in text messages is challenging due to intentional word concealment and contextual complexity. To counter abusive speech on social media, we need to explore the potential of machine learning (ML) and deep learning (DL) models, particularly those equipped with attention mechanisms. In this study, we utilized popular ML and DL models integrated with attention mechanism to detect abusive language in Urdu text. Our methodology involved employing Count Vectorizer and Term Frequency-Inverse Document Frequency (TF/IDF) to extract n-grams at the word level: Unigrams (Uni), Bigrams (Bi), Trigrams (Tri), and their combination (Uni + Bi + Tri). Initially, we evaluated four traditional ML models—Logistic Regression (LR), Gaussian Naïve Bayes (NB), Support Vector Machine (SVM), and Random Forest (RF)—on both proposed and established datasets. The results highlighted that RF model outperformed other conventional models in terms of accuracy, precision, recall, and F1-measure on both datasets. In our implementation of deep learning models, we employed various models integrated with custom fastText and Word2Vec embeddings, each equipped with an attention layer, except for the Convolutional Neural Network (CNN). Our findings indicated that the Bidirectional Long Short-Term Memory (Bi-LSTM) + attention model, utilizing custom Word2Vec embeddings, exhibited improved performance in detecting abusive language on both datasets.
Keywords