IEEE Access (Jan 2024)
Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English Social Media Conversations
Abstract
Automatic cyberbullying detection in social media is increasingly vital due to the integral role of social networks in people’s lives and the severe impact of cyberbullying. Cyberbullying involves intentional, repetitive, aggressive behaviour to harm others online. Among Urdu-speaking communities worldwide, it is common to use Urdu, Roman Urdu, and English in social media conversations. Existing research and detection methods overlook these linguistic dynamics and fail to address cyberbullying across these languages comprehensively. Additionally, there is no dataset in Urdu and Roman Urdu covering the repetition and intent to harm components of cyberbullying. This research addresses this gap by developing and annotating a comprehensive dataset capturing linguistic variations in cyberbullying instances across Urdu, Roman Urdu, and English, incorporating all aspects of cyberbullying. Besides proposing a dataset, a framework for detecting cyberbullying has been proposed. The framework classifies text messages as aggressive or non-aggressive and introduces novel quantitative measures for repetition and the level of intent to cause harm. The proposed framework classifies cyberbullying by applying thresholds to measures of aggression, repetition, and intent to harm, integrating all three aspects. Results show aggression detection using fine-tuned m-BERT and MuRIL, incorporating measures of repetition and intent to harm on the proposed dataset. Additionally, experiments are conducted to demonstrate the impact of repetition and intent to harm on cyberbullying classification. The best results on the dataset are achieved using fine-tuned MuRIL with a precision of 0.93, recall of 0.92, and an F-measure of 0.92 by incorporating quantitative measures of repetition and intent to harm.
Keywords