Journal of King Saud University: Computer and Information Sciences (Jul 2023)
Contextual Embeddings based on Fine-tuned Urdu-BERT for Urdu threatening content and target identification
Abstract
Identification of threatening text on social media platforms is a challenging task. Contrary to the high-resource languages, the Urdu language has very limited such approaches and the benchmark approach has an issue of inappropriate data annotation. Therefore, we present robust threatening content and target identification as a hierarchical classification model for Urdu tweets. This study investigates the potential of the Urdu-BERT (Bidirectional Encoder Representations from Transformer) language model to learn universal contextualized representations aiming to showcase its usefulness for binary classification tasks of threatening content and target identification. We propose to exploit a pre-trained Urdu-BERT as a transfer learning model after fine-tuning its parameters on a newly designed Urdu corpus from the Twitter platform. The proposed dataset contains 2,400 tweets manually annotated as threatening or non-threatening at the first level and threatening tweets are further categorized into individual or group at the second level. The performance of fine-tuned Urdu-BERT is compared with the benchmark study and other feature models. Experimental results show that the fine-tuned Urdu-BERT model achieves state-of-the-art performance by obtaining 87.5% accuracy and 87.8% F1-score for threatening content identification and 82.5% accuracy and 83.2% F1-score for target identification task. Furthermore, the proposed model outperforms the benchmark study.