Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings

Fatima Shannaq; Bassam Hammo; Hossam Faris; Pedro A. Castillo-Valdivieso

doi:10.1109/ACCESS.2022.3190960

IEEE Access (Jan 2022)

Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings

Fatima Shannaq,
Bassam Hammo,
Hossam Faris,
Pedro A. Castillo-Valdivieso

Affiliations

Fatima Shannaq: ORCiD; King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan
Bassam Hammo: ORCiD; King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan
Hossam Faris: ORCiD; King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan
Pedro A. Castillo-Valdivieso: ORCiD; Department of Computer Architecture and Technology, ETSIIT-CITIC, University of Granada, Granada, Spain

DOI: https://doi.org/10.1109/ACCESS.2022.3190960
Journal volume & issue: Vol. 10
pp. 75018 – 75039

Abstract

Read online

Social networks facilitate communication between people from all over the world. Unfortunately, the excessive use of social networks leads to the rise of antisocial behaviors such as the spread of online offensive language, cyberbullying (CB), and hate speech (HS). Therefore, abusive\offensive and hate detection become a crucial part of cyberharassment. Manual detection of cyberharassment is cumbersome, slow, and not even feasible in rapidly growing data. In this study, we addressed the challenges of automatic detection of the offensive tweets in the Arabic language. The main contribution of this study is to design and implement an intelligent prediction system encompassing a two-stage optimization approach to identify and classify the offensive from the non-offensive text. In the first stage, the proposed approach fine-tuned the pre-trained word embedding models by training them for several epochs on the training dataset. The embeddings of the vocabularies in the new dataset are trained and added to the old embeddings. While in the second stage, it employed a hybrid approach of two classifiers, namely XGBoost and SVM, and a genetic algorithm (GA) to mitigate the drawback of the classifiers in finding the optimal hyperparameter values to run the proposed approach. We tested the proposed approach on Arabic Cyberbullying Corpus (ArCybC), which contains tweets collected from four Twitter domains: gaming, sports, news, and celebrities. The ArCybC dataset has four categories: sexual, racial, intelligence, and appearance. The proposed approach produced superior results, in which the SVM algorithm with the Aravec SkipGram word embedding model achieved an accuracy rate of 88.2% and an F1-score rate of 87.8%.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords