Exploratory Data Analysis and Classification of a New Arabic Online Extremism Dataset

Saja Aldera; Ahmed Emam; Muhammad Al-Qurishi; Majed Alrubaian; Abdulrahman Alothaim

doi:10.1109/ACCESS.2021.3132651

IEEE Access (Jan 2021)

Exploratory Data Analysis and Classification of a New Arabic Online Extremism Dataset

Saja Aldera,
Ahmed Emam,
Muhammad Al-Qurishi,
Majed Alrubaian,
Abdulrahman Alothaim

Affiliations

Saja Aldera: ORCiD; Management Information Systems Department, College of Business Administration, King Saud University, Riyadh, Saudi Arabia
Ahmed Emam: ORCiD; Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Muhammad Al-Qurishi: ORCiD; Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Majed Alrubaian: ORCiD; Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Abdulrahman Alothaim: ORCiD; Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2021.3132651
Journal volume & issue: Vol. 9
pp. 161613 – 161626

Abstract

Read online

The dissemination of extremist ideas and causes online has intensified over the last decade. Extremist organizations use social media to gain publicity and new recruits, often with little interference from network providers. New techniques are being developed to identify extremist content, ensuring it can be promptly removed and its authors blocked from network access. However, most techniques are only compatible with the English language, despite the fact that extremist propaganda is frequently shared in other languages, including Arabic. Since the most effective methods for automated linguistic analysis use deep learning and require large, high-quality datasets, creating specialised data samples containing examples of extremist communication is an essential step toward a practical solution. In this paper, we present a dataset compiled for this purpose and discuss the classification methods that can be used for extremism detection. The manually annotated Arabic Twitter dataset consists of 89,816 tweets published between 2011 and 2021. Using guidelines, three expert annotators labelled the tweets as extremist or non-extremist. Exploratory data analysis was performed to understand the dataset’s features. Classification algorithms were used with the dataset, including logistic regression, support vector machine, multinominal naïve Bayes, random forest, and BERT. Among the traditional machine learning models, support vector machine with term frequency-inverse document frequency features achieved the highest accuracy (0.9729). However, BERT outperformed the traditional models with an accuracy of 0.9749. This dataset is expected to enhance the accuracy of Arabic online extremism classification in future research, and so we have made it publicly available.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords