SOD: A Corpus for Saudi Offensive Language Detection Classification

Afefa Asiri; Mostafa Saleh

doi:10.3390/computers13080211

Computers (Aug 2024)

SOD: A Corpus for Saudi Offensive Language Detection Classification

Afefa Asiri,
Mostafa Saleh

Affiliations

Afefa Asiri: Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
Mostafa Saleh: Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

DOI: https://doi.org/10.3390/computers13080211
Journal volume & issue: Vol. 13, no. 8
p. 211

Abstract

Read online

Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly developed for English, are insufficient for addressing online offensive language in Arabic, a language rich in dialects and informally used on social media. This gap underscores the need for dedicated, dialect-specific resources. This study introduces the Saudi Offensive Dialectal dataset (SOD), consisting of over 24,000 tweets annotated across three levels: offensive or non-offensive, with offensive tweets further categorized as general insults, hate speech, or sarcasm. A deeper analysis of hate speech identifies subtypes related to sports, religion, politics, race, and violence. A comprehensive descriptive analysis of the SOD is also provided to offer deeper insights into its composition. Using machine learning, traditional deep learning, and transformer-based deep learning models, particularly AraBERT, our research achieves a significant F1-Score of 87% in identifying offensive language. This score improves to 91% with data augmentation techniques addressing dataset imbalances. These results, which surpass many existing studies, demonstrate that a specialized dialectal dataset enhances detection efficacy compared to mixed-language datasets.

Published in Computers

ISSN: 2073-431X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/computers

About the journal

Abstract

Keywords