A Multi-Architecture Approach for Offensive Language Identification Combining Classical Natural Language Processing and BERT-Variant Models

Ashok Yadav; Farrukh Aslam Khan; Vrijendra Singh

doi:10.3390/app142311206

Applied Sciences (Dec 2024)

A Multi-Architecture Approach for Offensive Language Identification Combining Classical Natural Language Processing and BERT-Variant Models

Ashok Yadav,
Farrukh Aslam Khan,
Vrijendra Singh

Affiliations

Ashok Yadav: Department of Information Technology, Indian Institute of Information Technology, Allahabad, Prayagraj 211015, India
Farrukh Aslam Khan: Center of Excellence in Information Assurance (CoEIA), King Saud University, Riyadh 11653, Saudi Arabia
Vrijendra Singh: Department of Information Technology, Indian Institute of Information Technology, Allahabad, Prayagraj 211015, India

DOI: https://doi.org/10.3390/app142311206
Journal volume & issue: Vol. 14, no. 23
p. 11206

Abstract

Read online

Offensive content is a complex and multifaceted form of harmful material that targets individuals or groups. In recent years, offensive language (OL) has become increasingly harmful, as it incites violence and intolerance. The automatic identification of OL on social networks is essential to curtail the spread of harmful content. We address this problem by developing an architecture to effectively respond to and mitigate the impact of offensive content on society. In this paper, we use the Davidson dataset containing 24,783 samples of tweets and proposed three different architectures for detecting OL on social media platforms. Our proposed approach involves concatenation of features (TF-IDF, Word2Vec, sentiments, and FKRA/FRE) and a baseline machine learning model for the classification. We explore the effectiveness of different dimensions of GloVe embeddings in conjunction with deep learning models for classifying OL. We also propose an architecture that utilizes advanced transformer models such as BERT, ALBERT, and ELECTRA for pre-processing and encoding, with 1D CNN and neural network layers serving as the classification components. We achieve the highest precision, recall, and F1 score, i.e., 0.89, 0.90, and 0.90, respectively, for both the “bert encased preprocess/1 + small bert/L4H512A8/1 + neural network layers” model and the “bert encased preprocess/1 + electra small/2 + cnn” architecture.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords