Applied Sciences (Dec 2024)

A Multi-Architecture Approach for Offensive Language Identification Combining Classical Natural Language Processing and BERT-Variant Models

  • Ashok Yadav,
  • Farrukh Aslam Khan,
  • Vrijendra Singh

DOI
https://doi.org/10.3390/app142311206
Journal volume & issue
Vol. 14, no. 23
p. 11206

Abstract

Read online

Offensive content is a complex and multifaceted form of harmful material that targets individuals or groups. In recent years, offensive language (OL) has become increasingly harmful, as it incites violence and intolerance. The automatic identification of OL on social networks is essential to curtail the spread of harmful content. We address this problem by developing an architecture to effectively respond to and mitigate the impact of offensive content on society. In this paper, we use the Davidson dataset containing 24,783 samples of tweets and proposed three different architectures for detecting OL on social media platforms. Our proposed approach involves concatenation of features (TF-IDF, Word2Vec, sentiments, and FKRA/FRE) and a baseline machine learning model for the classification. We explore the effectiveness of different dimensions of GloVe embeddings in conjunction with deep learning models for classifying OL. We also propose an architecture that utilizes advanced transformer models such as BERT, ALBERT, and ELECTRA for pre-processing and encoding, with 1D CNN and neural network layers serving as the classification components. We achieve the highest precision, recall, and F1 score, i.e., 0.89, 0.90, and 0.90, respectively, for both the “bert encased preprocess/1 + small bert/L4H512A8/1 + neural network layers” model and the “bert encased preprocess/1 + electra small/2 + cnn” architecture.

Keywords