PeerJ Computer Science (Dec 2024)

Leveraging deep learning for toxic comment detection in cursive languages

  • Muhammad Shahid,
  • Muhammad Umair,
  • Muhammad Amjad Iqbal,
  • Muhammad Rashid,
  • Sheeraz Akram,
  • Muhammad Zubair

DOI
https://doi.org/10.7717/peerj-cs.2486
Journal volume & issue
Vol. 10
p. e2486

Abstract

Read online Read online

Social media platforms enable individuals to publicly express opinions, support, and criticism. Influencers can launch campaigns to promote ideas. Most people can now share their views and feelings through visual or textual comments, which can range from appreciation to hate speech, potentially inciting societal violence and hatred. Detecting these noxious comments and thoughts is critical to protecting our communities from their negative social, psychological, and political impact. Although Urdu (a low-resource language) is one of the most popular Asian languages around the globe, a standard tool does not exist to detect toxic comments posted in this language. Tokenization and then categorizing cursive text is challenging due to its complex nature, especially when dealing with toxic comments, which are often ungrammatical and very brief. This study proposes a novel model to identify salient features in Urdu sentences. It uses transformers to identify and flag toxic comments using deep learning binary classification of the text. Statistically, the proposed fine-tuned model outperforms the existing ones by achieving a precision of 88.38%. Among the models evaluated, bidirectional encoder representations from transformers (BERT) demonstrated superior performance with an accuracy 85.45%, precision 85.71%, recall 85.45%, F1 score 85.41%, and a Cohen Kappa 70.83% on the full feature set. Conversely, GPT-2 was identified as the lowest-performing model. The outcomes of this research represent a noteworthy advancement in the broader endeavor to improve and optimize content moderation mechanisms across diverse languages and platforms.

Keywords