IEEE Access (Jan 2024)

Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach

  • K. Sreelakshmi,
  • B. Premjith,
  • Bharathi Raja Chakravarthi,
  • K. P. Soman

DOI
https://doi.org/10.1109/ACCESS.2024.3358811
Journal volume & issue
Vol. 12
pp. 20064 – 20090

Abstract

Read online

Recently, the emergence of social media has opened the way for online harassment in the form of hate speech and offensive language. An automated approach is needed to detect hate and offensive content from social media, which is indispensable. This task is challenging in the case of social media posts or comments in low-resourced CodeMix languages. This paper investigates the efficacy of various multilingual transformer-based embedding models with machine learning classifiers for detecting hate speech and offensive language (HOS) content in social media posts in CodeMix Dravidian languages that belong to the low-resource language group. Experiments were conducted on six sets of openly available datasets in Kannada-English, Malayalam-English and Tamil-English languages. The objective is to identify a single pre-trained embedding model that commonly works well for HOS tasks in the above mentioned languages. For this, a comprehensive study of various multilingual transformer embedding models, such as BERT, DistilBERT, LaBSE, MuRIL, XLM, IndicBERT, and FNET for HOS detection was conducted. Our experiments revealed that MuRIL pre-trained embedding performed consistently well for all six datasets using Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel. In a set of experiments conducted on six datasets, the highest accuracy results for each dataset are as follows: DravidianLangTech 2021 achieved 96% accuracy for Malayalam, 72% accuracy for Tamil, and 66% accuracy for Kannada. For HASOC 2021 Tamil, the accuracy reached 76%, and for HASOC 2021 Malayalam, it reached 68%. Additionally, HASOC 2020 demonstrated an accuracy of 92% for Malayalam. Moreover, we performed an in-depth error analysis and a comparative study, presenting a tabulated summary of our work compared to other top-performing studies. In addition, we employed a cost-sensitive learning approach to address the class imbalance problem in the dataset, in which minority classes get higher classification weights than the majority classes. The weights were initialized and fine-tuned to obtain the best balance between all the classes. The results showed that incorporating the cost-sensitive learning strategy avoided class bias in the trained model. In addition to the aforementioned points, a significant contribution of our research presented in this paper is introducing a novel annotated test set for Malayalam-English CodeMix. This new dataset serves as an extension to our existing data, known as the Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC) 2021 Malayalam-English dataset.

Keywords