Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation

Namit Khanduja; Nishant Kumar; Arun Chauhan

Systems and Soft Computing (Dec 2024)

Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation

Namit Khanduja,
Nishant Kumar,
Arun Chauhan

Affiliations

Namit Khanduja: Department of Computer Science & Engineering, Faculty of Engineering & Technology, Gurukula Kangri Deemed to be University, Haridwar, Uttarakhand, India; Corresponding author.
Nishant Kumar: Department of Computer Science & Engineering, Faculty of Engineering & Technology, Gurukula Kangri Deemed to be University, Haridwar, Uttarakhand, India
Arun Chauhan: Department of Computer Science & Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India

Journal volume & issue: Vol. 6
p. 200112

Abstract

Read online

In today's digital era, social media has become a new tool for communication and sharing information, with the availability of high-speed internet it tends to reach the masses much faster. Lack of regulations and ethics have made advancement in the proliferation of abusive language and hate speech has become a growing concern on social media platforms in the form of posts, replies, and comments towards individuals, groups, religions, and communities. However, the process of classification of hate speech manually on online platforms is cumbersome and impractical due to the excessive amount of data being generated. Therefore, it is crucial to automatically filter online content to identify and eliminate hate speech from social media. Widely spoken resource-rich languages like English have driven the research and achieved the desired result due to the accessibility of large corpora, annotated datasets, and tools. Resource-constrained languages are not able to achieve the benefits of advancement due to a lack of data corpus and annotated datasets. India has diverse languages that change with demographics and languages that have limited data availability and semantic differences. Telugu is one of the low-resource Dravidian languages spoken in the southern part of India.In this paper, we present a monolingual Telugu corpus consisting of tweets posted on Twitter annotated with hate and non-hate labels and experiments to provide a comparison of state-of-the-art fine-tuned deep learning models (mBERT, DistilBERT, IndicBERT, NLLB, Muril, RNN+LSTM, XLM-RoBERTa, and Indic-Bart). Through transfer learning and hyperparameter tuning, the models are compared for their effectiveness in classifying hate speech in Telugu text. The fine-tuned mBERT model outperformed all other fine-tuned models achieving an accuracy of 98.2. The authors also propose a deployment model for social media accounts.

Published in Systems and Soft Computing

ISSN: 2772-9419 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.sciencedirect.com/journal/systems-and-soft-computing

About the journal

Abstract

Keywords