Data in Brief (Aug 2024)
Bengali & Banglish: A monolingual dataset for emotion detection in linguistically diverse contexts
Abstract
The ever-evolving global landscape of communication, driven by Information Technology advancements, underscores the importance of emotion detection in natural language processing. However, challenges persist in interpreting emotions within linguistically diverse contexts, notably in low-resource languages like Bengali, compounded by the emergence of Banglish. To address this gap, we present “Bengali & Banglish,” an extensive dataset comprising 80,098 labelled samples across six emotion classes. Our dataset fills a void in fine-grained emotion classification for Bengali and pioneers in emotion detection in Banglish. We achieve significant performance metrics through meticulous annotation and rigorous evaluation, including a weighted F1 score of 71.30% for Bengali and 64.59% for Banglish using BanglaBERT. Also, our dataset facilitates Bengali-to-Banglish Machine Translation, contributing to the advancement of language processing models. Furthermore, our dataset demonstrates a high Cohen's Kappa score of 93.5%, affirming the reliability and consistency of our annotations. This research underscores the importance of linguistic diversity in NLP and provides a valuable resource for enhancing Emotion Detection capabilities in Bengali and Banglish across digital platforms.