Data in Brief (Oct 2023)
MABSA: A curated Malayalam aspect based sentiment analysis dataset on movie reviews
Abstract
Regional languages are being used more frequently in online platforms as a result of the expanding use of digital technology. Understanding user opinions on social media platforms, forums, blogs, and other digital platforms that employ Indian regional languages has become significant due to their role in various applications. Research on sentiment analysis of Indian regional language texts suffers due to the unavailability of available regional language datasets. The curated Malayalam Aspect Based Sentiment Analysis (MABSA) dataset is a labeled dataset for Aspect Based Sentiment Analysis (ABSA) on the Indian regional language Malayalam over the movie review domain. Malayalam movie reviews, an excellent source of text data for ABSA, are collected from an online survey using Google form and manually collecting reviews from three social media platforms: IMDb, Facebook, and YouTube. Nine target aspects were identified, and three annotators annotated the dataset based on the sentiment polarity of each aspect. A total of 4000 reviews were collected, and a total of 7507 aspects are identified in the reviews. Spearman's correlation and Fleiss Kappa Index are used to analyze the annotated dataset's correlation. It has been found that the high correlation between the annotators implies that the MABSA dataset is of gold standard.