Toward Machine Learning Based Binary Sentiment Classification of Movie Reviews for Resource Restraint Language (RRL)&#x2014;Hindi

Ankita Sharma; Udayan Ghose

doi:10.1109/ACCESS.2023.3283461

IEEE Access (Jan 2023)

Toward Machine Learning Based Binary Sentiment Classification of Movie Reviews for Resource Restraint Language (RRL)—Hindi

Ankita Sharma,
Udayan Ghose

Affiliations

Ankita Sharma: ORCiD; University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, New Delhi, India
Udayan Ghose: ORCiD; University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, New Delhi, India

DOI: https://doi.org/10.1109/ACCESS.2023.3283461
Journal volume & issue: Vol. 11
pp. 58546 – 58564

Abstract

Read online

Sentiment analysis has significantly progressed in English, whereas Hindi research is still nascent. Despite being the third most spoken language worldwide, Hindi remains an RRL. Movie reviews are a treasure trove of opinionated content fueled by people’s passionate engagement with film industry. The proliferation of great use of Hindi in writing reviews has catalyzed our endeavor to devise an approach for bipolar sentiment classification of movie reviews. We compiled and manually annotated a Hindi Language Movie Review (HLMR) dataset comprising 10K reviews for experiments, and challenges associated with Hindi have also been identified. In addition to HLMR, two publicly available IIT-P movie and product review datasets are used. Following dataset preprocessing, we explored TF-ISF with word-level N-gram features for text representation. Studies suggest that performance of machine learning approaches can be enhanced by hyperparameter tuning and ensemble learning. Several baseline classifiers were initially applied, and their parameters were hyper-tuned using Grid search. Subsequently, ensemble-based classifiers were applied individually. Lastly, we propose a simplistic yet powerful stacked ensemble-based architecture (SEBA), which effectively classifies Hindi reviews by leveraging the strengths of both approaches. Comprehensive experiments were conducted on all deployed datasets. Empirical results demonstrate that SEBA outperformed individual baselines and exhibited superior performance with unigrams and TF-ISF as features across deployed datasets. SEBA achieved an accuracy, precision, and recall of 0.808% and an F1-score of 0.807% on the HLMR dataset. These findings strongly advocate for the effectiveness of proposed solution and indicate its suitability for online deployment in binary review classification tasks.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords