SETAR: Stacking Ensemble Learning for Thai Sentiment Analysis Using RoBERTa and Hybrid Feature Representation

Pree Thiengburanathum; Phasit Charoenkwan

doi:10.1109/ACCESS.2023.3308951

IEEE Access (Jan 2023)

SETAR: Stacking Ensemble Learning for Thai Sentiment Analysis Using RoBERTa and Hybrid Feature Representation

Pree Thiengburanathum,
Phasit Charoenkwan

Affiliations

Pree Thiengburanathum: ORCiD; Department of Software Engineering, College of Arts Media and Technology, Chiang Mai University, Chiang Mai, Thailand
Phasit Charoenkwan: Department of Modern Management Information and Technology, College of Arts Media and Technology, Chiang Mai University, Chiang Mai, Thailand

DOI: https://doi.org/10.1109/ACCESS.2023.3308951
Journal volume & issue: Vol. 11
pp. 92822 – 92837

Abstract

Read online

Sentiment classification of social media posts is among the most challenging and time-consuming tasks for analysts. This is particularly true when applied to languages that employ scriptio continua, such as the Thai language, in which there are no spaces between written words and where there is no end of sentence punctuation. Thai is considered a scarce-resource language as few datasets are available to researchers. Although machine-learning (ML) and deep-learning (DL) algorithms can identify sentiment classification polarity, the performance of the existing classification models are still inadequate. This study proposes a novel stacking ensemble learning technique for identifying sentiment classification polarity in the Thai language, SETAR. Our stacking ensemble strategy utilized the pre-trained Thai language model (WangChanBERTa), based on a Robustly Optimized BERT Pretraining Approach (RoBERTa) architecture to form a feature vector. This feature was combined with three distinct feature vectors obtained from three well-known categories, namely Word2Vec, TF-IDF, and bag-of-words, as a new hybrid sentence representation. The base learners were trained using seven chosen complex heterogeneous ML algorithms, including support vector machine (SVM), random forest (RF), extremely randomized trees (ET), light gradient boosting machine (LGBM), multi-layer perceptron (MLP), partial least squares (PLS), and logistic regression (LR) to enable the development of the final meta-learners. The results revealed that our proposed stacking ensemble model outperformed the baseline models of all classification metrics among the training and test sets, as was determined by extensive benchmarking, carried out on the four datasets, which included our developed sentiment corpus that domain experts annotated.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords