Cross-domain sentiment analysis model on Indonesian YouTube comment

Agus Sasmito Aribowo; Halizah Basiron; Noor Fazilla Abd Yusof; Siti Khomsah

doi:10.26555/ijain.v7i1.554

IJAIN (International Journal of Advances in Intelligent Informatics) (Mar 2021)

Cross-domain sentiment analysis model on Indonesian YouTube comment

Agus Sasmito Aribowo,
Halizah Basiron,
Noor Fazilla Abd Yusof,
Siti Khomsah

Affiliations

Agus Sasmito Aribowo: Universitas Pembangunan Nasional "Veteran" Yogyakarta Indonesia
Halizah Basiron: Universiti Teknikal Malaysia Melaka
Noor Fazilla Abd Yusof: Universiti Teknikal Malaysia Melaka
Siti Khomsah: Insitut Teknologi Telkom Purwokerto

DOI: https://doi.org/10.26555/ijain.v7i1.554
Journal volume & issue: Vol. 7, no. 1
pp. 12 – 25

Abstract

Read online

A cross-domain sentiment analysis (CDSA) study in the Indonesian language and tree-based ensemble machine learning is quite interesting. CDSA is useful to support the labeling process of cross-domain sentiment and reduce any dependence on the experts; however, the mechanism in the opinion unstructured by stop word, language expressions, and Indonesian slang words is unidentified yet. This study aimed to obtain the best model of CDSA for the opinion in Indonesia language that commonly is full of stop words and slang words in the Indonesian dialect. This study was purposely to observe the benefits of the stop words cleaning and slang words conversion in CDSA in the Indonesian language form. It was also to find out which machine learning method is suitable for this model. This study started by crawling five datasets of the comments on YouTube from 5 different domains. The dataset was copied into two groups: the dataset group without any process of stop word cleaning and slang word conversion and the dataset group to stop word cleaning and slang word conversion. CDSA model was built for each dataset group and then tested using two types of tree-based ensemble machine learning, i.e., Random Forest (RF) and Extra Tree (ET) classifier, and tested using three types of non-ensemble machine learning, including Naïve Bayes (NB), SVM, and Decision Tree (DT) as the comparison. Then, It can be suggested that the accuracy of CDSA in Indonesia Language increased if it still removed the stop words and converted the slang words. The best classifier model was built using tree-based ensemble machine learning, particularly ET, as in this study, the ET model could achieve the highest accuracy by 91.19%. This model is expected to be the CDSA technique alternative in the Indonesian language.

Published in IJAIN (International Journal of Advances in Intelligent Informatics)

ISSN: 2442-6571 (Print); 2548-3161 (Online)
Publisher: Universitas Ahmad Dahlan
Country of publisher: Indonesia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://ijain.org/index.php/IJAIN/index

About the journal

Abstract

Keywords