Data Balancing Approach Using Combine Sampling on Sentiment Analysis With K-Nearest Neighbor

Evlyn Pricilia Kondy; Siswanto Siswanto; Nirwan Ilyas

doi:10.32520/stmsi.v13i5.4013

Sistemasi: Jurnal Sistem Informasi (Sep 2024)

Data Balancing Approach Using Combine Sampling on Sentiment Analysis With K-Nearest Neighbor

Evlyn Pricilia Kondy,
Siswanto Siswanto,
Nirwan Ilyas

Affiliations

Evlyn Pricilia Kondy: Hasanuddin University
Siswanto Siswanto: Hasanuddin University
Nirwan Ilyas: Hasanuddin University

DOI: https://doi.org/10.32520/stmsi.v13i5.4013
Journal volume & issue: Vol. 13, no. 5
pp. 1836 – 1851

Abstract

Read online

One of the topics that has been discussed on twitter is the rules regarding the removal of masks. However, there's a chance that the data from Twitter contains unequal data classes. An unequal amount of data can cause the classification process to malfunction. Combining under- and oversampling techniques is known as combine sampling, and it is a data-balancing strategy. The research's data consists of Indonesian tweets using the hashtag "The Policy of Removing Masks." In this study, the classification approach was K-Nearest Neighbor, while the oversampling and undersampling techniques were SMOTE and Tomek Links. The purpose of this research is to classify sentiment using the K-Nearest Neighbor algorithm and to use combine sampling to balance the amount of training data in the two classes that are not yet balanced. 234 training data with a positive sentiment and 652 training data with a negative sentiment were obtained after the data was divided. Due to an imbalance in the quantity of training data between the two classes, the positive class's data is minor and the negative class's data is major. The quantity of training data are 613 in the positive class and 613 in the negative class obtained following the combine sampling. Following the balancing of data between the two classes, sentiment classification was performed, yielding accuracy of 60.4%, precision of 78.5%, and recall of 65%. The reason for the accuracy number of 60.4% is because machine learning misinterpreted a tweet regarding Indonesia's mask removal policy, leading to incorrect classification.

Published in Sistemasi: Jurnal Sistem Informasi

ISSN: 2302-8149 (Print); 2540-9719 (Online)
Publisher: Islamic University of Indragiri
Country of publisher: Indonesia
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://sistemasi.ftik.unisi.ac.id/index.php/stmsi

About the journal