Improving Multi-label Classification Performance on Imbalanced Datasets Through SMOTE Technique and Data Augmentation Using IndoBERT Model

Leno Dwi Cahya; Ardytha Luthfiarta; Julius Immanuel Theo Krisna; Sri Winarno; Adhitya Nugraha

doi:10.25077/TEKNOSI.v9i3.2023.290-298

Jurnal Teknologi dan Sistem Informasi (Jan 2024)

Improving Multi-label Classification Performance on Imbalanced Datasets Through SMOTE Technique and Data Augmentation Using IndoBERT Model

Leno Dwi Cahya,
Ardytha Luthfiarta,
Julius Immanuel Theo Krisna,
Sri Winarno,
Adhitya Nugraha

Affiliations

Leno Dwi Cahya: Universitas Dian Nuswantoro
Ardytha Luthfiarta: Universitas Dian Nuswantoro
Julius Immanuel Theo Krisna: Universitas Dian Nuswantoro
Sri Winarno: Universitas Dian Nuswantoro
Adhitya Nugraha: Universitas Dian Nuswantoro

DOI: https://doi.org/10.25077/TEKNOSI.v9i3.2023.290-298
Journal volume & issue: Vol. 9, no. 3
pp. 290 – 298

Abstract

Read online

Sentiment and emotion analysis is a common classification task aimed at enhancing the benefit and comfort of consumers of a product. However, the data obtained often lacks balance between each class or aspect to be analyzed, commonly known as an imbalanced dataset. Imbalanced datasets are frequently challenging in machine learning tasks, particularly text datasets. Our research tackles imbalanced datasets using two techniques, namely SMOTE and Augmentation. In the SMOTE technique, text datasets need to undergo numerical representation using TF-IDF. The classification model employed is the IndoBERT model. Both oversampling techniques can address data imbalance by generating synthetic and new data. The newly created dataset enhances the classification model's performance. With the Augmentation technique, the classification model's performance improves by up to 20%, with accuracy reaching 78%, precision at 85%, recall at 82%, and an F1-score of 83%. On the other hand, using the SMOTE technique, the evaluation results achieve the best values between the two techniques, enhancing the model's accuracy to a high 82% with precision at 87%, recall at 85%, and an F1-score of 86%.

Published in Jurnal Teknologi dan Sistem Informasi

ISSN: 2460-3465 (Print); 2476-8812 (Online)
Publisher: Universitas Andalas
Country of publisher: Indonesia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://teknosi.fti.unand.ac.id

About the journal

Abstract

Keywords