Jurnal Informatika (May 2023)

Resampling Technique for Imbalanced Class Handling on Educational Dataset

  • Anief Fauzan Rozi,
  • Adi Wibowo,
  • Budi Warsito

DOI
https://doi.org/10.30595/juita.v11i1.15498
Journal volume & issue
Vol. 11, no. 1
pp. 77 – 85

Abstract

Read online

Educational data mining is an emerging field in data mining. The need for accurate in identifying student accomplishment on a course or maybe an upcoming course can help the institution to build technology-aided education better. Educational data mining becoming a more important field to be studied because of its potential to produce a knowledge base model to help even the teacher or lecturer. Like another classification task, educational data mining has a common and frequently discovered problem. The problem that occurred in educational data mining specifically and classification tasks generally is an imbalanced class problem. An imbalanced class is a condition where the distribution of each class is not in the same proportion. In this research, it is found that the class distribution is severely imbalanced and it is a multiclass dataset that consists of more than two class labels. According to the problem stated beforehand, this paper will focus on the imbalanced class handling and classification with several methods on both of it such as Linear Regression, Random Forest and Stacking for classification and SMOTE, ADASYN, and SMOTE-ENN for the resampling algorithm. The methods are being evaluated using a 10-fold cross-validation and an 80-20 splitting ratio. The result shows that the best performance coming from the Stacking classification on ADASYN resampled dataset evaluated using an 80-20 splitting ratio with a 0.97 F1 score. The result of this study also shows that the resampling technique improves classification performance. Even though the no-resampling classification result produced a decent result too, it can be caused by several things such as the general pattern of the data for each class is already been good from the start. Thus, there is no real drawbacks if the original data is processed.

Keywords