IEEE Access (Jan 2020)

A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets

  • Intouch Kunakorntum,
  • Woranich Hinthong,
  • Phond Phunchongharn

DOI
https://doi.org/10.1109/ACCESS.2020.3003346
Journal volume & issue
Vol. 8
pp. 114692 – 114704

Abstract

Read online

Handling an imbalanced class problem is a challenging task in real-world applications. This problem affects various prediction models that predict only the majority class and fail to identify the minority class because of the skewed data. The oversampling technique is one of the exciting solutions that handles the imbalanced class problem. However, several existing oversampling methods do not consider the distribution of the target variable and cause an overlapping class problem. Therefore, this study introduces a new oversampling technique, namely Synthetic Minority based on Probabilistic Distribution (SyMProD), to handle skewed datasets. Our technique normalizes data using a Z-score and removes noisy data. Then, the proposed method selects minority samples based on the probability distribution of both classes. The synthetic instances are generated from selected points and several minority nearest neighbors. Our technique aims to create synthetic instances that cover the minority class distribution, avoid the noise generation, and reduce the possibilities of overlapping classes and overgeneralization problems. Our proposed technique is validated using 14 benchmark datasets and three classifiers. Moreover, we compare the performance with seven other conventional oversampling algorithms. The empirical results show that our method achieves better performance compared with other oversampling techniques.

Keywords