An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task

Anil Jadhav; Samih M. Mostafa; Hela Elmannai; Faten Khalid Karim

doi:10.3390/app12083928

Applied Sciences (Apr 2022)

An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task

Anil Jadhav,
Samih M. Mostafa,
Hela Elmannai,
Faten Khalid Karim

Affiliations

Anil Jadhav: Symbiosis Centre for Information Technology, Symbiosis International (Deemed University), Pune 411057, India
Samih M. Mostafa: Faculty of Computers and Information, South Valley University, Qena 83523, Egypt
Hela Elmannai: Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
Faten Khalid Karim: Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

DOI: https://doi.org/10.3390/app12083928
Journal volume & issue: Vol. 12, no. 8
p. 3928

Abstract

Read online

Many real-world classification problems such as fraud detection, intrusion detection, churn prediction, and anomaly detection suffer from the problem of imbalanced datasets. Therefore, in all such classification tasks, we need to balance the imbalanced datasets before building classifiers for prediction purposes. Several data-balancing techniques (DBT) have been discussed in the literature to address this issue. However, not much work is conducted to assess the performance of DBT. Therefore, in this research paper we empirically assess the performance of the data-preprocessing-level data-balancing techniques, namely: Under Sampling (OS), Over Sampling (OS), Hybrid Sampling (HS), Random Over Sampling Examples (ROSE), Synthetic Minority Over Sampling (SMOTE), and Clustering-Based Under Sampling (CBUS) techniques. We have used six different classifiers and twenty-five different datasets, that have varying levels of imbalance ratio (IR), to assess the performance of DBT. The experimental results indicate that DBT helps to improve the performance of the classifiers. However, no significant difference was observed in the performance of the US, OS, HS, SMOTE, and CBUS. It was also observed that performance of DBT was not consistent across varying levels of IR in the dataset and different classifiers.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords