A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models

Ming Zheng; Fei Wang; Xiaowen Hu; Yuhao Miao; Huo Cao; Mingjing Tang

doi:10.3390/axioms11110607

Axioms (Nov 2022)

A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models

Ming Zheng,
Fei Wang,
Xiaowen Hu,
Yuhao Miao,
Huo Cao,
Mingjing Tang

Affiliations

Ming Zheng: School of Computer and Information, Anhui Normal University, Wuhu 241002, China
Fei Wang: School of Computer and Information, Anhui Normal University, Wuhu 241002, China
Xiaowen Hu: School of Computer and Information, Anhui Normal University, Wuhu 241002, China
Yuhao Miao: Affiliated Institution of Anhui Normal University, Wuhu 241002, China
Huo Cao: School of Computer and Information, Anhui Normal University, Wuhu 241002, China
Mingjing Tang: School of Life Science, Yunnan Normal University, Kunming 650500, China

DOI: https://doi.org/10.3390/axioms11110607
Journal volume & issue: Vol. 11, no. 11
p. 607

Abstract

Read online

Machine learning models may not be able to effectively learn and predict from imbalanced data in the fields of machine learning and data mining. This study proposed a method for analyzing the performance impact of imbalanced binary data on machine learning models. It systematically analyzes 1. the relationship between varying performance in machine learning models and imbalance rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability evaluation method of machine learning models is proposed. Experiments of eight widely used machine learning models on 48 different imbalanced datasets demonstrate that the classification performance of machine learning models decreases with the increase of IR on the same imbalanced data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB, KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced data augmentation, and determining whether other oversampling methods can obtain consistent results needs further research. In the future, an imbalanced data augmentation algorithm based on undersampling and hybrid sampling should be used to analyze the performance impact of imbalanced binary data on machine learning models.

Published in Axioms

ISSN: 2075-1680 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/axioms

About the journal

Abstract

Keywords