A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Matloob Khushi; Kamran Shaukat; Talha Mahboob Alam; Ibrahim A. Hameed; Shahadat Uddin; Suhuai Luo; Xiaoyan Yang; Maranatha Consuelo Reyes

doi:10.1109/access.2021.3102399

IEEE Access (Jan 2021)

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Matloob Khushi,
Kamran Shaukat,
Talha Mahboob Alam,
Ibrahim A. Hameed,
Shahadat Uddin,
Suhuai Luo,
Xiaoyan Yang,
Maranatha Consuelo Reyes

Affiliations

Matloob Khushi: ORCiD; School of Computer Science, The University of Sydney, Sydney, NSW, Australia
Kamran Shaukat: ORCiD; School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
Talha Mahboob Alam: ORCiD; Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Pakistan
Ibrahim A. Hameed: Department of ICT and Natural Sciences, Norwegian University of Science and Technology, Trondheim, Norway
Shahadat Uddin: ORCiD; School of Project Management, The University of Sydney, Sydney, NSW, Australia
Suhuai Luo: School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
Xiaoyan Yang: School of Computer Science, The University of Sydney, Sydney, NSW, Australia
Maranatha Consuelo Reyes: ORCiD; School of Computer Science, The University of Sydney, Sydney, NSW, Australia

DOI: https://doi.org/10.1109/access.2021.3102399
Journal volume & issue: Vol. 9
pp. 109960 – 109975

Abstract

Read online

Medical datasets are usually imbalanced, where negative cases severely outnumber positive cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO and NLST, with imbalance ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, etc.), seven over-sampling methods (SMOTE, etc.), and two integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, etc.). The results show that class imbalance learning can improve the classification ability of the model. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable method, and the AUC in the model is generally higher than in other ways. Using ROS, the random forest performs the best predictive ability and is more suitable for the lung cancer datasets used in this study. The code is available at https://mkhushi.github.io/

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords