The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification

Mahesh T R; Vinoth Kumar V; Dhilip Kumar V; Oana Geman; Martin Margala; Manisha Guduri

Healthcare Analytics (Dec 2023)

The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification

Mahesh T R,
Vinoth Kumar V,
Dhilip Kumar V,
Oana Geman,
Martin Margala,
Manisha Guduri

Affiliations

Mahesh T R: Department of Computer Science and Engineering, Faculty of Engineering and Technology, JAIN (Deemed-to-be University), Bangalore, 562112, India; Corresponding author.
Vinoth Kumar V: School of Computer Science Engineering & Information Systems (SCORE), Vellore Institute of Technology (VIT),632014, India
Dhilip Kumar V: School of Computing, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, 600062, India
Oana Geman: Department of Computers, Electronics and Automation, Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, 720229, Suceava, Romania
Martin Margala: School of Computing and Informatics, University of Louisiana at Lafayette, USA
Manisha Guduri: School of Computing and Informatics, University of Louisiana at Lafayette, USA

Journal volume & issue: Vol. 4
p. 100247

Abstract

Read online

Breast cancer is one of the most common causes of death among women, and early diagnosis is vital for reducing the fatality rate. This study evaluates the most widely used machine-learning breast cancer prediction and diagnosis methods. We use synthetic minority over-sampling to handle imbalanced data in the breast cancer diagnosis dataset obtained from the Wisconsin Machine Learning Repository. We use a variety of machine learning algorithms, including Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbours (KNN), Classification and Regression Tree (CART), Naive Bayes (NB), and well-known ensembles methods like Majority-Voting, eXtreme Gradient Boosting algorithm (XGBoost), and Random Forest (RF) for the breast cancer classification. The findings show that the Majority-Voting ensemble method, built on the top three classifiers (LR, SVM, and CART), outperforms all other individual classifiers and offers the highest accuracy of 99.3%.

Published in Healthcare Analytics

ISSN: 2772-4425 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.journals.elsevier.com/healthcare-analytics

About the journal

Abstract

Keywords