Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Alhanoof Althnian; Duaa AlSaeed; Heyam Al-Baity; Amani Samha; Alanoud Bin Dris; Najla Alzakari; Afnan Abou Elwafa; Heba Kurdi

doi:10.3390/app11020796

Applied Sciences (Jan 2021)

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Alhanoof Althnian,
Duaa AlSaeed,
Heyam Al-Baity,
Amani Samha,
Alanoud Bin Dris,
Najla Alzakari,
Afnan Abou Elwafa,
Heba Kurdi

Affiliations

Alhanoof Althnian: Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
Duaa AlSaeed: Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
Heyam Al-Baity: Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
Amani Samha: Management Information Systems Department, College of Business Administration, King Saud University, Riyadh 11451, Saudi Arabia
Alanoud Bin Dris: National Center for Cyber Security Technology, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia
Najla Alzakari: National Center for Cyber Security Technology, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia
Afnan Abou Elwafa: Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
Heba Kurdi: Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

DOI: https://doi.org/10.3390/app11020796
Journal volume & issue: Vol. 11, no. 2
p. 796

Abstract

Read online

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords