The impact of imputation quality on machine learning classifiers for datasets with missing values

Tolou Shadbahr; Michael Roberts; Jan Stanczuk; Julian Gilbey; Philip Teare; Sören Dittmer; Matthew Thorpe; Ramon Viñas Torné; Evis Sala; Pietro Lió; Mishal Patel; Jacobus Preller; AIX-COVNET Collaboration; James H. F. Rudd; Tuomas Mirtti; Antti Sakari Rannikko; John A. D. Aston; Jing Tang; Carola-Bibiane Schönlieb

doi:10.1038/s43856-023-00356-z

Communications Medicine (Oct 2023)

The impact of imputation quality on machine learning classifiers for datasets with missing values

Tolou Shadbahr,
Michael Roberts,
Jan Stanczuk,
Julian Gilbey,
Philip Teare,
Sören Dittmer,
Matthew Thorpe,
Ramon Viñas Torné,
Evis Sala,
Pietro Lió,
Mishal Patel,
Jacobus Preller,
AIX-COVNET Collaboration,
James H. F. Rudd,
Tuomas Mirtti,
Antti Sakari Rannikko,
John A. D. Aston,
Jing Tang,
Carola-Bibiane Schönlieb

Affiliations

Tolou Shadbahr: Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki
Michael Roberts: Department of Applied Mathematics and Theoretical Physics, University of Cambridge
Jan Stanczuk: Department of Applied Mathematics and Theoretical Physics, University of Cambridge
Julian Gilbey: Department of Applied Mathematics and Theoretical Physics, University of Cambridge
Philip Teare: Data Science & Artificial Intelligence, AstraZeneca
Sören Dittmer: Department of Applied Mathematics and Theoretical Physics, University of Cambridge
Matthew Thorpe: Department of Mathematics, University of Manchester
Ramon Viñas Torné: Department of Computer Science and Technology, University of Cambridge
Evis Sala: Department of Radiology, University of Cambridge
Pietro Lió: Department of Mathematics, University of Manchester
Mishal Patel: Data Science & Artificial Intelligence, AstraZeneca
Jacobus Preller: Addenbrooke’s Hospital, Cambridge University Hospitals NHS Trust
AIX-COVNET Collaboration
James H. F. Rudd: Department of Medicine, University of Cambridge
Tuomas Mirtti: Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki
Antti Sakari Rannikko: Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki
John A. D. Aston: Department of Pure Mathematics and Mathematical Statistics, University of Cambridge
Jing Tang: Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki
Carola-Bibiane Schönlieb: Department of Applied Mathematics and Theoretical Physics, University of Cambridge

DOI: https://doi.org/10.1038/s43856-023-00356-z
Journal volume & issue: Vol. 3, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. Methods We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. Results The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. Conclusions It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

Published in Communications Medicine

ISSN: 2730-664X (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine
Website: https://www.nature.com/commsmed/

About the journal