Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

Konstantin Sharafutdinov; Konstantin Sharafutdinov; Konstantin Sharafutdinov; Jayesh S. Bhat; Jayesh S. Bhat; Sebastian Johannes Fritsch; Sebastian Johannes Fritsch; Sebastian Johannes Fritsch; Kateryna Nikulina; Kateryna Nikulina; Kateryna Nikulina; Moein E. Samadi; Moein E. Samadi; Richard Polzin; Richard Polzin; Richard Polzin; Hannah Mayer; Hannah Mayer; Gernot Marx; Gernot Marx; Johannes Bickenbach; Johannes Bickenbach; Andreas Schuppert; Andreas Schuppert; Andreas Schuppert

doi:10.3389/fdata.2022.603429

Frontiers in Big Data (Oct 2022)

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

Konstantin Sharafutdinov,
Konstantin Sharafutdinov,
Konstantin Sharafutdinov,
Jayesh S. Bhat,
Jayesh S. Bhat,
Sebastian Johannes Fritsch,
Sebastian Johannes Fritsch,
Sebastian Johannes Fritsch,
Kateryna Nikulina,
Kateryna Nikulina,
Kateryna Nikulina,
Moein E. Samadi,
Moein E. Samadi,
Richard Polzin,
Richard Polzin,
Richard Polzin,
Hannah Mayer,
Hannah Mayer,
Gernot Marx,
Gernot Marx,
Johannes Bickenbach,
Johannes Bickenbach,
Andreas Schuppert,
Andreas Schuppert,
Andreas Schuppert

Affiliations

Konstantin Sharafutdinov: Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Konstantin Sharafutdinov: Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Konstantin Sharafutdinov: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Jayesh S. Bhat: Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Jayesh S. Bhat: Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Sebastian Johannes Fritsch: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Sebastian Johannes Fritsch: Department of Intensive Care Medicine, University Hospital RWTH Aachen, Aachen, Germany
Sebastian Johannes Fritsch: Juelich Supercomputing Centre, Forschungszentrum Juelich, Juelich, Germany
Kateryna Nikulina: Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Kateryna Nikulina: Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Kateryna Nikulina: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Moein E. Samadi: Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Moein E. Samadi: Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Richard Polzin: Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Richard Polzin: Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Richard Polzin: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Hannah Mayer: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Hannah Mayer: Systems Pharmacology and Medicine, Bayer AG, Leverkusen, Germany
Gernot Marx: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Gernot Marx: Department of Intensive Care Medicine, University Hospital RWTH Aachen, Aachen, Germany
Johannes Bickenbach: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany
Johannes Bickenbach: Department of Intensive Care Medicine, University Hospital RWTH Aachen, Aachen, Germany
Andreas Schuppert: Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Andreas Schuppert: Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany
Andreas Schuppert: SMITH Consortium of the German Medical Informatics Initiative, Leipzig, Germany

DOI: https://doi.org/10.3389/fdata.2022.603429
Journal volume & issue: Vol. 5

Abstract

Read online

Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.

Published in Frontiers in Big Data

ISSN: 2624-909X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://www.frontiersin.org/journals/big-data

About the journal

Abstract

Keywords