Effect of data leakage in brain MRI classification using 2D convolutional neural networks

Ekin Yagis; Selamawet Workalemahu Atnafu; Alba García Seco de Herrera; Chiara Marzi; Riccardo Scheda; Marco Giannelli; Carlo Tessa; Luca Citi; Stefano Diciotti

doi:10.1038/s41598-021-01681-w

Scientific Reports (Nov 2021)

Effect of data leakage in brain MRI classification using 2D convolutional neural networks

Ekin Yagis,
Selamawet Workalemahu Atnafu,
Alba García Seco de Herrera,
Chiara Marzi,
Riccardo Scheda,
Marco Giannelli,
Carlo Tessa,
Luca Citi,
Stefano Diciotti

Affiliations

Ekin Yagis: School of Computer Science and Electronic Engineering, University of Essex
Selamawet Workalemahu Atnafu: Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi”, University of Bologna
Alba García Seco de Herrera: School of Computer Science and Electronic Engineering, University of Essex
Chiara Marzi: Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi”, University of Bologna
Riccardo Scheda: Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi”, University of Bologna
Marco Giannelli: Unit of Medical Physics, Pisa University Hospital “Azienda Ospedaliero-Universitaria Pisana”
Carlo Tessa: Division of Radiology, Versilia Hospital
Luca Citi: School of Computer Science and Electronic Engineering, University of Essex
Stefano Diciotti: Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi”, University of Bologna

DOI: https://doi.org/10.1038/s41598-021-01681-w
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 13

Abstract

Read online

Abstract In recent years, 2D convolutional neural networks (CNNs) have been extensively used to diagnose neurological diseases from magnetic resonance imaging (MRI) data due to their potential to discern subtle and intricate patterns. Despite the high performances reported in numerous studies, developing CNN models with good generalization abilities is still a challenging task due to possible data leakage introduced during cross-validation (CV). In this study, we quantitatively assessed the effect of a data leakage caused by 3D MRI data splitting based on a 2D slice-level using three 2D CNN models to classify patients with Alzheimer’s disease (AD) and Parkinson’s disease (PD). Our experiments showed that slice-level CV erroneously boosted the average slice level accuracy on the test set by 30% on Open Access Series of Imaging Studies (OASIS), 29% on Alzheimer’s Disease Neuroimaging Initiative (ADNI), 48% on Parkinson’s Progression Markers Initiative (PPMI) and 55% on a local de-novo PD Versilia dataset. Further tests on a randomly labeled OASIS-derived dataset produced about 96% of (erroneous) accuracy (slice-level split) and 50% accuracy (subject-level split), as expected from a randomized experiment. Overall, the extent of the effect of an erroneous slice-based CV is severe, especially for small datasets.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal