Effect of data harmonization of multicentric dataset in ASD/TD classification

Giacomo Serra; Francesca Mainas; Bruno Golosio; Alessandra Retico; Piernicola Oliva

doi:10.1186/s40708-023-00210-x

Brain Informatics (Nov 2023)

Effect of data harmonization of multicentric dataset in ASD/TD classification

Giacomo Serra,
Francesca Mainas,
Bruno Golosio,
Alessandra Retico,
Piernicola Oliva

Affiliations

Giacomo Serra: Department of Physics, University of Cagliari
Francesca Mainas: Department of Physics, University of Cagliari
Bruno Golosio: Department of Physics, University of Cagliari
Alessandra Retico: National Institute for Nuclear Physics (INFN), Pisa Division
Piernicola Oliva: National Institute for Nuclear Physics (INFN), Cagliari Division

DOI: https://doi.org/10.1186/s40708-023-00210-x
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set.

Published in Brain Informatics

ISSN: 2198-4018 (Print); 2198-4026 (Online)
Publisher: SpringerOpen
Country of publisher: Germany
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: http://www.springer.com/40708

About the journal

Abstract

Keywords