Informatics in Medicine Unlocked (Jan 2022)
Application of data collaboration analysis to distributed data with misaligned features
Abstract
The types of metabolites measured in metabolomics studies vary depending on many factors, including differences in methods. Centralizing the distributed raw data is also often difficult due to confidentiality issues. These difficulties prevent the integrated analysis of metabolomic data from multiple studies. In this study, we extend the data collaboration analysis, an integrated data analysis method, by sharing dimensionality-reduced intermediate representations instead of the raw data to allow it to be applied to distributed data where the samples are completely different, and features are partially common. We then evaluated the improvement in performance using non-common features in the data collaboration analysis. To perform this evaluation, we created the four artificial datasets and the two datasets generated from metabolomics public data where samples are completely different and features are partially common. For each of these datasets, we compared the classification performance including area under the curve in the receiver operating characteristic curve (ROC-AUC) with the following three cases: (i) a case where only local data were used for training, (ii) the data collaboration analysis with only the common features of the distributed datasets, and (iii) the data collaboration analysis with all the features including non-common features. In most cases, the data collaboration analysis using all features demonstrated better results compared to the data collaboration analysis only using common features (by 1.3–4.8 points ROC-AUC for each dataset on average) or that trained on only one of the datasets (by 1.8–2.9 points ROC-AUC for each dataset on average). It was confirmed that the data collaboration analysis could integrate and analyze distributed data where samples are completely different and features are partially common, which can improve the classification accuracy in machine learning without sharing the raw data.