Biology (Feb 2022)

Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes

  • Kaiyue Zhou,
  • Bhagya Shree Kottoori,
  • Seeya Awadhut Munj,
  • Zhewei Zhang,
  • Sorin Draghici,
  • Suzan Arslanturk

DOI
https://doi.org/10.3390/biology11030360
Journal volume & issue
Vol. 11, no. 3
p. 360

Abstract

Read online

Studies over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcomes. However, understanding the progression risk and differentiating long- and short-term survivors cannot be achieved by analyzing data from a single modality due to the heterogeneity of disease. Using a scientifically developed and tested deep-learning approach that leverages aggregate information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA Methylation, miRNA) could lead to a more accurate and robust prediction of disease progression. Here, we propose an autoencoder based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study have shown that inferring the missing data through the proposed data fusion pipeline allows a predictor that is superior to other baseline predictors with missing modalities. Results have further shown that short- and long-term survivors of glioblastoma multiforme, acute myeloid leukemia, and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75, and 0.96, respectively.

Keywords