Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes

Kaiyue Zhou; Bhagya Shree Kottoori; Seeya Awadhut Munj; Zhewei Zhang; Sorin Draghici; Suzan Arslanturk

doi:10.3390/biology11030360

Biology (Feb 2022)

Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes

Kaiyue Zhou,
Bhagya Shree Kottoori,
Seeya Awadhut Munj,
Zhewei Zhang,
Sorin Draghici,
Suzan Arslanturk

Affiliations

Kaiyue Zhou: Department of Computer Science, Wayne State University, Detroit, MI 48201, USA
Bhagya Shree Kottoori: Department of Computer Science, Wayne State University, Detroit, MI 48201, USA
Seeya Awadhut Munj: Department of Computer Science, Wayne State University, Detroit, MI 48201, USA
Zhewei Zhang: Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Sorin Draghici: Department of Computer Science, Wayne State University, Detroit, MI 48201, USA
Suzan Arslanturk: Department of Computer Science, Wayne State University, Detroit, MI 48201, USA

DOI: https://doi.org/10.3390/biology11030360
Journal volume & issue: Vol. 11, no. 3
p. 360

Abstract

Read online

Studies over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcomes. However, understanding the progression risk and differentiating long- and short-term survivors cannot be achieved by analyzing data from a single modality due to the heterogeneity of disease. Using a scientifically developed and tested deep-learning approach that leverages aggregate information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA Methylation, miRNA) could lead to a more accurate and robust prediction of disease progression. Here, we propose an autoencoder based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study have shown that inferring the missing data through the proposed data fusion pipeline allows a predictor that is superior to other baseline predictors with missing modalities. Results have further shown that short- and long-term survivors of glioblastoma multiforme, acute myeloid leukemia, and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75, and 0.96, respectively.

Published in Biology

ISSN: 2079-7737 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Biology (General)
Website: https://www.mdpi.com/journal/biology

About the journal

Abstract

Keywords