AI (Sep 2024)

Generative Models Utilizing Padding Can Efficiently Integrate and Generate Multi-Omics Data

  • Hyeon-Su Lee,
  • Seung-Hwan Hong,
  • Gwan-Heon Kim,
  • Hye-Jin You,
  • Eun-Young Lee,
  • Jae-Hwan Jeong,
  • Jin-Woo Ahn,
  • June-Hyuk Kim

DOI
https://doi.org/10.3390/ai5030078
Journal volume & issue
Vol. 5, no. 3
pp. 1614 – 1632

Abstract

Read online

Technological advances in information-processing capacity have enabled integrated analyses (multi-omics) of different omics data types, improving target discovery and clinical diagnosis. This study proposes novel artificial intelligence (AI) learning strategies for incomplete datasets, common in omics research. The model comprises (1) a multi-omics generative model based on a variational auto-encoder that learns tumor genetic patterns based on different omics data types and (2) an expanded classification model that predicts cancer phenotypes. Padding was applied to replace missing data with virtual data. The embedding data generated by the model accurately classified cancer phenotypes, addressing the class imbalance issue (weighted F1 score: cancer type > 0.95, primary site > 0.92, sample type > 0.97). The classification performance was maintained in the absence of omics data, and the virtual data resembled actual omics data (cosine similarity mRNA gene expression > 0.96, mRNA isoform expression > 0.95, DNA methylation > 0.96). Meanwhile, in the presence of omics data, high-quality, non-existent omics data were generated (cosine similarity mRNA gene expression: 0.9702, mRNA isoform expression: 0.9546, DNA methylation: 0.9687). This model can effectively classify cancer phenotypes based on incomplete omics data with data sparsity robustness, generating omics data through deep learning and enabling precision medicine.

Keywords