Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data
Yang Yang,
Hongjian Sun,
Yu Zhang,
Tiefu Zhang,
Jialei Gong,
Yunbo Wei,
Yong-Gang Duan,
Minglei Shu,
Yuchen Yang,
Di Wu,
Di Yu
Affiliations
Yang Yang
The University of Queensland Diamantina Institute, Faculty of Medicine, The University of Queensland, Translational Research Institute, Brisbane, QLD, Australia; Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
Hongjian Sun
Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Microelectronics, Shandong University, Jinan, China
Yu Zhang
Laboratory of Immunology for Environment and Health, School of Pharmaceutical Sciences, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
Tiefu Zhang
University of Electronic Science and Technology of China, Chengdu, China
Jialei Gong
Shenzhen Key Laboratory of Fertility Regulation, Center of Assisted Reproduction and Embryology, University of Hong Kong, Shenzhen Hospital, Shenzhen, China
Yunbo Wei
Laboratory of Immunology for Environment and Health, School of Pharmaceutical Sciences, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
Yong-Gang Duan
Shenzhen Key Laboratory of Fertility Regulation, Center of Assisted Reproduction and Embryology, University of Hong Kong, Shenzhen Hospital, Shenzhen, China
Minglei Shu
Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
Yuchen Yang
Department of Pathology and Laboratory Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; McAllister Heart Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Di Wu
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Division of Oral and Craniofacial Health Science, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA; Corresponding author
Di Yu
The University of Queensland Diamantina Institute, Faculty of Medicine, The University of Queensland, Translational Research Institute, Brisbane, QLD, Australia; Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Laboratory of Immunology for Environment and Health, School of Pharmaceutical Sciences, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Corresponding author
Summary: Transcriptomic analysis plays a key role in biomedical research. Linear dimensionality reduction methods, especially principal-component analysis (PCA), are widely used in detecting sample-to-sample heterogeneity, while recently developed non-linear methods, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), can efficiently cluster heterogeneous samples in single-cell RNA sequencing analysis. Yet, the application of t-SNE and UMAP in bulk transcriptomic analysis and comparison with conventional methods have not been achieved. We compare four major dimensionality reduction methods (PCA, multidimensional scaling [MDS], t-SNE, and UMAP) in analyzing 71 large bulk transcriptomic datasets. UMAP is superior to PCA and MDS but shows some advantages over t-SNE in differentiating batch effects, identifying pre-defined biological groups, and revealing in-depth clusters in two-dimensional space. Importantly, UMAP generates sample clusters uncovering biological features and clinical meaning. We recommend deploying UMAP in visualizing and analyzing sizable bulk transcriptomic datasets to reinforce sample heterogeneity analysis.