Performance determinants of unsupervised clustering methods for microbiome data

Yushu Shi; Liangliang Zhang; Christine B. Peterson; Kim-Anh Do; Robert R. Jenq

doi:10.1186/s40168-021-01199-3

Microbiome (Feb 2022)

Performance determinants of unsupervised clustering methods for microbiome data

Yushu Shi,
Liangliang Zhang,
Christine B. Peterson,
Kim-Anh Do,
Robert R. Jenq

Affiliations

Yushu Shi: Department of Statistics, The University of Missouri, Columbia
Liangliang Zhang: Department of Population and Quantitative Health Sciences, Case Western Reserve University
Christine B. Peterson: Department of Biostatistics, The University of Texas MD Anderson Cancer Center
Kim-Anh Do: Department of Biostatistics, The University of Texas MD Anderson Cancer Center
Robert R. Jenq: Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center

DOI: https://doi.org/10.1186/s40168-021-01199-3
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Background In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups, as well a clinical dataset with less clear separation between groups. Results Although no single method outperformed the others consistently, we did identify the key scenarios where certain methods can underperform. Specifically, the Bray Curtis (BC) metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac (UU) metric clustered poorly on dataset with a high prevalence of low-abundance OTUs. To explore these hypotheses about BC and UU, we systematically modified the properties of the poorly performing datasets and found that this approach resulted in improved BC and UU performance. Based on these observations, we rationally combined BC and UU to generate a novel metric. We tested its performance while varying the relative contributions of each metric and also compared it with another combined metric, the generalized UniFrac distance. The proposed metric showed high performance across all datasets. Conclusions Our systematic evaluation of clustering performance in these five datasets demonstrates that there is no existing clustering method that universally performs best across all datasets. We propose a combined metric of BC and UU that capitalizes on the complementary strengths of the two metrics. Video abstract

Published in Microbiome

ISSN: 2049-2618 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Microbiology: Microbial ecology
Website: https://microbiomejournal.biomedcentral.com

About the journal

Abstract

Keywords