StrainPanDA: Linked reconstruction of strain composition and gene content profiles via pangenome‐based decomposition of metagenomic data
Han Hu,
Yuxiang Tan,
Chenhao Li,
Junyu Chen,
Yan Kou,
Zhenjiang Zech Xu,
Yang‐Yu Liu,
Yan Tan,
Lei Dai
Affiliations
Han Hu
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shenzhen China
Yuxiang Tan
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shenzhen China
Chenhao Li
Center for Computational and Integrative Biology Massachusetts General Hospital and Harvard Medical School, Richard B. Simches Research Center Boston Massachusetts USA
Junyu Chen
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shenzhen China
Yan Kou
Bioinformatics Department Xbiome, Scientific Research Building, Tsinghua High‐Tech Park Shenzhen China
Zhenjiang Zech Xu
Department of Food Science and Technology, State Key Laboratory of Food Science and Technology Nanchang University Nanchang China
Yang‐Yu Liu
Channing Division of Network Medicine, Department of Medicine Brigham and Women's Hospital and Harvard Medical School Boston Massachusetts USA
Yan Tan
Bioinformatics Department Xbiome, Scientific Research Building, Tsinghua High‐Tech Park Shenzhen China
Lei Dai
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shenzhen China
Abstract Microbial strains of variable functional capacities coexist in microbiomes. Current bioinformatics methods of strain analysis cannot provide the direct linkage between strain composition and their gene contents from metagenomic data. Here we present Strain‐level Pangenome Decomposition Analysis (StrainPanDA), a novel method that uses the pangenome coverage profile of multiple metagenomic samples to simultaneously reconstruct the composition and gene content variation of coexisting strains in microbial communities. We systematically validate the accuracy and robustness of StrainPanDA using synthetic data sets. To demonstrate the power of gene‐centric strain profiling, we then apply StrainPanDA to analyze the gut microbiome samples of infants, as well as patients treated with fecal microbiota transplantation. We show that the linked reconstruction of strain composition and gene content profiles is critical for understanding the relationship between microbial adaptation and strain‐specific functions (e.g., nutrient utilization and pathogenicity). Finally, StrainPanDA has minimal requirements for computing resources and can be scaled to process multiple species in a community in parallel. In short, StrainPanDA can be applied to metagenomic data sets to detect the association between molecular functions and microbial/host phenotypes to formulate testable hypotheses and gain novel biological insights at the strain or subspecies level.