Bioconductor workflow for microbiome data analysis: from raw reads to community analyses [version 1; referees: 2 approved]

Ben J. Callahan; Kris Sankaran; Julia A. Fukuyama; Paul J. McMurdie; Susan P. Holmes

doi:10.12688/f1000research.8986.1

F1000Research (Jun 2016)

Bioconductor workflow for microbiome data analysis: from raw reads to community analyses [version 1; referees: 2 approved]

Ben J. Callahan,
Kris Sankaran,
Julia A. Fukuyama,
Paul J. McMurdie,
Susan P. Holmes

Affiliations

Ben J. Callahan: Statistics Department, Stanford University, Stanford, CA, 94305, USA
Kris Sankaran: Statistics Department, Stanford University, Stanford, CA, 94305, USA
Julia A. Fukuyama: Statistics Department, Stanford University, Stanford, CA, 94305, USA
Paul J. McMurdie: Whole Biome Inc., San Francisco, CA, 94107, USA
Susan P. Holmes: Statistics Department, Stanford University, Stanford, CA, 94305, USA

DOI: https://doi.org/10.12688/f1000research.8986.1
Journal volume & issue: Vol. 5

Abstract

Read online

High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or microbial composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, including both parameteric and nonparametric methods. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests, partial least squares and linear models as well as nonparametric testing using community networks and the ggnetwork package.

Published in F1000Research

ISSN: 2046-1402 (Online)
Publisher: F1000 Research Ltd
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://f1000research.com

About the journal

Abstract

Keywords