Analysis and correction of compositional bias in sparse sequencing count data

M. Senthil Kumar; Eric V. Slud; Kwame Okrah; Stephanie C. Hicks; Sridhar Hannenhalli; Héctor Corrada Bravo

doi:10.1186/s12864-018-5160-5

BMC Genomics (Nov 2018)

Analysis and correction of compositional bias in sparse sequencing count data

M. Senthil Kumar,
Eric V. Slud,
Kwame Okrah,
Stephanie C. Hicks,
Sridhar Hannenhalli,
Héctor Corrada Bravo

Affiliations

M. Senthil Kumar: Graduate Program in Bioinformatics, University of Maryland
Eric V. Slud: Department of Mathematics, University of Maryland
Kwame Okrah: GRED Oncology Biostatistics, Genentech
Stephanie C. Hicks: Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard University
Sridhar Hannenhalli: Center for Bioinformatics and Computational Biology, University of Maryland
Héctor Corrada Bravo: Center for Bioinformatics and Computational Biology, University of Maryland

DOI: https://doi.org/10.1186/s12864-018-5160-5
Journal volume & issue: Vol. 19, no. 1
pp. 1 – 23

Abstract

Read online

Abstract Background Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. Results We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. Conclusions Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.

Published in BMC Genomics

ISSN: 1471-2164 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Biology (General): Genetics
Website: http://bmcgenomics.biomedcentral.com

About the journal

Abstract

Keywords