Naught all zeros in sequence count data are the same

Justin D. Silverman; Kimberly Roche; Sayan Mukherjee; Lawrence A. David

Computational and Structural Biotechnology Journal (Jan 2020)

Naught all zeros in sequence count data are the same

Justin D. Silverman,
Kimberly Roche,
Sayan Mukherjee,
Lawrence A. David

Affiliations

Justin D. Silverman: College of Information Science and Technology, Pennsylvania State University, State College, PA 16802, United States; Institute for Computational and Data Science, Pennsylvania State University, State College, PA 16802, United States; Department of Medicine, Pennsylvania State University, Hershey, PA 17033, United States
Kimberly Roche: Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
Sayan Mukherjee: Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States; Departments of Statistical Science, Mathematics, Computer Science, Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, United States; Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States; Co-Corresponding author.
Lawrence A. David: Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States; Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States; Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, United States; Co-Corresponding author.

Journal volume & issue: Vol. 18
pp. 2789 – 2798

Abstract

Read online

Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as “zero-inflation” was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.

Published in Computational and Structural Biotechnology Journal

ISSN: 2001-0370 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Chemical technology: Biotechnology
Website: https://www.journals.elsevier.com/computational-and-structural-biotechnology-journal

About the journal

Abstract

Keywords