Sequence count data are poorly fit by the negative binomial distribution.

Stijn Hawinkel; J C W Rayner; Luc Bijnens; Olivier Thas

doi:10.1371/journal.pone.0224909

PLoS ONE (Jan 2020)

Sequence count data are poorly fit by the negative binomial distribution.

Stijn Hawinkel,
J C W Rayner,
Luc Bijnens,
Olivier Thas

Affiliations

Stijn Hawinkel
J C W Rayner
Luc Bijnens
Olivier Thas

DOI: https://doi.org/10.1371/journal.pone.0224909
Journal volume & issue: Vol. 15, no. 4
p. e0224909

Abstract

Read online

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.

Published in PLoS ONE

ISSN: 1932-6203 (Online)
Publisher: Public Library of Science (PLoS)
Country of publisher: United States
LCC subjects: Medicine; Science
Website: https://journals.plos.org/plosone/

About the journal