Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes

Igor V. Deyneko; Orkhan N. Mustafaev; Alexander А. Tyurin; Ksenya V. Zhukova; Alexander Varzari; Irina V. Goldenkova-Pavlova

doi:10.1186/s12859-022-05023-z

BMC Bioinformatics (Nov 2022)

Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes

Igor V. Deyneko,
Orkhan N. Mustafaev,
Alexander А. Tyurin,
Ksenya V. Zhukova,
Alexander Varzari,
Irina V. Goldenkova-Pavlova

Affiliations

Igor V. Deyneko: Laboratory of Functional Genomics, К.А. Timiryazev Institute of Plant Physiology RAS
Orkhan N. Mustafaev: Genetic Resources Institute, Azerbaijan National Academy of Sciences
Alexander А. Tyurin: Laboratory of Functional Genomics, К.А. Timiryazev Institute of Plant Physiology RAS
Ksenya V. Zhukova: Laboratory of Functional Genomics, К.А. Timiryazev Institute of Plant Physiology RAS
Alexander Varzari: Laboratory of Human Genetics, Chiril Draganiuc Institute of Phthisiopneumology
Irina V. Goldenkova-Pavlova: Laboratory of Functional Genomics, К.А. Timiryazev Institute of Plant Physiology RAS

DOI: https://doi.org/10.1186/s12859-022-05023-z
Journal volume & issue: Vol. 23, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background RNA-seq has become a standard technology to quantify mRNA. The measured values usually vary by several orders of magnitude, and while the detection of differences at high values is statistically well grounded, the significance of the differences for rare mRNAs can be weakened by the presence of biological and technical noise. Results We have developed a method for cleaning RNA-seq data, which improves the detection of differentially expressed genes and specifically genes with low to moderate transcription. Using a data modeling approach, parameters of randomly distributed mRNA counts are identified and reads, most probably originating from technical noise, are removed. We demonstrate that the removal of this random component leads to the significant increase in the number of detected differentially expressed genes, more significant pvalues and no bias towards low-count genes. Conclusion Application of RNAdeNoise to our RNA-seq data on polysome profiling and several published RNA-seq datasets reveals its suitability for different organisms and sequencing technologies such as Illumina and BGI, shows improved detection of differentially expressed genes, and excludes the subjective setting of thresholds for minimal RNA counts. The program, RNA-seq data, resulted gene lists and examples of use are in the supplementary data and at https://github.com/Deyneko/RNAdeNoise .

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords