VARUS: sampling complementary RNA reads from the sequence read archive

Mario Stanke; Willy Bruhn; Felix Becker; Katharina J. Hoff

doi:10.1186/s12859-019-3182-x

BMC Bioinformatics (Nov 2019)

VARUS: sampling complementary RNA reads from the sequence read archive

Mario Stanke,
Willy Bruhn,
Felix Becker,
Katharina J. Hoff

Affiliations

Mario Stanke: Institute for Mathematics and Computer Science, University of Greifswald
Willy Bruhn: Institute for Mathematics and Computer Science, University of Greifswald
Felix Becker: Institute for Mathematics and Computer Science, University of Greifswald
Katharina J. Hoff: Institute for Mathematics and Computer Science, University of Greifswald

DOI: https://doi.org/10.1186/s12859-019-3182-x
Journal volume & issue: Vol. 20, no. 1
pp. 1 – 7

Abstract

Read online

Abstract Background Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data. Results This article presents the software VARUS that selects, downloads and aligns reads from NCBI’s Sequence Read Archive, given only the species’ binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER. Conclusions With VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords