Scavenger: A pipeline for recovery of unaligned reads utilising similarity with aligned reads [version 2; peer review: 2 approved]

Andrian Yang; Michael Troup; Joshua Y. S. Tang; Joshua W. K. Ho

F1000Research (Oct 2022)

Scavenger: A pipeline for recovery of unaligned reads utilising similarity with aligned reads [version 2; peer review: 2 approved]

Andrian Yang,
Michael Troup,
Joshua Y. S. Tang,
Joshua W. K. Ho

Affiliations

Andrian Yang: ORCiD; Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
Michael Troup: Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
Joshua Y. S. Tang: Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
Joshua W. K. Ho: Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia

Journal volume & issue: Vol. 8

Abstract

Read online

Read alignment is an important step in RNA-seq analysis as the result of alignment forms the basis for downstream analyses. However, recent studies have shown that published alignment tools have variable mapping sensitivity and do not necessarily align all the reads which should have been aligned, a problem we termed as the false-negative non-alignment problem. Here we present Scavenger, a python-based bioinformatics pipeline for recovering unaligned reads using a novel mechanism in which a putative alignment location is discovered based on sequence similarity between aligned and unaligned reads. We showed that Scavenger could recover unaligned reads in a range of simulated and real RNA-seq datasets, including single-cell RNA-seq data. We found that recovered reads tend to contain more genetic variants with respect to the reference genome compared to previously aligned reads, indicating that divergence between personal and reference genomes plays a role in the false-negative non-alignment problem. Even when the number of recovered reads is relatively small compared to the total number of reads, the addition of these recovered reads can impact downstream analyses, especially in terms of estimating the expression and differential expression of lowly expressed genes, such as pseudogenes.

Published in F1000Research

ISSN: 2046-1402 (Online)
Publisher: F1000 Research Ltd
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://f1000research.com

About the journal

Abstract

Keywords