Short paired-end reads trump long single-end reads for expression analysis

Adam H. Freedman; John M. Gaspar; Timothy B. Sackton

doi:10.1186/s12859-020-3484-z

BMC Bioinformatics (Apr 2020)

Short paired-end reads trump long single-end reads for expression analysis

Adam H. Freedman,
John M. Gaspar,
Timothy B. Sackton

Affiliations

Adam H. Freedman: Informatics Group, Harvard University
John M. Gaspar: Informatics Group, Harvard University
Timothy B. Sackton: Informatics Group, Harvard University

DOI: https://doi.org/10.1186/s12859-020-3484-z
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Background Typical experimental design advice for expression analyses using RNA-seq generally assumes that single-end reads provide robust gene-level expression estimates in a cost-effective manner, and that the additional benefits obtained from paired-end sequencing are not worth the additional cost. However, in many cases (e.g., with Illumina NextSeq and NovaSeq instruments), shorter paired-end reads and longer single-end reads can be generated for the same cost, and it is not obvious which strategy should be preferred. Using publicly available data, we test whether short-paired end reads can achieve more robust expression estimates and differential expression results than single-end reads of approximately the same total number of sequenced bases. Results At both the transcript and gene levels, 2 × 40 paired-end reads unequivocally provide expression estimates that are more highly correlated with 2 × 125 than 1 × 75 reads; in nearly all cases, those correlations are also greater than for 1 × 125, despite the greater total number of sequenced bases for the latter. Across an array of metrics, differential expression tests based upon 2 × 40 consistently outperform those using 1 × 75. Conclusion Researchers seeking a cost-effective approach for gene-level expression analysis should prefer short paired-end reads over a longer single-end strategy. Short paired-end reads will also give reasonably robust expression estimates and differential expression results at the isoform level.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords