Effect of method of deduplication on estimation of differential gene expression using RNA-seq

Anna V. Klepikova; Artem S. Kasianov; Mikhail S. Chesnokov; Natalia L. Lazarevich; Aleksey A. Penin; Maria Logacheva

doi:10.7717/peerj.3091

PeerJ (Mar 2017)

Effect of method of deduplication on estimation of differential gene expression using RNA-seq

Anna V. Klepikova,
Artem S. Kasianov,
Mikhail S. Chesnokov,
Natalia L. Lazarevich,
Aleksey A. Penin,
Maria Logacheva

Affiliations

Anna V. Klepikova: Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia
Artem S. Kasianov: A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia
Mikhail S. Chesnokov: N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia
Natalia L. Lazarevich: N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia
Aleksey A. Penin: Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia
Maria Logacheva: Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia

DOI: https://doi.org/10.7717/peerj.3091
Journal volume & issue: Vol. 5
p. e3091

Abstract

Read online Read online

Background RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. Conclusion The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.

Published in PeerJ

ISSN: 2167-8359 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Medicine; Science: Biology (General)
Website: https://peerj.com/

About the journal

Abstract

Keywords