Analysis of error profiles in deep next-generation sequencing data

Xiaotu Ma; Ying Shao; Liqing Tian; Diane A. Flasch; Heather L. Mulder; Michael N. Edmonson; Yu Liu; Xiang Chen; Scott Newman; Joy Nakitandwe; Yongjin Li; Benshang Li; Shuhong Shen; Zhaoming Wang; Sheila Shurtleff; Leslie L. Robison; Shawn Levy; John Easton; Jinghui Zhang

doi:10.1186/s13059-019-1659-6

Genome Biology (Mar 2019)

Analysis of error profiles in deep next-generation sequencing data

Xiaotu Ma,
Ying Shao,
Liqing Tian,
Diane A. Flasch,
Heather L. Mulder,
Michael N. Edmonson,
Yu Liu,
Xiang Chen,
Scott Newman,
Joy Nakitandwe,
Yongjin Li,
Benshang Li,
Shuhong Shen,
Zhaoming Wang,
Sheila Shurtleff,
Leslie L. Robison,
Shawn Levy,
John Easton,
Jinghui Zhang

Affiliations

Xiaotu Ma: Department of Computational Biology, St. Jude Children’s Research Hospital
Ying Shao: Department of Computational Biology, St. Jude Children’s Research Hospital
Liqing Tian: Department of Computational Biology, St. Jude Children’s Research Hospital
Diane A. Flasch: Department of Computational Biology, St. Jude Children’s Research Hospital
Heather L. Mulder: Department of Computational Biology, St. Jude Children’s Research Hospital
Michael N. Edmonson: Department of Computational Biology, St. Jude Children’s Research Hospital
Yu Liu: Department of Computational Biology, St. Jude Children’s Research Hospital
Xiang Chen: Department of Computational Biology, St. Jude Children’s Research Hospital
Scott Newman: Department of Computational Biology, St. Jude Children’s Research Hospital
Joy Nakitandwe: Department of Pathology, St. Jude Children’s Research Hospital
Yongjin Li: Department of Computational Biology, St. Jude Children’s Research Hospital
Benshang Li: Key Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine
Shuhong Shen: Key Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine
Zhaoming Wang: Department of Computational Biology, St. Jude Children’s Research Hospital
Sheila Shurtleff: Department of Pathology, St. Jude Children’s Research Hospital
Leslie L. Robison: Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital
Shawn Levy: HudsonAlpha Institute for Biotechnology
John Easton: Department of Computational Biology, St. Jude Children’s Research Hospital
Jinghui Zhang: Department of Computational Biology, St. Jude Children’s Research Hospital

DOI: https://doi.org/10.1186/s13059-019-1659-6
Journal volume & issue: Vol. 20, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. Results By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10−5 to 10−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10−5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10−4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. Conclusions We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.

Published in Genome Biology

ISSN: 1474-760X (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: https://genomebiology.biomedcentral.com/

About the journal

Abstract

Keywords