Benchmarking of computational error-correction methods for next-generation sequencing data

Keith Mitchell; Jaqueline J. Brito; Igor Mandric; Qiaozhen Wu; Sergey Knyazev; Sei Chang; Lana S. Martin; Aaron Karlsberg; Ekaterina Gerasimov; Russell Littman; Brian L. Hill; Nicholas C. Wu; Harry Taegyun Yang; Kevin Hsieh; Linus Chen; Eli Littman; Taylor Shabani; German Enik; Douglas Yao; Ren Sun; Jan Schroeder; Eleazar Eskin; Alex Zelikovsky; Pavel Skums; Mihai Pop; Serghei Mangul

doi:10.1186/s13059-020-01988-3

Genome Biology (Mar 2020)

Benchmarking of computational error-correction methods for next-generation sequencing data

Keith Mitchell,
Jaqueline J. Brito,
Igor Mandric,
Qiaozhen Wu,
Sergey Knyazev,
Sei Chang,
Lana S. Martin,
Aaron Karlsberg,
Ekaterina Gerasimov,
Russell Littman,
Brian L. Hill,
Nicholas C. Wu,
Harry Taegyun Yang,
Kevin Hsieh,
Linus Chen,
Eli Littman,
Taylor Shabani,
German Enik,
Douglas Yao,
Ren Sun,
Jan Schroeder,
Eleazar Eskin,
Alex Zelikovsky,
Pavel Skums,
Mihai Pop,
Serghei Mangul

Affiliations

Keith Mitchell: Department of Computer Science, University of California Los Angeles
Jaqueline J. Brito: Department of Clinical Pharmacy, School of Pharmacy, University of Southern California
Igor Mandric: Department of Computer Science, University of California Los Angeles
Qiaozhen Wu: Department of Mathematics, University of California Los Angeles
Sergey Knyazev: Department of Computer Science, Georgia State University
Sei Chang: Department of Computer Science, University of California Los Angeles
Lana S. Martin: Department of Clinical Pharmacy, School of Pharmacy, University of Southern California
Aaron Karlsberg: Department of Clinical Pharmacy, School of Pharmacy, University of Southern California
Ekaterina Gerasimov: Department of Computer Science, Georgia State University
Russell Littman: UCLA Bioinformatics
Brian L. Hill: Department of Computer Science, University of California Los Angeles
Nicholas C. Wu: Department of Integrative Structural and Computational Biology, The Scripps Research Institute
Harry Taegyun Yang: Department of Computer Science, University of California Los Angeles
Kevin Hsieh: Department of Computer Science, University of California Los Angeles
Linus Chen: Department of Computer Science, University of California Los Angeles
Eli Littman: Department of Computer Science, University of California Los Angeles
Taylor Shabani: Department of Computer Science, University of California Los Angeles
German Enik: Department of Computer Science, University of California Los Angeles
Douglas Yao: Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles
Ren Sun: Department of Molecular and Medical Pharmacology, University of California Los Angeles
Jan Schroeder: Epigenetics & Reprogramming Laboratory, Monash University
Eleazar Eskin: Department of Computer Science, University of California Los Angeles
Alex Zelikovsky: Department of Computer Science, Georgia State University
Pavel Skums: Department of Computer Science, Georgia State University
Mihai Pop: Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland
Serghei Mangul: Department of Clinical Pharmacy, School of Pharmacy, University of Southern California

DOI: https://doi.org/10.1186/s13059-020-01988-3
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. Results In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. Conclusions In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.

Published in Genome Biology

ISSN: 1474-760X (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: https://genomebiology.biomedcentral.com/

About the journal