Genome Biology (Jan 2021)

SequencErr: measuring and suppressing sequencer errors in next-generation sequencing data

  • Eric M. Davis,
  • Yu Sun,
  • Yanling Liu,
  • Pandurang Kolekar,
  • Ying Shao,
  • Karol Szlachta,
  • Heather L. Mulder,
  • Dongren Ren,
  • Stephen V. Rice,
  • Zhaoming Wang,
  • Joy Nakitandwe,
  • Alexander M. Gout,
  • Bridget Shaner,
  • Salina Hall,
  • Leslie L. Robison,
  • Stanley Pounds,
  • Jeffery M. Klco,
  • John Easton,
  • Xiaotu Ma

DOI
https://doi.org/10.1186/s13059-020-02254-2
Journal volume & issue
Vol. 22, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Background There is currently no method to precisely measure the errors that occur in the sequencing instrument/sequencer, which is critical for next-generation sequencing applications aimed at discovering the genetic makeup of heterogeneous cellular populations. Results We propose a novel computational method, SequencErr, to address this challenge by measuring the base correspondence between overlapping regions in forward and reverse reads. An analysis of 3777 public datasets from 75 research institutions in 18 countries revealed the sequencer error rate to be ~ 10 per million (pm) and 1.4% of sequencers and 2.7% of flow cells have error rates > 100 pm. At the flow cell level, error rates are elevated in the bottom surfaces and > 90% of HiSeq and NovaSeq flow cells have at least one outlier error-prone tile. By sequencing a common DNA library on different sequencers, we demonstrate that sequencers with high error rates have reduced overall sequencing accuracy, and removal of outlier error-prone tiles improves sequencing accuracy. We demonstrate that SequencErr can reveal novel insights relative to the popular quality control method FastQC and achieve a 10-fold lower error rate than popular error correction methods including Lighter and Musket. Conclusions Our study reveals novel insights into the nature of DNA sequencing errors incurred on DNA sequencers. Our method can be used to assess, calibrate, and monitor sequencer accuracy, and to computationally suppress sequencer errors in existing datasets.

Keywords