Mining statistically-solid k-mers for accurate NGS error correction

Liang Zhao; Jin Xie; Lin Bai; Wen Chen; Mingju Wang; Zhonglei Zhang; Yiqi Wang; Zhe Zhao; Jinyan Li

doi:10.1186/s12864-018-5272-y

BMC Genomics (Dec 2018)

Mining statistically-solid k-mers for accurate NGS error correction

Liang Zhao,
Jin Xie,
Lin Bai,
Wen Chen,
Mingju Wang,
Zhonglei Zhang,
Yiqi Wang,
Zhe Zhao,
Jinyan Li

Affiliations

Liang Zhao: Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine
Jin Xie: Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine
Lin Bai: School of Computing and Electronic Information, Guangxi University
Wen Chen: Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine
Mingju Wang: Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine
Zhonglei Zhang: Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine
Yiqi Wang: Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine
Zhe Zhao: School of Computing and Electronic Information, Guangxi University
Jinyan Li: Advanced Analytics Institute, Faculty of Engineering & IT, University of Technology Sydney

DOI: https://doi.org/10.1186/s12864-018-5272-y
Journal volume & issue: Vol. 19, no. S10
pp. 1 – 10

Abstract

Read online

Abstract Background NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f 0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. Results We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f 0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. Conclusion The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

Published in BMC Genomics

ISSN: 1471-2164 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Biology (General): Genetics
Website: http://bmcgenomics.biomedcentral.com

About the journal

Abstract

Keywords