Frontiers in Genetics (Aug 2024)

A high-precision genome size estimator based on the k-mer histogram correction

  • Xiangyu Liao,
  • Wufei Zhu,
  • Chaoyun Liu

DOI
https://doi.org/10.3389/fgene.2024.1451730
Journal volume & issue
Vol. 15

Abstract

Read online

IntroductionIn the realm of next-generation sequencing datasets, various characteristics can be extracted through k-mer based analysis. Among these characteristics, genome size (GS) is one that can be estimated with relative ease, yet achieving satisfactory accuracy, especially in the context of heterozygosity, remains a challenge.MethodsIn this study, we introduce a high-precision genome size estimator, GSET (Genome Size Estimation Tool), which is based on k-mer histogram correction.ResultsWe have evaluated GSET on both simulated and real datasets. The experimental results demonstrate that this tool can estimate genome size with greater precision, even surpassing the accuracy of state-of-the-art tools. Notably, GSET also performs satisfactorily on heterozygous datasets, where other tools struggle to produce useable results.DiscussionThe processing model of GSET diverges from the popular data fitting models used by similar tools. Instead, it is derived from empirical data and incorporates a correction term to mitigate the impact of sequencing errors on genome size estimation. GSET is freely available for use and can be accessed at the following URL: https://github.com/Xingyu-Liao/GSET.

Keywords