KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Deyou Tang; Daqiang Tan; Weihao Xiao; Jiabin Lin; Juan Fu

doi:10.3390/a15040107

Algorithms (Mar 2022)

KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Deyou Tang,
Daqiang Tan,
Weihao Xiao,
Jiabin Lin,
Juan Fu

Affiliations

Deyou Tang: School of Software Engineering, South China University of Technology, Guangzhou 510006, China
Daqiang Tan: School of Software Engineering, South China University of Technology, Guangzhou 510006, China
Weihao Xiao: School of Software Engineering, South China University of Technology, Guangzhou 510006, China
Jiabin Lin: School of Software Engineering, South China University of Technology, Guangzhou 510006, China
Juan Fu: School of Medicine, South China University of Technology, Guangzhou 510006, China

DOI: https://doi.org/10.3390/a15040107
Journal volume & issue: Vol. 15, no. 4
p. 107

Abstract

Read online

Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords