Genome Biology (May 2023)

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

  • Xiaoming Xu,
  • Zekun Yin,
  • Lifeng Yan,
  • Hao Zhang,
  • Borui Xu,
  • Yanjie Wei,
  • Beifang Niu,
  • Bertil Schmidt,
  • Weiguo Liu

DOI
https://doi.org/10.1186/s13059-023-02961-6
Journal volume & issue
Vol. 24, no. 1
pp. 1 – 20

Abstract

Read online

Abstract We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.

Keywords