Genome Biology (Jun 2023)

Matchtigs: minimum plain text representation of k-mer sets

  • Sebastian Schmidt,
  • Shahbaz Khan,
  • Jarno N. Alanko,
  • Giulio E. Pibiri,
  • Alexandru I. Tomescu

DOI
https://doi.org/10.1186/s13059-023-02968-z
Journal volume & issue
Vol. 24, no. 1
pp. 1 – 32

Abstract

Read online

Abstract We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.

Keywords