Algorithms for Molecular Biology (Apr 2024)

Fast, parallel, and cache-friendly suffix array construction

  • Jamshed Khan,
  • Tobias Rubel,
  • Erin Molloy,
  • Laxman Dhulipala,
  • Rob Patro

DOI
https://doi.org/10.1186/s13015-024-00263-5
Journal volume & issue
Vol. 19, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. Methods In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. Results We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .

Keywords