Fast, parallel, and cache-friendly suffix array construction

Jamshed Khan; Tobias Rubel; Erin Molloy; Laxman Dhulipala; Rob Patro

doi:10.1186/s13015-024-00263-5

Algorithms for Molecular Biology (Apr 2024)

Fast, parallel, and cache-friendly suffix array construction

Jamshed Khan,
Tobias Rubel,
Erin Molloy,
Laxman Dhulipala,
Rob Patro

Affiliations

Jamshed Khan: Department of Computer Science, University of Maryland
Tobias Rubel: Department of Computer Science, University of Maryland
Erin Molloy: Department of Computer Science, University of Maryland
Laxman Dhulipala: Department of Computer Science, University of Maryland
Rob Patro: Department of Computer Science, University of Maryland

DOI: https://doi.org/10.1186/s13015-024-00263-5
Journal volume & issue: Vol. 19, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. Methods In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. Results We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .

Published in Algorithms for Molecular Biology

ISSN: 1748-7188 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: http://almob.biomedcentral.com

About the journal

Abstract

Keywords