iScience (Dec 2024)
The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance
Abstract
Summary: Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets.This paper presents the Backpack Quotient Filter (BQF), a data structure for indexing k-mers (substrings of length k), which offers greater space efficiency than the Counting Quotient Filter (CQF). The BQF maintains essential features such as abundance information and dynamicity, with an extremely low false positive rate of less than 10−5%. Our method redefines abundance information handling and implements an independent strategy for space efficiency.The BQF uses four times less space than the CQF on complex datasets such as sea-water metagenomics sequences. Additionally, its space efficiency improves with larger datasets, addressing the need for scalable data solutions.