Fast and efficient short read mapping based on a succinct hash index

Haowen Zhang; Yuandong Chan; Kaichao Fan; Bertil Schmidt; Weiguo Liu

doi:10.1186/s12859-018-2094-5

BMC Bioinformatics (Mar 2018)

Fast and efficient short read mapping based on a succinct hash index

Haowen Zhang,
Yuandong Chan,
Kaichao Fan,
Bertil Schmidt,
Weiguo Liu

Affiliations

Haowen Zhang: School of Computational Science and Engineering, Georgia Institute of Technology
Yuandong Chan: School of Software, Shandong University
Kaichao Fan: School of Software, Shandong University
Bertil Schmidt: Johannes Gutenberg University
Weiguo Liu: School of Software, Shandong University

DOI: https://doi.org/10.1186/s12859-018-2094-5
Journal volume & issue: Vol. 19, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background Various indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time. Results We present the succinct hash index – a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3. Conclusions The presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords