IEEE Access (Jan 2020)

Mining Discriminative K-Mers in DNA Sequences Using Sketches and Hardware Acceleration

  • Antonio Saavedra,
  • Hans Lehnert,
  • Cecilia Hernandez,
  • Gonzalo Carvajal,
  • Miguel Figueroa

DOI
https://doi.org/10.1109/ACCESS.2020.3003918
Journal volume & issue
Vol. 8
pp. 114715 – 114732

Abstract

Read online

Extracting discriminative k-mers is an important and challenging problem in DNA sequence analysis with applications in metagenomics and motif discovery. Despite the availability of multiple computational tools designed for this purpose, most discriminative k-mer discovery methods suffer from long execution times and high memory usage when processing large datasets. This paper presents a novel approach for discriminative k-mer discovery in DNA sequences, which leverages streaming and sketch algorithms to reduce space complexity and expose data parallelism, enabling the use of parallel platforms for accelerating the execution of computationally-intensive operations. To assess the performance of our method, we designed and implemented two versions of the algorithm that leverage parallelization at different levels: (i) a software version tailored for multithreading and vector instructions in commodity CPUs, and (ii) a custom architecture implemented on a Field-Programmable Gate Array (FPGA) accelerator that exploits fine-grain parallelism and deep pipelining on reconfigurable logic. Experimental results show that, when mining discriminative k-mers from a set of well-known ChIP-seq sequences, our parallel software implementation executes at least 15% faster than exact-counting tools, and requires at least five times less memory when processing large datasets. More importantly, we designed a custom FPGA-based accelerator for our algorithm on a Xilinx KCU1500 board, which achieves speedups above 78x with the largest datasets, compared to our parallel software implementation. The accelerator uses less than 3% of the logic resources available on the on-board XCKU115 Kintex-7 Ultrascale FPGA, and between 12% and 70% of the memory resources, depending on the size of the dataset.

Keywords