BMC Bioinformatics (Jun 2022)

Global, highly specific and fast filtering of alignment seeds

  • Matthis Ebel,
  • Giovanna Migliorelli,
  • Mario Stanke

DOI
https://doi.org/10.1186/s12859-022-04745-4
Journal volume & issue
Vol. 23, no. 1
pp. 1 – 19

Abstract

Read online

Abstract Background An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. Results We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. Conclusions An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Keywords