BMC Bioinformatics (Nov 2017)

Fast batch searching for protein homology based on compression and clustering

  • Hongwei Ge,
  • Liang Sun,
  • Jinghong Yu

DOI
https://doi.org/10.1186/s12859-017-1938-8
Journal volume & issue
Vol. 18, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Background In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. Results We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. Conclusions It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.

Keywords