An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Sriram P. Chockalingam; Jodh Pannu; Sahar Hooshmand; Sharma V. Thankachan; Srinivas Aluru

doi:10.1186/s12859-020-03738-5

BMC Bioinformatics (Nov 2020)

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Sriram P. Chockalingam,
Jodh Pannu,
Sahar Hooshmand,
Sharma V. Thankachan,
Srinivas Aluru

Affiliations

Sriram P. Chockalingam: Institute for Data Engineering and Science, Georiga Institute of Technology
Jodh Pannu: Department of Computer Science, University of Central Florida
Sahar Hooshmand: Department of Computer Science, University of Central Florida
Sharma V. Thankachan: Department of Computer Science, University of Central Florida
Srinivas Aluru: Institute for Data Engineering and Science, Georiga Institute of Technology

DOI: https://doi.org/10.1186/s12859-020-03738-5
Journal volume & issue: Vol. 21, no. S6
pp. 1 – 12

Abstract

Read online

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS k , have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS k takes O(n logk n) time and hence impractical for large datasets, multiple heuristics that can approximate ACS k have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACS k , which is faster than computing the exact ACS k while being closer to the exact ACS k values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACS k and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords