From pairs of most similar sequences to phylogenetic best matches

Peter F. Stadler; Manuela Geiß; David Schaller; Alitzel López Sánchez; Marcos González Laffitte; Dulce I. Valdivia; Marc Hellmuth; Maribel Hernández Rosales

doi:10.1186/s13015-020-00165-2

Algorithms for Molecular Biology (Apr 2020)

From pairs of most similar sequences to phylogenetic best matches

Peter F. Stadler,
Manuela Geiß,
David Schaller,
Alitzel López Sánchez,
Marcos González Laffitte,
Dulce I. Valdivia,
Marc Hellmuth,
Maribel Hernández Rosales

Affiliations

Peter F. Stadler: Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig
Manuela Geiß: Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig
David Schaller: Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig
Alitzel López Sánchez: CONACYT-Instituto de Matemáticas, UNAM Juriquilla
Marcos González Laffitte: CONACYT-Instituto de Matemáticas, UNAM Juriquilla
Dulce I. Valdivia: Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del IPN (CINVESTAV)
Marc Hellmuth: School of Computing, University of Leeds
Maribel Hernández Rosales: CONACYT-Instituto de Matemáticas, UNAM Juriquilla

DOI: https://doi.org/10.1186/s13015-020-00165-2
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Background Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods. Results If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. A priori knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches. Conclusion Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. Availability Accompanying software is available at https://github.com/david-schaller/AsymmeTree .

Published in Algorithms for Molecular Biology

ISSN: 1748-7188 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: http://almob.biomedcentral.com

About the journal

Abstract

Keywords