Algorithms for Molecular Biology (May 2022)

Embedding gene trees into phylogenetic networks by conflict resolution algorithms

  • Marcin Wawerka,
  • Dawid Dąbkowski,
  • Natalia Rutecka,
  • Agnieszka Mykowiecka,
  • Paweł Górecki

DOI
https://doi.org/10.1186/s13015-022-00218-8
Journal volume & issue
Vol. 17, no. 1
pp. 1 – 23

Abstract

Read online

Abstract Background Phylogenetic networks are mathematical models of evolutionary processes involving reticulate events such as hybridization, recombination, or horizontal gene transfer. One of the crucial notions in phylogenetic network modelling is displayed tree, which is obtained from a network by removing a set of reticulation edges. Displayed trees may represent an evolutionary history of a gene family if the evolution is shaped by reticulation events. Results We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence and duplication costs. We propose an O(mn)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where m and n are the sizes of G and N, respectively. In addition, our algorithm can verify whether the solution is exact. Moreover, it provides a set of reticulation edges corresponding to the obtained cost. If the cost is exact, the set induces an optimal displayed tree. Otherwise, the set contains pairs of conflicting edges, i.e., edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires $$2^{r+1}-1$$ 2 r + 1 - 1 invocations of DP in the worst case, where r is the number of reticulations. We propose a similar $$O(2^kmn)$$ O ( 2 k m n ) -time algorithm for level-k tree-child networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also extend the algorithms to a broader class of phylogenetic networks. Based on simulated data, the average runtime is $$\Theta (2^{{0.543}k}mn)$$ Θ ( 2 0.543 k m n ) under the deep-coalescence cost and $$\Theta (2^{{0.355}k}mn)$$ Θ ( 2 0.355 k m n ) under the duplication cost. Conclusions Despite exponential complexity in the worst case, our algorithms perform significantly well on empirical and simulated datasets, due to the strategy of resolving internal dissimilarities between gene trees and networks. Therefore, the algorithms are efficient alternatives to enumeration strategies commonly proposed in the literature and enable analyses of complex networks with dozens of reticulations.

Keywords