Algorithms for Molecular Biology (Dec 2021)

Testing the agreement of trees with internal labels

  • David Fernández-Baca,
  • Lei Liu

DOI
https://doi.org/10.1186/s13015-021-00201-9
Journal volume & issue
Vol. 16, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nodes are labeled with taxa. Semi-labeled trees encompass ordinary phylogenetic trees and taxonomies. Suppose we are given a collection $${\mathcal {P}}= \{{\mathcal {T}}_1, {\mathcal {T}}_2, \ldots , {\mathcal {T}}_k\}$$ P = { T 1 , T 2 , … , T k } of semi-labeled trees, called input trees, over partially overlapping sets of taxa. The agreement problem asks whether there exists a tree $${\mathcal {T}}$$ T , called an agreement tree, whose taxon set is the union of the taxon sets of the input trees such that the restriction of $${\mathcal {T}}$$ T to the taxon set of $${\mathcal {T}}_i$$ T i is isomorphic to $${\mathcal {T}}_i$$ T i , for each $$i \in \{1, 2, \ldots , k\}$$ i ∈ { 1 , 2 , … , k } . The agreement problems is a special case of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that represents the information in the input trees. An obstacle to building large phylogenetic supertrees is the limited amount of taxonomic overlap among the phylogenetic studies from which the input trees are obtained. Incorporating taxonomies into supertree analyses can alleviate this issue. Results We give a $${\mathcal {O}}(n k (\sum _{i \in [k]} d_i + \log ^2(nk)))$$ O ( n k ( ∑ i ∈ [ k ] d i + log 2 ( n k ) ) ) algorithm for the agreement problem, where n is the total number of distinct taxa in $${\mathcal {P}}$$ P , k is the number of trees in $${\mathcal {P}}$$ P , and $$d_i$$ d i is the maximum number of children of a node in $${\mathcal {T}}_i$$ T i . Conclusion Our algorithm can aid in integrating taxonomies into supertree analyses. Our computational experience with the algorithm suggests that its performance in practice is much better than its worst-case bound indicates.

Keywords