Algorithms for Molecular Biology (Dec 2023)

New algorithms for structure informed genome rearrangement

  • Eden Ozeri,
  • Meirav Zehavi,
  • Michal Ziv-Ukelson

DOI
https://doi.org/10.1186/s13015-023-00239-x
Journal volume & issue
Vol. 18, no. 1
pp. 1 – 37

Abstract

Read online

Abstract We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, $$\mathsf {Constrained \ TreeToString \ Divergence}$$ Constrained TreeToString Divergence ( $$\textsf{CTTSD}{}$$ CTTSD ), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, $$\mathsf {TreeToString \ Divergence}$$ TreeToString Divergence ( $$\textsf{TTSD}{}$$ TTSD ), generalizes $$\textsf{CTTSD}{}$$ CTTSD , where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to $$d_S$$ d S and $$d_T$$ d T gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves $$\textsf{CTTSD}{}$$ CTTSD in $$O(n \gamma ^2 \cdot (m_p \cdot 1.381^\gamma + m_q))$$ O ( n γ 2 · ( m p · 1 . 381 γ + m q ) ) time and $$O(n^2)$$ O ( n 2 ) space, where $$\gamma $$ γ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and $$m_p$$ m p and $$m_q$$ m q are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of $$\textsf{CTTSD}$$ CTTSD is 0, then the algorithm runs in $$O(n m \gamma ^2)$$ O ( n m γ 2 ) time and $$O(n^2)$$ O ( n 2 ) space. The second algorithm solves $$\textsf{TTSD}{}$$ TTSD in $$O(n^2 \gamma ^2 {d_T}^2 {d_S}^2\,m^2 (m_p \cdot 5^\gamma \gamma + m_q))$$ O ( n 2 γ 2 d T 2 d S 2 m 2 ( m p · 5 γ γ + m q ) ) time and $$O(d_T d_S m (m n + 5^\gamma ))$$ O ( d T d S m ( m n + 5 γ ) ) space, where $$\gamma $$ γ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, $$m_p$$ m p and $$m_q$$ m q are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to $$d_T$$ d T deletions from the tree and up to $$d_S$$ d S deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of $$\textsf{TTSD}$$ TTSD is 0) in $$O(n \gamma ^2 {d_T}^2 {d_S}^2\,m^2 (m_p \cdot 4^{\gamma }\gamma ^2n(d_T+d_S+m+n) + m_q))$$ O ( n γ 2 d T 2 d S 2 m 2 ( m p · 4 γ γ 2 n ( d T + d S + m + n ) + m q ) ) time and $$O(\gamma ^2 n m^2 d_T d_S (d_T+d_S+m+n))$$ O ( γ 2 n m 2 d T d S ( d T + d S + m + n ) ) space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.

Keywords