Algorithms for Molecular Biology (May 2020)

Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem

  • Yutong Qiu,
  • Cong Ma,
  • Han Xie,
  • Carl Kingsford

DOI
https://doi.org/10.1186/s13015-020-00170-5
Journal volume & issue
Vol. 15, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. Results To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a $$\frac{1}{4}$$ 1 4 -approximation algorithm for $$k=1$$ k = 1 and a $$\frac{3}{4}$$ 3 4 -approximation algorithm for the diploid case ( $$k=2$$ k = 2 ) assuming an oracle for $$k=1$$ k = 1 . Combining these, we obtain a $$\frac{3}{16}$$ 3 16 -approximation algorithm for MCAP when $$k=2$$ k = 2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require $$k>1$$ k > 1 alleles to satisfy read concordancy and show that such structures are prevalent. Conclusions We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid .

Keywords