G4mismatch: Deep neural networks to predict G-quadruplex propensity based on G4-seq data.

Mira Barshai; Barak Engel; Idan Haim; Yaron Orenstein

doi:10.1371/journal.pcbi.1010948

PLoS Computational Biology (Mar 2023)

G4mismatch: Deep neural networks to predict G-quadruplex propensity based on G4-seq data.

Mira Barshai,
Barak Engel,
Idan Haim,
Yaron Orenstein

Affiliations

Mira Barshai
Barak Engel
Idan Haim
Yaron Orenstein

DOI: https://doi.org/10.1371/journal.pcbi.1010948
Journal volume & issue: Vol. 19, no. 3
p. e1010948

Abstract

Read online

G-quadruplexes are non-B-DNA structures that form in the genome facilitated by Hoogsteen bonds between guanines in single or multiple strands of DNA. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on almost 400 millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method to predict mismatch scores genome-wide, achieved a Pearson correlation of over 0.8. When benchmarked on independent datasets derived from various animal species, G4mismatch trained on human data predicted G-quadruplex propensity genome-wide with high accuracy (Pearson correlations greater than 0.7). Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.

Published in PLoS Computational Biology

ISSN: 1553-734X (Print); 1553-7358 (Online)
Publisher: Public Library of Science (PLoS)
Country of publisher: United States
LCC subjects: Science: Biology (General)
Website: https://journals.plos.org/ploscompbiol/

About the journal