Scientific Reports (Aug 2024)
CircCNNs, a convolutional neural network framework to better understand the biogenesis of exonic circRNAs
Abstract
Abstract Circular RNAs (circRNAs) as biomarkers for cancer detection have been extensively explored, however, the biogenesis mechanism is still elusive. In contrast to linear splicing (LS) involved in linear transcript formation, the so-called back splicing (BS) process has been proposed to explain circRNA formation. To investigate the potential mechanism of BS via the machine learning approach, we curated a high-quality BS and LS exon pairs dataset with evidence-based stringent filtering. Two convolutional neural networks (CNN) base models with different structures for processing splicing junction sequences including motif extraction were created and compared after extensive hyperparameter tuning. In contrast to the previous study, we are able to identify motifs corresponding to well-established BS-associated genes such as MBNL1, QKI, and ESPR2. Importantly, despite prevalent high false positive rates in existing circRNA detection pipelines and databases, our base models demonstrated a notable high specificity (greater than 90%). To further improve the model performance, a novo fast numerical method was proposed and implemented to calculate the reverse complementary matches (RCMs) crossing two flanking regions and within each flanking region of exon pairs. Our CircCNNs framework that incorporated RCM information into the optimal base models further reduced the false positive rates leading to 88% prediction accuracy.