Frontiers in Genetics (Jun 2025)

HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction

  • Fuyu Li,
  • Wenxiang Lu,
  • Yunfei Bai

DOI
https://doi.org/10.3389/fgene.2025.1641162
Journal volume & issue
Vol. 16

Abstract

Read online

IntroductionExtrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1–5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.MethodsFull-length eccDNAs within 1–5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.ResultsHyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10−5 as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.ConclusionHyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.

Keywords