Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Weizhi An; Yuzhi Guo; Yatao Bian; Hehuan Ma; Jinyu Yang; Chunyuan Li; Junzhou Huang

doi:10.3390/biomedinformatics4020085

BioMedInformatics (Jun 2024)

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Weizhi An,
Yuzhi Guo,
Yatao Bian,
Hehuan Ma,
Jinyu Yang,
Chunyuan Li,
Junzhou Huang

Affiliations

Weizhi An: Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
Yuzhi Guo: Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
Yatao Bian: Tencent AI Lab, Shenzhen 518000, China
Hehuan Ma: Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
Jinyu Yang: Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
Chunyuan Li: Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
Junzhou Huang: Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA

DOI: https://doi.org/10.3390/biomedinformatics4020085
Journal volume & issue: Vol. 4, no. 2
pp. 1556 – 1571

Abstract

Read online

Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.

Published in BioMedInformatics

ISSN: 2673-7426 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Medicine: Internal medicine: Neurosciences. Biological psychiatry. Neuropsychiatry; Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.mdpi.com/journal/biomedinformatics

About the journal

Abstract

Keywords