A self-supervised deep learning method for data-efficient training in genomics

Hüseyin Anil Gündüz; Martin Binder; Xiao-Yin To; René Mreches; Bernd Bischl; Alice C. McHardy; Philipp C. Münch; Mina Rezaei

doi:10.1038/s42003-023-05310-2

Communications Biology (Sep 2023)

A self-supervised deep learning method for data-efficient training in genomics

Hüseyin Anil Gündüz,
Martin Binder,
Xiao-Yin To,
René Mreches,
Bernd Bischl,
Alice C. McHardy,
Philipp C. Münch,
Mina Rezaei

Affiliations

Hüseyin Anil Gündüz: Department of Statistics, LMU Munich
Martin Binder: Department of Statistics, LMU Munich
Xiao-Yin To: Department of Statistics, LMU Munich
René Mreches: Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research
Bernd Bischl: Department of Statistics, LMU Munich
Alice C. McHardy: Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research
Philipp C. Münch: Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research
Mina Rezaei: Department of Statistics, LMU Munich

DOI: https://doi.org/10.1038/s42003-023-05310-2
Journal volume & issue: Vol. 6, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.

Published in Communications Biology

ISSN: 2399-3642 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General)
Website: https://www.nature.com/commsbio/

About the journal