Effective training of nanopore callers for epigenetic marks with limited labelled data

Brian Yao; Chloe Hsu; Gal Goldner; Yael Michaeli; Yuval Ebenstein; Jennifer Listgarten

doi:10.1098/rsob.230449

Open Biology (Jun 2024)

Effective training of nanopore callers for epigenetic marks with limited labelled data

Brian Yao,
Chloe Hsu,
Gal Goldner,
Yael Michaeli,
Yuval Ebenstein,
Jennifer Listgarten

Affiliations

Brian Yao: Department of Electrical Engineering & Computer Sciences, University of California , Berkeley, CA 94720, USA
Chloe Hsu: Department of Electrical Engineering & Computer Sciences, University of California , Berkeley, CA 94720, USA
Gal Goldner: Department of Chemical Physics, Tel Aviv University , Tel Aviv-Yafo, Israel
Yael Michaeli: Department of Chemical Physics, Tel Aviv University , Tel Aviv-Yafo, Israel
Yuval Ebenstein: Department of Chemical Physics, Tel Aviv University , Tel Aviv-Yafo, Israel
Jennifer Listgarten: Department of Electrical Engineering & Computer Sciences, University of California , Berkeley, CA 94720, USA

DOI: https://doi.org/10.1098/rsob.230449
Journal volume & issue: Vol. 14, no. 6

Abstract

Read online

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds—a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM–DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.

Published in Open Biology

ISSN: 2046-2441 (Online)
Publisher: The Royal Society
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General)
Website: https://royalsocietypublishing.org/journal/rsob

About the journal

Abstract

Keywords