Computational and Structural Biotechnology Journal (Dec 2024)
Robust multi-read reconstruction from noisy clusters using deep neural network for DNA storage
Abstract
DNA holds immense potential as an emerging data storage medium. However, the recovery of information in DNA storage systems faces challenges posed by various errors, including IDS errors, strand breaks, and rearrangements, inevitably introduced during synthesis, amplification, sequencing, and storage processes. Sequence reconstruction, crucial for decoding, involves inferring the DNA reference from a cluster of erroneous copies. While most methods assume equal contributions from all reads within a cluster as noisy copies of the same reference, they often overlook the existence of contaminated sequences caused by DNA breaks, rearrangements, or mis-clustering reads. To address this issue, we propose RobuSeqNet, a robust multi-read reconstruction neural network specifically designed to robustly reconstruct multiple reads, accommodating noisy clusters with strand breakage, rearrangements, and mis-clustered strands. Leveraging the attention mechanism and an elaborate network design, RobuSeqNet exhibits resilience to highly-noisy clusters and effectively deals with in-strand IDS errors. The effectiveness and robustness of the proposed method are validated on three representative next-generation sequencing datasets. Results demonstrate that RobuSeqNet maintains high sequence reconstruction success rates of 99.74%, 99.58%, and 96.44% across three datasets, even in the presence of noisy clusters containing up to 20% contaminated sequences, outperforming known sequence reconstruction models. Additionally, in scenarios without contaminated sequences, it exhibits comparable performance to existing models, achieving success rates of 99.88%, 99.82%, and 97.68% across the three datasets.