Efficient and low-complexity variable-to-variable length coding for DNA storage

Yunfei Gao; Albert No

doi:10.1186/s12859-024-05943-y

BMC Bioinformatics (Oct 2024)

Efficient and low-complexity variable-to-variable length coding for DNA storage

Yunfei Gao,
Albert No

Affiliations

Yunfei Gao: SJTU-Ruijing-UIH Institute for Medical Imaging Technology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine
Albert No: Department of Artificial Intelligence, Yonsei University

DOI: https://doi.org/10.1186/s12859-024-05943-y
Journal volume & issue: Vol. 25, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between $$ [0.5 - c_{{GC}}, 0.5 + c_{{GC}} ] $$ [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint $$c_{GC}$$ c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. Results In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when $$h = 4$$ h = 4 and $$c_{GC} = 0.05$$ c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. Conclusion We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords