BMC Bioinformatics (Oct 2024)
Efficient and low-complexity variable-to-variable length coding for DNA storage
Abstract
Abstract Background Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between $$ [0.5 - c_{{GC}}, 0.5 + c_{{GC}} ] $$ [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint $$c_{GC}$$ c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. Results In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when $$h = 4$$ h = 4 and $$c_{GC} = 0.05$$ c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. Conclusion We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
Keywords