BMC Bioinformatics (Oct 2024)

Efficient and low-complexity variable-to-variable length coding for DNA storage

  • Yunfei Gao,
  • Albert No

DOI
https://doi.org/10.1186/s12859-024-05943-y
Journal volume & issue
Vol. 25, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between $$ [0.5 - c_{{GC}}, 0.5 + c_{{GC}} ] $$ [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint $$c_{GC}$$ c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. Results In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when $$h = 4$$ h = 4 and $$c_{GC} = 0.05$$ c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. Conclusion We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.

Keywords