IEEE Access (Jan 2022)
HADC: A Hybrid Compression Approach for DNA Sequences
Abstract
In the blossoming age of Next Generation Sequencing (NGS) technologies, genome sequencing has become much easier and more affordable. The large number of enormous genomic sequences obtained demand the availability of huge storage space in order to be kept for analysis. Since the storage cost has become an impediment facing biologists, there is a constant need of software that provides efficient compression of genomic sequences. Most general-purpose compression algorithms do not exploit the inherent redundancies that exist in genomic sequences which is the reason for the success and popularity of reference-based compression algorithms. In this research, a new reference-based lossless compression technique is proposed for deoxyribonucleic acid (DNA) sequences stored in FASTA format which can act as a layer above gzip compression. Several experiments were performed to evaluate this technique and the experimental results show that it is able to obtain promising compression ratios saving up to 99.9% space and reaching a gain of 80% for some plant genomes. The proposed technique also succeeds in performing the compression at acceptable time; even saving more than 50% of the time taken by ERGC in most experiments.
Keywords