SparkGC: Spark based genome compression for large collections of genomes

Haichang Yao; Guangyong Hu; Shangdong Liu; Houzhi Fang; Yimu Ji

doi:10.1186/s12859-022-04825-5

BMC Bioinformatics (Jul 2022)

SparkGC: Spark based genome compression for large collections of genomes

Haichang Yao,
Guangyong Hu,
Shangdong Liu,
Houzhi Fang,
Yimu Ji

Affiliations

Haichang Yao: School of Computer and Software, Nanjing Vocational University of Industry Technology
Guangyong Hu: School of Computer and Software, Nanjing Vocational University of Industry Technology
Shangdong Liu: School of Computer Science, Nanjing University of Posts and Telecommunications
Houzhi Fang: School of Computer Science, Nanjing University of Posts and Telecommunications
Yimu Ji: School of Computer Science, Nanjing University of Posts and Telecommunications

DOI: https://doi.org/10.1186/s12859-022-04825-5
Journal volume & issue: Vol. 23, no. 1
pp. 1 – 21

Abstract

Read online

Abstract Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords