Tackling the Challenges of FASTQ Referential Compression

Aníbal Guerra; Jaime Lotero; José Édinson Aedo; Sebastián Isaza

doi:10.1177/1177932218821373

Bioinformatics and Biology Insights (Feb 2019)

Tackling the Challenges of FASTQ Referential Compression

Aníbal Guerra,
Jaime Lotero,
José Édinson Aedo,
Sebastián Isaza

Affiliations

Aníbal Guerra: Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia
Jaime Lotero: Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
José Édinson Aedo: Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
Sebastián Isaza: Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela

DOI: https://doi.org/10.1177/1177932218821373
Journal volume & issue: Vol. 13

Abstract

Read online

The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.

Published in Bioinformatics and Biology Insights

ISSN: 1177-9322 (Online)
Publisher: SAGE Publishing
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General)
Website: https://journals.sagepub.com/home/bbi

About the journal