Using Apache Spark on genome assembly for scalable overlap-graph reduction

Alexander J. Paul; Dylan Lawrence; Myoungkyu Song; Seung-Hwan Lim; Chongle Pan; Tae-Hyuk Ahn

doi:10.1186/s40246-019-0227-1

Human Genomics (Oct 2019)

Using Apache Spark on genome assembly for scalable overlap-graph reduction

Alexander J. Paul,
Dylan Lawrence,
Myoungkyu Song,
Seung-Hwan Lim,
Chongle Pan,
Tae-Hyuk Ahn

Affiliations

Alexander J. Paul: Bioinformatics and Computational Biology Program, Saint Louis University
Dylan Lawrence: Computational and Systems Biology Program, Washington University in St. Louis
Myoungkyu Song: Department of Computer Science, University of Nebraska at Omaha
Seung-Hwan Lim: National Center for Computational Sciences, Oak Ridge National Laboratory
Chongle Pan: School of Computer Science, University of Oklahoma
Tae-Hyuk Ahn: Bioinformatics and Computational Biology Program, Saint Louis University

DOI: https://doi.org/10.1186/s40246-019-0227-1
Journal volume & issue: Vol. 13, no. S1
pp. 1 – 12

Abstract

Read online

Abstract Background De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. Results To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. Conclusions We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Published in Human Genomics

ISSN: 1479-7364 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine; Science: Biology (General): Genetics
Website: https://humgenomics.biomedcentral.com/

About the journal

Abstract

Keywords