BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments

Maria Luiza Mondelli; Thiago Magalhães; Guilherme Loss; Michael Wilde; Ian Foster; Marta Mattoso; Daniel Katz; Helio Barbosa; Ana Tereza R. de Vasconcelos; Kary Ocaña; Luiz M.R. Gadelha Jr

doi:10.7717/peerj.5551

PeerJ (Aug 2018)

BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments

Maria Luiza Mondelli,
Thiago Magalhães,
Guilherme Loss,
Michael Wilde,
Ian Foster,
Marta Mattoso,
Daniel Katz,
Helio Barbosa,
Ana Tereza R. de Vasconcelos,
Kary Ocaña,
Luiz M.R. Gadelha Jr

Affiliations

Maria Luiza Mondelli: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
Thiago Magalhães: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
Guilherme Loss: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
Michael Wilde: Computation Institute, Argonne National Laboratory/University of Chicago, Chicago, IL, USA
Ian Foster: Computation Institute, Argonne National Laboratory/University of Chicago, Chicago, IL, USA
Marta Mattoso: Computer and Systems Engineering Program, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
Daniel Katz: National Center for Supercomputing Applications, University of Illinois, Urbana, IL, USA
Helio Barbosa: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
Ana Tereza R. de Vasconcelos: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
Kary Ocaña: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
Luiz M.R. Gadelha Jr: National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil

DOI: https://doi.org/10.7717/peerj.5551
Journal volume & issue: Vol. 6
p. e5551

Abstract

Read online Read online

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.

Published in PeerJ

ISSN: 2167-8359 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Medicine; Science: Biology (General)
Website: https://peerj.com/

About the journal

Abstract

Keywords