Shared data science infrastructure for genomics data

Hamid Bagheri; Usha Muppirala; Rick E. Masonbrink; Andrew J. Severin; Hridesh Rajan

doi:10.1186/s12859-019-2967-2

BMC Bioinformatics (Aug 2019)

Shared data science infrastructure for genomics data

Hamid Bagheri,
Usha Muppirala,
Rick E. Masonbrink,
Andrew J. Severin,
Hridesh Rajan

Affiliations

Hamid Bagheri: Department of Computer Science, Iowa State University
Usha Muppirala: Genome Informatics Facility, Iowa State University
Rick E. Masonbrink: Genome Informatics Facility, Iowa State University
Andrew J. Severin: Genome Informatics Facility, Iowa State University
Hridesh Rajan: Department of Computer Science, Iowa State University

DOI: https://doi.org/10.1186/s12859-019-2967-2
Journal volume & issue: Vol. 20, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa g is needed to efficiently process and parse data contained in large data repositories. The main features of Boa g are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results As a proof of concept, Boa for genomics, Boa g , has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa g provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa g to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa g databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa g , provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa g using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa g could be used with large biological datasets.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords