NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements

Ryan Connor; Rodney Brister; Jan P. Buchmann; Ward Deboutte; Rob Edwards; Joan Martí-Carreras; Mike Tisza; Vadim Zalunin; Juan Andrade-Martínez; Adrian Cantu; Michael D’Amour; Alexandre Efremov; Lydia Fleischmann; Laura Forero-Junco; Sanzhima Garmaeva; Melissa Giluso; Cody Glickman; Margaret Henderson; Benjamin Kellman; David Kristensen; Carl Leubsdorf; Kyle Levi; Shane Levi; Suman Pakala; Vikas Peddu; Alise Ponsero; Eldred Ribeiro; Farrah Roy; Lindsay Rutter; Surya Saha; Migun Shakya; Ryan Shean; Matthew Miller; Benjamin Tully; Christopher Turkington; Ken Youens-Clark; Bert Vanmechelen; Ben Busby

doi:10.3390/genes10090714

Genes (Sep 2019)

NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements

Ryan Connor,
Rodney Brister,
Jan P. Buchmann,
Ward Deboutte,
Rob Edwards,
Joan Martí-Carreras,
Mike Tisza,
Vadim Zalunin,
Juan Andrade-Martínez,
Adrian Cantu,
Michael D’Amour,
Alexandre Efremov,
Lydia Fleischmann,
Laura Forero-Junco,
Sanzhima Garmaeva,
Melissa Giluso,
Cody Glickman,
Margaret Henderson,
Benjamin Kellman,
David Kristensen,
Carl Leubsdorf,
Kyle Levi,
Shane Levi,
Suman Pakala,
Vikas Peddu,
Alise Ponsero,
Eldred Ribeiro,
Farrah Roy,
Lindsay Rutter,
Surya Saha,
Migun Shakya,
Ryan Shean,
Matthew Miller,
Benjamin Tully,
Christopher Turkington,
Ken Youens-Clark,
Bert Vanmechelen,
Ben Busby

Affiliations

Ryan Connor: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
Rodney Brister: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
Jan P. Buchmann: Charles Perkins Centre, School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW 2006, Australia
Ward Deboutte: KU Leuven, Department of Microbiology & Immunology, Rega Institute, Leuven BE3000, Belgium
Rob Edwards: Department of Biology, San Diego State University, 5500 Campanile Dr., San Diego, CA 92182, USA
Joan Martí-Carreras: KU Leuven, Department of Microbiology & Immunology, Rega Institute, Leuven BE3000, Belgium
Mike Tisza: Lab of Cellular Oncology, NCI, NIH, Bethesda, MD 20892-4263, USA
Vadim Zalunin: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
Juan Andrade-Martínez: Research Group on Computational Biology and Microbial Ecology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia
Adrian Cantu: Department of Biology, San Diego State University, 5500 Campanile Dr., San Diego, CA 92182, USA
Michael D’Amour: D’Amour & Associates, 11839 Hilltop Drive, Los Altos Hills, CA 94024, USA
Alexandre Efremov: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
Lydia Fleischmann: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
Laura Forero-Junco: Research Group on Computational Biology and Microbial Ecology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia
Sanzhima Garmaeva: Department of Genetics, University Medical Center Groningen, Groningen 9713AV, The Netherlands
Melissa Giluso: Department of Biology, San Diego State University, 5500 Campanile Dr., San Diego, CA 92182, USA
Cody Glickman: Computational Bioscience Program, University of Colorado Anschutz, Aurora, CO 80045, USA
Margaret Henderson: Department of Biology, San Diego State University, 5500 Campanile Dr., San Diego, CA 92182, USA
Benjamin Kellman: Bioinformatics and Systems Biology Program, University of California at San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA
David Kristensen: Department of Biomedical Engineering, University of Iowa, Iowa City, IA 52242, USA
Carl Leubsdorf: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
Kyle Levi: Department of Biology, San Diego State University, 5500 Campanile Dr., San Diego, CA 92182, USA
Shane Levi: Department of Biology, San Diego State University, 5500 Campanile Dr., San Diego, CA 92182, USA
Suman Pakala: Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, USA
Vikas Peddu: Department of Laboratory Medicine, University of Washington Virology, 1616 Eastlake Ave E, Seattle, WA 98102, USA
Alise Ponsero: Department of Biosystems Engineering, University of Arizona, Tucson, AZ 85716, USA
Eldred Ribeiro: MITRE Corporation, 7515 Colshire Drive, McLean, VA 22102-7539, USA
Farrah Roy: Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
Lindsay Rutter: University of Tsukuba, Ibaraki 305-8575, Japan
Surya Saha: Boyce Thompson Institute, Ithaca, NY 14853, USA
Migun Shakya: Bioscience Division, Los Alamos National Lab, Los Alamos, NM 87545, USA
Ryan Shean: Department of Laboratory Medicine, University of Washington Virology, 1616 Eastlake Ave E, Seattle, WA 98102, USA
Matthew Miller: Department of Biosystems Engineering, University of Arizona, Tucson, AZ 85716, USA
Benjamin Tully: Center for Dark Energy Biosphere Investigations, University of Southern California, Los Angeles, CA 90089, USA
Christopher Turkington: School of Natural Sciences, University of California Merced, Merced, CA 95343, USA
Ken Youens-Clark: Department of Biosystems Engineering, University of Arizona, Tucson, AZ 85716, USA
Bert Vanmechelen: KU Leuven, Department of Microbiology & Immunology, Rega Institute, Leuven BE3000, Belgium
Ben Busby: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA

DOI: https://doi.org/10.3390/genes10090714
Journal volume & issue: Vol. 10, no. 9
p. 714

Abstract

Read online

A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.

Published in Genes

ISSN: 2073-4425 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Biology (General): Genetics
Website: http://www.mdpi.com/journal/genes/

About the journal

Abstract

Keywords