Benchmark datasets for SARS-CoV-2 surveillance bioinformatics
Lingzi Xiaoli,
Jill V. Hagey,
Daniel J. Park,
Christopher A. Gulvik,
Erin L. Young,
Nabil-Fareed Alikhan,
Adrian Lawsin,
Norman Hassell,
Kristen Knipe,
Kelly F. Oakeson,
Adam C. Retchless,
Migun Shakya,
Chien-Chi Lo,
Patrick Chain,
Andrew J. Page,
Benjamin J. Metcalf,
Michelle Su,
Jessica Rowell,
Eshaw Vidyaprakash,
Clinton R. Paden,
Andrew D. Huang,
Dawn Roellig,
Ketan Patel,
Kathryn Winglee,
Michael R. Weigand,
Lee S. Katz
Affiliations
Lingzi Xiaoli
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Jill V. Hagey
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Daniel J. Park
Broad Institute of MIT and Harvard, Cambridge, MA, United States of America
Christopher A. Gulvik
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Erin L. Young
Utah Public Health Laboratory, Salt Lake City, UT, United States of America
Nabil-Fareed Alikhan
Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom
Adrian Lawsin
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Norman Hassell
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Kristen Knipe
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Kelly F. Oakeson
Utah Public Health Laboratory, Salt Lake City, UT, United States of America
Adam C. Retchless
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Migun Shakya
Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
Chien-Chi Lo
Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
Patrick Chain
Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
Andrew J. Page
Quadram Institute Bioscience, Norwich Research Park, Norwich, United Kingdom
Benjamin J. Metcalf
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Michelle Su
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Jessica Rowell
SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Eshaw Vidyaprakash
SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Clinton R. Paden
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Andrew D. Huang
SARS-CoV-2 Emerging Variant Sequencing Project Dry Lab Group Laboratory and Testing Task Force COVID-19 Emergency Response, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Dawn Roellig
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Ketan Patel
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Kathryn Winglee
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Michael R. Weigand
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Lee S. Katz
Strain Surveillance and Emerging Variant Team, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
Background Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset. Methods We identified six SARS-CoV-2 sequence datasets from recent publications, public databases and internal resources. In addition, we created a method to mine public databases to identify representative genomes for these datasets. Using this novel method, we identified several genomes as either VOI/VOC representatives or non-VOI/VOC representatives. To describe each dataset, we utilized a previously published datasets format, which describes accession information and whole dataset information. Additionally, a script from the same publication has been enhanced to download and verify all data from this study. Results The benchmark datasets focus on the two most widely used sequencing platforms: long read sequencing data from the Oxford Nanopore Technologies platform and short read sequencing data from the Illumina platform. There are six datasets: three were derived from recent publications; two were derived from data mining public databases to answer common questions not covered by published datasets; one unique dataset representing common sequence failures was obtained by rigorously scrutinizing data that did not pass quality checks. The dataset summary table, data mining script and quality control (QC) values for all sequence data are publicly available on GitHub: https://github.com/CDCgov/datasets-sars-cov-2. Discussion The datasets presented here were generated to help public health laboratories build sequencing and bioinformatics capacity, benchmark different workflows and pipelines, and calibrate QC thresholds to ensure sequencing quality. Together, improvements in these areas support accurate and timely outbreak investigation and surveillance, providing actionable data for pandemic management. Furthermore, these publicly available and standardized benchmark data will facilitate the development and adjudication of new pipelines.