A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health

Kun Zhao; Katie Farrell; Melchizedek Mashiku; Dawit Abay; Kevin Tang; M. Steven Oberste; Cara C. Burns

doi:10.3389/fpubh.2023.1254976

Frontiers in Public Health (Nov 2023)

A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health

Kun Zhao,
Katie Farrell,
Melchizedek Mashiku,
Dawit Abay,
Kevin Tang,
M. Steven Oberste,
Cara C. Burns

Affiliations

Kun Zhao: Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
Katie Farrell: Cherokee Nation Businesses, Contracting Agency to the Division of Viral Diseases, Centers for Disease Control and Prevention, Catoosa, OK, United States
Melchizedek Mashiku: Cherokee Nation Businesses, Contracting Agency to the Division of Viral Diseases, Centers for Disease Control and Prevention, Catoosa, OK, United States
Dawit Abay: Cherokee Nation Businesses, Contracting Agency to the Division of Viral Diseases, Centers for Disease Control and Prevention, Catoosa, OK, United States
Kevin Tang: Division of Scientific Resources, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
M. Steven Oberste: Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States
Cara C. Burns: Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States

DOI: https://doi.org/10.3389/fpubh.2023.1254976
Journal volume & issue: Vol. 11

Abstract

Read online

BackgroundThe National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) has amassed a vast reservoir of genetic data since its inception in 2007. These public data hold immense potential for supporting pathogen surveillance and control. However, the lack of standardized metadata and inconsistent submission practices in SRA may impede the data’s utility in public health.MethodsTo address this issue, we introduce the Search-based Geographic Metadata Curation (SGMC) pipeline. SGMC utilized Python and web scraping to extract geographic data of sequencing institutions from NCBI SRA in the Cloud and its website. It then harnessed ChatGPT to refine the sequencing institution and location assignments. To illustrate the pipeline’s utility, we examined the geographic distribution of the sequencing institutions and their countries relevant to polio eradication and categorized them.ResultsSGMC successfully identified 7,649 sequencing institutions and their global locations from a random selection of 2,321,044 SRA accessions. These institutions were distributed across 97 countries, with strong representation in the United States, the United Kingdom and China. However, there was a lack of data from African, Central Asian, and Central American countries, indicating potential disparities in sequencing capabilities. Comparison with manually curated data for U.S. institutions reveals SGMC’s accuracy rates of 94.8% for institutions, 93.1% for countries, and 74.5% for geographic coordinates.ConclusionSGMC may represent a novel approach using a generative AI model to enhance geographic data (country and institution assignments) for large numbers of samples within SRA datasets. This information can be utilized to bolster public health endeavors.

Published in Frontiers in Public Health

ISSN: 2296-2565 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Public aspects of medicine
Website: https://www.frontiersin.org/journals/public-health

About the journal

Abstract

Keywords