Identifying COVID-19 Outbreaks From Contact-Tracing Interview Forms for Public Health Departments: Development of a Natural Language Processing Pipeline

John Caskey; Iain L McConnell; Madeline Oguss; Dmitriy Dligach; Rachel Kulikoff; Brittany Grogan; Crystal Gibson; Elizabeth Wimmer; Traci E DeSalvo; Edwin E Nyakoe-Nyasani; Matthew M Churpek; Majid Afshar

doi:10.2196/36119

JMIR Public Health and Surveillance (Mar 2022)

Identifying COVID-19 Outbreaks From Contact-Tracing Interview Forms for Public Health Departments: Development of a Natural Language Processing Pipeline

John Caskey,
Iain L McConnell,
Madeline Oguss,
Dmitriy Dligach,
Rachel Kulikoff,
Brittany Grogan,
Crystal Gibson,
Elizabeth Wimmer,
Traci E DeSalvo,
Edwin E Nyakoe-Nyasani,
Matthew M Churpek,
Majid Afshar

Affiliations

John Caskey: ORCiD
Iain L McConnell: ORCiD
Madeline Oguss: ORCiD
Dmitriy Dligach: ORCiD
Rachel Kulikoff: ORCiD
Brittany Grogan: ORCiD
Crystal Gibson: ORCiD
Elizabeth Wimmer: ORCiD
Traci E DeSalvo: ORCiD
Edwin E Nyakoe-Nyasani: ORCiD
Matthew M Churpek: ORCiD
Majid Afshar: ORCiD

DOI: https://doi.org/10.2196/36119
Journal volume & issue: Vol. 8, no. 3
p. e36119

Abstract

Read online

BackgroundIn Wisconsin, COVID-19 case interview forms contain free-text fields that need to be mined to identify potential outbreaks for targeted policy making. We developed an automated pipeline to ingest the free text into a pretrained neural language model to identify businesses and facilities as outbreaks. ObjectiveWe aimed to examine the precision and recall of our natural language processing pipeline against existing outbreaks and potentially new clusters. MethodsData on cases of COVID-19 were extracted from the Wisconsin Electronic Disease Surveillance System (WEDSS) for Dane County between July 1, 2020, and June 30, 2021. Features from the case interview forms were fed into a Bidirectional Encoder Representations from Transformers (BERT) model that was fine-tuned for named entity recognition (NER). We also developed a novel location-mapping tool to provide addresses for relevant NER. Precision and recall were measured against manually verified outbreaks and valid addresses in WEDSS. ResultsThere were 46,798 cases of COVID-19, with 4,183,273 total BERT tokens and 15,051 unique tokens. The recall and precision of the NER tool were 0.67 (95% CI 0.66-0.68) and 0.55 (95% CI 0.54-0.57), respectively. For the location-mapping tool, the recall and precision were 0.93 (95% CI 0.92-0.95) and 0.93 (95% CI 0.92-0.95), respectively. Across monthly intervals, the NER tool identified more potential clusters than were verified in WEDSS. ConclusionsWe developed a novel pipeline of tools that identified existing outbreaks and novel clusters with associated addresses. Our pipeline ingests data from a statewide database and may be deployed to assist local health departments for targeted interventions.

Published in JMIR Public Health and Surveillance

ISSN: 2369-2960 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Public aspects of medicine
Website: https://publichealth.jmir.org

About the journal