A Case Demonstration of the Open Health Natural Language Processing Toolkit From the National COVID-19 Cohort Collaborative and the Researching COVID to Enhance Recovery Programs for a Natural Language Processing System for COVID-19 or Postacute Sequelae of SARS CoV-2 Infection: Algorithm Development and Validation

Andrew Wen; Liwei Wang; Huan He; Sunyang Fu; Sijia Liu; David A Hanauer; Daniel R Harris; Ramakanth Kavuluru; Rui Zhang; Karthik Natarajan; Nishanth P Pavinkurve; Janos Hajagos; Sritha Rajupet; Veena Lingam; Mary Saltz; Corey Elowsky; Richard A Moffitt; Farrukh M Koraishy; Matvey B Palchuk; Jordan Donovan; Lora Lingrey; Garo Stone-DerHagopian; Robert T Miller; Andrew E Williams; Peter J Leese; Paul I Kovach; Emily R Pfaff; Mikhail Zemmel; Robert D Pates; Nick Guthe; Melissa A Haendel; Christopher G Chute; Hongfang Liu

doi:10.2196/49997

JMIR Medical Informatics (Sep 2024)

A Case Demonstration of the Open Health Natural Language Processing Toolkit From the National COVID-19 Cohort Collaborative and the Researching COVID to Enhance Recovery Programs for a Natural Language Processing System for COVID-19 or Postacute Sequelae of SARS CoV-2 Infection: Algorithm Development and Validation

Andrew Wen,
Liwei Wang,
Huan He,
Sunyang Fu,
Sijia Liu,
David A Hanauer,
Daniel R Harris,
Ramakanth Kavuluru,
Rui Zhang,
Karthik Natarajan,
Nishanth P Pavinkurve,
Janos Hajagos,
Sritha Rajupet,
Veena Lingam,
Mary Saltz,
Corey Elowsky,
Richard A Moffitt,
Farrukh M Koraishy,
Matvey B Palchuk,
Jordan Donovan,
Lora Lingrey,
Garo Stone-DerHagopian,
Robert T Miller,
Andrew E Williams,
Peter J Leese,
Paul I Kovach,
Emily R Pfaff,
Mikhail Zemmel,
Robert D Pates,
Nick Guthe,
Melissa A Haendel,
Christopher G Chute,
Hongfang Liu

Affiliations

Andrew Wen: ORCiD
Liwei Wang: ORCiD
Huan He: ORCiD
Sunyang Fu: ORCiD
Sijia Liu: ORCiD
David A Hanauer: ORCiD
Daniel R Harris: ORCiD
Ramakanth Kavuluru: ORCiD
Rui Zhang: ORCiD
Karthik Natarajan: ORCiD
Nishanth P Pavinkurve: ORCiD
Janos Hajagos: ORCiD
Sritha Rajupet: ORCiD
Veena Lingam: ORCiD
Mary Saltz: ORCiD
Corey Elowsky: ORCiD
Richard A Moffitt: ORCiD
Farrukh M Koraishy: ORCiD
Matvey B Palchuk: ORCiD
Jordan Donovan: ORCiD
Lora Lingrey: ORCiD
Garo Stone-DerHagopian: ORCiD
Robert T Miller: ORCiD
Andrew E Williams: ORCiD
Peter J Leese: ORCiD
Paul I Kovach: ORCiD
Emily R Pfaff: ORCiD
Mikhail Zemmel: ORCiD
Robert D Pates: ORCiD
Nick Guthe: ORCiD
Melissa A Haendel: ORCiD
Christopher G Chute: ORCiD
Hongfang Liu: ORCiD

DOI: https://doi.org/10.2196/49997
Journal volume & issue: Vol. 12
p. e49997

Abstract

Read online

BackgroundA wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is emergent, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and postacute sequelae of SARS-CoV-2 infection (PASC). ObjectiveThis study aims to highlight the current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding emergent clinical concepts and to illustrate our approach to addressing these issues through the use case of developing an NLP system for the signs and symptoms of COVID-19 and PASC. MethodsWe used 2 preexisting studies on PASC as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System to autonomously generate an expanded lexicon to weakly annotate a training set, which was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites that were also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm. ResultsAn NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2366 unique concepts was developed to extract COVID-19 or PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the 5 sites. ConclusionsThe evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach using the Open Health Natural Language Processing Toolkit aimed at addressing these needs with a dictionary-based weak labeling step that minimizes the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.

Published in JMIR Medical Informatics

ISSN: 2291-9694 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://medinform.jmir.org

About the journal