Evaluation of Structured, Semi-Structured, and Free-Text Electronic Health Record Data to Classify Hepatitis C Virus (HCV) Infection

Allan Fong; Justin Hughes; Sravya Gundapenini; Benjamin Hack; Mahdi Barkhordar; Sean Shenghsiu Huang; Adam Visconti; Stephen Fernandez; Dawn Fishbein

doi:10.3390/gidisord5020012

Gastrointestinal Disorders (Mar 2023)

Evaluation of Structured, Semi-Structured, and Free-Text Electronic Health Record Data to Classify Hepatitis C Virus (HCV) Infection

Allan Fong,
Justin Hughes,
Sravya Gundapenini,
Benjamin Hack,
Mahdi Barkhordar,
Sean Shenghsiu Huang,
Adam Visconti,
Stephen Fernandez,
Dawn Fishbein

Affiliations

Allan Fong: MedStar Health Research Institute, Hyattsville, MD 20782, USA
Justin Hughes: MedStar Health, Columbia, MD 20037, USA
Sravya Gundapenini: MedStar Health Research Institute, Hyattsville, MD 20782, USA
Benjamin Hack: School of Medicine, Georgetown University, Washington, DC 20007, USA
Mahdi Barkhordar: MedStar Health, Columbia, MD 20037, USA
Sean Shenghsiu Huang: Department of Health Management and Policy, School of Health, Georgetown University, Washington, DC 20007, USA
Adam Visconti: MedStar Health, Columbia, MD 20037, USA
Stephen Fernandez: MedStar Health Research Institute, Hyattsville, MD 20782, USA
Dawn Fishbein: MedStar Health Research Institute, Hyattsville, MD 20782, USA

DOI: https://doi.org/10.3390/gidisord5020012
Journal volume & issue: Vol. 5, no. 2
pp. 115 – 126

Abstract

Read online

Evaluation of the United States Centers for Disease Control and Prevention (CDC)-defined HCV-related risk factors are not consistently performed as part of routine care, rendering risk-based testing susceptible to clinician bias and missed diagnoses. This work uses natural language processing (NLP) and machine learning to identify patients who are at high risk for HCV infection. Models were developed and validated to predict patients with newly identified HCV infection (detectable RNA or reported HCV diagnosis). We evaluated models with three types of variables: structured (structured-based model), semi-structured and free-text notes (text-based model), and all variables (full-set model). We applied each model to three stratifications of data: patients with no history of HCV prior to 2020, patients with a history of HCV prior to 2020, and all patients. We used XGBoost and ten-fold C-statistic cross-validation to evaluate the generalizability of the models. There were 3564 unique patients, 487 with HCV infection. The average C-statistics on the structured-based, text-based, and full-set models for all the patients were 0.777 (95% CI: 0.744–0.810), 0.677 (95% CI: 0.631–0.723), and 0.774 (95% CI: 0.735–0.813), respectively. The full-set model performed slightly better than the structured-based model and similar to text-based models for patients with no history of HCV prior to 2020; average C-statistics of 0.780, 0.774, and 0.759, respectively. NLP was able to identify six more risk factors inconsistently coded in structured elements: incarceration, needlestick, substance use or abuse, sexually transmitted infections, piercings, and tattoos. The availability of model options (structured-based or text-based models) with a similar performance can provide deployment flexibility in situations where data is limited.

Published in Gastrointestinal Disorders

ISSN: 2624-5647 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Medicine: Internal medicine: Specialties of internal medicine: Diseases of the digestive system. Gastroenterology
Website: https://www.mdpi.com/journal/gastrointestdisord

About the journal

Abstract

Keywords