Evidence-Based Toxicology (Dec 2025)

An environmental health vocabulary and its semi-automated curation workflow

  • Michelle Angrish,
  • Scott Burns,
  • Joshua Cleland,
  • Caroline Foster,
  • Samuel Kovach,
  • Kristan Markey,
  • Brittany Schulz,
  • Andy Shapiro,
  • Michele Taylor,
  • George Woodall,
  • Sean Watford

DOI
https://doi.org/10.1080/2833373X.2025.2485111
Journal volume & issue
Vol. 3, no. 1

Abstract

Read online

The environmental health vocabulary (EHV) represents manually curated terminologies developed by the US Environmental Protection Agency’s (EPA) Chemical Pollutant Assessment Division (CPAD) for standardizing reporting of health effect information. Recognizing that manual data curation is a resource bottleneck, a semi-automated curation workflow was realized. The objectives of this work are to describe the manual creation of the EHV and improve the efficiency of manual data curation by implementing a new semi-automated curation workflow that minimizes manual review using a sequence of computational text analysis and quality assurance/quality control (QA/QC) steps with a high level of accuracy. To facilitate semi-automated curation a sequence of computational text analysis and manual steps were developed. Described are (1) a series of computational text processing steps to normalize and match extracted terms to the EHV, (2) a QA step of the computationally identified matches; (3) a manual review of unmatched terms; and (4) curation of the EHV that includes completion of missing hierarchical data and related metadata. The EHV was manually created to promote data aggregation, integration, accessibility and transparent data exchange across EPA partners by normalizing the data extracted into the EPA Health Assessment Workplace Collaborative (HAWC). The workflow described here removes the manual curation bottleneck by transforming data curation into a streamlined semi-automated process powered by computational text processing steps. This semi-automated curation method offers several advantages to the environmental health community including (but not limited to) efficiency by automating repeating data management tasks, scalability to a large volume of terms and terminology resources, and better integration with other data sets and artificial intelligence (AI) and machine learning (ML) models.

Keywords