Scientific Data (Dec 2024)

Contextualized race and ethnicity annotations for clinical text from MIMIC-III

  • Oliver J. Bear Don’t Walk,
  • Adrienne Pichon,
  • Harry Reyes Nieva,
  • Tony Sun,
  • Jaan Li,
  • Josh Joseph,
  • Sivan Kinberg,
  • Lauren R. Richter,
  • Salvatore Crusco,
  • Kyle Kulas,
  • Shaan A. Ahmed,
  • Daniel Snyder,
  • Ashkon Rahbari,
  • Benjamin L. Ranard,
  • Pallavi Juneja,
  • Dina Demner-Fushman,
  • Noémie Elhadad

DOI
https://doi.org/10.1038/s41597-024-04183-2
Journal volume & issue
Vol. 11, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Observational health research often relies on accurate and complete race and ethnicity (RE) patient information, such as characterizing cohorts, assessing quality/performance metrics of hospitals and health systems, and identifying health disparities. While the electronic health record contains structured data such as accessible patient-level RE data, it is often missing, inaccurate, or lacking granular details. Natural language processing models can be trained to identify RE in clinical text which can supplement missing RE data in clinical data repositories. Here we describe the Contextualized Race and Ethnicity Annotations for Clinical Text (C-REACT) Dataset, which comprises 12,000 patients and 17,281 sentences from their clinical notes in the MIMIC-III dataset. Using these sentences, two sets of reference standard annotations for RE data are made available with annotation guidelines. The first set of annotations comprise highly granular information related to RE, such as preferred language and country of origin, while the second set contains RE labels annotated by physicians. This dataset can support health systems’ ability to use RE data to serve health equity goals.