BioData Mining (Jun 2025)

A probabilistic approach for building disease phenotypes across electronic health records

  • David Vidmar,
  • Jessica De Freitas,
  • Will Thompson,
  • John M. Pfeifer,
  • Brandon K. Fornwalt,
  • Noah Zimmerman,
  • Riccardo Miotto,
  • Ruijun Chen

DOI
https://doi.org/10.1186/s13040-025-00454-9
Journal volume & issue
Vol. 18, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background Identifying the set of patients with a particular disease diagnosis across electronic health records (EHRs), referred to as a phenotype, is an important step in clinical research and applications. However, this task is often challenging, where incomplete data can render definitive classifications impossible. We propose a probabilistic approach to phenotyping based on Bayesian inference and without the need for gold-standard labels. In this paper, we develop multiple heuristic “labeling functions’’ (LFs) for 4 diseases across de-identified EHR data and aggregate their votes through a majority vote approach (MV), a popular open-source approach (Snorkel OSS), and our proposed probabilistic approach (LEVI). We compare the resulting phenotypes to those built using expert-curated logic from the literature, as well as an off-the-shelf natural language processing pipeline (Medspacy), using a curated sample of physician-reviewed labels for evaluation. Results Phenotypes built using LFs perform better than off-the-shelf alternatives on classification performance (F1 scores of 0.79–0.82 vs. expert-logic: 0.68, Medspacy: 0.55). Compared to output scores from Snorkel OSS, LEVI provides better probabilistic performance (expected calibration error of 0.04 vs. 0.12), ROC AUC estimates (interval score [loss] of 0.03 vs. 0.10), and operating point selection (equal-cost net benefit of 0.18 vs. 0.15). Conclusions For challenging disease states, phenotyping using probabilities rather than binary classification can lead to improved and more personalized downstream decision-making. Probabilistic phenotypes built using LEVI exhibit low calibration error without the need for labels, allowing for better risk-benefit tradeoffs.

Keywords