BMC Medical Research Methodology (Oct 2024)

Analyzing missingness patterns in real-world data using the SMDI toolkit: application to a linked EHR-claims pharmacoepidemiology study

  • Sudha R. Raman,
  • Bradley G. Hammill,
  • Pamela A. Shaw,
  • Hana Lee,
  • Sengwee Toh,
  • John G. Connolly,
  • Kimberly J. Dandreo,
  • Vinit Nalawade,
  • Fang Tian,
  • Wei Liu,
  • Jie Li,
  • José J. Hernández-Muñoz,
  • Robert J. Glynn,
  • Rishi J. Desai,
  • Janick Weberpals

DOI
https://doi.org/10.1186/s12874-024-02330-2
Journal volume & issue
Vol. 24, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background Missing data in confounding variables present a frequent challenge in generating evidence using real-world data, including electronic health records (EHR). Our objective was to apply a recently published toolkit for characterizing missing data patterns and based on the toolkit results about likely missingness mechanisms, illustrate the decision-making process for analyses in an empirical case example. Methods We utilized the Structural Missing Data Investigations (SMDI) toolkit to characterize missing data patterns in the context of a pharmacoepidemiology study comparing cardiovascular outcomes of initiating sodium-glucose-cotransporter-2 inhibitors (SGLT2i) and dipeptidyl peptidase‐4 inhibitors (DPP‐4i) among older adults. The study used a linked EHR-Medicare claims dataset from Duke Health patients (2015–2017), focusing on partially observed confounders from EHR data (HbA1c lab and body mass index [BMI] values). Our analysis incorporated SMDI's descriptive functions and diagnostic tests to explore missingness patterns and determine missingness mitigation approaches. We used findings from these investigations to inform estimation of adjusted hazard ratios comparing the two classes of medications. Results High levels of missingness were noted for important confounding variables including HbA1c (63.6%) and BMI (16.5%). Diagnostic tests resulted in output that described: 1) the distributions of patient characteristics, exposure, and outcome between patients with or without an observed value of the partially observed covariate, 2) the ability to predict missingness based on observed covariates, and 3) estimate if the missingness of a partially observed covariate is differential with respect to the outcome. There was evidence that missingness could be sufficiently described using observed data, which allowed multiple imputation by chained equations using random forests to address missing confounder data in estimating treatment effects. Multiple imputation resulted in improved alignment of effect estimates with previous studies. Conclusions We were able to demonstrate the practical application of the SMDI toolkit in a real-world setting. Application of the SMDI toolkit and the resulting insights of potential missingness patterns can inform the choice of appropriate analytic methods and increase transparency of research methods in handling missing data. This type of approach can inform analytic decision making and may increase our ability to generate evidence from real-world data.

Keywords