Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis

Yuchen Guo; Victoria Y. Strauss; Martí Català; Annika M. Jödicke; Sara Khalid; Daniel Prieto-Alhambra; Daniel Prieto-Alhambra

doi:10.3389/fphar.2024.1395707

Frontiers in Pharmacology (Oct 2024)

Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis

Yuchen Guo,
Victoria Y. Strauss,
Martí Català,
Annika M. Jödicke,
Sara Khalid,
Daniel Prieto-Alhambra,
Daniel Prieto-Alhambra

Affiliations

Yuchen Guo: Pharmaco- and Device Epidemiology Group, Centre of Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, United Kingdom
Victoria Y. Strauss: Boehringer-Ingelheim, Ingelheim, Germany
Martí Català: Pharmaco- and Device Epidemiology Group, Centre of Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, United Kingdom
Annika M. Jödicke: Pharmaco- and Device Epidemiology Group, Centre of Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, United Kingdom
Sara Khalid: Pharmaco- and Device Epidemiology Group, Centre of Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, United Kingdom
Daniel Prieto-Alhambra: Pharmaco- and Device Epidemiology Group, Centre of Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, United Kingdom
Daniel Prieto-Alhambra: Department of Medical Informatics, Erasmus Medical Center, Rotterdam, Netherlands

DOI: https://doi.org/10.3389/fphar.2024.1395707
Journal volume & issue: Vol. 15

Abstract

Read online

IntroductionMachine learning (ML) methods are promising and scalable alternatives for propensity score (PS) estimation, but their comparative performance in disease risk score (DRS) estimation remains unexplored.MethodsWe used real-world data comparing antihypertensive users to non-users with 69 negative control outcomes, and plasmode simulations to study the performance of ML methods in PS and DRS estimation. We conducted a cohort study using UK primary care records. Further, we conducted a plasmode simulation with synthetic treatment and outcome mimicking empirical data distributions. We compared four PS and DRS estimation methods: 1. Reference: Logistic regression including clinically chosen confounders. 2. Logistic regression with L1 regularisation (LASSO). 3. Multi-layer perceptron (MLP). 4. Extreme Gradient Boosting (XgBoost). Covariate balance, coverage of the null effect of negative control outcomes (real-world data) and bias based on the absolute difference between observed and true effects (for plasmode) were estimated. 632,201 antihypertensive users and nonusers were included.ResultsML methods outperformed the reference method for PS estimation in some scenarios, both in terms of covariate balance and coverage/bias. Specifically, XgBoost achieved the best performance. DRS-based methods performed worse than PS in all tested scenarios.DiscussionWe found that ML methods could be reliable alternatives for PS estimation. ML-based DRS methods performed worse than PS ones, likely given the rarity of outcomes.

Published in Frontiers in Pharmacology

ISSN: 1663-9812 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Therapeutics. Pharmacology
Website: http://journal.frontiersin.org/journals/pharmacology

About the journal

Abstract

Keywords