Learning debiased graph representations from the OMOP common data model for synthetic data generation

Nicolas Alexander Schulz; Jasmin Carus; Alexander Johannes Wiederhold; Ole Johanns; Frederik Peters; Natalie Rath; Katharina Rausch; Bernd Holleczek; Alexander Katalinic; the AI-CARE Working Group; Christopher Gundler

doi:10.1186/s12874-024-02257-8

BMC Medical Research Methodology (Jun 2024)

Learning debiased graph representations from the OMOP common data model for synthetic data generation

Nicolas Alexander Schulz,
Jasmin Carus,
Alexander Johannes Wiederhold,
Ole Johanns,
Frederik Peters,
Natalie Rath,
Katharina Rausch,
Bernd Holleczek,
Alexander Katalinic,
the AI-CARE Working Group,
Christopher Gundler

Affiliations

Nicolas Alexander Schulz: Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf
Jasmin Carus: University Cancer Center Hamburg, University Medical Center Hamburg-Eppendorf
Alexander Johannes Wiederhold: Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf
Ole Johanns: Cancer Registry Hamburg
Frederik Peters: Cancer Registry Hamburg
Natalie Rath: Saarland Cancer Registry
Katharina Rausch: Saarland Cancer Registry
Bernd Holleczek: Saarland Cancer Registry
Alexander Katalinic: Cancer Registry Schleswig-Holstein
the AI-CARE Working Group
Christopher Gundler: Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf

DOI: https://doi.org/10.1186/s12874-024-02257-8
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention. Methods Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts. Results The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand. Conclusion Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.

Published in BMC Medical Research Methodology

ISSN: 1471-2288 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General)
Website: http://bmcmedresmethodol.biomedcentral.com

About the journal

Abstract

Keywords