Synthetic data for privacy-preserving clinical risk prediction

Zhaozhi Qian; Thomas Callender; Bogdan Cebere; Sam M. Janes; Neal Navani; Mihaela van der Schaar

doi:10.1038/s41598-024-72894-y

Scientific Reports (Oct 2024)

Synthetic data for privacy-preserving clinical risk prediction

Zhaozhi Qian,
Thomas Callender,
Bogdan Cebere,
Sam M. Janes,
Neal Navani,
Mihaela van der Schaar

Affiliations

Zhaozhi Qian: University of Cambridge
Thomas Callender: University College London
Bogdan Cebere: University of Cambridge
Sam M. Janes: University College London
Neal Navani: University College London
Mihaela van der Schaar: University of Cambridge

DOI: https://doi.org/10.1038/s41598-024-72894-y
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches—such as federated learning—analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords