PLOS Digital Health (Jan 2023)
Synthetic data in health care: A narrative review
Abstract
Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, we examined existing literature to bridge the gap and highlight the utility of synthetic data in health care. We searched PubMed, Scopus, and Google Scholar to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking. Author summary Synthetic data or data that are artificially generated is gaining more attention in the recent years because of its potential in making timely health care data more accessible for analysis and technology development. In this paper, we explored how synthetic data are being used by reviewing published literature and by looking at known synthetic datasets that are available to the public. Based on the available literature, it was identified that synthetic data address three challenges in making health care data accessible: it protects the privacy of individuals in datasets, it allows increased and faster access of researchers to health care research data, and it addresses the lack of realistic data for software development and testing. Users should also be aware of its limitations that may include recognized risk for data leakage, dependency on imputation model, and not all synthetic data replicate precisely the content and properties of the original dataset. By explaining the utility and value of synthetic data, we hope that this review helps to improve understanding of synthetic data for different applications in research and software development.