Synthetic data generation methods in healthcare: A review on open-source tools and methods

Vasileios C. Pezoulas; Dimitrios I. Zaridis; Eugenia Mylona; Christos Androutsos; Kosmas Apostolidis; Nikolaos S. Tachos; Dimitrios I. Fotiadis

Computational and Structural Biotechnology Journal (Dec 2024)

Synthetic data generation methods in healthcare: A review on open-source tools and methods

Vasileios C. Pezoulas,
Dimitrios I. Zaridis,
Eugenia Mylona,
Christos Androutsos,
Kosmas Apostolidis,
Nikolaos S. Tachos,
Dimitrios I. Fotiadis

Affiliations

Vasileios C. Pezoulas: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece; Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece
Dimitrios I. Zaridis: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece; Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece; Biomedical Engineering Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece
Eugenia Mylona: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece; Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece
Christos Androutsos: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece
Kosmas Apostolidis: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece; Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece
Nikolaos S. Tachos: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece; Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece
Dimitrios I. Fotiadis: Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece; Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece; Correspondence to: Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece.

Journal volume & issue: Vol. 23
pp. 2892 – 2910

Abstract

Read online

Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algorithms on unbiased data with sufficient sample size and statistical power. Our review explores the application and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics, time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The type of method used for the synthetic data generation process was identified in each study and was categorized into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming languages used for the implementation of each method. Our evaluation revealed that the majority of the studies utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-quality, representative multimodal datasets without exposing sensitive patient information, among others. We underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with 75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is finally provided to accelerate research in the field.

Published in Computational and Structural Biotechnology Journal

ISSN: 2001-0370 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Chemical technology: Biotechnology
Website: https://www.journals.elsevier.com/computational-and-structural-biotechnology-journal

About the journal

Abstract

Keywords