Designing a utility evaluation framework for synthetic health data

Amir Marashi; Sadiq Dohadwalla; Patrick Browne; Ben Hachey

doi:10.23889/ijpds.v9i5.2765

International Journal of Population Data Science (Sep 2024)

Designing a utility evaluation framework for synthetic health data

Amir Marashi,
Sadiq Dohadwalla,
Patrick Browne,
Ben Hachey

Affiliations

Amir Marashi: University of Sydney
Sadiq Dohadwalla: University of Sydney
Patrick Browne: University of Sydney
Ben Hachey: University of Sydney

DOI: https://doi.org/10.23889/ijpds.v9i5.2765
Journal volume & issue: Vol. 9, no. 5

Abstract

Read online

Objectives Synthetic data (SD) promises to unlock health data for training, research, and innovation. However, where utility evaluation is performed, it is applied ad-hoc for a single task of interest. We produce an initial design for a robust benchmark across a range of tasks. Approach We undertook several projects as a prototyping experiment to gather requirements. These projects replicate previous studies performed on the Medical Information Mart for Intensive Care — a dataset used in more than 4,000 studies. We refine definitions, identify personas, draft a user statement, and collect requirements. Results Definitions: We define utility as an extrinsic measure of SD on a larger system, most often through comparison to system performance on real data. This contrasts with fidelity, which measures the accuracy of SD through direct comparison to real data. Personas: Data custodian, User of SD, SD researcher. User statement: As a technical stakeholder, I need a reliable way to measure the utility of datasets and a benchmark to compare generation techniques. Requirements: SD researchers can focus on generation not evaluation; Supports comparison and leaderboards; Based on relevant and applications; Comprehensive across study types and applications; Future proof for population research requiring linking. Conclusion We propose the following design: • Data pipelines follow an extract-generate-evaluate workflow. • Study types include cross-sectional and longitudinal. • Applications include predictive modelling and clinical research. This results in a comprehensive utility benchmarking suite that complements current frameworks for fidelity and privacy of SD.

Published in International Journal of Population Data Science

ISSN: 2399-4908 (Online)
Publisher: Swansea University
Country of publisher: United Kingdom
LCC subjects: Social Sciences: Economic theory. Demography: Demography. Population. Vital events
Website: https://ijpds.org

About the journal