Privacy-preserving data sharing via probabilistic modeling

Joonas Jälkö; Eemil Lagerspetz; Jari Haukka; Sasu Tarkoma; Antti Honkela; Samuel Kaski

Patterns (Jul 2021)

Privacy-preserving data sharing via probabilistic modeling

Joonas Jälkö,
Eemil Lagerspetz,
Jari Haukka,
Sasu Tarkoma,
Antti Honkela,
Samuel Kaski

Affiliations

Joonas Jälkö: Helsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Corresponding author
Eemil Lagerspetz: Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
Jari Haukka: Department of Public Health, University of Helsinki, Helsinki 00014, Finland
Sasu Tarkoma: Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
Antti Honkela: Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
Samuel Kaski: Helsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Department of Computer Science, University of Manchester, Manchester M13 9PL, UK; Corresponding author

Journal volume & issue: Vol. 2, no. 7
p. 100271

Abstract

Read online

Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data.

Published in Patterns

ISSN: 2666-3899 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.cell.com/patterns

About the journal

Abstract

Keywords