IEEE Access (Jan 2022)

Utility-Embraced Microaggregation for Machine Learning Applications

  • Soobin Lee,
  • Won-Yong Shin

DOI
https://doi.org/10.1109/ACCESS.2022.3183201
Journal volume & issue
Vol. 10
pp. 64535 – 64546

Abstract

Read online

With access to vast amounts of data, privacy protection is more important than ever. Among various de-identification (anonymization) techniques, $k$ -anonymous microaggregation has been widely studied since it enables us to balance between confidentiality and data utility. Despite plenty of microaggregation methods in the sense of reducing the information loss and/or computational complexity, machine learning (ML) models using the resulting aggregated data face the problem that they are not as effective as expected. Motivated by the fact that ML models can be heavily influenced by distorted training data (albeit slightly), we deliberate on the performance of microaggregation in terms of not only data privacy but also data utility. In this paper, we propose Util-MA, a new utility-embraced microaggregation framework for effective ML applications. Specifically, unlike prior studies that apply microaggregation techniques directly to raw data, we design a unified framework that can potentially enhance the data utility while preserving the $k$ -anonymity through preprocessing steps including dimensionality reduction and clustering. By using real-world datasets, we empirically demonstrate the superiority of Util-MA over benchmark microaggregation methods in terms of classification accuracy. Moreover, we investigate the importance of preprocessing by measuring key performance indicators (KPIs) of clustering; the clustering stage of Util-MA leads to high performance on the classification when the clustering results substantially coincide with the ground truth labels. We also establish a close relationship between the KPIs of clustering and the classification accuracies, which tends to be revealed when there is a gain of Util-MA over the benchmark method is observed. Our framework is microaggregation-model-agnostic; thus, underlying microaggregation models can be appropriately chosen according to one’s needs and ML tasks.

Keywords