Algorithms (Jul 2024)
A Histogram Publishing Method under Differential Privacy That Involves Balancing Small-Bin Availability First
Abstract
Differential privacy, a cornerstone of privacy-preserving techniques, plays an indispensable role in ensuring the secure handling and sharing of sensitive data analysis across domains such as in census, healthcare, and social networks. Histograms, serving as a visually compelling tool for presenting analytical outcomes, are widely employed in these sectors. Currently, numerous algorithms for publishing histograms under differential privacy have been developed, striving to balance privacy protection with the provision of useful data. Nonetheless, the pivotal challenge concerning the effective enhancement of precision for small bins (those intervals that are narrowly defined or contain a relatively small number of data points) within histograms has yet to receive adequate attention and in-depth investigation from experts. In standard DP histogram publishing, adding noise without regard for bin size can result in small data bins being disproportionately influenced by noise, potentially severely impairing the overall accuracy of the histogram. In response to this challenge, this paper introduces the SReB_GCA sanitization algorithm designed to enhance the accuracy of small bins in DP histograms. The SReB_GCA approach involves sorting the bins from smallest to largest and applying a greedy grouping strategy, with a predefined lower bound on the mean relative error required for a bin to be included in a group. Our theoretical analysis reveals that sorting bins in ascending order prior to grouping effectively prioritizes the accuracy of smaller bins. SReB_GCA ensures strict ϵ-DP compliance and strikes a careful balance between reconstruction error and noise error, thereby not only initially improving the accuracy of small bins but also approximately optimizing the mean relative error of the entire histogram. To validate the efficiency of our proposed SReB_GCA method, we conducted extensive experiments using four diverse datasets, including two real-life datasets and two synthetic ones. The experimental results, quantified by the Kullback–Leibler Divergence (KLD), show that the SReB_GCA algorithm achieves substantial performance enhancement compared to the baseline method (DP_BASE) and several other established approaches for differential privacy histogram publication.
Keywords