IEEE Open Journal of the Computer Society (Jan 2022)
Re-Identification in Differentially Private Incomplete Datasets
Abstract
Efforts to counter COVID-19 reaffirmed the importance of rich medical, behavioral, and sociological data. To make data available to many researchers who can conduct statistical analyses and machine learning, personally identifiable information must be excluded to protect individual privacy. It is essential to remove explicit identifiers, sample population data, and apply differential privacy, the de facto standard privacy metric. Despite the general belief that the risk of re-identification is insignificant when these techniques are applied, this study shows that even after applying these techniques, the risk of being re-identified is highly significant for some data. This study proposes in detail an algorithm for estimating the number of people in a population who have certain attribute values based on incomplete, differentially private databases. If the estimated number is one, the probability that only one person with that attribute value is present in the population is high, which means that there is a high probability of re-identification. Therefore, this study concludes that the re-identification risk must be evaluated even after applying state-of-the-art techniques to protect privacy.
Keywords