Journal of Big Data (Nov 2022)

Privacy preserved incremental record linkage

  • Shahidul Islam Khan,
  • Abir Bin Ayub Khan,
  • Abu Sayed Md Latiful Hoque

DOI
https://doi.org/10.1186/s40537-022-00655-7
Journal volume & issue
Vol. 9, no. 1
pp. 1 – 27

Abstract

Read online

Abstract Using an incremental approach to solve the record linkage problem is a relatively new research area. In incremental record linkage, every inserted record is compared with some existing clusters of records based on its blocking key value. Then, considering similarity, either the record will be put into an existing cluster, or a new cluster will be created for it. Although few papers have presented their solutions for incremental record linkage targeting the linkage quality or efficiency, privacy issue regarding the approach has not yet been discussed. Privacy is a major concern when record linkage is performed for sensitive data, e.g., health records, financial records, etc. In this regard, we have come up with a novel concept privacy-preserving incremental record linkage (PPiRL) which encapsulates privacy-preserving techniques with an incremental record linkage approach. In this chapter, we have proposed an end-to-end framework as our solution for PPiRL. For preserving privacy, we have used two types of privacy techniques namely phonetic encoding and generalization. We have used a recently developed phonetic algorithm “nameGist” to handle text-based features. For generalization, we have used the K-anonymization algorithm for numeric and categorical features. For handling incremental updates and internal linkage, we have used the Naive incremental clustering approach using Hierarchical Agglomerative clustering as the base clustering algorithm. We have performed various experiments to test the privacy and linkage quality of PPiRL. We have compared our work with the existing incremental record linkage framework and also with existing privacy-preserved record linkage techniques. It is apparent from our results that other than a small trade-off in linkage quality, our framework works better as a combined package of privacy and linkage solutions that any existing frameworks do not yet provide.

Keywords