IEEE Access (Jan 2024)

KDPII: A New Korean Dialogic Dataset for the Deidentification of Personally Identifiable Information

  • Li Fei,
  • Yejee Kang,
  • Seoyoon Park,
  • Yeonji Jang,
  • Jongkyu Lee,
  • Hansaem Kim

DOI
https://doi.org/10.1109/ACCESS.2024.3461804
Journal volume & issue
Vol. 12
pp. 135626 – 135641

Abstract

Read online

The rapid growth of social media in the era of big data and artificial intelligence has raised significant safety concerns related to the communication of sensitive personal information. In modern society, awareness of the importance of preserving privacy is growing, so there is a rising advocacy for adopting language modeling technology to mitigate the risk of personal information leakage and to deidentify sensitive information depending on the situation. Thus far, several theoretical analyses of privacy protection in Korea have been conducted. However, the technical development of language model training resources for Korean has been slower than those of widely spoken languages such as English and Chinese. To address this problem, we developed a comprehensive and organized framework for classifying Korean personally identifiable information (PII) by investigating pertinent examples, such as “Text Anonymization Benchmark” and “Network Intrusion Detection Dataset,” from within and outside Korea. Subsequently, we created a new Korean dataset for PII deidentification, KDPII, which consists of many conversational texts incorporating plentiful Korean PII. Based on this, we examined the Korean PII processing performances of many representative language models that are available on the market. Finally, we found that although the performance of language models in identifying PII varied by model size, model architecture, and training source, most of them were significantly better at recognizing universal PII than language-specific PII, which indicates a prospective direction of expanding training data for implementing Korean-specific PII deidentification in the future.

Keywords