Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward

Luca Mainetti; Andrea Elia

doi:10.3390/asi8020055

Applied System Innovation (Apr 2025)

Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward

Luca Mainetti,
Andrea Elia

Affiliations

Luca Mainetti: Department of Engineering for Innovation, University of Salento, 73100 Lecce, Italy
Andrea Elia: Faculty of Engineering, University of Salento, 73100 Lecce, Italy

DOI: https://doi.org/10.3390/asi8020055
Journal volume & issue: Vol. 8, no. 2
p. 55

Abstract

Read online

The protection of personally identifiable information (PII) is being increasingly demanded by customers and governments via data protection regulations. Private and public organizations store and exchange through the Internet a large amount of data that include the personal information of users, employees, and customers. While discovering PII from a large unstructured text corpus is still challenging, a lot of research work has focused on identifying methods and tools for the detection of PII in real-time scenarios and the ability to discover data exfiltration attacks. In those research attempts, natural language processing (NLP)-based schemas are widely adopted. Our work combines NLP with deep learning to identify PII in unstructured texts. NLP is used to extract semantic information and the syntactic structure of the text. This information is then processed by a pre-trained Bidirectional Encoder Representations from Transformers (BERT) algorithm. We achieved high performance in detecting PII, reaching an accuracy of 99.558%. This represents an improvement of 7.47 percentage points over the current state-of-the-art model that we analyzed. However, the experimental results show that there is still room for improvement to obtain better accuracy in detecting PII, including working on a new, balanced, and higher-quality training dataset for pre-trained models. Our study contributions encourage researchers to enhance NLP-based PII detection models and practitioners to transform those models into privacy detection tools to be deployed in security operation centers.

Published in Applied System Innovation

ISSN: 2571-5577 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Applied mathematics. Quantitative methods
Website: https://www.mdpi.com/journal/asi

About the journal

Abstract

Keywords