A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents

Kushagra Mishra; Harsh Pagare; Kanhaiya Sharma

doi:10.1038/s41598-025-04971-9

Scientific Reports (Jul 2025)

A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents

Kushagra Mishra,
Harsh Pagare,
Kanhaiya Sharma

Affiliations

Kushagra Mishra: Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)
Harsh Pagare: Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)
Kanhaiya Sharma: Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)

DOI: https://doi.org/10.1038/s41598-025-04971-9
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 27

Abstract

Read online

Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords