Applied Sciences (Apr 2025)
Automated Redaction of Personally Identifiable Information on Drug Labels Using Optical Character Recognition and Large Language Models for Compliance with Thailand’s Personal Data Protection Act
Abstract
The rapid proliferation of artificial intelligence (AI) across various industries presents both opportunities and challenges, particularly concerning personal data privacy. With the enforcement of regulations like Thailand’s Personal Data Protection Act (PDPA), organizations face increasing pressure to protect sensitive information found in diverse data sources, including product and shipping labels. These labels, often processed by AI systems for logistics and inventory management, frequently contain Personally Identifiable Information (PII). This paper introduces a novel AI-driven system for automated PII redaction on label images, specifically designed to facilitate PDPA compliance. Our system employs a two-stage pipeline: (1) text extraction using a combination of EasyOCR and Tesseract OCR engines, maximizing recall for both Thai and English text; and (2) intelligent redaction using a pre-trained large language model (LLM), Qwen (Qwen/Qwen2.5-72B-Instruct-AWQ), prompted to identify and classify text segments as PII or non-PII based on simplified PDPA guidelines. Identified PII is then automatically redacted via black masking. We evaluated our system on a dataset of 100 drug label images, achieving a redaction precision of 92.5%, a recall of 83.2%, and an F1-score of 87.6%, with an over-redaction rate of 3.1%. These results demonstrate the system’s effectiveness in accurately redacting PII while preserving the utility of non-sensitive label information. This research contributes a practical, scalable solution for automated PDPA compliance in AI-driven label processing, mitigating privacy risks and promoting responsible AI adoption.
Keywords