Zero and Few Short Learning Using Large Language Models for De-Identification of Medical Records

Y. S. Yashwanth; Rajashree Shettar

doi:10.1109/ACCESS.2024.3439680

IEEE Access (Jan 2024)

Zero and Few Short Learning Using Large Language Models for De-Identification of Medical Records

Y. S. Yashwanth,
Rajashree Shettar

Affiliations

Y. S. Yashwanth: ORCiD; Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka, India
Rajashree Shettar: ORCiD; Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka, India

DOI: https://doi.org/10.1109/ACCESS.2024.3439680
Journal volume & issue: Vol. 12
pp. 110385 – 110393

Abstract

Read online

The paper aims to evaluate and provide a comparative analysis of the performance and fine-tuning cost of various Large Language Models (LLMs) such as GPT-3.5, GPT-4, PaLM, Bard, and Llama in automating the de-identification of Protected Health Information (PHI) from medical records, ensuring patient and healthcare professional privacy. Zero-shot learning was utilized initially to assess the capabilities of these LLMs in de-identifying medical data. Subsequently, each model was fine-tuned with varying training set sizes to observe changes in performance. The study also investigates the impact of the specificity of prompts on the accuracy of de-identification tasks. Fine-tuning LLMs with specific examples significantly enhanced the accuracy of the de-identification process, surpassing the zero-shot learning accuracy of pre-trained counterparts. Notably, a fine-tuned GPT-3.5 model with a few-shot learning technique was able to exceed the performance of a zero-shot learning GPT-4 model, with 99% accuracy. Detailed prompts resulted in higher task accuracy across all models, yet fine-tuned models with brief instructions still outperformed pre-trained models given detailed prompts. Also, the fine-tuned models were more resilient to medical record format change than the zero-shot models. Code, calculations, and comparisons are available at https://github.com/YashwanthYS/De-Identification-of-medical-Records. The findings underscore the potential of LLMs, particularly when fine-tuned, to effectively automate the de-identification of PHI in medical records. The study highlights the importance of model training and prompt specificity in achieving high accuracy in de-identification tasks.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords