Automated Translation of Chronic Disease Diagnosis Codes using the ChatGPT Large Language Model

Barret Monchka; Hassan Maleki Golandouz; Lisa Lix; Amani Hamad

doi:10.23889/ijpds.v9i5.2759

International Journal of Population Data Science (Sep 2024)

Automated Translation of Chronic Disease Diagnosis Codes using the ChatGPT Large Language Model

Barret Monchka,
Hassan Maleki Golandouz,
Lisa Lix,
Amani Hamad

Affiliations

Barret Monchka: University of Manitoba
Hassan Maleki Golandouz: University of Manitoba
Lisa Lix: University of Manitoba
Amani Hamad: University of Manitoba

DOI: https://doi.org/10.23889/ijpds.v9i5.2759
Journal volume & issue: Vol. 9, no. 5

Abstract

Read online

Background The International Classification of Diseases (ICD) is revised over time and there are region-specific versions, including ICD-10-CA (Canada) and ICD-9-CM (USA). Studies spanning multiple ICD versions require crosswalks to translate diagnosis codes across versions, but manual crosswalk development is costly and requires clinical expertise. Objective To evaluate the accuracy of a pre-trained large language model (LLM) to automatically translate chronic disease diagnosis codes from ICD-10-CA to ICD-9-CM. Approach Eight prompts were developed to instruct the OpenAI Generative Pre-trained Transformer 4 (GPT-4) LLM to translate 1,272 ICD-10-CA codes for the Elixhauser Comorbidity Index to ICD-9-CM. Prompt accuracy (%) was measured against a crosswalk developed by the Canadian Institute of Health Information. Variability was assessed by replicating each prompt three times. Mean accuracy ± standard deviation was reported for each prompt across replications, for both five-digit and truncated three-digit codes. Results The highest prompt performance was observed when assigning a persona of a medical coding specialist (40.8% ± 0.9%), requesting justification for the selected code (41.4% ± 1.1%), and providing diagnosis code labels (47.5% ± 0.7%). For truncated three-digit codes, these prompts achieved accuracy of 82.0% ± 0.5%, 80.8% ± 0.9%, and 82.7% ± 0.1%, respectively. Combining these three prompting techniques marginally improved accuracy to 48.6% ± 0.7% for five-digit codes and 84.3% ± 0.2% for truncated three-digit codes. Conclusion General-purpose LLMs are currently not sufficiently accurate at automating ICD code translation for chronic diseases. Implications Additional experiments with fine-tuning, task-specific training, and prompt engineering are needed to improve accuracy and reduce variability.

Published in International Journal of Population Data Science

ISSN: 2399-4908 (Online)
Publisher: Swansea University
Country of publisher: United Kingdom
LCC subjects: Social Sciences: Economic theory. Demography: Demography. Population. Vital events
Website: https://ijpds.org

About the journal