IEEE Access (Jan 2024)

Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation

  • Atif Latif,
  • Jihie Kim

DOI
https://doi.org/10.1109/ACCESS.2024.3384496
Journal volume & issue
Vol. 12
pp. 48987 – 48996

Abstract

Read online

A major challenge in deep learning (DL) model training is data scarcity. Data scarcity is commonly found in specific domains, such as clinical or low-resource languages, that are not vastly explored in AI research. In this paper, we investigate the generation capability of large language models such as Text-To-Text Transfer Transformer (T5) and Bidirectional and Auto-Regressive Transformers (BART) for Clinical Health-Aware Reasoning across Dimensions (CHARDAT) dataset by applying the ChatGPT augmentation technique. We employed ChatGPT to rephrase each instance of the training set into conceptually similar but semantically different samples and augmented them to the dataset. This study aims to investigate the utilization of large language models, ChatGPT in particular, for data augmentation to overcome the limited availability in the clinical domain. In addition to the ChatGPT augmentation, we applied other augmentation techniques, such as easy data augmentation (EDA) and an easier data augmentation (AEDA), to clinical data. ChatGPT comprehended the contextual significance of sentences within the dataset and successfully modified English terms but not clinical terms. The original CHARDAT datasets represent 52 health conditions across three clinical dimensions, i.e., Treatments, Risk Factors, and Preventions. We compared the outputs for different augmentation techniques and evaluated their relative performance. Additionally, we examined how these techniques perform with different pre-trained language models, assessing their sensitivity in various contexts. Despite the relatively small size of the CHARDAT dataset, our results demonstrated that augmentation methods like ChatGPT augmentation surpassed the efficiency of the previously employed back-translation augmentation. Specifically, our findings revealed that the BART model resulted in superior performance, achieving a rouge score of 52.35 for ROUGE-1, 41.59 for ROUGE-2, and 50.71 for ROUGE-L.

Keywords