Applied Sciences (Sep 2024)

Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report

  • Yu-Hyeon Kim,
  • Chulho Kim,
  • Yu-Seop Kim

DOI
https://doi.org/10.3390/app14198652
Journal volume & issue
Vol. 14, no. 19
p. 8652

Abstract

Read online

Texts in medical fields containing sensitive information pose challenges for AI research usability. However, there is increasing interest in generating synthetic text to make medical text data bigger for text-based medical AI research. Therefore, this paper suggests a text augmentation system for cerebrovascular diseases, using a synthetic text generation model based on DistilGPT2 and a classification model based on BioBERT. The synthetic text generation model generates synthetic text using randomly extracted reports (5000, 10,000, 15,000, and 20,000) from 73,671 reports. The classification model is fine-tuned with the entire report to annotate synthetic text and build a new dataset. Subsequently, we fine-tuned a classification model by incrementally increasing the amount of augmented data added to each original dataset. Experimental results show that fine-tuning by adding augmented data improves model performance by up to 20%. Furthermore, we found that generating a large amount of synthetic text is not necessarily required to achieve better performance, and the appropriate amount of data augmentation depends on the size of the original data. Therefore, our proposed method reduces the time and resources needed for dataset construction, automating the annotation task and generating meaningful synthetic text for medical AI research.

Keywords