Informatics in Medicine Unlocked (Jan 2024)

Fully automatic summarization of radiology reports using natural language processing with large language models

  • Mizuho Nishio,
  • Takaaki Matsunaga,
  • Hidetoshi Matsuo,
  • Munenobu Nogami,
  • Yasuhisa Kurata,
  • Koji Fujimoto,
  • Osamu Sugiyama,
  • Toshiaki Akashi,
  • Shigeki Aoki,
  • Takamichi Murakami

Journal volume & issue
Vol. 46
p. 101465

Abstract

Read online

Purpose: Natural language processing using language models has yielded promising results in various fields. Language models can help improve the workflow of radiologists. This retrospective study aimed to construct and evaluate language models for automatic summarization of radiology reports. Methods: Two radiology report datasets from the MIMIC Chest X-ray (MIMIC-CXR) database and the Japan Medical Image Database (JMID) were included in this study. The MIMIC-CXR is an open database comprising chest radiograph reports. The JMID is a large database comprising computed tomography and magnetic resonance imaging reports from 10 academic medical centers in Japan. A total of 128,032 and 1,101,271 reports were included in this study from the MIMIC-CXR database and JMID, respectively. Four Text-to-Text Transfer Transformer (T5) models were constructed. Recall-Oriented Understudy for Gisting Evaluation (ROUGE), a quantitative metric, was used to evaluate the quality of the text summarized from 19,205 and 58,043 test sets from the MIMIC-CXR and JMID, respectively. The Wilcoxon signed-rank test was used to evaluate the differences among the ROUGE values of the four T5 models. Moreover, the subsets of automatically summarized text in the test sets were manually evaluated by two radiologists. The best T5 models were selected for automatic summarization using the Wilcoxon signed-rank test. Results: The quantitative metrics of the best T5 models were as follows: ROUGE-1 = 57.75 ± 30.99, ROUGE-2 = 49.96 ± 35.36, and ROUGE-L = 54.07 ± 32.48 in the MIMIC-CXR; and ROUGE-1 = 50.00 ± 29.24, ROUGE-2 = 39.66 ± 30.21, and ROUGE-L = 47.87 ± 29.44 in the JMID. The radiologists’ evaluations revealed 86% and 85% of the texts automatically summarized from the MIMIC-CXR and JMID, respectively, to be clinically useful. Conclusion: The T5 models constructed in this study were able to perform automatic summarization of the radiology reports. The radiologists’ evaluations demonstrated most of the automatically summarized texts to be clinically valuable.

Keywords