A systematic evaluation of GPT-4V's multimodal capability for chest X-ray image analysis

Yunyi Liu; Yingshu Li; Zhanyu Wang; Xinyu Liang; Lingqiao Liu; Lei Wang; Leyang Cui; Zhaopeng Tu; Longyue Wang; Luping Zhou

doi:10.1016/j.metrad.2024.100099

Meta-Radiology (Dec 2024)

A systematic evaluation of GPT-4V's multimodal capability for chest X-ray image analysis

Yunyi Liu,
Yingshu Li,
Zhanyu Wang,
Xinyu Liang,
Lingqiao Liu,
Lei Wang,
Leyang Cui,
Zhaopeng Tu,
Longyue Wang,
Luping Zhou

Affiliations

Yunyi Liu: University of Sydney, New South Wales 2006, Australia
Yingshu Li: University of Sydney, New South Wales 2006, Australia
Zhanyu Wang: University of Sydney, New South Wales 2006, Australia
Xinyu Liang: First Clinical Medical College, Guangzhou University of Chinese Medicine, Guangzhou 510405, China
Lingqiao Liu: University of Adelaide, South Australia 5005, Australia
Lei Wang: University of Wollongong, New South Wales 2522, Australia
Leyang Cui: Tencent AI Lab, Tencent, Shenzhen 518000, China
Zhaopeng Tu: Tencent AI Lab, Tencent, Shenzhen 518000, China
Longyue Wang: Tencent AI Lab, Tencent, Shenzhen 518000, China; Corresponding authors.
Luping Zhou: University of Sydney, New South Wales 2006, Australia; Corresponding authors.

DOI: https://doi.org/10.1016/j.metrad.2024.100099
Journal volume & issue: Vol. 2, no. 4
p. 100099

Abstract

Read online

This work evaluates GPT-4V's multimodal capability for medical image analysis, focusing on three representative tasks radiology report generation, medical visual question answering, and medical visual grounding. For the evaluation, a set of prompts is designed for each task to induce the corresponding capability of GPT-4V to produce sufficiently good outputs. Three evaluation ways including quantitative analysis, human evaluation, and case study are employed to achieve an in-depth and extensive evaluation. Our evaluation shows that GPT-4V excels in understanding medical images can generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis.

Published in Meta-Radiology

ISSN: 2950-1628 (Online)
Publisher: KeAi Communications Co., Ltd.
Country of publisher: China
LCC subjects: Medicine: Medicine (General): Medical physics. Medical radiology. Nuclear medicine
Website: https://www.keaipublishing.com/en/journals/meta-radiology/

About the journal

Abstract

Keywords