TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

Zekun Jiang; Dongjie Cheng; Ziyuan Qin; Jun Gao; Qicheng Lao; Abdullaev Bakhrom Ismoilovich; Urazboev Gayrat; Yuldashov Elyorbek; Bekchanov Habibullo; Defu Tang; Linjing Wei; Kang Li; Le Zhang

doi:10.26599/BDMA.2024.9020058

Big Data Mining and Analytics (Dec 2024)

TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

Zekun Jiang,
Dongjie Cheng,
Ziyuan Qin,
Jun Gao,
Qicheng Lao,
Abdullaev Bakhrom Ismoilovich,
Urazboev Gayrat,
Yuldashov Elyorbek,
Bekchanov Habibullo,
Defu Tang,
Linjing Wei,
Kang Li,
Le Zhang

Affiliations

Zekun Jiang: College of Computer Science, Sichuan University, Chengdu 610000, China
Dongjie Cheng: West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610000, China
Ziyuan Qin: Shcool of Engineering, Case Western Reserve University, Cleveland, OH 44106, USA
Jun Gao: College of Computer Science, Sichuan University, Chengdu 610000, China
Qicheng Lao: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Abdullaev Bakhrom Ismoilovich: Urgench State University, Urgench 220100, Uzbekistan
Urazboev Gayrat: Urgench State University, Urgench 220100, Uzbekistan
Yuldashov Elyorbek: Urgench State University, Urgench 220100, Uzbekistan
Bekchanov Habibullo: Urgench State University, Urgench 220100, Uzbekistan
Defu Tang: College of Animal Science and Technology, Gansu Agricultural University, Lanzhou 730000, China
Linjing Wei: College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730000, China
Kang Li: West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610000, China
Le Zhang: College of Computer Science, Sichuan University, Chengdu 610000, China

DOI: https://doi.org/10.26599/BDMA.2024.9020058
Journal volume & issue: Vol. 7, no. 4
pp. 1199 – 1211

Abstract

Read online

This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM’s capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO (p < 0.01) and GSAM (p < 0.05), closely matching the performance of SAM BBOX with gold standard bounding box prompts (p = 0.07), and surpasses the state-of-the-art methods on specific datasets such as ISIC (0.853 versus 0.802) and WBC (0.968 versus 0.883). The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.

Published in Big Data Mining and Analytics

ISSN: 2096-0654 (Print); 2097-406X (Online)
Publisher: Tsinghua University Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8254253

About the journal

Abstract

Keywords