Big Data Mining and Analytics (Dec 2024)

TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

  • Zekun Jiang,
  • Dongjie Cheng,
  • Ziyuan Qin,
  • Jun Gao,
  • Qicheng Lao,
  • Abdullaev Bakhrom Ismoilovich,
  • Urazboev Gayrat,
  • Yuldashov Elyorbek,
  • Bekchanov Habibullo,
  • Defu Tang,
  • Linjing Wei,
  • Kang Li,
  • Le Zhang

DOI
https://doi.org/10.26599/BDMA.2024.9020058
Journal volume & issue
Vol. 7, no. 4
pp. 1199 – 1211

Abstract

Read online

This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM’s capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO (p < 0.01) and GSAM (p < 0.05), closely matching the performance of SAM BBOX with gold standard bounding box prompts (p = 0.07), and surpasses the state-of-the-art methods on specific datasets such as ISIC (0.853 versus 0.802) and WBC (0.968 versus 0.883). The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.

Keywords