The Lancet Regional Health. Western Pacific (Feb 2025)
Harnessing large multimodal models in pulmonary CT: the generative AI edge in lung cancer diagnostics
Abstract
Background: Generative Artificial Intelligence (Gen-AI) has rapidly advanced in multimodal information processing, particularly in medical applications such as the refinement of instruments and interpretation of medical images. However, limited evidence exists on the diagnostic performance of Gen-AI models in tumor recognition, particularly using computed tomography (CT) images. This study aimed to evaluate the diagnostic capabilities of several prevelant Gen-AI models (GPT-4-turbo, Gemini-pro-vision, Claude-3-opus) in the context of lung CT image analysis. Methods: This retrospective study analyzed chest CT scans from 404 patients with lung conditions with lung neoplasms (n=184) and non-malignancy (n=210). After standardizing CT images, the diagnostic performance and reliability of three Gen-AI (GPT-4-turbo, Gemini-pro-vision, and Claude-3-opus) were assessed using chi-square tests and Receiver Operating Characteristic (ROC) curves across various clinical scenarios. Likert scale scoring and response rate analysis were employed to evaluate internal diagnostic tendencies, while regression analyses were conducted for model optimization. Findings: In a cueing environment limited to a single CT image, Gemini demonstrated the highest diagnostic accuracy (92.21%), followed by Claude (91.49%), while GPT exhibited the lowest performance (65.22%). As the complexity of the cueing environment increased, all models experienced a decline in diagnostic accuracy. Claude showed a marginal decrease, whereas Gemini's accuracy fluctuated significantly. Under simplified cueing conditions, the performance of all models improved notably (Gemini AUC = 0.76, Claude AUC = 0.69, GPT AUC = 0.73). Feature identification analysis revealed that Claude and GPT excelled in recognizing key features, particularly prioritizing “Morphology/Margins” when diagnosing primary malignancies, with “spiculated” and “irregular” serving as critical indicators. However, in cases of misdiagnosis or missed diagnoses, Gen-AI exhibited significant deviations across multiple feature dimensions—some even completely contradicted the actual findings. Following optimization through Lasso and stepwise regression, the diagnostic performance of the models was significantly enhanced (AUC = 0.896 and AUC = 0.894, respectively). Interpretation: Gen-AI shows promising potential in pulmonary CT imaging, particularly in simplified diagnostic settings. However, their limitations in processing complex multi-modal information highlight significant challenges for clinical integration. Ongoing efforts to improve the robustness and reliability of these models are crucial for their successful adoption in healthcare.