Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study

Ahmet Yıldırım; Orhan Cicek; Yavuz Selim Genç

doi:10.3390/diagnostics15121513

Diagnostics (Jun 2025)

Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study

Ahmet Yıldırım,
Orhan Cicek,
Yavuz Selim Genç

Affiliations

Ahmet Yıldırım: Department of Orthodontics, Faculty of Dentistry, Zonguldak Bulent Ecevit University, Zonguldak 67600, Türkiye
Orhan Cicek: Department of Orthodontics, Faculty of Dentistry, Zonguldak Bulent Ecevit University, Zonguldak 67600, Türkiye
Yavuz Selim Genç: Samsun Oral and Dental Health Hospital, Samsun Provincial Health Directorate, Samsun 55060, Türkiye

DOI: https://doi.org/10.3390/diagnostics15121513
Journal volume & issue: Vol. 15, no. 12
p. 1513

Abstract

Read online

Background/Aims: The aim of this study was to evaluate the effectiveness of large language model (LLM)-based chatbot systems in predicting bone age and identifying growth stages, and to explore their potential as practical, infrastructure-independent alternatives to conventional methods and convolutional neural network (CNN)-based deep learning models. Methods: This study evaluated the performance of three ChatGPT-based models (GPT-4o, GPT-o4-mini-high, and GPT-o1-pro) in predicting bone age and growth stage using 90 anonymized hand–wrist radiographs (30 from each growth stage—pre-peak, peak, and post-peak—with equal male and female distribution). Reference standards were ensured by expert orthodontists using Fishman’s Skeletal Maturity Indicators (SMI) system and the Greulich–Pyle Atlas, with each radiograph analyzed by three GPT models using standardized prompts. Model performances were evaluated through statistical analyses assessing agreement and prediction accuracy. Results: All models showed significant agreement with the reference values in bone age prediction (p p > 0.05). The GPT-o4-mini-high model achieved an accuracy rate of 72.2% within a ±2 year deviation range for bone age prediction. The GPT-o1-pro and GPT-o4-mini-high models showed bias in the Bland–Altman analysis of bone age predictions; however, GPT-o1-pro yielded more reliable predictions with narrower limits of agreement. In terms of growth stage classification, the GPT-4o model achieved the highest agreement with the reference values (κ = 0.283, p Conclusions: This study shows that general-purpose GPT models can support bone age and growth stages prediction, with each model having distinct strengths. While GPT models do not replace clinical examination, their contextual reasoning and ability to perform preliminary assessments without domain-specific training make them promising tools, though further development is needed.

Published in Diagnostics

ISSN: 2075-4418 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Medicine: Medicine (General)
Website: http://www.mdpi.com/journal/diagnostics

About the journal

Abstract

Keywords