Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool

Zin Tarakji; Adel Kanaan; Samer Saadi; Mohammed Firwana; Adel Kabbara Allababidi; Mohamed F. Abusalih; Rami Basmaci; Tamim I. Rajjo; Zhen Wang; M. Hassan Murad; Bashar Hasan

doi:10.1186/s12874-024-02372-6

BMC Medical Research Methodology (Nov 2024)

Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool

Zin Tarakji,
Adel Kanaan,
Samer Saadi,
Mohammed Firwana,
Adel Kabbara Allababidi,
Mohamed F. Abusalih,
Rami Basmaci,
Tamim I. Rajjo,
Zhen Wang,
M. Hassan Murad,
Bashar Hasan

Affiliations

Zin Tarakji: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Adel Kanaan: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Samer Saadi: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Mohammed Firwana: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Adel Kabbara Allababidi: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Mohamed F. Abusalih: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Rami Basmaci: Department of Family Medicine, Mayo Clinic
Tamim I. Rajjo: Department of Family Medicine, Mayo Clinic
Zhen Wang: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
M. Hassan Murad: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic
Bashar Hasan: Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo Clinic

DOI: https://doi.org/10.1186/s12874-024-02372-6
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 5

Abstract

Read online

Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.

Published in BMC Medical Research Methodology

ISSN: 1471-2288 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General)
Website: http://bmcmedresmethodol.biomedcentral.com

About the journal

Abstract

Keywords