Clinical Research With Large Language Models Generated Writing—Clinical Research with AI-assisted Writing (CRAW) Study

Ivan A. Huespe, MD; Jorge Echeverri, MD; Aisha Khalid, MD; Indalecio Carboni Bisso, MD; Carlos G. Musso, PhD; Salim Surani, MD; Vikas Bansal, MBBS, MPH; Rahul Kashyap, MD

doi:10.1097/CCE.0000000000000975

Critical Care Explorations (Oct 2023)

Clinical Research With Large Language Models Generated Writing—Clinical Research with AI-assisted Writing (CRAW) Study

Ivan A. Huespe, MD,
Jorge Echeverri, MD,
Aisha Khalid, MD,
Indalecio Carboni Bisso, MD,
Carlos G. Musso, PhD,
Salim Surani, MD,
Vikas Bansal, MBBS, MPH,
Rahul Kashyap, MD

Affiliations

Ivan A. Huespe, MD: 1 Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
Jorge Echeverri, MD: 3 Universidad Javeriana, Bogotá, Colombia.
Aisha Khalid, MD: 4 Harvard Medical School, Boston, MA.
Indalecio Carboni Bisso, MD: 1 Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
Carlos G. Musso, PhD: 1 Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
Salim Surani, MD: 6 Mayo Clinic, Rochester, MN.
Vikas Bansal, MBBS, MPH: 6 Mayo Clinic, Rochester, MN.
Rahul Kashyap, MD: 6 Mayo Clinic, Rochester, MN.

DOI: https://doi.org/10.1097/CCE.0000000000000975
Journal volume & issue: Vol. 5, no. 10
p. e0975

Abstract

Read online

IMPORTANCE:. The scientific community debates Generative Pre-trained Transformer (GPT)-3.5’s article quality, authorship merit, originality, and ethical use in scientific writing. OBJECTIVES:. Assess GPT-3.5’s ability to craft the background section of critical care clinical research questions compared to medical researchers with H-indices of 22 and 13. DESIGN:. Observational cross-sectional study. SETTING:. Researchers from 20 countries from six continents evaluated the backgrounds. PARTICIPANTS:. Researchers with a Scopus index greater than 1 were included. MAIN OUTCOMES AND MEASURES:. In this study, we generated a background section of a critical care clinical research question on “acute kidney injury in sepsis” using three different methods: researcher with H-index greater than 20, researcher with H-index greater than 10, and GPT-3.5. The three background sections were presented in a blinded survey to researchers with an H-index range between 1 and 96. First, the researchers evaluated the main components of the background using a 5-point Likert scale. Second, they were asked to identify which background was written by humans only or with large language model-generated tools. RESULTS:. A total of 80 researchers completed the survey. The median H-index was 3 (interquartile range, 1–7.25) and most (36%) researchers were from the Critical Care specialty. When compared with researchers with an H-index of 22 and 13, GPT-3.5 was marked high on the Likert scale ranking on main background components (median 4.5 vs. 3.82 vs. 3.6 vs. 4.5, respectively; p < 0.001). The sensitivity and specificity to detect researchers writing versus GPT-3.5 writing were poor, 22.4% and 57.6%, respectively. CONCLUSIONS AND RELEVANCE:. GPT-3.5 could create background research content indistinguishable from the writing of a medical researcher. It was marked higher compared with medical researchers with an H-index of 22 and 13 in writing the background section of a critical care clinical research question.

Published in Critical Care Explorations

ISSN: 2639-8028 (Online)
Publisher: Wolters Kluwer
Country of publisher: United States
LCC subjects: Medicine: Internal medicine: Medical emergencies. Critical care. Intensive care. First aid
Website: https://journals.lww.com/ccejournal/pages/default.aspx

About the journal