Application of generative language models to orthopaedic practice

Andrew Jones; Nicholas Cereceda-Monteoliva; Jessica Caterson; Olivia Ambler; Matthew Horner; Arwel Tomos Poacher

doi:10.1136/bmjopen-2023-076484

BMJ Open (Mar 2024)

Application of generative language models to orthopaedic practice

Andrew Jones,
Nicholas Cereceda-Monteoliva,
Jessica Caterson,
Olivia Ambler,
Matthew Horner,
Arwel Tomos Poacher

Affiliations

Andrew Jones: Liverpool John Moores University, Liverpool, UK
Nicholas Cereceda-Monteoliva: Guy`s and St Thomas` Hospitals NHS Trust, London, London, UK
Jessica Caterson: London School of Hygiene & Tropical Medicine, London, UK
Olivia Ambler: Plastic Surgery, Morriston Hospital, Swansea, Wales, UK
Matthew Horner: Trauma Department, University Hospital of Wales, Cardiff, Cardiff, UK
Arwel Tomos Poacher: Trauma Department, University Hospital of Wales, Cardiff, Cardiff, UK

DOI: https://doi.org/10.1136/bmjopen-2023-076484
Journal volume & issue: Vol. 14, no. 3

Abstract

Read online

Objective To explore whether large language models (LLMs) Generated Pre-trained Transformer (GPT)-3 and ChatGPT can write clinical letters and predict management plans for common orthopaedic scenarios.Design Fifteen scenarios were generated and ChatGPT and GPT-3 prompted to write clinical letters and separately generate management plans for identical scenarios with plans removed.Main outcome measures Letters were assessed for readability using the Readable Tool. Accuracy of letters and management plans were assessed by three independent orthopaedic surgery clinicians.Results Both models generated complete letters for all scenarios after single prompting. Readability was compared using Flesch-Kincade Grade Level (ChatGPT: 8.77 (SD 0.918); GPT-3: 8.47 (SD 0.982)), Flesch Readability Ease (ChatGPT: 58.2 (SD 4.00); GPT-3: 59.3 (SD 6.98)), Simple Measure of Gobbledygook (SMOG) Index (ChatGPT: 11.6 (SD 0.755); GPT-3: 11.4 (SD 1.01)), and reach (ChatGPT: 81.2%; GPT-3: 80.3%). ChatGPT produced more accurate letters (8.7/10 (SD 0.60) vs 7.3/10 (SD 1.41), p=0.024) and management plans (7.9/10 (SD 0.63) vs 6.8/10 (SD 1.06), p<0.001) than GPT-3. However, both LLMs sometimes omitted key information or added additional guidance which was at worst inaccurate.Conclusions This study shows that LLMs are effective for generation of clinical letters. With little prompting, they are readable and mostly accurate. However, they are not consistent, and include inappropriate omissions or insertions. Furthermore, management plans produced by LLMs are generic but often accurate. In the future, a healthcare specific language model trained on accurate and secure data could provide an excellent tool for increasing the efficiency of clinicians through summarisation of large volumes of data into a single clinical letter.

Published in BMJ Open

ISSN: 2044-6055 (Online)
Publisher: BMJ Publishing Group
Country of publisher: United Kingdom
LCC subjects: Medicine
Website: https://bmjopen.bmj.com

About the journal