Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

Arlene Casey; Emma Davidson; Claire Grover; Richard Tobin; Andreas Grivas; Huayu Zhang; Patrick Schrempf; Patrick Schrempf; Alison Q. O’Neil; Alison Q. O’Neil; Liam Lee; Michael Walsh; Freya Pellie; Freya Pellie; Karen Ferguson; Vera Cvoro; Vera Cvoro; Honghan Wu; Honghan Wu; Heather Whalley; Heather Whalley; Grant Mair; Grant Mair; William Whiteley; William Whiteley; Beatrice Alex; Beatrice Alex

doi:10.3389/fdgth.2023.1184919

Frontiers in Digital Health (Sep 2023)

Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

Arlene Casey,
Emma Davidson,
Claire Grover,
Richard Tobin,
Andreas Grivas,
Huayu Zhang,
Patrick Schrempf,
Patrick Schrempf,
Alison Q. O’Neil,
Alison Q. O’Neil,
Liam Lee,
Michael Walsh,
Freya Pellie,
Freya Pellie,
Karen Ferguson,
Vera Cvoro,
Vera Cvoro,
Honghan Wu,
Honghan Wu,
Heather Whalley,
Heather Whalley,
Grant Mair,
Grant Mair,
William Whiteley,
William Whiteley,
Beatrice Alex,
Beatrice Alex

Affiliations

Arlene Casey: Advanced Care Research Centre, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom
Emma Davidson: Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
Claire Grover: School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
Richard Tobin: School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
Andreas Grivas: School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
Huayu Zhang: Advanced Care Research Centre, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom
Patrick Schrempf: Canon Medical Research Europe Ltd., AI Research, Edinburgh, United Kingdom
Patrick Schrempf: School of Computer Science, University of St Andrews, St Andrews, United Kingdom
Alison Q. O’Neil: Canon Medical Research Europe Ltd., AI Research, Edinburgh, United Kingdom
Alison Q. O’Neil: School of Engineering, University of Edinburgh, Edinburgh, United Kingdom
Liam Lee: Medical School, University of Edinburgh, Edinburgh, United Kingdom
Michael Walsh: Intensive Care Department, University Hospitals Bristol and Weston, Bristol, United Kingdom
Freya Pellie: National Horizons Centre, Teesside University, Darlington, United Kingdom
Freya Pellie: 0School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom
Karen Ferguson: Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
Vera Cvoro: Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
Vera Cvoro: 1Department of Geriatric Medicine, NHS Fife, Fife, United Kingdom
Honghan Wu: 2Institute of Health Informatics, University College London, London, United Kingdom
Honghan Wu: 3Alan Turing Institute, London, United Kingdom
Heather Whalley: Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
Heather Whalley: 4Generation Scotland, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
Grant Mair: Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
Grant Mair: 5Neuroradiology, Department of Clinical Neurosciences, NHS Lothian, Edinburgh, United Kingdom
William Whiteley: Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
William Whiteley: 5Neuroradiology, Department of Clinical Neurosciences, NHS Lothian, Edinburgh, United Kingdom
Beatrice Alex: 6Edinburgh Futures Institute, University of Edinburgh, Edinburgh, United Kingdom
Beatrice Alex: 7School of Literatures, Languages and Cultures, University of Edinburgh, Edinburgh, United Kingdom

DOI: https://doi.org/10.3389/fdgth.2023.1184919
Journal volume & issue: Vol. 5

Abstract

Read online

BackgroundNatural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications.MethodsWe tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images.ResultsEdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%.ConclusionsThe four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.

Published in Frontiers in Digital Health

ISSN: 2673-253X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Public aspects of medicine; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/digital-health#

About the journal

Abstract

Keywords