Development and validation of an automated basal cell carcinoma histopathology information extraction system using natural language processing

Stephen R. Ali; Stephen R. Ali; Huw Strafford; Huw Strafford; Thomas D. Dobbs; Thomas D. Dobbs; Beata Fonferko-Shadrach; Arron S. Lacey; Arron S. Lacey; William Owen Pickrell; William Owen Pickrell; Hayley A. Hutchings; Iain S. Whitaker; Iain S. Whitaker

doi:10.3389/fsurg.2022.870494

Frontiers in Surgery (Aug 2022)

Development and validation of an automated basal cell carcinoma histopathology information extraction system using natural language processing

Stephen R. Ali,
Stephen R. Ali,
Huw Strafford,
Huw Strafford,
Thomas D. Dobbs,
Thomas D. Dobbs,
Beata Fonferko-Shadrach,
Arron S. Lacey,
Arron S. Lacey,
William Owen Pickrell,
William Owen Pickrell,
Hayley A. Hutchings,
Iain S. Whitaker,
Iain S. Whitaker

Affiliations

Stephen R. Ali: Reconstructive Surgery and Regenerative Medicine Research Centre, Institute of Life Sciences, Swansea University Medical School, Swansea, United Kingdom
Stephen R. Ali: Welsh Centre for Burns and Plastic Surgery, Morriston Hospital, Swansea, United Kingdom
Huw Strafford: Neurology and Molecular Neuroscience Group, Institute of Life Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
Huw Strafford: Health Data Research UK, Data Science Building, Swansea University Medical School, Swansea University, Swansea, United Kingdom
Thomas D. Dobbs: Reconstructive Surgery and Regenerative Medicine Research Centre, Institute of Life Sciences, Swansea University Medical School, Swansea, United Kingdom
Thomas D. Dobbs: Welsh Centre for Burns and Plastic Surgery, Morriston Hospital, Swansea, United Kingdom
Beata Fonferko-Shadrach: Neurology and Molecular Neuroscience Group, Institute of Life Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
Arron S. Lacey: Neurology and Molecular Neuroscience Group, Institute of Life Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
Arron S. Lacey: Health Data Research UK, Data Science Building, Swansea University Medical School, Swansea University, Swansea, United Kingdom
William Owen Pickrell: Neurology and Molecular Neuroscience Group, Institute of Life Science, Swansea University Medical School, Swansea University, Swansea, United Kingdom
William Owen Pickrell: Department of Neurology, Morriston Hospital, Swansea, United Kingdom
Hayley A. Hutchings: Patient and Population Health and Informatics Research, Swansea University Medical School, Swansea, United Kingdom
Iain S. Whitaker: Reconstructive Surgery and Regenerative Medicine Research Centre, Institute of Life Sciences, Swansea University Medical School, Swansea, United Kingdom
Iain S. Whitaker: Welsh Centre for Burns and Plastic Surgery, Morriston Hospital, Swansea, United Kingdom

DOI: https://doi.org/10.3389/fsurg.2022.870494
Journal volume & issue: Vol. 9

Abstract

Read online

IntroductionRoutinely collected healthcare data are a powerful research resource, but often lack detailed disease-specific information that is collected in clinical free text such as histopathology reports. We aim to use natural Language Processing (NLP) techniques to extract detailed clinical and pathological information from histopathology reports to enrich routinely collected data.MethodsWe used the general architecture for text engineering (GATE) framework to build an NLP information extraction system using rule-based techniques. During validation, we deployed our rule-based NLP pipeline on 200 previously unseen, de-identified and pseudonymised basal cell carcinoma (BCC) histopathological reports from Swansea Bay University Health Board, Wales, UK. The results of our algorithm were compared with gold standard human annotation by two independent and blinded expert clinicians involved in skin cancer care.ResultsWe identified 11,224 items of information with a mean precision, recall, and F1 score of 86.0% (95% CI: 75.1–96.9), 84.2% (95% CI: 72.8–96.1), and 84.5% (95% CI: 73.0–95.1), respectively. The difference between clinician annotator F1 scores was 7.9% in comparison with 15.5% between the NLP pipeline and the gold standard corpus. Cohen's Kappa score on annotated tokens was 0.85.ConclusionUsing an NLP rule-based approach for named entity recognition in BCC, we have been able to develop and validate a pipeline with a potential application in improving the quality of cancer registry data, supporting service planning, and enhancing the quality of routinely collected data for research.

Published in Frontiers in Surgery

ISSN: 2296-875X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Surgery
Website: https://www.frontiersin.org/journals/surgery

About the journal

Abstract

Keywords