An accessible, efficient, and accurate natural language processing method for extracting diagnostic data from pathology reports

Hansen Lam; Freddy Nguyen; Xintong Wang; Aryeh Stock; Volha Lenskaya; Maryam Kooshesh; Peizi Li; Mohammad Qazi; Shenyu Wang; Mitra Dehghan; Xia Qian; Qiusheng Si; Alexandros D. Polydorides

Journal of Pathology Informatics (Jan 2022)

An accessible, efficient, and accurate natural language processing method for extracting diagnostic data from pathology reports

Hansen Lam,
Freddy Nguyen,
Xintong Wang,
Aryeh Stock,
Volha Lenskaya,
Maryam Kooshesh,
Peizi Li,
Mohammad Qazi,
Shenyu Wang,
Mitra Dehghan,
Xia Qian,
Qiusheng Si,
Alexandros D. Polydorides

Affiliations

Hansen Lam: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Freddy Nguyen: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Xintong Wang: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Aryeh Stock: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Volha Lenskaya: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Maryam Kooshesh: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Peizi Li: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Mohammad Qazi: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Shenyu Wang: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Mitra Dehghan: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Xia Qian: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Qiusheng Si: Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Alexandros D. Polydorides: Corresponding author at: Department of Pathology, Molecular and Cell Based Medicine, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1194, New York, NY 10029, USA.; Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA

Journal volume & issue: Vol. 13
p. 100154

Abstract

Read online

Context: Analysis of diagnostic information in pathology reports for the purposes of clinical or translational research and quality assessment/control often requires manual data extraction, which can be laborious, time-consuming, and subject to mistakes. Objective: We sought to develop, employ, and evaluate a simple, dictionary- and rule-based natural language processing (NLP) algorithm for generating searchable information on various types of parameters from diverse surgical pathology reports. Design: Data were exported from the pathology laboratory information system (LIS) into extensible markup language (XML) documents, which were parsed by NLP-based Python code into desired data points and delivered to Excel spreadsheets. Accuracy and efficiency were compared to a manual data extraction method with concordance measured by Cohen’s κ coefficient and corresponding P values. Results: The automated method was highly concordant (90%–100%, P<.001) with excellent inter-observer reliability (Cohen’s κ: 0.86–1.0) compared to the manual method in 3 clinicopathological research scenarios, including squamous dysplasia presence and grade in anal biopsies, epithelial dysplasia grade and location in colonoscopic surveillance biopsies, and adenocarcinoma grade and amount in prostate core biopsies. Significantly, the automated method was 24–39 times faster and inherently contained links for each diagnosis to additional variables such as patient age, location, etc., which would require additional manual processing time. Conclusions: A simple, flexible, and scaleable NLP-based platform can be used to correctly, safely, and quickly extract and deliver linked data from pathology reports into searchable spreadsheets for clinical and research purposes.

Published in Journal of Pathology Informatics

ISSN: 2229-5089 (Print); 2153-3539 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Pathology
Website: https://www.journals.elsevier.com/journal-of-pathology-informatics

About the journal

Abstract

Keywords