Generalizable and automated classification of TNM stage from pathology reports with external validation

Jenna Kefeli; Jacob Berkowitz; Jose M. Acitores Cortina; Kevin K. Tsang; Nicholas P. Tatonetti

doi:10.1038/s41467-024-53190-9

Nature Communications (Oct 2024)

Generalizable and automated classification of TNM stage from pathology reports with external validation

Jenna Kefeli,
Jacob Berkowitz,
Jose M. Acitores Cortina,
Kevin K. Tsang,
Nicholas P. Tatonetti

Affiliations

Jenna Kefeli: Department of Systems Biology, Columbia University
Jacob Berkowitz: Department of Computational Biomedicine, Cedars-Sinai Medical Center
Jose M. Acitores Cortina: Department of Computational Biomedicine, Cedars-Sinai Medical Center
Kevin K. Tsang: Department of Computational Biomedicine, Cedars-Sinai Medical Center
Nicholas P. Tatonetti: Department of Systems Biology, Columbia University

DOI: https://doi.org/10.1038/s41467-024-53190-9
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 7

Abstract

Read online

Abstract Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present BB-TEN: Big Bird – TNM staging Extracted from Notes, a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815–0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal