A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries: A study of non-small cell lung cancer in California.

Frances B Maguire; Cyllene R Morris; Arti Parikh-Patel; Rosemary D Cress; Theresa H M Keegan; Chin-Shang Li; Patrick S Lin; Kenneth W Kizer

doi:10.1371/journal.pone.0212454

PLoS ONE (Jan 2019)

A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries: A study of non-small cell lung cancer in California.

Frances B Maguire,
Cyllene R Morris,
Arti Parikh-Patel,
Rosemary D Cress,
Theresa H M Keegan,
Chin-Shang Li,
Patrick S Lin,
Kenneth W Kizer

Affiliations

Frances B Maguire
Cyllene R Morris
Arti Parikh-Patel
Rosemary D Cress
Theresa H M Keegan
Chin-Shang Li
Patrick S Lin
Kenneth W Kizer

DOI: https://doi.org/10.1371/journal.pone.0212454
Journal volume & issue: Vol. 14, no. 2
p. e0212454

Abstract

Read online

BackgroundPopulation-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry.MethodsThe algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records.ResultsPercent agreement ranged from 91.1% to 99.4%. Ranges for other measures were 0.71-0.92 (Kappa), 74.3%-97.3% (sensitivity), 92.4%-99.8% (specificity), 60.4%-96.4% (positive predictive value), and 92.9%-99.9% (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review.ConclusionSAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.

Published in PLoS ONE

ISSN: 1932-6203 (Online)
Publisher: Public Library of Science (PLoS)
Country of publisher: United States
LCC subjects: Medicine; Science
Website: https://journals.plos.org/plosone/

About the journal