Clinical Epidemiology (Jan 2023)

Development and Validation of Coding Algorithms to Identify Patients with Incident Non-Small Cell Lung Cancer in United States Healthcare Claims Data

  • Beyrer J,
  • Nelson DR,
  • Sheffield KM,
  • Huang YJ,
  • Lau YK,
  • Hincapie AL

Journal volume & issue
Vol. Volume 15
pp. 73 – 89

Abstract

Read online

Julie Beyrer,1,* David R Nelson,1,* Kristin M Sheffield,1,* Yu-Jing Huang,1,* Yiu-Keung Lau,1,* Ana L Hincapie2,* 1Eli Lilly and Company, Indianapolis, IN, USA; 2University of Cincinnati James L. Winkle College of Pharmacy, Cincinnati, OH, USA*These authors contributed equally to this workCorrespondence: Julie Beyrer, Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN, 46285, USA, Tel +1 317 651 8236, Email [email protected]: We sought to develop and validate an incident non-small cell lung cancer (NSCLC) algorithm for United States (US) healthcare claims data. Diagnoses and procedures, but not medications, were incorporated to support longer-term relevance and reliability.Methods: Patients with newly diagnosed NSCLC per Surveillance, Epidemiology, and End Results (SEER) served as cases. Controls included newly diagnosed small-cell lung cancer and other lung cancers, and two 5% random samples for other cancer and without cancer. Algorithms derived from logistic regression and machine learning methods used the entire sample (Approach A) or started with a previous algorithm for those with lung cancer (Approach B). Sensitivity, specificity, positive predictive values (PPV), negative predictive values, and F-scores (compared for 1000 bootstrap samples) were calculated. Misclassification was evaluated by calculating the odds of selection by the algorithm among true positives and true negatives.Results: The best performing algorithm utilized neural networks (Approach B). A 10-variable point-score algorithm was derived from logistic regression (Approach B); sensitivity was 77.69% and PPV = 67.61% (F-score = 72.30%). This algorithm was less sensitive for patients ≥ 80 years old, with Medicare follow-up time < 3 months, or missing SEER data on stage, laterality, or site and less specific for patients with SEER primary site of main bronchus, SEER summary stage 2000 regional by direct extension only, or pre-index chronic pulmonary disease.Conclusion: Our study developed and validated a practical, 10-variable, point-based algorithm for identifying incident NSCLC cases in a US claims database based on a previously validated incident lung cancer algorithm.Keywords: algorithm, machine learning, medicare claims, non-small cell lung cancer, positive predictive value, sensitivity, validation

Keywords