BackgroundThe integration of data from disparate sources could help alleviate data insufficiency in real-world studies and compensate for the inadequacies of single data sources and short-duration, small sample size studies while improving the utility of data for research. ObjectiveThis study aims to describe and evaluate a process of integrating data from several complementary sources to conduct health outcomes research in patients with non–small cell lung cancer (NSCLC). The integrated data set is also used to describe patient demographics, clinical characteristics, treatment patterns, and mortality rates. MethodsThis retrospective cohort study integrated data from 4 sources: administrative claims from the HealthCore Integrated Research Database, clinical data from a Cancer Care Quality Program (CCQP), clinical data from abstracted medical records (MRs), and mortality data from the US Social Security Administration. Patients with lung cancer who initiated second-line (2L) therapy between November 01, 2015, and April 13, 2018, were identified in the claims and CCQP data. Eligible patients were 18 years or older and received atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab in the 2L setting. The main analysis cohort included patients with claims data and data from at least one additional data source (CCQP or MR). Patients without integrated data (claims only) were reported separately. Descriptive and univariate statistics were reported. ResultsData integration resulted in a main analysis cohort of 2195 patients with NSCLC; 2106 patients had CCQP and 407 patients had MR data. The claims-only cohort included 931 eligible patients. For the main analysis cohort, the mean age was 62.1 (SD 9.27) years, 48.56% (1066/2195) were female, the median length of follow-up was 6.8 months, and for 37.77% (829/2195), death was observed. For the claims-only cohort, the mean age was 66.6 (SD 12.69) years, 52.1% (485/931) were female, the median length of follow-up was 8.6 months, and for 29.3% (273/931), death was observed. The most frequent 2L treatment was immunotherapy (1094/2195, 49.84%), followed by platinum-based regimens (472/2195, 21.50%) and single-agent chemotherapy (441/2195, 20.09%); mean duration of 2L therapy was 5.6 (SD 4.9, median 4) months. We describe challenges and learnings from the data integration process, and the benefits of the integrated data set, which includes a richer set of clinical and outcome data to supplement the utilization metrics available in administrative claims. ConclusionsThe management of patients with NSCLC requires care from a multidisciplinary team, leading to a lack of a single aggregated data source in real-world settings. The availability of integrated clinical data from MRs, health plan claims, and other sources of clinical care may improve the ability to assess emerging treatments.