Whole exome sequencing and machine learning germline analysis of individuals presenting with extreme phenotypes of high and low risk of developing tobacco-associated lung adenocarcinomaResearch in context
Ana Patiño-García,
Elizabeth Guruceaga,
Maria Pilar Andueza,
Marimar Ocón,
Jafait Junior Fodop Sokoudjou,
Nicolás de Villalonga Zornoza,
Gorka Alkorta-Aranburu,
Ibon Tamayo Uria,
Alfonso Gurpide,
Carlos Camps,
Eloísa Jantus-Lewintre,
Maria Navamuel-Andueza,
Miguel F. Sanmamed,
Ignacio Melero,
Mohamed Elgendy,
Juan Pablo Fusco,
Javier J. Zulueta,
Juan P. de-Torres,
Gorka Bastarrika,
Luis Seijo,
Ruben Pio,
Luis M. Montuenga,
Mikel Hernáez,
Idoia Ochoa,
Jose Luis Perez-Gracia
Affiliations
Ana Patiño-García
Department of Pediatrics and Clinical Genetics, Clínica Universidad de Navarra (CUN), Cancer Center Clínica Universidad de Navarra (CCUN), Program in Solid Tumors, Center for Applied Medical Research (Cima) and Navarra Institute for Health Research (IdisNA), University of Navarra, Pamplona, Spain
Elizabeth Guruceaga
Bioinformatics Platform, Cima and IdisNA, University of Navarra, Pamplona, Spain
Maria Pilar Andueza
Department of Oncology, CUN, CCUN and IdisNA, University of Navarra, Pamplona, Spain
Marimar Ocón
Pulmonary Department, CUN, CCUN and IdisNA, University of Navarra, Pamplona, Spain
Jafait Junior Fodop Sokoudjou
Electrical and Electronic Engineering Department, Tecnun, University of Navarra, San Sebastian, Spain
Nicolás de Villalonga Zornoza
Electrical and Electronic Engineering Department, Tecnun, University of Navarra, San Sebastian, Spain
Gorka Alkorta-Aranburu
CIMA LAB Diagnostics and IdisNA, University of Navarra, Pamplona, Spain
Ibon Tamayo Uria
Bioinformatics Platform, Cima and IdisNA, University of Navarra, Pamplona, Spain
Alfonso Gurpide
Department of Oncology, CUN, CCUN and IdisNA, University of Navarra, Pamplona, Spain
Carlos Camps
Department of Medical Oncology, Hospital General Universitario de Valencia, Unidad Mixta TRIAL (Fundación para la Investigación del Hospital General Universitario de Valencia y Centro de Investigación Príncipe Felipe) and Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Valencia, Spain
Eloísa Jantus-Lewintre
Department of Biotechnology, Universitat Politècnica de València, Unidad Mixta TRIAL (Fundación para la Investigación del Hospital General Universitario de Valencia y Centro de Investigación Príncipe Felipe) and CIBERONC, Valencia, Spain
Maria Navamuel-Andueza
Pulmonary Department, CUN, CCUN and IdisNA, University of Navarra, Pamplona, Spain
Miguel F. Sanmamed
Department of Oncology, CUN, Division of Immunology, Cima, CCUN, IdisNA and CIBERONC, University of Navarra, Pamplona, Spain
Ignacio Melero
Division of Immunology, Cima and Immunotherapy, CUN, CCUN, IdisNA and CIBERONC, University of Navarra, Pamplona, Spain
Mohamed Elgendy
Institute for Clinical Chemistry and Laboratory Medicine, Mildred-Scheel Early Career Center, National Center for Tumor Diseases Dresden (NCT/UCC), University Hospital and Faculty of Medicine, Medical Clinic I, University Hospital Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany. Laboratory of Cancer Cell Biology, Institute of Molecular Genetics of the Czech Academy of Sciences, Prague, Czech Republic
Juan Pablo Fusco
Department of Medical Oncology Hospital La Luz, Quirón, Madrid, Spain
Javier J. Zulueta
Pulmonary, Critical Care, and Sleep Division, Mount Sinai Morningside Hospital, New York, USA
Juan P. de-Torres
Pulmonary Department, CUN, CCUN and IdisNA, University of Navarra, Pamplona, Spain
Gorka Bastarrika
Department of Radiology, CUN, CCUN and IdisNA, Pamplona, Spain
Luis Seijo
Pulmonary Department, CUN, CCUN and Centro de Investigación Biomédica en Red de Enfermedades Respiratorias (CIBERES), University of Navarra, Madrid, Spain
Ruben Pio
Program in Solid Tumors, Cima -CCUN, Department of Biochemistry and Genetics, School of Science, IdisNA and CIBERONC, University of Navarra, Pamplona, Spain
Luis M. Montuenga
Program in Solid Tumors, Cima, Department of Pathology, Anatomy and Physiology, Schools of Medicine and Sciences, CCUN, IdisNA and CIBERONC, University of Navarra, Pamplona, Spain
Mikel Hernáez
Computational Biology Program, Cima, Data Science and Artificial Intelligence Institute (DATAI), CCUN, IdisNA and CIBERONC, University of Navarra, Pamplona, Spain
Idoia Ochoa
Electrical and Electronic Engineering Department, Tecnun, DATAI, University of Navarra, San Sebastian, Spain
Jose Luis Perez-Gracia
Department of Oncology, CUN, CCUN, IdisNA and CIBERONC, University of Navarra, Pamplona, Spain; Corresponding author. Department of Oncology, Clinica Universidad de Navarra, Avda. Pio XII, 36, 31008, Pamplona, Spain.
Summary: Background: Tobacco is the main risk factor for developing lung cancer. Yet, while some heavy smokers develop lung cancer at a young age, other heavy smokers never develop it, even at an advanced age, suggesting a remarkable variability in the individual susceptibility to the carcinogenic effects of tobacco. We characterized the germline profile of subjects presenting these extreme phenotypes with Whole Exome Sequencing (WES) and Machine Learning (ML). Methods: We sequenced germline DNA from heavy smokers who either developed lung adenocarcinoma at an early age (extreme cases) or who did not develop lung cancer at an advanced age (extreme controls), selected from databases including over 6600 subjects. We selected individual coding genetic variants and variant-rich genes showing a significantly different distribution between extreme cases and controls. We validated the results from our discovery cohort, in which we analysed by WES extreme cases and controls presenting similar phenotypes. We developed ML models using both cohorts. Findings: Mean age for extreme cases and controls was 50.7 and 79.1 years respectively, and mean tobacco consumption was 34.6 and 62.3 pack-years. We validated 16 individual variants and 33 variant-rich genes. The gene harbouring the most validated variants was HLA-A in extreme controls (4 variants in the discovery cohort, p = 3.46E-07; and 4 in the validation cohort, p = 1.67E-06). We trained ML models using as input the 16 individual variants in the discovery cohort and tested them on the validation cohort, obtaining an accuracy of 76.5% and an AUC-ROC of 83.6%. Functions of validated genes included candidate oncogenes, tumour-suppressors, DNA repair, HLA-mediated antigen presentation and regulation of proliferation, apoptosis, inflammation and immune response. Interpretation: Individuals presenting extreme phenotypes of high and low risk of developing tobacco-associated lung adenocarcinoma show different germline profiles. Our strategy may allow the identification of high-risk subjects and the development of new therapeutic approaches. Funding: See a detailed list of funding bodies in the Acknowledgements section at the end of the manuscript.