IEEE Access (Jan 2020)
A New TTZ Feature Extracting Algorithm to Decipher Tobacco Related Mutation Signature Genes for the Personalized Lung Adenocarcinoma Treatment
Abstract
The big percentage of lung adenocarcinomas (LUAD) arising in lifetime nonsmokers and the low sensitivities of known major tobacco biomarkers urgent the identification of real molecular signatures for corresponding personalized treatment. Moreover, cancer is presumed to have a symptomatology strongly dependent on modules of functionally-related genes rather than on a unique important gene. Our aims, therefore, are to identify signature genes by optimizing the tobacco exposure pattern (TEP) classification model and to uncover their interaction relationships at different molecular levels. A new method, TTZ, is proposed to extract features as input variables to TEP classification model. Based on the Z-curve method, TTZ is able to extract features not only from mutation frequencies but also from sequencing information of insertions and deletions. Two independent LUAD datasets, The Cancer Genome Atlas (TCGA) and Broad data, are downloaded to train and test the TEP classification model. Thirty-four genes are identified as tobacco related mutational signature genes with the accuracies of 93.55% and 92.65% for train and validation data, respectively. The inference of genetic and protein-protein interaction (PPI) networks uncover that LAMA1, EGFR, KRAS and TNN are the most connected core genes. Six signature genes are proved significantly involved in the cilium damage pathway, which is considered as one of the root causes of lung cancer. The identified signature genes may serve as potential drug targets for the precision medicine of LUAD. Most importantly, the TTZ feature extracting method can be easily extended to other disease or cancer related mutational signature identification issues.
Keywords