Scientific Reports (Jul 2022)

Machine learning techniques for identification of carcinogenic mutations, which cause breast adenocarcinoma

  • Asghar Ali Shah,
  • Hafiz Abid Mahmood Malik,
  • AbdulHafeez Mohammad,
  • Yaser Daanial Khan,
  • Abdullah Alourani

DOI
https://doi.org/10.1038/s41598-022-15533-8
Journal volume & issue
Vol. 12, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Breast adenocarcinoma is the most common of all cancers that occur in women. According to the United States of America survey, more than 282,000 breast cancer patients are registered each year; most of them are women. Detection of cancer at its early stage saves many lives. Each cell contains the genetic code in the form of gene sequences. Changes in the gene sequences may lead to cancer. Replication and/or recombination in the gene base sometimes lead to a permanent change in the nucleotide sequence of the genome, called a mutation. Cancer driver mutations can lead to cancer. The proposed study develops a framework for the early detection of breast adenocarcinoma using machine learning techniques. Every gene has a specific sequence of nucleotides. A total of 99 genes are identified in various studies whose mutations can lead to breast adenocarcinoma. This study uses the dataset taken from 4127 human samples, including men and women from more than 12 cohorts. A total of 6170 mutations in gene sequences are used in this study. Decision Tree, Random Forest, and Gaussian Naïve Bayes are applied to these gene sequences using three evaluation methods: independent set testing, self-consistency testing, and tenfold cross-validation testing. Evaluation metrics such as accuracy, specificity, sensitivity, and Mathew’s correlation coefficient are calculated. The decision tree algorithm obtains the best accuracy of 99% for each evaluation method.