Infectious Disease Modelling (Dec 2023)

High-dimensional supervised classification in a context of non-independence of observations to identify the determining SNPs in a phenotype

  • Aboubacry Gaye,
  • Abdou Ka Diongue,
  • Lionel Nanguep Komen,
  • Amadou Diallo,
  • Seydou Nourou Sylla,
  • Maryam Diarra,
  • Cheikh Talla,
  • Cheikh Loucoubar

Journal volume & issue
Vol. 8, no. 4
pp. 1079 – 1087

Abstract

Read online

This work addresses the problem of supervised classification for highly correlated high-dimensional data describing non-independent observations to identify SNPs related to a phenotype. We use a general penalized linear mixed model with a single random effect that performs simultaneous SNP selection and population structure adjustment in high-dimensional prediction models. Specifically, the model simultaneously selects variables and estimates their effects, taking into account correlations between individuals.Single nucleotide polymorphisms (SNPs) are a type of genetic variation and each SNP represents a difference in a single DNA building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct source population of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is of great importance.In this study, we used uncorrelated variables from the construction of blocks of correlated variables done in a previous work to describe the most related observations of the dataset. The model was trained with 90% of the observations and tested with the remaining 10%. The best model obtained with the generalized information criterion (GIC) identified the SNP named rs2493311 located on the first chromosome of the gene called PRDM16 ((PR/SET domain 16)) as the most decisive factor in malaria attacks.

Keywords