Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Maria del Mar Muñiz Moreno; Claire Gavériaux-Ruff; Yann Herault

doi:10.1186/s12859-022-05111-0

BMC Bioinformatics (Jan 2023)

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Maria del Mar Muñiz Moreno,
Claire Gavériaux-Ruff,
Yann Herault

Affiliations

Maria del Mar Muñiz Moreno: Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)
Claire Gavériaux-Ruff: Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)
Yann Herault: Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)

DOI: https://doi.org/10.1186/s12859-022-05111-0
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. Result We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. Conclusions Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords