Frontiers in Genetics (Jun 2020)

Robustification of Linear Regression and Its Application in Genome-Wide Association Studies

  • Md. Alamin,
  • Md. Alamin,
  • Most. Humaira Sultana,
  • Haiming Xu,
  • Md. Nurul Haque Mollah

DOI
https://doi.org/10.3389/fgene.2020.00549
Journal volume & issue
Vol. 11

Abstract

Read online

Regression analysis is one of the most popular statistical techniques that attempt to explore the relationships between a response (dependent) variable and one or more explanatory (independent) variables. To test the overall significance of regression, F-statistic is used if the parameters are estimated by the least-squares estimators (LSEs), while if the parameters are estimated by the maximum likelihood estimators (MLEs), the likelihood ratio test (LRT) statistic is used. However, both procedures produce misleading results and often fail to provide good fits to the reasonable space of the dataset in the presence of outlying observations. Moreover, outliers occur very frequently in any real datasets as well as in the molecular OMICS datasets. Hence, an effort is made in this study to robustify MLE based regression analysis by maximizing the β-likelihood function. The tuning parameter β is selected by cross-validation. For β = 0, the proposed method reduces to the classical MLE based regression analysis. We inspect the performance of the proposed method using both synthetic and real data analysis. The results of simulations indicate that the proposed method performs better than traditional methods in both outliers and high leverage points to estimate the parameters and mean square errors. The results of relative efficiency analysis show that the proposed estimator is relatively less affected than the popular estimators, including S, MM, and fast-S for normal error distribution in case high dimension and outliers. Also, real data analysis results demonstrated that the proposed method shows robust properties with respect to data contaminations, overcome the drawback of the traditional methods. Genome-wide association studies (GWAS) by the proposed method identify the vital gene influencing hypertension and iron level in the liver and spleen of mice. Furthermore, we have identified 15 and 21 significant SNPs for chalkiness degree and chalkiness percentage, respectively, by GWAS based on the proposed method. The variant of the SNPs might be provided the new resources for grain quality traits and could be used for further molecular and physiological analysis to enhance the better quality of rice grain. These results offer an important basis for further understanding of the robust regression analysis, which might be applied in various fields, including business, genetics, and bioinformatics.

Keywords