پژوهشهای علوم دامی ایران (Oct 2021)

Comparison of principal component analysis (PCA) and discriminant analysis of principal component (DAPC) methods for analysis of population structure in Akhal-Take, Arabian and Caspian horse breeds using genomic data

  • Nasrin Babayi,
  • Abbad Rafat,
  • Mohammad Hossein Moradi,
  • Mohammad Reza Feizi derakhshi

DOI
https://doi.org/10.22067/ijasr.0621.39343
Journal volume & issue
Vol. 13, no. 3
pp. 453 – 462

Abstract

Read online

Introduction Development of high-power and cost-effective genotyping methods in recent years has provided the possibility of evaluation the genetic structure and the relationship among species populations utilizing genomic data. Genome wide inference of population structure using genetic markers could provide invaluable information associated with evolutionary relationships and clustering of subpopulations for performing animal breeding programs. In large scale studies, one of the interesting subjects is to study the existence of genetic differences among subdivided groups ascertained from different geographic locations. The objective of this study was to compare the principal component analysis (PCA) and discriminant analysis of principal component (DAPC) approaches for determining the population structure and study how an individual allocated to the true population of origin, in three Horse breeds located in Middle East consisting Akhal Take, Arabian and Caspian using genomic data.Materials and Methods In this study, the genomic data obtained from 61 animals consisting Akhal Take (19), Arabian (24) and Caspian (18) were used to investigate the population structure of some Asian horse breeds. The data were obtained from the Equine Genetic Diversity Consortium (EGDC) project. Hair or tissue samples were collected from animals. DNA extraction was performed using an optimized Pure gene (Qiagen) assay and approximately 1 μg of DNA was used for genotyping of the samples. Genotyping was performed using Illumina SNP 50K BeadChip arrays that allow to genotype 52603 SNP marker loci, according to the Illumina standard guidelines. In this study, different quality control steps were applied on preliminary data to ensure the quality of genotyping data. Quality control carried out using PLINK v.1.07 program. The samples with more than 5% missing data were excluded from analysis. Then for each SNP, MAF and call percentage were calculated and the SNPs with a call rate<95% and a MAF<2% were discarded. Deviation from Hardy-Weinberg equilibrium (p<10-6) was estimated for the remaining SNPs to identify genotyping errors. The Bonferroni correction (β=α/n) was used to address the multiple testing comparison problem. Principal component analysis (PCA) is a statistical technique for summarizing data from many variables into a few variables which describe as much of the variation in the data as possible. For this purpose, the variance-covariance matrix of independent variables was first calculated and principal components were extracted. Each new variable has an associated Eigen value that measures the respective amount of explained variance. Furthermore, the model independent of discriminant analysis of principal component (DAPC) is a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Analysis was performed using PCA and DAPC approaches and the codes for analysis were provided in R v.3.4.1 software.Results and Discussion The analysis of the main components summarizes the general variation among individuals, which includes both the variability between the groups and the diversity of the groups, and shows a clear picture of the differences between the groups. The results of this study indicated that 10.8% of the variance was explained by the first two components in both PCA and DAPC methods. Both methods showed high accuracy for assigning of individuals to the true population of origin and both were able to cluster three populations separately. The Bayesian information criterion (BIC) index was used for evaluating the optimal number of clusters for DAPC method and the results revealed that K=3 showing the optimal number with lowest BIC that completely separate three populations. The DAPC method was better than PCA to separate populations from each other due to the increase of intergroup variance and the reduction of intra-group variance. In determining the optimal number of K, it worked better than PCA method and provided a better picture of the relationship between individuals. This results show that DAPC method can be applied in quality control of GWAS as an alternative to the PCA, because of summarizing the genetic differentiation between groups and overlooking within-group variation and provides better population structure.Conclusion In general, the results of this study showed that although the previous studies grouped these three breeds located in Middle East in one cluster of neighboring trees, however, according to the results of this study, three breeds are grouped separately, and the DAPC method can better illustrate the inter-population relationships in horse breeds.

Keywords