Journal of Epigenetics (Oct 2021)
Imputation of ungenotyped individuals based on genotyped relatives using Machine Learning Methodology
Abstract
Machine learning methods have been used in genetic studies to build models capable of predicting missing genotypes for both human and animal genetic variations. Genotype imputation is an important process of predicting unknown genotypes. The objective of this study was to investigate the idea of using machine learning as imputation to compare the family-based methods and tried to offer improving the imputation performance in different scenarios. Also, the accuracies of different methods i.e. Support vector Machine; SVM, Random forest; RF are compared. The final population were simulated in the form of different family structures. Therefore, 100 families including one sire with different number of genotyped progenies (2, 3, 4, 5 or 7) were simulated. The number of markers was set to 5000 for whole genome. The sires in families and other scenarios such as, BothParents, sire/dam and one progeny, sire and maternal grandsire were defined to investigate the ability of learning machine algorithm for imputation. The imputation accuracy ranged from 0.78 to 0.99 in different scenarios. Also, least amount of imputation accuracy were achieved for sire and maternal grand sire scenario with both methods. Increasing in number of progenies from 2 to 3 was considerably increased in imputation accuracy (SVM and RF). The imputation of non-genotyped individuals based on parent-offspring trios and close relatives paired is possible. But, the use of child- one parent genotyped, BothParents genotyped and sire and maternal grandsire genotyped, average imputation accuracy would not exceed 85%. While genotyped progenies are the best source of predicted genotypes for ungenotyped individuals and if the number of progeny is more than 4, the imputation accuracy is increased more than 95%. These results confirmed, that the performance of machine learning methods in family of trios has a good accuracy and computational speed, which can be used in estimated breeding value.
Keywords