The Plant Genome (Mar 2019)

Machine Learning as an Effective Method for Identifying True Single Nucleotide Polymorphisms in Polyploid Plants

  • Walid Korani,
  • Josh P. Clevenger,
  • Ye Chu,
  • Peggy Ozias-Akins

DOI
https://doi.org/10.3835/plantgenome2018.05.0023
Journal volume & issue
Vol. 12, no. 1

Abstract

Read online

Single nucleotide polymorphisms (SNPs) have many advantages as molecular markers since they are ubiquitous and codominant. However, the discovery of true SNPs in polyploid species is difficult. Peanut ( L.) is an allopolyploid, which has a very low rate of true SNP calling. A large set of true and false SNPs identified from the Axiom_ 58k array was leveraged to train machine-learning models to enable identification of true SNPs directly from sequence data to reduce ascertainment bias. These models achieved accuracy rates above 80% using real peanut RNA sequencing (RNA-seq) and whole-genome shotgun (WGS) resequencing data, which is higher than previously reported for polyploids and at least a twofold improvement for peanut. A 48K SNP array, Axiom_2, was designed using this approach resulting in 75% accuracy of calling SNPs from different tetraploid peanut genotypes. Using the method to simulate SNP variation in several polyploids, models achieved >98% accuracy in selecting true SNPs. Additionally, models built with simulated genotypes were able to select true SNPs at >80% accuracy using real peanut data. This work accomplished the objective to create an effective approach for calling highly reliable SNPs from polyploids using machine learning. A novel tool was developed for predicting true SNPs from sequence data, designated as SNP machine learning (SNP-ML), using the described models. The SNP-ML additionally provides functionality to train new models not included in this study for customized use, designated SNP machine learner (SNP-MLer). The SNP-ML is publicly available.