Advanced Intelligent Systems (Dec 2023)

GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics

  • Zijie Shen,
  • Enhui Shen,
  • Qian-Hao Zhu,
  • Longjiang Fan,
  • Quan Zou,
  • Chu-Yu Ye

DOI
https://doi.org/10.1002/aisy.202300426
Journal volume & issue
Vol. 5, no. 12
pp. n/a – n/a

Abstract

Read online

Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool.

Keywords