IEEE Access (Jan 2020)

A Novel Deep Neural Networks Model Based on Prime Numbers for Y DNA Haplogroup Prediction

  • Jasbir Dhaliwal,
  • Keong Jin,
  • Zhe Jin

DOI
https://doi.org/10.1109/ACCESS.2020.3022274
Journal volume & issue
Vol. 8
pp. 169096 – 169105

Abstract

Read online

Most of the Y chromosome (Ychr) region (approximately 95%) passes unchanged from father to son, except by the gradual accumulation of single-nucleotide polymorphism (SNP) mutations. This results in mutations being inherited together, where all males in the direct family will have an identical pattern of variations. These mutation patterns serve as markers and can be mapped into clusters known as Y DNA haplogroups. Besides lineage tracing, haplogroups have been associated with male infertility, semen parameters, and, more recently, disease progression in several populations. Thus, haplogroup prediction research is gaining importance because of the increasing interest in personalized medicine. Of note, there are two approaches to predicting haplogroups, where the difference lies in the genetic markers: short tandem repeats (STRs) or SNPs are inputs to the haplogroup prediction tools. STRs are not without limitations, as similar STR haplotypes exist between haplogroups, and this reduces the effectiveness of STR-based haplogroup prediction tools. By contrast, current SNP-based haplogroup prediction tools are computationally expensive. There have been no studies to date that leverage traditional machine learning and deep learning algorithms to identify mutation patterns using SNPs only, and this paper proposes a novel SNP-based deep neural networks (DNNs) model. However, DNNs suffer the curse of dimensionality and become computationally expensive with large datasets. Thus, this paper overcomes the limitation of the network by proposing a novel feature extraction method based on prime numbers that computes features in either the forward or reverse direction of the SNPs data. Our experimental results show that the model achieves a categorical cross-entropy loss value as low as 0.001 on the training dataset and as low as 0.039 on the test dataset.

Keywords