Knowledge Engineering and Data Science (Oct 2023)

Comparison of Machine Learning Algorithms for Species Family Classification using DNA Barcode

  • Lala Septem Riza,
  • M Ammar Fadhlur Rahman,
  • Yudi Prasetyo,
  • Muhammad Iqbal Zain,
  • Herbert Siregar,
  • Topik Hidayat,
  • Khyrina Airin Fariza Abu Samah,
  • Miftahurrahma Rosyda

DOI
https://doi.org/10.17977/um018v6i22023p231-248
Journal volume & issue
Vol. 6, no. 2
pp. 231 – 248

Abstract

Read online

Classifying plant species within the Liliaceae and Amaryllidaceae families presents inherent challenges due to the complex genetic diversity and overlapping morphological traits among species. This study explores the difficulties in accurate classification by comparing 11 supervised learning algorithms applied to DNA barcode data, aiming to enhance the precision of species family classification in these taxonomically intricate plant families. The ribulose-1,5-bisphosphate carboxylase-oxygenase large sub-unit (rbcL) gene, selected as a DNA barcode locus for plants, is used to represent species within the Amaryllidaceae and Liliaceae families. The experimental results demonstrate that nearly all tested models achieve accurate species classification into the appropriate families, with an accuracy rate exceeding 97%, except for the Naïve Bayes model. Regarding computational time, the Random Forest model requires significantly more time for training than other models. Regarding memory usage, the Least Squares Support Vector Machine with a polynomial kernel, and Regularized Logistic Regression consume more memory than other models. These machine learning models exhibit strong concordance with NCBI's classifications when predicting families using the test dataset, effectively categorizing species into the Amaryllidaceae and Liliaceae families.