Acta Informatica Pragensia (Oct 2023)

Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms

  • Raksmey Phann,
  • Chitsutha Soomlek,
  • Pusadee Seresangtakul

DOI
https://doi.org/10.18267/j.aip.210
Journal volume & issue
Vol. 12, no. 2
pp. 243 – 259

Abstract

Read online

The research herein applies text classification with which to categorize Khmer news articles. News articles were collected from three online websites through web scraping and grouped into nine categories. After text preprocessing, the dataset was split into training and testing sets. We then evaluated the performance of the ensemble learning method via machine learning classifiers with k-fold validation. Various machine learning classifiers were employed, namely logistic regression, Complement Naive Bayes, Bernoulli Naive Bayes, k-nearest neighbours, perceptron, support vector machines, stochastic gradient descent, AdaBoost, decision tree, and random forest were employed. Accuracy was improved for the categorization of Khmer news articles, in which Grid Search CV was used to find the optimal hyperparameters for each machine learning classifier with feature extraction TF-IDF and Delta TF-IDF. The results determined that the highest accuracy was achieved through the ensemble learning method in the support vector machine with the optimal hyperparameters (C = 10, kernel = rbf), using feature extraction TF-IDF and Delta TF-IDF, at 83.47% and 83.40%, respectively. The model establishes that Khmer news articles can be accurately categorized.

Keywords