Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Jan 2017)

PROBABILITY DISTRIBUTION OVER THE SET OF CLASSES IN ARABIC DIALECT CLASSIFICATION TASK

  • O. V. Durandin,
  • N. R. Hilal,
  • D. Y. Strebkov,
  • N. Y. Zolotykh

DOI
https://doi.org/10.17586/2226-1494-2017-17-1-110-116
Journal volume & issue
Vol. 17, no. 1
pp. 110 – 116

Abstract

Read online

Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over the class label set instead of a particular class label. The proposed approach solves the classification problem taking into account the probability distribution over the class label set to improve the quality of the built classifier. Main Results. The suggested approach is illustrated on the automatic Arabic dialects classification example. Mined from the Twitter social network, the analyzed data contain word-marks and belong to the following six Arabic dialects: Saudi, Levantine, Algerian, Egyptian, Iraq, Jordan, and to the modern standard Arabic (MSA). The paper results demonstrate an increase of the quality of the built classifier achieved by taking into account probability distributions over the set of classes. Experiments carried out show that even relatively naive accounting of the probability distributions improves the precision of the classifier from 44% to 67%. Practical Relevance. Our approach and corresponding algorithm could be effectively used in situations when a manual annotation process performed by experts is connected with significant financial and time resources, but it is possible to create a system of heuristic rules. The implementation of the proposed algorithm enables to decrease significantly the data preparation expenses without substantial losses in the precision of the classification.

Keywords