Informaciâ i Innovacii (Sep 2018)

Problems of Algorithms Development to Determine Quality of Topic Models Ensembles for Make Rubricators

  • A. P. Shiryaev,
  • A. R. Fedorov,
  • P. A. Fedorov,
  • L. G. Gagarina,
  • E. M. Portnov

DOI
https://doi.org/10.31432/1994-2443-2018-13-3-53-58
Journal volume & issue
Vol. 13, no. 3
pp. 53 – 58

Abstract

Read online

Intelligent data mining is one of the most relevant areas of research in the modern world. The spectrum of its application is extremely wide and covers practically all scientific disciplines. The task of analyzing text collections with the purpose of establishing thematic headings, which should be classified as separate articles with observance of the principle of systematization “from the general to the particular” and the formation of the list of “nuclear” categories, is very actual. Clustering and, in particular, topic modeling is one of the methods of intelligent text analysis. The solution of the problem of clustering text collections is fundamentally ambiguously, and there are several reasons. Firstly, there isn’t known clearly the best criterion of quality of clustering. There are a lot of reasonable criteria, but they all can give different results. Secondly, the number of clusters is usually unknown in advance and determined according by some subjective criterion. Thirdly, clustering result depends significantly on the distance metric, the choice of which is usually subjective and set by the expert. Nowadays ensembles of models are becoming more widespread among the data mining techniques. They can significantly improve the accuracy of modeling results. The main purpose of this research is to increase the clustering effectiveness of textual information by using the ensemble thematic models. This article describes the usage of a voting algorithm, which is based on a group of different evaluation algorithms. Voting algorithm allows you to select the most appropriate solution, to accurately assess the quality of the topic model and to generate a set of relevant topics. Computational experiment demonstrates coincidence with the results of expert assessments and the evaluations of formal criteria in general. The concept for quality evaluation of thematic models ensemble, which uses the simple voting algorithm, was explored and proposed for further researches.

Keywords