Вестник Кемеровского государственного университета (Nov 2014)
AUTOMATIC CLASSIFICATION OF SEMISTRUCTURED DOCUMENTS IN SCIENTIFIC AND EDUCATIONAL PROCESS
Abstract
Numerous semi-structured documents are used daily in education and research activities at universities. Dealing with metadata rather than documents themselves is one of the ways of processing documents uniformly. However, as far as many semi-structured documents are concerned, this method is considered to be efficient only in case of the existing procedure of automatic extraction of documents content metadata. The procedure includes 3 stages: document class identification, clusterization of the documents whose classes could not be identified, extraction of metadata from the documents of identified classes. The paper is dedicated to possible solutions for the first stage, i.e. automatic classification of semi-structured documents. The paper includes the definition of a semi-structured document, criteria of methods efficiency classification, comparative analysis of different methods regarding 5 top criteria. To estimate 2 additionally developped criteria the following methods are used: multilayer neural networks, Rocchio algorithm, k-nearest neighbor method. Based on the analysis results, the neural networks method appears to be the most efficient in the context of accuracy and speed correlation. However, classification accuracy is not enough when dealing with semi-structured documents. The authors suppose the accuracy of the methods can be improved by using not only key words but also determined document structure during classification process.