Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2016)

The construction of syntax trees using external data for partially formalized text documents

  • Kirill Chuvilin

DOI
https://doi.org/10.23919/FRUCT.2016.7892178
Journal volume & issue
Vol. 420, no. 19
pp. 16 – 23

Abstract

Read online

This article investigates the possibility of logical structure (abstract syntax tree) automatic construction for text documents, the format of which is not fully defined by standards or other rules common to all the documents. In contrast to the syntax described by formal grammars, in such cases there is no way to build the parser automatically. Text files in LATEX format are the typical examples of such formatted documents with not completely formalized syntax markup. They are used as the resources for the implementation of the algorithms developed in this work. The relevance of LATEX document analysis is due to the fact that many scientific publishings and conferences use LATEX typesetting system, and this gives rise to important applied task of automation for categorization, correction, comparison, statistics collection, rendering for WEB, etc. The parsing of documents in format requires additional information about styles: symbols, commands and environments. A method to describe them in JSON format is proposed in this work. It allows to specify not only the information necessary to pars, but also meta information that facilitates further data mining. And it is really necessary, for example, for correct comparison of documents, which arises in the solution of the automatic correction problem. This approach is used for the first time. The developed algorithms for constructing a syntax tree of a document in LATEX format, that use such information as an external parameter are described. The results are successfully applied in the tasks of comparison, auto-correction and categorization of scientific papers. The implementation of the developed algorithms is available as a set of libraries released under the LGPLv3. The key features of the proposed approach are: flexibility (within the framework of the problem) and simplicity of parameter descriptions. The proposed approach allows to solve the problem of parsing documents in LATEX format. But it is required to form th- base of style element descriptions for widespread practical use of the developed algorithms.

Keywords