Труды Института системного программирования РАН (Feb 2019)
Automatic search for fragments containing biographical information in a natural language text
Abstract
The search and classification of text documents are used in many practical applications. These are the key tasks of information retrieval. Methods of text searching and classifying are used in search engines, electronic libraries and catalogs, systems for collecting and processing information, online education and many others. There are a large number of particular applications of these methods, but each such practical task is characterized, as a rule, by weak formalizability and narrow objectivity. Therefore, it requires individual study and its own approach to the solution. This paper discusses the task of automatically searching and typing text fragments containing biographical information. The key problem in solving this problem is to conduct a multi-class classification of text fragments, depending on the presence and type of biographical information contained in them. After reviewing the related works, the author concluded that the use of neural network methods is promising and widespread for solving such problems. Based on this conclusion, the paper compares various architectures of neural network models, as well as basic text presentation methods (Bag-Of-Words, TF-IDF, Word2Vec) on a pre-assembled and marked corpus of biographical texts. The article describes the steps involved in preparing a training set of text fragments for teaching models, methods for text representation and classification methods chosen for solving the problem. The results of the multi-class classification of text fragments are also presented. The examples of automatic search for fragments containing biographical information are shown for the texts that did not participate in the model learning process.
Keywords