Automatic search for fragments containing biographical information in a natural language text

A. V. Glazkova

doi:10.15514/ISPRAS-2018-30(6)-12

Труды Института системного программирования РАН (Feb 2019)

Automatic search for fragments containing biographical information in a natural language text

A. V. Glazkova

Affiliations

A. V. Glazkova: Тюменский государственный университет

DOI: https://doi.org/10.15514/ISPRAS-2018-30(6)-12
Journal volume & issue: Vol. 30, no. 6
pp. 221 – 236

Abstract

Read online

The search and classification of text documents are used in many practical applications. These are the key tasks of information retrieval. Methods of text searching and classifying are used in search engines, electronic libraries and catalogs, systems for collecting and processing information, online education and many others. There are a large number of particular applications of these methods, but each such practical task is characterized, as a rule, by weak formalizability and narrow objectivity. Therefore, it requires individual study and its own approach to the solution. This paper discusses the task of automatically searching and typing text fragments containing biographical information. The key problem in solving this problem is to conduct a multi-class classification of text fragments, depending on the presence and type of biographical information contained in them. After reviewing the related works, the author concluded that the use of neural network methods is promising and widespread for solving such problems. Based on this conclusion, the paper compares various architectures of neural network models, as well as basic text presentation methods (Bag-Of-Words, TF-IDF, Word2Vec) on a pre-assembled and marked corpus of biographical texts. The article describes the steps involved in preparing a training set of text fragments for teaching models, methods for text representation and classification methods chosen for solving the problem. The results of the multi-class classification of text fragments are also presented. The examples of automatic search for fragments containing biographical information are shown for the texts that did not participate in the model learning process.

Published in Труды Института системного программирования РАН

ISSN: 2079-8156 (Print); 2220-6426 (Online)
Publisher: Ivannikov Institute for System Programming of the Russian Academy of Sciences
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://ispranproceedings.elpub.ru/jour/index

About the journal

Abstract

Keywords