TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS

Volodymyr Sokol; Vitalii Krykun; Mariia Bilova; Ivan Perepelytsya; Volodymyr Pustovarov; Volodymyr Pustovarov

doi:10.20998/2079-0023.2021.02.10

Вісник Національного технічного університету "ХПÌ": Системний аналіз, управління та інформаційні технології (Dec 2021)

TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS

Volodymyr Sokol,
Vitalii Krykun,
Mariia Bilova,
Ivan Perepelytsya,
Volodymyr Pustovarov,
Volodymyr Pustovarov

Affiliations

Volodymyr Sokol: ORCiD; National Technical University "Kharkiv Polytechnic Institute"
Vitalii Krykun: ORCiD; National Technical University "Kharkiv Polytechnic Institute"
Mariia Bilova: ORCiD; National Technical University "Kharkiv Polytechnic Institute"
Ivan Perepelytsya: ORCiD; National Technical University "Kharkiv Polytechnic Institute"
Volodymyr Pustovarov: ORCiD; Kharkiv office of the General Customer - State Space Agency of Ukraine
Volodymyr Pustovarov: ORCiD; Kharkiv office of the General Customer - State Space Agency of Ukraine

DOI: https://doi.org/10.20998/2079-0023.2021.02.10
Journal volume & issue: no. 2 (6)
pp. 59 – 66

Abstract

Read online

The demand for the creation of information systems that simplifies and accelerates work has greatly increased in the context of the rapid informatization of society and all its branches. It provokes the emergence of more and more companies involved in the development of software products and information systems in general. In order to ensure the systematization, processing and use of this knowledge, knowledge management systems are used. One of the main tasks of IT companies is continuous training of personnel. This requires export of the content from the company's knowledge management system to the learning management system. The main goal of the research is to choose an algorithm that allows solving the problem of marking up the text of articles close to those used in knowledge management systems of IT companies. To achieve this goal, it is necessary to compare various topic segmentation methods on a dataset with a computer science texts. Inspec is one such dataset used for keyword extraction and in this research it has been adapted to the structure of the datasets used for the topic segmentation problem. The TextTiling and TextSeg methods were used for comparison on some well-known data science metrics and specific metrics that relate to the topic segmentation problem. A new generalized metric was also introduced to compare the results for the topic segmentation problem. All software implementations of the algorithms were written in Python programming language and represent a set of interrelated functions. Results were obtained showing the advantages of the Text Seg method in comparison with TextTiling when compared using classical data science metrics and special metrics developed for the topic segmentation task. From all the metrics, including the introduced one it can be concluded that the TextSeg algorithm performs better than the TextTiling algorithm on the adapted Inspec test data set.

Published in Вісник Національного технічного університету "ХПÌ": Системний аналіз, управління та інформаційні технології

ISSN: 2079-0023 (Print); 2410-2857 (Online)
Publisher: National Technical University Kharkiv Polytechnic Institute
Country of publisher: Ukraine
LCC subjects: Technology
Website: http://samit.khpi.edu.ua/

About the journal

Abstract

Keywords