TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS
Abstract
The demand for the creation of information systems that simplifies and accelerates work has greatly increased in the context of the rapid informatization of society and all its branches. It provokes the emergence of more and more companies involved in the development of software products and information systems in general. In order to ensure the systematization, processing and use of this knowledge, knowledge management systems are used. One of the main tasks of IT companies is continuous training of personnel. This requires export of the content from the company's knowledge management system to the learning management system. The main goal of the research is to choose an algorithm that allows solving the problem of marking up the text of articles close to those used in knowledge management systems of IT companies. To achieve this goal, it is necessary to compare various topic segmentation methods on a dataset with a computer science texts. Inspec is one such dataset used for keyword extraction and in this research it has been adapted to the structure of the datasets used for the topic segmentation problem. The TextTiling and TextSeg methods were used for comparison on some well-known data science metrics and specific metrics that relate to the topic segmentation problem. A new generalized metric was also introduced to compare the results for the topic segmentation problem. All software implementations of the algorithms were written in Python programming language and represent a set of interrelated functions. Results were obtained showing the advantages of the Text Seg method in comparison with TextTiling when compared using classical data science metrics and special metrics developed for the topic segmentation task. From all the metrics, including the introduced one it can be concluded that the TextSeg algorithm performs better than the TextTiling algorithm on the adapted Inspec test data set.
Keywords