Bìblìotečnij vìsnik (Jan 2023)

UDC code determination of new electronic receipts for the formation of an electronic library by means of software

  • Kuznetsov Oleksandr,
  • Zaika Victor

Journal volume & issue
no. 3
pp. 3 – 16

Abstract

Read online

The purpose of the article is to propose a validation technique of the UDC index of library electronic documents accessions and to demonstrate its usage for the five electronic documents on economic topics (UDC index 331) based on the developed software tool "Text Analysis". Research methodology. The quantitative method of document content research is applied. To find documents (files) similar in content, the concept of the cosine measure of similarity was used and coefficients of the thematic direction, were calculated for each document. Text files were vectorized, that is, represented as vectors in a multidimensional space. For this purpose, different word forms were reduced to one lexeme and the number (or frequency) of lexeme usage in each document was calculated. Lexemes are interpreted as coordinates, and the frequency of use is interpreted as the value of the corresponding coordinate. After vectorization of the texts, the mathematical apparatus of analytical geometry was applied, and a numerical value - the coefficient of the thematic direction - was matched to the topic of each text document. Scientific novelty. For the first time, methods of content analysis, namely, quantitative analysis, were used to assess the reliability of the UDC index of a document, and a software tool was created, the use of which will help the systematizer to confirm or refute the UDC index of a dubious document without reading it. Conclusions. The author’s software tool and the proposed UDC correction technique can be used when creating repositories of electronic texts and will contribute to improving the quality of information search and content selection. When accumulating a certain number of electronic documents, thanks to the developed methodology, the UDC of a new text (receipt) can be determined automatically by the indicator of the coefficients of the thematic direction (close to one) of the new text and the corresponding corpus. The vector of coefficients of the thematic direction of the studied texts, their distribution according to the growth of the coefficients of the thematic direction, made it possible to identify a cluster - a group of texts with the same content. A reliable criterion is the value of the coefficient for a variable linear approximation, ideally a horizontal shelf on the graph of the distribution of the coefficients of the thematic direction - the coefficient is equal to one. The number of thematic areas is determined by the number of clusters.

Keywords