Nongye tushu qingbao xuebao (Mar 2023)
Literature Classification Methods based on Structural Information Enhancement
Abstract
[Purpose/Significance] Literature classification is a fundamental task in library and information service, which is of great value for information resource management, and literature retrieval and acquisition. Deep learning-based literature classification methods are the current mainstream methods in text classification, which employ neural networks to model and use the textual content for literature classification. This approach only utilizes the information of the literature itself, but ignores the knowledge of the association between the literature. By observing the data, we found that literature in the same category tends to share more keyword information. The literature can build association networks through keywords to form structural relationships between literature. We attempt to utilize this structural in-formation to improve the performance of literature classification. [Methods/Process] This paper proposes a method that can model the structural representation of the literature and employ this representation to enhance traditional literature classification methods. Specifi-cally, we first constructed a large-scale keyword dictionary based on the collected data from about 930,000 documents. Second, we extracted the keyword set from the titles and abstracts of papers by a two-way maximum matching algorithm and constructed the keyword-literature graph data with the literature and keywords as nodes and the inclusion relationship between the documents and keywords as edges. The literature was connected with each other by keywords. Furthermore, we employed graph convolutional neural network to model the literature graph and learn the representation of literature and keywords in the keyword-literature graph. The literature representation generated by graph neural network contained the structural relationships between the literature. In addition, we employed Bert+BiLSTM to model the textual content representation of literature. Finally, the structural and textual representations of the literature were concatenated, and the classification of the literature was performed based on this representation. [Results/Conclusions] We constructed a literature classification dataset containing 423 classes and divided the training set, validation set and test set according to the ratio of 8:1:1. We conducted literature classification experiments on this dataset. The experimental results show that the structural information of literature can effectively enhance the performance of traditional literature classification methods. The results of the stripping experiments also show that the structural information alone is insufficient for the literature classification task. Through detailed analysis of the error data, we found that the model still has problems in handling some less frequent keywords and concepts. In the future, we plan to use small-sample learning methods to solve the classification problem for literature categories with less data.
Keywords