Jisuanji kexue (Oct 2022)
Chinese Keyword Extraction Method Combining Knowledge Graph and Pre-training Model
Abstract
Keywords represent the theme of the text,which is the condensed concept and content of the text.Through keywords,readers can quickly understand the gist and idea of the text and improve the efficiency of information retrieval.In addition,keyword extraction can also provide support for automatic text summarization and text classification.In recent years,research on automatic keyword extraction has attracted wide attention,but how to extract keywords from documents accurately remains a challenge.On the one hand,the keyword is people’s subjective understanding,judging whether a word is a keyword itself is subjective.On the other hand,Chinese words are often rich in semantic information and it is difficult to accurately extract the main idea expressed in the text by solely relying on traditional statistical features and thematic features.Aiming at the problems of low accuracy,information redundancy and information missing in Chinese keyword extraction,this paper proposes an unsupervised keyword extraction method combining knowledge graph and pre-training model.Firstly,topic clustering is carried out by using the pre-training model,and a sentence-based clustering method is proposed to ensure the coverage of the final selected keyword.Then,the knowledge graph is used for entity linking to achieve accurate word segmentation and semantic disambiguation.After that,the semantic word graph is constructed based on the topic information to calculate the semantic weight between words.Finally,keywords are sorted by the weighted PageRank algorithm.Experiments are conducted on two public datasets,DUC 2001 and CSL,and a separate annotated CLTS dataset,the prediction accuracy,recall rate and F1 score are taken as indicators in comparative experiments.Experimental results show that the accuracy of the proposed method has improved compared with other baseline methods,F1 value is increased by 9.14% compared with the traditional statistical method TF-IDF,and increased by 4.82% compared with the traditional graph method TextRank on CLTS dataset.
Keywords