An Unsupervised Approach for Keyphrase Extraction Using Within-Collection Resources

Teng-Fei Li; Liang Hu; Jian-Feng Chu; Hong-Tu Li; Ling Chi

doi:10.1109/ACCESS.2019.2938213

IEEE Access (Jan 2019)

An Unsupervised Approach for Keyphrase Extraction Using Within-Collection Resources

Teng-Fei Li,
Liang Hu,
Jian-Feng Chu,
Hong-Tu Li,
Ling Chi

Affiliations

Teng-Fei Li: ORCiD; College of Computer Science and Technology, Jilin University, Changchun, China
Liang Hu: College of Computer Science and Technology, Jilin University, Changchun, China
Jian-Feng Chu: College of Computer Science and Technology, Jilin University, Changchun, China
Hong-Tu Li: College of Computer Science and Technology, Jilin University, Changchun, China
Ling Chi: ORCiD; College of Computer Science and Technology, Jilin University, Changchun, China

DOI: https://doi.org/10.1109/ACCESS.2019.2938213
Journal volume & issue: Vol. 7
pp. 126088 – 126097

Abstract

Read online

It is hard to select and read suitable documents due to the rapidly growing number of scholarly documents. Keyphrases can be considered as the gist of a document so that a researcher can select the documents that they want using keyphrase queries. However, there are also many scholarly documents without any keyphrases tagged by the authors or other researchers. Automatic keyphrase extraction can help researchers to quickly extract keyphrases. This paper proposed an unsupervised approach for keyphrase extraction using graph-based ranking and topic-based clustering under the assumption that we only use the within-collection resources. We use graph-based ranking to describe the relevance between two words and topic-based clustering to embed semantical information into words. In this paper, we assume that each word has its own meaning, and each meaning can be considered as a topic, though we know nothing about these meanings. We use topic-based clustering to assign the “correct meaning” to the “correct word”. In addition, by taking the relevance among phrases into consideration and only using within-collection resources, we can use the graph-based ranking in our approach. The edges in a graph that are built for phrases can describe the hidden relevance between two phrases, and the weights that are set for edges can measure the connection between two phrases. Then, after using the position feature, our approach consists of an enhanced graph-based ranking and a topic-based clustering. The experiments are run on four datasets: KDD, WWW, GSN and ACM. The results indicate that our approach has better performance than the state-of-the-art methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords