IEEE Access (Jan 2019)
Clustering Scientific Document Based on an Extended Citation Model
Abstract
With the number of published scientific paper increasing exponentially, scientific document clustering is becoming a challenging task. Therefore, a scientific document clustering model with high quality is needed. In this paper, we propose an extended citation model for scientific document clustering. On the one hand, the proposed model considers that 1) the high frequency and the wide distribution of a scientific document cited in other documents will result in the high similarity between the citing and the cited documents; and 2) the close location of two scientific documents cited in a scientific document will also result in the high similarity between these two documents. On the other hand, the proposed model combines a citation networks and textual similarity network to enhance the performance of scientific document clustering. To evaluate the performance of our proposed model, we collect scientific documents from PMC and PubMed databases in the field of oncology as a case study. It is proved that our proposed model can obtain reasonably clustering results by comparing it with traditional scientific documents clustering models, such as traditional bibliographic coupling model and textual similarity model, according to the indices of precision, recall, and F1-score.
Keywords