Symmetry (Oct 2021)
Towards Potential Content-Based Features Evaluation to Tackle Meaningful Citations
Abstract
The scientific community has presented various citation classification models to refute the concept of pure quantitative citation analysis systems wherein all citations are treated equally. However, a small number of benchmark datasets exist, which makes the asymmetric citation data-driven modeling quite complex. These models classify citations for varying reasons, mostly harnessing metadata and content-based features derived from research papers. Presently, researchers are more inclined toward binary citation classification with the belief that exploiting the datasets of incomplete nature in the best possible way is adequate to address the issue. We argue that contemporary ML citation classification models overlook essential aspects while selecting the appropriate features that hinder elutriating the asymmetric citation data. This study presents a novel binary citation classification model exploiting a list of potential natural language processing (NLP) based features. Machine learning classifiers, including SVM, KLR, and RF, are harnessed to classify citations into important and non-important classes. The evaluation is performed using two benchmark data sets containing a corpus of around 953 paper-citation pairs annotated by the citing authors and domain experts. The study outcomes exhibit that the proposed model outperformed the contemporary approaches by attaining a precision of 0.88.
Keywords