IEEE Access (Jan 2024)
An Effective Author Name Disambiguation Framework for Large-Scale Publications
Abstract
With the significant growth of scientific literature, Author Name Disambiguation (AND) has become increasingly challenging and urgent for applications including literature retrieval systems, talent management systems, and academic web mining. However, traditional feature modeling has shown limitations when dealing with large-scale publications, such as under-performance in extracting deep semantic feature information, poor generalization of the heuristic rules constructed by experts, and over-reliance on the assumption of similar discriminability of different relational features. Hence, this paper proposes an effective AND framework to address the above issues. An extended MiniLM is designed for semantic feature extraction, in which sample selection preferences and training strategies are optimized. Features of inter-paper relations based on IDF metrics are constructed. An ensemble learning approach is introduced to model the discriminability differences and inter-feature relations of the above features. Based on the relational network of papers, an unsupervised feature fusion algorithm that synthesizes multi-order connectivity information with a lightweight convolution operation is designed to mitigate differences among features of authors to be disambiguated. Empirical experiments show that our method achieves competitive performance on the real-world dataset and prove its effectiveness in large-scale publications and complex namesake scenarios.
Keywords