Applied Sciences (Oct 2021)
FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings
Abstract
Data often have a relational nature that is most easily expressed in a network form, with its main components consisting of nodes that represent real objects and links that signify the relations between these objects. Modeling networks is useful for many purposes, but the efficacy of downstream tasks is often hampered by data quality issues related to their construction. In many constructed networks, ambiguity may arise when a node corresponds to multiple concepts. Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these data quality issues. We then introduce FONDUE, a framework for utilizing network embedding methods for data-driven disambiguation and deduplication of nodes. Given an undirected and unweighted network, FONDUE-NDA identifies nodes that appear to correspond to multiple entities for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE-NDD identifies nodes that appear to correspond to same entity for merging (node deduplication), using only the network topology. From controlled experiments on benchmark networks, we find that FONDUE-NDA is substantially and consistently more accurate with lower computational cost in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node deduplication, when compared to state-of-the-art alternatives.
Keywords