String Comparators for Chinese-Characters-Based Record Linkages

Senlin Xu; Mingfan Zheng; Xinran Li

doi:10.1109/ACCESS.2020.3047927

IEEE Access (Jan 2021)

String Comparators for Chinese-Characters-Based Record Linkages

Senlin Xu,
Mingfan Zheng,
Xinran Li

Affiliations

Senlin Xu: Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Mingfan Zheng: ORCiD; Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China
Xinran Li: ORCiD; Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China

DOI: https://doi.org/10.1109/ACCESS.2020.3047927
Journal volume & issue: Vol. 9
pp. 3735 – 3743

Abstract

Read online

In the context of big data, data sharing between different institutions can not only reduce the cost of information collection greatly but also benefit for obtaining analysis results effectively and efficiently. Record linkage is the task of locating records that refer to the same entity from heterogeneous data sources. In the last decades, extensive researches on alphabet-based record linkages have been carried out, among which the Fellegi-Sunter model extended by Winkler has outperformed others. However, it is still a challenge to perform record linkage on Chinese-character-based datasets. In this article, two set-based methods (Cosine similarity and Dice similarity) were introduced firstly, and then the similarity of Chinese characters was quantified based on an adapted encoding technique which exploits the information of both the shape and the pronunciation of Chinese character. A new method entitled Hybrid similarity was proposed in the next part, which is the combination of the character transformation technique (SoundShape Code) and Dice similarity. Finally, we performed the aforementioned methods on the simulated datasets, and each method was evaluated by counting the number of misclassified record pairs and the computational time. The results demonstrated that our Hybrid similarity method outperformed others in reducing the number of misclassified pairs with a relatively low computational cost.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords