International Journal of Computational Intelligence Systems (Jun 2024)

CFNAM-PG: Bridging Phonetic and Glyphic Information for Chinese Full Name and Abbreviation Matching Based on Simbert and DenseNet

  • Dongsheng Wang,
  • Yue Feng,
  • Jiawei Li,
  • Sha Liu,
  • Miaomiao Zhou,
  • Diming Zhang,
  • Huige Li

DOI
https://doi.org/10.1007/s44196-024-00549-x
Journal volume & issue
Vol. 17, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Matching abbreviated names with their full names (full-abbr matching) plays a key role in data integration, address matching, information retrieval, and other fields. Traditional full-abbr matching technology often encounters issues related to near homophones and near homoglyphs. First, a near-homophone full-abbr matching model based on Simbert and VGG was first proposed, which integrates character and speech features, leveraging a speech recognition model and combining a brain-like cognitive learning dual-process mechanism which involves linguistic knowledge and neural network together. Second, to address the problem of near-homoglyph full-abbr matching in Chinese, a DenseNet-based model that fuses glyph structure and image features was proposed, in which statistical feature extractors are employed to extract feature vectors for glyphic features including stroke, Wubi and structural features separately. Lastly, the near-homophone model and the near-homoglyph model are coupled to work together in the full-abbr matching task, in which expert knowledge is used as a component of the feature optimizer. Experimental results showed that the integrated model significantly increased the matching accuracy to 87.5%, demonstrating a 12.3% improvement.

Keywords