Applied Sciences (Apr 2023)

Candidate Term Boundary Conflict Reduction Method for Chinese Geological Text Segmentation

  • Yu Tang,
  • Jiqiu Deng,
  • Zhiyong Guo

DOI
https://doi.org/10.3390/app13074516
Journal volume & issue
Vol. 13, no. 7
p. 4516

Abstract

Read online

Though Chinese word segmentation (CWS) relies heavily on arithmetic power to train huge models and human work to label corpora, models and algorithms are still less accurate, especially for segmentation in a specific domain. In this study, a high-degree-of-freedom-priority candidate term boundary conflict reduction method (HFCR) is proposed to solve the problem of manually setting thresholds on segmentation based on information entropy. We quantify the uncertainty of left and right character connections of candidate terms and then arrange them in descending order for local comparisons to determine term boundaries. Dynamic numerical comparisons are adopted instead of setting a threshold manually and randomly. Experiments show that the average F1-value of CWS for Chinese geological text is higher than 95% and the F1-value for Chinese general datasets is higher than 87%. Compared with representative tokenizers and the SOTA model, our method performs better, which solves the term boundary conflict problem well and has excellent performance on single geological text segmentation without any samples or labels.

Keywords