Tūzī yǔ dàng’àn xuékān (Dec 2019)

Development and Application of an Ancient Chinese Sentence Segmentation System Based on Active Learning

  • Chih-Fan Hsu,
  • Chung Chang

DOI
https://doi.org/10.6575/JILA.201912_(95).0004
Journal volume & issue
Vol. 11, no. 2
pp. 117 – 145

Abstract

Read online

This study aims to develop a sentence segmentation system of ancient Chinese texts based on active learning. It is expected that through the human-machine cooperation mode, the training corpus needed to establish a model for automated ancient Chinese sentence segmentation could be reduced and humanities researchers may work more efficiently on sentence identification of uninterpreted text. Two experiments were conducted in this study for the system development and evaluation. In the first experiment, the automatic sentence segmentation models established by applying different algorithms and feature templates to sequential text selection and active learning text selection were compared to select the most suitable algorithm and feature template to employ in establishing this system. The results show that conditional random fields combined with three-word feature template adopted in active learning could perform effective learning outcomes that would be appropriate to apply to build the active learning sentence segmentation model for ancient Chinese texts. In the second experiment, six humanities researchers were invited to use the system to conduct sentence segmentation tasks of the assigned ancient Chinese texts to evaluate the performance of the system. Sentence segmentation results produced by individual humanistic researchers using the system were compared and analyzed. Semi-structured interviews were also conducted to gather an in-depth understanding of their experience and suggestions of using the system The experimental results show that the developed ancient Chinese sentence segmentation system based on active learning could effectively learn humanities researchers sentence segmentation data and constantly improve the model prediction through human-machine cooperation. Moreover, according to the interviews, most of the humanities researchers participated in this study reported a positive experience of using the system and indicated that the sentence segmentation prediction function provided in the system could effectively assist their sentence segmentation work. The prediction of the active learning sentence segmentation model could be further improved by embedding the name entity model or applying other phonological features or POS tagging of ancient Chinese in the future study. It is also expected to develop this system into a digital humanities learning platform for ancient Chinese sentence segmentation training in the future.

Keywords