Big Data Analytics (Sep 2017)

Building a Chinese discourse topic corpus with a micro-topic scheme based on theme-rheme theory

  • Xue-feng Xi,
  • Guodong Zhou

DOI
https://doi.org/10.1186/s41044-017-0023-7
Journal volume & issue
Vol. 2, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background How to build a suitable discourse topic structure is an important issue in discourse topic analysis, which is the core of natural language understanding. Not only is it the key basic unit to implement automatic computing, but also the key to realize the transformation from unstructured data to structured data during the process of big data analytics. Although the discourse topic structure has wide potential for application in discourse analysis and related tasks, the research on constructing such discourse resources is quite limited in Chinese language. In this paper, we propose a micro-topic scheme (MTS) to represent the discourse topic structure in the Chinese language according to theme-rheme theory, with elementary discourse topic unit(EDTU) as the node and referent of theme-rheme as link. In particular, thematic progression is employed to directly represent the development of the discourse topic structure. Results Guided by the MTS, we manually annotate a Chinese Discourse Topic Corpus (CDTC) of 500 documents. Moreover, we get 89.9 and 72.15 F1 value in two identification preliminary experiments, respectively, which show that the proposed representation can perform good automatic computation. Conclusion The lack of the formal representation system and related corpus resources for Chinese discourse topic structure has greatly restricted the study of discourse topic analysis in natural language, and further affected the development of natural language understanding. To address the above issues, a micro-topic scheme(MTS) representation is proposed based on functional grammar theory, and the corresponding corpus resources(i.e., CDTC) are constructed. Our preliminary evaluation justifies the appropriateness of the MTS for Chinese discourse analysis and the usefulness of our CDTC.

Keywords