Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Oct 2021)
Automatic construction of the dialog tree based on unmarked text corpora in Russian
Abstract
In this paper, we propose a method for automatically determining the structure of the tree and the key topics of nodes in the process of building a dialog tree based on unmarked text corpora. Building a dialog tree is one of the time-consuming tasks when creating an automatic dialog system and in most cases is performed on the basis of manual markup, which takes a lot of time and resources. The method of hierarchical clustering of dialogs takes into account the semantic proximity of messages, allows one to allocate a different number of nodes at each level of the hierarchy and limit the dialog tree in width and depth. The algorithm for constructing annotations of nodes of the dialog tree takes into account the hierarchy of topics by building thematic chains. The method is based on the complex use of natural language processing methods (tokenization, lemmatization, part-of-speech tagging, word embeddings, etc.), analysis of the main components to reduce the dimension and methods of cluster analysis. Experiments on constructing the structure of the dialog tree and annotating nodes have shown the great possibilities of the proposed method for constructing an automatic dialog tree. The recognition accuracy on the example of the reference dialog tree containing 13 nodes at the first level, 381 nodes at the second level and 299 nodes at the third level was 0.8, 0.7 and 0.5, respectively. Automatic construction of dialog trees can be in demand when developing automatic dialog systems and for improving the quality of generating answers to user questions.
Keywords