Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Aug 2023)

Text augmentation preserving persona speech style and vocabulary

  • Anastasia A. Matveeva,
  • Olesia V. Makhnytkina

DOI
https://doi.org/10.17586/2226-1494-2023-23-4-743-749
Journal volume & issue
Vol. 23, no. 4
pp. 743 – 749

Abstract

Read online

Currently, various natural language processing tasks often require large data sets. However, for many tasks, collecting large datasets is quite tedious and expensive, and requires the involvement of experts. An increase in the amount of data can be achieved using methods of data augmentation, however, the use of classical approaches can lead to the inclusion of phrases in the data corpus that differ in the speech style and vocabulary of the target person, which can lead to both a change in the target class as well as the appearance of replicas with unnatural vocabulary use and lack of meaning. In this context, a new method for test data enrichment is proposed that takes into account the person’s style and vocabulary. In this article, a new method for expanding text data that preserves individual language features and vocabulary is proposed. The core of the method is to create individual templates for each person based on the analysis of syntactic trees of propositions and then to create new replicas according to the generated templates. The method was tested on the task of assessing the user’s emotional state in a dialogue. The search was carried out for data sets in English and Russian. The proposed method made it possible to improve the quality of solving these problems for both the English and Russian languages. Up to a 2 % increase in accuracy and weighted F1 metrics has been noted for various models. The results of the work can be applied to improve the accuracy and weighted F1 metrics of models designed to solve various problems for the English and Russian languages.

Keywords