Гуманитарные и юридические исследования (Jan 2023)

On the corpus of speech samples with errors in the use of Russian as a foreign language: methods of data representation and deep markup parameters

  • S. V. Gusarenko,
  • M. K. Gusarenko

DOI
https://doi.org/10.37493/2409-1030.2022.4.17
Journal volume & issue
Vol. 9, no. 4
pp. 650 – 658

Abstract

Read online

The purpose of the study, the results of which are presented in the article, is to develop the optimal composition and method of presenting data in the developed corpus of Russian speech samples with errors made by foreign students. The development of such a corpus is conditioned, firstly, by the need for a scientific description of erroneous linguistic expressions, as all significant facts of the use of the language are currently being described, and secondly, by the need to create a unified database of systematized data on errors in the speech of Russian language learners for linguodidactic purposes. The creation of such a corpus requires an in-depth description of errors in speech, therefore, in this article, it is proposed to describe an erroneous linguistic expression as a violation of a certain language norm, a certain semantic, morphological, syntactic or lexical language model underlying the normatively correct expression, indicating the type of speech activity, speech situation, native language, specialty of the student. Within the framework of the task of creating a corpus, an error is understood as a failure at a certain level of speech generation, therefore, the model for describing errors is based on the model for describing language expressions developed by domestic researchers when creating an explanatory-combinatorial dictionary. The model of deep annotation of erroneous expressions proposed in the article includes schematized models of semantic representation, syntactic and lexical compatibility (depending on the nature of the error) of a linguistic expression, which is intended, on the one hand, to accurately localize the error in the use of the language, on the other hand, to serve as educational material in linguodidactics. It is concluded that when a statistically significant number of annotated samples with errors in Russian speech made by foreign students is reached, these corpora may well be used as a source of empirical data for a comprehensive scientific description of the facts of linguistic reality. It was also concluded that for the proposed corpus to be viable, it must be an open system that involves the inclusion of new description parameters in deep annotation.

Keywords