Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2023)
Lexical and Grammatical Features of Russian-Language Tweets in Comparison with Everyday Spoken Russian
Abstract
This study examines the features of computer-mediated discourse, often perceived as neither purely written nor spoken. Twitter discourse serves as a case in point, reflecting attributes of both spoken and written language. The aim of the study is to discern how closely Russian-language Twitter discourse mirrors everyday spoken Russian. We examined a dataset of 152,223 Russian-language tweets (over 2 million tokens) and juxtaposed it against transcripts from the ORD speech corpus, which captures 500 macro episodes of daily conversation, totaling around 1 million tokens. Both lexical and grammatical aspects of tweets and spoken episodes are analyzed. A detailed comparison of unigrams, discourse words, and pragmatic markers is undertaken, supplemented by a multidimensional analysis spanning 22 grammatical features. Our findings indicate that while the lexical attributes of Russian Twitter discourse closely align with spoken Russian, its grammatical features differ. Notably, both the tweets and speech episodes share a significant overlap in lemmas, discourse words and pragmatic markers. However, when viewed grammatically, the Twitter discourse diverges from spontaneous spoken language. These insights hold potential for refining computer-mediated discourse generation systems for the Russian language.
Keywords