ReaderBench: Multilevel analysis of Russian text characteristics

Dragos Corlatescu; Ștefan Ruseti; Mihai Dascalu

doi:10.22363/2687-0088-30145

Russian Journal of Linguistics (Jun 2022)

ReaderBench: Multilevel analysis of Russian text characteristics

Dragos Corlatescu,
Ștefan Ruseti,
Mihai Dascalu

Affiliations

Dragos Corlatescu: ORCiD; University Politehnica of Bucharest
Ștefan Ruseti: ORCiD; University Politehnica of Bucharest
Mihai Dascalu: ORCiD; University Politehnica of Bucharest

DOI: https://doi.org/10.22363/2687-0088-30145
Journal volume & issue: Vol. 26, no. 2
pp. 342 – 370

Abstract

Read online

This paper introduces an adaptation of the open source ReaderBench framework that now supports Russian multilevel analyses of text characteristics, while integrating both textual complexity indices and state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT). The evaluation of the proposed processing pipeline was conducted on a dataset containing Russian texts from two language levels for foreign learners (A - Basic user and B - Independent user). Our experiments showed that the ReaderBench complexity indices are statistically significant in differentiating between the two classes of language level, both from: a) a statistical perspective, where a Kruskal-Wallis analysis was performed and features such as the “nmod” dependency tag or the number of nouns at the sentence level proved the be the most predictive; and b) a neural network perspective, where our model combining textual complexity indices and contextualized embeddings obtained an accuracy of 92.36% in a leave one text out cross-validation, outperforming the BERT baseline. ReaderBench can be employed by designers and developers of educational materials to evaluate and rank materials based on their difficulty, as well as by a larger audience for assessing text complexity in different domains, including law, science, or politics.

Published in Russian Journal of Linguistics

ISSN: 2687-0088 (Print); 2686-8024 (Online)
Publisher: Peoples’ Friendship University of Russia (RUDN University)
Country of publisher: Russian Federation
LCC subjects: Language and Literature: Philology. Linguistics
Website: http://journals.rudn.ru/linguistics

About the journal

Abstract

Keywords