Вопросы образования (Dec 2022)

Dataset for Analysis of Russian-Language Reviews on MOOCs Extracted from Stepik

  • Yulia Dyulicheva

DOI
https://doi.org/10.17323/1814-9545-2022-4-298-321
Journal volume & issue
no. 4
pp. 298–321 – 298–321

Abstract

Read online

The article provides an overview of datasets and research areas in the field of educational data analysis based on natural language processing methods. The overview demonstrates the lack of datasets for the analysis of Russian-language reviews on MOOCs. Based on the scraping of reviews from the Stepik platform, a dataset of 5721 Russian-language reviews for MOOCs in mathematics, programming, biology, chemistry and physics was formed. A study of Russian-language reviews from the dataset was carried out based on descriptive statistics, frequency analysis of unigrams and bigrams, sentiment analysis using the dostoevsky python library with weighted F1-score for estimation accuracy of classification by sentiment as 74%. The descriptive characteristics of courses with respect to sentiments were detected based on unigrams analysis, the description of different aspects of learning content and difficulties encountered by students in learning MOOCs were detected based on bigrams analysis. The results of the sentiment analysis demonstrate the predominance of positive and neutral reviews of MOOCs in the studied dataset. The dataset is placed in the public domain Mendeley Data and will be useful to specialists in the field of text data analysis and the development of learning analytics tools.

Keywords