Proceedings of the XXth Conference of Open Innovations Association FRUCT (Apr 2022)

Topic Modeling of Literary Texts Using LDA: on the Influence of Linguistic Preprocessing on Model Interpretability

  • Tatiana Sherstinova,
  • Anna Moskvina,
  • Margarita Kirina,
  • Asya Karysheva,
  • Evgenia Kolpaschikova,
  • Irina Zavyalova,
  • Polina Maksimenko,
  • Alena Moskalenko

DOI
https://doi.org/10.23919/FRUCT54823.2022.9770887
Journal volume & issue
Vol. 31, no. 1
pp. 305 – 312

Abstract

Read online

The article describes the results of the research, the purpose of which was to evaluate the influence of linguistic preprocessing on the interpretability of topic models for literary texts. The study was carried out as part of a large project aimed to obtain topic models of Russian short stories written in the first three decades of the 20th century and divided into three successive historical periods: 1) the period of the beginning of the century before the First World War (1900-1913), 2) the time of acute social cataclysms, wars and revolutions (World War I, the February and October revolutions, and the Civil War) (1914-1922), and 3) the early Soviet period (1923-1930). The material of the study was 3 samples of different sizes for each period, containing 100, 500 and 1000 short stories each. Preprocessing included lemmatization using spaCy library and four POS-filtering options: 1) nouns only, 2) nouns and verbs, 3) nouns, adjectives, adverbs, verbs, and 4) no filtering. Using the latent Dirichlet allocation (LDA), 36 thematic models were built (9 models for each preprocessing option). The research showed that in case of literary texts topic models built without any POS filters are the most interpretable. The study made it possible to obtain information about topic diversity of Russian short stories, to assess their expert interpretability, and to offer some recommendations for optimizing topic modeling, which are to be used in the development of artificial intelligence systems that process large volumes of literary texts.

Keywords