Fast and modular regularized topic modelling

Denis Kochedykov; Murat Apishev; Lev Golitsyn; Konstantin Vorontsov

doi:10.23919/FRUCT.2017.8250181

Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2017)

Fast and modular regularized topic modelling

Denis Kochedykov,
Murat Apishev,
Lev Golitsyn,
Konstantin Vorontsov

Affiliations

Denis Kochedykov: J.P. Morgan, New York, USA
Murat Apishev: Moscow State University, Moscow, Russia
Lev Golitsyn: Integrated Systems, Moscow, Russia
Konstantin Vorontsov: MIPT, Moscow, Russia

DOI: https://doi.org/10.23919/FRUCT.2017.8250181
Journal volume & issue: Vol. 562, no. 21
pp. 182 – 193

Abstract

Read online

Topic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. In applications, there are often many requirements, such as, for example, problem-specific knowledge and additional data, to be taken into account. Therefore, it is natural for topic modelling to be considered a multiobjective optimization problem. However, historically, Bayesian learning became the most popular approach for topic modelling. In the Bayesian paradigm, all requirements are formalized in terms of a probabilistic generative process. This approach is not always convenient due to some limitations and technical difficulties. In this work, we develop a non-Bayesian multiobjective approach called the Additive Regularization of Topic Models (ARTM). It is based on regularized Maximum Likelihood Estimation (MLE), and we show that many of the well-known Bayesian topic models can be re-formulated in a much simpler way using the regularization point of view. We review some of the most important types of topic models: multimodal, multilingual, temporal, hierarchical, graph-based, and short-text. The ARTM framework enables easy combination of different types of models to create new models with the desired properties for applications. This modular “lego-style” technology for topic modelling is implemented in the open-source library BigARTM.

Published in Proceedings of the XXth Conference of Open Innovations Association FRUCT

ISSN: 2305-7254 (Print); 2343-0737 (Online)
Publisher: FRUCT
Country of publisher: Finland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication
Website: http://fruct.org/publication

About the journal

Abstract

Keywords