Topic modeling in natural language texts

Anton Korshunov; Andrey Gomzin

doi:10.15514/ISPRAS-2012-23-13

Труды Института системного программирования РАН (Oct 2018)

Topic modeling in natural language texts

Anton Korshunov,
Andrey Gomzin

Affiliations

Anton Korshunov: ИСП РАН
Andrey Gomzin: ИСП РАН

DOI: https://doi.org/10.15514/ISPRAS-2012-23-13
Journal volume & issue: Vol. 23, no. 0

Abstract

Read online

Topic modeling is a method for building a model of a collection of text documents. The model is able to determine topics for each of documents. Shifting from term space to space of extracted topics helps resolving synonymy and polysemy of terms. Besides, it allows for more efficient topic-sensitive search, classification, summarization, and annotation of document collections and news feeds. The paper shows an evolution of topic modeling techniques. The earlier methods are based on clustering. These algorithms use some similarity function defined on two documents. The next generation of topic modeling techniques is based on Latent Semantic Indexing (LSA). Words co-occurrences in documents are analyzed here. Currently, the most popular are approaches based on Bayesian networks — directed probabilistic graphical models which incorporate different kinds of entities and metadata: document authorship, connections between words, topics, documents, and authors, etc. The paper contains a comparative survey of different models along with methods for parameter estimation and accuracy measurement. The following topic models are considered in the paper: Probabilistic Latent Semantic Indexing, Latent Dirichlet Allocation, non-parametric models, dynamic models, and semi-supervised models. The paper describes well-known quality evaluation metrics: perplexity and topic coherence. Freely available implementations are listed as well.

Published in Труды Института системного программирования РАН

ISSN: 2079-8156 (Print); 2220-6426 (Online)
Publisher: Ivannikov Institute for System Programming of the Russian Academy of Sciences
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://ispranproceedings.elpub.ru/jour/index

About the journal

Abstract

Keywords