Journal of Electrical and Electronics Engineering (May 2014)

Text Categorization with Latent Dirichlet Allocation

  • ZLACKÝ Daniel,
  • STAŠ Ján,
  • JUHÁR Jozef,
  • CIŽMÁR Anton

Journal volume & issue
Vol. 7, no. 1
pp. 161 – 164

Abstract

Read online

This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.

Keywords