Advances in Electrical and Computer Engineering (Feb 2011)

Domain Independent Vocabulary Generation and Its Use in Category-based Small Footprint Language Model

  • KIM, K.-H.,
  • KIM, J.-H.

DOI
https://doi.org/10.4316/AECE.2011.01013
Journal volume & issue
Vol. 11, no. 1
pp. 77 – 84

Abstract

Read online

The work in this paper pertains to domain independent vocabulary generation and its use in category-based small footprint Language Model (LM). Two major constraints of the conventional LMs in the embedded environment are memory capacity limitation and data sparsity for the domain-specific application. This data sparsity adversely affects vocabulary coverage and LM performance. To overcome these constraints, we define a set of domain independent categories using a Part-Of-Speech (POS) tagged corpus. Also, we generate a domain independent vocabulary based on this set using the corpus and knowledge base. Then, we propose a mathematical framework for a category-based LM using this set. In this LM, one word can be assigned assign multiple categories. In order to reduce its memory requirements, we propose a tree-based data structure. In addition, we determine the history length of a category n-gram, and the independent assumption applying to a category history generation. The proposed vocabulary generation method illustrates at least 13.68% relative improvement in coverage for a SMS text corpus, where data are sparse due to the difficulties in data collection. The proposed category-based LM requires only 215KB which is 55% and 13% compared to the conventional category-based LM and the word-based LM, respectively. It successively improves the performance, achieving 54.9% and 60.6% perplexity reduction compared to the conventional category-based LM and the word-based LM in terms of normalized perplexity.

Keywords