Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

Jan Stas; Daniel Zlacky; Daniel Hladek; Jozef Juhar

doi:10.15598/aeee.v11i5.897

Advances in Electrical and Electronic Engineering (Jan 2013)

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

Jan Stas,
Daniel Zlacky,
Daniel Hladek,
Jozef Juhar

Affiliations

Jan Stas: Department of Electronics and Multimedia Communications Faculty of Electrical Engineering and Informatics Technical University of Kosice Kosice
Daniel Zlacky: Department of Electronics and Multimedia Communications Faculty of Electrical Engineering and Informatics Technical University of Kosice Kosice
Daniel Hladek: Department of Electronics and Multimedia Communications Faculty of Electrical Engineering and Informatics Technical University of Kosice Kosice
Jozef Juhar: Department of Electronics and Multimedia Communications Faculty of Electrical Engineering and Informatics Technical University of Kosice Kosice

DOI: https://doi.org/10.15598/aeee.v11i5.897
Journal volume & issue: Vol. 11, no. 5
pp. 398 – 403

Abstract

Read online

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively.

Published in Advances in Electrical and Electronic Engineering

ISSN: 1336-1376 (Print); 1804-3119 (Online)
Publisher: VSB-Technical University of Ostrava
Country of publisher: Czechia
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: http://advances.utc.sk/

About the journal

Abstract

Keywords