Telfor Journal (Nov 2014)
Building a Speech Repository for a Serbian LVCSR System
Abstract
This paper describes the procedure of collecting speech and corresponding textual data and the processing needed to create a repository for training a LVCSR system for the Serbian language. The speech database for Serbian consists of speech recordings from audio books, radio programmes and talk shows, as well as read utterances from an array of male and female speakers. Currently, approximately 200 hours of speech recordings are collected, together with corresponding orthographic transcriptions which contain around 200 thousand words (over 3 million characters).Audio files are split in order for each of them to contain a single utterance. The corresponding transcriptions are used to create label files as well as for training the language model (LM) – namely, new transcriptions are added to the existing textual corpus earlier collected for the purpose of creating the LM. The software which was specially designed for building the speech repository for Serbian is also briefly described.