Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2016)
Unsupervised PCFG Inference from Russian Corpus of Phone Conversations
Spontaneous speech full parsing still remains an unsolved task for the Russian language although a great amount of theoretical work bas been done In the field of spontaneons speech syntax. The paper presents results on probabilistic context free grammar induction from the unlabelled corpus of Russian spontaneons speech using the algorithm proposed by James Scicluna and Colin de la Higuera in 2014. The corpus contains 40 hours of speech (250 000 tokens). The exact task of the experiment was to learn syntactic structure of elementary discourse units that occur in spontaneous speech, make a benchmark for further development of spontaneous speech parsing algorithms and get statistics about elementary discourse units length and structure in spontaneous speech.