MATEC Web of Conferences (Jan 2019)

Neural machine translation system for the Kazakh language based on synthetic corpora

  • Tukeyev Ualsher,
  • Karibayeva Aidana,
  • Abduali Balzhan

DOI
https://doi.org/10.1051/matecconf/201925203006
Journal volume & issue
Vol. 252
p. 03006

Abstract

Read online

The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs.