Complexity (Jan 2021)

Sublemma-Based Neural Machine Translation

  • Thien Nguyen,
  • Huu Nguyen,
  • Phuoc Tran

DOI
https://doi.org/10.1155/2021/5935958
Journal volume & issue
Vol. 2021

Abstract

Read online

Powerful deep learning approach frees us from feature engineering in many artificial intelligence tasks. The approach is able to extract efficient representations from the input data, if the data are large enough. Unfortunately, it is not always possible to collect large and quality data. For tasks in low-resource contexts, such as the Russian ⟶ Vietnamese machine translation, insights into the data can compensate for their humble size. In this study of modelling Russian ⟶ Vietnamese translation, we leverage the input Russian words by decomposing them into not only features but also subfeatures. First, we break down a Russian word into a set of linguistic features: part-of-speech, morphology, dependency labels, and lemma. Second, the lemma feature is further divided into subfeatures labelled with tags corresponding to their positions in the lemma. Being consistent with the source side, Vietnamese target sentences are represented as sequences of subtokens. Sublemma-based neural machine translation proves itself in our experiments on Russian-Vietnamese bilingual data collected from TED talks. Experiment results reveal that the proposed model outperforms the best available Russian ⟶ Vietnamese model by 0.97 BLEU. In addition, automatic machine judgment on the experiment results is verified by human judgment. The proposed sublemma-based model provides an alternative to existing models when we build translation systems from an inflectionally rich language, such as Russian, Czech, or Bulgarian, in low-resource contexts.