Sublemma-Based Neural Machine Translation

Thien Nguyen; Huu Nguyen; Phuoc Tran

doi:10.1155/2021/5935958

Complexity (Jan 2021)

Sublemma-Based Neural Machine Translation

Thien Nguyen,
Huu Nguyen,
Phuoc Tran

Affiliations

Thien Nguyen: Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Huu Nguyen: Faculty of Information Technology, Ho Chi Minh City University of Food Industry, Ho Chi Minh City, Vietnam
Phuoc Tran: Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam

DOI: https://doi.org/10.1155/2021/5935958
Journal volume & issue: Vol. 2021

Abstract

Read online

Powerful deep learning approach frees us from feature engineering in many artificial intelligence tasks. The approach is able to extract efficient representations from the input data, if the data are large enough. Unfortunately, it is not always possible to collect large and quality data. For tasks in low-resource contexts, such as the Russian ⟶ Vietnamese machine translation, insights into the data can compensate for their humble size. In this study of modelling Russian ⟶ Vietnamese translation, we leverage the input Russian words by decomposing them into not only features but also subfeatures. First, we break down a Russian word into a set of linguistic features: part-of-speech, morphology, dependency labels, and lemma. Second, the lemma feature is further divided into subfeatures labelled with tags corresponding to their positions in the lemma. Being consistent with the source side, Vietnamese target sentences are represented as sequences of subtokens. Sublemma-based neural machine translation proves itself in our experiments on Russian-Vietnamese bilingual data collected from TED talks. Experiment results reveal that the proposed model outperforms the best available Russian ⟶ Vietnamese model by 0.97 BLEU. In addition, automatic machine judgment on the experiment results is verified by human judgment. The proposed sublemma-based model provides an alternative to existing models when we build translation systems from an inflectionally rich language, such as Russian, Czech, or Bulgarian, in low-resource contexts.

Published in Complexity

ISSN: 1076-2787 (Print); 1099-0526 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://onlinelibrary.wiley.com/journal/8503

About the journal