Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Jinyi Zhang; Tadahiro Matsumoto

doi:10.3390/app9102036

Applied Sciences (May 2019)

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Jinyi Zhang,
Tadahiro Matsumoto

Affiliations

Jinyi Zhang: Electronics and Information Systems Engineering Division, Graduate School of Engineering, Gifu University, Gifu 501-1193, Japan
Tadahiro Matsumoto: Department of Electrical, Electronic and Computer Engineering, Gifu University, Gifu 501-1193, Japan

DOI: https://doi.org/10.3390/app9102036
Journal volume & issue: Vol. 9, no. 10
p. 2036

Abstract

Read online

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords