IEEE Access (Jan 2023)

An Enhanced Method for Neural Machine Translation via Data Augmentation Based on the Self-Constructed English-Chinese Corpus, WCC-EC

  • Jinyi Zhang,
  • Cong Guo,
  • Jiannan Mao,
  • Chong Guo,
  • Tadahiro Matsumoto

DOI
https://doi.org/10.1109/ACCESS.2023.3323756
Journal volume & issue
Vol. 11
pp. 112123 – 112132

Abstract

Read online

In an era of increasing globalization, the imperative for understanding multilingual texts elevated the role of translation to an everyday necessity. The efficacy of contemporary Neural Machine Translation (NMT) systems was heavily dependent on the availability of substantial training data. As such, the creation of an expansive parallel corpus became a strategic focal point, providing a bedrock for the evolution of high-caliber NMT systems. However, building a parallel corpus often posed challenges due to its complexity and cost, especially for language pairs that lacked sufficient parallel data. To tackle this challenge, this study proposed a novel data augmentation method for the corpus. Bilingual news texts sourced from the KEKE English website were utilized, and SentenceBERT was employed to ensure accurate sentence alignment. Subsequently, the parallel partial sentences within the corpus were filtered and used to augment the dataset. Finally, the effectiveness of the method was assessed by calculating BLEU, chrF and METEOR scores based on both the original corpus and the data-augmented corpus using the base translation model. The experimental results indicated that compared to the baseline method, the optimal method showed an improvement of 0.4-2.1 points in BLEU scores, an improvement of 0.5-2.7 points in chrF scores, and an increase of 0.5-1.6 points in METEOR scores.

Keywords