Applied Sciences (Jun 2022)

WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

  • Jinyi Zhang,
  • Ye Tian,
  • Jiannan Mao,
  • Mei Han,
  • Tadahiro Matsumoto

DOI
https://doi.org/10.3390/app12126002
Journal volume & issue
Vol. 12, no. 12
p. 6002

Abstract

Read online

Currently, there are only a limited number of Japanese-Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese-Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.

Keywords