WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

Jinyi Zhang; Ye Tian; Jiannan Mao; Mei Han; Tadahiro Matsumoto

doi:10.3390/app12126002

Applied Sciences (Jun 2022)

WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

Jinyi Zhang,
Ye Tian,
Jiannan Mao,
Mei Han,
Tadahiro Matsumoto

Affiliations

Jinyi Zhang: School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China
Ye Tian: Zhuzhou CRRC Times Electric Co., Ltd., Zhuzhou 412001, China
Jiannan Mao: Faculty of Engineering, Gifu University, Gifu 501-1193, Japan
Mei Han: School of Electrical and Information Engineering, Hunan University of Technology, Zhuzhou 412007, China
Tadahiro Matsumoto: Faculty of Engineering, Gifu University, Gifu 501-1193, Japan

DOI: https://doi.org/10.3390/app12126002
Journal volume & issue: Vol. 12, no. 12
p. 6002

Abstract

Read online

Currently, there are only a limited number of Japanese-Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese-Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords