WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

Jinyi Zhang; Ye Tian; Jiannan Mao; Mei Han; Feng Wen; Cong Guo; Zhonghui Gao; Tadahiro Matsumoto

doi:10.3390/electronics12051140

Electronics (Feb 2023)

WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

Jinyi Zhang,
Ye Tian,
Jiannan Mao,
Mei Han,
Feng Wen,
Cong Guo,
Zhonghui Gao,
Tadahiro Matsumoto

Affiliations

Jinyi Zhang: School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China
Ye Tian: Zhuzhou CRRC Times Electric Co., Ltd., Zhuzhou 412001, China
Jiannan Mao: Faculty of Engineering, Gifu University, Gifu 501-1193, Japan
Mei Han: School of Electrical and Information Engineering, Hunan University of Technology, Zhuzhou 412007, China
Feng Wen: School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China
Cong Guo: School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China
Zhonghui Gao: School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China
Tadahiro Matsumoto: Faculty of Engineering, Gifu University, Gifu 501-1193, Japan

DOI: https://doi.org/10.3390/electronics12051140
Journal volume & issue: Vol. 12, no. 5
p. 1140

Abstract

Read online

Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable size containing bilingual text data in both Japanese and Chinese by collecting subtitle text data from websites that host movies and television series. The unsatisfactory translation performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was predominantly caused by the limited number of sentence pairs. To address this shortcoming, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to other comparative corpora and performed manual evaluations of the translation results generated by translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research purposes only.

Published in Electronics

ISSN: 2079-9292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: http://www.mdpi.com/journal/electronics

About the journal

Abstract

Keywords