IEEE Access (Jan 2022)

A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation

  • Phuoc Tran,
  • Thien Nguyen,
  • Dinh-Hong Vu,
  • Huu-Anh Tran,
  • Bay Vo

DOI
https://doi.org/10.1109/ACCESS.2022.3186978
Journal volume & issue
Vol. 10
pp. 78928 – 78938

Abstract

Read online

A bilingual corpus is vital for natural language processing problems, especially in machine translation. The larger and better quality the corpus is, the higher the efficiency of the resulting machine translation is. There are two popular approaches to building a bilingual corpus. The first is building one automatically based on resources that are available on the internet, typically bilingual websites. The second approach is to construct one manually. Automated construction methods are being used more frequently because they are less expensive and there are a growing number of bilingual websites to exploit. In this paper, we use automated collection methods for a bilingual website to create a bilingual Chinese-Vietnamese corpus. In particular, the bilingual website we use to collect the data is the website of a multilingual dictionary (https://glosbe.com). We collected the Chinese-Vietnamese corpus from this website that includes more than 400k sentence pairs. We chose 100,000 sentence pairs in this corpus for machine translation experiments. From the corpus, we built five datasets consisting of 20k, 40k, 60k, 80k, and 100k sentence pairs, respectively. In addition, we built five additional datasets, applying word segmentation on the sentences of the original datasets. The experimental results showed that: 1) the quality of the corpus is relatively good with the highest BLEU score of 19.8, although there are still some issues that need to be addressed in future works; 2) the larger the corpus is, the higher the machine translation quality is; and 3) the untokenized datasets help train better translation models than the tokenized datasets.

Keywords