A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation

Phuoc Tran; Thien Nguyen; Dinh-Hong Vu; Huu-Anh Tran; Bay Vo

doi:10.1109/ACCESS.2022.3186978

IEEE Access (Jan 2022)

A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation

Phuoc Tran,
Thien Nguyen,
Dinh-Hong Vu,
Huu-Anh Tran,
Bay Vo

Affiliations

Phuoc Tran: ORCiD; Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Thien Nguyen: ORCiD; Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Dinh-Hong Vu: ORCiD; Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Huu-Anh Tran: Faculty of Information Technology, Thai Binh University, Thai Binh City, Vietnam
Bay Vo: ORCiD; Faculty of Information Technology, HUTECH University, Ho Chi Minh City, Vietnam

DOI: https://doi.org/10.1109/ACCESS.2022.3186978
Journal volume & issue: Vol. 10
pp. 78928 – 78938

Abstract

Read online

A bilingual corpus is vital for natural language processing problems, especially in machine translation. The larger and better quality the corpus is, the higher the efficiency of the resulting machine translation is. There are two popular approaches to building a bilingual corpus. The first is building one automatically based on resources that are available on the internet, typically bilingual websites. The second approach is to construct one manually. Automated construction methods are being used more frequently because they are less expensive and there are a growing number of bilingual websites to exploit. In this paper, we use automated collection methods for a bilingual website to create a bilingual Chinese-Vietnamese corpus. In particular, the bilingual website we use to collect the data is the website of a multilingual dictionary (https://glosbe.com). We collected the Chinese-Vietnamese corpus from this website that includes more than 400k sentence pairs. We chose 100,000 sentence pairs in this corpus for machine translation experiments. From the corpus, we built five datasets consisting of 20k, 40k, 60k, 80k, and 100k sentence pairs, respectively. In addition, we built five additional datasets, applying word segmentation on the sentences of the original datasets. The experimental results showed that: 1) the quality of the corpus is relatively good with the highest BLEU score of 19.8, although there are still some issues that need to be addressed in future works; 2) the larger the corpus is, the higher the machine translation quality is; and 3) the untokenized datasets help train better translation models than the tokenized datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords