English–Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

Jeong-Uk Bang; Joon-Gyu Maeng; Jun Park; Seung Yun; Sang-Hun Kim

doi:10.4218/etrij.2021-0336

ETRI Journal (Feb 2023)

English–Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

Jeong-Uk Bang,
Joon-Gyu Maeng,
Jun Park,
Seung Yun,
Sang-Hun Kim

Affiliations

Jeong-Uk Bang
Joon-Gyu Maeng
Jun Park
Seung Yun
Sang-Hun Kim

DOI: https://doi.org/10.4218/etrij.2021-0336
Journal volume & issue: Vol. 45, no. 1
pp. 18 – 27

Abstract

Read online

We present an English–Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English–Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English–Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English–Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Published in ETRI Journal

ISSN: 1225-6463 (Print); 2233-7326 (Online)
Publisher: Electronics and Telecommunications Research Institute (ETRI)
Country of publisher: Korea, Republic of
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication; Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: https://onlinelibrary.wiley.com/journal/22337326

About the journal

Abstract

Keywords