Wiki-Translator: Multilingual Experiments for In-Domain Translations

Dan Tufis; Radu Ion; Stefan Daniel Dumitrescu

Computer Science Journal of Moldova (Nov 2013)

Wiki-Translator: Multilingual Experiments for In-Domain Translations

Dan Tufis,
Radu Ion,
Stefan Daniel Dumitrescu

Affiliations

Dan Tufis: Institute for AI, Romanian Academy, Bucharest, Romania
Radu Ion: Institute for AI, Romanian Academy, Bucharest, Romania
Stefan Daniel Dumitrescu: Institute for AI, Romanian Academy, Bucharest, Romania

Journal volume & issue: Vol. 21, no. 3(63)
pp. 332 – 359

Abstract

Read online

The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, is based on data extracted from comparable corpora from the new domain or from an as close as possible one. This article reports on a slightly different approach: building an SMT system entirely from comparable data for the domain of interest. Certainly, the approach is feasible if the comparable corpora are large enough to extract SMT useful data in sufficient quantities for a reliable training. The more comparable corpora, the better the results are. Wikipedia is definitely a very good candidate for such an experiment. We report on mass experiments showing significant improvements over a baseline system built from highly similar (almost parallel) text fragments extracted from Wikipedia. The improvements, statistically significant, are related to what we call the level of translational similarity between extracted pairs of sentences. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on sentence pairs extracted from the entire dumps of Wikipedia as of December 2012. Our experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.

Published in Computer Science Journal of Moldova

ISSN: 1561-4042 (Print); 2587-4330 (Online)
Publisher: Vladimir Andrunachievici Institute of Mathematics and Computer Science
Country of publisher: Moldova, Republic of
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.math.md/en/publications/csjm/

About the journal

Abstract

Keywords