Proceedings of the XXth Conference of Open Innovations Association FRUCT (Apr 2024)
The Impact of Multilinguality and Tokenization on Statistical Machine Translation
Abstract
Multilingual neural machine translation systems has achieved state-of-the-art results on translation quality, especially for low-resource languages, yet statistical machine translations systems has not been trained and examined in similar multilingual setup. This work defines a multilingual statistical machine translation system as a many-to-one system capable of translating from any of the predefined languages to the one target language. We study how the multilingual setting affects translations quality compared to a regular one-to-one language machine translation system. And we examine how this setting affects related languages with different amount of training data. The research is conducted in multiple languages of different language families. The impact of different tokenizers and preprocessing methods is researched as well. Specifically, we compare the default Moses tokenizer with the SentencePiece tokenizer, as well as dedicated Chinese and Japanese word splitters. We also investigate the impact of lowercasing and conduct our experiments on data of different sizes. We find out that multilinguality gives a small gain across all of the metrics. Languages with sufficient amount of good quality training data do not affect the quality of related languages with lesser quality data. The SentencePiece tokenizer shows lower BLEU scores on average, but outperforms other tokenizers on chrF++ and METEOR metrics. Lowercasing increases scores of all metrics in all of the scenarios.
Keywords