IEEE Access (Jan 2022)
An Automatic Post Editing With Efficient and Simple Data Generation Method
Abstract
Automatic post-editing (APE) research considers methods for correcting translation results inferred by machine translation systems. The training of APE models, generally require triplets including a source sentence ( $src$ ), machine translation sentence ( $mt$ ), and post-edited sentence ( $pe$ ). As considerable expert-level human labor is required in creating $pe$ , APE researches have encountered difficulty in constructing suitable dataset for most of language pairs. This has led to the absence of APE data for most of language pairs, such as Korean-English, and imposed limitation to the sustainable researches of APE. Motivated by this problem, we propose a method that can generate APE triplets using only a parallel corpus without human labor. Our proposal comprises three noise generation techniques, including random, part of speech tagging (POS) based, and semantic level noises, and the effectiveness of these methods are verified by the results of quantitative and qualitative experiments on Korean-English APE tasks. As a result of our experiments, we find that POS based noise encourages the best APE performance. The proposed method is influential in that it can obviate expert human labor which was generally required in APE data construction, and enable the sustainable APE researches for the most language pairs where human-edited APE triplets are unavailable.
Keywords