Lexis: Journal in English Lexicology (Apr 2010)
Using parallel text for the extraction of German multiword expressions
Abstract
A procedure for the identification of semantically opaque (i.e. idiomatic) German multiwords is presented. We focus on verb + PP combinations that are lexicographically relevant (extracted via dependency parsing [Schiehlen 2003]) of the kind ins Leben rufen – “to initiate”, lit.: “to call into life”. Starting from [Villada Moirón and Tiedemann 2006], the method exploits the fact that opaque combinations are translated as a whole, whereas compositional uses would show regular, individual translations of the words involved. The translations into other languages are obtained by applying GIZA++ [Och and Ney 2003] word alignment to the EUROPARL corpus [Koehn 2005]. Numerous experiments are performed to further optimise the original method: several parameters are analysed individually as well as in combination with each other. This leads to the following results: depending on the actual parameter settings, values between 0.800 and 0.936 (in terms of uninterpolated average precision) are reached amongst the highest scoring 200 multiword candidates, as opposed to a baseline of 0.584, using the 200 most frequent multiwords in decreasing order of their occurrence frequency.
Keywords