Statistical Models for Unsupervised, Semi-Supervised, and Supervised Transliteration Mining

Hassan Sajjad; Helmut Schmid; Alexander Fraser; Hinrich Schütze

doi:10.1162/coli_a_00286

Computational Linguistics (Mar 2017)

Statistical Models for Unsupervised, Semi-Supervised, and Supervised Transliteration Mining

Hassan Sajjad,
Helmut Schmid,
Alexander Fraser,
Hinrich Schütze

Affiliations

Hassan Sajjad
Helmut Schmid
Alexander Fraser
Hinrich Schütze

DOI: https://doi.org/10.1162/coli_a_00286
Journal volume & issue: Vol. 43, no. 2

Abstract

Read online

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

Published in Computational Linguistics

ISSN: 0891-2017 (Print); 1530-9312 (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/coli

About the journal