Anuario del Seminario de Filología Vasca "Julio de Urquijo" (Apr 2007)

Consumer Eroski parallel corpus

  • Asier Alcázar

Vol. 41, no. 2


This paper introduces the Consumer Eroski Parallel Corpus, a collection of articles originally written in Spanish and later translated to three languages also spoken in Spain: Basque, Catalan and Galician. The articles have been correlated in the four languages at the sentence level automatically using Moore's bilingual sentence alignment tool (2002). The Spanish section is also annotated morphosyntactically for parts of speech using SVMtool (Giménez and Márquez 2004). The Basque, Catalan and Galician sections may be annotated in a future release with the collaboration of Computational Linguistics Groups in Spain. To my knowledge, the Consumer Eroski Parallel Corpus is the first resource to exist that encompasses a substantial body of parallel text from these four languages spoken in Spain. I would like to thank the Eroski Foundation for granting permission to share the corpus in the public domain. Making this resource public will provide additional opportunities to test, train and develop natural language processing tools in the computational linguistics community. It may also help translators as a reference. With the addition of an advanced search interface, currently under development, the corpus may be consulted by Basque and Romance linguists interested in cross-linguistic research.