Journal of Open Humanities Data (Dec 2021)

Old Catalan Morphosyntax: Developing an Annotated Corpus

  • Marieke Meelen,
  • Afra Pujol i Campeny

DOI
https://doi.org/10.5334/johd.54
Journal volume & issue
Vol. 7

Abstract

Read online

This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank. We demonstrate, however, that a semi- supervised method of incrementally building training data using both neural and memory-based taggers, together with the Pyrrha annotation tool is highly efficient and yields accurate results. We propose that this simple and effective method could easily be extended to other low-resource historical languages for which no NLP tools exist yet.

Keywords