Integrated Sequence Tagging for Medieval Latin Using Deep Representation  Learning

Mike Kestemont; Jeroen De Gussem

Journal of Data Mining and Digital Humanities (Aug 2017)

Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

Mike Kestemont,
Jeroen De Gussem

Affiliations

Mike Kestemont
Jeroen De Gussem

Journal volume & issue: Vol. Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages, no. Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

Abstract

Read online

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.

Published in Journal of Data Mining and Digital Humanities

ISSN: 2416-5999 (Online)
Publisher: Nicolas Turenne
Country of publisher: France
LCC subjects: General Works: History of scholarship and learning. The humanities; Bibliography. Library science. Information resources
Website: http://jdmdh.episciences.org/

About the journal

Abstract

Keywords