Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Thibault Clérice

doi:10.46298/jdmdh.5581

Journal of Data Mining and Digital Humanities (Apr 2020)

Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Thibault Clérice

Affiliations

Thibault Clérice: ORCiD; Université Paris sciences et lettres

DOI: https://doi.org/10.46298/jdmdh.5581
Journal volume & issue: Vol. 2020, no. Towards a Digital Ecosystem:...

Abstract

Read online

Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.

Published in Journal of Data Mining and Digital Humanities

ISSN: 2416-5999 (Online)
Publisher: Nicolas Turenne
Country of publisher: France
LCC subjects: General Works: History of scholarship and learning. The humanities; Bibliography. Library science. Information resources
Website: http://jdmdh.episciences.org/

About the journal

Abstract

Keywords