Journal of Data Mining and Digital Humanities (Feb 2021)

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

  • Jean-Baptiste Camps,
  • Simon Gabay,
  • Paul Fièvre,
  • Thibault Clérice,
  • Florian Cafiero

DOI
https://doi.org/10.46298/jdmdh.6485
Journal volume & issue
Vol. 2021, no. Digital humanities in...

Abstract

Read online

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.

Keywords