Studia Linguistica Romanica (Oct 2022)

Harmoniser le corpus ConDÉ De l'image à la ressource linguistique

  • Pica, Morgane L.

DOI
https://doi.org/10.25364/19.2022.8.7
Journal volume & issue
Vol. 2, no. 8
pp. 131 – 154

Abstract

Read online

The corpus compiled for the RIN ConDÉ project consists of twelve reference sources on Norman customary law, from the 13th to the 19th century. Despite dealing with the same subject, the texts in this corpus are very heterogeneous in terms of format and structure. The texts were processed with the HTR tool Transkribus; Python and XSLT languages were employed for automated transformations; lemmatization was performed by AnaLog and the data was encoded using the TEI encoding model. Processing the data required a stage of reflection to identify the best means of restoring the structures and reference systems and to devise a set of lemma and part-of-speech tags that would work for texts covering six centuries of linguistic evolution. To make the texts maxi - mally comparable, it was eventually decided to create a three-level structure (part > chapter > section).