Harmoniser le corpus ConDÉ De l'image à la ressource linguistique

Pica, Morgane L.

doi:10.25364/19.2022.8.7

Studia Linguistica Romanica (Oct 2022)

Harmoniser le corpus ConDÉ De l'image à la ressource linguistique

Pica, Morgane L.

Affiliations

Pica, Morgane L.: École normale supérieure de Lyon (Lyon, France)

DOI: https://doi.org/10.25364/19.2022.8.7
Journal volume & issue: Vol. 2, no. 8
pp. 131 – 154

Abstract

Read online

The corpus compiled for the RIN ConDÉ project consists of twelve reference sources on Norman customary law, from the 13th to the 19th century. Despite dealing with the same subject, the texts in this corpus are very heterogeneous in terms of format and structure. The texts were processed with the HTR tool Transkribus; Python and XSLT languages were employed for automated transformations; lemmatization was performed by AnaLog and the data was encoded using the TEI encoding model. Processing the data required a stage of reflection to identify the best means of restoring the structures and reference systems and to devise a set of lemma and part-of-speech tags that would work for texts covering six centuries of linguistic evolution. To make the texts maxi - mally comparable, it was eventually decided to create a three-level structure (part > chapter > section).

Published in Studia Linguistica Romanica

ISSN: 2663-9815 (Online)
Publisher: University of Graz
Country of publisher: Austria
LCC subjects: Language and Literature: Romanic languages
Website: https://studialinguisticaromanica.org/index.php/slr

About the journal