Masked texts: new tools for the security and linguistic analysis of legal corpora

Laura Clemenzi; Francesca Fusco; Daniele Fusi; Giulia Lombardi

doi:10.6092/issn.2532-8816/15608

Umanistica Digitale (Dec 2023)

Masked texts: new tools for the security and linguistic analysis of legal corpora

Laura Clemenzi,
Francesca Fusco,
Daniele Fusi,
Giulia Lombardi

Affiliations

Laura Clemenzi: Università degli Studi della Tuscia
Francesca Fusco: Università degli Studi di Padova
Daniele Fusi: Università degli Studi di Venezia Ca' Foscari - Venice Centre for digital and public humanities (VeDPH)
Giulia Lombardi: Università di Genova

DOI: https://doi.org/10.6092/issn.2532-8816/15608
Journal volume & issue: no. 16
pp. 1 – 32

Abstract

Read online

The Atti Chiari project, collecting the first large Italian corpus of judicial acts, presents strict legal requirements as well as many peculiarities in terms of language and content; to meet them, a number of processes and tools have been designed and implemented. The first issue is the requirement to remove any personal data from the documents, without however destroying their linguistic form, nor compromising their readability. To this end, a pseudonymisation procedure has been created based on a preliminary annotation stage, which adds information right in order to remove it in different ways, according to different purposes (linguistic analysis, legal analysis, etc.). At the same time, this light annotation provides data useful not only for pseudonymization, but also for the conversion of documents, from their original presentational format into a semantic one based on TEI. Once documents have been prepared in this way, they are then centralized in a corpus, ready to be indexed for linguistic research. Given the multiple search criteria that must be combined, whatever their origin and model, a new type of search engine, designed primarily in the philological field, has been used here to obtain the required openness and granularity of metadata.

Published in Umanistica Digitale

ISSN: 2532-8816 (Online)
Publisher: University of Bologna
Country of publisher: Italy
LCC subjects: General Works: History of scholarship and learning. The humanities
Website: http://umanisticadigitale.unibo.it/

About the journal

Abstract

Keywords