Umanistica Digitale (Dec 2023)
Masked texts: new tools for the security and linguistic analysis of legal corpora
Abstract
The Atti Chiari project, collecting the first large Italian corpus of judicial acts, presents strict legal requirements as well as many peculiarities in terms of language and content; to meet them, a number of processes and tools have been designed and implemented. The first issue is the requirement to remove any personal data from the documents, without however destroying their linguistic form, nor compromising their readability. To this end, a pseudonymisation procedure has been created based on a preliminary annotation stage, which adds information right in order to remove it in different ways, according to different purposes (linguistic analysis, legal analysis, etc.). At the same time, this light annotation provides data useful not only for pseudonymization, but also for the conversion of documents, from their original presentational format into a semantic one based on TEI. Once documents have been prepared in this way, they are then centralized in a corpus, ready to be indexed for linguistic research. Given the multiple search criteria that must be combined, whatever their origin and model, a new type of search engine, designed primarily in the philological field, has been used here to obtain the required openness and granularity of metadata.
Keywords