Is Your Model Sensitive? SPEDAC: A New Resource for the Automatic Classification of Sensitive Personal Data

Gaia Gambarelli; Aldo Gangemi; Rocco Tripodi

doi:10.1109/ACCESS.2023.3240089

IEEE Access (Jan 2023)

Is Your Model Sensitive? SPEDAC: A New Resource for the Automatic Classification of Sensitive Personal Data

Gaia Gambarelli,
Aldo Gangemi,
Rocco Tripodi

Affiliations

Gaia Gambarelli: ORCiD; FICLIT, University of Bologna, Bologna, Italy
Aldo Gangemi: ORCiD; FICLIT, University of Bologna, Bologna, Italy
Rocco Tripodi: ORCiD; LILEC, University of Bologna, Bologna, Italy

DOI: https://doi.org/10.1109/ACCESS.2023.3240089
Journal volume & issue: Vol. 11
pp. 10864 – 10880

Abstract

Read online

In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light the extremely important issue of personal data protection in virtual environments. Sensitive information detection (SID) covers different domains and languages in literature. However, if we refer to the personal data domain, the absence of a shared standard benchmark makes comparison with the state-of-the-art difficult for this task. To fill this gap, we introduce and release SPEDAC, a new annotated resource for the identification of sensitive personal data categories in the English language. SPEDAC enables the evaluation of computational models for three different SID subtasks with increasing levels of complexity. SPEDAC 1 regards binary classification, a model has to detect if a sentence contains sensitive information or not; in SPEDAC 2 we collected labeled sentences using 5 categories that relate to macro-domains of personal information; in SPEDAC 3, the labeling is fine-grained and includes 61 personal data categories. We conduct an extensive evaluation of the resource using different state-of-the-art-classifiers. The results show that SPEDAC is challenging, particularly with regard to fine-grained classification. Classifiers based on the transformer architectures achieve good results on SPEDAC 1 and 2 but have difficulties to discern among fine-grained classes in SPEDAC 3.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords