Semi-Automatic Dataset Annotation Applied to Automatic Violent Message Detection

Beatriz Botella-Gil; Robiert Sepulveda-Torres; Alba Bonet-Jover; Patricio Martinez-Barco; Estela Saquete

doi:10.1109/ACCESS.2024.3361404

IEEE Access (Jan 2024)

Semi-Automatic Dataset Annotation Applied to Automatic Violent Message Detection

Beatriz Botella-Gil,
Robiert Sepulveda-Torres,
Alba Bonet-Jover,
Patricio Martinez-Barco,
Estela Saquete

Affiliations

Beatriz Botella-Gil: ORCiD; Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Alicante, Spain
Robiert Sepulveda-Torres: ORCiD; Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Alicante, Spain
Alba Bonet-Jover: ORCiD; Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Alicante, Spain
Patricio Martinez-Barco: ORCiD; Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Alicante, Spain
Estela Saquete: ORCiD; Department of Software and Computing Systems, University of Alicante, San Vicente del Raspeig, Alicante, Spain

DOI: https://doi.org/10.1109/ACCESS.2024.3361404
Journal volume & issue: Vol. 12
pp. 19651 – 19664

Abstract

Read online

Annotated corpora are indispensable tools to train computational models in Artificial Intelligence and Natural Language Processing. However, manual annotation is a costly, arduous, and time-consuming task, especially when the annotation is semantically complex. To address the problem, this work applies a methodology for semi-automatic annotation of datasets based on the Human-in-the-Loop paradigm. The methodology supports the building of a resource, that benefits from a fine-grained annotation, to aid in the detection of Spanish violent messages sourced from social media (Twitter/X). After implementing the proposed methodology for semi-automatic violence annotation, a high quality resource was obtained (hereafter referred to as VILLANOS). The methodology consists of annotating the dataset incrementally, which delivers an increase in annotator efficiency, thereby validating the suitability of the proposal. Annotation time was reduced by 52% compared to manual annotation and performance, by training a model with the VILLANOS dataset, obtains an $F_{1}$ of 85.2%. These results demonstrate the efficiency and effectiveness of the methodology, evidencing its validity.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords