Offensive Text Span Detection in Romanian Comments Using Large Language Models

Andrei Paraschiv; Teodora Andreea Ion; Mihai Dascalu

doi:10.3390/info15010008

Information (Dec 2023)

Offensive Text Span Detection in Romanian Comments Using Large Language Models

Andrei Paraschiv,
Teodora Andreea Ion,
Mihai Dascalu

Affiliations

Andrei Paraschiv: Computer Science and Engineering Department, National University of Science and Technology Politehnica of Bucharest, 313 Splaiul Independentei, 060042 Bucharest, Romania
Teodora Andreea Ion: Academy of Romanian Scientists, Str. Ilfov, Nr. 3, 050044 Bucharest, Romania
Mihai Dascalu: Computer Science and Engineering Department, National University of Science and Technology Politehnica of Bucharest, 313 Splaiul Independentei, 060042 Bucharest, Romania

DOI: https://doi.org/10.3390/info15010008
Journal volume & issue: Vol. 15, no. 1
p. 8

Abstract

Read online

The advent of online platforms and services has revolutionized communication, enabling users to share opinions and ideas seamlessly. However, this convenience has also brought about a surge in offensive and harmful language across various communication mediums. In response, social platforms have turned to automated methods to identify offensive content. A critical research question emerges when investigating the role of specific text spans within comments in conveying offensive characteristics. This paper conducted a comprehensive investigation into detecting offensive text spans in Romanian language comments using Transformer encoders and Large Language Models (LLMs). We introduced an extensive dataset of 4800 Romanian comments annotated with offensive text spans. Moreover, we explored the impact of varying model sizes, architectures, and training data volumes on the performance of offensive text span detection, providing valuable insights for determining the optimal configuration. The results argue for the effectiveness of BERT pre-trained models for this span-detection task, showcasing their superior performance. We further investigated the impact of different sample-retrieval strategies for few-shot learning using LLMs based on vector text representations. The analysis highlights important insights and trade-offs in leveraging LLMs for offensive-language-detection tasks.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords