Dataset for multimodal fake news detection and verification tasks

Alessandro Bondielli; Pietro Dell'Oglio; Alessandro Lenci; Francesco Marcelloni; Lucia Passaro

Data in Brief (Jun 2024)

Dataset for multimodal fake news detection and verification tasks

Alessandro Bondielli,
Pietro Dell'Oglio,
Alessandro Lenci,
Francesco Marcelloni,
Lucia Passaro

Affiliations

Alessandro Bondielli: Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo, 3, 56127, Pisa, Italy
Pietro Dell'Oglio: Department of Information Engineering, University of Pisa, Largo Lucio Lazzarino, 1, 56122, Pisa, Italy
Alessandro Lenci: Department of Philology, Literature and Linguistics, University of Pisa, Via S. Maria 36, 56127, Pisa, Italy
Francesco Marcelloni: Department of Information Engineering, University of Pisa, Largo Lucio Lazzarino, 1, 56122, Pisa, Italy; Corresponding author.
Lucia Passaro: Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo, 3, 56127, Pisa, Italy

Journal volume & issue: Vol. 54
p. 110440

Abstract

Read online

The proliferation of online disinformation and fake news, particularly in the context of breaking news events, demands the development of effective detection mechanisms. While textual content remains the predominant medium for disseminating misleading information, the contribution of other modalities is increasingly emerging within online outlets and social media platforms. However, multimodal datasets, which incorporate diverse modalities such as texts and images, are not very common yet, especially in low-resource languages. This study addresses this gap by releasing a dataset tailored for multimodal fake news detection in the Italian language.This dataset was originally employed in a shared task on the Italian language. The dataset is divided into two data subsets, each corresponding to a distinct sub-task. In sub-task 1, the goal is to assess the effectiveness of multimodal fake news detection systems. Sub-task 2 aims to delve into the interplay between text and images, specifically analyzing how these modalities mutually influence the interpretation of content when distinguishing between fake and real news. Both sub-tasks were managed as classification problems.The dataset consists of social media posts and news articles. After collecting it, it was labeled via crowdsourcing. Annotators were provided with external knowledge about the topic of the news to be labeled, enhancing their ability to discriminate between fake and real news. The data subsets for sub-task 1 and sub-task 2 consist of 913 and 1350 items, respectively, encompassing newspaper articles and tweets.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords