A Comparative Analysis of Word Embeddings Techniques for Italian News Categorization

Federica Rollo; Giovanni Bonisoli; Laura Po

doi:10.1109/ACCESS.2024.3367246

IEEE Access (Jan 2024)

A Comparative Analysis of Word Embeddings Techniques for Italian News Categorization

Federica Rollo,
Giovanni Bonisoli,
Laura Po

Affiliations

Federica Rollo: ORCiD; Department of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, Italy
Giovanni Bonisoli: ORCiD; Department of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, Italy
Laura Po: ORCiD; Department of Engineering “Enzo Ferrari,”, University of Modena and Reggio Emilia, Modena, Italy

DOI: https://doi.org/10.1109/ACCESS.2024.3367246
Journal volume & issue: Vol. 12
pp. 25536 – 25552

Abstract

Read online

Text categorization remains a formidable challenge in information retrieval, requiring effective strategies, especially when applied to low-resource languages such as Italian. This paper delves into the intricacies of categorizing Italian news articles, addressing the complexities arising from the language’s unique structure and writing style. The implemented methodology involves preprocessing the text, generating word embeddings, conducting feature engineering to extract meaningful representations, and training a classifier using the document vectors. The evaluation of the model’s performance is done on a partitioned dataset with a training set for model training and a test set for categorization, allowing assessment of its efficacy on unseen data. Within this paper, we assessed fifteen classifiers for the categorization of Italian news articles, scrutinizing eight models and three approaches for combining word embeddings to derive document vectors. We conducted a comparative analysis between established models such as Word2Vec and FastText and six novel Italian models pre-trained on native datasets. A significant highlight of our work is the introduction of an Italian GloVe model, previously absent for the Italian language. The datasets selected for testing the models’ performances are DICE, a dataset of 10,395 crime news articles extracted from an Italian newspaper, and RCV2-it, a collection of 28,405 Italian news stories released by the multinational media company Reuters Ltd. The tests conducted achieved as the best F-scores 84% and 93%. The results underscore the efficacy of the Support Vector Classification algorithm, while also revealing the inefficacy of Gaussian Naive Bayes, Bernoulli Naive Bayes, and Decision Tree models within the domain of text categorization. The comparison of the word embedding models revealed the better performance of Word2Vec and GloVe concerning FastText. The broader impact of this paper lies not only in advancing text categorization methodologies for Italian documents but also in enriching the linguistic landscape by releasing six novel Italian word embedding models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords