A Survey on Text Classification Algorithms: From Text to Predictions

Andrea Gasparetto; Matteo Marcuzzo; Alessandro Zangari; Andrea Albarelli

doi:10.3390/info13020083

Information (Feb 2022)

A Survey on Text Classification Algorithms: From Text to Predictions

Andrea Gasparetto,
Matteo Marcuzzo,
Alessandro Zangari,
Andrea Albarelli

Affiliations

Andrea Gasparetto: Department of Management, Ca’ Foscari University, 30123 Venice, Italy
Matteo Marcuzzo: Department of Management, Ca’ Foscari University, 30123 Venice, Italy
Alessandro Zangari: Department of Management, Ca’ Foscari University, 30123 Venice, Italy
Andrea Albarelli: Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University, 30123 Venice, Italy

DOI: https://doi.org/10.3390/info13020083
Journal volume & issue: Vol. 13, no. 2
p. 83

Abstract

Read online

In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords