Genre Classification of Books on Spanish

Juan Arturo Nolazco-Flores; Ana Veronica Guerrero-Galvan; Carolina Del-Valle-Soto; Leibny Paola Garcia-Perera

doi:10.1109/ACCESS.2023.3332997

IEEE Access (Jan 2023)

Genre Classification of Books on Spanish

Juan Arturo Nolazco-Flores,
Ana Veronica Guerrero-Galvan,
Carolina Del-Valle-Soto,
Leibny Paola Garcia-Perera

Affiliations

Juan Arturo Nolazco-Flores: ORCiD; School of Engineering and Science, Tecnologico de Monterrey, Monterrey, Nuevo León, Mexico
Ana Veronica Guerrero-Galvan: ORCiD; School of Engineering and Science, Tecnologico de Monterrey, Monterrey, Nuevo León, Mexico
Carolina Del-Valle-Soto: ORCiD; Facultad de Ingeniería, Universidad Panamericana, Zapopan, Jalisco, Mexico
Leibny Paola Garcia-Perera: Computer Science Department, John Hopkins University, Baltimore, MD, USA

DOI: https://doi.org/10.1109/ACCESS.2023.3332997
Journal volume & issue: Vol. 11
pp. 132878 – 132892

Abstract

Read online

Genre categorization of published titles is a common practice in publishing houses, libraries, and bookstores, as well as a fundamental element of editorial marketing. However, assigning subject codes to each title proves to be an arduous task for both publishers and data aggregators. The problem with automatic genre categorization is that some publishers use more than 200 categories, making it a highly complex task. Moreover, even though these publishers based their categorization on standards, they ofthen alter the names of these standards as they consider to be too technical. In this paper, we proposed Thema-based categorization as a tool to facilitate editors’ work by advancing the categorization process, allowing them to focus on finer category granularity. This categorization has four key features: first, it clusters the most important categories for Latin American publishers. Second, it stops grouping when the number of thematic categories remains practical for the purposes of the publishing business. Third, we assign names to these categories that resonate with Latin American stakeholders. Finally, the number of categories is optimized to provide reasonable classification performance. We worked on the description of books in Spanish of two publishers, and mapped them to this proposed categorization. This allowed us to created a database for train a model to automate categorization. After conducting our analysis, we determined that 26 thematic categories were an appropriate number that fulfilled the three features mentioned earlier. However, we recognized that classifying into 26 categories was still a complex task, so to overcome this challenge, we decided to augment data by back-translating it into Spanish using the translation function, $T_{l}^{S}\left ({T_{S}^{l}\left ({S }\right) }\right)$ , where $T_{S}^{l}\left ({S }\right)$ is the translation function from Spanish, $s$ , to language, $l$ ; $T_{l}^{s}\left ({l }\right)$ is the translation function from language, $l$ , to Spanish, $S$ , and $T_{S}^{l}\left ({S }\right)$ and $T_{l}^{S}\left ({l }\right)$ are not-invertible functions. Experimental results, obtained using 5-fold cross-validation, were approximately 57%, 57%, 63.38%, and 65.26% for the F1-score of Support Vector Machine (SVM), Logist Regression (LR), BERT, and RoBERTa models, respectively. We utilized the F1-score metric because our categories were not perfectly balanced. The results achieved by RoBERTa outperform those reported in the literature. Furthermore, these results are built upon the foundation of the Thema standard for categorizing book genres. Additionally, the categories have been specifically designed to align with the preferences and needs of Latin American publishers.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords