IEEE Access (Jan 2023)
Genre Classification of Books on Spanish
Abstract
Genre categorization of published titles is a common practice in publishing houses, libraries, and bookstores, as well as a fundamental element of editorial marketing. However, assigning subject codes to each title proves to be an arduous task for both publishers and data aggregators. The problem with automatic genre categorization is that some publishers use more than 200 categories, making it a highly complex task. Moreover, even though these publishers based their categorization on standards, they ofthen alter the names of these standards as they consider to be too technical. In this paper, we proposed Thema-based categorization as a tool to facilitate editors’ work by advancing the categorization process, allowing them to focus on finer category granularity. This categorization has four key features: first, it clusters the most important categories for Latin American publishers. Second, it stops grouping when the number of thematic categories remains practical for the purposes of the publishing business. Third, we assign names to these categories that resonate with Latin American stakeholders. Finally, the number of categories is optimized to provide reasonable classification performance. We worked on the description of books in Spanish of two publishers, and mapped them to this proposed categorization. This allowed us to created a database for train a model to automate categorization. After conducting our analysis, we determined that 26 thematic categories were an appropriate number that fulfilled the three features mentioned earlier. However, we recognized that classifying into 26 categories was still a complex task, so to overcome this challenge, we decided to augment data by back-translating it into Spanish using the translation function, $T_{l}^{S}\left ({T_{S}^{l}\left ({S }\right) }\right)$ , where $T_{S}^{l}\left ({S }\right)$ is the translation function from Spanish, $s$ , to language, $l$ ; $T_{l}^{s}\left ({l }\right)$ is the translation function from language, $l$ , to Spanish, $S$ , and $T_{S}^{l}\left ({S }\right)$ and $T_{l}^{S}\left ({l }\right)$ are not-invertible functions. Experimental results, obtained using 5-fold cross-validation, were approximately 57%, 57%, 63.38%, and 65.26% for the F1-score of Support Vector Machine (SVM), Logist Regression (LR), BERT, and RoBERTa models, respectively. We utilized the F1-score metric because our categories were not perfectly balanced. The results achieved by RoBERTa outperform those reported in the literature. Furthermore, these results are built upon the foundation of the Thema standard for categorizing book genres. Additionally, the categories have been specifically designed to align with the preferences and needs of Latin American publishers.
Keywords