Multisubject Analysis and Classification of Books and Book Collections, Based on a Subject Term Vocabulary and the Latent Dirichlet Allocation

Nikolaos Makris; Nikolaos Mitrou

doi:10.1109/ACCESS.2023.3326722

IEEE Access (Jan 2023)

Multisubject Analysis and Classification of Books and Book Collections, Based on a Subject Term Vocabulary and the Latent Dirichlet Allocation

Nikolaos Makris,
Nikolaos Mitrou

Affiliations

Nikolaos Makris: ORCiD; School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Nikolaos Mitrou: ORCiD; School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece

DOI: https://doi.org/10.1109/ACCESS.2023.3326722
Journal volume & issue: Vol. 11
pp. 120881 – 120898

Abstract

Read online

In this paper, a new method for automatically analyzing and classifying books and book collections according to the subjects they cover is presented. It is based on a combination of the LDA method for discovering latent topics in the collection, on the one hand, and the description of subjects by means of a subject term vocabulary, on the other. Books, topics and subjects, all are modelled as bag-of-words, with specific distributions over the underlying word vocabulary. The Table of Contents (ToC) was used to describe the books, instead of their entire body, while subject (or standard) documents are produced by a subject term hierarchy of the respective disciplines. Frequency-of-terms in the documents and word-generative probabilistic models (as the ones postulated by LDA) were integrated into a consistent statistical framework. Using Bayesian statistics and simple marginalization equations we were able to transform the expressions of the books from distributions over unlabeled topics (derived by the LDA) to distributions over labeled subjects representing the respective disciplines (Physical sciences, Health sciences, Mathematics, etc). More specifically, the necessary theoretical basis is firstly established, with each subject formally defined by the respective branch of a subject term hierarchy (much like a ToC) or the respective bag of words (single words and biwords) produced by flattening the hierarchy branch; flattening is realized by taking all the terms of the nodes and leaves of the branch with repetitions allowed. Being confined within a closed set of subjects, we are able to invert the frequency-of-terms in each subject [also interpreted as the probability of generating a term ( $w_{n}$ ) when sampling the subject ( $s_{i}$ ) and denoted by Pr{ $w_{n}\vert $ $s_{i}$ })] and express each term as a weighted mixture (or probability distribution) of subjects, denoted by Pr{ $s_{i}\vert $ $w_{n}$ }. This is the key idea of the proposed method. Then, any document (dm) can be expressed as a weighted mixture of subjects (or the respective distribution, denoted by Pr{ $s_{i}\vert \text{d}_{m}$ }) by simply summing up the distributions of the individual terms contained in the document. This is made possible by virtue of some simple formulas that have been formally proven for the union of documents ( $\Pr \left \{{{\mathrm {s}_{i}\vert (\mathbf {d}}_{1}\mathrm {\cup }\mathbf {d}_{2}) }\right \})$ and for the union of subjects $(\Pr \left \{{{\mathrm {(\mathbf {s}}}_{i}\mathrm {\cup }\mathbf {s}_{j}\mathrm {)\vert \mathbf {d}} }\right \})$ . Since not all vocabulary terms are found in a particular set of books, nor, conversely, all corpus words are included in the subject vocabulary either, two important measures come to the foreground and are calculated with the proposed formulation: the coverage of a book or a corpus by the subject term vocabulary and, conversely, the vocabulary coverage by a set of books. These measures are useful for updating/enriching the subject term vocabulary, whenever it happens that documents with new subjects are included in the corpus under analysis. Following the theoretical formulation, the derived results are combined with the LDA in order to further facilitate our multisubject analysis task: using the subject term vocabulary, LDA is applied on the corpus under study and results in expressing each book (bm), as a probability distribution over hidden topics (denotedby $\text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\}$ ). In the same framework, each topic $\left(\mathbf{t}_k\right)$ is expressed as a probability distribution over words $\left(\text{Pr}\left\{w_n \mid \mathbf{t}_k\right\}\right)$ . Having estimated each word's probability distribution over subjects $\left(\text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right)$ , we can express each discovered topic as a weighted mixture of subjects $\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}=\sum_n \text{Pr}\left\{w_n \mid \mathbf{t}_k\right\} \text{Pr}\left\{\mathbf{s}_i \mid w_n\right\}\right]$ and, by using that, we express each book in the same manner $\left[\text{Pr}\left\{\mathbf{s}_i \mid \mathbf{b}_m\right\}=\sum_k \text{Pr}\left\{\mathbf{t}_k \mid \mathbf{b}_m\right\} \text{Pr}\left\{\mathbf{s}_i \mid \mathbf{t}_k\right\}\right]$ . This is a very clear and formal way towards obtaining the desired result. The proposed methodology was applied to a Springer's e-book collection with more than 50,000 books, while a subject term hierarchy developed by KALLIPOS, a project creating openaccess e-books, was used for the proof of concept. A number of experiments were conducted to showcase the validity and usefulness of the proposed approach.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords