Results in Engineering (Mar 2023)
TopicStriKer: A topic kernels-powered approach for text classification
Abstract
Topic models are unsupervised machine learning techniques that output clusters of “topics” represented as co-occurring words with their associated probability distributions. Topic modeling algorithms find latent themes from large document collections by understanding their context. On the other hand, string kernels are supervised machine-learning techniques that quantify string similarities without explicit string encoding. We propose TopicStriKer, a model combining the advantages of unsupervised topic modeling with supervised string kernels for text classification tasks. The co-occurring topic words per topic and topic proportions per document obtained are used to reduce the document corpus to a topic-word sequence. This reduced representation is then used for text classification with the aid of string kernels, significantly improving accuracy and reducing training time. Experiments on the bag-of-words kernel-based string embeddings using the proposed algorithm outperform the traditional text classification approaches. This work extensively compares string kernels with topic modeling on various performance metrics to establish our findings.