Results in Engineering (Mar 2023)

TopicStriKer: A topic kernels-powered approach for text classification

  • Nikhil V. Chandran,
  • V.S. Anoop,
  • S. Asharaf

Journal volume & issue
Vol. 17
p. 100949

Abstract

Read online

Topic models are unsupervised machine learning techniques that output clusters of “topics” represented as co-occurring words with their associated probability distributions. Topic modeling algorithms find latent themes from large document collections by understanding their context. On the other hand, string kernels are supervised machine-learning techniques that quantify string similarities without explicit string encoding. We propose TopicStriKer, a model combining the advantages of unsupervised topic modeling with supervised string kernels for text classification tasks. The co-occurring topic words per topic and topic proportions per document obtained are used to reduce the document corpus to a topic-word sequence. This reduced representation is then used for text classification with the aid of string kernels, significantly improving accuracy and reducing training time. Experiments on the bag-of-words kernel-based string embeddings using the proposed algorithm outperform the traditional text classification approaches. This work extensively compares string kernels with topic modeling on various performance metrics to establish our findings.

Keywords