Journal of King Saud University: Computer and Information Sciences (Jan 2024)

A deep multiple-instance text binary classification for topic relevant content extraction on social media

  • Juan Yin,
  • Xiaoyang Liu,
  • Zhewen Yang

Journal volume & issue
Vol. 36, no. 1
p. 101883

Abstract

Read online

Social media platforms have rich text data, which can be used in data mining and analysis. However, given the fact, the evolution speed of natural languages is rapid in social media, and data on social media is very noisy. This is a great challenge to the accuracy of data analysis. To overcome this problem, we propose a topic-relevant content extraction (TRCE) based on deep multiple instance classification, leveraging existing information and hierarchical relationships among texts under a thread on social media as weak supervision to extract topic-strong-relevant data and filter out noise accurately without manually labeling data. The proposed method introduces latent variables, Bernoulli distribution, and variational inference into multiple-instance learning (MIL) to generate pseudo labels. Then we employ a dual-stream neural network with a 3-stage training process to achieve training MIL end-to-end. Experimental results show TRCE has a significant improvement compared with other MIL methods. Meanwhile, it only has a little decrease compared with supervised text classification on accuracy and F1 score. Given the fact TRCE does not need manually labeled data at all, while supervised classification relies heavily on labeled data, TRCE is a competitive method to extracting topic-relevant data and filtering out noise on social media.

Keywords