A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Tian Xia; Xuemin Chen; Jiacun Wang; Feng Qiu

doi:10.3390/s23218975

Sensors (Nov 2023)

A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Tian Xia,
Xuemin Chen,
Jiacun Wang,
Feng Qiu

Affiliations

Tian Xia: School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai 201209, China
Xuemin Chen: Department of Engineering, Texas Southern University, Houston, TX 77004, USA
Jiacun Wang: Department of Computer Science and Software Engineering, Monmouth University, West Long Branch, NJ 07764, USA
Feng Qiu: Institute of Artificial Intelligence on Education, Shanghai Normal University, Shanghai 200234, China

DOI: https://doi.org/10.3390/s23218975
Journal volume & issue: Vol. 23, no. 21
p. 8975

Abstract

Read online

Short message services (SMS), microblogging tools, instant message apps, and commercial websites produce numerous short text messages every day. These short text messages are usually guaranteed to reach mass audience with low cost. Spammers take advantage of short texts by sending bulk malicious or unwanted messages. Short texts are difficult to classify because of their shortness, sparsity, rapidness, and informal writing. The effectiveness of the hidden Markov model (HMM) for short text classification has been illustrated in our previous study. However, the HMM has limited capability to handle new words, which are mostly generated by informal writing. In this paper, a hybrid model is proposed to address the informal writing issue by weighting new words for fast short text filtering with high accuracy. The hybrid model consists of an artificial neural network (ANN) and an HMM, which are used for new word weighting and spam filtering, respectively. The weight of a new word is calculated based on the weights of its neighbor, along with the spam and ham (i.e., not spam) probabilities of short text message predicted by the ANN. Performance evaluations on benchmark datasets, including the SMS message data maintained by University of California, Irvine; the movie reviews, and the customer reviews are conducted. The hybrid model operates at a significantly higher speed than deep learning models. The experiment results show that the proposed hybrid model outperforms other prominent machine learning algorithms, achieving a good balance between filtering throughput and accuracy.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords