SMS Spam Classification–Simple Deep Learning Models With Higher Accuracy Using BUNOW And GloVe Word Embedding

Surajit Giri, Sayak Das; Sutirtha Bharati Das; Siddhartha Banerjee

doi:10.6180/jase.202310_26(10).0015

Journal of Applied Science and Engineering (Apr 2023)

SMS Spam Classification–Simple Deep Learning Models With Higher Accuracy Using BUNOW And GloVe Word Embedding

Surajit Giri, Sayak Das,
Sutirtha Bharati Das,
Siddhartha Banerjee

Affiliations

Surajit Giri, Sayak Das: Department of Computer Science, Ramakrishna Mission Residential College, Narendrapur, West Bengal, India
Sutirtha Bharati Das: Department of Computer Science, Ramakrishna Mission Residential College, Narendrapur, West Bengal, India
Siddhartha Banerjee: Department of Computer Science, Ramakrishna Mission Residential College, Narendrapur, West Bengal, India

DOI: https://doi.org/10.6180/jase.202310_26(10).0015
Journal volume & issue: Vol. 26, no. 10
pp. 1501 – 1511

Abstract

Read online

Unwanted text messages are called Spam SMSs. It has been proven that Machine Learning Models can categorize spam messages efficiently and with great accuracy. However, the lack of proper spam filtering software or misclassification of genuine SMS as spam by existing software, the use of spam detection applications has not become popular. In this paper, we propose multiple deep neural network models to classify spam messages. Tiago’s Dataset is used for this research. Initially, preprocessing step is applied to the messages in the data set, which involves lowercasing the text, tokenization, lemmatization of the text, and removal of numbers, punctuations, and stop words. These preprocessed messages are fed in two different deep learning models with simpler architectures, namely Convolution Neural Network and a hybrid Convolution Neural Network with Long Short-Term Memory Network for classification. To increase the accuracy of these two simple architectures, BUNOW and GloVe word embedding techniques are incorporated with deep learning models. BUNOW and GloVe are popular choices in sentiment analysis, but in this work, these two-word embedding techniques are tried in the context of text classification to improve accuracy. The best accuracy of 98.44% is achieved by the CNN LSTM BUNOW model after 15 epochs on a 70% - 30% train-test split. The proposed model can be used in many practical applications like real-time SMS spam detection, email spam detection, sentiment analysis, text categorization, etc.

Published in Journal of Applied Science and Engineering

ISSN: 2708-9967 (Print); 2708-9975 (Online)
Publisher: Tamkang University Press
Country of publisher: Taiwan, Province of China
LCC subjects: Technology: Engineering (General). Civil engineering (General); Technology: Chemical technology: Chemical engineering; Science: Physics
Website: http://jase.tku.edu.tw/

About the journal

Abstract

Keywords