Employing Siamese MaLSTM Model and ELMO Word Embedding for Quora Duplicate Questions Detection

Abdulaziz Altamimi; Muhammad Umer; Danial Hanif; Shtwai Alsubai; Tai-Hoon Kim; Imran Ashraf

doi:10.1109/ACCESS.2024.3367978

IEEE Access (Jan 2024)

Employing Siamese MaLSTM Model and ELMO Word Embedding for Quora Duplicate Questions Detection

Abdulaziz Altamimi,
Muhammad Umer,
Danial Hanif,
Shtwai Alsubai,
Tai-Hoon Kim,
Imran Ashraf

Affiliations

Abdulaziz Altamimi: ORCiD; Department of Computer Science and Engineering, University of Hafr Al Batin, Hafar Al Batin, Saudi Arabia
Muhammad Umer: ORCiD; Department of Computer Science and Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
Danial Hanif: ORCiD; Department of Computer Science and Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
Shtwai Alsubai: ORCiD; Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia
Tai-Hoon Kim: School of Electrical and Computer Engineering, Yeosu Campus, Chonnam National University, Yeosu-si, Republic of Korea
Imran Ashraf: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3367978
Journal volume & issue: Vol. 12
pp. 29072 – 29082

Abstract

Read online

Quora is an expanding online platform, that contains a growing collection of questions and answers generated by users. The content on this platform is managed by its users which involves creating, editing, and organization. Due to the vast number of users, it is not uncommon to find multiple questions with similar intents, leading to the problem of duplicate and identical questions. Detection of these duplicates could effectively lead to a more efficient search for high-quality answers, ultimately improving the user experience for both readers and writers on Quora. This study utilizes the dataset of Question Pairs for Quora obtained from Kaggle for identifying questions that are duplicates or identical. To vectorize the questions and for model training, six types of word embeddings are implemented including GoogleNewsVector, FastText crawl, FastText crawl sub-words, bidirectional encoder representations from transformers (BERT), robustly optimized BERT pretraining approach (RoBERTa), and embeddings from language models (ELMO) containing 100 dimensions. The Siamese Manhattan long short-term memory (MaLSTM) neural network model, where Ma is Manhattan distance, is applied with ELMO word embedding to predict duplicate questions in the dataset. Experimental results demonstrate that the proposed model attained an accuracy of 95.68% which surpasses the state-of-the-art models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords