CLUSTERING VIETNAMESE CONVERSATIONS FROM FACEBOOK PAGE TO BUILD TRAINING DATASET FOR
CHATBOT

Trieu Hai Nguyen; Thi-Kim-Ngoan Pham; Thi-Hong-Minh Bui; Thanh-Quynh-Chau Nguyen

doi:10.5455/jjcit.71-1632557439

Jordanian Journal of Computers and Information Technology (Mar 2022)

CLUSTERING VIETNAMESE CONVERSATIONS FROM FACEBOOK PAGE TO BUILD TRAINING DATASET FOR CHATBOT

Trieu Hai Nguyen,
Thi-Kim-Ngoan Pham,
Thi-Hong-Minh Bui,
Thanh-Quynh-Chau Nguyen

Affiliations

Trieu Hai Nguyen: Nha Trang University, 02 Nguyen Dinh Chieu Street, Nha Trang City, Vietnam
Thi-Kim-Ngoan Pham: Nha Trang University, 02 Nguyen Dinh Chieu Street, Nha Trang City, Vietnam
Thi-Hong-Minh Bui: Nha Trang University, 02 Nguyen Dinh Chieu Street, Nha Trang City, Vietnam
Thanh-Quynh-Chau Nguyen: Nha Trang University, 02 Nguyen Dinh Chieu Street, Nha Trang City, Vietnam

DOI: https://doi.org/10.5455/jjcit.71-1632557439
Journal volume & issue: Vol. 8, no. 1
pp. 1 – 17

Abstract

Read online

The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERT$_{base}$. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines for training chatbot. [JJCIT 2022; 8(1.000): 1-17]

Published in Jordanian Journal of Computers and Information Technology

ISSN: 2413-9351 (Print); 2415-1076 (Online)
Publisher: Scientific Research Support Fund of Jordan (SRSF) and Princess Sumaya University for Technology (PSUT)
Country of publisher: Jordan
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://jjcit.org/

About the journal

Abstract

Keywords