PLoS ONE (Jan 2024)

Multimodal representation learning for tourism recommendation with two-tower architecture.

  • Yuhang Cui,
  • Shengbin Liang,
  • YuYing Zhang

DOI
https://doi.org/10.1371/journal.pone.0299370
Journal volume & issue
Vol. 19, no. 2
p. e0299370

Abstract

Read online

Personalized recommendation plays an important role in many online service fields. In the field of tourism recommendation, tourist attractions contain rich context and content information. These implicit features include not only text, but also images and videos. In order to make better use of these features, researchers usually introduce richer feature information or more efficient feature representation methods, but the unrestricted introduction of a large amount of feature information will undoubtedly reduce the performance of the recommendation system. We propose a novel heterogeneous multimodal representation learning method for tourism recommendation. The proposed model is based on two-tower architecture, in which the item tower handles multimodal latent features: Bidirectional Long Short-Term Memory (Bi-LSTM) is used to extract the text features of items, and an External Attention Transformer (EANet) is used to extract image features of items, and connect these feature vectors with item IDs to enrich the feature representation of items. In order to increase the expressiveness of the model, we introduce a deep fully connected stack layer to fuse multimodal feature vectors and capture the hidden relationship between them. The model is tested on the three different datasets, our model is better than the baseline models in NDCG and precision.