International Journal of Computational Intelligence Systems (Apr 2024)

VSEM-SAMMI: An Explainable Multimodal Learning Approach to Predict User-Generated Image Helpfulness and Product Sales

  • Chengwen Sun,
  • Feng Liu

DOI
https://doi.org/10.1007/s44196-024-00495-8
Journal volume & issue
Vol. 17, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Using user-generated content (UGC) is of utmost importance for e-commerce platforms to extract valuable commercial information. In this paper, we propose an explainable multimodal learning approach named the visual–semantic embedding model with a self-attention mechanism for multimodal interaction (VSEM-SAMMI) to predict user-generated image (UGI) helpfulness and product sales. Focusing on SHEIN (i.e. a fast-fashion retailer), we collect the images posted by consumers, along with product and portrait characteristics. Moreover, we use VSEM-SAMMI, which adopts a self-attention mechanism to enforce attention weights between image and text, to extract features from UGI then use machine learning algorithms to predict UGI helpfulness and product sales. We explain features using a caption generation model and test the predictive power of embeddings and portrait characteristics. The results indicate that when predicting commercial information, embeddings are more informative than product and portrait characteristics. Combining VSEM-SAMMI with light gradient boosting (LightGBM) yields a mean squared error (MSE) of 0.208 for UGI helpfulness prediction and 0.184 for product sales prediction. Our study offers valuable insights for e-commerce platforms, enhances feature extraction from UGI through image–text joint embeddings for UGI helpfulness and product sales prediction, and pioneers a caption generation model for interpreting image embeddings in the e-commerce domain.

Keywords