An Empirical Study of Self-Supervised Learning with Wasserstein Distance

Makoto Yamada; Yuki Takezawa; Guillaume Houry; Kira Michaela Düsterwald; Deborah Sulem; Han Zhao; Yao-Hung Tsai

doi:10.3390/e26110939

Entropy (Oct 2024)

An Empirical Study of Self-Supervised Learning with Wasserstein Distance

Makoto Yamada,
Yuki Takezawa,
Guillaume Houry,
Kira Michaela Düsterwald,
Deborah Sulem,
Han Zhao,
Yao-Hung Tsai

Affiliations

Makoto Yamada: Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan
Yuki Takezawa: Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan
Guillaume Houry: Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan
Kira Michaela Düsterwald: Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan
Deborah Sulem: Barcelona School of Economics, Universitat Pompeu Fabra, 08002 Barcelona, Spain
Han Zhao: Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
Yao-Hung Tsai: Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan

DOI: https://doi.org/10.3390/e26110939
Journal volume & issue: Vol. 26, no. 11
p. 939

Abstract

Read online

In this study, we consider the problem of self-supervised learning (SSL) utilizing the 1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the cosine similarity is often utilized as an objective function; however, it has not been well studied when utilizing the Wasserstein distance. Training the Wasserstein distance is numerically challenging. Thus, this study empirically investigates a strategy for optimizing the SSL with the Wasserstein distance and finds a stable training procedure. More specifically, we evaluate the combination of two types of TWD (total variation and ClusterTree) and several probability models, including the softmax function, the ArcFace probability model, and simplicial embedding. We propose a simple yet effective Jeffrey divergence-based regularization method to stabilize optimization. Through empirical experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the softmax function and TWD can obtain significantly lower results than the standard SimCLR. Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the model performance depends on the combination of TWD and probability model, and that the Jeffrey divergence regularization helps in model training. Finally, we show that the appropriate combination of the TWD and probability model outperforms cosine similarity-based representation learning.

Published in Entropy

ISSN: 1099-4300 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Astronomy: Astrophysics; Science: Physics
Website: http://www.mdpi.com/journal/entropy

About the journal

Abstract

Keywords