IEEE Access (Jan 2023)
Expectation-Maximization via Pretext-Invariant Representations
Abstract
Contrastive learning methods have been widely adopted in numerous unsupervised and self-supervised visual representation learning methods. Such algorithms aim to maximize the cosine similarity between two positive samples while minimizing that of the negative samples. Recently, Grill et al. propose an algorithm, BYOL, to utilize only positive samples, completely giving up on negative ones, by introducing a Siamese-like asymmetric architecture. Although many recent state-of-the-art (SOTA) methods adopt the architecture, most of them simply introduce the additional neural network, the predictor, without much exploration of the asymmetrical architecture. In contrast, He et al. propose SimSiam, a simple Siamese architecture relying on the stop-gradient operation instead of the momentum encoder and describe the framework from the perspective of Expectation-Maximization. We argue that BYOL-like algorithms attain suboptimal performance due to representation inconsistency during training. In this work, we explain and propose a novel self-supervised objective, Expectation-Maximization via Pretext-Invariant Representations (EMPIR), which enhances Expectation-Maximization-based optimization in BYOL-like algorithms by enforcing augmentation invariance within a local region of k nearest neighbors, resulting in consistent representation learning. In other words, we propose Expectation-Maximization as a core task of asymmetric architectures. We show that it consistently outperforms other (SOTA) algorithms by a decent margin. We also demonstrate its transfer learning capabilities on downstream image recognition tasks.
Keywords