Mathematics (Oct 2024)

Improving Systematic Generalization of Linear Transformer Using Normalization Layers and Orthogonality Loss Function

  • Taewon Park,
  • Hyun-Chul Kim

DOI
https://doi.org/10.3390/math12213390
Journal volume & issue
Vol. 12, no. 21
p. 3390

Abstract

Read online

A Linear Transformer linearizes the attention mechanism of the vanilla Transformer architecture, significantly improving efficiency and achieving linear theoretical complexity with respect to sequence length. However, few studies have explored the capabilities of the Linear Transformer beyond its efficiency. In this work, we investigate the systematic generalization capability of the Linear Transformer, a crucial property for strong generalization to unseen data. Through preliminary experiments, we identify two major issues contributing to its unstable systematic generalization performance: (i) unconstrained norms of Queries and Keys, and (ii) high correlation among Values across the sequence. To address these issues, we propose two simple yet effective methods: normalization layers for Queries and Keys, and an orthogonality loss function applied to Values during training. In experiments, we demonstrate that applying these methods to the Linear Transformer significantly improves its stability and systematic generalization performance across several well-known tasks. Furthermore, our proposed methods outperform the vanilla Transformer on specific systematic generalization tasks, such as the sort-of-CLEVR and SCAN tasks.

Keywords