IEEE Access (Jan 2021)
Robustness of Adaptive Neural Network Optimization Under Training Noise
Abstract
Adaptive gradient methods such as adaptive moment estimation (Adam), RMSProp, and adaptive gradient (AdaGrad) use the temporal history of the gradient updates to improve the speed of convergence and reduce reliance on manual learning rate tuning, making them a popular choice for off-the-shelf Deep Neural Network (DNN) optimizers. In this article, we study the robustness of neural network optimizers in the presence of training perturbations. We show that popular adaptive optimization methods exhibit poor generalization while learning from noisy training data, compared to vanilla Stochastic Gradient Descent (SGD) and its variants, which manifest better implicit regularization properties. We construct an illustrative example of a family of two-class linearly separable toy-data such that models trained under noise using adaptive optimizers show only 52% test accuracy (random classifier), whereas SGD-based methods can achieve 100% test accuracy. We strengthen our hypothesis by empirical analysis using Convolutional Neural Networks (CNNs) on publicly available image datasets. For this purpose, our method trains neural network models with various optimizers on noisy training data, and we compute test accuracy on clean test data. Our results further highlight the robustness of SGD optimization against such noisy training data compared to its adaptive counterparts. Based on the results, our paper suggests a reconsideration of the extensive use of adaptive gradient methods for neural network optimization, especially when the training data is noisy.
Keywords