Stochastic Weight Averaging Revisited

Hao Guo; Jiyong Jin; Bin Liu

doi:10.3390/app13052935

Applied Sciences (Feb 2023)

Stochastic Weight Averaging Revisited

Hao Guo,
Jiyong Jin,
Bin Liu

Affiliations

Hao Guo: Research Center for Applied Mathematics and Machine Intelligence, Zhejiang Lab, Hangzhou 311121, China
Jiyong Jin: Research Center for Applied Mathematics and Machine Intelligence, Zhejiang Lab, Hangzhou 311121, China
Bin Liu: Research Center for Applied Mathematics and Machine Intelligence, Zhejiang Lab, Hangzhou 311121, China

DOI: https://doi.org/10.3390/app13052935
Journal volume & issue: Vol. 13, no. 5
p. 2935

Abstract

Read online

Averaging neural network weights sampled by a backbone stochastic gradient descent (SGD) is a simple-yet-effective approach to assist the backbone SGD in finding better optima, in terms of generalization. From a statistical perspective, weight-averaging contributes to variance reduction. Recently, a well-established stochastic weight-averaging (SWA) method was proposed, which featured the application of a cyclical or high-constant (CHC) learning-rate schedule for generating weight samples for weight-averaging. Then, a new insight on weight-averaging was introduced, which stated that weight average assisted in discovering a wider optima and resulted in better generalization. We conducted extensive experimental studies concerning SWA, involving 12 modern deep neural network model architectures and 12 open-source image, graph, and text datasets as benchmarks. We disentangled the contributions of the weight-averaging operation and the CHC learning-rate schedule for SWA, showing that the weight-averaging operation in SWA still contributed to variance reduction, and the CHC learning-rate schedule assisted in exploring the parameter space more widely than the backbone SGD, which could be be under-fitted due to a lack of training budget. We then presented an algorithm termed periodic SWA (PSWA) that comprised a series of weight-averaging operations to exploit such wide parameter space structures as explored by the CHC learning-rate schedule, and we empirically demonstrated that PSWA outperformed its backbone SGD remarkably.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords