Understanding the Disharmony between Weight Normalization Family and Weight Decay
The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight W to W′, which makes W′ independent to the magnitude of W. Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. Moreover, if we substitute (e.g., weight normalization) W′ = W∥W∥ in the original loss function ∑iL(ƒ(xi; W′),yi) + ½λ∥W′∥2, it is observed that the regularization term ½λ∥W′∥2 will be canceled as a constant ½ λ in the optimization objective. Therefore, to decay W, we need to explicitly append: ½λ∥W∥2. In this paper, we theoretically prove that ½λ∥W∥2 improves optimization only by modulating the effective learning rate and fairly has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several serious problems when introducing weight decay term to weight normalization family, including the missing of global minimum, training instability and sensitivity of initialization. To address these problems, we propose an Adaptive Weight Shrink (AWS) scheme, which gradually shrinks the weights during optimization by a dynamic coefficient proportional to the magnitude of the parameter. This simple yet effective method appropriately controls the effective learning rate, which significantly improves the training stability and makes optimization more robust to initialization.