TY - JOUR
AU - Fort, Stanislav
AU - Scherlis, Adam
PY - 2019/07/17
Y2 - 2021/04/12
TI - The Goldilocks Zone: Towards Better Understanding of Neural Network Loss Landscapes
JF - Proceedings of the AAAI Conference on Artificial Intelligence
JA - AAAI
VL - 33
IS - 01
SE - AAAI Technical Track: Machine Learning
DO - 10.1609/aaai.v33i01.33013574
UR - https://ojs.aaai.org/index.php/AAAI/article/view/4237
SP - 3574-3581
AB - <p>We explore the loss landscape of fully-connected and convolutional neural networks using random, low-dimensional hyperplanes and hyperspheres. Evaluating the Hessian, <em>H</em>, of the loss function on these hypersurfaces, we observe 1) an unusual excess of the number of positive eigenvalues of <em>H</em>, and 2) a large value of Tr(<em>H</em>)<em>/</em>||<em>H</em>|| at a well defined range of configuration space radii, corresponding to a thick, hollow, spherical shell we refer to as the <em>Goldilocks zone</em>. We observe this effect for fully-connected neural networks over a range of network widths and depths on MNIST and CIFAR-10 datasets with the ReLU and tanh non-linearities, and a similar effect for convolutional networks. Using our observations, we demonstrate a close connection between the Goldilocks zone, measures of local convexity/prevalence of positive curvature, and the suitability of a network initialization. We show that the high and stable accuracy reached when optimizing on random, low-dimensional hypersurfaces is directly related to the overlap between the hypersurface and the Goldilocks zone, and as a corollary demonstrate that the notion of intrinsic dimension is initialization-dependent. We note that common initialization techniques initialize neural networks in this particular region of unusually high convexity/prevalence of positive curvature, and offer a geometric intuition for their success. Furthermore, we demonstrate that initializing a neural network at a number of points and selecting for high measures of local convexity such as Tr(<em>H</em>)<em>/</em>||<em>H</em>||, number of positive eigenvalues of <em>H</em>, or low initial loss, leads to statistically significantly faster training on MNIST. Based on our observations, we hypothesize that the Goldilocks zone contains an unusually high density of suitable initialization configurations.</p>
ER -