Talk / Overview
In this work, we consider a simple classification problem to highlight key points of the learning process in a Deep Neural Network. We first show that regardless of the structure of the loss landscape, the time behaviour of standard probe functions is similar. From the evolution of the mean square displacement, we identify two regimes during training in agreement with recent findings. In addition, we show that training is not a continuous search for a minimum in a minimalistic model, which exhibits a unique minimum in the loss landscape. Instead, after an initial transient approach, the trajectory that follows SGD gets apart from the minimum as training goes on. Furthermore, in the case of multiple local minima, we show that along with the training, the distribution of hessian eigenvalues concentrates in a window around zero whose size fluctuates along with the training. This is at variance with the naive expectation of obtaining strictly positive eigenvalues after long training times. Finally, we report these features in the standard classification MNIST problem in the subparametrized regime.