Artificial neural networks trained with gradient-based algorithms have achieved impressive performances in a variety of applications. In particular, the stochastic gradient-descent (SGD) algorithm has proved to be surprisingly efficient in navigating high-dimensional loss landscapes. However, the theory behind this practical success remains largely unexplained. A general consensus has arisen that the answer requires a detailed description of the trajectory traversed during training. This task is highly nontrivial for at least two reasons. First, the high dimension of the parameter space where ANNs operate defies standard mathematical techniques. Second, SGD navigates a non-convex loss landscape following an out-of-equilibrium dynamics with a complicated state-dependent noise. In this talk, I will consider prototypical learning problems that are amenable to an exact characterisation. I will show how methods from spin glass theory can be used to derive a low-dimensional description of the network performance and the learning dynamics of gradient-based algorithms, including multi-pass SGD. I will discuss how different sources of algorithmic noise affect the performance of the network and how to characterise SGD noise via an effective fluctuation-dissipation relation holding at stationarity.
Download the slides for this talk.Download ( PDF, 7854.99 MB)