Theoretical Insights into Wide Neural Networks: Optimization, Generalization and Robustness



The success of neural networks is often reliant on training highly complex architectures. In this talk, I will discuss some intriguing insights on the role of network width from optimization, generalization and robustness perspectives. In the first part, we consider a function regression model where the goal is to learn a concave function using a linear combination of `bump-like’ components (neurons).  The parameters to be fitted are the centers of the bumps and the resulting empirical risk minimization problem is highly non-convex. Formulating it as a two-layer neural network, we show that in the limit that the network width diverges, the evolution of gradient descent converges to a Wasserstein gradient flow. Remarkably, the cost function optimized by the gradient flow exhibits a special property known as `displacement convexity’ which implies global convergence at an exponential rate. In the second part, we consider adversarial robustness of neural networks to small perturbation to input features during test-time. For random features models (two-layer networks with the first layer weights fixed to random weights), we provide a precise characterization of the role of network width on its robustness. Our theory reveals several intriguing phenomena and indicates that larger width can hurst robust generalization!

Related Papers: