An Alternative View: When does SGD Escape Local Minima? 83


Yang Yuan, Tsinghua University


2020.06.17 15:30-16:30




Zoom ID: 917-210-89416
Password: 687416

If you cannot log in the above Zoom ID, please use the following one instead:
Zoom ID: 266-664-3379
Password: 738669


Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function f has many bad local minima or saddle points, as long as for every point x, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution x∗, SGD will get close to, and then stay around x∗ with constant probability. More specifically, SGD will not get stuck at “sharp” local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.


Yang Yuan is now an assistant professor at IIIS, Tsinghua. He finished his undergraduate study at Peking University in 2012. Afterwards, he received his PhD at Cornell University in 2018, advised by Professor Robert Kleinberg. During his PhD, he was a visiting student at MIT/Microsoft New England (2014-2015) and Princeton University (2016 Fall). Before joining Tsinghua, he spent one year at MIT Institute for Foundations of Data Science (MIFODS) as a postdoc researcher, advised by Professor Piotr Indyk and Professor Aleksander Madry. He works on machine learning theory and algorithm design, including optimization, neural network analysis, hyperparameter tuning, etc. He was on the list of Forbes China 30 Under 30 (2019).