Zhanxing Zhu, Peking University
Room 306, No.5 Science Building
Stochastic gradient descent (SGD) is currently a standard workhorse for training deep learning models. Besides its computational benefits, its theoretical understanding and analysis is rather limited, remaining as a mystery. Recently, we discover that SGD has implicit regularization effects, i.e. the noise introduced by SGD helps to find the minima that generalize better. We provide two new perspectives for modeling this noise, additive and multiplicative ones. Based on the additive noise type, we discover that the SGD noise is anisotropic. And it aligns well with curvature of the loss function such that it can help to escape from sharp minima towards flatter ones much more efficiently than isotropic equivalence. From the multiplicative perspective, we interpret the regularization effects of SGD through minimizing local Gaussian/Radamacher complexity. This understanding also provides us new insights for implementing large-batch training without loss of generalization performance. These new findings shed some light on understanding the learning dynamics of deep learning towards opening this black-box.
Dr. Zhanxing Zhu, is currently assistant professor at School of Mathematical Sciences, Peking University, also affiliated with Center for Data Science, Peking University. He obtained Ph.D degree in machine learning from University of Edinburgh in 2016. His research interests cover machine learning and its applications in various domains. Currently he mainly focuses on deep learning theory and optimization algorithms, reinforcement learning, and applications in traffic, computer security, computer graphics, medical and healthcare etc. He has published more than 40 papers on top AI journals and conferences, such as NIPS, ICML, CVPR, ACL, IJCAI, AAAI, ECML etc. He was awarded “2019 Alibaba Damo Young Fellow”, and obtained “Best Paper Finalist” from the top computer security conference CCS 2018.