It is important to understand how the popular regularization method dropout helps neural network training find a good generalization solution. In this work, we theoretically derive the implicit regularization of dropout and study two implications of the implicit regularization of dropout, which intuitively rationalize why dropout helps generalization. First, we find that the training with dropout finds the neural network with a flatter minimum. Second, trained with dropout, input weights of hidden neurons would tend to condense on isolated orientations. This work points out the distinct characteristics of dropout compared with stochastic gradient descent.
Zhi-Qin John Xu is an associate professor at the Institute of Natural Sciences/School of Mathematical Sciences, Shanghai Jiao Tong University. Zhi-Qin graduated from Zhiyuan College of Shanghai Jiao Tong University in 2012. In 2016, he graduated from Shanghai Jiao Tong University with a doctor’s degree in applied mathematics. From 2016 to 2019, he was a postdoctoral fellow at NYU ABU Dhabi and the Courant Institute. He and collaborators discovered frequency principle, parameter condensation and embedding principles in deep learning, and developed multi-scale neural networks etc. He is one of the managing Editors of Journal of Machine Learning.