Zoom ID: 991-377-04986
Deep learning has achieved tremendous success in many applications. However, the reason for its success has not been well understood. A recent line of research on deep learning theory focuses on the extremely over-parameterized setting, and shows that deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees in the so-called neural tangent kernel (NTK) regime. However, many have argued that existing NTK-based results cannot explain the success of deep learning, mainly due to two reasons: (i) most results require an extremely wide neural network, which is impractical, and (ii) NTK-based analysis requires the network parameters to stay very close to initialization throughout training, which does not match empirical observation. In this talk, I will explain how these limitations in the current NTK-based analyses can be alleviated. In the first part of this talk, I will show that under certain assumptions, we can prove optimization and generalization guarantees with network width polylogarithmic in the training sample size and inverse target test error. In the second part of this talk, I will present a generalized neural tangent kernel analysis on noisy gradient descent with weight decay, which allows network parameters to be far away from initialization. Our results push the theoretical analysis of over-parameterized deep neural networks towards a more practical setting.