科普Transformer是推断还是记忆?初始化大小很重要 文献[49] 频率原则理解深度学习的媒体报道,见 《麻省理工科技评论》中文官网,DeepTech深科技,络绎科学 神经网络的简单偏好 (2017.11-2022.11 五年的理论研究小结) 从频率角度理解为什么深度可以加速神经网络的训练 文献[11] 线性Frequency Principle动力学:定量理解深度学习的一种有效模型 文献[13,17] F-Principle:初探深度学习在计算数学的应用 文献[4,9,12] 从傅里叶分析角度解读深度学习的泛化能力 [1,2,4,17] codecode at github. 1d example of F-PrincipleCondensationWeight condensation: input weights of neurons in a group are same. Phase diagram of two-layer ReLU NNEmbedding principleMscaleDNN structurePaper list* indicates the corresponding author #: Equal contribution bib citation format is Here. Deep learningReading guidance: Language Models Language model research faces significant challenges, especially for academic research groups with constrained resources. These challenges include complex data structures, unknown target functions, high computational costs and memory requirements, and a lack of interpretability in the inference process, etc. Drawing a parallel to the use of simple models in scientific research, we propose the concept of an anchor function. This is a type of benchmark functions designed for studying language models. Frequency Principle: An overview is in [27]. [4] is a comprehensive study of F-Principle with low- and high-dimensional experiments and a simple thoery. The first paper of F-Principle is [1]. Theory for the F-Principle of general networks with infinite samples is in [7], of inifite width two-layer network (NTK regime) with finite sample is in [13] and [17] (the initial version of [13] is [6].) We further show a Fourier-domain Variational Formulation for supervised learning inspired by the linear frequency principle and prove its well-posedness in 14. We also use F-Principle to understand why DNN and traditional numerical methods have different solution when overparameterized in [15]. In [11], to understand why feedforward deeper learning is faster by proposing a deep frequency principle. Based on the F-Principle, we propose a MscaleDNN in [9] (An initial verion is in [8]) to learn high frequency fast. Condensation: We found weight condensation is an important feature of non-linear training as show by the Phase Diagram of two-layer ReLU NN [10] and three-layer ReLU NN [29]. The condensation induced by gradiend flow at initial stage is explained in [18]. We also found Dropout can facilitate the condensation and can be explained by its implicit regularization in [33]. The condensation motivates us to find an important characteristics of loss landscape that provides a basis for condensation, i.e., embedding principle that the loss landscape of a DNN “contains” all the critical points of all the narrower DNNs in [19, 23]. To understand the effective size of a condensed network, we develop a rank stratification analysis for general non-linear model in [35]. This rank analysis reduces many long-standing problems of overparameterized non-linear models to a linear stability hypothesis, supported by a series of experiments, including condensation. AI for science: (Combustion) In [25], we apply DNN for reducing the detailed mechanism of chemical kinetics. in [26], we use DNN to accelerate the simulation of chemical kinetics. (Solve PDE): (a) The MscaleDNN is proposed and comprehensively studied in [9] (An initial verion is in [8]). [12,24] further develops MscaleDNN. The original idea of DNN slowly solving high frequency in PDE is in Fig. 4 of [4]. (b) MOD-Net (Model-operator-data network) learns the PDE operator by DNN with cheap data as regularization. 55 papers Language ModelsA PPT for anchor function and initialization effect in ppt [54] Zhiwei Wang, Yunji Wang, Zhongwang Zhang, Zhangchen Zhou, Hui Jin, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Yaoyu Zhang, Zhi-Qin John Xu*, Towards Understanding How Transformer Perform Multi-step Reasoning with Matching Operation, arxiv 2405.15302 (2024), and in pdf, and in arxiv [49] Zhongwang Zhang, Pengxiao Lin, Zhiwei Wang, Yaoyu Zhang, Zhi-Qin John Xu*, Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing, NeurIPS 2024, arxiv 2405.05409 (2024), and in pdf, and in arxiv [48] Zhongwang Zhang#, Zhiwei Wang#, Junjie Yao, Zhangchen Zhou, Xiaolong Li, Weinan E, Zhi-Qin John Xu*, Anchor function: a type of benchmark functions for studying language models, arxiv 2401.08309 (2024), and in pdf, and in arxiv Condensationkey papers [33] Zhongwang Zhang, Zhi-Qin John Xu*, Implicit regularization of dropout. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024. arxiv 2207.05952 (2022) pdf, and in arxiv. [10] Tao Luo#, Zhi-Qin John Xu #, Zheng Ma, Yaoyu Zhang*, Phase diagram for two-layer ReLU neural networks at infinite-width limit, arxiv 2007.07497 (2020), Journal of Machine Learning Research (2021) in web and in pdf, and in arxiv [19] Yaoyu Zhang*, Zhongwang Zhang, Tao Luo, Zhi-Qin John Xu*, Embedding Principle of Loss Landscape of Deep Neural Networks. NeurIPS 2021 spotlight, arxiv 2105.14573 (2021) in web, and in pdf, and in arxiv, see slides, and Talk on Bilibili [18] Hanxu Zhou, Qixuan Zhou, Tao Luo, Yaoyu Zhang*, Zhi-Qin John Xu*, Towards Understanding the Condensation of Neural Networks at Initial Training. arxiv 2105.11686 (2021) pdf with Appendix, and in web and arxiv, see slides and video talk in Chinese, NeurIPS2022. [43] Yaoyu Zhang*, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, Zhi-Qin John Xu*, Optimistic Estimate Uncovers the Potential of Nonlinear Models. arxiv 2305.15850 (2023) pdf, and in arxiv. More papers [51] Tianyi Chen, Zhi-Qin John Xu*, Efficient and Flexible Method for Reducing Moderate-size Deep Neural Networks with Condensation. Entropy 2024, 26(7), 567. Arxiv 2405.01041 pdf, and in arxiv, and in web. [41] Zhongwang Zhang, Yuqing Li*, Tao Luo*, Zhi-Qin John Xu*, Stochastic Modified Equations and Dynamics of Dropout Algorithm. ICLR 2024. arxiv 2305.15850 (2023) pdf, and in arxiv. [40] Zhongwang Zhang, Zhi-Qin John Xu*, Loss Spike in Training Neural Networks. arxiv 2305.12133 (2023) pdf, and in arxiv. [39] Zhangchen Zhou, Hanxu Zhou, Yuqing Li*, Zhi-Qin John Xu*, Understanding the Initial Condensation of Convolutional Neural Networks, arxiv 2305.09947. pdf, and in arxiv. [37] Zhengan Chen, Yuqing Li*, Tao Luo, Zhangchen Zhou, Zhi-Qin John Xu*, Phase Diagram of Initial Condensation for Two-layer Neural Networks, CSIAM Trans. Appl. Math., 5 (2024), pp. 448-514. CSIAM-AM arxiv 2303.06561. pdf, and in arxiv. [35] Yaoyu Zhang*, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, Zhi-Qin John Xu*, Optimistic Estimate Uncovers the Potential of Nonlinear Models, arxiv 2307.08921. pdf, and in arxiv. [30] Zhiwei Bai, Tao Luo, Zhi-Qin John Xu*, Yaoyu Zhang*, Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks. CSIAM Transactions on Applied Mathematics, 2024. arxiv 2205.13283 (2022) pdf, and in arxiv. [29] Hanxu Zhou, Qixuan Zhou, Zhenyuan Jin, Tao Luo, Yaoyu Zhang, Zhi-Qin John Xu*, Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width. arxiv 2205.12101 (2022) pdf, and in arxiv, NeurIPS2022. [23] Yaoyu Zhang*, Yuqing Li, Zhongwang Zhang, Tao Luo, Zhi-Qin John Xu*, Embedding Principle: a hierarchical structure of loss landscape of deep neural networks. Journal of Machine Learning, (2022), pp. 60-113. pdf, and in web, and in arxiv. Frequency PrincipleKey papers [4] Zhi-Qin John Xu* , Yaoyu Zhang, Tao Luo, Yanyang Xiao, Zheng Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, arXiv preprint: 1901.06523, Communications in Computational Physics (CiCP). pdf, and in web, some code is in github (2021世界人工智能大会青年优秀论文提名奖). [27] Zhi-Qin John Xu*, Yaoyu Zhang, Tao Luo, Overview frequency principle/spectral bias in deep learning. Communications on Applied Mathematics and Computation 2024 (dedicated to the memory of Professor Zhong-Ci Shi), arxiv 2201.07395 (2022) pdf, and in arxiv. [13] (Alphabetic order) Tao Luo*, Zheng Ma, Zhi-Qin John Xu, Yaoyu Zhang, On the exact computation of linear frequency principle dynamics and its generalization, SIAM Journal on Mathematics of Data Science (SIMODS), arxiv 2010.08153 (2020). in web, and in pdf, and in arxiv, some code is in github. Supplemental Material More papers [55] Zhangchen Zhou, Yaoyu Zhang, Zhi-Qin John Xu*, A rationale from frequency perspective for grokking in training neural network. Arxiv 2405.03095 pdf, and in arxiv. [17] Yaoyu Zhang, Tao Luo, Zheng Ma, Zhi-Qin John Xu*, Linear Frequency Principle Model to Understand the Absence of Overfitting in Neural Networks. Chinese Physics Letters, 2021. pdf, and in arxiv, see CPL web [16] (Alphabetic order) Yuheng Ma, Zhi-Qin John Xu*, Jiwei Zhang*, Frequency Principle in Deep Learning Beyond Gradient-descent-based Training, arxiv 2101.00747 (2021). pdf, and in arxiv [15] (Alphabetic order) Jihong Wang, Zhi-Qin John Xu*, Jiwei Zhang*, Yaoyu Zhang, Implicit bias in understanding deep learning for solving PDEs beyond Ritz-Galerkin method, CSIAM Transactions on Applied Mathematics. 2022. web, arxiv 2002.07989 (2020). pdf, and in arxiv [14] (Alphabetic order) Tao Luo*, Zheng Ma, Zhiwei Wang, Zhi-Qin John Xu, Yaoyu Zhang, An Upper Limit of Decaying Rate with Respect to Frequency in Deep Neural Network, MSML22, Mathematical and Scientific Machine Learning 2022, arxiv 2105.11675 (previous version: 2012.03238) (2020). pdf, and in arxiv [11] Zhi-Qin John Xu* , Hanxu Zhou, Deep frequency principle towards understanding why deeper learning is faster, Proceedings of the AAAI Conference on Artificial Intelligence 2021, arxiv 2007.14313 (2020) pdf, and in arxiv, and AAAI web, and slides, and AAAI speech script slides [7] (Alphabetic order) Tao Luo, Zheng Ma, Zhi-Qin John Xu, Yaoyu Zhang, Theory of the frequency principle for general deep neural networks, CSIAM Transactions on Applied Mathematics. 2021, arXiv preprint, 1906.09235 (2019). arxiv, in web, see pdf Note: [13] is a comprehensive version of [6]. [6] Yaoyu Zhang, Zhi-Qin John Xu* , Tao Luo, Zheng Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks. arXiv preprint, 1905.10264 (2019) pdf, and in arxiv [5] Yaoyu Zhang, Zhi-Qin John Xu* , Tao Luo, Zheng Ma, A type of generalization error induced by initialization in deep neural networks. arXiv preprint: 1905.07777 (2019), 1st Mathematical and Scientific Machine Learning Conference (MSML2020). pdf, and in web Note: Most of [2] and [3] are combined into paper [4]. [3] Zhi-Qin John Xu* , Frequency Principle in Deep Learning with General Loss Functions and Its Potential Application, arXiv preprint: 1811.10146 (2018). pdf, and in arxiv [2] Zhi-Qin John Xu* , Understanding training and generalization in deep learning by Fourier analysis, arXiv preprint: 1808.04295, (2018). pdf, and in arxiv [1] Zhi-Qin John Xu* , Yaoyu Zhang, and Yanyang Xiao, Training behavior of deep neural network in frequency domain, arXiv preprint: 1807.01251, (2018), 26th International Conference on Neural Information Processing (ICONIP 2019). pdf, and in web Combustionkey papers [47] Zhi-Qin John Xu*#, Junjie Yao#, Yuxiao Yi#, Liangkai Hang, Weinan E, Yaoyu Zhang*, Tianhan Zhang*, Solving multiscale dynamical systems by deep learning. arxiv 2401.01220 (2024) pdf, and in arxiv. [26] Tianhan Zhang*, Yuxiao Yi, Yifan Xu, Zhi X. Chen, Yaoyu Zhang, Weinan E, Zhi-Qin John Xu*, A multi-scale sampling method for accurate and robust deep neural network to predict combustion chemical kinetics. Combustion and Flame, 2022. arxiv 2201.03549 (2022) pdf, and in arxiv, and in web. [46] Zhiwei Wang, Yaoyu Zhang, Pengxiao Lin, Enhan Zhao, Weinan E, Tianhan Zhang*, Zhi-Qin John Xu*, Deep mechanism reduction (DeePMR) method for fuel chemical kinetics, Combustion and Flame, 2024. pdf More papers [44] Runze Mao, Minqi Lin, Yan Zhang*, Tianhan Zhang, Zhi-Qin John Xu, Zhi X Chen*, DeepFlame: A deep learning empowered open-source platform for reacting flow simulations. Computer Physics Communications (2023) pdf, and in arxiv. Solve PDEkey papers [9] Ziqi Liu, Wei Cai, Zhi-Qin John Xu* , Multi-scale Deep Neural Network (MscaleDNN) for Solving Poisson-Boltzmann Equation in Complex Domains, arxiv 2007.11207 (2020) Communications in Computational Physics (CiCP). pdf, and in web, some code is in github. [50] Liangkai Hang, Dan Hu, Zhi-Qin John Xu*, Input gradient annealing neural network for solving low-temperature Fokker-Planck equations. Arxiv 2405.00317 pdf, and in arxiv. [20] Lulu Zhang, Tao Luo, Yaoyu Zhang, Weinan E, Zhi-Qin John Xu*, Zheng Ma*, MOD-Net: A Machine Learning Approach via Model-Operator-Data Network for Solving PDEs. Communications in Computational Physics (CiCP) (2022), arxiv 2107.03673 (2021) pdf, and in web and in arxiv. [24] (Alphabetic order) Xi-An Li, Zhi-Qin John Xu, Lei Zhang*, Subspace Decomposition based DNN algorithm for elliptic type multi-scale PDEs. arxiv 2112.06660 (Journal of Computational Physics, 2023) pdf, and in arxiv. More papers [52] Zhiwei Wang, Lulu Zhang, Zhongwang Zhang, Zhi-Qin John Xu*, Loss Jump During Loss Switch in Solving PDEs with Neural Networks. Communications in Computational Physics, Arxiv 2405.03095 pdf, and in arxiv. [45] Xiong-Bin Yan, Keke Wu, Zhi-Qin John Xu, Zheng Ma*, An Unsupervised Deep Learning Approach for the Wave Equation Inverse Problem, arxiv 2311.04531. pdf, and in arxiv. [42] Lulu Zhang, Wei Cai, Zhi-Qin John Xu, A Correction and Comments on “Multi-Scale Deep Neural Network (MscaleDNN) for Solving Poisson-Boltzmann Equation in Complex Domains CiCP, 28 (5): 1970–2001, 2020”, Commun. Comput. Phys., 33 (2023), pp. 1509-1513. pdf. [38] Xiong-bin Yan, Zhi-Qin John Xu, Zheng Ma*, Laplace-fPINNs: Laplace-based fractional physics-informed neural networks for solving forward and inverse problems of subdiffusion, East Asian Journal on Applied Mathematics (Accepted). arxiv 2304.00909. pdf, and in arxiv. [36] Xiong-bin Yan, Zhi-Qin John Xu, Zheng Ma*, Bayesian Inversion with Neural Operator (BINO) for Modeling Subdiffusion: Forward and Inverse Problems, arxiv 2211.11981. pdf, and in arxiv. [34] Yifan Peng, Dan Hu*, Zhi-Qin John Xu, A Non-Gradient Method for Solving Elliptic Partial Differential Equations with Deep Neural Networks. Journal of Computational Physics, 2023. pdf, and in web. [12] (Alphabetic order) Xi-An Li, Zhi-Qin John Xu* , Lei Zhang, A multi-scale DNN algorithm for nonlinear elliptic equations with multiple scales, arxiv 2009.14597, (2020) Communications in Computational Physics (CiCP). pdf, and in web, and in arxiv, some code is in github. Note: [9] is a comprehensive version of [8] [8] (Alphabetic order) Wei Cai, Zhi-Qin John Xu* , Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, arxiv 1910.11710 (2019) pdf, and in arxiv Others in Deep Learning theory[53] Leyang Zhang, Zhi-Qin John Xu, Tao Luo*, Yaoyu Zhang*, Limitation of characterizing implicit regularization by data-independent functions. Transactions on Machine Learning Research (2023) pdf, and in arxiv. [32] Zhemin Li, Zhi-Qin John Xu, Tao Luo, Hongxia Wang*, A regularized deep matrix factorized model of matrix completion for image restoration, IET Image Processing (2022).web, and pdf, and in arxiv [31] Shuyu Yin, Tao Luo, Peilin Liu, Zhi-Qin John Xu*, An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation. arxiv 2205.12770 (2022) pdf, and in arxiv. [22] Lulu Zhang, Zhi-Qin John Xu*, Yaoyu Zhang*, Data-informed Deep Optimization. PLoS ONE (2022) in web, arxiv 2107.08166 (2021) pdf, and in arxiv. [21] Guangjie Leng, Yekun Zhu, Zhi-Qin John Xu*, Force-in-domain GAN inversion. arxiv 2107.06050 (2021) pdf, and in arxiv. Computational Neuroscience[8] Zhi-Qin John Xu, Xiaowei Gu, Chengyu Li, David Cai, Douglas Zhou*, David W. McLaughlin*. Neural networks of different species, brain areas and states can be characterized by the probability polling state, European Journal of Neuroscience (2020). pdf, and in web [7] Zhi-Qin John Xu, Douglas Zhou, David Cai, Swift Two-sample Test on High-dimensional Neural Spiking Data, arxiv preprint 1811.12314, (2018). pdf, and in web [6] Zhi-Qin John Xu* , Fang Xu, Guoqiang Bi, Douglas Zhou*, David Cai, A Cautionary Tale of Entropic Criteria in Assessing the Validity of Maximum Entropy Principle, (2018). (Europhysics Letters) pdf, and in web [5] Zhi-Qin John Xu, Jennifer Crodelle, Douglas Zhou*, David Cai, Maximum Entropy Principle Analysis in Network Systems with Short-time Recordings, Physical Review E, DOI: 10.1103/PhysRevE.99.022409, (2019). pdf, and in web [4] Zhi-Qin John Xu* , Douglas Zhou*, David Cai, Dynamical and Coupling Structure of Pulse-Coupled Networks in Maximum Entropy Analysis, Entropy 2019, 21(1). pdf, and in web [3] Zhi-Qin John Xu, Guoqiang Bi, Douglas Zhou*, and David Cai*, A dynamical state underlying the second order maximum entropy principle in neuronal networks, Communications in Mathematical Sciences, 15 (2017), pp. 665–692. pdf, and in web [2] Douglas Zhou, Yanyang Xiao, Yaoyu Zhang, Zhiqin Xu, and David Cai*, Granger causality network reconstruction of conductance-based integrate-and-fire neuronal systems, PloS one, 9 (2014). pdf, and in web [1] Douglas Zhou, Yanyang Xiao, Yaoyu Zhang, Zhiqin Xu, and David Cai*, Causal and structural connectivity of pulse-coupled nonlinear networks, Physical review letters, 111 (2013), p. 054102, pdf, and in web |