├── LICENSE ├── README.md └── talk └── 20190322 ├── Mean-field theory and dynamical isometry of deep neural networks.pdf ├── Mean-field theory and dynamical isometry of deep neural networks.pptx └── abstract.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 fwcore 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Slides 2 | 3 | [Mean-field theory and dynamical isometry of deep neural networks](https://github.com/fwcore/mean-field-theory-deep-learning/blob/master/talk/20190322/Mean-field%20theory%20and%20dynamical%20isometry%20of%20deep%20neural%20networks.pdf) 4 | 5 | ## Keywords 6 | * mean field theory 7 | * central limit theorem 8 | * dynamical isometry 9 | * Jacobian matrix 10 | * dynamical system 11 | * fixed points 12 | * eigenvalues 13 | * singular values 14 | 15 | ## Key people 16 | * [Samuel S. Schoenholz](https://samschoenholz.wordpress.com/), Google Brain 17 | - focused on using notions from statistical physics to better understand neural networks. 18 | - Ph. D. in Physics working with Andrea Liu at the University of Pennsylvania, focused on understanding the behavior of disordered solids and glassy liquids from their structure. Central to the approach has been the use of machine learning to identify local structural motifs that are particularly susceptible to rearrangement. 19 | 20 | * Jeffrey Pennington, Google Brain 21 | - a postdoctoral fellow at Stanford University, as a member of the Stanford Artificial Intelligence Laboratory in the Natural Language Processing (NLP) group. He received his Ph.D. in theoretical particle physics from Stanford University while working at the SLAC National Accelerator Laboratory. 22 | - Jeffrey’s research interests are multidisciplinary, ranging from the development of calculational techniques in perturbative quantum field theory to the vector representation of words and phrases in NLP to the study of trainability and expressivity in deep learning. Recently, his work has focused on building a set of theoretical tools with which to study deep neural networks. Leveraging techniques from random matrix theory and free probability, Jeffrey has investigated the geometry of neural network loss surfaces and the learning dynamics of very deep neural networks. He has also developed a new framework to begin harnessing the power of random matrix theory in applications with nonlinear dependencies, like deep learning. 23 | - [Theories of Deep Learning (STATS 385): Harnessing the Power of Random Matrix Theory to Study and Improve Deep Learning](https://stats385.github.io/pennington_lecture), Stanford University, Fall 2017 24 | 25 | ## Papers 26 | 27 | * Mean Field Analysis of Deep Neural Networks | [arXiv:1903.04440](https://arxiv.org/abs/1903.04440) 28 | - asymptotic behavior of MLP under large network size and large number of training iterations 29 | - characterization of the evolution of parameters in terms of their initialization 30 | - the limit is a system of integro-differential equations 31 | 32 | * Mean-field Analysis of Batch Normalization | [arXiv:1903.02606](https://arxiv.org/abs/1903.02606) 33 | - analytically quantify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. 34 | - it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix, enabling using larger learning rate 35 | - quantitative characterization of the maximal allowable learning rate to ensure convergence 36 | - suggest that networks with smaller values of the BatchNorm parameter achieve lower loss after the same number of epochs of training 37 | 38 | * A Mean Field Theory of Batch Normalization | [arXiv:1902.08129](https://arxiv.org/abs/1902.08129) 39 | - provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized fully-connected feedforward networks at initialization. 40 | - BN causes that gradient signals grow exponentially in depth and these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. 41 | - gradient explosion can be reduced by tuning the network close to the linear regime. 42 | 43 | * Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent | [arXiv:1902.06720](https://arxiv.org/abs/1902.06720) 44 | - for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters 45 | - find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions. 46 | 47 | * Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs | [arXiv:1901.08987](https://arxiv.org/abs/1901.08987) 48 | - develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. 49 | - derive a novel initialization scheme that eliminates or reduces training instabilities, enabling successful training while a standard initialization either fails completely or is orders of magnitude slower. 50 | - observe a beneficial effect on generalization performance using this new initialization. 51 | 52 | * Information Geometry of Orthogonal Initializations and Training | [arXiv:1810.03785](https://arxiv.org/abs/1810.03785) 53 | - show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix and the maximum singular value of the input-output Jacobian. 54 | - partially explains why neural networks that are more isometric can train much faster. 55 | - experimentally investigate the benefits of maintaining orthogonality throughout training 56 | - critical orthogonal initializations do not trivially give rise to a mean field limit of pre-activations for each layer. 57 | 58 | * Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function | [arXiv:1809.08848](https://arxiv.org/abs/1809.08848) 59 | - demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used 60 | - initialization acts as a confounding factor between the choice of activation function and the rate of learning, which can be resolved in ResNet by ensuring the same level of dynamical isometry at initialization. 61 | 62 | * Mean Field Analysis of Neural Networks: A Central Limit Theorem | [arXiv:1808.09372](https://arxiv.org/abs/1808.09372) 63 | - asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. 64 | - rigorously prove that the neural network satisfies a central limit theorem 65 | - describes the neural network's fluctuations around its mean-field limit 66 | - The fluctuations have a Gaussian distribution and satisfy a stochastic partial differential equation. 67 | 68 | * Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks | [arXiv:1806.05394](https://arxiv.org/abs/1806.05394) 69 | - develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and random matrix theory. 70 | - Our theory allows us to define a maximum timescale over which RNNs can remember an input, predicting the trainability. 71 | - gated recurrent networks feature a much broader, more robust, trainable region than vanilla RNNs 72 | - develop a closed-form critical initialization scheme that achieves dynamical isometry in both vanilla RNNs and minimalRNNs. 73 | 74 | * Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks | [arXiv:1806.05393](https://arxiv.org/abs/1806.05393) 75 | - it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme 76 | - develop a mean field theory for signal propagation and characterize the conditions for dynamical isometry for CNN 77 | - These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. 78 | - present an algorithm for generating such random initial orthogonal convolution kernels. 79 | 80 | * Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach | [arXiv:1806.01316 ](https://arxiv.org/abs/1806.01316) 81 | - reveals novel statistics of Fisher information matrix (FIM) that are universal among a wide class of DNNs 82 | - investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum takes a huge value 83 | - implies that the eigenvalue distribution has a long tail 84 | - Because the landscape of the parameter space is defined by the FIM, it is locally flat in most dimensions, but strongly distorted in others 85 | - small eigenvalues that induce flatness can be connected to a norm-based capacity measure of generalization ability 86 | - maximum eigenvalue that induces the distortion enables us to quantitatively estimate an appropriately sized learning rate for gradient methods to converge 87 | 88 | * The Emergence of Spectral Universality in Deep Networks | [arXiv:1802.09979](https://arxiv.org/abs/1802.09979) 89 | - build a full theoretical understanding of the spectra of Jacobians at initialization 90 | - leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth 91 | - For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity. 92 | 93 | * Mean Field Residual Networks: On the Edge of Chaos | [arXiv:1712.08969](https://arxiv.org/abs/1712.08969) 94 | - study randomly initialized residual networks using mean field theory and the theory of difference equations 95 | - Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. 96 | - In contrast, by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. 97 | - In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. 98 | - common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. 99 | 100 | * Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice | [arXiv:1711.04735](https://arxiv.org/abs/1711.04735) 101 | - extend the results obtained previously from linear DNN to the nonlinear setting 102 | - explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. 103 | - ReLU networks are incapable of dynamical isometry 104 | - sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. 105 | - demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. 106 | - show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks 107 | 108 | * Deep Information Propagation | [arXiv:1611.01232](https://arxiv.org/abs/1611.01232) 109 | - study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory 110 | - show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks 111 | - arbitrarily deep networks may be trained only sufficiently close to criticality. 112 | - the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. 113 | - develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively. 114 | -------------------------------------------------------------------------------- /talk/20190322/Mean-field theory and dynamical isometry of deep neural networks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fwcore/mean-field-theory-deep-learning/407afd9c1972b20897a81e1f5125405f0046d50e/talk/20190322/Mean-field theory and dynamical isometry of deep neural networks.pdf -------------------------------------------------------------------------------- /talk/20190322/Mean-field theory and dynamical isometry of deep neural networks.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fwcore/mean-field-theory-deep-learning/407afd9c1972b20897a81e1f5125405f0046d50e/talk/20190322/Mean-field theory and dynamical isometry of deep neural networks.pptx -------------------------------------------------------------------------------- /talk/20190322/abstract.md: -------------------------------------------------------------------------------- 1 | # Mean-field theory and dynamical isometry of neural networks 2 | 3 | ## Speaker: Feng Wang 4 | 5 | Abstract: Initialization, activation function, and batch normalization are known to have a strong impact on the training and the generalization of deep neural networks. Heuristic arguments have provided insightful pictures on their underlying mechanisms, however, a fundamental understanding remains elusive, leaving training and design neural networks strongly relying on experience. Recently, researchers from Google Brain start to build a theoretical framework to understand the impact on initialization, activation function, and batch normalization, as well as the architectures, such as CNN, ResNet, Gated RNN. The proposed mean-field theory focuses on the information flow through each layer by investigating the preservation of covariance matrices of pre-activations/gradients of each layer in the limit that (1) each layer is wide and (2) network is deep. 6 | 7 | In this talk, I will first briefly outline the theoretical framework and show the theoretical predictions, especially the predicted initialization strategies and the counterintuitive results on batch normalization. Then I will lead to dive into the theory and show its assumptions and possible pitfalls. We will emphasize the ideas behind the theory rather than the mathematical derivation. 8 | --------------------------------------------------------------------------------