├── .gitignore ├── README.md └── notes ├── aevb.md ├── alpha-divergence.md ├── bayesian-compress.md ├── bbb.md ├── blackbox-vi.md ├── concrete-dropout.md ├── cvi.md ├── deep-expo-families.md ├── interpret-cnn-compress.md ├── modern-vi.md ├── npn.md ├── perturbative-vi.md ├── smooth-svi.md ├── stein-var.md ├── svi.md ├── uncertainty-deep-learning.md ├── uncertainty-vision.md └── vprop.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # bayesian-deep-learning-notes 2 | > One-phrase-summary for Bayesian deep learning papers. 3 | > We here organize these papers in the following categories. But some of them might have overlap. 4 | 5 | ## (1). Uncertainty in deep learning 6 | > Model uncertainty in deep learning via Bayesian modelling by variatial inference etc. 7 | 8 | - [1705]. Concrete Dropout - [[arxiv](https://arxiv.org/abs/1705.07832)] [[Note](/notes/concrete-dropout.md)] 9 | - [1703]. Dropout Inference in Bayesian Neural Networks with Alpha-divergences - [[arxiv](https://arxiv.org/abs/1703.02914)] [[Note](/notes/alpha-divergence.md)] 10 | - [1703]. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? - [[arxiv](https://arxiv.org/abs/1703.04977)] [[Note](/notes/)] 11 | - [2016]. Uncertainty in Deep Learning - [[PDF](https://pdfs.semanticscholar.org/a6af/62389c6655770c624e2fa3f3ad6dc26bf77e.pdf)] [[Blog](http://mlg.eng.cam.ac.uk/yarin/blog_2248.html)] [[Note](/notes/uncertainty-deep-learning.md)] 12 | - [1505]. Weight Uncertainty in Neural Networks - [[arxiv](https://arxiv.org/abs/1505.05424)] [[Note](/notes/bbb.md)] 13 | - [2015]. On Modern Deep Learning and Variational Inference - [[NIPS](http://www.approximateinference.org/accepted/GalGhahramani2015.pdf)] [[Note](/notes/modern-vi.md)] 14 | - [1995]. Bayesian learning for neural networks 15 | 16 | ## (2). Probabilistic deep models 17 | > Use probabilistic model to imitate deep neural networks. 18 | 19 | - [1711]. Deep Gaussian Mixture Models - [[arxiv](https://arxiv.org/abs/1711.06929)] 20 | - [1411]. Deep Exponential Families - [[arxiv](https://arxiv.org/pdf/1411.2581.pdf)] [[Note](/notes/deep-expo-families.md)] 21 | 22 | ## (3). Probabilistic neural networks 23 | > Use probabilistic methods to do the inference in neural networks. 24 | 25 | - [1611]. Natural-Parameter Networks: A Class of Probabilistic Neural Networks - [[arxiv](https://arxiv.org/abs/1611.00448)] [[Note](/notes/npn.md)] 26 | 27 | ## (4). Approximate inference 28 | > Approximate inference or variational inference mostly is the building block for Bayesian deep learning. 29 | > Variational inference: the main idea behind variational inference is to pick a family of distributions over the latent variables with its own parameters which is called *variational parameters*. 30 | 31 | ### (4.1) General 32 | - [1712]. Vprop: Variational Inference using RMSprop - [[arxiv](https://arxiv.org/abs/1712.01038)] [[Note](/notes/vprop.md)] 33 | - [1709]. Perturbative Black Box Variational Inference - [[arxiv](https://arxiv.org/abs/1709.07433)] [[Note](/notes/perturbative-vi.md)] 34 | - [1703]. Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models - [[arxiv](https://arxiv.org/abs/1703.04265)] [[Note](/notes/cvi.md)] 35 | - [1611]. Variational Inference via χ-Upper Bound Minimization - [[arxiv](https://arxiv.org/abs/1611.00328)] 36 | - [1601]. Variational Inference: A Review for Statisticians - [[arxiv](https://arxiv.org/abs/1601.00670)] 37 | - [1401]. Black Box Variational Inference - [[arxiv](https://arxiv.org/abs/1401.0118)] [[Note](/notes/blackbox-vi.md)] 38 | - [2014]. Smoothed Gradients for Stochastic Variational Inference - [[NIPS](http://papers.nips.cc/paper/5557-smoothed-gradients-for-stochastic-variational-inference.pdf)] [[Note](/notes/smooth-svi.md)] 39 | - [1206]. Stochastic Variational Inference - [[arxiv](https://arxiv.org/abs/1206.7051)] [[Note](/notes/svi.md)] 40 | - [2011]. Practical Variational Inference for Neural Networks - [[NIPS](https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks)] 41 | - [1999]. An Introduction to Variational Methods for Graphical Models - [[PDF](https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf)] 42 | 43 | ### (4.2) Reparametrization trick in variational inference 44 | - [1506]. Variational Dropout and the Local Reparameterization Trick - [[arxiv](https://arxiv.org/abs/1506.02557)] 45 | - [1401]. Stochastic Backpropagation and Approximate Inference in Deep Generative Models - [[arxiv](https://arxiv.org/abs/1401.4082)] 46 | - [1312]. Auto-Encoding Variational Bayes - [[arxiv](https://arxiv.org/abs/1312.6114)] [[Note](/notes/aevb.md)] 47 | 48 | ### (4.3) Others 49 | - [NA]. [A roadmap to research on EP](https://tminka.github.io/papers/ep/roadmap.html) 50 | - [1608]. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm - [[arxiv](https://arxiv.org/abs/1608.04471)] [[Note](/notes/stein-var.md)] 51 | 52 | ## (5) Continuous relaxation 53 | > Use continuous distribution to approximate discrete random variables, e.g. concrete distribution is a continuous distribution used to approximate discrete random variables. 54 | 55 | - [1611]. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables - [[arxiv](https://arxiv.org/abs/1611.00712)] 56 | - [1611]. Categorical Reparameterization with Gumbel-Softmax - [[arxiv](https://arxiv.org/abs/1611.01144)] 57 | 58 | ## (6) Bayesian neural network pruning 59 | > Sparse prior can be used to induce sparse weight or neuron in neural networks thus favor smaller network structure for mobile devices etc. 60 | 61 | - [1711]. Interpreting Convolutional Neural Networks Through Compression - [[arXiv](https://arxiv.org/abs/1711.02329)] [[Note](/notes/interpret-cnn-compress.md)] 62 | - [1705]. Structural compression of convolutional neural networks based on greedy filter pruning - [[arXiv](https://arxiv.org/abs/1705.07356)] [[Note](/notes/interpret-cnn-compress.md)] 63 | - [1705]. Structured Bayesian Pruning via Log-Normal Multiplicative Noise - [[arxiv](https://arxiv.org/abs/1705.07283)] 64 | - [1705]. Bayesian Compression for Deep Learning - [[arxiv](https://arxiv.org/abs/1705.08665)] [[Note](/notes/bayesian-compress.md)] 65 | - [1701]. Variational Dropout Sparsifies Deep Neural Networks - [[arxiv](https://arxiv.org/abs/1701.05369)] 66 | 67 | 68 | ## Contribution 69 | Any contribution is welcome. But notice that we need '*one phrase summary*' to give an overview guidance to the readers RATHER THAN a list of papers. And please add yourself into the contributor list! 70 | 71 | ## Contributors 72 | - [Jun Lu](https://github.com/junlulocky) 73 | -------------------------------------------------------------------------------- /notes/aevb.md: -------------------------------------------------------------------------------- 1 | ## [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114) 2 | 3 | The authors proposed a reparametrisation trick of the variational lower bound to reduce the gradient variance in variational inference which is essential to the convergence of variation inference optimisation. 4 | 5 | -------------------------------------------------------------------------------- /notes/alpha-divergence.md: -------------------------------------------------------------------------------- 1 | ## [Dropout Inference in Bayesian Neural Networks with Alpha-divergences](https://arxiv.org/abs/1703.02914) 2 | 3 | Dropout variational inference (VI) via KL divergence can severely underestimate model uncertainty or the model uncertainty is 'biased' (See [1] Section 3.3.2 or [2]). In short, by checking the equation of KL divergence, suppose the *q(w|.)* is the variational distribution and *p(w|.)* is the posterior distribution we want to use. The KL divergence will penalise *q(w)* for placing mass where *p(w|.)* has no or small mass and penalise less for not placing mass where *p(w|.)* has large mass. In short, we have three cases [4], 4 | 5 | - if *q* is high and *p* is high, then we are happy; 6 | - if *q* is high and *p* is low, then we pay a price; 7 | - if *q* is low then we do not care. 8 | 9 | 10 | The authors propose another objective function based on alpha-divergence that can overcome this problem. 11 | 12 | Note: 13 | - **the value alpha in alpha-divergence**: Alpha-VI is mode seeking for large alpha and mass covering for smaller alpha. [3] 14 | 15 | 16 | ## External Reference 17 | [1]. Gal, Yarin. Uncertainty in deep learning. Diss. PhD thesis, University of Cambridge, 2016. 18 | 19 | [2]. Turner, Richard E., and Maneesh Sahani. "Two problems with variational expectation maximisation for time-series models." Bayesian Time series models (2011): 115-138. 20 | 21 | [3]. Bamler, Robert, et al. "Perturbative Black Box Variational Inference." arXiv preprint arXiv:1709.07433 (2017). 22 | 23 | [4]. Blei, D. M. Variational Inference, a note -------------------------------------------------------------------------------- /notes/bayesian-compress.md: -------------------------------------------------------------------------------- 1 | ## [Bayesian Compression for Deep Learning](https://arxiv.org/abs/1705.08665) 2 | 3 | The authors proposed to put a sparse prior over the neurons to induce sparsity in neural networks. The work is built upon Bayesian neural networks. Two different sparse priors are considered in this paper: **the hyperparameter free log-uniform prior** and **half-Cauchy prior**. -------------------------------------------------------------------------------- /notes/bbb.md: -------------------------------------------------------------------------------- 1 | ## [Weight Uncertainty in Neural Networks](https://arxiv.org/abs/1505.05424) 2 | 3 | The authors work in a Bayesian neural network framework, and propose a reparameterisation trick for the update of parameters in variational inference (i.e. 'variational approximation to the Bayesian posterior distribution on the weights' in the paper). Also the authors provide a way to do the update by mini-batch (stochastic gradient descent like). -------------------------------------------------------------------------------- /notes/blackbox-vi.md: -------------------------------------------------------------------------------- 1 | ## [Black Box Variational Inference](https://arxiv.org/abs/1401.0118) 2 | 3 | The authors showed that reducing the variance of the gradient estimator is essential to the fast convergence of 'variational inference algorithm'. In practice, the high variance gradients would require very small steps in the stochastic optimization thus leading to slow convergence. The authors showed how to reduce the variance in two ways: **Rao-Blackwellization** and **control variates**. -------------------------------------------------------------------------------- /notes/concrete-dropout.md: -------------------------------------------------------------------------------- 1 | ## [Concrete Dropout](https://arxiv.org/abs/1705.07832) 2 | 3 | The replace discrete Bernoulli distribution in a traditional dropout with its continuous relaxation (Concrete distribution relaxation in the paper) to model the uncertain the a dropout neural network. This relasation can help us leverage low variance pathwise derivative estimator instead of the score function estimator. The variance analysis of different estimators can be found in Section 3 of [1]. 4 | 5 | 6 | 7 | 8 | ## External Reference 9 | [1]. Gal, Yarin. Uncertainty in deep learning. Diss. PhD thesis, University of Cambridge, 2016. -------------------------------------------------------------------------------- /notes/cvi.md: -------------------------------------------------------------------------------- 1 | ## [Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models](https://arxiv.org/abs/1703.04265) 2 | 3 | The authors propose their method for variational inference computation for two classes of non-conjugate models. 4 | - The first class contains joint distribution for observed variables and hidden variables which can be split into a conjugate part and a non-conjugate part. For such models the proposed gradient steps can be expressed as a Bayesian inference in a conjugate model; 5 | - The second class of models additionally allows conditionally-conjugate terms. For this model class, the proposed gradient steps can be written as a message passing algorithm where variational message-passing (VMP) or stochastic variational inference (SVI) is used for the conjugate part while stochastic gradients are employed for the rest. 6 | 7 | The main concern of this paper is that the naive stochastic gradient descent in variational inferance may ignore the conjugate part of the lower bound (i.e. loss). The conjugate terms in the lower bound might have a **closed-form** expression and may not require any stochastic approximations. 8 | 9 | **Fun fact**: the authors provided a equivalent expression of gradient descent in equation (9) of the original paper. -------------------------------------------------------------------------------- /notes/deep-expo-families.md: -------------------------------------------------------------------------------- 1 | ## [Deep Exponential Families](https://arxiv.org/pdf/1411.2581.pdf) 2 | 3 | In Section 1 of [1], the author gives an hierarchical overview of neural networks. In the deep exponential families paper, the authors imitate the hierarchical structure in neural networks and replace each layer with some distributions from exponential families. The inference is by 'black box' variational inference. 4 | 5 | 6 | ## External Reference 7 | [1]. Gal, Yarin. Uncertainty in deep learning. Diss. PhD thesis, University of Cambridge, 2016. -------------------------------------------------------------------------------- /notes/interpret-cnn-compress.md: -------------------------------------------------------------------------------- 1 | ## 1. Structural compression of convolutional neural networks based on greedy filter pruning - [[arXiv](https://arxiv.org/abs/1705.07356)] 2 | 3 | ## 2. Interpreting Convolutional Neural Networks Through Compression - [[arXiv](https://arxiv.org/abs/1711.02329)] 4 | 5 | The author proposed classification accuracy reduction (CAR). In the CAR structural compression, the filter with the least effect on the classification accuracy gets pruned in each iteration. Afterwards, a fine tuning process is needed to get better accuracy. -------------------------------------------------------------------------------- /notes/modern-vi.md: -------------------------------------------------------------------------------- 1 | ## [On Modern Deep Learning and Variational Inference](http://www.approximateinference.org/accepted/GalGhahramani2015.pdf) 2 | 3 | In current deep learning research, the *model architecture selection* is often solved empirically by a process of *trial-and-error*. Stochastic regulariser techniques such as dropout can slow down the training but circumvents over-fitting. The author surveyed how stochastic regulariser techniques can be used in deep leaerning and proposed some future research directions for this field. 4 | 5 | Also, the paper gives a good review of Gaussian Process (GP). The authors showed that each GP covariance function has a one-to-one correspondence with the combination of both neural network non-linearities and weight regularisation. -------------------------------------------------------------------------------- /notes/npn.md: -------------------------------------------------------------------------------- 1 | ## [Natural-Parameter Networks: A Class of Probabilistic Neural Networks](https://arxiv.org/abs/1611.00448) 2 | 3 | The authors find a novel algorithm to train the parameters in the neural nets (NNs). In traditional NNs, the forward propagation is deterministic, i.e. **o = a * W + b**. However, in this paper, they update the vector **o** in a *non-deterministic* way, i.e. updating via the mean and variance of an exponential-family. -------------------------------------------------------------------------------- /notes/perturbative-vi.md: -------------------------------------------------------------------------------- 1 | ## [Perturbative Black Box Variational Inference](https://arxiv.org/abs/1709.07433)] 2 | 3 | The drawback of KL divergence is that: suppose the *q(w|.)* is the variational distribution and *p(w|.)* is the posterior distribution we want to use. The KL divergence will penalise *q(w)* for placing mass where *p(w|.)* has no or small mass and penalise less for not placing mass where *p(w|.)* has large mass[See another note](/notes/alpha-divergence.md). The authors constructed a new variational bound which is tighter than the KL bound and **more mass covering**. Compared to alpha-divergences, its reparameterization gradients have a lower variance. In short, the authors chose a lower bound lies in the general version of evidence lower bound (ELBO) - f-ELBO, that is a biased estimator with smaller variance which induces careful bias-variance trade-off. 4 | 5 | Note: it also contains a good review how ELBO can be derived from the marginal distribution of data. -------------------------------------------------------------------------------- /notes/smooth-svi.md: -------------------------------------------------------------------------------- 1 | ## [Smoothed Gradients for Stochastic Variational Inference](http://papers.nips.cc/paper/5557-smoothed-gradients-for-stochastic-variational-inference.pdf) 2 | 3 | stochastic variation inference uses a weighted sum to update the parameter which is unbiased. In smoothed gradients for stochastic variation inference, they uses a window averaged to update the parameter which is biased estimator but reduces the variance so as to fasten the convergence. -------------------------------------------------------------------------------- /notes/stein-var.md: -------------------------------------------------------------------------------- 1 | ## [Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm](https://arxiv.org/abs/1608.04471) 2 | 3 | The problem of variational inference is that the variational distribution is usually over-simplified and it maybe very different to the posterior distribution of interest. Stein variational gradient descent favors the stein's identity and thus using a iterative methods to make the 'variational distribution' closer to the posterior distribution of interest. -------------------------------------------------------------------------------- /notes/svi.md: -------------------------------------------------------------------------------- 1 | ## [Stochastic Variational Inference](https://arxiv.org/abs/1206.7051) 2 | 3 | The authors claimed the SVI is scalable to large dataset and it approximates the posterior distribution in conjugate exponential-family models with **local** and **global** hidden variables. 4 | 5 | [Assumption]: The Stochastic Variational Inference has extra assumption that the **complete conditionals** is from exponential family, where a complete conditional is the *conditional distribution of a hidden variable given the other **hidden variables** and the **observations***. 6 | 7 | The variational distribution is assumed to be in the mean-field family and each factor of the hidden variables is assumed to be in the same exponential family as the associated complete conditional distributions, e.g. 8 | 9 | [Prior distribution for hidden variables]: the prior distribution for the hidden variables has same for as its corresponding complete conditonals distribution. as shown in the equation (2) and equation (10) of the original paper. With these assumptions, we can optimize each coordinate in closed form rather than using coordinate ascent. 10 | 11 | [The advantage of mean-field family]: 12 | - the entropy term in ELBO decomposes; 13 | - some other computations efficiencies in the coordinate update shown in equation (15) of the original paper; 14 | 15 | [Stochastic]: The above VI is not efficient because it needs to go over all the data set. We can form **intermediate global parameters** using classical coordinate ascent updates where the sampled data point is **repeated N times**, i.e. use one data sample to represent the whole data set in one iteration; finally, we set the new global parameters to a **weighted average of the old estimate and the intermediate parameters**, in which case the author proved it is a form of **stochastic natural gradient**. 16 | 17 | 18 | 19 | -------------------------------------------------------------------------------- /notes/uncertainty-deep-learning.md: -------------------------------------------------------------------------------- 1 | ## [Uncertainty in Deep Learning](https://pdfs.semanticscholar.org/a6af/62389c6655770c624e2fa3f3ad6dc26bf77e.pdf) 2 | 3 | Out of distribution test data: If the model has been train on the dog dataset and when we feed the model with cat pictures, the model would give random outputs.Bayesian deep learning can be leveraged to provide the information of uncertainty in the model outputs. In general, two kinds of uncertaincy measurement can be considered: 4 | - **Epistemic uncertainty**: a large number of model parameter can explain the training data very well, i.e. we are uncertain which parameter is 'true'; also several model structures can also explain the training data very well, we are unsure about the model structure. 5 | - **Aleatoric uncertainty**: the observed data can be noisy. 6 | 7 | Combining the epistemic uncertainty and aleatoric uncertainty together, we can induce **predictive uncertainty** normally. 8 | 9 | Deep learning in general does point estimate so that it does not provide such uncertainty information. 10 | 11 | - Bayesian neural networks: model uncertainty in neural networks can be obtained by plaing distribitions over the weights. 12 | 13 | Stochastic regularisation techniques (SRTs) are used to model the uncertainty in neural networks. Popular SRTs include dropout, multiplicative Gaussian noise, dropConnect and many others. 14 | 15 | - Drawback of the SRTs: we need to repeate several times for this process to obtain the uncertainty. 16 | 17 | ### Bayesian Modelling 18 | Putting prior distribution over the parameters which represents our prior beliefs over these parameters and observe the training data, we can capture the more likely parameters and less likely parameters. 19 | 20 | To predict the new data points, we need to marginalise w.r.t the parameters in the prior distribution. If the likelihood is conjugate to the prior distribution, then it is analytically tractable; otherwise, we need some approximate inference methods such as variational inference etc. Variation infernece replces Bayesian modelling marginalisation with optimisation, i.e. it replaces the integral in Bayesian modelling with derivative calculation in optimisation. 21 | 22 | 23 | 24 | 25 | -------------------------------------------------------------------------------- /notes/uncertainty-vision.md: -------------------------------------------------------------------------------- 1 | ## [What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?](https://arxiv.org/abs/1703.04977) 2 | 3 | This paper is a complementary material for introduction of different types of uncertainty in deep learning to [1]. 4 | - **Epistemic uncertainty**: a large number of model parameter can explain the training data very well, i.e. we are uncertain which parameter is 'true'; also several model structures can also explain the training data very well, we are unsure about the model structure. 5 | - **Aleatoric uncertainty**: the observed data can be noisy. 6 | 7 | Aleatoric uncertainty can further be categorized into 8 | - **Homoscedastic** uncertainty: uncertainty which stays constant for different inputs 9 | - **Heteroscedastic** uncertainty: Heteroscedastic uncertainty depends on the inputs to the model, with some inputs potentially having more noisy outputs than others. 10 | 11 | The paper further introduces how to model these uncertainty: 12 | - Epistemic uncertainty is modeled by placing a **prior distribution** over a model’s weights, and then trying to capture how much these weights vary given some data. Such method in neural networks is referred as Bayesian neural networks (BNN); 13 | - Aleatoric uncertainty is modeled by placing a distribution over the output of the model, e.g Gaussian random noise on the output. 14 | 15 | Most importatly, the authors showed how to combine Aleatoric and Epistemic uncertainty in one model. 16 | 17 | ## External Reference 18 | [1]. Gal, Yarin. Uncertainty in deep learning. Diss. PhD thesis, University of Cambridge, 2016. 19 | -------------------------------------------------------------------------------- /notes/vprop.md: -------------------------------------------------------------------------------- 1 | ## [Vprop: Variational Inference using RMSprop](https://arxiv.org/abs/1712.01038) 2 | 3 | The natural gradient descent in natural parameter of the variational distribution is equivalent to the mirror descent in mean parameter of it. Due to this connection, mean field variational inference could have a RMSprop-typed two-step update which is computationally efficient. 4 | 5 | --------------------------------------------------------------------------------