├── Classification_Eq.png ├── Eq.png ├── Eq1.png ├── Eq2.png ├── Eq3.png ├── Eq4.png ├── Eq5.png ├── README.md ├── Regression_Eq.png ├── SegNet_Architecture.png ├── encoder_explained.png └── upsampling_indices.png /Classification_Eq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Classification_Eq.png -------------------------------------------------------------------------------- /Eq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Eq.png -------------------------------------------------------------------------------- /Eq1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Eq1.png -------------------------------------------------------------------------------- /Eq2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Eq2.png -------------------------------------------------------------------------------- /Eq3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Eq3.png -------------------------------------------------------------------------------- /Eq4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Eq4.png -------------------------------------------------------------------------------- /Eq5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Eq5.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BayesianDeepLearning 2 | 3 | Reference : Machine Learning: A Probabilistic Perspective,Kevin Murphy 4 | 5 | In machine learning, probability is used to model concepts. 6 | Two ways to look at probability : 7 | - Frequentist Interpretation : It represents long run of events. Eg - Flip the coin many times - lands head half the time. 8 | - Bayesian Interpretation : It quantifies the uncertainity about something - related to information rather than number of trails. Eg - Coin is equally likely to land head/toss on the next toss. 9 | 10 | Bayesian Approach has an advantage as it helps in modelling uncertainities of an event that does not occur frequenty. For example, probabaility of melting of ice at 2020 CE. This event ocurrs either once or never. Hence, does not occur frequently. This probabaility indicated how certain/uncertain is it to state that ice will melt at 2020 CE. 11 | 12 | Overview of Probability Theory : 13 | Ramdom Variable, say X, is a variable whose possible values are numerical outcomes of a random phenomenon. 14 | P(X) -> Probabaility that the event X is true. 15 | P(~X) -> Probabaility that the event not X. 1-P(X) 16 | Two types of Random Variables : 17 | - Discrete Random : A random variable that takes only a countable number of distinct values 18 | P(X=x) -> PMF -> Probability Mass Function 19 | - Continuous Random : A random variable that takes infinite number of possible value 20 | P(X=x) -> PDF -> Probability Density Function 21 | 22 | Fundamental Rules: 23 | - Union of 2 Events: 24 | 25 | 26 | - If two events are independent of each other then: 27 | 28 | 29 | - Joint Probabilities : 30 | 31 | 32 | Given joint distribution on two events P(A,B), we define the marginal distribution as follows : 33 | 34 | 35 | Sum Rule : 36 | 37 | Product Rule : 38 | 39 | Conditional Probability of event A given event B is true, as follows : 40 | 41 | 42 | Bayes Rule : 43 | 44 | This bayesian probability talks about partial beliefs and calculates the validity of a proposition. This calculation is based on : 45 | - Prior Estimate 46 | - New relevant evidence 47 | 48 | Probability of a given hypothesis given data - 49 | Goal of Bayes Learning is to achieve the most probabale hypothesis - (MAP) - 50 | Since P(D) is independent of the hypothesis, this can be eliminated. Also, for all hypothesis P(h) is equal. Hemce, that can be eliminated too. So, the final equation so obtained will involve maximizing the likelihood. 51 | 52 | 53 | Overview of the following papers published on Bayesian Deep Learning: 54 | 1. Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding 55 | 2. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? 56 | 3. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics 57 | 58 | ## Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding 59 | 60 | - Author : Alex Kendall,Vijay Badrinarayanan,Roberto Cipolla - University of Cambridge 61 | - Published : 10 Oct 2016 62 | - Link : [Paper](https://arxiv.org/pdf/1511.02680.pdf) 63 | ### Aim 64 | This paper aims to provide a probabilistic approach to pixel-wise segmentation - SegNet. This paper also aims to extend deep convolutional encoder-decoder neural network architecture to Bayesian CNN which can produce a probabilistic segmentation output. 65 | ### Result 66 | Modelling uncertainity improved the performance by 2-3% compared to the state-of-art architectures like SegNet, FCN and Dilatio Network. Significant improvement in performance was observered when fed with smaller datasets. 67 | ### Overview 68 | **Pixel wise Semantic Segmentation -** 69 | Segmentation in Computer Vision implies partitioning of image into coherent parts without understanding what each component represents. Semantic Segmentation in particular implies that the images are partitioned into semantically meaningful components. When the same goal is achieved by classifying each pixel then it is termed as pixel wise semantic segmentation. 70 | 71 | **Related Work** 72 | - Non Deep Learning Approaches 73 | - TextonBoost 74 | - TextonForest 75 | - Random Forest Based Classifiers 76 | - Deep Learning Approaches - Core Segmentation Engine 77 | - SegNet 78 | - Fully Convolutional Networks (FCN) 79 | - Dilation Network 80 | None of the above methods provide produce a probabilistic segmentation with a measure of model uncertainty. 81 | - Bayesian Deep Learning Approaches 82 | - Bayesian neural networks [Paper1](https://papers.nips.cc/paper/419-transforming-neural-net-output-levels-to-probability-distributions.pdf) [Paper2](https://authors.library.caltech.edu/13793/). They offer a probabilistic interpretation of deep learning models by inferring distributions over the networks weights. They are often computationally very expensive, increasing the number of model 83 | parameters without increasing model capacity significantly. 84 | - Performing inference in Bayesian neural networks is a difficult task, and approximations to the model posterior are 85 | often used, such as variational inference [Paper](https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks) 86 | - Training with stochastic gradient descent, using dropout to randomly remove units. During test time, standard dropout approximates the effect of averaging the predictions of all these thinnned networks by using the weights of the unthinned network. This is referred to as weight averaging.[Paper](http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf) 87 | - Dropout as approximate Bayesian inference over the network’s weights. [Paper](https://arxiv.org/pdf/1506.02158.pdf) 88 | 89 | **SegNet Architecture-** Since the paper aims to provide bayesian approach to the existing SegNet Architecture, the following gives a brief overview of the architecture itself [Paper](https://arxiv.org/pdf/1511.00561) [Blog](https://saytosid.github.io/segnet/): 90 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/SegNet_Architecture.png "SegNet Architecture") 91 | - SegNet is a deep convolutional encoder decoder architecture which consists of a sequence of non-linear processing layers (encoders) and a corresponding set of decoders followed by a pixel-wise classifier. 92 | - Encoder consists of one or more convolutional layers with batch normalisation and a ReLU non-linearity, followed by non-overlapping max-pooling and sub-sampling. The sparse encoding due to the pooling is upsampled in the decoder using max-pooling indices in the encoding sequences. It helps in retaining the class boundary details in the segmented images and also reducing the total number of model parameters. 93 | - Encoder Architecture : 94 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/encoder_explained.png "Encoder Architecture") 95 | - 13 VGG16 Conv layers 96 | - Not fully connected(this reduces parameters) 97 | - Good initial weights are available 98 | - Max-pooling and Subsampling - translation invariance achieved but feature map size reduces leading to lossy image representation with blurred boundaries. Hence, upsampling is done in the decoder. 99 | - Decoder Architecture : 100 | - For each of the encoders there is a corresponding decoder which upsamples the feature map using memorised max-pooling indices.To do that it needs to store some information. It is necessary to capture and store boundary information in the encoder feature maps before sub-sampling. In order to to that space efficiently, SegNet stores only the max-pooling indices i.e. the locations of maximum feature value in each pooling window is memorised for each encoder map. Only 2 bits are needed for each window of 2x2, slight loss of precision, but tradeoff. 101 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/upsampling_indices.png "Upsampling Index") 102 | - Sparse feature maps of higher resolutions produced 103 | - Sparse maps are fed through a trainable filter bank to produce dense feature maps 104 | - The last decoder is connected to a softmax classifier which classifies each pixel 105 | For each of the 13 encoders there is a corresponding decoder. The model is trained end to end using stochastic gradient descent. 106 | 107 | **Key aspects of Bayesian SegNet Architecture** 108 | 109 | - Necessity to find the posterior distribution over the convolutional weights, W, given observed training data X and labels Y. p(W | X, Y). This is difficult to trace. Hence, approximate using variational inference. 110 | - Let q(W) be the distribution over the network's weights, minimizing the Kullback-Leibler (KL) divergence between this approximating distribution and the full posterior: KL(q(W) || p(W | X, Y)) [Paper](https://arxiv.org/pdf/1506.02158.pdf) 111 | - Minimising the cross entropy loss objective function has the effect of minimising the Kullback-Leibler divergence term 112 | - Using stochastic gradient descent, minimizes the divergence term. Therefore training the network with stochastic gradient descent will encourage the model to learn a distribution of weights which explains the data well while preventing overfitting. 113 | - Use dropouts to form a probabilistic encoder-decoder architecture. This is kept as 0.5. Sample the posterior distribution over the weights at test time using dropout to obtain the posterior distribution of softmax class probabilities. Take the mean of these samples for segmentation prediction and use the variance to output model uncertainty for each class. The mean of the per class variance measurements as an overall measure of model uncertainty. 114 | - We use batch normalisation layers after every convolutional layer 115 | - The whole system is trained end-to-end using stochastic gradient descent with a base learning rate of 0.001 and weight decay parameter equal to 0.0005. We train the network until convergence when we observe no further reduction in training loss. 116 | - Probabilistic Variants of this architecture : 117 | - Bayesian Encoder: insert dropout after each encoder unit. 118 | - Bayesian Decoder: insert dropout after each decoder unit. 119 | - Bayesian Encoder-Decoder: insert dropout after each encoder and decoder unit. 120 | - Bayesian Center: insert dropout after the deepest encoder, between the encoder and decoder stage. 121 | - Bayesian Central Four Encoder-Decoder: insert dropout after the central four encoder and decoder units. 122 | - Bayesian Classifier: insert dropout after the last decoder unit, before the classifier. 123 | 124 | **Comparing Weight Averaging and Monte Carlo Dropout Sampling** 125 | Weight averaging proposes to remove dropout at test time and scale the weights proportionally to the dropout percentage. Monte Carlo sampling with dropout performs better than weight averaging after approximately 6 samples. Weight 126 | averaging technique produces poorer segmentation results,in terms of global accuracy, in addition to being unable to 127 | provide a measure of model uncertainty. 128 | 129 | **Experiments** 130 | Bayesian Deep Learning performs well on the following dataset: 131 | - CamVid Dataset 132 | - CamVid is a small road scene understanding dataset. 133 | - Segment 11 classes. 134 | - Comparision based on depth and motion cues. 135 | - Bayesian SegNet obtains the highest overall class average and mean intersection over union score by a significant margin. 136 | - Scene Understanding (SUN) 137 | - Large dataset of indoor scenes. 138 | - 37 indoor scene classes. 139 | - Bayesian SegNet outperforms all previous benchmarks, including those which use depth modality. 140 | - Pascal VOC 141 | - Segmentation challenge 142 | - 20 salient object class 143 | - Involves learning both classes and their spatial context 144 | 145 | **Understanding Modelling Uncertainity** 146 | - Qualitative observations 147 | Segmentation predictions are smooth, with a sharp segmentation around object boundaries. These results also show that when the model predicts an incorrect label, the model uncertainty is generally very high. High model uncertainity due to: 148 | - At class boundaries the model often displays a high level of uncertainty. This reflects the ambiguity surrounding the definition of defining where these labels transition. 149 | - Visually difficult to identify - objects are occluded or at a distance from the camera. 150 | - Visually ambiguous to the model. 151 | - Quantitative observation 152 | Relationship between uncertainity vs accuracy and between uncertainity vs frequency of each class in the dataset.Uncertainty is calculated as the mean uncertainty value for each pixel of that class in a test dataset. We observe an inverse relationship between uncertainty and class accuracy or class frequency. This shows that the model is more confident about classes which are easier or occur more often, and less certain about rare and challenging classes. 153 | 154 | 155 | 156 | ## What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? 157 | - Author : Alex Kendall,Yarin Gal - University of Cambridge 158 | - Published : 5 Oct 2017 159 | - Link : [Paper](https://arxiv.org/pdf/1703.04977) 160 | ### Aim 161 | - Bayesian Framework combining aleatoric uncertainty and epistemic uncertainty. 162 | - New loss functions that captures uncertainity for segmentation and depth regression benchmarks tasks - gives state-of-art results 163 | - Improve model performance by 1 − 3% over non-Bayesian baselines by reducing the effect of noisy data with the implied attenuation 164 | ### Overview 165 | Quantifying uncertainty in computer vision applications can be largely divided into regression settings such as depth regression, and classification settings such as semantic segmentation. Two main types of uncertainty: 166 | - Aleatoric uncertainty captures noise inherent in the observations. This could be for example sensor noise or motion 167 | noise, resulting in uncertainty which cannot be reduced even if more data were to be collected. This is modelled by placing a distribution over the output of the model. 168 | - homoscedastic uncertainty- uncertainty which stays constant for different inputs. Homoscedastic regression assumes constant observation noise σ for every input point x. 169 | - heteroscedastic uncertainty - uncertainty depends on the inputs to the model. The observation noise can vary with input x. 170 | 171 | - Epistemic uncertainty accounts for uncertainty in the model parameters. It captures ignorance about which model generated our collected data. This uncertainty can be explained away given enough data, and is often referred to as model uncertainty. This is modelled by placing a prior distribution over a model’s weights, and then trying to capture how much these weights vary given some data. 172 | We are required to evaluate the posterior p(W|X, Y) = p(Y|X, W)p(W)/p(Y|X),cannot be evaluated analytically. Hence, we approximate it.This replaces the intractable problem of averaging over all weights in the BNN with an optimisation task, where we seek to optimise over the parameters of the simple distribution instead of optimising the original neural network’s parameters. We also make use of Dropout variational inference to approximate the models. This inference is done by training a model with dropout before every weight layer,and by also performing dropout at test time to sample from the approximate posterior (stochastic forward passes, referred to as Monte Carlo dropout). More formally, this approach is equivalent to performing approximate variational inference where we find a simple distribution in a tractable family which minimises the Kullback-Leibler (KL) divergence to the true model posterior p(W|X, Y). Dropout can be interpreted as a variational Bayesian approximation. Epistemic uncertainty in the weights can be reduced by observing more data. This uncertainty induces prediction uncertainty by marginalising over the (approximate) weights posterior distribution. 173 | - Classification Equation: Monte Carlo integration 174 | 175 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/Classification_Eq.png "Classifier_Eq") 176 | 177 | - Regression Equation: Uncertainty is captured by the predictive variance. 178 | 179 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/Regression_Eq.png "Regression_Eq") 180 | 181 | Two terms: 182 | - first term - sigma sq indicates the noise inherent in the data 183 | - second term - how much the model is uncertain about its predictions – this term will vanish when we have zero parameter uncertainty 184 | **Combining Aleatoric and Epistemic Uncertainty in One Model** 185 | Heteroscedastic equation can be turned into a Bayesian NN by placing a distribution over its weights. 186 | We can use a single network to transform the input x, with its head split to predict both ŷ as well as σ̂ 2. 187 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/Eq.png "Eq") 188 | 189 | This loss consists of two components - the residual regression obtained with a stochastic sample through the model – making use of the uncertainty over the parameters – and an uncertainty regularization term. We do not need ‘uncertainty labels’ to learn uncertainty. Rather, we only need to supervise the learning of the regression task. We learn the variance, σ^2 , implicitly from the loss function. The second regularization term prevents the network from predicting infinite uncertainty 190 | (and therefore zero loss) for all data points.In practice, we train the network to predict the log variance. This is because it is more numerically stable than regressing the variance, σ^2 , as the loss avoids a potential division by zero. 191 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/Eq3.png "Eq3") 192 | 193 | 194 | Heteroscedastic Uncertainty as Learned Loss Attenuation - this makes the model more robust to noisy data: inputs for which the model learned to predict high uncertainty will have a smaller effect on the loss. The model is discouraged from predicting high uncertainty for all points – in effect ignoring the data – through the log σ 2 term. Large uncertainty increases the contribution of this term, and in turn penalizes the model: The model can learn to ignore the data – but is penalised for that. The model is also discouraged from predicting very low uncertainty for points with high residual error, as low σ 2 will exaggerate the contribution of the residual and will penalize the model. It is important to stress that this learned attenuation is not an ad-hoc construction, but a consequence of the probabilistic 195 | interpretation of the model.It is important to stress that this learned attenuation is not an ad-hoc construction, but a consequence of the probabilistic interpretation of the model. 196 | 197 | Heteroscedastic Uncertainty in Classification Tasks :For classification tasks our NN predicts a vector of unaries f i for each pixel i, which when passed through a softmax operation, forms a probability vector p i . The model is changed by placing a Gaussian distribution over the unaries vector: 198 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/Eq4.png "Eq4") 199 | 200 | Monte Carlo integration is uesd and sample unaries through the softmax function. The equation is as follows: 201 | ![alt text](https://github.com/shreyavshetty/BayesianDeepLearning/blob/master/Eq5.png "Eq5") 202 | 203 | Experiments : 204 | - Semantic Segmentation 205 | - CamVid - intersection over union (IoU) score of 67.5% - The implicit attenuation obtained from the aleatoric loss provides a larger improvement than the epistemic uncertainty model. However, the combination of both uncertainties improves performance even further. 206 | - NYU Dataset - This dataset is much harder than CamVid because there is significantly less structure in indoor scenes. Improves baseline performance by giving the model flexibility to estimate uncertainty and attenuate the loss. 207 | - Pixel-wise Depth Regression 208 | 209 | What Do Aleatoric and Epistemic Uncertainties Capture? 210 | - Quality of Uncertainty Metric 211 | - precision-recall curves for regression and classification models show how model performance improves by removing pixels with uncertainty larger than various percentile thresholds. 212 | - correlate well with accuracy and precision 213 | - the curves for epistemic and aleatoric uncertainty models are very similar. This suggests that when only one uncertainty is explicitly modeled, it attempts to compensate for the lack of the alternative uncertainty when possible 214 | 215 | Uncertainty with Distance from Training Data 216 | - Epistemic uncertainty decreases as the training dataset gets larger. 217 | - Aleatoric uncertainty remains relatively constant and cannot be explained away with more data 218 | - Testing the models with a different test set shows that epistemic uncertainty increases considerably 219 | 220 | ## Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics 221 | - Author : Alex Kendall,Yarin Gal,Roberto Cipolla 222 | - Published : 24 April 2018 223 | - Link : [Paper](https://arxiv.org/pdf/1511.02680.pdf) 224 | ### Aim : 225 | The proposed principled approach to multi-task deep learning weighs multiple loss functions by considering the homoscedastic uncertainty of each task. It simultaneously learn various quantities with different units or scales in both classification and regression settings. 226 | ### Overview : 227 | - Combining all tasks into a single model reduces computation and allows systems to run in real-time. Maually tuning weighed losses for multiple tasks is expensive. 228 | - Homoscedastic uncertainty is viewed as task-dependent weighting and a principled multi-task loss function is derived which can learn to balance various regression and classification losses optimally. 229 | - Scene Geometry and Semantics learnt using 3 tasks : 230 | - semantic segmentation 231 | - instance segmentation 232 | - pixel-wise metric depth 233 | Existing literature made use of seperate deep learning models for each of the tasks. 234 | - Comibing the above into a single model resulted in the following key contribution : 235 | - multi-task loss to simultaneously learn various classification and regression losses of varying quantities and units using homoscedastic task uncertainty 236 | - unified architecture 237 | - demonstrates importance of loss weighting for multi-task learning 238 | - Combining models improved generalisation by sharing the domain information between complimentary tasks. It does this by using a shared representation to learn multiple tasks. This implies that what is learned from one task can be used to learn other tasks. 239 | - Previous approaches combined multi objective losses to perform a weighted linear sum of the losses for each individual task. 240 | Eq : 241 | model performance is extremely sensitive to weight selection, w and its expensive to tune. 242 | - Deep Convolutional Neural Network architecture is used and a number of convolutional encoders which produce a shared representation, followed by a corresponding number of task-specific convolutional decoders. 243 | -------------------------------------------------------------------------------- /Regression_Eq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/Regression_Eq.png -------------------------------------------------------------------------------- /SegNet_Architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/SegNet_Architecture.png -------------------------------------------------------------------------------- /encoder_explained.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/encoder_explained.png -------------------------------------------------------------------------------- /upsampling_indices.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shreyavshetty/BayesianDeepLearning/6564dd3f0eff74b32d112053a102023232c0155b/upsampling_indices.png --------------------------------------------------------------------------------