├── Deterministic vs Probabilistic Deep Learning.ipynb ├── Deterministic vs Probabilistic Regression with Tensorflow.ipynb ├── Frequentist vs. Bayesian Statistics with Tensorflow.ipynb ├── Gentle Introduction to TensorFlow Probability - Distribution Objects.ipynb ├── Gentle Introduction to TensorFlow Probability - Trainable Parameters.ipynb ├── Maximum Likelihood Estimation from scratch in TensorFlow Probability.ipynb ├── Naive Bayes from Scratch with Tensorflow Probability.ipynb ├── Probabilistic Linear Regression from scratch in TensorFlow.ipynb └── README.md /Deterministic vs Probabilistic Regression with Tensorflow.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "id": "2d85812d-149c-4264-9338-e17f10a20a19", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import tensorflow as tf\n", 11 | "import tensorflow_probability as tfp\n", 12 | "import os\n", 13 | "import numpy as np\n", 14 | "import matplotlib.pyplot as plt\n", 15 | "import seaborn as sns\n", 16 | "sns.set_style(\"darkgrid\")\n", 17 | "import pandas as pd\n", 18 | "from tensorflow import keras\n", 19 | "\n", 20 | "from tensorflow.keras.models import Sequential\n", 21 | "from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D\n", 22 | "from tensorflow.keras.losses import SparseCategoricalCrossentropy, MeanSquaredError\n", 23 | "from tensorflow.keras.optimizers import RMSprop\n", 24 | "\n", 25 | "import tensorflow_datasets as tfds\n", 26 | "\n", 27 | "tfd = tfp.distributions\n", 28 | "tfpl = tfp.layers" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "b87cce23-e11f-46c1-ad84-20b3318aa6f8", 34 | "metadata": {}, 35 | "source": [ 36 | "# Introduction\n", 37 | "This article belongs to the series \"Probabilistic Deep Learning\". This weekly series covers probabilistic approaches to deep learning. The main goal is to extend deep learning models to quantify uncertainty, i.e. know what they do not know.\n", 38 | "\n", 39 | "In this article we are going to explore the main differences between deterministic and probabilistic regression. In general, deterministic regression is useful when the relationship between the independent and dependent variables is well understood and relatively stable. On the other hand, probabilistic regression is more appropriate when there is uncertainty or variability in the data. As an exercise to support our claims, we are going to fit a probabilistic model to non-linear data using TensorFlow Probability.\n", 40 | "\n", 41 | "We develop our models using TensorFlow and TensorFlow Probability. TensorFlow Probability is a Python library built on top of TensorFlow. We are going to start with the basic objects that we can find in TensorFlow Probability and understand how can we manipulate them. We will increase complexity incrementally over the following weeks and combine our probabilistic models with deep learning on modern hardware (e.g. GPU).\n", 42 | "\n", 43 | "As usual, the code is available on my GitHub." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "id": "d9ff3f0b-a1b4-4154-933b-8ec60f232eaa", 49 | "metadata": {}, 50 | "source": [ 51 | "# Deterministic vs Probabilistic Regression\n", 52 | "### Definitions\n", 53 | "Deterministic regression is a type of regression analysis in which the relationship between the independent and dependent variables is known and fixed. In other words, if the same inputs are provided to a deterministic regression model, it will always produce the same output. This makes it a useful tool for predicting the value of a dependent variable given a set of known independent variables.\n", 54 | "\n", 55 | "If we think about linear regression models, the Gauss-Markov theorem immediately comes to mind since it establishes the optimality of the ordinary least squares (OLS) estimator, under certain assumptions. In particular, the Gauss-Markov theorem states that the OLS estimator is the best linear unbiased estimator (BLUE), meaning that it has the smallest variance among all linear unbiased estimators. However, the Gauss-Markov theorem does not address the issue of uncertainty or belief in the estimates, which is a key aspect of the probabilistic approaches.\n", 56 | "\n", 57 | "On the other hand, probabilistic regression is a type of regression analysis in which the relationship between the independent and dependent variables is not known and may vary from one set of data to another. Instead of predicting a single value for the dependent variable, a probabilistic regression model predicts a probability distribution for the possible values of the dependent variable. This allows the model to account for uncertainty and variability in the data, and can provide more accurate predictions in some cases.\n", 58 | "\n", 59 | "Let's give a simple example to simplify the understanding. A researcher is studying the relationship between the amount of time that students spend studying for a test and their scores on the test. In this scenario, the researcher could use the OLS method to estimate the slope and intercept of the regression line, and use the Gauss-Markov theorem to justify the choice of this estimator. However, as we stated before, the Gauss-Markov theorem does not address the issue of uncertainty or belief in the estimates. In the probabilistic world, the emphasis is on using probability to describe the uncertainty or belief in the model or parameter, rather than just the optimality of the estimator. This means that we might use a different approach to estimating the slope and intercept of the regression line. As a consequence, we might come to a different conclusion about the relationship between study time and test scores, based on the data and the prior belief in the values of the slope and intercept.\n", 60 | "\n", 61 | "### Bayesian Statistics and the Bias–variance Trade-off\n", 62 | "\n", 63 | "Probabilistic regression can be seen as a form of Bayesian statistics, in that it involves treating the unknown relationship between the independent and dependent variables as a random variable, and estimating its probability distribution based on the available data. In this way, we can think of it as a way of incorporating uncertainty and variability into regression analysis. Recall that Bayesian statistics is a framework for statistical analysis in which all unknown quantities are treated as random variables, and their probability distributions are updated as new data is observed. This is in contrast to classical statistics, which typically assumes that unknown quantities are fixed, but unknown, parameters.\n", 64 | "\n", 65 | "Another way to think about the difference between the two approaches is to consider the trade-off between bias and variance in statistical estimates. Bias refers to the difference between the estimated value and the true value of a parameter, while variance refers to the spread or variability of the estimated values. By providing a distribution over the model parameters, rather than a single point estimate, probabilistic regression can help reduce bias in the estimates, which can improve the overall accuracy of the model. Additionally, probabilistic regression can provide a measure of uncertainty or confidence in the estimated values, which can be useful for making decisions or predictions based on the model. This can be particularly helpful when working with noisy or incomplete data, where the uncertainty in the estimates is higher." 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "id": "a06a19f4-64f9-4df6-b788-893816d6ba67", 71 | "metadata": {}, 72 | "source": [ 73 | "# Non-Linear Probabilistic Regression\n", 74 | "\n", 75 | "Let's jump to an example to make these concepts easier to understand. We will not cover the full Bayesian approach which would entail the estimation of the epistemic uncertainty - the uncertainty of the model. We will study this type of uncertainty in a future article. Nevertheless, we will estimate another type of uncertainty - the aleatoric uncertainty. It can be defined as the uncertainty in the generative process of the data. \n", 76 | "\n", 77 | "This time, we are going to cover a more complex regression analysis - non-linear regression. In contrast to linear regression, which models the relationship between the variables using a straight line, non-linear regression allows for more complex relationships between the variables to be modeled. This makes non-linear regression a useful tool for many machine learning applications, where the relationships between the variables may be too complex to be accurately modeled using a linear equation.\n", 78 | "\n", 79 | "We start by creating some data that follows a non-linear pattern. By generating this data synthetically, we have the advantage to know what was the exact generative process and, thus, we only need to find a way to get from the observations back to the generative process.\n", 80 | "\n", 81 | "The data we'll be working with is artificially created from the following equation:\n", 82 | "$$y_i = x_i^3+\\frac{3}{15}(1+x_i)\\epsilon_i$$\n", 83 | "\n", 84 | "where $\\epsilon_i \\sim \\mathcal{N}(0,1)$ are independent and identically distributed." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 3, 90 | "id": "c794db49-c2e7-4857-8370-0a19572adf0e", 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "data": { 95 | "image/png": "\n", 96 | "text/plain": [ 97 | "
" 98 | ] 99 | }, 100 | "metadata": {}, 101 | "output_type": "display_data" 102 | } 103 | ], 104 | "source": [ 105 | "# Create and plot 10000 data points\n", 106 | "\n", 107 | "x_train = np.linspace(-1, 1, 1000)[:, np.newaxis]\n", 108 | "y_train = np.power(x_train, 3) + (3/15)*(1+x_train)*np.random.randn(1000)[:, np.newaxis]\n", 109 | "\n", 110 | "plt.scatter(x_train, y_train, alpha=0.1)\n", 111 | "plt.show()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "id": "366f4f86-30d3-4321-91f5-a89fb330d69d", 117 | "metadata": {}, 118 | "source": [ 119 | "As usual, we need to specify our loss function. In the probabilistic case, and as we already saw several examples in previous articles, we need to define the negative log-likelihood as our loss function." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 4, 125 | "id": "2a63e19e-9e5e-47db-b23c-7c2cff07c62a", 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "# Train model using the negative loglikelihood\n", 130 | "\n", 131 | "def nll(y_true, y_pred):\n", 132 | " return -y_pred.log_prob(y_true)" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "id": "f349ed58-1fec-4de0-af4c-4eb6af6606e3", 138 | "metadata": {}, 139 | "source": [ 140 | "As we saw in the previous article, a generic way to incorporate any distribution into a Deep Learning architecture in Keras is to use the `DistributionLambda` layer from TensorFlow Probability. Recall that the `DistributionLambda` layer returns a distribution object. It is also the base class for several other probabilistic layers implemented in TensorFlow Probability and that we will use in future articles.\n", 141 | "\n", 142 | "To build our model, we start by adding 2 dense layers. The first has 8 units and a sigmoid activation function. The second has 2 units and no activation function. The reason to not add one is because we want to parameterize our Gaussian distribution that follows with any real value. The Gaussian distribution is defined by the `DistributionLambd` layer. Notice that in this case, we are using a lambda function to instantiate the `DistributionLambda` layer. The lambda function receives an input t, which is the output tensor of the previous Dense layer and returns a Gaussian distribution with mean and standard deviation defined by the tensor t. Remember that the scale of the distribution is the standard deviation, and this should be a positive value. As before, we pass the tensor component through the softplus function to respect this constraint.\n", 143 | "\n", 144 | "Note that the real difference between a linear and a non-linear model is just the added Dense layer as the first layer of the model." 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 8, 150 | "id": "bdd6aab1-8d17-4029-8a14-0f5e65e7fb6a", 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "Model: \"sequential_2\"\n", 158 | "_________________________________________________________________\n", 159 | " Layer (type) Output Shape Param # \n", 160 | "=================================================================\n", 161 | " dense_4 (Dense) (None, 8) 16 \n", 162 | " \n", 163 | " dense_5 (Dense) (None, 2) 18 \n", 164 | " \n", 165 | " independent_normal_1 (Indep ((None, 1), 0 \n", 166 | " endentNormal) (None, 1)) \n", 167 | " \n", 168 | "=================================================================\n", 169 | "Total params: 34\n", 170 | "Trainable params: 34\n", 171 | "Non-trainable params: 0\n", 172 | "_________________________________________________________________\n" 173 | ] 174 | } 175 | ], 176 | "source": [ 177 | "# Create probabilistic regression: normal distribution with fixed variance\n", 178 | "\n", 179 | "model = Sequential([\n", 180 | " Dense(input_shape=(1,), units=8, activation='sigmoid'),\n", 181 | " Dense(tfpl.IndependentNormal.params_size(event_shape=1)),\n", 182 | " tfpl.IndependentNormal(event_shape=1)\n", 183 | " # Dense(2),\n", 184 | " # tfpl.DistributionLambda(lambda t:tfd.Normal(loc=t[...,:1], scale=tf.math.softplus(t[...,1:])))\n", 185 | "])\n", 186 | "\n", 187 | "model.compile(loss=nll, optimizer=RMSprop(learning_rate=0.01))\n", 188 | "model.summary()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "id": "9c426818-cf4b-4dae-be9b-506ca3f88176", 194 | "metadata": {}, 195 | "source": [ 196 | "We can check the output shape from our model to understand better what is happening. We get an empty event shape and a batch shape of (1000, 1). 1000 refers to the batch size, while the extra dimension does not really make a lot of sense in our problem statement. We want to represent a single random variable that is normally distributed." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 9, 202 | "id": "948d6ad5-04a0-46b7-951b-d09e5a636086", 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "data": { 207 | "text/plain": [ 208 | "" 209 | ] 210 | }, 211 | "execution_count": 9, 212 | "metadata": {}, 213 | "output_type": "execute_result" 214 | } 215 | ], 216 | "source": [ 217 | "y_model = model(x_train)\n", 218 | "y_sample = y_model.sample()\n", 219 | "y_model" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "id": "18e63de5-806c-42a2-a0e1-8684b777b3e4", 225 | "metadata": {}, 226 | "source": [ 227 | "To simplify the implementation of our last layer and to make it more in line with what we expect to get as output shape, we can use a wrapper that TensorFlow Probability provides. By using the `IndependentNormal` layer we can build a similar distribution that we built with `DistributionLambda`. At the same time, we can use a static method that outputs the number of parameters that are required to the probabilistic layer and use it to define the number of units in the previous Dense layer: `tfpl.IndependentNormal.params_size(event_shape=1)`." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 10, 233 | "id": "1e925812-9dae-4579-9dc8-cca0f074a99e", 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "Model: \"sequential_3\"\n", 241 | "_________________________________________________________________\n", 242 | " Layer (type) Output Shape Param # \n", 243 | "=================================================================\n", 244 | " dense_6 (Dense) (None, 16) 32 \n", 245 | " \n", 246 | " dense_7 (Dense) (None, 2) 34 \n", 247 | " \n", 248 | " distribution_lambda_1 (Dist ((None, 1), 0 \n", 249 | " ributionLambda) (None, 1)) \n", 250 | " \n", 251 | "=================================================================\n", 252 | "Total params: 66\n", 253 | "Trainable params: 66\n", 254 | "Non-trainable params: 0\n", 255 | "_________________________________________________________________\n" 256 | ] 257 | } 258 | ], 259 | "source": [ 260 | "# Create probabilistic regression: normal distribution with fixed variance\n", 261 | "\n", 262 | "model = Sequential([\n", 263 | " Dense(input_shape=(1,), units=16, activation='sigmoid'),\n", 264 | " Dense(2),\n", 265 | " tfpl.DistributionLambda(lambda t:tfd.Normal(loc=t[...,:1], scale=tf.math.softplus(t[...,1:])))\n", 266 | "])\n", 267 | "\n", 268 | "model.compile(loss=nll, optimizer=RMSprop(learning_rate=0.01))\n", 269 | "model.summary()" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "id": "825a8351-8349-4a29-95a9-31eef5f0e3da", 275 | "metadata": {}, 276 | "source": [ 277 | "As we can see, the shape is now correctly specified as the extra dimension in the batch shape was moved to the event shape." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 11, 283 | "id": "3289c133-c827-400f-9ba2-740d1e470714", 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "data": { 288 | "text/plain": [ 289 | "" 290 | ] 291 | }, 292 | "execution_count": 11, 293 | "metadata": {}, 294 | "output_type": "execute_result" 295 | } 296 | ], 297 | "source": [ 298 | "y_model = model(x_train)\n", 299 | "y_sample = y_model.sample()\n", 300 | "y_model" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "id": "55d5e65d-532e-418d-8596-2190c58fc18b", 306 | "metadata": {}, 307 | "source": [ 308 | "Time to fit the model to our synthetically generated data." 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "id": "11c54c98-c798-4312-9505-9c301d0cfc4a", 315 | "metadata": { 316 | "scrolled": true, 317 | "tags": [] 318 | }, 319 | "outputs": [], 320 | "source": [ 321 | "# Train model\n", 322 | "\n", 323 | "model.fit(x_train, y_train, epochs=500, verbose=False)\n", 324 | "model.evaluate(x_train, y_train)" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "id": "cda8f34c-6728-4f1a-9260-1bbb03ffced7", 330 | "metadata": {}, 331 | "source": [ 332 | "As expected, we were able to capture the aleatoric uncertainty of the generative process of the data. We can easily build confidence intervals using the learned standard deviation parameter. As a result, we are now able to generate new samples using the same process that we just learned from the data." 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "id": "ddfbab51-f5bc-41f1-b371-02490a891a9a", 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "# Plot the data and a sample from the model\n", 343 | "\n", 344 | "y_model = model(x_train)\n", 345 | "y_sample = y_model.sample()\n", 346 | "y_hat = y_model.mean()\n", 347 | "y_sd = y_model.stddev()\n", 348 | "y_hat_m2sd = y_hat -2 * y_sd\n", 349 | "y_hat_p2sd = y_hat + 2*y_sd\n", 350 | "\n", 351 | "fig, (ax1, ax2) =plt.subplots(1, 2, figsize=(15, 5), sharey=True)\n", 352 | "ax1.scatter(x_train, y_train, alpha=0.4, label='data')\n", 353 | "ax1.scatter(x_train, y_sample, alpha=0.4, color='red', label='model sample')\n", 354 | "ax1.legend()\n", 355 | "ax2.scatter(x_train, y_train, alpha=0.4, label='data')\n", 356 | "ax2.plot(x_train, y_hat, color='red', alpha=0.8, label='model $\\mu$')\n", 357 | "ax2.plot(x_train, y_hat_m2sd, color='green', alpha=0.8, label='model $\\mu \\pm 2 \\sigma$')\n", 358 | "ax2.plot(x_train, y_hat_p2sd, color='green', alpha=0.8)\n", 359 | "ax2.legend()\n", 360 | "plt.show()" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "id": "126ab39f-6f31-40bc-94cd-33d74b486e71", 366 | "metadata": {}, 367 | "source": [ 368 | "# Conclusion" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "id": "da9e4dde-3a5d-440f-b176-0bbd065b2e98", 374 | "metadata": {}, 375 | "source": [ 376 | "This article explored the main differences between deterministic and probabilistic regression. We saw that in general, deterministic regression is useful when the relationship between the independent and dependent variables is well understood and relatively stable. On the other hand, probabilistic regression is more appropriate when there is uncertainty or variability in the data. As an exercise to support our claims, we then fitted a probabilistic model to non-linear data. By adding an extra dense layer with a activation function in the beginning of our model, we are able to learn non-linear patterns in the data. As before, our final layer is a probabilistic layer, which means that it outputs a distribution object. As a way to be more coherent with our problem statement, we extended our approach to use theIndependentNormal layer that we explored a few articles ago. It allows us to move batch dimensions to the event shape. By specifying our model with two additional variables defined in the new probabilistic layer, we were able to learn the mean and standard deviation of the Gaussian noise that was used to generate the artificial data. By using these parameters, we were also able to build confidence intervals but, more importantly, to generate new samples using the same process that we have learned from the data.\n", 377 | "\n", 378 | "Next week, we will explore more the differences between a frequentist and a Bayesian approach. See you then!" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "id": "9d9759d5-7266-47b6-8969-96b929253b1e", 384 | "metadata": {}, 385 | "source": [ 386 | "# References and Materials" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "id": "0f71fdfb-d812-4870-b46e-df8c367a1924", 392 | "metadata": {}, 393 | "source": [ 394 | "[1] - [Coursera: TensorFlow 2 for Deep Learning Specialization](https://www.coursera.org/specializations/tensorflow2-deeplearning)\n", 395 | "\n", 396 | "[2] - [Coursera: Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning)\n", 397 | "\n", 398 | "[3] - [TensorFlow Probability Guides and Tutorials](https://www.tensorflow.org/probability/overview)\n", 399 | "\n", 400 | "[4] - [TensorFlow Probability Posts in TensorFlow Blog](https://blog.tensorflow.org/search?label=TensorFlow+Probability&max-results=20)" 401 | ] 402 | } 403 | ], 404 | "metadata": { 405 | "kernelspec": { 406 | "display_name": "Python 3 (ipykernel)", 407 | "language": "python", 408 | "name": "python3" 409 | }, 410 | "language_info": { 411 | "codemirror_mode": { 412 | "name": "ipython", 413 | "version": 3 414 | }, 415 | "file_extension": ".py", 416 | "mimetype": "text/x-python", 417 | "name": "python", 418 | "nbconvert_exporter": "python", 419 | "pygments_lexer": "ipython3", 420 | "version": "3.9.15" 421 | } 422 | }, 423 | "nbformat": 4, 424 | "nbformat_minor": 5 425 | } 426 | -------------------------------------------------------------------------------- /Frequentist vs. Bayesian Statistics with Tensorflow.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "b4bc7609-81a6-4552-bf4a-992ae21e7490", 6 | "metadata": {}, 7 | "source": [ 8 | "# Introduction" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "ce163f9f-7cda-442f-98ae-556b92b08024", 14 | "metadata": {}, 15 | "source": [ 16 | "This article belongs to the series \"Probabilistic Deep Learning\". This weekly series covers probabilistic approaches to deep learning. The main goal is to extend deep learning models to quantify uncertainty, i.e. know what they do not know.\n", 17 | "\n", 18 | "The frequentist approach to statistics is based on the idea of repeated sampling and long-run relative frequency. It involves constructing hypotheses about a population and testing them using sample data. On the other hand, the Bayesian approach is based on subjective probability and involves updating an initial belief about a population using collected data. Both methods have their strengths and limitations and which one to use depends on the problem and analysis goals. In this article, we will explore the differences between the frequentist and the Bayesian approach and discuss how they can be implemented using TensorFlow and TensorFlow Probability." 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "093e675c-3984-416a-8db6-9e1fe0e6899a", 24 | "metadata": {}, 25 | "source": [ 26 | "# Frequentist vs. Bayesian Approach to Problems\n", 27 | "Frequentist statistics and Bayesian statistics are two main approaches to statistical inference, which is the process of using data to draw conclusions about a population. Both approaches are used to estimate unknown quantities, make predictions, and test hypotheses, but they differ in their interpretation of probability and how they incorporate prior knowledge and evidence.\n", 28 | "\n", 29 | "In frequentist statistics, probability is interpreted as the long-run relative frequency of an event in an infinite number of trials. This approach is based on the idea that the true value of a population parameter is fixed, but is unknown and must be estimated from data. In this framework, statistical inferences are drawn from the observed data by making assumptions about the underlying data-generating process and using techniques such as point estimation, confidence intervals, and hypothesis testing.\n", 30 | "\n", 31 | "On the other hand, Bayesian statistics interprets probability as a measure of belief or degree of certainty about an event. This approach allows for the incorporation of prior knowledge and evidence into statistical analysis through the use of Bayes' theorem. In this framework, the true value of a population parameter is treated as a random variable and is updated as new data is collected. This results in a full distribution over the parameter space, known as a posterior distribution, which can be used to make probabilistic predictions and to quantify uncertainty.\n", 32 | "\n", 33 | "One key difference between the two approaches is how they handle uncertainty. In frequentist statistics, uncertainty is quantified through the use of confidence intervals, which provide an estimate of the likely range of the true population parameter based on the observed data. In Bayesian statistics, uncertainty is represented by the full posterior distribution, which allows for a more complete characterization of the uncertainty surrounding the true value of the parameter.\n", 34 | "\n", 35 | "Another difference is that Bayesian statistics allows for the incorporation of prior knowledge, which can be particularly useful in cases where there is limited data or when the data-generating process is complex. However, the choice of prior distribution can significantly influence the results of a Bayesian analysis, and it is important to choose a prior that is appropriate for the problem at hand." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "id": "be697aaf-c600-4039-8fca-cdd1ff4111a0", 41 | "metadata": {}, 42 | "source": [ 43 | "# The Problem\n", 44 | "\n", 45 | "As an example of applying a frequentist approach to a problem, consider the task of estimating the mean income of a population based on a sample of data. In this case, the goal is to use the sample data to make inferences about the true mean income of the population (we are going to assume that the standard deviation of the income of the population is known for now).\n", 46 | "\n", 47 | "Let's generate some data:" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 6, 53 | "id": "9966e2ad-b918-48a1-991e-341485eb1ffd", 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "import tensorflow as tf\n", 58 | "import tensorflow_probability as tfp\n", 59 | "import matplotlib.pyplot as plt\n", 60 | "from scipy.stats import t\n", 61 | "import numpy as np\n", 62 | "\n", 63 | "tfd = tfp.distributions" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 7, 69 | "id": "e2757b92-2f3d-4032-9887-78d521607f33", 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "# Sample data\n", 74 | "sample_size = 30.\n", 75 | "sample_mean = 50000.\n", 76 | "sample_stddev = 10000.\n", 77 | "sample_data = tfd.Normal(loc=sample_mean, scale=sample_stddev).sample(sample_size)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 8, 83 | "id": "0f5c4162-168b-448c-b04e-d862e4d38919", 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "image/png": "\n", 89 | "text/plain": [ 90 | "
" 91 | ] 92 | }, 93 | "metadata": {}, 94 | "output_type": "display_data" 95 | } 96 | ], 97 | "source": [ 98 | "plt.hist(sample_data, density=True, alpha=0.5);" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "57d455ce-a06c-4e0d-a076-a5e592b09ce7", 104 | "metadata": {}, 105 | "source": [ 106 | "\n", 107 | "## The Frequentist Way\n", 108 | "\n", 109 | "One way to approach this problem using a frequentist approach is through point estimation. Point estimation involves using a single point estimate, such as the sample mean, to represent the unknown population parameter. The sample mean is a commonly used point estimate for the population mean, and it is calculated as the sum of the sample values divided by the sample size.\n", 110 | "\n", 111 | "The sample mean, $\\hat{\\mu}$, is a commonly used point estimate for the population mean, $\\mu$. It is calculated as the sum of the sample values, $x_1, x_2, ..., x_n$, divided by the sample size, $n$:\n", 112 | "\n", 113 | "$\\hat{\\mu} = \\frac{1}{n} \\sum_{i=1}^n x_i$" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 9, 119 | "id": "fd51e6dd-3b16-46d0-bb6c-c99c047b0e28", 120 | "metadata": {}, 121 | "outputs": [ 122 | { 123 | "data": { 124 | "text/plain": [ 125 | "" 126 | ] 127 | }, 128 | "execution_count": 9, 129 | "metadata": {}, 130 | "output_type": "execute_result" 131 | } 132 | ], 133 | "source": [ 134 | "sample_mean = tf.reduce_mean(sample_data)\n", 135 | "sample_mean" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "id": "6a1bd719-bd44-4c54-89c7-0e03f34d3612", 141 | "metadata": {}, 142 | "source": [ 143 | "However, point estimates alone do not provide a complete characterization of the uncertainty surrounding the estimate. To quantify this uncertainty, we can use a confidence interval. A confidence interval is an estimate of the likely range of the true population parameter based on the observed data, and it is constructed by adding and subtracting a margin of error to the point estimate. The margin of error is determined by the desired level of confidence and the sample size, and it reflects the variability in the sample data. For example, a 95% confidence interval indicates that we are 95% confident that the true population parameter falls within the interval.\n", 144 | "\n", 145 | "A confidence interval for the population mean is constructed by adding and subtracting a margin of error, $m$, to the point estimate:\n", 146 | "\n", 147 | "$\\hat{\\mu} \\pm m = [\\hat{\\mu} - m, \\hat{\\mu} + m]$\n", 148 | "\n", 149 | "The margin of error is determined by the desired level of confidence, $1-\\alpha$, and the sample size, $n$. It reflects the variability in the sample data and is typically calculated using the standard error, $SE$, of the point estimate:\n", 150 | "\n", 151 | "$m = t_{1-\\frac{\\alpha}{2}, n-1} \\times SE$\n", 152 | "\n", 153 | "where $t_{1-\\frac{\\alpha}{2}, n-1}$ is the critical value of the $t$-distribution with $n-1$ degrees of freedom for the desired level of confidence.\n", 154 | "\n", 155 | "The standard error of the sample mean is calculated as the standard deviation of the sample, $\\sigma$, divided by the square root of the sample size:\n", 156 | "\n", 157 | "$SE = \\frac{\\sigma}{\\sqrt{n}}$\n" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 10, 163 | "id": "8aa0722b-4b02-4e91-b59f-0cedc0c5ec90", 164 | "metadata": {}, 165 | "outputs": [ 166 | { 167 | "data": { 168 | "text/plain": [ 169 | "(46550.523, 53645.047)" 170 | ] 171 | }, 172 | "execution_count": 10, 173 | "metadata": {}, 174 | "output_type": "execute_result" 175 | } 176 | ], 177 | "source": [ 178 | "# Standard error of the sample mean\n", 179 | "sample_stddev = tf.math.reduce_std(sample_data)\n", 180 | "standard_error = sample_stddev / tf.sqrt(sample_size)\n", 181 | "\n", 182 | "# Margin of error\n", 183 | "confidence_level = 0.95\n", 184 | "degrees_of_freedom = sample_size - 1\n", 185 | "t_distribution = tfp.distributions.StudentT(df=degrees_of_freedom, loc=0., scale=1.)\n", 186 | "\n", 187 | "# t_distribution.quantile() seems to have a bug\n", 188 | "t_value = t.ppf(confidence_level+(1-confidence_level)/2, df=sample_size-1)\n", 189 | "margin_of_error = t_value * standard_error\n", 190 | "\n", 191 | "confidence_interval_lower = sample_mean - margin_of_error\n", 192 | "confidence_interval_upper = sample_mean + margin_of_error\n", 193 | "confidence_interval = (confidence_interval_lower.numpy(), confidence_interval_upper.numpy())\n", 194 | "confidence_interval" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "id": "c01eb0a7-e5ae-4fe3-9ec1-1083f0badb01", 200 | "metadata": {}, 201 | "source": [ 202 | "We can define a helper function to plot our predicting mean and confidence interval overlayed on sampled data." 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 11, 208 | "id": "acc51172-387d-4309-a8f6-ff5b97a1aa79", 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "def visualize_output(sample_data, sample_mean, interval, type_interval):\n", 213 | " plt.hist(sample_data, density=True, alpha=0.5)\n", 214 | " plt.axvline(x=sample_mean, color='r', linestyle='dashed', linewidth=2)\n", 215 | " plt.axvline(x=interval[0], color='g', linewidth=2)\n", 216 | " plt.axvline(x=interval[1], color='g', linewidth=2)\n", 217 | " plt.legend(['Sample Mean', f'{type_interval} interval'])\n", 218 | " plt.show()" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 12, 224 | "id": "503ec788-9f8f-4b5d-a8dd-3ef89b63d755", 225 | "metadata": {}, 226 | "outputs": [ 227 | { 228 | "data": { 229 | "image/png": "\n", 230 | "text/plain": [ 231 | "
" 232 | ] 233 | }, 234 | "metadata": {}, 235 | "output_type": "display_data" 236 | } 237 | ], 238 | "source": [ 239 | "visualize_output(sample_data, sample_mean, confidence_interval, 'confidence')" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "7663a3a2-0441-4584-aaad-52cc517aba55", 245 | "metadata": {}, 246 | "source": [ 247 | "## The Bayesian Way" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "4e597e38-6c5e-4236-b90d-e599a5b310c8", 253 | "metadata": {}, 254 | "source": [ 255 | "As an example of applying a Bayesian approach to a problem, consider the task of estimating the mean income of a population based on a sample of data. In this case, the goal is to use the sample data and any available prior knowledge to make inferences about the true mean income of the population.\n", 256 | "\n", 257 | "One way to approach this problem using a Bayesian approach is through the use of Bayes' theorem, which allows us to update our belief about the value of the population mean income based on the observed data. Bayes' theorem states that the posterior probability of a hypothesis given some data is equal to the prior probability of the hypothesis multiplied by the likelihood of the data given the hypothesis, normalized by the marginal probability of the data. This can be written mathematically as:\n", 258 | "\n", 259 | "$P(\\theta|x) = \\frac{P(\\theta)P(x|\\theta)}{P(x)}$\n", 260 | "\n", 261 | "where $\\theta$ is the population mean income, $x$ is the observed data (i.e., the sample of income values), and $P(\\theta|x)$, $P(x|\\theta)$, $P(\\theta)$, and $P(x)$ are the posterior probability, likelihood, prior probability, and marginal probability, respectively.\n", 262 | "\n", 263 | "To apply this approach to the income example, we must first specify a prior distribution over the population mean income, $P(\\theta)$. This prior distribution represents our belief about the value of the population mean income before we observe the data. The choice of prior distribution will depend on any available prior knowledge and the characteristics of the data-generating process. For example, if we have strong prior knowledge that the population mean income is normally distributed with a mean of $40,000$ and a standard deviation of $5,000$, we could use a normal prior distribution with these parameters." 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 14, 269 | "id": "cd97561b-86df-4198-9b5e-507711d47b93", 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "mu_prior = 40000.\n", 274 | "sigma_prior = 5000.\n", 275 | "prior = tfd.Normal(loc=mu_prior, scale=sigma_prior)" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "id": "7fc3bde9-3841-42a6-99f2-9ed9bc9dd1ca", 281 | "metadata": {}, 282 | "source": [ 283 | "The likelihood function, $P(x|\\theta)$, represents the probability of observing the sample data, $x$, given a particular value of the population mean income, $\\theta$. Since the sample data are assumed to be independent and identically distributed (i.i.d.), the likelihood function is simply the product of the individual probability density functions of the sample values. For example, if the sample data are normally distributed with a known standard deviation, $\\sigma$, the likelihood function is a normal distribution with mean equal to the population mean income, $\\theta$, and standard deviation equal to the sample standard deviation:\n", 284 | "\n", 285 | "$P(x|\\theta) = \\prod_{i=1}^n \\mathcal{N}(x_i|\\theta, \\sigma)$" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 15, 291 | "id": "c7fc2315-75e5-4167-8c4a-7ebd392e8ece", 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "mu_likelihood = np.mean(sample_data)\n", 296 | "sigma_likelihood = np.std(sample_data)\n", 297 | "likelihood = tfd.Normal(loc=np.mean(sample_data), scale=np.std(sample_data))" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "id": "b4432af2-f2da-4503-8fba-6ce52256e67a", 303 | "metadata": {}, 304 | "source": [ 305 | "Next, we use Bayes' theorem to compute the posterior distribution:\n", 306 | "\n", 307 | "$P(\\theta|x) = \\frac{ \\mathcal{N}(\\theta)\\prod_{i=1}^n \\mathcal{N}(x_i|\\theta)}{P(x)}$\n", 308 | "\n", 309 | "The posterior distribution can be used to make probabilistic predictions and to quantify the uncertainty surrounding the estimate of the population mean income. For example, we can use the posterior distribution to compute the posterior mean and posterior standard deviation as estimates of the population mean income and the uncertainty around the estimate, respectively.\n", 310 | "\n", 311 | "As we saw above, the posterior distribution is a result of multiplying two Gaussian distributions. For us to accomplish that, we need to introduce one more concept - conjugate distributions. In our case, we have a normal-normal conjugate family which is a parametric family of distributions where the prior distribution and the likelihood function are both normal distributions. This class of models has several attractive properties, including the ability to analytically compute the posterior distribution and the closed-form solution for the maximum a posteriori (MAP) estimate.\n", 312 | "\n", 313 | "The posterior distribution is defined as the distribution of the model parameters (in this case, the mean $\\mu$ and standard deviation $\\sigma$ of the normal distribution) given the observed data $y$. The mean and standard deviation of the posterior distribution can be calculated using the following equations:\n", 314 | "\n", 315 | "$$\\sigma_{\\text{post}}^2 = \\frac{1}{\\frac{n}{\\sigma^2} + \\frac{1}{\\sigma_0^2}}$$\n", 316 | "\n", 317 | "\n", 318 | "$$\\mu_{\\text{post}} = \\sigma_{\\text{post}}^2 \\big( \\frac{\\mu_0}{\\sigma_0^2} + \\frac{n \\mu_l}{\\sigma_l^2} \\big)$$\n", 319 | "\n", 320 | "where $\\mu_0$ and $\\sigma_0^2$ are the mean and variance of the prior distribution, respectively, $mu_l$ and $\\sigma_l^2$ are the mean and the variance of the likelihood function and $n$ is the sample size.\n", 321 | "\n", 322 | "These equations assume that the prior distribution and the likelihood function are both normal distributions, as is the case in a normal-normal conjugate model. If the prior and likelihood are not normal, or if the posterior distribution does not have a closed-form solution, these parameters can be approximated using techniques such as Markov Chain Monte Carlo (MCMC) sampling.\n", 323 | "\n", 324 | "We are ready to implement our new equations." 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 18, 330 | "id": "bd65f8f2-e9c2-4dc7-abd9-97f5069e3429", 331 | "metadata": {}, 332 | "outputs": [ 333 | { 334 | "data": { 335 | "text/plain": [ 336 | "(45801.615148780984, 52224.89311258316)" 337 | ] 338 | }, 339 | "execution_count": 18, 340 | "metadata": {}, 341 | "output_type": "execute_result" 342 | } 343 | ], 344 | "source": [ 345 | "# Compute the posterior distribution using Bayes' theorem\n", 346 | "var_posterior = 1 / ((sample_size/(sigma_likelihood**2)) + (1/(sigma_prior**2)))\n", 347 | "mu_posterior = var_posterior * (mu_prior/(sigma_prior**2) + (sample_size*mu_likelihood)/(sigma_likelihood**2))\n", 348 | "posterior = tfd.Normal(mu_posterior, tf.sqrt(var_posterior))\n", 349 | "credible_interval = (posterior.quantile(0.025).numpy(), posterior.quantile(0.975).numpy())\n", 350 | "credible_interval" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "id": "df7538ed-88ce-4581-ba44-862c7759895e", 356 | "metadata": {}, 357 | "source": [ 358 | "We can see how updating our prior belief with data resulted in our posterior distribution." 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 19, 364 | "id": "8e1e272d-f50b-4b3b-8c3a-3da933876f58", 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "data": { 369 | "image/png": "\n", 370 | "text/plain": [ 371 | "
" 372 | ] 373 | }, 374 | "metadata": {}, 375 | "output_type": "display_data" 376 | } 377 | ], 378 | "source": [ 379 | "x = np.linspace(30000, 70000, 1000)\n", 380 | "\n", 381 | "y_prior = prior.prob(x)\n", 382 | "y_likelihood = likelihood.prob(x)\n", 383 | "y_posterior = posterior.prob(x)\n", 384 | "\n", 385 | "plt.plot(x,y_prior, label=\"Prior\")\n", 386 | "plt.plot(x,y_likelihood, label=\"Likelihood\")\n", 387 | "plt.plot(x,y_posterior, label=\"Posterior\")\n", 388 | "\n", 389 | "plt.legend();" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "id": "fea52b87-dc31-4af0-bb38-8ed523466b25", 395 | "metadata": {}, 396 | "source": [ 397 | "Finally, as we did for the frequentist case, we can plot our estimate for the mean of the population. Notice that we are actually computing the credible interval and not the confidence interval with the Bayesian approach." 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 20, 403 | "id": "57cef556-a63a-4865-8599-f811199c2e79", 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "data": { 408 | "image/png": "\n", 409 | "text/plain": [ 410 | "
" 411 | ] 412 | }, 413 | "metadata": {}, 414 | "output_type": "display_data" 415 | } 416 | ], 417 | "source": [ 418 | "visualize_output(sample_data, posterior.mean(), (posterior.quantile(0.025), posterior.quantile(0.975)), 'credible')" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "id": "d140c216-b9c2-4228-99a8-2630ce1e4501", 424 | "metadata": {}, 425 | "source": [ 426 | "# Conclusion\n", 427 | "In conclusion, the frequentist and Bayesian approaches are two different ways of analyzing data and making predictions. The frequentist approach is based on the idea of statistical significance and involves constructing hypotheses about a population and testing them using sample data. The Bayesian approach is based on subjective probability and involves updating an initial belief about a population using collected data.\n", 428 | "Using TensorFlow and TensorFlow Probability, we demonstrated how to compute the mean and confidence interval for the frequentist approach and the mean and credible interval for the Bayesian approach. Both approaches can be useful for different types of problems and it is important to understand the strengths and limitations of each one in order to choose the most appropriate method for a given situation." 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "id": "ea280cc8-1a89-4029-a655-28a7ac4d911c", 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [] 438 | } 439 | ], 440 | "metadata": { 441 | "kernelspec": { 442 | "display_name": "Python 3 (ipykernel)", 443 | "language": "python", 444 | "name": "python3" 445 | }, 446 | "language_info": { 447 | "codemirror_mode": { 448 | "name": "ipython", 449 | "version": 3 450 | }, 451 | "file_extension": ".py", 452 | "mimetype": "text/x-python", 453 | "name": "python", 454 | "nbconvert_exporter": "python", 455 | "pygments_lexer": "ipython3", 456 | "version": "3.9.15" 457 | } 458 | }, 459 | "nbformat": 4, 460 | "nbformat_minor": 5 461 | } 462 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Probabilistic Deep Learning with TFP 2 | 3 | Series of articles covering probabilistic approaches to deep learning. The main goal is to extend deep learning models to quantify uncertainty, i.e. know what they do not know.  4 | 5 | We develop our models using TensorFlow and TensorFlow Probability (TFP). TFP is a Python library built on top of TensorFlow. We are going to start with the basic objects that we can find in TensorFlow Probability (TFP) and understand how can we manipulate them. We will increase complexity incrementally over the following weeks and combine our probabilistic models with deep learning on modern hardware (e.g. GPU). 6 | --------------------------------------------------------------------------------