├── .gitignore ├── README.md ├── code ├── __init__.py ├── models.py └── utils.py └── notebooks ├── 00_preface.ipynb ├── 01_golem_of_prague.ipynb ├── 02_small_large_worlds.ipynb ├── 03_sampling_the_imaginary.ipynb ├── 04_geocentric_models.ipynb ├── 05_many_vars_and_spurious_waffles.ipynb ├── 06_haunted_dag_and_causal_terror.ipynb ├── 07_ulysses_compass.ipynb ├── 08_conditional_manatees.ipynb ├── 09_mcmc.ipynb ├── 10_entropy_and_glm.ipynb ├── 11_god_spiked_ints.ipynb ├── 12_monsters_and_mixtures.ipynb ├── 13_models_with_memory.ipynb ├── 14_adventures_in_covariance.ipynb ├── 15_missing_data.ipynb ├── 16_genaralized_linear_madness.ipynb └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | data 2 | .ipynb_checkpoints 3 | __pycache__ 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Statistical Rethinking 2 | 3 | Going through the book _Statistical Rethinking_ (2nd edition) by Richard McElreath in an attempt to learn Bayesian modeling starting from zero. I'm a `python` kind of guy, so I think I'm going to try and redo all the code examples using one of the various PPL's (Probabilistic Programming Languages) that exist in the `python` universe. I have been getting more into `pytorch` lately as a framework for autodifferentiation and neural networks, and there is a nice-looking package called `pyro` for Bayesian inference that is built on top of it, so I will try and use that. 4 | 5 | I think this is a much better idea than learning R so that I can copy McElreath's code, because I have learned so much more by implementing things from scratch rather than relying on his custom-built `quap`, `precis`, and other functions as black boxes that simply give you the answer and hide away a lot of the implementation details. 6 | 7 | Du Phan, one of the maintainers of the package, is [doing something similar](https://fehiepsi.github.io/rethinking-pyro/), so their repo can serve as a comparison. 8 | 9 | I will also use a mixture of `numpy`, `sklearn`, `pandas`, `matplotlib`, etc. for various other things if the need arises rather than go straight to `torch`/`pyro` (especially for simpler problems). 10 | 11 | The data used can be found in the [official repository](https://github.com/rmcelreath/rethinking/tree/master/data) for the book. Noticed some files were missing (like `cars.csv`), but they can be found [here](https://github.com/fehiepsi/rethinking-numpyro/tree/master/data). -------------------------------------------------------------------------------- /code/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecotner/statistical-rethinking/ffb5b62f06cc6a856fc45655ebd15d32a3dbbf4e/code/__init__.py -------------------------------------------------------------------------------- /code/models.py: -------------------------------------------------------------------------------- 1 | import tqdm 2 | import torch.tensor as tt 3 | import pyro 4 | from pyro.infer import SVI, Trace_ELBO 5 | import pyro.infer.autoguide 6 | from pyro.infer.autoguide import AutoMultivariateNormal, AutoDiagonalNormal, init_to_mean, AutoLaplaceApproximation 7 | from pyro.optim import Adam 8 | 9 | class RegressionBase: 10 | def __init__(self, df, categoricals=None): 11 | if categoricals is None: 12 | categoricals = [] 13 | for col in set(df.columns) - set(categoricals): 14 | setattr(self, col, tt(df[col].values).double()) 15 | for col in categoricals: 16 | setattr(self, col, tt(df[col].values).long()) 17 | 18 | def __call__(self): 19 | raise NotImplementedError 20 | 21 | def train(self, num_steps, lr=1e-2, restart=True, autoguide=None, use_tqdm=True): 22 | if restart: 23 | pyro.clear_param_store() 24 | if autoguide is None: 25 | autoguide = AutoMultivariateNormal 26 | else: 27 | autoguide = getattr(pyro.infer.autoguide, autoguide) 28 | self.guide = autoguide(self, init_loc_fn=init_to_mean) 29 | svi = SVI(self, guide=self.guide, optim=Adam({"lr": lr}), loss=Trace_ELBO()) 30 | loss = [] 31 | if use_tqdm: 32 | iterator = tqdm.notebook.tnrange(num_steps) 33 | else: 34 | iterator = range(num_steps) 35 | for _ in iterator: 36 | loss.append(svi.step()) 37 | return loss 38 | 39 | 40 | -------------------------------------------------------------------------------- /code/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | from networkx.algorithms.moral import moral_graph 5 | from networkx.algorithms.dag import ancestors 6 | from networkx.algorithms.shortest_paths import has_path 7 | from pyro.infer import Predictive 8 | from pyro.infer.mcmc import NUTS, MCMC 9 | from pyro import poutine 10 | import torch 11 | import torch.tensor as tt 12 | 13 | ### Sample summarization and interval calculation 14 | 15 | def HPDI(samples, prob): 16 | """Calculates the Highest Posterior Density Interval (HPDI) 17 | 18 | Sorts all the samples, then with a fixed width window (in index space), 19 | iterates through them all and caclulates the interval width, taking the 20 | maximimum as it moves along. Probably only useful/correct for continuous 21 | distributions or discrete distributions with a notion of ordering and a large 22 | number of possible values. 23 | Arguments: 24 | samples (np.array): array of samples from a 1-dim posterior distribution 25 | prob (float): the probability mass of the desired interval 26 | Returns: 27 | Tuple[float, float]: the lower/upper bounds of the interval 28 | """ 29 | samples = sorted(samples) 30 | N = len(samples) 31 | W = int(round(N*prob)) 32 | min_interval = float('inf') 33 | bounds = [0, W] 34 | for i in range(N-W): 35 | interval = samples[i+W] - samples[i] 36 | if interval < min_interval: 37 | min_interval = interval 38 | bounds = [i, i+W] 39 | return samples[bounds[0]], samples[bounds[1]] 40 | 41 | 42 | def precis(samples: dict, prob=0.89): 43 | """Computes some summary statistics of the given samples. 44 | 45 | Arguments: 46 | samples (Dict[str, np.array]): dictionary of samples, where the key 47 | is the name of the sample site, and the value is the collection 48 | of sample values 49 | prob (float): the probability mass of the symmetric credible interval 50 | Returns: 51 | pd.DataFrame: summary dataframe 52 | """ 53 | p1, p2 = (1-prob)/2, 1-(1-prob)/2 54 | cols = ["mean","stddev",f"{100*p1:.1f}%",f"{100*p2:.1f}%"] 55 | df = pd.DataFrame(columns=cols, index=samples.keys()) 56 | if isinstance(samples, pd.DataFrame): 57 | samples = {k: np.array(samples[k]) for k in samples.columns} 58 | elif not isinstance(samples, dict): 59 | raise TypeError(" must be either dict or DataFrame") 60 | for k, v in samples.items(): 61 | df.loc[k]["mean"] = v.mean() 62 | df.loc[k]["stddev"] = v.std() 63 | q1, q2 = np.quantile(v, [p1, p2]) 64 | df.loc[k][f"{100*p1:.1f}%"] = q1 65 | df.loc[k][f"{100*p2:.1f}%"] = q2 66 | return df 67 | 68 | ### Causal inference tools 69 | 70 | def independent(G, n1, n2, n3=None): 71 | """Computes whether n1 and n2 are independent given n3 on the DAG G 72 | 73 | Can find a decent exposition of the algorithm at http://web.mit.edu/jmn/www/6.034/d-separation.pdf 74 | """ 75 | if n3 is None: 76 | n3 = set() 77 | elif isinstance(n3, (int, str)): 78 | n3 = set([n3]) 79 | elif not isinstance(n3, set): 80 | n3 = set(n3) 81 | # Construct the ancestral graph of n1, n2, and n3 82 | a = ancestors(G, n1) | ancestors(G, n2) | {n1, n2} | n3 83 | G = G.subgraph(a) 84 | # Moralize the graph 85 | M = moral_graph(G) 86 | # Remove n3 (if applicable) 87 | M.remove_nodes_from(n3) 88 | # Check that path exists between n1 and n2 89 | return not has_path(M, n1, n2) 90 | 91 | def conditional_independencies(G): 92 | """Finds all conditional independencies in the DAG G 93 | 94 | Only works when conditioning on a single node at a time 95 | """ 96 | tuples = [] 97 | for i1, n1 in enumerate(G.nodes): 98 | for i2, n2 in enumerate(G.nodes): 99 | if i1 >= i2: 100 | continue 101 | for n3 in G.nodes: 102 | try: 103 | if independent(G, n1, n2, n3): 104 | tuples.append((n1, n2, n3)) 105 | except: 106 | pass 107 | return tuples 108 | 109 | def marginal_independencies(G): 110 | """Finds all marginal independencies in the DAG G 111 | """ 112 | tuples = [] 113 | for i1, n1 in enumerate(G.nodes): 114 | for i2, n2 in enumerate(G.nodes): 115 | if i1 >= i2: 116 | continue 117 | try: 118 | if independent(G, n1, n2, {}): 119 | tuples.append((n1, n2, {})) 120 | except: 121 | pass 122 | return tuples 123 | 124 | def sample_posterior(model, num_samples, sites=None, data=None): 125 | p = Predictive( 126 | model, 127 | guide=model.guide, 128 | num_samples=num_samples, 129 | return_sites=sites, 130 | ) 131 | if data is None: 132 | p = p() 133 | else: 134 | p = p(data) 135 | return {k: v.detach().numpy() for k, v in p.items()} 136 | 137 | def sample_prior(model, num_samples, sites=None): 138 | return { 139 | k: v.detach().numpy() 140 | for k, v in Predictive( 141 | model, 142 | {}, 143 | return_sites=sites, 144 | num_samples=num_samples 145 | )().items() 146 | } 147 | 148 | def plot_intervals(samples, p): 149 | for i, (k, s) in enumerate(samples.items()): 150 | mean = s.mean() 151 | hpdi = HPDI(s, p) 152 | plt.scatter([mean], [i], facecolor="none", edgecolor="black") 153 | plt.plot(hpdi, [i, i], color="C0") 154 | plt.axhline(i, color="grey", alpha=0.5, linestyle="--") 155 | plt.yticks(range(len(samples)), samples.keys(), fontsize=15) 156 | plt.axvline(0, color="black", alpha=0.5, linestyle="--") 157 | 158 | 159 | def WAIC(model, x, y, out_var_nm, num_samples=100): 160 | p = torch.zeros((num_samples, len(y))) 161 | # Get log probability samples 162 | for i in range(num_samples): 163 | tr = poutine.trace(poutine.condition(model, data=model.guide())).get_trace(x) 164 | dist = tr.nodes[out_var_nm]["fn"] 165 | p[i] = dist.log_prob(y).detach() 166 | pmax = p.max(axis=0).values 167 | lppd = pmax + (p - pmax).exp().mean(axis=0).log() # numerically stable version 168 | penalty = p.var(axis=0) 169 | return -2*(lppd - penalty) 170 | 171 | 172 | def format_data(df, categoricals=None): 173 | data = dict() 174 | if categoricals is None: 175 | categoricals = [] 176 | for col in set(df.columns) - set(categoricals): 177 | data[col] = tt(df[col].values).double() 178 | for col in categoricals: 179 | data[col] = tt(df[col].values).long() 180 | return data 181 | 182 | 183 | def train_nuts(model, data, num_warmup, num_samples, num_chains=1, **kwargs): 184 | _kwargs = dict(adapt_step_size=True, adapt_mass_matrix=True, jit_compile=True) 185 | _kwargs.update(kwargs) 186 | print(_kwargs) 187 | kernel = NUTS(model, **_kwargs) 188 | engine = MCMC(kernel, num_samples, num_warmup, num_chains=num_chains) 189 | engine.run(data, training=True) 190 | return engine 191 | 192 | 193 | def traceplot(s, num_chains=1): 194 | fig, axes = plt.subplots(nrows=len(s), figsize=(12, len(s)*5)) 195 | for (k, v), ax in zip(s.items(), axes): 196 | plt.sca(ax) 197 | if num_chains > 1: 198 | for c in range(num_chains): 199 | plt.plot(v[c], linewidth=0.5) 200 | else: 201 | plt.plot(v, linewidth=0.5) 202 | plt.ylabel(k) 203 | plt.xlabel("Sample index") 204 | return fig 205 | 206 | def trankplot(s, num_chains): 207 | fig, axes = plt.subplots(nrows=len(s), figsize=(12, len(s)*num_chains)) 208 | ranks = {k: np.argsort(v, axis=None).reshape(v.shape) for k, v in s.items()} 209 | num_samples = 1 210 | for p in list(s.values())[0].shape: 211 | num_samples *= p 212 | bins = np.linspace(0, num_samples, 30) 213 | for i, (ax, (k, v)) in enumerate(zip(axes, ranks.items())): 214 | for c in range(num_chains): 215 | ax.hist(v[c], bins=bins, histtype="step", linewidth=2, alpha=0.5) 216 | ax.set_xlim(left=0, right=num_samples) 217 | ax.set_yticks([]) 218 | ax.set_ylabel(k) 219 | plt.xlabel("sample rank") 220 | return fig 221 | 222 | 223 | def unnest_samples(s, max_depth=1): 224 | """Unnests samples from multivariate distributions 225 | 226 | The general index structure of a sample tensor is 227 | [[chains,] samples [,idx1, idx2, ...]]. Sometimes the distribution is univariate 228 | and there are no additional indices. So we will always unnest from the right, but 229 | only if the tensor has rank of 3 or more (2 in the case of no grouping by chains). 230 | """ 231 | def _unnest_samples(s): 232 | _s = dict() 233 | for k in s: 234 | assert s[k].dim() > 0 235 | if s[k].dim() == 1: 236 | _s[k] = s[k] 237 | elif s[k].dim() == 2: 238 | for i in range(s[k].shape[1]): 239 | _s[f"{k}[{i}]"] = s[k][:,i] 240 | else: 241 | for i in range(s[k].shape[1]): 242 | _s[f"{k}[{i}]"] = s[k][:,i,...] 243 | return _s 244 | 245 | for _ in range(max_depth): 246 | s = _unnest_samples(s) 247 | if all([v.dim() == 1 for v in s.values()]): 248 | break 249 | return s 250 | 251 | 252 | def get_log_prob(mcmc, data, site_names): 253 | """Gets the pointwise log probability of the posterior density conditioned on the data 254 | 255 | Arguments: 256 | mcmc (pyro.infer.mcmc.MCMC): the fitted MC model 257 | data (dict): dictionary containing all the input data (including return sites) 258 | site_names (str or List[str]): names of return sites to measure log likelihood at 259 | Returns: 260 | Tensor: pointwise log-likelihood of shape (num posterior samples, num data points) 261 | """ 262 | samples = mcmc.get_samples() 263 | model = mcmc.kernel.model 264 | # get number of samples 265 | N = [v.shape[0] for v in samples.values()] 266 | assert [n == N[0] for n in N] 267 | N = N[0] 268 | if isinstance(site_names, str): 269 | site_names = [site_names] 270 | # iterate over samples 271 | log_prob = torch.zeros(N, len(data[site_names[0]])) 272 | for i in range(N): 273 | # condition on samples and get trace 274 | s = {k: v[i] for k, v in samples.items()} 275 | for nm in site_names: 276 | s[nm] = data[nm] 277 | tr = poutine.trace(poutine.condition(model, data=s)).get_trace(data) 278 | # get pointwise log probability 279 | for nm in site_names: 280 | node = tr.nodes[nm] 281 | log_prob[i] += node["fn"].log_prob(node["value"]) 282 | return log_prob -------------------------------------------------------------------------------- /notebooks/00_preface.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Preface" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stdout", 17 | "output_type": "stream", 18 | "text": [ 19 | "/home/ecotner/statistical-rethinking\n" 20 | ] 21 | } 22 | ], 23 | "source": [ 24 | "%cd ~/statistical-rethinking" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 2, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import numpy as np\n", 34 | "import pandas as pd\n", 35 | "from sklearn.linear_model import LinearRegression\n", 36 | "import matplotlib.pyplot as plt" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "### Code 0.1" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "Illustration of what `code` looks like" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "All models are wrong, but some are useful\n" 63 | ] 64 | } 65 | ], 66 | "source": [ 67 | "print(\"All models are wrong, but some are useful\")" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### Code 0.2" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "A complicated way to compute `10*20=200`" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "200.0000000000001" 93 | ] 94 | }, 95 | "execution_count": 4, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "x = np.array([1, 2])\n", 102 | "x = x*10\n", 103 | "x = np.log(x)\n", 104 | "x = np.sum(x)\n", 105 | "x = np.exp(x)\n", 106 | "x" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "### Code 0.3" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "Mathematically, the expressions\n", 121 | "$$\n", 122 | "p_1 = \\log(0.01^{200}) \\\\\n", 123 | "p_2 = 200 \\times \\log(0.01)\n", 124 | "$$\n", 125 | "are equivalent. However, if you compute them numerically, you will see that one is much more stable:" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 5, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "-inf\n", 138 | "-921.0340371976182\n" 139 | ] 140 | }, 141 | { 142 | "name": "stderr", 143 | "output_type": "stream", 144 | "text": [ 145 | "/home/ecotner/.local/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log\n", 146 | " \"\"\"Entry point for launching an IPython kernel.\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "print(np.log(0.01**200))\n", 152 | "print(200*np.log(0.01))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "### Code 0.4" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "Running linear regression on a sample dataset." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 6, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "coefficients: [3.93240876]\n", 179 | "intercept: -17.579094890510973\n" 180 | ] 181 | }, 182 | { 183 | "data": { 184 | "image/png": "\n", 185 | "text/plain": [ 186 | "
" 187 | ] 188 | }, 189 | "metadata": { 190 | "needs_background": "light" 191 | }, 192 | "output_type": "display_data" 193 | } 194 | ], 195 | "source": [ 196 | "# Import the data\n", 197 | "cars = pd.read_csv(\"data/cars.csv\")\n", 198 | "\n", 199 | "# Fit a linear regression of distance on speed\n", 200 | "model = LinearRegression()\n", 201 | "X = cars[\"speed\"].values.reshape(-1, 1)\n", 202 | "y = cars[\"dist\"].values\n", 203 | "model.fit(X, y)\n", 204 | "\n", 205 | "# Estimated coefficients from the model\n", 206 | "print(\"coefficients:\", model.coef_)\n", 207 | "print(\"intercept:\", model.intercept_)\n", 208 | "\n", 209 | "# Plot residuals against speed\n", 210 | "res = y - model.predict(X)\n", 211 | "plt.scatter(cars[\"speed\"], res)\n", 212 | "plt.axhline(0, color='grey', linestyle='--', alpha=0.5)\n", 213 | "plt.title(\"Linear regression residuals\")\n", 214 | "plt.ylabel(\"residuals\")\n", 215 | "plt.xlabel(\"speed\")\n", 216 | "plt.show()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "### Code 0.5\n", 224 | "\n", 225 | "The author installs his `rethinking` package for `R`, but since we're doing this in `python`, there is no equivalent step." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "Python 3", 239 | "language": "python", 240 | "name": "python3" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 3 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython3", 252 | "version": "3.7.5" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 4 257 | } 258 | -------------------------------------------------------------------------------- /notebooks/01_golem_of_prague.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 1: The Golem of Prague" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This chapter is entirely expository; there are no code samples. There are many interesting philosophical ideas touched upon that I will try to briefly summarize:\n", 15 | "\n", 16 | "* Models are not hypotheses (nor are hypotheses models)\n", 17 | " * In many cases, one hypothesis could be described by multiple sufficiently flexible models\n", 18 | " * Multiple hypotheses may lead to the same statistical model\n", 19 | " * The underlying scientific/causal model is the real source of truth, but can be difficult or impossible to ascertain.\n", 20 | "* There are multiple types of errors and uncertainties, and we should be careful to distinguish them\n", 21 | " * measurement/systematic error\n", 22 | " * level of belief\n", 23 | " * statistical variations\n", 24 | "* Just because something cannot be proven wrong does not mean it is right\n", 25 | "* The book will focus mainly on four main tools/techniques:\n", 26 | "\n", 27 | " 1. Bayesian data analysis\n", 28 | "\n", 29 | " 1. Reasons about outcomes in the form of probability distributions\n", 30 | " 2. Many natural intuitions about probability and statistics align with the Bayesian interpretation as opposed to others\n", 31 | " 2. Model comparison\n", 32 | " 1. Can use techniques like cross-validation and information criterion to measure generalization accuracy and compare models\n", 33 | " 2. Will be useful for determining if complex models are overfitting\n", 34 | " 3. Multilevel models\n", 35 | " 1. Complex models can be built from hierarchies of simpler ones\n", 36 | " 2. Makes use of _partial pooling_ trick to share information across units in a model to produce better estimates for all units\n", 37 | " 1. helps reduce the effects of common problems like repeat sampling, dataset imbalance, population variation, and misuse of data averaging\n", 38 | " 3. Author argues that multilevel regression should be the default!\n", 39 | " 4. Graphical causal models\n", 40 | " 1. Statistics cannot tell you anything about causation\n", 41 | " 2. If two events are correlated, which one caused the other?\n", 42 | " 1. Maybe both could be traced back to a common \"confounding factor\", so there is actually no causal relationship!\n", 43 | " 3. A DAG (Directed Acyclic Graph) is a common tool for specifying causal models that can represent chains of causal relationships\n", 44 | "* The layout of the chapters will be\n", 45 | "\n", 46 | " * Chapters 2/3 introduce Bayesian foundations\n", 47 | " * Chapters 4-9 explore multiple linear regression from a Bayesian perspective, and touch on the problem of overfitting\n", 48 | " * Chapters 9-12 explore \"generalized\" linear models, MCMC (Markov Chain Monte Carlo), and the use of \"maximum entropy\"\n", 49 | " * Chapters 13-16 discuss multilevel models as well as some other specialized models\n", 50 | " * Chapter 17 kind of returns to the beginning and wraps everything up" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [] 59 | } 60 | ], 61 | "metadata": { 62 | "kernelspec": { 63 | "display_name": "Python 3", 64 | "language": "python", 65 | "name": "python3" 66 | }, 67 | "language_info": { 68 | "codemirror_mode": { 69 | "name": "ipython", 70 | "version": 3 71 | }, 72 | "file_extension": ".py", 73 | "mimetype": "text/x-python", 74 | "name": "python", 75 | "nbconvert_exporter": "python", 76 | "pygments_lexer": "ipython3", 77 | "version": "3.7.5" 78 | } 79 | }, 80 | "nbformat": 4, 81 | "nbformat_minor": 4 82 | } 83 | -------------------------------------------------------------------------------- /notebooks/10_entropy_and_glm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 10: Big Entropy and the Generalized Linear Model\n", 8 | "We are going to examine the role of entropy in our choice of distributions to represent our priors/posteriors. The guiding principle here is that we want to choose the distribution that maximizes entropy (uncertainty) given some constraints. Before, we only really used Gaussian distributions, which it turns out are the maximum entropy distribution over the real numbers given a fixed mean and variance." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "import numpy as np\n", 18 | "import pandas as pd\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "from scipy.special import binom" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### Code 10.1 - 10.4\n", 28 | "Now we will analyze the role of entropy in choosing a maximum entropy discrete distribution for count data. If we assume we have 10 pebbles that can be split between 5 buckets, how many ways are there to arrange them? How many of these ways end up with the same number of pebbles in each bucket?" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "# consider 5 different potential orderings of the pebbles\n", 38 | "p = {\n", 39 | " \"A\": [0, 0, 10, 0, 0],\n", 40 | " \"B\": [0, 1, 8, 1, 0],\n", 41 | " \"C\": [0, 2, 6, 2, 0],\n", 42 | " \"D\": [1, 2, 4, 2, 1],\n", 43 | " \"E\": [2, 2, 2, 2, 2],\n", 44 | "}\n", 45 | "# normalize\n", 46 | "p_norm = {k: np.array(v)/np.sum(v) for k, v in p.items()}" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 3, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "name": "stderr", 56 | "output_type": "stream", 57 | "text": [ 58 | "/home/ecotner/.local/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log\n", 59 | " \n" 60 | ] 61 | }, 62 | { 63 | "data": { 64 | "text/plain": [ 65 | "{'A': -0.0,\n", 66 | " 'B': 0.639031859650177,\n", 67 | " 'C': 0.9502705392332347,\n", 68 | " 'D': 1.4708084763221112,\n", 69 | " 'E': 1.6094379124341005}" 70 | ] 71 | }, 72 | "execution_count": 3, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "# calculate entropy\n", 79 | "H = {k: -(q * np.where(q==0, 0, np.log(q))).sum() for k, q in p_norm.items()}\n", 80 | "H" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "It turns out that the entropy is highest for the most uniform distribution. How many ways can these counts of pebbles be realized? Well there are $\\binom{N}{k}$ ways to sample $k$ identical objects from a pool of $N$. So if we envision looking into each bucket in sequence, we know that for the first bucket, there are $\\binom{N}{k_1}$ ways of arranging the pebbles in that bucket, $\\binom{N-k_1}{k_2}$ ways of arranging the pebbles in the second bucket (because $k_1$ of the pebbles are already in the first bucket, so there are only $N - k_1$ \"free\" pebbles left), $\\binom{N-(k_1+k_2)}{k_3}$ in the third, etc..." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 4, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/plain": [ 98 | "{'A': 1, 'B': 90, 'C': 1260, 'D': 37800, 'E': 113400}" 99 | ] 100 | }, 101 | "execution_count": 4, 102 | "metadata": {}, 103 | "output_type": "execute_result" 104 | } 105 | ], 106 | "source": [ 107 | "ways = dict()\n", 108 | "for k in p:\n", 109 | " n_left = 10\n", 110 | " w = []\n", 111 | " for n in p[k]:\n", 112 | " w.append(binom(n_left, n))\n", 113 | " n_left -= n\n", 114 | " ways[k] = int(np.prod(w))\n", 115 | "ways" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 5, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "{'A': 0.0,\n", 127 | " 'B': 0.6491853096329675,\n", 128 | " 'C': 1.029920801838728,\n", 129 | " 'D': 1.5206098613995798,\n", 130 | " 'E': 1.6791061114716954}" 131 | ] 132 | }, 133 | "execution_count": 5, 134 | "metadata": {}, 135 | "output_type": "execute_result" 136 | } 137 | ], 138 | "source": [ 139 | "logwayspp = {k: np.log2(ways[k])/10 for k in ways}\n", 140 | "logwayspp" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 6, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "image/png": "\n", 151 | "text/plain": [ 152 | "
" 153 | ] 154 | }, 155 | "metadata": { 156 | "needs_background": "light" 157 | }, 158 | "output_type": "display_data" 159 | } 160 | ], 161 | "source": [ 162 | "x = np.linspace(-0.05, 1.80, 3)\n", 163 | "plt.plot(x, x, color=\"black\", linestyle=\"--\", zorder=-1)\n", 164 | "x = list(logwayspp.values())\n", 165 | "y = list(H.values())\n", 166 | "plt.scatter(x, y)\n", 167 | "plt.xlabel(\"log(ways)\")\n", 168 | "plt.ylabel(\"entropy\")\n", 169 | "labels = list(H.keys())\n", 170 | "for i in range(len(labels)):\n", 171 | " plt.text(x[i], y[i]+0.15, labels[i], horizontalalignment=\"center\", fontsize=15)\n", 172 | "plt.show()" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "### Code 10.5 - 10.6\n", 180 | "Now we want to compare the entropies of several potential probability distributions for sampling blue and white marbles from a bag, where we _know_ that the expected number of blue marbles over two draws is exactly 1. We consider the following proposal distributions" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 7, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/html": [ 191 | "
\n", 192 | "\n", 205 | "\n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | "
wwbwwbbb
A0.2500000.2500000.2500000.250000
B0.3333330.1666670.1666670.333333
C0.1666670.3333330.3333330.166667
D0.1250000.5000000.2500000.125000
\n", 246 | "
" 247 | ], 248 | "text/plain": [ 249 | " ww bw wb bb\n", 250 | "A 0.250000 0.250000 0.250000 0.250000\n", 251 | "B 0.333333 0.166667 0.166667 0.333333\n", 252 | "C 0.166667 0.333333 0.333333 0.166667\n", 253 | "D 0.125000 0.500000 0.250000 0.125000" 254 | ] 255 | }, 256 | "execution_count": 7, 257 | "metadata": {}, 258 | "output_type": "execute_result" 259 | } 260 | ], 261 | "source": [ 262 | "p = pd.DataFrame([\n", 263 | " [1/4, 1/4, 1/4, 1/4],\n", 264 | " [2/6, 1/6, 1/6, 2/6],\n", 265 | " [1/6, 2/6, 2/6, 1/6],\n", 266 | " [1/8, 4/8, 2/8, 1/8],\n", 267 | "], columns=[\"ww\",\"bw\",\"wb\",\"bb\"], index=list(\"ABCD\"))\n", 268 | "p" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 8, 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "data": { 278 | "text/plain": [ 279 | "A 1.0\n", 280 | "B 1.0\n", 281 | "C 1.0\n", 282 | "D 1.0\n", 283 | "dtype: float64" 284 | ] 285 | }, 286 | "execution_count": 8, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "# Compute expected value of # of blue marbles\n", 293 | "(p*np.array([[0, 1, 1, 2]])).sum(axis=1)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 9, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "A 1.386294\n", 305 | "B 1.329661\n", 306 | "C 1.329661\n", 307 | "D 1.213008\n", 308 | "dtype: float64" 309 | ] 310 | }, 311 | "execution_count": 9, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "# compute entropy of each distribution\n", 318 | "-(p*np.log(p)).sum(axis=1)" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "We see that distribution A has the largest entropy. It just happens to be the same as the binomal distribution for $b$ successes out of $n=2$ trials." 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "### Code 10.7 - 10.13\n", 333 | "The above example was kind of special because the distribution over outcomes can remain flat and still be consistent with the constraint. What if the expected value was 1.4 marbles in two draws ($p = 0.7$). The binomial distribution with this expected value is" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 10, 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "text/plain": [ 344 | "array([0.09, 0.21, 0.21, 0.49])" 345 | ] 346 | }, 347 | "execution_count": 10, 348 | "metadata": {}, 349 | "output_type": "execute_result" 350 | } 351 | ], 352 | "source": [ 353 | "p = 0.7\n", 354 | "A = np.array([(1-p)**2, p*(1-p), (1-p)*p, p**2])\n", 355 | "A" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 11, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "1.221728604109787" 367 | ] 368 | }, 369 | "execution_count": 11, 370 | "metadata": {}, 371 | "output_type": "execute_result" 372 | } 373 | ], 374 | "source": [ 375 | "# entropy of distribution\n", 376 | "-(A*np.log(A)).sum()" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "If we randomly generate a bunch of distributions with the same expected value of 1.4, then we expect that none of them will have a larger entropy than 1.22." 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 12, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "def sim_dists(G=1.4):\n", 393 | " x123 = np.random.rand(3)\n", 394 | " x4 = (G * x123.sum() - x123[1] - x123[2])/(2-G)\n", 395 | " z = x123.sum() + x4\n", 396 | " p = np.array([*x123, x4])/z\n", 397 | " return dict(H=-(p*np.log(p)).sum(), p=p)\n", 398 | "\n", 399 | "H = [sim_dists() for _ in range(10_000)]\n", 400 | "entropies = np.array([d[\"H\"] for d in H])\n", 401 | "distributions = np.array([d[\"p\"] for d in H])" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 13, 407 | "metadata": {}, 408 | "outputs": [ 409 | { 410 | "data": { 411 | "image/png": "\n", 412 | "text/plain": [ 413 | "
" 414 | ] 415 | }, 416 | "metadata": { 417 | "needs_background": "light" 418 | }, 419 | "output_type": "display_data" 420 | } 421 | ], 422 | "source": [ 423 | "plt.hist(entropies, bins=100, density=True, histtype=\"step\", linewidth=1.15, label=\"random\")\n", 424 | "plt.xlabel(\"entropy\")\n", 425 | "plt.ylabel(\"density\")\n", 426 | "plt.axvline(-(A*np.log(A)).sum(), color=\"black\", linestyle=\"--\", label=\"binomial\")\n", 427 | "plt.legend(title=\"distributions\")\n", 428 | "plt.show()" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "We can see that the largest entropy sample has a distribution that is almost identical to the binomial distribution." 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 14, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "name": "stdout", 445 | "output_type": "stream", 446 | "text": [ 447 | "max entropy: 1.221705804760178\n", 448 | "sample distribution: [0.08953159 0.21253592 0.20840089 0.48953159]\n", 449 | "binomial distribution: [0.09 0.21 0.21 0.49]\n" 450 | ] 451 | } 452 | ], 453 | "source": [ 454 | "idx = np.argmax(entropies)\n", 455 | "print(\"max entropy:\", entropies[idx])\n", 456 | "print(\"sample distribution:\", distributions[idx])\n", 457 | "print(\"binomial distribution:\", A)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "## Generalized linear models" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "The rest of the chapter is about generalized linear models (abbreviated GLM). There are no more code snippets, just some general theoretical motivation for their use. The idea is that instead of using the normal distribution and having $\\mu$ be a linear function of the predictors, perhaps we could (and in fact _should_) use other distributions in different scenarios, and instead try and predict the parameters that define them as functions of the predictors." 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "For example, if we have some count data where we need to infer the probability of an even occurring, we could use the following GLM:\n", 479 | "\n", 480 | "$$\n", 481 | "\\begin{align}\n", 482 | "y_i &\\sim \\text{Binomial}(n, p_i) \\\\\n", 483 | "f(p_i) &= \\alpha + \\beta x_i\n", 484 | "\\end{align}\n", 485 | "$$\n", 486 | "\n", 487 | "where $f(p)$ is known as the \"link function\", and its purpose is to transform the range of the linear function $\\alpha + \\beta x$ into the domain of the parameter $p$. The problem is that $\\alpha + \\beta x$ can in principle be any real number, but $p \\in [0, 1]$. So we need some function to \"squash\" the range to fit in there. A useful transformation in this case is the logit function\n", 488 | "$$\\text{logit}(p) = \\log \\frac{p}{1-p},$$\n", 489 | "which amounts to a transformation of\n", 490 | "$$p = f^{-1}(\\alpha + \\beta x) = \\frac{\\exp(\\alpha + \\beta x)}{1 + \\exp(\\alpha + \\beta x)}$$\n", 491 | "Because this inverse transformation is so common, you'll usually see $f^{-1}$ just as often as you'll see $f$." 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "The use of a nonlinear link function implicitly leads to interaction effects between all of the predictors because now the derivative of the parameter with respect to one predictor is now a function of all the predictors, not just the one:\n", 499 | "$$\n", 500 | "\\frac{\\partial p}{\\partial x_i} = (f^{-1})^\\prime(\\alpha + \\beta \\cdot x) \\beta_i\n", 501 | "$$\n", 502 | "(I'm assuming $x$ and $\\beta$ are both vectors here)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": null, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [] 511 | } 512 | ], 513 | "metadata": { 514 | "kernelspec": { 515 | "display_name": "Python 3", 516 | "language": "python", 517 | "name": "python3" 518 | }, 519 | "language_info": { 520 | "codemirror_mode": { 521 | "name": "ipython", 522 | "version": 3 523 | }, 524 | "file_extension": ".py", 525 | "mimetype": "text/x-python", 526 | "name": "python", 527 | "nbconvert_exporter": "python", 528 | "pygments_lexer": "ipython3", 529 | "version": "3.7.5" 530 | } 531 | }, 532 | "nbformat": 4, 533 | "nbformat_minor": 4 534 | } 535 | -------------------------------------------------------------------------------- /notebooks/README.md: -------------------------------------------------------------------------------- 1 | # Statistical Rethinking chapters 2 | 3 | Each of the jupyter notebooks in this directory corresponds to a chapter of _Statistical Rethinking_. Yes, those are the actual names of the chapters, I did not come up with them! 4 | 5 | ## Chapter 0: [Preface](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/00_preface.ipynb) 6 | Introduces the content of the book, how to use it effectively, some advice on coding, etc. 7 | 8 | ## Chapter 1: [The Golem of Prague](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/01_golem_of_prague.ipynb) 9 | Discusses the zoo of common statistical tests, differences between hypotheses and models, some philosophy about truth/falsification of models. Introduces some of the fundamental differences between Frequentist and Bayesian statistics, then goes on to highlight specific future chapters: chapter 7 on model comparison, chapter 13 on multilevel models, chapters 5/6 on graphical causal models. 10 | 11 | ## Chapter 2: [Small Worlds and Large Worlds](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/02_small_large_worlds.ipynb) 12 | Gives motivation for the Bayesian way of computing probabilities - it's just counting the ways observations could have occurred. Explains the various pieces of Bayes' rule: likelihood, prior, evidence, posterior. Illustrates how successive observations allow one to update their prior beliefs. Introduces the _grid approximation_ and _quadratic approximation_ (also known as _Laplace approximation_) for computing posteriors of simple models with low dimensionality and nearly Gaussian posteriors. 13 | 14 | ## Chapter 3: [Sampling the Imaginary](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/03_sampling_the_imaginary.ipynb) 15 | Goes into detail about the grid approximation and how to compute posterior distributions from it. Discusses confidence and credible/compatibility intervals, HDI/HPDI/PI. Explains the benefits of having the entire posterior distribution over only having a point estimate. Shows how to sample from posteriors and how to use the samples to calculate any quantity of interest. 16 | 17 | ## Chapter 4: [Geocentric Models](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/04_geocentric_models.ipynb) 18 | Explains why "common" models/distributions like linear regression and the Gaussian are the default in most scenarios and illustrates the Central Limit Theorem. Uses the grid/quadratic approximation to estimate the posterior of a simple linear regression model, and shows the importance of doing prior predictive sampling/simulation to determine logical/informative priors. Shows how to generalize to polynomial and spline regression models. 19 | 20 | ## Chapter 5: [The Many Variables & The Spurious Waffles](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/05_many_vars_and_spurious_waffles.ipynb) 21 | This chapter starts looking at the correlation vs. causation problem, introduces some techniques for causal inference (DAG's), and how including certain predictors in your model could either increase/decrease bias if you're not careful. Shows how to use models to do counterfactual deduction. Explains how to incorporate categorical predictors into your linear models. 22 | 23 | ## Chapter 6: [The Haunted DAG & The Causal Terror](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/06_haunted_dag_and_causal_terror.ipynb) 24 | Sources of bias and causality are discussed in more depth. Explains how multicollinearity can disguise causal relationships and introduce non-identifiability of parameters. Explains d-separation criteria and how to use it to make causal inferences when designing models. Illustrates Simpson's paradox, how to (try to) eliminate confounding, and create tests of causality. 25 | 26 | ## Chapter 7: [Ulysses' Compass](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/07_ulysses_compass.ipynb) 27 | Introduces underfitting/overfitting problems, how to identify them through information criteria (AIC, WAIC, PSIS-LOO) and cross-validation techniques. Discusses various model fit metrics and when to use them (absolutely wrecks $R^2$ haha). Gives some more detail of the math of information theory underlying probability theory (entropy, KL divergence). Explains common pitfalls when comparing metrics of model fit. Shows how regularizing priors can be used to improve inference in the presence of domain knowledge or to reduce overfitting. Author heavily prefers using WAIC and PSIS-LOO over out-of-sample CV... not sure if I completely agree. Illustrates all this with a problem comparing models of primate brain mass. 28 | 29 | ## Chapter 8: [Conditional Manatees](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/08_conditional_manatees.ipynb) 30 | Introduces nonlinear interactions between predictor variables, how to include them in "linear" models. Shows how rewriting more complicated models can eliminate identifiability issues. Shows how normalizing variables can make choosing priors simpler and more logical. 31 | 32 | ## Chapter 9: [Markov Chain Monte Carlo](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/09_mcmc.ipynb) 33 | This chapter title is actually pretty apt. The concept of MCMC is introduced, various flavors (Metropolis, Gibbs sampling, Hamiltonian MC) are explained, and then we finally settle on HMC/NUTS, talk about some of the pros/cons, how to use it, how to diagnose problems. 34 | 35 | ## Chapter 10: [Big Entropy and the Generalized Linear Model](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/10_entropy_and_glm.ipynb) 36 | The guiding principle of maximum entropy (to choose priors) is introduced. Genearlized linear models (GLM's), link functions, and techniques for interpreting parameters are introduced as well. 37 | 38 | ## Chapter 11: [God Spiked the Integers](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/11_god_spiked_ints.ipynb) 39 | Integer-valued distributions like Poisson/Binomial are introduced, and GLM's are built using them. Brief discussion on how to account for censored data that comes up often in count models or survival analysis where you measure things like durations to an event, or events that cause "subjects" to "be removed from" the study. 40 | 41 | ## Chapter 12: [Monsters and Mixtures](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/12_monsters_and_mixtures.ipynb) 42 | Mixture distributions are introduced, such as the zero-inflated Poisson, the beta-binomial, and gamma-poisson. Utility of these distributions and their higher entropy to help cover unexplained variance is discussed. 43 | 44 | ## Chapter 13: [Models with Memory](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/13_models_with_memory.ipynb) 45 | Hierarchical/multilevel models are introduced. Advantages/disadvantages of pooling are discussed, and reparametrization (centered vs. non-centered) and its effects on HMC sampling efficiency are shown. 46 | 47 | ## Chapter 14: [Adventures in Covariance](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/14_adventures_in_covariance.ipynb) 48 | Takes multilevel models further by introducing adaptive priors that can take covariance between groups of data into account (multidimensional Gaussians). Introduces Gaussian processes for groups linked by continuous values, which is illustrated in the context of geospatial and phylogenetic similarities. 49 | 50 | ## Chapter 15: [Missing Data and Other Opportunities](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/15_missing_data.ipynb) 51 | Teaches how to deal with a variety of problems by modeling the data itself as distributions to be learned. It is shown how to treat measurement error and missing data as generative processes that allow the recovery of the "true" data. Lots of pitfalls related to the causal implications of using such techniques are discussed. Latent discrete variables and their treatment in HMC is also shown. 52 | 53 | ## Chapter 16: [Generalized Linear Madness](http://nbviewer.jupyter.org/github/ecotner/statistical-rethinking/blob/master/notebooks/16_generalized_linear_madness.ipynb) 54 | Explains that while GLM's are a powerful tool, they are sometimes so general as to be uninterpretable. Often times, it is better to formulate a model using scientific theory inspired by the domain, trying to keep the model as close as possible to a plausible generative story. We go through several examples, including biological growth inspired by basic geometric principles, state space models for inferring strategies in children and forecasting population dynamics, highly nonlinear situations where we infer the parameters of a differential equations. 55 | 56 | ## Chapter 17: Horoscopes 57 | This chapter doesn't have any code, so I did not make a notebook for it. It _very_ briefly discusses how statistical models should be used in scientific studies, and offers some guidelines on how scientific studies should be judged to improve the quality of scientific literature. --------------------------------------------------------------------------------