These papers both focus on innovating on the _meta-learner_.
129 | 130 | # Optimising Your Model 131 | 132 | ## Machine Learning is not just Algorithms 133 | 134 | - Another contribution of machine learning to econometrics, in my opinion, has been the development of strategies to test and evaluate models. 135 | - Epistemologically, machine learning frequently takes a more agnostic view on trying to find a specific functional specification of a theoretical model. 136 | - This means that the "correct" model is the one that does the best job of matching _empirics_, and not a particular theory. 137 | - The cost of this is the unsuitability of many machine learning algorithms to theory testing in the traditional econometric sense. 138 | 139 | ## Cross Validation 140 | 141 | Cross validation is one such of these strategies. It consists of dividing the data into _training_ and _test_ sets: 142 | 143 | 1. The model is fit using the _training_ data: $y_{train} = f(X_{train}) + \epsilon \rightarrow \hat{f}(X)$ 144 | 2. The fitted model is applied to the _test features_ to generate _predicted values_: $\hat{y} = \hat{f}(X_{test})$ 145 | 3. The difference between the _predicted values_ and the _test labels_ is used as a measure of the predictive accuracy of the model: $\hat{e} = y_{test} - \hat{y}$ 146 | 147 | ::: {.fragment} 148 | There are multiple aggregate measures of prediction error, but a common one is _mean squared (prediction) error_, calculated as the sum of squared differences between prediction and test label. 149 | ::: 150 | 151 | ## k-fold Cross Validation 152 | 153 | - There are some obvious shortcomings to dividing the data into a training at test set just once. 154 | - A slightly more advanced method for train-test splitting is known a k-fold CV, which consists of splitting the training data randomly into $k$ bins, and then iteratively using the $k$th bin as a test set for all bins not $k$. 155 | 156 | ## Cross Validation Visualised 157 | 158 |  159 | 160 | ## Choosing Parameters 161 | 162 | Another strategy for improving the predictive accuracy of algorithms relates to choosing the right _parameters_. 163 | 164 | Most, if not all algorithms have some parameters that affect predictions in very unobvious ways. For example: 165 | 166 | - `k-means`: number of clusters 167 | - Decision Tree: min/max number of splits 168 | - Random Forest: proportion of features to use in each subset 169 | - LASSO/Ridge/EN: $\beta$ 170 | 171 | ## Hyperparameter Tuning 172 | 173 | - Hyperparameter tuning is the practice of choosing model parameters by maximising an _objective function_. Some possible objective functions include: 174 | - _Mean Absolute Prediction Error_: Combine with train-test splits. 175 | - _Goodness-of-Fit_: Measures such as R-squared, AIC, etc. 176 | - _Coherence/Entropy Measures_: Most algorithms have a measure of the complexity/information tradeoff, which can be optimised. 177 | - Hyperparameter tuning is computationally costly, but also easily parallelisable. 178 | 179 | 180 | # Machine Learning Recap 181 | 182 | ## Key Terms 183 | 184 | - _Unsupervised Learning_: No $y$, explore $X$ 185 | - _Supervised Learning_: Learn relationship between features and labels. 186 | - _Clustering_: Split observations into groups. 187 | - _Dimensionality Redution_: Reduce $j$, the number of features. 188 | - _Classification vs Regression_: Depends on structure of $y$ 189 | - _Cross Validation_: Train-test split data to optimise supervised learner. 190 | - _Hyperparameter Tuning_: Systematically choose optimal parameters for algorithm. 191 | - _Objective Function_: An optimisable aspect of the data used to measure goodness-of-fit. 192 | 193 | ## Trade-offs 194 | 195 | These trade-offs are not linear, but generally hold: 196 | 197 | - _Explanatory vs predictive power_ 198 | - _Flexibility vs efficiency_ 199 | - _Information vs time_ 200 | 201 | ## Readings 202 | 203 | Ensemble Methods: 204 | 205 | - [Grimmer \& Westwood, _Political Analysis_ 2017](https://www.cambridge.org/core/journals/political-analysis/article/estimating-heterogeneous-treatment-effects-and-the-effects-of-heterogeneous-treatments-with-ensemble-methods/C7E3EA00D0AD83429CBE73F4F0C6652C) 206 | - [Kunzel et al, _PNAS_ 2019](https://arxiv.org/abs/1706.03461) 207 | 208 | Elements of Statistical Learning: 209 | 210 | - 9.2: Tree-Based Methods 211 | - 15: Random Forests 212 | - 16: Ensemble Learning 213 | 214 | -------------------------------------------------------------------------------- /Week8/examples_selenium.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "opponent-colorado", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "subslide" 9 | } 10 | }, 11 | "source": [ 12 | "# Browser Automation with Selenium\n", 13 | "\n", 14 | "This notebook contains a short tutorial for scraping with the Selenium toolkit.\n", 15 | "\n", 16 | "We will be scraping `quotes.toscrape.com`, a wonderful page for practicing more advanced scraping techniques." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "julian-canon", 23 | "metadata": { 24 | "slideshow": { 25 | "slide_type": "subslide" 26 | } 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "# imports\n", 31 | "import requests\n", 32 | "from selenium import webdriver\n", 33 | "from selenium.webdriver.common.by import By" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "id": "finished-mixer", 39 | "metadata": { 40 | "slideshow": { 41 | "slide_type": "subslide" 42 | } 43 | }, 44 | "source": [ 45 | "## When static scraping fails:\n", 46 | "\n", 47 | "The following webpage is generated dynamically by `javascript`.\n", 48 | "We can see the script source in this page, but this is often not the case:" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "id": "straight-columbia", 55 | "metadata": { 56 | "slideshow": { 57 | "slide_type": "subslide" 58 | } 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "from bs4 import BeautifulSoup\n", 63 | "\n", 64 | "url = \"https://quotes.toscrape.com/js/\"\n", 65 | "page = requests.get(url)\n", 66 | "print(BeautifulSoup(page.text).body.prettify())" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "collect-finnish", 72 | "metadata": { 73 | "slideshow": { 74 | "slide_type": "subslide" 75 | } 76 | }, 77 | "source": [ 78 | "## Instantiating the WebDriver\n", 79 | "\n", 80 | "When we call the `webdriver.Chrome()` method, if we have the webdriver properly installed, an automated Chrome instance should appear!\n" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "id": "julian-nightlife", 87 | "metadata": { 88 | "slideshow": { 89 | "slide_type": "subslide" 90 | } 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "driver = webdriver.Chrome()\n", 95 | "driver.get(url)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "id": "intimate-edgar", 101 | "metadata": { 102 | "slideshow": { 103 | "slide_type": "subslide" 104 | } 105 | }, 106 | "source": [ 107 | "Let's select all of the quote-boxes that have the tag \"life\"." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "white-interstate", 114 | "metadata": { 115 | "slideshow": { 116 | "slide_type": "subslide" 117 | } 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "# This returns a list of elements that have the CSS class 'quote'\n", 122 | "quote_boxes = driver.find_elements(\n", 123 | " By.CLASS_NAME, 'quote')" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "id": "living-bundle", 130 | "metadata": { 131 | "slideshow": { 132 | "slide_type": "subslide" 133 | } 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "# Let's navigate the first element to recognize a pattern\n", 138 | "# Selecting the first div\n", 139 | "quote_box = quote_boxes[0]\n", 140 | "# Selecting the container div for the tags\n", 141 | "tags = quote_box.find_element(By.CLASS_NAME, 'tags')\n", 142 | "# Getting the tag names\n", 143 | "[\n", 144 | " tag.text for tag\n", 145 | " in tags.find_elements(By.TAG_NAME, 'a')\n", 146 | "]" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "id": "impressive-diesel", 153 | "metadata": { 154 | "slideshow": { 155 | "slide_type": "subslide" 156 | } 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "# Some crazy list filtering\n", 161 | "life_quotes = [\n", 162 | " quote for quote in quote_boxes if # unpack quote_boxes\n", 163 | " 'life' in [tag.text for tag in # check if 'life' is in\n", 164 | " quote.find_element(By.CLASS_NAME, 'tags'). # the list of tags\n", 165 | " find_elements(By.TAG_NAME, 'a')] # like we obtained before\n", 166 | "]\n", 167 | "life_quotes" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "young-emergency", 174 | "metadata": { 175 | "slideshow": { 176 | "slide_type": "subslide" 177 | } 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "# Let's put that into a function\n", 182 | "def filter_quotes_by_tag(driver, tag):\n", 183 | " quote_boxes = driver.find_elements(By.CLASS_NAME, 'quote')\n", 184 | " tagged_quotes = [\n", 185 | " quote for quote in quote_boxes if # unpack quote_boxes\n", 186 | " tag in [t.text for t in # check if tag is in\n", 187 | " quote.find_element(By.CLASS_NAME, 'tags'). # the list of tags\n", 188 | " find_elements(By.TAG_NAME, 'a')] # like we obtained before\n", 189 | " ]\n", 190 | " return tagged_quotes" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "binding-wheel", 196 | "metadata": { 197 | "slideshow": { 198 | "slide_type": "subslide" 199 | } 200 | }, 201 | "source": [ 202 | "## Simulating Clicks\n", 203 | "\n", 204 | "We can use the `.click()` property of any element to 'click' on it.\n", 205 | "\n", 206 | "Let's proceed to the next page of quotes." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "aging-voltage", 213 | "metadata": { 214 | "slideshow": { 215 | "slide_type": "subslide" 216 | } 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "# Get the \"next\" element\n", 221 | "next_button = driver.find_element(By.PARTIAL_LINK_TEXT, 'Next')\n", 222 | "print(driver.current_url)\n", 223 | "next_button.click()\n", 224 | "print(driver.current_url)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "id": "still-gasoline", 230 | "metadata": { 231 | "slideshow": { 232 | "slide_type": "subslide" 233 | } 234 | }, 235 | "source": [ 236 | "## Sending Keys\n", 237 | "\n", 238 | "Let's try to log in!" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "id": "controversial-jackson", 245 | "metadata": { 246 | "slideshow": { 247 | "slide_type": "subslide" 248 | } 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "login_box = driver.find_element(By.LINK_TEXT, 'Login')\n", 253 | "login_box.click()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "august-purpose", 260 | "metadata": { 261 | "slideshow": { 262 | "slide_type": "subslide" 263 | } 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "# Entering username and password\n", 268 | "username_box = driver.find_element(By.ID, 'username')\n", 269 | "password_box = driver.find_element(By.ID, 'password')" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "id": "flush-minutes", 276 | "metadata": { 277 | "slideshow": { 278 | "slide_type": "subslide" 279 | } 280 | }, 281 | "outputs": [], 282 | "source": [ 283 | "username_box.send_keys('username')\n", 284 | "password_box.send_keys('password')" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "id": "double-boring", 291 | "metadata": { 292 | "slideshow": { 293 | "slide_type": "subslide" 294 | } 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "# Using XPATH to get the login button\\\n", 299 | "# https://www.w3schools.com/xml/xpath_syntax.asp\n", 300 | "login_button = driver.find_element(\n", 301 | " By.XPATH, r\"//input[(@type='submit')]\")\n", 302 | "login_button.click()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "id": "correct-speaking", 308 | "metadata": { 309 | "slideshow": { 310 | "slide_type": "subslide" 311 | } 312 | }, 313 | "source": [ 314 | "## Race Conditions\n", 315 | "\n", 316 | "Usually the page will take time to load.\n", 317 | "\n", 318 | "If you are running Selenium from a script, it will execute the commands sequentially\n", 319 | "as fast as possible. This causes problems." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "quality-israel", 326 | "metadata": { 327 | "slideshow": { 328 | "slide_type": "subslide" 329 | } 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "url = \"https://quotes.toscrape.com/js-delayed/\"\n", 334 | "driver.get(url)\n", 335 | "filter_quotes_by_tag(driver, 'life')" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "id": "average-scottish", 341 | "metadata": { 342 | "slideshow": { 343 | "slide_type": "subslide" 344 | } 345 | }, 346 | "source": [ 347 | "Selenium does provide more sophisticated \"wait\" functionality,\n", 348 | "where you can define some condition that it will test until\n", 349 | "it becomes true.\n", 350 | "\n", 351 | "I'll demonstrate a simpler (and less reliable) solution, which\n", 352 | "is to just use a timed wait." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "id": "frozen-forest", 359 | "metadata": { 360 | "slideshow": { 361 | "slide_type": "fragment" 362 | } 363 | }, 364 | "outputs": [], 365 | "source": [ 366 | "from time import sleep\n", 367 | "url = \"https://quotes.toscrape.com/js-delayed/\"\n", 368 | "driver.get(url)\n", 369 | "sleep(10) # I happen to know the length of the delay\n", 370 | "filter_quotes_by_tag(driver, 'life')" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "id": "weekly-kingdom", 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "driver.quit()" 381 | ] 382 | } 383 | ], 384 | "metadata": { 385 | "kernelspec": { 386 | "display_name": "scrape", 387 | "language": "python", 388 | "name": "scrape" 389 | }, 390 | "language_info": { 391 | "codemirror_mode": { 392 | "name": "ipython", 393 | "version": 3 394 | }, 395 | "file_extension": ".py", 396 | "mimetype": "text/x-python", 397 | "name": "python", 398 | "nbconvert_exporter": "python", 399 | "pygments_lexer": "ipython3", 400 | "version": "3.8.5" 401 | } 402 | }, 403 | "nbformat": 4, 404 | "nbformat_minor": 5 405 | } 406 | -------------------------------------------------------------------------------- /Week5/examples_student.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "subslide" 8 | } 9 | }, 10 | "source": [ 11 | "# Coding Tutorial 5: Unsupervised Learning\n", 12 | "\n", 13 | "In this coding tutorial, we learn how to do the following for `k-means` clustering and principal components analysis:\n", 14 | "\n", 15 | "- Import models from `scikit-learn`\n", 16 | "- Prepare a pandas dataframe for analysis with `scikit-learn`\n", 17 | "- Instantiate and fit a model to data\n", 18 | "- Visualise the results of the model" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "slideshow": { 25 | "slide_type": "subslide" 26 | } 27 | }, 28 | "source": [ 29 | "# Importing Models from Scikit-Learn\n", 30 | "\n", 31 | "`scikit-learn` is actually a collection of modules, so you will need to find which sub-module contains the model you want to use." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "slideshow": { 39 | "slide_type": "subslide" 40 | } 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# standard imports\n", 45 | "import pandas as pd\n", 46 | "import numpy as np\n", 47 | "import matplotlib.pyplot as plt\n", 48 | "import seaborn as sns\n", 49 | "\n", 50 | "# scikit-learn imports\n", 51 | "from sklearn.preprocessing import StandardScaler\n", 52 | "from sklearn.cluster import KMeans\n", 53 | "from sklearn.decomposition import PCA" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": { 60 | "slideshow": { 61 | "slide_type": "subslide" 62 | } 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "# import the data\n", 67 | "link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'\n", 68 | "df = pd.read_feather(link)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "slideshow": { 75 | "slide_type": "subslide" 76 | } 77 | }, 78 | "source": [ 79 | "# Data Pre-Processing\n", 80 | "\n", 81 | "There are four steps for preparing data for analysis:\n", 82 | "\n", 83 | "1. Feature Selection\n", 84 | "2. Accounting for NAs\n", 85 | "3. One Hot Encoding\n", 86 | "4. Conversion to numpy ndarray" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": { 92 | "slideshow": { 93 | "slide_type": "subslide" 94 | } 95 | }, 96 | "source": [ 97 | "## Feature Selection\n", 98 | "\n", 99 | "Here we just choose which columns we are going to use. If your data has a lot of NAs, it may be worthwhile to prefer columns with fewer NAs." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": { 106 | "slideshow": { 107 | "slide_type": "fragment" 108 | } 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "features = ['region', 'Age', 'a02', 'a03', 'e01',\n", 113 | " 'k01', 'k02', 'k11', 'k13', 'k06', 'k08',\n", 114 | " 'y01', 'y03', 'y06', 'y08', 'y09', 'y11', 'y17']" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": { 120 | "slideshow": { 121 | "slide_type": "subslide" 122 | } 123 | }, 124 | "source": [ 125 | "## Accounting for NAs" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "slideshow": { 133 | "slide_type": "fragment" 134 | } 135 | }, 136 | "outputs": [], 137 | "source": [ 138 | "# Can check for na's with:\n", 139 | "# df[features].isna().sum()\n", 140 | "df = df[features].dropna()" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": { 146 | "slideshow": { 147 | "slide_type": "subslide" 148 | } 149 | }, 150 | "source": [ 151 | "## One-Hot Encoding\n", 152 | "\n", 153 | "We can do a one-hot encoding using the `pd.get_dummies()` function." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "slideshow": { 161 | "slide_type": "fragment" 162 | } 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "data = pd.get_dummies(df)\n", 167 | "print(df.shape, data.shape)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": { 173 | "slideshow": { 174 | "slide_type": "subslide" 175 | } 176 | }, 177 | "source": [ 178 | "## Normalization and Conversion to `numpy`\n", 179 | "\n", 180 | "We call the `StandardScaler().fit_transform()` function on the `.values` argument of the dataframe" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "slideshow": { 188 | "slide_type": "fragment" 189 | } 190 | }, 191 | "outputs": [], 192 | "source": [ 193 | "X = data.values\n", 194 | "scaler = StandardScaler()\n", 195 | "X_norm = scaler.fit_transform(X)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": { 201 | "slideshow": { 202 | "slide_type": "subslide" 203 | } 204 | }, 205 | "source": [ 206 | "# Instantiating and Fitting `k-means`\n", 207 | "\n", 208 | "We first create an instance of the model, where we provide parameters, and then we pass data to it." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": { 215 | "slideshow": { 216 | "slide_type": "fragment" 217 | } 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "kmeans = KMeans(n_clusters=5, random_state=634)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "slideshow": { 229 | "slide_type": "fragment" 230 | } 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "kmeans.fit(X_norm)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "slideshow": { 241 | "slide_type": "subslide" 242 | } 243 | }, 244 | "source": [ 245 | "We can extract the labels using the `.labels_` method, and then assign them to a column." 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "slideshow": { 253 | "slide_type": "subslide" 254 | } 255 | }, 256 | "outputs": [], 257 | "source": [ 258 | "df['labels_'] = kmeans.labels_\n", 259 | "df['labels_'] = df['labels_'].astype(str)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": { 265 | "slideshow": { 266 | "slide_type": "subslide" 267 | } 268 | }, 269 | "source": [ 270 | "# Visualising the Results\n", 271 | "\n", 272 | "This is a bit difficult with so many variables. Let's look at age." 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": { 279 | "slideshow": { 280 | "slide_type": "subslide" 281 | } 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "f, ax = plt.subplots(1, 1, figsize=(15, 8))\n", 286 | "sns.histplot(df[['labels_', 'Age']].sort_values('labels_'),\n", 287 | " x='Age', ax=ax, kde=True, hue='labels_');" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": { 294 | "slideshow": { 295 | "slide_type": "subslide" 296 | } 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "# We can appropriate this function\n", 301 | "def grouped_barplot(data, var1, var2):\n", 302 | " \"\"\"\n", 303 | " Creates a grouped bar plot of the distribution of `var2` within each group of `var2`.\n", 304 | " \"\"\"\n", 305 | " temp = data.groupby([var1, var2]).apply(len).reset_index().rename({0: 'Count'}, axis=1)\n", 306 | " f, ax = plt.subplots(1, 1, figsize=(len(data[var1].unique())*len(data[var1].unique())/5, 10))\n", 307 | " sns.barplot(data=temp, x=var1, y='Count', hue=var2)\n", 308 | " ax.set_title(f\"BES Sample {var2} per {var1}\")\n", 309 | " ax.xaxis.set_ticklabels(ax.xaxis.get_ticklabels(), rotation=30)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": { 316 | "slideshow": { 317 | "slide_type": "subslide" 318 | } 319 | }, 320 | "outputs": [], 321 | "source": [ 322 | "grouped_barplot(df, 'a02','labels_') " 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "slideshow": { 330 | "slide_type": "subslide" 331 | } 332 | }, 333 | "outputs": [], 334 | "source": [ 335 | "grouped_barplot(df, 'region','labels_')" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": { 341 | "slideshow": { 342 | "slide_type": "subslide" 343 | } 344 | }, 345 | "source": [ 346 | "## Instantiating and Fitting PCA" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": { 353 | "slideshow": { 354 | "slide_type": "fragment" 355 | } 356 | }, 357 | "outputs": [], 358 | "source": [ 359 | "pca = PCA(n_components=2, random_state=634)\n", 360 | "pca = pca.fit(X_norm)\n", 361 | "reduced_X = pca.fit_transform(X_norm)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": { 368 | "slideshow": { 369 | "slide_type": "fragment" 370 | } 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "sns.scatterplot(x=reduced_X[:, 0], y=reduced_X[:, 1]);" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": { 380 | "slideshow": { 381 | "slide_type": "subslide" 382 | } 383 | }, 384 | "source": [ 385 | "## Combining PCA and `k-means`\n", 386 | "\n", 387 | "We can fit k-means to PCA-reduced data:" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": { 394 | "slideshow": { 395 | "slide_type": "subslide" 396 | } 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "pcakmeans = KMeans(n_clusters=5, random_state=634)\n", 401 | "pcakmeans.fit(reduced_X)\n", 402 | "df['pcakmeans_labels'] = pcakmeans.labels_" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": { 409 | "slideshow": { 410 | "slide_type": "subslide" 411 | } 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "sns.set_style('darkgrid')\n", 416 | "f, ax = plt.subplots(1, 1, figsize=(15, 8))\n", 417 | "sns.scatterplot(x=reduced_X[:, 0], y=reduced_X[:, 1],\n", 418 | " hue=pcakmeans.labels_,\n", 419 | " palette=sns.color_palette(palette='colorblind', n_colors=5));" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": { 426 | "slideshow": { 427 | "slide_type": "subslide" 428 | } 429 | }, 430 | "outputs": [], 431 | "source": [ 432 | "grouped_barplot(df, 'a02', 'pcakmeans_labels')" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "metadata": {}, 439 | "outputs": [], 440 | "source": [ 441 | "pd.DataFrame(pca.components_, columns=data.columns)" 442 | ] 443 | } 444 | ], 445 | "metadata": { 446 | "kernelspec": { 447 | "display_name": "Python 3", 448 | "language": "python", 449 | "name": "python3" 450 | }, 451 | "language_info": { 452 | "codemirror_mode": { 453 | "name": "ipython", 454 | "version": 3 455 | }, 456 | "file_extension": ".py", 457 | "mimetype": "text/x-python", 458 | "name": "python", 459 | "nbconvert_exporter": "python", 460 | "pygments_lexer": "ipython3", 461 | "version": "3.7.9" 462 | } 463 | }, 464 | "nbformat": 4, 465 | "nbformat_minor": 4 466 | } 467 | -------------------------------------------------------------------------------- /misc_presentations/Measurement Bias.xml: -------------------------------------------------------------------------------- 1 | 2 |