├── requirements.txt ├── honest.fth ├── .devcontainer └── devcontainer.json ├── README.md └── idiomatic-pandas.ipynb /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==2.0.2 2 | pyarrow==12.0.0 3 | 4 | -------------------------------------------------------------------------------- /honest.fth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mwcraig/2023-scipy-pandas/main/honest.fth -------------------------------------------------------------------------------- /.devcontainer/devcontainer.json: -------------------------------------------------------------------------------- 1 | { 2 | "image": "mcr.microsoft.com/devcontainers/universal:2", 3 | "waitFor": "onCreateCommand", 4 | "updateContentCommand": "python3 -m pip install -r requirements.txt", 5 | "postCreateCommand": "", 6 | "customizations": { 7 | "codespaces": { 8 | "openFiles": [] 9 | }, 10 | "vscode": { 11 | "extensions": [ 12 | "ms-toolsai.jupyter", 13 | "ms-python.python" 14 | ] 15 | } 16 | } 17 | } 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 2023-scipy-pandas 2 | 3 | This repository contains a Jupyter notebook for the Idiomatic Pandas tutorial. The notebook covers various topics and interactive exercises designed to reinforce your learning. You can run the notebook in a local virtual environment or directly in GitHub Codespaces. 4 | 5 | ## Structure of the Repository 6 | 7 | - `idiomatic-pandas.ipynb` - This notebook 8 | - `honest.fth` - The data for the tutorial 9 | - `requirements.txt` - This file lists the Python dependencies required to run the notebook. 10 | 11 | ## Getting Started 12 | 13 | Here's how you can set up and run this project: 14 | 15 | ### Option 1: Running in Local Virtual Environment 16 | 17 | 1. **Clone the Repository** 18 | 19 | First, clone this repository to your local machine using the following command: 20 | 21 | ```bash 22 | git clone https://github.com/your_username/your_repository.git 23 | ``` 24 | 25 | 2. **Create and Activate Virtual Environment** 26 | 27 | It is always a good practice to create a virtual environment for your Python projects. Here's how you can do it: 28 | 29 | For Windows: 30 | 31 | ```bash 32 | python -m venv tutorial_env 33 | tutorial_env\Scripts\activate 34 | ``` 35 | 36 | For macOS/Linux: 37 | 38 | ```bash 39 | python3 -m venv tutorial_env 40 | source tutorial_env/bin/activate 41 | ``` 42 | 43 | 3. **Install Dependencies** 44 | 45 | Once your virtual environment is activated, you can install the necessary dependencies using pip. Navigate to the directory containing `requirements.txt` file and run: 46 | 47 | ```bash 48 | pip install -r requirements.txt 49 | ``` 50 | 51 | 4. **Launch Jupyter Notebook** 52 | 53 | After you have your environment set up and dependencies installed, you can start Jupyter notebook by running: 54 | 55 | ```bash 56 | jupyter notebook 57 | ``` 58 | 59 | Then, in your web browser, navigate to the location of the notebook file and click to open it. 60 | 61 | ### Option 2: Running in GitHub Codespaces 62 | 63 | GitHub Codespaces is a service that allows you to develop in the cloud instead of locally. Here's how you can use it for this project: 64 | 65 | 1. **Open the Repository in Codespaces** 66 | 67 | Navigate to this repository in GitHub. Click the `Code` button in the repository header and then select `Open with Codespaces`. 68 | 69 | 2. **Wait for a while** 70 | 71 | To let the codespace start 72 | 73 | 3. **Open the notebook** 74 | 75 | 4. **Click on "Select Kernel" -> Python Environments... -> Python 3.10** 76 | 77 | You should be good to go. 78 | 79 | ## Visit MetaSnake for Help 80 | 81 | We hope you enjoy this tutorial and find it helpful. Visit www.metasnake.com for Python and Data training for your teams. Buy *Effective Pandas* to up your pandas skills. 82 | -------------------------------------------------------------------------------- /idiomatic-pandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "81cc5b34", 6 | "metadata": { 7 | "lines_to_next_cell": 0, 8 | "pycharm": { 9 | "name": "#%% md\n" 10 | } 11 | }, 12 | "source": [ 13 | "# Idiomatic Pandas\n", 14 | "## 5 Tips for Better Pandas Code\n", 15 | "\n", 16 | "https://github.com/mattharrison/2023-scipy-pandas" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "d1532d10", 23 | "metadata": { 24 | "lines_to_next_cell": 2 25 | }, 26 | "outputs": [], 27 | "source": [] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "3b359fb0", 32 | "metadata": { 33 | "pycharm": { 34 | "name": "#%% md\n" 35 | } 36 | }, 37 | "source": [ 38 | "## About Matt Harrison @\\_\\_mharrison\\_\\_\n", 39 | "\n", 40 | "* Author of Effective Pandas, Machine Learning Pocket Reference, and Illustrated Guide to Python 3.\n", 41 | "* Advisor at Ponder (creators of Modin)\n", 42 | "* Corporate trainer at MetaSnake. Taught Pandas to 1000's of students." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "id": "fd45b865", 49 | "metadata": { 50 | "lines_to_next_cell": 2, 51 | "pycharm": { 52 | "name": "#%%\n" 53 | } 54 | }, 55 | "outputs": [], 56 | "source": [] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "id": "ee074cf6", 62 | "metadata": { 63 | "lines_to_next_cell": 2, 64 | "pycharm": { 65 | "name": "#%%\n" 66 | } 67 | }, 68 | "outputs": [], 69 | "source": [] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "id": "56dea0d6", 75 | "metadata": { 76 | "lines_to_next_cell": 2, 77 | "pycharm": { 78 | "name": "#%%\n" 79 | } 80 | }, 81 | "outputs": [], 82 | "source": [] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "id": "74599ee6", 88 | "metadata": { 89 | "lines_to_next_cell": 2, 90 | "pycharm": { 91 | "name": "#%%\n" 92 | } 93 | }, 94 | "outputs": [], 95 | "source": [] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "id": "ddaf4356", 101 | "metadata": { 102 | "lines_to_next_cell": 2, 103 | "pycharm": { 104 | "name": "#%%\n" 105 | } 106 | }, 107 | "outputs": [], 108 | "source": [] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "691fab46", 114 | "metadata": { 115 | "lines_to_next_cell": 2, 116 | "pycharm": { 117 | "name": "#%%\n" 118 | } 119 | }, 120 | "outputs": [], 121 | "source": [] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "id": "ac1d390a", 126 | "metadata": { 127 | "pycharm": { 128 | "name": "#%% md\n" 129 | } 130 | }, 131 | "source": [ 132 | "## Practice this on your data with your team!\n", 133 | "* Contact me matt@metasnake.com\n", 134 | "* Follow on Twitter @\\_\\_mharrison\\_\\_" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "id": "0d9e4268", 141 | "metadata": { 142 | "lines_to_next_cell": 2, 143 | "pycharm": { 144 | "name": "#%%\n" 145 | } 146 | }, 147 | "outputs": [], 148 | "source": [] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "id": "e9f646d4", 154 | "metadata": { 155 | "lines_to_next_cell": 2, 156 | "pycharm": { 157 | "name": "#%%\n" 158 | } 159 | }, 160 | "outputs": [], 161 | "source": [] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "id": "ec83fe17", 166 | "metadata": { 167 | "lines_to_next_cell": 0, 168 | "pycharm": { 169 | "name": "#%% md\n" 170 | } 171 | }, 172 | "source": [ 173 | "## Outline\n", 174 | "\n", 175 | "* Load Data\n", 176 | "* Types\n", 177 | "* Chaining\n", 178 | "* Mutation\n", 179 | "* Apply\n", 180 | "* Aggregation" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "id": "88919775", 187 | "metadata": { 188 | "lines_to_next_cell": 2 189 | }, 190 | "outputs": [], 191 | "source": [] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "b6b8d4fe", 196 | "metadata": {}, 197 | "source": [ 198 | "## Imports" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "id": "677e4ecd", 205 | "metadata": { 206 | "lines_to_next_cell": 2, 207 | "pycharm": { 208 | "name": "#%%\n" 209 | } 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "%matplotlib inline\n", 214 | "from IPython.display import display\n", 215 | "import numpy as np\n", 216 | "import pandas as pd\n", 217 | "import pyarrow\n", 218 | "\n", 219 | "import io\n", 220 | "import zipfile\n", 221 | "#import modin.pandas as pd" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "id": "108ee461", 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "pd.__version__" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "id": "e01647d1", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "pyarrow.__version__" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "id": "565ba8c3", 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "id": "6dfaa61a", 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "id": "4084693e", 263 | "metadata": { 264 | "pycharm": { 265 | "name": "#%% md\n" 266 | } 267 | }, 268 | "source": [ 269 | "## Data\n", 270 | "\n", 271 | "Don't need to run this, but this is how I created the data" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "id": "766a6963", 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "# https://gss.norc.org/get-the-data/stata\n", 282 | "# takes a few minutes on my computer to load\n", 283 | "path = '~/Downloads/gss_spss_with_codebook.zip'\n", 284 | "with zipfile.ZipFile(path) as z:\n", 285 | " print(z.namelist())\n", 286 | " with open('gss.sav', mode='bw') as fout:\n", 287 | " fout.write(z.open('GSS7218_R3.sav').read())\n", 288 | " gss = pd.read_spss('gss.sav')" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "id": "0bff9887", 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [ 298 | "!pip install pyreadstat" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "id": "b113d17e", 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "%%time\n", 309 | "import pyreadstat\n", 310 | "gss, meta = pyreadstat.read_sav('gss.sav')" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "id": "356a1732", 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "gss.shape" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "id": "41a1eb09", 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "gss.to_feather('gss.fth')" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "id": "fe3f9902", 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "%%time\n", 341 | "raw = pd.read_feather('~/Dropbox/work/jupyter/gss.fth')" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "id": "17a1c4e6", 348 | "metadata": { 349 | "lines_to_next_cell": 0, 350 | "pycharm": { 351 | "name": "#%%\n" 352 | } 353 | }, 354 | "outputs": [], 355 | "source": [ 356 | "raw" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "id": "c02fcbf9", 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "# 6000 columns!\n", 367 | "raw.shape" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "id": "e0bcc8ff", 374 | "metadata": { 375 | "lines_to_next_cell": 0 376 | }, 377 | "outputs": [], 378 | "source": [ 379 | "cols = ['YEAR','ID','AGE', 'HRS1','OCC','MAJOR1','SEX','RACE','BORN','INCOME',\n", 380 | " 'INCOME06','HONEST','TICKET']\n", 381 | "\n", 382 | "raw[cols].to_feather('honest.fth')" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "id": "591958d3", 389 | "metadata": { 390 | "lines_to_next_cell": 2 391 | }, 392 | "outputs": [], 393 | "source": [] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "id": "7dedee7b", 398 | "metadata": {}, 399 | "source": [ 400 | "## Loading Data" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "id": "20e8aa70", 407 | "metadata": { 408 | "lines_to_next_cell": 2, 409 | "pycharm": { 410 | "name": "#%%\n" 411 | } 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "raw = pd.read_feather('honest.fth', dtype_backend='pyarrow')" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "id": "5f95238b", 422 | "metadata": { 423 | "lines_to_next_cell": 2, 424 | "pycharm": { 425 | "name": "#%%\n" 426 | } 427 | }, 428 | "outputs": [], 429 | "source": [] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "id": "8da48799", 434 | "metadata": { 435 | "pycharm": { 436 | "name": "#%% md\n" 437 | } 438 | }, 439 | "source": [ 440 | "## My Cleanup\n", 441 | "See GSS_Codebook.pdf for explanation\n", 442 | "\n", 443 | "Columns:\n", 444 | "\n", 445 | "* YEAR\n", 446 | "* ID - RESPONDENT ID NUMBER\n", 447 | "* AGE - AGE OF RESPONENT\n", 448 | "* HRS1 - NUMBER OF HOURS WORKED LAST WEEK\n", 449 | "* OCC - R'S CENSUS OCCUPATION CODE (1970) - Page 126 (VAR: OCC) see page 125 for notes APPENDIX F,G,H\n", 450 | " Appendix F - Page 3286\n", 451 | "* MAJOR1 - COLLEGE MAJOR 1\n", 452 | "* SEX - RESPONDENTS SEX\n", 453 | "* RACE - RACE OF RESPONDENT\n", 454 | "* BORN - WAS R BORN IN THIS COUNTRY\n", 455 | "* INCOME - TOTAL FAMILY INCOME 1970\n", 456 | "* INCOME06 - TOTAL FAMILY INCOME 2006\n", 457 | "* HONEST - HONEST\n", 458 | "* TICKET - EVER RECEIVED A TRAFFIC TICKET\n" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "id": "a288f332", 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [ 468 | "cols = ['YEAR','ID','AGE', 'HRS1','OCC','MAJOR1','SEX','RACE','BORN','INCOME',\n", 469 | " 'INCOME06','HONEST','TICKET']\n", 470 | "\n", 471 | "raw[cols].isna().mean()*100" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "id": "b6de3c2f", 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "(raw\n", 482 | " [cols]\n", 483 | " .isna()\n", 484 | " .mean()*100\n", 485 | ")" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "id": "a284ba6e", 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "MAJOR= '''RESPONSE PUNCH 1972-82 1982B 1983-87 1987B 1988-91 1993-98 2000-04 2006 2008 2010 2012 2014 2016 2018 ALL\n", 496 | "Accounting/bookkeeping 1 0 0 0 0 0 0 0 0 0 0 28 32 30 29 119\n", 497 | "Advertising 2 0 0 0 0 0 0 0 0 0 0 3 2 0 0 5\n", 498 | "Agriculture/horticulture 3 0 0 0 0 0 0 0 0 0 0 8 2 7 5 22\n", 499 | "Allied health 4 0 0 0 0 0 0 0 0 0 0 0 2 1 0 3\n", 500 | "Anthropology 5 0 0 0 0 0 0 0 0 0 0 3 5 1 1 10\n", 501 | "Architecture 6 0 0 0 0 0 0 0 0 0 0 2 3 5 3 13\n", 502 | "Art 7 0 0 0 0 0 0 0 0 0 0 6 7 11 10 34\n", 503 | "Biology 8 0 0 0 0 0 0 0 0 0 0 16 22 33 26 97\n", 504 | "Business administration 9 0 0 0 0 0 0 0 0 0 0 90 142 172 138 542\n", 505 | "Chemistry 11 0 0 0 0 0 0 0 0 0 0 5 8 10 4 27\n", 506 | "Communications/speech 12 0 0 0 0 0 0 0 0 0 0 20 18 26 18 82\n", 507 | "Comm. disorders 13 0 0 0 0 0 0 0 0 0 0 4 6 2 2 14\n", 508 | "Computer science 14 0 0 0 0 0 0 0 0 0 0 25 24 33 17 99\n", 509 | "Dentistry 15 0 0 0 0 0 0 0 0 0 0 2 4 3 5 14\n", 510 | "Education 16 0 0 0 0 0 0 0 0 0 0 73 91 97 79 340\n", 511 | "Economics 17 0 0 0 0 0 0 0 0 0 0 11 19 13 19 62\n", 512 | "Engineering 18 0 0 0 0 0 0 0 0 0 0 47 49 47 54 197\n", 513 | "English 19 0 0 0 0 0 0 0 0 0 0 23 26 27 24 100\n", 514 | "Finance 20 0 0 0 0 0 0 0 0 0 0 7 15 14 16 52\n", 515 | "Foreign language 21 0 0 0 0 0 0 0 0 0 0 4 8 6 5 23\n", 516 | "Forestry 22 0 0 0 0 0 0 0 0 0 0 1 0 3 0 4\n", 517 | "Geography 23 0 0 0 0 0 0 0 0 0 0 0 2 2 4 8\n", 518 | "Geology 24 0 0 0 0 0 0 0 0 0 0 1 3 4 2 10\n", 519 | "History 25 0 0 0 0 0 0 0 0 0 0 10 19 14 19 62\n", 520 | "Home economics 26 0 0 0 0 0 0 0 0 0 0 0 0 3 2 5\n", 521 | "Industry & techn 27 0 0 0 0 0 0 0 0 0 0 3 4 6 0 13\n", 522 | "Journalism 28 0 0 0 0 0 0 0 0 0 0 5 6 6 4 21\n", 523 | "Law 29 0 0 0 0 0 0 0 0 0 0 13 18 23 14 68\n", 524 | "Law enforcement 30 0 0 0 0 0 0 0 0 0 0 3 5 4 2 14\n", 525 | "Library science 31 0 0 0 0 0 0 0 0 0 0 4 5 2 3 14\n", 526 | "Marketing 32 0 0 0 0 0 0 0 0 0 0 11 15 13 12 51\n", 527 | "Mathematics 33 0 0 0 0 0 0 0 0 0 0 5 10 12 5 32\n", 528 | "Medicine 34 0 0 0 0 0 0 0 0 0 0 9 25 12 11 57\n", 529 | "Music 35 0 0 0 0 0 0 0 0 0 0 4 2 10 2 18\n", 530 | "Nursing 36 0 0 0 0 0 0 0 0 0 0 36 39 60 51 186\n", 531 | "Optometry 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 532 | "Pharmacy 38 0 0 0 0 0 0 0 0 0 0 2 5 4 4 15\n", 533 | "Philosophy 39 0 0 0 0 0 0 0 0 0 0 2 0 2 2 6\n", 534 | "Physical education 40 0 0 0 0 0 0 0 0 0 0 9 6 16 6 37\n", 535 | "Physics 41 0 0 0 0 0 0 0 0 0 0 3 6 7 4 20\n", 536 | "Psychology 42 0 0 0 0 0 0 0 0 0 0 32 32 34 29 127\n", 537 | "Political science/international relations 43 0 0 0 0 0 0 0 0 0 0 16 22 19 14 71\n", 538 | "Sociology 44 0 0 0 0 0 0 0 0 0 0 9 15 10 12 46\n", 539 | "Special education 45 0 0 0 0 0 0 0 0 0 0 5 3 5 2 15\n", 540 | "Theater arts 46 0 0 0 0 0 0 0 0 0 0 6 2 3 1 12\n", 541 | "Theology 47 0 0 0 0 0 0 0 0 0 0 6 6 13 8 33\n", 542 | "Veterinary medicine 48 0 0 0 0 0 0 0 0 0 0 1 5 3 4 13\n", 543 | "Liberal arts 49 0 0 0 0 0 0 0 0 0 0 8 16 12 10 46\n", 544 | "Other 50 0 0 0 0 0 0 0 0 0 0 8 10 21 27 66\n", 545 | "General sciences 51 0 0 0 0 0 0 0 0 0 0 10 13 15 14 52\n", 546 | "Social work 52 0 0 0 0 0 0 0 0 0 0 7 17 24 7 55\n", 547 | "General studies 53 0 0 0 0 0 0 0 0 0 0 2 5 7 7 21\n", 548 | "Other vocational 54 0 0 0 0 0 0 0 0 0 0 5 11 6 5 27\n", 549 | "Health 55 0 0 0 0 0 0 0 0 0 0 23 31 31 42 127\n", 550 | "Industrial Relations 56 0 0 0 0 0 0 0 0 0 0 1 0 0 3 4\n", 551 | "Child/Human/Family Development 57 0 0 0 0 0 0 0 0 0 0 11 3 7 7 28\n", 552 | "Food Science/Nutrition/Culinary Arts 58 0 0 0 0 0 0 0 0 0 0 3 6 9 9 27\n", 553 | "Environmental Science/Ecology 59 0 0 0 0 0 0 0 0 0 0 5 5 6 8 24\n", 554 | "Social Sciences 60 0 0 0 0 0 0 0 0 0 0 4 2 7 5 18\n", 555 | "Human Services/Human Resources 61 0 0 0 0 0 0 0 0 0 0 3 7 7 5 22\n", 556 | "Visual Arts/Graphic Design/Design and Drafting 62 0 0 0 0 0 0 0 0 0 0 3 8 9 10 30\n", 557 | "Fine Arts 63 0 0 0 0 0 0 0 0 0 0 4 5 5 6 20\n", 558 | "Humanities 64 0 0 0 0 0 0 0 0 0 0 0 2 0 1 3\n", 559 | "Ethnic studies 65 0 0 0 0 0 0 0 0 0 0 3 1 0 0 4\n", 560 | "Educational administration 66 0 0 0 0 0 0 0 0 0 0 3 4 8 9 24\n", 561 | "Television/Film 67 0 0 0 0 0 0 0 0 0 0 0 2 6 1 9\n", 562 | "Aviation/Aeronatics 68 0 0 0 0 0 0 0 0 0 0 2 1 1 3 7\n", 563 | "Statistics/Biostatistics 69 0 0 0 0 0 0 0 0 0 0 0 0 2 2 4\n", 564 | "Criminology/Criminal Justice 70 0 0 0 0 0 0 0 0 0 0 13 17 17 13 60\n", 565 | "Administrative Science/Public Administration 71 0 0 0 0 0 0 0 0 0 0 2 11 3 5 21\n", 566 | "Electronics 72 0 0 0 0 0 0 0 0 0 0 6 6 5 9 26\n", 567 | "Urban and Regional Planning 73 0 0 0 0 0 0 0 0 0 0 1 1 3 2 7\n", 568 | "Mechanics/Machine Trade 74 0 0 0 0 0 0 0 0 0 0 0 1 1 4 6\n", 569 | "Dance 75 0 0 0 0 0 0 0 0 0 0 1 0 1 1 3\n", 570 | "Gerontology 76 0 0 0 0 0 0 0 0 0 0 1 0 1 1 3\n", 571 | "Public Relations 77 0 0 0 0 0 0 0 0 0 0 3 1 2 1 7\n", 572 | "Textiles/Cloth 78 0 0 0 0 0 0 0 0 0 0 3 4 0 0 7\n", 573 | "Parks and Recreation 79 0 0 0 0 0 0 0 0 0 0 1 2 1 0 4\n", 574 | "Information Technology 80 0 0 0 0 0 0 0 0 0 0 0 5 8 11 24\n", 575 | "Fashion 81 0 0 0 0 0 0 0 0 0 0 0 0 3 1 4\n", 576 | "Counseling 82 0 0 0 0 0 0 0 0 0 0 0 0 11 9 20\n", 577 | "Don't know/UNCODED 98 0 0 0 0 0 0 0 0 0 0 2 3 0 0 5\n", 578 | "No answer 99 0 0 0 0 0 0 0 0 0 0 0 1 5 3 9\n", 579 | "Not applicable 0 13626 354 7542 353 5907 10334 8394 4510 2023 2044 1263 1597 1795 1435 61177'''\n", 580 | "\n", 581 | "# copy paste slight tweak from page 186\n", 582 | "major_dict = {int(row.split()[-16]): ' '.join(row.split()[:-16]) for row in MAJOR.split('\\n')[1:]}\n", 583 | "major_dict" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "id": "fd16476a", 590 | "metadata": {}, 591 | "outputs": [], 592 | "source": [ 593 | "raw.MAJOR1.value_counts()" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "id": "382102ad", 600 | "metadata": {}, 601 | "outputs": [], 602 | "source": [ 603 | "(raw\n", 604 | " [cols]\n", 605 | " .assign(\n", 606 | " MAJOR1=raw.MAJOR1.fillna(99).astype('int').replace(major_dict),\n", 607 | " SEX=raw.SEX#\n", 608 | " \n", 609 | " .astype(int)\n", 610 | " .replace({1:'Male', 2:'Female'}),\n", 611 | " RACE=raw.RACE.astype(int).replace({1:'White', 2:'Black', 3:'Other'}),\n", 612 | " OCC=raw.OCC.fillna(9999).astype(int),\n", 613 | " BORN=raw.BORN.fillna(4).astype(int).replace({1:'Yes', 2:'No', 3:'Don\\'t Know',\n", 614 | " 4:'No answer', 5:'Not applicable'}),\n", 615 | " INCOME=raw.INCOME.fillna(99).astype(int).replace({99:'No answer', **dict(enumerate(['Not applicable',\n", 616 | " 0,1000,3000,4000,5000,6000,\n", 617 | " 7000,8000,10000,15000,20000,25000,]))}),\n", 618 | " INCOME06=raw.INCOME06.fillna(26).astype(int).replace({26:'Refused', **dict(enumerate(['Not applicable',\n", 619 | " 0,1000,3000,4000,5000,6000,\n", 620 | " 7000,8000,10000,12500,15000,\n", 621 | " 17500,20000,22500,25000,30_000,\n", 622 | " 35_000, 40_000, 50_000, 60_000,\n", 623 | " 75_000, 90_000, 110_000, 130_000,\n", 624 | " 150_000]))}),\n", 625 | " HONEST=raw.HONEST.fillna(9).astype(int).replace({1:'Most desirable', 2:'3 most desireable',\n", 626 | " 3:'Not mentioned', 4: '3 least desireable',\n", 627 | " 5: 'One least desireable',\n", 628 | " 9:'No answer'}),\n", 629 | " TICKET=raw.TICKET.fillna(9).astype(int).replace({1:'Yes', 2:'No', 3:'Refused', 9: 'No answer'}),\n", 630 | " )\n", 631 | " .astype({'YEAR':int, 'ID': 'uint16[pyarrow]'})\n", 632 | " .to_csv('GSS.csv')\n", 633 | ")" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": null, 639 | "id": "8366b83d", 640 | "metadata": {}, 641 | "outputs": [], 642 | "source": [] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "id": "be251648", 648 | "metadata": {}, 649 | "outputs": [], 650 | "source": [] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "id": "394409a6", 655 | "metadata": {}, 656 | "source": [ 657 | "## Types\n", 658 | "Getting the right types will enable analysis and correctness.\n" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "id": "bc415e79", 665 | "metadata": {}, 666 | "outputs": [], 667 | "source": [ 668 | "%%time\n", 669 | "gss = pd.read_csv('GSS.csv', index_col=0, dtype_backend='pyarrow', engine='pyarrow')" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "id": "b6391a1b", 676 | "metadata": { 677 | "pycharm": { 678 | "name": "#%%\n" 679 | } 680 | }, 681 | "outputs": [], 682 | "source": [ 683 | "gss.dtypes" 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": null, 689 | "id": "fe39dcbf", 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "gss" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": null, 699 | "id": "883a91c7", 700 | "metadata": { 701 | "pycharm": { 702 | "name": "#%%\n" 703 | } 704 | }, 705 | "outputs": [], 706 | "source": [ 707 | "gss.memory_usage(deep=True)" 708 | ] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "execution_count": null, 713 | "id": "6f564675", 714 | "metadata": { 715 | "pycharm": { 716 | "name": "#%%\n" 717 | } 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "# 36 M (pandas 1)\n", 722 | "# 8.6 M (Pandas 2)\n", 723 | "gss.memory_usage(deep=True).sum()" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": null, 729 | "id": "f0a102d3", 730 | "metadata": { 731 | "lines_to_next_cell": 2, 732 | "pycharm": { 733 | "name": "#%%\n" 734 | } 735 | }, 736 | "outputs": [], 737 | "source": [] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "id": "ea3e9a77", 742 | "metadata": { 743 | "pycharm": { 744 | "name": "#%% md\n" 745 | } 746 | }, 747 | "source": [ 748 | "## Ints" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "id": "c6a0a74f", 755 | "metadata": { 756 | "pycharm": { 757 | "name": "#%%\n" 758 | } 759 | }, 760 | "outputs": [], 761 | "source": [ 762 | "gss.select_dtypes(int).describe()" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "id": "04eb0242", 769 | "metadata": { 770 | "pycharm": { 771 | "name": "#%%\n" 772 | } 773 | }, 774 | "outputs": [], 775 | "source": [ 776 | "# chaining\n", 777 | "(gss\n", 778 | " .select_dtypes(int)\n", 779 | " .describe()\n", 780 | ")" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "id": "70ce85a3", 787 | "metadata": { 788 | "pycharm": { 789 | "name": "#%%\n" 790 | } 791 | }, 792 | "outputs": [], 793 | "source": [ 794 | "# can comb08 be an int8?\n", 795 | "# Do completion on int\n", 796 | "np.iinfo(np.int)" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": null, 802 | "id": "459e3d43", 803 | "metadata": { 804 | "pycharm": { 805 | "name": "#%%\n" 806 | } 807 | }, 808 | "outputs": [], 809 | "source": [ 810 | "np.iinfo(np.uint8)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "id": "5fbb6683", 817 | "metadata": { 818 | "pycharm": { 819 | "name": "#%%\n" 820 | } 821 | }, 822 | "outputs": [], 823 | "source": [ 824 | "np.iinfo(np.uint16)" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": null, 830 | "id": "15a94872", 831 | "metadata": { 832 | "pycharm": { 833 | "name": "#%%\n" 834 | } 835 | }, 836 | "outputs": [], 837 | "source": [ 838 | "# chaining\n", 839 | "(gss\n", 840 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' })\n", 841 | " .select_dtypes(['uint16'])\n", 842 | " .describe()\n", 843 | ")" 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "execution_count": null, 849 | "id": "24a7fea9", 850 | "metadata": { 851 | "lines_to_next_cell": 2, 852 | "pycharm": { 853 | "name": "#%%\n" 854 | } 855 | }, 856 | "outputs": [], 857 | "source": [ 858 | "# chaining\n", 859 | "# use 'integer' so see all int-like columns\n", 860 | "(gss\n", 861 | " .astype({#'YEAR': 'uint16[pyarrow]',\n", 862 | " 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }) \n", 863 | " .select_dtypes(['integer']) # see https://numpy.org/doc/stable/reference/arrays.scalars.html\n", 864 | " .describe()\n", 865 | ")" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": null, 871 | "id": "fb27bf58", 872 | "metadata": { 873 | "lines_to_next_cell": 2, 874 | "pycharm": { 875 | "name": "#%%\n" 876 | } 877 | }, 878 | "outputs": [], 879 | "source": [ 880 | "# Inspect memory usage\n", 881 | "(gss\n", 882 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }) \n", 883 | " .memory_usage(deep=True)\n", 884 | " .sum() # was 36M\n", 885 | ")" 886 | ] 887 | }, 888 | { 889 | "cell_type": "code", 890 | "execution_count": null, 891 | "id": "b8a61fa7", 892 | "metadata": { 893 | "lines_to_next_cell": 2, 894 | "pycharm": { 895 | "name": "#%%\n" 896 | } 897 | }, 898 | "outputs": [], 899 | "source": [] 900 | }, 901 | { 902 | "cell_type": "markdown", 903 | "id": "fdd212a2", 904 | "metadata": {}, 905 | "source": [ 906 | "## Int Exercise\n", 907 | "* Try converting *YEAR* to `'int8'`. What do the values look like?\n", 908 | "* Try converting *YEAR* to `'int8[pyarrow]'`. What do the values look like?" 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": null, 914 | "id": "370c2e84", 915 | "metadata": {}, 916 | "outputs": [], 917 | "source": [] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "id": "09cdc54a", 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [] 926 | }, 927 | { 928 | "cell_type": "markdown", 929 | "id": "5df9482a", 930 | "metadata": {}, 931 | "source": [ 932 | "## Floats" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "id": "fab11648", 939 | "metadata": { 940 | "pycharm": { 941 | "name": "#%%\n" 942 | } 943 | }, 944 | "outputs": [], 945 | "source": [ 946 | "(gss\n", 947 | ".select_dtypes('float'))" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": null, 953 | "id": "5c8490a5", 954 | "metadata": { 955 | "pycharm": { 956 | "name": "#%%\n" 957 | } 958 | }, 959 | "outputs": [], 960 | "source": [ 961 | "# surprise! age and hours worked looks int-like\n", 962 | "gss.HRS1.describe()" 963 | ] 964 | }, 965 | { 966 | "cell_type": "code", 967 | "execution_count": null, 968 | "id": "6f357afd", 969 | "metadata": { 970 | "pycharm": { 971 | "name": "#%%\n" 972 | } 973 | }, 974 | "outputs": [], 975 | "source": [ 976 | "# opps! missing values\n", 977 | "gss.HRS1.value_counts(dropna=False)" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": null, 983 | "id": "ce14eddb", 984 | "metadata": { 985 | "pycharm": { 986 | "name": "#%%\n" 987 | } 988 | }, 989 | "outputs": [], 990 | "source": [ 991 | "# where are they missing?\n", 992 | "(gss\n", 993 | " .query('HRS1.isna()')\n", 994 | ")" 995 | ] 996 | }, 997 | { 998 | "cell_type": "code", 999 | "execution_count": null, 1000 | "id": "dbe482fb", 1001 | "metadata": { 1002 | "pycharm": { 1003 | "name": "#%%\n" 1004 | } 1005 | }, 1006 | "outputs": [], 1007 | "source": [ 1008 | "# where are they missing?\n", 1009 | "(gss\n", 1010 | " .query('AGE.isna()')\n", 1011 | ")" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": null, 1017 | "id": "36817f9c", 1018 | "metadata": { 1019 | "pycharm": { 1020 | "name": "#%%\n" 1021 | } 1022 | }, 1023 | "outputs": [], 1024 | "source": [ 1025 | "# where are they missing?\n", 1026 | "# It turns out that ID is not consistent across years\n", 1027 | "(gss\n", 1028 | " .query('ID == 229')\n", 1029 | ")" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": null, 1035 | "id": "75b3c785", 1036 | "metadata": { 1037 | "lines_to_next_cell": 2, 1038 | "pycharm": { 1039 | "name": "#%%\n" 1040 | } 1041 | }, 1042 | "outputs": [], 1043 | "source": [ 1044 | "# Convert to integers\n", 1045 | "(gss\n", 1046 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]',\n", 1047 | " 'HRS1': 'uint8[pyarrow]', 'AGE': 'uint8[pyarrow]'})\n", 1048 | ")" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": null, 1054 | "id": "8e5bc829", 1055 | "metadata": { 1056 | "lines_to_next_cell": 2, 1057 | "pycharm": { 1058 | "name": "#%%\n" 1059 | } 1060 | }, 1061 | "outputs": [], 1062 | "source": [ 1063 | "(gss\n", 1064 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]',\n", 1065 | " 'HRS1': 'uint8[pyarrow]', 'AGE': 'uint8[pyarrow]'})\n", 1066 | " .memory_usage(deep=True)\n", 1067 | " .sum() # was 36M \n", 1068 | ")" 1069 | ] 1070 | }, 1071 | { 1072 | "cell_type": "code", 1073 | "execution_count": null, 1074 | "id": "7642b746", 1075 | "metadata": { 1076 | "lines_to_next_cell": 2, 1077 | "pycharm": { 1078 | "name": "#%%\n" 1079 | } 1080 | }, 1081 | "outputs": [], 1082 | "source": [] 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "id": "5ca5890d", 1087 | "metadata": {}, 1088 | "source": [ 1089 | "## Float Exercise\n", 1090 | "\n", 1091 | "* What is the mean of the numeric columns?\n", 1092 | "* How many values are missing in the numeric columns?" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": null, 1098 | "id": "fab90ba5", 1099 | "metadata": { 1100 | "lines_to_next_cell": 2 1101 | }, 1102 | "outputs": [], 1103 | "source": [] 1104 | }, 1105 | { 1106 | "cell_type": "markdown", 1107 | "id": "a083aa86", 1108 | "metadata": {}, 1109 | "source": [ 1110 | "## Objects" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "code", 1115 | "execution_count": null, 1116 | "id": "490050a8", 1117 | "metadata": { 1118 | "pycharm": { 1119 | "name": "#%%\n" 1120 | } 1121 | }, 1122 | "outputs": [], 1123 | "source": [ 1124 | "# pandas 1.x\n", 1125 | "(gss\n", 1126 | " .select_dtypes(object)\n", 1127 | ")" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "code", 1132 | "execution_count": null, 1133 | "id": "0363a462", 1134 | "metadata": { 1135 | "pycharm": { 1136 | "name": "#%%\n" 1137 | } 1138 | }, 1139 | "outputs": [], 1140 | "source": [ 1141 | "# pandas 2\n", 1142 | "(gss\n", 1143 | " .select_dtypes('string') # str doesn't work\n", 1144 | ")" 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "execution_count": null, 1150 | "id": "a5f79d7f", 1151 | "metadata": { 1152 | "pycharm": { 1153 | "name": "#%%\n" 1154 | } 1155 | }, 1156 | "outputs": [], 1157 | "source": [ 1158 | "# My goto method - .value_counts\n", 1159 | "# looks categorical\n", 1160 | "(gss.MAJOR1.value_counts(dropna=False))" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": null, 1166 | "id": "72004253", 1167 | "metadata": { 1168 | "lines_to_next_cell": 2, 1169 | "pycharm": { 1170 | "name": "#%%\n" 1171 | } 1172 | }, 1173 | "outputs": [], 1174 | "source": [ 1175 | "(gss\n", 1176 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]',\n", 1177 | " 'HRS1': 'uint8[pyarrow]', 'AGE': 'uint8[pyarrow]',\n", 1178 | " 'MAJOR1': 'category'})\n", 1179 | " .memory_usage(deep=True)\n", 1180 | " .sum() # was 36M \n", 1181 | ")" 1182 | ] 1183 | }, 1184 | { 1185 | "cell_type": "code", 1186 | "execution_count": null, 1187 | "id": "1b38e847", 1188 | "metadata": {}, 1189 | "outputs": [], 1190 | "source": [ 1191 | "(gss\n", 1192 | " .select_dtypes(object)\n", 1193 | " .columns\n", 1194 | ")" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": null, 1200 | "id": "c4529135", 1201 | "metadata": { 1202 | "lines_to_next_cell": 0, 1203 | "pycharm": { 1204 | "name": "#%%\n" 1205 | } 1206 | }, 1207 | "outputs": [], 1208 | "source": [ 1209 | "# wow!\n", 1210 | "(gss\n", 1211 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]',\n", 1212 | " 'HRS1': 'uint8[pyarrow]', 'AGE': 'uint8[pyarrow]',\n", 1213 | " 'MAJOR1': 'category',\n", 1214 | " **{col: 'category' for col in ['SEX', 'RACE', 'BORN', \n", 1215 | " 'INCOME', 'INCOME06', 'HONEST','TICKET']}}) \n", 1216 | " .memory_usage(deep=True)\n", 1217 | " .sum() # was 36M \n", 1218 | ")" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": null, 1224 | "id": "6df39625", 1225 | "metadata": {}, 1226 | "outputs": [], 1227 | "source": [] 1228 | }, 1229 | { 1230 | "cell_type": "code", 1231 | "execution_count": null, 1232 | "id": "2bd53a07", 1233 | "metadata": { 1234 | "lines_to_next_cell": 2 1235 | }, 1236 | "outputs": [], 1237 | "source": [] 1238 | }, 1239 | { 1240 | "cell_type": "markdown", 1241 | "id": "5c96b300", 1242 | "metadata": {}, 1243 | "source": [ 1244 | "## String and Category Exercises\n", 1245 | "* There is a `.cat` attribute on the category columns. What can you do with this attribute? (Use `dir` or tab completion to inspect).\n", 1246 | "* Categories can be ordered. How do you order *INCOME*?\n", 1247 | "* There is an `.str` attribute on the string and category columns. What can you do with this attribute? (Use `dir` or tab completion to inspect).\n", 1248 | "* Uppercase the values in the *TICKET* column." 1249 | ] 1250 | }, 1251 | { 1252 | "cell_type": "code", 1253 | "execution_count": null, 1254 | "id": "91a86c01", 1255 | "metadata": {}, 1256 | "outputs": [], 1257 | "source": [] 1258 | }, 1259 | { 1260 | "cell_type": "code", 1261 | "execution_count": null, 1262 | "id": "aeab919f", 1263 | "metadata": {}, 1264 | "outputs": [], 1265 | "source": [ 1266 | " " 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "code", 1271 | "execution_count": null, 1272 | "id": "2af1bf07", 1273 | "metadata": {}, 1274 | "outputs": [], 1275 | "source": [] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": null, 1280 | "id": "0d0e9df0", 1281 | "metadata": {}, 1282 | "outputs": [], 1283 | "source": [] 1284 | }, 1285 | { 1286 | "cell_type": "code", 1287 | "execution_count": null, 1288 | "id": "41672375", 1289 | "metadata": {}, 1290 | "outputs": [], 1291 | "source": [] 1292 | }, 1293 | { 1294 | "cell_type": "code", 1295 | "execution_count": null, 1296 | "id": "d51ca7d3", 1297 | "metadata": {}, 1298 | "outputs": [], 1299 | "source": [] 1300 | }, 1301 | { 1302 | "cell_type": "code", 1303 | "execution_count": null, 1304 | "id": "42dcf06d", 1305 | "metadata": {}, 1306 | "outputs": [], 1307 | "source": [] 1308 | }, 1309 | { 1310 | "cell_type": "code", 1311 | "execution_count": null, 1312 | "id": "d13689ff", 1313 | "metadata": {}, 1314 | "outputs": [], 1315 | "source": [] 1316 | }, 1317 | { 1318 | "cell_type": "code", 1319 | "execution_count": null, 1320 | "id": "28320f9c", 1321 | "metadata": {}, 1322 | "outputs": [], 1323 | "source": [] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": null, 1328 | "id": "7e74d51d", 1329 | "metadata": {}, 1330 | "outputs": [], 1331 | "source": [] 1332 | }, 1333 | { 1334 | "cell_type": "code", 1335 | "execution_count": null, 1336 | "id": "5cc2eedf", 1337 | "metadata": {}, 1338 | "outputs": [], 1339 | "source": [] 1340 | }, 1341 | { 1342 | "cell_type": "code", 1343 | "execution_count": null, 1344 | "id": "8e0408ab", 1345 | "metadata": {}, 1346 | "outputs": [], 1347 | "source": [] 1348 | }, 1349 | { 1350 | "cell_type": "markdown", 1351 | "id": "f0d47559", 1352 | "metadata": {}, 1353 | "source": [ 1354 | "## Make a Function" 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "code", 1359 | "execution_count": null, 1360 | "id": "48b7fd47", 1361 | "metadata": { 1362 | "lines_to_next_cell": 2, 1363 | "pycharm": { 1364 | "name": "#%%\n" 1365 | } 1366 | }, 1367 | "outputs": [], 1368 | "source": [ 1369 | "# a glorious function\n", 1370 | "# add ordered categories to this\n", 1371 | "def tweak_gss(gss):\n", 1372 | " return (gss\n", 1373 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]',\n", 1374 | " 'HRS1': 'uint8[pyarrow]', 'AGE': 'uint8[pyarrow]',\n", 1375 | " 'MAJOR1': 'category',\n", 1376 | " **{col: 'category' for col in ['SEX', 'RACE', 'BORN', \n", 1377 | " 'INCOME', 'INCOME06', 'HONEST','TICKET']}})\n", 1378 | " )\n", 1379 | "\n", 1380 | "tweak_gss(gss)" 1381 | ] 1382 | }, 1383 | { 1384 | "cell_type": "markdown", 1385 | "id": "3a3cc33c", 1386 | "metadata": {}, 1387 | "source": [ 1388 | "## Function Exercise\n", 1389 | "* Rearrange your notebook. Put the imports, code to load raw data, and tweak function at the top of the notebook. Restart the kernel and validate that your code works." 1390 | ] 1391 | }, 1392 | { 1393 | "cell_type": "code", 1394 | "execution_count": null, 1395 | "id": "5fabd502", 1396 | "metadata": {}, 1397 | "outputs": [], 1398 | "source": [] 1399 | }, 1400 | { 1401 | "cell_type": "code", 1402 | "execution_count": null, 1403 | "id": "01e1d15d", 1404 | "metadata": {}, 1405 | "outputs": [], 1406 | "source": [] 1407 | }, 1408 | { 1409 | "cell_type": "markdown", 1410 | "id": "94863cf2", 1411 | "metadata": { 1412 | "lines_to_next_cell": 2 1413 | }, 1414 | "source": [ 1415 | "## Fix Column Names" 1416 | ] 1417 | }, 1418 | { 1419 | "cell_type": "code", 1420 | "execution_count": null, 1421 | "id": "489290dd", 1422 | "metadata": { 1423 | "lines_to_next_cell": 0, 1424 | "pycharm": { 1425 | "name": "#%%\n" 1426 | } 1427 | }, 1428 | "outputs": [], 1429 | "source": [ 1430 | "# a glorious function\n", 1431 | "def tweak_gss(gss):\n", 1432 | " return (gss\n", 1433 | " .astype({'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]',\n", 1434 | " 'HRS1': 'uint8[pyarrow]', 'AGE': 'uint8[pyarrow]',\n", 1435 | " 'MAJOR1': 'category',\n", 1436 | " **{col: 'category' for col in ['SEX', 'RACE', 'BORN', \n", 1437 | " 'INCOME', 'INCOME06', 'HONEST','TICKET']}})\n", 1438 | " .rename(columns={'YEAR': 'year', 'ID': 'year_id', 'AGE':'age', \n", 1439 | " 'HRS1': 'hours_worked', 'OCC': 'occupation', \n", 1440 | " 'MAJOR1': 'college_major', 'SEX':'sex', \n", 1441 | " 'RACE':'race', 'BORN':'born_in_US',\n", 1442 | " 'INCOME':'income_1970', 'INCOME06': 'income_2006',\n", 1443 | " 'HONEST':'honesty_rank',\n", 1444 | " 'TICKET':'traffic_ticket'})\n", 1445 | " )\n", 1446 | "\n", 1447 | "tweak_gss(gss)" 1448 | ] 1449 | }, 1450 | { 1451 | "cell_type": "code", 1452 | "execution_count": null, 1453 | "id": "0f953444", 1454 | "metadata": { 1455 | "lines_to_next_cell": 2, 1456 | "pycharm": { 1457 | "name": "#%%\n" 1458 | } 1459 | }, 1460 | "outputs": [], 1461 | "source": [] 1462 | }, 1463 | { 1464 | "cell_type": "code", 1465 | "execution_count": null, 1466 | "id": "70e7086c", 1467 | "metadata": { 1468 | "lines_to_next_cell": 2, 1469 | "pycharm": { 1470 | "name": "#%%\n" 1471 | } 1472 | }, 1473 | "outputs": [], 1474 | "source": [] 1475 | }, 1476 | { 1477 | "cell_type": "markdown", 1478 | "id": "dd6ea0f5", 1479 | "metadata": { 1480 | "pycharm": { 1481 | "name": "#%% md\n" 1482 | } 1483 | }, 1484 | "source": [ 1485 | "## Chain\n", 1486 | "\n", 1487 | "Chaining is also called \"flow\" programming. Rather than making intermediate variables, just leverage the fact that most operations return a new object and work on that.\n", 1488 | "\n", 1489 | "The chain should read like a recipe of ordered steps.\n", 1490 | "\n", 1491 | "(BTW, this is actually what we did above.)\n", 1492 | "\n", 1493 | "