├── .gitattributes ├── .gitignore ├── README.md ├── notebooks ├── Combining DataFrames.ipynb ├── Group Operations.ipynb ├── Indexing and Selecting.ipynb ├── Misc Functions.ipynb ├── Pandas Intro to Data Structures.ipynb └── Row-Column Transformations.ipynb └── requirements.txt /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_STORE 2 | 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | local_settings.py 60 | db.sqlite3 61 | 62 | # Flask stuff: 63 | instance/ 64 | .webassets-cache 65 | 66 | # Scrapy stuff: 67 | .scrapy 68 | 69 | # Sphinx documentation 70 | docs/_build/ 71 | 72 | # PyBuilder 73 | target/ 74 | 75 | # Jupyter Notebook 76 | .ipynb_checkpoints 77 | 78 | # IPython 79 | profile_default/ 80 | ipython_config.py 81 | 82 | # pyenv 83 | .python-version 84 | 85 | # celery beat schedule file 86 | celerybeat-schedule 87 | 88 | # SageMath parsed files 89 | *.sage.py 90 | 91 | # Environments 92 | .env 93 | .venv 94 | env/ 95 | venv/ 96 | ENV/ 97 | env.bak/ 98 | venv.bak/ 99 | 100 | # Spyder project settings 101 | .spyderproject 102 | .spyproject 103 | 104 | # Rope project settings 105 | .ropeproject 106 | 107 | # mkdocs documentation 108 | /site 109 | 110 | # mypy 111 | .mypy_cache/ 112 | .dmypy.json 113 | dmypy.json 114 | 115 | # Pyre type checker 116 | .pyre/ 117 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # An Opinionated Guide to Pandas 2 | 3 | ## Getting started with this tutorial 4 | 5 | I made this after quite a lot of thought. There are a ton of pandas tutorials out there and the maintainers of pandas themselves have tutorials. But I think these tutorials have one of two flavors: 6 | 7 | 1. Intro: you barely get into details. 8 | 2. Reference: just the details 9 | 10 | I wanted to let people know what are the important and advanced pandas functions that a data scientist uses on a day to day basis. And I could not find it. 11 | 12 | Thus this. 13 | 14 | This tutorial is an opinionated guide to pandas. I'll let you know which functions I think are not worth learning and which are. This is not an intro to pandas. This is pandas for data scientists, and I hope you enjoy. 15 | 16 | ## Installing Virtualenv 17 | 18 | The first step to get running with these tutorials is to install virtualenv. Fortunately there is a [great tutorial](https://docs.python-guide.org/dev/virtualenvs/#lower-level-virtualenv) on hitchhiker's guide to python. Please follow the steps in the guide. 19 | 20 | Once you have installed virtualenv let's make an enviornment with the following command: 21 | 22 | `virtualenv -p python3 env` 23 | 24 | Notice that we are using python 3. Pandas has announced that they will stop supporting python 2 after 2019. You will then need to activate your env. Again the tutorial is a great resource on showing you how to do this on both windows and mac: 25 | 26 | [Activate your env](https://docs.python-guide.org/dev/virtualenvs/#lower-level-virtualenv) 27 | 28 | The next step is that we will need to install all the requirements: 29 | 30 | `pip install -r requirements.txt` 31 | 32 | Finally the last step is to run an ipython notebook from within the env and then we are ready to go: 33 | 34 | `ipython notebook` 35 | 36 | Pandas itself has some good resources on installation that you can find [here](https://pandas.pydata.org/pandas-docs/stable/install.html) 37 | 38 | 39 | ## Order of the Notebooks 40 | 41 | The recommended order is: 42 | 43 | 1. [Pandas Intro to Data Structures](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Pandas%20Intro%20to%20Data%20Structures.ipynb) 44 | 2. [Indexing and Selecting](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Indexing%20and%20Selecting.ipynb) 45 | 3. [Group Operations](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Group%20Operations.ipynb) 46 | 4. [Row-Column Transformations](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Row-Column%20Transformations.ipynb) 47 | 5. [Combining DataFrames](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Combining%20DataFrames.ipynb) 48 | 6. [Misc Functions](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Misc%20Functions.ipynb) 49 | 50 | 51 | ## Exercises 52 | 53 | If you are like me you will also find using some of these techniques in exercises to be quite useful as well. And fortunately pandas has some great [exercises listed on their site](https://pandas.pydata.org/pandas-docs/stable/tutorials.html#exercises-for-new-users). If y'all would like and these tutorials/videos get enough support, I'd be happy to video the solutions to those exercises as well. 54 | -------------------------------------------------------------------------------- /notebooks/Combining DataFrames.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import seaborn as sns\n", 10 | "import pandas as pd\n", 11 | "import numpy as np" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Pandas Combining DataFrames\n", 19 | "\n", 20 | "In pandas there are 4 (plus a few special case) ways to combine data from different frames:\n", 21 | "\n", 22 | "* Merging\n", 23 | "* Joining\n", 24 | "* Concatenating \n", 25 | "* Appending\n", 26 | "\n", 27 | "Where merging and joining are basically redundant and concatenating and appending are basically redundant. \n", 28 | "\n", 29 | "So today we will be going over Merging and Concatenating in pandas. \n", 30 | "\n", 31 | "Check out the full documentation [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), but be warned it is a bit long :)\n", 32 | "\n", 33 | "\n", 34 | "Okay let's get started." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 59 | "\n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 105 | "
" 106 | ], 107 | "text/plain": [ 108 | " total_bill tip sex smoker day time size\n", 109 | "0 16.99 1.01 Female No Sun Dinner 2\n", 110 | "1 10.34 1.66 Male No Sun Dinner 3\n", 111 | "2 21.01 3.50 Male No Sun Dinner 3" 112 | ] 113 | }, 114 | "execution_count": 2, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "tips = sns.load_dataset('tips')\n", 121 | "tips.head(3)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## Merge\n", 129 | "\n", 130 | "Merging is for doing complex column-wise combinations of dataframes in a SQL-like way. If you don't know SQL joins then check out this resource [sql joins](https://www.w3schools.com/sql/sql_join.asp) and comment below (and maybe I'll make a video). \n", 131 | "\n", 132 | "Two merge we need two dataframes, let's make them below:" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 3, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "tips_bill = tips.groupby(['sex', 'smoker'])[['total_bill', 'tip']].sum()\n", 142 | "tips_tip = tips.groupby(['sex', 'smoker'])[['total_bill', 'tip']].sum()\n", 143 | "\n", 144 | "del tips_bill['tip']\n", 145 | "del tips_tip['total_bill']" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 4, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/html": [ 156 | "
\n", 157 | "\n", 170 | "\n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | "
total_bill
sexsmoker
MaleYes1337.07
No1919.75
FemaleYes593.27
No977.68
\n", 204 | "
" 205 | ], 206 | "text/plain": [ 207 | " total_bill\n", 208 | "sex smoker \n", 209 | "Male Yes 1337.07\n", 210 | " No 1919.75\n", 211 | "Female Yes 593.27\n", 212 | " No 977.68" 213 | ] 214 | }, 215 | "execution_count": 4, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "tips_bill" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 18, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "data": { 231 | "text/html": [ 232 | "
\n", 233 | "\n", 246 | "\n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | "
tip
sexsmoker
MaleYes183.07
No302.00
FemaleYes96.74
No149.77
\n", 280 | "
" 281 | ], 282 | "text/plain": [ 283 | " tip\n", 284 | "sex smoker \n", 285 | "Male Yes 183.07\n", 286 | " No 302.00\n", 287 | "Female Yes 96.74\n", 288 | " No 149.77" 289 | ] 290 | }, 291 | "execution_count": 18, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "tips_tip" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "Now that we have two datasets that we want to combine (aka take the tips and combine with the total bill), how do we do it? We merge!" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 20, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "pd.merge?" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "Notice that there are a ton of options:" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 22, 326 | "metadata": {}, 327 | "outputs": [ 328 | { 329 | "data": { 330 | "text/html": [ 331 | "
\n", 332 | "\n", 345 | "\n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | "
total_billtip
sexsmoker
MaleYes1337.07183.07
No1919.75302.00
FemaleYes593.2796.74
No977.68149.77
\n", 385 | "
" 386 | ], 387 | "text/plain": [ 388 | " total_bill tip\n", 389 | "sex smoker \n", 390 | "Male Yes 1337.07 183.07\n", 391 | " No 1919.75 302.00\n", 392 | "Female Yes 593.27 96.74\n", 393 | " No 977.68 149.77" 394 | ] 395 | }, 396 | "execution_count": 22, 397 | "metadata": {}, 398 | "output_type": "execute_result" 399 | } 400 | ], 401 | "source": [ 402 | "# we can merge on the indexes\n", 403 | "pd.merge(tips_bill, tips_tip, \n", 404 | " right_index=True, left_index=True)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 24, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "text/html": [ 415 | "
\n", 416 | "\n", 429 | "\n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | "
sexsmokertotal_billtip
0MaleYes1337.07183.07
1MaleNo1919.75302.00
2FemaleYes593.2796.74
3FemaleNo977.68149.77
\n", 470 | "
" 471 | ], 472 | "text/plain": [ 473 | " sex smoker total_bill tip\n", 474 | "0 Male Yes 1337.07 183.07\n", 475 | "1 Male No 1919.75 302.00\n", 476 | "2 Female Yes 593.27 96.74\n", 477 | "3 Female No 977.68 149.77" 478 | ] 479 | }, 480 | "execution_count": 24, 481 | "metadata": {}, 482 | "output_type": "execute_result" 483 | } 484 | ], 485 | "source": [ 486 | "#we can reset indexes and then merge on the columns - perhaps the easiest way\n", 487 | "pd.merge(\n", 488 | " tips_bill.reset_index(), \n", 489 | " tips_tip.reset_index(),\n", 490 | " on=['sex', 'smoker']\n", 491 | ")" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 25, 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "data": { 501 | "text/html": [ 502 | "
\n", 503 | "\n", 516 | "\n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | "
sexsmokertotal_billtip
0MaleYes1337.07183.07
1MaleNo1919.75302.00
2FemaleYes593.2796.74
3FemaleNo977.68149.77
\n", 557 | "
" 558 | ], 559 | "text/plain": [ 560 | " sex smoker total_bill tip\n", 561 | "0 Male Yes 1337.07 183.07\n", 562 | "1 Male No 1919.75 302.00\n", 563 | "2 Female Yes 593.27 96.74\n", 564 | "3 Female No 977.68 149.77" 565 | ] 566 | }, 567 | "execution_count": 25, 568 | "metadata": {}, 569 | "output_type": "execute_result" 570 | } 571 | ], 572 | "source": [ 573 | "# it can actually infer the above - but be very careful with this\n", 574 | "pd.merge(\n", 575 | " tips_bill.reset_index(), \n", 576 | " tips_tip.reset_index()\n", 577 | ")" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 27, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/html": [ 588 | "
\n", 589 | "\n", 602 | "\n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | "
sexsmokertotal_billtip
0MaleYes1337.07183.07
1MaleNo1919.75302.00
2FemaleYes593.2796.74
3FemaleNo977.68149.77
\n", 643 | "
" 644 | ], 645 | "text/plain": [ 646 | " sex smoker total_bill tip\n", 647 | "0 Male Yes 1337.07 183.07\n", 648 | "1 Male No 1919.75 302.00\n", 649 | "2 Female Yes 593.27 96.74\n", 650 | "3 Female No 977.68 149.77" 651 | ] 652 | }, 653 | "execution_count": 27, 654 | "metadata": {}, 655 | "output_type": "execute_result" 656 | } 657 | ], 658 | "source": [ 659 | "# it can merge on partial column and index\n", 660 | "pd.merge(\n", 661 | " tips_bill.reset_index(), \n", 662 | " tips_tip,\n", 663 | " left_on=['sex', 'smoker'],\n", 664 | " right_index=True\n", 665 | ")" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": 5, 671 | "metadata": {}, 672 | "outputs": [ 673 | { 674 | "data": { 675 | "text/html": [ 676 | "
\n", 677 | "\n", 690 | "\n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | "
sextotal_bill
smoker
YesMale1337.07
NoMale1919.75
YesFemale593.27
NoFemale977.68
\n", 726 | "
" 727 | ], 728 | "text/plain": [ 729 | " sex total_bill\n", 730 | "smoker \n", 731 | "Yes Male 1337.07\n", 732 | "No Male 1919.75\n", 733 | "Yes Female 593.27\n", 734 | "No Female 977.68" 735 | ] 736 | }, 737 | "execution_count": 5, 738 | "metadata": {}, 739 | "output_type": "execute_result" 740 | } 741 | ], 742 | "source": [ 743 | "#it can do interesting combinations\n", 744 | "tips_bill_strange = tips_bill.reset_index(level=0)\n", 745 | "tips_bill_strange" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": 7, 751 | "metadata": {}, 752 | "outputs": [ 753 | { 754 | "data": { 755 | "text/html": [ 756 | "
\n", 757 | "\n", 770 | "\n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | "
sexsmokertiptotal_bill
0MaleYes183.071337.07
1MaleNo302.001919.75
2FemaleYes96.74593.27
3FemaleNo149.77977.68
\n", 811 | "
" 812 | ], 813 | "text/plain": [ 814 | " sex smoker tip total_bill\n", 815 | "0 Male Yes 183.07 1337.07\n", 816 | "1 Male No 302.00 1919.75\n", 817 | "2 Female Yes 96.74 593.27\n", 818 | "3 Female No 149.77 977.68" 819 | ] 820 | }, 821 | "execution_count": 7, 822 | "metadata": {}, 823 | "output_type": "execute_result" 824 | } 825 | ], 826 | "source": [ 827 | "pd.merge(\n", 828 | " tips_tip.reset_index(), \n", 829 | " tips_bill_strange,\n", 830 | " on=['sex', 'smoker']\n", 831 | ")" 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": 31, 837 | "metadata": {}, 838 | "outputs": [ 839 | { 840 | "data": { 841 | "text/html": [ 842 | "
\n", 843 | "\n", 856 | "\n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | "
sexsmokertotal_billtip
0MaleYes1337.07183.07
1MaleNo1919.75302.00
2FemaleYes593.27NaN
3FemaleNo977.68NaN
\n", 897 | "
" 898 | ], 899 | "text/plain": [ 900 | " sex smoker total_bill tip\n", 901 | "0 Male Yes 1337.07 183.07\n", 902 | "1 Male No 1919.75 302.00\n", 903 | "2 Female Yes 593.27 NaN\n", 904 | "3 Female No 977.68 NaN" 905 | ] 906 | }, 907 | "execution_count": 31, 908 | "metadata": {}, 909 | "output_type": "execute_result" 910 | } 911 | ], 912 | "source": [ 913 | "# we can do any SQL-like functionality\n", 914 | "pd.merge(\n", 915 | " tips_bill.reset_index(), \n", 916 | " tips_tip.reset_index().head(2),\n", 917 | " how='left'\n", 918 | ")" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": 32, 924 | "metadata": {}, 925 | "outputs": [ 926 | { 927 | "data": { 928 | "text/html": [ 929 | "
\n", 930 | "\n", 943 | "\n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | "
sexsmokertotal_billtip
0MaleYes1337.07183.07
1MaleNo1919.75302.00
\n", 970 | "
" 971 | ], 972 | "text/plain": [ 973 | " sex smoker total_bill tip\n", 974 | "0 Male Yes 1337.07 183.07\n", 975 | "1 Male No 1919.75 302.00" 976 | ] 977 | }, 978 | "execution_count": 32, 979 | "metadata": {}, 980 | "output_type": "execute_result" 981 | } 982 | ], 983 | "source": [ 984 | "pd.merge(\n", 985 | " tips_bill.reset_index(), \n", 986 | " tips_tip.reset_index().head(2),\n", 987 | " how='inner'\n", 988 | ")" 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": 36, 994 | "metadata": {}, 995 | "outputs": [ 996 | { 997 | "data": { 998 | "text/html": [ 999 | "
\n", 1000 | "\n", 1013 | "\n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | "
sexsmokertotal_billtip_merge
0MaleNo1919.75302.00both
1FemaleYes593.2796.74both
2FemaleNo977.68NaNleft_only
3MaleYesNaN183.07right_only
\n", 1059 | "
" 1060 | ], 1061 | "text/plain": [ 1062 | " sex smoker total_bill tip _merge\n", 1063 | "0 Male No 1919.75 302.00 both\n", 1064 | "1 Female Yes 593.27 96.74 both\n", 1065 | "2 Female No 977.68 NaN left_only\n", 1066 | "3 Male Yes NaN 183.07 right_only" 1067 | ] 1068 | }, 1069 | "execution_count": 36, 1070 | "metadata": {}, 1071 | "output_type": "execute_result" 1072 | } 1073 | ], 1074 | "source": [ 1075 | "# and if you add an indicator...\n", 1076 | "pd.merge(\n", 1077 | " tips_bill.reset_index().tail(3), \n", 1078 | " tips_tip.reset_index().head(3),\n", 1079 | " how='outer',\n", 1080 | " indicator=True\n", 1081 | ")" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "execution_count": 35, 1087 | "metadata": {}, 1088 | "outputs": [ 1089 | { 1090 | "data": { 1091 | "text/html": [ 1092 | "
\n", 1093 | "\n", 1106 | "\n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | "
total_bill_lefttotal_bill_right
sexsmoker
MaleYes1337.071337.07
No1919.751919.75
FemaleYes593.27593.27
No977.68977.68
\n", 1146 | "
" 1147 | ], 1148 | "text/plain": [ 1149 | " total_bill_left total_bill_right\n", 1150 | "sex smoker \n", 1151 | "Male Yes 1337.07 1337.07\n", 1152 | " No 1919.75 1919.75\n", 1153 | "Female Yes 593.27 593.27\n", 1154 | " No 977.68 977.68" 1155 | ] 1156 | }, 1157 | "execution_count": 35, 1158 | "metadata": {}, 1159 | "output_type": "execute_result" 1160 | } 1161 | ], 1162 | "source": [ 1163 | "# it can handle columns with the same name\n", 1164 | "pd.merge(tips_bill, \n", 1165 | " tips_bill, \n", 1166 | " right_index=True, \n", 1167 | " left_index=True,\n", 1168 | " suffixes=('_left', '_right')\n", 1169 | ")" 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "markdown", 1174 | "metadata": {}, 1175 | "source": [ 1176 | "This is one of the most complex parts of pandas - but it is very important to master. So please do check out the excerises below!\n", 1177 | "\n", 1178 | "One thing to be careful with here is merging two data types. Strings are not equal to ints!" 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "markdown", 1183 | "metadata": {}, 1184 | "source": [ 1185 | "# Contatenation\n", 1186 | "\n", 1187 | "Concatenating is for combining more than two dataframes in either column-wise or row-wise. The problem with concatenate is that the combinations it allows you to do are rather simplistic. That's why we need merge. \n", 1188 | "\n", 1189 | "Concatenate can take as many data frames as you want, but it requires that they are specifically constructed. All of the dataframes you pass in will need to have the same index. So no more using columns as an index. \n", 1190 | "\n", 1191 | "Let's check out basic use below:" 1192 | ] 1193 | }, 1194 | { 1195 | "cell_type": "code", 1196 | "execution_count": 8, 1197 | "metadata": {}, 1198 | "outputs": [ 1199 | { 1200 | "data": { 1201 | "text/html": [ 1202 | "
\n", 1203 | "\n", 1216 | "\n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | "
total_billtip
sexsmoker
MaleYes1337.07NaN
No1919.75NaN
FemaleYes593.27NaN
No977.68NaN
MaleYes1337.07NaN
No1919.75NaN
FemaleYes593.27NaN
No977.68NaN
MaleYesNaN183.07
NoNaN302.00
FemaleYesNaN96.74
NoNaN149.77
\n", 1300 | "
" 1301 | ], 1302 | "text/plain": [ 1303 | " total_bill tip\n", 1304 | "sex smoker \n", 1305 | "Male Yes 1337.07 NaN\n", 1306 | " No 1919.75 NaN\n", 1307 | "Female Yes 593.27 NaN\n", 1308 | " No 977.68 NaN\n", 1309 | "Male Yes 1337.07 NaN\n", 1310 | " No 1919.75 NaN\n", 1311 | "Female Yes 593.27 NaN\n", 1312 | " No 977.68 NaN\n", 1313 | "Male Yes NaN 183.07\n", 1314 | " No NaN 302.00\n", 1315 | "Female Yes NaN 96.74\n", 1316 | " No NaN 149.77" 1317 | ] 1318 | }, 1319 | "execution_count": 8, 1320 | "metadata": {}, 1321 | "output_type": "execute_result" 1322 | } 1323 | ], 1324 | "source": [ 1325 | "# this adds the dataframes together row wise\n", 1326 | "pd.concat([tips_bill, tips_bill, tips_tip], sort=False)" 1327 | ] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "execution_count": 9, 1332 | "metadata": {}, 1333 | "outputs": [ 1334 | { 1335 | "data": { 1336 | "text/html": [ 1337 | "
\n", 1338 | "\n", 1351 | "\n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | "
total_billtip
sexsmoker
MaleYes1337.07183.07
No1919.75302.00
FemaleYes593.2796.74
No977.68149.77
\n", 1391 | "
" 1392 | ], 1393 | "text/plain": [ 1394 | " total_bill tip\n", 1395 | "sex smoker \n", 1396 | "Male Yes 1337.07 183.07\n", 1397 | " No 1919.75 302.00\n", 1398 | "Female Yes 593.27 96.74\n", 1399 | " No 977.68 149.77" 1400 | ] 1401 | }, 1402 | "execution_count": 9, 1403 | "metadata": {}, 1404 | "output_type": "execute_result" 1405 | } 1406 | ], 1407 | "source": [ 1408 | "# this does it column wise\n", 1409 | "pd.concat([tips_bill, tips_tip], axis=1)" 1410 | ] 1411 | }, 1412 | { 1413 | "cell_type": "code", 1414 | "execution_count": 10, 1415 | "metadata": {}, 1416 | "outputs": [ 1417 | { 1418 | "data": { 1419 | "text/html": [ 1420 | "
\n", 1421 | "\n", 1434 | "\n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | "
total_billtip
sexsmoker
num0MaleYes1337.07NaN
No1919.75NaN
FemaleYes593.27NaN
No977.68NaN
num1MaleYesNaN183.07
NoNaN302.00
FemaleYesNaN96.74
NoNaN149.77
\n", 1500 | "
" 1501 | ], 1502 | "text/plain": [ 1503 | " total_bill tip\n", 1504 | " sex smoker \n", 1505 | "num0 Male Yes 1337.07 NaN\n", 1506 | " No 1919.75 NaN\n", 1507 | " Female Yes 593.27 NaN\n", 1508 | " No 977.68 NaN\n", 1509 | "num1 Male Yes NaN 183.07\n", 1510 | " No NaN 302.00\n", 1511 | " Female Yes NaN 96.74\n", 1512 | " No NaN 149.77" 1513 | ] 1514 | }, 1515 | "execution_count": 10, 1516 | "metadata": {}, 1517 | "output_type": "execute_result" 1518 | } 1519 | ], 1520 | "source": [ 1521 | "# and finally this will add on the dataset where it's from\n", 1522 | "pd.concat([tips_bill, tips_tip], sort=False, keys=['num0', 'num1'])" 1523 | ] 1524 | }, 1525 | { 1526 | "cell_type": "markdown", 1527 | "metadata": {}, 1528 | "source": [ 1529 | "As you can see there is not a ton of functionality to concat, but it is invaluable if you have more than one dataframe or you are looking to append the rows of one dataframe onto another." 1530 | ] 1531 | }, 1532 | { 1533 | "cell_type": "markdown", 1534 | "metadata": {}, 1535 | "source": [ 1536 | "## Conclusion\n", 1537 | "\n", 1538 | "There are a couple of other ways to merge data, but they are pretty niche (and mainly for time series data). If y'all have a desire for me to go over them then comment below!\n", 1539 | "\n", 1540 | "They are:\n", 1541 | "\n", 1542 | "* combine_first\n", 1543 | "* merge_ordered\n", 1544 | "* merge_asof\n", 1545 | "\n", 1546 | "Otherwise you should be fully equipped to do the [exercises](https://github.com/guipsamora/pandas_exercises#merge). These functions require a bit of practice to get used to, so don't be discouraged if it takes some time." 1547 | ] 1548 | }, 1549 | { 1550 | "cell_type": "code", 1551 | "execution_count": null, 1552 | "metadata": {}, 1553 | "outputs": [], 1554 | "source": [] 1555 | } 1556 | ], 1557 | "metadata": { 1558 | "kernelspec": { 1559 | "display_name": "Python 3", 1560 | "language": "python", 1561 | "name": "python3" 1562 | }, 1563 | "language_info": { 1564 | "codemirror_mode": { 1565 | "name": "ipython", 1566 | "version": 3 1567 | }, 1568 | "file_extension": ".py", 1569 | "mimetype": "text/x-python", 1570 | "name": "python", 1571 | "nbconvert_exporter": "python", 1572 | "pygments_lexer": "ipython3", 1573 | "version": "3.7.3" 1574 | } 1575 | }, 1576 | "nbformat": 4, 1577 | "nbformat_minor": 2 1578 | } 1579 | -------------------------------------------------------------------------------- /notebooks/Group Operations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import seaborn as sns\n", 10 | "import pandas as pd\n", 11 | "import numpy as np" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Pandas Group Operations\n", 19 | "\n", 20 | "Let's next go over grouped operations with pandas. This section of the pandas library does not have as much feature bloat as other parts, which is nice. And the community is starting to narrow around a couple of operations that are core to grouped operations. We'll be going over these operations with particular emphasis on groupby and agg:\n", 21 | "\n", 22 | "* groupby\n", 23 | "* agg\n", 24 | "* filter\n", 25 | "* transform\n", 26 | "\n", 27 | "Check out the full documentation [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html), but be warned it is a bit long :)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "Let's start with our good old tips dataset:" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 4, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 59 | "\n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 105 | "
" 106 | ], 107 | "text/plain": [ 108 | " total_bill tip sex smoker day time size\n", 109 | "0 16.99 1.01 Female No Sun Dinner 2\n", 110 | "1 10.34 1.66 Male No Sun Dinner 3\n", 111 | "2 21.01 3.50 Male No Sun Dinner 3" 112 | ] 113 | }, 114 | "execution_count": 4, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "tips = sns.load_dataset('tips')\n", 121 | "tips.head(3)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "### Groupby\n", 129 | "\n", 130 | "A grouped operation starts by specifying which groups of data that we would want to operate over. There are many ways of making groupsm, but the tool that pandas uses to make groups of data, is `groupby`" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 5, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "text/plain": [ 141 | "" 142 | ] 143 | }, 144 | "execution_count": 5, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "tips_gb = tips.groupby(['sex', 'smoker'])\n", 151 | "tips_gb" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "Groupby works by telling pandas a couple of columns. Pandas will look in your data and see every unique combination of the columns that you specify. Each unique combination is a group. So in this case we will have four groups: male smoker, female smoker, male non-smoker, female non-smoker.\n", 159 | "\n", 160 | "The groupby object by itself is not super important." 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "Once we have these groups (specified in the groupby object), we can do three types of operations on it (with the most important being agg)\n", 168 | "\n", 169 | "### Agg\n", 170 | "\n", 171 | "The aggregate operation aggregates all the data in these groups into one value. You use a dictionary to specify which values you'd like. For example look below, we are asking for both the mean and the min value of the tip column for each group:" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 6, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/html": [ 182 | "
\n", 183 | "\n", 200 | "\n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | "
tipdaytotal_bill
meanminfirstsize
sexsmoker
MaleYes3.0511671.00Sat60
No3.1134021.25Sun97
FemaleYes2.9315151.00Sat33
No2.7735191.00Sun54
\n", 259 | "
" 260 | ], 261 | "text/plain": [ 262 | " tip day total_bill\n", 263 | " mean min first size\n", 264 | "sex smoker \n", 265 | "Male Yes 3.051167 1.00 Sat 60\n", 266 | " No 3.113402 1.25 Sun 97\n", 267 | "Female Yes 2.931515 1.00 Sat 33\n", 268 | " No 2.773519 1.00 Sun 54" 269 | ] 270 | }, 271 | "execution_count": 6, 272 | "metadata": {}, 273 | "output_type": "execute_result" 274 | } 275 | ], 276 | "source": [ 277 | "tips_agg = tips_gb.agg({\n", 278 | " 'tip': ['mean', 'min'],\n", 279 | " 'day': 'first',\n", 280 | " 'total_bill': 'size'\n", 281 | "})\n", 282 | "\n", 283 | "tips_agg" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "So notice that we get both a multi-index for both the index and the columns. We can always get rid of the multi-index with a `reset_index` (see [indexing and selecting](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Indexing%20and%20Selecting.ipynb) for more details):" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 7, 296 | "metadata": {}, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "text/html": [ 301 | "
\n", 302 | "\n", 315 | "\n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | "
sexsmokertipdaytotal_bill
meanminfirstsize
0MaleYes3.0511671.00Sat60
1MaleNo3.1134021.25Sun97
2FemaleYes2.9315151.00Sat33
3FemaleNo2.7735191.00Sun54
\n", 374 | "
" 375 | ], 376 | "text/plain": [ 377 | " sex smoker tip day total_bill\n", 378 | " mean min first size\n", 379 | "0 Male Yes 3.051167 1.00 Sat 60\n", 380 | "1 Male No 3.113402 1.25 Sun 97\n", 381 | "2 Female Yes 2.931515 1.00 Sat 33\n", 382 | "3 Female No 2.773519 1.00 Sun 54" 383 | ] 384 | }, 385 | "execution_count": 7, 386 | "metadata": {}, 387 | "output_type": "execute_result" 388 | } 389 | ], 390 | "source": [ 391 | "tips_agg.reset_index()" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "And we can either use stacking or our column trick to get rid of the column nonsense:" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 8, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "data": { 408 | "text/plain": [ 409 | "MultiIndex(levels=[['tip', 'day', 'total_bill'], ['first', 'mean', 'min', 'size']],\n", 410 | " codes=[[0, 0, 1, 2], [1, 2, 0, 3]])" 411 | ] 412 | }, 413 | "execution_count": 8, 414 | "metadata": {}, 415 | "output_type": "execute_result" 416 | } 417 | ], 418 | "source": [ 419 | "# before\n", 420 | "tips_agg.columns" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 9, 426 | "metadata": {}, 427 | "outputs": [ 428 | { 429 | "data": { 430 | "text/html": [ 431 | "
\n", 432 | "\n", 445 | "\n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | "
tipdaytotal_bill
sexsmoker
MaleYesfirstNaNSatNaN
mean3.051167NaNNaN
min1.000000NaNNaN
sizeNaNNaN60.0
NofirstNaNSunNaN
mean3.113402NaNNaN
min1.250000NaNNaN
sizeNaNNaN97.0
FemaleYesfirstNaNSatNaN
mean2.931515NaNNaN
min1.000000NaNNaN
sizeNaNNaN33.0
NofirstNaNSunNaN
mean2.773519NaNNaN
min1.000000NaNNaN
sizeNaNNaN54.0
\n", 569 | "
" 570 | ], 571 | "text/plain": [ 572 | " tip day total_bill\n", 573 | "sex smoker \n", 574 | "Male Yes first NaN Sat NaN\n", 575 | " mean 3.051167 NaN NaN\n", 576 | " min 1.000000 NaN NaN\n", 577 | " size NaN NaN 60.0\n", 578 | " No first NaN Sun NaN\n", 579 | " mean 3.113402 NaN NaN\n", 580 | " min 1.250000 NaN NaN\n", 581 | " size NaN NaN 97.0\n", 582 | "Female Yes first NaN Sat NaN\n", 583 | " mean 2.931515 NaN NaN\n", 584 | " min 1.000000 NaN NaN\n", 585 | " size NaN NaN 33.0\n", 586 | " No first NaN Sun NaN\n", 587 | " mean 2.773519 NaN NaN\n", 588 | " min 1.000000 NaN NaN\n", 589 | " size NaN NaN 54.0" 590 | ] 591 | }, 592 | "execution_count": 9, 593 | "metadata": {}, 594 | "output_type": "execute_result" 595 | } 596 | ], 597 | "source": [ 598 | "tips_agg.stack()" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": 10, 604 | "metadata": {}, 605 | "outputs": [ 606 | { 607 | "data": { 608 | "text/plain": [ 609 | "Index(['tip__mean', 'tip__min', 'day__first', 'total_bill__size'], dtype='object')" 610 | ] 611 | }, 612 | "execution_count": 10, 613 | "metadata": {}, 614 | "output_type": "execute_result" 615 | } 616 | ], 617 | "source": [ 618 | "tips_agg.columns = ['__'.join(col).strip() for col in tips_agg.columns.values]\n", 619 | "tips_agg.columns" 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 11, 625 | "metadata": {}, 626 | "outputs": [ 627 | { 628 | "data": { 629 | "text/html": [ 630 | "
\n", 631 | "\n", 644 | "\n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | "
tip__meantip__minday__firsttotal_bill__size
sexsmoker
MaleYes3.0511671.00Sat60
No3.1134021.25Sun97
FemaleYes2.9315151.00Sat33
No2.7735191.00Sun54
\n", 696 | "
" 697 | ], 698 | "text/plain": [ 699 | " tip__mean tip__min day__first total_bill__size\n", 700 | "sex smoker \n", 701 | "Male Yes 3.051167 1.00 Sat 60\n", 702 | " No 3.113402 1.25 Sun 97\n", 703 | "Female Yes 2.931515 1.00 Sat 33\n", 704 | " No 2.773519 1.00 Sun 54" 705 | ] 706 | }, 707 | "execution_count": 11, 708 | "metadata": {}, 709 | "output_type": "execute_result" 710 | } 711 | ], 712 | "source": [ 713 | "tips_agg" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": {}, 719 | "source": [ 720 | "That is about it for the aggregation, you can find some common aggregation functions listed [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation)" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "### Filter\n", 728 | "\n", 729 | "The next common group operation is a filter. This one is pretty simple, we filter out member of groups that don't meet our criteria.\n", 730 | "\n", 731 | "For example let's only look at the least busy times the place is open. One way we might do that is exclude all times above the median from the analysis" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": 53, 737 | "metadata": {}, 738 | "outputs": [], 739 | "source": [ 740 | "# we use the exact same groupby syntax\n", 741 | "tips_gb = tips.groupby(['day', 'time'])" 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": 54, 747 | "metadata": {}, 748 | "outputs": [], 749 | "source": [ 750 | "median_size = tips_gb.agg({'size': 'sum'}).median()[0]" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": 56, 756 | "metadata": {}, 757 | "outputs": [ 758 | { 759 | "data": { 760 | "text/html": [ 761 | "
\n", 762 | "\n", 775 | "\n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | "
total_billtipsexsmokerdaytimesize
9028.973.00MaleYesFriDinner2
9122.493.50MaleNoFriDinner2
925.751.00FemaleYesFriDinner2
9316.324.30FemaleYesFriDinner2
9422.753.25FemaleNoFriDinner2
\n", 841 | "
" 842 | ], 843 | "text/plain": [ 844 | " total_bill tip sex smoker day time size\n", 845 | "90 28.97 3.00 Male Yes Fri Dinner 2\n", 846 | "91 22.49 3.50 Male No Fri Dinner 2\n", 847 | "92 5.75 1.00 Female Yes Fri Dinner 2\n", 848 | "93 16.32 4.30 Female Yes Fri Dinner 2\n", 849 | "94 22.75 3.25 Female No Fri Dinner 2" 850 | ] 851 | }, 852 | "execution_count": 56, 853 | "metadata": {}, 854 | "output_type": "execute_result" 855 | } 856 | ], 857 | "source": [ 858 | "# notice that we carved out quite a few rows\n", 859 | "tips_gb.filter(lambda group: group['size'].sum() < median_size).head()" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "That's honestly about it. I don't use this functionality too much, but it's pretty simple and I don't think it complicates things too much, so may as well throw it in." 867 | ] 868 | }, 869 | { 870 | "cell_type": "markdown", 871 | "metadata": {}, 872 | "source": [ 873 | "### Transform\n", 874 | "\n", 875 | "The final group operation is transform. This uses group information to apply transformations to individual data points. For example look below: each day let's divide by the bill and tip by the average amount spent on that day. That way we can look at how much that bill differs from the average of that day" 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": 57, 881 | "metadata": {}, 882 | "outputs": [], 883 | "source": [ 884 | "tips_gb = tips.groupby(['day'])" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 58, 890 | "metadata": {}, 891 | "outputs": [ 892 | { 893 | "data": { 894 | "text/html": [ 895 | "
\n", 896 | "\n", 909 | "\n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | "
total_billtip
00.7935540.310279
10.4829520.509964
20.9813171.075225
31.1060251.016856
41.1485291.109018
\n", 945 | "
" 946 | ], 947 | "text/plain": [ 948 | " total_bill tip\n", 949 | "0 0.793554 0.310279\n", 950 | "1 0.482952 0.509964\n", 951 | "2 0.981317 1.075225\n", 952 | "3 1.106025 1.016856\n", 953 | "4 1.148529 1.109018" 954 | ] 955 | }, 956 | "execution_count": 58, 957 | "metadata": {}, 958 | "output_type": "execute_result" 959 | } 960 | ], 961 | "source": [ 962 | "tips_gb[['total_bill', 'tip']].transform(lambda x: x / x.mean()).head()" 963 | ] 964 | }, 965 | { 966 | "cell_type": "markdown", 967 | "metadata": {}, 968 | "source": [ 969 | "I think I have only ever used this function for normalization, but it is pretty straight forwards and intuitive, so I'm fine with the added flexibility." 970 | ] 971 | }, 972 | { 973 | "cell_type": "markdown", 974 | "metadata": {}, 975 | "source": [ 976 | "## Conclusion\n", 977 | "\n", 978 | "This is about it for understanding pandas group operations. As always check out some of the [exercises on this topic](https://github.com/guipsamora/pandas_exercises#grouping), you should be able to do them with ease.\n", 979 | "\n", 980 | "As a final note, understanding the groupby and agg functions is critical to using pandas effectively. The transform and filter are nice, but you could probably get by without them." 981 | ] 982 | }, 983 | { 984 | "cell_type": "code", 985 | "execution_count": null, 986 | "metadata": {}, 987 | "outputs": [], 988 | "source": [] 989 | } 990 | ], 991 | "metadata": { 992 | "kernelspec": { 993 | "display_name": "Python 3", 994 | "language": "python", 995 | "name": "python3" 996 | }, 997 | "language_info": { 998 | "codemirror_mode": { 999 | "name": "ipython", 1000 | "version": 3 1001 | }, 1002 | "file_extension": ".py", 1003 | "mimetype": "text/x-python", 1004 | "name": "python", 1005 | "nbconvert_exporter": "python", 1006 | "pygments_lexer": "ipython3", 1007 | "version": "3.7.3" 1008 | } 1009 | }, 1010 | "nbformat": 4, 1011 | "nbformat_minor": 2 1012 | } 1013 | -------------------------------------------------------------------------------- /notebooks/Indexing and Selecting.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import seaborn as sns\n", 10 | "import pandas as pd\n", 11 | "import numpy as np" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Pandas Indexing and Selecting\n", 19 | "\n", 20 | "Let's talk about slicing and dicing pandas data. We are going to be going over four topics today:\n", 21 | "\n", 22 | "* Review the basics\n", 23 | "* Multi-index\n", 24 | "* Getting Single Values\n", 25 | "* Pointing out some stuff you don't need to worry about\n", 26 | "\n", 27 | "As always you can check out the full documentation: [basic indexing](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) and [advanced indexing](http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html). But be warned that they are very long and tell you way more than you'd need to know :)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Review the Basics\n", 35 | "\n", 36 | "First let's start with a bit of a recap on traditional indexing and selection. (We went over most of this in the [pandas fundamentals](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Pandas%20Intro%20to%20Data%20Structures.ipynb)). To start off with, here is the data we are going to be working with (good old tips data):" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/html": [ 47 | "
\n", 48 | "\n", 61 | "\n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 107 | "
" 108 | ], 109 | "text/plain": [ 110 | " total_bill tip sex smoker day time size\n", 111 | "0 16.99 1.01 Female No Sun Dinner 2\n", 112 | "1 10.34 1.66 Male No Sun Dinner 3\n", 113 | "2 21.01 3.50 Male No Sun Dinner 3" 114 | ] 115 | }, 116 | "execution_count": 2, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "tips = sns.load_dataset('tips')\n", 123 | "tips.head(3)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "There are basically 4 ways to do get data from dataframes:" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 3, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "text/html": [ 141 | "
\n", 142 | "\n", 155 | "\n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | "
total_billtip
016.991.01
110.341.66
221.013.50
323.683.31
424.593.61
\n", 191 | "
" 192 | ], 193 | "text/plain": [ 194 | " total_bill tip\n", 195 | "0 16.99 1.01\n", 196 | "1 10.34 1.66\n", 197 | "2 21.01 3.50\n", 198 | "3 23.68 3.31\n", 199 | "4 24.59 3.61" 200 | ] 201 | }, 202 | "execution_count": 3, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "# 1) get columns\n", 209 | "tips[['total_bill', 'tip']].head()" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 4, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "data": { 219 | "text/html": [ 220 | "
\n", 221 | "\n", 234 | "\n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | "
total_billtipsexsmokerdaytimesize
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", 270 | "
" 271 | ], 272 | "text/plain": [ 273 | " total_bill tip sex smoker day time size\n", 274 | "3 23.68 3.31 Male No Sun Dinner 2\n", 275 | "4 24.59 3.61 Female No Sun Dinner 4" 276 | ] 277 | }, 278 | "execution_count": 4, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "# 2) get some rows\n", 285 | "tips[3:5]" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 5, 291 | "metadata": {}, 292 | "outputs": [ 293 | { 294 | "data": { 295 | "text/html": [ 296 | "
\n", 297 | "\n", 310 | "\n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | "
sexsmoker
2MaleNo
3MaleNo
4FemaleNo
\n", 336 | "
" 337 | ], 338 | "text/plain": [ 339 | " sex smoker\n", 340 | "2 Male No\n", 341 | "3 Male No\n", 342 | "4 Female No" 343 | ] 344 | }, 345 | "execution_count": 5, 346 | "metadata": {}, 347 | "output_type": "execute_result" 348 | } 349 | ], 350 | "source": [ 351 | "# 3) select rows and columns based on their name\n", 352 | "tips.loc[2:4, 'sex': 'smoker']" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 6, 358 | "metadata": {}, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/html": [ 363 | "
\n", 364 | "\n", 377 | "\n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | "
total_billtip
110.341.66
221.013.50
\n", 398 | "
" 399 | ], 400 | "text/plain": [ 401 | " total_bill tip\n", 402 | "1 10.34 1.66\n", 403 | "2 21.01 3.50" 404 | ] 405 | }, 406 | "execution_count": 6, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "# select rows and columns by their ordering\n", 413 | "tips.iloc[1:3, 0:2]" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 9, 419 | "metadata": {}, 420 | "outputs": [ 421 | { 422 | "data": { 423 | "text/html": [ 424 | "
\n", 425 | "\n", 438 | "\n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", 504 | "
" 505 | ], 506 | "text/plain": [ 507 | " total_bill tip sex smoker day time size\n", 508 | "0 16.99 1.01 Female No Sun Dinner 2\n", 509 | "1 10.34 1.66 Male No Sun Dinner 3\n", 510 | "2 21.01 3.50 Male No Sun Dinner 3\n", 511 | "3 23.68 3.31 Male No Sun Dinner 2\n", 512 | "4 24.59 3.61 Female No Sun Dinner 4" 513 | ] 514 | }, 515 | "execution_count": 9, 516 | "metadata": {}, 517 | "output_type": "execute_result" 518 | } 519 | ], 520 | "source": [ 521 | "# 5) select using a bool series\n", 522 | "tips[tips['tip'] > 1].head()" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "But this is just the tip of the iceberg (well actually it's 90% of the iceberg). \n", 530 | "\n", 531 | "But there are a couple of other important concepts that you will most likely get into when diving into other pandas functionalities." 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "# Multi-index\n", 539 | "\n", 540 | "A subject that you might not think that you'd need - but turns out to be a rather frequent usecase. \n", 541 | "\n", 542 | "The initial idea behind the multi-index was to provide a framework to work with higher dim data (and thus a replacement for panels).\n", 543 | "\n", 544 | "But because of some operations it became quite commonplace. In almost all cases multi-index comes from [groupby's](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Group%20Operations.ipynb) (you will almost never construct it or read it in yourself).\n", 545 | "\n", 546 | "Let's do an example below:" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": 10, 552 | "metadata": {}, 553 | "outputs": [ 554 | { 555 | "data": { 556 | "text/html": [ 557 | "
\n", 558 | "\n", 571 | "\n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", 637 | "
" 638 | ], 639 | "text/plain": [ 640 | " total_bill tip sex smoker day time size\n", 641 | "0 16.99 1.01 Female No Sun Dinner 2\n", 642 | "1 10.34 1.66 Male No Sun Dinner 3\n", 643 | "2 21.01 3.50 Male No Sun Dinner 3\n", 644 | "3 23.68 3.31 Male No Sun Dinner 2\n", 645 | "4 24.59 3.61 Female No Sun Dinner 4" 646 | ] 647 | }, 648 | "execution_count": 10, 649 | "metadata": {}, 650 | "output_type": "execute_result" 651 | } 652 | ], 653 | "source": [ 654 | "tips.head()" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 11, 660 | "metadata": {}, 661 | "outputs": [ 662 | { 663 | "data": { 664 | "text/html": [ 665 | "
\n", 666 | "\n", 679 | "\n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | "
tip
sexsmoker
MaleYes3.051167
No3.113402
FemaleYes2.931515
No2.773519
\n", 713 | "
" 714 | ], 715 | "text/plain": [ 716 | " tip\n", 717 | "sex smoker \n", 718 | "Male Yes 3.051167\n", 719 | " No 3.113402\n", 720 | "Female Yes 2.931515\n", 721 | " No 2.773519" 722 | ] 723 | }, 724 | "execution_count": 11, 725 | "metadata": {}, 726 | "output_type": "execute_result" 727 | } 728 | ], 729 | "source": [ 730 | "mi_tips = tips.groupby(['sex', 'smoker']).agg({'tip': 'mean'})\n", 731 | "mi_tips" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": 12, 737 | "metadata": {}, 738 | "outputs": [ 739 | { 740 | "data": { 741 | "text/plain": [ 742 | "MultiIndex(levels=[['Male', 'Female'], ['Yes', 'No']],\n", 743 | " codes=[[0, 0, 1, 1], [0, 1, 0, 1]],\n", 744 | " names=['sex', 'smoker'])" 745 | ] 746 | }, 747 | "execution_count": 12, 748 | "metadata": {}, 749 | "output_type": "execute_result" 750 | } 751 | ], 752 | "source": [ 753 | "mi_tips.index" 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": {}, 759 | "source": [ 760 | "Ultimately there are a ton of operations that you can do on top of this type of data. And there are equivalent multi-index operations you can do, like this:" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": 13, 766 | "metadata": {}, 767 | "outputs": [ 768 | { 769 | "data": { 770 | "text/plain": [ 771 | "tip 3.113402\n", 772 | "Name: (Male, No), dtype: float64" 773 | ] 774 | }, 775 | "execution_count": 13, 776 | "metadata": {}, 777 | "output_type": "execute_result" 778 | } 779 | ], 780 | "source": [ 781 | "mi_tips.loc[('Male', 'No')]" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "But in that way you'd have a learn a lot of details and there are always exceptions. \n", 789 | "\n", 790 | "So the way that I have always deal with this is simply by resetting the index." 791 | ] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "execution_count": 14, 796 | "metadata": {}, 797 | "outputs": [ 798 | { 799 | "data": { 800 | "text/html": [ 801 | "
\n", 802 | "\n", 815 | "\n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | "
sexsmokertip
0MaleYes3.051167
1MaleNo3.113402
2FemaleYes2.931515
3FemaleNo2.773519
\n", 851 | "
" 852 | ], 853 | "text/plain": [ 854 | " sex smoker tip\n", 855 | "0 Male Yes 3.051167\n", 856 | "1 Male No 3.113402\n", 857 | "2 Female Yes 2.931515\n", 858 | "3 Female No 2.773519" 859 | ] 860 | }, 861 | "execution_count": 14, 862 | "metadata": {}, 863 | "output_type": "execute_result" 864 | } 865 | ], 866 | "source": [ 867 | "ri_tips = mi_tips.reset_index()\n", 868 | "ri_tips" 869 | ] 870 | }, 871 | { 872 | "cell_type": "markdown", 873 | "metadata": {}, 874 | "source": [ 875 | "Notice how we get values spread out over the full column now. So in this way it is easy to select only the male non-smokers:" 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": 18, 881 | "metadata": {}, 882 | "outputs": [ 883 | { 884 | "data": { 885 | "text/html": [ 886 | "
\n", 887 | "\n", 900 | "\n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | "
sexsmokertip
1MaleNo3.113402
\n", 918 | "
" 919 | ], 920 | "text/plain": [ 921 | " sex smoker tip\n", 922 | "1 Male No 3.113402" 923 | ] 924 | }, 925 | "execution_count": 18, 926 | "metadata": {}, 927 | "output_type": "execute_result" 928 | } 929 | ], 930 | "source": [ 931 | "ri_tips[(ri_tips['smoker'] == 'No') & (ri_tips['sex'] == 'Male')]" 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": {}, 937 | "source": [ 938 | "Another way you can deal with this is to only certain indexes out:" 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": 19, 944 | "metadata": {}, 945 | "outputs": [ 946 | { 947 | "data": { 948 | "text/html": [ 949 | "
\n", 950 | "\n", 963 | "\n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | "
sextip
smoker
YesMale3.051167
YesFemale2.931515
\n", 989 | "
" 990 | ], 991 | "text/plain": [ 992 | " sex tip\n", 993 | "smoker \n", 994 | "Yes Male 3.051167\n", 995 | "Yes Female 2.931515" 996 | ] 997 | }, 998 | "execution_count": 19, 999 | "metadata": {}, 1000 | "output_type": "execute_result" 1001 | } 1002 | ], 1003 | "source": [ 1004 | "ri0_tips = mi_tips.reset_index(level=0)\n", 1005 | "ri0_tips.loc['Yes']" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "markdown", 1010 | "metadata": {}, 1011 | "source": [ 1012 | "And finally you can [pull indexes back into the index](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Indexing%20and%20Selecting.ipynb) (basically only useful for certain types of merges)." 1013 | ] 1014 | }, 1015 | { 1016 | "cell_type": "code", 1017 | "execution_count": 20, 1018 | "metadata": {}, 1019 | "outputs": [ 1020 | { 1021 | "data": { 1022 | "text/html": [ 1023 | "
\n", 1024 | "\n", 1037 | "\n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | "
tip
sexsmoker
MaleYes3.051167
No3.113402
FemaleYes2.931515
No2.773519
\n", 1071 | "
" 1072 | ], 1073 | "text/plain": [ 1074 | " tip\n", 1075 | "sex smoker \n", 1076 | "Male Yes 3.051167\n", 1077 | " No 3.113402\n", 1078 | "Female Yes 2.931515\n", 1079 | " No 2.773519" 1080 | ] 1081 | }, 1082 | "execution_count": 20, 1083 | "metadata": {}, 1084 | "output_type": "execute_result" 1085 | } 1086 | ], 1087 | "source": [ 1088 | "ri_tips.set_index(['sex', 'smoker'])" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": 21, 1094 | "metadata": {}, 1095 | "outputs": [ 1096 | { 1097 | "data": { 1098 | "text/html": [ 1099 | "
\n", 1100 | "\n", 1113 | "\n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | "
tip
smokersex
YesMale3.051167
NoMale3.113402
YesFemale2.931515
NoFemale2.773519
\n", 1149 | "
" 1150 | ], 1151 | "text/plain": [ 1152 | " tip\n", 1153 | "smoker sex \n", 1154 | "Yes Male 3.051167\n", 1155 | "No Male 3.113402\n", 1156 | "Yes Female 2.931515\n", 1157 | "No Female 2.773519" 1158 | ] 1159 | }, 1160 | "execution_count": 21, 1161 | "metadata": {}, 1162 | "output_type": "execute_result" 1163 | } 1164 | ], 1165 | "source": [ 1166 | "ri0_tips.set_index('sex', append=True)" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "markdown", 1171 | "metadata": {}, 1172 | "source": [ 1173 | "# Getting Single Values\n", 1174 | "\n", 1175 | "The next little indexing trick is one that is mostly about speed. But it is getting and setting single values. It is a pretty simple:" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "code", 1180 | "execution_count": 37, 1181 | "metadata": {}, 1182 | "outputs": [ 1183 | { 1184 | "data": { 1185 | "text/html": [ 1186 | "
\n", 1187 | "\n", 1200 | "\n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | "
total_billtipsexsmokerdaytimesize
06.001.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 1246 | "
" 1247 | ], 1248 | "text/plain": [ 1249 | " total_bill tip sex smoker day time size\n", 1250 | "0 6.00 1.01 Female No Sun Dinner 2\n", 1251 | "1 10.34 1.66 Male No Sun Dinner 3\n", 1252 | "2 21.01 3.50 Male No Sun Dinner 3" 1253 | ] 1254 | }, 1255 | "execution_count": 37, 1256 | "metadata": {}, 1257 | "output_type": "execute_result" 1258 | } 1259 | ], 1260 | "source": [ 1261 | "tips.head(3)" 1262 | ] 1263 | }, 1264 | { 1265 | "cell_type": "markdown", 1266 | "metadata": {}, 1267 | "source": [ 1268 | "When getting/setting single values you should use the `at` function" 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": 23, 1274 | "metadata": {}, 1275 | "outputs": [ 1276 | { 1277 | "data": { 1278 | "text/html": [ 1279 | "
\n", 1280 | "\n", 1293 | "\n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | "
total_billtipsexsmokerdaytimesize
09000.001.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 1339 | "
" 1340 | ], 1341 | "text/plain": [ 1342 | " total_bill tip sex smoker day time size\n", 1343 | "0 9000.00 1.01 Female No Sun Dinner 2\n", 1344 | "1 10.34 1.66 Male No Sun Dinner 3\n", 1345 | "2 21.01 3.50 Male No Sun Dinner 3" 1346 | ] 1347 | }, 1348 | "execution_count": 23, 1349 | "metadata": {}, 1350 | "output_type": "execute_result" 1351 | } 1352 | ], 1353 | "source": [ 1354 | "tips.at[0, 'total_bill'] = 9000\n", 1355 | "tips.head(3)" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "code", 1360 | "execution_count": 24, 1361 | "metadata": {}, 1362 | "outputs": [ 1363 | { 1364 | "data": { 1365 | "text/plain": [ 1366 | "9000.0" 1367 | ] 1368 | }, 1369 | "execution_count": 24, 1370 | "metadata": {}, 1371 | "output_type": "execute_result" 1372 | } 1373 | ], 1374 | "source": [ 1375 | "tips.iat[0, 0]" 1376 | ] 1377 | }, 1378 | { 1379 | "cell_type": "markdown", 1380 | "metadata": {}, 1381 | "source": [ 1382 | "If you are modifying single values of a dataframe you should always use these guys. It's faster and it is a good way to know that you are not messing up (often times modifying the data can result in odd errors).\n", 1383 | "\n", 1384 | "So just to prove it's faster let's time it!" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": 25, 1390 | "metadata": {}, 1391 | "outputs": [ 1392 | { 1393 | "name": "stdout", 1394 | "output_type": "stream", 1395 | "text": [ 1396 | "5.85 µs ± 96.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" 1397 | ] 1398 | } 1399 | ], 1400 | "source": [ 1401 | "%%timeit\n", 1402 | "tips.at[0, 'total_bill'] = 6" 1403 | ] 1404 | }, 1405 | { 1406 | "cell_type": "code", 1407 | "execution_count": 26, 1408 | "metadata": {}, 1409 | "outputs": [ 1410 | { 1411 | "name": "stdout", 1412 | "output_type": "stream", 1413 | "text": [ 1414 | "304 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" 1415 | ] 1416 | } 1417 | ], 1418 | "source": [ 1419 | "%%timeit\n", 1420 | "tips.loc['total_bill', 0] = 6" 1421 | ] 1422 | }, 1423 | { 1424 | "cell_type": "markdown", 1425 | "metadata": {}, 1426 | "source": [ 1427 | "# Where, Masks and Queries\n", 1428 | "\n", 1429 | "These are things that are built into pandas that I have personally never used, mostly because they are pretty redundant and don't happen too often.\n", 1430 | "\n", 1431 | "They are a bit faster, yes. But the mental space is probably not worth it. So if you wanna learn it, go for it (docs are [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method)). If not, probably won't matter.\n", 1432 | "\n", 1433 | "Let me show you how you'd duplicate mask functionality below. " 1434 | ] 1435 | }, 1436 | { 1437 | "cell_type": "code", 1438 | "execution_count": 27, 1439 | "metadata": {}, 1440 | "outputs": [ 1441 | { 1442 | "data": { 1443 | "text/html": [ 1444 | "
\n", 1445 | "\n", 1458 | "\n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | "
01234
0-1.4387810.584173-0.6941120.1353040.409292
1-2.2032191.2324871.284779-2.460982-0.855321
2-0.827212-0.293645-0.6797450.209145-0.402497
30.4717471.1413610.4298782.290840-0.655701
4-1.9443340.1867851.031003-0.6338080.413554
\n", 1512 | "
" 1513 | ], 1514 | "text/plain": [ 1515 | " 0 1 2 3 4\n", 1516 | "0 -1.438781 0.584173 -0.694112 0.135304 0.409292\n", 1517 | "1 -2.203219 1.232487 1.284779 -2.460982 -0.855321\n", 1518 | "2 -0.827212 -0.293645 -0.679745 0.209145 -0.402497\n", 1519 | "3 0.471747 1.141361 0.429878 2.290840 -0.655701\n", 1520 | "4 -1.944334 0.186785 1.031003 -0.633808 0.413554" 1521 | ] 1522 | }, 1523 | "execution_count": 27, 1524 | "metadata": {}, 1525 | "output_type": "execute_result" 1526 | } 1527 | ], 1528 | "source": [ 1529 | "df = pd.DataFrame(np.random.randn(25).reshape((5, 5)))\n", 1530 | "df.head()" 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "code", 1535 | "execution_count": 28, 1536 | "metadata": {}, 1537 | "outputs": [ 1538 | { 1539 | "data": { 1540 | "text/html": [ 1541 | "
\n", 1542 | "\n", 1555 | "\n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | "
01234
0NaN0.584173NaN0.1353040.409292
1NaN1.2324871.284779NaNNaN
2NaNNaNNaN0.209145NaN
30.4717471.1413610.4298782.290840NaN
4NaN0.1867851.031003NaN0.413554
\n", 1609 | "
" 1610 | ], 1611 | "text/plain": [ 1612 | " 0 1 2 3 4\n", 1613 | "0 NaN 0.584173 NaN 0.135304 0.409292\n", 1614 | "1 NaN 1.232487 1.284779 NaN NaN\n", 1615 | "2 NaN NaN NaN 0.209145 NaN\n", 1616 | "3 0.471747 1.141361 0.429878 2.290840 NaN\n", 1617 | "4 NaN 0.186785 1.031003 NaN 0.413554" 1618 | ] 1619 | }, 1620 | "execution_count": 28, 1621 | "metadata": {}, 1622 | "output_type": "execute_result" 1623 | } 1624 | ], 1625 | "source": [ 1626 | "df.where(df > 0)" 1627 | ] 1628 | }, 1629 | { 1630 | "cell_type": "code", 1631 | "execution_count": 29, 1632 | "metadata": {}, 1633 | "outputs": [ 1634 | { 1635 | "data": { 1636 | "text/html": [ 1637 | "
\n", 1638 | "\n", 1651 | "\n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | "
01234
0NaN0.584173NaN0.1353040.409292
1NaN1.2324871.284779NaNNaN
2NaNNaNNaN0.209145NaN
30.4717471.1413610.4298782.290840NaN
4NaN0.1867851.031003NaN0.413554
\n", 1705 | "
" 1706 | ], 1707 | "text/plain": [ 1708 | " 0 1 2 3 4\n", 1709 | "0 NaN 0.584173 NaN 0.135304 0.409292\n", 1710 | "1 NaN 1.232487 1.284779 NaN NaN\n", 1711 | "2 NaN NaN NaN 0.209145 NaN\n", 1712 | "3 0.471747 1.141361 0.429878 2.290840 NaN\n", 1713 | "4 NaN 0.186785 1.031003 NaN 0.413554" 1714 | ] 1715 | }, 1716 | "execution_count": 29, 1717 | "metadata": {}, 1718 | "output_type": "execute_result" 1719 | } 1720 | ], 1721 | "source": [ 1722 | "df[df < 0] = np.NaN\n", 1723 | "df" 1724 | ] 1725 | }, 1726 | { 1727 | "cell_type": "markdown", 1728 | "metadata": {}, 1729 | "source": [ 1730 | "## Conclusion\n", 1731 | "\n", 1732 | "So that's it. This is really all I know about indexing and prob all you'll need to know too. If you've got any question or comment please add them! \n", 1733 | "\n", 1734 | "p.s. there are not really any great tutorials on this in particular, but if you know of one I should link, let me know." 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "code", 1739 | "execution_count": null, 1740 | "metadata": {}, 1741 | "outputs": [], 1742 | "source": [] 1743 | } 1744 | ], 1745 | "metadata": { 1746 | "kernelspec": { 1747 | "display_name": "Python 3", 1748 | "language": "python", 1749 | "name": "python3" 1750 | }, 1751 | "language_info": { 1752 | "codemirror_mode": { 1753 | "name": "ipython", 1754 | "version": 3 1755 | }, 1756 | "file_extension": ".py", 1757 | "mimetype": "text/x-python", 1758 | "name": "python", 1759 | "nbconvert_exporter": "python", 1760 | "pygments_lexer": "ipython3", 1761 | "version": "3.7.3" 1762 | } 1763 | }, 1764 | "nbformat": 4, 1765 | "nbformat_minor": 2 1766 | } 1767 | -------------------------------------------------------------------------------- /notebooks/Misc Functions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import seaborn as sns\n", 10 | "import pandas as pd\n", 11 | "import numpy as np" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Pandas Misc Useful Functions\n", 19 | "\n", 20 | "Pandas is massive. I mean really massive! There are hundreds of functions. So we are not going to go over all of them here, but I'll show you a couple of the most useful ones:\n", 21 | "\n", 22 | "\n", 23 | "This time each function has a bit of documentation, so let's just jump right in. " 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 3, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/html": [ 34 | "
\n", 35 | "\n", 48 | "\n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 94 | "
" 95 | ], 96 | "text/plain": [ 97 | " total_bill tip sex smoker day time size\n", 98 | "0 16.99 1.01 Female No Sun Dinner 2\n", 99 | "1 10.34 1.66 Male No Sun Dinner 3\n", 100 | "2 21.01 3.50 Male No Sun Dinner 3" 101 | ] 102 | }, 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "output_type": "execute_result" 106 | } 107 | ], 108 | "source": [ 109 | "tips = sns.load_dataset('tips')\n", 110 | "tips.head(3)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## Sample\n", 118 | "\n", 119 | "Pretty useful. Let's you get samples from a dataframe in a pretty powerful diverse way.\n", 120 | "\n", 121 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selecting-random-samples" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/html": [ 132 | "
\n", 133 | "\n", 146 | "\n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | "
total_billtipsexsmokerdaytimesize
13714.152.0FemaleNoThurLunch2
15916.492.0MaleNoSunDinner4
20316.402.5FemaleYesThurLunch2
21628.153.0MaleYesSatDinner5
3215.063.0FemaleNoSatDinner2
\n", 212 | "
" 213 | ], 214 | "text/plain": [ 215 | " total_bill tip sex smoker day time size\n", 216 | "137 14.15 2.0 Female No Thur Lunch 2\n", 217 | "159 16.49 2.0 Male No Sun Dinner 4\n", 218 | "203 16.40 2.5 Female Yes Thur Lunch 2\n", 219 | "216 28.15 3.0 Male Yes Sat Dinner 5\n", 220 | "32 15.06 3.0 Female No Sat Dinner 2" 221 | ] 222 | }, 223 | "execution_count": 4, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "tips.sample(5)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 5, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "tips.sample?" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "## isin\n", 246 | "\n", 247 | "The next pretty useful function is called is in. It is applied to an entire column and is very useful in selecting specific rows\n", 248 | "\n", 249 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-isin" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 8, 255 | "metadata": {}, 256 | "outputs": [ 257 | { 258 | "data": { 259 | "text/plain": [ 260 | "107 True\n", 261 | "217 True\n", 262 | "193 False\n", 263 | "226 False\n", 264 | "214 True\n", 265 | "Name: day, dtype: bool" 266 | ] 267 | }, 268 | "execution_count": 8, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "is_weekend = tips.day.isin(['Sat', 'Sun']).sample(5)\n", 275 | "is_weekend" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 10, 281 | "metadata": {}, 282 | "outputs": [ 283 | { 284 | "data": { 285 | "text/html": [ 286 | "
\n", 287 | "\n", 300 | "\n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | "
total_billtipsexsmokerdaytimesize
4430.405.60MaleNoSunDinner4
6113.812.00MaleYesSatDinner2
24027.182.00FemaleYesSatDinner2
16810.591.61FemaleYesSatDinner2
10322.423.48FemaleYesSatDinner2
\n", 366 | "
" 367 | ], 368 | "text/plain": [ 369 | " total_bill tip sex smoker day time size\n", 370 | "44 30.40 5.60 Male No Sun Dinner 4\n", 371 | "61 13.81 2.00 Male Yes Sat Dinner 2\n", 372 | "240 27.18 2.00 Female Yes Sat Dinner 2\n", 373 | "168 10.59 1.61 Female Yes Sat Dinner 2\n", 374 | "103 22.42 3.48 Female Yes Sat Dinner 2" 375 | ] 376 | }, 377 | "execution_count": 10, 378 | "metadata": {}, 379 | "output_type": "execute_result" 380 | } 381 | ], 382 | "source": [ 383 | "tips[tips.day.isin(['Sat', 'Sun'])].sample(5)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "## drop_duplicates\n", 391 | "\n", 392 | "This one is a pretty useful function in a lot of respects, and it works on more than one column\n", 393 | "\n", 394 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#duplicate-data" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 11, 400 | "metadata": {}, 401 | "outputs": [ 402 | { 403 | "data": { 404 | "text/html": [ 405 | "
\n", 406 | "\n", 419 | "\n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | "
timeday
0DinnerSun
19DinnerSat
77LunchThur
90DinnerFri
220LunchFri
243DinnerThur
\n", 460 | "
" 461 | ], 462 | "text/plain": [ 463 | " time day\n", 464 | "0 Dinner Sun\n", 465 | "19 Dinner Sat\n", 466 | "77 Lunch Thur\n", 467 | "90 Dinner Fri\n", 468 | "220 Lunch Fri\n", 469 | "243 Dinner Thur" 470 | ] 471 | }, 472 | "execution_count": 11, 473 | "metadata": {}, 474 | "output_type": "execute_result" 475 | } 476 | ], 477 | "source": [ 478 | "tips[['time', 'day']].drop_duplicates(keep='first')" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "## cut\n", 486 | "\n", 487 | "This will cut your numeric data into equal buckets and then assign them labels depending on the bucket. Pretty useful and if you need something more granular you can use qcut.\n", 488 | "\n", 489 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#tiling" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 12, 495 | "metadata": {}, 496 | "outputs": [ 497 | { 498 | "data": { 499 | "text/plain": [ 500 | "0 low\n", 501 | "1 low\n", 502 | "2 mid\n", 503 | "3 mid\n", 504 | "4 mid\n", 505 | "Name: total_bill, dtype: category\n", 506 | "Categories (3, object): [low < mid < high]" 507 | ] 508 | }, 509 | "execution_count": 12, 510 | "metadata": {}, 511 | "output_type": "execute_result" 512 | } 513 | ], 514 | "source": [ 515 | "pd.cut(tips['total_bill'], 3, labels=['low', 'mid', 'high']).head()" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "## str\n", 523 | "\n", 524 | "The str functions are really really useful and there are a ton of them. If you ever need to compute a string operation on a column first look here.\n", 525 | "\n", 526 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/text.html" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 13, 532 | "metadata": {}, 533 | "outputs": [ 534 | { 535 | "data": { 536 | "text/plain": [ 537 | "0 female\n", 538 | "1 male\n", 539 | "2 male\n", 540 | "3 male\n", 541 | "4 female\n", 542 | "Name: sex, dtype: object" 543 | ] 544 | }, 545 | "execution_count": 13, 546 | "metadata": {}, 547 | "output_type": "execute_result" 548 | } 549 | ], 550 | "source": [ 551 | "tips.sex.str.lower().head()" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "## NaNs\n", 559 | "\n", 560 | "There are three that are pretty useful:\n", 561 | "\n", 562 | "* isna\n", 563 | "* fillna\n", 564 | "* dropna\n", 565 | "\n", 566 | "They are all pretty self expanitory, but it is nice to know that they exist." 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 16, 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "text/plain": [ 577 | "total_bill 0\n", 578 | "tip 0\n", 579 | "sex 0\n", 580 | "smoker 0\n", 581 | "day 0\n", 582 | "time 0\n", 583 | "size 0\n", 584 | "dtype: int64" 585 | ] 586 | }, 587 | "execution_count": 16, 588 | "metadata": {}, 589 | "output_type": "execute_result" 590 | } 591 | ], 592 | "source": [ 593 | "tips.isna().sum()" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 17, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "tips.tip.fillna(0, inplace=True)" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": 18, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "tips.dropna(axis=1, how='any', inplace=True)" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "## corr\n", 619 | "\n", 620 | "Calculate correlation. Pretty straightforward\n", 621 | "\n", 622 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#correlation" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 19, 628 | "metadata": {}, 629 | "outputs": [ 630 | { 631 | "data": { 632 | "text/html": [ 633 | "
\n", 634 | "\n", 647 | "\n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | "
tiptotal_bill
tip1.0000000.675734
total_bill0.6757341.000000
\n", 668 | "
" 669 | ], 670 | "text/plain": [ 671 | " tip total_bill\n", 672 | "tip 1.000000 0.675734\n", 673 | "total_bill 0.675734 1.000000" 674 | ] 675 | }, 676 | "execution_count": 19, 677 | "metadata": {}, 678 | "output_type": "execute_result" 679 | } 680 | ], 681 | "source": [ 682 | "tips[['tip', 'total_bill']].corr('pearson')" 683 | ] 684 | }, 685 | { 686 | "cell_type": "markdown", 687 | "metadata": {}, 688 | "source": [ 689 | "## rank\n", 690 | "\n", 691 | "This will calculate what rank each entry is in the column.\n", 692 | "\n", 693 | "http://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#data-ranking" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": 20, 699 | "metadata": {}, 700 | "outputs": [ 701 | { 702 | "data": { 703 | "text/plain": [ 704 | "0 5.0\n", 705 | "1 33.0\n", 706 | "2 177.0\n", 707 | "3 165.0\n", 708 | "4 185.0\n", 709 | "Name: tip, dtype: float64" 710 | ] 711 | }, 712 | "execution_count": 20, 713 | "metadata": {}, 714 | "output_type": "execute_result" 715 | } 716 | ], 717 | "source": [ 718 | "tips.tip.rank().head()" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "## rename\n", 726 | "\n", 727 | "Rename while not completely needed, is a nice convienience funtion. You can rename columns or indexes.\n", 728 | "\n", 729 | "http://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#renaming-mapping-labels" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": 21, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "tips.rename(columns={'total_bill': 'bill'}, inplace=True)" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "metadata": {}, 744 | "source": [ 745 | "## itertuples\n", 746 | "\n", 747 | "There are a couple of iteraters for dataframes. I would very much so caution you to not use these unless you are really sure that you know what you are doing. These are not very fast compared to many functions, but when working with a small dataframe this can be really useful.\n", 748 | "\n", 749 | "http://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#itertuples" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": 22, 755 | "metadata": {}, 756 | "outputs": [ 757 | { 758 | "name": "stdout", 759 | "output_type": "stream", 760 | "text": [ 761 | "Pandas(Index=0, bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2)\n" 762 | ] 763 | } 764 | ], 765 | "source": [ 766 | "for tup in tips.itertuples():\n", 767 | " print(tup)\n", 768 | " break" 769 | ] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": {}, 774 | "source": [ 775 | "## Conclusion\n", 776 | "\n", 777 | "I hope this has been a bit interesting, but these are the functions that I use most (other than the funcitons I demonstrated in the other notebooks)\n", 778 | "\n", 779 | "There are a couple of other things that I have not gone over, but if I get enough interest I'd be happy to make:\n", 780 | "\n", 781 | "\n", 782 | "* timeseries \n", 783 | "* io\n", 784 | "* performance\n", 785 | "\n", 786 | "Please let me know if these interest you! And at this point you should be ready for all the exercises listed [here](https://github.com/guipsamora/pandas_exercises#merge)" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": 23, 792 | "metadata": {}, 793 | "outputs": [ 794 | { 795 | "name": "stdout", 796 | "output_type": "stream", 797 | "text": [ 798 | "Combining DataFrames.ipynb README.md\r\n", 799 | "Group Operations.ipynb Row-Column Transformations.ipynb\r\n", 800 | "Indexing and Selecting.ipynb \u001b[34menv\u001b[m\u001b[m/\r\n", 801 | "Misc Functions.ipynb requirements.txt\r\n", 802 | "Pandas Intro to Data Structures.ipynb\r\n" 803 | ] 804 | } 805 | ], 806 | "source": [ 807 | "ls" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": null, 813 | "metadata": {}, 814 | "outputs": [], 815 | "source": [] 816 | } 817 | ], 818 | "metadata": { 819 | "kernelspec": { 820 | "display_name": "Python 3", 821 | "language": "python", 822 | "name": "python3" 823 | }, 824 | "language_info": { 825 | "codemirror_mode": { 826 | "name": "ipython", 827 | "version": 3 828 | }, 829 | "file_extension": ".py", 830 | "mimetype": "text/x-python", 831 | "name": "python", 832 | "nbconvert_exporter": "python", 833 | "pygments_lexer": "ipython3", 834 | "version": "3.7.3" 835 | } 836 | }, 837 | "nbformat": 4, 838 | "nbformat_minor": 2 839 | } 840 | -------------------------------------------------------------------------------- /notebooks/Row-Column Transformations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import seaborn as sns\n", 10 | "import pandas as pd\n", 11 | "import numpy as np" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Pandas Row-Column Transformations\n", 19 | "\n", 20 | "There comes a time in the life of any data scientist when he or she needs to transform the set of columns in a dataset into rows and vice versa.\n", 21 | "\n", 22 | "This is not a common operation, but it does happen every now and then. Pandas has two set of methods to do this:\n", 23 | "\n", 24 | "* stack and unstack\n", 25 | "* pivot and melt\n", 26 | "\n", 27 | "Again these sets of methods basically do the same thing.\n", 28 | "\n", 29 | "\n", 30 | "I have found that stack and unstack are much more stable but a bit less powerful. So those are the ones I use. \n", 31 | "\n", 32 | "Right at the end we will go over pandas dummy variables being the last way to make the transformation. \n", 33 | "\n", 34 | "Check out the full documentation for both [stack and unstack](http://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) and [dummy variables](http://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#computing-indicator-dummy-variables), but be warned it is a bit long :)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "Okay Let's get started" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "data": { 51 | "text/html": [ 52 | "
\n", 53 | "\n", 66 | "\n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", 112 | "
" 113 | ], 114 | "text/plain": [ 115 | " total_bill tip sex smoker day time size\n", 116 | "0 16.99 1.01 Female No Sun Dinner 2\n", 117 | "1 10.34 1.66 Male No Sun Dinner 3\n", 118 | "2 21.01 3.50 Male No Sun Dinner 3" 119 | ] 120 | }, 121 | "execution_count": 2, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "tips = sns.load_dataset('tips')\n", 128 | "tips.head(3)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "A question we might want to ask is: what is the male to female ratio on different days of the week?\n", 136 | "\n", 137 | "To do this we might start with a groupby:" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/html": [ 148 | "
\n", 149 | "\n", 162 | "\n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | "
size
daysex
ThurMale73
Female79
FriMale21
Female19
SatMale156
Female63
SunMale163
Female53
\n", 214 | "
" 215 | ], 216 | "text/plain": [ 217 | " size\n", 218 | "day sex \n", 219 | "Thur Male 73\n", 220 | " Female 79\n", 221 | "Fri Male 21\n", 222 | " Female 19\n", 223 | "Sat Male 156\n", 224 | " Female 63\n", 225 | "Sun Male 163\n", 226 | " Female 53" 227 | ] 228 | }, 229 | "execution_count": 4, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "tips_gb = tips.groupby(['day', 'sex']).agg({'size': 'sum'})\n", 236 | "tips_gb" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "So we are getting somewhere, but it is a bit hard to tell the number of male and female visitors by looking at it, and you might want to do more columnwise operations comparing the male to the female visitors.\n", 244 | "\n", 245 | "So what you might want to do is take the values in the column sex and make them into column. This is where unstacking comes in!\n", 246 | "\n", 247 | "## Unstack" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 14, 253 | "metadata": {}, 254 | "outputs": [ 255 | { 256 | "data": { 257 | "text/html": [ 258 | "
\n", 259 | "\n", 276 | "\n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | "
size
sexMaleFemale
day
Thur7379
Fri2119
Sat15663
Sun16353
\n", 316 | "
" 317 | ], 318 | "text/plain": [ 319 | " size \n", 320 | "sex Male Female\n", 321 | "day \n", 322 | "Thur 73 79\n", 323 | "Fri 21 19\n", 324 | "Sat 156 63\n", 325 | "Sun 163 53" 326 | ] 327 | }, 328 | "execution_count": 14, 329 | "metadata": {}, 330 | "output_type": "execute_result" 331 | } 332 | ], 333 | "source": [ 334 | "tips_us = tips_gb.unstack()\n", 335 | "tips_us" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "Notice we basically moved an index to the columns!" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 15, 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "data": { 352 | "text/html": [ 353 | "
\n", 354 | "\n", 371 | "\n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | "
size
dayThurFriSatSun
sex
Male7321156163
Female79196353
\n", 409 | "
" 410 | ], 411 | "text/plain": [ 412 | " size \n", 413 | "day Thur Fri Sat Sun\n", 414 | "sex \n", 415 | "Male 73 21 156 163\n", 416 | "Female 79 19 63 53" 417 | ] 418 | }, 419 | "execution_count": 15, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "# you could do the same with the days of the week\n", 426 | "tips_gb.unstack(0)" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "The problem is that now we have this odd new object as the columns:" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 16, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/plain": [ 444 | "MultiIndex(levels=[['size'], ['Male', 'Female']],\n", 445 | " codes=[[0, 0], [0, 1]],\n", 446 | " names=[None, 'sex'])" 447 | ] 448 | }, 449 | "execution_count": 16, 450 | "metadata": {}, 451 | "output_type": "execute_result" 452 | } 453 | ], 454 | "source": [ 455 | "tips_us.columns" 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "metadata": {}, 461 | "source": [ 462 | "And while you can do things with it:" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 17, 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "data": { 472 | "text/html": [ 473 | "
\n", 474 | "\n", 491 | "\n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | "
size
sexMale
day
Thur73
Fri21
Sat156
Sun163
\n", 525 | "
" 526 | ], 527 | "text/plain": [ 528 | " size\n", 529 | "sex Male\n", 530 | "day \n", 531 | "Thur 73\n", 532 | "Fri 21\n", 533 | "Sat 156\n", 534 | "Sun 163" 535 | ] 536 | }, 537 | "execution_count": 17, 538 | "metadata": {}, 539 | "output_type": "execute_result" 540 | } 541 | ], 542 | "source": [ 543 | "tips_us[[('size', 'Male')]]" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "I find it a bit annoying to memorize a separate set of syntax, so I always convert it with a line of code like so (ps I wish this were in pandas core):" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": 18, 556 | "metadata": {}, 557 | "outputs": [], 558 | "source": [ 559 | "tips_us_copy = tips_us.copy()\n", 560 | "\n", 561 | "tips_us_copy.columns = ['__'.join(col).strip() for col in tips_us.columns.values]" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 19, 567 | "metadata": {}, 568 | "outputs": [ 569 | { 570 | "data": { 571 | "text/html": [ 572 | "
\n", 573 | "\n", 586 | "\n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | "
size__Malesize__Female
day
Thur7379
Fri2119
Sat15663
Sun16353
\n", 622 | "
" 623 | ], 624 | "text/plain": [ 625 | " size__Male size__Female\n", 626 | "day \n", 627 | "Thur 73 79\n", 628 | "Fri 21 19\n", 629 | "Sat 156 63\n", 630 | "Sun 163 53" 631 | ] 632 | }, 633 | "execution_count": 19, 634 | "metadata": {}, 635 | "output_type": "execute_result" 636 | } 637 | ], 638 | "source": [ 639 | "tips_us_copy" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": {}, 645 | "source": [ 646 | "You can of course repeat that operation as many times as you need to get the desired granularity of columns. \n", 647 | "\n", 648 | "But now let's try out the reverse operation. This is useful if somebody gives you data in pivot form.\n", 649 | "\n", 650 | "## Stack" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": 20, 656 | "metadata": {}, 657 | "outputs": [ 658 | { 659 | "data": { 660 | "text/html": [ 661 | "
\n", 662 | "\n", 675 | "\n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | "
size
daysex
ThurMale73
Female79
FriMale21
Female19
SatMale156
Female63
SunMale163
Female53
\n", 727 | "
" 728 | ], 729 | "text/plain": [ 730 | " size\n", 731 | "day sex \n", 732 | "Thur Male 73\n", 733 | " Female 79\n", 734 | "Fri Male 21\n", 735 | " Female 19\n", 736 | "Sat Male 156\n", 737 | " Female 63\n", 738 | "Sun Male 163\n", 739 | " Female 53" 740 | ] 741 | }, 742 | "execution_count": 20, 743 | "metadata": {}, 744 | "output_type": "execute_result" 745 | } 746 | ], 747 | "source": [ 748 | "tips_us.stack()" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "Again you can unstack either column index:" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": 22, 761 | "metadata": {}, 762 | "outputs": [ 763 | { 764 | "data": { 765 | "text/html": [ 766 | "
\n", 767 | "\n", 780 | "\n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | "
sexMaleFemale
day
Thursize7379
Frisize2119
Satsize15663
Sunsize16353
\n", 822 | "
" 823 | ], 824 | "text/plain": [ 825 | "sex Male Female\n", 826 | "day \n", 827 | "Thur size 73 79\n", 828 | "Fri size 21 19\n", 829 | "Sat size 156 63\n", 830 | "Sun size 163 53" 831 | ] 832 | }, 833 | "execution_count": 22, 834 | "metadata": {}, 835 | "output_type": "execute_result" 836 | } 837 | ], 838 | "source": [ 839 | "tips_us.stack(0)" 840 | ] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "## What about Melting and Pivoting?\n", 847 | "\n", 848 | "That is about it when it comes to stacking and unstacking. Anything you can do with melting and pivoting can be done with stacking and unstacking. Let's do a single example from pandas:" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": 26, 854 | "metadata": {}, 855 | "outputs": [ 856 | { 857 | "data": { 858 | "text/html": [ 859 | "
\n", 860 | "\n", 873 | "\n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | "
firstlastheightweight
0JohnDoe5.5130
1MaryBo6.0150
\n", 900 | "
" 901 | ], 902 | "text/plain": [ 903 | " first last height weight\n", 904 | "0 John Doe 5.5 130\n", 905 | "1 Mary Bo 6.0 150" 906 | ] 907 | }, 908 | "execution_count": 26, 909 | "metadata": {}, 910 | "output_type": "execute_result" 911 | } 912 | ], 913 | "source": [ 914 | "cheese = pd.DataFrame({'first': ['John', 'Mary'],\n", 915 | " 'last': ['Doe', 'Bo'],\n", 916 | " 'height': [5.5, 6.0],\n", 917 | " 'weight': [130, 150]})\n", 918 | "cheese" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": 27, 924 | "metadata": {}, 925 | "outputs": [ 926 | { 927 | "data": { 928 | "text/html": [ 929 | "
\n", 930 | "\n", 943 | "\n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | "
firstlastvariablevalue
0JohnDoeheight5.5
1MaryBoheight6.0
2JohnDoeweight130.0
3MaryBoweight150.0
\n", 984 | "
" 985 | ], 986 | "text/plain": [ 987 | " first last variable value\n", 988 | "0 John Doe height 5.5\n", 989 | "1 Mary Bo height 6.0\n", 990 | "2 John Doe weight 130.0\n", 991 | "3 Mary Bo weight 150.0" 992 | ] 993 | }, 994 | "execution_count": 27, 995 | "metadata": {}, 996 | "output_type": "execute_result" 997 | } 998 | ], 999 | "source": [ 1000 | "# melt does stacking in one operation\n", 1001 | "cheese.melt(id_vars=['first', 'last'])" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "To do this with stacking we just need to do it in two steps:" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": 28, 1014 | "metadata": {}, 1015 | "outputs": [ 1016 | { 1017 | "data": { 1018 | "text/html": [ 1019 | "
\n", 1020 | "\n", 1033 | "\n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | "
firstlastlevel_20
0JohnDoeheight5.5
1JohnDoeweight130.0
2MaryBoheight6.0
3MaryBoweight150.0
\n", 1074 | "
" 1075 | ], 1076 | "text/plain": [ 1077 | " first last level_2 0\n", 1078 | "0 John Doe height 5.5\n", 1079 | "1 John Doe weight 130.0\n", 1080 | "2 Mary Bo height 6.0\n", 1081 | "3 Mary Bo weight 150.0" 1082 | ] 1083 | }, 1084 | "execution_count": 28, 1085 | "metadata": {}, 1086 | "output_type": "execute_result" 1087 | } 1088 | ], 1089 | "source": [ 1090 | "cheese.set_index(['first', 'last'], inplace=True)\n", 1091 | "cheese.stack().reset_index()" 1092 | ] 1093 | }, 1094 | { 1095 | "cell_type": "markdown", 1096 | "metadata": {}, 1097 | "source": [ 1098 | "I have used melt and pivot before, but after getting a better understanding of stack and unstack I have found them more versitile and stable than the former. So why learn both!" 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "markdown", 1103 | "metadata": {}, 1104 | "source": [ 1105 | "## Dummy Variables\n", 1106 | "\n", 1107 | "There is one final way to transform the values in a column into headers, and this is called making dummy vars (well not quite, if you are interested in more ways to do it you can check out my [YT video](https://www.youtube.com/watch?v=WRxHfnl-Pcs&t=2s)).\n", 1108 | "\n", 1109 | "Making a dummy variable will take all the `k` distinct values in one column and make `k` columns out of them. \n", 1110 | "\n", 1111 | "Let's look at an example below:" 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "code", 1116 | "execution_count": 31, 1117 | "metadata": {}, 1118 | "outputs": [ 1119 | { 1120 | "data": { 1121 | "text/html": [ 1122 | "
\n", 1123 | "\n", 1136 | "\n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", 1202 | "
" 1203 | ], 1204 | "text/plain": [ 1205 | " total_bill tip sex smoker day time size\n", 1206 | "0 16.99 1.01 Female No Sun Dinner 2\n", 1207 | "1 10.34 1.66 Male No Sun Dinner 3\n", 1208 | "2 21.01 3.50 Male No Sun Dinner 3\n", 1209 | "3 23.68 3.31 Male No Sun Dinner 2\n", 1210 | "4 24.59 3.61 Female No Sun Dinner 4" 1211 | ] 1212 | }, 1213 | "execution_count": 31, 1214 | "metadata": {}, 1215 | "output_type": "execute_result" 1216 | } 1217 | ], 1218 | "source": [ 1219 | "tips.head()" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 30, 1225 | "metadata": {}, 1226 | "outputs": [ 1227 | { 1228 | "data": { 1229 | "text/html": [ 1230 | "
\n", 1231 | "\n", 1244 | "\n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | "
total_billtipsmokerdaytimesizesex_Malesex_Female
016.991.01NoSunDinner201
110.341.66NoSunDinner310
221.013.50NoSunDinner310
323.683.31NoSunDinner210
424.593.61NoSunDinner401
\n", 1316 | "
" 1317 | ], 1318 | "text/plain": [ 1319 | " total_bill tip smoker day time size sex_Male sex_Female\n", 1320 | "0 16.99 1.01 No Sun Dinner 2 0 1\n", 1321 | "1 10.34 1.66 No Sun Dinner 3 1 0\n", 1322 | "2 21.01 3.50 No Sun Dinner 3 1 0\n", 1323 | "3 23.68 3.31 No Sun Dinner 2 1 0\n", 1324 | "4 24.59 3.61 No Sun Dinner 4 0 1" 1325 | ] 1326 | }, 1327 | "execution_count": 30, 1328 | "metadata": {}, 1329 | "output_type": "execute_result" 1330 | } 1331 | ], 1332 | "source": [ 1333 | "pd.get_dummies(tips.head(), columns=['sex'])" 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "markdown", 1338 | "metadata": {}, 1339 | "source": [ 1340 | "Notice the sex column was split into the sex_Male and sex_Female column. When the sex is female the sex_Female is 1 and 0 otherwise. And similarly for the sex_Male column.\n", 1341 | "\n", 1342 | "This can be very useful for ML models and doing some types of analysis." 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "## Conclusion\n", 1350 | "\n", 1351 | "These three ways to transform rows to columns and back again have served me quite well, and I'd be surprised if you'd need anything more than these. \n", 1352 | "\n", 1353 | "They are pretty intuitive, so you might not need to do too much practice. I actually don't know a good exercise for these guys as well - so if somebody has a good one they know of please send it over. " 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": null, 1359 | "metadata": {}, 1360 | "outputs": [], 1361 | "source": [] 1362 | }, 1363 | { 1364 | "cell_type": "code", 1365 | "execution_count": null, 1366 | "metadata": {}, 1367 | "outputs": [], 1368 | "source": [] 1369 | } 1370 | ], 1371 | "metadata": { 1372 | "kernelspec": { 1373 | "display_name": "Python 3", 1374 | "language": "python", 1375 | "name": "python3" 1376 | }, 1377 | "language_info": { 1378 | "codemirror_mode": { 1379 | "name": "ipython", 1380 | "version": 3 1381 | }, 1382 | "file_extension": ".py", 1383 | "mimetype": "text/x-python", 1384 | "name": "python", 1385 | "nbconvert_exporter": "python", 1386 | "pygments_lexer": "ipython3", 1387 | "version": "3.7.3" 1388 | } 1389 | }, 1390 | "nbformat": 4, 1391 | "nbformat_minor": 2 1392 | } 1393 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | alabaster==0.7.12 2 | appnope==0.1.0 3 | attrs==19.1.0 4 | Babel==2.6.0 5 | backcall==0.1.0 6 | bleach==3.1.0 7 | certifi==2019.3.9 8 | chardet==3.0.4 9 | cycler==0.10.0 10 | decorator==4.4.0 11 | defusedxml==0.5.0 12 | docutils==0.14 13 | entrypoints==0.3 14 | idna==2.8 15 | imagesize==1.1.0 16 | ipykernel==5.1.0 17 | ipyparallel==6.2.3 18 | ipython==7.4.0 19 | ipython-genutils==0.2.0 20 | ipywidgets==7.4.2 21 | jedi==0.13.3 22 | Jinja2==2.10.1 23 | jsonschema==3.0.1 24 | jupyter-client==5.2.4 25 | jupyter-core==4.4.0 26 | kiwisolver==1.0.1 27 | MarkupSafe==1.1.1 28 | matplotlib==3.0.3 29 | mistune==0.8.4 30 | nbconvert==5.4.1 31 | nbformat==4.4.0 32 | nose==1.3.7 33 | notebook==5.7.8 34 | numpy==1.16.2 35 | packaging==19.0 36 | pandas==0.24.2 37 | pandocfilters==1.4.2 38 | parso==0.4.0 39 | pexpect==4.7.0 40 | pickleshare==0.7.5 41 | prometheus-client==0.6.0 42 | prompt-toolkit==2.0.9 43 | ptyprocess==0.6.0 44 | Pygments==2.3.1 45 | pyparsing==2.4.0 46 | pyrsistent==0.14.11 47 | python-dateutil==2.8.0 48 | pytz==2019.1 49 | pyzmq==18.0.1 50 | qtconsole==4.4.3 51 | requests==2.21.0 52 | scipy==1.2.1 53 | seaborn==0.9.0 54 | Send2Trash==1.5.0 55 | six==1.12.0 56 | snowballstemmer==1.2.1 57 | Sphinx==2.0.1 58 | sphinxcontrib-applehelp==1.0.1 59 | sphinxcontrib-devhelp==1.0.1 60 | sphinxcontrib-htmlhelp==1.0.2 61 | sphinxcontrib-jsmath==1.0.1 62 | sphinxcontrib-qthelp==1.0.2 63 | sphinxcontrib-serializinghtml==1.1.3 64 | terminado==0.8.2 65 | testpath==0.4.2 66 | tornado==6.0.2 67 | traitlets==4.3.2 68 | urllib3==1.24.2 69 | wcwidth==0.1.7 70 | webencodings==0.5.1 71 | widgetsnbextension==3.4.2 72 | --------------------------------------------------------------------------------