├── .ipynb_checkpoints └── Pandas Tutorial-checkpoint.ipynb ├── DataStructures.png ├── Pandas Tutorial.ipynb ├── README.md ├── RegularSeasonCompactResults.csv └── result.csv /.ipynb_checkpoints/Pandas Tutorial-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Introduction" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Since I've been working on a lot of Kaggle competitions, I use Pandas a lot. As you may know, Pandas (in addition to Numpy) is the go-to Python library for all your data science needs. It helps with dealing with input data in CSV formats and with transofrming your data into a form where it can be inputted into ML models. However, getting comfortable with the ideas of dataframes, slicing, etc was very tough for me in the beginning. Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "# Loading in Data" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "The first step in any ML problem is identifying what format your data is in, and then loading it into whateer framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. " 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Since I'm a huge sports fan, we're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as \"a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)\"." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Just think of it as a table for now. We'll explain more about what makes it unique later on. " 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "df = pd.read_csv('RegularSeasonCompactResults.csv')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows)." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 3, 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/html": [ 84 | "
\n", 85 | "\n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
0198520122881132864N0
1198525110677135470H0
2198525111263122356H0
3198525116570143254H0
4198525119286144774H0
\n", 157 | "
" 158 | ], 159 | "text/plain": [ 160 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 161 | "0 1985 20 1228 81 1328 64 N 0\n", 162 | "1 1985 25 1106 77 1354 70 H 0\n", 163 | "2 1985 25 1112 63 1223 56 H 0\n", 164 | "3 1985 25 1165 70 1432 54 H 0\n", 165 | "4 1985 25 1192 86 1447 74 H 0" 166 | ] 167 | }, 168 | "execution_count": 3, 169 | "metadata": {}, 170 | "output_type": "execute_result" 171 | } 172 | ], 173 | "source": [ 174 | "df.head()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "We can see the dimensions of the dataframe using the the **shape** attribute" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 4, 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "(145289, 8)" 195 | ] 196 | }, 197 | "execution_count": 4, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "df.shape" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "collapsed": true 210 | }, 211 | "source": [ 212 | "We can also extract all the columns as a list, by using the **columns** attribute" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 6, 218 | "metadata": { 219 | "collapsed": false 220 | }, 221 | "outputs": [ 222 | { 223 | "data": { 224 | "text/plain": [ 225 | "['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']" 226 | ] 227 | }, 228 | "execution_count": 6, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "df.columns.tolist()" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. " 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 10, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [ 251 | { 252 | "data": { 253 | "text/html": [ 254 | "
\n", 255 | "\n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | "
SeasonDaynumWteamWscoreLteamLscoreNumot
count145289.000000145289.000000145289.000000145289.000000145289.000000145289.000000145289.000000
mean2001.57483475.2238161286.72064676.6003211282.86406464.4970090.044387
std9.23334233.287418104.57027512.173033104.82923411.3806250.247819
min1985.0000000.0000001101.00000034.0000001101.00000020.0000000.000000
25%1994.00000047.0000001198.00000068.0000001191.00000057.0000000.000000
50%2002.00000078.0000001284.00000076.0000001280.00000064.0000000.000000
75%2010.000000103.0000001379.00000084.0000001375.00000072.0000000.000000
max2016.000000132.0000001464.000000186.0000001464.000000150.0000006.000000
\n", 351 | "
" 352 | ], 353 | "text/plain": [ 354 | " Season Daynum Wteam Wscore \\\n", 355 | "count 145289.000000 145289.000000 145289.000000 145289.000000 \n", 356 | "mean 2001.574834 75.223816 1286.720646 76.600321 \n", 357 | "std 9.233342 33.287418 104.570275 12.173033 \n", 358 | "min 1985.000000 0.000000 1101.000000 34.000000 \n", 359 | "25% 1994.000000 47.000000 1198.000000 68.000000 \n", 360 | "50% 2002.000000 78.000000 1284.000000 76.000000 \n", 361 | "75% 2010.000000 103.000000 1379.000000 84.000000 \n", 362 | "max 2016.000000 132.000000 1464.000000 186.000000 \n", 363 | "\n", 364 | " Lteam Lscore Numot \n", 365 | "count 145289.000000 145289.000000 145289.000000 \n", 366 | "mean 1282.864064 64.497009 0.044387 \n", 367 | "std 104.829234 11.380625 0.247819 \n", 368 | "min 1101.000000 20.000000 0.000000 \n", 369 | "25% 1191.000000 57.000000 0.000000 \n", 370 | "50% 1280.000000 64.000000 0.000000 \n", 371 | "75% 1375.000000 72.000000 0.000000 \n", 372 | "max 1464.000000 150.000000 6.000000 " 373 | ] 374 | }, 375 | "execution_count": 10, 376 | "metadata": {}, 377 | "output_type": "execute_result" 378 | } 379 | ], 380 | "source": [ 381 | "df.describe()" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 22, 394 | "metadata": { 395 | "collapsed": false 396 | }, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "text/plain": [ 401 | "Season 2016\n", 402 | "Daynum 132\n", 403 | "Wteam 1464\n", 404 | "Wscore 186\n", 405 | "Lteam 1464\n", 406 | "Lscore 150\n", 407 | "Wloc N\n", 408 | "Numot 6\n", 409 | "dtype: object" 410 | ] 411 | }, 412 | "execution_count": 22, 413 | "metadata": {}, 414 | "output_type": "execute_result" 415 | } 416 | ], 417 | "source": [ 418 | "df.max()" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 24, 431 | "metadata": { 432 | "collapsed": false 433 | }, 434 | "outputs": [ 435 | { 436 | "data": { 437 | "text/plain": [ 438 | "186" 439 | ] 440 | }, 441 | "execution_count": 24, 442 | "metadata": {}, 443 | "output_type": "execute_result" 444 | } 445 | ], 446 | "source": [ 447 | "df['Wscore'].max()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 36, 460 | "metadata": { 461 | "collapsed": false 462 | }, 463 | "outputs": [ 464 | { 465 | "data": { 466 | "text/plain": [ 467 | "24970" 468 | ] 469 | }, 470 | "execution_count": 36, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "df['Wscore'].argmax()" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an \"integer-location based indexing for selection by position.\"" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": 35, 489 | "metadata": { 490 | "collapsed": false 491 | }, 492 | "outputs": [ 493 | { 494 | "data": { 495 | "text/html": [ 496 | "
\n", 497 | "\n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
2497019916812581861109140H0
\n", 525 | "
" 526 | ], 527 | "text/plain": [ 528 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 529 | "24970 1991 68 1258 186 1109 140 H 0" 530 | ] 531 | }, 532 | "execution_count": 35, 533 | "metadata": {}, 534 | "output_type": "execute_result" 535 | } 536 | ], 537 | "source": [ 538 | "df.iloc[[df['Wscore'].argmax()]]" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. " 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 38, 551 | "metadata": { 552 | "collapsed": false 553 | }, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/plain": [ 558 | "140" 559 | ] 560 | }, 561 | "execution_count": 38, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "df.iloc[[df['Wscore'].argmax()]]['Lscore'].max()" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "The bracket indexing operator is the best way to extract certain columns from a dataframe." 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": 27, 580 | "metadata": { 581 | "collapsed": false, 582 | "scrolled": true 583 | }, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/html": [ 588 | "
\n", 589 | "\n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | "
WscoreLscore
08164
17770
26356
37054
48674
57978
66444
75856
89880
99789
1010371
117571
129172
137065
148758
156562
169250
176560
185853
195048
204740
215552
227656
235958
247976
2510655
269577
277966
286459
297647
.........
1452596967
1452607265
1452616461
1452627762
1452635754
1452646863
1452658169
1452666460
1452678171
1452689380
1452697454
1452706461
1452715553
1452726157
1452738857
1452747659
1452756967
1452768260
1452775453
1452788279
1452798074
1452807138
1452818271
1452827654
1452836259
1452847050
1452857258
1452868277
1452876662
1452888774
\n", 905 | "

145289 rows × 2 columns

\n", 906 | "
" 907 | ], 908 | "text/plain": [ 909 | " Wscore Lscore\n", 910 | "0 81 64\n", 911 | "1 77 70\n", 912 | "2 63 56\n", 913 | "3 70 54\n", 914 | "4 86 74\n", 915 | "5 79 78\n", 916 | "6 64 44\n", 917 | "7 58 56\n", 918 | "8 98 80\n", 919 | "9 97 89\n", 920 | "10 103 71\n", 921 | "11 75 71\n", 922 | "12 91 72\n", 923 | "13 70 65\n", 924 | "14 87 58\n", 925 | "15 65 62\n", 926 | "16 92 50\n", 927 | "17 65 60\n", 928 | "18 58 53\n", 929 | "19 50 48\n", 930 | "20 47 40\n", 931 | "21 55 52\n", 932 | "22 76 56\n", 933 | "23 59 58\n", 934 | "24 79 76\n", 935 | "25 106 55\n", 936 | "26 95 77\n", 937 | "27 79 66\n", 938 | "28 64 59\n", 939 | "29 76 47\n", 940 | "... ... ...\n", 941 | "145259 69 67\n", 942 | "145260 72 65\n", 943 | "145261 64 61\n", 944 | "145262 77 62\n", 945 | "145263 57 54\n", 946 | "145264 68 63\n", 947 | "145265 81 69\n", 948 | "145266 64 60\n", 949 | "145267 81 71\n", 950 | "145268 93 80\n", 951 | "145269 74 54\n", 952 | "145270 64 61\n", 953 | "145271 55 53\n", 954 | "145272 61 57\n", 955 | "145273 88 57\n", 956 | "145274 76 59\n", 957 | "145275 69 67\n", 958 | "145276 82 60\n", 959 | "145277 54 53\n", 960 | "145278 82 79\n", 961 | "145279 80 74\n", 962 | "145280 71 38\n", 963 | "145281 82 71\n", 964 | "145282 76 54\n", 965 | "145283 62 59\n", 966 | "145284 70 50\n", 967 | "145285 72 58\n", 968 | "145286 82 77\n", 969 | "145287 66 62\n", 970 | "145288 87 74\n", 971 | "\n", 972 | "[145289 rows x 2 columns]" 973 | ] 974 | }, 975 | "execution_count": 27, 976 | "metadata": {}, 977 | "output_type": "execute_result" 978 | } 979 | ], 980 | "source": [ 981 | "df[['Wscore', 'Lscore']]" 982 | ] 983 | }, 984 | { 985 | "cell_type": "markdown", 986 | "metadata": {}, 987 | "source": [ 988 | "Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then return the results in a dataframe (df[df['Wscore'] > 150])." 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": 33, 994 | "metadata": { 995 | "collapsed": false 996 | }, 997 | "outputs": [ 998 | { 999 | "data": { 1000 | "text/html": [ 1001 | "
\n", 1002 | "\n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
526919867512581511109107H0
120461988401328152114784H0
123551988521328151117399N0
1604019894013281521331122H0
1685319896812581621109144A0
1786719899212581811109150H0
1965319903013281731109101H0
1997119903812581521109137A0
2002219904011161661109101H0
2214519909712581571362115H0
2358219912613181521258123N0
2434119914713281721258112H0
2497019916812581861109140H0
256561991841106151121297H0
286871992541261159131986H0
3502319931121380155134191A0
4006019953213751561341114H0
526001998331395153141087H0
\n", 1217 | "
" 1218 | ], 1219 | "text/plain": [ 1220 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 1221 | "5269 1986 75 1258 151 1109 107 H 0\n", 1222 | "12046 1988 40 1328 152 1147 84 H 0\n", 1223 | "12355 1988 52 1328 151 1173 99 N 0\n", 1224 | "16040 1989 40 1328 152 1331 122 H 0\n", 1225 | "16853 1989 68 1258 162 1109 144 A 0\n", 1226 | "17867 1989 92 1258 181 1109 150 H 0\n", 1227 | "19653 1990 30 1328 173 1109 101 H 0\n", 1228 | "19971 1990 38 1258 152 1109 137 A 0\n", 1229 | "20022 1990 40 1116 166 1109 101 H 0\n", 1230 | "22145 1990 97 1258 157 1362 115 H 0\n", 1231 | "23582 1991 26 1318 152 1258 123 N 0\n", 1232 | "24341 1991 47 1328 172 1258 112 H 0\n", 1233 | "24970 1991 68 1258 186 1109 140 H 0\n", 1234 | "25656 1991 84 1106 151 1212 97 H 0\n", 1235 | "28687 1992 54 1261 159 1319 86 H 0\n", 1236 | "35023 1993 112 1380 155 1341 91 A 0\n", 1237 | "40060 1995 32 1375 156 1341 114 H 0\n", 1238 | "52600 1998 33 1395 153 1410 87 H 0" 1239 | ] 1240 | }, 1241 | "execution_count": 33, 1242 | "metadata": {}, 1243 | "output_type": "execute_result" 1244 | } 1245 | ], 1246 | "source": [ 1247 | "df[df['Wscore'] > 150]" 1248 | ] 1249 | }, 1250 | { 1251 | "cell_type": "markdown", 1252 | "metadata": {}, 1253 | "source": [ 1254 | "Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in an array style format" 1255 | ] 1256 | }, 1257 | { 1258 | "cell_type": "code", 1259 | "execution_count": 39, 1260 | "metadata": { 1261 | "collapsed": false 1262 | }, 1263 | "outputs": [ 1264 | { 1265 | "data": { 1266 | "text/plain": [ 1267 | "array([[1985, 20, 1228, ..., 64, 'N', 0],\n", 1268 | " [1985, 25, 1106, ..., 70, 'H', 0],\n", 1269 | " [1985, 25, 1112, ..., 56, 'H', 0],\n", 1270 | " ..., \n", 1271 | " [2016, 132, 1246, ..., 77, 'N', 1],\n", 1272 | " [2016, 132, 1277, ..., 62, 'N', 0],\n", 1273 | " [2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)" 1274 | ] 1275 | }, 1276 | "execution_count": 39, 1277 | "metadata": {}, 1278 | "output_type": "execute_result" 1279 | } 1280 | ], 1281 | "source": [ 1282 | "df.values" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "markdown", 1287 | "metadata": {}, 1288 | "source": [ 1289 | "Now, you can simply just access elements like you would in an array. " 1290 | ] 1291 | }, 1292 | { 1293 | "cell_type": "code", 1294 | "execution_count": 40, 1295 | "metadata": { 1296 | "collapsed": false 1297 | }, 1298 | "outputs": [ 1299 | { 1300 | "data": { 1301 | "text/plain": [ 1302 | "1985" 1303 | ] 1304 | }, 1305 | "execution_count": 40, 1306 | "metadata": {}, 1307 | "output_type": "execute_result" 1308 | } 1309 | ], 1310 | "source": [ 1311 | "df.values[0][0]" 1312 | ] 1313 | }, 1314 | { 1315 | "cell_type": "markdown", 1316 | "metadata": {}, 1317 | "source": [ 1318 | "# Dataframe Iteration" 1319 | ] 1320 | }, 1321 | { 1322 | "cell_type": "markdown", 1323 | "metadata": {}, 1324 | "source": [ 1325 | "In order to iterate through dataframes, we can use the " 1326 | ] 1327 | }, 1328 | { 1329 | "cell_type": "markdown", 1330 | "metadata": { 1331 | "collapsed": true 1332 | }, 1333 | "source": [ 1334 | "# Lots of Other Resources" 1335 | ] 1336 | }, 1337 | { 1338 | "cell_type": "markdown", 1339 | "metadata": {}, 1340 | "source": [ 1341 | "Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. \n", 1342 | "* http://pandas.pydata.org/pandas-docs/stable/10min.html\n", 1343 | "* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python\n", 1344 | "* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/\n", 1345 | "* https://www.dataquest.io/blog/pandas-python-tutorial/\n", 1346 | "* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view" 1347 | ] 1348 | }, 1349 | { 1350 | "cell_type": "code", 1351 | "execution_count": null, 1352 | "metadata": { 1353 | "collapsed": true 1354 | }, 1355 | "outputs": [], 1356 | "source": [] 1357 | } 1358 | ], 1359 | "metadata": { 1360 | "anaconda-cloud": {}, 1361 | "kernelspec": { 1362 | "display_name": "Python [conda root]", 1363 | "language": "python", 1364 | "name": "conda-root-py" 1365 | }, 1366 | "language_info": { 1367 | "codemirror_mode": { 1368 | "name": "ipython", 1369 | "version": 2 1370 | }, 1371 | "file_extension": ".py", 1372 | "mimetype": "text/x-python", 1373 | "name": "python", 1374 | "nbconvert_exporter": "python", 1375 | "pygments_lexer": "ipython2", 1376 | "version": "2.7.12" 1377 | } 1378 | }, 1379 | "nbformat": 4, 1380 | "nbformat_minor": 1 1381 | } 1382 | -------------------------------------------------------------------------------- /DataStructures.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adeshpande3/Pandas-Tutorial/7ce62d4166db83e4f29599a1d8b8eb6b22f21e4e/DataStructures.png -------------------------------------------------------------------------------- /Pandas Tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Since I've been working on a lot of Kaggle competitions, I use Pandas quite a bit. As you may know, Pandas (in addition to Numpy) is the go-to Python library for all your data science needs. It helps with dealing with input data in CSV formats and with transforming your data into a form where it can be inputted into ML models. However, getting comfortable with the ideas of dataframes, slicing, etc was very tough for me in the beginning. Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. " 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "# Loading in Data" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "The first step in any ML problem is identifying what format your data is in, and then loading it into whatever framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. " 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Since I'm a huge sports fan, we're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as \"a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)\"." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Just think of it as a table for now. " 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "df = pd.read_csv('RegularSeasonCompactResults.csv')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "# The Basics" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows)." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 3, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [ 88 | { 89 | "data": { 90 | "text/html": [ 91 | "
\n", 92 | "\n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
0198520122881132864N0
1198525110677135470H0
2198525111263122356H0
3198525116570143254H0
4198525119286144774H0
\n", 164 | "
" 165 | ], 166 | "text/plain": [ 167 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 168 | "0 1985 20 1228 81 1328 64 N 0\n", 169 | "1 1985 25 1106 77 1354 70 H 0\n", 170 | "2 1985 25 1112 63 1223 56 H 0\n", 171 | "3 1985 25 1165 70 1432 54 H 0\n", 172 | "4 1985 25 1192 86 1447 74 H 0" 173 | ] 174 | }, 175 | "execution_count": 3, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "df.head()" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 4, 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/html": [ 194 | "
\n", 195 | "\n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
1452842016132111470141950N0
1452852016132116372127258N0
1452862016132124682140177N1
1452872016132127766134562N0
1452882016132138687143374N0
\n", 267 | "
" 268 | ], 269 | "text/plain": [ 270 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 271 | "145284 2016 132 1114 70 1419 50 N 0\n", 272 | "145285 2016 132 1163 72 1272 58 N 0\n", 273 | "145286 2016 132 1246 82 1401 77 N 1\n", 274 | "145287 2016 132 1277 66 1345 62 N 0\n", 275 | "145288 2016 132 1386 87 1433 74 N 0" 276 | ] 277 | }, 278 | "execution_count": 4, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "df.tail()" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "We can see the dimensions of the dataframe using the the **shape** attribute" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 5, 297 | "metadata": { 298 | "collapsed": false 299 | }, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "(145289, 8)" 305 | ] 306 | }, 307 | "execution_count": 5, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "df.shape" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": { 319 | "collapsed": true 320 | }, 321 | "source": [ 322 | "We can also extract all the column names as a list, by using the **columns** attribute and can extract the rows with the **index** attribute" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 6, 328 | "metadata": { 329 | "collapsed": false 330 | }, 331 | "outputs": [ 332 | { 333 | "data": { 334 | "text/plain": [ 335 | "['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']" 336 | ] 337 | }, 338 | "execution_count": 6, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "df.columns.tolist()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. " 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 7, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [ 361 | { 362 | "data": { 363 | "text/html": [ 364 | "
\n", 365 | "\n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | "
SeasonDaynumWteamWscoreLteamLscoreNumot
count145289.000000145289.000000145289.000000145289.000000145289.000000145289.000000145289.000000
mean2001.57483475.2238161286.72064676.6003211282.86406464.4970090.044387
std9.23334233.287418104.57027512.173033104.82923411.3806250.247819
min1985.0000000.0000001101.00000034.0000001101.00000020.0000000.000000
25%1994.00000047.0000001198.00000068.0000001191.00000057.0000000.000000
50%2002.00000078.0000001284.00000076.0000001280.00000064.0000000.000000
75%2010.000000103.0000001379.00000084.0000001375.00000072.0000000.000000
max2016.000000132.0000001464.000000186.0000001464.000000150.0000006.000000
\n", 461 | "
" 462 | ], 463 | "text/plain": [ 464 | " Season Daynum Wteam Wscore \\\n", 465 | "count 145289.000000 145289.000000 145289.000000 145289.000000 \n", 466 | "mean 2001.574834 75.223816 1286.720646 76.600321 \n", 467 | "std 9.233342 33.287418 104.570275 12.173033 \n", 468 | "min 1985.000000 0.000000 1101.000000 34.000000 \n", 469 | "25% 1994.000000 47.000000 1198.000000 68.000000 \n", 470 | "50% 2002.000000 78.000000 1284.000000 76.000000 \n", 471 | "75% 2010.000000 103.000000 1379.000000 84.000000 \n", 472 | "max 2016.000000 132.000000 1464.000000 186.000000 \n", 473 | "\n", 474 | " Lteam Lscore Numot \n", 475 | "count 145289.000000 145289.000000 145289.000000 \n", 476 | "mean 1282.864064 64.497009 0.044387 \n", 477 | "std 104.829234 11.380625 0.247819 \n", 478 | "min 1101.000000 20.000000 0.000000 \n", 479 | "25% 1191.000000 57.000000 0.000000 \n", 480 | "50% 1280.000000 64.000000 0.000000 \n", 481 | "75% 1375.000000 72.000000 0.000000 \n", 482 | "max 1464.000000 150.000000 6.000000 " 483 | ] 484 | }, 485 | "execution_count": 7, 486 | "metadata": {}, 487 | "output_type": "execute_result" 488 | } 489 | ], 490 | "source": [ 491 | "df.describe()" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": 8, 504 | "metadata": { 505 | "collapsed": false 506 | }, 507 | "outputs": [ 508 | { 509 | "data": { 510 | "text/plain": [ 511 | "Season 2016\n", 512 | "Daynum 132\n", 513 | "Wteam 1464\n", 514 | "Wscore 186\n", 515 | "Lteam 1464\n", 516 | "Lscore 150\n", 517 | "Wloc N\n", 518 | "Numot 6\n", 519 | "dtype: object" 520 | ] 521 | }, 522 | "execution_count": 8, 523 | "metadata": {}, 524 | "output_type": "execute_result" 525 | } 526 | ], 527 | "source": [ 528 | "df.max()" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 9, 541 | "metadata": { 542 | "collapsed": false 543 | }, 544 | "outputs": [ 545 | { 546 | "data": { 547 | "text/plain": [ 548 | "186" 549 | ] 550 | }, 551 | "execution_count": 9, 552 | "metadata": {}, 553 | "output_type": "execute_result" 554 | } 555 | ], 556 | "source": [ 557 | "df['Wscore'].max()" 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": {}, 563 | "source": [ 564 | "If you'd like to find the mean of the Losing teams' score. " 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": 10, 570 | "metadata": { 571 | "collapsed": false 572 | }, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "text/plain": [ 577 | "64.49700940883343" 578 | ] 579 | }, 580 | "execution_count": 10, 581 | "metadata": {}, 582 | "output_type": "execute_result" 583 | } 584 | ], 585 | "source": [ 586 | "df['Lscore'].mean()" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 11, 599 | "metadata": { 600 | "collapsed": false 601 | }, 602 | "outputs": [ 603 | { 604 | "data": { 605 | "text/plain": [ 606 | "24970" 607 | ] 608 | }, 609 | "execution_count": 11, 610 | "metadata": {}, 611 | "output_type": "execute_result" 612 | } 613 | ], 614 | "source": [ 615 | "df['Wscore'].argmax()" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "One of the most useful functions that you can call on certain columns in a dataframe is the **value_counts()** function. It shows how many times each item appears in the column. This particular command shows the number of games in each season" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 12, 628 | "metadata": { 629 | "collapsed": false 630 | }, 631 | "outputs": [ 632 | { 633 | "data": { 634 | "text/plain": [ 635 | "2016 5369\n", 636 | "2014 5362\n", 637 | "2015 5354\n", 638 | "2013 5320\n", 639 | "2010 5263\n", 640 | "2012 5253\n", 641 | "2009 5249\n", 642 | "2011 5246\n", 643 | "2008 5163\n", 644 | "2007 5043\n", 645 | "2006 4757\n", 646 | "2005 4675\n", 647 | "2003 4616\n", 648 | "2004 4571\n", 649 | "2002 4555\n", 650 | "2000 4519\n", 651 | "2001 4467\n", 652 | "1999 4222\n", 653 | "1998 4167\n", 654 | "1997 4155\n", 655 | "1992 4127\n", 656 | "1991 4123\n", 657 | "1996 4122\n", 658 | "1995 4077\n", 659 | "1994 4060\n", 660 | "1990 4045\n", 661 | "1989 4037\n", 662 | "1993 3982\n", 663 | "1988 3955\n", 664 | "1987 3915\n", 665 | "1986 3783\n", 666 | "1985 3737\n", 667 | "Name: Season, dtype: int64" 668 | ] 669 | }, 670 | "execution_count": 12, 671 | "metadata": {}, 672 | "output_type": "execute_result" 673 | } 674 | ], 675 | "source": [ 676 | "df['Season'].value_counts()" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "# Acessing Values" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an \"integer-location based indexing for selection by position.\"" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 13, 696 | "metadata": { 697 | "collapsed": false 698 | }, 699 | "outputs": [ 700 | { 701 | "data": { 702 | "text/html": [ 703 | "
\n", 704 | "\n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
2497019916812581861109140H0
\n", 732 | "
" 733 | ], 734 | "text/plain": [ 735 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 736 | "24970 1991 68 1258 186 1109 140 H 0" 737 | ] 738 | }, 739 | "execution_count": 13, 740 | "metadata": {}, 741 | "output_type": "execute_result" 742 | } 743 | ], 744 | "source": [ 745 | "df.iloc[[df['Wscore'].argmax()]]" 746 | ] 747 | }, 748 | { 749 | "cell_type": "markdown", 750 | "metadata": {}, 751 | "source": [ 752 | "Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. " 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": 14, 758 | "metadata": { 759 | "collapsed": false 760 | }, 761 | "outputs": [ 762 | { 763 | "data": { 764 | "text/plain": [ 765 | "24970 140\n", 766 | "Name: Lscore, dtype: int64" 767 | ] 768 | }, 769 | "execution_count": 14, 770 | "metadata": {}, 771 | "output_type": "execute_result" 772 | } 773 | ], 774 | "source": [ 775 | "df.iloc[[df['Wscore'].argmax()]]['Lscore']" 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": {}, 781 | "source": [ 782 | "When you see data displayed in the above format, you're dealing with a Pandas **Series** object, not a dataframe object." 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "execution_count": 15, 788 | "metadata": { 789 | "collapsed": false 790 | }, 791 | "outputs": [ 792 | { 793 | "data": { 794 | "text/plain": [ 795 | "pandas.core.series.Series" 796 | ] 797 | }, 798 | "execution_count": 15, 799 | "metadata": {}, 800 | "output_type": "execute_result" 801 | } 802 | ], 803 | "source": [ 804 | "type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])" 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": 16, 810 | "metadata": { 811 | "collapsed": false 812 | }, 813 | "outputs": [ 814 | { 815 | "data": { 816 | "text/plain": [ 817 | "pandas.core.frame.DataFrame" 818 | ] 819 | }, 820 | "execution_count": 16, 821 | "metadata": {}, 822 | "output_type": "execute_result" 823 | } 824 | ], 825 | "source": [ 826 | "type(df.iloc[[df['Wscore'].argmax()]])" 827 | ] 828 | }, 829 | { 830 | "cell_type": "markdown", 831 | "metadata": {}, 832 | "source": [ 833 | "The following is a summary of the 3 data structures in Pandas (Haven't ever really used Panels yet)\n", 834 | "\n", 835 | "![](DataStructures.png)" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "When you want to access values in a Series, you'll want to just treat the Series like a Python dictionary, so you'd access the value according to its key (which is normally an integer index)" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": 17, 848 | "metadata": { 849 | "collapsed": false 850 | }, 851 | "outputs": [ 852 | { 853 | "data": { 854 | "text/plain": [ 855 | "140" 856 | ] 857 | }, 858 | "execution_count": 17, 859 | "metadata": {}, 860 | "output_type": "execute_result" 861 | } 862 | ], 863 | "source": [ 864 | "df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]" 865 | ] 866 | }, 867 | { 868 | "cell_type": "markdown", 869 | "metadata": {}, 870 | "source": [ 871 | "The other really important function in Pandas is the **loc** function. Contrary to iloc, which is an integer based indexing, loc is a \"Purely label-location based indexer for selection by label\". Since all the games are ordered from 0 to 145288, iloc and loc are going to be pretty interchangable in this type of dataset" 872 | ] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": 18, 877 | "metadata": { 878 | "collapsed": false 879 | }, 880 | "outputs": [ 881 | { 882 | "data": { 883 | "text/html": [ 884 | "
\n", 885 | "\n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
0198520122881132864N0
1198525110677135470H0
2198525111263122356H0
\n", 935 | "
" 936 | ], 937 | "text/plain": [ 938 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 939 | "0 1985 20 1228 81 1328 64 N 0\n", 940 | "1 1985 25 1106 77 1354 70 H 0\n", 941 | "2 1985 25 1112 63 1223 56 H 0" 942 | ] 943 | }, 944 | "execution_count": 18, 945 | "metadata": {}, 946 | "output_type": "execute_result" 947 | } 948 | ], 949 | "source": [ 950 | "df.iloc[:3]" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 19, 956 | "metadata": { 957 | "collapsed": false 958 | }, 959 | "outputs": [ 960 | { 961 | "data": { 962 | "text/html": [ 963 | "
\n", 964 | "\n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
0198520122881132864N0
1198525110677135470H0
2198525111263122356H0
3198525116570143254H0
\n", 1025 | "
" 1026 | ], 1027 | "text/plain": [ 1028 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 1029 | "0 1985 20 1228 81 1328 64 N 0\n", 1030 | "1 1985 25 1106 77 1354 70 H 0\n", 1031 | "2 1985 25 1112 63 1223 56 H 0\n", 1032 | "3 1985 25 1165 70 1432 54 H 0" 1033 | ] 1034 | }, 1035 | "execution_count": 19, 1036 | "metadata": {}, 1037 | "output_type": "execute_result" 1038 | } 1039 | ], 1040 | "source": [ 1041 | "df.loc[:3]" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "metadata": {}, 1047 | "source": [ 1048 | "Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive. " 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "markdown", 1053 | "metadata": {}, 1054 | "source": [ 1055 | "Below is an example of how you can use loc to acheive the same task as we did previously with iloc" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": 20, 1061 | "metadata": { 1062 | "collapsed": false 1063 | }, 1064 | "outputs": [ 1065 | { 1066 | "data": { 1067 | "text/plain": [ 1068 | "140" 1069 | ] 1070 | }, 1071 | "execution_count": 20, 1072 | "metadata": {}, 1073 | "output_type": "execute_result" 1074 | } 1075 | ], 1076 | "source": [ 1077 | "df.loc[df['Wscore'].argmax(), 'Lscore']" 1078 | ] 1079 | }, 1080 | { 1081 | "cell_type": "markdown", 1082 | "metadata": {}, 1083 | "source": [ 1084 | "A faster version uses the **at()** function. At() is really useful wheneever you know the row label and the column label of the particular value that you want to get. " 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": 21, 1090 | "metadata": { 1091 | "collapsed": false 1092 | }, 1093 | "outputs": [ 1094 | { 1095 | "data": { 1096 | "text/plain": [ 1097 | "140" 1098 | ] 1099 | }, 1100 | "execution_count": 21, 1101 | "metadata": {}, 1102 | "output_type": "execute_result" 1103 | } 1104 | ], 1105 | "source": [ 1106 | "df.at[df['Wscore'].argmax(), 'Lscore']" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "markdown", 1111 | "metadata": {}, 1112 | "source": [ 1113 | "If you'd like to see more discussion on how loc and iloc are different, check out this great Stack Overflow post: http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation. Just remember that **iloc looks at position** and **loc looks at labels**. Loc becomes very important when your row labels aren't integers. " 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "# Sorting" 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "markdown", 1125 | "metadata": {}, 1126 | "source": [ 1127 | "Let's say that we want to sort the dataframe in increasing order for the scores of the losing team" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "code", 1132 | "execution_count": 22, 1133 | "metadata": { 1134 | "collapsed": false, 1135 | "scrolled": true 1136 | }, 1137 | "outputs": [ 1138 | { 1139 | "data": { 1140 | "text/html": [ 1141 | "
\n", 1142 | "\n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
100027200866120349138720H0
49310199766115761120421H0
89021200644128441134321A0
85042200566113173121622H0
103660200926132659135922H0
\n", 1214 | "
" 1215 | ], 1216 | "text/plain": [ 1217 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 1218 | "100027 2008 66 1203 49 1387 20 H 0\n", 1219 | "49310 1997 66 1157 61 1204 21 H 0\n", 1220 | "89021 2006 44 1284 41 1343 21 A 0\n", 1221 | "85042 2005 66 1131 73 1216 22 H 0\n", 1222 | "103660 2009 26 1326 59 1359 22 H 0" 1223 | ] 1224 | }, 1225 | "execution_count": 22, 1226 | "metadata": {}, 1227 | "output_type": "execute_result" 1228 | } 1229 | ], 1230 | "source": [ 1231 | "df.sort_values('Lscore').head()" 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "code", 1236 | "execution_count": 23, 1237 | "metadata": { 1238 | "collapsed": false 1239 | }, 1240 | "outputs": [ 1241 | { 1242 | "data": { 1243 | "text/plain": [ 1244 | "" 1245 | ] 1246 | }, 1247 | "execution_count": 23, 1248 | "metadata": {}, 1249 | "output_type": "execute_result" 1250 | } 1251 | ], 1252 | "source": [ 1253 | "df.groupby('Lscore')" 1254 | ] 1255 | }, 1256 | { 1257 | "cell_type": "markdown", 1258 | "metadata": {}, 1259 | "source": [ 1260 | "# Filtering Rows Conditionally" 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "markdown", 1265 | "metadata": {}, 1266 | "source": [ 1267 | "Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then returns only those specific rows in a dataframe format (df[df['Wscore'] > 150])." 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "code", 1272 | "execution_count": 24, 1273 | "metadata": { 1274 | "collapsed": false 1275 | }, 1276 | "outputs": [ 1277 | { 1278 | "data": { 1279 | "text/html": [ 1280 | "
\n", 1281 | "\n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
526919867512581511109107H0
120461988401328152114784H0
123551988521328151117399N0
1604019894013281521331122H0
1685319896812581621109144A0
1786719899212581811109150H0
1965319903013281731109101H0
1997119903812581521109137A0
2002219904011161661109101H0
2214519909712581571362115H0
2358219912613181521258123N0
2434119914713281721258112H0
2497019916812581861109140H0
256561991841106151121297H0
286871992541261159131986H0
3502319931121380155134191A0
4006019953213751561341114H0
526001998331395153141087H0
\n", 1496 | "
" 1497 | ], 1498 | "text/plain": [ 1499 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 1500 | "5269 1986 75 1258 151 1109 107 H 0\n", 1501 | "12046 1988 40 1328 152 1147 84 H 0\n", 1502 | "12355 1988 52 1328 151 1173 99 N 0\n", 1503 | "16040 1989 40 1328 152 1331 122 H 0\n", 1504 | "16853 1989 68 1258 162 1109 144 A 0\n", 1505 | "17867 1989 92 1258 181 1109 150 H 0\n", 1506 | "19653 1990 30 1328 173 1109 101 H 0\n", 1507 | "19971 1990 38 1258 152 1109 137 A 0\n", 1508 | "20022 1990 40 1116 166 1109 101 H 0\n", 1509 | "22145 1990 97 1258 157 1362 115 H 0\n", 1510 | "23582 1991 26 1318 152 1258 123 N 0\n", 1511 | "24341 1991 47 1328 172 1258 112 H 0\n", 1512 | "24970 1991 68 1258 186 1109 140 H 0\n", 1513 | "25656 1991 84 1106 151 1212 97 H 0\n", 1514 | "28687 1992 54 1261 159 1319 86 H 0\n", 1515 | "35023 1993 112 1380 155 1341 91 A 0\n", 1516 | "40060 1995 32 1375 156 1341 114 H 0\n", 1517 | "52600 1998 33 1395 153 1410 87 H 0" 1518 | ] 1519 | }, 1520 | "execution_count": 24, 1521 | "metadata": {}, 1522 | "output_type": "execute_result" 1523 | } 1524 | ], 1525 | "source": [ 1526 | "df[df['Wscore'] > 150]" 1527 | ] 1528 | }, 1529 | { 1530 | "cell_type": "markdown", 1531 | "metadata": {}, 1532 | "source": [ 1533 | "This also works if you have multiple conditions. Let's say we want to find out when the winning team scores more than 150 points and when the losing team scores below 100. " 1534 | ] 1535 | }, 1536 | { 1537 | "cell_type": "code", 1538 | "execution_count": 25, 1539 | "metadata": { 1540 | "collapsed": false 1541 | }, 1542 | "outputs": [ 1543 | { 1544 | "data": { 1545 | "text/html": [ 1546 | "
\n", 1547 | "\n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
120461988401328152114784H0
123551988521328151117399N0
256561991841106151121297H0
286871992541261159131986H0
3502319931121380155134191A0
526001998331395153141087H0
\n", 1630 | "
" 1631 | ], 1632 | "text/plain": [ 1633 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 1634 | "12046 1988 40 1328 152 1147 84 H 0\n", 1635 | "12355 1988 52 1328 151 1173 99 N 0\n", 1636 | "25656 1991 84 1106 151 1212 97 H 0\n", 1637 | "28687 1992 54 1261 159 1319 86 H 0\n", 1638 | "35023 1993 112 1380 155 1341 91 A 0\n", 1639 | "52600 1998 33 1395 153 1410 87 H 0" 1640 | ] 1641 | }, 1642 | "execution_count": 25, 1643 | "metadata": {}, 1644 | "output_type": "execute_result" 1645 | } 1646 | ], 1647 | "source": [ 1648 | "df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]" 1649 | ] 1650 | }, 1651 | { 1652 | "cell_type": "markdown", 1653 | "metadata": {}, 1654 | "source": [ 1655 | "# Grouping" 1656 | ] 1657 | }, 1658 | { 1659 | "cell_type": "markdown", 1660 | "metadata": {}, 1661 | "source": [ 1662 | "Another important function in Pandas is **groupby()**. This is a function that allows you to group entries by certain attributes (e.g Grouping entries by Wteam number) and then perform operations on them. The following function groups all the entries (games) with the same Wteam number and finds the mean for each group. " 1663 | ] 1664 | }, 1665 | { 1666 | "cell_type": "code", 1667 | "execution_count": 26, 1668 | "metadata": { 1669 | "collapsed": false 1670 | }, 1671 | "outputs": [ 1672 | { 1673 | "data": { 1674 | "text/plain": [ 1675 | "Wteam\n", 1676 | "1101 78.111111\n", 1677 | "1102 69.893204\n", 1678 | "1103 75.839768\n", 1679 | "1104 75.825944\n", 1680 | "1105 74.960894\n", 1681 | "Name: Wscore, dtype: float64" 1682 | ] 1683 | }, 1684 | "execution_count": 26, 1685 | "metadata": {}, 1686 | "output_type": "execute_result" 1687 | } 1688 | ], 1689 | "source": [ 1690 | "df.groupby('Wteam')['Wscore'].mean().head()" 1691 | ] 1692 | }, 1693 | { 1694 | "cell_type": "markdown", 1695 | "metadata": {}, 1696 | "source": [ 1697 | "This next command groups all the games with the same Wteam number and finds where how many times that specific team won at home, on the road, or at a neutral site" 1698 | ] 1699 | }, 1700 | { 1701 | "cell_type": "code", 1702 | "execution_count": 27, 1703 | "metadata": { 1704 | "collapsed": false, 1705 | "scrolled": false 1706 | }, 1707 | "outputs": [ 1708 | { 1709 | "data": { 1710 | "text/plain": [ 1711 | "Wteam Wloc\n", 1712 | "1101 H 12\n", 1713 | " A 3\n", 1714 | " N 3\n", 1715 | "1102 H 204\n", 1716 | " A 73\n", 1717 | " N 32\n", 1718 | "1103 H 324\n", 1719 | " A 153\n", 1720 | " N 41\n", 1721 | "Name: Wloc, dtype: int64" 1722 | ] 1723 | }, 1724 | "execution_count": 27, 1725 | "metadata": {}, 1726 | "output_type": "execute_result" 1727 | } 1728 | ], 1729 | "source": [ 1730 | "df.groupby('Wteam')['Wloc'].value_counts().head(9)" 1731 | ] 1732 | }, 1733 | { 1734 | "cell_type": "markdown", 1735 | "metadata": {}, 1736 | "source": [ 1737 | "Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in a numpy array style format" 1738 | ] 1739 | }, 1740 | { 1741 | "cell_type": "code", 1742 | "execution_count": 28, 1743 | "metadata": { 1744 | "collapsed": false 1745 | }, 1746 | "outputs": [ 1747 | { 1748 | "data": { 1749 | "text/plain": [ 1750 | "array([[1985, 20, 1228, ..., 64, 'N', 0],\n", 1751 | " [1985, 25, 1106, ..., 70, 'H', 0],\n", 1752 | " [1985, 25, 1112, ..., 56, 'H', 0],\n", 1753 | " ..., \n", 1754 | " [2016, 132, 1246, ..., 77, 'N', 1],\n", 1755 | " [2016, 132, 1277, ..., 62, 'N', 0],\n", 1756 | " [2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)" 1757 | ] 1758 | }, 1759 | "execution_count": 28, 1760 | "metadata": {}, 1761 | "output_type": "execute_result" 1762 | } 1763 | ], 1764 | "source": [ 1765 | "df.values" 1766 | ] 1767 | }, 1768 | { 1769 | "cell_type": "markdown", 1770 | "metadata": {}, 1771 | "source": [ 1772 | "Now, you can simply just access elements like you would in an array. " 1773 | ] 1774 | }, 1775 | { 1776 | "cell_type": "code", 1777 | "execution_count": 29, 1778 | "metadata": { 1779 | "collapsed": false 1780 | }, 1781 | "outputs": [ 1782 | { 1783 | "data": { 1784 | "text/plain": [ 1785 | "1985" 1786 | ] 1787 | }, 1788 | "execution_count": 29, 1789 | "metadata": {}, 1790 | "output_type": "execute_result" 1791 | } 1792 | ], 1793 | "source": [ 1794 | "df.values[0][0]" 1795 | ] 1796 | }, 1797 | { 1798 | "cell_type": "markdown", 1799 | "metadata": {}, 1800 | "source": [ 1801 | "# Dataframe Iteration" 1802 | ] 1803 | }, 1804 | { 1805 | "cell_type": "markdown", 1806 | "metadata": {}, 1807 | "source": [ 1808 | "In order to iterate through dataframes, we can use the **iterrows()** function. Below is an example of what the first two rows look like. Each row in iterrows is a Series object" 1809 | ] 1810 | }, 1811 | { 1812 | "cell_type": "code", 1813 | "execution_count": 30, 1814 | "metadata": { 1815 | "collapsed": false 1816 | }, 1817 | "outputs": [ 1818 | { 1819 | "name": "stdout", 1820 | "output_type": "stream", 1821 | "text": [ 1822 | "Season 1985\n", 1823 | "Daynum 20\n", 1824 | "Wteam 1228\n", 1825 | "Wscore 81\n", 1826 | "Lteam 1328\n", 1827 | "Lscore 64\n", 1828 | "Wloc N\n", 1829 | "Numot 0\n", 1830 | "Name: 0, dtype: object\n", 1831 | "Season 1985\n", 1832 | "Daynum 25\n", 1833 | "Wteam 1106\n", 1834 | "Wscore 77\n", 1835 | "Lteam 1354\n", 1836 | "Lscore 70\n", 1837 | "Wloc H\n", 1838 | "Numot 0\n", 1839 | "Name: 1, dtype: object\n" 1840 | ] 1841 | } 1842 | ], 1843 | "source": [ 1844 | "for index, row in df.iterrows():\n", 1845 | " print row\n", 1846 | " if index == 1:\n", 1847 | " break" 1848 | ] 1849 | }, 1850 | { 1851 | "cell_type": "markdown", 1852 | "metadata": {}, 1853 | "source": [ 1854 | "# Extracting Rows and Columns" 1855 | ] 1856 | }, 1857 | { 1858 | "cell_type": "markdown", 1859 | "metadata": {}, 1860 | "source": [ 1861 | "The bracket indexing operator is one way to extract certain columns from a dataframe." 1862 | ] 1863 | }, 1864 | { 1865 | "cell_type": "code", 1866 | "execution_count": 31, 1867 | "metadata": { 1868 | "collapsed": false, 1869 | "scrolled": true 1870 | }, 1871 | "outputs": [ 1872 | { 1873 | "data": { 1874 | "text/html": [ 1875 | "
\n", 1876 | "\n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | "
WscoreLscore
08164
17770
26356
37054
48674
\n", 1912 | "
" 1913 | ], 1914 | "text/plain": [ 1915 | " Wscore Lscore\n", 1916 | "0 81 64\n", 1917 | "1 77 70\n", 1918 | "2 63 56\n", 1919 | "3 70 54\n", 1920 | "4 86 74" 1921 | ] 1922 | }, 1923 | "execution_count": 31, 1924 | "metadata": {}, 1925 | "output_type": "execute_result" 1926 | } 1927 | ], 1928 | "source": [ 1929 | "df[['Wscore', 'Lscore']].head()" 1930 | ] 1931 | }, 1932 | { 1933 | "cell_type": "markdown", 1934 | "metadata": {}, 1935 | "source": [ 1936 | "Notice that you can acheive the same result by using the loc function. Loc is a veryyyy versatile function that can help you in a lot of accessing and extracting tasks. " 1937 | ] 1938 | }, 1939 | { 1940 | "cell_type": "code", 1941 | "execution_count": 32, 1942 | "metadata": { 1943 | "collapsed": false 1944 | }, 1945 | "outputs": [ 1946 | { 1947 | "data": { 1948 | "text/html": [ 1949 | "
\n", 1950 | "\n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | "
WscoreLscore
08164
17770
26356
37054
48674
\n", 1986 | "
" 1987 | ], 1988 | "text/plain": [ 1989 | " Wscore Lscore\n", 1990 | "0 81 64\n", 1991 | "1 77 70\n", 1992 | "2 63 56\n", 1993 | "3 70 54\n", 1994 | "4 86 74" 1995 | ] 1996 | }, 1997 | "execution_count": 32, 1998 | "metadata": {}, 1999 | "output_type": "execute_result" 2000 | } 2001 | ], 2002 | "source": [ 2003 | "df.loc[:, ['Wscore', 'Lscore']].head()" 2004 | ] 2005 | }, 2006 | { 2007 | "cell_type": "markdown", 2008 | "metadata": {}, 2009 | "source": [ 2010 | "Note the difference is the return types when you use brackets and when you use double brackets. " 2011 | ] 2012 | }, 2013 | { 2014 | "cell_type": "code", 2015 | "execution_count": 33, 2016 | "metadata": { 2017 | "collapsed": false 2018 | }, 2019 | "outputs": [ 2020 | { 2021 | "data": { 2022 | "text/plain": [ 2023 | "pandas.core.series.Series" 2024 | ] 2025 | }, 2026 | "execution_count": 33, 2027 | "metadata": {}, 2028 | "output_type": "execute_result" 2029 | } 2030 | ], 2031 | "source": [ 2032 | "type(df['Wscore'])" 2033 | ] 2034 | }, 2035 | { 2036 | "cell_type": "code", 2037 | "execution_count": 34, 2038 | "metadata": { 2039 | "collapsed": false 2040 | }, 2041 | "outputs": [ 2042 | { 2043 | "data": { 2044 | "text/plain": [ 2045 | "pandas.core.frame.DataFrame" 2046 | ] 2047 | }, 2048 | "execution_count": 34, 2049 | "metadata": {}, 2050 | "output_type": "execute_result" 2051 | } 2052 | ], 2053 | "source": [ 2054 | "type(df[['Wscore']])" 2055 | ] 2056 | }, 2057 | { 2058 | "cell_type": "markdown", 2059 | "metadata": {}, 2060 | "source": [ 2061 | "You've seen before that you can access columns through df['col name']. You can access rows by using slicing operations. " 2062 | ] 2063 | }, 2064 | { 2065 | "cell_type": "code", 2066 | "execution_count": 35, 2067 | "metadata": { 2068 | "collapsed": false 2069 | }, 2070 | "outputs": [ 2071 | { 2072 | "data": { 2073 | "text/html": [ 2074 | "
\n", 2075 | "\n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | " \n", 2101 | " \n", 2102 | " \n", 2103 | " \n", 2104 | " \n", 2105 | " \n", 2106 | " \n", 2107 | " \n", 2108 | " \n", 2109 | " \n", 2110 | " \n", 2111 | " \n", 2112 | " \n", 2113 | " \n", 2114 | " \n", 2115 | " \n", 2116 | " \n", 2117 | " \n", 2118 | " \n", 2119 | " \n", 2120 | " \n", 2121 | " \n", 2122 | " \n", 2123 | " \n", 2124 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
0198520122881132864N0
1198525110677135470H0
2198525111263122356H0
\n", 2125 | "
" 2126 | ], 2127 | "text/plain": [ 2128 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 2129 | "0 1985 20 1228 81 1328 64 N 0\n", 2130 | "1 1985 25 1106 77 1354 70 H 0\n", 2131 | "2 1985 25 1112 63 1223 56 H 0" 2132 | ] 2133 | }, 2134 | "execution_count": 35, 2135 | "metadata": {}, 2136 | "output_type": "execute_result" 2137 | } 2138 | ], 2139 | "source": [ 2140 | "df[0:3]" 2141 | ] 2142 | }, 2143 | { 2144 | "cell_type": "markdown", 2145 | "metadata": {}, 2146 | "source": [ 2147 | "Here's an equivalent using iloc" 2148 | ] 2149 | }, 2150 | { 2151 | "cell_type": "code", 2152 | "execution_count": 36, 2153 | "metadata": { 2154 | "collapsed": false 2155 | }, 2156 | "outputs": [ 2157 | { 2158 | "data": { 2159 | "text/html": [ 2160 | "
\n", 2161 | "\n", 2162 | " \n", 2163 | " \n", 2164 | " \n", 2165 | " \n", 2166 | " \n", 2167 | " \n", 2168 | " \n", 2169 | " \n", 2170 | " \n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | " \n", 2207 | " \n", 2208 | " \n", 2209 | " \n", 2210 | "
SeasonDaynumWteamWscoreLteamLscoreWlocNumot
0198520122881132864N0
1198525110677135470H0
2198525111263122356H0
\n", 2211 | "
" 2212 | ], 2213 | "text/plain": [ 2214 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n", 2215 | "0 1985 20 1228 81 1328 64 N 0\n", 2216 | "1 1985 25 1106 77 1354 70 H 0\n", 2217 | "2 1985 25 1112 63 1223 56 H 0" 2218 | ] 2219 | }, 2220 | "execution_count": 36, 2221 | "metadata": {}, 2222 | "output_type": "execute_result" 2223 | } 2224 | ], 2225 | "source": [ 2226 | "df.iloc[0:3,:]" 2227 | ] 2228 | }, 2229 | { 2230 | "cell_type": "markdown", 2231 | "metadata": {}, 2232 | "source": [ 2233 | "# Data Cleaning" 2234 | ] 2235 | }, 2236 | { 2237 | "cell_type": "markdown", 2238 | "metadata": {}, 2239 | "source": [ 2240 | "One of the big jobs of doing well in Kaggle competitions is that of data cleaning. A lot of times, the CSV file you're given (especially like in the Titanic dataset), you'll have a lot of missing values in the dataset, which you have to identify. The following **isnull** function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column. In this case, we have a pretty clean dataset." 2241 | ] 2242 | }, 2243 | { 2244 | "cell_type": "code", 2245 | "execution_count": 37, 2246 | "metadata": { 2247 | "collapsed": false 2248 | }, 2249 | "outputs": [ 2250 | { 2251 | "data": { 2252 | "text/plain": [ 2253 | "Season 0\n", 2254 | "Daynum 0\n", 2255 | "Wteam 0\n", 2256 | "Wscore 0\n", 2257 | "Lteam 0\n", 2258 | "Lscore 0\n", 2259 | "Wloc 0\n", 2260 | "Numot 0\n", 2261 | "dtype: int64" 2262 | ] 2263 | }, 2264 | "execution_count": 37, 2265 | "metadata": {}, 2266 | "output_type": "execute_result" 2267 | } 2268 | ], 2269 | "source": [ 2270 | "df.isnull().sum()" 2271 | ] 2272 | }, 2273 | { 2274 | "cell_type": "markdown", 2275 | "metadata": {}, 2276 | "source": [ 2277 | "If you do end up having missing values in your datasets, be sure to get familiar with these two functions. \n", 2278 | "* **dropna()** - This function allows you to drop all(or some) of the rows that have missing values. \n", 2279 | "* **fillna()** - This function allows you replace the rows that have missing values with the value that you pass in." 2280 | ] 2281 | }, 2282 | { 2283 | "cell_type": "markdown", 2284 | "metadata": {}, 2285 | "source": [ 2286 | "# Visualizing Data" 2287 | ] 2288 | }, 2289 | { 2290 | "cell_type": "markdown", 2291 | "metadata": {}, 2292 | "source": [ 2293 | "An interesting way of displaying Dataframes is through matplotlib. " 2294 | ] 2295 | }, 2296 | { 2297 | "cell_type": "code", 2298 | "execution_count": 38, 2299 | "metadata": { 2300 | "collapsed": true 2301 | }, 2302 | "outputs": [], 2303 | "source": [ 2304 | "import matplotlib.pyplot as plt\n", 2305 | "%matplotlib inline" 2306 | ] 2307 | }, 2308 | { 2309 | "cell_type": "code", 2310 | "execution_count": 39, 2311 | "metadata": { 2312 | "collapsed": false 2313 | }, 2314 | "outputs": [ 2315 | { 2316 | "data": { 2317 | "text/plain": [ 2318 | "" 2319 | ] 2320 | }, 2321 | "execution_count": 39, 2322 | "metadata": {}, 2323 | "output_type": "execute_result" 2324 | }, 2325 | { 2326 | "data": { 2327 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjkAAAF5CAYAAAB9WzucAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3XuYnVV99//3h0MSwSYRIwkU0qpoSD1QMhyClQDGB6rg\n6YePMppysNZKAXnSeon6gxLhaYt4SfhBwFKkohym0iCeEglykCJEooRClCFUiR0QEhgJQwxNIMn3\n98daW+/cTuawZ8/svW8+r+vaV7LX+s6619o72fs7617rvhURmJmZmVXNTs3ugJmZmdlocJJjZmZm\nleQkx8zMzCrJSY6ZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEzM7NKcpJjZmZm\nldRySY6kT0vaJumiUvl5kp6Q9Lyk70var1Q/XtJlknolbZC0WNKepZhXSLpOUp+k9ZK+LGn3Usy+\nkpZI2ihpraQLJbXc62RmZmYDa6kvb0kHAx8DHiiVnwWcnusOATYCyySNK4RdDBwLHA/MAfYGbiwd\n4npgJjA3x84BrigcZydgKbALMBs4CTgZOK8R4zMzM7Oxo1a5QaeklwP3AacC5wD3R8Tf5rongC9E\nxML8fCKwDjgpIm7Iz58GToiIm3LMDKAbmB0RKyTNBH4GdETE/TnmGGAJsE9ErJX0DuDbwF4R0Ztj\n/hq4AHhVRGwZkxfDzMzMRqyVZnIuA74TEbcXCyW9GpgG3FYri4jngHuBw3LRQaTZl2LMaqCnEDMb\nWF9LcLJbgQAOLcSsqiU42TJgEvCGkQzOzMzMxtYuze4AgKQTgD8lJStl00iJyLpS+bpcBzAVeCEn\nPzuKmQY8VayMiK2SninF9HecWt0DmJmZWVtoepIjaR/Sepq3R8SLze7PcEl6JXAM8EtgU3N7Y2Zm\n1lYmAH8MLIuIXze68aYnOUAH8CpgpSTlsp2BOZJOB/YHRJqtKc6yTAVqp57WAuMkTSzN5kzNdbWY\n8m6rnYE9SjEHl/o3tVDXn2OA6wYaoJmZmQ3ow6TNQQ3VCknOrcCbSmVXkxYNXxARj0paS9oR9SD8\nduHxoaR1PJAWLG/JMcWFx9OB5TlmOTBZ0oGFdTlzSQnUvYWYz0qaUliXczTQBzy0g/7/EuDaa69l\n5syZwxp4q5o/fz4LFy5sdjcaokpjAY+nlVVpLODxtLIqjaW7u5t58+ZB/i5ttKYnORGxkVICIWkj\n8OuI6M5FFwNnS/o56YU4H3gc+FZu4zlJVwEXSVoPbAAuAe6OiBU55mFJy4ArJZ0KjAMuBboiojZL\nc0vuyzV52/pe+ViLBjiVtglg5syZzJo1a2QvRouYNGmSx9KiPJ7WVaWxgMfTyqo0loJRWe7R9CRn\nB7bb1x4RF0rajXRNm8nAXcA7IuKFQth8YCuwGBgP3AycVmr3Q8Ai0uzRthx7ZuE42yQdB3wJuId0\nPZ6rgXMbNTAzMzMbGy2Z5ETE2/opWwAsGOBnNgNn5MeOYp4F5g1y7MeA44bYVTMzM2tRrXSdHDMz\nM7OGacmZHGuuzs7OZnehYZo5lp6eHnp7ewcPHKIpU6ZU6r0B/1trZR5P66rSWEZby9zWoV1JmgXc\nd99991VxIZjVqaenhxkzZrJp0/MNa3PChN1Yvbqb6dOnN6xNM7NmWrlyJR0dHZBuubSy0e17Jsds\nFPT29uYE51rSPWFHqptNm+bR29vrJMfMbIic5JiNqpmAZ/jMzJrBC4/NzMyskpzkmJmZWSU5yTEz\nM7NKcpJjZmZmleQkx8zMzCrJSY6ZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEz\nM7NKcpJjZmZmleQkx8zMzCrJSY6ZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEz\nM7NKcpJjZmZmleQkx8zMzCrJSY6ZmZlVkpMcMzMzq6SmJzmSPi7pAUl9+XGPpD8v1H9F0rbSY2mp\njfGSLpPUK2mDpMWS9izFvELSdfkY6yV9WdLupZh9JS2RtFHSWkkXSmr6a2RmZmbD1wpf4I8BZwGz\ngA7gduBbkmYWYr4HTAWm5UdnqY2LgWOB44E5wN7AjaWY64GZwNwcOwe4olaZk5mlwC7AbOAk4GTg\nvBGOz8zMzJpgl2Z3ICKWlIrOlnQqKdHozmWbI+Lp/n5e0kTgI8AJEXFnLjsF6JZ0SESsyAnTMUBH\nRNyfY84Alkj6ZESszfX7A0dFRC+wStI5wAWSFkTEloYO3MzMzEZVK8zk/JaknSSdAOwG3FOoOlLS\nOkkPS7pc0h6Fug5SsnZbrSAiVgM9wGG5aDawvpbgZLcCARxaiFmVE5yaZcAk4A0jH52ZmZmNpabP\n5ABIeiOwHJgAbADelxMVSKeqbgTWAK8F/glYKumwiAjS6asXIuK5UrPrch35z6eKlRGxVdIzpZh1\n/bRRq3ug/hGamZnZWGuJJAd4GDiANGvyfuBrkuZExMMRcUMh7meSVgG/AI4E7hjznu7A/PnzmTRp\n0nZlnZ2ddHaWlw+ZmZm99HR1ddHV1bVdWV9f36gesyWSnLze5dH89H5JhwBnAqf2E7tGUi+wHynJ\nWQuMkzSxNJszNdeR/yzvttoZ2KMUc3DpcFMLdQNauHAhs2bNGizMzMzsJam/X/xXrlxJR0fHqB2z\npdbkFOwEjO+vQtI+wCuBJ3PRfcAW0q6pWswMYDrpFBj5z8mSDiw0NRcQcG8h5k2SphRijgb6gIdG\nMhgzMzMbe02fyZH0j6R1Nz3AHwAfBo4Ajs7XsTmXtCZnLWn25vPAI6RFwUTEc5KuAi6StJ60pucS\n4O6IWJFjHpa0DLgy79waB1wKdOWdVQC3kJKZaySdBewFnA8siogXR/llMDMzswZrepJDOo30VVJS\n0Qc8CBwdEbdLmgC8GTgRmAw8QUpu/r6UeMwHtgKLSTNANwOnlY7zIWARaVfVthx7Zq0yIrZJOg74\nEmln10bgalKSZWZmZm2m6UlORHx0gLpNwJ/vqL4Qtxk4Iz92FPMsMG+Qdh4DjhvseGZmZtb6WnVN\njpmZmdmIOMkxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc\n5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc\n5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc\n5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqqelJjqSPS3pAUl9+3CPpz0sx50l6QtLzkr4vab9S\n/XhJl0nqlbRB0mJJe5ZiXiHpunyM9ZK+LGn3Usy+kpZI2ihpraQLJTX9NTIzM7Pha4Uv8MeAs4BZ\nQAdwO/AtSTMBJJ0FnA58DDgE2AgskzSu0MbFwLHA8cAcYG/gxtJxrgdmAnNz7BzgilplTmaWArsA\ns4GTgJOB8xo2UjMzMxszTU9yImJJRNwcEb+IiJ9HxNnAb0iJBsCZwPkR8d2I+ClwIimJeS+ApInA\nR4D5EXFnRNwPnAL8maRDcsxM4BjgLyPiJxFxD3AGcIKkafk4xwD7Ax+OiFURsQw4BzhN0i6j/0qY\nmZlZIzU9ySmStJOkE4DdgHskvRqYBtxWi4mI54B7gcNy0UGk2ZdizGqgpxAzG1ifE6CaW4EADi3E\nrIqI3kLMMmAS8IaGDNDMzMzGTEskOZLeKGkDsBm4HHhfTlSmkRKRdaUfWZfrAKYCL+TkZ0cx04Cn\nipURsRV4phTT33EoxJiZmVmbaJXTMA8DB5BmTd4PfE3SnOZ2aXjmz5/PpEmTtivr7Oyks7OzST2y\nKuru7m5IO1OmTGH69OkNacvMbCi6urro6urarqyvr29Uj9kSSU5EbAEezU/vz2tpzgQuBESarSnO\nskwFaqee1gLjJE0szeZMzXW1mPJuq52BPUoxB5e6NrVQN6CFCxcya9aswcLM6vQksBPz5s1rSGsT\nJuzG6tXdTnTMbMz094v/ypUr6ejoGLVjtkSS04+dgPERsUbSWtKOqAfhtwuNDwUuy7H3AVtyzE05\nZgYwHVieY5YDkyUdWFiXM5eUQN1biPmspCmFdTlHA33AQ6MySrMhexbYBlxL2iQ4Et1s2jSP3t5e\nJzlmVmlNT3Ik/SPwPdJC4T8APgwcQUowIG0PP1vSz4FfAucDjwPfgrQQWdJVwEWS1gMbgEuAuyNi\nRY55WNIy4EpJpwLjgEuBroiozdLcQkpmrsnb1vfKx1oUES+O4ktgNgwzSVdbMDOzwTQ9ySGdRvoq\nKanoI83YHB0RtwNExIWSdiNd02YycBfwjoh4odDGfGArsBgYD9wMnFY6zoeARaRdVdty7Jm1yojY\nJuk44EvAPaTr8VwNnNvAsZqZmdkYaXqSExEfHULMAmDBAPWbSde9OWOAmGeBARc0RMRjwHGD9cfM\nzMxaX9OTHLNW0tPTQ29v7+CBg2jULigzM6ufkxyzrKenhxkzZrJp0/PN7oqZmTWAkxyzrLe3Nyc4\njdjBtJR0VxAzM2sWJzlmv6cRO5h8usrMrNla4rYOZmZmZo3mJMfMzMwqyUmOmZmZVZKTHDMzM6sk\nJzlmZmZWSU5yzMzMrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6sk\nJzlmZmZWSU5yzMzMrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6sk\nJzlmZmZWSU5yzMzMrJKc5JiZmVklNT3JkfQZSSskPSdpnaSbJL2+FPMVSdtKj6WlmPGSLpPUK2mD\npMWS9izFvELSdZL6JK2X9GVJu5di9pW0RNJGSWslXSip6a+TmZmZDU8rfHkfDlwKHAq8HdgVuEXS\ny0px3wOmAtPyo7NUfzFwLHA8MAfYG7ixFHM9MBOYm2PnAFfUKnMysxTYBZgNnAScDJw3gvGZmZlZ\nE+zS7A5ExDuLzyWdDDwFdAA/LFRtjoin+2tD0kTgI8AJEXFnLjsF6JZ0SESskDQTOAboiIj7c8wZ\nwBJJn4yItbl+f+CoiOgFVkk6B7hA0oKI2NK4kZuZmdloaoWZnLLJQADPlMqPzKezHpZ0uaQ9CnUd\npITttlpBRKwGeoDDctFsYH0twcluzcc6tBCzKic4NcuAScAbRjYsMzMzG0stleRIEum00w8j4qFC\n1feAE4G3AZ8CjgCW5nhIp69eiIjnSk2uy3W1mKeKlRGxlZRMFWPW9dMGhRgzMzNrA00/XVVyOfAn\nwJ8VCyPihsLTn0laBfwCOBK4Y8x6Z2ZmZm2jZZIcSYuAdwKHR8STA8VGxBpJvcB+pCRnLTBO0sTS\nbM7UXEf+s7zbamdgj1LMwaXDTS3U7dD8+fOZNGnSdmWdnZ10dpbXR5uZmb30dHV10dXVtV1ZX1/f\nqB6zJZKcnOC8BzgiInqGEL8P8EqglgzdB2wh7Zq6KcfMAKYDy3PMcmCypAML63LmAgLuLcR8VtKU\nwrqco4E+oHj67PcsXLiQWbNmDdZ1MzOzl6T+fvFfuXIlHR0do3bMpic5ki4nbQd/N7BRUm3mpC8i\nNuXr2JxL2g6+ljR783ngEdKiYCLiOUlXARdJWg9sAC4B7o6IFTnmYUnLgCslnQqMI21d78o7qwBu\nISUz10g6C9gLOB9YFBEvjuoLYWZmZg3V9CQH+Dhph9MPSuWnAF8DtgJvJi08ngw8QUpu/r6UeMzP\nsYuB8cDNwGmlNj8ELCLtqtqWY8+sVUbENknHAV8C7gE2AleTkiwzMzNrI01PciJiwB1eEbEJ+PMh\ntLMZOCM/dhTzLDBvkHYeA44b7HhmZmbW2uraQi7pLyRNaHRnzMzMzBql3uvkLATWSrpC0iGN7JCZ\nmZlZI9Sb5OwN/BWwD3C3pJ9K+jtJr2pc18zMzMzqV1eSExEvRMS/R8SxpG3a1wB/CTwu6RuSji1c\njdjMzMxszI34tg75wn23ki7KF8BBQBfwX5IOH2n7ZmZmZvWoO8mRNEXS/5H0AHA36WrC7wX+CPhD\n4JukLeBmZmZmY66uLeSSbiLdgmEN8GXgqxHxdCFkg6QLgb8deRfNzMzMhq/e6+Q8B7w9Iu4aIOZp\n4HV1tm9mZmY2InUlORFx0hBignSncDMzM7MxV+/FABdKKt8yAUmnSfriyLtlZmZmNjL1Ljz+36R7\nO5X9CPhg/d0xMzMza4x6k5wppHU5ZX25zszMzKyp6k1yfgEc00/5MaQdV2ZmZmZNVe/uqouBiyW9\nErg9l80FPgV8shEdMzMzMxuJendXXZnvQv5Z4HO5+HHgExHxr43qnJmZmVm96p3JISIuBS6VtBfw\nPxHxbOO6ZWZmZjYydSc5NfneVWZmZmYtpd7r5LxK0lck9UjaJOmF4qPRnTQzMzMbrnpncq4GXgt8\nAXiSdPdxMzMzs5ZRb5IzB5gTEfc3sjNmZmZmjVLvdXIex7M3ZmZm1sLqTXLmA/8kaZ9GdsbMzMys\nUeo9XXUN8AfAf0t6DnixWBkRe460Y2ZmZmYjUW+S8+mG9sLMzMysweq94vFVje6ImZmZWSPVuyYH\nSX8saYGkayTtmcuOljSzcd0zMzMzq0+9FwM8HPgZcATwAeDluaoDOK8xXTMzMzOrX70zOZ8HFkTE\nUUDxCse3AbNH3CszMzOzEao3yXkzsLif8qeAVw2nIUmfkbRC0nOS1km6SdLr+4k7T9ITkp6X9H1J\n+5Xqx0u6TFKvpA2SFtdOoxViXiHpOkl9ktZL+rKk3Usx+0paImmjpLWSLpRU92k9MzMza456v7z7\ngGn9lB8A/GqYbR0OXAocCrwd2BW4RdLLagGSzgJOBz4GHAJsBJZJGldo52LgWOB40hWZ9wZuLB3r\nemAmMDfHzgGuKBxnJ2ApaUH2bOAk4GR8Cs7MzKzt1LuF/OvABZLeT77ysaRDgS8C1w6noYh4Z/G5\npJNJM0IdwA9z8ZnA+RHx3RxzIrAOeC9wg6SJwEeAEyLizhxzCtAt6ZCIWJEXRB8DdNRuRyHpDGCJ\npE9GxNpcvz9wVET0AqsknZPHuiAitgxnbGZmZtY89c7kfAZ4FHiCtOj4IeAe4MfA+SPs02RS4vQM\ngKRXk2aNbqsFRMRzwL3AYbnoIFLCVoxZDfQUYmYD60v327o1H+vQQsyqnODULAMmAW8Y4bjMzMxs\nDNV7nZzNwCmSzgPeREp0VkbEwyPpjCSRTjv9MCIeysXTSInIulL4On53ymwq8EJOfnYUM400Q1Qc\nx1ZJz5Ri+jtOre6BYQ3IzMzMmqbe01UARMQaYE2D+gJwOfAnwJ81sE0zMzN7CaoryZH0LwPVR8TH\n6mhzEfBO4PCIeLJQtRYQabamOMsyFbi/EDNO0sTSbM7UXFeLKe+22hnYoxRzcKlrUwt1OzR//nwm\nTZq0XVlnZyednZ0D/ZiZmdlLQldXF11dXduV9fX1jeox653J2av0fFfSmpU/AP5juI3lBOc9wBER\n0VOsi4g1ktaSdkQ9mOMnktbRXJbD7gO25JibcswMYDqwPMcsByZLOrCwLmcuKYG6txDzWUlTCuty\njibtJqudPuvXwoULmTVr1nCHbmZm9pLQ3y/+K1eupKOjY9SOWe+anHeVyyTtAvwzgyQD/fzc5UAn\n8G5go6TazElfRGzKf78YOFvSz4FfkhY3Pw58K/fnOUlXARdJWg9sAC4B7o6IFTnmYUnLgCslnQqM\nI21d78o7qwBuyf2/Jm9b3ysfa1FEbHendTMzM2ttI1qTUxQRWyR9AfgBcNEwfvTjpIXFPyiVnwJ8\nLbd9oaTdSNe0mQzcBbwjIopXW54PbCVdpHA8cDNwWqnNDwGLSLuqtuXYMwtj2CbpOOBLpN1iG4Gr\ngXOHMR4zMzNrAQ1LcrJXk05dDVlEDGkbe0QsABYMUL8ZOCM/dhTzLDBvkOM8Bhw3lD6ZmZlZ66p3\n4fGF5SLSqZ13M8yLAZqZmZmNhnpncg4rPd8GPA18GrhyRD0yMzMza4B6Fx4f3uiOmJmZmTWS765t\nZmZmlVTvmpwfk2/MOZiIOKSeY5iZmZmNRL1rcu4A/hp4hN9dbG82MIO0zXvzyLtmZmZmVr96k5zJ\nwGUR8dlioaR/AKZGxEdH3DMzMzOzEah3Tc4HgK/0U3418L/r7o2ZmZlZg9Sb5GwmnZ4qm41PVZmZ\nmVkLqPd01SXAFZIOBFbkskOBvwL+qREdMzMzMxuJeq+T8w+S1pDu+1Rbf9MNfCwirm9U58zMzMzq\nVfe9q3Iy44TGzMzMWlLdFwOUNFHSyZLOk/SKXHaApL0a1z0zMzOz+tR7McA3ArcCzwP7knZVrQc+\nCPwhcFKD+mdmZmZWl3pnchaSTlW9FthUKF8CzBlpp8zMzMxGqt4k52Dg8ogo39rhV4BPV5mZmVnT\n1bvw+EXg5f2U7wf01t8ds+Hp6emht7cx/+S6u7sb0o6ZmbWGepOc7wDnSPpgfh6S/hC4APhGQ3pm\nNoienh5mzJjJpk3PN7srZmbWgupNcv6OlMysBV4G3A7sDfwY+OwAP2fWML29vTnBuRaY2YAWlwLn\nNKAdMzNrBfVeDHA9cJSkI4ADSKeuVgLL+lmnYzbKZgKzGtCOT1eZmVXJsJMcSbsC3wVOj4g7gTsb\n3iszMzOzERr27qqIeBHoADxjY2ZmZi2r3i3k1wGnNLIjZmZmZo1U78LjAE6X9HbgJ8DG7SojPjXS\njpmZmZmNRL1JTgfwYP77m0t1Po1lZmZmTTesJEfSa4A1EXH4KPXHzMzMrCGGuybnv4BX1Z5I+rqk\nqY3tkpmZmdnIDTfJUen5O4HdG9QXMzMzs4apd3dVQ0k6XNK3Jf1K0jZJ7y7VfyWXFx9LSzHjJV0m\nqVfSBkmLJe1ZinmFpOsk9UlaL+nLknYvxewraYmkjZLWSrpQUku8TmZmZjZ0w/3yDn5/YXEjFhrv\nDvwn8DcDtPc9YCowLT86S/UXA8cCxwNzSLeZuLEUcz3p8rhzc+wc4IpaZU5mlpLWKs0GTgJOBs6r\na1RmZmbWNMPdXSXgakmb8/MJwD9LKm8h/3+G02hE3AzcDCCpfEqsZnNEPN1vp6SJwEeAE/JVmJF0\nCtAt6ZCIWCFpJnAM0BER9+eYM4Alkj4ZEWtz/f7AURHRC6ySdA5wgaQFEbFlOOMyMzOz5hnuTM5X\ngaeAvvy4Fnii8Lz2GA1HSlon6WFJl0vao1DXQUrYbqsVRMRqoAc4LBfNBtbXEpzsVtLM0aGFmFU5\nwalZBkwC3tDQ0ZiZmdmoGtZMTkQ06yrH3yOdeloDvBb4J2CppMPyDUGnAS9ExHOln1uX68h/PlWs\njIitkp4pxazrp41a3QMNGIuZmZmNgXovBjimIuKGwtOfSVoF/AI4ErijKZ0qmT9/PpMmTdqurLOz\nk87O8tIhMzOzl56uri66urq2K+vrG62TP0lbJDllEbFGUi+wHynJWQuMkzSxNJszNdeR/yzvttoZ\n2KMUc3DpcFMLdTu0cOFCZs2aNdyhmJmZvST094v/ypUr6ejoGLVjtuXWaEn7AK8EnsxF9wFbSLum\najEzgOnA8ly0HJgs6cBCU3NJi6nvLcS8SdKUQszRpHVGDzV4GGZmZjaKWmImJ1+rZj9+d7HB10g6\nAHgmP84lrclZm+M+DzxCWhRMRDwn6SrgIknrgQ3AJcDdEbEixzwsaRlwpaRTgXHApUBX3lkFcAsp\nmblG0lnAXsD5wKKIeHE0XwMzMzNrrJZIcoCDSKedatfh+WIu/yrp2jlvBk4EJpN2cy0D/r6UeMwH\ntgKLgfGkLemnlY7zIWARaVfVthx7Zq0yIrZJOg74EnAP6e7qV5OSLLNK6e7ublhbU6ZMYfr06Q1r\nz8ysEVoiycnXthno1NmfD6GNzcAZ+bGjmGeBeYO08xhw3GDHM2tfTwI7MW/egP8VhmXChN1Yvbrb\niY6ZtZSWSHLMbCw9S5rIvJZ0AfCR6mbTpnn09vY6yTGzluIkx+wlaybgHYFmVl1tubvKzMzMbDBO\ncszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklO\ncszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklO\ncszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklO\ncszMzKySWiLJkXS4pG9L+pWkbZLe3U/MeZKekPS8pO9L2q9UP17SZZJ6JW2QtFjSnqWYV0i6TlKf\npPWSvixp91LMvpKWSNooaa2kCyW1xOtkZmZmQ9cqX967A/8J/A0Q5UpJZwGnAx8DDgE2AsskjSuE\nXQwcCxwPzAH2Bm4sNXU9MBOYm2PnAFcUjrMTsBTYBZgNnAScDJw3wvGZmZnZGNul2R0AiIibgZsB\nJKmfkDOB8yPiuznmRGAd8F7gBkkTgY8AJ0TEnTnmFKBb0iERsULSTOAYoCMi7s8xZwBLJH0yItbm\n+v2BoyKiF1gl6RzgAkkLImLLqL0IZmZm1lCtMpOzQ5JeDUwDbquVRcRzwL3AYbnoIFLCVoxZDfQU\nYmYD62sJTnYraebo0ELMqpzg1CwDJgFvaNCQzMzMbAy0fJJDSnCCNHNTtC7XAUwFXsjJz45ipgFP\nFSsjYivwTCmmv+NQiDEzM7M20BKnq6pg/vz5TJo0abuyzs5OOjs7m9QjMzOz1tHV1UVXV9d2ZX19\nfaN6zHZIctYCIs3WFGdZpgL3F2LGSZpYms2ZmutqMeXdVjsDe5RiDi4df2qhbocWLlzIrFmzBh2M\nmZnZS1F/v/ivXLmSjo6OUTtmy5+uiog1pARjbq0sLzQ+FLgnF90HbCnFzACmA8tz0XJgsqQDC83P\nJSVQ9xZi3iRpSiHmaKAPeKhBQzIzM7Mx0BIzOflaNfuREg6A10g6AHgmIh4jbQ8/W9LPgV8C5wOP\nA9+CtBCdV2t+AAAWK0lEQVRZ0lXARZLWAxuAS4C7I2JFjnlY0jLgSkmnAuOAS4GuvLMK4BZSMnNN\n3ra+Vz7Wooh4cVRfBDMzM2uolkhySLuj7iAtMA7gi7n8q8BHIuJCSbuRrmkzGbgLeEdEvFBoYz6w\nFVgMjCdtST+tdJwPAYtIu6q25dgza5URsU3SccCXSLNEG4GrgXMbNVAzMzMbGy2R5ORr2wx46iwi\nFgALBqjfDJyRHzuKeRaYN8hxHgOOGyjGzMzMWl/Lr8kxMzMzq4eTHDMzM6skJzlmZmZWSU5yzMzM\nrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzM\nrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqaZdmd8DMqqG7u7thbU2ZMoXp06c3rD0ze2ly\nkmNmI/QksBPz5s1rWIsTJuzG6tXdTnTMbESc5JjZCD0LbAOuBWY2oL1uNm2aR29vr5McMxsRJzlm\n1iAzgVnN7oSZ2W954bGZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEzM7NKcpJj\nZmZmldQWSY6kcyVtKz0eKsWcJ+kJSc9L+r6k/Ur14yVdJqlX0gZJiyXtWYp5haTrJPVJWi/py5J2\nH4sxmpmZWWO1RZKT/RSYCkzLj7fWKiSdBZwOfAw4BNgILJM0rvDzFwPHAscDc4C9gRtLx7iedEWz\nuTl2DnDFKIzFzMzMRlk7XfF4S0Q8vYO6M4HzI+K7AJJOBNYB7wVukDQR+AhwQkTcmWNOAbolHRIR\nKyTNBI4BOiLi/hxzBrBE0icjYu2ojs7MzMwaqp1mcl4n6VeSfiHpWkn7Akh6NWlm57ZaYEQ8B9wL\nHJaLDiIldMWY1UBPIWY2sL6W4GS3AgEcOjpDMjMzs9HSLknOj4CTSTMtHwdeDfxHXi8zjZSIrCv9\nzLpcB+k01ws5+dlRzDTgqWJlRGwFninEmJmZWZtoi9NVEbGs8PSnklYA/w18AHi4Ob0yMzOzVtYW\nSU5ZRPRJegTYD/gBINJsTXE2ZypQO/W0FhgnaWJpNmdqrqvFlHdb7QzsUYjZofnz5zNp0qTtyjo7\nO+ns7BziqMzMzKqrq6uLrq6u7cr6+vpG9ZhtmeRIejkpwflqRKyRtJa0I+rBXD+RtI7msvwj9wFb\ncsxNOWYGMB1YnmOWA5MlHVhYlzOXlEDdO1ifFi5cyKxZsxowOjMzs+rp7xf/lStX0tHRMWrHbIsk\nR9IXgO+QTlH9IfA54EXg33LIxcDZkn4O/BI4H3gc+BakhciSrgIukrQe2ABcAtwdEStyzMOSlgFX\nSjoVGAdcCnR5Z5WZmVn7aYskB9iHdA2bVwJPAz8EZkfErwEi4kJJu5GuaTMZuAt4R0S8UGhjPrAV\nWAyMB24GTisd50PAItKuqm059sxRGpOZmZmNorZIciJi0IUtEbEAWDBA/WbgjPzYUcyzwLzh99CG\nqqenh97e3oa01d3d3ZB2zMysmtoiybFq6OnpYcaMmWza9Hyzu2JmZi8BTnJszPT29uYE51rS3TNG\nailwTgPaMTOzKnKSY00wE2jETjSfrjIzsx1rlysem5mZmQ2LkxwzMzOrJCc5ZmZmVklOcszMzKyS\nnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklOcszMzKyS\nfINOM2tJ3d2NuQHrlClTmD59ekPaMrP24iTHzFrMk8BOzJs3ryGtTZiwG6tXdzvRMXsJcpJjZi3m\nWWAbcC0wc4RtdbNp0zx6e3ud5Ji9BDnJMbMWNROY1exOmFkb88JjMzMzqyQnOWZmZlZJTnLMzMys\nkpzkmJmZWSU5yTEzM7NKcpJjZmZmleQt5GZWeY26ejL4Cspm7cRJjplVWGOvngy+grJZO3GSY4Pq\n6emht7d3xO008rdps6Fp5NWTwVdQNmsvTnL6Iek04JPANOAB4IyI+HFzezV2urq66OzsBFKCM2PG\nTDZter7JvapXF9DZ7E400M1U6yrAY/X+jP7Vk4v/b6rA42ldVRrLaPPC4xJJHwS+CJwLHEhKcpZJ\nmtLUjo2hrq6u3/69t7c3JzjXAveN8HH+mI3hd7oGD2kry5rdgQarzvtT/H9TBR5P66rSWEabZ3J+\n33zgioj4GoCkjwPHAh8BLmxmx5qrEb8J+3SVVUN/p177+vpYuXLlsNvyQmaz0eMkp0DSrkAH8I+1\nsogISbcChzWtY2bWIgZeyNzR0THsFr2Q2Wz0OMnZ3hRgZ2BdqXwdMGPsuzN8q1ev5q/+6q/YvHlz\n3W088sgjHHrooQBMnjy5UV0zq4CBFjLPBxYOs720kPmuu+5i5syRL4zevHkz48ePH3E7kGamli9f\n3rD2oLH9G25bg820NbNvw22v3lnDmpfS7KGTnJGbAK2zc2jRokXcddddI25nxYoVpZKljPx0090N\nbGuo7T0OXNfA9oZqtMa6jqGPZ7C2xvJ92JH+3p92eB/W9FO3oY5j3A+ogVvcdyIlYY3xlre8taHt\nNbZ/w29r4Jm25vZtuO3VM2tYM27cBL7xjcXstddeI+zXyBW+OyeMRvuKiNFoty3l01XPA8dHxLcL\n5VcDkyLiff38zIcY+beOmZnZS9mHI+L6RjfqmZyCiHhR0n3AXODbAJKUn1+ygx9bBnwY+CWwaQy6\naWZmVhUTgD9mlLaOeianRNIHgKuBjwMrSCfa3w/sHxFPN7FrZmZmNgyeySmJiBvyNXHOA6YC/wkc\n4wTHzMysvXgmx8zMzCrJVzw2MzOzSnKSY2ZmZpXkJGcIJH1G0gpJz0laJ+kmSa/vJ+48SU9Iel7S\n9yXt14z+DoekT0vaJumiUnnbjEXS3pKukdSb+/uApFmlmLYYj6SdJJ0v6dHc159LOrufuJYcj6TD\nJX1b0q/yv6t39xMzYN8ljZd0WX4/N0haLGnPsRvFb/uxw7FI2kXS5yU9KOk3OearkvYqtdESY8l9\nGfS9KcT+c475RKm8rcYjaaakb0l6Nr9P90rap1DfEuMZbCySdpe0SNJj+f/NzyT9dSmmJcaS+9KQ\n78xGjMlJztAcDlwKHAq8HdgVuEXSy2oBks4CTgc+BhwCbCTd2HPc2Hd3aCQdTOrvA6XythmLpMmk\nK7RtBo4hXYb274D1hZi2GQ/waeCvgb8B9gc+BXxK0um1gBYfz+6kxfp/A/zegr8h9v1i0v3ijgfm\nAHsDN45ut/s10Fh2A/4U+BzpRr7vI10V/VuluFYZCwzy3tRIeh/ps+5X/VS3zXgkvRa4C3iI1Nc3\nke4SXLzUR6uMZ7D3ZiFwNPAh0ufCQmCRpOMKMa0yFmjcd+bIxxQRfgzzQbr9wzbgrYWyJ4D5hecT\ngf8BPtDs/u5gDC8HVgNvA+4ALmrHsQAXAHcOEtNO4/kOcGWpbDHwtXYbT/4/8u7hvBf5+WbgfYWY\nGbmtQ1ppLP3EHARsBfZp5bEMNB7gD4Ee0i8La4BPlN6rthkP6Rb3Xx3gZ1pyPDsYyyrg/y2V/QQ4\nr5XHUujLsL8zGzUmz+TUZzIp234GQNKrgWnAbbWAiHgOuJfWvbHnZcB3IuL2YmEbjuVdwE8k3ZCn\nRVdK+mitsg3Hcw8wV9LrACQdAPwZ6b4E7Tie3xpi3w8iXdqiGLOa9MXb0uPjd58Lz+bnHbTRWCQJ\n+BpwYUT0d3+KthlPHsuxwH9Jujl/NvxI0nsKYW0zHtLnwrsl7Q0g6SjgdfzuAnqtPpZ6vjMb8lng\nJGeY8n+ei4EfRsRDuXga6Q3s78ae08awe0Mi6QTSVPtn+qluq7EArwFOJc1KHQ18CbhE0l/k+nYb\nzwXA14GHJb0A3AdcHBH/luvbbTxFQ+n7VOCF/IG3o5iWI2k86b27PiJ+k4un0V5j+TSpv4t2UN9O\n49mTNFt9FukXhP8F3AR8Q9LhOaadxnMG6cZoj+fPhaXAaRFRu5lay45lBN+ZDfks8MUAh+9y4E9I\nv123nbzo7mLg7RHxYrP70wA7ASsi4pz8/AFJbyRdsfqa5nWrbh8knXc/gbSW4E+B/0/SExHRjuOp\nPEm7AP9O+tD+myZ3py6SOoBPkNYXVUHtF/hvRkTtljwPSnoL6bNh5HcxHlufIK1vOY40kzEHuDx/\nLtw+4E82X1O/Mz2TMwySFgHvBI6MiCcLVWsBkTLPoqm5rpV0AK8CVkp6UdKLwBHAmfk3hHW0z1gA\nnuT3b/3cDUzPf2+n9wbgQuCCiPj3iPhZRFxHWmRYm3Vrt/EUDaXva4FxkiYOENMyCgnOvsDRhVkc\naK+xvJX0ufBY4XPhj4CLJD2aY9ppPL3AFgb/bGj58UiaAPwD8LcRsTQifhoRl5NmfD+Zw1pyLCP8\nzmzImJzkDFF+s94DHBURPcW6iFhDetHnFuInkjLve8ayn0NwK2mXwZ8CB+THT4BrgQMi4lHaZyyQ\ndlbNKJXNAP4b2u69gbRrZ2upbBv5/2objue3htj3+0hfTsWYGaQvpuVj1tkhKCQ4rwHmRsT6Ukjb\njIW0FufN/O4z4QDSwtALSbsWoY3Gk2epf8zvfza8nvzZQPuMZ9f8KH8ubOV33+EtN5YGfGc2ZkzN\nXnXdDg/SdNt60ra4qYXHhELMp4BfkxbCvgn4JvBfwLhm938I4yvvrmqbsZAWp20mzXS8lnSqZwNw\nQpuO5yuk6eh3kn6Tfh/wFPCP7TAe0lbYA0hJ9Dbg/+Tn+w617/n/2xrgSNLM493AXa00FtKp/m+R\nvjDfVPpc2LXVxjKU96af+O12V7XbeID3kraLfzR/NpwOvAAc1mrjGcJY7gAeJM26/zFwMvA88LFW\nG0uhLyP+zmzEmMZ88O34yP/otvbzOLEUt4D028/zpFXv+zW770Mc3+0Ukpx2GwspIXgw9/VnwEf6\niWmL8eQPu4vyf+yN+T/954Bd2mE8+UO4v/8v/zrUvgPjSdfY6CUlrP8O7NlKYyEloOW62vM5rTaW\nob43pfhH+f0kp63GQ0oGHsn/l1YCx7XieAYbC2kh9VXAY3ksDwFntuJYcl8a8p3ZiDH5Bp1mZmZW\nSV6TY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc5JiZmVklOckxMzOz\nSnKSY2ZmZpXkJMesgiStkfSJJhz3LZIelPSCpG+M9fH7MxqvhaRzJa1sZJtm1nhOcsxajKSvSNom\naaukzZL+S9I5kobz//Ug4F+Gccwj8jEnDr/H27mIdI+gPyLdN6ghJB2T+7dnqfxJSY+Wyv4oxx6V\ni4b1WgzRFyjcHXk05ORs2w4eWyX962ge36wKdml2B8ysX98jJQkTgHeQ7sa7GbhwKD8cEb8e5vEE\nRP5zJF4LfCkinqy3AUm7RsSLpeIfAi+S7kZ8Q47bn/T6TJA0PSJ6cuzbSHefvhvqei0GFRHPk24q\nOJoOAnbOf/8zYDHwetKNCgH+Z5SPb9b2PJNj1po2R8TTEfFYRPwLcCvwnlqlpOMl/VTSpvwb/98W\nf7h8iib/9v+Xkr4haaOkRyS9K9f9EelO9ADri7MEkt6fTz89L6lX0i2SXlbubG32BNgD+Epu48Rc\nd4Ske3Nfn5D0T8VZKUl3SLpU0kJJTwM3l9uPiI3AT0hJTs2RwF2kZKZYfgTwo4h4YbivRaG/2yS9\nTdKPc8zdkl5fiDlX0v2F51+RdJOkv8tj7JW0SNLOhZhpkpbk1/Lnkj4w0Km0iPh1RDwVEU8Bz+Ti\np2tlEbGh8NovlvRsPu6NkvYpHPcwSbfmuvX5728q1I/P4z1F0vfyeFdJ6pA0Q9Jdkn4j6T8k7dtf\nX81alZMcs/awCRgHIKkD+DpwPfBG4Fzg/FpSMYC/B/4NeBOwFLhO0mTgMeD4HPM6YC/gTEnT8jG+\nDOxPSh6+Qf+zPT3ANNIswydyG1+XtDewBLgXeDPwceAvgbNLP38iaabqLTmmP3cARxWeHwX8APiP\nUvmROXYgO3otiv4vMB/oALYAV5Xqo/T8KOA1+fgnkmbiTi7UX0N6jeYA7wdOBV41SD8HJGkcKQFe\nCxwGHE6a8VoiqfY+vRy4EphNen0fB5ZKGl9q7u+BfwYOIL2f1wGX5fKDgZcBF4+kv2ZjLiL88MOP\nFnoAXwG+UXj+dtKpiQvy82uBm0s/83lgVeH5GuAThefbgAWF57vlsqPz8yOArcDEQsyBuWzfYfR9\nPXBi4fk/AA+VYk4F+grP7wB+MoS25+b+TM3P15ISkNnAmlz2mjyutzbgtTiyEPOOXDYuPz8XWFl6\nzx4FVCj7OnB9/vv++RgHFupfm8s+MYSx/977k8v/stiPXPYyUsL41h20tSvpVNvb8vPxuR+fLh1v\nG/DBQtlJwDPN/v/hhx/DeXgmx6w1vUvSBkmbSDMhXcDnct1M8nqTgruB1xV+e+/PqtpfIq0peQ7Y\nc8fhPADcBvxU0g2SPtrPbMdg9geW99PXlxdPqQD3DaGte8jrciTNJK3HWUk6jTUln3Y7kvQF/qNB\n2hrKa7Gq8PfaGqOBXq+fRURxdufJQvzrgRcj4renuCLiF6SkcCQOAN6Y/61skLQBeIq0lue1AJL2\nkvSvSgvY+0invsYB00ttFce7jjRT9dNS2SRJXstpbcP/WM1a0+2k0zYvAk9ExLYGtFlezBsMcMo6\nH/NoSYcBRwNnAP9X0qER8d8N6E/RxsECIuJ/JK0gnRZ6JfDDnFRskXQPacHxkcDdEbFlkOaG8lq8\nWKqnn5jhttloLyclf6fw+6cRn8p/dpFmb04jnZrcDNxPPv1Z0N94h/samLUU/2M1a00bI2JNRDze\nT4LTTdptU/RW4JHSTMJwvJD/3LlcERHLI+JzpNNXLwLvG0a73aS1IkVvBTZExON19LO2LudI0nqc\nmrty2REMvh6nGVYDu0g6sFYgaT/gFSNsdyUwA1gbEY+WHr/JMYcBF0XELRHRTfrc/4MRHtesLTjJ\nMWs/XwTmSjpb0usknUT6Lf0LI2jzv0m/qb9L0hRJu0s6RNJn8i6bfUmLk6cADw2j3cuBffPuqRmS\n3gMsyGOoxx2kxdFHA3cWyu8E3gvsQ2OSnP5O+9W9vT4iVpNO/V0p6eCc7FxBOrU21MS0v+N/lTQL\n9k2lCzH+cd4VtkjSlBzzc+AkSa+X9BbgatJC9nqOZ9ZWnOSYtZm8ruMDwAdJ6ygWAGdHxDXFsPKP\n9ddUoc0nSItpLyAt6L0U6CPtBFpCmok4D/jbiLhloO6V+voE8E7S7pz/JCU9V5IWJA/Utx1ZTjrd\nAtuv47mXdEpmA/Djgfq0g+PVEzNcf0F6be8EbiS9Dr9haAlHv8ePtI38cNJ6mW+SEtB/JiUotVOA\nJ5J2u/0naafc54FnB2t7B2VmbUX1z26bmVm98sLrHmBuRLTiKTaztuckx8xsDCjdZuLlpNm3vUlX\nr54GzIiIrc3sm1lVeXeVmdnY2BX4R+DVpNNqdwOdTnDMRo9ncszMzKySvPDYzMzMKslJjpmZmVWS\nkxwzMzOrJCc5ZmZmVklOcszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV9P8D26paXy8mx+gAAAAA\nSUVORK5CYII=\n", 2328 | "text/plain": [ 2329 | "" 2330 | ] 2331 | }, 2332 | "metadata": {}, 2333 | "output_type": "display_data" 2334 | } 2335 | ], 2336 | "source": [ 2337 | "ax = df['Wscore'].plot.hist(bins=20)\n", 2338 | "ax.set_xlabel('Points for Winning Team')" 2339 | ] 2340 | }, 2341 | { 2342 | "cell_type": "markdown", 2343 | "metadata": {}, 2344 | "source": [ 2345 | "# Creating Kaggle Submission CSVs" 2346 | ] 2347 | }, 2348 | { 2349 | "cell_type": "markdown", 2350 | "metadata": {}, 2351 | "source": [ 2352 | "This isn't directly Pandas related, but I assume that most people who use Pandas probably do a lot of Kaggle competitions as well. As you probably know, Kaggle competitions require you to create a CSV of your predictions. Here's some starter code that can help you create that csv file" 2353 | ] 2354 | }, 2355 | { 2356 | "cell_type": "code", 2357 | "execution_count": 40, 2358 | "metadata": { 2359 | "collapsed": false 2360 | }, 2361 | "outputs": [ 2362 | { 2363 | "name": "stdout", 2364 | "output_type": "stream", 2365 | "text": [ 2366 | "[[ 0 10]\n", 2367 | " [ 1 15]\n", 2368 | " [ 2 20]]\n" 2369 | ] 2370 | } 2371 | ], 2372 | "source": [ 2373 | "import numpy as np\n", 2374 | "import csv\n", 2375 | "\n", 2376 | "results = [[0,10],[1,15],[2,20]]\n", 2377 | "results = pd.np.array(results)\n", 2378 | "print results" 2379 | ] 2380 | }, 2381 | { 2382 | "cell_type": "code", 2383 | "execution_count": 41, 2384 | "metadata": { 2385 | "collapsed": false 2386 | }, 2387 | "outputs": [], 2388 | "source": [ 2389 | "firstRow = [['id', 'pred']]\n", 2390 | "with open(\"result.csv\", \"wb\") as f:\n", 2391 | " writer = csv.writer(f)\n", 2392 | " writer.writerows(firstRow)\n", 2393 | " writer.writerows(results)" 2394 | ] 2395 | }, 2396 | { 2397 | "cell_type": "markdown", 2398 | "metadata": {}, 2399 | "source": [ 2400 | "The approach I described above deals more with python lists and numpy. If you want a purely Pandas based approach, take a look at this video: https://www.youtube.com/watch?v=ylRlGCtAtiE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=22" 2401 | ] 2402 | }, 2403 | { 2404 | "cell_type": "markdown", 2405 | "metadata": {}, 2406 | "source": [ 2407 | "# Other Useful Functions" 2408 | ] 2409 | }, 2410 | { 2411 | "cell_type": "markdown", 2412 | "metadata": {}, 2413 | "source": [ 2414 | "* **drop()** - This function removes the column or row that you pass in (You also have the specify the axis). \n", 2415 | "* **agg()** - The aggregate function lets you compute summary statistics about each group\n", 2416 | "* **apply()** - Lets you apply a specific function to any/all elements in a Dataframe or Series\n", 2417 | "* **get_dummies()** - Helpful for turning categorical data into one hot vectors.\n", 2418 | "* **drop_duplicates()** - Lets you remove identical rows" 2419 | ] 2420 | }, 2421 | { 2422 | "cell_type": "markdown", 2423 | "metadata": { 2424 | "collapsed": true 2425 | }, 2426 | "source": [ 2427 | "# Lots of Other Great Resources" 2428 | ] 2429 | }, 2430 | { 2431 | "cell_type": "markdown", 2432 | "metadata": {}, 2433 | "source": [ 2434 | "Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. \n", 2435 | "* http://pandas.pydata.org/pandas-docs/stable/10min.html\n", 2436 | "* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python\n", 2437 | "* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/\n", 2438 | "* https://www.dataquest.io/blog/pandas-python-tutorial/\n", 2439 | "* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view\n", 2440 | "* https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y" 2441 | ] 2442 | } 2443 | ], 2444 | "metadata": { 2445 | "anaconda-cloud": {}, 2446 | "kernelspec": { 2447 | "display_name": "Python [conda root]", 2448 | "language": "python", 2449 | "name": "conda-root-py" 2450 | }, 2451 | "language_info": { 2452 | "codemirror_mode": { 2453 | "name": "ipython", 2454 | "version": 2 2455 | }, 2456 | "file_extension": ".py", 2457 | "mimetype": "text/x-python", 2458 | "name": "python", 2459 | "nbconvert_exporter": "python", 2460 | "pygments_lexer": "ipython2", 2461 | "version": "2.7.12" 2462 | } 2463 | }, 2464 | "nbformat": 4, 2465 | "nbformat_minor": 1 2466 | } 2467 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Pandas-Tutorial 2 | 3 | I've been working with Pandas quite a bit lately, and figured I'd make a short summary of the most important and helpful functions in the library. 4 | 5 | Hopefully it's helpful for you! 6 | 7 | # Lots of Other Great Tutorials 8 | * http://pandas.pydata.org/pandas-docs/stable/10min.html 9 | * https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python 10 | * http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/ 11 | * https://www.dataquest.io/blog/pandas-python-tutorial/ 12 | * https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view 13 | * https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y 14 | -------------------------------------------------------------------------------- /result.csv: -------------------------------------------------------------------------------- 1 | id,pred 2 | 0,10 3 | 1,15 4 | 2,20 5 | --------------------------------------------------------------------------------