├── .ipynb_checkpoints
└── Pandas Tutorial-checkpoint.ipynb
├── DataStructures.png
├── Pandas Tutorial.ipynb
├── README.md
├── RegularSeasonCompactResults.csv
└── result.csv
/.ipynb_checkpoints/Pandas Tutorial-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Introduction"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "Since I've been working on a lot of Kaggle competitions, I use Pandas a lot. As you may know, Pandas (in addition to Numpy) is the go-to Python library for all your data science needs. It helps with dealing with input data in CSV formats and with transofrming your data into a form where it can be inputted into ML models. However, getting comfortable with the ideas of dataframes, slicing, etc was very tough for me in the beginning. Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. "
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "# Loading in Data"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "The first step in any ML problem is identifying what format your data is in, and then loading it into whateer framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. "
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "Since I'm a huge sports fan, we're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as \"a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)\"."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "Just think of it as a table for now. We'll explain more about what makes it unique later on. "
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 2,
59 | "metadata": {
60 | "collapsed": false
61 | },
62 | "outputs": [],
63 | "source": [
64 | "df = pd.read_csv('RegularSeasonCompactResults.csv')"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows)."
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 3,
77 | "metadata": {
78 | "collapsed": false
79 | },
80 | "outputs": [
81 | {
82 | "data": {
83 | "text/html": [
84 | "
\n",
85 | "
\n",
86 | " \n",
87 | " \n",
88 | " | \n",
89 | " Season | \n",
90 | " Daynum | \n",
91 | " Wteam | \n",
92 | " Wscore | \n",
93 | " Lteam | \n",
94 | " Lscore | \n",
95 | " Wloc | \n",
96 | " Numot | \n",
97 | "
\n",
98 | " \n",
99 | " \n",
100 | " \n",
101 | " 0 | \n",
102 | " 1985 | \n",
103 | " 20 | \n",
104 | " 1228 | \n",
105 | " 81 | \n",
106 | " 1328 | \n",
107 | " 64 | \n",
108 | " N | \n",
109 | " 0 | \n",
110 | "
\n",
111 | " \n",
112 | " 1 | \n",
113 | " 1985 | \n",
114 | " 25 | \n",
115 | " 1106 | \n",
116 | " 77 | \n",
117 | " 1354 | \n",
118 | " 70 | \n",
119 | " H | \n",
120 | " 0 | \n",
121 | "
\n",
122 | " \n",
123 | " 2 | \n",
124 | " 1985 | \n",
125 | " 25 | \n",
126 | " 1112 | \n",
127 | " 63 | \n",
128 | " 1223 | \n",
129 | " 56 | \n",
130 | " H | \n",
131 | " 0 | \n",
132 | "
\n",
133 | " \n",
134 | " 3 | \n",
135 | " 1985 | \n",
136 | " 25 | \n",
137 | " 1165 | \n",
138 | " 70 | \n",
139 | " 1432 | \n",
140 | " 54 | \n",
141 | " H | \n",
142 | " 0 | \n",
143 | "
\n",
144 | " \n",
145 | " 4 | \n",
146 | " 1985 | \n",
147 | " 25 | \n",
148 | " 1192 | \n",
149 | " 86 | \n",
150 | " 1447 | \n",
151 | " 74 | \n",
152 | " H | \n",
153 | " 0 | \n",
154 | "
\n",
155 | " \n",
156 | "
\n",
157 | "
"
158 | ],
159 | "text/plain": [
160 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
161 | "0 1985 20 1228 81 1328 64 N 0\n",
162 | "1 1985 25 1106 77 1354 70 H 0\n",
163 | "2 1985 25 1112 63 1223 56 H 0\n",
164 | "3 1985 25 1165 70 1432 54 H 0\n",
165 | "4 1985 25 1192 86 1447 74 H 0"
166 | ]
167 | },
168 | "execution_count": 3,
169 | "metadata": {},
170 | "output_type": "execute_result"
171 | }
172 | ],
173 | "source": [
174 | "df.head()"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "We can see the dimensions of the dataframe using the the **shape** attribute"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 4,
187 | "metadata": {
188 | "collapsed": false
189 | },
190 | "outputs": [
191 | {
192 | "data": {
193 | "text/plain": [
194 | "(145289, 8)"
195 | ]
196 | },
197 | "execution_count": 4,
198 | "metadata": {},
199 | "output_type": "execute_result"
200 | }
201 | ],
202 | "source": [
203 | "df.shape"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {
209 | "collapsed": true
210 | },
211 | "source": [
212 | "We can also extract all the columns as a list, by using the **columns** attribute"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 6,
218 | "metadata": {
219 | "collapsed": false
220 | },
221 | "outputs": [
222 | {
223 | "data": {
224 | "text/plain": [
225 | "['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']"
226 | ]
227 | },
228 | "execution_count": 6,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "df.columns.tolist()"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. "
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 10,
247 | "metadata": {
248 | "collapsed": false
249 | },
250 | "outputs": [
251 | {
252 | "data": {
253 | "text/html": [
254 | "\n",
255 | "
\n",
256 | " \n",
257 | " \n",
258 | " | \n",
259 | " Season | \n",
260 | " Daynum | \n",
261 | " Wteam | \n",
262 | " Wscore | \n",
263 | " Lteam | \n",
264 | " Lscore | \n",
265 | " Numot | \n",
266 | "
\n",
267 | " \n",
268 | " \n",
269 | " \n",
270 | " count | \n",
271 | " 145289.000000 | \n",
272 | " 145289.000000 | \n",
273 | " 145289.000000 | \n",
274 | " 145289.000000 | \n",
275 | " 145289.000000 | \n",
276 | " 145289.000000 | \n",
277 | " 145289.000000 | \n",
278 | "
\n",
279 | " \n",
280 | " mean | \n",
281 | " 2001.574834 | \n",
282 | " 75.223816 | \n",
283 | " 1286.720646 | \n",
284 | " 76.600321 | \n",
285 | " 1282.864064 | \n",
286 | " 64.497009 | \n",
287 | " 0.044387 | \n",
288 | "
\n",
289 | " \n",
290 | " std | \n",
291 | " 9.233342 | \n",
292 | " 33.287418 | \n",
293 | " 104.570275 | \n",
294 | " 12.173033 | \n",
295 | " 104.829234 | \n",
296 | " 11.380625 | \n",
297 | " 0.247819 | \n",
298 | "
\n",
299 | " \n",
300 | " min | \n",
301 | " 1985.000000 | \n",
302 | " 0.000000 | \n",
303 | " 1101.000000 | \n",
304 | " 34.000000 | \n",
305 | " 1101.000000 | \n",
306 | " 20.000000 | \n",
307 | " 0.000000 | \n",
308 | "
\n",
309 | " \n",
310 | " 25% | \n",
311 | " 1994.000000 | \n",
312 | " 47.000000 | \n",
313 | " 1198.000000 | \n",
314 | " 68.000000 | \n",
315 | " 1191.000000 | \n",
316 | " 57.000000 | \n",
317 | " 0.000000 | \n",
318 | "
\n",
319 | " \n",
320 | " 50% | \n",
321 | " 2002.000000 | \n",
322 | " 78.000000 | \n",
323 | " 1284.000000 | \n",
324 | " 76.000000 | \n",
325 | " 1280.000000 | \n",
326 | " 64.000000 | \n",
327 | " 0.000000 | \n",
328 | "
\n",
329 | " \n",
330 | " 75% | \n",
331 | " 2010.000000 | \n",
332 | " 103.000000 | \n",
333 | " 1379.000000 | \n",
334 | " 84.000000 | \n",
335 | " 1375.000000 | \n",
336 | " 72.000000 | \n",
337 | " 0.000000 | \n",
338 | "
\n",
339 | " \n",
340 | " max | \n",
341 | " 2016.000000 | \n",
342 | " 132.000000 | \n",
343 | " 1464.000000 | \n",
344 | " 186.000000 | \n",
345 | " 1464.000000 | \n",
346 | " 150.000000 | \n",
347 | " 6.000000 | \n",
348 | "
\n",
349 | " \n",
350 | "
\n",
351 | "
"
352 | ],
353 | "text/plain": [
354 | " Season Daynum Wteam Wscore \\\n",
355 | "count 145289.000000 145289.000000 145289.000000 145289.000000 \n",
356 | "mean 2001.574834 75.223816 1286.720646 76.600321 \n",
357 | "std 9.233342 33.287418 104.570275 12.173033 \n",
358 | "min 1985.000000 0.000000 1101.000000 34.000000 \n",
359 | "25% 1994.000000 47.000000 1198.000000 68.000000 \n",
360 | "50% 2002.000000 78.000000 1284.000000 76.000000 \n",
361 | "75% 2010.000000 103.000000 1379.000000 84.000000 \n",
362 | "max 2016.000000 132.000000 1464.000000 186.000000 \n",
363 | "\n",
364 | " Lteam Lscore Numot \n",
365 | "count 145289.000000 145289.000000 145289.000000 \n",
366 | "mean 1282.864064 64.497009 0.044387 \n",
367 | "std 104.829234 11.380625 0.247819 \n",
368 | "min 1101.000000 20.000000 0.000000 \n",
369 | "25% 1191.000000 57.000000 0.000000 \n",
370 | "50% 1280.000000 64.000000 0.000000 \n",
371 | "75% 1375.000000 72.000000 0.000000 \n",
372 | "max 1464.000000 150.000000 6.000000 "
373 | ]
374 | },
375 | "execution_count": 10,
376 | "metadata": {},
377 | "output_type": "execute_result"
378 | }
379 | ],
380 | "source": [
381 | "df.describe()"
382 | ]
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {},
387 | "source": [
388 | "Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": 22,
394 | "metadata": {
395 | "collapsed": false
396 | },
397 | "outputs": [
398 | {
399 | "data": {
400 | "text/plain": [
401 | "Season 2016\n",
402 | "Daynum 132\n",
403 | "Wteam 1464\n",
404 | "Wscore 186\n",
405 | "Lteam 1464\n",
406 | "Lscore 150\n",
407 | "Wloc N\n",
408 | "Numot 6\n",
409 | "dtype: object"
410 | ]
411 | },
412 | "execution_count": 22,
413 | "metadata": {},
414 | "output_type": "execute_result"
415 | }
416 | ],
417 | "source": [
418 | "df.max()"
419 | ]
420 | },
421 | {
422 | "cell_type": "markdown",
423 | "metadata": {},
424 | "source": [
425 | "Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": 24,
431 | "metadata": {
432 | "collapsed": false
433 | },
434 | "outputs": [
435 | {
436 | "data": {
437 | "text/plain": [
438 | "186"
439 | ]
440 | },
441 | "execution_count": 24,
442 | "metadata": {},
443 | "output_type": "execute_result"
444 | }
445 | ],
446 | "source": [
447 | "df['Wscore'].max()"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": 36,
460 | "metadata": {
461 | "collapsed": false
462 | },
463 | "outputs": [
464 | {
465 | "data": {
466 | "text/plain": [
467 | "24970"
468 | ]
469 | },
470 | "execution_count": 36,
471 | "metadata": {},
472 | "output_type": "execute_result"
473 | }
474 | ],
475 | "source": [
476 | "df['Wscore'].argmax()"
477 | ]
478 | },
479 | {
480 | "cell_type": "markdown",
481 | "metadata": {},
482 | "source": [
483 | "Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an \"integer-location based indexing for selection by position.\""
484 | ]
485 | },
486 | {
487 | "cell_type": "code",
488 | "execution_count": 35,
489 | "metadata": {
490 | "collapsed": false
491 | },
492 | "outputs": [
493 | {
494 | "data": {
495 | "text/html": [
496 | "\n",
497 | "
\n",
498 | " \n",
499 | " \n",
500 | " | \n",
501 | " Season | \n",
502 | " Daynum | \n",
503 | " Wteam | \n",
504 | " Wscore | \n",
505 | " Lteam | \n",
506 | " Lscore | \n",
507 | " Wloc | \n",
508 | " Numot | \n",
509 | "
\n",
510 | " \n",
511 | " \n",
512 | " \n",
513 | " 24970 | \n",
514 | " 1991 | \n",
515 | " 68 | \n",
516 | " 1258 | \n",
517 | " 186 | \n",
518 | " 1109 | \n",
519 | " 140 | \n",
520 | " H | \n",
521 | " 0 | \n",
522 | "
\n",
523 | " \n",
524 | "
\n",
525 | "
"
526 | ],
527 | "text/plain": [
528 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
529 | "24970 1991 68 1258 186 1109 140 H 0"
530 | ]
531 | },
532 | "execution_count": 35,
533 | "metadata": {},
534 | "output_type": "execute_result"
535 | }
536 | ],
537 | "source": [
538 | "df.iloc[[df['Wscore'].argmax()]]"
539 | ]
540 | },
541 | {
542 | "cell_type": "markdown",
543 | "metadata": {},
544 | "source": [
545 | "Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. "
546 | ]
547 | },
548 | {
549 | "cell_type": "code",
550 | "execution_count": 38,
551 | "metadata": {
552 | "collapsed": false
553 | },
554 | "outputs": [
555 | {
556 | "data": {
557 | "text/plain": [
558 | "140"
559 | ]
560 | },
561 | "execution_count": 38,
562 | "metadata": {},
563 | "output_type": "execute_result"
564 | }
565 | ],
566 | "source": [
567 | "df.iloc[[df['Wscore'].argmax()]]['Lscore'].max()"
568 | ]
569 | },
570 | {
571 | "cell_type": "markdown",
572 | "metadata": {},
573 | "source": [
574 | "The bracket indexing operator is the best way to extract certain columns from a dataframe."
575 | ]
576 | },
577 | {
578 | "cell_type": "code",
579 | "execution_count": 27,
580 | "metadata": {
581 | "collapsed": false,
582 | "scrolled": true
583 | },
584 | "outputs": [
585 | {
586 | "data": {
587 | "text/html": [
588 | "\n",
589 | "
\n",
590 | " \n",
591 | " \n",
592 | " | \n",
593 | " Wscore | \n",
594 | " Lscore | \n",
595 | "
\n",
596 | " \n",
597 | " \n",
598 | " \n",
599 | " 0 | \n",
600 | " 81 | \n",
601 | " 64 | \n",
602 | "
\n",
603 | " \n",
604 | " 1 | \n",
605 | " 77 | \n",
606 | " 70 | \n",
607 | "
\n",
608 | " \n",
609 | " 2 | \n",
610 | " 63 | \n",
611 | " 56 | \n",
612 | "
\n",
613 | " \n",
614 | " 3 | \n",
615 | " 70 | \n",
616 | " 54 | \n",
617 | "
\n",
618 | " \n",
619 | " 4 | \n",
620 | " 86 | \n",
621 | " 74 | \n",
622 | "
\n",
623 | " \n",
624 | " 5 | \n",
625 | " 79 | \n",
626 | " 78 | \n",
627 | "
\n",
628 | " \n",
629 | " 6 | \n",
630 | " 64 | \n",
631 | " 44 | \n",
632 | "
\n",
633 | " \n",
634 | " 7 | \n",
635 | " 58 | \n",
636 | " 56 | \n",
637 | "
\n",
638 | " \n",
639 | " 8 | \n",
640 | " 98 | \n",
641 | " 80 | \n",
642 | "
\n",
643 | " \n",
644 | " 9 | \n",
645 | " 97 | \n",
646 | " 89 | \n",
647 | "
\n",
648 | " \n",
649 | " 10 | \n",
650 | " 103 | \n",
651 | " 71 | \n",
652 | "
\n",
653 | " \n",
654 | " 11 | \n",
655 | " 75 | \n",
656 | " 71 | \n",
657 | "
\n",
658 | " \n",
659 | " 12 | \n",
660 | " 91 | \n",
661 | " 72 | \n",
662 | "
\n",
663 | " \n",
664 | " 13 | \n",
665 | " 70 | \n",
666 | " 65 | \n",
667 | "
\n",
668 | " \n",
669 | " 14 | \n",
670 | " 87 | \n",
671 | " 58 | \n",
672 | "
\n",
673 | " \n",
674 | " 15 | \n",
675 | " 65 | \n",
676 | " 62 | \n",
677 | "
\n",
678 | " \n",
679 | " 16 | \n",
680 | " 92 | \n",
681 | " 50 | \n",
682 | "
\n",
683 | " \n",
684 | " 17 | \n",
685 | " 65 | \n",
686 | " 60 | \n",
687 | "
\n",
688 | " \n",
689 | " 18 | \n",
690 | " 58 | \n",
691 | " 53 | \n",
692 | "
\n",
693 | " \n",
694 | " 19 | \n",
695 | " 50 | \n",
696 | " 48 | \n",
697 | "
\n",
698 | " \n",
699 | " 20 | \n",
700 | " 47 | \n",
701 | " 40 | \n",
702 | "
\n",
703 | " \n",
704 | " 21 | \n",
705 | " 55 | \n",
706 | " 52 | \n",
707 | "
\n",
708 | " \n",
709 | " 22 | \n",
710 | " 76 | \n",
711 | " 56 | \n",
712 | "
\n",
713 | " \n",
714 | " 23 | \n",
715 | " 59 | \n",
716 | " 58 | \n",
717 | "
\n",
718 | " \n",
719 | " 24 | \n",
720 | " 79 | \n",
721 | " 76 | \n",
722 | "
\n",
723 | " \n",
724 | " 25 | \n",
725 | " 106 | \n",
726 | " 55 | \n",
727 | "
\n",
728 | " \n",
729 | " 26 | \n",
730 | " 95 | \n",
731 | " 77 | \n",
732 | "
\n",
733 | " \n",
734 | " 27 | \n",
735 | " 79 | \n",
736 | " 66 | \n",
737 | "
\n",
738 | " \n",
739 | " 28 | \n",
740 | " 64 | \n",
741 | " 59 | \n",
742 | "
\n",
743 | " \n",
744 | " 29 | \n",
745 | " 76 | \n",
746 | " 47 | \n",
747 | "
\n",
748 | " \n",
749 | " ... | \n",
750 | " ... | \n",
751 | " ... | \n",
752 | "
\n",
753 | " \n",
754 | " 145259 | \n",
755 | " 69 | \n",
756 | " 67 | \n",
757 | "
\n",
758 | " \n",
759 | " 145260 | \n",
760 | " 72 | \n",
761 | " 65 | \n",
762 | "
\n",
763 | " \n",
764 | " 145261 | \n",
765 | " 64 | \n",
766 | " 61 | \n",
767 | "
\n",
768 | " \n",
769 | " 145262 | \n",
770 | " 77 | \n",
771 | " 62 | \n",
772 | "
\n",
773 | " \n",
774 | " 145263 | \n",
775 | " 57 | \n",
776 | " 54 | \n",
777 | "
\n",
778 | " \n",
779 | " 145264 | \n",
780 | " 68 | \n",
781 | " 63 | \n",
782 | "
\n",
783 | " \n",
784 | " 145265 | \n",
785 | " 81 | \n",
786 | " 69 | \n",
787 | "
\n",
788 | " \n",
789 | " 145266 | \n",
790 | " 64 | \n",
791 | " 60 | \n",
792 | "
\n",
793 | " \n",
794 | " 145267 | \n",
795 | " 81 | \n",
796 | " 71 | \n",
797 | "
\n",
798 | " \n",
799 | " 145268 | \n",
800 | " 93 | \n",
801 | " 80 | \n",
802 | "
\n",
803 | " \n",
804 | " 145269 | \n",
805 | " 74 | \n",
806 | " 54 | \n",
807 | "
\n",
808 | " \n",
809 | " 145270 | \n",
810 | " 64 | \n",
811 | " 61 | \n",
812 | "
\n",
813 | " \n",
814 | " 145271 | \n",
815 | " 55 | \n",
816 | " 53 | \n",
817 | "
\n",
818 | " \n",
819 | " 145272 | \n",
820 | " 61 | \n",
821 | " 57 | \n",
822 | "
\n",
823 | " \n",
824 | " 145273 | \n",
825 | " 88 | \n",
826 | " 57 | \n",
827 | "
\n",
828 | " \n",
829 | " 145274 | \n",
830 | " 76 | \n",
831 | " 59 | \n",
832 | "
\n",
833 | " \n",
834 | " 145275 | \n",
835 | " 69 | \n",
836 | " 67 | \n",
837 | "
\n",
838 | " \n",
839 | " 145276 | \n",
840 | " 82 | \n",
841 | " 60 | \n",
842 | "
\n",
843 | " \n",
844 | " 145277 | \n",
845 | " 54 | \n",
846 | " 53 | \n",
847 | "
\n",
848 | " \n",
849 | " 145278 | \n",
850 | " 82 | \n",
851 | " 79 | \n",
852 | "
\n",
853 | " \n",
854 | " 145279 | \n",
855 | " 80 | \n",
856 | " 74 | \n",
857 | "
\n",
858 | " \n",
859 | " 145280 | \n",
860 | " 71 | \n",
861 | " 38 | \n",
862 | "
\n",
863 | " \n",
864 | " 145281 | \n",
865 | " 82 | \n",
866 | " 71 | \n",
867 | "
\n",
868 | " \n",
869 | " 145282 | \n",
870 | " 76 | \n",
871 | " 54 | \n",
872 | "
\n",
873 | " \n",
874 | " 145283 | \n",
875 | " 62 | \n",
876 | " 59 | \n",
877 | "
\n",
878 | " \n",
879 | " 145284 | \n",
880 | " 70 | \n",
881 | " 50 | \n",
882 | "
\n",
883 | " \n",
884 | " 145285 | \n",
885 | " 72 | \n",
886 | " 58 | \n",
887 | "
\n",
888 | " \n",
889 | " 145286 | \n",
890 | " 82 | \n",
891 | " 77 | \n",
892 | "
\n",
893 | " \n",
894 | " 145287 | \n",
895 | " 66 | \n",
896 | " 62 | \n",
897 | "
\n",
898 | " \n",
899 | " 145288 | \n",
900 | " 87 | \n",
901 | " 74 | \n",
902 | "
\n",
903 | " \n",
904 | "
\n",
905 | "
145289 rows × 2 columns
\n",
906 | "
"
907 | ],
908 | "text/plain": [
909 | " Wscore Lscore\n",
910 | "0 81 64\n",
911 | "1 77 70\n",
912 | "2 63 56\n",
913 | "3 70 54\n",
914 | "4 86 74\n",
915 | "5 79 78\n",
916 | "6 64 44\n",
917 | "7 58 56\n",
918 | "8 98 80\n",
919 | "9 97 89\n",
920 | "10 103 71\n",
921 | "11 75 71\n",
922 | "12 91 72\n",
923 | "13 70 65\n",
924 | "14 87 58\n",
925 | "15 65 62\n",
926 | "16 92 50\n",
927 | "17 65 60\n",
928 | "18 58 53\n",
929 | "19 50 48\n",
930 | "20 47 40\n",
931 | "21 55 52\n",
932 | "22 76 56\n",
933 | "23 59 58\n",
934 | "24 79 76\n",
935 | "25 106 55\n",
936 | "26 95 77\n",
937 | "27 79 66\n",
938 | "28 64 59\n",
939 | "29 76 47\n",
940 | "... ... ...\n",
941 | "145259 69 67\n",
942 | "145260 72 65\n",
943 | "145261 64 61\n",
944 | "145262 77 62\n",
945 | "145263 57 54\n",
946 | "145264 68 63\n",
947 | "145265 81 69\n",
948 | "145266 64 60\n",
949 | "145267 81 71\n",
950 | "145268 93 80\n",
951 | "145269 74 54\n",
952 | "145270 64 61\n",
953 | "145271 55 53\n",
954 | "145272 61 57\n",
955 | "145273 88 57\n",
956 | "145274 76 59\n",
957 | "145275 69 67\n",
958 | "145276 82 60\n",
959 | "145277 54 53\n",
960 | "145278 82 79\n",
961 | "145279 80 74\n",
962 | "145280 71 38\n",
963 | "145281 82 71\n",
964 | "145282 76 54\n",
965 | "145283 62 59\n",
966 | "145284 70 50\n",
967 | "145285 72 58\n",
968 | "145286 82 77\n",
969 | "145287 66 62\n",
970 | "145288 87 74\n",
971 | "\n",
972 | "[145289 rows x 2 columns]"
973 | ]
974 | },
975 | "execution_count": 27,
976 | "metadata": {},
977 | "output_type": "execute_result"
978 | }
979 | ],
980 | "source": [
981 | "df[['Wscore', 'Lscore']]"
982 | ]
983 | },
984 | {
985 | "cell_type": "markdown",
986 | "metadata": {},
987 | "source": [
988 | "Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then return the results in a dataframe (df[df['Wscore'] > 150])."
989 | ]
990 | },
991 | {
992 | "cell_type": "code",
993 | "execution_count": 33,
994 | "metadata": {
995 | "collapsed": false
996 | },
997 | "outputs": [
998 | {
999 | "data": {
1000 | "text/html": [
1001 | "\n",
1002 | "
\n",
1003 | " \n",
1004 | " \n",
1005 | " | \n",
1006 | " Season | \n",
1007 | " Daynum | \n",
1008 | " Wteam | \n",
1009 | " Wscore | \n",
1010 | " Lteam | \n",
1011 | " Lscore | \n",
1012 | " Wloc | \n",
1013 | " Numot | \n",
1014 | "
\n",
1015 | " \n",
1016 | " \n",
1017 | " \n",
1018 | " 5269 | \n",
1019 | " 1986 | \n",
1020 | " 75 | \n",
1021 | " 1258 | \n",
1022 | " 151 | \n",
1023 | " 1109 | \n",
1024 | " 107 | \n",
1025 | " H | \n",
1026 | " 0 | \n",
1027 | "
\n",
1028 | " \n",
1029 | " 12046 | \n",
1030 | " 1988 | \n",
1031 | " 40 | \n",
1032 | " 1328 | \n",
1033 | " 152 | \n",
1034 | " 1147 | \n",
1035 | " 84 | \n",
1036 | " H | \n",
1037 | " 0 | \n",
1038 | "
\n",
1039 | " \n",
1040 | " 12355 | \n",
1041 | " 1988 | \n",
1042 | " 52 | \n",
1043 | " 1328 | \n",
1044 | " 151 | \n",
1045 | " 1173 | \n",
1046 | " 99 | \n",
1047 | " N | \n",
1048 | " 0 | \n",
1049 | "
\n",
1050 | " \n",
1051 | " 16040 | \n",
1052 | " 1989 | \n",
1053 | " 40 | \n",
1054 | " 1328 | \n",
1055 | " 152 | \n",
1056 | " 1331 | \n",
1057 | " 122 | \n",
1058 | " H | \n",
1059 | " 0 | \n",
1060 | "
\n",
1061 | " \n",
1062 | " 16853 | \n",
1063 | " 1989 | \n",
1064 | " 68 | \n",
1065 | " 1258 | \n",
1066 | " 162 | \n",
1067 | " 1109 | \n",
1068 | " 144 | \n",
1069 | " A | \n",
1070 | " 0 | \n",
1071 | "
\n",
1072 | " \n",
1073 | " 17867 | \n",
1074 | " 1989 | \n",
1075 | " 92 | \n",
1076 | " 1258 | \n",
1077 | " 181 | \n",
1078 | " 1109 | \n",
1079 | " 150 | \n",
1080 | " H | \n",
1081 | " 0 | \n",
1082 | "
\n",
1083 | " \n",
1084 | " 19653 | \n",
1085 | " 1990 | \n",
1086 | " 30 | \n",
1087 | " 1328 | \n",
1088 | " 173 | \n",
1089 | " 1109 | \n",
1090 | " 101 | \n",
1091 | " H | \n",
1092 | " 0 | \n",
1093 | "
\n",
1094 | " \n",
1095 | " 19971 | \n",
1096 | " 1990 | \n",
1097 | " 38 | \n",
1098 | " 1258 | \n",
1099 | " 152 | \n",
1100 | " 1109 | \n",
1101 | " 137 | \n",
1102 | " A | \n",
1103 | " 0 | \n",
1104 | "
\n",
1105 | " \n",
1106 | " 20022 | \n",
1107 | " 1990 | \n",
1108 | " 40 | \n",
1109 | " 1116 | \n",
1110 | " 166 | \n",
1111 | " 1109 | \n",
1112 | " 101 | \n",
1113 | " H | \n",
1114 | " 0 | \n",
1115 | "
\n",
1116 | " \n",
1117 | " 22145 | \n",
1118 | " 1990 | \n",
1119 | " 97 | \n",
1120 | " 1258 | \n",
1121 | " 157 | \n",
1122 | " 1362 | \n",
1123 | " 115 | \n",
1124 | " H | \n",
1125 | " 0 | \n",
1126 | "
\n",
1127 | " \n",
1128 | " 23582 | \n",
1129 | " 1991 | \n",
1130 | " 26 | \n",
1131 | " 1318 | \n",
1132 | " 152 | \n",
1133 | " 1258 | \n",
1134 | " 123 | \n",
1135 | " N | \n",
1136 | " 0 | \n",
1137 | "
\n",
1138 | " \n",
1139 | " 24341 | \n",
1140 | " 1991 | \n",
1141 | " 47 | \n",
1142 | " 1328 | \n",
1143 | " 172 | \n",
1144 | " 1258 | \n",
1145 | " 112 | \n",
1146 | " H | \n",
1147 | " 0 | \n",
1148 | "
\n",
1149 | " \n",
1150 | " 24970 | \n",
1151 | " 1991 | \n",
1152 | " 68 | \n",
1153 | " 1258 | \n",
1154 | " 186 | \n",
1155 | " 1109 | \n",
1156 | " 140 | \n",
1157 | " H | \n",
1158 | " 0 | \n",
1159 | "
\n",
1160 | " \n",
1161 | " 25656 | \n",
1162 | " 1991 | \n",
1163 | " 84 | \n",
1164 | " 1106 | \n",
1165 | " 151 | \n",
1166 | " 1212 | \n",
1167 | " 97 | \n",
1168 | " H | \n",
1169 | " 0 | \n",
1170 | "
\n",
1171 | " \n",
1172 | " 28687 | \n",
1173 | " 1992 | \n",
1174 | " 54 | \n",
1175 | " 1261 | \n",
1176 | " 159 | \n",
1177 | " 1319 | \n",
1178 | " 86 | \n",
1179 | " H | \n",
1180 | " 0 | \n",
1181 | "
\n",
1182 | " \n",
1183 | " 35023 | \n",
1184 | " 1993 | \n",
1185 | " 112 | \n",
1186 | " 1380 | \n",
1187 | " 155 | \n",
1188 | " 1341 | \n",
1189 | " 91 | \n",
1190 | " A | \n",
1191 | " 0 | \n",
1192 | "
\n",
1193 | " \n",
1194 | " 40060 | \n",
1195 | " 1995 | \n",
1196 | " 32 | \n",
1197 | " 1375 | \n",
1198 | " 156 | \n",
1199 | " 1341 | \n",
1200 | " 114 | \n",
1201 | " H | \n",
1202 | " 0 | \n",
1203 | "
\n",
1204 | " \n",
1205 | " 52600 | \n",
1206 | " 1998 | \n",
1207 | " 33 | \n",
1208 | " 1395 | \n",
1209 | " 153 | \n",
1210 | " 1410 | \n",
1211 | " 87 | \n",
1212 | " H | \n",
1213 | " 0 | \n",
1214 | "
\n",
1215 | " \n",
1216 | "
\n",
1217 | "
"
1218 | ],
1219 | "text/plain": [
1220 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
1221 | "5269 1986 75 1258 151 1109 107 H 0\n",
1222 | "12046 1988 40 1328 152 1147 84 H 0\n",
1223 | "12355 1988 52 1328 151 1173 99 N 0\n",
1224 | "16040 1989 40 1328 152 1331 122 H 0\n",
1225 | "16853 1989 68 1258 162 1109 144 A 0\n",
1226 | "17867 1989 92 1258 181 1109 150 H 0\n",
1227 | "19653 1990 30 1328 173 1109 101 H 0\n",
1228 | "19971 1990 38 1258 152 1109 137 A 0\n",
1229 | "20022 1990 40 1116 166 1109 101 H 0\n",
1230 | "22145 1990 97 1258 157 1362 115 H 0\n",
1231 | "23582 1991 26 1318 152 1258 123 N 0\n",
1232 | "24341 1991 47 1328 172 1258 112 H 0\n",
1233 | "24970 1991 68 1258 186 1109 140 H 0\n",
1234 | "25656 1991 84 1106 151 1212 97 H 0\n",
1235 | "28687 1992 54 1261 159 1319 86 H 0\n",
1236 | "35023 1993 112 1380 155 1341 91 A 0\n",
1237 | "40060 1995 32 1375 156 1341 114 H 0\n",
1238 | "52600 1998 33 1395 153 1410 87 H 0"
1239 | ]
1240 | },
1241 | "execution_count": 33,
1242 | "metadata": {},
1243 | "output_type": "execute_result"
1244 | }
1245 | ],
1246 | "source": [
1247 | "df[df['Wscore'] > 150]"
1248 | ]
1249 | },
1250 | {
1251 | "cell_type": "markdown",
1252 | "metadata": {},
1253 | "source": [
1254 | "Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in an array style format"
1255 | ]
1256 | },
1257 | {
1258 | "cell_type": "code",
1259 | "execution_count": 39,
1260 | "metadata": {
1261 | "collapsed": false
1262 | },
1263 | "outputs": [
1264 | {
1265 | "data": {
1266 | "text/plain": [
1267 | "array([[1985, 20, 1228, ..., 64, 'N', 0],\n",
1268 | " [1985, 25, 1106, ..., 70, 'H', 0],\n",
1269 | " [1985, 25, 1112, ..., 56, 'H', 0],\n",
1270 | " ..., \n",
1271 | " [2016, 132, 1246, ..., 77, 'N', 1],\n",
1272 | " [2016, 132, 1277, ..., 62, 'N', 0],\n",
1273 | " [2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)"
1274 | ]
1275 | },
1276 | "execution_count": 39,
1277 | "metadata": {},
1278 | "output_type": "execute_result"
1279 | }
1280 | ],
1281 | "source": [
1282 | "df.values"
1283 | ]
1284 | },
1285 | {
1286 | "cell_type": "markdown",
1287 | "metadata": {},
1288 | "source": [
1289 | "Now, you can simply just access elements like you would in an array. "
1290 | ]
1291 | },
1292 | {
1293 | "cell_type": "code",
1294 | "execution_count": 40,
1295 | "metadata": {
1296 | "collapsed": false
1297 | },
1298 | "outputs": [
1299 | {
1300 | "data": {
1301 | "text/plain": [
1302 | "1985"
1303 | ]
1304 | },
1305 | "execution_count": 40,
1306 | "metadata": {},
1307 | "output_type": "execute_result"
1308 | }
1309 | ],
1310 | "source": [
1311 | "df.values[0][0]"
1312 | ]
1313 | },
1314 | {
1315 | "cell_type": "markdown",
1316 | "metadata": {},
1317 | "source": [
1318 | "# Dataframe Iteration"
1319 | ]
1320 | },
1321 | {
1322 | "cell_type": "markdown",
1323 | "metadata": {},
1324 | "source": [
1325 | "In order to iterate through dataframes, we can use the "
1326 | ]
1327 | },
1328 | {
1329 | "cell_type": "markdown",
1330 | "metadata": {
1331 | "collapsed": true
1332 | },
1333 | "source": [
1334 | "# Lots of Other Resources"
1335 | ]
1336 | },
1337 | {
1338 | "cell_type": "markdown",
1339 | "metadata": {},
1340 | "source": [
1341 | "Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. \n",
1342 | "* http://pandas.pydata.org/pandas-docs/stable/10min.html\n",
1343 | "* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python\n",
1344 | "* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/\n",
1345 | "* https://www.dataquest.io/blog/pandas-python-tutorial/\n",
1346 | "* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view"
1347 | ]
1348 | },
1349 | {
1350 | "cell_type": "code",
1351 | "execution_count": null,
1352 | "metadata": {
1353 | "collapsed": true
1354 | },
1355 | "outputs": [],
1356 | "source": []
1357 | }
1358 | ],
1359 | "metadata": {
1360 | "anaconda-cloud": {},
1361 | "kernelspec": {
1362 | "display_name": "Python [conda root]",
1363 | "language": "python",
1364 | "name": "conda-root-py"
1365 | },
1366 | "language_info": {
1367 | "codemirror_mode": {
1368 | "name": "ipython",
1369 | "version": 2
1370 | },
1371 | "file_extension": ".py",
1372 | "mimetype": "text/x-python",
1373 | "name": "python",
1374 | "nbconvert_exporter": "python",
1375 | "pygments_lexer": "ipython2",
1376 | "version": "2.7.12"
1377 | }
1378 | },
1379 | "nbformat": 4,
1380 | "nbformat_minor": 1
1381 | }
1382 |
--------------------------------------------------------------------------------
/DataStructures.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adeshpande3/Pandas-Tutorial/7ce62d4166db83e4f29599a1d8b8eb6b22f21e4e/DataStructures.png
--------------------------------------------------------------------------------
/Pandas Tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Since I've been working on a lot of Kaggle competitions, I use Pandas quite a bit. As you may know, Pandas (in addition to Numpy) is the go-to Python library for all your data science needs. It helps with dealing with input data in CSV formats and with transforming your data into a form where it can be inputted into ML models. However, getting comfortable with the ideas of dataframes, slicing, etc was very tough for me in the beginning. Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. "
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": [
25 | "import pandas as pd"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "# Loading in Data"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "The first step in any ML problem is identifying what format your data is in, and then loading it into whatever framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. "
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "Since I'm a huge sports fan, we're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as \"a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)\"."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "Just think of it as a table for now. "
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 2,
59 | "metadata": {
60 | "collapsed": false
61 | },
62 | "outputs": [],
63 | "source": [
64 | "df = pd.read_csv('RegularSeasonCompactResults.csv')"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "# The Basics"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows)."
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 3,
84 | "metadata": {
85 | "collapsed": false
86 | },
87 | "outputs": [
88 | {
89 | "data": {
90 | "text/html": [
91 | "\n",
92 | "
\n",
93 | " \n",
94 | " \n",
95 | " | \n",
96 | " Season | \n",
97 | " Daynum | \n",
98 | " Wteam | \n",
99 | " Wscore | \n",
100 | " Lteam | \n",
101 | " Lscore | \n",
102 | " Wloc | \n",
103 | " Numot | \n",
104 | "
\n",
105 | " \n",
106 | " \n",
107 | " \n",
108 | " 0 | \n",
109 | " 1985 | \n",
110 | " 20 | \n",
111 | " 1228 | \n",
112 | " 81 | \n",
113 | " 1328 | \n",
114 | " 64 | \n",
115 | " N | \n",
116 | " 0 | \n",
117 | "
\n",
118 | " \n",
119 | " 1 | \n",
120 | " 1985 | \n",
121 | " 25 | \n",
122 | " 1106 | \n",
123 | " 77 | \n",
124 | " 1354 | \n",
125 | " 70 | \n",
126 | " H | \n",
127 | " 0 | \n",
128 | "
\n",
129 | " \n",
130 | " 2 | \n",
131 | " 1985 | \n",
132 | " 25 | \n",
133 | " 1112 | \n",
134 | " 63 | \n",
135 | " 1223 | \n",
136 | " 56 | \n",
137 | " H | \n",
138 | " 0 | \n",
139 | "
\n",
140 | " \n",
141 | " 3 | \n",
142 | " 1985 | \n",
143 | " 25 | \n",
144 | " 1165 | \n",
145 | " 70 | \n",
146 | " 1432 | \n",
147 | " 54 | \n",
148 | " H | \n",
149 | " 0 | \n",
150 | "
\n",
151 | " \n",
152 | " 4 | \n",
153 | " 1985 | \n",
154 | " 25 | \n",
155 | " 1192 | \n",
156 | " 86 | \n",
157 | " 1447 | \n",
158 | " 74 | \n",
159 | " H | \n",
160 | " 0 | \n",
161 | "
\n",
162 | " \n",
163 | "
\n",
164 | "
"
165 | ],
166 | "text/plain": [
167 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
168 | "0 1985 20 1228 81 1328 64 N 0\n",
169 | "1 1985 25 1106 77 1354 70 H 0\n",
170 | "2 1985 25 1112 63 1223 56 H 0\n",
171 | "3 1985 25 1165 70 1432 54 H 0\n",
172 | "4 1985 25 1192 86 1447 74 H 0"
173 | ]
174 | },
175 | "execution_count": 3,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "df.head()"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 4,
187 | "metadata": {
188 | "collapsed": false
189 | },
190 | "outputs": [
191 | {
192 | "data": {
193 | "text/html": [
194 | "\n",
195 | "
\n",
196 | " \n",
197 | " \n",
198 | " | \n",
199 | " Season | \n",
200 | " Daynum | \n",
201 | " Wteam | \n",
202 | " Wscore | \n",
203 | " Lteam | \n",
204 | " Lscore | \n",
205 | " Wloc | \n",
206 | " Numot | \n",
207 | "
\n",
208 | " \n",
209 | " \n",
210 | " \n",
211 | " 145284 | \n",
212 | " 2016 | \n",
213 | " 132 | \n",
214 | " 1114 | \n",
215 | " 70 | \n",
216 | " 1419 | \n",
217 | " 50 | \n",
218 | " N | \n",
219 | " 0 | \n",
220 | "
\n",
221 | " \n",
222 | " 145285 | \n",
223 | " 2016 | \n",
224 | " 132 | \n",
225 | " 1163 | \n",
226 | " 72 | \n",
227 | " 1272 | \n",
228 | " 58 | \n",
229 | " N | \n",
230 | " 0 | \n",
231 | "
\n",
232 | " \n",
233 | " 145286 | \n",
234 | " 2016 | \n",
235 | " 132 | \n",
236 | " 1246 | \n",
237 | " 82 | \n",
238 | " 1401 | \n",
239 | " 77 | \n",
240 | " N | \n",
241 | " 1 | \n",
242 | "
\n",
243 | " \n",
244 | " 145287 | \n",
245 | " 2016 | \n",
246 | " 132 | \n",
247 | " 1277 | \n",
248 | " 66 | \n",
249 | " 1345 | \n",
250 | " 62 | \n",
251 | " N | \n",
252 | " 0 | \n",
253 | "
\n",
254 | " \n",
255 | " 145288 | \n",
256 | " 2016 | \n",
257 | " 132 | \n",
258 | " 1386 | \n",
259 | " 87 | \n",
260 | " 1433 | \n",
261 | " 74 | \n",
262 | " N | \n",
263 | " 0 | \n",
264 | "
\n",
265 | " \n",
266 | "
\n",
267 | "
"
268 | ],
269 | "text/plain": [
270 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
271 | "145284 2016 132 1114 70 1419 50 N 0\n",
272 | "145285 2016 132 1163 72 1272 58 N 0\n",
273 | "145286 2016 132 1246 82 1401 77 N 1\n",
274 | "145287 2016 132 1277 66 1345 62 N 0\n",
275 | "145288 2016 132 1386 87 1433 74 N 0"
276 | ]
277 | },
278 | "execution_count": 4,
279 | "metadata": {},
280 | "output_type": "execute_result"
281 | }
282 | ],
283 | "source": [
284 | "df.tail()"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "We can see the dimensions of the dataframe using the the **shape** attribute"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 5,
297 | "metadata": {
298 | "collapsed": false
299 | },
300 | "outputs": [
301 | {
302 | "data": {
303 | "text/plain": [
304 | "(145289, 8)"
305 | ]
306 | },
307 | "execution_count": 5,
308 | "metadata": {},
309 | "output_type": "execute_result"
310 | }
311 | ],
312 | "source": [
313 | "df.shape"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {
319 | "collapsed": true
320 | },
321 | "source": [
322 | "We can also extract all the column names as a list, by using the **columns** attribute and can extract the rows with the **index** attribute"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 6,
328 | "metadata": {
329 | "collapsed": false
330 | },
331 | "outputs": [
332 | {
333 | "data": {
334 | "text/plain": [
335 | "['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']"
336 | ]
337 | },
338 | "execution_count": 6,
339 | "metadata": {},
340 | "output_type": "execute_result"
341 | }
342 | ],
343 | "source": [
344 | "df.columns.tolist()"
345 | ]
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "metadata": {},
350 | "source": [
351 | "In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. "
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": 7,
357 | "metadata": {
358 | "collapsed": false
359 | },
360 | "outputs": [
361 | {
362 | "data": {
363 | "text/html": [
364 | "\n",
365 | "
\n",
366 | " \n",
367 | " \n",
368 | " | \n",
369 | " Season | \n",
370 | " Daynum | \n",
371 | " Wteam | \n",
372 | " Wscore | \n",
373 | " Lteam | \n",
374 | " Lscore | \n",
375 | " Numot | \n",
376 | "
\n",
377 | " \n",
378 | " \n",
379 | " \n",
380 | " count | \n",
381 | " 145289.000000 | \n",
382 | " 145289.000000 | \n",
383 | " 145289.000000 | \n",
384 | " 145289.000000 | \n",
385 | " 145289.000000 | \n",
386 | " 145289.000000 | \n",
387 | " 145289.000000 | \n",
388 | "
\n",
389 | " \n",
390 | " mean | \n",
391 | " 2001.574834 | \n",
392 | " 75.223816 | \n",
393 | " 1286.720646 | \n",
394 | " 76.600321 | \n",
395 | " 1282.864064 | \n",
396 | " 64.497009 | \n",
397 | " 0.044387 | \n",
398 | "
\n",
399 | " \n",
400 | " std | \n",
401 | " 9.233342 | \n",
402 | " 33.287418 | \n",
403 | " 104.570275 | \n",
404 | " 12.173033 | \n",
405 | " 104.829234 | \n",
406 | " 11.380625 | \n",
407 | " 0.247819 | \n",
408 | "
\n",
409 | " \n",
410 | " min | \n",
411 | " 1985.000000 | \n",
412 | " 0.000000 | \n",
413 | " 1101.000000 | \n",
414 | " 34.000000 | \n",
415 | " 1101.000000 | \n",
416 | " 20.000000 | \n",
417 | " 0.000000 | \n",
418 | "
\n",
419 | " \n",
420 | " 25% | \n",
421 | " 1994.000000 | \n",
422 | " 47.000000 | \n",
423 | " 1198.000000 | \n",
424 | " 68.000000 | \n",
425 | " 1191.000000 | \n",
426 | " 57.000000 | \n",
427 | " 0.000000 | \n",
428 | "
\n",
429 | " \n",
430 | " 50% | \n",
431 | " 2002.000000 | \n",
432 | " 78.000000 | \n",
433 | " 1284.000000 | \n",
434 | " 76.000000 | \n",
435 | " 1280.000000 | \n",
436 | " 64.000000 | \n",
437 | " 0.000000 | \n",
438 | "
\n",
439 | " \n",
440 | " 75% | \n",
441 | " 2010.000000 | \n",
442 | " 103.000000 | \n",
443 | " 1379.000000 | \n",
444 | " 84.000000 | \n",
445 | " 1375.000000 | \n",
446 | " 72.000000 | \n",
447 | " 0.000000 | \n",
448 | "
\n",
449 | " \n",
450 | " max | \n",
451 | " 2016.000000 | \n",
452 | " 132.000000 | \n",
453 | " 1464.000000 | \n",
454 | " 186.000000 | \n",
455 | " 1464.000000 | \n",
456 | " 150.000000 | \n",
457 | " 6.000000 | \n",
458 | "
\n",
459 | " \n",
460 | "
\n",
461 | "
"
462 | ],
463 | "text/plain": [
464 | " Season Daynum Wteam Wscore \\\n",
465 | "count 145289.000000 145289.000000 145289.000000 145289.000000 \n",
466 | "mean 2001.574834 75.223816 1286.720646 76.600321 \n",
467 | "std 9.233342 33.287418 104.570275 12.173033 \n",
468 | "min 1985.000000 0.000000 1101.000000 34.000000 \n",
469 | "25% 1994.000000 47.000000 1198.000000 68.000000 \n",
470 | "50% 2002.000000 78.000000 1284.000000 76.000000 \n",
471 | "75% 2010.000000 103.000000 1379.000000 84.000000 \n",
472 | "max 2016.000000 132.000000 1464.000000 186.000000 \n",
473 | "\n",
474 | " Lteam Lscore Numot \n",
475 | "count 145289.000000 145289.000000 145289.000000 \n",
476 | "mean 1282.864064 64.497009 0.044387 \n",
477 | "std 104.829234 11.380625 0.247819 \n",
478 | "min 1101.000000 20.000000 0.000000 \n",
479 | "25% 1191.000000 57.000000 0.000000 \n",
480 | "50% 1280.000000 64.000000 0.000000 \n",
481 | "75% 1375.000000 72.000000 0.000000 \n",
482 | "max 1464.000000 150.000000 6.000000 "
483 | ]
484 | },
485 | "execution_count": 7,
486 | "metadata": {},
487 | "output_type": "execute_result"
488 | }
489 | ],
490 | "source": [
491 | "df.describe()"
492 | ]
493 | },
494 | {
495 | "cell_type": "markdown",
496 | "metadata": {},
497 | "source": [
498 | "Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns"
499 | ]
500 | },
501 | {
502 | "cell_type": "code",
503 | "execution_count": 8,
504 | "metadata": {
505 | "collapsed": false
506 | },
507 | "outputs": [
508 | {
509 | "data": {
510 | "text/plain": [
511 | "Season 2016\n",
512 | "Daynum 132\n",
513 | "Wteam 1464\n",
514 | "Wscore 186\n",
515 | "Lteam 1464\n",
516 | "Lscore 150\n",
517 | "Wloc N\n",
518 | "Numot 6\n",
519 | "dtype: object"
520 | ]
521 | },
522 | "execution_count": 8,
523 | "metadata": {},
524 | "output_type": "execute_result"
525 | }
526 | ],
527 | "source": [
528 | "df.max()"
529 | ]
530 | },
531 | {
532 | "cell_type": "markdown",
533 | "metadata": {},
534 | "source": [
535 | "Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": 9,
541 | "metadata": {
542 | "collapsed": false
543 | },
544 | "outputs": [
545 | {
546 | "data": {
547 | "text/plain": [
548 | "186"
549 | ]
550 | },
551 | "execution_count": 9,
552 | "metadata": {},
553 | "output_type": "execute_result"
554 | }
555 | ],
556 | "source": [
557 | "df['Wscore'].max()"
558 | ]
559 | },
560 | {
561 | "cell_type": "markdown",
562 | "metadata": {},
563 | "source": [
564 | "If you'd like to find the mean of the Losing teams' score. "
565 | ]
566 | },
567 | {
568 | "cell_type": "code",
569 | "execution_count": 10,
570 | "metadata": {
571 | "collapsed": false
572 | },
573 | "outputs": [
574 | {
575 | "data": {
576 | "text/plain": [
577 | "64.49700940883343"
578 | ]
579 | },
580 | "execution_count": 10,
581 | "metadata": {},
582 | "output_type": "execute_result"
583 | }
584 | ],
585 | "source": [
586 | "df['Lscore'].mean()"
587 | ]
588 | },
589 | {
590 | "cell_type": "markdown",
591 | "metadata": {},
592 | "source": [
593 | "But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index"
594 | ]
595 | },
596 | {
597 | "cell_type": "code",
598 | "execution_count": 11,
599 | "metadata": {
600 | "collapsed": false
601 | },
602 | "outputs": [
603 | {
604 | "data": {
605 | "text/plain": [
606 | "24970"
607 | ]
608 | },
609 | "execution_count": 11,
610 | "metadata": {},
611 | "output_type": "execute_result"
612 | }
613 | ],
614 | "source": [
615 | "df['Wscore'].argmax()"
616 | ]
617 | },
618 | {
619 | "cell_type": "markdown",
620 | "metadata": {},
621 | "source": [
622 | "One of the most useful functions that you can call on certain columns in a dataframe is the **value_counts()** function. It shows how many times each item appears in the column. This particular command shows the number of games in each season"
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "execution_count": 12,
628 | "metadata": {
629 | "collapsed": false
630 | },
631 | "outputs": [
632 | {
633 | "data": {
634 | "text/plain": [
635 | "2016 5369\n",
636 | "2014 5362\n",
637 | "2015 5354\n",
638 | "2013 5320\n",
639 | "2010 5263\n",
640 | "2012 5253\n",
641 | "2009 5249\n",
642 | "2011 5246\n",
643 | "2008 5163\n",
644 | "2007 5043\n",
645 | "2006 4757\n",
646 | "2005 4675\n",
647 | "2003 4616\n",
648 | "2004 4571\n",
649 | "2002 4555\n",
650 | "2000 4519\n",
651 | "2001 4467\n",
652 | "1999 4222\n",
653 | "1998 4167\n",
654 | "1997 4155\n",
655 | "1992 4127\n",
656 | "1991 4123\n",
657 | "1996 4122\n",
658 | "1995 4077\n",
659 | "1994 4060\n",
660 | "1990 4045\n",
661 | "1989 4037\n",
662 | "1993 3982\n",
663 | "1988 3955\n",
664 | "1987 3915\n",
665 | "1986 3783\n",
666 | "1985 3737\n",
667 | "Name: Season, dtype: int64"
668 | ]
669 | },
670 | "execution_count": 12,
671 | "metadata": {},
672 | "output_type": "execute_result"
673 | }
674 | ],
675 | "source": [
676 | "df['Season'].value_counts()"
677 | ]
678 | },
679 | {
680 | "cell_type": "markdown",
681 | "metadata": {},
682 | "source": [
683 | "# Acessing Values"
684 | ]
685 | },
686 | {
687 | "cell_type": "markdown",
688 | "metadata": {},
689 | "source": [
690 | "Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an \"integer-location based indexing for selection by position.\""
691 | ]
692 | },
693 | {
694 | "cell_type": "code",
695 | "execution_count": 13,
696 | "metadata": {
697 | "collapsed": false
698 | },
699 | "outputs": [
700 | {
701 | "data": {
702 | "text/html": [
703 | "\n",
704 | "
\n",
705 | " \n",
706 | " \n",
707 | " | \n",
708 | " Season | \n",
709 | " Daynum | \n",
710 | " Wteam | \n",
711 | " Wscore | \n",
712 | " Lteam | \n",
713 | " Lscore | \n",
714 | " Wloc | \n",
715 | " Numot | \n",
716 | "
\n",
717 | " \n",
718 | " \n",
719 | " \n",
720 | " 24970 | \n",
721 | " 1991 | \n",
722 | " 68 | \n",
723 | " 1258 | \n",
724 | " 186 | \n",
725 | " 1109 | \n",
726 | " 140 | \n",
727 | " H | \n",
728 | " 0 | \n",
729 | "
\n",
730 | " \n",
731 | "
\n",
732 | "
"
733 | ],
734 | "text/plain": [
735 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
736 | "24970 1991 68 1258 186 1109 140 H 0"
737 | ]
738 | },
739 | "execution_count": 13,
740 | "metadata": {},
741 | "output_type": "execute_result"
742 | }
743 | ],
744 | "source": [
745 | "df.iloc[[df['Wscore'].argmax()]]"
746 | ]
747 | },
748 | {
749 | "cell_type": "markdown",
750 | "metadata": {},
751 | "source": [
752 | "Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. "
753 | ]
754 | },
755 | {
756 | "cell_type": "code",
757 | "execution_count": 14,
758 | "metadata": {
759 | "collapsed": false
760 | },
761 | "outputs": [
762 | {
763 | "data": {
764 | "text/plain": [
765 | "24970 140\n",
766 | "Name: Lscore, dtype: int64"
767 | ]
768 | },
769 | "execution_count": 14,
770 | "metadata": {},
771 | "output_type": "execute_result"
772 | }
773 | ],
774 | "source": [
775 | "df.iloc[[df['Wscore'].argmax()]]['Lscore']"
776 | ]
777 | },
778 | {
779 | "cell_type": "markdown",
780 | "metadata": {},
781 | "source": [
782 | "When you see data displayed in the above format, you're dealing with a Pandas **Series** object, not a dataframe object."
783 | ]
784 | },
785 | {
786 | "cell_type": "code",
787 | "execution_count": 15,
788 | "metadata": {
789 | "collapsed": false
790 | },
791 | "outputs": [
792 | {
793 | "data": {
794 | "text/plain": [
795 | "pandas.core.series.Series"
796 | ]
797 | },
798 | "execution_count": 15,
799 | "metadata": {},
800 | "output_type": "execute_result"
801 | }
802 | ],
803 | "source": [
804 | "type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])"
805 | ]
806 | },
807 | {
808 | "cell_type": "code",
809 | "execution_count": 16,
810 | "metadata": {
811 | "collapsed": false
812 | },
813 | "outputs": [
814 | {
815 | "data": {
816 | "text/plain": [
817 | "pandas.core.frame.DataFrame"
818 | ]
819 | },
820 | "execution_count": 16,
821 | "metadata": {},
822 | "output_type": "execute_result"
823 | }
824 | ],
825 | "source": [
826 | "type(df.iloc[[df['Wscore'].argmax()]])"
827 | ]
828 | },
829 | {
830 | "cell_type": "markdown",
831 | "metadata": {},
832 | "source": [
833 | "The following is a summary of the 3 data structures in Pandas (Haven't ever really used Panels yet)\n",
834 | "\n",
835 | ""
836 | ]
837 | },
838 | {
839 | "cell_type": "markdown",
840 | "metadata": {},
841 | "source": [
842 | "When you want to access values in a Series, you'll want to just treat the Series like a Python dictionary, so you'd access the value according to its key (which is normally an integer index)"
843 | ]
844 | },
845 | {
846 | "cell_type": "code",
847 | "execution_count": 17,
848 | "metadata": {
849 | "collapsed": false
850 | },
851 | "outputs": [
852 | {
853 | "data": {
854 | "text/plain": [
855 | "140"
856 | ]
857 | },
858 | "execution_count": 17,
859 | "metadata": {},
860 | "output_type": "execute_result"
861 | }
862 | ],
863 | "source": [
864 | "df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]"
865 | ]
866 | },
867 | {
868 | "cell_type": "markdown",
869 | "metadata": {},
870 | "source": [
871 | "The other really important function in Pandas is the **loc** function. Contrary to iloc, which is an integer based indexing, loc is a \"Purely label-location based indexer for selection by label\". Since all the games are ordered from 0 to 145288, iloc and loc are going to be pretty interchangable in this type of dataset"
872 | ]
873 | },
874 | {
875 | "cell_type": "code",
876 | "execution_count": 18,
877 | "metadata": {
878 | "collapsed": false
879 | },
880 | "outputs": [
881 | {
882 | "data": {
883 | "text/html": [
884 | "\n",
885 | "
\n",
886 | " \n",
887 | " \n",
888 | " | \n",
889 | " Season | \n",
890 | " Daynum | \n",
891 | " Wteam | \n",
892 | " Wscore | \n",
893 | " Lteam | \n",
894 | " Lscore | \n",
895 | " Wloc | \n",
896 | " Numot | \n",
897 | "
\n",
898 | " \n",
899 | " \n",
900 | " \n",
901 | " 0 | \n",
902 | " 1985 | \n",
903 | " 20 | \n",
904 | " 1228 | \n",
905 | " 81 | \n",
906 | " 1328 | \n",
907 | " 64 | \n",
908 | " N | \n",
909 | " 0 | \n",
910 | "
\n",
911 | " \n",
912 | " 1 | \n",
913 | " 1985 | \n",
914 | " 25 | \n",
915 | " 1106 | \n",
916 | " 77 | \n",
917 | " 1354 | \n",
918 | " 70 | \n",
919 | " H | \n",
920 | " 0 | \n",
921 | "
\n",
922 | " \n",
923 | " 2 | \n",
924 | " 1985 | \n",
925 | " 25 | \n",
926 | " 1112 | \n",
927 | " 63 | \n",
928 | " 1223 | \n",
929 | " 56 | \n",
930 | " H | \n",
931 | " 0 | \n",
932 | "
\n",
933 | " \n",
934 | "
\n",
935 | "
"
936 | ],
937 | "text/plain": [
938 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
939 | "0 1985 20 1228 81 1328 64 N 0\n",
940 | "1 1985 25 1106 77 1354 70 H 0\n",
941 | "2 1985 25 1112 63 1223 56 H 0"
942 | ]
943 | },
944 | "execution_count": 18,
945 | "metadata": {},
946 | "output_type": "execute_result"
947 | }
948 | ],
949 | "source": [
950 | "df.iloc[:3]"
951 | ]
952 | },
953 | {
954 | "cell_type": "code",
955 | "execution_count": 19,
956 | "metadata": {
957 | "collapsed": false
958 | },
959 | "outputs": [
960 | {
961 | "data": {
962 | "text/html": [
963 | "\n",
964 | "
\n",
965 | " \n",
966 | " \n",
967 | " | \n",
968 | " Season | \n",
969 | " Daynum | \n",
970 | " Wteam | \n",
971 | " Wscore | \n",
972 | " Lteam | \n",
973 | " Lscore | \n",
974 | " Wloc | \n",
975 | " Numot | \n",
976 | "
\n",
977 | " \n",
978 | " \n",
979 | " \n",
980 | " 0 | \n",
981 | " 1985 | \n",
982 | " 20 | \n",
983 | " 1228 | \n",
984 | " 81 | \n",
985 | " 1328 | \n",
986 | " 64 | \n",
987 | " N | \n",
988 | " 0 | \n",
989 | "
\n",
990 | " \n",
991 | " 1 | \n",
992 | " 1985 | \n",
993 | " 25 | \n",
994 | " 1106 | \n",
995 | " 77 | \n",
996 | " 1354 | \n",
997 | " 70 | \n",
998 | " H | \n",
999 | " 0 | \n",
1000 | "
\n",
1001 | " \n",
1002 | " 2 | \n",
1003 | " 1985 | \n",
1004 | " 25 | \n",
1005 | " 1112 | \n",
1006 | " 63 | \n",
1007 | " 1223 | \n",
1008 | " 56 | \n",
1009 | " H | \n",
1010 | " 0 | \n",
1011 | "
\n",
1012 | " \n",
1013 | " 3 | \n",
1014 | " 1985 | \n",
1015 | " 25 | \n",
1016 | " 1165 | \n",
1017 | " 70 | \n",
1018 | " 1432 | \n",
1019 | " 54 | \n",
1020 | " H | \n",
1021 | " 0 | \n",
1022 | "
\n",
1023 | " \n",
1024 | "
\n",
1025 | "
"
1026 | ],
1027 | "text/plain": [
1028 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
1029 | "0 1985 20 1228 81 1328 64 N 0\n",
1030 | "1 1985 25 1106 77 1354 70 H 0\n",
1031 | "2 1985 25 1112 63 1223 56 H 0\n",
1032 | "3 1985 25 1165 70 1432 54 H 0"
1033 | ]
1034 | },
1035 | "execution_count": 19,
1036 | "metadata": {},
1037 | "output_type": "execute_result"
1038 | }
1039 | ],
1040 | "source": [
1041 | "df.loc[:3]"
1042 | ]
1043 | },
1044 | {
1045 | "cell_type": "markdown",
1046 | "metadata": {},
1047 | "source": [
1048 | "Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive. "
1049 | ]
1050 | },
1051 | {
1052 | "cell_type": "markdown",
1053 | "metadata": {},
1054 | "source": [
1055 | "Below is an example of how you can use loc to acheive the same task as we did previously with iloc"
1056 | ]
1057 | },
1058 | {
1059 | "cell_type": "code",
1060 | "execution_count": 20,
1061 | "metadata": {
1062 | "collapsed": false
1063 | },
1064 | "outputs": [
1065 | {
1066 | "data": {
1067 | "text/plain": [
1068 | "140"
1069 | ]
1070 | },
1071 | "execution_count": 20,
1072 | "metadata": {},
1073 | "output_type": "execute_result"
1074 | }
1075 | ],
1076 | "source": [
1077 | "df.loc[df['Wscore'].argmax(), 'Lscore']"
1078 | ]
1079 | },
1080 | {
1081 | "cell_type": "markdown",
1082 | "metadata": {},
1083 | "source": [
1084 | "A faster version uses the **at()** function. At() is really useful wheneever you know the row label and the column label of the particular value that you want to get. "
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "execution_count": 21,
1090 | "metadata": {
1091 | "collapsed": false
1092 | },
1093 | "outputs": [
1094 | {
1095 | "data": {
1096 | "text/plain": [
1097 | "140"
1098 | ]
1099 | },
1100 | "execution_count": 21,
1101 | "metadata": {},
1102 | "output_type": "execute_result"
1103 | }
1104 | ],
1105 | "source": [
1106 | "df.at[df['Wscore'].argmax(), 'Lscore']"
1107 | ]
1108 | },
1109 | {
1110 | "cell_type": "markdown",
1111 | "metadata": {},
1112 | "source": [
1113 | "If you'd like to see more discussion on how loc and iloc are different, check out this great Stack Overflow post: http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation. Just remember that **iloc looks at position** and **loc looks at labels**. Loc becomes very important when your row labels aren't integers. "
1114 | ]
1115 | },
1116 | {
1117 | "cell_type": "markdown",
1118 | "metadata": {},
1119 | "source": [
1120 | "# Sorting"
1121 | ]
1122 | },
1123 | {
1124 | "cell_type": "markdown",
1125 | "metadata": {},
1126 | "source": [
1127 | "Let's say that we want to sort the dataframe in increasing order for the scores of the losing team"
1128 | ]
1129 | },
1130 | {
1131 | "cell_type": "code",
1132 | "execution_count": 22,
1133 | "metadata": {
1134 | "collapsed": false,
1135 | "scrolled": true
1136 | },
1137 | "outputs": [
1138 | {
1139 | "data": {
1140 | "text/html": [
1141 | "\n",
1142 | "
\n",
1143 | " \n",
1144 | " \n",
1145 | " | \n",
1146 | " Season | \n",
1147 | " Daynum | \n",
1148 | " Wteam | \n",
1149 | " Wscore | \n",
1150 | " Lteam | \n",
1151 | " Lscore | \n",
1152 | " Wloc | \n",
1153 | " Numot | \n",
1154 | "
\n",
1155 | " \n",
1156 | " \n",
1157 | " \n",
1158 | " 100027 | \n",
1159 | " 2008 | \n",
1160 | " 66 | \n",
1161 | " 1203 | \n",
1162 | " 49 | \n",
1163 | " 1387 | \n",
1164 | " 20 | \n",
1165 | " H | \n",
1166 | " 0 | \n",
1167 | "
\n",
1168 | " \n",
1169 | " 49310 | \n",
1170 | " 1997 | \n",
1171 | " 66 | \n",
1172 | " 1157 | \n",
1173 | " 61 | \n",
1174 | " 1204 | \n",
1175 | " 21 | \n",
1176 | " H | \n",
1177 | " 0 | \n",
1178 | "
\n",
1179 | " \n",
1180 | " 89021 | \n",
1181 | " 2006 | \n",
1182 | " 44 | \n",
1183 | " 1284 | \n",
1184 | " 41 | \n",
1185 | " 1343 | \n",
1186 | " 21 | \n",
1187 | " A | \n",
1188 | " 0 | \n",
1189 | "
\n",
1190 | " \n",
1191 | " 85042 | \n",
1192 | " 2005 | \n",
1193 | " 66 | \n",
1194 | " 1131 | \n",
1195 | " 73 | \n",
1196 | " 1216 | \n",
1197 | " 22 | \n",
1198 | " H | \n",
1199 | " 0 | \n",
1200 | "
\n",
1201 | " \n",
1202 | " 103660 | \n",
1203 | " 2009 | \n",
1204 | " 26 | \n",
1205 | " 1326 | \n",
1206 | " 59 | \n",
1207 | " 1359 | \n",
1208 | " 22 | \n",
1209 | " H | \n",
1210 | " 0 | \n",
1211 | "
\n",
1212 | " \n",
1213 | "
\n",
1214 | "
"
1215 | ],
1216 | "text/plain": [
1217 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
1218 | "100027 2008 66 1203 49 1387 20 H 0\n",
1219 | "49310 1997 66 1157 61 1204 21 H 0\n",
1220 | "89021 2006 44 1284 41 1343 21 A 0\n",
1221 | "85042 2005 66 1131 73 1216 22 H 0\n",
1222 | "103660 2009 26 1326 59 1359 22 H 0"
1223 | ]
1224 | },
1225 | "execution_count": 22,
1226 | "metadata": {},
1227 | "output_type": "execute_result"
1228 | }
1229 | ],
1230 | "source": [
1231 | "df.sort_values('Lscore').head()"
1232 | ]
1233 | },
1234 | {
1235 | "cell_type": "code",
1236 | "execution_count": 23,
1237 | "metadata": {
1238 | "collapsed": false
1239 | },
1240 | "outputs": [
1241 | {
1242 | "data": {
1243 | "text/plain": [
1244 | ""
1245 | ]
1246 | },
1247 | "execution_count": 23,
1248 | "metadata": {},
1249 | "output_type": "execute_result"
1250 | }
1251 | ],
1252 | "source": [
1253 | "df.groupby('Lscore')"
1254 | ]
1255 | },
1256 | {
1257 | "cell_type": "markdown",
1258 | "metadata": {},
1259 | "source": [
1260 | "# Filtering Rows Conditionally"
1261 | ]
1262 | },
1263 | {
1264 | "cell_type": "markdown",
1265 | "metadata": {},
1266 | "source": [
1267 | "Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then returns only those specific rows in a dataframe format (df[df['Wscore'] > 150])."
1268 | ]
1269 | },
1270 | {
1271 | "cell_type": "code",
1272 | "execution_count": 24,
1273 | "metadata": {
1274 | "collapsed": false
1275 | },
1276 | "outputs": [
1277 | {
1278 | "data": {
1279 | "text/html": [
1280 | "\n",
1281 | "
\n",
1282 | " \n",
1283 | " \n",
1284 | " | \n",
1285 | " Season | \n",
1286 | " Daynum | \n",
1287 | " Wteam | \n",
1288 | " Wscore | \n",
1289 | " Lteam | \n",
1290 | " Lscore | \n",
1291 | " Wloc | \n",
1292 | " Numot | \n",
1293 | "
\n",
1294 | " \n",
1295 | " \n",
1296 | " \n",
1297 | " 5269 | \n",
1298 | " 1986 | \n",
1299 | " 75 | \n",
1300 | " 1258 | \n",
1301 | " 151 | \n",
1302 | " 1109 | \n",
1303 | " 107 | \n",
1304 | " H | \n",
1305 | " 0 | \n",
1306 | "
\n",
1307 | " \n",
1308 | " 12046 | \n",
1309 | " 1988 | \n",
1310 | " 40 | \n",
1311 | " 1328 | \n",
1312 | " 152 | \n",
1313 | " 1147 | \n",
1314 | " 84 | \n",
1315 | " H | \n",
1316 | " 0 | \n",
1317 | "
\n",
1318 | " \n",
1319 | " 12355 | \n",
1320 | " 1988 | \n",
1321 | " 52 | \n",
1322 | " 1328 | \n",
1323 | " 151 | \n",
1324 | " 1173 | \n",
1325 | " 99 | \n",
1326 | " N | \n",
1327 | " 0 | \n",
1328 | "
\n",
1329 | " \n",
1330 | " 16040 | \n",
1331 | " 1989 | \n",
1332 | " 40 | \n",
1333 | " 1328 | \n",
1334 | " 152 | \n",
1335 | " 1331 | \n",
1336 | " 122 | \n",
1337 | " H | \n",
1338 | " 0 | \n",
1339 | "
\n",
1340 | " \n",
1341 | " 16853 | \n",
1342 | " 1989 | \n",
1343 | " 68 | \n",
1344 | " 1258 | \n",
1345 | " 162 | \n",
1346 | " 1109 | \n",
1347 | " 144 | \n",
1348 | " A | \n",
1349 | " 0 | \n",
1350 | "
\n",
1351 | " \n",
1352 | " 17867 | \n",
1353 | " 1989 | \n",
1354 | " 92 | \n",
1355 | " 1258 | \n",
1356 | " 181 | \n",
1357 | " 1109 | \n",
1358 | " 150 | \n",
1359 | " H | \n",
1360 | " 0 | \n",
1361 | "
\n",
1362 | " \n",
1363 | " 19653 | \n",
1364 | " 1990 | \n",
1365 | " 30 | \n",
1366 | " 1328 | \n",
1367 | " 173 | \n",
1368 | " 1109 | \n",
1369 | " 101 | \n",
1370 | " H | \n",
1371 | " 0 | \n",
1372 | "
\n",
1373 | " \n",
1374 | " 19971 | \n",
1375 | " 1990 | \n",
1376 | " 38 | \n",
1377 | " 1258 | \n",
1378 | " 152 | \n",
1379 | " 1109 | \n",
1380 | " 137 | \n",
1381 | " A | \n",
1382 | " 0 | \n",
1383 | "
\n",
1384 | " \n",
1385 | " 20022 | \n",
1386 | " 1990 | \n",
1387 | " 40 | \n",
1388 | " 1116 | \n",
1389 | " 166 | \n",
1390 | " 1109 | \n",
1391 | " 101 | \n",
1392 | " H | \n",
1393 | " 0 | \n",
1394 | "
\n",
1395 | " \n",
1396 | " 22145 | \n",
1397 | " 1990 | \n",
1398 | " 97 | \n",
1399 | " 1258 | \n",
1400 | " 157 | \n",
1401 | " 1362 | \n",
1402 | " 115 | \n",
1403 | " H | \n",
1404 | " 0 | \n",
1405 | "
\n",
1406 | " \n",
1407 | " 23582 | \n",
1408 | " 1991 | \n",
1409 | " 26 | \n",
1410 | " 1318 | \n",
1411 | " 152 | \n",
1412 | " 1258 | \n",
1413 | " 123 | \n",
1414 | " N | \n",
1415 | " 0 | \n",
1416 | "
\n",
1417 | " \n",
1418 | " 24341 | \n",
1419 | " 1991 | \n",
1420 | " 47 | \n",
1421 | " 1328 | \n",
1422 | " 172 | \n",
1423 | " 1258 | \n",
1424 | " 112 | \n",
1425 | " H | \n",
1426 | " 0 | \n",
1427 | "
\n",
1428 | " \n",
1429 | " 24970 | \n",
1430 | " 1991 | \n",
1431 | " 68 | \n",
1432 | " 1258 | \n",
1433 | " 186 | \n",
1434 | " 1109 | \n",
1435 | " 140 | \n",
1436 | " H | \n",
1437 | " 0 | \n",
1438 | "
\n",
1439 | " \n",
1440 | " 25656 | \n",
1441 | " 1991 | \n",
1442 | " 84 | \n",
1443 | " 1106 | \n",
1444 | " 151 | \n",
1445 | " 1212 | \n",
1446 | " 97 | \n",
1447 | " H | \n",
1448 | " 0 | \n",
1449 | "
\n",
1450 | " \n",
1451 | " 28687 | \n",
1452 | " 1992 | \n",
1453 | " 54 | \n",
1454 | " 1261 | \n",
1455 | " 159 | \n",
1456 | " 1319 | \n",
1457 | " 86 | \n",
1458 | " H | \n",
1459 | " 0 | \n",
1460 | "
\n",
1461 | " \n",
1462 | " 35023 | \n",
1463 | " 1993 | \n",
1464 | " 112 | \n",
1465 | " 1380 | \n",
1466 | " 155 | \n",
1467 | " 1341 | \n",
1468 | " 91 | \n",
1469 | " A | \n",
1470 | " 0 | \n",
1471 | "
\n",
1472 | " \n",
1473 | " 40060 | \n",
1474 | " 1995 | \n",
1475 | " 32 | \n",
1476 | " 1375 | \n",
1477 | " 156 | \n",
1478 | " 1341 | \n",
1479 | " 114 | \n",
1480 | " H | \n",
1481 | " 0 | \n",
1482 | "
\n",
1483 | " \n",
1484 | " 52600 | \n",
1485 | " 1998 | \n",
1486 | " 33 | \n",
1487 | " 1395 | \n",
1488 | " 153 | \n",
1489 | " 1410 | \n",
1490 | " 87 | \n",
1491 | " H | \n",
1492 | " 0 | \n",
1493 | "
\n",
1494 | " \n",
1495 | "
\n",
1496 | "
"
1497 | ],
1498 | "text/plain": [
1499 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
1500 | "5269 1986 75 1258 151 1109 107 H 0\n",
1501 | "12046 1988 40 1328 152 1147 84 H 0\n",
1502 | "12355 1988 52 1328 151 1173 99 N 0\n",
1503 | "16040 1989 40 1328 152 1331 122 H 0\n",
1504 | "16853 1989 68 1258 162 1109 144 A 0\n",
1505 | "17867 1989 92 1258 181 1109 150 H 0\n",
1506 | "19653 1990 30 1328 173 1109 101 H 0\n",
1507 | "19971 1990 38 1258 152 1109 137 A 0\n",
1508 | "20022 1990 40 1116 166 1109 101 H 0\n",
1509 | "22145 1990 97 1258 157 1362 115 H 0\n",
1510 | "23582 1991 26 1318 152 1258 123 N 0\n",
1511 | "24341 1991 47 1328 172 1258 112 H 0\n",
1512 | "24970 1991 68 1258 186 1109 140 H 0\n",
1513 | "25656 1991 84 1106 151 1212 97 H 0\n",
1514 | "28687 1992 54 1261 159 1319 86 H 0\n",
1515 | "35023 1993 112 1380 155 1341 91 A 0\n",
1516 | "40060 1995 32 1375 156 1341 114 H 0\n",
1517 | "52600 1998 33 1395 153 1410 87 H 0"
1518 | ]
1519 | },
1520 | "execution_count": 24,
1521 | "metadata": {},
1522 | "output_type": "execute_result"
1523 | }
1524 | ],
1525 | "source": [
1526 | "df[df['Wscore'] > 150]"
1527 | ]
1528 | },
1529 | {
1530 | "cell_type": "markdown",
1531 | "metadata": {},
1532 | "source": [
1533 | "This also works if you have multiple conditions. Let's say we want to find out when the winning team scores more than 150 points and when the losing team scores below 100. "
1534 | ]
1535 | },
1536 | {
1537 | "cell_type": "code",
1538 | "execution_count": 25,
1539 | "metadata": {
1540 | "collapsed": false
1541 | },
1542 | "outputs": [
1543 | {
1544 | "data": {
1545 | "text/html": [
1546 | "\n",
1547 | "
\n",
1548 | " \n",
1549 | " \n",
1550 | " | \n",
1551 | " Season | \n",
1552 | " Daynum | \n",
1553 | " Wteam | \n",
1554 | " Wscore | \n",
1555 | " Lteam | \n",
1556 | " Lscore | \n",
1557 | " Wloc | \n",
1558 | " Numot | \n",
1559 | "
\n",
1560 | " \n",
1561 | " \n",
1562 | " \n",
1563 | " 12046 | \n",
1564 | " 1988 | \n",
1565 | " 40 | \n",
1566 | " 1328 | \n",
1567 | " 152 | \n",
1568 | " 1147 | \n",
1569 | " 84 | \n",
1570 | " H | \n",
1571 | " 0 | \n",
1572 | "
\n",
1573 | " \n",
1574 | " 12355 | \n",
1575 | " 1988 | \n",
1576 | " 52 | \n",
1577 | " 1328 | \n",
1578 | " 151 | \n",
1579 | " 1173 | \n",
1580 | " 99 | \n",
1581 | " N | \n",
1582 | " 0 | \n",
1583 | "
\n",
1584 | " \n",
1585 | " 25656 | \n",
1586 | " 1991 | \n",
1587 | " 84 | \n",
1588 | " 1106 | \n",
1589 | " 151 | \n",
1590 | " 1212 | \n",
1591 | " 97 | \n",
1592 | " H | \n",
1593 | " 0 | \n",
1594 | "
\n",
1595 | " \n",
1596 | " 28687 | \n",
1597 | " 1992 | \n",
1598 | " 54 | \n",
1599 | " 1261 | \n",
1600 | " 159 | \n",
1601 | " 1319 | \n",
1602 | " 86 | \n",
1603 | " H | \n",
1604 | " 0 | \n",
1605 | "
\n",
1606 | " \n",
1607 | " 35023 | \n",
1608 | " 1993 | \n",
1609 | " 112 | \n",
1610 | " 1380 | \n",
1611 | " 155 | \n",
1612 | " 1341 | \n",
1613 | " 91 | \n",
1614 | " A | \n",
1615 | " 0 | \n",
1616 | "
\n",
1617 | " \n",
1618 | " 52600 | \n",
1619 | " 1998 | \n",
1620 | " 33 | \n",
1621 | " 1395 | \n",
1622 | " 153 | \n",
1623 | " 1410 | \n",
1624 | " 87 | \n",
1625 | " H | \n",
1626 | " 0 | \n",
1627 | "
\n",
1628 | " \n",
1629 | "
\n",
1630 | "
"
1631 | ],
1632 | "text/plain": [
1633 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
1634 | "12046 1988 40 1328 152 1147 84 H 0\n",
1635 | "12355 1988 52 1328 151 1173 99 N 0\n",
1636 | "25656 1991 84 1106 151 1212 97 H 0\n",
1637 | "28687 1992 54 1261 159 1319 86 H 0\n",
1638 | "35023 1993 112 1380 155 1341 91 A 0\n",
1639 | "52600 1998 33 1395 153 1410 87 H 0"
1640 | ]
1641 | },
1642 | "execution_count": 25,
1643 | "metadata": {},
1644 | "output_type": "execute_result"
1645 | }
1646 | ],
1647 | "source": [
1648 | "df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]"
1649 | ]
1650 | },
1651 | {
1652 | "cell_type": "markdown",
1653 | "metadata": {},
1654 | "source": [
1655 | "# Grouping"
1656 | ]
1657 | },
1658 | {
1659 | "cell_type": "markdown",
1660 | "metadata": {},
1661 | "source": [
1662 | "Another important function in Pandas is **groupby()**. This is a function that allows you to group entries by certain attributes (e.g Grouping entries by Wteam number) and then perform operations on them. The following function groups all the entries (games) with the same Wteam number and finds the mean for each group. "
1663 | ]
1664 | },
1665 | {
1666 | "cell_type": "code",
1667 | "execution_count": 26,
1668 | "metadata": {
1669 | "collapsed": false
1670 | },
1671 | "outputs": [
1672 | {
1673 | "data": {
1674 | "text/plain": [
1675 | "Wteam\n",
1676 | "1101 78.111111\n",
1677 | "1102 69.893204\n",
1678 | "1103 75.839768\n",
1679 | "1104 75.825944\n",
1680 | "1105 74.960894\n",
1681 | "Name: Wscore, dtype: float64"
1682 | ]
1683 | },
1684 | "execution_count": 26,
1685 | "metadata": {},
1686 | "output_type": "execute_result"
1687 | }
1688 | ],
1689 | "source": [
1690 | "df.groupby('Wteam')['Wscore'].mean().head()"
1691 | ]
1692 | },
1693 | {
1694 | "cell_type": "markdown",
1695 | "metadata": {},
1696 | "source": [
1697 | "This next command groups all the games with the same Wteam number and finds where how many times that specific team won at home, on the road, or at a neutral site"
1698 | ]
1699 | },
1700 | {
1701 | "cell_type": "code",
1702 | "execution_count": 27,
1703 | "metadata": {
1704 | "collapsed": false,
1705 | "scrolled": false
1706 | },
1707 | "outputs": [
1708 | {
1709 | "data": {
1710 | "text/plain": [
1711 | "Wteam Wloc\n",
1712 | "1101 H 12\n",
1713 | " A 3\n",
1714 | " N 3\n",
1715 | "1102 H 204\n",
1716 | " A 73\n",
1717 | " N 32\n",
1718 | "1103 H 324\n",
1719 | " A 153\n",
1720 | " N 41\n",
1721 | "Name: Wloc, dtype: int64"
1722 | ]
1723 | },
1724 | "execution_count": 27,
1725 | "metadata": {},
1726 | "output_type": "execute_result"
1727 | }
1728 | ],
1729 | "source": [
1730 | "df.groupby('Wteam')['Wloc'].value_counts().head(9)"
1731 | ]
1732 | },
1733 | {
1734 | "cell_type": "markdown",
1735 | "metadata": {},
1736 | "source": [
1737 | "Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in a numpy array style format"
1738 | ]
1739 | },
1740 | {
1741 | "cell_type": "code",
1742 | "execution_count": 28,
1743 | "metadata": {
1744 | "collapsed": false
1745 | },
1746 | "outputs": [
1747 | {
1748 | "data": {
1749 | "text/plain": [
1750 | "array([[1985, 20, 1228, ..., 64, 'N', 0],\n",
1751 | " [1985, 25, 1106, ..., 70, 'H', 0],\n",
1752 | " [1985, 25, 1112, ..., 56, 'H', 0],\n",
1753 | " ..., \n",
1754 | " [2016, 132, 1246, ..., 77, 'N', 1],\n",
1755 | " [2016, 132, 1277, ..., 62, 'N', 0],\n",
1756 | " [2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)"
1757 | ]
1758 | },
1759 | "execution_count": 28,
1760 | "metadata": {},
1761 | "output_type": "execute_result"
1762 | }
1763 | ],
1764 | "source": [
1765 | "df.values"
1766 | ]
1767 | },
1768 | {
1769 | "cell_type": "markdown",
1770 | "metadata": {},
1771 | "source": [
1772 | "Now, you can simply just access elements like you would in an array. "
1773 | ]
1774 | },
1775 | {
1776 | "cell_type": "code",
1777 | "execution_count": 29,
1778 | "metadata": {
1779 | "collapsed": false
1780 | },
1781 | "outputs": [
1782 | {
1783 | "data": {
1784 | "text/plain": [
1785 | "1985"
1786 | ]
1787 | },
1788 | "execution_count": 29,
1789 | "metadata": {},
1790 | "output_type": "execute_result"
1791 | }
1792 | ],
1793 | "source": [
1794 | "df.values[0][0]"
1795 | ]
1796 | },
1797 | {
1798 | "cell_type": "markdown",
1799 | "metadata": {},
1800 | "source": [
1801 | "# Dataframe Iteration"
1802 | ]
1803 | },
1804 | {
1805 | "cell_type": "markdown",
1806 | "metadata": {},
1807 | "source": [
1808 | "In order to iterate through dataframes, we can use the **iterrows()** function. Below is an example of what the first two rows look like. Each row in iterrows is a Series object"
1809 | ]
1810 | },
1811 | {
1812 | "cell_type": "code",
1813 | "execution_count": 30,
1814 | "metadata": {
1815 | "collapsed": false
1816 | },
1817 | "outputs": [
1818 | {
1819 | "name": "stdout",
1820 | "output_type": "stream",
1821 | "text": [
1822 | "Season 1985\n",
1823 | "Daynum 20\n",
1824 | "Wteam 1228\n",
1825 | "Wscore 81\n",
1826 | "Lteam 1328\n",
1827 | "Lscore 64\n",
1828 | "Wloc N\n",
1829 | "Numot 0\n",
1830 | "Name: 0, dtype: object\n",
1831 | "Season 1985\n",
1832 | "Daynum 25\n",
1833 | "Wteam 1106\n",
1834 | "Wscore 77\n",
1835 | "Lteam 1354\n",
1836 | "Lscore 70\n",
1837 | "Wloc H\n",
1838 | "Numot 0\n",
1839 | "Name: 1, dtype: object\n"
1840 | ]
1841 | }
1842 | ],
1843 | "source": [
1844 | "for index, row in df.iterrows():\n",
1845 | " print row\n",
1846 | " if index == 1:\n",
1847 | " break"
1848 | ]
1849 | },
1850 | {
1851 | "cell_type": "markdown",
1852 | "metadata": {},
1853 | "source": [
1854 | "# Extracting Rows and Columns"
1855 | ]
1856 | },
1857 | {
1858 | "cell_type": "markdown",
1859 | "metadata": {},
1860 | "source": [
1861 | "The bracket indexing operator is one way to extract certain columns from a dataframe."
1862 | ]
1863 | },
1864 | {
1865 | "cell_type": "code",
1866 | "execution_count": 31,
1867 | "metadata": {
1868 | "collapsed": false,
1869 | "scrolled": true
1870 | },
1871 | "outputs": [
1872 | {
1873 | "data": {
1874 | "text/html": [
1875 | "\n",
1876 | "
\n",
1877 | " \n",
1878 | " \n",
1879 | " | \n",
1880 | " Wscore | \n",
1881 | " Lscore | \n",
1882 | "
\n",
1883 | " \n",
1884 | " \n",
1885 | " \n",
1886 | " 0 | \n",
1887 | " 81 | \n",
1888 | " 64 | \n",
1889 | "
\n",
1890 | " \n",
1891 | " 1 | \n",
1892 | " 77 | \n",
1893 | " 70 | \n",
1894 | "
\n",
1895 | " \n",
1896 | " 2 | \n",
1897 | " 63 | \n",
1898 | " 56 | \n",
1899 | "
\n",
1900 | " \n",
1901 | " 3 | \n",
1902 | " 70 | \n",
1903 | " 54 | \n",
1904 | "
\n",
1905 | " \n",
1906 | " 4 | \n",
1907 | " 86 | \n",
1908 | " 74 | \n",
1909 | "
\n",
1910 | " \n",
1911 | "
\n",
1912 | "
"
1913 | ],
1914 | "text/plain": [
1915 | " Wscore Lscore\n",
1916 | "0 81 64\n",
1917 | "1 77 70\n",
1918 | "2 63 56\n",
1919 | "3 70 54\n",
1920 | "4 86 74"
1921 | ]
1922 | },
1923 | "execution_count": 31,
1924 | "metadata": {},
1925 | "output_type": "execute_result"
1926 | }
1927 | ],
1928 | "source": [
1929 | "df[['Wscore', 'Lscore']].head()"
1930 | ]
1931 | },
1932 | {
1933 | "cell_type": "markdown",
1934 | "metadata": {},
1935 | "source": [
1936 | "Notice that you can acheive the same result by using the loc function. Loc is a veryyyy versatile function that can help you in a lot of accessing and extracting tasks. "
1937 | ]
1938 | },
1939 | {
1940 | "cell_type": "code",
1941 | "execution_count": 32,
1942 | "metadata": {
1943 | "collapsed": false
1944 | },
1945 | "outputs": [
1946 | {
1947 | "data": {
1948 | "text/html": [
1949 | "\n",
1950 | "
\n",
1951 | " \n",
1952 | " \n",
1953 | " | \n",
1954 | " Wscore | \n",
1955 | " Lscore | \n",
1956 | "
\n",
1957 | " \n",
1958 | " \n",
1959 | " \n",
1960 | " 0 | \n",
1961 | " 81 | \n",
1962 | " 64 | \n",
1963 | "
\n",
1964 | " \n",
1965 | " 1 | \n",
1966 | " 77 | \n",
1967 | " 70 | \n",
1968 | "
\n",
1969 | " \n",
1970 | " 2 | \n",
1971 | " 63 | \n",
1972 | " 56 | \n",
1973 | "
\n",
1974 | " \n",
1975 | " 3 | \n",
1976 | " 70 | \n",
1977 | " 54 | \n",
1978 | "
\n",
1979 | " \n",
1980 | " 4 | \n",
1981 | " 86 | \n",
1982 | " 74 | \n",
1983 | "
\n",
1984 | " \n",
1985 | "
\n",
1986 | "
"
1987 | ],
1988 | "text/plain": [
1989 | " Wscore Lscore\n",
1990 | "0 81 64\n",
1991 | "1 77 70\n",
1992 | "2 63 56\n",
1993 | "3 70 54\n",
1994 | "4 86 74"
1995 | ]
1996 | },
1997 | "execution_count": 32,
1998 | "metadata": {},
1999 | "output_type": "execute_result"
2000 | }
2001 | ],
2002 | "source": [
2003 | "df.loc[:, ['Wscore', 'Lscore']].head()"
2004 | ]
2005 | },
2006 | {
2007 | "cell_type": "markdown",
2008 | "metadata": {},
2009 | "source": [
2010 | "Note the difference is the return types when you use brackets and when you use double brackets. "
2011 | ]
2012 | },
2013 | {
2014 | "cell_type": "code",
2015 | "execution_count": 33,
2016 | "metadata": {
2017 | "collapsed": false
2018 | },
2019 | "outputs": [
2020 | {
2021 | "data": {
2022 | "text/plain": [
2023 | "pandas.core.series.Series"
2024 | ]
2025 | },
2026 | "execution_count": 33,
2027 | "metadata": {},
2028 | "output_type": "execute_result"
2029 | }
2030 | ],
2031 | "source": [
2032 | "type(df['Wscore'])"
2033 | ]
2034 | },
2035 | {
2036 | "cell_type": "code",
2037 | "execution_count": 34,
2038 | "metadata": {
2039 | "collapsed": false
2040 | },
2041 | "outputs": [
2042 | {
2043 | "data": {
2044 | "text/plain": [
2045 | "pandas.core.frame.DataFrame"
2046 | ]
2047 | },
2048 | "execution_count": 34,
2049 | "metadata": {},
2050 | "output_type": "execute_result"
2051 | }
2052 | ],
2053 | "source": [
2054 | "type(df[['Wscore']])"
2055 | ]
2056 | },
2057 | {
2058 | "cell_type": "markdown",
2059 | "metadata": {},
2060 | "source": [
2061 | "You've seen before that you can access columns through df['col name']. You can access rows by using slicing operations. "
2062 | ]
2063 | },
2064 | {
2065 | "cell_type": "code",
2066 | "execution_count": 35,
2067 | "metadata": {
2068 | "collapsed": false
2069 | },
2070 | "outputs": [
2071 | {
2072 | "data": {
2073 | "text/html": [
2074 | "\n",
2075 | "
\n",
2076 | " \n",
2077 | " \n",
2078 | " | \n",
2079 | " Season | \n",
2080 | " Daynum | \n",
2081 | " Wteam | \n",
2082 | " Wscore | \n",
2083 | " Lteam | \n",
2084 | " Lscore | \n",
2085 | " Wloc | \n",
2086 | " Numot | \n",
2087 | "
\n",
2088 | " \n",
2089 | " \n",
2090 | " \n",
2091 | " 0 | \n",
2092 | " 1985 | \n",
2093 | " 20 | \n",
2094 | " 1228 | \n",
2095 | " 81 | \n",
2096 | " 1328 | \n",
2097 | " 64 | \n",
2098 | " N | \n",
2099 | " 0 | \n",
2100 | "
\n",
2101 | " \n",
2102 | " 1 | \n",
2103 | " 1985 | \n",
2104 | " 25 | \n",
2105 | " 1106 | \n",
2106 | " 77 | \n",
2107 | " 1354 | \n",
2108 | " 70 | \n",
2109 | " H | \n",
2110 | " 0 | \n",
2111 | "
\n",
2112 | " \n",
2113 | " 2 | \n",
2114 | " 1985 | \n",
2115 | " 25 | \n",
2116 | " 1112 | \n",
2117 | " 63 | \n",
2118 | " 1223 | \n",
2119 | " 56 | \n",
2120 | " H | \n",
2121 | " 0 | \n",
2122 | "
\n",
2123 | " \n",
2124 | "
\n",
2125 | "
"
2126 | ],
2127 | "text/plain": [
2128 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
2129 | "0 1985 20 1228 81 1328 64 N 0\n",
2130 | "1 1985 25 1106 77 1354 70 H 0\n",
2131 | "2 1985 25 1112 63 1223 56 H 0"
2132 | ]
2133 | },
2134 | "execution_count": 35,
2135 | "metadata": {},
2136 | "output_type": "execute_result"
2137 | }
2138 | ],
2139 | "source": [
2140 | "df[0:3]"
2141 | ]
2142 | },
2143 | {
2144 | "cell_type": "markdown",
2145 | "metadata": {},
2146 | "source": [
2147 | "Here's an equivalent using iloc"
2148 | ]
2149 | },
2150 | {
2151 | "cell_type": "code",
2152 | "execution_count": 36,
2153 | "metadata": {
2154 | "collapsed": false
2155 | },
2156 | "outputs": [
2157 | {
2158 | "data": {
2159 | "text/html": [
2160 | "\n",
2161 | "
\n",
2162 | " \n",
2163 | " \n",
2164 | " | \n",
2165 | " Season | \n",
2166 | " Daynum | \n",
2167 | " Wteam | \n",
2168 | " Wscore | \n",
2169 | " Lteam | \n",
2170 | " Lscore | \n",
2171 | " Wloc | \n",
2172 | " Numot | \n",
2173 | "
\n",
2174 | " \n",
2175 | " \n",
2176 | " \n",
2177 | " 0 | \n",
2178 | " 1985 | \n",
2179 | " 20 | \n",
2180 | " 1228 | \n",
2181 | " 81 | \n",
2182 | " 1328 | \n",
2183 | " 64 | \n",
2184 | " N | \n",
2185 | " 0 | \n",
2186 | "
\n",
2187 | " \n",
2188 | " 1 | \n",
2189 | " 1985 | \n",
2190 | " 25 | \n",
2191 | " 1106 | \n",
2192 | " 77 | \n",
2193 | " 1354 | \n",
2194 | " 70 | \n",
2195 | " H | \n",
2196 | " 0 | \n",
2197 | "
\n",
2198 | " \n",
2199 | " 2 | \n",
2200 | " 1985 | \n",
2201 | " 25 | \n",
2202 | " 1112 | \n",
2203 | " 63 | \n",
2204 | " 1223 | \n",
2205 | " 56 | \n",
2206 | " H | \n",
2207 | " 0 | \n",
2208 | "
\n",
2209 | " \n",
2210 | "
\n",
2211 | "
"
2212 | ],
2213 | "text/plain": [
2214 | " Season Daynum Wteam Wscore Lteam Lscore Wloc Numot\n",
2215 | "0 1985 20 1228 81 1328 64 N 0\n",
2216 | "1 1985 25 1106 77 1354 70 H 0\n",
2217 | "2 1985 25 1112 63 1223 56 H 0"
2218 | ]
2219 | },
2220 | "execution_count": 36,
2221 | "metadata": {},
2222 | "output_type": "execute_result"
2223 | }
2224 | ],
2225 | "source": [
2226 | "df.iloc[0:3,:]"
2227 | ]
2228 | },
2229 | {
2230 | "cell_type": "markdown",
2231 | "metadata": {},
2232 | "source": [
2233 | "# Data Cleaning"
2234 | ]
2235 | },
2236 | {
2237 | "cell_type": "markdown",
2238 | "metadata": {},
2239 | "source": [
2240 | "One of the big jobs of doing well in Kaggle competitions is that of data cleaning. A lot of times, the CSV file you're given (especially like in the Titanic dataset), you'll have a lot of missing values in the dataset, which you have to identify. The following **isnull** function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column. In this case, we have a pretty clean dataset."
2241 | ]
2242 | },
2243 | {
2244 | "cell_type": "code",
2245 | "execution_count": 37,
2246 | "metadata": {
2247 | "collapsed": false
2248 | },
2249 | "outputs": [
2250 | {
2251 | "data": {
2252 | "text/plain": [
2253 | "Season 0\n",
2254 | "Daynum 0\n",
2255 | "Wteam 0\n",
2256 | "Wscore 0\n",
2257 | "Lteam 0\n",
2258 | "Lscore 0\n",
2259 | "Wloc 0\n",
2260 | "Numot 0\n",
2261 | "dtype: int64"
2262 | ]
2263 | },
2264 | "execution_count": 37,
2265 | "metadata": {},
2266 | "output_type": "execute_result"
2267 | }
2268 | ],
2269 | "source": [
2270 | "df.isnull().sum()"
2271 | ]
2272 | },
2273 | {
2274 | "cell_type": "markdown",
2275 | "metadata": {},
2276 | "source": [
2277 | "If you do end up having missing values in your datasets, be sure to get familiar with these two functions. \n",
2278 | "* **dropna()** - This function allows you to drop all(or some) of the rows that have missing values. \n",
2279 | "* **fillna()** - This function allows you replace the rows that have missing values with the value that you pass in."
2280 | ]
2281 | },
2282 | {
2283 | "cell_type": "markdown",
2284 | "metadata": {},
2285 | "source": [
2286 | "# Visualizing Data"
2287 | ]
2288 | },
2289 | {
2290 | "cell_type": "markdown",
2291 | "metadata": {},
2292 | "source": [
2293 | "An interesting way of displaying Dataframes is through matplotlib. "
2294 | ]
2295 | },
2296 | {
2297 | "cell_type": "code",
2298 | "execution_count": 38,
2299 | "metadata": {
2300 | "collapsed": true
2301 | },
2302 | "outputs": [],
2303 | "source": [
2304 | "import matplotlib.pyplot as plt\n",
2305 | "%matplotlib inline"
2306 | ]
2307 | },
2308 | {
2309 | "cell_type": "code",
2310 | "execution_count": 39,
2311 | "metadata": {
2312 | "collapsed": false
2313 | },
2314 | "outputs": [
2315 | {
2316 | "data": {
2317 | "text/plain": [
2318 | ""
2319 | ]
2320 | },
2321 | "execution_count": 39,
2322 | "metadata": {},
2323 | "output_type": "execute_result"
2324 | },
2325 | {
2326 | "data": {
2327 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjkAAAF5CAYAAAB9WzucAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3XuYnVV99//3h0MSwSYRIwkU0qpoSD1QMhyClQDGB6rg\n6YePMppysNZKAXnSeon6gxLhaYt4SfhBwFKkohym0iCeEglykCJEooRClCFUiR0QEhgJQwxNIMn3\n98daW+/cTuawZ8/svW8+r+vaV7LX+s6619o72fs7617rvhURmJmZmVXNTs3ugJmZmdlocJJjZmZm\nleQkx8zMzCrJSY6ZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEzM7NKcpJjZmZm\nldRySY6kT0vaJumiUvl5kp6Q9Lyk70var1Q/XtJlknolbZC0WNKepZhXSLpOUp+k9ZK+LGn3Usy+\nkpZI2ihpraQLJbXc62RmZmYDa6kvb0kHAx8DHiiVnwWcnusOATYCyySNK4RdDBwLHA/MAfYGbiwd\n4npgJjA3x84BrigcZydgKbALMBs4CTgZOK8R4zMzM7Oxo1a5QaeklwP3AacC5wD3R8Tf5rongC9E\nxML8fCKwDjgpIm7Iz58GToiIm3LMDKAbmB0RKyTNBH4GdETE/TnmGGAJsE9ErJX0DuDbwF4R0Ztj\n/hq4AHhVRGwZkxfDzMzMRqyVZnIuA74TEbcXCyW9GpgG3FYri4jngHuBw3LRQaTZl2LMaqCnEDMb\nWF9LcLJbgQAOLcSsqiU42TJgEvCGkQzOzMzMxtYuze4AgKQTgD8lJStl00iJyLpS+bpcBzAVeCEn\nPzuKmQY8VayMiK2SninF9HecWt0DmJmZWVtoepIjaR/Sepq3R8SLze7PcEl6JXAM8EtgU3N7Y2Zm\n1lYmAH8MLIuIXze68aYnOUAH8CpgpSTlsp2BOZJOB/YHRJqtKc6yTAVqp57WAuMkTSzN5kzNdbWY\n8m6rnYE9SjEHl/o3tVDXn2OA6wYaoJmZmQ3ow6TNQQ3VCknOrcCbSmVXkxYNXxARj0paS9oR9SD8\nduHxoaR1PJAWLG/JMcWFx9OB5TlmOTBZ0oGFdTlzSQnUvYWYz0qaUliXczTQBzy0g/7/EuDaa69l\n5syZwxp4q5o/fz4LFy5sdjcaokpjAY+nlVVpLODxtLIqjaW7u5t58+ZB/i5ttKYnORGxkVICIWkj\n8OuI6M5FFwNnS/o56YU4H3gc+FZu4zlJVwEXSVoPbAAuAe6OiBU55mFJy4ArJZ0KjAMuBboiojZL\nc0vuyzV52/pe+ViLBjiVtglg5syZzJo1a2QvRouYNGmSx9KiPJ7WVaWxgMfTyqo0loJRWe7R9CRn\nB7bb1x4RF0rajXRNm8nAXcA7IuKFQth8YCuwGBgP3AycVmr3Q8Ai0uzRthx7ZuE42yQdB3wJuId0\nPZ6rgXMbNTAzMzMbGy2Z5ETE2/opWwAsGOBnNgNn5MeOYp4F5g1y7MeA44bYVTMzM2tRrXSdHDMz\nM7OGacmZHGuuzs7OZnehYZo5lp6eHnp7ewcPHKIpU6ZU6r0B/1trZR5P66rSWEZby9zWoV1JmgXc\nd99991VxIZjVqaenhxkzZrJp0/MNa3PChN1Yvbqb6dOnN6xNM7NmWrlyJR0dHZBuubSy0e17Jsds\nFPT29uYE51rSPWFHqptNm+bR29vrJMfMbIic5JiNqpmAZ/jMzJrBC4/NzMyskpzkmJmZWSU5yTEz\nM7NKcpJjZmZmleQkx8zMzCrJSY6ZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEz\nM7NKcpJjZmZmleQkx8zMzCrJSY6ZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEz\nM7NKcpJjZmZmleQkx8zMzCrJSY6ZmZlVkpMcMzMzq6SmJzmSPi7pAUl9+XGPpD8v1H9F0rbSY2mp\njfGSLpPUK2mDpMWS9izFvELSdfkY6yV9WdLupZh9JS2RtFHSWkkXSmr6a2RmZmbD1wpf4I8BZwGz\ngA7gduBbkmYWYr4HTAWm5UdnqY2LgWOB44E5wN7AjaWY64GZwNwcOwe4olaZk5mlwC7AbOAk4GTg\nvBGOz8zMzJpgl2Z3ICKWlIrOlnQqKdHozmWbI+Lp/n5e0kTgI8AJEXFnLjsF6JZ0SESsyAnTMUBH\nRNyfY84Alkj6ZESszfX7A0dFRC+wStI5wAWSFkTEloYO3MzMzEZVK8zk/JaknSSdAOwG3FOoOlLS\nOkkPS7pc0h6Fug5SsnZbrSAiVgM9wGG5aDawvpbgZLcCARxaiFmVE5yaZcAk4A0jH52ZmZmNpabP\n5ABIeiOwHJgAbADelxMVSKeqbgTWAK8F/glYKumwiAjS6asXIuK5UrPrch35z6eKlRGxVdIzpZh1\n/bRRq3ug/hGamZnZWGuJJAd4GDiANGvyfuBrkuZExMMRcUMh7meSVgG/AI4E7hjznu7A/PnzmTRp\n0nZlnZ2ddHaWlw+ZmZm99HR1ddHV1bVdWV9f36gesyWSnLze5dH89H5JhwBnAqf2E7tGUi+wHynJ\nWQuMkzSxNJszNdeR/yzvttoZ2KMUc3DpcFMLdQNauHAhs2bNGizMzMzsJam/X/xXrlxJR0fHqB2z\npdbkFOwEjO+vQtI+wCuBJ3PRfcAW0q6pWswMYDrpFBj5z8mSDiw0NRcQcG8h5k2SphRijgb6gIdG\nMhgzMzMbe02fyZH0j6R1Nz3AHwAfBo4Ajs7XsTmXtCZnLWn25vPAI6RFwUTEc5KuAi6StJ60pucS\n4O6IWJFjHpa0DLgy79waB1wKdOWdVQC3kJKZaySdBewFnA8siogXR/llMDMzswZrepJDOo30VVJS\n0Qc8CBwdEbdLmgC8GTgRmAw8QUpu/r6UeMwHtgKLSTNANwOnlY7zIWARaVfVthx7Zq0yIrZJOg74\nEmln10bgalKSZWZmZm2m6UlORHx0gLpNwJ/vqL4Qtxk4Iz92FPMsMG+Qdh4DjhvseGZmZtb6WnVN\njpmZmdmIOMkxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc\n5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc\n5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc\n5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqqelJjqSPS3pAUl9+3CPpz0sx50l6QtLzkr4vab9S\n/XhJl0nqlbRB0mJJe5ZiXiHpunyM9ZK+LGn3Usy+kpZI2ihpraQLJTX9NTIzM7Pha4Uv8MeAs4BZ\nQAdwO/AtSTMBJJ0FnA58DDgE2AgskzSu0MbFwLHA8cAcYG/gxtJxrgdmAnNz7BzgilplTmaWArsA\ns4GTgJOB8xo2UjMzMxszTU9yImJJRNwcEb+IiJ9HxNnAb0iJBsCZwPkR8d2I+ClwIimJeS+ApInA\nR4D5EXFnRNwPnAL8maRDcsxM4BjgLyPiJxFxD3AGcIKkafk4xwD7Ax+OiFURsQw4BzhN0i6j/0qY\nmZlZIzU9ySmStJOkE4DdgHskvRqYBtxWi4mI54B7gcNy0UGk2ZdizGqgpxAzG1ifE6CaW4EADi3E\nrIqI3kLMMmAS8IaGDNDMzMzGTEskOZLeKGkDsBm4HHhfTlSmkRKRdaUfWZfrAKYCL+TkZ0cx04Cn\nipURsRV4phTT33EoxJiZmVmbaJXTMA8DB5BmTd4PfE3SnOZ2aXjmz5/PpEmTtivr7Oyks7OzST2y\nKuru7m5IO1OmTGH69OkNacvMbCi6urro6urarqyvr29Uj9kSSU5EbAEezU/vz2tpzgQuBESarSnO\nskwFaqee1gLjJE0szeZMzXW1mPJuq52BPUoxB5e6NrVQN6CFCxcya9aswcLM6vQksBPz5s1rSGsT\nJuzG6tXdTnTMbMz094v/ypUr6ejoGLVjtkSS04+dgPERsUbSWtKOqAfhtwuNDwUuy7H3AVtyzE05\nZgYwHVieY5YDkyUdWFiXM5eUQN1biPmspCmFdTlHA33AQ6MySrMhexbYBlxL2iQ4Et1s2jSP3t5e\nJzlmVmlNT3Ik/SPwPdJC4T8APgwcQUowIG0PP1vSz4FfAucDjwPfgrQQWdJVwEWS1gMbgEuAuyNi\nRY55WNIy4EpJpwLjgEuBroiozdLcQkpmrsnb1vfKx1oUES+O4ktgNgwzSVdbMDOzwTQ9ySGdRvoq\nKanoI83YHB0RtwNExIWSdiNd02YycBfwjoh4odDGfGArsBgYD9wMnFY6zoeARaRdVdty7Jm1yojY\nJuk44EvAPaTr8VwNnNvAsZqZmdkYaXqSExEfHULMAmDBAPWbSde9OWOAmGeBARc0RMRjwHGD9cfM\nzMxaX9OTHLNW0tPTQ29v7+CBg2jULigzM6ufkxyzrKenhxkzZrJp0/PN7oqZmTWAkxyzrLe3Nyc4\njdjBtJR0VxAzM2sWJzlmv6cRO5h8usrMrNla4rYOZmZmZo3mJMfMzMwqyUmOmZmZVZKTHDMzM6sk\nJzlmZmZWSU5yzMzMrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6sk\nJzlmZmZWSU5yzMzMrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6sk\nJzlmZmZWSU5yzMzMrJKc5JiZmVklNT3JkfQZSSskPSdpnaSbJL2+FPMVSdtKj6WlmPGSLpPUK2mD\npMWS9izFvELSdZL6JK2X9GVJu5di9pW0RNJGSWslXSip6a+TmZmZDU8rfHkfDlwKHAq8HdgVuEXS\ny0px3wOmAtPyo7NUfzFwLHA8MAfYG7ixFHM9MBOYm2PnAFfUKnMysxTYBZgNnAScDJw3gvGZmZlZ\nE+zS7A5ExDuLzyWdDDwFdAA/LFRtjoin+2tD0kTgI8AJEXFnLjsF6JZ0SESskDQTOAboiIj7c8wZ\nwBJJn4yItbl+f+CoiOgFVkk6B7hA0oKI2NK4kZuZmdloaoWZnLLJQADPlMqPzKezHpZ0uaQ9CnUd\npITttlpBRKwGeoDDctFsYH0twcluzcc6tBCzKic4NcuAScAbRjYsMzMzG0stleRIEum00w8j4qFC\n1feAE4G3AZ8CjgCW5nhIp69eiIjnSk2uy3W1mKeKlRGxlZRMFWPW9dMGhRgzMzNrA00/XVVyOfAn\nwJ8VCyPihsLTn0laBfwCOBK4Y8x6Z2ZmZm2jZZIcSYuAdwKHR8STA8VGxBpJvcB+pCRnLTBO0sTS\nbM7UXEf+s7zbamdgj1LMwaXDTS3U7dD8+fOZNGnSdmWdnZ10dpbXR5uZmb30dHV10dXVtV1ZX1/f\nqB6zJZKcnOC8BzgiInqGEL8P8EqglgzdB2wh7Zq6KcfMAKYDy3PMcmCypAML63LmAgLuLcR8VtKU\nwrqco4E+oHj67PcsXLiQWbNmDdZ1MzOzl6T+fvFfuXIlHR0do3bMpic5ki4nbQd/N7BRUm3mpC8i\nNuXr2JxL2g6+ljR783ngEdKiYCLiOUlXARdJWg9sAC4B7o6IFTnmYUnLgCslnQqMI21d78o7qwBu\nISUz10g6C9gLOB9YFBEvjuoLYWZmZg3V9CQH+Dhph9MPSuWnAF8DtgJvJi08ngw8QUpu/r6UeMzP\nsYuB8cDNwGmlNj8ELCLtqtqWY8+sVUbENknHAV8C7gE2AleTkiwzMzNrI01PciJiwB1eEbEJ+PMh\ntLMZOCM/dhTzLDBvkHYeA44b7HhmZmbW2uraQi7pLyRNaHRnzMzMzBql3uvkLATWSrpC0iGN7JCZ\nmZlZI9Sb5OwN/BWwD3C3pJ9K+jtJr2pc18zMzMzqV1eSExEvRMS/R8SxpG3a1wB/CTwu6RuSji1c\njdjMzMxszI34tg75wn23ki7KF8BBQBfwX5IOH2n7ZmZmZvWoO8mRNEXS/5H0AHA36WrC7wX+CPhD\n4JukLeBmZmZmY66uLeSSbiLdgmEN8GXgqxHxdCFkg6QLgb8deRfNzMzMhq/e6+Q8B7w9Iu4aIOZp\n4HV1tm9mZmY2InUlORFx0hBignSncDMzM7MxV+/FABdKKt8yAUmnSfriyLtlZmZmNjL1Ljz+36R7\nO5X9CPhg/d0xMzMza4x6k5wppHU5ZX25zszMzKyp6k1yfgEc00/5MaQdV2ZmZmZNVe/uqouBiyW9\nErg9l80FPgV8shEdMzMzMxuJendXXZnvQv5Z4HO5+HHgExHxr43qnJmZmVm96p3JISIuBS6VtBfw\nPxHxbOO6ZWZmZjYydSc5NfneVWZmZmYtpd7r5LxK0lck9UjaJOmF4qPRnTQzMzMbrnpncq4GXgt8\nAXiSdPdxMzMzs5ZRb5IzB5gTEfc3sjNmZmZmjVLvdXIex7M3ZmZm1sLqTXLmA/8kaZ9GdsbMzMys\nUeo9XXUN8AfAf0t6DnixWBkRe460Y2ZmZmYjUW+S8+mG9sLMzMysweq94vFVje6ImZmZWSPVuyYH\nSX8saYGkayTtmcuOljSzcd0zMzMzq0+9FwM8HPgZcATwAeDluaoDOK8xXTMzMzOrX70zOZ8HFkTE\nUUDxCse3AbNH3CszMzOzEao3yXkzsLif8qeAVw2nIUmfkbRC0nOS1km6SdLr+4k7T9ITkp6X9H1J\n+5Xqx0u6TFKvpA2SFtdOoxViXiHpOkl9ktZL+rKk3Usx+0paImmjpLWSLpRU92k9MzMza456v7z7\ngGn9lB8A/GqYbR0OXAocCrwd2BW4RdLLagGSzgJOBz4GHAJsBJZJGldo52LgWOB40hWZ9wZuLB3r\nemAmMDfHzgGuKBxnJ2ApaUH2bOAk4GR8Cs7MzKzt1LuF/OvABZLeT77ysaRDgS8C1w6noYh4Z/G5\npJNJM0IdwA9z8ZnA+RHx3RxzIrAOeC9wg6SJwEeAEyLizhxzCtAt6ZCIWJEXRB8DdNRuRyHpDGCJ\npE9GxNpcvz9wVET0AqsknZPHuiAitgxnbGZmZtY89c7kfAZ4FHiCtOj4IeAe4MfA+SPs02RS4vQM\ngKRXk2aNbqsFRMRzwL3AYbnoIFLCVoxZDfQUYmYD60v327o1H+vQQsyqnODULAMmAW8Y4bjMzMxs\nDNV7nZzNwCmSzgPeREp0VkbEwyPpjCSRTjv9MCIeysXTSInIulL4On53ymwq8EJOfnYUM400Q1Qc\nx1ZJz5Ri+jtOre6BYQ3IzMzMmqbe01UARMQaYE2D+gJwOfAnwJ81sE0zMzN7CaoryZH0LwPVR8TH\n6mhzEfBO4PCIeLJQtRYQabamOMsyFbi/EDNO0sTSbM7UXFeLKe+22hnYoxRzcKlrUwt1OzR//nwm\nTZq0XVlnZyednZ0D/ZiZmdlLQldXF11dXduV9fX1jeox653J2av0fFfSmpU/AP5juI3lBOc9wBER\n0VOsi4g1ktaSdkQ9mOMnktbRXJbD7gO25JibcswMYDqwPMcsByZLOrCwLmcuKYG6txDzWUlTCuty\njibtJqudPuvXwoULmTVr1nCHbmZm9pLQ3y/+K1eupKOjY9SOWe+anHeVyyTtAvwzgyQD/fzc5UAn\n8G5go6TazElfRGzKf78YOFvSz4FfkhY3Pw58K/fnOUlXARdJWg9sAC4B7o6IFTnmYUnLgCslnQqM\nI21d78o7qwBuyf2/Jm9b3ysfa1FEbHendTMzM2ttI1qTUxQRWyR9AfgBcNEwfvTjpIXFPyiVnwJ8\nLbd9oaTdSNe0mQzcBbwjIopXW54PbCVdpHA8cDNwWqnNDwGLSLuqtuXYMwtj2CbpOOBLpN1iG4Gr\ngXOHMR4zMzNrAQ1LcrJXk05dDVlEDGkbe0QsABYMUL8ZOCM/dhTzLDBvkOM8Bhw3lD6ZmZlZ66p3\n4fGF5SLSqZ13M8yLAZqZmZmNhnpncg4rPd8GPA18GrhyRD0yMzMza4B6Fx4f3uiOmJmZmTWS765t\nZmZmlVTvmpwfk2/MOZiIOKSeY5iZmZmNRL1rcu4A/hp4hN9dbG82MIO0zXvzyLtmZmZmVr96k5zJ\nwGUR8dlioaR/AKZGxEdH3DMzMzOzEah3Tc4HgK/0U3418L/r7o2ZmZlZg9Sb5GwmnZ4qm41PVZmZ\nmVkLqPd01SXAFZIOBFbkskOBvwL+qREdMzMzMxuJeq+T8w+S1pDu+1Rbf9MNfCwirm9U58zMzMzq\nVfe9q3Iy44TGzMzMWlLdFwOUNFHSyZLOk/SKXHaApL0a1z0zMzOz+tR7McA3ArcCzwP7knZVrQc+\nCPwhcFKD+mdmZmZWl3pnchaSTlW9FthUKF8CzBlpp8zMzMxGqt4k52Dg8ogo39rhV4BPV5mZmVnT\n1bvw+EXg5f2U7wf01t8ds+Hp6emht7cx/+S6u7sb0o6ZmbWGepOc7wDnSPpgfh6S/hC4APhGQ3pm\nNoienh5mzJjJpk3PN7srZmbWgupNcv6OlMysBV4G3A7sDfwY+OwAP2fWML29vTnBuRaY2YAWlwLn\nNKAdMzNrBfVeDHA9cJSkI4ADSKeuVgLL+lmnYzbKZgKzGtCOT1eZmVXJsJMcSbsC3wVOj4g7gTsb\n3iszMzOzERr27qqIeBHoADxjY2ZmZi2r3i3k1wGnNLIjZmZmZo1U78LjAE6X9HbgJ8DG7SojPjXS\njpmZmZmNRL1JTgfwYP77m0t1Po1lZmZmTTesJEfSa4A1EXH4KPXHzMzMrCGGuybnv4BX1Z5I+rqk\nqY3tkpmZmdnIDTfJUen5O4HdG9QXMzMzs4apd3dVQ0k6XNK3Jf1K0jZJ7y7VfyWXFx9LSzHjJV0m\nqVfSBkmLJe1ZinmFpOsk9UlaL+nLknYvxewraYmkjZLWSrpQUku8TmZmZjZ0w/3yDn5/YXEjFhrv\nDvwn8DcDtPc9YCowLT86S/UXA8cCxwNzSLeZuLEUcz3p8rhzc+wc4IpaZU5mlpLWKs0GTgJOBs6r\na1RmZmbWNMPdXSXgakmb8/MJwD9LKm8h/3+G02hE3AzcDCCpfEqsZnNEPN1vp6SJwEeAE/JVmJF0\nCtAt6ZCIWCFpJnAM0BER9+eYM4Alkj4ZEWtz/f7AURHRC6ySdA5wgaQFEbFlOOMyMzOz5hnuTM5X\ngaeAvvy4Fnii8Lz2GA1HSlon6WFJl0vao1DXQUrYbqsVRMRqoAc4LBfNBtbXEpzsVtLM0aGFmFU5\nwalZBkwC3tDQ0ZiZmdmoGtZMTkQ06yrH3yOdeloDvBb4J2CppMPyDUGnAS9ExHOln1uX68h/PlWs\njIitkp4pxazrp41a3QMNGIuZmZmNgXovBjimIuKGwtOfSVoF/AI4ErijKZ0qmT9/PpMmTdqurLOz\nk87O8tIhMzOzl56uri66urq2K+vrG62TP0lbJDllEbFGUi+wHynJWQuMkzSxNJszNdeR/yzvttoZ\n2KMUc3DpcFMLdTu0cOFCZs2aNdyhmJmZvST094v/ypUr6ejoGLVjtuXWaEn7AK8EnsxF9wFbSLum\najEzgOnA8ly0HJgs6cBCU3NJi6nvLcS8SdKUQszRpHVGDzV4GGZmZjaKWmImJ1+rZj9+d7HB10g6\nAHgmP84lrclZm+M+DzxCWhRMRDwn6SrgIknrgQ3AJcDdEbEixzwsaRlwpaRTgXHApUBX3lkFcAsp\nmblG0lnAXsD5wKKIeHE0XwMzMzNrrJZIcoCDSKedatfh+WIu/yrp2jlvBk4EJpN2cy0D/r6UeMwH\ntgKLgfGkLemnlY7zIWARaVfVthx7Zq0yIrZJOg74EnAP6e7qV5OSLLNK6e7ublhbU6ZMYfr06Q1r\nz8ysEVoiycnXthno1NmfD6GNzcAZ+bGjmGeBeYO08xhw3GDHM2tfTwI7MW/egP8VhmXChN1Yvbrb\niY6ZtZSWSHLMbCw9S5rIvJZ0AfCR6mbTpnn09vY6yTGzluIkx+wlaybgHYFmVl1tubvKzMzMbDBO\ncszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklO\ncszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklO\ncszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklO\ncszMzKySWiLJkXS4pG9L+pWkbZLe3U/MeZKekPS8pO9L2q9UP17SZZJ6JW2QtFjSnqWYV0i6TlKf\npPWSvixp91LMvpKWSNooaa2kCyW1xOtkZmZmQ9cqX967A/8J/A0Q5UpJZwGnAx8DDgE2AsskjSuE\nXQwcCxwPzAH2Bm4sNXU9MBOYm2PnAFcUjrMTsBTYBZgNnAScDJw3wvGZmZnZGNul2R0AiIibgZsB\nJKmfkDOB8yPiuznmRGAd8F7gBkkTgY8AJ0TEnTnmFKBb0iERsULSTOAYoCMi7s8xZwBLJH0yItbm\n+v2BoyKiF1gl6RzgAkkLImLLqL0IZmZm1lCtMpOzQ5JeDUwDbquVRcRzwL3AYbnoIFLCVoxZDfQU\nYmYD62sJTnYraebo0ELMqpzg1CwDJgFvaNCQzMzMbAy0fJJDSnCCNHNTtC7XAUwFXsjJz45ipgFP\nFSsjYivwTCmmv+NQiDEzM7M20BKnq6pg/vz5TJo0abuyzs5OOjs7m9QjMzOz1tHV1UVXV9d2ZX19\nfaN6zHZIctYCIs3WFGdZpgL3F2LGSZpYms2ZmutqMeXdVjsDe5RiDi4df2qhbocWLlzIrFmzBh2M\nmZnZS1F/v/ivXLmSjo6OUTtmy5+uiog1pARjbq0sLzQ+FLgnF90HbCnFzACmA8tz0XJgsqQDC83P\nJSVQ9xZi3iRpSiHmaKAPeKhBQzIzM7Mx0BIzOflaNfuREg6A10g6AHgmIh4jbQ8/W9LPgV8C5wOP\nA9+CtBCdV2t+AAAWK0lEQVRZ0lXARZLWAxuAS4C7I2JFjnlY0jLgSkmnAuOAS4GuvLMK4BZSMnNN\n3ra+Vz7Wooh4cVRfBDMzM2uolkhySLuj7iAtMA7gi7n8q8BHIuJCSbuRrmkzGbgLeEdEvFBoYz6w\nFVgMjCdtST+tdJwPAYtIu6q25dgza5URsU3SccCXSLNEG4GrgXMbNVAzMzMbGy2R5ORr2wx46iwi\nFgALBqjfDJyRHzuKeRaYN8hxHgOOGyjGzMzMWl/Lr8kxMzMzq4eTHDMzM6skJzlmZmZWSU5yzMzM\nrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzM\nrJKc5JiZmVklOckxMzOzSnKSY2ZmZpXkJMfMzMwqaZdmd8DMqqG7u7thbU2ZMoXp06c3rD0ze2ly\nkmNmI/QksBPz5s1rWIsTJuzG6tXdTnTMbESc5JjZCD0LbAOuBWY2oL1uNm2aR29vr5McMxsRJzlm\n1iAzgVnN7oSZ2W954bGZmZlVkpMcMzMzqyQnOWZmZlZJTnLMzMyskpzkmJmZWSU5yTEzM7NKcpJj\nZmZmldQWSY6kcyVtKz0eKsWcJ+kJSc9L+r6k/Ur14yVdJqlX0gZJiyXtWYp5haTrJPVJWi/py5J2\nH4sxmpmZWWO1RZKT/RSYCkzLj7fWKiSdBZwOfAw4BNgILJM0rvDzFwPHAscDc4C9gRtLx7iedEWz\nuTl2DnDFKIzFzMzMRlk7XfF4S0Q8vYO6M4HzI+K7AJJOBNYB7wVukDQR+AhwQkTcmWNOAbolHRIR\nKyTNBI4BOiLi/hxzBrBE0icjYu2ojs7MzMwaqp1mcl4n6VeSfiHpWkn7Akh6NWlm57ZaYEQ8B9wL\nHJaLDiIldMWY1UBPIWY2sL6W4GS3AgEcOjpDMjMzs9HSLknOj4CTSTMtHwdeDfxHXi8zjZSIrCv9\nzLpcB+k01ws5+dlRzDTgqWJlRGwFninEmJmZWZtoi9NVEbGs8PSnklYA/w18AHi4Ob0yMzOzVtYW\nSU5ZRPRJegTYD/gBINJsTXE2ZypQO/W0FhgnaWJpNmdqrqvFlHdb7QzsUYjZofnz5zNp0qTtyjo7\nO+ns7BziqMzMzKqrq6uLrq6u7cr6+vpG9ZhtmeRIejkpwflqRKyRtJa0I+rBXD+RtI7msvwj9wFb\ncsxNOWYGMB1YnmOWA5MlHVhYlzOXlEDdO1ifFi5cyKxZsxowOjMzs+rp7xf/lStX0tHRMWrHbIsk\nR9IXgO+QTlH9IfA54EXg33LIxcDZkn4O/BI4H3gc+BakhciSrgIukrQe2ABcAtwdEStyzMOSlgFX\nSjoVGAdcCnR5Z5WZmVn7aYskB9iHdA2bVwJPAz8EZkfErwEi4kJJu5GuaTMZuAt4R0S8UGhjPrAV\nWAyMB24GTisd50PAItKuqm059sxRGpOZmZmNorZIciJi0IUtEbEAWDBA/WbgjPzYUcyzwLzh99CG\nqqenh97e3oa01d3d3ZB2zMysmtoiybFq6OnpYcaMmWza9Hyzu2JmZi8BTnJszPT29uYE51rS3TNG\nailwTgPaMTOzKnKSY00wE2jETjSfrjIzsx1rlysem5mZmQ2LkxwzMzOrJCc5ZmZmVklOcszMzKyS\nnOSYmZlZJTnJMTMzs0pykmNmZmaV5CTHzMzMKslJjpmZmVWSkxwzMzOrJCc5ZmZmVklOcszMzKyS\nfINOM2tJ3d2NuQHrlClTmD59ekPaMrP24iTHzFrMk8BOzJs3ryGtTZiwG6tXdzvRMXsJcpJjZi3m\nWWAbcC0wc4RtdbNp0zx6e3ud5Ji9BDnJMbMWNROY1exOmFkb88JjMzMzqyQnOWZmZlZJTnLMzMys\nkpzkmJmZWSU5yTEzM7NKcpJjZmZmleQt5GZWeY26ejL4Cspm7cRJjplVWGOvngy+grJZO3GSY4Pq\n6emht7d3xO008rdps6Fp5NWTwVdQNmsvTnL6Iek04JPANOAB4IyI+HFzezV2urq66OzsBFKCM2PG\nTDZter7JvapXF9DZ7E400M1U6yrAY/X+jP7Vk4v/b6rA42ldVRrLaPPC4xJJHwS+CJwLHEhKcpZJ\nmtLUjo2hrq6u3/69t7c3JzjXAveN8HH+mI3hd7oGD2kry5rdgQarzvtT/H9TBR5P66rSWEabZ3J+\n33zgioj4GoCkjwPHAh8BLmxmx5qrEb8J+3SVVUN/p177+vpYuXLlsNvyQmaz0eMkp0DSrkAH8I+1\nsogISbcChzWtY2bWIgZeyNzR0THsFr2Q2Wz0OMnZ3hRgZ2BdqXwdMGPsuzN8q1ev5q/+6q/YvHlz\n3W088sgjHHrooQBMnjy5UV0zq4CBFjLPBxYOs720kPmuu+5i5syRL4zevHkz48ePH3E7kGamli9f\n3rD2oLH9G25bg820NbNvw22v3lnDmpfS7KGTnJGbAK2zc2jRokXcddddI25nxYoVpZKljPx0090N\nbGuo7T0OXNfA9oZqtMa6jqGPZ7C2xvJ92JH+3p92eB/W9FO3oY5j3A+ogVvcdyIlYY3xlre8taHt\nNbZ/w29r4Jm25vZtuO3VM2tYM27cBL7xjcXstddeI+zXyBW+OyeMRvuKiNFoty3l01XPA8dHxLcL\n5VcDkyLiff38zIcY+beOmZnZS9mHI+L6RjfqmZyCiHhR0n3AXODbAJKUn1+ygx9bBnwY+CWwaQy6\naWZmVhUTgD9mlLaOeianRNIHgKuBjwMrSCfa3w/sHxFPN7FrZmZmNgyeySmJiBvyNXHOA6YC/wkc\n4wTHzMysvXgmx8zMzCrJVzw2MzOzSnKSY2ZmZpXkJGcIJH1G0gpJz0laJ+kmSa/vJ+48SU9Iel7S\n9yXt14z+DoekT0vaJumiUnnbjEXS3pKukdSb+/uApFmlmLYYj6SdJJ0v6dHc159LOrufuJYcj6TD\nJX1b0q/yv6t39xMzYN8ljZd0WX4/N0haLGnPsRvFb/uxw7FI2kXS5yU9KOk3OearkvYqtdESY8l9\nGfS9KcT+c475RKm8rcYjaaakb0l6Nr9P90rap1DfEuMZbCySdpe0SNJj+f/NzyT9dSmmJcaS+9KQ\n78xGjMlJztAcDlwKHAq8HdgVuEXSy2oBks4CTgc+BhwCbCTd2HPc2Hd3aCQdTOrvA6XythmLpMmk\nK7RtBo4hXYb274D1hZi2GQ/waeCvgb8B9gc+BXxK0um1gBYfz+6kxfp/A/zegr8h9v1i0v3ijgfm\nAHsDN45ut/s10Fh2A/4U+BzpRr7vI10V/VuluFYZCwzy3tRIeh/ps+5X/VS3zXgkvRa4C3iI1Nc3\nke4SXLzUR6uMZ7D3ZiFwNPAh0ufCQmCRpOMKMa0yFmjcd+bIxxQRfgzzQbr9wzbgrYWyJ4D5hecT\ngf8BPtDs/u5gDC8HVgNvA+4ALmrHsQAXAHcOEtNO4/kOcGWpbDHwtXYbT/4/8u7hvBf5+WbgfYWY\nGbmtQ1ppLP3EHARsBfZp5bEMNB7gD4Ee0i8La4BPlN6rthkP6Rb3Xx3gZ1pyPDsYyyrg/y2V/QQ4\nr5XHUujLsL8zGzUmz+TUZzIp234GQNKrgWnAbbWAiHgOuJfWvbHnZcB3IuL2YmEbjuVdwE8k3ZCn\nRVdK+mitsg3Hcw8wV9LrACQdAPwZ6b4E7Tie3xpi3w8iXdqiGLOa9MXb0uPjd58Lz+bnHbTRWCQJ\n+BpwYUT0d3+KthlPHsuxwH9Jujl/NvxI0nsKYW0zHtLnwrsl7Q0g6SjgdfzuAnqtPpZ6vjMb8lng\nJGeY8n+ei4EfRsRDuXga6Q3s78ae08awe0Mi6QTSVPtn+qluq7EArwFOJc1KHQ18CbhE0l/k+nYb\nzwXA14GHJb0A3AdcHBH/luvbbTxFQ+n7VOCF/IG3o5iWI2k86b27PiJ+k4un0V5j+TSpv4t2UN9O\n49mTNFt9FukXhP8F3AR8Q9LhOaadxnMG6cZoj+fPhaXAaRFRu5lay45lBN+ZDfks8MUAh+9y4E9I\nv123nbzo7mLg7RHxYrP70wA7ASsi4pz8/AFJbyRdsfqa5nWrbh8knXc/gbSW4E+B/0/SExHRjuOp\nPEm7AP9O+tD+myZ3py6SOoBPkNYXVUHtF/hvRkTtljwPSnoL6bNh5HcxHlufIK1vOY40kzEHuDx/\nLtw+4E82X1O/Mz2TMwySFgHvBI6MiCcLVWsBkTLPoqm5rpV0AK8CVkp6UdKLwBHAmfk3hHW0z1gA\nnuT3b/3cDUzPf2+n9wbgQuCCiPj3iPhZRFxHWmRYm3Vrt/EUDaXva4FxkiYOENMyCgnOvsDRhVkc\naK+xvJX0ufBY4XPhj4CLJD2aY9ppPL3AFgb/bGj58UiaAPwD8LcRsTQifhoRl5NmfD+Zw1pyLCP8\nzmzImJzkDFF+s94DHBURPcW6iFhDetHnFuInkjLve8ayn0NwK2mXwZ8CB+THT4BrgQMi4lHaZyyQ\ndlbNKJXNAP4b2u69gbRrZ2upbBv5/2objue3htj3+0hfTsWYGaQvpuVj1tkhKCQ4rwHmRsT6Ukjb\njIW0FufN/O4z4QDSwtALSbsWoY3Gk2epf8zvfza8nvzZQPuMZ9f8KH8ubOV33+EtN5YGfGc2ZkzN\nXnXdDg/SdNt60ra4qYXHhELMp4BfkxbCvgn4JvBfwLhm938I4yvvrmqbsZAWp20mzXS8lnSqZwNw\nQpuO5yuk6eh3kn6Tfh/wFPCP7TAe0lbYA0hJ9Dbg/+Tn+w617/n/2xrgSNLM493AXa00FtKp/m+R\nvjDfVPpc2LXVxjKU96af+O12V7XbeID3kraLfzR/NpwOvAAc1mrjGcJY7gAeJM26/zFwMvA88LFW\nG0uhLyP+zmzEmMZ88O34yP/otvbzOLEUt4D028/zpFXv+zW770Mc3+0Ukpx2GwspIXgw9/VnwEf6\niWmL8eQPu4vyf+yN+T/954Bd2mE8+UO4v/8v/zrUvgPjSdfY6CUlrP8O7NlKYyEloOW62vM5rTaW\nob43pfhH+f0kp63GQ0oGHsn/l1YCx7XieAYbC2kh9VXAY3ksDwFntuJYcl8a8p3ZiDH5Bp1mZmZW\nSV6TY2ZmZpXkJMfMzMwqyUmOmZmZVZKTHDMzM6skJzlmZmZWSU5yzMzMrJKc5JiZmVklOckxMzOz\nSnKSY2ZmZpXkJMesgiStkfSJJhz3LZIelPSCpG+M9fH7MxqvhaRzJa1sZJtm1nhOcsxajKSvSNom\naaukzZL+S9I5kobz//Ug4F+Gccwj8jEnDr/H27mIdI+gPyLdN6ghJB2T+7dnqfxJSY+Wyv4oxx6V\ni4b1WgzRFyjcHXk05ORs2w4eWyX962ge36wKdml2B8ysX98jJQkTgHeQ7sa7GbhwKD8cEb8e5vEE\nRP5zJF4LfCkinqy3AUm7RsSLpeIfAi+S7kZ8Q47bn/T6TJA0PSJ6cuzbSHefvhvqei0GFRHPk24q\nOJoOAnbOf/8zYDHwetKNCgH+Z5SPb9b2PJNj1po2R8TTEfFYRPwLcCvwnlqlpOMl/VTSpvwb/98W\nf7h8iib/9v+Xkr4haaOkRyS9K9f9EelO9ADri7MEkt6fTz89L6lX0i2SXlbubG32BNgD+Epu48Rc\nd4Ske3Nfn5D0T8VZKUl3SLpU0kJJTwM3l9uPiI3AT0hJTs2RwF2kZKZYfgTwo4h4YbivRaG/2yS9\nTdKPc8zdkl5fiDlX0v2F51+RdJOkv8tj7JW0SNLOhZhpkpbk1/Lnkj4w0Km0iPh1RDwVEU8Bz+Ti\np2tlEbGh8NovlvRsPu6NkvYpHPcwSbfmuvX5728q1I/P4z1F0vfyeFdJ6pA0Q9Jdkn4j6T8k7dtf\nX81alZMcs/awCRgHIKkD+DpwPfBG4Fzg/FpSMYC/B/4NeBOwFLhO0mTgMeD4HPM6YC/gTEnT8jG+\nDOxPSh6+Qf+zPT3ANNIswydyG1+XtDewBLgXeDPwceAvgbNLP38iaabqLTmmP3cARxWeHwX8APiP\nUvmROXYgO3otiv4vMB/oALYAV5Xqo/T8KOA1+fgnkmbiTi7UX0N6jeYA7wdOBV41SD8HJGkcKQFe\nCxwGHE6a8VoiqfY+vRy4EphNen0fB5ZKGl9q7u+BfwYOIL2f1wGX5fKDgZcBF4+kv2ZjLiL88MOP\nFnoAXwG+UXj+dtKpiQvy82uBm0s/83lgVeH5GuAThefbgAWF57vlsqPz8yOArcDEQsyBuWzfYfR9\nPXBi4fk/AA+VYk4F+grP7wB+MoS25+b+TM3P15ISkNnAmlz2mjyutzbgtTiyEPOOXDYuPz8XWFl6\nzx4FVCj7OnB9/vv++RgHFupfm8s+MYSx/977k8v/stiPXPYyUsL41h20tSvpVNvb8vPxuR+fLh1v\nG/DBQtlJwDPN/v/hhx/DeXgmx6w1vUvSBkmbSDMhXcDnct1M8nqTgruB1xV+e+/PqtpfIq0peQ7Y\nc8fhPADcBvxU0g2SPtrPbMdg9geW99PXlxdPqQD3DaGte8jrciTNJK3HWUk6jTUln3Y7kvQF/qNB\n2hrKa7Gq8PfaGqOBXq+fRURxdufJQvzrgRcj4renuCLiF6SkcCQOAN6Y/61skLQBeIq0lue1AJL2\nkvSvSgvY+0invsYB00ttFce7jjRT9dNS2SRJXstpbcP/WM1a0+2k0zYvAk9ExLYGtFlezBsMcMo6\nH/NoSYcBRwNnAP9X0qER8d8N6E/RxsECIuJ/JK0gnRZ6JfDDnFRskXQPacHxkcDdEbFlkOaG8lq8\nWKqnn5jhttloLyclf6fw+6cRn8p/dpFmb04jnZrcDNxPPv1Z0N94h/samLUU/2M1a00bI2JNRDze\nT4LTTdptU/RW4JHSTMJwvJD/3LlcERHLI+JzpNNXLwLvG0a73aS1IkVvBTZExON19LO2LudI0nqc\nmrty2REMvh6nGVYDu0g6sFYgaT/gFSNsdyUwA1gbEY+WHr/JMYcBF0XELRHRTfrc/4MRHtesLTjJ\nMWs/XwTmSjpb0usknUT6Lf0LI2jzv0m/qb9L0hRJu0s6RNJn8i6bfUmLk6cADw2j3cuBffPuqRmS\n3gMsyGOoxx2kxdFHA3cWyu8E3gvsQ2OSnP5O+9W9vT4iVpNO/V0p6eCc7FxBOrU21MS0v+N/lTQL\n9k2lCzH+cd4VtkjSlBzzc+AkSa+X9BbgatJC9nqOZ9ZWnOSYtZm8ruMDwAdJ6ygWAGdHxDXFsPKP\n9ddUoc0nSItpLyAt6L0U6CPtBFpCmok4D/jbiLhloO6V+voE8E7S7pz/JCU9V5IWJA/Utx1ZTjrd\nAtuv47mXdEpmA/Djgfq0g+PVEzNcf0F6be8EbiS9Dr9haAlHv8ePtI38cNJ6mW+SEtB/JiUotVOA\nJ5J2u/0naafc54FnB2t7B2VmbUX1z26bmVm98sLrHmBuRLTiKTaztuckx8xsDCjdZuLlpNm3vUlX\nr54GzIiIrc3sm1lVeXeVmdnY2BX4R+DVpNNqdwOdTnDMRo9ncszMzKySvPDYzMzMKslJjpmZmVWS\nkxwzMzOrJCc5ZmZmVklOcszMzKySnOSYmZlZJTnJMTMzs0pykmNmZmaV9P8D26paXy8mx+gAAAAA\nSUVORK5CYII=\n",
2328 | "text/plain": [
2329 | ""
2330 | ]
2331 | },
2332 | "metadata": {},
2333 | "output_type": "display_data"
2334 | }
2335 | ],
2336 | "source": [
2337 | "ax = df['Wscore'].plot.hist(bins=20)\n",
2338 | "ax.set_xlabel('Points for Winning Team')"
2339 | ]
2340 | },
2341 | {
2342 | "cell_type": "markdown",
2343 | "metadata": {},
2344 | "source": [
2345 | "# Creating Kaggle Submission CSVs"
2346 | ]
2347 | },
2348 | {
2349 | "cell_type": "markdown",
2350 | "metadata": {},
2351 | "source": [
2352 | "This isn't directly Pandas related, but I assume that most people who use Pandas probably do a lot of Kaggle competitions as well. As you probably know, Kaggle competitions require you to create a CSV of your predictions. Here's some starter code that can help you create that csv file"
2353 | ]
2354 | },
2355 | {
2356 | "cell_type": "code",
2357 | "execution_count": 40,
2358 | "metadata": {
2359 | "collapsed": false
2360 | },
2361 | "outputs": [
2362 | {
2363 | "name": "stdout",
2364 | "output_type": "stream",
2365 | "text": [
2366 | "[[ 0 10]\n",
2367 | " [ 1 15]\n",
2368 | " [ 2 20]]\n"
2369 | ]
2370 | }
2371 | ],
2372 | "source": [
2373 | "import numpy as np\n",
2374 | "import csv\n",
2375 | "\n",
2376 | "results = [[0,10],[1,15],[2,20]]\n",
2377 | "results = pd.np.array(results)\n",
2378 | "print results"
2379 | ]
2380 | },
2381 | {
2382 | "cell_type": "code",
2383 | "execution_count": 41,
2384 | "metadata": {
2385 | "collapsed": false
2386 | },
2387 | "outputs": [],
2388 | "source": [
2389 | "firstRow = [['id', 'pred']]\n",
2390 | "with open(\"result.csv\", \"wb\") as f:\n",
2391 | " writer = csv.writer(f)\n",
2392 | " writer.writerows(firstRow)\n",
2393 | " writer.writerows(results)"
2394 | ]
2395 | },
2396 | {
2397 | "cell_type": "markdown",
2398 | "metadata": {},
2399 | "source": [
2400 | "The approach I described above deals more with python lists and numpy. If you want a purely Pandas based approach, take a look at this video: https://www.youtube.com/watch?v=ylRlGCtAtiE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=22"
2401 | ]
2402 | },
2403 | {
2404 | "cell_type": "markdown",
2405 | "metadata": {},
2406 | "source": [
2407 | "# Other Useful Functions"
2408 | ]
2409 | },
2410 | {
2411 | "cell_type": "markdown",
2412 | "metadata": {},
2413 | "source": [
2414 | "* **drop()** - This function removes the column or row that you pass in (You also have the specify the axis). \n",
2415 | "* **agg()** - The aggregate function lets you compute summary statistics about each group\n",
2416 | "* **apply()** - Lets you apply a specific function to any/all elements in a Dataframe or Series\n",
2417 | "* **get_dummies()** - Helpful for turning categorical data into one hot vectors.\n",
2418 | "* **drop_duplicates()** - Lets you remove identical rows"
2419 | ]
2420 | },
2421 | {
2422 | "cell_type": "markdown",
2423 | "metadata": {
2424 | "collapsed": true
2425 | },
2426 | "source": [
2427 | "# Lots of Other Great Resources"
2428 | ]
2429 | },
2430 | {
2431 | "cell_type": "markdown",
2432 | "metadata": {},
2433 | "source": [
2434 | "Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. \n",
2435 | "* http://pandas.pydata.org/pandas-docs/stable/10min.html\n",
2436 | "* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python\n",
2437 | "* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/\n",
2438 | "* https://www.dataquest.io/blog/pandas-python-tutorial/\n",
2439 | "* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view\n",
2440 | "* https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y"
2441 | ]
2442 | }
2443 | ],
2444 | "metadata": {
2445 | "anaconda-cloud": {},
2446 | "kernelspec": {
2447 | "display_name": "Python [conda root]",
2448 | "language": "python",
2449 | "name": "conda-root-py"
2450 | },
2451 | "language_info": {
2452 | "codemirror_mode": {
2453 | "name": "ipython",
2454 | "version": 2
2455 | },
2456 | "file_extension": ".py",
2457 | "mimetype": "text/x-python",
2458 | "name": "python",
2459 | "nbconvert_exporter": "python",
2460 | "pygments_lexer": "ipython2",
2461 | "version": "2.7.12"
2462 | }
2463 | },
2464 | "nbformat": 4,
2465 | "nbformat_minor": 1
2466 | }
2467 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Pandas-Tutorial
2 |
3 | I've been working with Pandas quite a bit lately, and figured I'd make a short summary of the most important and helpful functions in the library.
4 |
5 | Hopefully it's helpful for you!
6 |
7 | # Lots of Other Great Tutorials
8 | * http://pandas.pydata.org/pandas-docs/stable/10min.html
9 | * https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
10 | * http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
11 | * https://www.dataquest.io/blog/pandas-python-tutorial/
12 | * https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view
13 | * https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
14 |
--------------------------------------------------------------------------------
/result.csv:
--------------------------------------------------------------------------------
1 | id,pred
2 | 0,10
3 | 1,15
4 | 2,20
5 |
--------------------------------------------------------------------------------