├── README.md ├── Ecommerce Purchases Exercise .ipynb ├── SF Salaries Exercise.ipynb ├── Choropleth Maps Exercise .ipynb └── NLP Project .ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Python-for-Data-Analysis-and-Machine-Learning 2 | 3 | This repo contains the exercises made for the course of Jose Portilla on Udemy. 4 | 5 | The course teaches how to use NumPy, Pandas, Seaborn , Matplotlib , Plotly , Scikit-Learn , Machine Learning models, Tensorflow, Deep Neural Networks and more. 6 | 7 | The exercises offer the possibility to reproduce the real analysis process using real data (like datasets on Kaggle) or fake data (created for the scope of the course). 8 | 9 | ### Details of each project: ### 10 | 11 | 1. *911 Calls Data Capstone Project .ipynb* : Analysis of the [Kaggle 911 Dataset](https://www.kaggle.com/mchirico/montcoalert) with visualizations in Seaborn and Matplotlib. 12 | 13 | 2. *Choropleth Maps Exercise .ipynb*: choropleth maps with Plotly. 14 | 15 | 3. *Decision Trees and Random Forest Project .ipynb*: Analysis of a dataset from [LendingClub.com](https://www.lendingclub.com/) with visualizations in Seaborn and Matplotlib and Machine Learning models with Scikit-Learn. 16 | 17 | 4. *Ecommerce Purchases Exercise .ipynb*: analysis of a dataset with fake Amazon data with Pandas. 18 | 19 | 5. *K Nearest Neighbors Project .ipynb*: using Numpy, Maplotlib, Pandas, Seaborn to analyze a dataset and the applying the KNN model to the data using Scikit-Learn. 20 | 21 | 6. *Linear Regression - Project Exercise .ipynb*: using Numpy, Maplotlib, Pandas, Seaborn to analyze a dataset about Ecommerce customers and the applying the Linear Regression model to the data using Scikit-Learn. 22 | 23 | 7. *Logistic Regression Project .ipynb*: using Numpy, Maplotlib, Pandas, Seaborn to work with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. Cresating a Logistic Regression model with Scikit-Learn to predict whether or not the users will click on an ad based off the features of the users. 24 | 25 | 8. *NLP Project .ipynb*: Classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. 26 | The analysis has been done using the [Yelp Review Data Set from Kaggle](https://www.kaggle.com/c/yelp-recsys-2013). 27 | 28 | 9. *SF Salaries Exercise .ipynb*: simple Pandas analysis using the [SF Salaries datased from Kaggle](https://github.com/marcogdepinto/Python-for-Data-Analysis-and-Machine-Learning/blob/master/SF%20Salaries%20Exercise.ipynb). 29 | 30 | 10. *Support Vector Machines Project .ipynb*: SVM application with Scikit-Learn on the Iris dataset. 31 | 32 | 11. *Tensorflow Project Exercise .ipynb*: Tensorflow application on the [UCI dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication) - created a Deep Neural Network using the contrib.learn module. 33 | -------------------------------------------------------------------------------- /Ecommerce Purchases Exercise .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "___\n", 8 | "\n", 9 | " \n", 10 | "___\n", 11 | "# Ecommerce Purchases Exercise\n", 12 | "\n", 13 | "In this Exercise you will be given some Fake Data about some purchases done through Amazon! Just go ahead and follow the directions and try your best to answer the questions and complete the tasks. Feel free to reference the solutions. Most of the tasks can be solved in different ways. For the most part, the questions get progressively harder.\n", 14 | "\n", 15 | "Please excuse anything that doesn't make \"Real-World\" sense in the dataframe, all the data is fake and made-up.\n", 16 | "\n", 17 | "Also note that all of these questions can be answered with one line of code.\n", 18 | "____\n", 19 | "** Import pandas and read in the Ecommerce Purchases csv file and set it to a DataFrame called ecom. **" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 3, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import pandas as pd" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 4, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "ecom = pd.read_csv('Ecommerce Purchases')" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "**Check the head of the DataFrame.**" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 4, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/html": [ 55 | "
\n", 56 | "\n", 69 | "\n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | "
AddressLotAM or PMBrowser InfoCompanyCredit CardCC Exp DateCC Security CodeCC ProviderEmailJobIP AddressLanguagePurchase Price
016629 Pace Camp Apt. 448\\nAlexisborough, NE 77...46 inPMOpera/9.56.(X11; Linux x86_64; sl-SI) Presto/2...Martinez-Herman601192906112340602/20900JCB 16 digitpdunlap@yahoo.comScientist, product/process development149.146.147.205el98.14
19374 Jasmine Spurs Suite 508\\nSouth John, TN 8...28 rnPMOpera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr...Fletcher, Richards and Whitaker333775816964535611/18561Mastercardanthony41@reed.comDrilling engineer15.160.41.51fr70.73
2Unit 0065 Box 5052\\nDPO AP 2745094 vEPMMozilla/5.0 (compatible; MSIE 9.0; Windows NT ...Simpson, Williams and Pham67595766612508/19699JCB 16 digitamymiller@morales-harrison.comCustomer service manager132.207.160.22de0.95
37780 Julia Fords\\nNew Stacy, WA 4579836 vmPMMozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ...Williams, Marshall and Buchanan601157850443071002/24384Discoverbrent16@olson-robinson.infoDrilling engineer30.250.74.19es78.04
423012 Munoz Drive Suite 337\\nNew Cynthia, TX 5...20 IEAMOpera/9.58.(X11; Linux x86_64; it-IT) Presto/2...Brown, Watson and Andrews601145662320799810/25678Diners Club / Carte Blanchechristopherwright@gmail.comFine artist24.140.33.94es77.82
\n", 177 | "
" 178 | ], 179 | "text/plain": [ 180 | " Address Lot AM or PM \\\n", 181 | "0 16629 Pace Camp Apt. 448\\nAlexisborough, NE 77... 46 in PM \n", 182 | "1 9374 Jasmine Spurs Suite 508\\nSouth John, TN 8... 28 rn PM \n", 183 | "2 Unit 0065 Box 5052\\nDPO AP 27450 94 vE PM \n", 184 | "3 7780 Julia Fords\\nNew Stacy, WA 45798 36 vm PM \n", 185 | "4 23012 Munoz Drive Suite 337\\nNew Cynthia, TX 5... 20 IE AM \n", 186 | "\n", 187 | " Browser Info \\\n", 188 | "0 Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2... \n", 189 | "1 Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr... \n", 190 | "2 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ... \n", 191 | "3 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ... \n", 192 | "4 Opera/9.58.(X11; Linux x86_64; it-IT) Presto/2... \n", 193 | "\n", 194 | " Company Credit Card CC Exp Date \\\n", 195 | "0 Martinez-Herman 6011929061123406 02/20 \n", 196 | "1 Fletcher, Richards and Whitaker 3337758169645356 11/18 \n", 197 | "2 Simpson, Williams and Pham 675957666125 08/19 \n", 198 | "3 Williams, Marshall and Buchanan 6011578504430710 02/24 \n", 199 | "4 Brown, Watson and Andrews 6011456623207998 10/25 \n", 200 | "\n", 201 | " CC Security Code CC Provider \\\n", 202 | "0 900 JCB 16 digit \n", 203 | "1 561 Mastercard \n", 204 | "2 699 JCB 16 digit \n", 205 | "3 384 Discover \n", 206 | "4 678 Diners Club / Carte Blanche \n", 207 | "\n", 208 | " Email Job \\\n", 209 | "0 pdunlap@yahoo.com Scientist, product/process development \n", 210 | "1 anthony41@reed.com Drilling engineer \n", 211 | "2 amymiller@morales-harrison.com Customer service manager \n", 212 | "3 brent16@olson-robinson.info Drilling engineer \n", 213 | "4 christopherwright@gmail.com Fine artist \n", 214 | "\n", 215 | " IP Address Language Purchase Price \n", 216 | "0 149.146.147.205 el 98.14 \n", 217 | "1 15.160.41.51 fr 70.73 \n", 218 | "2 132.207.160.22 de 0.95 \n", 219 | "3 30.250.74.19 es 78.04 \n", 220 | "4 24.140.33.94 es 77.82 " 221 | ] 222 | }, 223 | "execution_count": 4, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "ecom.head()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "** How many rows and columns are there? **" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 5, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "\n", 249 | "RangeIndex: 10000 entries, 0 to 9999\n", 250 | "Data columns (total 14 columns):\n", 251 | "Address 10000 non-null object\n", 252 | "Lot 10000 non-null object\n", 253 | "AM or PM 10000 non-null object\n", 254 | "Browser Info 10000 non-null object\n", 255 | "Company 10000 non-null object\n", 256 | "Credit Card 10000 non-null int64\n", 257 | "CC Exp Date 10000 non-null object\n", 258 | "CC Security Code 10000 non-null int64\n", 259 | "CC Provider 10000 non-null object\n", 260 | "Email 10000 non-null object\n", 261 | "Job 10000 non-null object\n", 262 | "IP Address 10000 non-null object\n", 263 | "Language 10000 non-null object\n", 264 | "Purchase Price 10000 non-null float64\n", 265 | "dtypes: float64(1), int64(2), object(11)\n", 266 | "memory usage: 1.1+ MB\n" 267 | ] 268 | } 269 | ], 270 | "source": [ 271 | "ecom.info()" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "** What is the average Purchase Price? **" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 6, 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "data": { 288 | "text/plain": [ 289 | "50.34730200000025" 290 | ] 291 | }, 292 | "execution_count": 6, 293 | "metadata": {}, 294 | "output_type": "execute_result" 295 | } 296 | ], 297 | "source": [ 298 | "ecom['Purchase Price'].mean()" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "** What were the highest and lowest purchase prices? **" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 9, 311 | "metadata": {}, 312 | "outputs": [ 313 | { 314 | "data": { 315 | "text/plain": [ 316 | "99.99" 317 | ] 318 | }, 319 | "execution_count": 9, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "ecom['Purchase Price'].max()" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 10, 331 | "metadata": {}, 332 | "outputs": [ 333 | { 334 | "data": { 335 | "text/plain": [ 336 | "0.0" 337 | ] 338 | }, 339 | "execution_count": 10, 340 | "metadata": {}, 341 | "output_type": "execute_result" 342 | } 343 | ], 344 | "source": [ 345 | "ecom['Purchase Price'].min()" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "** How many people have English 'en' as their Language of choice on the website? **" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 11, 358 | "metadata": {}, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "Address 1098\n", 364 | "Lot 1098\n", 365 | "AM or PM 1098\n", 366 | "Browser Info 1098\n", 367 | "Company 1098\n", 368 | "Credit Card 1098\n", 369 | "CC Exp Date 1098\n", 370 | "CC Security Code 1098\n", 371 | "CC Provider 1098\n", 372 | "Email 1098\n", 373 | "Job 1098\n", 374 | "IP Address 1098\n", 375 | "Language 1098\n", 376 | "Purchase Price 1098\n", 377 | "dtype: int64" 378 | ] 379 | }, 380 | "execution_count": 11, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "ecom[ecom['Language'] == 'en'].count()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "** How many people have the job title of \"Lawyer\" ? **\n" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 16, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "name": "stdout", 403 | "output_type": "stream", 404 | "text": [ 405 | "\n", 406 | "Int64Index: 30 entries, 470 to 9979\n", 407 | "Data columns (total 14 columns):\n", 408 | "Address 30 non-null object\n", 409 | "Lot 30 non-null object\n", 410 | "AM or PM 30 non-null object\n", 411 | "Browser Info 30 non-null object\n", 412 | "Company 30 non-null object\n", 413 | "Credit Card 30 non-null int64\n", 414 | "CC Exp Date 30 non-null object\n", 415 | "CC Security Code 30 non-null int64\n", 416 | "CC Provider 30 non-null object\n", 417 | "Email 30 non-null object\n", 418 | "Job 30 non-null object\n", 419 | "IP Address 30 non-null object\n", 420 | "Language 30 non-null object\n", 421 | "Purchase Price 30 non-null float64\n", 422 | "dtypes: float64(1), int64(2), object(11)\n", 423 | "memory usage: 3.5+ KB\n" 424 | ] 425 | } 426 | ], 427 | "source": [ 428 | "ecom[ecom['Job'] == 'Lawyer'].info()" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "** How many people made the purchase during the AM and how many people made the purchase during PM ? **\n", 436 | "\n", 437 | "**(Hint: Check out [value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) ) **" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 17, 443 | "metadata": {}, 444 | "outputs": [ 445 | { 446 | "data": { 447 | "text/plain": [ 448 | "PM 5068\n", 449 | "AM 4932\n", 450 | "Name: AM or PM, dtype: int64" 451 | ] 452 | }, 453 | "execution_count": 17, 454 | "metadata": {}, 455 | "output_type": "execute_result" 456 | } 457 | ], 458 | "source": [ 459 | "ecom['AM or PM'].value_counts()" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "** What are the 5 most common Job Titles? **" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 19, 472 | "metadata": {}, 473 | "outputs": [ 474 | { 475 | "data": { 476 | "text/plain": [ 477 | "Interior and spatial designer 31\n", 478 | "Lawyer 30\n", 479 | "Social researcher 28\n", 480 | "Designer, jewellery 27\n", 481 | "Purchasing manager 27\n", 482 | "Name: Job, dtype: int64" 483 | ] 484 | }, 485 | "execution_count": 19, 486 | "metadata": {}, 487 | "output_type": "execute_result" 488 | } 489 | ], 490 | "source": [ 491 | "ecom['Job'].value_counts().head(5)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "** Someone made a purchase that came from Lot: \"90 WT\" , what was the Purchase Price for this transaction? **" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": 22, 504 | "metadata": {}, 505 | "outputs": [ 506 | { 507 | "data": { 508 | "text/plain": [ 509 | "513 75.1\n", 510 | "Name: Purchase Price, dtype: float64" 511 | ] 512 | }, 513 | "execution_count": 22, 514 | "metadata": {}, 515 | "output_type": "execute_result" 516 | } 517 | ], 518 | "source": [ 519 | "ecom[ecom['Lot'] == '90 WT']['Purchase Price']" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "** What is the email of the person with the following Credit Card Number: 4926535242672853 **" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 30, 532 | "metadata": {}, 533 | "outputs": [ 534 | { 535 | "data": { 536 | "text/plain": [ 537 | "1234 bondellen@williams-garza.com\n", 538 | "Name: Email, dtype: object" 539 | ] 540 | }, 541 | "execution_count": 30, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "ecom[ecom['Credit Card'] == 4926535242672853]['Email']" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "** How many people have American Express as their Credit Card Provider *and* made a purchase above $95 ?**" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 38, 560 | "metadata": {}, 561 | "outputs": [ 562 | { 563 | "data": { 564 | "text/plain": [ 565 | "39" 566 | ] 567 | }, 568 | "execution_count": 38, 569 | "metadata": {}, 570 | "output_type": "execute_result" 571 | } 572 | ], 573 | "source": [ 574 | "sum(ecom[ecom['CC Provider'] == 'American Express']['Purchase Price'] > 95)" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": {}, 580 | "source": [ 581 | "** Hard: How many people have a credit card that expires in 2025? **" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 50, 587 | "metadata": {}, 588 | "outputs": [ 589 | { 590 | "data": { 591 | "text/plain": [ 592 | "1033" 593 | ] 594 | }, 595 | "execution_count": 50, 596 | "metadata": {}, 597 | "output_type": "execute_result" 598 | } 599 | ], 600 | "source": [ 601 | "sum(ecom['CC Exp Date'].apply(lambda x: x[3:]) == '25')" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": {}, 607 | "source": [ 608 | "** Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...) **" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": 56, 614 | "metadata": {}, 615 | "outputs": [ 616 | { 617 | "data": { 618 | "text/plain": [ 619 | "hotmail.com 1638\n", 620 | "yahoo.com 1616\n", 621 | "gmail.com 1605\n", 622 | "smith.com 42\n", 623 | "williams.com 37\n", 624 | "Name: Email, dtype: int64" 625 | ] 626 | }, 627 | "execution_count": 56, 628 | "metadata": {}, 629 | "output_type": "execute_result" 630 | } 631 | ], 632 | "source": [ 633 | "ecom['Email'].apply(lambda x: x.split('@')[1]).value_counts().head(5)" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "# Great Job!" 641 | ] 642 | } 643 | ], 644 | "metadata": { 645 | "kernelspec": { 646 | "display_name": "Python 3", 647 | "language": "python", 648 | "name": "python3" 649 | }, 650 | "language_info": { 651 | "codemirror_mode": { 652 | "name": "ipython", 653 | "version": 3 654 | }, 655 | "file_extension": ".py", 656 | "mimetype": "text/x-python", 657 | "name": "python", 658 | "nbconvert_exporter": "python", 659 | "pygments_lexer": "ipython3", 660 | "version": "3.6.4" 661 | } 662 | }, 663 | "nbformat": 4, 664 | "nbformat_minor": 1 665 | } 666 | -------------------------------------------------------------------------------- /SF Salaries Exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "___\n", 8 | "\n", 9 | " \n", 10 | "___" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# SF Salaries Exercise \n", 18 | "\n", 19 | "Welcome to a quick exercise for you to practice your pandas skills! We will be using the [SF Salaries Dataset](https://www.kaggle.com/kaggle/sf-salaries) from Kaggle! Just follow along and complete the tasks outlined in bold below. The tasks will get harder and harder as you go along." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "** Import pandas as pd.**" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "import pandas as pd" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "** Read Salaries.csv as a dataframe called sal.**" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "sal = pd.read_csv('Salaries.csv')" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "** Check the head of the DataFrame. **" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/html": [ 69 | "
\n", 70 | "\n", 83 | "\n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | "
IdEmployeeNameJobTitleBasePayOvertimePayOtherPayBenefitsTotalPayTotalPayBenefitsYearNotesAgencyStatus
01NATHANIEL FORDGENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY167411.180.00400184.25NaN567595.43567595.432011NaNSan FranciscoNaN
12GARY JIMENEZCAPTAIN III (POLICE DEPARTMENT)155966.02245131.88137811.38NaN538909.28538909.282011NaNSan FranciscoNaN
23ALBERT PARDINICAPTAIN III (POLICE DEPARTMENT)212739.13106088.1816452.60NaN335279.91335279.912011NaNSan FranciscoNaN
34CHRISTOPHER CHONGWIRE ROPE CABLE MAINTENANCE MECHANIC77916.0056120.71198306.90NaN332343.61332343.612011NaNSan FranciscoNaN
45PATRICK GARDNERDEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)134401.609737.00182234.59NaN326373.19326373.192011NaNSan FranciscoNaN
\n", 185 | "
" 186 | ], 187 | "text/plain": [ 188 | " Id EmployeeName JobTitle \\\n", 189 | "0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY \n", 190 | "1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) \n", 191 | "2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) \n", 192 | "3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC \n", 193 | "4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) \n", 194 | "\n", 195 | " BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits \\\n", 196 | "0 167411.18 0.00 400184.25 NaN 567595.43 567595.43 \n", 197 | "1 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 \n", 198 | "2 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 \n", 199 | "3 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 \n", 200 | "4 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 \n", 201 | "\n", 202 | " Year Notes Agency Status \n", 203 | "0 2011 NaN San Francisco NaN \n", 204 | "1 2011 NaN San Francisco NaN \n", 205 | "2 2011 NaN San Francisco NaN \n", 206 | "3 2011 NaN San Francisco NaN \n", 207 | "4 2011 NaN San Francisco NaN " 208 | ] 209 | }, 210 | "execution_count": 5, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "sal.head()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "** Use the .info() method to find out how many entries there are.**" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 8, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "name": "stdout", 233 | "output_type": "stream", 234 | "text": [ 235 | "\n", 236 | "RangeIndex: 148654 entries, 0 to 148653\n", 237 | "Data columns (total 13 columns):\n", 238 | "Id 148654 non-null int64\n", 239 | "EmployeeName 148654 non-null object\n", 240 | "JobTitle 148654 non-null object\n", 241 | "BasePay 148045 non-null float64\n", 242 | "OvertimePay 148650 non-null float64\n", 243 | "OtherPay 148650 non-null float64\n", 244 | "Benefits 112491 non-null float64\n", 245 | "TotalPay 148654 non-null float64\n", 246 | "TotalPayBenefits 148654 non-null float64\n", 247 | "Year 148654 non-null int64\n", 248 | "Notes 0 non-null float64\n", 249 | "Agency 148654 non-null object\n", 250 | "Status 0 non-null float64\n", 251 | "dtypes: float64(8), int64(2), object(3)\n", 252 | "memory usage: 14.7+ MB\n" 253 | ] 254 | } 255 | ], 256 | "source": [ 257 | "sal.info()" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "**What is the average BasePay ?**" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 9, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "66325.44884050643" 276 | ] 277 | }, 278 | "execution_count": 9, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "sal['BasePay'].mean()" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "** What is the highest amount of OvertimePay in the dataset ? **" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 10, 297 | "metadata": {}, 298 | "outputs": [ 299 | { 300 | "data": { 301 | "text/plain": [ 302 | "245131.88" 303 | ] 304 | }, 305 | "execution_count": 10, 306 | "metadata": {}, 307 | "output_type": "execute_result" 308 | } 309 | ], 310 | "source": [ 311 | "sal['OvertimePay'].max()" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "** What is the job title of JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll). **" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 15, 324 | "metadata": {}, 325 | "outputs": [ 326 | { 327 | "data": { 328 | "text/plain": [ 329 | "24 CAPTAIN, FIRE SUPPRESSION\n", 330 | "Name: JobTitle, dtype: object" 331 | ] 332 | }, 333 | "execution_count": 15, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['JobTitle']" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "** How much does JOSEPH DRISCOLL make (including benefits)? **" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 16, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "data": { 356 | "text/plain": [ 357 | "24 270324.91\n", 358 | "Name: TotalPayBenefits, dtype: float64" 359 | ] 360 | }, 361 | "execution_count": 16, 362 | "metadata": {}, 363 | "output_type": "execute_result" 364 | } 365 | ], 366 | "source": [ 367 | "sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['TotalPayBenefits']" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "** What is the name of highest paid person (including benefits)?**" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": 18, 380 | "metadata": {}, 381 | "outputs": [ 382 | { 383 | "data": { 384 | "text/html": [ 385 | "
\n", 386 | "\n", 399 | "\n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | "
IdEmployeeNameJobTitleBasePayOvertimePayOtherPayBenefitsTotalPayTotalPayBenefitsYearNotesAgencyStatus
01NATHANIEL FORDGENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY167411.180.0400184.25NaN567595.43567595.432011NaNSan FranciscoNaN
\n", 437 | "
" 438 | ], 439 | "text/plain": [ 440 | " Id EmployeeName JobTitle \\\n", 441 | "0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY \n", 442 | "\n", 443 | " BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits \\\n", 444 | "0 167411.18 0.0 400184.25 NaN 567595.43 567595.43 \n", 445 | "\n", 446 | " Year Notes Agency Status \n", 447 | "0 2011 NaN San Francisco NaN " 448 | ] 449 | }, 450 | "execution_count": 18, 451 | "metadata": {}, 452 | "output_type": "execute_result" 453 | } 454 | ], 455 | "source": [ 456 | "sal[sal['TotalPayBenefits']== sal['TotalPayBenefits'].max()]" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "** What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?**" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": 19, 469 | "metadata": {}, 470 | "outputs": [ 471 | { 472 | "data": { 473 | "text/html": [ 474 | "
\n", 475 | "\n", 488 | "\n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | "
IdEmployeeNameJobTitleBasePayOvertimePayOtherPayBenefitsTotalPayTotalPayBenefitsYearNotesAgencyStatus
148653148654Joe LopezCounselor, Log Cabin Ranch0.00.0-618.130.0-618.13-618.132014NaNSan FranciscoNaN
\n", 526 | "
" 527 | ], 528 | "text/plain": [ 529 | " Id EmployeeName JobTitle BasePay OvertimePay \\\n", 530 | "148653 148654 Joe Lopez Counselor, Log Cabin Ranch 0.0 0.0 \n", 531 | "\n", 532 | " OtherPay Benefits TotalPay TotalPayBenefits Year Notes \\\n", 533 | "148653 -618.13 0.0 -618.13 -618.13 2014 NaN \n", 534 | "\n", 535 | " Agency Status \n", 536 | "148653 San Francisco NaN " 537 | ] 538 | }, 539 | "execution_count": 19, 540 | "metadata": {}, 541 | "output_type": "execute_result" 542 | } 543 | ], 544 | "source": [ 545 | "sal[sal['TotalPayBenefits']== sal['TotalPayBenefits'].min()]\n", 546 | "\n", 547 | "#Negative Value" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "** What was the average (mean) BasePay of all employees per year? (2011-2014) ? **" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 20, 560 | "metadata": {}, 561 | "outputs": [ 562 | { 563 | "data": { 564 | "text/plain": [ 565 | "Year\n", 566 | "2011 63595.956517\n", 567 | "2012 65436.406857\n", 568 | "2013 69630.030216\n", 569 | "2014 66564.421924\n", 570 | "Name: BasePay, dtype: float64" 571 | ] 572 | }, 573 | "execution_count": 20, 574 | "metadata": {}, 575 | "output_type": "execute_result" 576 | } 577 | ], 578 | "source": [ 579 | "sal.groupby('Year').mean()['BasePay']" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": {}, 585 | "source": [ 586 | "** How many unique job titles are there? **" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 4, 592 | "metadata": {}, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "2159" 598 | ] 599 | }, 600 | "execution_count": 4, 601 | "metadata": {}, 602 | "output_type": "execute_result" 603 | } 604 | ], 605 | "source": [ 606 | "sal['JobTitle'].nunique()" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "** What are the top 5 most common jobs? **" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 8, 619 | "metadata": {}, 620 | "outputs": [ 621 | { 622 | "data": { 623 | "text/plain": [ 624 | "Transit Operator 7036\n", 625 | "Special Nurse 4389\n", 626 | "Registered Nurse 3736\n", 627 | "Public Svc Aide-Public Works 2518\n", 628 | "Police Officer 3 2421\n", 629 | "Name: JobTitle, dtype: int64" 630 | ] 631 | }, 632 | "execution_count": 8, 633 | "metadata": {}, 634 | "output_type": "execute_result" 635 | } 636 | ], 637 | "source": [ 638 | "sal['JobTitle'].value_counts().head(5)" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "** How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?) **" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": 9, 651 | "metadata": {}, 652 | "outputs": [ 653 | { 654 | "data": { 655 | "text/plain": [ 656 | "202" 657 | ] 658 | }, 659 | "execution_count": 9, 660 | "metadata": {}, 661 | "output_type": "execute_result" 662 | } 663 | ], 664 | "source": [ 665 | "sum(sal[sal['Year']==2013]['JobTitle'].value_counts() == 1) " 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "** How many people have the word Chief in their job title? (This is pretty tricky) **" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": 16, 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "def chief_string(title):\n", 682 | " if 'chief' in title.lower():\n", 683 | " return True\n", 684 | " else:\n", 685 | " return False" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": 17, 691 | "metadata": {}, 692 | "outputs": [ 693 | { 694 | "data": { 695 | "text/plain": [ 696 | "627" 697 | ] 698 | }, 699 | "execution_count": 17, 700 | "metadata": {}, 701 | "output_type": "execute_result" 702 | } 703 | ], 704 | "source": [ 705 | "sum(sal['JobTitle'].apply(lambda x: chief_string(x)))" 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "** Bonus: Is there a correlation between length of the Job Title string and Salary? **" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 18, 718 | "metadata": {}, 719 | "outputs": [], 720 | "source": [ 721 | "sal['title_len'] = sal['JobTitle'].apply(len)" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": 19, 727 | "metadata": {}, 728 | "outputs": [ 729 | { 730 | "data": { 731 | "text/html": [ 732 | "
\n", 733 | "\n", 746 | "\n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | "
title_lenTotalPayBenefits
title_len1.000000-0.036878
TotalPayBenefits-0.0368781.000000
\n", 767 | "
" 768 | ], 769 | "text/plain": [ 770 | " title_len TotalPayBenefits\n", 771 | "title_len 1.000000 -0.036878\n", 772 | "TotalPayBenefits -0.036878 1.000000" 773 | ] 774 | }, 775 | "execution_count": 19, 776 | "metadata": {}, 777 | "output_type": "execute_result" 778 | } 779 | ], 780 | "source": [ 781 | "sal[['title_len','TotalPayBenefits']].corr() # No correlation." 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "# Great Job!" 789 | ] 790 | } 791 | ], 792 | "metadata": { 793 | "kernelspec": { 794 | "display_name": "Python 3", 795 | "language": "python", 796 | "name": "python3" 797 | }, 798 | "language_info": { 799 | "codemirror_mode": { 800 | "name": "ipython", 801 | "version": 3 802 | }, 803 | "file_extension": ".py", 804 | "mimetype": "text/x-python", 805 | "name": "python", 806 | "nbconvert_exporter": "python", 807 | "pygments_lexer": "ipython3", 808 | "version": "3.6.4" 809 | } 810 | }, 811 | "nbformat": 4, 812 | "nbformat_minor": 1 813 | } 814 | -------------------------------------------------------------------------------- /Choropleth Maps Exercise .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "___\n", 8 | "\n", 9 | " \n", 10 | "___" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# Choropleth Maps Exercise \n", 18 | "\n", 19 | "Welcome to the Choropleth Maps Exercise! In this exercise we will give you some simple datasets and ask you to create Choropleth Maps from them. Due to the Nature of Plotly we can't show you examples\n", 20 | "\n", 21 | "[Full Documentation Reference](https://plot.ly/python/reference/#choropleth)\n", 22 | "\n", 23 | "## Plotly Imports" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "data": { 33 | "text/html": [ 34 | "" 35 | ], 36 | "text/vnd.plotly.v1+html": [ 37 | "" 38 | ] 39 | }, 40 | "metadata": {}, 41 | "output_type": "display_data" 42 | } 43 | ], 44 | "source": [ 45 | "import plotly.graph_objs as go \n", 46 | "from plotly.offline import init_notebook_mode,iplot\n", 47 | "init_notebook_mode(connected=True) " 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "** Import pandas and read the csv file: 2014_World_Power_Consumption**" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "import pandas as pd" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 3, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "df = pd.read_csv('2014_World_Power_Consumption')" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "** Check the head of the DataFrame. **" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/html": [ 90 | "
\n", 91 | "\n", 104 | "\n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | "
CountryPower Consumption KWHText
0China5.523000e+12China 5,523,000,000,000
1United States3.832000e+12United 3,832,000,000,000
2European2.771000e+12European 2,771,000,000,000
3Russia1.065000e+12Russia 1,065,000,000,000
4Japan9.210000e+11Japan 921,000,000,000
\n", 146 | "
" 147 | ], 148 | "text/plain": [ 149 | " Country Power Consumption KWH Text\n", 150 | "0 China 5.523000e+12 China 5,523,000,000,000\n", 151 | "1 United States 3.832000e+12 United 3,832,000,000,000\n", 152 | "2 European 2.771000e+12 European 2,771,000,000,000\n", 153 | "3 Russia 1.065000e+12 Russia 1,065,000,000,000\n", 154 | "4 Japan 9.210000e+11 Japan 921,000,000,000" 155 | ] 156 | }, 157 | "execution_count": 4, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | } 161 | ], 162 | "source": [ 163 | "df.head()" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "** Referencing the lecture notes, create a Choropleth Plot of the Power Consumption for Countries using the data and layout dictionary. **" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 16, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "data = dict(type = 'choropleth',\n", 180 | " locations = df['Country'],\n", 181 | " locationmode = 'country names',\n", 182 | " colorscale = 'Viridis',\n", 183 | " reversescale = True,\n", 184 | " text= df['Country'],\n", 185 | " z = df['Power Consumption KWH'],\n", 186 | " colorbar = {'title':'Power Consumption'})\n", 187 | "\n", 188 | "layout = dict(title = '2014 Power Consumption KWH',\n", 189 | " geo = dict(showframe = False,projection = {'type':'Mercator'})\n", 190 | " )" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 17, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "data": { 200 | "application/vnd.plotly.v1+json": { 201 | "data": [ 202 | { 203 | "colorbar": { 204 | "title": "Power Consumption" 205 | }, 206 | "colorscale": "Viridis", 207 | "locationmode": "country names", 208 | "locations": [ 209 | "China", 210 | "United States", 211 | "European", 212 | "Russia", 213 | "Japan", 214 | "India", 215 | "Germany", 216 | "Canada", 217 | "Brazil", 218 | "Korea,", 219 | "France", 220 | "United Kingdom", 221 | "Italy", 222 | "Taiwan", 223 | "Spain", 224 | "Mexico", 225 | "Saudi", 226 | "Australia", 227 | "South", 228 | "Turkey", 229 | "Iran", 230 | "Indonesia", 231 | "Ukraine", 232 | "Thailand", 233 | "Poland", 234 | "Egypt", 235 | "Sweden", 236 | "Norway", 237 | "Malaysia", 238 | "Argentina", 239 | "Netherlands", 240 | "Vietnam", 241 | "Venezuela", 242 | "United Arab Emirates", 243 | "Finland", 244 | "Belgium", 245 | "Kazakhstan", 246 | "Pakistan", 247 | "Philippines", 248 | "Austria", 249 | "Chile", 250 | "Czechia", 251 | "Israel", 252 | "Switzerland", 253 | "Greece", 254 | "Iraq", 255 | "Romania", 256 | "Kuwait", 257 | "Colombia", 258 | "Singapore", 259 | "Portugal", 260 | "Uzbekistan", 261 | "Hong", 262 | "Algeria", 263 | "Bangladesh", 264 | "New", 265 | "Bulgaria", 266 | "Belarus", 267 | "Peru", 268 | "Denmark", 269 | "Qatar", 270 | "Slovakia", 271 | "Libya", 272 | "Serbia", 273 | "Morocco", 274 | "Syria", 275 | "Nigeria", 276 | "Ireland", 277 | "Hungary", 278 | "Oman", 279 | "Ecuador", 280 | "Puerto", 281 | "Azerbaijan", 282 | "Croatia", 283 | "Iceland", 284 | "Cuba", 285 | "Korea,", 286 | "Dominican", 287 | "Jordan", 288 | "Tajikistan", 289 | "Tunisia", 290 | "Slovenia", 291 | "Lebanon", 292 | "Bosnia", 293 | "Turkmenistan", 294 | "Bahrain", 295 | "Mozambique", 296 | "Ghana", 297 | "Sri", 298 | "Kyrgyzstan", 299 | "Lithuania", 300 | "Uruguay", 301 | "Costa", 302 | "Guatemala", 303 | "Georgia", 304 | "Trinidad", 305 | "Zambia", 306 | "Paraguay", 307 | "Albania", 308 | "Burma", 309 | "Estonia", 310 | "Congo,", 311 | "Panama", 312 | "Latvia", 313 | "Macedonia", 314 | "Zimbabwe", 315 | "Kenya", 316 | "Bolivia", 317 | "Luxembourg", 318 | "Sudan", 319 | "El", 320 | "Cameroon", 321 | "West", 322 | "Ethiopia", 323 | "Armenia", 324 | "Honduras", 325 | "Angola", 326 | "Cote", 327 | "Tanzania", 328 | "Nicaragua", 329 | "Moldova", 330 | "Cyprus", 331 | "Macau", 332 | "Namibia", 333 | "Mongolia", 334 | "Afghanistan", 335 | "Yemen", 336 | "Brunei", 337 | "Cambodia", 338 | "Montenegro", 339 | "Nepal", 340 | "Botswana", 341 | "Papua", 342 | "Jamaica", 343 | "Kosovo", 344 | "Laos", 345 | "Uganda", 346 | "New", 347 | "Mauritius", 348 | "Senegal", 349 | "Bhutan", 350 | "Malawi", 351 | "Madagascar", 352 | "Bahamas,", 353 | "Gabon", 354 | "Suriname", 355 | "Guam", 356 | "Liechtenstein", 357 | "Swaziland", 358 | "Burkina", 359 | "Togo", 360 | "Curacao", 361 | "Mauritania", 362 | "Barbados", 363 | "Niger", 364 | "Aruba", 365 | "Benin", 366 | "Guinea", 367 | "Mali", 368 | "Fiji", 369 | "Congo,", 370 | "Virgin", 371 | "Lesotho", 372 | "South", 373 | "Bermuda", 374 | "French", 375 | "Jersey", 376 | "Belize", 377 | "Andorra", 378 | "Guyana", 379 | "Cayman", 380 | "Haiti", 381 | "Rwanda", 382 | "Saint", 383 | "Djibouti", 384 | "Seychelles", 385 | "Somalia", 386 | "Antigua", 387 | "Greenland", 388 | "Cabo", 389 | "Eritrea", 390 | "Burundi", 391 | "Liberia", 392 | "Maldives", 393 | "Faroe", 394 | "Gambia,", 395 | "Chad", 396 | "Micronesia,", 397 | "Grenada", 398 | "Central", 399 | "Turks", 400 | "Gibraltar", 401 | "American", 402 | "Sierra", 403 | "Saint", 404 | "Saint", 405 | "Timor-Leste", 406 | "Equatorial", 407 | "Samoa", 408 | "Dominica", 409 | "Western", 410 | "Solomon", 411 | "Sao", 412 | "British", 413 | "Vanuatu", 414 | "Guinea-Bissau", 415 | "Tonga", 416 | "Saint", 417 | "Comoros", 418 | "Cook", 419 | "Kiribati", 420 | "Montserrat", 421 | "Nauru", 422 | "Falkland", 423 | "Saint", 424 | "Niue", 425 | "Gaza", 426 | "Malta", 427 | "Northern" 428 | ], 429 | "reversescale": true, 430 | "text": [ 431 | "China", 432 | "United States", 433 | "European", 434 | "Russia", 435 | "Japan", 436 | "India", 437 | "Germany", 438 | "Canada", 439 | "Brazil", 440 | "Korea,", 441 | "France", 442 | "United Kingdom", 443 | "Italy", 444 | "Taiwan", 445 | "Spain", 446 | "Mexico", 447 | "Saudi", 448 | "Australia", 449 | "South", 450 | "Turkey", 451 | "Iran", 452 | "Indonesia", 453 | "Ukraine", 454 | "Thailand", 455 | "Poland", 456 | "Egypt", 457 | "Sweden", 458 | "Norway", 459 | "Malaysia", 460 | "Argentina", 461 | "Netherlands", 462 | "Vietnam", 463 | "Venezuela", 464 | "United Arab Emirates", 465 | "Finland", 466 | "Belgium", 467 | "Kazakhstan", 468 | "Pakistan", 469 | "Philippines", 470 | "Austria", 471 | "Chile", 472 | "Czechia", 473 | "Israel", 474 | "Switzerland", 475 | "Greece", 476 | "Iraq", 477 | "Romania", 478 | "Kuwait", 479 | "Colombia", 480 | "Singapore", 481 | "Portugal", 482 | "Uzbekistan", 483 | "Hong", 484 | "Algeria", 485 | "Bangladesh", 486 | "New", 487 | "Bulgaria", 488 | "Belarus", 489 | "Peru", 490 | "Denmark", 491 | "Qatar", 492 | "Slovakia", 493 | "Libya", 494 | "Serbia", 495 | "Morocco", 496 | "Syria", 497 | "Nigeria", 498 | "Ireland", 499 | "Hungary", 500 | "Oman", 501 | "Ecuador", 502 | "Puerto", 503 | "Azerbaijan", 504 | "Croatia", 505 | "Iceland", 506 | "Cuba", 507 | "Korea,", 508 | "Dominican", 509 | "Jordan", 510 | "Tajikistan", 511 | "Tunisia", 512 | "Slovenia", 513 | "Lebanon", 514 | "Bosnia", 515 | "Turkmenistan", 516 | "Bahrain", 517 | "Mozambique", 518 | "Ghana", 519 | "Sri", 520 | "Kyrgyzstan", 521 | "Lithuania", 522 | "Uruguay", 523 | "Costa", 524 | "Guatemala", 525 | "Georgia", 526 | "Trinidad", 527 | "Zambia", 528 | "Paraguay", 529 | "Albania", 530 | "Burma", 531 | "Estonia", 532 | "Congo,", 533 | "Panama", 534 | "Latvia", 535 | "Macedonia", 536 | "Zimbabwe", 537 | "Kenya", 538 | "Bolivia", 539 | "Luxembourg", 540 | "Sudan", 541 | "El", 542 | "Cameroon", 543 | "West", 544 | "Ethiopia", 545 | "Armenia", 546 | "Honduras", 547 | "Angola", 548 | "Cote", 549 | "Tanzania", 550 | "Nicaragua", 551 | "Moldova", 552 | "Cyprus", 553 | "Macau", 554 | "Namibia", 555 | "Mongolia", 556 | "Afghanistan", 557 | "Yemen", 558 | "Brunei", 559 | "Cambodia", 560 | "Montenegro", 561 | "Nepal", 562 | "Botswana", 563 | "Papua", 564 | "Jamaica", 565 | "Kosovo", 566 | "Laos", 567 | "Uganda", 568 | "New", 569 | "Mauritius", 570 | "Senegal", 571 | "Bhutan", 572 | "Malawi", 573 | "Madagascar", 574 | "Bahamas,", 575 | "Gabon", 576 | "Suriname", 577 | "Guam", 578 | "Liechtenstein", 579 | "Swaziland", 580 | "Burkina", 581 | "Togo", 582 | "Curacao", 583 | "Mauritania", 584 | "Barbados", 585 | "Niger", 586 | "Aruba", 587 | "Benin", 588 | "Guinea", 589 | "Mali", 590 | "Fiji", 591 | "Congo,", 592 | "Virgin", 593 | "Lesotho", 594 | "South", 595 | "Bermuda", 596 | "French", 597 | "Jersey", 598 | "Belize", 599 | "Andorra", 600 | "Guyana", 601 | "Cayman", 602 | "Haiti", 603 | "Rwanda", 604 | "Saint", 605 | "Djibouti", 606 | "Seychelles", 607 | "Somalia", 608 | "Antigua", 609 | "Greenland", 610 | "Cabo", 611 | "Eritrea", 612 | "Burundi", 613 | "Liberia", 614 | "Maldives", 615 | "Faroe", 616 | "Gambia,", 617 | "Chad", 618 | "Micronesia,", 619 | "Grenada", 620 | "Central", 621 | "Turks", 622 | "Gibraltar", 623 | "American", 624 | "Sierra", 625 | "Saint", 626 | "Saint", 627 | "Timor-Leste", 628 | "Equatorial", 629 | "Samoa", 630 | "Dominica", 631 | "Western", 632 | "Solomon", 633 | "Sao", 634 | "British", 635 | "Vanuatu", 636 | "Guinea-Bissau", 637 | "Tonga", 638 | "Saint", 639 | "Comoros", 640 | "Cook", 641 | "Kiribati", 642 | "Montserrat", 643 | "Nauru", 644 | "Falkland", 645 | "Saint", 646 | "Niue", 647 | "Gaza", 648 | "Malta", 649 | "Northern" 650 | ], 651 | "type": "choropleth", 652 | "z": [ 653 | 5523000000000, 654 | 3832000000000, 655 | 2771000000000, 656 | 1065000000000, 657 | 921000000000, 658 | 864700000000, 659 | 540100000000, 660 | 511000000000, 661 | 483500000000, 662 | 482400000000, 663 | 451100000000, 664 | 319100000000, 665 | 303100000000, 666 | 249500000000, 667 | 243100000000, 668 | 234000000000, 669 | 231600000000, 670 | 222600000000, 671 | 211600000000, 672 | 197000000000, 673 | 195300000000, 674 | 167500000000, 675 | 159800000000, 676 | 155900000000, 677 | 139000000000, 678 | 135600000000, 679 | 130500000000, 680 | 126400000000, 681 | 118500000000, 682 | 117100000000, 683 | 116800000000, 684 | 108300000000, 685 | 97690000000, 686 | 93280000000, 687 | 82040000000, 688 | 81890000000, 689 | 80290000000, 690 | 78890000000, 691 | 75270000000, 692 | 69750000000, 693 | 63390000000, 694 | 60550000000, 695 | 59830000000, 696 | 58010000000, 697 | 57730000000, 698 | 53410000000, 699 | 50730000000, 700 | 50000000000, 701 | 49380000000, 702 | 47180000000, 703 | 46250000000, 704 | 45210000000, 705 | 44210000000, 706 | 42870000000, 707 | 41520000000, 708 | 40300000000, 709 | 37990000000, 710 | 37880000000, 711 | 35690000000, 712 | 31960000000, 713 | 30530000000, 714 | 28360000000, 715 | 27540000000, 716 | 26910000000, 717 | 26700000000, 718 | 25700000000, 719 | 24780000000, 720 | 24240000000, 721 | 21550000000, 722 | 20360000000, 723 | 19020000000, 724 | 18620000000, 725 | 17790000000, 726 | 16970000000, 727 | 16940000000, 728 | 16200000000, 729 | 16000000000, 730 | 15140000000, 731 | 14560000000, 732 | 14420000000, 733 | 13310000000, 734 | 13020000000, 735 | 12940000000, 736 | 12560000000, 737 | 11750000000, 738 | 11690000000, 739 | 11280000000, 740 | 10580000000, 741 | 10170000000, 742 | 9943000000, 743 | 9664000000, 744 | 9559000000, 745 | 8987000000, 746 | 8915000000, 747 | 8468000000, 748 | 8365000000, 749 | 8327000000, 750 | 8125000000, 751 | 7793000000, 752 | 7765000000, 753 | 7417000000, 754 | 7292000000, 755 | 7144000000, 756 | 7141000000, 757 | 6960000000, 758 | 6831000000, 759 | 6627000000, 760 | 6456000000, 761 | 6108000000, 762 | 5665000000, 763 | 5665000000, 764 | 5535000000, 765 | 5312000000, 766 | 5227000000, 767 | 5043000000, 768 | 5036000000, 769 | 4842000000, 770 | 4731000000, 771 | 4545000000, 772 | 4412000000, 773 | 4305000000, 774 | 4296000000, 775 | 4291000000, 776 | 4238000000, 777 | 4204000000, 778 | 3893000000, 779 | 3838000000, 780 | 3766000000, 781 | 3553000000, 782 | 3465000000, 783 | 3239000000, 784 | 3213000000, 785 | 3116000000, 786 | 3008000000, 787 | 2887000000, 788 | 2874000000, 789 | 2821000000, 790 | 2716000000, 791 | 2658000000, 792 | 2586000000, 793 | 2085000000, 794 | 2027000000, 795 | 1883000000, 796 | 1716000000, 797 | 1680000000, 798 | 1572000000, 799 | 1566000000, 800 | 1360000000, 801 | 1295000000, 802 | 985500000, 803 | 976000000, 804 | 968000000, 805 | 962600000, 806 | 938000000, 807 | 930200000, 808 | 920700000, 809 | 911000000, 810 | 903000000, 811 | 882600000, 812 | 777600000, 813 | 740000000, 814 | 723500000, 815 | 707000000, 816 | 694100000, 817 | 664200000, 818 | 652900000, 819 | 630100000, 820 | 605000000, 821 | 562400000, 822 | 558000000, 823 | 545900000, 824 | 452000000, 825 | 365500000, 826 | 336400000, 827 | 311600000, 828 | 293900000, 829 | 293000000, 830 | 293000000, 831 | 292000000, 832 | 285500000, 833 | 284000000, 834 | 282900000, 835 | 276900000, 836 | 267100000, 837 | 261300000, 838 | 218600000, 839 | 190700000, 840 | 178600000, 841 | 178000000, 842 | 168300000, 843 | 167400000, 844 | 160000000, 845 | 146000000, 846 | 134900000, 847 | 130200000, 848 | 127400000, 849 | 125300000, 850 | 93000000, 851 | 90400000, 852 | 89750000, 853 | 83700000, 854 | 79050000, 855 | 60450000, 856 | 51150000, 857 | 49290000, 858 | 46500000, 859 | 44640000, 860 | 39990000, 861 | 39990000, 862 | 28950000, 863 | 24180000, 864 | 23250000, 865 | 23250000, 866 | 11160000, 867 | 7440000, 868 | 2790000, 869 | 202000, 870 | 174700, 871 | 48300 872 | ] 873 | } 874 | ], 875 | "layout": { 876 | "geo": { 877 | "projection": { 878 | "type": "Mercator" 879 | }, 880 | "showframe": false 881 | }, 882 | "title": "2014 Power Consumption KWH" 883 | } 884 | }, 885 | "text/html": [ 886 | "
" 887 | ], 888 | "text/vnd.plotly.v1+html": [ 889 | "
" 890 | ] 891 | }, 892 | "metadata": {}, 893 | "output_type": "display_data" 894 | } 895 | ], 896 | "source": [ 897 | "choromap = go.Figure(data = [data],layout = layout)\n", 898 | "iplot(choromap,validate=False)" 899 | ] 900 | }, 901 | { 902 | "cell_type": "markdown", 903 | "metadata": {}, 904 | "source": [ 905 | "## USA Choropleth\n", 906 | "\n", 907 | "** Import the 2012_Election_Data csv file using pandas. **" 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": 8, 913 | "metadata": {}, 914 | "outputs": [], 915 | "source": [ 916 | "df2 = pd.read_csv('2012_Election_Data')" 917 | ] 918 | }, 919 | { 920 | "cell_type": "markdown", 921 | "metadata": {}, 922 | "source": [ 923 | "** Check the head of the DataFrame. **" 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": 19, 929 | "metadata": {}, 930 | "outputs": [ 931 | { 932 | "data": { 933 | "text/html": [ 934 | "
\n", 935 | "\n", 948 | "\n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | "
YearICPSR State CodeAlphanumeric State CodeStateVEP Total Ballots CountedVEP Highest OfficeVAP Highest OfficeTotal Ballots CountedHighest OfficeVoting-Eligible Population (VEP)Voting-Age Population (VAP)% Non-citizenPrisonProbationParoleTotal Ineligible FelonState Abv
02012411AlabamaNaN58.6%56.0%NaN2,074,3383,539,2173707440.02.6%32,23257,9938,61671,584AL
12012812Alaska58.9%58.7%55.3%301,694300,495511,792543763.03.8%5,6337,1731,88211,317AK
22012613Arizona53.0%52.6%46.5%2,323,5792,306,5594,387,9004959270.09.9%35,18872,4527,46081,048AZ
32012424Arkansas51.1%50.7%47.7%1,078,5481,069,4682,109,8472242740.03.5%14,47130,12223,37253,808AR
42012715California55.7%55.1%45.1%13,202,15813,038,54723,681,83728913129.017.4%119,455089,287208,742CA
\n", 1074 | "
" 1075 | ], 1076 | "text/plain": [ 1077 | " Year ICPSR State Code Alphanumeric State Code State \\\n", 1078 | "0 2012 41 1 Alabama \n", 1079 | "1 2012 81 2 Alaska \n", 1080 | "2 2012 61 3 Arizona \n", 1081 | "3 2012 42 4 Arkansas \n", 1082 | "4 2012 71 5 California \n", 1083 | "\n", 1084 | " VEP Total Ballots Counted VEP Highest Office VAP Highest Office \\\n", 1085 | "0 NaN 58.6% 56.0% \n", 1086 | "1 58.9% 58.7% 55.3% \n", 1087 | "2 53.0% 52.6% 46.5% \n", 1088 | "3 51.1% 50.7% 47.7% \n", 1089 | "4 55.7% 55.1% 45.1% \n", 1090 | "\n", 1091 | " Total Ballots Counted Highest Office Voting-Eligible Population (VEP) \\\n", 1092 | "0 NaN 2,074,338 3,539,217 \n", 1093 | "1 301,694 300,495 511,792 \n", 1094 | "2 2,323,579 2,306,559 4,387,900 \n", 1095 | "3 1,078,548 1,069,468 2,109,847 \n", 1096 | "4 13,202,158 13,038,547 23,681,837 \n", 1097 | "\n", 1098 | " Voting-Age Population (VAP) % Non-citizen Prison Probation Parole \\\n", 1099 | "0 3707440.0 2.6% 32,232 57,993 8,616 \n", 1100 | "1 543763.0 3.8% 5,633 7,173 1,882 \n", 1101 | "2 4959270.0 9.9% 35,188 72,452 7,460 \n", 1102 | "3 2242740.0 3.5% 14,471 30,122 23,372 \n", 1103 | "4 28913129.0 17.4% 119,455 0 89,287 \n", 1104 | "\n", 1105 | " Total Ineligible Felon State Abv \n", 1106 | "0 71,584 AL \n", 1107 | "1 11,317 AK \n", 1108 | "2 81,048 AZ \n", 1109 | "3 53,808 AR \n", 1110 | "4 208,742 CA " 1111 | ] 1112 | }, 1113 | "execution_count": 19, 1114 | "metadata": {}, 1115 | "output_type": "execute_result" 1116 | } 1117 | ], 1118 | "source": [ 1119 | "df2.head()" 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "markdown", 1124 | "metadata": {}, 1125 | "source": [ 1126 | "** Now create a plot that displays the Voting-Age Population (VAP) per state. If you later want to play around with other columns, make sure you consider their data type. VAP has already been transformed to a float for you. **" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "code", 1131 | "execution_count": 34, 1132 | "metadata": {}, 1133 | "outputs": [], 1134 | "source": [ 1135 | "data = dict(type='choropleth',\n", 1136 | " colorscale = 'Viridis',\n", 1137 | " reversescale = True,\n", 1138 | " locations = df2['State Abv'],\n", 1139 | " z = df2['Voting-Age Population (VAP)'],\n", 1140 | " locationmode = 'USA-states',\n", 1141 | " text = df2['State'],\n", 1142 | " marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),\n", 1143 | " colorbar = {'title':\"Voting-Age Population (VAP)\"}\n", 1144 | " ) \n", 1145 | "\n", 1146 | "layout = dict(title = '2012 General Election Voting Data',\n", 1147 | " geo = dict(scope='usa',\n", 1148 | " showlakes = True,\n", 1149 | " lakecolor = 'rgb(85,173,240)')\n", 1150 | " )" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "code", 1155 | "execution_count": 35, 1156 | "metadata": {}, 1157 | "outputs": [ 1158 | { 1159 | "data": { 1160 | "application/vnd.plotly.v1+json": { 1161 | "data": [ 1162 | { 1163 | "colorbar": { 1164 | "title": "Voting-Age Population (VAP)" 1165 | }, 1166 | "colorscale": "Viridis", 1167 | "locationmode": "USA-states", 1168 | "locations": [ 1169 | "AL", 1170 | "AK", 1171 | "AZ", 1172 | "AR", 1173 | "CA", 1174 | "CO", 1175 | "CT", 1176 | "DE", 1177 | "District of Columbia", 1178 | "FL", 1179 | "GA", 1180 | "HI", 1181 | "ID", 1182 | "IL", 1183 | "IN", 1184 | "IA", 1185 | "KS", 1186 | "KY", 1187 | "LA", 1188 | "ME", 1189 | "MD", 1190 | "MA", 1191 | "MI", 1192 | "MN", 1193 | "MS", 1194 | "MO", 1195 | "MT", 1196 | "NE", 1197 | "NV", 1198 | "NH", 1199 | "NJ", 1200 | "NM", 1201 | "NY", 1202 | "NC", 1203 | "ND", 1204 | "OH", 1205 | "OK", 1206 | "OR", 1207 | "PA", 1208 | "RI", 1209 | "SC", 1210 | "SD", 1211 | "TN", 1212 | "TX", 1213 | "UT", 1214 | "VT", 1215 | "VA", 1216 | "WA", 1217 | "WV", 1218 | "WI", 1219 | "WY" 1220 | ], 1221 | "marker": { 1222 | "line": { 1223 | "color": "rgb(255,255,255)", 1224 | "width": 1 1225 | } 1226 | }, 1227 | "reversescale": true, 1228 | "text": [ 1229 | "Alabama", 1230 | "Alaska", 1231 | "Arizona", 1232 | "Arkansas", 1233 | "California", 1234 | "Colorado", 1235 | "Connecticut", 1236 | "Delaware", 1237 | "District of Columbia", 1238 | "Florida", 1239 | "Georgia", 1240 | "Hawaii", 1241 | "Idaho", 1242 | "Illinois", 1243 | "Indiana", 1244 | "Iowa", 1245 | "Kansas", 1246 | "Kentucky", 1247 | "Louisiana", 1248 | "Maine", 1249 | "Maryland", 1250 | "Massachusetts", 1251 | "Michigan", 1252 | "Minnesota", 1253 | "Mississippi", 1254 | "Missouri", 1255 | "Montana", 1256 | "Nebraska", 1257 | "Nevada", 1258 | "New Hampshire", 1259 | "New Jersey", 1260 | "New Mexico", 1261 | "New York", 1262 | "North Carolina", 1263 | "North Dakota", 1264 | "Ohio", 1265 | "Oklahoma", 1266 | "Oregon", 1267 | "Pennsylvania", 1268 | "Rhode Island", 1269 | "South Carolina", 1270 | "South Dakota", 1271 | "Tennessee", 1272 | "Texas", 1273 | "Utah", 1274 | "Vermont", 1275 | "Virginia", 1276 | "Washington", 1277 | "West Virginia", 1278 | "Wisconsin", 1279 | "Wyoming" 1280 | ], 1281 | "type": "choropleth", 1282 | "z": [ 1283 | 3707440, 1284 | 543763, 1285 | 4959270, 1286 | 2242740, 1287 | 28913129, 1288 | 3981208, 1289 | 2801375, 1290 | 715708, 1291 | 528848, 1292 | 15380947, 1293 | 7452696, 1294 | 1088335, 1295 | 1173727, 1296 | 9827043, 1297 | 4960376, 1298 | 2356209, 1299 | 2162442, 1300 | 3368684, 1301 | 3495847, 1302 | 1064779, 1303 | 4553853, 1304 | 5263550, 1305 | 7625576, 1306 | 4114820, 1307 | 2246931, 1308 | 4628500, 1309 | 785454, 1310 | 1396507, 1311 | 2105976, 1312 | 1047978, 1313 | 6847503, 1314 | 1573400, 1315 | 15344671, 1316 | 7496980, 1317 | 549955, 1318 | 8896930, 1319 | 2885093, 1320 | 3050747, 1321 | 10037099, 1322 | 834983, 1323 | 3662322, 1324 | 631472, 1325 | 4976284, 1326 | 19185395, 1327 | 1978956, 1328 | 502242, 1329 | 6348827, 1330 | 5329782, 1331 | 1472642, 1332 | 4417273, 1333 | 441726 1334 | ] 1335 | } 1336 | ], 1337 | "layout": { 1338 | "geo": { 1339 | "lakecolor": "rgb(85,173,240)", 1340 | "scope": "usa", 1341 | "showlakes": true 1342 | }, 1343 | "title": "2012 General Election Voting Data" 1344 | } 1345 | }, 1346 | "text/html": [ 1347 | "
" 1348 | ], 1349 | "text/vnd.plotly.v1+html": [ 1350 | "
" 1351 | ] 1352 | }, 1353 | "metadata": {}, 1354 | "output_type": "display_data" 1355 | } 1356 | ], 1357 | "source": [ 1358 | "choromap = go.Figure(data = [data],layout = layout)\n", 1359 | "iplot(choromap,validate=False)" 1360 | ] 1361 | }, 1362 | { 1363 | "cell_type": "markdown", 1364 | "metadata": {}, 1365 | "source": [ 1366 | "# Great Job!" 1367 | ] 1368 | } 1369 | ], 1370 | "metadata": { 1371 | "kernelspec": { 1372 | "display_name": "Python 3", 1373 | "language": "python", 1374 | "name": "python3" 1375 | }, 1376 | "language_info": { 1377 | "codemirror_mode": { 1378 | "name": "ipython", 1379 | "version": 3 1380 | }, 1381 | "file_extension": ".py", 1382 | "mimetype": "text/x-python", 1383 | "name": "python", 1384 | "nbconvert_exporter": "python", 1385 | "pygments_lexer": "ipython3", 1386 | "version": "3.6.4" 1387 | } 1388 | }, 1389 | "nbformat": 4, 1390 | "nbformat_minor": 1 1391 | } 1392 | -------------------------------------------------------------------------------- /NLP Project .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "___\n", 8 | "\n", 9 | " \n", 10 | "___" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# Natural Language Processing Project\n", 18 | "\n", 19 | "Welcome to the NLP Project for this section of the course. In this NLP project you will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. This will be a simpler procedure than the lecture, since we will utilize the pipeline methods for more complex tasks.\n", 20 | "\n", 21 | "We will use the [Yelp Review Data Set from Kaggle](https://www.kaggle.com/c/yelp-recsys-2013).\n", 22 | "\n", 23 | "Each observation in this dataset is a review of a particular business by a particular user.\n", 24 | "\n", 25 | "The \"stars\" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n", 26 | "\n", 27 | "The \"cool\" column is the number of \"cool\" votes this review received from other Yelp users. \n", 28 | "\n", 29 | "All reviews start with 0 \"cool\" votes, and there is no limit to how many \"cool\" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.\n", 30 | "\n", 31 | "The \"useful\" and \"funny\" columns are similar to the \"cool\" column.\n", 32 | "\n", 33 | "Let's get started! Just follow the directions below!" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Imports\n", 41 | " **Import the usual suspects. :) **" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 25, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "import numpy as np\n", 51 | "import pandas as pd\n", 52 | "import matplotlib.pyplot as plt\n", 53 | "import seaborn as sns \n", 54 | "\n", 55 | "%matplotlib inline" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## The Data\n", 63 | "\n", 64 | "**Read the yelp.csv file and set it as a dataframe called yelp.**" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 37, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "yelp = pd.read_csv('yelp.csv')" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "** Check the head, info , and describe methods on yelp.**" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 38, 86 | "metadata": {}, 87 | "outputs": [ 88 | { 89 | "data": { 90 | "text/html": [ 91 | "
\n", 92 | "\n", 105 | "\n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
business_iddatereview_idstarstexttypeuser_idcoolusefulfunny
09yKzy9PApeiPPOUJEtnvkg2011-01-26fWKvX83p0-ka4JS3dc6E5A5My wife took me here on my birthday for breakf...reviewrLtl8ZkDX5vH5nAx9C3q5Q250
1ZRJwVLyzEJq1VAihDhYiow2011-07-27IjZ33sJrzXqU-0X6U8NwyA5I have no idea why some people give bad review...review0a2KyEL0d3Yb1V6aivbIuQ000
26oRAC4uyJCsJl1X0WZpVSA2012-06-14IESLBzqUCLdSzSqm0eCSxQ4love the gyro plate. Rice is so good and I als...review0hT2KtfLiobPvh6cDC8JQg010
3_1QQZuf4zZOyFCvXc0o6Vg2010-05-27G-WvGaISbqqaMHlNnByodA5Rosie, Dakota, and I LOVE Chaparral Dog Park!!...reviewuZetl9T0NcROGOyFfughhg120
46ozycU1RpktNG2-1BroVtw2012-01-051uJFq2r5QfJG_6ExMRCaGw5General Manager Scott Petello is a good egg!!!...reviewvYmM4KTsC8ZfQBg-j5MWkw000
\n", 189 | "
" 190 | ], 191 | "text/plain": [ 192 | " business_id date review_id stars \\\n", 193 | "0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 \n", 194 | "1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 \n", 195 | "2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 \n", 196 | "3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 \n", 197 | "4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 \n", 198 | "\n", 199 | " text type \\\n", 200 | "0 My wife took me here on my birthday for breakf... review \n", 201 | "1 I have no idea why some people give bad review... review \n", 202 | "2 love the gyro plate. Rice is so good and I als... review \n", 203 | "3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review \n", 204 | "4 General Manager Scott Petello is a good egg!!!... review \n", 205 | "\n", 206 | " user_id cool useful funny \n", 207 | "0 rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 \n", 208 | "1 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 \n", 209 | "2 0hT2KtfLiobPvh6cDC8JQg 0 1 0 \n", 210 | "3 uZetl9T0NcROGOyFfughhg 1 2 0 \n", 211 | "4 vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 " 212 | ] 213 | }, 214 | "execution_count": 38, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "yelp.head()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 39, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "\n", 233 | "RangeIndex: 10000 entries, 0 to 9999\n", 234 | "Data columns (total 10 columns):\n", 235 | "business_id 10000 non-null object\n", 236 | "date 10000 non-null object\n", 237 | "review_id 10000 non-null object\n", 238 | "stars 10000 non-null int64\n", 239 | "text 10000 non-null object\n", 240 | "type 10000 non-null object\n", 241 | "user_id 10000 non-null object\n", 242 | "cool 10000 non-null int64\n", 243 | "useful 10000 non-null int64\n", 244 | "funny 10000 non-null int64\n", 245 | "dtypes: int64(4), object(6)\n", 246 | "memory usage: 781.3+ KB\n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "yelp.info()" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 40, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "data": { 261 | "text/html": [ 262 | "
\n", 263 | "\n", 276 | "\n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | "
starscoolusefulfunny
count10000.00000010000.00000010000.00000010000.000000
mean3.7775000.8768001.4093000.701300
std1.2146362.0678612.3366471.907942
min1.0000000.0000000.0000000.000000
25%3.0000000.0000000.0000000.000000
50%4.0000000.0000001.0000000.000000
75%5.0000001.0000002.0000001.000000
max5.00000077.00000076.00000057.000000
\n", 345 | "
" 346 | ], 347 | "text/plain": [ 348 | " stars cool useful funny\n", 349 | "count 10000.000000 10000.000000 10000.000000 10000.000000\n", 350 | "mean 3.777500 0.876800 1.409300 0.701300\n", 351 | "std 1.214636 2.067861 2.336647 1.907942\n", 352 | "min 1.000000 0.000000 0.000000 0.000000\n", 353 | "25% 3.000000 0.000000 0.000000 0.000000\n", 354 | "50% 4.000000 0.000000 1.000000 0.000000\n", 355 | "75% 5.000000 1.000000 2.000000 1.000000\n", 356 | "max 5.000000 77.000000 76.000000 57.000000" 357 | ] 358 | }, 359 | "execution_count": 40, 360 | "metadata": {}, 361 | "output_type": "execute_result" 362 | } 363 | ], 364 | "source": [ 365 | "yelp.describe()" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "**Create a new column called \"text length\" which is the number of words in the text column.**" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 41, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "yelp['text length'] = yelp['text'].apply(len)" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 42, 387 | "metadata": {}, 388 | "outputs": [ 389 | { 390 | "data": { 391 | "text/html": [ 392 | "
\n", 393 | "\n", 406 | "\n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | "
business_iddatereview_idstarstexttypeuser_idcoolusefulfunnytext length
09yKzy9PApeiPPOUJEtnvkg2011-01-26fWKvX83p0-ka4JS3dc6E5A5My wife took me here on my birthday for breakf...reviewrLtl8ZkDX5vH5nAx9C3q5Q250889
1ZRJwVLyzEJq1VAihDhYiow2011-07-27IjZ33sJrzXqU-0X6U8NwyA5I have no idea why some people give bad review...review0a2KyEL0d3Yb1V6aivbIuQ0001345
26oRAC4uyJCsJl1X0WZpVSA2012-06-14IESLBzqUCLdSzSqm0eCSxQ4love the gyro plate. Rice is so good and I als...review0hT2KtfLiobPvh6cDC8JQg01076
3_1QQZuf4zZOyFCvXc0o6Vg2010-05-27G-WvGaISbqqaMHlNnByodA5Rosie, Dakota, and I LOVE Chaparral Dog Park!!...reviewuZetl9T0NcROGOyFfughhg120419
46ozycU1RpktNG2-1BroVtw2012-01-051uJFq2r5QfJG_6ExMRCaGw5General Manager Scott Petello is a good egg!!!...reviewvYmM4KTsC8ZfQBg-j5MWkw000469
\n", 496 | "
" 497 | ], 498 | "text/plain": [ 499 | " business_id date review_id stars \\\n", 500 | "0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 \n", 501 | "1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 \n", 502 | "2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 \n", 503 | "3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 \n", 504 | "4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 \n", 505 | "\n", 506 | " text type \\\n", 507 | "0 My wife took me here on my birthday for breakf... review \n", 508 | "1 I have no idea why some people give bad review... review \n", 509 | "2 love the gyro plate. Rice is so good and I als... review \n", 510 | "3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review \n", 511 | "4 General Manager Scott Petello is a good egg!!!... review \n", 512 | "\n", 513 | " user_id cool useful funny text length \n", 514 | "0 rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 889 \n", 515 | "1 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 1345 \n", 516 | "2 0hT2KtfLiobPvh6cDC8JQg 0 1 0 76 \n", 517 | "3 uZetl9T0NcROGOyFfughhg 1 2 0 419 \n", 518 | "4 vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 469 " 519 | ] 520 | }, 521 | "execution_count": 42, 522 | "metadata": {}, 523 | "output_type": "execute_result" 524 | } 525 | ], 526 | "source": [ 527 | "yelp.head(5)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": {}, 533 | "source": [ 534 | "# EDA\n", 535 | "\n", 536 | "Let's explore the data\n", 537 | "\n", 538 | "## Imports\n", 539 | "\n", 540 | "**Import the data visualization libraries if you haven't done so already.**" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": 32, 546 | "metadata": {}, 547 | "outputs": [], 548 | "source": [ 549 | "#Already done" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "**Use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings. Reference the seaborn documentation for hints on this**" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": 44, 562 | "metadata": {}, 563 | "outputs": [ 564 | { 565 | "data": { 566 | "text/plain": [ 567 | "" 568 | ] 569 | }, 570 | "execution_count": 44, 571 | "metadata": {}, 572 | "output_type": "execute_result" 573 | }, 574 | { 575 | "data": { 576 | "image/png": "\n", 577 | "text/plain": [ 578 | "
" 579 | ] 580 | }, 581 | "metadata": {}, 582 | "output_type": "display_data" 583 | } 584 | ], 585 | "source": [ 586 | "g = sns.FacetGrid(yelp,col='stars')\n", 587 | "g.map(plt.hist,'text length')" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "**Create a boxplot of text length for each star category.**" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 45, 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "data": { 604 | "text/plain": [ 605 | "" 606 | ] 607 | }, 608 | "execution_count": 45, 609 | "metadata": {}, 610 | "output_type": "execute_result" 611 | }, 612 | { 613 | "data": { 614 | "image/png": "\n", 615 | "text/plain": [ 616 | "
" 617 | ] 618 | }, 619 | "metadata": {}, 620 | "output_type": "display_data" 621 | } 622 | ], 623 | "source": [ 624 | "sns.boxplot(x='stars',y='text length',data=yelp,palette='rainbow')" 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "**Create a countplot of the number of occurrences for each type of star rating.**" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": 48, 637 | "metadata": {}, 638 | "outputs": [ 639 | { 640 | "data": { 641 | "text/plain": [ 642 | "" 643 | ] 644 | }, 645 | "execution_count": 48, 646 | "metadata": {}, 647 | "output_type": "execute_result" 648 | }, 649 | { 650 | "data": { 651 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEKCAYAAAAFJbKyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFQVJREFUeJzt3X+wX3Wd3/Hny4BiFTdQIhuT2LA2bRe3bqB3kZa2g2Ax4HZhd1YHd9QMZSZuBzoy3d0ubH/gj6V1Z1WqW5aZ7BKFVWGpP2rqZsUUsY5tERKNgRAtt0glJkviRlGWWdrgu398P2m+hpube0jOPfd6n4+Z73zP930+53zf9/tHXjnnfL7nm6pCkqSZet7QDUiS5heDQ5LUicEhSerE4JAkdWJwSJI6MTgkSZ0YHJKkTgwOSVInBockqZMThm6gD6eddlqtXLly6DYkaV7ZunXrd6pqydHG/VgGx8qVK9myZcvQbUjSvJLkf89knKeqJEmdGBySpE4MDklSJwaHJKkTg0OS1InBIUnqxOCQJHXSW3AkOSnJfUm+lmRHkne2+oeTfDPJtvZY3epJ8sEkk0m2Jzl7bF9rkzzcHmv76lmSdHR9fgHwaeCCqnoyyYnAl5L8aVv3G1X18cPGXwysao9XAzcDr05yKnA9MAEUsDXJxqr6bo+9S5KOoLfgqKoCnmwvT2yPmmaTS4Hb2nb3JlmcZClwPrC5qvYDJNkMrAFu76t3aSHbsv29Q7fQi4lX/frQLfzY6PUaR5JFSbYBexn94//ltuqGdjrqxiQvaLVlwGNjm+9qtSPVD3+vdUm2JNmyb9++4/63SJJGeg2OqnqmqlYDy4FzkvwMcB3wt4CfA04FfrMNz1S7mKZ++Hutr6qJqppYsuSo9+iSJD1HszKrqqq+B3wBWFNVe2rkaeBDwDlt2C5gxdhmy4Hd09QlSQPoc1bVkiSL2/ILgdcCX2/XLUgS4DLgwbbJRuCtbXbVucATVbUHuAu4KMkpSU4BLmo1SdIA+pxVtRS4NckiRgF1Z1V9JsnnkyxhdApqG/Crbfwm4BJgEngKuAKgqvYneTdwfxv3roMXyiVJs6/PWVXbgbOmqF9whPEFXHWEdRuADce1QUnSc+I3xyVJnRgckqRODA5JUicGhySpkz5nVUnSvPZn7/mVoVvoxU9e+7Fj2t4jDklSJwaHJKkTg0OS1InBIUnqxOCQJHVicEiSOjE4JEmdGBySpE4MDklSJwaHJKkTg0OS1InBIUnqxOCQJHVicEiSOjE4JEmd9BYcSU5Kcl+SryXZkeSdrX5Gki8neTjJHyd5fqu/oL2ebOtXju3rulb/RpLX9dWzJOno+jzieBq4oKp+FlgNrElyLvA7wI1VtQr4LnBlG38l8N2q+uvAjW0cSc4ELgdeCawBfj/Joh77liRNo7fgqJEn28sT26OAC4CPt/qtwGVt+dL2mrb+wiRp9Tuq6umq+iYwCZzTV9+SpOn1eo0jyaIk24C9wGbgfwHfq6oDbcguYFlbXgY8BtDWPwH81fH6FNtIkmZZr8FRVc9U1WpgOaOjhJ+ealh7zhHWHan+I5KsS7IlyZZ9+/Y915YlSUcxK7Oqqup7wBeAc4HFSU5oq5YDu9vyLmAFQFv/E8D+8foU24y/x/qqmqiqiSVLlvTxZ0iS6HdW1ZIki9vyC4HXAjuBe4BfbsPWAp9uyxvba9r6z1dVtfrlbdbVGcAq4L6++pYkTe+Eow95zpYCt7YZUM8D7qyqzyR5CLgjyW8DXwVuaeNvAf4oySSjI43LAapqR5I7gYeAA8BVVfVMj31LkqbRW3BU1XbgrCnqjzDFrKiq+kvgDUfY1w3ADce7R0lSd35zXJLUicEhSerE4JAkdWJwSJI6MTgkSZ0YHJKkTgwOSVInBockqRODQ5LUicEhSerE4JAkdWJwSJI6MTgkSZ0YHJKkTgwOSVInBockqRODQ5LUicEhSerE4JAkdWJwSJI66S04kqxIck+SnUl2JHl7q78jybeTbGuPS8a2uS7JZJJvJHndWH1Nq00mubavniVJR3dCj/s+APxaVX0lycnA1iSb27obq+q944OTnAlcDrwSeBnwX5L8jbb6JuAfAbuA+5NsrKqHeuxdknQEvQVHVe0B9rTlHyTZCSybZpNLgTuq6mngm0kmgXPausmqegQgyR1trMEhSQOYlWscSVYCZwFfbqWrk2xPsiHJKa22DHhsbLNdrXakuiRpAL0HR5IXA58Arqmq7wM3A68AVjM6InnfwaFTbF7T1A9/n3VJtiTZsm/fvuPSuyTp2XoNjiQnMgqNj1bVJwGq6vGqeqaqfgj8AYdOR+0CVoxtvhzYPU39R1TV+qqaqKqJJUuWHP8/RpIE9DurKsAtwM6qev9YfenYsF8EHmzLG4HLk7wgyRnAKuA+4H5gVZIzkjyf0QX0jX31LUmaXp+zqs4D3gI8kGRbq/0W8KYkqxmdbnoUeBtAVe1Icieji94HgKuq6hmAJFcDdwGLgA1VtaPHviVJ0+hzVtWXmPr6xKZptrkBuGGK+qbptpMkzR6/OS5J6sTgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqROZhQcSe6eSe2w9SuS3JNkZ5IdSd7e6qcm2Zzk4fZ8SqsnyQeTTCbZnuTssX2tbeMfTrK2258oSTqepg2OJCclORU4Lckp7R/9U5OsBF52lH0fAH6tqn4aOBe4KsmZwLXA3VW1Cri7vQa4GFjVHuuAm1sPpwLXA68GzgGuPxg2kqTZd8JR1r8NuIZRSGwF0urfB26absOq2gPsacs/SLITWAZcCpzfht0KfAH4zVa/raoKuDfJ4iRL29jNVbUfIMlmYA1w+0z/SEnS8TNtcFTVB4APJPlnVfV7z/VN2hHKWcCXgdNbqFBVe5K8tA1bBjw2ttmuVjtS/fD3WMfoSIWXv/zlz7VVSdJRHO2IA4Cq+r0kfw9YOb5NVd12tG2TvBj4BHBNVX0/yRGHTvXW09QP73E9sB5gYmLiWeslScfHjIIjyR8BrwC2Ac+0cgHTBkeSExmFxker6pOt/HiSpe1oYymwt9V3ASvGNl8O7G718w+rf2EmfUuSjr8ZBQcwAZzZrj/MSEaHFrcAO6vq/WOrNgJrgfe050+P1a9OcgejC+FPtHC5C/i3YxfELwKum2kfkqTja6bB8SDwk7SL3TN0HvAW4IEk21rttxgFxp1JrgS+BbyhrdsEXAJMAk8BVwBU1f4k7wbub+PedfBCuSRp9s00OE4DHkpyH/D0wWJV/cKRNqiqLzH19QmAC6cYX8BVR9jXBmDDDHuVJPVopsHxjj6bkCTNHzOdVfVf+25EkjQ/zHRW1Q84NAX2+cCJwF9U1Uv6akySNDfN9Ijj5PHXSS5jdPsPSdIC85zujltV/wm44Dj3IkmaB2Z6quqXxl4+j9H3Ovx2tiQtQDOdVfWPx5YPAI8yuimhJGmBmek1jiv6bkSSND/M9Ieclif5VJK9SR5P8okky/tuTpI098z04viHGN1L6mWMbmn+n1tNkrTAzDQ4llTVh6rqQHt8GFjSY1+SpDlqpsHxnSRvTrKoPd4M/HmfjUmS5qaZBsc/Ad4I/BmjO+T+Mu3utZKkhWWm03HfDaytqu8CJDkVeC+jQJEkLSAzPeJ41cHQgNFvZDD6DXFJ0gIz0+B43tgv8B084pjp0Yok6cfITP/xfx/w35N8nNGtRt4I3NBbV5KkOWum3xy/LckWRjc2DPBLVfVQr51JkuakGZ9uakFhWEjSAvecbqsuSVq4eguOJBvava0eHKu9I8m3k2xrj0vG1l2XZDLJN5K8bqy+ptUmk1zbV7+SpJnp84jjw8CaKeo3VtXq9tgEkORM4HLglW2b3z/4LXXgJuBi4EzgTW2sJGkgvU2praovJlk5w+GXAndU1dPAN5NMcuinaSer6hGAJHe0sV5rkaSBDHGN4+ok29uprIPfDVkGPDY2ZlerHakuSRrIbAfHzcArgNWM7nn1vlbPFGNrmvqzJFmXZEuSLfv27TsevUqSpjCrwVFVj1fVM1X1Q+APOHQ6ahewYmzocmD3NPWp9r2+qiaqamLJEu/4Lkl9mdXbhiRZWlV72stfBA7OuNoIfCzJ+xn9WNQq4D5GRxyrkpwBfJvRBfRfmc2etTB84HufHbqFXrx98VTzU6Rj01twJLkdOB84Lcku4Hrg/CSrGZ1uehR4G0BV7UhyJ6OL3geAq6rqmbafq4G7gEXAhqra0VfPkqSj63NW1ZumKN8yzfgbmOL+V23K7qbj2Jok6Rj4zXFJUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6qS34EiyIcneJA+O1U5NsjnJw+35lFZPkg8mmUyyPcnZY9usbeMfTrK2r34lSTPT5xHHh4E1h9WuBe6uqlXA3e01wMXAqvZYB9wMo6ABrgdeDZwDXH8wbCRJw+gtOKrqi8D+w8qXAre25VuBy8bqt9XIvcDiJEuB1wGbq2p/VX0X2Myzw0iSNItm+xrH6VW1B6A9v7TVlwGPjY3b1WpHqkuSBjJXLo5nilpNU3/2DpJ1SbYk2bJv377j2pwk6ZDZDo7H2yko2vPeVt8FrBgbtxzYPU39WapqfVVNVNXEkiVLjnvjkqSR2Q6OjcDBmVFrgU+P1d/aZledCzzRTmXdBVyU5JR2UfyiVpMkDeSEvnac5HbgfOC0JLsYzY56D3BnkiuBbwFvaMM3AZcAk8BTwBUAVbU/ybuB+9u4d1XV4RfcJUmzqLfgqKo3HWHVhVOMLeCqI+xnA7DhOLam5oJ77x26hV58/txzh25B+rE2Vy6OS5LmCYNDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktRJb7ccmasu+u1Hhm6hF5/7Vz81dAuSFgiPOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqROBgmOJI8meSDJtiRbWu3UJJuTPNyeT2n1JPlgkskk25OcPUTPkqSRIY84XlNVq6tqor2+Fri7qlYBd7fXABcDq9pjHXDzrHcqSfr/5tKpqkuBW9vyrcBlY/XbauReYHGSpUM0KEkaLjgK+FySrUnWtdrpVbUHoD2/tNWXAY+Nbbur1X5EknVJtiTZsm/fvh5bl6SFbajbqp9XVbuTvBTYnOTr04zNFLV6VqFqPbAeYGJi4lnrJUnHxyBHHFW1uz3vBT4FnAM8fvAUVHve24bvAlaMbb4c2D173UqSxs16cCR5UZKTDy4DFwEPAhuBtW3YWuDTbXkj8NY2u+pc4ImDp7QkSbNviFNVpwOfSnLw/T9WVZ9Ncj9wZ5IrgW8Bb2jjNwGXAJPAU8AVs9+yJOmgWQ+OqnoE+Nkp6n8OXDhFvYCrZqE1SdIMzKXpuJKkecDgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJnRgckqRODA5JUicGhySpE4NDktSJwSFJ6sTgkCR1YnBIkjoxOCRJncyb4EiyJsk3kkwmuXbofiRpoZoXwZFkEXATcDFwJvCmJGcO25UkLUzzIjiAc4DJqnqkqv4PcAdw6cA9SdKCNF+CYxnw2NjrXa0mSZplJwzdwAxlilr9yIBkHbCuvXwyyTd67+roTgO+MxtvlH89G+9yTGbvs5iNNzk2s/ZZXDMbb3JsZu2zgN+Ynbd57mbvs7ju9iOt+Wsz2Xy+BMcuYMXY6+XA7vEBVbUeWD+bTR1Nki1VNTF0H3OBn8UhfhaH+FkcMp8+i/lyqup+YFWSM5I8H7gc2DhwT5K0IM2LI46qOpDkauAuYBGwoap2DNyWJC1I8yI4AKpqE7Bp6D46mlOnzgbmZ3GIn8UhfhaHzJvPIlV19FGSJDXz5RqHJGmOMDh6kGRDkr1JHhy6lyElWZHkniQ7k+xI8vahexpKkpOS3Jfka+2zeOfQPQ0tyaIkX03ymaF7GVKSR5M8kGRbki1D9zMTnqrqQZJ/CDwJ3FZVPzN0P0NJshRYWlVfSXIysBW4rKoeGri1WZckwIuq6skkJwJfAt5eVfcO3NpgkvxzYAJ4SVX9/ND9DCXJo8BEVc3S91mOnUccPaiqLwL7h+5jaFW1p6q+0pZ/AOxkgX7jv0aebC9PbI8F+7+2JMuB1wN/OHQv6s7g0KxIshI4C/jysJ0Mp52a2QbsBTZX1YL9LIB/D/wL4IdDNzIHFPC5JFvbHTDmPINDvUvyYuATwDVV9f2h+xlKVT1TVasZ3fngnCQL8jRmkp8H9lbV1qF7mSPOq6qzGd39+6p2qntOMzjUq3Y+/xPAR6vqk0P3MxdU1feALwBrBm5lKOcBv9DO7d8BXJDkI8O2NJyq2t2e9wKfYnQ38DnN4FBv2gXhW4CdVfX+ofsZUpIlSRa35RcCrwW+PmxXw6iq66pqeVWtZHT7oM9X1ZsHbmsQSV7UJo6Q5EXARcCcn41pcPQgye3A/wD+ZpJdSa4cuqeBnAe8hdH/KLe1xyVDNzWQpcA9SbYzuvfa5qpa0NNQBcDpwJeSfA24D/iTqvrswD0dldNxJUmdeMQhSerE4JAkdWJwSJI6MTgkSZ0YHJKkTgwOqQdJrknyV4buQ+qD03GlHjyXO54mWVRVz/TXlXR8zJufjpXmqvaN3zsZ3YNqEfAfgZcx+sLfd6rqNUluBn4OeCHw8aq6vm37KLCB0TeG/0OSlwK/ChwAHqqqy2f775GOxuCQjt0aYHdVvR4gyU8AVwCvGTvi+JdVtT/JIuDuJK+qqu1t3V9W1d9v2+4Gzqiqpw/eokSaa7zGIR27B4DXJvmdJP+gqp6YYswbk3wF+CrwSuDMsXV/PLa8HfhokjczOuqQ5hyDQzpGVfU/gb/DKED+XZJ/M74+yRnArwMXVtWrgD8BThob8hdjy68Hbmr725rEswKacwwO6RgleRnwVFV9BHgvcDbwA+DkNuQljMLhiSSnM/rdhan28zxgRVXdw+hHjhYDL+65fakz/zcjHbu/Dfxukh8C/xf4p8DfBf40yZ52cfyrwA7gEeC/HWE/i4CPtGskAW5sv90hzSlOx5UkdeKpKklSJwaHJKkTg0OS1InBIUnqxOCQJHVicEiSOjE4JEmdGBySpE7+H94+y9vNtvrQAAAAAElFTkSuQmCC\n", 652 | "text/plain": [ 653 | "
" 654 | ] 655 | }, 656 | "metadata": {}, 657 | "output_type": "display_data" 658 | } 659 | ], 660 | "source": [ 661 | "sns.countplot(data=yelp, x='stars', palette='rainbow')" 662 | ] 663 | }, 664 | { 665 | "cell_type": "markdown", 666 | "metadata": {}, 667 | "source": [ 668 | "** Use groupby to get the mean values of the numerical columns, you should be able to create this dataframe with the operation:**" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 50, 674 | "metadata": {}, 675 | "outputs": [ 676 | { 677 | "data": { 678 | "text/html": [ 679 | "
\n", 680 | "\n", 693 | "\n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | "
coolusefulfunnytext length
stars
10.5767691.6048061.056075826.515354
20.7195251.5631070.875944842.256742
30.7885011.3066390.694730758.498289
40.9546231.3959160.670448712.923142
50.9442611.3817800.608631624.999101
\n", 748 | "
" 749 | ], 750 | "text/plain": [ 751 | " cool useful funny text length\n", 752 | "stars \n", 753 | "1 0.576769 1.604806 1.056075 826.515354\n", 754 | "2 0.719525 1.563107 0.875944 842.256742\n", 755 | "3 0.788501 1.306639 0.694730 758.498289\n", 756 | "4 0.954623 1.395916 0.670448 712.923142\n", 757 | "5 0.944261 1.381780 0.608631 624.999101" 758 | ] 759 | }, 760 | "execution_count": 50, 761 | "metadata": {}, 762 | "output_type": "execute_result" 763 | } 764 | ], 765 | "source": [ 766 | "stars = yelp.groupby('stars').mean()\n", 767 | "stars" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "**Use the corr() method on that groupby dataframe to produce this dataframe:**" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": 51, 780 | "metadata": {}, 781 | "outputs": [ 782 | { 783 | "data": { 784 | "text/html": [ 785 | "
\n", 786 | "\n", 799 | "\n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | "
coolusefulfunnytext length
cool1.000000-0.743329-0.944939-0.857664
useful-0.7433291.0000000.8945060.699881
funny-0.9449390.8945061.0000000.843461
text length-0.8576640.6998810.8434611.000000
\n", 840 | "
" 841 | ], 842 | "text/plain": [ 843 | " cool useful funny text length\n", 844 | "cool 1.000000 -0.743329 -0.944939 -0.857664\n", 845 | "useful -0.743329 1.000000 0.894506 0.699881\n", 846 | "funny -0.944939 0.894506 1.000000 0.843461\n", 847 | "text length -0.857664 0.699881 0.843461 1.000000" 848 | ] 849 | }, 850 | "execution_count": 51, 851 | "metadata": {}, 852 | "output_type": "execute_result" 853 | } 854 | ], 855 | "source": [ 856 | "stars.corr()" 857 | ] 858 | }, 859 | { 860 | "cell_type": "markdown", 861 | "metadata": {}, 862 | "source": [ 863 | "**Then use seaborn to create a heatmap based off that .corr() dataframe:**" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": 54, 869 | "metadata": {}, 870 | "outputs": [ 871 | { 872 | "data": { 873 | "text/plain": [ 874 | "" 875 | ] 876 | }, 877 | "execution_count": 54, 878 | "metadata": {}, 879 | "output_type": "execute_result" 880 | }, 881 | { 882 | "data": { 883 | "image/png": "\n", 884 | "text/plain": [ 885 | "
" 886 | ] 887 | }, 888 | "metadata": {}, 889 | "output_type": "display_data" 890 | } 891 | ], 892 | "source": [ 893 | "sns.heatmap(stars.corr(), cmap='coolwarm', annot=True)" 894 | ] 895 | }, 896 | { 897 | "cell_type": "markdown", 898 | "metadata": {}, 899 | "source": [ 900 | "## NLP Classification Task\n", 901 | "\n", 902 | "Let's move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars.\n", 903 | "\n", 904 | "**Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews.**" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": 55, 910 | "metadata": {}, 911 | "outputs": [], 912 | "source": [ 913 | "yelp_class = yelp[(yelp.stars==1) | (yelp.stars==5)]" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "metadata": {}, 919 | "source": [ 920 | "** Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels)**" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": 56, 926 | "metadata": {}, 927 | "outputs": [], 928 | "source": [ 929 | "X = yelp_class['text']\n", 930 | "y = yelp_class['stars']" 931 | ] 932 | }, 933 | { 934 | "cell_type": "markdown", 935 | "metadata": {}, 936 | "source": [ 937 | "**Import CountVectorizer and create a CountVectorizer object.**" 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": 57, 943 | "metadata": {}, 944 | "outputs": [], 945 | "source": [ 946 | "from sklearn.feature_extraction.text import CountVectorizer\n", 947 | "cv = CountVectorizer()" 948 | ] 949 | }, 950 | { 951 | "cell_type": "markdown", 952 | "metadata": {}, 953 | "source": [ 954 | "** Use the fit_transform method on the CountVectorizer object and pass in X (the 'text' column). Save this result by overwriting X.**" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": 58, 960 | "metadata": {}, 961 | "outputs": [], 962 | "source": [ 963 | "X = cv.fit_transform(X)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "## Train Test Split\n", 971 | "\n", 972 | "Let's split our data into training and testing data.\n", 973 | "\n", 974 | "** Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101 **" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": 59, 980 | "metadata": {}, 981 | "outputs": [], 982 | "source": [ 983 | "from sklearn.model_selection import train_test_split" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 60, 989 | "metadata": {}, 990 | "outputs": [], 991 | "source": [ 992 | "X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)" 993 | ] 994 | }, 995 | { 996 | "cell_type": "markdown", 997 | "metadata": {}, 998 | "source": [ 999 | "## Training a Model\n", 1000 | "\n", 1001 | "Time to train a model!\n", 1002 | "\n", 1003 | "** Import MultinomialNB and create an instance of the estimator and call is nb **" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": 61, 1009 | "metadata": {}, 1010 | "outputs": [], 1011 | "source": [ 1012 | "from sklearn.naive_bayes import MultinomialNB\n", 1013 | "nb = MultinomialNB()" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "markdown", 1018 | "metadata": {}, 1019 | "source": [ 1020 | "**Now fit nb using the training data.**" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": 63, 1026 | "metadata": {}, 1027 | "outputs": [ 1028 | { 1029 | "data": { 1030 | "text/plain": [ 1031 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 1032 | ] 1033 | }, 1034 | "execution_count": 63, 1035 | "metadata": {}, 1036 | "output_type": "execute_result" 1037 | } 1038 | ], 1039 | "source": [ 1040 | "nb.fit(X_train, y_train)" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "markdown", 1045 | "metadata": {}, 1046 | "source": [ 1047 | "## Predictions and Evaluations\n", 1048 | "\n", 1049 | "Time to see how our model did!\n", 1050 | "\n", 1051 | "**Use the predict method off of nb to predict labels from X_test.**" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "execution_count": 65, 1057 | "metadata": {}, 1058 | "outputs": [], 1059 | "source": [ 1060 | "predictions = nb.predict(X_test)" 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "markdown", 1065 | "metadata": {}, 1066 | "source": [ 1067 | "** Create a confusion matrix and classification report using these predictions and y_test **" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "execution_count": 66, 1073 | "metadata": {}, 1074 | "outputs": [], 1075 | "source": [ 1076 | "from sklearn.metrics import confusion_matrix,classification_report" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": 67, 1082 | "metadata": {}, 1083 | "outputs": [ 1084 | { 1085 | "name": "stdout", 1086 | "output_type": "stream", 1087 | "text": [ 1088 | "[[159 69]\n", 1089 | " [ 22 976]]\n", 1090 | "\n", 1091 | "\n", 1092 | " precision recall f1-score support\n", 1093 | "\n", 1094 | " 1 0.88 0.70 0.78 228\n", 1095 | " 5 0.93 0.98 0.96 998\n", 1096 | "\n", 1097 | "avg / total 0.92 0.93 0.92 1226\n", 1098 | "\n" 1099 | ] 1100 | } 1101 | ], 1102 | "source": [ 1103 | "print(confusion_matrix(y_test,predictions))\n", 1104 | "print('\\n')\n", 1105 | "print(classification_report(y_test,predictions))" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "markdown", 1110 | "metadata": {}, 1111 | "source": [ 1112 | "**Great! Let's see what happens if we try to include TF-IDF to this process using a pipeline.**" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "markdown", 1117 | "metadata": {}, 1118 | "source": [ 1119 | "# Using Text Processing\n", 1120 | "\n", 1121 | "** Import TfidfTransformer from sklearn. **" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "code", 1126 | "execution_count": 68, 1127 | "metadata": {}, 1128 | "outputs": [], 1129 | "source": [ 1130 | "from sklearn.feature_extraction.text import TfidfTransformer" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "markdown", 1135 | "metadata": {}, 1136 | "source": [ 1137 | "** Import Pipeline from sklearn. **" 1138 | ] 1139 | }, 1140 | { 1141 | "cell_type": "code", 1142 | "execution_count": 69, 1143 | "metadata": {}, 1144 | "outputs": [], 1145 | "source": [ 1146 | "from sklearn.pipeline import Pipeline" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "markdown", 1151 | "metadata": {}, 1152 | "source": [ 1153 | "** Now create a pipeline with the following steps:CountVectorizer(), TfidfTransformer(),MultinomialNB()**" 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "code", 1158 | "execution_count": 70, 1159 | "metadata": {}, 1160 | "outputs": [], 1161 | "source": [ 1162 | "pipeline = Pipeline([\n", 1163 | " ('bow', CountVectorizer()), # strings to token integer counts\n", 1164 | " ('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores\n", 1165 | " ('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier\n", 1166 | "])" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "markdown", 1171 | "metadata": {}, 1172 | "source": [ 1173 | "## Using the Pipeline\n", 1174 | "\n", 1175 | "**Time to use the pipeline! Remember this pipeline has all your pre-process steps in it already, meaning we'll need to re-split the original data (Remember that we overwrote X as the CountVectorized version. What we need is just the text**" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "### Train Test Split\n", 1183 | "\n", 1184 | "**Redo the train test split on the yelp_class object.**" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "code", 1189 | "execution_count": 71, 1190 | "metadata": {}, 1191 | "outputs": [], 1192 | "source": [ 1193 | "X = yelp_class['text']\n", 1194 | "y = yelp_class['stars']\n", 1195 | "X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "markdown", 1200 | "metadata": {}, 1201 | "source": [ 1202 | "**Now fit the pipeline to the training data. Remember you can't use the same training data as last time because that data has already been vectorized. We need to pass in just the text and labels**" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "code", 1207 | "execution_count": 72, 1208 | "metadata": {}, 1209 | "outputs": [ 1210 | { 1211 | "data": { 1212 | "text/plain": [ 1213 | "Pipeline(memory=None,\n", 1214 | " steps=[('bow', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 1215 | " dtype=, encoding='utf-8', input='content',\n", 1216 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 1217 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 1218 | " strip_...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])" 1219 | ] 1220 | }, 1221 | "execution_count": 72, 1222 | "metadata": {}, 1223 | "output_type": "execute_result" 1224 | } 1225 | ], 1226 | "source": [ 1227 | "# May take some time\n", 1228 | "pipeline.fit(X_train,y_train)" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "markdown", 1233 | "metadata": {}, 1234 | "source": [ 1235 | "### Predictions and Evaluation\n", 1236 | "\n", 1237 | "** Now use the pipeline to predict from the X_test and create a classification report and confusion matrix. You should notice strange results.**" 1238 | ] 1239 | }, 1240 | { 1241 | "cell_type": "code", 1242 | "execution_count": 74, 1243 | "metadata": {}, 1244 | "outputs": [], 1245 | "source": [ 1246 | "predictions = pipeline.predict(X_test)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": 75, 1252 | "metadata": {}, 1253 | "outputs": [ 1254 | { 1255 | "name": "stdout", 1256 | "output_type": "stream", 1257 | "text": [ 1258 | "[[ 0 228]\n", 1259 | " [ 0 998]]\n", 1260 | "\n", 1261 | "\n", 1262 | " precision recall f1-score support\n", 1263 | "\n", 1264 | " 1 0.00 0.00 0.00 228\n", 1265 | " 5 0.81 1.00 0.90 998\n", 1266 | "\n", 1267 | "avg / total 0.66 0.81 0.73 1226\n", 1268 | "\n" 1269 | ] 1270 | }, 1271 | { 1272 | "name": "stderr", 1273 | "output_type": "stream", 1274 | "text": [ 1275 | "/home/marco/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.\n", 1276 | " 'precision', 'predicted', average, warn_for)\n" 1277 | ] 1278 | } 1279 | ], 1280 | "source": [ 1281 | "print(confusion_matrix(y_test,predictions))\n", 1282 | "print('\\n')\n", 1283 | "print(classification_report(y_test,predictions))" 1284 | ] 1285 | }, 1286 | { 1287 | "cell_type": "markdown", 1288 | "metadata": {}, 1289 | "source": [ 1290 | "Looks like Tf-Idf actually made things worse! That is it for this project. But there is still a lot more you can play with:\n", 1291 | "\n", 1292 | "**Some other things to try....**\n", 1293 | "Try going back and playing around with the pipeline steps and seeing if creating a custom analyzer like we did in the lecture helps (note: it probably won't). Or recreate the pipeline with just the CountVectorizer() and NaiveBayes. Does changing the ML model at the end to another classifier help at all?" 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "markdown", 1298 | "metadata": {}, 1299 | "source": [ 1300 | "# Great Job!" 1301 | ] 1302 | } 1303 | ], 1304 | "metadata": { 1305 | "kernelspec": { 1306 | "display_name": "Python 3", 1307 | "language": "python", 1308 | "name": "python3" 1309 | }, 1310 | "language_info": { 1311 | "codemirror_mode": { 1312 | "name": "ipython", 1313 | "version": 3 1314 | }, 1315 | "file_extension": ".py", 1316 | "mimetype": "text/x-python", 1317 | "name": "python", 1318 | "nbconvert_exporter": "python", 1319 | "pygments_lexer": "ipython3", 1320 | "version": "3.6.5" 1321 | } 1322 | }, 1323 | "nbformat": 4, 1324 | "nbformat_minor": 1 1325 | } 1326 | --------------------------------------------------------------------------------