├── README.md
├── analysis-of-opioid-prescription-problem
    ├── README.md
    ├── data
    │   ├── 123
    │   ├── mhincome.csv
    │   ├── opioids.csv
    │   ├── overdoses.csv
    │   ├── overdosesnew.csv
    │   └── prescriber-info.csv
    ├── images
    │   ├── 123
    │   └── opioids.png
    └── notebooks
    │   ├── 123
    │   └── opioid-prescription-problem.ipynb
├── churn
    ├── README.md
    ├── data
    │   └── 123.png
    ├── images
    │   ├── 123.png
    │   ├── balancedchurn.png
    │   ├── baseline.png
    │   ├── cellphone.jpg
    │   ├── churnprob.png
    │   ├── cm.png
    │   ├── cms.png
    │   ├── cms1.png
    │   ├── cms2.png
    │   ├── df_churn_new.png
    │   ├── featurerf.png
    │   ├── imbalancechurn.png
    │   ├── model_comparison.png
    │   └── predictions.png
    └── notebooks
    │   └── predicting-customer-churn.ipynb
├── click-prediction
    ├── README.md
    ├── images
    │   ├── 123
    │   └── click1.png
    ├── notebooks
    │   └── click-predictive-model.ipynb
    └── optimal-bidding-strategies-in-online-display-advertising .pdf
├── predicting-number-of-comments-on-reddit-using-random-forest-classifier
    ├── 123.png
    ├── README.md
    ├── images
    │   ├── 123.png
    │   ├── Reddit-logo.png
    │   ├── redditRF.png
    │   ├── redditpage.png
    │   └── redditwordshist.png
    └── notebooks
    │   ├── 123.png
    │   └── project-3-marco-tavora.ipynb
├── retail-strategy
    ├── README.md
    ├── data
    │   ├── 123
    │   ├── ia_zip_city_county_sqkm.csv
    │   ├── iowa_incomes.xls
    │   └── pop_iowa_per_county.csv
    ├── images
    │   ├── 123
    │   ├── 123.png
    │   ├── hm3.png
    │   ├── liquor.jpeg
    │   ├── output.png
    │   └── test.jpg
    └── notebooks
    │   └── retail-recommendations.ipynb
└── tennis
    ├── 123.png
    ├── README.md
    ├── images
        ├── 123.png
        ├── ATP_World_Tour.png
        ├── ROC.png
        ├── balanced.png
        ├── cv_score.png
        ├── decisiontree.png
        ├── imbalance.png
        ├── rf_features.png
        ├── rounds.png
        ├── surfaces.png
        └── tennis_df.png
    ├── notebooks
        ├── 123.png
        └── Final_Project_Marco_Tavora-DATNYC41_GA.ipynb
    └── slides
        ├── 123.png
        └── Final_Project_Marco_Tavora_DATNYC41.pdf


/README.md:
--------------------------------------------------------------------------------
 1 | ## Supervised Machine Learning Projects
 2 | 
 3 | ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![image title](https://img.shields.io/badge/ntlk-v3.2.5-yellow.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/BeautifulSoup-4.6.0-blue.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 4 | <br>
 5 | 
 6 | <p align="center">
 7 | <img src="https://github.com/marcotav/predicting-the-number-of-comments-on-reddit/blob/master/Reddit-logo.png" 
 8 |        width="120" height="120"/>  
 9 | </p> 
10 | 
11 | <br/>
12 | <p align="center">
13 |   <img src='https://github.com/marcotav/machine-learning-regression-models/blob/master/retail/images/liquor.jpeg' width="200">
14 | </p>
15 | <br>
16 | 
17 | <p align="center">
18 |   <a href="#nb"> Notebooks and descriptions </a>  •
19 |   <a href="#ci"> Contact Information </a> 
20 | </p>
21 | 
22 | <a id = 'nb'></a>
23 | ### Notebooks and descriptions
24 | | Notebook | Brief Description |
25 | |--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
26 | |[predicting-comments-on-reddit](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/painters-identification/notebooks/capstone-models-final-model-building.ipynb) | In this project I determine which characteristics of a post on Reddit contribute most to the overall interaction as measured by number of comments.|
27 | |[tennis-matches-prediction](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/bitcoin/notebooks/deep-learning-LSTM-bitcoins.ipynb) | The goal of the project is to predict the probability that the higher-ranked player will win a tennis match. I will call that a `win`(as opposed to an upset).|
28 | |[churn-analysis](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/keras-tf-tutorial/notebooks/neural-nets-digits-mnist.ipynb) | This project was done in collaboration with [Corey Girard](https://github.com/coreygirard/). A mobile device company is having a major problem with customer retention. Customers switching from one company to another is called churn. Our goal in this analysis is to understand the problem, identify behaviors which are strongly correlated with churn and to devise a solution.|
29 | |[click-prediction](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/transfer-learning/notebooks/transfer-learning.ipynb) | Many ads are actually sold on a "pay-per-click" (PPC) basis, meaning the company only pays for ad clicks, not ad views. Thus your optimal approach (as a search engine) is actually to choose an ad based on "expected value", meaning the price of a click times the likelihood that the ad will be clicked [...] In order for you to maximize expected value, you therefore need to accurately predict the likelihood that a given ad will be clicked, also known as "click-through rate" (CTR). In this project I will predict the likelihood that a given online ad will be clicked.|
30 | | [retail-store-expansion-analysis-with-lasso-and-ridge-regressions](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/painters-identification/notebooks/capstone-models-final-model-building.ipynb) | Based on a dataset containing the spirits purchase information of Iowa Class E liquor licensees by product and date of purchase this project provides recommendations on where to open new stores in the state of Iowa. To devise an expansion strategy, I first needed to understand the data and for that I conducted a thorough exploratory data analysis (EDA). With the data in hand I built multivariate regression models of total sales by county, using both Lasso and Ridge regularization, and based on these models, I made recommendations about new locations.|
31 | 
32 | 
33 | <p align="center">
34 | <img src="https://github.com/marcotav/machine-learning-classification-projects/blob/master/churn/images/cellphone.jpg" width="150" height="150"/> 
35 | </p>
36 | 
37 | 
38 | <p align="center">
39 | <img src="https://github.com/marcotav/machine-learning-classification-projects/blob/master/click-prediction/images/click1.png" width="100">  
40 | </p> 
41 | 
42 | <a id = 'ci'></a>
43 | ## Contact Information
44 | 
45 | Feel free to contact me:
46 | 
47 | * Email: [marcotav65@gmail.com](mailto:marcotav65@gmail.com)
48 | * GitHub: [marcotav](https://github.com/marcotav)
49 | * LinkedIn: [marco-tavora](https://www.linkedin.com/in/marco-tavora)
50 | * Website: [marcotavora.me](http://www.marcotavora.me)
51 | 
52 | 
53 | 
54 | 
55 | 
56 | 


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/README.md:
--------------------------------------------------------------------------------
 1 | ## U.S. Opiate Prescriptions/Overdoses [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/analysis-of-opioid-prescription-problem/notebooks/opioid-prescription-problem.ipynb)
 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg)
 3 | 
 4 | 
 5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/analysis-of-opioid-prescription-problem/notebooks/opioid-prescription-problem.ipynb) or by clicking on the.**
 6 |  
 7 | 
 8 | 
 9 | 
10 | 
11 | 
12 | <br>
13 | <br>
14 | <p align="center">
15 |   <img src="images/opioids.png", width = "180">
16 | </p>
17 | <br>
18 | 
19 | <p align="center">
20 |   <a href="#p">Brief Introduction </a> •
21 |   <a href="#d"> Dataset </a> •
22 |   <a href="#g"> Project Goal  </a>
23 | </p>
24 | 
25 | <a id = 'p'></a>
26 | ## Brief Introduction
27 | 
28 | Accidental death by fatal drug overdose is a rising trend in the United States. What can you do to help? (From Kaggle)
29 | 
30 | <a id = 'd'></a>
31 | ## Dataset
32 | 
33 | This dataset contains:
34 | - Summaries of prescription records for 250 common **opioid** and ***non-opioid*** drugs written by 25,000 unique licensed medical professionals in 2014 in the United States for citizens covered under Class D Medicare
35 | - Metadata about the doctors themselves. 
36 | - This data here is in a format with 1 row per prescriber and 25,000 unique prescribers down to 25,000 to keep it manageable. 
37 | - The main data is in `prescriber-info.csv`. 
38 | - There is also `opioids.csv` that contains the names of all opioid drugs included in the data 
39 | - There is the file `overdoses.csv` that contains information on opioid related drug overdose fatalities.
40 | 
41 | 
42 | The data consists of the following characteristics for each prescriber:
43 | - NPI – unique National Provider Identifier number
44 | - Gender - (M/F)
45 | - State - U.S. State by abbreviation
46 | - Credentials - set of initials indicative of medical degree
47 | - Specialty - description of type of medicinal practice
48 | - A long list of drugs with numeric values indicating the total number of prescriptions written for the year by that individual
49 | - `Opioid.Prescriber` - a boolean label indicating whether or not that individual prescribed opiate drugs more than 10 times in the yearr
50 | 
51 | <a id = 'g'></a>
52 | ## Project Goal
53 | 
54 | The increase in overdose fatalities is a well-known problem, and the search for possible solutions is an ongoing effort. This dataset is can be used to detect sources of significant quantities of opiate prescriptions. 
55 | 
56 | 


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/123:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/mhincome.csv:
--------------------------------------------------------------------------------
1 | State,IncomeMississippi,40593.00Arkansas,41995.00West Virginia,42019.00Alabama,44765.00Kentucky,45215.00New Mexico,45382.00


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/opioids.csv:
--------------------------------------------------------------------------------
  1 | Drug Name,Generic Name
  2 | ABSTRAL,FENTANYL CITRATE
  3 | ACETAMINOPHEN-CODEINE,ACETAMINOPHEN WITH CODEINE
  4 | ACTIQ,FENTANYL CITRATE
  5 | ASCOMP WITH CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN
  6 | ASPIRIN-CAFFEINE-DIHYDROCODEIN,DIHYDROCODEINE/ASPIRIN/CAFFEIN
  7 | AVINZA,MORPHINE SULFATE
  8 | BELLADONNA-OPIUM,OPIUM/BELLADONNA ALKALOIDS
  9 | BUPRENORPHINE HCL,BUPRENORPHINE HCL
 10 | BUTALB-ACETAMINOPH-CAFF-CODEIN,BUTALBIT/ACETAMIN/CAFF/CODEINE
 11 | BUTALB-CAFF-ACETAMINOPH-CODEIN,BUTALBIT/ACETAMIN/CAFF/CODEINE
 12 | BUTALBITAL COMPOUND-CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN
 13 | BUTORPHANOL TARTRATE,BUTORPHANOL TARTRATE
 14 | BUTRANS,BUPRENORPHINE
 15 | CAPITAL W-CODEINE,ACETAMINOPHEN WITH CODEINE
 16 | CARISOPRODOL COMPOUND-CODEINE,CODEINE/CARISOPRODOL/ASPIRIN
 17 | CARISOPRODOL-ASPIRIN-CODEINE,CODEINE/CARISOPRODOL/ASPIRIN
 18 | CODEINE SULFATE,CODEINE SULFATE
 19 | CO-GESIC,HYDROCODONE/ACETAMINOPHEN
 20 | CONZIP,TRAMADOL HCL
 21 | DEMEROL,MEPERIDINE HCL
 22 | DEMEROL,MEPERIDINE HCL/PF
 23 | DILAUDID,HYDROMORPHONE HCL
 24 | DILAUDID,HYDROMORPHONE HCL/PF
 25 | DILAUDID-HP,HYDROMORPHONE HCL/PF
 26 | DISKETS,METHADONE HCL
 27 | DOLOPHINE HCL,METHADONE HCL
 28 | DURAGESIC,FENTANYL
 29 | DURAMORPH,MORPHINE SULFATE/PF
 30 | ENDOCET,OXYCODONE HCL/ACETAMINOPHEN
 31 | ENDODAN,OXYCODONE HCL/ASPIRIN
 32 | EXALGO,HYDROMORPHONE HCL
 33 | FENTANYL,FENTANYL
 34 | FENTANYL CITRATE,FENTANYL CITRATE
 35 | FENTORA,FENTANYL CITRATE
 36 | FIORICET WITH CODEINE,BUTALBIT/ACETAMIN/CAFF/CODEINE
 37 | FIORINAL WITH CODEINE #3,CODEINE/BUTALBITAL/ASA/CAFFEIN
 38 | HYCET,HYDROCODONE/ACETAMINOPHEN
 39 | HYDROCODONE-ACETAMINOPHEN,HYDROCODONE/ACETAMINOPHEN
 40 | HYDROCODONE-IBUPROFEN,HYDROCODONE/IBUPROFEN
 41 | HYDROMORPHONE ER,HYDROMORPHONE HCL
 42 | HYDROMORPHONE HCL,HYDROMORPHONE HCL
 43 | HYDROMORPHONE HCL,HYDROMORPHONE HCL/PF
 44 | IBUDONE,HYDROCODONE/IBUPROFEN
 45 | INFUMORPH,MORPHINE SULFATE/PF
 46 | KADIAN,MORPHINE SULFATE
 47 | LAZANDA,FENTANYL CITRATE
 48 | LEVORPHANOL TARTRATE,LEVORPHANOL TARTRATE
 49 | LORCET,HYDROCODONE/ACETAMINOPHEN
 50 | LORCET 10-650,HYDROCODONE/ACETAMINOPHEN
 51 | LORCET HD,HYDROCODONE/ACETAMINOPHEN
 52 | LORCET PLUS,HYDROCODONE/ACETAMINOPHEN
 53 | LORTAB,HYDROCODONE/ACETAMINOPHEN
 54 | MAGNACET,OXYCODONE HCL/ACETAMINOPHEN
 55 | MEPERIDINE HCL,MEPERIDINE HCL
 56 | MEPERIDINE HCL,MEPERIDINE HCL/PF
 57 | MEPERITAB,MEPERIDINE HCL
 58 | METHADONE HCL,METHADONE HCL
 59 | METHADONE INTENSOL,METHADONE HCL
 60 | METHADOSE,METHADONE HCL
 61 | MORPHINE SULFATE,MORPHINE SULFATE
 62 | MORPHINE SULFATE,MORPHINE SULFATE/PF
 63 | MORPHINE SULFATE ER,MORPHINE SULFATE
 64 | MS CONTIN,MORPHINE SULFATE
 65 | NALBUPHINE HCL,NALBUPHINE HCL
 66 | NORCO,HYDROCODONE/ACETAMINOPHEN
 67 | NUCYNTA,TAPENTADOL HCL
 68 | NUCYNTA ER,TAPENTADOL HCL
 69 | OPANA,OXYMORPHONE HCL
 70 | OPANA ER,OXYMORPHONE HCL
 71 | OPIUM TINCTURE,OPIUM TINCTURE
 72 | OXECTA,OXYCODONE HCL
 73 | OXYCODONE HCL,OXYCODONE HCL
 74 | OXYCODONE HCL ER,OXYCODONE HCL
 75 | OXYCODONE HCL-ASPIRIN,OXYCODONE HCL/ASPIRIN
 76 | OXYCODONE HCL-IBUPROFEN,IBUPROFEN/OXYCODONE HCL
 77 | OXYCODONE-ACETAMINOPHEN,OXYCODONE HCL/ACETAMINOPHEN
 78 | OXYCONTIN,OXYCODONE HCL
 79 | OXYMORPHONE HCL,OXYMORPHONE HCL
 80 | OXYMORPHONE HCL ER,OXYMORPHONE HCL
 81 | PENTAZOCINE-ACETAMINOPHEN,PENTAZOCINE HCL/ACETAMINOPHEN
 82 | PENTAZOCINE-NALOXONE HCL,PENTAZOCINE HCL/NALOXONE HCL
 83 | PERCOCET,OXYCODONE HCL/ACETAMINOPHEN
 84 | PERCODAN,OXYCODONE HCL/ASPIRIN
 85 | PRIMLEV,OXYCODONE HCL/ACETAMINOPHEN
 86 | REPREXAIN,HYDROCODONE/IBUPROFEN
 87 | ROXICET,OXYCODONE HCL/ACETAMINOPHEN
 88 | ROXICODONE,OXYCODONE HCL
 89 | RYBIX ODT,TRAMADOL HCL
 90 | STAGESIC,HYDROCODONE/ACETAMINOPHEN
 91 | SUBSYS,FENTANYL
 92 | SYNALGOS-DC,DIHYDROCODEINE/ASPIRIN/CAFFEIN
 93 | TALWIN,PENTAZOCINE LACTATE
 94 | TRAMADOL HCL,TRAMADOL HCL
 95 | TRAMADOL HCL ER,TRAMADOL HCL
 96 | TRAMADOL HCL-ACETAMINOPHEN,TRAMADOL HCL/ACETAMINOPHEN
 97 | TREZIX,DHCODEINE BT/ACETAMINOPHN/CAFF
 98 | TYLENOL-CODEINE NO.3,ACETAMINOPHEN WITH CODEINE
 99 | TYLENOL-CODEINE NO.4,ACETAMINOPHEN WITH CODEINE
100 | ULTRACET,TRAMADOL HCL/ACETAMINOPHEN
101 | ULTRAM,TRAMADOL HCL
102 | ULTRAM ER,TRAMADOL HCL
103 | VICODIN,HYDROCODONE/ACETAMINOPHEN
104 | VICODIN ES,HYDROCODONE/ACETAMINOPHEN
105 | VICODIN HP,HYDROCODONE/ACETAMINOPHEN
106 | VICOPROFEN,HYDROCODONE/IBUPROFEN
107 | XARTEMIS XR,OXYCODONE HCL/ACETAMINOPHEN
108 | XODOL 10-300,HYDROCODONE/ACETAMINOPHEN
109 | XODOL 5-300,HYDROCODONE/ACETAMINOPHEN
110 | XODOL 7.5-300,HYDROCODONE/ACETAMINOPHEN
111 | XYLON 10,HYDROCODONE/IBUPROFEN
112 | ZAMICET,HYDROCODONE/ACETAMINOPHEN
113 | ZOHYDRO ER,HYDROCODONE BITARTRATE
114 | ZOLVIT,HYDROCODONE/ACETAMINOPHEN


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/overdoses.csv:
--------------------------------------------------------------------------------
 1 | "State","Population","Deaths","Abbrev"
 2 | "Alabama","4,833,722","723","AL"
 3 | "Alaska","735,132","124","AK"
 4 | "Arizona","6,626,624","1,211","AZ"
 5 | "Arkansas","2,959,373","356","AR"
 6 | "California","38,332,521","4,521","CA"
 7 | "Colorado","5,268,367","899","CO"
 8 | "Connecticut","3,596,080","623","CT"
 9 | "Delaware","925,749","189","DE"
10 | "Florida","19,552,860","2,634","FL"
11 | "Georgia","9,992,167","1,206","GA"
12 | "Hawaii","1,404,054","157","HI"
13 | "Idaho","1,612,136","212","ID"
14 | "Illinois","12,882,135","1,705","IL"
15 | "Indiana","6,570,902","1,172","IN"
16 | "Iowa","3,090,416","264","IA"
17 | "Kansas","2,893,957","332","KS"
18 | "Kentucky","4,395,295","1,077","KY"
19 | "Louisiana","4,625,470","777","LA"
20 | "Maine","1,328,302","216","ME"
21 | "Maryland","5,928,814","1,070","MD"
22 | "Massachusetts","6,692,824","1,289","MA"
23 | "Michigan","9,895,622","1,762","MI"
24 | "Minnesota","5,420,380","517","MN"
25 | "Mississippi","2,991,207","336","MS"
26 | "Missouri","6,044,171","1,067","MO"
27 | "Montana","1,015,165","125","MT"
28 | "Nebraska","1,868,516","125","NE"
29 | "Nevada","2,790,136","545","NV"
30 | "New Hampshire","1,323,459","334","NH"
31 | "New Jersey","8,899,339","1,253","NJ"
32 | "New Mexico","2,085,287","547","NM"
33 | "New York","19,651,127","2,300","NY"
34 | "North Carolina","9,848,060","1,358","NC"
35 | "North Dakota","723,393","43","ND"
36 | "Ohio","11,570,808","2,744","OH"
37 | "Oklahoma","3,850,568","777","OK"
38 | "Oregon","3,930,065","522","OR"
39 | "Pennsylvania","12,773,801","2,732","PA"
40 | "Rhode Island","1,051,511","247","RI"
41 | "South Carolina","4,774,839","701","SC"
42 | "South Dakota","844,877","63","SD"
43 | "Tennessee","6,495,978","1,269","TN"
44 | "Texas","26,448,193","2,601","TX"
45 | "Utah","2,900,872","603","UT"
46 | "Vermont","626,630","83","VT"
47 | "Virginia","8,260,405","980","VA"
48 | "Washington","6,971,406","979","WA"
49 | "West Virginia","1,854,304","627","WV"
50 | "Wisconsin","5,742,713","853","WI"
51 | "Wyoming","582,658","109","WY"
52 | 


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/overdosesnew.csv:
--------------------------------------------------------------------------------
 1 | ,State,Population,Deaths,Abbrev,Deaths/Population
 2 | 0,Alabama,4833722,723,AL,0.0001495741790694624
 3 | 1,Alaska,735132,124,AK,0.00016867718994683948
 4 | 2,Arizona,6626624,1211,AZ,0.00018274765551810394
 5 | 3,Arkansas,2959373,356,AR,0.00012029575183662215
 6 | 4,California,38332521,4521,CA,0.00011794162977175439
 7 | 5,Colorado,5268367,899,CO,0.00017064111137284096
 8 | 6,Connecticut,3596080,623,CT,0.00017324419923917154
 9 | 7,Delaware,925749,189,DE,0.00020415901070376527
10 | 8,Florida,19552860,2634,FL,0.0001347117506083509
11 | 9,Georgia,9992167,1206,GA,0.000120694540033208
12 | 10,Hawaii,1404054,157,HI,0.00011181906109024297
13 | 11,Idaho,1612136,212,ID,0.00013150255313447502
14 | 12,Illinois,12882135,1705,IL,0.00013235383731035268
15 | 13,Indiana,6570902,1172,IN,0.00017836211832104634
16 | 14,Iowa,3090416,264,IA,8.542539256850857e-05
17 | 15,Kansas,2893957,332,KS,0.00011472181514790993
18 | 16,Kentucky,4395295,1077,KY,0.00024503474738328144
19 | 17,Louisiana,4625470,777,LA,0.00016798292930231954
20 | 18,Maine,1328302,216,ME,0.00016261362250452081
21 | 19,Maryland,5928814,1070,MD,0.00018047454347530552
22 | 20,Massachusetts,6692824,1289,MA,0.0001925943368598965
23 | 21,Michigan,9895622,1762,MI,0.00017805853942278718
24 | 22,Minnesota,5420380,517,MN,9.538076666211594e-05
25 | 23,Mississippi,2991207,336,MS,0.00011232923699362832
26 | 24,Missouri,6044171,1067,MO,0.00017653372149795232
27 | 25,Montana,1015165,125,MT,0.00012313269271497737
28 | 26,Nebraska,1868516,125,NE,6.689800890118148e-05
29 | 27,Nevada,2790136,545,NV,0.00019533098028196475
30 | 28,New Hampshire,1323459,334,NH,0.0002523689815853759
31 | 29,New Jersey,8899339,1253,NJ,0.0001407969737977169
32 | 30,New Mexico,2085287,547,NM,0.0002623140124117208
33 | 31,New York,19651127,2300,NY,0.00011704163328647767
34 | 32,North Carolina,9848060,1358,NC,0.00013789517935512173
35 | 33,North Dakota,723393,43,ND,5.9442101319752885e-05
36 | 34,Ohio,11570808,2744,OH,0.00023714852065646582
37 | 35,Oklahoma,3850568,777,OK,0.00020178841147591733
38 | 36,Oregon,3930065,522,OR,0.00013282223067557407
39 | 37,Pennsylvania,12773801,2732,PA,0.00021387525921219534
40 | 38,Rhode Island,1051511,247,RI,0.00023490006286191965
41 | 39,South Carolina,4774839,701,SC,0.00014681123279758752
42 | 40,South Dakota,844877,63,SD,7.456706715888821e-05
43 | 41,Tennessee,6495978,1269,TN,0.00019535164681900091
44 | 42,Texas,26448193,2601,TX,9.83432025015849e-05
45 | 43,Utah,2900872,603,UT,0.00020786853056598153
46 | 44,Vermont,626630,83,VT,0.00013245455851140225
47 | 45,Virginia,8260405,980,VA,0.00011863825078794562
48 | 46,Washington,6971406,979,WA,0.00014043078254228773
49 | 47,West Virginia,1854304,627,WV,0.000338132258788203
50 | 48,Wisconsin,5742713,853,WI,0.00014853606648982808
51 | 49,Wyoming,582658,109,WY,0.00018707372077616716
52 | 


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/images/123:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/images/opioids.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/analysis-of-opioid-prescription-problem/images/opioids.png


--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/notebooks/123:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/churn/README.md:
--------------------------------------------------------------------------------
  1 | ## Churn Analysis [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/churn/notebooks/predicting-customer-churn.ipynb)
  2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg)
  3 | 
  4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/churn/notebooks/predicting-customer-churn.ipynb) or by clicking on the.**
  5 | 
  6 | This project was done in collaboration with [Corey Girard](https://github.com/coreygirard/)
  7 | 
  8 | <br>
  9 | 
 10 | <p align="center">
 11 |   <img src="images/cellphone.jpg", width = "400">
 12 | </p>                                                                  
 13 | <p align="center">
 14 |   <a href="#goals"> Goals </a> •
 15 |   <a href="#importance"> Why this is important? </a> •
 16 |   <a href="#mods"> Importing modules and reading the data </a> •
 17 |   <a href="#dh"> Data Handling and Feature Engineering </a> •
 18 |   <a href="#Xy"> Features and target </a> •
 19 |   <a href="#pp"> Using `pandas-profiling` and rejecting variables with correlations above 0.9 </a> •
 20 |   <a href="#scale">  Scaling </a> •
 21 |   <a href="#mc">  Model Comparison </a> •
 22 |   <a href="#rf"> Building a random forest classifier using GridSearch to optimize hyperparameters </a>
 23 | </p>
 24 | 
 25 | 
 26 | <a id = 'goals'></a>
 27 | ### Goals
 28 | From Wikipedia, 
 29 | 
 30 | > Churn rate is a measure of the number of individuals or items moving out of a collective group over a specific period. It is one of two primary factors that determine the steady-state level of customers a business will support [...] It is an important factor for any business with a subscriber-based service model, [such as] mobile telephone networks.
 31 | 
 32 | Our goal in this analysis was to predict the churn rate from a mobile phone company based on customer attributes including:
 33 | - Area code
 34 | - Call duration at different hours
 35 | - Charges
 36 | - Account length
 37 | 
 38 | See [this website](http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html) for a similar analysis.
 39 | 
 40 | <a id = 'importance'></a>
 41 | ### Why this is important? 
 42 | 
 43 | It is a well-known fact that in several businesses (particularly the ones involving subscriptions), the acquisition of new customers costs much more than the retention of existing ones. A thorough analysis of what causes churn-rates and how to predict them can be used to build efficient customer retention strategies.
 44 | 
 45 | <a id = 'mods'></a>
 46 | ## Importing modules and reading the data
 47 | ```
 48 | from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
 49 | from sklearn.ensemble import RandomForestClassifier
 50 | import pandas as pd
 51 | import seaborn as sns
 52 | import numpy as np
 53 | import matplotlib.pyplot as plt
 54 | %matplotlib inline
 55 | ```
 56 | Reading the data:
 57 | ```
 58 | df = pd.read_csv("data.csv")
 59 | ```
 60 | <p align="center">
 61 |   <img src="images/df_churn_new.png", width = "950">
 62 | </p> 
 63 | 
 64 | <a id = 'dh'></a>
 65 | ## Data Handling and Feature Engineering
 66 | In this section the following steps are taken:
 67 | - Conversion of strings into booleans 
 68 | - Conversion of booleans to integers
 69 | - Converting the states column into dummy columns
 70 | - Creation of several new features (feature engineering)
 71 | 
 72 | The commented code follows (most of the lines were ommited for brevity):
 73 | ```
 74 | # convert binary strings to boolean ints
 75 | df['international_plan'] = df.international_plan.replace({'Yes': 1, 'No': 0})
 76 | #convert booleans to boolean ints
 77 | df['churn'] = df.churn.replace({True: 1, False: 0})
 78 | # handle state and area code dummies
 79 | state_dummies = pd.get_dummies(df.state)
 80 | state_dummies.columns = ['state_'+c.lower() for c in state_dummies.columns.values]
 81 | df.drop('state', axis='columns', inplace=True)
 82 | df = pd.concat([df, state_dummies], axis='columns')
 83 | area_dummies = pd.get_dummies(df.area_code)
 84 | area_dummies.columns = ['area_code_'+str(c) for c in area_dummies.columns.values]
 85 | df.drop('area_code', axis='columns', inplace=True)
 86 | df = pd.concat([df, area_dummies], axis='columns')
 87 | # feature engineering
 88 | df['total_minutes'] = df.total_day_minutes + df.total_eve_minutes + df.total_intl_minutes
 89 | df['total_calls'] = df.total_day_calls + df.total_eve_calls + df.total_intl_calls
 90 | ```
 91 | 
 92 | <a id = 'Xy'></a>
 93 | ### Features and target
 94 | Defining the features matrix and the target (the churn):
 95 | ```
 96 | X = df[[c for c in df.columns if c != 'churn']]
 97 | y = df.churn
 98 | ```
 99 | 
100 | <a id = 'pp'></a>
101 | ### Using `pandas-profiling` and rejecting variables with correlations above 0.9
102 | 
103 | The package `pandas-profiling` contains a method `get_rejected_variables(threshold)` which identifies variables with correlation higher than a threshold.
104 | ```
105 | import pandas_profiling
106 | profile = pandas_profiling.ProfileReport(X)
107 | rejected_variables = profile.get_rejected_variables(threshold=0.9)
108 | X = X.drop(rejected_variables,axis=1)
109 | ```
110 | <a id = 'scale'></a>
111 | ### Scaling
112 | ```
113 | from sklearn.preprocessing import StandardScaler
114 | cols = X.columns.tolist()
115 | scaler = StandardScaler()
116 | X[cols] = scaler.fit_transform(X[cols])
117 | X = X[cols]
118 | ```
119 | We can now build our models.
120 | 
121 | <a id = 'mc'></a>
122 | ## Model Comparison
123 | 
124 | We can write a for loop that does the following:
125 | - Iterates over a list of models, in this case GaussianNB, KNeighborsClassifier and LinearSVC
126 | - Trains each model using the training dataset X_train and y_train
127 | - Predicts the target using the test features X_test
128 | - Calculates the `f1_score` and cross-validation score 
129 | - Build a dataframe with that information
130 | 
131 | The code will also print out the confusion matrix from which "recall" and "precision" can be calculated:
132 | - When a consumer churns, how often does my classifier predict that to happen. This is the "recall". 
133 | - When the model predicts a churn, how often does that user actually churns? This is the "precision"
134 | 
135 | ```
136 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
137 |                             test_size=0.25, random_state=0)
138 | 
139 | models = [LogisticRegression, GaussianNB, 
140 |           KNeighborsClassifier, LinearSVC]
141 | 
142 | lst = []
143 | for model in models:
144 |     clf = model().fit(X_train, y_train)
145 |     y_pred = clf.predict(X_test)
146 |     lst.append([i for i in (model.__name__, 
147 |                             round(metrics.f1_score(y_test, 
148 |                                                    y_pred, 
149 |                                                    average="macro"),3))])
150 | df = pd.DataFrame(lst, columns=['Model','f1_score'])
151 | 
152 | lst_av_cross_val_scores = []
153 | 
154 | for model in models:
155 |     clf = model()
156 |     cross_val_scores = (model.__name__, cross_val_score(clf, X, y, cv=5))
157 |     av_cross_val_scores = list(cross_val_scores)[1].mean()
158 |     lst_av_cross_val_scores.append(round(av_cross_val_scores,3))
159 | 
160 | model_names = [model.__name__ for model in models]
161 | 
162 | df1 = pd.DataFrame(list(zip(model_names, lst_av_cross_val_scores)))
163 | df1.columns = ['Model','Average Cross-Validation']
164 | df_all = pd.concat([df1,df['f1_score']],axis=1) 
165 | ```
166 | <p align="center">
167 |   <img src="images/model_comparison.png", width = "400">
168 | </p>
169 | 
170 | If we use cross-validation as our metric, we see that the `KNeighborsClassifier` has the best performance. 
171 | 
172 | Now we will look at confusion matrices. These are obtained as follows:
173 | 
174 | ```
175 | models_names = ['LogisticRegression', 'GaussianNB', 'KNeighborsClassifier', 'LinearSVC']
176 | i=0
177 | for preds in y_pred_lst:
178 |     print('Confusion Matrix for:',models_names[i])
179 |     i +=1
180 |     print('')
181 |     cm = pd.crosstab(pd.concat([X_test,y_test],axis=1)['churn'], preds, 
182 |             rownames=['Actual Values'], colnames=['Predicted Values'])
183 |     recall = round(cm.iloc[1,1]/(cm.iloc[1,0]+cm.iloc[1,1]),3)
184 |     precision = round(cm.iloc[1,1]/(cm.iloc[0,1]+cm.iloc[1,1]),3)
185 |     cm
186 |     print('Recall for {} is:'.format(models_names[i-1]),recall)
187 |     print('Precision for {} is:'.format(models_names[i-1]),precision,'\n')
188 |     print('------------------------------------------------------------ \n')
189 | ```
190 | The output is:
191 | 
192 | <p align="center">
193 |   <img src="images/cms2.png", width = "600">
194 | </p> 
195 | 
196 | The highest recall is from `GaussianNB` and the highest precision from `KNeighborsClassifier`.
197 | 
198 | <a id = 'rf'></a>
199 | ### Finding best hyperparameters
200 | As a complement let us use a Random Forest Classifier with GridSearch for hyperparameter optimization
201 | 
202 | 
203 | ```
204 | n_estimators = list(range(20,160,10))
205 | max_depth = list(range(2, 16, 2)) + [None]
206 | def rfscore(X,y,test_size,n_estimators,max_depth):
207 | 
208 |     X_train, X_test, y_train, y_test = train_test_split(X, 
209 |                                                         y, test_size = test_size, random_state=42) 
210 |     rf_params = {
211 |              'n_estimators':n_estimators,
212 |              'max_depth':max_depth}   # parameters for grid search
213 |     rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
214 |     rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
215 |     max_depth_best = rf_gs.best_params_['max_depth']      # getting the best max_depth
216 |     n_estimators_best = rf_gs.best_params_['n_estimators']  # getting the best n_estimators
217 |     print("best max_depth:",max_depth_best)
218 |     print("best n_estimators:",n_estimators_best)
219 |     best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
220 |     best_rf_gs.fit(X_train,y_train)  # fitting the best model
221 |     best_rf_score = best_rf_gs.score(X_test,y_test) 
222 |     print ("best score is:",round(best_rf_score,3))
223 |     preds = best_rf_gs.predict(X_test)
224 |     df_pred = pd.DataFrame(np.array(preds).reshape(len(preds),1))
225 |     df_pred.columns = ['predictions']
226 |     print('Features and their importance:\n')
227 |     feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X.columns).sort_values().tail(10)
228 |     print(feature_importances)
229 |     print(feature_importances.plot(kind="barh", figsize=(6,6)))
230 |     return (df_pred,max_depth_best,n_estimators_best)
231 | 
232 | 
233 | triple = rfscore(X,y,0.3,n_estimators,max_depth)
234 | ```
235 | ```
236 | df_pred = triple[0]
237 | ```
238 | The predictions are:
239 | ```
240 | df_pred['predictions'].value_counts()/df_pred.shape[0]
241 | ```
242 | 
243 | <p align="center">
244 |   <img src="images/predictions.png", width = "130">
245 | </p> 
246 | 
247 | 
248 | 
249 | ### Cross Validation
250 | ```
251 | def cv_score(X,y,cv,n_estimators,max_depth):
252 |     rf = RandomForestClassifier(n_estimators=n_estimators_best,
253 |                                 max_depth=max_depth_best)
254 |     s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1)
255 |     return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))
256 | ```
257 | ```
258 | dict_best = {'max_depth': triple[1], 'n_estimators': triple[2]}
259 | n_estimators_best = dict_best['n_estimators']
260 | max_depth_best = dict_best['max_depth']
261 | cv_score(X,y,5,n_estimators_best,max_depth_best)
262 | ```
263 | The output is:
264 | ```
265 | 'Random Forest Score is :0.774 ± 0.054'
266 | ```
267 | 
268 | For the random forest, the recall and precision found are:
269 | 
270 | ```
271 | recall: 0.286
272 | precision 0.727
273 | ```
274 | 
275 | Both cross-validation score and precision of our `RandomForestClassifier` is the highest among the five models investigated.
276 | 


--------------------------------------------------------------------------------
/churn/data/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/churn/images/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/churn/images/balancedchurn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/balancedchurn.png


--------------------------------------------------------------------------------
/churn/images/baseline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/baseline.png


--------------------------------------------------------------------------------
/churn/images/cellphone.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cellphone.jpg


--------------------------------------------------------------------------------
/churn/images/churnprob.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/churnprob.png


--------------------------------------------------------------------------------
/churn/images/cm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cm.png


--------------------------------------------------------------------------------
/churn/images/cms.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms.png


--------------------------------------------------------------------------------
/churn/images/cms1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms1.png


--------------------------------------------------------------------------------
/churn/images/cms2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms2.png


--------------------------------------------------------------------------------
/churn/images/df_churn_new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/df_churn_new.png


--------------------------------------------------------------------------------
/churn/images/featurerf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/featurerf.png


--------------------------------------------------------------------------------
/churn/images/imbalancechurn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/imbalancechurn.png


--------------------------------------------------------------------------------
/churn/images/model_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/model_comparison.png


--------------------------------------------------------------------------------
/churn/images/predictions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/predictions.png


--------------------------------------------------------------------------------
/click-prediction/README.md:
--------------------------------------------------------------------------------
  1 | ## Predicting clicks on ads [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/click-prediction/notebooks/click-predictive-model.ipynb)
  2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg)
  3 | 
  4 | 
  5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/click-prediction/notebooks/click-predictive-model.ipynb) or by clicking on the.**
  6 | 
  7 | <br>
  8 | 
  9 | <p align="center">
 10 |   <img src="images/click1.png" width="120",height="120">
 11 | </p>                                                                  
 12 | 
 13 | <a id = 'Problem Statement'></a>
 14 | ## Problem Statement
 15 | 
 16 | Borrowing from [here](https://turi.com/learn/gallery/notebooks/click_through_rate_prediction_intro.html):
 17 | 
 18 | 
 19 | > Many ads are actually sold on a "pay-per-click" (PPC) basis, meaning the company only pays for ad clicks, not ad views. Thus your optimal approach (as a search engine) is actually to choose an ad based on "expected value", meaning the price of a click times the likelihood that the ad will be clicked [...] In order for you to maximize expected value, you therefore need to accurately predict the likelihood that a given ad will be clicked, also known as "click-through rate" (CTR).
 20 | 
 21 | In this project I will predict the likelihood that a given online ad will be clicked.
 22 | 
 23 | ## Dataset 
 24 | 
 25 | - The two files `train_click.csv` and `test_click.csv` contain ad impression attributes from a campaign.
 26 | - Each row in `train.csv` includes a `click` column.
 27 | 
 28 | ## Import the relevant libraries and the files
 29 | 
 30 | ```
 31 | import numpy as np
 32 | import pandas as pd
 33 | import matplotlib.pyplot as plt
 34 | from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute   # used for feature imputation algorithms
 35 | pd.set_option('display.max_columns', None) # display all columns
 36 | pd.set_option('display.max_rows', None)  # displays all rows
 37 | %matplotlib inline
 38 | from IPython.core.interactiveshell import InteractiveShell
 39 | InteractiveShell.ast_node_interactivity = "all" # so we can see the value of multiple statements at once.
 40 | ```
 41 | 
 42 | ## Import the data
 43 | 
 44 | ```
 45 | train = pd.read_csv('train_click.csv',index_col=0)
 46 | test = pd.read_csv('test_click.csv',index_col=0)
 47 | ```
 48 | 
 49 | ## Data Dictionary
 50 | 
 51 | The meaning of the columns follows:
 52 | - `location` – ad placement in the website
 53 | - `carrier` – mobile carrier 
 54 | - `device` – type of device e.g. phone, tablet or computer 
 55 | - `day` – weekday user saw the ad
 56 | - `hour` – hour user saw the ad
 57 | - `dimension` – size of ad
 58 | 
 59 | ## Imbalance
 60 | The `click` column is **heavily** unbalanced. I will correct for this later.
 61 | 
 62 | ```
 63 | import aux_func_v2 as af
 64 | af.s_to_df(train['click'].value_counts())
 65 | ```
 66 | 
 67 | ### Checking the variance of each feature
 68 | 
 69 | Let's quickly study the variance of the features to have an estimate of their impact on clicks. But let us first consider the cardinalities.
 70 | 
 71 | #### Train set cardinalities
 72 | 
 73 | ```
 74 | cardin_train = [train[col].nunique() for col in train.columns.tolist()]
 75 | cols = [col for col in train.columns.tolist()]
 76 | d = {k:v for (k, v) in zip(cols,cardin_train)}
 77 | cardinal_train = pd.DataFrame(list(d.items()), columns=['column', 'cardinality'])
 78 | cardinal_train.sort_values('cardinality',ascending=False)
 79 | ```
 80 | 
 81 | #### Test set cardinalities
 82 | ```
 83 | cardin_test = [test[col].nunique() for col in test.columns.tolist()]
 84 | cols = [col for col in test.columns.tolist()]
 85 | d = {k:v for (k, v) in zip(cols,cardin_test)}
 86 | cardinal_test = pd.DataFrame(list(d.items()), columns=['column', 'cardinality'])
 87 | cardinal_test.sort_values('cardinality',ascending=False)
 88 | ```
 89 | 
 90 | #### High and low cardinality in the training data
 91 | 
 92 | We can set *arbitrary* thresholds to determine the level of cardinality in the feature categories:
 93 | 
 94 | ```
 95 | target = 'click'
 96 | cardinal_train_threshold = 33  # our choice
 97 | low_cardinal_train = cardinal_train[cardinal_train['cardinality'] 
 98 |                                     <= cardinal_train_threshold]['column'].tolist()
 99 | low_cardinal_train.remove(target)
100 | high_cardinal_train = cardinal_train[cardinal_train['cardinality'] 
101 |                                      > cardinal_train_threshold]['column'].tolist()
102 | print('Features with low cardinal_train:\n',low_cardinal_train)
103 | print('')
104 | print('Features with high cardinal_train:\n',high_cardinal_train)
105 | ```
106 | 
107 | #### High and low cardinality in the test data
108 | 
109 | ```
110 | cardinal_test_threshold = 25  # chosen for low_cardinal_set to agree with low_cardinal_train
111 | low_cardinal_test = cardinal_test[cardinal_test['cardinality'] 
112 |                                   <= cardinal_test_threshold]['column'].tolist()
113 | high_cardinal_test = cardinal_test[cardinal_test['cardinality']
114 |                                    > cardinal_test_threshold]['column'].tolist()
115 | print('Features with low cardinal_test:\n',low_cardinal_test)
116 | print('')
117 | print('Features with high cardinal_test:\n',high_cardinal_test)
118 | ```
119 | 
120 | #### Now let's look at the features' variances. 
121 | 
122 | From the bar plot below we see that `device_type` has non-negligible variance
123 | 
124 | ```
125 | from matplotlib import pyplot
126 | import matplotlib.pyplot as plt
127 | 
128 | for col in low_cardinal_train:
129 |     ax = train[target].groupby(train[col]).sum().plot(kind='bar', 
130 |                                                                 title ="Clicks per " + col, 
131 |                                                                 figsize=(10, 5), fontsize=12);
132 |     ax.set_xlabel(col, fontsize=12);
133 |     ax.set_ylabel("Clicks", fontsize=12);
134 |     plt.show();
135 | ```
136 | 
137 | ### Dropping some features
138 | 
139 | Notice that some of the features are massively dominated by **just one level**. We will drop those. We have to
140 | do that for both train and test sets:
141 | 
142 | ```
143 | cols_to_drop = ['location']
144 | train_new = train.drop(cols_to_drop,axis=1)
145 | test_new = test.drop(cols_to_drop,axis=1)
146 | ```
147 | 
148 | <a id = 'dtypes'></a> 
149 | ### Data types
150 | 
151 | ```
152 | train_new.dtypes
153 | test_new.dtypes
154 | ```
155 | 
156 | #### Converting some of the integer columns into strings:
157 | 
158 | ```
159 | cols_to_convert = test_new.columns.tolist()
160 | for col in cols_to_convert:
161 |     train_new[col] = train_new[col].astype(str)
162 |     test_new[col] = test_new[col].astype(str)
163 | ```
164 | 
165 | 
166 | ## Handling missing values
167 | 
168 | The only column with missing values is the `domain` column. There are several ways to fill missing values including:
169 | - Dropping the corresponding rows
170 | - Filling `NaNs` using most the frequent value.
171 | - Using Multiple Imputation by Chained Equations of MICE is a more sophisticated option
172 | 
173 | In our case, the are only a relatively small percentage of `NaNs` in just one column, namely, $\approx$ 13$\%$ of domain values are missing. I opted for values imputation to avoid dropping rows. Future analysis using MICE should improve final results.
174 | 
175 | ```
176 | train_new['website'] = train_new[['website']].apply(lambda x:x.fillna(x.value_counts().index[0]))  
177 | train_new.isnull().any()
178 | test_new['website'] = test_new[['website']].apply(lambda x:x.fillna(x.value_counts().index[0]))  
179 | test_new.isnull().any()
180 | ```
181 | 
182 | <a id = 'dummies'></a> 
183 | ### Dummies
184 | 
185 | We can transform the categories with low cardinality into dummies using hot encoding:
186 | 
187 | ```
188 | cols_to_keep = ['carrier', 'device', 'day', 'hour', 'dimension']
189 | low_cardin_train = train_new[cols_to_keep]
190 | low_cardin_test = test_new[cols_to_keep]
191 | dummies_train = pd.concat([pd.get_dummies(low_cardin_train[col], drop_first = True, prefix= col) 
192 |                      for col in cols_to_keep], axis=1)
193 | dummies_test = pd.concat([pd.get_dummies(low_cardin_test[col], drop_first = True, prefix= col) 
194 |                      for col in cols_to_keep], axis=1)
195 | dummies_train.head()
196 | dummies_test.head()
197 | 
198 | train_new.to_csv('train_new.csv')
199 | test_new.to_csv('test_new.csv')
200 | ```
201 | 
202 | #### Concatenating with the rest of the `DataFrame`:
203 | 
204 | ```
205 | train_new = pd.concat([train_new[high_cardinal_train + ['click']], dummies_train], axis = 1)
206 | test_new = pd.concat([test_new[high_cardinal_test], dummies_test], axis = 1)
207 | ```
208 | 
209 | Now, to treat the columns with high cardinality, we will break them up into percentiles based on the number of impressions (number of rows). 
210 | 
211 | #### Building up dictionaries for creation of dummy variables
212 | 
213 | ```
214 | train_new['count'] = 1   # auxiliar column
215 | test_new['count'] = 1
216 | ```
217 | 
218 | #### In the next cell, I use `pd.cut` to rename column entries using percentiles
219 | 
220 | ```
221 | def series_to_dataframe(s,name,index_list):
222 |     lst = [s.iloc[i] for i in range(s.shape[0])]
223 |     new_df = pd.DataFrame({name: lst})  # transforms list into dataframe
224 |     new_df.index = index_list
225 |     return new_df
226 | def ranges(df1,col):
227 |         df = series_to_dataframe(df1['count'].groupby(df1[col]).sum(),
228 |                              'sum of ads',
229 |                              df1['count'].groupby(df1[col]).sum().index.tolist()).sort_values('sum of ads',ascending=False)
230 |         #print('How the pd.cut looks like:\n')
231 |         #print(pd.get_dummies(pd.cut(df['sum of ads'], 3)).head(3))
232 |         df = pd.concat([df,pd.get_dummies(pd.cut(df['sum of ads'], 3), drop_first = True)],axis=1)
233 |         df.columns = ['sum of ads',col + '_1',col + '_2']
234 |         return df
235 | website_train = ranges(train_new,'website')
236 | publisher_train = ranges(train_new,'publisher')
237 | website_test = ranges(test_new,'website')
238 | publisher_test = ranges(test_new,'publisher')
239 | website_train.reset_index(level=0, inplace=True)
240 | publisher_train.reset_index(level=0, inplace=True)
241 | website_test.reset_index(level=0, inplace=True)
242 | publisher_test.reset_index(level=0, inplace=True)
243 | website_train.columns = ['website', 'sum of impressions', 'website_1', 'website_2']
244 | publisher_train.columns = ['publisher', 'sum of impressions', 'publisher_1', 'publisher_2']
245 | website_test.columns = ['website', 'sum of impressions', 'website_1', 'website_2']
246 | publisher_test.columns = ['publisher', 'sum of impressions', 'publisher_1', 'publisher_2']
247 | train_new = train_new.merge(website_train, how='left')
248 | train_new = train_new.drop('website',axis=1).drop('sum of impressions',axis=1)
249 | train_new = train_new.merge(publisher_train, how='left')
250 | train_new = train_new.drop('publisher',axis=1).drop('sum of impressions',axis=1)
251 | test_new = test_new.merge(website_test, how='left')
252 | test_new = test_new.drop('website',axis=1).drop('sum of impressions',axis=1)
253 | test_new = test_new.merge(publisher_test, how='left')
254 | test_new = test_new.drop('publisher',axis=1).drop('sum of impressions',axis=1)
255 | ```
256 | 
257 | ## Imbalanced classes
258 | 
259 | <a id = 'umb'></a>
260 | #### Imbalanced classes in general
261 | 
262 | - We can account for unbalanced classes using:
263 |   - Undersampling: randomly sample the majority class, artificially balancing the classes when fitting the model
264 |   - Oversampling: boostrap (sample with replacement) the minority class to balance the classes when fitting the model. We can oversample using the SMOTE algorithm (Synthetic Minority Oversampling Technique) 
265 | - Note that it is crucial that we **evaluate our model on the real data!!**
266 | 
267 | ```
268 | zeros = train_new[train_new['click'] == 0]
269 | ones = train_new[train_new['click'] == 1]
270 | counts = train_new['click'].value_counts()
271 | proportion = counts[1]/counts[0]
272 | train_new = ones.append(zeros.sample(frac=proportion))
273 | #train_new['response'].value_counts()
274 | #train_new.isnull().any()
275 | ```
276 | 
277 | ## Models
278 | 
279 | ```
280 | from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split, GridSearchCV
281 | from sklearn.tree import DecisionTreeClassifier
282 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
283 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
284 | from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer, TfidfTransformer
285 | import seaborn as sns
286 | from sklearn.metrics import confusion_matrix
287 | %matplotlib inline
288 | 
289 | X_test = test_new
290 | ```
291 | 
292 | ## Defining ranges for the hyperparameters to be scanned by the grid search
293 | ```
294 | n_estimators = list(range(20,120,10))
295 | max_depth = list(range(2, 22, 2)) + [None]
296 | def random_forest_score(df,target_col,test_size,n_estimators,max_depth):
297 |     
298 |     X_train = df.drop(target_col, axis=1)   # predictors
299 |     y_train = df[target_col]                # target
300 |     X_test = test_new
301 |     
302 |     rf_params = {
303 |              'n_estimators':n_estimators,
304 |              'max_depth':max_depth}   # parameters for grid search
305 |     rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
306 |     rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
307 |     print('The best parameters on the training data are:\n',rf_gs.best_params_) # printing the best parameters
308 |     max_depth_best = rf_gs.best_params_['max_depth']      # getting the best max_depth
309 |     n_estimators_best = rf_gs.best_params_['n_estimators']  # getting the best n_estimators
310 |     print("best max_depth:",max_depth_best)
311 |     print("best n_estimators:",n_estimators_best)
312 |     best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
313 |     best_rf_gs.fit(X_train,y_train)  # fitting the best model
314 |     preds = best_rf_gs.predict(X_test)
315 |     feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X_train.columns).sort_values().tail(5)
316 |     print(feature_importances.plot(kind="barh", figsize=(6,6)))
317 |     return 
318 | 
319 | random_forest_score(train_new,'click',0.3,n_estimators,max_depth)
320 | ```
321 | ```
322 | X = train_new.drop('click', axis=1)   # predictors
323 | y = train_new['click']  
324 | 
325 | def cv_score(X,y,cv,n_estimators,max_depth):
326 |     rf = RandomForestClassifier(n_estimators=n_estimators_best,
327 |                                 max_depth=max_depth_best)
328 |     s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1)
329 |     return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))
330 | 
331 | dict_best = {'max_depth': 14, 'n_estimators': 80}
332 | n_estimators_best = dict_best['n_estimators']
333 | max_depth_best = dict_best['max_depth']
334 | cv_score(X,y,5,n_estimators_best,max_depth_best)
335 | 
336 | n_estimators = list(range(20,120,10))
337 | max_depth = list(range(2, 16, 2)) + [None]
338 | 
339 | def random_forest_score_probas(df,target_col,test_size,n_estimators,max_depth):
340 |     
341 |     X_train = df.drop(target_col, axis=1)   # predictors
342 |     y_train = df[target_col]                # target
343 |     X_test = test_new
344 |     
345 |     rf_params = {
346 |              'n_estimators':n_estimators,
347 |              'max_depth':max_depth}   # parameters for grid search
348 |     rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, n_jobs=-1)
349 |     rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
350 |     max_depth_best = rf_gs.best_params_['max_depth']      # getting the best max_depth
351 |     n_estimators_best = rf_gs.best_params_['n_estimators']  # getting the best n_estimators
352 |     best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
353 |     best_rf_gs.fit(X_train,y_train)  # fitting the best model
354 |     preds = best_rf_gs.predict(X_test)
355 |     prob_list = [prob[0] for prob in best_rf_gs.predict_proba(X_test).tolist()]
356 |     df_prob = pd.DataFrame(np.array(prob_list).reshape(53333,1))
357 |     df_prob.columns = ['probabilities']
358 |     df_prob.to_csv('probs.csv')
359 |     return df_prob
360 | 
361 | random_forest_score_probas(train_new,'click',0.3,n_estimators,max_depth).head()
362 | 
363 | def random_forest_score_preds(df,target_col,test_size,n_estimators,max_depth):
364 |     
365 |     X_train = df.drop(target_col, axis=1)   # predictors
366 |     y_train = df[target_col]                # target
367 |     X_test = test_new
368 |     
369 |     rf_params = {
370 |              'n_estimators':n_estimators,
371 |              'max_depth':max_depth}   # parameters for grid search
372 |     rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
373 |     rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
374 |     max_depth_best = rf_gs.best_params_['max_depth']      # getting the best max_depth
375 |     n_estimators_best = rf_gs.best_params_['n_estimators']  # getting the best n_estimators
376 |     best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
377 |     best_rf_gs.fit(X_train,y_train)  # fitting the best model
378 |     preds = best_rf_gs.predict(X_test)
379 |     df_pred = pd.DataFrame(np.array(preds).reshape(53333,1))
380 |     df_pred.columns = ['predictions']
381 |     df_pred.to_csv('preds.csv')
382 |     return df_pred
383 | 
384 | random_forest_score_preds(train_new,'click',0.3,n_estimators,max_depth)
385 | ```
386 | 


--------------------------------------------------------------------------------
/click-prediction/images/123:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/click-prediction/images/click1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/click-prediction/images/click1.png


--------------------------------------------------------------------------------
/click-prediction/optimal-bidding-strategies-in-online-display-advertising .pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/click-prediction/optimal-bidding-strategies-in-online-display-advertising .pdf


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/README.md:
--------------------------------------------------------------------------------
  1 | ## Predicting Comments on Reddit  [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/project-3-marco-tavora.ipynb) 
  2 | ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![image title](https://img.shields.io/badge/ntlk-v3.2.5-yellow.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/BeautifulSoup-4.6.0-blue.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg)
  3 | 
  4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/project-3-marco-tavora.ipynb) or by clicking on the [view code] link above.**
  5 | 
  6 | 
  7 | <br>
  8 | <br>
  9 | <p align="center">
 10 |   <img src="https://github.com/marcotav/predicting-the-number-of-comments-on-reddit/blob/master/Reddit-logo.png" 
 11 |        width="150" height="150">
 12 | </p>
 13 | <br>
 14 | 
 15 | <p align="center">
 16 |   <a href="#ps"> Problem Statement </a> •
 17 |   <a href="#steps"> Steps </a> •
 18 |   <a href="#webscraping"> Bird's-eye view of webscraping  </a> •
 19 |   <a href="#writingfunctions"> Writing functions to extract data from Reddit </a> •
 20 |   <a href="#nlp"> Quick review of NLP techniques </a> •
 21 |   <a href="#preprocess"> Preprocessing the text </a> •
 22 |   <a href="#models">Models </a> 
 23 | </p>
 24 | 
 25 | <a id = 'ps'></a>
 26 | ## Problem Statement
 27 | 
 28 | Determine which characteristics of a post on Reddit contribute most to the overall interaction as measured by number of comments.
 29 | 
 30 | <a id = 'steps'></a>
 31 | ## Steps
 32 | 
 33 | This project had three steps:
 34 | - Collecting data by scraping a website using the Python package `requests` and using the Python library `BeautifulSoup` which efficiently extracts HTML code. We scraped the 'hot' threads as listed on the <br> [Reddit homepage](https://www.reddit.com/) (see figure below) and acquired the following pieces of information about each thread:
 35 | 
 36 |    - The title of the thread
 37 |    - The subreddit that the thread corresponds to
 38 |    - The length of time it has been up on Reddit
 39 |    - The number of comments on the thread
 40 |   
 41 |   <br>
 42 | <br>
 43 | <p align="center">
 44 |   <img src="https://github.com/marcotav/predicting-the-number-of-comments-on-reddit/blob/master/redditpage.png" 
 45 |        width="750">
 46 | </p>
 47 | <br>
 48 | 
 49 | - Using Natural Language Processing (NLP) techniques to preprocess the data. NLP, in a nutshell, is "how to transform text data and convert it to features that enable us to build models." NLP techniques include:
 50 | 
 51 |    - Tokenization: essentially splitting text into pieces based on given patterns
 52 |    - Removing stopwords  
 53 |    - Lemmatization: returns the word's *lemma* (its base/dictionary form)
 54 |    - Stemming: returns the base form of the word (it is usually cruder than lemmatization).
 55 | 
 56 | - After the step above we obtain *numerical* features which allow for algebraic computations. We then build a `RandomForestClassifier` and use it to classify each post according to the corresponding number of comments associated with it. More concretely the model predicts whether or not a given Reddit post will have above or below the _median_ number of comments.
 57 | 
 58 | <a id = 'webscraping'></a>  
 59 | ### Bird's-eye view of webscraping 
 60 | 
 61 | The general strategy is:
 62 | - Use the `requests` Python packages to make a `.get` request (the object `res` is a `Response` object):   
 63 | ```
 64 | res = requests.get(URL,headers={"user-agent":'mt'})
 65 | ```     
 66 | - Create a BeautifulSoup object from the HTML
 67 | ```
 68 | soup = BeautifulSoup(res.content,"lxml")
 69 | ```
 70 | - Use `.extract` to see the page structure:
 71 | ```
 72 | soup.extract
 73 | ```
 74 | <a id = 'writingfunctions'></a>  
 75 | ### Writing functions to extract data from Reddit
 76 | Here I write down the the functions that will extract the information needed. The structure of the functions depends on the HTML code of the page. The page has the following structure:
 77 | - The thread title is within an `<a>` tag with the attribute `data-event-action="title"`.
 78 | - The time since the thread was created is within a `<time>` tag with attribute `class="live-timestamp"`.
 79 | - The subreddit is within an `<a>` tag with the attribute `class="subreddit hover may-blank"`.
 80 | - The number of comments is within an `<a>` tag with the attribute `data-event-action="comments"`.
 81 | 
 82 | The functions are:
 83 | ```
 84 | def extract_title_from_result(result,num=25):
 85 |     titles = []
 86 |     title = result.find_all('a', {'data-event-action':'title'})
 87 |     for i in title:
 88 |         titles.append(i.text)
 89 |     return titles
 90 | 
 91 | def extract_time_from_result(result,num=25):
 92 |     times = []
 93 |     time = result.find_all('time', {'class':'live-timestamp'})
 94 |     for i in time:
 95 |         times.append(i.text)
 96 |     return times
 97 | 
 98 | def extract_subreddit_from_result(result,num=25):
 99 |     subreddits = []
100 |     subreddit = result.find_all('a', {'class':'subreddit hover may-blank'})
101 |     for i in subreddit:
102 |         subreddits.append(i.string)
103 |     return subreddits
104 | 
105 | def extract_num_from_result(result,num=25):
106 |     nums_lst = []
107 |     nums = result.find_all('a', {'data-event-action': 'comments'})
108 |     for i in nums:
109 |         nums_lst.append(i.string)
110 |     return nums_lst
111 | ```
112 |  I then write a function that finds the last `id` on the page, and stores it:
113 |  ```
114 | def get_urls(n=25):
115 |     j=0   # counting loops
116 |     titles = []
117 |     times = []
118 |     subreddits = []
119 |     nums = []
120 |     URLS = []
121 |     URL = "http://www.reddit.com"
122 |     
123 |     for _ in range(n):
124 |         
125 |         res = requests.get(URL, headers={"user-agent":'mt'})
126 |         soup = BeautifulSoup(res.content,"lxml")
127 |         
128 |         titles.extend(extract_title_from_result(soup))
129 |         times.extend(extract_time_from_result(soup))
130 |         subreddits.extend(extract_subreddit_from_result(soup))
131 |         nums.extend(extract_num_from_result(soup))         
132 | 
133 |         URL = soup.find('span',{'class':'next-button'}).find('a')['href']
134 |         URLS.append(URL)
135 |         j+=1
136 |         print(j)
137 |         time.sleep(3)
138 |         
139 |     return titles, times, subreddits, nums, URLS
140 |  ```
141 | 
142 | I then build a pandas `DataFrame`, perform some exploratory data analysis and create:
143 | - A binary column that classifies the number of comments comparing the values with their median
144 | - A set of dummy columns for the subreddits
145 | - Concatenate both
146 | 
147 | ```
148 | df['binary'] = df['nums'].apply(lambda x: 1 if x >= np.median(df['nums']) else 0)
149 | # dummies created and dataframes concatenated
150 | df_subred = pd.concat([df['binary'],pd.get_dummies(df['subreddits'], drop_first = True)], axis = 1)
151 | ```
152 | <a id = 'nlp'></a>  
153 | ### Quick review of NLP techniques
154 | Before applying NLP to our problem, I will provide a quick review of the basic procedures using `Python`. We use the package `nltk` (Natural Language Toolkit) to perform the actions above. The general procedure is the following. We first import `nltk` and the necessary classes for lemmatization and stemming
155 | ```
156 | import nltk
157 | from nltk.stem import WordNetLemmatizer
158 | from nltk.stem.porter import PorterStemmer
159 | ```
160 | We then create objects of the classes `PorterStemmer` and `WordNetLemmatizer`: 
161 | ```
162 | stemmer = PorterStemmer()
163 | lemmatizer = WordNetLemmatizer()
164 | ```
165 | To use lemmatization and/or stemming in a given string `text` we must first tokenize it. To do that, we use `RegexpTokenizer` where the argument below is a regular expression. 
166 | ```
167 | tokenizer = RegexpTokenizer(r'\w+')
168 | tokens = tokenizer.tokenize(text)
169 | tokens_lemma = [lemmatizer.lemmatize(i) for i in tokens]
170 | stem_text = [PorterStemmer().stem(i) for i in tokens]
171 | ```
172 | <a id = 'preprocess'></a>  
173 | ### Preprocessing the text
174 | To preprocess the text, before creating numerical features from it, I used the following `cleaner` function:
175 | ```
176 | def cleaner(text):
177 |     stemmer = PorterStemmer()                                          
178 |     stop = stopwords.words('english')    
179 |     text = text.translate(str.maketrans('', '', string.punctuation))   
180 |     text = text.translate(str.maketrans('', '', string.digits))        
181 |     text = text.lower().strip() 
182 |     final_text = []
183 |     for w in text.split():
184 |         if w not in stop:
185 |             final_text.append(stemmer.stem(w.strip()))
186 |     return ' '.join(final_text)
187 | ```
188 | I then use `CountVectorizer` to create features based on the words in the thread titles. `CountVectorizer` is scikit-learn's bag of words tool. I then combine this new table `df_all` and the subreddits features table and build a model.
189 | 
190 | ```
191 | cvt = CountVectorizer(min_df=min_df, preprocessor=cleaner)
192 | cvt.fit(df["titles"])
193 | cvt.transform(df['titles']).todense()
194 | X_title = cvt.fit_transform(df["titles"])
195 | X_thread = pd.DataFrame(X_title.todense(), 
196 |                         columns=cvt.get_feature_names())
197 | df_all = pd.concat([df_subred,X_thread],axis=1)                     
198 | ```
199 | 
200 | <img src="https://github.com/marcotav/predicting-the-number-of-comments-on-reddit/blob/master/redditwordshist.png" width="400">
201 | 
202 | 
203 | <a id = 'models'></a>  
204 | ### Models
205 | Finally, now with the data properly treated, we use the following function to fit the training data using a `RandomForestClassifier` with optimized hyperparameters obtained using `GridSearchCV`. The range of hyperparameters is:
206 | ```
207 | n_estimators = list(range(20,220,10))
208 | max_depth = list(range(2, 22, 2)) + [None]
209 | ```
210 | 
211 | The following function does the following:
212 | - Defines target and predictors
213 | - Performs a train-test split of the data
214 | - Uses `GridSearchCV` which performs an "exhaustive search over specified parameter values for an estimator" (see the docs). It searches the hyperparameter space to find the highest cross validation score. It has several important arguments namely:
215 | 
216 | | Argument | Description |
217 | | --- | ---|
218 | | **`estimator`** | Sklearn instance of the model to fit on |
219 | | **`param_grid`** | A dictionary where keys are hyperparameters and values are lists of values to test |
220 | | **`cv`** | Number of internal cross-validation folds to run for each set of hyperparameters |
221 | 
222 | - After fitting, `GridSearchCV` provides information such as:
223 | 
224 | | Property | Use |
225 | | --- | ---|
226 | | **`results.param_grid`** | Parameters searched over. |
227 | | **`results.best_score_`** | Best mean cross-validated score.|
228 | | **`results.best_estimator_`** | Reference to model with best score. |
229 | | **`results.best_params_`** | Parameters found to perform with the best score. |
230 | | **`results.grid_scores_`** | Display score attributes with corresponding parameters. | 
231 | 
232 | - The estimator chosen here was a `RandomForestClassifier`. The latter fits a set of decision tree classifiers on sub-samples of the data, averaging to improve the accuracy and avoid over-fitting. 
233 | - Fits several models using the training data, for all parameters within the parameter grid `rf_params` and find the best model i.e. the model with best mean cross-validated score.
234 | - Instantiates the best model and fits it
235 | - Scores the model and makes predictions
236 | - Determines the most relevant features and prints out a bar plot showing them.
237 | 
238 | ```
239 | def rfscore(df,target_col,test_size,n_estimators,max_depth):
240 |     
241 |     X = df.drop(target_col, axis=1)   # predictors
242 |     y = df[target_col]                # target
243 |     
244 |     # train-test split
245 |     X_train, X_test, y_train, y_test = train_test_split(X, 
246 |                                                         y, test_size = test_size, random_state=42)
247 |     # definition of a grid of parameter values
248 |     rf_params = {
249 |              'n_estimators':n_estimators,
250 |              'max_depth':max_depth}   # parameters for grid search
251 |              
252 |     # Instantiation       
253 |     rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
254 |     
255 |     # fitting using training data with all possible parameters
256 |     rf_gs.fit(X_train,y_train) 
257 |     
258 |     # Parameters that have been found to perform with the best score
259 |     max_depth_best = rf_gs.best_params_['max_depth']      
260 |     n_estimators_best = rf_gs.best_params_['n_estimators'] 
261 |     
262 |     # Best model
263 |     best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) 
264 |     
265 |     # fitting best model using training data with all possible parameters
266 |     best_rf_gs.fit(X_train,y_train)  
267 |     
268 |     # scoring
269 |     best_rf_score = best_rf_gs.score(X_test,y_test) 
270 |     
271 |     # predictions
272 |     preds = best_rf_gs.predict(X_test)
273 |     
274 |     # finds the most important features and plots a bar chart
275 |     feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X.columns).sort_values().tail(5)
276 |     print(feature_importances.plot(kind="barh", figsize=(6,6)))
277 |     return 
278 | ```
279 | The function below that performs cross-validation, to obtain the accuracy score for the model with best parameters obtained from the `GridSearch`:
280 | 
281 | ```
282 | def cv_score(X,y,cv,n_estimators,max_depth):
283 |     rf = RandomForestClassifier(n_estimators=n_estimators_best,
284 |                                 max_depth=max_depth_best)
285 |     s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1)
286 |     return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))
287 | ```
288 | The most important features according to the `RandomForestClassifier` are shown in the graph below:
289 | <br>
290 | 
291 |    <img src="https://github.com/marcotav/predicting-the-number-of-comments-on-reddit/blob/master/redditRF.png" width="400">
292 | 
293 | 


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/Reddit-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/Reddit-logo.png


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditRF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditRF.png


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditpage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditpage.png


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditwordshist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditwordshist.png


--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/retail-strategy/README.md:
--------------------------------------------------------------------------------
  1 | ## Retail Expansion Analysis with Lasso and Ridge Regressions [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb) 
  2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  3 | 
  4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb) or by clicking on the [view code] link above.**
  5 | 
  6 | 
  7 | 
  8 | 
  9 | 
 10 | <br>
 11 | 
 12 | <p align="center">
 13 |   <img src="images/liquor.jpeg">
 14 | </p>        
 15 | <br>
 16 | 
 17 | <p align="center">
 18 |   <a href="#summary"> Summary </a> •
 19 |   <a href="#pre"> Preamble </a> •
 20 |   <a href="#data"> Getting data </a> •
 21 |   <a href="#munge_eda"> Data Munging and EDA </a> •
 22 |   <a href="#mine"> Mining the data </a> •
 23 |   <a href="#models"> Building the models </a> •
 24 |   <a href="#plots"> Plotting results </a> •
 25 |   <a href="#conc"> Conclusions and recommendations</a> 
 26 | </p>
 27 | 
 28 | <a id = 'summary'></a>
 29 | ## Summary
 30 | Based on a dataset containing the spirits purchase information of Iowa Class E liquor licensees by product and date of purchase (link) this project provides recommendations on where to open new stores in the state of Iowa. I first conducted a thorough exploratory data analysis and then built several multivariate regression models of total sales by county, using both Lasso and Ridge regularization, and based on these models, I made recommendations about new locations. 
 31 | 
 32 | <a id = 'pre'></a>
 33 | ## Preamble
 34 | 
 35 | Expansion plans traditionally use subsets of the following mix of data:
 36 | 
 37 | #### Demographics
 38 | 
 39 | I focused on the following quantities:
 40 | - The ratio between sales and volume for each county, i.e., the number of dollars per liter sold. If this ratio is high in a given county, the stores in that county are, on average, high-end stores. 
 41 | - Another critical ratio is the number of stores per area. The meaning of a high value of this ratio is not so straightforward since it may indicate either that the market is saturated, or that the county is a strong market for this type of product and would welcome a new store (an example would be a county close to some major university). In contrast, a low value may indicate a market with untapped potential or a market with a population which is not a target of this type of store.
 42 | - Another important ratio is consumption/person, i.e., the consumption *per capita*. The knowledge of the profile of the population in the county (if they are "light" or "heavy" drinkers) would undoubtedly help the owner decide whether to open or not a new storefront there.
 43 | 
 44 | #### Nearby businesses
 45 | 
 46 | Competition is a critical component, and can be indirectly measured by the ratio of the number of stores and the population.
 47 | 
 48 | #### Aggregated human ﬂow/foot traffic
 49 | 
 50 | For this information to be useful, we would need more granular data such as apps check-ins. Population and population density will be used as proxies.
 51 | 
 52 | <a id = 'data'></a>
 53 | ## Getting data
 54 | 
 55 | Three datasets were used namely:
 56 | - A dataset containing the spirits purchase information of Iowa Class “E” liquor licensees by product and date of purchase.
 57 | - A dataset with information about population per county
 58 | - A database containing information about incomes
 59 | 
 60 | <a id = 'munge_eda'></a>
 61 | ## Data Munging and EDA
 62 | 
 63 | Data munging included:
 64 | - Checking the time span of the data and dropping 2016 data (which contained only three months)
 65 | - Eliminating symbols in the data, dropping `NaNs` and converting objects to floats
 66 | - Conversion of columns of objects into columns of float
 67 | - Dropping `NaN` values 
 68 | - Converting store numbers to strings. 
 69 | - Examining the data we find that the maximum values in all columns were many standard deviations larger than the mean, indicating the presence of outliers. Keeping outliers in the analysis would inflate the predicted sales. Also, since the goal is to predict the *most likely performance* for each store keeping exceptionally well-performing stores would be detrimental.
 70 | 
 71 | To exclude dollar signs for example I used:
 72 | ```
 73 | for col in cols_with_dollar:
 74 |     df[col] = df[col].apply(lambda x: x.strip('$')).astype('float')
 75 | ```
 76 | To plot histograms I found it convenient to write a simple function:
 77 | ```
 78 | def draw_histograms(df,col,bins):
 79 |     df[col].hist(bins=bins);
 80 |     plt.title(col);
 81 |     plt.xlabel(col);
 82 |     plt.xticks(rotation=90);
 83 |     plt.show();
 84 | ```
 85 | <a id = 'mine'></a>
 86 | ## Mining the data
 87 | 
 88 | Some of steps for mining the data included: computing the total sales per county, creating a profit column, calculating profit per store and the sales per volume, dropping outliers, calculating both stores per person and alcohol consumption per person ratios.
 89 | 
 90 | I then looked for any statistical relationships, correlations, or other relevant properties of the dataset. 
 91 | 
 92 | #### Steps:
 93 | - First I needed to choose the proper predictors. I looked for strong correlations between variables to avoid problems with multicollinearity. 
 94 | - Also, variables that changed very little had little impact and they were therefore not included as predictors. 
 95 | - I then studied correlations between predictors. 
 96 | - I saw from the correlation matrices that `num_stores` and `stores_per_area` are highly correlated. Furthermore, both are highly correlated to the target variable `sale_dollars`. Both things also happen with `store_population_ratio` and `consumption_per_capita`.
 97 | 
 98 | A heatmap of correlations using `Seaborn` follows:
 99 | 
100 | <p align="center">
101 |    <img src="https://github.com/marcotav/retail-store-expansion-analysis/blob/master/hm3.png" width="400">
102 | <p/>
103 | 
104 | To generate scatter plots for all the predictors (which provided similar information as the correlation matrices) we write:
105 | ```
106 | g = sns.pairplot(df[cols_to_keep])
107 | for ax in g.axes.flatten():    # from [6]
108 |     for tick in ax.get_xticklabels(): 
109 |         tick.set(rotation=90);
110 | ```
111 | <p align="center">
112 |    <img src="https://github.com/marcotav/retail-store-expansion-analysis/blob/master/output.png" width="700">
113 | <p/>
114 | 
115 | 
116 | <a id = 'models'></a>
117 | ## Building the models
118 | 
119 | Using `scikit-learn` and `statsmodels`, I built the necessary models and valuated their fit. For that I generated all combinations of useful relevant features using the `itertools` module. 
120 | 
121 | Preparing training and test sets:
122 | ```
123 | # choose candidate features
124 | features = ['num_stores','population', 'store_population_ratio', \
125 |  'consumption_per_capita',  'stores_per_area', u'per_capita_income']
126 | # defining the predictors and the target
127 | X,y = df_final[features], df_final['sale_dollars']
128 | # train-test split
129 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
130 | ```
131 | I now generate combinations of features:
132 | 
133 | ```
134 | combs = []
135 | for num in range(1,len(features)+1):
136 |     combs.append([i[0] for i in list(itertools.combinations(features, num))])
137 | ```
138 | 
139 | I then instantiated the models and tested them. The code below makes a list of `r2` combinations and finds the best predictors using `itemgetter`:
140 | ```
141 | lr = linear_model.LinearRegression(normalize=True)
142 | ridge = linear_model.RidgeCV(cv=5)
143 | lasso = linear_model.LassoCV(cv=5)
144 | models = [lr,lasso,ridge]
145 | r2_comb_lst = []
146 | for comb in combs:        
147 |     for m in models:
148 |         model = m.fit(X_train[comb],y_train)
149 |         r2 = m.score(X_test[comb], y_test)
150 |         r2_comb_lst.append([round(r2,3),comb,str(model).split('(')[0]])
151 |         
152 | r2_comb_lst.sort(key=operator.itemgetter(1))
153 | ```
154 | The best predictors were obtained via:
155 | ```
156 | r2_comb_lst[-1][1]
157 | ```
158 | Dropping highly correlated predictors I redefined `X` and `y` and built a Ridge model:
159 | ```
160 | X ,y = df_final[features], df_final['sale_dollars']
161 | ridge = linear_model.RidgeCV(cv=5)
162 | model = ridge.fit(X,y)
163 | ```
164 | 
165 | <a id = 'plots'></a>
166 | ## Plotting results
167 | 
168 | I then plotted the predictions versus the true value:
169 | 
170 | <p align="center">
171 |    <img src="https://github.com/marcotav/retail-store-expansion-analysis/blob/master/test.jpg" width="500">
172 | <p/>
173 | 
174 | <a id = 'conc'></a>
175 | ## Conclusions and recommendations:
176 | 
177 | The following recommendations were provided:
178 | 
179 | - Linn has higher sales which in part is because it has larger population which is not very useful information. 
180 | - Next, ordering stores by `sales_per_litters` we obtain which counties have more high-end stores (Johnson has the higher number).
181 | - We would recommend Johnson for a new store *if the goal of the the owner is to build new high-end stores*. 
182 | - If the plan is to open more stores but with cheaper products, Johnson is not the place to choose. The less saturated market is Decatur. But as discussed before this information does not provide have a unique recommendation and a more thorough analysis is needed. 
183 | - The county with weaker competition is Butler. This could provided untapped potential. However, the absence of a reasonable number of stores may indicate, as observed before, that the county's population is simply not interested in this category of product. Again, further investigation must be carried out.
184 | 
185 | 
186 | I strongly recommend reading the notebook using [nbviewer](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb).
187 | 
188 | 


--------------------------------------------------------------------------------
/retail-strategy/data/123:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/retail-strategy/data/ia_zip_city_county_sqkm.csv:
--------------------------------------------------------------------------------
1 | ,Zip Code,City,County,State,County Number,Area (sqkm)0,50001,ACKWORTH,Warren,IA,91,62.7966561,50002,ADAIR,Guthrie,IA,39,279.2022192,50003,ADEL,Dallas,IA,25,298.0862913,50005,ALBION,Marshall,IA,64,69.6235734,50006,ALDEN,Hardin,IA,42,317.745155,50007,ALLEMAN,Polk,IA,77,13.7828976,50008,ALLERTON,Wayne,IA,93,220.6235737,50009,ALTOONA,Polk,IA,77,65.2071138,50010,AMES,Story,IA,85,155.2941189,50011,AMES,Story,IA,85,0.12509410,50012,AMES,Story,IA,85,1.98262211,50012,AMES,Story,IA,85,1.98262212,50014,AMES,Story,IA,85,144.82608813,50020,ANITA,Cass,IA,15,249.12848914,50021,ANKENY,Polk,IA,77,66.72592415,50022,ATLANTIC,Cass,IA,15,431.88331116,50023,ANKENY,Polk,IA,77,57.42413617,50025,AUDUBON,Audubon,IA,5,507.43142118,50026,BAGLEY,Guthrie,IA,39,142.86950119,50027,BARNES CITY,Mahaska,IA,62,72.8917320,50028,BAXTER,Jasper,IA,50,114.93365121,50029,BAYARD,Guthrie,IA,39,105.0383622,50032,BERWICK,Polk,IA,77,0.9553923,50033,BEVINGTON,Warren,IA,91,0.28820124,50034,BLAIRSBURG,Hamilton,IA,40,163.48420325,50035,BONDURANT,Polk,IA,77,116.8981526,50036,BOONE,Boone,IA,8,505.06349127,50038,BOONEVILLE,Dallas,IA,25,8.87423928,50039,BOUTON,Dallas,IA,25,60.66204729,50041,BRADFORD,Franklin,IA,35,1.10142730,50042,BRAYTON,Audubon,IA,5,84.1625931,50044,BUSSEY,Marion,IA,63,118.47305632,50046,CAMBRIDGE,Story,IA,85,119.97335233,50047,CARLISLE,Warren,IA,91,152.15962834,50048,CASEY,Guthrie,IA,39,226.5732735,50049,CHARITON,Lucas,IA,59,523.12465636,50050,CHURDAN,Greene,IA,37,197.70662837,50051,CLEMONS,Marshall,IA,64,66.57308938,50052,CLIO,Wayne,IA,93,50.9406339,50054,COLFAX,Jasper,IA,50,152.87227840,50055,COLLINS,Story,IA,85,126.34052141,50056,COLO,Story,IA,85,149.19237742,50057,COLUMBIA,Marion,IA,63,52.18353843,50058,COON RAPIDS,Carroll,IA,14,364.23196744,50060,CORYDON,Wayne,IA,93,453.61018645,50061,CUMMING,Warren,IA,91,81.69904346,50062,MELCHER-DALLAS,Marion,IA,63,80.10431947,50063,DALLAS CENTER,Dallas,IA,25,170.53275748,50064,DANA,Greene,IA,37,40.41890949,50065,DAVIS CITY,Decatur,IA,27,147.90298950,50066,DAWSON,Dallas,IA,25,64.73219151,50067,DECATUR,Decatur,IA,27,87.09351752,50068,DERBY,Lucas,IA,59,114.74963353,50069,DE SOTO,Dallas,IA,25,13.26249254,50070,DEXTER,Dallas,IA,25,171.07076955,50071,DOWS,Wright,IA,99,293.11418256,50072,EARLHAM,Madison,IA,61,228.90886957,50073,ELKHART,Polk,IA,77,67.42014258,50074,ELLSTON,Ringgold,IA,80,127.81592259,50075,ELLSWORTH,Hamilton,IA,40,119.36939160,50076,EXIRA,Audubon,IA,5,290.65138261,50078,FERGUSON,Marshall,IA,64,0.66069962,50101,GALT,Wright,IA,99,25.43122663,50102,GARDEN CITY,Hardin,IA,42,1.30496364,50103,GARDEN GROVE,Decatur,IA,27,157.40944465,50104,GIBSON,Keokuk,IA,54,28.9405666,50105,GILBERT,Story,IA,85,20.56044767,50106,GILMAN,Marshall,IA,64,170.14111468,50107,GRAND JUNCTION,Greene,IA,37,130.43318669,50108,GRAND RIVER,Decatur,IA,27,188.60810770,50109,GRANGER,Polk,IA,77,62.75723571,50111,GRIMES,Polk,IA,77,71.48460672,50112,GRINNELL,Poweshiek,IA,79,475.67215373,50115,GUTHRIE CENTER,Guthrie,IA,39,441.18533474,50116,HAMILTON,Marion,IA,63,44.07175475,50117,HAMLIN,Audubon,IA,5,71.6571876,50118,HARTFORD,Warren,IA,91,51.46625577,50119,HARVEY,Marion,IA,63,40.35463278,50120,HAVERHILL,Marshall,IA,64,43.9413579,50122,HUBBARD,Hardin,IA,42,218.62136480,50123,HUMESTON,Wayne,IA,93,204.58483481,50124,HUXLEY,Story,IA,85,54.09868882,50125,INDIANOLA,Warren,IA,91,426.52194483,50126,IOWA FALLS,Hardin,IA,42,351.8748384,50127,IRA,Jasper,IA,50,0.0282685,50128,JAMAICA,Guthrie,IA,39,75.89881286,50129,JEFFERSON,Greene,IA,37,435.6586587,50130,JEWELL,Hamilton,IA,40,171.93305288,50131,JOHNSTON,Polk,IA,77,64.66693989,50132,KAMRAR,Hamilton,IA,40,74.60925890,50133,KELLERTON,Ringgold,IA,80,196.96693791,50134,KELLEY,Story,IA,85,48.42767592,50135,KELLOGG,Jasper,IA,50,195.27876493,50136,KESWICK,Keokuk,IA,54,100.63220994,50138,KNOXVILLE,Marion,IA,63,461.82728895,50139,LACONA,Warren,IA,91,229.50836896,50140,LAMONI,Decatur,IA,27,241.67813397,50141,LAUREL,Marshall,IA,64,93.77748198,50142,LE GRAND,Marshall,IA,64,2.56667599,50143,LEIGHTON,Mahaska,IA,62,92.958086100,50144,LEON,Decatur,IA,27,362.605062101,50146,LINDEN,Dallas,IA,25,75.845999102,50147,LINEVILLE,Wayne,IA,93,187.491745103,50148,LISCOMB,Marshall,IA,64,51.098115104,50149,LORIMOR,Union,IA,88,213.449691105,50150,LOVILIA,Monroe,IA,68,148.536678106,50151,LUCAS,Lucas,IA,59,205.563843107,50153,LYNNVILLE,Jasper,IA,50,80.372194108,50154,MC CALLSBURG,Story,IA,85,53.144166109,50155,MACKSBURG,Madison,IA,61,78.50132110,50156,MADRID,Boone,IA,8,238.724015111,50157,MALCOM,Poweshiek,IA,79,168.389373112,50158,MARSHALLTOWN,Marshall,IA,64,548.346934113,50160,MARTENSDALE,Warren,IA,91,0.965184114,50161,MAXWELL,Story,IA,85,211.812219115,50162,MELBOURNE,Marshall,IA,64,132.120698116,50163,MELCHER-DALLAS,Marion,IA,63,1.230059117,50164,MENLO,Guthrie,IA,39,138.23504118,50165,MILLERTON,Wayne,IA,93,6.437634119,50166,MILO,Warren,IA,91,162.522428120,50167,MINBURN,Dallas,IA,25,102.610498121,50168,MINGO,Jasper,IA,50,104.773461122,50169,MITCHELLVILLE,Polk,IA,77,103.378565123,50170,MONROE,Jasper,IA,50,230.99503124,50171,MONTEZUMA,Poweshiek,IA,79,281.675684125,50173,MONTOUR,Tama,IA,86,80.140704126,50174,MURRAY,Clarke,IA,20,278.718495127,50201,NEVADA,Story,IA,85,300.453642128,50206,NEW PROVIDENCE,Hardin,IA,42,124.182004129,50207,NEW SHARON,Mahaska,IA,62,366.588282130,50208,NEWTON,Jasper,IA,50,426.046663131,50210,NEW VIRGINIA,Warren,IA,91,196.767136132,50211,NORWALK,Warren,IA,91,147.182178133,50212,OGDEN,Boone,IA,8,352.230522134,50213,OSCEOLA,Clarke,IA,20,543.975469135,50214,OTLEY,Marion,IA,63,102.571148136,50216,PANORA,Guthrie,IA,39,145.699881137,50217,PATON,Greene,IA,37,178.606122138,50218,PATTERSON,Madison,IA,61,0.542828139,50219,PELLA,Marion,IA,63,317.144262140,50220,PERRY,Dallas,IA,25,268.176779141,50222,PERU,Madison,IA,61,108.369441142,50223,PILOT MOUND,Boone,IA,8,76.600548143,50225,PLEASANTVILLE,Marion,IA,63,217.246336144,50226,POLK CITY,Polk,IA,77,109.873855145,50227,POPEJOY,Franklin,IA,35,0.966375146,50228,PRAIRIE CITY,Jasper,IA,50,180.367188147,50229,PROLE,Warren,IA,91,105.616644148,50230,RADCLIFFE,Hardin,IA,42,223.982113149,50231,RANDALL,Hamilton,IA,40,1.065168150,50232,REASNOR,Jasper,IA,50,86.762448151,50233,REDFIELD,Dallas,IA,25,130.640688152,50234,RHODES,Marshall,IA,64,81.236093153,50235,RIPPEY,Greene,IA,37,121.134316154,50236,ROLAND,Story,IA,85,89.345522155,50237,RUNNELLS,Polk,IA,77,146.002809156,50238,RUSSELL,Lucas,IA,59,308.904729157,50239,SAINT ANTHONY,Marshall,IA,64,44.053312158,50240,SAINT CHARLES,Madison,IA,61,197.714047159,50242,SEARSBORO,Poweshiek,IA,79,106.954493160,50243,SHELDAHL,Story,IA,85,1.425493161,50244,SLATER,Story,IA,85,57.130248162,50244,SLATER,Story,IA,85,57.130248163,50246,STANHOPE,Hamilton,IA,40,120.153227164,50247,STATE CENTER,Marshall,IA,64,215.968634165,50248,STORY CITY,Story,IA,85,211.580755166,50249,STRATFORD,Hamilton,IA,40,202.332923167,50250,STUART,Adair,IA,1,265.060086168,50251,SULLY,Jasper,IA,50,104.817095169,50252,SWAN,Marion,IA,63,22.861403170,50254,THAYER,Union,IA,88,113.65365171,50255,THORNBURG,Keokuk,IA,54,0.456756172,50256,TRACY,Marion,IA,63,70.812037173,50257,TRURO,Madison,IA,61,103.296613174,50258,UNION,Hardin,IA,42,139.476319175,50261,VAN METER,Madison,IA,61,173.242731176,50262,VAN WERT,Decatur,IA,27,83.830606177,50263,WAUKEE,Dallas,IA,25,90.002855178,50264,WELDON,Decatur,IA,27,174.299474179,50265,WEST DES MOINES,Polk,IA,77,46.466559180,50266,WEST DES MOINES,Dallas,IA,25,43.060835181,50268,WHAT CHEER,Keokuk,IA,54,123.524623182,50271,WILLIAMS,Hamilton,IA,40,172.819803183,50272,WILLIAMSON,Lucas,IA,59,1.922108184,50273,WINTERSET,Madison,IA,61,519.17142185,50274,WIOTA,Cass,IA,15,130.159776186,50275,WOODBURN,Clarke,IA,20,131.849713187,50276,WOODWARD,Dallas,IA,25,209.84696188,50277,YALE,Guthrie,IA,39,104.624278189,50278,ZEARING,Story,IA,85,138.864198190,50309,DES MOINES,Polk,IA,77,7.776473191,50310,DES MOINES,Polk,IA,77,21.123546192,50311,DES MOINES,Polk,IA,77,6.511832193,50312,DES MOINES,Polk,IA,77,15.05106194,50313,DES MOINES,Polk,IA,77,47.635293195,50314,DES MOINES,Polk,IA,77,6.629721196,50315,DES MOINES,Polk,IA,77,26.560331197,50316,DES MOINES,Polk,IA,77,9.302481198,50317,DES MOINES,Polk,IA,77,60.041842199,50319,DES MOINES,Polk,IA,77,0.213707200,50320,DES MOINES,Polk,IA,77,49.547031201,50321,DES MOINES,Polk,IA,77,30.969186202,50322,URBANDALE,Polk,IA,77,27.938267203,50323,URBANDALE,Dallas,IA,25,19.984131204,50324,WINDSOR HEIGHTS,Polk,IA,77,3.74028205,50325,CLIVE,Polk,IA,77,20.224117206,50327,PLEASANT HILL,Polk,IA,77,49.702622207,50401,MASON CITY,Cerro Gordo,IA,17,387.509792208,50420,ALEXANDER,Franklin,IA,35,117.256906209,50421,BELMOND,Wright,IA,99,232.911303210,50423,BRITT,Hancock,IA,41,376.364842211,50424,BUFFALO CENTER,Winnebago,IA,95,315.854649212,50426,CARPENTER,Mitchell,IA,66,0.060113213,50428,CLEAR LAKE,Cerro Gordo,IA,17,316.380154214,50430,CORWITH,Hancock,IA,41,160.984015215,50431,COULTER,Franklin,IA,35,1.936776216,50432,CRYSTAL LAKE,Hancock,IA,41,1.127714217,50433,DOUGHERTY,Cerro Gordo,IA,17,125.889253218,50434,FERTILE,Worth,IA,98,23.487419219,50435,FLOYD,Floyd,IA,34,105.304339220,50436,FOREST CITY,Winnebago,IA,95,354.034151221,50438,GARNER,Hancock,IA,41,342.63578222,50439,GOODELL,Hancock,IA,41,84.706319223,50440,GRAFTON,Worth,IA,98,80.665798224,50441,HAMPTON,Franklin,IA,35,395.662654225,50444,HANLONTOWN,Worth,IA,98,55.966912226,50446,JOICE,Worth,IA,98,108.86978227,50447,KANAWHA,Hancock,IA,41,261.433926228,50448,KENSETT,Worth,IA,98,160.165797229,50449,KLEMME,Hancock,IA,41,95.013046230,50450,LAKE MILLS,Winnebago,IA,95,214.081289231,50451,LAKOTA,Kossuth,IA,55,164.530815232,50452,LATIMER,Franklin,IA,35,127.424687233,50453,LELAND,Winnebago,IA,95,114.347611234,50454,LITTLE CEDAR,Mitchell,IA,66,44.76636235,50455,MC INTIRE,Mitchell,IA,66,84.207513236,50456,MANLY,Worth,IA,98,120.430473237,50457,MESERVEY,Cerro Gordo,IA,17,88.503707238,50458,NORA SPRINGS,Floyd,IA,34,190.798489239,50459,NORTHWOOD,Worth,IA,98,375.483499240,50460,ORCHARD,Mitchell,IA,66,93.166151241,50461,OSAGE,Mitchell,IA,66,446.474922242,50464,PLYMOUTH,Cerro Gordo,IA,17,61.009266243,50465,RAKE,Winnebago,IA,95,9.497218244,50466,RICEVILLE,Howard,IA,45,354.341172245,50467,ROCK FALLS,Cerro Gordo,IA,17,0.709984246,50468,ROCKFORD,Floyd,IA,34,256.209489247,50469,ROCKWELL,Cerro Gordo,IA,17,227.879604248,50470,ROWAN,Wright,IA,99,62.885229249,50471,RUDD,Floyd,IA,34,110.089036250,50472,SAINT ANSGAR,Mitchell,IA,66,316.41701251,50473,SCARVILLE,Winnebago,IA,95,96.000513252,50475,SHEFFIELD,Franklin,IA,35,231.102519253,50476,STACYVILLE,Mitchell,IA,66,101.702581254,50477,SWALEDALE,Cerro Gordo,IA,17,58.455862255,50478,THOMPSON,Winnebago,IA,95,191.575269256,50479,THORNTON,Cerro Gordo,IA,17,150.266394257,50480,TITONKA,Kossuth,IA,55,163.023328258,50482,VENTURA,Cerro Gordo,IA,17,81.166123259,50483,WESLEY,Kossuth,IA,55,198.210428260,50484,WODEN,Hancock,IA,41,113.720772261,50501,FORT DODGE,Webster,IA,94,407.21578262,50510,ALBERT CITY,Buena Vista,IA,11,219.73212263,50511,ALGONA,Kossuth,IA,55,321.723723264,50514,ARMSTRONG,Emmet,IA,32,287.766282265,50515,AYRSHIRE,Palo Alto,IA,74,96.076793266,50516,BADGER,Webster,IA,94,63.984856267,50517,BANCROFT,Kossuth,IA,55,198.110964268,50518,BARNUM,Webster,IA,94,80.803741269,50519,BODE,Humboldt,IA,46,142.395653270,50520,BRADGATE,Humboldt,IA,46,62.072927271,50521,BURNSIDE,Webster,IA,94,3.160848272,50522,BURT,Kossuth,IA,55,176.225395273,50523,CALLENDER,Webster,IA,94,109.190936274,50524,CLARE,Webster,IA,94,148.535431275,50525,CLARION,Wright,IA,99,363.773781276,50527,CURLEW,Palo Alto,IA,74,141.597255277,50528,CYLINDER,Palo Alto,IA,74,191.696767278,50529,DAKOTA CITY,Humboldt,IA,46,1.445412279,50530,DAYTON,Webster,IA,94,168.385319280,50531,DOLLIVER,Emmet,IA,32,100.449294281,50532,DUNCOMBE,Webster,IA,94,181.71793282,50533,EAGLE GROVE,Wright,IA,99,258.50312283,50535,EARLY,Sac,IA,81,158.492375284,50536,EMMETSBURG,Palo Alto,IA,74,370.232494285,50538,FARNHAMVILLE,Calhoun,IA,13,80.88901286,50539,FENTON,Kossuth,IA,55,139.538906287,50540,FONDA,Pocahontas,IA,76,275.625728288,50541,GILMORE CITY,Humboldt,IA,46,237.749663289,50542,GOLDFIELD,Wright,IA,99,172.884137290,50543,GOWRIE,Webster,IA,94,212.824794291,50544,HARCOURT,Webster,IA,94,75.465666292,50545,HARDY,Humboldt,IA,46,97.252233293,50546,HAVELOCK,Pocahontas,IA,76,137.019674294,50548,HUMBOLDT,Humboldt,IA,46,323.465219295,50551,JOLLEY,Calhoun,IA,13,69.704315296,50554,LAURENS,Pocahontas,IA,76,232.762069297,50556,LEDYARD,Kossuth,IA,55,101.116298,50557,LEHIGH,Webster,IA,94,130.151481299,50558,LIVERMORE,Humboldt,IA,46,114.721586300,50559,LONE ROCK,Kossuth,IA,55,102.790935301,50560,LU VERNE,Kossuth,IA,55,225.654488302,50561,LYTTON,Calhoun,IA,13,119.358374303,50562,MALLARD,Palo Alto,IA,74,165.098566304,50563,MANSON,Calhoun,IA,13,253.318021305,50565,MARATHON,Buena Vista,IA,11,114.004136306,50566,MOORLAND,Webster,IA,94,89.865522307,50567,NEMAHA,Sac,IA,81,67.326397308,50568,NEWELL,Buena Vista,IA,11,220.071293309,50569,OTHO,Webster,IA,94,54.889709310,50570,OTTOSEN,Humboldt,IA,46,112.220748311,50571,PALMER,Pocahontas,IA,76,115.641227312,50573,PLOVER,Pocahontas,IA,76,1.045738313,50574,POCAHONTAS,Pocahontas,IA,76,288.134473314,50575,POMEROY,Calhoun,IA,13,163.4466315,50576,REMBRANDT,Buena Vista,IA,11,93.051632316,50577,RENWICK,Humboldt,IA,46,106.941323317,50578,RINGSTED,Emmet,IA,32,192.331445318,50579,ROCKWELL CITY,Calhoun,IA,13,359.119951319,50581,ROLFE,Pocahontas,IA,76,246.722485320,50582,RUTLAND,Humboldt,IA,46,53.572451321,50583,SAC CITY,Sac,IA,81,306.359541322,50585,SIOUX RAPIDS,Buena Vista,IA,11,165.291906323,50586,SOMERS,Calhoun,IA,13,91.12132324,50588,STORM LAKE,Buena Vista,IA,11,368.993698325,50590,SWEA CITY,Kossuth,IA,55,203.980739326,50591,THOR,Humboldt,IA,46,73.985552327,50593,VARINA,Pocahontas,IA,76,0.480019328,50594,VINCENT,Webster,IA,94,67.103128329,50595,WEBSTER CITY,Hamilton,IA,40,399.609138330,50597,WEST BEND,Palo Alto,IA,74,214.240511331,50598,WHITTEMORE,Kossuth,IA,55,176.474137332,50599,WOOLSTOCK,Wright,IA,99,133.057067333,50601,ACKLEY,Franklin,IA,35,368.01212334,50602,ALLISON,Butler,IA,12,207.455662335,50603,ALTA VISTA,Chickasaw,IA,19,122.972014336,50604,APLINGTON,Butler,IA,12,184.521061337,50605,AREDALE,Butler,IA,12,38.865937338,50606,ARLINGTON,Fayette,IA,33,184.315162339,50607,AURORA,Buchanan,IA,10,123.088687340,50609,BEAMAN,Grundy,IA,38,89.218598341,50611,BRISTOW,Butler,IA,12,78.763743342,50612,BUCKINGHAM,Tama,IA,86,57.581068343,50613,CEDAR FALLS,Black Hawk,IA,7,329.972902344,50616,CHARLES CITY,Floyd,IA,34,448.105088345,50619,CLARKSVILLE,Butler,IA,12,230.86623346,50620,COLWELL,Floyd,IA,34,0.324589347,50621,CONRAD,Grundy,IA,38,165.399151348,50622,DENVER,Bremer,IA,9,64.857976349,50624,DIKE,Grundy,IA,38,135.187133350,50625,DUMONT,Butler,IA,12,158.053593351,50626,DUNKERTON,Black Hawk,IA,7,130.892804352,50627,ELDORA,Hardin,IA,42,277.223505353,50628,ELMA,Howard,IA,45,289.356789354,50629,FAIRBANK,Buchanan,IA,10,205.328666355,50630,FREDERICKSBURG,Chickasaw,IA,19,214.715992356,50632,GARWIN,Tama,IA,86,110.25125357,50632,GARWIN,Tama,IA,86,110.25125358,50633,GENEVA,Franklin,IA,35,103.078532359,50634,GILBERTVILLE,Black Hawk,IA,7,1.018252360,50635,GLADBROOK,Tama,IA,86,217.592812361,50636,GREENE,Butler,IA,12,313.519052362,50638,GRUNDY CENTER,Grundy,IA,38,245.826479363,50641,HAZLETON,Buchanan,IA,10,123.019787364,50642,HOLLAND,Grundy,IA,38,83.404671365,50643,HUDSON,Black Hawk,IA,7,163.108458366,50644,INDEPENDENCE,Buchanan,IA,10,372.595201367,50645,IONIA,Chickasaw,IA,19,214.246026368,50647,JANESVILLE,Bremer,IA,9,79.06534369,50648,JESUP,Black Hawk,IA,7,223.212718370,50650,LAMONT,Buchanan,IA,10,106.756415371,50651,LA PORTE CITY,Black Hawk,IA,7,294.627077372,50652,LINCOLN,Tama,IA,86,0.605668373,50653,MARBLE ROCK,Floyd,IA,34,134.748744374,50654,MASONVILLE,Delaware,IA,28,136.18815375,50655,MAYNARD,Fayette,IA,33,91.05168376,50658,NASHUA,Chickasaw,IA,19,187.097603377,50659,NEW HAMPTON,Chickasaw,IA,19,403.802766378,50660,NEW HARTFORD,Butler,IA,12,100.165022379,50662,OELWEIN,Fayette,IA,33,176.049517380,50664,ORAN,Fayette,IA,33,0.09535381,50665,PARKERSBURG,Butler,IA,12,253.290828382,50666,PLAINFIELD,Bremer,IA,9,140.256035383,50667,RAYMOND,Black Hawk,IA,7,5.217793384,50668,READLYN,Bremer,IA,9,87.689572385,50669,REINBECK,Grundy,IA,38,239.750053386,50670,SHELL ROCK,Butler,IA,12,148.931701387,50671,STANLEY,Buchanan,IA,10,57.533274388,50672,STEAMBOAT ROCK,Hardin,IA,42,94.993283389,50673,STOUT,Grundy,IA,38,0.444167390,50674,SUMNER,Bremer,IA,9,408.690075391,50675,TRAER,Tama,IA,86,287.237436392,50676,TRIPOLI,Bremer,IA,9,148.867149393,50677,WAVERLY,Bremer,IA,9,325.186841394,50680,WELLSBURG,Grundy,IA,38,138.682394395,50681,WESTGATE,Fayette,IA,33,63.049395396,50682,WINTHROP,Buchanan,IA,10,220.98261397,50701,WATERLOO,Black Hawk,IA,7,214.718743398,50702,WATERLOO,Black Hawk,IA,7,25.60849399,50703,WATERLOO,Black Hawk,IA,7,244.724015400,50707,EVANSDALE,Black Hawk,IA,7,25.361881401,50801,CRESTON,Union,IA,88,545.028688402,50830,AFTON,Union,IA,88,306.617835403,50833,BEDFORD,Taylor,IA,87,536.325319404,50835,BENTON,Ringgold,IA,80,43.994784405,50836,BLOCKTON,Taylor,IA,87,232.828727406,50837,BRIDGEWATER,Adair,IA,1,130.795854407,50839,CARBON,Adams,IA,2,1.828417408,50840,CLEARFIELD,Taylor,IA,87,129.94877409,50841,CORNING,Adams,IA,2,610.836196410,50842,CROMWELL,Union,IA,88,0.674912411,50843,CUMBERLAND,Cass,IA,15,195.866561412,50845,DIAGONAL,Ringgold,IA,80,285.590236413,50846,FONTANELLE,Adair,IA,1,238.867152414,50847,GRANT,Montgomery,IA,69,0.863108415,50848,GRAVITY,Taylor,IA,87,117.720194416,50849,GREENFIELD,Adair,IA,1,304.431532417,50851,LENOX,Taylor,IA,87,335.214398418,50853,MASSENA,Cass,IA,15,195.987739419,50854,MOUNT AYR,Ringgold,IA,80,350.941226420,50857,NODAWAY,Adams,IA,2,131.265143421,50858,ORIENT,Adair,IA,1,206.99525422,50859,PRESCOTT,Adams,IA,2,206.490726423,50860,REDDING,Ringgold,IA,80,115.136578424,50861,SHANNON CITY,Union,IA,88,113.520838425,50862,SHARPSBURG,Taylor,IA,87,56.218206426,50863,TINGLEY,Ringgold,IA,80,78.178667427,50864,VILLISCA,Montgomery,IA,69,377.642994428,51001,AKRON,Plymouth,IA,75,360.862327429,51002,ALTA,Buena Vista,IA,11,297.148464430,51003,ALTON,Sioux,IA,84,144.371109431,51004,ANTHON,Woodbury,IA,97,212.848541432,51005,AURELIA,Cherokee,IA,18,244.17026433,51006,BATTLE CREEK,Ida,IA,47,213.095547434,51007,BRONSON,Woodbury,IA,97,87.318034435,51008,BRUNSVILLE,Plymouth,IA,75,0.630646436,51009,CALUMET,O'Brien,IA,71,0.611821437,51010,CASTANA,Monona,IA,67,176.313125438,51011,CHATSWORTH,Sioux,IA,84,1.275286439,51012,CHEROKEE,Cherokee,IA,18,386.838104440,51014,CLEGHORN,Cherokee,IA,18,139.646952441,51016,CORRECTIONVILLE,Woodbury,IA,97,262.895432442,51018,CUSHING,Woodbury,IA,97,95.118381443,51019,DANBURY,Woodbury,IA,97,236.787928444,51020,GALVA,Ida,IA,47,140.577018445,51022,GRANVILLE,Sioux,IA,84,204.082524446,51023,HAWARDEN,Sioux,IA,84,271.44549447,51024,HINTON,Plymouth,IA,75,232.728447448,51025,HOLSTEIN,Ida,IA,47,278.402985449,51026,HORNICK,Woodbury,IA,97,281.235872450,51027,IRETON,Sioux,IA,84,239.835719451,51028,KINGSLEY,Plymouth,IA,75,328.930974452,51029,LARRABEE,Cherokee,IA,18,58.631884453,51030,LAWTON,Woodbury,IA,97,153.939315454,51031,LE MARS,Plymouth,IA,75,605.111843455,51033,LINN GROVE,Buena Vista,IA,11,163.475339456,51034,MAPLETON,Monona,IA,67,290.876284457,51035,MARCUS,Cherokee,IA,18,278.97681458,51036,MAURICE,Sioux,IA,84,114.229271459,51037,MERIDEN,Cherokee,IA,18,61.66748460,51038,MERRILL,Plymouth,IA,75,233.828323461,51039,MOVILLE,Woodbury,IA,97,223.159311462,51040,ONAWA,Monona,IA,67,399.810225463,51041,ORANGE CITY,Sioux,IA,84,184.491043464,51044,OTO,Woodbury,IA,97,89.839583465,51046,PAULLINA,O'Brien,IA,71,240.918039466,51047,PETERSON,Clay,IA,21,200.078951467,51048,PIERSON,Woodbury,IA,97,86.240685468,51049,QUIMBY,Cherokee,IA,18,113.097948469,51050,REMSEN,Plymouth,IA,75,353.266814470,51051,RODNEY,Monona,IA,67,8.740209471,51052,SALIX,Woodbury,IA,97,159.659315472,51053,SCHALLER,Sac,IA,81,195.513267473,51054,SERGEANT BLUFF,Woodbury,IA,97,106.329025474,51055,SLOAN,Woodbury,IA,97,174.692448475,51056,SMITHLAND,Woodbury,IA,97,88.932213476,51058,SUTHERLAND,O'Brien,IA,71,214.786027477,51060,UTE,Monona,IA,67,156.247108478,51061,WASHTA,Cherokee,IA,18,121.628171479,51062,WESTFIELD,Plymouth,IA,75,144.594267480,51063,WHITING,Monona,IA,67,162.058438481,51101,SIOUX CITY,Woodbury,IA,97,3.138764482,51103,SIOUX CITY,Woodbury,IA,97,27.86321483,51104,SIOUX CITY,Woodbury,IA,97,20.00953484,51105,SIOUX CITY,Woodbury,IA,97,15.825592485,51106,SIOUX CITY,Woodbury,IA,97,81.702782486,51108,SIOUX CITY,Woodbury,IA,97,116.455967487,51109,SIOUX CITY,Woodbury,IA,97,49.159557488,51111,SIOUX CITY,Woodbury,IA,97,17.993387489,51201,SHELDON,O'Brien,IA,71,295.817592490,51230,ALVORD,Lyon,IA,60,64.875507491,51231,ARCHER,O'Brien,IA,71,73.029493492,51232,ASHTON,Osceola,IA,72,156.638511493,51234,BOYDEN,Sioux,IA,84,129.274507494,51235,DOON,Lyon,IA,60,144.971909495,51237,GEORGE,Lyon,IA,60,249.759921496,51238,HOSPERS,Sioux,IA,84,118.831336497,51239,HULL,Sioux,IA,84,171.513941498,51240,INWOOD,Lyon,IA,60,263.161609499,51241,LARCHWOOD,Lyon,IA,60,232.767886500,51242,LESTER,Lyon,IA,60,1.181066501,51243,LITTLE ROCK,Lyon,IA,60,139.238401502,51244,MATLOCK,Sioux,IA,84,0.778428503,51245,PRIMGHAR,O'Brien,IA,71,203.413241504,51246,ROCK RAPIDS,Lyon,IA,60,424.515402505,51247,ROCK VALLEY,Sioux,IA,84,289.650577506,51248,SANBORN,O'Brien,IA,71,214.49695507,51249,SIBLEY,Osceola,IA,72,324.74321508,51250,SIOUX CENTER,Sioux,IA,84,186.398963509,51301,SPENCER,Clay,IA,21,406.942409510,51331,ARNOLDS PARK,Dickinson,IA,30,7.12251511,51333,DICKENS,Clay,IA,21,170.365473512,51334,ESTHERVILLE,Emmet,IA,32,491.781068513,51338,EVERLY,Clay,IA,21,189.29993514,51341,GILLETT GROVE,Clay,IA,21,0.89395515,51342,GRAETTINGER,Palo Alto,IA,74,214.231634516,51343,GREENVILLE,Clay,IA,21,65.322005517,51345,HARRIS,Osceola,IA,72,146.304432518,51346,HARTLEY,O'Brien,IA,71,377.358581519,51347,LAKE PARK,Dickinson,IA,30,223.92845520,51350,MELVIN,Osceola,IA,72,112.104258521,51351,MILFORD,Dickinson,IA,30,278.931975522,51354,OCHEYEDAN,Osceola,IA,72,227.662269523,51355,OKOBOJI,Dickinson,IA,30,11.052693524,51357,ROYAL,Clay,IA,21,108.592195525,51358,RUTHVEN,Palo Alto,IA,74,202.954357526,51360,SPIRIT LAKE,Dickinson,IA,30,334.993966527,51363,SUPERIOR,Dickinson,IA,30,1.050241528,51364,TERRIL,Dickinson,IA,30,167.034887529,51365,WALLINGFORD,Emmet,IA,32,58.440957530,51366,WEBB,Clay,IA,21,163.894539531,51401,CARROLL,Carroll,IA,14,454.121532,51430,ARCADIA,Carroll,IA,14,105.531033533,51431,ARTHUR,Ida,IA,47,101.935208534,51433,AUBURN,Sac,IA,81,147.355553535,51436,BREDA,Carroll,IA,14,156.790205536,51439,CHARTER OAK,Crawford,IA,24,223.476837537,51440,DEDHAM,Carroll,IA,14,66.015871538,51441,DELOIT,Crawford,IA,24,40.266915539,51442,DENISON,Crawford,IA,24,448.677639540,51443,GLIDDEN,Carroll,IA,14,276.442701541,51444,HALBUR,Carroll,IA,14,0.42223542,51445,IDA GROVE,Ida,IA,47,327.873042543,51446,IRWIN,Shelby,IA,83,90.274738544,51447,KIRKMAN,Shelby,IA,83,90.268943545,51448,KIRON,Crawford,IA,24,144.043445546,51449,LAKE CITY,Calhoun,IA,13,232.944822547,51450,LAKE VIEW,Sac,IA,81,157.595822548,51451,LANESBORO,Carroll,IA,14,0.955449549,51453,LOHRVILLE,Calhoun,IA,13,223.339647550,51454,MANILLA,Crawford,IA,24,259.313231551,51455,MANNING,Carroll,IA,14,287.004925552,51458,ODEBOLT,Sac,IA,81,245.407583553,51459,RALSTON,Carroll,IA,14,1.355933554,51461,SCHLESWIG,Crawford,IA,24,130.196128555,51462,SCRANTON,Greene,IA,37,281.553969556,51463,TEMPLETON,Carroll,IA,14,77.535941557,51465,VAIL,Crawford,IA,24,141.104728558,51466,WALL LAKE,Sac,IA,81,134.548626559,51467,WESTSIDE,Crawford,IA,24,168.975493560,51501,COUNCIL BLUFFS,Pottawattamie,IA,78,68.663347561,51503,COUNCIL BLUFFS,Pottawattamie,IA,78,311.311378562,51510,CARTER LAKE,Pottawattamie,IA,78,5.228569563,51520,ARION,Crawford,IA,24,48.585455564,51521,AVOCA,Pottawattamie,IA,78,223.26416565,51523,BLENCOE,Monona,IA,67,134.969227566,51525,CARSON,Pottawattamie,IA,78,158.290149567,51526,CRESCENT,Pottawattamie,IA,78,111.333068568,51527,DEFIANCE,Shelby,IA,83,101.109924569,51528,DOW CITY,Crawford,IA,24,189.885849570,51529,DUNLAP,Harrison,IA,43,333.953737571,51530,EARLING,Shelby,IA,83,140.370471572,51531,ELK HORN,Shelby,IA,83,73.952439573,51532,ELLIOTT,Montgomery,IA,69,147.80971574,51533,EMERSON,Mills,IA,65,213.641636575,51534,GLENWOOD,Mills,IA,65,260.80305576,51535,GRISWOLD,Cass,IA,15,337.126792577,51536,HANCOCK,Pottawattamie,IA,78,124.955843578,51537,HARLAN,Shelby,IA,83,422.332929579,51540,HASTINGS,Mills,IA,65,141.382464580,51541,HENDERSON,Mills,IA,65,106.735256581,51542,HONEY CREEK,Pottawattamie,IA,78,92.88071582,51543,KIMBALLTON,Audubon,IA,5,57.402651583,51544,LEWIS,Cass,IA,15,145.135032584,51545,LITTLE SIOUX,Harrison,IA,43,116.122152585,51546,LOGAN,Harrison,IA,43,293.310022586,51548,MC CLELLAND,Pottawattamie,IA,78,64.393248587,51549,MACEDONIA,Pottawattamie,IA,78,84.699901588,51550,MAGNOLIA,Harrison,IA,43,1.456103589,51551,MALVERN,Mills,IA,65,210.796119590,51552,MARNE,Cass,IA,15,91.256513591,51553,MINDEN,Pottawattamie,IA,78,118.587348592,51554,MINEOLA,Mills,IA,65,6.840874593,51555,MISSOURI VALLEY,Harrison,IA,43,410.136654594,51556,MODALE,Harrison,IA,43,117.494595595,51557,MONDAMIN,Harrison,IA,43,181.091843596,51558,MOORHEAD,Monona,IA,67,213.388981597,51559,NEOLA,Pottawattamie,IA,78,220.723262598,51560,OAKLAND,Pottawattamie,IA,78,254.891645599,51561,PACIFIC JUNCTION,Mills,IA,65,159.38428600,51562,PANAMA,Shelby,IA,83,87.966803601,51563,PERSIA,Harrison,IA,43,142.269566602,51564,PISGAH,Harrison,IA,43,103.461051603,51565,PORTSMOUTH,Shelby,IA,83,126.12811604,51566,RED OAK,Montgomery,IA,69,457.891547605,51570,SHELBY,Shelby,IA,83,166.695973606,51571,SILVER CITY,Mills,IA,65,121.369661607,51572,SOLDIER,Monona,IA,67,114.390965608,51573,STANTON,Montgomery,IA,69,156.152123609,51575,TREYNOR,Pottawattamie,IA,78,126.569658610,51576,UNDERWOOD,Pottawattamie,IA,78,130.453907611,51577,WALNUT,Pottawattamie,IA,78,206.979038612,51578,WESTPHALIA,Shelby,IA,83,0.096684613,51579,WOODBINE,Harrison,IA,43,301.420886614,51601,SHENANDOAH,Page,IA,73,276.34259615,51630,BLANCHARD,Page,IA,73,65.739953616,51631,BRADDYVILLE,Page,IA,73,95.851302617,51632,CLARINDA,Page,IA,73,540.669979618,51636,COIN,Page,IA,73,144.434643619,51637,COLLEGE SPRINGS,Page,IA,73,4.185566620,51638,ESSEX,Page,IA,73,220.766642621,51639,FARRAGUT,Fremont,IA,36,186.79102622,51640,HAMBURG,Fremont,IA,36,313.989946623,51645,IMOGENE,Fremont,IA,36,107.043333624,51646,NEW MARKET,Taylor,IA,87,162.903928625,51647,NORTHBORO,Page,IA,73,49.689882626,51648,PERCIVAL,Fremont,IA,36,130.838045627,51649,RANDOLPH,Fremont,IA,36,106.334486628,51650,RIVERTON,Fremont,IA,36,78.508646629,51652,SIDNEY,Fremont,IA,36,200.054769630,51653,TABOR,Fremont,IA,36,89.512532631,51654,THURMAN,Fremont,IA,36,137.790111632,51656,YORKTOWN,Page,IA,73,0.438077633,52001,DUBUQUE,Dubuque,IA,31,75.057763634,52002,DUBUQUE,Dubuque,IA,31,74.76947635,52003,DUBUQUE,Dubuque,IA,31,151.954104636,52030,ANDREW,Jackson,IA,49,0.690483637,52031,BELLEVUE,Jackson,IA,49,448.070077638,52032,BERNARD,Jackson,IA,49,272.572418639,52033,CASCADE,Jones,IA,53,252.635822640,52035,COLESBURG,Clayton,IA,22,139.028244641,52037,DELMAR,Clinton,IA,23,176.775293642,52038,DUNDEE,Delaware,IA,28,76.611993643,52039,DURANGO,Dubuque,IA,31,91.068312644,52040,DYERSVILLE,Dubuque,IA,31,157.748966645,52041,EARLVILLE,Delaware,IA,28,155.040397646,52042,EDGEWOOD,Clayton,IA,22,161.463192647,52043,ELKADER,Clayton,IA,22,258.435335648,52044,ELKPORT,Clayton,IA,22,31.383574649,52045,EPWORTH,Dubuque,IA,31,123.92508650,52046,FARLEY,Dubuque,IA,31,119.196635651,52047,FARMERSBURG,Clayton,IA,22,88.328123652,52048,GARBER,Clayton,IA,22,85.789255653,52049,GARNAVILLO,Clayton,IA,22,188.393273654,52050,GREELEY,Delaware,IA,28,84.028046655,52052,GUTTENBERG,Clayton,IA,22,249.00035656,52053,HOLY CROSS,Dubuque,IA,31,154.540299657,52054,LA MOTTE,Jackson,IA,49,138.18618658,52057,MANCHESTER,Delaware,IA,28,362.350079659,52060,MAQUOKETA,Jackson,IA,49,409.835195660,52064,MILES,Jackson,IA,49,107.540582661,52065,NEW VIENNA,Dubuque,IA,31,131.382162662,52066,NORTH BUENA VISTA,Clayton,IA,22,0.681956663,52068,PEOSTA,Dubuque,IA,31,120.418986664,52069,PRESTON,Jackson,IA,49,133.277674665,52070,SABULA,Jackson,IA,49,154.415679666,52072,SAINT OLAF,Clayton,IA,22,87.106056667,52073,SHERRILL,Dubuque,IA,31,139.623458668,52074,SPRAGUEVILLE,Jackson,IA,49,80.684903669,52076,STRAWBERRY POINT,Clayton,IA,22,252.900986670,52077,VOLGA,Clayton,IA,22,83.72915671,52078,WORTHINGTON,Dubuque,IA,31,106.290071672,52079,ZWINGLE,Jackson,IA,49,153.776141673,52101,DECORAH,Winneshiek,IA,96,804.622162674,52132,CALMAR,Winneshiek,IA,96,157.379001675,52133,CASTALIA,Winneshiek,IA,96,118.295514676,52134,CHESTER,Howard,IA,45,77.559488677,52135,CLERMONT,Fayette,IA,33,67.561964678,52136,CRESCO,Howard,IA,45,552.444562679,52140,DORCHESTER,Allamakee,IA,3,201.58604680,52141,ELGIN,Fayette,IA,33,216.938945681,52142,FAYETTE,Fayette,IA,33,189.331878682,52144,FORT ATKINSON,Winneshiek,IA,96,194.119174683,52146,HARPERS FERRY,Allamakee,IA,3,221.678469684,52147,HAWKEYE,Fayette,IA,33,187.912171685,52151,LANSING,Allamakee,IA,3,324.670969686,52154,LAWLER,Chickasaw,IA,19,189.966484687,52155,LIME SPRINGS,Howard,IA,45,274.895155688,52156,LUANA,Clayton,IA,22,110.19542689,52157,MC GREGOR,Clayton,IA,22,149.443621690,52158,MARQUETTE,Clayton,IA,22,3.357585691,52159,MONONA,Clayton,IA,22,197.141302692,52160,NEW ALBIN,Allamakee,IA,3,103.702081693,52161,OSSIAN,Winneshiek,IA,96,140.304657694,52162,POSTVILLE,Allamakee,IA,3,258.921418695,52163,PROTIVIN,Howard,IA,45,4.602409696,52164,RANDALIA,Fayette,IA,33,65.306405697,52165,RIDGEWAY,Winneshiek,IA,96,172.29035698,52166,SAINT LUCAS,Fayette,IA,33,0.493739699,52168,SPILLVILLE,Winneshiek,IA,96,0.417415700,52169,WADENA,Fayette,IA,33,77.089638701,52170,WATERVILLE,Allamakee,IA,3,120.174634702,52171,WAUCOMA,Fayette,IA,33,205.89717703,52171,WAUCOMA,Fayette,IA,33,205.89717704,52172,WAUKON,Allamakee,IA,3,409.930314705,52175,WEST UNION,Fayette,IA,33,224.708583706,52201,AINSWORTH,Washington,IA,92,171.814169707,52202,ALBURNETT,Linn,IA,57,65.295077708,52203,AMANA,Iowa,IA,48,119.753928709,52205,ANAMOSA,Jones,IA,53,309.946308710,52206,ATKINS,Benton,IA,6,73.245916711,52207,BALDWIN,Jackson,IA,49,102.173612712,52208,BELLE PLAINE,Benton,IA,6,150.72943713,52209,BLAIRSTOWN,Benton,IA,6,95.402594714,52210,BRANDON,Buchanan,IA,10,90.462642715,52211,BROOKLYN,Poweshiek,IA,79,236.939306716,52212,CENTER JUNCTION,Jones,IA,53,60.790568717,52213,CENTER POINT,Linn,IA,57,194.486805718,52214,CENTRAL CITY,Linn,IA,57,247.622502719,52215,CHELSEA,Tama,IA,86,224.182198720,52216,CLARENCE,Cedar,IA,16,146.950128721,52217,CLUTIER,Tama,IA,86,152.367658722,52218,COGGON,Linn,IA,57,187.432341723,52219,PRAIRIEBURG,Linn,IA,57,1.194124724,52220,CONROY,Iowa,IA,48,1.194245725,52221,GUERNSEY,Poweshiek,IA,79,53.776053726,52222,DEEP RIVER,Poweshiek,IA,79,209.152349727,52223,DELHI,Delaware,IA,28,127.619807728,52224,DYSART,Tama,IA,86,256.593612729,52225,ELBERON,Tama,IA,86,91.276214730,52227,ELY,Linn,IA,57,76.013553731,52228,FAIRFAX,Linn,IA,57,96.928032732,52229,GARRISON,Benton,IA,6,127.391906733,52231,HARPER,Keokuk,IA,54,84.735151734,52232,HARTWICK,Poweshiek,IA,79,47.517446735,52233,HIAWATHA,Linn,IA,57,9.080084736,52235,HILLS,Johnson,IA,52,5.257012737,52236,HOMESTEAD,Iowa,IA,48,74.148178738,52237,HOPKINTON,Delaware,IA,28,210.461938739,52240,IOWA CITY,Johnson,IA,52,415.571318740,52241,CORALVILLE,Johnson,IA,52,30.871305741,52242,IOWA CITY,Johnson,IA,52,1.995678742,52245,IOWA CITY,Johnson,IA,52,21.712859743,52246,IOWA CITY,Johnson,IA,52,23.832009744,52246,IOWA CITY,Johnson,IA,52,23.832009745,52247,KALONA,Washington,IA,92,206.350122746,52248,KEOTA,Washington,IA,92,295.236835747,52249,KEYSTONE,Benton,IA,6,131.160363748,52251,LADORA,Iowa,IA,48,106.339173749,52253,LISBON,Linn,IA,57,121.480006750,52254,LOST NATION,Clinton,IA,23,146.759383751,52255,LOWDEN,Cedar,IA,16,112.408307752,52257,LUZERNE,Benton,IA,6,40.66545753,52301,MARENGO,Iowa,IA,48,316.484635754,52302,MARION,Linn,IA,57,192.237746755,52305,MARTELLE,Jones,IA,53,73.493573756,52306,MECHANICSVILLE,Cedar,IA,16,194.245557757,52307,MIDDLE AMANA,Iowa,IA,48,0.488234758,52308,MILLERSBURG,Iowa,IA,48,0.233003759,52309,MONMOUTH,Jackson,IA,49,74.291839760,52310,MONTICELLO,Jones,IA,53,385.183593761,52312,MORLEY,Jones,IA,53,0.242872762,52313,MOUNT AUBURN,Benton,IA,6,84.959546763,52314,MOUNT VERNON,Linn,IA,57,155.658054764,52315,NEWHALL,Benton,IA,6,75.003645765,52316,NORTH ENGLISH,Iowa,IA,48,180.293861766,52317,NORTH LIBERTY,Johnson,IA,52,95.897591767,52318,NORWAY,Benton,IA,6,95.080335768,52320,OLIN,Jones,IA,53,153.985129769,52321,ONSLOW,Jones,IA,53,77.84139770,52322,OXFORD,Johnson,IA,52,231.220008771,52323,OXFORD JUNCTION,Jones,IA,53,132.353794772,52324,PALO,Linn,IA,57,107.056042773,52325,PARNELL,Iowa,IA,48,109.143193774,52326,QUASQUETON,Buchanan,IA,10,5.447918775,52327,RIVERSIDE,Washington,IA,92,217.607612776,52328,ROBINS,Linn,IA,57,7.989464777,52329,ROWLEY,Buchanan,IA,10,132.090802778,52330,RYAN,Delaware,IA,28,123.160697779,52332,SHELLSBURG,Benton,IA,6,110.692089780,52333,SOLON,Johnson,IA,52,234.514567781,52334,SOUTH AMANA,Iowa,IA,48,43.109728782,52335,SOUTH ENGLISH,Keokuk,IA,54,144.712437783,52336,SPRINGVILLE,Linn,IA,57,127.253732784,52337,STANWOOD,Cedar,IA,16,70.714333785,52338,SWISHER,Johnson,IA,52,88.82337786,52339,TAMA,Tama,IA,86,265.108854787,52340,TIFFIN,Johnson,IA,52,42.940483788,52341,TODDVILLE,Linn,IA,57,34.64544789,52342,TOLEDO,Tama,IA,86,232.660911790,52345,URBANA,Benton,IA,6,8.460683791,52346,VAN HORNE,Benton,IA,6,134.845571792,52347,VICTOR,Iowa,IA,48,166.647034793,52348,VINING,Tama,IA,86,2.376982794,52349,VINTON,Benton,IA,6,382.672144795,52351,WALFORD,Benton,IA,6,2.13643796,52352,WALKER,Linn,IA,57,205.022618797,52353,WASHINGTON,Washington,IA,92,395.338718798,52354,WATKINS,Benton,IA,6,76.593832799,52355,WEBSTER,Keokuk,IA,54,95.20977800,52356,WELLMAN,Washington,IA,92,232.4078801,52358,WEST BRANCH,Cedar,IA,16,200.939529802,52359,WEST CHESTER,Washington,IA,92,38.168572803,52361,WILLIAMSBURG,Iowa,IA,48,330.197198804,52362,WYOMING,Jones,IA,53,157.429942805,52401,CEDAR RAPIDS,Linn,IA,57,3.464505806,52402,CEDAR RAPIDS,Linn,IA,57,36.420817807,52403,CEDAR RAPIDS,Linn,IA,57,69.523743808,52404,CEDAR RAPIDS,Linn,IA,57,142.93349809,52405,CEDAR RAPIDS,Linn,IA,57,38.49318810,52411,CEDAR RAPIDS,Linn,IA,57,44.635019811,52501,OTTUMWA,Wapello,IA,90,591.297871812,52530,AGENCY,Wapello,IA,90,36.986236813,52531,ALBIA,Monroe,IA,68,563.107904814,52533,BATAVIA,Jefferson,IA,51,227.778322815,52534,BEACON,Mahaska,IA,62,1.012395816,52535,BIRMINGHAM,Van Buren,IA,89,149.696508817,52536,BLAKESBURG,Wapello,IA,90,160.133711818,52537,BLOOMFIELD,Davis,IA,26,900.130186819,52540,BRIGHTON,Jefferson,IA,51,257.681662820,52542,CANTRIL,Van Buren,IA,89,117.166206821,52543,CEDAR,Mahaska,IA,62,53.398057822,52544,CENTERVILLE,Appanoose,IA,4,355.984819823,52548,CHILLICOTHE,Wapello,IA,90,0.622729824,52549,CINCINNATI,Appanoose,IA,4,113.382459825,52550,DELTA,Keokuk,IA,54,100.697091826,52551,DOUDS,Van Buren,IA,89,151.969068827,52552,DRAKESVILLE,Davis,IA,26,151.883138828,52553,EDDYVILLE,Wapello,IA,90,218.424094829,52554,ELDON,Wapello,IA,90,94.705759830,52555,EXLINE,Appanoose,IA,4,64.737097831,52556,FAIRFIELD,Jefferson,IA,51,458.48455832,52557,FAIRFIELD,Jefferson,IA,51,0.116099833,52560,FLORIS,Davis,IA,26,91.932554834,52561,FREMONT,Mahaska,IA,62,90.172828835,52563,HEDRICK,Keokuk,IA,54,299.15143836,52565,KEOSAUQUA,Van Buren,IA,89,307.637735837,52566,KIRKVILLE,Wapello,IA,90,2.69235838,52567,LIBERTYVILLE,Jefferson,IA,51,73.121919839,52569,MELROSE,Monroe,IA,68,249.170073840,52570,MILTON,Van Buren,IA,89,177.637484841,52571,MORAVIA,Appanoose,IA,4,276.402554842,52572,MOULTON,Appanoose,IA,4,252.233684843,52573,MOUNT STERLING,Van Buren,IA,89,74.630386844,52574,MYSTIC,Appanoose,IA,4,109.572946845,52576,OLLIE,Keokuk,IA,54,117.180743846,52577,OSKALOOSA,Mahaska,IA,62,415.772496847,52580,PACKWOOD,Jefferson,IA,51,98.765465848,52581,PLANO,Appanoose,IA,4,102.818704849,52583,PROMISE CITY,Wayne,IA,93,120.716306850,52584,PULASKI,Davis,IA,26,58.551231851,52585,RICHLAND,Keokuk,IA,54,145.900379852,52586,ROSE HILL,Mahaska,IA,62,122.508394853,52588,SELMA,Van Buren,IA,89,26.887357854,52590,SEYMOUR,Wayne,IA,93,194.526072855,52591,SIGOURNEY,Keokuk,IA,54,327.849389856,52593,UDELL,Appanoose,IA,4,42.505876857,52594,UNIONVILLE,Appanoose,IA,4,121.987222858,52595,UNIVERSITY PARK,Mahaska,IA,62,1.193986859,52601,BURLINGTON,Des Moines,IA,29,312.02532860,52619,ARGYLE,Lee,IA,56,95.322544861,52620,BONAPARTE,Van Buren,IA,89,139.397291862,52621,CRAWFORDSVILLE,Washington,IA,92,109.582868863,52623,DANVILLE,Des Moines,IA,29,151.439917864,52624,DENMARK,Lee,IA,56,1.656465865,52625,DONNELLSON,Lee,IA,56,285.245058866,52626,FARMINGTON,Van Buren,IA,89,247.437264867,52627,FORT MADISON,Lee,IA,56,188.414662868,52630,HILLSBORO,Henry,IA,44,109.946731869,52632,KEOKUK,Lee,IA,56,140.470022870,52635,LOCKRIDGE,Jefferson,IA,51,112.512335871,52637,MEDIAPOLIS,Des Moines,IA,29,172.04191872,52638,MIDDLETOWN,Des Moines,IA,29,88.521308873,52639,MONTROSE,Lee,IA,56,117.469873874,52640,MORNING SUN,Louisa,IA,58,188.643645875,52641,MOUNT PLEASANT,Henry,IA,44,551.76361876,52644,MOUNT UNION,Henry,IA,44,111.080445877,52645,NEW LONDON,Henry,IA,44,185.650036878,52646,OAKVILLE,Louisa,IA,58,155.005489879,52647,OLDS,Henry,IA,44,0.911591880,52649,SALEM,Henry,IA,44,119.399466881,52650,SPERRY,Des Moines,IA,29,104.495885882,52651,STOCKPORT,Van Buren,IA,89,156.669871883,52653,WAPELLO,Louisa,IA,58,312.47521884,52654,WAYLAND,Henry,IA,44,136.065742885,52655,WEST BURLINGTON,Des Moines,IA,29,42.56453886,52656,WEST POINT,Lee,IA,56,247.274743887,52657,SAINT PAUL,Lee,IA,56,0.494762888,52658,WEVER,Lee,IA,56,131.24099889,52659,WINFIELD,Henry,IA,44,159.819459890,52660,YARMOUTH,Des Moines,IA,29,57.367543891,52701,ANDOVER,Clinton,IA,23,1.643828892,52720,ATALISSA,Muscatine,IA,70,110.12746893,52721,BENNETT,Cedar,IA,16,108.698097894,52722,BETTENDORF,Scott,IA,82,73.161625895,52726,BLUE GRASS,Scott,IA,82,94.823134896,52727,BRYANT,Clinton,IA,23,63.081901897,52728,BUFFALO,Scott,IA,82,5.764558898,52729,CALAMUS,Clinton,IA,23,108.881384899,52730,CAMANCHE,Clinton,IA,23,101.243411900,52731,CHARLOTTE,Clinton,IA,23,136.590407901,52732,CLINTON,Clinton,IA,23,310.704633902,52737,COLUMBUS CITY,Louisa,IA,58,0.607264903,52738,COLUMBUS JUNCTION,Louisa,IA,58,323.19118904,52739,CONESVILLE,Muscatine,IA,70,89.211035905,52742,DE WITT,Clinton,IA,23,304.142776906,52745,DIXON,Scott,IA,82,74.381611907,52746,DONAHUE,Scott,IA,82,68.807314908,52747,DURANT,Cedar,IA,16,56.094151909,52748,ELDRIDGE,Scott,IA,82,107.004915910,52749,FRUITLAND,Muscatine,IA,70,5.21822911,52750,GOOSE LAKE,Clinton,IA,23,78.435746912,52751,GRAND MOUND,Clinton,IA,23,129.085616913,52752,GRANDVIEW,Louisa,IA,58,0.984227914,52753,LE CLAIRE,Scott,IA,82,68.492183915,52754,LETTS,Muscatine,IA,70,186.005712916,52755,LONE TREE,Johnson,IA,52,157.930258917,52756,LONG GROVE,Scott,IA,82,111.107641918,52757,LOW MOOR,Clinton,IA,23,3.069924919,52758,MC CAUSLAND,Scott,IA,82,1.745891920,52760,MOSCOW,Muscatine,IA,70,58.900339921,52761,MUSCATINE,Muscatine,IA,70,482.973608922,52765,NEW LIBERTY,Scott,IA,82,78.185471923,52766,NICHOLS,Muscatine,IA,70,119.401548924,52767,PLEASANT VALLEY,Scott,IA,82,3.354642925,52768,PRINCETON,Scott,IA,82,89.481384926,52769,STOCKTON,Muscatine,IA,70,106.726313927,52769,STOCKTON,Muscatine,IA,70,106.726313928,52772,TIPTON,Cedar,IA,16,342.328096929,52773,WALCOTT,Scott,IA,82,145.837805930,52774,WELTON,Clinton,IA,23,0.728977931,52776,WEST LIBERTY,Muscatine,IA,70,219.678908932,52777,WHEATLAND,Clinton,IA,23,139.946616933,52778,WILTON,Muscatine,IA,70,215.972375934,52801,DAVENPORT,Scott,IA,82,1.359908935,52802,DAVENPORT,Scott,IA,82,29.294412936,52803,DAVENPORT,Scott,IA,82,14.068035937,52804,DAVENPORT,Scott,IA,82,88.861422938,52806,DAVENPORT,Scott,IA,82,79.448284939,52807,DAVENPORT,Scott,IA,82,76.46944


--------------------------------------------------------------------------------
/retail-strategy/data/iowa_incomes.xls:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/data/iowa_incomes.xls


--------------------------------------------------------------------------------
/retail-strategy/data/pop_iowa_per_county.csv:
--------------------------------------------------------------------------------
  1 | ,county,population
  2 | 0,Adair,7092
  3 | 1,Adams,3693
  4 | 2,Allamakee,13884
  5 | 3,Appanoose,12462
  6 | 4,Audubon,5678
  7 | 5,Benton,25699
  8 | 6,Black Hawk,132904
  9 | 7,Boone,26532
 10 | 8,Bremer,24798
 11 | 9,Buchanan,20992
 12 | 10,Buena Vista,20332
 13 | 11,Butler,14791
 14 | 12,Calhoun,9846
 15 | 13,Carroll,20437
 16 | 14,Cass,13157
 17 | 15,Cedar,18454
 18 | 16,Cerro Gordo,43070
 19 | 17,Cherokee,11508
 20 | 18,Chickasaw,12023
 21 | 19,Clarke,9309
 22 | 20,Clay,16333
 23 | 21,Clayton,17590
 24 | 22,Clinton,47309
 25 | 23,Crawford,16940
 26 | 24,Dallas,84516
 27 | 25,Davis,8860
 28 | 26,Decatur,8141
 29 | 27,Delaware,17327
 30 | 28,Des Moines,39739
 31 | 29,Dickinson,17243
 32 | 30,Dubuque,97003
 33 | 31,Emmet,9658
 34 | 32,Fayette,20054
 35 | 33,Floyd,15873
 36 | 34,Franklin,10170
 37 | 35,Fremont,6950
 38 | 36,Greene,9011
 39 | 37,Grundy,12313
 40 | 38,Guthrie,10625
 41 | 39,Hamilton,15076
 42 | 40,Hancock,10835
 43 | 41,Hardin,17226
 44 | 42,Harrison,14149
 45 | 43,Henry,19773
 46 | 44,Howard,9332
 47 | 45,Humboldt,9487
 48 | 46,Ida,6985
 49 | 47,Iowa,16311
 50 | 48,Jackson,19472
 51 | 49,Jasper,36708
 52 | 50,Jefferson,18090
 53 | 51,Johnson,146547
 54 | 52,Jones,20439
 55 | 53,Keokuk,10119
 56 | 54,Kossuth,15114
 57 | 55,Lee,34615
 58 | 56,Linn,221661
 59 | 57,Louisa,11142
 60 | 58,Lucas,8647
 61 | 59,Lyon,11754
 62 | 60,Madison,15848
 63 | 61,Mahaska,22181
 64 | 62,Marion,33189
 65 | 63,Marshall,40312
 66 | 64,Mills,14972
 67 | 65,Mitchell,10763
 68 | 66,Monona,8898
 69 | 67,Monroe,7870
 70 | 68,Montgomery,10225
 71 | 69,Muscatine,42940
 72 | 70,O'Brien,14020
 73 | 71,Osceola,6064
 74 | 72,Page,15391
 75 | 73,Palo Alto,9047
 76 | 74,Plymouth,25200
 77 | 75,Pocahontas,6886
 78 | 76,Polk,474045
 79 | 77,Pottawattamie,93582
 80 | 78,Poweshiek,18533
 81 | 79,Ringgold,5068
 82 | 80,Sac,9876
 83 | 81,Scott,172474
 84 | 82,Shelby,11800
 85 | 83,Sioux,34898
 86 | 84,Story,97090
 87 | 85,Tama,17319
 88 | 86,Taylor,6216
 89 | 87,Union,12420
 90 | 88,Van Buren,7271
 91 | 89,Wapello,34982
 92 | 90,Warren,49691
 93 | 91,Washington,22281
 94 | 92,Wayne,6452
 95 | 93,Webster,36769
 96 | 94,Winnebago,10631
 97 | 95,Winneshiek,20561
 98 | 96,Woodbury,102779
 99 | 97,Worth,7572
100 | 98,Wright,12779
101 | 


--------------------------------------------------------------------------------
/retail-strategy/images/123:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/retail-strategy/images/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/retail-strategy/images/hm3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/hm3.png


--------------------------------------------------------------------------------
/retail-strategy/images/liquor.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/liquor.jpeg


--------------------------------------------------------------------------------
/retail-strategy/images/output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/output.png


--------------------------------------------------------------------------------
/retail-strategy/images/test.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/test.jpg


--------------------------------------------------------------------------------
/tennis/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/tennis/README.md:
--------------------------------------------------------------------------------
  1 | ## Forecasting the winner in the Men's ATP World Tour [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/tennis/notebooks/Final_Project_Marco_Tavora-DATNYC41_GA.ipynb) 
  2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg)
  3 | 
  4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/tennis/notebooks/Final_Project_Marco_Tavora-DATNYC41_GA.ipynb) or by clicking on the [view code] link above.**
  5 | 
  6 | <br>
  7 | 
  8 | <p align="center">
  9 |   <img src="images/ATP_World_Tour.png">
 10 | </p>                                                                  
 11 | <p align="center">
 12 |   <a href="#Problem Statement"> Problem Statement </a> •
 13 |   <a href="#Dataset"> Dataset </a> •
 14 |   <a href="#Importing basic modules"> Importing basic modules</a> •
 15 |   <a href="#Pre-Processing of dataset"> Pre-Processing of dataset</a><br> 
 16 |   <a href="#`Best_of` = 5"> `Best_of` = 5</a> •
 17 |   <a href="#Dummy variables">Dummy variables</a> •
 18 |   <a href="#Exploratory Analysis for Best_of = 5">Exploratory Analysis for Best_of = 5</a> •
 19 |   <a href="#Logistic Regression">Logistic Regression</a> •
 20 |   <a href="#Decision Trees and Random Forests">Decision Trees and Random Forests</a> 
 21 | </p>
 22 | 
 23 | <a id = 'Problem Statement'></a>
 24 | ## Problem Statement
 25 | 
 26 | The goal of the project is to predict the probability that the higher-ranked player will win a tennis match. I will call that a `win`(as opposed to an upset). 
 27 | <a id = 'Dataset'></a>
 28 | ## Dataset
 29 | 
 30 | Results for the men's ATP tour date back to January 2000 from the dateset http://www.tennis-data.co.uk/data.php (obtained from Kaggle). The features for each match that were used in the project were:
 31 | - `Date`: date of the match 
 32 | - `Series`: name of ATP tennis series (we kept the four main current categories namely Grand Slams, Masters 1000, ATP250, ATP500)
 33 | - `Surface`: type of surface (clay, hard or grass)
 34 | - `Round`: round of match (from first round to the final)
 35 | - `Best of`: maximum number of sets playable in match (Best of 3 or Best of 5)
 36 | - `WRank`: ATP Entry ranking of the match winner as of the start of the tournament
 37 | - `LRank`: ATP Entry ranking of the match loser as of the start of the tournament
 38 | 
 39 | The output variable is binary. The better player has higher rank by definition. The `win` variable is 1 if the higher-ranked player wins and 0 otherwise.
 40 | <a id = 'Importing basic modules'></a>
 41 | ## Importing basic modules
 42 | 
 43 | ```
 44 | import numpy as np
 45 | import statsmodels.api as sm
 46 | import matplotlib.pyplot as plt
 47 | from sklearn import metrics
 48 | import seaborn as sns
 49 | sns.set_style("darkgrid")
 50 | import pylab as pl
 51 | %matplotlib inline
 52 | ```
 53 | <a id = 'Pre-Processing of dataset'></a>
 54 | ## Pre-Processing of dataset
 55 | 
 56 | After loading the dataset we proceed as following:
 57 | - Keep only completed matches i.e. eliminate matches with injury withdrawals and walkovers.
 58 | - Choose the features listed above
 59 | - Drop `NaN` entries
 60 | - Consider the two final years only (to avoid comparing different categories of tournaments which existed in the past). Note that this choice is somewhat arbitrary and can be changed if needed.
 61 | - Choose only higher ranked players for better accuracy (as suggested by Corral and Prieto-Rodriguez (2010) and confirmed here)
 62 | ```
 63 | # converting to Datetime
 64 | df_atp['Date'] = pd.to_datetime(df_atp['Date']) 
 65 | # Restricing dates
 66 | df_atp = df_atp.loc[(df_atp['Date'] > '2014-11-09') & (df_atp['Date'] <= '2016-11-09')]
 67 | # Keeping only completed matches
 68 | df_atp = df_atp[df_atp['Comment'] == 'Completed'].drop("Comment",axis = 1)
 69 | # Renaming Best of to Best_of
 70 | df_atp.rename(columns = {'Best of':'Best_of'},inplace=True)
 71 | # Choosing features
 72 | cols_to_keep = ['Date','Series','Surface', 'Round','Best_of', 'WRank','LRank']
 73 | # Dropping NaNs
 74 | df_atp = df_atp[cols_to_keep].dropna()
 75 | # Dropping errors in the dataset and unimportant entries (e.g. there are very few entries for Masters Cup)
 76 | df_atp = df_atp[(df_atp['LRank'] != 'NR') & (df_atp['WRank'] != 'NR') & (df_atp['Series'] != 'Masters Cup')]
 77 | ```
 78 | Another important step for some of the columns is to transform strings into numerical values:
 79 | ```
 80 | cols_to_keep = ['Best_of','WRank','LRank']
 81 | df_atp[cols_to_keep] = df_atp[cols_to_keep].astype(int)
 82 | ```
 83 | I now create an extra column for the variable `win` (described above) using an auxiliary function `win(x)`:
 84 | 
 85 | ```
 86 | def win(x):
 87 |     if x > 0:
 88 |         return 0
 89 |     elif x <= 0:
 90 |         return 1  
 91 | ```
 92 | Using the `apply( )` method which sends a column to a function:
 93 | ```
 94 | df_atp['win'] = (df_atp['WRank'] - df_atp['LRank']).apply(win)
 95 | ```
 96 | 
 97 | Following [Corral and Prieto-Rodriguez](https://ideas.repec.org/a/eee/intfor/v26yi3p551-563.html) we restrict the analysis to higher ranked players:
 98 | ```
 99 | df_new = df_atp[(df_atp['WRank'] <= 150) & (df_atp['LRank'] <= 150)]
100 | ```
101 | <br>
102 | 
103 | <p align="center">
104 |   <img src="images/tennis_df.png">
105 | </p>                                   
106 |                                  
107 | <br> 
108 | 
109 | ##  `Best_of` = 5
110 | 
111 | We now restrict our analysis to matches of `Best_of` = 5. Since only Grand Slams have 5 sets we can drop the `Series` column. The case of `Best_of = 3` will be considered afterwards.
112 | ```
113 | df3 = df_new.copy()
114 | df3 = df3[df3['Best_of'] == 5]
115 | # Drop Best_of and Series columns
116 | df3.drop(["Series",axis = 1,inplace=True)
117 | df3.drop("Best_of",axis = 1,inplace=True)
118 | ```
119 | The dataset is uneven in terms of frequency of `wins`(imbalanced classes). Using this quick function to convert `Series` to `DataFrame` (for aesthetic reasons only!)
120 | ```
121 | def series_to_df(s):
122 |     return s.to_frame()
123 | series_to_df(df3['win'].value_counts())
124 | series_to_df(df3['win'].value_counts()/df3.shape[0])
125 | ```
126 | <br>
127 | 
128 | <p align="center">
129 |   <img src="images/imbalance.png">
130 | </p>
131 | 
132 | To correct this problem, and create a balanced dataset via simple undersampling, I used a stratified sampling procedure. 
133 | 
134 | ```
135 | y_0 = df3[df3.win == 0] 
136 | y_1 = df3[df3.win == 1] 
137 | n = min([len(y_0), len(y_1)]) 
138 | y_0 = y_0.sample(n = n, random_state = 0) 
139 | y_1 = y_1.sample(n = n, random_state = 0)
140 | df_strat = pd.concat([y_0, y_1]) 
141 | X_strat = df_strat[['Date', 'Surface', 'Round','WRank', 'LRank']]
142 | y_strat = df_strat.win
143 | df = X_strat.copy()
144 | df['win'] = y_strat
145 | ```
146 | The balanced classes become:
147 | 
148 | <p align="center">
149 |   <img src="images/balanced.png">
150 | </p>
151 | 
152 | We now define the variables `P1` and `P2` where the former has higher ranking:
153 | ```
154 | ranks = ["WRank", "LRank"]
155 | df["P1"] = df[ranks].max(axis=1)
156 | df["P2"] = df[ranks].min(axis=1)
157 | ```
158 | 
159 | <a id = 'Exploratory Analysis for Best_of = 5'></a>
160 | ## Exploratory Analysis for Best_of = 5
161 | 
162 | I first look at percentage of wins for each surface. We find that when the `Surface` is Clay there is a higher likelihood of upsets (opposite of wins) i.e. the percentage of wins is lower. The difference is not too large tough.
163 | ```
164 | win_by_Surface = pd.crosstab(df.win, df.Surface).apply(lambda x: x/x.sum(), axis = 0)
165 | ```
166 | 
167 | <p align="center">
168 |   <img src="images/surfaces.png">
169 | </p>
170 | 
171 | What about the dependence on rounds? The relation is not very clear but we can clearly see that upsets are unlikely to happen on the semifinals.
172 | 
173 | ```
174 | win_by_round = pd.crosstab(df.win, df.Round).apply(lambda x: x/x.sum(), axis = 0)
175 | ```
176 | <p align="center">
177 |   <img src="images/rounds.png">
178 | </p>
179 | 
180 | 
181 | 
182 | <a id = 'Dummy variables'></a>
183 | ## Dummy variables
184 | To keep the dataframe cleaner we transform the `Round` entries into numbers using: 
185 | ```
186 | df1 = df.copy()
187 | def round_number(x):
188 |     if x == '1st Round':
189 |         return 1
190 |     elif x == '2nd Round':
191 |         return 2
192 |     elif x == '3rd Round':
193 |         return 3
194 |     elif x == '4th Round':
195 |         return 4
196 |     elif x == 'Quarterfinals':
197 |         return 5
198 |     elif x == 'Semifinals':
199 |         return 6
200 |     elif x == 'The Final':
201 |         return 7
202 | df1['Round'] = df1['Round'].apply(round_number)     
203 | ```
204 | We then transform rounds into dummy variables
205 | ```
206 | dummy_ranks = pd.get_dummies(df1['Round'], prefix='Round')
207 | df1 = df1.join(dummy_ranks.ix[:, 'Round_2':])
208 | rounds = ['Round_2', 'Round_3',
209 |        'Round_4', 'Round_5', 'Round_6', 'Round_7']
210 | df1[rounds] = df1[rounds].astype('int_')
211 | ```
212 | We repeat this for the `Surface` variable. I now take the logarithms of `P1` and `P2`, then create a variable `D` 
213 | ```
214 | df4['P1'] = np.log2(df4['P1'].astype('float64')) 
215 | df4['P2'] = np.log2(df4['P2'].astype('float64')) 
216 | df4['D'] = df4['P1'] - df4['P2']
217 | df4['D'] = np.absolute(df4['D'])
218 | ```
219 | <a id = 'Logistic Regression'></a>
220 | ## Logistic Regression
221 | 
222 | The next step is building the models. I first use a logistic regression. First, the `y` and `X` must be defined:
223 | 
224 | ```
225 | feature_cols = ['Round_2','Round_3','Round_4','Round_5','Round_6','Round_7','Surface_Grass','Surface_Hard','D']
226 | dfnew = df4.copy()
227 | dfnew[feature_cols].head()
228 | X = dfnew[feature_cols]
229 | y = dfnew.win
230 | ```
231 | Doing a train-test split:
232 | ```
233 | from sklearn.cross_validation import train_test_split
234 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
235 | ```
236 | I then fit the model with the training data,
237 | ```
238 | from sklearn.linear_model import LogisticRegression
239 | logreg = LogisticRegression()
240 | logreg.fit(X_train, y_train)
241 | ```
242 | and make predictions using the test set:
243 | ```
244 | y_pred_class = logreg.predict(X_test)
245 | from sklearn import metrics
246 | print('Accuracy score is:',metrics.accuracy_score(y_test, y_pred_class))
247 | ```
248 | and obtain:
249 | ```
250 | Accuracy score is: 0.7070707070707071
251 | ```
252 | 
253 | The next step is evaluate the appropriate metrics. Using `scikit-learn` for calcule the AUC,
254 | ```
255 | y_pred_prob = logreg.predict_proba(X_test)[:, 1]
256 | auc_score = metrics.roc_auc_score(y_test, y_pred_prob)
257 | print('AUC is:', auc_score)
258 | ```
259 | I obtain the following `auc_score`:
260 | ```
261 | AUC is: 0.7546938775510204
262 | ```
263 | To plot the ROC curve I use:
264 | ```
265 | fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
266 | fig = plt.plot(fpr, tpr,label='ROC curve (area = %0.2f)' % auc_score )
267 | plt.plot([0, 1], [0, 1], 'k--')
268 | plt.xlim([0.0, 1.0])
269 | plt.ylim([0.0, 1.0])
270 | plt.title('ROC curve for win classifier')
271 | plt.xlabel('False Positive Rate (1 - Specificity)')
272 | plt.ylabel('True Positive Rate (Sensitivity)')
273 | plt.legend(loc="lower right")
274 | plt.grid(True)
275 | ```
276 | <br>
277 | 
278 | <p align="center">
279 |   <img src="images/ROC.png" 
280 |        width="400" height="300">
281 | </p>                                                                
282 | <br> 
283 | 
284 | Now we must perform cross-validation. 
285 | ```
286 | from sklearn.cross_validation import cross_val_score
287 | print('Mean CV score is:',cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean())
288 | ``` 
289 | The output is:
290 | ```
291 | Mean CV score is: 0.7287617728531856
292 | ```
293 | 
294 | <a id = 'Decision Trees and Random Forests'></a>
295 | ## Decision Trees and Random Forests
296 | 
297 | 
298 | I now build a decision tree model to predict the upsets likelihood of a given match:
299 | 
300 | ```
301 | from sklearn.tree import DecisionTreeClassifier
302 | model = DecisionTreeClassifier()
303 | X = dfnew[feature_cols].dropna()
304 | y = dfnew['win']
305 | model.fit(X, y)
306 | ```
307 | Again performing cross-validation:
308 | ```
309 | scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
310 | print('AUC {}, Average AUC {}'.format(scores, scores.mean()))
311 | model = DecisionTreeClassifier(
312 |                 max_depth = 4,
313 |                 min_samples_leaf = 6)
314 | 
315 | model.fit(X, y)
316 | ```
317 | <br>
318 | 
319 | <p align="center">
320 |   <img src="images/decisiontree.png">
321 | </p>                                   
322 |                                  
323 | <br>
324 | 
325 | Evaluating the cross-validation score:
326 | 
327 | ```
328 | scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
329 | print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
330 | ```
331 | 
332 | <br>
333 | 
334 | <p align="center">
335 |   <img src="images/cv_score.png",width="250" height="250">
336 | </p>                                   
337 |                                  
338 | <br>
339 | 
340 | 
341 | 
342 | 
343 | 
344 | Now I repeat the lines above using a random forest classifier:
345 | ```
346 | from sklearn.ensemble import RandomForestClassifier
347 | from sklearn.cross_validation import cross_val_score
348 | X = dfnew[feature_cols].dropna()
349 | y = dfnew['win']
350 | model = RandomForestClassifier(n_estimators = 200)
351 | model.fit(X, y)
352 | features = X.columns
353 | feature_importances = model.feature_importances_
354 | features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
355 | features_df.sort_values('Importance Score', inplace=True, ascending=False)
356 | feature_importances = pd.Series(model.feature_importances_, index=X.columns)
357 | feature_importances.sort_values()
358 | feature_importances.plot(kind="barh", figsize=(7,6))
359 | scores = cross_val_score(model, X, y, scoring='roc_auc')
360 | print('AUC {}, Average AUC {}'.format(scores, scores.mean()))
361 | for n_trees in range(1, 100, 10):
362 |     model = RandomForestClassifier(n_estimators = n_trees)
363 |     scores = cross_val_score(model, X, y, scoring='roc_auc')
364 |     print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))
365 | ```
366 | <br>
367 | 
368 | <p align="center">
369 |   <img src="images/rf_features.png",width="280" height="280">
370 | </p>                                   
371 |                                  
372 | <br>
373 | 
374 | 
375 | 
376 | 
377 | The same identical analysis is done for `Best_of = 3` and therefore it is ommited here in the README. 
378 | 


--------------------------------------------------------------------------------
/tennis/images/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/tennis/images/ATP_World_Tour.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/ATP_World_Tour.png


--------------------------------------------------------------------------------
/tennis/images/ROC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/ROC.png


--------------------------------------------------------------------------------
/tennis/images/balanced.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/balanced.png


--------------------------------------------------------------------------------
/tennis/images/cv_score.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/cv_score.png


--------------------------------------------------------------------------------
/tennis/images/decisiontree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/decisiontree.png


--------------------------------------------------------------------------------
/tennis/images/imbalance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/imbalance.png


--------------------------------------------------------------------------------
/tennis/images/rf_features.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/rf_features.png


--------------------------------------------------------------------------------
/tennis/images/rounds.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/rounds.png


--------------------------------------------------------------------------------
/tennis/images/surfaces.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/surfaces.png


--------------------------------------------------------------------------------
/tennis/images/tennis_df.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/tennis_df.png


--------------------------------------------------------------------------------
/tennis/notebooks/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/tennis/slides/123.png:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/tennis/slides/Final_Project_Marco_Tavora_DATNYC41.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/slides/Final_Project_Marco_Tavora_DATNYC41.pdf


--------------------------------------------------------------------------------