├── README.md
├── analysis-of-opioid-prescription-problem
├── README.md
├── data
│ ├── 123
│ ├── mhincome.csv
│ ├── opioids.csv
│ ├── overdoses.csv
│ ├── overdosesnew.csv
│ └── prescriber-info.csv
├── images
│ ├── 123
│ └── opioids.png
└── notebooks
│ ├── 123
│ └── opioid-prescription-problem.ipynb
├── churn
├── README.md
├── data
│ └── 123.png
├── images
│ ├── 123.png
│ ├── balancedchurn.png
│ ├── baseline.png
│ ├── cellphone.jpg
│ ├── churnprob.png
│ ├── cm.png
│ ├── cms.png
│ ├── cms1.png
│ ├── cms2.png
│ ├── df_churn_new.png
│ ├── featurerf.png
│ ├── imbalancechurn.png
│ ├── model_comparison.png
│ └── predictions.png
└── notebooks
│ └── predicting-customer-churn.ipynb
├── click-prediction
├── README.md
├── images
│ ├── 123
│ └── click1.png
├── notebooks
│ └── click-predictive-model.ipynb
└── optimal-bidding-strategies-in-online-display-advertising .pdf
├── predicting-number-of-comments-on-reddit-using-random-forest-classifier
├── 123.png
├── README.md
├── images
│ ├── 123.png
│ ├── Reddit-logo.png
│ ├── redditRF.png
│ ├── redditpage.png
│ └── redditwordshist.png
└── notebooks
│ ├── 123.png
│ └── project-3-marco-tavora.ipynb
├── retail-strategy
├── README.md
├── data
│ ├── 123
│ ├── ia_zip_city_county_sqkm.csv
│ ├── iowa_incomes.xls
│ └── pop_iowa_per_county.csv
├── images
│ ├── 123
│ ├── 123.png
│ ├── hm3.png
│ ├── liquor.jpeg
│ ├── output.png
│ └── test.jpg
└── notebooks
│ └── retail-recommendations.ipynb
└── tennis
├── 123.png
├── README.md
├── images
├── 123.png
├── ATP_World_Tour.png
├── ROC.png
├── balanced.png
├── cv_score.png
├── decisiontree.png
├── imbalance.png
├── rf_features.png
├── rounds.png
├── surfaces.png
└── tennis_df.png
├── notebooks
├── 123.png
└── Final_Project_Marco_Tavora-DATNYC41_GA.ipynb
└── slides
├── 123.png
└── Final_Project_Marco_Tavora_DATNYC41.pdf
/README.md:
--------------------------------------------------------------------------------
1 | ## Supervised Machine Learning Projects
2 |
3 |       [](https://opensource.org/licenses/MIT)
4 |
5 |
6 |
7 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 | Notebooks and descriptions •
19 | Contact Information
20 |
21 |
22 |
23 | ### Notebooks and descriptions
24 | | Notebook | Brief Description |
25 | |--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
26 | |[predicting-comments-on-reddit](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/painters-identification/notebooks/capstone-models-final-model-building.ipynb) | In this project I determine which characteristics of a post on Reddit contribute most to the overall interaction as measured by number of comments.|
27 | |[tennis-matches-prediction](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/bitcoin/notebooks/deep-learning-LSTM-bitcoins.ipynb) | The goal of the project is to predict the probability that the higher-ranked player will win a tennis match. I will call that a `win`(as opposed to an upset).|
28 | |[churn-analysis](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/keras-tf-tutorial/notebooks/neural-nets-digits-mnist.ipynb) | This project was done in collaboration with [Corey Girard](https://github.com/coreygirard/). A mobile device company is having a major problem with customer retention. Customers switching from one company to another is called churn. Our goal in this analysis is to understand the problem, identify behaviors which are strongly correlated with churn and to devise a solution.|
29 | |[click-prediction](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/transfer-learning/notebooks/transfer-learning.ipynb) | Many ads are actually sold on a "pay-per-click" (PPC) basis, meaning the company only pays for ad clicks, not ad views. Thus your optimal approach (as a search engine) is actually to choose an ad based on "expected value", meaning the price of a click times the likelihood that the ad will be clicked [...] In order for you to maximize expected value, you therefore need to accurately predict the likelihood that a given ad will be clicked, also known as "click-through rate" (CTR). In this project I will predict the likelihood that a given online ad will be clicked.|
30 | | [retail-store-expansion-analysis-with-lasso-and-ridge-regressions](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/painters-identification/notebooks/capstone-models-final-model-building.ipynb) | Based on a dataset containing the spirits purchase information of Iowa Class E liquor licensees by product and date of purchase this project provides recommendations on where to open new stores in the state of Iowa. To devise an expansion strategy, I first needed to understand the data and for that I conducted a thorough exploratory data analysis (EDA). With the data in hand I built multivariate regression models of total sales by county, using both Lasso and Ridge regularization, and based on these models, I made recommendations about new locations.|
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 | ## Contact Information
44 |
45 | Feel free to contact me:
46 |
47 | * Email: [marcotav65@gmail.com](mailto:marcotav65@gmail.com)
48 | * GitHub: [marcotav](https://github.com/marcotav)
49 | * LinkedIn: [marco-tavora](https://www.linkedin.com/in/marco-tavora)
50 | * Website: [marcotavora.me](http://www.marcotavora.me)
51 |
52 |
53 |
54 |
55 |
56 |
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/README.md:
--------------------------------------------------------------------------------
1 | ## U.S. Opiate Prescriptions/Overdoses [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/analysis-of-opioid-prescription-problem/notebooks/opioid-prescription-problem.ipynb)
2 |   
3 |
4 |
5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/analysis-of-opioid-prescription-problem/notebooks/opioid-prescription-problem.ipynb) or by clicking on the.**
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 | Brief Introduction •
21 | Dataset •
22 | Project Goal
23 |
24 |
25 |
26 | ## Brief Introduction
27 |
28 | Accidental death by fatal drug overdose is a rising trend in the United States. What can you do to help? (From Kaggle)
29 |
30 |
31 | ## Dataset
32 |
33 | This dataset contains:
34 | - Summaries of prescription records for 250 common **opioid** and ***non-opioid*** drugs written by 25,000 unique licensed medical professionals in 2014 in the United States for citizens covered under Class D Medicare
35 | - Metadata about the doctors themselves.
36 | - This data here is in a format with 1 row per prescriber and 25,000 unique prescribers down to 25,000 to keep it manageable.
37 | - The main data is in `prescriber-info.csv`.
38 | - There is also `opioids.csv` that contains the names of all opioid drugs included in the data
39 | - There is the file `overdoses.csv` that contains information on opioid related drug overdose fatalities.
40 |
41 |
42 | The data consists of the following characteristics for each prescriber:
43 | - NPI – unique National Provider Identifier number
44 | - Gender - (M/F)
45 | - State - U.S. State by abbreviation
46 | - Credentials - set of initials indicative of medical degree
47 | - Specialty - description of type of medicinal practice
48 | - A long list of drugs with numeric values indicating the total number of prescriptions written for the year by that individual
49 | - `Opioid.Prescriber` - a boolean label indicating whether or not that individual prescribed opiate drugs more than 10 times in the yearr
50 |
51 |
52 | ## Project Goal
53 |
54 | The increase in overdose fatalities is a well-known problem, and the search for possible solutions is an ongoing effort. This dataset is can be used to detect sources of significant quantities of opiate prescriptions.
55 |
56 |
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/123:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/mhincome.csv:
--------------------------------------------------------------------------------
1 | State,Income
Mississippi,40593.00
Arkansas,41995.00
West Virginia,42019.00
Alabama,44765.00
Kentucky,45215.00
New Mexico,45382.00
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/opioids.csv:
--------------------------------------------------------------------------------
1 | Drug Name,Generic Name
2 | ABSTRAL,FENTANYL CITRATE
3 | ACETAMINOPHEN-CODEINE,ACETAMINOPHEN WITH CODEINE
4 | ACTIQ,FENTANYL CITRATE
5 | ASCOMP WITH CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN
6 | ASPIRIN-CAFFEINE-DIHYDROCODEIN,DIHYDROCODEINE/ASPIRIN/CAFFEIN
7 | AVINZA,MORPHINE SULFATE
8 | BELLADONNA-OPIUM,OPIUM/BELLADONNA ALKALOIDS
9 | BUPRENORPHINE HCL,BUPRENORPHINE HCL
10 | BUTALB-ACETAMINOPH-CAFF-CODEIN,BUTALBIT/ACETAMIN/CAFF/CODEINE
11 | BUTALB-CAFF-ACETAMINOPH-CODEIN,BUTALBIT/ACETAMIN/CAFF/CODEINE
12 | BUTALBITAL COMPOUND-CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN
13 | BUTORPHANOL TARTRATE,BUTORPHANOL TARTRATE
14 | BUTRANS,BUPRENORPHINE
15 | CAPITAL W-CODEINE,ACETAMINOPHEN WITH CODEINE
16 | CARISOPRODOL COMPOUND-CODEINE,CODEINE/CARISOPRODOL/ASPIRIN
17 | CARISOPRODOL-ASPIRIN-CODEINE,CODEINE/CARISOPRODOL/ASPIRIN
18 | CODEINE SULFATE,CODEINE SULFATE
19 | CO-GESIC,HYDROCODONE/ACETAMINOPHEN
20 | CONZIP,TRAMADOL HCL
21 | DEMEROL,MEPERIDINE HCL
22 | DEMEROL,MEPERIDINE HCL/PF
23 | DILAUDID,HYDROMORPHONE HCL
24 | DILAUDID,HYDROMORPHONE HCL/PF
25 | DILAUDID-HP,HYDROMORPHONE HCL/PF
26 | DISKETS,METHADONE HCL
27 | DOLOPHINE HCL,METHADONE HCL
28 | DURAGESIC,FENTANYL
29 | DURAMORPH,MORPHINE SULFATE/PF
30 | ENDOCET,OXYCODONE HCL/ACETAMINOPHEN
31 | ENDODAN,OXYCODONE HCL/ASPIRIN
32 | EXALGO,HYDROMORPHONE HCL
33 | FENTANYL,FENTANYL
34 | FENTANYL CITRATE,FENTANYL CITRATE
35 | FENTORA,FENTANYL CITRATE
36 | FIORICET WITH CODEINE,BUTALBIT/ACETAMIN/CAFF/CODEINE
37 | FIORINAL WITH CODEINE #3,CODEINE/BUTALBITAL/ASA/CAFFEIN
38 | HYCET,HYDROCODONE/ACETAMINOPHEN
39 | HYDROCODONE-ACETAMINOPHEN,HYDROCODONE/ACETAMINOPHEN
40 | HYDROCODONE-IBUPROFEN,HYDROCODONE/IBUPROFEN
41 | HYDROMORPHONE ER,HYDROMORPHONE HCL
42 | HYDROMORPHONE HCL,HYDROMORPHONE HCL
43 | HYDROMORPHONE HCL,HYDROMORPHONE HCL/PF
44 | IBUDONE,HYDROCODONE/IBUPROFEN
45 | INFUMORPH,MORPHINE SULFATE/PF
46 | KADIAN,MORPHINE SULFATE
47 | LAZANDA,FENTANYL CITRATE
48 | LEVORPHANOL TARTRATE,LEVORPHANOL TARTRATE
49 | LORCET,HYDROCODONE/ACETAMINOPHEN
50 | LORCET 10-650,HYDROCODONE/ACETAMINOPHEN
51 | LORCET HD,HYDROCODONE/ACETAMINOPHEN
52 | LORCET PLUS,HYDROCODONE/ACETAMINOPHEN
53 | LORTAB,HYDROCODONE/ACETAMINOPHEN
54 | MAGNACET,OXYCODONE HCL/ACETAMINOPHEN
55 | MEPERIDINE HCL,MEPERIDINE HCL
56 | MEPERIDINE HCL,MEPERIDINE HCL/PF
57 | MEPERITAB,MEPERIDINE HCL
58 | METHADONE HCL,METHADONE HCL
59 | METHADONE INTENSOL,METHADONE HCL
60 | METHADOSE,METHADONE HCL
61 | MORPHINE SULFATE,MORPHINE SULFATE
62 | MORPHINE SULFATE,MORPHINE SULFATE/PF
63 | MORPHINE SULFATE ER,MORPHINE SULFATE
64 | MS CONTIN,MORPHINE SULFATE
65 | NALBUPHINE HCL,NALBUPHINE HCL
66 | NORCO,HYDROCODONE/ACETAMINOPHEN
67 | NUCYNTA,TAPENTADOL HCL
68 | NUCYNTA ER,TAPENTADOL HCL
69 | OPANA,OXYMORPHONE HCL
70 | OPANA ER,OXYMORPHONE HCL
71 | OPIUM TINCTURE,OPIUM TINCTURE
72 | OXECTA,OXYCODONE HCL
73 | OXYCODONE HCL,OXYCODONE HCL
74 | OXYCODONE HCL ER,OXYCODONE HCL
75 | OXYCODONE HCL-ASPIRIN,OXYCODONE HCL/ASPIRIN
76 | OXYCODONE HCL-IBUPROFEN,IBUPROFEN/OXYCODONE HCL
77 | OXYCODONE-ACETAMINOPHEN,OXYCODONE HCL/ACETAMINOPHEN
78 | OXYCONTIN,OXYCODONE HCL
79 | OXYMORPHONE HCL,OXYMORPHONE HCL
80 | OXYMORPHONE HCL ER,OXYMORPHONE HCL
81 | PENTAZOCINE-ACETAMINOPHEN,PENTAZOCINE HCL/ACETAMINOPHEN
82 | PENTAZOCINE-NALOXONE HCL,PENTAZOCINE HCL/NALOXONE HCL
83 | PERCOCET,OXYCODONE HCL/ACETAMINOPHEN
84 | PERCODAN,OXYCODONE HCL/ASPIRIN
85 | PRIMLEV,OXYCODONE HCL/ACETAMINOPHEN
86 | REPREXAIN,HYDROCODONE/IBUPROFEN
87 | ROXICET,OXYCODONE HCL/ACETAMINOPHEN
88 | ROXICODONE,OXYCODONE HCL
89 | RYBIX ODT,TRAMADOL HCL
90 | STAGESIC,HYDROCODONE/ACETAMINOPHEN
91 | SUBSYS,FENTANYL
92 | SYNALGOS-DC,DIHYDROCODEINE/ASPIRIN/CAFFEIN
93 | TALWIN,PENTAZOCINE LACTATE
94 | TRAMADOL HCL,TRAMADOL HCL
95 | TRAMADOL HCL ER,TRAMADOL HCL
96 | TRAMADOL HCL-ACETAMINOPHEN,TRAMADOL HCL/ACETAMINOPHEN
97 | TREZIX,DHCODEINE BT/ACETAMINOPHN/CAFF
98 | TYLENOL-CODEINE NO.3,ACETAMINOPHEN WITH CODEINE
99 | TYLENOL-CODEINE NO.4,ACETAMINOPHEN WITH CODEINE
100 | ULTRACET,TRAMADOL HCL/ACETAMINOPHEN
101 | ULTRAM,TRAMADOL HCL
102 | ULTRAM ER,TRAMADOL HCL
103 | VICODIN,HYDROCODONE/ACETAMINOPHEN
104 | VICODIN ES,HYDROCODONE/ACETAMINOPHEN
105 | VICODIN HP,HYDROCODONE/ACETAMINOPHEN
106 | VICOPROFEN,HYDROCODONE/IBUPROFEN
107 | XARTEMIS XR,OXYCODONE HCL/ACETAMINOPHEN
108 | XODOL 10-300,HYDROCODONE/ACETAMINOPHEN
109 | XODOL 5-300,HYDROCODONE/ACETAMINOPHEN
110 | XODOL 7.5-300,HYDROCODONE/ACETAMINOPHEN
111 | XYLON 10,HYDROCODONE/IBUPROFEN
112 | ZAMICET,HYDROCODONE/ACETAMINOPHEN
113 | ZOHYDRO ER,HYDROCODONE BITARTRATE
114 | ZOLVIT,HYDROCODONE/ACETAMINOPHEN
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/overdoses.csv:
--------------------------------------------------------------------------------
1 | "State","Population","Deaths","Abbrev"
2 | "Alabama","4,833,722","723","AL"
3 | "Alaska","735,132","124","AK"
4 | "Arizona","6,626,624","1,211","AZ"
5 | "Arkansas","2,959,373","356","AR"
6 | "California","38,332,521","4,521","CA"
7 | "Colorado","5,268,367","899","CO"
8 | "Connecticut","3,596,080","623","CT"
9 | "Delaware","925,749","189","DE"
10 | "Florida","19,552,860","2,634","FL"
11 | "Georgia","9,992,167","1,206","GA"
12 | "Hawaii","1,404,054","157","HI"
13 | "Idaho","1,612,136","212","ID"
14 | "Illinois","12,882,135","1,705","IL"
15 | "Indiana","6,570,902","1,172","IN"
16 | "Iowa","3,090,416","264","IA"
17 | "Kansas","2,893,957","332","KS"
18 | "Kentucky","4,395,295","1,077","KY"
19 | "Louisiana","4,625,470","777","LA"
20 | "Maine","1,328,302","216","ME"
21 | "Maryland","5,928,814","1,070","MD"
22 | "Massachusetts","6,692,824","1,289","MA"
23 | "Michigan","9,895,622","1,762","MI"
24 | "Minnesota","5,420,380","517","MN"
25 | "Mississippi","2,991,207","336","MS"
26 | "Missouri","6,044,171","1,067","MO"
27 | "Montana","1,015,165","125","MT"
28 | "Nebraska","1,868,516","125","NE"
29 | "Nevada","2,790,136","545","NV"
30 | "New Hampshire","1,323,459","334","NH"
31 | "New Jersey","8,899,339","1,253","NJ"
32 | "New Mexico","2,085,287","547","NM"
33 | "New York","19,651,127","2,300","NY"
34 | "North Carolina","9,848,060","1,358","NC"
35 | "North Dakota","723,393","43","ND"
36 | "Ohio","11,570,808","2,744","OH"
37 | "Oklahoma","3,850,568","777","OK"
38 | "Oregon","3,930,065","522","OR"
39 | "Pennsylvania","12,773,801","2,732","PA"
40 | "Rhode Island","1,051,511","247","RI"
41 | "South Carolina","4,774,839","701","SC"
42 | "South Dakota","844,877","63","SD"
43 | "Tennessee","6,495,978","1,269","TN"
44 | "Texas","26,448,193","2,601","TX"
45 | "Utah","2,900,872","603","UT"
46 | "Vermont","626,630","83","VT"
47 | "Virginia","8,260,405","980","VA"
48 | "Washington","6,971,406","979","WA"
49 | "West Virginia","1,854,304","627","WV"
50 | "Wisconsin","5,742,713","853","WI"
51 | "Wyoming","582,658","109","WY"
52 |
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/data/overdosesnew.csv:
--------------------------------------------------------------------------------
1 | ,State,Population,Deaths,Abbrev,Deaths/Population
2 | 0,Alabama,4833722,723,AL,0.0001495741790694624
3 | 1,Alaska,735132,124,AK,0.00016867718994683948
4 | 2,Arizona,6626624,1211,AZ,0.00018274765551810394
5 | 3,Arkansas,2959373,356,AR,0.00012029575183662215
6 | 4,California,38332521,4521,CA,0.00011794162977175439
7 | 5,Colorado,5268367,899,CO,0.00017064111137284096
8 | 6,Connecticut,3596080,623,CT,0.00017324419923917154
9 | 7,Delaware,925749,189,DE,0.00020415901070376527
10 | 8,Florida,19552860,2634,FL,0.0001347117506083509
11 | 9,Georgia,9992167,1206,GA,0.000120694540033208
12 | 10,Hawaii,1404054,157,HI,0.00011181906109024297
13 | 11,Idaho,1612136,212,ID,0.00013150255313447502
14 | 12,Illinois,12882135,1705,IL,0.00013235383731035268
15 | 13,Indiana,6570902,1172,IN,0.00017836211832104634
16 | 14,Iowa,3090416,264,IA,8.542539256850857e-05
17 | 15,Kansas,2893957,332,KS,0.00011472181514790993
18 | 16,Kentucky,4395295,1077,KY,0.00024503474738328144
19 | 17,Louisiana,4625470,777,LA,0.00016798292930231954
20 | 18,Maine,1328302,216,ME,0.00016261362250452081
21 | 19,Maryland,5928814,1070,MD,0.00018047454347530552
22 | 20,Massachusetts,6692824,1289,MA,0.0001925943368598965
23 | 21,Michigan,9895622,1762,MI,0.00017805853942278718
24 | 22,Minnesota,5420380,517,MN,9.538076666211594e-05
25 | 23,Mississippi,2991207,336,MS,0.00011232923699362832
26 | 24,Missouri,6044171,1067,MO,0.00017653372149795232
27 | 25,Montana,1015165,125,MT,0.00012313269271497737
28 | 26,Nebraska,1868516,125,NE,6.689800890118148e-05
29 | 27,Nevada,2790136,545,NV,0.00019533098028196475
30 | 28,New Hampshire,1323459,334,NH,0.0002523689815853759
31 | 29,New Jersey,8899339,1253,NJ,0.0001407969737977169
32 | 30,New Mexico,2085287,547,NM,0.0002623140124117208
33 | 31,New York,19651127,2300,NY,0.00011704163328647767
34 | 32,North Carolina,9848060,1358,NC,0.00013789517935512173
35 | 33,North Dakota,723393,43,ND,5.9442101319752885e-05
36 | 34,Ohio,11570808,2744,OH,0.00023714852065646582
37 | 35,Oklahoma,3850568,777,OK,0.00020178841147591733
38 | 36,Oregon,3930065,522,OR,0.00013282223067557407
39 | 37,Pennsylvania,12773801,2732,PA,0.00021387525921219534
40 | 38,Rhode Island,1051511,247,RI,0.00023490006286191965
41 | 39,South Carolina,4774839,701,SC,0.00014681123279758752
42 | 40,South Dakota,844877,63,SD,7.456706715888821e-05
43 | 41,Tennessee,6495978,1269,TN,0.00019535164681900091
44 | 42,Texas,26448193,2601,TX,9.83432025015849e-05
45 | 43,Utah,2900872,603,UT,0.00020786853056598153
46 | 44,Vermont,626630,83,VT,0.00013245455851140225
47 | 45,Virginia,8260405,980,VA,0.00011863825078794562
48 | 46,Washington,6971406,979,WA,0.00014043078254228773
49 | 47,West Virginia,1854304,627,WV,0.000338132258788203
50 | 48,Wisconsin,5742713,853,WI,0.00014853606648982808
51 | 49,Wyoming,582658,109,WY,0.00018707372077616716
52 |
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/images/123:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/images/opioids.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/analysis-of-opioid-prescription-problem/images/opioids.png
--------------------------------------------------------------------------------
/analysis-of-opioid-prescription-problem/notebooks/123:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/churn/README.md:
--------------------------------------------------------------------------------
1 | ## Churn Analysis [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/churn/notebooks/predicting-customer-churn.ipynb)
2 |       
3 |
4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/churn/notebooks/predicting-customer-churn.ipynb) or by clicking on the.**
5 |
6 | This project was done in collaboration with [Corey Girard](https://github.com/coreygirard/)
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 | Goals •
15 | Why this is important? •
16 | Importing modules and reading the data •
17 | Data Handling and Feature Engineering •
18 | Features and target •
19 | Using `pandas-profiling` and rejecting variables with correlations above 0.9 •
20 | Scaling •
21 | Model Comparison •
22 | Building a random forest classifier using GridSearch to optimize hyperparameters
23 |
24 |
25 |
26 |
27 | ### Goals
28 | From Wikipedia,
29 |
30 | > Churn rate is a measure of the number of individuals or items moving out of a collective group over a specific period. It is one of two primary factors that determine the steady-state level of customers a business will support [...] It is an important factor for any business with a subscriber-based service model, [such as] mobile telephone networks.
31 |
32 | Our goal in this analysis was to predict the churn rate from a mobile phone company based on customer attributes including:
33 | - Area code
34 | - Call duration at different hours
35 | - Charges
36 | - Account length
37 |
38 | See [this website](http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html) for a similar analysis.
39 |
40 |
41 | ### Why this is important?
42 |
43 | It is a well-known fact that in several businesses (particularly the ones involving subscriptions), the acquisition of new customers costs much more than the retention of existing ones. A thorough analysis of what causes churn-rates and how to predict them can be used to build efficient customer retention strategies.
44 |
45 |
46 | ## Importing modules and reading the data
47 | ```
48 | from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
49 | from sklearn.ensemble import RandomForestClassifier
50 | import pandas as pd
51 | import seaborn as sns
52 | import numpy as np
53 | import matplotlib.pyplot as plt
54 | %matplotlib inline
55 | ```
56 | Reading the data:
57 | ```
58 | df = pd.read_csv("data.csv")
59 | ```
60 |
61 |
62 |
63 |
64 |
65 | ## Data Handling and Feature Engineering
66 | In this section the following steps are taken:
67 | - Conversion of strings into booleans
68 | - Conversion of booleans to integers
69 | - Converting the states column into dummy columns
70 | - Creation of several new features (feature engineering)
71 |
72 | The commented code follows (most of the lines were ommited for brevity):
73 | ```
74 | # convert binary strings to boolean ints
75 | df['international_plan'] = df.international_plan.replace({'Yes': 1, 'No': 0})
76 | #convert booleans to boolean ints
77 | df['churn'] = df.churn.replace({True: 1, False: 0})
78 | # handle state and area code dummies
79 | state_dummies = pd.get_dummies(df.state)
80 | state_dummies.columns = ['state_'+c.lower() for c in state_dummies.columns.values]
81 | df.drop('state', axis='columns', inplace=True)
82 | df = pd.concat([df, state_dummies], axis='columns')
83 | area_dummies = pd.get_dummies(df.area_code)
84 | area_dummies.columns = ['area_code_'+str(c) for c in area_dummies.columns.values]
85 | df.drop('area_code', axis='columns', inplace=True)
86 | df = pd.concat([df, area_dummies], axis='columns')
87 | # feature engineering
88 | df['total_minutes'] = df.total_day_minutes + df.total_eve_minutes + df.total_intl_minutes
89 | df['total_calls'] = df.total_day_calls + df.total_eve_calls + df.total_intl_calls
90 | ```
91 |
92 |
93 | ### Features and target
94 | Defining the features matrix and the target (the churn):
95 | ```
96 | X = df[[c for c in df.columns if c != 'churn']]
97 | y = df.churn
98 | ```
99 |
100 |
101 | ### Using `pandas-profiling` and rejecting variables with correlations above 0.9
102 |
103 | The package `pandas-profiling` contains a method `get_rejected_variables(threshold)` which identifies variables with correlation higher than a threshold.
104 | ```
105 | import pandas_profiling
106 | profile = pandas_profiling.ProfileReport(X)
107 | rejected_variables = profile.get_rejected_variables(threshold=0.9)
108 | X = X.drop(rejected_variables,axis=1)
109 | ```
110 |
111 | ### Scaling
112 | ```
113 | from sklearn.preprocessing import StandardScaler
114 | cols = X.columns.tolist()
115 | scaler = StandardScaler()
116 | X[cols] = scaler.fit_transform(X[cols])
117 | X = X[cols]
118 | ```
119 | We can now build our models.
120 |
121 |
122 | ## Model Comparison
123 |
124 | We can write a for loop that does the following:
125 | - Iterates over a list of models, in this case GaussianNB, KNeighborsClassifier and LinearSVC
126 | - Trains each model using the training dataset X_train and y_train
127 | - Predicts the target using the test features X_test
128 | - Calculates the `f1_score` and cross-validation score
129 | - Build a dataframe with that information
130 |
131 | The code will also print out the confusion matrix from which "recall" and "precision" can be calculated:
132 | - When a consumer churns, how often does my classifier predict that to happen. This is the "recall".
133 | - When the model predicts a churn, how often does that user actually churns? This is the "precision"
134 |
135 | ```
136 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
137 | test_size=0.25, random_state=0)
138 |
139 | models = [LogisticRegression, GaussianNB,
140 | KNeighborsClassifier, LinearSVC]
141 |
142 | lst = []
143 | for model in models:
144 | clf = model().fit(X_train, y_train)
145 | y_pred = clf.predict(X_test)
146 | lst.append([i for i in (model.__name__,
147 | round(metrics.f1_score(y_test,
148 | y_pred,
149 | average="macro"),3))])
150 | df = pd.DataFrame(lst, columns=['Model','f1_score'])
151 |
152 | lst_av_cross_val_scores = []
153 |
154 | for model in models:
155 | clf = model()
156 | cross_val_scores = (model.__name__, cross_val_score(clf, X, y, cv=5))
157 | av_cross_val_scores = list(cross_val_scores)[1].mean()
158 | lst_av_cross_val_scores.append(round(av_cross_val_scores,3))
159 |
160 | model_names = [model.__name__ for model in models]
161 |
162 | df1 = pd.DataFrame(list(zip(model_names, lst_av_cross_val_scores)))
163 | df1.columns = ['Model','Average Cross-Validation']
164 | df_all = pd.concat([df1,df['f1_score']],axis=1)
165 | ```
166 |
167 |
168 |
169 |
170 | If we use cross-validation as our metric, we see that the `KNeighborsClassifier` has the best performance.
171 |
172 | Now we will look at confusion matrices. These are obtained as follows:
173 |
174 | ```
175 | models_names = ['LogisticRegression', 'GaussianNB', 'KNeighborsClassifier', 'LinearSVC']
176 | i=0
177 | for preds in y_pred_lst:
178 | print('Confusion Matrix for:',models_names[i])
179 | i +=1
180 | print('')
181 | cm = pd.crosstab(pd.concat([X_test,y_test],axis=1)['churn'], preds,
182 | rownames=['Actual Values'], colnames=['Predicted Values'])
183 | recall = round(cm.iloc[1,1]/(cm.iloc[1,0]+cm.iloc[1,1]),3)
184 | precision = round(cm.iloc[1,1]/(cm.iloc[0,1]+cm.iloc[1,1]),3)
185 | cm
186 | print('Recall for {} is:'.format(models_names[i-1]),recall)
187 | print('Precision for {} is:'.format(models_names[i-1]),precision,'\n')
188 | print('------------------------------------------------------------ \n')
189 | ```
190 | The output is:
191 |
192 |
193 |
194 |
195 |
196 | The highest recall is from `GaussianNB` and the highest precision from `KNeighborsClassifier`.
197 |
198 |
199 | ### Finding best hyperparameters
200 | As a complement let us use a Random Forest Classifier with GridSearch for hyperparameter optimization
201 |
202 |
203 | ```
204 | n_estimators = list(range(20,160,10))
205 | max_depth = list(range(2, 16, 2)) + [None]
206 | def rfscore(X,y,test_size,n_estimators,max_depth):
207 |
208 | X_train, X_test, y_train, y_test = train_test_split(X,
209 | y, test_size = test_size, random_state=42)
210 | rf_params = {
211 | 'n_estimators':n_estimators,
212 | 'max_depth':max_depth} # parameters for grid search
213 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
214 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
215 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth
216 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators
217 | print("best max_depth:",max_depth_best)
218 | print("best n_estimators:",n_estimators_best)
219 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
220 | best_rf_gs.fit(X_train,y_train) # fitting the best model
221 | best_rf_score = best_rf_gs.score(X_test,y_test)
222 | print ("best score is:",round(best_rf_score,3))
223 | preds = best_rf_gs.predict(X_test)
224 | df_pred = pd.DataFrame(np.array(preds).reshape(len(preds),1))
225 | df_pred.columns = ['predictions']
226 | print('Features and their importance:\n')
227 | feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X.columns).sort_values().tail(10)
228 | print(feature_importances)
229 | print(feature_importances.plot(kind="barh", figsize=(6,6)))
230 | return (df_pred,max_depth_best,n_estimators_best)
231 |
232 |
233 | triple = rfscore(X,y,0.3,n_estimators,max_depth)
234 | ```
235 | ```
236 | df_pred = triple[0]
237 | ```
238 | The predictions are:
239 | ```
240 | df_pred['predictions'].value_counts()/df_pred.shape[0]
241 | ```
242 |
243 |
244 |
245 |
246 |
247 |
248 |
249 | ### Cross Validation
250 | ```
251 | def cv_score(X,y,cv,n_estimators,max_depth):
252 | rf = RandomForestClassifier(n_estimators=n_estimators_best,
253 | max_depth=max_depth_best)
254 | s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1)
255 | return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))
256 | ```
257 | ```
258 | dict_best = {'max_depth': triple[1], 'n_estimators': triple[2]}
259 | n_estimators_best = dict_best['n_estimators']
260 | max_depth_best = dict_best['max_depth']
261 | cv_score(X,y,5,n_estimators_best,max_depth_best)
262 | ```
263 | The output is:
264 | ```
265 | 'Random Forest Score is :0.774 ± 0.054'
266 | ```
267 |
268 | For the random forest, the recall and precision found are:
269 |
270 | ```
271 | recall: 0.286
272 | precision 0.727
273 | ```
274 |
275 | Both cross-validation score and precision of our `RandomForestClassifier` is the highest among the five models investigated.
276 |
--------------------------------------------------------------------------------
/churn/data/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/churn/images/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/churn/images/balancedchurn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/balancedchurn.png
--------------------------------------------------------------------------------
/churn/images/baseline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/baseline.png
--------------------------------------------------------------------------------
/churn/images/cellphone.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cellphone.jpg
--------------------------------------------------------------------------------
/churn/images/churnprob.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/churnprob.png
--------------------------------------------------------------------------------
/churn/images/cm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cm.png
--------------------------------------------------------------------------------
/churn/images/cms.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms.png
--------------------------------------------------------------------------------
/churn/images/cms1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms1.png
--------------------------------------------------------------------------------
/churn/images/cms2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms2.png
--------------------------------------------------------------------------------
/churn/images/df_churn_new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/df_churn_new.png
--------------------------------------------------------------------------------
/churn/images/featurerf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/featurerf.png
--------------------------------------------------------------------------------
/churn/images/imbalancechurn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/imbalancechurn.png
--------------------------------------------------------------------------------
/churn/images/model_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/model_comparison.png
--------------------------------------------------------------------------------
/churn/images/predictions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/predictions.png
--------------------------------------------------------------------------------
/click-prediction/README.md:
--------------------------------------------------------------------------------
1 | ## Predicting clicks on ads [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/click-prediction/notebooks/click-predictive-model.ipynb)
2 |       
3 |
4 |
5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/click-prediction/notebooks/click-predictive-model.ipynb) or by clicking on the.**
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 | ## Problem Statement
15 |
16 | Borrowing from [here](https://turi.com/learn/gallery/notebooks/click_through_rate_prediction_intro.html):
17 |
18 |
19 | > Many ads are actually sold on a "pay-per-click" (PPC) basis, meaning the company only pays for ad clicks, not ad views. Thus your optimal approach (as a search engine) is actually to choose an ad based on "expected value", meaning the price of a click times the likelihood that the ad will be clicked [...] In order for you to maximize expected value, you therefore need to accurately predict the likelihood that a given ad will be clicked, also known as "click-through rate" (CTR).
20 |
21 | In this project I will predict the likelihood that a given online ad will be clicked.
22 |
23 | ## Dataset
24 |
25 | - The two files `train_click.csv` and `test_click.csv` contain ad impression attributes from a campaign.
26 | - Each row in `train.csv` includes a `click` column.
27 |
28 | ## Import the relevant libraries and the files
29 |
30 | ```
31 | import numpy as np
32 | import pandas as pd
33 | import matplotlib.pyplot as plt
34 | from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute # used for feature imputation algorithms
35 | pd.set_option('display.max_columns', None) # display all columns
36 | pd.set_option('display.max_rows', None) # displays all rows
37 | %matplotlib inline
38 | from IPython.core.interactiveshell import InteractiveShell
39 | InteractiveShell.ast_node_interactivity = "all" # so we can see the value of multiple statements at once.
40 | ```
41 |
42 | ## Import the data
43 |
44 | ```
45 | train = pd.read_csv('train_click.csv',index_col=0)
46 | test = pd.read_csv('test_click.csv',index_col=0)
47 | ```
48 |
49 | ## Data Dictionary
50 |
51 | The meaning of the columns follows:
52 | - `location` – ad placement in the website
53 | - `carrier` – mobile carrier
54 | - `device` – type of device e.g. phone, tablet or computer
55 | - `day` – weekday user saw the ad
56 | - `hour` – hour user saw the ad
57 | - `dimension` – size of ad
58 |
59 | ## Imbalance
60 | The `click` column is **heavily** unbalanced. I will correct for this later.
61 |
62 | ```
63 | import aux_func_v2 as af
64 | af.s_to_df(train['click'].value_counts())
65 | ```
66 |
67 | ### Checking the variance of each feature
68 |
69 | Let's quickly study the variance of the features to have an estimate of their impact on clicks. But let us first consider the cardinalities.
70 |
71 | #### Train set cardinalities
72 |
73 | ```
74 | cardin_train = [train[col].nunique() for col in train.columns.tolist()]
75 | cols = [col for col in train.columns.tolist()]
76 | d = {k:v for (k, v) in zip(cols,cardin_train)}
77 | cardinal_train = pd.DataFrame(list(d.items()), columns=['column', 'cardinality'])
78 | cardinal_train.sort_values('cardinality',ascending=False)
79 | ```
80 |
81 | #### Test set cardinalities
82 | ```
83 | cardin_test = [test[col].nunique() for col in test.columns.tolist()]
84 | cols = [col for col in test.columns.tolist()]
85 | d = {k:v for (k, v) in zip(cols,cardin_test)}
86 | cardinal_test = pd.DataFrame(list(d.items()), columns=['column', 'cardinality'])
87 | cardinal_test.sort_values('cardinality',ascending=False)
88 | ```
89 |
90 | #### High and low cardinality in the training data
91 |
92 | We can set *arbitrary* thresholds to determine the level of cardinality in the feature categories:
93 |
94 | ```
95 | target = 'click'
96 | cardinal_train_threshold = 33 # our choice
97 | low_cardinal_train = cardinal_train[cardinal_train['cardinality']
98 | <= cardinal_train_threshold]['column'].tolist()
99 | low_cardinal_train.remove(target)
100 | high_cardinal_train = cardinal_train[cardinal_train['cardinality']
101 | > cardinal_train_threshold]['column'].tolist()
102 | print('Features with low cardinal_train:\n',low_cardinal_train)
103 | print('')
104 | print('Features with high cardinal_train:\n',high_cardinal_train)
105 | ```
106 |
107 | #### High and low cardinality in the test data
108 |
109 | ```
110 | cardinal_test_threshold = 25 # chosen for low_cardinal_set to agree with low_cardinal_train
111 | low_cardinal_test = cardinal_test[cardinal_test['cardinality']
112 | <= cardinal_test_threshold]['column'].tolist()
113 | high_cardinal_test = cardinal_test[cardinal_test['cardinality']
114 | > cardinal_test_threshold]['column'].tolist()
115 | print('Features with low cardinal_test:\n',low_cardinal_test)
116 | print('')
117 | print('Features with high cardinal_test:\n',high_cardinal_test)
118 | ```
119 |
120 | #### Now let's look at the features' variances.
121 |
122 | From the bar plot below we see that `device_type` has non-negligible variance
123 |
124 | ```
125 | from matplotlib import pyplot
126 | import matplotlib.pyplot as plt
127 |
128 | for col in low_cardinal_train:
129 | ax = train[target].groupby(train[col]).sum().plot(kind='bar',
130 | title ="Clicks per " + col,
131 | figsize=(10, 5), fontsize=12);
132 | ax.set_xlabel(col, fontsize=12);
133 | ax.set_ylabel("Clicks", fontsize=12);
134 | plt.show();
135 | ```
136 |
137 | ### Dropping some features
138 |
139 | Notice that some of the features are massively dominated by **just one level**. We will drop those. We have to
140 | do that for both train and test sets:
141 |
142 | ```
143 | cols_to_drop = ['location']
144 | train_new = train.drop(cols_to_drop,axis=1)
145 | test_new = test.drop(cols_to_drop,axis=1)
146 | ```
147 |
148 |
149 | ### Data types
150 |
151 | ```
152 | train_new.dtypes
153 | test_new.dtypes
154 | ```
155 |
156 | #### Converting some of the integer columns into strings:
157 |
158 | ```
159 | cols_to_convert = test_new.columns.tolist()
160 | for col in cols_to_convert:
161 | train_new[col] = train_new[col].astype(str)
162 | test_new[col] = test_new[col].astype(str)
163 | ```
164 |
165 |
166 | ## Handling missing values
167 |
168 | The only column with missing values is the `domain` column. There are several ways to fill missing values including:
169 | - Dropping the corresponding rows
170 | - Filling `NaNs` using most the frequent value.
171 | - Using Multiple Imputation by Chained Equations of MICE is a more sophisticated option
172 |
173 | In our case, the are only a relatively small percentage of `NaNs` in just one column, namely, $\approx$ 13$\%$ of domain values are missing. I opted for values imputation to avoid dropping rows. Future analysis using MICE should improve final results.
174 |
175 | ```
176 | train_new['website'] = train_new[['website']].apply(lambda x:x.fillna(x.value_counts().index[0]))
177 | train_new.isnull().any()
178 | test_new['website'] = test_new[['website']].apply(lambda x:x.fillna(x.value_counts().index[0]))
179 | test_new.isnull().any()
180 | ```
181 |
182 |
183 | ### Dummies
184 |
185 | We can transform the categories with low cardinality into dummies using hot encoding:
186 |
187 | ```
188 | cols_to_keep = ['carrier', 'device', 'day', 'hour', 'dimension']
189 | low_cardin_train = train_new[cols_to_keep]
190 | low_cardin_test = test_new[cols_to_keep]
191 | dummies_train = pd.concat([pd.get_dummies(low_cardin_train[col], drop_first = True, prefix= col)
192 | for col in cols_to_keep], axis=1)
193 | dummies_test = pd.concat([pd.get_dummies(low_cardin_test[col], drop_first = True, prefix= col)
194 | for col in cols_to_keep], axis=1)
195 | dummies_train.head()
196 | dummies_test.head()
197 |
198 | train_new.to_csv('train_new.csv')
199 | test_new.to_csv('test_new.csv')
200 | ```
201 |
202 | #### Concatenating with the rest of the `DataFrame`:
203 |
204 | ```
205 | train_new = pd.concat([train_new[high_cardinal_train + ['click']], dummies_train], axis = 1)
206 | test_new = pd.concat([test_new[high_cardinal_test], dummies_test], axis = 1)
207 | ```
208 |
209 | Now, to treat the columns with high cardinality, we will break them up into percentiles based on the number of impressions (number of rows).
210 |
211 | #### Building up dictionaries for creation of dummy variables
212 |
213 | ```
214 | train_new['count'] = 1 # auxiliar column
215 | test_new['count'] = 1
216 | ```
217 |
218 | #### In the next cell, I use `pd.cut` to rename column entries using percentiles
219 |
220 | ```
221 | def series_to_dataframe(s,name,index_list):
222 | lst = [s.iloc[i] for i in range(s.shape[0])]
223 | new_df = pd.DataFrame({name: lst}) # transforms list into dataframe
224 | new_df.index = index_list
225 | return new_df
226 | def ranges(df1,col):
227 | df = series_to_dataframe(df1['count'].groupby(df1[col]).sum(),
228 | 'sum of ads',
229 | df1['count'].groupby(df1[col]).sum().index.tolist()).sort_values('sum of ads',ascending=False)
230 | #print('How the pd.cut looks like:\n')
231 | #print(pd.get_dummies(pd.cut(df['sum of ads'], 3)).head(3))
232 | df = pd.concat([df,pd.get_dummies(pd.cut(df['sum of ads'], 3), drop_first = True)],axis=1)
233 | df.columns = ['sum of ads',col + '_1',col + '_2']
234 | return df
235 | website_train = ranges(train_new,'website')
236 | publisher_train = ranges(train_new,'publisher')
237 | website_test = ranges(test_new,'website')
238 | publisher_test = ranges(test_new,'publisher')
239 | website_train.reset_index(level=0, inplace=True)
240 | publisher_train.reset_index(level=0, inplace=True)
241 | website_test.reset_index(level=0, inplace=True)
242 | publisher_test.reset_index(level=0, inplace=True)
243 | website_train.columns = ['website', 'sum of impressions', 'website_1', 'website_2']
244 | publisher_train.columns = ['publisher', 'sum of impressions', 'publisher_1', 'publisher_2']
245 | website_test.columns = ['website', 'sum of impressions', 'website_1', 'website_2']
246 | publisher_test.columns = ['publisher', 'sum of impressions', 'publisher_1', 'publisher_2']
247 | train_new = train_new.merge(website_train, how='left')
248 | train_new = train_new.drop('website',axis=1).drop('sum of impressions',axis=1)
249 | train_new = train_new.merge(publisher_train, how='left')
250 | train_new = train_new.drop('publisher',axis=1).drop('sum of impressions',axis=1)
251 | test_new = test_new.merge(website_test, how='left')
252 | test_new = test_new.drop('website',axis=1).drop('sum of impressions',axis=1)
253 | test_new = test_new.merge(publisher_test, how='left')
254 | test_new = test_new.drop('publisher',axis=1).drop('sum of impressions',axis=1)
255 | ```
256 |
257 | ## Imbalanced classes
258 |
259 |
260 | #### Imbalanced classes in general
261 |
262 | - We can account for unbalanced classes using:
263 | - Undersampling: randomly sample the majority class, artificially balancing the classes when fitting the model
264 | - Oversampling: boostrap (sample with replacement) the minority class to balance the classes when fitting the model. We can oversample using the SMOTE algorithm (Synthetic Minority Oversampling Technique)
265 | - Note that it is crucial that we **evaluate our model on the real data!!**
266 |
267 | ```
268 | zeros = train_new[train_new['click'] == 0]
269 | ones = train_new[train_new['click'] == 1]
270 | counts = train_new['click'].value_counts()
271 | proportion = counts[1]/counts[0]
272 | train_new = ones.append(zeros.sample(frac=proportion))
273 | #train_new['response'].value_counts()
274 | #train_new.isnull().any()
275 | ```
276 |
277 | ## Models
278 |
279 | ```
280 | from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split, GridSearchCV
281 | from sklearn.tree import DecisionTreeClassifier
282 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
283 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
284 | from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer, TfidfTransformer
285 | import seaborn as sns
286 | from sklearn.metrics import confusion_matrix
287 | %matplotlib inline
288 |
289 | X_test = test_new
290 | ```
291 |
292 | ## Defining ranges for the hyperparameters to be scanned by the grid search
293 | ```
294 | n_estimators = list(range(20,120,10))
295 | max_depth = list(range(2, 22, 2)) + [None]
296 | def random_forest_score(df,target_col,test_size,n_estimators,max_depth):
297 |
298 | X_train = df.drop(target_col, axis=1) # predictors
299 | y_train = df[target_col] # target
300 | X_test = test_new
301 |
302 | rf_params = {
303 | 'n_estimators':n_estimators,
304 | 'max_depth':max_depth} # parameters for grid search
305 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
306 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
307 | print('The best parameters on the training data are:\n',rf_gs.best_params_) # printing the best parameters
308 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth
309 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators
310 | print("best max_depth:",max_depth_best)
311 | print("best n_estimators:",n_estimators_best)
312 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
313 | best_rf_gs.fit(X_train,y_train) # fitting the best model
314 | preds = best_rf_gs.predict(X_test)
315 | feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X_train.columns).sort_values().tail(5)
316 | print(feature_importances.plot(kind="barh", figsize=(6,6)))
317 | return
318 |
319 | random_forest_score(train_new,'click',0.3,n_estimators,max_depth)
320 | ```
321 | ```
322 | X = train_new.drop('click', axis=1) # predictors
323 | y = train_new['click']
324 |
325 | def cv_score(X,y,cv,n_estimators,max_depth):
326 | rf = RandomForestClassifier(n_estimators=n_estimators_best,
327 | max_depth=max_depth_best)
328 | s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1)
329 | return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))
330 |
331 | dict_best = {'max_depth': 14, 'n_estimators': 80}
332 | n_estimators_best = dict_best['n_estimators']
333 | max_depth_best = dict_best['max_depth']
334 | cv_score(X,y,5,n_estimators_best,max_depth_best)
335 |
336 | n_estimators = list(range(20,120,10))
337 | max_depth = list(range(2, 16, 2)) + [None]
338 |
339 | def random_forest_score_probas(df,target_col,test_size,n_estimators,max_depth):
340 |
341 | X_train = df.drop(target_col, axis=1) # predictors
342 | y_train = df[target_col] # target
343 | X_test = test_new
344 |
345 | rf_params = {
346 | 'n_estimators':n_estimators,
347 | 'max_depth':max_depth} # parameters for grid search
348 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, n_jobs=-1)
349 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
350 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth
351 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators
352 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
353 | best_rf_gs.fit(X_train,y_train) # fitting the best model
354 | preds = best_rf_gs.predict(X_test)
355 | prob_list = [prob[0] for prob in best_rf_gs.predict_proba(X_test).tolist()]
356 | df_prob = pd.DataFrame(np.array(prob_list).reshape(53333,1))
357 | df_prob.columns = ['probabilities']
358 | df_prob.to_csv('probs.csv')
359 | return df_prob
360 |
361 | random_forest_score_probas(train_new,'click',0.3,n_estimators,max_depth).head()
362 |
363 | def random_forest_score_preds(df,target_col,test_size,n_estimators,max_depth):
364 |
365 | X_train = df.drop(target_col, axis=1) # predictors
366 | y_train = df[target_col] # target
367 | X_test = test_new
368 |
369 | rf_params = {
370 | 'n_estimators':n_estimators,
371 | 'max_depth':max_depth} # parameters for grid search
372 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
373 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters
374 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth
375 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators
376 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model
377 | best_rf_gs.fit(X_train,y_train) # fitting the best model
378 | preds = best_rf_gs.predict(X_test)
379 | df_pred = pd.DataFrame(np.array(preds).reshape(53333,1))
380 | df_pred.columns = ['predictions']
381 | df_pred.to_csv('preds.csv')
382 | return df_pred
383 |
384 | random_forest_score_preds(train_new,'click',0.3,n_estimators,max_depth)
385 | ```
386 |
--------------------------------------------------------------------------------
/click-prediction/images/123:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/click-prediction/images/click1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/click-prediction/images/click1.png
--------------------------------------------------------------------------------
/click-prediction/optimal-bidding-strategies-in-online-display-advertising .pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/click-prediction/optimal-bidding-strategies-in-online-display-advertising .pdf
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/README.md:
--------------------------------------------------------------------------------
1 | ## Predicting Comments on Reddit [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/project-3-marco-tavora.ipynb)
2 |       
3 |
4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/project-3-marco-tavora.ipynb) or by clicking on the [view code] link above.**
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
16 | Problem Statement •
17 | Steps •
18 | Bird's-eye view of webscraping •
19 | Writing functions to extract data from Reddit •
20 | Quick review of NLP techniques •
21 | Preprocessing the text •
22 | Models
23 |
24 |
25 |
26 | ## Problem Statement
27 |
28 | Determine which characteristics of a post on Reddit contribute most to the overall interaction as measured by number of comments.
29 |
30 |
31 | ## Steps
32 |
33 | This project had three steps:
34 | - Collecting data by scraping a website using the Python package `requests` and using the Python library `BeautifulSoup` which efficiently extracts HTML code. We scraped the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/) (see figure below) and acquired the following pieces of information about each thread:
35 |
36 | - The title of the thread
37 | - The subreddit that the thread corresponds to
38 | - The length of time it has been up on Reddit
39 | - The number of comments on the thread
40 |
41 |
42 |
43 |
44 |
46 |
47 |
48 |
49 | - Using Natural Language Processing (NLP) techniques to preprocess the data. NLP, in a nutshell, is "how to transform text data and convert it to features that enable us to build models." NLP techniques include:
50 |
51 | - Tokenization: essentially splitting text into pieces based on given patterns
52 | - Removing stopwords
53 | - Lemmatization: returns the word's *lemma* (its base/dictionary form)
54 | - Stemming: returns the base form of the word (it is usually cruder than lemmatization).
55 |
56 | - After the step above we obtain *numerical* features which allow for algebraic computations. We then build a `RandomForestClassifier` and use it to classify each post according to the corresponding number of comments associated with it. More concretely the model predicts whether or not a given Reddit post will have above or below the _median_ number of comments.
57 |
58 |
59 | ### Bird's-eye view of webscraping
60 |
61 | The general strategy is:
62 | - Use the `requests` Python packages to make a `.get` request (the object `res` is a `Response` object):
63 | ```
64 | res = requests.get(URL,headers={"user-agent":'mt'})
65 | ```
66 | - Create a BeautifulSoup object from the HTML
67 | ```
68 | soup = BeautifulSoup(res.content,"lxml")
69 | ```
70 | - Use `.extract` to see the page structure:
71 | ```
72 | soup.extract
73 | ```
74 |
75 | ### Writing functions to extract data from Reddit
76 | Here I write down the the functions that will extract the information needed. The structure of the functions depends on the HTML code of the page. The page has the following structure:
77 | - The thread title is within an `` tag with the attribute `data-event-action="title"`.
78 | - The time since the thread was created is within a `` tag with attribute `class="live-timestamp"`.
79 | - The subreddit is within an `` tag with the attribute `class="subreddit hover may-blank"`.
80 | - The number of comments is within an ` ` tag with the attribute `data-event-action="comments"`.
81 |
82 | The functions are:
83 | ```
84 | def extract_title_from_result(result,num=25):
85 | titles = []
86 | title = result.find_all('a', {'data-event-action':'title'})
87 | for i in title:
88 | titles.append(i.text)
89 | return titles
90 |
91 | def extract_time_from_result(result,num=25):
92 | times = []
93 | time = result.find_all('time', {'class':'live-timestamp'})
94 | for i in time:
95 | times.append(i.text)
96 | return times
97 |
98 | def extract_subreddit_from_result(result,num=25):
99 | subreddits = []
100 | subreddit = result.find_all('a', {'class':'subreddit hover may-blank'})
101 | for i in subreddit:
102 | subreddits.append(i.string)
103 | return subreddits
104 |
105 | def extract_num_from_result(result,num=25):
106 | nums_lst = []
107 | nums = result.find_all('a', {'data-event-action': 'comments'})
108 | for i in nums:
109 | nums_lst.append(i.string)
110 | return nums_lst
111 | ```
112 | I then write a function that finds the last `id` on the page, and stores it:
113 | ```
114 | def get_urls(n=25):
115 | j=0 # counting loops
116 | titles = []
117 | times = []
118 | subreddits = []
119 | nums = []
120 | URLS = []
121 | URL = "http://www.reddit.com"
122 |
123 | for _ in range(n):
124 |
125 | res = requests.get(URL, headers={"user-agent":'mt'})
126 | soup = BeautifulSoup(res.content,"lxml")
127 |
128 | titles.extend(extract_title_from_result(soup))
129 | times.extend(extract_time_from_result(soup))
130 | subreddits.extend(extract_subreddit_from_result(soup))
131 | nums.extend(extract_num_from_result(soup))
132 |
133 | URL = soup.find('span',{'class':'next-button'}).find('a')['href']
134 | URLS.append(URL)
135 | j+=1
136 | print(j)
137 | time.sleep(3)
138 |
139 | return titles, times, subreddits, nums, URLS
140 | ```
141 |
142 | I then build a pandas `DataFrame`, perform some exploratory data analysis and create:
143 | - A binary column that classifies the number of comments comparing the values with their median
144 | - A set of dummy columns for the subreddits
145 | - Concatenate both
146 |
147 | ```
148 | df['binary'] = df['nums'].apply(lambda x: 1 if x >= np.median(df['nums']) else 0)
149 | # dummies created and dataframes concatenated
150 | df_subred = pd.concat([df['binary'],pd.get_dummies(df['subreddits'], drop_first = True)], axis = 1)
151 | ```
152 |
153 | ### Quick review of NLP techniques
154 | Before applying NLP to our problem, I will provide a quick review of the basic procedures using `Python`. We use the package `nltk` (Natural Language Toolkit) to perform the actions above. The general procedure is the following. We first import `nltk` and the necessary classes for lemmatization and stemming
155 | ```
156 | import nltk
157 | from nltk.stem import WordNetLemmatizer
158 | from nltk.stem.porter import PorterStemmer
159 | ```
160 | We then create objects of the classes `PorterStemmer` and `WordNetLemmatizer`:
161 | ```
162 | stemmer = PorterStemmer()
163 | lemmatizer = WordNetLemmatizer()
164 | ```
165 | To use lemmatization and/or stemming in a given string `text` we must first tokenize it. To do that, we use `RegexpTokenizer` where the argument below is a regular expression.
166 | ```
167 | tokenizer = RegexpTokenizer(r'\w+')
168 | tokens = tokenizer.tokenize(text)
169 | tokens_lemma = [lemmatizer.lemmatize(i) for i in tokens]
170 | stem_text = [PorterStemmer().stem(i) for i in tokens]
171 | ```
172 |
173 | ### Preprocessing the text
174 | To preprocess the text, before creating numerical features from it, I used the following `cleaner` function:
175 | ```
176 | def cleaner(text):
177 | stemmer = PorterStemmer()
178 | stop = stopwords.words('english')
179 | text = text.translate(str.maketrans('', '', string.punctuation))
180 | text = text.translate(str.maketrans('', '', string.digits))
181 | text = text.lower().strip()
182 | final_text = []
183 | for w in text.split():
184 | if w not in stop:
185 | final_text.append(stemmer.stem(w.strip()))
186 | return ' '.join(final_text)
187 | ```
188 | I then use `CountVectorizer` to create features based on the words in the thread titles. `CountVectorizer` is scikit-learn's bag of words tool. I then combine this new table `df_all` and the subreddits features table and build a model.
189 |
190 | ```
191 | cvt = CountVectorizer(min_df=min_df, preprocessor=cleaner)
192 | cvt.fit(df["titles"])
193 | cvt.transform(df['titles']).todense()
194 | X_title = cvt.fit_transform(df["titles"])
195 | X_thread = pd.DataFrame(X_title.todense(),
196 | columns=cvt.get_feature_names())
197 | df_all = pd.concat([df_subred,X_thread],axis=1)
198 | ```
199 |
200 |
201 |
202 |
203 |
204 | ### Models
205 | Finally, now with the data properly treated, we use the following function to fit the training data using a `RandomForestClassifier` with optimized hyperparameters obtained using `GridSearchCV`. The range of hyperparameters is:
206 | ```
207 | n_estimators = list(range(20,220,10))
208 | max_depth = list(range(2, 22, 2)) + [None]
209 | ```
210 |
211 | The following function does the following:
212 | - Defines target and predictors
213 | - Performs a train-test split of the data
214 | - Uses `GridSearchCV` which performs an "exhaustive search over specified parameter values for an estimator" (see the docs). It searches the hyperparameter space to find the highest cross validation score. It has several important arguments namely:
215 |
216 | | Argument | Description |
217 | | --- | ---|
218 | | **`estimator`** | Sklearn instance of the model to fit on |
219 | | **`param_grid`** | A dictionary where keys are hyperparameters and values are lists of values to test |
220 | | **`cv`** | Number of internal cross-validation folds to run for each set of hyperparameters |
221 |
222 | - After fitting, `GridSearchCV` provides information such as:
223 |
224 | | Property | Use |
225 | | --- | ---|
226 | | **`results.param_grid`** | Parameters searched over. |
227 | | **`results.best_score_`** | Best mean cross-validated score.|
228 | | **`results.best_estimator_`** | Reference to model with best score. |
229 | | **`results.best_params_`** | Parameters found to perform with the best score. |
230 | | **`results.grid_scores_`** | Display score attributes with corresponding parameters. |
231 |
232 | - The estimator chosen here was a `RandomForestClassifier`. The latter fits a set of decision tree classifiers on sub-samples of the data, averaging to improve the accuracy and avoid over-fitting.
233 | - Fits several models using the training data, for all parameters within the parameter grid `rf_params` and find the best model i.e. the model with best mean cross-validated score.
234 | - Instantiates the best model and fits it
235 | - Scores the model and makes predictions
236 | - Determines the most relevant features and prints out a bar plot showing them.
237 |
238 | ```
239 | def rfscore(df,target_col,test_size,n_estimators,max_depth):
240 |
241 | X = df.drop(target_col, axis=1) # predictors
242 | y = df[target_col] # target
243 |
244 | # train-test split
245 | X_train, X_test, y_train, y_test = train_test_split(X,
246 | y, test_size = test_size, random_state=42)
247 | # definition of a grid of parameter values
248 | rf_params = {
249 | 'n_estimators':n_estimators,
250 | 'max_depth':max_depth} # parameters for grid search
251 |
252 | # Instantiation
253 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1)
254 |
255 | # fitting using training data with all possible parameters
256 | rf_gs.fit(X_train,y_train)
257 |
258 | # Parameters that have been found to perform with the best score
259 | max_depth_best = rf_gs.best_params_['max_depth']
260 | n_estimators_best = rf_gs.best_params_['n_estimators']
261 |
262 | # Best model
263 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best)
264 |
265 | # fitting best model using training data with all possible parameters
266 | best_rf_gs.fit(X_train,y_train)
267 |
268 | # scoring
269 | best_rf_score = best_rf_gs.score(X_test,y_test)
270 |
271 | # predictions
272 | preds = best_rf_gs.predict(X_test)
273 |
274 | # finds the most important features and plots a bar chart
275 | feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X.columns).sort_values().tail(5)
276 | print(feature_importances.plot(kind="barh", figsize=(6,6)))
277 | return
278 | ```
279 | The function below that performs cross-validation, to obtain the accuracy score for the model with best parameters obtained from the `GridSearch`:
280 |
281 | ```
282 | def cv_score(X,y,cv,n_estimators,max_depth):
283 | rf = RandomForestClassifier(n_estimators=n_estimators_best,
284 | max_depth=max_depth_best)
285 | s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1)
286 | return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))
287 | ```
288 | The most important features according to the `RandomForestClassifier` are shown in the graph below:
289 |
290 |
291 |
292 |
293 |
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/Reddit-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/Reddit-logo.png
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditRF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditRF.png
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditpage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditpage.png
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditwordshist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditwordshist.png
--------------------------------------------------------------------------------
/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/retail-strategy/README.md:
--------------------------------------------------------------------------------
1 | ## Retail Expansion Analysis with Lasso and Ridge Regressions [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb)
2 |       [](https://opensource.org/licenses/MIT)
3 |
4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb) or by clicking on the [view code] link above.**
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 | Summary •
19 | Preamble •
20 | Getting data •
21 | Data Munging and EDA •
22 | Mining the data •
23 | Building the models •
24 | Plotting results •
25 | Conclusions and recommendations
26 |
27 |
28 |
29 | ## Summary
30 | Based on a dataset containing the spirits purchase information of Iowa Class E liquor licensees by product and date of purchase (link) this project provides recommendations on where to open new stores in the state of Iowa. I first conducted a thorough exploratory data analysis and then built several multivariate regression models of total sales by county, using both Lasso and Ridge regularization, and based on these models, I made recommendations about new locations.
31 |
32 |
33 | ## Preamble
34 |
35 | Expansion plans traditionally use subsets of the following mix of data:
36 |
37 | #### Demographics
38 |
39 | I focused on the following quantities:
40 | - The ratio between sales and volume for each county, i.e., the number of dollars per liter sold. If this ratio is high in a given county, the stores in that county are, on average, high-end stores.
41 | - Another critical ratio is the number of stores per area. The meaning of a high value of this ratio is not so straightforward since it may indicate either that the market is saturated, or that the county is a strong market for this type of product and would welcome a new store (an example would be a county close to some major university). In contrast, a low value may indicate a market with untapped potential or a market with a population which is not a target of this type of store.
42 | - Another important ratio is consumption/person, i.e., the consumption *per capita*. The knowledge of the profile of the population in the county (if they are "light" or "heavy" drinkers) would undoubtedly help the owner decide whether to open or not a new storefront there.
43 |
44 | #### Nearby businesses
45 |
46 | Competition is a critical component, and can be indirectly measured by the ratio of the number of stores and the population.
47 |
48 | #### Aggregated human flow/foot traffic
49 |
50 | For this information to be useful, we would need more granular data such as apps check-ins. Population and population density will be used as proxies.
51 |
52 |
53 | ## Getting data
54 |
55 | Three datasets were used namely:
56 | - A dataset containing the spirits purchase information of Iowa Class “E” liquor licensees by product and date of purchase.
57 | - A dataset with information about population per county
58 | - A database containing information about incomes
59 |
60 |
61 | ## Data Munging and EDA
62 |
63 | Data munging included:
64 | - Checking the time span of the data and dropping 2016 data (which contained only three months)
65 | - Eliminating symbols in the data, dropping `NaNs` and converting objects to floats
66 | - Conversion of columns of objects into columns of float
67 | - Dropping `NaN` values
68 | - Converting store numbers to strings.
69 | - Examining the data we find that the maximum values in all columns were many standard deviations larger than the mean, indicating the presence of outliers. Keeping outliers in the analysis would inflate the predicted sales. Also, since the goal is to predict the *most likely performance* for each store keeping exceptionally well-performing stores would be detrimental.
70 |
71 | To exclude dollar signs for example I used:
72 | ```
73 | for col in cols_with_dollar:
74 | df[col] = df[col].apply(lambda x: x.strip('$')).astype('float')
75 | ```
76 | To plot histograms I found it convenient to write a simple function:
77 | ```
78 | def draw_histograms(df,col,bins):
79 | df[col].hist(bins=bins);
80 | plt.title(col);
81 | plt.xlabel(col);
82 | plt.xticks(rotation=90);
83 | plt.show();
84 | ```
85 |
86 | ## Mining the data
87 |
88 | Some of steps for mining the data included: computing the total sales per county, creating a profit column, calculating profit per store and the sales per volume, dropping outliers, calculating both stores per person and alcohol consumption per person ratios.
89 |
90 | I then looked for any statistical relationships, correlations, or other relevant properties of the dataset.
91 |
92 | #### Steps:
93 | - First I needed to choose the proper predictors. I looked for strong correlations between variables to avoid problems with multicollinearity.
94 | - Also, variables that changed very little had little impact and they were therefore not included as predictors.
95 | - I then studied correlations between predictors.
96 | - I saw from the correlation matrices that `num_stores` and `stores_per_area` are highly correlated. Furthermore, both are highly correlated to the target variable `sale_dollars`. Both things also happen with `store_population_ratio` and `consumption_per_capita`.
97 |
98 | A heatmap of correlations using `Seaborn` follows:
99 |
100 |
101 |
102 |
103 |
104 | To generate scatter plots for all the predictors (which provided similar information as the correlation matrices) we write:
105 | ```
106 | g = sns.pairplot(df[cols_to_keep])
107 | for ax in g.axes.flatten(): # from [6]
108 | for tick in ax.get_xticklabels():
109 | tick.set(rotation=90);
110 | ```
111 |
112 |
113 |
114 |
115 |
116 |
117 | ## Building the models
118 |
119 | Using `scikit-learn` and `statsmodels`, I built the necessary models and valuated their fit. For that I generated all combinations of useful relevant features using the `itertools` module.
120 |
121 | Preparing training and test sets:
122 | ```
123 | # choose candidate features
124 | features = ['num_stores','population', 'store_population_ratio', \
125 | 'consumption_per_capita', 'stores_per_area', u'per_capita_income']
126 | # defining the predictors and the target
127 | X,y = df_final[features], df_final['sale_dollars']
128 | # train-test split
129 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
130 | ```
131 | I now generate combinations of features:
132 |
133 | ```
134 | combs = []
135 | for num in range(1,len(features)+1):
136 | combs.append([i[0] for i in list(itertools.combinations(features, num))])
137 | ```
138 |
139 | I then instantiated the models and tested them. The code below makes a list of `r2` combinations and finds the best predictors using `itemgetter`:
140 | ```
141 | lr = linear_model.LinearRegression(normalize=True)
142 | ridge = linear_model.RidgeCV(cv=5)
143 | lasso = linear_model.LassoCV(cv=5)
144 | models = [lr,lasso,ridge]
145 | r2_comb_lst = []
146 | for comb in combs:
147 | for m in models:
148 | model = m.fit(X_train[comb],y_train)
149 | r2 = m.score(X_test[comb], y_test)
150 | r2_comb_lst.append([round(r2,3),comb,str(model).split('(')[0]])
151 |
152 | r2_comb_lst.sort(key=operator.itemgetter(1))
153 | ```
154 | The best predictors were obtained via:
155 | ```
156 | r2_comb_lst[-1][1]
157 | ```
158 | Dropping highly correlated predictors I redefined `X` and `y` and built a Ridge model:
159 | ```
160 | X ,y = df_final[features], df_final['sale_dollars']
161 | ridge = linear_model.RidgeCV(cv=5)
162 | model = ridge.fit(X,y)
163 | ```
164 |
165 |
166 | ## Plotting results
167 |
168 | I then plotted the predictions versus the true value:
169 |
170 |
171 |
172 |
173 |
174 |
175 | ## Conclusions and recommendations:
176 |
177 | The following recommendations were provided:
178 |
179 | - Linn has higher sales which in part is because it has larger population which is not very useful information.
180 | - Next, ordering stores by `sales_per_litters` we obtain which counties have more high-end stores (Johnson has the higher number).
181 | - We would recommend Johnson for a new store *if the goal of the the owner is to build new high-end stores*.
182 | - If the plan is to open more stores but with cheaper products, Johnson is not the place to choose. The less saturated market is Decatur. But as discussed before this information does not provide have a unique recommendation and a more thorough analysis is needed.
183 | - The county with weaker competition is Butler. This could provided untapped potential. However, the absence of a reasonable number of stores may indicate, as observed before, that the county's population is simply not interested in this category of product. Again, further investigation must be carried out.
184 |
185 |
186 | I strongly recommend reading the notebook using [nbviewer](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb).
187 |
188 |
--------------------------------------------------------------------------------
/retail-strategy/data/123:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/retail-strategy/data/ia_zip_city_county_sqkm.csv:
--------------------------------------------------------------------------------
1 | ,Zip Code,City,County,State,County Number,Area (sqkm)
0,50001,ACKWORTH,Warren,IA,91,62.796656
1,50002,ADAIR,Guthrie,IA,39,279.202219
2,50003,ADEL,Dallas,IA,25,298.086291
3,50005,ALBION,Marshall,IA,64,69.623573
4,50006,ALDEN,Hardin,IA,42,317.74515
5,50007,ALLEMAN,Polk,IA,77,13.782897
6,50008,ALLERTON,Wayne,IA,93,220.623573
7,50009,ALTOONA,Polk,IA,77,65.207113
8,50010,AMES,Story,IA,85,155.294118
9,50011,AMES,Story,IA,85,0.125094
10,50012,AMES,Story,IA,85,1.982622
11,50012,AMES,Story,IA,85,1.982622
12,50014,AMES,Story,IA,85,144.826088
13,50020,ANITA,Cass,IA,15,249.128489
14,50021,ANKENY,Polk,IA,77,66.725924
15,50022,ATLANTIC,Cass,IA,15,431.883311
16,50023,ANKENY,Polk,IA,77,57.424136
17,50025,AUDUBON,Audubon,IA,5,507.431421
18,50026,BAGLEY,Guthrie,IA,39,142.869501
19,50027,BARNES CITY,Mahaska,IA,62,72.89173
20,50028,BAXTER,Jasper,IA,50,114.933651
21,50029,BAYARD,Guthrie,IA,39,105.03836
22,50032,BERWICK,Polk,IA,77,0.95539
23,50033,BEVINGTON,Warren,IA,91,0.288201
24,50034,BLAIRSBURG,Hamilton,IA,40,163.484203
25,50035,BONDURANT,Polk,IA,77,116.89815
26,50036,BOONE,Boone,IA,8,505.063491
27,50038,BOONEVILLE,Dallas,IA,25,8.874239
28,50039,BOUTON,Dallas,IA,25,60.662047
29,50041,BRADFORD,Franklin,IA,35,1.101427
30,50042,BRAYTON,Audubon,IA,5,84.16259
31,50044,BUSSEY,Marion,IA,63,118.473056
32,50046,CAMBRIDGE,Story,IA,85,119.973352
33,50047,CARLISLE,Warren,IA,91,152.159628
34,50048,CASEY,Guthrie,IA,39,226.57327
35,50049,CHARITON,Lucas,IA,59,523.124656
36,50050,CHURDAN,Greene,IA,37,197.706628
37,50051,CLEMONS,Marshall,IA,64,66.573089
38,50052,CLIO,Wayne,IA,93,50.94063
39,50054,COLFAX,Jasper,IA,50,152.872278
40,50055,COLLINS,Story,IA,85,126.340521
41,50056,COLO,Story,IA,85,149.192377
42,50057,COLUMBIA,Marion,IA,63,52.183538
43,50058,COON RAPIDS,Carroll,IA,14,364.231967
44,50060,CORYDON,Wayne,IA,93,453.610186
45,50061,CUMMING,Warren,IA,91,81.699043
46,50062,MELCHER-DALLAS,Marion,IA,63,80.104319
47,50063,DALLAS CENTER,Dallas,IA,25,170.532757
48,50064,DANA,Greene,IA,37,40.418909
49,50065,DAVIS CITY,Decatur,IA,27,147.902989
50,50066,DAWSON,Dallas,IA,25,64.732191
51,50067,DECATUR,Decatur,IA,27,87.093517
52,50068,DERBY,Lucas,IA,59,114.749633
53,50069,DE SOTO,Dallas,IA,25,13.262492
54,50070,DEXTER,Dallas,IA,25,171.070769
55,50071,DOWS,Wright,IA,99,293.114182
56,50072,EARLHAM,Madison,IA,61,228.908869
57,50073,ELKHART,Polk,IA,77,67.420142
58,50074,ELLSTON,Ringgold,IA,80,127.815922
59,50075,ELLSWORTH,Hamilton,IA,40,119.369391
60,50076,EXIRA,Audubon,IA,5,290.651382
61,50078,FERGUSON,Marshall,IA,64,0.660699
62,50101,GALT,Wright,IA,99,25.431226
63,50102,GARDEN CITY,Hardin,IA,42,1.304963
64,50103,GARDEN GROVE,Decatur,IA,27,157.409444
65,50104,GIBSON,Keokuk,IA,54,28.94056
66,50105,GILBERT,Story,IA,85,20.560447
67,50106,GILMAN,Marshall,IA,64,170.141114
68,50107,GRAND JUNCTION,Greene,IA,37,130.433186
69,50108,GRAND RIVER,Decatur,IA,27,188.608107
70,50109,GRANGER,Polk,IA,77,62.757235
71,50111,GRIMES,Polk,IA,77,71.484606
72,50112,GRINNELL,Poweshiek,IA,79,475.672153
73,50115,GUTHRIE CENTER,Guthrie,IA,39,441.185334
74,50116,HAMILTON,Marion,IA,63,44.071754
75,50117,HAMLIN,Audubon,IA,5,71.65718
76,50118,HARTFORD,Warren,IA,91,51.466255
77,50119,HARVEY,Marion,IA,63,40.354632
78,50120,HAVERHILL,Marshall,IA,64,43.94135
79,50122,HUBBARD,Hardin,IA,42,218.621364
80,50123,HUMESTON,Wayne,IA,93,204.584834
81,50124,HUXLEY,Story,IA,85,54.098688
82,50125,INDIANOLA,Warren,IA,91,426.521944
83,50126,IOWA FALLS,Hardin,IA,42,351.87483
84,50127,IRA,Jasper,IA,50,0.02826
85,50128,JAMAICA,Guthrie,IA,39,75.898812
86,50129,JEFFERSON,Greene,IA,37,435.65865
87,50130,JEWELL,Hamilton,IA,40,171.933052
88,50131,JOHNSTON,Polk,IA,77,64.666939
89,50132,KAMRAR,Hamilton,IA,40,74.609258
90,50133,KELLERTON,Ringgold,IA,80,196.966937
91,50134,KELLEY,Story,IA,85,48.427675
92,50135,KELLOGG,Jasper,IA,50,195.278764
93,50136,KESWICK,Keokuk,IA,54,100.632209
94,50138,KNOXVILLE,Marion,IA,63,461.827288
95,50139,LACONA,Warren,IA,91,229.508368
96,50140,LAMONI,Decatur,IA,27,241.678133
97,50141,LAUREL,Marshall,IA,64,93.777481
98,50142,LE GRAND,Marshall,IA,64,2.566675
99,50143,LEIGHTON,Mahaska,IA,62,92.958086
100,50144,LEON,Decatur,IA,27,362.605062
101,50146,LINDEN,Dallas,IA,25,75.845999
102,50147,LINEVILLE,Wayne,IA,93,187.491745
103,50148,LISCOMB,Marshall,IA,64,51.098115
104,50149,LORIMOR,Union,IA,88,213.449691
105,50150,LOVILIA,Monroe,IA,68,148.536678
106,50151,LUCAS,Lucas,IA,59,205.563843
107,50153,LYNNVILLE,Jasper,IA,50,80.372194
108,50154,MC CALLSBURG,Story,IA,85,53.144166
109,50155,MACKSBURG,Madison,IA,61,78.50132
110,50156,MADRID,Boone,IA,8,238.724015
111,50157,MALCOM,Poweshiek,IA,79,168.389373
112,50158,MARSHALLTOWN,Marshall,IA,64,548.346934
113,50160,MARTENSDALE,Warren,IA,91,0.965184
114,50161,MAXWELL,Story,IA,85,211.812219
115,50162,MELBOURNE,Marshall,IA,64,132.120698
116,50163,MELCHER-DALLAS,Marion,IA,63,1.230059
117,50164,MENLO,Guthrie,IA,39,138.23504
118,50165,MILLERTON,Wayne,IA,93,6.437634
119,50166,MILO,Warren,IA,91,162.522428
120,50167,MINBURN,Dallas,IA,25,102.610498
121,50168,MINGO,Jasper,IA,50,104.773461
122,50169,MITCHELLVILLE,Polk,IA,77,103.378565
123,50170,MONROE,Jasper,IA,50,230.99503
124,50171,MONTEZUMA,Poweshiek,IA,79,281.675684
125,50173,MONTOUR,Tama,IA,86,80.140704
126,50174,MURRAY,Clarke,IA,20,278.718495
127,50201,NEVADA,Story,IA,85,300.453642
128,50206,NEW PROVIDENCE,Hardin,IA,42,124.182004
129,50207,NEW SHARON,Mahaska,IA,62,366.588282
130,50208,NEWTON,Jasper,IA,50,426.046663
131,50210,NEW VIRGINIA,Warren,IA,91,196.767136
132,50211,NORWALK,Warren,IA,91,147.182178
133,50212,OGDEN,Boone,IA,8,352.230522
134,50213,OSCEOLA,Clarke,IA,20,543.975469
135,50214,OTLEY,Marion,IA,63,102.571148
136,50216,PANORA,Guthrie,IA,39,145.699881
137,50217,PATON,Greene,IA,37,178.606122
138,50218,PATTERSON,Madison,IA,61,0.542828
139,50219,PELLA,Marion,IA,63,317.144262
140,50220,PERRY,Dallas,IA,25,268.176779
141,50222,PERU,Madison,IA,61,108.369441
142,50223,PILOT MOUND,Boone,IA,8,76.600548
143,50225,PLEASANTVILLE,Marion,IA,63,217.246336
144,50226,POLK CITY,Polk,IA,77,109.873855
145,50227,POPEJOY,Franklin,IA,35,0.966375
146,50228,PRAIRIE CITY,Jasper,IA,50,180.367188
147,50229,PROLE,Warren,IA,91,105.616644
148,50230,RADCLIFFE,Hardin,IA,42,223.982113
149,50231,RANDALL,Hamilton,IA,40,1.065168
150,50232,REASNOR,Jasper,IA,50,86.762448
151,50233,REDFIELD,Dallas,IA,25,130.640688
152,50234,RHODES,Marshall,IA,64,81.236093
153,50235,RIPPEY,Greene,IA,37,121.134316
154,50236,ROLAND,Story,IA,85,89.345522
155,50237,RUNNELLS,Polk,IA,77,146.002809
156,50238,RUSSELL,Lucas,IA,59,308.904729
157,50239,SAINT ANTHONY,Marshall,IA,64,44.053312
158,50240,SAINT CHARLES,Madison,IA,61,197.714047
159,50242,SEARSBORO,Poweshiek,IA,79,106.954493
160,50243,SHELDAHL,Story,IA,85,1.425493
161,50244,SLATER,Story,IA,85,57.130248
162,50244,SLATER,Story,IA,85,57.130248
163,50246,STANHOPE,Hamilton,IA,40,120.153227
164,50247,STATE CENTER,Marshall,IA,64,215.968634
165,50248,STORY CITY,Story,IA,85,211.580755
166,50249,STRATFORD,Hamilton,IA,40,202.332923
167,50250,STUART,Adair,IA,1,265.060086
168,50251,SULLY,Jasper,IA,50,104.817095
169,50252,SWAN,Marion,IA,63,22.861403
170,50254,THAYER,Union,IA,88,113.65365
171,50255,THORNBURG,Keokuk,IA,54,0.456756
172,50256,TRACY,Marion,IA,63,70.812037
173,50257,TRURO,Madison,IA,61,103.296613
174,50258,UNION,Hardin,IA,42,139.476319
175,50261,VAN METER,Madison,IA,61,173.242731
176,50262,VAN WERT,Decatur,IA,27,83.830606
177,50263,WAUKEE,Dallas,IA,25,90.002855
178,50264,WELDON,Decatur,IA,27,174.299474
179,50265,WEST DES MOINES,Polk,IA,77,46.466559
180,50266,WEST DES MOINES,Dallas,IA,25,43.060835
181,50268,WHAT CHEER,Keokuk,IA,54,123.524623
182,50271,WILLIAMS,Hamilton,IA,40,172.819803
183,50272,WILLIAMSON,Lucas,IA,59,1.922108
184,50273,WINTERSET,Madison,IA,61,519.17142
185,50274,WIOTA,Cass,IA,15,130.159776
186,50275,WOODBURN,Clarke,IA,20,131.849713
187,50276,WOODWARD,Dallas,IA,25,209.84696
188,50277,YALE,Guthrie,IA,39,104.624278
189,50278,ZEARING,Story,IA,85,138.864198
190,50309,DES MOINES,Polk,IA,77,7.776473
191,50310,DES MOINES,Polk,IA,77,21.123546
192,50311,DES MOINES,Polk,IA,77,6.511832
193,50312,DES MOINES,Polk,IA,77,15.05106
194,50313,DES MOINES,Polk,IA,77,47.635293
195,50314,DES MOINES,Polk,IA,77,6.629721
196,50315,DES MOINES,Polk,IA,77,26.560331
197,50316,DES MOINES,Polk,IA,77,9.302481
198,50317,DES MOINES,Polk,IA,77,60.041842
199,50319,DES MOINES,Polk,IA,77,0.213707
200,50320,DES MOINES,Polk,IA,77,49.547031
201,50321,DES MOINES,Polk,IA,77,30.969186
202,50322,URBANDALE,Polk,IA,77,27.938267
203,50323,URBANDALE,Dallas,IA,25,19.984131
204,50324,WINDSOR HEIGHTS,Polk,IA,77,3.74028
205,50325,CLIVE,Polk,IA,77,20.224117
206,50327,PLEASANT HILL,Polk,IA,77,49.702622
207,50401,MASON CITY,Cerro Gordo,IA,17,387.509792
208,50420,ALEXANDER,Franklin,IA,35,117.256906
209,50421,BELMOND,Wright,IA,99,232.911303
210,50423,BRITT,Hancock,IA,41,376.364842
211,50424,BUFFALO CENTER,Winnebago,IA,95,315.854649
212,50426,CARPENTER,Mitchell,IA,66,0.060113
213,50428,CLEAR LAKE,Cerro Gordo,IA,17,316.380154
214,50430,CORWITH,Hancock,IA,41,160.984015
215,50431,COULTER,Franklin,IA,35,1.936776
216,50432,CRYSTAL LAKE,Hancock,IA,41,1.127714
217,50433,DOUGHERTY,Cerro Gordo,IA,17,125.889253
218,50434,FERTILE,Worth,IA,98,23.487419
219,50435,FLOYD,Floyd,IA,34,105.304339
220,50436,FOREST CITY,Winnebago,IA,95,354.034151
221,50438,GARNER,Hancock,IA,41,342.63578
222,50439,GOODELL,Hancock,IA,41,84.706319
223,50440,GRAFTON,Worth,IA,98,80.665798
224,50441,HAMPTON,Franklin,IA,35,395.662654
225,50444,HANLONTOWN,Worth,IA,98,55.966912
226,50446,JOICE,Worth,IA,98,108.86978
227,50447,KANAWHA,Hancock,IA,41,261.433926
228,50448,KENSETT,Worth,IA,98,160.165797
229,50449,KLEMME,Hancock,IA,41,95.013046
230,50450,LAKE MILLS,Winnebago,IA,95,214.081289
231,50451,LAKOTA,Kossuth,IA,55,164.530815
232,50452,LATIMER,Franklin,IA,35,127.424687
233,50453,LELAND,Winnebago,IA,95,114.347611
234,50454,LITTLE CEDAR,Mitchell,IA,66,44.76636
235,50455,MC INTIRE,Mitchell,IA,66,84.207513
236,50456,MANLY,Worth,IA,98,120.430473
237,50457,MESERVEY,Cerro Gordo,IA,17,88.503707
238,50458,NORA SPRINGS,Floyd,IA,34,190.798489
239,50459,NORTHWOOD,Worth,IA,98,375.483499
240,50460,ORCHARD,Mitchell,IA,66,93.166151
241,50461,OSAGE,Mitchell,IA,66,446.474922
242,50464,PLYMOUTH,Cerro Gordo,IA,17,61.009266
243,50465,RAKE,Winnebago,IA,95,9.497218
244,50466,RICEVILLE,Howard,IA,45,354.341172
245,50467,ROCK FALLS,Cerro Gordo,IA,17,0.709984
246,50468,ROCKFORD,Floyd,IA,34,256.209489
247,50469,ROCKWELL,Cerro Gordo,IA,17,227.879604
248,50470,ROWAN,Wright,IA,99,62.885229
249,50471,RUDD,Floyd,IA,34,110.089036
250,50472,SAINT ANSGAR,Mitchell,IA,66,316.41701
251,50473,SCARVILLE,Winnebago,IA,95,96.000513
252,50475,SHEFFIELD,Franklin,IA,35,231.102519
253,50476,STACYVILLE,Mitchell,IA,66,101.702581
254,50477,SWALEDALE,Cerro Gordo,IA,17,58.455862
255,50478,THOMPSON,Winnebago,IA,95,191.575269
256,50479,THORNTON,Cerro Gordo,IA,17,150.266394
257,50480,TITONKA,Kossuth,IA,55,163.023328
258,50482,VENTURA,Cerro Gordo,IA,17,81.166123
259,50483,WESLEY,Kossuth,IA,55,198.210428
260,50484,WODEN,Hancock,IA,41,113.720772
261,50501,FORT DODGE,Webster,IA,94,407.21578
262,50510,ALBERT CITY,Buena Vista,IA,11,219.73212
263,50511,ALGONA,Kossuth,IA,55,321.723723
264,50514,ARMSTRONG,Emmet,IA,32,287.766282
265,50515,AYRSHIRE,Palo Alto,IA,74,96.076793
266,50516,BADGER,Webster,IA,94,63.984856
267,50517,BANCROFT,Kossuth,IA,55,198.110964
268,50518,BARNUM,Webster,IA,94,80.803741
269,50519,BODE,Humboldt,IA,46,142.395653
270,50520,BRADGATE,Humboldt,IA,46,62.072927
271,50521,BURNSIDE,Webster,IA,94,3.160848
272,50522,BURT,Kossuth,IA,55,176.225395
273,50523,CALLENDER,Webster,IA,94,109.190936
274,50524,CLARE,Webster,IA,94,148.535431
275,50525,CLARION,Wright,IA,99,363.773781
276,50527,CURLEW,Palo Alto,IA,74,141.597255
277,50528,CYLINDER,Palo Alto,IA,74,191.696767
278,50529,DAKOTA CITY,Humboldt,IA,46,1.445412
279,50530,DAYTON,Webster,IA,94,168.385319
280,50531,DOLLIVER,Emmet,IA,32,100.449294
281,50532,DUNCOMBE,Webster,IA,94,181.71793
282,50533,EAGLE GROVE,Wright,IA,99,258.50312
283,50535,EARLY,Sac,IA,81,158.492375
284,50536,EMMETSBURG,Palo Alto,IA,74,370.232494
285,50538,FARNHAMVILLE,Calhoun,IA,13,80.88901
286,50539,FENTON,Kossuth,IA,55,139.538906
287,50540,FONDA,Pocahontas,IA,76,275.625728
288,50541,GILMORE CITY,Humboldt,IA,46,237.749663
289,50542,GOLDFIELD,Wright,IA,99,172.884137
290,50543,GOWRIE,Webster,IA,94,212.824794
291,50544,HARCOURT,Webster,IA,94,75.465666
292,50545,HARDY,Humboldt,IA,46,97.252233
293,50546,HAVELOCK,Pocahontas,IA,76,137.019674
294,50548,HUMBOLDT,Humboldt,IA,46,323.465219
295,50551,JOLLEY,Calhoun,IA,13,69.704315
296,50554,LAURENS,Pocahontas,IA,76,232.762069
297,50556,LEDYARD,Kossuth,IA,55,101.116
298,50557,LEHIGH,Webster,IA,94,130.151481
299,50558,LIVERMORE,Humboldt,IA,46,114.721586
300,50559,LONE ROCK,Kossuth,IA,55,102.790935
301,50560,LU VERNE,Kossuth,IA,55,225.654488
302,50561,LYTTON,Calhoun,IA,13,119.358374
303,50562,MALLARD,Palo Alto,IA,74,165.098566
304,50563,MANSON,Calhoun,IA,13,253.318021
305,50565,MARATHON,Buena Vista,IA,11,114.004136
306,50566,MOORLAND,Webster,IA,94,89.865522
307,50567,NEMAHA,Sac,IA,81,67.326397
308,50568,NEWELL,Buena Vista,IA,11,220.071293
309,50569,OTHO,Webster,IA,94,54.889709
310,50570,OTTOSEN,Humboldt,IA,46,112.220748
311,50571,PALMER,Pocahontas,IA,76,115.641227
312,50573,PLOVER,Pocahontas,IA,76,1.045738
313,50574,POCAHONTAS,Pocahontas,IA,76,288.134473
314,50575,POMEROY,Calhoun,IA,13,163.4466
315,50576,REMBRANDT,Buena Vista,IA,11,93.051632
316,50577,RENWICK,Humboldt,IA,46,106.941323
317,50578,RINGSTED,Emmet,IA,32,192.331445
318,50579,ROCKWELL CITY,Calhoun,IA,13,359.119951
319,50581,ROLFE,Pocahontas,IA,76,246.722485
320,50582,RUTLAND,Humboldt,IA,46,53.572451
321,50583,SAC CITY,Sac,IA,81,306.359541
322,50585,SIOUX RAPIDS,Buena Vista,IA,11,165.291906
323,50586,SOMERS,Calhoun,IA,13,91.12132
324,50588,STORM LAKE,Buena Vista,IA,11,368.993698
325,50590,SWEA CITY,Kossuth,IA,55,203.980739
326,50591,THOR,Humboldt,IA,46,73.985552
327,50593,VARINA,Pocahontas,IA,76,0.480019
328,50594,VINCENT,Webster,IA,94,67.103128
329,50595,WEBSTER CITY,Hamilton,IA,40,399.609138
330,50597,WEST BEND,Palo Alto,IA,74,214.240511
331,50598,WHITTEMORE,Kossuth,IA,55,176.474137
332,50599,WOOLSTOCK,Wright,IA,99,133.057067
333,50601,ACKLEY,Franklin,IA,35,368.01212
334,50602,ALLISON,Butler,IA,12,207.455662
335,50603,ALTA VISTA,Chickasaw,IA,19,122.972014
336,50604,APLINGTON,Butler,IA,12,184.521061
337,50605,AREDALE,Butler,IA,12,38.865937
338,50606,ARLINGTON,Fayette,IA,33,184.315162
339,50607,AURORA,Buchanan,IA,10,123.088687
340,50609,BEAMAN,Grundy,IA,38,89.218598
341,50611,BRISTOW,Butler,IA,12,78.763743
342,50612,BUCKINGHAM,Tama,IA,86,57.581068
343,50613,CEDAR FALLS,Black Hawk,IA,7,329.972902
344,50616,CHARLES CITY,Floyd,IA,34,448.105088
345,50619,CLARKSVILLE,Butler,IA,12,230.86623
346,50620,COLWELL,Floyd,IA,34,0.324589
347,50621,CONRAD,Grundy,IA,38,165.399151
348,50622,DENVER,Bremer,IA,9,64.857976
349,50624,DIKE,Grundy,IA,38,135.187133
350,50625,DUMONT,Butler,IA,12,158.053593
351,50626,DUNKERTON,Black Hawk,IA,7,130.892804
352,50627,ELDORA,Hardin,IA,42,277.223505
353,50628,ELMA,Howard,IA,45,289.356789
354,50629,FAIRBANK,Buchanan,IA,10,205.328666
355,50630,FREDERICKSBURG,Chickasaw,IA,19,214.715992
356,50632,GARWIN,Tama,IA,86,110.25125
357,50632,GARWIN,Tama,IA,86,110.25125
358,50633,GENEVA,Franklin,IA,35,103.078532
359,50634,GILBERTVILLE,Black Hawk,IA,7,1.018252
360,50635,GLADBROOK,Tama,IA,86,217.592812
361,50636,GREENE,Butler,IA,12,313.519052
362,50638,GRUNDY CENTER,Grundy,IA,38,245.826479
363,50641,HAZLETON,Buchanan,IA,10,123.019787
364,50642,HOLLAND,Grundy,IA,38,83.404671
365,50643,HUDSON,Black Hawk,IA,7,163.108458
366,50644,INDEPENDENCE,Buchanan,IA,10,372.595201
367,50645,IONIA,Chickasaw,IA,19,214.246026
368,50647,JANESVILLE,Bremer,IA,9,79.06534
369,50648,JESUP,Black Hawk,IA,7,223.212718
370,50650,LAMONT,Buchanan,IA,10,106.756415
371,50651,LA PORTE CITY,Black Hawk,IA,7,294.627077
372,50652,LINCOLN,Tama,IA,86,0.605668
373,50653,MARBLE ROCK,Floyd,IA,34,134.748744
374,50654,MASONVILLE,Delaware,IA,28,136.18815
375,50655,MAYNARD,Fayette,IA,33,91.05168
376,50658,NASHUA,Chickasaw,IA,19,187.097603
377,50659,NEW HAMPTON,Chickasaw,IA,19,403.802766
378,50660,NEW HARTFORD,Butler,IA,12,100.165022
379,50662,OELWEIN,Fayette,IA,33,176.049517
380,50664,ORAN,Fayette,IA,33,0.09535
381,50665,PARKERSBURG,Butler,IA,12,253.290828
382,50666,PLAINFIELD,Bremer,IA,9,140.256035
383,50667,RAYMOND,Black Hawk,IA,7,5.217793
384,50668,READLYN,Bremer,IA,9,87.689572
385,50669,REINBECK,Grundy,IA,38,239.750053
386,50670,SHELL ROCK,Butler,IA,12,148.931701
387,50671,STANLEY,Buchanan,IA,10,57.533274
388,50672,STEAMBOAT ROCK,Hardin,IA,42,94.993283
389,50673,STOUT,Grundy,IA,38,0.444167
390,50674,SUMNER,Bremer,IA,9,408.690075
391,50675,TRAER,Tama,IA,86,287.237436
392,50676,TRIPOLI,Bremer,IA,9,148.867149
393,50677,WAVERLY,Bremer,IA,9,325.186841
394,50680,WELLSBURG,Grundy,IA,38,138.682394
395,50681,WESTGATE,Fayette,IA,33,63.049395
396,50682,WINTHROP,Buchanan,IA,10,220.98261
397,50701,WATERLOO,Black Hawk,IA,7,214.718743
398,50702,WATERLOO,Black Hawk,IA,7,25.60849
399,50703,WATERLOO,Black Hawk,IA,7,244.724015
400,50707,EVANSDALE,Black Hawk,IA,7,25.361881
401,50801,CRESTON,Union,IA,88,545.028688
402,50830,AFTON,Union,IA,88,306.617835
403,50833,BEDFORD,Taylor,IA,87,536.325319
404,50835,BENTON,Ringgold,IA,80,43.994784
405,50836,BLOCKTON,Taylor,IA,87,232.828727
406,50837,BRIDGEWATER,Adair,IA,1,130.795854
407,50839,CARBON,Adams,IA,2,1.828417
408,50840,CLEARFIELD,Taylor,IA,87,129.94877
409,50841,CORNING,Adams,IA,2,610.836196
410,50842,CROMWELL,Union,IA,88,0.674912
411,50843,CUMBERLAND,Cass,IA,15,195.866561
412,50845,DIAGONAL,Ringgold,IA,80,285.590236
413,50846,FONTANELLE,Adair,IA,1,238.867152
414,50847,GRANT,Montgomery,IA,69,0.863108
415,50848,GRAVITY,Taylor,IA,87,117.720194
416,50849,GREENFIELD,Adair,IA,1,304.431532
417,50851,LENOX,Taylor,IA,87,335.214398
418,50853,MASSENA,Cass,IA,15,195.987739
419,50854,MOUNT AYR,Ringgold,IA,80,350.941226
420,50857,NODAWAY,Adams,IA,2,131.265143
421,50858,ORIENT,Adair,IA,1,206.99525
422,50859,PRESCOTT,Adams,IA,2,206.490726
423,50860,REDDING,Ringgold,IA,80,115.136578
424,50861,SHANNON CITY,Union,IA,88,113.520838
425,50862,SHARPSBURG,Taylor,IA,87,56.218206
426,50863,TINGLEY,Ringgold,IA,80,78.178667
427,50864,VILLISCA,Montgomery,IA,69,377.642994
428,51001,AKRON,Plymouth,IA,75,360.862327
429,51002,ALTA,Buena Vista,IA,11,297.148464
430,51003,ALTON,Sioux,IA,84,144.371109
431,51004,ANTHON,Woodbury,IA,97,212.848541
432,51005,AURELIA,Cherokee,IA,18,244.17026
433,51006,BATTLE CREEK,Ida,IA,47,213.095547
434,51007,BRONSON,Woodbury,IA,97,87.318034
435,51008,BRUNSVILLE,Plymouth,IA,75,0.630646
436,51009,CALUMET,O'Brien,IA,71,0.611821
437,51010,CASTANA,Monona,IA,67,176.313125
438,51011,CHATSWORTH,Sioux,IA,84,1.275286
439,51012,CHEROKEE,Cherokee,IA,18,386.838104
440,51014,CLEGHORN,Cherokee,IA,18,139.646952
441,51016,CORRECTIONVILLE,Woodbury,IA,97,262.895432
442,51018,CUSHING,Woodbury,IA,97,95.118381
443,51019,DANBURY,Woodbury,IA,97,236.787928
444,51020,GALVA,Ida,IA,47,140.577018
445,51022,GRANVILLE,Sioux,IA,84,204.082524
446,51023,HAWARDEN,Sioux,IA,84,271.44549
447,51024,HINTON,Plymouth,IA,75,232.728447
448,51025,HOLSTEIN,Ida,IA,47,278.402985
449,51026,HORNICK,Woodbury,IA,97,281.235872
450,51027,IRETON,Sioux,IA,84,239.835719
451,51028,KINGSLEY,Plymouth,IA,75,328.930974
452,51029,LARRABEE,Cherokee,IA,18,58.631884
453,51030,LAWTON,Woodbury,IA,97,153.939315
454,51031,LE MARS,Plymouth,IA,75,605.111843
455,51033,LINN GROVE,Buena Vista,IA,11,163.475339
456,51034,MAPLETON,Monona,IA,67,290.876284
457,51035,MARCUS,Cherokee,IA,18,278.97681
458,51036,MAURICE,Sioux,IA,84,114.229271
459,51037,MERIDEN,Cherokee,IA,18,61.66748
460,51038,MERRILL,Plymouth,IA,75,233.828323
461,51039,MOVILLE,Woodbury,IA,97,223.159311
462,51040,ONAWA,Monona,IA,67,399.810225
463,51041,ORANGE CITY,Sioux,IA,84,184.491043
464,51044,OTO,Woodbury,IA,97,89.839583
465,51046,PAULLINA,O'Brien,IA,71,240.918039
466,51047,PETERSON,Clay,IA,21,200.078951
467,51048,PIERSON,Woodbury,IA,97,86.240685
468,51049,QUIMBY,Cherokee,IA,18,113.097948
469,51050,REMSEN,Plymouth,IA,75,353.266814
470,51051,RODNEY,Monona,IA,67,8.740209
471,51052,SALIX,Woodbury,IA,97,159.659315
472,51053,SCHALLER,Sac,IA,81,195.513267
473,51054,SERGEANT BLUFF,Woodbury,IA,97,106.329025
474,51055,SLOAN,Woodbury,IA,97,174.692448
475,51056,SMITHLAND,Woodbury,IA,97,88.932213
476,51058,SUTHERLAND,O'Brien,IA,71,214.786027
477,51060,UTE,Monona,IA,67,156.247108
478,51061,WASHTA,Cherokee,IA,18,121.628171
479,51062,WESTFIELD,Plymouth,IA,75,144.594267
480,51063,WHITING,Monona,IA,67,162.058438
481,51101,SIOUX CITY,Woodbury,IA,97,3.138764
482,51103,SIOUX CITY,Woodbury,IA,97,27.86321
483,51104,SIOUX CITY,Woodbury,IA,97,20.00953
484,51105,SIOUX CITY,Woodbury,IA,97,15.825592
485,51106,SIOUX CITY,Woodbury,IA,97,81.702782
486,51108,SIOUX CITY,Woodbury,IA,97,116.455967
487,51109,SIOUX CITY,Woodbury,IA,97,49.159557
488,51111,SIOUX CITY,Woodbury,IA,97,17.993387
489,51201,SHELDON,O'Brien,IA,71,295.817592
490,51230,ALVORD,Lyon,IA,60,64.875507
491,51231,ARCHER,O'Brien,IA,71,73.029493
492,51232,ASHTON,Osceola,IA,72,156.638511
493,51234,BOYDEN,Sioux,IA,84,129.274507
494,51235,DOON,Lyon,IA,60,144.971909
495,51237,GEORGE,Lyon,IA,60,249.759921
496,51238,HOSPERS,Sioux,IA,84,118.831336
497,51239,HULL,Sioux,IA,84,171.513941
498,51240,INWOOD,Lyon,IA,60,263.161609
499,51241,LARCHWOOD,Lyon,IA,60,232.767886
500,51242,LESTER,Lyon,IA,60,1.181066
501,51243,LITTLE ROCK,Lyon,IA,60,139.238401
502,51244,MATLOCK,Sioux,IA,84,0.778428
503,51245,PRIMGHAR,O'Brien,IA,71,203.413241
504,51246,ROCK RAPIDS,Lyon,IA,60,424.515402
505,51247,ROCK VALLEY,Sioux,IA,84,289.650577
506,51248,SANBORN,O'Brien,IA,71,214.49695
507,51249,SIBLEY,Osceola,IA,72,324.74321
508,51250,SIOUX CENTER,Sioux,IA,84,186.398963
509,51301,SPENCER,Clay,IA,21,406.942409
510,51331,ARNOLDS PARK,Dickinson,IA,30,7.12251
511,51333,DICKENS,Clay,IA,21,170.365473
512,51334,ESTHERVILLE,Emmet,IA,32,491.781068
513,51338,EVERLY,Clay,IA,21,189.29993
514,51341,GILLETT GROVE,Clay,IA,21,0.89395
515,51342,GRAETTINGER,Palo Alto,IA,74,214.231634
516,51343,GREENVILLE,Clay,IA,21,65.322005
517,51345,HARRIS,Osceola,IA,72,146.304432
518,51346,HARTLEY,O'Brien,IA,71,377.358581
519,51347,LAKE PARK,Dickinson,IA,30,223.92845
520,51350,MELVIN,Osceola,IA,72,112.104258
521,51351,MILFORD,Dickinson,IA,30,278.931975
522,51354,OCHEYEDAN,Osceola,IA,72,227.662269
523,51355,OKOBOJI,Dickinson,IA,30,11.052693
524,51357,ROYAL,Clay,IA,21,108.592195
525,51358,RUTHVEN,Palo Alto,IA,74,202.954357
526,51360,SPIRIT LAKE,Dickinson,IA,30,334.993966
527,51363,SUPERIOR,Dickinson,IA,30,1.050241
528,51364,TERRIL,Dickinson,IA,30,167.034887
529,51365,WALLINGFORD,Emmet,IA,32,58.440957
530,51366,WEBB,Clay,IA,21,163.894539
531,51401,CARROLL,Carroll,IA,14,454.121
532,51430,ARCADIA,Carroll,IA,14,105.531033
533,51431,ARTHUR,Ida,IA,47,101.935208
534,51433,AUBURN,Sac,IA,81,147.355553
535,51436,BREDA,Carroll,IA,14,156.790205
536,51439,CHARTER OAK,Crawford,IA,24,223.476837
537,51440,DEDHAM,Carroll,IA,14,66.015871
538,51441,DELOIT,Crawford,IA,24,40.266915
539,51442,DENISON,Crawford,IA,24,448.677639
540,51443,GLIDDEN,Carroll,IA,14,276.442701
541,51444,HALBUR,Carroll,IA,14,0.42223
542,51445,IDA GROVE,Ida,IA,47,327.873042
543,51446,IRWIN,Shelby,IA,83,90.274738
544,51447,KIRKMAN,Shelby,IA,83,90.268943
545,51448,KIRON,Crawford,IA,24,144.043445
546,51449,LAKE CITY,Calhoun,IA,13,232.944822
547,51450,LAKE VIEW,Sac,IA,81,157.595822
548,51451,LANESBORO,Carroll,IA,14,0.955449
549,51453,LOHRVILLE,Calhoun,IA,13,223.339647
550,51454,MANILLA,Crawford,IA,24,259.313231
551,51455,MANNING,Carroll,IA,14,287.004925
552,51458,ODEBOLT,Sac,IA,81,245.407583
553,51459,RALSTON,Carroll,IA,14,1.355933
554,51461,SCHLESWIG,Crawford,IA,24,130.196128
555,51462,SCRANTON,Greene,IA,37,281.553969
556,51463,TEMPLETON,Carroll,IA,14,77.535941
557,51465,VAIL,Crawford,IA,24,141.104728
558,51466,WALL LAKE,Sac,IA,81,134.548626
559,51467,WESTSIDE,Crawford,IA,24,168.975493
560,51501,COUNCIL BLUFFS,Pottawattamie,IA,78,68.663347
561,51503,COUNCIL BLUFFS,Pottawattamie,IA,78,311.311378
562,51510,CARTER LAKE,Pottawattamie,IA,78,5.228569
563,51520,ARION,Crawford,IA,24,48.585455
564,51521,AVOCA,Pottawattamie,IA,78,223.26416
565,51523,BLENCOE,Monona,IA,67,134.969227
566,51525,CARSON,Pottawattamie,IA,78,158.290149
567,51526,CRESCENT,Pottawattamie,IA,78,111.333068
568,51527,DEFIANCE,Shelby,IA,83,101.109924
569,51528,DOW CITY,Crawford,IA,24,189.885849
570,51529,DUNLAP,Harrison,IA,43,333.953737
571,51530,EARLING,Shelby,IA,83,140.370471
572,51531,ELK HORN,Shelby,IA,83,73.952439
573,51532,ELLIOTT,Montgomery,IA,69,147.80971
574,51533,EMERSON,Mills,IA,65,213.641636
575,51534,GLENWOOD,Mills,IA,65,260.80305
576,51535,GRISWOLD,Cass,IA,15,337.126792
577,51536,HANCOCK,Pottawattamie,IA,78,124.955843
578,51537,HARLAN,Shelby,IA,83,422.332929
579,51540,HASTINGS,Mills,IA,65,141.382464
580,51541,HENDERSON,Mills,IA,65,106.735256
581,51542,HONEY CREEK,Pottawattamie,IA,78,92.88071
582,51543,KIMBALLTON,Audubon,IA,5,57.402651
583,51544,LEWIS,Cass,IA,15,145.135032
584,51545,LITTLE SIOUX,Harrison,IA,43,116.122152
585,51546,LOGAN,Harrison,IA,43,293.310022
586,51548,MC CLELLAND,Pottawattamie,IA,78,64.393248
587,51549,MACEDONIA,Pottawattamie,IA,78,84.699901
588,51550,MAGNOLIA,Harrison,IA,43,1.456103
589,51551,MALVERN,Mills,IA,65,210.796119
590,51552,MARNE,Cass,IA,15,91.256513
591,51553,MINDEN,Pottawattamie,IA,78,118.587348
592,51554,MINEOLA,Mills,IA,65,6.840874
593,51555,MISSOURI VALLEY,Harrison,IA,43,410.136654
594,51556,MODALE,Harrison,IA,43,117.494595
595,51557,MONDAMIN,Harrison,IA,43,181.091843
596,51558,MOORHEAD,Monona,IA,67,213.388981
597,51559,NEOLA,Pottawattamie,IA,78,220.723262
598,51560,OAKLAND,Pottawattamie,IA,78,254.891645
599,51561,PACIFIC JUNCTION,Mills,IA,65,159.38428
600,51562,PANAMA,Shelby,IA,83,87.966803
601,51563,PERSIA,Harrison,IA,43,142.269566
602,51564,PISGAH,Harrison,IA,43,103.461051
603,51565,PORTSMOUTH,Shelby,IA,83,126.12811
604,51566,RED OAK,Montgomery,IA,69,457.891547
605,51570,SHELBY,Shelby,IA,83,166.695973
606,51571,SILVER CITY,Mills,IA,65,121.369661
607,51572,SOLDIER,Monona,IA,67,114.390965
608,51573,STANTON,Montgomery,IA,69,156.152123
609,51575,TREYNOR,Pottawattamie,IA,78,126.569658
610,51576,UNDERWOOD,Pottawattamie,IA,78,130.453907
611,51577,WALNUT,Pottawattamie,IA,78,206.979038
612,51578,WESTPHALIA,Shelby,IA,83,0.096684
613,51579,WOODBINE,Harrison,IA,43,301.420886
614,51601,SHENANDOAH,Page,IA,73,276.34259
615,51630,BLANCHARD,Page,IA,73,65.739953
616,51631,BRADDYVILLE,Page,IA,73,95.851302
617,51632,CLARINDA,Page,IA,73,540.669979
618,51636,COIN,Page,IA,73,144.434643
619,51637,COLLEGE SPRINGS,Page,IA,73,4.185566
620,51638,ESSEX,Page,IA,73,220.766642
621,51639,FARRAGUT,Fremont,IA,36,186.79102
622,51640,HAMBURG,Fremont,IA,36,313.989946
623,51645,IMOGENE,Fremont,IA,36,107.043333
624,51646,NEW MARKET,Taylor,IA,87,162.903928
625,51647,NORTHBORO,Page,IA,73,49.689882
626,51648,PERCIVAL,Fremont,IA,36,130.838045
627,51649,RANDOLPH,Fremont,IA,36,106.334486
628,51650,RIVERTON,Fremont,IA,36,78.508646
629,51652,SIDNEY,Fremont,IA,36,200.054769
630,51653,TABOR,Fremont,IA,36,89.512532
631,51654,THURMAN,Fremont,IA,36,137.790111
632,51656,YORKTOWN,Page,IA,73,0.438077
633,52001,DUBUQUE,Dubuque,IA,31,75.057763
634,52002,DUBUQUE,Dubuque,IA,31,74.76947
635,52003,DUBUQUE,Dubuque,IA,31,151.954104
636,52030,ANDREW,Jackson,IA,49,0.690483
637,52031,BELLEVUE,Jackson,IA,49,448.070077
638,52032,BERNARD,Jackson,IA,49,272.572418
639,52033,CASCADE,Jones,IA,53,252.635822
640,52035,COLESBURG,Clayton,IA,22,139.028244
641,52037,DELMAR,Clinton,IA,23,176.775293
642,52038,DUNDEE,Delaware,IA,28,76.611993
643,52039,DURANGO,Dubuque,IA,31,91.068312
644,52040,DYERSVILLE,Dubuque,IA,31,157.748966
645,52041,EARLVILLE,Delaware,IA,28,155.040397
646,52042,EDGEWOOD,Clayton,IA,22,161.463192
647,52043,ELKADER,Clayton,IA,22,258.435335
648,52044,ELKPORT,Clayton,IA,22,31.383574
649,52045,EPWORTH,Dubuque,IA,31,123.92508
650,52046,FARLEY,Dubuque,IA,31,119.196635
651,52047,FARMERSBURG,Clayton,IA,22,88.328123
652,52048,GARBER,Clayton,IA,22,85.789255
653,52049,GARNAVILLO,Clayton,IA,22,188.393273
654,52050,GREELEY,Delaware,IA,28,84.028046
655,52052,GUTTENBERG,Clayton,IA,22,249.00035
656,52053,HOLY CROSS,Dubuque,IA,31,154.540299
657,52054,LA MOTTE,Jackson,IA,49,138.18618
658,52057,MANCHESTER,Delaware,IA,28,362.350079
659,52060,MAQUOKETA,Jackson,IA,49,409.835195
660,52064,MILES,Jackson,IA,49,107.540582
661,52065,NEW VIENNA,Dubuque,IA,31,131.382162
662,52066,NORTH BUENA VISTA,Clayton,IA,22,0.681956
663,52068,PEOSTA,Dubuque,IA,31,120.418986
664,52069,PRESTON,Jackson,IA,49,133.277674
665,52070,SABULA,Jackson,IA,49,154.415679
666,52072,SAINT OLAF,Clayton,IA,22,87.106056
667,52073,SHERRILL,Dubuque,IA,31,139.623458
668,52074,SPRAGUEVILLE,Jackson,IA,49,80.684903
669,52076,STRAWBERRY POINT,Clayton,IA,22,252.900986
670,52077,VOLGA,Clayton,IA,22,83.72915
671,52078,WORTHINGTON,Dubuque,IA,31,106.290071
672,52079,ZWINGLE,Jackson,IA,49,153.776141
673,52101,DECORAH,Winneshiek,IA,96,804.622162
674,52132,CALMAR,Winneshiek,IA,96,157.379001
675,52133,CASTALIA,Winneshiek,IA,96,118.295514
676,52134,CHESTER,Howard,IA,45,77.559488
677,52135,CLERMONT,Fayette,IA,33,67.561964
678,52136,CRESCO,Howard,IA,45,552.444562
679,52140,DORCHESTER,Allamakee,IA,3,201.58604
680,52141,ELGIN,Fayette,IA,33,216.938945
681,52142,FAYETTE,Fayette,IA,33,189.331878
682,52144,FORT ATKINSON,Winneshiek,IA,96,194.119174
683,52146,HARPERS FERRY,Allamakee,IA,3,221.678469
684,52147,HAWKEYE,Fayette,IA,33,187.912171
685,52151,LANSING,Allamakee,IA,3,324.670969
686,52154,LAWLER,Chickasaw,IA,19,189.966484
687,52155,LIME SPRINGS,Howard,IA,45,274.895155
688,52156,LUANA,Clayton,IA,22,110.19542
689,52157,MC GREGOR,Clayton,IA,22,149.443621
690,52158,MARQUETTE,Clayton,IA,22,3.357585
691,52159,MONONA,Clayton,IA,22,197.141302
692,52160,NEW ALBIN,Allamakee,IA,3,103.702081
693,52161,OSSIAN,Winneshiek,IA,96,140.304657
694,52162,POSTVILLE,Allamakee,IA,3,258.921418
695,52163,PROTIVIN,Howard,IA,45,4.602409
696,52164,RANDALIA,Fayette,IA,33,65.306405
697,52165,RIDGEWAY,Winneshiek,IA,96,172.29035
698,52166,SAINT LUCAS,Fayette,IA,33,0.493739
699,52168,SPILLVILLE,Winneshiek,IA,96,0.417415
700,52169,WADENA,Fayette,IA,33,77.089638
701,52170,WATERVILLE,Allamakee,IA,3,120.174634
702,52171,WAUCOMA,Fayette,IA,33,205.89717
703,52171,WAUCOMA,Fayette,IA,33,205.89717
704,52172,WAUKON,Allamakee,IA,3,409.930314
705,52175,WEST UNION,Fayette,IA,33,224.708583
706,52201,AINSWORTH,Washington,IA,92,171.814169
707,52202,ALBURNETT,Linn,IA,57,65.295077
708,52203,AMANA,Iowa,IA,48,119.753928
709,52205,ANAMOSA,Jones,IA,53,309.946308
710,52206,ATKINS,Benton,IA,6,73.245916
711,52207,BALDWIN,Jackson,IA,49,102.173612
712,52208,BELLE PLAINE,Benton,IA,6,150.72943
713,52209,BLAIRSTOWN,Benton,IA,6,95.402594
714,52210,BRANDON,Buchanan,IA,10,90.462642
715,52211,BROOKLYN,Poweshiek,IA,79,236.939306
716,52212,CENTER JUNCTION,Jones,IA,53,60.790568
717,52213,CENTER POINT,Linn,IA,57,194.486805
718,52214,CENTRAL CITY,Linn,IA,57,247.622502
719,52215,CHELSEA,Tama,IA,86,224.182198
720,52216,CLARENCE,Cedar,IA,16,146.950128
721,52217,CLUTIER,Tama,IA,86,152.367658
722,52218,COGGON,Linn,IA,57,187.432341
723,52219,PRAIRIEBURG,Linn,IA,57,1.194124
724,52220,CONROY,Iowa,IA,48,1.194245
725,52221,GUERNSEY,Poweshiek,IA,79,53.776053
726,52222,DEEP RIVER,Poweshiek,IA,79,209.152349
727,52223,DELHI,Delaware,IA,28,127.619807
728,52224,DYSART,Tama,IA,86,256.593612
729,52225,ELBERON,Tama,IA,86,91.276214
730,52227,ELY,Linn,IA,57,76.013553
731,52228,FAIRFAX,Linn,IA,57,96.928032
732,52229,GARRISON,Benton,IA,6,127.391906
733,52231,HARPER,Keokuk,IA,54,84.735151
734,52232,HARTWICK,Poweshiek,IA,79,47.517446
735,52233,HIAWATHA,Linn,IA,57,9.080084
736,52235,HILLS,Johnson,IA,52,5.257012
737,52236,HOMESTEAD,Iowa,IA,48,74.148178
738,52237,HOPKINTON,Delaware,IA,28,210.461938
739,52240,IOWA CITY,Johnson,IA,52,415.571318
740,52241,CORALVILLE,Johnson,IA,52,30.871305
741,52242,IOWA CITY,Johnson,IA,52,1.995678
742,52245,IOWA CITY,Johnson,IA,52,21.712859
743,52246,IOWA CITY,Johnson,IA,52,23.832009
744,52246,IOWA CITY,Johnson,IA,52,23.832009
745,52247,KALONA,Washington,IA,92,206.350122
746,52248,KEOTA,Washington,IA,92,295.236835
747,52249,KEYSTONE,Benton,IA,6,131.160363
748,52251,LADORA,Iowa,IA,48,106.339173
749,52253,LISBON,Linn,IA,57,121.480006
750,52254,LOST NATION,Clinton,IA,23,146.759383
751,52255,LOWDEN,Cedar,IA,16,112.408307
752,52257,LUZERNE,Benton,IA,6,40.66545
753,52301,MARENGO,Iowa,IA,48,316.484635
754,52302,MARION,Linn,IA,57,192.237746
755,52305,MARTELLE,Jones,IA,53,73.493573
756,52306,MECHANICSVILLE,Cedar,IA,16,194.245557
757,52307,MIDDLE AMANA,Iowa,IA,48,0.488234
758,52308,MILLERSBURG,Iowa,IA,48,0.233003
759,52309,MONMOUTH,Jackson,IA,49,74.291839
760,52310,MONTICELLO,Jones,IA,53,385.183593
761,52312,MORLEY,Jones,IA,53,0.242872
762,52313,MOUNT AUBURN,Benton,IA,6,84.959546
763,52314,MOUNT VERNON,Linn,IA,57,155.658054
764,52315,NEWHALL,Benton,IA,6,75.003645
765,52316,NORTH ENGLISH,Iowa,IA,48,180.293861
766,52317,NORTH LIBERTY,Johnson,IA,52,95.897591
767,52318,NORWAY,Benton,IA,6,95.080335
768,52320,OLIN,Jones,IA,53,153.985129
769,52321,ONSLOW,Jones,IA,53,77.84139
770,52322,OXFORD,Johnson,IA,52,231.220008
771,52323,OXFORD JUNCTION,Jones,IA,53,132.353794
772,52324,PALO,Linn,IA,57,107.056042
773,52325,PARNELL,Iowa,IA,48,109.143193
774,52326,QUASQUETON,Buchanan,IA,10,5.447918
775,52327,RIVERSIDE,Washington,IA,92,217.607612
776,52328,ROBINS,Linn,IA,57,7.989464
777,52329,ROWLEY,Buchanan,IA,10,132.090802
778,52330,RYAN,Delaware,IA,28,123.160697
779,52332,SHELLSBURG,Benton,IA,6,110.692089
780,52333,SOLON,Johnson,IA,52,234.514567
781,52334,SOUTH AMANA,Iowa,IA,48,43.109728
782,52335,SOUTH ENGLISH,Keokuk,IA,54,144.712437
783,52336,SPRINGVILLE,Linn,IA,57,127.253732
784,52337,STANWOOD,Cedar,IA,16,70.714333
785,52338,SWISHER,Johnson,IA,52,88.82337
786,52339,TAMA,Tama,IA,86,265.108854
787,52340,TIFFIN,Johnson,IA,52,42.940483
788,52341,TODDVILLE,Linn,IA,57,34.64544
789,52342,TOLEDO,Tama,IA,86,232.660911
790,52345,URBANA,Benton,IA,6,8.460683
791,52346,VAN HORNE,Benton,IA,6,134.845571
792,52347,VICTOR,Iowa,IA,48,166.647034
793,52348,VINING,Tama,IA,86,2.376982
794,52349,VINTON,Benton,IA,6,382.672144
795,52351,WALFORD,Benton,IA,6,2.13643
796,52352,WALKER,Linn,IA,57,205.022618
797,52353,WASHINGTON,Washington,IA,92,395.338718
798,52354,WATKINS,Benton,IA,6,76.593832
799,52355,WEBSTER,Keokuk,IA,54,95.20977
800,52356,WELLMAN,Washington,IA,92,232.4078
801,52358,WEST BRANCH,Cedar,IA,16,200.939529
802,52359,WEST CHESTER,Washington,IA,92,38.168572
803,52361,WILLIAMSBURG,Iowa,IA,48,330.197198
804,52362,WYOMING,Jones,IA,53,157.429942
805,52401,CEDAR RAPIDS,Linn,IA,57,3.464505
806,52402,CEDAR RAPIDS,Linn,IA,57,36.420817
807,52403,CEDAR RAPIDS,Linn,IA,57,69.523743
808,52404,CEDAR RAPIDS,Linn,IA,57,142.93349
809,52405,CEDAR RAPIDS,Linn,IA,57,38.49318
810,52411,CEDAR RAPIDS,Linn,IA,57,44.635019
811,52501,OTTUMWA,Wapello,IA,90,591.297871
812,52530,AGENCY,Wapello,IA,90,36.986236
813,52531,ALBIA,Monroe,IA,68,563.107904
814,52533,BATAVIA,Jefferson,IA,51,227.778322
815,52534,BEACON,Mahaska,IA,62,1.012395
816,52535,BIRMINGHAM,Van Buren,IA,89,149.696508
817,52536,BLAKESBURG,Wapello,IA,90,160.133711
818,52537,BLOOMFIELD,Davis,IA,26,900.130186
819,52540,BRIGHTON,Jefferson,IA,51,257.681662
820,52542,CANTRIL,Van Buren,IA,89,117.166206
821,52543,CEDAR,Mahaska,IA,62,53.398057
822,52544,CENTERVILLE,Appanoose,IA,4,355.984819
823,52548,CHILLICOTHE,Wapello,IA,90,0.622729
824,52549,CINCINNATI,Appanoose,IA,4,113.382459
825,52550,DELTA,Keokuk,IA,54,100.697091
826,52551,DOUDS,Van Buren,IA,89,151.969068
827,52552,DRAKESVILLE,Davis,IA,26,151.883138
828,52553,EDDYVILLE,Wapello,IA,90,218.424094
829,52554,ELDON,Wapello,IA,90,94.705759
830,52555,EXLINE,Appanoose,IA,4,64.737097
831,52556,FAIRFIELD,Jefferson,IA,51,458.48455
832,52557,FAIRFIELD,Jefferson,IA,51,0.116099
833,52560,FLORIS,Davis,IA,26,91.932554
834,52561,FREMONT,Mahaska,IA,62,90.172828
835,52563,HEDRICK,Keokuk,IA,54,299.15143
836,52565,KEOSAUQUA,Van Buren,IA,89,307.637735
837,52566,KIRKVILLE,Wapello,IA,90,2.69235
838,52567,LIBERTYVILLE,Jefferson,IA,51,73.121919
839,52569,MELROSE,Monroe,IA,68,249.170073
840,52570,MILTON,Van Buren,IA,89,177.637484
841,52571,MORAVIA,Appanoose,IA,4,276.402554
842,52572,MOULTON,Appanoose,IA,4,252.233684
843,52573,MOUNT STERLING,Van Buren,IA,89,74.630386
844,52574,MYSTIC,Appanoose,IA,4,109.572946
845,52576,OLLIE,Keokuk,IA,54,117.180743
846,52577,OSKALOOSA,Mahaska,IA,62,415.772496
847,52580,PACKWOOD,Jefferson,IA,51,98.765465
848,52581,PLANO,Appanoose,IA,4,102.818704
849,52583,PROMISE CITY,Wayne,IA,93,120.716306
850,52584,PULASKI,Davis,IA,26,58.551231
851,52585,RICHLAND,Keokuk,IA,54,145.900379
852,52586,ROSE HILL,Mahaska,IA,62,122.508394
853,52588,SELMA,Van Buren,IA,89,26.887357
854,52590,SEYMOUR,Wayne,IA,93,194.526072
855,52591,SIGOURNEY,Keokuk,IA,54,327.849389
856,52593,UDELL,Appanoose,IA,4,42.505876
857,52594,UNIONVILLE,Appanoose,IA,4,121.987222
858,52595,UNIVERSITY PARK,Mahaska,IA,62,1.193986
859,52601,BURLINGTON,Des Moines,IA,29,312.02532
860,52619,ARGYLE,Lee,IA,56,95.322544
861,52620,BONAPARTE,Van Buren,IA,89,139.397291
862,52621,CRAWFORDSVILLE,Washington,IA,92,109.582868
863,52623,DANVILLE,Des Moines,IA,29,151.439917
864,52624,DENMARK,Lee,IA,56,1.656465
865,52625,DONNELLSON,Lee,IA,56,285.245058
866,52626,FARMINGTON,Van Buren,IA,89,247.437264
867,52627,FORT MADISON,Lee,IA,56,188.414662
868,52630,HILLSBORO,Henry,IA,44,109.946731
869,52632,KEOKUK,Lee,IA,56,140.470022
870,52635,LOCKRIDGE,Jefferson,IA,51,112.512335
871,52637,MEDIAPOLIS,Des Moines,IA,29,172.04191
872,52638,MIDDLETOWN,Des Moines,IA,29,88.521308
873,52639,MONTROSE,Lee,IA,56,117.469873
874,52640,MORNING SUN,Louisa,IA,58,188.643645
875,52641,MOUNT PLEASANT,Henry,IA,44,551.76361
876,52644,MOUNT UNION,Henry,IA,44,111.080445
877,52645,NEW LONDON,Henry,IA,44,185.650036
878,52646,OAKVILLE,Louisa,IA,58,155.005489
879,52647,OLDS,Henry,IA,44,0.911591
880,52649,SALEM,Henry,IA,44,119.399466
881,52650,SPERRY,Des Moines,IA,29,104.495885
882,52651,STOCKPORT,Van Buren,IA,89,156.669871
883,52653,WAPELLO,Louisa,IA,58,312.47521
884,52654,WAYLAND,Henry,IA,44,136.065742
885,52655,WEST BURLINGTON,Des Moines,IA,29,42.56453
886,52656,WEST POINT,Lee,IA,56,247.274743
887,52657,SAINT PAUL,Lee,IA,56,0.494762
888,52658,WEVER,Lee,IA,56,131.24099
889,52659,WINFIELD,Henry,IA,44,159.819459
890,52660,YARMOUTH,Des Moines,IA,29,57.367543
891,52701,ANDOVER,Clinton,IA,23,1.643828
892,52720,ATALISSA,Muscatine,IA,70,110.12746
893,52721,BENNETT,Cedar,IA,16,108.698097
894,52722,BETTENDORF,Scott,IA,82,73.161625
895,52726,BLUE GRASS,Scott,IA,82,94.823134
896,52727,BRYANT,Clinton,IA,23,63.081901
897,52728,BUFFALO,Scott,IA,82,5.764558
898,52729,CALAMUS,Clinton,IA,23,108.881384
899,52730,CAMANCHE,Clinton,IA,23,101.243411
900,52731,CHARLOTTE,Clinton,IA,23,136.590407
901,52732,CLINTON,Clinton,IA,23,310.704633
902,52737,COLUMBUS CITY,Louisa,IA,58,0.607264
903,52738,COLUMBUS JUNCTION,Louisa,IA,58,323.19118
904,52739,CONESVILLE,Muscatine,IA,70,89.211035
905,52742,DE WITT,Clinton,IA,23,304.142776
906,52745,DIXON,Scott,IA,82,74.381611
907,52746,DONAHUE,Scott,IA,82,68.807314
908,52747,DURANT,Cedar,IA,16,56.094151
909,52748,ELDRIDGE,Scott,IA,82,107.004915
910,52749,FRUITLAND,Muscatine,IA,70,5.21822
911,52750,GOOSE LAKE,Clinton,IA,23,78.435746
912,52751,GRAND MOUND,Clinton,IA,23,129.085616
913,52752,GRANDVIEW,Louisa,IA,58,0.984227
914,52753,LE CLAIRE,Scott,IA,82,68.492183
915,52754,LETTS,Muscatine,IA,70,186.005712
916,52755,LONE TREE,Johnson,IA,52,157.930258
917,52756,LONG GROVE,Scott,IA,82,111.107641
918,52757,LOW MOOR,Clinton,IA,23,3.069924
919,52758,MC CAUSLAND,Scott,IA,82,1.745891
920,52760,MOSCOW,Muscatine,IA,70,58.900339
921,52761,MUSCATINE,Muscatine,IA,70,482.973608
922,52765,NEW LIBERTY,Scott,IA,82,78.185471
923,52766,NICHOLS,Muscatine,IA,70,119.401548
924,52767,PLEASANT VALLEY,Scott,IA,82,3.354642
925,52768,PRINCETON,Scott,IA,82,89.481384
926,52769,STOCKTON,Muscatine,IA,70,106.726313
927,52769,STOCKTON,Muscatine,IA,70,106.726313
928,52772,TIPTON,Cedar,IA,16,342.328096
929,52773,WALCOTT,Scott,IA,82,145.837805
930,52774,WELTON,Clinton,IA,23,0.728977
931,52776,WEST LIBERTY,Muscatine,IA,70,219.678908
932,52777,WHEATLAND,Clinton,IA,23,139.946616
933,52778,WILTON,Muscatine,IA,70,215.972375
934,52801,DAVENPORT,Scott,IA,82,1.359908
935,52802,DAVENPORT,Scott,IA,82,29.294412
936,52803,DAVENPORT,Scott,IA,82,14.068035
937,52804,DAVENPORT,Scott,IA,82,88.861422
938,52806,DAVENPORT,Scott,IA,82,79.448284
939,52807,DAVENPORT,Scott,IA,82,76.46944
--------------------------------------------------------------------------------
/retail-strategy/data/iowa_incomes.xls:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/data/iowa_incomes.xls
--------------------------------------------------------------------------------
/retail-strategy/data/pop_iowa_per_county.csv:
--------------------------------------------------------------------------------
1 | ,county,population
2 | 0,Adair,7092
3 | 1,Adams,3693
4 | 2,Allamakee,13884
5 | 3,Appanoose,12462
6 | 4,Audubon,5678
7 | 5,Benton,25699
8 | 6,Black Hawk,132904
9 | 7,Boone,26532
10 | 8,Bremer,24798
11 | 9,Buchanan,20992
12 | 10,Buena Vista,20332
13 | 11,Butler,14791
14 | 12,Calhoun,9846
15 | 13,Carroll,20437
16 | 14,Cass,13157
17 | 15,Cedar,18454
18 | 16,Cerro Gordo,43070
19 | 17,Cherokee,11508
20 | 18,Chickasaw,12023
21 | 19,Clarke,9309
22 | 20,Clay,16333
23 | 21,Clayton,17590
24 | 22,Clinton,47309
25 | 23,Crawford,16940
26 | 24,Dallas,84516
27 | 25,Davis,8860
28 | 26,Decatur,8141
29 | 27,Delaware,17327
30 | 28,Des Moines,39739
31 | 29,Dickinson,17243
32 | 30,Dubuque,97003
33 | 31,Emmet,9658
34 | 32,Fayette,20054
35 | 33,Floyd,15873
36 | 34,Franklin,10170
37 | 35,Fremont,6950
38 | 36,Greene,9011
39 | 37,Grundy,12313
40 | 38,Guthrie,10625
41 | 39,Hamilton,15076
42 | 40,Hancock,10835
43 | 41,Hardin,17226
44 | 42,Harrison,14149
45 | 43,Henry,19773
46 | 44,Howard,9332
47 | 45,Humboldt,9487
48 | 46,Ida,6985
49 | 47,Iowa,16311
50 | 48,Jackson,19472
51 | 49,Jasper,36708
52 | 50,Jefferson,18090
53 | 51,Johnson,146547
54 | 52,Jones,20439
55 | 53,Keokuk,10119
56 | 54,Kossuth,15114
57 | 55,Lee,34615
58 | 56,Linn,221661
59 | 57,Louisa,11142
60 | 58,Lucas,8647
61 | 59,Lyon,11754
62 | 60,Madison,15848
63 | 61,Mahaska,22181
64 | 62,Marion,33189
65 | 63,Marshall,40312
66 | 64,Mills,14972
67 | 65,Mitchell,10763
68 | 66,Monona,8898
69 | 67,Monroe,7870
70 | 68,Montgomery,10225
71 | 69,Muscatine,42940
72 | 70,O'Brien,14020
73 | 71,Osceola,6064
74 | 72,Page,15391
75 | 73,Palo Alto,9047
76 | 74,Plymouth,25200
77 | 75,Pocahontas,6886
78 | 76,Polk,474045
79 | 77,Pottawattamie,93582
80 | 78,Poweshiek,18533
81 | 79,Ringgold,5068
82 | 80,Sac,9876
83 | 81,Scott,172474
84 | 82,Shelby,11800
85 | 83,Sioux,34898
86 | 84,Story,97090
87 | 85,Tama,17319
88 | 86,Taylor,6216
89 | 87,Union,12420
90 | 88,Van Buren,7271
91 | 89,Wapello,34982
92 | 90,Warren,49691
93 | 91,Washington,22281
94 | 92,Wayne,6452
95 | 93,Webster,36769
96 | 94,Winnebago,10631
97 | 95,Winneshiek,20561
98 | 96,Woodbury,102779
99 | 97,Worth,7572
100 | 98,Wright,12779
101 |
--------------------------------------------------------------------------------
/retail-strategy/images/123:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/retail-strategy/images/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/retail-strategy/images/hm3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/hm3.png
--------------------------------------------------------------------------------
/retail-strategy/images/liquor.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/liquor.jpeg
--------------------------------------------------------------------------------
/retail-strategy/images/output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/output.png
--------------------------------------------------------------------------------
/retail-strategy/images/test.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/test.jpg
--------------------------------------------------------------------------------
/tennis/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/tennis/README.md:
--------------------------------------------------------------------------------
1 | ## Forecasting the winner in the Men's ATP World Tour [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/tennis/notebooks/Final_Project_Marco_Tavora-DATNYC41_GA.ipynb)
2 |       
3 |
4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/tennis/notebooks/Final_Project_Marco_Tavora-DATNYC41_GA.ipynb) or by clicking on the [view code] link above.**
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 | Problem Statement •
13 | Dataset •
14 | Importing basic modules •
15 | Pre-Processing of dataset
16 | `Best_of` = 5 •
17 | Dummy variables •
18 | Exploratory Analysis for Best_of = 5 •
19 | Logistic Regression •
20 | Decision Trees and Random Forests
21 |
22 |
23 |
24 | ## Problem Statement
25 |
26 | The goal of the project is to predict the probability that the higher-ranked player will win a tennis match. I will call that a `win`(as opposed to an upset).
27 |
28 | ## Dataset
29 |
30 | Results for the men's ATP tour date back to January 2000 from the dateset http://www.tennis-data.co.uk/data.php (obtained from Kaggle). The features for each match that were used in the project were:
31 | - `Date`: date of the match
32 | - `Series`: name of ATP tennis series (we kept the four main current categories namely Grand Slams, Masters 1000, ATP250, ATP500)
33 | - `Surface`: type of surface (clay, hard or grass)
34 | - `Round`: round of match (from first round to the final)
35 | - `Best of`: maximum number of sets playable in match (Best of 3 or Best of 5)
36 | - `WRank`: ATP Entry ranking of the match winner as of the start of the tournament
37 | - `LRank`: ATP Entry ranking of the match loser as of the start of the tournament
38 |
39 | The output variable is binary. The better player has higher rank by definition. The `win` variable is 1 if the higher-ranked player wins and 0 otherwise.
40 |
41 | ## Importing basic modules
42 |
43 | ```
44 | import numpy as np
45 | import statsmodels.api as sm
46 | import matplotlib.pyplot as plt
47 | from sklearn import metrics
48 | import seaborn as sns
49 | sns.set_style("darkgrid")
50 | import pylab as pl
51 | %matplotlib inline
52 | ```
53 |
54 | ## Pre-Processing of dataset
55 |
56 | After loading the dataset we proceed as following:
57 | - Keep only completed matches i.e. eliminate matches with injury withdrawals and walkovers.
58 | - Choose the features listed above
59 | - Drop `NaN` entries
60 | - Consider the two final years only (to avoid comparing different categories of tournaments which existed in the past). Note that this choice is somewhat arbitrary and can be changed if needed.
61 | - Choose only higher ranked players for better accuracy (as suggested by Corral and Prieto-Rodriguez (2010) and confirmed here)
62 | ```
63 | # converting to Datetime
64 | df_atp['Date'] = pd.to_datetime(df_atp['Date'])
65 | # Restricing dates
66 | df_atp = df_atp.loc[(df_atp['Date'] > '2014-11-09') & (df_atp['Date'] <= '2016-11-09')]
67 | # Keeping only completed matches
68 | df_atp = df_atp[df_atp['Comment'] == 'Completed'].drop("Comment",axis = 1)
69 | # Renaming Best of to Best_of
70 | df_atp.rename(columns = {'Best of':'Best_of'},inplace=True)
71 | # Choosing features
72 | cols_to_keep = ['Date','Series','Surface', 'Round','Best_of', 'WRank','LRank']
73 | # Dropping NaNs
74 | df_atp = df_atp[cols_to_keep].dropna()
75 | # Dropping errors in the dataset and unimportant entries (e.g. there are very few entries for Masters Cup)
76 | df_atp = df_atp[(df_atp['LRank'] != 'NR') & (df_atp['WRank'] != 'NR') & (df_atp['Series'] != 'Masters Cup')]
77 | ```
78 | Another important step for some of the columns is to transform strings into numerical values:
79 | ```
80 | cols_to_keep = ['Best_of','WRank','LRank']
81 | df_atp[cols_to_keep] = df_atp[cols_to_keep].astype(int)
82 | ```
83 | I now create an extra column for the variable `win` (described above) using an auxiliary function `win(x)`:
84 |
85 | ```
86 | def win(x):
87 | if x > 0:
88 | return 0
89 | elif x <= 0:
90 | return 1
91 | ```
92 | Using the `apply( )` method which sends a column to a function:
93 | ```
94 | df_atp['win'] = (df_atp['WRank'] - df_atp['LRank']).apply(win)
95 | ```
96 |
97 | Following [Corral and Prieto-Rodriguez](https://ideas.repec.org/a/eee/intfor/v26yi3p551-563.html) we restrict the analysis to higher ranked players:
98 | ```
99 | df_new = df_atp[(df_atp['WRank'] <= 150) & (df_atp['LRank'] <= 150)]
100 | ```
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 | ## `Best_of` = 5
110 |
111 | We now restrict our analysis to matches of `Best_of` = 5. Since only Grand Slams have 5 sets we can drop the `Series` column. The case of `Best_of = 3` will be considered afterwards.
112 | ```
113 | df3 = df_new.copy()
114 | df3 = df3[df3['Best_of'] == 5]
115 | # Drop Best_of and Series columns
116 | df3.drop(["Series",axis = 1,inplace=True)
117 | df3.drop("Best_of",axis = 1,inplace=True)
118 | ```
119 | The dataset is uneven in terms of frequency of `wins`(imbalanced classes). Using this quick function to convert `Series` to `DataFrame` (for aesthetic reasons only!)
120 | ```
121 | def series_to_df(s):
122 | return s.to_frame()
123 | series_to_df(df3['win'].value_counts())
124 | series_to_df(df3['win'].value_counts()/df3.shape[0])
125 | ```
126 |
127 |
128 |
129 |
130 |
131 |
132 | To correct this problem, and create a balanced dataset via simple undersampling, I used a stratified sampling procedure.
133 |
134 | ```
135 | y_0 = df3[df3.win == 0]
136 | y_1 = df3[df3.win == 1]
137 | n = min([len(y_0), len(y_1)])
138 | y_0 = y_0.sample(n = n, random_state = 0)
139 | y_1 = y_1.sample(n = n, random_state = 0)
140 | df_strat = pd.concat([y_0, y_1])
141 | X_strat = df_strat[['Date', 'Surface', 'Round','WRank', 'LRank']]
142 | y_strat = df_strat.win
143 | df = X_strat.copy()
144 | df['win'] = y_strat
145 | ```
146 | The balanced classes become:
147 |
148 |
149 |
150 |
151 |
152 | We now define the variables `P1` and `P2` where the former has higher ranking:
153 | ```
154 | ranks = ["WRank", "LRank"]
155 | df["P1"] = df[ranks].max(axis=1)
156 | df["P2"] = df[ranks].min(axis=1)
157 | ```
158 |
159 |
160 | ## Exploratory Analysis for Best_of = 5
161 |
162 | I first look at percentage of wins for each surface. We find that when the `Surface` is Clay there is a higher likelihood of upsets (opposite of wins) i.e. the percentage of wins is lower. The difference is not too large tough.
163 | ```
164 | win_by_Surface = pd.crosstab(df.win, df.Surface).apply(lambda x: x/x.sum(), axis = 0)
165 | ```
166 |
167 |
168 |
169 |
170 |
171 | What about the dependence on rounds? The relation is not very clear but we can clearly see that upsets are unlikely to happen on the semifinals.
172 |
173 | ```
174 | win_by_round = pd.crosstab(df.win, df.Round).apply(lambda x: x/x.sum(), axis = 0)
175 | ```
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 | ## Dummy variables
184 | To keep the dataframe cleaner we transform the `Round` entries into numbers using:
185 | ```
186 | df1 = df.copy()
187 | def round_number(x):
188 | if x == '1st Round':
189 | return 1
190 | elif x == '2nd Round':
191 | return 2
192 | elif x == '3rd Round':
193 | return 3
194 | elif x == '4th Round':
195 | return 4
196 | elif x == 'Quarterfinals':
197 | return 5
198 | elif x == 'Semifinals':
199 | return 6
200 | elif x == 'The Final':
201 | return 7
202 | df1['Round'] = df1['Round'].apply(round_number)
203 | ```
204 | We then transform rounds into dummy variables
205 | ```
206 | dummy_ranks = pd.get_dummies(df1['Round'], prefix='Round')
207 | df1 = df1.join(dummy_ranks.ix[:, 'Round_2':])
208 | rounds = ['Round_2', 'Round_3',
209 | 'Round_4', 'Round_5', 'Round_6', 'Round_7']
210 | df1[rounds] = df1[rounds].astype('int_')
211 | ```
212 | We repeat this for the `Surface` variable. I now take the logarithms of `P1` and `P2`, then create a variable `D`
213 | ```
214 | df4['P1'] = np.log2(df4['P1'].astype('float64'))
215 | df4['P2'] = np.log2(df4['P2'].astype('float64'))
216 | df4['D'] = df4['P1'] - df4['P2']
217 | df4['D'] = np.absolute(df4['D'])
218 | ```
219 |
220 | ## Logistic Regression
221 |
222 | The next step is building the models. I first use a logistic regression. First, the `y` and `X` must be defined:
223 |
224 | ```
225 | feature_cols = ['Round_2','Round_3','Round_4','Round_5','Round_6','Round_7','Surface_Grass','Surface_Hard','D']
226 | dfnew = df4.copy()
227 | dfnew[feature_cols].head()
228 | X = dfnew[feature_cols]
229 | y = dfnew.win
230 | ```
231 | Doing a train-test split:
232 | ```
233 | from sklearn.cross_validation import train_test_split
234 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
235 | ```
236 | I then fit the model with the training data,
237 | ```
238 | from sklearn.linear_model import LogisticRegression
239 | logreg = LogisticRegression()
240 | logreg.fit(X_train, y_train)
241 | ```
242 | and make predictions using the test set:
243 | ```
244 | y_pred_class = logreg.predict(X_test)
245 | from sklearn import metrics
246 | print('Accuracy score is:',metrics.accuracy_score(y_test, y_pred_class))
247 | ```
248 | and obtain:
249 | ```
250 | Accuracy score is: 0.7070707070707071
251 | ```
252 |
253 | The next step is evaluate the appropriate metrics. Using `scikit-learn` for calcule the AUC,
254 | ```
255 | y_pred_prob = logreg.predict_proba(X_test)[:, 1]
256 | auc_score = metrics.roc_auc_score(y_test, y_pred_prob)
257 | print('AUC is:', auc_score)
258 | ```
259 | I obtain the following `auc_score`:
260 | ```
261 | AUC is: 0.7546938775510204
262 | ```
263 | To plot the ROC curve I use:
264 | ```
265 | fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
266 | fig = plt.plot(fpr, tpr,label='ROC curve (area = %0.2f)' % auc_score )
267 | plt.plot([0, 1], [0, 1], 'k--')
268 | plt.xlim([0.0, 1.0])
269 | plt.ylim([0.0, 1.0])
270 | plt.title('ROC curve for win classifier')
271 | plt.xlabel('False Positive Rate (1 - Specificity)')
272 | plt.ylabel('True Positive Rate (Sensitivity)')
273 | plt.legend(loc="lower right")
274 | plt.grid(True)
275 | ```
276 |
277 |
278 |
279 |
281 |
282 |
283 |
284 | Now we must perform cross-validation.
285 | ```
286 | from sklearn.cross_validation import cross_val_score
287 | print('Mean CV score is:',cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean())
288 | ```
289 | The output is:
290 | ```
291 | Mean CV score is: 0.7287617728531856
292 | ```
293 |
294 |
295 | ## Decision Trees and Random Forests
296 |
297 |
298 | I now build a decision tree model to predict the upsets likelihood of a given match:
299 |
300 | ```
301 | from sklearn.tree import DecisionTreeClassifier
302 | model = DecisionTreeClassifier()
303 | X = dfnew[feature_cols].dropna()
304 | y = dfnew['win']
305 | model.fit(X, y)
306 | ```
307 | Again performing cross-validation:
308 | ```
309 | scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
310 | print('AUC {}, Average AUC {}'.format(scores, scores.mean()))
311 | model = DecisionTreeClassifier(
312 | max_depth = 4,
313 | min_samples_leaf = 6)
314 |
315 | model.fit(X, y)
316 | ```
317 |
318 |
319 |
320 |
321 |
322 |
323 |
324 |
325 | Evaluating the cross-validation score:
326 |
327 | ```
328 | scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
329 | print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
330 | ```
331 |
332 |
333 |
334 |
335 |
336 |
337 |
338 |
339 |
340 |
341 |
342 |
343 |
344 | Now I repeat the lines above using a random forest classifier:
345 | ```
346 | from sklearn.ensemble import RandomForestClassifier
347 | from sklearn.cross_validation import cross_val_score
348 | X = dfnew[feature_cols].dropna()
349 | y = dfnew['win']
350 | model = RandomForestClassifier(n_estimators = 200)
351 | model.fit(X, y)
352 | features = X.columns
353 | feature_importances = model.feature_importances_
354 | features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
355 | features_df.sort_values('Importance Score', inplace=True, ascending=False)
356 | feature_importances = pd.Series(model.feature_importances_, index=X.columns)
357 | feature_importances.sort_values()
358 | feature_importances.plot(kind="barh", figsize=(7,6))
359 | scores = cross_val_score(model, X, y, scoring='roc_auc')
360 | print('AUC {}, Average AUC {}'.format(scores, scores.mean()))
361 | for n_trees in range(1, 100, 10):
362 | model = RandomForestClassifier(n_estimators = n_trees)
363 | scores = cross_val_score(model, X, y, scoring='roc_auc')
364 | print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))
365 | ```
366 |
367 |
368 |
369 |
370 |
371 |
372 |
373 |
374 |
375 |
376 |
377 | The same identical analysis is done for `Best_of = 3` and therefore it is ommited here in the README.
378 |
--------------------------------------------------------------------------------
/tennis/images/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/tennis/images/ATP_World_Tour.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/ATP_World_Tour.png
--------------------------------------------------------------------------------
/tennis/images/ROC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/ROC.png
--------------------------------------------------------------------------------
/tennis/images/balanced.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/balanced.png
--------------------------------------------------------------------------------
/tennis/images/cv_score.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/cv_score.png
--------------------------------------------------------------------------------
/tennis/images/decisiontree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/decisiontree.png
--------------------------------------------------------------------------------
/tennis/images/imbalance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/imbalance.png
--------------------------------------------------------------------------------
/tennis/images/rf_features.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/rf_features.png
--------------------------------------------------------------------------------
/tennis/images/rounds.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/rounds.png
--------------------------------------------------------------------------------
/tennis/images/surfaces.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/surfaces.png
--------------------------------------------------------------------------------
/tennis/images/tennis_df.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/tennis_df.png
--------------------------------------------------------------------------------
/tennis/notebooks/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/tennis/slides/123.png:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/tennis/slides/Final_Project_Marco_Tavora_DATNYC41.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/slides/Final_Project_Marco_Tavora_DATNYC41.pdf
--------------------------------------------------------------------------------