├── README.md ├── analysis-of-opioid-prescription-problem ├── README.md ├── data │ ├── 123 │ ├── mhincome.csv │ ├── opioids.csv │ ├── overdoses.csv │ ├── overdosesnew.csv │ └── prescriber-info.csv ├── images │ ├── 123 │ └── opioids.png └── notebooks │ ├── 123 │ └── opioid-prescription-problem.ipynb ├── churn ├── README.md ├── data │ └── 123.png ├── images │ ├── 123.png │ ├── balancedchurn.png │ ├── baseline.png │ ├── cellphone.jpg │ ├── churnprob.png │ ├── cm.png │ ├── cms.png │ ├── cms1.png │ ├── cms2.png │ ├── df_churn_new.png │ ├── featurerf.png │ ├── imbalancechurn.png │ ├── model_comparison.png │ └── predictions.png └── notebooks │ └── predicting-customer-churn.ipynb ├── click-prediction ├── README.md ├── images │ ├── 123 │ └── click1.png ├── notebooks │ └── click-predictive-model.ipynb └── optimal-bidding-strategies-in-online-display-advertising .pdf ├── predicting-number-of-comments-on-reddit-using-random-forest-classifier ├── 123.png ├── README.md ├── images │ ├── 123.png │ ├── Reddit-logo.png │ ├── redditRF.png │ ├── redditpage.png │ └── redditwordshist.png └── notebooks │ ├── 123.png │ └── project-3-marco-tavora.ipynb ├── retail-strategy ├── README.md ├── data │ ├── 123 │ ├── ia_zip_city_county_sqkm.csv │ ├── iowa_incomes.xls │ └── pop_iowa_per_county.csv ├── images │ ├── 123 │ ├── 123.png │ ├── hm3.png │ ├── liquor.jpeg │ ├── output.png │ └── test.jpg └── notebooks │ └── retail-recommendations.ipynb └── tennis ├── 123.png ├── README.md ├── images ├── 123.png ├── ATP_World_Tour.png ├── ROC.png ├── balanced.png ├── cv_score.png ├── decisiontree.png ├── imbalance.png ├── rf_features.png ├── rounds.png ├── surfaces.png └── tennis_df.png ├── notebooks ├── 123.png └── Final_Project_Marco_Tavora-DATNYC41_GA.ipynb └── slides ├── 123.png └── Final_Project_Marco_Tavora_DATNYC41.pdf /README.md: -------------------------------------------------------------------------------- 1 | ## Supervised Machine Learning Projects 2 | 3 | ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![image title](https://img.shields.io/badge/ntlk-v3.2.5-yellow.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/BeautifulSoup-4.6.0-blue.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 4 |
5 | 6 |

7 | 9 |

10 | 11 |
12 |

13 | 14 |

15 |
16 | 17 |

18 | Notebooks and descriptions • 19 | Contact Information 20 |

21 | 22 | 23 | ### Notebooks and descriptions 24 | | Notebook | Brief Description | 25 | |--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| 26 | |[predicting-comments-on-reddit](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/painters-identification/notebooks/capstone-models-final-model-building.ipynb) | In this project I determine which characteristics of a post on Reddit contribute most to the overall interaction as measured by number of comments.| 27 | |[tennis-matches-prediction](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/bitcoin/notebooks/deep-learning-LSTM-bitcoins.ipynb) | The goal of the project is to predict the probability that the higher-ranked player will win a tennis match. I will call that a `win`(as opposed to an upset).| 28 | |[churn-analysis](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/keras-tf-tutorial/notebooks/neural-nets-digits-mnist.ipynb) | This project was done in collaboration with [Corey Girard](https://github.com/coreygirard/). A mobile device company is having a major problem with customer retention. Customers switching from one company to another is called churn. Our goal in this analysis is to understand the problem, identify behaviors which are strongly correlated with churn and to devise a solution.| 29 | |[click-prediction](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/transfer-learning/notebooks/transfer-learning.ipynb) | Many ads are actually sold on a "pay-per-click" (PPC) basis, meaning the company only pays for ad clicks, not ad views. Thus your optimal approach (as a search engine) is actually to choose an ad based on "expected value", meaning the price of a click times the likelihood that the ad will be clicked [...] In order for you to maximize expected value, you therefore need to accurately predict the likelihood that a given ad will be clicked, also known as "click-through rate" (CTR). In this project I will predict the likelihood that a given online ad will be clicked.| 30 | | [retail-store-expansion-analysis-with-lasso-and-ridge-regressions](http://nbviewer.jupyter.org/github/marcotav/deep-learning/blob/master/painters-identification/notebooks/capstone-models-final-model-building.ipynb) | Based on a dataset containing the spirits purchase information of Iowa Class E liquor licensees by product and date of purchase this project provides recommendations on where to open new stores in the state of Iowa. To devise an expansion strategy, I first needed to understand the data and for that I conducted a thorough exploratory data analysis (EDA). With the data in hand I built multivariate regression models of total sales by county, using both Lasso and Ridge regularization, and based on these models, I made recommendations about new locations.| 31 | 32 | 33 |

34 | 35 |

36 | 37 | 38 |

39 | 40 |

41 | 42 | 43 | ## Contact Information 44 | 45 | Feel free to contact me: 46 | 47 | * Email: [marcotav65@gmail.com](mailto:marcotav65@gmail.com) 48 | * GitHub: [marcotav](https://github.com/marcotav) 49 | * LinkedIn: [marco-tavora](https://www.linkedin.com/in/marco-tavora) 50 | * Website: [marcotavora.me](http://www.marcotavora.me) 51 | 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/README.md: -------------------------------------------------------------------------------- 1 | ## U.S. Opiate Prescriptions/Overdoses [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/analysis-of-opioid-prescription-problem/notebooks/opioid-prescription-problem.ipynb) 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) 3 | 4 | 5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/analysis-of-opioid-prescription-problem/notebooks/opioid-prescription-problem.ipynb) or by clicking on the.** 6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 |

15 | 16 |

17 |
18 | 19 |

20 | Brief Introduction • 21 | Dataset • 22 | Project Goal 23 |

24 | 25 | 26 | ## Brief Introduction 27 | 28 | Accidental death by fatal drug overdose is a rising trend in the United States. What can you do to help? (From Kaggle) 29 | 30 | 31 | ## Dataset 32 | 33 | This dataset contains: 34 | - Summaries of prescription records for 250 common **opioid** and ***non-opioid*** drugs written by 25,000 unique licensed medical professionals in 2014 in the United States for citizens covered under Class D Medicare 35 | - Metadata about the doctors themselves. 36 | - This data here is in a format with 1 row per prescriber and 25,000 unique prescribers down to 25,000 to keep it manageable. 37 | - The main data is in `prescriber-info.csv`. 38 | - There is also `opioids.csv` that contains the names of all opioid drugs included in the data 39 | - There is the file `overdoses.csv` that contains information on opioid related drug overdose fatalities. 40 | 41 | 42 | The data consists of the following characteristics for each prescriber: 43 | - NPI – unique National Provider Identifier number 44 | - Gender - (M/F) 45 | - State - U.S. State by abbreviation 46 | - Credentials - set of initials indicative of medical degree 47 | - Specialty - description of type of medicinal practice 48 | - A long list of drugs with numeric values indicating the total number of prescriptions written for the year by that individual 49 | - `Opioid.Prescriber` - a boolean label indicating whether or not that individual prescribed opiate drugs more than 10 times in the yearr 50 | 51 | 52 | ## Project Goal 53 | 54 | The increase in overdose fatalities is a well-known problem, and the search for possible solutions is an ongoing effort. This dataset is can be used to detect sources of significant quantities of opiate prescriptions. 55 | 56 | -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/data/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/data/mhincome.csv: -------------------------------------------------------------------------------- 1 | State,Income Mississippi,40593.00 Arkansas,41995.00 West Virginia,42019.00 Alabama,44765.00 Kentucky,45215.00 New Mexico,45382.00 -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/data/opioids.csv: -------------------------------------------------------------------------------- 1 | Drug Name,Generic Name 2 | ABSTRAL,FENTANYL CITRATE 3 | ACETAMINOPHEN-CODEINE,ACETAMINOPHEN WITH CODEINE 4 | ACTIQ,FENTANYL CITRATE 5 | ASCOMP WITH CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN 6 | ASPIRIN-CAFFEINE-DIHYDROCODEIN,DIHYDROCODEINE/ASPIRIN/CAFFEIN 7 | AVINZA,MORPHINE SULFATE 8 | BELLADONNA-OPIUM,OPIUM/BELLADONNA ALKALOIDS 9 | BUPRENORPHINE HCL,BUPRENORPHINE HCL 10 | BUTALB-ACETAMINOPH-CAFF-CODEIN,BUTALBIT/ACETAMIN/CAFF/CODEINE 11 | BUTALB-CAFF-ACETAMINOPH-CODEIN,BUTALBIT/ACETAMIN/CAFF/CODEINE 12 | BUTALBITAL COMPOUND-CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN 13 | BUTORPHANOL TARTRATE,BUTORPHANOL TARTRATE 14 | BUTRANS,BUPRENORPHINE 15 | CAPITAL W-CODEINE,ACETAMINOPHEN WITH CODEINE 16 | CARISOPRODOL COMPOUND-CODEINE,CODEINE/CARISOPRODOL/ASPIRIN 17 | CARISOPRODOL-ASPIRIN-CODEINE,CODEINE/CARISOPRODOL/ASPIRIN 18 | CODEINE SULFATE,CODEINE SULFATE 19 | CO-GESIC,HYDROCODONE/ACETAMINOPHEN 20 | CONZIP,TRAMADOL HCL 21 | DEMEROL,MEPERIDINE HCL 22 | DEMEROL,MEPERIDINE HCL/PF 23 | DILAUDID,HYDROMORPHONE HCL 24 | DILAUDID,HYDROMORPHONE HCL/PF 25 | DILAUDID-HP,HYDROMORPHONE HCL/PF 26 | DISKETS,METHADONE HCL 27 | DOLOPHINE HCL,METHADONE HCL 28 | DURAGESIC,FENTANYL 29 | DURAMORPH,MORPHINE SULFATE/PF 30 | ENDOCET,OXYCODONE HCL/ACETAMINOPHEN 31 | ENDODAN,OXYCODONE HCL/ASPIRIN 32 | EXALGO,HYDROMORPHONE HCL 33 | FENTANYL,FENTANYL 34 | FENTANYL CITRATE,FENTANYL CITRATE 35 | FENTORA,FENTANYL CITRATE 36 | FIORICET WITH CODEINE,BUTALBIT/ACETAMIN/CAFF/CODEINE 37 | FIORINAL WITH CODEINE #3,CODEINE/BUTALBITAL/ASA/CAFFEIN 38 | HYCET,HYDROCODONE/ACETAMINOPHEN 39 | HYDROCODONE-ACETAMINOPHEN,HYDROCODONE/ACETAMINOPHEN 40 | HYDROCODONE-IBUPROFEN,HYDROCODONE/IBUPROFEN 41 | HYDROMORPHONE ER,HYDROMORPHONE HCL 42 | HYDROMORPHONE HCL,HYDROMORPHONE HCL 43 | HYDROMORPHONE HCL,HYDROMORPHONE HCL/PF 44 | IBUDONE,HYDROCODONE/IBUPROFEN 45 | INFUMORPH,MORPHINE SULFATE/PF 46 | KADIAN,MORPHINE SULFATE 47 | LAZANDA,FENTANYL CITRATE 48 | LEVORPHANOL TARTRATE,LEVORPHANOL TARTRATE 49 | LORCET,HYDROCODONE/ACETAMINOPHEN 50 | LORCET 10-650,HYDROCODONE/ACETAMINOPHEN 51 | LORCET HD,HYDROCODONE/ACETAMINOPHEN 52 | LORCET PLUS,HYDROCODONE/ACETAMINOPHEN 53 | LORTAB,HYDROCODONE/ACETAMINOPHEN 54 | MAGNACET,OXYCODONE HCL/ACETAMINOPHEN 55 | MEPERIDINE HCL,MEPERIDINE HCL 56 | MEPERIDINE HCL,MEPERIDINE HCL/PF 57 | MEPERITAB,MEPERIDINE HCL 58 | METHADONE HCL,METHADONE HCL 59 | METHADONE INTENSOL,METHADONE HCL 60 | METHADOSE,METHADONE HCL 61 | MORPHINE SULFATE,MORPHINE SULFATE 62 | MORPHINE SULFATE,MORPHINE SULFATE/PF 63 | MORPHINE SULFATE ER,MORPHINE SULFATE 64 | MS CONTIN,MORPHINE SULFATE 65 | NALBUPHINE HCL,NALBUPHINE HCL 66 | NORCO,HYDROCODONE/ACETAMINOPHEN 67 | NUCYNTA,TAPENTADOL HCL 68 | NUCYNTA ER,TAPENTADOL HCL 69 | OPANA,OXYMORPHONE HCL 70 | OPANA ER,OXYMORPHONE HCL 71 | OPIUM TINCTURE,OPIUM TINCTURE 72 | OXECTA,OXYCODONE HCL 73 | OXYCODONE HCL,OXYCODONE HCL 74 | OXYCODONE HCL ER,OXYCODONE HCL 75 | OXYCODONE HCL-ASPIRIN,OXYCODONE HCL/ASPIRIN 76 | OXYCODONE HCL-IBUPROFEN,IBUPROFEN/OXYCODONE HCL 77 | OXYCODONE-ACETAMINOPHEN,OXYCODONE HCL/ACETAMINOPHEN 78 | OXYCONTIN,OXYCODONE HCL 79 | OXYMORPHONE HCL,OXYMORPHONE HCL 80 | OXYMORPHONE HCL ER,OXYMORPHONE HCL 81 | PENTAZOCINE-ACETAMINOPHEN,PENTAZOCINE HCL/ACETAMINOPHEN 82 | PENTAZOCINE-NALOXONE HCL,PENTAZOCINE HCL/NALOXONE HCL 83 | PERCOCET,OXYCODONE HCL/ACETAMINOPHEN 84 | PERCODAN,OXYCODONE HCL/ASPIRIN 85 | PRIMLEV,OXYCODONE HCL/ACETAMINOPHEN 86 | REPREXAIN,HYDROCODONE/IBUPROFEN 87 | ROXICET,OXYCODONE HCL/ACETAMINOPHEN 88 | ROXICODONE,OXYCODONE HCL 89 | RYBIX ODT,TRAMADOL HCL 90 | STAGESIC,HYDROCODONE/ACETAMINOPHEN 91 | SUBSYS,FENTANYL 92 | SYNALGOS-DC,DIHYDROCODEINE/ASPIRIN/CAFFEIN 93 | TALWIN,PENTAZOCINE LACTATE 94 | TRAMADOL HCL,TRAMADOL HCL 95 | TRAMADOL HCL ER,TRAMADOL HCL 96 | TRAMADOL HCL-ACETAMINOPHEN,TRAMADOL HCL/ACETAMINOPHEN 97 | TREZIX,DHCODEINE BT/ACETAMINOPHN/CAFF 98 | TYLENOL-CODEINE NO.3,ACETAMINOPHEN WITH CODEINE 99 | TYLENOL-CODEINE NO.4,ACETAMINOPHEN WITH CODEINE 100 | ULTRACET,TRAMADOL HCL/ACETAMINOPHEN 101 | ULTRAM,TRAMADOL HCL 102 | ULTRAM ER,TRAMADOL HCL 103 | VICODIN,HYDROCODONE/ACETAMINOPHEN 104 | VICODIN ES,HYDROCODONE/ACETAMINOPHEN 105 | VICODIN HP,HYDROCODONE/ACETAMINOPHEN 106 | VICOPROFEN,HYDROCODONE/IBUPROFEN 107 | XARTEMIS XR,OXYCODONE HCL/ACETAMINOPHEN 108 | XODOL 10-300,HYDROCODONE/ACETAMINOPHEN 109 | XODOL 5-300,HYDROCODONE/ACETAMINOPHEN 110 | XODOL 7.5-300,HYDROCODONE/ACETAMINOPHEN 111 | XYLON 10,HYDROCODONE/IBUPROFEN 112 | ZAMICET,HYDROCODONE/ACETAMINOPHEN 113 | ZOHYDRO ER,HYDROCODONE BITARTRATE 114 | ZOLVIT,HYDROCODONE/ACETAMINOPHEN -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/data/overdoses.csv: -------------------------------------------------------------------------------- 1 | "State","Population","Deaths","Abbrev" 2 | "Alabama","4,833,722","723","AL" 3 | "Alaska","735,132","124","AK" 4 | "Arizona","6,626,624","1,211","AZ" 5 | "Arkansas","2,959,373","356","AR" 6 | "California","38,332,521","4,521","CA" 7 | "Colorado","5,268,367","899","CO" 8 | "Connecticut","3,596,080","623","CT" 9 | "Delaware","925,749","189","DE" 10 | "Florida","19,552,860","2,634","FL" 11 | "Georgia","9,992,167","1,206","GA" 12 | "Hawaii","1,404,054","157","HI" 13 | "Idaho","1,612,136","212","ID" 14 | "Illinois","12,882,135","1,705","IL" 15 | "Indiana","6,570,902","1,172","IN" 16 | "Iowa","3,090,416","264","IA" 17 | "Kansas","2,893,957","332","KS" 18 | "Kentucky","4,395,295","1,077","KY" 19 | "Louisiana","4,625,470","777","LA" 20 | "Maine","1,328,302","216","ME" 21 | "Maryland","5,928,814","1,070","MD" 22 | "Massachusetts","6,692,824","1,289","MA" 23 | "Michigan","9,895,622","1,762","MI" 24 | "Minnesota","5,420,380","517","MN" 25 | "Mississippi","2,991,207","336","MS" 26 | "Missouri","6,044,171","1,067","MO" 27 | "Montana","1,015,165","125","MT" 28 | "Nebraska","1,868,516","125","NE" 29 | "Nevada","2,790,136","545","NV" 30 | "New Hampshire","1,323,459","334","NH" 31 | "New Jersey","8,899,339","1,253","NJ" 32 | "New Mexico","2,085,287","547","NM" 33 | "New York","19,651,127","2,300","NY" 34 | "North Carolina","9,848,060","1,358","NC" 35 | "North Dakota","723,393","43","ND" 36 | "Ohio","11,570,808","2,744","OH" 37 | "Oklahoma","3,850,568","777","OK" 38 | "Oregon","3,930,065","522","OR" 39 | "Pennsylvania","12,773,801","2,732","PA" 40 | "Rhode Island","1,051,511","247","RI" 41 | "South Carolina","4,774,839","701","SC" 42 | "South Dakota","844,877","63","SD" 43 | "Tennessee","6,495,978","1,269","TN" 44 | "Texas","26,448,193","2,601","TX" 45 | "Utah","2,900,872","603","UT" 46 | "Vermont","626,630","83","VT" 47 | "Virginia","8,260,405","980","VA" 48 | "Washington","6,971,406","979","WA" 49 | "West Virginia","1,854,304","627","WV" 50 | "Wisconsin","5,742,713","853","WI" 51 | "Wyoming","582,658","109","WY" 52 | -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/data/overdosesnew.csv: -------------------------------------------------------------------------------- 1 | ,State,Population,Deaths,Abbrev,Deaths/Population 2 | 0,Alabama,4833722,723,AL,0.0001495741790694624 3 | 1,Alaska,735132,124,AK,0.00016867718994683948 4 | 2,Arizona,6626624,1211,AZ,0.00018274765551810394 5 | 3,Arkansas,2959373,356,AR,0.00012029575183662215 6 | 4,California,38332521,4521,CA,0.00011794162977175439 7 | 5,Colorado,5268367,899,CO,0.00017064111137284096 8 | 6,Connecticut,3596080,623,CT,0.00017324419923917154 9 | 7,Delaware,925749,189,DE,0.00020415901070376527 10 | 8,Florida,19552860,2634,FL,0.0001347117506083509 11 | 9,Georgia,9992167,1206,GA,0.000120694540033208 12 | 10,Hawaii,1404054,157,HI,0.00011181906109024297 13 | 11,Idaho,1612136,212,ID,0.00013150255313447502 14 | 12,Illinois,12882135,1705,IL,0.00013235383731035268 15 | 13,Indiana,6570902,1172,IN,0.00017836211832104634 16 | 14,Iowa,3090416,264,IA,8.542539256850857e-05 17 | 15,Kansas,2893957,332,KS,0.00011472181514790993 18 | 16,Kentucky,4395295,1077,KY,0.00024503474738328144 19 | 17,Louisiana,4625470,777,LA,0.00016798292930231954 20 | 18,Maine,1328302,216,ME,0.00016261362250452081 21 | 19,Maryland,5928814,1070,MD,0.00018047454347530552 22 | 20,Massachusetts,6692824,1289,MA,0.0001925943368598965 23 | 21,Michigan,9895622,1762,MI,0.00017805853942278718 24 | 22,Minnesota,5420380,517,MN,9.538076666211594e-05 25 | 23,Mississippi,2991207,336,MS,0.00011232923699362832 26 | 24,Missouri,6044171,1067,MO,0.00017653372149795232 27 | 25,Montana,1015165,125,MT,0.00012313269271497737 28 | 26,Nebraska,1868516,125,NE,6.689800890118148e-05 29 | 27,Nevada,2790136,545,NV,0.00019533098028196475 30 | 28,New Hampshire,1323459,334,NH,0.0002523689815853759 31 | 29,New Jersey,8899339,1253,NJ,0.0001407969737977169 32 | 30,New Mexico,2085287,547,NM,0.0002623140124117208 33 | 31,New York,19651127,2300,NY,0.00011704163328647767 34 | 32,North Carolina,9848060,1358,NC,0.00013789517935512173 35 | 33,North Dakota,723393,43,ND,5.9442101319752885e-05 36 | 34,Ohio,11570808,2744,OH,0.00023714852065646582 37 | 35,Oklahoma,3850568,777,OK,0.00020178841147591733 38 | 36,Oregon,3930065,522,OR,0.00013282223067557407 39 | 37,Pennsylvania,12773801,2732,PA,0.00021387525921219534 40 | 38,Rhode Island,1051511,247,RI,0.00023490006286191965 41 | 39,South Carolina,4774839,701,SC,0.00014681123279758752 42 | 40,South Dakota,844877,63,SD,7.456706715888821e-05 43 | 41,Tennessee,6495978,1269,TN,0.00019535164681900091 44 | 42,Texas,26448193,2601,TX,9.83432025015849e-05 45 | 43,Utah,2900872,603,UT,0.00020786853056598153 46 | 44,Vermont,626630,83,VT,0.00013245455851140225 47 | 45,Virginia,8260405,980,VA,0.00011863825078794562 48 | 46,Washington,6971406,979,WA,0.00014043078254228773 49 | 47,West Virginia,1854304,627,WV,0.000338132258788203 50 | 48,Wisconsin,5742713,853,WI,0.00014853606648982808 51 | 49,Wyoming,582658,109,WY,0.00018707372077616716 52 | -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/images/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/images/opioids.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/analysis-of-opioid-prescription-problem/images/opioids.png -------------------------------------------------------------------------------- /analysis-of-opioid-prescription-problem/notebooks/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /churn/README.md: -------------------------------------------------------------------------------- 1 | ## Churn Analysis [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/churn/notebooks/predicting-customer-churn.ipynb) 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) 3 | 4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/churn/notebooks/predicting-customer-churn.ipynb) or by clicking on the.** 5 | 6 | This project was done in collaboration with [Corey Girard](https://github.com/coreygirard/) 7 | 8 |
9 | 10 |

11 | 12 |

13 |

14 | Goals • 15 | Why this is important? • 16 | Importing modules and reading the data • 17 | Data Handling and Feature Engineering • 18 | Features and target • 19 | Using `pandas-profiling` and rejecting variables with correlations above 0.9 • 20 | Scaling • 21 | Model Comparison • 22 | Building a random forest classifier using GridSearch to optimize hyperparameters 23 |

24 | 25 | 26 | 27 | ### Goals 28 | From Wikipedia, 29 | 30 | > Churn rate is a measure of the number of individuals or items moving out of a collective group over a specific period. It is one of two primary factors that determine the steady-state level of customers a business will support [...] It is an important factor for any business with a subscriber-based service model, [such as] mobile telephone networks. 31 | 32 | Our goal in this analysis was to predict the churn rate from a mobile phone company based on customer attributes including: 33 | - Area code 34 | - Call duration at different hours 35 | - Charges 36 | - Account length 37 | 38 | See [this website](http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html) for a similar analysis. 39 | 40 | 41 | ### Why this is important? 42 | 43 | It is a well-known fact that in several businesses (particularly the ones involving subscriptions), the acquisition of new customers costs much more than the retention of existing ones. A thorough analysis of what causes churn-rates and how to predict them can be used to build efficient customer retention strategies. 44 | 45 | 46 | ## Importing modules and reading the data 47 | ``` 48 | from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV 49 | from sklearn.ensemble import RandomForestClassifier 50 | import pandas as pd 51 | import seaborn as sns 52 | import numpy as np 53 | import matplotlib.pyplot as plt 54 | %matplotlib inline 55 | ``` 56 | Reading the data: 57 | ``` 58 | df = pd.read_csv("data.csv") 59 | ``` 60 |

61 | 62 |

63 | 64 | 65 | ## Data Handling and Feature Engineering 66 | In this section the following steps are taken: 67 | - Conversion of strings into booleans 68 | - Conversion of booleans to integers 69 | - Converting the states column into dummy columns 70 | - Creation of several new features (feature engineering) 71 | 72 | The commented code follows (most of the lines were ommited for brevity): 73 | ``` 74 | # convert binary strings to boolean ints 75 | df['international_plan'] = df.international_plan.replace({'Yes': 1, 'No': 0}) 76 | #convert booleans to boolean ints 77 | df['churn'] = df.churn.replace({True: 1, False: 0}) 78 | # handle state and area code dummies 79 | state_dummies = pd.get_dummies(df.state) 80 | state_dummies.columns = ['state_'+c.lower() for c in state_dummies.columns.values] 81 | df.drop('state', axis='columns', inplace=True) 82 | df = pd.concat([df, state_dummies], axis='columns') 83 | area_dummies = pd.get_dummies(df.area_code) 84 | area_dummies.columns = ['area_code_'+str(c) for c in area_dummies.columns.values] 85 | df.drop('area_code', axis='columns', inplace=True) 86 | df = pd.concat([df, area_dummies], axis='columns') 87 | # feature engineering 88 | df['total_minutes'] = df.total_day_minutes + df.total_eve_minutes + df.total_intl_minutes 89 | df['total_calls'] = df.total_day_calls + df.total_eve_calls + df.total_intl_calls 90 | ``` 91 | 92 | 93 | ### Features and target 94 | Defining the features matrix and the target (the churn): 95 | ``` 96 | X = df[[c for c in df.columns if c != 'churn']] 97 | y = df.churn 98 | ``` 99 | 100 | 101 | ### Using `pandas-profiling` and rejecting variables with correlations above 0.9 102 | 103 | The package `pandas-profiling` contains a method `get_rejected_variables(threshold)` which identifies variables with correlation higher than a threshold. 104 | ``` 105 | import pandas_profiling 106 | profile = pandas_profiling.ProfileReport(X) 107 | rejected_variables = profile.get_rejected_variables(threshold=0.9) 108 | X = X.drop(rejected_variables,axis=1) 109 | ``` 110 | 111 | ### Scaling 112 | ``` 113 | from sklearn.preprocessing import StandardScaler 114 | cols = X.columns.tolist() 115 | scaler = StandardScaler() 116 | X[cols] = scaler.fit_transform(X[cols]) 117 | X = X[cols] 118 | ``` 119 | We can now build our models. 120 | 121 | 122 | ## Model Comparison 123 | 124 | We can write a for loop that does the following: 125 | - Iterates over a list of models, in this case GaussianNB, KNeighborsClassifier and LinearSVC 126 | - Trains each model using the training dataset X_train and y_train 127 | - Predicts the target using the test features X_test 128 | - Calculates the `f1_score` and cross-validation score 129 | - Build a dataframe with that information 130 | 131 | The code will also print out the confusion matrix from which "recall" and "precision" can be calculated: 132 | - When a consumer churns, how often does my classifier predict that to happen. This is the "recall". 133 | - When the model predicts a churn, how often does that user actually churns? This is the "precision" 134 | 135 | ``` 136 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, 137 | test_size=0.25, random_state=0) 138 | 139 | models = [LogisticRegression, GaussianNB, 140 | KNeighborsClassifier, LinearSVC] 141 | 142 | lst = [] 143 | for model in models: 144 | clf = model().fit(X_train, y_train) 145 | y_pred = clf.predict(X_test) 146 | lst.append([i for i in (model.__name__, 147 | round(metrics.f1_score(y_test, 148 | y_pred, 149 | average="macro"),3))]) 150 | df = pd.DataFrame(lst, columns=['Model','f1_score']) 151 | 152 | lst_av_cross_val_scores = [] 153 | 154 | for model in models: 155 | clf = model() 156 | cross_val_scores = (model.__name__, cross_val_score(clf, X, y, cv=5)) 157 | av_cross_val_scores = list(cross_val_scores)[1].mean() 158 | lst_av_cross_val_scores.append(round(av_cross_val_scores,3)) 159 | 160 | model_names = [model.__name__ for model in models] 161 | 162 | df1 = pd.DataFrame(list(zip(model_names, lst_av_cross_val_scores))) 163 | df1.columns = ['Model','Average Cross-Validation'] 164 | df_all = pd.concat([df1,df['f1_score']],axis=1) 165 | ``` 166 |

167 | 168 |

169 | 170 | If we use cross-validation as our metric, we see that the `KNeighborsClassifier` has the best performance. 171 | 172 | Now we will look at confusion matrices. These are obtained as follows: 173 | 174 | ``` 175 | models_names = ['LogisticRegression', 'GaussianNB', 'KNeighborsClassifier', 'LinearSVC'] 176 | i=0 177 | for preds in y_pred_lst: 178 | print('Confusion Matrix for:',models_names[i]) 179 | i +=1 180 | print('') 181 | cm = pd.crosstab(pd.concat([X_test,y_test],axis=1)['churn'], preds, 182 | rownames=['Actual Values'], colnames=['Predicted Values']) 183 | recall = round(cm.iloc[1,1]/(cm.iloc[1,0]+cm.iloc[1,1]),3) 184 | precision = round(cm.iloc[1,1]/(cm.iloc[0,1]+cm.iloc[1,1]),3) 185 | cm 186 | print('Recall for {} is:'.format(models_names[i-1]),recall) 187 | print('Precision for {} is:'.format(models_names[i-1]),precision,'\n') 188 | print('------------------------------------------------------------ \n') 189 | ``` 190 | The output is: 191 | 192 |

193 | 194 |

195 | 196 | The highest recall is from `GaussianNB` and the highest precision from `KNeighborsClassifier`. 197 | 198 | 199 | ### Finding best hyperparameters 200 | As a complement let us use a Random Forest Classifier with GridSearch for hyperparameter optimization 201 | 202 | 203 | ``` 204 | n_estimators = list(range(20,160,10)) 205 | max_depth = list(range(2, 16, 2)) + [None] 206 | def rfscore(X,y,test_size,n_estimators,max_depth): 207 | 208 | X_train, X_test, y_train, y_test = train_test_split(X, 209 | y, test_size = test_size, random_state=42) 210 | rf_params = { 211 | 'n_estimators':n_estimators, 212 | 'max_depth':max_depth} # parameters for grid search 213 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1) 214 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters 215 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth 216 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators 217 | print("best max_depth:",max_depth_best) 218 | print("best n_estimators:",n_estimators_best) 219 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model 220 | best_rf_gs.fit(X_train,y_train) # fitting the best model 221 | best_rf_score = best_rf_gs.score(X_test,y_test) 222 | print ("best score is:",round(best_rf_score,3)) 223 | preds = best_rf_gs.predict(X_test) 224 | df_pred = pd.DataFrame(np.array(preds).reshape(len(preds),1)) 225 | df_pred.columns = ['predictions'] 226 | print('Features and their importance:\n') 227 | feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X.columns).sort_values().tail(10) 228 | print(feature_importances) 229 | print(feature_importances.plot(kind="barh", figsize=(6,6))) 230 | return (df_pred,max_depth_best,n_estimators_best) 231 | 232 | 233 | triple = rfscore(X,y,0.3,n_estimators,max_depth) 234 | ``` 235 | ``` 236 | df_pred = triple[0] 237 | ``` 238 | The predictions are: 239 | ``` 240 | df_pred['predictions'].value_counts()/df_pred.shape[0] 241 | ``` 242 | 243 |

244 | 245 |

246 | 247 | 248 | 249 | ### Cross Validation 250 | ``` 251 | def cv_score(X,y,cv,n_estimators,max_depth): 252 | rf = RandomForestClassifier(n_estimators=n_estimators_best, 253 | max_depth=max_depth_best) 254 | s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1) 255 | return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3))) 256 | ``` 257 | ``` 258 | dict_best = {'max_depth': triple[1], 'n_estimators': triple[2]} 259 | n_estimators_best = dict_best['n_estimators'] 260 | max_depth_best = dict_best['max_depth'] 261 | cv_score(X,y,5,n_estimators_best,max_depth_best) 262 | ``` 263 | The output is: 264 | ``` 265 | 'Random Forest Score is :0.774 ± 0.054' 266 | ``` 267 | 268 | For the random forest, the recall and precision found are: 269 | 270 | ``` 271 | recall: 0.286 272 | precision 0.727 273 | ``` 274 | 275 | Both cross-validation score and precision of our `RandomForestClassifier` is the highest among the five models investigated. 276 | -------------------------------------------------------------------------------- /churn/data/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /churn/images/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /churn/images/balancedchurn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/balancedchurn.png -------------------------------------------------------------------------------- /churn/images/baseline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/baseline.png -------------------------------------------------------------------------------- /churn/images/cellphone.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cellphone.jpg -------------------------------------------------------------------------------- /churn/images/churnprob.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/churnprob.png -------------------------------------------------------------------------------- /churn/images/cm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cm.png -------------------------------------------------------------------------------- /churn/images/cms.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms.png -------------------------------------------------------------------------------- /churn/images/cms1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms1.png -------------------------------------------------------------------------------- /churn/images/cms2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/cms2.png -------------------------------------------------------------------------------- /churn/images/df_churn_new.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/df_churn_new.png -------------------------------------------------------------------------------- /churn/images/featurerf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/featurerf.png -------------------------------------------------------------------------------- /churn/images/imbalancechurn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/imbalancechurn.png -------------------------------------------------------------------------------- /churn/images/model_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/model_comparison.png -------------------------------------------------------------------------------- /churn/images/predictions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/churn/images/predictions.png -------------------------------------------------------------------------------- /click-prediction/README.md: -------------------------------------------------------------------------------- 1 | ## Predicting clicks on ads [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/click-prediction/notebooks/click-predictive-model.ipynb) 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) 3 | 4 | 5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/click-prediction/notebooks/click-predictive-model.ipynb) or by clicking on the.** 6 | 7 |
8 | 9 |

10 | 11 |

12 | 13 | 14 | ## Problem Statement 15 | 16 | Borrowing from [here](https://turi.com/learn/gallery/notebooks/click_through_rate_prediction_intro.html): 17 | 18 | 19 | > Many ads are actually sold on a "pay-per-click" (PPC) basis, meaning the company only pays for ad clicks, not ad views. Thus your optimal approach (as a search engine) is actually to choose an ad based on "expected value", meaning the price of a click times the likelihood that the ad will be clicked [...] In order for you to maximize expected value, you therefore need to accurately predict the likelihood that a given ad will be clicked, also known as "click-through rate" (CTR). 20 | 21 | In this project I will predict the likelihood that a given online ad will be clicked. 22 | 23 | ## Dataset 24 | 25 | - The two files `train_click.csv` and `test_click.csv` contain ad impression attributes from a campaign. 26 | - Each row in `train.csv` includes a `click` column. 27 | 28 | ## Import the relevant libraries and the files 29 | 30 | ``` 31 | import numpy as np 32 | import pandas as pd 33 | import matplotlib.pyplot as plt 34 | from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute # used for feature imputation algorithms 35 | pd.set_option('display.max_columns', None) # display all columns 36 | pd.set_option('display.max_rows', None) # displays all rows 37 | %matplotlib inline 38 | from IPython.core.interactiveshell import InteractiveShell 39 | InteractiveShell.ast_node_interactivity = "all" # so we can see the value of multiple statements at once. 40 | ``` 41 | 42 | ## Import the data 43 | 44 | ``` 45 | train = pd.read_csv('train_click.csv',index_col=0) 46 | test = pd.read_csv('test_click.csv',index_col=0) 47 | ``` 48 | 49 | ## Data Dictionary 50 | 51 | The meaning of the columns follows: 52 | - `location` – ad placement in the website 53 | - `carrier` – mobile carrier 54 | - `device` – type of device e.g. phone, tablet or computer 55 | - `day` – weekday user saw the ad 56 | - `hour` – hour user saw the ad 57 | - `dimension` – size of ad 58 | 59 | ## Imbalance 60 | The `click` column is **heavily** unbalanced. I will correct for this later. 61 | 62 | ``` 63 | import aux_func_v2 as af 64 | af.s_to_df(train['click'].value_counts()) 65 | ``` 66 | 67 | ### Checking the variance of each feature 68 | 69 | Let's quickly study the variance of the features to have an estimate of their impact on clicks. But let us first consider the cardinalities. 70 | 71 | #### Train set cardinalities 72 | 73 | ``` 74 | cardin_train = [train[col].nunique() for col in train.columns.tolist()] 75 | cols = [col for col in train.columns.tolist()] 76 | d = {k:v for (k, v) in zip(cols,cardin_train)} 77 | cardinal_train = pd.DataFrame(list(d.items()), columns=['column', 'cardinality']) 78 | cardinal_train.sort_values('cardinality',ascending=False) 79 | ``` 80 | 81 | #### Test set cardinalities 82 | ``` 83 | cardin_test = [test[col].nunique() for col in test.columns.tolist()] 84 | cols = [col for col in test.columns.tolist()] 85 | d = {k:v for (k, v) in zip(cols,cardin_test)} 86 | cardinal_test = pd.DataFrame(list(d.items()), columns=['column', 'cardinality']) 87 | cardinal_test.sort_values('cardinality',ascending=False) 88 | ``` 89 | 90 | #### High and low cardinality in the training data 91 | 92 | We can set *arbitrary* thresholds to determine the level of cardinality in the feature categories: 93 | 94 | ``` 95 | target = 'click' 96 | cardinal_train_threshold = 33 # our choice 97 | low_cardinal_train = cardinal_train[cardinal_train['cardinality'] 98 | <= cardinal_train_threshold]['column'].tolist() 99 | low_cardinal_train.remove(target) 100 | high_cardinal_train = cardinal_train[cardinal_train['cardinality'] 101 | > cardinal_train_threshold]['column'].tolist() 102 | print('Features with low cardinal_train:\n',low_cardinal_train) 103 | print('') 104 | print('Features with high cardinal_train:\n',high_cardinal_train) 105 | ``` 106 | 107 | #### High and low cardinality in the test data 108 | 109 | ``` 110 | cardinal_test_threshold = 25 # chosen for low_cardinal_set to agree with low_cardinal_train 111 | low_cardinal_test = cardinal_test[cardinal_test['cardinality'] 112 | <= cardinal_test_threshold]['column'].tolist() 113 | high_cardinal_test = cardinal_test[cardinal_test['cardinality'] 114 | > cardinal_test_threshold]['column'].tolist() 115 | print('Features with low cardinal_test:\n',low_cardinal_test) 116 | print('') 117 | print('Features with high cardinal_test:\n',high_cardinal_test) 118 | ``` 119 | 120 | #### Now let's look at the features' variances. 121 | 122 | From the bar plot below we see that `device_type` has non-negligible variance 123 | 124 | ``` 125 | from matplotlib import pyplot 126 | import matplotlib.pyplot as plt 127 | 128 | for col in low_cardinal_train: 129 | ax = train[target].groupby(train[col]).sum().plot(kind='bar', 130 | title ="Clicks per " + col, 131 | figsize=(10, 5), fontsize=12); 132 | ax.set_xlabel(col, fontsize=12); 133 | ax.set_ylabel("Clicks", fontsize=12); 134 | plt.show(); 135 | ``` 136 | 137 | ### Dropping some features 138 | 139 | Notice that some of the features are massively dominated by **just one level**. We will drop those. We have to 140 | do that for both train and test sets: 141 | 142 | ``` 143 | cols_to_drop = ['location'] 144 | train_new = train.drop(cols_to_drop,axis=1) 145 | test_new = test.drop(cols_to_drop,axis=1) 146 | ``` 147 | 148 | 149 | ### Data types 150 | 151 | ``` 152 | train_new.dtypes 153 | test_new.dtypes 154 | ``` 155 | 156 | #### Converting some of the integer columns into strings: 157 | 158 | ``` 159 | cols_to_convert = test_new.columns.tolist() 160 | for col in cols_to_convert: 161 | train_new[col] = train_new[col].astype(str) 162 | test_new[col] = test_new[col].astype(str) 163 | ``` 164 | 165 | 166 | ## Handling missing values 167 | 168 | The only column with missing values is the `domain` column. There are several ways to fill missing values including: 169 | - Dropping the corresponding rows 170 | - Filling `NaNs` using most the frequent value. 171 | - Using Multiple Imputation by Chained Equations of MICE is a more sophisticated option 172 | 173 | In our case, the are only a relatively small percentage of `NaNs` in just one column, namely, $\approx$ 13$\%$ of domain values are missing. I opted for values imputation to avoid dropping rows. Future analysis using MICE should improve final results. 174 | 175 | ``` 176 | train_new['website'] = train_new[['website']].apply(lambda x:x.fillna(x.value_counts().index[0])) 177 | train_new.isnull().any() 178 | test_new['website'] = test_new[['website']].apply(lambda x:x.fillna(x.value_counts().index[0])) 179 | test_new.isnull().any() 180 | ``` 181 | 182 | 183 | ### Dummies 184 | 185 | We can transform the categories with low cardinality into dummies using hot encoding: 186 | 187 | ``` 188 | cols_to_keep = ['carrier', 'device', 'day', 'hour', 'dimension'] 189 | low_cardin_train = train_new[cols_to_keep] 190 | low_cardin_test = test_new[cols_to_keep] 191 | dummies_train = pd.concat([pd.get_dummies(low_cardin_train[col], drop_first = True, prefix= col) 192 | for col in cols_to_keep], axis=1) 193 | dummies_test = pd.concat([pd.get_dummies(low_cardin_test[col], drop_first = True, prefix= col) 194 | for col in cols_to_keep], axis=1) 195 | dummies_train.head() 196 | dummies_test.head() 197 | 198 | train_new.to_csv('train_new.csv') 199 | test_new.to_csv('test_new.csv') 200 | ``` 201 | 202 | #### Concatenating with the rest of the `DataFrame`: 203 | 204 | ``` 205 | train_new = pd.concat([train_new[high_cardinal_train + ['click']], dummies_train], axis = 1) 206 | test_new = pd.concat([test_new[high_cardinal_test], dummies_test], axis = 1) 207 | ``` 208 | 209 | Now, to treat the columns with high cardinality, we will break them up into percentiles based on the number of impressions (number of rows). 210 | 211 | #### Building up dictionaries for creation of dummy variables 212 | 213 | ``` 214 | train_new['count'] = 1 # auxiliar column 215 | test_new['count'] = 1 216 | ``` 217 | 218 | #### In the next cell, I use `pd.cut` to rename column entries using percentiles 219 | 220 | ``` 221 | def series_to_dataframe(s,name,index_list): 222 | lst = [s.iloc[i] for i in range(s.shape[0])] 223 | new_df = pd.DataFrame({name: lst}) # transforms list into dataframe 224 | new_df.index = index_list 225 | return new_df 226 | def ranges(df1,col): 227 | df = series_to_dataframe(df1['count'].groupby(df1[col]).sum(), 228 | 'sum of ads', 229 | df1['count'].groupby(df1[col]).sum().index.tolist()).sort_values('sum of ads',ascending=False) 230 | #print('How the pd.cut looks like:\n') 231 | #print(pd.get_dummies(pd.cut(df['sum of ads'], 3)).head(3)) 232 | df = pd.concat([df,pd.get_dummies(pd.cut(df['sum of ads'], 3), drop_first = True)],axis=1) 233 | df.columns = ['sum of ads',col + '_1',col + '_2'] 234 | return df 235 | website_train = ranges(train_new,'website') 236 | publisher_train = ranges(train_new,'publisher') 237 | website_test = ranges(test_new,'website') 238 | publisher_test = ranges(test_new,'publisher') 239 | website_train.reset_index(level=0, inplace=True) 240 | publisher_train.reset_index(level=0, inplace=True) 241 | website_test.reset_index(level=0, inplace=True) 242 | publisher_test.reset_index(level=0, inplace=True) 243 | website_train.columns = ['website', 'sum of impressions', 'website_1', 'website_2'] 244 | publisher_train.columns = ['publisher', 'sum of impressions', 'publisher_1', 'publisher_2'] 245 | website_test.columns = ['website', 'sum of impressions', 'website_1', 'website_2'] 246 | publisher_test.columns = ['publisher', 'sum of impressions', 'publisher_1', 'publisher_2'] 247 | train_new = train_new.merge(website_train, how='left') 248 | train_new = train_new.drop('website',axis=1).drop('sum of impressions',axis=1) 249 | train_new = train_new.merge(publisher_train, how='left') 250 | train_new = train_new.drop('publisher',axis=1).drop('sum of impressions',axis=1) 251 | test_new = test_new.merge(website_test, how='left') 252 | test_new = test_new.drop('website',axis=1).drop('sum of impressions',axis=1) 253 | test_new = test_new.merge(publisher_test, how='left') 254 | test_new = test_new.drop('publisher',axis=1).drop('sum of impressions',axis=1) 255 | ``` 256 | 257 | ## Imbalanced classes 258 | 259 | 260 | #### Imbalanced classes in general 261 | 262 | - We can account for unbalanced classes using: 263 | - Undersampling: randomly sample the majority class, artificially balancing the classes when fitting the model 264 | - Oversampling: boostrap (sample with replacement) the minority class to balance the classes when fitting the model. We can oversample using the SMOTE algorithm (Synthetic Minority Oversampling Technique) 265 | - Note that it is crucial that we **evaluate our model on the real data!!** 266 | 267 | ``` 268 | zeros = train_new[train_new['click'] == 0] 269 | ones = train_new[train_new['click'] == 1] 270 | counts = train_new['click'].value_counts() 271 | proportion = counts[1]/counts[0] 272 | train_new = ones.append(zeros.sample(frac=proportion)) 273 | #train_new['response'].value_counts() 274 | #train_new.isnull().any() 275 | ``` 276 | 277 | ## Models 278 | 279 | ``` 280 | from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split, GridSearchCV 281 | from sklearn.tree import DecisionTreeClassifier 282 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier 283 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV 284 | from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer, TfidfTransformer 285 | import seaborn as sns 286 | from sklearn.metrics import confusion_matrix 287 | %matplotlib inline 288 | 289 | X_test = test_new 290 | ``` 291 | 292 | ## Defining ranges for the hyperparameters to be scanned by the grid search 293 | ``` 294 | n_estimators = list(range(20,120,10)) 295 | max_depth = list(range(2, 22, 2)) + [None] 296 | def random_forest_score(df,target_col,test_size,n_estimators,max_depth): 297 | 298 | X_train = df.drop(target_col, axis=1) # predictors 299 | y_train = df[target_col] # target 300 | X_test = test_new 301 | 302 | rf_params = { 303 | 'n_estimators':n_estimators, 304 | 'max_depth':max_depth} # parameters for grid search 305 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1) 306 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters 307 | print('The best parameters on the training data are:\n',rf_gs.best_params_) # printing the best parameters 308 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth 309 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators 310 | print("best max_depth:",max_depth_best) 311 | print("best n_estimators:",n_estimators_best) 312 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model 313 | best_rf_gs.fit(X_train,y_train) # fitting the best model 314 | preds = best_rf_gs.predict(X_test) 315 | feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X_train.columns).sort_values().tail(5) 316 | print(feature_importances.plot(kind="barh", figsize=(6,6))) 317 | return 318 | 319 | random_forest_score(train_new,'click',0.3,n_estimators,max_depth) 320 | ``` 321 | ``` 322 | X = train_new.drop('click', axis=1) # predictors 323 | y = train_new['click'] 324 | 325 | def cv_score(X,y,cv,n_estimators,max_depth): 326 | rf = RandomForestClassifier(n_estimators=n_estimators_best, 327 | max_depth=max_depth_best) 328 | s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1) 329 | return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3))) 330 | 331 | dict_best = {'max_depth': 14, 'n_estimators': 80} 332 | n_estimators_best = dict_best['n_estimators'] 333 | max_depth_best = dict_best['max_depth'] 334 | cv_score(X,y,5,n_estimators_best,max_depth_best) 335 | 336 | n_estimators = list(range(20,120,10)) 337 | max_depth = list(range(2, 16, 2)) + [None] 338 | 339 | def random_forest_score_probas(df,target_col,test_size,n_estimators,max_depth): 340 | 341 | X_train = df.drop(target_col, axis=1) # predictors 342 | y_train = df[target_col] # target 343 | X_test = test_new 344 | 345 | rf_params = { 346 | 'n_estimators':n_estimators, 347 | 'max_depth':max_depth} # parameters for grid search 348 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, n_jobs=-1) 349 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters 350 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth 351 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators 352 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model 353 | best_rf_gs.fit(X_train,y_train) # fitting the best model 354 | preds = best_rf_gs.predict(X_test) 355 | prob_list = [prob[0] for prob in best_rf_gs.predict_proba(X_test).tolist()] 356 | df_prob = pd.DataFrame(np.array(prob_list).reshape(53333,1)) 357 | df_prob.columns = ['probabilities'] 358 | df_prob.to_csv('probs.csv') 359 | return df_prob 360 | 361 | random_forest_score_probas(train_new,'click',0.3,n_estimators,max_depth).head() 362 | 363 | def random_forest_score_preds(df,target_col,test_size,n_estimators,max_depth): 364 | 365 | X_train = df.drop(target_col, axis=1) # predictors 366 | y_train = df[target_col] # target 367 | X_test = test_new 368 | 369 | rf_params = { 370 | 'n_estimators':n_estimators, 371 | 'max_depth':max_depth} # parameters for grid search 372 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1) 373 | rf_gs.fit(X_train,y_train) # training the random forest with all possible parameters 374 | max_depth_best = rf_gs.best_params_['max_depth'] # getting the best max_depth 375 | n_estimators_best = rf_gs.best_params_['n_estimators'] # getting the best n_estimators 376 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) # instantiate the best model 377 | best_rf_gs.fit(X_train,y_train) # fitting the best model 378 | preds = best_rf_gs.predict(X_test) 379 | df_pred = pd.DataFrame(np.array(preds).reshape(53333,1)) 380 | df_pred.columns = ['predictions'] 381 | df_pred.to_csv('preds.csv') 382 | return df_pred 383 | 384 | random_forest_score_preds(train_new,'click',0.3,n_estimators,max_depth) 385 | ``` 386 | -------------------------------------------------------------------------------- /click-prediction/images/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /click-prediction/images/click1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/click-prediction/images/click1.png -------------------------------------------------------------------------------- /click-prediction/optimal-bidding-strategies-in-online-display-advertising .pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/click-prediction/optimal-bidding-strategies-in-online-display-advertising .pdf -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/README.md: -------------------------------------------------------------------------------- 1 | ## Predicting Comments on Reddit [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/project-3-marco-tavora.ipynb) 2 | ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![image title](https://img.shields.io/badge/ntlk-v3.2.5-yellow.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/BeautifulSoup-4.6.0-blue.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) 3 | 4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/project-3-marco-tavora.ipynb) or by clicking on the [view code] link above.** 5 | 6 | 7 |
8 |
9 |

10 | 12 |

13 |
14 | 15 |

16 | Problem Statement • 17 | Steps • 18 | Bird's-eye view of webscraping • 19 | Writing functions to extract data from Reddit • 20 | Quick review of NLP techniques • 21 | Preprocessing the text • 22 | Models 23 |

24 | 25 | 26 | ## Problem Statement 27 | 28 | Determine which characteristics of a post on Reddit contribute most to the overall interaction as measured by number of comments. 29 | 30 | 31 | ## Steps 32 | 33 | This project had three steps: 34 | - Collecting data by scraping a website using the Python package `requests` and using the Python library `BeautifulSoup` which efficiently extracts HTML code. We scraped the 'hot' threads as listed on the
[Reddit homepage](https://www.reddit.com/) (see figure below) and acquired the following pieces of information about each thread: 35 | 36 | - The title of the thread 37 | - The subreddit that the thread corresponds to 38 | - The length of time it has been up on Reddit 39 | - The number of comments on the thread 40 | 41 |
42 |
43 |

44 | 46 |

47 |
48 | 49 | - Using Natural Language Processing (NLP) techniques to preprocess the data. NLP, in a nutshell, is "how to transform text data and convert it to features that enable us to build models." NLP techniques include: 50 | 51 | - Tokenization: essentially splitting text into pieces based on given patterns 52 | - Removing stopwords 53 | - Lemmatization: returns the word's *lemma* (its base/dictionary form) 54 | - Stemming: returns the base form of the word (it is usually cruder than lemmatization). 55 | 56 | - After the step above we obtain *numerical* features which allow for algebraic computations. We then build a `RandomForestClassifier` and use it to classify each post according to the corresponding number of comments associated with it. More concretely the model predicts whether or not a given Reddit post will have above or below the _median_ number of comments. 57 | 58 | 59 | ### Bird's-eye view of webscraping 60 | 61 | The general strategy is: 62 | - Use the `requests` Python packages to make a `.get` request (the object `res` is a `Response` object): 63 | ``` 64 | res = requests.get(URL,headers={"user-agent":'mt'}) 65 | ``` 66 | - Create a BeautifulSoup object from the HTML 67 | ``` 68 | soup = BeautifulSoup(res.content,"lxml") 69 | ``` 70 | - Use `.extract` to see the page structure: 71 | ``` 72 | soup.extract 73 | ``` 74 | 75 | ### Writing functions to extract data from Reddit 76 | Here I write down the the functions that will extract the information needed. The structure of the functions depends on the HTML code of the page. The page has the following structure: 77 | - The thread title is within an `` tag with the attribute `data-event-action="title"`. 78 | - The time since the thread was created is within a `` tag with the attribute `class="subreddit hover may-blank"`. 80 | - The number of comments is within an `` tag with the attribute `data-event-action="comments"`. 81 | 82 | The functions are: 83 | ``` 84 | def extract_title_from_result(result,num=25): 85 | titles = [] 86 | title = result.find_all('a', {'data-event-action':'title'}) 87 | for i in title: 88 | titles.append(i.text) 89 | return titles 90 | 91 | def extract_time_from_result(result,num=25): 92 | times = [] 93 | time = result.find_all('time', {'class':'live-timestamp'}) 94 | for i in time: 95 | times.append(i.text) 96 | return times 97 | 98 | def extract_subreddit_from_result(result,num=25): 99 | subreddits = [] 100 | subreddit = result.find_all('a', {'class':'subreddit hover may-blank'}) 101 | for i in subreddit: 102 | subreddits.append(i.string) 103 | return subreddits 104 | 105 | def extract_num_from_result(result,num=25): 106 | nums_lst = [] 107 | nums = result.find_all('a', {'data-event-action': 'comments'}) 108 | for i in nums: 109 | nums_lst.append(i.string) 110 | return nums_lst 111 | ``` 112 | I then write a function that finds the last `id` on the page, and stores it: 113 | ``` 114 | def get_urls(n=25): 115 | j=0 # counting loops 116 | titles = [] 117 | times = [] 118 | subreddits = [] 119 | nums = [] 120 | URLS = [] 121 | URL = "http://www.reddit.com" 122 | 123 | for _ in range(n): 124 | 125 | res = requests.get(URL, headers={"user-agent":'mt'}) 126 | soup = BeautifulSoup(res.content,"lxml") 127 | 128 | titles.extend(extract_title_from_result(soup)) 129 | times.extend(extract_time_from_result(soup)) 130 | subreddits.extend(extract_subreddit_from_result(soup)) 131 | nums.extend(extract_num_from_result(soup)) 132 | 133 | URL = soup.find('span',{'class':'next-button'}).find('a')['href'] 134 | URLS.append(URL) 135 | j+=1 136 | print(j) 137 | time.sleep(3) 138 | 139 | return titles, times, subreddits, nums, URLS 140 | ``` 141 | 142 | I then build a pandas `DataFrame`, perform some exploratory data analysis and create: 143 | - A binary column that classifies the number of comments comparing the values with their median 144 | - A set of dummy columns for the subreddits 145 | - Concatenate both 146 | 147 | ``` 148 | df['binary'] = df['nums'].apply(lambda x: 1 if x >= np.median(df['nums']) else 0) 149 | # dummies created and dataframes concatenated 150 | df_subred = pd.concat([df['binary'],pd.get_dummies(df['subreddits'], drop_first = True)], axis = 1) 151 | ``` 152 | 153 | ### Quick review of NLP techniques 154 | Before applying NLP to our problem, I will provide a quick review of the basic procedures using `Python`. We use the package `nltk` (Natural Language Toolkit) to perform the actions above. The general procedure is the following. We first import `nltk` and the necessary classes for lemmatization and stemming 155 | ``` 156 | import nltk 157 | from nltk.stem import WordNetLemmatizer 158 | from nltk.stem.porter import PorterStemmer 159 | ``` 160 | We then create objects of the classes `PorterStemmer` and `WordNetLemmatizer`: 161 | ``` 162 | stemmer = PorterStemmer() 163 | lemmatizer = WordNetLemmatizer() 164 | ``` 165 | To use lemmatization and/or stemming in a given string `text` we must first tokenize it. To do that, we use `RegexpTokenizer` where the argument below is a regular expression. 166 | ``` 167 | tokenizer = RegexpTokenizer(r'\w+') 168 | tokens = tokenizer.tokenize(text) 169 | tokens_lemma = [lemmatizer.lemmatize(i) for i in tokens] 170 | stem_text = [PorterStemmer().stem(i) for i in tokens] 171 | ``` 172 | 173 | ### Preprocessing the text 174 | To preprocess the text, before creating numerical features from it, I used the following `cleaner` function: 175 | ``` 176 | def cleaner(text): 177 | stemmer = PorterStemmer() 178 | stop = stopwords.words('english') 179 | text = text.translate(str.maketrans('', '', string.punctuation)) 180 | text = text.translate(str.maketrans('', '', string.digits)) 181 | text = text.lower().strip() 182 | final_text = [] 183 | for w in text.split(): 184 | if w not in stop: 185 | final_text.append(stemmer.stem(w.strip())) 186 | return ' '.join(final_text) 187 | ``` 188 | I then use `CountVectorizer` to create features based on the words in the thread titles. `CountVectorizer` is scikit-learn's bag of words tool. I then combine this new table `df_all` and the subreddits features table and build a model. 189 | 190 | ``` 191 | cvt = CountVectorizer(min_df=min_df, preprocessor=cleaner) 192 | cvt.fit(df["titles"]) 193 | cvt.transform(df['titles']).todense() 194 | X_title = cvt.fit_transform(df["titles"]) 195 | X_thread = pd.DataFrame(X_title.todense(), 196 | columns=cvt.get_feature_names()) 197 | df_all = pd.concat([df_subred,X_thread],axis=1) 198 | ``` 199 | 200 | 201 | 202 | 203 | 204 | ### Models 205 | Finally, now with the data properly treated, we use the following function to fit the training data using a `RandomForestClassifier` with optimized hyperparameters obtained using `GridSearchCV`. The range of hyperparameters is: 206 | ``` 207 | n_estimators = list(range(20,220,10)) 208 | max_depth = list(range(2, 22, 2)) + [None] 209 | ``` 210 | 211 | The following function does the following: 212 | - Defines target and predictors 213 | - Performs a train-test split of the data 214 | - Uses `GridSearchCV` which performs an "exhaustive search over specified parameter values for an estimator" (see the docs). It searches the hyperparameter space to find the highest cross validation score. It has several important arguments namely: 215 | 216 | | Argument | Description | 217 | | --- | ---| 218 | | **`estimator`** | Sklearn instance of the model to fit on | 219 | | **`param_grid`** | A dictionary where keys are hyperparameters and values are lists of values to test | 220 | | **`cv`** | Number of internal cross-validation folds to run for each set of hyperparameters | 221 | 222 | - After fitting, `GridSearchCV` provides information such as: 223 | 224 | | Property | Use | 225 | | --- | ---| 226 | | **`results.param_grid`** | Parameters searched over. | 227 | | **`results.best_score_`** | Best mean cross-validated score.| 228 | | **`results.best_estimator_`** | Reference to model with best score. | 229 | | **`results.best_params_`** | Parameters found to perform with the best score. | 230 | | **`results.grid_scores_`** | Display score attributes with corresponding parameters. | 231 | 232 | - The estimator chosen here was a `RandomForestClassifier`. The latter fits a set of decision tree classifiers on sub-samples of the data, averaging to improve the accuracy and avoid over-fitting. 233 | - Fits several models using the training data, for all parameters within the parameter grid `rf_params` and find the best model i.e. the model with best mean cross-validated score. 234 | - Instantiates the best model and fits it 235 | - Scores the model and makes predictions 236 | - Determines the most relevant features and prints out a bar plot showing them. 237 | 238 | ``` 239 | def rfscore(df,target_col,test_size,n_estimators,max_depth): 240 | 241 | X = df.drop(target_col, axis=1) # predictors 242 | y = df[target_col] # target 243 | 244 | # train-test split 245 | X_train, X_test, y_train, y_test = train_test_split(X, 246 | y, test_size = test_size, random_state=42) 247 | # definition of a grid of parameter values 248 | rf_params = { 249 | 'n_estimators':n_estimators, 250 | 'max_depth':max_depth} # parameters for grid search 251 | 252 | # Instantiation 253 | rf_gs = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, verbose=1, n_jobs=-1) 254 | 255 | # fitting using training data with all possible parameters 256 | rf_gs.fit(X_train,y_train) 257 | 258 | # Parameters that have been found to perform with the best score 259 | max_depth_best = rf_gs.best_params_['max_depth'] 260 | n_estimators_best = rf_gs.best_params_['n_estimators'] 261 | 262 | # Best model 263 | best_rf_gs = RandomForestClassifier(max_depth=max_depth_best,n_estimators=n_estimators_best) 264 | 265 | # fitting best model using training data with all possible parameters 266 | best_rf_gs.fit(X_train,y_train) 267 | 268 | # scoring 269 | best_rf_score = best_rf_gs.score(X_test,y_test) 270 | 271 | # predictions 272 | preds = best_rf_gs.predict(X_test) 273 | 274 | # finds the most important features and plots a bar chart 275 | feature_importances = pd.Series(best_rf_gs.feature_importances_, index=X.columns).sort_values().tail(5) 276 | print(feature_importances.plot(kind="barh", figsize=(6,6))) 277 | return 278 | ``` 279 | The function below that performs cross-validation, to obtain the accuracy score for the model with best parameters obtained from the `GridSearch`: 280 | 281 | ``` 282 | def cv_score(X,y,cv,n_estimators,max_depth): 283 | rf = RandomForestClassifier(n_estimators=n_estimators_best, 284 | max_depth=max_depth_best) 285 | s = cross_val_score(rf, X, y, cv=cv, n_jobs=-1) 286 | return("{} Score is :{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3))) 287 | ``` 288 | The most important features according to the `RandomForestClassifier` are shown in the graph below: 289 |
290 | 291 | 292 | 293 | -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/Reddit-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/Reddit-logo.png -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditRF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditRF.png -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditpage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditpage.png -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditwordshist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/predicting-number-of-comments-on-reddit-using-random-forest-classifier/images/redditwordshist.png -------------------------------------------------------------------------------- /predicting-number-of-comments-on-reddit-using-random-forest-classifier/notebooks/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /retail-strategy/README.md: -------------------------------------------------------------------------------- 1 | ## Retail Expansion Analysis with Lasso and Ridge Regressions [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb) 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 3 | 4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb) or by clicking on the [view code] link above.** 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 |

13 | 14 |

15 |
16 | 17 |

18 | Summary • 19 | Preamble • 20 | Getting data • 21 | Data Munging and EDA • 22 | Mining the data • 23 | Building the models • 24 | Plotting results • 25 | Conclusions and recommendations 26 |

27 | 28 | 29 | ## Summary 30 | Based on a dataset containing the spirits purchase information of Iowa Class E liquor licensees by product and date of purchase (link) this project provides recommendations on where to open new stores in the state of Iowa. I first conducted a thorough exploratory data analysis and then built several multivariate regression models of total sales by county, using both Lasso and Ridge regularization, and based on these models, I made recommendations about new locations. 31 | 32 | 33 | ## Preamble 34 | 35 | Expansion plans traditionally use subsets of the following mix of data: 36 | 37 | #### Demographics 38 | 39 | I focused on the following quantities: 40 | - The ratio between sales and volume for each county, i.e., the number of dollars per liter sold. If this ratio is high in a given county, the stores in that county are, on average, high-end stores. 41 | - Another critical ratio is the number of stores per area. The meaning of a high value of this ratio is not so straightforward since it may indicate either that the market is saturated, or that the county is a strong market for this type of product and would welcome a new store (an example would be a county close to some major university). In contrast, a low value may indicate a market with untapped potential or a market with a population which is not a target of this type of store. 42 | - Another important ratio is consumption/person, i.e., the consumption *per capita*. The knowledge of the profile of the population in the county (if they are "light" or "heavy" drinkers) would undoubtedly help the owner decide whether to open or not a new storefront there. 43 | 44 | #### Nearby businesses 45 | 46 | Competition is a critical component, and can be indirectly measured by the ratio of the number of stores and the population. 47 | 48 | #### Aggregated human flow/foot traffic 49 | 50 | For this information to be useful, we would need more granular data such as apps check-ins. Population and population density will be used as proxies. 51 | 52 | 53 | ## Getting data 54 | 55 | Three datasets were used namely: 56 | - A dataset containing the spirits purchase information of Iowa Class “E” liquor licensees by product and date of purchase. 57 | - A dataset with information about population per county 58 | - A database containing information about incomes 59 | 60 | 61 | ## Data Munging and EDA 62 | 63 | Data munging included: 64 | - Checking the time span of the data and dropping 2016 data (which contained only three months) 65 | - Eliminating symbols in the data, dropping `NaNs` and converting objects to floats 66 | - Conversion of columns of objects into columns of float 67 | - Dropping `NaN` values 68 | - Converting store numbers to strings. 69 | - Examining the data we find that the maximum values in all columns were many standard deviations larger than the mean, indicating the presence of outliers. Keeping outliers in the analysis would inflate the predicted sales. Also, since the goal is to predict the *most likely performance* for each store keeping exceptionally well-performing stores would be detrimental. 70 | 71 | To exclude dollar signs for example I used: 72 | ``` 73 | for col in cols_with_dollar: 74 | df[col] = df[col].apply(lambda x: x.strip('$')).astype('float') 75 | ``` 76 | To plot histograms I found it convenient to write a simple function: 77 | ``` 78 | def draw_histograms(df,col,bins): 79 | df[col].hist(bins=bins); 80 | plt.title(col); 81 | plt.xlabel(col); 82 | plt.xticks(rotation=90); 83 | plt.show(); 84 | ``` 85 | 86 | ## Mining the data 87 | 88 | Some of steps for mining the data included: computing the total sales per county, creating a profit column, calculating profit per store and the sales per volume, dropping outliers, calculating both stores per person and alcohol consumption per person ratios. 89 | 90 | I then looked for any statistical relationships, correlations, or other relevant properties of the dataset. 91 | 92 | #### Steps: 93 | - First I needed to choose the proper predictors. I looked for strong correlations between variables to avoid problems with multicollinearity. 94 | - Also, variables that changed very little had little impact and they were therefore not included as predictors. 95 | - I then studied correlations between predictors. 96 | - I saw from the correlation matrices that `num_stores` and `stores_per_area` are highly correlated. Furthermore, both are highly correlated to the target variable `sale_dollars`. Both things also happen with `store_population_ratio` and `consumption_per_capita`. 97 | 98 | A heatmap of correlations using `Seaborn` follows: 99 | 100 |

101 | 102 |

103 | 104 | To generate scatter plots for all the predictors (which provided similar information as the correlation matrices) we write: 105 | ``` 106 | g = sns.pairplot(df[cols_to_keep]) 107 | for ax in g.axes.flatten(): # from [6] 108 | for tick in ax.get_xticklabels(): 109 | tick.set(rotation=90); 110 | ``` 111 |

112 | 113 |

114 | 115 | 116 | 117 | ## Building the models 118 | 119 | Using `scikit-learn` and `statsmodels`, I built the necessary models and valuated their fit. For that I generated all combinations of useful relevant features using the `itertools` module. 120 | 121 | Preparing training and test sets: 122 | ``` 123 | # choose candidate features 124 | features = ['num_stores','population', 'store_population_ratio', \ 125 | 'consumption_per_capita', 'stores_per_area', u'per_capita_income'] 126 | # defining the predictors and the target 127 | X,y = df_final[features], df_final['sale_dollars'] 128 | # train-test split 129 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3) 130 | ``` 131 | I now generate combinations of features: 132 | 133 | ``` 134 | combs = [] 135 | for num in range(1,len(features)+1): 136 | combs.append([i[0] for i in list(itertools.combinations(features, num))]) 137 | ``` 138 | 139 | I then instantiated the models and tested them. The code below makes a list of `r2` combinations and finds the best predictors using `itemgetter`: 140 | ``` 141 | lr = linear_model.LinearRegression(normalize=True) 142 | ridge = linear_model.RidgeCV(cv=5) 143 | lasso = linear_model.LassoCV(cv=5) 144 | models = [lr,lasso,ridge] 145 | r2_comb_lst = [] 146 | for comb in combs: 147 | for m in models: 148 | model = m.fit(X_train[comb],y_train) 149 | r2 = m.score(X_test[comb], y_test) 150 | r2_comb_lst.append([round(r2,3),comb,str(model).split('(')[0]]) 151 | 152 | r2_comb_lst.sort(key=operator.itemgetter(1)) 153 | ``` 154 | The best predictors were obtained via: 155 | ``` 156 | r2_comb_lst[-1][1] 157 | ``` 158 | Dropping highly correlated predictors I redefined `X` and `y` and built a Ridge model: 159 | ``` 160 | X ,y = df_final[features], df_final['sale_dollars'] 161 | ridge = linear_model.RidgeCV(cv=5) 162 | model = ridge.fit(X,y) 163 | ``` 164 | 165 | 166 | ## Plotting results 167 | 168 | I then plotted the predictions versus the true value: 169 | 170 |

171 | 172 |

173 | 174 | 175 | ## Conclusions and recommendations: 176 | 177 | The following recommendations were provided: 178 | 179 | - Linn has higher sales which in part is because it has larger population which is not very useful information. 180 | - Next, ordering stores by `sales_per_litters` we obtain which counties have more high-end stores (Johnson has the higher number). 181 | - We would recommend Johnson for a new store *if the goal of the the owner is to build new high-end stores*. 182 | - If the plan is to open more stores but with cheaper products, Johnson is not the place to choose. The less saturated market is Decatur. But as discussed before this information does not provide have a unique recommendation and a more thorough analysis is needed. 183 | - The county with weaker competition is Butler. This could provided untapped potential. However, the absence of a reasonable number of stores may indicate, as observed before, that the county's population is simply not interested in this category of product. Again, further investigation must be carried out. 184 | 185 | 186 | I strongly recommend reading the notebook using [nbviewer](http://nbviewer.jupyter.org/github/marcotav/machine-learning-regression-models/blob/master/retail/notebooks/retail-recommendations.ipynb). 187 | 188 | -------------------------------------------------------------------------------- /retail-strategy/data/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /retail-strategy/data/ia_zip_city_county_sqkm.csv: -------------------------------------------------------------------------------- 1 | ,Zip Code,City,County,State,County Number,Area (sqkm) 0,50001,ACKWORTH,Warren,IA,91,62.796656 1,50002,ADAIR,Guthrie,IA,39,279.202219 2,50003,ADEL,Dallas,IA,25,298.086291 3,50005,ALBION,Marshall,IA,64,69.623573 4,50006,ALDEN,Hardin,IA,42,317.74515 5,50007,ALLEMAN,Polk,IA,77,13.782897 6,50008,ALLERTON,Wayne,IA,93,220.623573 7,50009,ALTOONA,Polk,IA,77,65.207113 8,50010,AMES,Story,IA,85,155.294118 9,50011,AMES,Story,IA,85,0.125094 10,50012,AMES,Story,IA,85,1.982622 11,50012,AMES,Story,IA,85,1.982622 12,50014,AMES,Story,IA,85,144.826088 13,50020,ANITA,Cass,IA,15,249.128489 14,50021,ANKENY,Polk,IA,77,66.725924 15,50022,ATLANTIC,Cass,IA,15,431.883311 16,50023,ANKENY,Polk,IA,77,57.424136 17,50025,AUDUBON,Audubon,IA,5,507.431421 18,50026,BAGLEY,Guthrie,IA,39,142.869501 19,50027,BARNES CITY,Mahaska,IA,62,72.89173 20,50028,BAXTER,Jasper,IA,50,114.933651 21,50029,BAYARD,Guthrie,IA,39,105.03836 22,50032,BERWICK,Polk,IA,77,0.95539 23,50033,BEVINGTON,Warren,IA,91,0.288201 24,50034,BLAIRSBURG,Hamilton,IA,40,163.484203 25,50035,BONDURANT,Polk,IA,77,116.89815 26,50036,BOONE,Boone,IA,8,505.063491 27,50038,BOONEVILLE,Dallas,IA,25,8.874239 28,50039,BOUTON,Dallas,IA,25,60.662047 29,50041,BRADFORD,Franklin,IA,35,1.101427 30,50042,BRAYTON,Audubon,IA,5,84.16259 31,50044,BUSSEY,Marion,IA,63,118.473056 32,50046,CAMBRIDGE,Story,IA,85,119.973352 33,50047,CARLISLE,Warren,IA,91,152.159628 34,50048,CASEY,Guthrie,IA,39,226.57327 35,50049,CHARITON,Lucas,IA,59,523.124656 36,50050,CHURDAN,Greene,IA,37,197.706628 37,50051,CLEMONS,Marshall,IA,64,66.573089 38,50052,CLIO,Wayne,IA,93,50.94063 39,50054,COLFAX,Jasper,IA,50,152.872278 40,50055,COLLINS,Story,IA,85,126.340521 41,50056,COLO,Story,IA,85,149.192377 42,50057,COLUMBIA,Marion,IA,63,52.183538 43,50058,COON RAPIDS,Carroll,IA,14,364.231967 44,50060,CORYDON,Wayne,IA,93,453.610186 45,50061,CUMMING,Warren,IA,91,81.699043 46,50062,MELCHER-DALLAS,Marion,IA,63,80.104319 47,50063,DALLAS CENTER,Dallas,IA,25,170.532757 48,50064,DANA,Greene,IA,37,40.418909 49,50065,DAVIS CITY,Decatur,IA,27,147.902989 50,50066,DAWSON,Dallas,IA,25,64.732191 51,50067,DECATUR,Decatur,IA,27,87.093517 52,50068,DERBY,Lucas,IA,59,114.749633 53,50069,DE SOTO,Dallas,IA,25,13.262492 54,50070,DEXTER,Dallas,IA,25,171.070769 55,50071,DOWS,Wright,IA,99,293.114182 56,50072,EARLHAM,Madison,IA,61,228.908869 57,50073,ELKHART,Polk,IA,77,67.420142 58,50074,ELLSTON,Ringgold,IA,80,127.815922 59,50075,ELLSWORTH,Hamilton,IA,40,119.369391 60,50076,EXIRA,Audubon,IA,5,290.651382 61,50078,FERGUSON,Marshall,IA,64,0.660699 62,50101,GALT,Wright,IA,99,25.431226 63,50102,GARDEN CITY,Hardin,IA,42,1.304963 64,50103,GARDEN GROVE,Decatur,IA,27,157.409444 65,50104,GIBSON,Keokuk,IA,54,28.94056 66,50105,GILBERT,Story,IA,85,20.560447 67,50106,GILMAN,Marshall,IA,64,170.141114 68,50107,GRAND JUNCTION,Greene,IA,37,130.433186 69,50108,GRAND RIVER,Decatur,IA,27,188.608107 70,50109,GRANGER,Polk,IA,77,62.757235 71,50111,GRIMES,Polk,IA,77,71.484606 72,50112,GRINNELL,Poweshiek,IA,79,475.672153 73,50115,GUTHRIE CENTER,Guthrie,IA,39,441.185334 74,50116,HAMILTON,Marion,IA,63,44.071754 75,50117,HAMLIN,Audubon,IA,5,71.65718 76,50118,HARTFORD,Warren,IA,91,51.466255 77,50119,HARVEY,Marion,IA,63,40.354632 78,50120,HAVERHILL,Marshall,IA,64,43.94135 79,50122,HUBBARD,Hardin,IA,42,218.621364 80,50123,HUMESTON,Wayne,IA,93,204.584834 81,50124,HUXLEY,Story,IA,85,54.098688 82,50125,INDIANOLA,Warren,IA,91,426.521944 83,50126,IOWA FALLS,Hardin,IA,42,351.87483 84,50127,IRA,Jasper,IA,50,0.02826 85,50128,JAMAICA,Guthrie,IA,39,75.898812 86,50129,JEFFERSON,Greene,IA,37,435.65865 87,50130,JEWELL,Hamilton,IA,40,171.933052 88,50131,JOHNSTON,Polk,IA,77,64.666939 89,50132,KAMRAR,Hamilton,IA,40,74.609258 90,50133,KELLERTON,Ringgold,IA,80,196.966937 91,50134,KELLEY,Story,IA,85,48.427675 92,50135,KELLOGG,Jasper,IA,50,195.278764 93,50136,KESWICK,Keokuk,IA,54,100.632209 94,50138,KNOXVILLE,Marion,IA,63,461.827288 95,50139,LACONA,Warren,IA,91,229.508368 96,50140,LAMONI,Decatur,IA,27,241.678133 97,50141,LAUREL,Marshall,IA,64,93.777481 98,50142,LE GRAND,Marshall,IA,64,2.566675 99,50143,LEIGHTON,Mahaska,IA,62,92.958086 100,50144,LEON,Decatur,IA,27,362.605062 101,50146,LINDEN,Dallas,IA,25,75.845999 102,50147,LINEVILLE,Wayne,IA,93,187.491745 103,50148,LISCOMB,Marshall,IA,64,51.098115 104,50149,LORIMOR,Union,IA,88,213.449691 105,50150,LOVILIA,Monroe,IA,68,148.536678 106,50151,LUCAS,Lucas,IA,59,205.563843 107,50153,LYNNVILLE,Jasper,IA,50,80.372194 108,50154,MC CALLSBURG,Story,IA,85,53.144166 109,50155,MACKSBURG,Madison,IA,61,78.50132 110,50156,MADRID,Boone,IA,8,238.724015 111,50157,MALCOM,Poweshiek,IA,79,168.389373 112,50158,MARSHALLTOWN,Marshall,IA,64,548.346934 113,50160,MARTENSDALE,Warren,IA,91,0.965184 114,50161,MAXWELL,Story,IA,85,211.812219 115,50162,MELBOURNE,Marshall,IA,64,132.120698 116,50163,MELCHER-DALLAS,Marion,IA,63,1.230059 117,50164,MENLO,Guthrie,IA,39,138.23504 118,50165,MILLERTON,Wayne,IA,93,6.437634 119,50166,MILO,Warren,IA,91,162.522428 120,50167,MINBURN,Dallas,IA,25,102.610498 121,50168,MINGO,Jasper,IA,50,104.773461 122,50169,MITCHELLVILLE,Polk,IA,77,103.378565 123,50170,MONROE,Jasper,IA,50,230.99503 124,50171,MONTEZUMA,Poweshiek,IA,79,281.675684 125,50173,MONTOUR,Tama,IA,86,80.140704 126,50174,MURRAY,Clarke,IA,20,278.718495 127,50201,NEVADA,Story,IA,85,300.453642 128,50206,NEW PROVIDENCE,Hardin,IA,42,124.182004 129,50207,NEW SHARON,Mahaska,IA,62,366.588282 130,50208,NEWTON,Jasper,IA,50,426.046663 131,50210,NEW VIRGINIA,Warren,IA,91,196.767136 132,50211,NORWALK,Warren,IA,91,147.182178 133,50212,OGDEN,Boone,IA,8,352.230522 134,50213,OSCEOLA,Clarke,IA,20,543.975469 135,50214,OTLEY,Marion,IA,63,102.571148 136,50216,PANORA,Guthrie,IA,39,145.699881 137,50217,PATON,Greene,IA,37,178.606122 138,50218,PATTERSON,Madison,IA,61,0.542828 139,50219,PELLA,Marion,IA,63,317.144262 140,50220,PERRY,Dallas,IA,25,268.176779 141,50222,PERU,Madison,IA,61,108.369441 142,50223,PILOT MOUND,Boone,IA,8,76.600548 143,50225,PLEASANTVILLE,Marion,IA,63,217.246336 144,50226,POLK CITY,Polk,IA,77,109.873855 145,50227,POPEJOY,Franklin,IA,35,0.966375 146,50228,PRAIRIE CITY,Jasper,IA,50,180.367188 147,50229,PROLE,Warren,IA,91,105.616644 148,50230,RADCLIFFE,Hardin,IA,42,223.982113 149,50231,RANDALL,Hamilton,IA,40,1.065168 150,50232,REASNOR,Jasper,IA,50,86.762448 151,50233,REDFIELD,Dallas,IA,25,130.640688 152,50234,RHODES,Marshall,IA,64,81.236093 153,50235,RIPPEY,Greene,IA,37,121.134316 154,50236,ROLAND,Story,IA,85,89.345522 155,50237,RUNNELLS,Polk,IA,77,146.002809 156,50238,RUSSELL,Lucas,IA,59,308.904729 157,50239,SAINT ANTHONY,Marshall,IA,64,44.053312 158,50240,SAINT CHARLES,Madison,IA,61,197.714047 159,50242,SEARSBORO,Poweshiek,IA,79,106.954493 160,50243,SHELDAHL,Story,IA,85,1.425493 161,50244,SLATER,Story,IA,85,57.130248 162,50244,SLATER,Story,IA,85,57.130248 163,50246,STANHOPE,Hamilton,IA,40,120.153227 164,50247,STATE CENTER,Marshall,IA,64,215.968634 165,50248,STORY CITY,Story,IA,85,211.580755 166,50249,STRATFORD,Hamilton,IA,40,202.332923 167,50250,STUART,Adair,IA,1,265.060086 168,50251,SULLY,Jasper,IA,50,104.817095 169,50252,SWAN,Marion,IA,63,22.861403 170,50254,THAYER,Union,IA,88,113.65365 171,50255,THORNBURG,Keokuk,IA,54,0.456756 172,50256,TRACY,Marion,IA,63,70.812037 173,50257,TRURO,Madison,IA,61,103.296613 174,50258,UNION,Hardin,IA,42,139.476319 175,50261,VAN METER,Madison,IA,61,173.242731 176,50262,VAN WERT,Decatur,IA,27,83.830606 177,50263,WAUKEE,Dallas,IA,25,90.002855 178,50264,WELDON,Decatur,IA,27,174.299474 179,50265,WEST DES MOINES,Polk,IA,77,46.466559 180,50266,WEST DES MOINES,Dallas,IA,25,43.060835 181,50268,WHAT CHEER,Keokuk,IA,54,123.524623 182,50271,WILLIAMS,Hamilton,IA,40,172.819803 183,50272,WILLIAMSON,Lucas,IA,59,1.922108 184,50273,WINTERSET,Madison,IA,61,519.17142 185,50274,WIOTA,Cass,IA,15,130.159776 186,50275,WOODBURN,Clarke,IA,20,131.849713 187,50276,WOODWARD,Dallas,IA,25,209.84696 188,50277,YALE,Guthrie,IA,39,104.624278 189,50278,ZEARING,Story,IA,85,138.864198 190,50309,DES MOINES,Polk,IA,77,7.776473 191,50310,DES MOINES,Polk,IA,77,21.123546 192,50311,DES MOINES,Polk,IA,77,6.511832 193,50312,DES MOINES,Polk,IA,77,15.05106 194,50313,DES MOINES,Polk,IA,77,47.635293 195,50314,DES MOINES,Polk,IA,77,6.629721 196,50315,DES MOINES,Polk,IA,77,26.560331 197,50316,DES MOINES,Polk,IA,77,9.302481 198,50317,DES MOINES,Polk,IA,77,60.041842 199,50319,DES MOINES,Polk,IA,77,0.213707 200,50320,DES MOINES,Polk,IA,77,49.547031 201,50321,DES MOINES,Polk,IA,77,30.969186 202,50322,URBANDALE,Polk,IA,77,27.938267 203,50323,URBANDALE,Dallas,IA,25,19.984131 204,50324,WINDSOR HEIGHTS,Polk,IA,77,3.74028 205,50325,CLIVE,Polk,IA,77,20.224117 206,50327,PLEASANT HILL,Polk,IA,77,49.702622 207,50401,MASON CITY,Cerro Gordo,IA,17,387.509792 208,50420,ALEXANDER,Franklin,IA,35,117.256906 209,50421,BELMOND,Wright,IA,99,232.911303 210,50423,BRITT,Hancock,IA,41,376.364842 211,50424,BUFFALO CENTER,Winnebago,IA,95,315.854649 212,50426,CARPENTER,Mitchell,IA,66,0.060113 213,50428,CLEAR LAKE,Cerro Gordo,IA,17,316.380154 214,50430,CORWITH,Hancock,IA,41,160.984015 215,50431,COULTER,Franklin,IA,35,1.936776 216,50432,CRYSTAL LAKE,Hancock,IA,41,1.127714 217,50433,DOUGHERTY,Cerro Gordo,IA,17,125.889253 218,50434,FERTILE,Worth,IA,98,23.487419 219,50435,FLOYD,Floyd,IA,34,105.304339 220,50436,FOREST CITY,Winnebago,IA,95,354.034151 221,50438,GARNER,Hancock,IA,41,342.63578 222,50439,GOODELL,Hancock,IA,41,84.706319 223,50440,GRAFTON,Worth,IA,98,80.665798 224,50441,HAMPTON,Franklin,IA,35,395.662654 225,50444,HANLONTOWN,Worth,IA,98,55.966912 226,50446,JOICE,Worth,IA,98,108.86978 227,50447,KANAWHA,Hancock,IA,41,261.433926 228,50448,KENSETT,Worth,IA,98,160.165797 229,50449,KLEMME,Hancock,IA,41,95.013046 230,50450,LAKE MILLS,Winnebago,IA,95,214.081289 231,50451,LAKOTA,Kossuth,IA,55,164.530815 232,50452,LATIMER,Franklin,IA,35,127.424687 233,50453,LELAND,Winnebago,IA,95,114.347611 234,50454,LITTLE CEDAR,Mitchell,IA,66,44.76636 235,50455,MC INTIRE,Mitchell,IA,66,84.207513 236,50456,MANLY,Worth,IA,98,120.430473 237,50457,MESERVEY,Cerro Gordo,IA,17,88.503707 238,50458,NORA SPRINGS,Floyd,IA,34,190.798489 239,50459,NORTHWOOD,Worth,IA,98,375.483499 240,50460,ORCHARD,Mitchell,IA,66,93.166151 241,50461,OSAGE,Mitchell,IA,66,446.474922 242,50464,PLYMOUTH,Cerro Gordo,IA,17,61.009266 243,50465,RAKE,Winnebago,IA,95,9.497218 244,50466,RICEVILLE,Howard,IA,45,354.341172 245,50467,ROCK FALLS,Cerro Gordo,IA,17,0.709984 246,50468,ROCKFORD,Floyd,IA,34,256.209489 247,50469,ROCKWELL,Cerro Gordo,IA,17,227.879604 248,50470,ROWAN,Wright,IA,99,62.885229 249,50471,RUDD,Floyd,IA,34,110.089036 250,50472,SAINT ANSGAR,Mitchell,IA,66,316.41701 251,50473,SCARVILLE,Winnebago,IA,95,96.000513 252,50475,SHEFFIELD,Franklin,IA,35,231.102519 253,50476,STACYVILLE,Mitchell,IA,66,101.702581 254,50477,SWALEDALE,Cerro Gordo,IA,17,58.455862 255,50478,THOMPSON,Winnebago,IA,95,191.575269 256,50479,THORNTON,Cerro Gordo,IA,17,150.266394 257,50480,TITONKA,Kossuth,IA,55,163.023328 258,50482,VENTURA,Cerro Gordo,IA,17,81.166123 259,50483,WESLEY,Kossuth,IA,55,198.210428 260,50484,WODEN,Hancock,IA,41,113.720772 261,50501,FORT DODGE,Webster,IA,94,407.21578 262,50510,ALBERT CITY,Buena Vista,IA,11,219.73212 263,50511,ALGONA,Kossuth,IA,55,321.723723 264,50514,ARMSTRONG,Emmet,IA,32,287.766282 265,50515,AYRSHIRE,Palo Alto,IA,74,96.076793 266,50516,BADGER,Webster,IA,94,63.984856 267,50517,BANCROFT,Kossuth,IA,55,198.110964 268,50518,BARNUM,Webster,IA,94,80.803741 269,50519,BODE,Humboldt,IA,46,142.395653 270,50520,BRADGATE,Humboldt,IA,46,62.072927 271,50521,BURNSIDE,Webster,IA,94,3.160848 272,50522,BURT,Kossuth,IA,55,176.225395 273,50523,CALLENDER,Webster,IA,94,109.190936 274,50524,CLARE,Webster,IA,94,148.535431 275,50525,CLARION,Wright,IA,99,363.773781 276,50527,CURLEW,Palo Alto,IA,74,141.597255 277,50528,CYLINDER,Palo Alto,IA,74,191.696767 278,50529,DAKOTA CITY,Humboldt,IA,46,1.445412 279,50530,DAYTON,Webster,IA,94,168.385319 280,50531,DOLLIVER,Emmet,IA,32,100.449294 281,50532,DUNCOMBE,Webster,IA,94,181.71793 282,50533,EAGLE GROVE,Wright,IA,99,258.50312 283,50535,EARLY,Sac,IA,81,158.492375 284,50536,EMMETSBURG,Palo Alto,IA,74,370.232494 285,50538,FARNHAMVILLE,Calhoun,IA,13,80.88901 286,50539,FENTON,Kossuth,IA,55,139.538906 287,50540,FONDA,Pocahontas,IA,76,275.625728 288,50541,GILMORE CITY,Humboldt,IA,46,237.749663 289,50542,GOLDFIELD,Wright,IA,99,172.884137 290,50543,GOWRIE,Webster,IA,94,212.824794 291,50544,HARCOURT,Webster,IA,94,75.465666 292,50545,HARDY,Humboldt,IA,46,97.252233 293,50546,HAVELOCK,Pocahontas,IA,76,137.019674 294,50548,HUMBOLDT,Humboldt,IA,46,323.465219 295,50551,JOLLEY,Calhoun,IA,13,69.704315 296,50554,LAURENS,Pocahontas,IA,76,232.762069 297,50556,LEDYARD,Kossuth,IA,55,101.116 298,50557,LEHIGH,Webster,IA,94,130.151481 299,50558,LIVERMORE,Humboldt,IA,46,114.721586 300,50559,LONE ROCK,Kossuth,IA,55,102.790935 301,50560,LU VERNE,Kossuth,IA,55,225.654488 302,50561,LYTTON,Calhoun,IA,13,119.358374 303,50562,MALLARD,Palo Alto,IA,74,165.098566 304,50563,MANSON,Calhoun,IA,13,253.318021 305,50565,MARATHON,Buena Vista,IA,11,114.004136 306,50566,MOORLAND,Webster,IA,94,89.865522 307,50567,NEMAHA,Sac,IA,81,67.326397 308,50568,NEWELL,Buena Vista,IA,11,220.071293 309,50569,OTHO,Webster,IA,94,54.889709 310,50570,OTTOSEN,Humboldt,IA,46,112.220748 311,50571,PALMER,Pocahontas,IA,76,115.641227 312,50573,PLOVER,Pocahontas,IA,76,1.045738 313,50574,POCAHONTAS,Pocahontas,IA,76,288.134473 314,50575,POMEROY,Calhoun,IA,13,163.4466 315,50576,REMBRANDT,Buena Vista,IA,11,93.051632 316,50577,RENWICK,Humboldt,IA,46,106.941323 317,50578,RINGSTED,Emmet,IA,32,192.331445 318,50579,ROCKWELL CITY,Calhoun,IA,13,359.119951 319,50581,ROLFE,Pocahontas,IA,76,246.722485 320,50582,RUTLAND,Humboldt,IA,46,53.572451 321,50583,SAC CITY,Sac,IA,81,306.359541 322,50585,SIOUX RAPIDS,Buena Vista,IA,11,165.291906 323,50586,SOMERS,Calhoun,IA,13,91.12132 324,50588,STORM LAKE,Buena Vista,IA,11,368.993698 325,50590,SWEA CITY,Kossuth,IA,55,203.980739 326,50591,THOR,Humboldt,IA,46,73.985552 327,50593,VARINA,Pocahontas,IA,76,0.480019 328,50594,VINCENT,Webster,IA,94,67.103128 329,50595,WEBSTER CITY,Hamilton,IA,40,399.609138 330,50597,WEST BEND,Palo Alto,IA,74,214.240511 331,50598,WHITTEMORE,Kossuth,IA,55,176.474137 332,50599,WOOLSTOCK,Wright,IA,99,133.057067 333,50601,ACKLEY,Franklin,IA,35,368.01212 334,50602,ALLISON,Butler,IA,12,207.455662 335,50603,ALTA VISTA,Chickasaw,IA,19,122.972014 336,50604,APLINGTON,Butler,IA,12,184.521061 337,50605,AREDALE,Butler,IA,12,38.865937 338,50606,ARLINGTON,Fayette,IA,33,184.315162 339,50607,AURORA,Buchanan,IA,10,123.088687 340,50609,BEAMAN,Grundy,IA,38,89.218598 341,50611,BRISTOW,Butler,IA,12,78.763743 342,50612,BUCKINGHAM,Tama,IA,86,57.581068 343,50613,CEDAR FALLS,Black Hawk,IA,7,329.972902 344,50616,CHARLES CITY,Floyd,IA,34,448.105088 345,50619,CLARKSVILLE,Butler,IA,12,230.86623 346,50620,COLWELL,Floyd,IA,34,0.324589 347,50621,CONRAD,Grundy,IA,38,165.399151 348,50622,DENVER,Bremer,IA,9,64.857976 349,50624,DIKE,Grundy,IA,38,135.187133 350,50625,DUMONT,Butler,IA,12,158.053593 351,50626,DUNKERTON,Black Hawk,IA,7,130.892804 352,50627,ELDORA,Hardin,IA,42,277.223505 353,50628,ELMA,Howard,IA,45,289.356789 354,50629,FAIRBANK,Buchanan,IA,10,205.328666 355,50630,FREDERICKSBURG,Chickasaw,IA,19,214.715992 356,50632,GARWIN,Tama,IA,86,110.25125 357,50632,GARWIN,Tama,IA,86,110.25125 358,50633,GENEVA,Franklin,IA,35,103.078532 359,50634,GILBERTVILLE,Black Hawk,IA,7,1.018252 360,50635,GLADBROOK,Tama,IA,86,217.592812 361,50636,GREENE,Butler,IA,12,313.519052 362,50638,GRUNDY CENTER,Grundy,IA,38,245.826479 363,50641,HAZLETON,Buchanan,IA,10,123.019787 364,50642,HOLLAND,Grundy,IA,38,83.404671 365,50643,HUDSON,Black Hawk,IA,7,163.108458 366,50644,INDEPENDENCE,Buchanan,IA,10,372.595201 367,50645,IONIA,Chickasaw,IA,19,214.246026 368,50647,JANESVILLE,Bremer,IA,9,79.06534 369,50648,JESUP,Black Hawk,IA,7,223.212718 370,50650,LAMONT,Buchanan,IA,10,106.756415 371,50651,LA PORTE CITY,Black Hawk,IA,7,294.627077 372,50652,LINCOLN,Tama,IA,86,0.605668 373,50653,MARBLE ROCK,Floyd,IA,34,134.748744 374,50654,MASONVILLE,Delaware,IA,28,136.18815 375,50655,MAYNARD,Fayette,IA,33,91.05168 376,50658,NASHUA,Chickasaw,IA,19,187.097603 377,50659,NEW HAMPTON,Chickasaw,IA,19,403.802766 378,50660,NEW HARTFORD,Butler,IA,12,100.165022 379,50662,OELWEIN,Fayette,IA,33,176.049517 380,50664,ORAN,Fayette,IA,33,0.09535 381,50665,PARKERSBURG,Butler,IA,12,253.290828 382,50666,PLAINFIELD,Bremer,IA,9,140.256035 383,50667,RAYMOND,Black Hawk,IA,7,5.217793 384,50668,READLYN,Bremer,IA,9,87.689572 385,50669,REINBECK,Grundy,IA,38,239.750053 386,50670,SHELL ROCK,Butler,IA,12,148.931701 387,50671,STANLEY,Buchanan,IA,10,57.533274 388,50672,STEAMBOAT ROCK,Hardin,IA,42,94.993283 389,50673,STOUT,Grundy,IA,38,0.444167 390,50674,SUMNER,Bremer,IA,9,408.690075 391,50675,TRAER,Tama,IA,86,287.237436 392,50676,TRIPOLI,Bremer,IA,9,148.867149 393,50677,WAVERLY,Bremer,IA,9,325.186841 394,50680,WELLSBURG,Grundy,IA,38,138.682394 395,50681,WESTGATE,Fayette,IA,33,63.049395 396,50682,WINTHROP,Buchanan,IA,10,220.98261 397,50701,WATERLOO,Black Hawk,IA,7,214.718743 398,50702,WATERLOO,Black Hawk,IA,7,25.60849 399,50703,WATERLOO,Black Hawk,IA,7,244.724015 400,50707,EVANSDALE,Black Hawk,IA,7,25.361881 401,50801,CRESTON,Union,IA,88,545.028688 402,50830,AFTON,Union,IA,88,306.617835 403,50833,BEDFORD,Taylor,IA,87,536.325319 404,50835,BENTON,Ringgold,IA,80,43.994784 405,50836,BLOCKTON,Taylor,IA,87,232.828727 406,50837,BRIDGEWATER,Adair,IA,1,130.795854 407,50839,CARBON,Adams,IA,2,1.828417 408,50840,CLEARFIELD,Taylor,IA,87,129.94877 409,50841,CORNING,Adams,IA,2,610.836196 410,50842,CROMWELL,Union,IA,88,0.674912 411,50843,CUMBERLAND,Cass,IA,15,195.866561 412,50845,DIAGONAL,Ringgold,IA,80,285.590236 413,50846,FONTANELLE,Adair,IA,1,238.867152 414,50847,GRANT,Montgomery,IA,69,0.863108 415,50848,GRAVITY,Taylor,IA,87,117.720194 416,50849,GREENFIELD,Adair,IA,1,304.431532 417,50851,LENOX,Taylor,IA,87,335.214398 418,50853,MASSENA,Cass,IA,15,195.987739 419,50854,MOUNT AYR,Ringgold,IA,80,350.941226 420,50857,NODAWAY,Adams,IA,2,131.265143 421,50858,ORIENT,Adair,IA,1,206.99525 422,50859,PRESCOTT,Adams,IA,2,206.490726 423,50860,REDDING,Ringgold,IA,80,115.136578 424,50861,SHANNON CITY,Union,IA,88,113.520838 425,50862,SHARPSBURG,Taylor,IA,87,56.218206 426,50863,TINGLEY,Ringgold,IA,80,78.178667 427,50864,VILLISCA,Montgomery,IA,69,377.642994 428,51001,AKRON,Plymouth,IA,75,360.862327 429,51002,ALTA,Buena Vista,IA,11,297.148464 430,51003,ALTON,Sioux,IA,84,144.371109 431,51004,ANTHON,Woodbury,IA,97,212.848541 432,51005,AURELIA,Cherokee,IA,18,244.17026 433,51006,BATTLE CREEK,Ida,IA,47,213.095547 434,51007,BRONSON,Woodbury,IA,97,87.318034 435,51008,BRUNSVILLE,Plymouth,IA,75,0.630646 436,51009,CALUMET,O'Brien,IA,71,0.611821 437,51010,CASTANA,Monona,IA,67,176.313125 438,51011,CHATSWORTH,Sioux,IA,84,1.275286 439,51012,CHEROKEE,Cherokee,IA,18,386.838104 440,51014,CLEGHORN,Cherokee,IA,18,139.646952 441,51016,CORRECTIONVILLE,Woodbury,IA,97,262.895432 442,51018,CUSHING,Woodbury,IA,97,95.118381 443,51019,DANBURY,Woodbury,IA,97,236.787928 444,51020,GALVA,Ida,IA,47,140.577018 445,51022,GRANVILLE,Sioux,IA,84,204.082524 446,51023,HAWARDEN,Sioux,IA,84,271.44549 447,51024,HINTON,Plymouth,IA,75,232.728447 448,51025,HOLSTEIN,Ida,IA,47,278.402985 449,51026,HORNICK,Woodbury,IA,97,281.235872 450,51027,IRETON,Sioux,IA,84,239.835719 451,51028,KINGSLEY,Plymouth,IA,75,328.930974 452,51029,LARRABEE,Cherokee,IA,18,58.631884 453,51030,LAWTON,Woodbury,IA,97,153.939315 454,51031,LE MARS,Plymouth,IA,75,605.111843 455,51033,LINN GROVE,Buena Vista,IA,11,163.475339 456,51034,MAPLETON,Monona,IA,67,290.876284 457,51035,MARCUS,Cherokee,IA,18,278.97681 458,51036,MAURICE,Sioux,IA,84,114.229271 459,51037,MERIDEN,Cherokee,IA,18,61.66748 460,51038,MERRILL,Plymouth,IA,75,233.828323 461,51039,MOVILLE,Woodbury,IA,97,223.159311 462,51040,ONAWA,Monona,IA,67,399.810225 463,51041,ORANGE CITY,Sioux,IA,84,184.491043 464,51044,OTO,Woodbury,IA,97,89.839583 465,51046,PAULLINA,O'Brien,IA,71,240.918039 466,51047,PETERSON,Clay,IA,21,200.078951 467,51048,PIERSON,Woodbury,IA,97,86.240685 468,51049,QUIMBY,Cherokee,IA,18,113.097948 469,51050,REMSEN,Plymouth,IA,75,353.266814 470,51051,RODNEY,Monona,IA,67,8.740209 471,51052,SALIX,Woodbury,IA,97,159.659315 472,51053,SCHALLER,Sac,IA,81,195.513267 473,51054,SERGEANT BLUFF,Woodbury,IA,97,106.329025 474,51055,SLOAN,Woodbury,IA,97,174.692448 475,51056,SMITHLAND,Woodbury,IA,97,88.932213 476,51058,SUTHERLAND,O'Brien,IA,71,214.786027 477,51060,UTE,Monona,IA,67,156.247108 478,51061,WASHTA,Cherokee,IA,18,121.628171 479,51062,WESTFIELD,Plymouth,IA,75,144.594267 480,51063,WHITING,Monona,IA,67,162.058438 481,51101,SIOUX CITY,Woodbury,IA,97,3.138764 482,51103,SIOUX CITY,Woodbury,IA,97,27.86321 483,51104,SIOUX CITY,Woodbury,IA,97,20.00953 484,51105,SIOUX CITY,Woodbury,IA,97,15.825592 485,51106,SIOUX CITY,Woodbury,IA,97,81.702782 486,51108,SIOUX CITY,Woodbury,IA,97,116.455967 487,51109,SIOUX CITY,Woodbury,IA,97,49.159557 488,51111,SIOUX CITY,Woodbury,IA,97,17.993387 489,51201,SHELDON,O'Brien,IA,71,295.817592 490,51230,ALVORD,Lyon,IA,60,64.875507 491,51231,ARCHER,O'Brien,IA,71,73.029493 492,51232,ASHTON,Osceola,IA,72,156.638511 493,51234,BOYDEN,Sioux,IA,84,129.274507 494,51235,DOON,Lyon,IA,60,144.971909 495,51237,GEORGE,Lyon,IA,60,249.759921 496,51238,HOSPERS,Sioux,IA,84,118.831336 497,51239,HULL,Sioux,IA,84,171.513941 498,51240,INWOOD,Lyon,IA,60,263.161609 499,51241,LARCHWOOD,Lyon,IA,60,232.767886 500,51242,LESTER,Lyon,IA,60,1.181066 501,51243,LITTLE ROCK,Lyon,IA,60,139.238401 502,51244,MATLOCK,Sioux,IA,84,0.778428 503,51245,PRIMGHAR,O'Brien,IA,71,203.413241 504,51246,ROCK RAPIDS,Lyon,IA,60,424.515402 505,51247,ROCK VALLEY,Sioux,IA,84,289.650577 506,51248,SANBORN,O'Brien,IA,71,214.49695 507,51249,SIBLEY,Osceola,IA,72,324.74321 508,51250,SIOUX CENTER,Sioux,IA,84,186.398963 509,51301,SPENCER,Clay,IA,21,406.942409 510,51331,ARNOLDS PARK,Dickinson,IA,30,7.12251 511,51333,DICKENS,Clay,IA,21,170.365473 512,51334,ESTHERVILLE,Emmet,IA,32,491.781068 513,51338,EVERLY,Clay,IA,21,189.29993 514,51341,GILLETT GROVE,Clay,IA,21,0.89395 515,51342,GRAETTINGER,Palo Alto,IA,74,214.231634 516,51343,GREENVILLE,Clay,IA,21,65.322005 517,51345,HARRIS,Osceola,IA,72,146.304432 518,51346,HARTLEY,O'Brien,IA,71,377.358581 519,51347,LAKE PARK,Dickinson,IA,30,223.92845 520,51350,MELVIN,Osceola,IA,72,112.104258 521,51351,MILFORD,Dickinson,IA,30,278.931975 522,51354,OCHEYEDAN,Osceola,IA,72,227.662269 523,51355,OKOBOJI,Dickinson,IA,30,11.052693 524,51357,ROYAL,Clay,IA,21,108.592195 525,51358,RUTHVEN,Palo Alto,IA,74,202.954357 526,51360,SPIRIT LAKE,Dickinson,IA,30,334.993966 527,51363,SUPERIOR,Dickinson,IA,30,1.050241 528,51364,TERRIL,Dickinson,IA,30,167.034887 529,51365,WALLINGFORD,Emmet,IA,32,58.440957 530,51366,WEBB,Clay,IA,21,163.894539 531,51401,CARROLL,Carroll,IA,14,454.121 532,51430,ARCADIA,Carroll,IA,14,105.531033 533,51431,ARTHUR,Ida,IA,47,101.935208 534,51433,AUBURN,Sac,IA,81,147.355553 535,51436,BREDA,Carroll,IA,14,156.790205 536,51439,CHARTER OAK,Crawford,IA,24,223.476837 537,51440,DEDHAM,Carroll,IA,14,66.015871 538,51441,DELOIT,Crawford,IA,24,40.266915 539,51442,DENISON,Crawford,IA,24,448.677639 540,51443,GLIDDEN,Carroll,IA,14,276.442701 541,51444,HALBUR,Carroll,IA,14,0.42223 542,51445,IDA GROVE,Ida,IA,47,327.873042 543,51446,IRWIN,Shelby,IA,83,90.274738 544,51447,KIRKMAN,Shelby,IA,83,90.268943 545,51448,KIRON,Crawford,IA,24,144.043445 546,51449,LAKE CITY,Calhoun,IA,13,232.944822 547,51450,LAKE VIEW,Sac,IA,81,157.595822 548,51451,LANESBORO,Carroll,IA,14,0.955449 549,51453,LOHRVILLE,Calhoun,IA,13,223.339647 550,51454,MANILLA,Crawford,IA,24,259.313231 551,51455,MANNING,Carroll,IA,14,287.004925 552,51458,ODEBOLT,Sac,IA,81,245.407583 553,51459,RALSTON,Carroll,IA,14,1.355933 554,51461,SCHLESWIG,Crawford,IA,24,130.196128 555,51462,SCRANTON,Greene,IA,37,281.553969 556,51463,TEMPLETON,Carroll,IA,14,77.535941 557,51465,VAIL,Crawford,IA,24,141.104728 558,51466,WALL LAKE,Sac,IA,81,134.548626 559,51467,WESTSIDE,Crawford,IA,24,168.975493 560,51501,COUNCIL BLUFFS,Pottawattamie,IA,78,68.663347 561,51503,COUNCIL BLUFFS,Pottawattamie,IA,78,311.311378 562,51510,CARTER LAKE,Pottawattamie,IA,78,5.228569 563,51520,ARION,Crawford,IA,24,48.585455 564,51521,AVOCA,Pottawattamie,IA,78,223.26416 565,51523,BLENCOE,Monona,IA,67,134.969227 566,51525,CARSON,Pottawattamie,IA,78,158.290149 567,51526,CRESCENT,Pottawattamie,IA,78,111.333068 568,51527,DEFIANCE,Shelby,IA,83,101.109924 569,51528,DOW CITY,Crawford,IA,24,189.885849 570,51529,DUNLAP,Harrison,IA,43,333.953737 571,51530,EARLING,Shelby,IA,83,140.370471 572,51531,ELK HORN,Shelby,IA,83,73.952439 573,51532,ELLIOTT,Montgomery,IA,69,147.80971 574,51533,EMERSON,Mills,IA,65,213.641636 575,51534,GLENWOOD,Mills,IA,65,260.80305 576,51535,GRISWOLD,Cass,IA,15,337.126792 577,51536,HANCOCK,Pottawattamie,IA,78,124.955843 578,51537,HARLAN,Shelby,IA,83,422.332929 579,51540,HASTINGS,Mills,IA,65,141.382464 580,51541,HENDERSON,Mills,IA,65,106.735256 581,51542,HONEY CREEK,Pottawattamie,IA,78,92.88071 582,51543,KIMBALLTON,Audubon,IA,5,57.402651 583,51544,LEWIS,Cass,IA,15,145.135032 584,51545,LITTLE SIOUX,Harrison,IA,43,116.122152 585,51546,LOGAN,Harrison,IA,43,293.310022 586,51548,MC CLELLAND,Pottawattamie,IA,78,64.393248 587,51549,MACEDONIA,Pottawattamie,IA,78,84.699901 588,51550,MAGNOLIA,Harrison,IA,43,1.456103 589,51551,MALVERN,Mills,IA,65,210.796119 590,51552,MARNE,Cass,IA,15,91.256513 591,51553,MINDEN,Pottawattamie,IA,78,118.587348 592,51554,MINEOLA,Mills,IA,65,6.840874 593,51555,MISSOURI VALLEY,Harrison,IA,43,410.136654 594,51556,MODALE,Harrison,IA,43,117.494595 595,51557,MONDAMIN,Harrison,IA,43,181.091843 596,51558,MOORHEAD,Monona,IA,67,213.388981 597,51559,NEOLA,Pottawattamie,IA,78,220.723262 598,51560,OAKLAND,Pottawattamie,IA,78,254.891645 599,51561,PACIFIC JUNCTION,Mills,IA,65,159.38428 600,51562,PANAMA,Shelby,IA,83,87.966803 601,51563,PERSIA,Harrison,IA,43,142.269566 602,51564,PISGAH,Harrison,IA,43,103.461051 603,51565,PORTSMOUTH,Shelby,IA,83,126.12811 604,51566,RED OAK,Montgomery,IA,69,457.891547 605,51570,SHELBY,Shelby,IA,83,166.695973 606,51571,SILVER CITY,Mills,IA,65,121.369661 607,51572,SOLDIER,Monona,IA,67,114.390965 608,51573,STANTON,Montgomery,IA,69,156.152123 609,51575,TREYNOR,Pottawattamie,IA,78,126.569658 610,51576,UNDERWOOD,Pottawattamie,IA,78,130.453907 611,51577,WALNUT,Pottawattamie,IA,78,206.979038 612,51578,WESTPHALIA,Shelby,IA,83,0.096684 613,51579,WOODBINE,Harrison,IA,43,301.420886 614,51601,SHENANDOAH,Page,IA,73,276.34259 615,51630,BLANCHARD,Page,IA,73,65.739953 616,51631,BRADDYVILLE,Page,IA,73,95.851302 617,51632,CLARINDA,Page,IA,73,540.669979 618,51636,COIN,Page,IA,73,144.434643 619,51637,COLLEGE SPRINGS,Page,IA,73,4.185566 620,51638,ESSEX,Page,IA,73,220.766642 621,51639,FARRAGUT,Fremont,IA,36,186.79102 622,51640,HAMBURG,Fremont,IA,36,313.989946 623,51645,IMOGENE,Fremont,IA,36,107.043333 624,51646,NEW MARKET,Taylor,IA,87,162.903928 625,51647,NORTHBORO,Page,IA,73,49.689882 626,51648,PERCIVAL,Fremont,IA,36,130.838045 627,51649,RANDOLPH,Fremont,IA,36,106.334486 628,51650,RIVERTON,Fremont,IA,36,78.508646 629,51652,SIDNEY,Fremont,IA,36,200.054769 630,51653,TABOR,Fremont,IA,36,89.512532 631,51654,THURMAN,Fremont,IA,36,137.790111 632,51656,YORKTOWN,Page,IA,73,0.438077 633,52001,DUBUQUE,Dubuque,IA,31,75.057763 634,52002,DUBUQUE,Dubuque,IA,31,74.76947 635,52003,DUBUQUE,Dubuque,IA,31,151.954104 636,52030,ANDREW,Jackson,IA,49,0.690483 637,52031,BELLEVUE,Jackson,IA,49,448.070077 638,52032,BERNARD,Jackson,IA,49,272.572418 639,52033,CASCADE,Jones,IA,53,252.635822 640,52035,COLESBURG,Clayton,IA,22,139.028244 641,52037,DELMAR,Clinton,IA,23,176.775293 642,52038,DUNDEE,Delaware,IA,28,76.611993 643,52039,DURANGO,Dubuque,IA,31,91.068312 644,52040,DYERSVILLE,Dubuque,IA,31,157.748966 645,52041,EARLVILLE,Delaware,IA,28,155.040397 646,52042,EDGEWOOD,Clayton,IA,22,161.463192 647,52043,ELKADER,Clayton,IA,22,258.435335 648,52044,ELKPORT,Clayton,IA,22,31.383574 649,52045,EPWORTH,Dubuque,IA,31,123.92508 650,52046,FARLEY,Dubuque,IA,31,119.196635 651,52047,FARMERSBURG,Clayton,IA,22,88.328123 652,52048,GARBER,Clayton,IA,22,85.789255 653,52049,GARNAVILLO,Clayton,IA,22,188.393273 654,52050,GREELEY,Delaware,IA,28,84.028046 655,52052,GUTTENBERG,Clayton,IA,22,249.00035 656,52053,HOLY CROSS,Dubuque,IA,31,154.540299 657,52054,LA MOTTE,Jackson,IA,49,138.18618 658,52057,MANCHESTER,Delaware,IA,28,362.350079 659,52060,MAQUOKETA,Jackson,IA,49,409.835195 660,52064,MILES,Jackson,IA,49,107.540582 661,52065,NEW VIENNA,Dubuque,IA,31,131.382162 662,52066,NORTH BUENA VISTA,Clayton,IA,22,0.681956 663,52068,PEOSTA,Dubuque,IA,31,120.418986 664,52069,PRESTON,Jackson,IA,49,133.277674 665,52070,SABULA,Jackson,IA,49,154.415679 666,52072,SAINT OLAF,Clayton,IA,22,87.106056 667,52073,SHERRILL,Dubuque,IA,31,139.623458 668,52074,SPRAGUEVILLE,Jackson,IA,49,80.684903 669,52076,STRAWBERRY POINT,Clayton,IA,22,252.900986 670,52077,VOLGA,Clayton,IA,22,83.72915 671,52078,WORTHINGTON,Dubuque,IA,31,106.290071 672,52079,ZWINGLE,Jackson,IA,49,153.776141 673,52101,DECORAH,Winneshiek,IA,96,804.622162 674,52132,CALMAR,Winneshiek,IA,96,157.379001 675,52133,CASTALIA,Winneshiek,IA,96,118.295514 676,52134,CHESTER,Howard,IA,45,77.559488 677,52135,CLERMONT,Fayette,IA,33,67.561964 678,52136,CRESCO,Howard,IA,45,552.444562 679,52140,DORCHESTER,Allamakee,IA,3,201.58604 680,52141,ELGIN,Fayette,IA,33,216.938945 681,52142,FAYETTE,Fayette,IA,33,189.331878 682,52144,FORT ATKINSON,Winneshiek,IA,96,194.119174 683,52146,HARPERS FERRY,Allamakee,IA,3,221.678469 684,52147,HAWKEYE,Fayette,IA,33,187.912171 685,52151,LANSING,Allamakee,IA,3,324.670969 686,52154,LAWLER,Chickasaw,IA,19,189.966484 687,52155,LIME SPRINGS,Howard,IA,45,274.895155 688,52156,LUANA,Clayton,IA,22,110.19542 689,52157,MC GREGOR,Clayton,IA,22,149.443621 690,52158,MARQUETTE,Clayton,IA,22,3.357585 691,52159,MONONA,Clayton,IA,22,197.141302 692,52160,NEW ALBIN,Allamakee,IA,3,103.702081 693,52161,OSSIAN,Winneshiek,IA,96,140.304657 694,52162,POSTVILLE,Allamakee,IA,3,258.921418 695,52163,PROTIVIN,Howard,IA,45,4.602409 696,52164,RANDALIA,Fayette,IA,33,65.306405 697,52165,RIDGEWAY,Winneshiek,IA,96,172.29035 698,52166,SAINT LUCAS,Fayette,IA,33,0.493739 699,52168,SPILLVILLE,Winneshiek,IA,96,0.417415 700,52169,WADENA,Fayette,IA,33,77.089638 701,52170,WATERVILLE,Allamakee,IA,3,120.174634 702,52171,WAUCOMA,Fayette,IA,33,205.89717 703,52171,WAUCOMA,Fayette,IA,33,205.89717 704,52172,WAUKON,Allamakee,IA,3,409.930314 705,52175,WEST UNION,Fayette,IA,33,224.708583 706,52201,AINSWORTH,Washington,IA,92,171.814169 707,52202,ALBURNETT,Linn,IA,57,65.295077 708,52203,AMANA,Iowa,IA,48,119.753928 709,52205,ANAMOSA,Jones,IA,53,309.946308 710,52206,ATKINS,Benton,IA,6,73.245916 711,52207,BALDWIN,Jackson,IA,49,102.173612 712,52208,BELLE PLAINE,Benton,IA,6,150.72943 713,52209,BLAIRSTOWN,Benton,IA,6,95.402594 714,52210,BRANDON,Buchanan,IA,10,90.462642 715,52211,BROOKLYN,Poweshiek,IA,79,236.939306 716,52212,CENTER JUNCTION,Jones,IA,53,60.790568 717,52213,CENTER POINT,Linn,IA,57,194.486805 718,52214,CENTRAL CITY,Linn,IA,57,247.622502 719,52215,CHELSEA,Tama,IA,86,224.182198 720,52216,CLARENCE,Cedar,IA,16,146.950128 721,52217,CLUTIER,Tama,IA,86,152.367658 722,52218,COGGON,Linn,IA,57,187.432341 723,52219,PRAIRIEBURG,Linn,IA,57,1.194124 724,52220,CONROY,Iowa,IA,48,1.194245 725,52221,GUERNSEY,Poweshiek,IA,79,53.776053 726,52222,DEEP RIVER,Poweshiek,IA,79,209.152349 727,52223,DELHI,Delaware,IA,28,127.619807 728,52224,DYSART,Tama,IA,86,256.593612 729,52225,ELBERON,Tama,IA,86,91.276214 730,52227,ELY,Linn,IA,57,76.013553 731,52228,FAIRFAX,Linn,IA,57,96.928032 732,52229,GARRISON,Benton,IA,6,127.391906 733,52231,HARPER,Keokuk,IA,54,84.735151 734,52232,HARTWICK,Poweshiek,IA,79,47.517446 735,52233,HIAWATHA,Linn,IA,57,9.080084 736,52235,HILLS,Johnson,IA,52,5.257012 737,52236,HOMESTEAD,Iowa,IA,48,74.148178 738,52237,HOPKINTON,Delaware,IA,28,210.461938 739,52240,IOWA CITY,Johnson,IA,52,415.571318 740,52241,CORALVILLE,Johnson,IA,52,30.871305 741,52242,IOWA CITY,Johnson,IA,52,1.995678 742,52245,IOWA CITY,Johnson,IA,52,21.712859 743,52246,IOWA CITY,Johnson,IA,52,23.832009 744,52246,IOWA CITY,Johnson,IA,52,23.832009 745,52247,KALONA,Washington,IA,92,206.350122 746,52248,KEOTA,Washington,IA,92,295.236835 747,52249,KEYSTONE,Benton,IA,6,131.160363 748,52251,LADORA,Iowa,IA,48,106.339173 749,52253,LISBON,Linn,IA,57,121.480006 750,52254,LOST NATION,Clinton,IA,23,146.759383 751,52255,LOWDEN,Cedar,IA,16,112.408307 752,52257,LUZERNE,Benton,IA,6,40.66545 753,52301,MARENGO,Iowa,IA,48,316.484635 754,52302,MARION,Linn,IA,57,192.237746 755,52305,MARTELLE,Jones,IA,53,73.493573 756,52306,MECHANICSVILLE,Cedar,IA,16,194.245557 757,52307,MIDDLE AMANA,Iowa,IA,48,0.488234 758,52308,MILLERSBURG,Iowa,IA,48,0.233003 759,52309,MONMOUTH,Jackson,IA,49,74.291839 760,52310,MONTICELLO,Jones,IA,53,385.183593 761,52312,MORLEY,Jones,IA,53,0.242872 762,52313,MOUNT AUBURN,Benton,IA,6,84.959546 763,52314,MOUNT VERNON,Linn,IA,57,155.658054 764,52315,NEWHALL,Benton,IA,6,75.003645 765,52316,NORTH ENGLISH,Iowa,IA,48,180.293861 766,52317,NORTH LIBERTY,Johnson,IA,52,95.897591 767,52318,NORWAY,Benton,IA,6,95.080335 768,52320,OLIN,Jones,IA,53,153.985129 769,52321,ONSLOW,Jones,IA,53,77.84139 770,52322,OXFORD,Johnson,IA,52,231.220008 771,52323,OXFORD JUNCTION,Jones,IA,53,132.353794 772,52324,PALO,Linn,IA,57,107.056042 773,52325,PARNELL,Iowa,IA,48,109.143193 774,52326,QUASQUETON,Buchanan,IA,10,5.447918 775,52327,RIVERSIDE,Washington,IA,92,217.607612 776,52328,ROBINS,Linn,IA,57,7.989464 777,52329,ROWLEY,Buchanan,IA,10,132.090802 778,52330,RYAN,Delaware,IA,28,123.160697 779,52332,SHELLSBURG,Benton,IA,6,110.692089 780,52333,SOLON,Johnson,IA,52,234.514567 781,52334,SOUTH AMANA,Iowa,IA,48,43.109728 782,52335,SOUTH ENGLISH,Keokuk,IA,54,144.712437 783,52336,SPRINGVILLE,Linn,IA,57,127.253732 784,52337,STANWOOD,Cedar,IA,16,70.714333 785,52338,SWISHER,Johnson,IA,52,88.82337 786,52339,TAMA,Tama,IA,86,265.108854 787,52340,TIFFIN,Johnson,IA,52,42.940483 788,52341,TODDVILLE,Linn,IA,57,34.64544 789,52342,TOLEDO,Tama,IA,86,232.660911 790,52345,URBANA,Benton,IA,6,8.460683 791,52346,VAN HORNE,Benton,IA,6,134.845571 792,52347,VICTOR,Iowa,IA,48,166.647034 793,52348,VINING,Tama,IA,86,2.376982 794,52349,VINTON,Benton,IA,6,382.672144 795,52351,WALFORD,Benton,IA,6,2.13643 796,52352,WALKER,Linn,IA,57,205.022618 797,52353,WASHINGTON,Washington,IA,92,395.338718 798,52354,WATKINS,Benton,IA,6,76.593832 799,52355,WEBSTER,Keokuk,IA,54,95.20977 800,52356,WELLMAN,Washington,IA,92,232.4078 801,52358,WEST BRANCH,Cedar,IA,16,200.939529 802,52359,WEST CHESTER,Washington,IA,92,38.168572 803,52361,WILLIAMSBURG,Iowa,IA,48,330.197198 804,52362,WYOMING,Jones,IA,53,157.429942 805,52401,CEDAR RAPIDS,Linn,IA,57,3.464505 806,52402,CEDAR RAPIDS,Linn,IA,57,36.420817 807,52403,CEDAR RAPIDS,Linn,IA,57,69.523743 808,52404,CEDAR RAPIDS,Linn,IA,57,142.93349 809,52405,CEDAR RAPIDS,Linn,IA,57,38.49318 810,52411,CEDAR RAPIDS,Linn,IA,57,44.635019 811,52501,OTTUMWA,Wapello,IA,90,591.297871 812,52530,AGENCY,Wapello,IA,90,36.986236 813,52531,ALBIA,Monroe,IA,68,563.107904 814,52533,BATAVIA,Jefferson,IA,51,227.778322 815,52534,BEACON,Mahaska,IA,62,1.012395 816,52535,BIRMINGHAM,Van Buren,IA,89,149.696508 817,52536,BLAKESBURG,Wapello,IA,90,160.133711 818,52537,BLOOMFIELD,Davis,IA,26,900.130186 819,52540,BRIGHTON,Jefferson,IA,51,257.681662 820,52542,CANTRIL,Van Buren,IA,89,117.166206 821,52543,CEDAR,Mahaska,IA,62,53.398057 822,52544,CENTERVILLE,Appanoose,IA,4,355.984819 823,52548,CHILLICOTHE,Wapello,IA,90,0.622729 824,52549,CINCINNATI,Appanoose,IA,4,113.382459 825,52550,DELTA,Keokuk,IA,54,100.697091 826,52551,DOUDS,Van Buren,IA,89,151.969068 827,52552,DRAKESVILLE,Davis,IA,26,151.883138 828,52553,EDDYVILLE,Wapello,IA,90,218.424094 829,52554,ELDON,Wapello,IA,90,94.705759 830,52555,EXLINE,Appanoose,IA,4,64.737097 831,52556,FAIRFIELD,Jefferson,IA,51,458.48455 832,52557,FAIRFIELD,Jefferson,IA,51,0.116099 833,52560,FLORIS,Davis,IA,26,91.932554 834,52561,FREMONT,Mahaska,IA,62,90.172828 835,52563,HEDRICK,Keokuk,IA,54,299.15143 836,52565,KEOSAUQUA,Van Buren,IA,89,307.637735 837,52566,KIRKVILLE,Wapello,IA,90,2.69235 838,52567,LIBERTYVILLE,Jefferson,IA,51,73.121919 839,52569,MELROSE,Monroe,IA,68,249.170073 840,52570,MILTON,Van Buren,IA,89,177.637484 841,52571,MORAVIA,Appanoose,IA,4,276.402554 842,52572,MOULTON,Appanoose,IA,4,252.233684 843,52573,MOUNT STERLING,Van Buren,IA,89,74.630386 844,52574,MYSTIC,Appanoose,IA,4,109.572946 845,52576,OLLIE,Keokuk,IA,54,117.180743 846,52577,OSKALOOSA,Mahaska,IA,62,415.772496 847,52580,PACKWOOD,Jefferson,IA,51,98.765465 848,52581,PLANO,Appanoose,IA,4,102.818704 849,52583,PROMISE CITY,Wayne,IA,93,120.716306 850,52584,PULASKI,Davis,IA,26,58.551231 851,52585,RICHLAND,Keokuk,IA,54,145.900379 852,52586,ROSE HILL,Mahaska,IA,62,122.508394 853,52588,SELMA,Van Buren,IA,89,26.887357 854,52590,SEYMOUR,Wayne,IA,93,194.526072 855,52591,SIGOURNEY,Keokuk,IA,54,327.849389 856,52593,UDELL,Appanoose,IA,4,42.505876 857,52594,UNIONVILLE,Appanoose,IA,4,121.987222 858,52595,UNIVERSITY PARK,Mahaska,IA,62,1.193986 859,52601,BURLINGTON,Des Moines,IA,29,312.02532 860,52619,ARGYLE,Lee,IA,56,95.322544 861,52620,BONAPARTE,Van Buren,IA,89,139.397291 862,52621,CRAWFORDSVILLE,Washington,IA,92,109.582868 863,52623,DANVILLE,Des Moines,IA,29,151.439917 864,52624,DENMARK,Lee,IA,56,1.656465 865,52625,DONNELLSON,Lee,IA,56,285.245058 866,52626,FARMINGTON,Van Buren,IA,89,247.437264 867,52627,FORT MADISON,Lee,IA,56,188.414662 868,52630,HILLSBORO,Henry,IA,44,109.946731 869,52632,KEOKUK,Lee,IA,56,140.470022 870,52635,LOCKRIDGE,Jefferson,IA,51,112.512335 871,52637,MEDIAPOLIS,Des Moines,IA,29,172.04191 872,52638,MIDDLETOWN,Des Moines,IA,29,88.521308 873,52639,MONTROSE,Lee,IA,56,117.469873 874,52640,MORNING SUN,Louisa,IA,58,188.643645 875,52641,MOUNT PLEASANT,Henry,IA,44,551.76361 876,52644,MOUNT UNION,Henry,IA,44,111.080445 877,52645,NEW LONDON,Henry,IA,44,185.650036 878,52646,OAKVILLE,Louisa,IA,58,155.005489 879,52647,OLDS,Henry,IA,44,0.911591 880,52649,SALEM,Henry,IA,44,119.399466 881,52650,SPERRY,Des Moines,IA,29,104.495885 882,52651,STOCKPORT,Van Buren,IA,89,156.669871 883,52653,WAPELLO,Louisa,IA,58,312.47521 884,52654,WAYLAND,Henry,IA,44,136.065742 885,52655,WEST BURLINGTON,Des Moines,IA,29,42.56453 886,52656,WEST POINT,Lee,IA,56,247.274743 887,52657,SAINT PAUL,Lee,IA,56,0.494762 888,52658,WEVER,Lee,IA,56,131.24099 889,52659,WINFIELD,Henry,IA,44,159.819459 890,52660,YARMOUTH,Des Moines,IA,29,57.367543 891,52701,ANDOVER,Clinton,IA,23,1.643828 892,52720,ATALISSA,Muscatine,IA,70,110.12746 893,52721,BENNETT,Cedar,IA,16,108.698097 894,52722,BETTENDORF,Scott,IA,82,73.161625 895,52726,BLUE GRASS,Scott,IA,82,94.823134 896,52727,BRYANT,Clinton,IA,23,63.081901 897,52728,BUFFALO,Scott,IA,82,5.764558 898,52729,CALAMUS,Clinton,IA,23,108.881384 899,52730,CAMANCHE,Clinton,IA,23,101.243411 900,52731,CHARLOTTE,Clinton,IA,23,136.590407 901,52732,CLINTON,Clinton,IA,23,310.704633 902,52737,COLUMBUS CITY,Louisa,IA,58,0.607264 903,52738,COLUMBUS JUNCTION,Louisa,IA,58,323.19118 904,52739,CONESVILLE,Muscatine,IA,70,89.211035 905,52742,DE WITT,Clinton,IA,23,304.142776 906,52745,DIXON,Scott,IA,82,74.381611 907,52746,DONAHUE,Scott,IA,82,68.807314 908,52747,DURANT,Cedar,IA,16,56.094151 909,52748,ELDRIDGE,Scott,IA,82,107.004915 910,52749,FRUITLAND,Muscatine,IA,70,5.21822 911,52750,GOOSE LAKE,Clinton,IA,23,78.435746 912,52751,GRAND MOUND,Clinton,IA,23,129.085616 913,52752,GRANDVIEW,Louisa,IA,58,0.984227 914,52753,LE CLAIRE,Scott,IA,82,68.492183 915,52754,LETTS,Muscatine,IA,70,186.005712 916,52755,LONE TREE,Johnson,IA,52,157.930258 917,52756,LONG GROVE,Scott,IA,82,111.107641 918,52757,LOW MOOR,Clinton,IA,23,3.069924 919,52758,MC CAUSLAND,Scott,IA,82,1.745891 920,52760,MOSCOW,Muscatine,IA,70,58.900339 921,52761,MUSCATINE,Muscatine,IA,70,482.973608 922,52765,NEW LIBERTY,Scott,IA,82,78.185471 923,52766,NICHOLS,Muscatine,IA,70,119.401548 924,52767,PLEASANT VALLEY,Scott,IA,82,3.354642 925,52768,PRINCETON,Scott,IA,82,89.481384 926,52769,STOCKTON,Muscatine,IA,70,106.726313 927,52769,STOCKTON,Muscatine,IA,70,106.726313 928,52772,TIPTON,Cedar,IA,16,342.328096 929,52773,WALCOTT,Scott,IA,82,145.837805 930,52774,WELTON,Clinton,IA,23,0.728977 931,52776,WEST LIBERTY,Muscatine,IA,70,219.678908 932,52777,WHEATLAND,Clinton,IA,23,139.946616 933,52778,WILTON,Muscatine,IA,70,215.972375 934,52801,DAVENPORT,Scott,IA,82,1.359908 935,52802,DAVENPORT,Scott,IA,82,29.294412 936,52803,DAVENPORT,Scott,IA,82,14.068035 937,52804,DAVENPORT,Scott,IA,82,88.861422 938,52806,DAVENPORT,Scott,IA,82,79.448284 939,52807,DAVENPORT,Scott,IA,82,76.46944 -------------------------------------------------------------------------------- /retail-strategy/data/iowa_incomes.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/data/iowa_incomes.xls -------------------------------------------------------------------------------- /retail-strategy/data/pop_iowa_per_county.csv: -------------------------------------------------------------------------------- 1 | ,county,population 2 | 0,Adair,7092 3 | 1,Adams,3693 4 | 2,Allamakee,13884 5 | 3,Appanoose,12462 6 | 4,Audubon,5678 7 | 5,Benton,25699 8 | 6,Black Hawk,132904 9 | 7,Boone,26532 10 | 8,Bremer,24798 11 | 9,Buchanan,20992 12 | 10,Buena Vista,20332 13 | 11,Butler,14791 14 | 12,Calhoun,9846 15 | 13,Carroll,20437 16 | 14,Cass,13157 17 | 15,Cedar,18454 18 | 16,Cerro Gordo,43070 19 | 17,Cherokee,11508 20 | 18,Chickasaw,12023 21 | 19,Clarke,9309 22 | 20,Clay,16333 23 | 21,Clayton,17590 24 | 22,Clinton,47309 25 | 23,Crawford,16940 26 | 24,Dallas,84516 27 | 25,Davis,8860 28 | 26,Decatur,8141 29 | 27,Delaware,17327 30 | 28,Des Moines,39739 31 | 29,Dickinson,17243 32 | 30,Dubuque,97003 33 | 31,Emmet,9658 34 | 32,Fayette,20054 35 | 33,Floyd,15873 36 | 34,Franklin,10170 37 | 35,Fremont,6950 38 | 36,Greene,9011 39 | 37,Grundy,12313 40 | 38,Guthrie,10625 41 | 39,Hamilton,15076 42 | 40,Hancock,10835 43 | 41,Hardin,17226 44 | 42,Harrison,14149 45 | 43,Henry,19773 46 | 44,Howard,9332 47 | 45,Humboldt,9487 48 | 46,Ida,6985 49 | 47,Iowa,16311 50 | 48,Jackson,19472 51 | 49,Jasper,36708 52 | 50,Jefferson,18090 53 | 51,Johnson,146547 54 | 52,Jones,20439 55 | 53,Keokuk,10119 56 | 54,Kossuth,15114 57 | 55,Lee,34615 58 | 56,Linn,221661 59 | 57,Louisa,11142 60 | 58,Lucas,8647 61 | 59,Lyon,11754 62 | 60,Madison,15848 63 | 61,Mahaska,22181 64 | 62,Marion,33189 65 | 63,Marshall,40312 66 | 64,Mills,14972 67 | 65,Mitchell,10763 68 | 66,Monona,8898 69 | 67,Monroe,7870 70 | 68,Montgomery,10225 71 | 69,Muscatine,42940 72 | 70,O'Brien,14020 73 | 71,Osceola,6064 74 | 72,Page,15391 75 | 73,Palo Alto,9047 76 | 74,Plymouth,25200 77 | 75,Pocahontas,6886 78 | 76,Polk,474045 79 | 77,Pottawattamie,93582 80 | 78,Poweshiek,18533 81 | 79,Ringgold,5068 82 | 80,Sac,9876 83 | 81,Scott,172474 84 | 82,Shelby,11800 85 | 83,Sioux,34898 86 | 84,Story,97090 87 | 85,Tama,17319 88 | 86,Taylor,6216 89 | 87,Union,12420 90 | 88,Van Buren,7271 91 | 89,Wapello,34982 92 | 90,Warren,49691 93 | 91,Washington,22281 94 | 92,Wayne,6452 95 | 93,Webster,36769 96 | 94,Winnebago,10631 97 | 95,Winneshiek,20561 98 | 96,Woodbury,102779 99 | 97,Worth,7572 100 | 98,Wright,12779 101 | -------------------------------------------------------------------------------- /retail-strategy/images/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /retail-strategy/images/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /retail-strategy/images/hm3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/hm3.png -------------------------------------------------------------------------------- /retail-strategy/images/liquor.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/liquor.jpeg -------------------------------------------------------------------------------- /retail-strategy/images/output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/output.png -------------------------------------------------------------------------------- /retail-strategy/images/test.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/retail-strategy/images/test.jpg -------------------------------------------------------------------------------- /tennis/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /tennis/README.md: -------------------------------------------------------------------------------- 1 | ## Forecasting the winner in the Men's ATP World Tour [[view code]](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/tennis/notebooks/Final_Project_Marco_Tavora-DATNYC41_GA.ipynb) 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) 3 | 4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/machine-learning-classification-projects/blob/master/tennis/notebooks/Final_Project_Marco_Tavora-DATNYC41_GA.ipynb) or by clicking on the [view code] link above.** 5 | 6 |
7 | 8 |

9 | 10 |

11 |

12 | Problem Statement • 13 | Dataset • 14 | Importing basic modules • 15 | Pre-Processing of dataset
16 | `Best_of` = 5 • 17 | Dummy variables • 18 | Exploratory Analysis for Best_of = 5 • 19 | Logistic Regression • 20 | Decision Trees and Random Forests 21 |

22 | 23 | 24 | ## Problem Statement 25 | 26 | The goal of the project is to predict the probability that the higher-ranked player will win a tennis match. I will call that a `win`(as opposed to an upset). 27 | 28 | ## Dataset 29 | 30 | Results for the men's ATP tour date back to January 2000 from the dateset http://www.tennis-data.co.uk/data.php (obtained from Kaggle). The features for each match that were used in the project were: 31 | - `Date`: date of the match 32 | - `Series`: name of ATP tennis series (we kept the four main current categories namely Grand Slams, Masters 1000, ATP250, ATP500) 33 | - `Surface`: type of surface (clay, hard or grass) 34 | - `Round`: round of match (from first round to the final) 35 | - `Best of`: maximum number of sets playable in match (Best of 3 or Best of 5) 36 | - `WRank`: ATP Entry ranking of the match winner as of the start of the tournament 37 | - `LRank`: ATP Entry ranking of the match loser as of the start of the tournament 38 | 39 | The output variable is binary. The better player has higher rank by definition. The `win` variable is 1 if the higher-ranked player wins and 0 otherwise. 40 | 41 | ## Importing basic modules 42 | 43 | ``` 44 | import numpy as np 45 | import statsmodels.api as sm 46 | import matplotlib.pyplot as plt 47 | from sklearn import metrics 48 | import seaborn as sns 49 | sns.set_style("darkgrid") 50 | import pylab as pl 51 | %matplotlib inline 52 | ``` 53 | 54 | ## Pre-Processing of dataset 55 | 56 | After loading the dataset we proceed as following: 57 | - Keep only completed matches i.e. eliminate matches with injury withdrawals and walkovers. 58 | - Choose the features listed above 59 | - Drop `NaN` entries 60 | - Consider the two final years only (to avoid comparing different categories of tournaments which existed in the past). Note that this choice is somewhat arbitrary and can be changed if needed. 61 | - Choose only higher ranked players for better accuracy (as suggested by Corral and Prieto-Rodriguez (2010) and confirmed here) 62 | ``` 63 | # converting to Datetime 64 | df_atp['Date'] = pd.to_datetime(df_atp['Date']) 65 | # Restricing dates 66 | df_atp = df_atp.loc[(df_atp['Date'] > '2014-11-09') & (df_atp['Date'] <= '2016-11-09')] 67 | # Keeping only completed matches 68 | df_atp = df_atp[df_atp['Comment'] == 'Completed'].drop("Comment",axis = 1) 69 | # Renaming Best of to Best_of 70 | df_atp.rename(columns = {'Best of':'Best_of'},inplace=True) 71 | # Choosing features 72 | cols_to_keep = ['Date','Series','Surface', 'Round','Best_of', 'WRank','LRank'] 73 | # Dropping NaNs 74 | df_atp = df_atp[cols_to_keep].dropna() 75 | # Dropping errors in the dataset and unimportant entries (e.g. there are very few entries for Masters Cup) 76 | df_atp = df_atp[(df_atp['LRank'] != 'NR') & (df_atp['WRank'] != 'NR') & (df_atp['Series'] != 'Masters Cup')] 77 | ``` 78 | Another important step for some of the columns is to transform strings into numerical values: 79 | ``` 80 | cols_to_keep = ['Best_of','WRank','LRank'] 81 | df_atp[cols_to_keep] = df_atp[cols_to_keep].astype(int) 82 | ``` 83 | I now create an extra column for the variable `win` (described above) using an auxiliary function `win(x)`: 84 | 85 | ``` 86 | def win(x): 87 | if x > 0: 88 | return 0 89 | elif x <= 0: 90 | return 1 91 | ``` 92 | Using the `apply( )` method which sends a column to a function: 93 | ``` 94 | df_atp['win'] = (df_atp['WRank'] - df_atp['LRank']).apply(win) 95 | ``` 96 | 97 | Following [Corral and Prieto-Rodriguez](https://ideas.repec.org/a/eee/intfor/v26yi3p551-563.html) we restrict the analysis to higher ranked players: 98 | ``` 99 | df_new = df_atp[(df_atp['WRank'] <= 150) & (df_atp['LRank'] <= 150)] 100 | ``` 101 |
102 | 103 |

104 | 105 |

106 | 107 |
108 | 109 | ## `Best_of` = 5 110 | 111 | We now restrict our analysis to matches of `Best_of` = 5. Since only Grand Slams have 5 sets we can drop the `Series` column. The case of `Best_of = 3` will be considered afterwards. 112 | ``` 113 | df3 = df_new.copy() 114 | df3 = df3[df3['Best_of'] == 5] 115 | # Drop Best_of and Series columns 116 | df3.drop(["Series",axis = 1,inplace=True) 117 | df3.drop("Best_of",axis = 1,inplace=True) 118 | ``` 119 | The dataset is uneven in terms of frequency of `wins`(imbalanced classes). Using this quick function to convert `Series` to `DataFrame` (for aesthetic reasons only!) 120 | ``` 121 | def series_to_df(s): 122 | return s.to_frame() 123 | series_to_df(df3['win'].value_counts()) 124 | series_to_df(df3['win'].value_counts()/df3.shape[0]) 125 | ``` 126 |
127 | 128 |

129 | 130 |

131 | 132 | To correct this problem, and create a balanced dataset via simple undersampling, I used a stratified sampling procedure. 133 | 134 | ``` 135 | y_0 = df3[df3.win == 0] 136 | y_1 = df3[df3.win == 1] 137 | n = min([len(y_0), len(y_1)]) 138 | y_0 = y_0.sample(n = n, random_state = 0) 139 | y_1 = y_1.sample(n = n, random_state = 0) 140 | df_strat = pd.concat([y_0, y_1]) 141 | X_strat = df_strat[['Date', 'Surface', 'Round','WRank', 'LRank']] 142 | y_strat = df_strat.win 143 | df = X_strat.copy() 144 | df['win'] = y_strat 145 | ``` 146 | The balanced classes become: 147 | 148 |

149 | 150 |

151 | 152 | We now define the variables `P1` and `P2` where the former has higher ranking: 153 | ``` 154 | ranks = ["WRank", "LRank"] 155 | df["P1"] = df[ranks].max(axis=1) 156 | df["P2"] = df[ranks].min(axis=1) 157 | ``` 158 | 159 | 160 | ## Exploratory Analysis for Best_of = 5 161 | 162 | I first look at percentage of wins for each surface. We find that when the `Surface` is Clay there is a higher likelihood of upsets (opposite of wins) i.e. the percentage of wins is lower. The difference is not too large tough. 163 | ``` 164 | win_by_Surface = pd.crosstab(df.win, df.Surface).apply(lambda x: x/x.sum(), axis = 0) 165 | ``` 166 | 167 |

168 | 169 |

170 | 171 | What about the dependence on rounds? The relation is not very clear but we can clearly see that upsets are unlikely to happen on the semifinals. 172 | 173 | ``` 174 | win_by_round = pd.crosstab(df.win, df.Round).apply(lambda x: x/x.sum(), axis = 0) 175 | ``` 176 |

177 | 178 |

179 | 180 | 181 | 182 | 183 | ## Dummy variables 184 | To keep the dataframe cleaner we transform the `Round` entries into numbers using: 185 | ``` 186 | df1 = df.copy() 187 | def round_number(x): 188 | if x == '1st Round': 189 | return 1 190 | elif x == '2nd Round': 191 | return 2 192 | elif x == '3rd Round': 193 | return 3 194 | elif x == '4th Round': 195 | return 4 196 | elif x == 'Quarterfinals': 197 | return 5 198 | elif x == 'Semifinals': 199 | return 6 200 | elif x == 'The Final': 201 | return 7 202 | df1['Round'] = df1['Round'].apply(round_number) 203 | ``` 204 | We then transform rounds into dummy variables 205 | ``` 206 | dummy_ranks = pd.get_dummies(df1['Round'], prefix='Round') 207 | df1 = df1.join(dummy_ranks.ix[:, 'Round_2':]) 208 | rounds = ['Round_2', 'Round_3', 209 | 'Round_4', 'Round_5', 'Round_6', 'Round_7'] 210 | df1[rounds] = df1[rounds].astype('int_') 211 | ``` 212 | We repeat this for the `Surface` variable. I now take the logarithms of `P1` and `P2`, then create a variable `D` 213 | ``` 214 | df4['P1'] = np.log2(df4['P1'].astype('float64')) 215 | df4['P2'] = np.log2(df4['P2'].astype('float64')) 216 | df4['D'] = df4['P1'] - df4['P2'] 217 | df4['D'] = np.absolute(df4['D']) 218 | ``` 219 | 220 | ## Logistic Regression 221 | 222 | The next step is building the models. I first use a logistic regression. First, the `y` and `X` must be defined: 223 | 224 | ``` 225 | feature_cols = ['Round_2','Round_3','Round_4','Round_5','Round_6','Round_7','Surface_Grass','Surface_Hard','D'] 226 | dfnew = df4.copy() 227 | dfnew[feature_cols].head() 228 | X = dfnew[feature_cols] 229 | y = dfnew.win 230 | ``` 231 | Doing a train-test split: 232 | ``` 233 | from sklearn.cross_validation import train_test_split 234 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 235 | ``` 236 | I then fit the model with the training data, 237 | ``` 238 | from sklearn.linear_model import LogisticRegression 239 | logreg = LogisticRegression() 240 | logreg.fit(X_train, y_train) 241 | ``` 242 | and make predictions using the test set: 243 | ``` 244 | y_pred_class = logreg.predict(X_test) 245 | from sklearn import metrics 246 | print('Accuracy score is:',metrics.accuracy_score(y_test, y_pred_class)) 247 | ``` 248 | and obtain: 249 | ``` 250 | Accuracy score is: 0.7070707070707071 251 | ``` 252 | 253 | The next step is evaluate the appropriate metrics. Using `scikit-learn` for calcule the AUC, 254 | ``` 255 | y_pred_prob = logreg.predict_proba(X_test)[:, 1] 256 | auc_score = metrics.roc_auc_score(y_test, y_pred_prob) 257 | print('AUC is:', auc_score) 258 | ``` 259 | I obtain the following `auc_score`: 260 | ``` 261 | AUC is: 0.7546938775510204 262 | ``` 263 | To plot the ROC curve I use: 264 | ``` 265 | fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob) 266 | fig = plt.plot(fpr, tpr,label='ROC curve (area = %0.2f)' % auc_score ) 267 | plt.plot([0, 1], [0, 1], 'k--') 268 | plt.xlim([0.0, 1.0]) 269 | plt.ylim([0.0, 1.0]) 270 | plt.title('ROC curve for win classifier') 271 | plt.xlabel('False Positive Rate (1 - Specificity)') 272 | plt.ylabel('True Positive Rate (Sensitivity)') 273 | plt.legend(loc="lower right") 274 | plt.grid(True) 275 | ``` 276 |
277 | 278 |

279 | 281 |

282 |
283 | 284 | Now we must perform cross-validation. 285 | ``` 286 | from sklearn.cross_validation import cross_val_score 287 | print('Mean CV score is:',cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()) 288 | ``` 289 | The output is: 290 | ``` 291 | Mean CV score is: 0.7287617728531856 292 | ``` 293 | 294 | 295 | ## Decision Trees and Random Forests 296 | 297 | 298 | I now build a decision tree model to predict the upsets likelihood of a given match: 299 | 300 | ``` 301 | from sklearn.tree import DecisionTreeClassifier 302 | model = DecisionTreeClassifier() 303 | X = dfnew[feature_cols].dropna() 304 | y = dfnew['win'] 305 | model.fit(X, y) 306 | ``` 307 | Again performing cross-validation: 308 | ``` 309 | scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5) 310 | print('AUC {}, Average AUC {}'.format(scores, scores.mean())) 311 | model = DecisionTreeClassifier( 312 | max_depth = 4, 313 | min_samples_leaf = 6) 314 | 315 | model.fit(X, y) 316 | ``` 317 |
318 | 319 |

320 | 321 |

322 | 323 |
324 | 325 | Evaluating the cross-validation score: 326 | 327 | ``` 328 | scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5) 329 | print('CV AUC {}, Average AUC {}'.format(scores, scores.mean())) 330 | ``` 331 | 332 |
333 | 334 |

335 | 336 |

337 | 338 |
339 | 340 | 341 | 342 | 343 | 344 | Now I repeat the lines above using a random forest classifier: 345 | ``` 346 | from sklearn.ensemble import RandomForestClassifier 347 | from sklearn.cross_validation import cross_val_score 348 | X = dfnew[feature_cols].dropna() 349 | y = dfnew['win'] 350 | model = RandomForestClassifier(n_estimators = 200) 351 | model.fit(X, y) 352 | features = X.columns 353 | feature_importances = model.feature_importances_ 354 | features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances}) 355 | features_df.sort_values('Importance Score', inplace=True, ascending=False) 356 | feature_importances = pd.Series(model.feature_importances_, index=X.columns) 357 | feature_importances.sort_values() 358 | feature_importances.plot(kind="barh", figsize=(7,6)) 359 | scores = cross_val_score(model, X, y, scoring='roc_auc') 360 | print('AUC {}, Average AUC {}'.format(scores, scores.mean())) 361 | for n_trees in range(1, 100, 10): 362 | model = RandomForestClassifier(n_estimators = n_trees) 363 | scores = cross_val_score(model, X, y, scoring='roc_auc') 364 | print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean())) 365 | ``` 366 |
367 | 368 |

369 | 370 |

371 | 372 |
373 | 374 | 375 | 376 | 377 | The same identical analysis is done for `Best_of = 3` and therefore it is ommited here in the README. 378 | -------------------------------------------------------------------------------- /tennis/images/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /tennis/images/ATP_World_Tour.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/ATP_World_Tour.png -------------------------------------------------------------------------------- /tennis/images/ROC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/ROC.png -------------------------------------------------------------------------------- /tennis/images/balanced.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/balanced.png -------------------------------------------------------------------------------- /tennis/images/cv_score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/cv_score.png -------------------------------------------------------------------------------- /tennis/images/decisiontree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/decisiontree.png -------------------------------------------------------------------------------- /tennis/images/imbalance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/imbalance.png -------------------------------------------------------------------------------- /tennis/images/rf_features.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/rf_features.png -------------------------------------------------------------------------------- /tennis/images/rounds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/rounds.png -------------------------------------------------------------------------------- /tennis/images/surfaces.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/surfaces.png -------------------------------------------------------------------------------- /tennis/images/tennis_df.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/images/tennis_df.png -------------------------------------------------------------------------------- /tennis/notebooks/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /tennis/slides/123.png: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /tennis/slides/Final_Project_Marco_Tavora_DATNYC41.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/supervised-machine-learning/bf8dbeab4b68e04219aaee9fad0e018f0e6347b1/tennis/slides/Final_Project_Marco_Tavora_DATNYC41.pdf --------------------------------------------------------------------------------