├── .gitignore ├── README.md ├── cleaning-notebooks ├── 01 - Deduplication.ipynb ├── 02 - String Matching.ipynb ├── 03 - Managing Nulls.ipynb ├── 04 - Preprocessing with Scikit-learn.ipynb ├── Case Study - Lobste.rs newest stories.ipynb └── Dask Pipeline.ipynb ├── conda_reqs.txt ├── data ├── HVAC.csv ├── HVAC_with_nulls.csv ├── all_lobsters.json ├── customer_data.csv ├── customer_data_duped.csv ├── customer_database.json ├── customer_database_duped.json ├── iot_example.csv ├── iot_example_with_nulls.csv ├── nhl_scores.csv ├── sales_data.csv ├── sales_data_duped.csv ├── sales_data_duped_with_nulls.csv ├── sales_data_with_nulls.csv └── sales_summary.csv ├── install_reqs.txt ├── solutions ├── dask.py ├── dedupe.py ├── engarde.py ├── hypothesis.py ├── lobsters_dropped.py ├── nulls.py └── preprocessing.py └── validation-notebooks ├── .gitignore ├── 01 - Data Validation with Voluptuous.ipynb ├── 02 - Dataframe Validation with Engarde.ipynb ├── 03 - TDDA.ipynb ├── 04 - Hypothesis.ipynb ├── Case Study - Basic Queued Pipeline with Validation.ipynb ├── __init__.py └── queue_example.py /.gitignore: -------------------------------------------------------------------------------- 1 | .*/* 2 | *.pyc 3 | *~ 4 | .ropeproject* 5 | dask-worker-space* 6 | *.cfg 7 | .ipynb* 8 | *.log 9 | *.*lock 10 | **/ignore-* 11 | generate-* 12 | *.db 13 | **/.hypothesis* 14 | *.png 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Data Cleaning 101 2 | 3 | Welcome to the code repository for Practical Data Cleaning with Python! This is a two-day training offered through Safari with O'Reilly media. You can sign up by searching for the course on Safari. 4 | 5 | This course aims to give you a practical overview of data cleaning and validation libraries and methods in Python. Since we only have 6 hours, it can't go massively in-depth into any one library or tool, but I have tried to include useful tools I have found in my work and incorporate a mixture of the munging and testing I have seen in my own and others workflows. 6 | 7 | If you have a suggestion for another library or additional topic, feel free to drop me a line :) 8 | 9 | ### Installation 10 | 11 | These lessons has been tested for Python 3.4 and Python 3.6 and primarily uses the latest release of each library, except where versions are pinned. You likely can run most of the code with older releases, but if you run into an issue, try upgrading the library in question first. 12 | 13 | ```pip install -r install_reqs.txt``` 14 | 15 | 16 | I believe this will also work with Conda, although I am less familiar with Conda so please report issues! (special thanks to @blue_hacker for this fix!) 17 | 18 | ``` 19 | $ conda create -n dataclean --copy python=3.6 20 | $ source activate dataclean 21 | $ pip install -r install_reqs.txt 22 | ``` 23 | 24 | In addition, you will need to install [sqlite3](https://www.sqlite.org/) or make changes to the second day case study with a connection string to your database of choice. [more info](https://dataset.readthedocs.io/en/latest/quickstart.html#connecting-to-a-database) 25 | 26 | If you want to visualize graphs using Dask, you will need to install [Graphviz](http://www.graphviz.org/), which has special requirements on all platforms. For linux, it is usually available via the system package library (apt, yum). For other platforms, you might need to use a special installer. It is also [available via conda install graphviz](https://anaconda.org/anaconda/graphviz) and [pip install graphviz](https://pypi.python.org/pypi/graphviz), but these might not include all necessary dependencies for your OS. For best results, search for your 27 | OS and "install graphviz and dependencies" and follow a recent article on setup. 28 | 29 | ### Repository structure 30 | 31 | Each day coincides with a particular notebook folder. For day one, we will use `cleaning-notebooks`. Day two will focus on `validation-notebooks`. The `data` folder holds data we will use throughout the course. The `queue_example.py` file is used in the day two case study. 32 | 33 | 34 | ### Python2 v. Python3 35 | 36 | This repository has been built with Python 3. If you are using Python 2 and need help porting some logic or finding alternatives, please let me know and I will try and help. :) 37 | 38 | ### Corrections? 39 | 40 | If you find any issues in these code examples, feel free to submit an Issue or Pull Request. I appreciate your input! 41 | 42 | ### Questions? 43 | 44 | Reach out to @kjam on Twitter or GitHub. @kjam is also often on freenode. :) 45 | -------------------------------------------------------------------------------- /cleaning-notebooks/01 - Deduplication.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Deduplicating data\n", 8 | "\n", 9 | "In this notebook, we deduplicate data using the [Dedupe library](https://dedupe.readthedocs.io/en/latest/), which utilizes a shallow neural network to learn from a small training exercise.\n", 10 | "\n", 11 | "If you are interested in building your own parser, the same folks have created the [Parserator](https://github.com/datamade/parserator) which you can use to extract text features and train your own text extraction (hooray! less brittle than regex!)" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "import dedupe\n", 22 | "import os" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "customers = pd.read_csv('../data/customer_data_duped.csv', \n", 32 | " encoding='utf-8')" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Checking Data Quality" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "scrolled": true 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "customers.head()" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "customers.dtypes" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "for col in customers.columns:\n", 69 | " print(col, customers[col].isnull().sum())" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## Setting up Dedupe" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "variables = [\n", 86 | " {'field': 'name', 'type': 'String'},\n", 87 | " {'field': 'job', 'type': 'String'},\n", 88 | " {'field': 'company', 'type': 'String'}, \n", 89 | " {'field': 'street_address','type': 'String'},\n", 90 | " {'field': 'city','type': 'String'},\n", 91 | " {'field': 'state', 'type': 'String', 'has_missing': True},\n", 92 | " {'field': 'email', 'type': 'String', 'has_missing': True},\n", 93 | " {'field': 'user_name', 'type': 'String'},\n", 94 | "]\n", 95 | "\n", 96 | "deduper = dedupe.Dedupe(variables)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "deduper" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "customers.shape" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "scrolled": true 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "deduper.sample(customers.T.to_dict(), 500)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "Note: If you receive an error like this:\n", 133 | "\n", 134 | "```/usr/local/lib/python2.7/site-packages/dedupe/sampling.py:39: UserWarning: 250 blocked samples were requested, but only able to sample 249\n", 135 | " % (sample_size, len(blocked_sample)))\n", 136 | "```\n", 137 | "\n", 138 | "you can continue (some were selected), or use the suggested number (^ here it would be 249)" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "#### Either use training file (uncomment) or resume active training below" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "training_file = '../data/ignore-dedupe-training.json'\n", 155 | "#if os.path.exists(training_file):\n", 156 | "# with open(training_file, 'rb') as f:\n", 157 | "# deduper.readTraining(f)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "dedupe.consoleLabel(deduper)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "deduper.train()" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "with open(training_file, 'w') as tf:\n", 185 | " deduper.writeTraining(tf)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "dupes = deduper.match(customers.T.to_dict())" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "dupes" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "dupes[2]" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "customers.iloc[[741,1107]]" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": { 227 | "collapsed": true 228 | }, 229 | "source": [ 230 | "### Exercise: Flag duplicates by adding 2 extra columns, one for confidence score and one for duplicate_ids" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "# %load ../solutions/dedupe.py\n" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "customers[customers.confidence.notnull() == True].head()" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [] 264 | } 265 | ], 266 | "metadata": { 267 | "kernelspec": { 268 | "display_name": "Python 3", 269 | "language": "python", 270 | "name": "python3" 271 | }, 272 | "language_info": { 273 | "codemirror_mode": { 274 | "name": "ipython", 275 | "version": 3 276 | }, 277 | "file_extension": ".py", 278 | "mimetype": "text/x-python", 279 | "name": "python", 280 | "nbconvert_exporter": "python", 281 | "pygments_lexer": "ipython3", 282 | "version": "3.6.6" 283 | } 284 | }, 285 | "nbformat": 4, 286 | "nbformat_minor": 1 287 | } 288 | -------------------------------------------------------------------------------- /cleaning-notebooks/02 - String Matching.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## String Matching\n", 8 | "\n", 9 | "In this notebook, we use [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy), a popular string matching library by SeatGeek. \n", 10 | "\n", 11 | "For more information on the different methods available and how they differ, see [their blog post explaining methodologies](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": { 18 | "scrolled": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "from fuzzywuzzy import fuzz, process" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "berlin = ['Berlin, Germany', \n", 32 | " 'Berlin, Deutschland', \n", 33 | " 'Berlin', \n", 34 | " 'Berlin, DE']" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "#### Try matching the first and second strings: 'Berlin, Germany' and 'Berlin, Deutschland'" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "fuzz.partial_ratio(berlin[0], berlin[1])" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "fuzz.ratio?" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "fuzz.ratio(berlin[0], berlin[1])" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "fuzz.token_set_ratio(berlin[0], berlin[1])" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "fuzz.token_sort_ratio(berlin[0], berlin[1])" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "#### Try matching the second and third strings: 'Berlin, Deutschland' and 'Berlin'" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "fuzz.partial_ratio(berlin[1], berlin[2])" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "fuzz.ratio(berlin[1], berlin[2])" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "fuzz.token_sort_ratio(berlin[1], berlin[2])" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### What do you think will score lowest and highest for the final two elements: \n", 128 | "- 'Berlin'\n", 129 | "- 'Berlin, DE'" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "fuzz.token_set_ratio(berlin[2], berlin[3])" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "### Extracting a guess out of a list" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "choices = ['Germany', 'Deutschland', 'France', \n", 155 | " 'United Kingdom', 'Great Britain', \n", 156 | " 'United States']" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "process.extract('DE', choices, limit=2)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "process.extract('UK', choices)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "process.extract('frankreich', choices)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": { 189 | "collapsed": true 190 | }, 191 | "source": [ 192 | "### Will this properly extract?" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "process.extract('U.S.', choices)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [] 210 | } 211 | ], 212 | "metadata": { 213 | "kernelspec": { 214 | "display_name": "Python 3", 215 | "language": "python", 216 | "name": "python3" 217 | }, 218 | "language_info": { 219 | "codemirror_mode": { 220 | "name": "ipython", 221 | "version": 3 222 | }, 223 | "file_extension": ".py", 224 | "mimetype": "text/x-python", 225 | "name": "python", 226 | "nbconvert_exporter": "python", 227 | "pygments_lexer": "ipython3", 228 | "version": "3.6.6" 229 | } 230 | }, 231 | "nbformat": 4, 232 | "nbformat_minor": 2 233 | } 234 | -------------------------------------------------------------------------------- /cleaning-notebooks/03 - Managing Nulls.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Managing Nulls with Pandas\n", 8 | "\n", 9 | "In this notebook, we will take a look at some ways to manage nulls using Pandas DataFrames.\n", 10 | "\n", 11 | "For even more details on how to do this, check out the [Panda's documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html)." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "from numpy import random" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.read_csv('../data/iot_example_with_nulls.csv')" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Data Quality Check" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "df.head()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "df.dtypes" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "df.note.value_counts()" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Let's remove all null values (including the note: n/a)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "df = pd.read_csv('../data/iot_example_with_nulls.csv', \n", 81 | " na_values=['n/a'])" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Test to see if we can use dropna" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "df.shape" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "df.dropna().shape" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "df.dropna(how='all', axis=1).shape" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### Test to see if we can drop columns" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "my_columns = list(df.columns)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "my_columns" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "list(df.dropna(thresh=int(df.shape[0] * .9), axis=1).columns)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "### I want to find all columns that have missing data" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "missing_info = list(df.columns[df.isnull().any()])" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "missing_info" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "for col in missing_info:\n", 184 | " num_missing = df[df[col].isnull() == True].shape[0]\n", 185 | " print('number missing for column {}: {}'.format(col, \n", 186 | " num_missing))" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "for col in missing_info:\n", 196 | " percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]\n", 197 | " print('percent missing for column {}: {}'.format(\n", 198 | " col, percent_missing))" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "### Can I easily substitute majority values in for missing data?" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "df.note.value_counts()" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "df.build.value_counts().head()" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "df.latest.value_counts()" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "df.latest = df.latest.fillna(0)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "### Have not yet addressed temperature missing values... Let's find a way to fill" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "df.username.value_counts().head()" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "df = df.set_index('timestamp')" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "df.head()" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "df.temperature = df.groupby('username').temperature.fillna(\n", 285 | " method='backfill', limit=3)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "### Exercise: How many temperature values did I fill? What percentage of values are still missing (for temperature)?" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "# %load ../solutions/nulls.py\n" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "rows_filled" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "still_missing" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [] 335 | } 336 | ], 337 | "metadata": { 338 | "kernelspec": { 339 | "display_name": "Python 3", 340 | "language": "python", 341 | "name": "python3" 342 | }, 343 | "language_info": { 344 | "codemirror_mode": { 345 | "name": "ipython", 346 | "version": 3 347 | }, 348 | "file_extension": ".py", 349 | "mimetype": "text/x-python", 350 | "name": "python", 351 | "nbconvert_exporter": "python", 352 | "pygments_lexer": "ipython3", 353 | "version": "3.6.6" 354 | } 355 | }, 356 | "nbformat": 4, 357 | "nbformat_minor": 2 358 | } 359 | -------------------------------------------------------------------------------- /cleaning-notebooks/04 - Preprocessing with Scikit-learn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Scikit Learn Preprocessing\n", 8 | "\n", 9 | "In this notebook, we'll use `sklearn.preprocessing` to do some scaling for us. If you need to prepare data for machine learning or feature extraction, the [sklearn.preprocessing documentation](http://scikit-learn.org/stable/modules/preprocessing.html) has great examples." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "from sklearn import preprocessing\n", 19 | "import pandas as pd\n", 20 | "from datetime import datetime" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "hvac = pd.read_csv('../data/HVAC_with_nulls.csv')" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Checking Data Quality" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "hvac.dtypes" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "hvac.shape" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "hvac.head()" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## Impute missing values with mean" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "imp = preprocessing.Imputer(missing_values='NaN', \n", 80 | " strategy='mean')" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "hvac_numeric = hvac[['TargetTemp', 'SystemAge']]" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "imp = imp.fit(hvac_numeric.loc[:10])" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "transformed = imp.fit_transform(hvac_numeric)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "transformed" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "hvac['TargetTemp'], hvac['SystemAge'] = transformed[:,0], transformed[:,1]" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "hvac.head()" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "## Scale temperature values" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "hvac['ScaledTemp'] = preprocessing.scale(hvac['ActualTemp'])" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "hvac['ScaledTemp'].head()" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "## Scale using a min and max scaler" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "min_max_scaler = preprocessing.MinMaxScaler()" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "temp_minmax = min_max_scaler.fit_transform(hvac[['ActualTemp']])" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "temp_minmax" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "### Exercise: add the `temp_minmax` back to the dataframe as a new column" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "# %load ../solutions/preprocessing.py\n", 210 | "\n" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [] 219 | } 220 | ], 221 | "metadata": { 222 | "kernelspec": { 223 | "display_name": "Python 3", 224 | "language": "python", 225 | "name": "python3" 226 | }, 227 | "language_info": { 228 | "codemirror_mode": { 229 | "name": "ipython", 230 | "version": 3 231 | }, 232 | "file_extension": ".py", 233 | "mimetype": "text/x-python", 234 | "name": "python", 235 | "nbconvert_exporter": "python", 236 | "pygments_lexer": "ipython3", 237 | "version": "3.6.1" 238 | } 239 | }, 240 | "nbformat": 4, 241 | "nbformat_minor": 2 242 | } 243 | -------------------------------------------------------------------------------- /cleaning-notebooks/Case Study - Lobste.rs newest stories.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Case Study: Preparing Lobste.rs Stories for Machine Learning\n", 8 | "\n", 9 | "In this case study, we'll be preparing [lobste.rs](http://lobste.rs) stories for machine learning. To do so, we need to extract features and clean up the messy parts of the data. We'll be using Pandas along with `sklearn.preprocessing` and `fuzzywuzzy`. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import requests\n", 20 | "from fuzzywuzzy import fuzz\n", 21 | "from collections import Counter\n", 22 | "from sklearn import preprocessing" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### If you'd rather read from the API to get the latest, uncomment the details (and add comment to the final line)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "#resp = requests.get('https://lobste.rs/hottest.json')\n", 39 | "#stories = pd.read_json(resp.content)\n", 40 | "#stories = stories.set_index('short_id')\n", 41 | "\n", 42 | "stories = pd.read_json('../data/all_lobsters.json')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "stories.head()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "stories.dtypes" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### Let's take a look at the submitter_user field, as it appears like a dict" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "stories.submitter_user.iloc[3]" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "user_df = stories['submitter_user'].apply(pd.Series)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "user_df.head()" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### Can we combine the user data without potential column overlap?" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "set(user_df.columns).intersection(stories.columns)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "user_df = user_df.rename(columns={'created_at': \n", 120 | " 'user_created_at'})" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "stories = pd.concat([stories.drop(['submitter_user'], axis=1), \n", 130 | " user_df], axis=1)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "stories.head()" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### Let's check for nulls" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "stories.shape" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "stories.dropna().shape" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "stories.dropna(thresh=10, axis=1).shape" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": { 179 | "collapsed": true 180 | }, 181 | "source": [ 182 | "### Exercise: which columns would be dropped?" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "# %load ../solutions/lobsters_dropped.py\n" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## Let's make the tags easier to use by having them as features in the columns." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "tag_df = stories.tags.apply(pd.Series)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "tag_df.head()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "# what are our unique tags?\n", 226 | "\n", 227 | "pd.unique(tag_df.values.ravel())" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "set(tag_df.values.ravel())" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "len(pd.unique(tag_df.values.ravel()))" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "# most common tags\n", 255 | "\n", 256 | "Counter(tag_df.values.ravel()).most_common(5)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "### Let's create a dummy df with our tags" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "tag_df = pd.get_dummies(\n", 273 | " tag_df.apply(pd.Series).stack()).sum(level=0)" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": { 280 | "scrolled": true 281 | }, 282 | "outputs": [], 283 | "source": [ 284 | "tag_df.head()" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "### Now we can add it back to our stories DataFrame" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "stories = pd.concat([stories.drop('tags', axis=1), \n", 301 | " tag_df], axis=1)" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "stories.head()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "### Another potentially useful feature is the post times..." 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "stories['created_hour'] = stories.created_at.map(\n", 327 | " lambda x: x.hour)" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": {}, 334 | "outputs": [], 335 | "source": [ 336 | "stories['created_dow'] = stories.created_at.map(\n", 337 | " lambda x: x.weekday())" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "### Let's analyze some of the correlations in our features so far..." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "stories[['created_hour', 'score']].corr()" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "stories[['created_dow', 'score']].corr()" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "stories[['karma', 'score']].corr()" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "stories[['comment_count', 'score']].corr()" 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "stories[['hardware', 'score']].corr()" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "## Exercise: can you find a more highly positive correlation?" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "## We might also want/need to normalize scores. We can use a Scaler / MinMaxScaler or Normalizer" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "normed_score = preprocessing.normalize(stories[['score']])" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "normed_score[:5]" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "#### hmm... maybe a min-max scaler works better for our needs!" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "scaler = preprocessing.MinMaxScaler()" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "scaled_score = scaler.fit_transform(stories[['score']])" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "scaled_score[:5]" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "stories['scaled_score'] = scaled_score[:,0]" 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "## Exercise: can you add a scaled or normalized karma score?" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": { 491 | "collapsed": true 492 | }, 493 | "source": [ 494 | "## What else should we add?\n", 495 | "\n", 496 | "- fuzzywuzzy to find match of title with topics\n", 497 | "- add normalization or scaling to comments\n", 498 | "- extract domain name\n", 499 | "- number of words in the title\n", 500 | "- number of capitalized words in the title\n", 501 | "- use NLP to extract named entities from the title\n", 502 | "- what else?" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": null, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [] 511 | } 512 | ], 513 | "metadata": { 514 | "kernelspec": { 515 | "display_name": "Python 3", 516 | "language": "python", 517 | "name": "python3" 518 | }, 519 | "language_info": { 520 | "codemirror_mode": { 521 | "name": "ipython", 522 | "version": 3 523 | }, 524 | "file_extension": ".py", 525 | "mimetype": "text/x-python", 526 | "name": "python", 527 | "nbconvert_exporter": "python", 528 | "pygments_lexer": "ipython3", 529 | "version": "3.6.6" 530 | } 531 | }, 532 | "nbformat": 4, 533 | "nbformat_minor": 2 534 | } 535 | -------------------------------------------------------------------------------- /cleaning-notebooks/Dask Pipeline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Tracking the International Space Station with Dask\n", 8 | "\n", 9 | "In this notebook, we will use two APIs: [Google Maps Geocoder](https://developers.google.com/maps/documentation/geocoding/) and the [open notify API for ISS location](http://api.open-notify.org/). We will use them to track the ISS location and next pass time in relation to a list of cities.\n", 10 | "\n", 11 | "To help build our graphs and intelligently parallelize data, we will use [Dask](http://dask.pydata.org/en/latest/), specifically [Dask delayed](http://dask.pydata.org/en/latest/delayed.html)." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import requests\n", 21 | "import logging\n", 22 | "import sys\n", 23 | "import numpy as np\n", 24 | "from time import sleep\n", 25 | "from datetime import datetime\n", 26 | "from math import radians\n", 27 | "from dask import delayed\n", 28 | "from operator import itemgetter\n", 29 | "from sklearn.neighbors import DistanceMetric" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "logger = logging.getLogger()\n", 39 | "logger.setLevel(logging.INFO)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "### First, we need to get lat and long pairs from a list of cities" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "def get_lat_long(address):\n", 56 | " resp = requests.get(\n", 57 | " 'https://eu1.locationiq.org/v1/search.php',\n", 58 | " params={'key': '92e7ba84cf3465', #Please be kind, you can generate your own for more use here - https://locationiq.org :D\n", 59 | " 'q': address,\n", 60 | " 'format': 'json'}\n", 61 | " )\n", 62 | " if resp.status_code != 200:\n", 63 | " print('There was a problem with your request!')\n", 64 | " print(resp.content)\n", 65 | " return\n", 66 | " data = resp.json()[0]\n", 67 | " return {\n", 68 | " 'name': data.get('display_name'),\n", 69 | " 'lat': float(data.get('lat')),\n", 70 | " 'long': float(data.get('lon')),\n", 71 | " }" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "get_lat_long('Berlin, Germany')" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "locations = []\n", 90 | "for city in ['Seattle, Washington', 'Miami, Florida', \n", 91 | " 'Berlin, Germany', 'Singapore', \n", 92 | " 'Wellington, New Zealand',\n", 93 | " 'Beirut, Lebanon', 'Beijing, China', 'Nairobi, Kenya',\n", 94 | " 'Cape Town, South Africa', 'Buenos Aires, Argentina']:\n", 95 | " locations.append(get_lat_long(city))\n", 96 | " sleep(2)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "locations" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "### Now we can define the functions we will use to get the ISS data and compare location and next pass times amongst cities " 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "def get_spaceship_location():\n", 122 | " resp = requests.get('http://api.open-notify.org/iss-now.json')\n", 123 | " location = resp.json()['iss_position']\n", 124 | " return {'lat': float(location.get('latitude')),\n", 125 | " 'long': float(location.get('longitude'))}" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "def great_circle_dist(lon1, lat1, lon2, lat2):\n", 135 | " \"Found on SO: http://stackoverflow.com/a/41858332/380442\"\n", 136 | " dist = DistanceMetric.get_metric('haversine')\n", 137 | " lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])\n", 138 | "\n", 139 | " X = [[lat1, lon1], [lat2, lon2]]\n", 140 | " kms = 6367\n", 141 | " return (kms * dist.pairwise(X)).max()" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "def iss_dist_from_loc(issloc, loc):\n", 151 | " distance = great_circle_dist(issloc.get('long'), \n", 152 | " issloc.get('lat'), \n", 153 | " loc.get('long'), loc.get('lat'))\n", 154 | " logging.info('ISS is ~%dkm from %s', int(distance), loc.get('name'))\n", 155 | " return distance" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "def iss_pass_near_loc(loc):\n", 165 | " resp = requests.get('http://api.open-notify.org/iss-pass.json',\n", 166 | " params={'lat': loc.get('lat'), \n", 167 | " 'lon': loc.get('long')})\n", 168 | " data = resp.json().get('response')[0]\n", 169 | " td = datetime.fromtimestamp(data.get('risetime')) - datetime.now()\n", 170 | " m, s = divmod(int(td.total_seconds()), 60)\n", 171 | " h, m = divmod(m, 60)\n", 172 | " logging.info('ISS will pass near %s in %02d:%02d:%02d',loc.get('name'), h, m, s)\n", 173 | " return td.total_seconds()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "iss_dist_from_loc(get_spaceship_location(), locations[4])" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "iss_pass_near_loc(locations[4])" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Let's create a delayed pipeline" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "output = []\n", 208 | "\n", 209 | "for loc in locations:\n", 210 | " issloc = delayed(get_spaceship_location)()\n", 211 | " dist = delayed(iss_dist_from_loc)(issloc, loc)\n", 212 | " output.append((loc.get('name'), dist))\n", 213 | "\n", 214 | "closest = delayed(lambda x: sorted(x, \n", 215 | " key=itemgetter(1))[0])(output)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "closest" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "### Let's see our DAG!" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "closest.visualize()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### Remember: it is lazy, so let's start it with `compute()`" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "closest.compute()" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": { 262 | "collapsed": true 263 | }, 264 | "source": [ 265 | "### Exercise: which city will it fly over next?\n", 266 | "\n", 267 | "### Extra: add your city and compare!" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "# %load ../solutions/dask.py\n", 277 | "\n" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [] 286 | } 287 | ], 288 | "metadata": { 289 | "kernelspec": { 290 | "display_name": "Python 3", 291 | "language": "python", 292 | "name": "python3" 293 | }, 294 | "language_info": { 295 | "codemirror_mode": { 296 | "name": "ipython", 297 | "version": 3 298 | }, 299 | "file_extension": ".py", 300 | "mimetype": "text/x-python", 301 | "name": "python", 302 | "nbconvert_exporter": "python", 303 | "pygments_lexer": "ipython3", 304 | "version": "3.6.6" 305 | } 306 | }, 307 | "nbformat": 4, 308 | "nbformat_minor": 2 309 | } 310 | -------------------------------------------------------------------------------- /conda_reqs.txt: -------------------------------------------------------------------------------- 1 | jupyter 2 | pandas 3 | scikit-learn 4 | scipy 5 | requests 6 | fuzzywuzzy 7 | dask 8 | graphviz 9 | voluptuous 10 | engarde 11 | tdda 12 | faker 13 | hypothesis 14 | dataset 15 | distributed 16 | bokeh 17 | tornado 18 | -------------------------------------------------------------------------------- /data/customer_database.json: -------------------------------------------------------------------------------- 1 | [ 2 | {"job": "Toxicologist ", "address": "Junckenallee 8\n60024 Fallingbostel ", "company": "Löffler KG ", "name": "Anne-Kathrin Sorgatz B.Eng. "}, 3 | 4 | {"job": "Ceramics designer ", "address": "Minna-Kruschwitz-Straße 570\n03700 Worbis ", "company": "Pärtzelt AG ", "name": "Nadine Bruder "}, 5 | 6 | {"job": "Engineer, petroleum ", "address": "Ernst-Mentzel-Allee 70\n51786 Rockenhausen ", "company": "Rose Binner AG & Co. KG ", "name": "Reingard auch Schlauchin "}, 7 | 8 | {"job": "Neurosurgeon ", "address": "Burkhardt-Steinberg-Gasse 3\n82473 Halberstadt ", "company": "Kroker ", "name": "Charlotte Scheel B.A. "}, 9 | 10 | {"job": "Human resources officer ", "address": "Gretl-Trub-Ring 35\n10586 Garmisch-Partenkirchen ", "company": "Trommler ", "name": "Volkmar Meister "}, 11 | 12 | {"job": "Investment analyst ", "address": "Rosel-Bonbach-Weg 98\n57085 Saarlouis ", "company": "Trub ", "name": "Prof. Karolina Hertrampf "}, 13 | 14 | {"job": "Psychotherapist, child ", "address": "Gunda-Neureuther-Ring 8\n36338 Bischofswerda ", "company": "Oderwald ", "name": "Gudula Pieper "}, 15 | 16 | {"job": "Designer, ceramics/pottery ", "address": "Löwergasse 09\n63585 Ravensburg ", "company": "Rogner Birnbaum e.V. ", "name": "Aleksandra Mitschke B.A. "}, 17 | 18 | {"job": "Technical brewer ", "address": "Finkestraße 388\n60394 Pritzwalk ", "company": "Oestrovsky OHG mbH ", "name": "Carmen Schlosser MBA. "}, 19 | 20 | {"job": "Research scientist (medical) ", "address": "Heringplatz 94\n69733 Büsingenm Hochrhein ", "company": "Heydrich Atzler GmbH ", "name": "Wolfram Löffler "}, 21 | 22 | {"job": "Scientist, product/process development ", "address": "Junckenweg 044\n60890 Aachen ", "company": "Gude ", "name": "Prof. Friedrich-Wilhelm Heintze "}, 23 | 24 | {"job": "Newspaper journalist ", "address": "Helena-Gumprich-Allee 477\n08356 Mühldorfm Inn ", "company": "Hande ", "name": "Reimer Gude B.Sc. "}, 25 | 26 | {"job": "Best boy ", "address": "Henkring 8/8\n20802 Schleiz ", "company": "Kruschwitz ", "name": "Ing. Jens-Uwe Conradi "}, 27 | 28 | {"job": "Government social research officer ", "address": "Strohstr. 75\n95096 Freising ", "company": "Heinz ", "name": "Rena Schmiedecke "}, 29 | 30 | {"job": "Politician's assistant", "address": "Annelene-Schleich-Straße 8\n67005 Düren ", "company": "Schmidtke ", "name": "Lissy Mans B.Eng. "}, 31 | 32 | {"job": "Broadcast engineer ", "address": "Riehlgasse 9/5\n29353 Herzberg ", "company": "Kuhl AG ", "name": "Roselinde Kraushaar B.Eng. "}, 33 | 34 | {"job": "Scientist, research (maths) ", "address": "Hornichallee 8\n73902 Klötze ", "company": "Franke Stiftung & Co. KG ", "name": "Dr. Klaus-D. Scheel "}, 35 | 36 | {"job": "Sports therapist ", "address": "Tania-Hartung-Platz 7/9\n79609 Euskirchen ", "company": "Etzold AG ", "name": "Johanna Buchholz B.A. "}, 37 | 38 | {"job": "Lighting technician, broadcasting/film/video ", "address": "Cornelius-Harloff-Weg 2/5\n03574 Bergzabern ", "company": "Stadelmann ", "name": "Renato Hornig "}, 39 | 40 | {"job": "Engineer, maintenance (IT) ", "address": "Elfriede-Ladeck-Ring 618\n22695 Backnang ", "company": "Textor GmbH & Co. OHG ", "name": "Hans-Peter Speer "}, 41 | 42 | {"name": "Shawn Sanchez ", "address": "375 Jocelyn Unions\nThomashaven, VT 51428 ", "company": "Lambert, Thomas and Hess ", "job": "Advertising account planner "}, 43 | 44 | {"name": "Dr. Lori Hunt MD ", "address": "6425 Long Lodge Suite 464\nRachelview, ND 86669 ", "company": "Robinson-Rhodes ", "job": "Contracting civil engineer "}, 45 | 46 | {"name": "Alyssa James ", "address": "203 Christopher Station Suite 299\nAndrewsberg, GU 30904-0672 ", "company": "Hicks, Garcia and Roman ", "job": "Accountant, chartered management "}, 47 | 48 | {"name": "Bobby Barron ", "address": "547 Garrett Prairie Suite 288\nPort Angeltown, OK 74707-8260 ", "company": "Ewing-Lawrence ", "job": "Building services engineer "}, 49 | 50 | {"name": "Joshua Roberts ", "address": "8295 Palmer Lodge Suite 803\nWest Julie, LA 13745-5550 ", "company": "Tucker-Torres ", "job": "Clinical embryologist "}, 51 | 52 | {"name": "Joshua Becker ", "address": "136 Lee Falls Suite 336\nJamiestad, OH 38171 ", "company": "Schmidt, Weeks and Mathews ", "job": "Industrial buyer "}, 53 | 54 | {"name": "Barbara Reynolds ", "address": "3950 Jared Ramp Apt. 792\nJanetmouth, TX 68677 ", "company": "Jensen PLC ", "job": "Scientist, research (life sciences) "}, 55 | 56 | {"name": "William Espinoza ", "address": "8254 Wright Ville\nBlanchardfort, WA 58718-2634 ", "company": "Brown, Avila and Valdez ", "job": "Artist "}, 57 | 58 | {"name": "Laura Sexton ", "address": "10101 Robert Street\nEast Davidborough, VT 26605 ", "company": "Ingram-Ryan ", "job": "Animal technologist "}, 59 | 60 | {"name": "James Harrell ", "address": "1517 Johnson Motorway Suite 185\nSouth Julia, ID 14807 ", "company": "Bradford-George ", "job": "Primary school teacher "}, 61 | 62 | {"name": "Christopher Rodriguez ", "address": "260 Cassandra Landing\nPort Karen, FM 09345 ", "company": "Norman, Marshall and Abbott ", "job": "Psychologist, prison and probation services "}, 63 | 64 | {"name": "Robert Martinez ", "address": "520 Smith Turnpike\nNew Stephanie, IA 82621 ", "company": "Zhang-Solis ", "job": "International aid/development worker "}, 65 | 66 | {"name": "Sandra Meza ", "address": "33487 Billy Estate\nNew Victoria, ID 98841-2128 ", "company": "Caldwell LLC ", "job": "Advertising account planner "}, 67 | 68 | {"name": "John Ryan ", "address": "76586 Vincent Trail\nWest Monique, IA 58444-6293 ", "company": "Gonzalez, Jones and Hoffman ", "job": "Education officer, community "}, 69 | 70 | {"name": "Michelle Wilson ", "address": "1504 Diane Village Apt. 937\nAnthonystad, IL 88819-8569 ", "company": "Reid-Winters ", "job": "Gaffer "}, 71 | 72 | {"name": "Jessica Johnson ", "address": "5230 Johnson Overpass Apt. 741\nEast Dylanfort, KY 88382-3442 ", "company": "Rose Group ", "job": "Engineer, broadcasting (operations) "}, 73 | 74 | {"name": "Lori Arellano ", "address": "24531 Evans Manors Apt. 673\nKaylaton, UT 55939-9389 ", "company": "Lucas Inc ", "job": "Printmaker "}, 75 | 76 | {"name": "Jose Espinoza ", "address": "0662 Valerie Hollow\nTravisstad, NM 46213-8971 ", "company": "Ward, Morris and Simmons ", "job": "Ophthalmologist "}, 77 | 78 | {"name": "Glenn Quinn ", "address": "93847 Anthony Walks\nWarrenhaven, WA 06141 ", "company": "Nichols-Watts ", "job": "Systems developer "}, 79 | 80 | {"name": "Matthew Cooper ", "address": "USCGC Wright\nFPO AP 09773 ", "company": "Miller, Yoder and Coleman ", "job": "Actor "}, 81 | 82 | {"job": "Designer, fashion/clothing ", "address": "23 Cooper greens\nSouth Paigeborough\nS6 3WF ", "company": "Jones-Haynes ", "name": "Jade Shaw "}, 83 | 84 | {"job": "Fast food restaurant manager ", "address": "4 Joan mount\nNorth Ronaldfurt\nNW56 7UG ", "company": "Smith, Richardson and Noble ", "name": "Hilary Smith "}, 85 | 86 | {"job": "Agricultural engineer ", "address": "143 Scott centers\nCraigview\nE7 6NL ", "company": "Brown-Mann ", "name": "Mr. Martin Thompson "}, 87 | 88 | {"job": "Operational investment banker ", "address": "104 Pearce viaduct\nGregoryburgh\nFY3W 3ZP ", "company": "Duncan Group ", "name": "Jemma Hudson "}, 89 | 90 | {"job": "Geographical information systems officer ", "address": "Studio 25V\nThompson rest\nLake Amber\nM0 5FS ", "company": "Robertson, Gill and Walton ", "name": "Brenda Phillips "}, 91 | 92 | {"job": "Psychologist, educational ", "address": "540 Morton oval\nNew Annaburgh\nL3 8EF ", "company": "Roberts PLC ", "name": "Albert Khan "}, 93 | 94 | {"job": "Scientist, product/process development ", "address": "11 Smith coves\nPort Joanna\nB0 9ST ", "company": "Byrne-Connolly ", "name": "Terry Johnson "}, 95 | 96 | {"job": "Secondary school teacher ", "address": "Flat 34n\nNorton brooks\nJonesberg\nLS7 3LG ", "company": "Brown-Curtis ", "name": "Sheila Cross "}, 97 | 98 | {"job": "Engineer, civil (contracting) ", "address": "12 Christopher ville\nHortonton\nG20 3JA ", "company": "Martin, Thompson and Gould ", "name": "Sandra Hudson "}, 99 | 100 | {"job": "Educational psychologist ", "address": "Flat 6\nHolly corners\nPort Ritatown\nCA34 9YE ", "company": "Roberts LLC ", "name": "Kelly Barry "}, 101 | 102 | {"job": "Teacher, adult education ", "address": "83 Ellie falls\nDouglasmouth\nPA9 6QU ", "company": "Townsend Inc ", "name": "Emily Newman-Jones "}, 103 | 104 | {"job": "Garment/textile technologist ", "address": "952 Williams grove\nEast Georginaborough\nE3 5AR ", "company": "Wong-Parker ", "name": "Caroline West "}, 105 | 106 | {"job": "Community development worker ", "address": "1 Kent common\nShaunview\nW25 1PS ", "company": "Stewart-Robinson ", "name": "Mr. Joseph Carr "}, 107 | 108 | {"job": "Information systems manager ", "address": "Flat 80\nPatricia fort\nCarolebury\nB9 6XW ", "company": "Gibbons LLC ", "name": "Rita Dean "}, 109 | 110 | {"job": "Scientist, forensic ", "address": "3 Taylor isle\nHillport\nE5 1HS ", "company": "Banks, Reed and Hudson ", "name": "Gareth Singh-Green "}, 111 | 112 | {"job": "Psychiatrist ", "address": "Studio 6\nElizabeth manor\nPort Wendy\nRG2W 0ND ", "company": "Cunningham LLC ", "name": "Elliott Clarke "}, 113 | 114 | {"job": "Materials engineer ", "address": "Flat 52\nNicholls spring\nNew Samuel\nM39 8AL ", "company": "Jones Ltd ", "name": "Marilyn Edwards-Walker "}, 115 | 116 | {"job": "Personal assistant ", "address": "94 Lee shoal\nAnnchester\nFY9P 1FY ", "company": "Murray, Ball and Mann ", "name": "Mathew Smith "}, 117 | 118 | {"job": "Museum education officer ", "address": "Flat 5\nGray overpass\nAmberborough\nTD0 8AW ", "company": "King, Steele and Dunn ", "name": "Miss Beth Miah "}, 119 | 120 | {"job": "Nature conservation officer ", "address": "41 Hammond orchard\nPort Anneville\nDN7P 9EH ", "company": "Walker, Anderson and Phillips ", "name": "Suzanne Archer-Barker "}, 121 | 122 | {"name": "Andrés Pablo Valentín Ricart ", "address": "Ronda de Alejandro Palmer 18\nBaleares, 49650 ", "job": "Conservator, museum/gallery ", "company": "Roda-Agustí "}, 123 | 124 | {"name": "Gabriel Raul Infante Montero ", "address": "C. de Sergio Daza 147\nÁvila, 96969 ", "job": "Engineer, manufacturing ", "company": "Castell-Llamas "}, 125 | 126 | {"name": "Trinidad Cuervo Escudero ", "address": "Via Jorge Cuenca 70\nCáceres, 78327 ", "job": "Media buyer ", "company": "Alberdi and Sons "}, 127 | 128 | {"name": "Vicenta Vizcaíno Montes ", "address": "Acceso de Inmaculada Espinosa 79 Puerta 1 \nValencia, 37415 ", "job": "Secretary, company ", "company": "Guerra, Riquelme and Quintero "}, 129 | 130 | {"name": "Dolores Zurita Prats ", "address": "Avenida de Enrique Calatayud 78\nNavarra, 00769 ", "job": "Mining engineer ", "company": "Pons, Tur and Araujo "}, 131 | 132 | {"name": "Santiago Gras Benítez ", "address": "Paseo Eva Piñeiro 37\nMadrid, 86140 ", "job": "Special educational needs teacher ", "company": "Sanz-Alvarez "}, 133 | 134 | {"name": "Noelia Reig ", "address": "Cañada de Ramon Uriarte 52 Puerta 1 \nSevilla, 61918 ", "job": "Production manager ", "company": "Lluch-Caro "}, 135 | 136 | {"name": "Luis Miguel Maldonado Moliner ", "address": "Avenida Jose Maria Rosa 59 Apt. 52 \nMálaga, 37984 ", "job": "Physiological scientist ", "company": "Pacheco Ltd "}, 137 | 138 | {"name": "Francisca Barral ", "address": "Ronda de Raul Sedano 20\nMelilla, 02762 ", "job": "Quality manager ", "company": "Landa Inc "}, 139 | 140 | {"name": "Natalia Peñalver Sotelo ", "address": "Rambla Vicente Heredia 74\nCantabria, 85661 ", "job": "Chemical engineer ", "company": "Jaume, Barba and Cerdán "}, 141 | 142 | {"name": "Vanesa Barrena Simó ", "address": "Urbanización Juan Luis Cerdá 2 Piso 2 \nVizcaya, 20484 ", "job": "Arts development officer ", "company": "Berenguer and Sons "}, 143 | 144 | {"name": "Gonzalo Robledo Pou ", "address": "Cañada Aitor Gascón 30 Apt. 95 \nCáceres, 78565 ", "job": "Television floor manager ", "company": "Coronado-Cortés "}, 145 | 146 | {"name": "Tomás Cepeda Castrillo ", "address": "Urbanización Alicia Ariza 45\nCáceres, 76326 ", "job": "Housing manager/officer ", "company": "Cruz Inc "}, 147 | 148 | {"name": "Lidia Azcona Sebastián ", "address": "C. Manuela Jódar 386\nLugo, 14995 ", "job": "Occupational hygienist ", "company": "Iglesia Ltd "}, 149 | 150 | {"name": "Elena Beltran-Amat ", "address": "Cuesta Nicolás Higueras 85 Piso 6 \nÁvila, 70131 ", "job": "Physiological scientist ", "company": "Pintor, Orozco and Torre "}, 151 | 152 | {"name": "Esperanza Villa Rius ", "address": "Glorieta Juan Serrano 608 Puerta 4 \nCórdoba, 90524 ", "job": "Education officer, museum ", "company": "Martí LLC "}, 153 | 154 | {"name": "Belen Carmona Urrutia ", "address": "Vial Natalia Bou 56\nCórdoba, 82426 ", "job": "Control and instrumentation engineer ", "company": "Roca PLC "}, 155 | 156 | {"name": "Juan Luis de Mesa ", "address": "Urbanización Purificación Villegas 56 Apt. 14 \nOurense, 99568 ", "job": "Ranger/warden ", "company": "Quintana PLC "}, 157 | 158 | {"name": "Alberto Torrens ", "address": "Calle Juan Luis Alcázar 17 Piso 6 \nZaragoza, 24032 ", "job": "Presenter, broadcasting ", "company": "Moya, Recio and Caro "}, 159 | 160 | {"name": "Andrés Lago ", "address": "Paseo de Blanca Bernad 41\nPontevedra, 68224 ", "job": "Management consultant ", "company": "Boada, Machado and Portillo "} 161 | ] 162 | -------------------------------------------------------------------------------- /data/customer_database_duped.json: -------------------------------------------------------------------------------- 1 | {"address":{"0":"Junckenallee 8\n60024 Fallingbostel ","1":"Minna-Kruschwitz-Stra\u00dfe 570\n03700 Worbis ","2":"Ernst-Mentzel-Allee 70\n51786 Rockenhausen ","3":"Burkhardt-Steinberg-Gasse 3\n82473 Halberstadt ","4":"Gretl-Trub-Ring 35\n10586 Garmisch-Partenkirchen ","5":"Rosel-Bonbach-Weg 98\n57085 Saarlouis ","6":"Gunda-Neureuther-Ring 8\n36338 Bischofswerda ","7":"L\u00f6wergasse 09\n63585 Ravensburg ","8":"Finkestra\u00dfe 388\n60394 Pritzwalk ","9":"Heringplatz 94\n69733 B\u00fcsingenm Hochrhein ","10":"Junckenweg 044\n60890 Aachen ","11":"Helena-Gumprich-Allee 477\n08356 M\u00fchldorfm Inn ","12":"Henkring 8\/8\n20802 Schleiz ","13":"Strohstr. 75\n95096 Freising ","14":"Annelene-Schleich-Stra\u00dfe 8\n67005 D\u00fcren ","15":"Riehlgasse 9\/5\n29353 Herzberg ","16":"Hornichallee 8\n73902 Kl\u00f6tze ","17":"Tania-Hartung-Platz 7\/9\n79609 Euskirchen ","18":"Cornelius-Harloff-Weg 2\/5\n03574 Bergzabern ","19":"Elfriede-Ladeck-Ring 618\n22695 Backnang ","20":"375 Jocelyn Unions\nThomashaven, VT 51428 ","21":"6425 Long Lodge Suite 464\nRachelview, ND 86669 ","22":"203 Christopher Station Suite 299\nAndrewsberg, GU 30904-0672 ","23":"547 Garrett Prairie Suite 288\nPort Angeltown, OK 74707-8260 ","24":"8295 Palmer Lodge Suite 803\nWest Julie, LA 13745-5550 ","25":"136 Lee Falls Suite 336\nJamiestad, OH 38171 ","26":"3950 Jared Ramp Apt. 792\nJanetmouth, TX 68677 ","27":"8254 Wright Ville\nBlanchardfort, WA 58718-2634 ","28":"10101 Robert Street\nEast Davidborough, VT 26605 ","29":"1517 Johnson Motorway Suite 185\nSouth Julia, ID 14807 ","30":"260 Cassandra Landing\nPort Karen, FM 09345 ","31":"520 Smith Turnpike\nNew Stephanie, IA 82621 ","32":"33487 Billy Estate\nNew Victoria, ID 98841-2128 ","33":"76586 Vincent Trail\nWest Monique, IA 58444-6293 ","34":"1504 Diane Village Apt. 937\nAnthonystad, IL 88819-8569 ","35":"5230 Johnson Overpass Apt. 741\nEast Dylanfort, KY 88382-3442 ","36":"24531 Evans Manors Apt. 673\nKaylaton, UT 55939-9389 ","37":"0662 Valerie Hollow\nTravisstad, NM 46213-8971 ","38":"93847 Anthony Walks\nWarrenhaven, WA 06141 ","39":"USCGC Wright\nFPO AP 09773 ","40":"23 Cooper greens\nSouth Paigeborough\nS6 3WF ","41":"4 Joan mount\nNorth Ronaldfurt\nNW56 7UG ","42":"143 Scott centers\nCraigview\nE7 6NL ","43":"104 Pearce viaduct\nGregoryburgh\nFY3W 3ZP ","44":"Studio 25V\nThompson rest\nLake Amber\nM0 5FS ","45":"540 Morton oval\nNew Annaburgh\nL3 8EF ","46":"11 Smith coves\nPort Joanna\nB0 9ST ","47":"Flat 34n\nNorton brooks\nJonesberg\nLS7 3LG ","48":"12 Christopher ville\nHortonton\nG20 3JA ","49":"Flat 6\nHolly corners\nPort Ritatown\nCA34 9YE ","50":"83 Ellie falls\nDouglasmouth\nPA9 6QU ","51":"952 Williams grove\nEast Georginaborough\nE3 5AR ","52":"1 Kent common\nShaunview\nW25 1PS ","53":"Flat 80\nPatricia fort\nCarolebury\nB9 6XW ","54":"3 Taylor isle\nHillport\nE5 1HS ","55":"Studio 6\nElizabeth manor\nPort Wendy\nRG2W 0ND ","56":"Flat 52\nNicholls spring\nNew Samuel\nM39 8AL ","57":"94 Lee shoal\nAnnchester\nFY9P 1FY ","58":"Flat 5\nGray overpass\nAmberborough\nTD0 8AW ","59":"41 Hammond orchard\nPort Anneville\nDN7P 9EH ","60":"Ronda de Alejandro Palmer 18\nBaleares, 49650 ","61":"C. de Sergio Daza 147\n\u00c1vila, 96969 ","62":"Via Jorge Cuenca 70\nC\u00e1ceres, 78327 ","63":"Acceso de Inmaculada Espinosa 79 Puerta 1 \nValencia, 37415 ","64":"Avenida de Enrique Calatayud 78\nNavarra, 00769 ","65":"Paseo Eva Pi\u00f1eiro 37\nMadrid, 86140 ","66":"Ca\u00f1ada de Ramon Uriarte 52 Puerta 1 \nSevilla, 61918 ","67":"Avenida Jose Maria Rosa 59 Apt. 52 \nM\u00e1laga, 37984 ","68":"Ronda de Raul Sedano 20\nMelilla, 02762 ","69":"Rambla Vicente Heredia 74\nCantabria, 85661 ","70":"Urbanizaci\u00f3n Juan Luis Cerd\u00e1 2 Piso 2 \nVizcaya, 20484 ","71":"Ca\u00f1ada Aitor Gasc\u00f3n 30 Apt. 95 \nC\u00e1ceres, 78565 ","72":"Urbanizaci\u00f3n Alicia Ariza 45\nC\u00e1ceres, 76326 ","73":"C. Manuela J\u00f3dar 386\nLugo, 14995 ","74":"Cuesta Nicol\u00e1s Higueras 85 Piso 6 \n\u00c1vila, 70131 ","75":"Glorieta Juan Serrano 608 Puerta 4 \nC\u00f3rdoba, 90524 ","76":"Vial Natalia Bou 56\nC\u00f3rdoba, 82426 ","77":"Urbanizaci\u00f3n Purificaci\u00f3n Villegas 56 Apt. 14 \nOurense, 99568 ","78":"Calle Juan Luis Alc\u00e1zar 17 Piso 6 \nZaragoza, 24032 ","79":"Paseo de Blanca Bernad 41\nPontevedra, 68224 ","80":"8295 Palmer Lodge Suite 803\nWest Julie, LA 13745-5550 ","81":"Flat 5\nGray overpass\nAmberborough\nTD0 8AW ","82":"Gunda-Neureuther-Ring 8\n36338 Bischofswerda ","83":"Ca\u00f1ada de Ramon Uriarte 52 Puerta 1 \nSevilla, 61918 ","84":"Flat 6\nHolly corners\nPort Ritatown\nCA34 9YE ","85":"76586 Vincent Trail\nWest Monique, IA 58444-6293 ","86":"94 Lee shoal\nAnnchester\nFY9P 1FY ","87":"Ronda de Alejandro Palmer 18\nBaleares, 49650 ","88":"Ca\u00f1ada Aitor Gasc\u00f3n 30 Apt. 95 \nC\u00e1ceres, 78565 ","89":"Urbanizaci\u00f3n Alicia Ariza 45\nC\u00e1ceres, 76326 ","90":"93847 Anthony Walks\nWarrenhaven, WA 06141 ","91":"12 Christopher ville\nHortonton\nG20 3JA ","92":"203 Christopher Station Suite 299\nAndrewsberg, GU 30904-0672 ","93":"952 Williams grove\nEast Georginaborough\nE3 5AR ","94":"1517 Johnson Motorway Suite 185\nSouth Julia, ID 14807 ","95":"23 Cooper greens\nSouth Paigeborough\nS6 3WF ","96":"260 Cassandra Landing\nPort Karen, FM 09345 ","97":"Ronda de Raul Sedano 20\nMelilla, 02762 ","98":"Rambla Vicente Heredia 74\nCantabria, 85661 ","99":"375 Jocelyn Unions\nThomashaven, VT 51428 ","100":"Calle Juan Luis Alc\u00e1zar 17 Piso 6 \nZaragoza, 24032 ","101":"24531 Evans Manors Apt. 673\nKaylaton, UT 55939-9389 ","102":"Finkestra\u00dfe 388\n60394 Pritzwalk ","103":"C. de Sergio Daza 147\n\u00c1vila, 96969 "},"company":{"0":"L\u00f6ffler KG ","1":"P\u00e4rtzelt AG ","2":"Rose Binner AG & Co. KG ","3":"Kroker ","4":"Trommler ","5":"Trub ","6":"Oderwald ","7":"Rogner Birnbaum e.V. ","8":"Oestrovsky OHG mbH ","9":"Heydrich Atzler GmbH ","10":"Gude ","11":"Hande ","12":"Kruschwitz ","13":"Heinz ","14":"Schmidtke ","15":"Kuhl AG ","16":"Franke Stiftung & Co. KG ","17":"Etzold AG ","18":"Stadelmann ","19":"Textor GmbH & Co. OHG ","20":"Lambert, Thomas and Hess ","21":"Robinson-Rhodes ","22":"Hicks, Garcia and Roman ","23":"Ewing-Lawrence ","24":"Tucker-Torres ","25":"Schmidt, Weeks and Mathews ","26":"Jensen PLC ","27":"Brown, Avila and Valdez ","28":"Ingram-Ryan ","29":"Bradford-George ","30":"Norman, Marshall and Abbott ","31":"Zhang-Solis ","32":"Caldwell LLC ","33":"Gonzalez, Jones and Hoffman ","34":"Reid-Winters ","35":"Rose Group ","36":"Lucas Inc ","37":"Ward, Morris and Simmons ","38":"Nichols-Watts ","39":"Miller, Yoder and Coleman ","40":"Jones-Haynes ","41":"Smith, Richardson and Noble ","42":"Brown-Mann ","43":"Duncan Group ","44":"Robertson, Gill and Walton ","45":"Roberts PLC ","46":"Byrne-Connolly ","47":"Brown-Curtis ","48":"Martin, Thompson and Gould ","49":"Roberts LLC ","50":"Townsend Inc ","51":"Wong-Parker ","52":"Stewart-Robinson ","53":"Gibbons LLC ","54":"Banks, Reed and Hudson ","55":"Cunningham LLC ","56":"Jones Ltd ","57":"Murray, Ball and Mann ","58":"King, Steele and Dunn ","59":"Walker, Anderson and Phillips ","60":"Roda-Agust\u00ed ","61":"Castell-Llamas ","62":"Alberdi and Sons ","63":"Guerra, Riquelme and Quintero ","64":"Pons, Tur and Araujo ","65":"Sanz-Alvarez ","66":"Lluch-Caro ","67":"Pacheco Ltd ","68":"Landa Inc ","69":"Jaume, Barba and Cerd\u00e1n ","70":"Berenguer and Sons ","71":"Coronado-Cort\u00e9s ","72":"Cruz Inc ","73":"Iglesia Ltd ","74":"Pintor, Orozco and Torre ","75":"Mart\u00ed LLC ","76":"Roca PLC ","77":"Quintana PLC ","78":"Moya, Recio and Caro ","79":"Boada, Machado and Portillo ","80":"Tucker-Torres ","81":"King, Steele and Dunn ","82":"Oderwald ","83":"Lluch-Caro ","84":"Roberts LLC ","85":"Gonzalez, Jones and Hoffman ","86":"Murray, Ball and Mann ","87":"Roda-Agust\u00ed ","88":"Coronado-Cort\u00e9s ","89":"Cruz Inc ","90":"Nichols-Watts ","91":"Martin, Thompson and Gould ","92":"Hicks, Garcia and Roman ","93":"Wong-Parker ","94":"Bradford-George ","95":"Jones-Haynes ","96":"Norman, Marshall and Abbott ","97":"Landa Inc ","98":"Jaume, Barba and Cerd\u00e1n ","99":"Lambert, Thomas and Hess ","100":"Moya, Recio and Caro ","101":"Lucas Inc ","102":"Oestrovsky OHG mbH ","103":"Castell-Llamas "},"job":{"0":"Toxicologist ","1":"Ceramics designer ","2":"Engineer, petroleum ","3":"Neurosurgeon ","4":"Human resources officer ","5":"Investment analyst ","6":"Psychotherapist, child ","7":"Designer, ceramics\/pottery ","8":"Technical brewer ","9":"Research scientist (medical) ","10":"Scientist, product\/process development ","11":"Newspaper journalist ","12":"Best boy ","13":"Government social research officer ","14":"Politician's assistant","15":"Broadcast engineer ","16":"Scientist, research (maths) ","17":"Sports therapist ","18":"Lighting technician, broadcasting\/film\/video ","19":"Engineer, maintenance (IT) ","20":"Advertising account planner ","21":"Contracting civil engineer ","22":"Accountant, chartered management ","23":"Building services engineer ","24":"Clinical embryologist ","25":"Industrial buyer ","26":"Scientist, research (life sciences) ","27":"Artist ","28":"Animal technologist ","29":"Primary school teacher ","30":"Psychologist, prison and probation services ","31":"International aid\/development worker ","32":"Advertising account planner ","33":"Education officer, community ","34":"Gaffer ","35":"Engineer, broadcasting (operations) ","36":"Printmaker ","37":"Ophthalmologist ","38":"Systems developer ","39":"Actor ","40":"Designer, fashion\/clothing ","41":"Fast food restaurant manager ","42":"Agricultural engineer ","43":"Operational investment banker ","44":"Geographical information systems officer ","45":"Psychologist, educational ","46":"Scientist, product\/process development ","47":"Secondary school teacher ","48":"Engineer, civil (contracting) ","49":"Educational psychologist ","50":"Teacher, adult education ","51":"Garment\/textile technologist ","52":"Community development worker ","53":"Information systems manager ","54":"Scientist, forensic ","55":"Psychiatrist ","56":"Materials engineer ","57":"Personal assistant ","58":"Museum education officer ","59":"Nature conservation officer ","60":"Conservator, museum\/gallery ","61":"Engineer, manufacturing ","62":"Media buyer ","63":"Secretary, company ","64":"Mining engineer ","65":"Special educational needs teacher ","66":"Production manager ","67":"Physiological scientist ","68":"Quality manager ","69":"Chemical engineer ","70":"Arts development officer ","71":"Television floor manager ","72":"Housing manager\/officer ","73":"Occupational hygienist ","74":"Physiological scientist ","75":"Education officer, museum ","76":"Control and instrumentation engineer ","77":"Ranger\/warden ","78":"Presenter, broadcasting ","79":"Management consultant ","80":"Clinical embryologist ","81":"Museum education officer ","82":"Psychotherapist, child ","83":"Production manager ","84":"Educational psychologist ","85":"Education officer, community ","86":"Personal assistant ","87":"Conservator, museum\/gallery ","88":"Television floor manager ","89":"Housing manager\/officer ","90":"Systems developer ","91":"Engineer, civil (contracting) ","92":"Accountant, chartered management ","93":"Garment\/textile technologist ","94":"Primary school teacher ","95":"Designer, fashion\/clothing ","96":"Psychologist, prison and probation services ","97":"Quality manager ","98":"Chemical engineer ","99":"Advertising account planner ","100":"Presenter, broadcasting ","101":"Printmaker ","102":"Technical brewer ","103":"Engineer, manufacturing "},"name":{"0":"Anne-Kathrin Sorgatz B.Eng. ","1":"Nadine Bruder ","2":"Reingard auch Schlauchin ","3":"Charlotte Scheel B.A. ","4":"Volkmar Meister ","5":"Prof. Karolina Hertrampf ","6":"Gudula Pieper ","7":"Aleksandra Mitschke B.A. ","8":"Carmen Schlosser MBA. ","9":"Wolfram L\u00f6ffler ","10":"Prof. Friedrich-Wilhelm Heintze ","11":"Reimer Gude B.Sc. ","12":"Ing. Jens-Uwe Conradi ","13":"Rena Schmiedecke ","14":"Lissy Mans B.Eng. ","15":"Roselinde Kraushaar B.Eng. ","16":"Dr. Klaus-D. Scheel ","17":"Johanna Buchholz B.A. ","18":"Renato Hornig ","19":"Hans-Peter Speer ","20":"Shawn Sanchez ","21":"Dr. Lori Hunt MD ","22":"Alyssa James ","23":"Bobby Barron ","24":"Joshua Roberts ","25":"Joshua Becker ","26":"Barbara Reynolds ","27":"William Espinoza ","28":"Laura Sexton ","29":"James Harrell ","30":"Christopher Rodriguez ","31":"Robert Martinez ","32":"Sandra Meza ","33":"John Ryan ","34":"Michelle Wilson ","35":"Jessica Johnson ","36":"Lori Arellano ","37":"Jose Espinoza ","38":"Glenn Quinn ","39":"Matthew Cooper ","40":"Jade Shaw ","41":"Hilary Smith ","42":"Mr. Martin Thompson ","43":"Jemma Hudson ","44":"Brenda Phillips ","45":"Albert Khan ","46":"Terry Johnson ","47":"Sheila Cross ","48":"Sandra Hudson ","49":"Kelly Barry ","50":"Emily Newman-Jones ","51":"Caroline West ","52":"Mr. Joseph Carr ","53":"Rita Dean ","54":"Gareth Singh-Green ","55":"Elliott Clarke ","56":"Marilyn Edwards-Walker ","57":"Mathew Smith ","58":"Miss Beth Miah ","59":"Suzanne Archer-Barker ","60":"Andr\u00e9s Pablo Valent\u00edn Ricart ","61":"Gabriel Raul Infante Montero ","62":"Trinidad Cuervo Escudero ","63":"Vicenta Vizca\u00edno Montes ","64":"Dolores Zurita Prats ","65":"Santiago Gras Ben\u00edtez ","66":"Noelia Reig ","67":"Luis Miguel Maldonado Moliner ","68":"Francisca Barral ","69":"Natalia Pe\u00f1alver Sotelo ","70":"Vanesa Barrena Sim\u00f3 ","71":"Gonzalo Robledo Pou ","72":"Tom\u00e1s Cepeda Castrillo ","73":"Lidia Azcona Sebasti\u00e1n ","74":"Elena Beltran-Amat ","75":"Esperanza Villa Rius ","76":"Belen Carmona Urrutia ","77":"Juan Luis de Mesa ","78":"Alberto Torrens ","79":"Andr\u00e9s Lago ","80":"Joshua Roberts ","81":"Miss Beth Miah ","82":"Gudula Pieper ","83":"Noelia Reig ","84":"Kelly Barry ","85":"John Ryan ","86":"Mathew Smith ","87":"Andr\u00e9s Pablo Valent\u00edn Ricart ","88":"Gonzalo Robledo Pou ","89":"Tom\u00e1s Cepeda Castrillo ","90":"Glenn Quinn ","91":"Sandra Hudson ","92":"Alyssa James ","93":"Caroline West ","94":"James Harrell ","95":"Jade Shaw ","96":"Christopher Rodriguez ","97":"Francisca Barral ","98":"Natalia Pe\u00f1alver Sotelo ","99":"Shawn Sanchez ","100":"Alberto Torrens ","101":"Lori Arellano ","102":"Carmen Schlosser MBA. ","103":"Gabriel Raul Infante Montero "}} -------------------------------------------------------------------------------- /data/nhl_scores.csv: -------------------------------------------------------------------------------- 1 | time,status,scores,teams,date,notes 2 | ,"Status 3 | GAME 2 4 | OTT LEADS 2-0 5 | FINAL /2OT","['Rangers', 'Senators']","['Rangers', 'Senators']","Saturday, Apr 29","NYR 6 | New York Rangers 7 | Game Information 8 | Goal Scorers 9 | 1st: M. Grabner (3) 4:16 (SHG) 2nd: C. Kreider (1) 10:39 D. Stepan (2) 13:10 (SHG) B. Skjei (3) 15:51 3rd: B. Skjei (4) 5:10 OT: None 2OT: None 10 | L: H. Lundqvist 11 | OTT 12 | Ottawa Senators 13 | Game Information 14 | Goal Scorers 15 | 1st: J. Pageau (2) 13:59 2nd: M. Methot (1) 14:00 3rd: M. Stone (2) 1:28 J. Pageau (3) 16:41 J. Pageau (4) 18:58 OT: None 2OT: J. Pageau (5) 2:54 16 | W: C. Anderson" 17 | ,"Status 18 | GAME 2 19 | PIT LEADS 2-0 20 | FINAL","['Penguins', 'Capitals']","['Penguins', 'Capitals']","Saturday, Apr 29","PIT 21 | Pittsburgh Penguins 22 | Game Information 23 | Goal Scorers 24 | 1st: None 2nd: M. Cullen (1) 1:15 (SHG) P. Kessel (3) 13:04 J. Guentzel (6) 16:14 3rd: P. Kessel (4) 2:19 (PPG) E. Malkin (2) 5:31 J. Guentzel (7) 19:17 25 | W: M. Fleury 26 | WSH 27 | Washington Capitals 28 | Game Information 29 | Goal Scorers 30 | 1st: None 2nd: M. Niskanen (1) 2:09 (PPG) 3rd: N. Backstrom (3) 3:44 31 | L: B. Holtby" 32 | 3:00 PM ET,"Status 33 | GAME 3 34 | TIED 1-1 35 | 3:00 PM ET 36 | NBC, TVAS, SN","['Blues', 'Predators']","['Blues', 'Predators']","Sunday, Apr 30","STL 37 | St. Louis Blues 38 | Game Information 39 | J. Schwartz: 7 Points 40 | J. Schwartz: 3 Goals 41 | J. Schwartz: 4 Assists 42 | NSH 43 | Nashville Predators 44 | Game Information 45 | R. Ellis: 7 Points 46 | F. Forsberg: 3 Goals 47 | R. Johansen: 6 Assists" 48 | 7:00 PM ET,"Status 49 | GAME 3 50 | EDM LEADS 2-0 51 | 7:00 PM ET 52 | NBCSN, TVAS, SN","['Ducks', 'Oilers']","['Ducks', 'Oilers']","Sunday, Apr 30","ANA 53 | Anaheim Ducks 54 | Game Information 55 | R. Getzlaf: 7 Points 56 | R. Getzlaf: 4 Goals 57 | R. Kesler: 4 Assists 58 | EDM 59 | Edmonton Oilers 60 | Game Information 61 | L. Draisaitl: 7 Points 62 | M. Letestu: 3 Goals 63 | L. Draisaitl: 5 Assists" 64 | 7:30 PM ET,"Status 65 | GAME 3 66 | PIT LEADS 2-0 67 | 7:30 PM ET 68 | NBCSN, CBC, TVAS","['Capitals', 'Penguins']","['Capitals', 'Penguins']","Monday, May 1","WSH 69 | Washington Capitals 70 | Game Information 71 | T.J. Oshie: 9 Points 72 | A. Ovechkin: 4 Goals 73 | T.J. Oshie: 6 Assists 74 | PIT 75 | Pittsburgh Penguins 76 | Game Information 77 | E. Malkin: 13 Points 78 | J. Guentzel: 7 Goals 79 | E. Malkin: 10 Assists" 80 | 7:00 PM ET,"Status 81 | GAME 3 82 | OTT LEADS 2-0 83 | 7:00 PM ET 84 | NBCSN, CBC, TVAS","['Senators', 'Rangers']","['Senators', 'Rangers']","Tuesday, May 2","OTT 85 | Ottawa Senators 86 | Game Information 87 | D. Brassard: 8 Points 88 | J. Pageau: 5 Goals 89 | E. Karlsson: 7 Assists 90 | NYR 91 | New York Rangers 92 | Game Information 93 | M. Zibanejad: 6 Points 94 | B. Skjei: 4 Goals 95 | M. Zibanejad: 5 Assists" 96 | 9:30 PM ET,"Status 97 | GAME 4 98 | TIED 1-1 99 | 9:30 PM ET 100 | NBCSN, TVAS, SN","['Blues', 'Predators']","['Blues', 'Predators']","Tuesday, May 2","STL 101 | St. Louis Blues 102 | Game Information 103 | J. Schwartz: 7 Points 104 | J. Schwartz: 3 Goals 105 | J. Schwartz: 4 Assists 106 | NSH 107 | Nashville Predators 108 | Game Information 109 | R. Ellis: 7 Points 110 | F. Forsberg: 3 Goals 111 | R. Johansen: 6 Assists" 112 | 7:30 PM ET,"Status 113 | GAME 4 114 | PIT LEADS 2-0 115 | 7:30 PM ET 116 | NBCSN, CBC, TVAS","['Capitals', 'Penguins']","['Capitals', 'Penguins']","Wednesday, May 3","WSH 117 | Washington Capitals 118 | Game Information 119 | T.J. Oshie: 9 Points 120 | A. Ovechkin: 4 Goals 121 | T.J. Oshie: 6 Assists 122 | PIT 123 | Pittsburgh Penguins 124 | Game Information 125 | E. Malkin: 13 Points 126 | J. Guentzel: 7 Goals 127 | E. Malkin: 10 Assists" 128 | 10:00 PM ET,"Status 129 | GAME 4 130 | EDM LEADS 2-0 131 | 10:00 PM ET 132 | NBCSN, TVAS, SN","['Ducks', 'Oilers']","['Ducks', 'Oilers']","Wednesday, May 3","ANA 133 | Anaheim Ducks 134 | Game Information 135 | R. Getzlaf: 7 Points 136 | R. Getzlaf: 4 Goals 137 | R. Kesler: 4 Assists 138 | EDM 139 | Edmonton Oilers 140 | Game Information 141 | L. Draisaitl: 7 Points 142 | M. Letestu: 3 Goals 143 | L. Draisaitl: 5 Assists" 144 | 7:30 PM ET,"Status 145 | GAME 4 146 | OTT LEADS 2-0 147 | 7:30 PM ET 148 | NBCSN, CBC, TVAS","['Senators', 'Rangers']","['Senators', 'Rangers']","Thursday, May 4","OTT 149 | Ottawa Senators 150 | Game Information 151 | D. Brassard: 8 Points 152 | J. Pageau: 5 Goals 153 | E. Karlsson: 7 Assists 154 | NYR 155 | New York Rangers 156 | Game Information 157 | M. Zibanejad: 6 Points 158 | B. Skjei: 4 Goals 159 | M. Zibanejad: 5 Assists" 160 | -------------------------------------------------------------------------------- /data/sales_data.csv: -------------------------------------------------------------------------------- 1 | ,timestamp,city,store_id,sale_number,sale_amount,associate 2 | 0,2018-09-10 05:00:45,Williamburgh,6,1530,1167.0,Gary Lee 3 | 1,2018-09-12 10:01:27,Ibarraberg,1,2744,258.0,Daniel Davis 4 | 2,2018-09-13 12:01:48,Sarachester,2,1908,266.0,Michael Roth 5 | 3,2018-09-14 20:02:19,Caldwellbury,14,771,-108.0,Michaela Stewart 6 | 4,2018-09-16 01:03:21,Erikaland,11,1571,-372.0,Mark Taylor 7 | 5,2018-09-18 03:04:11,Ponceview,19,1006,-399.0,Douglas Peters 8 | 6,2018-09-20 12:04:49,East Randymouth,3,2320,-304.0,Julie Cooper 9 | 7,2018-09-21 15:05:42,New Jamesberg,15,2818,-295.0,Dean Davis 10 | 8,2018-09-23 20:06:09,Olsenville,18,2309,729.0,Nicole Anderson 11 | 9,2018-09-25 21:06:34,Port Grace,9,29,457.0,Andrew Robles 12 | 10,2018-09-27 04:07:32,New Katherineville,8,1654,-89.0,Patricia Peterson 13 | 11,2018-09-29 10:07:58,East Kirstenbury,11,513,810.0,Jeremiah Thompson 14 | 12,2018-10-01 15:08:04,West Petermouth,14,1276,127.0,Robert Taylor 15 | 13,2018-10-02 17:08:42,North Stanleybury,11,1437,-303.0,Lawrence Norton 16 | 14,2018-10-05 02:08:50,West Shannon,2,1021,477.0,Michael Silva 17 | 15,2018-10-07 06:09:00,Phillipsshire,16,1066,-432.0,Kelsey Collins 18 | 16,2018-10-09 15:09:03,South Colinport,8,464,229.0,Taylor Martin MD 19 | 17,2018-10-10 19:09:06,Valeriebury,2,2911,747.0,Richard Wong 20 | 18,2018-10-13 01:09:53,Port Amanda,1,421,1330.0,Christopher Joyce 21 | 19,2018-10-14 08:10:23,South Neilstad,3,2106,-177.0,Melinda Miranda 22 | 20,2018-10-16 12:10:55,West Natasha,3,278,-154.0,Charles Fletcher 23 | 21,2018-10-17 20:11:03,Beniteztown,3,1186,627.0,Mary Luna 24 | 22,2018-10-20 02:11:21,Stephaniemouth,5,277,-130.0,Carol Morrison 25 | 23,2018-10-22 03:12:13,New Timothy,17,2986,1487.0,Tony Lynch 26 | 24,2018-10-23 12:12:25,Caseyside,10,2624,372.0,Alicia Valenzuela 27 | 25,2018-10-24 14:13:28,New Meganland,19,2767,-145.0,Michael Harvey 28 | 26,2018-10-26 16:14:19,Riceport,9,1778,53.0,Alexis Boone 29 | 27,2018-10-27 19:14:56,West Bridgetborough,5,216,1383.0,Jonathan Townsend 30 | 28,2018-10-29 22:15:38,Robertstown,3,198,1471.0,Cynthia Friedman 31 | 29,2018-10-31 00:16:30,East Deborahland,9,365,1313.0,Ms. Victoria Ford DDS 32 | 30,2018-11-01 06:16:52,Katherinefurt,14,488,94.0,Jason Robinson 33 | 31,2018-11-03 09:17:33,East Katie,16,1137,-259.0,Gregory James 34 | 32,2018-11-05 17:17:41,Angelabury,4,1389,1097.0,Carl Bailey 35 | 33,2018-11-08 02:17:55,Johnfurt,12,1370,676.0,Michael Clark 36 | 34,2018-11-10 08:18:17,Port Matthewton,4,449,1427.0,Martin Chang 37 | 35,2018-11-12 11:18:42,East Johntown,10,2540,534.0,Nathan Santana 38 | 36,2018-11-14 20:19:23,North Adam,4,457,886.0,Robert Wilson 39 | 37,2018-11-17 03:19:30,Kathleenbury,19,884,1439.0,Carl Mills 40 | 38,2018-11-18 08:19:35,Hughesview,14,280,-241.0,Brenda Mcbride 41 | 39,2018-11-19 11:19:43,New Jonathan,18,1135,1328.0,Edward Thomas 42 | 40,2018-11-21 19:20:45,South Ashleyton,3,2323,-4.0,Robert Landry 43 | 41,2018-11-23 02:21:43,West Sean,4,2327,1581.0,Denise Ross 44 | 42,2018-11-24 06:22:33,Randyborough,7,248,1417.0,Eric Montgomery 45 | 43,2018-11-26 15:22:56,West Robertmouth,15,1799,262.0,Pamela Salinas 46 | 44,2018-11-27 22:23:37,Cohenshire,1,339,1359.0,Wayne Peterson 47 | 45,2018-11-30 00:24:21,New Sydney,9,1847,1529.0,Brad Ray 48 | 46,2018-12-01 05:24:30,Smithfurt,8,2072,-238.0,Mr. Micheal Hale DDS 49 | 47,2018-12-03 12:24:53,South Jeffreytown,10,254,542.0,Linda Mclaughlin 50 | 48,2018-12-05 16:25:18,North Jason,15,2850,-284.0,Jeremy Brewer 51 | 49,2018-12-07 23:26:14,South Jessicaview,3,516,992.0,Summer Nash 52 | 50,2018-12-09 07:26:28,Port Briana,8,1219,799.0,Christian Perez 53 | 51,2018-12-10 12:26:57,Aliciastad,6,652,-164.0,Rhonda Stone 54 | 52,2018-12-12 17:27:07,East Curtis,17,443,112.0,Elizabeth Barber 55 | 53,2018-12-13 20:27:28,Lake Brandon,11,406,880.0,Stacy Mcintosh 56 | 54,2018-12-14 22:27:45,North Kyle,5,2944,537.0,Ashley Taylor 57 | 55,2018-12-16 01:28:14,Samanthashire,9,2293,-184.0,Jason Harper 58 | 56,2018-12-18 07:29:18,South Jason,17,2387,-304.0,Sarah Sandoval 59 | 57,2018-12-19 12:29:25,Katrinafurt,5,1044,543.0,Kimberly Zhang 60 | 58,2018-12-20 13:29:37,New Lisachester,13,736,991.0,Michelle Murray 61 | 59,2018-12-22 15:29:58,New Melanie,7,1717,1579.0,Eric Morrow 62 | 60,2018-12-24 20:30:45,Port Matthew,15,725,-455.0,Tracey Martin 63 | 61,2018-12-26 03:31:41,South Paulabury,2,2963,1028.0,David Ayers 64 | 62,2018-12-27 08:32:37,South Lindamouth,19,1756,942.0,Andre Villa 65 | 63,2018-12-29 17:32:50,Chaveztown,18,161,1551.0,Eric Hunter 66 | 64,2018-12-31 02:33:29,Timothyside,15,1096,757.0,Lisa James 67 | 65,2019-01-01 08:34:21,Port Pamelafurt,9,1662,-397.0,Austin Mclaughlin 68 | 66,2019-01-02 14:34:59,New Tylermouth,7,1727,203.0,Kenneth Elliott 69 | 67,2019-01-04 18:35:32,Port Emilyfurt,12,87,1301.0,Mrs. Nicole Huang 70 | 68,2019-01-06 00:36:34,Janiceview,1,820,443.0,Linda Snyder 71 | 69,2019-01-08 02:37:14,Port Monicaville,2,718,-400.0,Courtney Ward 72 | 70,2019-01-10 08:37:44,Gonzalesburgh,3,196,1482.0,Kevin Madden 73 | 71,2019-01-11 10:37:50,Lake Chelseatown,11,2941,-321.0,Don Garza 74 | 72,2019-01-12 11:38:16,Faulknerstad,15,433,1281.0,Oscar Garcia 75 | 73,2019-01-14 17:38:53,Chelseashire,6,78,796.0,Jeffery Hughes 76 | 74,2019-01-15 22:39:00,Port Matthew,4,170,-47.0,Adam Brandt 77 | 75,2019-01-18 07:39:35,Lake Richardstad,6,2330,1274.0,Christopher Randall 78 | 76,2019-01-20 15:40:17,New Phillipton,14,2514,-68.0,Maria Lee 79 | 77,2019-01-21 16:41:11,North Thomasshire,16,1491,33.0,Elizabeth Bishop 80 | 78,2019-01-22 22:42:11,Seanview,8,962,243.0,Jamie Cummings 81 | 79,2019-01-24 06:42:31,West Kimberly,14,2639,1429.0,Steven Miller 82 | 80,2019-01-25 08:43:00,Lake Sandrafurt,15,2557,320.0,Shawn James 83 | 81,2019-01-27 17:44:00,Port Dariusshire,18,1878,880.0,Mark Fowler 84 | 82,2019-01-29 18:44:03,Katherinestad,18,2764,119.0,Jennifer Lee 85 | 83,2019-01-31 20:44:23,Elizabethview,13,1375,16.0,Joseph Barron 86 | 84,2019-02-03 03:45:05,New Amy,3,212,857.0,John Chambers 87 | 85,2019-02-05 11:45:39,East Joseph,1,1430,156.0,Jennifer Wilson 88 | 86,2019-02-06 18:46:26,Ortiztown,13,185,1454.0,Christopher Harris 89 | 87,2019-02-09 00:47:04,Jonesland,6,1774,1053.0,Christopher Young 90 | 88,2019-02-11 05:47:26,New Jason,16,280,309.0,Ashley Bryant 91 | 89,2019-02-12 06:47:48,Lindaville,2,1178,971.0,Monique Hughes 92 | 90,2019-02-13 15:48:35,Port Ryan,17,1307,615.0,Cory Wilkerson 93 | 91,2019-02-14 22:48:39,Danielsmouth,10,708,1134.0,Pamela Potts 94 | 92,2019-02-16 01:49:43,Lauraborough,6,2998,1201.0,Calvin Soto 95 | 93,2019-02-18 05:50:27,South Vincenthaven,1,1020,692.0,Lori Johnson 96 | 94,2019-02-20 07:51:08,East Rebecca,17,1788,512.0,Gina Frey 97 | 95,2019-02-22 09:52:00,Port Coryborough,3,1455,958.0,Laurie Carr DVM 98 | 96,2019-02-24 12:52:05,East Michelleville,7,1572,262.0,Eugene Summers 99 | 97,2019-02-25 13:52:59,Bairdmouth,11,2100,414.0,Brandon Salinas 100 | 98,2019-02-26 16:54:02,Carolhaven,11,1544,171.0,Heather Johnson 101 | 99,2019-02-27 21:54:27,Hardingburgh,1,939,981.0,Karen Flynn 102 | 100,2019-03-02 05:55:05,Port Shannon,12,2351,648.0,Dawn Schmidt 103 | 101,2019-03-03 11:55:51,Munozberg,14,1827,-213.0,John Glover 104 | 102,2019-03-04 12:56:26,Zacharyborough,9,936,964.0,Marc Barnett 105 | 103,2019-03-05 19:56:30,Lake Edwardmouth,14,2565,-144.0,Keith Jones 106 | 104,2019-03-06 20:56:50,South Stevenview,3,2683,-265.0,Mariah Wright 107 | 105,2019-03-08 21:57:36,Harrisport,17,698,1113.0,Donna Ibarra 108 | 106,2019-03-11 06:58:00,Brockfort,8,2888,14.0,Miranda Gibbs 109 | 107,2019-03-13 13:58:38,West Kimberly,17,1467,-349.0,Christina Randolph 110 | 108,2019-03-15 14:58:53,South James,18,2362,12.0,Jonathan Clark 111 | 109,2019-03-16 22:58:58,Fosterburgh,6,809,755.0,Lauren Nicholson 112 | 110,2019-03-18 07:59:22,Sarahton,5,1500,774.0,Kimberly Price 113 | 111,2019-03-19 15:00:18,New Douglasmouth,13,2028,-78.0,Carlos French 114 | 112,2019-03-20 16:00:25,South Nancyview,7,681,-310.0,Lance Hurst 115 | 113,2019-03-23 01:01:02,West Kenneth,2,2063,1096.0,Herbert Morris 116 | 114,2019-03-24 04:01:26,Port Bob,18,457,926.0,James Turner 117 | 115,2019-03-26 11:02:03,West Alex,10,1168,111.0,Barbara Savage 118 | 116,2019-03-28 19:02:35,Bruceton,12,1036,1570.0,Diane Pearson 119 | 117,2019-03-30 03:03:11,Davidstad,13,2689,976.0,Christopher Ortiz 120 | 118,2019-03-31 04:03:56,North Michael,3,2633,462.0,Kendra Santiago 121 | 119,2019-04-01 06:04:29,Jonesstad,6,1966,553.0,Andrew Warren 122 | 120,2019-04-03 12:04:38,Justinside,18,2386,1490.0,Stephen Shaffer 123 | 121,2019-04-05 13:04:55,West Carriemouth,19,1245,1149.0,Jeffrey Ford 124 | 122,2019-04-07 22:05:23,Heathershire,7,1941,650.0,Vanessa Burke 125 | 123,2019-04-10 00:05:47,Port Michael,9,2724,-179.0,Brandon Lawson 126 | 124,2019-04-12 07:06:20,North Georgemouth,1,1180,-391.0,George Andrade 127 | 125,2019-04-13 15:06:49,Port Peter,13,1646,349.0,Dave Evans 128 | 126,2019-04-14 21:07:08,Lake Richard,4,2574,1387.0,Kayla Mercado 129 | 127,2019-04-17 02:07:19,Lake Kimberlyport,16,2170,553.0,Dana Wilson 130 | 128,2019-04-19 09:08:14,South Donaldchester,19,2305,357.0,Chelsea Brown 131 | 129,2019-04-20 17:09:07,Markmouth,1,2511,1504.0,Laura Henderson 132 | 130,2019-04-21 23:09:18,South Margaretstad,3,929,-91.0,Stephen Palmer 133 | 131,2019-04-24 01:09:50,West Donaldbury,12,1547,980.0,Chelsea Lee 134 | 132,2019-04-26 10:10:45,Sarahland,8,2126,-372.0,David Wallace 135 | 133,2019-04-28 14:10:54,Kathrynmouth,6,1563,1024.0,Ms. Brenda Larson DDS 136 | 134,2019-04-30 17:11:08,Port Jacobborough,3,652,452.0,Jodi Watson 137 | 135,2019-05-01 21:11:39,Kathrynburgh,10,1344,925.0,Adam Fisher 138 | 136,2019-05-03 03:11:48,Salasshire,13,1832,755.0,Shelly Todd 139 | 137,2019-05-04 05:12:13,West Debbieberg,11,478,708.0,Connie Wilson 140 | 138,2019-05-06 06:12:22,South Melissaside,10,1770,1204.0,Amber Steele 141 | 139,2019-05-07 13:12:50,Kellerside,5,5,528.0,Thomas Baker 142 | 140,2019-05-09 15:13:19,North Karenton,7,549,567.0,Crystal Jennings 143 | 141,2019-05-11 19:14:07,East Zacharychester,1,2205,1512.0,Kelsey Gardner 144 | 142,2019-05-14 02:14:32,North Johnview,1,1247,-449.0,Alyssa Buck 145 | 143,2019-05-16 05:15:14,West Michael,11,2890,264.0,Christine Oneal 146 | 144,2019-05-17 09:15:33,Hectorton,12,1674,823.0,Jennifer Page 147 | 145,2019-05-19 16:16:30,East Jesusport,2,1729,396.0,Haley Pitts 148 | 146,2019-05-20 23:17:05,West Georgeshire,6,1862,1021.0,Heidi Lutz 149 | 147,2019-05-23 06:17:10,Alexandrabury,18,1043,15.0,Stacey Daniels 150 | 148,2019-05-24 09:17:53,Nicholschester,6,2876,47.0,Timothy Moran 151 | 149,2019-05-25 18:18:18,Brettland,14,2130,1494.0,Amy Lane 152 | 150,2019-05-27 20:18:51,Tammyport,12,2382,819.0,Carrie Levine 153 | 151,2019-05-29 04:19:23,Lake Brentfurt,13,1217,1040.0,Kristina Trujillo 154 | 152,2019-05-30 13:20:01,Franklinstad,5,1080,-405.0,Amanda Newton 155 | 153,2019-05-31 20:20:53,West Billyfort,5,2508,55.0,Peter Duffy 156 | 154,2019-06-02 01:21:17,Beardshire,18,1243,1027.0,Amanda Haynes 157 | 155,2019-06-03 03:22:02,Port Mackenziemouth,5,2155,1599.0,Monique Martinez 158 | 156,2019-06-05 07:22:10,Longhaven,8,2368,1527.0,Chad Williams 159 | 157,2019-06-07 08:23:07,Alexanderport,19,1602,-462.0,Carlos Vasquez 160 | 158,2019-06-09 10:23:36,Danielhaven,11,1348,1374.0,Derek Mcgee 161 | 159,2019-06-10 18:23:46,Brownmouth,12,700,1155.0,Andrea Crawford 162 | 160,2019-06-12 01:24:37,Lake Ginaland,7,2695,547.0,Bradley Carrillo 163 | 161,2019-06-14 05:25:11,Denisemouth,16,16,535.0,James Pugh 164 | 162,2019-06-15 08:25:52,Moraport,3,1827,-358.0,Belinda Vasquez 165 | 163,2019-06-16 15:26:33,Ariasberg,19,444,1299.0,Alicia Ross 166 | 164,2019-06-17 19:26:37,West Amanda,7,2454,-78.0,Dennis Jones 167 | 165,2019-06-19 04:26:44,Romanside,11,1965,1310.0,Valerie Ayers 168 | 166,2019-06-21 13:27:17,Batesshire,6,914,467.0,Donald Smith 169 | 167,2019-06-22 21:27:21,Jamesburgh,4,2102,-358.0,James Thompson 170 | 168,2019-06-24 06:28:08,New Natashastad,12,1625,1071.0,James Gordon 171 | 169,2019-06-25 07:28:37,North Linda,6,893,16.0,Johnny Turner 172 | 170,2019-06-27 16:29:23,West Cody,15,1755,888.0,James Chen 173 | 171,2019-06-30 01:29:59,Karaville,11,1569,-391.0,Christina Khan 174 | 172,2019-07-01 08:30:55,Michaelafort,17,2561,1334.0,Mark Soto 175 | 173,2019-07-02 16:31:29,North Justin,2,1786,412.0,Ray Miller 176 | 174,2019-07-03 20:31:47,Jacquelineburgh,19,59,1380.0,Brandi Simmons 177 | 175,2019-07-05 21:32:25,Wigginshaven,14,2574,1087.0,Jeffery Rodriguez 178 | 176,2019-07-08 01:32:39,West Johnfort,15,1623,107.0,Michael Trujillo 179 | 177,2019-07-10 02:33:30,South Josephfurt,16,2918,741.0,Kayla Sanchez 180 | 178,2019-07-11 07:33:55,North Nicholas,11,2117,-304.0,Stacy Williams 181 | 179,2019-07-12 08:34:47,Griffithville,12,214,753.0,Thomas Fuller 182 | 180,2019-07-13 14:35:35,Timchester,6,2292,-9.0,Michael Johnson 183 | 181,2019-07-15 21:35:58,Port Timothy,1,2365,724.0,Christopher West 184 | 182,2019-07-18 04:36:04,New Emma,13,1995,56.0,Christopher Castillo 185 | 183,2019-07-20 12:36:31,Lisaville,1,2929,250.0,Heather Fisher 186 | 184,2019-07-22 20:37:17,New Tashabury,2,668,141.0,Antonio Jackson 187 | 185,2019-07-25 05:37:25,Albertfurt,18,534,388.0,Eric Hogan 188 | 186,2019-07-27 12:38:24,South Jessemouth,12,1841,1446.0,Martin Ponce DVM 189 | 187,2019-07-28 21:38:54,Johnchester,1,72,1475.0,Nicole Carpenter 190 | 188,2019-07-31 00:38:57,Markmouth,8,1260,938.0,Kevin Mosley 191 | 189,2019-08-02 07:39:28,West Meghanburgh,7,1740,368.0,Randall Page 192 | 190,2019-08-04 11:40:07,Millerview,19,1576,722.0,James Martinez 193 | 191,2019-08-06 20:40:50,Rebeccaview,12,2055,1100.0,Michael Mcintyre 194 | 192,2019-08-08 03:40:55,Debbieton,4,1781,899.0,Ronald Wood 195 | 193,2019-08-10 06:41:47,Robinsonshire,8,55,738.0,James Garcia 196 | 194,2019-08-11 09:41:58,Port Brandyside,2,2627,-433.0,Deborah Reed DDS 197 | 195,2019-08-12 12:42:27,Tinafort,10,1324,-329.0,George Collins 198 | 196,2019-08-13 16:42:30,Nguyenton,7,1929,-147.0,Katelyn Barber 199 | 197,2019-08-14 19:42:52,Leachville,4,590,662.0,Amanda Tate 200 | 198,2019-08-16 03:42:59,North Spencer,10,539,1068.0,Johnathan Brown 201 | 199,2019-08-18 07:43:28,Guerrachester,15,2124,167.0,Nicole Gonzalez 202 | 200,2019-08-20 11:43:31,Lisamouth,15,1748,1266.0,Rebecca Baxter 203 | 201,2019-08-21 13:44:01,Lake Joanne,3,629,1407.0,Tracy Diaz 204 | 202,2019-08-22 17:44:05,East Jennifer,15,2623,1124.0,Ryan Morrow 205 | 203,2019-08-25 01:45:08,Lake Summerport,2,2798,-319.0,Jason Anderson 206 | 204,2019-08-26 08:45:21,Port Carrieburgh,16,2144,1307.0,Kyle Robbins 207 | 205,2019-08-28 13:45:36,South Ryan,18,1835,1278.0,Robert Santos 208 | 206,2019-08-29 16:45:54,West Vincentmouth,4,887,-132.0,Miguel Hunter 209 | 207,2019-08-31 00:46:11,Dianabury,14,2697,-20.0,Nathan White 210 | 208,2019-09-01 06:46:44,North Jeffrey,4,1806,391.0,James Davis 211 | 209,2019-09-03 12:47:26,Lake Ashleytown,1,2532,1539.0,James Greer 212 | 210,2019-09-05 18:47:30,Nelsonfort,7,1781,146.0,Jennifer Mcdonald 213 | 211,2019-09-07 23:48:08,West Michaelbury,18,366,-167.0,Sarah Lewis 214 | 212,2018-09-09 04:48:48,Jeffreyside,17,81,173.0,Frank Hoffman 215 | -------------------------------------------------------------------------------- /data/sales_data_duped.csv: -------------------------------------------------------------------------------- 1 | timestamp,city,store_id,sale_number,sale_amount,associate 2 | 2017-09-15T06:17:10,Alexandrabury,18,1043,15.0,Stacey Daniels 3 | 2017-09-11T16:16:30,East Jesusport,2,1729,396.0,Haley Pitts 4 | 2017-07-12T15:00:18,New Douglasmouth,13,2028,-78.0,Carlos French 5 | 2017-07-29T13:04:55,West Carriemouth,19,1245,1149.0,Jeffrey Ford 6 | 2017-11-07T21:35:58,Port Timothy,1,2365,724.0,Christopher West 7 | 2017-05-24T18:44:03,Katherinestad,18,2764,119.0,Jennifer Lee 8 | 2017-08-07T21:07:08,Lake Richard,4,2574,1387.0,Kayla Mercado 9 | 2017-09-26T03:22:02,Port Mackenziemouth,5,2155,1599.0,Monique Martinez 10 | 2017-10-18T07:28:37,North Linda,6,893,16.0,Johnny Turner 11 | 2017-02-14T03:12:13,New Timothy,17,2986,1487.0,Tony Lynch 12 | 2017-04-08T22:27:45,North Kyle,5,2944,537.0,Ashley Taylor 13 | 2017-08-05T07:06:20,North Georgemouth,1,1180,-391.0,George Andrade 14 | 2017-06-17T09:52:00,Port Coryborough,3,1455,958.0,Laurie Carr DVM 15 | 2017-01-20T04:07:32,New Katherineville,8,1654,-89.0,Patricia Peterson 16 | 2017-12-05T12:42:27,Tinafort,10,1324,-329.0,George Collins 17 | 2017-06-11T01:49:43,Lauraborough,6,2998,1201.0,Calvin Soto 18 | 2017-05-01T00:36:34,Janiceview,1,820,443.0,Linda Snyder 19 | 2017-02-09T20:11:03,Beniteztown,3,1186,627.0,Mary Luna 20 | 2017-07-17T04:01:26,Port Bob,18,457,926.0,James Turner 21 | 2017-12-29T18:47:30,Nelsonfort,7,1781,146.0,Jennifer Mcdonald 22 | 2017-07-23T03:03:11,Davidstad,13,2689,976.0,Christopher Ortiz 23 | 2017-07-16T01:01:02,West Kenneth,2,2063,1096.0,Herbert Morris 24 | 2017-11-10T04:36:04,New Emma,13,1995,56.0,Christopher Castillo 25 | 2017-02-18T16:14:19,Riceport,9,1778,53.0,Alexis Boone 26 | 2017-12-29T18:47:30,Nelsonfort,7,1781,146.0,Jennifer Mcdonald 27 | 2017-10-15T21:27:21,Jamesburgh,4,2102,-358.0,James Thompson 28 | 2017-12-31T23:48:08,West Michaelbury,18,366,-167.0,Sarah Lewis 29 | 2017-06-09T22:48:39,Danielsmouth,10,708,1134.0,Pamela Potts 30 | 2017-11-05T14:35:35,Timchester,6,2292,-9.0,Michael Johnson 31 | 2017-07-06T13:58:38,West Kimberly,17,1467,-349.0,Christina Randolph 32 | 2017-12-15T17:44:05,East Jennifer,15,2623,1124.0,Ryan Morrow 33 | 2017-05-06T10:37:50,Lake Chelseatown,11,2941,-321.0,Don Garza 34 | 2017-10-09T15:26:33,Ariasberg,19,444,1299.0,Alicia Ross 35 | 2017-03-25T00:24:21,New Sydney,9,1847,1529.0,Brad Ray 36 | 2017-05-07T11:38:16,Faulknerstad,15,433,1281.0,Oscar Garcia 37 | 2017-09-08T05:15:14,West Michael,11,2890,264.0,Christine Oneal 38 | 2017-06-29T20:56:50,South Stevenview,3,2683,-265.0,Mariah Wright 39 | 2017-01-28T02:08:50,West Shannon,2,1021,477.0,Michael Silva 40 | 2017-08-06T15:06:49,Port Peter,13,1646,349.0,Dave Evans 41 | 2017-11-29T20:40:50,Rebeccaview,12,2055,1100.0,Michael Mcintyre 42 | 2017-06-06T05:47:26,New Jason,16,280,309.0,Ashley Bryant 43 | 2017-10-12T04:26:44,Romanside,11,1965,1310.0,Valerie Ayers 44 | 2017-03-18T02:21:43,West Sean,4,2327,1581.0,Denise Ross 45 | 2017-10-23T01:29:59,Karaville,11,1569,-391.0,Christina Khan 46 | 2017-12-24T00:46:11,Dianabury,14,2697,-20.0,Nathan White 47 | 2017-01-25T17:08:42,North Stanleybury,11,1437,-303.0,Lawrence Norton 48 | 2017-06-06T05:47:26,New Jason,16,280,309.0,Ashley Bryant 49 | 2017-04-04T12:26:57,Aliciastad,6,652,-164.0,Rhonda Stone 50 | 2017-11-12T12:36:31,Lisaville,1,2929,250.0,Heather Fisher 51 | 2017-02-16T14:13:28,New Meganland,19,2767,-145.0,Michael Harvey 52 | 2017-07-09T22:58:58,Fosterburgh,6,809,755.0,Lauren Nicholson 53 | 2017-04-18T20:30:45,Port Matthew,15,725,-455.0,Tracey Martin 54 | 2017-01-24T15:08:04,West Petermouth,14,1276,127.0,Robert Taylor 55 | 2017-05-29T03:45:05,New Amy,3,212,857.0,John Chambers 56 | 2017-09-25T01:21:17,Beardshire,18,1243,1027.0,Amanda Haynes 57 | 2017-02-01T15:09:03,South Colinport,8,464,229.0,Taylor Martin MD 58 | 2017-02-08T12:10:55,West Natasha,3,278,-154.0,Charles Fletcher 59 | 2017-06-25T05:55:05,Port Shannon,12,2351,648.0,Dawn Schmidt 60 | 2017-06-11T01:49:43,Lauraborough,6,2998,1201.0,Calvin Soto 61 | 2017-04-06T17:27:07,East Curtis,17,443,112.0,Elizabeth Barber 62 | 2017-06-04T00:47:04,Jonesland,6,1774,1053.0,Christopher Young 63 | 2017-02-26T09:17:33,East Katie,16,1137,-259.0,Gregory James 64 | 2017-12-09T03:42:59,North Spencer,10,539,1068.0,Johnathan Brown 65 | 2017-07-04T06:58:00,Brockfort,8,2888,14.0,Miranda Gibbs 66 | 2017-02-24T06:16:52,Katherinefurt,14,488,94.0,Jason Robinson 67 | 2017-12-03T06:41:47,Robinsonshire,8,55,738.0,James Garcia 68 | 2017-05-31T11:45:39,East Joseph,1,1430,156.0,Jennifer Wilson 69 | 2017-06-04T00:47:04,Jonesland,6,1774,1053.0,Christopher Young 70 | 2017-11-19T12:38:24,South Jessemouth,12,1841,1446.0,Martin Ponce DVM 71 | 2017-04-13T12:29:25,Katrinafurt,5,1044,543.0,Kimberly Zhang 72 | 2017-10-26T20:31:47,Jacquelineburgh,19,59,1380.0,Brandi Simmons 73 | 2017-08-19T10:10:45,Sarahland,8,2126,-372.0,David Wallace 74 | 2017-04-07T20:27:28,Lake Brandon,11,406,880.0,Stacy Mcintosh 75 | 2017-12-11T07:43:28,Guerrachester,15,2124,167.0,Nicole Gonzalez 76 | 2017-06-19T12:52:05,East Michelleville,7,1572,262.0,Eugene Summers 77 | 2017-04-21T08:32:37,South Lindamouth,19,1756,942.0,Andre Villa 78 | 2017-03-16T19:20:45,South Ashleyton,3,2323,-4.0,Robert Landry 79 | 2017-05-17T22:42:11,Seanview,8,962,243.0,Jamie Cummings 80 | 2017-09-19T20:18:51,Tammyport,12,2382,819.0,Carrie Levine 81 | 2017-03-21T15:22:56,West Robertmouth,15,1799,262.0,Pamela Salinas 82 | 2017-10-14T13:27:17,Batesshire,6,914,467.0,Donald Smith 83 | 2017-08-13T17:09:07,Markmouth,1,2511,1504.0,Laura Henderson 84 | 2017-04-14T13:29:37,New Lisachester,13,736,991.0,Michelle Murray 85 | 2017-08-06T15:06:49,Port Peter,13,1646,349.0,Dave Evans 86 | 2017-09-22T13:20:01,Franklinstad,5,1080,-405.0,Amanda Newton 87 | 2017-08-21T14:10:54,Kathrynmouth,6,1563,1024.0,Ms. Brenda Larson DDS 88 | 2017-02-05T01:09:53,Port Amanda,1,421,1330.0,Christopher Joyce 89 | 2017-08-29T06:12:22,South Melissaside,10,1770,1204.0,Amber Steele 90 | 2017-12-21T13:45:36,South Ryan,18,1835,1278.0,Robert Santos 91 | 2017-07-23T03:03:11,Davidstad,13,2689,976.0,Christopher Ortiz 92 | 2017-05-10T22:39:00,Port Matthew,4,170,-47.0,Adam Brandt 93 | 2017-03-26T05:24:30,Smithfurt,8,2072,-238.0,Mr. Micheal Hale DDS 94 | 2017-12-27T12:47:26,Lake Ashleytown,1,2532,1539.0,James Greer 95 | 2017-11-17T05:37:25,Albertfurt,18,534,388.0,Eric Hogan 96 | 2017-01-13T12:04:49,East Randymouth,3,2320,-304.0,Julie Cooper 97 | 2017-09-16T09:17:53,Nicholschester,6,2876,47.0,Timothy Moran 98 | 2017-06-28T19:56:30,Lake Edwardmouth,14,2565,-144.0,Keith Jones 99 | 2017-08-24T21:11:39,Kathrynburgh,10,1344,925.0,Adam Fisher 100 | 2017-04-03T07:26:28,Port Briana,8,1219,799.0,Christian Perez 101 | 2017-09-06T02:14:32,North Johnview,1,1247,-449.0,Alyssa Buck 102 | 2017-06-22T21:54:27,Hardingburgh,1,939,981.0,Karen Flynn 103 | 2017-04-23T17:32:50,Chaveztown,18,161,1551.0,Eric Hunter 104 | 2017-07-16T01:01:02,West Kenneth,2,2063,1096.0,Herbert Morris 105 | 2017-06-07T06:47:48,Lindaville,2,1178,971.0,Monique Hughes 106 | 2017-09-12T23:17:05,West Georgeshire,6,1862,1021.0,Heidi Lutz 107 | 2017-09-23T20:20:53,West Billyfort,5,2508,55.0,Peter Duffy 108 | 2017-11-25T07:39:28,West Meghanburgh,7,1740,368.0,Randall Page 109 | 2017-09-11T16:16:30,East Jesusport,2,1729,396.0,Haley Pitts 110 | 2017-04-20T03:31:41,South Paulabury,2,2963,1028.0,David Ayers 111 | 2017-01-03T05:00:45,Williamburgh,6,1530,1167.0,Gary Lee 112 | 2017-07-08T14:58:53,South James,18,2362,12.0,Jonathan Clark 113 | 2017-10-07T05:25:11,Denisemouth,16,16,535.0,James Pugh 114 | 2017-05-15T15:40:17,New Phillipton,14,2514,-68.0,Maria Lee 115 | 2017-09-28T07:22:10,Longhaven,8,2368,1527.0,Chad Williams 116 | 2017-04-03T07:26:28,Port Briana,8,1219,799.0,Christian Perez 117 | 2017-05-29T03:45:05,New Amy,3,212,857.0,John Chambers 118 | 2017-12-14T13:44:01,Lake Joanne,3,629,1407.0,Tracy Diaz 119 | 2017-03-03T02:17:55,Johnfurt,12,1370,676.0,Michael Clark 120 | 2017-06-08T15:48:35,Port Ryan,17,1307,615.0,Cory Wilkerson 121 | 2017-10-08T08:25:52,Moraport,3,1827,-358.0,Belinda Vasquez 122 | 2017-12-07T19:42:52,Leachville,4,590,662.0,Amanda Tate 123 | 2017-10-03T18:23:46,Brownmouth,12,700,1155.0,Andrea Crawford 124 | 2017-10-05T01:24:37,Lake Ginaland,7,2695,547.0,Bradley Carrillo 125 | 2017-11-25T07:39:28,West Meghanburgh,7,1740,368.0,Randall Page 126 | 2017-08-26T03:11:48,Salasshire,13,1832,755.0,Shelly Todd 127 | 2017-09-17T18:18:18,Brettland,14,2130,1494.0,Amy Lane 128 | 2017-06-15T07:51:08,East Rebecca,17,1788,512.0,Gina Frey 129 | 2017-11-04T08:34:47,Griffithville,12,214,753.0,Thomas Fuller 130 | 2017-03-30T16:25:18,North Jason,15,2850,-284.0,Jeremy Brewer 131 | 2017-11-20T21:38:54,Johnchester,1,72,1475.0,Nicole Carpenter 132 | 2017-10-26T20:31:47,Jacquelineburgh,19,59,1380.0,Brandi Simmons 133 | 2017-02-21T22:15:38,Robertstown,3,198,1471.0,Cynthia Friedman 134 | 2017-03-07T11:18:42,East Johntown,10,2540,534.0,Nathan Santana 135 | 2017-10-24T08:30:55,Michaelafort,17,2561,1334.0,Mark Soto 136 | 2017-03-12T03:19:30,Kathleenbury,19,884,1439.0,Carl Mills 137 | 2017-10-02T10:23:36,Danielhaven,11,1348,1374.0,Derek Mcgee 138 | 2017-05-09T17:38:53,Chelseashire,6,78,796.0,Jeffery Hughes 139 | 2017-07-25T06:04:29,Jonesstad,6,1966,553.0,Andrew Warren 140 | 2017-02-16T14:13:28,New Meganland,19,2767,-145.0,Michael Harvey 141 | 2017-01-07T20:02:19,Caldwellbury,14,771,-108.0,Michaela Stewart 142 | 2017-08-12T09:08:14,South Donaldchester,19,2305,357.0,Chelsea Brown 143 | 2017-12-04T09:41:58,Port Brandyside,2,2627,-433.0,Deborah Reed DDS 144 | 2017-07-11T07:59:22,Sarahton,5,1500,774.0,Kimberly Price 145 | 2017-11-27T11:40:07,Millerview,19,1576,722.0,James Martinez 146 | 2017-01-16T20:06:09,Olsenville,18,2309,729.0,Nicole Anderson 147 | 2017-02-15T12:12:25,Caseyside,10,2624,372.0,Alicia Valenzuela 148 | 2017-10-17T06:28:08,New Natashastad,12,1625,1071.0,James Gordon 149 | 2017-03-19T06:22:33,Randyborough,7,248,1417.0,Eric Montgomery 150 | 2017-12-13T11:43:31,Lisamouth,15,1748,1266.0,Rebecca Baxter 151 | 2017-05-16T16:41:11,North Thomasshire,16,1491,33.0,Elizabeth Bishop 152 | 2017-05-16T16:41:11,North Thomasshire,16,1491,33.0,Elizabeth Bishop 153 | 2017-03-07T11:18:42,East Johntown,10,2540,534.0,Nathan Santana 154 | 2017-08-03T00:05:47,Port Michael,9,2724,-179.0,Brandon Lawson 155 | 2017-10-09T15:26:33,Ariasberg,19,444,1299.0,Alicia Ross 156 | 2017-02-19T19:14:56,West Bridgetborough,5,216,1383.0,Jonathan Townsend 157 | 2017-09-09T09:15:33,Hectorton,12,1674,823.0,Jennifer Page 158 | 2017-01-14T15:05:42,New Jamesberg,15,2818,-295.0,Dean Davis 159 | 2017-05-19T06:42:31,West Kimberly,14,2639,1429.0,Steven Miller 160 | 2017-04-26T08:34:21,Port Pamelafurt,9,1662,-397.0,Austin Mclaughlin 161 | 2017-06-01T18:46:26,Ortiztown,13,185,1454.0,Christopher Harris 162 | 2017-11-27T11:40:07,Millerview,19,1576,722.0,James Martinez 163 | 2017-10-10T19:26:37,West Amanda,7,2454,-78.0,Dennis Jones 164 | 2017-04-01T23:26:14,South Jessicaview,3,516,992.0,Summer Nash 165 | 2017-04-23T17:32:50,Chaveztown,18,161,1551.0,Eric Hunter 166 | 2017-01-09T01:03:21,Erikaland,11,1571,-372.0,Mark Taylor 167 | 2017-02-19T19:14:56,West Bridgetborough,5,216,1383.0,Jonathan Townsend 168 | 2017-12-01T03:40:55,Debbieton,4,1781,899.0,Ronald Wood 169 | 2017-08-27T05:12:13,West Debbieberg,11,478,708.0,Connie Wilson 170 | 2017-09-21T04:19:23,Lake Brentfurt,13,1217,1040.0,Kristina Trujillo 171 | 2017-12-18T01:45:08,Lake Summerport,2,2798,-319.0,Jason Anderson 172 | 2017-07-24T04:03:56,North Michael,3,2633,462.0,Kendra Santiago 173 | 2017-02-23T00:16:30,East Deborahland,9,365,1313.0,Ms. Victoria Ford DDS 174 | 2017-04-06T17:27:07,East Curtis,17,443,112.0,Elizabeth Barber 175 | 2017-03-14T11:19:43,New Jonathan,18,1135,1328.0,Edward Thomas 176 | 2017-02-06T08:10:23,South Neilstad,3,2106,-177.0,Melinda Miranda 177 | 2017-10-05T01:24:37,Lake Ginaland,7,2695,547.0,Bradley Carrillo 178 | 2017-08-17T01:09:50,West Donaldbury,12,1547,980.0,Chelsea Lee 179 | 2017-07-09T22:58:58,Fosterburgh,6,809,755.0,Lauren Nicholson 180 | 2017-10-20T16:29:23,West Cody,15,1755,888.0,James Chen 181 | 2017-01-03T05:00:45,Williamburgh,6,1530,1167.0,Gary Lee 182 | 2017-01-05T10:01:27,Ibarraberg,1,2744,258.0,Daniel Davis 183 | 2017-01-06T12:01:48,Sarachester,2,1908,266.0,Michael Roth 184 | 2017-11-14T20:37:17,New Tashabury,2,668,141.0,Antonio Jackson 185 | 2017-08-30T13:12:50,Kellerside,5,5,528.0,Thomas Baker 186 | 2017-04-16T15:29:58,New Melanie,7,1717,1579.0,Eric Morrow 187 | 2017-06-26T11:55:51,Munozberg,14,1827,-213.0,John Glover 188 | 2017-07-21T19:02:35,Bruceton,12,1036,1570.0,Diane Pearson 189 | 2017-02-02T19:09:06,Valeriebury,2,2911,747.0,Richard Wong 190 | 2017-02-28T17:17:41,Angelabury,4,1389,1097.0,Carl Bailey 191 | 2017-03-28T12:24:53,South Jeffreytown,10,254,542.0,Linda Mclaughlin 192 | 2017-06-15T07:51:08,East Rebecca,17,1788,512.0,Gina Frey 193 | 2017-08-30T13:12:50,Kellerside,5,5,528.0,Thomas Baker 194 | 2017-08-14T23:09:18,South Margaretstad,3,929,-91.0,Stephen Palmer 195 | 2017-06-20T13:52:59,Bairdmouth,11,2100,414.0,Brandon Salinas 196 | 2017-09-03T19:14:07,East Zacharychester,1,2205,1512.0,Kelsey Gardner 197 | 2017-06-27T12:56:26,Zacharyborough,9,936,964.0,Marc Barnett 198 | 2017-01-18T21:06:34,Port Grace,9,29,457.0,Andrew Robles 199 | 2017-04-12T07:29:18,South Jason,17,2387,-304.0,Sarah Sandoval 200 | 2017-09-30T08:23:07,Alexanderport,19,1602,-462.0,Carlos Vasquez 201 | 2017-11-02T02:33:30,South Josephfurt,16,2918,741.0,Kayla Sanchez 202 | 2017-03-14T11:19:43,New Jonathan,18,1135,1328.0,Edward Thomas 203 | 2017-10-31T01:32:39,West Johnfort,15,1623,107.0,Michael Trujillo 204 | 2017-06-13T05:50:27,South Vincenthaven,1,1020,692.0,Lori Johnson 205 | 2017-07-13T16:00:25,South Nancyview,7,681,-310.0,Lance Hurst 206 | 2017-04-25T02:33:29,Timothyside,15,1096,757.0,Lisa James 207 | 2017-11-03T07:33:55,North Nicholas,11,2117,-304.0,Stacy Williams 208 | 2017-12-06T16:42:30,Nguyenton,7,1929,-147.0,Katelyn Barber 209 | 2017-01-30T06:09:00,Phillipsshire,16,1066,-432.0,Kelsey Collins 210 | 2017-03-22T22:23:37,Cohenshire,1,339,1359.0,Wayne Peterson 211 | 2017-11-23T00:38:57,Markmouth,8,1260,938.0,Kevin Mosley 212 | 2017-10-28T21:32:25,Wigginshaven,14,2574,1087.0,Jeffery Rodriguez 213 | 2017-12-19T08:45:21,Port Carrieburgh,16,2144,1307.0,Kyle Robbins 214 | 2017-03-18T02:21:43,West Sean,4,2327,1581.0,Denise Ross 215 | 2017-07-01T21:57:36,Harrisport,17,698,1113.0,Donna Ibarra 216 | 2017-10-25T16:31:29,North Justin,2,1786,412.0,Ray Miller 217 | 2017-05-13T07:39:35,Lake Richardstad,6,2330,1274.0,Christopher Randall 218 | 2017-05-05T08:37:44,Gonzalesburgh,3,196,1482.0,Kevin Madden 219 | 2017-01-22T10:07:58,East Kirstenbury,11,513,810.0,Jeremiah Thompson 220 | 2017-07-27T12:04:38,Justinside,18,2386,1490.0,Stephen Shaffer 221 | 2017-05-26T20:44:23,Elizabethview,13,1375,16.0,Joseph Barron 222 | 2017-03-09T20:19:23,North Adam,4,457,886.0,Robert Wilson 223 | 2017-03-05T08:18:17,Port Matthewton,4,449,1427.0,Martin Chang 224 | 2017-08-10T02:07:19,Lake Kimberlyport,16,2170,553.0,Dana Wilson 225 | 2017-07-31T22:05:23,Heathershire,7,1941,650.0,Vanessa Burke 226 | 2017-07-19T11:02:03,West Alex,10,1168,111.0,Barbara Savage 227 | 2018-01-02T04:48:48,Jeffreyside,17,81,173.0,Frank Hoffman 228 | 2017-05-22T17:44:00,Port Dariusshire,18,1878,880.0,Mark Fowler 229 | 2017-12-25T06:46:44,North Jeffrey,4,1806,391.0,James Davis 230 | 2017-10-07T05:25:11,Denisemouth,16,16,535.0,James Pugh 231 | 2017-04-27T14:34:59,New Tylermouth,7,1727,203.0,Kenneth Elliott 232 | 2017-12-22T16:45:54,West Vincentmouth,4,887,-132.0,Miguel Hunter 233 | 2017-08-17T01:09:50,West Donaldbury,12,1547,980.0,Chelsea Lee 234 | 2017-05-03T02:37:14,Port Monicaville,2,718,-400.0,Courtney Ward 235 | 2017-10-20T16:29:23,West Cody,15,1755,888.0,James Chen 236 | 2017-05-20T08:43:00,Lake Sandrafurt,15,2557,320.0,Shawn James 237 | 2017-05-15T15:40:17,New Phillipton,14,2514,-68.0,Maria Lee 238 | 2017-06-21T16:54:02,Carolhaven,11,1544,171.0,Heather Johnson 239 | 2017-01-11T03:04:11,Ponceview,19,1006,-399.0,Douglas Peters 240 | 2017-06-01T18:46:26,Ortiztown,13,185,1454.0,Christopher Harris 241 | 2017-08-23T17:11:08,Port Jacobborough,3,652,452.0,Jodi Watson 242 | 2017-04-10T01:28:14,Samanthashire,9,2293,-184.0,Jason Harper 243 | 2017-02-12T02:11:21,Stephaniemouth,5,277,-130.0,Carol Morrison 244 | 2017-03-13T08:19:35,Hughesview,14,280,-241.0,Brenda Mcbride 245 | 2017-09-01T15:13:19,North Karenton,7,549,567.0,Crystal Jennings 246 | 2017-04-29T18:35:32,Port Emilyfurt,12,87,1301.0,Mrs. Nicole Huang 247 | -------------------------------------------------------------------------------- /data/sales_data_duped_with_nulls.csv: -------------------------------------------------------------------------------- 1 | timestamp,city,store_id,sale_number,sale_amount,associate 2 | 2017-09-15T06:17:10,Alexandrabury,18,1043.0,15.0,Stacey Daniels 3 | 2017-09-11T16:16:30,East Jesusport,2,1729.0,396.0,Haley Pitts 4 | 2017-07-12T15:00:18,New Douglasmouth,13,2028.0,-78.0,Carlos French 5 | 2017-07-29T13:04:55,West Carriemouth,19,1245.0,1149.0,Jeffrey Ford 6 | 2017-11-07T21:35:58,Port Timothy,1,2365.0,724.0,Christopher West 7 | 2017-05-24T18:44:03,Katherinestad,18,2764.0,119.0,Jennifer Lee 8 | 2017-08-07T21:07:08,Lake Richard,4,2574.0,1387.0,Kayla Mercado 9 | 2017-09-26T03:22:02,Port Mackenziemouth,5,2155.0,1599.0, 10 | 2017-10-18T07:28:37,North Linda,6,893.0,16.0,Johnny Turner 11 | 2017-02-14T03:12:13,New Timothy,17,2986.0,1487.0,Tony Lynch 12 | 2017-04-08T22:27:45,North Kyle,5,2944.0,537.0,Ashley Taylor 13 | 2017-08-05T07:06:20,North Georgemouth,1,1180.0,,George Andrade 14 | 2017-06-17T09:52:00,Port Coryborough,3,1455.0,958.0,Laurie Carr DVM 15 | 2017-01-20T04:07:32,New Katherineville,8,1654.0,-89.0,Patricia Peterson 16 | 2017-12-05T12:42:27,Tinafort,10,1324.0,,George Collins 17 | 2017-06-11T01:49:43,Lauraborough,6,,, 18 | 2017-05-01T00:36:34,Janiceview,1,820.0,443.0,Linda Snyder 19 | 2017-02-09T20:11:03,Beniteztown,3,1186.0,627.0,Mary Luna 20 | 2017-07-17T04:01:26,Port Bob,18,,926.0,James Turner 21 | 2017-12-29T18:47:30,Nelsonfort,7,1781.0,146.0, 22 | 2017-07-23T03:03:11,Davidstad,13,,976.0,Christopher Ortiz 23 | 2017-07-16T01:01:02,West Kenneth,2,2063.0,1096.0, 24 | 2017-11-10T04:36:04,New Emma,13,,56.0,Christopher Castillo 25 | 2017-02-18T16:14:19,Riceport,9,,53.0, 26 | 2017-12-29T18:47:30,Nelsonfort,7,1781.0,146.0,Jennifer Mcdonald 27 | 2017-10-15T21:27:21,Jamesburgh,4,2102.0,-358.0,James Thompson 28 | 2017-12-31T23:48:08,West Michaelbury,18,366.0,-167.0, 29 | 2017-06-09T22:48:39,Danielsmouth,10,,1134.0, 30 | 2017-11-05T14:35:35,Timchester,6,2292.0,-9.0,Michael Johnson 31 | 2017-07-06T13:58:38,West Kimberly,17,1467.0,,Christina Randolph 32 | 2017-12-15T17:44:05,East Jennifer,15,2623.0,1124.0,Ryan Morrow 33 | 2017-05-06T10:37:50,Lake Chelseatown,11,2941.0,-321.0,Don Garza 34 | 2017-10-09T15:26:33,Ariasberg,19,444.0,1299.0,Alicia Ross 35 | 2017-03-25T00:24:21,New Sydney,9,,1529.0,Brad Ray 36 | 2017-05-07T11:38:16,Faulknerstad,15,,1281.0, 37 | 2017-09-08T05:15:14,West Michael,11,2890.0,264.0,Christine Oneal 38 | 2017-06-29T20:56:50,South Stevenview,3,2683.0,-265.0,Mariah Wright 39 | 2017-01-28T02:08:50,West Shannon,2,,477.0, 40 | 2017-08-06T15:06:49,Port Peter,13,1646.0,349.0,Dave Evans 41 | 2017-11-29T20:40:50,Rebeccaview,12,,1100.0,Michael Mcintyre 42 | 2017-06-06T05:47:26,New Jason,16,280.0,309.0,Ashley Bryant 43 | 2017-10-12T04:26:44,Romanside,11,1965.0,1310.0, 44 | 2017-03-18T02:21:43,West Sean,4,,1581.0,Denise Ross 45 | 2017-10-23T01:29:59,Karaville,11,1569.0,-391.0,Christina Khan 46 | 2017-12-24T00:46:11,Dianabury,14,2697.0,-20.0,Nathan White 47 | 2017-01-25T17:08:42,North Stanleybury,11,1437.0,,Lawrence Norton 48 | 2017-06-06T05:47:26,New Jason,16,280.0,309.0,Ashley Bryant 49 | 2017-04-04T12:26:57,Aliciastad,6,652.0,-164.0,Rhonda Stone 50 | 2017-11-12T12:36:31,Lisaville,1,,250.0, 51 | 2017-02-16T14:13:28,New Meganland,19,,-145.0,Michael Harvey 52 | 2017-07-09T22:58:58,Fosterburgh,6,809.0,, 53 | 2017-04-18T20:30:45,Port Matthew,15,725.0,-455.0,Tracey Martin 54 | 2017-01-24T15:08:04,West Petermouth,14,1276.0,127.0,Robert Taylor 55 | 2017-05-29T03:45:05,New Amy,3,212.0,857.0,John Chambers 56 | 2017-09-25T01:21:17,Beardshire,18,1243.0,1027.0,Amanda Haynes 57 | 2017-02-01T15:09:03,South Colinport,8,,229.0,Taylor Martin MD 58 | 2017-02-08T12:10:55,West Natasha,3,278.0,-154.0,Charles Fletcher 59 | 2017-06-25T05:55:05,Port Shannon,12,2351.0,648.0,Dawn Schmidt 60 | 2017-06-11T01:49:43,Lauraborough,6,,1201.0,Calvin Soto 61 | 2017-04-06T17:27:07,East Curtis,17,443.0,112.0,Elizabeth Barber 62 | 2017-06-04T00:47:04,Jonesland,6,1774.0,1053.0,Christopher Young 63 | 2017-02-26T09:17:33,East Katie,16,1137.0,-259.0, 64 | 2017-12-09T03:42:59,North Spencer,10,,1068.0,Johnathan Brown 65 | 2017-07-04T06:58:00,Brockfort,8,,,Miranda Gibbs 66 | 2017-02-24T06:16:52,Katherinefurt,14,,,Jason Robinson 67 | 2017-12-03T06:41:47,Robinsonshire,8,55.0,738.0,James Garcia 68 | 2017-05-31T11:45:39,East Joseph,1,,156.0, 69 | 2017-06-04T00:47:04,Jonesland,6,1774.0,,Christopher Young 70 | 2017-11-19T12:38:24,South Jessemouth,12,1841.0,1446.0, 71 | 2017-04-13T12:29:25,Katrinafurt,5,,543.0, 72 | 2017-10-26T20:31:47,Jacquelineburgh,19,59.0,1380.0,Brandi Simmons 73 | 2017-08-19T10:10:45,Sarahland,8,2126.0,,David Wallace 74 | 2017-04-07T20:27:28,Lake Brandon,11,406.0,880.0,Stacy Mcintosh 75 | 2017-12-11T07:43:28,Guerrachester,15,2124.0,167.0,Nicole Gonzalez 76 | 2017-06-19T12:52:05,East Michelleville,7,1572.0,262.0,Eugene Summers 77 | 2017-04-21T08:32:37,South Lindamouth,19,1756.0,,Andre Villa 78 | 2017-03-16T19:20:45,South Ashleyton,3,2323.0,-4.0,Robert Landry 79 | 2017-05-17T22:42:11,Seanview,8,962.0,243.0,Jamie Cummings 80 | 2017-09-19T20:18:51,Tammyport,12,,819.0,Carrie Levine 81 | 2017-03-21T15:22:56,West Robertmouth,15,1799.0,262.0,Pamela Salinas 82 | 2017-10-14T13:27:17,Batesshire,6,914.0,467.0,Donald Smith 83 | 2017-08-13T17:09:07,Markmouth,1,2511.0,1504.0,Laura Henderson 84 | 2017-04-14T13:29:37,New Lisachester,13,,991.0,Michelle Murray 85 | 2017-08-06T15:06:49,Port Peter,13,1646.0,349.0,Dave Evans 86 | 2017-09-22T13:20:01,Franklinstad,5,1080.0,, 87 | 2017-08-21T14:10:54,Kathrynmouth,6,,1024.0, 88 | 2017-02-05T01:09:53,Port Amanda,1,421.0,1330.0,Christopher Joyce 89 | 2017-08-29T06:12:22,South Melissaside,10,1770.0,1204.0, 90 | 2017-12-21T13:45:36,South Ryan,18,1835.0,1278.0,Robert Santos 91 | 2017-07-23T03:03:11,Davidstad,13,2689.0,976.0,Christopher Ortiz 92 | 2017-05-10T22:39:00,Port Matthew,4,170.0,-47.0,Adam Brandt 93 | 2017-03-26T05:24:30,Smithfurt,8,2072.0,-238.0,Mr. Micheal Hale DDS 94 | 2017-12-27T12:47:26,Lake Ashleytown,1,2532.0,1539.0,James Greer 95 | 2017-11-17T05:37:25,Albertfurt,18,534.0,388.0, 96 | 2017-01-13T12:04:49,East Randymouth,3,2320.0,-304.0, 97 | 2017-09-16T09:17:53,Nicholschester,6,2876.0,47.0, 98 | 2017-06-28T19:56:30,Lake Edwardmouth,14,2565.0,,Keith Jones 99 | 2017-08-24T21:11:39,Kathrynburgh,10,1344.0,,Adam Fisher 100 | 2017-04-03T07:26:28,Port Briana,8,1219.0,799.0,Christian Perez 101 | 2017-09-06T02:14:32,North Johnview,1,1247.0,,Alyssa Buck 102 | 2017-06-22T21:54:27,Hardingburgh,1,939.0,981.0, 103 | 2017-04-23T17:32:50,Chaveztown,18,161.0,1551.0,Eric Hunter 104 | 2017-07-16T01:01:02,West Kenneth,2,,1096.0,Herbert Morris 105 | 2017-06-07T06:47:48,Lindaville,2,1178.0,971.0,Monique Hughes 106 | 2017-09-12T23:17:05,West Georgeshire,6,1862.0,1021.0,Heidi Lutz 107 | 2017-09-23T20:20:53,West Billyfort,5,2508.0,,Peter Duffy 108 | 2017-11-25T07:39:28,West Meghanburgh,7,,368.0, 109 | 2017-09-11T16:16:30,East Jesusport,2,1729.0,396.0,Haley Pitts 110 | 2017-04-20T03:31:41,South Paulabury,2,2963.0,1028.0,David Ayers 111 | 2017-01-03T05:00:45,Williamburgh,6,1530.0,1167.0,Gary Lee 112 | 2017-07-08T14:58:53,South James,18,2362.0,,Jonathan Clark 113 | 2017-10-07T05:25:11,Denisemouth,16,16.0,535.0,James Pugh 114 | 2017-05-15T15:40:17,New Phillipton,14,2514.0,-68.0, 115 | 2017-09-28T07:22:10,Longhaven,8,2368.0,1527.0, 116 | 2017-04-03T07:26:28,Port Briana,8,,799.0,Christian Perez 117 | 2017-05-29T03:45:05,New Amy,3,,, 118 | 2017-12-14T13:44:01,Lake Joanne,3,629.0,1407.0,Tracy Diaz 119 | 2017-03-03T02:17:55,Johnfurt,12,1370.0,676.0,Michael Clark 120 | 2017-06-08T15:48:35,Port Ryan,17,,615.0,Cory Wilkerson 121 | 2017-10-08T08:25:52,Moraport,3,1827.0,-358.0,Belinda Vasquez 122 | 2017-12-07T19:42:52,Leachville,4,590.0,662.0,Amanda Tate 123 | 2017-10-03T18:23:46,Brownmouth,12,700.0,1155.0,Andrea Crawford 124 | 2017-10-05T01:24:37,Lake Ginaland,7,2695.0,547.0,Bradley Carrillo 125 | 2017-11-25T07:39:28,West Meghanburgh,7,1740.0,368.0, 126 | 2017-08-26T03:11:48,Salasshire,13,1832.0,755.0,Shelly Todd 127 | 2017-09-17T18:18:18,Brettland,14,,,Amy Lane 128 | 2017-06-15T07:51:08,East Rebecca,17,1788.0,, 129 | 2017-11-04T08:34:47,Griffithville,12,214.0,753.0,Thomas Fuller 130 | 2017-03-30T16:25:18,North Jason,15,2850.0,-284.0, 131 | 2017-11-20T21:38:54,Johnchester,1,,1475.0,Nicole Carpenter 132 | 2017-10-26T20:31:47,Jacquelineburgh,19,59.0,1380.0,Brandi Simmons 133 | 2017-02-21T22:15:38,Robertstown,3,198.0,, 134 | 2017-03-07T11:18:42,East Johntown,10,2540.0,534.0,Nathan Santana 135 | 2017-10-24T08:30:55,Michaelafort,17,2561.0,,Mark Soto 136 | 2017-03-12T03:19:30,Kathleenbury,19,884.0,1439.0,Carl Mills 137 | 2017-10-02T10:23:36,Danielhaven,11,1348.0,,Derek Mcgee 138 | 2017-05-09T17:38:53,Chelseashire,6,78.0,796.0,Jeffery Hughes 139 | 2017-07-25T06:04:29,Jonesstad,6,1966.0,553.0,Andrew Warren 140 | 2017-02-16T14:13:28,New Meganland,19,2767.0,,Michael Harvey 141 | 2017-01-07T20:02:19,Caldwellbury,14,771.0,-108.0,Michaela Stewart 142 | 2017-08-12T09:08:14,South Donaldchester,19,2305.0,,Chelsea Brown 143 | 2017-12-04T09:41:58,Port Brandyside,2,2627.0,-433.0,Deborah Reed DDS 144 | 2017-07-11T07:59:22,Sarahton,5,1500.0,774.0,Kimberly Price 145 | 2017-11-27T11:40:07,Millerview,19,1576.0,722.0, 146 | 2017-01-16T20:06:09,Olsenville,18,,729.0,Nicole Anderson 147 | 2017-02-15T12:12:25,Caseyside,10,,372.0,Alicia Valenzuela 148 | 2017-10-17T06:28:08,New Natashastad,12,1625.0,,James Gordon 149 | 2017-03-19T06:22:33,Randyborough,7,248.0,1417.0,Eric Montgomery 150 | 2017-12-13T11:43:31,Lisamouth,15,1748.0,1266.0,Rebecca Baxter 151 | 2017-05-16T16:41:11,North Thomasshire,16,1491.0,33.0,Elizabeth Bishop 152 | 2017-05-16T16:41:11,North Thomasshire,16,,33.0,Elizabeth Bishop 153 | 2017-03-07T11:18:42,East Johntown,10,2540.0,534.0,Nathan Santana 154 | 2017-08-03T00:05:47,Port Michael,9,,, 155 | 2017-10-09T15:26:33,Ariasberg,19,444.0,,Alicia Ross 156 | 2017-02-19T19:14:56,West Bridgetborough,5,216.0,1383.0,Jonathan Townsend 157 | 2017-09-09T09:15:33,Hectorton,12,1674.0,823.0,Jennifer Page 158 | 2017-01-14T15:05:42,New Jamesberg,15,2818.0,-295.0,Dean Davis 159 | 2017-05-19T06:42:31,West Kimberly,14,2639.0,1429.0,Steven Miller 160 | 2017-04-26T08:34:21,Port Pamelafurt,9,1662.0,-397.0, 161 | 2017-06-01T18:46:26,Ortiztown,13,185.0,1454.0,Christopher Harris 162 | 2017-11-27T11:40:07,Millerview,19,1576.0,, 163 | 2017-10-10T19:26:37,West Amanda,7,2454.0,,Dennis Jones 164 | 2017-04-01T23:26:14,South Jessicaview,3,,992.0,Summer Nash 165 | 2017-04-23T17:32:50,Chaveztown,18,161.0,,Eric Hunter 166 | 2017-01-09T01:03:21,Erikaland,11,1571.0,,Mark Taylor 167 | 2017-02-19T19:14:56,West Bridgetborough,5,216.0,1383.0,Jonathan Townsend 168 | 2017-12-01T03:40:55,Debbieton,4,1781.0,899.0,Ronald Wood 169 | 2017-08-27T05:12:13,West Debbieberg,11,,708.0,Connie Wilson 170 | 2017-09-21T04:19:23,Lake Brentfurt,13,1217.0,,Kristina Trujillo 171 | 2017-12-18T01:45:08,Lake Summerport,2,2798.0,-319.0, 172 | 2017-07-24T04:03:56,North Michael,3,2633.0,462.0,Kendra Santiago 173 | 2017-02-23T00:16:30,East Deborahland,9,365.0,1313.0,Ms. Victoria Ford DDS 174 | 2017-04-06T17:27:07,East Curtis,17,443.0,112.0,Elizabeth Barber 175 | 2017-03-14T11:19:43,New Jonathan,18,1135.0,1328.0,Edward Thomas 176 | 2017-02-06T08:10:23,South Neilstad,3,2106.0,-177.0,Melinda Miranda 177 | 2017-10-05T01:24:37,Lake Ginaland,7,2695.0,547.0,Bradley Carrillo 178 | 2017-08-17T01:09:50,West Donaldbury,12,1547.0,980.0,Chelsea Lee 179 | 2017-07-09T22:58:58,Fosterburgh,6,809.0,,Lauren Nicholson 180 | 2017-10-20T16:29:23,West Cody,15,1755.0,888.0,James Chen 181 | 2017-01-03T05:00:45,Williamburgh,6,1530.0,1167.0,Gary Lee 182 | 2017-01-05T10:01:27,Ibarraberg,1,,258.0,Daniel Davis 183 | 2017-01-06T12:01:48,Sarachester,2,1908.0,,Michael Roth 184 | 2017-11-14T20:37:17,New Tashabury,2,668.0,141.0,Antonio Jackson 185 | 2017-08-30T13:12:50,Kellerside,5,5.0,528.0,Thomas Baker 186 | 2017-04-16T15:29:58,New Melanie,7,1717.0,1579.0,Eric Morrow 187 | 2017-06-26T11:55:51,Munozberg,14,1827.0,,John Glover 188 | 2017-07-21T19:02:35,Bruceton,12,1036.0,1570.0,Diane Pearson 189 | 2017-02-02T19:09:06,Valeriebury,2,2911.0,747.0,Richard Wong 190 | 2017-02-28T17:17:41,Angelabury,4,1389.0,1097.0,Carl Bailey 191 | 2017-03-28T12:24:53,South Jeffreytown,10,254.0,542.0, 192 | 2017-06-15T07:51:08,East Rebecca,17,1788.0,512.0,Gina Frey 193 | 2017-08-30T13:12:50,Kellerside,5,5.0,528.0,Thomas Baker 194 | 2017-08-14T23:09:18,South Margaretstad,3,929.0,-91.0,Stephen Palmer 195 | 2017-06-20T13:52:59,Bairdmouth,11,2100.0,414.0,Brandon Salinas 196 | 2017-09-03T19:14:07,East Zacharychester,1,2205.0,,Kelsey Gardner 197 | 2017-06-27T12:56:26,Zacharyborough,9,936.0,964.0,Marc Barnett 198 | 2017-01-18T21:06:34,Port Grace,9,29.0,457.0, 199 | 2017-04-12T07:29:18,South Jason,17,2387.0,-304.0,Sarah Sandoval 200 | 2017-09-30T08:23:07,Alexanderport,19,,-462.0,Carlos Vasquez 201 | 2017-11-02T02:33:30,South Josephfurt,16,2918.0,741.0, 202 | 2017-03-14T11:19:43,New Jonathan,18,1135.0,, 203 | 2017-10-31T01:32:39,West Johnfort,15,1623.0,107.0,Michael Trujillo 204 | 2017-06-13T05:50:27,South Vincenthaven,1,1020.0,692.0,Lori Johnson 205 | 2017-07-13T16:00:25,South Nancyview,7,681.0,-310.0,Lance Hurst 206 | 2017-04-25T02:33:29,Timothyside,15,1096.0,757.0, 207 | 2017-11-03T07:33:55,North Nicholas,11,2117.0,-304.0,Stacy Williams 208 | 2017-12-06T16:42:30,Nguyenton,7,,-147.0,Katelyn Barber 209 | 2017-01-30T06:09:00,Phillipsshire,16,1066.0,,Kelsey Collins 210 | 2017-03-22T22:23:37,Cohenshire,1,,1359.0,Wayne Peterson 211 | 2017-11-23T00:38:57,Markmouth,8,1260.0,938.0,Kevin Mosley 212 | 2017-10-28T21:32:25,Wigginshaven,14,2574.0,1087.0,Jeffery Rodriguez 213 | 2017-12-19T08:45:21,Port Carrieburgh,16,,1307.0,Kyle Robbins 214 | 2017-03-18T02:21:43,West Sean,4,2327.0,1581.0,Denise Ross 215 | 2017-07-01T21:57:36,Harrisport,17,698.0,1113.0,Donna Ibarra 216 | 2017-10-25T16:31:29,North Justin,2,,412.0,Ray Miller 217 | 2017-05-13T07:39:35,Lake Richardstad,6,2330.0,1274.0,Christopher Randall 218 | 2017-05-05T08:37:44,Gonzalesburgh,3,196.0,1482.0,Kevin Madden 219 | 2017-01-22T10:07:58,East Kirstenbury,11,513.0,810.0,Jeremiah Thompson 220 | 2017-07-27T12:04:38,Justinside,18,2386.0,1490.0,Stephen Shaffer 221 | 2017-05-26T20:44:23,Elizabethview,13,1375.0,16.0,Joseph Barron 222 | 2017-03-09T20:19:23,North Adam,4,457.0,886.0,Robert Wilson 223 | 2017-03-05T08:18:17,Port Matthewton,4,449.0,1427.0,Martin Chang 224 | 2017-08-10T02:07:19,Lake Kimberlyport,16,2170.0,553.0,Dana Wilson 225 | 2017-07-31T22:05:23,Heathershire,7,1941.0,650.0,Vanessa Burke 226 | 2017-07-19T11:02:03,West Alex,10,1168.0,111.0,Barbara Savage 227 | 2018-01-02T04:48:48,Jeffreyside,17,81.0,173.0,Frank Hoffman 228 | 2017-05-22T17:44:00,Port Dariusshire,18,1878.0,880.0,Mark Fowler 229 | 2017-12-25T06:46:44,North Jeffrey,4,1806.0,391.0,James Davis 230 | 2017-10-07T05:25:11,Denisemouth,16,16.0,535.0,James Pugh 231 | 2017-04-27T14:34:59,New Tylermouth,7,1727.0,203.0,Kenneth Elliott 232 | 2017-12-22T16:45:54,West Vincentmouth,4,887.0,-132.0,Miguel Hunter 233 | 2017-08-17T01:09:50,West Donaldbury,12,1547.0,980.0,Chelsea Lee 234 | 2017-05-03T02:37:14,Port Monicaville,2,718.0,-400.0,Courtney Ward 235 | 2017-10-20T16:29:23,West Cody,15,1755.0,888.0, 236 | 2017-05-20T08:43:00,Lake Sandrafurt,15,2557.0,320.0,Shawn James 237 | 2017-05-15T15:40:17,New Phillipton,14,2514.0,-68.0, 238 | 2017-06-21T16:54:02,Carolhaven,11,,,Heather Johnson 239 | 2017-01-11T03:04:11,Ponceview,19,1006.0,-399.0,Douglas Peters 240 | 2017-06-01T18:46:26,Ortiztown,13,,1454.0,Christopher Harris 241 | 2017-08-23T17:11:08,Port Jacobborough,3,652.0,452.0,Jodi Watson 242 | 2017-04-10T01:28:14,Samanthashire,9,2293.0,-184.0,Jason Harper 243 | 2017-02-12T02:11:21,Stephaniemouth,5,,-130.0, 244 | 2017-03-13T08:19:35,Hughesview,14,280.0,-241.0,Brenda Mcbride 245 | 2017-09-01T15:13:19,North Karenton,7,549.0,567.0,Crystal Jennings 246 | 2017-04-29T18:35:32,Port Emilyfurt,12,87.0,1301.0,Mrs. Nicole Huang 247 | -------------------------------------------------------------------------------- /data/sales_data_with_nulls.csv: -------------------------------------------------------------------------------- 1 | timestamp,city,store_id,sale_number,sale_amount,associate 2 | 2017-02-19T17:00:00,Stephanieport,11,2162.0,247.0,Jenna White 3 | 2017-02-19T22:00:00,Gutierreztown,11,,1586.0,Laura Massey 4 | 2017-02-20T01:00:00,Colemanside,3,2858.0,631.0,Jacqueline Benson 5 | 2017-02-20T08:00:00,,1,1080.0,-161.0,Tina Martin 6 | 2017-02-20T13:00:00,,1,358.0,1414.0,David Khan 7 | 2017-02-20T21:00:00,South Jackie,7,2560.0,869.0,Erika Townsend 8 | 2017-02-21T03:00:00,Amandafurt,5,695.0,1273.0,Anne Wells 9 | 2017-02-21T10:00:00,East Lisa,9,1911.0,1584.0,Linda Atkinson 10 | 2017-02-21T11:00:00,Lake Jason,3,999.0,579.0,Shannon House 11 | 2017-02-21T13:00:00,Williambury,13,43.0,1399.0,Thomas Martin 12 | 2017-02-21T15:00:00,Stevenhaven,11,1425.0,1537.0,Terry Larsen 13 | 2017-02-21T21:00:00,Brandyberg,12,789.0,1461.0,Beth Christensen 14 | 2017-02-22T05:00:00,Port Seanborough,10,1689.0,1192.0,Amanda Palmer 15 | 2017-02-22T08:00:00,Jacquelineview,7,41.0,1171.0,Joyce Terry 16 | 2017-02-22T10:00:00,Mendezside,14,1571.0,282.0,Jeffrey Bush 17 | 2017-02-22T16:00:00,West Sherry,7,,1027.0,Marc Butler 18 | 2017-02-22T18:00:00,West Deanna,5,344.0,-104.0, 19 | 2017-02-22T19:00:00,East Carol,16,1773.0,748.0,Felicia Warren 20 | 2017-02-23T01:00:00,Cruzport,17,,434.0,Marissa Rodriguez 21 | 2017-02-23T05:00:00,Timothyhaven,5,1190.0,1213.0,Nicholas Carrillo 22 | 2017-02-23T08:00:00,East Henryland,3,2444.0,-338.0,Mckenzie Martinez 23 | 2017-02-23T13:00:00,North Saraview,12,2197.0,386.0,Steven Walker 24 | 2017-02-23T17:00:00,Beckland,10,1301.0,161.0,Megan Bonilla 25 | 2017-02-23T21:00:00,Lake Justinfort,7,2474.0,1093.0, 26 | 2017-02-24T06:00:00,Torresstad,17,2350.0,-207.0,Stephanie Perez 27 | 2017-02-24T07:00:00,North Shane,8,1337.0,1481.0,Aaron Mcmahon 28 | 2017-02-24T13:00:00,,7,1799.0,1562.0,Meghan Walker 29 | 2017-02-24T20:00:00,,7,2190.0,-45.0,John Blake 30 | 2017-02-25T01:00:00,Lake Kennethfort,15,2500.0,1434.0,Michele Flynn 31 | 2017-02-25T04:00:00,Gonzalestown,15,2004.0,1104.0,Jose Powell 32 | 2017-02-25T07:00:00,East Dean,16,2394.0,314.0,Hector Watts 33 | 2017-02-25T15:00:00,East Jennifer,1,1593.0,1516.0,Victoria Lindsey 34 | 2017-02-25T20:00:00,Terryside,3,446.0,525.0,Mallory Henson 35 | 2017-02-26T02:00:00,Paulhaven,12,1358.0,853.0,Michael Delacruz 36 | 2017-02-26T06:00:00,New Amandatown,18,2327.0,167.0,James Thomas 37 | 2017-02-26T11:00:00,,11,631.0,229.0,Holly Rhodes 38 | 2017-02-26T20:00:00,New James,15,786.0,1145.0,David Pineda 39 | 2017-02-27T05:00:00,New Susanbury,19,765.0,244.0,Robert Moore 40 | 2017-02-27T13:00:00,Port Tinaport,1,1393.0,134.0,Mr. Joshua Benson 41 | 2017-02-27T18:00:00,Lake Ericafort,12,2633.0,1062.0,Benjamin Quinn 42 | 2017-02-27T23:00:00,Nancyhaven,4,1512.0,1538.0,Paul White 43 | 2017-02-28T04:00:00,South John,3,1156.0,1204.0,Sheila Dixon 44 | 2017-02-28T10:00:00,New Kimton,4,,648.0,Michelle Kim 45 | 2017-02-28T13:00:00,Lake Lauren,19,289.0,995.0,Pamela Rose 46 | 2017-02-28T19:00:00,Kellimouth,6,,7.0,Kelly Page 47 | 2017-02-28T22:00:00,Lake Jeffreyborough,6,,819.0,Hannah Graves 48 | 2017-03-01T02:00:00,,1,508.0,597.0,Brenda Peck 49 | 2017-03-01T04:00:00,Pittmanberg,1,957.0,1598.0,Anthony Dominguez 50 | 2017-03-01T10:00:00,Kyleland,8,2424.0,634.0,Zachary Robbins 51 | 2017-03-01T14:00:00,North Daniellefort,2,1656.0,294.0,Emily Taylor 52 | 2017-03-01T20:00:00,Port Tracey,12,2072.0,726.0,Mark Garcia 53 | 2017-03-01T23:00:00,South Stevenberg,2,702.0,1391.0,William Herrera 54 | 2017-03-02T07:00:00,Deniseborough,18,514.0,817.0,Patricia Smith 55 | 2017-03-02T09:00:00,West Jamesbury,8,87.0,1146.0,Anna Kemp DVM 56 | 2017-03-02T18:00:00,East Williammouth,15,,362.0,Tracy Brooks 57 | 2017-03-02T22:00:00,Watkinsport,5,,378.0,Joseph Miller 58 | 2017-03-03T03:00:00,,7,1697.0,894.0,Dr. Carol Compton 59 | 2017-03-03T12:00:00,Lake Kimberlyview,8,1485.0,1596.0,William Cook 60 | 2017-03-03T18:00:00,South Michael,16,801.0,1488.0,Laurie Schmidt 61 | 2017-03-04T00:00:00,West Arianafort,12,478.0,860.0, 62 | 2017-03-04T07:00:00,Keithfort,10,,910.0,Steven Smith 63 | 2017-03-04T10:00:00,Jaclynmouth,15,756.0,-418.0,Ryan Lewis 64 | 2017-03-04T13:00:00,New David,5,2243.0,-221.0,Jennifer Lowery 65 | 2017-03-04T18:00:00,Brownmouth,19,250.0,-5.0,Leah Tapia 66 | 2017-03-05T03:00:00,East Justinshire,9,1998.0,219.0,Richard Foley 67 | 2017-03-05T06:00:00,Morganmouth,17,2734.0,-473.0,Carrie Richardson 68 | 2017-03-05T07:00:00,East Nathanton,14,1558.0,-317.0,Phillip Jimenez 69 | 2017-03-05T14:00:00,,5,1196.0,249.0,Carol Cannon 70 | 2017-03-05T21:00:00,Lindseyborough,4,,458.0,Renee Cain 71 | 2017-03-06T01:00:00,North Julie,12,369.0,1194.0, 72 | 2017-03-06T09:00:00,South Alexandra,10,2505.0,499.0,Joann Anderson 73 | 2017-03-06T12:00:00,West Tasha,1,2180.0,1435.0,Denise Sutton 74 | 2017-03-06T16:00:00,Johnsonhaven,2,2416.0,1034.0,Melissa Bradley 75 | 2017-03-07T01:00:00,Kaylaton,12,1898.0,-282.0,Rodney Brennan 76 | 2017-03-07T06:00:00,Hollowayshire,9,,-275.0, 77 | 2017-03-07T09:00:00,New Gary,5,445.0,40.0,Matthew Jones 78 | 2017-03-07T17:00:00,Torresburgh,10,2469.0,67.0, 79 | 2017-03-08T01:00:00,New Kevinberg,3,1345.0,-14.0,Abigail Rosario 80 | 2017-03-08T04:00:00,Port Timothyville,7,2011.0,860.0,Calvin Freeman 81 | 2017-03-08T10:00:00,Brendahaven,3,1070.0,332.0,Jasmine Payne 82 | 2017-03-08T14:00:00,South Kurtville,14,2217.0,-174.0,Rhonda Reyes 83 | 2017-03-08T20:00:00,Grantbury,9,,571.0,Valerie Roberts 84 | 2017-03-09T04:00:00,Ortizville,19,2507.0,-77.0,Rachel Valentine 85 | 2017-03-09T09:00:00,New Janetland,3,2332.0,199.0,Maria Harrison 86 | 2017-03-09T18:00:00,Williston,14,1916.0,386.0,Albert Nguyen 87 | 2017-03-09T19:00:00,Lake Sarahtown,16,2211.0,815.0, 88 | 2017-03-10T03:00:00,Brianchester,16,683.0,1329.0,Joseph Grant 89 | 2017-03-10T06:00:00,Penaview,12,2664.0,462.0, 90 | 2017-03-10T15:00:00,Williamsmouth,5,1882.0,79.0,Jennifer Marsh 91 | 2017-03-10T22:00:00,North Paulfurt,17,2789.0,-414.0,Elizabeth Sanchez 92 | 2017-03-11T03:00:00,East Annetteside,19,1870.0,379.0,Erika Perry 93 | 2017-03-11T08:00:00,Murphyville,13,391.0,322.0,Paula Patel 94 | 2017-03-11T13:00:00,,8,1375.0,1332.0, 95 | 2017-03-11T17:00:00,Longfort,10,936.0,657.0,Scott Hall 96 | 2017-03-11T20:00:00,North Lukeside,18,785.0,-328.0,Janet Harris 97 | 2017-03-12T02:00:00,East Juliebury,14,2732.0,-365.0,Jennifer Harper 98 | 2017-03-12T07:00:00,Davidside,5,,1316.0,David Garcia 99 | 2017-03-12T15:00:00,Wilsonstad,1,1707.0,1211.0,Stephen Ball 100 | 2017-03-12T22:00:00,West Jessicaport,18,62.0,-432.0,Aaron Diaz 101 | 2017-03-13T04:00:00,Masonhaven,6,62.0,-467.0,Hector Taylor 102 | 2017-03-13T13:00:00,Traciburgh,16,594.0,293.0,James Gay 103 | 2017-03-13T21:00:00,West Kirkchester,14,2988.0,451.0,Charles Thomas 104 | 2017-03-14T05:00:00,Ethanmouth,13,715.0,1008.0,Jesse Patterson 105 | 2017-03-14T07:00:00,Lauramouth,18,1579.0,1446.0,Steven Rodriguez 106 | 2017-03-14T08:00:00,Madelineberg,8,423.0,-185.0,Charles Castro 107 | 2017-03-14T12:00:00,,10,2865.0,1211.0,Richard Coleman 108 | 2017-03-14T20:00:00,,6,1812.0,1429.0,Susan Wilkins 109 | 2017-03-14T23:00:00,West Tiffanyshire,9,2646.0,120.0,Aaron Allen 110 | 2017-03-15T07:00:00,West Shawntown,16,489.0,954.0,Susan Powers 111 | 2017-03-15T16:00:00,Alanmouth,11,590.0,1090.0,Jennifer Lara 112 | 2017-03-15T21:00:00,Marybury,8,,274.0,Kenneth Price 113 | 2017-03-16T05:00:00,Davisfort,18,1258.0,239.0, 114 | 2017-03-16T14:00:00,Victoriaside,11,1434.0,591.0,Travis Dean 115 | 2017-03-16T15:00:00,,13,237.0,1023.0,Jacqueline Levine 116 | 2017-03-16T17:00:00,New Robert,11,1507.0,1273.0,Joyce Moore 117 | 2017-03-16T21:00:00,Port Heatherchester,5,1007.0,860.0,Michael Cherry 118 | 2017-03-17T04:00:00,Kristinaside,16,2125.0,1449.0,Sharon Moore 119 | 2017-03-17T07:00:00,Dawnville,2,1859.0,1401.0,Steven Manning 120 | 2017-03-17T09:00:00,South Michaelport,10,2004.0,530.0,Vincent Jacobs 121 | 2017-03-17T11:00:00,New Monique,13,2903.0,-431.0, 122 | 2017-03-17T18:00:00,Kathyview,2,1569.0,423.0,Donald Morales MD 123 | 2017-03-18T01:00:00,Port Kathrynport,10,,976.0,Michelle Browning 124 | 2017-03-18T07:00:00,Turnerland,12,1024.0,225.0,Sandra Mcmahon 125 | 2017-03-18T12:00:00,Port Kristenstad,3,1259.0,231.0,Amber Dickson MD 126 | 2017-03-18T16:00:00,North Brianton,7,788.0,-111.0,Andrew Wyatt 127 | 2017-03-19T01:00:00,Drakeland,5,1004.0,741.0,Brian Ellis 128 | 2017-03-19T06:00:00,Maddenton,12,2063.0,1324.0, 129 | 2017-03-19T09:00:00,Mcguirehaven,9,687.0,623.0,Carl Lee 130 | 2017-03-19T17:00:00,Gardnerborough,12,153.0,1002.0, 131 | 2017-03-19T20:00:00,Port Teresahaven,10,2329.0,898.0,Stephen Perry 132 | 2017-03-20T01:00:00,Victormouth,10,1241.0,228.0,Michael Palmer 133 | 2017-03-20T03:00:00,,4,2965.0,-171.0,Kimberly Harding 134 | 2017-03-20T05:00:00,Anthonyshire,17,23.0,515.0,Justin Lindsey 135 | 2017-03-20T08:00:00,Karenbury,1,2468.0,-19.0,Jeffrey Black 136 | 2017-03-20T14:00:00,Melissaburgh,17,2489.0,-54.0, 137 | 2017-03-20T21:00:00,Janetbury,4,980.0,-230.0, 138 | 2017-03-21T05:00:00,Erikville,4,858.0,1241.0, 139 | 2017-03-21T07:00:00,West Susan,8,2266.0,-443.0,Dawn Reid 140 | 2017-03-21T14:00:00,Millerland,13,532.0,691.0,Joseph Gilbert 141 | 2017-03-21T18:00:00,West James,5,1386.0,1206.0,Mark Harrison 142 | 2017-03-21T19:00:00,Wendybury,6,,569.0,Howard Weeks 143 | 2017-03-21T22:00:00,Charlesmouth,10,,784.0,Jose Hall 144 | 2017-03-22T07:00:00,Jameshaven,1,533.0,-36.0,Brittney Torres 145 | 2017-03-22T13:00:00,New Christina,7,1099.0,681.0,Antonio Miller 146 | 2017-03-22T20:00:00,,4,1423.0,-372.0,Kimberly Page 147 | 2017-03-23T00:00:00,Port David,9,922.0,733.0,Manuel Porter 148 | 2017-03-23T09:00:00,Ryanstad,13,2984.0,164.0,Todd Smith 149 | 2017-03-23T18:00:00,South Sydney,13,1218.0,1472.0,Brenda Sanders 150 | 2017-03-23T21:00:00,,15,,1092.0,David Robinson 151 | 2017-03-24T06:00:00,,19,2507.0,724.0,Leslie Robinson 152 | 2017-03-24T11:00:00,Jennifertown,11,2778.0,949.0,Randy Harmon 153 | 2017-03-24T12:00:00,Hammondborough,2,1605.0,-309.0,Emily Gregory 154 | 2017-03-24T15:00:00,Brendachester,15,2097.0,726.0,Lindsay Palmer 155 | 2017-03-25T00:00:00,Allisonhaven,10,473.0,1457.0,Rachel Martinez 156 | 2017-03-25T07:00:00,Jenniferton,6,,-236.0,Kayla Santiago 157 | 2017-03-25T11:00:00,Scottland,17,2445.0,1170.0,Stephanie Perez 158 | 2017-03-25T13:00:00,North Miranda,2,469.0,1030.0,Anthony Walker 159 | 2017-03-25T17:00:00,Port Davidstad,12,2823.0,-319.0,Nicolas James 160 | 2017-03-25T19:00:00,South Matthew,10,2689.0,1327.0,Brenda Payne 161 | 2017-03-26T01:00:00,Lake Jamesshire,12,1609.0,56.0,Melissa Hanson 162 | 2017-03-26T08:00:00,Brownhaven,2,,909.0,Caleb Trujillo 163 | 2017-03-26T10:00:00,Christybury,1,1385.0,557.0,Jennifer Anderson 164 | 2017-03-26T15:00:00,Port Mary,10,635.0,245.0,Ryan Walker 165 | 2017-03-26T22:00:00,Gregoryburgh,15,,1199.0,Brian Clark 166 | 2017-03-27T03:00:00,,4,2874.0,1394.0,Ms. Carla Vaughn DVM 167 | 2017-03-27T05:00:00,,18,953.0,557.0,Keith Wilson 168 | 2017-03-27T13:00:00,New Amyfort,6,500.0,1173.0,Christina Pope 169 | 2017-03-27T17:00:00,Dustinfurt,13,1765.0,686.0,Raymond Fry 170 | 2017-03-27T22:00:00,Nicolefort,14,2428.0,-274.0,Katie Miller 171 | 2017-03-28T06:00:00,New Brianborough,6,1166.0,1583.0,Kristen Summers 172 | 2017-03-28T07:00:00,Tammyfurt,2,757.0,462.0,Crystal Hudson 173 | 2017-03-28T14:00:00,Lauraburgh,12,1836.0,1258.0,Kimberly Pennington 174 | 2017-03-28T15:00:00,Alejandroland,19,2478.0,1254.0,Michele Rogers 175 | 2017-03-28T16:00:00,Lake Vanessaton,9,1854.0,-262.0,Carlos Jensen 176 | 2017-03-28T18:00:00,West Richard,8,2820.0,1020.0,Jaime Clark 177 | 2017-03-28T19:00:00,Port Rebekahmouth,17,1478.0,900.0,Nicholas Nash 178 | 2017-03-28T21:00:00,Port Josephview,17,183.0,1501.0,Jeanne Curry 179 | 2017-03-28T23:00:00,East Kevin,4,1187.0,936.0,Eric Day 180 | 2017-03-29T05:00:00,West John,17,1078.0,116.0,Tyler Cain 181 | 2017-03-29T13:00:00,Nancyberg,10,2392.0,178.0,Ashley Johnson 182 | 2017-03-29T16:00:00,North Julie,11,2806.0,1035.0, 183 | 2017-03-29T22:00:00,Arnoldberg,5,417.0,338.0,Scott Beltran 184 | 2017-03-30T04:00:00,Port Sydney,13,1480.0,-88.0,Theresa Young 185 | 2017-03-30T06:00:00,,8,483.0,1442.0,Stephen Henry 186 | 2017-03-30T12:00:00,New Cameronton,1,1262.0,-264.0,Ms. Jessica Williams 187 | 2017-03-30T13:00:00,Ramirezton,4,2275.0,283.0,Harold Patton 188 | 2017-03-30T22:00:00,North Kennethberg,10,,378.0,Nicholas Hughes 189 | 2017-03-31T01:00:00,Lake Brandonfort,10,2286.0,492.0,Rodney Patel 190 | 2017-03-31T10:00:00,New Zachary,10,,145.0,Jonathan Thomas 191 | 2017-03-31T18:00:00,New Lisaberg,6,1986.0,-392.0,Kimberly Stevens 192 | 2017-03-31T20:00:00,South Stephanieview,5,30.0,394.0,Vanessa Allen 193 | 2017-03-31T21:00:00,North Stephanie,12,2514.0,-177.0,Brett Dixon 194 | 2017-04-01T04:00:00,,15,110.0,521.0,Melissa Levine 195 | 2017-04-01T08:00:00,,6,,-298.0,Autumn Alvarado 196 | 2017-04-01T15:00:00,Morrisstad,12,2797.0,991.0, 197 | 2017-04-01T18:00:00,Amandaside,17,1211.0,1121.0,Audrey Daniels 198 | 2017-04-02T00:00:00,South Joseph,11,232.0,-427.0,Tina Lopez 199 | 2017-04-02T05:00:00,Port Nicolehaven,5,2153.0,629.0,Julie Payne 200 | 2017-04-02T09:00:00,North Timothyfort,10,1564.0,-297.0,Stephen Carrillo 201 | 2017-04-02T11:00:00,New Erika,9,1952.0,976.0,Mark Knight 202 | 2017-04-02T15:00:00,North Stephen,7,713.0,109.0,Michael Reed 203 | 2017-04-02T22:00:00,Port Lindsayhaven,5,1454.0,594.0,Anna Ayala 204 | 2017-04-03T01:00:00,East Justintown,3,497.0,1305.0,Dillon Adams 205 | 2017-04-03T03:00:00,Bettyport,10,2241.0,169.0,Karen Thompson 206 | 2017-04-03T09:00:00,Lake Edwardfurt,1,726.0,345.0,Linda Guerra 207 | 2017-04-03T12:00:00,Josephfort,4,1276.0,1301.0,Ricky Li 208 | 2017-04-03T17:00:00,Johnsonchester,19,563.0,-225.0,Elizabeth Lynch 209 | 2017-04-03T22:00:00,,4,1955.0,-34.0,Anthony Peck DDS 210 | 2017-04-03T23:00:00,North Katherine,5,1645.0,1444.0,Hannah Jones 211 | 2017-04-04T07:00:00,North Jason,19,1176.0,482.0,Brian Martin 212 | 2017-04-04T09:00:00,Romanstad,16,2930.0,-464.0,Melanie Stevens 213 | 2017-04-04T16:00:00,Turnerport,3,2809.0,948.0, 214 | 2017-04-04T17:00:00,Kingside,11,1231.0,1265.0,Taylor Neal 215 | 2017-04-05T00:00:00,Christineview,14,1067.0,517.0,Stephen Nolan 216 | 2017-04-05T08:00:00,Port Alicia,6,1410.0,-97.0,Michael Cabrera 217 | 2017-04-05T10:00:00,,14,2022.0,-117.0,Russell Benjamin 218 | 2017-04-05T17:00:00,Lake Cynthia,9,2838.0,-122.0,Paul Palmer 219 | 2017-04-05T20:00:00,Carpenterport,2,2696.0,-264.0,William Lopez 220 | 2017-04-05T21:00:00,Foxchester,14,1077.0,-37.0,Joel Zhang 221 | 2017-04-05T22:00:00,Williamville,18,2120.0,-246.0,Heather Patrick 222 | 2017-04-06T00:00:00,Steinland,17,1849.0,1340.0,Jacob Brown 223 | 2017-04-06T03:00:00,,9,2840.0,1536.0,Kathryn Ramos 224 | 2017-04-06T11:00:00,Lake Coryland,8,1652.0,234.0,Benjamin Weber 225 | 2017-04-06T18:00:00,Port Gregory,16,1889.0,220.0,Nancy Soto 226 | 2017-04-06T19:00:00,Lake Anita,14,44.0,81.0,David Murphy 227 | 2017-04-07T04:00:00,West Jasonport,17,1845.0,698.0,Robert May 228 | 2017-04-07T13:00:00,South Peter,14,1959.0,542.0,Savannah James 229 | 2017-04-07T17:00:00,Wendyhaven,13,273.0,374.0,Mark Harrison 230 | 2017-04-07T21:00:00,Lake Lauren,12,652.0,-225.0,Jim Rogers 231 | 2017-04-08T05:00:00,Myersborough,15,2644.0,19.0,Daniel Taylor 232 | 2017-04-08T07:00:00,Brooksside,8,2703.0,400.0,Kristie Luna 233 | 2017-04-08T12:00:00,Kathyton,13,2286.0,799.0,John Cook 234 | 2017-04-08T21:00:00,Deanhaven,3,2693.0,590.0,Elaine Barber 235 | 2017-04-08T22:00:00,New Christinaside,19,1647.0,921.0,Laura Thomas 236 | 2017-04-09T07:00:00,East Shannonton,9,118.0,858.0,Mary Ramirez 237 | 2017-04-09T12:00:00,Richardsville,8,806.0,281.0,Alejandro Gordon 238 | 2017-04-09T20:00:00,,5,1487.0,1171.0,Cheryl Johnson 239 | 2017-04-10T01:00:00,Lake Jon,2,1966.0,85.0, 240 | 2017-04-10T10:00:00,New Alejandro,14,2882.0,580.0,Bonnie Rivera 241 | 2017-04-10T17:00:00,Victoriaburgh,16,2660.0,-250.0,Mrs. Destiny Butler 242 | 2017-04-10T22:00:00,Ortizberg,1,2269.0,414.0,Bobby Campbell 243 | 2017-04-11T00:00:00,,15,,1175.0,Ariana Fisher 244 | 2017-04-11T06:00:00,Allisonland,1,773.0,567.0,Linda Holland 245 | 2017-04-11T10:00:00,Bridgestown,5,1169.0,1193.0,Abigail Lloyd 246 | 2017-04-11T14:00:00,Guzmanhaven,1,671.0,-17.0, 247 | 2017-04-11T16:00:00,Yoderfort,3,2490.0,827.0,Kevin Reeves 248 | 2017-04-11T23:00:00,Brownview,13,,-117.0,Brittney Nunez 249 | 2017-04-12T01:00:00,Jasonfurt,6,394.0,577.0,Nathan Hartman 250 | 2017-04-12T05:00:00,New Emily,15,2445.0,753.0,Olivia Williams 251 | 2017-04-12T10:00:00,Ashleytown,1,761.0,835.0,Samuel Hodges 252 | 2017-04-12T14:00:00,East Davidberg,17,978.0,73.0,Kevin Kim 253 | 2017-04-12T17:00:00,Kirbyside,7,2891.0,949.0,David Fernandez 254 | 2017-04-12T23:00:00,North Patriciaside,2,413.0,-107.0,Dylan Martin 255 | 2017-04-13T01:00:00,South Michaelside,17,2029.0,1059.0,Kevin Allison 256 | 2017-04-13T04:00:00,Lindaland,18,1409.0,99.0,Colleen Forbes 257 | 2017-04-13T13:00:00,Amberfurt,6,,709.0,Justin Elliott 258 | 2017-04-13T15:00:00,Lake Brandon,2,,-171.0,Elizabeth Reed 259 | 2017-04-13T18:00:00,West James,10,734.0,1018.0,April Conner 260 | 2017-04-14T03:00:00,,10,2678.0,-489.0,Nicole Morgan 261 | 2017-04-14T05:00:00,Mcmahonside,12,842.0,229.0,Crystal Wilson 262 | 2017-04-14T11:00:00,East Stephanieton,4,1705.0,1442.0,Todd Sutton 263 | 2017-04-14T13:00:00,Priceside,1,777.0,-77.0,Lindsey Nelson 264 | 2017-04-14T17:00:00,Port Andreafort,4,2442.0,1337.0,Tammy Salinas 265 | 2017-04-14T23:00:00,,3,1312.0,-178.0, 266 | 2017-04-15T08:00:00,South Charlesfort,3,2706.0,767.0,Deborah Williams 267 | 2017-04-15T15:00:00,South Michaelberg,1,1420.0,920.0, 268 | 2017-04-15T18:00:00,West Manuel,10,1899.0,306.0,Steven Whitehead 269 | 2017-04-16T02:00:00,Westview,3,460.0,54.0, 270 | 2017-04-16T11:00:00,Lake Sueside,5,1634.0,335.0,Robert Smith 271 | 2017-04-16T13:00:00,Port Steventon,9,1831.0,642.0,Michael Miller 272 | 2017-04-16T22:00:00,,13,2777.0,1535.0,Cindy Edwards 273 | 2017-04-17T04:00:00,Teresatown,10,1469.0,329.0,Natasha Garcia 274 | 2017-04-17T12:00:00,Port John,7,1912.0,994.0,Erik Jones 275 | 2017-04-17T19:00:00,North David,9,2735.0,-236.0, 276 | 2017-04-17T22:00:00,East Brittneyland,19,2555.0,-154.0,Amanda Beasley 277 | 2017-04-18T05:00:00,Jeffreyland,14,,959.0,Jennifer Benton 278 | 2017-04-18T14:00:00,West Judymouth,10,360.0,-368.0,Robert Rice 279 | 2017-04-18T19:00:00,Mcdonaldmouth,7,228.0,331.0,Daniel Brown 280 | 2017-04-19T01:00:00,Paynestad,12,2383.0,1081.0,Megan Wilcox 281 | 2017-04-19T09:00:00,New Danielle,14,327.0,214.0,Victoria Walker 282 | 2017-04-19T16:00:00,Waltertown,3,1216.0,657.0,Carlos Fitzpatrick 283 | 2017-04-19T18:00:00,Jordanstad,5,2331.0,-468.0,Cynthia Morgan 284 | 2017-04-20T02:00:00,Williamton,10,1946.0,-446.0,Erika Gomez 285 | 2017-04-20T08:00:00,,17,716.0,821.0,Adriana Hudson 286 | 2017-04-20T17:00:00,Sarahstad,12,644.0,1498.0,Brandon Rivera 287 | 2017-04-21T02:00:00,,15,1884.0,653.0,Parker Franco 288 | 2017-04-21T06:00:00,Craigshire,2,1317.0,1447.0,Michelle Jones 289 | 2017-04-21T14:00:00,Johnville,14,876.0,-54.0,Miranda Spencer 290 | 2017-04-21T17:00:00,East Courtneyshire,18,1345.0,986.0,Lisa Gardner 291 | 2017-04-21T22:00:00,East Lindseychester,18,786.0,-483.0,Kevin Green 292 | 2017-04-22T05:00:00,South Kennethville,11,2865.0,1338.0,Eric Mills 293 | 2017-04-22T10:00:00,South Vanessastad,9,638.0,1322.0,Victor Reynolds 294 | 2017-04-22T16:00:00,Adamtown,2,2479.0,1523.0,Sandra Mays 295 | 2017-04-22T19:00:00,East Amy,14,2440.0,663.0,Kathryn Branch DDS 296 | 2017-04-22T21:00:00,New Rebeccaborough,2,2831.0,690.0,William Drake 297 | 2017-04-23T01:00:00,Lewisport,6,2730.0,782.0,Lori Allen 298 | 2017-04-23T07:00:00,Samuelmouth,8,19.0,-387.0, 299 | 2017-04-23T10:00:00,Port Meganburgh,16,1322.0,774.0,Chad Villanueva 300 | 2017-04-23T19:00:00,Port Tyrone,15,1225.0,-144.0,Daniel Escobar 301 | 2017-04-23T23:00:00,Gregorychester,13,2830.0,644.0,Justin Williams 302 | 2017-04-24T06:00:00,,6,2131.0,1511.0, 303 | 2017-04-24T13:00:00,Heatherfort,14,1940.0,537.0,Christopher White 304 | 2017-04-24T20:00:00,Bryanside,16,1971.0,109.0,Lucas Stanley 305 | 2017-04-25T02:00:00,Feliciabury,11,,1289.0,Jo Morrow 306 | 2017-04-25T10:00:00,Janetstad,19,747.0,-168.0,Madison Evans 307 | 2017-04-25T18:00:00,Collinsport,9,1378.0,198.0,Christopher Lopez 308 | 2017-04-26T00:00:00,Lake Laura,3,2245.0,1570.0,James Weiss 309 | 2017-04-26T03:00:00,Port Dianemouth,6,2359.0,1148.0,Krista Harris 310 | 2017-04-26T07:00:00,Kellishire,11,2704.0,1585.0,Mariah Johnson 311 | 2017-04-26T10:00:00,Alexisville,18,1976.0,1147.0,Kathryn Summers 312 | 2017-04-26T11:00:00,Davidland,4,2937.0,-77.0,Erik Johnson 313 | 2017-04-26T19:00:00,Patrickstad,2,422.0,1210.0, 314 | 2017-04-26T23:00:00,North Lori,11,2789.0,-451.0,Michael Flores 315 | 2017-04-27T07:00:00,Patelmouth,1,2163.0,285.0,Angelica Esparza 316 | 2017-04-27T14:00:00,Port Erica,8,525.0,1560.0,Todd Mcbride 317 | 2017-04-27T22:00:00,Castilloburgh,19,2833.0,-190.0,Corey Brown 318 | 2017-04-28T07:00:00,Frostport,3,,817.0,Jessica Hughes 319 | 2017-04-28T15:00:00,Kimport,8,1835.0,1466.0,Hailey Roth 320 | 2017-04-28T20:00:00,Valerietown,11,2770.0,191.0, 321 | 2017-04-29T03:00:00,Lake Claire,19,1243.0,619.0,Brian Rogers 322 | 2017-04-29T10:00:00,South Michelleshire,7,2917.0,968.0,Katherine Poole 323 | 2017-04-29T16:00:00,Russellburgh,6,1735.0,513.0,Dominique Carr 324 | 2017-04-30T00:00:00,Reneehaven,7,1308.0,1368.0,Evan Stokes 325 | 2017-04-30T06:00:00,Bestfurt,17,733.0,244.0,Danielle Terrell 326 | 2017-04-30T14:00:00,Port Marilynhaven,9,1167.0,504.0,James Montgomery 327 | 2017-04-30T19:00:00,Port Kristen,6,1777.0,1010.0,Joseph Jones 328 | 2017-05-01T03:00:00,North Kylemouth,4,299.0,1285.0,William Robinson 329 | 2017-05-01T12:00:00,Jesseside,10,2267.0,-420.0, 330 | 2017-05-01T13:00:00,Codyborough,5,785.0,-359.0,Carol Heath 331 | 2017-05-01T18:00:00,,18,1034.0,315.0,Laurie Mcdonald 332 | 2017-05-01T20:00:00,Port Susan,8,1221.0,1365.0,Carolyn Blackwell 333 | 2017-05-02T04:00:00,New Joy,14,2319.0,1409.0,Debra Mccullough 334 | 2017-05-02T07:00:00,Gonzaleston,16,1714.0,727.0,Lori Arroyo 335 | 2017-05-02T16:00:00,Andrewsburgh,19,,512.0,Scott Levine 336 | 2017-05-03T00:00:00,Lake Jasmine,17,1631.0,-58.0,Robert Sexton 337 | 2017-05-03T05:00:00,North Bonnieshire,19,2035.0,173.0,Andre Wilson 338 | 2017-05-03T14:00:00,Johnburgh,19,2402.0,-226.0,James Shea 339 | 2017-05-03T20:00:00,East Lisaton,11,2814.0,1253.0,Sharon Wade 340 | 2017-05-04T00:00:00,Calebton,4,1566.0,1476.0, 341 | 2017-05-04T09:00:00,Schroederfurt,18,676.0,472.0,Brittany Weaver 342 | 2017-05-04T10:00:00,Kristopherburgh,5,,407.0,Tammy Johnson 343 | 2017-05-04T19:00:00,East Connie,13,2394.0,762.0,Mark Robinson 344 | 2017-05-05T04:00:00,Howardville,13,944.0,1024.0, 345 | 2017-05-05T05:00:00,Richardfort,17,1084.0,970.0,Alicia Harris 346 | 2017-05-05T12:00:00,West Jessestad,10,111.0,-21.0,Michelle Johnson 347 | 2017-05-05T13:00:00,East Kendra,3,1667.0,370.0,Jared Sanchez 348 | 2017-05-05T18:00:00,Tylerport,6,841.0,953.0,Danielle Hart 349 | 2017-05-05T23:00:00,East Chelsealand,7,,-175.0, 350 | 2017-05-06T03:00:00,Kristinville,13,2809.0,-136.0,Cynthia Reed 351 | 2017-05-06T09:00:00,New Michaelchester,15,636.0,107.0,Oscar Mann 352 | 2017-05-06T15:00:00,Raymondton,11,1380.0,1266.0,Kaylee Stone 353 | 2017-05-06T21:00:00,Port Tashaberg,9,1217.0,779.0,Ernest Clay 354 | 2017-05-06T23:00:00,Mcintoshside,7,2244.0,694.0, 355 | 2017-05-07T06:00:00,Martinezberg,10,2888.0,365.0,Brittany Gould DDS 356 | 2017-05-07T10:00:00,Danielmouth,9,651.0,-171.0,Ryan Wu 357 | 2017-05-07T16:00:00,Jeffersonfurt,16,523.0,53.0,Jimmy Porter 358 | 2017-05-07T21:00:00,North Meganhaven,12,2857.0,969.0,Mrs. Madison Jones MD 359 | 2017-05-07T23:00:00,New Scottchester,5,31.0,1172.0,Miss Patricia Villa DDS 360 | 2017-05-08T07:00:00,Sanchezview,10,316.0,-120.0,Laura Lewis 361 | 2017-05-08T14:00:00,West Richard,4,1914.0,870.0,Sean Anderson 362 | 2017-05-08T23:00:00,,14,126.0,-414.0,Oscar Zamora 363 | 2017-05-09T08:00:00,Port Scott,7,2775.0,-74.0,Francisco Chambers 364 | 2017-05-09T16:00:00,Gonzalezview,9,976.0,898.0,Stephanie Nichols 365 | 2017-05-09T17:00:00,South Kimberly,14,1466.0,131.0,Natalie Cannon 366 | 2017-05-09T22:00:00,Mcgeeland,2,,1336.0,Jennifer Robinson 367 | 2017-05-10T06:00:00,Gonzalesborough,4,1255.0,-86.0, 368 | 2017-05-10T08:00:00,Zunigaport,2,2854.0,501.0,Brandon Hernandez 369 | 2017-05-10T17:00:00,Lake Tyler,3,864.0,191.0,Benjamin Chapman 370 | 2017-05-11T02:00:00,New Andrea,9,,1432.0,Kristen Smith 371 | -------------------------------------------------------------------------------- /data/sales_summary.csv: -------------------------------------------------------------------------------- 1 | ,timestamp,city,store_id,sale_number,sale_amount,associate,store_total,associate_total,city_total 2 | 0,2017-09-15,Alexandrabury,18,1043.0,15.0,Stacey Daniels,7688.0,15.0,15.0 3 | 1,2017-09-11,East Jesusport,2,1729.0,396.0,Haley Pitts,2450.0,396.0,396.0 4 | 2,2017-07-12,New Douglasmouth,13,2028.0,-78.0,Carlos French,3472.0,-78.0,-78.0 5 | 3,2017-07-29,West Carriemouth,19,1245.0,1149.0,Jeffrey Ford,4868.0,1149.0,1149.0 6 | 4,2017-11-07,Port Timothy,1,2365.0,724.0,Christopher West,6232.0,724.0,724.0 7 | 5,2017-05-24,Katherinestad,18,2764.0,119.0,Jennifer Lee,7688.0,119.0,119.0 8 | 6,2017-08-07,Lake Richard,4,2574.0,1387.0,Kayla Mercado,7793.0,1387.0,1387.0 9 | 8,2017-10-18,North Linda,6,893.0,16.0,Johnny Turner,6174.0,16.0,16.0 10 | 9,2017-02-14,New Timothy,17,2986.0,1487.0,Tony Lynch,3093.0,1487.0,1487.0 11 | 10,2017-04-08,North Kyle,5,2944.0,537.0,Ashley Taylor,3222.0,537.0,537.0 12 | 12,2017-06-17,Port Coryborough,3,1455.0,958.0,Laurie Carr DVM,5196.0,958.0,958.0 13 | 13,2017-01-20,New Katherineville,8,1654.0,-89.0,Patricia Peterson,2391.0,-89.0,-89.0 14 | 16,2017-05-01,Janiceview,1,820.0,443.0,Linda Snyder,6232.0,443.0,443.0 15 | 17,2017-02-09,Beniteztown,3,1186.0,627.0,Mary Luna,5196.0,627.0,627.0 16 | 24,2017-12-29,Nelsonfort,7,1781.0,146.0,Jennifer Mcdonald,5061.0,146.0,146.0 17 | 25,2017-10-15,Jamesburgh,4,2102.0,-358.0,James Thompson,7793.0,-358.0,-358.0 18 | 28,2017-11-05,Timchester,6,2292.0,-9.0,Michael Johnson,6174.0,-9.0,-9.0 19 | 30,2017-12-15,East Jennifer,15,2623.0,1124.0,Ryan Morrow,3384.0,1124.0,1124.0 20 | 31,2017-05-06,Lake Chelseatown,11,2941.0,-321.0,Don Garza,1352.0,-321.0,-321.0 21 | 32,2017-10-09,Ariasberg,19,444.0,1299.0,Alicia Ross,4868.0,1299.0,1299.0 22 | 35,2017-09-08,West Michael,11,2890.0,264.0,Christine Oneal,1352.0,264.0,264.0 23 | 36,2017-06-29,South Stevenview,3,2683.0,-265.0,Mariah Wright,5196.0,-265.0,-265.0 24 | 38,2017-08-06,Port Peter,13,1646.0,349.0,Dave Evans,3472.0,349.0,349.0 25 | 40,2017-06-06,New Jason,16,280.0,309.0,Ashley Bryant,1430.0,309.0,309.0 26 | 43,2017-10-23,Karaville,11,1569.0,-391.0,Christina Khan,1352.0,-391.0,-391.0 27 | 44,2017-12-24,Dianabury,14,2697.0,-20.0,Nathan White,2274.0,-20.0,-20.0 28 | 47,2017-04-04,Aliciastad,6,652.0,-164.0,Rhonda Stone,6174.0,-164.0,-164.0 29 | 51,2017-04-18,Port Matthew,15,725.0,-455.0,Tracey Martin,3384.0,-455.0,-502.0 30 | 52,2017-01-24,West Petermouth,14,1276.0,127.0,Robert Taylor,2274.0,127.0,127.0 31 | 53,2017-05-29,New Amy,3,212.0,857.0,John Chambers,5196.0,857.0,857.0 32 | 54,2017-09-25,Beardshire,18,1243.0,1027.0,Amanda Haynes,7688.0,1027.0,1027.0 33 | 56,2017-02-08,West Natasha,3,278.0,-154.0,Charles Fletcher,5196.0,-154.0,-154.0 34 | 57,2017-06-25,Port Shannon,12,2351.0,648.0,Dawn Schmidt,7906.0,648.0,648.0 35 | 59,2017-04-06,East Curtis,17,443.0,112.0,Elizabeth Barber,3093.0,112.0,112.0 36 | 60,2017-06-04,Jonesland,6,1774.0,1053.0,Christopher Young,6174.0,1053.0,1053.0 37 | 65,2017-12-03,Robinsonshire,8,55.0,738.0,James Garcia,2391.0,738.0,738.0 38 | 70,2017-10-26,Jacquelineburgh,19,59.0,1380.0,Brandi Simmons,4868.0,1380.0,1380.0 39 | 72,2017-04-07,Lake Brandon,11,406.0,880.0,Stacy Mcintosh,1352.0,880.0,880.0 40 | 73,2017-12-11,Guerrachester,15,2124.0,167.0,Nicole Gonzalez,3384.0,167.0,167.0 41 | 74,2017-06-19,East Michelleville,7,1572.0,262.0,Eugene Summers,5061.0,262.0,262.0 42 | 76,2017-03-16,South Ashleyton,3,2323.0,-4.0,Robert Landry,5196.0,-4.0,-4.0 43 | 77,2017-05-17,Seanview,8,962.0,243.0,Jamie Cummings,2391.0,243.0,243.0 44 | 79,2017-03-21,West Robertmouth,15,1799.0,262.0,Pamela Salinas,3384.0,262.0,262.0 45 | 80,2017-10-14,Batesshire,6,914.0,467.0,Donald Smith,6174.0,467.0,467.0 46 | 81,2017-08-13,Markmouth,1,2511.0,1504.0,Laura Henderson,6232.0,1504.0,2442.0 47 | 86,2017-02-05,Port Amanda,1,421.0,1330.0,Christopher Joyce,6232.0,1330.0,1330.0 48 | 88,2017-12-21,South Ryan,18,1835.0,1278.0,Robert Santos,7688.0,1278.0,1278.0 49 | 89,2017-07-23,Davidstad,13,2689.0,976.0,Christopher Ortiz,3472.0,976.0,976.0 50 | 90,2017-05-10,Port Matthew,4,170.0,-47.0,Adam Brandt,7793.0,-47.0,-502.0 51 | 91,2017-03-26,Smithfurt,8,2072.0,-238.0,Mr. Micheal Hale DDS,2391.0,-238.0,-238.0 52 | 92,2017-12-27,Lake Ashleytown,1,2532.0,1539.0,James Greer,6232.0,1539.0,1539.0 53 | 98,2017-04-03,Port Briana,8,1219.0,799.0,Christian Perez,2391.0,799.0,799.0 54 | 101,2017-04-23,Chaveztown,18,161.0,1551.0,Eric Hunter,7688.0,1551.0,1551.0 55 | 103,2017-06-07,Lindaville,2,1178.0,971.0,Monique Hughes,2450.0,971.0,971.0 56 | 104,2017-09-12,West Georgeshire,6,1862.0,1021.0,Heidi Lutz,6174.0,1021.0,1021.0 57 | 108,2017-04-20,South Paulabury,2,2963.0,1028.0,David Ayers,2450.0,1028.0,1028.0 58 | 109,2017-01-03,Williamburgh,6,1530.0,1167.0,Gary Lee,6174.0,1167.0,1167.0 59 | 111,2017-10-07,Denisemouth,16,16.0,535.0,James Pugh,1430.0,535.0,535.0 60 | 116,2017-12-14,Lake Joanne,3,629.0,1407.0,Tracy Diaz,5196.0,1407.0,1407.0 61 | 117,2017-03-03,Johnfurt,12,1370.0,676.0,Michael Clark,7906.0,676.0,676.0 62 | 119,2017-10-08,Moraport,3,1827.0,-358.0,Belinda Vasquez,5196.0,-358.0,-358.0 63 | 120,2017-12-07,Leachville,4,590.0,662.0,Amanda Tate,7793.0,662.0,662.0 64 | 121,2017-10-03,Brownmouth,12,700.0,1155.0,Andrea Crawford,7906.0,1155.0,1155.0 65 | 122,2017-10-05,Lake Ginaland,7,2695.0,547.0,Bradley Carrillo,5061.0,547.0,547.0 66 | 124,2017-08-26,Salasshire,13,1832.0,755.0,Shelly Todd,3472.0,755.0,755.0 67 | 127,2017-11-04,Griffithville,12,214.0,753.0,Thomas Fuller,7906.0,753.0,753.0 68 | 132,2017-03-07,East Johntown,10,2540.0,534.0,Nathan Santana,645.0,534.0,534.0 69 | 134,2017-03-12,Kathleenbury,19,884.0,1439.0,Carl Mills,4868.0,1439.0,1439.0 70 | 136,2017-05-09,Chelseashire,6,78.0,796.0,Jeffery Hughes,6174.0,796.0,796.0 71 | 137,2017-07-25,Jonesstad,6,1966.0,553.0,Andrew Warren,6174.0,553.0,553.0 72 | 139,2017-01-07,Caldwellbury,14,771.0,-108.0,Michaela Stewart,2274.0,-108.0,-108.0 73 | 141,2017-12-04,Port Brandyside,2,2627.0,-433.0,Deborah Reed DDS,2450.0,-433.0,-433.0 74 | 142,2017-07-11,Sarahton,5,1500.0,774.0,Kimberly Price,3222.0,774.0,774.0 75 | 147,2017-03-19,Randyborough,7,248.0,1417.0,Eric Montgomery,5061.0,1417.0,1417.0 76 | 148,2017-12-13,Lisamouth,15,1748.0,1266.0,Rebecca Baxter,3384.0,1266.0,1266.0 77 | 149,2017-05-16,North Thomasshire,16,1491.0,33.0,Elizabeth Bishop,1430.0,33.0,33.0 78 | 154,2017-02-19,West Bridgetborough,5,216.0,1383.0,Jonathan Townsend,3222.0,1383.0,1383.0 79 | 155,2017-09-09,Hectorton,12,1674.0,823.0,Jennifer Page,7906.0,823.0,823.0 80 | 156,2017-01-14,New Jamesberg,15,2818.0,-295.0,Dean Davis,3384.0,-295.0,-295.0 81 | 157,2017-05-19,West Kimberly,14,2639.0,1429.0,Steven Miller,2274.0,1429.0,1429.0 82 | 159,2017-06-01,Ortiztown,13,185.0,1454.0,Christopher Harris,3472.0,1454.0,1454.0 83 | 166,2017-12-01,Debbieton,4,1781.0,899.0,Ronald Wood,7793.0,899.0,899.0 84 | 170,2017-07-24,North Michael,3,2633.0,462.0,Kendra Santiago,5196.0,462.0,462.0 85 | 171,2017-02-23,East Deborahland,9,365.0,1313.0,Ms. Victoria Ford DDS,2093.0,1313.0,1313.0 86 | 173,2017-03-14,New Jonathan,18,1135.0,1328.0,Edward Thomas,7688.0,1328.0,1328.0 87 | 174,2017-02-06,South Neilstad,3,2106.0,-177.0,Melinda Miranda,5196.0,-177.0,-177.0 88 | 176,2017-08-17,West Donaldbury,12,1547.0,980.0,Chelsea Lee,7906.0,980.0,980.0 89 | 178,2017-10-20,West Cody,15,1755.0,888.0,James Chen,3384.0,888.0,888.0 90 | 182,2017-11-14,New Tashabury,2,668.0,141.0,Antonio Jackson,2450.0,141.0,141.0 91 | 183,2017-08-30,Kellerside,5,5.0,528.0,Thomas Baker,3222.0,528.0,528.0 92 | 184,2017-04-16,New Melanie,7,1717.0,1579.0,Eric Morrow,5061.0,1579.0,1579.0 93 | 186,2017-07-21,Bruceton,12,1036.0,1570.0,Diane Pearson,7906.0,1570.0,1570.0 94 | 187,2017-02-02,Valeriebury,2,2911.0,747.0,Richard Wong,2450.0,747.0,747.0 95 | 188,2017-02-28,Angelabury,4,1389.0,1097.0,Carl Bailey,7793.0,1097.0,1097.0 96 | 190,2017-06-15,East Rebecca,17,1788.0,512.0,Gina Frey,3093.0,512.0,512.0 97 | 192,2017-08-14,South Margaretstad,3,929.0,-91.0,Stephen Palmer,5196.0,-91.0,-91.0 98 | 193,2017-06-20,Bairdmouth,11,2100.0,414.0,Brandon Salinas,1352.0,414.0,414.0 99 | 195,2017-06-27,Zacharyborough,9,936.0,964.0,Marc Barnett,2093.0,964.0,964.0 100 | 197,2017-04-12,South Jason,17,2387.0,-304.0,Sarah Sandoval,3093.0,-304.0,-304.0 101 | 201,2017-10-31,West Johnfort,15,1623.0,107.0,Michael Trujillo,3384.0,107.0,107.0 102 | 202,2017-06-13,South Vincenthaven,1,1020.0,692.0,Lori Johnson,6232.0,692.0,692.0 103 | 203,2017-07-13,South Nancyview,7,681.0,-310.0,Lance Hurst,5061.0,-310.0,-310.0 104 | 205,2017-11-03,North Nicholas,11,2117.0,-304.0,Stacy Williams,1352.0,-304.0,-304.0 105 | 209,2017-11-23,Markmouth,8,1260.0,938.0,Kevin Mosley,2391.0,938.0,2442.0 106 | 210,2017-10-28,Wigginshaven,14,2574.0,1087.0,Jeffery Rodriguez,2274.0,1087.0,1087.0 107 | 212,2017-03-18,West Sean,4,2327.0,1581.0,Denise Ross,7793.0,1581.0,1581.0 108 | 213,2017-07-01,Harrisport,17,698.0,1113.0,Donna Ibarra,3093.0,1113.0,1113.0 109 | 215,2017-05-13,Lake Richardstad,6,2330.0,1274.0,Christopher Randall,6174.0,1274.0,1274.0 110 | 216,2017-05-05,Gonzalesburgh,3,196.0,1482.0,Kevin Madden,5196.0,1482.0,1482.0 111 | 217,2017-01-22,East Kirstenbury,11,513.0,810.0,Jeremiah Thompson,1352.0,810.0,810.0 112 | 218,2017-07-27,Justinside,18,2386.0,1490.0,Stephen Shaffer,7688.0,1490.0,1490.0 113 | 219,2017-05-26,Elizabethview,13,1375.0,16.0,Joseph Barron,3472.0,16.0,16.0 114 | 220,2017-03-09,North Adam,4,457.0,886.0,Robert Wilson,7793.0,886.0,886.0 115 | 221,2017-03-05,Port Matthewton,4,449.0,1427.0,Martin Chang,7793.0,1427.0,1427.0 116 | 222,2017-08-10,Lake Kimberlyport,16,2170.0,553.0,Dana Wilson,1430.0,553.0,553.0 117 | 223,2017-07-31,Heathershire,7,1941.0,650.0,Vanessa Burke,5061.0,650.0,650.0 118 | 224,2017-07-19,West Alex,10,1168.0,111.0,Barbara Savage,645.0,111.0,111.0 119 | 225,2018-01-02,Jeffreyside,17,81.0,173.0,Frank Hoffman,3093.0,173.0,173.0 120 | 226,2017-05-22,Port Dariusshire,18,1878.0,880.0,Mark Fowler,7688.0,880.0,880.0 121 | 227,2017-12-25,North Jeffrey,4,1806.0,391.0,James Davis,7793.0,391.0,391.0 122 | 229,2017-04-27,New Tylermouth,7,1727.0,203.0,Kenneth Elliott,5061.0,203.0,203.0 123 | 230,2017-12-22,West Vincentmouth,4,887.0,-132.0,Miguel Hunter,7793.0,-132.0,-132.0 124 | 232,2017-05-03,Port Monicaville,2,718.0,-400.0,Courtney Ward,2450.0,-400.0,-400.0 125 | 234,2017-05-20,Lake Sandrafurt,15,2557.0,320.0,Shawn James,3384.0,320.0,320.0 126 | 237,2017-01-11,Ponceview,19,1006.0,-399.0,Douglas Peters,4868.0,-399.0,-399.0 127 | 239,2017-08-23,Port Jacobborough,3,652.0,452.0,Jodi Watson,5196.0,452.0,452.0 128 | 240,2017-04-10,Samanthashire,9,2293.0,-184.0,Jason Harper,2093.0,-184.0,-184.0 129 | 242,2017-03-13,Hughesview,14,280.0,-241.0,Brenda Mcbride,2274.0,-241.0,-241.0 130 | 243,2017-09-01,North Karenton,7,549.0,567.0,Crystal Jennings,5061.0,567.0,567.0 131 | 244,2017-04-29,Port Emilyfurt,12,87.0,1301.0,Mrs. Nicole Huang,7906.0,1301.0,1301.0 132 | -------------------------------------------------------------------------------- /install_reqs.txt: -------------------------------------------------------------------------------- 1 | jupyter 2 | pandas 3 | scikit-learn 4 | scipy 5 | requests 6 | dedupe 7 | fuzzywuzzy 8 | dask 9 | graphviz 10 | voluptuous 11 | engarde 12 | tdda 13 | faker 14 | hypothesis 15 | dataset 16 | distributed 17 | bokeh 18 | tornado 19 | -------------------------------------------------------------------------------- /solutions/dask.py: -------------------------------------------------------------------------------- 1 | output = [] 2 | 3 | for loc in locations: 4 | issloc = delayed(get_spaceship_location)() 5 | next_pass = delayed(iss_pass_near_loc)(loc) 6 | output.append((loc.get('name'), next_pass)) 7 | 8 | earliest = delayed(lambda x: sorted(x, key=itemgetter(1))[0])(output) 9 | 10 | earliest.compute() 11 | -------------------------------------------------------------------------------- /solutions/dedupe.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | dupe_dict = {} 4 | 5 | for dupepair, confidence in dupes: 6 | dupe_dict[dupepair[0]] = {'pair': dupepair, 'confidence': confidence[0]} 7 | dupe_dict[dupepair[1]] = {'pair': dupepair, 'confidence': confidence[0]} 8 | 9 | customers['duplicate_pair'] = customers.index.map(lambda i: dupe_dict[i].get('pair') 10 | if i in dupe_dict else np.nan) 11 | customers['confidence'] = customers.index.map(lambda i: dupe_dict[i].get('confidence') 12 | if i in dupe_dict else np.nan) 13 | -------------------------------------------------------------------------------- /solutions/engarde.py: -------------------------------------------------------------------------------- 1 | @ed.has_dtypes(final_types) 2 | @ed.none_missing() 3 | def calculate_store_sales(sales): 4 | sales['store_total'] = sales.groupby('store_id').transform(sum)['sale_amount'] 5 | sales['associate_total'] = sales.groupby('associate').transform(sum)['sale_amount'] 6 | sales['city_total'] = sales.groupby('city')['sale_amount'].transform(sum) 7 | sales['store_total'] = pd.to_numeric(sales['store_total']) 8 | sales['city_total'] = pd.to_numeric(sales['city_total']) 9 | sales['associate_total'] = pd.to_numeric(sales['associate_total']) 10 | return sales 11 | -------------------------------------------------------------------------------- /solutions/hypothesis.py: -------------------------------------------------------------------------------- 1 | def parse_email(email): 2 | result = re.match('(?P[\.\w\-\!~#$%&\|{}\+\/\^\`\=\*\']+).(?P[\w\.\-]+)', email).groups() 3 | return result 4 | -------------------------------------------------------------------------------- /solutions/lobsters_dropped.py: -------------------------------------------------------------------------------- 1 | set(stories.columns) - set(stories.dropna(thresh=10, axis=1).columns) 2 | -------------------------------------------------------------------------------- /solutions/nulls.py: -------------------------------------------------------------------------------- 1 | rows_filled = 32357 - df[df.temperature.isnull() == True].shape[0] # taken from cell 15 2 | still_missing = df[df.temperature.isnull() == True].shape[0] / df.shape[0] 3 | -------------------------------------------------------------------------------- /solutions/preprocessing.py: -------------------------------------------------------------------------------- 1 | hvac['MinMaxScaledTemp'] = temp_minmax[:,0] 2 | hvac['MinMaxScaledTemp'].head() 3 | -------------------------------------------------------------------------------- /validation-notebooks/.gitignore: -------------------------------------------------------------------------------- 1 | *.*lock 2 | -------------------------------------------------------------------------------- /validation-notebooks/01 - Data Validation with Voluptuous.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Data Validation with Voluptuous (Schema Definitions)\n", 8 | "\n", 9 | "In this notebook, we'll use [Voluptuous](https://github.com/alecthomas/voluptuous) to define schemas for our data. We can then use schema checking at different points in our cleanup to ensure we meet criteria. We can then use schema validation exceptions to either mark, set aside or remove unclean / invalid data. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import logging\n", 19 | "import pandas as pd\n", 20 | "from datetime import datetime\n", 21 | "from voluptuous import Schema, Required, Range, All, ALLOW_EXTRA\n", 22 | "from voluptuous.error import MultipleInvalid, Invalid" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "logger = logging.getLogger(0)\n", 32 | "logger.setLevel(logging.WARNING)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "sales = pd.read_csv('../data/sales_data.csv')" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "sales.head()" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "sales.dtypes" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "sales['timestamp'].map(lambda x: datetime.strptime(x, \n", 69 | " '%Y-%m-%d %H:%M:%S'))" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### Data Quality Check" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "sales.head()" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "sales.dtypes" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "## Defining our first schema" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "schema = Schema({\n", 111 | " Required('sale_amount'): All(float, \n", 112 | " Range(min=2.50, max=1450.99)),\n", 113 | "}, extra=ALLOW_EXTRA)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "error_count = 0\n", 123 | "for s_id, sale in sales.T.to_dict().items():\n", 124 | " try:\n", 125 | " schema(sale)\n", 126 | " except MultipleInvalid as e:\n", 127 | " logging.warning('issue with sale: %s (%s) - %s', \n", 128 | " s_id, sale['sale_amount'], e)\n", 129 | " error_count += 1" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "error_count" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "sales.shape" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "### Questions we might want to answer:\n", 155 | "- Do we have an improperly defined schema?\n", 156 | "- Are negative values possibly returns or falsely marked? (data entry proceedures)\n", 157 | "- Are higher values combined purchases or special sales? (or potentially fraud?)\n", 158 | "- What should we do with our schema and our failing data points?" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "source": [ 167 | "### Adding a custom Validation Case" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "def ValidDate(fmt='%Y-%m-%d %H:%M:%S'):\n", 177 | " return lambda v: datetime.strptime(v, fmt)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "schema = Schema({\n", 187 | " Required('timestamp'): All(ValidDate()),\n", 188 | "}, extra=ALLOW_EXTRA)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "error_count = 0\n", 198 | "for s_id, sale in sales.T.to_dict().items():\n", 199 | " try:\n", 200 | " schema(sale)\n", 201 | " except MultipleInvalid as e:\n", 202 | " logging.warning('issue with sale: %s (%s) - %s', \n", 203 | " s_id, sale['timestamp'], e)\n", 204 | " error_count += 1" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "error_count" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "## So we have valid date structures, what about actual valid dates?" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "def ValidDate(fmt='%Y-%m-%d %H:%M:%S'):\n", 230 | " def validation_func(v):\n", 231 | " try:\n", 232 | " assert datetime.strptime(v, fmt) <= datetime.now()\n", 233 | " except AssertionError:\n", 234 | " raise Invalid('date is in the future! %s' % v)\n", 235 | " return validation_func" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "schema = Schema({\n", 245 | " Required('timestamp'): All(ValidDate()),\n", 246 | "}, extra=ALLOW_EXTRA)" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "error_count = 0\n", 256 | "for s_id, sale in sales.T.to_dict().items():\n", 257 | " try:\n", 258 | " schema(sale)\n", 259 | " except MultipleInvalid as e:\n", 260 | " logging.warning('issue with sale: %s (%s) - %s', \n", 261 | " s_id, sale['timestamp'], e)\n", 262 | " error_count += 1" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "error_count" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": { 277 | "collapsed": true 278 | }, 279 | "source": [ 280 | "## Exercise: what are some possible reasons for future dates? What should we do with the data and schema?" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [] 289 | } 290 | ], 291 | "metadata": { 292 | "kernelspec": { 293 | "display_name": "Python 3", 294 | "language": "python", 295 | "name": "python3" 296 | }, 297 | "language_info": { 298 | "codemirror_mode": { 299 | "name": "ipython", 300 | "version": 3 301 | }, 302 | "file_extension": ".py", 303 | "mimetype": "text/x-python", 304 | "name": "python", 305 | "nbconvert_exporter": "python", 306 | "pygments_lexer": "ipython3", 307 | "version": "3.6.6" 308 | } 309 | }, 310 | "nbformat": 4, 311 | "nbformat_minor": 2 312 | } 313 | -------------------------------------------------------------------------------- /validation-notebooks/02 - Dataframe Validation with Engarde.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Pandas DataFrame Validation with Engarde\n", 8 | "\n", 9 | "In this notebook, we'll take a look at how to validate data within `pandas.DataFrame` objects. Tom Augspurger has created the library [engarde](https://github.com/TomAugspurger/engarde), which allows you to write both function decorators or utilize built-in functions to test your DataFrame with specific validation rules or definitions." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import engarde.decorators as ed\n", 20 | "from datetime import datetime" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "sales = pd.read_csv('../data/sales_data_duped_with_nulls.csv')" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Data Quality Check" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "sales.head()" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "sales.dtypes" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "### Engarde let's us track datatypes, so first we need to record our expected results at the first function -- changing what we will change with our first method" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "new_dtypes = {\n", 71 | " 'timestamp': object,\n", 72 | " 'city': object,\n", 73 | " 'store_id': int,\n", 74 | " 'sale_number': float,\n", 75 | " 'sale_amount': float,\n", 76 | " 'associate': object\n", 77 | "}" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "@ed.has_dtypes(new_dtypes)\n", 87 | "@ed.is_shape((None, 6))\n", 88 | "def update_dtypes(sales):\n", 89 | " sales.timestamp = sales.timestamp.map(\n", 90 | " lambda x: datetime.strptime(\n", 91 | " x, '%Y-%m-%dT%H:%M:%S').date())\n", 92 | " return sales" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "sales = update_dtypes(sales)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "sales.timestamp.iloc[0]" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## Now we want to remove poor quality data, let's remove any missing important columns we might need later" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "@ed.has_dtypes(new_dtypes)\n", 127 | "@ed.is_shape((None, 6))\n", 128 | "@ed.none_missing()\n", 129 | "def remove_poor_quality_data(sales):\n", 130 | " sales = sales.drop_duplicates()\n", 131 | " sales = sales.dropna(subset=['sale_amount', 'store_id', \n", 132 | " 'sale_number', \n", 133 | " 'city', 'associate'])\n", 134 | " return sales" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "sales = remove_poor_quality_data(sales)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "sales.isnull().any()" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "final_types = new_dtypes.copy()\n", 162 | "final_types.update({\n", 163 | " 'store_total': float,\n", 164 | " 'associate_total': float,\n", 165 | " 'city_total': float\n", 166 | "})" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "@ed.has_dtypes(final_types)\n", 176 | "@ed.none_missing()\n", 177 | "def calculate_store_sales(sales):\n", 178 | " sales['store_total'] = sales.groupby(\n", 179 | " 'store_id').transform(sum)['sale_amount']\n", 180 | " sales['associate_total'] = sales.groupby(\n", 181 | " 'associate').transform(sum)['sale_amount']\n", 182 | " sales['city_total'] = sales.groupby('city')[\n", 183 | " 'sale_amount'].transform(sum)\n", 184 | " return sales" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "sales.head()" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "sales = calculate_store_sales(sales)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "## Exercise: Can you fix the above error?" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "# %load ../solutions/engarde.py\n" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "sales = calculate_store_sales(sales)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "sales" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "@ed.is_shape((None, 9))\n", 253 | "def save_report(sales):\n", 254 | " sales.to_csv('../data/sales_summary.csv')\n", 255 | " return sales" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "sales = save_report(sales)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": null, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "sales.dtypes" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [] 282 | } 283 | ], 284 | "metadata": { 285 | "kernelspec": { 286 | "display_name": "Python 3", 287 | "language": "python", 288 | "name": "python3" 289 | }, 290 | "language_info": { 291 | "codemirror_mode": { 292 | "name": "ipython", 293 | "version": 3 294 | }, 295 | "file_extension": ".py", 296 | "mimetype": "text/x-python", 297 | "name": "python", 298 | "nbconvert_exporter": "python", 299 | "pygments_lexer": "ipython3", 300 | "version": "3.6.6" 301 | } 302 | }, 303 | "nbformat": 4, 304 | "nbformat_minor": 2 305 | } 306 | -------------------------------------------------------------------------------- /validation-notebooks/03 - TDDA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## TDDA: Test-Driven Data Analysis\n", 8 | "\n", 9 | "In this notebook, we'll review a Python library: [TDDA](https://github.com/tdda/tdda), which takes data inputs (such as NumPy arrays or Pandas DataFrames) and builds a set of constraints around them. You can then save your constraints (JSON output) and test new data against observed constraints." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "from tdda.constraints.pdconstraints import discover_constraints, \\\n", 21 | " verify_df" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.read_csv('../data/iot_example.csv')" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## Basic Data Quality Check" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "df.sample(10)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "df.dtypes" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Use `discover_constraints` to build the constraint object" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "constraints = discover_constraints(df)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "constraints" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "constraints.fields" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Now write the constraints to a file" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "with open('../data/ignore-iot_constraints.tdda', 'w') as f:\n", 106 | " f.write(constraints.to_json())" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "cat ../data/ignore-iot_constraints.tdda" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Exercise: what types of constraints are being extracted? How does this compare with defining your own schema?" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "### Now, let's read in our other IOT dataset :D (can anyone guess what will happen?)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "new_df = pd.read_csv('../data/iot_example_with_nulls.csv')" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "## We use `verify_df` to pass in the new dataframe, along with either the filepath to our saved constraints." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "v = verify_df(new_df, '../data/ignore-iot_constraints.tdda')" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "## We can now test passes, failures and look at the output" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "v" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "v.passes" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "v.failures" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "print(str(v))" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "## In addition, we can take a look at the passes and failures in a dataframe" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "v.to_frame()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": { 219 | "collapsed": true 220 | }, 221 | "source": [ 222 | "## Exercise: How could we fix the schema or separate data so all tests pass?" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [] 231 | } 232 | ], 233 | "metadata": { 234 | "kernelspec": { 235 | "display_name": "Python 3", 236 | "language": "python", 237 | "name": "python3" 238 | }, 239 | "language_info": { 240 | "codemirror_mode": { 241 | "name": "ipython", 242 | "version": 3 243 | }, 244 | "file_extension": ".py", 245 | "mimetype": "text/x-python", 246 | "name": "python", 247 | "nbconvert_exporter": "python", 248 | "pygments_lexer": "ipython3", 249 | "version": "3.6.6" 250 | } 251 | }, 252 | "nbformat": 4, 253 | "nbformat_minor": 2 254 | } 255 | -------------------------------------------------------------------------------- /validation-notebooks/04 - Hypothesis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Hypothesis: Property-Based Testing\n", 8 | "\n", 9 | "In this notebook, we use property based testing to find issues in our code. [Hypothesis](https://hypothesis.readthedocs.io/en/latest/) is a great library written primarily by [David MacIver](http://www.drmaciver.com/). It introduces methodology similar to [Haskell's Quickcheck](https://hackage.haskell.org/package/QuickCheck), but in Python -- hooray!\n", 10 | "\n", 11 | "The documentation is incredibly useful and the library is used by many other Python libraries you may know and love. It also has the abilities to mock and test `numpy` datatypes. I recommend it *especially* if you are writing any libraries or have shared code across teams (like utils or helpers and so forth)." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "from hypothesis import given, assume\n", 21 | "from hypothesis.strategies import tuples, integers, emails\n", 22 | "import re" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### First, we need to write our function, which takes a tuple and finds the range" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "def calculate_range(tuple_obj):\n", 39 | " return max(tuple_obj) - min(tuple_obj)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Then, we can define our test and our rules using Hypothesis `strategies` and `given`" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "@given(tuples(integers(), integers(), integers()))\n", 56 | "def test_calculate_range(tup):\n", 57 | " result = calculate_range(tup)\n", 58 | " assert isinstance(result, int)\n", 59 | " assert result > 0" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "scrolled": false 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "test_calculate_range()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "## We can fix our test by adding `>=`" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "@given(tuples(integers(), integers()))\n", 87 | "def test_calculate_range(tup):\n", 88 | " result = calculate_range(tup)\n", 89 | " assert isinstance(result, int)\n", 90 | " assert result >= 0" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "test_calculate_range()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## You can also use Hypothesis alongside `faker` a library which generates mock data based on specified types." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "def parse_email(email):\n", 116 | " result = re.match('(?P\\w+)@(?P\\w+)', \n", 117 | " email).groups()\n", 118 | " return result" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "@given(emails())\n", 128 | "def test_parse_email(email):\n", 129 | " result = parse_email(email)\n", 130 | " #print(result)\n", 131 | " assert len(result) == 2\n", 132 | " assert '.' in result[1]" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "test_parse_email()" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "def parse_email(email):\n", 151 | " result = re.match('(?P\\w+).(?P[\\w\\.]+)', \n", 152 | " email).groups()\n", 153 | " return result" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "test_parse_email()" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Exercise: can you fix the regex for this error?" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "# %load ../solutions/hypothesis.py\n", 186 | "\n" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "test_parse_email()" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [] 211 | } 212 | ], 213 | "metadata": { 214 | "kernelspec": { 215 | "display_name": "Python 3", 216 | "language": "python", 217 | "name": "python3" 218 | }, 219 | "language_info": { 220 | "codemirror_mode": { 221 | "name": "ipython", 222 | "version": 3 223 | }, 224 | "file_extension": ".py", 225 | "mimetype": "text/x-python", 226 | "name": "python", 227 | "nbconvert_exporter": "python", 228 | "pygments_lexer": "ipython3", 229 | "version": "3.6.6" 230 | } 231 | }, 232 | "nbformat": 4, 233 | "nbformat_minor": 2 234 | } 235 | -------------------------------------------------------------------------------- /validation-notebooks/Case Study - Basic Queued Pipeline with Validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Case Study: Queued Pipeline with Dask Distributed + Schema Validation\n", 8 | "\n", 9 | "In this case study, we will take a look at a pipeline which takes in Router environment variables (like temperature and fan RPM) and determines whether they are outside of normal ranges. \n", 10 | "\n", 11 | "We will define schema in [Voluptuous](https://github.com/alecthomas/voluptuous) to set the threshholds we expect to see and use [Dask Distributed](http://distributed.readthedocs.io/en/latest/index.html) to schedule and distribute the work across several workers. We use [dataset](https://dataset.readthedocs.io/en/latest/) to make a quick `sqlite3` database to store our output." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import logging\n", 21 | "import random\n", 22 | "import dataset\n", 23 | "import sys\n", 24 | "from time import sleep\n", 25 | "from datetime import datetime\n", 26 | "from queue import Queue\n", 27 | "from queue_example import generate_example, generate_machine_db\n", 28 | "from distributed import Client\n", 29 | "from voluptuous import Schema, Required, Range, All, ALLOW_EXTRA\n", 30 | "from voluptuous.error import MultipleInvalid" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "logger = logging.getLogger(0)\n", 40 | "logger.setLevel(logging.WARNING)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### We set up a Queue and start adding events via another thread. This will keep running until we stop the notebook." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "queue = Queue()\n", 57 | "db = dataset.connect('sqlite:///output_db.db')\n", 58 | "table = db['readings']" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "def load_data(input_q):\n", 68 | " while True:\n", 69 | " input_q.put(generate_example())\n", 70 | " sleep(1)" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "from threading import Thread\n", 80 | "load_thread = Thread(target=load_data, args=(queue,))\n", 81 | "load_thread.start()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "generate_example()" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Then, we define our schema in Voluptuous" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "schema = Schema({\n", 107 | " Required('AmbientTemp'): All(float, Range(min=3, max=40)),\n", 108 | " Required('Fan'): All(int, Range(min=100, max=2000)),\n", 109 | " Required('CpuTemp'): All(float, Range(min=5, max=50)),\n", 110 | "}, extra=ALLOW_EXTRA)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Now, we need to start our scheduler and workers.\n", 118 | "\n", 119 | "### Commands (if you are in environment you installed distributed in AND this folder validation-notebooks): \n", 120 | " - To start the scheduler: dask-scheduler\n", 121 | " - Then, in a terminal, navigate to validation-notebooks and run an export to add the path to your PYTHONPATH\n", 122 | " i.e. export PYTHONPATH=PYTHONPATH:/path/to/validation/notebooks\n", 123 | " - In that same terminal, start a worker: dask-worker SCHEDULER_IP:SCHEDULER_PORT (most often 127.0.0.1:8786)\n", 124 | "\n", 125 | "To view the Bokeh application: click on the Web UI link (usually: http://127.0.0.1:8787/status/ )\n", 126 | "\n", 127 | "#### Note: you need to start your workers in *this* directory, or copy the `queue_example.py` file to an accessible place." 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "client = Client('127.0.0.1:8786')\n", 137 | "client" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## Now we can define our pipeline functions:\n", 145 | "\n", 146 | "1. Test the schema adding warnings if we find schema failures.\n", 147 | "2. Add some extra machine information from our machine database*. (*just a dict, but use your imagination)\n", 148 | "3. Insert our reading into our database of readings" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "def test_schema(reading):\n", 158 | " try:\n", 159 | " schema(reading)\n", 160 | " reading['warning'] = False\n", 161 | " except MultipleInvalid as e:\n", 162 | " logger.warning('SCHEMA: Issue with router %s (%s)', \n", 163 | " reading.get('MachineId'), e)\n", 164 | " reading['warning'] = True\n", 165 | " return reading" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "def add_machine_info(reading):\n", 175 | " mdb = generate_machine_db()\n", 176 | " reading['brand'] = mdb[reading['MachineId']]\n", 177 | " return reading" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "def add_reading(reading):\n", 187 | " db = dataset.connect('sqlite:///output_db.db',\n", 188 | " engine_kwargs={'connect_args': \n", 189 | " {'check_same_thread':False}})\n", 190 | " table = db['readings']\n", 191 | " reading['processed_at'] = datetime.now()\n", 192 | " table.insert(reading)\n", 193 | " return reading" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "## To begin, we scatter the queue from the data to our workers:" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "remote_q = client.scatter(queue)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "### Then, we create a series of `map` functions, passing the futures objects to the next step of the pipeline. At the end we `gather` the data into one queue." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "schema_q = client.map(test_schema, remote_q)\n", 226 | "info_q = client.map(add_machine_info, schema_q)\n", 227 | "insert_q = client.map(add_reading, info_q)\n", 228 | "final = client.gather(insert_q)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": { 234 | "collapsed": true 235 | }, 236 | "source": [ 237 | "### Then, we can collect the data using `get`\n", 238 | "\n", 239 | "#### Note: you can watch errors and logging in the worker processes. And make sure to check the Bokeh Status Web UI!" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": { 246 | "scrolled": false 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "count = 0\n", 251 | "while count < 40:\n", 252 | " item = final.get()\n", 253 | " print(item)\n", 254 | " print('Queue size: ', queue.qsize())\n", 255 | " count += 1" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "db = dataset.connect('sqlite:///output_db.db')\n", 265 | "table = db['readings']\n", 266 | "table.count()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "warnings = db.query('''SELECT COUNT(*) as cnt FROM readings \n", 276 | " where warning == 1''')" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "list(warnings)[0].get('cnt')" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": { 291 | "collapsed": true 292 | }, 293 | "source": [ 294 | "### Exercise: did you see any other errors (or can you spot a potential error when checking the `queue_example.py` file?) How might we prevent the error?" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [] 303 | } 304 | ], 305 | "metadata": { 306 | "kernelspec": { 307 | "display_name": "Python 3", 308 | "language": "python", 309 | "name": "python3" 310 | }, 311 | "language_info": { 312 | "codemirror_mode": { 313 | "name": "ipython", 314 | "version": 3 315 | }, 316 | "file_extension": ".py", 317 | "mimetype": "text/x-python", 318 | "name": "python", 319 | "nbconvert_exporter": "python", 320 | "pygments_lexer": "ipython3", 321 | "version": "3.6.6" 322 | } 323 | }, 324 | "nbformat": 4, 325 | "nbformat_minor": 2 326 | } 327 | -------------------------------------------------------------------------------- /validation-notebooks/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kjam/data-cleaning-101/a1e3d4159c65fdea8abf935ca6a2fce2fa64c11c/validation-notebooks/__init__.py -------------------------------------------------------------------------------- /validation-notebooks/queue_example.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import sys 4 | from time import sleep 5 | from datetime import datetime 6 | 7 | # PYTHON_PATH is an issue for some OS re: dask execution 8 | sys.path.append(os.path.dirname(__file__)) 9 | 10 | def generate_machine_db(): 11 | return {mid: generate_brand() for mid in range(100)} 12 | 13 | def generate_brand(): 14 | return random.choice(['Cisco', 'Juniper', 'Arista']) 15 | 16 | def generate_example(): 17 | sleep(random.random()) 18 | return { 19 | 'MachineId': random.randint(0,110), # yes, 110 intentionally >.< 20 | 'AmbientTemp': random.randint(2,45) + random.random(), 21 | 'Fan': random.randint(60, 2500), 22 | 'CpuTemp': random.randint(2,55) + random.random(), 23 | 'ObservationTime': datetime.now().isoformat(), 24 | } 25 | --------------------------------------------------------------------------------