├── .gitignore
├── Intro_to_ML.html
├── Intro_to_ML.ipynb
├── LICENSE
├── README.md
├── images
├── Insight_small.png
├── clustering.png
├── density_est.png
├── dim_redux.png
├── future_labs.png
├── l2.png
├── logistic2.png
├── logistic_loss2.png
├── lr_sklearn.png
├── overfitting.png
├── supervised.png
├── supervised_vs_unsupervised.png
├── train_test_valid.png
└── what_is_ML.png
└── resources
├── __init__.py
├── bank_marketing.csv
└── plot_utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
--------------------------------------------------------------------------------
/Intro_to_ML.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Machine Learning Primer\n",
8 | "## [Future Labs AI Summit, Oct. 30, 10-10:45am](http://aisummit2017.futurelabs.engineering.nyu.edu)\n",
9 | "[](http://futurelabs.nyc/)\n",
10 | "## [Ross Fadely, AI Lead](https://www.linkedin.com/in/rossfadely/), [Insight Data Science](http://insightdata.ai)\n",
11 | "### ross [at] insightdata [dot] ai\n",
12 | "[](http://insightdata.ai)\n",
13 | "\n",
14 | "
"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "# What is (and isn't) Machine Learning\n",
22 | ""
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "# Breaking down the Machine Learning Landscape\n",
30 | "\n",
31 | "\n",
32 | "# Supervised Learning\n",
33 | "\n",
34 | "\n",
35 | "# Unsupervised Learning\n",
36 | "\n",
37 | "\n",
38 | ""
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "# Diving in - Predicting Telemarketing success\n",
46 | "\n",
47 | "As a working example, we are going to build a machine learning model to predict the success of a telemarketing campaign for a (unnamed) bank. The original dataset can be found [here](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), but for our purposes we will use the dataset found in `bank_marketing.csv` in the github repo.\n",
48 | "
\n",
49 | "\n",
50 | "## Loading data and preprocessing\n",
51 | "\n",
52 | "To start we are going to first load in the modules we are going to use. Our dependencies here are numpy, pandas, and scikit-learn, all of which can be install via pip or using a environment package like Anaconda."
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "metadata": {
59 | "collapsed": false
60 | },
61 | "outputs": [],
62 | "source": [
63 | "# loading needed modules upfront.\n",
64 | "import pandas as pd\n",
65 | "import numpy as np\n",
66 | "from sklearn import preprocessing\n",
67 | "from sklearn.linear_model import LogisticRegression\n",
68 | "from sklearn.model_selection import train_test_split\n",
69 | "from sklearn.metrics import confusion_matrix"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Lets load our data into Pandas and have a look:"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [],
86 | "source": [
87 | "data = pd.read_csv('./resources/bank_marketing.csv')\n",
88 | "data.head()"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "Seems like we have 6 columns (or \"features\") in our data, followed by the thing we want to model - the \"outcome\"."
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "metadata": {
102 | "collapsed": false
103 | },
104 | "outputs": [],
105 | "source": [
106 | "data.describe()"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "Almost all ML models require the data to be preprocessed. Common preprocessing steps might include in-filling/removing missing data points, mean subtraction and unit-variance scaling (continuous data), and tokenization (NLP). \n",
114 | "\n",
115 | "In our example we can see all the data consists of categorical data. In other words, all the entries for `job` fall into buckets like `housemaid`, `services`, or `admin`. Note there is no easy way to represent these categories in numbers (e.g., what does it mean to have `housemaid` be a 1 and `services` to be a 2?). To account for this most ML models require us to split each of these categories into their own features indicating if it is or isn't true. This is called one-hot encoding or creating \"dummy\" variables."
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {
122 | "collapsed": false
123 | },
124 | "outputs": [],
125 | "source": [
126 | "data = pd.get_dummies(data, columns =['job', 'marital', 'default', 'housing', 'loan', 'prev_outcome'])\n",
127 | "data.head()"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "### Creating a training, validation, and test dataset\n",
135 | "\n",
136 | "The whole point of creating a ML model is to try to develop a model that ***generalizes*** to ***new*** unseen data. If a model works well for data it has in hand but fails when new users/examples are entered into the system, it is useless.\n",
137 | "\n",
138 | "To assess generalization, we spit the data into special buckets we call training, test, and validation.\n",
139 | "\n",
140 | "Our training data is what we use to have our ML model *learn*. Later, to ultimately measure generalization we run *inference* (predict) on the test data. Finally we have a third set of data we call validation data. For better or worse, almost all ML models have parameters that drive how they work. We use this special validation set to tune these, by using the validation data as \"test\" data during our search for the best hyperparameters.\n",
141 | "\n",
142 | "**NOTE - ** Below we will use a simple `train_test_split` function to save 60% of our data for training, 20% for validation, and 20% for split. Scikit-learn offers ways to do this in one shot as well as more robust (and almost always used) [cross validation approaches](http://scikit-learn.org/0.16/modules/classes.html#module-sklearn.cross_validation). Also note, above we mention additional pre-processing techniques beyond one-hot encoding. Often, these must be performed after splitting the data as to avoid a form of [data leakage](https://machinelearningmastery.com/data-leakage-machine-learning/)."
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": null,
148 | "metadata": {
149 | "collapsed": false
150 | },
151 | "outputs": [],
152 | "source": [
153 | "# break out our data and labels\n",
154 | "X = data.iloc[:,1:]\n",
155 | "y = data.iloc[:,0]\n",
156 | "\n",
157 | "# split off 20% for test data\n",
158 | "X_remaining, X_test, y_remaining, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n",
159 | "\n",
160 | "# split off 20% for validation data\n",
161 | "X_train, X_valid, y_train, y_valid = train_test_split(X_remaining, y_remaining, test_size=0.25, random_state=0)\n",
162 | "\n",
163 | "print(\"We have {}, {}, {} train, validation, and test samples\".format(y_train.size, y_valid.size, y_test.size))"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "# Fitting an initial model - logistic regression\n",
171 | "\n",
172 | "As an initial approach, we are going to explore our predictive task using [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) ([scikit's documentaion](http://scikit-learn.org/0.16/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)). Logistic Regression is a relatively simple but incredibly useful model and with creative [feature engineering](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) can often rivel the performance of highly complicated and expensive to compute models. Lets look at the math behind the model:\n",
173 | "\n",
174 | "\n",
175 | "\n",
176 | "If we look inside the parenthesis, we see a simple linear model (remember y=mx+b?). The exponential produces values from ~0 to infinity, which means the logistic function ranges from 0 to 1 - exactly what we want for binary classification. \n",
177 | "\n",
178 | "So how does one fit this to data? First we need to define a cost or **loss function** which helps us define how well the current parameters (w and b here) fit the data. For Logistic Regression this looks like:\n",
179 | "\n",
180 | "\n",
181 | "\n",
182 | "This may be a lot to unpack, but if you walk through the math, you will see this loss function is high when there is a mismatch between our predictions *f* and labels *y*. To find the best models our code must optimize the parameters *w* and *b* to find the lowest value of the loss. Discussion of optimization is out of scope here, but know it often relies on variants of [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)."
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "\n",
190 | "
\n",
191 | "
Exercise: Building intuition
\n",
192 | "Above we've shown you the model and cost function equations. To build some intuition about what those equations are doing, try plotting them out, i.e. plot `f(x)`, and `J(f,y)`. From these graphs, what can you say (in words) about their function in the model?\n",
193 | ""
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {
200 | "collapsed": true
201 | },
202 | "outputs": [],
203 | "source": []
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "### Fitting LR in scikit-learn\n",
210 | "Math aside, the process of fitting most ML models in scikit-learn is really straightforward:"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {
217 | "collapsed": true
218 | },
219 | "outputs": [],
220 | "source": [
221 | "# call the model class\n",
222 | "classifier= LogisticRegression(random_state=0)\n",
223 | "\n",
224 | "# fit to the training data\n",
225 | "classifier.fit(X_train, y_train)\n",
226 | "\n",
227 | "# generate predictions for our test data!\n",
228 | "y_pred = classifier.predict(X_valid)"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "Let's see how we did by looking at the *accuracy* - i.e., the percentage of examples we predicted correctly. Scikit-learn includes this as a method in their LR class:"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": null,
241 | "metadata": {
242 | "collapsed": false
243 | },
244 | "outputs": [],
245 | "source": [
246 | "print(\"We got {:0.2f} correct!\".format(classifier.score(X_valid, y_valid)))"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "# Overfitting\n",
254 | "\n",
255 | "Before we go further, let's take a crucial conceptual aside. Of the core ideas that are key to successfully applying ML, overfitting is right at the top. In short, overfitting is your model has learned the nuances and properties of your training data and performs significantly (sometimes *MUCH*) worse on your test data. Consider the below example where a polynomial model is fit to data, using an increasingly higher degree ([borrowed from scikit-learn's website](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py)):\n",
256 | "\n",
257 | "\n",
258 | "\n",
259 | "When the degree is low, the model has little flexibility so the test error (here mean squared error or MSE) is reasonable but a little high. As we increase the model's flexibility, we find a model with the right amount of wiggle to fit the training and test data well. However if we continue to increase the capacity we will find we fit the training data *really* well but have a crazy function that can't possibly match the data in places the model hasn't seen. Our goal when training ML models is to make sure this doesn't happen!"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "# Regularization\n",
267 | "\n",
268 | "One of the main ways to combat overfitting is through regularization. Simply put, regularization is the application of any approach that serves to limit model capacity in the hopes of getting better generalization performance. Typically, regularization comes in the form of penalizing parameter values from being too crazy or by cleverly restricting the structure of a model. A complete review of regularization is also out of scope here, as it can vary quite significantly across different ML models.\n",
269 | "\n",
270 | "Going back to our working example, we will briefly explore the two main forms of regularization for logistic regression namely L1 and L2 penalties.\n",
271 | "\n",
272 | "## L2 Regularization\n",
273 | "\n",
274 | "Probably the most common form of regularization across ML, L2 regularization penalizes model weight parameters for having large positive or negative values:\n",
275 | "\n",
276 | "\n",
277 | "\n",
278 | "L2 is very straightforward to understand and compute, being just the sum of all the squared weight values. This sum is added to our loss function, so that it is taken into account during training (optimization). The lambda in front of the sum is called a hyperparamter (a parameter not learned during training but set upfront) and it controls how much the L2 penalty matters during training.\n",
279 | "\n",
280 | "Looking at the scikit-learn API:\n",
281 | "\n",
282 | "\n",
283 | "\n",
284 | "We see that using L2 is set as default, and we have already been using it! Lets explore the affect of varying the hyperparameter *C* (which is 1/lambda):"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": null,
290 | "metadata": {
291 | "collapsed": false
292 | },
293 | "outputs": [],
294 | "source": [
295 | "def LR_model(X_train, y_train, X_valid, y_valid, lamb=1, rs=0, class_weight=None):\n",
296 | " model = LogisticRegression(C=1./lamb, random_state=rs, class_weight=class_weight)\n",
297 | " model.fit(X_train, y_train)\n",
298 | " y_pred = model.predict(X_valid)\n",
299 | " return y_pred, model.coef_, model.score(X_valid, y_valid), model\n",
300 | "\n",
301 | "# fit lambda=1 as a reminder\n",
302 | "lamb = 1\n",
303 | "_, weights, acc, _ = LR_model(X_train, y_train, X_valid, y_valid, lamb)\n",
304 | "print(\"We got {:0.2f} correct!\".format(acc))"
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "Great, lets look at the weights of the model to see what is happening under the good."
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": null,
317 | "metadata": {
318 | "collapsed": false
319 | },
320 | "outputs": [],
321 | "source": [
322 | "print(\"lambda = {:2.2f}: We got {:0.2f} correct!\".format(lamb, acc))\n",
323 | "print(\"Here are the weights:\\n\", weights)\n",
324 | "print(\"\\nAnd the sum of their magnitudes is {:1.3f}\".format(np.sum(np.abs(weights))))"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {},
330 | "source": [
331 | "Interesting, almost all of the weights carry some value with only one being zero. What happens if we take C to large or small values?"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {
338 | "collapsed": false,
339 | "scrolled": false
340 | },
341 | "outputs": [],
342 | "source": [
343 | "lamb = 1.e6\n",
344 | "_, weights, acc, _ = LR_model(X_train, y_train, X_valid, y_valid, lamb=lamb)\n",
345 | "print(\"lambda = {:2.2f}: We got {:0.2f} correct!\".format(lamb, acc))\n",
346 | "print(\"Here are the weights:\\n\", weights)\n",
347 | "print(\"\\nAnd the sum of their magnitudes is {:1.3f}\".format(np.sum(np.abs(weights))))\n",
348 | "\n",
349 | "lamb = 1.e-6\n",
350 | "_, weights, acc, _ = LR_model(X_train, y_train, X_valid, y_valid, lamb=lamb)\n",
351 | "print(\"\\n\\nlambda = {:2.2f}: We got {:0.2f} correct!\".format(lamb, acc))\n",
352 | "print(\"Here are the weights:\", weights)\n",
353 | "print(\"\\nAnd the sum of their magnitudes is {:1.3f}\".format(np.sum(np.abs(weights))))"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {},
359 | "source": [
360 | "# Class Imbalance\n",
361 | "\n",
362 | "At this point you might be suspicious about the performance of our LR model - Why is the accuracy high? Also, why does regularization only have a small effect?\n",
363 | "\n",
364 | "It turns out that this dataset, like ***most*** datasets in the wild, suffers from a significant class imbalance. That is, the number of negative examples (0's) greatly outnumber the positive examples (1's). Lets look at this in detail:"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "metadata": {
371 | "collapsed": false
372 | },
373 | "outputs": [],
374 | "source": [
375 | "print(\"There are {:0.2f} negative examples in the data!\".format(np.float(np.sum(y_test=='no')) / y_test.size))"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": null,
381 | "metadata": {
382 | "collapsed": false
383 | },
384 | "outputs": [],
385 | "source": [
386 | "y_pred, weights, acc, _ = LR_model(X_train, y_train, X_valid, y_valid, lamb=1)\n",
387 | "print(\"We predicted {:3d} positive samples in our validation data,\".format(np.sum(y_pred=='yes')))\n",
388 | "print(\"but there are {:3d} positive samples in our validation data!\".format(np.sum(y_test=='yes')))"
389 | ]
390 | },
391 | {
392 | "cell_type": "markdown",
393 | "metadata": {},
394 | "source": [
395 | "To better understand where our errors like, lets look at the confusion matrix - a grid that shows how many test examples our model says are positive/negative versus their true labels."
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": null,
401 | "metadata": {
402 | "collapsed": false
403 | },
404 | "outputs": [],
405 | "source": [
406 | "%matplotlib inline\n",
407 | "from resources.plot_utils import plot_confusion_matrix\n",
408 | "fig = plot_confusion_matrix(confusion_matrix(y_valid, y_pred), ['negative','positive'])\n",
409 | "fig.show()"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "### Addressing imbalance\n",
417 | "\n",
418 | "Clearly the above results are an issue. We are wrongly predicting that 744 people will not convert in the marketing campaign when in fact they will. As a business, missing out the opportunity to contact high yield leads is not good. One approach to address this is to directly tackle the class imbalance in the data.\n",
419 | "\n",
420 | "[There are many approaches to addressing class imbalance](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/). Perhaps the simplist and most straight forward is to reweight the examples in your loss function, according to their relative frequencies. Luckily, most ML models in scikit-learn have this option baked in. Below we will used this reweighting approach by setting `class_weight='balanced'`."
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": null,
426 | "metadata": {
427 | "collapsed": false
428 | },
429 | "outputs": [],
430 | "source": [
431 | "y_pred, weights, acc, model = LR_model(X_train, y_train, X_valid, y_valid, lamb=1e-6, class_weight='balanced')\n",
432 | "print(\"We got {:0.2f} correct!\\n\".format(acc))\n",
433 | "fig = plot_confusion_matrix(confusion_matrix(y_valid, y_pred), ['negative','positive'])"
434 | ]
435 | },
436 | {
437 | "cell_type": "markdown",
438 | "metadata": {},
439 | "source": [
440 | "Our total accuracy has gone down, but our ability to detect positives has dramatically improved. Determining what the [right metric](https://en.wikipedia.org/wiki/Precision_and_recall) for success depends a lot on the context of the problem."
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "\n",
448 | "\n",
449 | "
Exercise
\n",
450 | "Thus far, we've used the validation data to assess and tune the performance of our model. Of course, we can continue tuning to see if we can do better. But once we're done, we need to evaluate the performance of our model on the test set. This allows us to better understand how generalizable our model is. Your task is to calculate accuracy and re-generate the confusion matrix on the test set.\n",
451 | ""
452 | ]
453 | },
454 | {
455 | "cell_type": "code",
456 | "execution_count": null,
457 | "metadata": {
458 | "collapsed": true
459 | },
460 | "outputs": [],
461 | "source": []
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "\n",
468 | "\n",
469 | "
Exercise
\n",
470 | "Try fitting a
Random Forest model to the data. Does it do better?\n",
471 | "
"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "metadata": {
478 | "collapsed": true
479 | },
480 | "outputs": [],
481 | "source": [
482 | "# you'll need to import the RF classifier model\n",
483 | "from sklearn.ensemble import RandomForestClassifier"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "# Want to learn more?"
491 | ]
492 | },
493 | {
494 | "cell_type": "markdown",
495 | "metadata": {
496 | "collapsed": true
497 | },
498 | "source": [
499 | "## Continued learning with the above example\n",
500 | "- What about other classification models?\n",
501 | "- Try to tune the hyperparameters of the Random Forest model to find the best one. Remember to use validation data for this. To limit your search, try focusing on `n_estimators` and `max_depth`.\n",
502 | "- Do the above with a form of [cross-validation](http://scikit-learn.org/0.16/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold). Don't forget to hold out test data!\n",
503 | "\n",
504 | "## Longer term - course and book work\n",
505 | "- A descent starting point to get an understanding of the ML landscape is to take [Andrew Ng's ML course on Coursera](https://www.coursera.org/learn/machine-learning).\n",
506 | "- A good book which give a good sense of ML models but not in full gory detail is the [Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/).\n",
507 | "- More in-depth reading can be found in [The Elements of Statistical Learning](https://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576) and [Machine Learning: A Probabilistic Perspective](https://www.amazon.com/Machine-Learning-Probabilistic-Perspective-Computation-ebook/dp/B00AF1AYTQ).\n",
508 | "\n",
509 | "## Longer term - ML projects!\n",
510 | "- One of the best ways to learn ML is to just start using it! Pick a project to work on where you can leverage ML. Stuck on idea's [Insight's blog](https://blog.insightdatascience.com) is full of them.\n",
511 | "- [Kaggle is ok](https://www.kaggle.com) there is a lot of useful information on competition forums and you can learn a lot doing a project. Be warned - 1) kaggle competitions tend to collapse down to just a handful of ML approaches and 2) kaggle is pretty focused on getting that 0.0001% improvement, which is often NOT what real world problems are calling for."
512 | ]
513 | }
514 | ],
515 | "metadata": {
516 | "kernelspec": {
517 | "display_name": "Python 2",
518 | "language": "python",
519 | "name": "python2"
520 | },
521 | "language_info": {
522 | "codemirror_mode": {
523 | "name": "ipython",
524 | "version": 2
525 | },
526 | "file_extension": ".py",
527 | "mimetype": "text/x-python",
528 | "name": "python",
529 | "nbconvert_exporter": "python",
530 | "pygments_lexer": "ipython2",
531 | "version": "2.7.12"
532 | }
533 | },
534 | "nbformat": 4,
535 | "nbformat_minor": 1
536 | }
537 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 Insight Data Science
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Primer
2 | ML primer workshop @ Future Labs AI Summit.
3 | 10am, October 30, 2017
4 |
5 | Ross Fadely, AI Lead, [Insight Data Science](http://insightdata.ai)
6 |
7 | # About
8 | This 45 minute workshop is meant to give an overview of the important aspects of machine learning and will touch on core concepts by working through a case study. This is targeted for audiences who are less familiar with ML or could use a refresher.
9 |
10 | # Usage
11 | ```
12 | git clone git@github.com:InsightDataScience/Intro_to_ML-FutureLabs.git
13 | ```
14 |
15 | or [download the tarball](https://github.com/InsightDataScience/Intro_to_ML-FutureLabs/archive/master.zip).
16 |
17 | For those note equiped to handle jupyter notebooks you can open the html version in your browser.
18 |
19 |
20 |
--------------------------------------------------------------------------------
/images/Insight_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/Insight_small.png
--------------------------------------------------------------------------------
/images/clustering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/clustering.png
--------------------------------------------------------------------------------
/images/density_est.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/density_est.png
--------------------------------------------------------------------------------
/images/dim_redux.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/dim_redux.png
--------------------------------------------------------------------------------
/images/future_labs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/future_labs.png
--------------------------------------------------------------------------------
/images/l2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/l2.png
--------------------------------------------------------------------------------
/images/logistic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/logistic2.png
--------------------------------------------------------------------------------
/images/logistic_loss2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/logistic_loss2.png
--------------------------------------------------------------------------------
/images/lr_sklearn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/lr_sklearn.png
--------------------------------------------------------------------------------
/images/overfitting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/overfitting.png
--------------------------------------------------------------------------------
/images/supervised.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/supervised.png
--------------------------------------------------------------------------------
/images/supervised_vs_unsupervised.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/supervised_vs_unsupervised.png
--------------------------------------------------------------------------------
/images/train_test_valid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/train_test_valid.png
--------------------------------------------------------------------------------
/images/what_is_ML.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/images/what_is_ML.png
--------------------------------------------------------------------------------
/resources/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsightDataScience/Intro_to_ML-FutureLabs/f151146296c8be6ce4b7123e71d1c436ea60a71b/resources/__init__.py
--------------------------------------------------------------------------------
/resources/plot_utils.py:
--------------------------------------------------------------------------------
1 | import itertools
2 | import numpy as np
3 | import matplotlib.pyplot as plt
4 |
5 | def plot_confusion_matrix(cm, classes,
6 | normalize=False,
7 | title='Confusion matrix',
8 | cmap=plt.cm.Blues,
9 | figsize=[10, 5],
10 | fontsize=20):
11 | """
12 | This function prints and plots the confusion matrix.
13 | Normalization can be applied by setting `normalize=True`.
14 | """
15 | fig = plt.figure(figsize=figsize)
16 | plt.imshow(cm, interpolation='nearest', cmap=cmap)
17 | plt.title(title)
18 | plt.colorbar()
19 | tick_marks = np.arange(len(classes))
20 | plt.xticks(tick_marks, classes, rotation=45)
21 | plt.yticks(tick_marks, classes)
22 |
23 | fmt = '.2f' if normalize else 'd'
24 | thresh = cm.max() / 2.
25 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
26 | plt.text(j, i, format(cm[i, j], fmt), fontsize=fontsize,
27 | horizontalalignment="center",
28 | color="white" if cm[i, j] > thresh else "black")
29 |
30 | plt.tight_layout()
31 | plt.ylabel('True label')
32 | plt.xlabel('Predicted label')
33 | return plt
34 |
--------------------------------------------------------------------------------