├── .gitignore
├── Final_Project
    ├── amtrak_speeds
    │   └── readme.md
    ├── readme.md
    └── speeding_cops
    │   ├── florida_toll_plaza.kml
    │   ├── readme.md
    │   └── transponder_data.csv
├── LICENSE
├── class1_1
    ├── class1-1.ipynb
    ├── exercise
    │   ├── Exercise1_MeanFunction.ipynb
    │   ├── Exercise2_MayoralExcuseGenerator.ipynb
    │   ├── Exercise3-Answers.ipynb
    │   ├── Exercise3.ipynb
    │   ├── Exercise4-Answers.ipynb
    │   ├── Exercise4.ipynb
    │   └── excuse.csv
    ├── lab1-1.ipynb
    └── newsroom_examples.md
├── class1_2
    ├── .ipynb_checkpoints
    │   └── EDA_Python-checkpoint.ipynb
    ├── 2013_NYC_CD_MedianIncome_Recycle.xlsx
    ├── Data_Collection_Sheet.csv
    ├── EDA_Python.ipynb
    ├── class1_2.ipynb
    ├── height_weight.xlsx
    └── heights_weights_genders.csv
├── class2_1
    ├── .ipynb_checkpoints
    │   └── EDA_Review-checkpoint.ipynb
    ├── EDA_Review.ipynb
    ├── README.md
    └── data
    │   └── ontime_reports_may_2015_ny.csv
├── class2_2
    ├── DoNow_2-2.ipynb
    ├── DoNow_2-2_answers.ipynb
    ├── Multiple_Variable_Regression.ipynb
    ├── Simple_Linear_Regression.ipynb
    └── data
    │   ├── 2013_NYC_CD_MedianIncome_Recycle.xlsx
    │   ├── height_weight.xlsx
    │   ├── heights_weights_genders.csv
    │   └── ontime_reports_may_2015_ny.csv
├── class3_1
    ├── .ipynb_checkpoints
    │   ├── classification-checkpoint.ipynb
    │   └── regression_review-checkpoint.ipynb
    ├── README.md
    ├── classification.ipynb
    ├── data
    │   ├── apib12tx.csv
    │   └── category-training.csv
    └── regression_review.ipynb
├── class3_2
    ├── 3-2_DoNow.ipynb
    ├── 3-2_DoNow_Answers.ipynb
    ├── 3-2_DoNow_Answers_statsmodels.ipynb
    ├── 3-2_Exercises-Answers.ipynb
    ├── 3-2_Exercises.ipynb
    ├── Decision_Tree.ipynb
    ├── data
    │   ├── hanford.csv
    │   ├── hanford.txt
    │   ├── iris.csv
    │   ├── ontime_reports_may_2015_ny.csv
    │   ├── seeds_dataset.txt
    │   └── titanic.csv
    └── images
    │   ├── hanford_variables.png
    │   └── iris_scatter.png
├── class4_1
    ├── README.md
    ├── data
    │   ├── bills_training.txt
    │   ├── contribs_training.csv
    │   ├── contribs_training_small.csv
    │   └── contribs_unclassified.csv
    ├── doc_classifier.py
    └── donors.py
├── class4_2
    ├── 4-2_DoNow.ipynb
    ├── Feature_Engineering.ipynb
    ├── Logistic_regression.ipynb
    ├── Naive_Bayes.ipynb
    ├── data
    │   ├── ontime_reports_may_2015_ny.csv
    │   ├── titanic.csv
    │   └── wine.csv
    └── images
    │   └── titanic.png
├── class5_1
    ├── .ipynb_checkpoints
    │   └── vectorization-checkpoint.ipynb
    ├── README.md
    ├── bill_classifier.py
    ├── crime_clusterer.py
    ├── data
    │   ├── bills_training.txt
    │   ├── columbia_crime.csv
    │   └── releases_training.txt
    ├── release_classifier.py
    └── vectorization.ipynb
├── class5_2
    ├── 5_2-Assignment.ipynb
    ├── 5_2-DoNow.ipynb
    ├── data
    │   └── wine.csv
    ├── kmeans.ipynb
    └── knn.ipynb
├── class6_1
    ├── .ipynb_checkpoints
    │   ├── cluster_crime-checkpoint.ipynb
    │   └── cluster_emails-checkpoint.ipynb
    ├── README.md
    ├── cluster_crime.ipynb
    ├── cluster_emails.ipynb
    └── data
    │   ├── cluster_examples
    │       ├── kmeans_10.csv
    │       └── kmeans_3.csv
    │   ├── columbia_crime.csv
    │   └── jeb_subjects.csv
├── class6_2
    ├── AssociationRuleMining.ipynb
    └── RandomForest.ipynb
├── class7_1
    ├── README.md
    ├── bill_classifier.py
    └── data
    │   └── bills_training.txt
├── data_journalism_on_github.md
└── readme.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | #  Usually these files are written by a python script from a template
28 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | 
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 | 
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 | 
46 | # Translations
47 | *.mo
48 | *.pot
49 | 
50 | # Django stuff:
51 | *.log
52 | 
53 | # Sphinx documentation
54 | docs/_build/
55 | 
56 | # PyBuilder
57 | target/
58 | 


--------------------------------------------------------------------------------
/Final_Project/amtrak_speeds/readme.md:
--------------------------------------------------------------------------------
1 | ##Source: [Derailed Amtrak train sped into deadly crash curve | Al Jazeera America](http://america.aljazeera.com/multimedia/2015/5/map-derailed-amtrak-sped-through-northeast-corridor.html)
2 | 
3 | ##Data: [[https://github.com/ajam/amtrak-188]]
4 | + The live data is accessible here: [[https://www.googleapis.com/mapsengine/v1/tables/01382379791355219452-08584582962951999356/features?version=published&key=AIzaSyCVFeFQrtk-ywrUE0pEcvlwgCqS6TJcOW4&maxResults=250]]
5 | 
6 | ##Notes:
7 | + Michael Keller doens't provide the code for scraping the data, but you can scrape the live data from the URL above (I'd recommend a database)


--------------------------------------------------------------------------------
/Final_Project/readme.md:
--------------------------------------------------------------------------------
 1 | ##Final Project
 2 | 
 3 | 
 4 | The final project is a chance for you to demonstrate the skills you've learned in this and other Lede classes to explore a topic of personal or professional interest using data. You should demonstrate not only strong technical ability, but also the ability to synthesize the data in interesting and meaningful ways. 
 5 | 
 6 | Requirements:
 7 | + You must write a blog post, [submitted through the class Tumblr](http://ledealgorithms.tumblr.com/submit), outlining your project, your goals, your methodology, and your findings. Specifically address the data you used, it's source, the steps you took to clean the data, and the insights you gained at each step, either with respect to your project or working with data more generally.
 8 | + You must present your work in class, either August 27th or August 31st. Prepare a 15 minute presentation on the points covered in your blogpost and be prepared to answer questions. <strong>All work is due September 1st.</strong>
 9 | + You must provide the source code for your project. Code should be well written and commented wherever possible to explain the operation. 
10 | 
11 | You are free to work in groups and we encourage you to find projects that are of limited enough scope to fit into the time alotted for this project. Often we work under tight deadlines and being able to constrain scope ensures projects is important. Often a smaller, more constrained objective will allow us to better understand the important task and essential challenges. Attempting to implement all we envision at once, is a recipe for disaster (like Healthcare.gov). Take this opportunity to develop a more iterative approach and develop your project in phases rather than tackle everything all at once. For more information on this approach look into [Agile Development](http://agilemethodology.org/). 
12 | 
13 | If you work in groups, please indicate in your blog post the work of each person on the project so they may receive the proper credit. 
14 | 
15 | If you have any questions or need assistance shaping projects, please don't hesitate to reach out.
16 | 


--------------------------------------------------------------------------------
/Final_Project/speeding_cops/readme.md:
--------------------------------------------------------------------------------
1 | ##Source: (The Florida Sun-Sentinel Speeding Cops)[http://www.sun-sentinel.com/news/speeding-cops/]
2 | 
3 | ##Background
4 | [Documenting the process](http://towcenter.gitbooks.io/sensors-and-journalism/content/the_second_section/sun_sentinel_%E2%80%93.html)
5 | 
6 | ##Data
7 | + transponder_data.csv - an extract from their database online with the entrance and exit locations, entrance time and exit time
8 | + florida_toll_plaza.kml - (extracted from [here](https://www.google.com/maps/d/viewer?mid=zkhNiVf3Ss6c.k8ys3XRv92Ms&hl=en_US)locations for each toll booths (as well as the toll plazas) in KML format. A KML is just an XML with spatial data. Recommend using OpenRefine to process. Python is also an option (but OpenRefine will be easier and faster)
9 | 


--------------------------------------------------------------------------------
/class1_1/class1-1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:83d3e5703fd0ce5b2e62c63376699efb77c5cfa83ce21a7433dc7a0f14c00d56"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "code",
 13 |      "collapsed": false,
 14 |      "input": [
 15 |       "for i in range(10):\n",
 16 |       "    print i"
 17 |      ],
 18 |      "language": "python",
 19 |      "metadata": {},
 20 |      "outputs": [
 21 |       {
 22 |        "output_type": "stream",
 23 |        "stream": "stdout",
 24 |        "text": [
 25 |         "0\n",
 26 |         "1\n",
 27 |         "2\n",
 28 |         "3\n",
 29 |         "4\n",
 30 |         "5\n",
 31 |         "6\n",
 32 |         "7\n",
 33 |         "8\n",
 34 |         "9\n"
 35 |        ]
 36 |       }
 37 |      ],
 38 |      "prompt_number": 1
 39 |     },
 40 |     {
 41 |      "cell_type": "code",
 42 |      "collapsed": false,
 43 |      "input": [
 44 |       "for i in range(1,10):\n",
 45 |       "    print i"
 46 |      ],
 47 |      "language": "python",
 48 |      "metadata": {},
 49 |      "outputs": [
 50 |       {
 51 |        "output_type": "stream",
 52 |        "stream": "stdout",
 53 |        "text": [
 54 |         "1\n",
 55 |         "2\n",
 56 |         "3\n",
 57 |         "4\n",
 58 |         "5\n",
 59 |         "6\n",
 60 |         "7\n",
 61 |         "8\n",
 62 |         "9\n"
 63 |        ]
 64 |       }
 65 |      ],
 66 |      "prompt_number": 2
 67 |     },
 68 |     {
 69 |      "cell_type": "code",
 70 |      "collapsed": false,
 71 |      "input": [
 72 |       "for i in range(1,10,2):\n",
 73 |       "    print i"
 74 |      ],
 75 |      "language": "python",
 76 |      "metadata": {},
 77 |      "outputs": [
 78 |       {
 79 |        "output_type": "stream",
 80 |        "stream": "stdout",
 81 |        "text": [
 82 |         "1\n",
 83 |         "3\n",
 84 |         "5\n",
 85 |         "7\n",
 86 |         "9\n"
 87 |        ]
 88 |       }
 89 |      ],
 90 |      "prompt_number": 3
 91 |     },
 92 |     {
 93 |      "cell_type": "code",
 94 |      "collapsed": false,
 95 |      "input": [
 96 |       "print range(10)"
 97 |      ],
 98 |      "language": "python",
 99 |      "metadata": {},
100 |      "outputs": [
101 |       {
102 |        "output_type": "stream",
103 |        "stream": "stdout",
104 |        "text": [
105 |         "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n"
106 |        ]
107 |       }
108 |      ],
109 |      "prompt_number": 4
110 |     },
111 |     {
112 |      "cell_type": "code",
113 |      "collapsed": false,
114 |      "input": [
115 |       "print range(1,10,3)"
116 |      ],
117 |      "language": "python",
118 |      "metadata": {},
119 |      "outputs": [
120 |       {
121 |        "output_type": "stream",
122 |        "stream": "stdout",
123 |        "text": [
124 |         "[1, 4, 7]\n"
125 |        ]
126 |       }
127 |      ],
128 |      "prompt_number": 6
129 |     },
130 |     {
131 |      "cell_type": "code",
132 |      "collapsed": false,
133 |      "input": [
134 |       "print range(1,10,3,5)"
135 |      ],
136 |      "language": "python",
137 |      "metadata": {},
138 |      "outputs": [
139 |       {
140 |        "ename": "TypeError",
141 |        "evalue": "range expected at most 3 arguments, got 4",
142 |        "output_type": "pyerr",
143 |        "traceback": [
144 |         "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
145 |         "\u001b[0;32m<ipython-input-7-c30a3a8ef311>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
146 |         "\u001b[0;31mTypeError\u001b[0m: range expected at most 3 arguments, got 4"
147 |        ]
148 |       }
149 |      ],
150 |      "prompt_number": 7
151 |     },
152 |     {
153 |      "cell_type": "code",
154 |      "collapsed": false,
155 |      "input": [
156 |       "l = [1,2,\"abc\"]"
157 |      ],
158 |      "language": "python",
159 |      "metadata": {},
160 |      "outputs": [],
161 |      "prompt_number": 8
162 |     },
163 |     {
164 |      "cell_type": "code",
165 |      "collapsed": false,
166 |      "input": [
167 |       "l"
168 |      ],
169 |      "language": "python",
170 |      "metadata": {},
171 |      "outputs": [
172 |       {
173 |        "metadata": {},
174 |        "output_type": "pyout",
175 |        "prompt_number": 14,
176 |        "text": [
177 |         "[2]"
178 |        ]
179 |       }
180 |      ],
181 |      "prompt_number": 14
182 |     },
183 |     {
184 |      "cell_type": "code",
185 |      "collapsed": false,
186 |      "input": [
187 |       "l.pop()"
188 |      ],
189 |      "language": "python",
190 |      "metadata": {},
191 |      "outputs": [
192 |       {
193 |        "metadata": {},
194 |        "output_type": "pyout",
195 |        "prompt_number": 10,
196 |        "text": [
197 |         "'abc'"
198 |        ]
199 |       }
200 |      ],
201 |      "prompt_number": 10
202 |     },
203 |     {
204 |      "cell_type": "code",
205 |      "collapsed": false,
206 |      "input": [
207 |       "l2 = l.remove(2)"
208 |      ],
209 |      "language": "python",
210 |      "metadata": {},
211 |      "outputs": [],
212 |      "prompt_number": 17
213 |     },
214 |     {
215 |      "cell_type": "code",
216 |      "collapsed": false,
217 |      "input": [
218 |       "l"
219 |      ],
220 |      "language": "python",
221 |      "metadata": {},
222 |      "outputs": [
223 |       {
224 |        "metadata": {},
225 |        "output_type": "pyout",
226 |        "prompt_number": 13,
227 |        "text": [
228 |         "[2]"
229 |        ]
230 |       }
231 |      ],
232 |      "prompt_number": 13
233 |     },
234 |     {
235 |      "cell_type": "code",
236 |      "collapsed": false,
237 |      "input": [
238 |       "l2"
239 |      ],
240 |      "language": "python",
241 |      "metadata": {},
242 |      "outputs": [],
243 |      "prompt_number": 18
244 |     },
245 |     {
246 |      "cell_type": "code",
247 |      "collapsed": false,
248 |      "input": [
249 |       "l = [1,1,1,2,3]"
250 |      ],
251 |      "language": "python",
252 |      "metadata": {},
253 |      "outputs": [],
254 |      "prompt_number": 19
255 |     },
256 |     {
257 |      "cell_type": "code",
258 |      "collapsed": false,
259 |      "input": [
260 |       "s = set(l)"
261 |      ],
262 |      "language": "python",
263 |      "metadata": {},
264 |      "outputs": [],
265 |      "prompt_number": 20
266 |     },
267 |     {
268 |      "cell_type": "code",
269 |      "collapsed": false,
270 |      "input": [
271 |       "s"
272 |      ],
273 |      "language": "python",
274 |      "metadata": {},
275 |      "outputs": [
276 |       {
277 |        "metadata": {},
278 |        "output_type": "pyout",
279 |        "prompt_number": 21,
280 |        "text": [
281 |         "{1, 2, 3}"
282 |        ]
283 |       }
284 |      ],
285 |      "prompt_number": 21
286 |     },
287 |     {
288 |      "cell_type": "code",
289 |      "collapsed": false,
290 |      "input": [
291 |       "l"
292 |      ],
293 |      "language": "python",
294 |      "metadata": {},
295 |      "outputs": [
296 |       {
297 |        "metadata": {},
298 |        "output_type": "pyout",
299 |        "prompt_number": 22,
300 |        "text": [
301 |         "[1, 1, 1, 2, 3]"
302 |        ]
303 |       }
304 |      ],
305 |      "prompt_number": 22
306 |     },
307 |     {
308 |      "cell_type": "code",
309 |      "collapsed": false,
310 |      "input": [
311 |       "s1 = set({1,2,3})"
312 |      ],
313 |      "language": "python",
314 |      "metadata": {},
315 |      "outputs": [],
316 |      "prompt_number": 23
317 |     },
318 |     {
319 |      "cell_type": "code",
320 |      "collapsed": false,
321 |      "input": [
322 |       "s2 = set({3,4,5})"
323 |      ],
324 |      "language": "python",
325 |      "metadata": {},
326 |      "outputs": [],
327 |      "prompt_number": 24
328 |     },
329 |     {
330 |      "cell_type": "code",
331 |      "collapsed": false,
332 |      "input": [
333 |       "s1 - s2"
334 |      ],
335 |      "language": "python",
336 |      "metadata": {},
337 |      "outputs": [
338 |       {
339 |        "metadata": {},
340 |        "output_type": "pyout",
341 |        "prompt_number": 25,
342 |        "text": [
343 |         "{1, 2}"
344 |        ]
345 |       }
346 |      ],
347 |      "prompt_number": 25
348 |     },
349 |     {
350 |      "cell_type": "code",
351 |      "collapsed": false,
352 |      "input": [
353 |       "state_dict = {'ny': 'New York'}"
354 |      ],
355 |      "language": "python",
356 |      "metadata": {},
357 |      "outputs": [],
358 |      "prompt_number": 26
359 |     },
360 |     {
361 |      "cell_type": "code",
362 |      "collapsed": false,
363 |      "input": [
364 |       "state_dict['ny']"
365 |      ],
366 |      "language": "python",
367 |      "metadata": {},
368 |      "outputs": [
369 |       {
370 |        "metadata": {},
371 |        "output_type": "pyout",
372 |        "prompt_number": 27,
373 |        "text": [
374 |         "'New York'"
375 |        ]
376 |       }
377 |      ],
378 |      "prompt_number": 27
379 |     },
380 |     {
381 |      "cell_type": "code",
382 |      "collapsed": false,
383 |      "input": [
384 |       "state_dict[0]"
385 |      ],
386 |      "language": "python",
387 |      "metadata": {},
388 |      "outputs": [
389 |       {
390 |        "ename": "KeyError",
391 |        "evalue": "0",
392 |        "output_type": "pyerr",
393 |        "traceback": [
394 |         "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
395 |         "\u001b[0;32m<ipython-input-28-3a41b1d60580>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mstate_dict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
396 |         "\u001b[0;31mKeyError\u001b[0m: 0"
397 |        ]
398 |       }
399 |      ],
400 |      "prompt_number": 28
401 |     },
402 |     {
403 |      "cell_type": "code",
404 |      "collapsed": false,
405 |      "input": [],
406 |      "language": "python",
407 |      "metadata": {},
408 |      "outputs": []
409 |     }
410 |    ],
411 |    "metadata": {}
412 |   }
413 |  ]
414 | }


--------------------------------------------------------------------------------
/class1_1/exercise/Exercise1_MeanFunction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "def mean_calc(input_list):\n",
 12 |     "    list_len = 0 # variable to track running length\n",
 13 |     "    list_sum = 0 # variable to track running sum\n",
 14 |     "    if input_list:\n",
 15 |     "        for i in input_list:\n",
 16 |     "            if isinstance(i,int) or isinstance(i,float): # check to see if element i is of type int or float\n",
 17 |     "                list_len += 1\n",
 18 |     "                list_sum += i\n",
 19 |     "            else: # element i is not int or float\n",
 20 |     "                print \"list element %s is not of type int or float\" % i\n",
 21 |     "        return list_sum/float(list_len) #return the final calculation\n",
 22 |     "    else: #list is empty\n",
 23 |     "        return \"input list is empty\""
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {
 30 |     "collapsed": true
 31 |    },
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "test_list = [1,1,1,2,3,4,4,4,4,5,6,7,9]"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 3,
 40 |    "metadata": {
 41 |     "collapsed": false
 42 |    },
 43 |    "outputs": [
 44 |     {
 45 |      "data": {
 46 |       "text/plain": [
 47 |        "3.923076923076923"
 48 |       ]
 49 |      },
 50 |      "execution_count": 3,
 51 |      "metadata": {},
 52 |      "output_type": "execute_result"
 53 |     }
 54 |    ],
 55 |    "source": [
 56 |     "mean_calc(test_list)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 4,
 62 |    "metadata": {
 63 |     "collapsed": true
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "import numpy as np"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 5,
 73 |    "metadata": {
 74 |     "collapsed": false
 75 |    },
 76 |    "outputs": [
 77 |     {
 78 |      "data": {
 79 |       "text/plain": [
 80 |        "3.9230769230769229"
 81 |       ]
 82 |      },
 83 |      "execution_count": 5,
 84 |      "metadata": {},
 85 |      "output_type": "execute_result"
 86 |     }
 87 |    ],
 88 |    "source": [
 89 |     "np.mean(test_list)"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 6,
 95 |    "metadata": {
 96 |     "collapsed": true
 97 |    },
 98 |    "outputs": [],
 99 |    "source": [
100 |     "test_string = ['1',2,'3','4']"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 7,
106 |    "metadata": {
107 |     "collapsed": false
108 |    },
109 |    "outputs": [
110 |     {
111 |      "name": "stdout",
112 |      "output_type": "stream",
113 |      "text": [
114 |       "list element 1 is not of type int or float\n",
115 |       "list element 3 is not of type int or float\n",
116 |       "list element 4 is not of type int or float\n"
117 |      ]
118 |     },
119 |     {
120 |      "data": {
121 |       "text/plain": [
122 |        "2.0"
123 |       ]
124 |      },
125 |      "execution_count": 7,
126 |      "metadata": {},
127 |      "output_type": "execute_result"
128 |     }
129 |    ],
130 |    "source": [
131 |     "mean_calc(test_string)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 8,
137 |    "metadata": {
138 |     "collapsed": false
139 |    },
140 |    "outputs": [
141 |     {
142 |      "ename": "TypeError",
143 |      "evalue": "cannot perform reduce with flexible type",
144 |      "output_type": "error",
145 |      "traceback": [
146 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
147 |       "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
148 |       "\u001b[0;32m<ipython-input-8-1b70377e298d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_string\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
149 |       "\u001b[0;32m/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc\u001b[0m in \u001b[0;36mmean\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m   2733\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2734\u001b[0m     return _methods._mean(a, axis=axis, dtype=dtype,\n\u001b[0;32m-> 2735\u001b[0;31m                             out=out, keepdims=keepdims)\n\u001b[0m\u001b[1;32m   2736\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2737\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mddof\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
150 |       "\u001b[0;32m/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc\u001b[0m in \u001b[0;36m_mean\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m     64\u001b[0m         \u001b[0mdtype\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmu\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'f8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m     \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     67\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mret\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmu\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     68\u001b[0m         ret = um.true_divide(\n",
151 |       "\u001b[0;31mTypeError\u001b[0m: cannot perform reduce with flexible type"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "np.mean(test_string)"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": 9,
162 |    "metadata": {
163 |     "collapsed": true
164 |    },
165 |    "outputs": [],
166 |    "source": [
167 |     "empty_list =[]"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 10,
173 |    "metadata": {
174 |     "collapsed": false
175 |    },
176 |    "outputs": [
177 |     {
178 |      "data": {
179 |       "text/plain": [
180 |        "'input list is empty'"
181 |       ]
182 |      },
183 |      "execution_count": 10,
184 |      "metadata": {},
185 |      "output_type": "execute_result"
186 |     }
187 |    ],
188 |    "source": [
189 |     "mean_calc(empty_list)"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": 11,
195 |    "metadata": {
196 |     "collapsed": false
197 |    },
198 |    "outputs": [
199 |     {
200 |      "name": "stderr",
201 |      "output_type": "stream",
202 |      "text": [
203 |       "/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.\n",
204 |       "  warnings.warn(\"Mean of empty slice.\", RuntimeWarning)\n",
205 |       "/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:71: RuntimeWarning: invalid value encountered in double_scalars\n",
206 |       "  ret = ret.dtype.type(ret / rcount)\n"
207 |      ]
208 |     },
209 |     {
210 |      "data": {
211 |       "text/plain": [
212 |        "nan"
213 |       ]
214 |      },
215 |      "execution_count": 11,
216 |      "metadata": {},
217 |      "output_type": "execute_result"
218 |     }
219 |    ],
220 |    "source": [
221 |     "np.mean(empty_list)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {
228 |     "collapsed": true
229 |    },
230 |    "outputs": [],
231 |    "source": []
232 |   }
233 |  ],
234 |  "metadata": {
235 |   "kernelspec": {
236 |    "display_name": "Python 2",
237 |    "language": "python",
238 |    "name": "python2"
239 |   },
240 |   "language_info": {
241 |    "codemirror_mode": {
242 |     "name": "ipython",
243 |     "version": 2
244 |    },
245 |    "file_extension": ".py",
246 |    "mimetype": "text/x-python",
247 |    "name": "python",
248 |    "nbconvert_exporter": "python",
249 |    "pygments_lexer": "ipython2",
250 |    "version": "2.7.10"
251 |   }
252 |  },
253 |  "nbformat": 4,
254 |  "nbformat_minor": 0
255 | }
256 | 


--------------------------------------------------------------------------------
/class1_1/exercise/Exercise2_MayoralExcuseGenerator.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import random #package for generating pseudo-random numbers: https://docs.python.org/2/library/random.html"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 3,
 17 |    "metadata": {
 18 |     "collapsed": false
 19 |    },
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "import csv"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 4,
 28 |    "metadata": {
 29 |     "collapsed": false
 30 |    },
 31 |    "outputs": [
 32 |     {
 33 |      "name": "stdout",
 34 |      "output_type": "stream",
 35 |      "text": [
 36 |       "Enter your name: Richard\n"
 37 |      ]
 38 |     }
 39 |    ],
 40 |    "source": [
 41 |     "person = raw_input('Enter your name: ')"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 5,
 47 |    "metadata": {
 48 |     "collapsed": false
 49 |    },
 50 |    "outputs": [
 51 |     {
 52 |      "name": "stdout",
 53 |      "output_type": "stream",
 54 |      "text": [
 55 |       "Enter your destination: Chelsea\n"
 56 |      ]
 57 |     }
 58 |    ],
 59 |    "source": [
 60 |     "place = raw_input('Enter your destination: ')"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 6,
 66 |    "metadata": {
 67 |     "collapsed": false
 68 |    },
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "r = random.randrange(0,11) # generate random number between 0 and 10"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 8,
 77 |    "metadata": {
 78 |     "collapsed": false
 79 |    },
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "excuse_list = [] #create an empty list to hold the excuses\n",
 83 |     "inputReader = csv.DictReader(open('excuse.csv','rU'))\n",
 84 |     "for line in inputReader:\n",
 85 |     "    excuse_list.append(line) # append the excuses (as dictionary) to the list"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 7,
 91 |    "metadata": {
 92 |     "collapsed": false
 93 |    },
 94 |    "outputs": [
 95 |     {
 96 |      "name": "stdout",
 97 |      "output_type": "stream",
 98 |      "text": [
 99 |       "Sorry, Richard I was late to Chelsea, breakfast began a little later than expected\n",
100 |       "From the story \"De Blasio 15 Minutes Late to St. Patrick's Day Mass, Blames Breakfast\"\n",
101 |       "http://www.dnainfo.com/new-york/20150317/midtown/de-blasio-15-minutes-late-st-patricks-day-mass-blames-breakfast\n"
102 |      ]
103 |     }
104 |    ],
105 |    "source": [
106 |     "print \"Sorry, \" + person + \" I was late to \" + place + \", \" + excuse_list[r]['excuse']\n",
107 |     "print 'From the story \"' + excuse_list[r]['headline'] + '\"'\n",
108 |     "print excuse_list[r]['hyperlink']"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 15,
114 |    "metadata": {
115 |     "collapsed": false
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "# alternate way of generating the list of excuses using the context manager\n",
120 |     "# http://preshing.com/20110920/the-python-with-statement-by-example/\n",
121 |     "excuse_list2 = []\n",
122 |     "with open('excuse.csv','rU') as inputFile:\n",
123 |     "    inputReader = csv.DictReader(inputFile)\n",
124 |     "    for line in inputReader:\n",
125 |     "        excuse_list2.append(line) # append the excuses (as dictionary) to the list\n",
126 |     "    #file connection is close at end of the indented code"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {
133 |     "collapsed": false
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "# This is the least elegant and least pythonic way of doing this. \n",
138 |     "# Putting this code up at a Python conference could get you booed or otherwise shamed and driven from the hall\n",
139 |     "# but it gets the job done\n",
140 |     "inputFile = open('excuse.csv','rU') #create the file object\n",
141 |     "header = next(inputFile) # return the first line of the file (header) and assign to a variable\n",
142 |     "excuse_list = []\n",
143 |     "for line in inputFile:\n",
144 |     "    line = line.split(',') # split the line on the comma\n",
145 |     "    excuse_list.append(line[0]) # append the first element to the list\n",
146 |     "inputFile.close() # close connection to the file"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": null,
152 |    "metadata": {
153 |     "collapsed": false
154 |    },
155 |    "outputs": [],
156 |    "source": []
157 |   }
158 |  ],
159 |  "metadata": {
160 |   "kernelspec": {
161 |    "display_name": "Python 2",
162 |    "language": "python",
163 |    "name": "python2"
164 |   },
165 |   "language_info": {
166 |    "codemirror_mode": {
167 |     "name": "ipython",
168 |     "version": 2
169 |    },
170 |    "file_extension": ".py",
171 |    "mimetype": "text/x-python",
172 |    "name": "python",
173 |    "nbconvert_exporter": "python",
174 |    "pygments_lexer": "ipython2",
175 |    "version": "2.7.10"
176 |   }
177 |  },
178 |  "nbformat": 4,
179 |  "nbformat_minor": 0
180 | }
181 | 


--------------------------------------------------------------------------------
/class1_1/exercise/Exercise3-Answers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# The following code will print the prime numbers between 1 and 100. Modify the code so it prints every other prime number from 1 to 100"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 2,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [
 17 |     {
 18 |      "name": "stdout",
 19 |      "output_type": "stream",
 20 |      "text": [
 21 |       "1\n",
 22 |       "3\n",
 23 |       "7\n",
 24 |       "13\n",
 25 |       "19\n",
 26 |       "29\n",
 27 |       "37\n",
 28 |       "43\n",
 29 |       "53\n",
 30 |       "61\n",
 31 |       "71\n",
 32 |       "79\n",
 33 |       "89\n"
 34 |      ]
 35 |     }
 36 |    ],
 37 |    "source": [
 38 |     "j = 0 # add check counter outside the for-loop so it doesn't get reset\n",
 39 |     "for num in range(1,101): \n",
 40 |     "    prime = True \n",
 41 |     "    for i in range(2,num): \n",
 42 |     "        if (num%i==0): \n",
 43 |     "            prime = False \n",
 44 |     "    if prime: \n",
 45 |     "        if j%2 == 0: # test the check counter for being even and if so, then print the number\n",
 46 |     "            print num\n",
 47 |     "        j += 1 # increment the check counter each time a prime is found"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "# Extra Credit: Can you write a procedure that runs faster than the one above?"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 12,
 60 |    "metadata": {
 61 |     "collapsed": false
 62 |    },
 63 |    "outputs": [
 64 |     {
 65 |      "name": "stdout",
 66 |      "output_type": "stream",
 67 |      "text": [
 68 |       "1\n",
 69 |       "3\n",
 70 |       "7\n",
 71 |       "13\n",
 72 |       "19\n",
 73 |       "29\n",
 74 |       "37\n",
 75 |       "43\n",
 76 |       "53\n",
 77 |       "61\n",
 78 |       "71\n",
 79 |       "79\n",
 80 |       "89\n"
 81 |      ]
 82 |     }
 83 |    ],
 84 |    "source": [
 85 |     "j = 0 \n",
 86 |     "for num in range(1,101): \n",
 87 |     "    prime = True \n",
 88 |     "    for i in range(2,num): \n",
 89 |     "        if (num%i==0): \n",
 90 |     "            prime = False\n",
 91 |     "            continue \n",
 92 |     "            # once the number has already been shown to be false, \n",
 93 |     "            # there's no reason to keep checking\n",
 94 |     "    if prime: \n",
 95 |     "        if j%2 == 0: \n",
 96 |     "            print num\n",
 97 |     "        j += 1"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": 12,
103 |    "metadata": {
104 |     "collapsed": false
105 |    },
106 |    "outputs": [],
107 |    "source": []
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {
113 |     "collapsed": false
114 |    },
115 |    "outputs": [],
116 |    "source": []
117 |   }
118 |  ],
119 |  "metadata": {
120 |   "kernelspec": {
121 |    "display_name": "Python 2",
122 |    "language": "python",
123 |    "name": "python2"
124 |   },
125 |   "language_info": {
126 |    "codemirror_mode": {
127 |     "name": "ipython",
128 |     "version": 2
129 |    },
130 |    "file_extension": ".py",
131 |    "mimetype": "text/x-python",
132 |    "name": "python",
133 |    "nbconvert_exporter": "python",
134 |    "pygments_lexer": "ipython2",
135 |    "version": "2.7.10"
136 |   }
137 |  },
138 |  "nbformat": 4,
139 |  "nbformat_minor": 0
140 | }
141 | 


--------------------------------------------------------------------------------
/class1_1/exercise/Exercise3.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "metadata": {
 3 |   "name": "",
 4 |   "signature": "sha256:68df1038be6fa984e8fa87db9aa2fa3b80b0196b0e8ca61596f63da0942cd96d"
 5 |  },
 6 |  "nbformat": 3,
 7 |  "nbformat_minor": 0,
 8 |  "worksheets": [
 9 |   {
10 |    "cells": [
11 |     {
12 |      "cell_type": "heading",
13 |      "level": 1,
14 |      "metadata": {},
15 |      "source": [
16 |       "The following code will print the prime numbers between 1 and 100. Modify the code so it prints every other prime number from 1 to 100"
17 |      ]
18 |     },
19 |     {
20 |      "cell_type": "code",
21 |      "collapsed": false,
22 |      "input": [
23 |       "for num in range(1,101): # for-loop through the numbers\n",
24 |       "    prime = True # boolean flag to check the number for being prime\n",
25 |       "    for i in range(2,num): # for-loop to check for \"primeness\" by checking for divisors other than 1\n",
26 |       "        if (num%i==0): # logical test for the number having a divisor other than 1 and itself\n",
27 |       "            prime = False # if there's a divisor, the boolean value gets flipped to False\n",
28 |       "    if prime: # if prime is still True after going through all numbers from 1 - 100, then it gets printed\n",
29 |       "        print num"
30 |      ],
31 |      "language": "python",
32 |      "metadata": {},
33 |      "outputs": []
34 |     },
35 |     {
36 |      "cell_type": "heading",
37 |      "level": 1,
38 |      "metadata": {},
39 |      "source": [
40 |       "Extra Credit: Can you write a procedure that runs faster than the one above?"
41 |      ]
42 |     },
43 |     {
44 |      "cell_type": "code",
45 |      "collapsed": false,
46 |      "input": [],
47 |      "language": "python",
48 |      "metadata": {},
49 |      "outputs": []
50 |     }
51 |    ],
52 |    "metadata": {}
53 |   }
54 |  ]
55 | }


--------------------------------------------------------------------------------
/class1_1/exercise/Exercise4-Answers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# The writer of this code wants to count the mean and median article length for recent articles on gay marraige. This code has several issues, including errors. When they checked their custom functions against the numpy functions, they noticed some discrepancies. Fix the code so it executes properly and the output of the custom functions match the output of the numpy functions"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 5,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import requests # a better package than urllib2"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 6,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "def my_mean(input_list):\n",
 30 |     "    list_sum = 0\n",
 31 |     "    list_count = 0\n",
 32 |     "    for el in input_list:\n",
 33 |     "        list_sum += el\n",
 34 |     "        list_count += 1\n",
 35 |     "    return list_sum / float(list_count) # cast list_count to float"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 42,
 41 |    "metadata": {
 42 |     "collapsed": false
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "def my_median(input_list):\n",
 47 |     "    input_list.sort() # sort the list\n",
 48 |     "    list_length = len(input_list) # get length so it doesn't need to be recalculated\n",
 49 |     "\n",
 50 |     "    # test for even length and take len/2 and len/2 -1 divided over 2.0 for float division\n",
 51 |     "    if list_length %2 == 0: \n",
 52 |     "        return (input_list[list_length/2] + input_list[(list_length/2) - 1]) / 2.0 \n",
 53 |     "    else:\n",
 54 |     "        return input_list[list_length/2]"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 2,
 60 |    "metadata": {
 61 |     "collapsed": false
 62 |    },
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "api_key = \"ffaf60d7d82258e112dd4fb2b5e4e2d6:3:72421680\""
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 3,
 71 |    "metadata": {
 72 |     "collapsed": false
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "url = \"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gay+marriage&api-key=%s\" % api_key # variable name mistyped"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 8,
 82 |    "metadata": {
 83 |     "collapsed": false
 84 |    },
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "r = requests.get(url)"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 10,
 93 |    "metadata": {
 94 |     "collapsed": false
 95 |    },
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "wc_list = []\n",
 99 |     "for article in r.json()['response']['docs']:\n",
100 |     "    wc_list.append(int(article['word_count'])) #word_count needs to be cast to int"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 11,
106 |    "metadata": {
107 |     "collapsed": false
108 |    },
109 |    "outputs": [
110 |     {
111 |      "data": {
112 |       "text/plain": [
113 |        "1034.2"
114 |       ]
115 |      },
116 |      "execution_count": 11,
117 |      "metadata": {},
118 |      "output_type": "execute_result"
119 |     }
120 |    ],
121 |    "source": [
122 |     "my_mean(wc_list)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 12,
128 |    "metadata": {
129 |     "collapsed": false
130 |    },
131 |    "outputs": [],
132 |    "source": [
133 |     "import numpy as np"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 13,
139 |    "metadata": {
140 |     "collapsed": false
141 |    },
142 |    "outputs": [
143 |     {
144 |      "data": {
145 |       "text/plain": [
146 |        "1034.2"
147 |       ]
148 |      },
149 |      "execution_count": 13,
150 |      "metadata": {},
151 |      "output_type": "execute_result"
152 |     }
153 |    ],
154 |    "source": [
155 |     "np.mean(wc_list)"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 43,
161 |    "metadata": {
162 |     "collapsed": false
163 |    },
164 |    "outputs": [
165 |     {
166 |      "data": {
167 |       "text/plain": [
168 |        "926.5"
169 |       ]
170 |      },
171 |      "execution_count": 43,
172 |      "metadata": {},
173 |      "output_type": "execute_result"
174 |     }
175 |    ],
176 |    "source": [
177 |     "my_median(wc_list)"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": 28,
183 |    "metadata": {
184 |     "collapsed": false
185 |    },
186 |    "outputs": [
187 |     {
188 |      "data": {
189 |       "text/plain": [
190 |        "926.5"
191 |       ]
192 |      },
193 |      "execution_count": 28,
194 |      "metadata": {},
195 |      "output_type": "execute_result"
196 |     }
197 |    ],
198 |    "source": [
199 |     "np.median(wc_list)"
200 |    ]
201 |   }
202 |  ],
203 |  "metadata": {
204 |   "kernelspec": {
205 |    "display_name": "Python 2",
206 |    "language": "python",
207 |    "name": "python2"
208 |   },
209 |   "language_info": {
210 |    "codemirror_mode": {
211 |     "name": "ipython",
212 |     "version": 2
213 |    },
214 |    "file_extension": ".py",
215 |    "mimetype": "text/x-python",
216 |    "name": "python",
217 |    "nbconvert_exporter": "python",
218 |    "pygments_lexer": "ipython2",
219 |    "version": "2.7.10"
220 |   }
221 |  },
222 |  "nbformat": 4,
223 |  "nbformat_minor": 0
224 | }
225 | 


--------------------------------------------------------------------------------
/class1_1/exercise/Exercise4.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:27eed4d676ae4b5bf707d837cc436a5377ce46ea08f8fe7dcf43383e80482aeb"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "heading",
 13 |      "level": 1,
 14 |      "metadata": {},
 15 |      "source": [
 16 |       "The writer of this code wants to count the mean and median article length for recent articles on gay marriage in the New York Times. This code has several issues, including errors. When they checked their custom functions against the numpy functions, they noticed some discrepancies. Fix the code so it executes properly, retrieves the articles, and outputs the correct result from the custom functions, compared to the numpy functions."
 17 |      ]
 18 |     },
 19 |     {
 20 |      "cell_type": "code",
 21 |      "collapsed": false,
 22 |      "input": [
 23 |       "import requests # a better package than urllib2"
 24 |      ],
 25 |      "language": "python",
 26 |      "metadata": {},
 27 |      "outputs": []
 28 |     },
 29 |     {
 30 |      "cell_type": "code",
 31 |      "collapsed": false,
 32 |      "input": [
 33 |       "def my_mean(input_list):\n",
 34 |       "    list_sum = 0\n",
 35 |       "    list_count = 0\n",
 36 |       "    for el in input_list:\n",
 37 |       "        list_sum += el\n",
 38 |       "        list_count += 1\n",
 39 |       "    return list_sum / list_count"
 40 |      ],
 41 |      "language": "python",
 42 |      "metadata": {},
 43 |      "outputs": []
 44 |     },
 45 |     {
 46 |      "cell_type": "code",
 47 |      "collapsed": false,
 48 |      "input": [
 49 |       "def my_median(input_list):\n",
 50 |       "    list_length = len(input_list)\n",
 51 |       "    return input_list[list_length/2]"
 52 |      ],
 53 |      "language": "python",
 54 |      "metadata": {},
 55 |      "outputs": []
 56 |     },
 57 |     {
 58 |      "cell_type": "code",
 59 |      "collapsed": false,
 60 |      "input": [
 61 |       "api_key = \"ffaf60d7d82258e112dd4fb2b5e4e2d6:3:72421680\""
 62 |      ],
 63 |      "language": "python",
 64 |      "metadata": {},
 65 |      "outputs": []
 66 |     },
 67 |     {
 68 |      "cell_type": "code",
 69 |      "collapsed": false,
 70 |      "input": [
 71 |       "url = \"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gay+marriage&api-key=%s\" % API_key"
 72 |      ],
 73 |      "language": "python",
 74 |      "metadata": {},
 75 |      "outputs": []
 76 |     },
 77 |     {
 78 |      "cell_type": "code",
 79 |      "collapsed": false,
 80 |      "input": [
 81 |       "r = requests.get(url)"
 82 |      ],
 83 |      "language": "python",
 84 |      "metadata": {},
 85 |      "outputs": []
 86 |     },
 87 |     {
 88 |      "cell_type": "code",
 89 |      "collapsed": false,
 90 |      "input": [
 91 |       "wc_list = []\n",
 92 |       "for article in r.json()['response']['docs']:\n",
 93 |       "    wc_list.append(article['word_count'])"
 94 |      ],
 95 |      "language": "python",
 96 |      "metadata": {},
 97 |      "outputs": []
 98 |     },
 99 |     {
100 |      "cell_type": "code",
101 |      "collapsed": false,
102 |      "input": [
103 |       "my_mean(wc_list)"
104 |      ],
105 |      "language": "python",
106 |      "metadata": {},
107 |      "outputs": []
108 |     },
109 |     {
110 |      "cell_type": "code",
111 |      "collapsed": false,
112 |      "input": [
113 |       "import numpy as np"
114 |      ],
115 |      "language": "python",
116 |      "metadata": {},
117 |      "outputs": []
118 |     },
119 |     {
120 |      "cell_type": "code",
121 |      "collapsed": false,
122 |      "input": [
123 |       "np.mean(wc_list)"
124 |      ],
125 |      "language": "python",
126 |      "metadata": {},
127 |      "outputs": []
128 |     },
129 |     {
130 |      "cell_type": "code",
131 |      "collapsed": false,
132 |      "input": [
133 |       "my_median(wc_list)"
134 |      ],
135 |      "language": "python",
136 |      "metadata": {},
137 |      "outputs": []
138 |     },
139 |     {
140 |      "cell_type": "code",
141 |      "collapsed": false,
142 |      "input": [
143 |       "np.median(wc_list)"
144 |      ],
145 |      "language": "python",
146 |      "metadata": {},
147 |      "outputs": []
148 |     },
149 |     {
150 |      "cell_type": "code",
151 |      "collapsed": false,
152 |      "input": [],
153 |      "language": "python",
154 |      "metadata": {},
155 |      "outputs": []
156 |     }
157 |    ],
158 |    "metadata": {}
159 |   }
160 |  ]
161 | }


--------------------------------------------------------------------------------
/class1_1/exercise/excuse.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_1/exercise/excuse.csv


--------------------------------------------------------------------------------
/class1_1/lab1-1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:b6f3a87300b6901d9f5557b6771c6025c4453f08fb77b010b5531157a6471784"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "code",
 13 |      "collapsed": false,
 14 |      "input": [
 15 |       "b = 1\n",
 16 |       "for i in range(5):\n",
 17 |       "    print b\n",
 18 |       "    b += 1"
 19 |      ],
 20 |      "language": "python",
 21 |      "metadata": {},
 22 |      "outputs": [
 23 |       {
 24 |        "output_type": "stream",
 25 |        "stream": "stdout",
 26 |        "text": [
 27 |         "1\n",
 28 |         "2\n",
 29 |         "3\n",
 30 |         "4\n",
 31 |         "5\n"
 32 |        ]
 33 |       }
 34 |      ],
 35 |      "prompt_number": 11
 36 |     },
 37 |     {
 38 |      "cell_type": "code",
 39 |      "collapsed": false,
 40 |      "input": [
 41 |       "for i in range(5):\n",
 42 |       "    b = 5\n",
 43 |       "    print b\n",
 44 |       "    b += 1"
 45 |      ],
 46 |      "language": "python",
 47 |      "metadata": {},
 48 |      "outputs": [
 49 |       {
 50 |        "output_type": "stream",
 51 |        "stream": "stdout",
 52 |        "text": [
 53 |         "5\n",
 54 |         "5\n",
 55 |         "5\n",
 56 |         "5\n",
 57 |         "5\n"
 58 |        ]
 59 |       }
 60 |      ],
 61 |      "prompt_number": 9
 62 |     },
 63 |     {
 64 |      "cell_type": "code",
 65 |      "collapsed": false,
 66 |      "input": [
 67 |       "for i in range(10):\n",
 68 |       "    print i\n",
 69 |       "    for j in range(10):\n",
 70 |       "        print j\n",
 71 |       "print i"
 72 |      ],
 73 |      "language": "python",
 74 |      "metadata": {},
 75 |      "outputs": [
 76 |       {
 77 |        "output_type": "stream",
 78 |        "stream": "stdout",
 79 |        "text": [
 80 |         "0\n",
 81 |         "0\n",
 82 |         "1\n",
 83 |         "2\n",
 84 |         "3\n",
 85 |         "4\n",
 86 |         "5\n",
 87 |         "6\n",
 88 |         "7\n",
 89 |         "8\n",
 90 |         "9\n",
 91 |         "1\n",
 92 |         "0\n",
 93 |         "1\n",
 94 |         "2\n",
 95 |         "3\n",
 96 |         "4\n",
 97 |         "5\n",
 98 |         "6\n",
 99 |         "7\n",
100 |         "8\n",
101 |         "9\n",
102 |         "2\n",
103 |         "0\n",
104 |         "1\n",
105 |         "2\n",
106 |         "3\n",
107 |         "4\n",
108 |         "5\n",
109 |         "6\n",
110 |         "7\n",
111 |         "8\n",
112 |         "9\n",
113 |         "3\n",
114 |         "0\n",
115 |         "1\n",
116 |         "2\n",
117 |         "3\n",
118 |         "4\n",
119 |         "5\n",
120 |         "6\n",
121 |         "7\n",
122 |         "8\n",
123 |         "9\n",
124 |         "4\n",
125 |         "0\n",
126 |         "1\n",
127 |         "2\n",
128 |         "3\n",
129 |         "4\n",
130 |         "5\n",
131 |         "6\n",
132 |         "7\n",
133 |         "8\n",
134 |         "9\n",
135 |         "5\n",
136 |         "0\n",
137 |         "1\n",
138 |         "2\n",
139 |         "3\n",
140 |         "4\n",
141 |         "5\n",
142 |         "6\n",
143 |         "7\n",
144 |         "8\n",
145 |         "9\n",
146 |         "6\n",
147 |         "0\n",
148 |         "1\n",
149 |         "2\n",
150 |         "3\n",
151 |         "4\n",
152 |         "5\n",
153 |         "6\n",
154 |         "7\n",
155 |         "8\n",
156 |         "9\n",
157 |         "7\n",
158 |         "0\n",
159 |         "1\n",
160 |         "2\n",
161 |         "3\n",
162 |         "4\n",
163 |         "5\n",
164 |         "6\n",
165 |         "7\n",
166 |         "8\n",
167 |         "9\n",
168 |         "8\n",
169 |         "0\n",
170 |         "1\n",
171 |         "2\n",
172 |         "3\n",
173 |         "4\n",
174 |         "5\n",
175 |         "6\n",
176 |         "7\n",
177 |         "8\n",
178 |         "9\n",
179 |         "9\n",
180 |         "0\n",
181 |         "1\n",
182 |         "2\n",
183 |         "3\n",
184 |         "4\n",
185 |         "5\n",
186 |         "6\n",
187 |         "7\n",
188 |         "8\n",
189 |         "9\n",
190 |         "9\n"
191 |        ]
192 |       }
193 |      ],
194 |      "prompt_number": 13
195 |     },
196 |     {
197 |      "cell_type": "code",
198 |      "collapsed": false,
199 |      "input": [
200 |       "person= raw_input(\"Enter your name: \")"
201 |      ],
202 |      "language": "python",
203 |      "metadata": {},
204 |      "outputs": [
205 |       {
206 |        "name": "stdout",
207 |        "output_type": "stream",
208 |        "stream": "stdout",
209 |        "text": [
210 |         "Enter your name: Richard\n"
211 |        ]
212 |       }
213 |      ],
214 |      "prompt_number": 14
215 |     },
216 |     {
217 |      "cell_type": "code",
218 |      "collapsed": false,
219 |      "input": [
220 |       "person"
221 |      ],
222 |      "language": "python",
223 |      "metadata": {},
224 |      "outputs": [
225 |       {
226 |        "metadata": {},
227 |        "output_type": "pyout",
228 |        "prompt_number": 15,
229 |        "text": [
230 |         "'Richard'"
231 |        ]
232 |       }
233 |      ],
234 |      "prompt_number": 15
235 |     },
236 |     {
237 |      "cell_type": "code",
238 |      "collapsed": false,
239 |      "input": [
240 |       "3/4"
241 |      ],
242 |      "language": "python",
243 |      "metadata": {},
244 |      "outputs": [
245 |       {
246 |        "metadata": {},
247 |        "output_type": "pyout",
248 |        "prompt_number": 16,
249 |        "text": [
250 |         "0"
251 |        ]
252 |       }
253 |      ],
254 |      "prompt_number": 16
255 |     },
256 |     {
257 |      "cell_type": "code",
258 |      "collapsed": false,
259 |      "input": [
260 |       "3/4.0"
261 |      ],
262 |      "language": "python",
263 |      "metadata": {},
264 |      "outputs": [
265 |       {
266 |        "metadata": {},
267 |        "output_type": "pyout",
268 |        "prompt_number": 17,
269 |        "text": [
270 |         "0.75"
271 |        ]
272 |       }
273 |      ],
274 |      "prompt_number": 17
275 |     },
276 |     {
277 |      "cell_type": "code",
278 |      "collapsed": false,
279 |      "input": [
280 |       "import csv"
281 |      ],
282 |      "language": "python",
283 |      "metadata": {},
284 |      "outputs": [],
285 |      "prompt_number": 24
286 |     },
287 |     {
288 |      "cell_type": "code",
289 |      "collapsed": false,
290 |      "input": [
291 |       "inputFile = open('../lede_algorithms/class1_1/exercise/excuse.csv','rU')\n",
292 |       "inputReader = csv.reader(inputFile)"
293 |      ],
294 |      "language": "python",
295 |      "metadata": {},
296 |      "outputs": [],
297 |      "prompt_number": 27
298 |     },
299 |     {
300 |      "cell_type": "code",
301 |      "collapsed": false,
302 |      "input": [
303 |       "for line in inputFile:\n",
304 |       "    line = line.split(',')\n",
305 |       "    print line"
306 |      ],
307 |      "language": "python",
308 |      "metadata": {},
309 |      "outputs": [
310 |       {
311 |        "output_type": "stream",
312 |        "stream": "stdout",
313 |        "text": [
314 |         "['excuse', 'headline', 'hyperlink\\rthe fog was unexpected and did slow us down a bit', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rwe had some meetings at Gracie Mansion', \"De Blasio 30 Minutes Late to Rockaway St. Patrick's Day Parade\", 'http://www.dnainfo.com/new-york/20150307/belle-harbor/de-blasio-30-minutes-late-rockaway-st-patricks-day-parade\\rI had a very rough night and woke up sluggish', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rI just woke up in the middle of the night and couldn\\x89\\xdb\\xaat get back to sleep', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rwe had some stuff we had to do', \"De Blasio 30 Minutes Late to Rockaway St. Patrick's Day Parade\", 'http://www.dnainfo.com/new-york/20150307/belle-harbor/de-blasio-30-minutes-late-rockaway-st-patricks-day-parade\\rI should have gotten myself moving quicker', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rI was just not feeling well this morning', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rbreakfast began a little later than expected', '\"De Blasio 15 Minutes Late to St. Patrick\\'s Day Mass', ' Blames Breakfast\"', 'http://www.dnainfo.com/new-york/20150317/midtown/de-blasio-15-minutes-late-st-patricks-day-mass-blames-breakfast\\rthe detail drove away when we went into the subway rather than waiting to confirm we got on a train', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0\\rwe waited 20 mins for an express only to hear there were major delays', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0\\rwe need a better system', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0']\n"
315 |        ]
316 |       }
317 |      ],
318 |      "prompt_number": 28
319 |     },
320 |     {
321 |      "cell_type": "code",
322 |      "collapsed": false,
323 |      "input": [
324 |       "inputFile = open('/Users/richarddunks/Dropbox/Datapolitan/Projects/Training/)"
325 |      ],
326 |      "language": "python",
327 |      "metadata": {},
328 |      "outputs": []
329 |     }
330 |    ],
331 |    "metadata": {}
332 |   }
333 |  ]
334 | }


--------------------------------------------------------------------------------
/class1_1/newsroom_examples.md:
--------------------------------------------------------------------------------
 1 | # Algorithms in the newsroom
 2 | 
 3 | Ever since Phillip Meyer published [Precision Journalism](http://www.unc.edu/~pmeyer/book/) in the 1970s (and probably even a bit before), journalists have been using algorithms in some form in order to tell new and different kinds of stories. Below we've included several examples from different eras of what you'd now call data journalism:
 4 | 
 5 | ### Classic examples
 6 | 
 7 | Computer-assisted reporting specialists have been a fixture in newsroom for decades, applying data analysis and social science methods to the news report. Some of the most sophisticated of those techniques are based in part on algorithms we'll learn in this class:
 8 | 
 9 | - **School Scandals, Children Left Behind: Cheating in Texas Schools and Also [Faking the Grade](http://clipfile.org/?p=892)**: Two powerful series of stories by the Dallas Morning News in the mid-2000s that showed rampant cheating by students and teachers on Texas standardized exams. Not the first story to use regression models but one of the most powerful early examples.
10 | 
11 | - **[Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://www.boston.com/globe/metro/packages/tickets/)**: Another early example of using logistic regression to explain a newsworthy phenomenon -- in this case the many factors that go into whether a person is given a speeding ticket or let off the hook. Just as interesting as the story is its [detailed methodology](http://www.boston.com/globe/metro/packages/tickets/study.pdf), which is worth a read.
12 | 
13 | - **[Cluster analysis in CAR](https://www.ire.org/publications/search-uplink-archives/167/)** Simple cluster analysis has been used for years in newsrooms to find everything from [crime hotspots](http://www.icpsr.umich.edu/CrimeStat/) to [cancer clusters on Long Island](http://www.ij-healthgeographics.com/content/2/1/3).
14 | 
15 | ### Algorithmic journalism catches 
16 | 
17 | Although reporters and computer-assisted reporting specialists had been doing some form of it for years the idea of "data journalism" as its is now known was popularized during the 2012 presidential elections, in large part thanks to the predictive modeling of Nate Silver.
18 | 
19 | - **[FiveThirtyEight](http://fivethirtyeight.blogs.nytimes.com/fivethirtyeights-2012-forecast/)**: Nate Silver's prediction models were the first example of data/algorithmic journalism reaching the mainstream. Since then, election predictions have become a bit old hat. The Times' new model, [Leo](http://www.nytimes.com/newsgraphics/2014/senate-model/), was exceedingly accurate in 2014 (its [source code](https://github.com/TheUpshot/leo-senate-model) is online). The Times also ran a series of [live predictions](http://elections.nytimes.com/2014/senate-model) on key 2014 races on Election Night.
20 | 
21 | - **[ProPublica's Message Machine](https://projects.propublica.org/emails/)**: Also during the 2012 elections, ProPublica launched its Message Machine project, which used hashing algorithms to reverse-engineer targeted e-mail messages from political campaigns.
22 | 
23 | - **[L.A. Times crime alerts](http://maps.latimes.com/crime/)**: The Los Angeles Times has for years been calculating and publicizing alerts when crime spikes in certain neighborhoods.
24 | 
25 | ### Modern examples
26 | 
27 | These days, sophisticated algorithms are used to solve all sorts of journalistic problems, both exciting and mundane.
28 | 
29 | - **[Campaign finance data deduplication](https://github.com/cjdd3b/fec-standardizer/wiki)**: Most campaign finance data is organized by contribution, not donor. Joe Smith might give three different contributions and be listed in the data in three different ways. Connecting those records into a single canonical Joe Smith is often the first step to doing sophisticated campaign finance analysis. Over the last few years, people have developed highly accurate methods to do this using both supervised and unsupervised machine learning.
30 | 
31 | - **[NYT Cooking](http://cooking.nytimes.com/)**: The new Cooking website and app has been one of the Times' most successful new products, but it was initially based largely on recipes stored in free-text articles. The Times extracted many of those recipes using an algorithmic technique known as [conditional random fields](http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/). The L.A. Times did something [similar](https://source.opennews.org/en-US/articles/how-we-made-new-california-cookbook/) in 2013.
32 | 
33 | - **[The Echo Chamber](http://www.reuters.com/investigates/special-report/scotus/)**: A team from Reuters were finalists for the Pulitzer this year after using (among other things) sophisticated topic modeling techniques to help document how a small group of lawyers have disproportionate influence over the U.S. Supreme Court.


--------------------------------------------------------------------------------
/class1_2/2013_NYC_CD_MedianIncome_Recycle.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_2/2013_NYC_CD_MedianIncome_Recycle.xlsx


--------------------------------------------------------------------------------
/class1_2/Data_Collection_Sheet.csv:
--------------------------------------------------------------------------------
 1 | name,height (inches),age (years),siblings (not including you)
 2 | Richard,72,35,1
 3 | Adam,71,29,1
 4 | Rashida,62,24,0
 5 | Spe ,62,23,1
 6 | Jiachuan,67,25,0
 7 | GK,66,36,1
 8 | Fanny,64,32,3
 9 | Meghan,65,27,2
10 | Arthur,74,49,1
11 | Elliott,66,31,3
12 | Aliza,64,23,1
13 | Lindsay,67,22,2
14 | Michael,71,49,3
15 | Vanessa,66,27,2
16 | Melissa,61,23,1
17 | Kassahun,67,32,3
18 | Sebastian,67.7165,26,2
19 | Giulia,67,27,2
20 | Siutan,66,25,0
21 | Tian,68,23,0
22 | Laure,65,25,3
23 | Katie,67,36,1


--------------------------------------------------------------------------------
/class1_2/height_weight.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_2/height_weight.xlsx


--------------------------------------------------------------------------------
/class2_1/README.md:
--------------------------------------------------------------------------------
 1 | # Algorithms: Week 2, Class 1 (Tuesday, July 21)
 2 | 
 3 | Today we'll spend the first hour reviewing last week's material by exploring a new dataset: transportation data used by FiveThirtyEight to build their [fastest-flight tracker](http://projects.fivethirtyeight.com/flights/), which launched last month.
 4 | 
 5 | Then we'll talk about why it's important to learn, and explain, what's going on under the hood of modern algorithms -- both as an exercise in transparency and skepticism and as a setup for Thursday's class, where we'll begin discussing regression.
 6 | 
 7 | In lab, you'll continue our class work by analyzing some data on your own and developing your results into story ideas. Then we'll ask you to critique a project released earlier this summer by NPR.
 8 | 
 9 | ## Hour 1: Exploratory data analysis review
10 | 
11 | We'll be working with airline on-time performance reports from the U.S. Department of Transportation, which you can download [here](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time). The rest of what you'll need is in the accompanying [iPython notebook](https://github.com/datapolitan/lede_algorithms/blob/master/class2_1/EDA_Review.ipynb).
12 | 
13 | ## Hour 2: Transparency and the "nerd box"
14 | 
15 | First, we'll read Simon Rogers' piece in Mother Jones: [Hey Wonk Reporters, Liberate Your Data!](http://www.motherjones.com/media/2014/04/vox-538-upshot-open-data-missing) (and possibly this: [Debugging the Backlash to Data Journalism](http://towcenter.org/debugging-the-backlash-to-data-journalism/)).
16 | 
17 | Then we'll discuss the evolution of transparency in data journalism and why it's important: from the journalistic tradition of the "nerd box," through making data available via news apps, and finally to more modern examples of transparency.
18 | 
19 |   - [The Boston Globe's Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://www.boston.com/globe/metro/packages/tickets/study.pdf)
20 |   - [St. Petersburg (now Tampa Bay) Times: Vanishing Wetlands](http://www.sptimes.com/2006/webspecials06/wetlands/)
21 |   - [Ft. Lauderdale Sun Sentinel police speeding investigation](http://databases.sun-sentinel.com/news/broward/ftlaudCopSpeeds/ftlaudCopSpeeds_list.php)
22 |   - [Washington Post police shootings](http://www.washingtonpost.com/national/how-the-washington-post-is-examining-police-shootings-in-the-us/2015/06/29/f42c10b2-151b-11e5-9518-f9e0a8959f32_story.html)
23 |   - [Leo: The NYT Senate model](http://www.nytimes.com/newsgraphics/2014/senate-model/methodology.html)
24 | 
25 | ## Hour 3: From transparent data to transparent algorithms
26 | 
27 | Even if you never write another algorithm before you die, this class should at least teach you enough to ask good questions about their capabilities and roles in society. We'll look at stories from some reporters who can articulate a clear, accurate understanding of how algorithms work, as well as some who ... well ... can't.
28 | 
29 | Here are a few examples from one of those categories:
30 | 
31 |   - [Experts predict robots will take over 30% of our jobs by 2025 — and white-collar jobs aren't immune](http://www.businessinsider.com/experts-predict-that-one-third-of-jobs-will-be-replaced-by-robots-2015-5)
32 |   - [Journalists, here's how robots are going to steal your job](http://www.newstatesman.com/future-proof/2014/03/journalists-heres-how-robots-are-going-steal-your-job)
33 |   - [Artificial intelligence could end mankind: Hawking](http://www.cnbc.com/2014/05/04/artificial-intelligence-could-end-mankind-hawking.html)
34 |   - ['Chappie' Doesn't Think Robots Will Destroy the World](http://www.nbcnews.com/tech/innovation/chappie-doesnt-think-robots-will-destroy-world-n305876)
35 |   - [What Happens When Robots Write the Future?](http://op-talk.blogs.nytimes.com/2014/08/18/what-happens-when-robots-write-the-future/)
36 | 
37 | And a few from the other:
38 | 
39 |   - [At UPS, the Algorithm Is the Driver](http://www.wsj.com/articles/at-ups-the-algorithm-is-the-driver-1424136536)
40 |   - [If Algorithms Know All, How Much Should Humans Help?](http://www.nytimes.com/2015/04/07/upshot/if-algorithms-know-all-how-much-should-humans-help.html?abt=0002&abg=0)
41 |   - [The Potential and the Risks of Data Science](http://bits.blogs.nytimes.com/2013/04/07/the-potential-and-the-risks-of-data-science/)
42 |   - [Google Schools Its Algorithm](http://www.nytimes.com/2011/03/06/weekinreview/06lohr.html)
43 |   - [When Algorithms Discriminate](http://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html)
44 | 
45 | ## Lab
46 | 
47 | You'll be working on two things in lab today. First, by way of review, download a slice of the [data](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) we used for the class exercises. Explore that data on your own using the skills you've learned so far and come back with two 100-ish-word story pitches. Each pitch should also come with a brief description of how you analyzed the data in order to get that idea.
48 | 
49 | Second, write a short critique of NPR and Planet Money's recent coverage of the effect of algorithms and automation on the labor market. You don't need to listen to all of their podcasts on the subject (they did a handful), but check out [a few of them](http://www.npr.org/sections/money/2015/05/08/405270046/episode-622-humans-vs-robots), look at some of their [data visualizations](http://www.npr.org/sections/money/2015/05/21/408234543/will-your-job-be-done-by-a-machine) and play around with their tool that calculates whether your job is [likely to be done by a machine](http://www.npr.org/sections/money/2015/05/21/408234543/will-your-job-be-done-by-a-machine).
50 | 
51 | ## Questions
52 | 
53 | I'm at chase.davis@nytimes.com, and I'll be on Slack after class.


--------------------------------------------------------------------------------
/class2_2/DoNow_2-2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "#Accomplish the following tasks by whatever means necessary based on the material we've covered in class. Save the notebook in this format: `<lastname>_DoNow_2-2.ipynb` where `<lastname>` is your last (family) name and turn it in via Slack."
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "# the magic command to plot inline with the notebook \n",
 19 |     "# https://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions\n",
 20 |     "%matplotlib inline"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "###1. Import the pandas package and use the common alias"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "metadata": {
 34 |     "collapsed": true
 35 |    },
 36 |    "outputs": [],
 37 |    "source": []
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "###2. Read the file \"heights_weights.xlsx\" in the `data` folder into a pandas dataframe"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": null,
 49 |    "metadata": {
 50 |     "collapsed": true
 51 |    },
 52 |    "outputs": [],
 53 |    "source": []
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "###3. Plot a histogram for both height and weight. Describe the data distribution in comments."
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {
 66 |     "collapsed": false
 67 |    },
 68 |    "outputs": [],
 69 |    "source": []
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "###4. Calculate the mean height and mean weight for the dataframe. "
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "collapsed": false
 83 |    },
 84 |    "outputs": [],
 85 |    "source": []
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "###5. Calculate the other significant descriptive statistics on the two data points\n",
 92 |     "+ Standard deviation\n",
 93 |     "+ Range\n",
 94 |     "+ Interquartile range"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {
101 |     "collapsed": false
102 |    },
103 |    "outputs": [],
104 |    "source": []
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "###6. Calculate the coefficient of correlation for these variables. Do they appear correlated? (put your answer in comments)"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {
117 |     "collapsed": false
118 |    },
119 |    "outputs": [],
120 |    "source": []
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "###Extra Credit: Create a scatter plot of height and weight"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {
133 |     "collapsed": false
134 |    },
135 |    "outputs": [],
136 |    "source": []
137 |   }
138 |  ],
139 |  "metadata": {
140 |   "kernelspec": {
141 |    "display_name": "Python 2",
142 |    "language": "python",
143 |    "name": "python2"
144 |   },
145 |   "language_info": {
146 |    "codemirror_mode": {
147 |     "name": "ipython",
148 |     "version": 2
149 |    },
150 |    "file_extension": ".py",
151 |    "mimetype": "text/x-python",
152 |    "name": "python",
153 |    "nbconvert_exporter": "python",
154 |    "pygments_lexer": "ipython2",
155 |    "version": "2.7.10"
156 |   }
157 |  },
158 |  "nbformat": 4,
159 |  "nbformat_minor": 0
160 | }
161 | 


--------------------------------------------------------------------------------
/class2_2/data/2013_NYC_CD_MedianIncome_Recycle.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class2_2/data/2013_NYC_CD_MedianIncome_Recycle.xlsx


--------------------------------------------------------------------------------
/class2_2/data/height_weight.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class2_2/data/height_weight.xlsx


--------------------------------------------------------------------------------
/class3_1/.ipynb_checkpoints/classification-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import csv, re, string\n",
 12 |     "import numpy as np\n",
 13 |     "from sklearn.linear_model import LogisticRegression\n",
 14 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 15 |     "from sklearn.pipeline import Pipeline"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 15,
 21 |    "metadata": {
 22 |     "collapsed": false
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\n",
 27 |     "VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 16,
 33 |    "metadata": {
 34 |     "collapsed": false
 35 |    },
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "data = []\n",
 39 |     "with open('data/category-training.csv', 'r') as f:\n",
 40 |     "    inputreader = csv.reader(f, delimiter=',', quotechar='\"')\n",
 41 |     "    for r in inputreader:\n",
 42 |     "        # Concatenate the occupation and employer strings together and remove\n",
 43 |     "        # punctuation. Both occupation and employer will be used in prediction.\n",
 44 |     "        text = PUNCTUATION.sub('', ' '.join(r[0:2]))\n",
 45 |     "        if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:\n",
 46 |     "            # We're only attempting to classify the first character of the\n",
 47 |     "            # industry prefix (\"A\", \"B\", etc.) -- not the whole thing. That's\n",
 48 |     "            # what the r[2][0] piece is about.\n",
 49 |     "            data.append([text, r[2][0]])"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 18,
 55 |    "metadata": {
 56 |     "collapsed": true
 57 |    },
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "    texts = np.array([el[0] for el in data])\n",
 61 |     "    classes = np.array([el[1] for el in data])"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 19,
 67 |    "metadata": {
 68 |     "collapsed": false
 69 |    },
 70 |    "outputs": [
 71 |     {
 72 |      "name": "stdout",
 73 |      "output_type": "stream",
 74 |      "text": [
 75 |       "['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT'\n",
 76 |       " 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC'\n",
 77 |       " 'Owner Fair Funeral Home' 'ST MARTIN  LIRERRE LAW FIRM ']\n"
 78 |      ]
 79 |     }
 80 |    ],
 81 |    "source": [
 82 |     "print texts"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 20,
 88 |    "metadata": {
 89 |     "collapsed": false
 90 |    },
 91 |    "outputs": [
 92 |     {
 93 |      "name": "stdout",
 94 |      "output_type": "stream",
 95 |      "text": [
 96 |       "['F' 'Z' 'Z' ..., 'F' 'G' 'K']\n"
 97 |      ]
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "print classes"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 21,
107 |    "metadata": {
108 |     "collapsed": true
109 |    },
110 |    "outputs": [],
111 |    "source": [
112 |     "pipeline = Pipeline([\n",
113 |     "  ('vectorizer', CountVectorizer(\n",
114 |     "    ngram_range=(1,2),\n",
115 |     "    stop_words='english',\n",
116 |     "    min_df=2,\n",
117 |     "    max_df=len(texts))),\n",
118 |     "  ('classifier',  LogisticRegression())\n",
119 |     "])"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 22,
125 |    "metadata": {
126 |     "collapsed": false
127 |    },
128 |    "outputs": [
129 |     {
130 |      "data": {
131 |       "text/plain": [
132 |        "Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
133 |        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
134 |        "        lowercase=True, max_df=66923, max_features=None, min_df=2,\n",
135 |        "        ngram_range=(1, 2), preprocessor=None, stop_words='english...',\n",
136 |        "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
137 |        "          verbose=0))])"
138 |       ]
139 |      },
140 |      "execution_count": 22,
141 |      "metadata": {},
142 |      "output_type": "execute_result"
143 |     }
144 |    ],
145 |    "source": [
146 |     "pipeline.fit(np.asarray(texts), np.asarray(classes))"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 27,
152 |    "metadata": {
153 |     "collapsed": false
154 |    },
155 |    "outputs": [
156 |     {
157 |      "name": "stdout",
158 |      "output_type": "stream",
159 |      "text": [
160 |       "['K']\n"
161 |      ]
162 |     }
163 |    ],
164 |    "source": [
165 |     "print pipeline.predict(['LAWYER'])"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 28,
171 |    "metadata": {
172 |     "collapsed": false
173 |    },
174 |    "outputs": [
175 |     {
176 |      "name": "stdout",
177 |      "output_type": "stream",
178 |      "text": [
179 |       "['K']\n"
180 |      ]
181 |     }
182 |    ],
183 |    "source": [
184 |     "print pipeline.predict(['SKADDEN ARPS'])"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": 29,
190 |    "metadata": {
191 |     "collapsed": false,
192 |     "scrolled": true
193 |    },
194 |    "outputs": [
195 |     {
196 |      "name": "stdout",
197 |      "output_type": "stream",
198 |      "text": [
199 |       "['J']\n"
200 |      ]
201 |     }
202 |    ],
203 |    "source": [
204 |     "print pipeline.predict(['COMPUTER PROGRAMMER'])"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {
211 |     "collapsed": true
212 |    },
213 |    "outputs": [],
214 |    "source": []
215 |   }
216 |  ],
217 |  "metadata": {
218 |   "kernelspec": {
219 |    "display_name": "Python 2",
220 |    "language": "python",
221 |    "name": "python2"
222 |   },
223 |   "language_info": {
224 |    "codemirror_mode": {
225 |     "name": "ipython",
226 |     "version": 2
227 |    },
228 |    "file_extension": ".py",
229 |    "mimetype": "text/x-python",
230 |    "name": "python",
231 |    "nbconvert_exporter": "python",
232 |    "pygments_lexer": "ipython2",
233 |    "version": "2.7.9"
234 |   }
235 |  },
236 |  "nbformat": 4,
237 |  "nbformat_minor": 0
238 | }
239 | 


--------------------------------------------------------------------------------
/class3_1/README.md:
--------------------------------------------------------------------------------
 1 | # Algorithms: Week 3, Class 1 (Tuesday, July 28)
 2 | 
 3 | Today we'll be reviewing a bit of material from last week, including both your lab assignments from last Tuesday and Thursday's lesson on regression, and then expanding on the latter in a couple ways.
 4 | 
 5 | First we'll review, and (roughly) recreate a couple stories that have employed regression in basic ways. Then, in the interests of transparency, we'll talk a bit about what's going on under the hood with the algorithms that comprise regression.
 6 | 
 7 | Finally we'll touch briefly on the idea of classification using a portion of a project we're currently implementing at the Times.
 8 | 
 9 | ## Hour 1/1.5: Review
10 | 
11 | First we'll talk through some of your story ideas and critiques from last Tuesday. Then we'll revisit some basic regression concepts from last week using [this iPython notebook](https://github.com/datapolitan/lede_algorithms/blob/master/class3_1/regression_review.ipynb), which (very roughly) mimics a project that the St. Paul Pioneer Press did in 2006 and 2010, known as [Schools that Work](http://www.twincities.com/ci_15487174).
12 | 
13 | ## Hour 2: A closer look at regression
14 | 
15 | Journalists tend to look at linear regression through a statistical lens and use it primarily to describe things, as in the case above. You can see another examples here:
16 | 
17 |   - [Race gap found in pothole patching](http://www.jsonline.com/watchdog/watchdogreports/32580034.html) (Milwaukee Journal Sentinel). And the [associated explainer](http://www.jsonline.com/news/milwaukee/32580074.html).
18 | 
19 | But looked at another way, linear regression is also a predictive model -- one that, at scale, is based on an algorithm that we can demystify, per our conversations last week. We'll spend a short amount of time talking about how that works and relate it (hypothetically) to this [fun story](http://fivethirtyeight.com/features/donald-trump-is-the-worlds-greatest-troll/) from FiveThirtyEight.
20 | 
21 | ## Hour 3: Introduction to classification
22 | 
23 | Using a [project](https://github.com/datapolitan/lede_algorithms/blob/master/class3_1/classification.ipynb) we've been working on at the Times, we'll expand our idea of supervised learning to include something that seems a bit more like what you might consider "machine learning" -- classifying people's jobs based on strings representing their occupation and employer.
24 | 
25 | We'll also discuss how lots of data problems in journalism are secretly classification problems, including things like [sorting through documents](https://github.com/cjdd3b/nicar2014/tree/master/lightning-talk/naive-bayes) and [extracting quotes from news articles](https://github.com/cjdd3b/citizen-quotes).
26 | 
27 | ## Lab
28 | 
29 | Like last week, you'll be doing two things in the lab today:
30 | 
31 | First you'll expand the schools analysis we did earlier by layering in other variables [(documented here)](http://www.cde.ca.gov/ta/ac/ap/reclayout12b.asp) using multiple regression, interpreting the results, and again writing two ledes about what you found. Back those lede up with some internet research. If you find some schools that are over/under-performing or have other interesting characteristics, Google around to see what has been written about them. It's a good way to check your assumptions and to find other interesting facts to round out your story pitches.
32 | 
33 | This of course comes with a huge, blinking-red caveat: This is an algorithms class, and we're not getting deep enough into the guts of statistical regression for you to run out and write full-on stories based on your findings. There are things like [p-values](http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients) to consider, as well as rules of thumb for interpreting r-squared. If you'd like to get more in depth with that, we can carve out some time later in the course.
34 | 
35 | Your second assignment today is to write a short story (300-500 words) about [this company](https://www.upstart.com/), which is a startup that uses predictive models to assess creditworthiness using variables that go beyond credit score. No doubt their model is more complex than this, but you can think of the intuition as being similar to regression -- a handful of independent variables that help predict the likelihood that someone will pay their loan back. What are the implications of this? Why might it be good or bad for consumers if this catches on?


--------------------------------------------------------------------------------
/class3_1/classification.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Classification in the Wild"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "This code comes straight from a Times project that helps us standardize campaign finance data to enable new types of analyses. Specifically, it tries to categorize a free-form occupation/employer string into a discrete job category (for example, the strings \"LAWYER\" and \"ATTORNEY\" would both be categorized under \"LAW\").\n",
 15 |     "\n",
 16 |     "We use this to create one of a large number of features that inform the larger predictive model we use for standardization. But it also shows the power of simple classification in action."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": 2,
 22 |    "metadata": {
 23 |     "collapsed": true
 24 |    },
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "import csv, re, string\n",
 28 |     "import numpy as np\n",
 29 |     "from sklearn.linear_model import LogisticRegression\n",
 30 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 31 |     "from sklearn.pipeline import Pipeline"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 30,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "# Some basic setup for data-cleaning purposes\n",
 43 |     "PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\n",
 44 |     "VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 16,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "# Open the training data and clean it up a bit\n",
 56 |     "data = []\n",
 57 |     "with open('data/category-training.csv', 'r') as f:\n",
 58 |     "    inputreader = csv.reader(f, delimiter=',', quotechar='\"')\n",
 59 |     "    for r in inputreader:\n",
 60 |     "        # Concatenate the occupation and employer strings together and remove\n",
 61 |     "        # punctuation. Both occupation and employer will be used in prediction.\n",
 62 |     "        text = PUNCTUATION.sub('', ' '.join(r[0:2]))\n",
 63 |     "        if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:\n",
 64 |     "            # We're only attempting to classify the first character of the\n",
 65 |     "            # industry prefix (\"A\", \"B\", etc.) -- not the whole thing. That's\n",
 66 |     "            # what the r[2][0] piece is about.\n",
 67 |     "            data.append([text, r[2][0]])"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 18,
 73 |    "metadata": {
 74 |     "collapsed": true
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "    # Separate the text of the occupation/employer strings from the correct classification\n",
 79 |     "    texts = np.array([el[0] for el in data])\n",
 80 |     "    classes = np.array([el[1] for el in data])"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 19,
 86 |    "metadata": {
 87 |     "collapsed": false
 88 |    },
 89 |    "outputs": [
 90 |     {
 91 |      "name": "stdout",
 92 |      "output_type": "stream",
 93 |      "text": [
 94 |       "['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT'\n",
 95 |       " 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC'\n",
 96 |       " 'Owner Fair Funeral Home' 'ST MARTIN  LIRERRE LAW FIRM ']\n"
 97 |      ]
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "print texts"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 20,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [
111 |     {
112 |      "name": "stdout",
113 |      "output_type": "stream",
114 |      "text": [
115 |       "['F' 'Z' 'Z' ..., 'F' 'G' 'K']\n"
116 |      ]
117 |     }
118 |    ],
119 |    "source": [
120 |     "print classes"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 31,
126 |    "metadata": {
127 |     "collapsed": true
128 |    },
129 |    "outputs": [],
130 |    "source": [
131 |     "# Build a simple machine learning pipeline to turn the above arrays into something scikit-learn understands\n",
132 |     "pipeline = Pipeline([\n",
133 |     "  ('vectorizer', CountVectorizer(\n",
134 |     "    ngram_range=(1,2),\n",
135 |     "    stop_words='english',\n",
136 |     "    min_df=2,\n",
137 |     "    max_df=len(texts))),\n",
138 |     "  ('classifier',  LogisticRegression())\n",
139 |     "])"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 32,
145 |    "metadata": {
146 |     "collapsed": false
147 |    },
148 |    "outputs": [
149 |     {
150 |      "data": {
151 |       "text/plain": [
152 |        "Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
153 |        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
154 |        "        lowercase=True, max_df=66923, max_features=None, min_df=2,\n",
155 |        "        ngram_range=(1, 2), preprocessor=None, stop_words='english...',\n",
156 |        "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
157 |        "          verbose=0))])"
158 |       ]
159 |      },
160 |      "execution_count": 32,
161 |      "metadata": {},
162 |      "output_type": "execute_result"
163 |     }
164 |    ],
165 |    "source": [
166 |     "# Fit the model\n",
167 |     "pipeline.fit(np.asarray(texts), np.asarray(classes))"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 27,
173 |    "metadata": {
174 |     "collapsed": false
175 |    },
176 |    "outputs": [
177 |     {
178 |      "name": "stdout",
179 |      "output_type": "stream",
180 |      "text": [
181 |       "['K']\n"
182 |      ]
183 |     }
184 |    ],
185 |    "source": [
186 |     "# Now, run some predictions. \"K\" means \"LAW\" in this case.\n",
187 |     "print pipeline.predict(['LAWYER'])"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 28,
193 |    "metadata": {
194 |     "collapsed": false
195 |    },
196 |    "outputs": [
197 |     {
198 |      "name": "stdout",
199 |      "output_type": "stream",
200 |      "text": [
201 |       "['K']\n"
202 |      ]
203 |     }
204 |    ],
205 |    "source": [
206 |     "# It also recognizes law firms!\n",
207 |     "print pipeline.predict(['SKADDEN ARPS'])"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": 34,
213 |    "metadata": {
214 |     "collapsed": false,
215 |     "scrolled": true
216 |    },
217 |    "outputs": [
218 |     {
219 |      "name": "stdout",
220 |      "output_type": "stream",
221 |      "text": [
222 |       "['F']\n"
223 |      ]
224 |     }
225 |    ],
226 |    "source": [
227 |     "# The \"F\" category represents business and finance.\n",
228 |     "print pipeline.predict(['CEO'])"
229 |    ]
230 |   }
231 |  ],
232 |  "metadata": {
233 |   "kernelspec": {
234 |    "display_name": "Python 2",
235 |    "language": "python",
236 |    "name": "python2"
237 |   },
238 |   "language_info": {
239 |    "codemirror_mode": {
240 |     "name": "ipython",
241 |     "version": 2
242 |    },
243 |    "file_extension": ".py",
244 |    "mimetype": "text/x-python",
245 |    "name": "python",
246 |    "nbconvert_exporter": "python",
247 |    "pygments_lexer": "ipython2",
248 |    "version": "2.7.9"
249 |   }
250 |  },
251 |  "nbformat": 4,
252 |  "nbformat_minor": 0
253 | }
254 | 


--------------------------------------------------------------------------------
/class3_2/3-2_DoNow.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "##1. Import the necessary packages to read in the data, plot, and create a linear regression model"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": []
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## 2. Read in the hanford.csv file "
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": null,
 29 |    "metadata": {
 30 |     "collapsed": true
 31 |    },
 32 |    "outputs": [],
 33 |    "source": []
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "<img src=\"images/hanford_variables.png\">"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "## 3. Calculate the basic descriptive statistics on the data"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "metadata": {
 53 |     "collapsed": false
 54 |    },
 55 |    "outputs": [],
 56 |    "source": []
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "## 4. Calculate the coefficient of correlation (r) and generate the scatter plot. Does there seem to be a correlation worthy of investigation?"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {
 69 |     "collapsed": false
 70 |    },
 71 |    "outputs": [],
 72 |    "source": []
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "## 5. Create a linear regression model based on the available data to predict the mortality rate given a level of exposure"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": true
 86 |    },
 87 |    "outputs": [],
 88 |    "source": []
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {
 93 |     "collapsed": true
 94 |    },
 95 |    "source": [
 96 |     "## 6. Plot the linear regression line on the scatter plot of values. Calculate the r^2 (coefficient of determination)"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {
103 |     "collapsed": false
104 |    },
105 |    "outputs": [],
106 |    "source": []
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {
111 |     "collapsed": true
112 |    },
113 |    "source": [
114 |     "## 7. Predict the mortality rate (Cancer per 100,000 man years) given an index of exposure = 10"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": null,
120 |    "metadata": {
121 |     "collapsed": true
122 |    },
123 |    "outputs": [],
124 |    "source": []
125 |   }
126 |  ],
127 |  "metadata": {
128 |   "kernelspec": {
129 |    "display_name": "Python 2",
130 |    "language": "python",
131 |    "name": "python2"
132 |   },
133 |   "language_info": {
134 |    "codemirror_mode": {
135 |     "name": "ipython",
136 |     "version": 2
137 |    },
138 |    "file_extension": ".py",
139 |    "mimetype": "text/x-python",
140 |    "name": "python",
141 |    "nbconvert_exporter": "python",
142 |    "pygments_lexer": "ipython2",
143 |    "version": "2.7.10"
144 |   }
145 |  },
146 |  "nbformat": 4,
147 |  "nbformat_minor": 0
148 | }
149 | 


--------------------------------------------------------------------------------
/class3_2/3-2_Exercises.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "##We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "###1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": []
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "###2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {
 37 |     "collapsed": true
 38 |    },
 39 |    "outputs": [],
 40 |    "source": []
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "###3. Perform 10-fold cross validation on the data and compare your results to the hold out method we used in 1 and 2. Take the average of the results. What do you notice about the accuracy measures in each of these?"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "metadata": {
 53 |     "collapsed": true
 54 |    },
 55 |    "outputs": [],
 56 |    "source": []
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "###4. Open the seeds_dataset.txt and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?\n",
 63 |     "For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/seeds"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": null,
 69 |    "metadata": {
 70 |     "collapsed": true
 71 |    },
 72 |    "outputs": [],
 73 |    "source": []
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "metadata": {},
 78 |    "source": [
 79 |     "###5. Using the seeds_dataset.txt, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50, 75-25, 10-fold cross validation) and discuss the results."
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "metadata": {
 86 |     "collapsed": true
 87 |    },
 88 |    "outputs": [],
 89 |    "source": []
 90 |   }
 91 |  ],
 92 |  "metadata": {
 93 |   "kernelspec": {
 94 |    "display_name": "Python 2",
 95 |    "language": "python",
 96 |    "name": "python2"
 97 |   },
 98 |   "language_info": {
 99 |    "codemirror_mode": {
100 |     "name": "ipython",
101 |     "version": 2
102 |    },
103 |    "file_extension": ".py",
104 |    "mimetype": "text/x-python",
105 |    "name": "python",
106 |    "nbconvert_exporter": "python",
107 |    "pygments_lexer": "ipython2",
108 |    "version": "2.7.10"
109 |   }
110 |  },
111 |  "nbformat": 4,
112 |  "nbformat_minor": 0
113 | }
114 | 


--------------------------------------------------------------------------------
/class3_2/data/hanford.csv:
--------------------------------------------------------------------------------
 1 | County,Exposure,Mortality
 2 | Umatilla,2.49,147.1
 3 | Morrow,2.57,130.1
 4 | Gilliam,3.41,129.9
 5 | Sherman,1.25,113.5
 6 | Wasco,1.62,137.5
 7 | HoodRiver,3.83,162.3
 8 | Portland,11.64,207.5
 9 | Columbia,6.41,177.9
10 | Clatsop,8.34,210.3


--------------------------------------------------------------------------------
/class3_2/data/hanford.txt:
--------------------------------------------------------------------------------
 1 | County	Exposure	Mortality
 2 | Umatilla	2.49	147.1
 3 | Morrow	2.57	130.1
 4 | Gilliam	3.41	129.9
 5 | Sherman	1.25	113.5
 6 | Wasco	1.62	137.5
 7 | HoodRiver	3.83	162.3
 8 | Portland	11.64	207.5
 9 | Columbia	6.41	177.9
10 | Clatsop	8.34	210.3
11 | 


--------------------------------------------------------------------------------
/class3_2/data/iris.csv:
--------------------------------------------------------------------------------
  1 | SepalLength,SepalWidth,PetalLength,PetalWidth,Name
  2 | 5.1,3.5,1.4,0.2,Iris-setosa
  3 | 4.9,3.0,1.4,0.2,Iris-setosa
  4 | 4.7,3.2,1.3,0.2,Iris-setosa
  5 | 4.6,3.1,1.5,0.2,Iris-setosa
  6 | 5.0,3.6,1.4,0.2,Iris-setosa
  7 | 5.4,3.9,1.7,0.4,Iris-setosa
  8 | 4.6,3.4,1.4,0.3,Iris-setosa
  9 | 5.0,3.4,1.5,0.2,Iris-setosa
 10 | 4.4,2.9,1.4,0.2,Iris-setosa
 11 | 4.9,3.1,1.5,0.1,Iris-setosa
 12 | 5.4,3.7,1.5,0.2,Iris-setosa
 13 | 4.8,3.4,1.6,0.2,Iris-setosa
 14 | 4.8,3.0,1.4,0.1,Iris-setosa
 15 | 4.3,3.0,1.1,0.1,Iris-setosa
 16 | 5.8,4.0,1.2,0.2,Iris-setosa
 17 | 5.7,4.4,1.5,0.4,Iris-setosa
 18 | 5.4,3.9,1.3,0.4,Iris-setosa
 19 | 5.1,3.5,1.4,0.3,Iris-setosa
 20 | 5.7,3.8,1.7,0.3,Iris-setosa
 21 | 5.1,3.8,1.5,0.3,Iris-setosa
 22 | 5.4,3.4,1.7,0.2,Iris-setosa
 23 | 5.1,3.7,1.5,0.4,Iris-setosa
 24 | 4.6,3.6,1.0,0.2,Iris-setosa
 25 | 5.1,3.3,1.7,0.5,Iris-setosa
 26 | 4.8,3.4,1.9,0.2,Iris-setosa
 27 | 5.0,3.0,1.6,0.2,Iris-setosa
 28 | 5.0,3.4,1.6,0.4,Iris-setosa
 29 | 5.2,3.5,1.5,0.2,Iris-setosa
 30 | 5.2,3.4,1.4,0.2,Iris-setosa
 31 | 4.7,3.2,1.6,0.2,Iris-setosa
 32 | 4.8,3.1,1.6,0.2,Iris-setosa
 33 | 5.4,3.4,1.5,0.4,Iris-setosa
 34 | 5.2,4.1,1.5,0.1,Iris-setosa
 35 | 5.5,4.2,1.4,0.2,Iris-setosa
 36 | 4.9,3.1,1.5,0.1,Iris-setosa
 37 | 5.0,3.2,1.2,0.2,Iris-setosa
 38 | 5.5,3.5,1.3,0.2,Iris-setosa
 39 | 4.9,3.1,1.5,0.1,Iris-setosa
 40 | 4.4,3.0,1.3,0.2,Iris-setosa
 41 | 5.1,3.4,1.5,0.2,Iris-setosa
 42 | 5.0,3.5,1.3,0.3,Iris-setosa
 43 | 4.5,2.3,1.3,0.3,Iris-setosa
 44 | 4.4,3.2,1.3,0.2,Iris-setosa
 45 | 5.0,3.5,1.6,0.6,Iris-setosa
 46 | 5.1,3.8,1.9,0.4,Iris-setosa
 47 | 4.8,3.0,1.4,0.3,Iris-setosa
 48 | 5.1,3.8,1.6,0.2,Iris-setosa
 49 | 4.6,3.2,1.4,0.2,Iris-setosa
 50 | 5.3,3.7,1.5,0.2,Iris-setosa
 51 | 5.0,3.3,1.4,0.2,Iris-setosa
 52 | 7.0,3.2,4.7,1.4,Iris-versicolor
 53 | 6.4,3.2,4.5,1.5,Iris-versicolor
 54 | 6.9,3.1,4.9,1.5,Iris-versicolor
 55 | 5.5,2.3,4.0,1.3,Iris-versicolor
 56 | 6.5,2.8,4.6,1.5,Iris-versicolor
 57 | 5.7,2.8,4.5,1.3,Iris-versicolor
 58 | 6.3,3.3,4.7,1.6,Iris-versicolor
 59 | 4.9,2.4,3.3,1.0,Iris-versicolor
 60 | 6.6,2.9,4.6,1.3,Iris-versicolor
 61 | 5.2,2.7,3.9,1.4,Iris-versicolor
 62 | 5.0,2.0,3.5,1.0,Iris-versicolor
 63 | 5.9,3.0,4.2,1.5,Iris-versicolor
 64 | 6.0,2.2,4.0,1.0,Iris-versicolor
 65 | 6.1,2.9,4.7,1.4,Iris-versicolor
 66 | 5.6,2.9,3.6,1.3,Iris-versicolor
 67 | 6.7,3.1,4.4,1.4,Iris-versicolor
 68 | 5.6,3.0,4.5,1.5,Iris-versicolor
 69 | 5.8,2.7,4.1,1.0,Iris-versicolor
 70 | 6.2,2.2,4.5,1.5,Iris-versicolor
 71 | 5.6,2.5,3.9,1.1,Iris-versicolor
 72 | 5.9,3.2,4.8,1.8,Iris-versicolor
 73 | 6.1,2.8,4.0,1.3,Iris-versicolor
 74 | 6.3,2.5,4.9,1.5,Iris-versicolor
 75 | 6.1,2.8,4.7,1.2,Iris-versicolor
 76 | 6.4,2.9,4.3,1.3,Iris-versicolor
 77 | 6.6,3.0,4.4,1.4,Iris-versicolor
 78 | 6.8,2.8,4.8,1.4,Iris-versicolor
 79 | 6.7,3.0,5.0,1.7,Iris-versicolor
 80 | 6.0,2.9,4.5,1.5,Iris-versicolor
 81 | 5.7,2.6,3.5,1.0,Iris-versicolor
 82 | 5.5,2.4,3.8,1.1,Iris-versicolor
 83 | 5.5,2.4,3.7,1.0,Iris-versicolor
 84 | 5.8,2.7,3.9,1.2,Iris-versicolor
 85 | 6.0,2.7,5.1,1.6,Iris-versicolor
 86 | 5.4,3.0,4.5,1.5,Iris-versicolor
 87 | 6.0,3.4,4.5,1.6,Iris-versicolor
 88 | 6.7,3.1,4.7,1.5,Iris-versicolor
 89 | 6.3,2.3,4.4,1.3,Iris-versicolor
 90 | 5.6,3.0,4.1,1.3,Iris-versicolor
 91 | 5.5,2.5,4.0,1.3,Iris-versicolor
 92 | 5.5,2.6,4.4,1.2,Iris-versicolor
 93 | 6.1,3.0,4.6,1.4,Iris-versicolor
 94 | 5.8,2.6,4.0,1.2,Iris-versicolor
 95 | 5.0,2.3,3.3,1.0,Iris-versicolor
 96 | 5.6,2.7,4.2,1.3,Iris-versicolor
 97 | 5.7,3.0,4.2,1.2,Iris-versicolor
 98 | 5.7,2.9,4.2,1.3,Iris-versicolor
 99 | 6.2,2.9,4.3,1.3,Iris-versicolor
100 | 5.1,2.5,3.0,1.1,Iris-versicolor
101 | 5.7,2.8,4.1,1.3,Iris-versicolor
102 | 6.3,3.3,6.0,2.5,Iris-virginica
103 | 5.8,2.7,5.1,1.9,Iris-virginica
104 | 7.1,3.0,5.9,2.1,Iris-virginica
105 | 6.3,2.9,5.6,1.8,Iris-virginica
106 | 6.5,3.0,5.8,2.2,Iris-virginica
107 | 7.6,3.0,6.6,2.1,Iris-virginica
108 | 4.9,2.5,4.5,1.7,Iris-virginica
109 | 7.3,2.9,6.3,1.8,Iris-virginica
110 | 6.7,2.5,5.8,1.8,Iris-virginica
111 | 7.2,3.6,6.1,2.5,Iris-virginica
112 | 6.5,3.2,5.1,2.0,Iris-virginica
113 | 6.4,2.7,5.3,1.9,Iris-virginica
114 | 6.8,3.0,5.5,2.1,Iris-virginica
115 | 5.7,2.5,5.0,2.0,Iris-virginica
116 | 5.8,2.8,5.1,2.4,Iris-virginica
117 | 6.4,3.2,5.3,2.3,Iris-virginica
118 | 6.5,3.0,5.5,1.8,Iris-virginica
119 | 7.7,3.8,6.7,2.2,Iris-virginica
120 | 7.7,2.6,6.9,2.3,Iris-virginica
121 | 6.0,2.2,5.0,1.5,Iris-virginica
122 | 6.9,3.2,5.7,2.3,Iris-virginica
123 | 5.6,2.8,4.9,2.0,Iris-virginica
124 | 7.7,2.8,6.7,2.0,Iris-virginica
125 | 6.3,2.7,4.9,1.8,Iris-virginica
126 | 6.7,3.3,5.7,2.1,Iris-virginica
127 | 7.2,3.2,6.0,1.8,Iris-virginica
128 | 6.2,2.8,4.8,1.8,Iris-virginica
129 | 6.1,3.0,4.9,1.8,Iris-virginica
130 | 6.4,2.8,5.6,2.1,Iris-virginica
131 | 7.2,3.0,5.8,1.6,Iris-virginica
132 | 7.4,2.8,6.1,1.9,Iris-virginica
133 | 7.9,3.8,6.4,2.0,Iris-virginica
134 | 6.4,2.8,5.6,2.2,Iris-virginica
135 | 6.3,2.8,5.1,1.5,Iris-virginica
136 | 6.1,2.6,5.6,1.4,Iris-virginica
137 | 7.7,3.0,6.1,2.3,Iris-virginica
138 | 6.3,3.4,5.6,2.4,Iris-virginica
139 | 6.4,3.1,5.5,1.8,Iris-virginica
140 | 6.0,3.0,4.8,1.8,Iris-virginica
141 | 6.9,3.1,5.4,2.1,Iris-virginica
142 | 6.7,3.1,5.6,2.4,Iris-virginica
143 | 6.9,3.1,5.1,2.3,Iris-virginica
144 | 5.8,2.7,5.1,1.9,Iris-virginica
145 | 6.8,3.2,5.9,2.3,Iris-virginica
146 | 6.7,3.3,5.7,2.5,Iris-virginica
147 | 6.7,3.0,5.2,2.3,Iris-virginica
148 | 6.3,2.5,5.0,1.9,Iris-virginica
149 | 6.5,3.0,5.2,2.0,Iris-virginica
150 | 6.2,3.4,5.4,2.3,Iris-virginica
151 | 5.9,3.0,5.1,1.8,Iris-virginica


--------------------------------------------------------------------------------
/class3_2/data/seeds_dataset.txt:
--------------------------------------------------------------------------------
  1 | 15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
  2 | 14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
  3 | 14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
  4 | 13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
  5 | 16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
  6 | 14.38,14.21,0.8951,5.386,3.312,2.462,4.956,1
  7 | 14.69,14.49,0.8799,5.563,3.259,3.586,5.219,1
  8 | 14.11,14.1,0.8911,5.42,3.302,2.7,5,1
  9 | 16.63,15.46,0.8747,6.053,3.465,2.04,5.877,1
 10 | 16.44,15.25,0.888,5.884,3.505,1.969,5.533,1
 11 | 15.26,14.85,0.8696,5.714,3.242,4.543,5.314,1
 12 | 14.03,14.16,0.8796,5.438,3.201,1.717,5.001,1
 13 | 13.89,14.02,0.888,5.439,3.199,3.986,4.738,1
 14 | 13.78,14.06,0.8759,5.479,3.156,3.136,4.872,1
 15 | 13.74,14.05,0.8744,5.482,3.114,2.932,4.825,1
 16 | 14.59,14.28,0.8993,5.351,3.333,4.185,4.781,1
 17 | 13.99,13.83,0.9183,5.119,3.383,5.234,4.781,1
 18 | 15.69,14.75,0.9058,5.527,3.514,1.599,5.046,1
 19 | 14.7,14.21,0.9153,5.205,3.466,1.767,4.649,1
 20 | 12.72,13.57,0.8686,5.226,3.049,4.102,4.914,1
 21 | 14.16,14.4,0.8584,5.658,3.129,3.072,5.176,1
 22 | 14.11,14.26,0.8722,5.52,3.168,2.688,5.219,1
 23 | 15.88,14.9,0.8988,5.618,3.507,0.7651,5.091,1
 24 | 12.08,13.23,0.8664,5.099,2.936,1.415,4.961,1
 25 | 15.01,14.76,0.8657,5.789,3.245,1.791,5.001,1
 26 | 16.19,15.16,0.8849,5.833,3.421,0.903,5.307,1
 27 | 13.02,13.76,0.8641,5.395,3.026,3.373,4.825,1
 28 | 12.74,13.67,0.8564,5.395,2.956,2.504,4.869,1
 29 | 14.11,14.18,0.882,5.541,3.221,2.754,5.038,1
 30 | 13.45,14.02,0.8604,5.516,3.065,3.531,5.097,1
 31 | 13.16,13.82,0.8662,5.454,2.975,0.8551,5.056,1
 32 | 15.49,14.94,0.8724,5.757,3.371,3.412,5.228,1
 33 | 14.09,14.41,0.8529,5.717,3.186,3.92,5.299,1
 34 | 13.94,14.17,0.8728,5.585,3.15,2.124,5.012,1
 35 | 15.05,14.68,0.8779,5.712,3.328,2.129,5.36,1
 36 | 16.12,15,0.9,5.709,3.485,2.27,5.443,1
 37 | 16.2,15.27,0.8734,5.826,3.464,2.823,5.527,1
 38 | 17.08,15.38,0.9079,5.832,3.683,2.956,5.484,1
 39 | 14.8,14.52,0.8823,5.656,3.288,3.112,5.309,1
 40 | 14.28,14.17,0.8944,5.397,3.298,6.685,5.001,1
 41 | 13.54,13.85,0.8871,5.348,3.156,2.587,5.178,1
 42 | 13.5,13.85,0.8852,5.351,3.158,2.249,5.176,1
 43 | 13.16,13.55,0.9009,5.138,3.201,2.461,4.783,1
 44 | 15.5,14.86,0.882,5.877,3.396,4.711,5.528,1
 45 | 15.11,14.54,0.8986,5.579,3.462,3.128,5.18,1
 46 | 13.8,14.04,0.8794,5.376,3.155,1.56,4.961,1
 47 | 15.36,14.76,0.8861,5.701,3.393,1.367,5.132,1
 48 | 14.99,14.56,0.8883,5.57,3.377,2.958,5.175,1
 49 | 14.79,14.52,0.8819,5.545,3.291,2.704,5.111,1
 50 | 14.86,14.67,0.8676,5.678,3.258,2.129,5.351,1
 51 | 14.43,14.4,0.8751,5.585,3.272,3.975,5.144,1
 52 | 15.78,14.91,0.8923,5.674,3.434,5.593,5.136,1
 53 | 14.49,14.61,0.8538,5.715,3.113,4.116,5.396,1
 54 | 14.33,14.28,0.8831,5.504,3.199,3.328,5.224,1
 55 | 14.52,14.6,0.8557,5.741,3.113,1.481,5.487,1
 56 | 15.03,14.77,0.8658,5.702,3.212,1.933,5.439,1
 57 | 14.46,14.35,0.8818,5.388,3.377,2.802,5.044,1
 58 | 14.92,14.43,0.9006,5.384,3.412,1.142,5.088,1
 59 | 15.38,14.77,0.8857,5.662,3.419,1.999,5.222,1
 60 | 12.11,13.47,0.8392,5.159,3.032,1.502,4.519,1
 61 | 11.42,12.86,0.8683,5.008,2.85,2.7,4.607,1
 62 | 11.23,12.63,0.884,4.902,2.879,2.269,4.703,1
 63 | 12.36,13.19,0.8923,5.076,3.042,3.22,4.605,1
 64 | 13.22,13.84,0.868,5.395,3.07,4.157,5.088,1
 65 | 12.78,13.57,0.8716,5.262,3.026,1.176,4.782,1
 66 | 12.88,13.5,0.8879,5.139,3.119,2.352,4.607,1
 67 | 14.34,14.37,0.8726,5.63,3.19,1.313,5.15,1
 68 | 14.01,14.29,0.8625,5.609,3.158,2.217,5.132,1
 69 | 14.37,14.39,0.8726,5.569,3.153,1.464,5.3,1
 70 | 12.73,13.75,0.8458,5.412,2.882,3.533,5.067,1
 71 | 17.63,15.98,0.8673,6.191,3.561,4.076,6.06,2
 72 | 16.84,15.67,0.8623,5.998,3.484,4.675,5.877,2
 73 | 17.26,15.73,0.8763,5.978,3.594,4.539,5.791,2
 74 | 19.11,16.26,0.9081,6.154,3.93,2.936,6.079,2
 75 | 16.82,15.51,0.8786,6.017,3.486,4.004,5.841,2
 76 | 16.77,15.62,0.8638,5.927,3.438,4.92,5.795,2
 77 | 17.32,15.91,0.8599,6.064,3.403,3.824,5.922,2
 78 | 20.71,17.23,0.8763,6.579,3.814,4.451,6.451,2
 79 | 18.94,16.49,0.875,6.445,3.639,5.064,6.362,2
 80 | 17.12,15.55,0.8892,5.85,3.566,2.858,5.746,2
 81 | 16.53,15.34,0.8823,5.875,3.467,5.532,5.88,2
 82 | 18.72,16.19,0.8977,6.006,3.857,5.324,5.879,2
 83 | 20.2,16.89,0.8894,6.285,3.864,5.173,6.187,2
 84 | 19.57,16.74,0.8779,6.384,3.772,1.472,6.273,2
 85 | 19.51,16.71,0.878,6.366,3.801,2.962,6.185,2
 86 | 18.27,16.09,0.887,6.173,3.651,2.443,6.197,2
 87 | 18.88,16.26,0.8969,6.084,3.764,1.649,6.109,2
 88 | 18.98,16.66,0.859,6.549,3.67,3.691,6.498,2
 89 | 21.18,17.21,0.8989,6.573,4.033,5.78,6.231,2
 90 | 20.88,17.05,0.9031,6.45,4.032,5.016,6.321,2
 91 | 20.1,16.99,0.8746,6.581,3.785,1.955,6.449,2
 92 | 18.76,16.2,0.8984,6.172,3.796,3.12,6.053,2
 93 | 18.81,16.29,0.8906,6.272,3.693,3.237,6.053,2
 94 | 18.59,16.05,0.9066,6.037,3.86,6.001,5.877,2
 95 | 18.36,16.52,0.8452,6.666,3.485,4.933,6.448,2
 96 | 16.87,15.65,0.8648,6.139,3.463,3.696,5.967,2
 97 | 19.31,16.59,0.8815,6.341,3.81,3.477,6.238,2
 98 | 18.98,16.57,0.8687,6.449,3.552,2.144,6.453,2
 99 | 18.17,16.26,0.8637,6.271,3.512,2.853,6.273,2
100 | 18.72,16.34,0.881,6.219,3.684,2.188,6.097,2
101 | 16.41,15.25,0.8866,5.718,3.525,4.217,5.618,2
102 | 17.99,15.86,0.8992,5.89,3.694,2.068,5.837,2
103 | 19.46,16.5,0.8985,6.113,3.892,4.308,6.009,2
104 | 19.18,16.63,0.8717,6.369,3.681,3.357,6.229,2
105 | 18.95,16.42,0.8829,6.248,3.755,3.368,6.148,2
106 | 18.83,16.29,0.8917,6.037,3.786,2.553,5.879,2
107 | 18.85,16.17,0.9056,6.152,3.806,2.843,6.2,2
108 | 17.63,15.86,0.88,6.033,3.573,3.747,5.929,2
109 | 19.94,16.92,0.8752,6.675,3.763,3.252,6.55,2
110 | 18.55,16.22,0.8865,6.153,3.674,1.738,5.894,2
111 | 18.45,16.12,0.8921,6.107,3.769,2.235,5.794,2
112 | 19.38,16.72,0.8716,6.303,3.791,3.678,5.965,2
113 | 19.13,16.31,0.9035,6.183,3.902,2.109,5.924,2
114 | 19.14,16.61,0.8722,6.259,3.737,6.682,6.053,2
115 | 20.97,17.25,0.8859,6.563,3.991,4.677,6.316,2
116 | 19.06,16.45,0.8854,6.416,3.719,2.248,6.163,2
117 | 18.96,16.2,0.9077,6.051,3.897,4.334,5.75,2
118 | 19.15,16.45,0.889,6.245,3.815,3.084,6.185,2
119 | 18.89,16.23,0.9008,6.227,3.769,3.639,5.966,2
120 | 20.03,16.9,0.8811,6.493,3.857,3.063,6.32,2
121 | 20.24,16.91,0.8897,6.315,3.962,5.901,6.188,2
122 | 18.14,16.12,0.8772,6.059,3.563,3.619,6.011,2
123 | 16.17,15.38,0.8588,5.762,3.387,4.286,5.703,2
124 | 18.43,15.97,0.9077,5.98,3.771,2.984,5.905,2
125 | 15.99,14.89,0.9064,5.363,3.582,3.336,5.144,2
126 | 18.75,16.18,0.8999,6.111,3.869,4.188,5.992,2
127 | 18.65,16.41,0.8698,6.285,3.594,4.391,6.102,2
128 | 17.98,15.85,0.8993,5.979,3.687,2.257,5.919,2
129 | 20.16,17.03,0.8735,6.513,3.773,1.91,6.185,2
130 | 17.55,15.66,0.8991,5.791,3.69,5.366,5.661,2
131 | 18.3,15.89,0.9108,5.979,3.755,2.837,5.962,2
132 | 18.94,16.32,0.8942,6.144,3.825,2.908,5.949,2
133 | 15.38,14.9,0.8706,5.884,3.268,4.462,5.795,2
134 | 16.16,15.33,0.8644,5.845,3.395,4.266,5.795,2
135 | 15.56,14.89,0.8823,5.776,3.408,4.972,5.847,2
136 | 15.38,14.66,0.899,5.477,3.465,3.6,5.439,2
137 | 17.36,15.76,0.8785,6.145,3.574,3.526,5.971,2
138 | 15.57,15.15,0.8527,5.92,3.231,2.64,5.879,2
139 | 15.6,15.11,0.858,5.832,3.286,2.725,5.752,2
140 | 16.23,15.18,0.885,5.872,3.472,3.769,5.922,2
141 | 13.07,13.92,0.848,5.472,2.994,5.304,5.395,3
142 | 13.32,13.94,0.8613,5.541,3.073,7.035,5.44,3
143 | 13.34,13.95,0.862,5.389,3.074,5.995,5.307,3
144 | 12.22,13.32,0.8652,5.224,2.967,5.469,5.221,3
145 | 11.82,13.4,0.8274,5.314,2.777,4.471,5.178,3
146 | 11.21,13.13,0.8167,5.279,2.687,6.169,5.275,3
147 | 11.43,13.13,0.8335,5.176,2.719,2.221,5.132,3
148 | 12.49,13.46,0.8658,5.267,2.967,4.421,5.002,3
149 | 12.7,13.71,0.8491,5.386,2.911,3.26,5.316,3
150 | 10.79,12.93,0.8107,5.317,2.648,5.462,5.194,3
151 | 11.83,13.23,0.8496,5.263,2.84,5.195,5.307,3
152 | 12.01,13.52,0.8249,5.405,2.776,6.992,5.27,3
153 | 12.26,13.6,0.8333,5.408,2.833,4.756,5.36,3
154 | 11.18,13.04,0.8266,5.22,2.693,3.332,5.001,3
155 | 11.36,13.05,0.8382,5.175,2.755,4.048,5.263,3
156 | 11.19,13.05,0.8253,5.25,2.675,5.813,5.219,3
157 | 11.34,12.87,0.8596,5.053,2.849,3.347,5.003,3
158 | 12.13,13.73,0.8081,5.394,2.745,4.825,5.22,3
159 | 11.75,13.52,0.8082,5.444,2.678,4.378,5.31,3
160 | 11.49,13.22,0.8263,5.304,2.695,5.388,5.31,3
161 | 12.54,13.67,0.8425,5.451,2.879,3.082,5.491,3
162 | 12.02,13.33,0.8503,5.35,2.81,4.271,5.308,3
163 | 12.05,13.41,0.8416,5.267,2.847,4.988,5.046,3
164 | 12.55,13.57,0.8558,5.333,2.968,4.419,5.176,3
165 | 11.14,12.79,0.8558,5.011,2.794,6.388,5.049,3
166 | 12.1,13.15,0.8793,5.105,2.941,2.201,5.056,3
167 | 12.44,13.59,0.8462,5.319,2.897,4.924,5.27,3
168 | 12.15,13.45,0.8443,5.417,2.837,3.638,5.338,3
169 | 11.35,13.12,0.8291,5.176,2.668,4.337,5.132,3
170 | 11.24,13,0.8359,5.09,2.715,3.521,5.088,3
171 | 11.02,13,0.8189,5.325,2.701,6.735,5.163,3
172 | 11.55,13.1,0.8455,5.167,2.845,6.715,4.956,3
173 | 11.27,12.97,0.8419,5.088,2.763,4.309,5,3
174 | 11.4,13.08,0.8375,5.136,2.763,5.588,5.089,3
175 | 10.83,12.96,0.8099,5.278,2.641,5.182,5.185,3
176 | 10.8,12.57,0.859,4.981,2.821,4.773,5.063,3
177 | 11.26,13.01,0.8355,5.186,2.71,5.335,5.092,3
178 | 10.74,12.73,0.8329,5.145,2.642,4.702,4.963,3
179 | 11.48,13.05,0.8473,5.18,2.758,5.876,5.002,3
180 | 12.21,13.47,0.8453,5.357,2.893,1.661,5.178,3
181 | 11.41,12.95,0.856,5.09,2.775,4.957,4.825,3
182 | 12.46,13.41,0.8706,5.236,3.017,4.987,5.147,3
183 | 12.19,13.36,0.8579,5.24,2.909,4.857,5.158,3
184 | 11.65,13.07,0.8575,5.108,2.85,5.209,5.135,3
185 | 12.89,13.77,0.8541,5.495,3.026,6.185,5.316,3
186 | 11.56,13.31,0.8198,5.363,2.683,4.062,5.182,3
187 | 11.81,13.45,0.8198,5.413,2.716,4.898,5.352,3
188 | 10.91,12.8,0.8372,5.088,2.675,4.179,4.956,3
189 | 11.23,12.82,0.8594,5.089,2.821,7.524,4.957,3
190 | 10.59,12.41,0.8648,4.899,2.787,4.975,4.794,3
191 | 10.93,12.8,0.839,5.046,2.717,5.398,5.045,3
192 | 11.27,12.86,0.8563,5.091,2.804,3.985,5.001,3
193 | 11.87,13.02,0.8795,5.132,2.953,3.597,5.132,3
194 | 10.82,12.83,0.8256,5.18,2.63,4.853,5.089,3
195 | 12.11,13.27,0.8639,5.236,2.975,4.132,5.012,3
196 | 12.8,13.47,0.886,5.16,3.126,4.873,4.914,3
197 | 12.79,13.53,0.8786,5.224,3.054,5.483,4.958,3
198 | 13.37,13.78,0.8849,5.32,3.128,4.67,5.091,3
199 | 12.62,13.67,0.8481,5.41,2.911,3.306,5.231,3
200 | 12.76,13.38,0.8964,5.073,3.155,2.828,4.83,3
201 | 12.38,13.44,0.8609,5.219,2.989,5.472,5.045,3
202 | 12.67,13.32,0.8977,4.984,3.135,2.3,4.745,3
203 | 11.18,12.72,0.868,5.009,2.81,4.051,4.828,3
204 | 12.7,13.41,0.8874,5.183,3.091,8.456,5,3
205 | 12.37,13.47,0.8567,5.204,2.96,3.919,5.001,3
206 | 12.19,13.2,0.8783,5.137,2.981,3.631,4.87,3
207 | 11.23,12.88,0.8511,5.14,2.795,4.325,5.003,3
208 | 13.2,13.66,0.8883,5.236,3.232,8.315,5.056,3
209 | 11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3
210 | 12.3,13.34,0.8684,5.243,2.974,5.637,5.063,3


--------------------------------------------------------------------------------
/class3_2/images/hanford_variables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class3_2/images/hanford_variables.png


--------------------------------------------------------------------------------
/class3_2/images/iris_scatter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class3_2/images/iris_scatter.png


--------------------------------------------------------------------------------
/class4_1/README.md:
--------------------------------------------------------------------------------
 1 | # Algorithms: Week 4, Class 1 (Tuesday, Aug. 4)
 2 | 
 3 | This week's class is going to be a bit different than the last few. After a quick review of last week's material, we're going to build a supervised learning system that is meant to outline a rough automated approach to [this story](http://www.nytimes.com/2015/08/02/us/small-pool-of-rich-donors-dominates-election-giving.html) from Sunday's Times about wealthy donors to Super PACs affiliated with presidential candidates.
 4 | 
 5 | Along the way, we'll talk about:
 6 | 
 7 |   - How to train and apply different models, including the decision trees you've already discussed
 8 |   - How to engineer useful features for those models
 9 |   - How to evaluate the results of those models so you don't get yourself in trouble
10 |   - The difference between statistical and rules-based solutions to problems like this
11 | 
12 | For lab, you'll be asked to take on a simpler supervised learning problem that will give you a chance to apply the lessons from class.
13 | 
14 | If you'd like to get a head start, feel free to read [this documentation](https://github.com/cjdd3b/fec-standardizer/wiki) on standardizing FEC data. We'll be taking a simpler approach, but much of the intuition will be similar.


--------------------------------------------------------------------------------
/class4_1/doc_classifier.py:
--------------------------------------------------------------------------------
 1 | from sklearn import preprocessing
 2 | 
 3 | ########## FEATURES ##########
 4 | 
 5 | # Put your features here
 6 | 
 7 | 
 8 | ########## MAIN ##########
 9 | 
10 | if __name__ == '__main__':
11 | 
12 |     # First we'll do some preprocessing to create our two vectors for model training: features, which
13 |     # represents the feature vector, and labels, which represent our correct answers.
14 | 
15 |     features, labels = [], []
16 |     with open('data/bills_training.txt', 'rU') as csvfile:
17 |         for line in csvfile.readlines():
18 |             bill = line.strip().split('|')
19 | 
20 |             if len(bill) > 1:
21 |                 labels.append(bill[1])
22 | 
23 |                 features.append([
24 |                     # Your features here, based on bill[0], which contains the text of the bill titles
25 |                 ])
26 | 
27 |     # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
28 |     # be numbers, not strings. The LabelEncoder performs this transformation.
29 |     encoder = preprocessing.LabelEncoder()
30 |     encoded_labels = encoder.fit_transform(labels)
31 | 
32 |     print features
33 |     print encoded_labels
34 | 
35 |     # STEP ONE: Create and train a model
36 | 
37 |     # Your code here
38 | 
39 | 
40 |     # STEP TWO: Evaluate the model
41 | 
42 |     # Your code here
43 | 
44 | 
45 |     # STEP THREE: Apply the model
46 | 
47 |     # Use the model to get categories for each of these documents
48 | 
49 |     docs_new = ["Public postsecondary education: executive officer compensation.",
50 |                 "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
51 |                 "Political Reform Act of 1974: campaign disclosures.",
52 |                 "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
53 |             ]


--------------------------------------------------------------------------------
/class4_1/donors.py:
--------------------------------------------------------------------------------
  1 | import csv, itertools
  2 | import numpy as np
  3 | from sklearn.tree import DecisionTreeClassifier
  4 | from sklearn.linear_model import LogisticRegression
  5 | from sklearn import cross_validation
  6 | from sklearn.cross_validation import KFold
  7 | from sklearn import metrics
  8 | from nameparser import HumanName
  9 | 
 10 | ########## HELPER FUNCTIONS ##########
 11 | 
 12 | def _shingle(word, n):
 13 |     '''
 14 |     Splits words into shingles of size n. Given the word "shingle" and n=2, the output
 15 |     would be a list that looks like :['sh', 'hi', 'in', 'ng', 'gl', 'le']
 16 | 
 17 |     More on shingling here: http://blog.mafr.de/2011/01/06/near-duplicate-detection/
 18 |     '''
 19 |     return set([word[i:i + n] for i in range(len(word) - n + 1)])
 20 | 
 21 | def _jaccard_sim(X, Y):
 22 |     '''
 23 |     Jaccard similarity between two sets.
 24 | 
 25 |     Explanation here: http://en.wikipedia.org/wiki/Jaccard_index
 26 |     '''
 27 |     if not X or not Y: return 0
 28 |     x = set(X)
 29 |     y = set(Y)
 30 |     return float(len(x & y)) / len(x | y)
 31 | 
 32 | def sim(str1, str2, shingle_length=3):
 33 |     '''
 34 |     String similarity metric based on shingles and Jaccard.
 35 |     '''
 36 |     str1_shingles = _shingle(str1, shingle_length)
 37 |     str2_shingles = _shingle(str2, shingle_length)
 38 |     return _jaccard_sim(str1_shingles, str2_shingles)
 39 | 
 40 | ########## FEATURES ##########
 41 | 
 42 | def same_name(name1, name2):
 43 |     return 1 if name1 == name2 else 0
 44 | 
 45 | def same_zip_code(zip1, zip2):
 46 |     return 1 if zip1[:5] == zip2[:5] else 0
 47 | 
 48 | def same_first_name(name1, name2):
 49 |     first1 = HumanName(name1).first
 50 |     first2 = HumanName(name2).first
 51 |     return 1 if first1 == first2 else 0
 52 | 
 53 | def same_last_name(name1, name2):
 54 |     last1 = HumanName(name1).last
 55 |     last2 = HumanName(name2).last
 56 |     return 1 if last1 == last2 else 0
 57 | 
 58 | 
 59 | # We're going to add more here ...
 60 | 
 61 | ########## MAIN ##########
 62 | 
 63 | if __name__ == '__main__':
 64 | 
 65 |     # STEP ONE: Train our model.
 66 | 
 67 |     features, matches = [], []
 68 |     with open('data/contribs_training_small.csv', 'rU') as csvfile:
 69 |         reader = csv.DictReader(csvfile)
 70 |         for c in itertools.combinations(reader, 2):
 71 | 
 72 |             # Fill up our vector of correct answers
 73 |             match = 1 if c[0]['contributor_ext_id'] == c[1]['contributor_ext_id'] else 0
 74 |             matches.append(match)
 75 | 
 76 |             # And now fill up our feature vector
 77 |             features.append([
 78 |                 same_name(c[0]['name'], c[1]['name']),
 79 |                 same_zip_code(c[0]['zip_code'], c[1]['zip_code']),
 80 |                 same_first_name(c[0]['name'], c[1]['name']),
 81 |                 same_last_name(c[0]['name'], c[1]['name'])
 82 | 
 83 |             ])
 84 | 
 85 |     clf = DecisionTreeClassifier()
 86 |     clf = clf.fit(features, matches)
 87 | 
 88 |     # STEP TWO: Evaluate the model using 10-fold cross-validation
 89 | 
 90 |     # scores = cross_validation.cross_val_score(clf, features, matches, cv=10, scoring='f1')
 91 |     # print "%s (%s folds): %0.2f (+/- %0.2f)\n" % ('f1', 10, scores.mean(), scores.std() / 2)
 92 | 
 93 |     # STEP THREE: Apply the model
 94 | 
 95 |     with open('data/contribs_unclassified.csv', 'rU') as csvfile:
 96 |         reader = csv.DictReader(csvfile)
 97 |         for key, group in itertools.groupby(reader, lambda x: x['last_name']):
 98 |             for c in itertools.combinations(group, 2):
 99 | 
100 |                 # Making print-friendly representations of the records, for easier evaluation
101 |                 record1 = '%s, %s %s | %s %s %s | %s %s' % \
102 |                     (c[0]['last_name'], c[0]['first_name'], c[0]['middle_name'],
103 |                      c[0]['city'], c[0]['state'], c[0]['zip'],
104 |                      c[0]['employer'], c[0]['occupation'])
105 |                 record2 = '%s, %s %s | %s %s %s | %s %s' % \
106 |                     (c[1]['last_name'], c[1]['first_name'], c[1]['middle_name'],
107 |                      c[1]['city'], c[1]['state'], c[1]['zip'],
108 |                      c[1]['employer'], c[1]['occupation'])
109 | 
110 |                 # We need to do this because our training set has full names, but this set has name
111 |                 # components. Turn those into full names.
112 |                 name1 = '%s, %s %s' % (c[0]['last_name'], c[0]['first_name'], c[0]['middle_name'])
113 |                 name2 = '%s, %s %s' % (c[1]['last_name'], c[1]['first_name'], c[1]['middle_name'])
114 | 
115 |                 # And now fill up our feature vector
116 |                 features = [
117 |                     same_name(name1, name2),
118 |                     same_zip_code(c[0]['zip'], c[1]['zip']),
119 |                     same_first_name(name1, name2),
120 |                     same_last_name(name1, name2)
121 |                 ]
122 | 
123 |                 # Predict match or no match
124 |                 match = clf.predict_proba(features)
125 | 
126 |                 # Print the results
127 |                 if match[0][0] < match[0][1]:
128 |                     print 'MATCH!'
129 |                     print record1 + ' ---------> ' + record2 + '\n'
130 |                     print match
131 |                 else:
132 |                     print 'NO MATCH!'
133 |                     print record1 + ' ---------> ' + record2 + '\n'


--------------------------------------------------------------------------------
/class4_2/4-2_DoNow.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "###import wine.csv and build a decision tree classifier to predict wine_cultivar. Test the data using 5-fold cross validation"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "code",
12 |    "execution_count": null,
13 |    "metadata": {
14 |     "collapsed": true
15 |    },
16 |    "outputs": [],
17 |    "source": []
18 |   }
19 |  ],
20 |  "metadata": {
21 |   "kernelspec": {
22 |    "display_name": "Python 2",
23 |    "language": "python",
24 |    "name": "python2"
25 |   },
26 |   "language_info": {
27 |    "codemirror_mode": {
28 |     "name": "ipython",
29 |     "version": 2
30 |    },
31 |    "file_extension": ".py",
32 |    "mimetype": "text/x-python",
33 |    "name": "python",
34 |    "nbconvert_exporter": "python",
35 |    "pygments_lexer": "ipython2",
36 |    "version": "2.7.10"
37 |   }
38 |  },
39 |  "nbformat": 4,
40 |  "nbformat_minor": 0
41 | }
42 | 


--------------------------------------------------------------------------------
/class4_2/Feature_Engineering.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "%matplotlib inline"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "###A simple example to illustrate the intuition behind dummy variables"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {
 37 |     "collapsed": false
 38 |    },
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "df"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {
 48 |     "collapsed": false
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "pd.get_dummies(df['key'],prefix='key')"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {
 58 |     "collapsed": true
 59 |    },
 60 |    "source": [
 61 |     "###Now we have a matrix of values based on the presence of absence of the attribute value in our dataset"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "###Now let's look at another example using our flight data"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {
 75 |     "collapsed": false
 76 |    },
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "df = pd.read_csv('data/ontime_reports_may_2015_ny.csv')"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "#count number of NaNs in column\n",
 91 |     "df['DEP_DELAY'].isnull().sum()"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "#calculate the percentage this represents of the total number of instances\n",
103 |     "df['DEP_DELAY'].isnull().sum()/df['DEP_DELAY'].sum()"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "###We could explore whether the NaNs are actually zero delays, but we'll just filter them out for now, especially since they represent such a small number of instances"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {
117 |     "collapsed": false
118 |    },
119 |    "outputs": [],
120 |    "source": [
121 |     "#filter DEP_DELAY NaNs\n",
122 |     "df = df[pd.notnull(df['DEP_DELAY'])]"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "###We can discretize the continuous DEP_DELAY value by giving it a value of 0 if it's delayed and a 1 if it's not. We record this value into a separate column. (We could also code -1 for early, 0 for ontime, and 1 for late)"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {
136 |     "collapsed": false
137 |    },
138 |    "outputs": [],
139 |    "source": [
140 |     "#code whether delay or not delayed\n",
141 |     "df['IS_DELAYED'] = df['DEP_DELAY'].apply(lambda x: 1 if x>0 else 0 )"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {
148 |     "collapsed": false,
149 |     "scrolled": true
150 |    },
151 |    "outputs": [],
152 |    "source": [
153 |     "#Let's check that our column was created properly\n",
154 |     "df[['DEP_DELAY','IS_DELAYED']]"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {
161 |     "collapsed": true
162 |    },
163 |    "outputs": [],
164 |    "source": [
165 |     "###Dummy variables create a "
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "metadata": {
172 |     "collapsed": false
173 |    },
174 |    "outputs": [],
175 |    "source": [
176 |     "pd.get_dummies(df['ORIGIN'],prefix='origin')"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "###Normalize values"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "metadata": {
190 |     "collapsed": false
191 |    },
192 |    "outputs": [],
193 |    "source": [
194 |     "#Normalize the data attributes for the Iris dataset\n",
195 |     "# Example from Jump Start Scikit Learn https://machinelearningmastery.com/jump-start-scikit-learn/\n",
196 |     "from sklearn.datasets import load_iris \n",
197 |     "from sklearn import preprocessing #load the iris dataset\n",
198 |     "iris=load_iris()\n",
199 |     "X=iris.data\n",
200 |     "y=iris.target #normalize the data attributes \n",
201 |     "normalized_X = preprocessing.normalize(X)"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": null,
207 |    "metadata": {
208 |     "collapsed": false
209 |    },
210 |    "outputs": [],
211 |    "source": [
212 |     "zip(X,normalized_X)"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": null,
218 |    "metadata": {
219 |     "collapsed": true
220 |    },
221 |    "outputs": [],
222 |    "source": []
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {
228 |     "collapsed": true
229 |    },
230 |    "outputs": [],
231 |    "source": []
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "metadata": {
237 |     "collapsed": true
238 |    },
239 |    "outputs": [],
240 |    "source": []
241 |   }
242 |  ],
243 |  "metadata": {
244 |   "kernelspec": {
245 |    "display_name": "Python 2",
246 |    "language": "python",
247 |    "name": "python2"
248 |   },
249 |   "language_info": {
250 |    "codemirror_mode": {
251 |     "name": "ipython",
252 |     "version": 2
253 |    },
254 |    "file_extension": ".py",
255 |    "mimetype": "text/x-python",
256 |    "name": "python",
257 |    "nbconvert_exporter": "python",
258 |    "pygments_lexer": "ipython2",
259 |    "version": "2.7.10"
260 |   }
261 |  },
262 |  "nbformat": 4,
263 |  "nbformat_minor": 0
264 | }
265 | 


--------------------------------------------------------------------------------
/class4_2/Logistic_regression.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<img src=\"images/titanic.png\" width=\"500\">"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import pandas as pd\n",
 19 |     "%matplotlib inline\n",
 20 |     "import numpy as np"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {
 27 |     "collapsed": true
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "titanic = pd.read_csv(\"data/titanic.csv\")"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "titanic.columns"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "metadata": {},
 48 |    "source": [
 49 |     "###Let's do a simple logistic regression to predict survival based on pclass and sex"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "First we need to prepare our features. Remember we drop one value in each dummy to avoid the dummy variable trap"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {
 63 |     "collapsed": true
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "titanic['sex_female'] = titanic['sex'].apply(lambda x:1 if x=='female' else 0)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {
 74 |     "collapsed": true
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "dataset = titanic[['survived']].join([pd.get_dummies(titanic['pclass'],prefix=\"pclass\"),titanic.sex_female])"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": true
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "from sklearn.linear_model import LogisticRegression"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {
 96 |     "collapsed": true
 97 |    },
 98 |    "outputs": [],
 99 |    "source": [
100 |     "lm = LogisticRegression()"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "metadata": {
107 |     "collapsed": false
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "#drop pclass_1st to avoid dummy variable trap\n",
112 |     "x = np.asarray(dataset[['pclass_2nd','pclass_3rd','sex_female']])\n",
113 |     "y = np.asarray(dataset['survived'])"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {
120 |     "collapsed": true
121 |    },
122 |    "outputs": [],
123 |    "source": [
124 |     "lm = lm.fit(x,y)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": null,
130 |    "metadata": {
131 |     "collapsed": false
132 |    },
133 |    "outputs": [],
134 |    "source": [
135 |     "lm.score(x,y)"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": null,
141 |    "metadata": {
142 |     "collapsed": false
143 |    },
144 |    "outputs": [],
145 |    "source": [
146 |     "y.mean()"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": null,
152 |    "metadata": {
153 |     "collapsed": false
154 |    },
155 |    "outputs": [],
156 |    "source": [
157 |     "lm.coef_"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": null,
163 |    "metadata": {
164 |     "collapsed": false
165 |    },
166 |    "outputs": [],
167 |    "source": [
168 |     "lm.intercept_"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": null,
174 |    "metadata": {
175 |     "collapsed": true
176 |    },
177 |    "outputs": [],
178 |    "source": []
179 |   }
180 |  ],
181 |  "metadata": {
182 |   "kernelspec": {
183 |    "display_name": "Python 2",
184 |    "language": "python",
185 |    "name": "python2"
186 |   },
187 |   "language_info": {
188 |    "codemirror_mode": {
189 |     "name": "ipython",
190 |     "version": 2
191 |    },
192 |    "file_extension": ".py",
193 |    "mimetype": "text/x-python",
194 |    "name": "python",
195 |    "nbconvert_exporter": "python",
196 |    "pygments_lexer": "ipython2",
197 |    "version": "2.7.10"
198 |   }
199 |  },
200 |  "nbformat": 4,
201 |  "nbformat_minor": 0
202 | }
203 | 


--------------------------------------------------------------------------------
/class4_2/Naive_Bayes.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from sklearn import datasets\n",
 12 |     "from sklearn import metrics \n",
 13 |     "from sklearn.naive_bayes import GaussianNB \n",
 14 |     "#load the iris datasets \n",
 15 |     "dataset=datasets.load_iris() \n",
 16 |     "#fit a Naive Bayes model to the data \n",
 17 |     "model=GaussianNB() \n",
 18 |     "model.fit(dataset.data,dataset.target) \n",
 19 |     "print(model)\n",
 20 |     "#makepredictions\n",
 21 |     "expected=dataset.target \n",
 22 |     "predicted=model.predict(dataset.data) \n",
 23 |     "#summarize the fit of the model \n",
 24 |     "print(metrics.classification_report(expected,predicted)) \n",
 25 |     "print(metrics.confusion_matrix(expected,predicted))"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "###Let's get back to the saga of Leo and Kate"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {
 39 |     "collapsed": true
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "import pandas as pd\n",
 44 |     "%matplotlib inline\n",
 45 |     "import numpy as np"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {
 52 |     "collapsed": true
 53 |    },
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "titanic = pd.read_csv(\"data/titanic.csv\")"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {
 63 |     "collapsed": false
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "titanic['sex_female'] = titanic['sex'].apply(lambda x:1 if x=='female' else 0)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {
 74 |     "collapsed": true
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "dataset = titanic[['survived']].join([pd.get_dummies(titanic['pclass'],prefix=\"pclass\"),titanic.sex_female])"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": false
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "#drop pclass_1st to avoid dummy variable trap\n",
 90 |     "x = np.asarray(dataset[['pclass_1st','pclass_2nd','pclass_3rd','sex_female']])\n",
 91 |     "y = np.asarray(dataset['survived'])"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "model.fit(x,y)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {
109 |     "collapsed": true
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "expected = y \n",
114 |     "predicted = model.predict(x) "
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": null,
120 |    "metadata": {
121 |     "collapsed": true
122 |    },
123 |    "outputs": [],
124 |    "source": [
125 |     "def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):\n",
126 |     "    y_pred=clf.predict(X)\n",
127 |     "    if show_accuracy:\n",
128 |     "        print \"Accuracy:{0:.3f}\".format(metrics.accuracy_score(y, y_pred)),\"\\n\"\n",
129 |     "    if show_classification_report:\n",
130 |     "        print \"Classification report\"\n",
131 |     "        print metrics.classification_report(y,y_pred),\"\\n\"\n",
132 |     "    if show_confussion_matrix:\n",
133 |     "        print \"Confusion matrix\"\n",
134 |     "        print metrics.confusion_matrix(y,y_pred),\"\\n\""
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": null,
140 |    "metadata": {
141 |     "collapsed": false
142 |    },
143 |    "outputs": [],
144 |    "source": [
145 |     "measure_performance(x,y,model)"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {
152 |     "collapsed": true
153 |    },
154 |    "outputs": [],
155 |    "source": []
156 |   }
157 |  ],
158 |  "metadata": {
159 |   "kernelspec": {
160 |    "display_name": "Python 2",
161 |    "language": "python",
162 |    "name": "python2"
163 |   },
164 |   "language_info": {
165 |    "codemirror_mode": {
166 |     "name": "ipython",
167 |     "version": 2
168 |    },
169 |    "file_extension": ".py",
170 |    "mimetype": "text/x-python",
171 |    "name": "python",
172 |    "nbconvert_exporter": "python",
173 |    "pygments_lexer": "ipython2",
174 |    "version": "2.7.10"
175 |   }
176 |  },
177 |  "nbformat": 4,
178 |  "nbformat_minor": 0
179 | }
180 | 


--------------------------------------------------------------------------------
/class4_2/images/titanic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class4_2/images/titanic.png


--------------------------------------------------------------------------------
/class5_1/.ipynb_checkpoints/vectorization-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from sklearn.feature_extraction.text import CountVectorizer"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Basic vectorization\n",
 19 |     "\n",
 20 |     "Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features.\n",
 21 |     "\n",
 22 |     "Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as \"The number of times Word X appears in this document\".\n",
 23 |     "\n",
 24 |     "Here's an example with one bill title:"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 14,
 30 |    "metadata": {
 31 |     "collapsed": true
 32 |    },
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 16,
 41 |    "metadata": {
 42 |     "collapsed": false
 43 |    },
 44 |    "outputs": [
 45 |     {
 46 |      "name": "stdout",
 47 |      "output_type": "stream",
 48 |      "text": [
 49 |       "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n"
 50 |      ]
 51 |     }
 52 |    ],
 53 |    "source": [
 54 |     "vectorizer = CountVectorizer()\n",
 55 |     "features = vectorizer.fit_transform(bill_titles).toarray()"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 17,
 61 |    "metadata": {
 62 |     "collapsed": false
 63 |    },
 64 |    "outputs": [
 65 |     {
 66 |      "name": "stdout",
 67 |      "output_type": "stream",
 68 |      "text": [
 69 |       "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n",
 70 |       "[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']\n"
 71 |      ]
 72 |     }
 73 |    ],
 74 |    "source": [
 75 |     "print features\n",
 76 |     "print vectorizer.get_feature_names()"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "metadata": {},
 82 |    "source": [
 83 |     "Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is \"44277\", the second is \"act\", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, \"to\", which appears twice.\n",
 84 |     "\n",
 85 |     "Now what happens if we add another bill and run it again?"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 19,
 91 |    "metadata": {
 92 |     "collapsed": false
 93 |    },
 94 |    "outputs": [
 95 |     {
 96 |      "name": "stdout",
 97 |      "output_type": "stream",
 98 |      "text": [
 99 |       "[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]\n",
100 |       " [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]\n",
101 |       "[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']\n"
102 |      ]
103 |     }
104 |    ],
105 |    "source": [
106 |     "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',\n",
107 |     "               'An act relative to health care coverage']\n",
108 |     "features = vectorizer.fit_transform(bill_titles).toarray()\n",
109 |     "\n",
110 |     "print features\n",
111 |     "print vectorizer.get_feature_names()"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, \"44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization."
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "## Cleaning up our vectors\n",
126 |     "\n",
127 |     "As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.\n",
128 |     "\n",
129 |     "First is removing so-called \"stop words\" -- words like \"and\", \"or\", \"the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": 21,
135 |    "metadata": {
136 |     "collapsed": false
137 |    },
138 |    "outputs": [
139 |     {
140 |      "name": "stdout",
141 |      "output_type": "stream",
142 |      "text": [
143 |       "[[1 1 1 0 1 0 1 0 1 0 1 1]\n",
144 |       " [0 1 0 1 0 1 0 1 0 1 0 0]]\n",
145 |       "[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']\n"
146 |      ]
147 |     }
148 |    ],
149 |    "source": [
150 |     "new_vectorizer = CountVectorizer(stop_words='english')\n",
151 |     "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
152 |     "\n",
153 |     "print features\n",
154 |     "print new_vectorizer.get_feature_names()"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large."
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 24,
167 |    "metadata": {
168 |     "collapsed": false
169 |    },
170 |    "outputs": [
171 |     {
172 |      "name": "stdout",
173 |      "output_type": "stream",
174 |      "text": [
175 |       "[[1]\n",
176 |       " [1]]\n",
177 |       "[u'act']\n"
178 |      ]
179 |     }
180 |    ],
181 |    "source": [
182 |     "new_vectorizer = CountVectorizer(stop_words='english', min_df=2)\n",
183 |     "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
184 |     "\n",
185 |     "print features\n",
186 |     "print new_vectorizer.get_feature_names()"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "markdown",
191 |    "metadata": {},
192 |    "source": [
193 |     "This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": null,
199 |    "metadata": {
200 |     "collapsed": true
201 |    },
202 |    "outputs": [],
203 |    "source": [
204 |     "new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))\n",
205 |     "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
206 |     "\n",
207 |     "print features\n",
208 |     "print new_vectorizer.get_feature_names()"
209 |    ]
210 |   }
211 |  ],
212 |  "metadata": {
213 |   "kernelspec": {
214 |    "display_name": "Python 2",
215 |    "language": "python",
216 |    "name": "python2"
217 |   },
218 |   "language_info": {
219 |    "codemirror_mode": {
220 |     "name": "ipython",
221 |     "version": 2
222 |    },
223 |    "file_extension": ".py",
224 |    "mimetype": "text/x-python",
225 |    "name": "python",
226 |    "nbconvert_exporter": "python",
227 |    "pygments_lexer": "ipython2",
228 |    "version": "2.7.9"
229 |   }
230 |  },
231 |  "nbformat": 4,
232 |  "nbformat_minor": 0
233 | }
234 | 


--------------------------------------------------------------------------------
/class5_1/README.md:
--------------------------------------------------------------------------------
 1 | # Algorithms: Week 5, Class 1 (Tuesday, Aug. 11)
 2 | 
 3 | We'll pick up where we left off last week with the bill classification problem, using it as an excuse to introduce a method of feature creation that is especially useful for text documents -- the idea of vectorization. If we have time, we'll also discuss the basic idea of supervised learning:
 4 | 
 5 | ## Hour 1: Exercise review
 6 | 
 7 | We'll talk in detail through the exercises from last week (which were deliberately difficult) and use them to segue into basic natural language processing techniques.
 8 | 
 9 | ## Hour 2/2.5: Vectorization
10 | 
11 | We'll talk about how to use vectorization to engineer our features for natural language classification and clustering problems, rather than building features by hand. We'll then revisit the bill classification problem from last week using what we've learned.
12 | 
13 | ## Hour 2.5/3: Unsupervised learning
14 | 
15 | We'll talk a little about the intuition and dangers of unsupervised learning, also known as clustering, using crime data as our example.
16 | 
17 | ## Lab
18 | 
19 | You'll be doing two things in lab today:
20 | 
21 |   - First you'll work through a simple document classification problem (classifying drug-related and non-drug-related press releases) using vectorization and the other techinques we discussed in class.
22 | 
23 |   - Second, take a look at [this map](https://www.google.com/maps/d/u/1/embed?mid=z9S6reOYqCIE.kQnlzV2-uDzg), which shows police dispatch logs for Columbia, Mo., over the first 10 days of August. Within the map, there are three layers (eps_0.3, eps_0.2, eps_0.4), each of which shows hotspots of dispatches calculated in slightly different ways. Choose what you think is the fairest representations of the hotsports and write a couple paragraphs characterizing your findings. Be sure to include the layer you chose in your Tumblr post.


--------------------------------------------------------------------------------
/class5_1/bill_classifier.py:
--------------------------------------------------------------------------------
 1 | from sklearn import preprocessing
 2 | from sklearn import cross_validation
 3 | from sklearn.tree import DecisionTreeClassifier
 4 | from sklearn.naive_bayes import MultinomialNB
 5 | from sklearn.feature_extraction.text import CountVectorizer
 6 | 
 7 | if __name__ == '__main__':
 8 | 
 9 |     ########## STEP 1: DATA IMPORT AND PREPROCESSING ##########
10 | 
11 |     # Here we're taking in the training data and splitting it into two lists: One with the text of
12 |     # each bill title, and the second with each bill title's corresponding category. Order is important.
13 |     # The first bill in list 1 should also be the first category in list 2.
14 |     training = [line.strip().split('|') for line in open('data/bills_training.txt', 'r').readlines()]
15 |     text = [t[0] for t in training if len(t) > 1]
16 |     labels = [t[1] for t in training if len(t) > 1]
17 | 
18 |     # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
19 |     # be numbers, not strings. The LabelEncoder performs this transformation.
20 |     encoder = preprocessing.LabelEncoder()
21 |     correct_labels = encoder.fit_transform(labels)
22 | 
23 |     ########## STEP 2: FEATURE EXTRACTION ##########
24 |     print 'Extracting features ...'
25 | 
26 |     vectorizer = CountVectorizer(stop_words='english')
27 |     data = vectorizer.fit_transform(text)
28 | 
29 |     ########## STEP 3: MODEL BUILDING ##########
30 |     print 'Training ...'
31 | 
32 |     #model = MultinomialNB()
33 |     model = DecisionTreeClassifier()
34 |     fit_model = model.fit(data, correct_labels)
35 | 
36 |     # ########## STEP 4: EVALUATION ##########
37 |     print 'Evaluating ...'
38 | 
39 |     # Evaluate our model with 10-fold cross-validation
40 |     scores = cross_validation.cross_val_score(model, data, correct_labels, cv=5)
41 |     print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)
42 | 
43 |     # ########## STEP 5: APPLYING THE MODEL ##########
44 |     print 'Classifying ...'
45 | 
46 |     docs_new = ["Public postsecondary education: executive officer compensation.",
47 |                 "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
48 |                 "Political Reform Act of 1974: campaign disclosures.",
49 |                 "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
50 |             ]
51 | 
52 |     test_data = vectorizer.transform(docs_new)
53 | 
54 |     for i in xrange(len(docs_new)):
55 |         print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])])


--------------------------------------------------------------------------------
/class5_1/crime_clusterer.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | cluster.py
 3 | This script demonstrates the use of the DBSCAN algorithm for finding
 4 | clusters of crimes in Columbia, Mo. DBSCAN is a density-based clustering
 5 | algorithm that finds points based on their proximity to other points in the
 6 | dataset. Unlike algorithms such as K-means, you do not need to specify the
 7 | number of clusters you would like it to find in advance. Instead, you set a
 8 | parameter, epsilon, that identifies how close you would like two points to be 
 9 | for them to belong to the same cluster.
10 | 
11 | More information here:
12 | http://en.wikipedia.org/wiki/DBSCAN
13 | http://scikit-learn.org/dev/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
14 | 
15 | And there's a clean, documented implementation here for reference:
16 | https://github.com/cjdd3b/car-datascience-toolkit/blob/master/cluster/dbscan.py
17 | '''
18 | 
19 | import csv
20 | import numpy as np
21 | from scipy.spatial import distance
22 | from sklearn.cluster import DBSCAN
23 | 
24 | ########## MODIFY THIS ##########
25 | 
26 | EPS = 0.04
27 | 
28 | ######### DON'T WORRY (YET) ABOUT MODIFYING THIS ##########
29 | 
30 | # Pull in our data using DictReader
31 | data = list(csv.DictReader(open('data/columbia_crime.csv', 'r').readlines()))
32 | 
33 | # Separate out the coordinates
34 | coords = [(float(d['lat']), float(d['lng'])) for d in data if len(d['lat']) > 0]
35 | types = [d['ExtNatureDisplayName'] for d in data]
36 | 
37 | # Scikit-learn's implemenetation of DBSCAN requires the input of a distance matrix showing pairwise
38 | # distances between all points in the dataset.
39 | distance_matrix = distance.squareform(distance.pdist(coords))
40 | 
41 | # Run DBSCAN. Setting epsilon with lat/lon data like we have here is an inexact science. 0.03 looked
42 | # good after a few test runs. Ideally we'd project the data and set epsilon using meters or feet.
43 | db = DBSCAN(eps=EPS).fit(distance_matrix)
44 | 
45 | # And now we print out the results in the form cluster_id,lat,lng. You can save this to a file and import
46 | # directly into a mapping program or Fusion Tables if you want to visualize it.
47 | for k in set(db.labels_):
48 |     class_members = [index[0] for index in np.argwhere(db.labels_ == k)]
49 |     for index in class_members:
50 |         print '%s,%s,%s' % (int(k), types[index], '{0},{1}'.format(*coords[index]))
51 | 


--------------------------------------------------------------------------------
/class5_1/data/releases_training.txt:
--------------------------------------------------------------------------------
1 | FEB 12 (BEAUMONT, Texas) – A 25-year-old Port Arthur, Texas man has pleaded guilty to drug trafficking violations in the Eastern District of Texas, announced Drug Enforcement  Administration Acting Special Agent in Charge Steven S. Whipple and  U.S. Attorney John M. Bales today. Michael Joseph Barrett IV pleaded guilty to possession with intent to distribute methamphetamine on Feb. 11, 2014 before U.S. District Judge Marcia Crone. According to information presented in court, on Feb. 19, 2013, law enforcement officers responded to a residence on 32nd Street in Port Arthur after receiving information regarding suspected manufacture of methamphetamine at the location.  Consent to search was obtained and a search of the premises revealed a small amount of cocaine, a semi-automatic pistol, and various items associated with methamphetamine manufacture, including a three liter bottle containing a methamphetamine mixture.  A federal grand jury returned an indictment on Dec. 4, 2013, charging Barrett with drug trafficking violations.  Barrett faces up to 20 years in federal prison at sentencing.  A sentencing date has not been set. This case was investigated by the Drug Enforcement Administration, the Port Arthur Police Department and the Jefferson County Sheriff's Office Crime Lab and prosecuted by Assistant U.S. Attorney Randall L. Fluke.|YES
2 | FEB 05 (BROWNSVILLE, Texas ) - Stephen Whipple, Acting Special Agent in Charge of the United States Drug Enforcement Administration (DEA), Houston Division and  United States Attorney Kenneth Magidson announced Jesus Mauricio Juarez Jr. aka Flaco 27, has been sentenced to federal prison for his involvement in a 1,000 pound marijuana load. He pleaded guilty in November 2013. Today, Senior U.S. District Judge Hilda G. Tagle sentenced Juarez to 31 months in federal prison. In handing down the sentence, Ruben Gonzalez-Cavazos aka Mume, also pleaded guilty in relation to the conspiracy and was sentenced to 47 months in federal prison and assessed a $15,000 fine on Feb. 3, 2014. Co-defendant Francisco Javier Maya, 35, went to trial last week in Brownsville and was convicted on all counts. He will be sentenced on May 13, 2014. Adolfo Lozano-Luna aka Chefero, 35, and Alberto Martinez aka El Diablo, 50, also pleaded guilty and will be sentenced at a later date. Evidence at Maya’s trial placed all five men in a conspiracy involving a 1,000 pound marijuana load, which was forcibly hijacked from them by unknown individuals on Dec. 11, 2012. One month later, Juarez was injured after an improvised explosive device (IED) detonated at his residence in Brownsville. In sentencing Juarez today, Judge Tagle discussed the bombing incident and noted that at least he and his family still have their lives. Evidence also linked Juarez, Gonzalez-Cavazos and Maya to other marijuana loads during the conspiracy. Maya’s role in the drug trafficking organization was to provide drivers for tractor trailers to drive marijuana loads to locations to include Houston and Taylor. Maya, Juarez and Gonzalez-Cavazos would share in the profits of each successful marijuana load. At the direction of Juarez, Maya provided bank account numbers associated with him and Gonzalez-Cavazos to Juarez in order to deposit drug profits. Juarez then made deposits stemming from narcotics proceeds from a successful marijuana load delivered to Taylor in November 2012. Evidence was presented at Maya’s trial that a $6,000 deposit was made into an account associated with Maya on Nov. 28, 2012, while another $6,500 was deposited into an account associated with Gonzalez-Cavazos on the same day. The jury last week also heard that Maya was a follower of the Santeria religion. The jury saw photos of Maya’s residence in Mission, Texas, which depicted numerous images of what was considered to be altars showing glasses of alcohol, knives, a machete, kettles, feathers and substances that appeared to be blood. Testimony also included descriptions of two rituals involving the sacrifice of animals. In December 2012, Maya had a Santeria priest, known as a “Padrino,” perform rituals with the organization to “bless” a 1,000 pound marijuana load that was destined for Houston. After meeting with the Padrino, Maya, Gonzalez-Cavazos and Juarez decided the marijuana load should remain in the Rio Grande Valley. The next day, a second ritual, attended by all five defendants, was performed and the 1,000 pounds of marijuana was to be transported to Houston. However, the marijuana was stolen from the group by unknown individuals that evening. After the theft and subsequent IED detonation, law enforcement was able to piece together the events and conspirators involved in this drug trafficking organization.  The case was investigated by the Drug Enforcement Administration, FBI, Homeland Security Investigations, Bureau of Alcohol, Tobacco, Firearms and Explosives and the Brownsville Police Department. The case was prosecuted by Assistant United States Attorneys Angel Castro and Jody Young.|YES
3 | JAN 08 (HOUSTON) - Javier F. Peña, Special Agent in Charge of the United States Drug Enforcement Administration (DEA), Houston Division and Kenneth Magidson, United States Attorney, Southern District of Texas announced Oscar Nava-Valencia, 42, of Guadalajara, Mexico, has received a 25-year sentence for his role in the smuggling of a 3,100 kilogram load of cocaine from Panama. Nava-Valencia previously pleaded guilty and was sentenced late yesterday afternoon in federal court in Houston. U.S. District Judge Ewing Werlein Jr. sentenced Nava-Valencia to a term of 300 months in federal prison and further ordered him to pay a $5,000 fine. In March 2006, Panamanian authorities seized approximately 2,080 kilograms of cocaine from a warehouse in Panama City, Panama. The seized cocaine was part of a larger load totaling approximately 3,100 kilograms which was to be shipped from Panama to Mexico and eventually destined for the United States. Nava-Valencia, along with other associates, was to take possession of approximately 1,250 kilograms of cocaine once it arrived in Mexico. In January of 2010, Nava-Valencia was apprehended by Mexican authorities and extradited to the United States in January 2011. He has been and will remain in custody pending transfer to a U.S. Bureau of Prisons facility to be determined in the near future. The investigation leading to the charges was conducted by the Drug Enforcement Administration. Assistant United States Attorneys James Sturgis prosecuted the case.|YES
4 | JAN 21 (MONTGOMERY, Ala.) – The Drug Enforcement Administration awarded Assistant U. S. Attorneys Verne Speirs and Gray Borden the Spartan Award, announced George L. Beck, Jr., United States Attorney Middle District of Alabama.  The Spartan award recognizes prosecutors for their dedication and extraordinary effort to investigate and prosecute large-scale drug dealers and money launderers. This year’s award is presented to Assistant U.S. Attorneys Speirs and Borden because of long hours invested and success obtained in combating the ever-growing scourge of drug dealing in the Middle District of Alabama.  DEA chose Speirs and Borden for this award after examining the work of all federal prosecutors in the State of Alabama. “The DEA in Alabama was pleased to present the 2013 Spartan Award for Excellence in Drug Investigations to AUSA’s Speirs and Borden,” stated Clay Morris, Assistant Special Agent in Charge of DEA in Alabama.  “The award was named after the Spartan Warrior Society.  AUSA’s Speirs and Borden were selected by DEA management to receive the award because they exhibited many traits of a Spartan Warrior: a relentless pursuit of justice, tenacity, loyalty and dedication.  Throughout 2013, AUSA’s Speirs and Borden tirelessly worked alongside our agents and task force officers in many long term complex investigations.  Because of the dedication of AUSA’s Speirs and Borden, many drug trafficking organizations were completely dismantled and dangerous criminals were removed from the streets of our communities.  I cannot say enough about the outstanding efforts of AUSA’s Speirs and Borden and the entire staff of the Unites States Attorney’s Office.  One thing is certain, as long as AUSA’s Speirs and Borden are prosecuting drug trafficking organizations, those who target and sell poison to our children should be very afraid.” “I am very pleased that the extraordinary success of AUSAs Speirs and Borden are receiving the recognition they truly deserve,” stated U.S. Attorney George Beck, “They have worked tirelessly to prosecute these criminals.  I believe it is essential that these types of crimes be vigorously prosecuted and that we continue to combat the drug problem facing this district and this nation.” “I am truly humbled to receive this award, but the real credit goes to the DEA Agents and Task Force Officers who risk everything to combat drug traffickers across this country,” stated Verne Speirs, Assistant U.S. Attorney.  “The safety of our families and communities depend upon their selfless service.” “I consider this award to be one of the great achievements in my career in the U.S. Attorney’s Office, but the credit goes to our dedicated and professional staff and the DEA’s stable of tireless agents,” stated Gray Borden, Assistant U.S. Attorney.  “I am proud to be associated with a team of this caliber.”|NO
5 | JAN 30 (SAN JUAN, Puerto Rico) – Yesterday, January 29, U.S. Magistrate Judge Marcos E. López authorized a complaint charging: Joselito Taveras, Miguel Jimenez, and Alberto Dominguez with conspiracy to possess and possession with intent to distribute controlled substances, and conspiracy to import and importation of controlled substances, announced Rosa Emilia Rodríguez-Vélez, United States Attorney for the District of Puerto Rico.  The crew of the Coast Guard Cutter Farallon offloaded 136 kilograms (300 pounds) of cocaine Monday night, 60 nautical miles northwest of Aguadilla, Puerto Rico and transferred the custody of the defendants to Drug Enforcement Administration (DEA) special agents and Customs and Border Protection officers Wednesday at Coast Guard San Juan, Puerto Rico. The interdiction was a result of U.S. Coast Guard, Customs Border Protection, Drug Enforcement Administration and Dominican Republic Navy coordinated efforts in support of Operation Unified Resolve, Operation Caribbean Guard, and the Caribbean Corridor Strike Force (CCSF), to interdict the illegal drug shipment consisting of nine bales of cocaine with an estimated street value of approximately $3.5 million dollars.|YES
6 | JAN 10 (WASHINGTON) – The U.S. Department of Justice and the U.S. Department of Commerce's National Institute of Standards and Technology (NIST) today announced appointments to a newly created National Commission on Forensic Science. Members of the commission will work to improve the practice of forensic science by developing guidance concerning the intersections between forensic science and the criminal justice system.  The commission also will work to develop policy recommendations for the U.S. Attorney General, including uniform codes for professional responsibility and requirements for formal training and certification. The commission is co-chaired by Deputy Attorney General James M. Cole and Under Secretary of Commerce for Standards and Technology and NIST Director Patrick D. Gallagher.  Nelson Santos, Deputy Assistant Administrator for the Office of Forensic Sciences at the Drug Enforcement Administration, and John M. Butler, Special Assistant to the NIST director for forensic science, serve as vice-chairs. "I appreciate the commitment each of the commissioners has made and look forward to working with them to strengthen the validity and reliability of the forensic sciences and enhance quality assurance and quality control," said Deputy Attorney General Cole.  "Scientifically valid and accurate forensic analysis supports all aspects of our justice system."|NO
7 | 


--------------------------------------------------------------------------------
/class5_1/release_classifier.py:
--------------------------------------------------------------------------------
 1 | from sklearn import preprocessing
 2 | from sklearn.tree import DecisionTreeClassifier
 3 | from sklearn.naive_bayes import MultinomialNB
 4 | from sklearn.feature_extraction.text import CountVectorizer
 5 | 
 6 | if __name__ == '__main__':
 7 | 
 8 |     ########## STEP 1: DATA IMPORT AND PREPROCESSING ##########
 9 | 
10 |     training = [line.strip().split('|') for line in open('data/releases_training.txt', 'r').readlines()]
11 |     text = [t[0] for t in training if len(t) > 1]
12 |     labels = [t[1] for t in training if len(t) > 1]
13 | 
14 |     encoder = preprocessing.LabelEncoder()
15 |     correct_labels = encoder.fit_transform(labels)
16 | 
17 |     ########## FEATURE EXTRACTION ##########
18 |     
19 |     # VECTORIZE YOUR DATA HERE
20 | 
21 |     ########## MODEL BUILDING ##########
22 |     
23 |     # TRAIN YOUR MODEL HERE
24 | 
25 |     ########## STEP 5: APPLYING THE MODEL ##########
26 |     docs_new = ["Five Columbia Residents among 10 Defendants Indicted for Conspiracy to Distribute a Ton of Marijuana",
27 |             ]
28 | 
29 |     # EVALUATE THE DOCUMENT HERE


--------------------------------------------------------------------------------
/class5_1/vectorization.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 4,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from sklearn.feature_extraction.text import CountVectorizer"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Basic vectorization\n",
 19 |     "\n",
 20 |     "Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features, represented by a matrix.\n",
 21 |     "\n",
 22 |     "Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as \"The number of times Word X appears in this document\".\n",
 23 |     "\n",
 24 |     "Here's an example with one bill title:"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 5,
 30 |    "metadata": {
 31 |     "collapsed": true
 32 |    },
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 7,
 41 |    "metadata": {
 42 |     "collapsed": false,
 43 |     "scrolled": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "vectorizer = CountVectorizer()\n",
 48 |     "features = vectorizer.fit_transform(bill_titles).toarray()"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 8,
 54 |    "metadata": {
 55 |     "collapsed": false,
 56 |     "scrolled": true
 57 |    },
 58 |    "outputs": [
 59 |     {
 60 |      "name": "stdout",
 61 |      "output_type": "stream",
 62 |      "text": [
 63 |       "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n",
 64 |       "[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']\n"
 65 |      ]
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "print features\n",
 70 |     "print vectorizer.get_feature_names()"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "markdown",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is \"44277\", the second is \"act\", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, \"to\", which appears twice.\n",
 78 |     "\n",
 79 |     "Now what happens if we add another bill and run it again?"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 11,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [
 89 |     {
 90 |      "name": "stdout",
 91 |      "output_type": "stream",
 92 |      "text": [
 93 |       "[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]\n",
 94 |       " [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]\n",
 95 |       "[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']\n"
 96 |      ]
 97 |     }
 98 |    ],
 99 |    "source": [
100 |     "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',\n",
101 |     "               'An act relative to health care coverage']\n",
102 |     "features = vectorizer.fit_transform(bill_titles).toarray()\n",
103 |     "\n",
104 |     "print features\n",
105 |     "print vectorizer.get_feature_names()"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {},
111 |    "source": [
112 |     "Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, \"44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization."
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "## Cleaning up our vectors\n",
120 |     "\n",
121 |     "As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.\n",
122 |     "\n",
123 |     "First is removing so-called \"stop words\" -- words like \"and\", \"or\", \"the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 12,
129 |    "metadata": {
130 |     "collapsed": false
131 |    },
132 |    "outputs": [
133 |     {
134 |      "name": "stdout",
135 |      "output_type": "stream",
136 |      "text": [
137 |       "[[1 1 1 0 1 0 1 0 1 0 1 1]\n",
138 |       " [0 1 0 1 0 1 0 1 0 1 0 0]]\n",
139 |       "[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']\n"
140 |      ]
141 |     }
142 |    ],
143 |    "source": [
144 |     "new_vectorizer = CountVectorizer(stop_words='english')\n",
145 |     "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
146 |     "\n",
147 |     "print features\n",
148 |     "print new_vectorizer.get_feature_names()"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large."
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 13,
161 |    "metadata": {
162 |     "collapsed": false
163 |    },
164 |    "outputs": [
165 |     {
166 |      "name": "stdout",
167 |      "output_type": "stream",
168 |      "text": [
169 |       "[[1]\n",
170 |       " [1]]\n",
171 |       "[u'act']\n"
172 |      ]
173 |     }
174 |    ],
175 |    "source": [
176 |     "new_vectorizer = CountVectorizer(stop_words='english', min_df=2)\n",
177 |     "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
178 |     "\n",
179 |     "print features\n",
180 |     "print new_vectorizer.get_feature_names()"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 17,
193 |    "metadata": {
194 |     "collapsed": false,
195 |     "scrolled": true
196 |    },
197 |    "outputs": [
198 |     {
199 |      "name": "stdout",
200 |      "output_type": "stream",
201 |      "text": [
202 |       "[[1 1 1 1 0 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1]\n",
203 |       " [0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0]]\n",
204 |       "[u'44277', u'44277 education', u'act', u'act amend', u'act relative', u'amend', u'amend section', u'care', u'care coverage', u'code', u'code relating', u'coverage', u'education', u'education code', u'health', u'health care', u'relating', u'relating teachers', u'relative', u'relative health', u'section', u'section 44277', u'teachers']\n"
205 |      ]
206 |     }
207 |    ],
208 |    "source": [
209 |     "new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))\n",
210 |     "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
211 |     "\n",
212 |     "print features\n",
213 |     "print new_vectorizer.get_feature_names()"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "markdown",
218 |    "metadata": {},
219 |    "source": [
220 |     "Although the feature space gets much larger, sometimes having multi-word features can make our models more accurate.\n",
221 |     "\n",
222 |     "These are just a few basic tricks scikit-learn makes available for transforming your vectors (you can see other ones [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)). But now let's take what we've learned here and apply it to the bill classification problem."
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": null,
228 |    "metadata": {
229 |     "collapsed": true
230 |    },
231 |    "outputs": [],
232 |    "source": []
233 |   }
234 |  ],
235 |  "metadata": {
236 |   "kernelspec": {
237 |    "display_name": "Python 2",
238 |    "language": "python",
239 |    "name": "python2"
240 |   },
241 |   "language_info": {
242 |    "codemirror_mode": {
243 |     "name": "ipython",
244 |     "version": 2
245 |    },
246 |    "file_extension": ".py",
247 |    "mimetype": "text/x-python",
248 |    "name": "python",
249 |    "nbconvert_exporter": "python",
250 |    "pygments_lexer": "ipython2",
251 |    "version": "2.7.9"
252 |   }
253 |  },
254 |  "nbformat": 4,
255 |  "nbformat_minor": 0
256 | }
257 | 


--------------------------------------------------------------------------------
/class5_2/5_2-Assignment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "test_string = \"Do you know the way to San Jose?\""
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "##Extend your functions from class\n",
 19 |     "###1. Add code to your tokenizer to filter for punctuation before tokenizing\n",
 20 |     "####This might be helpful: http://stackoverflow.com/a/266162/1808021"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {
 27 |     "collapsed": true
 28 |    },
 29 |    "outputs": [],
 30 |    "source": []
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "###2. Add code to your tokenizer to filter for stopwords\n",
 37 |     "###Your function should use the list of stopwords to filter the string and not return words in the stopword list\n",
 38 |     "###You can use the list in NLTK or create your own\n"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {
 45 |     "collapsed": true
 46 |    },
 47 |    "outputs": [],
 48 |    "source": []
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "###3. Add code to your tokenizer to call your tokenizer to create word tokens (if it doesn't already) and then generate the counts for each token"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {
 61 |     "collapsed": true
 62 |    },
 63 |    "outputs": [],
 64 |    "source": []
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "##Bonus\n",
 71 |     "###Write a simple function to calculate the tf-idf \n",
 72 |     "####Remember the following were $t$ is the term, $D$ is the document, $N$ is the total number of documents, $n_w$ is the number of documents containing each word $t$, and $i_w$ is the frequency word $t$ appears in a document\n",
 73 |     "\n",
 74 |     "$tf(t,D)=\\frac{i_w}{n_D}$\n",
 75 |     "\n",
 76 |     "$idf(t,D)=\\log(\\frac{N}{1+n_w})$\n",
 77 |     "\n",
 78 |     "$tfidf=tf\\times idf$"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": true
 86 |    },
 87 |    "outputs": [],
 88 |    "source": []
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "##k-NN on Iris\n",
 95 |     "###4. Using the Iris dataset, test the kNN for various levels of k to see if you can build a better classifier than our decision tree in 3_2"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {
102 |     "collapsed": true
103 |    },
104 |    "outputs": [],
105 |    "source": []
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "##k-Means with Congressional Bills\n",
112 |     "###5. Explore the clusters of Congressional Records. Select another subset and investigate the contents. Write code that investigates a different cluster."
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": null,
118 |    "metadata": {
119 |     "collapsed": true
120 |    },
121 |    "outputs": [],
122 |    "source": []
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "###6. On the class Tumblr, provide a response to the lesson on k-Means, specifically whether you think this is a useful technique for working journalists (data or otherwise)"
129 |    ]
130 |   }
131 |  ],
132 |  "metadata": {
133 |   "kernelspec": {
134 |    "display_name": "Python 2",
135 |    "language": "python",
136 |    "name": "python2"
137 |   },
138 |   "language_info": {
139 |    "codemirror_mode": {
140 |     "name": "ipython",
141 |     "version": 2
142 |    },
143 |    "file_extension": ".py",
144 |    "mimetype": "text/x-python",
145 |    "name": "python",
146 |    "nbconvert_exporter": "python",
147 |    "pygments_lexer": "ipython2",
148 |    "version": "2.7.10"
149 |   }
150 |  },
151 |  "nbformat": 4,
152 |  "nbformat_minor": 0
153 | }
154 | 


--------------------------------------------------------------------------------
/class5_2/5_2-DoNow.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "##Let's check your knowledge of the material we've covered"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "###Code your own tokenizer\n",
15 |     "####Write a simple tokenizer function to take in a string, tokenize by individual words"
16 |    ]
17 |   },
18 |   {
19 |    "cell_type": "code",
20 |    "execution_count": null,
21 |    "metadata": {
22 |     "collapsed": true
23 |    },
24 |    "outputs": [],
25 |    "source": []
26 |   },
27 |   {
28 |    "cell_type": "markdown",
29 |    "metadata": {},
30 |    "source": [
31 |     "###Create your own vectorizer\n",
32 |     "####Write code to output the list of tokens and the count for each token"
33 |    ]
34 |   },
35 |   {
36 |    "cell_type": "code",
37 |    "execution_count": null,
38 |    "metadata": {
39 |     "collapsed": true
40 |    },
41 |    "outputs": [],
42 |    "source": []
43 |   },
44 |   {
45 |    "cell_type": "code",
46 |    "execution_count": null,
47 |    "metadata": {
48 |     "collapsed": true
49 |    },
50 |    "outputs": [],
51 |    "source": []
52 |   }
53 |  ],
54 |  "metadata": {
55 |   "kernelspec": {
56 |    "display_name": "Python 2",
57 |    "language": "python",
58 |    "name": "python2"
59 |   },
60 |   "language_info": {
61 |    "codemirror_mode": {
62 |     "name": "ipython",
63 |     "version": 2
64 |    },
65 |    "file_extension": ".py",
66 |    "mimetype": "text/x-python",
67 |    "name": "python",
68 |    "nbconvert_exporter": "python",
69 |    "pygments_lexer": "ipython2",
70 |    "version": "2.7.10"
71 |   }
72 |  },
73 |  "nbformat": 4,
74 |  "nbformat_minor": 0
75 | }
76 | 


--------------------------------------------------------------------------------
/class5_2/kmeans.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import re #a package for doing regex\n",
 13 |     "import glob #for accessing files on our local system"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "###We'll be using data from http://www.cs.cornell.edu/home/llee/data/convote.html to explore k-means clustering"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {
 27 |     "collapsed": false
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "!tar -zxvf convote_v1.1.tar.gz"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "metadata": {
 49 |     "collapsed": true
 50 |    },
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "paths = glob.glob(\"convote_v1.1/data_stage_one/development_set/*\")\n",
 54 |     "speeches = []\n",
 55 |     "for path in paths:\n",
 56 |     "    speech = {}\n",
 57 |     "    filename = path[-26:]\n",
 58 |     "    speech['filename'] = filename\n",
 59 |     "    speech['bill_no'] = filename[:3]\n",
 60 |     "    speech['speaker_no'] = filename[4:10]\n",
 61 |     "    speech['bill_vote'] = filename[-5]\n",
 62 |     "    speech['party'] = filename[-7]\n",
 63 |     "    \n",
 64 |     "    # Open the file\n",
 65 |     "    speech_file = open(path, 'r')\n",
 66 |     "    # Read the stuff out of it\n",
 67 |     "    speech['contents'] = speech_file.read()\n",
 68 |     "\n",
 69 |     "    cleaned_contents = re.sub(r\"[^ \\w]\",'', speech['contents'])\n",
 70 |     "    cleaned_contents = re.sub(r\" +\",' ', cleaned_contents)\n",
 71 |     "    cleaned_contents = cleaned_contents.strip()\n",
 72 |     "    words = cleaned_contents.split(' ')\n",
 73 |     "    speech['word_count'] = len(words)\n",
 74 |     "    \n",
 75 |     "    speeches.append(speech)"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "collapsed": false
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "speeches[:5]"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {
 93 |     "collapsed": false
 94 |    },
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "speeches_df = pd.DataFrame(speeches)\n",
 98 |     "speeches_df.head()"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": null,
104 |    "metadata": {
105 |     "collapsed": false
106 |    },
107 |    "outputs": [],
108 |    "source": [
109 |     "speeches_df[\"word_count\"].describe()"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "###Notice that we have a lot of speeches that are relatively short. They probably aren't the best for clustering because of their brevity"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "###Time to bring the TF-IDF vectorizer"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": null,
129 |    "metadata": {
130 |     "collapsed": true
131 |    },
132 |    "outputs": [],
133 |    "source": [
134 |     "from sklearn.feature_extraction.text import TfidfVectorizer"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": null,
140 |    "metadata": {
141 |     "collapsed": true
142 |    },
143 |    "outputs": [],
144 |    "source": [
145 |     "vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')\n",
146 |     "longer_speeches = speeches_df[speeches_df[\"word_count\"] > 92] \n",
147 |     "#filtering for word counts greater than 92 (our median length)\n",
148 |     "X = vectorizer.fit_transform(longer_speeches['contents'])"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": null,
154 |    "metadata": {
155 |     "collapsed": true
156 |    },
157 |    "outputs": [],
158 |    "source": [
159 |     "from sklearn.cluster import KMeans"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": null,
165 |    "metadata": {
166 |     "collapsed": false
167 |    },
168 |    "outputs": [],
169 |    "source": [
170 |     "number_of_clusters = 7\n",
171 |     "km = KMeans(n_clusters=number_of_clusters)\n",
172 |     "km.fit(X)"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {
179 |     "collapsed": false
180 |    },
181 |    "outputs": [],
182 |    "source": [
183 |     "print(\"Top terms per cluster:\")\n",
184 |     "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
185 |     "terms = vectorizer.get_feature_names()\n",
186 |     "for i in range(number_of_clusters):\n",
187 |     "    print(\"Cluster %d:\" % i),\n",
188 |     "    for ind in order_centroids[i, :15]:\n",
189 |     "        print(' %s' % terms[ind]),\n",
190 |     "    print ''"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": null,
196 |    "metadata": {
197 |     "collapsed": true
198 |    },
199 |    "outputs": [],
200 |    "source": []
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": null,
205 |    "metadata": {
206 |     "collapsed": true
207 |    },
208 |    "outputs": [],
209 |    "source": [
210 |     "additional_stopwords = ['mr','congress','chairman','madam','amendment','legislation','speaker']"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": null,
216 |    "metadata": {
217 |     "collapsed": false
218 |    },
219 |    "outputs": [],
220 |    "source": [
221 |     "import nltk\n",
222 |     "\n",
223 |     "english_stopwords = nltk.corpus.stopwords.words('english')\n",
224 |     "new_stopwords = additional_stopwords + english_stopwords"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": null,
230 |    "metadata": {
231 |     "collapsed": true
232 |    },
233 |    "outputs": [],
234 |    "source": [
235 |     "vectorizer = TfidfVectorizer(max_features=10000, stop_words=new_stopwords)"
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "code",
240 |    "execution_count": null,
241 |    "metadata": {
242 |     "collapsed": true
243 |    },
244 |    "outputs": [],
245 |    "source": [
246 |     "longer_speeches = speeches_df[speeches_df[\"word_count\"] > 92]\n",
247 |     "X = vectorizer.fit_transform(longer_speeches['contents'])"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "metadata": {
254 |     "collapsed": false
255 |    },
256 |    "outputs": [],
257 |    "source": [
258 |     "number_of_clusters = 7\n",
259 |     "km = KMeans(n_clusters=number_of_clusters)\n",
260 |     "km.fit(X)"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": null,
266 |    "metadata": {
267 |     "collapsed": false
268 |    },
269 |    "outputs": [],
270 |    "source": [
271 |     "print(\"Top terms per cluster:\")\n",
272 |     "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
273 |     "terms = vectorizer.get_feature_names()\n",
274 |     "for i in range(number_of_clusters):\n",
275 |     "    print(\"Cluster %d:\" % i),\n",
276 |     "    for ind in order_centroids[i, :15]:\n",
277 |     "        print(' %s' % terms[ind]),\n",
278 |     "    print ''"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "code",
283 |    "execution_count": null,
284 |    "metadata": {
285 |     "collapsed": false
286 |    },
287 |    "outputs": [],
288 |    "source": [
289 |     "longer_speeches[\"k-means label\"] = km.labels_"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": null,
295 |    "metadata": {
296 |     "collapsed": false
297 |    },
298 |    "outputs": [],
299 |    "source": [
300 |     "longer_speeches.head()"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "metadata": {
307 |     "collapsed": true
308 |    },
309 |    "outputs": [],
310 |    "source": [
311 |     "china_speeches = longer_speeches[longer_speeches[\"k-means label\"] == 1]"
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "code",
316 |    "execution_count": null,
317 |    "metadata": {
318 |     "collapsed": false
319 |    },
320 |    "outputs": [],
321 |    "source": [
322 |     "china_speeches.head()"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": null,
328 |    "metadata": {
329 |     "collapsed": false
330 |    },
331 |    "outputs": [],
332 |    "source": [
333 |     "vectorizer = TfidfVectorizer(max_features=10000, stop_words=new_stopwords)\n",
334 |     "X = vectorizer.fit_transform(china_speeches['contents'])\n",
335 |     "\n",
336 |     "number_of_clusters = 5\n",
337 |     "km = KMeans(n_clusters=number_of_clusters)\n",
338 |     "km.fit(X)\n",
339 |     "\n",
340 |     "print(\"Top terms per cluster:\")\n",
341 |     "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
342 |     "terms = vectorizer.get_feature_names()\n",
343 |     "for i in range(number_of_clusters):\n",
344 |     "    print(\"Cluster %d:\" % i),\n",
345 |     "    for ind in order_centroids[i, :10]:\n",
346 |     "        print(' %s' % terms[ind]),\n",
347 |     "    print ''"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "metadata": {
354 |     "collapsed": false
355 |    },
356 |    "outputs": [],
357 |    "source": [
358 |     "km.get_params()"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": null,
364 |    "metadata": {
365 |     "collapsed": false
366 |    },
367 |    "outputs": [],
368 |    "source": [
369 |     "km.score(X)"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": null,
375 |    "metadata": {
376 |     "collapsed": true
377 |    },
378 |    "outputs": [],
379 |    "source": []
380 |   }
381 |  ],
382 |  "metadata": {
383 |   "kernelspec": {
384 |    "display_name": "Python 2",
385 |    "language": "python",
386 |    "name": "python2"
387 |   },
388 |   "language_info": {
389 |    "codemirror_mode": {
390 |     "name": "ipython",
391 |     "version": 2
392 |    },
393 |    "file_extension": ".py",
394 |    "mimetype": "text/x-python",
395 |    "name": "python",
396 |    "nbconvert_exporter": "python",
397 |    "pygments_lexer": "ipython2",
398 |    "version": "2.7.10"
399 |   }
400 |  },
401 |  "nbformat": 4,
402 |  "nbformat_minor": 0
403 | }
404 | 


--------------------------------------------------------------------------------
/class5_2/knn.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "##Let's work with the wine dataset we worked with before, but slightly modified. This has more instances and different target features"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "####based on http://blog.yhathq.com/posts/classification-using-knn-and-python.html"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "import pandas as pd\n",
 26 |     "import matplotlib.pyplot as plt\n",
 27 |     "%matplotlib inline\n",
 28 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 29 |     "from sklearn import cross_validation"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "import numpy as np"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "metadata": {
 47 |     "collapsed": false
 48 |    },
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "df = pd.read_csv(\"data/wine.csv\")"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {
 58 |     "collapsed": false
 59 |    },
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "df.columns"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "###Instead of wine cultvar, we have the wine color (red or white), as well as a binary (is red) and high quality indicator (0 or 1)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "df.high_quality.unique()"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "###Let's set up our training and test sets"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "metadata": {
 94 |     "collapsed": false
 95 |    },
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "train, test = cross_validation.train_test_split(df[['density','sulphates','residual_sugar','high_quality']],train_size=0.75)"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "###We'll use just three columns (dimensions) for classification"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "metadata": {
112 |     "collapsed": false
113 |    },
114 |    "outputs": [],
115 |    "source": [
116 |     "train"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": null,
122 |    "metadata": {
123 |     "collapsed": false
124 |    },
125 |    "outputs": [],
126 |    "source": [
127 |     "x_train = train[:,:3]\n",
128 |     "y_train = train[:,3]"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {
135 |     "collapsed": true
136 |    },
137 |    "outputs": [],
138 |    "source": [
139 |     "x_test = test[:,:3]\n",
140 |     "y_test = test[:,3]"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "###Let's start with a k of 1 to predict high quality"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": null,
153 |    "metadata": {
154 |     "collapsed": false
155 |    },
156 |    "outputs": [],
157 |    "source": [
158 |     "clf = KNeighborsClassifier(n_neighbors=1)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": null,
164 |    "metadata": {
165 |     "collapsed": false
166 |    },
167 |    "outputs": [],
168 |    "source": [
169 |     "clf.fit(x_train,y_train)"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": null,
175 |    "metadata": {
176 |     "collapsed": false
177 |    },
178 |    "outputs": [],
179 |    "source": [
180 |     "preds = clf.predict(x_test)"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": null,
186 |    "metadata": {
187 |     "collapsed": true
188 |    },
189 |    "outputs": [],
190 |    "source": [
191 |     "accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {
198 |     "collapsed": false
199 |    },
200 |    "outputs": [],
201 |    "source": [
202 |     "print \"Accuracy: %3f\" % (accuracy,)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {},
208 |    "source": [
209 |     "###Not bad. Let's see what happens as the k changes"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": null,
215 |    "metadata": {
216 |     "collapsed": false
217 |    },
218 |    "outputs": [],
219 |    "source": [
220 |     "results = []\n",
221 |     "for k in range(1, 51, 2):\n",
222 |     "    clf = KNeighborsClassifier(n_neighbors=k)\n",
223 |     "    clf.fit(x_train,y_train)\n",
224 |     "    preds = clf.predict(x_test)\n",
225 |     "    accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))\n",
226 |     "    print \"Neighbors: %d, Accuracy: %3f\" % (k, accuracy)\n",
227 |     "\n",
228 |     "    results.append([k, accuracy])\n",
229 |     "\n",
230 |     "results = pd.DataFrame(results, columns=[\"k\", \"accuracy\"])\n",
231 |     "\n",
232 |     "plt.plot(results.k, results.accuracy)\n",
233 |     "plt.title(\"Accuracy with Increasing K\")\n",
234 |     "plt.show()"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "###Looks like about 80% is the best we can do. The way it plateaus, suggests there's not much more to be gained by increasing k"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "markdown",
246 |    "metadata": {},
247 |    "source": [
248 |     "###We can also tune this a bit by not weighting each instance the same, but decreasing the weight as the distance increases"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": null,
254 |    "metadata": {
255 |     "collapsed": false
256 |    },
257 |    "outputs": [],
258 |    "source": [
259 |     "results = []\n",
260 |     "for k in range(1, 51, 2):\n",
261 |     "    clf = KNeighborsClassifier(n_neighbors=k,weights='distance')\n",
262 |     "    clf.fit(x_train,y_train)\n",
263 |     "    preds = clf.predict(x_test)\n",
264 |     "    accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))\n",
265 |     "    print \"Neighbors: %d, Accuracy: %3f\" % (k, accuracy)\n",
266 |     "\n",
267 |     "    results.append([k, accuracy])\n",
268 |     "\n",
269 |     "results = pd.DataFrame(results, columns=[\"k\", \"accuracy\"])\n",
270 |     "\n",
271 |     "plt.plot(results.k, results.accuracy)\n",
272 |     "plt.title(\"Accuracy with Increasing K\")\n",
273 |     "plt.show()"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "markdown",
278 |    "metadata": {},
279 |    "source": [
280 |     "###This actually increases the accuracy of our prediction"
281 |    ]
282 |   }
283 |  ],
284 |  "metadata": {
285 |   "kernelspec": {
286 |    "display_name": "Python 2",
287 |    "language": "python",
288 |    "name": "python2"
289 |   },
290 |   "language_info": {
291 |    "codemirror_mode": {
292 |     "name": "ipython",
293 |     "version": 2
294 |    },
295 |    "file_extension": ".py",
296 |    "mimetype": "text/x-python",
297 |    "name": "python",
298 |    "nbconvert_exporter": "python",
299 |    "pygments_lexer": "ipython2",
300 |    "version": "2.7.10"
301 |   }
302 |  },
303 |  "nbformat": 4,
304 |  "nbformat_minor": 0
305 | }
306 | 


--------------------------------------------------------------------------------
/class6_1/README.md:
--------------------------------------------------------------------------------
 1 | # Algorithms: Week 6, Class 1 (Tuesday, Aug. 18)
 2 | 
 3 | After a quick review of the homework, we'll explore in-depth several methods for clustering crime data before we return to and expand on the document clustering problem from last Thursday.
 4 | 
 5 | ## Hour 1: Exercise review
 6 | 
 7 | We'll talk in detail through the bill classification problem and the ambiguities inherent in clustering data, via the crime example we talked about briefly last Tuesday.
 8 | 
 9 | ## Hour 2: Clustering crime
10 | 
11 | We'll look at the two methods you learned last week -- k-means clustering and k-nearest neighbors -- along with another one, known as DBSCAN, to see how different methods can produce different results when we apply them to crime data.
12 | 
13 | ## Hour 3: Back to document clustering
14 | 
15 | Finally we'll return to the idea of document clustering that you started exploring last week, going more into depth on the ideas of document similarity and term frequency-inverse document frequency and showing how clustering can more quickly help us explore a new document set.


--------------------------------------------------------------------------------
/class6_2/AssociationRuleMining.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "##A simple example of Association Rule Mining based on http://orange.biolab.si/docs/latest/reference/rst/Orange.associate.html"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import Orange #pip install orange\n",
 19 |     "data = Orange.data.Table(\"market-basket.basket\")"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {
 26 |     "collapsed": false
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "for d in data:\n",
 31 |     "    print d"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {
 38 |     "collapsed": false
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.3)\n",
 43 |     "print \"%4s %4s  %s\" % (\"Supp\", \"Conf\", \"Rule\")\n",
 44 |     "for r in rules[:5]:\n",
 45 |     "    print \"%4.1f %4.1f  %s\" % (r.support, r.confidence, r)"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "###Spanish Inquisition example"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {
 59 |     "collapsed": true
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "data = Orange.data.Table(\"inquisition.basket\")"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": null,
 69 |    "metadata": {
 70 |     "collapsed": false
 71 |    },
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "for d in data:\n",
 75 |     "    print d"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "collapsed": false
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)\n",
 87 |     "\n",
 88 |     "print \"%5s   %5s\" % (\"supp\", \"conf\")\n",
 89 |     "for r in rules:\n",
 90 |     "    print \"%5.3f   %5.3f   %s\" % (r.support, r.confidence, r)"
 91 |    ]
 92 |   }
 93 |  ],
 94 |  "metadata": {
 95 |   "kernelspec": {
 96 |    "display_name": "Python 2",
 97 |    "language": "python",
 98 |    "name": "python2"
 99 |   },
100 |   "language_info": {
101 |    "codemirror_mode": {
102 |     "name": "ipython",
103 |     "version": 2
104 |    },
105 |    "file_extension": ".py",
106 |    "mimetype": "text/x-python",
107 |    "name": "python",
108 |    "nbconvert_exporter": "python",
109 |    "pygments_lexer": "ipython2",
110 |    "version": "2.7.10"
111 |   }
112 |  },
113 |  "nbformat": 4,
114 |  "nbformat_minor": 0
115 | }
116 | 


--------------------------------------------------------------------------------
/class6_2/RandomForest.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "##Based on example from http://blog.yhathq.com/posts/random-forests-in-python.html, with modifications from https://gist.github.com/glamp/5717321"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from sklearn.datasets import load_iris\n",
 19 |     "from sklearn.ensemble import RandomForestClassifier\n",
 20 |     "import pandas as pd\n",
 21 |     "import numpy as np"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {
 28 |     "collapsed": true
 29 |    },
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "iris = load_iris()\n",
 33 |     "df = pd.DataFrame(iris.data, columns=iris.feature_names)"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {
 40 |     "collapsed": true
 41 |    },
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "metadata": {
 62 |     "collapsed": false
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "df.head()"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {
 73 |     "collapsed": true
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "train, test = df[df['is_train']==True], df[df['is_train']==False]"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {
 84 |     "collapsed": false
 85 |    },
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "features = df.columns[:4]\n",
 89 |     "clf = RandomForestClassifier(n_jobs=2)\n",
 90 |     "y, _ = pd.factorize(train['species'])\n",
 91 |     "clf.fit(train[features], y)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "preds = iris.target_names[clf.predict(test[features])]\n",
103 |     "pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {
110 |     "collapsed": true
111 |    },
112 |    "outputs": [],
113 |    "source": []
114 |   }
115 |  ],
116 |  "metadata": {
117 |   "kernelspec": {
118 |    "display_name": "Python 2",
119 |    "language": "python",
120 |    "name": "python2"
121 |   },
122 |   "language_info": {
123 |    "codemirror_mode": {
124 |     "name": "ipython",
125 |     "version": 2
126 |    },
127 |    "file_extension": ".py",
128 |    "mimetype": "text/x-python",
129 |    "name": "python",
130 |    "nbconvert_exporter": "python",
131 |    "pygments_lexer": "ipython2",
132 |    "version": "2.7.10"
133 |   }
134 |  },
135 |  "nbformat": 4,
136 |  "nbformat_minor": 0
137 | }
138 | 


--------------------------------------------------------------------------------
/class7_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 7, Class 1 (Tuesday, Aug. 25)
2 | 
3 | After a quick look back at last Thursday's material, we'll spend some time looking over [examples](https://github.com/datapolitan/lede_algorithms/blob/master/class1_1/newsroom_examples.md) of algorithms and journalism from earlier in the course and talk about how to build on the skills you've learned here going forward. I'm counting on wrapping up early so we can talk about final projects.
4 | 
5 | ## Resources for later
6 | 
7 | - IRE/NICAR: I've said this a hundred times, but [sign up](http://www.ire.org/membership/). Use the student rate if you'd like. And if someone there balks, tell me and I'll talk to them.
8 | 
9 | - MORE TK


--------------------------------------------------------------------------------
/class7_1/bill_classifier.py:
--------------------------------------------------------------------------------
 1 | from sklearn import preprocessing
 2 | from sklearn import cross_validation
 3 | from sklearn.tree import DecisionTreeClassifier
 4 | from sklearn.ensemble import RandomForestClassifier
 5 | from sklearn.naive_bayes import MultinomialNB
 6 | from sklearn.feature_extraction.text import CountVectorizer
 7 | 
 8 | if __name__ == '__main__':
 9 | 
10 |     ########## STEP 1: DATA IMPORT AND PREPROCESSING ##########
11 | 
12 |     # Here we're taking in the training data and splitting it into two lists: One with the text of
13 |     # each bill title, and the second with each bill title's corresponding category. Order is important.
14 |     # The first bill in list 1 should also be the first category in list 2.
15 |     training = [line.strip().split('|') for line in open('data/bills_training.txt', 'r').readlines()]
16 |     text = [t[0] for t in training if len(t) > 1]
17 |     labels = [t[1] for t in training if len(t) > 1]
18 | 
19 |     # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
20 |     # be numbers, not strings. The LabelEncoder performs this transformation.
21 |     encoder = preprocessing.LabelEncoder()
22 |     correct_labels = encoder.fit_transform(labels)
23 | 
24 |     ########## STEP 2: FEATURE EXTRACTION ##########
25 |     print 'Extracting features ...'
26 | 
27 |     vectorizer = CountVectorizer(stop_words='english')
28 |     data = vectorizer.fit_transform(text)
29 | 
30 |     ########## STEP 3: MODEL BUILDING ##########
31 |     print 'Training ...'
32 | 
33 |     #model = MultinomialNB()
34 |     model = RandomForestClassifier()
35 |     fit_model = model.fit(data, correct_labels)
36 | 
37 |     # ########## STEP 4: EVALUATION ##########
38 |     print 'Evaluating ...'
39 | 
40 |     # Evaluate our model with 10-fold cross-validation
41 |     scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10)
42 |     print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)
43 | 
44 |     # ########## STEP 5: APPLYING THE MODEL ##########
45 |     # print 'Classifying ...'
46 | 
47 |     # docs_new = ["Public postsecondary education: executive officer compensation.",
48 |     #             "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
49 |     #             "Political Reform Act of 1974: campaign disclosures.",
50 |     #             "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
51 |     #         ]
52 | 
53 |     # test_data = vectorizer.transform(docs_new)
54 | 
55 |     # for i in xrange(len(docs_new)):
56 |     #     print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])])


--------------------------------------------------------------------------------
/data_journalism_on_github.md:
--------------------------------------------------------------------------------
 1 | #Data Journalists on Github
 2 | 
 3 | ##Organizations
 4 | + New York Times The Upshot: https://github.com/TheUpshot 
 5 | + New York Times Newsroom Developers: https://github.com/newsdev
 6 | + FiveThirtyEight.com: https://github.com/fivethirtyeight
 7 | + Al Jazeera America (at least until April): https://github.com/ajam
 8 | + Chicago Tribune News Apps: https://github.com/newsapps
 9 | + Northwestern University Knight Lab: https://github.com/NUKnightLab
10 | + ProPublica: https://github.com/propublica
11 | + Sunlight Labs: https://github.com/sunlightlabs
12 | + NPR Visuals Team: https://github.com/nprapps
13 | + NPR Tech: https://github.com/npr
14 | + The Guardian: https://github.com/guardian
15 | + Vox Media: https://github.com/voxmedia
16 | + Time Magazine: https://github.com/TimeMagazine
17 | + Los Angeles Times Data Desk: https://github.com/datadesk
18 | + BuzzFeed News: https://github.com/BuzzFeedNews
19 | + [Huffington Post Data](http://data.huffingtonpost.com/): https://github.com/huffpostdata
20 | 
21 | 
22 | ##Tools
23 | + Wireservice: https://github.com/wireservice
24 | + [Open Civic Data](http://opencivicdata.org/): https://github.com/opencivicdata
25 | + [TabulaPDF](http://tabula.technology/): https://github.com/tabulapdf
26 | + [Public Media Platform](http://publicmediaplatform.org/): https://github.com/publicmediaplatform
27 | + [CensusReporter](http://censusreporter.org/): https://github.com/censusreporter
28 | + Mozilla Foundation: https://github.com/mozilla
29 | 
30 | ##People
31 | + Michael Keller: https://github.com/mhkeller
32 | + Joanna S. Kao: https://github.com/joannaskao
33 | + Kevin Quealy: https://github.com/kpq
34 | + Joe Germuska: https://github.com/JoeGermuska
35 | 
36 | ##Github's infrequently updated [list of open journalism projects](https://github.com/showcases/open-journalism)


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
  1 | # Algorithms, Summer 2015
  2 | ## LEDE Program, Columbia University, Graduate School of Journalism
  3 | 
  4 | 
  5 | ### Instructors:
  6 | 
  7 | Richard Dunks: richard [at] datapolitan [dot] com
  8 | 
  9 | Chase Davis: chase.davis [at] nytimes [dot] com
 10 | 
 11 | 
 12 | #### Room Number: Pulitzer Hall 601B
 13 | 
 14 | #### Course Dates: 14 July - 27 August 2015
 15 | 
 16 | ### Course Overview
 17 | 
 18 | This course presents an overview of algorithms as they relate to journalistic tradecraft, with particular emphasis on algorithms that relate to the discovery, cleaning, and analysis of data. This course intends to provide literacy in the common types of data algorithms, while providing practice in the design, development, and testing of algorithms to support news reporting and analysis, including the basic concepts of algorithm reverse engineering in support of investigative news reporting. The emphasis in this class will be on practical applications and critical awareness of the impact algorithms have in modern life.
 19 | 
 20 | 
 21 | ### Learning Objectives
 22 | 
 23 | + You will understand the basic structure and operation of algorithms
 24 | + You will understand the primary types of data science algorithms, including techniques of supervised and unsupervised machine learning
 25 | + You will be practiced in implementing basic algorithms in Python
 26 | + You will be able to meaningfully explain and critique the use and operation of algorithms as tools of public policy and business
 27 | + You will understand how algorithms are applied in the newsroom
 28 | 
 29 | ### Course Requirements
 30 | <b>All students will be expected to have a laptop during both lectures and lab time.</b> Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts.
 31 | 
 32 | ### Course Readings
 33 | The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available to you electronically. Recommended readings are suggestions if you wish to study further the topics covered in class. Suggested readings will also be provided as appropriate for those interested in a more in-depth discussion of the material covered in class.
 34 | 
 35 | ### Assignments
 36 | This course consists of programming and critical response assignments intended to reinforce learning and provide you with pratical applications of the material covered in class. Completion of these assignments is critical to achieving the outcomes of this course. Assignments are intended to be completed during lab time or for homework. Generally, assignments will be due the following week, unless otherwise stated. For example, exercises assigned on Tuesday will be due before class on the following Tuesday. 
 37 | + Programming assignments will be submitted via Slack to the TAs in Python scripts (not ipynb) format. <b>The exercises should be standalone for each assignment, not a combination of all assignments. This allows them to be tested and scored separately.</b> 
 38 | + Response questions should be [submitted using this address](http://ledealgorithms.tumblr.com/submit) and will be posted to the [class Tumblr](http://ledealgorithms.tumblr.com/) after grading. They should be clear, concise, and use the elements of good grammar. This is an opportunity to develop your ability to explain algorithms to your audience.
 39 | 
 40 | ### Class Format
 41 | Class runs from 10am to 1pm Tuesday and Thursday. Lab time will be from 2pm to 5pm Tuesday and Thursday. The class will be taught in roughly 50 minute blocks, with approximately 10 minute breaks between each 50 minute block. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class. Lab time is intended for the completion of exercises, but may also include guided learning sessions as necessary to ensure comprehension of the course material. 
 42 | 
 43 | ### Course Policies
 44 | + <b>Attendance and Tardiness:</b> We expect you to attend every class, arriving on time and staying for the entire duration of class. Absences will only be excused for circumstances coordinated in advance and you are responsible for making up any missed work.
 45 | + <b>Participation:</b> We expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. <b>Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures.</b> We will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered.
 46 | + <b>Late Assignments:</b> All assignments are to be submitted before the start of class. Assignments posted by the end of the day following class will be marked down 10% and assignments posted at the end of the day following will be marked down 20%. No assignments will be accepted for a grade after three days following class.
 47 | + <b>Office Hours:</b> We won’t be holding regular office hours, but are available via email to answer whatever questions you may have about the material. Please feel free to also reach out to the Teaching Assistants as necessary for support and guidance with the exercises, particularly during lab time.
 48 | 
 49 | ----
 50 | ### Resources
 51 | #### Technical
 52 | 
 53 | + [Stack Overflow](http://stackoverflow.com) - Q&A community of technology pros
 54 | 
 55 | #### (Some) Open Data Sources
 56 | 
 57 | + [New York City Open Data Portal](https://nycopendata.socrata.com/)
 58 | + [New York State Open Data Portal](https://data.ny.gov/)
 59 | + [Hilary Mason’s Research Quality Data Sets](https://bitly.com/bundles/hmason/1)
 60 | 
 61 | #### Visualizations
 62 | 
 63 | + [Flowing Data](http://flowingdata.com/)
 64 | + [Tableau Visualization Gallery](http://www.tableausoftware.com/public/gallery)
 65 | + [Visualizing.org](http://www.visualizing.org/)
 66 | + [Data is Beautiful](http://www.reddit.com/r/dataisbeautiful/)
 67 | 
 68 | #### Data Journalism and Critiques
 69 | 
 70 | + [FiveThirtyEight](http://fivethirtyeight.com/)
 71 | + [Upshot](http://www.nytimes.com/upshot/)
 72 | + [IQuantNY](http://iquantny.tumblr.com/)
 73 | + [SimplyStatistics](http://simplystatistics.org/)
 74 | + [Data Journalism Handbook](http://datajournalismhandbook.org/1.0/en/index.html)
 75 | 
 76 | #### Suggested Reading
 77 | Conway, Drew and John Myles White. <i>Machine Learning for Hackers</i>. O'Reilly Media, Inc., 2012.
 78 | 
 79 | Knuth, Donald E. <i>The Art of Computer Programming</i>. Addison-Wesley Professional, 2011.
 80 | 
 81 | MacCormick, John. <i>Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers</i>. Princeton University Press, 2011.
 82 | 
 83 | McCallum, Q Ethan. <i>Bad Data Handbook</i>. O'Reilly Media, Inc., 2012.
 84 | 
 85 | McKinney, Wes. <i>Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.</i> O'Reilly Media, Inc., 2012.
 86 | 
 87 | O'Neil, Cathy and Rachel Schutt. <i>Doing Data Science: Straight Talk from the Front Line</i>. O'Reilly Media, Inc., 2013.
 88 | 
 89 | Russell, Matthew A. <i>Mining the Social Web.</i> O'Reilly Media, Inc., 2013.
 90 | 
 91 | Sedgewick, Robert and Kevin Wayne. <i>Algorithms</i>. Addison-Wesley Professional, 2011.
 92 | 
 93 | Steiner, Christopher. <i>Automate This: How Algorithms Came to Rule Our World</i>. Penguin Group, 2012. 
 94 | 
 95 | ----
 96 | ### Course Outline
 97 | (Subject to change)
 98 | 
 99 | #### Week 1: Introduction to Algorithms/Statistics review
100 | ##### Class 1 Readings
101 | + Miller, Claire Cain, [“When Algorithms Discriminate”](http://nyti.ms/1KS5rdu) New York Times, 9 July 2015
102 | + O’Neil, Cathy, [“Algorithms And Accountability Of Those Who Deploy Them”](http://mathbabe.org/2015/05/26/algorithms-and-accountability-of-those-who-deploy-them/)
103 | + Elkus, Adam, [“You Can’t Handle the (Algorithmic) Truth”](http://www.slate.com/articles/technology/future_tense/2015/05/algorithms_aren_t_responsible_for_the_cruelties_of_bureaucracy.single.html)
104 | + Diakopoulos, Nicholas, ["Algorithmic Accontability Reporting: On the Investigation of Black Boxes"](http://towcenter.org/wp-content/uploads/2014/02/78524_Tow-Center-Report-WEB-1.pdf)
105 | 
106 | ##### Class 2 Readings (optional)
107 | + McKinney, "Getting Started With Pandas" <i>Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.</i>
108 | + McKinney, "Plotting and Visualization" <i>Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.</i> 
109 | 
110 | #### Week 2: Statistics in Reporting/Opening the Blackbox: Supervised Learning - Linear Regression 
111 | ##### Class 1 Readings
112 | + (TBD)
113 | 
114 | ##### Class 2 Readings
115 | + O'Neill, "Statistical Inference, Exploratory Data Analysis, and the Data Science Process" <i>Doing Data Science: Straight Talk from the Front Line</i> pp. 17-37
116 | 
117 | #### Week 3: Opening the Blackbox: Supervised Learning - Feature Engineering/Decision Trees
118 | 
119 | ##### Class 2 Readings
120 | + <i>Building Machine Learning Systems with Python</i>, pp. 33-43
121 | + <i>Learning scikit-learn: Machine Learning in Python</i>, pp. 41-52
122 | + Brownlee, Jason, ("Discover Feature Engineering, How to Engineer Features and How to Get Good at It")[http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/]
123 | + ("A Visual Introduction to Machine Learning")[http://www.r2d3.us/visual-intro-to-machine-learning-part-1/]
124 | 
125 | #### Week 4: Opening the Blackbox: Supervised Learning - Feature Engineering/Logistic Regression
126 | 
127 | #### Week 5: Opening the Blackbox: Unsupervised Learning - Clustering, k-NN
128 | 
129 | #### Week 6: Natural Language Processing, Reverse Engineering, and Ethics Revisited
130 | 
131 | #### Week 7: Advanced Topics (we'll be polling the class for topics)
132 | 
133 | 


--------------------------------------------------------------------------------