├── .gitignore
├── Final_Project
├── amtrak_speeds
│ └── readme.md
├── readme.md
└── speeding_cops
│ ├── florida_toll_plaza.kml
│ ├── readme.md
│ └── transponder_data.csv
├── LICENSE
├── class1_1
├── class1-1.ipynb
├── exercise
│ ├── Exercise1_MeanFunction.ipynb
│ ├── Exercise2_MayoralExcuseGenerator.ipynb
│ ├── Exercise3-Answers.ipynb
│ ├── Exercise3.ipynb
│ ├── Exercise4-Answers.ipynb
│ ├── Exercise4.ipynb
│ └── excuse.csv
├── lab1-1.ipynb
└── newsroom_examples.md
├── class1_2
├── .ipynb_checkpoints
│ └── EDA_Python-checkpoint.ipynb
├── 2013_NYC_CD_MedianIncome_Recycle.xlsx
├── Data_Collection_Sheet.csv
├── EDA_Python.ipynb
├── class1_2.ipynb
├── height_weight.xlsx
└── heights_weights_genders.csv
├── class2_1
├── .ipynb_checkpoints
│ └── EDA_Review-checkpoint.ipynb
├── EDA_Review.ipynb
├── README.md
└── data
│ └── ontime_reports_may_2015_ny.csv
├── class2_2
├── DoNow_2-2.ipynb
├── DoNow_2-2_answers.ipynb
├── Multiple_Variable_Regression.ipynb
├── Simple_Linear_Regression.ipynb
└── data
│ ├── 2013_NYC_CD_MedianIncome_Recycle.xlsx
│ ├── height_weight.xlsx
│ ├── heights_weights_genders.csv
│ └── ontime_reports_may_2015_ny.csv
├── class3_1
├── .ipynb_checkpoints
│ ├── classification-checkpoint.ipynb
│ └── regression_review-checkpoint.ipynb
├── README.md
├── classification.ipynb
├── data
│ ├── apib12tx.csv
│ └── category-training.csv
└── regression_review.ipynb
├── class3_2
├── 3-2_DoNow.ipynb
├── 3-2_DoNow_Answers.ipynb
├── 3-2_DoNow_Answers_statsmodels.ipynb
├── 3-2_Exercises-Answers.ipynb
├── 3-2_Exercises.ipynb
├── Decision_Tree.ipynb
├── data
│ ├── hanford.csv
│ ├── hanford.txt
│ ├── iris.csv
│ ├── ontime_reports_may_2015_ny.csv
│ ├── seeds_dataset.txt
│ └── titanic.csv
└── images
│ ├── hanford_variables.png
│ └── iris_scatter.png
├── class4_1
├── README.md
├── data
│ ├── bills_training.txt
│ ├── contribs_training.csv
│ ├── contribs_training_small.csv
│ └── contribs_unclassified.csv
├── doc_classifier.py
└── donors.py
├── class4_2
├── 4-2_DoNow.ipynb
├── Feature_Engineering.ipynb
├── Logistic_regression.ipynb
├── Naive_Bayes.ipynb
├── data
│ ├── ontime_reports_may_2015_ny.csv
│ ├── titanic.csv
│ └── wine.csv
└── images
│ └── titanic.png
├── class5_1
├── .ipynb_checkpoints
│ └── vectorization-checkpoint.ipynb
├── README.md
├── bill_classifier.py
├── crime_clusterer.py
├── data
│ ├── bills_training.txt
│ ├── columbia_crime.csv
│ └── releases_training.txt
├── release_classifier.py
└── vectorization.ipynb
├── class5_2
├── 5_2-Assignment.ipynb
├── 5_2-DoNow.ipynb
├── data
│ └── wine.csv
├── kmeans.ipynb
└── knn.ipynb
├── class6_1
├── .ipynb_checkpoints
│ ├── cluster_crime-checkpoint.ipynb
│ └── cluster_emails-checkpoint.ipynb
├── README.md
├── cluster_crime.ipynb
├── cluster_emails.ipynb
└── data
│ ├── cluster_examples
│ ├── kmeans_10.csv
│ └── kmeans_3.csv
│ ├── columbia_crime.csv
│ └── jeb_subjects.csv
├── class6_2
├── AssociationRuleMining.ipynb
└── RandomForest.ipynb
├── class7_1
├── README.md
├── bill_classifier.py
└── data
│ └── bills_training.txt
├── data_journalism_on_github.md
└── readme.md
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 |
5 | # C extensions
6 | *.so
7 |
8 | # Distribution / packaging
9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 |
26 | # PyInstaller
27 | # Usually these files are written by a python script from a template
28 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 |
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 |
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 |
46 | # Translations
47 | *.mo
48 | *.pot
49 |
50 | # Django stuff:
51 | *.log
52 |
53 | # Sphinx documentation
54 | docs/_build/
55 |
56 | # PyBuilder
57 | target/
58 |
--------------------------------------------------------------------------------
/Final_Project/amtrak_speeds/readme.md:
--------------------------------------------------------------------------------
1 | ##Source: [Derailed Amtrak train sped into deadly crash curve | Al Jazeera America](http://america.aljazeera.com/multimedia/2015/5/map-derailed-amtrak-sped-through-northeast-corridor.html)
2 |
3 | ##Data: [[https://github.com/ajam/amtrak-188]]
4 | + The live data is accessible here: [[https://www.googleapis.com/mapsengine/v1/tables/01382379791355219452-08584582962951999356/features?version=published&key=AIzaSyCVFeFQrtk-ywrUE0pEcvlwgCqS6TJcOW4&maxResults=250]]
5 |
6 | ##Notes:
7 | + Michael Keller doens't provide the code for scraping the data, but you can scrape the live data from the URL above (I'd recommend a database)
--------------------------------------------------------------------------------
/Final_Project/readme.md:
--------------------------------------------------------------------------------
1 | ##Final Project
2 |
3 |
4 | The final project is a chance for you to demonstrate the skills you've learned in this and other Lede classes to explore a topic of personal or professional interest using data. You should demonstrate not only strong technical ability, but also the ability to synthesize the data in interesting and meaningful ways.
5 |
6 | Requirements:
7 | + You must write a blog post, [submitted through the class Tumblr](http://ledealgorithms.tumblr.com/submit), outlining your project, your goals, your methodology, and your findings. Specifically address the data you used, it's source, the steps you took to clean the data, and the insights you gained at each step, either with respect to your project or working with data more generally.
8 | + You must present your work in class, either August 27th or August 31st. Prepare a 15 minute presentation on the points covered in your blogpost and be prepared to answer questions. All work is due September 1st.
9 | + You must provide the source code for your project. Code should be well written and commented wherever possible to explain the operation.
10 |
11 | You are free to work in groups and we encourage you to find projects that are of limited enough scope to fit into the time alotted for this project. Often we work under tight deadlines and being able to constrain scope ensures projects is important. Often a smaller, more constrained objective will allow us to better understand the important task and essential challenges. Attempting to implement all we envision at once, is a recipe for disaster (like Healthcare.gov). Take this opportunity to develop a more iterative approach and develop your project in phases rather than tackle everything all at once. For more information on this approach look into [Agile Development](http://agilemethodology.org/).
12 |
13 | If you work in groups, please indicate in your blog post the work of each person on the project so they may receive the proper credit.
14 |
15 | If you have any questions or need assistance shaping projects, please don't hesitate to reach out.
16 |
--------------------------------------------------------------------------------
/Final_Project/speeding_cops/readme.md:
--------------------------------------------------------------------------------
1 | ##Source: (The Florida Sun-Sentinel Speeding Cops)[http://www.sun-sentinel.com/news/speeding-cops/]
2 |
3 | ##Background
4 | [Documenting the process](http://towcenter.gitbooks.io/sensors-and-journalism/content/the_second_section/sun_sentinel_%E2%80%93.html)
5 |
6 | ##Data
7 | + transponder_data.csv - an extract from their database online with the entrance and exit locations, entrance time and exit time
8 | + florida_toll_plaza.kml - (extracted from [here](https://www.google.com/maps/d/viewer?mid=zkhNiVf3Ss6c.k8ys3XRv92Ms&hl=en_US)locations for each toll booths (as well as the toll plazas) in KML format. A KML is just an XML with spatial data. Recommend using OpenRefine to process. Python is also an option (but OpenRefine will be easier and faster)
9 |
--------------------------------------------------------------------------------
/class1_1/class1-1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:83d3e5703fd0ce5b2e62c63376699efb77c5cfa83ce21a7433dc7a0f14c00d56"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "code",
13 | "collapsed": false,
14 | "input": [
15 | "for i in range(10):\n",
16 | " print i"
17 | ],
18 | "language": "python",
19 | "metadata": {},
20 | "outputs": [
21 | {
22 | "output_type": "stream",
23 | "stream": "stdout",
24 | "text": [
25 | "0\n",
26 | "1\n",
27 | "2\n",
28 | "3\n",
29 | "4\n",
30 | "5\n",
31 | "6\n",
32 | "7\n",
33 | "8\n",
34 | "9\n"
35 | ]
36 | }
37 | ],
38 | "prompt_number": 1
39 | },
40 | {
41 | "cell_type": "code",
42 | "collapsed": false,
43 | "input": [
44 | "for i in range(1,10):\n",
45 | " print i"
46 | ],
47 | "language": "python",
48 | "metadata": {},
49 | "outputs": [
50 | {
51 | "output_type": "stream",
52 | "stream": "stdout",
53 | "text": [
54 | "1\n",
55 | "2\n",
56 | "3\n",
57 | "4\n",
58 | "5\n",
59 | "6\n",
60 | "7\n",
61 | "8\n",
62 | "9\n"
63 | ]
64 | }
65 | ],
66 | "prompt_number": 2
67 | },
68 | {
69 | "cell_type": "code",
70 | "collapsed": false,
71 | "input": [
72 | "for i in range(1,10,2):\n",
73 | " print i"
74 | ],
75 | "language": "python",
76 | "metadata": {},
77 | "outputs": [
78 | {
79 | "output_type": "stream",
80 | "stream": "stdout",
81 | "text": [
82 | "1\n",
83 | "3\n",
84 | "5\n",
85 | "7\n",
86 | "9\n"
87 | ]
88 | }
89 | ],
90 | "prompt_number": 3
91 | },
92 | {
93 | "cell_type": "code",
94 | "collapsed": false,
95 | "input": [
96 | "print range(10)"
97 | ],
98 | "language": "python",
99 | "metadata": {},
100 | "outputs": [
101 | {
102 | "output_type": "stream",
103 | "stream": "stdout",
104 | "text": [
105 | "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n"
106 | ]
107 | }
108 | ],
109 | "prompt_number": 4
110 | },
111 | {
112 | "cell_type": "code",
113 | "collapsed": false,
114 | "input": [
115 | "print range(1,10,3)"
116 | ],
117 | "language": "python",
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "output_type": "stream",
122 | "stream": "stdout",
123 | "text": [
124 | "[1, 4, 7]\n"
125 | ]
126 | }
127 | ],
128 | "prompt_number": 6
129 | },
130 | {
131 | "cell_type": "code",
132 | "collapsed": false,
133 | "input": [
134 | "print range(1,10,3,5)"
135 | ],
136 | "language": "python",
137 | "metadata": {},
138 | "outputs": [
139 | {
140 | "ename": "TypeError",
141 | "evalue": "range expected at most 3 arguments, got 4",
142 | "output_type": "pyerr",
143 | "traceback": [
144 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
145 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
146 | "\u001b[0;31mTypeError\u001b[0m: range expected at most 3 arguments, got 4"
147 | ]
148 | }
149 | ],
150 | "prompt_number": 7
151 | },
152 | {
153 | "cell_type": "code",
154 | "collapsed": false,
155 | "input": [
156 | "l = [1,2,\"abc\"]"
157 | ],
158 | "language": "python",
159 | "metadata": {},
160 | "outputs": [],
161 | "prompt_number": 8
162 | },
163 | {
164 | "cell_type": "code",
165 | "collapsed": false,
166 | "input": [
167 | "l"
168 | ],
169 | "language": "python",
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "metadata": {},
174 | "output_type": "pyout",
175 | "prompt_number": 14,
176 | "text": [
177 | "[2]"
178 | ]
179 | }
180 | ],
181 | "prompt_number": 14
182 | },
183 | {
184 | "cell_type": "code",
185 | "collapsed": false,
186 | "input": [
187 | "l.pop()"
188 | ],
189 | "language": "python",
190 | "metadata": {},
191 | "outputs": [
192 | {
193 | "metadata": {},
194 | "output_type": "pyout",
195 | "prompt_number": 10,
196 | "text": [
197 | "'abc'"
198 | ]
199 | }
200 | ],
201 | "prompt_number": 10
202 | },
203 | {
204 | "cell_type": "code",
205 | "collapsed": false,
206 | "input": [
207 | "l2 = l.remove(2)"
208 | ],
209 | "language": "python",
210 | "metadata": {},
211 | "outputs": [],
212 | "prompt_number": 17
213 | },
214 | {
215 | "cell_type": "code",
216 | "collapsed": false,
217 | "input": [
218 | "l"
219 | ],
220 | "language": "python",
221 | "metadata": {},
222 | "outputs": [
223 | {
224 | "metadata": {},
225 | "output_type": "pyout",
226 | "prompt_number": 13,
227 | "text": [
228 | "[2]"
229 | ]
230 | }
231 | ],
232 | "prompt_number": 13
233 | },
234 | {
235 | "cell_type": "code",
236 | "collapsed": false,
237 | "input": [
238 | "l2"
239 | ],
240 | "language": "python",
241 | "metadata": {},
242 | "outputs": [],
243 | "prompt_number": 18
244 | },
245 | {
246 | "cell_type": "code",
247 | "collapsed": false,
248 | "input": [
249 | "l = [1,1,1,2,3]"
250 | ],
251 | "language": "python",
252 | "metadata": {},
253 | "outputs": [],
254 | "prompt_number": 19
255 | },
256 | {
257 | "cell_type": "code",
258 | "collapsed": false,
259 | "input": [
260 | "s = set(l)"
261 | ],
262 | "language": "python",
263 | "metadata": {},
264 | "outputs": [],
265 | "prompt_number": 20
266 | },
267 | {
268 | "cell_type": "code",
269 | "collapsed": false,
270 | "input": [
271 | "s"
272 | ],
273 | "language": "python",
274 | "metadata": {},
275 | "outputs": [
276 | {
277 | "metadata": {},
278 | "output_type": "pyout",
279 | "prompt_number": 21,
280 | "text": [
281 | "{1, 2, 3}"
282 | ]
283 | }
284 | ],
285 | "prompt_number": 21
286 | },
287 | {
288 | "cell_type": "code",
289 | "collapsed": false,
290 | "input": [
291 | "l"
292 | ],
293 | "language": "python",
294 | "metadata": {},
295 | "outputs": [
296 | {
297 | "metadata": {},
298 | "output_type": "pyout",
299 | "prompt_number": 22,
300 | "text": [
301 | "[1, 1, 1, 2, 3]"
302 | ]
303 | }
304 | ],
305 | "prompt_number": 22
306 | },
307 | {
308 | "cell_type": "code",
309 | "collapsed": false,
310 | "input": [
311 | "s1 = set({1,2,3})"
312 | ],
313 | "language": "python",
314 | "metadata": {},
315 | "outputs": [],
316 | "prompt_number": 23
317 | },
318 | {
319 | "cell_type": "code",
320 | "collapsed": false,
321 | "input": [
322 | "s2 = set({3,4,5})"
323 | ],
324 | "language": "python",
325 | "metadata": {},
326 | "outputs": [],
327 | "prompt_number": 24
328 | },
329 | {
330 | "cell_type": "code",
331 | "collapsed": false,
332 | "input": [
333 | "s1 - s2"
334 | ],
335 | "language": "python",
336 | "metadata": {},
337 | "outputs": [
338 | {
339 | "metadata": {},
340 | "output_type": "pyout",
341 | "prompt_number": 25,
342 | "text": [
343 | "{1, 2}"
344 | ]
345 | }
346 | ],
347 | "prompt_number": 25
348 | },
349 | {
350 | "cell_type": "code",
351 | "collapsed": false,
352 | "input": [
353 | "state_dict = {'ny': 'New York'}"
354 | ],
355 | "language": "python",
356 | "metadata": {},
357 | "outputs": [],
358 | "prompt_number": 26
359 | },
360 | {
361 | "cell_type": "code",
362 | "collapsed": false,
363 | "input": [
364 | "state_dict['ny']"
365 | ],
366 | "language": "python",
367 | "metadata": {},
368 | "outputs": [
369 | {
370 | "metadata": {},
371 | "output_type": "pyout",
372 | "prompt_number": 27,
373 | "text": [
374 | "'New York'"
375 | ]
376 | }
377 | ],
378 | "prompt_number": 27
379 | },
380 | {
381 | "cell_type": "code",
382 | "collapsed": false,
383 | "input": [
384 | "state_dict[0]"
385 | ],
386 | "language": "python",
387 | "metadata": {},
388 | "outputs": [
389 | {
390 | "ename": "KeyError",
391 | "evalue": "0",
392 | "output_type": "pyerr",
393 | "traceback": [
394 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
395 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mstate_dict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
396 | "\u001b[0;31mKeyError\u001b[0m: 0"
397 | ]
398 | }
399 | ],
400 | "prompt_number": 28
401 | },
402 | {
403 | "cell_type": "code",
404 | "collapsed": false,
405 | "input": [],
406 | "language": "python",
407 | "metadata": {},
408 | "outputs": []
409 | }
410 | ],
411 | "metadata": {}
412 | }
413 | ]
414 | }
--------------------------------------------------------------------------------
/class1_1/exercise/Exercise1_MeanFunction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "def mean_calc(input_list):\n",
12 | " list_len = 0 # variable to track running length\n",
13 | " list_sum = 0 # variable to track running sum\n",
14 | " if input_list:\n",
15 | " for i in input_list:\n",
16 | " if isinstance(i,int) or isinstance(i,float): # check to see if element i is of type int or float\n",
17 | " list_len += 1\n",
18 | " list_sum += i\n",
19 | " else: # element i is not int or float\n",
20 | " print \"list element %s is not of type int or float\" % i\n",
21 | " return list_sum/float(list_len) #return the final calculation\n",
22 | " else: #list is empty\n",
23 | " return \"input list is empty\""
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {
30 | "collapsed": true
31 | },
32 | "outputs": [],
33 | "source": [
34 | "test_list = [1,1,1,2,3,4,4,4,4,5,6,7,9]"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 3,
40 | "metadata": {
41 | "collapsed": false
42 | },
43 | "outputs": [
44 | {
45 | "data": {
46 | "text/plain": [
47 | "3.923076923076923"
48 | ]
49 | },
50 | "execution_count": 3,
51 | "metadata": {},
52 | "output_type": "execute_result"
53 | }
54 | ],
55 | "source": [
56 | "mean_calc(test_list)"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 4,
62 | "metadata": {
63 | "collapsed": true
64 | },
65 | "outputs": [],
66 | "source": [
67 | "import numpy as np"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 5,
73 | "metadata": {
74 | "collapsed": false
75 | },
76 | "outputs": [
77 | {
78 | "data": {
79 | "text/plain": [
80 | "3.9230769230769229"
81 | ]
82 | },
83 | "execution_count": 5,
84 | "metadata": {},
85 | "output_type": "execute_result"
86 | }
87 | ],
88 | "source": [
89 | "np.mean(test_list)"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 6,
95 | "metadata": {
96 | "collapsed": true
97 | },
98 | "outputs": [],
99 | "source": [
100 | "test_string = ['1',2,'3','4']"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 7,
106 | "metadata": {
107 | "collapsed": false
108 | },
109 | "outputs": [
110 | {
111 | "name": "stdout",
112 | "output_type": "stream",
113 | "text": [
114 | "list element 1 is not of type int or float\n",
115 | "list element 3 is not of type int or float\n",
116 | "list element 4 is not of type int or float\n"
117 | ]
118 | },
119 | {
120 | "data": {
121 | "text/plain": [
122 | "2.0"
123 | ]
124 | },
125 | "execution_count": 7,
126 | "metadata": {},
127 | "output_type": "execute_result"
128 | }
129 | ],
130 | "source": [
131 | "mean_calc(test_string)"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 8,
137 | "metadata": {
138 | "collapsed": false
139 | },
140 | "outputs": [
141 | {
142 | "ename": "TypeError",
143 | "evalue": "cannot perform reduce with flexible type",
144 | "output_type": "error",
145 | "traceback": [
146 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
147 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
148 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_string\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
149 | "\u001b[0;32m/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc\u001b[0m in \u001b[0;36mmean\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m 2733\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2734\u001b[0m return _methods._mean(a, axis=axis, dtype=dtype,\n\u001b[0;32m-> 2735\u001b[0;31m out=out, keepdims=keepdims)\n\u001b[0m\u001b[1;32m 2736\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2737\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mddof\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
150 | "\u001b[0;32m/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc\u001b[0m in \u001b[0;36m_mean\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m 64\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmu\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'f8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mret\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmu\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m ret = um.true_divide(\n",
151 | "\u001b[0;31mTypeError\u001b[0m: cannot perform reduce with flexible type"
152 | ]
153 | }
154 | ],
155 | "source": [
156 | "np.mean(test_string)"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 9,
162 | "metadata": {
163 | "collapsed": true
164 | },
165 | "outputs": [],
166 | "source": [
167 | "empty_list =[]"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 10,
173 | "metadata": {
174 | "collapsed": false
175 | },
176 | "outputs": [
177 | {
178 | "data": {
179 | "text/plain": [
180 | "'input list is empty'"
181 | ]
182 | },
183 | "execution_count": 10,
184 | "metadata": {},
185 | "output_type": "execute_result"
186 | }
187 | ],
188 | "source": [
189 | "mean_calc(empty_list)"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": 11,
195 | "metadata": {
196 | "collapsed": false
197 | },
198 | "outputs": [
199 | {
200 | "name": "stderr",
201 | "output_type": "stream",
202 | "text": [
203 | "/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.\n",
204 | " warnings.warn(\"Mean of empty slice.\", RuntimeWarning)\n",
205 | "/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:71: RuntimeWarning: invalid value encountered in double_scalars\n",
206 | " ret = ret.dtype.type(ret / rcount)\n"
207 | ]
208 | },
209 | {
210 | "data": {
211 | "text/plain": [
212 | "nan"
213 | ]
214 | },
215 | "execution_count": 11,
216 | "metadata": {},
217 | "output_type": "execute_result"
218 | }
219 | ],
220 | "source": [
221 | "np.mean(empty_list)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": null,
227 | "metadata": {
228 | "collapsed": true
229 | },
230 | "outputs": [],
231 | "source": []
232 | }
233 | ],
234 | "metadata": {
235 | "kernelspec": {
236 | "display_name": "Python 2",
237 | "language": "python",
238 | "name": "python2"
239 | },
240 | "language_info": {
241 | "codemirror_mode": {
242 | "name": "ipython",
243 | "version": 2
244 | },
245 | "file_extension": ".py",
246 | "mimetype": "text/x-python",
247 | "name": "python",
248 | "nbconvert_exporter": "python",
249 | "pygments_lexer": "ipython2",
250 | "version": "2.7.10"
251 | }
252 | },
253 | "nbformat": 4,
254 | "nbformat_minor": 0
255 | }
256 |
--------------------------------------------------------------------------------
/class1_1/exercise/Exercise2_MayoralExcuseGenerator.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": false
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import random #package for generating pseudo-random numbers: https://docs.python.org/2/library/random.html"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 3,
17 | "metadata": {
18 | "collapsed": false
19 | },
20 | "outputs": [],
21 | "source": [
22 | "import csv"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 4,
28 | "metadata": {
29 | "collapsed": false
30 | },
31 | "outputs": [
32 | {
33 | "name": "stdout",
34 | "output_type": "stream",
35 | "text": [
36 | "Enter your name: Richard\n"
37 | ]
38 | }
39 | ],
40 | "source": [
41 | "person = raw_input('Enter your name: ')"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 5,
47 | "metadata": {
48 | "collapsed": false
49 | },
50 | "outputs": [
51 | {
52 | "name": "stdout",
53 | "output_type": "stream",
54 | "text": [
55 | "Enter your destination: Chelsea\n"
56 | ]
57 | }
58 | ],
59 | "source": [
60 | "place = raw_input('Enter your destination: ')"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 6,
66 | "metadata": {
67 | "collapsed": false
68 | },
69 | "outputs": [],
70 | "source": [
71 | "r = random.randrange(0,11) # generate random number between 0 and 10"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 8,
77 | "metadata": {
78 | "collapsed": false
79 | },
80 | "outputs": [],
81 | "source": [
82 | "excuse_list = [] #create an empty list to hold the excuses\n",
83 | "inputReader = csv.DictReader(open('excuse.csv','rU'))\n",
84 | "for line in inputReader:\n",
85 | " excuse_list.append(line) # append the excuses (as dictionary) to the list"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 7,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "Sorry, Richard I was late to Chelsea, breakfast began a little later than expected\n",
100 | "From the story \"De Blasio 15 Minutes Late to St. Patrick's Day Mass, Blames Breakfast\"\n",
101 | "http://www.dnainfo.com/new-york/20150317/midtown/de-blasio-15-minutes-late-st-patricks-day-mass-blames-breakfast\n"
102 | ]
103 | }
104 | ],
105 | "source": [
106 | "print \"Sorry, \" + person + \" I was late to \" + place + \", \" + excuse_list[r]['excuse']\n",
107 | "print 'From the story \"' + excuse_list[r]['headline'] + '\"'\n",
108 | "print excuse_list[r]['hyperlink']"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 15,
114 | "metadata": {
115 | "collapsed": false
116 | },
117 | "outputs": [],
118 | "source": [
119 | "# alternate way of generating the list of excuses using the context manager\n",
120 | "# http://preshing.com/20110920/the-python-with-statement-by-example/\n",
121 | "excuse_list2 = []\n",
122 | "with open('excuse.csv','rU') as inputFile:\n",
123 | " inputReader = csv.DictReader(inputFile)\n",
124 | " for line in inputReader:\n",
125 | " excuse_list2.append(line) # append the excuses (as dictionary) to the list\n",
126 | " #file connection is close at end of the indented code"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {
133 | "collapsed": false
134 | },
135 | "outputs": [],
136 | "source": [
137 | "# This is the least elegant and least pythonic way of doing this. \n",
138 | "# Putting this code up at a Python conference could get you booed or otherwise shamed and driven from the hall\n",
139 | "# but it gets the job done\n",
140 | "inputFile = open('excuse.csv','rU') #create the file object\n",
141 | "header = next(inputFile) # return the first line of the file (header) and assign to a variable\n",
142 | "excuse_list = []\n",
143 | "for line in inputFile:\n",
144 | " line = line.split(',') # split the line on the comma\n",
145 | " excuse_list.append(line[0]) # append the first element to the list\n",
146 | "inputFile.close() # close connection to the file"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {
153 | "collapsed": false
154 | },
155 | "outputs": [],
156 | "source": []
157 | }
158 | ],
159 | "metadata": {
160 | "kernelspec": {
161 | "display_name": "Python 2",
162 | "language": "python",
163 | "name": "python2"
164 | },
165 | "language_info": {
166 | "codemirror_mode": {
167 | "name": "ipython",
168 | "version": 2
169 | },
170 | "file_extension": ".py",
171 | "mimetype": "text/x-python",
172 | "name": "python",
173 | "nbconvert_exporter": "python",
174 | "pygments_lexer": "ipython2",
175 | "version": "2.7.10"
176 | }
177 | },
178 | "nbformat": 4,
179 | "nbformat_minor": 0
180 | }
181 |
--------------------------------------------------------------------------------
/class1_1/exercise/Exercise3-Answers.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The following code will print the prime numbers between 1 and 100. Modify the code so it prints every other prime number from 1 to 100"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 2,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [
17 | {
18 | "name": "stdout",
19 | "output_type": "stream",
20 | "text": [
21 | "1\n",
22 | "3\n",
23 | "7\n",
24 | "13\n",
25 | "19\n",
26 | "29\n",
27 | "37\n",
28 | "43\n",
29 | "53\n",
30 | "61\n",
31 | "71\n",
32 | "79\n",
33 | "89\n"
34 | ]
35 | }
36 | ],
37 | "source": [
38 | "j = 0 # add check counter outside the for-loop so it doesn't get reset\n",
39 | "for num in range(1,101): \n",
40 | " prime = True \n",
41 | " for i in range(2,num): \n",
42 | " if (num%i==0): \n",
43 | " prime = False \n",
44 | " if prime: \n",
45 | " if j%2 == 0: # test the check counter for being even and if so, then print the number\n",
46 | " print num\n",
47 | " j += 1 # increment the check counter each time a prime is found"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "# Extra Credit: Can you write a procedure that runs faster than the one above?"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 12,
60 | "metadata": {
61 | "collapsed": false
62 | },
63 | "outputs": [
64 | {
65 | "name": "stdout",
66 | "output_type": "stream",
67 | "text": [
68 | "1\n",
69 | "3\n",
70 | "7\n",
71 | "13\n",
72 | "19\n",
73 | "29\n",
74 | "37\n",
75 | "43\n",
76 | "53\n",
77 | "61\n",
78 | "71\n",
79 | "79\n",
80 | "89\n"
81 | ]
82 | }
83 | ],
84 | "source": [
85 | "j = 0 \n",
86 | "for num in range(1,101): \n",
87 | " prime = True \n",
88 | " for i in range(2,num): \n",
89 | " if (num%i==0): \n",
90 | " prime = False\n",
91 | " continue \n",
92 | " # once the number has already been shown to be false, \n",
93 | " # there's no reason to keep checking\n",
94 | " if prime: \n",
95 | " if j%2 == 0: \n",
96 | " print num\n",
97 | " j += 1"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 12,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [],
107 | "source": []
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": null,
112 | "metadata": {
113 | "collapsed": false
114 | },
115 | "outputs": [],
116 | "source": []
117 | }
118 | ],
119 | "metadata": {
120 | "kernelspec": {
121 | "display_name": "Python 2",
122 | "language": "python",
123 | "name": "python2"
124 | },
125 | "language_info": {
126 | "codemirror_mode": {
127 | "name": "ipython",
128 | "version": 2
129 | },
130 | "file_extension": ".py",
131 | "mimetype": "text/x-python",
132 | "name": "python",
133 | "nbconvert_exporter": "python",
134 | "pygments_lexer": "ipython2",
135 | "version": "2.7.10"
136 | }
137 | },
138 | "nbformat": 4,
139 | "nbformat_minor": 0
140 | }
141 |
--------------------------------------------------------------------------------
/class1_1/exercise/Exercise3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:68df1038be6fa984e8fa87db9aa2fa3b80b0196b0e8ca61596f63da0942cd96d"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "heading",
13 | "level": 1,
14 | "metadata": {},
15 | "source": [
16 | "The following code will print the prime numbers between 1 and 100. Modify the code so it prints every other prime number from 1 to 100"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "collapsed": false,
22 | "input": [
23 | "for num in range(1,101): # for-loop through the numbers\n",
24 | " prime = True # boolean flag to check the number for being prime\n",
25 | " for i in range(2,num): # for-loop to check for \"primeness\" by checking for divisors other than 1\n",
26 | " if (num%i==0): # logical test for the number having a divisor other than 1 and itself\n",
27 | " prime = False # if there's a divisor, the boolean value gets flipped to False\n",
28 | " if prime: # if prime is still True after going through all numbers from 1 - 100, then it gets printed\n",
29 | " print num"
30 | ],
31 | "language": "python",
32 | "metadata": {},
33 | "outputs": []
34 | },
35 | {
36 | "cell_type": "heading",
37 | "level": 1,
38 | "metadata": {},
39 | "source": [
40 | "Extra Credit: Can you write a procedure that runs faster than the one above?"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "collapsed": false,
46 | "input": [],
47 | "language": "python",
48 | "metadata": {},
49 | "outputs": []
50 | }
51 | ],
52 | "metadata": {}
53 | }
54 | ]
55 | }
--------------------------------------------------------------------------------
/class1_1/exercise/Exercise4-Answers.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The writer of this code wants to count the mean and median article length for recent articles on gay marraige. This code has several issues, including errors. When they checked their custom functions against the numpy functions, they noticed some discrepancies. Fix the code so it executes properly and the output of the custom functions match the output of the numpy functions"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 5,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import requests # a better package than urllib2"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 6,
24 | "metadata": {
25 | "collapsed": false
26 | },
27 | "outputs": [],
28 | "source": [
29 | "def my_mean(input_list):\n",
30 | " list_sum = 0\n",
31 | " list_count = 0\n",
32 | " for el in input_list:\n",
33 | " list_sum += el\n",
34 | " list_count += 1\n",
35 | " return list_sum / float(list_count) # cast list_count to float"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 42,
41 | "metadata": {
42 | "collapsed": false
43 | },
44 | "outputs": [],
45 | "source": [
46 | "def my_median(input_list):\n",
47 | " input_list.sort() # sort the list\n",
48 | " list_length = len(input_list) # get length so it doesn't need to be recalculated\n",
49 | "\n",
50 | " # test for even length and take len/2 and len/2 -1 divided over 2.0 for float division\n",
51 | " if list_length %2 == 0: \n",
52 | " return (input_list[list_length/2] + input_list[(list_length/2) - 1]) / 2.0 \n",
53 | " else:\n",
54 | " return input_list[list_length/2]"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 2,
60 | "metadata": {
61 | "collapsed": false
62 | },
63 | "outputs": [],
64 | "source": [
65 | "api_key = \"ffaf60d7d82258e112dd4fb2b5e4e2d6:3:72421680\""
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "metadata": {
72 | "collapsed": false
73 | },
74 | "outputs": [],
75 | "source": [
76 | "url = \"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gay+marriage&api-key=%s\" % api_key # variable name mistyped"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 8,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [],
86 | "source": [
87 | "r = requests.get(url)"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 10,
93 | "metadata": {
94 | "collapsed": false
95 | },
96 | "outputs": [],
97 | "source": [
98 | "wc_list = []\n",
99 | "for article in r.json()['response']['docs']:\n",
100 | " wc_list.append(int(article['word_count'])) #word_count needs to be cast to int"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 11,
106 | "metadata": {
107 | "collapsed": false
108 | },
109 | "outputs": [
110 | {
111 | "data": {
112 | "text/plain": [
113 | "1034.2"
114 | ]
115 | },
116 | "execution_count": 11,
117 | "metadata": {},
118 | "output_type": "execute_result"
119 | }
120 | ],
121 | "source": [
122 | "my_mean(wc_list)"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 12,
128 | "metadata": {
129 | "collapsed": false
130 | },
131 | "outputs": [],
132 | "source": [
133 | "import numpy as np"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": 13,
139 | "metadata": {
140 | "collapsed": false
141 | },
142 | "outputs": [
143 | {
144 | "data": {
145 | "text/plain": [
146 | "1034.2"
147 | ]
148 | },
149 | "execution_count": 13,
150 | "metadata": {},
151 | "output_type": "execute_result"
152 | }
153 | ],
154 | "source": [
155 | "np.mean(wc_list)"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 43,
161 | "metadata": {
162 | "collapsed": false
163 | },
164 | "outputs": [
165 | {
166 | "data": {
167 | "text/plain": [
168 | "926.5"
169 | ]
170 | },
171 | "execution_count": 43,
172 | "metadata": {},
173 | "output_type": "execute_result"
174 | }
175 | ],
176 | "source": [
177 | "my_median(wc_list)"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 28,
183 | "metadata": {
184 | "collapsed": false
185 | },
186 | "outputs": [
187 | {
188 | "data": {
189 | "text/plain": [
190 | "926.5"
191 | ]
192 | },
193 | "execution_count": 28,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "np.median(wc_list)"
200 | ]
201 | }
202 | ],
203 | "metadata": {
204 | "kernelspec": {
205 | "display_name": "Python 2",
206 | "language": "python",
207 | "name": "python2"
208 | },
209 | "language_info": {
210 | "codemirror_mode": {
211 | "name": "ipython",
212 | "version": 2
213 | },
214 | "file_extension": ".py",
215 | "mimetype": "text/x-python",
216 | "name": "python",
217 | "nbconvert_exporter": "python",
218 | "pygments_lexer": "ipython2",
219 | "version": "2.7.10"
220 | }
221 | },
222 | "nbformat": 4,
223 | "nbformat_minor": 0
224 | }
225 |
--------------------------------------------------------------------------------
/class1_1/exercise/Exercise4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:27eed4d676ae4b5bf707d837cc436a5377ce46ea08f8fe7dcf43383e80482aeb"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "heading",
13 | "level": 1,
14 | "metadata": {},
15 | "source": [
16 | "The writer of this code wants to count the mean and median article length for recent articles on gay marriage in the New York Times. This code has several issues, including errors. When they checked their custom functions against the numpy functions, they noticed some discrepancies. Fix the code so it executes properly, retrieves the articles, and outputs the correct result from the custom functions, compared to the numpy functions."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "collapsed": false,
22 | "input": [
23 | "import requests # a better package than urllib2"
24 | ],
25 | "language": "python",
26 | "metadata": {},
27 | "outputs": []
28 | },
29 | {
30 | "cell_type": "code",
31 | "collapsed": false,
32 | "input": [
33 | "def my_mean(input_list):\n",
34 | " list_sum = 0\n",
35 | " list_count = 0\n",
36 | " for el in input_list:\n",
37 | " list_sum += el\n",
38 | " list_count += 1\n",
39 | " return list_sum / list_count"
40 | ],
41 | "language": "python",
42 | "metadata": {},
43 | "outputs": []
44 | },
45 | {
46 | "cell_type": "code",
47 | "collapsed": false,
48 | "input": [
49 | "def my_median(input_list):\n",
50 | " list_length = len(input_list)\n",
51 | " return input_list[list_length/2]"
52 | ],
53 | "language": "python",
54 | "metadata": {},
55 | "outputs": []
56 | },
57 | {
58 | "cell_type": "code",
59 | "collapsed": false,
60 | "input": [
61 | "api_key = \"ffaf60d7d82258e112dd4fb2b5e4e2d6:3:72421680\""
62 | ],
63 | "language": "python",
64 | "metadata": {},
65 | "outputs": []
66 | },
67 | {
68 | "cell_type": "code",
69 | "collapsed": false,
70 | "input": [
71 | "url = \"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gay+marriage&api-key=%s\" % API_key"
72 | ],
73 | "language": "python",
74 | "metadata": {},
75 | "outputs": []
76 | },
77 | {
78 | "cell_type": "code",
79 | "collapsed": false,
80 | "input": [
81 | "r = requests.get(url)"
82 | ],
83 | "language": "python",
84 | "metadata": {},
85 | "outputs": []
86 | },
87 | {
88 | "cell_type": "code",
89 | "collapsed": false,
90 | "input": [
91 | "wc_list = []\n",
92 | "for article in r.json()['response']['docs']:\n",
93 | " wc_list.append(article['word_count'])"
94 | ],
95 | "language": "python",
96 | "metadata": {},
97 | "outputs": []
98 | },
99 | {
100 | "cell_type": "code",
101 | "collapsed": false,
102 | "input": [
103 | "my_mean(wc_list)"
104 | ],
105 | "language": "python",
106 | "metadata": {},
107 | "outputs": []
108 | },
109 | {
110 | "cell_type": "code",
111 | "collapsed": false,
112 | "input": [
113 | "import numpy as np"
114 | ],
115 | "language": "python",
116 | "metadata": {},
117 | "outputs": []
118 | },
119 | {
120 | "cell_type": "code",
121 | "collapsed": false,
122 | "input": [
123 | "np.mean(wc_list)"
124 | ],
125 | "language": "python",
126 | "metadata": {},
127 | "outputs": []
128 | },
129 | {
130 | "cell_type": "code",
131 | "collapsed": false,
132 | "input": [
133 | "my_median(wc_list)"
134 | ],
135 | "language": "python",
136 | "metadata": {},
137 | "outputs": []
138 | },
139 | {
140 | "cell_type": "code",
141 | "collapsed": false,
142 | "input": [
143 | "np.median(wc_list)"
144 | ],
145 | "language": "python",
146 | "metadata": {},
147 | "outputs": []
148 | },
149 | {
150 | "cell_type": "code",
151 | "collapsed": false,
152 | "input": [],
153 | "language": "python",
154 | "metadata": {},
155 | "outputs": []
156 | }
157 | ],
158 | "metadata": {}
159 | }
160 | ]
161 | }
--------------------------------------------------------------------------------
/class1_1/exercise/excuse.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_1/exercise/excuse.csv
--------------------------------------------------------------------------------
/class1_1/lab1-1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:b6f3a87300b6901d9f5557b6771c6025c4453f08fb77b010b5531157a6471784"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "code",
13 | "collapsed": false,
14 | "input": [
15 | "b = 1\n",
16 | "for i in range(5):\n",
17 | " print b\n",
18 | " b += 1"
19 | ],
20 | "language": "python",
21 | "metadata": {},
22 | "outputs": [
23 | {
24 | "output_type": "stream",
25 | "stream": "stdout",
26 | "text": [
27 | "1\n",
28 | "2\n",
29 | "3\n",
30 | "4\n",
31 | "5\n"
32 | ]
33 | }
34 | ],
35 | "prompt_number": 11
36 | },
37 | {
38 | "cell_type": "code",
39 | "collapsed": false,
40 | "input": [
41 | "for i in range(5):\n",
42 | " b = 5\n",
43 | " print b\n",
44 | " b += 1"
45 | ],
46 | "language": "python",
47 | "metadata": {},
48 | "outputs": [
49 | {
50 | "output_type": "stream",
51 | "stream": "stdout",
52 | "text": [
53 | "5\n",
54 | "5\n",
55 | "5\n",
56 | "5\n",
57 | "5\n"
58 | ]
59 | }
60 | ],
61 | "prompt_number": 9
62 | },
63 | {
64 | "cell_type": "code",
65 | "collapsed": false,
66 | "input": [
67 | "for i in range(10):\n",
68 | " print i\n",
69 | " for j in range(10):\n",
70 | " print j\n",
71 | "print i"
72 | ],
73 | "language": "python",
74 | "metadata": {},
75 | "outputs": [
76 | {
77 | "output_type": "stream",
78 | "stream": "stdout",
79 | "text": [
80 | "0\n",
81 | "0\n",
82 | "1\n",
83 | "2\n",
84 | "3\n",
85 | "4\n",
86 | "5\n",
87 | "6\n",
88 | "7\n",
89 | "8\n",
90 | "9\n",
91 | "1\n",
92 | "0\n",
93 | "1\n",
94 | "2\n",
95 | "3\n",
96 | "4\n",
97 | "5\n",
98 | "6\n",
99 | "7\n",
100 | "8\n",
101 | "9\n",
102 | "2\n",
103 | "0\n",
104 | "1\n",
105 | "2\n",
106 | "3\n",
107 | "4\n",
108 | "5\n",
109 | "6\n",
110 | "7\n",
111 | "8\n",
112 | "9\n",
113 | "3\n",
114 | "0\n",
115 | "1\n",
116 | "2\n",
117 | "3\n",
118 | "4\n",
119 | "5\n",
120 | "6\n",
121 | "7\n",
122 | "8\n",
123 | "9\n",
124 | "4\n",
125 | "0\n",
126 | "1\n",
127 | "2\n",
128 | "3\n",
129 | "4\n",
130 | "5\n",
131 | "6\n",
132 | "7\n",
133 | "8\n",
134 | "9\n",
135 | "5\n",
136 | "0\n",
137 | "1\n",
138 | "2\n",
139 | "3\n",
140 | "4\n",
141 | "5\n",
142 | "6\n",
143 | "7\n",
144 | "8\n",
145 | "9\n",
146 | "6\n",
147 | "0\n",
148 | "1\n",
149 | "2\n",
150 | "3\n",
151 | "4\n",
152 | "5\n",
153 | "6\n",
154 | "7\n",
155 | "8\n",
156 | "9\n",
157 | "7\n",
158 | "0\n",
159 | "1\n",
160 | "2\n",
161 | "3\n",
162 | "4\n",
163 | "5\n",
164 | "6\n",
165 | "7\n",
166 | "8\n",
167 | "9\n",
168 | "8\n",
169 | "0\n",
170 | "1\n",
171 | "2\n",
172 | "3\n",
173 | "4\n",
174 | "5\n",
175 | "6\n",
176 | "7\n",
177 | "8\n",
178 | "9\n",
179 | "9\n",
180 | "0\n",
181 | "1\n",
182 | "2\n",
183 | "3\n",
184 | "4\n",
185 | "5\n",
186 | "6\n",
187 | "7\n",
188 | "8\n",
189 | "9\n",
190 | "9\n"
191 | ]
192 | }
193 | ],
194 | "prompt_number": 13
195 | },
196 | {
197 | "cell_type": "code",
198 | "collapsed": false,
199 | "input": [
200 | "person= raw_input(\"Enter your name: \")"
201 | ],
202 | "language": "python",
203 | "metadata": {},
204 | "outputs": [
205 | {
206 | "name": "stdout",
207 | "output_type": "stream",
208 | "stream": "stdout",
209 | "text": [
210 | "Enter your name: Richard\n"
211 | ]
212 | }
213 | ],
214 | "prompt_number": 14
215 | },
216 | {
217 | "cell_type": "code",
218 | "collapsed": false,
219 | "input": [
220 | "person"
221 | ],
222 | "language": "python",
223 | "metadata": {},
224 | "outputs": [
225 | {
226 | "metadata": {},
227 | "output_type": "pyout",
228 | "prompt_number": 15,
229 | "text": [
230 | "'Richard'"
231 | ]
232 | }
233 | ],
234 | "prompt_number": 15
235 | },
236 | {
237 | "cell_type": "code",
238 | "collapsed": false,
239 | "input": [
240 | "3/4"
241 | ],
242 | "language": "python",
243 | "metadata": {},
244 | "outputs": [
245 | {
246 | "metadata": {},
247 | "output_type": "pyout",
248 | "prompt_number": 16,
249 | "text": [
250 | "0"
251 | ]
252 | }
253 | ],
254 | "prompt_number": 16
255 | },
256 | {
257 | "cell_type": "code",
258 | "collapsed": false,
259 | "input": [
260 | "3/4.0"
261 | ],
262 | "language": "python",
263 | "metadata": {},
264 | "outputs": [
265 | {
266 | "metadata": {},
267 | "output_type": "pyout",
268 | "prompt_number": 17,
269 | "text": [
270 | "0.75"
271 | ]
272 | }
273 | ],
274 | "prompt_number": 17
275 | },
276 | {
277 | "cell_type": "code",
278 | "collapsed": false,
279 | "input": [
280 | "import csv"
281 | ],
282 | "language": "python",
283 | "metadata": {},
284 | "outputs": [],
285 | "prompt_number": 24
286 | },
287 | {
288 | "cell_type": "code",
289 | "collapsed": false,
290 | "input": [
291 | "inputFile = open('../lede_algorithms/class1_1/exercise/excuse.csv','rU')\n",
292 | "inputReader = csv.reader(inputFile)"
293 | ],
294 | "language": "python",
295 | "metadata": {},
296 | "outputs": [],
297 | "prompt_number": 27
298 | },
299 | {
300 | "cell_type": "code",
301 | "collapsed": false,
302 | "input": [
303 | "for line in inputFile:\n",
304 | " line = line.split(',')\n",
305 | " print line"
306 | ],
307 | "language": "python",
308 | "metadata": {},
309 | "outputs": [
310 | {
311 | "output_type": "stream",
312 | "stream": "stdout",
313 | "text": [
314 | "['excuse', 'headline', 'hyperlink\\rthe fog was unexpected and did slow us down a bit', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rwe had some meetings at Gracie Mansion', \"De Blasio 30 Minutes Late to Rockaway St. Patrick's Day Parade\", 'http://www.dnainfo.com/new-york/20150307/belle-harbor/de-blasio-30-minutes-late-rockaway-st-patricks-day-parade\\rI had a very rough night and woke up sluggish', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rI just woke up in the middle of the night and couldn\\x89\\xdb\\xaat get back to sleep', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rwe had some stuff we had to do', \"De Blasio 30 Minutes Late to Rockaway St. Patrick's Day Parade\", 'http://www.dnainfo.com/new-york/20150307/belle-harbor/de-blasio-30-minutes-late-rockaway-st-patricks-day-parade\\rI should have gotten myself moving quicker', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rI was just not feeling well this morning', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rbreakfast began a little later than expected', '\"De Blasio 15 Minutes Late to St. Patrick\\'s Day Mass', ' Blames Breakfast\"', 'http://www.dnainfo.com/new-york/20150317/midtown/de-blasio-15-minutes-late-st-patricks-day-mass-blames-breakfast\\rthe detail drove away when we went into the subway rather than waiting to confirm we got on a train', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0\\rwe waited 20 mins for an express only to hear there were major delays', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0\\rwe need a better system', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0']\n"
315 | ]
316 | }
317 | ],
318 | "prompt_number": 28
319 | },
320 | {
321 | "cell_type": "code",
322 | "collapsed": false,
323 | "input": [
324 | "inputFile = open('/Users/richarddunks/Dropbox/Datapolitan/Projects/Training/)"
325 | ],
326 | "language": "python",
327 | "metadata": {},
328 | "outputs": []
329 | }
330 | ],
331 | "metadata": {}
332 | }
333 | ]
334 | }
--------------------------------------------------------------------------------
/class1_1/newsroom_examples.md:
--------------------------------------------------------------------------------
1 | # Algorithms in the newsroom
2 |
3 | Ever since Phillip Meyer published [Precision Journalism](http://www.unc.edu/~pmeyer/book/) in the 1970s (and probably even a bit before), journalists have been using algorithms in some form in order to tell new and different kinds of stories. Below we've included several examples from different eras of what you'd now call data journalism:
4 |
5 | ### Classic examples
6 |
7 | Computer-assisted reporting specialists have been a fixture in newsroom for decades, applying data analysis and social science methods to the news report. Some of the most sophisticated of those techniques are based in part on algorithms we'll learn in this class:
8 |
9 | - **School Scandals, Children Left Behind: Cheating in Texas Schools and Also [Faking the Grade](http://clipfile.org/?p=892)**: Two powerful series of stories by the Dallas Morning News in the mid-2000s that showed rampant cheating by students and teachers on Texas standardized exams. Not the first story to use regression models but one of the most powerful early examples.
10 |
11 | - **[Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://www.boston.com/globe/metro/packages/tickets/)**: Another early example of using logistic regression to explain a newsworthy phenomenon -- in this case the many factors that go into whether a person is given a speeding ticket or let off the hook. Just as interesting as the story is its [detailed methodology](http://www.boston.com/globe/metro/packages/tickets/study.pdf), which is worth a read.
12 |
13 | - **[Cluster analysis in CAR](https://www.ire.org/publications/search-uplink-archives/167/)** Simple cluster analysis has been used for years in newsrooms to find everything from [crime hotspots](http://www.icpsr.umich.edu/CrimeStat/) to [cancer clusters on Long Island](http://www.ij-healthgeographics.com/content/2/1/3).
14 |
15 | ### Algorithmic journalism catches
16 |
17 | Although reporters and computer-assisted reporting specialists had been doing some form of it for years the idea of "data journalism" as its is now known was popularized during the 2012 presidential elections, in large part thanks to the predictive modeling of Nate Silver.
18 |
19 | - **[FiveThirtyEight](http://fivethirtyeight.blogs.nytimes.com/fivethirtyeights-2012-forecast/)**: Nate Silver's prediction models were the first example of data/algorithmic journalism reaching the mainstream. Since then, election predictions have become a bit old hat. The Times' new model, [Leo](http://www.nytimes.com/newsgraphics/2014/senate-model/), was exceedingly accurate in 2014 (its [source code](https://github.com/TheUpshot/leo-senate-model) is online). The Times also ran a series of [live predictions](http://elections.nytimes.com/2014/senate-model) on key 2014 races on Election Night.
20 |
21 | - **[ProPublica's Message Machine](https://projects.propublica.org/emails/)**: Also during the 2012 elections, ProPublica launched its Message Machine project, which used hashing algorithms to reverse-engineer targeted e-mail messages from political campaigns.
22 |
23 | - **[L.A. Times crime alerts](http://maps.latimes.com/crime/)**: The Los Angeles Times has for years been calculating and publicizing alerts when crime spikes in certain neighborhoods.
24 |
25 | ### Modern examples
26 |
27 | These days, sophisticated algorithms are used to solve all sorts of journalistic problems, both exciting and mundane.
28 |
29 | - **[Campaign finance data deduplication](https://github.com/cjdd3b/fec-standardizer/wiki)**: Most campaign finance data is organized by contribution, not donor. Joe Smith might give three different contributions and be listed in the data in three different ways. Connecting those records into a single canonical Joe Smith is often the first step to doing sophisticated campaign finance analysis. Over the last few years, people have developed highly accurate methods to do this using both supervised and unsupervised machine learning.
30 |
31 | - **[NYT Cooking](http://cooking.nytimes.com/)**: The new Cooking website and app has been one of the Times' most successful new products, but it was initially based largely on recipes stored in free-text articles. The Times extracted many of those recipes using an algorithmic technique known as [conditional random fields](http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/). The L.A. Times did something [similar](https://source.opennews.org/en-US/articles/how-we-made-new-california-cookbook/) in 2013.
32 |
33 | - **[The Echo Chamber](http://www.reuters.com/investigates/special-report/scotus/)**: A team from Reuters were finalists for the Pulitzer this year after using (among other things) sophisticated topic modeling techniques to help document how a small group of lawyers have disproportionate influence over the U.S. Supreme Court.
--------------------------------------------------------------------------------
/class1_2/2013_NYC_CD_MedianIncome_Recycle.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_2/2013_NYC_CD_MedianIncome_Recycle.xlsx
--------------------------------------------------------------------------------
/class1_2/Data_Collection_Sheet.csv:
--------------------------------------------------------------------------------
1 | name,height (inches),age (years),siblings (not including you)
2 | Richard,72,35,1
3 | Adam,71,29,1
4 | Rashida,62,24,0
5 | Spe ,62,23,1
6 | Jiachuan,67,25,0
7 | GK,66,36,1
8 | Fanny,64,32,3
9 | Meghan,65,27,2
10 | Arthur,74,49,1
11 | Elliott,66,31,3
12 | Aliza,64,23,1
13 | Lindsay,67,22,2
14 | Michael,71,49,3
15 | Vanessa,66,27,2
16 | Melissa,61,23,1
17 | Kassahun,67,32,3
18 | Sebastian,67.7165,26,2
19 | Giulia,67,27,2
20 | Siutan,66,25,0
21 | Tian,68,23,0
22 | Laure,65,25,3
23 | Katie,67,36,1
--------------------------------------------------------------------------------
/class1_2/height_weight.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_2/height_weight.xlsx
--------------------------------------------------------------------------------
/class2_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 2, Class 1 (Tuesday, July 21)
2 |
3 | Today we'll spend the first hour reviewing last week's material by exploring a new dataset: transportation data used by FiveThirtyEight to build their [fastest-flight tracker](http://projects.fivethirtyeight.com/flights/), which launched last month.
4 |
5 | Then we'll talk about why it's important to learn, and explain, what's going on under the hood of modern algorithms -- both as an exercise in transparency and skepticism and as a setup for Thursday's class, where we'll begin discussing regression.
6 |
7 | In lab, you'll continue our class work by analyzing some data on your own and developing your results into story ideas. Then we'll ask you to critique a project released earlier this summer by NPR.
8 |
9 | ## Hour 1: Exploratory data analysis review
10 |
11 | We'll be working with airline on-time performance reports from the U.S. Department of Transportation, which you can download [here](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time). The rest of what you'll need is in the accompanying [iPython notebook](https://github.com/datapolitan/lede_algorithms/blob/master/class2_1/EDA_Review.ipynb).
12 |
13 | ## Hour 2: Transparency and the "nerd box"
14 |
15 | First, we'll read Simon Rogers' piece in Mother Jones: [Hey Wonk Reporters, Liberate Your Data!](http://www.motherjones.com/media/2014/04/vox-538-upshot-open-data-missing) (and possibly this: [Debugging the Backlash to Data Journalism](http://towcenter.org/debugging-the-backlash-to-data-journalism/)).
16 |
17 | Then we'll discuss the evolution of transparency in data journalism and why it's important: from the journalistic tradition of the "nerd box," through making data available via news apps, and finally to more modern examples of transparency.
18 |
19 | - [The Boston Globe's Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://www.boston.com/globe/metro/packages/tickets/study.pdf)
20 | - [St. Petersburg (now Tampa Bay) Times: Vanishing Wetlands](http://www.sptimes.com/2006/webspecials06/wetlands/)
21 | - [Ft. Lauderdale Sun Sentinel police speeding investigation](http://databases.sun-sentinel.com/news/broward/ftlaudCopSpeeds/ftlaudCopSpeeds_list.php)
22 | - [Washington Post police shootings](http://www.washingtonpost.com/national/how-the-washington-post-is-examining-police-shootings-in-the-us/2015/06/29/f42c10b2-151b-11e5-9518-f9e0a8959f32_story.html)
23 | - [Leo: The NYT Senate model](http://www.nytimes.com/newsgraphics/2014/senate-model/methodology.html)
24 |
25 | ## Hour 3: From transparent data to transparent algorithms
26 |
27 | Even if you never write another algorithm before you die, this class should at least teach you enough to ask good questions about their capabilities and roles in society. We'll look at stories from some reporters who can articulate a clear, accurate understanding of how algorithms work, as well as some who ... well ... can't.
28 |
29 | Here are a few examples from one of those categories:
30 |
31 | - [Experts predict robots will take over 30% of our jobs by 2025 — and white-collar jobs aren't immune](http://www.businessinsider.com/experts-predict-that-one-third-of-jobs-will-be-replaced-by-robots-2015-5)
32 | - [Journalists, here's how robots are going to steal your job](http://www.newstatesman.com/future-proof/2014/03/journalists-heres-how-robots-are-going-steal-your-job)
33 | - [Artificial intelligence could end mankind: Hawking](http://www.cnbc.com/2014/05/04/artificial-intelligence-could-end-mankind-hawking.html)
34 | - ['Chappie' Doesn't Think Robots Will Destroy the World](http://www.nbcnews.com/tech/innovation/chappie-doesnt-think-robots-will-destroy-world-n305876)
35 | - [What Happens When Robots Write the Future?](http://op-talk.blogs.nytimes.com/2014/08/18/what-happens-when-robots-write-the-future/)
36 |
37 | And a few from the other:
38 |
39 | - [At UPS, the Algorithm Is the Driver](http://www.wsj.com/articles/at-ups-the-algorithm-is-the-driver-1424136536)
40 | - [If Algorithms Know All, How Much Should Humans Help?](http://www.nytimes.com/2015/04/07/upshot/if-algorithms-know-all-how-much-should-humans-help.html?abt=0002&abg=0)
41 | - [The Potential and the Risks of Data Science](http://bits.blogs.nytimes.com/2013/04/07/the-potential-and-the-risks-of-data-science/)
42 | - [Google Schools Its Algorithm](http://www.nytimes.com/2011/03/06/weekinreview/06lohr.html)
43 | - [When Algorithms Discriminate](http://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html)
44 |
45 | ## Lab
46 |
47 | You'll be working on two things in lab today. First, by way of review, download a slice of the [data](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) we used for the class exercises. Explore that data on your own using the skills you've learned so far and come back with two 100-ish-word story pitches. Each pitch should also come with a brief description of how you analyzed the data in order to get that idea.
48 |
49 | Second, write a short critique of NPR and Planet Money's recent coverage of the effect of algorithms and automation on the labor market. You don't need to listen to all of their podcasts on the subject (they did a handful), but check out [a few of them](http://www.npr.org/sections/money/2015/05/08/405270046/episode-622-humans-vs-robots), look at some of their [data visualizations](http://www.npr.org/sections/money/2015/05/21/408234543/will-your-job-be-done-by-a-machine) and play around with their tool that calculates whether your job is [likely to be done by a machine](http://www.npr.org/sections/money/2015/05/21/408234543/will-your-job-be-done-by-a-machine).
50 |
51 | ## Questions
52 |
53 | I'm at chase.davis@nytimes.com, and I'll be on Slack after class.
--------------------------------------------------------------------------------
/class2_2/DoNow_2-2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "#Accomplish the following tasks by whatever means necessary based on the material we've covered in class. Save the notebook in this format: `_DoNow_2-2.ipynb` where `` is your last (family) name and turn it in via Slack."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "# the magic command to plot inline with the notebook \n",
19 | "# https://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions\n",
20 | "%matplotlib inline"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "###1. Import the pandas package and use the common alias"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {
34 | "collapsed": true
35 | },
36 | "outputs": [],
37 | "source": []
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "###2. Read the file \"heights_weights.xlsx\" in the `data` folder into a pandas dataframe"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "metadata": {
50 | "collapsed": true
51 | },
52 | "outputs": [],
53 | "source": []
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "###3. Plot a histogram for both height and weight. Describe the data distribution in comments."
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": null,
65 | "metadata": {
66 | "collapsed": false
67 | },
68 | "outputs": [],
69 | "source": []
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "###4. Calculate the mean height and mean weight for the dataframe. "
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {
82 | "collapsed": false
83 | },
84 | "outputs": [],
85 | "source": []
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "###5. Calculate the other significant descriptive statistics on the two data points\n",
92 | "+ Standard deviation\n",
93 | "+ Range\n",
94 | "+ Interquartile range"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": null,
100 | "metadata": {
101 | "collapsed": false
102 | },
103 | "outputs": [],
104 | "source": []
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "###6. Calculate the coefficient of correlation for these variables. Do they appear correlated? (put your answer in comments)"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "collapsed": false
118 | },
119 | "outputs": [],
120 | "source": []
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "###Extra Credit: Create a scatter plot of height and weight"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {
133 | "collapsed": false
134 | },
135 | "outputs": [],
136 | "source": []
137 | }
138 | ],
139 | "metadata": {
140 | "kernelspec": {
141 | "display_name": "Python 2",
142 | "language": "python",
143 | "name": "python2"
144 | },
145 | "language_info": {
146 | "codemirror_mode": {
147 | "name": "ipython",
148 | "version": 2
149 | },
150 | "file_extension": ".py",
151 | "mimetype": "text/x-python",
152 | "name": "python",
153 | "nbconvert_exporter": "python",
154 | "pygments_lexer": "ipython2",
155 | "version": "2.7.10"
156 | }
157 | },
158 | "nbformat": 4,
159 | "nbformat_minor": 0
160 | }
161 |
--------------------------------------------------------------------------------
/class2_2/data/2013_NYC_CD_MedianIncome_Recycle.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class2_2/data/2013_NYC_CD_MedianIncome_Recycle.xlsx
--------------------------------------------------------------------------------
/class2_2/data/height_weight.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class2_2/data/height_weight.xlsx
--------------------------------------------------------------------------------
/class3_1/.ipynb_checkpoints/classification-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import csv, re, string\n",
12 | "import numpy as np\n",
13 | "from sklearn.linear_model import LogisticRegression\n",
14 | "from sklearn.feature_extraction.text import CountVectorizer\n",
15 | "from sklearn.pipeline import Pipeline"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 15,
21 | "metadata": {
22 | "collapsed": false
23 | },
24 | "outputs": [],
25 | "source": [
26 | "PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\n",
27 | "VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 16,
33 | "metadata": {
34 | "collapsed": false
35 | },
36 | "outputs": [],
37 | "source": [
38 | "data = []\n",
39 | "with open('data/category-training.csv', 'r') as f:\n",
40 | " inputreader = csv.reader(f, delimiter=',', quotechar='\"')\n",
41 | " for r in inputreader:\n",
42 | " # Concatenate the occupation and employer strings together and remove\n",
43 | " # punctuation. Both occupation and employer will be used in prediction.\n",
44 | " text = PUNCTUATION.sub('', ' '.join(r[0:2]))\n",
45 | " if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:\n",
46 | " # We're only attempting to classify the first character of the\n",
47 | " # industry prefix (\"A\", \"B\", etc.) -- not the whole thing. That's\n",
48 | " # what the r[2][0] piece is about.\n",
49 | " data.append([text, r[2][0]])"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 18,
55 | "metadata": {
56 | "collapsed": true
57 | },
58 | "outputs": [],
59 | "source": [
60 | " texts = np.array([el[0] for el in data])\n",
61 | " classes = np.array([el[1] for el in data])"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 19,
67 | "metadata": {
68 | "collapsed": false
69 | },
70 | "outputs": [
71 | {
72 | "name": "stdout",
73 | "output_type": "stream",
74 | "text": [
75 | "['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT'\n",
76 | " 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC'\n",
77 | " 'Owner Fair Funeral Home' 'ST MARTIN LIRERRE LAW FIRM ']\n"
78 | ]
79 | }
80 | ],
81 | "source": [
82 | "print texts"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 20,
88 | "metadata": {
89 | "collapsed": false
90 | },
91 | "outputs": [
92 | {
93 | "name": "stdout",
94 | "output_type": "stream",
95 | "text": [
96 | "['F' 'Z' 'Z' ..., 'F' 'G' 'K']\n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "print classes"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 21,
107 | "metadata": {
108 | "collapsed": true
109 | },
110 | "outputs": [],
111 | "source": [
112 | "pipeline = Pipeline([\n",
113 | " ('vectorizer', CountVectorizer(\n",
114 | " ngram_range=(1,2),\n",
115 | " stop_words='english',\n",
116 | " min_df=2,\n",
117 | " max_df=len(texts))),\n",
118 | " ('classifier', LogisticRegression())\n",
119 | "])"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 22,
125 | "metadata": {
126 | "collapsed": false
127 | },
128 | "outputs": [
129 | {
130 | "data": {
131 | "text/plain": [
132 | "Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
133 | " dtype=, encoding=u'utf-8', input=u'content',\n",
134 | " lowercase=True, max_df=66923, max_features=None, min_df=2,\n",
135 | " ngram_range=(1, 2), preprocessor=None, stop_words='english...',\n",
136 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
137 | " verbose=0))])"
138 | ]
139 | },
140 | "execution_count": 22,
141 | "metadata": {},
142 | "output_type": "execute_result"
143 | }
144 | ],
145 | "source": [
146 | "pipeline.fit(np.asarray(texts), np.asarray(classes))"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 27,
152 | "metadata": {
153 | "collapsed": false
154 | },
155 | "outputs": [
156 | {
157 | "name": "stdout",
158 | "output_type": "stream",
159 | "text": [
160 | "['K']\n"
161 | ]
162 | }
163 | ],
164 | "source": [
165 | "print pipeline.predict(['LAWYER'])"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 28,
171 | "metadata": {
172 | "collapsed": false
173 | },
174 | "outputs": [
175 | {
176 | "name": "stdout",
177 | "output_type": "stream",
178 | "text": [
179 | "['K']\n"
180 | ]
181 | }
182 | ],
183 | "source": [
184 | "print pipeline.predict(['SKADDEN ARPS'])"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 29,
190 | "metadata": {
191 | "collapsed": false,
192 | "scrolled": true
193 | },
194 | "outputs": [
195 | {
196 | "name": "stdout",
197 | "output_type": "stream",
198 | "text": [
199 | "['J']\n"
200 | ]
201 | }
202 | ],
203 | "source": [
204 | "print pipeline.predict(['COMPUTER PROGRAMMER'])"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {
211 | "collapsed": true
212 | },
213 | "outputs": [],
214 | "source": []
215 | }
216 | ],
217 | "metadata": {
218 | "kernelspec": {
219 | "display_name": "Python 2",
220 | "language": "python",
221 | "name": "python2"
222 | },
223 | "language_info": {
224 | "codemirror_mode": {
225 | "name": "ipython",
226 | "version": 2
227 | },
228 | "file_extension": ".py",
229 | "mimetype": "text/x-python",
230 | "name": "python",
231 | "nbconvert_exporter": "python",
232 | "pygments_lexer": "ipython2",
233 | "version": "2.7.9"
234 | }
235 | },
236 | "nbformat": 4,
237 | "nbformat_minor": 0
238 | }
239 |
--------------------------------------------------------------------------------
/class3_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 3, Class 1 (Tuesday, July 28)
2 |
3 | Today we'll be reviewing a bit of material from last week, including both your lab assignments from last Tuesday and Thursday's lesson on regression, and then expanding on the latter in a couple ways.
4 |
5 | First we'll review, and (roughly) recreate a couple stories that have employed regression in basic ways. Then, in the interests of transparency, we'll talk a bit about what's going on under the hood with the algorithms that comprise regression.
6 |
7 | Finally we'll touch briefly on the idea of classification using a portion of a project we're currently implementing at the Times.
8 |
9 | ## Hour 1/1.5: Review
10 |
11 | First we'll talk through some of your story ideas and critiques from last Tuesday. Then we'll revisit some basic regression concepts from last week using [this iPython notebook](https://github.com/datapolitan/lede_algorithms/blob/master/class3_1/regression_review.ipynb), which (very roughly) mimics a project that the St. Paul Pioneer Press did in 2006 and 2010, known as [Schools that Work](http://www.twincities.com/ci_15487174).
12 |
13 | ## Hour 2: A closer look at regression
14 |
15 | Journalists tend to look at linear regression through a statistical lens and use it primarily to describe things, as in the case above. You can see another examples here:
16 |
17 | - [Race gap found in pothole patching](http://www.jsonline.com/watchdog/watchdogreports/32580034.html) (Milwaukee Journal Sentinel). And the [associated explainer](http://www.jsonline.com/news/milwaukee/32580074.html).
18 |
19 | But looked at another way, linear regression is also a predictive model -- one that, at scale, is based on an algorithm that we can demystify, per our conversations last week. We'll spend a short amount of time talking about how that works and relate it (hypothetically) to this [fun story](http://fivethirtyeight.com/features/donald-trump-is-the-worlds-greatest-troll/) from FiveThirtyEight.
20 |
21 | ## Hour 3: Introduction to classification
22 |
23 | Using a [project](https://github.com/datapolitan/lede_algorithms/blob/master/class3_1/classification.ipynb) we've been working on at the Times, we'll expand our idea of supervised learning to include something that seems a bit more like what you might consider "machine learning" -- classifying people's jobs based on strings representing their occupation and employer.
24 |
25 | We'll also discuss how lots of data problems in journalism are secretly classification problems, including things like [sorting through documents](https://github.com/cjdd3b/nicar2014/tree/master/lightning-talk/naive-bayes) and [extracting quotes from news articles](https://github.com/cjdd3b/citizen-quotes).
26 |
27 | ## Lab
28 |
29 | Like last week, you'll be doing two things in the lab today:
30 |
31 | First you'll expand the schools analysis we did earlier by layering in other variables [(documented here)](http://www.cde.ca.gov/ta/ac/ap/reclayout12b.asp) using multiple regression, interpreting the results, and again writing two ledes about what you found. Back those lede up with some internet research. If you find some schools that are over/under-performing or have other interesting characteristics, Google around to see what has been written about them. It's a good way to check your assumptions and to find other interesting facts to round out your story pitches.
32 |
33 | This of course comes with a huge, blinking-red caveat: This is an algorithms class, and we're not getting deep enough into the guts of statistical regression for you to run out and write full-on stories based on your findings. There are things like [p-values](http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients) to consider, as well as rules of thumb for interpreting r-squared. If you'd like to get more in depth with that, we can carve out some time later in the course.
34 |
35 | Your second assignment today is to write a short story (300-500 words) about [this company](https://www.upstart.com/), which is a startup that uses predictive models to assess creditworthiness using variables that go beyond credit score. No doubt their model is more complex than this, but you can think of the intuition as being similar to regression -- a handful of independent variables that help predict the likelihood that someone will pay their loan back. What are the implications of this? Why might it be good or bad for consumers if this catches on?
--------------------------------------------------------------------------------
/class3_1/classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Classification in the Wild"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "This code comes straight from a Times project that helps us standardize campaign finance data to enable new types of analyses. Specifically, it tries to categorize a free-form occupation/employer string into a discrete job category (for example, the strings \"LAWYER\" and \"ATTORNEY\" would both be categorized under \"LAW\").\n",
15 | "\n",
16 | "We use this to create one of a large number of features that inform the larger predictive model we use for standardization. But it also shows the power of simple classification in action."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 2,
22 | "metadata": {
23 | "collapsed": true
24 | },
25 | "outputs": [],
26 | "source": [
27 | "import csv, re, string\n",
28 | "import numpy as np\n",
29 | "from sklearn.linear_model import LogisticRegression\n",
30 | "from sklearn.feature_extraction.text import CountVectorizer\n",
31 | "from sklearn.pipeline import Pipeline"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 30,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [],
41 | "source": [
42 | "# Some basic setup for data-cleaning purposes\n",
43 | "PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\n",
44 | "VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 16,
50 | "metadata": {
51 | "collapsed": false
52 | },
53 | "outputs": [],
54 | "source": [
55 | "# Open the training data and clean it up a bit\n",
56 | "data = []\n",
57 | "with open('data/category-training.csv', 'r') as f:\n",
58 | " inputreader = csv.reader(f, delimiter=',', quotechar='\"')\n",
59 | " for r in inputreader:\n",
60 | " # Concatenate the occupation and employer strings together and remove\n",
61 | " # punctuation. Both occupation and employer will be used in prediction.\n",
62 | " text = PUNCTUATION.sub('', ' '.join(r[0:2]))\n",
63 | " if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:\n",
64 | " # We're only attempting to classify the first character of the\n",
65 | " # industry prefix (\"A\", \"B\", etc.) -- not the whole thing. That's\n",
66 | " # what the r[2][0] piece is about.\n",
67 | " data.append([text, r[2][0]])"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 18,
73 | "metadata": {
74 | "collapsed": true
75 | },
76 | "outputs": [],
77 | "source": [
78 | " # Separate the text of the occupation/employer strings from the correct classification\n",
79 | " texts = np.array([el[0] for el in data])\n",
80 | " classes = np.array([el[1] for el in data])"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 19,
86 | "metadata": {
87 | "collapsed": false
88 | },
89 | "outputs": [
90 | {
91 | "name": "stdout",
92 | "output_type": "stream",
93 | "text": [
94 | "['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT'\n",
95 | " 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC'\n",
96 | " 'Owner Fair Funeral Home' 'ST MARTIN LIRERRE LAW FIRM ']\n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "print texts"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 20,
107 | "metadata": {
108 | "collapsed": false
109 | },
110 | "outputs": [
111 | {
112 | "name": "stdout",
113 | "output_type": "stream",
114 | "text": [
115 | "['F' 'Z' 'Z' ..., 'F' 'G' 'K']\n"
116 | ]
117 | }
118 | ],
119 | "source": [
120 | "print classes"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 31,
126 | "metadata": {
127 | "collapsed": true
128 | },
129 | "outputs": [],
130 | "source": [
131 | "# Build a simple machine learning pipeline to turn the above arrays into something scikit-learn understands\n",
132 | "pipeline = Pipeline([\n",
133 | " ('vectorizer', CountVectorizer(\n",
134 | " ngram_range=(1,2),\n",
135 | " stop_words='english',\n",
136 | " min_df=2,\n",
137 | " max_df=len(texts))),\n",
138 | " ('classifier', LogisticRegression())\n",
139 | "])"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 32,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [
149 | {
150 | "data": {
151 | "text/plain": [
152 | "Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
153 | " dtype=, encoding=u'utf-8', input=u'content',\n",
154 | " lowercase=True, max_df=66923, max_features=None, min_df=2,\n",
155 | " ngram_range=(1, 2), preprocessor=None, stop_words='english...',\n",
156 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
157 | " verbose=0))])"
158 | ]
159 | },
160 | "execution_count": 32,
161 | "metadata": {},
162 | "output_type": "execute_result"
163 | }
164 | ],
165 | "source": [
166 | "# Fit the model\n",
167 | "pipeline.fit(np.asarray(texts), np.asarray(classes))"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 27,
173 | "metadata": {
174 | "collapsed": false
175 | },
176 | "outputs": [
177 | {
178 | "name": "stdout",
179 | "output_type": "stream",
180 | "text": [
181 | "['K']\n"
182 | ]
183 | }
184 | ],
185 | "source": [
186 | "# Now, run some predictions. \"K\" means \"LAW\" in this case.\n",
187 | "print pipeline.predict(['LAWYER'])"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": 28,
193 | "metadata": {
194 | "collapsed": false
195 | },
196 | "outputs": [
197 | {
198 | "name": "stdout",
199 | "output_type": "stream",
200 | "text": [
201 | "['K']\n"
202 | ]
203 | }
204 | ],
205 | "source": [
206 | "# It also recognizes law firms!\n",
207 | "print pipeline.predict(['SKADDEN ARPS'])"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 34,
213 | "metadata": {
214 | "collapsed": false,
215 | "scrolled": true
216 | },
217 | "outputs": [
218 | {
219 | "name": "stdout",
220 | "output_type": "stream",
221 | "text": [
222 | "['F']\n"
223 | ]
224 | }
225 | ],
226 | "source": [
227 | "# The \"F\" category represents business and finance.\n",
228 | "print pipeline.predict(['CEO'])"
229 | ]
230 | }
231 | ],
232 | "metadata": {
233 | "kernelspec": {
234 | "display_name": "Python 2",
235 | "language": "python",
236 | "name": "python2"
237 | },
238 | "language_info": {
239 | "codemirror_mode": {
240 | "name": "ipython",
241 | "version": 2
242 | },
243 | "file_extension": ".py",
244 | "mimetype": "text/x-python",
245 | "name": "python",
246 | "nbconvert_exporter": "python",
247 | "pygments_lexer": "ipython2",
248 | "version": "2.7.9"
249 | }
250 | },
251 | "nbformat": 4,
252 | "nbformat_minor": 0
253 | }
254 |
--------------------------------------------------------------------------------
/class3_2/3-2_DoNow.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "##1. Import the necessary packages to read in the data, plot, and create a linear regression model"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": []
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## 2. Read in the hanford.csv file "
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {
30 | "collapsed": true
31 | },
32 | "outputs": [],
33 | "source": []
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "
"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## 3. Calculate the basic descriptive statistics on the data"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "metadata": {
53 | "collapsed": false
54 | },
55 | "outputs": [],
56 | "source": []
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "## 4. Calculate the coefficient of correlation (r) and generate the scatter plot. Does there seem to be a correlation worthy of investigation?"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "collapsed": false
70 | },
71 | "outputs": [],
72 | "source": []
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "## 5. Create a linear regression model based on the available data to predict the mortality rate given a level of exposure"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "collapsed": true
86 | },
87 | "outputs": [],
88 | "source": []
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {
93 | "collapsed": true
94 | },
95 | "source": [
96 | "## 6. Plot the linear regression line on the scatter plot of values. Calculate the r^2 (coefficient of determination)"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {
103 | "collapsed": false
104 | },
105 | "outputs": [],
106 | "source": []
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {
111 | "collapsed": true
112 | },
113 | "source": [
114 | "## 7. Predict the mortality rate (Cancer per 100,000 man years) given an index of exposure = 10"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {
121 | "collapsed": true
122 | },
123 | "outputs": [],
124 | "source": []
125 | }
126 | ],
127 | "metadata": {
128 | "kernelspec": {
129 | "display_name": "Python 2",
130 | "language": "python",
131 | "name": "python2"
132 | },
133 | "language_info": {
134 | "codemirror_mode": {
135 | "name": "ipython",
136 | "version": 2
137 | },
138 | "file_extension": ".py",
139 | "mimetype": "text/x-python",
140 | "name": "python",
141 | "nbconvert_exporter": "python",
142 | "pygments_lexer": "ipython2",
143 | "version": "2.7.10"
144 | }
145 | },
146 | "nbformat": 4,
147 | "nbformat_minor": 0
148 | }
149 |
--------------------------------------------------------------------------------
/class3_2/3-2_Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "##We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "###1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": []
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "###2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "metadata": {
37 | "collapsed": true
38 | },
39 | "outputs": [],
40 | "source": []
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "###3. Perform 10-fold cross validation on the data and compare your results to the hold out method we used in 1 and 2. Take the average of the results. What do you notice about the accuracy measures in each of these?"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "metadata": {
53 | "collapsed": true
54 | },
55 | "outputs": [],
56 | "source": []
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "###4. Open the seeds_dataset.txt and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?\n",
63 | "For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/seeds"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {
70 | "collapsed": true
71 | },
72 | "outputs": [],
73 | "source": []
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "###5. Using the seeds_dataset.txt, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50, 75-25, 10-fold cross validation) and discuss the results."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {
86 | "collapsed": true
87 | },
88 | "outputs": [],
89 | "source": []
90 | }
91 | ],
92 | "metadata": {
93 | "kernelspec": {
94 | "display_name": "Python 2",
95 | "language": "python",
96 | "name": "python2"
97 | },
98 | "language_info": {
99 | "codemirror_mode": {
100 | "name": "ipython",
101 | "version": 2
102 | },
103 | "file_extension": ".py",
104 | "mimetype": "text/x-python",
105 | "name": "python",
106 | "nbconvert_exporter": "python",
107 | "pygments_lexer": "ipython2",
108 | "version": "2.7.10"
109 | }
110 | },
111 | "nbformat": 4,
112 | "nbformat_minor": 0
113 | }
114 |
--------------------------------------------------------------------------------
/class3_2/data/hanford.csv:
--------------------------------------------------------------------------------
1 | County,Exposure,Mortality
2 | Umatilla,2.49,147.1
3 | Morrow,2.57,130.1
4 | Gilliam,3.41,129.9
5 | Sherman,1.25,113.5
6 | Wasco,1.62,137.5
7 | HoodRiver,3.83,162.3
8 | Portland,11.64,207.5
9 | Columbia,6.41,177.9
10 | Clatsop,8.34,210.3
--------------------------------------------------------------------------------
/class3_2/data/hanford.txt:
--------------------------------------------------------------------------------
1 | County Exposure Mortality
2 | Umatilla 2.49 147.1
3 | Morrow 2.57 130.1
4 | Gilliam 3.41 129.9
5 | Sherman 1.25 113.5
6 | Wasco 1.62 137.5
7 | HoodRiver 3.83 162.3
8 | Portland 11.64 207.5
9 | Columbia 6.41 177.9
10 | Clatsop 8.34 210.3
11 |
--------------------------------------------------------------------------------
/class3_2/data/iris.csv:
--------------------------------------------------------------------------------
1 | SepalLength,SepalWidth,PetalLength,PetalWidth,Name
2 | 5.1,3.5,1.4,0.2,Iris-setosa
3 | 4.9,3.0,1.4,0.2,Iris-setosa
4 | 4.7,3.2,1.3,0.2,Iris-setosa
5 | 4.6,3.1,1.5,0.2,Iris-setosa
6 | 5.0,3.6,1.4,0.2,Iris-setosa
7 | 5.4,3.9,1.7,0.4,Iris-setosa
8 | 4.6,3.4,1.4,0.3,Iris-setosa
9 | 5.0,3.4,1.5,0.2,Iris-setosa
10 | 4.4,2.9,1.4,0.2,Iris-setosa
11 | 4.9,3.1,1.5,0.1,Iris-setosa
12 | 5.4,3.7,1.5,0.2,Iris-setosa
13 | 4.8,3.4,1.6,0.2,Iris-setosa
14 | 4.8,3.0,1.4,0.1,Iris-setosa
15 | 4.3,3.0,1.1,0.1,Iris-setosa
16 | 5.8,4.0,1.2,0.2,Iris-setosa
17 | 5.7,4.4,1.5,0.4,Iris-setosa
18 | 5.4,3.9,1.3,0.4,Iris-setosa
19 | 5.1,3.5,1.4,0.3,Iris-setosa
20 | 5.7,3.8,1.7,0.3,Iris-setosa
21 | 5.1,3.8,1.5,0.3,Iris-setosa
22 | 5.4,3.4,1.7,0.2,Iris-setosa
23 | 5.1,3.7,1.5,0.4,Iris-setosa
24 | 4.6,3.6,1.0,0.2,Iris-setosa
25 | 5.1,3.3,1.7,0.5,Iris-setosa
26 | 4.8,3.4,1.9,0.2,Iris-setosa
27 | 5.0,3.0,1.6,0.2,Iris-setosa
28 | 5.0,3.4,1.6,0.4,Iris-setosa
29 | 5.2,3.5,1.5,0.2,Iris-setosa
30 | 5.2,3.4,1.4,0.2,Iris-setosa
31 | 4.7,3.2,1.6,0.2,Iris-setosa
32 | 4.8,3.1,1.6,0.2,Iris-setosa
33 | 5.4,3.4,1.5,0.4,Iris-setosa
34 | 5.2,4.1,1.5,0.1,Iris-setosa
35 | 5.5,4.2,1.4,0.2,Iris-setosa
36 | 4.9,3.1,1.5,0.1,Iris-setosa
37 | 5.0,3.2,1.2,0.2,Iris-setosa
38 | 5.5,3.5,1.3,0.2,Iris-setosa
39 | 4.9,3.1,1.5,0.1,Iris-setosa
40 | 4.4,3.0,1.3,0.2,Iris-setosa
41 | 5.1,3.4,1.5,0.2,Iris-setosa
42 | 5.0,3.5,1.3,0.3,Iris-setosa
43 | 4.5,2.3,1.3,0.3,Iris-setosa
44 | 4.4,3.2,1.3,0.2,Iris-setosa
45 | 5.0,3.5,1.6,0.6,Iris-setosa
46 | 5.1,3.8,1.9,0.4,Iris-setosa
47 | 4.8,3.0,1.4,0.3,Iris-setosa
48 | 5.1,3.8,1.6,0.2,Iris-setosa
49 | 4.6,3.2,1.4,0.2,Iris-setosa
50 | 5.3,3.7,1.5,0.2,Iris-setosa
51 | 5.0,3.3,1.4,0.2,Iris-setosa
52 | 7.0,3.2,4.7,1.4,Iris-versicolor
53 | 6.4,3.2,4.5,1.5,Iris-versicolor
54 | 6.9,3.1,4.9,1.5,Iris-versicolor
55 | 5.5,2.3,4.0,1.3,Iris-versicolor
56 | 6.5,2.8,4.6,1.5,Iris-versicolor
57 | 5.7,2.8,4.5,1.3,Iris-versicolor
58 | 6.3,3.3,4.7,1.6,Iris-versicolor
59 | 4.9,2.4,3.3,1.0,Iris-versicolor
60 | 6.6,2.9,4.6,1.3,Iris-versicolor
61 | 5.2,2.7,3.9,1.4,Iris-versicolor
62 | 5.0,2.0,3.5,1.0,Iris-versicolor
63 | 5.9,3.0,4.2,1.5,Iris-versicolor
64 | 6.0,2.2,4.0,1.0,Iris-versicolor
65 | 6.1,2.9,4.7,1.4,Iris-versicolor
66 | 5.6,2.9,3.6,1.3,Iris-versicolor
67 | 6.7,3.1,4.4,1.4,Iris-versicolor
68 | 5.6,3.0,4.5,1.5,Iris-versicolor
69 | 5.8,2.7,4.1,1.0,Iris-versicolor
70 | 6.2,2.2,4.5,1.5,Iris-versicolor
71 | 5.6,2.5,3.9,1.1,Iris-versicolor
72 | 5.9,3.2,4.8,1.8,Iris-versicolor
73 | 6.1,2.8,4.0,1.3,Iris-versicolor
74 | 6.3,2.5,4.9,1.5,Iris-versicolor
75 | 6.1,2.8,4.7,1.2,Iris-versicolor
76 | 6.4,2.9,4.3,1.3,Iris-versicolor
77 | 6.6,3.0,4.4,1.4,Iris-versicolor
78 | 6.8,2.8,4.8,1.4,Iris-versicolor
79 | 6.7,3.0,5.0,1.7,Iris-versicolor
80 | 6.0,2.9,4.5,1.5,Iris-versicolor
81 | 5.7,2.6,3.5,1.0,Iris-versicolor
82 | 5.5,2.4,3.8,1.1,Iris-versicolor
83 | 5.5,2.4,3.7,1.0,Iris-versicolor
84 | 5.8,2.7,3.9,1.2,Iris-versicolor
85 | 6.0,2.7,5.1,1.6,Iris-versicolor
86 | 5.4,3.0,4.5,1.5,Iris-versicolor
87 | 6.0,3.4,4.5,1.6,Iris-versicolor
88 | 6.7,3.1,4.7,1.5,Iris-versicolor
89 | 6.3,2.3,4.4,1.3,Iris-versicolor
90 | 5.6,3.0,4.1,1.3,Iris-versicolor
91 | 5.5,2.5,4.0,1.3,Iris-versicolor
92 | 5.5,2.6,4.4,1.2,Iris-versicolor
93 | 6.1,3.0,4.6,1.4,Iris-versicolor
94 | 5.8,2.6,4.0,1.2,Iris-versicolor
95 | 5.0,2.3,3.3,1.0,Iris-versicolor
96 | 5.6,2.7,4.2,1.3,Iris-versicolor
97 | 5.7,3.0,4.2,1.2,Iris-versicolor
98 | 5.7,2.9,4.2,1.3,Iris-versicolor
99 | 6.2,2.9,4.3,1.3,Iris-versicolor
100 | 5.1,2.5,3.0,1.1,Iris-versicolor
101 | 5.7,2.8,4.1,1.3,Iris-versicolor
102 | 6.3,3.3,6.0,2.5,Iris-virginica
103 | 5.8,2.7,5.1,1.9,Iris-virginica
104 | 7.1,3.0,5.9,2.1,Iris-virginica
105 | 6.3,2.9,5.6,1.8,Iris-virginica
106 | 6.5,3.0,5.8,2.2,Iris-virginica
107 | 7.6,3.0,6.6,2.1,Iris-virginica
108 | 4.9,2.5,4.5,1.7,Iris-virginica
109 | 7.3,2.9,6.3,1.8,Iris-virginica
110 | 6.7,2.5,5.8,1.8,Iris-virginica
111 | 7.2,3.6,6.1,2.5,Iris-virginica
112 | 6.5,3.2,5.1,2.0,Iris-virginica
113 | 6.4,2.7,5.3,1.9,Iris-virginica
114 | 6.8,3.0,5.5,2.1,Iris-virginica
115 | 5.7,2.5,5.0,2.0,Iris-virginica
116 | 5.8,2.8,5.1,2.4,Iris-virginica
117 | 6.4,3.2,5.3,2.3,Iris-virginica
118 | 6.5,3.0,5.5,1.8,Iris-virginica
119 | 7.7,3.8,6.7,2.2,Iris-virginica
120 | 7.7,2.6,6.9,2.3,Iris-virginica
121 | 6.0,2.2,5.0,1.5,Iris-virginica
122 | 6.9,3.2,5.7,2.3,Iris-virginica
123 | 5.6,2.8,4.9,2.0,Iris-virginica
124 | 7.7,2.8,6.7,2.0,Iris-virginica
125 | 6.3,2.7,4.9,1.8,Iris-virginica
126 | 6.7,3.3,5.7,2.1,Iris-virginica
127 | 7.2,3.2,6.0,1.8,Iris-virginica
128 | 6.2,2.8,4.8,1.8,Iris-virginica
129 | 6.1,3.0,4.9,1.8,Iris-virginica
130 | 6.4,2.8,5.6,2.1,Iris-virginica
131 | 7.2,3.0,5.8,1.6,Iris-virginica
132 | 7.4,2.8,6.1,1.9,Iris-virginica
133 | 7.9,3.8,6.4,2.0,Iris-virginica
134 | 6.4,2.8,5.6,2.2,Iris-virginica
135 | 6.3,2.8,5.1,1.5,Iris-virginica
136 | 6.1,2.6,5.6,1.4,Iris-virginica
137 | 7.7,3.0,6.1,2.3,Iris-virginica
138 | 6.3,3.4,5.6,2.4,Iris-virginica
139 | 6.4,3.1,5.5,1.8,Iris-virginica
140 | 6.0,3.0,4.8,1.8,Iris-virginica
141 | 6.9,3.1,5.4,2.1,Iris-virginica
142 | 6.7,3.1,5.6,2.4,Iris-virginica
143 | 6.9,3.1,5.1,2.3,Iris-virginica
144 | 5.8,2.7,5.1,1.9,Iris-virginica
145 | 6.8,3.2,5.9,2.3,Iris-virginica
146 | 6.7,3.3,5.7,2.5,Iris-virginica
147 | 6.7,3.0,5.2,2.3,Iris-virginica
148 | 6.3,2.5,5.0,1.9,Iris-virginica
149 | 6.5,3.0,5.2,2.0,Iris-virginica
150 | 6.2,3.4,5.4,2.3,Iris-virginica
151 | 5.9,3.0,5.1,1.8,Iris-virginica
--------------------------------------------------------------------------------
/class3_2/data/seeds_dataset.txt:
--------------------------------------------------------------------------------
1 | 15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
2 | 14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
3 | 14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
4 | 13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
5 | 16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
6 | 14.38,14.21,0.8951,5.386,3.312,2.462,4.956,1
7 | 14.69,14.49,0.8799,5.563,3.259,3.586,5.219,1
8 | 14.11,14.1,0.8911,5.42,3.302,2.7,5,1
9 | 16.63,15.46,0.8747,6.053,3.465,2.04,5.877,1
10 | 16.44,15.25,0.888,5.884,3.505,1.969,5.533,1
11 | 15.26,14.85,0.8696,5.714,3.242,4.543,5.314,1
12 | 14.03,14.16,0.8796,5.438,3.201,1.717,5.001,1
13 | 13.89,14.02,0.888,5.439,3.199,3.986,4.738,1
14 | 13.78,14.06,0.8759,5.479,3.156,3.136,4.872,1
15 | 13.74,14.05,0.8744,5.482,3.114,2.932,4.825,1
16 | 14.59,14.28,0.8993,5.351,3.333,4.185,4.781,1
17 | 13.99,13.83,0.9183,5.119,3.383,5.234,4.781,1
18 | 15.69,14.75,0.9058,5.527,3.514,1.599,5.046,1
19 | 14.7,14.21,0.9153,5.205,3.466,1.767,4.649,1
20 | 12.72,13.57,0.8686,5.226,3.049,4.102,4.914,1
21 | 14.16,14.4,0.8584,5.658,3.129,3.072,5.176,1
22 | 14.11,14.26,0.8722,5.52,3.168,2.688,5.219,1
23 | 15.88,14.9,0.8988,5.618,3.507,0.7651,5.091,1
24 | 12.08,13.23,0.8664,5.099,2.936,1.415,4.961,1
25 | 15.01,14.76,0.8657,5.789,3.245,1.791,5.001,1
26 | 16.19,15.16,0.8849,5.833,3.421,0.903,5.307,1
27 | 13.02,13.76,0.8641,5.395,3.026,3.373,4.825,1
28 | 12.74,13.67,0.8564,5.395,2.956,2.504,4.869,1
29 | 14.11,14.18,0.882,5.541,3.221,2.754,5.038,1
30 | 13.45,14.02,0.8604,5.516,3.065,3.531,5.097,1
31 | 13.16,13.82,0.8662,5.454,2.975,0.8551,5.056,1
32 | 15.49,14.94,0.8724,5.757,3.371,3.412,5.228,1
33 | 14.09,14.41,0.8529,5.717,3.186,3.92,5.299,1
34 | 13.94,14.17,0.8728,5.585,3.15,2.124,5.012,1
35 | 15.05,14.68,0.8779,5.712,3.328,2.129,5.36,1
36 | 16.12,15,0.9,5.709,3.485,2.27,5.443,1
37 | 16.2,15.27,0.8734,5.826,3.464,2.823,5.527,1
38 | 17.08,15.38,0.9079,5.832,3.683,2.956,5.484,1
39 | 14.8,14.52,0.8823,5.656,3.288,3.112,5.309,1
40 | 14.28,14.17,0.8944,5.397,3.298,6.685,5.001,1
41 | 13.54,13.85,0.8871,5.348,3.156,2.587,5.178,1
42 | 13.5,13.85,0.8852,5.351,3.158,2.249,5.176,1
43 | 13.16,13.55,0.9009,5.138,3.201,2.461,4.783,1
44 | 15.5,14.86,0.882,5.877,3.396,4.711,5.528,1
45 | 15.11,14.54,0.8986,5.579,3.462,3.128,5.18,1
46 | 13.8,14.04,0.8794,5.376,3.155,1.56,4.961,1
47 | 15.36,14.76,0.8861,5.701,3.393,1.367,5.132,1
48 | 14.99,14.56,0.8883,5.57,3.377,2.958,5.175,1
49 | 14.79,14.52,0.8819,5.545,3.291,2.704,5.111,1
50 | 14.86,14.67,0.8676,5.678,3.258,2.129,5.351,1
51 | 14.43,14.4,0.8751,5.585,3.272,3.975,5.144,1
52 | 15.78,14.91,0.8923,5.674,3.434,5.593,5.136,1
53 | 14.49,14.61,0.8538,5.715,3.113,4.116,5.396,1
54 | 14.33,14.28,0.8831,5.504,3.199,3.328,5.224,1
55 | 14.52,14.6,0.8557,5.741,3.113,1.481,5.487,1
56 | 15.03,14.77,0.8658,5.702,3.212,1.933,5.439,1
57 | 14.46,14.35,0.8818,5.388,3.377,2.802,5.044,1
58 | 14.92,14.43,0.9006,5.384,3.412,1.142,5.088,1
59 | 15.38,14.77,0.8857,5.662,3.419,1.999,5.222,1
60 | 12.11,13.47,0.8392,5.159,3.032,1.502,4.519,1
61 | 11.42,12.86,0.8683,5.008,2.85,2.7,4.607,1
62 | 11.23,12.63,0.884,4.902,2.879,2.269,4.703,1
63 | 12.36,13.19,0.8923,5.076,3.042,3.22,4.605,1
64 | 13.22,13.84,0.868,5.395,3.07,4.157,5.088,1
65 | 12.78,13.57,0.8716,5.262,3.026,1.176,4.782,1
66 | 12.88,13.5,0.8879,5.139,3.119,2.352,4.607,1
67 | 14.34,14.37,0.8726,5.63,3.19,1.313,5.15,1
68 | 14.01,14.29,0.8625,5.609,3.158,2.217,5.132,1
69 | 14.37,14.39,0.8726,5.569,3.153,1.464,5.3,1
70 | 12.73,13.75,0.8458,5.412,2.882,3.533,5.067,1
71 | 17.63,15.98,0.8673,6.191,3.561,4.076,6.06,2
72 | 16.84,15.67,0.8623,5.998,3.484,4.675,5.877,2
73 | 17.26,15.73,0.8763,5.978,3.594,4.539,5.791,2
74 | 19.11,16.26,0.9081,6.154,3.93,2.936,6.079,2
75 | 16.82,15.51,0.8786,6.017,3.486,4.004,5.841,2
76 | 16.77,15.62,0.8638,5.927,3.438,4.92,5.795,2
77 | 17.32,15.91,0.8599,6.064,3.403,3.824,5.922,2
78 | 20.71,17.23,0.8763,6.579,3.814,4.451,6.451,2
79 | 18.94,16.49,0.875,6.445,3.639,5.064,6.362,2
80 | 17.12,15.55,0.8892,5.85,3.566,2.858,5.746,2
81 | 16.53,15.34,0.8823,5.875,3.467,5.532,5.88,2
82 | 18.72,16.19,0.8977,6.006,3.857,5.324,5.879,2
83 | 20.2,16.89,0.8894,6.285,3.864,5.173,6.187,2
84 | 19.57,16.74,0.8779,6.384,3.772,1.472,6.273,2
85 | 19.51,16.71,0.878,6.366,3.801,2.962,6.185,2
86 | 18.27,16.09,0.887,6.173,3.651,2.443,6.197,2
87 | 18.88,16.26,0.8969,6.084,3.764,1.649,6.109,2
88 | 18.98,16.66,0.859,6.549,3.67,3.691,6.498,2
89 | 21.18,17.21,0.8989,6.573,4.033,5.78,6.231,2
90 | 20.88,17.05,0.9031,6.45,4.032,5.016,6.321,2
91 | 20.1,16.99,0.8746,6.581,3.785,1.955,6.449,2
92 | 18.76,16.2,0.8984,6.172,3.796,3.12,6.053,2
93 | 18.81,16.29,0.8906,6.272,3.693,3.237,6.053,2
94 | 18.59,16.05,0.9066,6.037,3.86,6.001,5.877,2
95 | 18.36,16.52,0.8452,6.666,3.485,4.933,6.448,2
96 | 16.87,15.65,0.8648,6.139,3.463,3.696,5.967,2
97 | 19.31,16.59,0.8815,6.341,3.81,3.477,6.238,2
98 | 18.98,16.57,0.8687,6.449,3.552,2.144,6.453,2
99 | 18.17,16.26,0.8637,6.271,3.512,2.853,6.273,2
100 | 18.72,16.34,0.881,6.219,3.684,2.188,6.097,2
101 | 16.41,15.25,0.8866,5.718,3.525,4.217,5.618,2
102 | 17.99,15.86,0.8992,5.89,3.694,2.068,5.837,2
103 | 19.46,16.5,0.8985,6.113,3.892,4.308,6.009,2
104 | 19.18,16.63,0.8717,6.369,3.681,3.357,6.229,2
105 | 18.95,16.42,0.8829,6.248,3.755,3.368,6.148,2
106 | 18.83,16.29,0.8917,6.037,3.786,2.553,5.879,2
107 | 18.85,16.17,0.9056,6.152,3.806,2.843,6.2,2
108 | 17.63,15.86,0.88,6.033,3.573,3.747,5.929,2
109 | 19.94,16.92,0.8752,6.675,3.763,3.252,6.55,2
110 | 18.55,16.22,0.8865,6.153,3.674,1.738,5.894,2
111 | 18.45,16.12,0.8921,6.107,3.769,2.235,5.794,2
112 | 19.38,16.72,0.8716,6.303,3.791,3.678,5.965,2
113 | 19.13,16.31,0.9035,6.183,3.902,2.109,5.924,2
114 | 19.14,16.61,0.8722,6.259,3.737,6.682,6.053,2
115 | 20.97,17.25,0.8859,6.563,3.991,4.677,6.316,2
116 | 19.06,16.45,0.8854,6.416,3.719,2.248,6.163,2
117 | 18.96,16.2,0.9077,6.051,3.897,4.334,5.75,2
118 | 19.15,16.45,0.889,6.245,3.815,3.084,6.185,2
119 | 18.89,16.23,0.9008,6.227,3.769,3.639,5.966,2
120 | 20.03,16.9,0.8811,6.493,3.857,3.063,6.32,2
121 | 20.24,16.91,0.8897,6.315,3.962,5.901,6.188,2
122 | 18.14,16.12,0.8772,6.059,3.563,3.619,6.011,2
123 | 16.17,15.38,0.8588,5.762,3.387,4.286,5.703,2
124 | 18.43,15.97,0.9077,5.98,3.771,2.984,5.905,2
125 | 15.99,14.89,0.9064,5.363,3.582,3.336,5.144,2
126 | 18.75,16.18,0.8999,6.111,3.869,4.188,5.992,2
127 | 18.65,16.41,0.8698,6.285,3.594,4.391,6.102,2
128 | 17.98,15.85,0.8993,5.979,3.687,2.257,5.919,2
129 | 20.16,17.03,0.8735,6.513,3.773,1.91,6.185,2
130 | 17.55,15.66,0.8991,5.791,3.69,5.366,5.661,2
131 | 18.3,15.89,0.9108,5.979,3.755,2.837,5.962,2
132 | 18.94,16.32,0.8942,6.144,3.825,2.908,5.949,2
133 | 15.38,14.9,0.8706,5.884,3.268,4.462,5.795,2
134 | 16.16,15.33,0.8644,5.845,3.395,4.266,5.795,2
135 | 15.56,14.89,0.8823,5.776,3.408,4.972,5.847,2
136 | 15.38,14.66,0.899,5.477,3.465,3.6,5.439,2
137 | 17.36,15.76,0.8785,6.145,3.574,3.526,5.971,2
138 | 15.57,15.15,0.8527,5.92,3.231,2.64,5.879,2
139 | 15.6,15.11,0.858,5.832,3.286,2.725,5.752,2
140 | 16.23,15.18,0.885,5.872,3.472,3.769,5.922,2
141 | 13.07,13.92,0.848,5.472,2.994,5.304,5.395,3
142 | 13.32,13.94,0.8613,5.541,3.073,7.035,5.44,3
143 | 13.34,13.95,0.862,5.389,3.074,5.995,5.307,3
144 | 12.22,13.32,0.8652,5.224,2.967,5.469,5.221,3
145 | 11.82,13.4,0.8274,5.314,2.777,4.471,5.178,3
146 | 11.21,13.13,0.8167,5.279,2.687,6.169,5.275,3
147 | 11.43,13.13,0.8335,5.176,2.719,2.221,5.132,3
148 | 12.49,13.46,0.8658,5.267,2.967,4.421,5.002,3
149 | 12.7,13.71,0.8491,5.386,2.911,3.26,5.316,3
150 | 10.79,12.93,0.8107,5.317,2.648,5.462,5.194,3
151 | 11.83,13.23,0.8496,5.263,2.84,5.195,5.307,3
152 | 12.01,13.52,0.8249,5.405,2.776,6.992,5.27,3
153 | 12.26,13.6,0.8333,5.408,2.833,4.756,5.36,3
154 | 11.18,13.04,0.8266,5.22,2.693,3.332,5.001,3
155 | 11.36,13.05,0.8382,5.175,2.755,4.048,5.263,3
156 | 11.19,13.05,0.8253,5.25,2.675,5.813,5.219,3
157 | 11.34,12.87,0.8596,5.053,2.849,3.347,5.003,3
158 | 12.13,13.73,0.8081,5.394,2.745,4.825,5.22,3
159 | 11.75,13.52,0.8082,5.444,2.678,4.378,5.31,3
160 | 11.49,13.22,0.8263,5.304,2.695,5.388,5.31,3
161 | 12.54,13.67,0.8425,5.451,2.879,3.082,5.491,3
162 | 12.02,13.33,0.8503,5.35,2.81,4.271,5.308,3
163 | 12.05,13.41,0.8416,5.267,2.847,4.988,5.046,3
164 | 12.55,13.57,0.8558,5.333,2.968,4.419,5.176,3
165 | 11.14,12.79,0.8558,5.011,2.794,6.388,5.049,3
166 | 12.1,13.15,0.8793,5.105,2.941,2.201,5.056,3
167 | 12.44,13.59,0.8462,5.319,2.897,4.924,5.27,3
168 | 12.15,13.45,0.8443,5.417,2.837,3.638,5.338,3
169 | 11.35,13.12,0.8291,5.176,2.668,4.337,5.132,3
170 | 11.24,13,0.8359,5.09,2.715,3.521,5.088,3
171 | 11.02,13,0.8189,5.325,2.701,6.735,5.163,3
172 | 11.55,13.1,0.8455,5.167,2.845,6.715,4.956,3
173 | 11.27,12.97,0.8419,5.088,2.763,4.309,5,3
174 | 11.4,13.08,0.8375,5.136,2.763,5.588,5.089,3
175 | 10.83,12.96,0.8099,5.278,2.641,5.182,5.185,3
176 | 10.8,12.57,0.859,4.981,2.821,4.773,5.063,3
177 | 11.26,13.01,0.8355,5.186,2.71,5.335,5.092,3
178 | 10.74,12.73,0.8329,5.145,2.642,4.702,4.963,3
179 | 11.48,13.05,0.8473,5.18,2.758,5.876,5.002,3
180 | 12.21,13.47,0.8453,5.357,2.893,1.661,5.178,3
181 | 11.41,12.95,0.856,5.09,2.775,4.957,4.825,3
182 | 12.46,13.41,0.8706,5.236,3.017,4.987,5.147,3
183 | 12.19,13.36,0.8579,5.24,2.909,4.857,5.158,3
184 | 11.65,13.07,0.8575,5.108,2.85,5.209,5.135,3
185 | 12.89,13.77,0.8541,5.495,3.026,6.185,5.316,3
186 | 11.56,13.31,0.8198,5.363,2.683,4.062,5.182,3
187 | 11.81,13.45,0.8198,5.413,2.716,4.898,5.352,3
188 | 10.91,12.8,0.8372,5.088,2.675,4.179,4.956,3
189 | 11.23,12.82,0.8594,5.089,2.821,7.524,4.957,3
190 | 10.59,12.41,0.8648,4.899,2.787,4.975,4.794,3
191 | 10.93,12.8,0.839,5.046,2.717,5.398,5.045,3
192 | 11.27,12.86,0.8563,5.091,2.804,3.985,5.001,3
193 | 11.87,13.02,0.8795,5.132,2.953,3.597,5.132,3
194 | 10.82,12.83,0.8256,5.18,2.63,4.853,5.089,3
195 | 12.11,13.27,0.8639,5.236,2.975,4.132,5.012,3
196 | 12.8,13.47,0.886,5.16,3.126,4.873,4.914,3
197 | 12.79,13.53,0.8786,5.224,3.054,5.483,4.958,3
198 | 13.37,13.78,0.8849,5.32,3.128,4.67,5.091,3
199 | 12.62,13.67,0.8481,5.41,2.911,3.306,5.231,3
200 | 12.76,13.38,0.8964,5.073,3.155,2.828,4.83,3
201 | 12.38,13.44,0.8609,5.219,2.989,5.472,5.045,3
202 | 12.67,13.32,0.8977,4.984,3.135,2.3,4.745,3
203 | 11.18,12.72,0.868,5.009,2.81,4.051,4.828,3
204 | 12.7,13.41,0.8874,5.183,3.091,8.456,5,3
205 | 12.37,13.47,0.8567,5.204,2.96,3.919,5.001,3
206 | 12.19,13.2,0.8783,5.137,2.981,3.631,4.87,3
207 | 11.23,12.88,0.8511,5.14,2.795,4.325,5.003,3
208 | 13.2,13.66,0.8883,5.236,3.232,8.315,5.056,3
209 | 11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3
210 | 12.3,13.34,0.8684,5.243,2.974,5.637,5.063,3
--------------------------------------------------------------------------------
/class3_2/images/hanford_variables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class3_2/images/hanford_variables.png
--------------------------------------------------------------------------------
/class3_2/images/iris_scatter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class3_2/images/iris_scatter.png
--------------------------------------------------------------------------------
/class4_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 4, Class 1 (Tuesday, Aug. 4)
2 |
3 | This week's class is going to be a bit different than the last few. After a quick review of last week's material, we're going to build a supervised learning system that is meant to outline a rough automated approach to [this story](http://www.nytimes.com/2015/08/02/us/small-pool-of-rich-donors-dominates-election-giving.html) from Sunday's Times about wealthy donors to Super PACs affiliated with presidential candidates.
4 |
5 | Along the way, we'll talk about:
6 |
7 | - How to train and apply different models, including the decision trees you've already discussed
8 | - How to engineer useful features for those models
9 | - How to evaluate the results of those models so you don't get yourself in trouble
10 | - The difference between statistical and rules-based solutions to problems like this
11 |
12 | For lab, you'll be asked to take on a simpler supervised learning problem that will give you a chance to apply the lessons from class.
13 |
14 | If you'd like to get a head start, feel free to read [this documentation](https://github.com/cjdd3b/fec-standardizer/wiki) on standardizing FEC data. We'll be taking a simpler approach, but much of the intuition will be similar.
--------------------------------------------------------------------------------
/class4_1/doc_classifier.py:
--------------------------------------------------------------------------------
1 | from sklearn import preprocessing
2 |
3 | ########## FEATURES ##########
4 |
5 | # Put your features here
6 |
7 |
8 | ########## MAIN ##########
9 |
10 | if __name__ == '__main__':
11 |
12 | # First we'll do some preprocessing to create our two vectors for model training: features, which
13 | # represents the feature vector, and labels, which represent our correct answers.
14 |
15 | features, labels = [], []
16 | with open('data/bills_training.txt', 'rU') as csvfile:
17 | for line in csvfile.readlines():
18 | bill = line.strip().split('|')
19 |
20 | if len(bill) > 1:
21 | labels.append(bill[1])
22 |
23 | features.append([
24 | # Your features here, based on bill[0], which contains the text of the bill titles
25 | ])
26 |
27 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
28 | # be numbers, not strings. The LabelEncoder performs this transformation.
29 | encoder = preprocessing.LabelEncoder()
30 | encoded_labels = encoder.fit_transform(labels)
31 |
32 | print features
33 | print encoded_labels
34 |
35 | # STEP ONE: Create and train a model
36 |
37 | # Your code here
38 |
39 |
40 | # STEP TWO: Evaluate the model
41 |
42 | # Your code here
43 |
44 |
45 | # STEP THREE: Apply the model
46 |
47 | # Use the model to get categories for each of these documents
48 |
49 | docs_new = ["Public postsecondary education: executive officer compensation.",
50 | "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
51 | "Political Reform Act of 1974: campaign disclosures.",
52 | "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
53 | ]
--------------------------------------------------------------------------------
/class4_1/donors.py:
--------------------------------------------------------------------------------
1 | import csv, itertools
2 | import numpy as np
3 | from sklearn.tree import DecisionTreeClassifier
4 | from sklearn.linear_model import LogisticRegression
5 | from sklearn import cross_validation
6 | from sklearn.cross_validation import KFold
7 | from sklearn import metrics
8 | from nameparser import HumanName
9 |
10 | ########## HELPER FUNCTIONS ##########
11 |
12 | def _shingle(word, n):
13 | '''
14 | Splits words into shingles of size n. Given the word "shingle" and n=2, the output
15 | would be a list that looks like :['sh', 'hi', 'in', 'ng', 'gl', 'le']
16 |
17 | More on shingling here: http://blog.mafr.de/2011/01/06/near-duplicate-detection/
18 | '''
19 | return set([word[i:i + n] for i in range(len(word) - n + 1)])
20 |
21 | def _jaccard_sim(X, Y):
22 | '''
23 | Jaccard similarity between two sets.
24 |
25 | Explanation here: http://en.wikipedia.org/wiki/Jaccard_index
26 | '''
27 | if not X or not Y: return 0
28 | x = set(X)
29 | y = set(Y)
30 | return float(len(x & y)) / len(x | y)
31 |
32 | def sim(str1, str2, shingle_length=3):
33 | '''
34 | String similarity metric based on shingles and Jaccard.
35 | '''
36 | str1_shingles = _shingle(str1, shingle_length)
37 | str2_shingles = _shingle(str2, shingle_length)
38 | return _jaccard_sim(str1_shingles, str2_shingles)
39 |
40 | ########## FEATURES ##########
41 |
42 | def same_name(name1, name2):
43 | return 1 if name1 == name2 else 0
44 |
45 | def same_zip_code(zip1, zip2):
46 | return 1 if zip1[:5] == zip2[:5] else 0
47 |
48 | def same_first_name(name1, name2):
49 | first1 = HumanName(name1).first
50 | first2 = HumanName(name2).first
51 | return 1 if first1 == first2 else 0
52 |
53 | def same_last_name(name1, name2):
54 | last1 = HumanName(name1).last
55 | last2 = HumanName(name2).last
56 | return 1 if last1 == last2 else 0
57 |
58 |
59 | # We're going to add more here ...
60 |
61 | ########## MAIN ##########
62 |
63 | if __name__ == '__main__':
64 |
65 | # STEP ONE: Train our model.
66 |
67 | features, matches = [], []
68 | with open('data/contribs_training_small.csv', 'rU') as csvfile:
69 | reader = csv.DictReader(csvfile)
70 | for c in itertools.combinations(reader, 2):
71 |
72 | # Fill up our vector of correct answers
73 | match = 1 if c[0]['contributor_ext_id'] == c[1]['contributor_ext_id'] else 0
74 | matches.append(match)
75 |
76 | # And now fill up our feature vector
77 | features.append([
78 | same_name(c[0]['name'], c[1]['name']),
79 | same_zip_code(c[0]['zip_code'], c[1]['zip_code']),
80 | same_first_name(c[0]['name'], c[1]['name']),
81 | same_last_name(c[0]['name'], c[1]['name'])
82 |
83 | ])
84 |
85 | clf = DecisionTreeClassifier()
86 | clf = clf.fit(features, matches)
87 |
88 | # STEP TWO: Evaluate the model using 10-fold cross-validation
89 |
90 | # scores = cross_validation.cross_val_score(clf, features, matches, cv=10, scoring='f1')
91 | # print "%s (%s folds): %0.2f (+/- %0.2f)\n" % ('f1', 10, scores.mean(), scores.std() / 2)
92 |
93 | # STEP THREE: Apply the model
94 |
95 | with open('data/contribs_unclassified.csv', 'rU') as csvfile:
96 | reader = csv.DictReader(csvfile)
97 | for key, group in itertools.groupby(reader, lambda x: x['last_name']):
98 | for c in itertools.combinations(group, 2):
99 |
100 | # Making print-friendly representations of the records, for easier evaluation
101 | record1 = '%s, %s %s | %s %s %s | %s %s' % \
102 | (c[0]['last_name'], c[0]['first_name'], c[0]['middle_name'],
103 | c[0]['city'], c[0]['state'], c[0]['zip'],
104 | c[0]['employer'], c[0]['occupation'])
105 | record2 = '%s, %s %s | %s %s %s | %s %s' % \
106 | (c[1]['last_name'], c[1]['first_name'], c[1]['middle_name'],
107 | c[1]['city'], c[1]['state'], c[1]['zip'],
108 | c[1]['employer'], c[1]['occupation'])
109 |
110 | # We need to do this because our training set has full names, but this set has name
111 | # components. Turn those into full names.
112 | name1 = '%s, %s %s' % (c[0]['last_name'], c[0]['first_name'], c[0]['middle_name'])
113 | name2 = '%s, %s %s' % (c[1]['last_name'], c[1]['first_name'], c[1]['middle_name'])
114 |
115 | # And now fill up our feature vector
116 | features = [
117 | same_name(name1, name2),
118 | same_zip_code(c[0]['zip'], c[1]['zip']),
119 | same_first_name(name1, name2),
120 | same_last_name(name1, name2)
121 | ]
122 |
123 | # Predict match or no match
124 | match = clf.predict_proba(features)
125 |
126 | # Print the results
127 | if match[0][0] < match[0][1]:
128 | print 'MATCH!'
129 | print record1 + ' ---------> ' + record2 + '\n'
130 | print match
131 | else:
132 | print 'NO MATCH!'
133 | print record1 + ' ---------> ' + record2 + '\n'
--------------------------------------------------------------------------------
/class4_2/4-2_DoNow.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "###import wine.csv and build a decision tree classifier to predict wine_cultivar. Test the data using 5-fold cross validation"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": []
18 | }
19 | ],
20 | "metadata": {
21 | "kernelspec": {
22 | "display_name": "Python 2",
23 | "language": "python",
24 | "name": "python2"
25 | },
26 | "language_info": {
27 | "codemirror_mode": {
28 | "name": "ipython",
29 | "version": 2
30 | },
31 | "file_extension": ".py",
32 | "mimetype": "text/x-python",
33 | "name": "python",
34 | "nbconvert_exporter": "python",
35 | "pygments_lexer": "ipython2",
36 | "version": "2.7.10"
37 | }
38 | },
39 | "nbformat": 4,
40 | "nbformat_minor": 0
41 | }
42 |
--------------------------------------------------------------------------------
/class4_2/Feature_Engineering.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "%matplotlib inline"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "###A simple example to illustrate the intuition behind dummy variables"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": null,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "metadata": {
37 | "collapsed": false
38 | },
39 | "outputs": [],
40 | "source": [
41 | "df"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {
48 | "collapsed": false
49 | },
50 | "outputs": [],
51 | "source": [
52 | "pd.get_dummies(df['key'],prefix='key')"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {
58 | "collapsed": true
59 | },
60 | "source": [
61 | "###Now we have a matrix of values based on the presence of absence of the attribute value in our dataset"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "###Now let's look at another example using our flight data"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [],
78 | "source": [
79 | "df = pd.read_csv('data/ontime_reports_may_2015_ny.csv')"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {
86 | "collapsed": false
87 | },
88 | "outputs": [],
89 | "source": [
90 | "#count number of NaNs in column\n",
91 | "df['DEP_DELAY'].isnull().sum()"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {
98 | "collapsed": false
99 | },
100 | "outputs": [],
101 | "source": [
102 | "#calculate the percentage this represents of the total number of instances\n",
103 | "df['DEP_DELAY'].isnull().sum()/df['DEP_DELAY'].sum()"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "###We could explore whether the NaNs are actually zero delays, but we'll just filter them out for now, especially since they represent such a small number of instances"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "collapsed": false
118 | },
119 | "outputs": [],
120 | "source": [
121 | "#filter DEP_DELAY NaNs\n",
122 | "df = df[pd.notnull(df['DEP_DELAY'])]"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "###We can discretize the continuous DEP_DELAY value by giving it a value of 0 if it's delayed and a 1 if it's not. We record this value into a separate column. (We could also code -1 for early, 0 for ontime, and 1 for late)"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {
136 | "collapsed": false
137 | },
138 | "outputs": [],
139 | "source": [
140 | "#code whether delay or not delayed\n",
141 | "df['IS_DELAYED'] = df['DEP_DELAY'].apply(lambda x: 1 if x>0 else 0 )"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "metadata": {
148 | "collapsed": false,
149 | "scrolled": true
150 | },
151 | "outputs": [],
152 | "source": [
153 | "#Let's check that our column was created properly\n",
154 | "df[['DEP_DELAY','IS_DELAYED']]"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {
161 | "collapsed": true
162 | },
163 | "outputs": [],
164 | "source": [
165 | "###Dummy variables create a "
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {
172 | "collapsed": false
173 | },
174 | "outputs": [],
175 | "source": [
176 | "pd.get_dummies(df['ORIGIN'],prefix='origin')"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "###Normalize values"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {
190 | "collapsed": false
191 | },
192 | "outputs": [],
193 | "source": [
194 | "#Normalize the data attributes for the Iris dataset\n",
195 | "# Example from Jump Start Scikit Learn https://machinelearningmastery.com/jump-start-scikit-learn/\n",
196 | "from sklearn.datasets import load_iris \n",
197 | "from sklearn import preprocessing #load the iris dataset\n",
198 | "iris=load_iris()\n",
199 | "X=iris.data\n",
200 | "y=iris.target #normalize the data attributes \n",
201 | "normalized_X = preprocessing.normalize(X)"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {
208 | "collapsed": false
209 | },
210 | "outputs": [],
211 | "source": [
212 | "zip(X,normalized_X)"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "metadata": {
219 | "collapsed": true
220 | },
221 | "outputs": [],
222 | "source": []
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": null,
227 | "metadata": {
228 | "collapsed": true
229 | },
230 | "outputs": [],
231 | "source": []
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": null,
236 | "metadata": {
237 | "collapsed": true
238 | },
239 | "outputs": [],
240 | "source": []
241 | }
242 | ],
243 | "metadata": {
244 | "kernelspec": {
245 | "display_name": "Python 2",
246 | "language": "python",
247 | "name": "python2"
248 | },
249 | "language_info": {
250 | "codemirror_mode": {
251 | "name": "ipython",
252 | "version": 2
253 | },
254 | "file_extension": ".py",
255 | "mimetype": "text/x-python",
256 | "name": "python",
257 | "nbconvert_exporter": "python",
258 | "pygments_lexer": "ipython2",
259 | "version": "2.7.10"
260 | }
261 | },
262 | "nbformat": 4,
263 | "nbformat_minor": 0
264 | }
265 |
--------------------------------------------------------------------------------
/class4_2/Logistic_regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "%matplotlib inline\n",
20 | "import numpy as np"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {
27 | "collapsed": true
28 | },
29 | "outputs": [],
30 | "source": [
31 | "titanic = pd.read_csv(\"data/titanic.csv\")"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [],
41 | "source": [
42 | "titanic.columns"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "###Let's do a simple logistic regression to predict survival based on pclass and sex"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "First we need to prepare our features. Remember we drop one value in each dummy to avoid the dummy variable trap"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {
63 | "collapsed": true
64 | },
65 | "outputs": [],
66 | "source": [
67 | "titanic['sex_female'] = titanic['sex'].apply(lambda x:1 if x=='female' else 0)"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": null,
73 | "metadata": {
74 | "collapsed": true
75 | },
76 | "outputs": [],
77 | "source": [
78 | "dataset = titanic[['survived']].join([pd.get_dummies(titanic['pclass'],prefix=\"pclass\"),titanic.sex_female])"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "collapsed": true
86 | },
87 | "outputs": [],
88 | "source": [
89 | "from sklearn.linear_model import LogisticRegression"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {
96 | "collapsed": true
97 | },
98 | "outputs": [],
99 | "source": [
100 | "lm = LogisticRegression()"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {
107 | "collapsed": false
108 | },
109 | "outputs": [],
110 | "source": [
111 | "#drop pclass_1st to avoid dummy variable trap\n",
112 | "x = np.asarray(dataset[['pclass_2nd','pclass_3rd','sex_female']])\n",
113 | "y = np.asarray(dataset['survived'])"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {
120 | "collapsed": true
121 | },
122 | "outputs": [],
123 | "source": [
124 | "lm = lm.fit(x,y)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {
131 | "collapsed": false
132 | },
133 | "outputs": [],
134 | "source": [
135 | "lm.score(x,y)"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "metadata": {
142 | "collapsed": false
143 | },
144 | "outputs": [],
145 | "source": [
146 | "y.mean()"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {
153 | "collapsed": false
154 | },
155 | "outputs": [],
156 | "source": [
157 | "lm.coef_"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {
164 | "collapsed": false
165 | },
166 | "outputs": [],
167 | "source": [
168 | "lm.intercept_"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {
175 | "collapsed": true
176 | },
177 | "outputs": [],
178 | "source": []
179 | }
180 | ],
181 | "metadata": {
182 | "kernelspec": {
183 | "display_name": "Python 2",
184 | "language": "python",
185 | "name": "python2"
186 | },
187 | "language_info": {
188 | "codemirror_mode": {
189 | "name": "ipython",
190 | "version": 2
191 | },
192 | "file_extension": ".py",
193 | "mimetype": "text/x-python",
194 | "name": "python",
195 | "nbconvert_exporter": "python",
196 | "pygments_lexer": "ipython2",
197 | "version": "2.7.10"
198 | }
199 | },
200 | "nbformat": 4,
201 | "nbformat_minor": 0
202 | }
203 |
--------------------------------------------------------------------------------
/class4_2/Naive_Bayes.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": false
8 | },
9 | "outputs": [],
10 | "source": [
11 | "from sklearn import datasets\n",
12 | "from sklearn import metrics \n",
13 | "from sklearn.naive_bayes import GaussianNB \n",
14 | "#load the iris datasets \n",
15 | "dataset=datasets.load_iris() \n",
16 | "#fit a Naive Bayes model to the data \n",
17 | "model=GaussianNB() \n",
18 | "model.fit(dataset.data,dataset.target) \n",
19 | "print(model)\n",
20 | "#makepredictions\n",
21 | "expected=dataset.target \n",
22 | "predicted=model.predict(dataset.data) \n",
23 | "#summarize the fit of the model \n",
24 | "print(metrics.classification_report(expected,predicted)) \n",
25 | "print(metrics.confusion_matrix(expected,predicted))"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "###Let's get back to the saga of Leo and Kate"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {
39 | "collapsed": true
40 | },
41 | "outputs": [],
42 | "source": [
43 | "import pandas as pd\n",
44 | "%matplotlib inline\n",
45 | "import numpy as np"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {
52 | "collapsed": true
53 | },
54 | "outputs": [],
55 | "source": [
56 | "titanic = pd.read_csv(\"data/titanic.csv\")"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {
63 | "collapsed": false
64 | },
65 | "outputs": [],
66 | "source": [
67 | "titanic['sex_female'] = titanic['sex'].apply(lambda x:1 if x=='female' else 0)"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": null,
73 | "metadata": {
74 | "collapsed": true
75 | },
76 | "outputs": [],
77 | "source": [
78 | "dataset = titanic[['survived']].join([pd.get_dummies(titanic['pclass'],prefix=\"pclass\"),titanic.sex_female])"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "collapsed": false
86 | },
87 | "outputs": [],
88 | "source": [
89 | "#drop pclass_1st to avoid dummy variable trap\n",
90 | "x = np.asarray(dataset[['pclass_1st','pclass_2nd','pclass_3rd','sex_female']])\n",
91 | "y = np.asarray(dataset['survived'])"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {
98 | "collapsed": false
99 | },
100 | "outputs": [],
101 | "source": [
102 | "model.fit(x,y)"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {
109 | "collapsed": true
110 | },
111 | "outputs": [],
112 | "source": [
113 | "expected = y \n",
114 | "predicted = model.predict(x) "
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {
121 | "collapsed": true
122 | },
123 | "outputs": [],
124 | "source": [
125 | "def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):\n",
126 | " y_pred=clf.predict(X)\n",
127 | " if show_accuracy:\n",
128 | " print \"Accuracy:{0:.3f}\".format(metrics.accuracy_score(y, y_pred)),\"\\n\"\n",
129 | " if show_classification_report:\n",
130 | " print \"Classification report\"\n",
131 | " print metrics.classification_report(y,y_pred),\"\\n\"\n",
132 | " if show_confussion_matrix:\n",
133 | " print \"Confusion matrix\"\n",
134 | " print metrics.confusion_matrix(y,y_pred),\"\\n\""
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": null,
140 | "metadata": {
141 | "collapsed": false
142 | },
143 | "outputs": [],
144 | "source": [
145 | "measure_performance(x,y,model)"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {
152 | "collapsed": true
153 | },
154 | "outputs": [],
155 | "source": []
156 | }
157 | ],
158 | "metadata": {
159 | "kernelspec": {
160 | "display_name": "Python 2",
161 | "language": "python",
162 | "name": "python2"
163 | },
164 | "language_info": {
165 | "codemirror_mode": {
166 | "name": "ipython",
167 | "version": 2
168 | },
169 | "file_extension": ".py",
170 | "mimetype": "text/x-python",
171 | "name": "python",
172 | "nbconvert_exporter": "python",
173 | "pygments_lexer": "ipython2",
174 | "version": "2.7.10"
175 | }
176 | },
177 | "nbformat": 4,
178 | "nbformat_minor": 0
179 | }
180 |
--------------------------------------------------------------------------------
/class4_2/images/titanic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class4_2/images/titanic.png
--------------------------------------------------------------------------------
/class5_1/.ipynb_checkpoints/vectorization-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "from sklearn.feature_extraction.text import CountVectorizer"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Basic vectorization\n",
19 | "\n",
20 | "Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features.\n",
21 | "\n",
22 | "Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as \"The number of times Word X appears in this document\".\n",
23 | "\n",
24 | "Here's an example with one bill title:"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 14,
30 | "metadata": {
31 | "collapsed": true
32 | },
33 | "outputs": [],
34 | "source": [
35 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 16,
41 | "metadata": {
42 | "collapsed": false
43 | },
44 | "outputs": [
45 | {
46 | "name": "stdout",
47 | "output_type": "stream",
48 | "text": [
49 | "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n"
50 | ]
51 | }
52 | ],
53 | "source": [
54 | "vectorizer = CountVectorizer()\n",
55 | "features = vectorizer.fit_transform(bill_titles).toarray()"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 17,
61 | "metadata": {
62 | "collapsed": false
63 | },
64 | "outputs": [
65 | {
66 | "name": "stdout",
67 | "output_type": "stream",
68 | "text": [
69 | "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n",
70 | "[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']\n"
71 | ]
72 | }
73 | ],
74 | "source": [
75 | "print features\n",
76 | "print vectorizer.get_feature_names()"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is \"44277\", the second is \"act\", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, \"to\", which appears twice.\n",
84 | "\n",
85 | "Now what happens if we add another bill and run it again?"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 19,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]\n",
100 | " [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]\n",
101 | "[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']\n"
102 | ]
103 | }
104 | ],
105 | "source": [
106 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',\n",
107 | " 'An act relative to health care coverage']\n",
108 | "features = vectorizer.fit_transform(bill_titles).toarray()\n",
109 | "\n",
110 | "print features\n",
111 | "print vectorizer.get_feature_names()"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "metadata": {},
117 | "source": [
118 | "Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, \"44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization."
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "## Cleaning up our vectors\n",
126 | "\n",
127 | "As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.\n",
128 | "\n",
129 | "First is removing so-called \"stop words\" -- words like \"and\", \"or\", \"the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": 21,
135 | "metadata": {
136 | "collapsed": false
137 | },
138 | "outputs": [
139 | {
140 | "name": "stdout",
141 | "output_type": "stream",
142 | "text": [
143 | "[[1 1 1 0 1 0 1 0 1 0 1 1]\n",
144 | " [0 1 0 1 0 1 0 1 0 1 0 0]]\n",
145 | "[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']\n"
146 | ]
147 | }
148 | ],
149 | "source": [
150 | "new_vectorizer = CountVectorizer(stop_words='english')\n",
151 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
152 | "\n",
153 | "print features\n",
154 | "print new_vectorizer.get_feature_names()"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large."
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 24,
167 | "metadata": {
168 | "collapsed": false
169 | },
170 | "outputs": [
171 | {
172 | "name": "stdout",
173 | "output_type": "stream",
174 | "text": [
175 | "[[1]\n",
176 | " [1]]\n",
177 | "[u'act']\n"
178 | ]
179 | }
180 | ],
181 | "source": [
182 | "new_vectorizer = CountVectorizer(stop_words='english', min_df=2)\n",
183 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
184 | "\n",
185 | "print features\n",
186 | "print new_vectorizer.get_feature_names()"
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {
200 | "collapsed": true
201 | },
202 | "outputs": [],
203 | "source": [
204 | "new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))\n",
205 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
206 | "\n",
207 | "print features\n",
208 | "print new_vectorizer.get_feature_names()"
209 | ]
210 | }
211 | ],
212 | "metadata": {
213 | "kernelspec": {
214 | "display_name": "Python 2",
215 | "language": "python",
216 | "name": "python2"
217 | },
218 | "language_info": {
219 | "codemirror_mode": {
220 | "name": "ipython",
221 | "version": 2
222 | },
223 | "file_extension": ".py",
224 | "mimetype": "text/x-python",
225 | "name": "python",
226 | "nbconvert_exporter": "python",
227 | "pygments_lexer": "ipython2",
228 | "version": "2.7.9"
229 | }
230 | },
231 | "nbformat": 4,
232 | "nbformat_minor": 0
233 | }
234 |
--------------------------------------------------------------------------------
/class5_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 5, Class 1 (Tuesday, Aug. 11)
2 |
3 | We'll pick up where we left off last week with the bill classification problem, using it as an excuse to introduce a method of feature creation that is especially useful for text documents -- the idea of vectorization. If we have time, we'll also discuss the basic idea of supervised learning:
4 |
5 | ## Hour 1: Exercise review
6 |
7 | We'll talk in detail through the exercises from last week (which were deliberately difficult) and use them to segue into basic natural language processing techniques.
8 |
9 | ## Hour 2/2.5: Vectorization
10 |
11 | We'll talk about how to use vectorization to engineer our features for natural language classification and clustering problems, rather than building features by hand. We'll then revisit the bill classification problem from last week using what we've learned.
12 |
13 | ## Hour 2.5/3: Unsupervised learning
14 |
15 | We'll talk a little about the intuition and dangers of unsupervised learning, also known as clustering, using crime data as our example.
16 |
17 | ## Lab
18 |
19 | You'll be doing two things in lab today:
20 |
21 | - First you'll work through a simple document classification problem (classifying drug-related and non-drug-related press releases) using vectorization and the other techinques we discussed in class.
22 |
23 | - Second, take a look at [this map](https://www.google.com/maps/d/u/1/embed?mid=z9S6reOYqCIE.kQnlzV2-uDzg), which shows police dispatch logs for Columbia, Mo., over the first 10 days of August. Within the map, there are three layers (eps_0.3, eps_0.2, eps_0.4), each of which shows hotspots of dispatches calculated in slightly different ways. Choose what you think is the fairest representations of the hotsports and write a couple paragraphs characterizing your findings. Be sure to include the layer you chose in your Tumblr post.
--------------------------------------------------------------------------------
/class5_1/bill_classifier.py:
--------------------------------------------------------------------------------
1 | from sklearn import preprocessing
2 | from sklearn import cross_validation
3 | from sklearn.tree import DecisionTreeClassifier
4 | from sklearn.naive_bayes import MultinomialNB
5 | from sklearn.feature_extraction.text import CountVectorizer
6 |
7 | if __name__ == '__main__':
8 |
9 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ##########
10 |
11 | # Here we're taking in the training data and splitting it into two lists: One with the text of
12 | # each bill title, and the second with each bill title's corresponding category. Order is important.
13 | # The first bill in list 1 should also be the first category in list 2.
14 | training = [line.strip().split('|') for line in open('data/bills_training.txt', 'r').readlines()]
15 | text = [t[0] for t in training if len(t) > 1]
16 | labels = [t[1] for t in training if len(t) > 1]
17 |
18 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
19 | # be numbers, not strings. The LabelEncoder performs this transformation.
20 | encoder = preprocessing.LabelEncoder()
21 | correct_labels = encoder.fit_transform(labels)
22 |
23 | ########## STEP 2: FEATURE EXTRACTION ##########
24 | print 'Extracting features ...'
25 |
26 | vectorizer = CountVectorizer(stop_words='english')
27 | data = vectorizer.fit_transform(text)
28 |
29 | ########## STEP 3: MODEL BUILDING ##########
30 | print 'Training ...'
31 |
32 | #model = MultinomialNB()
33 | model = DecisionTreeClassifier()
34 | fit_model = model.fit(data, correct_labels)
35 |
36 | # ########## STEP 4: EVALUATION ##########
37 | print 'Evaluating ...'
38 |
39 | # Evaluate our model with 10-fold cross-validation
40 | scores = cross_validation.cross_val_score(model, data, correct_labels, cv=5)
41 | print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)
42 |
43 | # ########## STEP 5: APPLYING THE MODEL ##########
44 | print 'Classifying ...'
45 |
46 | docs_new = ["Public postsecondary education: executive officer compensation.",
47 | "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
48 | "Political Reform Act of 1974: campaign disclosures.",
49 | "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
50 | ]
51 |
52 | test_data = vectorizer.transform(docs_new)
53 |
54 | for i in xrange(len(docs_new)):
55 | print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])])
--------------------------------------------------------------------------------
/class5_1/crime_clusterer.py:
--------------------------------------------------------------------------------
1 | '''
2 | cluster.py
3 | This script demonstrates the use of the DBSCAN algorithm for finding
4 | clusters of crimes in Columbia, Mo. DBSCAN is a density-based clustering
5 | algorithm that finds points based on their proximity to other points in the
6 | dataset. Unlike algorithms such as K-means, you do not need to specify the
7 | number of clusters you would like it to find in advance. Instead, you set a
8 | parameter, epsilon, that identifies how close you would like two points to be
9 | for them to belong to the same cluster.
10 |
11 | More information here:
12 | http://en.wikipedia.org/wiki/DBSCAN
13 | http://scikit-learn.org/dev/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
14 |
15 | And there's a clean, documented implementation here for reference:
16 | https://github.com/cjdd3b/car-datascience-toolkit/blob/master/cluster/dbscan.py
17 | '''
18 |
19 | import csv
20 | import numpy as np
21 | from scipy.spatial import distance
22 | from sklearn.cluster import DBSCAN
23 |
24 | ########## MODIFY THIS ##########
25 |
26 | EPS = 0.04
27 |
28 | ######### DON'T WORRY (YET) ABOUT MODIFYING THIS ##########
29 |
30 | # Pull in our data using DictReader
31 | data = list(csv.DictReader(open('data/columbia_crime.csv', 'r').readlines()))
32 |
33 | # Separate out the coordinates
34 | coords = [(float(d['lat']), float(d['lng'])) for d in data if len(d['lat']) > 0]
35 | types = [d['ExtNatureDisplayName'] for d in data]
36 |
37 | # Scikit-learn's implemenetation of DBSCAN requires the input of a distance matrix showing pairwise
38 | # distances between all points in the dataset.
39 | distance_matrix = distance.squareform(distance.pdist(coords))
40 |
41 | # Run DBSCAN. Setting epsilon with lat/lon data like we have here is an inexact science. 0.03 looked
42 | # good after a few test runs. Ideally we'd project the data and set epsilon using meters or feet.
43 | db = DBSCAN(eps=EPS).fit(distance_matrix)
44 |
45 | # And now we print out the results in the form cluster_id,lat,lng. You can save this to a file and import
46 | # directly into a mapping program or Fusion Tables if you want to visualize it.
47 | for k in set(db.labels_):
48 | class_members = [index[0] for index in np.argwhere(db.labels_ == k)]
49 | for index in class_members:
50 | print '%s,%s,%s' % (int(k), types[index], '{0},{1}'.format(*coords[index]))
51 |
--------------------------------------------------------------------------------
/class5_1/data/releases_training.txt:
--------------------------------------------------------------------------------
1 | FEB 12 (BEAUMONT, Texas) – A 25-year-old Port Arthur, Texas man has pleaded guilty to drug trafficking violations in the Eastern District of Texas, announced Drug Enforcement Administration Acting Special Agent in Charge Steven S. Whipple and U.S. Attorney John M. Bales today. Michael Joseph Barrett IV pleaded guilty to possession with intent to distribute methamphetamine on Feb. 11, 2014 before U.S. District Judge Marcia Crone. According to information presented in court, on Feb. 19, 2013, law enforcement officers responded to a residence on 32nd Street in Port Arthur after receiving information regarding suspected manufacture of methamphetamine at the location. Consent to search was obtained and a search of the premises revealed a small amount of cocaine, a semi-automatic pistol, and various items associated with methamphetamine manufacture, including a three liter bottle containing a methamphetamine mixture. A federal grand jury returned an indictment on Dec. 4, 2013, charging Barrett with drug trafficking violations. Barrett faces up to 20 years in federal prison at sentencing. A sentencing date has not been set. This case was investigated by the Drug Enforcement Administration, the Port Arthur Police Department and the Jefferson County Sheriff's Office Crime Lab and prosecuted by Assistant U.S. Attorney Randall L. Fluke.|YES
2 | FEB 05 (BROWNSVILLE, Texas ) - Stephen Whipple, Acting Special Agent in Charge of the United States Drug Enforcement Administration (DEA), Houston Division and United States Attorney Kenneth Magidson announced Jesus Mauricio Juarez Jr. aka Flaco 27, has been sentenced to federal prison for his involvement in a 1,000 pound marijuana load. He pleaded guilty in November 2013. Today, Senior U.S. District Judge Hilda G. Tagle sentenced Juarez to 31 months in federal prison. In handing down the sentence, Ruben Gonzalez-Cavazos aka Mume, also pleaded guilty in relation to the conspiracy and was sentenced to 47 months in federal prison and assessed a $15,000 fine on Feb. 3, 2014. Co-defendant Francisco Javier Maya, 35, went to trial last week in Brownsville and was convicted on all counts. He will be sentenced on May 13, 2014. Adolfo Lozano-Luna aka Chefero, 35, and Alberto Martinez aka El Diablo, 50, also pleaded guilty and will be sentenced at a later date. Evidence at Maya’s trial placed all five men in a conspiracy involving a 1,000 pound marijuana load, which was forcibly hijacked from them by unknown individuals on Dec. 11, 2012. One month later, Juarez was injured after an improvised explosive device (IED) detonated at his residence in Brownsville. In sentencing Juarez today, Judge Tagle discussed the bombing incident and noted that at least he and his family still have their lives. Evidence also linked Juarez, Gonzalez-Cavazos and Maya to other marijuana loads during the conspiracy. Maya’s role in the drug trafficking organization was to provide drivers for tractor trailers to drive marijuana loads to locations to include Houston and Taylor. Maya, Juarez and Gonzalez-Cavazos would share in the profits of each successful marijuana load. At the direction of Juarez, Maya provided bank account numbers associated with him and Gonzalez-Cavazos to Juarez in order to deposit drug profits. Juarez then made deposits stemming from narcotics proceeds from a successful marijuana load delivered to Taylor in November 2012. Evidence was presented at Maya’s trial that a $6,000 deposit was made into an account associated with Maya on Nov. 28, 2012, while another $6,500 was deposited into an account associated with Gonzalez-Cavazos on the same day. The jury last week also heard that Maya was a follower of the Santeria religion. The jury saw photos of Maya’s residence in Mission, Texas, which depicted numerous images of what was considered to be altars showing glasses of alcohol, knives, a machete, kettles, feathers and substances that appeared to be blood. Testimony also included descriptions of two rituals involving the sacrifice of animals. In December 2012, Maya had a Santeria priest, known as a “Padrino,” perform rituals with the organization to “bless” a 1,000 pound marijuana load that was destined for Houston. After meeting with the Padrino, Maya, Gonzalez-Cavazos and Juarez decided the marijuana load should remain in the Rio Grande Valley. The next day, a second ritual, attended by all five defendants, was performed and the 1,000 pounds of marijuana was to be transported to Houston. However, the marijuana was stolen from the group by unknown individuals that evening. After the theft and subsequent IED detonation, law enforcement was able to piece together the events and conspirators involved in this drug trafficking organization. The case was investigated by the Drug Enforcement Administration, FBI, Homeland Security Investigations, Bureau of Alcohol, Tobacco, Firearms and Explosives and the Brownsville Police Department. The case was prosecuted by Assistant United States Attorneys Angel Castro and Jody Young.|YES
3 | JAN 08 (HOUSTON) - Javier F. Peña, Special Agent in Charge of the United States Drug Enforcement Administration (DEA), Houston Division and Kenneth Magidson, United States Attorney, Southern District of Texas announced Oscar Nava-Valencia, 42, of Guadalajara, Mexico, has received a 25-year sentence for his role in the smuggling of a 3,100 kilogram load of cocaine from Panama. Nava-Valencia previously pleaded guilty and was sentenced late yesterday afternoon in federal court in Houston. U.S. District Judge Ewing Werlein Jr. sentenced Nava-Valencia to a term of 300 months in federal prison and further ordered him to pay a $5,000 fine. In March 2006, Panamanian authorities seized approximately 2,080 kilograms of cocaine from a warehouse in Panama City, Panama. The seized cocaine was part of a larger load totaling approximately 3,100 kilograms which was to be shipped from Panama to Mexico and eventually destined for the United States. Nava-Valencia, along with other associates, was to take possession of approximately 1,250 kilograms of cocaine once it arrived in Mexico. In January of 2010, Nava-Valencia was apprehended by Mexican authorities and extradited to the United States in January 2011. He has been and will remain in custody pending transfer to a U.S. Bureau of Prisons facility to be determined in the near future. The investigation leading to the charges was conducted by the Drug Enforcement Administration. Assistant United States Attorneys James Sturgis prosecuted the case.|YES
4 | JAN 21 (MONTGOMERY, Ala.) – The Drug Enforcement Administration awarded Assistant U. S. Attorneys Verne Speirs and Gray Borden the Spartan Award, announced George L. Beck, Jr., United States Attorney Middle District of Alabama. The Spartan award recognizes prosecutors for their dedication and extraordinary effort to investigate and prosecute large-scale drug dealers and money launderers. This year’s award is presented to Assistant U.S. Attorneys Speirs and Borden because of long hours invested and success obtained in combating the ever-growing scourge of drug dealing in the Middle District of Alabama. DEA chose Speirs and Borden for this award after examining the work of all federal prosecutors in the State of Alabama. “The DEA in Alabama was pleased to present the 2013 Spartan Award for Excellence in Drug Investigations to AUSA’s Speirs and Borden,” stated Clay Morris, Assistant Special Agent in Charge of DEA in Alabama. “The award was named after the Spartan Warrior Society. AUSA’s Speirs and Borden were selected by DEA management to receive the award because they exhibited many traits of a Spartan Warrior: a relentless pursuit of justice, tenacity, loyalty and dedication. Throughout 2013, AUSA’s Speirs and Borden tirelessly worked alongside our agents and task force officers in many long term complex investigations. Because of the dedication of AUSA’s Speirs and Borden, many drug trafficking organizations were completely dismantled and dangerous criminals were removed from the streets of our communities. I cannot say enough about the outstanding efforts of AUSA’s Speirs and Borden and the entire staff of the Unites States Attorney’s Office. One thing is certain, as long as AUSA’s Speirs and Borden are prosecuting drug trafficking organizations, those who target and sell poison to our children should be very afraid.” “I am very pleased that the extraordinary success of AUSAs Speirs and Borden are receiving the recognition they truly deserve,” stated U.S. Attorney George Beck, “They have worked tirelessly to prosecute these criminals. I believe it is essential that these types of crimes be vigorously prosecuted and that we continue to combat the drug problem facing this district and this nation.” “I am truly humbled to receive this award, but the real credit goes to the DEA Agents and Task Force Officers who risk everything to combat drug traffickers across this country,” stated Verne Speirs, Assistant U.S. Attorney. “The safety of our families and communities depend upon their selfless service.” “I consider this award to be one of the great achievements in my career in the U.S. Attorney’s Office, but the credit goes to our dedicated and professional staff and the DEA’s stable of tireless agents,” stated Gray Borden, Assistant U.S. Attorney. “I am proud to be associated with a team of this caliber.”|NO
5 | JAN 30 (SAN JUAN, Puerto Rico) – Yesterday, January 29, U.S. Magistrate Judge Marcos E. López authorized a complaint charging: Joselito Taveras, Miguel Jimenez, and Alberto Dominguez with conspiracy to possess and possession with intent to distribute controlled substances, and conspiracy to import and importation of controlled substances, announced Rosa Emilia Rodríguez-Vélez, United States Attorney for the District of Puerto Rico. The crew of the Coast Guard Cutter Farallon offloaded 136 kilograms (300 pounds) of cocaine Monday night, 60 nautical miles northwest of Aguadilla, Puerto Rico and transferred the custody of the defendants to Drug Enforcement Administration (DEA) special agents and Customs and Border Protection officers Wednesday at Coast Guard San Juan, Puerto Rico. The interdiction was a result of U.S. Coast Guard, Customs Border Protection, Drug Enforcement Administration and Dominican Republic Navy coordinated efforts in support of Operation Unified Resolve, Operation Caribbean Guard, and the Caribbean Corridor Strike Force (CCSF), to interdict the illegal drug shipment consisting of nine bales of cocaine with an estimated street value of approximately $3.5 million dollars.|YES
6 | JAN 10 (WASHINGTON) – The U.S. Department of Justice and the U.S. Department of Commerce's National Institute of Standards and Technology (NIST) today announced appointments to a newly created National Commission on Forensic Science. Members of the commission will work to improve the practice of forensic science by developing guidance concerning the intersections between forensic science and the criminal justice system. The commission also will work to develop policy recommendations for the U.S. Attorney General, including uniform codes for professional responsibility and requirements for formal training and certification. The commission is co-chaired by Deputy Attorney General James M. Cole and Under Secretary of Commerce for Standards and Technology and NIST Director Patrick D. Gallagher. Nelson Santos, Deputy Assistant Administrator for the Office of Forensic Sciences at the Drug Enforcement Administration, and John M. Butler, Special Assistant to the NIST director for forensic science, serve as vice-chairs. "I appreciate the commitment each of the commissioners has made and look forward to working with them to strengthen the validity and reliability of the forensic sciences and enhance quality assurance and quality control," said Deputy Attorney General Cole. "Scientifically valid and accurate forensic analysis supports all aspects of our justice system."|NO
7 |
--------------------------------------------------------------------------------
/class5_1/release_classifier.py:
--------------------------------------------------------------------------------
1 | from sklearn import preprocessing
2 | from sklearn.tree import DecisionTreeClassifier
3 | from sklearn.naive_bayes import MultinomialNB
4 | from sklearn.feature_extraction.text import CountVectorizer
5 |
6 | if __name__ == '__main__':
7 |
8 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ##########
9 |
10 | training = [line.strip().split('|') for line in open('data/releases_training.txt', 'r').readlines()]
11 | text = [t[0] for t in training if len(t) > 1]
12 | labels = [t[1] for t in training if len(t) > 1]
13 |
14 | encoder = preprocessing.LabelEncoder()
15 | correct_labels = encoder.fit_transform(labels)
16 |
17 | ########## FEATURE EXTRACTION ##########
18 |
19 | # VECTORIZE YOUR DATA HERE
20 |
21 | ########## MODEL BUILDING ##########
22 |
23 | # TRAIN YOUR MODEL HERE
24 |
25 | ########## STEP 5: APPLYING THE MODEL ##########
26 | docs_new = ["Five Columbia Residents among 10 Defendants Indicted for Conspiracy to Distribute a Ton of Marijuana",
27 | ]
28 |
29 | # EVALUATE THE DOCUMENT HERE
--------------------------------------------------------------------------------
/class5_1/vectorization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 4,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "from sklearn.feature_extraction.text import CountVectorizer"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Basic vectorization\n",
19 | "\n",
20 | "Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features, represented by a matrix.\n",
21 | "\n",
22 | "Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as \"The number of times Word X appears in this document\".\n",
23 | "\n",
24 | "Here's an example with one bill title:"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 5,
30 | "metadata": {
31 | "collapsed": true
32 | },
33 | "outputs": [],
34 | "source": [
35 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 7,
41 | "metadata": {
42 | "collapsed": false,
43 | "scrolled": false
44 | },
45 | "outputs": [],
46 | "source": [
47 | "vectorizer = CountVectorizer()\n",
48 | "features = vectorizer.fit_transform(bill_titles).toarray()"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 8,
54 | "metadata": {
55 | "collapsed": false,
56 | "scrolled": true
57 | },
58 | "outputs": [
59 | {
60 | "name": "stdout",
61 | "output_type": "stream",
62 | "text": [
63 | "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n",
64 | "[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']\n"
65 | ]
66 | }
67 | ],
68 | "source": [
69 | "print features\n",
70 | "print vectorizer.get_feature_names()"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is \"44277\", the second is \"act\", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, \"to\", which appears twice.\n",
78 | "\n",
79 | "Now what happens if we add another bill and run it again?"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 11,
85 | "metadata": {
86 | "collapsed": false
87 | },
88 | "outputs": [
89 | {
90 | "name": "stdout",
91 | "output_type": "stream",
92 | "text": [
93 | "[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]\n",
94 | " [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]\n",
95 | "[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']\n"
96 | ]
97 | }
98 | ],
99 | "source": [
100 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',\n",
101 | " 'An act relative to health care coverage']\n",
102 | "features = vectorizer.fit_transform(bill_titles).toarray()\n",
103 | "\n",
104 | "print features\n",
105 | "print vectorizer.get_feature_names()"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, \"44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization."
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "## Cleaning up our vectors\n",
120 | "\n",
121 | "As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.\n",
122 | "\n",
123 | "First is removing so-called \"stop words\" -- words like \"and\", \"or\", \"the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 12,
129 | "metadata": {
130 | "collapsed": false
131 | },
132 | "outputs": [
133 | {
134 | "name": "stdout",
135 | "output_type": "stream",
136 | "text": [
137 | "[[1 1 1 0 1 0 1 0 1 0 1 1]\n",
138 | " [0 1 0 1 0 1 0 1 0 1 0 0]]\n",
139 | "[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']\n"
140 | ]
141 | }
142 | ],
143 | "source": [
144 | "new_vectorizer = CountVectorizer(stop_words='english')\n",
145 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
146 | "\n",
147 | "print features\n",
148 | "print new_vectorizer.get_feature_names()"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large."
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 13,
161 | "metadata": {
162 | "collapsed": false
163 | },
164 | "outputs": [
165 | {
166 | "name": "stdout",
167 | "output_type": "stream",
168 | "text": [
169 | "[[1]\n",
170 | " [1]]\n",
171 | "[u'act']\n"
172 | ]
173 | }
174 | ],
175 | "source": [
176 | "new_vectorizer = CountVectorizer(stop_words='english', min_df=2)\n",
177 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
178 | "\n",
179 | "print features\n",
180 | "print new_vectorizer.get_feature_names()"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": 17,
193 | "metadata": {
194 | "collapsed": false,
195 | "scrolled": true
196 | },
197 | "outputs": [
198 | {
199 | "name": "stdout",
200 | "output_type": "stream",
201 | "text": [
202 | "[[1 1 1 1 0 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1]\n",
203 | " [0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0]]\n",
204 | "[u'44277', u'44277 education', u'act', u'act amend', u'act relative', u'amend', u'amend section', u'care', u'care coverage', u'code', u'code relating', u'coverage', u'education', u'education code', u'health', u'health care', u'relating', u'relating teachers', u'relative', u'relative health', u'section', u'section 44277', u'teachers']\n"
205 | ]
206 | }
207 | ],
208 | "source": [
209 | "new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))\n",
210 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n",
211 | "\n",
212 | "print features\n",
213 | "print new_vectorizer.get_feature_names()"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "Although the feature space gets much larger, sometimes having multi-word features can make our models more accurate.\n",
221 | "\n",
222 | "These are just a few basic tricks scikit-learn makes available for transforming your vectors (you can see other ones [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)). But now let's take what we've learned here and apply it to the bill classification problem."
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {
229 | "collapsed": true
230 | },
231 | "outputs": [],
232 | "source": []
233 | }
234 | ],
235 | "metadata": {
236 | "kernelspec": {
237 | "display_name": "Python 2",
238 | "language": "python",
239 | "name": "python2"
240 | },
241 | "language_info": {
242 | "codemirror_mode": {
243 | "name": "ipython",
244 | "version": 2
245 | },
246 | "file_extension": ".py",
247 | "mimetype": "text/x-python",
248 | "name": "python",
249 | "nbconvert_exporter": "python",
250 | "pygments_lexer": "ipython2",
251 | "version": "2.7.9"
252 | }
253 | },
254 | "nbformat": 4,
255 | "nbformat_minor": 0
256 | }
257 |
--------------------------------------------------------------------------------
/class5_2/5_2-Assignment.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "test_string = \"Do you know the way to San Jose?\""
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "##Extend your functions from class\n",
19 | "###1. Add code to your tokenizer to filter for punctuation before tokenizing\n",
20 | "####This might be helpful: http://stackoverflow.com/a/266162/1808021"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {
27 | "collapsed": true
28 | },
29 | "outputs": [],
30 | "source": []
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "###2. Add code to your tokenizer to filter for stopwords\n",
37 | "###Your function should use the list of stopwords to filter the string and not return words in the stopword list\n",
38 | "###You can use the list in NLTK or create your own\n"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "outputs": [],
48 | "source": []
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "###3. Add code to your tokenizer to call your tokenizer to create word tokens (if it doesn't already) and then generate the counts for each token"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {
61 | "collapsed": true
62 | },
63 | "outputs": [],
64 | "source": []
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "##Bonus\n",
71 | "###Write a simple function to calculate the tf-idf \n",
72 | "####Remember the following were $t$ is the term, $D$ is the document, $N$ is the total number of documents, $n_w$ is the number of documents containing each word $t$, and $i_w$ is the frequency word $t$ appears in a document\n",
73 | "\n",
74 | "$tf(t,D)=\\frac{i_w}{n_D}$\n",
75 | "\n",
76 | "$idf(t,D)=\\log(\\frac{N}{1+n_w})$\n",
77 | "\n",
78 | "$tfidf=tf\\times idf$"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "collapsed": true
86 | },
87 | "outputs": [],
88 | "source": []
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "##k-NN on Iris\n",
95 | "###4. Using the Iris dataset, test the kNN for various levels of k to see if you can build a better classifier than our decision tree in 3_2"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "metadata": {
102 | "collapsed": true
103 | },
104 | "outputs": [],
105 | "source": []
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "##k-Means with Congressional Bills\n",
112 | "###5. Explore the clusters of Congressional Records. Select another subset and investigate the contents. Write code that investigates a different cluster."
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {
119 | "collapsed": true
120 | },
121 | "outputs": [],
122 | "source": []
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "###6. On the class Tumblr, provide a response to the lesson on k-Means, specifically whether you think this is a useful technique for working journalists (data or otherwise)"
129 | ]
130 | }
131 | ],
132 | "metadata": {
133 | "kernelspec": {
134 | "display_name": "Python 2",
135 | "language": "python",
136 | "name": "python2"
137 | },
138 | "language_info": {
139 | "codemirror_mode": {
140 | "name": "ipython",
141 | "version": 2
142 | },
143 | "file_extension": ".py",
144 | "mimetype": "text/x-python",
145 | "name": "python",
146 | "nbconvert_exporter": "python",
147 | "pygments_lexer": "ipython2",
148 | "version": "2.7.10"
149 | }
150 | },
151 | "nbformat": 4,
152 | "nbformat_minor": 0
153 | }
154 |
--------------------------------------------------------------------------------
/class5_2/5_2-DoNow.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "##Let's check your knowledge of the material we've covered"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "###Code your own tokenizer\n",
15 | "####Write a simple tokenizer function to take in a string, tokenize by individual words"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": null,
21 | "metadata": {
22 | "collapsed": true
23 | },
24 | "outputs": [],
25 | "source": []
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "###Create your own vectorizer\n",
32 | "####Write code to output the list of tokens and the count for each token"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {
39 | "collapsed": true
40 | },
41 | "outputs": [],
42 | "source": []
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {
48 | "collapsed": true
49 | },
50 | "outputs": [],
51 | "source": []
52 | }
53 | ],
54 | "metadata": {
55 | "kernelspec": {
56 | "display_name": "Python 2",
57 | "language": "python",
58 | "name": "python2"
59 | },
60 | "language_info": {
61 | "codemirror_mode": {
62 | "name": "ipython",
63 | "version": 2
64 | },
65 | "file_extension": ".py",
66 | "mimetype": "text/x-python",
67 | "name": "python",
68 | "nbconvert_exporter": "python",
69 | "pygments_lexer": "ipython2",
70 | "version": "2.7.10"
71 | }
72 | },
73 | "nbformat": 4,
74 | "nbformat_minor": 0
75 | }
76 |
--------------------------------------------------------------------------------
/class5_2/kmeans.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import re #a package for doing regex\n",
13 | "import glob #for accessing files on our local system"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "###We'll be using data from http://www.cs.cornell.edu/home/llee/data/convote.html to explore k-means clustering"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {
27 | "collapsed": false
28 | },
29 | "outputs": [],
30 | "source": [
31 | "!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [],
41 | "source": [
42 | "!tar -zxvf convote_v1.1.tar.gz"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": null,
48 | "metadata": {
49 | "collapsed": true
50 | },
51 | "outputs": [],
52 | "source": [
53 | "paths = glob.glob(\"convote_v1.1/data_stage_one/development_set/*\")\n",
54 | "speeches = []\n",
55 | "for path in paths:\n",
56 | " speech = {}\n",
57 | " filename = path[-26:]\n",
58 | " speech['filename'] = filename\n",
59 | " speech['bill_no'] = filename[:3]\n",
60 | " speech['speaker_no'] = filename[4:10]\n",
61 | " speech['bill_vote'] = filename[-5]\n",
62 | " speech['party'] = filename[-7]\n",
63 | " \n",
64 | " # Open the file\n",
65 | " speech_file = open(path, 'r')\n",
66 | " # Read the stuff out of it\n",
67 | " speech['contents'] = speech_file.read()\n",
68 | "\n",
69 | " cleaned_contents = re.sub(r\"[^ \\w]\",'', speech['contents'])\n",
70 | " cleaned_contents = re.sub(r\" +\",' ', cleaned_contents)\n",
71 | " cleaned_contents = cleaned_contents.strip()\n",
72 | " words = cleaned_contents.split(' ')\n",
73 | " speech['word_count'] = len(words)\n",
74 | " \n",
75 | " speeches.append(speech)"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {
82 | "collapsed": false
83 | },
84 | "outputs": [],
85 | "source": [
86 | "speeches[:5]"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "metadata": {
93 | "collapsed": false
94 | },
95 | "outputs": [],
96 | "source": [
97 | "speeches_df = pd.DataFrame(speeches)\n",
98 | "speeches_df.head()"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "metadata": {
105 | "collapsed": false
106 | },
107 | "outputs": [],
108 | "source": [
109 | "speeches_df[\"word_count\"].describe()"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "###Notice that we have a lot of speeches that are relatively short. They probably aren't the best for clustering because of their brevity"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "###Time to bring the TF-IDF vectorizer"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "metadata": {
130 | "collapsed": true
131 | },
132 | "outputs": [],
133 | "source": [
134 | "from sklearn.feature_extraction.text import TfidfVectorizer"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": null,
140 | "metadata": {
141 | "collapsed": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')\n",
146 | "longer_speeches = speeches_df[speeches_df[\"word_count\"] > 92] \n",
147 | "#filtering for word counts greater than 92 (our median length)\n",
148 | "X = vectorizer.fit_transform(longer_speeches['contents'])"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {
155 | "collapsed": true
156 | },
157 | "outputs": [],
158 | "source": [
159 | "from sklearn.cluster import KMeans"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "metadata": {
166 | "collapsed": false
167 | },
168 | "outputs": [],
169 | "source": [
170 | "number_of_clusters = 7\n",
171 | "km = KMeans(n_clusters=number_of_clusters)\n",
172 | "km.fit(X)"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {
179 | "collapsed": false
180 | },
181 | "outputs": [],
182 | "source": [
183 | "print(\"Top terms per cluster:\")\n",
184 | "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
185 | "terms = vectorizer.get_feature_names()\n",
186 | "for i in range(number_of_clusters):\n",
187 | " print(\"Cluster %d:\" % i),\n",
188 | " for ind in order_centroids[i, :15]:\n",
189 | " print(' %s' % terms[ind]),\n",
190 | " print ''"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {
197 | "collapsed": true
198 | },
199 | "outputs": [],
200 | "source": []
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "metadata": {
206 | "collapsed": true
207 | },
208 | "outputs": [],
209 | "source": [
210 | "additional_stopwords = ['mr','congress','chairman','madam','amendment','legislation','speaker']"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {
217 | "collapsed": false
218 | },
219 | "outputs": [],
220 | "source": [
221 | "import nltk\n",
222 | "\n",
223 | "english_stopwords = nltk.corpus.stopwords.words('english')\n",
224 | "new_stopwords = additional_stopwords + english_stopwords"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": null,
230 | "metadata": {
231 | "collapsed": true
232 | },
233 | "outputs": [],
234 | "source": [
235 | "vectorizer = TfidfVectorizer(max_features=10000, stop_words=new_stopwords)"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": null,
241 | "metadata": {
242 | "collapsed": true
243 | },
244 | "outputs": [],
245 | "source": [
246 | "longer_speeches = speeches_df[speeches_df[\"word_count\"] > 92]\n",
247 | "X = vectorizer.fit_transform(longer_speeches['contents'])"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": null,
253 | "metadata": {
254 | "collapsed": false
255 | },
256 | "outputs": [],
257 | "source": [
258 | "number_of_clusters = 7\n",
259 | "km = KMeans(n_clusters=number_of_clusters)\n",
260 | "km.fit(X)"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {
267 | "collapsed": false
268 | },
269 | "outputs": [],
270 | "source": [
271 | "print(\"Top terms per cluster:\")\n",
272 | "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
273 | "terms = vectorizer.get_feature_names()\n",
274 | "for i in range(number_of_clusters):\n",
275 | " print(\"Cluster %d:\" % i),\n",
276 | " for ind in order_centroids[i, :15]:\n",
277 | " print(' %s' % terms[ind]),\n",
278 | " print ''"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {
285 | "collapsed": false
286 | },
287 | "outputs": [],
288 | "source": [
289 | "longer_speeches[\"k-means label\"] = km.labels_"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {
296 | "collapsed": false
297 | },
298 | "outputs": [],
299 | "source": [
300 | "longer_speeches.head()"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": null,
306 | "metadata": {
307 | "collapsed": true
308 | },
309 | "outputs": [],
310 | "source": [
311 | "china_speeches = longer_speeches[longer_speeches[\"k-means label\"] == 1]"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": null,
317 | "metadata": {
318 | "collapsed": false
319 | },
320 | "outputs": [],
321 | "source": [
322 | "china_speeches.head()"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": null,
328 | "metadata": {
329 | "collapsed": false
330 | },
331 | "outputs": [],
332 | "source": [
333 | "vectorizer = TfidfVectorizer(max_features=10000, stop_words=new_stopwords)\n",
334 | "X = vectorizer.fit_transform(china_speeches['contents'])\n",
335 | "\n",
336 | "number_of_clusters = 5\n",
337 | "km = KMeans(n_clusters=number_of_clusters)\n",
338 | "km.fit(X)\n",
339 | "\n",
340 | "print(\"Top terms per cluster:\")\n",
341 | "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
342 | "terms = vectorizer.get_feature_names()\n",
343 | "for i in range(number_of_clusters):\n",
344 | " print(\"Cluster %d:\" % i),\n",
345 | " for ind in order_centroids[i, :10]:\n",
346 | " print(' %s' % terms[ind]),\n",
347 | " print ''"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": null,
353 | "metadata": {
354 | "collapsed": false
355 | },
356 | "outputs": [],
357 | "source": [
358 | "km.get_params()"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": null,
364 | "metadata": {
365 | "collapsed": false
366 | },
367 | "outputs": [],
368 | "source": [
369 | "km.score(X)"
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": null,
375 | "metadata": {
376 | "collapsed": true
377 | },
378 | "outputs": [],
379 | "source": []
380 | }
381 | ],
382 | "metadata": {
383 | "kernelspec": {
384 | "display_name": "Python 2",
385 | "language": "python",
386 | "name": "python2"
387 | },
388 | "language_info": {
389 | "codemirror_mode": {
390 | "name": "ipython",
391 | "version": 2
392 | },
393 | "file_extension": ".py",
394 | "mimetype": "text/x-python",
395 | "name": "python",
396 | "nbconvert_exporter": "python",
397 | "pygments_lexer": "ipython2",
398 | "version": "2.7.10"
399 | }
400 | },
401 | "nbformat": 4,
402 | "nbformat_minor": 0
403 | }
404 |
--------------------------------------------------------------------------------
/class5_2/knn.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "##Let's work with the wine dataset we worked with before, but slightly modified. This has more instances and different target features"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "####based on http://blog.yhathq.com/posts/classification-using-knn-and-python.html"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": [
25 | "import pandas as pd\n",
26 | "import matplotlib.pyplot as plt\n",
27 | "%matplotlib inline\n",
28 | "from sklearn.neighbors import KNeighborsClassifier\n",
29 | "from sklearn import cross_validation"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {
36 | "collapsed": true
37 | },
38 | "outputs": [],
39 | "source": [
40 | "import numpy as np"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": null,
46 | "metadata": {
47 | "collapsed": false
48 | },
49 | "outputs": [],
50 | "source": [
51 | "df = pd.read_csv(\"data/wine.csv\")"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {
58 | "collapsed": false
59 | },
60 | "outputs": [],
61 | "source": [
62 | "df.columns"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "###Instead of wine cultvar, we have the wine color (red or white), as well as a binary (is red) and high quality indicator (0 or 1)"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {
76 | "collapsed": false
77 | },
78 | "outputs": [],
79 | "source": [
80 | "df.high_quality.unique()"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "###Let's set up our training and test sets"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "collapsed": false
95 | },
96 | "outputs": [],
97 | "source": [
98 | "train, test = cross_validation.train_test_split(df[['density','sulphates','residual_sugar','high_quality']],train_size=0.75)"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "###We'll use just three columns (dimensions) for classification"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {
112 | "collapsed": false
113 | },
114 | "outputs": [],
115 | "source": [
116 | "train"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {
123 | "collapsed": false
124 | },
125 | "outputs": [],
126 | "source": [
127 | "x_train = train[:,:3]\n",
128 | "y_train = train[:,3]"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {
135 | "collapsed": true
136 | },
137 | "outputs": [],
138 | "source": [
139 | "x_test = test[:,:3]\n",
140 | "y_test = test[:,3]"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "###Let's start with a k of 1 to predict high quality"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {
154 | "collapsed": false
155 | },
156 | "outputs": [],
157 | "source": [
158 | "clf = KNeighborsClassifier(n_neighbors=1)"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {
165 | "collapsed": false
166 | },
167 | "outputs": [],
168 | "source": [
169 | "clf.fit(x_train,y_train)"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {
176 | "collapsed": false
177 | },
178 | "outputs": [],
179 | "source": [
180 | "preds = clf.predict(x_test)"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {
187 | "collapsed": true
188 | },
189 | "outputs": [],
190 | "source": [
191 | "accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {
198 | "collapsed": false
199 | },
200 | "outputs": [],
201 | "source": [
202 | "print \"Accuracy: %3f\" % (accuracy,)"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "###Not bad. Let's see what happens as the k changes"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": null,
215 | "metadata": {
216 | "collapsed": false
217 | },
218 | "outputs": [],
219 | "source": [
220 | "results = []\n",
221 | "for k in range(1, 51, 2):\n",
222 | " clf = KNeighborsClassifier(n_neighbors=k)\n",
223 | " clf.fit(x_train,y_train)\n",
224 | " preds = clf.predict(x_test)\n",
225 | " accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))\n",
226 | " print \"Neighbors: %d, Accuracy: %3f\" % (k, accuracy)\n",
227 | "\n",
228 | " results.append([k, accuracy])\n",
229 | "\n",
230 | "results = pd.DataFrame(results, columns=[\"k\", \"accuracy\"])\n",
231 | "\n",
232 | "plt.plot(results.k, results.accuracy)\n",
233 | "plt.title(\"Accuracy with Increasing K\")\n",
234 | "plt.show()"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "###Looks like about 80% is the best we can do. The way it plateaus, suggests there's not much more to be gained by increasing k"
242 | ]
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "metadata": {},
247 | "source": [
248 | "###We can also tune this a bit by not weighting each instance the same, but decreasing the weight as the distance increases"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {
255 | "collapsed": false
256 | },
257 | "outputs": [],
258 | "source": [
259 | "results = []\n",
260 | "for k in range(1, 51, 2):\n",
261 | " clf = KNeighborsClassifier(n_neighbors=k,weights='distance')\n",
262 | " clf.fit(x_train,y_train)\n",
263 | " preds = clf.predict(x_test)\n",
264 | " accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))\n",
265 | " print \"Neighbors: %d, Accuracy: %3f\" % (k, accuracy)\n",
266 | "\n",
267 | " results.append([k, accuracy])\n",
268 | "\n",
269 | "results = pd.DataFrame(results, columns=[\"k\", \"accuracy\"])\n",
270 | "\n",
271 | "plt.plot(results.k, results.accuracy)\n",
272 | "plt.title(\"Accuracy with Increasing K\")\n",
273 | "plt.show()"
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "metadata": {},
279 | "source": [
280 | "###This actually increases the accuracy of our prediction"
281 | ]
282 | }
283 | ],
284 | "metadata": {
285 | "kernelspec": {
286 | "display_name": "Python 2",
287 | "language": "python",
288 | "name": "python2"
289 | },
290 | "language_info": {
291 | "codemirror_mode": {
292 | "name": "ipython",
293 | "version": 2
294 | },
295 | "file_extension": ".py",
296 | "mimetype": "text/x-python",
297 | "name": "python",
298 | "nbconvert_exporter": "python",
299 | "pygments_lexer": "ipython2",
300 | "version": "2.7.10"
301 | }
302 | },
303 | "nbformat": 4,
304 | "nbformat_minor": 0
305 | }
306 |
--------------------------------------------------------------------------------
/class6_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 6, Class 1 (Tuesday, Aug. 18)
2 |
3 | After a quick review of the homework, we'll explore in-depth several methods for clustering crime data before we return to and expand on the document clustering problem from last Thursday.
4 |
5 | ## Hour 1: Exercise review
6 |
7 | We'll talk in detail through the bill classification problem and the ambiguities inherent in clustering data, via the crime example we talked about briefly last Tuesday.
8 |
9 | ## Hour 2: Clustering crime
10 |
11 | We'll look at the two methods you learned last week -- k-means clustering and k-nearest neighbors -- along with another one, known as DBSCAN, to see how different methods can produce different results when we apply them to crime data.
12 |
13 | ## Hour 3: Back to document clustering
14 |
15 | Finally we'll return to the idea of document clustering that you started exploring last week, going more into depth on the ideas of document similarity and term frequency-inverse document frequency and showing how clustering can more quickly help us explore a new document set.
--------------------------------------------------------------------------------
/class6_2/AssociationRuleMining.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "##A simple example of Association Rule Mining based on http://orange.biolab.si/docs/latest/reference/rst/Orange.associate.html"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import Orange #pip install orange\n",
19 | "data = Orange.data.Table(\"market-basket.basket\")"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": null,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [],
29 | "source": [
30 | "for d in data:\n",
31 | " print d"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [],
41 | "source": [
42 | "rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.3)\n",
43 | "print \"%4s %4s %s\" % (\"Supp\", \"Conf\", \"Rule\")\n",
44 | "for r in rules[:5]:\n",
45 | " print \"%4.1f %4.1f %s\" % (r.support, r.confidence, r)"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "###Spanish Inquisition example"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "metadata": {
59 | "collapsed": true
60 | },
61 | "outputs": [],
62 | "source": [
63 | "data = Orange.data.Table(\"inquisition.basket\")"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {
70 | "collapsed": false
71 | },
72 | "outputs": [],
73 | "source": [
74 | "for d in data:\n",
75 | " print d"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {
82 | "collapsed": false
83 | },
84 | "outputs": [],
85 | "source": [
86 | "rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)\n",
87 | "\n",
88 | "print \"%5s %5s\" % (\"supp\", \"conf\")\n",
89 | "for r in rules:\n",
90 | " print \"%5.3f %5.3f %s\" % (r.support, r.confidence, r)"
91 | ]
92 | }
93 | ],
94 | "metadata": {
95 | "kernelspec": {
96 | "display_name": "Python 2",
97 | "language": "python",
98 | "name": "python2"
99 | },
100 | "language_info": {
101 | "codemirror_mode": {
102 | "name": "ipython",
103 | "version": 2
104 | },
105 | "file_extension": ".py",
106 | "mimetype": "text/x-python",
107 | "name": "python",
108 | "nbconvert_exporter": "python",
109 | "pygments_lexer": "ipython2",
110 | "version": "2.7.10"
111 | }
112 | },
113 | "nbformat": 4,
114 | "nbformat_minor": 0
115 | }
116 |
--------------------------------------------------------------------------------
/class6_2/RandomForest.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "##Based on example from http://blog.yhathq.com/posts/random-forests-in-python.html, with modifications from https://gist.github.com/glamp/5717321"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "from sklearn.datasets import load_iris\n",
19 | "from sklearn.ensemble import RandomForestClassifier\n",
20 | "import pandas as pd\n",
21 | "import numpy as np"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {
28 | "collapsed": true
29 | },
30 | "outputs": [],
31 | "source": [
32 | "iris = load_iris()\n",
33 | "df = pd.DataFrame(iris.data, columns=iris.feature_names)"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {
40 | "collapsed": true
41 | },
42 | "outputs": [],
43 | "source": [
44 | "df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": null,
50 | "metadata": {
51 | "collapsed": false
52 | },
53 | "outputs": [],
54 | "source": [
55 | "df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {
62 | "collapsed": false
63 | },
64 | "outputs": [],
65 | "source": [
66 | "df.head()"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {
73 | "collapsed": true
74 | },
75 | "outputs": [],
76 | "source": [
77 | "train, test = df[df['is_train']==True], df[df['is_train']==False]"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {
84 | "collapsed": false
85 | },
86 | "outputs": [],
87 | "source": [
88 | "features = df.columns[:4]\n",
89 | "clf = RandomForestClassifier(n_jobs=2)\n",
90 | "y, _ = pd.factorize(train['species'])\n",
91 | "clf.fit(train[features], y)"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {
98 | "collapsed": false
99 | },
100 | "outputs": [],
101 | "source": [
102 | "preds = iris.target_names[clf.predict(test[features])]\n",
103 | "pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {
110 | "collapsed": true
111 | },
112 | "outputs": [],
113 | "source": []
114 | }
115 | ],
116 | "metadata": {
117 | "kernelspec": {
118 | "display_name": "Python 2",
119 | "language": "python",
120 | "name": "python2"
121 | },
122 | "language_info": {
123 | "codemirror_mode": {
124 | "name": "ipython",
125 | "version": 2
126 | },
127 | "file_extension": ".py",
128 | "mimetype": "text/x-python",
129 | "name": "python",
130 | "nbconvert_exporter": "python",
131 | "pygments_lexer": "ipython2",
132 | "version": "2.7.10"
133 | }
134 | },
135 | "nbformat": 4,
136 | "nbformat_minor": 0
137 | }
138 |
--------------------------------------------------------------------------------
/class7_1/README.md:
--------------------------------------------------------------------------------
1 | # Algorithms: Week 7, Class 1 (Tuesday, Aug. 25)
2 |
3 | After a quick look back at last Thursday's material, we'll spend some time looking over [examples](https://github.com/datapolitan/lede_algorithms/blob/master/class1_1/newsroom_examples.md) of algorithms and journalism from earlier in the course and talk about how to build on the skills you've learned here going forward. I'm counting on wrapping up early so we can talk about final projects.
4 |
5 | ## Resources for later
6 |
7 | - IRE/NICAR: I've said this a hundred times, but [sign up](http://www.ire.org/membership/). Use the student rate if you'd like. And if someone there balks, tell me and I'll talk to them.
8 |
9 | - MORE TK
--------------------------------------------------------------------------------
/class7_1/bill_classifier.py:
--------------------------------------------------------------------------------
1 | from sklearn import preprocessing
2 | from sklearn import cross_validation
3 | from sklearn.tree import DecisionTreeClassifier
4 | from sklearn.ensemble import RandomForestClassifier
5 | from sklearn.naive_bayes import MultinomialNB
6 | from sklearn.feature_extraction.text import CountVectorizer
7 |
8 | if __name__ == '__main__':
9 |
10 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ##########
11 |
12 | # Here we're taking in the training data and splitting it into two lists: One with the text of
13 | # each bill title, and the second with each bill title's corresponding category. Order is important.
14 | # The first bill in list 1 should also be the first category in list 2.
15 | training = [line.strip().split('|') for line in open('data/bills_training.txt', 'r').readlines()]
16 | text = [t[0] for t in training if len(t) > 1]
17 | labels = [t[1] for t in training if len(t) > 1]
18 |
19 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
20 | # be numbers, not strings. The LabelEncoder performs this transformation.
21 | encoder = preprocessing.LabelEncoder()
22 | correct_labels = encoder.fit_transform(labels)
23 |
24 | ########## STEP 2: FEATURE EXTRACTION ##########
25 | print 'Extracting features ...'
26 |
27 | vectorizer = CountVectorizer(stop_words='english')
28 | data = vectorizer.fit_transform(text)
29 |
30 | ########## STEP 3: MODEL BUILDING ##########
31 | print 'Training ...'
32 |
33 | #model = MultinomialNB()
34 | model = RandomForestClassifier()
35 | fit_model = model.fit(data, correct_labels)
36 |
37 | # ########## STEP 4: EVALUATION ##########
38 | print 'Evaluating ...'
39 |
40 | # Evaluate our model with 10-fold cross-validation
41 | scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10)
42 | print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)
43 |
44 | # ########## STEP 5: APPLYING THE MODEL ##########
45 | # print 'Classifying ...'
46 |
47 | # docs_new = ["Public postsecondary education: executive officer compensation.",
48 | # "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
49 | # "Political Reform Act of 1974: campaign disclosures.",
50 | # "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
51 | # ]
52 |
53 | # test_data = vectorizer.transform(docs_new)
54 |
55 | # for i in xrange(len(docs_new)):
56 | # print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])])
--------------------------------------------------------------------------------
/data_journalism_on_github.md:
--------------------------------------------------------------------------------
1 | #Data Journalists on Github
2 |
3 | ##Organizations
4 | + New York Times The Upshot: https://github.com/TheUpshot
5 | + New York Times Newsroom Developers: https://github.com/newsdev
6 | + FiveThirtyEight.com: https://github.com/fivethirtyeight
7 | + Al Jazeera America (at least until April): https://github.com/ajam
8 | + Chicago Tribune News Apps: https://github.com/newsapps
9 | + Northwestern University Knight Lab: https://github.com/NUKnightLab
10 | + ProPublica: https://github.com/propublica
11 | + Sunlight Labs: https://github.com/sunlightlabs
12 | + NPR Visuals Team: https://github.com/nprapps
13 | + NPR Tech: https://github.com/npr
14 | + The Guardian: https://github.com/guardian
15 | + Vox Media: https://github.com/voxmedia
16 | + Time Magazine: https://github.com/TimeMagazine
17 | + Los Angeles Times Data Desk: https://github.com/datadesk
18 | + BuzzFeed News: https://github.com/BuzzFeedNews
19 | + [Huffington Post Data](http://data.huffingtonpost.com/): https://github.com/huffpostdata
20 |
21 |
22 | ##Tools
23 | + Wireservice: https://github.com/wireservice
24 | + [Open Civic Data](http://opencivicdata.org/): https://github.com/opencivicdata
25 | + [TabulaPDF](http://tabula.technology/): https://github.com/tabulapdf
26 | + [Public Media Platform](http://publicmediaplatform.org/): https://github.com/publicmediaplatform
27 | + [CensusReporter](http://censusreporter.org/): https://github.com/censusreporter
28 | + Mozilla Foundation: https://github.com/mozilla
29 |
30 | ##People
31 | + Michael Keller: https://github.com/mhkeller
32 | + Joanna S. Kao: https://github.com/joannaskao
33 | + Kevin Quealy: https://github.com/kpq
34 | + Joe Germuska: https://github.com/JoeGermuska
35 |
36 | ##Github's infrequently updated [list of open journalism projects](https://github.com/showcases/open-journalism)
--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
1 | # Algorithms, Summer 2015
2 | ## LEDE Program, Columbia University, Graduate School of Journalism
3 |
4 |
5 | ### Instructors:
6 |
7 | Richard Dunks: richard [at] datapolitan [dot] com
8 |
9 | Chase Davis: chase.davis [at] nytimes [dot] com
10 |
11 |
12 | #### Room Number: Pulitzer Hall 601B
13 |
14 | #### Course Dates: 14 July - 27 August 2015
15 |
16 | ### Course Overview
17 |
18 | This course presents an overview of algorithms as they relate to journalistic tradecraft, with particular emphasis on algorithms that relate to the discovery, cleaning, and analysis of data. This course intends to provide literacy in the common types of data algorithms, while providing practice in the design, development, and testing of algorithms to support news reporting and analysis, including the basic concepts of algorithm reverse engineering in support of investigative news reporting. The emphasis in this class will be on practical applications and critical awareness of the impact algorithms have in modern life.
19 |
20 |
21 | ### Learning Objectives
22 |
23 | + You will understand the basic structure and operation of algorithms
24 | + You will understand the primary types of data science algorithms, including techniques of supervised and unsupervised machine learning
25 | + You will be practiced in implementing basic algorithms in Python
26 | + You will be able to meaningfully explain and critique the use and operation of algorithms as tools of public policy and business
27 | + You will understand how algorithms are applied in the newsroom
28 |
29 | ### Course Requirements
30 | All students will be expected to have a laptop during both lectures and lab time. Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts.
31 |
32 | ### Course Readings
33 | The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available to you electronically. Recommended readings are suggestions if you wish to study further the topics covered in class. Suggested readings will also be provided as appropriate for those interested in a more in-depth discussion of the material covered in class.
34 |
35 | ### Assignments
36 | This course consists of programming and critical response assignments intended to reinforce learning and provide you with pratical applications of the material covered in class. Completion of these assignments is critical to achieving the outcomes of this course. Assignments are intended to be completed during lab time or for homework. Generally, assignments will be due the following week, unless otherwise stated. For example, exercises assigned on Tuesday will be due before class on the following Tuesday.
37 | + Programming assignments will be submitted via Slack to the TAs in Python scripts (not ipynb) format. The exercises should be standalone for each assignment, not a combination of all assignments. This allows them to be tested and scored separately.
38 | + Response questions should be [submitted using this address](http://ledealgorithms.tumblr.com/submit) and will be posted to the [class Tumblr](http://ledealgorithms.tumblr.com/) after grading. They should be clear, concise, and use the elements of good grammar. This is an opportunity to develop your ability to explain algorithms to your audience.
39 |
40 | ### Class Format
41 | Class runs from 10am to 1pm Tuesday and Thursday. Lab time will be from 2pm to 5pm Tuesday and Thursday. The class will be taught in roughly 50 minute blocks, with approximately 10 minute breaks between each 50 minute block. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class. Lab time is intended for the completion of exercises, but may also include guided learning sessions as necessary to ensure comprehension of the course material.
42 |
43 | ### Course Policies
44 | + Attendance and Tardiness: We expect you to attend every class, arriving on time and staying for the entire duration of class. Absences will only be excused for circumstances coordinated in advance and you are responsible for making up any missed work.
45 | + Participation: We expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures. We will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered.
46 | + Late Assignments: All assignments are to be submitted before the start of class. Assignments posted by the end of the day following class will be marked down 10% and assignments posted at the end of the day following will be marked down 20%. No assignments will be accepted for a grade after three days following class.
47 | + Office Hours: We won’t be holding regular office hours, but are available via email to answer whatever questions you may have about the material. Please feel free to also reach out to the Teaching Assistants as necessary for support and guidance with the exercises, particularly during lab time.
48 |
49 | ----
50 | ### Resources
51 | #### Technical
52 |
53 | + [Stack Overflow](http://stackoverflow.com) - Q&A community of technology pros
54 |
55 | #### (Some) Open Data Sources
56 |
57 | + [New York City Open Data Portal](https://nycopendata.socrata.com/)
58 | + [New York State Open Data Portal](https://data.ny.gov/)
59 | + [Hilary Mason’s Research Quality Data Sets](https://bitly.com/bundles/hmason/1)
60 |
61 | #### Visualizations
62 |
63 | + [Flowing Data](http://flowingdata.com/)
64 | + [Tableau Visualization Gallery](http://www.tableausoftware.com/public/gallery)
65 | + [Visualizing.org](http://www.visualizing.org/)
66 | + [Data is Beautiful](http://www.reddit.com/r/dataisbeautiful/)
67 |
68 | #### Data Journalism and Critiques
69 |
70 | + [FiveThirtyEight](http://fivethirtyeight.com/)
71 | + [Upshot](http://www.nytimes.com/upshot/)
72 | + [IQuantNY](http://iquantny.tumblr.com/)
73 | + [SimplyStatistics](http://simplystatistics.org/)
74 | + [Data Journalism Handbook](http://datajournalismhandbook.org/1.0/en/index.html)
75 |
76 | #### Suggested Reading
77 | Conway, Drew and John Myles White. Machine Learning for Hackers. O'Reilly Media, Inc., 2012.
78 |
79 | Knuth, Donald E. The Art of Computer Programming. Addison-Wesley Professional, 2011.
80 |
81 | MacCormick, John. Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers. Princeton University Press, 2011.
82 |
83 | McCallum, Q Ethan. Bad Data Handbook. O'Reilly Media, Inc., 2012.
84 |
85 | McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012.
86 |
87 | O'Neil, Cathy and Rachel Schutt. Doing Data Science: Straight Talk from the Front Line. O'Reilly Media, Inc., 2013.
88 |
89 | Russell, Matthew A. Mining the Social Web. O'Reilly Media, Inc., 2013.
90 |
91 | Sedgewick, Robert and Kevin Wayne. Algorithms. Addison-Wesley Professional, 2011.
92 |
93 | Steiner, Christopher. Automate This: How Algorithms Came to Rule Our World. Penguin Group, 2012.
94 |
95 | ----
96 | ### Course Outline
97 | (Subject to change)
98 |
99 | #### Week 1: Introduction to Algorithms/Statistics review
100 | ##### Class 1 Readings
101 | + Miller, Claire Cain, [“When Algorithms Discriminate”](http://nyti.ms/1KS5rdu) New York Times, 9 July 2015
102 | + O’Neil, Cathy, [“Algorithms And Accountability Of Those Who Deploy Them”](http://mathbabe.org/2015/05/26/algorithms-and-accountability-of-those-who-deploy-them/)
103 | + Elkus, Adam, [“You Can’t Handle the (Algorithmic) Truth”](http://www.slate.com/articles/technology/future_tense/2015/05/algorithms_aren_t_responsible_for_the_cruelties_of_bureaucracy.single.html)
104 | + Diakopoulos, Nicholas, ["Algorithmic Accontability Reporting: On the Investigation of Black Boxes"](http://towcenter.org/wp-content/uploads/2014/02/78524_Tow-Center-Report-WEB-1.pdf)
105 |
106 | ##### Class 2 Readings (optional)
107 | + McKinney, "Getting Started With Pandas" Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.
108 | + McKinney, "Plotting and Visualization" Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.
109 |
110 | #### Week 2: Statistics in Reporting/Opening the Blackbox: Supervised Learning - Linear Regression
111 | ##### Class 1 Readings
112 | + (TBD)
113 |
114 | ##### Class 2 Readings
115 | + O'Neill, "Statistical Inference, Exploratory Data Analysis, and the Data Science Process" Doing Data Science: Straight Talk from the Front Line pp. 17-37
116 |
117 | #### Week 3: Opening the Blackbox: Supervised Learning - Feature Engineering/Decision Trees
118 |
119 | ##### Class 2 Readings
120 | + Building Machine Learning Systems with Python, pp. 33-43
121 | + Learning scikit-learn: Machine Learning in Python, pp. 41-52
122 | + Brownlee, Jason, ("Discover Feature Engineering, How to Engineer Features and How to Get Good at It")[http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/]
123 | + ("A Visual Introduction to Machine Learning")[http://www.r2d3.us/visual-intro-to-machine-learning-part-1/]
124 |
125 | #### Week 4: Opening the Blackbox: Supervised Learning - Feature Engineering/Logistic Regression
126 |
127 | #### Week 5: Opening the Blackbox: Unsupervised Learning - Clustering, k-NN
128 |
129 | #### Week 6: Natural Language Processing, Reverse Engineering, and Ethics Revisited
130 |
131 | #### Week 7: Advanced Topics (we'll be polling the class for topics)
132 |
133 |
--------------------------------------------------------------------------------