├── .gitignore ├── Final_Project ├── amtrak_speeds │ └── readme.md ├── readme.md └── speeding_cops │ ├── florida_toll_plaza.kml │ ├── readme.md │ └── transponder_data.csv ├── LICENSE ├── class1_1 ├── class1-1.ipynb ├── exercise │ ├── Exercise1_MeanFunction.ipynb │ ├── Exercise2_MayoralExcuseGenerator.ipynb │ ├── Exercise3-Answers.ipynb │ ├── Exercise3.ipynb │ ├── Exercise4-Answers.ipynb │ ├── Exercise4.ipynb │ └── excuse.csv ├── lab1-1.ipynb └── newsroom_examples.md ├── class1_2 ├── .ipynb_checkpoints │ └── EDA_Python-checkpoint.ipynb ├── 2013_NYC_CD_MedianIncome_Recycle.xlsx ├── Data_Collection_Sheet.csv ├── EDA_Python.ipynb ├── class1_2.ipynb ├── height_weight.xlsx └── heights_weights_genders.csv ├── class2_1 ├── .ipynb_checkpoints │ └── EDA_Review-checkpoint.ipynb ├── EDA_Review.ipynb ├── README.md └── data │ └── ontime_reports_may_2015_ny.csv ├── class2_2 ├── DoNow_2-2.ipynb ├── DoNow_2-2_answers.ipynb ├── Multiple_Variable_Regression.ipynb ├── Simple_Linear_Regression.ipynb └── data │ ├── 2013_NYC_CD_MedianIncome_Recycle.xlsx │ ├── height_weight.xlsx │ ├── heights_weights_genders.csv │ └── ontime_reports_may_2015_ny.csv ├── class3_1 ├── .ipynb_checkpoints │ ├── classification-checkpoint.ipynb │ └── regression_review-checkpoint.ipynb ├── README.md ├── classification.ipynb ├── data │ ├── apib12tx.csv │ └── category-training.csv └── regression_review.ipynb ├── class3_2 ├── 3-2_DoNow.ipynb ├── 3-2_DoNow_Answers.ipynb ├── 3-2_DoNow_Answers_statsmodels.ipynb ├── 3-2_Exercises-Answers.ipynb ├── 3-2_Exercises.ipynb ├── Decision_Tree.ipynb ├── data │ ├── hanford.csv │ ├── hanford.txt │ ├── iris.csv │ ├── ontime_reports_may_2015_ny.csv │ ├── seeds_dataset.txt │ └── titanic.csv └── images │ ├── hanford_variables.png │ └── iris_scatter.png ├── class4_1 ├── README.md ├── data │ ├── bills_training.txt │ ├── contribs_training.csv │ ├── contribs_training_small.csv │ └── contribs_unclassified.csv ├── doc_classifier.py └── donors.py ├── class4_2 ├── 4-2_DoNow.ipynb ├── Feature_Engineering.ipynb ├── Logistic_regression.ipynb ├── Naive_Bayes.ipynb ├── data │ ├── ontime_reports_may_2015_ny.csv │ ├── titanic.csv │ └── wine.csv └── images │ └── titanic.png ├── class5_1 ├── .ipynb_checkpoints │ └── vectorization-checkpoint.ipynb ├── README.md ├── bill_classifier.py ├── crime_clusterer.py ├── data │ ├── bills_training.txt │ ├── columbia_crime.csv │ └── releases_training.txt ├── release_classifier.py └── vectorization.ipynb ├── class5_2 ├── 5_2-Assignment.ipynb ├── 5_2-DoNow.ipynb ├── data │ └── wine.csv ├── kmeans.ipynb └── knn.ipynb ├── class6_1 ├── .ipynb_checkpoints │ ├── cluster_crime-checkpoint.ipynb │ └── cluster_emails-checkpoint.ipynb ├── README.md ├── cluster_crime.ipynb ├── cluster_emails.ipynb └── data │ ├── cluster_examples │ ├── kmeans_10.csv │ └── kmeans_3.csv │ ├── columbia_crime.csv │ └── jeb_subjects.csv ├── class6_2 ├── AssociationRuleMining.ipynb └── RandomForest.ipynb ├── class7_1 ├── README.md ├── bill_classifier.py └── data │ └── bills_training.txt ├── data_journalism_on_github.md └── readme.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | -------------------------------------------------------------------------------- /Final_Project/amtrak_speeds/readme.md: -------------------------------------------------------------------------------- 1 | ##Source: [Derailed Amtrak train sped into deadly crash curve | Al Jazeera America](http://america.aljazeera.com/multimedia/2015/5/map-derailed-amtrak-sped-through-northeast-corridor.html) 2 | 3 | ##Data: [[https://github.com/ajam/amtrak-188]] 4 | + The live data is accessible here: [[https://www.googleapis.com/mapsengine/v1/tables/01382379791355219452-08584582962951999356/features?version=published&key=AIzaSyCVFeFQrtk-ywrUE0pEcvlwgCqS6TJcOW4&maxResults=250]] 5 | 6 | ##Notes: 7 | + Michael Keller doens't provide the code for scraping the data, but you can scrape the live data from the URL above (I'd recommend a database) -------------------------------------------------------------------------------- /Final_Project/readme.md: -------------------------------------------------------------------------------- 1 | ##Final Project 2 | 3 | 4 | The final project is a chance for you to demonstrate the skills you've learned in this and other Lede classes to explore a topic of personal or professional interest using data. You should demonstrate not only strong technical ability, but also the ability to synthesize the data in interesting and meaningful ways. 5 | 6 | Requirements: 7 | + You must write a blog post, [submitted through the class Tumblr](http://ledealgorithms.tumblr.com/submit), outlining your project, your goals, your methodology, and your findings. Specifically address the data you used, it's source, the steps you took to clean the data, and the insights you gained at each step, either with respect to your project or working with data more generally. 8 | + You must present your work in class, either August 27th or August 31st. Prepare a 15 minute presentation on the points covered in your blogpost and be prepared to answer questions. All work is due September 1st. 9 | + You must provide the source code for your project. Code should be well written and commented wherever possible to explain the operation. 10 | 11 | You are free to work in groups and we encourage you to find projects that are of limited enough scope to fit into the time alotted for this project. Often we work under tight deadlines and being able to constrain scope ensures projects is important. Often a smaller, more constrained objective will allow us to better understand the important task and essential challenges. Attempting to implement all we envision at once, is a recipe for disaster (like Healthcare.gov). Take this opportunity to develop a more iterative approach and develop your project in phases rather than tackle everything all at once. For more information on this approach look into [Agile Development](http://agilemethodology.org/). 12 | 13 | If you work in groups, please indicate in your blog post the work of each person on the project so they may receive the proper credit. 14 | 15 | If you have any questions or need assistance shaping projects, please don't hesitate to reach out. 16 | -------------------------------------------------------------------------------- /Final_Project/speeding_cops/readme.md: -------------------------------------------------------------------------------- 1 | ##Source: (The Florida Sun-Sentinel Speeding Cops)[http://www.sun-sentinel.com/news/speeding-cops/] 2 | 3 | ##Background 4 | [Documenting the process](http://towcenter.gitbooks.io/sensors-and-journalism/content/the_second_section/sun_sentinel_%E2%80%93.html) 5 | 6 | ##Data 7 | + transponder_data.csv - an extract from their database online with the entrance and exit locations, entrance time and exit time 8 | + florida_toll_plaza.kml - (extracted from [here](https://www.google.com/maps/d/viewer?mid=zkhNiVf3Ss6c.k8ys3XRv92Ms&hl=en_US)locations for each toll booths (as well as the toll plazas) in KML format. A KML is just an XML with spatial data. Recommend using OpenRefine to process. Python is also an option (but OpenRefine will be easier and faster) 9 | -------------------------------------------------------------------------------- /class1_1/class1-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:83d3e5703fd0ce5b2e62c63376699efb77c5cfa83ce21a7433dc7a0f14c00d56" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "code", 13 | "collapsed": false, 14 | "input": [ 15 | "for i in range(10):\n", 16 | " print i" 17 | ], 18 | "language": "python", 19 | "metadata": {}, 20 | "outputs": [ 21 | { 22 | "output_type": "stream", 23 | "stream": "stdout", 24 | "text": [ 25 | "0\n", 26 | "1\n", 27 | "2\n", 28 | "3\n", 29 | "4\n", 30 | "5\n", 31 | "6\n", 32 | "7\n", 33 | "8\n", 34 | "9\n" 35 | ] 36 | } 37 | ], 38 | "prompt_number": 1 39 | }, 40 | { 41 | "cell_type": "code", 42 | "collapsed": false, 43 | "input": [ 44 | "for i in range(1,10):\n", 45 | " print i" 46 | ], 47 | "language": "python", 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "output_type": "stream", 52 | "stream": "stdout", 53 | "text": [ 54 | "1\n", 55 | "2\n", 56 | "3\n", 57 | "4\n", 58 | "5\n", 59 | "6\n", 60 | "7\n", 61 | "8\n", 62 | "9\n" 63 | ] 64 | } 65 | ], 66 | "prompt_number": 2 67 | }, 68 | { 69 | "cell_type": "code", 70 | "collapsed": false, 71 | "input": [ 72 | "for i in range(1,10,2):\n", 73 | " print i" 74 | ], 75 | "language": "python", 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "output_type": "stream", 80 | "stream": "stdout", 81 | "text": [ 82 | "1\n", 83 | "3\n", 84 | "5\n", 85 | "7\n", 86 | "9\n" 87 | ] 88 | } 89 | ], 90 | "prompt_number": 3 91 | }, 92 | { 93 | "cell_type": "code", 94 | "collapsed": false, 95 | "input": [ 96 | "print range(10)" 97 | ], 98 | "language": "python", 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "output_type": "stream", 103 | "stream": "stdout", 104 | "text": [ 105 | "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n" 106 | ] 107 | } 108 | ], 109 | "prompt_number": 4 110 | }, 111 | { 112 | "cell_type": "code", 113 | "collapsed": false, 114 | "input": [ 115 | "print range(1,10,3)" 116 | ], 117 | "language": "python", 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "output_type": "stream", 122 | "stream": "stdout", 123 | "text": [ 124 | "[1, 4, 7]\n" 125 | ] 126 | } 127 | ], 128 | "prompt_number": 6 129 | }, 130 | { 131 | "cell_type": "code", 132 | "collapsed": false, 133 | "input": [ 134 | "print range(1,10,3,5)" 135 | ], 136 | "language": "python", 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "ename": "TypeError", 141 | "evalue": "range expected at most 3 arguments, got 4", 142 | "output_type": "pyerr", 143 | "traceback": [ 144 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 145 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 146 | "\u001b[0;31mTypeError\u001b[0m: range expected at most 3 arguments, got 4" 147 | ] 148 | } 149 | ], 150 | "prompt_number": 7 151 | }, 152 | { 153 | "cell_type": "code", 154 | "collapsed": false, 155 | "input": [ 156 | "l = [1,2,\"abc\"]" 157 | ], 158 | "language": "python", 159 | "metadata": {}, 160 | "outputs": [], 161 | "prompt_number": 8 162 | }, 163 | { 164 | "cell_type": "code", 165 | "collapsed": false, 166 | "input": [ 167 | "l" 168 | ], 169 | "language": "python", 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "metadata": {}, 174 | "output_type": "pyout", 175 | "prompt_number": 14, 176 | "text": [ 177 | "[2]" 178 | ] 179 | } 180 | ], 181 | "prompt_number": 14 182 | }, 183 | { 184 | "cell_type": "code", 185 | "collapsed": false, 186 | "input": [ 187 | "l.pop()" 188 | ], 189 | "language": "python", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "metadata": {}, 194 | "output_type": "pyout", 195 | "prompt_number": 10, 196 | "text": [ 197 | "'abc'" 198 | ] 199 | } 200 | ], 201 | "prompt_number": 10 202 | }, 203 | { 204 | "cell_type": "code", 205 | "collapsed": false, 206 | "input": [ 207 | "l2 = l.remove(2)" 208 | ], 209 | "language": "python", 210 | "metadata": {}, 211 | "outputs": [], 212 | "prompt_number": 17 213 | }, 214 | { 215 | "cell_type": "code", 216 | "collapsed": false, 217 | "input": [ 218 | "l" 219 | ], 220 | "language": "python", 221 | "metadata": {}, 222 | "outputs": [ 223 | { 224 | "metadata": {}, 225 | "output_type": "pyout", 226 | "prompt_number": 13, 227 | "text": [ 228 | "[2]" 229 | ] 230 | } 231 | ], 232 | "prompt_number": 13 233 | }, 234 | { 235 | "cell_type": "code", 236 | "collapsed": false, 237 | "input": [ 238 | "l2" 239 | ], 240 | "language": "python", 241 | "metadata": {}, 242 | "outputs": [], 243 | "prompt_number": 18 244 | }, 245 | { 246 | "cell_type": "code", 247 | "collapsed": false, 248 | "input": [ 249 | "l = [1,1,1,2,3]" 250 | ], 251 | "language": "python", 252 | "metadata": {}, 253 | "outputs": [], 254 | "prompt_number": 19 255 | }, 256 | { 257 | "cell_type": "code", 258 | "collapsed": false, 259 | "input": [ 260 | "s = set(l)" 261 | ], 262 | "language": "python", 263 | "metadata": {}, 264 | "outputs": [], 265 | "prompt_number": 20 266 | }, 267 | { 268 | "cell_type": "code", 269 | "collapsed": false, 270 | "input": [ 271 | "s" 272 | ], 273 | "language": "python", 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "metadata": {}, 278 | "output_type": "pyout", 279 | "prompt_number": 21, 280 | "text": [ 281 | "{1, 2, 3}" 282 | ] 283 | } 284 | ], 285 | "prompt_number": 21 286 | }, 287 | { 288 | "cell_type": "code", 289 | "collapsed": false, 290 | "input": [ 291 | "l" 292 | ], 293 | "language": "python", 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "metadata": {}, 298 | "output_type": "pyout", 299 | "prompt_number": 22, 300 | "text": [ 301 | "[1, 1, 1, 2, 3]" 302 | ] 303 | } 304 | ], 305 | "prompt_number": 22 306 | }, 307 | { 308 | "cell_type": "code", 309 | "collapsed": false, 310 | "input": [ 311 | "s1 = set({1,2,3})" 312 | ], 313 | "language": "python", 314 | "metadata": {}, 315 | "outputs": [], 316 | "prompt_number": 23 317 | }, 318 | { 319 | "cell_type": "code", 320 | "collapsed": false, 321 | "input": [ 322 | "s2 = set({3,4,5})" 323 | ], 324 | "language": "python", 325 | "metadata": {}, 326 | "outputs": [], 327 | "prompt_number": 24 328 | }, 329 | { 330 | "cell_type": "code", 331 | "collapsed": false, 332 | "input": [ 333 | "s1 - s2" 334 | ], 335 | "language": "python", 336 | "metadata": {}, 337 | "outputs": [ 338 | { 339 | "metadata": {}, 340 | "output_type": "pyout", 341 | "prompt_number": 25, 342 | "text": [ 343 | "{1, 2}" 344 | ] 345 | } 346 | ], 347 | "prompt_number": 25 348 | }, 349 | { 350 | "cell_type": "code", 351 | "collapsed": false, 352 | "input": [ 353 | "state_dict = {'ny': 'New York'}" 354 | ], 355 | "language": "python", 356 | "metadata": {}, 357 | "outputs": [], 358 | "prompt_number": 26 359 | }, 360 | { 361 | "cell_type": "code", 362 | "collapsed": false, 363 | "input": [ 364 | "state_dict['ny']" 365 | ], 366 | "language": "python", 367 | "metadata": {}, 368 | "outputs": [ 369 | { 370 | "metadata": {}, 371 | "output_type": "pyout", 372 | "prompt_number": 27, 373 | "text": [ 374 | "'New York'" 375 | ] 376 | } 377 | ], 378 | "prompt_number": 27 379 | }, 380 | { 381 | "cell_type": "code", 382 | "collapsed": false, 383 | "input": [ 384 | "state_dict[0]" 385 | ], 386 | "language": "python", 387 | "metadata": {}, 388 | "outputs": [ 389 | { 390 | "ename": "KeyError", 391 | "evalue": "0", 392 | "output_type": "pyerr", 393 | "traceback": [ 394 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", 395 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mstate_dict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 396 | "\u001b[0;31mKeyError\u001b[0m: 0" 397 | ] 398 | } 399 | ], 400 | "prompt_number": 28 401 | }, 402 | { 403 | "cell_type": "code", 404 | "collapsed": false, 405 | "input": [], 406 | "language": "python", 407 | "metadata": {}, 408 | "outputs": [] 409 | } 410 | ], 411 | "metadata": {} 412 | } 413 | ] 414 | } -------------------------------------------------------------------------------- /class1_1/exercise/Exercise1_MeanFunction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "def mean_calc(input_list):\n", 12 | " list_len = 0 # variable to track running length\n", 13 | " list_sum = 0 # variable to track running sum\n", 14 | " if input_list:\n", 15 | " for i in input_list:\n", 16 | " if isinstance(i,int) or isinstance(i,float): # check to see if element i is of type int or float\n", 17 | " list_len += 1\n", 18 | " list_sum += i\n", 19 | " else: # element i is not int or float\n", 20 | " print \"list element %s is not of type int or float\" % i\n", 21 | " return list_sum/float(list_len) #return the final calculation\n", 22 | " else: #list is empty\n", 23 | " return \"input list is empty\"" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [ 34 | "test_list = [1,1,1,2,3,4,4,4,4,5,6,7,9]" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": { 41 | "collapsed": false 42 | }, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/plain": [ 47 | "3.923076923076923" 48 | ] 49 | }, 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "mean_calc(test_list)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 4, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "import numpy as np" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 5, 73 | "metadata": { 74 | "collapsed": false 75 | }, 76 | "outputs": [ 77 | { 78 | "data": { 79 | "text/plain": [ 80 | "3.9230769230769229" 81 | ] 82 | }, 83 | "execution_count": 5, 84 | "metadata": {}, 85 | "output_type": "execute_result" 86 | } 87 | ], 88 | "source": [ 89 | "np.mean(test_list)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 6, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "test_string = ['1',2,'3','4']" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 7, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [ 110 | { 111 | "name": "stdout", 112 | "output_type": "stream", 113 | "text": [ 114 | "list element 1 is not of type int or float\n", 115 | "list element 3 is not of type int or float\n", 116 | "list element 4 is not of type int or float\n" 117 | ] 118 | }, 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "2.0" 123 | ] 124 | }, 125 | "execution_count": 7, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "mean_calc(test_string)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 8, 137 | "metadata": { 138 | "collapsed": false 139 | }, 140 | "outputs": [ 141 | { 142 | "ename": "TypeError", 143 | "evalue": "cannot perform reduce with flexible type", 144 | "output_type": "error", 145 | "traceback": [ 146 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 147 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 148 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_string\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 149 | "\u001b[0;32m/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc\u001b[0m in \u001b[0;36mmean\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m 2733\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2734\u001b[0m return _methods._mean(a, axis=axis, dtype=dtype,\n\u001b[0;32m-> 2735\u001b[0;31m out=out, keepdims=keepdims)\n\u001b[0m\u001b[1;32m 2736\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2737\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mddof\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 150 | "\u001b[0;32m/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc\u001b[0m in \u001b[0;36m_mean\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m 64\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmu\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'f8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mret\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmu\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m ret = um.true_divide(\n", 151 | "\u001b[0;31mTypeError\u001b[0m: cannot perform reduce with flexible type" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "np.mean(test_string)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 9, 162 | "metadata": { 163 | "collapsed": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "empty_list =[]" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 10, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "text/plain": [ 180 | "'input list is empty'" 181 | ] 182 | }, 183 | "execution_count": 10, 184 | "metadata": {}, 185 | "output_type": "execute_result" 186 | } 187 | ], 188 | "source": [ 189 | "mean_calc(empty_list)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 11, 195 | "metadata": { 196 | "collapsed": false 197 | }, 198 | "outputs": [ 199 | { 200 | "name": "stderr", 201 | "output_type": "stream", 202 | "text": [ 203 | "/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.\n", 204 | " warnings.warn(\"Mean of empty slice.\", RuntimeWarning)\n", 205 | "/Users/richarddunks/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:71: RuntimeWarning: invalid value encountered in double_scalars\n", 206 | " ret = ret.dtype.type(ret / rcount)\n" 207 | ] 208 | }, 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "nan" 213 | ] 214 | }, 215 | "execution_count": 11, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "np.mean(empty_list)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [] 232 | } 233 | ], 234 | "metadata": { 235 | "kernelspec": { 236 | "display_name": "Python 2", 237 | "language": "python", 238 | "name": "python2" 239 | }, 240 | "language_info": { 241 | "codemirror_mode": { 242 | "name": "ipython", 243 | "version": 2 244 | }, 245 | "file_extension": ".py", 246 | "mimetype": "text/x-python", 247 | "name": "python", 248 | "nbconvert_exporter": "python", 249 | "pygments_lexer": "ipython2", 250 | "version": "2.7.10" 251 | } 252 | }, 253 | "nbformat": 4, 254 | "nbformat_minor": 0 255 | } 256 | -------------------------------------------------------------------------------- /class1_1/exercise/Exercise2_MayoralExcuseGenerator.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import random #package for generating pseudo-random numbers: https://docs.python.org/2/library/random.html" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 3, 17 | "metadata": { 18 | "collapsed": false 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import csv" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 4, 28 | "metadata": { 29 | "collapsed": false 30 | }, 31 | "outputs": [ 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | "Enter your name: Richard\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "person = raw_input('Enter your name: ')" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 5, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "Enter your destination: Chelsea\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "place = raw_input('Enter your destination: ')" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 6, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "r = random.randrange(0,11) # generate random number between 0 and 10" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 8, 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [], 81 | "source": [ 82 | "excuse_list = [] #create an empty list to hold the excuses\n", 83 | "inputReader = csv.DictReader(open('excuse.csv','rU'))\n", 84 | "for line in inputReader:\n", 85 | " excuse_list.append(line) # append the excuses (as dictionary) to the list" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 7, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "Sorry, Richard I was late to Chelsea, breakfast began a little later than expected\n", 100 | "From the story \"De Blasio 15 Minutes Late to St. Patrick's Day Mass, Blames Breakfast\"\n", 101 | "http://www.dnainfo.com/new-york/20150317/midtown/de-blasio-15-minutes-late-st-patricks-day-mass-blames-breakfast\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "print \"Sorry, \" + person + \" I was late to \" + place + \", \" + excuse_list[r]['excuse']\n", 107 | "print 'From the story \"' + excuse_list[r]['headline'] + '\"'\n", 108 | "print excuse_list[r]['hyperlink']" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 15, 114 | "metadata": { 115 | "collapsed": false 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "# alternate way of generating the list of excuses using the context manager\n", 120 | "# http://preshing.com/20110920/the-python-with-statement-by-example/\n", 121 | "excuse_list2 = []\n", 122 | "with open('excuse.csv','rU') as inputFile:\n", 123 | " inputReader = csv.DictReader(inputFile)\n", 124 | " for line in inputReader:\n", 125 | " excuse_list2.append(line) # append the excuses (as dictionary) to the list\n", 126 | " #file connection is close at end of the indented code" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "# This is the least elegant and least pythonic way of doing this. \n", 138 | "# Putting this code up at a Python conference could get you booed or otherwise shamed and driven from the hall\n", 139 | "# but it gets the job done\n", 140 | "inputFile = open('excuse.csv','rU') #create the file object\n", 141 | "header = next(inputFile) # return the first line of the file (header) and assign to a variable\n", 142 | "excuse_list = []\n", 143 | "for line in inputFile:\n", 144 | " line = line.split(',') # split the line on the comma\n", 145 | " excuse_list.append(line[0]) # append the first element to the list\n", 146 | "inputFile.close() # close connection to the file" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [], 156 | "source": [] 157 | } 158 | ], 159 | "metadata": { 160 | "kernelspec": { 161 | "display_name": "Python 2", 162 | "language": "python", 163 | "name": "python2" 164 | }, 165 | "language_info": { 166 | "codemirror_mode": { 167 | "name": "ipython", 168 | "version": 2 169 | }, 170 | "file_extension": ".py", 171 | "mimetype": "text/x-python", 172 | "name": "python", 173 | "nbconvert_exporter": "python", 174 | "pygments_lexer": "ipython2", 175 | "version": "2.7.10" 176 | } 177 | }, 178 | "nbformat": 4, 179 | "nbformat_minor": 0 180 | } 181 | -------------------------------------------------------------------------------- /class1_1/exercise/Exercise3-Answers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The following code will print the prime numbers between 1 and 100. Modify the code so it prints every other prime number from 1 to 100" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "1\n", 22 | "3\n", 23 | "7\n", 24 | "13\n", 25 | "19\n", 26 | "29\n", 27 | "37\n", 28 | "43\n", 29 | "53\n", 30 | "61\n", 31 | "71\n", 32 | "79\n", 33 | "89\n" 34 | ] 35 | } 36 | ], 37 | "source": [ 38 | "j = 0 # add check counter outside the for-loop so it doesn't get reset\n", 39 | "for num in range(1,101): \n", 40 | " prime = True \n", 41 | " for i in range(2,num): \n", 42 | " if (num%i==0): \n", 43 | " prime = False \n", 44 | " if prime: \n", 45 | " if j%2 == 0: # test the check counter for being even and if so, then print the number\n", 46 | " print num\n", 47 | " j += 1 # increment the check counter each time a prime is found" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "# Extra Credit: Can you write a procedure that runs faster than the one above?" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 12, 60 | "metadata": { 61 | "collapsed": false 62 | }, 63 | "outputs": [ 64 | { 65 | "name": "stdout", 66 | "output_type": "stream", 67 | "text": [ 68 | "1\n", 69 | "3\n", 70 | "7\n", 71 | "13\n", 72 | "19\n", 73 | "29\n", 74 | "37\n", 75 | "43\n", 76 | "53\n", 77 | "61\n", 78 | "71\n", 79 | "79\n", 80 | "89\n" 81 | ] 82 | } 83 | ], 84 | "source": [ 85 | "j = 0 \n", 86 | "for num in range(1,101): \n", 87 | " prime = True \n", 88 | " for i in range(2,num): \n", 89 | " if (num%i==0): \n", 90 | " prime = False\n", 91 | " continue \n", 92 | " # once the number has already been shown to be false, \n", 93 | " # there's no reason to keep checking\n", 94 | " if prime: \n", 95 | " if j%2 == 0: \n", 96 | " print num\n", 97 | " j += 1" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 12, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": { 113 | "collapsed": false 114 | }, 115 | "outputs": [], 116 | "source": [] 117 | } 118 | ], 119 | "metadata": { 120 | "kernelspec": { 121 | "display_name": "Python 2", 122 | "language": "python", 123 | "name": "python2" 124 | }, 125 | "language_info": { 126 | "codemirror_mode": { 127 | "name": "ipython", 128 | "version": 2 129 | }, 130 | "file_extension": ".py", 131 | "mimetype": "text/x-python", 132 | "name": "python", 133 | "nbconvert_exporter": "python", 134 | "pygments_lexer": "ipython2", 135 | "version": "2.7.10" 136 | } 137 | }, 138 | "nbformat": 4, 139 | "nbformat_minor": 0 140 | } 141 | -------------------------------------------------------------------------------- /class1_1/exercise/Exercise3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:68df1038be6fa984e8fa87db9aa2fa3b80b0196b0e8ca61596f63da0942cd96d" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "heading", 13 | "level": 1, 14 | "metadata": {}, 15 | "source": [ 16 | "The following code will print the prime numbers between 1 and 100. Modify the code so it prints every other prime number from 1 to 100" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "collapsed": false, 22 | "input": [ 23 | "for num in range(1,101): # for-loop through the numbers\n", 24 | " prime = True # boolean flag to check the number for being prime\n", 25 | " for i in range(2,num): # for-loop to check for \"primeness\" by checking for divisors other than 1\n", 26 | " if (num%i==0): # logical test for the number having a divisor other than 1 and itself\n", 27 | " prime = False # if there's a divisor, the boolean value gets flipped to False\n", 28 | " if prime: # if prime is still True after going through all numbers from 1 - 100, then it gets printed\n", 29 | " print num" 30 | ], 31 | "language": "python", 32 | "metadata": {}, 33 | "outputs": [] 34 | }, 35 | { 36 | "cell_type": "heading", 37 | "level": 1, 38 | "metadata": {}, 39 | "source": [ 40 | "Extra Credit: Can you write a procedure that runs faster than the one above?" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "collapsed": false, 46 | "input": [], 47 | "language": "python", 48 | "metadata": {}, 49 | "outputs": [] 50 | } 51 | ], 52 | "metadata": {} 53 | } 54 | ] 55 | } -------------------------------------------------------------------------------- /class1_1/exercise/Exercise4-Answers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The writer of this code wants to count the mean and median article length for recent articles on gay marraige. This code has several issues, including errors. When they checked their custom functions against the numpy functions, they noticed some discrepancies. Fix the code so it executes properly and the output of the custom functions match the output of the numpy functions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 5, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import requests # a better package than urllib2" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 6, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "def my_mean(input_list):\n", 30 | " list_sum = 0\n", 31 | " list_count = 0\n", 32 | " for el in input_list:\n", 33 | " list_sum += el\n", 34 | " list_count += 1\n", 35 | " return list_sum / float(list_count) # cast list_count to float" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 42, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "def my_median(input_list):\n", 47 | " input_list.sort() # sort the list\n", 48 | " list_length = len(input_list) # get length so it doesn't need to be recalculated\n", 49 | "\n", 50 | " # test for even length and take len/2 and len/2 -1 divided over 2.0 for float division\n", 51 | " if list_length %2 == 0: \n", 52 | " return (input_list[list_length/2] + input_list[(list_length/2) - 1]) / 2.0 \n", 53 | " else:\n", 54 | " return input_list[list_length/2]" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": { 61 | "collapsed": false 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "api_key = \"ffaf60d7d82258e112dd4fb2b5e4e2d6:3:72421680\"" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "url = \"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gay+marriage&api-key=%s\" % api_key # variable name mistyped" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 8, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "r = requests.get(url)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 10, 93 | "metadata": { 94 | "collapsed": false 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "wc_list = []\n", 99 | "for article in r.json()['response']['docs']:\n", 100 | " wc_list.append(int(article['word_count'])) #word_count needs to be cast to int" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 11, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "1034.2" 114 | ] 115 | }, 116 | "execution_count": 11, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "my_mean(wc_list)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 12, 128 | "metadata": { 129 | "collapsed": false 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "import numpy as np" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 13, 139 | "metadata": { 140 | "collapsed": false 141 | }, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "text/plain": [ 146 | "1034.2" 147 | ] 148 | }, 149 | "execution_count": 13, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "np.mean(wc_list)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 43, 161 | "metadata": { 162 | "collapsed": false 163 | }, 164 | "outputs": [ 165 | { 166 | "data": { 167 | "text/plain": [ 168 | "926.5" 169 | ] 170 | }, 171 | "execution_count": 43, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "my_median(wc_list)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 28, 183 | "metadata": { 184 | "collapsed": false 185 | }, 186 | "outputs": [ 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "926.5" 191 | ] 192 | }, 193 | "execution_count": 28, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "np.median(wc_list)" 200 | ] 201 | } 202 | ], 203 | "metadata": { 204 | "kernelspec": { 205 | "display_name": "Python 2", 206 | "language": "python", 207 | "name": "python2" 208 | }, 209 | "language_info": { 210 | "codemirror_mode": { 211 | "name": "ipython", 212 | "version": 2 213 | }, 214 | "file_extension": ".py", 215 | "mimetype": "text/x-python", 216 | "name": "python", 217 | "nbconvert_exporter": "python", 218 | "pygments_lexer": "ipython2", 219 | "version": "2.7.10" 220 | } 221 | }, 222 | "nbformat": 4, 223 | "nbformat_minor": 0 224 | } 225 | -------------------------------------------------------------------------------- /class1_1/exercise/Exercise4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:27eed4d676ae4b5bf707d837cc436a5377ce46ea08f8fe7dcf43383e80482aeb" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "heading", 13 | "level": 1, 14 | "metadata": {}, 15 | "source": [ 16 | "The writer of this code wants to count the mean and median article length for recent articles on gay marriage in the New York Times. This code has several issues, including errors. When they checked their custom functions against the numpy functions, they noticed some discrepancies. Fix the code so it executes properly, retrieves the articles, and outputs the correct result from the custom functions, compared to the numpy functions." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "collapsed": false, 22 | "input": [ 23 | "import requests # a better package than urllib2" 24 | ], 25 | "language": "python", 26 | "metadata": {}, 27 | "outputs": [] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "collapsed": false, 32 | "input": [ 33 | "def my_mean(input_list):\n", 34 | " list_sum = 0\n", 35 | " list_count = 0\n", 36 | " for el in input_list:\n", 37 | " list_sum += el\n", 38 | " list_count += 1\n", 39 | " return list_sum / list_count" 40 | ], 41 | "language": "python", 42 | "metadata": {}, 43 | "outputs": [] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "collapsed": false, 48 | "input": [ 49 | "def my_median(input_list):\n", 50 | " list_length = len(input_list)\n", 51 | " return input_list[list_length/2]" 52 | ], 53 | "language": "python", 54 | "metadata": {}, 55 | "outputs": [] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "collapsed": false, 60 | "input": [ 61 | "api_key = \"ffaf60d7d82258e112dd4fb2b5e4e2d6:3:72421680\"" 62 | ], 63 | "language": "python", 64 | "metadata": {}, 65 | "outputs": [] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "collapsed": false, 70 | "input": [ 71 | "url = \"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gay+marriage&api-key=%s\" % API_key" 72 | ], 73 | "language": "python", 74 | "metadata": {}, 75 | "outputs": [] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "collapsed": false, 80 | "input": [ 81 | "r = requests.get(url)" 82 | ], 83 | "language": "python", 84 | "metadata": {}, 85 | "outputs": [] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "collapsed": false, 90 | "input": [ 91 | "wc_list = []\n", 92 | "for article in r.json()['response']['docs']:\n", 93 | " wc_list.append(article['word_count'])" 94 | ], 95 | "language": "python", 96 | "metadata": {}, 97 | "outputs": [] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "collapsed": false, 102 | "input": [ 103 | "my_mean(wc_list)" 104 | ], 105 | "language": "python", 106 | "metadata": {}, 107 | "outputs": [] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "collapsed": false, 112 | "input": [ 113 | "import numpy as np" 114 | ], 115 | "language": "python", 116 | "metadata": {}, 117 | "outputs": [] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "collapsed": false, 122 | "input": [ 123 | "np.mean(wc_list)" 124 | ], 125 | "language": "python", 126 | "metadata": {}, 127 | "outputs": [] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "collapsed": false, 132 | "input": [ 133 | "my_median(wc_list)" 134 | ], 135 | "language": "python", 136 | "metadata": {}, 137 | "outputs": [] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "collapsed": false, 142 | "input": [ 143 | "np.median(wc_list)" 144 | ], 145 | "language": "python", 146 | "metadata": {}, 147 | "outputs": [] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "collapsed": false, 152 | "input": [], 153 | "language": "python", 154 | "metadata": {}, 155 | "outputs": [] 156 | } 157 | ], 158 | "metadata": {} 159 | } 160 | ] 161 | } -------------------------------------------------------------------------------- /class1_1/exercise/excuse.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_1/exercise/excuse.csv -------------------------------------------------------------------------------- /class1_1/lab1-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:b6f3a87300b6901d9f5557b6771c6025c4453f08fb77b010b5531157a6471784" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "code", 13 | "collapsed": false, 14 | "input": [ 15 | "b = 1\n", 16 | "for i in range(5):\n", 17 | " print b\n", 18 | " b += 1" 19 | ], 20 | "language": "python", 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "output_type": "stream", 25 | "stream": "stdout", 26 | "text": [ 27 | "1\n", 28 | "2\n", 29 | "3\n", 30 | "4\n", 31 | "5\n" 32 | ] 33 | } 34 | ], 35 | "prompt_number": 11 36 | }, 37 | { 38 | "cell_type": "code", 39 | "collapsed": false, 40 | "input": [ 41 | "for i in range(5):\n", 42 | " b = 5\n", 43 | " print b\n", 44 | " b += 1" 45 | ], 46 | "language": "python", 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "output_type": "stream", 51 | "stream": "stdout", 52 | "text": [ 53 | "5\n", 54 | "5\n", 55 | "5\n", 56 | "5\n", 57 | "5\n" 58 | ] 59 | } 60 | ], 61 | "prompt_number": 9 62 | }, 63 | { 64 | "cell_type": "code", 65 | "collapsed": false, 66 | "input": [ 67 | "for i in range(10):\n", 68 | " print i\n", 69 | " for j in range(10):\n", 70 | " print j\n", 71 | "print i" 72 | ], 73 | "language": "python", 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "output_type": "stream", 78 | "stream": "stdout", 79 | "text": [ 80 | "0\n", 81 | "0\n", 82 | "1\n", 83 | "2\n", 84 | "3\n", 85 | "4\n", 86 | "5\n", 87 | "6\n", 88 | "7\n", 89 | "8\n", 90 | "9\n", 91 | "1\n", 92 | "0\n", 93 | "1\n", 94 | "2\n", 95 | "3\n", 96 | "4\n", 97 | "5\n", 98 | "6\n", 99 | "7\n", 100 | "8\n", 101 | "9\n", 102 | "2\n", 103 | "0\n", 104 | "1\n", 105 | "2\n", 106 | "3\n", 107 | "4\n", 108 | "5\n", 109 | "6\n", 110 | "7\n", 111 | "8\n", 112 | "9\n", 113 | "3\n", 114 | "0\n", 115 | "1\n", 116 | "2\n", 117 | "3\n", 118 | "4\n", 119 | "5\n", 120 | "6\n", 121 | "7\n", 122 | "8\n", 123 | "9\n", 124 | "4\n", 125 | "0\n", 126 | "1\n", 127 | "2\n", 128 | "3\n", 129 | "4\n", 130 | "5\n", 131 | "6\n", 132 | "7\n", 133 | "8\n", 134 | "9\n", 135 | "5\n", 136 | "0\n", 137 | "1\n", 138 | "2\n", 139 | "3\n", 140 | "4\n", 141 | "5\n", 142 | "6\n", 143 | "7\n", 144 | "8\n", 145 | "9\n", 146 | "6\n", 147 | "0\n", 148 | "1\n", 149 | "2\n", 150 | "3\n", 151 | "4\n", 152 | "5\n", 153 | "6\n", 154 | "7\n", 155 | "8\n", 156 | "9\n", 157 | "7\n", 158 | "0\n", 159 | "1\n", 160 | "2\n", 161 | "3\n", 162 | "4\n", 163 | "5\n", 164 | "6\n", 165 | "7\n", 166 | "8\n", 167 | "9\n", 168 | "8\n", 169 | "0\n", 170 | "1\n", 171 | "2\n", 172 | "3\n", 173 | "4\n", 174 | "5\n", 175 | "6\n", 176 | "7\n", 177 | "8\n", 178 | "9\n", 179 | "9\n", 180 | "0\n", 181 | "1\n", 182 | "2\n", 183 | "3\n", 184 | "4\n", 185 | "5\n", 186 | "6\n", 187 | "7\n", 188 | "8\n", 189 | "9\n", 190 | "9\n" 191 | ] 192 | } 193 | ], 194 | "prompt_number": 13 195 | }, 196 | { 197 | "cell_type": "code", 198 | "collapsed": false, 199 | "input": [ 200 | "person= raw_input(\"Enter your name: \")" 201 | ], 202 | "language": "python", 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "stream": "stdout", 209 | "text": [ 210 | "Enter your name: Richard\n" 211 | ] 212 | } 213 | ], 214 | "prompt_number": 14 215 | }, 216 | { 217 | "cell_type": "code", 218 | "collapsed": false, 219 | "input": [ 220 | "person" 221 | ], 222 | "language": "python", 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "metadata": {}, 227 | "output_type": "pyout", 228 | "prompt_number": 15, 229 | "text": [ 230 | "'Richard'" 231 | ] 232 | } 233 | ], 234 | "prompt_number": 15 235 | }, 236 | { 237 | "cell_type": "code", 238 | "collapsed": false, 239 | "input": [ 240 | "3/4" 241 | ], 242 | "language": "python", 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "metadata": {}, 247 | "output_type": "pyout", 248 | "prompt_number": 16, 249 | "text": [ 250 | "0" 251 | ] 252 | } 253 | ], 254 | "prompt_number": 16 255 | }, 256 | { 257 | "cell_type": "code", 258 | "collapsed": false, 259 | "input": [ 260 | "3/4.0" 261 | ], 262 | "language": "python", 263 | "metadata": {}, 264 | "outputs": [ 265 | { 266 | "metadata": {}, 267 | "output_type": "pyout", 268 | "prompt_number": 17, 269 | "text": [ 270 | "0.75" 271 | ] 272 | } 273 | ], 274 | "prompt_number": 17 275 | }, 276 | { 277 | "cell_type": "code", 278 | "collapsed": false, 279 | "input": [ 280 | "import csv" 281 | ], 282 | "language": "python", 283 | "metadata": {}, 284 | "outputs": [], 285 | "prompt_number": 24 286 | }, 287 | { 288 | "cell_type": "code", 289 | "collapsed": false, 290 | "input": [ 291 | "inputFile = open('../lede_algorithms/class1_1/exercise/excuse.csv','rU')\n", 292 | "inputReader = csv.reader(inputFile)" 293 | ], 294 | "language": "python", 295 | "metadata": {}, 296 | "outputs": [], 297 | "prompt_number": 27 298 | }, 299 | { 300 | "cell_type": "code", 301 | "collapsed": false, 302 | "input": [ 303 | "for line in inputFile:\n", 304 | " line = line.split(',')\n", 305 | " print line" 306 | ], 307 | "language": "python", 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "output_type": "stream", 312 | "stream": "stdout", 313 | "text": [ 314 | "['excuse', 'headline', 'hyperlink\\rthe fog was unexpected and did slow us down a bit', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rwe had some meetings at Gracie Mansion', \"De Blasio 30 Minutes Late to Rockaway St. Patrick's Day Parade\", 'http://www.dnainfo.com/new-york/20150307/belle-harbor/de-blasio-30-minutes-late-rockaway-st-patricks-day-parade\\rI had a very rough night and woke up sluggish', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rI just woke up in the middle of the night and couldn\\x89\\xdb\\xaat get back to sleep', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rwe had some stuff we had to do', \"De Blasio 30 Minutes Late to Rockaway St. Patrick's Day Parade\", 'http://www.dnainfo.com/new-york/20150307/belle-harbor/de-blasio-30-minutes-late-rockaway-st-patricks-day-parade\\rI should have gotten myself moving quicker', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rI was just not feeling well this morning', \"De Blasio Blames 'Rough Night' and Fog for Missing Flight 587 Ceremony\", 'http://www.dnainfo.com/new-york/20141112/rockaway-park/de-blasio-arrives-20-minutes-late-flight-587-memorial-angering-families\\rbreakfast began a little later than expected', '\"De Blasio 15 Minutes Late to St. Patrick\\'s Day Mass', ' Blames Breakfast\"', 'http://www.dnainfo.com/new-york/20150317/midtown/de-blasio-15-minutes-late-st-patricks-day-mass-blames-breakfast\\rthe detail drove away when we went into the subway rather than waiting to confirm we got on a train', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0\\rwe waited 20 mins for an express only to hear there were major delays', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0\\rwe need a better system', 'Mayor de Blasio Is Irked by a Subway Delay', 'http://www.nytimes.com/2015/05/06/nyregion/mayor-de-blasio-is-irked-by-a-subway-delay.html?ref=nyregion&_r=0']\n" 315 | ] 316 | } 317 | ], 318 | "prompt_number": 28 319 | }, 320 | { 321 | "cell_type": "code", 322 | "collapsed": false, 323 | "input": [ 324 | "inputFile = open('/Users/richarddunks/Dropbox/Datapolitan/Projects/Training/)" 325 | ], 326 | "language": "python", 327 | "metadata": {}, 328 | "outputs": [] 329 | } 330 | ], 331 | "metadata": {} 332 | } 333 | ] 334 | } -------------------------------------------------------------------------------- /class1_1/newsroom_examples.md: -------------------------------------------------------------------------------- 1 | # Algorithms in the newsroom 2 | 3 | Ever since Phillip Meyer published [Precision Journalism](http://www.unc.edu/~pmeyer/book/) in the 1970s (and probably even a bit before), journalists have been using algorithms in some form in order to tell new and different kinds of stories. Below we've included several examples from different eras of what you'd now call data journalism: 4 | 5 | ### Classic examples 6 | 7 | Computer-assisted reporting specialists have been a fixture in newsroom for decades, applying data analysis and social science methods to the news report. Some of the most sophisticated of those techniques are based in part on algorithms we'll learn in this class: 8 | 9 | - **School Scandals, Children Left Behind: Cheating in Texas Schools and Also [Faking the Grade](http://clipfile.org/?p=892)**: Two powerful series of stories by the Dallas Morning News in the mid-2000s that showed rampant cheating by students and teachers on Texas standardized exams. Not the first story to use regression models but one of the most powerful early examples. 10 | 11 | - **[Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://www.boston.com/globe/metro/packages/tickets/)**: Another early example of using logistic regression to explain a newsworthy phenomenon -- in this case the many factors that go into whether a person is given a speeding ticket or let off the hook. Just as interesting as the story is its [detailed methodology](http://www.boston.com/globe/metro/packages/tickets/study.pdf), which is worth a read. 12 | 13 | - **[Cluster analysis in CAR](https://www.ire.org/publications/search-uplink-archives/167/)** Simple cluster analysis has been used for years in newsrooms to find everything from [crime hotspots](http://www.icpsr.umich.edu/CrimeStat/) to [cancer clusters on Long Island](http://www.ij-healthgeographics.com/content/2/1/3). 14 | 15 | ### Algorithmic journalism catches 16 | 17 | Although reporters and computer-assisted reporting specialists had been doing some form of it for years the idea of "data journalism" as its is now known was popularized during the 2012 presidential elections, in large part thanks to the predictive modeling of Nate Silver. 18 | 19 | - **[FiveThirtyEight](http://fivethirtyeight.blogs.nytimes.com/fivethirtyeights-2012-forecast/)**: Nate Silver's prediction models were the first example of data/algorithmic journalism reaching the mainstream. Since then, election predictions have become a bit old hat. The Times' new model, [Leo](http://www.nytimes.com/newsgraphics/2014/senate-model/), was exceedingly accurate in 2014 (its [source code](https://github.com/TheUpshot/leo-senate-model) is online). The Times also ran a series of [live predictions](http://elections.nytimes.com/2014/senate-model) on key 2014 races on Election Night. 20 | 21 | - **[ProPublica's Message Machine](https://projects.propublica.org/emails/)**: Also during the 2012 elections, ProPublica launched its Message Machine project, which used hashing algorithms to reverse-engineer targeted e-mail messages from political campaigns. 22 | 23 | - **[L.A. Times crime alerts](http://maps.latimes.com/crime/)**: The Los Angeles Times has for years been calculating and publicizing alerts when crime spikes in certain neighborhoods. 24 | 25 | ### Modern examples 26 | 27 | These days, sophisticated algorithms are used to solve all sorts of journalistic problems, both exciting and mundane. 28 | 29 | - **[Campaign finance data deduplication](https://github.com/cjdd3b/fec-standardizer/wiki)**: Most campaign finance data is organized by contribution, not donor. Joe Smith might give three different contributions and be listed in the data in three different ways. Connecting those records into a single canonical Joe Smith is often the first step to doing sophisticated campaign finance analysis. Over the last few years, people have developed highly accurate methods to do this using both supervised and unsupervised machine learning. 30 | 31 | - **[NYT Cooking](http://cooking.nytimes.com/)**: The new Cooking website and app has been one of the Times' most successful new products, but it was initially based largely on recipes stored in free-text articles. The Times extracted many of those recipes using an algorithmic technique known as [conditional random fields](http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/). The L.A. Times did something [similar](https://source.opennews.org/en-US/articles/how-we-made-new-california-cookbook/) in 2013. 32 | 33 | - **[The Echo Chamber](http://www.reuters.com/investigates/special-report/scotus/)**: A team from Reuters were finalists for the Pulitzer this year after using (among other things) sophisticated topic modeling techniques to help document how a small group of lawyers have disproportionate influence over the U.S. Supreme Court. -------------------------------------------------------------------------------- /class1_2/2013_NYC_CD_MedianIncome_Recycle.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_2/2013_NYC_CD_MedianIncome_Recycle.xlsx -------------------------------------------------------------------------------- /class1_2/Data_Collection_Sheet.csv: -------------------------------------------------------------------------------- 1 | name,height (inches),age (years),siblings (not including you) 2 | Richard,72,35,1 3 | Adam,71,29,1 4 | Rashida,62,24,0 5 | Spe ,62,23,1 6 | Jiachuan,67,25,0 7 | GK,66,36,1 8 | Fanny,64,32,3 9 | Meghan,65,27,2 10 | Arthur,74,49,1 11 | Elliott,66,31,3 12 | Aliza,64,23,1 13 | Lindsay,67,22,2 14 | Michael,71,49,3 15 | Vanessa,66,27,2 16 | Melissa,61,23,1 17 | Kassahun,67,32,3 18 | Sebastian,67.7165,26,2 19 | Giulia,67,27,2 20 | Siutan,66,25,0 21 | Tian,68,23,0 22 | Laure,65,25,3 23 | Katie,67,36,1 -------------------------------------------------------------------------------- /class1_2/height_weight.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class1_2/height_weight.xlsx -------------------------------------------------------------------------------- /class2_1/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms: Week 2, Class 1 (Tuesday, July 21) 2 | 3 | Today we'll spend the first hour reviewing last week's material by exploring a new dataset: transportation data used by FiveThirtyEight to build their [fastest-flight tracker](http://projects.fivethirtyeight.com/flights/), which launched last month. 4 | 5 | Then we'll talk about why it's important to learn, and explain, what's going on under the hood of modern algorithms -- both as an exercise in transparency and skepticism and as a setup for Thursday's class, where we'll begin discussing regression. 6 | 7 | In lab, you'll continue our class work by analyzing some data on your own and developing your results into story ideas. Then we'll ask you to critique a project released earlier this summer by NPR. 8 | 9 | ## Hour 1: Exploratory data analysis review 10 | 11 | We'll be working with airline on-time performance reports from the U.S. Department of Transportation, which you can download [here](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time). The rest of what you'll need is in the accompanying [iPython notebook](https://github.com/datapolitan/lede_algorithms/blob/master/class2_1/EDA_Review.ipynb). 12 | 13 | ## Hour 2: Transparency and the "nerd box" 14 | 15 | First, we'll read Simon Rogers' piece in Mother Jones: [Hey Wonk Reporters, Liberate Your Data!](http://www.motherjones.com/media/2014/04/vox-538-upshot-open-data-missing) (and possibly this: [Debugging the Backlash to Data Journalism](http://towcenter.org/debugging-the-backlash-to-data-journalism/)). 16 | 17 | Then we'll discuss the evolution of transparency in data journalism and why it's important: from the journalistic tradition of the "nerd box," through making data available via news apps, and finally to more modern examples of transparency. 18 | 19 | - [The Boston Globe's Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://www.boston.com/globe/metro/packages/tickets/study.pdf) 20 | - [St. Petersburg (now Tampa Bay) Times: Vanishing Wetlands](http://www.sptimes.com/2006/webspecials06/wetlands/) 21 | - [Ft. Lauderdale Sun Sentinel police speeding investigation](http://databases.sun-sentinel.com/news/broward/ftlaudCopSpeeds/ftlaudCopSpeeds_list.php) 22 | - [Washington Post police shootings](http://www.washingtonpost.com/national/how-the-washington-post-is-examining-police-shootings-in-the-us/2015/06/29/f42c10b2-151b-11e5-9518-f9e0a8959f32_story.html) 23 | - [Leo: The NYT Senate model](http://www.nytimes.com/newsgraphics/2014/senate-model/methodology.html) 24 | 25 | ## Hour 3: From transparent data to transparent algorithms 26 | 27 | Even if you never write another algorithm before you die, this class should at least teach you enough to ask good questions about their capabilities and roles in society. We'll look at stories from some reporters who can articulate a clear, accurate understanding of how algorithms work, as well as some who ... well ... can't. 28 | 29 | Here are a few examples from one of those categories: 30 | 31 | - [Experts predict robots will take over 30% of our jobs by 2025 — and white-collar jobs aren't immune](http://www.businessinsider.com/experts-predict-that-one-third-of-jobs-will-be-replaced-by-robots-2015-5) 32 | - [Journalists, here's how robots are going to steal your job](http://www.newstatesman.com/future-proof/2014/03/journalists-heres-how-robots-are-going-steal-your-job) 33 | - [Artificial intelligence could end mankind: Hawking](http://www.cnbc.com/2014/05/04/artificial-intelligence-could-end-mankind-hawking.html) 34 | - ['Chappie' Doesn't Think Robots Will Destroy the World](http://www.nbcnews.com/tech/innovation/chappie-doesnt-think-robots-will-destroy-world-n305876) 35 | - [What Happens When Robots Write the Future?](http://op-talk.blogs.nytimes.com/2014/08/18/what-happens-when-robots-write-the-future/) 36 | 37 | And a few from the other: 38 | 39 | - [At UPS, the Algorithm Is the Driver](http://www.wsj.com/articles/at-ups-the-algorithm-is-the-driver-1424136536) 40 | - [If Algorithms Know All, How Much Should Humans Help?](http://www.nytimes.com/2015/04/07/upshot/if-algorithms-know-all-how-much-should-humans-help.html?abt=0002&abg=0) 41 | - [The Potential and the Risks of Data Science](http://bits.blogs.nytimes.com/2013/04/07/the-potential-and-the-risks-of-data-science/) 42 | - [Google Schools Its Algorithm](http://www.nytimes.com/2011/03/06/weekinreview/06lohr.html) 43 | - [When Algorithms Discriminate](http://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html) 44 | 45 | ## Lab 46 | 47 | You'll be working on two things in lab today. First, by way of review, download a slice of the [data](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) we used for the class exercises. Explore that data on your own using the skills you've learned so far and come back with two 100-ish-word story pitches. Each pitch should also come with a brief description of how you analyzed the data in order to get that idea. 48 | 49 | Second, write a short critique of NPR and Planet Money's recent coverage of the effect of algorithms and automation on the labor market. You don't need to listen to all of their podcasts on the subject (they did a handful), but check out [a few of them](http://www.npr.org/sections/money/2015/05/08/405270046/episode-622-humans-vs-robots), look at some of their [data visualizations](http://www.npr.org/sections/money/2015/05/21/408234543/will-your-job-be-done-by-a-machine) and play around with their tool that calculates whether your job is [likely to be done by a machine](http://www.npr.org/sections/money/2015/05/21/408234543/will-your-job-be-done-by-a-machine). 50 | 51 | ## Questions 52 | 53 | I'm at chase.davis@nytimes.com, and I'll be on Slack after class. -------------------------------------------------------------------------------- /class2_2/DoNow_2-2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#Accomplish the following tasks by whatever means necessary based on the material we've covered in class. Save the notebook in this format: `_DoNow_2-2.ipynb` where `` is your last (family) name and turn it in via Slack." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "# the magic command to plot inline with the notebook \n", 19 | "# https://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions\n", 20 | "%matplotlib inline" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "###1. Import the pandas package and use the common alias" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": { 34 | "collapsed": true 35 | }, 36 | "outputs": [], 37 | "source": [] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "###2. Read the file \"heights_weights.xlsx\" in the `data` folder into a pandas dataframe" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": { 50 | "collapsed": true 51 | }, 52 | "outputs": [], 53 | "source": [] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "###3. Plot a histogram for both height and weight. Describe the data distribution in comments." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "outputs": [], 69 | "source": [] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "###4. Calculate the mean height and mean weight for the dataframe. " 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "collapsed": false 83 | }, 84 | "outputs": [], 85 | "source": [] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "###5. Calculate the other significant descriptive statistics on the two data points\n", 92 | "+ Standard deviation\n", 93 | "+ Range\n", 94 | "+ Interquartile range" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": false 102 | }, 103 | "outputs": [], 104 | "source": [] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "###6. Calculate the coefficient of correlation for these variables. Do they appear correlated? (put your answer in comments)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [], 120 | "source": [] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "###Extra Credit: Create a scatter plot of height and weight" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "outputs": [], 136 | "source": [] 137 | } 138 | ], 139 | "metadata": { 140 | "kernelspec": { 141 | "display_name": "Python 2", 142 | "language": "python", 143 | "name": "python2" 144 | }, 145 | "language_info": { 146 | "codemirror_mode": { 147 | "name": "ipython", 148 | "version": 2 149 | }, 150 | "file_extension": ".py", 151 | "mimetype": "text/x-python", 152 | "name": "python", 153 | "nbconvert_exporter": "python", 154 | "pygments_lexer": "ipython2", 155 | "version": "2.7.10" 156 | } 157 | }, 158 | "nbformat": 4, 159 | "nbformat_minor": 0 160 | } 161 | -------------------------------------------------------------------------------- /class2_2/data/2013_NYC_CD_MedianIncome_Recycle.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class2_2/data/2013_NYC_CD_MedianIncome_Recycle.xlsx -------------------------------------------------------------------------------- /class2_2/data/height_weight.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class2_2/data/height_weight.xlsx -------------------------------------------------------------------------------- /class3_1/.ipynb_checkpoints/classification-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import csv, re, string\n", 12 | "import numpy as np\n", 13 | "from sklearn.linear_model import LogisticRegression\n", 14 | "from sklearn.feature_extraction.text import CountVectorizer\n", 15 | "from sklearn.pipeline import Pipeline" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 15, 21 | "metadata": { 22 | "collapsed": false 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\n", 27 | "VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 16, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "data = []\n", 39 | "with open('data/category-training.csv', 'r') as f:\n", 40 | " inputreader = csv.reader(f, delimiter=',', quotechar='\"')\n", 41 | " for r in inputreader:\n", 42 | " # Concatenate the occupation and employer strings together and remove\n", 43 | " # punctuation. Both occupation and employer will be used in prediction.\n", 44 | " text = PUNCTUATION.sub('', ' '.join(r[0:2]))\n", 45 | " if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:\n", 46 | " # We're only attempting to classify the first character of the\n", 47 | " # industry prefix (\"A\", \"B\", etc.) -- not the whole thing. That's\n", 48 | " # what the r[2][0] piece is about.\n", 49 | " data.append([text, r[2][0]])" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 18, 55 | "metadata": { 56 | "collapsed": true 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | " texts = np.array([el[0] for el in data])\n", 61 | " classes = np.array([el[1] for el in data])" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 19, 67 | "metadata": { 68 | "collapsed": false 69 | }, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT'\n", 76 | " 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC'\n", 77 | " 'Owner Fair Funeral Home' 'ST MARTIN LIRERRE LAW FIRM ']\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "print texts" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 20, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [ 92 | { 93 | "name": "stdout", 94 | "output_type": "stream", 95 | "text": [ 96 | "['F' 'Z' 'Z' ..., 'F' 'G' 'K']\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "print classes" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 21, 107 | "metadata": { 108 | "collapsed": true 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "pipeline = Pipeline([\n", 113 | " ('vectorizer', CountVectorizer(\n", 114 | " ngram_range=(1,2),\n", 115 | " stop_words='english',\n", 116 | " min_df=2,\n", 117 | " max_df=len(texts))),\n", 118 | " ('classifier', LogisticRegression())\n", 119 | "])" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 22, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 133 | " dtype=, encoding=u'utf-8', input=u'content',\n", 134 | " lowercase=True, max_df=66923, max_features=None, min_df=2,\n", 135 | " ngram_range=(1, 2), preprocessor=None, stop_words='english...',\n", 136 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 137 | " verbose=0))])" 138 | ] 139 | }, 140 | "execution_count": 22, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "pipeline.fit(np.asarray(texts), np.asarray(classes))" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 27, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [ 156 | { 157 | "name": "stdout", 158 | "output_type": "stream", 159 | "text": [ 160 | "['K']\n" 161 | ] 162 | } 163 | ], 164 | "source": [ 165 | "print pipeline.predict(['LAWYER'])" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 28, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "['K']\n" 180 | ] 181 | } 182 | ], 183 | "source": [ 184 | "print pipeline.predict(['SKADDEN ARPS'])" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 29, 190 | "metadata": { 191 | "collapsed": false, 192 | "scrolled": true 193 | }, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "['J']\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "print pipeline.predict(['COMPUTER PROGRAMMER'])" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "collapsed": true 212 | }, 213 | "outputs": [], 214 | "source": [] 215 | } 216 | ], 217 | "metadata": { 218 | "kernelspec": { 219 | "display_name": "Python 2", 220 | "language": "python", 221 | "name": "python2" 222 | }, 223 | "language_info": { 224 | "codemirror_mode": { 225 | "name": "ipython", 226 | "version": 2 227 | }, 228 | "file_extension": ".py", 229 | "mimetype": "text/x-python", 230 | "name": "python", 231 | "nbconvert_exporter": "python", 232 | "pygments_lexer": "ipython2", 233 | "version": "2.7.9" 234 | } 235 | }, 236 | "nbformat": 4, 237 | "nbformat_minor": 0 238 | } 239 | -------------------------------------------------------------------------------- /class3_1/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms: Week 3, Class 1 (Tuesday, July 28) 2 | 3 | Today we'll be reviewing a bit of material from last week, including both your lab assignments from last Tuesday and Thursday's lesson on regression, and then expanding on the latter in a couple ways. 4 | 5 | First we'll review, and (roughly) recreate a couple stories that have employed regression in basic ways. Then, in the interests of transparency, we'll talk a bit about what's going on under the hood with the algorithms that comprise regression. 6 | 7 | Finally we'll touch briefly on the idea of classification using a portion of a project we're currently implementing at the Times. 8 | 9 | ## Hour 1/1.5: Review 10 | 11 | First we'll talk through some of your story ideas and critiques from last Tuesday. Then we'll revisit some basic regression concepts from last week using [this iPython notebook](https://github.com/datapolitan/lede_algorithms/blob/master/class3_1/regression_review.ipynb), which (very roughly) mimics a project that the St. Paul Pioneer Press did in 2006 and 2010, known as [Schools that Work](http://www.twincities.com/ci_15487174). 12 | 13 | ## Hour 2: A closer look at regression 14 | 15 | Journalists tend to look at linear regression through a statistical lens and use it primarily to describe things, as in the case above. You can see another examples here: 16 | 17 | - [Race gap found in pothole patching](http://www.jsonline.com/watchdog/watchdogreports/32580034.html) (Milwaukee Journal Sentinel). And the [associated explainer](http://www.jsonline.com/news/milwaukee/32580074.html). 18 | 19 | But looked at another way, linear regression is also a predictive model -- one that, at scale, is based on an algorithm that we can demystify, per our conversations last week. We'll spend a short amount of time talking about how that works and relate it (hypothetically) to this [fun story](http://fivethirtyeight.com/features/donald-trump-is-the-worlds-greatest-troll/) from FiveThirtyEight. 20 | 21 | ## Hour 3: Introduction to classification 22 | 23 | Using a [project](https://github.com/datapolitan/lede_algorithms/blob/master/class3_1/classification.ipynb) we've been working on at the Times, we'll expand our idea of supervised learning to include something that seems a bit more like what you might consider "machine learning" -- classifying people's jobs based on strings representing their occupation and employer. 24 | 25 | We'll also discuss how lots of data problems in journalism are secretly classification problems, including things like [sorting through documents](https://github.com/cjdd3b/nicar2014/tree/master/lightning-talk/naive-bayes) and [extracting quotes from news articles](https://github.com/cjdd3b/citizen-quotes). 26 | 27 | ## Lab 28 | 29 | Like last week, you'll be doing two things in the lab today: 30 | 31 | First you'll expand the schools analysis we did earlier by layering in other variables [(documented here)](http://www.cde.ca.gov/ta/ac/ap/reclayout12b.asp) using multiple regression, interpreting the results, and again writing two ledes about what you found. Back those lede up with some internet research. If you find some schools that are over/under-performing or have other interesting characteristics, Google around to see what has been written about them. It's a good way to check your assumptions and to find other interesting facts to round out your story pitches. 32 | 33 | This of course comes with a huge, blinking-red caveat: This is an algorithms class, and we're not getting deep enough into the guts of statistical regression for you to run out and write full-on stories based on your findings. There are things like [p-values](http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients) to consider, as well as rules of thumb for interpreting r-squared. If you'd like to get more in depth with that, we can carve out some time later in the course. 34 | 35 | Your second assignment today is to write a short story (300-500 words) about [this company](https://www.upstart.com/), which is a startup that uses predictive models to assess creditworthiness using variables that go beyond credit score. No doubt their model is more complex than this, but you can think of the intuition as being similar to regression -- a handful of independent variables that help predict the likelihood that someone will pay their loan back. What are the implications of this? Why might it be good or bad for consumers if this catches on? -------------------------------------------------------------------------------- /class3_1/classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Classification in the Wild" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This code comes straight from a Times project that helps us standardize campaign finance data to enable new types of analyses. Specifically, it tries to categorize a free-form occupation/employer string into a discrete job category (for example, the strings \"LAWYER\" and \"ATTORNEY\" would both be categorized under \"LAW\").\n", 15 | "\n", 16 | "We use this to create one of a large number of features that inform the larger predictive model we use for standardization. But it also shows the power of simple classification in action." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": { 23 | "collapsed": true 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "import csv, re, string\n", 28 | "import numpy as np\n", 29 | "from sklearn.linear_model import LogisticRegression\n", 30 | "from sklearn.feature_extraction.text import CountVectorizer\n", 31 | "from sklearn.pipeline import Pipeline" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 30, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "# Some basic setup for data-cleaning purposes\n", 43 | "PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\n", 44 | "VALID_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'T', 'X', 'Z']" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 16, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "# Open the training data and clean it up a bit\n", 56 | "data = []\n", 57 | "with open('data/category-training.csv', 'r') as f:\n", 58 | " inputreader = csv.reader(f, delimiter=',', quotechar='\"')\n", 59 | " for r in inputreader:\n", 60 | " # Concatenate the occupation and employer strings together and remove\n", 61 | " # punctuation. Both occupation and employer will be used in prediction.\n", 62 | " text = PUNCTUATION.sub('', ' '.join(r[0:2]))\n", 63 | " if len(r[2]) > 1 and r[2][0] in VALID_CLASSES:\n", 64 | " # We're only attempting to classify the first character of the\n", 65 | " # industry prefix (\"A\", \"B\", etc.) -- not the whole thing. That's\n", 66 | " # what the r[2][0] piece is about.\n", 67 | " data.append([text, r[2][0]])" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 18, 73 | "metadata": { 74 | "collapsed": true 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | " # Separate the text of the occupation/employer strings from the correct classification\n", 79 | " texts = np.array([el[0] for el in data])\n", 80 | " classes = np.array([el[1] for el in data])" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 19, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [ 90 | { 91 | "name": "stdout", 92 | "output_type": "stream", 93 | "text": [ 94 | "['Owner First Priority Title Llc' 'SENIOR PARTNER ARES MANAGEMENT'\n", 95 | " 'CEO HB AGENCY' ..., 'INVESTMENT EXECUTIVE FEF MANAGEMENT LLC'\n", 96 | " 'Owner Fair Funeral Home' 'ST MARTIN LIRERRE LAW FIRM ']\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "print texts" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 20, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "['F' 'Z' 'Z' ..., 'F' 'G' 'K']\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "print classes" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 31, 126 | "metadata": { 127 | "collapsed": true 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "# Build a simple machine learning pipeline to turn the above arrays into something scikit-learn understands\n", 132 | "pipeline = Pipeline([\n", 133 | " ('vectorizer', CountVectorizer(\n", 134 | " ngram_range=(1,2),\n", 135 | " stop_words='english',\n", 136 | " min_df=2,\n", 137 | " max_df=len(texts))),\n", 138 | " ('classifier', LogisticRegression())\n", 139 | "])" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 32, 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "outputs": [ 149 | { 150 | "data": { 151 | "text/plain": [ 152 | "Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", 153 | " dtype=, encoding=u'utf-8', input=u'content',\n", 154 | " lowercase=True, max_df=66923, max_features=None, min_df=2,\n", 155 | " ngram_range=(1, 2), preprocessor=None, stop_words='english...',\n", 156 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 157 | " verbose=0))])" 158 | ] 159 | }, 160 | "execution_count": 32, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "# Fit the model\n", 167 | "pipeline.fit(np.asarray(texts), np.asarray(classes))" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 27, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [ 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "['K']\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "# Now, run some predictions. \"K\" means \"LAW\" in this case.\n", 187 | "print pipeline.predict(['LAWYER'])" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 28, 193 | "metadata": { 194 | "collapsed": false 195 | }, 196 | "outputs": [ 197 | { 198 | "name": "stdout", 199 | "output_type": "stream", 200 | "text": [ 201 | "['K']\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "# It also recognizes law firms!\n", 207 | "print pipeline.predict(['SKADDEN ARPS'])" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 34, 213 | "metadata": { 214 | "collapsed": false, 215 | "scrolled": true 216 | }, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "['F']\n" 223 | ] 224 | } 225 | ], 226 | "source": [ 227 | "# The \"F\" category represents business and finance.\n", 228 | "print pipeline.predict(['CEO'])" 229 | ] 230 | } 231 | ], 232 | "metadata": { 233 | "kernelspec": { 234 | "display_name": "Python 2", 235 | "language": "python", 236 | "name": "python2" 237 | }, 238 | "language_info": { 239 | "codemirror_mode": { 240 | "name": "ipython", 241 | "version": 2 242 | }, 243 | "file_extension": ".py", 244 | "mimetype": "text/x-python", 245 | "name": "python", 246 | "nbconvert_exporter": "python", 247 | "pygments_lexer": "ipython2", 248 | "version": "2.7.9" 249 | } 250 | }, 251 | "nbformat": 4, 252 | "nbformat_minor": 0 253 | } 254 | -------------------------------------------------------------------------------- /class3_2/3-2_DoNow.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "##1. Import the necessary packages to read in the data, plot, and create a linear regression model" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## 2. Read in the hanford.csv file " 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## 3. Calculate the basic descriptive statistics on the data" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## 4. Calculate the coefficient of correlation (r) and generate the scatter plot. Does there seem to be a correlation worthy of investigation?" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": { 69 | "collapsed": false 70 | }, 71 | "outputs": [], 72 | "source": [] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "## 5. Create a linear regression model based on the available data to predict the mortality rate given a level of exposure" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "collapsed": true 94 | }, 95 | "source": [ 96 | "## 6. Plot the linear regression line on the scatter plot of values. Calculate the r^2 (coefficient of determination)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "collapsed": false 104 | }, 105 | "outputs": [], 106 | "source": [] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": { 111 | "collapsed": true 112 | }, 113 | "source": [ 114 | "## 7. Predict the mortality rate (Cancer per 100,000 man years) given an index of exposure = 10" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": true 122 | }, 123 | "outputs": [], 124 | "source": [] 125 | } 126 | ], 127 | "metadata": { 128 | "kernelspec": { 129 | "display_name": "Python 2", 130 | "language": "python", 131 | "name": "python2" 132 | }, 133 | "language_info": { 134 | "codemirror_mode": { 135 | "name": "ipython", 136 | "version": 2 137 | }, 138 | "file_extension": ".py", 139 | "mimetype": "text/x-python", 140 | "name": "python", 141 | "nbconvert_exporter": "python", 142 | "pygments_lexer": "ipython2", 143 | "version": "2.7.10" 144 | } 145 | }, 146 | "nbformat": 4, 147 | "nbformat_minor": 0 148 | } 149 | -------------------------------------------------------------------------------- /class3_2/3-2_Exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "##We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "###1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "###2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": { 37 | "collapsed": true 38 | }, 39 | "outputs": [], 40 | "source": [] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "###3. Perform 10-fold cross validation on the data and compare your results to the hold out method we used in 1 and 2. Take the average of the results. What do you notice about the accuracy measures in each of these?" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "collapsed": true 54 | }, 55 | "outputs": [], 56 | "source": [] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "###4. Open the seeds_dataset.txt and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?\n", 63 | "For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/seeds" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": { 70 | "collapsed": true 71 | }, 72 | "outputs": [], 73 | "source": [] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "###5. Using the seeds_dataset.txt, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50, 75-25, 10-fold cross validation) and discuss the results." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "collapsed": true 87 | }, 88 | "outputs": [], 89 | "source": [] 90 | } 91 | ], 92 | "metadata": { 93 | "kernelspec": { 94 | "display_name": "Python 2", 95 | "language": "python", 96 | "name": "python2" 97 | }, 98 | "language_info": { 99 | "codemirror_mode": { 100 | "name": "ipython", 101 | "version": 2 102 | }, 103 | "file_extension": ".py", 104 | "mimetype": "text/x-python", 105 | "name": "python", 106 | "nbconvert_exporter": "python", 107 | "pygments_lexer": "ipython2", 108 | "version": "2.7.10" 109 | } 110 | }, 111 | "nbformat": 4, 112 | "nbformat_minor": 0 113 | } 114 | -------------------------------------------------------------------------------- /class3_2/data/hanford.csv: -------------------------------------------------------------------------------- 1 | County,Exposure,Mortality 2 | Umatilla,2.49,147.1 3 | Morrow,2.57,130.1 4 | Gilliam,3.41,129.9 5 | Sherman,1.25,113.5 6 | Wasco,1.62,137.5 7 | HoodRiver,3.83,162.3 8 | Portland,11.64,207.5 9 | Columbia,6.41,177.9 10 | Clatsop,8.34,210.3 -------------------------------------------------------------------------------- /class3_2/data/hanford.txt: -------------------------------------------------------------------------------- 1 | County Exposure Mortality 2 | Umatilla 2.49 147.1 3 | Morrow 2.57 130.1 4 | Gilliam 3.41 129.9 5 | Sherman 1.25 113.5 6 | Wasco 1.62 137.5 7 | HoodRiver 3.83 162.3 8 | Portland 11.64 207.5 9 | Columbia 6.41 177.9 10 | Clatsop 8.34 210.3 11 | -------------------------------------------------------------------------------- /class3_2/data/iris.csv: -------------------------------------------------------------------------------- 1 | SepalLength,SepalWidth,PetalLength,PetalWidth,Name 2 | 5.1,3.5,1.4,0.2,Iris-setosa 3 | 4.9,3.0,1.4,0.2,Iris-setosa 4 | 4.7,3.2,1.3,0.2,Iris-setosa 5 | 4.6,3.1,1.5,0.2,Iris-setosa 6 | 5.0,3.6,1.4,0.2,Iris-setosa 7 | 5.4,3.9,1.7,0.4,Iris-setosa 8 | 4.6,3.4,1.4,0.3,Iris-setosa 9 | 5.0,3.4,1.5,0.2,Iris-setosa 10 | 4.4,2.9,1.4,0.2,Iris-setosa 11 | 4.9,3.1,1.5,0.1,Iris-setosa 12 | 5.4,3.7,1.5,0.2,Iris-setosa 13 | 4.8,3.4,1.6,0.2,Iris-setosa 14 | 4.8,3.0,1.4,0.1,Iris-setosa 15 | 4.3,3.0,1.1,0.1,Iris-setosa 16 | 5.8,4.0,1.2,0.2,Iris-setosa 17 | 5.7,4.4,1.5,0.4,Iris-setosa 18 | 5.4,3.9,1.3,0.4,Iris-setosa 19 | 5.1,3.5,1.4,0.3,Iris-setosa 20 | 5.7,3.8,1.7,0.3,Iris-setosa 21 | 5.1,3.8,1.5,0.3,Iris-setosa 22 | 5.4,3.4,1.7,0.2,Iris-setosa 23 | 5.1,3.7,1.5,0.4,Iris-setosa 24 | 4.6,3.6,1.0,0.2,Iris-setosa 25 | 5.1,3.3,1.7,0.5,Iris-setosa 26 | 4.8,3.4,1.9,0.2,Iris-setosa 27 | 5.0,3.0,1.6,0.2,Iris-setosa 28 | 5.0,3.4,1.6,0.4,Iris-setosa 29 | 5.2,3.5,1.5,0.2,Iris-setosa 30 | 5.2,3.4,1.4,0.2,Iris-setosa 31 | 4.7,3.2,1.6,0.2,Iris-setosa 32 | 4.8,3.1,1.6,0.2,Iris-setosa 33 | 5.4,3.4,1.5,0.4,Iris-setosa 34 | 5.2,4.1,1.5,0.1,Iris-setosa 35 | 5.5,4.2,1.4,0.2,Iris-setosa 36 | 4.9,3.1,1.5,0.1,Iris-setosa 37 | 5.0,3.2,1.2,0.2,Iris-setosa 38 | 5.5,3.5,1.3,0.2,Iris-setosa 39 | 4.9,3.1,1.5,0.1,Iris-setosa 40 | 4.4,3.0,1.3,0.2,Iris-setosa 41 | 5.1,3.4,1.5,0.2,Iris-setosa 42 | 5.0,3.5,1.3,0.3,Iris-setosa 43 | 4.5,2.3,1.3,0.3,Iris-setosa 44 | 4.4,3.2,1.3,0.2,Iris-setosa 45 | 5.0,3.5,1.6,0.6,Iris-setosa 46 | 5.1,3.8,1.9,0.4,Iris-setosa 47 | 4.8,3.0,1.4,0.3,Iris-setosa 48 | 5.1,3.8,1.6,0.2,Iris-setosa 49 | 4.6,3.2,1.4,0.2,Iris-setosa 50 | 5.3,3.7,1.5,0.2,Iris-setosa 51 | 5.0,3.3,1.4,0.2,Iris-setosa 52 | 7.0,3.2,4.7,1.4,Iris-versicolor 53 | 6.4,3.2,4.5,1.5,Iris-versicolor 54 | 6.9,3.1,4.9,1.5,Iris-versicolor 55 | 5.5,2.3,4.0,1.3,Iris-versicolor 56 | 6.5,2.8,4.6,1.5,Iris-versicolor 57 | 5.7,2.8,4.5,1.3,Iris-versicolor 58 | 6.3,3.3,4.7,1.6,Iris-versicolor 59 | 4.9,2.4,3.3,1.0,Iris-versicolor 60 | 6.6,2.9,4.6,1.3,Iris-versicolor 61 | 5.2,2.7,3.9,1.4,Iris-versicolor 62 | 5.0,2.0,3.5,1.0,Iris-versicolor 63 | 5.9,3.0,4.2,1.5,Iris-versicolor 64 | 6.0,2.2,4.0,1.0,Iris-versicolor 65 | 6.1,2.9,4.7,1.4,Iris-versicolor 66 | 5.6,2.9,3.6,1.3,Iris-versicolor 67 | 6.7,3.1,4.4,1.4,Iris-versicolor 68 | 5.6,3.0,4.5,1.5,Iris-versicolor 69 | 5.8,2.7,4.1,1.0,Iris-versicolor 70 | 6.2,2.2,4.5,1.5,Iris-versicolor 71 | 5.6,2.5,3.9,1.1,Iris-versicolor 72 | 5.9,3.2,4.8,1.8,Iris-versicolor 73 | 6.1,2.8,4.0,1.3,Iris-versicolor 74 | 6.3,2.5,4.9,1.5,Iris-versicolor 75 | 6.1,2.8,4.7,1.2,Iris-versicolor 76 | 6.4,2.9,4.3,1.3,Iris-versicolor 77 | 6.6,3.0,4.4,1.4,Iris-versicolor 78 | 6.8,2.8,4.8,1.4,Iris-versicolor 79 | 6.7,3.0,5.0,1.7,Iris-versicolor 80 | 6.0,2.9,4.5,1.5,Iris-versicolor 81 | 5.7,2.6,3.5,1.0,Iris-versicolor 82 | 5.5,2.4,3.8,1.1,Iris-versicolor 83 | 5.5,2.4,3.7,1.0,Iris-versicolor 84 | 5.8,2.7,3.9,1.2,Iris-versicolor 85 | 6.0,2.7,5.1,1.6,Iris-versicolor 86 | 5.4,3.0,4.5,1.5,Iris-versicolor 87 | 6.0,3.4,4.5,1.6,Iris-versicolor 88 | 6.7,3.1,4.7,1.5,Iris-versicolor 89 | 6.3,2.3,4.4,1.3,Iris-versicolor 90 | 5.6,3.0,4.1,1.3,Iris-versicolor 91 | 5.5,2.5,4.0,1.3,Iris-versicolor 92 | 5.5,2.6,4.4,1.2,Iris-versicolor 93 | 6.1,3.0,4.6,1.4,Iris-versicolor 94 | 5.8,2.6,4.0,1.2,Iris-versicolor 95 | 5.0,2.3,3.3,1.0,Iris-versicolor 96 | 5.6,2.7,4.2,1.3,Iris-versicolor 97 | 5.7,3.0,4.2,1.2,Iris-versicolor 98 | 5.7,2.9,4.2,1.3,Iris-versicolor 99 | 6.2,2.9,4.3,1.3,Iris-versicolor 100 | 5.1,2.5,3.0,1.1,Iris-versicolor 101 | 5.7,2.8,4.1,1.3,Iris-versicolor 102 | 6.3,3.3,6.0,2.5,Iris-virginica 103 | 5.8,2.7,5.1,1.9,Iris-virginica 104 | 7.1,3.0,5.9,2.1,Iris-virginica 105 | 6.3,2.9,5.6,1.8,Iris-virginica 106 | 6.5,3.0,5.8,2.2,Iris-virginica 107 | 7.6,3.0,6.6,2.1,Iris-virginica 108 | 4.9,2.5,4.5,1.7,Iris-virginica 109 | 7.3,2.9,6.3,1.8,Iris-virginica 110 | 6.7,2.5,5.8,1.8,Iris-virginica 111 | 7.2,3.6,6.1,2.5,Iris-virginica 112 | 6.5,3.2,5.1,2.0,Iris-virginica 113 | 6.4,2.7,5.3,1.9,Iris-virginica 114 | 6.8,3.0,5.5,2.1,Iris-virginica 115 | 5.7,2.5,5.0,2.0,Iris-virginica 116 | 5.8,2.8,5.1,2.4,Iris-virginica 117 | 6.4,3.2,5.3,2.3,Iris-virginica 118 | 6.5,3.0,5.5,1.8,Iris-virginica 119 | 7.7,3.8,6.7,2.2,Iris-virginica 120 | 7.7,2.6,6.9,2.3,Iris-virginica 121 | 6.0,2.2,5.0,1.5,Iris-virginica 122 | 6.9,3.2,5.7,2.3,Iris-virginica 123 | 5.6,2.8,4.9,2.0,Iris-virginica 124 | 7.7,2.8,6.7,2.0,Iris-virginica 125 | 6.3,2.7,4.9,1.8,Iris-virginica 126 | 6.7,3.3,5.7,2.1,Iris-virginica 127 | 7.2,3.2,6.0,1.8,Iris-virginica 128 | 6.2,2.8,4.8,1.8,Iris-virginica 129 | 6.1,3.0,4.9,1.8,Iris-virginica 130 | 6.4,2.8,5.6,2.1,Iris-virginica 131 | 7.2,3.0,5.8,1.6,Iris-virginica 132 | 7.4,2.8,6.1,1.9,Iris-virginica 133 | 7.9,3.8,6.4,2.0,Iris-virginica 134 | 6.4,2.8,5.6,2.2,Iris-virginica 135 | 6.3,2.8,5.1,1.5,Iris-virginica 136 | 6.1,2.6,5.6,1.4,Iris-virginica 137 | 7.7,3.0,6.1,2.3,Iris-virginica 138 | 6.3,3.4,5.6,2.4,Iris-virginica 139 | 6.4,3.1,5.5,1.8,Iris-virginica 140 | 6.0,3.0,4.8,1.8,Iris-virginica 141 | 6.9,3.1,5.4,2.1,Iris-virginica 142 | 6.7,3.1,5.6,2.4,Iris-virginica 143 | 6.9,3.1,5.1,2.3,Iris-virginica 144 | 5.8,2.7,5.1,1.9,Iris-virginica 145 | 6.8,3.2,5.9,2.3,Iris-virginica 146 | 6.7,3.3,5.7,2.5,Iris-virginica 147 | 6.7,3.0,5.2,2.3,Iris-virginica 148 | 6.3,2.5,5.0,1.9,Iris-virginica 149 | 6.5,3.0,5.2,2.0,Iris-virginica 150 | 6.2,3.4,5.4,2.3,Iris-virginica 151 | 5.9,3.0,5.1,1.8,Iris-virginica -------------------------------------------------------------------------------- /class3_2/data/seeds_dataset.txt: -------------------------------------------------------------------------------- 1 | 15.26,14.84,0.871,5.763,3.312,2.221,5.22,1 2 | 14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1 3 | 14.29,14.09,0.905,5.291,3.337,2.699,4.825,1 4 | 13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1 5 | 16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1 6 | 14.38,14.21,0.8951,5.386,3.312,2.462,4.956,1 7 | 14.69,14.49,0.8799,5.563,3.259,3.586,5.219,1 8 | 14.11,14.1,0.8911,5.42,3.302,2.7,5,1 9 | 16.63,15.46,0.8747,6.053,3.465,2.04,5.877,1 10 | 16.44,15.25,0.888,5.884,3.505,1.969,5.533,1 11 | 15.26,14.85,0.8696,5.714,3.242,4.543,5.314,1 12 | 14.03,14.16,0.8796,5.438,3.201,1.717,5.001,1 13 | 13.89,14.02,0.888,5.439,3.199,3.986,4.738,1 14 | 13.78,14.06,0.8759,5.479,3.156,3.136,4.872,1 15 | 13.74,14.05,0.8744,5.482,3.114,2.932,4.825,1 16 | 14.59,14.28,0.8993,5.351,3.333,4.185,4.781,1 17 | 13.99,13.83,0.9183,5.119,3.383,5.234,4.781,1 18 | 15.69,14.75,0.9058,5.527,3.514,1.599,5.046,1 19 | 14.7,14.21,0.9153,5.205,3.466,1.767,4.649,1 20 | 12.72,13.57,0.8686,5.226,3.049,4.102,4.914,1 21 | 14.16,14.4,0.8584,5.658,3.129,3.072,5.176,1 22 | 14.11,14.26,0.8722,5.52,3.168,2.688,5.219,1 23 | 15.88,14.9,0.8988,5.618,3.507,0.7651,5.091,1 24 | 12.08,13.23,0.8664,5.099,2.936,1.415,4.961,1 25 | 15.01,14.76,0.8657,5.789,3.245,1.791,5.001,1 26 | 16.19,15.16,0.8849,5.833,3.421,0.903,5.307,1 27 | 13.02,13.76,0.8641,5.395,3.026,3.373,4.825,1 28 | 12.74,13.67,0.8564,5.395,2.956,2.504,4.869,1 29 | 14.11,14.18,0.882,5.541,3.221,2.754,5.038,1 30 | 13.45,14.02,0.8604,5.516,3.065,3.531,5.097,1 31 | 13.16,13.82,0.8662,5.454,2.975,0.8551,5.056,1 32 | 15.49,14.94,0.8724,5.757,3.371,3.412,5.228,1 33 | 14.09,14.41,0.8529,5.717,3.186,3.92,5.299,1 34 | 13.94,14.17,0.8728,5.585,3.15,2.124,5.012,1 35 | 15.05,14.68,0.8779,5.712,3.328,2.129,5.36,1 36 | 16.12,15,0.9,5.709,3.485,2.27,5.443,1 37 | 16.2,15.27,0.8734,5.826,3.464,2.823,5.527,1 38 | 17.08,15.38,0.9079,5.832,3.683,2.956,5.484,1 39 | 14.8,14.52,0.8823,5.656,3.288,3.112,5.309,1 40 | 14.28,14.17,0.8944,5.397,3.298,6.685,5.001,1 41 | 13.54,13.85,0.8871,5.348,3.156,2.587,5.178,1 42 | 13.5,13.85,0.8852,5.351,3.158,2.249,5.176,1 43 | 13.16,13.55,0.9009,5.138,3.201,2.461,4.783,1 44 | 15.5,14.86,0.882,5.877,3.396,4.711,5.528,1 45 | 15.11,14.54,0.8986,5.579,3.462,3.128,5.18,1 46 | 13.8,14.04,0.8794,5.376,3.155,1.56,4.961,1 47 | 15.36,14.76,0.8861,5.701,3.393,1.367,5.132,1 48 | 14.99,14.56,0.8883,5.57,3.377,2.958,5.175,1 49 | 14.79,14.52,0.8819,5.545,3.291,2.704,5.111,1 50 | 14.86,14.67,0.8676,5.678,3.258,2.129,5.351,1 51 | 14.43,14.4,0.8751,5.585,3.272,3.975,5.144,1 52 | 15.78,14.91,0.8923,5.674,3.434,5.593,5.136,1 53 | 14.49,14.61,0.8538,5.715,3.113,4.116,5.396,1 54 | 14.33,14.28,0.8831,5.504,3.199,3.328,5.224,1 55 | 14.52,14.6,0.8557,5.741,3.113,1.481,5.487,1 56 | 15.03,14.77,0.8658,5.702,3.212,1.933,5.439,1 57 | 14.46,14.35,0.8818,5.388,3.377,2.802,5.044,1 58 | 14.92,14.43,0.9006,5.384,3.412,1.142,5.088,1 59 | 15.38,14.77,0.8857,5.662,3.419,1.999,5.222,1 60 | 12.11,13.47,0.8392,5.159,3.032,1.502,4.519,1 61 | 11.42,12.86,0.8683,5.008,2.85,2.7,4.607,1 62 | 11.23,12.63,0.884,4.902,2.879,2.269,4.703,1 63 | 12.36,13.19,0.8923,5.076,3.042,3.22,4.605,1 64 | 13.22,13.84,0.868,5.395,3.07,4.157,5.088,1 65 | 12.78,13.57,0.8716,5.262,3.026,1.176,4.782,1 66 | 12.88,13.5,0.8879,5.139,3.119,2.352,4.607,1 67 | 14.34,14.37,0.8726,5.63,3.19,1.313,5.15,1 68 | 14.01,14.29,0.8625,5.609,3.158,2.217,5.132,1 69 | 14.37,14.39,0.8726,5.569,3.153,1.464,5.3,1 70 | 12.73,13.75,0.8458,5.412,2.882,3.533,5.067,1 71 | 17.63,15.98,0.8673,6.191,3.561,4.076,6.06,2 72 | 16.84,15.67,0.8623,5.998,3.484,4.675,5.877,2 73 | 17.26,15.73,0.8763,5.978,3.594,4.539,5.791,2 74 | 19.11,16.26,0.9081,6.154,3.93,2.936,6.079,2 75 | 16.82,15.51,0.8786,6.017,3.486,4.004,5.841,2 76 | 16.77,15.62,0.8638,5.927,3.438,4.92,5.795,2 77 | 17.32,15.91,0.8599,6.064,3.403,3.824,5.922,2 78 | 20.71,17.23,0.8763,6.579,3.814,4.451,6.451,2 79 | 18.94,16.49,0.875,6.445,3.639,5.064,6.362,2 80 | 17.12,15.55,0.8892,5.85,3.566,2.858,5.746,2 81 | 16.53,15.34,0.8823,5.875,3.467,5.532,5.88,2 82 | 18.72,16.19,0.8977,6.006,3.857,5.324,5.879,2 83 | 20.2,16.89,0.8894,6.285,3.864,5.173,6.187,2 84 | 19.57,16.74,0.8779,6.384,3.772,1.472,6.273,2 85 | 19.51,16.71,0.878,6.366,3.801,2.962,6.185,2 86 | 18.27,16.09,0.887,6.173,3.651,2.443,6.197,2 87 | 18.88,16.26,0.8969,6.084,3.764,1.649,6.109,2 88 | 18.98,16.66,0.859,6.549,3.67,3.691,6.498,2 89 | 21.18,17.21,0.8989,6.573,4.033,5.78,6.231,2 90 | 20.88,17.05,0.9031,6.45,4.032,5.016,6.321,2 91 | 20.1,16.99,0.8746,6.581,3.785,1.955,6.449,2 92 | 18.76,16.2,0.8984,6.172,3.796,3.12,6.053,2 93 | 18.81,16.29,0.8906,6.272,3.693,3.237,6.053,2 94 | 18.59,16.05,0.9066,6.037,3.86,6.001,5.877,2 95 | 18.36,16.52,0.8452,6.666,3.485,4.933,6.448,2 96 | 16.87,15.65,0.8648,6.139,3.463,3.696,5.967,2 97 | 19.31,16.59,0.8815,6.341,3.81,3.477,6.238,2 98 | 18.98,16.57,0.8687,6.449,3.552,2.144,6.453,2 99 | 18.17,16.26,0.8637,6.271,3.512,2.853,6.273,2 100 | 18.72,16.34,0.881,6.219,3.684,2.188,6.097,2 101 | 16.41,15.25,0.8866,5.718,3.525,4.217,5.618,2 102 | 17.99,15.86,0.8992,5.89,3.694,2.068,5.837,2 103 | 19.46,16.5,0.8985,6.113,3.892,4.308,6.009,2 104 | 19.18,16.63,0.8717,6.369,3.681,3.357,6.229,2 105 | 18.95,16.42,0.8829,6.248,3.755,3.368,6.148,2 106 | 18.83,16.29,0.8917,6.037,3.786,2.553,5.879,2 107 | 18.85,16.17,0.9056,6.152,3.806,2.843,6.2,2 108 | 17.63,15.86,0.88,6.033,3.573,3.747,5.929,2 109 | 19.94,16.92,0.8752,6.675,3.763,3.252,6.55,2 110 | 18.55,16.22,0.8865,6.153,3.674,1.738,5.894,2 111 | 18.45,16.12,0.8921,6.107,3.769,2.235,5.794,2 112 | 19.38,16.72,0.8716,6.303,3.791,3.678,5.965,2 113 | 19.13,16.31,0.9035,6.183,3.902,2.109,5.924,2 114 | 19.14,16.61,0.8722,6.259,3.737,6.682,6.053,2 115 | 20.97,17.25,0.8859,6.563,3.991,4.677,6.316,2 116 | 19.06,16.45,0.8854,6.416,3.719,2.248,6.163,2 117 | 18.96,16.2,0.9077,6.051,3.897,4.334,5.75,2 118 | 19.15,16.45,0.889,6.245,3.815,3.084,6.185,2 119 | 18.89,16.23,0.9008,6.227,3.769,3.639,5.966,2 120 | 20.03,16.9,0.8811,6.493,3.857,3.063,6.32,2 121 | 20.24,16.91,0.8897,6.315,3.962,5.901,6.188,2 122 | 18.14,16.12,0.8772,6.059,3.563,3.619,6.011,2 123 | 16.17,15.38,0.8588,5.762,3.387,4.286,5.703,2 124 | 18.43,15.97,0.9077,5.98,3.771,2.984,5.905,2 125 | 15.99,14.89,0.9064,5.363,3.582,3.336,5.144,2 126 | 18.75,16.18,0.8999,6.111,3.869,4.188,5.992,2 127 | 18.65,16.41,0.8698,6.285,3.594,4.391,6.102,2 128 | 17.98,15.85,0.8993,5.979,3.687,2.257,5.919,2 129 | 20.16,17.03,0.8735,6.513,3.773,1.91,6.185,2 130 | 17.55,15.66,0.8991,5.791,3.69,5.366,5.661,2 131 | 18.3,15.89,0.9108,5.979,3.755,2.837,5.962,2 132 | 18.94,16.32,0.8942,6.144,3.825,2.908,5.949,2 133 | 15.38,14.9,0.8706,5.884,3.268,4.462,5.795,2 134 | 16.16,15.33,0.8644,5.845,3.395,4.266,5.795,2 135 | 15.56,14.89,0.8823,5.776,3.408,4.972,5.847,2 136 | 15.38,14.66,0.899,5.477,3.465,3.6,5.439,2 137 | 17.36,15.76,0.8785,6.145,3.574,3.526,5.971,2 138 | 15.57,15.15,0.8527,5.92,3.231,2.64,5.879,2 139 | 15.6,15.11,0.858,5.832,3.286,2.725,5.752,2 140 | 16.23,15.18,0.885,5.872,3.472,3.769,5.922,2 141 | 13.07,13.92,0.848,5.472,2.994,5.304,5.395,3 142 | 13.32,13.94,0.8613,5.541,3.073,7.035,5.44,3 143 | 13.34,13.95,0.862,5.389,3.074,5.995,5.307,3 144 | 12.22,13.32,0.8652,5.224,2.967,5.469,5.221,3 145 | 11.82,13.4,0.8274,5.314,2.777,4.471,5.178,3 146 | 11.21,13.13,0.8167,5.279,2.687,6.169,5.275,3 147 | 11.43,13.13,0.8335,5.176,2.719,2.221,5.132,3 148 | 12.49,13.46,0.8658,5.267,2.967,4.421,5.002,3 149 | 12.7,13.71,0.8491,5.386,2.911,3.26,5.316,3 150 | 10.79,12.93,0.8107,5.317,2.648,5.462,5.194,3 151 | 11.83,13.23,0.8496,5.263,2.84,5.195,5.307,3 152 | 12.01,13.52,0.8249,5.405,2.776,6.992,5.27,3 153 | 12.26,13.6,0.8333,5.408,2.833,4.756,5.36,3 154 | 11.18,13.04,0.8266,5.22,2.693,3.332,5.001,3 155 | 11.36,13.05,0.8382,5.175,2.755,4.048,5.263,3 156 | 11.19,13.05,0.8253,5.25,2.675,5.813,5.219,3 157 | 11.34,12.87,0.8596,5.053,2.849,3.347,5.003,3 158 | 12.13,13.73,0.8081,5.394,2.745,4.825,5.22,3 159 | 11.75,13.52,0.8082,5.444,2.678,4.378,5.31,3 160 | 11.49,13.22,0.8263,5.304,2.695,5.388,5.31,3 161 | 12.54,13.67,0.8425,5.451,2.879,3.082,5.491,3 162 | 12.02,13.33,0.8503,5.35,2.81,4.271,5.308,3 163 | 12.05,13.41,0.8416,5.267,2.847,4.988,5.046,3 164 | 12.55,13.57,0.8558,5.333,2.968,4.419,5.176,3 165 | 11.14,12.79,0.8558,5.011,2.794,6.388,5.049,3 166 | 12.1,13.15,0.8793,5.105,2.941,2.201,5.056,3 167 | 12.44,13.59,0.8462,5.319,2.897,4.924,5.27,3 168 | 12.15,13.45,0.8443,5.417,2.837,3.638,5.338,3 169 | 11.35,13.12,0.8291,5.176,2.668,4.337,5.132,3 170 | 11.24,13,0.8359,5.09,2.715,3.521,5.088,3 171 | 11.02,13,0.8189,5.325,2.701,6.735,5.163,3 172 | 11.55,13.1,0.8455,5.167,2.845,6.715,4.956,3 173 | 11.27,12.97,0.8419,5.088,2.763,4.309,5,3 174 | 11.4,13.08,0.8375,5.136,2.763,5.588,5.089,3 175 | 10.83,12.96,0.8099,5.278,2.641,5.182,5.185,3 176 | 10.8,12.57,0.859,4.981,2.821,4.773,5.063,3 177 | 11.26,13.01,0.8355,5.186,2.71,5.335,5.092,3 178 | 10.74,12.73,0.8329,5.145,2.642,4.702,4.963,3 179 | 11.48,13.05,0.8473,5.18,2.758,5.876,5.002,3 180 | 12.21,13.47,0.8453,5.357,2.893,1.661,5.178,3 181 | 11.41,12.95,0.856,5.09,2.775,4.957,4.825,3 182 | 12.46,13.41,0.8706,5.236,3.017,4.987,5.147,3 183 | 12.19,13.36,0.8579,5.24,2.909,4.857,5.158,3 184 | 11.65,13.07,0.8575,5.108,2.85,5.209,5.135,3 185 | 12.89,13.77,0.8541,5.495,3.026,6.185,5.316,3 186 | 11.56,13.31,0.8198,5.363,2.683,4.062,5.182,3 187 | 11.81,13.45,0.8198,5.413,2.716,4.898,5.352,3 188 | 10.91,12.8,0.8372,5.088,2.675,4.179,4.956,3 189 | 11.23,12.82,0.8594,5.089,2.821,7.524,4.957,3 190 | 10.59,12.41,0.8648,4.899,2.787,4.975,4.794,3 191 | 10.93,12.8,0.839,5.046,2.717,5.398,5.045,3 192 | 11.27,12.86,0.8563,5.091,2.804,3.985,5.001,3 193 | 11.87,13.02,0.8795,5.132,2.953,3.597,5.132,3 194 | 10.82,12.83,0.8256,5.18,2.63,4.853,5.089,3 195 | 12.11,13.27,0.8639,5.236,2.975,4.132,5.012,3 196 | 12.8,13.47,0.886,5.16,3.126,4.873,4.914,3 197 | 12.79,13.53,0.8786,5.224,3.054,5.483,4.958,3 198 | 13.37,13.78,0.8849,5.32,3.128,4.67,5.091,3 199 | 12.62,13.67,0.8481,5.41,2.911,3.306,5.231,3 200 | 12.76,13.38,0.8964,5.073,3.155,2.828,4.83,3 201 | 12.38,13.44,0.8609,5.219,2.989,5.472,5.045,3 202 | 12.67,13.32,0.8977,4.984,3.135,2.3,4.745,3 203 | 11.18,12.72,0.868,5.009,2.81,4.051,4.828,3 204 | 12.7,13.41,0.8874,5.183,3.091,8.456,5,3 205 | 12.37,13.47,0.8567,5.204,2.96,3.919,5.001,3 206 | 12.19,13.2,0.8783,5.137,2.981,3.631,4.87,3 207 | 11.23,12.88,0.8511,5.14,2.795,4.325,5.003,3 208 | 13.2,13.66,0.8883,5.236,3.232,8.315,5.056,3 209 | 11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3 210 | 12.3,13.34,0.8684,5.243,2.974,5.637,5.063,3 -------------------------------------------------------------------------------- /class3_2/images/hanford_variables.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class3_2/images/hanford_variables.png -------------------------------------------------------------------------------- /class3_2/images/iris_scatter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class3_2/images/iris_scatter.png -------------------------------------------------------------------------------- /class4_1/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms: Week 4, Class 1 (Tuesday, Aug. 4) 2 | 3 | This week's class is going to be a bit different than the last few. After a quick review of last week's material, we're going to build a supervised learning system that is meant to outline a rough automated approach to [this story](http://www.nytimes.com/2015/08/02/us/small-pool-of-rich-donors-dominates-election-giving.html) from Sunday's Times about wealthy donors to Super PACs affiliated with presidential candidates. 4 | 5 | Along the way, we'll talk about: 6 | 7 | - How to train and apply different models, including the decision trees you've already discussed 8 | - How to engineer useful features for those models 9 | - How to evaluate the results of those models so you don't get yourself in trouble 10 | - The difference between statistical and rules-based solutions to problems like this 11 | 12 | For lab, you'll be asked to take on a simpler supervised learning problem that will give you a chance to apply the lessons from class. 13 | 14 | If you'd like to get a head start, feel free to read [this documentation](https://github.com/cjdd3b/fec-standardizer/wiki) on standardizing FEC data. We'll be taking a simpler approach, but much of the intuition will be similar. -------------------------------------------------------------------------------- /class4_1/doc_classifier.py: -------------------------------------------------------------------------------- 1 | from sklearn import preprocessing 2 | 3 | ########## FEATURES ########## 4 | 5 | # Put your features here 6 | 7 | 8 | ########## MAIN ########## 9 | 10 | if __name__ == '__main__': 11 | 12 | # First we'll do some preprocessing to create our two vectors for model training: features, which 13 | # represents the feature vector, and labels, which represent our correct answers. 14 | 15 | features, labels = [], [] 16 | with open('data/bills_training.txt', 'rU') as csvfile: 17 | for line in csvfile.readlines(): 18 | bill = line.strip().split('|') 19 | 20 | if len(bill) > 1: 21 | labels.append(bill[1]) 22 | 23 | features.append([ 24 | # Your features here, based on bill[0], which contains the text of the bill titles 25 | ]) 26 | 27 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to 28 | # be numbers, not strings. The LabelEncoder performs this transformation. 29 | encoder = preprocessing.LabelEncoder() 30 | encoded_labels = encoder.fit_transform(labels) 31 | 32 | print features 33 | print encoded_labels 34 | 35 | # STEP ONE: Create and train a model 36 | 37 | # Your code here 38 | 39 | 40 | # STEP TWO: Evaluate the model 41 | 42 | # Your code here 43 | 44 | 45 | # STEP THREE: Apply the model 46 | 47 | # Use the model to get categories for each of these documents 48 | 49 | docs_new = ["Public postsecondary education: executive officer compensation.", 50 | "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.", 51 | "Political Reform Act of 1974: campaign disclosures.", 52 | "An act to add Section 236.3 to the Penal Code, relating to human trafficking." 53 | ] -------------------------------------------------------------------------------- /class4_1/donors.py: -------------------------------------------------------------------------------- 1 | import csv, itertools 2 | import numpy as np 3 | from sklearn.tree import DecisionTreeClassifier 4 | from sklearn.linear_model import LogisticRegression 5 | from sklearn import cross_validation 6 | from sklearn.cross_validation import KFold 7 | from sklearn import metrics 8 | from nameparser import HumanName 9 | 10 | ########## HELPER FUNCTIONS ########## 11 | 12 | def _shingle(word, n): 13 | ''' 14 | Splits words into shingles of size n. Given the word "shingle" and n=2, the output 15 | would be a list that looks like :['sh', 'hi', 'in', 'ng', 'gl', 'le'] 16 | 17 | More on shingling here: http://blog.mafr.de/2011/01/06/near-duplicate-detection/ 18 | ''' 19 | return set([word[i:i + n] for i in range(len(word) - n + 1)]) 20 | 21 | def _jaccard_sim(X, Y): 22 | ''' 23 | Jaccard similarity between two sets. 24 | 25 | Explanation here: http://en.wikipedia.org/wiki/Jaccard_index 26 | ''' 27 | if not X or not Y: return 0 28 | x = set(X) 29 | y = set(Y) 30 | return float(len(x & y)) / len(x | y) 31 | 32 | def sim(str1, str2, shingle_length=3): 33 | ''' 34 | String similarity metric based on shingles and Jaccard. 35 | ''' 36 | str1_shingles = _shingle(str1, shingle_length) 37 | str2_shingles = _shingle(str2, shingle_length) 38 | return _jaccard_sim(str1_shingles, str2_shingles) 39 | 40 | ########## FEATURES ########## 41 | 42 | def same_name(name1, name2): 43 | return 1 if name1 == name2 else 0 44 | 45 | def same_zip_code(zip1, zip2): 46 | return 1 if zip1[:5] == zip2[:5] else 0 47 | 48 | def same_first_name(name1, name2): 49 | first1 = HumanName(name1).first 50 | first2 = HumanName(name2).first 51 | return 1 if first1 == first2 else 0 52 | 53 | def same_last_name(name1, name2): 54 | last1 = HumanName(name1).last 55 | last2 = HumanName(name2).last 56 | return 1 if last1 == last2 else 0 57 | 58 | 59 | # We're going to add more here ... 60 | 61 | ########## MAIN ########## 62 | 63 | if __name__ == '__main__': 64 | 65 | # STEP ONE: Train our model. 66 | 67 | features, matches = [], [] 68 | with open('data/contribs_training_small.csv', 'rU') as csvfile: 69 | reader = csv.DictReader(csvfile) 70 | for c in itertools.combinations(reader, 2): 71 | 72 | # Fill up our vector of correct answers 73 | match = 1 if c[0]['contributor_ext_id'] == c[1]['contributor_ext_id'] else 0 74 | matches.append(match) 75 | 76 | # And now fill up our feature vector 77 | features.append([ 78 | same_name(c[0]['name'], c[1]['name']), 79 | same_zip_code(c[0]['zip_code'], c[1]['zip_code']), 80 | same_first_name(c[0]['name'], c[1]['name']), 81 | same_last_name(c[0]['name'], c[1]['name']) 82 | 83 | ]) 84 | 85 | clf = DecisionTreeClassifier() 86 | clf = clf.fit(features, matches) 87 | 88 | # STEP TWO: Evaluate the model using 10-fold cross-validation 89 | 90 | # scores = cross_validation.cross_val_score(clf, features, matches, cv=10, scoring='f1') 91 | # print "%s (%s folds): %0.2f (+/- %0.2f)\n" % ('f1', 10, scores.mean(), scores.std() / 2) 92 | 93 | # STEP THREE: Apply the model 94 | 95 | with open('data/contribs_unclassified.csv', 'rU') as csvfile: 96 | reader = csv.DictReader(csvfile) 97 | for key, group in itertools.groupby(reader, lambda x: x['last_name']): 98 | for c in itertools.combinations(group, 2): 99 | 100 | # Making print-friendly representations of the records, for easier evaluation 101 | record1 = '%s, %s %s | %s %s %s | %s %s' % \ 102 | (c[0]['last_name'], c[0]['first_name'], c[0]['middle_name'], 103 | c[0]['city'], c[0]['state'], c[0]['zip'], 104 | c[0]['employer'], c[0]['occupation']) 105 | record2 = '%s, %s %s | %s %s %s | %s %s' % \ 106 | (c[1]['last_name'], c[1]['first_name'], c[1]['middle_name'], 107 | c[1]['city'], c[1]['state'], c[1]['zip'], 108 | c[1]['employer'], c[1]['occupation']) 109 | 110 | # We need to do this because our training set has full names, but this set has name 111 | # components. Turn those into full names. 112 | name1 = '%s, %s %s' % (c[0]['last_name'], c[0]['first_name'], c[0]['middle_name']) 113 | name2 = '%s, %s %s' % (c[1]['last_name'], c[1]['first_name'], c[1]['middle_name']) 114 | 115 | # And now fill up our feature vector 116 | features = [ 117 | same_name(name1, name2), 118 | same_zip_code(c[0]['zip'], c[1]['zip']), 119 | same_first_name(name1, name2), 120 | same_last_name(name1, name2) 121 | ] 122 | 123 | # Predict match or no match 124 | match = clf.predict_proba(features) 125 | 126 | # Print the results 127 | if match[0][0] < match[0][1]: 128 | print 'MATCH!' 129 | print record1 + ' ---------> ' + record2 + '\n' 130 | print match 131 | else: 132 | print 'NO MATCH!' 133 | print record1 + ' ---------> ' + record2 + '\n' -------------------------------------------------------------------------------- /class4_2/4-2_DoNow.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "###import wine.csv and build a decision tree classifier to predict wine_cultivar. Test the data using 5-fold cross validation" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [] 18 | } 19 | ], 20 | "metadata": { 21 | "kernelspec": { 22 | "display_name": "Python 2", 23 | "language": "python", 24 | "name": "python2" 25 | }, 26 | "language_info": { 27 | "codemirror_mode": { 28 | "name": "ipython", 29 | "version": 2 30 | }, 31 | "file_extension": ".py", 32 | "mimetype": "text/x-python", 33 | "name": "python", 34 | "nbconvert_exporter": "python", 35 | "pygments_lexer": "ipython2", 36 | "version": "2.7.10" 37 | } 38 | }, 39 | "nbformat": 4, 40 | "nbformat_minor": 0 41 | } 42 | -------------------------------------------------------------------------------- /class4_2/Feature_Engineering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd\n", 12 | "%matplotlib inline" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "###A simple example to illustrate the intuition behind dummy variables" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "df" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "pd.get_dummies(df['key'],prefix='key')" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": { 58 | "collapsed": true 59 | }, 60 | "source": [ 61 | "###Now we have a matrix of values based on the presence of absence of the attribute value in our dataset" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "###Now let's look at another example using our flight data" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "df = pd.read_csv('data/ontime_reports_may_2015_ny.csv')" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "#count number of NaNs in column\n", 91 | "df['DEP_DELAY'].isnull().sum()" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "#calculate the percentage this represents of the total number of instances\n", 103 | "df['DEP_DELAY'].isnull().sum()/df['DEP_DELAY'].sum()" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "###We could explore whether the NaNs are actually zero delays, but we'll just filter them out for now, especially since they represent such a small number of instances" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "#filter DEP_DELAY NaNs\n", 122 | "df = df[pd.notnull(df['DEP_DELAY'])]" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "###We can discretize the continuous DEP_DELAY value by giving it a value of 0 if it's delayed and a 1 if it's not. We record this value into a separate column. (We could also code -1 for early, 0 for ontime, and 1 for late)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": { 136 | "collapsed": false 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "#code whether delay or not delayed\n", 141 | "df['IS_DELAYED'] = df['DEP_DELAY'].apply(lambda x: 1 if x>0 else 0 )" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": false, 149 | "scrolled": true 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "#Let's check that our column was created properly\n", 154 | "df[['DEP_DELAY','IS_DELAYED']]" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "collapsed": true 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "###Dummy variables create a " 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [], 175 | "source": [ 176 | "pd.get_dummies(df['ORIGIN'],prefix='origin')" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "###Normalize values" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "#Normalize the data attributes for the Iris dataset\n", 195 | "# Example from Jump Start Scikit Learn https://machinelearningmastery.com/jump-start-scikit-learn/\n", 196 | "from sklearn.datasets import load_iris \n", 197 | "from sklearn import preprocessing #load the iris dataset\n", 198 | "iris=load_iris()\n", 199 | "X=iris.data\n", 200 | "y=iris.target #normalize the data attributes \n", 201 | "normalized_X = preprocessing.normalize(X)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "zip(X,normalized_X)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": { 219 | "collapsed": true 220 | }, 221 | "outputs": [], 222 | "source": [] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": true 238 | }, 239 | "outputs": [], 240 | "source": [] 241 | } 242 | ], 243 | "metadata": { 244 | "kernelspec": { 245 | "display_name": "Python 2", 246 | "language": "python", 247 | "name": "python2" 248 | }, 249 | "language_info": { 250 | "codemirror_mode": { 251 | "name": "ipython", 252 | "version": 2 253 | }, 254 | "file_extension": ".py", 255 | "mimetype": "text/x-python", 256 | "name": "python", 257 | "nbconvert_exporter": "python", 258 | "pygments_lexer": "ipython2", 259 | "version": "2.7.10" 260 | } 261 | }, 262 | "nbformat": 4, 263 | "nbformat_minor": 0 264 | } 265 | -------------------------------------------------------------------------------- /class4_2/Logistic_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "%matplotlib inline\n", 20 | "import numpy as np" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "titanic = pd.read_csv(\"data/titanic.csv\")" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "titanic.columns" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "###Let's do a simple logistic regression to predict survival based on pclass and sex" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "First we need to prepare our features. Remember we drop one value in each dummy to avoid the dummy variable trap" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "titanic['sex_female'] = titanic['sex'].apply(lambda x:1 if x=='female' else 0)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "collapsed": true 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "dataset = titanic[['survived']].join([pd.get_dummies(titanic['pclass'],prefix=\"pclass\"),titanic.sex_female])" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "from sklearn.linear_model import LogisticRegression" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "lm = LogisticRegression()" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "#drop pclass_1st to avoid dummy variable trap\n", 112 | "x = np.asarray(dataset[['pclass_2nd','pclass_3rd','sex_female']])\n", 113 | "y = np.asarray(dataset['survived'])" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "lm = lm.fit(x,y)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "collapsed": false 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "lm.score(x,y)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "y.mean()" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "lm.coef_" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": { 164 | "collapsed": false 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "lm.intercept_" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [] 179 | } 180 | ], 181 | "metadata": { 182 | "kernelspec": { 183 | "display_name": "Python 2", 184 | "language": "python", 185 | "name": "python2" 186 | }, 187 | "language_info": { 188 | "codemirror_mode": { 189 | "name": "ipython", 190 | "version": 2 191 | }, 192 | "file_extension": ".py", 193 | "mimetype": "text/x-python", 194 | "name": "python", 195 | "nbconvert_exporter": "python", 196 | "pygments_lexer": "ipython2", 197 | "version": "2.7.10" 198 | } 199 | }, 200 | "nbformat": 4, 201 | "nbformat_minor": 0 202 | } 203 | -------------------------------------------------------------------------------- /class4_2/Naive_Bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from sklearn import datasets\n", 12 | "from sklearn import metrics \n", 13 | "from sklearn.naive_bayes import GaussianNB \n", 14 | "#load the iris datasets \n", 15 | "dataset=datasets.load_iris() \n", 16 | "#fit a Naive Bayes model to the data \n", 17 | "model=GaussianNB() \n", 18 | "model.fit(dataset.data,dataset.target) \n", 19 | "print(model)\n", 20 | "#makepredictions\n", 21 | "expected=dataset.target \n", 22 | "predicted=model.predict(dataset.data) \n", 23 | "#summarize the fit of the model \n", 24 | "print(metrics.classification_report(expected,predicted)) \n", 25 | "print(metrics.confusion_matrix(expected,predicted))" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "###Let's get back to the saga of Leo and Kate" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "import pandas as pd\n", 44 | "%matplotlib inline\n", 45 | "import numpy as np" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "titanic = pd.read_csv(\"data/titanic.csv\")" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": false 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "titanic['sex_female'] = titanic['sex'].apply(lambda x:1 if x=='female' else 0)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "collapsed": true 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "dataset = titanic[['survived']].join([pd.get_dummies(titanic['pclass'],prefix=\"pclass\"),titanic.sex_female])" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "#drop pclass_1st to avoid dummy variable trap\n", 90 | "x = np.asarray(dataset[['pclass_1st','pclass_2nd','pclass_3rd','sex_female']])\n", 91 | "y = np.asarray(dataset['survived'])" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "model.fit(x,y)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "expected = y \n", 114 | "predicted = model.predict(x) " 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": true 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):\n", 126 | " y_pred=clf.predict(X)\n", 127 | " if show_accuracy:\n", 128 | " print \"Accuracy:{0:.3f}\".format(metrics.accuracy_score(y, y_pred)),\"\\n\"\n", 129 | " if show_classification_report:\n", 130 | " print \"Classification report\"\n", 131 | " print metrics.classification_report(y,y_pred),\"\\n\"\n", 132 | " if show_confussion_matrix:\n", 133 | " print \"Confusion matrix\"\n", 134 | " print metrics.confusion_matrix(y,y_pred),\"\\n\"" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "measure_performance(x,y,model)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "collapsed": true 153 | }, 154 | "outputs": [], 155 | "source": [] 156 | } 157 | ], 158 | "metadata": { 159 | "kernelspec": { 160 | "display_name": "Python 2", 161 | "language": "python", 162 | "name": "python2" 163 | }, 164 | "language_info": { 165 | "codemirror_mode": { 166 | "name": "ipython", 167 | "version": 2 168 | }, 169 | "file_extension": ".py", 170 | "mimetype": "text/x-python", 171 | "name": "python", 172 | "nbconvert_exporter": "python", 173 | "pygments_lexer": "ipython2", 174 | "version": "2.7.10" 175 | } 176 | }, 177 | "nbformat": 4, 178 | "nbformat_minor": 0 179 | } 180 | -------------------------------------------------------------------------------- /class4_2/images/titanic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datapolitan/lede_algorithms/9109b5c91a3eb74c1343e79daf60abd76be182ea/class4_2/images/titanic.png -------------------------------------------------------------------------------- /class5_1/.ipynb_checkpoints/vectorization-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from sklearn.feature_extraction.text import CountVectorizer" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Basic vectorization\n", 19 | "\n", 20 | "Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features.\n", 21 | "\n", 22 | "Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as \"The number of times Word X appears in this document\".\n", 23 | "\n", 24 | "Here's an example with one bill title:" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 14, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 16, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n" 50 | ] 51 | } 52 | ], 53 | "source": [ 54 | "vectorizer = CountVectorizer()\n", 55 | "features = vectorizer.fit_transform(bill_titles).toarray()" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 17, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n", 70 | "[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "print features\n", 76 | "print vectorizer.get_feature_names()" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is \"44277\", the second is \"act\", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, \"to\", which appears twice.\n", 84 | "\n", 85 | "Now what happens if we add another bill and run it again?" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 19, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]\n", 100 | " [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]\n", 101 | "[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',\n", 107 | " 'An act relative to health care coverage']\n", 108 | "features = vectorizer.fit_transform(bill_titles).toarray()\n", 109 | "\n", 110 | "print features\n", 111 | "print vectorizer.get_feature_names()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, \"44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization." 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "## Cleaning up our vectors\n", 126 | "\n", 127 | "As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.\n", 128 | "\n", 129 | "First is removing so-called \"stop words\" -- words like \"and\", \"or\", \"the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 21, 135 | "metadata": { 136 | "collapsed": false 137 | }, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "[[1 1 1 0 1 0 1 0 1 0 1 1]\n", 144 | " [0 1 0 1 0 1 0 1 0 1 0 0]]\n", 145 | "[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']\n" 146 | ] 147 | } 148 | ], 149 | "source": [ 150 | "new_vectorizer = CountVectorizer(stop_words='english')\n", 151 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n", 152 | "\n", 153 | "print features\n", 154 | "print new_vectorizer.get_feature_names()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 24, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [ 171 | { 172 | "name": "stdout", 173 | "output_type": "stream", 174 | "text": [ 175 | "[[1]\n", 176 | " [1]]\n", 177 | "[u'act']\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "new_vectorizer = CountVectorizer(stop_words='english', min_df=2)\n", 183 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n", 184 | "\n", 185 | "print features\n", 186 | "print new_vectorizer.get_feature_names()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": { 200 | "collapsed": true 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))\n", 205 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n", 206 | "\n", 207 | "print features\n", 208 | "print new_vectorizer.get_feature_names()" 209 | ] 210 | } 211 | ], 212 | "metadata": { 213 | "kernelspec": { 214 | "display_name": "Python 2", 215 | "language": "python", 216 | "name": "python2" 217 | }, 218 | "language_info": { 219 | "codemirror_mode": { 220 | "name": "ipython", 221 | "version": 2 222 | }, 223 | "file_extension": ".py", 224 | "mimetype": "text/x-python", 225 | "name": "python", 226 | "nbconvert_exporter": "python", 227 | "pygments_lexer": "ipython2", 228 | "version": "2.7.9" 229 | } 230 | }, 231 | "nbformat": 4, 232 | "nbformat_minor": 0 233 | } 234 | -------------------------------------------------------------------------------- /class5_1/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms: Week 5, Class 1 (Tuesday, Aug. 11) 2 | 3 | We'll pick up where we left off last week with the bill classification problem, using it as an excuse to introduce a method of feature creation that is especially useful for text documents -- the idea of vectorization. If we have time, we'll also discuss the basic idea of supervised learning: 4 | 5 | ## Hour 1: Exercise review 6 | 7 | We'll talk in detail through the exercises from last week (which were deliberately difficult) and use them to segue into basic natural language processing techniques. 8 | 9 | ## Hour 2/2.5: Vectorization 10 | 11 | We'll talk about how to use vectorization to engineer our features for natural language classification and clustering problems, rather than building features by hand. We'll then revisit the bill classification problem from last week using what we've learned. 12 | 13 | ## Hour 2.5/3: Unsupervised learning 14 | 15 | We'll talk a little about the intuition and dangers of unsupervised learning, also known as clustering, using crime data as our example. 16 | 17 | ## Lab 18 | 19 | You'll be doing two things in lab today: 20 | 21 | - First you'll work through a simple document classification problem (classifying drug-related and non-drug-related press releases) using vectorization and the other techinques we discussed in class. 22 | 23 | - Second, take a look at [this map](https://www.google.com/maps/d/u/1/embed?mid=z9S6reOYqCIE.kQnlzV2-uDzg), which shows police dispatch logs for Columbia, Mo., over the first 10 days of August. Within the map, there are three layers (eps_0.3, eps_0.2, eps_0.4), each of which shows hotspots of dispatches calculated in slightly different ways. Choose what you think is the fairest representations of the hotsports and write a couple paragraphs characterizing your findings. Be sure to include the layer you chose in your Tumblr post. -------------------------------------------------------------------------------- /class5_1/bill_classifier.py: -------------------------------------------------------------------------------- 1 | from sklearn import preprocessing 2 | from sklearn import cross_validation 3 | from sklearn.tree import DecisionTreeClassifier 4 | from sklearn.naive_bayes import MultinomialNB 5 | from sklearn.feature_extraction.text import CountVectorizer 6 | 7 | if __name__ == '__main__': 8 | 9 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ########## 10 | 11 | # Here we're taking in the training data and splitting it into two lists: One with the text of 12 | # each bill title, and the second with each bill title's corresponding category. Order is important. 13 | # The first bill in list 1 should also be the first category in list 2. 14 | training = [line.strip().split('|') for line in open('data/bills_training.txt', 'r').readlines()] 15 | text = [t[0] for t in training if len(t) > 1] 16 | labels = [t[1] for t in training if len(t) > 1] 17 | 18 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to 19 | # be numbers, not strings. The LabelEncoder performs this transformation. 20 | encoder = preprocessing.LabelEncoder() 21 | correct_labels = encoder.fit_transform(labels) 22 | 23 | ########## STEP 2: FEATURE EXTRACTION ########## 24 | print 'Extracting features ...' 25 | 26 | vectorizer = CountVectorizer(stop_words='english') 27 | data = vectorizer.fit_transform(text) 28 | 29 | ########## STEP 3: MODEL BUILDING ########## 30 | print 'Training ...' 31 | 32 | #model = MultinomialNB() 33 | model = DecisionTreeClassifier() 34 | fit_model = model.fit(data, correct_labels) 35 | 36 | # ########## STEP 4: EVALUATION ########## 37 | print 'Evaluating ...' 38 | 39 | # Evaluate our model with 10-fold cross-validation 40 | scores = cross_validation.cross_val_score(model, data, correct_labels, cv=5) 41 | print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) 42 | 43 | # ########## STEP 5: APPLYING THE MODEL ########## 44 | print 'Classifying ...' 45 | 46 | docs_new = ["Public postsecondary education: executive officer compensation.", 47 | "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.", 48 | "Political Reform Act of 1974: campaign disclosures.", 49 | "An act to add Section 236.3 to the Penal Code, relating to human trafficking." 50 | ] 51 | 52 | test_data = vectorizer.transform(docs_new) 53 | 54 | for i in xrange(len(docs_new)): 55 | print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])]) -------------------------------------------------------------------------------- /class5_1/crime_clusterer.py: -------------------------------------------------------------------------------- 1 | ''' 2 | cluster.py 3 | This script demonstrates the use of the DBSCAN algorithm for finding 4 | clusters of crimes in Columbia, Mo. DBSCAN is a density-based clustering 5 | algorithm that finds points based on their proximity to other points in the 6 | dataset. Unlike algorithms such as K-means, you do not need to specify the 7 | number of clusters you would like it to find in advance. Instead, you set a 8 | parameter, epsilon, that identifies how close you would like two points to be 9 | for them to belong to the same cluster. 10 | 11 | More information here: 12 | http://en.wikipedia.org/wiki/DBSCAN 13 | http://scikit-learn.org/dev/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN 14 | 15 | And there's a clean, documented implementation here for reference: 16 | https://github.com/cjdd3b/car-datascience-toolkit/blob/master/cluster/dbscan.py 17 | ''' 18 | 19 | import csv 20 | import numpy as np 21 | from scipy.spatial import distance 22 | from sklearn.cluster import DBSCAN 23 | 24 | ########## MODIFY THIS ########## 25 | 26 | EPS = 0.04 27 | 28 | ######### DON'T WORRY (YET) ABOUT MODIFYING THIS ########## 29 | 30 | # Pull in our data using DictReader 31 | data = list(csv.DictReader(open('data/columbia_crime.csv', 'r').readlines())) 32 | 33 | # Separate out the coordinates 34 | coords = [(float(d['lat']), float(d['lng'])) for d in data if len(d['lat']) > 0] 35 | types = [d['ExtNatureDisplayName'] for d in data] 36 | 37 | # Scikit-learn's implemenetation of DBSCAN requires the input of a distance matrix showing pairwise 38 | # distances between all points in the dataset. 39 | distance_matrix = distance.squareform(distance.pdist(coords)) 40 | 41 | # Run DBSCAN. Setting epsilon with lat/lon data like we have here is an inexact science. 0.03 looked 42 | # good after a few test runs. Ideally we'd project the data and set epsilon using meters or feet. 43 | db = DBSCAN(eps=EPS).fit(distance_matrix) 44 | 45 | # And now we print out the results in the form cluster_id,lat,lng. You can save this to a file and import 46 | # directly into a mapping program or Fusion Tables if you want to visualize it. 47 | for k in set(db.labels_): 48 | class_members = [index[0] for index in np.argwhere(db.labels_ == k)] 49 | for index in class_members: 50 | print '%s,%s,%s' % (int(k), types[index], '{0},{1}'.format(*coords[index])) 51 | -------------------------------------------------------------------------------- /class5_1/data/releases_training.txt: -------------------------------------------------------------------------------- 1 | FEB 12 (BEAUMONT, Texas) – A 25-year-old Port Arthur, Texas man has pleaded guilty to drug trafficking violations in the Eastern District of Texas, announced Drug Enforcement Administration Acting Special Agent in Charge Steven S. Whipple and U.S. Attorney John M. Bales today. Michael Joseph Barrett IV pleaded guilty to possession with intent to distribute methamphetamine on Feb. 11, 2014 before U.S. District Judge Marcia Crone. According to information presented in court, on Feb. 19, 2013, law enforcement officers responded to a residence on 32nd Street in Port Arthur after receiving information regarding suspected manufacture of methamphetamine at the location. Consent to search was obtained and a search of the premises revealed a small amount of cocaine, a semi-automatic pistol, and various items associated with methamphetamine manufacture, including a three liter bottle containing a methamphetamine mixture. A federal grand jury returned an indictment on Dec. 4, 2013, charging Barrett with drug trafficking violations. Barrett faces up to 20 years in federal prison at sentencing. A sentencing date has not been set. This case was investigated by the Drug Enforcement Administration, the Port Arthur Police Department and the Jefferson County Sheriff's Office Crime Lab and prosecuted by Assistant U.S. Attorney Randall L. Fluke.|YES 2 | FEB 05 (BROWNSVILLE, Texas ) - Stephen Whipple, Acting Special Agent in Charge of the United States Drug Enforcement Administration (DEA), Houston Division and United States Attorney Kenneth Magidson announced Jesus Mauricio Juarez Jr. aka Flaco 27, has been sentenced to federal prison for his involvement in a 1,000 pound marijuana load. He pleaded guilty in November 2013. Today, Senior U.S. District Judge Hilda G. Tagle sentenced Juarez to 31 months in federal prison. In handing down the sentence, Ruben Gonzalez-Cavazos aka Mume, also pleaded guilty in relation to the conspiracy and was sentenced to 47 months in federal prison and assessed a $15,000 fine on Feb. 3, 2014. Co-defendant Francisco Javier Maya, 35, went to trial last week in Brownsville and was convicted on all counts. He will be sentenced on May 13, 2014. Adolfo Lozano-Luna aka Chefero, 35, and Alberto Martinez aka El Diablo, 50, also pleaded guilty and will be sentenced at a later date. Evidence at Maya’s trial placed all five men in a conspiracy involving a 1,000 pound marijuana load, which was forcibly hijacked from them by unknown individuals on Dec. 11, 2012. One month later, Juarez was injured after an improvised explosive device (IED) detonated at his residence in Brownsville. In sentencing Juarez today, Judge Tagle discussed the bombing incident and noted that at least he and his family still have their lives. Evidence also linked Juarez, Gonzalez-Cavazos and Maya to other marijuana loads during the conspiracy. Maya’s role in the drug trafficking organization was to provide drivers for tractor trailers to drive marijuana loads to locations to include Houston and Taylor. Maya, Juarez and Gonzalez-Cavazos would share in the profits of each successful marijuana load. At the direction of Juarez, Maya provided bank account numbers associated with him and Gonzalez-Cavazos to Juarez in order to deposit drug profits. Juarez then made deposits stemming from narcotics proceeds from a successful marijuana load delivered to Taylor in November 2012. Evidence was presented at Maya’s trial that a $6,000 deposit was made into an account associated with Maya on Nov. 28, 2012, while another $6,500 was deposited into an account associated with Gonzalez-Cavazos on the same day. The jury last week also heard that Maya was a follower of the Santeria religion. The jury saw photos of Maya’s residence in Mission, Texas, which depicted numerous images of what was considered to be altars showing glasses of alcohol, knives, a machete, kettles, feathers and substances that appeared to be blood. Testimony also included descriptions of two rituals involving the sacrifice of animals. In December 2012, Maya had a Santeria priest, known as a “Padrino,” perform rituals with the organization to “bless” a 1,000 pound marijuana load that was destined for Houston. After meeting with the Padrino, Maya, Gonzalez-Cavazos and Juarez decided the marijuana load should remain in the Rio Grande Valley. The next day, a second ritual, attended by all five defendants, was performed and the 1,000 pounds of marijuana was to be transported to Houston. However, the marijuana was stolen from the group by unknown individuals that evening. After the theft and subsequent IED detonation, law enforcement was able to piece together the events and conspirators involved in this drug trafficking organization. The case was investigated by the Drug Enforcement Administration, FBI, Homeland Security Investigations, Bureau of Alcohol, Tobacco, Firearms and Explosives and the Brownsville Police Department. The case was prosecuted by Assistant United States Attorneys Angel Castro and Jody Young.|YES 3 | JAN 08 (HOUSTON) - Javier F. Peña, Special Agent in Charge of the United States Drug Enforcement Administration (DEA), Houston Division and Kenneth Magidson, United States Attorney, Southern District of Texas announced Oscar Nava-Valencia, 42, of Guadalajara, Mexico, has received a 25-year sentence for his role in the smuggling of a 3,100 kilogram load of cocaine from Panama. Nava-Valencia previously pleaded guilty and was sentenced late yesterday afternoon in federal court in Houston. U.S. District Judge Ewing Werlein Jr. sentenced Nava-Valencia to a term of 300 months in federal prison and further ordered him to pay a $5,000 fine. In March 2006, Panamanian authorities seized approximately 2,080 kilograms of cocaine from a warehouse in Panama City, Panama. The seized cocaine was part of a larger load totaling approximately 3,100 kilograms which was to be shipped from Panama to Mexico and eventually destined for the United States. Nava-Valencia, along with other associates, was to take possession of approximately 1,250 kilograms of cocaine once it arrived in Mexico. In January of 2010, Nava-Valencia was apprehended by Mexican authorities and extradited to the United States in January 2011. He has been and will remain in custody pending transfer to a U.S. Bureau of Prisons facility to be determined in the near future. The investigation leading to the charges was conducted by the Drug Enforcement Administration. Assistant United States Attorneys James Sturgis prosecuted the case.|YES 4 | JAN 21 (MONTGOMERY, Ala.) – The Drug Enforcement Administration awarded Assistant U. S. Attorneys Verne Speirs and Gray Borden the Spartan Award, announced George L. Beck, Jr., United States Attorney Middle District of Alabama. The Spartan award recognizes prosecutors for their dedication and extraordinary effort to investigate and prosecute large-scale drug dealers and money launderers. This year’s award is presented to Assistant U.S. Attorneys Speirs and Borden because of long hours invested and success obtained in combating the ever-growing scourge of drug dealing in the Middle District of Alabama. DEA chose Speirs and Borden for this award after examining the work of all federal prosecutors in the State of Alabama. “The DEA in Alabama was pleased to present the 2013 Spartan Award for Excellence in Drug Investigations to AUSA’s Speirs and Borden,” stated Clay Morris, Assistant Special Agent in Charge of DEA in Alabama. “The award was named after the Spartan Warrior Society. AUSA’s Speirs and Borden were selected by DEA management to receive the award because they exhibited many traits of a Spartan Warrior: a relentless pursuit of justice, tenacity, loyalty and dedication. Throughout 2013, AUSA’s Speirs and Borden tirelessly worked alongside our agents and task force officers in many long term complex investigations. Because of the dedication of AUSA’s Speirs and Borden, many drug trafficking organizations were completely dismantled and dangerous criminals were removed from the streets of our communities. I cannot say enough about the outstanding efforts of AUSA’s Speirs and Borden and the entire staff of the Unites States Attorney’s Office. One thing is certain, as long as AUSA’s Speirs and Borden are prosecuting drug trafficking organizations, those who target and sell poison to our children should be very afraid.” “I am very pleased that the extraordinary success of AUSAs Speirs and Borden are receiving the recognition they truly deserve,” stated U.S. Attorney George Beck, “They have worked tirelessly to prosecute these criminals. I believe it is essential that these types of crimes be vigorously prosecuted and that we continue to combat the drug problem facing this district and this nation.” “I am truly humbled to receive this award, but the real credit goes to the DEA Agents and Task Force Officers who risk everything to combat drug traffickers across this country,” stated Verne Speirs, Assistant U.S. Attorney. “The safety of our families and communities depend upon their selfless service.” “I consider this award to be one of the great achievements in my career in the U.S. Attorney’s Office, but the credit goes to our dedicated and professional staff and the DEA’s stable of tireless agents,” stated Gray Borden, Assistant U.S. Attorney. “I am proud to be associated with a team of this caliber.”|NO 5 | JAN 30 (SAN JUAN, Puerto Rico) – Yesterday, January 29, U.S. Magistrate Judge Marcos E. López authorized a complaint charging: Joselito Taveras, Miguel Jimenez, and Alberto Dominguez with conspiracy to possess and possession with intent to distribute controlled substances, and conspiracy to import and importation of controlled substances, announced Rosa Emilia Rodríguez-Vélez, United States Attorney for the District of Puerto Rico. The crew of the Coast Guard Cutter Farallon offloaded 136 kilograms (300 pounds) of cocaine Monday night, 60 nautical miles northwest of Aguadilla, Puerto Rico and transferred the custody of the defendants to Drug Enforcement Administration (DEA) special agents and Customs and Border Protection officers Wednesday at Coast Guard San Juan, Puerto Rico. The interdiction was a result of U.S. Coast Guard, Customs Border Protection, Drug Enforcement Administration and Dominican Republic Navy coordinated efforts in support of Operation Unified Resolve, Operation Caribbean Guard, and the Caribbean Corridor Strike Force (CCSF), to interdict the illegal drug shipment consisting of nine bales of cocaine with an estimated street value of approximately $3.5 million dollars.|YES 6 | JAN 10 (WASHINGTON) – The U.S. Department of Justice and the U.S. Department of Commerce's National Institute of Standards and Technology (NIST) today announced appointments to a newly created National Commission on Forensic Science. Members of the commission will work to improve the practice of forensic science by developing guidance concerning the intersections between forensic science and the criminal justice system. The commission also will work to develop policy recommendations for the U.S. Attorney General, including uniform codes for professional responsibility and requirements for formal training and certification. The commission is co-chaired by Deputy Attorney General James M. Cole and Under Secretary of Commerce for Standards and Technology and NIST Director Patrick D. Gallagher. Nelson Santos, Deputy Assistant Administrator for the Office of Forensic Sciences at the Drug Enforcement Administration, and John M. Butler, Special Assistant to the NIST director for forensic science, serve as vice-chairs. "I appreciate the commitment each of the commissioners has made and look forward to working with them to strengthen the validity and reliability of the forensic sciences and enhance quality assurance and quality control," said Deputy Attorney General Cole. "Scientifically valid and accurate forensic analysis supports all aspects of our justice system."|NO 7 | -------------------------------------------------------------------------------- /class5_1/release_classifier.py: -------------------------------------------------------------------------------- 1 | from sklearn import preprocessing 2 | from sklearn.tree import DecisionTreeClassifier 3 | from sklearn.naive_bayes import MultinomialNB 4 | from sklearn.feature_extraction.text import CountVectorizer 5 | 6 | if __name__ == '__main__': 7 | 8 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ########## 9 | 10 | training = [line.strip().split('|') for line in open('data/releases_training.txt', 'r').readlines()] 11 | text = [t[0] for t in training if len(t) > 1] 12 | labels = [t[1] for t in training if len(t) > 1] 13 | 14 | encoder = preprocessing.LabelEncoder() 15 | correct_labels = encoder.fit_transform(labels) 16 | 17 | ########## FEATURE EXTRACTION ########## 18 | 19 | # VECTORIZE YOUR DATA HERE 20 | 21 | ########## MODEL BUILDING ########## 22 | 23 | # TRAIN YOUR MODEL HERE 24 | 25 | ########## STEP 5: APPLYING THE MODEL ########## 26 | docs_new = ["Five Columbia Residents among 10 Defendants Indicted for Conspiracy to Distribute a Ton of Marijuana", 27 | ] 28 | 29 | # EVALUATE THE DOCUMENT HERE -------------------------------------------------------------------------------- /class5_1/vectorization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 4, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from sklearn.feature_extraction.text import CountVectorizer" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Basic vectorization\n", 19 | "\n", 20 | "Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features, represented by a matrix.\n", 21 | "\n", 22 | "Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as \"The number of times Word X appears in this document\".\n", 23 | "\n", 24 | "Here's an example with one bill title:" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 5, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 7, 41 | "metadata": { 42 | "collapsed": false, 43 | "scrolled": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "vectorizer = CountVectorizer()\n", 48 | "features = vectorizer.fit_transform(bill_titles).toarray()" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 8, 54 | "metadata": { 55 | "collapsed": false, 56 | "scrolled": true 57 | }, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | "[[1 1 1 1 1 1 1 1 1 1 1 2]]\n", 64 | "[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "print features\n", 70 | "print vectorizer.get_feature_names()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is \"44277\", the second is \"act\", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, \"to\", which appears twice.\n", 78 | "\n", 79 | "Now what happens if we add another bill and run it again?" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 11, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]\n", 94 | " [0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]\n", 95 | "[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']\n" 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',\n", 101 | " 'An act relative to health care coverage']\n", 102 | "features = vectorizer.fit_transform(bill_titles).toarray()\n", 103 | "\n", 104 | "print features\n", 105 | "print vectorizer.get_feature_names()" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, \"44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization." 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## Cleaning up our vectors\n", 120 | "\n", 121 | "As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.\n", 122 | "\n", 123 | "First is removing so-called \"stop words\" -- words like \"and\", \"or\", \"the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 12, 129 | "metadata": { 130 | "collapsed": false 131 | }, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "[[1 1 1 0 1 0 1 0 1 0 1 1]\n", 138 | " [0 1 0 1 0 1 0 1 0 1 0 0]]\n", 139 | "[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "new_vectorizer = CountVectorizer(stop_words='english')\n", 145 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n", 146 | "\n", 147 | "print features\n", 148 | "print new_vectorizer.get_feature_names()" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 13, 161 | "metadata": { 162 | "collapsed": false 163 | }, 164 | "outputs": [ 165 | { 166 | "name": "stdout", 167 | "output_type": "stream", 168 | "text": [ 169 | "[[1]\n", 170 | " [1]]\n", 171 | "[u'act']\n" 172 | ] 173 | } 174 | ], 175 | "source": [ 176 | "new_vectorizer = CountVectorizer(stop_words='english', min_df=2)\n", 177 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n", 178 | "\n", 179 | "print features\n", 180 | "print new_vectorizer.get_feature_names()" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 17, 193 | "metadata": { 194 | "collapsed": false, 195 | "scrolled": true 196 | }, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "[[1 1 1 1 0 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1]\n", 203 | " [0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0]]\n", 204 | "[u'44277', u'44277 education', u'act', u'act amend', u'act relative', u'amend', u'amend section', u'care', u'care coverage', u'code', u'code relating', u'coverage', u'education', u'education code', u'health', u'health care', u'relating', u'relating teachers', u'relative', u'relative health', u'section', u'section 44277', u'teachers']\n" 205 | ] 206 | } 207 | ], 208 | "source": [ 209 | "new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))\n", 210 | "features = new_vectorizer.fit_transform(bill_titles).toarray()\n", 211 | "\n", 212 | "print features\n", 213 | "print new_vectorizer.get_feature_names()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "Although the feature space gets much larger, sometimes having multi-word features can make our models more accurate.\n", 221 | "\n", 222 | "These are just a few basic tricks scikit-learn makes available for transforming your vectors (you can see other ones [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)). But now let's take what we've learned here and apply it to the bill classification problem." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "collapsed": true 230 | }, 231 | "outputs": [], 232 | "source": [] 233 | } 234 | ], 235 | "metadata": { 236 | "kernelspec": { 237 | "display_name": "Python 2", 238 | "language": "python", 239 | "name": "python2" 240 | }, 241 | "language_info": { 242 | "codemirror_mode": { 243 | "name": "ipython", 244 | "version": 2 245 | }, 246 | "file_extension": ".py", 247 | "mimetype": "text/x-python", 248 | "name": "python", 249 | "nbconvert_exporter": "python", 250 | "pygments_lexer": "ipython2", 251 | "version": "2.7.9" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 0 256 | } 257 | -------------------------------------------------------------------------------- /class5_2/5_2-Assignment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "test_string = \"Do you know the way to San Jose?\"" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "##Extend your functions from class\n", 19 | "###1. Add code to your tokenizer to filter for punctuation before tokenizing\n", 20 | "####This might be helpful: http://stackoverflow.com/a/266162/1808021" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "###2. Add code to your tokenizer to filter for stopwords\n", 37 | "###Your function should use the list of stopwords to filter the string and not return words in the stopword list\n", 38 | "###You can use the list in NLTK or create your own\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "###3. Add code to your tokenizer to call your tokenizer to create word tokens (if it doesn't already) and then generate the counts for each token" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "##Bonus\n", 71 | "###Write a simple function to calculate the tf-idf \n", 72 | "####Remember the following were $t$ is the term, $D$ is the document, $N$ is the total number of documents, $n_w$ is the number of documents containing each word $t$, and $i_w$ is the frequency word $t$ appears in a document\n", 73 | "\n", 74 | "$tf(t,D)=\\frac{i_w}{n_D}$\n", 75 | "\n", 76 | "$idf(t,D)=\\log(\\frac{N}{1+n_w})$\n", 77 | "\n", 78 | "$tfidf=tf\\times idf$" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "##k-NN on Iris\n", 95 | "###4. Using the Iris dataset, test the kNN for various levels of k to see if you can build a better classifier than our decision tree in 3_2" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": { 102 | "collapsed": true 103 | }, 104 | "outputs": [], 105 | "source": [] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "##k-Means with Congressional Bills\n", 112 | "###5. Explore the clusters of Congressional Records. Select another subset and investigate the contents. Write code that investigates a different cluster." 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": { 119 | "collapsed": true 120 | }, 121 | "outputs": [], 122 | "source": [] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "###6. On the class Tumblr, provide a response to the lesson on k-Means, specifically whether you think this is a useful technique for working journalists (data or otherwise)" 129 | ] 130 | } 131 | ], 132 | "metadata": { 133 | "kernelspec": { 134 | "display_name": "Python 2", 135 | "language": "python", 136 | "name": "python2" 137 | }, 138 | "language_info": { 139 | "codemirror_mode": { 140 | "name": "ipython", 141 | "version": 2 142 | }, 143 | "file_extension": ".py", 144 | "mimetype": "text/x-python", 145 | "name": "python", 146 | "nbconvert_exporter": "python", 147 | "pygments_lexer": "ipython2", 148 | "version": "2.7.10" 149 | } 150 | }, 151 | "nbformat": 4, 152 | "nbformat_minor": 0 153 | } 154 | -------------------------------------------------------------------------------- /class5_2/5_2-DoNow.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "##Let's check your knowledge of the material we've covered" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "###Code your own tokenizer\n", 15 | "####Write a simple tokenizer function to take in a string, tokenize by individual words" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "###Create your own vectorizer\n", 32 | "####Write code to output the list of tokens and the count for each token" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [], 42 | "source": [] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": { 48 | "collapsed": true 49 | }, 50 | "outputs": [], 51 | "source": [] 52 | } 53 | ], 54 | "metadata": { 55 | "kernelspec": { 56 | "display_name": "Python 2", 57 | "language": "python", 58 | "name": "python2" 59 | }, 60 | "language_info": { 61 | "codemirror_mode": { 62 | "name": "ipython", 63 | "version": 2 64 | }, 65 | "file_extension": ".py", 66 | "mimetype": "text/x-python", 67 | "name": "python", 68 | "nbconvert_exporter": "python", 69 | "pygments_lexer": "ipython2", 70 | "version": "2.7.10" 71 | } 72 | }, 73 | "nbformat": 4, 74 | "nbformat_minor": 0 75 | } 76 | -------------------------------------------------------------------------------- /class5_2/kmeans.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd\n", 12 | "import re #a package for doing regex\n", 13 | "import glob #for accessing files on our local system" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "###We'll be using data from http://www.cs.cornell.edu/home/llee/data/convote.html to explore k-means clustering" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "collapsed": false 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "!tar -zxvf convote_v1.1.tar.gz" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "paths = glob.glob(\"convote_v1.1/data_stage_one/development_set/*\")\n", 54 | "speeches = []\n", 55 | "for path in paths:\n", 56 | " speech = {}\n", 57 | " filename = path[-26:]\n", 58 | " speech['filename'] = filename\n", 59 | " speech['bill_no'] = filename[:3]\n", 60 | " speech['speaker_no'] = filename[4:10]\n", 61 | " speech['bill_vote'] = filename[-5]\n", 62 | " speech['party'] = filename[-7]\n", 63 | " \n", 64 | " # Open the file\n", 65 | " speech_file = open(path, 'r')\n", 66 | " # Read the stuff out of it\n", 67 | " speech['contents'] = speech_file.read()\n", 68 | "\n", 69 | " cleaned_contents = re.sub(r\"[^ \\w]\",'', speech['contents'])\n", 70 | " cleaned_contents = re.sub(r\" +\",' ', cleaned_contents)\n", 71 | " cleaned_contents = cleaned_contents.strip()\n", 72 | " words = cleaned_contents.split(' ')\n", 73 | " speech['word_count'] = len(words)\n", 74 | " \n", 75 | " speeches.append(speech)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "collapsed": false 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "speeches[:5]" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "speeches_df = pd.DataFrame(speeches)\n", 98 | "speeches_df.head()" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "collapsed": false 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "speeches_df[\"word_count\"].describe()" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "###Notice that we have a lot of speeches that are relatively short. They probably aren't the best for clustering because of their brevity" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "###Time to bring the TF-IDF vectorizer" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "from sklearn.feature_extraction.text import TfidfVectorizer" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')\n", 146 | "longer_speeches = speeches_df[speeches_df[\"word_count\"] > 92] \n", 147 | "#filtering for word counts greater than 92 (our median length)\n", 148 | "X = vectorizer.fit_transform(longer_speeches['contents'])" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": { 155 | "collapsed": true 156 | }, 157 | "outputs": [], 158 | "source": [ 159 | "from sklearn.cluster import KMeans" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "number_of_clusters = 7\n", 171 | "km = KMeans(n_clusters=number_of_clusters)\n", 172 | "km.fit(X)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "collapsed": false 180 | }, 181 | "outputs": [], 182 | "source": [ 183 | "print(\"Top terms per cluster:\")\n", 184 | "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n", 185 | "terms = vectorizer.get_feature_names()\n", 186 | "for i in range(number_of_clusters):\n", 187 | " print(\"Cluster %d:\" % i),\n", 188 | " for ind in order_centroids[i, :15]:\n", 189 | " print(' %s' % terms[ind]),\n", 190 | " print ''" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "collapsed": true 198 | }, 199 | "outputs": [], 200 | "source": [] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": true 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "additional_stopwords = ['mr','congress','chairman','madam','amendment','legislation','speaker']" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "import nltk\n", 222 | "\n", 223 | "english_stopwords = nltk.corpus.stopwords.words('english')\n", 224 | "new_stopwords = additional_stopwords + english_stopwords" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": { 231 | "collapsed": true 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "vectorizer = TfidfVectorizer(max_features=10000, stop_words=new_stopwords)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "collapsed": true 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "longer_speeches = speeches_df[speeches_df[\"word_count\"] > 92]\n", 247 | "X = vectorizer.fit_transform(longer_speeches['contents'])" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": { 254 | "collapsed": false 255 | }, 256 | "outputs": [], 257 | "source": [ 258 | "number_of_clusters = 7\n", 259 | "km = KMeans(n_clusters=number_of_clusters)\n", 260 | "km.fit(X)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": false 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "print(\"Top terms per cluster:\")\n", 272 | "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n", 273 | "terms = vectorizer.get_feature_names()\n", 274 | "for i in range(number_of_clusters):\n", 275 | " print(\"Cluster %d:\" % i),\n", 276 | " for ind in order_centroids[i, :15]:\n", 277 | " print(' %s' % terms[ind]),\n", 278 | " print ''" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "collapsed": false 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "longer_speeches[\"k-means label\"] = km.labels_" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": { 296 | "collapsed": false 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "longer_speeches.head()" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": { 307 | "collapsed": true 308 | }, 309 | "outputs": [], 310 | "source": [ 311 | "china_speeches = longer_speeches[longer_speeches[\"k-means label\"] == 1]" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": { 318 | "collapsed": false 319 | }, 320 | "outputs": [], 321 | "source": [ 322 | "china_speeches.head()" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "collapsed": false 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "vectorizer = TfidfVectorizer(max_features=10000, stop_words=new_stopwords)\n", 334 | "X = vectorizer.fit_transform(china_speeches['contents'])\n", 335 | "\n", 336 | "number_of_clusters = 5\n", 337 | "km = KMeans(n_clusters=number_of_clusters)\n", 338 | "km.fit(X)\n", 339 | "\n", 340 | "print(\"Top terms per cluster:\")\n", 341 | "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n", 342 | "terms = vectorizer.get_feature_names()\n", 343 | "for i in range(number_of_clusters):\n", 344 | " print(\"Cluster %d:\" % i),\n", 345 | " for ind in order_centroids[i, :10]:\n", 346 | " print(' %s' % terms[ind]),\n", 347 | " print ''" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "collapsed": false 355 | }, 356 | "outputs": [], 357 | "source": [ 358 | "km.get_params()" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [], 368 | "source": [ 369 | "km.score(X)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": { 376 | "collapsed": true 377 | }, 378 | "outputs": [], 379 | "source": [] 380 | } 381 | ], 382 | "metadata": { 383 | "kernelspec": { 384 | "display_name": "Python 2", 385 | "language": "python", 386 | "name": "python2" 387 | }, 388 | "language_info": { 389 | "codemirror_mode": { 390 | "name": "ipython", 391 | "version": 2 392 | }, 393 | "file_extension": ".py", 394 | "mimetype": "text/x-python", 395 | "name": "python", 396 | "nbconvert_exporter": "python", 397 | "pygments_lexer": "ipython2", 398 | "version": "2.7.10" 399 | } 400 | }, 401 | "nbformat": 4, 402 | "nbformat_minor": 0 403 | } 404 | -------------------------------------------------------------------------------- /class5_2/knn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "##Let's work with the wine dataset we worked with before, but slightly modified. This has more instances and different target features" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "####based on http://blog.yhathq.com/posts/classification-using-knn-and-python.html" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "%matplotlib inline\n", 28 | "from sklearn.neighbors import KNeighborsClassifier\n", 29 | "from sklearn import cross_validation" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "import numpy as np" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": { 47 | "collapsed": false 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "df = pd.read_csv(\"data/wine.csv\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "collapsed": false 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "df.columns" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "###Instead of wine cultvar, we have the wine color (red or white), as well as a binary (is red) and high quality indicator (0 or 1)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "df.high_quality.unique()" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "###Let's set up our training and test sets" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": false 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "train, test = cross_validation.train_test_split(df[['density','sulphates','residual_sugar','high_quality']],train_size=0.75)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "###We'll use just three columns (dimensions) for classification" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "train" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "x_train = train[:,:3]\n", 128 | "y_train = train[:,3]" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": { 135 | "collapsed": true 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "x_test = test[:,:3]\n", 140 | "y_test = test[:,3]" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "###Let's start with a k of 1 to predict high quality" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "clf = KNeighborsClassifier(n_neighbors=1)" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "clf.fit(x_train,y_train)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": { 176 | "collapsed": false 177 | }, 178 | "outputs": [], 179 | "source": [ 180 | "preds = clf.predict(x_test)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "collapsed": true 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": { 198 | "collapsed": false 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "print \"Accuracy: %3f\" % (accuracy,)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "###Not bad. Let's see what happens as the k changes" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "collapsed": false 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "results = []\n", 221 | "for k in range(1, 51, 2):\n", 222 | " clf = KNeighborsClassifier(n_neighbors=k)\n", 223 | " clf.fit(x_train,y_train)\n", 224 | " preds = clf.predict(x_test)\n", 225 | " accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))\n", 226 | " print \"Neighbors: %d, Accuracy: %3f\" % (k, accuracy)\n", 227 | "\n", 228 | " results.append([k, accuracy])\n", 229 | "\n", 230 | "results = pd.DataFrame(results, columns=[\"k\", \"accuracy\"])\n", 231 | "\n", 232 | "plt.plot(results.k, results.accuracy)\n", 233 | "plt.title(\"Accuracy with Increasing K\")\n", 234 | "plt.show()" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "###Looks like about 80% is the best we can do. The way it plateaus, suggests there's not much more to be gained by increasing k" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "###We can also tune this a bit by not weighting each instance the same, but decreasing the weight as the distance increases" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": { 255 | "collapsed": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "results = []\n", 260 | "for k in range(1, 51, 2):\n", 261 | " clf = KNeighborsClassifier(n_neighbors=k,weights='distance')\n", 262 | " clf.fit(x_train,y_train)\n", 263 | " preds = clf.predict(x_test)\n", 264 | " accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))\n", 265 | " print \"Neighbors: %d, Accuracy: %3f\" % (k, accuracy)\n", 266 | "\n", 267 | " results.append([k, accuracy])\n", 268 | "\n", 269 | "results = pd.DataFrame(results, columns=[\"k\", \"accuracy\"])\n", 270 | "\n", 271 | "plt.plot(results.k, results.accuracy)\n", 272 | "plt.title(\"Accuracy with Increasing K\")\n", 273 | "plt.show()" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "###This actually increases the accuracy of our prediction" 281 | ] 282 | } 283 | ], 284 | "metadata": { 285 | "kernelspec": { 286 | "display_name": "Python 2", 287 | "language": "python", 288 | "name": "python2" 289 | }, 290 | "language_info": { 291 | "codemirror_mode": { 292 | "name": "ipython", 293 | "version": 2 294 | }, 295 | "file_extension": ".py", 296 | "mimetype": "text/x-python", 297 | "name": "python", 298 | "nbconvert_exporter": "python", 299 | "pygments_lexer": "ipython2", 300 | "version": "2.7.10" 301 | } 302 | }, 303 | "nbformat": 4, 304 | "nbformat_minor": 0 305 | } 306 | -------------------------------------------------------------------------------- /class6_1/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms: Week 6, Class 1 (Tuesday, Aug. 18) 2 | 3 | After a quick review of the homework, we'll explore in-depth several methods for clustering crime data before we return to and expand on the document clustering problem from last Thursday. 4 | 5 | ## Hour 1: Exercise review 6 | 7 | We'll talk in detail through the bill classification problem and the ambiguities inherent in clustering data, via the crime example we talked about briefly last Tuesday. 8 | 9 | ## Hour 2: Clustering crime 10 | 11 | We'll look at the two methods you learned last week -- k-means clustering and k-nearest neighbors -- along with another one, known as DBSCAN, to see how different methods can produce different results when we apply them to crime data. 12 | 13 | ## Hour 3: Back to document clustering 14 | 15 | Finally we'll return to the idea of document clustering that you started exploring last week, going more into depth on the ideas of document similarity and term frequency-inverse document frequency and showing how clustering can more quickly help us explore a new document set. -------------------------------------------------------------------------------- /class6_2/AssociationRuleMining.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "##A simple example of Association Rule Mining based on http://orange.biolab.si/docs/latest/reference/rst/Orange.associate.html" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import Orange #pip install orange\n", 19 | "data = Orange.data.Table(\"market-basket.basket\")" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "for d in data:\n", 31 | " print d" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.3)\n", 43 | "print \"%4s %4s %s\" % (\"Supp\", \"Conf\", \"Rule\")\n", 44 | "for r in rules[:5]:\n", 45 | " print \"%4.1f %4.1f %s\" % (r.support, r.confidence, r)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "###Spanish Inquisition example" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "collapsed": true 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "data = Orange.data.Table(\"inquisition.basket\")" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": { 70 | "collapsed": false 71 | }, 72 | "outputs": [], 73 | "source": [ 74 | "for d in data:\n", 75 | " print d" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "collapsed": false 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)\n", 87 | "\n", 88 | "print \"%5s %5s\" % (\"supp\", \"conf\")\n", 89 | "for r in rules:\n", 90 | " print \"%5.3f %5.3f %s\" % (r.support, r.confidence, r)" 91 | ] 92 | } 93 | ], 94 | "metadata": { 95 | "kernelspec": { 96 | "display_name": "Python 2", 97 | "language": "python", 98 | "name": "python2" 99 | }, 100 | "language_info": { 101 | "codemirror_mode": { 102 | "name": "ipython", 103 | "version": 2 104 | }, 105 | "file_extension": ".py", 106 | "mimetype": "text/x-python", 107 | "name": "python", 108 | "nbconvert_exporter": "python", 109 | "pygments_lexer": "ipython2", 110 | "version": "2.7.10" 111 | } 112 | }, 113 | "nbformat": 4, 114 | "nbformat_minor": 0 115 | } 116 | -------------------------------------------------------------------------------- /class6_2/RandomForest.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "##Based on example from http://blog.yhathq.com/posts/random-forests-in-python.html, with modifications from https://gist.github.com/glamp/5717321" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from sklearn.datasets import load_iris\n", 19 | "from sklearn.ensemble import RandomForestClassifier\n", 20 | "import pandas as pd\n", 21 | "import numpy as np" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": true 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "iris = load_iris()\n", 33 | "df = pd.DataFrame(iris.data, columns=iris.feature_names)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "df.head()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "train, test = df[df['is_train']==True], df[df['is_train']==False]" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "features = df.columns[:4]\n", 89 | "clf = RandomForestClassifier(n_jobs=2)\n", 90 | "y, _ = pd.factorize(train['species'])\n", 91 | "clf.fit(train[features], y)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "preds = iris.target_names[clf.predict(test[features])]\n", 103 | "pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": { 110 | "collapsed": true 111 | }, 112 | "outputs": [], 113 | "source": [] 114 | } 115 | ], 116 | "metadata": { 117 | "kernelspec": { 118 | "display_name": "Python 2", 119 | "language": "python", 120 | "name": "python2" 121 | }, 122 | "language_info": { 123 | "codemirror_mode": { 124 | "name": "ipython", 125 | "version": 2 126 | }, 127 | "file_extension": ".py", 128 | "mimetype": "text/x-python", 129 | "name": "python", 130 | "nbconvert_exporter": "python", 131 | "pygments_lexer": "ipython2", 132 | "version": "2.7.10" 133 | } 134 | }, 135 | "nbformat": 4, 136 | "nbformat_minor": 0 137 | } 138 | -------------------------------------------------------------------------------- /class7_1/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms: Week 7, Class 1 (Tuesday, Aug. 25) 2 | 3 | After a quick look back at last Thursday's material, we'll spend some time looking over [examples](https://github.com/datapolitan/lede_algorithms/blob/master/class1_1/newsroom_examples.md) of algorithms and journalism from earlier in the course and talk about how to build on the skills you've learned here going forward. I'm counting on wrapping up early so we can talk about final projects. 4 | 5 | ## Resources for later 6 | 7 | - IRE/NICAR: I've said this a hundred times, but [sign up](http://www.ire.org/membership/). Use the student rate if you'd like. And if someone there balks, tell me and I'll talk to them. 8 | 9 | - MORE TK -------------------------------------------------------------------------------- /class7_1/bill_classifier.py: -------------------------------------------------------------------------------- 1 | from sklearn import preprocessing 2 | from sklearn import cross_validation 3 | from sklearn.tree import DecisionTreeClassifier 4 | from sklearn.ensemble import RandomForestClassifier 5 | from sklearn.naive_bayes import MultinomialNB 6 | from sklearn.feature_extraction.text import CountVectorizer 7 | 8 | if __name__ == '__main__': 9 | 10 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ########## 11 | 12 | # Here we're taking in the training data and splitting it into two lists: One with the text of 13 | # each bill title, and the second with each bill title's corresponding category. Order is important. 14 | # The first bill in list 1 should also be the first category in list 2. 15 | training = [line.strip().split('|') for line in open('data/bills_training.txt', 'r').readlines()] 16 | text = [t[0] for t in training if len(t) > 1] 17 | labels = [t[1] for t in training if len(t) > 1] 18 | 19 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to 20 | # be numbers, not strings. The LabelEncoder performs this transformation. 21 | encoder = preprocessing.LabelEncoder() 22 | correct_labels = encoder.fit_transform(labels) 23 | 24 | ########## STEP 2: FEATURE EXTRACTION ########## 25 | print 'Extracting features ...' 26 | 27 | vectorizer = CountVectorizer(stop_words='english') 28 | data = vectorizer.fit_transform(text) 29 | 30 | ########## STEP 3: MODEL BUILDING ########## 31 | print 'Training ...' 32 | 33 | #model = MultinomialNB() 34 | model = RandomForestClassifier() 35 | fit_model = model.fit(data, correct_labels) 36 | 37 | # ########## STEP 4: EVALUATION ########## 38 | print 'Evaluating ...' 39 | 40 | # Evaluate our model with 10-fold cross-validation 41 | scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10) 42 | print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) 43 | 44 | # ########## STEP 5: APPLYING THE MODEL ########## 45 | # print 'Classifying ...' 46 | 47 | # docs_new = ["Public postsecondary education: executive officer compensation.", 48 | # "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.", 49 | # "Political Reform Act of 1974: campaign disclosures.", 50 | # "An act to add Section 236.3 to the Penal Code, relating to human trafficking." 51 | # ] 52 | 53 | # test_data = vectorizer.transform(docs_new) 54 | 55 | # for i in xrange(len(docs_new)): 56 | # print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])]) -------------------------------------------------------------------------------- /data_journalism_on_github.md: -------------------------------------------------------------------------------- 1 | #Data Journalists on Github 2 | 3 | ##Organizations 4 | + New York Times The Upshot: https://github.com/TheUpshot 5 | + New York Times Newsroom Developers: https://github.com/newsdev 6 | + FiveThirtyEight.com: https://github.com/fivethirtyeight 7 | + Al Jazeera America (at least until April): https://github.com/ajam 8 | + Chicago Tribune News Apps: https://github.com/newsapps 9 | + Northwestern University Knight Lab: https://github.com/NUKnightLab 10 | + ProPublica: https://github.com/propublica 11 | + Sunlight Labs: https://github.com/sunlightlabs 12 | + NPR Visuals Team: https://github.com/nprapps 13 | + NPR Tech: https://github.com/npr 14 | + The Guardian: https://github.com/guardian 15 | + Vox Media: https://github.com/voxmedia 16 | + Time Magazine: https://github.com/TimeMagazine 17 | + Los Angeles Times Data Desk: https://github.com/datadesk 18 | + BuzzFeed News: https://github.com/BuzzFeedNews 19 | + [Huffington Post Data](http://data.huffingtonpost.com/): https://github.com/huffpostdata 20 | 21 | 22 | ##Tools 23 | + Wireservice: https://github.com/wireservice 24 | + [Open Civic Data](http://opencivicdata.org/): https://github.com/opencivicdata 25 | + [TabulaPDF](http://tabula.technology/): https://github.com/tabulapdf 26 | + [Public Media Platform](http://publicmediaplatform.org/): https://github.com/publicmediaplatform 27 | + [CensusReporter](http://censusreporter.org/): https://github.com/censusreporter 28 | + Mozilla Foundation: https://github.com/mozilla 29 | 30 | ##People 31 | + Michael Keller: https://github.com/mhkeller 32 | + Joanna S. Kao: https://github.com/joannaskao 33 | + Kevin Quealy: https://github.com/kpq 34 | + Joe Germuska: https://github.com/JoeGermuska 35 | 36 | ##Github's infrequently updated [list of open journalism projects](https://github.com/showcases/open-journalism) -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Algorithms, Summer 2015 2 | ## LEDE Program, Columbia University, Graduate School of Journalism 3 | 4 | 5 | ### Instructors: 6 | 7 | Richard Dunks: richard [at] datapolitan [dot] com 8 | 9 | Chase Davis: chase.davis [at] nytimes [dot] com 10 | 11 | 12 | #### Room Number: Pulitzer Hall 601B 13 | 14 | #### Course Dates: 14 July - 27 August 2015 15 | 16 | ### Course Overview 17 | 18 | This course presents an overview of algorithms as they relate to journalistic tradecraft, with particular emphasis on algorithms that relate to the discovery, cleaning, and analysis of data. This course intends to provide literacy in the common types of data algorithms, while providing practice in the design, development, and testing of algorithms to support news reporting and analysis, including the basic concepts of algorithm reverse engineering in support of investigative news reporting. The emphasis in this class will be on practical applications and critical awareness of the impact algorithms have in modern life. 19 | 20 | 21 | ### Learning Objectives 22 | 23 | + You will understand the basic structure and operation of algorithms 24 | + You will understand the primary types of data science algorithms, including techniques of supervised and unsupervised machine learning 25 | + You will be practiced in implementing basic algorithms in Python 26 | + You will be able to meaningfully explain and critique the use and operation of algorithms as tools of public policy and business 27 | + You will understand how algorithms are applied in the newsroom 28 | 29 | ### Course Requirements 30 | All students will be expected to have a laptop during both lectures and lab time. Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts. 31 | 32 | ### Course Readings 33 | The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available to you electronically. Recommended readings are suggestions if you wish to study further the topics covered in class. Suggested readings will also be provided as appropriate for those interested in a more in-depth discussion of the material covered in class. 34 | 35 | ### Assignments 36 | This course consists of programming and critical response assignments intended to reinforce learning and provide you with pratical applications of the material covered in class. Completion of these assignments is critical to achieving the outcomes of this course. Assignments are intended to be completed during lab time or for homework. Generally, assignments will be due the following week, unless otherwise stated. For example, exercises assigned on Tuesday will be due before class on the following Tuesday. 37 | + Programming assignments will be submitted via Slack to the TAs in Python scripts (not ipynb) format. The exercises should be standalone for each assignment, not a combination of all assignments. This allows them to be tested and scored separately. 38 | + Response questions should be [submitted using this address](http://ledealgorithms.tumblr.com/submit) and will be posted to the [class Tumblr](http://ledealgorithms.tumblr.com/) after grading. They should be clear, concise, and use the elements of good grammar. This is an opportunity to develop your ability to explain algorithms to your audience. 39 | 40 | ### Class Format 41 | Class runs from 10am to 1pm Tuesday and Thursday. Lab time will be from 2pm to 5pm Tuesday and Thursday. The class will be taught in roughly 50 minute blocks, with approximately 10 minute breaks between each 50 minute block. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class. Lab time is intended for the completion of exercises, but may also include guided learning sessions as necessary to ensure comprehension of the course material. 42 | 43 | ### Course Policies 44 | + Attendance and Tardiness: We expect you to attend every class, arriving on time and staying for the entire duration of class. Absences will only be excused for circumstances coordinated in advance and you are responsible for making up any missed work. 45 | + Participation: We expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures. We will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered. 46 | + Late Assignments: All assignments are to be submitted before the start of class. Assignments posted by the end of the day following class will be marked down 10% and assignments posted at the end of the day following will be marked down 20%. No assignments will be accepted for a grade after three days following class. 47 | + Office Hours: We won’t be holding regular office hours, but are available via email to answer whatever questions you may have about the material. Please feel free to also reach out to the Teaching Assistants as necessary for support and guidance with the exercises, particularly during lab time. 48 | 49 | ---- 50 | ### Resources 51 | #### Technical 52 | 53 | + [Stack Overflow](http://stackoverflow.com) - Q&A community of technology pros 54 | 55 | #### (Some) Open Data Sources 56 | 57 | + [New York City Open Data Portal](https://nycopendata.socrata.com/) 58 | + [New York State Open Data Portal](https://data.ny.gov/) 59 | + [Hilary Mason’s Research Quality Data Sets](https://bitly.com/bundles/hmason/1) 60 | 61 | #### Visualizations 62 | 63 | + [Flowing Data](http://flowingdata.com/) 64 | + [Tableau Visualization Gallery](http://www.tableausoftware.com/public/gallery) 65 | + [Visualizing.org](http://www.visualizing.org/) 66 | + [Data is Beautiful](http://www.reddit.com/r/dataisbeautiful/) 67 | 68 | #### Data Journalism and Critiques 69 | 70 | + [FiveThirtyEight](http://fivethirtyeight.com/) 71 | + [Upshot](http://www.nytimes.com/upshot/) 72 | + [IQuantNY](http://iquantny.tumblr.com/) 73 | + [SimplyStatistics](http://simplystatistics.org/) 74 | + [Data Journalism Handbook](http://datajournalismhandbook.org/1.0/en/index.html) 75 | 76 | #### Suggested Reading 77 | Conway, Drew and John Myles White. Machine Learning for Hackers. O'Reilly Media, Inc., 2012. 78 | 79 | Knuth, Donald E. The Art of Computer Programming. Addison-Wesley Professional, 2011. 80 | 81 | MacCormick, John. Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers. Princeton University Press, 2011. 82 | 83 | McCallum, Q Ethan. Bad Data Handbook. O'Reilly Media, Inc., 2012. 84 | 85 | McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012. 86 | 87 | O'Neil, Cathy and Rachel Schutt. Doing Data Science: Straight Talk from the Front Line. O'Reilly Media, Inc., 2013. 88 | 89 | Russell, Matthew A. Mining the Social Web. O'Reilly Media, Inc., 2013. 90 | 91 | Sedgewick, Robert and Kevin Wayne. Algorithms. Addison-Wesley Professional, 2011. 92 | 93 | Steiner, Christopher. Automate This: How Algorithms Came to Rule Our World. Penguin Group, 2012. 94 | 95 | ---- 96 | ### Course Outline 97 | (Subject to change) 98 | 99 | #### Week 1: Introduction to Algorithms/Statistics review 100 | ##### Class 1 Readings 101 | + Miller, Claire Cain, [“When Algorithms Discriminate”](http://nyti.ms/1KS5rdu) New York Times, 9 July 2015 102 | + O’Neil, Cathy, [“Algorithms And Accountability Of Those Who Deploy Them”](http://mathbabe.org/2015/05/26/algorithms-and-accountability-of-those-who-deploy-them/) 103 | + Elkus, Adam, [“You Can’t Handle the (Algorithmic) Truth”](http://www.slate.com/articles/technology/future_tense/2015/05/algorithms_aren_t_responsible_for_the_cruelties_of_bureaucracy.single.html) 104 | + Diakopoulos, Nicholas, ["Algorithmic Accontability Reporting: On the Investigation of Black Boxes"](http://towcenter.org/wp-content/uploads/2014/02/78524_Tow-Center-Report-WEB-1.pdf) 105 | 106 | ##### Class 2 Readings (optional) 107 | + McKinney, "Getting Started With Pandas" Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. 108 | + McKinney, "Plotting and Visualization" Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. 109 | 110 | #### Week 2: Statistics in Reporting/Opening the Blackbox: Supervised Learning - Linear Regression 111 | ##### Class 1 Readings 112 | + (TBD) 113 | 114 | ##### Class 2 Readings 115 | + O'Neill, "Statistical Inference, Exploratory Data Analysis, and the Data Science Process" Doing Data Science: Straight Talk from the Front Line pp. 17-37 116 | 117 | #### Week 3: Opening the Blackbox: Supervised Learning - Feature Engineering/Decision Trees 118 | 119 | ##### Class 2 Readings 120 | + Building Machine Learning Systems with Python, pp. 33-43 121 | + Learning scikit-learn: Machine Learning in Python, pp. 41-52 122 | + Brownlee, Jason, ("Discover Feature Engineering, How to Engineer Features and How to Get Good at It")[http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/] 123 | + ("A Visual Introduction to Machine Learning")[http://www.r2d3.us/visual-intro-to-machine-learning-part-1/] 124 | 125 | #### Week 4: Opening the Blackbox: Supervised Learning - Feature Engineering/Logistic Regression 126 | 127 | #### Week 5: Opening the Blackbox: Unsupervised Learning - Clustering, k-NN 128 | 129 | #### Week 6: Natural Language Processing, Reverse Engineering, and Ethics Revisited 130 | 131 | #### Week 7: Advanced Topics (we'll be polling the class for topics) 132 | 133 | --------------------------------------------------------------------------------