├── .gitignore ├── 1-Introduction-to-machine-learning-software-stack └── 1.3 │ ├── 1.3a NumPy Examples │ ├── 1.3b Matplotlib │ ├── 1.3c Pandas Examples │ ├── 1.3d Machine learning with scikit learn │ ├── Machine Learning with Scikit Learn.docx │ └── Matplotlib.docx ├── 2-intro-to-ml └── README.md ├── 3-project-overview └── README.md ├── 4-regression ├── 4.1 - Linear Regression Ad Sales Revenue.ipynb ├── 4.2 - Multivariate Linear Regression.ipynb ├── 4.3 - Regularization and Model Evaluation.ipynb └── README.md ├── 5-classification ├── 5.2 - Logistic Regression in Classifying Breast Cancer .ipynb ├── 5.3 - Visualizing SVMs with Diabetes.ipynb ├── 5.4 - Artificial Neural Networks in Classifying Breast Cancer.ipynb ├── 5.4.2 - Peeking Inside a Neural Network with MNIST Data.ipynb └── README.md ├── 6-clustering ├── README.md ├── example-clustering-2-dimensions.ipynb ├── example-clustering-states.ipynb └── k means visualization │ ├── d3.v3.min.js │ ├── index.html │ └── k-means.js ├── 7 - Practical Methodologies ├── 7---practical-methodologies └── CV ├── Error Analysis and Classification Measures.ipynb ├── Iris PCA.ipynb ├── LICENSE ├── Learning curves and bias-variance tradeoff.ipynb ├── Ng's Machine Learning Exercise 6.ipynb ├── README.md ├── datasets ├── Advertising.csv ├── breast-cancer-wisconson.csv └── pima-indians-diabetes.csv ├── evaluating estimator performance using cross-validation.ipynb ├── examples ├── Gradient.ipynb └── kNN.ipynb ├── images ├── neuralnet.png ├── train_img.png └── weights.png ├── kaggle-data ├── README.md ├── county_facts.csv ├── county_facts_dictionary.csv └── primary_results.csv └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | #####=== IPythonNotebook ===##### 2 | # Temporary data 3 | .ipynb_checkpoints/ 4 | 5 | #####=== Python ===##### 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | 11 | # C extensions 12 | *.so 13 | 14 | # Distribution / packaging 15 | .Python 16 | env/ 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | lib/ 23 | lib64/ 24 | parts/ 25 | sdist/ 26 | var/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .coverage 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | 56 | # Sphinx documentation 57 | docs/_build/ 58 | 59 | # PyBuilder 60 | target/ 61 | 62 | #####=== OSX ===##### 63 | .DS_Store 64 | .AppleDouble 65 | .LSOverride 66 | 67 | # Icon must end with two \r 68 | Icon 69 | 70 | 71 | # Thumbnails 72 | ._* 73 | 74 | # Files that might appear on external disk 75 | .Spotlight-V100 76 | .Trashes 77 | 78 | # Directories potentially created on remote AFP share 79 | .AppleDB 80 | .AppleDesktop 81 | Network Trash Folder 82 | Temporary Items 83 | .apdisk 84 | 85 | -------------------------------------------------------------------------------- /1-Introduction-to-machine-learning-software-stack/1.3/1.3a NumPy Examples: -------------------------------------------------------------------------------- 1 | ARRAYS 2 | 3 | The central feature of NumPy is the array object class. Arrays are similar to lists in Python. 4 | Arrays make operations with large amounts of numeric data very fast and are generally much more efficient than lists. 5 | An array can be created from a list: 6 | In [1]: import numpy as np 7 | 8 | In [2]: a = np.array([1,2,3,4,5], float) 9 | 10 | In [3]: a 11 | Out[3]: array([ 1., 2., 3., 4., 5.]) 12 | 13 | We can also find out the type of each member of the list. 14 | In [4]: type(a) 15 | Out[4]: numpy.ndarray 16 | 17 | Arrays can be accessed with the help of index. 18 | In [5]: a[:2] 19 | Out[5]: array([ 1., 2.]) 20 | This indexing gives the elements from the beginning of the array before the 3rd element. 21 | 22 | In [6]: a[3] 23 | Out[6]: 4.0 24 | 25 | In [7]: a[0] 26 | Out[7]: 1.0 27 | 28 | In [8]: a[0] = 7 29 | 30 | In [9]: a 31 | Out[9]: array([ 7., 2., 3., 4., 5.]) 32 | 33 | An example of two-dimensional array – 34 | In [12]: a = np.array([[1,2,3], [4,5,6]], float) 35 | 36 | In [13]: a 37 | Out[13]: 38 | array([[ 1., 2., 3.], 39 | [ 4., 5., 6.]]) 40 | In [14]: a[0,0] 41 | Out[14]: 1.0 42 | 43 | In [15]: a[0,1] 44 | Out[15]: 2.0 45 | 46 | Array slicing works with multiple dimensions in the same way as usual, applying each slice specification as a filter to a specified dimension. Use of a single ":" in a dimension indicates the use of everything along that dimension 47 | In [16]: a[1,:] 48 | Out[16]: array([ 4., 5., 6.]) 49 | 50 | In [17]: a[:,2] 51 | Out[17]: array([ 3., 6.]) 52 | 53 | In [18]: a[-1:, -2:] 54 | Out[18]: array([[ 5., 6.]]) 55 | 56 | The shape property of an array returns a tuple with the size of each array dimension 57 | In [19]: a.shape 58 | Out[19]: (2, 3) 59 | The dtype property tells you what type of values are stored by the array 60 | In [20]: a.dtype 61 | Out[20]: dtype('float64') 62 | 63 | The len function returns the length of the first axis 64 | In [23]: len(a) 65 | Out[23]: 2 66 | 67 | The in statement can be used to test if values are present in an array 68 | In [24]: 2 in a 69 | Out[24]: True 70 | 71 | In [25]: 10 in a 72 | Out[25]: False 73 | 74 | Arrays can be reshaped using tuples that specify new dimensions. In the following example, we turn a twenty-element one-dimensional array into a two-dimensional array whose first axis has five elements and whose second axis has four elements 75 | In [26]: a = np.array(range(20), float) 76 | 77 | In [27]: a 78 | Out[27]: 79 | array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 80 | 11., 12., 13., 14., 15., 16., 17., 18., 19.]) 81 | 82 | In [28]: a = a.reshape((5,4)) 83 | 84 | In [29]: a 85 | Out[29]: 86 | array([[ 0., 1., 2., 3.], 87 | [ 4., 5., 6., 7.], 88 | [ 8., 9., 10., 11.], 89 | [ 12., 13., 14., 15.], 90 | [ 16., 17., 18., 19.]]) 91 | In [30]: a.shape 92 | Out[30]: (5, 4) 93 | The reshape function creates a new array and does not itself modify the original array. 94 | 95 | The copy function can be used to create a new, separate copy of an array in memory 96 | In [31]: a = np.array([1,2,3], float) 97 | 98 | In [32]: b = a 99 | 100 | In [33]: c = a.copy() 101 | 102 | In [34]: a[0] = 0 103 | 104 | In [35]: a 105 | Out[35]: array([ 0., 2., 3.]) 106 | 107 | In [36]: b 108 | Out[36]: array([ 0., 2., 3.]) 109 | 110 | In [37]: c 111 | Out[37]: array([ 1., 2., 3.]) 112 | 113 | Lists can also be created from array 114 | In [38]: a - np.array([1,2,3], float) 115 | Out[38]: array([-1., 0., 0.]) 116 | 117 | In [39]: a.tolist() 118 | Out[39]: [0.0, 2.0, 3.0] 119 | 120 | In [40]: list(a) 121 | Out[40]: [0.0, 2.0, 3.0] 122 | We can convert the raw data in an array to a binary string (i.e., not in human-readable form) using the tostring function. The fromstring function then allows an array to be created from this data later on. These routines are sometimes convenient for saving large amount of array data in files that can be read later on. 123 | In [48]: a = np.array([5,4,3], float) 124 | 125 | In [49]: s = a.tostring() 126 | 127 | In [50]: s 128 | Out[50]: '\x00\x00\x00\x00\x00\x00\x14@\x00\x00\x00\x00\x00\x00\x10@\x00\x00\x00\x00\x00\x00\x08@' 129 | 130 | In [51]: np.fromstring(s) 131 | Out[51]: array([ 5., 4., 3.]) 132 | 133 | Array can also be transposed with the help of transpose function 134 | In [52]: a = np.array(range(6), float).reshape((3,2)) 135 | 136 | In [53]: a 137 | Out[53]: 138 | array([[ 0., 1.], 139 | [ 2., 3.], 140 | [ 4., 5.]]) 141 | 142 | In [54]: a.transpose() 143 | Out[54]: 144 | array([[ 0., 2., 4.], 145 | [ 1., 3., 5.]]) 146 | 147 | One-dimensional versions of multi-dimensional arrays can be generated with flatten 148 | In [55]: a = np.array([[1, 2, 3], [4, 5, 6]], float) 149 | 150 | In [56]: a 151 | Out[56]: 152 | array([[ 1., 2., 3.], 153 | [ 4., 5., 6.]]) 154 | 155 | In [57]: a.flatten() 156 | Out[57]: array([ 1., 2., 3., 4., 5., 6.]) 157 | 158 | Two or more arrays can be concatenated together using the concatenate function with a tuple of the arrays to be joined 159 | In [62]: a = np.array([1,2], float) 160 | 161 | In [63]: b = np.array([3,4,5], float) 162 | 163 | In [64]: c = np.array([6,7,8,9], float) 164 | 165 | In [65]: np.concatenate((a, b, c)) 166 | Out[65]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9.]) 167 | 168 | If an array has more than one dimension, it is possible to specify the axis along which multiple arrays are concatenated. By default (without specifying the axis), NumPy concatenates along the first dimension 169 | In [66]: a = np.array([[1, 2], [3, 4]], float) 170 | 171 | In [67]: b = np.array([[5, 6], [7,8]], float) 172 | 173 | In [68]: np.concatenate((a,b)) 174 | Out[68]: 175 | array([[ 1., 2.], 176 | [ 3., 4.], 177 | [ 5., 6.], 178 | [ 7., 8.]]) 179 | 180 | In [69]: np.concatenate((a,b), axis = 0) 181 | Out[69]: 182 | array([[ 1., 2.], 183 | [ 3., 4.], 184 | [ 5., 6.], 185 | [ 7., 8.]]) 186 | 187 | In [70]: np.concatenate((a,b), axis = 1) 188 | Out[70]: 189 | array([[ 1., 2., 5., 6.], 190 | [ 3., 4., 7., 8.]]) 191 | 192 | The dimensionality of an array can be increased using the newaxis constant in bracket notation 193 | In [75]: a = np.array([1, 2, 3], float) 194 | 195 | In [76]: a 196 | Out[76]: array([ 1., 2., 3.]) 197 | 198 | In [77]: a[:, np.newaxis].shape 199 | Out[77]: (3, 1) 200 | 201 | In [79]: a[np.newaxis,:] 202 | Out[79]: array([[ 1., 2., 3.]]) 203 | 204 | In [80]: a[np.newaxis,:].shape 205 | Out[80]: (1, 3) 206 | 207 | The arange function returns an array 208 | In [82]: np.arange(5, dtype = float) 209 | Out[82]: array([ 0., 1., 2., 3., 4.]) 210 | 211 | In [83]: np.arange(1,6,2, dtype = int) 212 | Out[83]: array([1, 3, 5]) 213 | 214 | The functions zeros and ones create new arrays of specified dimensions filled with these values 215 | In [84]: np.ones((3,2), dtype = float) 216 | Out[84]: 217 | array([[ 1., 1.], 218 | [ 1., 1.], 219 | [ 1., 1.]]) 220 | 221 | In [85]: np.zeros(7, dtype = int) 222 | Out[85]: array([0, 0, 0, 0, 0, 0, 0]) 223 | 224 | The zeros_like and ones_like functions create a new array with the same dimensions and type of an existing one 225 | In [86]: a = np.array([[1,2,3], [4,5,6]], float) 226 | 227 | In [87]: np.zeros_like(a) 228 | Out[87]: 229 | array([[ 0., 0., 0.], 230 | [ 0., 0., 0.]]) 231 | 232 | In [88]: np.ones_like(a) 233 | Out[88]: 234 | array([[ 1., 1., 1.], 235 | [ 1., 1., 1.]]) 236 | 237 | 238 | 239 | 240 | ARRAY MATHEMATICS 241 | 242 | When standard mathematical operations are used with arrays, they are applied on an element-by-element basis. This means that the arrays should be the same size during addition, subtraction 243 | In [90]: a = np.array([1,2,3]) 244 | 245 | In [91]: b = np.array([5,6,7]) 246 | 247 | In [92]: a+b 248 | Out[92]: array([ 6, 8, 10]) 249 | 250 | In [93]: a-b 251 | Out[93]: array([-4, -4, -4]) 252 | 253 | In [94]: a*b 254 | Out[94]: array([ 5, 12, 21]) 255 | 256 | In [95]: b/a 257 | Out[95]: array([5, 3, 2]) 258 | 259 | In [96]: a%b 260 | Out[96]: array([1, 2, 3]) 261 | 262 | In [97]: b**a 263 | Out[97]: array([ 5, 36, 343]) 264 | 265 | For two-dimensional arrays, multiplication remains element wise and does not correspond to matrix multiplication 266 | In [98]: a = np.array([[1,2], [3,4]]) 267 | 268 | In [99]: b = np.array([[2,0], [1,3]]) 269 | 270 | In [100]: a*b 271 | Out[100]: 272 | array([[ 2, 0], 273 | [ 3, 12]]) 274 | We will see Errors when the arrays do not match in size. 275 | 276 | Arrays that do not match in the number of dimensions will be broadcasted by Python to perform mathematical operations 277 | In [103]: a = np.array([[1, 2], [3, 4], [5, 6]]) 278 | 279 | In [104]: b = np.array([-1, 3]) 280 | 281 | In [105]: a 282 | Out[105]: 283 | array([[1, 2], 284 | [3, 4], 285 | [5, 6]]) 286 | 287 | In [106]: b 288 | Out[106]: array([-1, 3]) 289 | 290 | In [107]: a + b 291 | Out[107]: 292 | array([[0, 5], 293 | [2, 7], 294 | [4, 9]]) 295 | 296 | In addition to the standard operators, NumPy offers a large library of common mathematical functions that can be applied elementwise to arrays. Among these are the functions: abs, sign, sqrt, log, log10, exp, sin, cos, tan, arcsin, arccos, arctan, sinh, cosh, tanh, arcsinh, arccosh, and arctanh 297 | In [108]: a = np.array([6,8,9]) 298 | 299 | In [109]: np.sqrt(a) 300 | Out[109]: array([ 2.44948974, 2.82842712, 3. ]) 301 | 302 | In [110]: np. sin(a) 303 | Out[110]: array([-0.2794155 , 0.98935825, 0.41211849]) 304 | 305 | In [111]: np.cos(a) 306 | Out[111]: array([ 0.96017029, -0.14550003, -0.91113026]) 307 | 308 | In [112]: np.log(a) 309 | Out[112]: array([ 1.79175947, 2.07944154, 2.19722458]) 310 | 311 | In [113]: np.log10(a) 312 | Out[113]: array([ 0.77815125, 0.90308999, 0.95424251]) 313 | 314 | In [114]: np.tan(a) 315 | Out[114]: array([-0.29100619, -6.79971146, -0.45231566]) 316 | 317 | 318 | The functions floor, ceil, and rint give the lower, upper, or nearest (rounded) integer 319 | In [117]: a = np.array([3.6, 8.4, 9.2, 5.9], float) 320 | 321 | In [118]: np.floor(a) 322 | Out[118]: array([ 3., 8., 9., 5.]) 323 | 324 | In [119]: np.ceil(a) 325 | Out[119]: array([ 4., 9., 10., 6.]) 326 | 327 | In [120]: np.rint(a) 328 | Out[120]: array([ 4., 8., 9., 6.]) 329 | Also included in the NumPy module are two important mathematical constants 330 | In [121]: np.pi 331 | Out[121]: 3.141592653589793 332 | 333 | In [122]: np.e 334 | Out[122]: 2.718281828459045 335 | 336 | 337 | 338 | 339 | ARRAY ITERATION 340 | 341 | It is possible to iterate over arrays in a manner similar to that of lists 342 | In [131]: a = np.array([1,3,6]) 343 | 344 | In [132]: for x in a: 345 | ...: print x 346 | ...: 347 | 1 348 | 3 349 | 6 350 | 351 | For multidimensional arrays, iteration proceeds over the first axis such that each loop returns a subsection of the array 352 | In [133]: a = np.array([[1, 2], [3, 4], [5, 6]], float) 353 | 354 | In [134]: for x in a: 355 | ...: print x 356 | ...: 357 | [ 1. 2.] 358 | [ 3. 4.] 359 | [ 5. 6.] 360 | 361 | 362 | Multiple assignment can also be used with array iteration 363 | In [135]: a = np.array([[1, 2], [3, 4], [5, 6]]) 364 | 365 | In [136]: for (x,y) in a: 366 | ...: print x*y 367 | ...: 368 | 2 369 | 12 370 | 30 371 | 372 | 373 | 374 | 375 | ARRAY OPERATIONS 376 | 377 | The elements in an array can be summed or multiplied 378 | In [137]: a = np.array([2,4,3]) 379 | 380 | In [138]: a.sum() 381 | Out[138]: 9 382 | 383 | In [139]: a.prod() 384 | Out[139]: 24 385 | 386 | Here, member functions of the arrays were used. We can also use standalone functions in the NumPy module 387 | In [140]: np.sum(a) 388 | Out[140]: 9 389 | 390 | In [141]: np.prod(a) 391 | Out[141]: 24 392 | 393 | A number of routines enable computation of statistical quantities in array datasets, such as the mean (average), variance, and standard deviation 394 | In [143]: a.mean() 395 | Out[143]: 3.0 396 | 397 | In [144]: a.var() 398 | Out[144]: 0.66666666666666663 399 | 400 | In [145]: a.std() 401 | Out[145]: 0.81649658092772603 402 | 403 | It's also possible to find the minimum and maximum element values 404 | In [150]: a = np.array([4,9,6]) 405 | 406 | In [151]: a.min() 407 | Out[151]: 4 408 | 409 | In [152]: a.max() 410 | Out[152]: 9 411 | 412 | The argmin and argmax functions return the array indices of the minimum and maximum values 413 | In [153]: a = np.array([4,9,6]) 414 | 415 | In [154]: a.argmin() 416 | Out[154]: 0 417 | 418 | In [155]: a.argmax() 419 | Out[155]: 1 420 | 421 | For multidimensional arrays, each of the functions thus far described can take an optional argument axis that will perform an operation along only the specified axis, placing the results in a return array 422 | In [163]: a = np.array([[0,1], [2,-4], [9,7]]) 423 | 424 | In [164]: a.mean(axis=0) 425 | Out[164]: array([ 3.66666667, 1.33333333]) 426 | 427 | In [165]: a.mean(axis=1) 428 | Out[165]: array([ 0.5, -1. , 8. ]) 429 | 430 | In [166]: a.min(axis=1) 431 | Out[166]: array([ 0, -4, 7]) 432 | 433 | In [167]: a.max(axis=0) 434 | Out[167]: array([9, 7]) 435 | 436 | Arrays can also be sorted 437 | In [168]: a = np.array([8,5,9,3,7,0,12]) 438 | 439 | In [169]: sorted(a) 440 | Out[169]: [0, 3, 5, 7, 8, 9, 12] 441 | 442 | In [170]: a.sort() 443 | 444 | In [171]: a 445 | Out[171]: array([ 0, 3, 5, 7, 8, 9, 12]) 446 | 447 | Values in an array can be "clipped" to be within a prespecified range 448 | In [172]: a = np.array([8,5,9,3,7,0,12]) 449 | 450 | In [173]: a.clip(0,5) 451 | Out[173]: array([5, 5, 5, 3, 5, 0, 5]) 452 | 453 | Unique elements can be extracted from an array. 454 | In [176]: a = np.array([7,7,7,7,8,1,3,2,2,5,5,9,9,1]) 455 | 456 | In [177]: np.unique(a) 457 | Out[177]: array([1, 2, 3, 5, 7, 8, 9]) 458 | 459 | 460 | 461 | 462 | BOOLEAN COMPARISONS 463 | 464 | Boolean comparisons can be used to compare members element wise on arrays of equal size. The return value is an array of Boolean True / False values. 465 | In [178]: a = np.array([6,9,4], float) 466 | 467 | In [179]: b = np.array([0,4,7], float) 468 | 469 | In [180]: a > b 470 | Out[180]: array([ True, True, False], dtype=bool) 471 | In [181]: a == b 472 | Out[181]: array([False, False, False], dtype=bool) 473 | 474 | In [182]: a <= b 475 | Out[182]: array([False, False, True], dtype=bool) 476 | 477 | The results of a Boolean comparison can be stored in an array 478 | In [183]: c = a > b 479 | 480 | In [184]: c 481 | Out[184]: array([ True, True, False], dtype=bool) 482 | 483 | Arrays can be compared to single values using broadcasting 484 | In [185]: a = np.array([2,9,7,5,8]) 485 | 486 | In [186]: a > 4 487 | Out[186]: array([False, True, True, True, True], dtype=bool) 488 | 489 | It is possible to test whether or not values are NaN ("not a number") or finite 490 | In [197]: np.isnan(a) 491 | Out[197]: array([False, True, False], dtype=bool) 492 | 493 | In [198]: np.isfinite(a) 494 | Out[198]: array([ True, False, False], dtype=bool) 495 | 496 | 497 | 498 | 499 | STATISTICS 500 | 501 | In addition to the mean, var, and std functions, NumPy supplies several other methods for returning statistical features of arrays. 502 | In [199]: a = np.array([1,3,5,7,4,6,9,8]) 503 | 504 | In [200]: np.median(a) 505 | Out[200]: 5.5 506 | 507 | The correlation coefficient for multiple variables observed at multiple instances can be found for arrays of the form [[x1, x2, …], [y1, y2, …], [z1, z2, …], …] where x, y, z are different observables and the numbers indicate the observation times 508 | In [202]: a = np.array([[1,2,3,4,5], [6,7,8,9,10]]) 509 | 510 | In [203]: b = np.corrcoef(a) 511 | 512 | In [204]: b 513 | Out[204]: 514 | array([[ 1., 1.], 515 | [ 1., 1.]]) 516 | 517 | The covariance for data can be found 518 | In [208]: np.cov(a) 519 | Out[208]: 520 | array([[ 0.91666667, 2.08333333], 521 | [ 2.08333333, 8.91666667]]) 522 | -------------------------------------------------------------------------------- /1-Introduction-to-machine-learning-software-stack/1.3/1.3b Matplotlib: -------------------------------------------------------------------------------- 1 | LINE PLOT 2 | 3 | In [1]: import matplotlib as mp 4 | 5 | In [2]: import matplotlib.pyplot as plt 6 | 7 | In [3]: plt.plot([1,2,3,4]) 8 | Out[3]: [] 9 | 10 | In [4]: plt.show() 11 | 12 | 13 | 14 | In [5]: plt.plot([1,2,3,4], [2,5,8,15]) 15 | Out[5]: [] 16 | 17 | In [6]: plt.show() 18 | 19 | 20 | 21 | 22 | PLOTTING VALUES WITH DOTS 23 | 24 | In [7]: plt.plot([1,2,3,4], [2,5,8,15], 'ro') 25 | Out[7]: [] 26 | 27 | In [8]: plt.axis([0,6,0,20]) 28 | Out[8]: [0, 6, 0, 20] 29 | 30 | In [9]: plt.show() 31 | 32 | 33 | 34 | 35 | PLOTTING SEVERAL LINES WITH DIFFERENT FORMAT STYLE 36 | 37 | In [11]: import numpy as np 38 | 39 | In [12]: t = np.arange(0., 5., 0.2) 40 | 41 | In [13]: plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^') 42 | Out[13]: 43 | [, 44 | , 45 | ] 46 | 47 | In [14]: plt.show() 48 | 49 | 50 | 51 | 52 | SIMPLE PLOT 53 | 54 | In [16]: x = np.arange(0, 3*np.pi, 0.1) 55 | 56 | In [17]: y = np.sin(x) 57 | 58 | In [18]: plt.plot(x,y) 59 | Out[18]: [] 60 | 61 | In [19]: plt.show() 62 | 63 | 64 | 65 | 66 | SINE AND COSINE PLOT 67 | 68 | In [20]: x = np.arange(0, 3 * np.pi, 0.1) 69 | In [22]: y_sin = np.sin(x) 70 | 71 | In [23]: y_cos = np.cos(x) 72 | 73 | In [24]: plt.plot(x, y_sin) 74 | Out[24]: [] 75 | 76 | In [25]: plt.plot(x, y_cos) 77 | Out[25]: [] 78 | 79 | In [26]: plt.xlabel('x axis label') 80 | Out[26]: 81 | 82 | In [27]: plt.ylabel('y axis label') 83 | Out[27]: 84 | 85 | In [28]: plt.title ('Sine and Cosine') 86 | Out[28]: 87 | 88 | In [29]: plt.legend(['Sine', 'Cosine']) 89 | Out[29]: 90 | 91 | In [30]: plt.show() 92 | 93 | 94 | 95 | 96 | SUBPLOTS 97 | 98 | In [31]: x = np.arange(0, 3 * np.pi, 0.1) 99 | 100 | In [32]: y_sin = np.sin(x) 101 | 102 | In [33]: y_cos = np.cos(x) 103 | 104 | In [34]: plt.subplot(2,1,1) 105 | Out[34]: 106 | 107 | In [35]: plt.plot(x, y_sin) 108 | Out[35]: [] 109 | 110 | In [36]: plt.title('Sine') 111 | Out[36]: 112 | 113 | In [37]: plt.subplot(2,1,2) 114 | Out[37]: 115 | 116 | In [38]: plt.plot(x, y_cos) 117 | Out[38]: [] 118 | 119 | In [39]: plt.title('Cosine') 120 | Out[39]: 121 | 122 | In [40]: plt.show() 123 | 124 | 125 | 126 | 127 | HISTOGRAM 128 | 129 | data = np.random.normal(5.0, 3.0, 1000) 130 | 131 | In [12]: plt.hist(data) 132 | Out[12]: 133 | (array([ 2., 12., 44., 100., 200., 249., 214., 123., 43., 13.]), 134 | array([ -5.6815256 , -3.7794175 , -1.8773094 , 0.02479871, 135 | 1.92690681, 3.82901491, 5.73112301, 7.63323111, 136 | 9.53533921, 11.43744731, 13.33955541]), 137 | ) 138 | 139 | In [13]: plt.xlabel('data') 140 | Out[13]: 141 | 142 | In [14]: plt.show() 143 | 144 | 145 | 146 | 147 | 148 | 149 | In [15]: bins = np.arange(-5., 16., 1.) 150 | 151 | In [16]: plt.hist(data, bins) 152 | Out[16]: 153 | (array([ 0., 5., 8., 13., 31., 33., 74., 104., 110., 154 | 138., 120., 113., 103., 74., 25., 27., 13., 5., 155 | 3., 0.]), 156 | array([ -5., -4., -3., -2., -1., 0., 1., 2., 3., 4., 5., 157 | 6., 7., 8., 9., 10., 11., 12., 13., 14., 15.]), 158 | ) 159 | 160 | In [17]: plt.show() 161 | 162 | 163 | 164 | 165 | BAR CHART 166 | 167 | In [3]: year = (2011, 2012, 2013, 2014, 2015) 168 | 169 | In [4]: score = (83, 78, 99, 60, 80) 170 | In [13]: plt.xlabel('year') 171 | Out[13]: 172 | 173 | In [14]: plt.ylabel('score') 174 | Out[14]: 175 | 176 | In [15]: plt.bar(year, score) 177 | Out[15]: 178 | 179 | In [16]: plt.show() 180 | 181 | 182 | 183 | 184 | SCATTER PLOT 185 | 186 | In [14]: year = (2011, 2012, 2013, 2014, 2015) 187 | 188 | In [15]: score = (83, 78, 99, 60, 80) 189 | 190 | In [16]: plt.scatter(year, score) 191 | Out[16]: 192 | 193 | In [17]: plt.xlabel(year) 194 | Out[17]: 195 | 196 | In [18]: plt.ylabel(score) 197 | Out[18]: 198 | 199 | In [19]: plt.show() 200 | 201 | 202 | -------------------------------------------------------------------------------- /1-Introduction-to-machine-learning-software-stack/1.3/1.3c Pandas Examples: -------------------------------------------------------------------------------- 1 | SERIES IN PANDAS: 2 | 3 | A series is an one-dimensional array containing an array of any data type and an associated array of data labels called index. 4 | 5 | 6 | In [1]: import pandas as pd 7 | 8 | In [3]: from pandas import Series, DataFrame 9 | 10 | 11 | 12 | Forming a simple series. 13 | 14 | In [4]: obj = Series([4,7,-3,1]) 15 | 16 | In [5]: obj 17 | Out[5]: 18 | 0 4 19 | 1 7 20 | 2 -3 21 | 3 1 22 | dtype: int64 23 | 24 | The Series is represented normally with an index at the left. Since we did not specify any index for data here, it specified an index on its own starting from 0 through N-1 where N is the length of data (in this case its 4). 25 | 26 | 27 | 28 | We can get an array representation and index objects of the Series via its values and index attributes - 29 | 30 | In [6]: obj.values 31 | Out[6]: array([ 4, 7, -3, 1], dtype=int64) 32 | 33 | In [7]: obj.index 34 | Out[7]: RangeIndex(start=0, stop=4, step=1) 35 | 36 | 37 | 38 | It is good to create a Series with an index for each element in the Series. 39 | 40 | In [8]: obj2 = Series([4,7,-3,1], index = ['a','b','c','d']) 41 | 42 | In [9]: obj2 43 | Out[9]: 44 | a 4 45 | b 7 46 | c -3 47 | d 1 48 | dtype: int64 49 | 50 | 51 | 52 | Compared to regular Numpy Array, in pandas, we can use values in the index when selecting a single values or a set of values. 53 | 54 | In [10]: obj2.index 55 | Out[10]: Index([u'a', u'b', u'c', u'd'], dtype='object') 56 | 57 | In [11]: obj2['a'] 58 | Out[11]: 4 59 | 60 | In [12]: obj2['d'] - 4 61 | Out[12]: -3 62 | 63 | In [13]: obj2[['c', 'a', 'd']] 64 | Out[13]: 65 | c -3 66 | a 4 67 | d 1 68 | dtype: int64 69 | 70 | 71 | 72 | NumPy array operations, such as filtering with Boolean array, Scalar multiplications, or applying math functions, will preserve the index value. 73 | 74 | In [14]: obj2 75 | Out[14]: 76 | a 4 77 | b 7 78 | c -3 79 | d 1 80 | dtype: int64 81 | 82 | 83 | In [15]: obj2[obj2 > 0] 84 | Out[15]: 85 | a 4 86 | b 7 87 | d 1 88 | dtype: int64 89 | 90 | 91 | In [16]: obj2 * 2 92 | Out[16]: 93 | a 8 94 | b 14 95 | c -6 96 | d 2 97 | dtype: int64 98 | 99 | 100 | In [17]: import numpy as np 101 | 102 | 103 | In [18]: np.exp(obj2) 104 | Out[18]: 105 | a 54.598150 106 | b 1096.633158 107 | c 0.049787 108 | d 2.718282 109 | dtype: float64 110 | 111 | 112 | 113 | We can also think of Series as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dictionary. 114 | 115 | In [19]: 'b' in obj2 116 | Out[19]: True 117 | 118 | In [21]: 'f' in obj2 119 | Out[21]: False 120 | 121 | 122 | 123 | If there is data in Python dictionary, we can create a Series from it by passing the dictionary. 124 | 125 | In [22]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} 126 | 127 | 128 | In [23]: obj3 = Series(sdata) 129 | 130 | 131 | In [24]: obj3 132 | Out[24]: 133 | Ohio 35000 134 | Oregon 16000 135 | Texas 71000 136 | Utah 5000 137 | dtype: int64 138 | 139 | 140 | 141 | 142 | While passing a dictionary, the index of the resulting Series will have the dictionary keys in the sorted order. 143 | 144 | In [25]: states = ['California', 'Ohio', 'Oregon', 'Texas'] 145 | 146 | In [26]: obj4 = Series(sdata, index= states) 147 | 148 | In [27]: obj4 149 | Out[27]: 150 | California NaN 151 | Ohio 35000.0 152 | Oregon 16000.0 153 | Texas 71000.0 154 | dtype: float64 155 | 156 | In this case, three values are found in sdata and were placed in the appropriate location. There was no value for California and hence it is NaN (Not an Number) which is considered in pandas to mark missing or NA values. 157 | 158 | 159 | 160 | The isnull or notnull function in pandas is used to detect missing values. 161 | 162 | In [28]: pd.isnull(obj4) 163 | Out[28]: 164 | California True 165 | Ohio False 166 | Oregon False 167 | Texas False 168 | dtype: bool 169 | 170 | 171 | In [29]: pd.notnull(obj4) 172 | Out[29]: 173 | California False 174 | Ohio True 175 | Oregon True 176 | Texas True 177 | dtype: bool 178 | 179 | 180 | 181 | Pandas Series also has these as instance methods. 182 | 183 | In [30]: obj4.isnull() 184 | Out[30]: 185 | California True 186 | Ohio False 187 | Oregon False 188 | Texas False 189 | dtype: bool 190 | 191 | 192 | 193 | Another important Series feature for many applications is that it automatically aligns differently indexed data in arithmetic operations. 194 | 195 | In [31]: obj3 196 | Out[31]: 197 | Ohio 35000 198 | Oregon 16000 199 | Texas 71000 200 | Utah 5000 201 | dtype: int64 202 | 203 | 204 | In [32]: obj4 205 | Out[32]: 206 | California NaN 207 | Ohio 35000.0 208 | Oregon 16000.0 209 | Texas 71000.0 210 | dtype: float64 211 | 212 | 213 | In [33]: obj3 + obj4 214 | Out[33]: 215 | California NaN 216 | Ohio 70000.0 217 | Oregon 32000.0 218 | Texas 142000.0 219 | Utah NaN 220 | dtype: float64 221 | 222 | 223 | 224 | Both the Series objects and its index has a name attribute, which integrates with other key areas of pandas functionality. 225 | 226 | In [34]: obj4.name = 'population' 227 | 228 | In [35]: obj4.index.name = 'state' 229 | 230 | In [36]: obj4 231 | Out[36]: 232 | state 233 | California NaN 234 | Ohio 35000.0 235 | Oregon 16000.0 236 | Texas 71000.0 237 | Name: population, dtype: float64 238 | 239 | 240 | 241 | A Series index can be altered by assigning it. 242 | 243 | In [37]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] 244 | 245 | In [38]: obj 246 | Out[38]: 247 | Bob 4 248 | Steve 7 249 | Jeff -3 250 | Ryan 1 251 | dtype: int64 252 | 253 | 254 | 255 | 256 | 257 | 258 | DATA FRAMES: 259 | 260 | A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered 261 | collection of columns, each of which can be a different value type (numeric, 262 | string, boolean, etc.). The DataFrame has both a row and column index; it can be 263 | thought of as a dict of Series (one for all sharing the same index). 264 | 265 | 266 | 267 | Though there are different ways to construct a DataFrame, the most common way is to form a dictionary of equal length lists or Numpy arrays. 268 | 269 | In [39]: data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 270 | ...: 'year' : [2000, 2001, 2002, 2001, 2002], 271 | ...: 'pop' : [1.5, 1.7, 3.6, 2.4, 2.9]} 272 | 273 | In [40]: frame = DataFrame(data) 274 | 275 | 276 | 277 | The resulting DataFrame will have its index assigned automatically and the columns are placed in sorted order. 278 | 279 | In [41]: frame 280 | Out[41]: 281 | pop state year 282 | 0 1.5 Ohio 2000 283 | 1 1.7 Ohio 2001 284 | 2 3.6 Ohio 2002 285 | 3 2.4 Nevada 2001 286 | 4 2.9 Nevada 2002 287 | 288 | 289 | 290 | If we specify a sequence of columns, the DataFrame’s columns will be exactly in the order of what we pass 291 | 292 | In [42]: DataFrame(data, columns = ['year', 'state', 'pop']) 293 | Out[42]: 294 | year state pop 295 | 0 2000 Ohio 1.5 296 | 1 2001 Ohio 1.7 297 | 2 2002 Ohio 3.6 298 | 3 2001 Nevada 2.4 299 | 4 2002 Nevada 2.9 300 | 301 | 302 | 303 | If we pass a column that is not contained in the data, it will appear as NaN values. 304 | 305 | In [43]: frame2 = DataFrame(data, columns = ['year', 'state', 'pop', 'debt'], 306 | ...: index = ['one', 'two', 'three', 'four', 'five']) 307 | 308 | In [44]: frame2 309 | Out[44]: 310 | year state pop debt 311 | one 2000 Ohio 1.5 NaN 312 | two 2001 Ohio 1.7 NaN 313 | three 2002 Ohio 3.6 NaN 314 | four 2001 Nevada 2.4 NaN 315 | five 2002 Nevada 2.9 NaN 316 | 317 | 318 | 319 | A column in a DataFrame can be retrieved as a Series either by dictionary like notation or by attribute. 320 | 321 | In [45]: frame2.columns 322 | Out[45]: Index([u'year', u'state', u'pop', u'debt'], dtype='object') 323 | 324 | 325 | In [46]: frame2['state'] 326 | Out[46]: 327 | one Ohio 328 | two Ohio 329 | three Ohio 330 | four Nevada 331 | five Nevada 332 | Name: state, dtype: object 333 | 334 | 335 | In [47]: frame2.year 336 | Out[47]: 337 | one 2000 338 | two 2001 339 | three 2002 340 | four 2001 341 | five 2002 342 | Name: year, dtype: int64 343 | 344 | 345 | Note: The name attribute has the same index as the DataFrame, and their name attribute has been appropriately set. 346 | 347 | 348 | 349 | Rows can also be retrieved by position or name by a couple of methods, such as the 350 | ix indexing field. 351 | 352 | In [48]: frame2.ix['three'] 353 | Out[48]: 354 | year 2002 355 | state Ohio 356 | pop 3.6 357 | debt NaN 358 | Name: three, dtype: object 359 | 360 | 361 | 362 | Columns can also be assigned values and modified. 363 | 364 | In [49]: frame2['debt'] = 16.5 365 | 366 | In [50]: frame2 367 | Out[50]: 368 | year state pop debt 369 | one 2000 Ohio 1.5 16.5 370 | two 2001 Ohio 1.7 16.5 371 | three 2002 Ohio 3.6 16.5 372 | four 2001 Nevada 2.4 16.5 373 | five 2002 Nevada 2.9 16.5 374 | 375 | In [52]: frame2['debt'] = np.arange(5.) 376 | 377 | In [53]: frame2 378 | Out[53]: 379 | year state pop debt 380 | one 2000 Ohio 1.5 0.0 381 | two 2001 Ohio 1.7 1.0 382 | three 2002 Ohio 3.6 2.0 383 | four 2001 Nevada 2.4 3.0 384 | five 2002 Nevada 2.9 4.0 385 | 386 | 387 | 388 | While assigning lists or arrays to a column, the value’s length must match the length 389 | of the DataFrame. If you assign a Series, it will be instead conformed exactly to the 390 | DataFrame’s index, inserting missing values in any holes 391 | 392 | In [56]: val = Series([-1.4, -1.6, -1.7], index = ['two', 'four', 'five']) 393 | 394 | In [57]: frame2['debt'] = val 395 | 396 | In [58]: frame2 397 | Out[58]: 398 | year state pop debt 399 | one 2000 Ohio 1.5 NaN 400 | two 2001 Ohio 1.7 -1.4 401 | three 2002 Ohio 3.6 NaN 402 | four 2001 Nevada 2.4 -1.6 403 | five 2002 Nevada 2.9 -1.7 404 | 405 | 406 | 407 | Assigning a column that doesn’t exist will create a new column. The del keyword will 408 | delete columns as with a dictionary. 409 | 410 | In [59]: frame2['western'] = frame2.state == 'Ohio' 411 | 412 | In [60]: frame2 413 | Out[60]: 414 | year state pop debt western 415 | one 2000 Ohio 1.5 NaN True 416 | two 2001 Ohio 1.7 -1.4 True 417 | three 2002 Ohio 3.6 NaN True 418 | four 2001 Nevada 2.4 -1.6 False 419 | five 2002 Nevada 2.9 -1.7 False 420 | 421 | In [61]: del frame2['western'] 422 | 423 | In [62]: frame2.columns 424 | Out[62]: Index([u'year', u'state', u'pop', u'debt'], dtype='object') 425 | 426 | 427 | 428 | Another common form of data is a nested dict of dicts format 429 | 430 | In [63]: pop = {'Nevada' : {2001: 2.4, 2002: 2.9}, 431 | ...: 'Ohio' : {2000: 1.5, 2001: 1.7, 2002: 3.6}} 432 | 433 | 434 | 435 | If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner 436 | keys as the row indices. 437 | 438 | In [64]: frame3 = DataFrame(pop) 439 | 440 | In [65]: frame3 441 | Out[65]: 442 | Nevada Ohio 443 | 2000 NaN 1.5 444 | 2001 2.4 1.7 445 | 2002 2.9 3.6 446 | 447 | 448 | 449 | We can also transpose the result 450 | 451 | In [66]: frame3.T 452 | Out[66]: 453 | 2000 2001 2002 454 | Nevada NaN 2.4 2.9 455 | Ohio 1.5 1.7 3.6 456 | 457 | 458 | 459 | The keys in the inner dicts are unioned and sorted to form the index in the result. This 460 | isn’t true if an explicit index is specified. 461 | 462 | In [67]: DataFrame(pop, index=[2001, 2002, 2003]) 463 | Out[67]: 464 | Nevada Ohio 465 | 2001 2.4 1.7 466 | 2002 2.9 3.6 467 | 2003 NaN NaN 468 | 469 | 470 | 471 | Dicts of Series are treated much in the same way. 472 | 473 | In [68]: pdata = {'Ohio': frame3['Ohio'][:-1], 474 | ...: 'Nevada' : frame3['Nevada'][:2]} 475 | 476 | In [69]: DataFrame(pdata) 477 | Out[69]: 478 | Nevada Ohio 479 | 2000 NaN 1.5 480 | 2001 2.4 1.7 481 | 482 | 483 | 484 | If the DataFrame’s index and column have their name attributes set, these will also be displayed. 485 | 486 | In [70]: frame3.index.name = 'year'; frame3.columns.name = 'state 487 | 488 | In [71]: frame3 489 | Out[71]: 490 | State Nevada Ohio 491 | year 492 | 2000 NaN 1.5 493 | 2001 2.4 1.7 494 | 2002 2.9 3.6 495 | 496 | 497 | 498 | Like Series, the values attribute returns the data contained in the DataFrame as a 2D 499 | ndarray 500 | 501 | In [72]: frame3.values 502 | Out[72]: 503 | array([[ nan, 1.5], 504 | [ 2.4, 1.7], 505 | [ 2.9, 3.6]]) 506 | 507 | 508 | 509 | If the DataFrame’s columns are different dtypes, the dtype of the values array will be 510 | chosen to accomodate all of the columns 511 | 512 | In [73]: frame2.values 513 | Out[73]: 514 | array([[2000L, 'Ohio', 1.5, nan], 515 | [2001L, 'Ohio', 1.7, -1.4], 516 | [2002L, 'Ohio', 3.6, nan], 517 | [2001L, 'Nevada', 2.4, -1.6], 518 | [2002L, 'Nevada', 2.9, -1.7]], dtype=object) 519 | 520 | -------------------------------------------------------------------------------- /1-Introduction-to-machine-learning-software-stack/1.3/1.3d Machine learning with scikit learn: -------------------------------------------------------------------------------- 1 | To begin with, let us load the scikit learn library and import the module that contains the various functions in order to extract the inbuilt datasets 2 | 3 | In [5]: from sklearn.datasets import load_iris,load_boston,make_classification, make_circles, make_moons 4 | 5 | 6 | First, we will look into the ‘Iris flower data set’ by Ronald Fisher. 7 | This dataset would be used for classification problem. 8 | 9 | The load_iris function returns a dictionary object 10 | 11 | In [6]: data = load_iris() 12 | 13 | 14 | The predictor x, response variable y, response variable names and feature names can be extracted by querying the dictionary object with the appropriate keys 15 | 16 | In [7]: x = data['data'] 17 | 18 | In [8]: y = data['target'] 19 | 20 | In [9]: y_labels = data['target_names'] 21 | 22 | In [10]: x_labels = data['feature_names'] 23 | 24 | 25 | 26 | Let’s print to see the values 27 | 28 | In [11]: print 29 | 30 | In [12]: print x.shape 31 | (150, 4) 32 | 33 | In [13]: print y.shape 34 | (150,) 35 | 36 | In [14]: print x_labels 37 | ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] 38 | 39 | In [15]: print y_labels 40 | ['setosa' 'versicolor' 'virginica'] 41 | 42 | 43 | We can see that predictors have 150 instances and 4 attributes. 44 | The response variable has 150 instances and a class label for each of the rows in our predictor set. 45 | We are then printing out the attribute names, petal and sepal width and length, and finally, the class labels. 46 | 47 | 48 | 49 | Now we will see some features with the Boston Housing Dataset. 50 | This dataset would be used for regression problem. 51 | The data is loaded pretty much the same way as is done above for iris flower data set. 52 | 53 | In [16]: data = load_boston() 54 | 55 | 56 | The various components of the data, including the predictors and response variables, are queried using the respective keys from the dictionary. 57 | 58 | In [17]: x = data['data'] 59 | 60 | In [18]: y = data['target'] 61 | 62 | In [19]: x_labels = data['feature_names'] 63 | 64 | 65 | Let's print these variables to understand more. 66 | 67 | In [20]: print 68 | 69 | In [21]: print x.shape 70 | (506, 13) 71 | 72 | In [22]: print y.shape 73 | (506,) 74 | 75 | In [23]: print x_labels 76 | ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 77 | 'B' 'LSTAT'] 78 | 79 | We can see that the predictor set x has 506 instances and 13 attributes. 80 | The response variable has 506 entries. 81 | Finally, we have also printed out the names of the attributes. 82 | 83 | 84 | 85 | Scikit-learn has also provided functions that will help us produce a random classification dataset with some desired properties 86 | 87 | # make some classification dataset 88 | 89 | The make_classification function is a function that can be used to generate a classification dataset. 90 | 91 | In [24]: x,y = make_classification(n_samples = 50, n_features = 5, n_classes = 2) 92 | 93 | In this example, we generated a dataset with 50 instances that are dictated by the n_samples parameter, five attributes, n_features parameters, and two classes set by the n_classes parameter. 94 | 95 | Let’s print to analyze the output of this function. 96 | 97 | In [25]: print 98 | 99 | In [26]: print x.shape 100 | (50, 5) 101 | 102 | In [27]: print y.shape 103 | (50,) 104 | 105 | In [28]: print x[1,:] 106 | [-0.06371735 0.24424535 -0.3383024 0.84448991 0.94607483] 107 | 108 | In [29]: print y[1] 109 | 0 110 | 111 | 112 | We can see that predictor x has 150 instances with 5 features. 113 | The response variable has 150 instances, with a class label for each of the prediction instances. 114 | On printing the second record in our predictor set x, we can see that we have a vector of dimension 5, relating to the five features that we requested. 115 | Also, we will print the response variable y. For the second row of our predictors, the class label is 0. 116 | 117 | 118 | 119 | Scikit learn also provide us with functions that can generate data with non-linear relationships. 120 | 121 | In [30]: x,y = make_circles() 122 | 123 | In [31]: import numpy as np 124 | 125 | In [32]: import matplotlib.pyplot as plt 126 | 127 | In [33]: plt.close('all') 128 | 129 | In [34]: plt.figure(1) 130 | Out[34]: 131 | 132 | In [35]: plt.scatter(x[:,0],x[:,1],c=y) 133 | Out[35]: 134 | 135 | In [36]: plt.show() 136 | 137 | 138 | Let us look at the plot to understand non – linear relationship. (Plot to be found in the docx file) 139 | Our classification has produced two concentric circles. x is a dataset with two variables. Variable y is the class label. As shown by the concentric circle, the relationship between our prediction variable is nonlinear. 140 | 141 | Another interesting function to produce a nonlinear relationship is make_moons from scikit-learn. 142 | 143 | In [39]: x,y = make_moons() 144 | 145 | In [40]: plt.figure(2) 146 | Out[40]: 147 | 148 | In [42]: plt.scatter(x[:,0],x[:,1],c=y) 149 | Out[42]: 150 | 151 | In [43]: plt.show() 152 | 153 | Lets look at the plot to understand more (Plot to be found in the docx file) 154 | 155 | 156 | The crescent-shaped plot shows that the attributes in our predictor set x are nonlinearly related to each other. 157 | 158 | 159 | 160 | 161 | Let us now learn the API structures of Scikit learn. One of the major advantages of using scikit-learn is its clean API structure. All the data modeling classes deriving from the BaseEstimator class have to strictly implement the fit and transform functions. We will see some examples regarding this. 162 | 163 | Lets see how we can put some machine learning functionalities in scikit learn 164 | 165 | In [44]: import numpy as np 166 | 167 | Lets start with the preprocessing module of scikit learn 168 | 169 | In [45]: from sklearn.preprocessing import PolynomialFeatures 170 | 171 | We will use the PolynomialFeatures class in order to demonstrate the ease of using scikit-learn's SDK. 172 | 173 | With a set of predictor variables, we may want to add some more variables to our predictor set in order to see if our model accuracy can be improved. We can use the polynomials of the existing features as a new feature. The PolynomialFeatures class helps us do this. 174 | 175 | We will first create a dataset. In this case, our dataset has two instances and two attributes – 176 | 177 | In [46]: x = np.asmatrix([[1,2],[2,4]]) 178 | 179 | We will proceed to instantiate our PolynomialFeatures class with the required degree of polynomials. In this case, it will be a second degree 180 | 181 | In [47]: poly = PolynomialFeatures(degree = 2) 182 | 183 | There are two functions, fit and transform. The fit function is used to do the necessary calculations for the transformation. The transform function takes the input and, based on the calculations performed by fit, transforms the given input. 184 | 185 | In [48]: poly.fit(x) 186 | Out[48]: PolynomialFeatures(degree=2, include_bias=True, interaction_only=False) 187 | 188 | In [49]: x_poly = poly.transform(x) 189 | 190 | In [50]: print "Original x variable shape", x.shape 191 | Original x variable shape (2, 2) 192 | 193 | 194 | #original x values 195 | 196 | In [51]: print x 197 | [[1 2] 198 | [2 4]] 199 | 200 | In [52]: print 201 | 202 | In [53]: print "Transformed x variables", x_poly.shape 203 | Transformed x variables (2, 6) 204 | 205 | 206 | # transformed x values 207 | 208 | In [54]: print x_poly 209 | [[ 1. 1. 2. 1. 2. 4.] 210 | [ 1. 2. 4. 4. 8. 16.]] 211 | 212 | 213 | Alternatively, in this case, fit and transform can be called in one shot and the output would be the same. 214 | 215 | In [55]: x_poly = poly.fit_transform(x) 216 | 217 | 218 | Any class that implements a machine learning method in scikit-learn has to deliver from BaseEstimator. BaseEstimator expects that the implementation class provides both the fit and transform methods. This way the API is kept very clean. 219 | 220 | Lets see another example. Here we imported a class called DecisionTreeClassifier from the module tree. DecisionTreeClassifier implements the decision tree algorithm 221 | 222 | In [56]: from sklearn.tree import DecisionTreeClassifier 223 | 224 | In [57]: from sklearn.datasets import load_iris 225 | 226 | In [58]: data = load_iris() 227 | 228 | In [59]: x = data['data'] 229 | 230 | In [60]: y = data['target'] 231 | 232 | In [61]: estimator = DecisionTreeClassifier() 233 | 234 | In [62]: estimator.fit(x,y) 235 | Out[62]: 236 | DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, 237 | max_features=None, max_leaf_nodes=None, min_samples_leaf=1, 238 | min_samples_split=2, min_weight_fraction_leaf=0.0, 239 | presort=False, random_state=None, splitter='best') 240 | 241 | In [63]: predicted_y = estimator.predict(x) 242 | 243 | In [64]: predicted_y_prob = estimator.predict_proba(x) 244 | 245 | In [65]: predicted_y_lprob = estimator.predict_log_proba(x) 246 | 247 | We have used the iris dataset to see how the tree algorithm can be used. First, we have loaded the iris dataset in the x and y variables. We then instantiate DecisonTreeClassifier. We proceeded to build the model by invoking the fit function and passing our x predictor and y response variable. This built the tree model. 248 | 249 | Now, we are ready with our model to do some predictions. We have used the predict function in order to predict the class labels for the given input. As we can see, we leveraged the same fit and predict method as in PolynomialFeatures. There are two other methods, predict_proba - which gives the probability of the prediction, and predict_log_proba – which provides the logarithm of the prediction probability. 250 | 251 | 252 | We will see another interesting utility called pipelining. Various machine learning methods can be chained together using pipe lining. 253 | 254 | In [66]: from sklearn.pipeline import Pipeline 255 | 256 | In [67]: poly = PolynomialFeatures(n=3) 257 | 258 | In [70]: tree_estimator = DecisionTreeClassifier() 259 | 260 | 261 | Let's start by instantiating the data processing routines, PolynomialFeatures and DecisionTreeClassifier 262 | 263 | In [71]: steps = [('poly',poly),('tree',tree_estimator)] 264 | 265 | We will define a list of tuples to indicate the order of our chaining. We want to run the polynomial feature generation, followed by our decision tree 266 | 267 | In [72]: estimator = Pipeline(steps = steps) 268 | 269 | In [73]: estimator.fit(x,y) 270 | Out[73]: 271 | Pipeline(steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, 272 | max_features=None, max_leaf_nodes=None, min_samples_leaf=1, 273 | min_samples_split=2, min_weight_fraction_leaf=0.0, 274 | presort=False, random_state=None, splitter='best'))]) 275 | 276 | In [74]: predicted_y = estimator.predict(x) 277 | 278 | We can now instantiate our Pipeline object with the list declared using the steps variable. Now, we can proceed as usual by calling the fit and predict methods. 279 | 280 | We can invoke the named_steps attribute in order to inspect the models in the various stages of our pipeline. 281 | 282 | In [75]: estimator.named_steps 283 | Out[75]: 284 | {'poly': PolynomialFeatures(degree=2, include_bias=True, interaction_only=False), 285 | 'tree': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, 286 | max_features=None, max_leaf_nodes=None, min_samples_leaf=1, 287 | min_samples_split=2, min_weight_fraction_leaf=0.0, 288 | presort=False, random_state=None, splitter='best')} 289 | -------------------------------------------------------------------------------- /1-Introduction-to-machine-learning-software-stack/1.3/Machine Learning with Scikit Learn.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/1-Introduction-to-machine-learning-software-stack/1.3/Machine Learning with Scikit Learn.docx -------------------------------------------------------------------------------- /1-Introduction-to-machine-learning-software-stack/1.3/Matplotlib.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/1-Introduction-to-machine-learning-software-stack/1.3/Matplotlib.docx -------------------------------------------------------------------------------- /2-intro-to-ml/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/2-intro-to-ml/README.md -------------------------------------------------------------------------------- /3-project-overview/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/3-project-overview/README.md -------------------------------------------------------------------------------- /4-regression/4.3 - Regularization and Model Evaluation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Optimizing the Models " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Welcome to the practical section of module 4.3. Here we'll continue with the advertising-sales dataset to investigate the ideas of regularization and model evaluation. We'll continue with the multivariate regression model we build in the previous module and we'll be looking into tuning the regularization parameter to achieve the most accurate model and we'll evaluate this accuracy using better metrics than MSE which we have been using in the previous modules." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd\n", 26 | "import numpy as np\n", 27 | "from sklearn.linear_model import SGDRegressor\n", 28 | "from sklearn.preprocessing import StandardScaler\n", 29 | "\n", 30 | "%matplotlib inline\n", 31 | "import matplotlib.pyplot as plt\n", 32 | "plt.rcParams['figure.figsize'] = (10, 10)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "# Building the Model" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "In the following you'll see the same code (without visualization) we wrote in the previous module for the rgeression model using both **TV** and **Newspaper** data, so it's nothing new, **except** for the part where we prepare our data. We'll be splitting the dataset into three parts now instead of two:\n", 47 | "* **Training Set**: we'll train the model on this\n", 48 | "* **Validation Set**: we'll be tuning hyperparameters on this (more on that later)\n", 49 | "* **Tests Set**: we'll be evaluating our model on this" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": { 56 | "collapsed": true 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "def scale_features(X, scalar=None):\n", 61 | " if(len(X.shape) == 1):\n", 62 | " X = X.reshape(-1, 1)\n", 63 | " \n", 64 | " if scalar == None:\n", 65 | " scalar = StandardScaler()\n", 66 | " scalar.fit(X)\n", 67 | " \n", 68 | " return scalar.transform(X), scalar" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "# get the advertising data set\n", 80 | "\n", 81 | "dataset = pd.read_csv('../datasets/Advertising.csv')\n", 82 | "dataset = dataset[[\"TV\", \"Radio\", \"Newspaper\", \"Sales\"]] # filtering the Unamed index column out of the dataset" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 7, 88 | "metadata": { 89 | "collapsed": true 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "dataset_size = len(dataset)\n", 94 | "training_size = np.floor(dataset_size * 0.6).astype(int)\n", 95 | "validation_size = np.floor(dataset_size * 0.2).astype(int)\n", 96 | "\n", 97 | "# First we split the shuffled dataset into three parts: training, validation and test\n", 98 | "X_training = dataset[[\"TV\", \"Newspaper\"]][:training_size]\n", 99 | "y_training = dataset[\"Sales\"][:training_size]\n", 100 | "\n", 101 | "X_validation = dataset[[\"TV\", \"Newspaper\"]][training_size:training_size + validation_size]\n", 102 | "y_validation = dataset[\"Sales\"][training_size:training_size + validation_size]\n", 103 | "\n", 104 | "X_test = dataset[[\"TV\", \"Newspaper\"]][training_size:training_size + validation_size:]\n", 105 | "y_test = dataset[\"Sales\"][training_size:training_size + validation_size:]\n", 106 | "\n", 107 | "# Second we apply feature scaling on X_training and X_test\n", 108 | "X_training, training_scalar = scale_features(X_training)\n", 109 | "X_validation,_ = scale_features(X_validation, scalar=training_scalar)\n", 110 | "X_test,_ = scale_features(X_test, scalar=training_scalar)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 40, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [ 120 | { 121 | "name": "stdout", 122 | "output_type": "stream", 123 | "text": [ 124 | "Trained model: y = 11.55 + 3.32x₁ + 0.83x₂\n", 125 | "The Test Data MSE is: 16.852\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "model = SGDRegressor(loss='squared_loss')\n", 131 | "model.fit(X_training, y_training)\n", 132 | "\n", 133 | "w0 = model.intercept_\n", 134 | "w1 = model.coef_[0] # Notice that model.coef_ is a list now not a single number\n", 135 | "w2 = model.coef_[1]\n", 136 | "\n", 137 | "print \"Trained model: y = %0.2f + %0.2fx₁ + %0.2fx₂\" % (w0, w1, w2)\n", 138 | "\n", 139 | "MSE = np.mean((y_test - model.predict(X_test)) ** 2)\n", 140 | "\n", 141 | "print \"The Test Data MSE is: %0.3f\" % (MSE)" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "# L2 Regularization" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "From the videos, we learned that the idea of **regularization** is introduced to prevent the model from overfitting to the data points by adding a penality for large weights values. Such penality is expressed mathematically with the second term of the cost function:\n", 156 | "\n", 157 | "$$ J(W) = \\sum_{i=1}^{m} (h_w(X^{(i)} - y^{(i)})^2 + \\lambda \\sum_{j=1}^{n} w_j^2 $$\n", 158 | "\n", 159 | "This is called **L2 Regularization** and $\\lambda$ is called the **Regularization Parameter** , How can we implment it then with scikit-learn for our models?\n", 160 | "\n", 161 | "Well, no worries, scikit-learn implements that for you and we have been using it all the time.\n", 162 | "The **SGDRegressor** constructs has two arguments that define the behavior of the penality:\n", 163 | "* *penalty*: which is a string specifying the type of penality to use (default to 'l2')\n", 164 | "* *alpha*: which is the value of the $\\lambda$ in the equation above\n", 165 | "\n", 166 | "Now let's play with the value of alpha and see how does that affect our model's accuracy. Let's set alpha to a large number say 1. In this case we give the values of the weights a very harsh penalty so they'll end up smaller than they should be and the accuracy should be worse!" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 41, 172 | "metadata": { 173 | "collapsed": false 174 | }, 175 | "outputs": [ 176 | { 177 | "name": "stdout", 178 | "output_type": "stream", 179 | "text": [ 180 | "Trained model: y = 11.53 + 1.97x₁ + 0.48x₂\n", 181 | "The Test Data MSE is: 19.423\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "model = SGDRegressor(loss='squared_loss', alpha=1)\n", 187 | "model.fit(X_training, y_training)\n", 188 | "\n", 189 | "w0 = model.intercept_\n", 190 | "w1 = model.coef_[0] # Notice that model.coef_ is a list now not a single number\n", 191 | "w2 = model.coef_[1]\n", 192 | "\n", 193 | "print \"Trained model: y = %0.2f + %0.2fx₁ + %0.2fx₂\" % (w0, w1, w2)\n", 194 | "\n", 195 | "MSE = np.mean((y_test - model.predict(X_test)) ** 2)\n", 196 | "\n", 197 | "print \"The Test Data MSE is: %0.3f\" % (MSE)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "The effect the value of the regularization parameter has on the model's accuracy makes a very good candidate for tuning. We can use the validation data set we created for that purpose. We create a list of possible values for the regularization parameter, we train the model using each of these value and evaluate the model using the validation set. The value with the best evaluation (least MSE) is the best value for the regularization parameter." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 61, 210 | "metadata": { 211 | "collapsed": false 212 | }, 213 | "outputs": [ 214 | { 215 | "name": "stdout", 216 | "output_type": "stream", 217 | "text": [ 218 | "The Best alpha is: 0.0001\n", 219 | "The Test Data MSE is: 16.985\n" 220 | ] 221 | } 222 | ], 223 | "source": [ 224 | "alphas = [0.00025, 0.00005, 0.0001, 0.0002, 0.0004]\n", 225 | "best_alpha = alphas[0]\n", 226 | "least_mse = float(\"inf\") #initialized to infinity\n", 227 | "for possible_alpha in alphas:\n", 228 | " model = SGDRegressor(loss='squared_loss', alpha=possible_alpha)\n", 229 | " model.fit(X_training, y_training)\n", 230 | " \n", 231 | " mse = np.mean((y_validation - model.predict(X_validation)) ** 2)\n", 232 | " if mse <= least_mse:\n", 233 | " least_mse = mse\n", 234 | " best_alpha = possible_alpha\n", 235 | "\n", 236 | "print \"The Best alpha is: %.4f\" % (best_alpha) \n", 237 | "\n", 238 | "best_model = SGDRegressor(loss='squared_loss', alpha=best_alpha)\n", 239 | "best_model.fit(X_training, y_training)\n", 240 | "MSE = np.mean((y_test - best_model.predict(X_test)) ** 2) # evaluating the best model on test data\n", 241 | "\n", 242 | "print \"The Test Data MSE is: %0.3f\" % (MSE)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "There's a better way to tune the regularization parameter and possiblby multiple parameters at the same time. This way through scikit-learn's [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html). We'll not be working with that here, but you're encouraged to read the documentation and user guides and try for yourself how it could be done. Once you got the hang of it, you can maybe try and tune the learning rate and the regularization parameter at the same time! " 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "# The R-squared Metric" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "The Last thing we have here is to see how we can evaluate our model using the $R^2$ metric. We learned in the videos that the $R^2$ metric measures how close the data points are to our regression line (or plane). We also learned that there's an adjusted version of that metric denoted by $\\overline{R^2}$ that penalizes for the extra features we add to the model that doesn't help the model be more accurate. Those metric can be calculated using the following formulas:\n", 264 | "\n", 265 | "$$R^2 = 1 - \\frac{\\sum_{i=1}^{n}(y_i - f_i)^2}{\\sum_{i=1}^{n}(y_i - \\overline{y})^2}$$" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "where $f_i$ is our model prediction and $overline{y}$ is the mean of all n $y_i$s. And for the adjusted version:\n", 273 | "\n", 274 | "$$\\overline{R^2} = R^2 - \\frac{k - 1}{n - k}(1 - R^2)$$" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "where $k$ is the number of fatures and $n$ is the number of data samples. Both $R^2$ and $\\overline{R^2}$ take a value less than or equal to **1**.The closer it is to one, the better our model is.\n", 282 | "\n", 283 | "Fortunately, we don't have to do all these calculations by hand to use this metric with scikit-learn. The model's **score** method does that for us. It takes the test Xs and ys and spits out the value of $\\overline{R^2}$" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 64, 289 | "metadata": { 290 | "collapsed": false 291 | }, 292 | "outputs": [ 293 | { 294 | "name": "stdout", 295 | "output_type": "stream", 296 | "text": [ 297 | "Trained model: y = 13.82 + 3.90x₁ + 0.98x₂\n", 298 | "The Model's Adjusted R² on Test Data is 0.65\n" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "model = SGDRegressor(loss='squared_loss', eta0=0.02)\n", 304 | "model.fit(X_training, y_training)\n", 305 | "\n", 306 | "w0 = model.intercept_\n", 307 | "w1 = model.coef_[0] # Notice that model.coef_ is a list now not a single number\n", 308 | "w2 = model.coef_[1]\n", 309 | "\n", 310 | "print \"Trained model: y = %0.2f + %0.2fx₁ + %0.2fx₂\" % (w0, w1, w2)\n", 311 | "\n", 312 | "R2_adjusted = model.score(X_test, y_test)\n", 313 | "\n", 314 | "print \"The Model's Adjusted R² on Test Data is %0.2f\" % (R2_adjusted)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "# Exercise\n", 322 | "Apply the ideas of L2 Regularization and $R^2$ metric to the exercises you did in the last two modules.\n", 323 | "\n", 324 | "# Research Idea\n", 325 | "Download [Kaggle's 2016 US Election Dataset](https://www.kaggle.com/benhamner/2016-us-election/) and explore the data using what you learned in Linear Regression. Make assumptions about the data correlations and dependence and test your assumptions using what you learned. If had interesting results, publish your code and your results to the [Script's Repo](https://www.kaggle.com/benhamner/2016-us-election/scripts) and share them with the community." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": { 332 | "collapsed": true 333 | }, 334 | "outputs": [], 335 | "source": [] 336 | } 337 | ], 338 | "metadata": { 339 | "kernelspec": { 340 | "display_name": "Python 2", 341 | "language": "python", 342 | "name": "python2" 343 | }, 344 | "language_info": { 345 | "codemirror_mode": { 346 | "name": "ipython", 347 | "version": 2 348 | }, 349 | "file_extension": ".py", 350 | "mimetype": "text/x-python", 351 | "name": "python", 352 | "nbconvert_exporter": "python", 353 | "pygments_lexer": "ipython2", 354 | "version": "2.7.6" 355 | } 356 | }, 357 | "nbformat": 4, 358 | "nbformat_minor": 0 359 | } 360 | -------------------------------------------------------------------------------- /4-regression/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/4-regression/README.md -------------------------------------------------------------------------------- /5-classification/5.4 - Artificial Neural Networks in Classifying Breast Cancer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Example 5.4: Classifying Malignant/Benign Breast Tumors with Artificial Neural Networks" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Welcome to the practical section of module 5.4. Here we'll be using **Artificial Neural Networks** with the [Wisconsin Breast Cancer Database](http://bit.ly/1IoTs7x) just like in the practical example for module 5.2 to predict whether a patient's tumor is benign or malignant based on tumor cell charactaristics. This is just one example from many to which machine learning and classification could offer great insights and aid. **Make sure** to delete any rows with missing data (which will contain a **\"?\"** character in a feature cell).\n", 15 | "\n", 16 | "By the end of the module, we'll have a trained an artificial neural network model on the a subset of the features presented in the dataset that is very accurate at diagnosing the condition of the tumor based on these features. We'll also see how we can make interseting inferences from the model that could be helpful for the physicians in diagnosing cancer." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "Since scikit-learn's newest **stable** version does not support neural networks / multi-layer perceptrons, we will be using the [scikit-neuralnetwork](https://scikit-neuralnetwork.readthedocs.io/en/latest/) third-party implementation. To install scikit-neuralnetwork, please consult the [Installation](https://scikit-neuralnetwork.readthedocs.io/en/latest/guide_installation.html) section of the documentation." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "First, we will import all our dependencies. Make sure to install all of these separately:" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 52, 36 | "metadata": { 37 | "collapsed": true 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "import pandas as pd # we use this library to import a CSV of cancer tumor data\n", 42 | "import numpy as np # we use this library to help us represent traditional Python arrays/lists as matrices/tensors with linear algebra operations\n", 43 | "\n", 44 | "from sknn.mlp import Classifier, Layer # we use this library for the actual neural network code\n", 45 | "from sklearn.utils import shuffle # we use this library for randomly shuffling arrays/tensors\n", 46 | "\n", 47 | "import sys # we use for accessing output window \n", 48 | "import logging # we use this library for outputting real time statistics/updates on training progress\n", 49 | "logging.basicConfig(format=\"%(message)s\", level=logging.DEBUG, stream=sys.stdout) # set the logging mode to DEBUG to output training information, use \"INFO\" for less volume of output" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "# The Data\n", 57 | "\n", 58 | "We'll start off by exploring our dataset to see what attributes we have and how the class of the tumor is represented" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Before we proceed, ensure to include headers to the dataset provided by the University of Wisconsin. We will use the following headers:\n", 66 | "\n", 67 | "> ID,CT,UCS,UCSh,MA,SECS,BN,BC,NN,M,Class\n", 68 | "\n", 69 | "Add this to the **beginning** line of your .csv file." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/html": [ 82 | "
\n", 83 | "\n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | "
IDCTUCSUCShMASECSBNBCNNMClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
510171228101087109714
6101809911112103112
710185612121213112
810330782111211152
910330784211212112
\n", 243 | "
" 244 | ], 245 | "text/plain": [ 246 | " ID CT UCS UCSh MA SECS BN BC NN M Class\n", 247 | "0 1000025 5 1 1 1 2 1 3 1 1 2\n", 248 | "1 1002945 5 4 4 5 7 10 3 2 1 2\n", 249 | "2 1015425 3 1 1 1 2 2 3 1 1 2\n", 250 | "3 1016277 6 8 8 1 3 4 3 7 1 2\n", 251 | "4 1017023 4 1 1 3 2 1 3 1 1 2\n", 252 | "5 1017122 8 10 10 8 7 10 9 7 1 4\n", 253 | "6 1018099 1 1 1 1 2 10 3 1 1 2\n", 254 | "7 1018561 2 1 2 1 2 1 3 1 1 2\n", 255 | "8 1033078 2 1 1 1 2 1 1 1 5 2\n", 256 | "9 1033078 4 2 1 1 2 1 2 1 1 2" 257 | ] 258 | }, 259 | "execution_count": 2, 260 | "metadata": {}, 261 | "output_type": "execute_result" 262 | } 263 | ], 264 | "source": [ 265 | "dataset = pd.read_csv('./datasets/breast-cancer-wisconson.csv') # import the CSV data into an array using the panda dependency\n", 266 | "print dataset[:10]" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "To understand the meaning of the abbreviations we can consult the [dataset's website](http://bit.ly/1IoTs7x) to find a description of each attribute in order. We are going to train on **all** features unlike the logistic regression example (where we just trained on three). This does mean that we will be unable to visualize the results, but will get a feel for how to work with high-dimensional data.\n", 274 | "\n", 275 | "If you noticed the **Class** attribute at the end (which gives the class of the tumor), you'll find that it takes either 2 or 4, where 2 represents a *benign* tumor while 4 represents a *malignant* tumor. We'll change that to more expressive values and make a benign tumor represented by 0 (false) and mlignants by 1s (true).\n", 276 | "\n", 277 | "You'll notice that the **ID** attribute of data that is useless to our modelling, since it provides no information about the tumor itself, and is instead a way of identifying a specific tumor. We will hence strip this from our dataset before training." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": { 284 | "collapsed": false 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "dataset = dataset[[\"CT\", \"UCS\", \"UCSh\", \"MA\", \"SECS\", \"BN\", \"BC\", \"NN\", \"M\", \"Class\"]] # remove the ID attribute from the dataset\n", 289 | "dataset.is_copy = False # this is just to hide a nasty warning!\n", 290 | "[0 if tclass == 2 else 1 for tclass in dataset[\"Class\"]] # convert Class attributes to 0/1 if they are 2/4 in dataset[\"Class\"] column" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "We will now need to split up the dataset into two separate tensors: **X** and **y**. **X** will contain the features and their values for each training example, and **y** will contain all the outputs. In this training set, let's say **m** refers to the number of training examples (of which there are just over 600) and **n** refers to the number of features (of which there are 9). Thus, **X** will be a matrix where $X\\in \\mathbb{R}^{m\\:\\cdot \\:n}$ and **y** is a vector where $y\\in \\mathbb{R}^m$ (because we only have one output - a probability of the tumor being malignant).\n", 298 | "\n", 299 | "We simply separate by the **\"Class\"** attribute and the other features." 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": { 306 | "collapsed": true 307 | }, 308 | "outputs": [], 309 | "source": [ 310 | "X = np.array(dataset[[\"CT\", \"UCS\", \"UCSh\", \"MA\", \"SECS\", \"BN\", \"BC\", \"NN\", \"M\"]]) # X is composed of all the n feature columns\n", 311 | "y = np.array(dataset[\"Class\"]) # y is composed of just the output class column" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "# Training the Neural Network Model" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "For the training, we are going to split the dataset into a training set and a test set. The training set will be 70% of the original data set, and will be what the neural network will learn from. We will test the accuracy of the neural network's learned weights by using the test set, which is composed of 30% of the original data." 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "It's **very important** to **shuffle** the dataset before partitioning into training/test sets. Why? Because the data given to us by University of Wisconsin may be in some sorted (apparent or unapparent) order. It may be that most y=0 examples are in the latter half of the data. It may be that most of the nearby recorded patients are similar, or were recorded during similar dates/times. We want none of that - we want to remove any information and ensure that our partitioning is random, so that our test results represent true probabilites of picking a random training case from an entire population of permutations of the feature vector. For example, I once got a 5% difference in test results since the data I had was sorted beforehand. Essentially, we want to make sure that we're being absolutely fair about the partition, and not accidentally making our test results too good/too bad.\n", 333 | "\n", 334 | "We can't use numpy's traditional shuffle function because this shuffles one array only. If we independently shuffled **x** and **y**, the order between them would be lost (ie. if training case x had output of 1 beforehand, this may be accidentally changed to an output of 0, since we use corresponding indicdes to match up the inputs and outputs)." 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": { 341 | "collapsed": true 342 | }, 343 | "outputs": [], 344 | "source": [ 345 | "X, y = shuffle(X, y, random_state=0) # we use scikit-learn's synchronized shuffle feature to shuffle two arrays in unison" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 48, 351 | "metadata": { 352 | "collapsed": false 353 | }, 354 | "outputs": [], 355 | "source": [ 356 | "dataset_size = len(dataset) # get the size of overall dataset\n", 357 | "training_size = np.floor(dataset_size * 0.7).astype(int) # get the training size as 70% of the dataset size (or roughly 0.7 * dataset_size) and as an integer\n", 358 | "\n", 359 | "X_train = X[:training_size] # extract the first 70% of inputs for training\n", 360 | "y_train = y[:training_size] # extract the first 70% of inputs for training\n", 361 | "\n", 362 | "X_test = X[training_size:] # extract rest 30% of inputs for testing\n", 363 | "y_test = y[training_size:] # extract rest 30% of outputs for testing" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "**scikit-neuralnetwork** offers a really neat, easy to use API (from sknn.mlp import Classifier, Layer) for training neural networks. This API has support for many different paradigms like dropout, momentum, weight decay, mini-batch gradient descent etc. and even different neural network types like Convolutional Neural Networks. Today, however, our goal is to get a simple Artificial Neural Network setup!" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "Our first job is to configure the architecture for the neural network. We will need to decide:\n", 378 | "* The number of hidden layers\n", 379 | "* The size of each hidden layer\n", 380 | "* The activation function used at each hidden layer\n", 381 | "* The learning rate, number of iterations, and other hyperparameters\n", 382 | "\n", 383 | "Some types of activation functions offered by scikit-neuralnetwork include:\n", 384 | "* Linear\n", 385 | "* Rectifier\n", 386 | "* Sigmoid\n", 387 | "* Softmax\n", 388 | "\n", 389 | "Where Softmax computes a sigmoid probability distribution over multiple outputs. Generally, it is conventional to use Softmax as the activation function for the output layer (when we have 1 output it really is just the same as a Sigmoid layer, but scikit-neuralnetwork will still output a warning). Recall the formula for the sigmoid activation function:\n", 390 | "\n", 391 | "$Sigmoid\\left(z\\right)=\\frac{1}{1+e^{-z}}$\n", 392 | "\n", 393 | "This function \"squeezes\" any real value into a probability of range (0, 1).\n", 394 | "\n", 395 | "The Linear and Rectifier [(ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) may be used, but we obviously need to use Sigmoid/Softmax in our neural network because we are performing a classification task. Generally, I found that using Rectifier units throughout and then having a Softmax (Sigmoid) layer at the end produced the best results." 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "Now we need to decide on our structure (as in, the hidden layers and their sizes). Our neural network will end up looking like the following:\n", 403 | "\n", 404 | "\n", 405 | "That is, for this example, we will use two hidden layers (both of which are Rectifiers). Generally, the greater number of hidden layers you have, the greater complexity. Two should be fine for our task though. In our neural network, we will have **9** input nodes (for our 9 features), and I have chosen 100 neurons for the first hidden layer as well as 50 for the second. The Softmax layer will just have one output because we only want one output (probability of malignancy). You can **play around** with these numbers, bar the Softmax output layer :) We really need not that many hidden units in each hidden layer, but I want to demonstrate the scale we can create with this API.\n", 406 | "\n", 407 | "**NOTE**: In actuality, our Softmax layer for **this coding example** will have **two outputs**. For classifiers, whether its binary classification or multi-class, scikit-neuralnetwork uses a one-hot-encoded representation of the labels with cross-entropy-loss. This requires that the output layer (the softmax layer) to be a probability distribution over all labels, hence the number of the units in the softmax layer needs to be the number of labels. In binary classification, we have 2 labels: 0 and 1, so the expected behavior is for the softmax layer to have 2 units.\n", 408 | "\n", 409 | "Finally, we need to choose our **hyperparameters**. A **learning rate** of **0.001** and number of iterations/epochs of **100** should suffice. Our code will look like the following:" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": null, 415 | "metadata": { 416 | "collapsed": true 417 | }, 418 | "outputs": [], 419 | "source": [ 420 | "nn = Classifier(layers=[ # create a new Classifier object (neural network classifier), and pass all the layers as parameters\n", 421 | " Layer(\"Rectifier\", units=100), # create the first post-input hidden layer, a Rectifier (ReLU) layer of 100 units\n", 422 | " Layer(\"Rectifier\", units=50), # create the second hidden layer, a Rectifier layer of 50 units\n", 423 | " Layer(\"Softmax\"), units=2], # create the final output layer, a Softmax layer that will output two probabilities, as mentioned before\n", 424 | " learning_rate=0.001, n_iter=100) # pass in hyperparameters as a separate parameter to the layers" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "Our neural network has been configured and built! Now, we just need to train it using our training set, using the intuitively named function \"fit\":" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": { 438 | "collapsed": true 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "nn.fit(X_train, y_train) # begin training using backpropagation!" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "The output in my console due to the DEBUG logging looked like this (it's fun to see training in action!):\n", 450 | "\n", 451 | "" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "**NOTE**: We do **not** need to prefix 1s to the dataset beforehand to achieve bias terms that vertically shift the decision boundary regions. The API provides this by default." 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "# Results " 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "We can see the weights that were produced (including the bias terms) using the following line of code:" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": { 479 | "collapsed": true 480 | }, 481 | "outputs": [], 482 | "source": [ 483 | "print nn.get_parameters() # output the weights of the neural network" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "Unlike logistic regression, the weights of neural networks, unfortunately, are not very interpretable. There are also a high number of these weights. For example, the weights my system produced were outputted as so:\n", 491 | "\n", 492 | "" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "However, it is more important to output the **acurracy** of these results, and how much error is made. Unlike logistic regression where we used a cost function that outputted a real number, we are going to find the **percent** error of our system like so:\n", 500 | "\n", 501 | "1. Create predictions in the form of probabilities for each test tumor being malignant\n", 502 | "2. Iterate through the test set with index i\n", 503 | "3. Fetch the predicted output probability of this test tumor\n", 504 | "4. Round this probability to either a 1 (malignant) or a 0 (benign)\n", 505 | "5. Compare this to the correct test output at the ith index\n", 506 | "6. If an error occurs, increment some pre-initialized error counter\n", 507 | "7. Get the percentage error by dividing the error counter by the total number of test examples, multiplying by 100" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "metadata": { 514 | "collapsed": true 515 | }, 516 | "outputs": [], 517 | "source": [ 518 | "error_count = 0.0 # initialize the error counter\n", 519 | "prob_predictions = nn.predict_proba(X_test) # predict the outputs of the X_test instances, in the form of probabilities (hence predict_proba)\n", 520 | "for i in xrange(len(prob_predictions)): # iterate through all the predictions (equal in size to test set)\n", 521 | " # create a discrete decision for the tumor being malignant or benign, using 0.5 as the lowest probability needed for a predicted malignancy (general rounding)\n", 522 | " # as discussed before, our network actually outputs [probability_of_benign, probability_of_malignant], so we will want to\n", 523 | " # fetch the probability_of_malignant value and round this one (that's how it would be for a single output network if it worked!)\n", 524 | " discrete_prediction = 0 if prob_predictions[i][1] < 0.5 else 1\n", 525 | "\tif not y_test[i] == discrete_prediction: # if the actual, correct value for this test tumor does not equal the discrete prediction \n", 526 | "\t\terror_count += 1.0 # increment the number of errors\n", 527 | " \n", 528 | "error_rate = error_count / len(prob_predictions) * 100 # get the percentage of errors by dividing total errors by number of instances, multiplying by 100 \n", 529 | "print error_count # print number of raw errors\n", 530 | "print str(error_rate) + \"%\" # output this error percentage " 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "My program produced **8** errors in total, with an error rate of **3.9%**. This means my neural network had a success/accuracy rate of **96.1%**. This is pretty good! In the future, we can make it even better. We could introduce more complex models with a greater number of hidden layers, or we could perform pre-processing on the input eg. by using normalization. We may also want to apply **regularization**, and ensure that our model is not too complex for the data (we could use more test data to do that - but right now it looks good!). Lastly, we may want to look at each individual error and try to minimize **false negatives** (we predict a patient **does not** have a malignant tumor when they **do** - risky business!) over **false positives**.\n", 538 | "\n", 539 | "There are many other things we can do from here, and this practical example demonstrated the power of neural networks and how we can use an API like scikit-neuralnetwork to get one up and running in no time. Hope y'all had fun!" 540 | ] 541 | } 542 | ], 543 | "metadata": { 544 | "kernelspec": { 545 | "display_name": "Python 2", 546 | "language": "python", 547 | "name": "python2" 548 | }, 549 | "language_info": { 550 | "codemirror_mode": { 551 | "name": "ipython", 552 | "version": 2 553 | }, 554 | "file_extension": ".py", 555 | "mimetype": "text/x-python", 556 | "name": "python", 557 | "nbconvert_exporter": "python", 558 | "pygments_lexer": "ipython2", 559 | "version": "2.7.11" 560 | } 561 | }, 562 | "nbformat": 4, 563 | "nbformat_minor": 0 564 | } 565 | -------------------------------------------------------------------------------- /5-classification/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/5-classification/README.md -------------------------------------------------------------------------------- /6-clustering/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/6-clustering/README.md -------------------------------------------------------------------------------- /6-clustering/k means visualization/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
6 |
7 | 8 | 9 |
10 | 11 |
12 |
13 |
14 | 15 | 16 |
17 |

K means visualization

18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 | 28 | 29 | 30 | 31 | 32 |
33 | 34 |
35 | 36 |
37 | 38 |
39 | 40 |
41 |
42 | -------------------------------------------------------------------------------- /6-clustering/k means visualization/k-means.js: -------------------------------------------------------------------------------- 1 | var flag = false; 2 | var WIDTH = d3.select("#kmeans")[0][0].offsetWidth - 20; 3 | var HEIGHT = Math.max(300, WIDTH * .7); 4 | var svg = d3.select("#kmeans svg") 5 | .attr('width', WIDTH) 6 | .attr('height', HEIGHT) 7 | .style('padding', '10px') 8 | .style('background', '#223344') 9 | .style('cursor', 'pointer') 10 | .style('-webkit-user-select', 'none') 11 | .style('-khtml-user-select', 'none') 12 | .style('-moz-user-select', 'none') 13 | .style('-ms-user-select', 'none') 14 | .style('user-select', 'none') 15 | .on('click', function() { 16 | d3.event.preventDefault(); 17 | step(); 18 | }); 19 | 20 | d3.selectAll("#kmeans button") 21 | .style('padding', '.5em .8em'); 22 | 23 | d3.selectAll("#kmeans label") 24 | .style('display', 'inline-block') 25 | .style('width', '15em'); 26 | 27 | var lineg = svg.append('g'); 28 | var dotg = svg.append('g'); 29 | var centerg = svg.append('g'); 30 | d3.select("#step") 31 | .on('click', function() { step(); draw(); }); 32 | d3.select("#restart") 33 | .on('click', function() { restart(); draw(); }); 34 | d3.select("#reset") 35 | .on('click', function() { init(); draw(); }); 36 | 37 | 38 | var groups = [], dots = []; 39 | 40 | function step() { 41 | d3.select("#restart").attr("disabled", null); 42 | if (flag) { 43 | moveCenter(); 44 | draw(); 45 | } else { 46 | updateGroups(); 47 | draw(); 48 | } 49 | flag = !flag; 50 | } 51 | 52 | function init() { 53 | d3.select("#restart").attr("disabled", "disabled"); 54 | 55 | var N = parseInt(d3.select('#N')[0][0].value, 10); 56 | var K = parseInt(d3.select('#K')[0][0].value, 10); 57 | groups = []; 58 | for (var i = 0; i < K; i++) { 59 | var g = { 60 | dots: [], 61 | color: 'hsl(' + (i * 360 / K) + ',100%,50%)', 62 | center: { 63 | x: Math.random() * WIDTH, 64 | y: Math.random() * HEIGHT 65 | }, 66 | init: { 67 | center: {} 68 | } 69 | }; 70 | g.init.center = { 71 | x: g.center.x, 72 | y: g.center.y 73 | }; 74 | groups.push(g); 75 | } 76 | 77 | dots = []; 78 | flag = false; 79 | for (i = 0; i < N; i++) { 80 | var dot ={ 81 | x: Math.random() * WIDTH, 82 | y: Math.random() * HEIGHT, 83 | group: undefined 84 | }; 85 | dot.init = { 86 | x: dot.x, 87 | y: dot.y, 88 | group: dot.group 89 | }; 90 | dots.push(dot); 91 | } 92 | } 93 | 94 | function restart() { 95 | flag = false; 96 | d3.select("#restart").attr("disabled", "disabled"); 97 | 98 | groups.forEach(function(g) { 99 | g.dots = []; 100 | g.center.x = g.init.center.x; 101 | g.center.y = g.init.center.y; 102 | }); 103 | 104 | for (var i = 0; i < dots.length; i++) { 105 | var dot = dots[i]; 106 | dots[i] = { 107 | x: dot.init.x, 108 | y: dot.init.y, 109 | group: undefined, 110 | init: dot.init 111 | }; 112 | } 113 | } 114 | 115 | 116 | function draw() { 117 | var circles = dotg.selectAll('circle') 118 | .data(dots); 119 | circles.enter() 120 | .append('circle'); 121 | circles.exit().remove(); 122 | circles 123 | .transition() 124 | .duration(500) 125 | .attr('cx', function(d) { return d.x; }) 126 | .attr('cy', function(d) { return d.y; }) 127 | .attr('fill', function(d) { return d.group ? d.group.color : '#ffffff'; }) 128 | .attr('r', 5); 129 | 130 | if (dots[0].group) { 131 | var l = lineg.selectAll('line') 132 | .data(dots); 133 | var updateLine = function(lines) { 134 | lines 135 | .attr('x1', function(d) { return d.x; }) 136 | .attr('y1', function(d) { return d.y; }) 137 | .attr('x2', function(d) { return d.group.center.x; }) 138 | .attr('y2', function(d) { return d.group.center.y; }) 139 | .attr('stroke', function(d) { return d.group.color; }); 140 | }; 141 | updateLine(l.enter().append('line')); 142 | updateLine(l.transition().duration(500)); 143 | l.exit().remove(); 144 | } else { 145 | lineg.selectAll('line').remove(); 146 | } 147 | 148 | var c = centerg.selectAll('path') 149 | .data(groups); 150 | var updateCenters = function(centers) { 151 | centers 152 | .attr('transform', function(d) { return "translate(" + d.center.x + "," + d.center.y + ") rotate(45)";}) 153 | .attr('fill', function(d,i) { return d.color; }) 154 | .attr('stroke', '#aabbcc'); 155 | }; 156 | c.exit().remove(); 157 | updateCenters(c.enter() 158 | .append('path') 159 | .attr('d', d3.svg.symbol().type('cross')) 160 | .attr('stroke', '#aabbcc')); 161 | updateCenters(c 162 | .transition() 163 | .duration(500));} 164 | 165 | function moveCenter() { 166 | groups.forEach(function(group, i) { 167 | if (group.dots.length == 0) return; 168 | 169 | // get center of gravity 170 | var x = 0, y = 0; 171 | group.dots.forEach(function(dot) { 172 | x += dot.x; 173 | y += dot.y; 174 | }); 175 | 176 | group.center = { 177 | x: x / group.dots.length, 178 | y: y / group.dots.length 179 | }; 180 | }); 181 | 182 | } 183 | 184 | function updateGroups() { 185 | groups.forEach(function(g) { g.dots = []; }); 186 | dots.forEach(function(dot) { 187 | // find the nearest group 188 | var min = Infinity; 189 | var group; 190 | groups.forEach(function(g) { 191 | var d = Math.pow(g.center.x - dot.x, 2) + Math.pow(g.center.y - dot.y, 2); 192 | if (d < min) { 193 | min = d; 194 | group = g; 195 | } 196 | }); 197 | 198 | // update group 199 | group.dots.push(dot); 200 | dot.group = group; 201 | }); 202 | } 203 | 204 | init(); draw(); 205 | -------------------------------------------------------------------------------- /7 - Practical Methodologies: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /7---practical-methodologies/CV: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Stanford Crowd Course Initiative 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # datascience 2 | 3 | [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org/repo/crowd-course/datascience) 4 | 5 | * Add Jupyter notebooks to the corresponding folder. 6 | * To maintain the order, create one notebook per module. 7 | e.g. 2-1-this-is-module-1-for-chapter-2, 3-3-this-is-module-3-for-chapter-3. 8 | * If you need to add extra info, add it to the README file inside each folder. 9 | * Use scikit-learn, numpy and Pandas only. 10 | -------------------------------------------------------------------------------- /datasets/Advertising.csv: -------------------------------------------------------------------------------- 1 | "","TV","Radio","Newspaper","Sales" 2 | "1",230.1,37.8,69.2,22.1 3 | "2",44.5,39.3,45.1,10.4 4 | "3",17.2,45.9,69.3,9.3 5 | "4",151.5,41.3,58.5,18.5 6 | "5",180.8,10.8,58.4,12.9 7 | "6",8.7,48.9,75,7.2 8 | "7",57.5,32.8,23.5,11.8 9 | "8",120.2,19.6,11.6,13.2 10 | "9",8.6,2.1,1,4.8 11 | "10",199.8,2.6,21.2,10.6 12 | "11",66.1,5.8,24.2,8.6 13 | "12",214.7,24,4,17.4 14 | "13",23.8,35.1,65.9,9.2 15 | "14",97.5,7.6,7.2,9.7 16 | "15",204.1,32.9,46,19 17 | "16",195.4,47.7,52.9,22.4 18 | "17",67.8,36.6,114,12.5 19 | "18",281.4,39.6,55.8,24.4 20 | "19",69.2,20.5,18.3,11.3 21 | "20",147.3,23.9,19.1,14.6 22 | "21",218.4,27.7,53.4,18 23 | "22",237.4,5.1,23.5,12.5 24 | "23",13.2,15.9,49.6,5.6 25 | "24",228.3,16.9,26.2,15.5 26 | "25",62.3,12.6,18.3,9.7 27 | "26",262.9,3.5,19.5,12 28 | "27",142.9,29.3,12.6,15 29 | "28",240.1,16.7,22.9,15.9 30 | "29",248.8,27.1,22.9,18.9 31 | "30",70.6,16,40.8,10.5 32 | "31",292.9,28.3,43.2,21.4 33 | "32",112.9,17.4,38.6,11.9 34 | "33",97.2,1.5,30,9.6 35 | "34",265.6,20,0.3,17.4 36 | "35",95.7,1.4,7.4,9.5 37 | "36",290.7,4.1,8.5,12.8 38 | "37",266.9,43.8,5,25.4 39 | "38",74.7,49.4,45.7,14.7 40 | "39",43.1,26.7,35.1,10.1 41 | "40",228,37.7,32,21.5 42 | "41",202.5,22.3,31.6,16.6 43 | "42",177,33.4,38.7,17.1 44 | "43",293.6,27.7,1.8,20.7 45 | "44",206.9,8.4,26.4,12.9 46 | "45",25.1,25.7,43.3,8.5 47 | "46",175.1,22.5,31.5,14.9 48 | "47",89.7,9.9,35.7,10.6 49 | "48",239.9,41.5,18.5,23.2 50 | "49",227.2,15.8,49.9,14.8 51 | "50",66.9,11.7,36.8,9.7 52 | "51",199.8,3.1,34.6,11.4 53 | "52",100.4,9.6,3.6,10.7 54 | "53",216.4,41.7,39.6,22.6 55 | "54",182.6,46.2,58.7,21.2 56 | "55",262.7,28.8,15.9,20.2 57 | "56",198.9,49.4,60,23.7 58 | "57",7.3,28.1,41.4,5.5 59 | "58",136.2,19.2,16.6,13.2 60 | "59",210.8,49.6,37.7,23.8 61 | "60",210.7,29.5,9.3,18.4 62 | "61",53.5,2,21.4,8.1 63 | "62",261.3,42.7,54.7,24.2 64 | "63",239.3,15.5,27.3,15.7 65 | "64",102.7,29.6,8.4,14 66 | "65",131.1,42.8,28.9,18 67 | "66",69,9.3,0.9,9.3 68 | "67",31.5,24.6,2.2,9.5 69 | "68",139.3,14.5,10.2,13.4 70 | "69",237.4,27.5,11,18.9 71 | "70",216.8,43.9,27.2,22.3 72 | "71",199.1,30.6,38.7,18.3 73 | "72",109.8,14.3,31.7,12.4 74 | "73",26.8,33,19.3,8.8 75 | "74",129.4,5.7,31.3,11 76 | "75",213.4,24.6,13.1,17 77 | "76",16.9,43.7,89.4,8.7 78 | "77",27.5,1.6,20.7,6.9 79 | "78",120.5,28.5,14.2,14.2 80 | "79",5.4,29.9,9.4,5.3 81 | "80",116,7.7,23.1,11 82 | "81",76.4,26.7,22.3,11.8 83 | "82",239.8,4.1,36.9,12.3 84 | "83",75.3,20.3,32.5,11.3 85 | "84",68.4,44.5,35.6,13.6 86 | "85",213.5,43,33.8,21.7 87 | "86",193.2,18.4,65.7,15.2 88 | "87",76.3,27.5,16,12 89 | "88",110.7,40.6,63.2,16 90 | "89",88.3,25.5,73.4,12.9 91 | "90",109.8,47.8,51.4,16.7 92 | "91",134.3,4.9,9.3,11.2 93 | "92",28.6,1.5,33,7.3 94 | "93",217.7,33.5,59,19.4 95 | "94",250.9,36.5,72.3,22.2 96 | "95",107.4,14,10.9,11.5 97 | "96",163.3,31.6,52.9,16.9 98 | "97",197.6,3.5,5.9,11.7 99 | "98",184.9,21,22,15.5 100 | "99",289.7,42.3,51.2,25.4 101 | "100",135.2,41.7,45.9,17.2 102 | "101",222.4,4.3,49.8,11.7 103 | "102",296.4,36.3,100.9,23.8 104 | "103",280.2,10.1,21.4,14.8 105 | "104",187.9,17.2,17.9,14.7 106 | "105",238.2,34.3,5.3,20.7 107 | "106",137.9,46.4,59,19.2 108 | "107",25,11,29.7,7.2 109 | "108",90.4,0.3,23.2,8.7 110 | "109",13.1,0.4,25.6,5.3 111 | "110",255.4,26.9,5.5,19.8 112 | "111",225.8,8.2,56.5,13.4 113 | "112",241.7,38,23.2,21.8 114 | "113",175.7,15.4,2.4,14.1 115 | "114",209.6,20.6,10.7,15.9 116 | "115",78.2,46.8,34.5,14.6 117 | "116",75.1,35,52.7,12.6 118 | "117",139.2,14.3,25.6,12.2 119 | "118",76.4,0.8,14.8,9.4 120 | "119",125.7,36.9,79.2,15.9 121 | "120",19.4,16,22.3,6.6 122 | "121",141.3,26.8,46.2,15.5 123 | "122",18.8,21.7,50.4,7 124 | "123",224,2.4,15.6,11.6 125 | "124",123.1,34.6,12.4,15.2 126 | "125",229.5,32.3,74.2,19.7 127 | "126",87.2,11.8,25.9,10.6 128 | "127",7.8,38.9,50.6,6.6 129 | "128",80.2,0,9.2,8.8 130 | "129",220.3,49,3.2,24.7 131 | "130",59.6,12,43.1,9.7 132 | "131",0.7,39.6,8.7,1.6 133 | "132",265.2,2.9,43,12.7 134 | "133",8.4,27.2,2.1,5.7 135 | "134",219.8,33.5,45.1,19.6 136 | "135",36.9,38.6,65.6,10.8 137 | "136",48.3,47,8.5,11.6 138 | "137",25.6,39,9.3,9.5 139 | "138",273.7,28.9,59.7,20.8 140 | "139",43,25.9,20.5,9.6 141 | "140",184.9,43.9,1.7,20.7 142 | "141",73.4,17,12.9,10.9 143 | "142",193.7,35.4,75.6,19.2 144 | "143",220.5,33.2,37.9,20.1 145 | "144",104.6,5.7,34.4,10.4 146 | "145",96.2,14.8,38.9,11.4 147 | "146",140.3,1.9,9,10.3 148 | "147",240.1,7.3,8.7,13.2 149 | "148",243.2,49,44.3,25.4 150 | "149",38,40.3,11.9,10.9 151 | "150",44.7,25.8,20.6,10.1 152 | "151",280.7,13.9,37,16.1 153 | "152",121,8.4,48.7,11.6 154 | "153",197.6,23.3,14.2,16.6 155 | "154",171.3,39.7,37.7,19 156 | "155",187.8,21.1,9.5,15.6 157 | "156",4.1,11.6,5.7,3.2 158 | "157",93.9,43.5,50.5,15.3 159 | "158",149.8,1.3,24.3,10.1 160 | "159",11.7,36.9,45.2,7.3 161 | "160",131.7,18.4,34.6,12.9 162 | "161",172.5,18.1,30.7,14.4 163 | "162",85.7,35.8,49.3,13.3 164 | "163",188.4,18.1,25.6,14.9 165 | "164",163.5,36.8,7.4,18 166 | "165",117.2,14.7,5.4,11.9 167 | "166",234.5,3.4,84.8,11.9 168 | "167",17.9,37.6,21.6,8 169 | "168",206.8,5.2,19.4,12.2 170 | "169",215.4,23.6,57.6,17.1 171 | "170",284.3,10.6,6.4,15 172 | "171",50,11.6,18.4,8.4 173 | "172",164.5,20.9,47.4,14.5 174 | "173",19.6,20.1,17,7.6 175 | "174",168.4,7.1,12.8,11.7 176 | "175",222.4,3.4,13.1,11.5 177 | "176",276.9,48.9,41.8,27 178 | "177",248.4,30.2,20.3,20.2 179 | "178",170.2,7.8,35.2,11.7 180 | "179",276.7,2.3,23.7,11.8 181 | "180",165.6,10,17.6,12.6 182 | "181",156.6,2.6,8.3,10.5 183 | "182",218.5,5.4,27.4,12.2 184 | "183",56.2,5.7,29.7,8.7 185 | "184",287.6,43,71.8,26.2 186 | "185",253.8,21.3,30,17.6 187 | "186",205,45.1,19.6,22.6 188 | "187",139.5,2.1,26.6,10.3 189 | "188",191.1,28.7,18.2,17.3 190 | "189",286,13.9,3.7,15.9 191 | "190",18.7,12.1,23.4,6.7 192 | "191",39.5,41.1,5.8,10.8 193 | "192",75.5,10.8,6,9.9 194 | "193",17.2,4.1,31.6,5.9 195 | "194",166.8,42,3.6,19.6 196 | "195",149.7,35.6,6,17.3 197 | "196",38.2,3.7,13.8,7.6 198 | "197",94.2,4.9,8.1,9.7 199 | "198",177,9.3,6.4,12.8 200 | "199",283.6,42,66.2,25.5 201 | "200",232.1,8.6,8.7,13.4 202 | -------------------------------------------------------------------------------- /datasets/breast-cancer-wisconson.csv: -------------------------------------------------------------------------------- 1 | ID,CT,UCS,UCSh,MA,SECS,BN,BC,NN,M,Class 2 | 1000025,5,1,1,1,2,1,3,1,1,2 3 | 1002945,5,4,4,5,7,10,3,2,1,2 4 | 1015425,3,1,1,1,2,2,3,1,1,2 5 | 1016277,6,8,8,1,3,4,3,7,1,2 6 | 1017023,4,1,1,3,2,1,3,1,1,2 7 | 1017122,8,10,10,8,7,10,9,7,1,4 8 | 1018099,1,1,1,1,2,10,3,1,1,2 9 | 1018561,2,1,2,1,2,1,3,1,1,2 10 | 1033078,2,1,1,1,2,1,1,1,5,2 11 | 1033078,4,2,1,1,2,1,2,1,1,2 12 | 1035283,1,1,1,1,1,1,3,1,1,2 13 | 1036172,2,1,1,1,2,1,2,1,1,2 14 | 1041801,5,3,3,3,2,3,4,4,1,4 15 | 1043999,1,1,1,1,2,3,3,1,1,2 16 | 1044572,8,7,5,10,7,9,5,5,4,4 17 | 1047630,7,4,6,4,6,1,4,3,1,4 18 | 1048672,4,1,1,1,2,1,2,1,1,2 19 | 1049815,4,1,1,1,2,1,3,1,1,2 20 | 1050670,10,7,7,6,4,10,4,1,2,4 21 | 1050718,6,1,1,1,2,1,3,1,1,2 22 | 1054590,7,3,2,10,5,10,5,4,4,4 23 | 1054593,10,5,5,3,6,7,7,10,1,4 24 | 1056784,3,1,1,1,2,1,2,1,1,2 25 | 1057013,8,4,5,1,2,?,7,3,1,4 26 | 1059552,1,1,1,1,2,1,3,1,1,2 27 | 1065726,5,2,3,4,2,7,3,6,1,4 28 | 1066373,3,2,1,1,1,1,2,1,1,2 29 | 1066979,5,1,1,1,2,1,2,1,1,2 30 | 1067444,2,1,1,1,2,1,2,1,1,2 31 | 1070935,1,1,3,1,2,1,1,1,1,2 32 | 1070935,3,1,1,1,1,1,2,1,1,2 33 | 1071760,2,1,1,1,2,1,3,1,1,2 34 | 1072179,10,7,7,3,8,5,7,4,3,4 35 | 1074610,2,1,1,2,2,1,3,1,1,2 36 | 1075123,3,1,2,1,2,1,2,1,1,2 37 | 1079304,2,1,1,1,2,1,2,1,1,2 38 | 1080185,10,10,10,8,6,1,8,9,1,4 39 | 1081791,6,2,1,1,1,1,7,1,1,2 40 | 1084584,5,4,4,9,2,10,5,6,1,4 41 | 1091262,2,5,3,3,6,7,7,5,1,4 42 | 1096800,6,6,6,9,6,?,7,8,1,2 43 | 1099510,10,4,3,1,3,3,6,5,2,4 44 | 1100524,6,10,10,2,8,10,7,3,3,4 45 | 1102573,5,6,5,6,10,1,3,1,1,4 46 | 1103608,10,10,10,4,8,1,8,10,1,4 47 | 1103722,1,1,1,1,2,1,2,1,2,2 48 | 1105257,3,7,7,4,4,9,4,8,1,4 49 | 1105524,1,1,1,1,2,1,2,1,1,2 50 | 1106095,4,1,1,3,2,1,3,1,1,2 51 | 1106829,7,8,7,2,4,8,3,8,2,4 52 | 1108370,9,5,8,1,2,3,2,1,5,4 53 | 1108449,5,3,3,4,2,4,3,4,1,4 54 | 1110102,10,3,6,2,3,5,4,10,2,4 55 | 1110503,5,5,5,8,10,8,7,3,7,4 56 | 1110524,10,5,5,6,8,8,7,1,1,4 57 | 1111249,10,6,6,3,4,5,3,6,1,4 58 | 1112209,8,10,10,1,3,6,3,9,1,4 59 | 1113038,8,2,4,1,5,1,5,4,4,4 60 | 1113483,5,2,3,1,6,10,5,1,1,4 61 | 1113906,9,5,5,2,2,2,5,1,1,4 62 | 1115282,5,3,5,5,3,3,4,10,1,4 63 | 1115293,1,1,1,1,2,2,2,1,1,2 64 | 1116116,9,10,10,1,10,8,3,3,1,4 65 | 1116132,6,3,4,1,5,2,3,9,1,4 66 | 1116192,1,1,1,1,2,1,2,1,1,2 67 | 1116998,10,4,2,1,3,2,4,3,10,4 68 | 1117152,4,1,1,1,2,1,3,1,1,2 69 | 1118039,5,3,4,1,8,10,4,9,1,4 70 | 1120559,8,3,8,3,4,9,8,9,8,4 71 | 1121732,1,1,1,1,2,1,3,2,1,2 72 | 1121919,5,1,3,1,2,1,2,1,1,2 73 | 1123061,6,10,2,8,10,2,7,8,10,4 74 | 1124651,1,3,3,2,2,1,7,2,1,2 75 | 1125035,9,4,5,10,6,10,4,8,1,4 76 | 1126417,10,6,4,1,3,4,3,2,3,4 77 | 1131294,1,1,2,1,2,2,4,2,1,2 78 | 1132347,1,1,4,1,2,1,2,1,1,2 79 | 1133041,5,3,1,2,2,1,2,1,1,2 80 | 1133136,3,1,1,1,2,3,3,1,1,2 81 | 1136142,2,1,1,1,3,1,2,1,1,2 82 | 1137156,2,2,2,1,1,1,7,1,1,2 83 | 1143978,4,1,1,2,2,1,2,1,1,2 84 | 1143978,5,2,1,1,2,1,3,1,1,2 85 | 1147044,3,1,1,1,2,2,7,1,1,2 86 | 1147699,3,5,7,8,8,9,7,10,7,4 87 | 1147748,5,10,6,1,10,4,4,10,10,4 88 | 1148278,3,3,6,4,5,8,4,4,1,4 89 | 1148873,3,6,6,6,5,10,6,8,3,4 90 | 1152331,4,1,1,1,2,1,3,1,1,2 91 | 1155546,2,1,1,2,3,1,2,1,1,2 92 | 1156272,1,1,1,1,2,1,3,1,1,2 93 | 1156948,3,1,1,2,2,1,1,1,1,2 94 | 1157734,4,1,1,1,2,1,3,1,1,2 95 | 1158247,1,1,1,1,2,1,2,1,1,2 96 | 1160476,2,1,1,1,2,1,3,1,1,2 97 | 1164066,1,1,1,1,2,1,3,1,1,2 98 | 1165297,2,1,1,2,2,1,1,1,1,2 99 | 1165790,5,1,1,1,2,1,3,1,1,2 100 | 1165926,9,6,9,2,10,6,2,9,10,4 101 | 1166630,7,5,6,10,5,10,7,9,4,4 102 | 1166654,10,3,5,1,10,5,3,10,2,4 103 | 1167439,2,3,4,4,2,5,2,5,1,4 104 | 1167471,4,1,2,1,2,1,3,1,1,2 105 | 1168359,8,2,3,1,6,3,7,1,1,4 106 | 1168736,10,10,10,10,10,1,8,8,8,4 107 | 1169049,7,3,4,4,3,3,3,2,7,4 108 | 1170419,10,10,10,8,2,10,4,1,1,4 109 | 1170420,1,6,8,10,8,10,5,7,1,4 110 | 1171710,1,1,1,1,2,1,2,3,1,2 111 | 1171710,6,5,4,4,3,9,7,8,3,4 112 | 1171795,1,3,1,2,2,2,5,3,2,2 113 | 1171845,8,6,4,3,5,9,3,1,1,4 114 | 1172152,10,3,3,10,2,10,7,3,3,4 115 | 1173216,10,10,10,3,10,8,8,1,1,4 116 | 1173235,3,3,2,1,2,3,3,1,1,2 117 | 1173347,1,1,1,1,2,5,1,1,1,2 118 | 1173347,8,3,3,1,2,2,3,2,1,2 119 | 1173509,4,5,5,10,4,10,7,5,8,4 120 | 1173514,1,1,1,1,4,3,1,1,1,2 121 | 1173681,3,2,1,1,2,2,3,1,1,2 122 | 1174057,1,1,2,2,2,1,3,1,1,2 123 | 1174057,4,2,1,1,2,2,3,1,1,2 124 | 1174131,10,10,10,2,10,10,5,3,3,4 125 | 1174428,5,3,5,1,8,10,5,3,1,4 126 | 1175937,5,4,6,7,9,7,8,10,1,4 127 | 1176406,1,1,1,1,2,1,2,1,1,2 128 | 1176881,7,5,3,7,4,10,7,5,5,4 129 | 1177027,3,1,1,1,2,1,3,1,1,2 130 | 1177399,8,3,5,4,5,10,1,6,2,4 131 | 1177512,1,1,1,1,10,1,1,1,1,2 132 | 1178580,5,1,3,1,2,1,2,1,1,2 133 | 1179818,2,1,1,1,2,1,3,1,1,2 134 | 1180194,5,10,8,10,8,10,3,6,3,4 135 | 1180523,3,1,1,1,2,1,2,2,1,2 136 | 1180831,3,1,1,1,3,1,2,1,1,2 137 | 1181356,5,1,1,1,2,2,3,3,1,2 138 | 1182404,4,1,1,1,2,1,2,1,1,2 139 | 1182410,3,1,1,1,2,1,1,1,1,2 140 | 1183240,4,1,2,1,2,1,2,1,1,2 141 | 1183246,1,1,1,1,1,?,2,1,1,2 142 | 1183516,3,1,1,1,2,1,1,1,1,2 143 | 1183911,2,1,1,1,2,1,1,1,1,2 144 | 1183983,9,5,5,4,4,5,4,3,3,4 145 | 1184184,1,1,1,1,2,5,1,1,1,2 146 | 1184241,2,1,1,1,2,1,2,1,1,2 147 | 1184840,1,1,3,1,2,?,2,1,1,2 148 | 1185609,3,4,5,2,6,8,4,1,1,4 149 | 1185610,1,1,1,1,3,2,2,1,1,2 150 | 1187457,3,1,1,3,8,1,5,8,1,2 151 | 1187805,8,8,7,4,10,10,7,8,7,4 152 | 1188472,1,1,1,1,1,1,3,1,1,2 153 | 1189266,7,2,4,1,6,10,5,4,3,4 154 | 1189286,10,10,8,6,4,5,8,10,1,4 155 | 1190394,4,1,1,1,2,3,1,1,1,2 156 | 1190485,1,1,1,1,2,1,1,1,1,2 157 | 1192325,5,5,5,6,3,10,3,1,1,4 158 | 1193091,1,2,2,1,2,1,2,1,1,2 159 | 1193210,2,1,1,1,2,1,3,1,1,2 160 | 1193683,1,1,2,1,3,?,1,1,1,2 161 | 1196295,9,9,10,3,6,10,7,10,6,4 162 | 1196915,10,7,7,4,5,10,5,7,2,4 163 | 1197080,4,1,1,1,2,1,3,2,1,2 164 | 1197270,3,1,1,1,2,1,3,1,1,2 165 | 1197440,1,1,1,2,1,3,1,1,7,2 166 | 1197510,5,1,1,1,2,?,3,1,1,2 167 | 1197979,4,1,1,1,2,2,3,2,1,2 168 | 1197993,5,6,7,8,8,10,3,10,3,4 169 | 1198128,10,8,10,10,6,1,3,1,10,4 170 | 1198641,3,1,1,1,2,1,3,1,1,2 171 | 1199219,1,1,1,2,1,1,1,1,1,2 172 | 1199731,3,1,1,1,2,1,1,1,1,2 173 | 1199983,1,1,1,1,2,1,3,1,1,2 174 | 1200772,1,1,1,1,2,1,2,1,1,2 175 | 1200847,6,10,10,10,8,10,10,10,7,4 176 | 1200892,8,6,5,4,3,10,6,1,1,4 177 | 1200952,5,8,7,7,10,10,5,7,1,4 178 | 1201834,2,1,1,1,2,1,3,1,1,2 179 | 1201936,5,10,10,3,8,1,5,10,3,4 180 | 1202125,4,1,1,1,2,1,3,1,1,2 181 | 1202812,5,3,3,3,6,10,3,1,1,4 182 | 1203096,1,1,1,1,1,1,3,1,1,2 183 | 1204242,1,1,1,1,2,1,1,1,1,2 184 | 1204898,6,1,1,1,2,1,3,1,1,2 185 | 1205138,5,8,8,8,5,10,7,8,1,4 186 | 1205579,8,7,6,4,4,10,5,1,1,4 187 | 1206089,2,1,1,1,1,1,3,1,1,2 188 | 1206695,1,5,8,6,5,8,7,10,1,4 189 | 1206841,10,5,6,10,6,10,7,7,10,4 190 | 1207986,5,8,4,10,5,8,9,10,1,4 191 | 1208301,1,2,3,1,2,1,3,1,1,2 192 | 1210963,10,10,10,8,6,8,7,10,1,4 193 | 1211202,7,5,10,10,10,10,4,10,3,4 194 | 1212232,5,1,1,1,2,1,2,1,1,2 195 | 1212251,1,1,1,1,2,1,3,1,1,2 196 | 1212422,3,1,1,1,2,1,3,1,1,2 197 | 1212422,4,1,1,1,2,1,3,1,1,2 198 | 1213375,8,4,4,5,4,7,7,8,2,2 199 | 1213383,5,1,1,4,2,1,3,1,1,2 200 | 1214092,1,1,1,1,2,1,1,1,1,2 201 | 1214556,3,1,1,1,2,1,2,1,1,2 202 | 1214966,9,7,7,5,5,10,7,8,3,4 203 | 1216694,10,8,8,4,10,10,8,1,1,4 204 | 1216947,1,1,1,1,2,1,3,1,1,2 205 | 1217051,5,1,1,1,2,1,3,1,1,2 206 | 1217264,1,1,1,1,2,1,3,1,1,2 207 | 1218105,5,10,10,9,6,10,7,10,5,4 208 | 1218741,10,10,9,3,7,5,3,5,1,4 209 | 1218860,1,1,1,1,1,1,3,1,1,2 210 | 1218860,1,1,1,1,1,1,3,1,1,2 211 | 1219406,5,1,1,1,1,1,3,1,1,2 212 | 1219525,8,10,10,10,5,10,8,10,6,4 213 | 1219859,8,10,8,8,4,8,7,7,1,4 214 | 1220330,1,1,1,1,2,1,3,1,1,2 215 | 1221863,10,10,10,10,7,10,7,10,4,4 216 | 1222047,10,10,10,10,3,10,10,6,1,4 217 | 1222936,8,7,8,7,5,5,5,10,2,4 218 | 1223282,1,1,1,1,2,1,2,1,1,2 219 | 1223426,1,1,1,1,2,1,3,1,1,2 220 | 1223793,6,10,7,7,6,4,8,10,2,4 221 | 1223967,6,1,3,1,2,1,3,1,1,2 222 | 1224329,1,1,1,2,2,1,3,1,1,2 223 | 1225799,10,6,4,3,10,10,9,10,1,4 224 | 1226012,4,1,1,3,1,5,2,1,1,4 225 | 1226612,7,5,6,3,3,8,7,4,1,4 226 | 1227210,10,5,5,6,3,10,7,9,2,4 227 | 1227244,1,1,1,1,2,1,2,1,1,2 228 | 1227481,10,5,7,4,4,10,8,9,1,4 229 | 1228152,8,9,9,5,3,5,7,7,1,4 230 | 1228311,1,1,1,1,1,1,3,1,1,2 231 | 1230175,10,10,10,3,10,10,9,10,1,4 232 | 1230688,7,4,7,4,3,7,7,6,1,4 233 | 1231387,6,8,7,5,6,8,8,9,2,4 234 | 1231706,8,4,6,3,3,1,4,3,1,2 235 | 1232225,10,4,5,5,5,10,4,1,1,4 236 | 1236043,3,3,2,1,3,1,3,6,1,2 237 | 1241232,3,1,4,1,2,?,3,1,1,2 238 | 1241559,10,8,8,2,8,10,4,8,10,4 239 | 1241679,9,8,8,5,6,2,4,10,4,4 240 | 1242364,8,10,10,8,6,9,3,10,10,4 241 | 1243256,10,4,3,2,3,10,5,3,2,4 242 | 1270479,5,1,3,3,2,2,2,3,1,2 243 | 1276091,3,1,1,3,1,1,3,1,1,2 244 | 1277018,2,1,1,1,2,1,3,1,1,2 245 | 128059,1,1,1,1,2,5,5,1,1,2 246 | 1285531,1,1,1,1,2,1,3,1,1,2 247 | 1287775,5,1,1,2,2,2,3,1,1,2 248 | 144888,8,10,10,8,5,10,7,8,1,4 249 | 145447,8,4,4,1,2,9,3,3,1,4 250 | 167528,4,1,1,1,2,1,3,6,1,2 251 | 169356,3,1,1,1,2,?,3,1,1,2 252 | 183913,1,2,2,1,2,1,1,1,1,2 253 | 191250,10,4,4,10,2,10,5,3,3,4 254 | 1017023,6,3,3,5,3,10,3,5,3,2 255 | 1100524,6,10,10,2,8,10,7,3,3,4 256 | 1116116,9,10,10,1,10,8,3,3,1,4 257 | 1168736,5,6,6,2,4,10,3,6,1,4 258 | 1182404,3,1,1,1,2,1,1,1,1,2 259 | 1182404,3,1,1,1,2,1,2,1,1,2 260 | 1198641,3,1,1,1,2,1,3,1,1,2 261 | 242970,5,7,7,1,5,8,3,4,1,2 262 | 255644,10,5,8,10,3,10,5,1,3,4 263 | 263538,5,10,10,6,10,10,10,6,5,4 264 | 274137,8,8,9,4,5,10,7,8,1,4 265 | 303213,10,4,4,10,6,10,5,5,1,4 266 | 314428,7,9,4,10,10,3,5,3,3,4 267 | 1182404,5,1,4,1,2,1,3,2,1,2 268 | 1198641,10,10,6,3,3,10,4,3,2,4 269 | 320675,3,3,5,2,3,10,7,1,1,4 270 | 324427,10,8,8,2,3,4,8,7,8,4 271 | 385103,1,1,1,1,2,1,3,1,1,2 272 | 390840,8,4,7,1,3,10,3,9,2,4 273 | 411453,5,1,1,1,2,1,3,1,1,2 274 | 320675,3,3,5,2,3,10,7,1,1,4 275 | 428903,7,2,4,1,3,4,3,3,1,4 276 | 431495,3,1,1,1,2,1,3,2,1,2 277 | 432809,3,1,3,1,2,?,2,1,1,2 278 | 434518,3,1,1,1,2,1,2,1,1,2 279 | 452264,1,1,1,1,2,1,2,1,1,2 280 | 456282,1,1,1,1,2,1,3,1,1,2 281 | 476903,10,5,7,3,3,7,3,3,8,4 282 | 486283,3,1,1,1,2,1,3,1,1,2 283 | 486662,2,1,1,2,2,1,3,1,1,2 284 | 488173,1,4,3,10,4,10,5,6,1,4 285 | 492268,10,4,6,1,2,10,5,3,1,4 286 | 508234,7,4,5,10,2,10,3,8,2,4 287 | 527363,8,10,10,10,8,10,10,7,3,4 288 | 529329,10,10,10,10,10,10,4,10,10,4 289 | 535331,3,1,1,1,3,1,2,1,1,2 290 | 543558,6,1,3,1,4,5,5,10,1,4 291 | 555977,5,6,6,8,6,10,4,10,4,4 292 | 560680,1,1,1,1,2,1,1,1,1,2 293 | 561477,1,1,1,1,2,1,3,1,1,2 294 | 563649,8,8,8,1,2,?,6,10,1,4 295 | 601265,10,4,4,6,2,10,2,3,1,4 296 | 606140,1,1,1,1,2,?,2,1,1,2 297 | 606722,5,5,7,8,6,10,7,4,1,4 298 | 616240,5,3,4,3,4,5,4,7,1,2 299 | 61634,5,4,3,1,2,?,2,3,1,2 300 | 625201,8,2,1,1,5,1,1,1,1,2 301 | 63375,9,1,2,6,4,10,7,7,2,4 302 | 635844,8,4,10,5,4,4,7,10,1,4 303 | 636130,1,1,1,1,2,1,3,1,1,2 304 | 640744,10,10,10,7,9,10,7,10,10,4 305 | 646904,1,1,1,1,2,1,3,1,1,2 306 | 653777,8,3,4,9,3,10,3,3,1,4 307 | 659642,10,8,4,4,4,10,3,10,4,4 308 | 666090,1,1,1,1,2,1,3,1,1,2 309 | 666942,1,1,1,1,2,1,3,1,1,2 310 | 667204,7,8,7,6,4,3,8,8,4,4 311 | 673637,3,1,1,1,2,5,5,1,1,2 312 | 684955,2,1,1,1,3,1,2,1,1,2 313 | 688033,1,1,1,1,2,1,1,1,1,2 314 | 691628,8,6,4,10,10,1,3,5,1,4 315 | 693702,1,1,1,1,2,1,1,1,1,2 316 | 704097,1,1,1,1,1,1,2,1,1,2 317 | 704168,4,6,5,6,7,?,4,9,1,2 318 | 706426,5,5,5,2,5,10,4,3,1,4 319 | 709287,6,8,7,8,6,8,8,9,1,4 320 | 718641,1,1,1,1,5,1,3,1,1,2 321 | 721482,4,4,4,4,6,5,7,3,1,2 322 | 730881,7,6,3,2,5,10,7,4,6,4 323 | 733639,3,1,1,1,2,?,3,1,1,2 324 | 733639,3,1,1,1,2,1,3,1,1,2 325 | 733823,5,4,6,10,2,10,4,1,1,4 326 | 740492,1,1,1,1,2,1,3,1,1,2 327 | 743348,3,2,2,1,2,1,2,3,1,2 328 | 752904,10,1,1,1,2,10,5,4,1,4 329 | 756136,1,1,1,1,2,1,2,1,1,2 330 | 760001,8,10,3,2,6,4,3,10,1,4 331 | 760239,10,4,6,4,5,10,7,1,1,4 332 | 76389,10,4,7,2,2,8,6,1,1,4 333 | 764974,5,1,1,1,2,1,3,1,2,2 334 | 770066,5,2,2,2,2,1,2,2,1,2 335 | 785208,5,4,6,6,4,10,4,3,1,4 336 | 785615,8,6,7,3,3,10,3,4,2,4 337 | 792744,1,1,1,1,2,1,1,1,1,2 338 | 797327,6,5,5,8,4,10,3,4,1,4 339 | 798429,1,1,1,1,2,1,3,1,1,2 340 | 704097,1,1,1,1,1,1,2,1,1,2 341 | 806423,8,5,5,5,2,10,4,3,1,4 342 | 809912,10,3,3,1,2,10,7,6,1,4 343 | 810104,1,1,1,1,2,1,3,1,1,2 344 | 814265,2,1,1,1,2,1,1,1,1,2 345 | 814911,1,1,1,1,2,1,1,1,1,2 346 | 822829,7,6,4,8,10,10,9,5,3,4 347 | 826923,1,1,1,1,2,1,1,1,1,2 348 | 830690,5,2,2,2,3,1,1,3,1,2 349 | 831268,1,1,1,1,1,1,1,3,1,2 350 | 832226,3,4,4,10,5,1,3,3,1,4 351 | 832567,4,2,3,5,3,8,7,6,1,4 352 | 836433,5,1,1,3,2,1,1,1,1,2 353 | 837082,2,1,1,1,2,1,3,1,1,2 354 | 846832,3,4,5,3,7,3,4,6,1,2 355 | 850831,2,7,10,10,7,10,4,9,4,4 356 | 855524,1,1,1,1,2,1,2,1,1,2 357 | 857774,4,1,1,1,3,1,2,2,1,2 358 | 859164,5,3,3,1,3,3,3,3,3,4 359 | 859350,8,10,10,7,10,10,7,3,8,4 360 | 866325,8,10,5,3,8,4,4,10,3,4 361 | 873549,10,3,5,4,3,7,3,5,3,4 362 | 877291,6,10,10,10,10,10,8,10,10,4 363 | 877943,3,10,3,10,6,10,5,1,4,4 364 | 888169,3,2,2,1,4,3,2,1,1,2 365 | 888523,4,4,4,2,2,3,2,1,1,2 366 | 896404,2,1,1,1,2,1,3,1,1,2 367 | 897172,2,1,1,1,2,1,2,1,1,2 368 | 95719,6,10,10,10,8,10,7,10,7,4 369 | 160296,5,8,8,10,5,10,8,10,3,4 370 | 342245,1,1,3,1,2,1,1,1,1,2 371 | 428598,1,1,3,1,1,1,2,1,1,2 372 | 492561,4,3,2,1,3,1,2,1,1,2 373 | 493452,1,1,3,1,2,1,1,1,1,2 374 | 493452,4,1,2,1,2,1,2,1,1,2 375 | 521441,5,1,1,2,2,1,2,1,1,2 376 | 560680,3,1,2,1,2,1,2,1,1,2 377 | 636437,1,1,1,1,2,1,1,1,1,2 378 | 640712,1,1,1,1,2,1,2,1,1,2 379 | 654244,1,1,1,1,1,1,2,1,1,2 380 | 657753,3,1,1,4,3,1,2,2,1,2 381 | 685977,5,3,4,1,4,1,3,1,1,2 382 | 805448,1,1,1,1,2,1,1,1,1,2 383 | 846423,10,6,3,6,4,10,7,8,4,4 384 | 1002504,3,2,2,2,2,1,3,2,1,2 385 | 1022257,2,1,1,1,2,1,1,1,1,2 386 | 1026122,2,1,1,1,2,1,1,1,1,2 387 | 1071084,3,3,2,2,3,1,1,2,3,2 388 | 1080233,7,6,6,3,2,10,7,1,1,4 389 | 1114570,5,3,3,2,3,1,3,1,1,2 390 | 1114570,2,1,1,1,2,1,2,2,1,2 391 | 1116715,5,1,1,1,3,2,2,2,1,2 392 | 1131411,1,1,1,2,2,1,2,1,1,2 393 | 1151734,10,8,7,4,3,10,7,9,1,4 394 | 1156017,3,1,1,1,2,1,2,1,1,2 395 | 1158247,1,1,1,1,1,1,1,1,1,2 396 | 1158405,1,2,3,1,2,1,2,1,1,2 397 | 1168278,3,1,1,1,2,1,2,1,1,2 398 | 1176187,3,1,1,1,2,1,3,1,1,2 399 | 1196263,4,1,1,1,2,1,1,1,1,2 400 | 1196475,3,2,1,1,2,1,2,2,1,2 401 | 1206314,1,2,3,1,2,1,1,1,1,2 402 | 1211265,3,10,8,7,6,9,9,3,8,4 403 | 1213784,3,1,1,1,2,1,1,1,1,2 404 | 1223003,5,3,3,1,2,1,2,1,1,2 405 | 1223306,3,1,1,1,2,4,1,1,1,2 406 | 1223543,1,2,1,3,2,1,1,2,1,2 407 | 1229929,1,1,1,1,2,1,2,1,1,2 408 | 1231853,4,2,2,1,2,1,2,1,1,2 409 | 1234554,1,1,1,1,2,1,2,1,1,2 410 | 1236837,2,3,2,2,2,2,3,1,1,2 411 | 1237674,3,1,2,1,2,1,2,1,1,2 412 | 1238021,1,1,1,1,2,1,2,1,1,2 413 | 1238464,1,1,1,1,1,?,2,1,1,2 414 | 1238633,10,10,10,6,8,4,8,5,1,4 415 | 1238915,5,1,2,1,2,1,3,1,1,2 416 | 1238948,8,5,6,2,3,10,6,6,1,4 417 | 1239232,3,3,2,6,3,3,3,5,1,2 418 | 1239347,8,7,8,5,10,10,7,2,1,4 419 | 1239967,1,1,1,1,2,1,2,1,1,2 420 | 1240337,5,2,2,2,2,2,3,2,2,2 421 | 1253505,2,3,1,1,5,1,1,1,1,2 422 | 1255384,3,2,2,3,2,3,3,1,1,2 423 | 1257200,10,10,10,7,10,10,8,2,1,4 424 | 1257648,4,3,3,1,2,1,3,3,1,2 425 | 1257815,5,1,3,1,2,1,2,1,1,2 426 | 1257938,3,1,1,1,2,1,1,1,1,2 427 | 1258549,9,10,10,10,10,10,10,10,1,4 428 | 1258556,5,3,6,1,2,1,1,1,1,2 429 | 1266154,8,7,8,2,4,2,5,10,1,4 430 | 1272039,1,1,1,1,2,1,2,1,1,2 431 | 1276091,2,1,1,1,2,1,2,1,1,2 432 | 1276091,1,3,1,1,2,1,2,2,1,2 433 | 1276091,5,1,1,3,4,1,3,2,1,2 434 | 1277629,5,1,1,1,2,1,2,2,1,2 435 | 1293439,3,2,2,3,2,1,1,1,1,2 436 | 1293439,6,9,7,5,5,8,4,2,1,2 437 | 1294562,10,8,10,1,3,10,5,1,1,4 438 | 1295186,10,10,10,1,6,1,2,8,1,4 439 | 527337,4,1,1,1,2,1,1,1,1,2 440 | 558538,4,1,3,3,2,1,1,1,1,2 441 | 566509,5,1,1,1,2,1,1,1,1,2 442 | 608157,10,4,3,10,4,10,10,1,1,4 443 | 677910,5,2,2,4,2,4,1,1,1,2 444 | 734111,1,1,1,3,2,3,1,1,1,2 445 | 734111,1,1,1,1,2,2,1,1,1,2 446 | 780555,5,1,1,6,3,1,2,1,1,2 447 | 827627,2,1,1,1,2,1,1,1,1,2 448 | 1049837,1,1,1,1,2,1,1,1,1,2 449 | 1058849,5,1,1,1,2,1,1,1,1,2 450 | 1182404,1,1,1,1,1,1,1,1,1,2 451 | 1193544,5,7,9,8,6,10,8,10,1,4 452 | 1201870,4,1,1,3,1,1,2,1,1,2 453 | 1202253,5,1,1,1,2,1,1,1,1,2 454 | 1227081,3,1,1,3,2,1,1,1,1,2 455 | 1230994,4,5,5,8,6,10,10,7,1,4 456 | 1238410,2,3,1,1,3,1,1,1,1,2 457 | 1246562,10,2,2,1,2,6,1,1,2,4 458 | 1257470,10,6,5,8,5,10,8,6,1,4 459 | 1259008,8,8,9,6,6,3,10,10,1,4 460 | 1266124,5,1,2,1,2,1,1,1,1,2 461 | 1267898,5,1,3,1,2,1,1,1,1,2 462 | 1268313,5,1,1,3,2,1,1,1,1,2 463 | 1268804,3,1,1,1,2,5,1,1,1,2 464 | 1276091,6,1,1,3,2,1,1,1,1,2 465 | 1280258,4,1,1,1,2,1,1,2,1,2 466 | 1293966,4,1,1,1,2,1,1,1,1,2 467 | 1296572,10,9,8,7,6,4,7,10,3,4 468 | 1298416,10,6,6,2,4,10,9,7,1,4 469 | 1299596,6,6,6,5,4,10,7,6,2,4 470 | 1105524,4,1,1,1,2,1,1,1,1,2 471 | 1181685,1,1,2,1,2,1,2,1,1,2 472 | 1211594,3,1,1,1,1,1,2,1,1,2 473 | 1238777,6,1,1,3,2,1,1,1,1,2 474 | 1257608,6,1,1,1,1,1,1,1,1,2 475 | 1269574,4,1,1,1,2,1,1,1,1,2 476 | 1277145,5,1,1,1,2,1,1,1,1,2 477 | 1287282,3,1,1,1,2,1,1,1,1,2 478 | 1296025,4,1,2,1,2,1,1,1,1,2 479 | 1296263,4,1,1,1,2,1,1,1,1,2 480 | 1296593,5,2,1,1,2,1,1,1,1,2 481 | 1299161,4,8,7,10,4,10,7,5,1,4 482 | 1301945,5,1,1,1,1,1,1,1,1,2 483 | 1302428,5,3,2,4,2,1,1,1,1,2 484 | 1318169,9,10,10,10,10,5,10,10,10,4 485 | 474162,8,7,8,5,5,10,9,10,1,4 486 | 787451,5,1,2,1,2,1,1,1,1,2 487 | 1002025,1,1,1,3,1,3,1,1,1,2 488 | 1070522,3,1,1,1,1,1,2,1,1,2 489 | 1073960,10,10,10,10,6,10,8,1,5,4 490 | 1076352,3,6,4,10,3,3,3,4,1,4 491 | 1084139,6,3,2,1,3,4,4,1,1,4 492 | 1115293,1,1,1,1,2,1,1,1,1,2 493 | 1119189,5,8,9,4,3,10,7,1,1,4 494 | 1133991,4,1,1,1,1,1,2,1,1,2 495 | 1142706,5,10,10,10,6,10,6,5,2,4 496 | 1155967,5,1,2,10,4,5,2,1,1,2 497 | 1170945,3,1,1,1,1,1,2,1,1,2 498 | 1181567,1,1,1,1,1,1,1,1,1,2 499 | 1182404,4,2,1,1,2,1,1,1,1,2 500 | 1204558,4,1,1,1,2,1,2,1,1,2 501 | 1217952,4,1,1,1,2,1,2,1,1,2 502 | 1224565,6,1,1,1,2,1,3,1,1,2 503 | 1238186,4,1,1,1,2,1,2,1,1,2 504 | 1253917,4,1,1,2,2,1,2,1,1,2 505 | 1265899,4,1,1,1,2,1,3,1,1,2 506 | 1268766,1,1,1,1,2,1,1,1,1,2 507 | 1277268,3,3,1,1,2,1,1,1,1,2 508 | 1286943,8,10,10,10,7,5,4,8,7,4 509 | 1295508,1,1,1,1,2,4,1,1,1,2 510 | 1297327,5,1,1,1,2,1,1,1,1,2 511 | 1297522,2,1,1,1,2,1,1,1,1,2 512 | 1298360,1,1,1,1,2,1,1,1,1,2 513 | 1299924,5,1,1,1,2,1,2,1,1,2 514 | 1299994,5,1,1,1,2,1,1,1,1,2 515 | 1304595,3,1,1,1,1,1,2,1,1,2 516 | 1306282,6,6,7,10,3,10,8,10,2,4 517 | 1313325,4,10,4,7,3,10,9,10,1,4 518 | 1320077,1,1,1,1,1,1,1,1,1,2 519 | 1320077,1,1,1,1,1,1,2,1,1,2 520 | 1320304,3,1,2,2,2,1,1,1,1,2 521 | 1330439,4,7,8,3,4,10,9,1,1,4 522 | 333093,1,1,1,1,3,1,1,1,1,2 523 | 369565,4,1,1,1,3,1,1,1,1,2 524 | 412300,10,4,5,4,3,5,7,3,1,4 525 | 672113,7,5,6,10,4,10,5,3,1,4 526 | 749653,3,1,1,1,2,1,2,1,1,2 527 | 769612,3,1,1,2,2,1,1,1,1,2 528 | 769612,4,1,1,1,2,1,1,1,1,2 529 | 798429,4,1,1,1,2,1,3,1,1,2 530 | 807657,6,1,3,2,2,1,1,1,1,2 531 | 8233704,4,1,1,1,1,1,2,1,1,2 532 | 837480,7,4,4,3,4,10,6,9,1,4 533 | 867392,4,2,2,1,2,1,2,1,1,2 534 | 869828,1,1,1,1,1,1,3,1,1,2 535 | 1043068,3,1,1,1,2,1,2,1,1,2 536 | 1056171,2,1,1,1,2,1,2,1,1,2 537 | 1061990,1,1,3,2,2,1,3,1,1,2 538 | 1113061,5,1,1,1,2,1,3,1,1,2 539 | 1116192,5,1,2,1,2,1,3,1,1,2 540 | 1135090,4,1,1,1,2,1,2,1,1,2 541 | 1145420,6,1,1,1,2,1,2,1,1,2 542 | 1158157,5,1,1,1,2,2,2,1,1,2 543 | 1171578,3,1,1,1,2,1,1,1,1,2 544 | 1174841,5,3,1,1,2,1,1,1,1,2 545 | 1184586,4,1,1,1,2,1,2,1,1,2 546 | 1186936,2,1,3,2,2,1,2,1,1,2 547 | 1197527,5,1,1,1,2,1,2,1,1,2 548 | 1222464,6,10,10,10,4,10,7,10,1,4 549 | 1240603,2,1,1,1,1,1,1,1,1,2 550 | 1240603,3,1,1,1,1,1,1,1,1,2 551 | 1241035,7,8,3,7,4,5,7,8,2,4 552 | 1287971,3,1,1,1,2,1,2,1,1,2 553 | 1289391,1,1,1,1,2,1,3,1,1,2 554 | 1299924,3,2,2,2,2,1,4,2,1,2 555 | 1306339,4,4,2,1,2,5,2,1,2,2 556 | 1313658,3,1,1,1,2,1,1,1,1,2 557 | 1313982,4,3,1,1,2,1,4,8,1,2 558 | 1321264,5,2,2,2,1,1,2,1,1,2 559 | 1321321,5,1,1,3,2,1,1,1,1,2 560 | 1321348,2,1,1,1,2,1,2,1,1,2 561 | 1321931,5,1,1,1,2,1,2,1,1,2 562 | 1321942,5,1,1,1,2,1,3,1,1,2 563 | 1321942,5,1,1,1,2,1,3,1,1,2 564 | 1328331,1,1,1,1,2,1,3,1,1,2 565 | 1328755,3,1,1,1,2,1,2,1,1,2 566 | 1331405,4,1,1,1,2,1,3,2,1,2 567 | 1331412,5,7,10,10,5,10,10,10,1,4 568 | 1333104,3,1,2,1,2,1,3,1,1,2 569 | 1334071,4,1,1,1,2,3,2,1,1,2 570 | 1343068,8,4,4,1,6,10,2,5,2,4 571 | 1343374,10,10,8,10,6,5,10,3,1,4 572 | 1344121,8,10,4,4,8,10,8,2,1,4 573 | 142932,7,6,10,5,3,10,9,10,2,4 574 | 183936,3,1,1,1,2,1,2,1,1,2 575 | 324382,1,1,1,1,2,1,2,1,1,2 576 | 378275,10,9,7,3,4,2,7,7,1,4 577 | 385103,5,1,2,1,2,1,3,1,1,2 578 | 690557,5,1,1,1,2,1,2,1,1,2 579 | 695091,1,1,1,1,2,1,2,1,1,2 580 | 695219,1,1,1,1,2,1,2,1,1,2 581 | 824249,1,1,1,1,2,1,3,1,1,2 582 | 871549,5,1,2,1,2,1,2,1,1,2 583 | 878358,5,7,10,6,5,10,7,5,1,4 584 | 1107684,6,10,5,5,4,10,6,10,1,4 585 | 1115762,3,1,1,1,2,1,1,1,1,2 586 | 1217717,5,1,1,6,3,1,1,1,1,2 587 | 1239420,1,1,1,1,2,1,1,1,1,2 588 | 1254538,8,10,10,10,6,10,10,10,1,4 589 | 1261751,5,1,1,1,2,1,2,2,1,2 590 | 1268275,9,8,8,9,6,3,4,1,1,4 591 | 1272166,5,1,1,1,2,1,1,1,1,2 592 | 1294261,4,10,8,5,4,1,10,1,1,4 593 | 1295529,2,5,7,6,4,10,7,6,1,4 594 | 1298484,10,3,4,5,3,10,4,1,1,4 595 | 1311875,5,1,2,1,2,1,1,1,1,2 596 | 1315506,4,8,6,3,4,10,7,1,1,4 597 | 1320141,5,1,1,1,2,1,2,1,1,2 598 | 1325309,4,1,2,1,2,1,2,1,1,2 599 | 1333063,5,1,3,1,2,1,3,1,1,2 600 | 1333495,3,1,1,1,2,1,2,1,1,2 601 | 1334659,5,2,4,1,1,1,1,1,1,2 602 | 1336798,3,1,1,1,2,1,2,1,1,2 603 | 1344449,1,1,1,1,1,1,2,1,1,2 604 | 1350568,4,1,1,1,2,1,2,1,1,2 605 | 1352663,5,4,6,8,4,1,8,10,1,4 606 | 188336,5,3,2,8,5,10,8,1,2,4 607 | 352431,10,5,10,3,5,8,7,8,3,4 608 | 353098,4,1,1,2,2,1,1,1,1,2 609 | 411453,1,1,1,1,2,1,1,1,1,2 610 | 557583,5,10,10,10,10,10,10,1,1,4 611 | 636375,5,1,1,1,2,1,1,1,1,2 612 | 736150,10,4,3,10,3,10,7,1,2,4 613 | 803531,5,10,10,10,5,2,8,5,1,4 614 | 822829,8,10,10,10,6,10,10,10,10,4 615 | 1016634,2,3,1,1,2,1,2,1,1,2 616 | 1031608,2,1,1,1,1,1,2,1,1,2 617 | 1041043,4,1,3,1,2,1,2,1,1,2 618 | 1042252,3,1,1,1,2,1,2,1,1,2 619 | 1057067,1,1,1,1,1,?,1,1,1,2 620 | 1061990,4,1,1,1,2,1,2,1,1,2 621 | 1073836,5,1,1,1,2,1,2,1,1,2 622 | 1083817,3,1,1,1,2,1,2,1,1,2 623 | 1096352,6,3,3,3,3,2,6,1,1,2 624 | 1140597,7,1,2,3,2,1,2,1,1,2 625 | 1149548,1,1,1,1,2,1,1,1,1,2 626 | 1174009,5,1,1,2,1,1,2,1,1,2 627 | 1183596,3,1,3,1,3,4,1,1,1,2 628 | 1190386,4,6,6,5,7,6,7,7,3,4 629 | 1190546,2,1,1,1,2,5,1,1,1,2 630 | 1213273,2,1,1,1,2,1,1,1,1,2 631 | 1218982,4,1,1,1,2,1,1,1,1,2 632 | 1225382,6,2,3,1,2,1,1,1,1,2 633 | 1235807,5,1,1,1,2,1,2,1,1,2 634 | 1238777,1,1,1,1,2,1,1,1,1,2 635 | 1253955,8,7,4,4,5,3,5,10,1,4 636 | 1257366,3,1,1,1,2,1,1,1,1,2 637 | 1260659,3,1,4,1,2,1,1,1,1,2 638 | 1268952,10,10,7,8,7,1,10,10,3,4 639 | 1275807,4,2,4,3,2,2,2,1,1,2 640 | 1277792,4,1,1,1,2,1,1,1,1,2 641 | 1277792,5,1,1,3,2,1,1,1,1,2 642 | 1285722,4,1,1,3,2,1,1,1,1,2 643 | 1288608,3,1,1,1,2,1,2,1,1,2 644 | 1290203,3,1,1,1,2,1,2,1,1,2 645 | 1294413,1,1,1,1,2,1,1,1,1,2 646 | 1299596,2,1,1,1,2,1,1,1,1,2 647 | 1303489,3,1,1,1,2,1,2,1,1,2 648 | 1311033,1,2,2,1,2,1,1,1,1,2 649 | 1311108,1,1,1,3,2,1,1,1,1,2 650 | 1315807,5,10,10,10,10,2,10,10,10,4 651 | 1318671,3,1,1,1,2,1,2,1,1,2 652 | 1319609,3,1,1,2,3,4,1,1,1,2 653 | 1323477,1,2,1,3,2,1,2,1,1,2 654 | 1324572,5,1,1,1,2,1,2,2,1,2 655 | 1324681,4,1,1,1,2,1,2,1,1,2 656 | 1325159,3,1,1,1,2,1,3,1,1,2 657 | 1326892,3,1,1,1,2,1,2,1,1,2 658 | 1330361,5,1,1,1,2,1,2,1,1,2 659 | 1333877,5,4,5,1,8,1,3,6,1,2 660 | 1334015,7,8,8,7,3,10,7,2,3,4 661 | 1334667,1,1,1,1,2,1,1,1,1,2 662 | 1339781,1,1,1,1,2,1,2,1,1,2 663 | 1339781,4,1,1,1,2,1,3,1,1,2 664 | 13454352,1,1,3,1,2,1,2,1,1,2 665 | 1345452,1,1,3,1,2,1,2,1,1,2 666 | 1345593,3,1,1,3,2,1,2,1,1,2 667 | 1347749,1,1,1,1,2,1,1,1,1,2 668 | 1347943,5,2,2,2,2,1,1,1,2,2 669 | 1348851,3,1,1,1,2,1,3,1,1,2 670 | 1350319,5,7,4,1,6,1,7,10,3,4 671 | 1350423,5,10,10,8,5,5,7,10,1,4 672 | 1352848,3,10,7,8,5,8,7,4,1,4 673 | 1353092,3,2,1,2,2,1,3,1,1,2 674 | 1354840,2,1,1,1,2,1,3,1,1,2 675 | 1354840,5,3,2,1,3,1,1,1,1,2 676 | 1355260,1,1,1,1,2,1,2,1,1,2 677 | 1365075,4,1,4,1,2,1,1,1,1,2 678 | 1365328,1,1,2,1,2,1,2,1,1,2 679 | 1368267,5,1,1,1,2,1,1,1,1,2 680 | 1368273,1,1,1,1,2,1,1,1,1,2 681 | 1368882,2,1,1,1,2,1,1,1,1,2 682 | 1369821,10,10,10,10,5,10,10,10,7,4 683 | 1371026,5,10,10,10,4,10,5,6,3,4 684 | 1371920,5,1,1,1,2,1,3,2,1,2 685 | 466906,1,1,1,1,2,1,1,1,1,2 686 | 466906,1,1,1,1,2,1,1,1,1,2 687 | 534555,1,1,1,1,2,1,1,1,1,2 688 | 536708,1,1,1,1,2,1,1,1,1,2 689 | 566346,3,1,1,1,2,1,2,3,1,2 690 | 603148,4,1,1,1,2,1,1,1,1,2 691 | 654546,1,1,1,1,2,1,1,1,8,2 692 | 654546,1,1,1,3,2,1,1,1,1,2 693 | 695091,5,10,10,5,4,5,4,4,1,4 694 | 714039,3,1,1,1,2,1,1,1,1,2 695 | 763235,3,1,1,1,2,1,2,1,2,2 696 | 776715,3,1,1,1,3,2,1,1,1,2 697 | 841769,2,1,1,1,2,1,1,1,1,2 698 | 888820,5,10,10,3,7,3,8,10,2,4 699 | 897471,4,8,6,4,3,4,10,6,1,4 700 | 897471,4,8,8,5,4,5,10,4,1,4 701 | -------------------------------------------------------------------------------- /datasets/pima-indians-diabetes.csv: -------------------------------------------------------------------------------- 1 | Pregnancy Count,Blood Glucose,Diastolic BP,Triceps Skin Fold Thickness,Serum Insulin,BMI,Diabetes Pedigree Function,Age,Class 2 | 6,148,72,35,0,33.6,0.627,50,1 3 | 1,85,66,29,0,26.6,0.351,31,0 4 | 8,183,64,0,0,23.3,0.672,32,1 5 | 1,89,66,23,94,28.1,0.167,21,0 6 | 0,137,40,35,168,43.1,2.288,33,1 7 | 5,116,74,0,0,25.6,0.201,30,0 8 | 3,78,50,32,88,31.0,0.248,26,1 9 | 10,115,0,0,0,35.3,0.134,29,0 10 | 2,197,70,45,543,30.5,0.158,53,1 11 | 8,125,96,0,0,0.0,0.232,54,1 12 | 4,110,92,0,0,37.6,0.191,30,0 13 | 10,168,74,0,0,38.0,0.537,34,1 14 | 10,139,80,0,0,27.1,1.441,57,0 15 | 1,189,60,23,846,30.1,0.398,59,1 16 | 5,166,72,19,175,25.8,0.587,51,1 17 | 7,100,0,0,0,30.0,0.484,32,1 18 | 0,118,84,47,230,45.8,0.551,31,1 19 | 7,107,74,0,0,29.6,0.254,31,1 20 | 1,103,30,38,83,43.3,0.183,33,0 21 | 1,115,70,30,96,34.6,0.529,32,1 22 | 3,126,88,41,235,39.3,0.704,27,0 23 | 8,99,84,0,0,35.4,0.388,50,0 24 | 7,196,90,0,0,39.8,0.451,41,1 25 | 9,119,80,35,0,29.0,0.263,29,1 26 | 11,143,94,33,146,36.6,0.254,51,1 27 | 10,125,70,26,115,31.1,0.205,41,1 28 | 7,147,76,0,0,39.4,0.257,43,1 29 | 1,97,66,15,140,23.2,0.487,22,0 30 | 13,145,82,19,110,22.2,0.245,57,0 31 | 5,117,92,0,0,34.1,0.337,38,0 32 | 5,109,75,26,0,36.0,0.546,60,0 33 | 3,158,76,36,245,31.6,0.851,28,1 34 | 3,88,58,11,54,24.8,0.267,22,0 35 | 6,92,92,0,0,19.9,0.188,28,0 36 | 10,122,78,31,0,27.6,0.512,45,0 37 | 4,103,60,33,192,24.0,0.966,33,0 38 | 11,138,76,0,0,33.2,0.420,35,0 39 | 9,102,76,37,0,32.9,0.665,46,1 40 | 2,90,68,42,0,38.2,0.503,27,1 41 | 4,111,72,47,207,37.1,1.390,56,1 42 | 3,180,64,25,70,34.0,0.271,26,0 43 | 7,133,84,0,0,40.2,0.696,37,0 44 | 7,106,92,18,0,22.7,0.235,48,0 45 | 9,171,110,24,240,45.4,0.721,54,1 46 | 7,159,64,0,0,27.4,0.294,40,0 47 | 0,180,66,39,0,42.0,1.893,25,1 48 | 1,146,56,0,0,29.7,0.564,29,0 49 | 2,71,70,27,0,28.0,0.586,22,0 50 | 7,103,66,32,0,39.1,0.344,31,1 51 | 7,105,0,0,0,0.0,0.305,24,0 52 | 1,103,80,11,82,19.4,0.491,22,0 53 | 1,101,50,15,36,24.2,0.526,26,0 54 | 5,88,66,21,23,24.4,0.342,30,0 55 | 8,176,90,34,300,33.7,0.467,58,1 56 | 7,150,66,42,342,34.7,0.718,42,0 57 | 1,73,50,10,0,23.0,0.248,21,0 58 | 7,187,68,39,304,37.7,0.254,41,1 59 | 0,100,88,60,110,46.8,0.962,31,0 60 | 0,146,82,0,0,40.5,1.781,44,0 61 | 0,105,64,41,142,41.5,0.173,22,0 62 | 2,84,0,0,0,0.0,0.304,21,0 63 | 8,133,72,0,0,32.9,0.270,39,1 64 | 5,44,62,0,0,25.0,0.587,36,0 65 | 2,141,58,34,128,25.4,0.699,24,0 66 | 7,114,66,0,0,32.8,0.258,42,1 67 | 5,99,74,27,0,29.0,0.203,32,0 68 | 0,109,88,30,0,32.5,0.855,38,1 69 | 2,109,92,0,0,42.7,0.845,54,0 70 | 1,95,66,13,38,19.6,0.334,25,0 71 | 4,146,85,27,100,28.9,0.189,27,0 72 | 2,100,66,20,90,32.9,0.867,28,1 73 | 5,139,64,35,140,28.6,0.411,26,0 74 | 13,126,90,0,0,43.4,0.583,42,1 75 | 4,129,86,20,270,35.1,0.231,23,0 76 | 1,79,75,30,0,32.0,0.396,22,0 77 | 1,0,48,20,0,24.7,0.140,22,0 78 | 7,62,78,0,0,32.6,0.391,41,0 79 | 5,95,72,33,0,37.7,0.370,27,0 80 | 0,131,0,0,0,43.2,0.270,26,1 81 | 2,112,66,22,0,25.0,0.307,24,0 82 | 3,113,44,13,0,22.4,0.140,22,0 83 | 2,74,0,0,0,0.0,0.102,22,0 84 | 7,83,78,26,71,29.3,0.767,36,0 85 | 0,101,65,28,0,24.6,0.237,22,0 86 | 5,137,108,0,0,48.8,0.227,37,1 87 | 2,110,74,29,125,32.4,0.698,27,0 88 | 13,106,72,54,0,36.6,0.178,45,0 89 | 2,100,68,25,71,38.5,0.324,26,0 90 | 15,136,70,32,110,37.1,0.153,43,1 91 | 1,107,68,19,0,26.5,0.165,24,0 92 | 1,80,55,0,0,19.1,0.258,21,0 93 | 4,123,80,15,176,32.0,0.443,34,0 94 | 7,81,78,40,48,46.7,0.261,42,0 95 | 4,134,72,0,0,23.8,0.277,60,1 96 | 2,142,82,18,64,24.7,0.761,21,0 97 | 6,144,72,27,228,33.9,0.255,40,0 98 | 2,92,62,28,0,31.6,0.130,24,0 99 | 1,71,48,18,76,20.4,0.323,22,0 100 | 6,93,50,30,64,28.7,0.356,23,0 101 | 1,122,90,51,220,49.7,0.325,31,1 102 | 1,163,72,0,0,39.0,1.222,33,1 103 | 1,151,60,0,0,26.1,0.179,22,0 104 | 0,125,96,0,0,22.5,0.262,21,0 105 | 1,81,72,18,40,26.6,0.283,24,0 106 | 2,85,65,0,0,39.6,0.930,27,0 107 | 1,126,56,29,152,28.7,0.801,21,0 108 | 1,96,122,0,0,22.4,0.207,27,0 109 | 4,144,58,28,140,29.5,0.287,37,0 110 | 3,83,58,31,18,34.3,0.336,25,0 111 | 0,95,85,25,36,37.4,0.247,24,1 112 | 3,171,72,33,135,33.3,0.199,24,1 113 | 8,155,62,26,495,34.0,0.543,46,1 114 | 1,89,76,34,37,31.2,0.192,23,0 115 | 4,76,62,0,0,34.0,0.391,25,0 116 | 7,160,54,32,175,30.5,0.588,39,1 117 | 4,146,92,0,0,31.2,0.539,61,1 118 | 5,124,74,0,0,34.0,0.220,38,1 119 | 5,78,48,0,0,33.7,0.654,25,0 120 | 4,97,60,23,0,28.2,0.443,22,0 121 | 4,99,76,15,51,23.2,0.223,21,0 122 | 0,162,76,56,100,53.2,0.759,25,1 123 | 6,111,64,39,0,34.2,0.260,24,0 124 | 2,107,74,30,100,33.6,0.404,23,0 125 | 5,132,80,0,0,26.8,0.186,69,0 126 | 0,113,76,0,0,33.3,0.278,23,1 127 | 1,88,30,42,99,55.0,0.496,26,1 128 | 3,120,70,30,135,42.9,0.452,30,0 129 | 1,118,58,36,94,33.3,0.261,23,0 130 | 1,117,88,24,145,34.5,0.403,40,1 131 | 0,105,84,0,0,27.9,0.741,62,1 132 | 4,173,70,14,168,29.7,0.361,33,1 133 | 9,122,56,0,0,33.3,1.114,33,1 134 | 3,170,64,37,225,34.5,0.356,30,1 135 | 8,84,74,31,0,38.3,0.457,39,0 136 | 2,96,68,13,49,21.1,0.647,26,0 137 | 2,125,60,20,140,33.8,0.088,31,0 138 | 0,100,70,26,50,30.8,0.597,21,0 139 | 0,93,60,25,92,28.7,0.532,22,0 140 | 0,129,80,0,0,31.2,0.703,29,0 141 | 5,105,72,29,325,36.9,0.159,28,0 142 | 3,128,78,0,0,21.1,0.268,55,0 143 | 5,106,82,30,0,39.5,0.286,38,0 144 | 2,108,52,26,63,32.5,0.318,22,0 145 | 10,108,66,0,0,32.4,0.272,42,1 146 | 4,154,62,31,284,32.8,0.237,23,0 147 | 0,102,75,23,0,0.0,0.572,21,0 148 | 9,57,80,37,0,32.8,0.096,41,0 149 | 2,106,64,35,119,30.5,1.400,34,0 150 | 5,147,78,0,0,33.7,0.218,65,0 151 | 2,90,70,17,0,27.3,0.085,22,0 152 | 1,136,74,50,204,37.4,0.399,24,0 153 | 4,114,65,0,0,21.9,0.432,37,0 154 | 9,156,86,28,155,34.3,1.189,42,1 155 | 1,153,82,42,485,40.6,0.687,23,0 156 | 8,188,78,0,0,47.9,0.137,43,1 157 | 7,152,88,44,0,50.0,0.337,36,1 158 | 2,99,52,15,94,24.6,0.637,21,0 159 | 1,109,56,21,135,25.2,0.833,23,0 160 | 2,88,74,19,53,29.0,0.229,22,0 161 | 17,163,72,41,114,40.9,0.817,47,1 162 | 4,151,90,38,0,29.7,0.294,36,0 163 | 7,102,74,40,105,37.2,0.204,45,0 164 | 0,114,80,34,285,44.2,0.167,27,0 165 | 2,100,64,23,0,29.7,0.368,21,0 166 | 0,131,88,0,0,31.6,0.743,32,1 167 | 6,104,74,18,156,29.9,0.722,41,1 168 | 3,148,66,25,0,32.5,0.256,22,0 169 | 4,120,68,0,0,29.6,0.709,34,0 170 | 4,110,66,0,0,31.9,0.471,29,0 171 | 3,111,90,12,78,28.4,0.495,29,0 172 | 6,102,82,0,0,30.8,0.180,36,1 173 | 6,134,70,23,130,35.4,0.542,29,1 174 | 2,87,0,23,0,28.9,0.773,25,0 175 | 1,79,60,42,48,43.5,0.678,23,0 176 | 2,75,64,24,55,29.7,0.370,33,0 177 | 8,179,72,42,130,32.7,0.719,36,1 178 | 6,85,78,0,0,31.2,0.382,42,0 179 | 0,129,110,46,130,67.1,0.319,26,1 180 | 5,143,78,0,0,45.0,0.190,47,0 181 | 5,130,82,0,0,39.1,0.956,37,1 182 | 6,87,80,0,0,23.2,0.084,32,0 183 | 0,119,64,18,92,34.9,0.725,23,0 184 | 1,0,74,20,23,27.7,0.299,21,0 185 | 5,73,60,0,0,26.8,0.268,27,0 186 | 4,141,74,0,0,27.6,0.244,40,0 187 | 7,194,68,28,0,35.9,0.745,41,1 188 | 8,181,68,36,495,30.1,0.615,60,1 189 | 1,128,98,41,58,32.0,1.321,33,1 190 | 8,109,76,39,114,27.9,0.640,31,1 191 | 5,139,80,35,160,31.6,0.361,25,1 192 | 3,111,62,0,0,22.6,0.142,21,0 193 | 9,123,70,44,94,33.1,0.374,40,0 194 | 7,159,66,0,0,30.4,0.383,36,1 195 | 11,135,0,0,0,52.3,0.578,40,1 196 | 8,85,55,20,0,24.4,0.136,42,0 197 | 5,158,84,41,210,39.4,0.395,29,1 198 | 1,105,58,0,0,24.3,0.187,21,0 199 | 3,107,62,13,48,22.9,0.678,23,1 200 | 4,109,64,44,99,34.8,0.905,26,1 201 | 4,148,60,27,318,30.9,0.150,29,1 202 | 0,113,80,16,0,31.0,0.874,21,0 203 | 1,138,82,0,0,40.1,0.236,28,0 204 | 0,108,68,20,0,27.3,0.787,32,0 205 | 2,99,70,16,44,20.4,0.235,27,0 206 | 6,103,72,32,190,37.7,0.324,55,0 207 | 5,111,72,28,0,23.9,0.407,27,0 208 | 8,196,76,29,280,37.5,0.605,57,1 209 | 5,162,104,0,0,37.7,0.151,52,1 210 | 1,96,64,27,87,33.2,0.289,21,0 211 | 7,184,84,33,0,35.5,0.355,41,1 212 | 2,81,60,22,0,27.7,0.290,25,0 213 | 0,147,85,54,0,42.8,0.375,24,0 214 | 7,179,95,31,0,34.2,0.164,60,0 215 | 0,140,65,26,130,42.6,0.431,24,1 216 | 9,112,82,32,175,34.2,0.260,36,1 217 | 12,151,70,40,271,41.8,0.742,38,1 218 | 5,109,62,41,129,35.8,0.514,25,1 219 | 6,125,68,30,120,30.0,0.464,32,0 220 | 5,85,74,22,0,29.0,1.224,32,1 221 | 5,112,66,0,0,37.8,0.261,41,1 222 | 0,177,60,29,478,34.6,1.072,21,1 223 | 2,158,90,0,0,31.6,0.805,66,1 224 | 7,119,0,0,0,25.2,0.209,37,0 225 | 7,142,60,33,190,28.8,0.687,61,0 226 | 1,100,66,15,56,23.6,0.666,26,0 227 | 1,87,78,27,32,34.6,0.101,22,0 228 | 0,101,76,0,0,35.7,0.198,26,0 229 | 3,162,52,38,0,37.2,0.652,24,1 230 | 4,197,70,39,744,36.7,2.329,31,0 231 | 0,117,80,31,53,45.2,0.089,24,0 232 | 4,142,86,0,0,44.0,0.645,22,1 233 | 6,134,80,37,370,46.2,0.238,46,1 234 | 1,79,80,25,37,25.4,0.583,22,0 235 | 4,122,68,0,0,35.0,0.394,29,0 236 | 3,74,68,28,45,29.7,0.293,23,0 237 | 4,171,72,0,0,43.6,0.479,26,1 238 | 7,181,84,21,192,35.9,0.586,51,1 239 | 0,179,90,27,0,44.1,0.686,23,1 240 | 9,164,84,21,0,30.8,0.831,32,1 241 | 0,104,76,0,0,18.4,0.582,27,0 242 | 1,91,64,24,0,29.2,0.192,21,0 243 | 4,91,70,32,88,33.1,0.446,22,0 244 | 3,139,54,0,0,25.6,0.402,22,1 245 | 6,119,50,22,176,27.1,1.318,33,1 246 | 2,146,76,35,194,38.2,0.329,29,0 247 | 9,184,85,15,0,30.0,1.213,49,1 248 | 10,122,68,0,0,31.2,0.258,41,0 249 | 0,165,90,33,680,52.3,0.427,23,0 250 | 9,124,70,33,402,35.4,0.282,34,0 251 | 1,111,86,19,0,30.1,0.143,23,0 252 | 9,106,52,0,0,31.2,0.380,42,0 253 | 2,129,84,0,0,28.0,0.284,27,0 254 | 2,90,80,14,55,24.4,0.249,24,0 255 | 0,86,68,32,0,35.8,0.238,25,0 256 | 12,92,62,7,258,27.6,0.926,44,1 257 | 1,113,64,35,0,33.6,0.543,21,1 258 | 3,111,56,39,0,30.1,0.557,30,0 259 | 2,114,68,22,0,28.7,0.092,25,0 260 | 1,193,50,16,375,25.9,0.655,24,0 261 | 11,155,76,28,150,33.3,1.353,51,1 262 | 3,191,68,15,130,30.9,0.299,34,0 263 | 3,141,0,0,0,30.0,0.761,27,1 264 | 4,95,70,32,0,32.1,0.612,24,0 265 | 3,142,80,15,0,32.4,0.200,63,0 266 | 4,123,62,0,0,32.0,0.226,35,1 267 | 5,96,74,18,67,33.6,0.997,43,0 268 | 0,138,0,0,0,36.3,0.933,25,1 269 | 2,128,64,42,0,40.0,1.101,24,0 270 | 0,102,52,0,0,25.1,0.078,21,0 271 | 2,146,0,0,0,27.5,0.240,28,1 272 | 10,101,86,37,0,45.6,1.136,38,1 273 | 2,108,62,32,56,25.2,0.128,21,0 274 | 3,122,78,0,0,23.0,0.254,40,0 275 | 1,71,78,50,45,33.2,0.422,21,0 276 | 13,106,70,0,0,34.2,0.251,52,0 277 | 2,100,70,52,57,40.5,0.677,25,0 278 | 7,106,60,24,0,26.5,0.296,29,1 279 | 0,104,64,23,116,27.8,0.454,23,0 280 | 5,114,74,0,0,24.9,0.744,57,0 281 | 2,108,62,10,278,25.3,0.881,22,0 282 | 0,146,70,0,0,37.9,0.334,28,1 283 | 10,129,76,28,122,35.9,0.280,39,0 284 | 7,133,88,15,155,32.4,0.262,37,0 285 | 7,161,86,0,0,30.4,0.165,47,1 286 | 2,108,80,0,0,27.0,0.259,52,1 287 | 7,136,74,26,135,26.0,0.647,51,0 288 | 5,155,84,44,545,38.7,0.619,34,0 289 | 1,119,86,39,220,45.6,0.808,29,1 290 | 4,96,56,17,49,20.8,0.340,26,0 291 | 5,108,72,43,75,36.1,0.263,33,0 292 | 0,78,88,29,40,36.9,0.434,21,0 293 | 0,107,62,30,74,36.6,0.757,25,1 294 | 2,128,78,37,182,43.3,1.224,31,1 295 | 1,128,48,45,194,40.5,0.613,24,1 296 | 0,161,50,0,0,21.9,0.254,65,0 297 | 6,151,62,31,120,35.5,0.692,28,0 298 | 2,146,70,38,360,28.0,0.337,29,1 299 | 0,126,84,29,215,30.7,0.520,24,0 300 | 14,100,78,25,184,36.6,0.412,46,1 301 | 8,112,72,0,0,23.6,0.840,58,0 302 | 0,167,0,0,0,32.3,0.839,30,1 303 | 2,144,58,33,135,31.6,0.422,25,1 304 | 5,77,82,41,42,35.8,0.156,35,0 305 | 5,115,98,0,0,52.9,0.209,28,1 306 | 3,150,76,0,0,21.0,0.207,37,0 307 | 2,120,76,37,105,39.7,0.215,29,0 308 | 10,161,68,23,132,25.5,0.326,47,1 309 | 0,137,68,14,148,24.8,0.143,21,0 310 | 0,128,68,19,180,30.5,1.391,25,1 311 | 2,124,68,28,205,32.9,0.875,30,1 312 | 6,80,66,30,0,26.2,0.313,41,0 313 | 0,106,70,37,148,39.4,0.605,22,0 314 | 2,155,74,17,96,26.6,0.433,27,1 315 | 3,113,50,10,85,29.5,0.626,25,0 316 | 7,109,80,31,0,35.9,1.127,43,1 317 | 2,112,68,22,94,34.1,0.315,26,0 318 | 3,99,80,11,64,19.3,0.284,30,0 319 | 3,182,74,0,0,30.5,0.345,29,1 320 | 3,115,66,39,140,38.1,0.150,28,0 321 | 6,194,78,0,0,23.5,0.129,59,1 322 | 4,129,60,12,231,27.5,0.527,31,0 323 | 3,112,74,30,0,31.6,0.197,25,1 324 | 0,124,70,20,0,27.4,0.254,36,1 325 | 13,152,90,33,29,26.8,0.731,43,1 326 | 2,112,75,32,0,35.7,0.148,21,0 327 | 1,157,72,21,168,25.6,0.123,24,0 328 | 1,122,64,32,156,35.1,0.692,30,1 329 | 10,179,70,0,0,35.1,0.200,37,0 330 | 2,102,86,36,120,45.5,0.127,23,1 331 | 6,105,70,32,68,30.8,0.122,37,0 332 | 8,118,72,19,0,23.1,1.476,46,0 333 | 2,87,58,16,52,32.7,0.166,25,0 334 | 1,180,0,0,0,43.3,0.282,41,1 335 | 12,106,80,0,0,23.6,0.137,44,0 336 | 1,95,60,18,58,23.9,0.260,22,0 337 | 0,165,76,43,255,47.9,0.259,26,0 338 | 0,117,0,0,0,33.8,0.932,44,0 339 | 5,115,76,0,0,31.2,0.343,44,1 340 | 9,152,78,34,171,34.2,0.893,33,1 341 | 7,178,84,0,0,39.9,0.331,41,1 342 | 1,130,70,13,105,25.9,0.472,22,0 343 | 1,95,74,21,73,25.9,0.673,36,0 344 | 1,0,68,35,0,32.0,0.389,22,0 345 | 5,122,86,0,0,34.7,0.290,33,0 346 | 8,95,72,0,0,36.8,0.485,57,0 347 | 8,126,88,36,108,38.5,0.349,49,0 348 | 1,139,46,19,83,28.7,0.654,22,0 349 | 3,116,0,0,0,23.5,0.187,23,0 350 | 3,99,62,19,74,21.8,0.279,26,0 351 | 5,0,80,32,0,41.0,0.346,37,1 352 | 4,92,80,0,0,42.2,0.237,29,0 353 | 4,137,84,0,0,31.2,0.252,30,0 354 | 3,61,82,28,0,34.4,0.243,46,0 355 | 1,90,62,12,43,27.2,0.580,24,0 356 | 3,90,78,0,0,42.7,0.559,21,0 357 | 9,165,88,0,0,30.4,0.302,49,1 358 | 1,125,50,40,167,33.3,0.962,28,1 359 | 13,129,0,30,0,39.9,0.569,44,1 360 | 12,88,74,40,54,35.3,0.378,48,0 361 | 1,196,76,36,249,36.5,0.875,29,1 362 | 5,189,64,33,325,31.2,0.583,29,1 363 | 5,158,70,0,0,29.8,0.207,63,0 364 | 5,103,108,37,0,39.2,0.305,65,0 365 | 4,146,78,0,0,38.5,0.520,67,1 366 | 4,147,74,25,293,34.9,0.385,30,0 367 | 5,99,54,28,83,34.0,0.499,30,0 368 | 6,124,72,0,0,27.6,0.368,29,1 369 | 0,101,64,17,0,21.0,0.252,21,0 370 | 3,81,86,16,66,27.5,0.306,22,0 371 | 1,133,102,28,140,32.8,0.234,45,1 372 | 3,173,82,48,465,38.4,2.137,25,1 373 | 0,118,64,23,89,0.0,1.731,21,0 374 | 0,84,64,22,66,35.8,0.545,21,0 375 | 2,105,58,40,94,34.9,0.225,25,0 376 | 2,122,52,43,158,36.2,0.816,28,0 377 | 12,140,82,43,325,39.2,0.528,58,1 378 | 0,98,82,15,84,25.2,0.299,22,0 379 | 1,87,60,37,75,37.2,0.509,22,0 380 | 4,156,75,0,0,48.3,0.238,32,1 381 | 0,93,100,39,72,43.4,1.021,35,0 382 | 1,107,72,30,82,30.8,0.821,24,0 383 | 0,105,68,22,0,20.0,0.236,22,0 384 | 1,109,60,8,182,25.4,0.947,21,0 385 | 1,90,62,18,59,25.1,1.268,25,0 386 | 1,125,70,24,110,24.3,0.221,25,0 387 | 1,119,54,13,50,22.3,0.205,24,0 388 | 5,116,74,29,0,32.3,0.660,35,1 389 | 8,105,100,36,0,43.3,0.239,45,1 390 | 5,144,82,26,285,32.0,0.452,58,1 391 | 3,100,68,23,81,31.6,0.949,28,0 392 | 1,100,66,29,196,32.0,0.444,42,0 393 | 5,166,76,0,0,45.7,0.340,27,1 394 | 1,131,64,14,415,23.7,0.389,21,0 395 | 4,116,72,12,87,22.1,0.463,37,0 396 | 4,158,78,0,0,32.9,0.803,31,1 397 | 2,127,58,24,275,27.7,1.600,25,0 398 | 3,96,56,34,115,24.7,0.944,39,0 399 | 0,131,66,40,0,34.3,0.196,22,1 400 | 3,82,70,0,0,21.1,0.389,25,0 401 | 3,193,70,31,0,34.9,0.241,25,1 402 | 4,95,64,0,0,32.0,0.161,31,1 403 | 6,137,61,0,0,24.2,0.151,55,0 404 | 5,136,84,41,88,35.0,0.286,35,1 405 | 9,72,78,25,0,31.6,0.280,38,0 406 | 5,168,64,0,0,32.9,0.135,41,1 407 | 2,123,48,32,165,42.1,0.520,26,0 408 | 4,115,72,0,0,28.9,0.376,46,1 409 | 0,101,62,0,0,21.9,0.336,25,0 410 | 8,197,74,0,0,25.9,1.191,39,1 411 | 1,172,68,49,579,42.4,0.702,28,1 412 | 6,102,90,39,0,35.7,0.674,28,0 413 | 1,112,72,30,176,34.4,0.528,25,0 414 | 1,143,84,23,310,42.4,1.076,22,0 415 | 1,143,74,22,61,26.2,0.256,21,0 416 | 0,138,60,35,167,34.6,0.534,21,1 417 | 3,173,84,33,474,35.7,0.258,22,1 418 | 1,97,68,21,0,27.2,1.095,22,0 419 | 4,144,82,32,0,38.5,0.554,37,1 420 | 1,83,68,0,0,18.2,0.624,27,0 421 | 3,129,64,29,115,26.4,0.219,28,1 422 | 1,119,88,41,170,45.3,0.507,26,0 423 | 2,94,68,18,76,26.0,0.561,21,0 424 | 0,102,64,46,78,40.6,0.496,21,0 425 | 2,115,64,22,0,30.8,0.421,21,0 426 | 8,151,78,32,210,42.9,0.516,36,1 427 | 4,184,78,39,277,37.0,0.264,31,1 428 | 0,94,0,0,0,0.0,0.256,25,0 429 | 1,181,64,30,180,34.1,0.328,38,1 430 | 0,135,94,46,145,40.6,0.284,26,0 431 | 1,95,82,25,180,35.0,0.233,43,1 432 | 2,99,0,0,0,22.2,0.108,23,0 433 | 3,89,74,16,85,30.4,0.551,38,0 434 | 1,80,74,11,60,30.0,0.527,22,0 435 | 2,139,75,0,0,25.6,0.167,29,0 436 | 1,90,68,8,0,24.5,1.138,36,0 437 | 0,141,0,0,0,42.4,0.205,29,1 438 | 12,140,85,33,0,37.4,0.244,41,0 439 | 5,147,75,0,0,29.9,0.434,28,0 440 | 1,97,70,15,0,18.2,0.147,21,0 441 | 6,107,88,0,0,36.8,0.727,31,0 442 | 0,189,104,25,0,34.3,0.435,41,1 443 | 2,83,66,23,50,32.2,0.497,22,0 444 | 4,117,64,27,120,33.2,0.230,24,0 445 | 8,108,70,0,0,30.5,0.955,33,1 446 | 4,117,62,12,0,29.7,0.380,30,1 447 | 0,180,78,63,14,59.4,2.420,25,1 448 | 1,100,72,12,70,25.3,0.658,28,0 449 | 0,95,80,45,92,36.5,0.330,26,0 450 | 0,104,64,37,64,33.6,0.510,22,1 451 | 0,120,74,18,63,30.5,0.285,26,0 452 | 1,82,64,13,95,21.2,0.415,23,0 453 | 2,134,70,0,0,28.9,0.542,23,1 454 | 0,91,68,32,210,39.9,0.381,25,0 455 | 2,119,0,0,0,19.6,0.832,72,0 456 | 2,100,54,28,105,37.8,0.498,24,0 457 | 14,175,62,30,0,33.6,0.212,38,1 458 | 1,135,54,0,0,26.7,0.687,62,0 459 | 5,86,68,28,71,30.2,0.364,24,0 460 | 10,148,84,48,237,37.6,1.001,51,1 461 | 9,134,74,33,60,25.9,0.460,81,0 462 | 9,120,72,22,56,20.8,0.733,48,0 463 | 1,71,62,0,0,21.8,0.416,26,0 464 | 8,74,70,40,49,35.3,0.705,39,0 465 | 5,88,78,30,0,27.6,0.258,37,0 466 | 10,115,98,0,0,24.0,1.022,34,0 467 | 0,124,56,13,105,21.8,0.452,21,0 468 | 0,74,52,10,36,27.8,0.269,22,0 469 | 0,97,64,36,100,36.8,0.600,25,0 470 | 8,120,0,0,0,30.0,0.183,38,1 471 | 6,154,78,41,140,46.1,0.571,27,0 472 | 1,144,82,40,0,41.3,0.607,28,0 473 | 0,137,70,38,0,33.2,0.170,22,0 474 | 0,119,66,27,0,38.8,0.259,22,0 475 | 7,136,90,0,0,29.9,0.210,50,0 476 | 4,114,64,0,0,28.9,0.126,24,0 477 | 0,137,84,27,0,27.3,0.231,59,0 478 | 2,105,80,45,191,33.7,0.711,29,1 479 | 7,114,76,17,110,23.8,0.466,31,0 480 | 8,126,74,38,75,25.9,0.162,39,0 481 | 4,132,86,31,0,28.0,0.419,63,0 482 | 3,158,70,30,328,35.5,0.344,35,1 483 | 0,123,88,37,0,35.2,0.197,29,0 484 | 4,85,58,22,49,27.8,0.306,28,0 485 | 0,84,82,31,125,38.2,0.233,23,0 486 | 0,145,0,0,0,44.2,0.630,31,1 487 | 0,135,68,42,250,42.3,0.365,24,1 488 | 1,139,62,41,480,40.7,0.536,21,0 489 | 0,173,78,32,265,46.5,1.159,58,0 490 | 4,99,72,17,0,25.6,0.294,28,0 491 | 8,194,80,0,0,26.1,0.551,67,0 492 | 2,83,65,28,66,36.8,0.629,24,0 493 | 2,89,90,30,0,33.5,0.292,42,0 494 | 4,99,68,38,0,32.8,0.145,33,0 495 | 4,125,70,18,122,28.9,1.144,45,1 496 | 3,80,0,0,0,0.0,0.174,22,0 497 | 6,166,74,0,0,26.6,0.304,66,0 498 | 5,110,68,0,0,26.0,0.292,30,0 499 | 2,81,72,15,76,30.1,0.547,25,0 500 | 7,195,70,33,145,25.1,0.163,55,1 501 | 6,154,74,32,193,29.3,0.839,39,0 502 | 2,117,90,19,71,25.2,0.313,21,0 503 | 3,84,72,32,0,37.2,0.267,28,0 504 | 6,0,68,41,0,39.0,0.727,41,1 505 | 7,94,64,25,79,33.3,0.738,41,0 506 | 3,96,78,39,0,37.3,0.238,40,0 507 | 10,75,82,0,0,33.3,0.263,38,0 508 | 0,180,90,26,90,36.5,0.314,35,1 509 | 1,130,60,23,170,28.6,0.692,21,0 510 | 2,84,50,23,76,30.4,0.968,21,0 511 | 8,120,78,0,0,25.0,0.409,64,0 512 | 12,84,72,31,0,29.7,0.297,46,1 513 | 0,139,62,17,210,22.1,0.207,21,0 514 | 9,91,68,0,0,24.2,0.200,58,0 515 | 2,91,62,0,0,27.3,0.525,22,0 516 | 3,99,54,19,86,25.6,0.154,24,0 517 | 3,163,70,18,105,31.6,0.268,28,1 518 | 9,145,88,34,165,30.3,0.771,53,1 519 | 7,125,86,0,0,37.6,0.304,51,0 520 | 13,76,60,0,0,32.8,0.180,41,0 521 | 6,129,90,7,326,19.6,0.582,60,0 522 | 2,68,70,32,66,25.0,0.187,25,0 523 | 3,124,80,33,130,33.2,0.305,26,0 524 | 6,114,0,0,0,0.0,0.189,26,0 525 | 9,130,70,0,0,34.2,0.652,45,1 526 | 3,125,58,0,0,31.6,0.151,24,0 527 | 3,87,60,18,0,21.8,0.444,21,0 528 | 1,97,64,19,82,18.2,0.299,21,0 529 | 3,116,74,15,105,26.3,0.107,24,0 530 | 0,117,66,31,188,30.8,0.493,22,0 531 | 0,111,65,0,0,24.6,0.660,31,0 532 | 2,122,60,18,106,29.8,0.717,22,0 533 | 0,107,76,0,0,45.3,0.686,24,0 534 | 1,86,66,52,65,41.3,0.917,29,0 535 | 6,91,0,0,0,29.8,0.501,31,0 536 | 1,77,56,30,56,33.3,1.251,24,0 537 | 4,132,0,0,0,32.9,0.302,23,1 538 | 0,105,90,0,0,29.6,0.197,46,0 539 | 0,57,60,0,0,21.7,0.735,67,0 540 | 0,127,80,37,210,36.3,0.804,23,0 541 | 3,129,92,49,155,36.4,0.968,32,1 542 | 8,100,74,40,215,39.4,0.661,43,1 543 | 3,128,72,25,190,32.4,0.549,27,1 544 | 10,90,85,32,0,34.9,0.825,56,1 545 | 4,84,90,23,56,39.5,0.159,25,0 546 | 1,88,78,29,76,32.0,0.365,29,0 547 | 8,186,90,35,225,34.5,0.423,37,1 548 | 5,187,76,27,207,43.6,1.034,53,1 549 | 4,131,68,21,166,33.1,0.160,28,0 550 | 1,164,82,43,67,32.8,0.341,50,0 551 | 4,189,110,31,0,28.5,0.680,37,0 552 | 1,116,70,28,0,27.4,0.204,21,0 553 | 3,84,68,30,106,31.9,0.591,25,0 554 | 6,114,88,0,0,27.8,0.247,66,0 555 | 1,88,62,24,44,29.9,0.422,23,0 556 | 1,84,64,23,115,36.9,0.471,28,0 557 | 7,124,70,33,215,25.5,0.161,37,0 558 | 1,97,70,40,0,38.1,0.218,30,0 559 | 8,110,76,0,0,27.8,0.237,58,0 560 | 11,103,68,40,0,46.2,0.126,42,0 561 | 11,85,74,0,0,30.1,0.300,35,0 562 | 6,125,76,0,0,33.8,0.121,54,1 563 | 0,198,66,32,274,41.3,0.502,28,1 564 | 1,87,68,34,77,37.6,0.401,24,0 565 | 6,99,60,19,54,26.9,0.497,32,0 566 | 0,91,80,0,0,32.4,0.601,27,0 567 | 2,95,54,14,88,26.1,0.748,22,0 568 | 1,99,72,30,18,38.6,0.412,21,0 569 | 6,92,62,32,126,32.0,0.085,46,0 570 | 4,154,72,29,126,31.3,0.338,37,0 571 | 0,121,66,30,165,34.3,0.203,33,1 572 | 3,78,70,0,0,32.5,0.270,39,0 573 | 2,130,96,0,0,22.6,0.268,21,0 574 | 3,111,58,31,44,29.5,0.430,22,0 575 | 2,98,60,17,120,34.7,0.198,22,0 576 | 1,143,86,30,330,30.1,0.892,23,0 577 | 1,119,44,47,63,35.5,0.280,25,0 578 | 6,108,44,20,130,24.0,0.813,35,0 579 | 2,118,80,0,0,42.9,0.693,21,1 580 | 10,133,68,0,0,27.0,0.245,36,0 581 | 2,197,70,99,0,34.7,0.575,62,1 582 | 0,151,90,46,0,42.1,0.371,21,1 583 | 6,109,60,27,0,25.0,0.206,27,0 584 | 12,121,78,17,0,26.5,0.259,62,0 585 | 8,100,76,0,0,38.7,0.190,42,0 586 | 8,124,76,24,600,28.7,0.687,52,1 587 | 1,93,56,11,0,22.5,0.417,22,0 588 | 8,143,66,0,0,34.9,0.129,41,1 589 | 6,103,66,0,0,24.3,0.249,29,0 590 | 3,176,86,27,156,33.3,1.154,52,1 591 | 0,73,0,0,0,21.1,0.342,25,0 592 | 11,111,84,40,0,46.8,0.925,45,1 593 | 2,112,78,50,140,39.4,0.175,24,0 594 | 3,132,80,0,0,34.4,0.402,44,1 595 | 2,82,52,22,115,28.5,1.699,25,0 596 | 6,123,72,45,230,33.6,0.733,34,0 597 | 0,188,82,14,185,32.0,0.682,22,1 598 | 0,67,76,0,0,45.3,0.194,46,0 599 | 1,89,24,19,25,27.8,0.559,21,0 600 | 1,173,74,0,0,36.8,0.088,38,1 601 | 1,109,38,18,120,23.1,0.407,26,0 602 | 1,108,88,19,0,27.1,0.400,24,0 603 | 6,96,0,0,0,23.7,0.190,28,0 604 | 1,124,74,36,0,27.8,0.100,30,0 605 | 7,150,78,29,126,35.2,0.692,54,1 606 | 4,183,0,0,0,28.4,0.212,36,1 607 | 1,124,60,32,0,35.8,0.514,21,0 608 | 1,181,78,42,293,40.0,1.258,22,1 609 | 1,92,62,25,41,19.5,0.482,25,0 610 | 0,152,82,39,272,41.5,0.270,27,0 611 | 1,111,62,13,182,24.0,0.138,23,0 612 | 3,106,54,21,158,30.9,0.292,24,0 613 | 3,174,58,22,194,32.9,0.593,36,1 614 | 7,168,88,42,321,38.2,0.787,40,1 615 | 6,105,80,28,0,32.5,0.878,26,0 616 | 11,138,74,26,144,36.1,0.557,50,1 617 | 3,106,72,0,0,25.8,0.207,27,0 618 | 6,117,96,0,0,28.7,0.157,30,0 619 | 2,68,62,13,15,20.1,0.257,23,0 620 | 9,112,82,24,0,28.2,1.282,50,1 621 | 0,119,0,0,0,32.4,0.141,24,1 622 | 2,112,86,42,160,38.4,0.246,28,0 623 | 2,92,76,20,0,24.2,1.698,28,0 624 | 6,183,94,0,0,40.8,1.461,45,0 625 | 0,94,70,27,115,43.5,0.347,21,0 626 | 2,108,64,0,0,30.8,0.158,21,0 627 | 4,90,88,47,54,37.7,0.362,29,0 628 | 0,125,68,0,0,24.7,0.206,21,0 629 | 0,132,78,0,0,32.4,0.393,21,0 630 | 5,128,80,0,0,34.6,0.144,45,0 631 | 4,94,65,22,0,24.7,0.148,21,0 632 | 7,114,64,0,0,27.4,0.732,34,1 633 | 0,102,78,40,90,34.5,0.238,24,0 634 | 2,111,60,0,0,26.2,0.343,23,0 635 | 1,128,82,17,183,27.5,0.115,22,0 636 | 10,92,62,0,0,25.9,0.167,31,0 637 | 13,104,72,0,0,31.2,0.465,38,1 638 | 5,104,74,0,0,28.8,0.153,48,0 639 | 2,94,76,18,66,31.6,0.649,23,0 640 | 7,97,76,32,91,40.9,0.871,32,1 641 | 1,100,74,12,46,19.5,0.149,28,0 642 | 0,102,86,17,105,29.3,0.695,27,0 643 | 4,128,70,0,0,34.3,0.303,24,0 644 | 6,147,80,0,0,29.5,0.178,50,1 645 | 4,90,0,0,0,28.0,0.610,31,0 646 | 3,103,72,30,152,27.6,0.730,27,0 647 | 2,157,74,35,440,39.4,0.134,30,0 648 | 1,167,74,17,144,23.4,0.447,33,1 649 | 0,179,50,36,159,37.8,0.455,22,1 650 | 11,136,84,35,130,28.3,0.260,42,1 651 | 0,107,60,25,0,26.4,0.133,23,0 652 | 1,91,54,25,100,25.2,0.234,23,0 653 | 1,117,60,23,106,33.8,0.466,27,0 654 | 5,123,74,40,77,34.1,0.269,28,0 655 | 2,120,54,0,0,26.8,0.455,27,0 656 | 1,106,70,28,135,34.2,0.142,22,0 657 | 2,155,52,27,540,38.7,0.240,25,1 658 | 2,101,58,35,90,21.8,0.155,22,0 659 | 1,120,80,48,200,38.9,1.162,41,0 660 | 11,127,106,0,0,39.0,0.190,51,0 661 | 3,80,82,31,70,34.2,1.292,27,1 662 | 10,162,84,0,0,27.7,0.182,54,0 663 | 1,199,76,43,0,42.9,1.394,22,1 664 | 8,167,106,46,231,37.6,0.165,43,1 665 | 9,145,80,46,130,37.9,0.637,40,1 666 | 6,115,60,39,0,33.7,0.245,40,1 667 | 1,112,80,45,132,34.8,0.217,24,0 668 | 4,145,82,18,0,32.5,0.235,70,1 669 | 10,111,70,27,0,27.5,0.141,40,1 670 | 6,98,58,33,190,34.0,0.430,43,0 671 | 9,154,78,30,100,30.9,0.164,45,0 672 | 6,165,68,26,168,33.6,0.631,49,0 673 | 1,99,58,10,0,25.4,0.551,21,0 674 | 10,68,106,23,49,35.5,0.285,47,0 675 | 3,123,100,35,240,57.3,0.880,22,0 676 | 8,91,82,0,0,35.6,0.587,68,0 677 | 6,195,70,0,0,30.9,0.328,31,1 678 | 9,156,86,0,0,24.8,0.230,53,1 679 | 0,93,60,0,0,35.3,0.263,25,0 680 | 3,121,52,0,0,36.0,0.127,25,1 681 | 2,101,58,17,265,24.2,0.614,23,0 682 | 2,56,56,28,45,24.2,0.332,22,0 683 | 0,162,76,36,0,49.6,0.364,26,1 684 | 0,95,64,39,105,44.6,0.366,22,0 685 | 4,125,80,0,0,32.3,0.536,27,1 686 | 5,136,82,0,0,0.0,0.640,69,0 687 | 2,129,74,26,205,33.2,0.591,25,0 688 | 3,130,64,0,0,23.1,0.314,22,0 689 | 1,107,50,19,0,28.3,0.181,29,0 690 | 1,140,74,26,180,24.1,0.828,23,0 691 | 1,144,82,46,180,46.1,0.335,46,1 692 | 8,107,80,0,0,24.6,0.856,34,0 693 | 13,158,114,0,0,42.3,0.257,44,1 694 | 2,121,70,32,95,39.1,0.886,23,0 695 | 7,129,68,49,125,38.5,0.439,43,1 696 | 2,90,60,0,0,23.5,0.191,25,0 697 | 7,142,90,24,480,30.4,0.128,43,1 698 | 3,169,74,19,125,29.9,0.268,31,1 699 | 0,99,0,0,0,25.0,0.253,22,0 700 | 4,127,88,11,155,34.5,0.598,28,0 701 | 4,118,70,0,0,44.5,0.904,26,0 702 | 2,122,76,27,200,35.9,0.483,26,0 703 | 6,125,78,31,0,27.6,0.565,49,1 704 | 1,168,88,29,0,35.0,0.905,52,1 705 | 2,129,0,0,0,38.5,0.304,41,0 706 | 4,110,76,20,100,28.4,0.118,27,0 707 | 6,80,80,36,0,39.8,0.177,28,0 708 | 10,115,0,0,0,0.0,0.261,30,1 709 | 2,127,46,21,335,34.4,0.176,22,0 710 | 9,164,78,0,0,32.8,0.148,45,1 711 | 2,93,64,32,160,38.0,0.674,23,1 712 | 3,158,64,13,387,31.2,0.295,24,0 713 | 5,126,78,27,22,29.6,0.439,40,0 714 | 10,129,62,36,0,41.2,0.441,38,1 715 | 0,134,58,20,291,26.4,0.352,21,0 716 | 3,102,74,0,0,29.5,0.121,32,0 717 | 7,187,50,33,392,33.9,0.826,34,1 718 | 3,173,78,39,185,33.8,0.970,31,1 719 | 10,94,72,18,0,23.1,0.595,56,0 720 | 1,108,60,46,178,35.5,0.415,24,0 721 | 5,97,76,27,0,35.6,0.378,52,1 722 | 4,83,86,19,0,29.3,0.317,34,0 723 | 1,114,66,36,200,38.1,0.289,21,0 724 | 1,149,68,29,127,29.3,0.349,42,1 725 | 5,117,86,30,105,39.1,0.251,42,0 726 | 1,111,94,0,0,32.8,0.265,45,0 727 | 4,112,78,40,0,39.4,0.236,38,0 728 | 1,116,78,29,180,36.1,0.496,25,0 729 | 0,141,84,26,0,32.4,0.433,22,0 730 | 2,175,88,0,0,22.9,0.326,22,0 731 | 2,92,52,0,0,30.1,0.141,22,0 732 | 3,130,78,23,79,28.4,0.323,34,1 733 | 8,120,86,0,0,28.4,0.259,22,1 734 | 2,174,88,37,120,44.5,0.646,24,1 735 | 2,106,56,27,165,29.0,0.426,22,0 736 | 2,105,75,0,0,23.3,0.560,53,0 737 | 4,95,60,32,0,35.4,0.284,28,0 738 | 0,126,86,27,120,27.4,0.515,21,0 739 | 8,65,72,23,0,32.0,0.600,42,0 740 | 2,99,60,17,160,36.6,0.453,21,0 741 | 1,102,74,0,0,39.5,0.293,42,1 742 | 11,120,80,37,150,42.3,0.785,48,1 743 | 3,102,44,20,94,30.8,0.400,26,0 744 | 1,109,58,18,116,28.5,0.219,22,0 745 | 9,140,94,0,0,32.7,0.734,45,1 746 | 13,153,88,37,140,40.6,1.174,39,0 747 | 12,100,84,33,105,30.0,0.488,46,0 748 | 1,147,94,41,0,49.3,0.358,27,1 749 | 1,81,74,41,57,46.3,1.096,32,0 750 | 3,187,70,22,200,36.4,0.408,36,1 751 | 6,162,62,0,0,24.3,0.178,50,1 752 | 4,136,70,0,0,31.2,1.182,22,1 753 | 1,121,78,39,74,39.0,0.261,28,0 754 | 3,108,62,24,0,26.0,0.223,25,0 755 | 0,181,88,44,510,43.3,0.222,26,1 756 | 8,154,78,32,0,32.4,0.443,45,1 757 | 1,128,88,39,110,36.5,1.057,37,1 758 | 7,137,90,41,0,32.0,0.391,39,0 759 | 0,123,72,0,0,36.3,0.258,52,1 760 | 1,106,76,0,0,37.5,0.197,26,0 761 | 6,190,92,0,0,35.5,0.278,66,1 762 | 2,88,58,26,16,28.4,0.766,22,0 763 | 9,170,74,31,0,44.0,0.403,43,1 764 | 9,89,62,0,0,22.5,0.142,33,0 765 | 10,101,76,48,180,32.9,0.171,63,0 766 | 2,122,70,27,0,36.8,0.340,27,0 767 | 5,121,72,23,112,26.2,0.245,30,0 768 | 1,126,60,0,0,30.1,0.349,47,1 769 | 1,93,70,31,0,30.4,0.315,23,0 770 | -------------------------------------------------------------------------------- /evaluating estimator performance using cross-validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.\n", 8 | "In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function. Let’s load the iris data set to fit a linear support vector machine on it:" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "collapsed": false 16 | }, 17 | "outputs": [ 18 | { 19 | "data": { 20 | "text/plain": [ 21 | "((150, 4), (150,))" 22 | ] 23 | }, 24 | "execution_count": 1, 25 | "metadata": {}, 26 | "output_type": "execute_result" 27 | } 28 | ], 29 | "source": [ 30 | "import numpy as np\n", 31 | "from sklearn import cross_validation\n", 32 | "from sklearn import datasets\n", 33 | "from sklearn import svm\n", 34 | "\n", 35 | "iris = datasets.load_iris()\n", 36 | "iris.data.shape, iris.target.shape\n", 37 | "((150, 4), (150,))" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/plain": [ 57 | "0.96666666666666667" 58 | ] 59 | }, 60 | "execution_count": 3, 61 | "metadata": {}, 62 | "output_type": "execute_result" 63 | } 64 | ], 65 | "source": [ 66 | "X_train, X_test, y_train, y_test = cross_validation.train_test_split(\n", 67 | "... iris.data, iris.target, test_size=0.4, random_state=0)\n", 68 | "\n", 69 | "X_train.shape, y_train.shape\n", 70 | "((90, 4), (90,))\n", 71 | "X_test.shape, y_test.shape\n", 72 | "((60, 4), (60,))\n", 73 | "\n", 74 | "clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)\n", 75 | "clf.score(X_test, y_test) " 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.\n", 83 | "However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.\n", 84 | "A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:\n", 85 | "A model is trained using k-1 of the folds as training data;\n", 86 | "the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).\n", 87 | "The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small." 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "# Computing cross-validated metrics " 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.\n", 102 | "The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 10, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/plain": [ 115 | "array([ 0.96666667, 1. , 0.96666667, 0.96666667, 1. ])" 116 | ] 117 | }, 118 | "execution_count": 10, 119 | "metadata": {}, 120 | "output_type": "execute_result" 121 | } 122 | ], 123 | "source": [ 124 | "clf = svm.SVC(kernel='linear', C=1)\n", 125 | "scores = cross_validation.cross_val_score(\n", 126 | " clf, iris.data, iris.target, cv=5)\n", 127 | "...\n", 128 | "scores \n" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "The mean score and the 95% confidence interval of the score estimate are hence given by:" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 11, 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [ 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "Accuracy: 0.98 (+/- 0.03)\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 13, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "array([ 0.96658312, 1. , 0.96658312, 0.96658312, 1. ])" 175 | ] 176 | }, 177 | "execution_count": 13, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "from sklearn import metrics\n", 184 | "scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_weighted')\n", 185 | "scores" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "In the case of the Iris dataset, the samples are balanced across target classes hence the accuracy and the F1-score are almost equal.\n", 193 | "When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.\n", 194 | "It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 14, 200 | "metadata": { 201 | "collapsed": false 202 | }, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "text/plain": [ 207 | "array([ 0.97777778, 0.97777778, 1. ])" 208 | ] 209 | }, 210 | "execution_count": 14, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "n_samples = iris.data.shape[0]\n", 217 | "cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.3, random_state=0)\n", 218 | "cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "# Data transformation with held out data" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 15, 238 | "metadata": { 239 | "collapsed": false 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "0.93333333333333335" 246 | ] 247 | }, 248 | "execution_count": 15, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "from sklearn import preprocessing\n", 255 | "X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)\n", 256 | "scaler = preprocessing.StandardScaler().fit(X_train)\n", 257 | "X_train_transformed = scaler.transform(X_train)\n", 258 | "clf = svm.SVC(C=1).fit(X_train_transformed, y_train)\n", 259 | "X_test_transformed = scaler.transform(X_test)\n", 260 | "clf.score(X_test_transformed, y_test) " 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 16, 273 | "metadata": { 274 | "collapsed": false 275 | }, 276 | "outputs": [ 277 | { 278 | "data": { 279 | "text/plain": [ 280 | "array([ 0.97777778, 0.93333333, 0.95555556])" 281 | ] 282 | }, 283 | "execution_count": 16, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "from sklearn.pipeline import make_pipeline\n", 290 | "clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))\n", 291 | "cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "# Obtaining predictions by cross-validation" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).\n", 306 | "These prediction can then be used to evaluate the classifier:" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 17, 312 | "metadata": { 313 | "collapsed": false 314 | }, 315 | "outputs": [ 316 | { 317 | "data": { 318 | "text/plain": [ 319 | "0.96666666666666667" 320 | ] 321 | }, 322 | "execution_count": 17, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "predicted = cross_validation.cross_val_predict(clf, iris.data, iris.target, cv=10)\n", 329 | "metrics.accuracy_score(iris.target, predicted) " 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "Note that the result of this computation may be slightly different from those obtained using cross_val_score as the elements are grouped in different ways." 337 | ] 338 | } 339 | ], 340 | "metadata": { 341 | "kernelspec": { 342 | "display_name": "Python 3", 343 | "language": "python", 344 | "name": "python3" 345 | }, 346 | "language_info": { 347 | "codemirror_mode": { 348 | "name": "ipython", 349 | "version": 3 350 | }, 351 | "file_extension": ".py", 352 | "mimetype": "text/x-python", 353 | "name": "python", 354 | "nbconvert_exporter": "python", 355 | "pygments_lexer": "ipython3", 356 | "version": "3.5.1" 357 | } 358 | }, 359 | "nbformat": 4, 360 | "nbformat_minor": 0 361 | } 362 | -------------------------------------------------------------------------------- /examples/Gradient.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Code for Gradient disent " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Code for calculating local minima for $$f(x,y)= x^4 + y^4 - x^2 - y^2$$ with partial derivative wrt x : $$4x^3 - 2x$$ and partial derivative wrt y: $$4y^3 - 2y$$ " 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "max_iter = 1000\n", 26 | "x_o=0\n", 27 | "y_o=0\n", 28 | "alpha = 0.01 ## Step Size\n", 29 | "x_k=2 ## Starting position of x coordinate \n", 30 | "y_k=2 ## Starting position of y coordinate \n", 31 | "\n", 32 | "\n" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "Here we set the Max iterations as 1000 and starting coordinte in ($x_k,y_k$) = (2,2) and Step size as 0.01 donated by alpha " 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "collapsed": true 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "def devx(x): ## Defining partial derivative wrt x\n", 51 | " return 4*x**3 - 2*x\n", 52 | "def devy(y): ## Defining partial derivation wrt y\n", 53 | " return 4*y**3 -2*y\n", 54 | "for i in range(max_iter):\n", 55 | " x_o = x_k\n", 56 | " y_o = y_k\n", 57 | " x_k = x_o - alpha * devx(x_o)\n", 58 | " y_k = y_o - alpha * devy(y_o)\n", 59 | "\n", 60 | "print \"Local Minimum at\",x_k,\",\",y_k" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Here We define 2 functions devx(x) and devy(y) as the partial derivative with respect to x:$$4x^3 - 2x$$ and partial derivative with respect to y: $$4y^3 - 2y$$ respectively . In the following loop we calculate the local minima using equations : ![Alt Text](http://www.codeproject.com/KB/recipes/879043/GradientDescent.jpg \"Gradient Descent\"). " 68 | ] 69 | } 70 | ], 71 | "metadata": { 72 | "kernelspec": { 73 | "display_name": "Python 2", 74 | "language": "python", 75 | "name": "python2" 76 | }, 77 | "language_info": { 78 | "codemirror_mode": { 79 | "name": "ipython", 80 | "version": 2 81 | }, 82 | "file_extension": ".py", 83 | "mimetype": "text/x-python", 84 | "name": "python", 85 | "nbconvert_exporter": "python", 86 | "pygments_lexer": "ipython2", 87 | "version": "2.7.10" 88 | } 89 | }, 90 | "nbformat": 4, 91 | "nbformat_minor": 0 92 | } 93 | -------------------------------------------------------------------------------- /images/neuralnet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/images/neuralnet.png -------------------------------------------------------------------------------- /images/train_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/images/train_img.png -------------------------------------------------------------------------------- /images/weights.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crowd-course/datascience/f5961c20c4052566b1b5a9fc0699c8cadb6147f5/images/weights.png -------------------------------------------------------------------------------- /kaggle-data/README.md: -------------------------------------------------------------------------------- 1 | Source: [https://www.kaggle.com/benhamner/2016-us-election](https://www.kaggle.com/benhamner/2016-us-election) 2 | 3 | -------------------------------------------------------------------------------- /kaggle-data/county_facts_dictionary.csv: -------------------------------------------------------------------------------- 1 | column_name,description 2 | PST045214,"Population, 2014 estimate" 3 | PST040210,"Population, 2010 (April 1) estimates base" 4 | PST120214,"Population, percent change - April 1, 2010 to July 1, 2014" 5 | POP010210,"Population, 2010" 6 | AGE135214,"Persons under 5 years, percent, 2014" 7 | AGE295214,"Persons under 18 years, percent, 2014" 8 | AGE775214,"Persons 65 years and over, percent, 2014" 9 | SEX255214,"Female persons, percent, 2014" 10 | RHI125214,"White alone, percent, 2014" 11 | RHI225214,"Black or African American alone, percent, 2014" 12 | RHI325214,"American Indian and Alaska Native alone, percent, 2014" 13 | RHI425214,"Asian alone, percent, 2014" 14 | RHI525214,"Native Hawaiian and Other Pacific Islander alone, percent, 2014" 15 | RHI625214,"Two or More Races, percent, 2014" 16 | RHI725214,"Hispanic or Latino, percent, 2014" 17 | RHI825214,"White alone, not Hispanic or Latino, percent, 2014" 18 | POP715213,"Living in same house 1 year & over, percent, 2009-2013" 19 | POP645213,"Foreign born persons, percent, 2009-2013" 20 | POP815213,"Language other than English spoken at home, pct age 5+, 2009-2013" 21 | EDU635213,"High school graduate or higher, percent of persons age 25+, 2009-2013" 22 | EDU685213,"Bachelor's degree or higher, percent of persons age 25+, 2009-2013" 23 | VET605213,"Veterans, 2009-2013" 24 | LFE305213,"Mean travel time to work (minutes), workers age 16+, 2009-2013" 25 | HSG010214,"Housing units, 2014" 26 | HSG445213,"Homeownership rate, 2009-2013" 27 | HSG096213,"Housing units in multi-unit structures, percent, 2009-2013" 28 | HSG495213,"Median value of owner-occupied housing units, 2009-2013" 29 | HSD410213,"Households, 2009-2013" 30 | HSD310213,"Persons per household, 2009-2013" 31 | INC910213,"Per capita money income in past 12 months (2013 dollars), 2009-2013" 32 | INC110213,"Median household income, 2009-2013" 33 | PVY020213,"Persons below poverty level, percent, 2009-2013" 34 | BZA010213,"Private nonfarm establishments, 2013" 35 | BZA110213,"Private nonfarm employment, 2013" 36 | BZA115213,"Private nonfarm employment, percent change, 2012-2013" 37 | NES010213,"Nonemployer establishments, 2013" 38 | SBO001207,"Total number of firms, 2007" 39 | SBO315207,"Black-owned firms, percent, 2007" 40 | SBO115207,"American Indian- and Alaska Native-owned firms, percent, 2007" 41 | SBO215207,"Asian-owned firms, percent, 2007" 42 | SBO515207,"Native Hawaiian- and Other Pacific Islander-owned firms, percent, 2007" 43 | SBO415207,"Hispanic-owned firms, percent, 2007" 44 | SBO015207,"Women-owned firms, percent, 2007" 45 | MAN450207,"Manufacturers shipments, 2007 ($1,000)" 46 | WTN220207,"Merchant wholesaler sales, 2007 ($1,000)" 47 | RTN130207,"Retail sales, 2007 ($1,000)" 48 | RTN131207,"Retail sales per capita, 2007" 49 | AFN120207,"Accommodation and food services sales, 2007 ($1,000)" 50 | BPS030214,"Building permits, 2014" 51 | LND110210,"Land area in square miles, 2010" 52 | POP060210,"Population per square mile, 2010" 53 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.11.0 2 | pandas==0.18.1 3 | python-dateutil==2.5.3 4 | pytz==2016.4 5 | scikit-learn==0.17.1 6 | scipy==0.17.1 7 | six==1.10.0 8 | scikit-neuralnetwork==0.7 9 | --------------------------------------------------------------------------------