├── .gitignore ├── README.md ├── datasets └── titanic │ ├── test.csv │ └── train.csv ├── day01 └── README.md ├── day02 ├── README.md ├── intro-to-jupyter.ipynb ├── intro-to-numpy.ipynb ├── pset1-sol.ipynb └── pset1.ipynb ├── day03 ├── README.md ├── data_vis.ipynb └── visualization.ipynb ├── day04 ├── Classification + Regression.ipynb └── README.md ├── day05 ├── Cleaning and RegEx.ipynb ├── README.md └── cleaning_regex.py ├── day06 ├── Day06slides.pdf ├── README.md ├── linear_reg.ipynb └── nbasalary.csv ├── day07 ├── Logistic Regression.ipynb └── README.md ├── day08 ├── MNIST.ipynb └── README.md ├── day09 ├── Kmeans.ipynb ├── Logistic Regression and Cross Validation.ipynb └── README.md ├── day10 ├── README.md ├── bias-variance-sol.ipynb ├── bias-variance.ipynb ├── images │ ├── bv.png │ ├── data.png │ ├── data1.png │ ├── hoods.png │ ├── k=1.png │ ├── k=10.png │ └── k=100.png └── knn.ipynb ├── day11 ├── Decision Trees.ipynb ├── README.md └── calc_gini.py ├── day12 ├── README.md └── Random Forest.ipynb ├── day13 └── README.md ├── day14 ├── README.md └── nn.ipynb ├── day15 └── README.md ├── day16 ├── README.md ├── VGG.ipynb ├── cat.jpg ├── day18.ipynb ├── dog.jpeg ├── imagenet_utils.py └── utils.py ├── day18 └── PCA and Facial Recognition.ipynb ├── final_proj.pdf ├── guide ├── DockerCheatsheet.md ├── deepaws.md └── keras.md ├── homework └── Keras is cool!.ipynb ├── project1 ├── Proj1_v3.pdf ├── data.csv └── test.csv ├── project2 └── project2.pdf └── syllabus.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | sentinel 3 | __pycache__/ 4 | *.py[co] 5 | .ipynb_checkpoints/ 6 | project1/test_labels.csv 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science with Kaggle Decal Spring 2017 2 | ## Taught by [Phillip Kuznetsov](https://github.com/philkuz), [Raul Puri](https://github.com/raulpuric), [Humza Iqbal](https://github.com/humzaiqbal), [James Bartlett](https://github.com/JamesMBartlett), [Dan Geng](https://github.com/dangeng), [Jordan Prosky](https://github.com/jorpro) 3 | Data Science Decal taught at UC Berkeley Spring Semester 2017 4 | 5 | To get started on the coursework, you'll first need to clone the repo: 6 | ``` 7 | cd 8 | git clone https://github.com/kaggledecal/sp17.git kaggledecal 9 | ``` 10 | 11 | Then you'll want to start a jupyter server from within this repo. 12 | ``` 13 | cd kaggledecal 14 | jupyter notebook 15 | ``` 16 | 17 | ## Setting up your environment 18 | If you have not yet installed jupyter on your machine, you'll probably get an error message when you run the above command. 19 | ### Mac OS 10.10+, Linux, and Windows 10 20 | Use [Docker](https://docs.docker.com/engine/installation/). 21 | ### Windows < 10 and Mac OS < 10.10 22 | [Anaconda](https://www.continuum.io/downloads). 23 | ### What should I use? 24 | Choosing either package is to your discretion; we encourage Docker because it will allow you to quickly prototype papers like [Style Transfer](https://hub.docker.com/r/kchentw/neural-style/), ([paper](http://arxiv.org/abs/1508.06576)). 25 | -------------------------------------------------------------------------------- /datasets/titanic/test.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 2 | 892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 3 | 893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S 4 | 894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 5 | 895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S 6 | 896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S 7 | 897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S 8 | 898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q 9 | 899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S 10 | 900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C 11 | 901,3,"Davies, Mr. John Samuel",male,21,2,0,A/4 48871,24.15,,S 12 | 902,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S 13 | 903,1,"Jones, Mr. Charles Cresson",male,46,0,0,694,26,,S 14 | 904,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23,1,0,21228,82.2667,B45,S 15 | 905,2,"Howard, Mr. Benjamin",male,63,1,0,24065,26,,S 16 | 906,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)",female,47,1,0,W.E.P. 5734,61.175,E31,S 17 | 907,2,"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",female,24,1,0,SC/PARIS 2167,27.7208,,C 18 | 908,2,"Keane, Mr. Daniel",male,35,0,0,233734,12.35,,Q 19 | 909,3,"Assaf, Mr. Gerios",male,21,0,0,2692,7.225,,C 20 | 910,3,"Ilmakangas, Miss. Ida Livija",female,27,1,0,STON/O2. 3101270,7.925,,S 21 | 911,3,"Assaf Khalil, Mrs. Mariana (Miriam"")""",female,45,0,0,2696,7.225,,C 22 | 912,1,"Rothschild, Mr. Martin",male,55,1,0,PC 17603,59.4,,C 23 | 913,3,"Olsen, Master. Artur Karl",male,9,0,1,C 17368,3.1708,,S 24 | 914,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S 25 | 915,1,"Williams, Mr. Richard Norris II",male,21,0,1,PC 17597,61.3792,,C 26 | 916,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48,1,3,PC 17608,262.375,B57 B59 B63 B66,C 27 | 917,3,"Robins, Mr. Alexander A",male,50,1,0,A/5. 3337,14.5,,S 28 | 918,1,"Ostby, Miss. Helene Ragnhild",female,22,0,1,113509,61.9792,B36,C 29 | 919,3,"Daher, Mr. Shedid",male,22.5,0,0,2698,7.225,,C 30 | 920,1,"Brady, Mr. John Bertram",male,41,0,0,113054,30.5,A21,S 31 | 921,3,"Samaan, Mr. Elias",male,,2,0,2662,21.6792,,C 32 | 922,2,"Louch, Mr. Charles Alexander",male,50,1,0,SC/AH 3085,26,,S 33 | 923,2,"Jefferys, Mr. Clifford Thomas",male,24,2,0,C.A. 31029,31.5,,S 34 | 924,3,"Dean, Mrs. Bertram (Eva Georgetta Light)",female,33,1,2,C.A. 2315,20.575,,S 35 | 925,3,"Johnston, Mrs. Andrew G (Elizabeth Lily"" Watson)""",female,,1,2,W./C. 6607,23.45,,S 36 | 926,1,"Mock, Mr. Philipp Edmund",male,30,1,0,13236,57.75,C78,C 37 | 927,3,"Katavelas, Mr. Vassilios (Catavelas Vassilios"")""",male,18.5,0,0,2682,7.2292,,C 38 | 928,3,"Roth, Miss. Sarah A",female,,0,0,342712,8.05,,S 39 | 929,3,"Cacic, Miss. Manda",female,21,0,0,315087,8.6625,,S 40 | 930,3,"Sap, Mr. Julius",male,25,0,0,345768,9.5,,S 41 | 931,3,"Hee, Mr. Ling",male,,0,0,1601,56.4958,,S 42 | 932,3,"Karun, Mr. Franz",male,39,0,1,349256,13.4167,,C 43 | 933,1,"Franklin, Mr. Thomas Parham",male,,0,0,113778,26.55,D34,S 44 | 934,3,"Goldsmith, Mr. Nathan",male,41,0,0,SOTON/O.Q. 3101263,7.85,,S 45 | 935,2,"Corbett, Mrs. Walter H (Irene Colvin)",female,30,0,0,237249,13,,S 46 | 936,1,"Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)",female,45,1,0,11753,52.5542,D19,S 47 | 937,3,"Peltomaki, Mr. Nikolai Johannes",male,25,0,0,STON/O 2. 3101291,7.925,,S 48 | 938,1,"Chevre, Mr. Paul Romaine",male,45,0,0,PC 17594,29.7,A9,C 49 | 939,3,"Shaughnessy, Mr. Patrick",male,,0,0,370374,7.75,,Q 50 | 940,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60,0,0,11813,76.2917,D15,C 51 | 941,3,"Coutts, Mrs. William (Winnie Minnie"" Treanor)""",female,36,0,2,C.A. 37671,15.9,,S 52 | 942,1,"Smith, Mr. Lucien Philip",male,24,1,0,13695,60,C31,S 53 | 943,2,"Pulbaum, Mr. Franz",male,27,0,0,SC/PARIS 2168,15.0333,,C 54 | 944,2,"Hocking, Miss. Ellen Nellie""""",female,20,2,1,29105,23,,S 55 | 945,1,"Fortune, Miss. Ethel Flora",female,28,3,2,19950,263,C23 C25 C27,S 56 | 946,2,"Mangiavacchi, Mr. Serafino Emilio",male,,0,0,SC/A.3 2861,15.5792,,C 57 | 947,3,"Rice, Master. Albert",male,10,4,1,382652,29.125,,Q 58 | 948,3,"Cor, Mr. Bartol",male,35,0,0,349230,7.8958,,S 59 | 949,3,"Abelseth, Mr. Olaus Jorgensen",male,25,0,0,348122,7.65,F G63,S 60 | 950,3,"Davison, Mr. Thomas Henry",male,,1,0,386525,16.1,,S 61 | 951,1,"Chaudanson, Miss. Victorine",female,36,0,0,PC 17608,262.375,B61,C 62 | 952,3,"Dika, Mr. Mirko",male,17,0,0,349232,7.8958,,S 63 | 953,2,"McCrae, Mr. Arthur Gordon",male,32,0,0,237216,13.5,,S 64 | 954,3,"Bjorklund, Mr. Ernst Herbert",male,18,0,0,347090,7.75,,S 65 | 955,3,"Bradley, Miss. Bridget Delia",female,22,0,0,334914,7.725,,Q 66 | 956,1,"Ryerson, Master. John Borie",male,13,2,2,PC 17608,262.375,B57 B59 B63 B66,C 67 | 957,2,"Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)",female,,0,0,F.C.C. 13534,21,,S 68 | 958,3,"Burns, Miss. Mary Delia",female,18,0,0,330963,7.8792,,Q 69 | 959,1,"Moore, Mr. Clarence Bloomfield",male,47,0,0,113796,42.4,,S 70 | 960,1,"Tucker, Mr. Gilbert Milligan Jr",male,31,0,0,2543,28.5375,C53,C 71 | 961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60,1,4,19950,263,C23 C25 C27,S 72 | 962,3,"Mulvihill, Miss. Bertha E",female,24,0,0,382653,7.75,,Q 73 | 963,3,"Minkoff, Mr. Lazar",male,21,0,0,349211,7.8958,,S 74 | 964,3,"Nieminen, Miss. Manta Josefina",female,29,0,0,3101297,7.925,,S 75 | 965,1,"Ovies y Rodriguez, Mr. Servando",male,28.5,0,0,PC 17562,27.7208,D43,C 76 | 966,1,"Geiger, Miss. Amalie",female,35,0,0,113503,211.5,C130,C 77 | 967,1,"Keeping, Mr. Edwin",male,32.5,0,0,113503,211.5,C132,C 78 | 968,3,"Miles, Mr. Frank",male,,0,0,359306,8.05,,S 79 | 969,1,"Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)",female,55,2,0,11770,25.7,C101,S 80 | 970,2,"Aldworth, Mr. Charles Augustus",male,30,0,0,248744,13,,S 81 | 971,3,"Doyle, Miss. Elizabeth",female,24,0,0,368702,7.75,,Q 82 | 972,3,"Boulos, Master. Akar",male,6,1,1,2678,15.2458,,C 83 | 973,1,"Straus, Mr. Isidor",male,67,1,0,PC 17483,221.7792,C55 C57,S 84 | 974,1,"Case, Mr. Howard Brown",male,49,0,0,19924,26,,S 85 | 975,3,"Demetri, Mr. Marinko",male,,0,0,349238,7.8958,,S 86 | 976,2,"Lamb, Mr. John Joseph",male,,0,0,240261,10.7083,,Q 87 | 977,3,"Khalil, Mr. Betros",male,,1,0,2660,14.4542,,C 88 | 978,3,"Barry, Miss. Julia",female,27,0,0,330844,7.8792,,Q 89 | 979,3,"Badman, Miss. Emily Louisa",female,18,0,0,A/4 31416,8.05,,S 90 | 980,3,"O'Donoghue, Ms. Bridget",female,,0,0,364856,7.75,,Q 91 | 981,2,"Wells, Master. Ralph Lester",male,2,1,1,29103,23,,S 92 | 982,3,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)",female,22,1,0,347072,13.9,,S 93 | 983,3,"Pedersen, Mr. Olaf",male,,0,0,345498,7.775,,S 94 | 984,1,"Davidson, Mrs. Thornton (Orian Hays)",female,27,1,2,F.C. 12750,52,B71,S 95 | 985,3,"Guest, Mr. Robert",male,,0,0,376563,8.05,,S 96 | 986,1,"Birnbaum, Mr. Jakob",male,25,0,0,13905,26,,C 97 | 987,3,"Tenglin, Mr. Gunnar Isidor",male,25,0,0,350033,7.7958,,S 98 | 988,1,"Cavendish, Mrs. Tyrell William (Julia Florence Siegel)",female,76,1,0,19877,78.85,C46,S 99 | 989,3,"Makinen, Mr. Kalle Edvard",male,29,0,0,STON/O 2. 3101268,7.925,,S 100 | 990,3,"Braf, Miss. Elin Ester Maria",female,20,0,0,347471,7.8542,,S 101 | 991,3,"Nancarrow, Mr. William Henry",male,33,0,0,A./5. 3338,8.05,,S 102 | 992,1,"Stengel, Mrs. Charles Emil Henry (Annie May Morris)",female,43,1,0,11778,55.4417,C116,C 103 | 993,2,"Weisz, Mr. Leopold",male,27,1,0,228414,26,,S 104 | 994,3,"Foley, Mr. William",male,,0,0,365235,7.75,,Q 105 | 995,3,"Johansson Palmquist, Mr. Oskar Leander",male,26,0,0,347070,7.775,,S 106 | 996,3,"Thomas, Mrs. Alexander (Thamine Thelma"")""",female,16,1,1,2625,8.5167,,C 107 | 997,3,"Holthen, Mr. Johan Martin",male,28,0,0,C 4001,22.525,,S 108 | 998,3,"Buckley, Mr. Daniel",male,21,0,0,330920,7.8208,,Q 109 | 999,3,"Ryan, Mr. Edward",male,,0,0,383162,7.75,,Q 110 | 1000,3,"Willer, Mr. Aaron (Abi Weller"")""",male,,0,0,3410,8.7125,,S 111 | 1001,2,"Swane, Mr. George",male,18.5,0,0,248734,13,F,S 112 | 1002,2,"Stanton, Mr. Samuel Ward",male,41,0,0,237734,15.0458,,C 113 | 1003,3,"Shine, Miss. Ellen Natalia",female,,0,0,330968,7.7792,,Q 114 | 1004,1,"Evans, Miss. Edith Corse",female,36,0,0,PC 17531,31.6792,A29,C 115 | 1005,3,"Buckley, Miss. Katherine",female,18.5,0,0,329944,7.2833,,Q 116 | 1006,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63,1,0,PC 17483,221.7792,C55 C57,S 117 | 1007,3,"Chronopoulos, Mr. Demetrios",male,18,1,0,2680,14.4542,,C 118 | 1008,3,"Thomas, Mr. John",male,,0,0,2681,6.4375,,C 119 | 1009,3,"Sandstrom, Miss. Beatrice Irene",female,1,1,1,PP 9549,16.7,G6,S 120 | 1010,1,"Beattie, Mr. Thomson",male,36,0,0,13050,75.2417,C6,C 121 | 1011,2,"Chapman, Mrs. John Henry (Sara Elizabeth Lawry)",female,29,1,0,SC/AH 29037,26,,S 122 | 1012,2,"Watt, Miss. Bertha J",female,12,0,0,C.A. 33595,15.75,,S 123 | 1013,3,"Kiernan, Mr. John",male,,1,0,367227,7.75,,Q 124 | 1014,1,"Schabert, Mrs. Paul (Emma Mock)",female,35,1,0,13236,57.75,C28,C 125 | 1015,3,"Carver, Mr. Alfred John",male,28,0,0,392095,7.25,,S 126 | 1016,3,"Kennedy, Mr. John",male,,0,0,368783,7.75,,Q 127 | 1017,3,"Cribb, Miss. Laura Alice",female,17,0,1,371362,16.1,,S 128 | 1018,3,"Brobeck, Mr. Karl Rudolf",male,22,0,0,350045,7.7958,,S 129 | 1019,3,"McCoy, Miss. Alicia",female,,2,0,367226,23.25,,Q 130 | 1020,2,"Bowenur, Mr. Solomon",male,42,0,0,211535,13,,S 131 | 1021,3,"Petersen, Mr. Marius",male,24,0,0,342441,8.05,,S 132 | 1022,3,"Spinner, Mr. Henry John",male,32,0,0,STON/OQ. 369943,8.05,,S 133 | 1023,1,"Gracie, Col. Archibald IV",male,53,0,0,113780,28.5,C51,C 134 | 1024,3,"Lefebre, Mrs. Frank (Frances)",female,,0,4,4133,25.4667,,S 135 | 1025,3,"Thomas, Mr. Charles P",male,,1,0,2621,6.4375,,C 136 | 1026,3,"Dintcheff, Mr. Valtcho",male,43,0,0,349226,7.8958,,S 137 | 1027,3,"Carlsson, Mr. Carl Robert",male,24,0,0,350409,7.8542,,S 138 | 1028,3,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C 139 | 1029,2,"Schmidt, Mr. August",male,26,0,0,248659,13,,S 140 | 1030,3,"Drapkin, Miss. Jennie",female,23,0,0,SOTON/OQ 392083,8.05,,S 141 | 1031,3,"Goodwin, Mr. Charles Frederick",male,40,1,6,CA 2144,46.9,,S 142 | 1032,3,"Goodwin, Miss. Jessie Allis",female,10,5,2,CA 2144,46.9,,S 143 | 1033,1,"Daniels, Miss. Sarah",female,33,0,0,113781,151.55,,S 144 | 1034,1,"Ryerson, Mr. Arthur Larned",male,61,1,3,PC 17608,262.375,B57 B59 B63 B66,C 145 | 1035,2,"Beauchamp, Mr. Henry James",male,28,0,0,244358,26,,S 146 | 1036,1,"Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey"")""",male,42,0,0,17475,26.55,,S 147 | 1037,3,"Vander Planke, Mr. Julius",male,31,3,0,345763,18,,S 148 | 1038,1,"Hilliard, Mr. Herbert Henry",male,,0,0,17463,51.8625,E46,S 149 | 1039,3,"Davies, Mr. Evan",male,22,0,0,SC/A4 23568,8.05,,S 150 | 1040,1,"Crafton, Mr. John Bertram",male,,0,0,113791,26.55,,S 151 | 1041,2,"Lahtinen, Rev. William",male,30,1,1,250651,26,,S 152 | 1042,1,"Earnshaw, Mrs. Boulton (Olive Potter)",female,23,0,1,11767,83.1583,C54,C 153 | 1043,3,"Matinoff, Mr. Nicola",male,,0,0,349255,7.8958,,C 154 | 1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S 155 | 1045,3,"Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)",female,36,0,2,350405,12.1833,,S 156 | 1046,3,"Asplund, Master. Filip Oscar",male,13,4,2,347077,31.3875,,S 157 | 1047,3,"Duquemin, Mr. Joseph",male,24,0,0,S.O./P.P. 752,7.55,,S 158 | 1048,1,"Bird, Miss. Ellen",female,29,0,0,PC 17483,221.7792,C97,S 159 | 1049,3,"Lundin, Miss. Olga Elida",female,23,0,0,347469,7.8542,,S 160 | 1050,1,"Borebank, Mr. John James",male,42,0,0,110489,26.55,D22,S 161 | 1051,3,"Peacock, Mrs. Benjamin (Edith Nile)",female,26,0,2,SOTON/O.Q. 3101315,13.775,,S 162 | 1052,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q 163 | 1053,3,"Touma, Master. Georges Youssef",male,7,1,1,2650,15.2458,,C 164 | 1054,2,"Wright, Miss. Marion",female,26,0,0,220844,13.5,,S 165 | 1055,3,"Pearce, Mr. Ernest",male,,0,0,343271,7,,S 166 | 1056,2,"Peruschitz, Rev. Joseph Maria",male,41,0,0,237393,13,,S 167 | 1057,3,"Kink-Heilmann, Mrs. Anton (Luise Heilmann)",female,26,1,1,315153,22.025,,S 168 | 1058,1,"Brandeis, Mr. Emil",male,48,0,0,PC 17591,50.4958,B10,C 169 | 1059,3,"Ford, Mr. Edward Watson",male,18,2,2,W./C. 6608,34.375,,S 170 | 1060,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)",female,,0,0,17770,27.7208,,C 171 | 1061,3,"Hellstrom, Miss. Hilda Maria",female,22,0,0,7548,8.9625,,S 172 | 1062,3,"Lithman, Mr. Simon",male,,0,0,S.O./P.P. 251,7.55,,S 173 | 1063,3,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,,C 174 | 1064,3,"Dyker, Mr. Adolf Fredrik",male,23,1,0,347072,13.9,,S 175 | 1065,3,"Torfa, Mr. Assad",male,,0,0,2673,7.2292,,C 176 | 1066,3,"Asplund, Mr. Carl Oscar Vilhelm Gustafsson",male,40,1,5,347077,31.3875,,S 177 | 1067,2,"Brown, Miss. Edith Eileen",female,15,0,2,29750,39,,S 178 | 1068,2,"Sincock, Miss. Maude",female,20,0,0,C.A. 33112,36.75,,S 179 | 1069,1,"Stengel, Mr. Charles Emil Henry",male,54,1,0,11778,55.4417,C116,C 180 | 1070,2,"Becker, Mrs. Allen Oliver (Nellie E Baumgardner)",female,36,0,3,230136,39,F4,S 181 | 1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)",female,64,0,2,PC 17756,83.1583,E45,C 182 | 1072,2,"McCrie, Mr. James Matthew",male,30,0,0,233478,13,,S 183 | 1073,1,"Compton, Mr. Alexander Taylor Jr",male,37,1,1,PC 17756,83.1583,E52,C 184 | 1074,1,"Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)",female,18,1,0,113773,53.1,D30,S 185 | 1075,3,"Lane, Mr. Patrick",male,,0,0,7935,7.75,,Q 186 | 1076,1,"Douglas, Mrs. Frederick Charles (Mary Helene Baxter)",female,27,1,1,PC 17558,247.5208,B58 B60,C 187 | 1077,2,"Maybery, Mr. Frank Hubert",male,40,0,0,239059,16,,S 188 | 1078,2,"Phillips, Miss. Alice Frances Louisa",female,21,0,1,S.O./P.P. 2,21,,S 189 | 1079,3,"Davies, Mr. Joseph",male,17,2,0,A/4 48873,8.05,,S 190 | 1080,3,"Sage, Miss. Ada",female,,8,2,CA. 2343,69.55,,S 191 | 1081,2,"Veal, Mr. James",male,40,0,0,28221,13,,S 192 | 1082,2,"Angle, Mr. William A",male,34,1,0,226875,26,,S 193 | 1083,1,"Salomon, Mr. Abraham L",male,,0,0,111163,26,,S 194 | 1084,3,"van Billiard, Master. Walter John",male,11.5,1,1,A/5. 851,14.5,,S 195 | 1085,2,"Lingane, Mr. John",male,61,0,0,235509,12.35,,Q 196 | 1086,2,"Drew, Master. Marshall Brines",male,8,0,2,28220,32.5,,S 197 | 1087,3,"Karlsson, Mr. Julius Konrad Eugen",male,33,0,0,347465,7.8542,,S 198 | 1088,1,"Spedden, Master. Robert Douglas",male,6,0,2,16966,134.5,E34,C 199 | 1089,3,"Nilsson, Miss. Berta Olivia",female,18,0,0,347066,7.775,,S 200 | 1090,2,"Baimbrigge, Mr. Charles Robert",male,23,0,0,C.A. 31030,10.5,,S 201 | 1091,3,"Rasmussen, Mrs. (Lena Jacobsen Solvang)",female,,0,0,65305,8.1125,,S 202 | 1092,3,"Murphy, Miss. Nora",female,,0,0,36568,15.5,,Q 203 | 1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S 204 | 1094,1,"Astor, Col. John Jacob",male,47,1,0,PC 17757,227.525,C62 C64,C 205 | 1095,2,"Quick, Miss. Winifred Vera",female,8,1,1,26360,26,,S 206 | 1096,2,"Andrew, Mr. Frank Thomas",male,25,0,0,C.A. 34050,10.5,,S 207 | 1097,1,"Omont, Mr. Alfred Fernand",male,,0,0,F.C. 12998,25.7417,,C 208 | 1098,3,"McGowan, Miss. Katherine",female,35,0,0,9232,7.75,,Q 209 | 1099,2,"Collett, Mr. Sidney C Stuart",male,24,0,0,28034,10.5,,S 210 | 1100,1,"Rosenbaum, Miss. Edith Louise",female,33,0,0,PC 17613,27.7208,A11,C 211 | 1101,3,"Delalic, Mr. Redjo",male,25,0,0,349250,7.8958,,S 212 | 1102,3,"Andersen, Mr. Albert Karvin",male,32,0,0,C 4001,22.525,,S 213 | 1103,3,"Finoli, Mr. Luigi",male,,0,0,SOTON/O.Q. 3101308,7.05,,S 214 | 1104,2,"Deacon, Mr. Percy William",male,17,0,0,S.O.C. 14879,73.5,,S 215 | 1105,2,"Howard, Mrs. Benjamin (Ellen Truelove Arman)",female,60,1,0,24065,26,,S 216 | 1106,3,"Andersson, Miss. Ida Augusta Margareta",female,38,4,2,347091,7.775,,S 217 | 1107,1,"Head, Mr. Christopher",male,42,0,0,113038,42.5,B11,S 218 | 1108,3,"Mahon, Miss. Bridget Delia",female,,0,0,330924,7.8792,,Q 219 | 1109,1,"Wick, Mr. George Dennick",male,57,1,1,36928,164.8667,,S 220 | 1110,1,"Widener, Mrs. George Dunton (Eleanor Elkins)",female,50,1,1,113503,211.5,C80,C 221 | 1111,3,"Thomson, Mr. Alexander Morrison",male,,0,0,32302,8.05,,S 222 | 1112,2,"Duran y More, Miss. Florentina",female,30,1,0,SC/PARIS 2148,13.8583,,C 223 | 1113,3,"Reynolds, Mr. Harold J",male,21,0,0,342684,8.05,,S 224 | 1114,2,"Cook, Mrs. (Selena Rogers)",female,22,0,0,W./C. 14266,10.5,F33,S 225 | 1115,3,"Karlsson, Mr. Einar Gervasius",male,21,0,0,350053,7.7958,,S 226 | 1116,1,"Candee, Mrs. Edward (Helen Churchill Hungerford)",female,53,0,0,PC 17606,27.4458,,C 227 | 1117,3,"Moubarek, Mrs. George (Omine Amenia"" Alexander)""",female,,0,2,2661,15.2458,,C 228 | 1118,3,"Asplund, Mr. Johan Charles",male,23,0,0,350054,7.7958,,S 229 | 1119,3,"McNeill, Miss. Bridget",female,,0,0,370368,7.75,,Q 230 | 1120,3,"Everett, Mr. Thomas James",male,40.5,0,0,C.A. 6212,15.1,,S 231 | 1121,2,"Hocking, Mr. Samuel James Metcalfe",male,36,0,0,242963,13,,S 232 | 1122,2,"Sweet, Mr. George Frederick",male,14,0,0,220845,65,,S 233 | 1123,1,"Willard, Miss. Constance",female,21,0,0,113795,26.55,,S 234 | 1124,3,"Wiklund, Mr. Karl Johan",male,21,1,0,3101266,6.4958,,S 235 | 1125,3,"Linehan, Mr. Michael",male,,0,0,330971,7.8792,,Q 236 | 1126,1,"Cumings, Mr. John Bradley",male,39,1,0,PC 17599,71.2833,C85,C 237 | 1127,3,"Vendel, Mr. Olof Edvin",male,20,0,0,350416,7.8542,,S 238 | 1128,1,"Warren, Mr. Frank Manley",male,64,1,0,110813,75.25,D37,C 239 | 1129,3,"Baccos, Mr. Raffull",male,20,0,0,2679,7.225,,C 240 | 1130,2,"Hiltunen, Miss. Marta",female,18,1,1,250650,13,,S 241 | 1131,1,"Douglas, Mrs. Walter Donald (Mahala Dutton)",female,48,1,0,PC 17761,106.425,C86,C 242 | 1132,1,"Lindstrom, Mrs. Carl Johan (Sigrid Posse)",female,55,0,0,112377,27.7208,,C 243 | 1133,2,"Christy, Mrs. (Alice Frances)",female,45,0,2,237789,30,,S 244 | 1134,1,"Spedden, Mr. Frederic Oakley",male,45,1,1,16966,134.5,E34,C 245 | 1135,3,"Hyman, Mr. Abraham",male,,0,0,3470,7.8875,,S 246 | 1136,3,"Johnston, Master. William Arthur Willie""""",male,,1,2,W./C. 6607,23.45,,S 247 | 1137,1,"Kenyon, Mr. Frederick R",male,41,1,0,17464,51.8625,D21,S 248 | 1138,2,"Karnes, Mrs. J Frank (Claire Bennett)",female,22,0,0,F.C.C. 13534,21,,S 249 | 1139,2,"Drew, Mr. James Vivian",male,42,1,1,28220,32.5,,S 250 | 1140,2,"Hold, Mrs. Stephen (Annie Margaret Hill)",female,29,1,0,26707,26,,S 251 | 1141,3,"Khalil, Mrs. Betros (Zahie Maria"" Elias)""",female,,1,0,2660,14.4542,,C 252 | 1142,2,"West, Miss. Barbara J",female,0.92,1,2,C.A. 34651,27.75,,S 253 | 1143,3,"Abrahamsson, Mr. Abraham August Johannes",male,20,0,0,SOTON/O2 3101284,7.925,,S 254 | 1144,1,"Clark, Mr. Walter Miller",male,27,1,0,13508,136.7792,C89,C 255 | 1145,3,"Salander, Mr. Karl Johan",male,24,0,0,7266,9.325,,S 256 | 1146,3,"Wenzel, Mr. Linhart",male,32.5,0,0,345775,9.5,,S 257 | 1147,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.55,,S 258 | 1148,3,"Mahon, Mr. John",male,,0,0,AQ/4 3130,7.75,,Q 259 | 1149,3,"Niklasson, Mr. Samuel",male,28,0,0,363611,8.05,,S 260 | 1150,2,"Bentham, Miss. Lilian W",female,19,0,0,28404,13,,S 261 | 1151,3,"Midtsjo, Mr. Karl Albert",male,21,0,0,345501,7.775,,S 262 | 1152,3,"de Messemaeker, Mr. Guillaume Joseph",male,36.5,1,0,345572,17.4,,S 263 | 1153,3,"Nilsson, Mr. August Ferdinand",male,21,0,0,350410,7.8542,,S 264 | 1154,2,"Wells, Mrs. Arthur Henry (Addie"" Dart Trevaskis)""",female,29,0,2,29103,23,,S 265 | 1155,3,"Klasen, Miss. Gertrud Emilia",female,1,1,1,350405,12.1833,,S 266 | 1156,2,"Portaluppi, Mr. Emilio Ilario Giuseppe",male,30,0,0,C.A. 34644,12.7375,,C 267 | 1157,3,"Lyntakoff, Mr. Stanko",male,,0,0,349235,7.8958,,S 268 | 1158,1,"Chisholm, Mr. Roderick Robert Crispin",male,,0,0,112051,0,,S 269 | 1159,3,"Warren, Mr. Charles William",male,,0,0,C.A. 49867,7.55,,S 270 | 1160,3,"Howard, Miss. May Elizabeth",female,,0,0,A. 2. 39186,8.05,,S 271 | 1161,3,"Pokrnic, Mr. Mate",male,17,0,0,315095,8.6625,,S 272 | 1162,1,"McCaffry, Mr. Thomas Francis",male,46,0,0,13050,75.2417,C6,C 273 | 1163,3,"Fox, Mr. Patrick",male,,0,0,368573,7.75,,Q 274 | 1164,1,"Clark, Mrs. Walter Miller (Virginia McDowell)",female,26,1,0,13508,136.7792,C89,C 275 | 1165,3,"Lennon, Miss. Mary",female,,1,0,370371,15.5,,Q 276 | 1166,3,"Saade, Mr. Jean Nassr",male,,0,0,2676,7.225,,C 277 | 1167,2,"Bryhl, Miss. Dagmar Jenny Ingeborg ",female,20,1,0,236853,26,,S 278 | 1168,2,"Parker, Mr. Clifford Richard",male,28,0,0,SC 14888,10.5,,S 279 | 1169,2,"Faunthorpe, Mr. Harry",male,40,1,0,2926,26,,S 280 | 1170,2,"Ware, Mr. John James",male,30,1,0,CA 31352,21,,S 281 | 1171,2,"Oxenham, Mr. Percy Thomas",male,22,0,0,W./C. 14260,10.5,,S 282 | 1172,3,"Oreskovic, Miss. Jelka",female,23,0,0,315085,8.6625,,S 283 | 1173,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S 284 | 1174,3,"Fleming, Miss. Honora",female,,0,0,364859,7.75,,Q 285 | 1175,3,"Touma, Miss. Maria Youssef",female,9,1,1,2650,15.2458,,C 286 | 1176,3,"Rosblom, Miss. Salli Helena",female,2,1,1,370129,20.2125,,S 287 | 1177,3,"Dennis, Mr. William",male,36,0,0,A/5 21175,7.25,,S 288 | 1178,3,"Franklin, Mr. Charles (Charles Fardon)",male,,0,0,SOTON/O.Q. 3101314,7.25,,S 289 | 1179,1,"Snyder, Mr. John Pillsbury",male,24,1,0,21228,82.2667,B45,S 290 | 1180,3,"Mardirosian, Mr. Sarkis",male,,0,0,2655,7.2292,F E46,C 291 | 1181,3,"Ford, Mr. Arthur",male,,0,0,A/5 1478,8.05,,S 292 | 1182,1,"Rheims, Mr. George Alexander Lucien",male,,0,0,PC 17607,39.6,,S 293 | 1183,3,"Daly, Miss. Margaret Marcella Maggie""""",female,30,0,0,382650,6.95,,Q 294 | 1184,3,"Nasr, Mr. Mustafa",male,,0,0,2652,7.2292,,C 295 | 1185,1,"Dodge, Dr. Washington",male,53,1,1,33638,81.8583,A34,S 296 | 1186,3,"Wittevrongel, Mr. Camille",male,36,0,0,345771,9.5,,S 297 | 1187,3,"Angheloff, Mr. Minko",male,26,0,0,349202,7.8958,,S 298 | 1188,2,"Laroche, Miss. Louise",female,1,1,2,SC/Paris 2123,41.5792,,C 299 | 1189,3,"Samaan, Mr. Hanna",male,,2,0,2662,21.6792,,C 300 | 1190,1,"Loring, Mr. Joseph Holland",male,30,0,0,113801,45.5,,S 301 | 1191,3,"Johansson, Mr. Nils",male,29,0,0,347467,7.8542,,S 302 | 1192,3,"Olsson, Mr. Oscar Wilhelm",male,32,0,0,347079,7.775,,S 303 | 1193,2,"Malachard, Mr. Noel",male,,0,0,237735,15.0458,D,C 304 | 1194,2,"Phillips, Mr. Escott Robert",male,43,0,1,S.O./P.P. 2,21,,S 305 | 1195,3,"Pokrnic, Mr. Tome",male,24,0,0,315092,8.6625,,S 306 | 1196,3,"McCarthy, Miss. Catherine Katie""""",female,,0,0,383123,7.75,,Q 307 | 1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)",female,64,1,1,112901,26.55,B26,S 308 | 1198,1,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S 309 | 1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S 310 | 1200,1,"Hays, Mr. Charles Melville",male,55,1,1,12749,93.5,B69,S 311 | 1201,3,"Hansen, Mrs. Claus Peter (Jennie L Howard)",female,45,1,0,350026,14.1083,,S 312 | 1202,3,"Cacic, Mr. Jego Grga",male,18,0,0,315091,8.6625,,S 313 | 1203,3,"Vartanian, Mr. David",male,22,0,0,2658,7.225,,C 314 | 1204,3,"Sadowitz, Mr. Harry",male,,0,0,LP 1588,7.575,,S 315 | 1205,3,"Carr, Miss. Jeannie",female,37,0,0,368364,7.75,,Q 316 | 1206,1,"White, Mrs. John Stuart (Ella Holmes)",female,55,0,0,PC 17760,135.6333,C32,C 317 | 1207,3,"Hagardon, Miss. Kate",female,17,0,0,AQ/3. 30631,7.7333,,Q 318 | 1208,1,"Spencer, Mr. William Augustus",male,57,1,0,PC 17569,146.5208,B78,C 319 | 1209,2,"Rogers, Mr. Reginald Harry",male,19,0,0,28004,10.5,,S 320 | 1210,3,"Jonsson, Mr. Nils Hilding",male,27,0,0,350408,7.8542,,S 321 | 1211,2,"Jefferys, Mr. Ernest Wilfred",male,22,2,0,C.A. 31029,31.5,,S 322 | 1212,3,"Andersson, Mr. Johan Samuel",male,26,0,0,347075,7.775,,S 323 | 1213,3,"Krekorian, Mr. Neshan",male,25,0,0,2654,7.2292,F E57,C 324 | 1214,2,"Nesson, Mr. Israel",male,26,0,0,244368,13,F2,S 325 | 1215,1,"Rowe, Mr. Alfred G",male,33,0,0,113790,26.55,,S 326 | 1216,1,"Kreuchen, Miss. Emilie",female,39,0,0,24160,211.3375,,S 327 | 1217,3,"Assam, Mr. Ali",male,23,0,0,SOTON/O.Q. 3101309,7.05,,S 328 | 1218,2,"Becker, Miss. Ruth Elizabeth",female,12,2,1,230136,39,F4,S 329 | 1219,1,"Rosenshine, Mr. George (Mr George Thorne"")""",male,46,0,0,PC 17585,79.2,,C 330 | 1220,2,"Clarke, Mr. Charles Valentine",male,29,1,0,2003,26,,S 331 | 1221,2,"Enander, Mr. Ingvar",male,21,0,0,236854,13,,S 332 | 1222,2,"Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) ",female,48,0,2,C.A. 33112,36.75,,S 333 | 1223,1,"Dulles, Mr. William Crothers",male,39,0,0,PC 17580,29.7,A18,C 334 | 1224,3,"Thomas, Mr. Tannous",male,,0,0,2684,7.225,,C 335 | 1225,3,"Nakid, Mrs. Said (Waika Mary"" Mowad)""",female,19,1,1,2653,15.7417,,C 336 | 1226,3,"Cor, Mr. Ivan",male,27,0,0,349229,7.8958,,S 337 | 1227,1,"Maguire, Mr. John Edward",male,30,0,0,110469,26,C106,S 338 | 1228,2,"de Brito, Mr. Jose Joaquim",male,32,0,0,244360,13,,S 339 | 1229,3,"Elias, Mr. Joseph",male,39,0,2,2675,7.2292,,C 340 | 1230,2,"Denbury, Mr. Herbert",male,25,0,0,C.A. 31029,31.5,,S 341 | 1231,3,"Betros, Master. Seman",male,,0,0,2622,7.2292,,C 342 | 1232,2,"Fillbrook, Mr. Joseph Charles",male,18,0,0,C.A. 15185,10.5,,S 343 | 1233,3,"Lundstrom, Mr. Thure Edvin",male,32,0,0,350403,7.5792,,S 344 | 1234,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.55,,S 345 | 1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)",female,58,0,1,PC 17755,512.3292,B51 B53 B55,C 346 | 1236,3,"van Billiard, Master. James William",male,,1,1,A/5. 851,14.5,,S 347 | 1237,3,"Abelseth, Miss. Karen Marie",female,16,0,0,348125,7.65,,S 348 | 1238,2,"Botsford, Mr. William Hull",male,26,0,0,237670,13,,S 349 | 1239,3,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38,0,0,2688,7.2292,,C 350 | 1240,2,"Giles, Mr. Ralph",male,24,0,0,248726,13.5,,S 351 | 1241,2,"Walcroft, Miss. Nellie",female,31,0,0,F.C.C. 13528,21,,S 352 | 1242,1,"Greenfield, Mrs. Leo David (Blanche Strouse)",female,45,0,1,PC 17759,63.3583,D10 D12,C 353 | 1243,2,"Stokes, Mr. Philip Joseph",male,25,0,0,F.C.C. 13540,10.5,,S 354 | 1244,2,"Dibden, Mr. William",male,18,0,0,S.O.C. 14879,73.5,,S 355 | 1245,2,"Herman, Mr. Samuel",male,49,1,2,220845,65,,S 356 | 1246,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.575,,S 357 | 1247,1,"Julian, Mr. Henry Forbes",male,50,0,0,113044,26,E60,S 358 | 1248,1,"Brown, Mrs. John Murray (Caroline Lane Lamson)",female,59,2,0,11769,51.4792,C101,S 359 | 1249,3,"Lockyer, Mr. Edward",male,,0,0,1222,7.8792,,S 360 | 1250,3,"O'Keefe, Mr. Patrick",male,,0,0,368402,7.75,,Q 361 | 1251,3,"Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)",female,30,1,0,349910,15.55,,S 362 | 1252,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.55,,S 363 | 1253,2,"Mallet, Mrs. Albert (Antoinette Magnin)",female,24,1,1,S.C./PARIS 2079,37.0042,,C 364 | 1254,2,"Ware, Mrs. John James (Florence Louise Long)",female,31,0,0,CA 31352,21,,S 365 | 1255,3,"Strilic, Mr. Ivan",male,27,0,0,315083,8.6625,,S 366 | 1256,1,"Harder, Mrs. George Achilles (Dorothy Annan)",female,25,1,0,11765,55.4417,E50,C 367 | 1257,3,"Sage, Mrs. John (Annie Bullen)",female,,1,9,CA. 2343,69.55,,S 368 | 1258,3,"Caram, Mr. Joseph",male,,1,0,2689,14.4583,,C 369 | 1259,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22,0,0,3101295,39.6875,,S 370 | 1260,1,"Gibson, Mrs. Leonard (Pauline C Boeson)",female,45,0,1,112378,59.4,,C 371 | 1261,2,"Pallas y Castello, Mr. Emilio",male,29,0,0,SC/PARIS 2147,13.8583,,C 372 | 1262,2,"Giles, Mr. Edgar",male,21,1,0,28133,11.5,,S 373 | 1263,1,"Wilson, Miss. Helen Alice",female,31,0,0,16966,134.5,E39 E41,C 374 | 1264,1,"Ismay, Mr. Joseph Bruce",male,49,0,0,112058,0,B52 B54 B56,S 375 | 1265,2,"Harbeck, Mr. William H",male,44,0,0,248746,13,,S 376 | 1266,1,"Dodge, Mrs. Washington (Ruth Vidaver)",female,54,1,1,33638,81.8583,A34,S 377 | 1267,1,"Bowen, Miss. Grace Scott",female,45,0,0,PC 17608,262.375,,C 378 | 1268,3,"Kink, Miss. Maria",female,22,2,0,315152,8.6625,,S 379 | 1269,2,"Cotterill, Mr. Henry Harry""""",male,21,0,0,29107,11.5,,S 380 | 1270,1,"Hipkins, Mr. William Edward",male,55,0,0,680,50,C39,S 381 | 1271,3,"Asplund, Master. Carl Edgar",male,5,4,2,347077,31.3875,,S 382 | 1272,3,"O'Connor, Mr. Patrick",male,,0,0,366713,7.75,,Q 383 | 1273,3,"Foley, Mr. Joseph",male,26,0,0,330910,7.8792,,Q 384 | 1274,3,"Risien, Mrs. Samuel (Emma)",female,,0,0,364498,14.5,,S 385 | 1275,3,"McNamee, Mrs. Neal (Eileen O'Leary)",female,19,1,0,376566,16.1,,S 386 | 1276,2,"Wheeler, Mr. Edwin Frederick""""",male,,0,0,SC/PARIS 2159,12.875,,S 387 | 1277,2,"Herman, Miss. Kate",female,24,1,2,220845,65,,S 388 | 1278,3,"Aronsson, Mr. Ernst Axel Algot",male,24,0,0,349911,7.775,,S 389 | 1279,2,"Ashby, Mr. John",male,57,0,0,244346,13,,S 390 | 1280,3,"Canavan, Mr. Patrick",male,21,0,0,364858,7.75,,Q 391 | 1281,3,"Palsson, Master. Paul Folke",male,6,3,1,349909,21.075,,S 392 | 1282,1,"Payne, Mr. Vivian Ponsonby",male,23,0,0,12749,93.5,B24,S 393 | 1283,1,"Lines, Mrs. Ernest H (Elizabeth Lindsey James)",female,51,0,1,PC 17592,39.4,D28,S 394 | 1284,3,"Abbott, Master. Eugene Joseph",male,13,0,2,C.A. 2673,20.25,,S 395 | 1285,2,"Gilbert, Mr. William",male,47,0,0,C.A. 30769,10.5,,S 396 | 1286,3,"Kink-Heilmann, Mr. Anton",male,29,3,1,315153,22.025,,S 397 | 1287,1,"Smith, Mrs. Lucien Philip (Mary Eloise Hughes)",female,18,1,0,13695,60,C31,S 398 | 1288,3,"Colbert, Mr. Patrick",male,24,0,0,371109,7.25,,Q 399 | 1289,1,"Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)",female,48,1,1,13567,79.2,B41,C 400 | 1290,3,"Larsson-Rondberg, Mr. Edvard A",male,22,0,0,347065,7.775,,S 401 | 1291,3,"Conlon, Mr. Thomas Henry",male,31,0,0,21332,7.7333,,Q 402 | 1292,1,"Bonnell, Miss. Caroline",female,30,0,0,36928,164.8667,C7,S 403 | 1293,2,"Gale, Mr. Harry",male,38,1,0,28664,21,,S 404 | 1294,1,"Gibson, Miss. Dorothy Winifred",female,22,0,1,112378,59.4,,C 405 | 1295,1,"Carrau, Mr. Jose Pedro",male,17,0,0,113059,47.1,,S 406 | 1296,1,"Frauenthal, Mr. Isaac Gerald",male,43,1,0,17765,27.7208,D40,C 407 | 1297,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20,0,0,SC/PARIS 2166,13.8625,D38,C 408 | 1298,2,"Ware, Mr. William Jeffery",male,23,1,0,28666,10.5,,S 409 | 1299,1,"Widener, Mr. George Dunton",male,50,1,1,113503,211.5,C80,C 410 | 1300,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q 411 | 1301,3,"Peacock, Miss. Treasteall",female,3,1,1,SOTON/O.Q. 3101315,13.775,,S 412 | 1302,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.75,,Q 413 | 1303,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37,1,0,19928,90,C78,Q 414 | 1304,3,"Henriksson, Miss. Jenny Lovisa",female,28,0,0,347086,7.775,,S 415 | 1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S 416 | 1306,1,"Oliva y Ocana, Dona. Fermina",female,39,0,0,PC 17758,108.9,C105,C 417 | 1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S 418 | 1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S 419 | 1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C 420 | -------------------------------------------------------------------------------- /day01/README.md: -------------------------------------------------------------------------------- 1 | # Day 1 2 | Introduction to the course and installation of the necessary components for the class. 3 | 4 | [Webcast](https://www.youtube.com/watch?v=RUrXA9koSCw&t=26s) 5 | 6 | [Slides](https://docs.google.com/a/berkeley.edu/presentation/d/1RJvvZ2MUi3OKO1gF8X92AWYLg9Xb2ZfsSpAeM-6eqEk/edit?usp=sharing) 7 | 8 | [Quiz](https://goo.gl/forms/moXHmopQmgXzH6h32) 9 | -------------------------------------------------------------------------------- /day02/README.md: -------------------------------------------------------------------------------- 1 | # Day 2 2 | Introduction to numpy and scientific computing. We will be briefly going over the [intro-to-numpy](https://github.com/kaggledecal/sp17/blob/master/day02/intro-to-numpy.ipynb) and [intro-to-jupyter](https://github.com/kaggledecal/sp17/blob/master/day02/intro-to-jupyter.ipynb) notebooks during class today and then you will have a [simple assignment](https://github.com/kaggledecal/sp17/blob/master/day02/pset1.ipynb) applying some of the things you learned in this lecture. 3 | 4 | 5 | [Slides](https://docs.google.com/a/berkeley.edu/presentation/d/1i5NNMnSpeKMhkQETF29zvUEYjv3dO8tYKmi5nhejatQ/edit?usp=sharing) 6 | 7 | [Video](https://www.youtube.com/watch?v=QNpZFI9AdQg) 8 | 9 | [Homework](https://github.com/kaggledecal/sp17/blob/master/day02/pset1.ipynb) 10 | 11 | ## Notebooks 12 | [intro-to-numpy](https://github.com/kaggledecal/sp17/blob/master/day02/intro-to-numpy.ipynb) 13 | 14 | 15 | [intro-to-jupyter](https://github.com/kaggledecal/sp17/blob/master/day02/intro-to-jupyter.ipynb) 16 | -------------------------------------------------------------------------------- /day02/pset1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "#### RUN ME PLEASE ####\n", 12 | "\"\"\"Use 'Shift + Enter' when focused on this cell.\"\"\"\n", 13 | "import numpy as np" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Python for DS Practice\n", 21 | "## Data Science for Kaggle Decal Spring 2017\n", 22 | "\n", 23 | "\n", 24 | "Welcome to the first pset. This should be a really simple intro to numpy. We are structuring it to get you familiar with the tools we will be using for this class. Make sure that you run the first cell before any others otherwise you'll get errors for not including the numpy package.\n", 25 | "\n", 26 | "If you have trouble with any question, first consult any old notebooks. Especially those from class today.\n", 27 | "Then, ask [Piazza](https://piazza.com/class/iy117z6nv30626?cid=9)! We'll try to get back to you asap through there. " 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "# Problem 1\n", 35 | "Fill `fun_array` with values 1-100 using a numpy method. Use google to find the answer." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "fun_array = ## YOUR CODE HERE\n", 47 | "fun_array" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "# Problem 2 \n", 55 | "Take the square root of each element in fun_array using numpy commands. Don't use a `for` loop" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "### YOUR CODE HERE\n", 67 | "fun_array_sqrt = ## YOUR CODE HERE\n", 68 | "fun_array_sqrt" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "# Problem 3\n", 76 | "Fill this array with 100 random values. Your answer should use only a single method call like `np.()`" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "random_array = ## YOUR CODE HERE\n", 88 | "random_array" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "# Problem 4\n", 96 | "Multiply each element in `random_array` by `5`. Do not use a `for` loop." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "collapsed": false 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "random_arrayx5 = ## YOUR CODE HERE\n", 108 | "random_arrayx5" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "# Problem 5\n", 116 | "Multiply matrix $X$ and multiply with $\\vec{y}$ using numpy commands. Should be only 1 line." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "X = np.array([[1,2,3],[4,5,6],[7,8,9]])\n", 128 | "y = np.array([1,2,3])\n", 129 | "## YOUR CODE HERE" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "# Problem 6\n", 137 | "Get the shape of `X` and output it. Do it with and without a print() statement for full credit." 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "collapsed": true 145 | }, 146 | "outputs": [], 147 | "source": [] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "# Problem 7\n", 154 | "Fill in the missing code so that you can plot \n", 155 | "\n", 156 | "This is a very simple 2 line problem. You should read [the `matplotlib` documentation](http://matplotlib.org/users/pyplot_tutorial.html) regardless of how familiar you are with matplotlib. \n", 157 | "\n", 158 | "Alternatively, google how to plot a graph using matplotlib. You'll probably find an answer on stack overflow\n", 159 | "\n", 160 | "I almost always copy and past matplotlib code whenever I need to use it." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "%matplotlib inline\n", 172 | "import matplotlib.pyplot as plt\n", 173 | "X = np.linspace(0, 5, 500) #np.arange(500)\n", 174 | "\n", 175 | "# These are just to represent actual training and testing \n", 176 | "training_accuracy = np.exp(-(X)) * np.exp(1.5) + 0.5\n", 177 | "testing_accuracy = (X - 2.4)**2 + 1\n", 178 | "\n", 179 | "### YOUR CODE BELOW ###\n", 180 | "### END YOUR CODE ###\n", 181 | "\n", 182 | "plt.title(\"Bias Variance\")\n", 183 | "plt.xlabel(\"Parameter Value \")\n", 184 | "plt.ylabel(\"Error\")\n", 185 | "plt.legend(labels=['Training Error','Testing Error'])\n", 186 | "axes = plt.gca()\n", 187 | "axes.set_ylim([0,5])" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "# HOMEWORK SUBMISSION\n", 195 | "We will be submitting work through gradescope. You must make your submission in pdf format so that we can easily process them using gradescope question matching.\n", 196 | "## Saving the file as a pdf\n", 197 | "\n", 198 | "In the menu bar, you simply follow\n", 199 | "`File > Print Preview` \n", 200 | "Then print the webpage and Save as PDF. This differs per browser/OS so you may have to spend some time looking for this option to save. \n", 201 | "\n", 202 | "## Submitting\n", 203 | "Go to [gradescope.com](https://gradescope.com/), and click on the box that says `Add a course`. Type in `9KPPBM` to the course enrollment spot and you should be good to go. If you are not enrolled in the course, we will remove you from gradescope so please only enroll if we have sent you a permission numer. \n", 204 | "\n", 205 | "When you are in the course, navigate to `Problem Set 1`. On the submission page, select upload and upload the pdf you just downloaded. You will be required to label your questions appropriately.\n" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": { 212 | "collapsed": true 213 | }, 214 | "outputs": [], 215 | "source": [] 216 | } 217 | ], 218 | "metadata": { 219 | "kernelspec": { 220 | "display_name": "Python 3", 221 | "language": "python", 222 | "name": "python3" 223 | }, 224 | "language_info": { 225 | "codemirror_mode": { 226 | "name": "ipython", 227 | "version": 3 228 | }, 229 | "file_extension": ".py", 230 | "mimetype": "text/x-python", 231 | "name": "python", 232 | "nbconvert_exporter": "python", 233 | "pygments_lexer": "ipython3", 234 | "version": "3.5.0" 235 | } 236 | }, 237 | "nbformat": 4, 238 | "nbformat_minor": 0 239 | } 240 | -------------------------------------------------------------------------------- /day03/README.md: -------------------------------------------------------------------------------- 1 | # Day 3 2 | Data Visualization Lecture. 3 | 4 | [Slides](https://docs.google.com/presentation/d/1WWNmpZdYLLtIjLW11Aswuj0_5Pf0CL2J6-rT2Fc0aGc/edit#slide=id.g1c95f0019c_0_1) 5 | 6 | [Video](https://youtu.be/nq5RLYvOwFg) 7 | 8 | ## Notebooks 9 | [data-vis1](https://github.com/kaggledecal/sp17/blob/master/day03/data_vis.ipynb) 10 | [data-vis2](https://github.com/kaggledecal/sp17/blob/master/day03/visualization.ipynb) 11 | 12 | -------------------------------------------------------------------------------- /day03/visualization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Titanic Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Summary Statistics" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "%matplotlib inline\n", 26 | "\n", 27 | "#Importing Modules\n", 28 | "from pandas import DataFrame, Series\n", 29 | "import numpy as np\n", 30 | "import pandas as pd\n", 31 | "import matplotlib.pyplot as plt" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "#Read in Data\n", 43 | "df = pd.read_csv('../datasets/titanic/train.csv') \n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "#Data diagnostics\n", 55 | "\n", 56 | "# It is often a good idea to always start with a question that might affect the target variable you are trying to predict. \n", 57 | "df.shape" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "df.head()" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "df.describe()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "df.dtypes" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "df.describe(include=['O'])" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "df.columns" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": { 119 | "collapsed": false 120 | }, 121 | "outputs": [], 122 | "source": [] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": { 128 | "collapsed": false 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "df['Survived'].mean()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": false 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "df.groupby('Sex')['Survived'].mean() #Groupby groups dataframe with selected variables so you can perform statistics on each group" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "df.groupby('Sex')['Survived'].size() #Size counts the number in each group" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "collapsed": false 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "df.corr()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [], 175 | "source": [ 176 | "df.groupby('Pclass')['Survived'].mean()" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "df['Fare_bins'] = pd.cut(df['Fare'],bins=[0,20,50,80,1000]) #Categorizing numerical data into bins for easy groupby\n", 188 | "df.groupby('Fare_bins')['Survived'].mean()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "df.groupby(['Sex','Fare_bins'])['Survived'].mean()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": false 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "df.groupby(['Sex','Fare_bins'])['Survived'].agg([np.mean,np.size,np.std])" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "df.isnull().sum() #Counts missing values for each column" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": { 228 | "collapsed": true 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "def countInfs(series):\n", 233 | " #Counts infinite values for a particular column\n", 234 | " if (series.dtype == 'int64') | (series.dtype == 'float64'):\n", 235 | " return sum((series > 1e20) | (series < -1e20))\n", 236 | " else:\n", 237 | " return 0" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "df.select_dtypes(include=[np.number]).apply(countInfs,axis=0)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": { 255 | "collapsed": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "df['SibSp'].value_counts() #Tabulates counts of each unique value" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "## Data Visualizations" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": { 273 | "collapsed": false 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "#Pandas Histogram\n", 278 | "df['Age'].hist(bins=20)\n", 279 | "plt.title('Distribution of All Ages')\n", 280 | "plt.xlabel('Age')\n", 281 | "plt.ylabel('Counts')\n" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": { 288 | "collapsed": false 289 | }, 290 | "outputs": [], 291 | "source": [ 292 | "#Pandas Group By Histogram with Transparency\n", 293 | "df.groupby('Sex')['Age'].hist(bins=20,alpha=0.5)\n", 294 | "plt.legend(labels=['Female','Male'])\n", 295 | "plt.title('Distribution of Ages by Female and Male')" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": { 302 | "collapsed": false 303 | }, 304 | "outputs": [], 305 | "source": [ 306 | "#Pandas Group By Density\n", 307 | "df.groupby('Sex')['Age'].plot(kind='density')\n", 308 | "plt.legend(labels=['Female','Male'])\n", 309 | "plt.title('Distribution of Ages by Female and Male')" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": { 316 | "collapsed": false 317 | }, 318 | "outputs": [], 319 | "source": [ 320 | "#Pandas Scatterplot\n", 321 | "colors = ['blue','green','yellow']\n", 322 | "plt.scatter(df['Age'],df['Fare'],c=df[df.Age.notnull()]['Pclass'].apply(lambda x: colors[x-1]),alpha=0.5)\n", 323 | "plt.xlabel('Age')\n", 324 | "plt.ylabel('Fare Price')\n", 325 | "plt.title('Scatterplot of Fare Price vs. Age Colored by Class')\n" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": { 332 | "collapsed": false 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "#Subplots\n", 337 | "fig, axes = plt.subplots(2,1)\n", 338 | "df.Embarked.value_counts().plot(ax=axes[0],kind='bar')\n", 339 | "df.groupby('Embarked')['Age'].mean().plot(ax=axes[1],kind='bar')\n", 340 | "axes[0].set_title(\"Number of Passengers per Location\")\n", 341 | "axes[1].set_title(\"Mean Age per Location\")\n", 342 | "axes[1].set_xlabel(\"Location\")\n", 343 | "axes[0].set_ylabel(\"Counts\")\n", 344 | "axes[1].set_ylabel(\"Proportion\")" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "## Prettier Data Visualizations - Seaborn" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "collapsed": false 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "!pip install seaborn" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": { 369 | "collapsed": false 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "import seaborn as sns\n", 374 | "#~/anaconda/bin/pip install seaborn if using anaconda, otherwise just use pip install\n", 375 | "\n", 376 | "sns.set_style(\"white\")\n", 377 | "\n", 378 | "fig, axes = plt.subplots(2,1)\n", 379 | "df.Embarked.value_counts().plot(ax=axes[0],kind='bar')\n", 380 | "df.groupby('Embarked')['Age'].mean().plot(ax=axes[1],kind='bar')\n", 381 | "axes[0].set_title(\"Number of Passengers per Location\")\n", 382 | "axes[1].set_title(\"Mean Age per Location\")\n", 383 | "axes[1].set_xlabel(\"Location\")\n", 384 | "axes[0].set_ylabel(\"Counts\")\n", 385 | "axes[1].set_ylabel(\"Proportion\")" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": { 392 | "collapsed": false, 393 | "scrolled": true 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "sns.stripplot(x=\"Embarked\", y=\"Age\", hue='Sex', data=df, jitter=True);\n", 398 | "sns.plt.title(\"Strip Plot with Seaborn\")\n" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": { 404 | "collapsed": true 405 | }, 406 | "source": [ 407 | "## Visualizing Missing Data" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": { 414 | "collapsed": false 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "!pip install missingno" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "metadata": { 425 | "collapsed": false, 426 | "scrolled": false 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "import missingno as msno\n", 431 | "msno.matrix(df)" 432 | ] 433 | } 434 | ], 435 | "metadata": { 436 | "kernelspec": { 437 | "display_name": "Python 3", 438 | "language": "python", 439 | "name": "python3" 440 | }, 441 | "language_info": { 442 | "codemirror_mode": { 443 | "name": "ipython", 444 | "version": 3 445 | }, 446 | "file_extension": ".py", 447 | "mimetype": "text/x-python", 448 | "name": "python", 449 | "nbconvert_exporter": "python", 450 | "pygments_lexer": "ipython3", 451 | "version": "3.5.0" 452 | } 453 | }, 454 | "nbformat": 4, 455 | "nbformat_minor": 0 456 | } 457 | -------------------------------------------------------------------------------- /day04/Classification + Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## This assignment will help you become more familiar with the standard scikit learn API for machine learning with some simple examples. Don't worry if you don't know exactly what these classifiers are yet, we will dive more into how they actually work in future lectures." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import matplotlib.pyplot as plt\n", 19 | "import numpy as np\n", 20 | "from sklearn import datasets, linear_model\n", 21 | "from sklearn.preprocessing import PolynomialFeatures\n", 22 | "# Magic function to make matplotlib inline and interactive\n", 23 | "%matplotlib notebook\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Regression Example" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [ 40 | { 41 | "ename": "NameError", 42 | "evalue": "name 'datasets' is not defined", 43 | "output_type": "error", 44 | "traceback": [ 45 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 46 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 47 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m# Load the diabetes dataset\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdiabetes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdatasets\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mload_diabetes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 48 | "\u001b[0;31mNameError\u001b[0m: name 'datasets' is not defined" 49 | ] 50 | } 51 | ], 52 | "source": [ 53 | "\n", 54 | "# Load the diabetes dataset\n", 55 | "diabetes = datasets.load_diabetes()\n", 56 | "\n", 57 | "\n", 58 | "# Use only one feature\n", 59 | "diabetes_X = diabetes.data[:, np.newaxis, 2]\n", 60 | "poly = PolynomialFeatures(degree=3)\n", 61 | "poly.fit_transform(diabetes_X)\n", 62 | "# Split the data into training/testing sets\n", 63 | "diabetes_X_train = diabetes_X[:-20]\n", 64 | "diabetes_X_test = diabetes_X[-20:]\n", 65 | "\n", 66 | "# Split the targets into training/testing sets\n", 67 | "diabetes_y_train = diabetes.target[:-20]\n", 68 | "diabetes_y_test = diabetes.target[-20:]\n", 69 | "\n", 70 | "\n", 71 | "\n", 72 | "regr = None\n", 73 | "# Create linear regression object\n", 74 | "#YOUR CODE HERE\n", 75 | "# Train the model using the training sets\n", 76 | "#YOUR CODE HERE\n", 77 | "\n", 78 | "# The coefficients\n", 79 | "print('Coefficients: \\n', regr.coef_)\n", 80 | "# The mean squared error\n", 81 | "print(\"Mean squared error: %.2f\"\n", 82 | " % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))\n", 83 | "# Explained variance score: 1 is perfect prediction\n", 84 | "print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))\n", 85 | "\n", 86 | "# Plot outputs\n", 87 | "plt.scatter(diabetes_X_test, diabetes_y_test, color='black')\n", 88 | "plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',\n", 89 | " linewidth=3)\n", 90 | "\n", 91 | "plt.xticks(())\n", 92 | "plt.yticks(())\n", 93 | "\n", 94 | "#plt.show()" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### 3 Class Classification Example" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 1, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "ename": "NameError", 113 | "evalue": "name 'datasets' is not defined", 114 | "output_type": "error", 115 | "traceback": [ 116 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 117 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 118 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# import some data to play with\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0miris\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdatasets\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mload_iris\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0miris\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m:\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;31m# we only take the first two features.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mY\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0miris\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 119 | "\u001b[0;31mNameError\u001b[0m: name 'datasets' is not defined" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "# import some data to play with\n", 125 | "iris = datasets.load_iris()\n", 126 | "X = iris.data[:, :2] # we only take the first two features.\n", 127 | "Y = iris.target\n", 128 | "\n", 129 | "h = .02 # step size in the mesh\n", 130 | "\n", 131 | "logreg = None\n", 132 | "\n", 133 | "\n", 134 | "#YOUR CODE HERE\n", 135 | "# We fit the model to the data\n", 136 | "\n", 137 | "# YOUR CODE HERE\n", 138 | "\n", 139 | "# Plot the decision boundary. For that, we will assign a color to each\n", 140 | "# point in the mesh [x_min, x_max]x[y_min, y_max].\n", 141 | "x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n", 142 | "y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n", 143 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n", 144 | "Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])\n", 145 | "\n", 146 | "# Put the result into a color plot\n", 147 | "Z = Z.reshape(xx.shape)\n", 148 | "plt.figure(1, figsize=(4, 3))\n", 149 | "plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)\n", 150 | "\n", 151 | "# Plot also the training points\n", 152 | "plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)\n", 153 | "plt.xlabel('Sepal length')\n", 154 | "plt.ylabel('Sepal width')\n", 155 | "\n", 156 | "plt.xlim(xx.min(), xx.max())\n", 157 | "plt.ylim(yy.min(), yy.max())\n", 158 | "plt.xticks(())\n", 159 | "plt.yticks(())\n", 160 | "\n", 161 | "plt.show()\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### In this example we experiment with a classifier known as an SVM which has a feature called a kernel. We'll dive more into that later. Right now fill in the code below to make an SVM with a linear, rbf, and polynomial kernel. Notice how the decision boundary changes." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 3, 174 | "metadata": { 175 | "collapsed": false 176 | }, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "Automatically created module for IPython interactive environment\n" 183 | ] 184 | }, 185 | { 186 | "ename": "NameError", 187 | "evalue": "name 'svc' is not defined", 188 | "output_type": "error", 189 | "traceback": [ 190 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 191 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 192 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 34\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 35\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 36\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mclf\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msvc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlin_svc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrbf_svc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpoly_svc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 37\u001b[0m \u001b[0;31m# Plot the decision boundary. For that, we will assign a color to each\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 38\u001b[0m \u001b[0;31m# point in the mesh [x_min, x_max]x[y_min, y_max].\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 193 | "\u001b[0;31mNameError\u001b[0m: name 'svc' is not defined" 194 | ] 195 | } 196 | ], 197 | "source": [ 198 | "print(__doc__)\n", 199 | "\n", 200 | "import numpy as np\n", 201 | "import matplotlib.pyplot as plt\n", 202 | "from sklearn import svm, datasets\n", 203 | "\n", 204 | "# import some data to play with\n", 205 | "iris = datasets.load_iris()\n", 206 | "X = iris.data[:, :2] # we only take the first two features. We could\n", 207 | " # avoid this ugly slicing by using a two-dim dataset\n", 208 | "y = iris.target\n", 209 | "\n", 210 | "h = .02 # step size in the mesh\n", 211 | "\n", 212 | "# we create an instance of SVM and fit out data. We do not scale our\n", 213 | "# data since we want to plot the support vectors\n", 214 | "C = 1.0 # SVM regularization parameter\n", 215 | "svm_linear = None\n", 216 | "rbf = None\n", 217 | "poly = None \n", 218 | "\n", 219 | "\n", 220 | "# create a mesh to plot in\n", 221 | "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", 222 | "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", 223 | "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", 224 | " np.arange(y_min, y_max, h))\n", 225 | "\n", 226 | "# title for the plots\n", 227 | "titles = ['SVC with linear kernel',\n", 228 | " 'LinearSVC (linear kernel)',\n", 229 | " 'SVC with RBF kernel',\n", 230 | " 'SVC with polynomial (degree 3) kernel']\n", 231 | "\n", 232 | "\n", 233 | "for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):\n", 234 | " # Plot the decision boundary. For that, we will assign a color to each\n", 235 | " # point in the mesh [x_min, x_max]x[y_min, y_max].\n", 236 | " plt.subplot(2, 2, i + 1)\n", 237 | " plt.subplots_adjust(wspace=0.4, hspace=0.4)\n", 238 | "\n", 239 | " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", 240 | "\n", 241 | " # Put the result into a color plot\n", 242 | " Z = Z.reshape(xx.shape)\n", 243 | " plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)\n", 244 | "\n", 245 | " # Plot also the training points\n", 246 | " plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)\n", 247 | " plt.xlabel('Sepal length')\n", 248 | " plt.ylabel('Sepal width')\n", 249 | " plt.xlim(xx.min(), xx.max())\n", 250 | " plt.ylim(yy.min(), yy.max())\n", 251 | " plt.xticks(())\n", 252 | " plt.yticks(())\n", 253 | " plt.title(titles[i])\n", 254 | "\n", 255 | "plt.show()" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": { 262 | "collapsed": true 263 | }, 264 | "outputs": [], 265 | "source": [] 266 | } 267 | ], 268 | "metadata": { 269 | "anaconda-cloud": {}, 270 | "kernelspec": { 271 | "display_name": "Python [Root]", 272 | "language": "python", 273 | "name": "Python [Root]" 274 | }, 275 | "language_info": { 276 | "codemirror_mode": { 277 | "name": "ipython", 278 | "version": 2 279 | }, 280 | "file_extension": ".py", 281 | "mimetype": "text/x-python", 282 | "name": "python", 283 | "nbconvert_exporter": "python", 284 | "pygments_lexer": "ipython2", 285 | "version": "2.7.12" 286 | } 287 | }, 288 | "nbformat": 4, 289 | "nbformat_minor": 0 290 | } 291 | -------------------------------------------------------------------------------- /day04/README.md: -------------------------------------------------------------------------------- 1 | # Day 4 2 | Classification vs Regressio. 3 | 4 | [Slides](https://docs.google.com/a/berkeley.edu/presentation/d/16zpoPNRgFABtlnsCwcCSfMZDZSFysDjGzRR1TQ0QznA/edit?usp=sharing) 5 | 6 | [Video](https://youtu.be/nq5RLYvOwFg) 7 | 8 | ## Notebooks 9 | [Classification and Regression](https://github.com/kaggledecal/sp17/blob/master/day03/Classification + Regression.ipynb) 10 | 11 | 12 | -------------------------------------------------------------------------------- /day05/README.md: -------------------------------------------------------------------------------- 1 | # Day 5 2 | Data Cleaning and Regular Expressions 3 | [Slides](https://docs.google.com/presentation/d/14asYO4OKfnSrbe0Q7bfSx1XDjRGcYY4VKztSN7NiedI/edit?usp=sharing) -------------------------------------------------------------------------------- /day05/cleaning_regex.py: -------------------------------------------------------------------------------- 1 | 2 | from pandas import DataFrame, Series 3 | import numpy as np 4 | import pandas as pd 5 | import matplotlib.pyplot as plt 6 | 7 | 8 | #Filling in Missing Values 9 | df = pd.read_csv('~/src/kaggledecal/datasets/titanic/train.csv') 10 | df['Age'].fillna(df.Age.mean(),inplace=True) 11 | 12 | df = pd.read_csv('~/src/kaggledecal/datasets/titanic/train.csv') 13 | df['title'] = 'other' 14 | df.loc[['Master.' in n for n in df['Name']],'title'] = 'Master' 15 | df.loc[['Miss.' in n for n in df['Name']],'title'] = 'Miss' 16 | df.loc[['Mr.' in n for n in df['Name']],'title'] = 'Mr' 17 | df.loc[['Mrs.' in n for n in df['Name']],'title'] = 'Mrs' 18 | 19 | #-Brief aside -- 20 | #loc vs. iloc vs. ix 21 | #use loc if you plan on indexing by the index label of a dataframe, e.g. 22 | df.head() 23 | #the left most column represents the index and so if you do df.loc[1] you will select the second row 24 | #but say the index was like so: 25 | df_copy = df.copy() 26 | df_copy.index = pd.Index(df.index.values-1) 27 | df_copy.head() 28 | df_copy.loc[1] 29 | #since the index of 1 correponds to the third row, loc will select the third row 30 | 31 | #iloc will index on position (actual row number) starting with 0 for the first row regardless of index 32 | df.iloc[1] 33 | df_copy.iloc[1] 34 | 35 | #ix defaults to behave like loc but falls back to iloc if a label is not in the index AND if the index contains both strings and numerics 36 | #in most cases, you will use loc and iloc more often than ix. ix is useful if you want to index the rows by index label and columns 37 | #by position 38 | df.ix[:3,:2] 39 | df_copy.ix[:3,:2] 40 | #- 41 | 42 | df.boxplot(column='Age',by='title') #Mean Age is different per title 43 | plt.ylabel('Age') 44 | 45 | df['age_filled'] = df[['title','Age']].groupby('title').transform(lambda x: x.fillna(x.mean())) #Transform performs operation per group and returns values to their original index 46 | df[['title','Age','age_filled']].tail(20) 47 | df.groupby('title')['Age'].mean() 48 | 49 | df['Cabin'] 50 | #Cabin number can distinguish between port or starboard side of the Titanic. From research, it seems that 51 | #the "Women and children first" policy was implemented differently between the two sides. The starboard side 52 | #had women and children prioritized before letting men on board while the port side ONLY allowed women and 53 | #children on board. Odd cabin numbers were on the starboard side and even cabin numbers were on the port side. 54 | 55 | #Cabin letter designates the deck on the titanic, starting from highest to lowest - A to G. 56 | 57 | df['cabin_side'] = 'Unknown' 58 | df.loc[df['Cabin'].str[-1].isin(["1", "3", "5", "7", "9"]),'cabin_side'] = 'starboard' 59 | df.loc[df['Cabin'].str[-1].isin(["2", "4", "6", "8", "0"]),'cabin_side'] = 'port' 60 | df['cabin_side'].value_counts() 61 | 62 | df['deck'] = 'Unknown' 63 | df.loc[df['Cabin'].notnull(),'deck'] = df['Cabin'].str[0] 64 | df['deck'].value_counts() 65 | df[df['Cabin'].str[0]=='T'] #Why is there a T deck... 66 | df.loc[df['deck'] == 'T','deck'] = "Unknown" 67 | 68 | #Some cabins start with "F" followed by a space and then the actual deck letter 69 | df['Cabin'][df.Cabin.notnull()].values 70 | 71 | #Regular Expression Assignment!!!!!!!!!!!!!!!! 72 | 73 | 74 | pattern = "[A-Z]\s[A-Z]" #Any capital letter between A-Z followed by a whitespace followed by any letter between A-Z 75 | mask = df['Cabin'].str.contains(pattern,na=False) 76 | df.loc[mask,'Cabin'] 77 | df.loc[mask,'deck'] = df.loc[mask,'Cabin'].str[2] 78 | df.deck.value_counts() 79 | 80 | #If you also look closely, some people have multiple cabins assigned to them possibly indicating group tickets for the family 81 | #We can split these by whitespace and count them to make another variable called "number_in_group" 82 | df['Cabin'].str.split() 83 | df['num_in_group'] = df['Cabin'].str.split().apply(lambda x: len(x) if type(x)!=float else 1) 84 | df.loc[25:30] 85 | 86 | # Home Depot Data Set 87 | import re #For regular expressions 88 | 89 | hd = pd.read_csv("~/src/kaggle_decal/datasets/home_depot/train.csv", encoding="ISO-8859-1") 90 | 91 | strNum = {'zero':0,'one':1,'two':2,'three':3,'four':4,'five':5,'six':6,'seven':7,'eight':8,'nine':9} #Used to convert spelled out numbers to the actual digits 92 | 93 | 94 | #And here's the list of regex! Credits to the1owl on Kaggle for compiling all of this 95 | 96 | hd['cleaned_product_title'] = hd['product_title'].map(lambda s: re.sub(r"(\w)\.([A-Z])", r"\1 \2", s)) #Split words with a.A 97 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.lower()) 98 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" "," ")) 99 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(",","")) #could be number / segment later 100 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("$"," ")) 101 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("?"," ")) 102 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("-"," ")) 103 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("//","/")) 104 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("..",".")) 105 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" \\ "," ")) 106 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("."," . ")) 107 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"(^\.|/)", r"", s)) 108 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"(\.|/)$", r"", s)) 109 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9])([a-z])", r"\1 \2", s)) 110 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([a-z])([0-9])", r"\1 \2", s)) 111 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" x "," xbi ")) 112 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([a-z])( *)\.( *)([a-z])", r"\1 \4", s)) 113 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([a-z])( *)/( *)([a-z])", r"\1 \4", s)) 114 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("*"," xbi ")) 115 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" by "," xbi ")) 116 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9])( *)\.( *)([0-9])", r"\1.\4", s)) 117 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(inches|inch|in|')\.?", r"\1in. ", s)) 118 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(foot|feet|ft|'')\.?", r"\1ft. ", s)) 119 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(pounds|pound|lbs|lb)\.?", r"\1lb. ", s)) 120 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(square|sq) ?\.?(feet|foot|ft)\.?", r"\1sq.ft. ", s)) 121 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(cubic|cu) ?\.?(feet|foot|ft)\.?", r"\1cu.ft. ", s)) 122 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(gallons|gallon|gal)\.?", r"\1gal. ", s)) 123 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(ounces|ounce|oz)\.?", r"\1oz. ", s)) 124 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(centimeters|cm)\.?", r"\1cm. ", s)) 125 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(milimeters|mm)\.?", r"\1mm. ", s)) 126 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("°"," degrees ")) 127 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(degrees|degree)\.?", r"\1deg. ", s)) 128 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" v "," volts ")) 129 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(volts|volt)\.?", r"\1volt. ", s)) 130 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(watts|watt)\.?", r"\1watt. ", s)) 131 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: re.sub(r"([0-9]+)( *)(amperes|ampere|amps|amp)\.?", r"\1amp. ", s)) 132 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" "," ")) 133 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace(" . "," ")) 134 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: (" ").join([str(strNum[z]) if z in strNum else z for z in s.split(" ")])) 135 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.lower()) 136 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("toliet","toilet")) 137 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("airconditioner","air conditioner")) 138 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("vinal","vinyl")) 139 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("vynal","vinyl")) 140 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("skill","skil")) 141 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("snowbl","snow bl")) 142 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("plexigla","plexi gla")) 143 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("rustoleum","rust-oleum")) 144 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("whirpool","whirlpool")) 145 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("whirlpoolga", "whirlpool ga")) 146 | hd['cleaned_product_title'] = hd['cleaned_product_title'].map(lambda s: s.replace("whirlpoolstainless","whirlpool stainless")) 147 | 148 | 149 | #In order to better understand what each line does, let's break this down a bit: 150 | s = re.sub(r"(\w)\.([A-Z])", r"\1 \2", s) #Split words with a.A 151 | 152 | #The prefix "r" represents that whatever comes next should be interpretted as a "raw string" 153 | #Since python tries to be smart and convert, say, "\b" to mean "backspace" for you, this won't be converted correctly in regex. 154 | #Prefixing your regex patterns with "r" will tell python to hand over the "raw string" pattern for normal regex to work. 155 | 156 | #The "\1" code means "take the first regex group in parantheses and put it here". In this example, we put whatever word character 157 | #"(\w)" captures into the \1 spot and whatever capital letter "([A-Z])" catures into the \2 spot. 158 | 159 | pattern = r"(\w)\.([A-Z])" 160 | matches = hd.product_title.str.contains(pattern) 161 | hd.product_title.iloc[np.where(matches)] 162 | 163 | s = hd.product_title.iloc[73970] 164 | s 165 | s = re.sub(r"(\w)\.([A-Z])", r"\1 \2", s) 166 | s 167 | 168 | #FB Data for Date Columns 169 | fb = pd.read_csv("~/src/kaggle_decal/datasets/fb/train.csv") 170 | 171 | fb.head() 172 | 173 | initial_date = np.datetime64('2014-01-01T01:01', dtype='datetime64[m]') #Arbitrary start date 174 | d_times = pd.DatetimeIndex(initial_date + np.timedelta64(int(mn), 'm') for mn in fb.time.values) 175 | d_times[:5] 176 | 177 | fb['hour'] = d_times.hour 178 | fb['weekday'] = d_times.weekday 179 | fb['day'] = d_times.day 180 | fb['month'] = d_times.month 181 | fb['year'] = d_times.year 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | -------------------------------------------------------------------------------- /day06/Day06slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day06/Day06slides.pdf -------------------------------------------------------------------------------- /day06/README.md: -------------------------------------------------------------------------------- 1 | # Day 6 2 | Linear Regression 3 | [Slides](https://docs.google.com/a/berkeley.edu/presentation/d/10c1cp-ZmqvT9g-aAt6MzgfEItXILHsLYwYHpnVal7IQ/edit?usp=sharing) -------------------------------------------------------------------------------- /day06/linear_reg.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Science with Kaggle Decal \n", 8 | "## Spring 2017\n", 9 | "## Day 6: Linear Regression" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd \n", 21 | "import numpy as np\n", 22 | "import statsmodels.api as sm\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "import matplotlib.mlab as mlab\n", 25 | "from sklearn.feature_selection import RFECV\n", 26 | "from sklearn.linear_model import LinearRegression" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## NBA Salary Data" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "nba_sals = pd.read_csv(\"./nbasalary.csv\", index_col = 0)\n", 45 | "nba_sals = nba_sals.dropna(axis=0)\n", 46 | "nba_sals.head()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "log_wage = nba_sals[\"lwage\"]\n", 58 | "wage = nba_sals[\"wage\"]\n", 59 | "points = nba_sals[\"points\"]\n", 60 | "exper = nba_sals[\"exper\"]" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## Simple Linear Regression" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "##### In this section, we will compare two SLR models and see which one performs better on a validation set. \n", 75 | "##### Model 1: Regressing wage on points scored\n", 76 | "##### Model 2: Regressing wage on years of experience" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "plt.figure(1)\n", 88 | "plt.scatter(points,wage)\n", 89 | "plt.title(\"Wage vs. Points\")\n", 90 | "plt.xlabel(\"Points\")\n", 91 | "plt.ylabel(\"Wage\")\n", 92 | "plt.show()\n", 93 | "plt.close()\n", 94 | "\n", 95 | "plt.figure(2)\n", 96 | "plt.scatter(exper,wage)\n", 97 | "plt.title(\"Wage vs. Experience\")\n", 98 | "plt.xlabel(\"Experience\")\n", 99 | "plt.ylabel(\"Wage\")\n", 100 | "\n", 101 | "plt.show()" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "plt.figure(1)\n", 113 | "myOLS_points = sm.OLS(wage,points).fit()\n", 114 | "plt.plot(points, myOLS_points.predict(points))\n", 115 | "plt.scatter(points, wage)\n", 116 | "plt.title(\"Wage vs. Points\")\n", 117 | "plt.xlabel(\"Points\")\n", 118 | "plt.ylabel(\"Wage\")\n", 119 | "plt.show()\n", 120 | "plt.close()\n", 121 | "\n", 122 | "plt.figure(2)\n", 123 | "myOLS_exper = sm.OLS(wage,exper).fit()\n", 124 | "plt.plot(exper, myOLS_exper.predict(exper))\n", 125 | "plt.scatter(exper, wage)\n", 126 | "plt.title(\"Wage vs. Expereince\")\n", 127 | "plt.xlabel(\"Experience\")\n", 128 | "plt.ylabel(\"Wage\")\n", 129 | "plt.show()" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## A little validation..." 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "wage_train = nba_sals[\"wage\"][0:214]\n", 148 | "wage_valid = nba_sals[\"wage\"][214:]\n", 149 | "points_train = nba_sals[\"points\"][0:214]\n", 150 | "points_valid = nba_sals[\"points\"][214:]\n", 151 | "exper_train = nba_sals[\"exper\"][0:214]\n", 152 | "exper_valid = nba_sals[\"exper\"][214:]" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "#### Regression wage on points..." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "myOLS = sm.OLS(wage_train,points_train).fit()\n", 171 | "wage_hat = myOLS.predict(points_valid)\n", 172 | "mse = 1/len(wage_valid)*np.dot((wage_valid - wage_hat),(wage_valid - wage_hat))\n", 173 | "print(\"The MSE for the model wage~points is:\", mse)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "#### Regressing wage on experience..." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "myOLS = sm.OLS(wage_train,exper_train).fit()\n", 192 | "wage_hat = myOLS.predict(exper_valid)\n", 193 | "mse = 1/len(wage_valid)*np.dot((wage_valid - wage_hat),(wage_valid - wage_hat))\n", 194 | "print(\"The MSE for the model wage~experience is:\", mse)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "### Conclusion: points scored is a better predictor of wage than years of experience" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "## Multiple Linear Regression" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### Wage vs. Experience & Points" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "collapsed": false, 223 | "scrolled": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "exp_dat = nba_sals[[\"exper\",\"points\"]]\n", 228 | "exp_dat = sm.add_constant(exp_dat)\n", 229 | "myMLR = sm.OLS(wage, exp_dat).fit()\n", 230 | "myMLR.summary()" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "from mpl_toolkits.mplot3d import Axes3D\n", 242 | "\n", 243 | "x1, x2 = np.meshgrid(np.linspace(exp_dat.exper.min(), exp_dat.exper.max(), 100), \n", 244 | " np.linspace(exp_dat.points.min(), exp_dat.points.max(), 100))\n", 245 | "\n", 246 | "x3 = myMLR.params[0] + myMLR.params[1] * x1 + myMLR.params[2] * x2\n", 247 | "\n", 248 | "# create matplotlib 3d axes\n", 249 | "fig = plt.figure(figsize=(12, 8))\n", 250 | "my3D = Axes3D(fig, azim=-120, elev=20)\n", 251 | "\n", 252 | "# plot hyperplane\n", 253 | "surf = my3D.plot_surface(x1, x2, x3, cmap=plt.cm.RdBu_r, alpha=0.5, linewidth=0.5)\n", 254 | "\n", 255 | "# plot data points\n", 256 | "resid = wage - myMLR.predict(exp_dat)\n", 257 | "my3D.scatter(exp_dat[resid >= 0].exper, exp_dat[resid >= 0].points, wage[resid >= 0], color='black', alpha=1.0, facecolor='white')\n", 258 | "my3D.scatter(exp_dat[resid < 0].exper, exp_dat[resid < 0].points, wage[resid < 0], color='black', alpha=1.0)\n", 259 | "\n", 260 | "# set axis labels\n", 261 | "my3D.set_xlabel('experience')\n", 262 | "my3D.set_ylabel('points')\n", 263 | "my3D.set_zlabel('wage')\n", 264 | "my3D.set_title('Regression Plane in 3D')\n", 265 | "\n", 266 | "plt.show()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "## Residual Plots" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "#### Residuals vs. fitted values" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "collapsed": false 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "wage_hat = myMLR.predict(exp_dat)\n", 292 | "resids = wage - wage_hat\n", 293 | "\n", 294 | "# Line of best fit should be Y = 0\n", 295 | "residOLS = sm.OLS(resids,wage_hat).fit()\n", 296 | "\n", 297 | "# Residuals vs. fitted falues\n", 298 | "plt.plot(wage_hat, residOLS.predict(wage_hat))\n", 299 | "plt.scatter(wage_hat, resids)\n", 300 | "plt.ylabel(\"Residuals\")\n", 301 | "plt.xlabel(\"Fitted Values\")\n", 302 | "plt.title(\"Residuals vs. Fitted Values\")\n", 303 | "plt.plot(wage_hat, np.repeat(2,len(wage_hat)),color = \"r\")\n", 304 | "plt.plot(wage_hat, np.repeat(-2,len(wage_hat)),color = \"r\")\n", 305 | "plt.show()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "#### Histogram of Residuals" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": { 319 | "collapsed": false, 320 | "scrolled": true 321 | }, 322 | "outputs": [], 323 | "source": [ 324 | "n, bins, patches = plt.hist(resids, bins = 20, normed= True, facecolor='green', alpha=0.5)\n", 325 | "\n", 326 | "## pdf of a normal(0, std(resids))\n", 327 | "y = mlab.normpdf( bins, np.mean(resids), np.std(resids))\n", 328 | "l = plt.plot(bins, y, 'r', linewidth=2)\n", 329 | "plt.title(\"Histogram of Residuals\")\n", 330 | "plt.show()" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "# Recursive Feature Elimination" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "#### From the API: \"Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.\"\n", 345 | "#### \"First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.\"\n", 346 | "#### RFECV performs RFE with a cross-validation loop" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": { 353 | "collapsed": false 354 | }, 355 | "outputs": [], 356 | "source": [ 357 | "# Let's see what we have to work with...\n", 358 | "nba_sals.columns" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [], 368 | "source": [ 369 | "# Let's pick a subset\n", 370 | "nba_dat = nba_sals[[\"exper\", \"games\",\"minutes\",\"forward\",\"center\",\"points\",\"guard\",\"age\"]]\n", 371 | "nba_dat = sm.add_constant(nba_dat)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "metadata": { 378 | "collapsed": false 379 | }, 380 | "outputs": [], 381 | "source": [ 382 | "myReg = LinearRegression()\n", 383 | "myRFE = RFECV(myReg, step = 1, cv = 5)\n", 384 | "myRFE = myRFE.fit(nba_dat[[1,2,3,4,5,6,7,8]], wage)\n", 385 | "\n", 386 | "print(myRFE.support_)\n", 387 | "print(myRFE.ranking_)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "### Best model" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "metadata": { 401 | "collapsed": false 402 | }, 403 | "outputs": [], 404 | "source": [ 405 | "myMLR2 = sm.OLS(wage, nba_dat[[0,1,5,6,7]]).fit()\n", 406 | "myMLR2.summary()" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "### Let's get the MSE of our best model..." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [], 423 | "source": [ 424 | "# Split into training and validation sets\n", 425 | "nba_sals_train = nba_dat[0:214]\n", 426 | "nba_sals_valid = nba_dat[214:]\n", 427 | "\n", 428 | "wage_train = nba_sals[\"wage\"][0:214]\n", 429 | "wage_valid = nba_sals[\"wage\"][214:]\n", 430 | "\n", 431 | "nba_dat_train = nba_sals_train[[0,1,5,6,7]]\n", 432 | "nba_dat_train = sm.add_constant(nba_dat_train)\n", 433 | "\n", 434 | "nba_dat_valid = nba_sals_valid[[0,1,5,6,7]]\n", 435 | "nba_dat_valid = sm.add_constant(nba_dat_valid)" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": { 442 | "collapsed": false 443 | }, 444 | "outputs": [], 445 | "source": [ 446 | "wage_hat = myMLR2.predict(nba_dat_valid)\n", 447 | "mse = 1/len(wage_valid)*np.dot((wage_valid - wage_hat),(wage_valid - wage_hat))\n", 448 | "print(\"The MSE of our best model is:\", mse)" 449 | ] 450 | } 451 | ], 452 | "metadata": { 453 | "anaconda-cloud": {}, 454 | "kernelspec": { 455 | "display_name": "Python [default]", 456 | "language": "python", 457 | "name": "python3" 458 | }, 459 | "language_info": { 460 | "codemirror_mode": { 461 | "name": "ipython", 462 | "version": 3 463 | }, 464 | "file_extension": ".py", 465 | "mimetype": "text/x-python", 466 | "name": "python", 467 | "nbconvert_exporter": "python", 468 | "pygments_lexer": "ipython3", 469 | "version": "3.5.2" 470 | } 471 | }, 472 | "nbformat": 4, 473 | "nbformat_minor": 1 474 | } 475 | -------------------------------------------------------------------------------- /day07/Logistic Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "print(__doc__)\n", 12 | "\n", 13 | "\n", 14 | "import numpy as np\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "from sklearn.datasets import make_blobs\n", 17 | "from sklearn.linear_model import LogisticRegression\n" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": { 24 | "collapsed": true 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "\n", 29 | "# make 3-class dataset for classification\n", 30 | "centers = [[-5, 0], [0, 1.5], [5, -1]]\n", 31 | "X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)\n", 32 | "transformation = [[0.4, 0.2], [-0.4, 1.2]]\n", 33 | "X = np.dot(X, transformation)\n", 34 | "\n", 35 | "for multi_class in ('multinomial', 'ovr'):\n", 36 | " \"\"\"\n", 37 | " YOUR CODE HERE\n", 38 | " \"\"\"\n", 39 | " # print the training scores\n", 40 | " print(\"training score : %.3f (%s)\" % (clf.score(X, y), multi_class))\n", 41 | "\n", 42 | " # create a mesh to plot in\n", 43 | " h = .02 # step size in the mesh\n", 44 | " x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", 45 | " y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", 46 | " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", 47 | " np.arange(y_min, y_max, h))\n", 48 | "\n", 49 | " # Plot the decision boundary. For that, we will assign a color to each\n", 50 | " # point in the mesh [x_min, x_max]x[y_min, y_max].\n", 51 | " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", 52 | " # Put the result into a color plot\n", 53 | " Z = Z.reshape(xx.shape)\n", 54 | " plt.figure()\n", 55 | " plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)\n", 56 | " plt.title(\"Decision surface of LogisticRegression (%s)\" % multi_class)\n", 57 | " plt.axis('tight')\n", 58 | "\n", 59 | " # Plot also the training points\n", 60 | " colors = \"bry\"\n", 61 | " for i, color in zip(clf.classes_, colors):\n", 62 | " idx = np.where(y == i)\n", 63 | " plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.Paired)\n", 64 | "\n", 65 | " # Plot the three one-against-all classifiers\n", 66 | " xmin, xmax = plt.xlim()\n", 67 | " ymin, ymax = plt.ylim()\n", 68 | " coef = clf.coef_\n", 69 | " intercept = clf.intercept_\n", 70 | "\n", 71 | " def plot_hyperplane(c, color):\n", 72 | " def line(x0):\n", 73 | " return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]\n", 74 | " plt.plot([xmin, xmax], [line(xmin), line(xmax)],\n", 75 | " ls=\"--\", color=color)\n", 76 | "\n", 77 | " for i, color in zip(clf.classes_, colors):\n", 78 | " plot_hyperplane(i, color)\n", 79 | "\n", 80 | "plt.show()\n" 81 | ] 82 | } 83 | ], 84 | "metadata": { 85 | "anaconda-cloud": {}, 86 | "kernelspec": { 87 | "display_name": "Python [Root]", 88 | "language": "python", 89 | "name": "Python [Root]" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 2 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython2", 101 | "version": "2.7.12" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 0 106 | } 107 | -------------------------------------------------------------------------------- /day07/README.md: -------------------------------------------------------------------------------- 1 | # Day 7 2 | Logistic Regression 3 | [Slides](https://docs.google.com/presentation/d/1Me_75Fj2j9YZM006Hn75m9HAYMNPUJfwtpP0RLTXNIc/edit#slide=id.g1d04d3fdfd_0_11) -------------------------------------------------------------------------------- /day08/README.md: -------------------------------------------------------------------------------- 1 | # Day 8 2 | Regularization and Logistic Regression 3 | [Notebook](https://github.com/kaggledecal/sp17/blob/master/day08/MNIST.ipynb) 4 | [Slides](https://drive.google.com/open?id=1EQ_MXVTPpGRnQ6FUqOk6O49pbJldpoipaJ25P2uM9dk) 5 | [Video]() 6 | -------------------------------------------------------------------------------- /day09/Logistic Regression and Cross Validation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Logistic Regression Model" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Below you will find code that demonstrates how to run and interpret a logistic regression model. As before, please refer to the slides to get a full understanding of the motivations and derivations behind logistic regression and importantly its relation with the linear model." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "from pandas import DataFrame, Series\n", 26 | "import numpy as np\n", 27 | "import pandas as pd\n", 28 | "import matplotlib.pyplot as plt\n", 29 | "import statsmodels.api as sm\n", 30 | "from sklearn.cross_validation import train_test_split\n", 31 | "\n", 32 | "%matplotlib inline" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "#Read in Titanic Data\n", 44 | "titanic = pd.read_csv(\"../../datasets/titanic/train.csv\")" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Dealing with Categorical Data (One-Hot-Encoding)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "Categorical data, or data that have strings that denote something other than a numeric quantity, are extremely common in datasets. The catch is that, at least in Python, the vast majority of models do not know how to deal with categorical data - they prefer numeric data types only. At least in linear and logistic regression this makes intuitive sense because it doesn't make sense to invert a matrix of strings. What we do instead is do something called \"One-Hot-Encoding\"." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": { 65 | "collapsed": false 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "titanic_only = pd.get_dummies(titanic,columns=['Sex','Pclass','Embarked'],drop_first=True)\n", 70 | "titanic_only.head()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "If you notice closely, there are now more than one column that represents a categorical variable! Sex is split into a male only column (1 if the corresponding Sex element was male) and a female only column, which is NOT shown because we chose to drop it from drop_first. Drop_first drops a single column from the new columns we've generated because this again has to do with multicollinearity. If I know that someone is male, then I know for sure someone is not female. As a result, just holding the male column is enough information for our model to handle, and we won't need to worry about multicollinearity issues!\n", 78 | "\n", 79 | "This process of converting a categorical column into multiple columns containing 0's and 1's is called one-hot-encoding and this technique is by far the most common way of feeding in categorical data into a model. Another way of describing this process is getting \"dummy variables\" (hence pd.get_dummies) which just refer to the variables with 1's and 0's. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## Validation Method" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": { 93 | "collapsed": true 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "#Drop columns we don't care about (yet) or have missing values (Models don't like missing values)\n", 98 | "titanic_only.drop(['PassengerId','Name','Ticket','Age','Cabin'],axis=1,inplace=True)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "collapsed": true 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "#Train Test Splitting\n", 110 | "local_train, local_test = train_test_split(titanic_only,test_size=0.2,random_state=123)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "local_train.shape" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": { 128 | "collapsed": false 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "local_test.shape" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "local_train_y = local_train[\"Survived\"]\n", 144 | "local_train_x = local_train.drop([\"Survived\"],axis=1)\n", 145 | "local_test_y = local_test[\"Survived\"]\n", 146 | "local_test_x = local_test.drop(\"Survived\",axis=1)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "#The Model\n", 158 | "clf = sm.Logit(local_train_y,local_train_x)\n", 159 | "result = clf.fit()\n", 160 | "preds = result.predict(local_test_x)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "#Accuracy of Logistic Model\n", 172 | "np.mean((preds > 0.5) == local_test_y)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": { 179 | "collapsed": false 180 | }, 181 | "outputs": [], 182 | "source": [ 183 | "result.summary()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Now let's put some of the Data Cleaning and Feature Engineering from before to work!" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "collapsed": true 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "#Read in Titanic Data\n", 202 | "titanic = pd.read_csv(\"../../datasets/titanic/train.csv\")" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": { 209 | "collapsed": true 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "titanic_engineered = titanic.copy()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": true 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "#Imputing Age\n", 225 | "titanic_engineered['title'] = 'other'\n", 226 | "titanic_engineered.loc[['Master.' in n for n in titanic_engineered['Name']],'title'] = 'Master'\n", 227 | "titanic_engineered.loc[['Miss.' in n for n in titanic_engineered['Name']],'title'] = 'Miss'\n", 228 | "titanic_engineered.loc[['Mr.' in n for n in titanic_engineered['Name']],'title'] = 'Mr'\n", 229 | "titanic_engineered.loc[['Mrs.' in n for n in titanic_engineered['Name']],'title'] = 'Mrs'\n", 230 | "\n", 231 | "#Transform performs operation per group and returns values to their original index\n", 232 | "titanic_engineered['age_filled'] = titanic_engineered[['title','Age']].groupby('title').transform(lambda x: x.fillna(x.mean())) \n", 233 | "\n", 234 | "titanic_engineered.drop(['Age'],axis=1,inplace=True)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": { 241 | "collapsed": true 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "#Cabin Side Feature\n", 246 | "titanic_engineered['cabin_side'] = 'Unknown'\n", 247 | "titanic_engineered.loc[titanic_engineered['Cabin'].str[-1].isin([\"1\", \"3\", \"5\", \"7\", \"9\"]),'cabin_side'] = 'starboard'\n", 248 | "titanic_engineered.loc[titanic_engineered['Cabin'].str[-1].isin([\"2\", \"4\", \"6\", \"8\", \"0\"]),'cabin_side'] = 'port'" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": { 255 | "collapsed": true 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "#Deck Feature (including some cleaning)\n", 260 | "titanic_engineered['deck'] = 'Unknown'\n", 261 | "titanic_engineered.loc[titanic_engineered['Cabin'].notnull(),'deck'] = titanic_engineered['Cabin'].str[0]\n", 262 | "titanic_engineered.loc[titanic_engineered['deck'] == 'T','deck'] = \"Unknown\"" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": { 269 | "collapsed": true 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "#Deck Feature (including some cleaning)\n", 274 | "titanic_engineered['deck'] = 'Unknown'\n", 275 | "titanic_engineered.loc[titanic_engineered['Cabin'].notnull(),'deck'] = titanic_engineered['Cabin'].str[0]\n", 276 | "titanic_engineered.loc[titanic_engineered['deck'] == 'T','deck'] = \"Unknown\"\n", 277 | "\n", 278 | "pattern = \"[A-Z]\\s[A-Z]\" #Any capital letter between A-Z followed by a whitespace followed by any letter between A-Z\n", 279 | "mask = titanic_engineered['Cabin'].str.contains(pattern,na=False)\n", 280 | "titanic_engineered.loc[mask,'deck'] = titanic_engineered.loc[mask,'Cabin'].str[2]" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "collapsed": true 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "#Number cabins per person\n", 292 | "titanic_engineered['num_in_group'] = titanic_engineered['Cabin'].str.split().apply(lambda x: len(x) if type(x)!=float else 1)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": { 299 | "collapsed": true 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "#Removing columns we don't want (that don't make sense to include anymore)\n", 304 | "#Notice we are NOT dropping the Age column anymore because we've filled in the missing values!\n", 305 | "titanic_engineered.drop(['PassengerId','Name','Ticket','Cabin','title'],axis=1,inplace=True)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": { 312 | "collapsed": true 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "#Getting Dummy Variables\n", 317 | "titanic_engineered = pd.get_dummies(titanic_engineered,columns=['Sex','Pclass','Embarked','cabin_side','deck'],drop_first=True)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": { 324 | "collapsed": false 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "titanic_engineered.head()" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": { 335 | "collapsed": true 336 | }, 337 | "outputs": [], 338 | "source": [ 339 | "#Train Test Splitting\n", 340 | "local_train, local_test = train_test_split(titanic_engineered,test_size=0.2,random_state=123)\n", 341 | "\n", 342 | "local_train_y = local_train[\"Survived\"]\n", 343 | "local_train_x = local_train.drop([\"Survived\"],axis=1)\n", 344 | "local_test_y = local_test[\"Survived\"]\n", 345 | "local_test_x = local_test.drop(\"Survived\",axis=1)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": { 352 | "collapsed": false 353 | }, 354 | "outputs": [], 355 | "source": [ 356 | "#The Model\n", 357 | "clf = sm.Logit(local_train_y,local_train_x)\n", 358 | "result = clf.fit()\n", 359 | "preds = result.predict(local_test_x)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": { 366 | "collapsed": false 367 | }, 368 | "outputs": [], 369 | "source": [ 370 | "#Accuracy of Logistic Model\n", 371 | "np.mean((preds > 0.5) == local_test_y)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "metadata": { 378 | "collapsed": false 379 | }, 380 | "outputs": [], 381 | "source": [ 382 | "result.summary()" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "## K-Fold Cross Validation (Basic Data Set)" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": { 396 | "collapsed": true 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "from sklearn.cross_validation import KFold" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "collapsed": true 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "#Splits data into our train and test indices for each fold\n", 412 | "kf = KFold(titanic_only.shape[0], n_folds=10)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": { 419 | "collapsed": true 420 | }, 421 | "outputs": [], 422 | "source": [ 423 | "#Saves our accuracy scores for each fold\n", 424 | "outcomes = []\n", 425 | "\n", 426 | "#Keeps track of which fold we are currently in\n", 427 | "fold = 0" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": { 434 | "collapsed": false 435 | }, 436 | "outputs": [], 437 | "source": [ 438 | "for train_index, test_index in kf:\n", 439 | " fold += 1\n", 440 | " local_train_xy, local_test_xy = titanic_only.iloc[train_index], titanic_only.iloc[test_index]\n", 441 | " local_train_y = local_train_xy['Survived']\n", 442 | " local_train_x = local_train_xy.drop(['Survived'],axis=1)\n", 443 | " local_test_y = local_test_xy['Survived']\n", 444 | " local_test_x = local_test_xy.drop(['Survived'],axis=1)\n", 445 | "\n", 446 | " clf = sm.Logit(local_train_y,local_train_x)\n", 447 | " result = clf.fit()\n", 448 | " preds = result.predict(local_test_x)\n", 449 | " accuracy = np.mean((preds > 0.5) == local_test_y)\n", 450 | "\n", 451 | " outcomes.append(accuracy)\n", 452 | " print(\"Fold {0} accuracy: {1}\".format(fold, accuracy)) " 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": { 459 | "collapsed": false 460 | }, 461 | "outputs": [], 462 | "source": [ 463 | "#Final Cross Validated (average) score\n", 464 | "mean_outcome = np.mean(outcomes)\n", 465 | "mean_outcome" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "## K-Fold Cross Validation (Feature Engineered Data Set)" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": { 479 | "collapsed": true 480 | }, 481 | "outputs": [], 482 | "source": [ 483 | "#Saves our accuracy scores for each fold\n", 484 | "outcomes = []\n", 485 | "\n", 486 | "#Keeps track of which fold we are currently in\n", 487 | "fold = 0" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": { 494 | "collapsed": false 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "for train_index, test_index in kf:\n", 499 | " fold += 1\n", 500 | " local_train_xy, local_test_xy = titanic_engineered.iloc[train_index], titanic_engineered.iloc[test_index]\n", 501 | " local_train_y = local_train_xy['Survived']\n", 502 | " local_train_x = local_train_xy.drop(['Survived'],axis=1)\n", 503 | " local_test_y = local_test_xy['Survived']\n", 504 | " local_test_x = local_test_xy.drop(['Survived'],axis=1)\n", 505 | "\n", 506 | " clf = sm.Logit(local_train_y,local_train_x)\n", 507 | " result = clf.fit()\n", 508 | " preds = result.predict(local_test_x)\n", 509 | " accuracy = np.mean((preds > 0.5) == local_test_y)\n", 510 | "\n", 511 | " outcomes.append(accuracy)\n", 512 | " print(\"Fold {0} accuracy: {1}\".format(fold, accuracy)) \n", 513 | "\n", 514 | " " 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "metadata": { 521 | "collapsed": false 522 | }, 523 | "outputs": [], 524 | "source": [ 525 | "mean_outcome = np.mean(outcomes)\n", 526 | "mean_outcome" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": { 533 | "collapsed": true 534 | }, 535 | "outputs": [], 536 | "source": [] 537 | } 538 | ], 539 | "metadata": { 540 | "kernelspec": { 541 | "display_name": "Python 3", 542 | "language": "python", 543 | "name": "python3" 544 | }, 545 | "language_info": { 546 | "codemirror_mode": { 547 | "name": "ipython", 548 | "version": 3 549 | }, 550 | "file_extension": ".py", 551 | "mimetype": "text/x-python", 552 | "name": "python", 553 | "nbconvert_exporter": "python", 554 | "pygments_lexer": "ipython3", 555 | "version": "3.5.0" 556 | } 557 | }, 558 | "nbformat": 4, 559 | "nbformat_minor": 0 560 | } 561 | -------------------------------------------------------------------------------- /day09/README.md: -------------------------------------------------------------------------------- 1 | # Day 9 2 | K-means and cross-validation 3 | 4 | [Kmeans Notebook](https://github.com/kaggledecal/sp17/blob/master/day09/Kmeans.ipynb) 5 | 6 | [Cross Validation Notebook](https://github.com/kaggledecal/sp17/blob/master/day09/Logistic%20Regression%20and%20Cross%20Validation.ipynb) 7 | 8 | [Slides](https://docs.google.com/presentation/d/1gEr0w_-18GFkenbIQC2LUuhCm9_nkeLUCdleuk82Mik/edit?usp=sharing) 9 | 10 | [Video]() 11 | -------------------------------------------------------------------------------- /day10/README.md: -------------------------------------------------------------------------------- 1 | #Day 10 2 | ## K-Nearest Neighbors and the Bias Variance Tradeoff 3 | [kNN Notebook](https://github.com/kaggledecal/sp17/blob/master/day10/knn.ipynb) 4 | 5 | [Validation Practice](https://github.com/kaggledecal/sp17/blob/master/day10/bias-variance.ipynb) 6 | 7 | 8 | [Slides](https://docs.google.com/presentation/d/1k33EgQt5GlaJKrK9XNuaCH4TN7Nmf6FY-ANKx3CpiaU/edit?usp=sharing) 9 | 10 | [Video](https://youtu.be/zCLtAJVv60c) 11 | -------------------------------------------------------------------------------- /day10/bias-variance.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "%matplotlib inline" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "# Lecture 11 - Model Validation\n", 22 | "You might remember this code block from last class" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "from sklearn import datasets\n", 34 | "from sklearn.neighbors import KNeighborsClassifier as KNN\n", 35 | "from matplotlib.colors import ListedColormap" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "# import some data to play with\n", 47 | "iris = datasets.load_iris()\n", 48 | "X = iris.data[:, :2] # we only take the first two features. We could\n", 49 | " # avoid this ugly slicing by using a two-dim dataset\n", 50 | "y = iris.target\n", 51 | "h = .02 # step size in the mesh\n", 52 | "\n", 53 | "# Create color maps\n", 54 | "cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n", 55 | "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n", 56 | "\n", 57 | "for n_neighbors in [1, 5, 10, 15, 50, 100]:\n", 58 | "\n", 59 | " # we create an instance of Neighbours Classifier and fit the data.\n", 60 | " clf = KNN(n_neighbors)\n", 61 | " clf.fit(X, y)\n", 62 | "\n", 63 | " # Plot the decision boundary. For that, we will assign a color to each\n", 64 | " # point in the mesh [x_min, x_max]x[y_min, y_max].\n", 65 | " x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", 66 | " y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", 67 | " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", 68 | " np.arange(y_min, y_max, h))\n", 69 | " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", 70 | "\n", 71 | " # Put the result into a color plot\n", 72 | " Z = Z.reshape(xx.shape)\n", 73 | " plt.figure()\n", 74 | " plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n", 75 | "\n", 76 | " # Plot also the training points\n", 77 | " plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)\n", 78 | " plt.xlim(xx.min(), xx.max())\n", 79 | " plt.ylim(yy.min(), yy.max())\n", 80 | " plt.title(\"3-Class classification (k = %i)\" % (n_neighbors))\n", 81 | "\n", 82 | "plt.show()" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "Which of the above plots are overfitting? underfitting? \n", 90 | "


\n", 91 | "\n", 92 | "## Cross-Validation\n", 93 | "We've touched on the idea of overfitting a few times now, but I want to explore it again to really emphasize the importance of this method. This is probably the **most important step** of any machine learning / data science process. It asserts that your model can generalize (is not overfit) and makes sure that you're not wasting a clients money or time.\n", 94 | "\n", 95 | "Before we proceed, let's import some data" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": { 102 | "collapsed": true 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "from sklearn import datasets\n", 107 | "from sklearn.cross_validation import cross_val_score, train_test_split\n", 108 | "from sklearn.neighbors import KNeighborsClassifier as KNN\n", 109 | "np.random.seed(30)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "collapsed": true 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "digits = datasets.load_digits()\n", 121 | "X = digits.data\n", 122 | "y = digits.target" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": { 129 | "collapsed": false 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "# Plot digit at position\n", 134 | "digit_idx = 37 # the position of the digit to render \n", 135 | "plt.figure(1, figsize=(3, 3))\n", 136 | "plt.imshow(X[digit_idx].reshape((8,8)), cmap=plt.cm.gray_r, interpolation='nearest')\n", 137 | "plt.show()\n", 138 | "print(\"Corresponding digit\",y[digit_idx])" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "### Split test data from the training data\n", 146 | "You may be asking yourself why are we separating a test set from the training set - isn't this what cross-validation already handles? \n", 147 | "\n", 148 | "We are basically getting the ability to assess how well our model will generalize. We'll use cross-validation (making the train-validation split of this training set) to evaluate subtle changes in our algorithm. Then we will use the test set to get an accuracy measurement for how well we expect our model to perform against new data." 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": { 155 | "collapsed": false 156 | }, 157 | "outputs": [], 158 | "source": [ 159 | "# split dataset using train_test_split" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": { 166 | "collapsed": true 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "from sklearn.cross_validation import cross_val_score" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "collapsed": false 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "# Evaluate a KNN model using cross_val_score" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "As you can see, `cross_val_score` outputs the validation accuracy values for each fold in the k-fold.\n", 189 | "\n", 190 | "Let's make this into something a little bit more readable" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "collapsed": false 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "def cv_stats(cv_score):\n", 202 | " \"\"\" \n", 203 | " Takes in the output of cross_val_score\n", 204 | " Returns the mean and standard deviation in a readable format\n", 205 | " \"\"\"\n", 206 | " mean = np.mean(cv_score)\n", 207 | " std = np.std(cv_score)\n", 208 | " return mean, std\n", 209 | "cv_stats(clf_score)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "Now we're moving. How can we use this to our advantage?\n", 217 | "\n", 218 | "**To pick our model hyperparameters of course!**\n", 219 | "\n", 220 | "A good rule of thumb for picking model hyperparameters, as mentioned in last class, is to vary by order of magnitude at first, then explore the local neighborhood of values that seem relevant." 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": { 227 | "collapsed": false 228 | }, 229 | "outputs": [], 230 | "source": [ 231 | "# iterate through a range of values for k and evaluate how well KNN does on average\n", 232 | "# start off with the lowest value possible, then multiply by 2 with every iteration\n" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "Not only can we evaluate how different hyperparameters affect the accuracy of a classifier, but we're also able to compare *different learning methods* as well!\n", 240 | "\n", 241 | "Let's try logistic regression. First let's find the best parameter for C, the regularization term of logistic regression" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "from sklearn.linear_model import LogisticRegression, RidgeClassifier" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": { 259 | "collapsed": false 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "# iterate through a range of penalty values and evaluate how well LogisticRegression does on average" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "And lets do the with RidgeClassifier. This algorithm basically implements Ridge Regression, a special case of Linear Regression. \n", 271 | "\n", 272 | "It basically attempts to limit large sizes of weights. If this is a little daunting, don't worry too much about it. A good explanation is located [here](https://www.quora.com/What-is-Ridge-Regression-in-laymans-terms) if you are still curious" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": { 279 | "collapsed": false 280 | }, 281 | "outputs": [], 282 | "source": [ 283 | "# iterate through a range of penalty values and evaluate how well RidgeClassifier does on average" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "Let's choose the best model and evaluate it's result " 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": { 297 | "collapsed": false 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "clf= # best classifier here w/ best parameters\n", 302 | "clf.fit(X_train, y_train)\n", 303 | "predictions = clf.predict(X_test) \n", 304 | "accuracy = sum(predictions == y_test) / len(y_test)\n", 305 | "accuracy" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "## Introduction to Confusion Matrices\n", 313 | "\"How can we understand what types of mistakes a learned model makes? \"" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": { 320 | "collapsed": true 321 | }, 322 | "outputs": [], 323 | "source": [ 324 | "from sklearn.metrics import confusion_matrix" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "metadata": { 331 | "collapsed": true 332 | }, 333 | "outputs": [], 334 | "source": [ 335 | "\n", 336 | "import itertools\n", 337 | "def plot_confusion_matrix(cm, classes,\n", 338 | " normalize=False,\n", 339 | " title='Confusion matrix',\n", 340 | " cmap=plt.cm.Blues):\n", 341 | " \"\"\"\n", 342 | " This function prints and plots the confusion matrix.\n", 343 | " Normalization can be applied by setting `normalize=True`.\n", 344 | " \"\"\"\n", 345 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", 346 | " plt.title(title)\n", 347 | " plt.colorbar()\n", 348 | " tick_marks = np.arange(len(classes))\n", 349 | " plt.xticks(tick_marks, classes, rotation=45)\n", 350 | " plt.yticks(tick_marks, classes)\n", 351 | "\n", 352 | " if normalize:\n", 353 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 354 | " print(\"Normalized confusion matrix\")\n", 355 | " else:\n", 356 | " print('Confusion matrix, without normalization')\n", 357 | "\n", 358 | " print(cm)\n", 359 | "\n", 360 | " thresh = cm.max() / 2.\n", 361 | " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 362 | " plt.text(j, i, cm[i, j],\n", 363 | " horizontalalignment=\"center\",\n", 364 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n", 365 | "\n", 366 | " plt.tight_layout()\n", 367 | " plt.ylabel('True label')\n", 368 | " plt.xlabel('Predicted label')" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": { 375 | "collapsed": false 376 | }, 377 | "outputs": [], 378 | "source": [ 379 | "cnf_matrix = confusion_matrix(y_test, predictions)\n", 380 | "plot_confusion_matrix(cnf_matrix, digits.target_names)" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": { 386 | "slideshow": { 387 | "slide_type": "slide" 388 | } 389 | }, 390 | "source": [ 391 | "# Review of Models\n", 392 | "1. Linear Regression\n", 393 | "2. Logistic Regression\n", 394 | "3. K-means\n", 395 | "4. K-Nearest Neighbors\n", 396 | "


\n", 397 | "\n", 398 | "\n", 399 | "## Linear Regression\n", 400 | "Answers the questions - \"What's the best way to draw a line through our points?\"" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "collapsed": true 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "from sklearn import datasets\n", 412 | "from sklearn.cross_validation import train_test_split\n", 413 | "from sklearn import linear_model\n", 414 | "import matplotlib.pyplot as plt" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "metadata": { 421 | "collapsed": false 422 | }, 423 | "outputs": [], 424 | "source": [ 425 | "boston = datasets.load_boston()\n", 426 | "X = boston.data\n", 427 | "y = boston.target\n", 428 | "\n", 429 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": { 436 | "collapsed": false 437 | }, 438 | "outputs": [], 439 | "source": [ 440 | "# Initialize linear regression" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "#### Scoring our linear regression\n", 448 | "\n", 449 | "If you remember from the linear regression lecture and from any stats class you've taken, the residual square error allows us to calculate how well our preidctor does in fitting the data. \n", 450 | "The formula is as so\n", 451 | "\n", 452 | "$$R^2 = 1 - \\frac{\\sum(y_i - f(x_i))}{\\sum(y_i-\\bar{y})}$$,\n", 453 | "where $f(x_i)$ is our model's prediction for the $x_i$th datapoint.\n", 454 | "\n", 455 | "The great thing about sklearn's api is that a lot of these scoring measure are already built in. Using our trained Linear regressor, we can quickly score it using the score method as so" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": { 462 | "collapsed": false 463 | }, 464 | "outputs": [], 465 | "source": [ 466 | "train_r2 = # fill out linreg\n", 467 | "test_r2 = # fill out linreg\n", 468 | "\n", 469 | "print(\"Train accuracy : \", train_r2)\n", 470 | "print(\"Test accuracy : \", test_r2)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": { 477 | "collapsed": true 478 | }, 479 | "outputs": [], 480 | "source": [] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "We can graphically examine how well our Linear Regressor performs by plotting the true y values, and those predicted by the linear regressor" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "collapsed": false 494 | }, 495 | "outputs": [], 496 | "source": [ 497 | "y_pred = # predict on test set\n", 498 | "\n", 499 | "fig, ax = plt.subplots()\n", 500 | "ax.scatter(y_test, y_pred)\n", 501 | "ax.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=4)\n", 502 | "ax.set_xlabel('Test')\n", 503 | "ax.set_ylabel('Predicted')\n", 504 | "plt.title('Comparison of Actual vs Predicted Values')\n", 505 | "plt.show()" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "## Logistic Regression\n", 513 | "\"Model the probability that some class occurs\"" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": { 520 | "collapsed": false 521 | }, 522 | "outputs": [], 523 | "source": [ 524 | "cancer = datasets.load_breast_cancer()\n", 525 | "X = cancer.data\n", 526 | "y = cancer.target\n", 527 | "\n", 528 | "\n", 529 | "np.random.seed(seed=133) # set seed=40 if you want an example where test accuracy is greater than train\n", 530 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)\n" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": { 537 | "collapsed": false 538 | }, 539 | "outputs": [], 540 | "source": [ 541 | "# initialize and fit logistic regression" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": null, 547 | "metadata": { 548 | "collapsed": false 549 | }, 550 | "outputs": [], 551 | "source": [ 552 | "train_acc = # score Logistic Regression on train set\n", 553 | "test_acc = # score Logistic Regression on test set\n", 554 | "\n", 555 | "\n", 556 | "print(\"Train accuracy : \", train_acc)\n", 557 | "print(\"Test accuracy : \", test_acc)" 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": {}, 563 | "source": [ 564 | "## k - Nearest Neighbors\n", 565 | "\" Find the `k` closest points in the training set that match the input datapoint \"" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "metadata": { 572 | "collapsed": true 573 | }, 574 | "outputs": [], 575 | "source": [ 576 | "from sklearn.neighbors import KNeighborsClassifier as KNN" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": null, 582 | "metadata": { 583 | "collapsed": false 584 | }, 585 | "outputs": [], 586 | "source": [ 587 | "# initialize KNN here" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": { 594 | "collapsed": false 595 | }, 596 | "outputs": [], 597 | "source": [ 598 | "train_acc = # score KNN on train set\n", 599 | "test_acc = # score KNN on test set\n", 600 | "\n", 601 | "\n", 602 | "print(\"Train accuracy : \", train_acc)\n", 603 | "print(\"Test accuracy : \", test_acc)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "metadata": {}, 609 | "source": [ 610 | "## K- means\n", 611 | "Find `k` clusters in the data based on how similar each datapoint is to itself" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": { 618 | "collapsed": true 619 | }, 620 | "outputs": [], 621 | "source": [ 622 | "from sklearn.cluster import KMeans\n", 623 | "from matplotlib import cm" 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": null, 629 | "metadata": { 630 | "collapsed": false 631 | }, 632 | "outputs": [], 633 | "source": [ 634 | "blobs = datasets.make_blobs(n_samples=1000)\n", 635 | "X = blobs[0]\n", 636 | "y = blobs[1]\n" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": { 643 | "collapsed": true 644 | }, 645 | "outputs": [], 646 | "source": [ 647 | "# setup KMeans " 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": { 654 | "collapsed": false 655 | }, 656 | "outputs": [], 657 | "source": [ 658 | "y_pred = # predict KMeans" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "metadata": { 665 | "collapsed": false 666 | }, 667 | "outputs": [], 668 | "source": [ 669 | "plt.figure()\n", 670 | "y_unique = np.unique(y)\n", 671 | "colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size))\n", 672 | "for this_y, color in zip(y_unique, colors):\n", 673 | " this_X = X[y_pred == this_y]\n", 674 | "# this_sw = sw_train[y == this_y]\n", 675 | " plt.scatter(this_X[:, 0], this_X[:, 1], c=color, alpha=0.5,\n", 676 | " label=\"Class %s\" % this_y)\n", 677 | "plt.legend(loc=\"best\")\n", 678 | "plt.title(\"Data\")\n" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": null, 684 | "metadata": { 685 | "collapsed": true 686 | }, 687 | "outputs": [], 688 | "source": [] 689 | } 690 | ], 691 | "metadata": { 692 | "kernelspec": { 693 | "display_name": "Python 3", 694 | "language": "python", 695 | "name": "python3" 696 | }, 697 | "language_info": { 698 | "codemirror_mode": { 699 | "name": "ipython", 700 | "version": 3 701 | }, 702 | "file_extension": ".py", 703 | "mimetype": "text/x-python", 704 | "name": "python", 705 | "nbconvert_exporter": "python", 706 | "pygments_lexer": "ipython3", 707 | "version": "3.5.0" 708 | } 709 | }, 710 | "nbformat": 4, 711 | "nbformat_minor": 0 712 | } 713 | -------------------------------------------------------------------------------- /day10/images/bv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/bv.png -------------------------------------------------------------------------------- /day10/images/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/data.png -------------------------------------------------------------------------------- /day10/images/data1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/data1.png -------------------------------------------------------------------------------- /day10/images/hoods.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/hoods.png -------------------------------------------------------------------------------- /day10/images/k=1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/k=1.png -------------------------------------------------------------------------------- /day10/images/k=10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/k=10.png -------------------------------------------------------------------------------- /day10/images/k=100.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day10/images/k=100.png -------------------------------------------------------------------------------- /day11/Decision Trees.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stdout", 12 | "output_type": "stream", 13 | "text": [ 14 | "Automatically created module for IPython interactive environment\n" 15 | ] 16 | } 17 | ], 18 | "source": [ 19 | "print(__doc__)\n", 20 | "\n", 21 | "import numpy as np\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "\n", 24 | "from sklearn.datasets import load_iris\n", 25 | "from sklearn.tree import DecisionTreeClassifier\n", 26 | "\n", 27 | "# Parameters\n", 28 | "n_classes = 3\n", 29 | "plot_colors = \"bry\"\n", 30 | "plot_step = 0.02\n", 31 | "\n", 32 | "# Load data\n", 33 | "iris = load_iris()\n", 34 | "\n", 35 | "\"\"\"\n", 36 | "For the code below train with a decision tree and plot \n", 37 | "the resulting decision boundary\n", 38 | "\n", 39 | "Try this with and without max depth and see if you can notice a difference\n", 40 | "\"\"\"\n", 41 | "\n", 42 | "for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],\n", 43 | " [1, 2], [1, 3], [2, 3]]):\n", 44 | " # We only take the two corresponding features\n", 45 | " X = iris.data[:, pair]\n", 46 | " y = iris.target\n", 47 | "\n", 48 | " # Train (your code here)\n", 49 | " clf = None # (make this your decision tree)\n", 50 | "\n", 51 | " # Plot the decision boundary\n", 52 | " plt.subplot(2, 3, pairidx + 1)\n", 53 | "\n", 54 | " x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", 55 | " y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", 56 | " xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),\n", 57 | " np.arange(y_min, y_max, plot_step))\n", 58 | "\n", 59 | " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", 60 | " Z = Z.reshape(xx.shape)\n", 61 | " cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)\n", 62 | "\n", 63 | " plt.xlabel(iris.feature_names[pair[0]])\n", 64 | " plt.ylabel(iris.feature_names[pair[1]])\n", 65 | " plt.axis(\"tight\")\n", 66 | "\n", 67 | " # Plot the training points\n", 68 | " for i, color in zip(range(n_classes), plot_colors):\n", 69 | " idx = np.where(y == i)\n", 70 | " plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],\n", 71 | " cmap=plt.cm.Paired)\n", 72 | "\n", 73 | " plt.axis(\"tight\")\n", 74 | "\n", 75 | "plt.suptitle(\"Decision surface of a decision tree using paired features\")\n", 76 | "plt.legend()\n", 77 | "plt.show()" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": true 85 | }, 86 | "outputs": [], 87 | "source": [] 88 | } 89 | ], 90 | "metadata": { 91 | "kernelspec": { 92 | "display_name": "Python [Root]", 93 | "language": "python", 94 | "name": "Python [Root]" 95 | }, 96 | "language_info": { 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 2 100 | }, 101 | "file_extension": ".py", 102 | "mimetype": "text/x-python", 103 | "name": "python", 104 | "nbconvert_exporter": "python", 105 | "pygments_lexer": "ipython2", 106 | "version": "2.7.12" 107 | } 108 | }, 109 | "nbformat": 4, 110 | "nbformat_minor": 0 111 | } 112 | -------------------------------------------------------------------------------- /day11/README.md: -------------------------------------------------------------------------------- 1 | #Day 11 2 | ## Decision Trees 3 | [Decision Tree Notebook](https://github.com/kaggledecal/sp17/blob/master/day11/Decision%20Trees.ipynb) 4 | 5 | 6 | 7 | 8 | [Slides](https://docs.google.com/a/berkeley.edu/presentation/d/1E3xTsMv4DZ5_c9BVN4KQ-qQ3gXh__TCybWL52Nt0yXo/edit?usp=sharing) 9 | 10 | [Video] 11 | -------------------------------------------------------------------------------- /day11/calc_gini.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | """ 3 | In the below function you will be implementing a function 4 | to calculate the gini impurity for a given collection of labels 5 | below are some examples to test how your function behaves 6 | """ 7 | def gini_index(classes): 8 | raise Exception('Not implemented') 9 | 10 | 11 | """ 12 | Uncomment these tests to see how 13 | your function behaves 14 | 15 | """ 16 | #gini_index([0, 0, 0, 0,0, 0]) -> 0 17 | #gini_index([0, 1, 0, 1, 0, 1]) -> 0.5 18 | 19 | -------------------------------------------------------------------------------- /day12/README.md: -------------------------------------------------------------------------------- 1 | #Day 11 2 | ## Decision Trees 3 | [Decision Tree Notebook](https://github.com/kaggledecal/sp17/blob/master/day11/Decision%20Trees.ipynb) 4 | 5 | 6 | 7 | 8 | [Slides](https://docs.google.com/presentation/d/15CYIHRBQR_h7r2R0qOFGws8s310kx0qAd4bbZvXk_Vk/edit?usp=sharing) 9 | 10 | [Video] 11 | -------------------------------------------------------------------------------- /day12/Random Forest.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import matplotlib.pyplot as plt\n", 12 | "\n", 13 | "import numpy as np\n", 14 | "\n", 15 | "from sklearn.datasets import make_blobs\n", 16 | "from sklearn.ensemble import RandomForestClassifier\n", 17 | "from sklearn.calibration import CalibratedClassifierCV\n", 18 | "from sklearn.metrics import log_loss\n", 19 | "\n", 20 | "np.random.seed(0)\n", 21 | "\n", 22 | "# Generate data\n", 23 | "X, y = make_blobs(n_samples=1000, n_features=2, random_state=42,\n", 24 | " cluster_std=5.0)\n", 25 | "X_train, y_train = X[:600], y[:600]\n", 26 | "X_valid, y_valid = X[600:800], y[600:800]\n", 27 | "X_train_valid, y_train_valid = X[:800], y[:800]\n", 28 | "X_test, y_test = X[800:], y[800:]\n", 29 | "\n", 30 | "# Train uncalibrated random forest classifier on whole train and validation\n", 31 | "# data and evaluate on test data\n", 32 | "\n", 33 | "\"\"\"\n", 34 | "YOUR code here, replace clf with a Random Forest with 25 individual trees\n", 35 | "\"\"\"\n", 36 | "clf = None\n", 37 | "clf.fit(X_train_valid, y_train_valid)\n", 38 | "clf_probs = clf.predict_proba(X_test)\n", 39 | "score = log_loss(y_test, clf_probs)\n", 40 | "\n", 41 | "# Train random forest classifier, calibrate on validation data and evaluate\n", 42 | "# on test data\n", 43 | "clf = RandomForestClassifier(n_estimators=25)\n", 44 | "clf.fit(X_train, y_train)\n", 45 | "clf_probs = clf.predict_proba(X_test)\n", 46 | "sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\", cv=\"prefit\")\n", 47 | "sig_clf.fit(X_valid, y_valid)\n", 48 | "sig_clf_probs = sig_clf.predict_proba(X_test)\n", 49 | "sig_score = log_loss(y_test, sig_clf_probs)\n", 50 | "\n", 51 | "# Plot changes in predicted probabilities via arrows\n", 52 | "plt.figure(0)\n", 53 | "colors = [\"r\", \"g\", \"b\"]\n", 54 | "for i in range(clf_probs.shape[0]):\n", 55 | " plt.arrow(clf_probs[i, 0], clf_probs[i, 1],\n", 56 | " sig_clf_probs[i, 0] - clf_probs[i, 0],\n", 57 | " sig_clf_probs[i, 1] - clf_probs[i, 1],\n", 58 | " color=colors[y_test[i]], head_width=1e-2)\n", 59 | "\n", 60 | "# Plot perfect predictions\n", 61 | "plt.plot([1.0], [0.0], 'ro', ms=20, label=\"Class 1\")\n", 62 | "plt.plot([0.0], [1.0], 'go', ms=20, label=\"Class 2\")\n", 63 | "plt.plot([0.0], [0.0], 'bo', ms=20, label=\"Class 3\")\n", 64 | "\n", 65 | "# Plot boundaries of unit simplex\n", 66 | "plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label=\"Simplex\")\n", 67 | "\n", 68 | "# Annotate points on the simplex\n", 69 | "plt.annotate(r'($\\frac{1}{3}$, $\\frac{1}{3}$, $\\frac{1}{3}$)',\n", 70 | " xy=(1.0/3, 1.0/3), xytext=(1.0/3, .23), xycoords='data',\n", 71 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 72 | " horizontalalignment='center', verticalalignment='center')\n", 73 | "plt.plot([1.0/3], [1.0/3], 'ko', ms=5)\n", 74 | "plt.annotate(r'($\\frac{1}{2}$, $0$, $\\frac{1}{2}$)',\n", 75 | " xy=(.5, .0), xytext=(.5, .1), xycoords='data',\n", 76 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 77 | " horizontalalignment='center', verticalalignment='center')\n", 78 | "plt.annotate(r'($0$, $\\frac{1}{2}$, $\\frac{1}{2}$)',\n", 79 | " xy=(.0, .5), xytext=(.1, .5), xycoords='data',\n", 80 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 81 | " horizontalalignment='center', verticalalignment='center')\n", 82 | "plt.annotate(r'($\\frac{1}{2}$, $\\frac{1}{2}$, $0$)',\n", 83 | " xy=(.5, .5), xytext=(.6, .6), xycoords='data',\n", 84 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 85 | " horizontalalignment='center', verticalalignment='center')\n", 86 | "plt.annotate(r'($0$, $0$, $1$)',\n", 87 | " xy=(0, 0), xytext=(.1, .1), xycoords='data',\n", 88 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 89 | " horizontalalignment='center', verticalalignment='center')\n", 90 | "plt.annotate(r'($1$, $0$, $0$)',\n", 91 | " xy=(1, 0), xytext=(1, .1), xycoords='data',\n", 92 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 93 | " horizontalalignment='center', verticalalignment='center')\n", 94 | "plt.annotate(r'($0$, $1$, $0$)',\n", 95 | " xy=(0, 1), xytext=(.1, 1), xycoords='data',\n", 96 | " arrowprops=dict(facecolor='black', shrink=0.05),\n", 97 | " horizontalalignment='center', verticalalignment='center')\n", 98 | "# Add grid\n", 99 | "plt.grid(\"off\")\n", 100 | "for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:\n", 101 | " plt.plot([0, x], [x, 0], 'k', alpha=0.2)\n", 102 | " plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)\n", 103 | " plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)\n", 104 | "\n", 105 | "plt.title(\"Change of predicted probabilities after sigmoid calibration\")\n", 106 | "plt.xlabel(\"Probability class 1\")\n", 107 | "plt.ylabel(\"Probability class 2\")\n", 108 | "plt.xlim(-0.05, 1.05)\n", 109 | "plt.ylim(-0.05, 1.05)\n", 110 | "plt.legend(loc=\"best\")\n", 111 | "\n", 112 | "print(\"Log-loss of\")\n", 113 | "print(\" * uncalibrated classifier trained on 800 datapoints: %.3f \"\n", 114 | " % score)\n", 115 | "print(\" * classifier trained on 600 datapoints and calibrated on \"\n", 116 | " \"200 datapoint: %.3f\" % sig_score)\n", 117 | "\n", 118 | "# Illustrate calibrator\n", 119 | "plt.figure(1)\n", 120 | "# generate grid over 2-simplex\n", 121 | "p1d = np.linspace(0, 1, 20)\n", 122 | "p0, p1 = np.meshgrid(p1d, p1d)\n", 123 | "p2 = 1 - p0 - p1\n", 124 | "p = np.c_[p0.ravel(), p1.ravel(), p2.ravel()]\n", 125 | "p = p[p[:, 2] >= 0]\n", 126 | "\n", 127 | "calibrated_classifier = sig_clf.calibrated_classifiers_[0]\n", 128 | "prediction = np.vstack([calibrator.predict(this_p)\n", 129 | " for calibrator, this_p in\n", 130 | " zip(calibrated_classifier.calibrators_, p.T)]).T\n", 131 | "prediction /= prediction.sum(axis=1)[:, None]\n", 132 | "\n", 133 | "# Plot modifications of calibrator\n", 134 | "for i in range(prediction.shape[0]):\n", 135 | " plt.arrow(p[i, 0], p[i, 1],\n", 136 | " prediction[i, 0] - p[i, 0], prediction[i, 1] - p[i, 1],\n", 137 | " head_width=1e-2, color=colors[np.argmax(p[i])])\n", 138 | "# Plot boundaries of unit simplex\n", 139 | "plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label=\"Simplex\")\n", 140 | "\n", 141 | "plt.grid(\"off\")\n", 142 | "for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:\n", 143 | " plt.plot([0, x], [x, 0], 'k', alpha=0.2)\n", 144 | " plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)\n", 145 | " plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)\n", 146 | "\n", 147 | "plt.title(\"Illustration of sigmoid calibrator\")\n", 148 | "plt.xlabel(\"Probability class 1\")\n", 149 | "plt.ylabel(\"Probability class 2\")\n", 150 | "plt.xlim(-0.05, 1.05)\n", 151 | "plt.ylim(-0.05, 1.05)\n", 152 | "\n", 153 | "plt.show()" 154 | ] 155 | } 156 | ], 157 | "metadata": { 158 | "kernelspec": { 159 | "display_name": "Python [Root]", 160 | "language": "python", 161 | "name": "Python [Root]" 162 | }, 163 | "language_info": { 164 | "codemirror_mode": { 165 | "name": "ipython", 166 | "version": 2 167 | }, 168 | "file_extension": ".py", 169 | "mimetype": "text/x-python", 170 | "name": "python", 171 | "nbconvert_exporter": "python", 172 | "pygments_lexer": "ipython2", 173 | "version": "2.7.12" 174 | } 175 | }, 176 | "nbformat": 4, 177 | "nbformat_minor": 0 178 | } 179 | -------------------------------------------------------------------------------- /day13/README.md: -------------------------------------------------------------------------------- 1 | # Day 13 2 | ## AWS for Data Science 3 | [Slides](https://docs.google.com/presentation/d/1MccQLSrsiqfB4T6H6PD5Ly6_KwBivl3uUC6u_JXxiiQ/edit?usp=sharing) 4 | 5 | [Video](https://www.youtube.com/watch?v=27UYt6fhNrU&feature=youtu.be) 6 | -------------------------------------------------------------------------------- /day14/README.md: -------------------------------------------------------------------------------- 1 | # Day 14 2 | ## Intro to Neural Networks 3 | [Slides](https://docs.google.com/presentation/d/1RpvBuvBmnd03uhOBiDZ9hvfxF0S3jx8UmQk9kWdXBVQ/edit#slide=id.g2038b5ec23_0_32) 4 | 5 | [Video](https://www.youtube.com/watch?v=LLUISB1iCnw) 6 | -------------------------------------------------------------------------------- /day14/nn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "'''\n", 12 | "A Multilayer Perceptron implementation example using TensorFlow library.\n", 13 | "This example is using the MNIST database of handwritten digits\n", 14 | "(http://yann.lecun.com/exdb/mnist/)\n", 15 | "\n", 16 | "Author: Aymeric Damien\n", 17 | "Project: https://github.com/aymericdamien/TensorFlow-Examples/\n", 18 | "'''" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "name": "stdout", 30 | "output_type": "stream", 31 | "text": [ 32 | "Extracting MNIST_data/train-images-idx3-ubyte.gz\n", 33 | "Extracting MNIST_data/train-labels-idx1-ubyte.gz\n", 34 | "Extracting MNIST_data/t10k-images-idx3-ubyte.gz\n", 35 | "Extracting MNIST_data/t10k-labels-idx1-ubyte.gz\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "# Import MINST data\n", 41 | "from tensorflow.examples.tutorials.mnist import input_data\n", 42 | "mnist = input_data.read_data_sets(\"MNIST_data/\", one_hot=True)\n", 43 | "\n", 44 | "import tensorflow as tf" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "# Parameters\n", 56 | "learning_rate = 0.001\n", 57 | "training_epochs = 15\n", 58 | "batch_size = 100\n", 59 | "display_step = 1\n", 60 | "\n", 61 | "# Network Parameters\n", 62 | "n_hidden_1 = 256 # 1st layer number of features\n", 63 | "n_hidden_2 = 256 # 2nd layer number of features\n", 64 | "n_input = 784 # MNIST data input (img shape: 28*28)\n", 65 | "n_classes = 10 # MNIST total classes (0-9 digits)\n", 66 | "\n", 67 | "# tf Graph input\n", 68 | "x = tf.placeholder(\"float\", [None, n_input])\n", 69 | "y = tf.placeholder(\"float\", [None, n_classes])" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": { 76 | "collapsed": true 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "# Create model\n", 81 | "def multilayer_perceptron(x, weights, biases):\n", 82 | " # Hidden layer with RELU activation\n", 83 | " layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])\n", 84 | " layer_1 = tf.nn.relu(layer_1)\n", 85 | " # Hidden layer with RELU activation\n", 86 | " layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])\n", 87 | " layer_2 = tf.nn.relu(layer_2)\n", 88 | " # Output layer with linear activation\n", 89 | " out_layer = tf.matmul(layer_2, weights['out']) + biases['out']\n", 90 | " return out_layer" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 5, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "# Store layers weight & bias\n", 102 | "weights = {\n", 103 | " 'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),\n", 104 | " 'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),\n", 105 | " 'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))\n", 106 | "}\n", 107 | "biases = {\n", 108 | " 'b1': tf.Variable(tf.random_normal([n_hidden_1])),\n", 109 | " 'b2': tf.Variable(tf.random_normal([n_hidden_2])),\n", 110 | " 'out': tf.Variable(tf.random_normal([n_classes]))\n", 111 | "}\n", 112 | "\n", 113 | "# Construct model\n", 114 | "pred = multilayer_perceptron(x, weights, biases)\n", 115 | "\n", 116 | "# Define loss and optimizer\n", 117 | "cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))\n", 118 | "optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)\n", 119 | "\n", 120 | "# Initializing the variables\n", 121 | "init = tf.global_variables_initializer()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 6, 127 | "metadata": { 128 | "collapsed": false 129 | }, 130 | "outputs": [ 131 | { 132 | "name": "stdout", 133 | "output_type": "stream", 134 | "text": [ 135 | "Epoch: 0001 cost= 173.056566575\n", 136 | "Epoch: 0002 cost= 44.054413928\n", 137 | "Epoch: 0003 cost= 27.455470655\n", 138 | "Epoch: 0004 cost= 19.008652363\n", 139 | "Epoch: 0005 cost= 13.654873594\n", 140 | "Epoch: 0006 cost= 10.059267435\n", 141 | "Epoch: 0007 cost= 7.436018432\n", 142 | "Epoch: 0008 cost= 5.587794416\n", 143 | "Epoch: 0009 cost= 4.209882509\n", 144 | "Epoch: 0010 cost= 3.203879515\n", 145 | "Epoch: 0011 cost= 2.319920681\n", 146 | "Epoch: 0012 cost= 1.676204545\n", 147 | "Epoch: 0013 cost= 1.248805338\n", 148 | "Epoch: 0014 cost= 1.052676844\n", 149 | "Epoch: 0015 cost= 0.890117338\n", 150 | "Optimization Finished!\n", 151 | "Accuracy: 0.9459\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "# Launch the graph\n", 157 | "with tf.Session() as sess:\n", 158 | " sess.run(init)\n", 159 | "\n", 160 | " # Training cycle\n", 161 | " for epoch in range(training_epochs):\n", 162 | " avg_cost = 0.\n", 163 | " total_batch = int(mnist.train.num_examples/batch_size)\n", 164 | " # Loop over all batches\n", 165 | " for i in range(total_batch):\n", 166 | " batch_x, batch_y = mnist.train.next_batch(batch_size)\n", 167 | " # Run optimization op (backprop) and cost op (to get loss value)\n", 168 | " _, c = sess.run([optimizer, cost], feed_dict={x: batch_x,\n", 169 | " y: batch_y})\n", 170 | " # Compute average loss\n", 171 | " avg_cost += c / total_batch\n", 172 | " # Display logs per epoch step\n", 173 | " if epoch % display_step == 0:\n", 174 | " print \"Epoch:\", '%04d' % (epoch+1), \"cost=\", \\\n", 175 | " \"{:.9f}\".format(avg_cost)\n", 176 | " print \"Optimization Finished!\"\n", 177 | "\n", 178 | " # Test model\n", 179 | " correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))\n", 180 | " # Calculate accuracy\n", 181 | " accuracy = tf.reduce_mean(tf.cast(correct_prediction, \"float\"))\n", 182 | " print \"Accuracy:\", accuracy.eval({x: mnist.test.images, y: mnist.test.labels})" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": { 189 | "collapsed": true 190 | }, 191 | "outputs": [], 192 | "source": [] 193 | } 194 | ], 195 | "metadata": { 196 | "kernelspec": { 197 | "display_name": "Python 2", 198 | "language": "python", 199 | "name": "python2" 200 | }, 201 | "language_info": { 202 | "codemirror_mode": { 203 | "name": "ipython", 204 | "version": 2 205 | }, 206 | "file_extension": ".py", 207 | "mimetype": "text/x-python", 208 | "name": "python", 209 | "nbconvert_exporter": "python", 210 | "pygments_lexer": "ipython2", 211 | "version": "2.7.13" 212 | } 213 | }, 214 | "nbformat": 4, 215 | "nbformat_minor": 0 216 | } 217 | -------------------------------------------------------------------------------- /day15/README.md: -------------------------------------------------------------------------------- 1 | # Day 15 2 | ## How Neural Networks Learn 3 | [Slides](https://docs.google.com/presentation/d/1xiRn9TyQKScqkoWdThKOQMqGiRVWgMV6tziWK6ky3PA/edit?usp=sharing) 4 | 5 | [Video](https://youtu.be/yB-2lG49raE) 6 | -------------------------------------------------------------------------------- /day16/README.md: -------------------------------------------------------------------------------- 1 | # Day 15 2 | ## Convolutional Neural Networks and Keras 3 | [Slides](https://docs.google.com/presentation/d/13mOHBoUSHtNNyuljG3qv5Vm70FmXwK_U_CtmsBeqLQw/edit?usp=sharing) 4 | 5 | [Video](https://youtu.be/_XQzT5Ik6Zo) 6 | # Updates 7 | Don't forget to do the Keras homework (located at homework) and start the final project. 8 | -------------------------------------------------------------------------------- /day16/cat.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day16/cat.jpg -------------------------------------------------------------------------------- /day16/dog.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/day16/dog.jpeg -------------------------------------------------------------------------------- /day16/imagenet_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import json 3 | 4 | from keras.utils.data_utils import get_file 5 | from keras import backend as K 6 | 7 | CLASS_INDEX = None 8 | CLASS_INDEX_PATH = 'https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json' 9 | 10 | 11 | def preprocess_input(x, dim_ordering='default'): 12 | if dim_ordering == 'default': 13 | dim_ordering = K.image_dim_ordering() 14 | assert dim_ordering in {'tf', 'th'} 15 | 16 | if dim_ordering == 'th': 17 | x[:, 0, :, :] -= 103.939 18 | x[:, 1, :, :] -= 116.779 19 | x[:, 2, :, :] -= 123.68 20 | # 'RGB'->'BGR' 21 | x = x[:, ::-1, :, :] 22 | else: 23 | x[:, :, :, 0] -= 103.939 24 | x[:, :, :, 1] -= 116.779 25 | x[:, :, :, 2] -= 123.68 26 | # 'RGB'->'BGR' 27 | x = x[:, :, :, ::-1] 28 | return x 29 | 30 | 31 | def decode_predictions(preds, top=5): 32 | global CLASS_INDEX 33 | if len(preds.shape) != 2 or preds.shape[1] != 1000: 34 | raise ValueError('`decode_predictions` expects ' 35 | 'a batch of predictions ' 36 | '(i.e. a 2D array of shape (samples, 1000)). ' 37 | 'Found array with shape: ' + str(preds.shape)) 38 | if CLASS_INDEX is None: 39 | fpath = get_file('imagenet_class_index.json', 40 | CLASS_INDEX_PATH, 41 | cache_subdir='models') 42 | CLASS_INDEX = json.load(open(fpath)) 43 | results = [] 44 | for pred in preds: 45 | top_indices = pred.argsort()[-top:][::-1] 46 | result = [tuple(CLASS_INDEX[str(i)]) + (pred[i],) for i in top_indices] 47 | results.append(result) 48 | return results -------------------------------------------------------------------------------- /day16/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | def load_mnist_dataset(): 5 | ''' 6 | Loads the mnist dataset from a folder 7 | :return: The training and tessting sets as specified by the mnist package 8 | ''' 9 | mndata = pd.read_csv('../../datasets/MNIST/train.csv') 10 | labels_train = mndata['label'] 11 | X_train = mndata.drop('label', axis=1).as_matrix() 12 | return X_train, labels_train 13 | def one_hot(labels_train, num_classes=10): 14 | '''Convert categorical labels to standard basis vectors in R^{num_classes} ''' 15 | return np.eye(num_classes)[labels_train] 16 | def train_test_split(data, labels, test_size=1/6): 17 | assert len(data) == len(labels), "Data and labels are mismatched lengths" 18 | indices = np.random.permutation(data.shape[0]) 19 | train_test_index = int(len(labels) * test_size) 20 | test_idx, training_idx = indices[:train_test_index], indices[train_test_index:] 21 | X_train, X_test = data[training_idx,:], data[test_idx,:] 22 | y_train, y_test = labels[training_idx,:], labels[test_idx,:] 23 | return X_train, y_train, X_test, y_test 24 | def plot_mnist(digit, title=None): 25 | assert len(digit.shape) < 3, "Digit must be a vector or matrix, not bigger" 26 | if 784 in digit.shape: 27 | digit = digit.reshape((28,28)) 28 | if title: 29 | plt.title(title) 30 | plt.imshow(digit, cmap='gray') 31 | plt.show() -------------------------------------------------------------------------------- /final_proj.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/final_proj.pdf -------------------------------------------------------------------------------- /guide/DockerCheatsheet.md: -------------------------------------------------------------------------------- 1 | # Docker Cheatsheet 2 | This a cheatsheet that will cover most of the commands/situation you will encounter in this class! If you want to learn more, [here's a more in depth cheatsheet]( 3 | https://github.com/wsargent/docker-cheat-sheet) 4 | 5 | 6 | 7 | 8 | 9 | ## Starting a jupyter container 10 | **Start a jupyter notebook container without mounting a directory** 11 | 12 | `docker run -d -p 8888:8888 jupyter/scipy-notebook` 13 | 14 | **Start a jupyter container with your current directory mounted** 15 | 16 | `docker run -d -p 8888:8888 -v "$(pwd)":/home/jovyan/work jupyter/scipy-notebook` 17 | 18 | *Notes* 19 | * Do not change `/home/jovyan/work`. This is a setting for the container 20 | 21 | ## Stopping a container 22 | To stop a container, you'll first need to figure out the container name. To find a list of running containers simply run: 23 | 24 | `docker ps` 25 | 26 | and you'll get output that looks like this 27 | ``` 28 | CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 29 | a564ee08a199 jupyter/scipy-notebook "tini -- start-notebo" 2 hours ago Up 2 hours 0.0.0.0:8888->8888/tcp sharp_dijkstra 30 | ``` 31 | 32 | Got to the NAMES column, copy the name (sharp_dijkstra in this case), then paste it in this command 33 | 34 | `docker stop ` 35 | 36 | For the example this would be `docker stop sharp_dijkstra`. 37 | 38 | ## Restarting a stopped Image 39 | Using the same image name, 40 | 41 | `docker start ` 42 | 43 | ## Removing a stopped Image 44 | Using the same image name as before 45 | 46 | `docker rm ` 47 | 48 | *Notes* 49 | * Usually you'll want to remove an image way after it's been stopped. To get a list of all stopped containers, you'll want to add the `-a` flag to `docker ps` as so 50 | `docker ps -a` 51 | 52 | 53 | # FAQs 54 | * ```docker: Error response from daemon: driver failed programming external connectivity on endpoint modest_turing (96452e3e228f072c778ba9c91d7c95c232ad166cae85f14d44efa43c57363bcf): Bind for 0.0.0.0:8888 failed: port is already allocated.``` 55 | 56 | This usually means you already have a jupyter container running (or are running a jupyter server separately that is already serving port 8888). All you need to do is stop the previous container to start a new one. 57 | 58 | * ```docker: Error parsing reference: "\u00add" is not a valid repository/tag See 'docker run --help'. ``` 59 | 60 | Our apologies for this! This is what happens when PDFs encode hyphens apparently. All you need to do for this is copy the start container command from above and this issue will disappear 61 | 62 | * I am on windows and `X` keeps happening. 63 | You can find us during office hours or take the easy way out and [install anaconda](https://www.continuum.io/downloads) 64 | 65 | ![Windoze troll](https://i.imgflip.com/1amtdy.jpg) 66 | -------------------------------------------------------------------------------- /guide/deepaws.md: -------------------------------------------------------------------------------- 1 | # Deep Learning on AWS 2 | ## By [Phillip Kuznetsov](https://github.com/philkuz/) 3 | Up to date version available [here](https://github.com/philkuz/DeepAWS). 4 | 5 | 6 | In this article you're going to learn how to setup a Deep Learning Server on AWS so that you can run all of your favorite Neural Network models on the hardware you need. Not only that, I'll also show you how to setup a Jupyter Notebook Server to make your neural network experiments that much easier. 7 | 8 | AWS is an excellent alternative to buying your own GPU. Running an Amazon GPU instance costs a fraction compared to new hardware and you won't have to deal with setting your machine up from scratch. 9 | 10 | Furthermore, if you're a student in high school or college, you can easily get a bunch of free AWS credits from [AWS Educate](http://www.awseducate.com/) 11 | 12 | For this guide we'll use the AMI managed by Github user [Miej](https://github.com/Miej) called [GoDeeper](https://github.com/Miej/GoDeeper). This AMI has a bunch of common deep learning packages ranging from Tensorflow, Keras, Torch and even OpenCV so that you can run all of that cutting-edge research you desire with ease. The repo has more details on what else is installed in the AMI. 13 | 14 | Now let's move on to the meat of this problem. If you were to go straight to running p2 instance in Northern Oregon, you'll burn through your money rather quickly. However, there is a better method than simply launching instances. I typically use these things called spot requests. Basically, you bid for server time at a price significantly lower than that of dedicated instances. There's certainly a risk that you'll be outbid while running some program, causing you to be kicked off. However, I've avoided this problem by setting a reasonably high maximum bid that is still cheaper than the price of the dedicated instance. 15 | 16 | You'll also want to prioritize the region of your machine based on pricing. 17 | Amazon publishes the [current spot pricing](https://aws.amazon.com/ec2/spot/pricing/) for all machine types in every region. Make note that only a few regions have p2 instances. For me in the US, the two closest regions that fit this criteria are North Virginia and Oregon. I'll typically go to the spot pricing portal and check which region has the cheapest pricing at the moment. However if you live in a different continent, you'll want to look for regions that are close to you that have p2 instances. You can use the spot pricing tool to see whether any region has a p2 instance. If you see an N/A next to the p2 instance, that means this type of machine is not available in that region. 18 | 19 | Now let's get on with the tutorial. To get started, you'll need to login to the AWS console. Here you'll want to click Services, then EC2. At the top left you'll want to confirm that you are in a region that has gpu instances. I'll be using the Oregon region because it was the cheapest when I checked the spot pricing. On the left panel, click Spot Requests and then click the Big Blue `Request Spot Instances`. 20 | 21 | In the Spot Request wizard, you'll want to go down to the AMI drop down and click Select. You'll be met with a window. Change the dropdown to `Community AMIs`. Now go to the [GoDeeper repo](https://github.com/Miej/GoDeeper) and find the specific ami id for the region you are in. Since I am using Oregon, I will use `ami-da3096ba`. 22 | 23 | In the next entry, you'll be selecting the machine type. First, remove default instance type by clicking the `x` in the gray box. Next, click the Select button. In the window that pops up, scroll down to the p2.xlarge row, click it, and press OK. 24 | 25 | Everything else should be set as default except Maximum Price if you want to put a limit on how much you'll bid for an instance. I'll typically set the max price to the maximum price of the past week + a few cents of leeway. You can determine this price by clicking on the price history button that appears after you select `Set max price per hour`. 26 | 27 | Click Next. You don't really need to worry about the size of the EBS volume here as the AMI already comes with 100GB. So change the volume as you please. Note that larger volumes cost slightly more money. A [pricing schedule is available here](https://aws.amazon.com/ebs/pricing/). 28 | 29 | Next you'll want to make a key-pair if you haven't already. Make sure that you save this somewhere you will remember, because you need to use it to ssh into your machine. However, make sure you don't add it to any publicly hosted git repos for privacy purposes. I typically save my keys in the `~/.ssh/` directory. Once you've moved the key you'll want to change the permissions to make the key safe. Simply run 30 | ``` 31 | sudo chmod ~/.ssh/ 400 32 | ``` 33 | 34 | Now you'll want to create a security group called Jupyter that has 3 inbound rules- 35 | 1. SSH 36 | 2. HTTPS 37 | 3. Custom TCP Rule - 8888 38 | 39 | After you've set those, you'll want to make sure that you change the source to `Anywhere` for each option. Setting the exact source certainly increases the security of your instance, however, this can be problematic if you plan to switch networks or you don't have a static ip on your network. I just leave it as `Anywhere` because I don't use my gpu instances for longer than a few days anyways. 40 | 41 | Return to the wizard, refresh the security groups panel and you should see your security group up. Select the checkbox next to it. 42 | 43 | Finally, I'll set a timeout limit for the instance. This is just to make sure I don't accidentally leave the instance running for weeks on end, wracking up a bunch of charges on my account. I'll set mine for a week from now because I won't be using this instance for a very long time. 44 | 45 | Click Next, then Launch Instance. 46 | 47 | Now you'll want to confirm that the instance request has been fulfilled. Click the instances tab in the left panel and on the new page you should see an instance up and running. In the bottom tab you'll see a description. Copy the public dns that you see in right column and navigate to your terminal. 48 | 49 | You'll want to run the command 50 | ``` 51 | ssh -i /path/to/key.pem icarus@ 52 | ``` 53 | it'll prompt you for the password, which is `changetheworld` 54 | And you should have access! 55 | 56 | ## Setting up Jupyter 57 | Now here comes the fun part, let's setup a Jupyter notebook for our server. I found this answer in a [Quora post](https://www.quora.com/How-do-I-create-Jupyter-notebook-on-AWS) a while back and have modified it slightly for this tutorial. 58 | 59 | The steps here are simply run 60 | ``` 61 | git clone https://gist.github.com/philkuz/4b7fda8bc2eba4f9a1ba71c54321c126 nb 62 | . nb/jupyter_notebook_ec2.sh 63 | ``` 64 | From this you'll be prompted to enter a password. Then you'll be given a series of questions about the certs. I just leave them as default as they are not important for our purposes. 65 | 66 | Now run these commands 67 | ``` 68 | cd;mkdir notebook;cd notebook 69 | tmux new -s nb 70 | jupyter notebook --certfile=~/certs/mycert.pem --keyfile ~/certs/mycert.key 71 | ``` 72 | *Note: you can use screen instead of tmux - this is meant so you can exit ssh and leave the jupyter notebook running* 73 | 74 | And with that setup, all you need to do is navigate to ```https://:8888``` and you'll have your notebook up and running and accessible! *Don't forget the `https` part of the url otherwise it wont' work*. 75 | 76 | You should be met with a self-signed cert error. Even though it's scary, this is fine. You should be able to find a link that allows you to proceed anyways. 77 | 78 | As a test of whether this works, let's open and run my repo [Neural Network Zoo](https://github.com/philkuz/Neural-Network-Zoo). 79 | 80 | Start a new terminal, then enter 81 | ``` 82 | git clone https://github.com/philkuz/Neural-Network-Zoo 83 | ``` 84 | Once you see a success signal, close the window you are in, and you should see a new folder called `Neural-Network-Zoo`. Enter the folder and open any notebook you'd like, try running the cells and seeing if they work. On initial runs, it'll take some more time because tensorflow has to initialize the gpus first, but you'll have no problems on later runs. 85 | 86 | And with that embark on your deep learning journeys friend. Let me know through git issues or pull requests what you think should be changed to this guide. 87 | -------------------------------------------------------------------------------- /guide/keras.md: -------------------------------------------------------------------------------- 1 | # Keras Installation Guide 2 | ## The Docker Way 3 | If you're one of the many students we got to install Docker at the beginning of the semester, you're in luck! Installing Keras can be as simple as running a single command. 4 | 5 | In your terminal, all you need to run is 6 | ``` 7 | docker run -d -p 8888:8888 ermaker/keras-jupyter 8 | ``` 9 | If you want to mount the current directory, use 10 | ``` 11 | docker run -d -p 8888:8888 -v $(pwd):/notebook ermaker/keras-jupyter 12 | ``` 13 | and you're done! The docker daemon will install the keras container, and all you need is to load up localhost:8888 to open it with the jupyter notebook server! 14 | 15 | ## The other way 16 | Now setting up Keras without Docker is a little more involved. Fortunately, the python package management system makes things a little easier. 17 | 18 | Before you go and pip install, however, you'll want to configure the "backend" for Keras. The great thing about Keras' setup is that you have the option to use two different deep learning backends: Tensorflow or Theano. 19 | ### Mac/Linux 20 | You'll want to install TensorFlow before you install Keras. As long as you have pip installed, all you need to do is find the matching binary to install, and run the appropriate command. 21 | 22 | I've added a list of relevant lines here, however there's also instructions for other [hardwares and python versions](https://www.tensorflow.org/versions/r0.11/get_started/os_setup.html). 23 | ``` 24 | # Ubuntu/Linux 64-bit, CPU only, Python 3.4 25 | export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0rc2-cp34-cp34m-linux_x86_64.whl 26 | 27 | # Ubuntu/Linux 64-bit, CPU only, Python 3.5 28 | export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0rc2-cp35-cp35m-linux_x86_64.whl 29 | 30 | # Mac OS X, CPU only, Python 3.4 or 3.5: 31 | export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.11.0rc2-py3-none-any.whl 32 | ``` 33 | 34 | After running this command, then simply run 35 | ``` 36 | pip install tensorflow keras 37 | ``` 38 | So if I were installing this on my Mac OS w/ only a CPU, the set of commands would simply be 39 | ``` 40 | export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.11.0rc2-py3-none-any.whl 41 | pip install tensorflow keras 42 | ``` 43 | 44 | ### Windows 45 | ``` 46 | sudo pip install keras 47 | ``` 48 | If you're on windows, unfortunately you'll not be able to use Tensorflow on your machine. However, you can still install the docker container above! 49 | 50 | Even then, Theano will still install and will be perfectly fine for the work we'll be doing. However in the future you'll probably want to switch over to Tensorflow because it's benchmarks are significantly faster than Theano's. 51 | -------------------------------------------------------------------------------- /homework/Keras is cool!.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import keras\n", 12 | "import numpy as np" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Keras\n", 20 | "Keras is a neural network framework that wraps tensorflow (if you haven't heard of tensorflow it's another neural network framework) and makes it really simple to implement common neural networks. It's philosophy is to make simple things easy (but beware trying to implement uncommon, custom neural networks can be pretty challenging in Keras, for the purposes of this course you will never have to that though so don't worry about it). If you are ever confused during this homework, Keras has really good documentation, so just go to [Keras Docs](https://keras.io)" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "# Datasets\n", 28 | "Keras has many datasets conviently builtin to the library. We can access them from the ``keras.datasets`` module. For this homework, we will be using their housing price dataset, their image classification dataset and their movie review sentiment dataset. To get a full list of their datasets, you can go to this link. [Keras Datasets](https://keras.io/datasets). To use their datasets, we just import them and then call ``load_data()``, load_data returns two tuples, the first one is training data, and the second one is testing data. See the example below" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": { 35 | "collapsed": false 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "from keras.datasets import boston_housing\n", 40 | "(x_train, y_train), (x_test, y_test) = boston_housing.load_data()" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "You can also choose the proportion of training data you would like." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "print(\"Size of training set before: \", x_train.shape)\n", 59 | "(x_train, y_train), (x_test, y_test) = boston_housing.load_data(test_split=0.10)\n", 60 | "print(\"Size of training set after: \", x_train.shape)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "from keras.utils import normalize\n", 72 | "x_train = normalize(x_train, axis=1)\n", 73 | "x_test = normalize(x_test, axis=1)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "# Models\n", 81 | "Every thing in Keras starts out with a model. From an initial model, we can add layers, train the model on data, evaluate the model on test sets, etc. We initialize a model with ``Sequential()``. Sequential refers to the fact that the model has a sequence of layers. Personally, I have very rarely used anything other than sequential, so I think its all you really need to worry about." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": { 88 | "collapsed": false 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "from keras.models import Sequential\n", 93 | "model = Sequential()" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "Once we have a model, we can add layers to it with ``model.add``. Keras has a really good range of layers we can use. For example, if we want a basic fully connected layer we can use ``Dense``. I will now run through an example of using Keras to build and train a fully connected neural network for the purposes of regressing on housing prices for the dataset we loaded earlier." 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": { 107 | "collapsed": true 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "from keras.layers import Dense\n", 112 | "model.add(Dense(16, input_shape=(13,)))" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "This line of code adds a fully connected layer with 32 neurons. For the first layer of any model we always have to specify the input shape. In our case we will be training a fully connected network on the boston housing data, so each data point has 13 features. That's why we use an input_shape of (13,). The nice part about Keras is other than the input_shape for the first layer, we don't have to worry about shapes the rest of the time, Keras takes care of it. This can be really useful when you are doing complicated convolutions and things like that where working out the input shape to the next layer can be non-trivial." 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "Now let's add an Activation function to our network after our first fully connected layer." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "from keras.layers import Activation\n", 138 | "model.add(Activation('relu'))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "Simple as that. We just added a relu activation to the whole layer. To see a list of activation functions available in Keras go to [Keras Activations](https://keras.io/activations/). Now let's add the final layer in our model." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "collapsed": true 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "model.add(Dense(1))" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Now we can use a handy utility in Keras to print out what our model looks like so far." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "model.summary()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "You can see it shows us what layers we have, the output shapes of each layer, and how many parameters there are for each layer. All this information can be really useful when trying to debug a model, or even for sharing your model architechture with others." 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "# Training\n", 189 | "Now for actually training the model. Before we train a model we have to compile it. ``model.compile`` is how you specify which optimizer to use and what loss function to use. Sometimes choosing the right optimizer can have a significant effect on model performance. For a list of optimizers look at [Keras Optimizers](https://keras.io/optimizers). Choosing the right optimizer is mostly just trying each one to see which works better, there is some general advice for when to use each one but its basically just another hyperparameter. We also have to choose a loss function. Choosing the right loss function is really important since the loss function basically decides what the goal of the model is. Since we are doing regression we want to choose mean squared error, to get our output to be as close as possible to the label. " 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "collapsed": false 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "model.compile(optimizer='SGD', loss='mean_squared_error')" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "Now we have to actually train our model on the data. This is really easy in Keras, in fact it only takes one line of code." 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": { 214 | "collapsed": false 215 | }, 216 | "outputs": [], 217 | "source": [ 218 | "model.fit(x_train, y_train, epochs=100)" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "# Evaluation\n", 226 | "Now that we have trained our model we can evaluate it on our testing set. It is also just one line of code." 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": { 233 | "collapsed": false 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "print(\"Loss: \", model.evaluate(x_test, y_test, verbose=0))" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "This loss might seem very high and it is, mostly because there aren't very many training points in the dataset (also no effort was put into finding the best model)." 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "We can also generate predictions for new data that we don't have labels for. Since we don't have new data, I will just demonstrate the idea with our testing data." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": { 258 | "collapsed": false, 259 | "scrolled": true 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "y_predicted = model.predict(x_test)\n", 264 | "print(y_predicted)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": { 270 | "collapsed": false 271 | }, 272 | "source": [ 273 | "That's it. We have successfully (depending on your definition of success) built a fully connected neural network and trained that network on a dataset. Now its your turn." 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": { 279 | "collapsed": true 280 | }, 281 | "source": [ 282 | "# Problem 1: Image Classification\n", 283 | "We are going to build a convolutional neural network to predict image classes on CIFAR-10, a dataset of images of 10 different things (i.e. 10 classes). Things like aeroplanes, cars, deer, horses, etc. " 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "**(a)** Load the cifar10 dataset from Keras. If you need a hint go to [Keras Datasets](https://keras.io/datasets). This might take a little while to download." 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": { 297 | "collapsed": false 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "from keras.datasets import ???\n", 302 | "(cifar_x_train, cifar_y_train), (cifar_x_test, cifar_y_test) = ???" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "**(b)** Initialize a Sequential model" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": { 316 | "collapsed": true 317 | }, 318 | "outputs": [], 319 | "source": [ 320 | "cifar_model = ???" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "**(c)** Add a ``Conv2D`` layer to the model. It should have 32 filters, a 5x5 kernel, and a 1x1 stride. The documentation [here](https://keras.io/layers/convolutional/#conv2d) will be your friend for this problem. __Hint:__ This is the first layer of the model so you have to specify the input shape. I recommend printing ``cifar_x_train.shape``, to get an idea of what the shape of the data looks like. Then add a ```relu``` activation layer to the model." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "collapsed": false 335 | }, 336 | "outputs": [], 337 | "source": [ 338 | "from keras.layers.convolutional import Conv2D\n", 339 | "print(cifar_x_train.shape)\n", 340 | "##YOUR CODE HERE" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "**(d)** Add a ``MaxPooling2D`` layer to the model. The layer should have a 2x2 pool size. The documentation for Max Pooling is [here](https://keras.io/layers/pooling/)." 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "collapsed": true 355 | }, 356 | "outputs": [], 357 | "source": [ 358 | "from keras.layers.pooling import MaxPooling2D\n", 359 | "##YOUR CODE HERE" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "**(e)** Add another ``Conv2D`` identical to last one, then another ``relu`` activation, then another ``MaxPooling2D`` layer. __Hint:__ You've already written this code" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "collapsed": true 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "##YOUR CODE HERE" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "**(f)** Add another ``Conv2D`` layer identical to the others except with 64 filters instead of 32. Add another ``relu`` activation layer." 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": { 391 | "collapsed": false 392 | }, 393 | "outputs": [], 394 | "source": [ 395 | "##YOUR CODE HERE" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "**(g)** Now we want to move from 2D data to 1D vectors for classification, to this we have to flatten the data. Keras has a layer for this called [Flatten](https://keras.io/layers/core/#flatten). Then add a ``Dense`` (fully connected) layer with 64 neurons, a ``relu`` activation layer, another ``Dense`` layer with 10 neurons, and a ``softmax`` activation layer." 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": { 409 | "collapsed": false 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "from keras.layers import Flatten\n", 414 | "##YOUR CODE HERE" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "Notice that we have constructed a network that takes in an image and outputs a vector of 10 numbers and then we take the softmax of these, which leaves us we a vector of 0s except 1 one and the location of this one in the vector corresponds to which class the network is predicting for that image. This is sort of the canonical way of doing image classification." 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "**(h)** Now print a summary of your network." 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": { 435 | "collapsed": false 436 | }, 437 | "outputs": [], 438 | "source": [ 439 | "##YOUR CODE HERE" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "**(i)** We need to convert our labels from integers to length 10 vectors with 9 zeros and 1 one, where the integer label is the index of the 1 in the vector. Luckily, Keras has a handy function to do this for us. Have a look [here](https://keras.io/utils/#to_categorical)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": null, 452 | "metadata": { 453 | "collapsed": true 454 | }, 455 | "outputs": [], 456 | "source": [ 457 | "from keras.utils import to_categorical\n", 458 | "y_train_cat = ???\n", 459 | "y_test_cat = ???" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "**(j)** Now compile the model with SGD optimizer and categorical_crossentropy loss function, also include ``metrics=['accuracy']`` as a parameter so we can see the accuracy of the model. Then train the model on the training data. For training we want to weight the classes in the loss function, so set the ``class_weight`` parameter of fit to be the ``class_weights`` dictionary. Be warned training can take forever, I trained on a cpu for 20 epochs (about 30 minutes) and only got 20% accuracy. For the purposes of this assignment you don't need to worry to much about accuracy, just train for at least 1 epoch." 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": { 473 | "collapsed": false 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "##YOUR COMPILING CODE HERE" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": null, 483 | "metadata": { 484 | "collapsed": false 485 | }, 486 | "outputs": [], 487 | "source": [ 488 | "class_weights = {}\n", 489 | "for i in range(10):\n", 490 | " class_weights[i] = 1. / np.where(cifar_y_train==i)[0].size\n", 491 | "\n", 492 | "##YOUR TRAINING CODE HERE" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "Now we can evaluate on our test set." 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": { 506 | "collapsed": false 507 | }, 508 | "outputs": [], 509 | "source": [ 510 | "cifar_model.evaluate(cifar_x_test, y_test_cat)" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "We can also get the class labels the network predicts on our test set and look at a few examples." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "metadata": { 524 | "collapsed": false 525 | }, 526 | "outputs": [], 527 | "source": [ 528 | "y_pred = cifar_model.predict(cifar_x_test)\n", 529 | "import matplotlib.pyplot as plt\n", 530 | "%matplotlib inline\n", 531 | "plt.imshow(cifar_x_test[1234])\n", 532 | "print(\"Predicted label: \", np.argmax(y_pred[1234]))\n", 533 | "print(\"True label: \", cifar_y_test[1234])" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "# Problem 2: Sentiment Classification" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "In this problem we will use Kera's imdb sentiment dataset. You will take in sequences of words and use an RNN to try to classify the sequences sentiment. First we have to process the data a little bit, so that we have fixed length sequences." 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": null, 553 | "metadata": { 554 | "collapsed": false 555 | }, 556 | "outputs": [], 557 | "source": [ 558 | "from keras.datasets import imdb\n", 559 | "(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000, maxlen=200)" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": { 566 | "collapsed": true 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "def process_data(data):\n", 571 | " processed = np.zeros(len(data) * 200).reshape((len(data), 200))\n", 572 | " for i, seq in enumerate(data):\n", 573 | " if len(seq) < 200:\n", 574 | " processed[i] = np.array(seq + [0 for _ in range(200 - len(seq))])\n", 575 | " else:\n", 576 | " processed[i] = np.array(seq)\n", 577 | " return processed" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": { 584 | "collapsed": false 585 | }, 586 | "outputs": [], 587 | "source": [ 588 | "x_train_proc = process_data(x_train)\n", 589 | "x_test_proc = process_data(x_test)\n", 590 | "print(x_test_proc.shape)" 591 | ] 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "metadata": {}, 596 | "source": [ 597 | "The Embedding Layer is a little bit different from most of the layers, so we have provided that code for you. Basically, the 1000 means that we are using a vocabulary size of 1000, the 32 means we will have a vector of size 32 outputed, and the mask zero means that we don't care about 0, because we are using it for padding." 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": null, 603 | "metadata": { 604 | "collapsed": false 605 | }, 606 | "outputs": [], 607 | "source": [ 608 | "imdb_model = Sequential()" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": null, 614 | "metadata": { 615 | "collapsed": true 616 | }, 617 | "outputs": [], 618 | "source": [ 619 | "from keras.layers.embeddings import Embedding\n", 620 | "imdb_model.add(Embedding(1000, 32, input_length=200, mask_zero=True))" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "**(a)** For this problem, I won't walk you everything like I did in the last one. What you need to do is as follows. Add an LSTM layer with 32 outputs, then a Dense layer with 16 neurons, then a relu activation, then a dense layer with 1 neuron, then a sigmoid activation. Then you should print out the model summary." 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "metadata": { 634 | "collapsed": true 635 | }, 636 | "outputs": [], 637 | "source": [ 638 | "##YOUR CODE HERE" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "**(b)** Now compile the model with binary cross entropy, and the adam optimizer. Also include accuracy as a metric in the compile. Then train the model on the processed data (no need to worry about class weights this time)" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": { 652 | "collapsed": true 653 | }, 654 | "outputs": [], 655 | "source": [ 656 | "##YOUR CODE HERE" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": {}, 662 | "source": [ 663 | "After training we can evaluate our model on the test set." 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "metadata": { 670 | "collapsed": false 671 | }, 672 | "outputs": [], 673 | "source": [ 674 | "print(\"Accuracy: \", imdb_model.evaluate(x_test_proc, y_test)[1])" 675 | ] 676 | }, 677 | { 678 | "cell_type": "markdown", 679 | "metadata": {}, 680 | "source": [ 681 | "Now we can look at our predictions and the sentences they correspond to." 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": { 688 | "collapsed": false 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "y_pred = imdb_model.predict(x_test_proc)" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": { 699 | "collapsed": false 700 | }, 701 | "outputs": [], 702 | "source": [ 703 | "y_pred = np.vectorize(lambda x: int(x >= 0.5))(y_pred)\n", 704 | "correct = []\n", 705 | "incorrect = []\n", 706 | "for i, pred in enumerate(y_pred):\n", 707 | " if y_test[i] == pred:\n", 708 | " correct.append(i)\n", 709 | " else:\n", 710 | " incorrect.append(i)\n", 711 | "word_dict = inv_map = {v: k for k, v in imdb.get_word_index().items()}\n", 712 | "\n", 713 | "print(list(map(lambda x: word_dict[int(x)] if x != 0 else None, x_test[correct[123]])))" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": {}, 719 | "source": [ 720 | "After making this I realized that keras' method for converting from word index back to words is broken right now (see this open [github issue](https://github.com/fchollet/keras/issues/5912)). So we can't actually see what the sentences look like." 721 | ] 722 | } 723 | ], 724 | "metadata": { 725 | "anaconda-cloud": {}, 726 | "kernelspec": { 727 | "display_name": "Python [tensorflow]", 728 | "language": "python", 729 | "name": "Python [tensorflow]" 730 | }, 731 | "language_info": { 732 | "codemirror_mode": { 733 | "name": "ipython", 734 | "version": 3 735 | }, 736 | "file_extension": ".py", 737 | "mimetype": "text/x-python", 738 | "name": "python", 739 | "nbconvert_exporter": "python", 740 | "pygments_lexer": "ipython3", 741 | "version": "3.5.2" 742 | } 743 | }, 744 | "nbformat": 4, 745 | "nbformat_minor": 0 746 | } 747 | -------------------------------------------------------------------------------- /project1/Proj1_v3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/project1/Proj1_v3.pdf -------------------------------------------------------------------------------- /project2/project2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaggledecal/sp17/52649d867d00787d92ef4a239605c23f04933b57/project2/project2.pdf -------------------------------------------------------------------------------- /syllabus.md: -------------------------------------------------------------------------------- 1 | # Data Science with Kaggle Syllabus Spring 2017 2 | 3 | ## Introduction: 4 | 5 | Welcome to Data Science with Kaggle! Kaggle is home to an abundant source of company-volunteered data that encourage data scientists from around the world to solve proposed, and often business-related, challenges. The platform fosters a great amount of knowledge sharing, competition, and practical relevance where beginners and experts alike benefit from an exponentially expanding field. 6 | 7 | ## Prerequisites: 8 | 9 | This class is a projects-based class with a machine learning bias. You are expected to have some programming or statistics backgrounds and so the material will be of greatest benefit to sophomores or those who have taken CS61A, DATA 8, STAT 133, or equivalent. However, the first two weeks of class will be an optional python bootcamp for those taking the course with absolutely no programming background. By the end, you can determine whether you are comfortable continuing through the course. 10 | 11 | Note that this is not an easy class. The student facilitators intend to provide you with a comprehensive guide to data analysis with the goal of preparing you for industry and, if demonstrated superb interest, future machine learning competitions. 12 | 13 | ## Learning objectives: 14 | 15 | Listed below is a subset of topics you will learn from this course: 16 | 17 | 1. Python programming 18 | 19 | 2. Data interpretation, data munging, and visual analysis 20 | 21 | 3. Numerical and Text 22 | 23 | 3. Linear and Logistic Regression 24 | 25 | 4. Clustering Techniques 26 | 27 | 5. RandomForest 28 | 29 | 6. Keras 30 | 31 | 7. Neural Networks 32 | 33 | 8. AWS for Machine Learning 34 | 35 | ## Project Schedule: 36 | 37 | Each project is made to last for two weeks. Projects can be completed individually or in teams of at most 4 students. 38 | 39 | The first project will guide students along the intuition and steps for data analysis. The last two projects will be more open-ended for students to figure out how to complete. 40 | 41 | * The first two weeks will involve analysis of the Titanic data set. This is a widely studied and perhaps most popular data set on Kaggle. We will guide you through the python language and teach you basic analysis techniques you can do. 42 | 43 | * MNIST Digit Recognizer – Clustering Techniques 44 | 45 | * Auto-Librarian Classification – RandomForest, text analysis, and clustering techniques 46 | 47 | * Company-sponsored In Class Kaggle Competition - All data science methods 48 | 49 | ## Assignment Schedule: 50 | 51 | Early assignments (during the bootcamp and possibly the first week of class) will involve hand-written conceptual based questions to check for understanding. 52 | 53 | Afterwards, assignments will involve in-class kaggle competitions where students submit their model predictions on a custom arranged data set separate from lecture. This will give you a chance to apply what you have learned in class. These assignments should be done individually. 54 | 55 | ## Class Logistics: 56 | 57 | First two weeks will be an optional programming bootcamp for students to get introduced to programming and to catch up to the level of skill necessary to complete the course. 58 | 59 | There will be 3 hours of lecture per week in addition to office hours. Assignments will be given at the beginning of each module and will end at the end of each module. Note that the first few weeks of class will involve handwritten weekly assignments that test for understanding only. 60 | 61 | 62 | 63 | ## Grading: 64 | 65 | Assignments early on will be graded on completeness. In-class Kaggle assignments will be graded on whether the prediction score is above a certain threshold. Final projects will be graded on completeness, quality of response, accuracy, and team mate evaluations. There will be 2 open-ended projects throughout the latter half of the semester. All assignments, projects, and attendance will be assigned a point value with a general weighting scheme of 40% assignments, and 60% projects. In order to pass the class, you must pass the minimum accuracy threshold for all assignments and projects. 66 | 67 | ## Suggested Online Reading Schedule: 68 | 69 | These are hand-picked resources the student instructors strongly believe will help you understand lecture. 70 | 71 | [Docker Instructions](https://github.com/kaggledecal/kaggle_fa16/blob/master/DockerCheatsheet.md) 72 | 73 | 2/20 - [https://www.kaggle.com/c/titanic/details/getting-started-with-python](https://www.kaggle.com/c/titanic/details/getting-started-with-python) 74 | 75 | 2/27 - [https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii](https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii) 76 | 77 | 3/8 - [http://opencvpython.blogspot.com/2012/12/k-means-clustering-1-basic-understanding.html](http://opencvpython.blogspot.com/2012/12/k-means-clustering-1-basic-understanding.html) 78 | 79 | 3/13 - [https://www.dataquest.io/blog/k-nearest-neighbors-in-python/](https://www.dataquest.io/blog/k-nearest-neighbors-in-python/) 80 | 81 | 3/15 - [https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words) 82 | 83 | 3/22 - [http://blog.yhat.com/posts/random-forests-in-python.html](http://blog.yhat.com/posts/random-forests-in-python.html) 84 | 85 | 4/3 - [http://natureofcode.com/book/chapter-10-neural-networks/](http://natureofcode.com/book/chapter-10-neural-networks/) (Introduction and Section 10.2) 86 | 87 | 4/5 - [http://neuralnetworksanddeeplearning.com/chap2.html](http://neuralnetworksanddeeplearning.com/chap2.html) 88 | 89 | 4/10 - [https://keras.io/getting-started/sequential-model-guide/](https://keras.io/getting-started/sequential-model-guide/) 90 | 91 | 4/12 - [www.asimovinstitute.org/neural-network-zoo/](http://www.asimovinstitute.org/neural-network-zoo/) 92 | 93 | ## Recommended Texts and Online Readings: 94 | 95 | Most material will be in the form of powerpoint slides, handouts, and live demos. However, there are a few resources we recommend reading throughout the course to better understand concepts or a programming language. Some are really just fun reads. 96 | 97 | The Data Science Handbook by Carl Shan, Henry Wang, William Chen, and Max Song: [http://www.thedatasciencehandbook.com](http://www.thedatasciencehandbook.com) 98 | 99 | Python for Data Analysis by Wes McKinney: [http://shop.oreilly.com/product/0636920023784.do](http://shop.oreilly.com/product/0636920023784.do) 100 | 101 | The Signal and the Noise by Nate Silver: [https://en.wikipedia.org/wiki/The_Signal_and_the_Noise](https://en.wikipedia.org/wiki/The_Signal_and_the_Noise) 102 | 103 | Book for nitty-gritty of neural networks: 104 | [http://neuralnetworksanddeeplearning.com/](http://neuralnetworksanddeeplearning.com/) 105 | 106 | Recurrent Neural Networks: 107 | 108 | [http://karpathy.github.io/2015/05/21/rnn-effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) 109 | 110 | ## Extra Credit: 111 | 112 | Students will be able to receive extra credit for completing a side project with instructor approval. 113 | 114 | ## Attendance: 115 | 116 | Since this is a project-team-based class, attendance is mandatory. We will be keeping track at the beginning of each class. However, you may have two absences for any reason. If you are working with a team, please communicate appropriately. 117 | 118 | 119 | 120 | ## Class Schedule: 121 | 122 | 2/6 - Decal Kickoff. Overview of tools that will be in play. This lecture will help you determine if you need to go to the bootcamp. 123 | 124 | 2/8 - (Optional) Python setup. Coding environment setup. Variables/Data Types. If, for, while statements. 125 | 126 | 2/13 - (Optional) Data Visualization. Reading data. Structure of arrays/matrices/dataframes. Objects. Histograms and missing value imputation. Summary statistics. 127 | 128 | 2/15 - (Optional). Difference between Classification and Regression. Exploring data hands on. 129 | 130 | 2/20 - President's Day: No Class 131 | 132 | 2/22 - Data cleaning. Regular expressions. 133 | 134 | 2/27 - Linear Regression. Assumptions for intuition. Residuals. Interpretation. Practical example on Titanic. Example of linear dependence. 135 | 136 | 3/1 - Logistic Regression. Assumptions for intuition. Difference from Linear Regression. Practical example on Titanic. Regression assignment due 3/5. 137 | 138 | *Project 1 released. Due March 17th* 139 | 140 | 3/6 - Introduction of Digit Recognition data set. Linear and logistic regression. KNN clustering. 141 | 142 | 3/8 - K-means and Cross-Validation. Sign up for awseducate 143 | 144 | 3/13 - In-class implementation of KNN. Bias and variance tradeoff. 145 | 146 | 3/15 - Decision Trees. 147 | 148 | 3/20 - Random Forests. 149 | 150 | 3/22 - Lecture on Amazon Web Services (AWS). 151 | 152 | 3/27 - Spring Recess: No Class 153 | 154 | 3/29 - Spring Recess: No Class 155 | 156 | 4/3 - Introduction to Neural Networks 157 | 158 | 4/5 - How Neural Networks Learn / Gradient Descent 159 | 160 | 4/10 - Introduction to Convolutional Neural Nets through Keras 161 | 162 | 4/12 - Recurrent Neural Networks. Text recurrence. 163 | 164 | 4/17 - Anomaly Deterction. Using LSTMs. 165 | 166 | 4/19 - Guest Lectures 167 | 168 | 4/24 - Guest Lectures 169 | 170 | 4/26 - Final Lecture. Review. 171 | --------------------------------------------------------------------------------