├── README.md
├── in_class_notebooks
├── Autoencoder
│ └── autoencoder_part.ipynb
├── DT_viz
│ ├── data
│ │ ├── Heart.csv
│ │ ├── Heart_cleaned.csv
│ │ └── residency.csv
│ ├── in_class_DT-Release.ipynb
│ └── in_class_DT_solution.ipynb
├── EDA
│ ├── .ipynb_checkpoints
│ │ └── in-class_EDA-release-checkpoint.ipynb
│ ├── data
│ │ ├── breast-cancer.csv
│ │ ├── breast-cancer_info.names
│ │ └── titanic.csv
│ ├── in-class_EDA-release.ipynb
│ └── in-class_EDA-solution.ipynb
├── NMF
│ ├── NMF_applications.ipynb
│ ├── NMF_rev1.ipynb
│ └── data
│ │ └── articles.pkl
├── NN-intro-Keras
│ └── fashion_mnist_with_keras.ipynb
├── RNN_Keras
│ └── in-class-RNN-keras-release.ipynb
├── SVM
│ └── in_class_SVM-release.ipynb
└── Unsupervised_learning
│ ├── .ipynb_checkpoints
│ └── in_class_unsupervised_learning-release-checkpoint.ipynb
│ ├── hitters.csv
│ └── in_class_unsupervised_learning-release.ipynb
├── other_resource
├── ieee_matrix_factoriztion.pdf
└── project_guide.md
└── slides
├── Lec10-SVM(2).pdf
├── Lec10-SVM(2)_anno.pdf
├── Lec11-NeuralNetwork_1_anno.pdf
├── Lec12-NeuralNetwork2_anno.pdf
├── Lec13-math-behind-NN-training_anno.pdf
├── Lec14-optimization-methods-NN-training_anno.pdf
├── Lec15-Unsupervised Learning-PCA_anno.pdf
├── Lec16-Unsupervised Learning-Clustering_anno.pdf
├── Lec17-Unsupervised Learning-Recommender System_anno.pdf
├── Lec18-Unsupervised Learning-Matrix Factorization_anno_ver1.pdf
├── Lec19-whiteboard_ver1.png
├── Lec1_Introduction.pdf
├── Lec1_Introduction_anno.pdf
├── Lec2-Linear-Regression.pdf
├── Lec2-Linear-Regression_anno.pdf
├── Lec20-ConvolutionalNeuralNetwork1_anno.pdf
├── Lec21-ConvolutionalNeuralNetwork2_r.pdf
├── Lec22-ConvolutionalNeuralNetwork3.pdf
├── Lec23-autoencoder_whiteboard.png
├── Lec24-RecurrentNeuralNetwork1.pdf
├── Lec3-Logistic-Regression.pdf
├── Lec3-Logistic-Regression_anno.pdf
├── Lec4-improve-training.pdf
├── Lec4-improve-training_anno.pdf
├── Lec5_Decision_Trees.pdf
├── Lec5_Decision_Trees_anno.pdf
├── Lec6_Decision_Trees_pruning.pdf
├── Lec7_random_forest.pdf
├── Lec7_random_forest_anno.pdf
├── Lec8_Boosting.pdf
├── Lec8_Boosting_anno.pdf
├── Lec9-SVM(1).pdf
└── Lec9-SVM(1)_anno.pdf
/README.md:
--------------------------------------------------------------------------------
1 | ### CSCI 4622: Undergraduate Machine Learning (University of Colorado Boulder, Spring 2020)
2 |
3 | **Note**: This schedule is a rough approximation and subject to change.
4 | Reading chapters are from the textbooks unless mentioned otherwise.
5 | Currently we have two textbooks.
6 | ISLR- Introduction to statistical learning with applications in R
7 | ACML- A course in machine learning
8 |
9 | | Week | Date | Reading | Topic | Slides | Assignments |
10 | |:------:|:------------:| :-----------:| :----------------------------------------:|:-----------:|:----------:|
11 | | 1 | Jan. 13 | | Machine Learning Intro., KNN | [slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/tree/master/slides/Lec1_Introduction.pdf) | |
12 | | | Jan. 15 |ISLR 3.1, 3.2 | Linear Regression |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/tree/master/slides/Lec2-Linear-Regression.pdf) | |
13 | | | Jan. 17 | | Hands-on: EDA | [notebooks](https://github.com/libphy/CSCI4622-20SP-MachineLearning/tree/master/in_class_notebooks/EDA) | HW1 out|
14 | | 2 | Jan. 20 | | No class: MLK | | |
15 | | | Jan. 22 | ISLR 4.3 | Logistic regression |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec3-Logistic-Regression.pdf) | |
16 | | | Jan. 24 | ISLR 6.2.1-6.2.3, 5.1 | Techniques to improve training | [slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec4-improve-training.pdf) | |
17 | | 3 | Jan. 27| ISLR 8.1 | Decision Tree 1 |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec5_Decision_Trees.pdf) | |
18 | | | Jan. 29 | ISLR 8.1 | Decision Tree 2 |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec6_Decision_Trees_pruning.pdf) | |
19 | | | Jan. 31 | | Hands-on | | HW1 due, HW2 out |
20 | | 4 | Feb. 3 |ISLR 8.3.3 |Ensemble methods 1: Bagging|[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec7_random_forest.pdf) | |
21 | | | Feb. 5 |ISLR 8.3.4 | Ensemble methods 2: Boosting |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec8_Boosting.pdf) |
22 | | | Feb. 7 | | Kaggle mini comp 1: [regression challenge](https://www.kaggle.com/c/cu-regression-challenge/) | | |
23 | | 5 | Feb. 10 |ISLR 9.1, 9.2 |SVM 1 |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec9-SVM(1).pdf) | |
24 | | | Feb. 12 |ISLR 9.3-9.5 | SVM 2 |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec10-SVM(2).pdf) | mini comp 1 closes|
25 | | | Feb. 14 | | Hands-on | | |
26 | | 6 | Feb. 17 |ACML ch4 |Neural Network 1 (perceptron) |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec11-NeuralNetwork_1_anno.pdf) | HW2 due|
27 | | | Feb. 19 | | midterm review | | |
28 | | | Feb. 21 | | Midterm 1 | | |
29 | | 7 | Feb. 24 |[deeplearningbook](http://www.deeplearningbook.org) Ch.6.1-6.4 |Neural Network 2 (training perceptron, ANN design parameters), demo: intro to Keras |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec12-NeuralNetwork2_anno.pdf) | HW3 out|
30 | | | Feb. 26 | [deeplearningbook](http://www.deeplearningbook.org) Ch.6.5, 8.3 |Back Propagation, Stochastic Gradient Descent | [slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec13-math-behind-NN-training_anno.pdf) | |
31 | | | Feb. 28 | [deeplearningbook](http://www.deeplearningbook.org)Ch.8.3, 8.5, 8.7.1 |More optimization algorithms, Training tricks, Kaggle mini comp 2: [classification challenge](https://www.kaggle.com/c/cub-csci-4622-kaggle-2-2020/overview) |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec14-optimization-methods-NN-training_anno.pdf) | |
32 | | 8 | Mar. 2 | ISLR 10.1, 10.2, 6.3|Unsupervised Learning 1: Dimensionality Reduction |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec15-Unsupervised%20Learning-PCA_anno.pdf) | |
33 | | | Mar. 4 | ISLR 10.3 |Unsupervised Learning 2: Clustering |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec16-Unsupervised%20Learning-Clustering_anno.pdf) | |
34 | | | Mar. 6 | | Hands-on | | mini comp 2 closes, HW4 out |
35 | | 9 | Mar. 9 |[MMDS](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf) Ch.9.1-9.3 | Recommender System|[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec17-Unsupervised%20Learning-Recommender%20System_anno.pdf) | |
36 | | | Mar. 11 | [MMDS](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf) Ch.9.4-9.6, [paper](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/other_resource/ieee_matrix_factoriztion.pdf) | Matrix Factorization | [slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec18-Unsupervised%20Learning-Matrix%20Factorization_anno_ver1.pdf) |HW3 due |
37 | | | Mar. 13 | | Kaggle mini comp 3: unsupervised learning Class canceled due to COVID-19 | | |
38 | | 10 | Mar. 16 | | NMF applications-Topic modeling | [notebook](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/in_class_notebooks/NMF/NMF_applications.ipynb), [whiteboard](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec19-whiteboard_ver1.png) | |
39 | | | Mar. 18 |[deeplearningbook](http://www.deeplearningbook.org) Ch. 9.1-9.3, 9.10, 9.11 | CNN 1: Basics | [slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec20-ConvolutionalNeuralNetwork1_anno.pdf) | |
40 | | | Mar. 20 |[resource](http://cs231n.github.io/convolutional-networks/): Stanford's 231n course has great resources on (convolutional) neural networks. |CNN 2: Architectures & Training |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec21-ConvolutionalNeuralNetwork2_r.pdf) | |
41 | | 11 | Mar. 23 | | No class: Spring Break | | |
42 | | | Mar. 25 | | No class: Spring Break | | |
43 | | | Mar. 27 | | No class: Spring Break | | |
44 | | 12 | Mar. 30 | | review | | |
45 | | | Apr. 1 | | review | | |
46 | | | Apr. 3 | | Midterm 2 | | |
47 | | 13 | Apr. 6 | | CNN 3: Advanced topic |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec22-ConvolutionalNeuralNetwork3.pdf) | |
48 | | | Apr. 8 |[deeplearningbook](http://www.deeplearningbook.org) Ch. 14.1-14.5 | Unsupervised Neural Networks | [notebook](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/in_class_notebooks/Autoencoder/autoencoder_part.ipynb) [whiteboard](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec23-autoencoder_whiteboard.png)|HW4 due, HW5 out|
49 | | | Apr.10 | | Kaggle mini comp 4: [Image classification](https://www.kaggle.com/c/cuboulder-image-labelling), Keras mini tutorial| | [project announcement](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/other_resource/project_guide.md)|
50 | | 14 | Apr. 13 |[deeplearningbook](http://www.deeplearningbook.org) Ch. 10.1, 10.2 | RNN 1 |[slides](https://github.com/libphy/CSCI4622-20SP-MachineLearning/blob/master/slides/Lec24-RecurrentNeuralNetwork1.pdf) | |
51 | | | Apr. 15 | | RNN 2 | | mini comp 4 closes |
52 | | | Apr. 17 | | Hands-on | | |
53 | | 15 | Apr.20 | | project discussion | |Team formation deadline|
54 | | | Apr. 22 | | project discussion | | |
55 | | | Apr. 24 | | project discussion | |HW5 due |
56 | | 16 | Apr. 27 | | project discussion | | |
57 | | | Apr. 29 | | project discussion | | |
58 | | | May. 1 | | no class | | |
59 | |Finals week| May. 3| | | |Project deliverable due |
60 |
--------------------------------------------------------------------------------
/in_class_notebooks/DT_viz/data/Heart.csv:
--------------------------------------------------------------------------------
1 | "","Age","Sex","ChestPain","RestBP","Chol","Fbs","RestECG","MaxHR","ExAng","Oldpeak","Slope","Ca","Thal","AHD"
2 | "1",63,1,"typical",145,233,1,2,150,0,2.3,3,0,"fixed","No"
3 | "2",67,1,"asymptomatic",160,286,0,2,108,1,1.5,2,3,"normal","Yes"
4 | "3",67,1,"asymptomatic",120,229,0,2,129,1,2.6,2,2,"reversable","Yes"
5 | "4",37,1,"nonanginal",130,250,0,0,187,0,3.5,3,0,"normal","No"
6 | "5",41,0,"nontypical",130,204,0,2,172,0,1.4,1,0,"normal","No"
7 | "6",56,1,"nontypical",120,236,0,0,178,0,0.8,1,0,"normal","No"
8 | "7",62,0,"asymptomatic",140,268,0,2,160,0,3.6,3,2,"normal","Yes"
9 | "8",57,0,"asymptomatic",120,354,0,0,163,1,0.6,1,0,"normal","No"
10 | "9",63,1,"asymptomatic",130,254,0,2,147,0,1.4,2,1,"reversable","Yes"
11 | "10",53,1,"asymptomatic",140,203,1,2,155,1,3.1,3,0,"reversable","Yes"
12 | "11",57,1,"asymptomatic",140,192,0,0,148,0,0.4,2,0,"fixed","No"
13 | "12",56,0,"nontypical",140,294,0,2,153,0,1.3,2,0,"normal","No"
14 | "13",56,1,"nonanginal",130,256,1,2,142,1,0.6,2,1,"fixed","Yes"
15 | "14",44,1,"nontypical",120,263,0,0,173,0,0,1,0,"reversable","No"
16 | "15",52,1,"nonanginal",172,199,1,0,162,0,0.5,1,0,"reversable","No"
17 | "16",57,1,"nonanginal",150,168,0,0,174,0,1.6,1,0,"normal","No"
18 | "17",48,1,"nontypical",110,229,0,0,168,0,1,3,0,"reversable","Yes"
19 | "18",54,1,"asymptomatic",140,239,0,0,160,0,1.2,1,0,"normal","No"
20 | "19",48,0,"nonanginal",130,275,0,0,139,0,0.2,1,0,"normal","No"
21 | "20",49,1,"nontypical",130,266,0,0,171,0,0.6,1,0,"normal","No"
22 | "21",64,1,"typical",110,211,0,2,144,1,1.8,2,0,"normal","No"
23 | "22",58,0,"typical",150,283,1,2,162,0,1,1,0,"normal","No"
24 | "23",58,1,"nontypical",120,284,0,2,160,0,1.8,2,0,"normal","Yes"
25 | "24",58,1,"nonanginal",132,224,0,2,173,0,3.2,1,2,"reversable","Yes"
26 | "25",60,1,"asymptomatic",130,206,0,2,132,1,2.4,2,2,"reversable","Yes"
27 | "26",50,0,"nonanginal",120,219,0,0,158,0,1.6,2,0,"normal","No"
28 | "27",58,0,"nonanginal",120,340,0,0,172,0,0,1,0,"normal","No"
29 | "28",66,0,"typical",150,226,0,0,114,0,2.6,3,0,"normal","No"
30 | "29",43,1,"asymptomatic",150,247,0,0,171,0,1.5,1,0,"normal","No"
31 | "30",40,1,"asymptomatic",110,167,0,2,114,1,2,2,0,"reversable","Yes"
32 | "31",69,0,"typical",140,239,0,0,151,0,1.8,1,2,"normal","No"
33 | "32",60,1,"asymptomatic",117,230,1,0,160,1,1.4,1,2,"reversable","Yes"
34 | "33",64,1,"nonanginal",140,335,0,0,158,0,0,1,0,"normal","Yes"
35 | "34",59,1,"asymptomatic",135,234,0,0,161,0,0.5,2,0,"reversable","No"
36 | "35",44,1,"nonanginal",130,233,0,0,179,1,0.4,1,0,"normal","No"
37 | "36",42,1,"asymptomatic",140,226,0,0,178,0,0,1,0,"normal","No"
38 | "37",43,1,"asymptomatic",120,177,0,2,120,1,2.5,2,0,"reversable","Yes"
39 | "38",57,1,"asymptomatic",150,276,0,2,112,1,0.6,2,1,"fixed","Yes"
40 | "39",55,1,"asymptomatic",132,353,0,0,132,1,1.2,2,1,"reversable","Yes"
41 | "40",61,1,"nonanginal",150,243,1,0,137,1,1,2,0,"normal","No"
42 | "41",65,0,"asymptomatic",150,225,0,2,114,0,1,2,3,"reversable","Yes"
43 | "42",40,1,"typical",140,199,0,0,178,1,1.4,1,0,"reversable","No"
44 | "43",71,0,"nontypical",160,302,0,0,162,0,0.4,1,2,"normal","No"
45 | "44",59,1,"nonanginal",150,212,1,0,157,0,1.6,1,0,"normal","No"
46 | "45",61,0,"asymptomatic",130,330,0,2,169,0,0,1,0,"normal","Yes"
47 | "46",58,1,"nonanginal",112,230,0,2,165,0,2.5,2,1,"reversable","Yes"
48 | "47",51,1,"nonanginal",110,175,0,0,123,0,0.6,1,0,"normal","No"
49 | "48",50,1,"asymptomatic",150,243,0,2,128,0,2.6,2,0,"reversable","Yes"
50 | "49",65,0,"nonanginal",140,417,1,2,157,0,0.8,1,1,"normal","No"
51 | "50",53,1,"nonanginal",130,197,1,2,152,0,1.2,3,0,"normal","No"
52 | "51",41,0,"nontypical",105,198,0,0,168,0,0,1,1,"normal","No"
53 | "52",65,1,"asymptomatic",120,177,0,0,140,0,0.4,1,0,"reversable","No"
54 | "53",44,1,"asymptomatic",112,290,0,2,153,0,0,1,1,"normal","Yes"
55 | "54",44,1,"nontypical",130,219,0,2,188,0,0,1,0,"normal","No"
56 | "55",60,1,"asymptomatic",130,253,0,0,144,1,1.4,1,1,"reversable","Yes"
57 | "56",54,1,"asymptomatic",124,266,0,2,109,1,2.2,2,1,"reversable","Yes"
58 | "57",50,1,"nonanginal",140,233,0,0,163,0,0.6,2,1,"reversable","Yes"
59 | "58",41,1,"asymptomatic",110,172,0,2,158,0,0,1,0,"reversable","Yes"
60 | "59",54,1,"nonanginal",125,273,0,2,152,0,0.5,3,1,"normal","No"
61 | "60",51,1,"typical",125,213,0,2,125,1,1.4,1,1,"normal","No"
62 | "61",51,0,"asymptomatic",130,305,0,0,142,1,1.2,2,0,"reversable","Yes"
63 | "62",46,0,"nonanginal",142,177,0,2,160,1,1.4,3,0,"normal","No"
64 | "63",58,1,"asymptomatic",128,216,0,2,131,1,2.2,2,3,"reversable","Yes"
65 | "64",54,0,"nonanginal",135,304,1,0,170,0,0,1,0,"normal","No"
66 | "65",54,1,"asymptomatic",120,188,0,0,113,0,1.4,2,1,"reversable","Yes"
67 | "66",60,1,"asymptomatic",145,282,0,2,142,1,2.8,2,2,"reversable","Yes"
68 | "67",60,1,"nonanginal",140,185,0,2,155,0,3,2,0,"normal","Yes"
69 | "68",54,1,"nonanginal",150,232,0,2,165,0,1.6,1,0,"reversable","No"
70 | "69",59,1,"asymptomatic",170,326,0,2,140,1,3.4,3,0,"reversable","Yes"
71 | "70",46,1,"nonanginal",150,231,0,0,147,0,3.6,2,0,"normal","Yes"
72 | "71",65,0,"nonanginal",155,269,0,0,148,0,0.8,1,0,"normal","No"
73 | "72",67,1,"asymptomatic",125,254,1,0,163,0,0.2,2,2,"reversable","Yes"
74 | "73",62,1,"asymptomatic",120,267,0,0,99,1,1.8,2,2,"reversable","Yes"
75 | "74",65,1,"asymptomatic",110,248,0,2,158,0,0.6,1,2,"fixed","Yes"
76 | "75",44,1,"asymptomatic",110,197,0,2,177,0,0,1,1,"normal","Yes"
77 | "76",65,0,"nonanginal",160,360,0,2,151,0,0.8,1,0,"normal","No"
78 | "77",60,1,"asymptomatic",125,258,0,2,141,1,2.8,2,1,"reversable","Yes"
79 | "78",51,0,"nonanginal",140,308,0,2,142,0,1.5,1,1,"normal","No"
80 | "79",48,1,"nontypical",130,245,0,2,180,0,0.2,2,0,"normal","No"
81 | "80",58,1,"asymptomatic",150,270,0,2,111,1,0.8,1,0,"reversable","Yes"
82 | "81",45,1,"asymptomatic",104,208,0,2,148,1,3,2,0,"normal","No"
83 | "82",53,0,"asymptomatic",130,264,0,2,143,0,0.4,2,0,"normal","No"
84 | "83",39,1,"nonanginal",140,321,0,2,182,0,0,1,0,"normal","No"
85 | "84",68,1,"nonanginal",180,274,1,2,150,1,1.6,2,0,"reversable","Yes"
86 | "85",52,1,"nontypical",120,325,0,0,172,0,0.2,1,0,"normal","No"
87 | "86",44,1,"nonanginal",140,235,0,2,180,0,0,1,0,"normal","No"
88 | "87",47,1,"nonanginal",138,257,0,2,156,0,0,1,0,"normal","No"
89 | "88",53,0,"nonanginal",128,216,0,2,115,0,0,1,0,NA,"No"
90 | "89",53,0,"asymptomatic",138,234,0,2,160,0,0,1,0,"normal","No"
91 | "90",51,0,"nonanginal",130,256,0,2,149,0,0.5,1,0,"normal","No"
92 | "91",66,1,"asymptomatic",120,302,0,2,151,0,0.4,2,0,"normal","No"
93 | "92",62,0,"asymptomatic",160,164,0,2,145,0,6.2,3,3,"reversable","Yes"
94 | "93",62,1,"nonanginal",130,231,0,0,146,0,1.8,2,3,"reversable","No"
95 | "94",44,0,"nonanginal",108,141,0,0,175,0,0.6,2,0,"normal","No"
96 | "95",63,0,"nonanginal",135,252,0,2,172,0,0,1,0,"normal","No"
97 | "96",52,1,"asymptomatic",128,255,0,0,161,1,0,1,1,"reversable","Yes"
98 | "97",59,1,"asymptomatic",110,239,0,2,142,1,1.2,2,1,"reversable","Yes"
99 | "98",60,0,"asymptomatic",150,258,0,2,157,0,2.6,2,2,"reversable","Yes"
100 | "99",52,1,"nontypical",134,201,0,0,158,0,0.8,1,1,"normal","No"
101 | "100",48,1,"asymptomatic",122,222,0,2,186,0,0,1,0,"normal","No"
102 | "101",45,1,"asymptomatic",115,260,0,2,185,0,0,1,0,"normal","No"
103 | "102",34,1,"typical",118,182,0,2,174,0,0,1,0,"normal","No"
104 | "103",57,0,"asymptomatic",128,303,0,2,159,0,0,1,1,"normal","No"
105 | "104",71,0,"nonanginal",110,265,1,2,130,0,0,1,1,"normal","No"
106 | "105",49,1,"nonanginal",120,188,0,0,139,0,2,2,3,"reversable","Yes"
107 | "106",54,1,"nontypical",108,309,0,0,156,0,0,1,0,"reversable","No"
108 | "107",59,1,"asymptomatic",140,177,0,0,162,1,0,1,1,"reversable","Yes"
109 | "108",57,1,"nonanginal",128,229,0,2,150,0,0.4,2,1,"reversable","Yes"
110 | "109",61,1,"asymptomatic",120,260,0,0,140,1,3.6,2,1,"reversable","Yes"
111 | "110",39,1,"asymptomatic",118,219,0,0,140,0,1.2,2,0,"reversable","Yes"
112 | "111",61,0,"asymptomatic",145,307,0,2,146,1,1,2,0,"reversable","Yes"
113 | "112",56,1,"asymptomatic",125,249,1,2,144,1,1.2,2,1,"normal","Yes"
114 | "113",52,1,"typical",118,186,0,2,190,0,0,2,0,"fixed","No"
115 | "114",43,0,"asymptomatic",132,341,1,2,136,1,3,2,0,"reversable","Yes"
116 | "115",62,0,"nonanginal",130,263,0,0,97,0,1.2,2,1,"reversable","Yes"
117 | "116",41,1,"nontypical",135,203,0,0,132,0,0,2,0,"fixed","No"
118 | "117",58,1,"nonanginal",140,211,1,2,165,0,0,1,0,"normal","No"
119 | "118",35,0,"asymptomatic",138,183,0,0,182,0,1.4,1,0,"normal","No"
120 | "119",63,1,"asymptomatic",130,330,1,2,132,1,1.8,1,3,"reversable","Yes"
121 | "120",65,1,"asymptomatic",135,254,0,2,127,0,2.8,2,1,"reversable","Yes"
122 | "121",48,1,"asymptomatic",130,256,1,2,150,1,0,1,2,"reversable","Yes"
123 | "122",63,0,"asymptomatic",150,407,0,2,154,0,4,2,3,"reversable","Yes"
124 | "123",51,1,"nonanginal",100,222,0,0,143,1,1.2,2,0,"normal","No"
125 | "124",55,1,"asymptomatic",140,217,0,0,111,1,5.6,3,0,"reversable","Yes"
126 | "125",65,1,"typical",138,282,1,2,174,0,1.4,2,1,"normal","Yes"
127 | "126",45,0,"nontypical",130,234,0,2,175,0,0.6,2,0,"normal","No"
128 | "127",56,0,"asymptomatic",200,288,1,2,133,1,4,3,2,"reversable","Yes"
129 | "128",54,1,"asymptomatic",110,239,0,0,126,1,2.8,2,1,"reversable","Yes"
130 | "129",44,1,"nontypical",120,220,0,0,170,0,0,1,0,"normal","No"
131 | "130",62,0,"asymptomatic",124,209,0,0,163,0,0,1,0,"normal","No"
132 | "131",54,1,"nonanginal",120,258,0,2,147,0,0.4,2,0,"reversable","No"
133 | "132",51,1,"nonanginal",94,227,0,0,154,1,0,1,1,"reversable","No"
134 | "133",29,1,"nontypical",130,204,0,2,202,0,0,1,0,"normal","No"
135 | "134",51,1,"asymptomatic",140,261,0,2,186,1,0,1,0,"normal","No"
136 | "135",43,0,"nonanginal",122,213,0,0,165,0,0.2,2,0,"normal","No"
137 | "136",55,0,"nontypical",135,250,0,2,161,0,1.4,2,0,"normal","No"
138 | "137",70,1,"asymptomatic",145,174,0,0,125,1,2.6,3,0,"reversable","Yes"
139 | "138",62,1,"nontypical",120,281,0,2,103,0,1.4,2,1,"reversable","Yes"
140 | "139",35,1,"asymptomatic",120,198,0,0,130,1,1.6,2,0,"reversable","Yes"
141 | "140",51,1,"nonanginal",125,245,1,2,166,0,2.4,2,0,"normal","No"
142 | "141",59,1,"nontypical",140,221,0,0,164,1,0,1,0,"normal","No"
143 | "142",59,1,"typical",170,288,0,2,159,0,0.2,2,0,"reversable","Yes"
144 | "143",52,1,"nontypical",128,205,1,0,184,0,0,1,0,"normal","No"
145 | "144",64,1,"nonanginal",125,309,0,0,131,1,1.8,2,0,"reversable","Yes"
146 | "145",58,1,"nonanginal",105,240,0,2,154,1,0.6,2,0,"reversable","No"
147 | "146",47,1,"nonanginal",108,243,0,0,152,0,0,1,0,"normal","Yes"
148 | "147",57,1,"asymptomatic",165,289,1,2,124,0,1,2,3,"reversable","Yes"
149 | "148",41,1,"nonanginal",112,250,0,0,179,0,0,1,0,"normal","No"
150 | "149",45,1,"nontypical",128,308,0,2,170,0,0,1,0,"normal","No"
151 | "150",60,0,"nonanginal",102,318,0,0,160,0,0,1,1,"normal","No"
152 | "151",52,1,"typical",152,298,1,0,178,0,1.2,2,0,"reversable","No"
153 | "152",42,0,"asymptomatic",102,265,0,2,122,0,0.6,2,0,"normal","No"
154 | "153",67,0,"nonanginal",115,564,0,2,160,0,1.6,2,0,"reversable","No"
155 | "154",55,1,"asymptomatic",160,289,0,2,145,1,0.8,2,1,"reversable","Yes"
156 | "155",64,1,"asymptomatic",120,246,0,2,96,1,2.2,3,1,"normal","Yes"
157 | "156",70,1,"asymptomatic",130,322,0,2,109,0,2.4,2,3,"normal","Yes"
158 | "157",51,1,"asymptomatic",140,299,0,0,173,1,1.6,1,0,"reversable","Yes"
159 | "158",58,1,"asymptomatic",125,300,0,2,171,0,0,1,2,"reversable","Yes"
160 | "159",60,1,"asymptomatic",140,293,0,2,170,0,1.2,2,2,"reversable","Yes"
161 | "160",68,1,"nonanginal",118,277,0,0,151,0,1,1,1,"reversable","No"
162 | "161",46,1,"nontypical",101,197,1,0,156,0,0,1,0,"reversable","No"
163 | "162",77,1,"asymptomatic",125,304,0,2,162,1,0,1,3,"normal","Yes"
164 | "163",54,0,"nonanginal",110,214,0,0,158,0,1.6,2,0,"normal","No"
165 | "164",58,0,"asymptomatic",100,248,0,2,122,0,1,2,0,"normal","No"
166 | "165",48,1,"nonanginal",124,255,1,0,175,0,0,1,2,"normal","No"
167 | "166",57,1,"asymptomatic",132,207,0,0,168,1,0,1,0,"reversable","No"
168 | "167",52,1,"nonanginal",138,223,0,0,169,0,0,1,NA,"normal","No"
169 | "168",54,0,"nontypical",132,288,1,2,159,1,0,1,1,"normal","No"
170 | "169",35,1,"asymptomatic",126,282,0,2,156,1,0,1,0,"reversable","Yes"
171 | "170",45,0,"nontypical",112,160,0,0,138,0,0,2,0,"normal","No"
172 | "171",70,1,"nonanginal",160,269,0,0,112,1,2.9,2,1,"reversable","Yes"
173 | "172",53,1,"asymptomatic",142,226,0,2,111,1,0,1,0,"reversable","No"
174 | "173",59,0,"asymptomatic",174,249,0,0,143,1,0,2,0,"normal","Yes"
175 | "174",62,0,"asymptomatic",140,394,0,2,157,0,1.2,2,0,"normal","No"
176 | "175",64,1,"asymptomatic",145,212,0,2,132,0,2,2,2,"fixed","Yes"
177 | "176",57,1,"asymptomatic",152,274,0,0,88,1,1.2,2,1,"reversable","Yes"
178 | "177",52,1,"asymptomatic",108,233,1,0,147,0,0.1,1,3,"reversable","No"
179 | "178",56,1,"asymptomatic",132,184,0,2,105,1,2.1,2,1,"fixed","Yes"
180 | "179",43,1,"nonanginal",130,315,0,0,162,0,1.9,1,1,"normal","No"
181 | "180",53,1,"nonanginal",130,246,1,2,173,0,0,1,3,"normal","No"
182 | "181",48,1,"asymptomatic",124,274,0,2,166,0,0.5,2,0,"reversable","Yes"
183 | "182",56,0,"asymptomatic",134,409,0,2,150,1,1.9,2,2,"reversable","Yes"
184 | "183",42,1,"typical",148,244,0,2,178,0,0.8,1,2,"normal","No"
185 | "184",59,1,"typical",178,270,0,2,145,0,4.2,3,0,"reversable","No"
186 | "185",60,0,"asymptomatic",158,305,0,2,161,0,0,1,0,"normal","Yes"
187 | "186",63,0,"nontypical",140,195,0,0,179,0,0,1,2,"normal","No"
188 | "187",42,1,"nonanginal",120,240,1,0,194,0,0.8,3,0,"reversable","No"
189 | "188",66,1,"nontypical",160,246,0,0,120,1,0,2,3,"fixed","Yes"
190 | "189",54,1,"nontypical",192,283,0,2,195,0,0,1,1,"reversable","Yes"
191 | "190",69,1,"nonanginal",140,254,0,2,146,0,2,2,3,"reversable","Yes"
192 | "191",50,1,"nonanginal",129,196,0,0,163,0,0,1,0,"normal","No"
193 | "192",51,1,"asymptomatic",140,298,0,0,122,1,4.2,2,3,"reversable","Yes"
194 | "193",43,1,"asymptomatic",132,247,1,2,143,1,0.1,2,NA,"reversable","Yes"
195 | "194",62,0,"asymptomatic",138,294,1,0,106,0,1.9,2,3,"normal","Yes"
196 | "195",68,0,"nonanginal",120,211,0,2,115,0,1.5,2,0,"normal","No"
197 | "196",67,1,"asymptomatic",100,299,0,2,125,1,0.9,2,2,"normal","Yes"
198 | "197",69,1,"typical",160,234,1,2,131,0,0.1,2,1,"normal","No"
199 | "198",45,0,"asymptomatic",138,236,0,2,152,1,0.2,2,0,"normal","No"
200 | "199",50,0,"nontypical",120,244,0,0,162,0,1.1,1,0,"normal","No"
201 | "200",59,1,"typical",160,273,0,2,125,0,0,1,0,"normal","Yes"
202 | "201",50,0,"asymptomatic",110,254,0,2,159,0,0,1,0,"normal","No"
203 | "202",64,0,"asymptomatic",180,325,0,0,154,1,0,1,0,"normal","No"
204 | "203",57,1,"nonanginal",150,126,1,0,173,0,0.2,1,1,"reversable","No"
205 | "204",64,0,"nonanginal",140,313,0,0,133,0,0.2,1,0,"reversable","No"
206 | "205",43,1,"asymptomatic",110,211,0,0,161,0,0,1,0,"reversable","No"
207 | "206",45,1,"asymptomatic",142,309,0,2,147,1,0,2,3,"reversable","Yes"
208 | "207",58,1,"asymptomatic",128,259,0,2,130,1,3,2,2,"reversable","Yes"
209 | "208",50,1,"asymptomatic",144,200,0,2,126,1,0.9,2,0,"reversable","Yes"
210 | "209",55,1,"nontypical",130,262,0,0,155,0,0,1,0,"normal","No"
211 | "210",62,0,"asymptomatic",150,244,0,0,154,1,1.4,2,0,"normal","Yes"
212 | "211",37,0,"nonanginal",120,215,0,0,170,0,0,1,0,"normal","No"
213 | "212",38,1,"typical",120,231,0,0,182,1,3.8,2,0,"reversable","Yes"
214 | "213",41,1,"nonanginal",130,214,0,2,168,0,2,2,0,"normal","No"
215 | "214",66,0,"asymptomatic",178,228,1,0,165,1,1,2,2,"reversable","Yes"
216 | "215",52,1,"asymptomatic",112,230,0,0,160,0,0,1,1,"normal","Yes"
217 | "216",56,1,"typical",120,193,0,2,162,0,1.9,2,0,"reversable","No"
218 | "217",46,0,"nontypical",105,204,0,0,172,0,0,1,0,"normal","No"
219 | "218",46,0,"asymptomatic",138,243,0,2,152,1,0,2,0,"normal","No"
220 | "219",64,0,"asymptomatic",130,303,0,0,122,0,2,2,2,"normal","No"
221 | "220",59,1,"asymptomatic",138,271,0,2,182,0,0,1,0,"normal","No"
222 | "221",41,0,"nonanginal",112,268,0,2,172,1,0,1,0,"normal","No"
223 | "222",54,0,"nonanginal",108,267,0,2,167,0,0,1,0,"normal","No"
224 | "223",39,0,"nonanginal",94,199,0,0,179,0,0,1,0,"normal","No"
225 | "224",53,1,"asymptomatic",123,282,0,0,95,1,2,2,2,"reversable","Yes"
226 | "225",63,0,"asymptomatic",108,269,0,0,169,1,1.8,2,2,"normal","Yes"
227 | "226",34,0,"nontypical",118,210,0,0,192,0,0.7,1,0,"normal","No"
228 | "227",47,1,"asymptomatic",112,204,0,0,143,0,0.1,1,0,"normal","No"
229 | "228",67,0,"nonanginal",152,277,0,0,172,0,0,1,1,"normal","No"
230 | "229",54,1,"asymptomatic",110,206,0,2,108,1,0,2,1,"normal","Yes"
231 | "230",66,1,"asymptomatic",112,212,0,2,132,1,0.1,1,1,"normal","Yes"
232 | "231",52,0,"nonanginal",136,196,0,2,169,0,0.1,2,0,"normal","No"
233 | "232",55,0,"asymptomatic",180,327,0,1,117,1,3.4,2,0,"normal","Yes"
234 | "233",49,1,"nonanginal",118,149,0,2,126,0,0.8,1,3,"normal","Yes"
235 | "234",74,0,"nontypical",120,269,0,2,121,1,0.2,1,1,"normal","No"
236 | "235",54,0,"nonanginal",160,201,0,0,163,0,0,1,1,"normal","No"
237 | "236",54,1,"asymptomatic",122,286,0,2,116,1,3.2,2,2,"normal","Yes"
238 | "237",56,1,"asymptomatic",130,283,1,2,103,1,1.6,3,0,"reversable","Yes"
239 | "238",46,1,"asymptomatic",120,249,0,2,144,0,0.8,1,0,"reversable","Yes"
240 | "239",49,0,"nontypical",134,271,0,0,162,0,0,2,0,"normal","No"
241 | "240",42,1,"nontypical",120,295,0,0,162,0,0,1,0,"normal","No"
242 | "241",41,1,"nontypical",110,235,0,0,153,0,0,1,0,"normal","No"
243 | "242",41,0,"nontypical",126,306,0,0,163,0,0,1,0,"normal","No"
244 | "243",49,0,"asymptomatic",130,269,0,0,163,0,0,1,0,"normal","No"
245 | "244",61,1,"typical",134,234,0,0,145,0,2.6,2,2,"normal","Yes"
246 | "245",60,0,"nonanginal",120,178,1,0,96,0,0,1,0,"normal","No"
247 | "246",67,1,"asymptomatic",120,237,0,0,71,0,1,2,0,"normal","Yes"
248 | "247",58,1,"asymptomatic",100,234,0,0,156,0,0.1,1,1,"reversable","Yes"
249 | "248",47,1,"asymptomatic",110,275,0,2,118,1,1,2,1,"normal","Yes"
250 | "249",52,1,"asymptomatic",125,212,0,0,168,0,1,1,2,"reversable","Yes"
251 | "250",62,1,"nontypical",128,208,1,2,140,0,0,1,0,"normal","No"
252 | "251",57,1,"asymptomatic",110,201,0,0,126,1,1.5,2,0,"fixed","No"
253 | "252",58,1,"asymptomatic",146,218,0,0,105,0,2,2,1,"reversable","Yes"
254 | "253",64,1,"asymptomatic",128,263,0,0,105,1,0.2,2,1,"reversable","No"
255 | "254",51,0,"nonanginal",120,295,0,2,157,0,0.6,1,0,"normal","No"
256 | "255",43,1,"asymptomatic",115,303,0,0,181,0,1.2,2,0,"normal","No"
257 | "256",42,0,"nonanginal",120,209,0,0,173,0,0,2,0,"normal","No"
258 | "257",67,0,"asymptomatic",106,223,0,0,142,0,0.3,1,2,"normal","No"
259 | "258",76,0,"nonanginal",140,197,0,1,116,0,1.1,2,0,"normal","No"
260 | "259",70,1,"nontypical",156,245,0,2,143,0,0,1,0,"normal","No"
261 | "260",57,1,"nontypical",124,261,0,0,141,0,0.3,1,0,"reversable","Yes"
262 | "261",44,0,"nonanginal",118,242,0,0,149,0,0.3,2,1,"normal","No"
263 | "262",58,0,"nontypical",136,319,1,2,152,0,0,1,2,"normal","Yes"
264 | "263",60,0,"typical",150,240,0,0,171,0,0.9,1,0,"normal","No"
265 | "264",44,1,"nonanginal",120,226,0,0,169,0,0,1,0,"normal","No"
266 | "265",61,1,"asymptomatic",138,166,0,2,125,1,3.6,2,1,"normal","Yes"
267 | "266",42,1,"asymptomatic",136,315,0,0,125,1,1.8,2,0,"fixed","Yes"
268 | "267",52,1,"asymptomatic",128,204,1,0,156,1,1,2,0,NA,"Yes"
269 | "268",59,1,"nonanginal",126,218,1,0,134,0,2.2,2,1,"fixed","Yes"
270 | "269",40,1,"asymptomatic",152,223,0,0,181,0,0,1,0,"reversable","Yes"
271 | "270",42,1,"nonanginal",130,180,0,0,150,0,0,1,0,"normal","No"
272 | "271",61,1,"asymptomatic",140,207,0,2,138,1,1.9,1,1,"reversable","Yes"
273 | "272",66,1,"asymptomatic",160,228,0,2,138,0,2.3,1,0,"fixed","No"
274 | "273",46,1,"asymptomatic",140,311,0,0,120,1,1.8,2,2,"reversable","Yes"
275 | "274",71,0,"asymptomatic",112,149,0,0,125,0,1.6,2,0,"normal","No"
276 | "275",59,1,"typical",134,204,0,0,162,0,0.8,1,2,"normal","Yes"
277 | "276",64,1,"typical",170,227,0,2,155,0,0.6,2,0,"reversable","No"
278 | "277",66,0,"nonanginal",146,278,0,2,152,0,0,2,1,"normal","No"
279 | "278",39,0,"nonanginal",138,220,0,0,152,0,0,2,0,"normal","No"
280 | "279",57,1,"nontypical",154,232,0,2,164,0,0,1,1,"normal","Yes"
281 | "280",58,0,"asymptomatic",130,197,0,0,131,0,0.6,2,0,"normal","No"
282 | "281",57,1,"asymptomatic",110,335,0,0,143,1,3,2,1,"reversable","Yes"
283 | "282",47,1,"nonanginal",130,253,0,0,179,0,0,1,0,"normal","No"
284 | "283",55,0,"asymptomatic",128,205,0,1,130,1,2,2,1,"reversable","Yes"
285 | "284",35,1,"nontypical",122,192,0,0,174,0,0,1,0,"normal","No"
286 | "285",61,1,"asymptomatic",148,203,0,0,161,0,0,1,1,"reversable","Yes"
287 | "286",58,1,"asymptomatic",114,318,0,1,140,0,4.4,3,3,"fixed","Yes"
288 | "287",58,0,"asymptomatic",170,225,1,2,146,1,2.8,2,2,"fixed","Yes"
289 | "288",58,1,"nontypical",125,220,0,0,144,0,0.4,2,NA,"reversable","No"
290 | "289",56,1,"nontypical",130,221,0,2,163,0,0,1,0,"reversable","No"
291 | "290",56,1,"nontypical",120,240,0,0,169,0,0,3,0,"normal","No"
292 | "291",67,1,"nonanginal",152,212,0,2,150,0,0.8,2,0,"reversable","Yes"
293 | "292",55,0,"nontypical",132,342,0,0,166,0,1.2,1,0,"normal","No"
294 | "293",44,1,"asymptomatic",120,169,0,0,144,1,2.8,3,0,"fixed","Yes"
295 | "294",63,1,"asymptomatic",140,187,0,2,144,1,4,1,2,"reversable","Yes"
296 | "295",63,0,"asymptomatic",124,197,0,0,136,1,0,2,0,"normal","Yes"
297 | "296",41,1,"nontypical",120,157,0,0,182,0,0,1,0,"normal","No"
298 | "297",59,1,"asymptomatic",164,176,1,2,90,0,1,2,2,"fixed","Yes"
299 | "298",57,0,"asymptomatic",140,241,0,0,123,1,0.2,2,0,"reversable","Yes"
300 | "299",45,1,"typical",110,264,0,0,132,0,1.2,2,0,"reversable","Yes"
301 | "300",68,1,"asymptomatic",144,193,1,0,141,0,3.4,2,2,"reversable","Yes"
302 | "301",57,1,"asymptomatic",130,131,0,0,115,1,1.2,2,1,"reversable","Yes"
303 | "302",57,0,"nontypical",130,236,0,2,174,0,0,2,1,"normal","Yes"
304 | "303",38,1,"nonanginal",138,175,0,0,173,0,0,1,NA,"normal","No"
305 |
--------------------------------------------------------------------------------
/in_class_notebooks/DT_viz/data/residency.csv:
--------------------------------------------------------------------------------
1 | ,Age,Salary,Degree,Residency
2 | 0,24,40000,1,1
3 | 1,53,52000,0,0
4 | 2,23,25000,1,0
5 | 3,25,77000,1,1
6 | 4,32,48000,0,1
7 | 5,52,110000,1,1
8 | 6,22,38000,1,1
9 | 7,43,44000,1,0
10 | 8,52,27000,0,0
11 | 9,48,65000,1,1
12 |
--------------------------------------------------------------------------------
/in_class_notebooks/DT_viz/in_class_DT-Release.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## In-class notebook : Decision tree visualization and hyperparameter tuning"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook, we will learn how to visualize a trained decision tree classifier. We will also manually tune the `hyperparameters` of the tree and visualize the results of that tuning. "
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "First we will run this cell that improrts the required libraries for this exercise. We will be using a python package called pydotplus. Make sure you install this in your python distribution. You will also have to download the graphviz software [https://graphviz.gitlab.io/download/]. "
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "import pandas as pd\n",
31 | "import numpy as np\n",
32 | "import matplotlib.pyplot as plt\n",
33 | "%matplotlib inline\n",
34 | "\n",
35 | "from sklearn.externals.six import StringIO \n",
36 | "from IPython.display import Image \n",
37 | "from sklearn.tree import export_graphviz\n",
38 | "\n",
39 | "import pydotplus\n",
40 | "\n",
41 | "from sklearn.model_selection import cross_val_score\n",
42 | "from sklearn.model_selection import train_test_split\n",
43 | "from sklearn.tree import DecisionTreeClassifier\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "To ensure that your graphviz executables are added to your PATH variable, please run the following cell. Replace the RHS of graphviz_path with your actual path to the Graphviz bin."
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "import os\n",
60 | "graphviz_path = 'C:/Program Files (x86)/Graphviz2.38/bin/'\n",
61 | "os.environ[\"PATH\"] += os.pathsep + graphviz_path"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": []
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": []
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "df = pd.read_csv('data/residency.csv')\n",
83 | "df = df.drop('Unnamed: 0',axis=1)\n",
84 | "y = df['Residency']\n",
85 | "X= df.drop('Residency',axis=1)\n",
86 | "df.head()"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "clf = DecisionTreeClassifier(random_state=0)\n",
96 | "clf.fit(X,y)"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "dot_data = StringIO()\n",
106 | "export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True)\n",
107 | "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n",
108 | "Image(graph.create_png())"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": null,
114 | "metadata": {},
115 | "outputs": [],
116 | "source": [
117 | " ### There is another way to make the tree graph, but the visualization process is similar (we use export_graphviz)\n",
118 | "from sklearn import tree\n",
119 | "tree.plot_tree(clf.fit(X, y))"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "new_cat_features=preprocessor.transformers_[1][1]['onehot']\\\n",
129 | " .get_feature_names(categorical_features)\n",
130 | "# dot_data = tree.export_graphviz(clf, out_file=None,feature_names=list(numeric_features)+list(new_cat_features),class_names=['No','Yes'],filled=True, rounded=True,special_characters=True)\n",
131 | "dot_data = tree.export_graphviz(clf, out_file=None,feature_names=['Age','Salary','Degree'],class_names=['No','Yes'],filled=True, rounded=True,special_characters=True)"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "import graphviz\n",
141 | "graph = graphviz.Source(dot_data)\n"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "metadata": {
148 | "scrolled": false
149 | },
150 | "outputs": [],
151 | "source": [
152 | "graph"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "Image(graph.create_png())\n"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": null,
167 | "metadata": {},
168 | "outputs": [],
169 | "source": [
170 | "df = pd.read_csv(\"data/Heart_cleaned.csv\")\n",
171 | "df = df.drop('Unnamed: 0',axis=1)\n",
172 | "\n",
173 | "y= df['AHD']\n",
174 | "X = df.drop('AHD',axis=1)"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {},
181 | "outputs": [],
182 | "source": [
183 | "clf = DecisionTreeClassifier(random_state=0)\n",
184 | "clf.fit(X,y)"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "#new_cat_features=preprocessor.transformers_[1][1]['onehot']\\\n",
194 | " #.get_feature_names(categorical_features)\n",
195 | " \n",
196 | "new_cat_features = list(X.columns)\n",
197 | "dot_data = tree.export_graphviz(clf, out_file=None,feature_names=new_cat_features,class_names=['No','Yes'],filled=True, rounded=True,special_characters=True)"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "import graphviz\n",
207 | "graph = graphviz.Source(dot_data)\n"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "graph"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "## Exercises: Manual and automatic hyperparameter tuning"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "### 1. Grow a decision tree classifier and change its options and visualize the tree to check what's happening\n",
231 | "- 1.1 `max_depth`\n",
232 | "- 1.2 `min_samples_split`\n",
233 | "- 1.3 `min_samples_leaf`\n",
234 | "- 1.4 `max_features`\n",
235 | "- 1.5 `min_impurity_decrease` \n",
236 | "See the [document](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) for details.\n",
237 | " "
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "clf = DecisionTreeClassifier()"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "clf.get_params() #check the default options"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": null,
261 | "metadata": {},
262 | "outputs": [],
263 | "source": [
264 | "# First, the max_depth is None by default. Seeing the visualization above, the depth grows over 10. \n",
265 | "# So we can pick a max_depth that's smaller than 10, for example let's pick 5.\n",
266 | "# Note that the example numbers here are for demo purpose and would not necessarily be the best choice.\n"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "### 2. Pick a performance metric (for classification) and optimize those tuning parameters. Does a tree perform better when fully grown or early stopped using those parameters?"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {},
280 | "outputs": [],
281 | "source": [
282 | "#Specify the parameter space (max_depth, min_sample_split, min_samples_leaf)\n",
283 | "param_space={\n",
284 | "'max_depth' : [3, 5, 7, 9, None],\n",
285 | "'min_samples_split' : [2, 3, 5],\n",
286 | "'min_samples_leaf' : [1, 2, 3, 5],\n",
287 | "'max_features' : [4, 8, None]}"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "# sklearn has a convenient function that can do grid search, as well as cross validation.\n",
297 | "# see more in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n",
298 | "\n",
299 | "\n",
300 | "#If you want to use different types of cv (e.g. stratified- which also takes care of class label imbalance), you can construct cv object.\n",
301 | "# see more in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold\n",
302 | "# and https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold"
303 | ]
304 | }
305 | ],
306 | "metadata": {
307 | "kernelspec": {
308 | "display_name": "Python 3",
309 | "language": "python",
310 | "name": "python3"
311 | },
312 | "language_info": {
313 | "codemirror_mode": {
314 | "name": "ipython",
315 | "version": 3
316 | },
317 | "file_extension": ".py",
318 | "mimetype": "text/x-python",
319 | "name": "python",
320 | "nbconvert_exporter": "python",
321 | "pygments_lexer": "ipython3",
322 | "version": "3.7.4"
323 | }
324 | },
325 | "nbformat": 4,
326 | "nbformat_minor": 2
327 | }
328 |
--------------------------------------------------------------------------------
/in_class_notebooks/EDA/.ipynb_checkpoints/in-class_EDA-release-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## In-Class Notebook "
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Numpy Review\n",
15 | "\n",
16 | "Data scientists primarily deal with structured numeric data. While tuples, lists and dictionaries are useful for general programming, *vectors* and *arrays* are more useful for mathematical calculations.\n",
17 | "\n",
18 | "[NumPy](https://docs.scipy.org/doc/numpy-1.13.0/index.html) is an *extension module* to the Python language that provides vectors and arrays. NumPy has been imported with alias `np` in the cell below. We will now go through some basic numpy operations. \n"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "import numpy as np\n"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "#### 1. Checking the numpy version you have intalled in your system"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 3,
40 | "metadata": {},
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/plain": [
45 | "'1.16.5'"
46 | ]
47 | },
48 | "execution_count": 3,
49 | "metadata": {},
50 | "output_type": "execute_result"
51 | }
52 | ],
53 | "source": [
54 | "# Type out python command for displaying version\n",
55 | "np.__version__"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "#### 2. Create a 10x10 matrix, in which the elements on the borders will be equal to 1, and inside 0"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {},
69 | "outputs": [],
70 | "source": [
71 | "# Create your 10X10 matrix and store it in x (HINT: Make it all ones or zeros to begin with)\n",
72 | "x = 0\n",
73 | "\n",
74 | "# Slice indexing to modify the matrix\n",
75 | "\n",
76 | "print(x)"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "#### 3. Compute the multiplication of two given matrixes"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "\n",
93 | "p = [[1, 0], [0, 1]]\n",
94 | "q = [[1, 2], [3, 4]]\n",
95 | "print(\"original matrix:\")\n",
96 | "print(p)\n",
97 | "print(q)\n",
98 | "\n",
99 | "# Enter the solution to do matrix multiplication\n",
100 | "\n",
101 | "# Return shape of resultant matrix\n",
102 | "\n",
103 | "print(\"Result of the said matrix multiplication:\")\n",
104 | "\n",
105 | "# print result\n"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "#### 4.Generate random numbers from a normal distribution with mean 2 and standard deviation 1"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {},
119 | "outputs": [],
120 | "source": [
121 | "mu = 2\n",
122 | "s_dev = 1.5\n",
123 | "\n",
124 | "# Enter numpy command to generate \n",
125 | "\n",
126 | "# Check shape of resulting variable \n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "#### 5. Shuffling Arrays : Shuffle numbers between 0 and 10"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "\n",
143 | "x = np.arange(10) # Creatinga vector of 10 elements\n",
144 | "\n",
145 | "# Randomly shuffle the elements of the vector\n",
146 | "\n",
147 | "# Print the result in a nice format\n"
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "metadata": {},
153 | "source": [
154 | "## Pandas Dataframe Review\n",
155 | "\n",
156 | "Pandas is part of an ecosystem of Python software used for statistical analysis.\n",
157 | "\n",
158 | "Pandas extends Python with two datatypes used in statistical analysis: the Series and the DataFrame.\n",
159 | "\n",
160 | "The name \"Pandas\" is derived from \"Panel Data\", a particular way of representing data represented in Pandas by the DataFrame.\n",
161 | "\n",
162 | "As with NumPy, we need to import Pandas. We'll see almost all of our notebooks starting with:\n"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "import pandas as pd"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.\n",
179 | "\n",
180 | "There are a number of ways you can construct a DataFrame. One of the most common is to use a python Dictionary to label the different columns."
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "ages = np.array([20, 39, 45, 18, 56, 90])\n",
190 | "salary = np.array([10000, 40000, 50000, 8000, 55000, 5000])"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {},
197 | "outputs": [],
198 | "source": [
199 | "\n",
200 | "# Using a python dictionary, create pandas dataframe\n",
201 | "\n",
202 | "df"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "You can also specify the index column (_e.g._ it could be the names of the people represented in the age/salary data)"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": null,
215 | "metadata": {},
216 | "outputs": [],
217 | "source": [
218 | "# Name your specific rows with an index columns \n",
219 | "\n",
220 | "df2"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "#### 1. Accessing Rows of a Data Frame"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "You can index a dataframe by row index to extract a set of rows. For integral indices, the range is specified as **from:to** to include entries from **from** to strictly less than **to**."
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "# Select the sub-dataframe with rows 1 and 2"
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "You can do similar slices for named rows, but for inexplicible reasons, the range now includes all of the specified rows (i.e. it doesn't end before the last index)."
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {},
257 | "outputs": [],
258 | "source": [
259 | "# Select rows corresponding to the indices you used above to see the difference"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "We can also access a single row of the data frame using index operations based on the location of the data."
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": null,
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "# Use the `loc` operator to select a particular row index"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "#### 2. Accessing Elements and Columns of a DataFrame"
283 | ]
284 | },
285 | {
286 | "cell_type": "markdown",
287 | "metadata": {},
288 | "source": [
289 | "You can refer to each column using the name of the column. "
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {},
296 | "outputs": [],
297 | "source": [
298 | "# Select age column"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": null,
304 | "metadata": {},
305 | "outputs": [],
306 | "source": [
307 | "# You can also use this form..\n"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": null,
313 | "metadata": {},
314 | "outputs": [],
315 | "source": [
316 | "Once you've selected a column, you can access elements using the index for that specific row."
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": null,
322 | "metadata": {},
323 | "outputs": [],
324 | "source": [
325 | "# Select the age (or any other attribute) for a particular row"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {},
332 | "outputs": [],
333 | "source": [
334 | "# Alternate way using the `loc` operator"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "#### 3. Adding new columns"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {},
348 | "outputs": [],
349 | "source": [
350 | "We can add new columns to the data frame simply by assigning to them."
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": null,
356 | "metadata": {},
357 | "outputs": [],
358 | "source": [
359 | "# Create a new column where each element is the product of the corresponding age and salary"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "## Exploratory Data Analysis with a data set"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "Go to https://archive.ics.uci.edu/ml/datasets/Breast+Cancer and download the file. You can change the data file format to csv."
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": null,
379 | "metadata": {},
380 | "outputs": [],
381 | "source": [
382 | "# read the csv file to a data frame (there is literally a function called read_csv)\n"
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": null,
388 | "metadata": {},
389 | "outputs": [],
390 | "source": [
391 | "df.head() # Shows the top few rows of the dataframe that you just imported "
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": null,
397 | "metadata": {},
398 | "outputs": [],
399 | "source": [
400 | "df.info()#check the data type- it's not always desired form, you can change"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "### Data Cleaning Ideas\n",
408 | "1. Change column names \n",
409 | "2. Find Null values and clean them\n",
410 | "3. Check the data types\n",
411 | "4. Convert ordinal category strings to number\n",
412 | "5. Convert non-ordinal category strings to **dummified** array"
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {},
419 | "outputs": [],
420 | "source": [
421 | "# Let us check what columns there are"
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "execution_count": null,
427 | "metadata": {},
428 | "outputs": [],
429 | "source": [
430 | "# create a dictionary and use it to rename the columns\n",
431 | "columns = ['class','age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']\n",
432 | "\n",
433 | "# Create a list of out the dataframe column names \n",
434 | "\n",
435 | "# Pair it with elements of the `columns` list using the zip() function\n",
436 | "\n",
437 | "# Create a dictionary out of the \n",
438 | "\n",
439 | "print(dd)"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "df.rename(columns=dd, inplace=True)"
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": null,
454 | "metadata": {},
455 | "outputs": [],
456 | "source": [
457 | "df.head()"
458 | ]
459 | },
460 | {
461 | "cell_type": "code",
462 | "execution_count": null,
463 | "metadata": {},
464 | "outputs": [],
465 | "source": [
466 | "# we want to change \"class\" to be 0 or 1\n",
467 | "# you can change values using .apply() : 0- no-recurrence-events , 1- recurrence events; check out the lambda function to acheive this with minimum code\n",
468 | "\n",
469 | "#make sure that there is no other values beofre you use if-else. we know it from the unique values we've inspected above"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "metadata": {},
476 | "outputs": [],
477 | "source": [
478 | "df['class'].unique()"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "metadata": {},
485 | "outputs": [],
486 | "source": [
487 | "df.head()"
488 | ]
489 | },
490 | {
491 | "cell_type": "markdown",
492 | "metadata": {},
493 | "source": [
494 | "Age ranges are inconvenient to handle and it would be more convenient if this were a numeric column. Let us make each element the average of the range to achieve this."
495 | ]
496 | },
497 | {
498 | "cell_type": "code",
499 | "execution_count": null,
500 | "metadata": {},
501 | "outputs": [],
502 | "source": [
503 | "# List with average of each age range\n",
504 | "ageval=[24,35,45,55,65,75]"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": null,
510 | "metadata": {},
511 | "outputs": [],
512 | "source": [
513 | "# Similar to what we did with the column names, let us make a dictionary to link the age range to its corresponding\n",
514 | "# age average\n",
515 | "\n"
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": null,
521 | "metadata": {},
522 | "outputs": [],
523 | "source": [
524 | "#view the dictionary that you create"
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": null,
530 | "metadata": {},
531 | "outputs": [],
532 | "source": [
533 | "#replace the age column `df['age']` with its numeric counter parts using apply and lambda"
534 | ]
535 | },
536 | {
537 | "cell_type": "code",
538 | "execution_count": null,
539 | "metadata": {},
540 | "outputs": [],
541 | "source": [
542 | "df.head()"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": null,
548 | "metadata": {},
549 | "outputs": [],
550 | "source": [
551 | "# Use get_dummies on the menopause column since those elements don't make sense\n",
552 | "# get_dummies creates a new dataframe with columns corresponding to discrete row elements ; the elements of the new\n",
553 | "# columns will be binary\n",
554 | "\n",
555 | "# concatenate newly created df to your existing df"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": null,
561 | "metadata": {},
562 | "outputs": [],
563 | "source": [
564 | "# Here are some tricks to extract the average of the tumor size range. \n",
565 | "\n",
566 | "tumors=sorted(list(df['tumor-size'].unique())) #we'll create ordinal variable (average of the range as representitive of the category, but again, you can assign what makes sense for you)\n",
567 | "tsize=[(int(x[0])+int(x[1]))/2 for x in [x.split('-') for x in tumors]] #just some tricks to extract numbers from the string"
568 | ]
569 | },
570 | {
571 | "cell_type": "code",
572 | "execution_count": null,
573 | "metadata": {},
574 | "outputs": [],
575 | "source": [
576 | "# Create a dictionary once again for tumorsize to convert to numeric"
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": null,
582 | "metadata": {},
583 | "outputs": [],
584 | "source": [
585 | "# replace the tumor-size column with numeric counterparts"
586 | ]
587 | },
588 | {
589 | "cell_type": "code",
590 | "execution_count": null,
591 | "metadata": {},
592 | "outputs": [],
593 | "source": []
594 | },
595 | {
596 | "cell_type": "markdown",
597 | "metadata": {},
598 | "source": [
599 | "### Visualizing the data : Using Matplotlib"
600 | ]
601 | },
602 | {
603 | "cell_type": "markdown",
604 | "metadata": {},
605 | "source": [
606 | "One way to do EDA visually is to make some basic plots of the data to extract basic information from it. Some of these plots include histograms and correlation plots. "
607 | ]
608 | },
609 | {
610 | "cell_type": "code",
611 | "execution_count": null,
612 | "metadata": {},
613 | "outputs": [],
614 | "source": [
615 | "import matplotlib.pyplot as plt"
616 | ]
617 | },
618 | {
619 | "cell_type": "code",
620 | "execution_count": null,
621 | "metadata": {},
622 | "outputs": [],
623 | "source": [
624 | "# plot histogram of ages"
625 | ]
626 | },
627 | {
628 | "cell_type": "code",
629 | "execution_count": null,
630 | "metadata": {},
631 | "outputs": [],
632 | "source": [
633 | "# Change the number of bins"
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": null,
639 | "metadata": {},
640 | "outputs": [],
641 | "source": [
642 | "# Plot the histogram of another variable of your choice"
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": null,
648 | "metadata": {},
649 | "outputs": [],
650 | "source": [
651 | "dfs = df1[['class','age','tumor-size','deg-malig','premeno']]\n",
652 | "\n",
653 | "#after changing them to numbers, we can see correlation matrix\n",
654 | "corr = dfs.corr()\n",
655 | "corr.style.background_gradient(cmap='coolwarm')"
656 | ]
657 | },
658 | {
659 | "cell_type": "code",
660 | "execution_count": null,
661 | "metadata": {},
662 | "outputs": [],
663 | "source": [
664 | "# Or using plt\n",
665 | "import matplotlib.pyplot as plt\n",
666 | "f = plt.figure(figsize=(19, 15))\n",
667 | "plt.matshow(dfs.corr(), fignum=f.number)\n",
668 | "plt.xticks(range(dfs.shape[1]), dfs.columns, fontsize=14, rotation=45)\n",
669 | "plt.yticks(range(dfs.shape[1]), dfs.columns, fontsize=14)\n",
670 | "cb = plt.colorbar()\n",
671 | "cb.ax.tick_params(labelsize=14)\n",
672 | "plt.title('Correlation Matrix', fontsize=16);"
673 | ]
674 | }
675 | ],
676 | "metadata": {
677 | "kernelspec": {
678 | "display_name": "Python 3",
679 | "language": "python",
680 | "name": "python3"
681 | },
682 | "language_info": {
683 | "codemirror_mode": {
684 | "name": "ipython",
685 | "version": 3
686 | },
687 | "file_extension": ".py",
688 | "mimetype": "text/x-python",
689 | "name": "python",
690 | "nbconvert_exporter": "python",
691 | "pygments_lexer": "ipython3",
692 | "version": "3.7.4"
693 | }
694 | },
695 | "nbformat": 4,
696 | "nbformat_minor": 2
697 | }
698 |
--------------------------------------------------------------------------------
/in_class_notebooks/EDA/data/breast-cancer.csv:
--------------------------------------------------------------------------------
1 | no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
2 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
3 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
4 | no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
5 | no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
6 | no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_low,no
7 | no-recurrence-events,50-59,premeno,25-29,0-2,no,2,left,left_low,no
8 | no-recurrence-events,60-69,ge40,20-24,0-2,no,1,left,left_low,no
9 | no-recurrence-events,40-49,premeno,50-54,0-2,no,2,left,left_low,no
10 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,left_up,no
11 | no-recurrence-events,40-49,premeno,0-4,0-2,no,3,left,central,no
12 | no-recurrence-events,50-59,ge40,25-29,0-2,no,2,left,left_low,no
13 | no-recurrence-events,60-69,lt40,10-14,0-2,no,1,left,right_up,no
14 | no-recurrence-events,50-59,ge40,25-29,0-2,no,3,left,right_up,no
15 | no-recurrence-events,40-49,premeno,30-34,0-2,no,3,left,left_up,no
16 | no-recurrence-events,60-69,lt40,30-34,0-2,no,1,left,left_low,no
17 | no-recurrence-events,40-49,premeno,15-19,0-2,no,2,left,left_low,no
18 | no-recurrence-events,50-59,premeno,30-34,0-2,no,3,left,left_low,no
19 | no-recurrence-events,60-69,ge40,30-34,0-2,no,3,left,left_low,no
20 | no-recurrence-events,50-59,ge40,30-34,0-2,no,1,right,right_up,no
21 | no-recurrence-events,50-59,ge40,40-44,0-2,no,2,left,left_low,no
22 | no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_low,no
23 | no-recurrence-events,30-39,premeno,25-29,0-2,no,2,right,left_low,no
24 | no-recurrence-events,50-59,premeno,40-44,0-2,no,2,left,left_up,no
25 | no-recurrence-events,50-59,premeno,35-39,0-2,no,2,right,left_up,no
26 | no-recurrence-events,40-49,premeno,25-29,0-2,no,2,left,left_up,no
27 | no-recurrence-events,50-59,premeno,20-24,0-2,no,1,left,left_low,no
28 | no-recurrence-events,60-69,ge40,25-29,0-2,no,3,right,left_up,no
29 | no-recurrence-events,40-49,premeno,40-44,0-2,no,2,right,left_low,no
30 | no-recurrence-events,60-69,ge40,30-34,0-2,no,2,left,left_low,no
31 | no-recurrence-events,50-59,ge40,40-44,0-2,no,3,right,left_up,no
32 | no-recurrence-events,50-59,premeno,15-19,0-2,no,2,right,left_low,no
33 | no-recurrence-events,50-59,premeno,10-14,0-2,no,3,left,left_low,no
34 | no-recurrence-events,50-59,ge40,10-14,0-2,no,1,right,left_up,no
35 | no-recurrence-events,50-59,ge40,10-14,0-2,no,1,left,left_up,no
36 | no-recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
37 | no-recurrence-events,50-59,ge40,0-4,0-2,no,2,left,central,no
38 | no-recurrence-events,50-59,ge40,15-19,0-2,no,1,right,central,no
39 | no-recurrence-events,40-49,premeno,10-14,0-2,no,2,left,left_low,no
40 | no-recurrence-events,40-49,premeno,30-34,0-2,no,1,left,left_low,no
41 | no-recurrence-events,50-59,ge40,20-24,0-2,no,1,right,left_low,no
42 | no-recurrence-events,60-69,ge40,25-29,0-2,no,2,left,left_low,no
43 | no-recurrence-events,60-69,ge40,5-9,0-2,no,1,left,central,no
44 | no-recurrence-events,40-49,premeno,10-14,0-2,no,2,left,left_up,no
45 | no-recurrence-events,50-59,ge40,50-54,0-2,no,1,right,right_up,no
46 | no-recurrence-events,50-59,ge40,30-34,0-2,no,1,left,left_up,no
47 | no-recurrence-events,40-49,premeno,25-29,0-2,no,2,right,left_low,no
48 | no-recurrence-events,50-59,premeno,25-29,0-2,no,1,right,left_up,no
49 | no-recurrence-events,40-49,premeno,20-24,0-2,no,1,right,right_up,no
50 | no-recurrence-events,40-49,premeno,20-24,0-2,no,1,right,left_low,no
51 | no-recurrence-events,50-59,lt40,15-19,0-2,no,2,left,left_low,no
52 | no-recurrence-events,30-39,premeno,20-24,0-2,no,2,left,right_low,no
53 | no-recurrence-events,50-59,premeno,15-19,0-2,no,1,left,left_low,no
54 | no-recurrence-events,70-79,ge40,20-24,0-2,no,3,left,left_up,no
55 | no-recurrence-events,70-79,ge40,40-44,0-2,no,1,right,left_up,no
56 | no-recurrence-events,70-79,ge40,40-44,0-2,no,1,right,right_up,no
57 | no-recurrence-events,50-59,ge40,0-4,0-2,no,1,right,central,no
58 | no-recurrence-events,50-59,ge40,5-9,0-2,no,2,right,right_up,no
59 | no-recurrence-events,60-69,ge40,30-34,0-2,no,1,left,left_up,no
60 | no-recurrence-events,60-69,ge40,15-19,0-2,no,1,right,left_up,no
61 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,central,no
62 | no-recurrence-events,40-49,premeno,10-14,0-2,no,1,right,right_low,no
63 | no-recurrence-events,50-59,ge40,0-4,0-2,no,1,left,left_low,no
64 | no-recurrence-events,20-29,premeno,35-39,0-2,no,2,right,right_up,no
65 | no-recurrence-events,40-49,premeno,25-29,0-2,no,1,left,right_low,no
66 | no-recurrence-events,40-49,premeno,10-14,0-2,no,1,right,left_up,no
67 | no-recurrence-events,40-49,premeno,25-29,0-2,no,1,right,right_low,no
68 | no-recurrence-events,50-59,ge40,20-24,0-2,no,3,left,left_up,no
69 | no-recurrence-events,50-59,ge40,35-39,0-2,no,3,left,left_low,no
70 | no-recurrence-events,60-69,ge40,50-54,0-2,no,2,left,left_low,no
71 | no-recurrence-events,60-69,ge40,10-14,0-2,no,1,left,left_low,no
72 | no-recurrence-events,40-49,premeno,25-29,0-2,no,2,right,left_up,no
73 | no-recurrence-events,60-69,ge40,20-24,0-2,no,2,left,left_up,no
74 | no-recurrence-events,50-59,premeno,15-19,0-2,no,2,right,right_low,no
75 | no-recurrence-events,30-39,premeno,5-9,0-2,no,2,left,right_low,no
76 | no-recurrence-events,50-59,ge40,10-14,0-2,no,1,left,left_low,no
77 | no-recurrence-events,50-59,ge40,10-14,0-2,no,2,left,left_low,no
78 | no-recurrence-events,30-39,premeno,25-29,0-2,no,1,left,central,no
79 | no-recurrence-events,50-59,premeno,25-29,0-2,no,2,left,left_low,no
80 | no-recurrence-events,40-49,premeno,25-29,0-2,no,2,right,central,no
81 | no-recurrence-events,50-59,ge40,10-14,0-2,no,2,right,left_low,no
82 | no-recurrence-events,60-69,ge40,10-14,0-2,no,1,left,left_up,no
83 | no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_low,no
84 | no-recurrence-events,50-59,ge40,15-19,0-2,no,2,right,left_low,no
85 | no-recurrence-events,40-49,premeno,20-24,0-2,no,1,left,right_low,no
86 | no-recurrence-events,50-59,ge40,35-39,0-2,no,3,left,left_up,no
87 | no-recurrence-events,60-69,ge40,25-29,0-2,no,2,right,left_low,no
88 | no-recurrence-events,70-79,ge40,0-4,0-2,no,1,left,right_low,no
89 | no-recurrence-events,50-59,ge40,20-24,0-2,no,3,right,left_up,no
90 | no-recurrence-events,40-49,premeno,40-44,0-2,no,1,right,left_up,no
91 | no-recurrence-events,30-39,premeno,0-4,0-2,no,2,right,central,no
92 | no-recurrence-events,50-59,ge40,20-24,0-2,no,3,left,left_up,no
93 | no-recurrence-events,50-59,ge40,25-29,0-2,no,2,right,left_up,no
94 | no-recurrence-events,60-69,ge40,20-24,0-2,no,2,right,left_up,no
95 | no-recurrence-events,50-59,premeno,10-14,0-2,no,1,left,left_low,no
96 | no-recurrence-events,40-49,premeno,30-34,0-2,no,2,right,right_low,no
97 | no-recurrence-events,60-69,ge40,30-34,0-2,no,2,left,left_up,no
98 | no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
99 | no-recurrence-events,40-49,premeno,30-34,0-2,no,1,left,right_up,no
100 | no-recurrence-events,30-39,premeno,25-29,0-2,no,2,left,left_low,no
101 | no-recurrence-events,40-49,ge40,20-24,0-2,no,3,left,left_low,no
102 | no-recurrence-events,50-59,ge40,30-34,0-2,no,3,right,left_low,no
103 | no-recurrence-events,50-59,premeno,25-29,0-2,no,2,right,right_low,no
104 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,right_low,no
105 | no-recurrence-events,40-49,premeno,10-14,0-2,no,2,right,left_low,no
106 | no-recurrence-events,40-49,premeno,30-34,0-2,no,1,right,left_up,no
107 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_up,no
108 | no-recurrence-events,30-39,premeno,40-44,0-2,no,2,right,right_up,no
109 | no-recurrence-events,40-49,premeno,30-34,0-2,no,3,right,right_up,no
110 | no-recurrence-events,60-69,ge40,30-34,0-2,no,1,right,left_up,no
111 | no-recurrence-events,50-59,ge40,25-29,0-2,no,1,left,left_low,no
112 | no-recurrence-events,50-59,ge40,15-19,0-2,no,1,right,central,no
113 | no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,left_up,no
114 | no-recurrence-events,40-49,premeno,10-14,0-2,no,1,right,left_up,no
115 | no-recurrence-events,40-49,premeno,35-39,0-2,no,2,right,right_up,no
116 | no-recurrence-events,50-59,ge40,20-24,0-2,no,2,right,left_up,no
117 | no-recurrence-events,30-39,premeno,15-19,0-2,no,1,left,left_low,no
118 | no-recurrence-events,40-49,ge40,20-24,0-2,no,3,left,left_up,no
119 | no-recurrence-events,30-39,premeno,10-14,0-2,no,1,right,left_low,no
120 | no-recurrence-events,60-69,ge40,15-19,0-2,no,1,left,right_low,no
121 | no-recurrence-events,60-69,ge40,20-24,0-2,no,1,left,left_low,no
122 | no-recurrence-events,50-59,ge40,15-19,0-2,no,2,right,right_up,no
123 | no-recurrence-events,50-59,ge40,40-44,0-2,no,3,left,left_up,no
124 | no-recurrence-events,50-59,ge40,30-34,0-2,no,1,right,left_low,no
125 | no-recurrence-events,60-69,ge40,10-14,0-2,no,1,right,left_low,no
126 | no-recurrence-events,70-79,ge40,10-14,0-2,no,2,left,central,no
127 | no-recurrence-events,30-39,premeno,30-34,6-8,yes,2,right,right_up,no
128 | no-recurrence-events,30-39,premeno,25-29,6-8,yes,2,right,left_up,yes
129 | no-recurrence-events,50-59,premeno,25-29,0-2,yes,2,left,left_up,no
130 | no-recurrence-events,40-49,premeno,35-39,9-11,yes,2,right,left_up,yes
131 | no-recurrence-events,40-49,premeno,35-39,9-11,yes,2,right,right_up,yes
132 | no-recurrence-events,40-49,premeno,40-44,3-5,yes,3,right,left_up,yes
133 | no-recurrence-events,40-49,premeno,30-34,6-8,no,2,left,left_up,no
134 | no-recurrence-events,50-59,ge40,40-44,0-2,no,3,left,right_up,no
135 | no-recurrence-events,60-69,ge40,30-34,0-2,no,2,left,left_low,yes
136 | no-recurrence-events,30-39,premeno,20-24,3-5,no,2,right,central,no
137 | no-recurrence-events,30-39,premeno,40-44,3-5,no,3,right,right_up,yes
138 | no-recurrence-events,40-49,premeno,5-9,0-2,no,1,left,left_low,yes
139 | no-recurrence-events,30-39,premeno,40-44,0-2,no,2,left,left_low,yes
140 | no-recurrence-events,40-49,premeno,30-34,0-2,no,2,left,right_low,no
141 | no-recurrence-events,50-59,ge40,40-44,3-5,yes,2,left,left_low,no
142 | no-recurrence-events,50-59,premeno,20-24,3-5,yes,2,left,left_low,no
143 | no-recurrence-events,60-69,ge40,10-14,0-2,no,1,left,left_up,no
144 | no-recurrence-events,40-49,premeno,45-49,0-2,no,2,left,left_low,yes
145 | no-recurrence-events,60-69,ge40,45-49,6-8,yes,3,left,central,no
146 | no-recurrence-events,40-49,premeno,25-29,0-2,?,2,left,right_low,yes
147 | no-recurrence-events,60-69,ge40,50-54,0-2,no,2,right,left_up,yes
148 | no-recurrence-events,50-59,premeno,30-34,3-5,yes,2,left,left_low,yes
149 | no-recurrence-events,30-39,premeno,20-24,0-2,no,3,left,central,no
150 | no-recurrence-events,50-59,lt40,30-34,0-2,no,3,right,left_up,no
151 | no-recurrence-events,50-59,ge40,25-29,15-17,yes,3,right,left_up,no
152 | no-recurrence-events,60-69,ge40,30-34,3-5,yes,3,left,left_low,no
153 | no-recurrence-events,50-59,ge40,35-39,15-17,no,3,left,left_low,no
154 | no-recurrence-events,60-69,ge40,15-19,0-2,no,3,right,left_up,yes
155 | no-recurrence-events,30-39,lt40,15-19,0-2,no,3,right,left_up,no
156 | no-recurrence-events,60-69,ge40,40-44,3-5,no,2,right,left_up,yes
157 | no-recurrence-events,50-59,ge40,25-29,3-5,yes,3,right,left_up,no
158 | no-recurrence-events,50-59,premeno,30-34,0-2,no,1,left,central,no
159 | no-recurrence-events,50-59,ge40,30-34,0-2,no,1,right,central,no
160 | no-recurrence-events,40-49,premeno,35-39,0-2,no,1,left,left_low,no
161 | no-recurrence-events,40-49,premeno,25-29,0-2,no,3,right,left_up,yes
162 | no-recurrence-events,40-49,premeno,30-34,3-5,yes,2,right,left_low,no
163 | no-recurrence-events,60-69,ge40,10-14,0-2,no,2,right,left_up,yes
164 | no-recurrence-events,60-69,ge40,25-29,3-5,?,1,right,left_up,yes
165 | no-recurrence-events,60-69,ge40,25-29,3-5,?,1,right,left_low,yes
166 | no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_up,no
167 | no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_low,no
168 | no-recurrence-events,40-49,ge40,40-44,15-17,yes,2,right,left_up,yes
169 | no-recurrence-events,50-59,premeno,10-14,0-2,no,2,right,left_up,no
170 | no-recurrence-events,40-49,ge40,30-34,0-2,no,2,left,left_up,yes
171 | no-recurrence-events,30-39,premeno,20-24,3-5,yes,2,right,left_up,yes
172 | no-recurrence-events,30-39,premeno,15-19,0-2,no,1,left,left_low,no
173 | no-recurrence-events,60-69,ge40,30-34,6-8,yes,2,right,right_up,no
174 | no-recurrence-events,50-59,ge40,20-24,3-5,yes,2,right,left_up,no
175 | no-recurrence-events,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes
176 | no-recurrence-events,40-49,premeno,30-34,0-2,no,2,right,right_up,yes
177 | no-recurrence-events,40-49,ge40,25-29,0-2,no,2,left,left_low,no
178 | no-recurrence-events,60-69,ge40,10-14,0-2,no,2,left,left_low,no
179 | no-recurrence-events,50-59,premeno,25-29,3-5,no,2,right,left_up,yes
180 | no-recurrence-events,40-49,premeno,20-24,0-2,no,3,right,left_low,yes
181 | no-recurrence-events,40-49,premeno,35-39,0-2,yes,3,right,left_up,yes
182 | no-recurrence-events,40-49,premeno,35-39,0-2,yes,3,right,left_low,yes
183 | no-recurrence-events,40-49,premeno,25-29,0-2,no,1,right,left_low,yes
184 | no-recurrence-events,50-59,ge40,30-34,9-11,?,3,left,left_up,yes
185 | no-recurrence-events,50-59,ge40,30-34,9-11,?,3,left,left_low,yes
186 | no-recurrence-events,40-49,premeno,20-24,6-8,no,2,right,left_low,yes
187 | no-recurrence-events,50-59,ge40,25-29,0-2,no,1,left,right_low,no
188 | no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_up,yes
189 | no-recurrence-events,40-49,premeno,10-14,0-2,no,2,right,left_up,no
190 | no-recurrence-events,50-59,ge40,20-24,0-2,yes,2,right,left_up,no
191 | no-recurrence-events,40-49,premeno,15-19,12-14,no,3,right,right_low,yes
192 | no-recurrence-events,40-49,premeno,25-29,0-2,no,2,left,left_up,yes
193 | no-recurrence-events,50-59,ge40,30-34,6-8,yes,2,left,left_low,no
194 | no-recurrence-events,30-39,premeno,10-14,0-2,no,2,left,right_low,no
195 | no-recurrence-events,50-59,premeno,50-54,0-2,yes,2,right,left_up,yes
196 | no-recurrence-events,50-59,ge40,35-39,0-2,no,2,left,left_up,no
197 | no-recurrence-events,50-59,premeno,10-14,3-5,no,1,right,left_up,no
198 | no-recurrence-events,40-49,premeno,10-14,0-2,no,2,left,left_low,yes
199 | no-recurrence-events,50-59,ge40,15-19,0-2,yes,2,left,central,yes
200 | no-recurrence-events,50-59,premeno,25-29,0-2,no,1,left,left_low,no
201 | no-recurrence-events,60-69,ge40,25-29,0-2,no,3,right,left_low,no
202 | recurrence-events,50-59,premeno,15-19,0-2,no,2,left,left_low,no
203 | recurrence-events,40-49,premeno,40-44,0-2,no,1,left,left_low,no
204 | recurrence-events,50-59,ge40,35-39,0-2,no,2,left,left_low,no
205 | recurrence-events,50-59,premeno,25-29,0-2,no,2,left,right_up,no
206 | recurrence-events,30-39,premeno,0-4,0-2,no,2,right,central,no
207 | recurrence-events,50-59,ge40,30-34,0-2,no,3,left,?,no
208 | recurrence-events,50-59,premeno,25-29,0-2,no,2,left,right_up,no
209 | recurrence-events,50-59,premeno,30-34,0-2,no,3,left,right_up,no
210 | recurrence-events,40-49,premeno,35-39,0-2,no,1,right,left_up,no
211 | recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
212 | recurrence-events,50-59,ge40,20-24,0-2,no,2,right,central,no
213 | recurrence-events,40-49,premeno,30-34,0-2,no,3,right,right_up,no
214 | recurrence-events,50-59,premeno,25-29,0-2,no,1,right,left_up,no
215 | recurrence-events,60-69,ge40,40-44,0-2,no,2,right,left_low,no
216 | recurrence-events,40-49,ge40,20-24,0-2,no,2,right,left_up,no
217 | recurrence-events,50-59,ge40,20-24,0-2,no,2,left,left_up,no
218 | recurrence-events,40-49,premeno,15-19,0-2,no,2,left,left_up,no
219 | recurrence-events,60-69,ge40,30-34,0-2,no,3,right,central,no
220 | recurrence-events,30-39,premeno,15-19,0-2,no,1,right,left_low,no
221 | recurrence-events,40-49,premeno,25-29,0-2,no,3,left,right_up,no
222 | recurrence-events,30-39,premeno,30-34,0-2,no,1,right,left_up,no
223 | recurrence-events,60-69,ge40,25-29,0-2,no,3,left,right_low,yes
224 | recurrence-events,60-69,ge40,20-24,0-2,no,3,right,left_low,no
225 | recurrence-events,30-39,premeno,25-29,3-5,yes,3,left,left_low,yes
226 | recurrence-events,40-49,ge40,20-24,3-5,no,3,right,left_low,yes
227 | recurrence-events,40-49,premeno,30-34,15-17,yes,3,left,left_low,no
228 | recurrence-events,50-59,premeno,30-34,0-2,no,3,right,left_up,yes
229 | recurrence-events,60-69,ge40,40-44,3-5,yes,3,right,left_low,no
230 | recurrence-events,60-69,ge40,45-49,0-2,no,1,right,right_up,yes
231 | recurrence-events,50-59,premeno,50-54,9-11,yes,2,right,left_up,no
232 | recurrence-events,40-49,premeno,30-34,3-5,no,2,right,left_up,no
233 | recurrence-events,30-39,premeno,30-34,3-5,no,3,right,left_up,yes
234 | recurrence-events,70-79,ge40,15-19,9-11,?,1,left,left_low,yes
235 | recurrence-events,60-69,ge40,30-34,0-2,no,3,right,left_up,yes
236 | recurrence-events,50-59,premeno,25-29,3-5,yes,3,left,left_low,yes
237 | recurrence-events,40-49,premeno,25-29,0-2,no,2,right,left_low,no
238 | recurrence-events,40-49,premeno,25-29,0-2,no,2,right,left_low,no
239 | recurrence-events,30-39,premeno,35-39,0-2,no,3,left,left_low,no
240 | recurrence-events,40-49,premeno,20-24,3-5,yes,2,right,right_up,yes
241 | recurrence-events,60-69,ge40,20-24,3-5,no,2,left,left_low,yes
242 | recurrence-events,40-49,premeno,15-19,15-17,yes,3,left,left_low,no
243 | recurrence-events,50-59,ge40,25-29,6-8,no,3,left,left_low,yes
244 | recurrence-events,50-59,ge40,20-24,3-5,yes,3,right,right_up,no
245 | recurrence-events,40-49,premeno,30-34,12-14,yes,3,left,left_up,yes
246 | recurrence-events,30-39,premeno,30-34,9-11,no,2,right,left_up,yes
247 | recurrence-events,30-39,premeno,15-19,6-8,yes,3,left,left_low,yes
248 | recurrence-events,50-59,ge40,30-34,9-11,yes,3,left,right_low,yes
249 | recurrence-events,60-69,ge40,35-39,6-8,yes,3,left,left_low,no
250 | recurrence-events,30-39,premeno,20-24,3-5,yes,2,left,left_low,no
251 | recurrence-events,40-49,premeno,25-29,0-2,no,3,left,left_up,no
252 | recurrence-events,40-49,premeno,50-54,0-2,no,2,right,left_low,yes
253 | recurrence-events,30-39,premeno,40-44,0-2,no,1,left,left_up,no
254 | recurrence-events,60-69,ge40,50-54,0-2,no,3,right,left_up,no
255 | recurrence-events,40-49,premeno,30-34,0-2,yes,3,right,right_up,no
256 | recurrence-events,40-49,premeno,30-34,6-8,yes,3,right,left_up,no
257 | recurrence-events,40-49,premeno,30-34,0-2,no,1,left,left_low,yes
258 | recurrence-events,40-49,premeno,20-24,3-5,yes,2,left,left_low,yes
259 | recurrence-events,50-59,ge40,30-34,6-8,yes,2,left,right_low,yes
260 | recurrence-events,50-59,ge40,30-34,3-5,no,3,right,left_up,no
261 | recurrence-events,60-69,ge40,25-29,3-5,no,2,right,right_up,no
262 | recurrence-events,40-49,ge40,25-29,12-14,yes,3,left,right_low,yes
263 | recurrence-events,60-69,ge40,25-29,0-2,no,3,left,left_up,no
264 | recurrence-events,50-59,lt40,20-24,0-2,?,1,left,left_up,no
265 | recurrence-events,50-59,lt40,20-24,0-2,?,1,left,left_low,no
266 | recurrence-events,30-39,premeno,35-39,9-11,yes,3,left,left_low,no
267 | recurrence-events,40-49,premeno,30-34,3-5,yes,2,left,right_up,no
268 | recurrence-events,60-69,ge40,20-24,24-26,yes,3,left,left_low,yes
269 | recurrence-events,30-39,premeno,35-39,0-2,no,3,left,left_low,no
270 | recurrence-events,40-49,premeno,25-29,0-2,no,2,left,left_low,yes
271 | recurrence-events,50-59,ge40,30-34,6-8,yes,3,left,right_low,no
272 | recurrence-events,50-59,premeno,25-29,0-2,no,3,right,left_low,yes
273 | recurrence-events,40-49,premeno,15-19,0-2,yes,3,right,left_up,no
274 | recurrence-events,60-69,ge40,30-34,0-2,yes,2,right,right_up,yes
275 | recurrence-events,60-69,ge40,30-34,3-5,yes,2,left,central,yes
276 | recurrence-events,40-49,premeno,25-29,9-11,yes,3,right,left_up,no
277 | recurrence-events,30-39,premeno,25-29,6-8,yes,3,left,right_low,yes
278 | recurrence-events,60-69,ge40,10-14,6-8,yes,3,left,left_up,yes
279 | recurrence-events,50-59,premeno,35-39,15-17,yes,3,right,right_up,no
280 | recurrence-events,50-59,ge40,40-44,6-8,yes,3,left,left_low,yes
281 | recurrence-events,50-59,ge40,40-44,6-8,yes,3,left,left_low,yes
282 | recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
283 | recurrence-events,30-39,premeno,20-24,0-2,no,3,left,left_up,yes
284 | recurrence-events,60-69,ge40,20-24,0-2,no,1,right,left_up,no
285 | recurrence-events,40-49,ge40,30-34,3-5,no,3,left,left_low,no
286 | recurrence-events,50-59,ge40,30-34,3-5,no,3,left,left_low,no
287 |
--------------------------------------------------------------------------------
/in_class_notebooks/EDA/data/breast-cancer_info.names:
--------------------------------------------------------------------------------
1 | Citation Request:
2 | This breast cancer domain was obtained from the University Medical Centre,
3 | Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and
4 | M. Soklic for providing the data. Please include this citation if you plan
5 | to use this database.
6 |
7 | 1. Title: Breast cancer data (Michalski has used this)
8 |
9 | 2. Sources:
10 | -- Matjaz Zwitter & Milan Soklic (physicians)
11 | Institute of Oncology
12 | University Medical Center
13 | Ljubljana, Yugoslavia
14 | -- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
15 | -- Date: 11 July 1988
16 |
17 | 3. Past Usage: (Several: here are some)
18 | -- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The
19 | Multi-Purpose Incremental Learning System AQ15 and its Testing
20 | Application to Three Medical Domains. In Proceedings of the
21 | Fifth National Conference on Artificial Intelligence, 1041-1045,
22 | Philadelphia, PA: Morgan Kaufmann.
23 | -- accuracy range: 66%-72%
24 | -- Clark,P. & Niblett,T. (1987). Induction in Noisy Domains. In
25 | Progress in Machine Learning (from the Proceedings of the 2nd
26 | European Working Session on Learning), 11-30, Bled,
27 | Yugoslavia: Sigma Press.
28 | -- 8 test results given: 65%-72% accuracy range
29 | -- Tan, M., & Eshelman, L. (1988). Using weighted networks to
30 | represent classification knowledge in noisy domains. Proceedings
31 | of the Fifth International Conference on Machine Learning, 121-134,
32 | Ann Arbor, MI.
33 | -- 4 systems tested: accuracy range was 68%-73.5%
34 | -- Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A
35 | Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko
36 | & N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press.
37 | -- Assistant-86: 78% accuracy
38 |
39 | 4. Relevant Information:
40 | This is one of three domains provided by the Oncology Institute
41 | that has repeatedly appeared in the machine learning literature.
42 | (See also lymphography and primary-tumor.)
43 |
44 | This data set includes 201 instances of one class and 85 instances of
45 | another class. The instances are described by 9 attributes, some of
46 | which are linear and some are nominal.
47 |
48 | 5. Number of Instances: 286
49 |
50 | 6. Number of Attributes: 9 + the class attribute
51 |
52 | 7. Attribute Information:
53 | 1. Class: no-recurrence-events, recurrence-events
54 | 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
55 | 3. menopause: lt40, ge40, premeno.
56 | 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,
57 | 45-49, 50-54, 55-59.
58 | 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,
59 | 27-29, 30-32, 33-35, 36-39.
60 | 6. node-caps: yes, no.
61 | 7. deg-malig: 1, 2, 3.
62 | 8. breast: left, right.
63 | 9. breast-quad: left-up, left-low, right-up, right-low, central.
64 | 10. irradiat: yes, no.
65 |
66 | 8. Missing Attribute Values: (denoted by "?")
67 | Attribute #: Number of instances with missing values:
68 | 6. 8
69 | 9. 1.
70 |
71 | 9. Class Distribution:
72 | 1. no-recurrence-events: 201 instances
73 | 2. recurrence-events: 85 instances
--------------------------------------------------------------------------------
/in_class_notebooks/EDA/in-class_EDA-release.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## In-Class Notebook "
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Numpy Review\n",
15 | "\n",
16 | "Data scientists primarily deal with structured numeric data. While tuples, lists and dictionaries are useful for general programming, *vectors* and *arrays* are more useful for mathematical calculations.\n",
17 | "\n",
18 | "[NumPy](https://docs.scipy.org/doc/numpy-1.13.0/index.html) is an *extension module* to the Python language that provides vectors and arrays. NumPy has been imported with alias `np` in the cell below. We will now go through some basic numpy operations. \n"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": null,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "import numpy as np\n"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "#### 1. Checking the numpy version you have intalled in your system"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "# Type out python command for displaying version\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "#### 2. Create a 10x10 matrix, in which the elements on the borders will be equal to 1, and inside 0"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "# Create your 10X10 matrix and store it in x (HINT: Make it all ones or zeros to begin with)\n",
60 | "x = 0\n",
61 | "\n",
62 | "# Slice indexing to modify the matrix\n",
63 | "\n",
64 | "print(x)"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "#### 3. Compute the multiplication of two given matrixes"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "\n",
81 | "p = [[1, 0], [0, 1]]\n",
82 | "q = [[1, 2], [3, 4]]\n",
83 | "print(\"original matrix:\")\n",
84 | "print(p)\n",
85 | "print(q)\n",
86 | "\n",
87 | "# Enter the solution to do matrix multiplication\n",
88 | "\n",
89 | "# Return shape of resultant matrix\n",
90 | "\n",
91 | "print(\"Result of the said matrix multiplication:\")\n",
92 | "\n",
93 | "# print result\n"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "#### 4.Generate random numbers from a normal distribution with mean 2 and standard deviation 1"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "mu = 2\n",
110 | "s_dev = 1.5\n",
111 | "\n",
112 | "# Enter numpy command to generate \n",
113 | "\n",
114 | "# Check shape of resulting variable \n"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "#### 5. Shuffling Arrays : Shuffle numbers between 0 and 10"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "\n",
131 | "x = np.arange(10) # Creatinga vector of 10 elements\n",
132 | "\n",
133 | "# Randomly shuffle the elements of the vector\n",
134 | "\n",
135 | "# Print the result in a nice format\n"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "## Pandas Dataframe Review\n",
143 | "\n",
144 | "Pandas is part of an ecosystem of Python software used for statistical analysis.\n",
145 | "\n",
146 | "Pandas extends Python with two datatypes used in statistical analysis: the Series and the DataFrame.\n",
147 | "\n",
148 | "The name \"Pandas\" is derived from \"Panel Data\", a particular way of representing data represented in Pandas by the DataFrame.\n",
149 | "\n",
150 | "As with NumPy, we need to import Pandas. We'll see almost all of our notebooks starting with:\n"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": [
159 | "import pandas as pd"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.\n",
167 | "\n",
168 | "There are a number of ways you can construct a DataFrame. One of the most common is to use a python Dictionary to label the different columns."
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "ages = np.array([20, 39, 45, 18, 56, 90])\n",
178 | "salary = np.array([10000, 40000, 50000, 8000, 55000, 5000])"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "\n",
188 | "# Using a python dictionary, create pandas dataframe\n",
189 | "\n",
190 | "df"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "You can also specify the index column (_e.g._ it could be the names of the people represented in the age/salary data)"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "# Name your specific rows with an index columns \n",
207 | "\n",
208 | "df2"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "#### 1. Accessing Rows of a Data Frame"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "You can index a dataframe by row index to extract a set of rows. For integral indices, the range is specified as **from:to** to include entries from **from** to strictly less than **to**."
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {},
229 | "outputs": [],
230 | "source": [
231 | "# Select the sub-dataframe with rows 1 and 2"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "You can do similar slices for named rows, but for inexplicible reasons, the range now includes all of the specified rows (i.e. it doesn't end before the last index)."
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {},
245 | "outputs": [],
246 | "source": [
247 | "# Select rows corresponding to the indices you used above to see the difference"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "We can also access a single row of the data frame using index operations based on the location of the data."
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "# Use the `loc` operator to select a particular row index"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "#### 2. Accessing Elements and Columns of a DataFrame"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "You can refer to each column using the name of the column. "
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": null,
283 | "metadata": {},
284 | "outputs": [],
285 | "source": [
286 | "# Select age column"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "# You can also use this form..\n"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": null,
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "Once you've selected a column, you can access elements using the index for that specific row."
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "# Select the age (or any other attribute) for a particular row"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "# Alternate way using the `loc` operator"
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {},
328 | "source": [
329 | "#### 3. Adding new columns"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": null,
335 | "metadata": {},
336 | "outputs": [],
337 | "source": [
338 | "We can add new columns to the data frame simply by assigning to them."
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {},
345 | "outputs": [],
346 | "source": [
347 | "# Create a new column where each element is the product of the corresponding age and salary"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "## Exploratory Data Analysis with a data set"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {},
360 | "source": [
361 | "Go to https://archive.ics.uci.edu/ml/datasets/Breast+Cancer and download the file. You can change the data file format to csv."
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "# read the csv file to a data frame (there is literally a function called read_csv)\n"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": null,
376 | "metadata": {},
377 | "outputs": [],
378 | "source": [
379 | "df.head() # Shows the top few rows of the dataframe that you just imported "
380 | ]
381 | },
382 | {
383 | "cell_type": "code",
384 | "execution_count": null,
385 | "metadata": {},
386 | "outputs": [],
387 | "source": [
388 | "df.info()#check the data type- it's not always desired form, you can change"
389 | ]
390 | },
391 | {
392 | "cell_type": "markdown",
393 | "metadata": {},
394 | "source": [
395 | "### Data Cleaning Ideas\n",
396 | "1. Change column names \n",
397 | "2. Find Null values and clean them\n",
398 | "3. Check the data types\n",
399 | "4. Convert ordinal category strings to number\n",
400 | "5. Convert non-ordinal category strings to **dummified** array"
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": null,
406 | "metadata": {},
407 | "outputs": [],
408 | "source": [
409 | "# Let us check what columns there are"
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "execution_count": null,
415 | "metadata": {},
416 | "outputs": [],
417 | "source": [
418 | "# create a dictionary and use it to rename the columns\n",
419 | "columns = ['class','age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']\n",
420 | "\n",
421 | "# Create a list of out the dataframe column names \n",
422 | "\n",
423 | "# Pair it with elements of the `columns` list using the zip() function\n",
424 | "\n",
425 | "# Create a dictionary out of the \n",
426 | "\n",
427 | "print(dd)"
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": null,
433 | "metadata": {},
434 | "outputs": [],
435 | "source": [
436 | "df.rename(columns=dd, inplace=True)"
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": null,
442 | "metadata": {},
443 | "outputs": [],
444 | "source": [
445 | "df.head()"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": null,
451 | "metadata": {},
452 | "outputs": [],
453 | "source": [
454 | "# we want to change \"class\" to be 0 or 1\n",
455 | "# you can change values using .apply() : 0- no-recurrence-events , 1- recurrence events; check out the lambda function to acheive this with minimum code\n",
456 | "\n",
457 | "#make sure that there is no other values beofre you use if-else. we know it from the unique values we've inspected above"
458 | ]
459 | },
460 | {
461 | "cell_type": "code",
462 | "execution_count": null,
463 | "metadata": {},
464 | "outputs": [],
465 | "source": [
466 | "df['class'].unique()"
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": null,
472 | "metadata": {},
473 | "outputs": [],
474 | "source": [
475 | "df.head()"
476 | ]
477 | },
478 | {
479 | "cell_type": "markdown",
480 | "metadata": {},
481 | "source": [
482 | "Age ranges are inconvenient to handle and it would be more convenient if this were a numeric column. Let us make each element the average of the range to achieve this."
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": null,
488 | "metadata": {},
489 | "outputs": [],
490 | "source": [
491 | "# List with average of each age range\n",
492 | "ageval=[24,35,45,55,65,75]"
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": [
501 | "# Similar to what we did with the column names, let us make a dictionary to link the age range to its corresponding\n",
502 | "# age average\n",
503 | "\n"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": null,
509 | "metadata": {},
510 | "outputs": [],
511 | "source": [
512 | "#view the dictionary that you create"
513 | ]
514 | },
515 | {
516 | "cell_type": "code",
517 | "execution_count": null,
518 | "metadata": {},
519 | "outputs": [],
520 | "source": [
521 | "#replace the age column `df['age']` with its numeric counter parts using apply and lambda"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": null,
527 | "metadata": {},
528 | "outputs": [],
529 | "source": [
530 | "df.head()"
531 | ]
532 | },
533 | {
534 | "cell_type": "code",
535 | "execution_count": null,
536 | "metadata": {},
537 | "outputs": [],
538 | "source": [
539 | "# Use get_dummies on the menopause column since those elements don't make sense\n",
540 | "# get_dummies creates a new dataframe with columns corresponding to discrete row elements ; the elements of the new\n",
541 | "# columns will be binary\n",
542 | "\n",
543 | "# concatenate newly created df to your existing df"
544 | ]
545 | },
546 | {
547 | "cell_type": "code",
548 | "execution_count": null,
549 | "metadata": {},
550 | "outputs": [],
551 | "source": [
552 | "# Here are some tricks to extract the average of the tumor size range. \n",
553 | "\n",
554 | "tumors=sorted(list(df['tumor-size'].unique())) #we'll create ordinal variable (average of the range as representitive of the category, but again, you can assign what makes sense for you)\n",
555 | "tsize=[(int(x[0])+int(x[1]))/2 for x in [x.split('-') for x in tumors]] #just some tricks to extract numbers from the string"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": null,
561 | "metadata": {},
562 | "outputs": [],
563 | "source": [
564 | "# Create a dictionary once again for tumorsize to convert to numeric"
565 | ]
566 | },
567 | {
568 | "cell_type": "code",
569 | "execution_count": null,
570 | "metadata": {},
571 | "outputs": [],
572 | "source": [
573 | "# replace the tumor-size column with numeric counterparts"
574 | ]
575 | },
576 | {
577 | "cell_type": "code",
578 | "execution_count": null,
579 | "metadata": {},
580 | "outputs": [],
581 | "source": []
582 | },
583 | {
584 | "cell_type": "markdown",
585 | "metadata": {},
586 | "source": [
587 | "### Visualizing the data : Using Matplotlib"
588 | ]
589 | },
590 | {
591 | "cell_type": "markdown",
592 | "metadata": {},
593 | "source": [
594 | "One way to do EDA visually is to make some basic plots of the data to extract basic information from it. Some of these plots include histograms and correlation plots. "
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": null,
600 | "metadata": {},
601 | "outputs": [],
602 | "source": [
603 | "import matplotlib.pyplot as plt"
604 | ]
605 | },
606 | {
607 | "cell_type": "code",
608 | "execution_count": null,
609 | "metadata": {},
610 | "outputs": [],
611 | "source": [
612 | "# plot histogram of ages"
613 | ]
614 | },
615 | {
616 | "cell_type": "code",
617 | "execution_count": null,
618 | "metadata": {},
619 | "outputs": [],
620 | "source": [
621 | "# Change the number of bins"
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": null,
627 | "metadata": {},
628 | "outputs": [],
629 | "source": [
630 | "# Plot the histogram of another variable of your choice"
631 | ]
632 | },
633 | {
634 | "cell_type": "code",
635 | "execution_count": null,
636 | "metadata": {},
637 | "outputs": [],
638 | "source": [
639 | "dfs = df1[['class','age','tumor-size','deg-malig','premeno']]\n",
640 | "\n",
641 | "#after changing them to numbers, we can see correlation matrix\n",
642 | "corr = dfs.corr()\n",
643 | "corr.style.background_gradient(cmap='coolwarm')"
644 | ]
645 | },
646 | {
647 | "cell_type": "code",
648 | "execution_count": null,
649 | "metadata": {},
650 | "outputs": [],
651 | "source": [
652 | "# Or using plt\n",
653 | "import matplotlib.pyplot as plt\n",
654 | "f = plt.figure(figsize=(19, 15))\n",
655 | "plt.matshow(dfs.corr(), fignum=f.number)\n",
656 | "plt.xticks(range(dfs.shape[1]), dfs.columns, fontsize=14, rotation=45)\n",
657 | "plt.yticks(range(dfs.shape[1]), dfs.columns, fontsize=14)\n",
658 | "cb = plt.colorbar()\n",
659 | "cb.ax.tick_params(labelsize=14)\n",
660 | "plt.title('Correlation Matrix', fontsize=16);"
661 | ]
662 | }
663 | ],
664 | "metadata": {
665 | "kernelspec": {
666 | "display_name": "Python 3",
667 | "language": "python",
668 | "name": "python3"
669 | },
670 | "language_info": {
671 | "codemirror_mode": {
672 | "name": "ipython",
673 | "version": 3
674 | },
675 | "file_extension": ".py",
676 | "mimetype": "text/x-python",
677 | "name": "python",
678 | "nbconvert_exporter": "python",
679 | "pygments_lexer": "ipython3",
680 | "version": "3.7.4"
681 | }
682 | },
683 | "nbformat": 4,
684 | "nbformat_minor": 2
685 | }
686 |
--------------------------------------------------------------------------------
/in_class_notebooks/NMF/NMF_rev1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import numpy as np\n",
10 | "from sklearn.decomposition import NMF\n",
11 | "import time\n",
12 | "import copy\n",
13 | "import math"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 2,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "X = np.array(np.random.choice([0,1],(100,20)))"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 3,
28 | "metadata": {},
29 | "outputs": [
30 | {
31 | "data": {
32 | "text/plain": [
33 | "array([[0, 1, 0, ..., 1, 1, 1],\n",
34 | " [1, 0, 1, ..., 1, 0, 1],\n",
35 | " [1, 1, 1, ..., 1, 0, 1],\n",
36 | " ...,\n",
37 | " [0, 0, 1, ..., 0, 0, 0],\n",
38 | " [0, 0, 0, ..., 1, 0, 0],\n",
39 | " [1, 0, 0, ..., 0, 0, 0]])"
40 | ]
41 | },
42 | "execution_count": 3,
43 | "metadata": {},
44 | "output_type": "execute_result"
45 | }
46 | ],
47 | "source": [
48 | "X"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 5,
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "name": "stdout",
58 | "output_type": "stream",
59 | "text": [
60 | "0.9993326356321898 0.0047724633550728646 0.4744848605751693 11.10587651389246\n",
61 | "0.9970518224489903 0.0010757902841025402 0.4363152034895211 4.696081830045622\n"
62 | ]
63 | }
64 | ],
65 | "source": [
66 | "#use the same init\n",
67 | "n_factors=4\n",
68 | "W0=np.random.random((X.shape[0],n_factors))\n",
69 | "H0=np.random.random((n_factors,X.shape[1]))\n",
70 | "print(W0.max(), W0.min(), W0.mean(), np.linalg.norm(W0))\n",
71 | "print(H0.max(), H0.min(), H0.mean(), np.linalg.norm(H0))"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 21,
77 | "metadata": {},
78 | "outputs": [
79 | {
80 | "name": "stdout",
81 | "output_type": "stream",
82 | "text": [
83 | "(100, 4) (4, 20)\n",
84 | "0.8534930074730912 0.0 0.26878123064509496 6.868154390442655\n",
85 | "2.457904908102121 0.0 0.48343643704125794 6.60419177612916\n",
86 | "{'alpha': 0.0, 'beta_loss': 'frobenius', 'init': 'custom', 'l1_ratio': 0.0, 'max_iter': 500, 'n_components': 4, 'random_state': None, 'shuffle': False, 'solver': 'cd', 'tol': 0.0001, 'verbose': 0}\n",
87 | "CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms\n",
88 | "Wall time: 16.5 ms\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "%%time\n",
94 | "n_factors=4\n",
95 | "nmf = NMF(n_components=n_factors, init='custom',max_iter=500)\n",
96 | "W = nmf.fit_transform(X,W=W0.astype(np.double),H=H0.astype(np.double)) #corresponding to the U matrix\n",
97 | "H = nmf.components_ #corresponding to the V matrix\n",
98 | "print(W.shape, H.shape)\n",
99 | "print(W.max(), W.min(), W.mean(), np.linalg.norm(W))\n",
100 | "print(H.max(), H.min(), H.mean(), np.linalg.norm(H))\n",
101 | "print(nmf.get_params())"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 7,
107 | "metadata": {},
108 | "outputs": [
109 | {
110 | "data": {
111 | "text/plain": [
112 | "375"
113 | ]
114 | },
115 | "execution_count": 7,
116 | "metadata": {},
117 | "output_type": "execute_result"
118 | }
119 | ],
120 | "source": [
121 | "nmf.n_iter_"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 12,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "# manual decomposition\n",
131 | "def updateU(X,U,V,n_factors,gamma=0.01):\n",
132 | " perm = np.array(range(n_factors))\n",
133 | " np.random.shuffle(perm) \n",
134 | " for k in perm:\n",
135 | " for i in range(X.shape[0]):\n",
136 | "# grad=0\n",
137 | "# Z=0\n",
138 | "# for j in range(X.shape[1]):\n",
139 | "# grad+=(X[i,j]-np.dot(U[i,:],V[:,j])+U[i,k]*V[k,j])*V[k,j]\n",
140 | "# Z += V[k,j]*V[k,j] \n",
141 | " grad = np.sum(V[k,:]*(X[i,:]-np.dot(U[i,:],V[:])+(U[i,k]*V[k,:]))) \n",
142 | " Z = np.sum(np.square(V[k,:])) \n",
143 | " U[i,k]=U[i,k]+gamma*(grad/Z-U[i,k]) #moving average \n",
144 | " return U \n",
145 | "\n",
146 | "def updateV(X,U,V, n_factors,gamma=0.01):\n",
147 | " perm = np.array(range(n_factors))\n",
148 | " np.random.shuffle(perm)\n",
149 | " for k in perm:\n",
150 | " for j in range(X.shape[1]):\n",
151 | "# grad=0\n",
152 | "# Z=0\n",
153 | "# for i in range(X.shape[0]):\n",
154 | "# grad += (X[i,j]-np.dot(U[i,:],V[:,j])+U[i,k]*V[k,j])*U[i,k]\n",
155 | "# Z += U[i,k]*U[i,k]\n",
156 | " grad = np.sum(U[:,k]*(X[:,j]-np.dot(U[:],V[:,j])+V[k,j]*U[:,k])) \n",
157 | " Z = np.sum(np.square(U[:,k])) \n",
158 | " V[k,j]=V[k,j]+gamma*(grad/Z-V[k,j]) #moving average \n",
159 | " \n",
160 | " return V\n"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 13,
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "def NMF1(X,U,V,n_factors,itr=500,tol=0.0001,gamma=0.1):\n",
170 | " i=0\n",
171 | " ut = math.inf\n",
172 | " vt = math.inf\n",
173 | " while (i < itr)and((ut>tol)or(vt>tol)):\n",
174 | " Uc = copy.deepcopy(U)\n",
175 | " Vc = copy.deepcopy(V)\n",
176 | " U = updateU(X, U, V, n_factors,gamma)\n",
177 | " V = updateV(X, U, V, n_factors,gamma)\n",
178 | " ut = np.linalg.norm(Uc-U)\n",
179 | " vt = np.linalg.norm(Vc-V)\n",
180 | " i+=1\n",
181 | "\n",
182 | " meta=[i,ut,vt] \n",
183 | " return U, V, meta \n"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 26,
189 | "metadata": {},
190 | "outputs": [
191 | {
192 | "name": "stdout",
193 | "output_type": "stream",
194 | "text": [
195 | "1.4256359486225392 -0.7353469841280715 0.33773175031711955 9.711463203731865\n",
196 | "1.0565875214032427 -0.28051829669813505 0.3539178638357291 4.361622729372063\n",
197 | "[500, 0.004303717022975672, 0.0018612614344892034]\n",
198 | "CPU times: user 4.17 s, sys: 0 ns, total: 4.17 s\n",
199 | "Wall time: 4.17 s\n"
200 | ]
201 | }
202 | ],
203 | "source": [
204 | "%%time\n",
205 | "n_factors=4\n",
206 | "U0 = copy.deepcopy(W0)\n",
207 | "V0= copy.deepcopy(H0)\n",
208 | "U, V, meta = NMF1(X,U0,V0, n_factors,itr=500,gamma=0.1)\n",
209 | "print(U.max(), U.min(), U.mean(), np.linalg.norm(U))\n",
210 | "print(V.max(), V.min(), V.mean(), np.linalg.norm(V)) \n",
211 | "print(meta)"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 27,
217 | "metadata": {},
218 | "outputs": [
219 | {
220 | "data": {
221 | "text/plain": [
222 | "0.011676009680096841"
223 | ]
224 | },
225 | "execution_count": 27,
226 | "metadata": {},
227 | "output_type": "execute_result"
228 | }
229 | ],
230 | "source": [
231 | "np.linalg.norm(W-U)/(W.shape[0]*W.shape[1])"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 28,
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "data": {
241 | "text/plain": [
242 | "0.04298984278940894"
243 | ]
244 | },
245 | "execution_count": 28,
246 | "metadata": {},
247 | "output_type": "execute_result"
248 | }
249 | ],
250 | "source": [
251 | "np.linalg.norm(H-V)/(H.shape[0]*H.shape[1])"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 29,
257 | "metadata": {},
258 | "outputs": [
259 | {
260 | "data": {
261 | "text/plain": [
262 | "0.7353469841280715"
263 | ]
264 | },
265 | "execution_count": 29,
266 | "metadata": {},
267 | "output_type": "execute_result"
268 | }
269 | ],
270 | "source": [
271 | "np.abs(W-U).max()"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 30,
277 | "metadata": {},
278 | "outputs": [
279 | {
280 | "data": {
281 | "text/plain": [
282 | "1.5553761178059644"
283 | ]
284 | },
285 | "execution_count": 30,
286 | "metadata": {},
287 | "output_type": "execute_result"
288 | }
289 | ],
290 | "source": [
291 | "np.abs(H-V).max()"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 31,
297 | "metadata": {},
298 | "outputs": [
299 | {
300 | "data": {
301 | "text/plain": [
302 | "0.18365286237996753"
303 | ]
304 | },
305 | "execution_count": 31,
306 | "metadata": {},
307 | "output_type": "execute_result"
308 | }
309 | ],
310 | "source": [
311 | "np.abs(W-U).mean()"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 32,
317 | "metadata": {},
318 | "outputs": [
319 | {
320 | "data": {
321 | "text/plain": [
322 | "0.26195447960394697"
323 | ]
324 | },
325 | "execution_count": 32,
326 | "metadata": {},
327 | "output_type": "execute_result"
328 | }
329 | ],
330 | "source": [
331 | "np.abs(H-V).mean()"
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "Conclusion: The result is slightly different for the sklearn NMF and manual created NMF. It is because their optimization methods and regularization (thus also the loss function) are different, so the solutions they converge into can be a little different."
339 | ]
340 | }
341 | ],
342 | "metadata": {
343 | "kernelspec": {
344 | "display_name": "Python 3",
345 | "language": "python",
346 | "name": "python3"
347 | },
348 | "language_info": {
349 | "codemirror_mode": {
350 | "name": "ipython",
351 | "version": 3
352 | },
353 | "file_extension": ".py",
354 | "mimetype": "text/x-python",
355 | "name": "python",
356 | "nbconvert_exporter": "python",
357 | "pygments_lexer": "ipython3",
358 | "version": "3.7.4"
359 | }
360 | },
361 | "nbformat": 4,
362 | "nbformat_minor": 2
363 | }
364 |
--------------------------------------------------------------------------------
/in_class_notebooks/NMF/data/articles.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/in_class_notebooks/NMF/data/articles.pkl
--------------------------------------------------------------------------------
/in_class_notebooks/NN-intro-Keras/fashion_mnist_with_keras.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "fashion-mnist-with-keras.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "codemirror_mode": {
16 | "name": "ipython",
17 | "version": 3
18 | },
19 | "file_extension": ".py",
20 | "mimetype": "text/x-python",
21 | "name": "python",
22 | "nbconvert_exporter": "python",
23 | "pygments_lexer": "ipython3",
24 | "version": "3.6.7"
25 | },
26 | "accelerator": "GPU"
27 | },
28 | "cells": [
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {
32 | "colab_type": "text",
33 | "id": "IZrAitlFLdEZ"
34 | },
35 | "source": [
36 | "# Fashion-MNIST with tf.keras\n",
37 | "\n",
38 | "Welcome! In this lab, you'll learn how to train an image classifier train on the [Fashion-MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) using TensorFlow 2. You'll go through all the steps, including loading the data, building and training a model, calculating the accuracy, and making predictions. Our focus here is on the code.\n",
39 | "\n",
40 | "The biggest change to TensorFlow is that it runs with eager execution by default."
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "metadata": {
46 | "colab_type": "code",
47 | "id": "jSmUsjJfMEqC",
48 | "colab": {
49 | "base_uri": "https://localhost:8080/",
50 | "height": 35
51 | },
52 | "outputId": "99a1a7a4-0676-4962-b1e3-5182a0be6628"
53 | },
54 | "source": [
55 | "%tensorflow_version 2.x\n",
56 | "import tensorflow as tf\n",
57 | "\n",
58 | "import numpy as np"
59 | ],
60 | "execution_count": 2,
61 | "outputs": [
62 | {
63 | "output_type": "stream",
64 | "text": [
65 | "TensorFlow 2.x selected.\n"
66 | ],
67 | "name": "stdout"
68 | }
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "metadata": {
74 | "id": "r0QDiSfo_2oE",
75 | "colab_type": "code",
76 | "outputId": "14a2f4ba-07d5-4440-8bae-a82daec4ad40",
77 | "colab": {
78 | "base_uri": "https://localhost:8080/",
79 | "height": 35
80 | }
81 | },
82 | "source": [
83 | "tf.__version__"
84 | ],
85 | "execution_count": 3,
86 | "outputs": [
87 | {
88 | "output_type": "execute_result",
89 | "data": {
90 | "text/plain": [
91 | "'2.1.0'"
92 | ]
93 | },
94 | "metadata": {
95 | "tags": []
96 | },
97 | "execution_count": 3
98 | }
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {
104 | "colab_type": "text",
105 | "id": "B8Lhscw0NDln"
106 | },
107 | "source": [
108 | "### Step 1: Download the dataset\n",
109 | "\n",
110 | "The Fashion-MNIST dataset contains thousands of grayscale images of Zalando fashion articles."
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "metadata": {
116 | "colab_type": "code",
117 | "id": "FKiwTuT-NE6f",
118 | "colab": {
119 | "base_uri": "https://localhost:8080/",
120 | "height": 162
121 | },
122 | "outputId": "7ed038f7-6c70-46f3-b63d-463b793ef30d"
123 | },
124 | "source": [
125 | "(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()"
126 | ],
127 | "execution_count": 4,
128 | "outputs": [
129 | {
130 | "output_type": "stream",
131 | "text": [
132 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz\n",
133 | "32768/29515 [=================================] - 0s 0us/step\n",
134 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz\n",
135 | "26427392/26421880 [==============================] - 0s 0us/step\n",
136 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz\n",
137 | "8192/5148 [===============================================] - 0s 0us/step\n",
138 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz\n",
139 | "4423680/4422102 [==============================] - 0s 0us/step\n"
140 | ],
141 | "name": "stdout"
142 | }
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {
148 | "colab_type": "text",
149 | "id": "e2n2NVdKNk5i"
150 | },
151 | "source": [
152 | "### Step 2) Understand the data format\n",
153 | "\n",
154 | "We are given the images as a 3-D array of integer values that is of shape (*N*, 28, 28), where *N* is the number of images in the training or test set. The labels are 1-D array of the integer values of each image."
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "metadata": {
160 | "id": "pWRdG-PwDXL9",
161 | "colab_type": "code",
162 | "colab": {
163 | "base_uri": "https://localhost:8080/",
164 | "height": 53
165 | },
166 | "outputId": "7ae7a3fa-a39c-42fb-83e3-a61c451c5710"
167 | },
168 | "source": [
169 | "print(train_images.shape, train_labels.shape)\n",
170 | "print(test_images.shape, test_labels.shape)"
171 | ],
172 | "execution_count": 5,
173 | "outputs": [
174 | {
175 | "output_type": "stream",
176 | "text": [
177 | "(60000, 28, 28) (60000,)\n",
178 | "(10000, 28, 28) (10000,)\n"
179 | ],
180 | "name": "stdout"
181 | }
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "metadata": {
187 | "id": "uAnYPsUKDnL_",
188 | "colab_type": "code",
189 | "colab": {
190 | "base_uri": "https://localhost:8080/",
191 | "height": 35
192 | },
193 | "outputId": "28b6b80a-7272-411c-c588-1b279e4eea1e"
194 | },
195 | "source": [
196 | "set(test_labels)"
197 | ],
198 | "execution_count": 6,
199 | "outputs": [
200 | {
201 | "output_type": "execute_result",
202 | "data": {
203 | "text/plain": [
204 | "{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}"
205 | ]
206 | },
207 | "metadata": {
208 | "tags": []
209 | },
210 | "execution_count": 6
211 | }
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {
217 | "colab_type": "text",
218 | "id": "eEFU58MaNPpk"
219 | },
220 | "source": [
221 | "### Step 3) Visualize the data\n",
222 | "Let's see how the images look. This function shows a random example along with it's corresponding label."
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "metadata": {
228 | "colab_type": "code",
229 | "id": "AwxNOsCMNNGd",
230 | "outputId": "617be20b-d250-44de-da20-66878be0c04c",
231 | "colab": {
232 | "base_uri": "https://localhost:8080/",
233 | "height": 301
234 | }
235 | },
236 | "source": [
237 | "%matplotlib inline\n",
238 | "import random\n",
239 | "import matplotlib.pyplot as plt\n",
240 | "\n",
241 | "i = random.randint(0, 100)\n",
242 | "\n",
243 | "print(\"Label: %s\" % train_labels[i])\n",
244 | "plt.imshow(train_images[i], cmap='gray')"
245 | ],
246 | "execution_count": 6,
247 | "outputs": [
248 | {
249 | "output_type": "stream",
250 | "text": [
251 | "Label: 5\n"
252 | ],
253 | "name": "stdout"
254 | },
255 | {
256 | "output_type": "execute_result",
257 | "data": {
258 | "text/plain": [
259 | ""
260 | ]
261 | },
262 | "metadata": {
263 | "tags": []
264 | },
265 | "execution_count": 6
266 | },
267 | {
268 | "output_type": "display_data",
269 | "data": {
270 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPsAAAD4CAYAAAAq5pAIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0\ndHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAPZUlEQVR4nO3dXYwVZZ7H8d+fN18AlRZtEJoFCWqI\nRqYlxPiyURGCJgZJ1MAVGzU9F6MZk4mzZPZiMJtNJu666x0JE3BYM+uERHQIMTu4SHS8kNgSRWyY\nQUnj0DY0iGiDvAj896KLSYtdT7Xnpes0/+8n6Zxz6t9P1ZMDvz516qmqx9xdAC5+I8ruAIChQdiB\nIAg7EARhB4Ig7EAQo4ZyY2bGoX+gztzdBlpe1Se7mS0ys7+Y2admtqKadQGoL6t0nN3MRkr6q6QF\nkvZLel/SMnfvSLThkx2os3p8ss+T9Km773X305L+IGlxFesDUEfVhH2KpL/1e70/W/Y9ZtZmZu1m\n1l7FtgBUqe4H6Nx9taTVErvxQJmq+WTvktTS7/XUbBmABlRN2N+XNMvMZpjZGElLJW2sTbcA1FrF\nu/HufsbMnpL0J0kjJa11909q1jMANVXx0FtFG+M7O1B3dTmpBsDwQdiBIAg7EARhB4Ig7EAQhB0I\ngrADQRB2IAjCDgRB2IEgCDsQBGEHgiDsQBCEHQiCsANBEHYgCMIOBEHYgSAIOxAEYQeCIOxAEIQd\nCIKwA0EQdiAIwg4EQdiBIAg7EARhB4Ig7EAQhB0IouL52SXJzDol9Uo6K+mMu8+tRacA1F5VYc/c\n6+6Ha7AeAHXEbjwQRLVhd0mbzewDM2sb6BfMrM3M2s2svcptAaiCuXvljc2muHuXmV0r6U1JT7v7\nO4nfr3xjAAbF3W2g5VV9srt7V/bYI+k1SfOqWR+A+qk47GY21szGn38uaaGknbXqGIDaquZofLOk\n18zs/Hr+x93/tya9AlBzVX1n/9Eb4zs7UHd1+c4OYPgg7EAQhB0IgrADQRB2IAjCDgRB2IEgCDsQ\nBGEHgiDsQBCEHQiCsANBEHYgiFrccBIlGzEi/2/2uXPnqlr37bffnqwfPXo0Wd+9e3dV26+X7NLs\nXNVeDTpy5MiKt3/mzJmqtp2HT3YgCMIOBEHYgSAIOxAEYQeCIOxAEIQdCIJx9mGgaMz27NmzFa97\nwYIFyfoTTzyRrJ88eTJZf/311yuqDcaoUen/vtWcf1D0np86dSpZr+bfpF74ZAeCIOxAEIQdCIKw\nA0EQdiAIwg4EQdiBIJjF9SK3cuXKZL2lpSVZ37t3b7J+4403Juvjx4/PrR05ciTZtmiMv5G1trYm\n63fccUdubfbs2cm269evz621t7ert7e3sllczWytmfWY2c5+y5rM7E0z25M9TihaD4ByDWY3/neS\nFl2wbIWkLe4+S9KW7DWABlYYdnd/R9KF+1uLJa3Lnq+T9HCN+wWgxio9N77Z3buz5wckNef9opm1\nSWqrcDsAaqTqC2Hc3VMH3tx9taTVEgfogDJVOvR20MwmS1L22FO7LgGoh0rDvlHS8uz5ckl/rE13\nANRL4Ti7mb0i6R5JEyUdlPRrSa9LWi9pmqR9kh5z9/SgqeLuxo8ZMyZZP336dLI+b968ZH3NmjW5\ntY0bNybb9vb2Juv33Xdfsv7uu+8m63v27MmtPfLII8m2V1xxRbK+adOmZL2nJ3+H89tvv022vfnm\nm5P1SZMmJevHjx9P1lPXwxfdN/7gwYO5tQ0bNujQoUMDjrMXfmd392U5pflFbQE0Dk6XBYIg7EAQ\nhB0IgrADQRB2IIhhdYlraprbQQwhVlVP3Za4XlPsnpcaWpPSl6kuXLgw2faaa65J1teuXZusX3rp\npcn6G2+8kVvr7OxMtr3pppuS9RkzZiTrqdtFF/17p4a3BuPEiRPJ+nfffZdb27FjR7Lt5s2bk3V3\nr+wSVwAXB8IOBEHYgSAIOxAEYQeCIOxAEIQdCGJYTdlczTkBRW2L6kVT/FZj0aIL7+f5fRMmpG/e\nW82Y8KFDh5L1hx56KFm/9tprk/VVq1bl1opuQ/32228n60ePHk3WR48enVtrampKtm1uzr3TmiTp\n888/T9bfeuutZH3btm3Jej3wyQ4EQdiBIAg7EARhB4Ig7EAQhB0IgrADQQz59eyp64hT14xL6bHw\nasfBL7/88mT9/vvvz63Nn5++0e7111+frBddU/7FF18k62fPns2tPfroo8m2ZXrxxReT9VmzZiXr\n7733XrKemi766quvTrZ94YUXkvWOjo5kvZ6K7uvA9exAcIQdCIKwA0EQdiAIwg4EQdiBIAg7EMSw\num98NZ5++ulk/e67707WU9eMHzhwINn28OHDyfqcOXOS9dbW1mQ9Nf3wvffem2xbpOj+6vX8//PA\nAw8k60XvS2oq7KJ/k5deeilZL1Lm+1bxOLuZrTWzHjPb2W/ZSjPrMrMPs58Ha9lZALU3mN3430ka\n6FYq/+Xuc7Kf/Gk/ADSEwrC7+zuSjgxBXwDUUTUH6J4ysx3Zbn7uTdLMrM3M2s2svYptAahSpWFf\nJWmmpDmSuiXlXjXg7qvdfa67z61wWwBqoKKwu/tBdz/r7uck/VbSvNp2C0CtVRR2M5vc7+USSTvz\nfhdAYyi8b7yZvSLpHkkTzWy/pF9LusfM5khySZ2SflqLzlx11VXJ+mWXXZZb6+7uTrYdO3Zssr57\n9+5kvaenJ7c2ceLEZNuiOcy3bNmSrBfN9b106dJkvRrVjgffdtttubWi6/yL7kmfGkeX0mPpU6dO\nTbZ97rnnkvXUtfJS+h4DkvTll1/m1rq6upJtX3755WQ9T2HY3X3ZAIvXVLQ1AKXhdFkgCMIOBEHY\ngSAIOxAEYQeCGNIpm1taWvTss8/m1ouG3nbt2pVbmzZtWrJt0dDcV199laynpk0uGhobM2ZMsj57\n9uxkvWj635kzZ+bWii4THTlyZLK+cOHCZH3cuHHJemr468iR9CUXRUOal1xySbKeuhV1ahhXKp4O\numhorWjIMnUr69RwpSQdO3Yst7Z169bcGp/sQBCEHQiCsANBEHYgCMIOBEHYgSAIOxDEkN5Kety4\ncX7LLbfk1p9//vlk+9S4atGtgYtu91xU37dvX26taIx+1Kj06QxFl8CeOXMmWf/6669za8uXL0+2\nTV26KxVPF110DkFqrLvo3IhJkyYl66NHj07WT548mVsrOjeiKBepdQ9m/b29vbm1ovMLnnzyydza\nZ599phMnTjBlMxAZYQeCIOxAEIQdCIKwA0EQdiAIwg4EMaymbF6yZElu7fHHH0+2LZqS+corr0zW\nU+OqRePgRbcGTo2TS9I333yTrKeuKR8xIv33vGjdTU1NyXrRWHdqnL3oevaia8qLrsVvbm7OrRXd\nhrpo2+fOnUvWjx8/nqyn7q8wd2568qT58+fn1vbu3cs4OxAdYQeCIOxAEIQdCIKwA0EQdiAIwg4E\nMazG2cvU2tqaW7v11luTbYuuZ7/zzjuT9VOnTiXrnZ2dubWi8weKpqresWNHsr59+/ZkvUzTp0/P\nrd1www3Jttddd12yXjQFeJHUOQYfffRRsm1HR0ey7u6VjbObWYuZbTWzDjP7xMx+ni1vMrM3zWxP\n9pg/iwKA0g1mN/6MpF+4+2xJt0v6mZnNlrRC0hZ3nyVpS/YaQIMqDLu7d7v79ux5r6RdkqZIWixp\nXfZr6yQ9XK9OAqjej5rrzcymS/qJpG2Smt39/Am+ByQNeCKymbVJaqu8iwBqYdBH481snKRXJT3j\n7t+7esL7jvINePDN3Ve7+1x3T5/dD6CuBhV2MxutvqD/3t03ZIsPmtnkrD5ZUvo2pQBKVTj0Zmam\nvu/kR9z9mX7L/13Sl+7+GzNbIanJ3X9ZsK5hO/QGDBd5Q2+DCftdkv4s6WNJ5y/i/ZX6vrevlzRN\n0j5Jj7l78gJlwg7UX8VhryXCDtRfxSfVALg4EHYgCMIOBEHYgSAIOxAEYQeCIOxAEIQdCIKwA0EQ\ndiAIwg4EQdiBIAg7EARhB4Ig7EAQhB0IgrADQRB2IAjCDgRB2IEgCDsQBGEHgiDsQBCEHQiCsANB\nEHYgCMIOBEHYgSAIOxBEYdjNrMXMtppZh5l9YmY/z5avNLMuM/sw+3mw/t0FUKnBzM8+WdJkd99u\nZuMlfSDpYUmPSTrm7v8x6I0xZTNQd3lTNo8aRMNuSd3Z814z2yVpSm27B6DeftR3djObLuknkrZl\ni54ysx1mttbMJuS0aTOzdjNrr6qnAKpSuBv/9180GyfpbUn/5u4bzKxZ0mFJLulf1ber/3jBOtiN\nB+osbzd+UGE3s9GSNkn6k7v/5wD16ZI2ufvNBesh7ECd5YV9MEfjTdIaSbv6Bz07cHfeEkk7q+0k\ngPoZzNH4uyT9WdLHks5li38laZmkOerbje+U9NPsYF5qXXyyA3VW1W58rRB2oP4q3o0HcHEg7EAQ\nhB0IgrADQRB2IAjCDgRB2IEgCDsQBGEHgiDsQBCEHQiCsANBEHYgCMIOBFF4w8kaOyxpX7/XE7Nl\njahR+9ao/ZLoW6Vq2bd/yCsM6fXsP9i4Wbu7zy2tAwmN2rdG7ZdE3yo1VH1jNx4IgrADQZQd9tUl\nbz+lUfvWqP2S6FulhqRvpX5nBzB0yv5kBzBECDsQRClhN7NFZvYXM/vUzFaU0Yc8ZtZpZh9n01CX\nOj9dNodej5nt7LesyczeNLM92eOAc+yV1LeGmMY7Mc14qe9d2dOfD/l3djMbKemvkhZI2i/pfUnL\n3L1jSDuSw8w6Jc1199JPwDCzf5R0TNJ/n59ay8yel3TE3X+T/aGc4O7/3CB9W6kfOY13nfqWN834\nP6nE966W059XooxP9nmSPnX3ve5+WtIfJC0uoR8Nz93fkXTkgsWLJa3Lnq9T33+WIZfTt4bg7t3u\nvj173ivp/DTjpb53iX4NiTLCPkXS3/q93q/Gmu/dJW02sw/MrK3szgygud80WwckNZfZmQEUTuM9\nlC6YZrxh3rtKpj+vFgfofugud2+V9ICkn2W7qw3J+76DNdLY6SpJM9U3B2C3pBfK7Ew2zfirkp5x\n92/618p87wbo15C8b2WEvUtSS7/XU7NlDcHdu7LHHkmvqe9rRyM5eH4G3eyxp+T+/J27H3T3s+5+\nTtJvVeJ7l00z/qqk37v7hmxx6e/dQP0aqvetjLC/L2mWmc0wszGSlkraWEI/fsDMxmYHTmRmYyUt\nVONNRb1R0vLs+XJJfyyxL9/TKNN4500zrpLfu9KnP3f3If+R9KD6jsh/JulfyuhDTr+ul/RR9vNJ\n2X2T9Ir6duu+U9+xjSckXS1pi6Q9kv5PUlMD9e1l9U3tvUN9wZpcUt/uUt8u+g5JH2Y/D5b93iX6\nNSTvG6fLAkFwgA4IgrADQRB2IAjCDgRB2IEgCDsQBGEHgvh/akg160ligkQAAAAASUVORK5CYII=\n",
271 | "text/plain": [
272 | ""
273 | ]
274 | },
275 | "metadata": {
276 | "tags": []
277 | }
278 | }
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {
284 | "id": "91U8S4iq_2oh",
285 | "colab_type": "text"
286 | },
287 | "source": [
288 | "Each training and test example is assigned one of the following labels:\n",
289 | "\n",
290 | "| Label | Description |\n",
291 | "| --- | --- |\n",
292 | "| 0 | T-shirt/top |\n",
293 | "| 1 | Trouser |\n",
294 | "| 2 | Pullover |\n",
295 | "| 3 | Dress |\n",
296 | "| 4 | Coat |\n",
297 | "| 5 | Sandal |\n",
298 | "| 6 | Shirt |\n",
299 | "| 7 | Sneaker |\n",
300 | "| 8 | Bag |\n",
301 | "| 9 | Ankle boot |"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {
307 | "colab_type": "text",
308 | "id": "Eo_cZXaqODnZ"
309 | },
310 | "source": [
311 | "### Step 4) Reformat the images\n",
312 | "Here, we'll flatten (or unstack) the images. There are deep learning techniques that work with 2d images directly (rather than their flattened representation), but we'll start with this format. Instead of working with a 28 by 28 *image*, we'll unstack it into a 28 \\* 28 = 784 length *array*.\n",
313 | "\n",
314 | "* We want to convert the 3-D array of shape (*N*, 28, 28) to a 2-D array of shape (*N*, 784) where the second dimension is just an array of all the pixels in an image. This is called flattening, or unstacking, the images. \n",
315 | "* We also want to convert the pixel values from a number between 0 and 255 to a number between 0 and 1."
316 | ]
317 | },
318 | {
319 | "cell_type": "code",
320 | "metadata": {
321 | "colab_type": "code",
322 | "id": "OgnV5FJjP5Vz",
323 | "colab": {}
324 | },
325 | "source": [
326 | "TRAINING_SIZE = len(train_images)\n",
327 | "TEST_SIZE = len(test_images)\n",
328 | "\n",
329 | "# Reshape from (N, 28, 28) to (N, 28*28=784)\n",
330 | "train_images = np.reshape(train_images, (TRAINING_SIZE, 784))\n",
331 | "test_images = np.reshape(test_images, (TEST_SIZE, 784))\n",
332 | "\n",
333 | "# Convert the array to float32 as opposed to uint8\n",
334 | "train_images = train_images.astype(np.float32)\n",
335 | "test_images = test_images.astype(np.float32)\n",
336 | "\n",
337 | "# Convert the pixel values from integers between 0 and 255 to floats between 0 and 1\n",
338 | "train_images /= 255\n",
339 | "test_images /= 255"
340 | ],
341 | "execution_count": 0,
342 | "outputs": []
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {
347 | "colab_type": "text",
348 | "id": "GI25z0StQH-P"
349 | },
350 | "source": [
351 | "### Step 5) Reformat the labels\n",
352 | "\n",
353 | "Next, we want to convert the labels from an integer format (e.g., \"2\" or \"Pullover\"), to a [one hot encoding](https://en.wikipedia.org/wiki/One-hot) (e.g., \"0, 0, 1, 0, 0, 0, 0, 0, 0, 0\"). To do so, we'll use the `tf.keras.utils.to_categorical` [function](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) function."
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "metadata": {
359 | "colab_type": "code",
360 | "id": "E9yrkEENQ9Vz",
361 | "outputId": "6e9ac878-ccba-4914-8bf1-db979f03d8f2",
362 | "colab": {
363 | "base_uri": "https://localhost:8080/",
364 | "height": 53
365 | }
366 | },
367 | "source": [
368 | "NUM_CAT = 10\n",
369 | "\n",
370 | "print(\"Before\", train_labels[0]) # The format of the labels before conversion\n",
371 | "\n",
372 | "train_labels = tf.keras.utils.to_categorical(train_labels, NUM_CAT)\n",
373 | "\n",
374 | "print(\"After\", train_labels[0]) # The format of the labels after conversion\n",
375 | "\n",
376 | "test_labels = tf.keras.utils.to_categorical(test_labels, NUM_CAT)"
377 | ],
378 | "execution_count": 8,
379 | "outputs": [
380 | {
381 | "output_type": "stream",
382 | "text": [
383 | "Before 9\n",
384 | "After [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]\n"
385 | ],
386 | "name": "stdout"
387 | }
388 | ]
389 | },
390 | {
391 | "cell_type": "markdown",
392 | "metadata": {
393 | "colab_type": "text",
394 | "id": "pjdbemHURkpv"
395 | },
396 | "source": [
397 | "### Step 6) Build the model\n",
398 | "\n",
399 | "Now, we'll create our neural network using the [Keras Sequential API](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential). Keras is a high-level API to build and train deep learning models and is user friendly, modular and easy to extend. `tf.keras` is TensorFlow's implementation of this API and it supports such things as eager execution, `tf.data` pipelines and Estimators.\n",
400 | "\n",
401 | "Architecture wise, we'll use a single hidden layer network, where:\n",
402 | "* The hidden layer will have 512 units using the [ReLU](https://www.tensorflow.org/api_docs/python/tf/keras/activations/relu) activation function. \n",
403 | "* The output layer will have 10 units and use [softmax](https://www.tensorflow.org/api_docs/python/tf/keras/activations/softmax) function. \n",
404 | "* Notice, we specify the input shape on the first layer. If you add subsequent layers, this is not necessary. \n",
405 | "* We will use the [categorical crossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/categorical_crossentropy) loss function, and the [SGD optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/)."
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "metadata": {
411 | "colab_type": "code",
412 | "id": "mNscbvHkUrMc",
413 | "outputId": "615f8dcb-2630-4c75-af98-604438de8d39",
414 | "colab": {
415 | "base_uri": "https://localhost:8080/",
416 | "height": 235
417 | }
418 | },
419 | "source": [
420 | "model = tf.keras.Sequential()\n",
421 | "model.add(tf.keras.layers.Dense(512, activation=tf.nn.relu, input_shape=(784,)))\n",
422 | "model.add(tf.keras.layers.Dense(NUM_CAT, activation=tf.nn.softmax))\n",
423 | "\n",
424 | "# We will now compile and print out a summary of our model\n",
425 | "opt = tf.keras.optimizers.SGD(learning_rate=0.1)\n",
426 | "\n",
427 | "model.compile(loss='categorical_crossentropy',\n",
428 | " optimizer=opt,\n",
429 | " metrics=['accuracy'])\n",
430 | "\n",
431 | "model.summary()"
432 | ],
433 | "execution_count": 9,
434 | "outputs": [
435 | {
436 | "output_type": "stream",
437 | "text": [
438 | "Model: \"sequential\"\n",
439 | "_________________________________________________________________\n",
440 | "Layer (type) Output Shape Param # \n",
441 | "=================================================================\n",
442 | "dense (Dense) (None, 512) 401920 \n",
443 | "_________________________________________________________________\n",
444 | "dense_1 (Dense) (None, 10) 5130 \n",
445 | "=================================================================\n",
446 | "Total params: 407,050\n",
447 | "Trainable params: 407,050\n",
448 | "Non-trainable params: 0\n",
449 | "_________________________________________________________________\n"
450 | ],
451 | "name": "stdout"
452 | }
453 | ]
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {
458 | "colab_type": "text",
459 | "id": "k3br9Yi6VuBT"
460 | },
461 | "source": [
462 | "### Step 7) Training\n",
463 | "\n",
464 | "Next, we will train the model by using the [fit method](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit) for 5 [epochs](https://www.quora.com/What-is-epochs-in-machine-learning). We will keep track of the training loss and accuracy as we go. Please be patient as this step may take a while depending on your hardware."
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "metadata": {
470 | "colab_type": "code",
471 | "id": "gBs0LwqcVXx6",
472 | "outputId": "e5fd470d-ee43-4cd0-d21f-ecd96a94452b",
473 | "colab": {
474 | "base_uri": "https://localhost:8080/",
475 | "height": 235
476 | }
477 | },
478 | "source": [
479 | "model.fit(train_images, train_labels, epochs=5)"
480 | ],
481 | "execution_count": 10,
482 | "outputs": [
483 | {
484 | "output_type": "stream",
485 | "text": [
486 | "Train on 60000 samples\n",
487 | "Epoch 1/5\n",
488 | "60000/60000 [==============================] - 7s 113us/sample - loss: 0.5175 - accuracy: 0.8165\n",
489 | "Epoch 2/5\n",
490 | "60000/60000 [==============================] - 5s 86us/sample - loss: 0.3851 - accuracy: 0.8600\n",
491 | "Epoch 3/5\n",
492 | "60000/60000 [==============================] - 5s 87us/sample - loss: 0.3467 - accuracy: 0.8734\n",
493 | "Epoch 4/5\n",
494 | "60000/60000 [==============================] - 5s 81us/sample - loss: 0.3209 - accuracy: 0.8824\n",
495 | "Epoch 5/5\n",
496 | "60000/60000 [==============================] - 5s 82us/sample - loss: 0.3043 - accuracy: 0.8880\n"
497 | ],
498 | "name": "stdout"
499 | },
500 | {
501 | "output_type": "execute_result",
502 | "data": {
503 | "text/plain": [
504 | ""
505 | ]
506 | },
507 | "metadata": {
508 | "tags": []
509 | },
510 | "execution_count": 10
511 | }
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {
517 | "colab_type": "text",
518 | "id": "rcYMPkwkWIPq"
519 | },
520 | "source": [
521 | "### Step 8) Testing\n",
522 | "Now that we have trained our model, we want to evaluate it. Sure, our model is >88% accurate on the training set, but what about on data it hasn't seen before? The test accuracy is a good metric for that."
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "metadata": {
528 | "colab_type": "code",
529 | "id": "iuqDe4NiWBpU",
530 | "outputId": "abf59397-0d2a-4941-ca93-770fb826391f",
531 | "colab": {
532 | "base_uri": "https://localhost:8080/",
533 | "height": 53
534 | }
535 | },
536 | "source": [
537 | "loss, accuracy = model.evaluate(test_images, test_labels)\n",
538 | "print('Test accuracy: %.2f' % (accuracy))"
539 | ],
540 | "execution_count": 11,
541 | "outputs": [
542 | {
543 | "output_type": "stream",
544 | "text": [
545 | "10000/10000 [==============================] - 1s 86us/sample - loss: 0.3447 - accuracy: 0.8745\n",
546 | "Test accuracy: 0.87\n"
547 | ],
548 | "name": "stdout"
549 | }
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "metadata": {
555 | "colab_type": "text",
556 | "id": "jo-yoMwvXkw6"
557 | },
558 | "source": [
559 | "## To Do\n",
560 | "Congrats! You have successfully used TensorFlow Keras to train a model on the Fashion-MNIST dataset.\n",
561 | "Now, try with different hyperparameters such as:\n",
562 | "- Number of neurons in a layer\n",
563 | "- Number of layers\n",
564 | "- Learning rate\n",
565 | "- Different optimizer\n",
566 | "- Number of epochs\n",
567 | "\n",
568 | "## Question\n",
569 | "What happens if you use sigmoid activation function instead of softmax activation?\n",
570 | "How are they different?\n",
571 | "Wihch activation function should you use for this case, and which can you use for another case? (give an example)"
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "metadata": {
577 | "id": "fyqXY7Bk_2pW",
578 | "colab_type": "code",
579 | "colab": {}
580 | },
581 | "source": [
582 | ""
583 | ],
584 | "execution_count": 0,
585 | "outputs": []
586 | }
587 | ]
588 | }
--------------------------------------------------------------------------------
/in_class_notebooks/RNN_Keras/in-class-RNN-keras-release.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Recurrent Neural Networks with Keras\n",
8 | "***\n",
9 | "\n",
10 | "In this notebook we will work through an example using a recurrent neural network with the Keras wrapper on Tensorflow. \n"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "Recurrent neural networks use a specific architecture that ensures persistence of information that it has seen in the past. This *memory* allows these networks to learn information sequences though it sees parts of the sequence one at a time.\n",
18 | "\n",
19 | "RNNs achieve this because of a recurrent connection in their hidden layer allowing information from one time step to go not only to the output layer but also back into the network. \n",
20 | "\n",
21 | "RNNs suffer from a problem called the *vanishing gradient* problem. In short, this keeps them from learning very long term dependencies, causing them to *forget* information from very early on.\n"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "**Some Resources**:\n",
29 | "\n",
30 | "- [Christopher Olah's blogpost on LSTMs and GRUs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)\n",
31 | "- [Jason Brownlee's detailed post on LSTMs](https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)\n"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "Make sure you have Keras and Tensorflow installed. Run the cell below to import the required libraries"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "from __future__ import print_function\n",
48 | "from keras.models import Sequential\n",
49 | "from keras import layers\n",
50 | "import numpy as np\n",
51 | "from six.moves import range"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "#### RNN learns to perform addition on two numbers: Sequence to sequence learning\n",
59 | "\n",
60 | "This is a simple example [available on the Keras website](https://keras.io/examples/addition_rnn/). It does offer insight into how the RNN/LSTM leverages sequential information. \n",
61 | "\n",
62 | "In this you will take in a sequence of characters denoting the sum of two numbers e.g. '301+4' and train a network to return an output sequence of characters representing the answer e.g., '305'. The input sequences will be one-hot encoded before sending it to the network.\n"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "In the cell below the methods for encoding and decoding the character strings to and from one-hot code is written. Please run this cell to use these."
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "class CharacterTable(object):\n",
79 | " \"\"\"Given a set of characters:\n",
80 | " + Encode them to a one-hot integer representation\n",
81 | " + Decode the one-hot or integer representation to their character output\n",
82 | " + Decode a vector of probabilities to their character output\n",
83 | " \"\"\"\n",
84 | " def __init__(self, chars):\n",
85 | " \"\"\"Initialize character table.\n",
86 | "\n",
87 | " # Arguments\n",
88 | " chars: Characters that can appear in the input.\n",
89 | " \"\"\"\n",
90 | " self.chars = sorted(set(chars))\n",
91 | " self.char_indices = dict((c, i) for i, c in enumerate(self.chars))\n",
92 | " self.indices_char = dict((i, c) for i, c in enumerate(self.chars))\n",
93 | "\n",
94 | " def encode(self, C, num_rows):\n",
95 | " \"\"\"One-hot encode given string C.\n",
96 | "\n",
97 | " # Arguments\n",
98 | " C: string, to be encoded.\n",
99 | " num_rows: Number of rows in the returned one-hot encoding. This is\n",
100 | " used to keep the # of rows for each data the same.\n",
101 | " \"\"\"\n",
102 | " x = np.zeros((num_rows, len(self.chars)))\n",
103 | " for i, c in enumerate(C):\n",
104 | " x[i, self.char_indices[c]] = 1\n",
105 | " return x\n",
106 | "\n",
107 | " def decode(self, x, calc_argmax=True):\n",
108 | " \"\"\"Decode the given vector or 2D array to their character output.\n",
109 | "\n",
110 | " # Arguments\n",
111 | " x: A vector or a 2D array of probabilities or one-hot representations;\n",
112 | " or a vector of character indices (used with `calc_argmax=False`).\n",
113 | " calc_argmax: Whether to find the character index with maximum\n",
114 | " probability, defaults to `True`.\n",
115 | " \"\"\"\n",
116 | " if calc_argmax:\n",
117 | " x = x.argmax(axis=-1)\n",
118 | " return ''.join(self.indices_char[x] for x in x)\n",
119 | "\n",
120 | "\n",
121 | "class colors:\n",
122 | " ok = '\\033[92m'\n",
123 | " fail = '\\033[91m'\n",
124 | " close = '\\033[0m'\n",
125 | "\n"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "**Step 1:** We will generate a number of example sequences and corresponding expected results for the addition in the cell below. `TRAINING_SIZE` determines the total number of examples generated. `DIGITS` represents maximum number of digits in each number in the addition. "
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "# Parameters for the model and dataset.\n",
142 | "TRAINING_SIZE = 50000\n",
143 | "DIGITS = 3\n",
144 | "REVERSE = True\n",
145 | "\n",
146 | "# Maximum length of input is 'int + int' (e.g., '345+678'). Maximum length of\n",
147 | "# int is DIGITS.\n",
148 | "MAXLEN = DIGITS + 1 + DIGITS\n",
149 | "\n",
150 | "# All the numbers, plus sign and space for padding.\n",
151 | "chars = '0123456789+ '\n",
152 | "ctable = CharacterTable(chars)\n",
153 | "\n",
154 | "questions = []\n",
155 | "expected = []\n",
156 | "seen = set()\n",
157 | "print('Generating data...')\n",
158 | "while len(questions) < TRAINING_SIZE:\n",
159 | " f = lambda: int(''.join(np.random.choice(list('0123456789'))\n",
160 | " for i in range(np.random.randint(1, DIGITS + 1))))\n",
161 | " a, b = f(), f()\n",
162 | " # Skip any addition questions we've already seen\n",
163 | " # Also skip any such that x+Y == Y+x (hence the sorting).\n",
164 | " key = tuple(sorted((a, b)))\n",
165 | " if key in seen:\n",
166 | " continue\n",
167 | " seen.add(key)\n",
168 | " # Pad the data with spaces such that it is always MAXLEN.\n",
169 | " q = '{}+{}'.format(a, b)\n",
170 | " query = q + ' ' * (MAXLEN - len(q))\n",
171 | " ans = str(a + b)\n",
172 | " # Answers can be of maximum size DIGITS + 1.\n",
173 | " ans += ' ' * (DIGITS + 1 - len(ans))\n",
174 | " if REVERSE:\n",
175 | " # Reverse the query, e.g., '12+345 ' becomes ' 543+21'. (Note the\n",
176 | " # space used for padding.)\n",
177 | " query = query[::-1]\n",
178 | " questions.append(query)\n",
179 | " expected.append(ans)\n",
180 | "print('Total addition questions:', len(questions))\n",
181 | "\n"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "# TODO: use this cell to figure out what's going on in the data: look at the form of questions and expected"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "**Step 2:** In this next step you will encode the examples above and create training and validation data to train the recurrent neural network on.\n",
198 | "\n",
199 | "Remember, if the question isn't maximum length then it will padded with spaces. "
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "metadata": {},
206 | "outputs": [],
207 | "source": [
208 | "print('Vectorization...')\n",
209 | "x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)\n",
210 | "y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)\n",
211 | "\n",
212 | "# To do: in x and y generate and store the encoded forms of the questions above, \n",
213 | "# use the encode function in the class CharacterTable\n",
214 | "\n",
215 | "\n",
216 | "# Shuffle (x, y) in unison as the later parts of x will almost all be larger\n",
217 | "# digits.\n",
218 | "indices = np.arange(len(y))\n",
219 | "np.random.shuffle(indices)\n",
220 | "x = x[indices]\n",
221 | "y = y[indices]\n",
222 | "\n",
223 | "# TODO: Explicitly set apart 10% for validation data that we never train over.\n",
224 | "\n",
225 | "\n",
226 | "# Check the shape of your data"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "**Step 3:** Now the model will be created to test a recurrent neural network architecture.\n",
234 | "***\n",
235 | "\n",
236 | "The architecture has 2 main components: the encoder and the decoder, both of which exploit abilities to learn sequential relations between data.\n",
237 | "\n",
238 | "Encoder:\n",
239 | "Here we will use a recurrent unit (simpleRNN or LSTM) as the encoder. The goal of the encoder is to take in the clunky one-hot encoded input data say '53+21 ' character by character and encode it into a dense representation. '\n",
240 | "\n",
241 | "Decoder: \n",
242 | "For this we will use another recurrent unit of choice. Here each time delayed unit (rolling out the recurrent connection) will obtain the encoded representation from earlier. The output required has a maximum size of `DIGITS+1`. Therefore as input we are going to feed in the dense representation as many times, using `RepeatVector` \"layer\". \n",
243 | "\n",
244 | "The loss function used is therefore a categorical crossentropy loss function. \n",
245 | "\n",
246 | "Finally, the output of the decoder is a temporal sequence of probability outputs which returns the probability about what digit (or plus sign or space) the character can respresent. For this we apply a dense layer to each temporal slice of the output of the decoder, i.e., for each character of the sequence we generate a probability for that digit. This is done by using the `TimeDistributed` layer."
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "# Try replacing GRU, or SimpleRNN.\n",
256 | "# RNN = layers.LSTM\n",
257 | "RNN = layers.SimpleRNN\n",
258 | "HIDDEN_SIZE = 128\n",
259 | "BATCH_SIZE = 128\n",
260 | "LAYERS = 1\n",
261 | "\n",
262 | "print('Build model...')\n",
263 | "model = Sequential()\n",
264 | "# TODO: \n",
265 | "# \"Encode\" the input sequence using an RNN, producing an output of HIDDEN_SIZE.\n",
266 | "# Note: In a situation where your input sequences have a variable length,\n",
267 | "# use input_shape=(None, num_feature).\n",
268 | "\n",
269 | "\n",
270 | "\n",
271 | "# As the decoder RNN's input, repeatedly provide with the last output of\n",
272 | "# RNN for each time step. Repeat 'DIGITS + 1' times as that's the maximum\n",
273 | "# length of output, e.g., when DIGITS=3, max output is 999+999=1998.\n",
274 | "model.add(layers.RepeatVector(DIGITS + 1))\n",
275 | "\n",
276 | "# Build the Decoder : \n",
277 | "# TODO: \n",
278 | "# The decoder RNN could be multiple layers stacked or a single layer.\n",
279 | "# Set return_sequences to True, return not only the last output but\n",
280 | "# all the outputs so far in the form of (num_samples, timesteps,\n",
281 | "# output_dim). This is necessary as TimeDistributed in the below expects\n",
282 | "# the first dimension to be the timesteps.\n",
283 | "\n",
284 | "\n",
285 | "\n",
286 | "\n",
287 | "# Apply a dense layer to the every temporal slice of an input. For each of step\n",
288 | "# of the output sequence, decide which character should be chosen.\n",
289 | "model.add(layers.TimeDistributed(layers.Dense(len(chars), activation='softmax')))\n",
290 | "model.compile(loss='categorical_crossentropy',\n",
291 | " optimizer='adam',\n",
292 | " metrics=['accuracy'])\n",
293 | "model.summary()\n",
294 | "\n"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "**Step 4:** Here we actually train the network using the `fit()` method. Then the code se"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": null,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "# Train the model each generation and show predictions against the validation\n",
311 | "# dataset.\n",
312 | "for iteration in range(1, 200):\n",
313 | " print()\n",
314 | " print('-' * 50)\n",
315 | " print('Iteration', iteration)\n",
316 | " # TODO: run the model fit method for one epoch at a time\n",
317 | "\n",
318 | " # Select 10 samples from the validation set at random so we can visualize\n",
319 | " # errors.\n",
320 | " for i in range(10):\n",
321 | " ind = np.random.randint(0, len(x_val))\n",
322 | " rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]\n",
323 | " preds = model.predict_classes(rowx, verbose=0)\n",
324 | " q = ctable.decode(rowx[0])\n",
325 | " correct = ctable.decode(rowy[0])\n",
326 | " guess = ctable.decode(preds[0], calc_argmax=False)\n",
327 | " print('Q', q[::-1] if REVERSE else q, end=' ')\n",
328 | " print('T', correct, end=' ')\n",
329 | " if correct == guess:\n",
330 | " print(colors.ok + '☑' + colors.close, end=' ')\n",
331 | " else:\n",
332 | " print(colors.fail + '☒' + colors.close, end=' ')\n",
333 | " print(guess)"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": null,
339 | "metadata": {},
340 | "outputs": [],
341 | "source": []
342 | }
343 | ],
344 | "metadata": {
345 | "kernelspec": {
346 | "display_name": "Python 3",
347 | "language": "python",
348 | "name": "python3"
349 | },
350 | "language_info": {
351 | "codemirror_mode": {
352 | "name": "ipython",
353 | "version": 3
354 | },
355 | "file_extension": ".py",
356 | "mimetype": "text/x-python",
357 | "name": "python",
358 | "nbconvert_exporter": "python",
359 | "pygments_lexer": "ipython3",
360 | "version": "3.7.4"
361 | }
362 | },
363 | "nbformat": 4,
364 | "nbformat_minor": 2
365 | }
366 |
--------------------------------------------------------------------------------
/in_class_notebooks/SVM/in_class_SVM-release.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Hands-On Support Vector Machines\n",
8 | "***\n",
9 | "\n",
10 | "In this notebook we'll explore the details of the Soft-Margin SVM and look at how the choice of tuning parameters affects the learned models. We'll also look at kernel SVMs for non-linearly separable and methods for choosing and visualizing good hyperparameters. \n",
11 | "\n",
12 | "To understand visually how SVMs work with some advanced math look at [this paper](http://www.robots.ox.ac.uk/~cvrg/bennett00duality.pdf).\n",
13 | "\n",
14 | "Other info on the practicalities of SVM:\n",
15 | "\n",
16 | "[Very thorough guide](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf) on the practical use and tuning of SVMs, from the creators of libsvm.\n",
17 | "\n",
18 | "[The time and space complexity of SVM in Scikit-Learn](https://scikit-learn.org/stable/modules/svm.html#complexity)\n",
19 | "\n",
20 | "Some advanced things:\n",
21 | "\n",
22 | "[The optimization algorithm used in Scikit-Learn for SVM](http://www.jmlr.org/papers/volume6/fan05a/fan05a.pdf)\n",
23 | "\n",
24 | "[The SMO Algorithm](http://cs229.stanford.edu/materials/smo.pdf)\n",
25 | "\n",
26 | "\n",
27 | "**Note**: There are some helper functions at the bottom of this notebook. Scroll down and execute those cells before continuing. \n"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "import numpy as np"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "### Part 1: Soft-Margin SVM Details\n",
44 | "***\n",
45 | "\n",
46 | "Suppose you have the following labeled data set (assume here that red corresponds to $y=1$ and blue corresponds to $y = -1$) and suppose the SVM decision boundary is defined by the weights ${\\bf w} = [-1/4, ~ 1/4]^T$ and $b = -1/4$. \n"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "# Data and Labels \n",
56 | "X = np.array([[1,8],[7,2],[6,-1],[-5,0], [-5,1], [-5,2],[6,3],[6,1],[5,2]])\n",
57 | "y = np.array([1,-1,-1,1,-1,1,1,-1,-1])\n",
58 | "\n",
59 | "# Support vector parameters \n",
60 | "w, b = np.array([-1/4, 1/4]), -1/4\n",
61 | "# w, b = np.array([-1/2, 1/2]), -1/2\n",
62 | "\n",
63 | "# Plot the data and support vector boundaries \n",
64 | "linear_plot(X, y, w=w, b=b)"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "**Part A**: What is the margin of this particular SVM? "
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "**Solution**: The margin is given by $M = 2 / \\|{\\bf w}\\|$, or $1/ \\|{\\bf w}\\|$ for half the margin. "
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "# TODO: compute margin according to the formula and print\n",
88 | "\n",
89 | "margin=0\n",
90 | "\n",
91 | "print(\"M = {:.3f}\".format(margin))"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "**Part B**: Which training examples are the support vectors? "
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "**Solution**: From the data it is a fair assumption that red points above the decision boundary are correctly classified and blue points below the decision boundary are correctly classified. From this we see that $(-5,0)$ and $(5,2)$ are the support vectors. "
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "**Part C**: Which training examples have nonzero slack? "
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": []
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "**Part D**: Compute the slack $\\xi_i$ associated with the misclassified points. Do these values jive with the plot of the data and the support vector boundaries? "
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "**How to find the slack:**: We will use the definition $(1/\\|{\\bf w}\\|)y_i({\\bf w}^T{\\bf x}_i + b ) = (1/\\|{\\bf w}\\| - \\xi_i)$ to compute the slack. Solving for $\\xi_i$ gives \n",
132 | "\n",
133 | "$$\n",
134 | "\\xi_i = (1 - y_i({\\bf w}^T{\\bf x}_i + b ))(1/\\| {\\bf w} \\| ) \n",
135 | "$$"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "print(\"Margin: {}\".format(1/np.linalg.norm(w)))\n",
145 | "\n",
146 | "# Indices of the points with slack are 4 and 6\n",
147 | "# TODO: compute slack according to the formula above for the two points and subsequently print it\n",
148 | "\n",
149 | "xi4 = 0\n",
150 | "print(\"xi_4 = {:.3f}\".format(xi4))\n",
151 | "\n",
152 | "xi6 = 0\n",
153 | "print(\"xi_6 = {:.3f}\".format(xi6))"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "### Part 2: The Margin vs Slack \n",
161 | "***\n",
162 | "\n",
163 | "In this problem we'll figure out how to fit linear SVM models to data using sklearn. Consider the data shown below. \n"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "X, y = part2data()\n",
173 | "linear_plot(X, y)"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "metadata": {},
180 | "outputs": [],
181 | "source": []
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "**Part A**: Let's fit a linear Soft-Margin SVM to the data above. For SVMs with a linear kernel we'll use the [`LinearSVM`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) method from sklearn's `svm` module. Go now and look at the documentation. \n",
188 | "\n",
189 | "Recall that the primal objective function for the linear kernel SVM is as follows \n",
190 | "\n",
191 | "\n",
192 | "$$\n",
193 | "\\min_{{\\bf w}, b, {\\bf \\xi}} \\frac{1}{2}\\|{\\bf w}\\|^2 + C \\sum_{i=1}^m \\xi_i^p\n",
194 | "$$\n",
195 | "\n",
196 | "The two optional parameters in `LinearSVM` that we'll be most concerned with are `C`, the hyperparameter weighting the slackness contribution to the primal objective function, and `loss`, which determines the exponent on the slack variables in the sum. \n",
197 | "\n",
198 | "Write some code below to train a linear SVM with $C=1$ and $p=1$, get the computed weight vector and bias, and the plot the resulting model. "
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "from sklearn.svm import LinearSVC\n",
208 | "\n",
209 | "# TODO: Train the model and get the parameters\n",
210 | "\n",
211 | "linear_plot(X, y, w=w, b=b)"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "**Part B**: Experiment with different values of `C`. How does the choice of `C` affect the nature of the decision boundary and the associated margin? "
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": null,
224 | "metadata": {},
225 | "outputs": [],
226 | "source": [
227 | "from sklearn.svm import LinearSVC\n",
228 | "\n",
229 | "# TODO: Train the model and get the parameters\n",
230 | "\n",
231 | "linear_plot(X, y, w=w, b=b)"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": null,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "from sklearn.svm import LinearSVC\n",
241 | "\n",
242 | "# TODO: Train the model and get the parameters\n",
243 | "\n",
244 | "linear_plot(X, y, w=w, b=b)"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "**Part C**: Set `C=3` and compare the results you get when using the `hinge` vs the `squared_hinge` values for the `loss` parameter. Explain your observations. "
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": null,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "from sklearn.svm import LinearSVC\n",
261 | "\n",
262 | "# TODO: Train the model and get the parameters, pay attention to the loss parameter\n",
263 | "\n",
264 | "linear_plot(X, y, w=w, b=b)"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": null,
270 | "metadata": {},
271 | "outputs": [],
272 | "source": []
273 | },
274 | {
275 | "cell_type": "markdown",
276 | "metadata": {},
277 | "source": [
278 | "**Part D**: In general, how does the choice of `C` affect the bias and variance of the model? "
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": []
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "### Part 3: Nonlinear SVM, Parameter Tuning, Accuracy, and Cross-Validation \n",
291 | "***\n",
292 | "\n",
293 | "Any support vector machine classifier will have at least one parameter that needs to be tuned based on the training data. The guaranteed parameter is the $C$ associated with the slack variables in the primal objective function, i.e. \n",
294 | "\n",
295 | "$$\n",
296 | "\\min_{{\\bf w}, b, {\\bf \\xi}} \\frac{1}{2}\\|{\\bf w}\\|^2 + C \\sum_{i=1}^m \\xi_i\n",
297 | "$$\n",
298 | "\n",
299 | "If you use a kernel fancier than the linear kernel then you will likely have other parameters as well. For instance in the polynomial kernel $K({\\bf x}, {\\bf z}) = ({\\bf x}^T{\\bf z} + c)^d$ you have to select the shift $c$ and the polynomial degree $d$. Similarly the rbf kernel\n",
300 | "\n",
301 | "$$\n",
302 | "K({\\bf x}, {\\bf z}) = \\exp\\left[-\\gamma\\|{\\bf x} - {\\bf z}\\|^2\\right]\n",
303 | "$$\n",
304 | "\n",
305 | "has one tuning parameter, namely $\\gamma$, which controls how fast the similarity measure drops off with distance between ${\\bf x}$ and ${\\bf z}$. \n",
306 | "\n",
307 | "For our examples we'll consider the rbf kernel, which gives us two parameters to tune, namely $C$ and $\\gamma$. \n",
308 | "\n",
309 | "Consider the following two dimensional data"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "metadata": {},
316 | "outputs": [],
317 | "source": [
318 | "X, y = part3data(N=300, seed=1235)\n",
319 | "nonlinear_plot(X, y)"
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {},
325 | "source": [
326 | "**Part A**: We can use the method [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) from sklearn's `svm` module to fit an SVM with a nonlinear kernel to the data. Go now and look at the documentation. Note that we pass the `kernel=\"rbf\"` parameter to use the RBF kernel. The other two parameters we'll be concerned with are `C` and the RBF parameter `gamma`. \n",
327 | "\n",
328 | "Write some code to fit an SVM with RBF kernel to the data and plot the results. Use the parameter values `C=1` and `gamma=1`. "
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {},
335 | "outputs": [],
336 | "source": [
337 | "from sklearn.svm import SVC\n",
338 | "\n",
339 | "# TODO: create a non linear SVM classifier with rbf kernel and experment with C=1 and gamma=1\n",
340 | "\n",
341 | "\n",
342 | "nonlinear_plot(X, y, nlsvm)"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "**Part B**: In this part we'll use cross-validation to estimate the validation accuracy achieved by our model. Experiment with the values of the hyperparameters to see if you can get a good validation accuracy. How do the choice of `C` and `gamma` affect the resulting decision boundary? \n"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": null,
355 | "metadata": {},
356 | "outputs": [],
357 | "source": [
358 | "from sklearn.model_selection import cross_val_score\n",
359 | "\n",
360 | "\n",
361 | "# TODO: create a non linear SVM classifier with rbf kernel and experment with different C and gamma values\n",
362 | "\n",
363 | "\n",
364 | "print(\"cross-val mean-accuracy: {:.3f}\".format(np.mean(scores)))\n",
365 | "\n",
366 | "nonlinear_plot(X, y, nlsvm)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "**Part C**: (On your own after class)\n",
374 | "\n",
375 | "Experiment with the different classifiers produced by different kernels.\n"
376 | ]
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": []
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "### Part 4: Automating the Parameter Search \n",
388 | "***\n",
389 | "\n",
390 | "On the previous problem we were able to choose some OK parameters just by hand-tuning. But in real life (where time is money) it would be better to do something a little more automated. One common thing to do is a **grid-search** over a predefined range of the parameters. In this case you will loop over all possible combinations of parameters, estimate the accuracy of your model using K-Folds cross-validation, and then choose the parameter combination that produces the highest validation accuracy. \n",
391 | "\n",
392 | "**Part A**: Below is an experiment where we search over a logarithmic range between $2^{-5}$ and $2^{5}$ for $C$ and a range between $2^{-5}$ and $2^{5}$ for $\\gamma$. For the accuracy measure we use K-Folds CV with $K=3$."
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": null,
398 | "metadata": {},
399 | "outputs": [],
400 | "source": [
401 | "from sklearn.model_selection import cross_val_score, GridSearchCV\n",
402 | "\n",
403 | "# TODO: specify range of C and gamma and form a param_grid. \n",
404 | "c_range = 0\n",
405 | "g_range = 0\n",
406 | "param_grid = dict(gamma=g_range, C=c_range)\n",
407 | "\n",
408 | "# TODO : create GridSearchCV object with the current SVC classifier with rbf kernel as your base "
409 | ]
410 | },
411 | {
412 | "cell_type": "markdown",
413 | "metadata": {},
414 | "source": [
415 | "**Part B**: The following function will plot a heat-map of the cross-validation accuracies for each combination of parameters. Which combination looks the best?"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": null,
421 | "metadata": {},
422 | "outputs": [],
423 | "source": [
424 | "# TODO: use the plotSearchGrid helper function to plot the grid of searched points and corresponding model accuracy \n",
425 | "# (argument is the GridSearchCV object)"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": null,
431 | "metadata": {},
432 | "outputs": [],
433 | "source": [
434 | "grid.best_params_"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": null,
440 | "metadata": {},
441 | "outputs": [],
442 | "source": []
443 | },
444 | {
445 | "cell_type": "markdown",
446 | "metadata": {},
447 | "source": [
448 | "**Part C**: The GridSearchCV object scores, among other things, the best combination of parameters as well as the cross-validation accuracy achieved with those parameters. Print those quantities for our model. "
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": null,
454 | "metadata": {},
455 | "outputs": [],
456 | "source": [
457 | "# TODO: plot the decision boundaries for the best classifier using `nonlinear_plot` helper funciton\n",
458 | "\n"
459 | ]
460 | },
461 | {
462 | "cell_type": "code",
463 | "execution_count": null,
464 | "metadata": {},
465 | "outputs": [],
466 | "source": []
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {},
472 | "outputs": [],
473 | "source": []
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": null,
478 | "metadata": {},
479 | "outputs": [],
480 | "source": []
481 | },
482 | {
483 | "cell_type": "markdown",
484 | "metadata": {},
485 | "source": [
486 | "
\n",
487 | "
\n",
488 | "
\n",
489 | "\n",
490 | "### Helper Functions\n",
491 | "***"
492 | ]
493 | },
494 | {
495 | "cell_type": "code",
496 | "execution_count": null,
497 | "metadata": {},
498 | "outputs": [],
499 | "source": [
500 | "import numpy as np\n",
501 | "from sklearn.datasets import make_blobs\n",
502 | "from matplotlib.colors import Normalize\n",
503 | "import matplotlib.pyplot as plt\n",
504 | "%matplotlib inline\n",
505 | "\n",
506 | "def linear_plot(X, y, w=None, b=None):\n",
507 | " \n",
508 | " mycolors = {\"blue\": \"steelblue\", \"red\": \"#a76c6e\", \"green\": \"#6a9373\"}\n",
509 | " colors = [mycolors[\"red\"] if yi==1 else mycolors[\"blue\"] for yi in y]\n",
510 | " \n",
511 | " # Plot data \n",
512 | " fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,8))\n",
513 | " ax.scatter(X[:,0], X[:,1], color=colors, s=150, alpha=0.95, zorder=2)\n",
514 | " \n",
515 | " # Plot boundaries \n",
516 | " lower_left = np.min([np.min(X[:,0]), np.min(X[:,1])])\n",
517 | " upper_right = np.max([np.max(X[:,0]), np.max(X[:,1])])\n",
518 | " gap = .1*(upper_right-lower_left)\n",
519 | " xplot = np.linspace(lower_left-gap, upper_right+gap, 20)\n",
520 | " if w is not None and b is not None: \n",
521 | " ax.plot(xplot, (-b - w[0]*xplot)/w[1], color=\"gray\", lw=2, zorder=1)\n",
522 | " ax.plot(xplot, ( 1 -b - w[0]*xplot)/w[1], color=\"gray\", lw=2, ls=\"--\", zorder=1)\n",
523 | " ax.plot(xplot, (-1 -b - w[0]*xplot)/w[1], color=\"gray\", lw=2, ls=\"--\", zorder=1)\n",
524 | " \n",
525 | " \n",
526 | " ax.set_xlim([lower_left-gap, upper_right+gap])\n",
527 | " ax.set_ylim([lower_left-gap, upper_right+gap])\n",
528 | " \n",
529 | " ax.grid(alpha=0.25)\n",
530 | " \n",
531 | "def part2data():\n",
532 | " \n",
533 | " '''\n",
534 | " X = np.zeros((22,2))\n",
535 | " X[0:10,0] = 1.5*np.random.rand(10) \n",
536 | " X[0:10,1] = 1.5*np.random.rand(10)\n",
537 | " X[10:20,0] = 1.5*np.random.rand(10) + 1.75\n",
538 | " X[10:20,1] = 1.5*np.random.rand(10) + 1\n",
539 | " X[20,0] = 1.5\n",
540 | " X[20,1] = 2.25\n",
541 | " X[21,0] = 1.6\n",
542 | " X[21,1] = 0.25\n",
543 | " \n",
544 | " X = X + np.random.standard_normal(X.shape)*0.5\n",
545 | " '''\n",
546 | " y = np.ones(22)\n",
547 | " y[11:] = -1 \n",
548 | " \n",
549 | " X = np.zeros((22,2))\n",
550 | " X[:11,:] = np.random.randn(11,2)+np.array([0, 1.5])\n",
551 | " X[11:,:] = np.random.randn(11,2)+np.array([3.5,1.5])\n",
552 | " \n",
553 | " return X, y\n",
554 | "\n",
555 | "def part3data(N=100, seed=1235):\n",
556 | " \n",
557 | " np.random.seed(seed)\n",
558 | " \n",
559 | " X = np.random.uniform(-1,1,(N,2))\n",
560 | " y = np.array([1 if y-x > 0 else -1 for (x,y) in zip(X[:,0]**2 * np.sin(2*np.pi*X[:,0]), X[:,1])])\n",
561 | " X = X + np.random.normal(0,.1,(N,2))\n",
562 | " \n",
563 | " return X, y\n",
564 | "\n",
565 | "def nonlinear_plot(X, y, clf=None): \n",
566 | " \n",
567 | " mycolors = {\"blue\": \"steelblue\", \"red\": \"#a76c6e\", \"green\": \"#6a9373\"}\n",
568 | " \n",
569 | " fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,10))\n",
570 | " \n",
571 | " colors = [mycolors[\"red\"] if yi==1 else mycolors[\"blue\"] for yi in y]\n",
572 | " ax.scatter(X[:,0],X[:,1], marker='o', color=colors, s=100, alpha=0.5)\n",
573 | " \n",
574 | " ax.arrow(-1.25,0,2.5,0, head_length=0.05, head_width=0.05, fc=\"gray\", ec=\"gray\", lw=2, alpha=0.25)\n",
575 | " ax.arrow(0,-1.25,0,2.5, head_length=0.05, head_width=0.05, fc=\"gray\", ec=\"gray\", lw=2, alpha=0.25)\n",
576 | " z = np.linspace(0.25,3.5,10)\n",
577 | " \n",
578 | " ax.set_xlim([-1.50,1.50])\n",
579 | " ax.set_ylim([-1.50,1.50])\n",
580 | " \n",
581 | " ax.spines['top'].set_visible(False)\n",
582 | " ax.spines['right'].set_visible(False)\n",
583 | " ax.spines['bottom'].set_visible(False)\n",
584 | " ax.spines['left'].set_visible(False)\n",
585 | " plt.xticks([], fontsize=16)\n",
586 | " plt.yticks([], fontsize=16)\n",
587 | " \n",
588 | "\n",
589 | " if clf: \n",
590 | " \n",
591 | " clf.fit(X,y)\n",
592 | "\n",
593 | " x_min = X[:, 0].min()+.00\n",
594 | " x_max = X[:, 0].max()-.00\n",
595 | " y_min = X[:, 1].min()+.00\n",
596 | " y_max = X[:, 1].max()-.00\n",
597 | "\n",
598 | " colors = [mycolors[\"red\"] if yi==1 else mycolors[\"blue\"] for yi in y]\n",
599 | "\n",
600 | " XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]\n",
601 | " Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])\n",
602 | "\n",
603 | " # Put the result into a color plot\n",
604 | " Z = Z.reshape(XX.shape)\n",
605 | " plt.contour(XX, YY, Z, colors=[mycolors[\"blue\"], \"gray\", mycolors[\"red\"]], linestyles=['--', '-', '--'],\n",
606 | " levels=[-1.0, 0, 1.0], linewidths=[2,2,2], alpha=0.9)\n",
607 | " \n",
608 | "\n",
609 | "class MidpointNormalize(Normalize):\n",
610 | "\n",
611 | " def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):\n",
612 | " self.midpoint = midpoint\n",
613 | " Normalize.__init__(self, vmin, vmax, clip)\n",
614 | "\n",
615 | " def __call__(self, value, clip=None):\n",
616 | " x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]\n",
617 | " return np.ma.masked_array(np.interp(value, x, y))\n",
618 | " \n",
619 | "def plotSearchGrid(grid):\n",
620 | " \n",
621 | " scores = [x for x in grid.cv_results_[\"mean_test_score\"]]\n",
622 | " scores = np.array(scores).reshape(len(grid.param_grid[\"C\"]), len(grid.param_grid[\"gamma\"]))\n",
623 | "\n",
624 | " \n",
625 | " plt.figure(figsize=(10, 8))\n",
626 | " plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)\n",
627 | " plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,\n",
628 | " norm=MidpointNormalize(vmin=0.2, midpoint=0.92))\n",
629 | " plt.xlabel('gamma')\n",
630 | " plt.ylabel('C')\n",
631 | " plt.colorbar()\n",
632 | " plt.xticks(np.arange(len(grid.param_grid[\"gamma\"])), grid.param_grid[\"gamma\"], rotation=45)\n",
633 | " plt.yticks(np.arange(len(grid.param_grid[\"C\"])), grid.param_grid[\"C\"])\n",
634 | " plt.title('Validation accuracy')\n",
635 | " plt.show()\n",
636 | "\n",
637 | "from IPython.core.display import HTML\n",
638 | "HTML(\"\"\"\n",
639 | "\n",
642 | "\"\"\")"
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": null,
648 | "metadata": {},
649 | "outputs": [],
650 | "source": []
651 | }
652 | ],
653 | "metadata": {
654 | "kernelspec": {
655 | "display_name": "Python 3",
656 | "language": "python",
657 | "name": "python3"
658 | },
659 | "language_info": {
660 | "codemirror_mode": {
661 | "name": "ipython",
662 | "version": 3
663 | },
664 | "file_extension": ".py",
665 | "mimetype": "text/x-python",
666 | "name": "python",
667 | "nbconvert_exporter": "python",
668 | "pygments_lexer": "ipython3",
669 | "version": "3.7.4"
670 | }
671 | },
672 | "nbformat": 4,
673 | "nbformat_minor": 2
674 | }
675 |
--------------------------------------------------------------------------------
/in_class_notebooks/Unsupervised_learning/in_class_unsupervised_learning-release.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Hands-On Unsupervised Learning Exercises\n",
8 | "***\n",
9 | "\n",
10 | "We will look at exercises to identify important features to get better regression results and also perform clustering with K-means clustering algorithm.\n"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "## Regression using PCA"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": null,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "import numpy as np\n",
27 | "import pandas as pd\n",
28 | "from sklearn.preprocessing import scale \n",
29 | "from sklearn import model_selection\n",
30 | "from sklearn.decomposition import PCA\n",
31 | "from sklearn.linear_model import LinearRegression\n",
32 | "from sklearn.metrics import mean_squared_error\n",
33 | "from sklearn.datasets import fetch_openml\n",
34 | "import matplotlib.pyplot as plt\n",
35 | "from mpl_toolkits import mplot3d\n",
36 | "%matplotlib inline"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "df = pd.read_csv('hitters.csv').dropna().drop('Unnamed: 0', axis=1)\n",
46 | "dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])\n",
47 | "\n",
48 | "y = df.Salary\n",
49 | "\n",
50 | "# Drop the column with the independent variable (Salary), and columns for which we created dummy variables\n",
51 | "X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')\n",
52 | "\n",
53 | "# Define the feature set X.\n",
54 | "X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "#TODO: perform PCA on the feature vectors\n",
64 | "\n",
65 | "pca = PCA()\n",
66 | "X_reduced = pca.fit_transform(scale(X))"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "pd.DataFrame(pca.components_.T).loc[:4,:5]"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "# Split into training and test sets\n",
85 | "X_train, X_test , y_train, y_test = model_selection.train_test_split(X, y, test_size=0.5, random_state=1)\n",
86 | "\n",
87 | "pca = PCA()\n",
88 | "X_reduced = pca.fit_transform(scale(X_train))\n",
89 | "\n",
90 | "# 10-fold CV, with shuffle\n",
91 | "n = len(X_reduced)\n",
92 | "kf_10 = model_selection.KFold( n_splits=10, shuffle=True, random_state=1)\n",
93 | "\n",
94 | "regr = LinearRegression()\n",
95 | "mse = []\n",
96 | "\n",
97 | "# Calculate MSE with only the intercept (no principal components in regression)\n",
98 | "score = -1*model_selection.cross_val_score(regr, np.ones((n,1)), y_train.ravel(), cv=kf_10, scoring='neg_mean_squared_error').mean() \n",
99 | "mse.append(score)\n",
100 | "print('no PCA',mse)\n",
101 | "\n",
102 | "# Calculate MSE using CV for the 19 principle components, adding one component at the time.\n",
103 | "for i in np.arange(1, X.shape[-1]):\n",
104 | " score = -1*model_selection.cross_val_score(regr, X_reduced[:,:i], y_train.ravel(), cv=kf_10, scoring='neg_mean_squared_error').mean()\n",
105 | " mse.append(score)\n",
106 | " \n",
107 | "# Plot results \n",
108 | "plt.plot(np.arange(0,X.shape[-1],1),mse, '-v')\n",
109 | "plt.xlabel('Number of principal components in regression')\n",
110 | "plt.ylabel('MSE')\n",
111 | "plt.title('Salary')\n",
112 | "plt.xlim(xmin=-1)\n",
113 | "plt.xticks(np.arange(0,X.shape[-1], step=1));"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "## Visualization and clustering\n",
121 | "\n",
122 | "We are going to explore some dimensionality reduction techniques on the Iris dataset and visualize the low-dimensional representation. The goal is to provide a better understanding of what the dimensionality reduction techniques are doing.\n",
123 | "\n",
124 | "For more resources on data dimensionality reduction refer to:\n",
125 | "\n",
126 | "- [Really nice blogpost detailing dimension reduction and visualization with MNIST](https://colah.github.io/posts/2014-10-Visualizing-MNIST/)\n",
127 | "- [The paper on TSNE, if you're interested in the math](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "### PCA on Iris dataset:\n",
135 | "Here we will perform Principal Components Analysis on the iris dataset using SKlearn. "
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "from sklearn.cluster import KMeans\n",
145 | "from sklearn.datasets import load_iris"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {},
152 | "outputs": [],
153 | "source": [
154 | "# Running this cell will take a few seconds\n",
155 | "\n",
156 | "X,y = load_iris(return_X_y = True)"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "#### Next, for easy visualization, we are going to work with a subset of the data.\n",
164 | "\n",
165 | "In the next cell, extract a subset of the data and perform PCA on it. "
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "train_x, train_y = X[:500,:], y[:500]\n",
175 | "\n",
176 | "#TO DO: perform PCA on the data\n"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "We will now visualize this data using the two principal dimensions. "
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "fig = plt.figure(figsize=(18, 16))\n",
193 | "ax = plt.axes()\n",
194 | "\n",
195 | "\n",
196 | "# TO DO: make a scatter plot of the data using the first two components of your \n"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "We will now visualize this data using the ***three*** principal dimensions. See, if it gets any better."
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": null,
209 | "metadata": {
210 | "scrolled": false
211 | },
212 | "outputs": [],
213 | "source": [
214 | "\n",
215 | "\n",
216 | "fig = plt.figure(figsize=(18, 16))\n",
217 | "ax = plt.axes(projection='3d')\n",
218 | "\n",
219 | "# TODO: create 3d scatter plot to visualize the projection.\n"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "***K-means clustering*** on the Iris dataset using the reduced data using sklearn."
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": null,
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "# TODO: using scikit learn implement k means clustering. \n",
236 | "# Since this is Iris, we know that the number of outut categories is 3. Use that as number of clusters\n",
237 | "\n",
238 | "\n",
239 | "\n",
240 | "# TODO, next extract the labels of the output k-means. (These labels are just membership to different clusters)\n",
241 | "# Remember: K-means does not have any information about the actual labels.\n",
242 | "labels = []\n"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "We want to visualize how the k-means algorithm has clustered the data."
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": null,
255 | "metadata": {},
256 | "outputs": [],
257 | "source": [
258 | "# TODO: create a 2- D scatter plot to show this. You can use colors to represent the categorization by K-means. \n",
259 | "# Also consider annotating the points with the true labels to visually see how K-means has done.\n",
260 | "\n",
261 | "fig = plt.figure(figsize=(18, 16))\n",
262 | "ax = plt.axes()\n",
263 | "\n"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "# TODO: create a 3- D scatter plot to show this. You can use colors to represent the categorization by K-means. \n",
273 | "# Also consider annotating the points with the true labels to visually see how K-means has done.\n",
274 | "\n",
275 | "\n",
276 | "fig = plt.figure(figsize=(18, 16))\n",
277 | "ax = plt.axes(projection='3d')\n"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "**Play around with the hyperparameters of K-means and visualize the output. See how the clustering changes with each.**\n",
285 | "\n",
286 | "- Initialization method (`init`)\n",
287 | "- Number of clusters (`n_clusters`)\n",
288 | "- Number of initializations for k means (`n_init`)"
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": null,
294 | "metadata": {},
295 | "outputs": [],
296 | "source": []
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": null,
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "#TODO: use this cell to visualize\n",
305 | "\n",
306 | "fig = plt.figure(figsize=(18, 16))\n",
307 | "ax = plt.axes()"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "### Elbow plot to determine number of clusters:\n",
315 | "\n",
316 | "For this example, we know exactly what the number of categories are. We therefore know how many clusters there should be. However, for truly unsupervised problems, we need to identify a good value for `k`. \n",
317 | "\n",
318 | "The elbow plot essentially plots how the sum of square distances of the data points to their closest cluster centers. You can determine this using the attribute `inertia_` in your k-means object."
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "execution_count": null,
324 | "metadata": {},
325 | "outputs": [],
326 | "source": [
327 | "sum_sq_distances = []\n",
328 | "\n",
329 | "# TODO: append values of the sum of squared distance \n",
330 | "\n",
331 | "\n",
332 | " \n",
333 | "plt.plot(range(1,20), sum_sq_distances, 'bX-')\n",
334 | "plt.xlabel('K')\n",
335 | "plt.ylabel('Sum of squared distances')\n",
336 | "plt.show()"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {},
342 | "source": [
343 | "**What do you observe?**"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "metadata": {},
350 | "outputs": [],
351 | "source": []
352 | },
353 | {
354 | "cell_type": "markdown",
355 | "metadata": {},
356 | "source": [
357 | "**How does this compare with the clusters you chose to begin with?**"
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": null,
363 | "metadata": {},
364 | "outputs": [],
365 | "source": []
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "### Dimensionality Reduction on MNIST for visualization"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": null,
377 | "metadata": {},
378 | "outputs": [],
379 | "source": [
380 | "# Running this cell will take a few seconds\n",
381 | "\n",
382 | "X, y = fetch_openml('mnist_784', version=1, return_X_y=True)\n",
383 | " "
384 | ]
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "#### Next, for easy visualization, we are going to work with a subset of the data.\n",
391 | "\n",
392 | "In the next cell, extract a subset of the data and perform PCA on it. "
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": null,
398 | "metadata": {},
399 | "outputs": [],
400 | "source": [
401 | "train_x, train_y = X[:5000,:], y[:5000]\n",
402 | "\n",
403 | "# TODO: perform PCA on the MNIST digits\n"
404 | ]
405 | },
406 | {
407 | "cell_type": "markdown",
408 | "metadata": {},
409 | "source": [
410 | "Visualizing the data using the prinicpal components."
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": null,
416 | "metadata": {},
417 | "outputs": [],
418 | "source": [
419 | "fig = plt.figure(figsize=(18, 16))\n",
420 | "ax = plt.axes()\n",
421 | "\n",
422 | "\n",
423 | "# TO DO: make a scatter plot of the data using the first two components of your \n"
424 | ]
425 | },
426 | {
427 | "cell_type": "markdown",
428 | "metadata": {},
429 | "source": [
430 | "### Using t-SNE for visualizing MNIST data:"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": null,
436 | "metadata": {},
437 | "outputs": [],
438 | "source": [
439 | "from sklearn.manifold import TSNE"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "#TODO: use sklearn tsne to 'embed' the MNIST data into 2-dimensional space using stochastic neighbor embedding\n"
449 | ]
450 | },
451 | {
452 | "cell_type": "markdown",
453 | "metadata": {},
454 | "source": [
455 | "Make a scatter plot of the embedding in the next cell. Color the points based on the true labels of the digits."
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "execution_count": null,
461 | "metadata": {},
462 | "outputs": [],
463 | "source": [
464 | "\n",
465 | "\n",
466 | "fig = plt.figure(figsize=(18, 16))\n",
467 | "ax = plt.axes()\n",
468 | "\n",
469 | "\n",
470 | "# TO DO: make a scatter plot of the data using the first two components of your \n",
471 | "\n"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "#### Exercises for you"
479 | ]
480 | },
481 | {
482 | "cell_type": "markdown",
483 | "metadata": {},
484 | "source": [
485 | "1. How does the t-sne visualization look compared to the PCA visualization?"
486 | ]
487 | },
488 | {
489 | "cell_type": "markdown",
490 | "metadata": {},
491 | "source": [
492 | "2. Play around with hyperparameters of the two to see how it changes?"
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": []
501 | }
502 | ],
503 | "metadata": {
504 | "kernelspec": {
505 | "display_name": "Python 3",
506 | "language": "python",
507 | "name": "python3"
508 | },
509 | "language_info": {
510 | "codemirror_mode": {
511 | "name": "ipython",
512 | "version": 3
513 | },
514 | "file_extension": ".py",
515 | "mimetype": "text/x-python",
516 | "name": "python",
517 | "nbconvert_exporter": "python",
518 | "pygments_lexer": "ipython3",
519 | "version": "3.7.4"
520 | }
521 | },
522 | "nbformat": 4,
523 | "nbformat_minor": 2
524 | }
525 |
--------------------------------------------------------------------------------
/other_resource/ieee_matrix_factoriztion.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/other_resource/ieee_matrix_factoriztion.pdf
--------------------------------------------------------------------------------
/other_resource/project_guide.md:
--------------------------------------------------------------------------------
1 | # Project Guide
2 |
3 | For the final project, your team are required to:
4 |
5 | 1. Identify a machine learning problem.
6 | If you have a relationship with a company or organization who are willing to provide a project task or data, you're welcome to work on their problem. You do not need to share the data, but you should be able to make a public git (that includes your notebook or codes- at least the ML analysis part if sharing all codes are not possible) and video presentation.
7 | If you're going to just use a kaggle competition or similar, you must provide more focus on model building and/or analysis to be a valid project (replicating what's in the kaggle kernel or other notebooks available online is not a valid project. If you add other approaches and compare with those, it's valid). If you find a research paper and want to replicate the experiments or implement an algorithm, that works too. Please consult to the teaching staffs to check if your project idea has a right scope.
8 | 2. Gather data, determine the method of data collection and provenance of the data. The data can be web-scraped, or downloaded from any sources as long as it's legal and ethical, and does not violate their policy or intellectual property/copyrights.
9 | 3. Clean the data, do EDA (exploratory data analysis- e.g., inspecting and visualization of the data)
10 | 4. Perform analysis using machine learning models of your choice.
11 | 5. Deliver the results: These deliverables serves two purposes- grade for this course and your project portfolio that you can show when you apply for jobs in the near futures.
12 |
13 | **[Deliverable 1]** Jupyter notebook showing a brief problem description, EDA procedure, analysis (model building and training), result and discussion/conclusion. If your work becomes large that it doesn't fit into one notebook (or you think it will be less readable by having one large notebook), you can make several notebooks or scripts in the git (as deliverable 3), and submit a report-style notebook or pdf instead.
14 | In case your project doesn't fit into jupyter notebook format (e.g. you built an app that uses ML), write your approach as a report and submit in a pdf form. Please include all of your team members name in the notebook or report.
15 |
16 | **[Deliverable 2]** Video presentation- record a video of a presentation or demo of your work. The presentation should be a condensed version as if you're doing a short pitch to advertise your work; so please focus on the highlights (1. what problem do you solve, 2. what ML approach you use or what methods your app uses, 3. what the result is or running app demo). Minimum video length 5 min, maximum length 10 min (we will not grade anything after 10 mins) Submit the video in .mp4 format via moodle (only one file per team allowed).
17 |
18 | **[Deliverable 3]** Create a public project Git repository with your work (please also include the git repo url in your notebook/report and slides). Data by-product: In case your project creates data and you want to share, please do not upload the data in git, but a good way to share would be through kaggle dataset, or similar. Similarly, please do not upload videos to git- if you want, you can upload to youtube, then you can post those link(s) to your git.
19 |
20 | Here is the rough timeline for the next 4 weeks:
21 |
22 | 1. **Project week 1** In the earlier phase, you were to make the initial selection of a data source and problem to evaluate and discuss it on Piazza and though OHs. In this stage, you're going to go through the initial data cleaning and EDA, and judge whether you need to collect more data or different data etc.
23 |
24 | 2. **Project week 2** Continue more EDA if needed. Start the main analysis (main analysis refers to *machine learning* such as classification or regression, prediction or inference etc OR *other stat analysis*).You are on the right track if you start the main analysis at the latest end of this week or earlier. Depending on your tasks, you may have one model or more. Generally, it is deemed to be a higher quality project if you compare multiple models and show your understanding of why certain model works better than the other or what limitations or cautions certain models may have (and for machine learning models, show enough effort on the hyperparameter optimization).
25 |
26 | 3. **Project week 3** Continue more main analysis. Hyperparameter optimization. Compare results from your models. Wrap up.
27 |
28 | 4. **Project week 4** Wrap up and finalize your jupyter notebook write-up. Prepare the presentation. Organize your git repository. Submit the 3 deliverables.
29 | [1] finished jupyter notebook or pdf write-up: Your final report should be in the pdf format or a Jupyter notebook with Markdown (please make sure you actually write the report not just comments in case you use jupyter notebook).
30 | [2] a link to your project git repository (please include the link in both report and video presentation): Your github repo must be public and accessible. You do not need to upload data file there (a description and some pandas dataframe of data in the write-up is enough).
31 | [3] the video presentation in .mp4 file format- please check your audio/video works before you submit.
32 | When submitting in Moodle submission box, you can either submit those files or a zipped file.
33 |
34 | ### EDA procedure example (for the Project Week 1 & 2)
35 | - Describe the data sources, the hypothesis or problem you which to analyze and then describe the factors or components that make up the dataset (The "factors" here is called "features" in the machine learning term. These factors are often columns in the tabulated data). For each factor, use a box-plot, scatter plot, histogram etc to describe the distribution of the data as appropriate.
36 | - Describe correlations between different factors of the dataset and justify your assumption that they are correlated or not correlated. You may use numeric or qualitative / graphical analysis for this step.
37 | - Determine if any data needs to be transformed. For example, if you're planning on using a SVM method for prediction, you may need to normalize or scale the data if there is considerable difference in the range of the data.
38 | - Using your hypothesis, indicate if it's likely that you should transform data, such as using a log transform or other transformation of the dataset.
39 | - You should determine if your data has outliers or needs to be cleaned in any way. Are there missing data values for specific factors? How will you handle the data cleaning? Will you discard, interpolate or otherwise substitute data values?
40 | - If you believe that specific factors will be more important than others in your analysis, you should mention which and why. You will use this to confirm your intuitions in your final writeup.
41 |
42 | ### How to find data
43 | There are a plethora of data resource these days. Here are a few popular (classic ML data) ones.
44 | - [UCI ML data repository](https://archive.ics.uci.edu/ml/datasets.php): Their data is from researchers mostly and is relatively clean. Also the task types are mostly for classification or regression, therefore many of them are suitable for the scope of this course's project.
45 | - [Kaggle](https://www.kaggle.com/): Perhaps one of the most popular data science/ML data repositories today, they have many interesting con-going or past competitions. But most of recent dataset/tasks will be beyond the scope of this course. Should you be still interested, choose problems that have tabulated data and classification or regression type tasks.
46 | - [Data.gov](https://www.data.gov/) has many government sources of data. You can filter for a specific topic (e.g. Finance) and then restrict your attention to e.g. CSV data, which should be easier to process.
47 | - [Grand Challenge](https://grand-challenge.org/) has various biomed image computer vision competitions.
48 |
49 | Some internet blogs about list of dataset:
50 | - https://medium.com/towards-artificial-intelligence/the-50-best-public-datasets-for-machine-learning-d80e9f030279
51 | - https://towardsdatascience.com/top-sources-for-machine-learning-datasets-bb6d0dc3378b
52 | - https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
53 | - http://www.statsmodels.org/devel/datasets/index.html
54 | - https://pathmind.com/wiki/open-datasets
55 |
56 | ### Example Topics
57 | Here is a link to a list of ML projects from the [previous classes](https://docs.google.com/spreadsheets/d/1RG2TrTRbjli5ySZF5E1WRqA0r_o4uETKx_lPhaIN3cU/edit?usp=sharing) I've taught. These are from master-level data science classes- some students there have background on certain fields such as medicine or software engineering but most were first-time learners on ML subjects and many didn't have a CS degree, so the skill-levels widely vary.
58 |
--------------------------------------------------------------------------------
/slides/Lec10-SVM(2).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec10-SVM(2).pdf
--------------------------------------------------------------------------------
/slides/Lec10-SVM(2)_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec10-SVM(2)_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec11-NeuralNetwork_1_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec11-NeuralNetwork_1_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec12-NeuralNetwork2_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec12-NeuralNetwork2_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec13-math-behind-NN-training_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec13-math-behind-NN-training_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec14-optimization-methods-NN-training_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec14-optimization-methods-NN-training_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec15-Unsupervised Learning-PCA_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec15-Unsupervised Learning-PCA_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec16-Unsupervised Learning-Clustering_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec16-Unsupervised Learning-Clustering_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec17-Unsupervised Learning-Recommender System_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec17-Unsupervised Learning-Recommender System_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec18-Unsupervised Learning-Matrix Factorization_anno_ver1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec18-Unsupervised Learning-Matrix Factorization_anno_ver1.pdf
--------------------------------------------------------------------------------
/slides/Lec19-whiteboard_ver1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec19-whiteboard_ver1.png
--------------------------------------------------------------------------------
/slides/Lec1_Introduction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec1_Introduction.pdf
--------------------------------------------------------------------------------
/slides/Lec1_Introduction_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec1_Introduction_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec2-Linear-Regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec2-Linear-Regression.pdf
--------------------------------------------------------------------------------
/slides/Lec2-Linear-Regression_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec2-Linear-Regression_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec20-ConvolutionalNeuralNetwork1_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec20-ConvolutionalNeuralNetwork1_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec21-ConvolutionalNeuralNetwork2_r.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec21-ConvolutionalNeuralNetwork2_r.pdf
--------------------------------------------------------------------------------
/slides/Lec22-ConvolutionalNeuralNetwork3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec22-ConvolutionalNeuralNetwork3.pdf
--------------------------------------------------------------------------------
/slides/Lec23-autoencoder_whiteboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec23-autoencoder_whiteboard.png
--------------------------------------------------------------------------------
/slides/Lec24-RecurrentNeuralNetwork1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec24-RecurrentNeuralNetwork1.pdf
--------------------------------------------------------------------------------
/slides/Lec3-Logistic-Regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec3-Logistic-Regression.pdf
--------------------------------------------------------------------------------
/slides/Lec3-Logistic-Regression_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec3-Logistic-Regression_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec4-improve-training.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec4-improve-training.pdf
--------------------------------------------------------------------------------
/slides/Lec4-improve-training_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec4-improve-training_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec5_Decision_Trees.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec5_Decision_Trees.pdf
--------------------------------------------------------------------------------
/slides/Lec5_Decision_Trees_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec5_Decision_Trees_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec6_Decision_Trees_pruning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec6_Decision_Trees_pruning.pdf
--------------------------------------------------------------------------------
/slides/Lec7_random_forest.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec7_random_forest.pdf
--------------------------------------------------------------------------------
/slides/Lec7_random_forest_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec7_random_forest_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec8_Boosting.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec8_Boosting.pdf
--------------------------------------------------------------------------------
/slides/Lec8_Boosting_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec8_Boosting_anno.pdf
--------------------------------------------------------------------------------
/slides/Lec9-SVM(1).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec9-SVM(1).pdf
--------------------------------------------------------------------------------
/slides/Lec9-SVM(1)_anno.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/libphy/CSCI4622-20SP-MachineLearning/ddd39b8369fc6a9fc9aea46bcdedcc306ff78aa5/slides/Lec9-SVM(1)_anno.pdf
--------------------------------------------------------------------------------