├── .gitignore
├── .ipynb_checkpoints
    ├── Syllabus-checkpoint.md
    └── Untitled-checkpoint.ipynb
├── Syllabus.md
├── Syllabus.pdf
├── cheat_sheets
    └── git
    │   ├── figures
    │       ├── clone_screen.png
    │       ├── create_branch_website.png
    │       ├── download.png
    │       ├── example_notebook.png
    │       ├── fork.png
    │       ├── git_commit.png
    │       ├── git_status.png
    │       ├── init_profile.png
    │       ├── init_repo.png
    │       ├── jupyter.png
    │       ├── jupyter_dir.png
    │       ├── jupyter_dropdown.png
    │       ├── jupyter_shutdown.png
    │       ├── new_repo.png
    │       ├── open_pull_request.png
    │       ├── open_pull_request2.png
    │       ├── pull_request.png
    │       ├── pull_request2.png
    │       ├── rename_notebook.png
    │       ├── repo_screen.png
    │       └── updated_repo.png
    │   ├── git_cheat_sheet.pdf
    │   └── git_cheat_sheet.tex
├── homeworks
    ├── .ipynb_checkpoints
    │   ├── Final Project-checkpoint.ipynb
    │   ├── Homework_1-checkpoint.ipynb
    │   ├── Homework_2-checkpoint.ipynb
    │   ├── Homework_3-checkpoint.ipynb
    │   ├── Homework_4-checkpoint.ipynb
    │   └── Homework_5-checkpoint.ipynb
    ├── Final Project.ipynb
    ├── Homework_1.ipynb
    ├── Homework_2.ipynb
    ├── Homework_3.ipynb
    ├── Homework_4.ipynb
    └── Homework_5.ipynb
├── lectures
    ├── .ipynb_checkpoints
    │   ├── Lecture_10_Random_Forest-checkpoint.ipynb
    │   ├── Lecture_11_Boosting-checkpoint.ipynb
    │   ├── Lecture_1_Introduction-checkpoint.ipynb
    │   ├── Lecture_2_Python_Git-checkpoint.ipynb
    │   ├── Lecture_3_Get_Clean_Visualize_Data-checkpoint.ipynb
    │   ├── Lecture_4_Linear_Regression_and_Evaluation-checkpoint.ipynb
    │   ├── Lecture_5_Logistic_Regression_and_Evaluation-checkpoint.ipynb
    │   ├── Lecture_6_K_Nearest_Neighbors-checkpoint.ipynb
    │   ├── Lecture_7_Naive_Bayes-checkpoint.ipynb
    │   ├── Lecture_8_SVM-checkpoint.ipynb
    │   ├── Lecture_9_Decision_Trees-checkpoint.ipynb
    │   └── Random Notes-checkpoint.ipynb
    ├── Extra_Encoder_Decoder.ipynb
    ├── In_Progress_Encoder_Decoder_Attention.ipynb
    ├── Lecture_10_Random_Forest.ipynb
    ├── Lecture_11_Boosting.ipynb
    ├── Lecture_12_Dimensionality Reduction.ipynb
    ├── Lecture_13_Clustering.ipynb
    ├── Lecture_14_MLP.ipynb
    ├── Lecture_15_Conv_Nets.ipynb
    ├── Lecture_16_RNNs.ipynb
    ├── Lecture_17_Recommender_Systems.ipynb
    ├── Lecture_1_Introduction.ipynb
    ├── Lecture_2_Python_Git.ipynb
    ├── Lecture_3_Get_Clean_Visualize_Data.ipynb
    ├── Lecture_4_Linear_Regression_and_Evaluation.ipynb
    ├── Lecture_5_Logistic_Regression_and_Evaluation.ipynb
    ├── Lecture_6_K_Nearest_Neighbors.ipynb
    ├── Lecture_7_Naive_Bayes.ipynb
    ├── Lecture_8_SVM.ipynb
    └── Lecture_9_Decision_Trees.ipynb
└── small_data
    ├── Wholesale customers data.csv
    ├── baby_names
        ├── yob2000.txt
        ├── yob2001.txt
        ├── yob2002.txt
        ├── yob2003.txt
        ├── yob2004.txt
        ├── yob2005.txt
        ├── yob2006.txt
        ├── yob2007.txt
        ├── yob2008.txt
        ├── yob2009.txt
        ├── yob2010.txt
        ├── yob2011.txt
        ├── yob2012.txt
        ├── yob2013.txt
        ├── yob2014.txt
        ├── yob2015.txt
        └── yob2016.txt
    ├── elonmusk_tweets.csv
    ├── male_names.txt
    ├── movie_data.tsv
    └── u.item


/.gitignore:
--------------------------------------------------------------------------------
 1 | large_data/
 2 | *.swp
 3 | Syllabus.docx
 4 | drafts/
 5 | *.aux
 6 | *.log
 7 | *.synctex.gz
 8 | *.toc
 9 | *.DS_Store
10 | *lectures/.ipynb_checkpoints/*
11 | models/
12 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Syllabus-checkpoint.md:
--------------------------------------------------------------------------------
  1 | ﻿**Applied Machine Learning Syllabus**
  2 | 
  3 | Economics 213R
  4 | 
  5 | Professor: Tyler Folkman
  6 | 
  7 | 
  8 | ## Contact Information
  9 | 
 10 | I have created a slack group that we will use for most communication. I will provide more information on how to join and use slack when the course starts.
 11 | 
 12 | You can also always reach me at tylerfolkman at gmail
 13 | 
 14 | 
 15 | ## Office Hours
 16 | 
 17 | To be determined. I will have an hour twice a week for office hours, but I want to discuss with the class the best times. My office hours will most likely be remote using Zoom video conference.
 18 | 
 19 | 
 20 | ## Grading
 21 | 
 22 | Final Project: 30%
 23 | 
 24 | Homeworks: 50%
 25 | 
 26 | Participation: 10%
 27 | 
 28 | Quizzes: 10%
 29 | 
 30 | Grading will be done on a curve and not upon absolute thresholds.
 31 | 
 32 | 
 33 | ## Class Material
 34 | 
 35 | I will be leveraging material from many sources as well as my own content. Below is a listed of sources I used. Many are freely (and legally) available online. Not required to purchase any, though, I find [1] to be quite good as well as [4]. [4] is available for free online.
 36 | 
 37 | 1. Hands-On Machine Learning with Scikit-Learn and Tensorflow
 38 | 2. Data Science From Scratch
 39 | 3. Python for Data Analysis
 40 | 4. Introduction to Statistical Learning
 41 | 5. Coursera: Machine Learning
 42 | 6. Fastai course 1
 43 | 7. https://distill.pub/2016/misread-tsne/
 44 | 
 45 | 
 46 | ## Course Description
 47 | 
 48 | The amount of data being created in the world is staggering. More data has been created in the past two years than in the entire previous history of the human race and companies are taking notice. Some of the largest companies in the world - Google, Microsoft, Facebook, and Amazon to name a few - are leveraging these data to create valuable products. Amazon's Alexa, Google's search, Facebook's timeline are all powered by machine learning. These same companies are also scrambling to get as much machine learning talent as they can.
 49 | 
 50 | This course will teach you the fundamental building blocks of machine learning. We will go from how to gather and clean data to writing Python code to create and evaluate our own models. We will cover enough theory to ensure an understanding of machine learning models, but the focus is on the application of machine learning. To that end, we will also cover machine learning in industry and how to avoid some of the pitfalls that cause machine learning initiatives to fail.
 51 | 
 52 | The course will use Python heavily and while not a pre-requisite any previous experience with coding and/or Python will be very helpful. 
 53 | 
 54 | 
 55 | ## Prerequisites
 56 | 
 57 | Econ 110 and Econ 378
 58 | 
 59 | 
 60 | ## Tentative Outline
 61 | 
 62 | ### Introduction to Machine Learning
 63 | 
 64 | 1. What, why, how, and challenges
 65 | 2. Differences between stats, ML, and econometrics
 66 | 
 67 | References: [1] Chapter 1
 68 | 
 69 | 
 70 | ### Introduction to Python
 71 | 
 72 | 1. Functions, Lists, Dictionaries, Counters, Sets, List Comprehensions, Generators
 73 | 2. Data science stack: numpy, pandas, scikit-learn, scipy, jupyter notebooks
 74 | 3. Git and github
 75 | 
 76 | References: [2] Chapter 2, [3] All Chapters
 77 | 
 78 | 
 79 | ### How to get and clean data
 80 | 
 81 | 1. Loading from text, SQL, and web pages
 82 | 2. Cleaning data: missing, outliers, scaling data, reformatting
 83 | 
 84 | References: [3] Chapters 5-7, [2] Chapters 9-10 and 23, [1] Chapter 2
 85 | 
 86 | 
 87 | ### How to visualize and describe data
 88 | 
 89 | 1. Data description techniques: mean, median, percentiles, correlations
 90 | 2. Data visualization techniques: various plots, Bokeh
 91 | 
 92 | References: [1] Chapter 2, [2] Chapters 3 and 5, [3] Chapters 5, 8 and 9
 93 | 
 94 | 
 95 | ### Regression
 96 | 
 97 | 1. Linear Regression
 98 | 2. Logistic Regression
 99 | 3. Regularized Regression
100 | 
101 | References: [2] Chapters 14-16, [1] Chapter 4, [4] Chapter 3, [5] Weeks 1-3
102 | 
103 | 
104 | ### Gradient Descent
105 | 
106 | 1. Cost functions, learning rates, and gradients
107 | 
108 | References: [2] Chapter 8, [1] Chapter 4
109 | 
110 | 
111 | ### How to evaluate and tune models
112 | 
113 | 1. Training/Testing and Cross validation
114 | 2. MSE
115 | 3. Confusion matrix, precision and recall, ROC
116 | 4. Learning Curves
117 | 5. Bias and variance / over-fitting under-fitting
118 | 6. Tuning hyperparameters
119 | 7. Feature importances
120 | 
121 | References: [2] Chapter 11, [1] Chapter 2 and 3, [4] Chapter 2, 5, [5] Week 6
122 | 
123 | 
124 | ### Classification Models
125 | 
126 | 1. Naive Bayes, K-nearest neighbors
127 | 
128 | References: [2] Chapters 12-13, [4] Chapter 4 
129 | 
130 | 
131 | ### SVMs
132 | 
133 | 1. Max-margin, kernel trick, more than 2 classes
134 | 
135 | References: [1] Chapter 5, [4] Chapter 9, [5] Week 7
136 | 
137 | 
138 | ### Trees and Ensembles
139 | 
140 | 1. Decision Trees, Random Forests, and Gradient Boosted Trees
141 | 2. XGBoost
142 | 
143 | References: [2] Chapter 17, [1] Chapters 6-7, [4] Chapter 8
144 | 
145 | 
146 | ### Dimensionality Reduction
147 | 
148 | 1. Curse of dimensionality
149 | 2. PCA and SVD
150 | 3. T-SNE
151 | 
152 | References: [1] Chapter 8, [7]. [4] Chapter 6, 10, [5] Week 8
153 | 
154 | 
155 | ### Clustering
156 | 
157 | 1. K-means, Hierarchical, DBScan
158 | 
159 | References: [2] Chapter 19, [4] Chapter 10, [5] Week 8
160 | 
161 | 
162 | ### Recommender Systems
163 | 
164 | 1. Collaborative filtering and deep collaborative filtering
165 | 
166 | References: [2] Chapter 22, [5] Week 9
167 | 
168 | 
169 | ### Data Science at Scale
170 | 
171 | 1. Parallelism - dask and blaze
172 | 2. Spark and Mapreduce
173 | 3. AWS
174 | 4. Online learning and stochastic gradient descent
175 | 
176 | References: [5] Week 10, [2] Chapter 24
177 | 
178 | 
179 | ### Deep Learning
180 | 
181 | 1. Perceptron, simple networks, and backprop
182 | 2. GPUs
183 | 3. Tensorflow, Keras, Pytorch
184 | 4. CNNs
185 | 5. RNNs and word2vec
186 | 
187 | References: [2] Chapter 18, [1] Chapters 9-16, [5] Weeks 4-5, [6] All Lessons
188 | 
189 | 
190 | ### Deploying Machine Learning
191 | 
192 | 1. Docker
193 | 2. Flask and REST Endpoints
194 | 
195 | 
196 | ### Machine Learning in Industry
197 | 
198 | 1. Guest lecturers / companies
199 | 2. Things Kaggle challenges won’t teach you
200 | 3. Why most data science initiatives fail
201 | 4. Importance of communication and collaboration
202 | 
203 | 
204 | ## University 
205 | 
206 | ### Honor Code
207 | In keeping with the principles of the BYU Honor Code, students are expected to be honest in all of their academic work. Academic honesty means, most fundamentally, that any work you present as your own must in fact be your own work and not that of another. Violations of this principle may result in a failing grade in the course and additional disciplinary action by the university. Students are also expected to adhere to the Dress and Grooming Standards. Adherence demonstrates respect for yourself and others and ensures an effective learning and working environment. It is the university's expectation, and my own expectation in class, that each student will abide by all Honor Code standards. Please call the Honor Code Office at 422-2847 if you have questions about those standards.
208 | 
209 | ### Preventing & Responding to Sexual Misconduct
210 | In accordance with Title IX of the Education Amendments of 1972, Brigham Young University prohibits unlawful sex discrimination against any participant in its education programs or activities. The university also prohibits sexual harassment—including sexual violence—committed by or against students, university employees, and visitors to campus. As outlined in university policy, sexual harassment, dating violence, domestic violence, sexual assault, and stalking are considered forms of "Sexual Misconduct" prohibited by the university. University policy requires all university employees in a teaching, managerial, or supervisory role to report all incidents of Sexual Misconduct that come to their attention in any way, including but not limited to face-to-face conversations, a written class assignment or paper, class discussion, email, text, or social media post. Incidents of Sexual Misconduct should be reported to the Title IX Coordinator at t9coordinator@byu.edu or (801) 422-8692. Reports may also be submitted through EthicsPoint at https://titleix.byu.edu/report or 1-888-238-1062 (24-hours a day). BYU offers confidential resources for those affected by Sexual Misconduct, including the university’s Victim Advocate, as well as a number of non-confidential resources and services that may be helpful. Additional information about Title IX, the university’s Sexual Misconduct Policy, reporting requirements, and resources can be found at http://titleix.byu.edu or by contacting the university’s Title IX Coordinator.
211 | 
212 | ### Student Disability
213 | Brigham Young University is committed to providing a working and learning atmosphere that reasonably accommodates qualified persons with disabilities. If you have any disability which may impair your ability to complete this course successfully, please contact the Services for Students with Disabilities Office (422-2767). Reasonable academic accommodations are reviewed for all students who have qualified, documented disabilities. Services are coordinated with the student and instructor by the SSD Office. If you need assistance or if you feel you have been unlawfully discriminated against on the basis of disability, you may seek resolution through established grievance policy and procedures by contacting the Equal Employment Office at 422-5895,D-285 ASB.
214 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Untitled-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/Syllabus.md:
--------------------------------------------------------------------------------
  1 | ﻿**Applied Machine Learning Syllabus**
  2 | 
  3 | Economics 213R
  4 | 
  5 | Professor: Tyler Folkman
  6 | 
  7 | 
  8 | ## Contact Information
  9 | 
 10 | I have created a slack group that we will use for most communication. I will provide more information on how to join and use slack when the course starts.
 11 | 
 12 | You can also always reach me at tylerfolkman at gmail
 13 | 
 14 | 
 15 | ## Office Hours
 16 | 
 17 | To be determined. I will have an hour twice a week for office hours, but I want to discuss with the class the best times. My office hours will most likely be remote using Zoom video conference.
 18 | 
 19 | 
 20 | ## Grading
 21 | 
 22 | Final Project: 30%
 23 | 
 24 | Homeworks: 50%
 25 | 
 26 | Participation: 10%
 27 | 
 28 | Quizzes: 10%
 29 | 
 30 | Grading will be done on a curve and not upon absolute thresholds.
 31 | 
 32 | 
 33 | ## Class Material
 34 | 
 35 | I will be leveraging material from many sources as well as my own content. Below is a listed of sources I used. Many are freely (and legally) available online. Not required to purchase any, though, I find [1] to be quite good as well as [4]. [4] is available for free online.
 36 | 
 37 | 1. Hands-On Machine Learning with Scikit-Learn and Tensorflow
 38 | 2. Data Science From Scratch
 39 | 3. Python for Data Analysis
 40 | 4. Introduction to Statistical Learning
 41 | 5. Coursera: Machine Learning
 42 | 6. Fastai course 1
 43 | 7. https://distill.pub/2016/misread-tsne/
 44 | 
 45 | 
 46 | ## Course Description
 47 | 
 48 | The amount of data being created in the world is staggering. More data has been created in the past two years than in the entire previous history of the human race and companies are taking notice. Some of the largest companies in the world - Google, Microsoft, Facebook, and Amazon to name a few - are leveraging these data to create valuable products. Amazon's Alexa, Google's search, Facebook's timeline are all powered by machine learning. These same companies are also scrambling to get as much machine learning talent as they can.
 49 | 
 50 | This course will teach you the fundamental building blocks of machine learning. We will go from how to gather and clean data to writing Python code to create and evaluate our own models. We will cover enough theory to ensure an understanding of machine learning models, but the focus is on the application of machine learning. To that end, we will also cover machine learning in industry and how to avoid some of the pitfalls that cause machine learning initiatives to fail.
 51 | 
 52 | The course will use Python heavily and while not a pre-requisite any previous experience with coding and/or Python will be very helpful. 
 53 | 
 54 | 
 55 | ## Prerequisites
 56 | 
 57 | Econ 110 and Econ 378
 58 | 
 59 | 
 60 | ## Tentative Outline
 61 | 
 62 | ### Introduction to Machine Learning
 63 | 
 64 | 1. What, why, how, and challenges
 65 | 2. Differences between stats, ML, and econometrics
 66 | 
 67 | References: [1] Chapter 1
 68 | 
 69 | 
 70 | ### Introduction to Python
 71 | 
 72 | 1. Functions, Lists, Dictionaries, Counters, Sets, List Comprehensions, Generators
 73 | 2. Data science stack: numpy, pandas, scikit-learn, scipy, jupyter notebooks
 74 | 3. Git and github
 75 | 
 76 | References: [2] Chapter 2, [3] All Chapters
 77 | 
 78 | 
 79 | ### How to get and clean data
 80 | 
 81 | 1. Loading from text, SQL, and web pages
 82 | 2. Cleaning data: missing, outliers, scaling data, reformatting
 83 | 
 84 | References: [3] Chapters 5-7, [2] Chapters 9-10 and 23, [1] Chapter 2
 85 | 
 86 | 
 87 | ### How to visualize and describe data
 88 | 
 89 | 1. Data description techniques: mean, median, percentiles, correlations
 90 | 2. Data visualization techniques: various plots, Bokeh
 91 | 
 92 | References: [1] Chapter 2, [2] Chapters 3 and 5, [3] Chapters 5, 8 and 9
 93 | 
 94 | 
 95 | ### Regression
 96 | 
 97 | 1. Linear Regression
 98 | 2. Logistic Regression
 99 | 3. Regularized Regression
100 | 
101 | References: [2] Chapters 14-16, [1] Chapter 4, [4] Chapter 3, [5] Weeks 1-3
102 | 
103 | 
104 | ### Gradient Descent
105 | 
106 | 1. Cost functions, learning rates, and gradients
107 | 
108 | References: [2] Chapter 8, [1] Chapter 4
109 | 
110 | 
111 | ### How to evaluate and tune models
112 | 
113 | 1. Training/Testing and Cross validation
114 | 2. MSE
115 | 3. Confusion matrix, precision and recall, ROC
116 | 4. Learning Curves
117 | 5. Bias and variance / over-fitting under-fitting
118 | 6. Tuning hyperparameters
119 | 7. Feature importances
120 | 
121 | References: [2] Chapter 11, [1] Chapter 2 and 3, [4] Chapter 2, 5, [5] Week 6
122 | 
123 | 
124 | ### Classification Models
125 | 
126 | 1. Naive Bayes, K-nearest neighbors
127 | 
128 | References: [2] Chapters 12-13, [4] Chapter 4 
129 | 
130 | 
131 | ### SVMs
132 | 
133 | 1. Max-margin, kernel trick, more than 2 classes
134 | 
135 | References: [1] Chapter 5, [4] Chapter 9, [5] Week 7
136 | 
137 | 
138 | ### Trees and Ensembles
139 | 
140 | 1. Decision Trees, Random Forests, and Gradient Boosted Trees
141 | 2. XGBoost
142 | 
143 | References: [2] Chapter 17, [1] Chapters 6-7, [4] Chapter 8
144 | 
145 | 
146 | ### Dimensionality Reduction
147 | 
148 | 1. Curse of dimensionality
149 | 2. PCA and SVD
150 | 3. T-SNE
151 | 
152 | References: [1] Chapter 8, [7]. [4] Chapter 6, 10, [5] Week 8
153 | 
154 | 
155 | ### Clustering
156 | 
157 | 1. K-means, Hierarchical, DBScan
158 | 
159 | References: [2] Chapter 19, [4] Chapter 10, [5] Week 8
160 | 
161 | 
162 | ### Recommender Systems
163 | 
164 | 1. Collaborative filtering and deep collaborative filtering
165 | 
166 | References: [2] Chapter 22, [5] Week 9
167 | 
168 | 
169 | ### Data Science at Scale
170 | 
171 | 1. Parallelism - dask and blaze
172 | 2. Spark and Mapreduce
173 | 3. AWS
174 | 4. Online learning and stochastic gradient descent
175 | 
176 | References: [5] Week 10, [2] Chapter 24
177 | 
178 | 
179 | ### Deep Learning
180 | 
181 | 1. Perceptron, simple networks, and backprop
182 | 2. GPUs
183 | 3. Tensorflow, Keras, Pytorch
184 | 4. CNNs
185 | 5. RNNs and word2vec
186 | 
187 | References: [2] Chapter 18, [1] Chapters 9-16, [5] Weeks 4-5, [6] All Lessons
188 | 
189 | 
190 | ### Deploying Machine Learning
191 | 
192 | 1. Docker
193 | 2. Flask and REST Endpoints
194 | 
195 | 
196 | ### Machine Learning in Industry
197 | 
198 | 1. Guest lecturers / companies
199 | 2. Things Kaggle challenges won’t teach you
200 | 3. Why most data science initiatives fail
201 | 4. Importance of communication and collaboration
202 | 
203 | 
204 | ## University 
205 | 
206 | ### Honor Code
207 | In keeping with the principles of the BYU Honor Code, students are expected to be honest in all of their academic work. Academic honesty means, most fundamentally, that any work you present as your own must in fact be your own work and not that of another. Violations of this principle may result in a failing grade in the course and additional disciplinary action by the university. Students are also expected to adhere to the Dress and Grooming Standards. Adherence demonstrates respect for yourself and others and ensures an effective learning and working environment. It is the university's expectation, and my own expectation in class, that each student will abide by all Honor Code standards. Please call the Honor Code Office at 422-2847 if you have questions about those standards.
208 | 
209 | ### Preventing & Responding to Sexual Misconduct
210 | In accordance with Title IX of the Education Amendments of 1972, Brigham Young University prohibits unlawful sex discrimination against any participant in its education programs or activities. The university also prohibits sexual harassment—including sexual violence—committed by or against students, university employees, and visitors to campus. As outlined in university policy, sexual harassment, dating violence, domestic violence, sexual assault, and stalking are considered forms of "Sexual Misconduct" prohibited by the university. University policy requires all university employees in a teaching, managerial, or supervisory role to report all incidents of Sexual Misconduct that come to their attention in any way, including but not limited to face-to-face conversations, a written class assignment or paper, class discussion, email, text, or social media post. Incidents of Sexual Misconduct should be reported to the Title IX Coordinator at t9coordinator@byu.edu or (801) 422-8692. Reports may also be submitted through EthicsPoint at https://titleix.byu.edu/report or 1-888-238-1062 (24-hours a day). BYU offers confidential resources for those affected by Sexual Misconduct, including the university’s Victim Advocate, as well as a number of non-confidential resources and services that may be helpful. Additional information about Title IX, the university’s Sexual Misconduct Policy, reporting requirements, and resources can be found at http://titleix.byu.edu or by contacting the university’s Title IX Coordinator.
211 | 
212 | ### Student Disability
213 | Brigham Young University is committed to providing a working and learning atmosphere that reasonably accommodates qualified persons with disabilities. If you have any disability which may impair your ability to complete this course successfully, please contact the Services for Students with Disabilities Office (422-2767). Reasonable academic accommodations are reviewed for all students who have qualified, documented disabilities. Services are coordinated with the student and instructor by the SSD Office. If you need assistance or if you feel you have been unlawfully discriminated against on the basis of disability, you may seek resolution through established grievance policy and procedures by contacting the Equal Employment Office at 422-5895,D-285 ASB.
214 | 


--------------------------------------------------------------------------------
/Syllabus.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/Syllabus.pdf


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/clone_screen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/clone_screen.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/create_branch_website.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/create_branch_website.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/download.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/example_notebook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/example_notebook.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/fork.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/fork.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/git_commit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/git_commit.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/git_status.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/git_status.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/init_profile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/init_profile.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/init_repo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/init_repo.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/jupyter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/jupyter_dir.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter_dir.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/jupyter_dropdown.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter_dropdown.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/jupyter_shutdown.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter_shutdown.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/new_repo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/new_repo.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/open_pull_request.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/open_pull_request.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/open_pull_request2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/open_pull_request2.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/pull_request.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/pull_request.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/pull_request2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/pull_request2.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/rename_notebook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/rename_notebook.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/repo_screen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/repo_screen.png


--------------------------------------------------------------------------------
/cheat_sheets/git/figures/updated_repo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/updated_repo.png


--------------------------------------------------------------------------------
/cheat_sheets/git/git_cheat_sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/git_cheat_sheet.pdf


--------------------------------------------------------------------------------
/cheat_sheets/git/git_cheat_sheet.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[11pt,a4paper]{article}
  2 | 
  3 | \usepackage[utf8]{inputenc}
  4 | \usepackage{url}
  5 | \usepackage[margin=1in]{geometry}
  6 | \usepackage{graphicx}
  7 | 
  8 | \title{GitHub Cheat Sheet}
  9 | \date{}
 10 | \author{Econ 213R - Applied Machine Learning}
 11 | 
 12 | \begin{document}
 13 | \vspace*{-75pt}
 14 |     {\let\newpage\relax\maketitle}
 15 |     
 16 | \tableofcontents
 17 |     
 18 | \section{What is GitHub?}
 19 | GitHub is a place where developers can store projects and collaborate with other developers to create and update code.
 20 | Below is a list of terms you should know before you get started with GitHub.
 21 | 
 22 | \begin{itemize}
 23 | \item \textbf{repository}: place where all folders and files for a project are stored
 24 | \item \textbf{push}: update changes to a repository
 25 | \item \textbf{pull}: retrieve and merge changes to a repository
 26 | \item \textbf{commit}: an individual revision to a file or a set of files
 27 | \item \textbf{branch}: a parallel version of the repository contained within the repository but not affecting the primary branch (which is most often the master branch)
 28 | \item \textbf{clone}: a copy of a repository on your local machine
 29 | \item \textbf{fork}: a personal copy of another user's repository
 30 | \end{itemize}
 31 | 
 32 | Some of these definitions came from \url{https://help.github.com/articles/github-glossary/}, a useful guide for many GitHub terms.
 33 |     
 34 | \section{GitHub Registration}
 35 | You need an account with GitHub.
 36 | Go to \url{www.github.com}. 
 37 | Sign up from the home page and follow the instructions.
 38 | Note that private repositories require monthly payments; however, you can have an unlimited number of public repositories for free.
 39 | Once you sign up, you are ready to create and fork repositories.
 40 | 
 41 | \section{Creating Repositories}
 42 | Navigate to your GitHub profile.
 43 | It should look something like Figure \ref{fig:init-prof}.
 44 | 
 45 | \begin{figure}[h!]
 46 | \centering
 47 | \includegraphics[width=.7\textwidth]{figures/init_profile.png}
 48 | \caption{Profile Screen}
 49 | \label{fig:init-prof}
 50 | \end{figure}
 51 | 
 52 | Click the ``Repositories" tab.
 53 | Click the button that says ``New" (see Figure \ref{fig:new-repo}).
 54 | 
 55 | \begin{figure}[h!]
 56 | \centering
 57 | \includegraphics[width=.7\textwidth]{figures/new_repo.png}
 58 | \caption{New Repository}
 59 | \label{fig:new-repo}
 60 | \end{figure}
 61 | 
 62 | Name your repository.
 63 | It is customary to use dashes instead of underscores in repository names.
 64 | Make sure to initialize the repository with a README.md (see Figure \ref{fig:init-repo}).
 65 | If you accidentally forget to do this, that's okay; you can always add a README later.
 66 | A README is a short description of the repository.
 67 | 
 68 | \begin{figure}[h!]
 69 | \centering
 70 | \includegraphics[width=.7\textwidth]{figures/init_repo.png}
 71 | \caption{Initialize the repository.}
 72 | \label{fig:init-repo}
 73 | \end{figure}
 74 | 
 75 | Click ``Create Repository" at the bottom of the page.
 76 | You should see a page that looks like Figure \ref{fig:repo-screen}.
 77 | 
 78 | \begin{figure}[h!]
 79 | \centering
 80 | \includegraphics[width=.7\textwidth]{figures/repo_screen.png}
 81 | \caption{Repository screen.}
 82 | \label{fig:repo-screen}
 83 | \end{figure}
 84 | 
 85 | Congratulations!
 86 | You have just created your first GitHub repository.
 87 | 
 88 | \section{Making Changes to Repositories}
 89 | There are a few ways to upload and change files in a GitHub repository.
 90 | The most common way to modify repositories is via the command line; see Section \ref{command-line} for command line instructions.
 91 | It is highly, \textit{highly} recommended that you learn the command line to modify repositories.
 92 | The command line allows you to more freedom to modify your repositories and branches, and can streamline your workflow.
 93 | However, you can use the GitHub website to upload and download files to and from repositories.
 94 | You can also modify files using the website.
 95 | Note that you can't run Jupyter notebooks on the GitHub website, and when you try to edit Jupyter notebooks on the website, you have to comb through lots of nasty JSON lingo.
 96 | If you choose to use the website to make changes to the GitHub repository, you will have to download the Jupyter notebook, modify it on a local machine, and upload it back to the repository.
 97 | Instructions for using the website are in Section \ref{website}.
 98 | 
 99 | If you do not have Git installed on your local machine, use the links below to download Git.
100 | \begin{itemize}
101 | \item[] Macs: \url{https://sourceforge.net/projects/git-osx-installer/files/}
102 | \item[] Windows: \url{http://gitforwindows.org/}
103 | \item[] Linux: In the Terminal, first type \texttt{sudo apt-get update} and hit Enter. Then type \texttt{sudo apt-get install git} and hit Enter.
104 | \end{itemize} 
105 | 
106 | \subsection{Using Command Line} \label{command-line}
107 | Let's walk through an example together.
108 | In this example, we will configure Git on a local machine, clone the repository to a local machine, create a Jupyter notebook, add it to the repository, and push the changes.
109 | 
110 | First, set your username and email in Git.
111 | Use the commands below to do this.
112 | 
113 | \begin{itemize}
114 | \item[] \texttt{git config --global user.name "Your Name Here"}
115 | \item[] \texttt{git config --global user.email "your\_email@website.com"}
116 | \end{itemize}
117 | 
118 | Now Git is configured on your machine, and you are ready to clone the repository you made.
119 | First, go to the repository screen on the GitHub website (Figure \ref{fig:repo-screen}).
120 | Click the button that says ``Clone or download".
121 | A drop-down window should appear that contains a link to your repository (see Figure \ref{fig:clone-screen}).
122 | Copy the link in the drop-down window.
123 | 
124 | \begin{figure}[h!]
125 | \centering
126 | \includegraphics[width=.7\textwidth]{figures/clone_screen.png}
127 | \caption{Jupyter notebook screen.}
128 | \label{fig:clone-screen}
129 | \end{figure}
130 | 
131 | Now open your command line.
132 | Navigate to the location where you want your repository to be (for example, I store all of my files on my desktop, so I would navigate to my desktop from my command line).
133 | Once you have navigated to the location, type the following command: \texttt{git clone link\_to\_your\_repo}, where \texttt{link\_to\_your\_repo} is the link you copied above..
134 | When your computer is done downloading the repository, you should see a folder with the repository name in the location where you wanted it.
135 | Now you are ready to make some changes and push them to your master branch.
136 | 
137 | \subsubsection{Making Changes}
138 | For this example, we will make a Jupyter notebook and push it to the master branch.
139 | In the command line, navigate to your repository if you're not there already.
140 | Type \texttt{jupyter notebook} and hit enter.
141 | A web browser should open up with a page that looks like Figure \ref{fig:jupyter-dir}.
142 | If a web browser did not automatically open, the command line should have provided a URL that you can copy and past directly into a browser to access the Jupyter notebook.
143 | 
144 | \begin{figure}[h!]
145 | \centering
146 | \includegraphics[width=.6\textwidth]{figures/jupyter_dir.png}
147 | \caption{Jupyter notebook main screen.}
148 | \label{fig:jupyter-dir}
149 | \end{figure}
150 | 
151 | In the upper right-hand corner, click ``New".
152 | A dropdown menu will appear.
153 | Under the ``Notebooks" heading, click on Python 3 (see Figure \ref{fig:jupyter-dropdown}).
154 | When you click ``Python 3", a new window will open in your browser with a screen that looks like Figure \ref{fig:jupyter}.
155 | 
156 | \begin{figure}[h!]
157 | \centering
158 | \includegraphics[width=.2\textwidth]{figures/jupyter_dropdown.png}
159 | \caption{Create a Python 3 Jupyter notebook.}
160 | \label{fig:jupyter-dropdown}
161 | \end{figure}
162 | 
163 | \begin{figure}[h!]
164 | \centering
165 | \includegraphics[width=.6\textwidth]{figures/jupyter.png}
166 | \caption{A Jupyter notebook.}
167 | \label{fig:jupyter}
168 | \end{figure}
169 | 
170 | Change the name of the notebook from``Untitled" to ``example".
171 | To do this, click the ``Untitled" text at the top of the notebook.
172 | A window will appear; replace the ``Untitled" text with ``example".
173 | See Figure \ref{fig:rename-notebook}.
174 | 
175 | \begin{figure}[h!]
176 | \centering
177 | \includegraphics[width=.7\textwidth]{figures/rename_notebook.png}
178 | \caption{Rename the notebook.}
179 | \label{fig:rename-notebook}
180 | \end{figure}
181 | 
182 | In the first cell, type \texttt{print("Hello World")}.
183 | Then press Shift+Enter.
184 | This key combination runs the cell.
185 | A new cell will automatically appear below the cell with the print statement.
186 | To save the notebook, hit Command+S.
187 | Your notebook should now look like Figure \ref{fig:example-notebook}.
188 | 
189 | \begin{figure}[h!]
190 | \centering
191 | \includegraphics[width=.7\textwidth]{figures/example_notebook.png}
192 | \caption{A Jupyter notebook.}
193 | \label{fig:example-notebook}
194 | \end{figure}
195 | 
196 | Now navigate to the command line where you launched Jupyter.
197 | To stop the Jupyter server from running, hit Ctrl+c twice.
198 | This will shut down the Jupyter notebook.
199 | If you didn't exit the browser window yet, it will look like Figure \ref{fig:jupyter-shutdown} after you shut down the Jupyter server.
200 | You are safe to close all browser windows related to this Jupyter notebook.
201 | 
202 | \begin{figure}[h!]
203 | \centering
204 | \includegraphics[width=.7\textwidth]{figures/jupyter_shutdown.png}
205 | \caption{Shutting down a Jupyter notebook.}
206 | \label{fig:jupyter-shutdown}
207 | \end{figure}
208 | 
209 | \subsubsection{Committing Changes}
210 | In the command prompt, navigate to the repository, if you aren't already there.
211 | Type \texttt{git status} and hit enter.
212 | You should see a message that looks similar to Figure \ref{fig:git-status}.
213 | 
214 | \begin{figure}[h!]
215 | \centering
216 | \includegraphics[width=.7\textwidth]{figures/git_status.png}
217 | \caption{\texttt{git status}}
218 | \label{fig:git-status}
219 | \end{figure}
220 | 
221 | You want to add the notebook you just created to the Git index so Git knows to keep track of this file and any future changes made to it.
222 | To do this, type \texttt{git add example.ipynb} and hit enter.
223 | Now we are going to commit this change to the branch.
224 | Then, type \texttt{git commit -m "example notebook"}.
225 | A message that looks like Figure \ref{fig:git-commit} will appear.
226 | Note that you can type any descriptive message you want in the quotes.
227 | These messages are to help you keep track of the changes you make in each commit.
228 | Now we want to push these changes up the master branch, so that any other machines with this repository can pull the changes you made.
229 | Type \texttt{git push origin master} and hit enter.
230 | Congratulations!
231 | You have just pushed your first change to your GitHub repository.
232 | If you navigate to the web page of your repository, you should see the example.ipynb file there, as in Figure \ref{fig:updated-repo}.
233 | 
234 | \begin{figure}[h!]
235 | \centering
236 | \includegraphics[width=.4\textwidth]{figures/git_commit.png}
237 | \caption{\texttt{git commit}}
238 | \label{fig:git-commit}
239 | \end{figure}
240 | 
241 | \begin{figure}[h!]
242 | \centering
243 | \includegraphics[width=.7\textwidth]{figures/updated_repo.png}
244 | \caption{Updated Repository}
245 | \label{fig:updated-repo}
246 | \end{figure}
247 | 
248 | \subsubsection{Pulling Changes}
249 | If you ever push a change to a repository branch on one machine, and you have the repository branch on another machine, you can still access the changes you made elsewhere.
250 | Type \texttt{git pull origin branch\_name} to retrieve changes you have made to the branch elsewhere.
251 | 
252 | \subsection{Using the Website} \label{website}
253 | \subsubsection{Downloading Files}
254 | To download a file from the website of your repository, find the file you wish to download, and click on its name.
255 | Click ``Raw" (Figure \ref{fig:download}).
256 | This will take you to a new web page with the raw version of the file you want to download (for example, if you are trying to download a Jupyter notebook, you will see file in JSON format).
257 | Control-click on this web page and select ``Save As...".
258 | You can save the file anywhere you'd like on your local machine.
259 | 
260 | \begin{figure}[h!]
261 | \centering
262 | \includegraphics[width=.7\textwidth]{figures/download.png}
263 | \caption{Download files from the website.}
264 | \label{fig:download}
265 | \end{figure}
266 | 
267 | \subsubsection{Uploading Files}
268 | To upload files to your repository using the website, navigate to the main repository page and click ``Upload Files".
269 | You can drag and drop the files you wish to upload or use a navigation tool to find them on your local machine.
270 | Then you have two options: you can either commit those changes directly to the branch you are in, or you can commit those changes to a new branch.
271 | 
272 | \section{Forking Repositories}
273 | Navigate to the class repository. 
274 | The link is \url{https://github.com/tfolkman/byu_econ_applied_machine_learning}.
275 | Click the ``Fork" button in the upper-left corner (Figure \ref{fig:fork}).
276 | Now you have forked the repository, and it will show up in your list of repositories.
277 | 
278 | \begin{figure}[h!]
279 | \centering
280 | \includegraphics[width=.7\textwidth]{figures/fork.png}
281 | \caption{Fork a Repository}
282 | \label{fig:fork}
283 | \end{figure}
284 | 
285 | \subsection{Updating a Fork}
286 | 
287 | Suppose a contributor of the original repository (the one from which you forked) makes a change, and you want to update your fork with that change.
288 | To make sure you are tracking the correct remote branch, type \texttt{git remote add upstream https://github.com/tfolkman/byu\_econ\_applied\_machine\_learning.git} in the command line.
289 | This adds a remote (a pointer or link) to the class repository called \texttt{upstream}.
290 | To update your fork's master branch, use the command \texttt{git pull upstream master} (make sure you are on the master branch first!).
291 | Now your forked repository should reflect the changes made in the original repository.
292 | 
293 | \section{Deleting Repositories}
294 | \textbf{PROCEED WITH CAUTION.}
295 | This action cannot be undone.
296 | 
297 | To delete a repository, first navigate to the repository's main page.
298 | Click ``Settings".
299 | Scroll down to the bottom of the screen until you reach the ``Danger Zone".
300 | Click ``Delete the repository".
301 | You will have to type the name of the repository to verify that you want to completely delete it.
302 | Then you can erase it from your GitHub.
303 | 
304 | \section{Appendix}
305 | 
306 | \subsection{List of Git Commands} \label{git-commands}
307 | 
308 | \begin{itemize}
309 | \item \texttt{git clone repository\_link}: Clone the repository to a local machine.
310 | \item \texttt{git status}: Display the current status of the Git index.
311 | \item \texttt{git add filename}: Add filename to Git index to be tracked.
312 | \item \texttt{git branch}: Check which branch you are currently on
313 | \item \texttt{git commit -m "message"}: Commit the changes to the file or files. 
314 | \item \texttt{git push origin branch\_name}: Submit the changes to the branch.
315 | \item \texttt{git pull origin branch\_name}: Retrieve changes to the branch.
316 | \item \texttt{git checkout -b branch\_name}: Create and switch to a new branch.
317 | \item \texttt{git checkout branch\_name}: Switch to an existing branch.
318 | \item \texttt{git branch -d branch\_name}: Delete an existing branch.
319 | \item \texttt{git remote add upstream repository\_link}: Add a remote to a repository.
320 | \item \texttt{git pull upstream master}: Update whatever branch you're on in your forked repository to reflect the changes in the master branch of the repository you forked.
321 | \end{itemize}
322 | 
323 | \subsection{Making New Branches}
324 | 
325 | \subsubsection{Command Line}
326 | You might want to make different branches for different tasks.
327 | To switch to a new, nonexistent branch, type \texttt{git checkout -b branch\_name}, where ``branch\_name" is the name you want to give the new branch.
328 | Now you can use the previous steps to add, commit, and push changes.
329 | The only difference is when you push, you will type \texttt{git push origin branch\_name} (instead of pushing to the master branch).
330 | To navigate to an existing branch, type \texttt{git checkout branch\_name}, where ``branch\_name: is the name of the existing branch.
331 | To delete an existing branch, type \texttt{git checkout -d branch\_name}.
332 | Be careful with this last command.
333 | 
334 | \subsubsection{Using the Website}
335 | To make a new branch on the GitHub website, navigate to the main repository page.
336 | Click ``Branch: master".
337 | A dropdown menu should appear (Figure \ref{create-branch}.
338 | Start typing the name of the branch you wish to create in the text field.
339 | An option will appear that says ``Create branch".
340 | Click it, and you will now be on the main page for that branch.
341 | You can use that dropdown menu to switch between branches.
342 | 
343 | \begin{figure}[h!]
344 | \centering
345 | \includegraphics[width=.4\textwidth]{figures/create_branch_website.png}
346 | \caption{Create a Branch}
347 | \label{fig:create-branch}
348 | \end{figure}
349 | 
350 | 
351 | \subsection{Pull Requests}
352 | If you committed on a different branch besides the master, and you want to merge these changes with the master, you can submit what is called a pull request.
353 | Pull requests are very useful when you collaborate on a project with other people.
354 | Pull requests allow others to see the differences between the changes you made and the branch with which you are requesting to merge.
355 | To submit a pull request, navigate to the main screen of your repository on the GitHub website.
356 | Click the ``Create Pull Request" button (Figure \ref{fig:pull-request}).
357 | 
358 | \begin{figure}[h!]
359 | \centering
360 | \includegraphics[width=.7\textwidth]{figures/pull_request.png}
361 | \caption{Pull Request}
362 | \label{fig:pull-request}
363 | \end{figure}
364 | 
365 | This will bring you to a new screen that looks like Figure \ref{fig:pull-request2}.
366 | 
367 | \begin{figure}[h!]
368 | \centering
369 | \includegraphics[width=.7\textwidth]{figures/pull_request2.png}
370 | \caption{Pull Request}
371 | \label{fig:pull-request2}
372 | \end{figure}
373 | 
374 | The base branch is the branch that does not have the changes you made.
375 | The compare branch is the branch with which you want the base branch to be merged.
376 | For example, if you made a branch for data cleaning called data\_cleaning, and you want those changes to be reflected in the master branch, you would choose the base branch to be master, and the compare branch to be data\_cleaning.
377 | Once you have chosen the appropriate branches, click ``Create Pull Request".
378 | Then you will see a page that looks like Figure \ref{fig:open-pull-request}.
379 | 
380 | \begin{figure}[h!]
381 | \centering
382 | \includegraphics[width=.7\textwidth]{figures/open_pull_request.png}
383 | \caption{Pull Request}
384 | \label{fig:open-pull-request}
385 | \end{figure}
386 | 
387 | In the ``Write" section, it is helpful if you describe what you changed.
388 | In a collaboration project with dozens of files and thousands of lines of code, it is helpful to other reviewers to see an outline of what you changed at a glance.
389 | If you scroll down, you will see sections that describe what has been changed and what is different between the two branches.
390 | You can request reviewers, and you and other collaborators can make comments about changes.
391 | When you are ready to make the pull request official, click ``Create Pull Request".
392 | After clicking this, you will see a web page that looks like Figure \ref{fig:open-pull-request2}.
393 | 
394 | \begin{figure}[h!]
395 | \centering
396 | \includegraphics[width=.7\textwidth]{figures/open_pull_request2.png}
397 | \caption{Pull Request}
398 | \label{fig:open-pull-request2}
399 | \end{figure}
400 | 
401 | When you and/or your collaborators have reviewed your changes and are ready to merge them to the base branch (and the compare branch has no conflicts with the base branch), click ``Merge Pull Request".
402 | After the branches are merged, you are safe to delete the compare branch.
403 | 
404 | \end{document}
405 | 


--------------------------------------------------------------------------------
/homeworks/.ipynb_checkpoints/Final Project-checkpoint.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Final Project\n",
 8 |     "\n",
 9 |     "The final project is your opportunity to do your own data project end to end and present the results to some local companies.\n",
10 |     "\n",
11 |     "### Rubric\n",
12 |     "\n",
13 |     "Code Quality - 20%\n",
14 |     "\n",
15 |     "Describing, cleaning, and visualizing data - 25%\n",
16 |     "\n",
17 |     "Modeling - 25%\n",
18 |     "\n",
19 |     "Presentation / Storytelling - 30%"
20 |    ]
21 |   }
22 |  ],
23 |  "metadata": {
24 |   "kernelspec": {
25 |    "display_name": "Python 3",
26 |    "language": "python",
27 |    "name": "python3"
28 |   },
29 |   "language_info": {
30 |    "codemirror_mode": {
31 |     "name": "ipython",
32 |     "version": 3
33 |    },
34 |    "file_extension": ".py",
35 |    "mimetype": "text/x-python",
36 |    "name": "python",
37 |    "nbconvert_exporter": "python",
38 |    "pygments_lexer": "ipython3",
39 |    "version": "3.5.3"
40 |   }
41 |  },
42 |  "nbformat": 4,
43 |  "nbformat_minor": 2
44 | }
45 | 


--------------------------------------------------------------------------------
/homeworks/.ipynb_checkpoints/Homework_1-checkpoint.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Data Cleaning, Describing, and Visualization\n",
 8 |     "\n",
 9 |     "### Step 1 - Get your environment setup\n",
10 |     "\n",
11 |     "1. Install Git on your computer and fork the class repository on [Github](https://github.com/tfolkman/byu_econ_applied_machine_learning).\n",
12 |     "2. Install [Anaconda](https://conda.io/docs/install/quick.html) and get it working.\n",
13 |     "\n",
14 |     "### Step 2 - Explore Datasets\n",
15 |     "\n",
16 |     "The goals of this project are:\n",
17 |     "\n",
18 |     "1. Read in data from multiple sources\n",
19 |     "2. Gain practice cleaning, describing, and visualizing data\n",
20 |     "\n",
21 |     "To this end, you need to find from three different sources. For example: CSV, JSON, and API, SQL, or web scraping. For each of these data sets, you must perform the following:\n",
22 |     "\n",
23 |     "1. Data cleaning. Some options your might consider: handle missing data, handle outliers, scale the data, convert some data to categorical.\n",
24 |     "2. Describe data. Provide tables, statistics, and summaries of your data.\n",
25 |     "3. Visualize data. Provide visualizations of your data.\n",
26 |     "\n",
27 |     "These are the typical first steps of any data science project and are often the most time consuming. My hope is that in going through this process 3 different times, that you will gain a sense for it.\n",
28 |     "\n",
29 |     "Also, as you are doing this, please tell us a story. Explain in your notebook why are doing what you are doing and to what end. Telling a story in your analysis is a crucial skill for data scientists. There are almost an infinite amount of ways to analyze a data set; help us understand why you choose your particular path and why we should care.\n",
30 |     "\n",
31 |     "Also - this homework is very open-ended and we provided you with basically no starting point. I realize this increases the difficulty and complexity, but I think it is worth it. It is much closer to what you might experience in industry and allows you to find data that might excite you!\n",
32 |     "\n",
33 |     "### Grading\n",
34 |     "\n",
35 |     "This homework is due **Jan. 31, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
36 |     "\n",
37 |     "Rubric:\n",
38 |     "\n",
39 |     "* Code Quality - 10%\n",
40 |     "* Ability to read in data - 10%\n",
41 |     "* Ability to describe data - 20%\n",
42 |     "* Ability to visualize data - 20%\n",
43 |     "* Ability to clean data - 20%\n",
44 |     "* Storytelling - 20%"
45 |    ]
46 |   },
47 |   {
48 |    "cell_type": "code",
49 |    "execution_count": null,
50 |    "metadata": {},
51 |    "outputs": [],
52 |    "source": []
53 |   }
54 |  ],
55 |  "metadata": {
56 |   "kernelspec": {
57 |    "display_name": "Python 3",
58 |    "language": "python",
59 |    "name": "python3"
60 |   },
61 |   "language_info": {
62 |    "codemirror_mode": {
63 |     "name": "ipython",
64 |     "version": 3
65 |    },
66 |    "file_extension": ".py",
67 |    "mimetype": "text/x-python",
68 |    "name": "python",
69 |    "nbconvert_exporter": "python",
70 |    "pygments_lexer": "ipython3",
71 |    "version": "3.6.6"
72 |   }
73 |  },
74 |  "nbformat": 4,
75 |  "nbformat_minor": 2
76 | }
77 | 


--------------------------------------------------------------------------------
/homeworks/.ipynb_checkpoints/Homework_2-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Introduction to Modeling with Python\n",
  8 |     "\n",
  9 |     "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n",
 10 |     "\n",
 11 |     "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n",
 12 |     "\n",
 13 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n",
 14 |     "\n",
 15 |     "First, make sure you are signed up on Kaggle and then download the data:\n",
 16 |     "\n",
 17 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n",
 18 |     "\n",
 19 |     "The data includes both testing and training sets as well as a sample submission file. \n",
 20 |     "\n",
 21 |     "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n",
 22 |     "\n",
 23 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n",
 24 |     "\n",
 25 |     "And the discussion board:\n",
 26 |     "\n",
 27 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n",
 28 |     "\n",
 29 |     "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n",
 30 |     "\n",
 31 |     "\n",
 32 |     "### Grading\n",
 33 |     "\n",
 34 |     "This homework is due **Feb. 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
 35 |     "\n",
 36 |     "Rubric:\n",
 37 |     "\n",
 38 |     "* Code Quality - 10%\n",
 39 |     "* Storytelling - 10%\n",
 40 |     "* Result on Kaggle - 5%\n",
 41 |     "* Describing, Cleaning, and Visualizing data - 25%\n",
 42 |     "* Modeling - 50%\n",
 43 |     "\n",
 44 |     "More specifically, for modeling we will look for: \n",
 45 |     "\n",
 46 |     "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n",
 47 |     "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n",
 48 |     "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "## Introduction to Modeling with Python\n",
 56 |     "\n",
 57 |     "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n",
 58 |     "\n",
 59 |     "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n",
 60 |     "\n",
 61 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n",
 62 |     "\n",
 63 |     "First, make sure you are signed up on Kaggle and then download the data:\n",
 64 |     "\n",
 65 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n",
 66 |     "\n",
 67 |     "The data includes both testing and training sets as well as a sample submission file. \n",
 68 |     "\n",
 69 |     "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n",
 70 |     "\n",
 71 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n",
 72 |     "\n",
 73 |     "And the discussion board:\n",
 74 |     "\n",
 75 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n",
 76 |     "\n",
 77 |     "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n",
 78 |     "\n",
 79 |     "\n",
 80 |     "### Grading\n",
 81 |     "\n",
 82 |     "This homework is due **Feb. 20, 2018 by 3:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
 83 |     "\n",
 84 |     "Rubric:\n",
 85 |     "\n",
 86 |     "* Code Quality - 10%\n",
 87 |     "* Storytelling - 10%\n",
 88 |     "* Result on Kaggle - 5%\n",
 89 |     "* Describing, Cleaning, and Visualizing data - 25%\n",
 90 |     "* Modeling - 50%\n",
 91 |     "\n",
 92 |     "More specifically, for modeling we will look for: \n",
 93 |     "\n",
 94 |     "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n",
 95 |     "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n",
 96 |     "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading."
 97 |    ]
 98 |   }
 99 |  ],
100 |  "metadata": {
101 |   "kernelspec": {
102 |    "display_name": "Python 3",
103 |    "language": "python",
104 |    "name": "python3"
105 |   },
106 |   "language_info": {
107 |    "codemirror_mode": {
108 |     "name": "ipython",
109 |     "version": 3
110 |    },
111 |    "file_extension": ".py",
112 |    "mimetype": "text/x-python",
113 |    "name": "python",
114 |    "nbconvert_exporter": "python",
115 |    "pygments_lexer": "ipython3",
116 |    "version": "3.6.6"
117 |   }
118 |  },
119 |  "nbformat": 4,
120 |  "nbformat_minor": 2
121 | }
122 | 


--------------------------------------------------------------------------------
/homeworks/.ipynb_checkpoints/Homework_3-checkpoint.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Project Abstract\n",
 8 |     "\n",
 9 |     "To help people get started on their project and to make sure you are selecting an appropriate task, we will have all the teams submit an abstract. Please only submit one abstract per team.\n",
10 |     "\n",
11 |     "The abstract should include (at least):\n",
12 |     "\n",
13 |     "* Team members\n",
14 |     "* Problem statement\n",
15 |     "* Data you will use to solve the problem\n",
16 |     "* Outline of how you plan on solving the problem with the data. For example, what pre-processing steps might you need to do, what models, etc. \n",
17 |     "* Supporting documents if necessary citing past research in the area and methods used to solve the problem.\n",
18 |     "\n",
19 |     "The goal of this abstract is for you to think deeply about the project you will be undertaking and convince yourself (and us) that it is a meaningful and achievable project for this class.\n",
20 |     "\n",
21 |     "This homework is due **February 28, 2018 by 4:00 pm Utah time.** and will be submitted on learning suite."
22 |    ]
23 |   }
24 |  ],
25 |  "metadata": {
26 |   "kernelspec": {
27 |    "display_name": "Python 3",
28 |    "language": "python",
29 |    "name": "python3"
30 |   },
31 |   "language_info": {
32 |    "codemirror_mode": {
33 |     "name": "ipython",
34 |     "version": 3
35 |    },
36 |    "file_extension": ".py",
37 |    "mimetype": "text/x-python",
38 |    "name": "python",
39 |    "nbconvert_exporter": "python",
40 |    "pygments_lexer": "ipython3",
41 |    "version": "3.6.6"
42 |   }
43 |  },
44 |  "nbformat": 4,
45 |  "nbformat_minor": 2
46 | }
47 | 


--------------------------------------------------------------------------------
/homeworks/.ipynb_checkpoints/Homework_4-checkpoint.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Classification with Python\n",
 8 |     "\n",
 9 |     "Hopefully now you are feeling a bit more comfortable with Python, Kaggle, and modeling. \n",
10 |     "\n",
11 |     "This next homework will test your classification abilities. We will be trying to predict the poverty levels of various households in Costa Rica:\n",
12 |     "\n",
13 |     "https://www.kaggle.com/c/costa-rican-household-poverty-prediction\n",
14 |     "\n",
15 |     "The evalution metric for Kaggle is accuracy, but please also explore how well your model does on multiple metrics like F1, precision, recall, and area under the ROC curve.\n",
16 |     "\n",
17 |     "### Grading\n",
18 |     "\n",
19 |     "This homework is due **March 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
20 |     "\n",
21 |     "Rubric:\n",
22 |     "\n",
23 |     "* Code Quality - 10%\n",
24 |     "* Storytelling - 10%\n",
25 |     "* Result on Kaggle - 5%\n",
26 |     "* Describing, Cleaning, and Visualizing data - 25%\n",
27 |     "* Modeling - 50%\n",
28 |     "\n",
29 |     "More specifically, for modeling we will look for: \n",
30 |     "\n",
31 |     "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n",
32 |     "* Model evaluation: Did you evaluate your model on multiple metrics? Where does your model do well? Where could it be improved? How are the metrics different?\n",
33 |     "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n",
34 |     "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading."
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "code",
39 |    "execution_count": null,
40 |    "metadata": {},
41 |    "outputs": [],
42 |    "source": []
43 |   }
44 |  ],
45 |  "metadata": {
46 |   "kernelspec": {
47 |    "display_name": "Python 3",
48 |    "language": "python",
49 |    "name": "python3"
50 |   },
51 |   "language_info": {
52 |    "codemirror_mode": {
53 |     "name": "ipython",
54 |     "version": 3
55 |    },
56 |    "file_extension": ".py",
57 |    "mimetype": "text/x-python",
58 |    "name": "python",
59 |    "nbconvert_exporter": "python",
60 |    "pygments_lexer": "ipython3",
61 |    "version": "3.6.6"
62 |   }
63 |  },
64 |  "nbformat": 4,
65 |  "nbformat_minor": 2
66 | }
67 | 


--------------------------------------------------------------------------------
/homeworks/.ipynb_checkpoints/Homework_5-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Dimensionality Reduction and Clustering\n",
  8 |     "\n",
  9 |     "For this homework we will be using some image data! Specifically, the MNIST data set. You can load this data easily with the following commands:"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from sklearn.datasets import fetch_mldata\n",
 19 |     "\n",
 20 |     "mnist = fetch_mldata(\"MNIST original\")\n",
 21 |     "X = mnist.data / 255.0\n",
 22 |     "y = mnist.target"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "The MNIST data set is hand-drawn digits, from zero through nine.\n",
 30 |     "\n",
 31 |     "Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.\n",
 32 |     "\n",
 33 |     "Source: https://www.kaggle.com/c/digit-recognizer/data\n",
 34 |     "\n",
 35 |     "For this homework, perform the following with the MNIST data:\n",
 36 |     "\n",
 37 |     "1. Use PCA to reduce the dimensionality\n",
 38 |     "\n",
 39 |     "    a. How many components did you use? Why?\n",
 40 |     "    \n",
 41 |     "    b. Plot the first two components. Do you notice any trends? What is this plot showing us?\n",
 42 |     "    \n",
 43 |     "    c. Why would you use PCA? What is it doing? And what are the drawbacks?\n",
 44 |     "    \n",
 45 |     "    d. Plot some of the images, then compress them using PCA and plot again. How does it look?\n",
 46 |     "    \n",
 47 |     "2. Use t-SNE to plot the first two components (you should probably random sample around 10000 points):\n",
 48 |     "\n",
 49 |     "    a. How does this plot differ from your PCA plot?\n",
 50 |     "    \n",
 51 |     "    b. How robust is it to changes in perplexity?\n",
 52 |     "    \n",
 53 |     "    c. How robust is it to different learning rate and number of iterations?\n",
 54 |     "    \n",
 55 |     "3. Perform k-means clustering:\n",
 56 |     "\n",
 57 |     "    a. How did you choose k?\n",
 58 |     "    \n",
 59 |     "    b. How did you evaluate your clustering?\n",
 60 |     "    \n",
 61 |     "    c. Visualize your clusters using t-sne\n",
 62 |     "    \n",
 63 |     "    d. Did you scale your data?\n",
 64 |     "    \n",
 65 |     "    e. How robust is your clustering?\n",
 66 |     "    \n",
 67 |     "4. Perform hierarchical clustering:\n",
 68 |     "\n",
 69 |     "    a. Plot your dendrogram\n",
 70 |     "    \n",
 71 |     "    b. How many clusters seem reasonable based off your graph?\n",
 72 |     "    \n",
 73 |     "    c. How does your dendrogram change with different linkage methods?"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "markdown",
 78 |    "metadata": {},
 79 |    "source": [
 80 |     "### Grading\n",
 81 |     "\n",
 82 | 
 83 |     "This homework is due **April 4, 2019 by midnight Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
 84 | 
 85 |     
 86 |     "\n",
 87 |     "Rubric:\n",
 88 |     "\n",
 89 |     "* Code Quality - 10%\n",
 90 |     "* Storytelling - 10%\n",
 91 |     "* PCA - 20%\n",
 92 |     "* T-SNE - 20%\n",
 93 |     "* K-means - 20%\n",
 94 |     "* Hierarchical Clustering - 20%"
 95 |    ]
 96 |   }
 97 |  ],
 98 |  "metadata": {
 99 |   "kernelspec": {
100 |    "display_name": "Python 3",
101 |    "language": "python",
102 |    "name": "python3"
103 |   },
104 |   "language_info": {
105 |    "codemirror_mode": {
106 |     "name": "ipython",
107 |     "version": 3
108 |    },
109 |    "file_extension": ".py",
110 |    "mimetype": "text/x-python",
111 |    "name": "python",
112 |    "nbconvert_exporter": "python",
113 |    "pygments_lexer": "ipython3",
114 |    "version": "3.6.6"
115 |   }
116 |  },
117 |  "nbformat": 4,
118 |  "nbformat_minor": 2
119 | }
120 | 


--------------------------------------------------------------------------------
/homeworks/Final Project.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Final Project\n",
 8 |     "\n",
 9 |     "The final project is your opportunity to do your own data project end to end and present the results to some local companies.\n",
10 |     "\n",
11 |     "### Rubric\n",
12 |     "\n",
13 |     "Code Quality - 20%\n",
14 |     "\n",
15 |     "Describing, cleaning, and visualizing data - 25%\n",
16 |     "\n",
17 |     "Modeling - 25%\n",
18 |     "\n",
19 |     "Presentation / Storytelling - 30%"
20 |    ]
21 |   }
22 |  ],
23 |  "metadata": {
24 |   "kernelspec": {
25 |    "display_name": "Python 3",
26 |    "language": "python",
27 |    "name": "python3"
28 |   },
29 |   "language_info": {
30 |    "codemirror_mode": {
31 |     "name": "ipython",
32 |     "version": 3
33 |    },
34 |    "file_extension": ".py",
35 |    "mimetype": "text/x-python",
36 |    "name": "python",
37 |    "nbconvert_exporter": "python",
38 |    "pygments_lexer": "ipython3",
39 |    "version": "3.5.3"
40 |   }
41 |  },
42 |  "nbformat": 4,
43 |  "nbformat_minor": 2
44 | }
45 | 


--------------------------------------------------------------------------------
/homeworks/Homework_1.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Data Cleaning, Describing, and Visualization\n",
 8 |     "\n",
 9 |     "### Step 1 - Get your environment setup\n",
10 |     "\n",
11 |     "1. Install Git on your computer and fork the class repository on [Github](https://github.com/tfolkman/byu_econ_applied_machine_learning).\n",
12 |     "2. Install [Anaconda](https://conda.io/docs/install/quick.html) and get it working.\n",
13 |     "\n",
14 |     "### Step 2 - Explore Datasets\n",
15 |     "\n",
16 |     "The goals of this project are:\n",
17 |     "\n",
18 |     "1. Read in data from multiple sources\n",
19 |     "2. Gain practice cleaning, describing, and visualizing data\n",
20 |     "\n",
21 |     "To this end, you need to find from three different sources. For example: CSV, JSON, and API, SQL, or web scraping. For each of these data sets, you must perform the following:\n",
22 |     "\n",
23 |     "1. Data cleaning. Some options your might consider: handle missing data, handle outliers, scale the data, convert some data to categorical.\n",
24 |     "2. Describe data. Provide tables, statistics, and summaries of your data.\n",
25 |     "3. Visualize data. Provide visualizations of your data.\n",
26 |     "\n",
27 |     "These are the typical first steps of any data science project and are often the most time consuming. My hope is that in going through this process 3 different times, that you will gain a sense for it.\n",
28 |     "\n",
29 |     "Also, as you are doing this, please tell us a story. Explain in your notebook why are doing what you are doing and to what end. Telling a story in your analysis is a crucial skill for data scientists. There are almost an infinite amount of ways to analyze a data set; help us understand why you choose your particular path and why we should care.\n",
30 |     "\n",
31 |     "Also - this homework is very open-ended and we provided you with basically no starting point. I realize this increases the difficulty and complexity, but I think it is worth it. It is much closer to what you might experience in industry and allows you to find data that might excite you!\n",
32 |     "\n",
33 |     "### Grading\n",
34 |     "\n",
35 |     "This homework is due **Jan. 31, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
36 |     "\n",
37 |     "Rubric:\n",
38 |     "\n",
39 |     "* Code Quality - 10%\n",
40 |     "* Ability to read in data - 10%\n",
41 |     "* Ability to describe data - 20%\n",
42 |     "* Ability to visualize data - 20%\n",
43 |     "* Ability to clean data - 20%\n",
44 |     "* Storytelling - 20%"
45 |    ]
46 |   },
47 |   {
48 |    "cell_type": "code",
49 |    "execution_count": null,
50 |    "metadata": {},
51 |    "outputs": [],
52 |    "source": []
53 |   }
54 |  ],
55 |  "metadata": {
56 |   "kernelspec": {
57 |    "display_name": "Python 3",
58 |    "language": "python",
59 |    "name": "python3"
60 |   },
61 |   "language_info": {
62 |    "codemirror_mode": {
63 |     "name": "ipython",
64 |     "version": 3
65 |    },
66 |    "file_extension": ".py",
67 |    "mimetype": "text/x-python",
68 |    "name": "python",
69 |    "nbconvert_exporter": "python",
70 |    "pygments_lexer": "ipython3",
71 |    "version": "3.6.6"
72 |   }
73 |  },
74 |  "nbformat": 4,
75 |  "nbformat_minor": 2
76 | }
77 | 


--------------------------------------------------------------------------------
/homeworks/Homework_2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Introduction to Modeling with Python\n",
  8 |     "\n",
  9 |     "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n",
 10 |     "\n",
 11 |     "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n",
 12 |     "\n",
 13 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n",
 14 |     "\n",
 15 |     "First, make sure you are signed up on Kaggle and then download the data:\n",
 16 |     "\n",
 17 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n",
 18 |     "\n",
 19 |     "The data includes both testing and training sets as well as a sample submission file. \n",
 20 |     "\n",
 21 |     "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n",
 22 |     "\n",
 23 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n",
 24 |     "\n",
 25 |     "And the discussion board:\n",
 26 |     "\n",
 27 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n",
 28 |     "\n",
 29 |     "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n",
 30 |     "\n",
 31 |     "\n",
 32 |     "### Grading\n",
 33 |     "\n",
 34 |     "This homework is due **Feb. 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
 35 |     "\n",
 36 |     "Rubric:\n",
 37 |     "\n",
 38 |     "* Code Quality - 10%\n",
 39 |     "* Storytelling - 10%\n",
 40 |     "* Result on Kaggle - 5%\n",
 41 |     "* Describing, Cleaning, and Visualizing data - 25%\n",
 42 |     "* Modeling - 50%\n",
 43 |     "\n",
 44 |     "More specifically, for modeling we will look for: \n",
 45 |     "\n",
 46 |     "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n",
 47 |     "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n",
 48 |     "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "## Introduction to Modeling with Python\n",
 56 |     "\n",
 57 |     "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n",
 58 |     "\n",
 59 |     "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n",
 60 |     "\n",
 61 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n",
 62 |     "\n",
 63 |     "First, make sure you are signed up on Kaggle and then download the data:\n",
 64 |     "\n",
 65 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n",
 66 |     "\n",
 67 |     "The data includes both testing and training sets as well as a sample submission file. \n",
 68 |     "\n",
 69 |     "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n",
 70 |     "\n",
 71 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n",
 72 |     "\n",
 73 |     "And the discussion board:\n",
 74 |     "\n",
 75 |     "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n",
 76 |     "\n",
 77 |     "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n",
 78 |     "\n",
 79 |     "\n",
 80 |     "### Grading\n",
 81 |     "\n",
 82 |     "This homework is due **Oct. 16, 2018 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
 83 |     "\n",
 84 |     "Rubric:\n",
 85 |     "\n",
 86 |     "* Code Quality - 10%\n",
 87 |     "* Storytelling - 10%\n",
 88 |     "* Result on Kaggle - 5%\n",
 89 |     "* Describing, Cleaning, and Visualizing data - 25%\n",
 90 |     "* Modeling - 50%\n",
 91 |     "\n",
 92 |     "More specifically, for modeling we will look for: \n",
 93 |     "\n",
 94 |     "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n",
 95 |     "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n",
 96 |     "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading."
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": []
105 |   }
106 |  ],
107 |  "metadata": {
108 |   "kernelspec": {
109 |    "display_name": "Python 3",
110 |    "language": "python",
111 |    "name": "python3"
112 |   },
113 |   "language_info": {
114 |    "codemirror_mode": {
115 |     "name": "ipython",
116 |     "version": 3
117 |    },
118 |    "file_extension": ".py",
119 |    "mimetype": "text/x-python",
120 |    "name": "python",
121 |    "nbconvert_exporter": "python",
122 |    "pygments_lexer": "ipython3",
123 |    "version": "3.6.6"
124 |   }
125 |  },
126 |  "nbformat": 4,
127 |  "nbformat_minor": 2
128 | }
129 | 


--------------------------------------------------------------------------------
/homeworks/Homework_3.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Project Abstract\n",
 8 |     "\n",
 9 |     "To help people get started on their project and to make sure you are selecting an appropriate task, we will have all the teams submit an abstract. Please only submit one abstract per team.\n",
10 |     "\n",
11 |     "The abstract should include (at least):\n",
12 |     "\n",
13 |     "* Team members\n",
14 |     "* Problem statement\n",
15 |     "* Data you will use to solve the problem\n",
16 |     "* Outline of how you plan on solving the problem with the data. For example, what pre-processing steps might you need to do, what models, etc. \n",
17 |     "* Supporting documents if necessary citing past research in the area and methods used to solve the problem.\n",
18 |     "\n",
19 |     "The goal of this abstract is for you to think deeply about the project you will be undertaking and convince yourself (and us) that it is a meaningful and achievable project for this class.\n",
20 |     "\n",
21 |     "This homework is due **February 28, 2018 by 4:00 pm Utah time.** and will be submitted on learning suite."
22 |    ]
23 |   }
24 |  ],
25 |  "metadata": {
26 |   "kernelspec": {
27 |    "display_name": "Python 3",
28 |    "language": "python",
29 |    "name": "python3"
30 |   },
31 |   "language_info": {
32 |    "codemirror_mode": {
33 |     "name": "ipython",
34 |     "version": 3
35 |    },
36 |    "file_extension": ".py",
37 |    "mimetype": "text/x-python",
38 |    "name": "python",
39 |    "nbconvert_exporter": "python",
40 |    "pygments_lexer": "ipython3",
41 |    "version": "3.6.6"
42 |   }
43 |  },
44 |  "nbformat": 4,
45 |  "nbformat_minor": 2
46 | }
47 | 


--------------------------------------------------------------------------------
/homeworks/Homework_4.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Classification with Python\n",
 8 |     "\n",
 9 |     "Hopefully now you are feeling a bit more comfortable with Python, Kaggle, and modeling. \n",
10 |     "\n",
11 |     "This next homework will test your classification abilities. We will be trying to predict the poverty levels of various households in Costa Rica:\n",
12 |     "\n",
13 |     "https://www.kaggle.com/c/costa-rican-household-poverty-prediction\n",
14 |     "\n",
15 |     "The evalution metric for Kaggle is accuracy, but please also explore how well your model does on multiple metrics like F1, precision, recall, and area under the ROC curve.\n",
16 |     "\n",
17 |     "### Grading\n",
18 |     "\n",
19 |     "This homework is due **March 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
20 |     "\n",
21 |     "Rubric:\n",
22 |     "\n",
23 |     "* Code Quality - 10%\n",
24 |     "* Storytelling - 10%\n",
25 |     "* Result on Kaggle - 5%\n",
26 |     "* Describing, Cleaning, and Visualizing data - 25%\n",
27 |     "* Modeling - 50%\n",
28 |     "\n",
29 |     "More specifically, for modeling we will look for: \n",
30 |     "\n",
31 |     "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n",
32 |     "* Model evaluation: Did you evaluate your model on multiple metrics? Where does your model do well? Where could it be improved? How are the metrics different?\n",
33 |     "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n",
34 |     "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading."
35 |    ]
36 |   },
37 |   {
38 |    "cell_type": "code",
39 |    "execution_count": null,
40 |    "metadata": {},
41 |    "outputs": [],
42 |    "source": []
43 |   }
44 |  ],
45 |  "metadata": {
46 |   "kernelspec": {
47 |    "display_name": "Python 3",
48 |    "language": "python",
49 |    "name": "python3"
50 |   },
51 |   "language_info": {
52 |    "codemirror_mode": {
53 |     "name": "ipython",
54 |     "version": 3
55 |    },
56 |    "file_extension": ".py",
57 |    "mimetype": "text/x-python",
58 |    "name": "python",
59 |    "nbconvert_exporter": "python",
60 |    "pygments_lexer": "ipython3",
61 |    "version": "3.6.6"
62 |   }
63 |  },
64 |  "nbformat": 4,
65 |  "nbformat_minor": 2
66 | }
67 | 


--------------------------------------------------------------------------------
/homeworks/Homework_5.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Dimensionality Reduction and Clustering\n",
  8 |     "\n",
  9 |     "For this homework we will be using some image data! Specifically, the MNIST data set. You can load this data easily with the following commands:"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from sklearn.datasets import fetch_mldata\n",
 19 |     "\n",
 20 |     "mnist = fetch_mldata(\"MNIST original\")\n",
 21 |     "X = mnist.data / 255.0\n",
 22 |     "y = mnist.target"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "The MNIST data set is hand-drawn digits, from zero through nine.\n",
 30 |     "\n",
 31 |     "Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.\n",
 32 |     "\n",
 33 |     "Source: https://www.kaggle.com/c/digit-recognizer/data\n",
 34 |     "\n",
 35 |     "For this homework, perform the following with the MNIST data:\n",
 36 |     "\n",
 37 |     "1. Use PCA to reduce the dimensionality\n",
 38 |     "\n",
 39 |     "    a. How many components did you use? Why?\n",
 40 |     "    \n",
 41 |     "    b. Plot the first two components. Do you notice any trends? What is this plot showing us?\n",
 42 |     "    \n",
 43 |     "    c. Why would you use PCA? What is it doing? And what are the drawbacks?\n",
 44 |     "    \n",
 45 |     "    d. Plot some of the images, then compress them using PCA and plot again. How does it look?\n",
 46 |     "    \n",
 47 |     "2. Use t-SNE to plot the first two components (you should probably random sample around 10000 points):\n",
 48 |     "\n",
 49 |     "    a. How does this plot differ from your PCA plot?\n",
 50 |     "    \n",
 51 |     "    b. How robust is it to changes in perplexity?\n",
 52 |     "    \n",
 53 |     "    c. How robust is it to different learning rate and number of iterations?\n",
 54 |     "    \n",
 55 |     "3. Perform k-means clustering:\n",
 56 |     "\n",
 57 |     "    a. How did you choose k?\n",
 58 |     "    \n",
 59 |     "    b. How did you evaluate your clustering?\n",
 60 |     "    \n",
 61 |     "    c. Visualize your clusters using t-sne\n",
 62 |     "    \n",
 63 |     "    d. Did you scale your data?\n",
 64 |     "    \n",
 65 |     "    e. How robust is your clustering?\n",
 66 |     "    \n",
 67 |     "4. Perform hierarchical clustering:\n",
 68 |     "\n",
 69 |     "    a. Plot your dendrogram\n",
 70 |     "    \n",
 71 |     "    b. How many clusters seem reasonable based off your graph?\n",
 72 |     "    \n",
 73 |     "    c. How does your dendrogram change with different linkage methods?"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "markdown",
 78 |    "metadata": {},
 79 |    "source": [
 80 |     "### Grading\n",
 81 |     "\n",
 82 | 
 83 |     "This homework is due **April 4, 2019 by midnight Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n",
 84 | 
 85 |     "\n",
 86 |     "Rubric:\n",
 87 |     "\n",
 88 |     "* Code Quality - 10%\n",
 89 |     "* Storytelling - 10%\n",
 90 |     "* PCA - 20%\n",
 91 |     "* T-SNE - 20%\n",
 92 |     "* K-means - 20%\n",
 93 |     "* Hierarchical Clustering - 20%"
 94 |    ]
 95 |   }
 96 |  ],
 97 |  "metadata": {
 98 |   "kernelspec": {
 99 |    "display_name": "Python 3",
100 |    "language": "python",
101 |    "name": "python3"
102 |   },
103 |   "language_info": {
104 |    "codemirror_mode": {
105 |     "name": "ipython",
106 |     "version": 3
107 |    },
108 |    "file_extension": ".py",
109 |    "mimetype": "text/x-python",
110 |    "name": "python",
111 |    "nbconvert_exporter": "python",
112 |    "pygments_lexer": "ipython3",
113 |    "version": "3.6.6"
114 |   }
115 |  },
116 |  "nbformat": 4,
117 |  "nbformat_minor": 2
118 | }
119 | 


--------------------------------------------------------------------------------
/lectures/.ipynb_checkpoints/Lecture_11_Boosting-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Boosting\n",
  8 |     "\n",
  9 |     "Boosting is another ensemble techinque like random forests. With random forests, we trained multiple decision trees independently and created independence by randomly sampling samples and features.\n",
 10 |     "\n",
 11 |     "Boosting on the other hand trains multiple models sequentially where each each model tries to improve on the areas where the previous models performed poorly. \n",
 12 |     "\n",
 13 |     "### Adaboost\n",
 14 |     "\n",
 15 |     "Adaboost works by starting with a single model. From that model we then make predictions on the **training** set. We can then see which training samples this first model got right and which it got wrong. Adaboost then trains the next model, but puts more weight on the training samples that the first model got wrong. This process continues for N number of models where N is a hyper-parameter.\n",
 16 |     "\n",
 17 |     "It is important to note that the sequential nature of boosting makes it harder to scale and parallelize relative the random forest models. That being said, work has been done with boosting methods to allow them to be more parallelizable. A very popular library that does this is [XGboost](https://github.com/dmlc/xgboost).\n",
 18 |     "\n",
 19 |     "So - at the end of your training, you now have N models where each model is trained to do better on the instances that the previous models did poorly on. You can now combine these models via a weighted voting or average method very similar to random forest except you weight each vote by how accurate the model was overall.\n",
 20 |     "\n",
 21 |     "Adaboost also had a hyper-parameter called **learning rate.** This hyper-parameter adjusts the contribution of each classifier. When decreasing the value, each new classifier makes smaller adjustments to the weights of mis-classified samples. Basically, meaning Adaboost is slower to learn per tree. Typically a lower learning rate requires more trees to perform well. This value and the number of trees can be tuned using cross-validation. \n",
 22 |     "\n",
 23 |     "#### Math\n",
 24 |     "\n",
 25 |     "So, how exactly do we do this?\n",
 26 |     "\n",
 27 |     "**Step 1**: Set all sample weights to 1/m when have m training examples\n",
 28 |     "\n",
 29 |     "**Step 2**: Train the first model\n",
 30 |     "\n",
 31 |     "**Step 3**: Calculate the weighted error of this model, which is simply the sum of the weights of the misclassified\n",
 32 |     "examples divided by the total weight of all samples. We will call this $r_j$ for the $j$th model. The best is zero and worst is 1.\n",
 33 |     "\n",
 34 |     "**Step 4**: Calculate the predictor's weight where being more accurate gets a higher weight:\n",
 35 |     "\n",
 36 |     "$$\\alpha_{j} = \\eta log \\frac{1-r_{j}}{r_j}$$\n",
 37 |     "\n",
 38 |     "If you are wrong more than right you get a negative weight and if close to random weight close to zero\n",
 39 |     "\n",
 40 |     "**Step 5**: Update the weights of your training samples where if you got it right, the weight remains the same. If you got it wrong new weight is:\n",
 41 |     "\n",
 42 |     "$$old weight * exp(\\alpha_{j})$$\n",
 43 |     "\n",
 44 |     "Then normalize all weights to sum to 1 by dividing all weights by sum of weights.\n",
 45 |     "\n",
 46 |     "You can see that a good predictor adds extra weight to its mis-classification, putting a strong emphasis on them for the next model. Also, $\\eta$ is our learning rate and can decrease the impact of a tree on weight updates by being less than 1.\n",
 47 |     "\n",
 48 |     "**Step 6**: Repeat from step 2 with next model and continue repeating until required number of models.\n",
 49 |     "\n",
 50 |     "That's it!\n",
 51 |     "\n",
 52 |     "Predictions are made by running a new sample through all the trained models, getting the most likely class (for classification) for each one, and then doing a weighted vote where the weight is the value from step 4. For regression just weighted average.\n",
 53 |     "\n",
 54 |     "Note: Almost always the model choosen for boosting is a decision tree.\n",
 55 |     "\n",
 56 |     "### Gradient Boosting\n",
 57 |     "\n",
 58 |     "Gradient boosting is also sequential, but instead of changing weights, it uses the residual errors from the previous model as the targets.\n",
 59 |     "\n",
 60 |     "Basically for regression, here are our steps:\n",
 61 |     "\n",
 62 |     "1. Fit a decision tree (assuming this is our base model and it is the most common)\n",
 63 |     "2. Calculate the residuals: true training values - predicted training values. Note: it turns out that this is the same as taking the negative gradient of the loss function, so this can generalize to other loss functions for classification and ranking tasks.\n",
 64 |     "3. Train a second decision tree where the residuals are the targets\n",
 65 |     "4. Continue the process for the number of defined estimators (hyper-parameter)\n",
 66 |     "\n",
 67 |     "At prediction time, we make a prediction with each tree and **sum** them. \n",
 68 |     "\n",
 69 |     "### Learning Rate\n",
 70 |     "\n",
 71 |     "The learning rate for gradient boosting trees, scales the contribution of each tree. \n",
 72 |     "\n",
 73 |     "\n",
 74 |     "### Early Stopping\n",
 75 |     "\n",
 76 |     "A good way to decide how many trees are needed is to use early stopping. See page 199 of Hands on Machine Learning for an example of this with sklearn. Basically, as you add an additional tree, you check your validation error and when your validation error stops getting better, you stop adding trees.\n",
 77 |     "\n",
 78 |     "\n",
 79 |     "### Some final notes\n",
 80 |     "\n",
 81 |     "* Boosting is more likely to overfit than random forests when the number of estimators is large. Though, usually slow to overfit.\n",
 82 |     "* Typical learning rates are around 0.01 or 0.001. And small learning rates can require a large number of estimators to achieve good results.\n",
 83 |     "* The max depth of the trees controls the complexity of the model and a max depth of 1 can often work well. This results in an additive model.\n",
 84 |     "* Can also get feature importance scores as with Random Forest.\n",
 85 |     "\n",
 86 |     "### Stacking\n",
 87 |     "\n",
 88 |     "We won't spend time discussing stacking as it tends to be too complex for industry. That being said, it is a good technique to be familiar with. Take a look at p. 200 of Hands on Machine Learning."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "## SKlearn Example"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 2,
101 |    "metadata": {
102 |     "collapsed": true
103 |    },
104 |    "outputs": [],
105 |    "source": [
106 |     "from sklearn.datasets import load_breast_cancer\n",
107 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
108 |     "from sklearn.model_selection import train_test_split, GridSearchCV\n",
109 |     "from sklearn.metrics import f1_score, classification_report\n",
110 |     "from collections import Counter\n",
111 |     "import numpy as np\n",
112 |     "import pandas as pd\n",
113 |     "%matplotlib inline"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 4,
119 |    "metadata": {
120 |     "collapsed": true
121 |    },
122 |    "outputs": [],
123 |    "source": [
124 |     "data = load_breast_cancer()\n",
125 |     "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 5,
131 |    "metadata": {},
132 |    "outputs": [
133 |     {
134 |      "data": {
135 |       "text/plain": [
136 |        "GridSearchCV(cv=None, error_score='raise',\n",
137 |        "       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
138 |        "              learning_rate=0.1, loss='deviance', max_depth=3,\n",
139 |        "              max_features=None, max_leaf_nodes=None,\n",
140 |        "              min_impurity_decrease=0.0, min_impurity_split=None,\n",
141 |        "              min_samples_leaf=1, min_samples_split=2,\n",
142 |        "              min_weight_fraction_leaf=0.0, n_estimators=100,\n",
143 |        "              presort='auto', random_state=None, subsample=1.0, verbose=0,\n",
144 |        "              warm_start=False),\n",
145 |        "       fit_params=None, iid=True, n_jobs=1,\n",
146 |        "       param_grid={'learning_rate': [0.1, 0.01, 0.001], 'n_estimators': [100, 1000, 5000], 'max_depth': [1, 2, 3]},\n",
147 |        "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
148 |        "       scoring='f1', verbose=0)"
149 |       ]
150 |      },
151 |      "execution_count": 5,
152 |      "metadata": {},
153 |      "output_type": "execute_result"
154 |     }
155 |    ],
156 |    "source": [
157 |     "clf = GradientBoostingClassifier()\n",
158 |     "gridsearch = GridSearchCV(clf, {\"learning_rate\": [.1, .01, .001], \"n_estimators\": [100, 1000, 5000], \n",
159 |     "                                'max_depth': [1, 2, 3]}, scoring='f1')\n",
160 |     "gridsearch.fit(X_train, y_train)"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 10,
166 |    "metadata": {},
167 |    "outputs": [
168 |     {
169 |      "name": "stdout",
170 |      "output_type": "stream",
171 |      "text": [
172 |       "Best Params: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5000}\n",
173 |       "\n",
174 |       "Classification Report:\n",
175 |       "             precision    recall  f1-score   support\n",
176 |       "\n",
177 |       "          0       0.95      0.93      0.94        43\n",
178 |       "          1       0.96      0.97      0.97        71\n",
179 |       "\n",
180 |       "avg / total       0.96      0.96      0.96       114\n",
181 |       "\n"
182 |      ]
183 |     }
184 |    ],
185 |    "source": [
186 |     "print(\"Best Params: {}\".format(gridsearch.best_params_))\n",
187 |     "print(\"\\nClassification Report:\")\n",
188 |     "print(classification_report(y_test, gridsearch.predict(X_test)))"
189 |    ]
190 |   }
191 |  ],
192 |  "metadata": {
193 |   "kernelspec": {
194 |    "display_name": "Python 3",
195 |    "language": "python",
196 |    "name": "python3"
197 |   },
198 |   "language_info": {
199 |    "codemirror_mode": {
200 |     "name": "ipython",
201 |     "version": 3
202 |    },
203 |    "file_extension": ".py",
204 |    "mimetype": "text/x-python",
205 |    "name": "python",
206 |    "nbconvert_exporter": "python",
207 |    "pygments_lexer": "ipython3",
208 |    "version": "3.6.1"
209 |   }
210 |  },
211 |  "nbformat": 4,
212 |  "nbformat_minor": 2
213 | }
214 | 


--------------------------------------------------------------------------------
/lectures/.ipynb_checkpoints/Lecture_1_Introduction-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Introductions\n",
  8 |     "\n",
  9 |     "* Importance of getting to know people in the class\n",
 10 |     "* Review syllabus\n",
 11 |     "* Give out slack information\n",
 12 |     "* When office hours?\n",
 13 |     "* Participation: ML news / paper for the day / discuss homework ideas / class discussions\n",
 14 |     "* Homework: Target 5 homeworks which will be pretty open-ended and almost like mini-projects\n",
 15 |     "* Quizzes: In class and around 3. Will test knowledge of the subject not coding\n",
 16 |     "* First time class being taught and I am very open to feedback. Also, feel free to submit pull requests to correct or add to content.\n",
 17 |     "* Review first homework"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "## What is Machine Learning?\n",
 25 |     "\n",
 26 |     "[Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) tells us that Machine learning is, \"a field of computer science that gives computers **the ability to learn without being explicitly programmed**.\" It goes on to say, \"machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through **building a model from sample inputs**.\"\n",
 27 |     "\n",
 28 |     "\n",
 29 |     "### Learning from inputs\n",
 30 |     "\n",
 31 |     "What does it mean to learn from inputs without being explicitly programmed? Let us consider the classical machine learning problem: spam filtering.\n",
 32 |     "\n",
 33 |     "Let us imagine that we no nothing about machine learning, but are tasked with determining whether an email is spam or not. How might we do this?\n",
 34 |     "\n",
 35 |     "What do we like and not like about our approach? How would we keep it up to date with new types of spam?\n",
 36 |     "\n",
 37 |     "Now imagine we had a black-box that if given a great many examples of emails that are spam and not spam, could take these examples and learn from the text what spam looks like.  For this class, we will call that black-box (though it won't be black for long) machine learning. And often the data we feed it to understand a problem are called features.\n",
 38 |     "\n",
 39 |     "What does learning from inputs look like? Let us consider [flappy bird](https://www.youtube.com/watch?v=79BWQUN_Njc).\n",
 40 |     "\n",
 41 |     "\n",
 42 |     "## Why machine learning?\n",
 43 |     "\n",
 44 |     "From our discussion, why do you think machine learning might be valuable?\n",
 45 |     "\n",
 46 |     "\n",
 47 |     "## Types of machine learning problems\n",
 48 |     "\n",
 49 |     "**Supervised** machine learning problems are ones for which you have labeled data. Labeled data means you give the algorithm the solution with the data and these solutions are called labels. For example, with spam classification the labels would be \"spam\" or \"not spam.\" Linear regression would be considered a supervised problem.\n",
 50 |     "\n",
 51 |     "**Unsupervised** machine learning is the opposite. It is not given any labels. These algorithms are often not as powerful as they don't get the benefit of labels, but they can be extremely valuable when getting labeled data is expensive or impossible. An example would be clustering.\n",
 52 |     "\n",
 53 |     "**Regression** problems are a class of problems for which you are trying to predict a real number. For example, linear regression outputs a real number and could be used to predict housing prices.\n",
 54 |     "\n",
 55 |     "**Classification** problems are problems for which you are predicting a class. For example, spam prediction is a classification problem because you want to know whether you input falls into one of two classes. Logistic regression is an algorithm used for classification.\n",
 56 |     "\n",
 57 |     "**Ranking** problems are very popular in eCommerce. These models try to rank the items by how valuable they are to a user. For example, Netflix's movie recommendations. An example model is collaberative filtering.\n",
 58 |     "\n",
 59 |     "**Reinforcement Learning** is when you have an agent in an enviroment that gets to perform actions and recieve rewards for actions. The model here learns the best actions to take to maximize rewards. The flappy bird video is an example of reinforcement learning. An example model is deep Q-networks.\n",
 60 |     "\n",
 61 |     "\n",
 62 |     "## Machine Learning and Econometrics\n",
 63 |     "\n",
 64 |     "How are they different?\n",
 65 |     "\n",
 66 |     "For one, they use different lingo. \n",
 67 |     "\n",
 68 |     "For another, econometrics is often more interested in understanding why things happen while machine learning often cares more about just the actual prediction being correct.\n",
 69 |     "\n",
 70 |     "Economic theory is often a driver in the development of econometric models, while machine learning often relies on the data to deliver insights.\n",
 71 |     "\n",
 72 |     "The two worlds have a lot of overlap and continue to grow closer. Machine learning is getting better and providing both predictions and understandings and econometrics is finding value in the scalability and accuracy of some machine learning models.\n",
 73 |     "\n",
 74 |     "\n",
 75 |     "## Challenges of Machine Learning"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "Perhaps my favorite part from the Wikipedia page on machine learning is, \"As of 2016, **machine learning is a buzzword**, and according to the Gartner hype cycle of 2016, at its peak of inflated expectations. Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, **machine-learning programs often fail to deliver**.\"\n",
 83 |     "\n",
 84 |     "* There isn't a clear problem to solve\n",
 85 |     "\n",
 86 |     "Some executive heard machine learning is the next big thing, so they hired a data science team. Unfortunately, there isn't a clear idea on what problems to solve, so the team flounders for a year.\n",
 87 |     "\n",
 88 |     "* Labeled data can be extremely important to building machine learning models, but can also be extremely costly.\n",
 89 |     "\n",
 90 |     "First off, you often need a lot of data. [Google](https://research.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html) found for representation learning, performance increases logarithmically based on volume of training data:\n",
 91 |     "\n",
 92 |     "![image.png](attachment:image.png)\n",
 93 |     "\n",
 94 |     "Secondly, you need to get data that represents the full distribution of the problem you are trying to solve. For example, for our spam classification problem, what kinds of emails might we want to gather? What if we only had emails that came from US IP addresses?\n",
 95 |     "\n",
 96 |     "Lastly, just getting data labeled can be time consuming and cost a lot of money. Who is going to label 1 million emails as spam or not spam?\n",
 97 |     "\n",
 98 |     "* Data can be very messy\n",
 99 |     "\n",
100 |     "Often data in the real world has errors, outliers, missing data, and noise. How you handle these can greatly influence the outcome of your model.\n",
101 |     "\n",
102 |     "* Feature engineering\n",
103 |     "\n",
104 |     "Once you have your data and labels, deciding on how to represent it to your model can be very challenging. For example, for spam classification would you just feed it the raw text? What about the origin of the IP addres? What about a timestamp?\n",
105 |     "\n",
106 |     "* Your model might not generalize\n",
107 |     "\n",
108 |     "After all of this, you might still end up with a model that either is too simple to be effective (underfitting) or too complex to generalize well (overfitting). You have to develop a model that is just right. :)\n",
109 |     "\n",
110 |     "* Evaluation is non-trivial\n",
111 |     "\n",
112 |     "Let's say we develop a machine learning model for spam classification. How do we evaluate it? Do we care more about precision or recall? How do we tie our scientific metrics to business metrics?\n",
113 |     "\n",
114 |     "* Getting into production can be hard\n",
115 |     "\n",
116 |     "You have a beautiful model built in Python only to discover the back-end is in Java and has to run in under 5ms, so micro-services are not an option. So you convert your model to PMML, but engineers won't let you push code, so you are now blocked and putting your model in production isn't high on their priorities.\n",
117 |     "\n",
118 |     "\n",
119 |     "## There is hope\n",
120 |     "\n",
121 |     "While many machine learning initiative do fail, many also succeed and are running some of the most valuable companies in the world. Comapines like Google, Facebook, Amazon, AirBnB, and Netflix have all found successful ways to leverage machine learning and are reaping large rewards.\n",
122 |     "\n",
123 |     "Google CEO Sundar Pichai even recently said, \"an important shift from a mobile first world to an AI first world.\"\n",
124 |     "\n",
125 |     "And Mark Cuban said, \"Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years\"\n",
126 |     "\n",
127 |     "And lastly, [Harvard Business Review](https://hbr.org/2012/10/big-data-the-management-revolution) found, \"ompanies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.\"\n",
128 |     "\n",
129 |     "The goal of this course is to prepare you for this world. So that you will not only know how to build the machine learning models to predict the future, but also understand the key ingredients of a successful machine learning initiative and how to overcome the challenges."
130 |    ]
131 |   }
132 |  ],
133 |  "metadata": {
134 |   "kernelspec": {
135 |    "display_name": "Python 2",
136 |    "language": "python",
137 |    "name": "python2"
138 |   },
139 |   "language_info": {
140 |    "codemirror_mode": {
141 |     "name": "ipython",
142 |     "version": 2
143 |    },
144 |    "file_extension": ".py",
145 |    "mimetype": "text/x-python",
146 |    "name": "python",
147 |    "nbconvert_exporter": "python",
148 |    "pygments_lexer": "ipython2",
149 |    "version": "2.7.13"
150 |   }
151 |  },
152 |  "nbformat": 4,
153 |  "nbformat_minor": 2
154 | }
155 | 


--------------------------------------------------------------------------------
/lectures/.ipynb_checkpoints/Lecture_6_K_Nearest_Neighbors-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## K-Nearest Neighbors\n",
  8 |     "\n",
  9 |     "This is one of the simpliest and easiest to understand algorithms. It can be used for both classification and regression tasks, but is more common in classification, so we will focus there. The principles, though, can be used in both cases and sklearn supports both.\n",
 10 |     "\n",
 11 |     "Basically, here is the algorithm:\n",
 12 |     "\n",
 13 |     "1. Define $k$\n",
 14 |     "2. Define a distance metric - usually Euclidean distance\n",
 15 |     "3. For a new data point, find the $k$ nearest training points and combine their classes in some way - usually voting - to get a predicted class\n",
 16 |     "\n",
 17 |     "That's it!\n",
 18 |     "\n",
 19 |     "**Some of the benefits:**\n",
 20 |     "\n",
 21 |     "* Doesn't really require any training in the traditional sense. You just need a fast way to find the nearest neighbors.\n",
 22 |     "* Easy to understand \n",
 23 |     "\n",
 24 |     "**Some of the negatives:**\n",
 25 |     "\n",
 26 |     "* Need to define k, which is a hyper-parameter, so it can be tuned with cross-validation. A higher value for k increases bias and a lower value increases variance.\n",
 27 |     "* Have to choose a distance metric and could get very different results depending on the metric. Again, you can use cross-validation.\n",
 28 |     "* Doesn't really offer insights into which features might be important.\n",
 29 |     "* Can suffer with high dimensional data due to the curse of dimensionality.\n",
 30 |     "\n",
 31 |     "**Basic assumption:**\n",
 32 |     "\n",
 33 |     "* Data points that are close are similar for our target\n",
 34 |     "\n",
 35 |     "\n",
 36 |     "### Curse of Dimensionality\n",
 37 |     "\n",
 38 |     "Basically, the curse of dimensionality means that in high dimensions, it is likely that close points are not much closer than the average distance, which means being close doesnt mean much. In high dimensions, the data becomes very spread out, which creates this phenomenon. \n",
 39 |     "\n",
 40 |     "There are so many good resources for this online, that I won't go any deeper. Here is one you might look at:\n",
 41 |     "\n",
 42 |     "http://blog.galvanize.com/how-to-manage-dimensionality/\n",
 43 |     "\n",
 44 |     "### Euclidean Distance\n",
 45 |     "\n",
 46 |     "For vectors, q and p that are being compared (these would be our feature vectors):\n",
 47 |     "\n",
 48 |     "$$\\sqrt{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$$\n",
 49 |     "\n",
 50 |     "\n",
 51 |     "### SKlearn Example\n",
 52 |     "\n"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 15,
 58 |    "metadata": {
 59 |     "collapsed": false
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "from sklearn.datasets import load_breast_cancer\n",
 64 |     "from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\n",
 65 |     "from sklearn.model_selection import train_test_split, GridSearchCV\n",
 66 |     "from sklearn.metrics import f1_score, classification_report, accuracy_score, mean_squared_error"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": 3,
 72 |    "metadata": {
 73 |     "collapsed": false
 74 |    },
 75 |    "outputs": [
 76 |     {
 77 |      "name": "stdout",
 78 |      "output_type": "stream",
 79 |      "text": [
 80 |       "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}\n",
 81 |       "Train F1: 0.9606837606837607\n",
 82 |       "Test Classification Report:\n",
 83 |       "             precision    recall  f1-score   support\n",
 84 |       "\n",
 85 |       "          0       0.97      0.88      0.93        43\n",
 86 |       "          1       0.93      0.99      0.96        71\n",
 87 |       "\n",
 88 |       "avg / total       0.95      0.95      0.95       114\n",
 89 |       "\n",
 90 |       "Train Accuracy: 0.9494505494505494\tTest accuracy: 0.9473684210526315\n"
 91 |      ]
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "data = load_breast_cancer()\n",
 96 |     "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)\n",
 97 |     "clf = KNeighborsClassifier()\n",
 98 |     "gridsearch = GridSearchCV(clf, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n",
 99 |     "                                'p': [1, 2, 3]}, scoring='f1')\n",
100 |     "gridsearch.fit(X_train, y_train)\n",
101 |     "print(\"Best Params: {}\".format(gridsearch.best_params_))\n",
102 |     "y_pred_train = gridsearch.predict(X_train)\n",
103 |     "print(\"Train F1: {}\".format(f1_score(y_train, y_pred_train)))\n",
104 |     "print(\"Test Classification Report:\")\n",
105 |     "y_pred_test = gridsearch.predict(X_test)\n",
106 |     "print(classification_report(y_test, y_pred_test))\n",
107 |     "print(\"Train Accuracy: {}\\tTest accuracy: {}\".format(accuracy_score(y_train, y_pred_train),\n",
108 |     "                                                     accuracy_score(y_test, y_pred_test)))"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "### Voting Methods\n",
116 |     "\n",
117 |     "* Majority Voting: After you take the $k$ nearest neighbors, you take a \"vote\" of those neighbors' classes. The new data point is classified with whatever the majority class of the neighbors is. If you are doing a binary classification, it is recommended that you use an odd number of neighbors to avoid tied votes. However, in a multi-class problem, it is harder to avoid ties. A common solution to this is to decrease $k$ until the tie is broken.\n",
118 |     "* Distance Weighting: Instead of directly taking votes of the nearest neighbors, you weight each vote by the distance of that instance from the new data point. A common weighting method is $\\hat{y} = \\frac{\\sum_{i=1}^Nw_iq_i}{\\sum_{i=1}^N w_i}$, where $w_i=\\frac{1}{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$, or one over the distance between the new data point and the training point. The new data point is added into the class with the largest added weight. Not only does this decrease the chances of ties, it also reduces the effect of a skewed representation of data.\n",
119 |     "\n",
120 |     "### Distance Metrics\n",
121 |     "Euclidean distance, or the 2-norm, is a very common distance metric to use for $k$-nearest neighbors. Any $p$-norm can be used (the $p$-norm is defined as $||\\mathbf{x}_i-\\mathbf{y}_i||_p = (\\sum_{i=1}^N|x_i-y_i|^p)^\\frac{1}{p}$. For categorical data, however, this can be a problem. For example, if we have encoded a feature for car color from red, blue, and green to 0, 1, 2, how can the \"distance\" between green and red be measured? You could make dummy variables, but if a feature has 15 possible categories, that means adding 14 more variables to your feature set, and we run into the curse of dimensionality. There are several ways to confront this issue, but they are beyond the scope of this lecture. But be aware of the effect categorical features can have on your nearest neighbors classifier.\n",
122 |     "\n",
123 |     "### Search Algorithm\n",
124 |     "\n",
125 |     "Imagine the data set contains 2000 points. A brute-force search for the 3 nearest neighbors to one point does not take very long. But if the data set contains 2000000 points, a brute-force search can become quite costly, especially if the dimension of the data is large. Other search algorithms sacrifice an exhaustive search for a faster run time. Structures such as KDTrees (see https://en.wikipedia.org/wiki/K-d_tree) or Ball trees (see https://en.wikipedia.org/wiki/Ball_tree) are used for faster run times. While we won't dive into the details of these structures, be aware of them and how they can optimize your run time (although, training time does increase).\n",
126 |     "\n",
127 |     "### Radius Neighbors Classifier\n",
128 |     "\n",
129 |     "This is the same idea as a $k$ nearest neighbors classifier, but instead of finding the $k$ nearest neighbors, you find all the neighbors within a given radius. Setting the radius requires some domain knowledge; if your points are closely packed together, you'd want to use a smaller radius to avoid having nearly every point vote.\n",
130 |     "\n",
131 |     "### K Neighbors Regressor\n",
132 |     "To change our problem from classification to regressing, all we have to do is find the weighted average of the $k$ nearest neighbors. Instead of taking the majority class, we calculate a weighted average of these nearest values, using the same weighting method as above.\n",
133 |     "\n",
134 |     "Let's try predicting the area of tissue based on the other features using a $k$-neighbors regressor."
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 19,
140 |    "metadata": {
141 |     "collapsed": false
142 |    },
143 |    "outputs": [
144 |     {
145 |      "name": "stdout",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}\n",
149 |       "Train MSE: 0.0\tTest MSE: 878.1482686614424\n"
150 |      ]
151 |     }
152 |    ],
153 |    "source": [
154 |     "target = data.data[:,3]\n",
155 |     "X = data.data[:,[0,1,2,4,5,6,7,8]]\n",
156 |     "X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.20, random_state=42)\n",
157 |     "reg = KNeighborsRegressor()\n",
158 |     "gridsearch = GridSearchCV(reg, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n",
159 |     "                                'p': [1, 2, 3]}, scoring='neg_mean_squared_error')\n",
160 |     "gridsearch.fit(X_train, y_train)\n",
161 |     "print(\"Best Params: {}\".format(gridsearch.best_params_))\n",
162 |     "y_pred_train = gridsearch.predict(X_train)\n",
163 |     "y_pred_test = gridsearch.predict(X_test)\n",
164 |     "print(\"Train MSE: {}\\tTest MSE: {}\".format(mean_squared_error(y_train, y_pred_train),\n",
165 |     "                                           mean_squared_error(y_test, y_pred_test)))"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "### Outlier Detection\n",
173 |     "In $k$-nearest neighbors, the data is naturally clustered. Within these clusters, we can find the average distance between points (either exhaustively or from the centroid of the cluster). If we find a few points that are much farther than the average distance to other points or to the centroid, it is reasonable (but not always correct) to think they could be outliers. We can use this process on new data points as well. "
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {
180 |     "collapsed": true
181 |    },
182 |    "outputs": [],
183 |    "source": []
184 |   }
185 |  ],
186 |  "metadata": {
187 |   "kernelspec": {
188 |    "display_name": "Python 3",
189 |    "language": "python",
190 |    "name": "python3"
191 |   },
192 |   "language_info": {
193 |    "codemirror_mode": {
194 |     "name": "ipython",
195 |     "version": 3
196 |    },
197 |    "file_extension": ".py",
198 |    "mimetype": "text/x-python",
199 |    "name": "python",
200 |    "nbconvert_exporter": "python",
201 |    "pygments_lexer": "ipython3",
202 |    "version": "3.5.3"
203 |   }
204 |  },
205 |  "nbformat": 4,
206 |  "nbformat_minor": 2
207 | }
208 | 


--------------------------------------------------------------------------------
/lectures/.ipynb_checkpoints/Lecture_9_Decision_Trees-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": true
  7 |    },
  8 |    "source": [
  9 |     "## Decision Trees\n",
 10 |     "\n",
 11 |     "To better understand decision trees, let's take a look at a simple example:\n",
 12 |     "\n",
 13 |     "https://zhengtianyu.files.wordpress.com/2013/12/decision-tree.png\n",
 14 |     "\n",
 15 |     "Clearly, decision trees are very simple and pretty easy to understand. Once you have a tree, you just follow the forks down for your example to get to an end node. Then at that end node you take the most common class or average value for regression. In fact, with classification you can get probabilistic estimates by taking the fraction of examples in that class. Nice! And that is one of the biggest benefits of decision trees - they are so easy to explain and understand.\n",
 16 |     "\n",
 17 |     "Most decision trees use binary trees - each split is either yes or no. When you graph these decision boundaries you end up with a bunch of vertical or horizontal lines. Here is an example with the iris data set:\n",
 18 |     "\n",
 19 |     "http://statweb.stanford.edu/~jtaylo/courses/stats202/trees.html\n",
 20 |     "\n",
 21 |     "## How To Grow A Tree\n",
 22 |     "\n",
 23 |     "So - how does one learn a tree? Do we just randomly pick binary splitting points and see what comes out? Of course not! We leverage our data and a definition of impurity. \n",
 24 |     "\n",
 25 |     "If you think about it, what we really want from our tree is pure leaf nodes. Meaning that for classification we would like at the end to have the leaf node only have examples of a single class. This greatly increases our confidence that if a testing example ends up at that leaf node that it is in fact of that class.\n",
 26 |     "\n",
 27 |     "Imagine the opposite, a leaf node in a 2 class classification problem that has 50% of the examples in one class and 50% in the other. That really doesn't help us at all! If a testing example ends up at that leaf node, all we can say is a 50/50 guess.\n",
 28 |     "\n",
 29 |     "So, how do we define impurity?\n",
 30 |     "\n",
 31 |     "**Gini Impurity**\n",
 32 |     "\n",
 33 |     "$$Gini_{i} = 1 - \\sum_{k=1}^{n}{p_{i,k}^2}$$\n",
 34 |     "\n",
 35 |     "Where $i$ is the node of interest, $n$ is the total number of classes, and $p_{i,k}$ is the fraction of class $k$ in node $i$. The gini of a node is 0 when all the examples belong to the same class. Gini is the highest when there is an equal probability of being in each class. \n",
 36 |     "\n",
 37 |     "Let's consider a 2-class examples. Say I am trying to predict whether someone is male or female and I branch on height being less than 5 1/2 feet. In this node, I find 50 examples of which 40 are female and 10 are male. Gini is then:\n",
 38 |     "\n",
 39 |     "$$1 - (40/50)^2 - (10/50)^2 = 0.32$$\n",
 40 |     "\n",
 41 |     "Very nice - now we can basically evaluate how good a node is. With this we can now start to **greedily** grow our tree. What I mean by greedily is that we will not consider every possible tree, but instead start with the most pure feature split and then from there pick the next most pure, etc.\n",
 42 |     "\n",
 43 |     "Thus, our algorithm looks at all the features and all the splitting points for our features (note: this process runs much faster if you features have a small amount of unique values like categories as opposed to real numbers). To evalue a feature, split point pair we do the following:\n",
 44 |     "\n",
 45 |     "Take the weighted average of the Gini impurity for the two nodes created by the split. The weights being the number of examples in the nodes.\n",
 46 |     "\n",
 47 |     "That's it! Basically, the feature split that produces the lowest weighted average gini impurity is considered best, then we move on and find the next best feature split given all the previous feature splits.\n",
 48 |     "\n",
 49 |     "### When to stop?\n",
 50 |     "\n",
 51 |     "Decision trees have many hyper-parameters to help control when to stop growing the tree. These include max depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes.\n",
 52 |     "\n",
 53 |     "Since we have to consider all these splits, training can be slow, but predictions are very fast since we just have to traverse the tree.\n",
 54 |     "\n",
 55 |     "Also, there are other impurity measures that are used with decision trees. Another very popular one is Entropy. See page 173 of Hands On Machine Learning for its definition. Usually, which you choose doesn't matter too much. Gini is slightly faster to computer, while entropy produces slightly more balanced trees.\n",
 56 |     "\n",
 57 |     "### Bias and variance\n",
 58 |     "\n",
 59 |     "Decision trees can be very prone to overfitting if you let them grow to deep. Thus, decreasing the depth can decrease variance / increase bias. It is important to use cross-validation to do hyper-parameter selection. \n",
 60 |     "\n",
 61 |     "### Regression\n",
 62 |     "\n",
 63 |     "Decision trees can be applied to regression problems in exactly the same way, but instead of using Gini impurity you would use mean squared error. You calculate the MSE of a node by setting the prediction for all values in that node as the average $y$ value of the examples in that node. For example, if your node had these values: 5, 2, 1, 6 then your predicted value would be (5+2+1+6)/4 = 3.5. You would do the squared difference between 3.5 and all the values and take the mean. \n",
 64 |     "\n",
 65 |     "Prediction is done by traversing to the leaf node for an example and taking the average value at that node.\n",
 66 |     "\n",
 67 |     "### Pros\n",
 68 |     "\n",
 69 |     "* Easy to explain\n",
 70 |     "* Can be visualized\n",
 71 |     "* Can handle categorical variables and missing data well\n",
 72 |     "\n",
 73 |     "### Cons\n",
 74 |     "\n",
 75 |     "* Typically don't have very strong prediction accuracy\n",
 76 |     "* Very sensitive to small changes in training data"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {
 83 |     "collapsed": true
 84 |    },
 85 |    "outputs": [],
 86 |    "source": []
 87 |   }
 88 |  ],
 89 |  "metadata": {
 90 |   "kernelspec": {
 91 |    "display_name": "Python 2",
 92 |    "language": "python",
 93 |    "name": "python2"
 94 |   },
 95 |   "language_info": {
 96 |    "codemirror_mode": {
 97 |     "name": "ipython",
 98 |     "version": 2
 99 |    },
100 |    "file_extension": ".py",
101 |    "mimetype": "text/x-python",
102 |    "name": "python",
103 |    "nbconvert_exporter": "python",
104 |    "pygments_lexer": "ipython2",
105 |    "version": "2.7.13"
106 |   }
107 |  },
108 |  "nbformat": 4,
109 |  "nbformat_minor": 2
110 | }
111 | 


--------------------------------------------------------------------------------
/lectures/.ipynb_checkpoints/Random Notes-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 0
6 | }
7 | 


--------------------------------------------------------------------------------
/lectures/Lecture_11_Boosting.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Boosting\n",
  8 |     "\n",
  9 |     "Boosting is another ensemble techinque like random forests. With random forests, we trained multiple decision trees independently and created independence by randomly sampling samples and features.\n",
 10 |     "\n",
 11 |     "Boosting on the other hand trains multiple models sequentially where each each model tries to improve on the areas where the previous models performed poorly. \n",
 12 |     "\n",
 13 |     "### Adaboost\n",
 14 |     "\n",
 15 |     "Adaboost works by starting with a single model. From that model we then make predictions on the **training** set. We can then see which training samples this first model got right and which it got wrong. Adaboost then trains the next model, but puts more weight on the training samples that the first model got wrong. This process continues for N number of models where N is a hyper-parameter.\n",
 16 |     "\n",
 17 |     "It is important to note that the sequential nature of boosting makes it harder to scale and parallelize relative the random forest models. That being said, work has been done with boosting methods to allow them to be more parallelizable. A very popular library that does this is [XGboost](https://github.com/dmlc/xgboost).\n",
 18 |     "\n",
 19 |     "So - at the end of your training, you now have N models where each model is trained to do better on the instances that the previous models did poorly on. You can now combine these models via a weighted voting or average method very similar to random forest except you weight each vote by how accurate the model was overall.\n",
 20 |     "\n",
 21 |     "Adaboost also had a hyper-parameter called **learning rate.** This hyper-parameter adjusts the contribution of each classifier. When decreasing the value, each new classifier makes smaller adjustments to the weights of mis-classified samples. Basically, meaning Adaboost is slower to learn per tree. Typically a lower learning rate requires more trees to perform well. This value and the number of trees can be tuned using cross-validation. \n",
 22 |     "\n",
 23 |     "#### Math\n",
 24 |     "\n",
 25 |     "So, how exactly do we do this?\n",
 26 |     "\n",
 27 |     "**Step 1**: Set all sample weights to 1/m when have m training examples\n",
 28 |     "\n",
 29 |     "**Step 2**: Train the first model\n",
 30 |     "\n",
 31 |     "**Step 3**: Calculate the weighted error of this model, which is simply the sum of the weights of the misclassified\n",
 32 |     "examples divided by the total weight of all samples. We will call this $r_j$ for the $j$th model. The best is zero and worst is 1.\n",
 33 |     "\n",
 34 |     "**Step 4**: Calculate the predictor's weight where being more accurate gets a higher weight:\n",
 35 |     "\n",
 36 |     "$$\\alpha_{j} = \\eta log \\frac{1-r_{j}}{r_j}$$\n",
 37 |     "\n",
 38 |     "If you are wrong more than right you get a negative weight and if close to random weight close to zero\n",
 39 |     "\n",
 40 |     "**Step 5**: Update the weights of your training samples where if you got it right, the weight remains the same. If you got it wrong new weight is:\n",
 41 |     "\n",
 42 |     "$$old weight * exp(\\alpha_{j})$$\n",
 43 |     "\n",
 44 |     "Then normalize all weights to sum to 1 by dividing all weights by sum of weights.\n",
 45 |     "\n",
 46 |     "You can see that a good predictor adds extra weight to its mis-classification, putting a strong emphasis on them for the next model. Also, $\\eta$ is our learning rate and can decrease the impact of a tree on weight updates by being less than 1.\n",
 47 |     "\n",
 48 |     "**Step 6**: Repeat from step 2 with next model and continue repeating until required number of models.\n",
 49 |     "\n",
 50 |     "That's it!\n",
 51 |     "\n",
 52 |     "Predictions are made by running a new sample through all the trained models, getting the most likely class (for classification) for each one, and then doing a weighted vote where the weight is the value from step 4. For regression just weighted average.\n",
 53 |     "\n",
 54 |     "Note: Almost always the model choosen for boosting is a decision tree.\n",
 55 |     "\n",
 56 |     "### Gradient Boosting\n",
 57 |     "\n",
 58 |     "Gradient boosting is also sequential, but instead of changing weights, it uses the residual errors from the previous model as the targets.\n",
 59 |     "\n",
 60 |     "Basically for regression, here are our steps:\n",
 61 |     "\n",
 62 |     "1. Fit a decision tree (assuming this is our base model and it is the most common)\n",
 63 |     "2. Calculate the residuals: true training values - predicted training values. Note: it turns out that this is the same as taking the negative gradient of the loss function, so this can generalize to other loss functions for classification and ranking tasks.\n",
 64 |     "3. Train a second decision tree where the residuals are the targets\n",
 65 |     "4. Continue the process for the number of defined estimators (hyper-parameter)\n",
 66 |     "\n",
 67 |     "At prediction time, we make a prediction with each tree and **sum** them. \n",
 68 |     "\n",
 69 |     "### Learning Rate\n",
 70 |     "\n",
 71 |     "The learning rate for gradient boosting trees, scales the contribution of each tree. \n",
 72 |     "\n",
 73 |     "\n",
 74 |     "### Early Stopping\n",
 75 |     "\n",
 76 |     "A good way to decide how many trees are needed is to use early stopping. See page 199 of Hands on Machine Learning for an example of this with sklearn. Basically, as you add an additional tree, you check your validation error and when your validation error stops getting better, you stop adding trees.\n",
 77 |     "\n",
 78 |     "\n",
 79 |     "### Some final notes\n",
 80 |     "\n",
 81 |     "* Boosting is more likely to overfit than random forests when the number of estimators is large. Though, usually slow to overfit.\n",
 82 |     "* Typical learning rates are around 0.01 or 0.001. And small learning rates can require a large number of estimators to achieve good results.\n",
 83 |     "* The max depth of the trees controls the complexity of the model and a max depth of 1 can often work well. This results in an additive model.\n",
 84 |     "* Can also get feature importance scores as with Random Forest.\n",
 85 |     "\n",
 86 |     "### Stacking\n",
 87 |     "\n",
 88 |     "We won't spend time discussing stacking as it tends to be too complex for industry. That being said, it is a good technique to be familiar with. Take a look at p. 200 of Hands on Machine Learning."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "## SKlearn Example"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 2,
101 |    "metadata": {
102 |     "collapsed": true
103 |    },
104 |    "outputs": [],
105 |    "source": [
106 |     "from sklearn.datasets import load_breast_cancer\n",
107 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
108 |     "from sklearn.model_selection import train_test_split, GridSearchCV\n",
109 |     "from sklearn.metrics import f1_score, classification_report\n",
110 |     "from collections import Counter\n",
111 |     "import numpy as np\n",
112 |     "import pandas as pd\n",
113 |     "%matplotlib inline"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 4,
119 |    "metadata": {
120 |     "collapsed": true
121 |    },
122 |    "outputs": [],
123 |    "source": [
124 |     "data = load_breast_cancer()\n",
125 |     "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 5,
131 |    "metadata": {
132 |     "collapsed": false
133 |    },
134 |    "outputs": [
135 |     {
136 |      "data": {
137 |       "text/plain": [
138 |        "GridSearchCV(cv=None, error_score='raise',\n",
139 |        "       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
140 |        "              learning_rate=0.1, loss='deviance', max_depth=3,\n",
141 |        "              max_features=None, max_leaf_nodes=None,\n",
142 |        "              min_impurity_decrease=0.0, min_impurity_split=None,\n",
143 |        "              min_samples_leaf=1, min_samples_split=2,\n",
144 |        "              min_weight_fraction_leaf=0.0, n_estimators=100,\n",
145 |        "              presort='auto', random_state=None, subsample=1.0, verbose=0,\n",
146 |        "              warm_start=False),\n",
147 |        "       fit_params=None, iid=True, n_jobs=1,\n",
148 |        "       param_grid={'learning_rate': [0.1, 0.01, 0.001], 'n_estimators': [100, 1000, 5000], 'max_depth': [1, 2, 3]},\n",
149 |        "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
150 |        "       scoring='f1', verbose=0)"
151 |       ]
152 |      },
153 |      "execution_count": 5,
154 |      "metadata": {},
155 |      "output_type": "execute_result"
156 |     }
157 |    ],
158 |    "source": [
159 |     "clf = GradientBoostingClassifier()\n",
160 |     "gridsearch = GridSearchCV(clf, {\"learning_rate\": [.1, .01, .001], \"n_estimators\": [100, 1000, 5000], \n",
161 |     "                                'max_depth': [1, 2, 3]}, scoring='f1')\n",
162 |     "gridsearch.fit(X_train, y_train)"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": 10,
168 |    "metadata": {
169 |     "collapsed": false
170 |    },
171 |    "outputs": [
172 |     {
173 |      "name": "stdout",
174 |      "output_type": "stream",
175 |      "text": [
176 |       "Best Params: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5000}\n",
177 |       "\n",
178 |       "Classification Report:\n",
179 |       "             precision    recall  f1-score   support\n",
180 |       "\n",
181 |       "          0       0.95      0.93      0.94        43\n",
182 |       "          1       0.96      0.97      0.97        71\n",
183 |       "\n",
184 |       "avg / total       0.96      0.96      0.96       114\n",
185 |       "\n"
186 |      ]
187 |     }
188 |    ],
189 |    "source": [
190 |     "print(\"Best Params: {}\".format(gridsearch.best_params_))\n",
191 |     "print(\"\\nClassification Report:\")\n",
192 |     "print(classification_report(y_test, gridsearch.predict(X_test)))"
193 |    ]
194 |   }
195 |  ],
196 |  "metadata": {
197 |   "kernelspec": {
198 |    "display_name": "Python 3",
199 |    "language": "python",
200 |    "name": "python3"
201 |   },
202 |   "language_info": {
203 |    "codemirror_mode": {
204 |     "name": "ipython",
205 |     "version": 3
206 |    },
207 |    "file_extension": ".py",
208 |    "mimetype": "text/x-python",
209 |    "name": "python",
210 |    "nbconvert_exporter": "python",
211 |    "pygments_lexer": "ipython3",
212 |    "version": "3.5.3"
213 |   }
214 |  },
215 |  "nbformat": 4,
216 |  "nbformat_minor": 2
217 | }
218 | 


--------------------------------------------------------------------------------
/lectures/Lecture_14_MLP.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Multi-layer perceptron\n",
  8 |     "\n",
  9 |     "I am not even going to try and write a better intro. to neural nets than this...\n",
 10 |     "\n",
 11 |     "https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/\n",
 12 |     "\n",
 13 |     "### Softmax Equation\n",
 14 |     "\n",
 15 |     "Given an array of values of length n, the softmax of value i in the array is:\n",
 16 |     "\n",
 17 |     "$$\\frac{e^{i}}{\\sum_{j}^{n}e^{j}}$$\n",
 18 |     "\n",
 19 |     "### Deep Neural Network\n",
 20 |     "\n",
 21 |     "When you have multiple hidden layers - the layers in between the input and softmax layers, the network is called deep.\n",
 22 |     "\n",
 23 |     "### Backpropagation\n",
 24 |     "\n",
 25 |     "Neural nets are trained using a technique called backpropagation. At a very high level, you pass a training example through your network (forward pass), then measure its error, and then you go backwards through each layer to measure the contribution of each connection to the error (backwards pass). You then use this information to adjust the weights of your connections using gradient descent. \n",
 26 |     "\n",
 27 |     "### Vanishing/Exploding gradients\n",
 28 |     "\n",
 29 |     "When your gradients start to get too small or too large this can negatively effect learning. For example, a zero gradient will stop learning all together and when you gradients get too large your learning can diverge.\n",
 30 |     "\n",
 31 |     "### Activation Functions\n",
 32 |     "\n",
 33 |     "The article above does not talk much about activation functions. Typically, in an MLP after you pass connections to a neuron you then apply an activation function. Historically, that activation function was a logistic function, which then is basically logistic regression. This tends to suffer from vanishing gradient problem.\n",
 34 |     "\n",
 35 |     "Another very popular activation function now is relu. Relu(z) = max(0,z). This is very fast to compute and in practice works very well. This function suffers less from the vanishing gradient problem.\n",
 36 |     "\n",
 37 |     "One problem with relu is that the connections can die. This happens if the inputs to a neuron end up negative resulting in a zero gradient. Thus, the **leaky relu** was invented: Leaky Relu(x) = max($\\alpha$x, x) where $\\alpha$ is usually a value of 0.01 or 0.02. The $\\alpha$ value is the slope when x < 0 and ensures that the activation never truly dies, though it can become quite small.\n",
 38 |     "\n",
 39 |     "**Elu** is another activation function which generally performs the best but is slower to compute then a leaky relu. Again, when x > 0 you just get x. But when x < 0 you get $\\alpha$(exp(x) -1). $\\alpha$ represents the value that the function approaches when x is a large negative number. Usually, it is set to 1. This function is also smooth everywhere, including zero.\n",
 40 |     "\n",
 41 |     "### Batch Normalization\n",
 42 |     "\n",
 43 |     "As we have learned it is important to scale - or normalize - your data before feeding it to a neural net. Another important normalization step is right before your activation function to again normalize your data by subtracting the mean and dividing by the standard deviation. Since you are working with a batch, you use the batch mean and standard deviation. You also allow each batch normalization to learn an appropriate scaling and shifiting factor for your standardized values. \n",
 44 |     "\n",
 45 |     "This technique has been shown to reduce the vanishing/exploding gradient problem, allow the use or larger learning rates, and be less sensitive to initalization. On the downside, it reduces runtime prediction speed.\n",
 46 |     "\n",
 47 |     "\n",
 48 |     "### Cross-entropy\n",
 49 |     "\n",
 50 |     "$$-\\frac{1}{m}\\sum_{i=1}^{m}\\sum_{k=1}^{K}y_{k}^{i}log(p_{k}^{i})$$\n",
 51 |     "\n",
 52 |     "Where:\n",
 53 |     "\n",
 54 |     "* m - the number of data points\n",
 55 |     "* K - the number of classes\n",
 56 |     "* y_{k}^{i} - the true class value for row i, class k. Either a zero or one depending on if k is the correct class\n",
 57 |     "* p_{k}^{i} - the value predicted by your model for class k, row i. Usually from your softmax\n",
 58 |     "\n",
 59 |     "This is the cost function you are trying to minimze.\n",
 60 |     "\n",
 61 |     "### Important to Remember\n",
 62 |     "\n",
 63 |     "* Scale data - usually zero to one\n",
 64 |     "* Shuffle data\n",
 65 |     "\n",
 66 |     "### Tuning Hyper-parameters\n",
 67 |     "\n",
 68 |     "* Better to use random search\n",
 69 |     "* Start with reasonable, known architectures\n",
 70 |     "* Number of hidden layers:\n",
 71 |     "    * Often can be valuable to have a deep network to learn heirarchy. Usually converge faster and generalize better. \n",
 72 |     "    * More complex problems can often require deeper networks and more data\n",
 73 |     "* Number of neurons:\n",
 74 |     "    * Typically size the layers to form a type of funnel with fewer and fewer neurons at each layer. This comes back the heirachy idea where you might need more neurons to learn lower level features. \n",
 75 |     "    * Also can try picking same number of neurons for all layers to have less parameters to tune\n",
 76 |     "* Usually more value in going deeper than wider\n",
 77 |     "* Can try going deeper and wider than you think necessary and use regularization techinques to prevent overfitting. Such as early stopping.\n",
 78 |     "\n",
 79 |     "### Initialization\n",
 80 |     "\n",
 81 |     "In turns out with neural nets that how you initalize your weights can be quite important. Instead of random initalization, it is usually preferred to use either Xavier or He initalization. \n",
 82 |     "\n",
 83 |     "P. 278 of Hands on Machine learning has a good description of these initalizations.\n",
 84 |     "\n",
 85 |     "If you are going to use Relu or Elu activation functions, I would recommend He, which is supported by Keras:\n",
 86 |     "\n",
 87 |     "https://keras.io/initializers/#he_normal\n",
 88 |     "\n",
 89 |     "For He normal you initialize from a truncated normal distribution centered around 0 and with a standard deviation of sqrt(2/ (number of inputs + number of outputs))\n",
 90 |     "\n",
 91 |     "### Transfer Learning\n",
 92 |     "\n",
 93 |     "It turns out that the weights of a neural network can be used by other networks with the same architecture. For example, imagine Google has trained a neural network on millions of images from google search to predict 100 categories. Now, you would like to take a few thousand photos from your own photo collection and train a neural network to detect whether or not you are in a photo (binary classification).\n",
 94 |     "\n",
 95 |     "It turns out that you can start with Google's network and weights (Assuming you can get them) and use them as a starting place for your network. Assuming you are okay with the rest of their architecture, you would just need to change the last layer to 2 nodes intead of 100 and learn those weights from scratch.\n",
 96 |     "\n",
 97 |     "This is a really powerful idea and allows you to train much faster and with less data.\n",
 98 |     "\n",
 99 |     "This is such a good idea that you are almost always better starting with pre-trained weights if you can find them even if the problem they were trained on is not that close to your problem. Obviously, the closer the problems the better.\n",
100 |     "\n",
101 |     "Many deep learning frameworks have what are called model zoos where you can find pre-trained models. Keras' model zoo is here: https://keras.io/applications/\n",
102 |     "\n",
103 |     "You can find more details on how to perform some of these techniques using Keras here: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html\n",
104 |     "\n",
105 |     "Lastly, another valuable option when you have little data is to pre-train your own network on related data. For example, if you want to classify whether you are in an image or not, but only have a few images of yourself. You can first train your network on images of people in general and then fine-tune your network with the images of yourself.\n",
106 |     "\n",
107 |     "### Optimizers\n",
108 |     "\n",
109 |     "We have previously discussed vanilla gradient descent where you move in the direction negative to the gradient in proportion to the learning rate. It turns out that there are faster techniques for finiding the minimum - or a minimum - in your cost function. These faster techniques are very valuable with neural nets which already take a long time to train.\n",
110 |     "\n",
111 |     "We won't cover these in too much detail, but there is a good description here:\n",
112 |     "\n",
113 |     "http://ruder.io/optimizing-gradient-descent/index.html#whichoptimizertochoose\n",
114 |     "\n",
115 |     "And starting on p.295 of Hands on Machine Learning.\n",
116 |     "\n",
117 |     "Generally, a good place to start can be the Adam optimizer.\n",
118 |     "\n",
119 |     "### Regularization\n",
120 |     "\n",
121 |     "As we have discussed, neural nets can be quite prone to over fitting. Thus, we have some techniques to combat this:\n",
122 |     "\n",
123 |     "* **Early Stopping:** Keep track of your validation error after every iteration and stop training when it stops going down. Usually, you would say something like: if the validation error has not decreased in 5 continuous iterations, stop.\n",
124 |     "* **L1 and L2:** Just like logistic and linear regression, we can add a penalty term for large weights.\n",
125 |     "* ** Dropout:** This is probably the most popular regularization method and is seen in many architectures. It is simple: at every iteration, every neuron has some probability of being turned off or inactive during that iteration (except the output neurons). This probability is usually referred to as the dropout rate and is a hyper parameter you have to choose. What this does, is it forces the network to become pretty robust. At anytime, it can lose a neuron and thus can't learn to become too dependent on a small set of neurons - including the input. This isn't too different from random forest where each decision tree sees slightly different samples and features. With dropout, every iteration is a slightly different neural net that sees different features (or neurons). \n",
126 |     "\n",
127 |     "Keras has a dropout layer: https://keras.io/layers/core/#dropout\n",
128 |     "\n",
129 |     "### Data Augmentation\n",
130 |     "\n",
131 |     "Neural nets - especially deep ones - love data. Sometimes you don't have a lot of data or would like more data, but getting additional samples can be costly. One way of dealing with this is by augmenting your current data via transformations.\n",
132 |     "\n",
133 |     "This idea is quite popular in computer vision. Say your task is to predict whether or not a dog is in an image and you have 5,000 images. To get more images you can randomly transform the 5,000 images you have. For example, you can change the rotation, brightness, size, etc. This then creates additional data while still not changing the label (the picture still contains a dog or not).\n",
134 |     "\n",
135 |     "These augmentations usually lead to your net being more robust to the transformations you applied and less prone to over-fit.\n",
136 |     "\n",
137 |     "Keras supports image data augmentation: https://keras.io/preprocessing/image/\n",
138 |     "\n",
139 |     "Even if your data are not images, though, you may be able to think of some creative ways of augmenting your data."
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 1,
145 |    "metadata": {},
146 |    "outputs": [
147 |     {
148 |      "name": "stdout",
149 |      "output_type": "stream",
150 |      "text": [
151 |       "[0.0, 0.0, 0.02, 0.0, 0.98]\n",
152 |       "1.0\n"
153 |      ]
154 |     }
155 |    ],
156 |    "source": [
157 |     "import numpy as np\n",
158 |     "\n",
159 |     "values = np.array([1.0, 3.0, 8.0, 4.0, 12.0])\n",
160 |     "exp_values = np.exp(values)\n",
161 |     "softmax = exp_values / sum(exp_values)\n",
162 |     "print([round(x,2) for x in softmax])\n",
163 |     "print(sum(softmax))"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "## Example using Python"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 35,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "from keras.models import Sequential\n",
180 |     "from keras.layers import Dense, Activation, Dropout, BatchNormalization\n",
181 |     "from keras.utils import np_utils\n",
182 |     "from keras.datasets import mnist\n",
183 |     "from sklearn.metrics import confusion_matrix\n",
184 |     "import numpy as np\n",
185 |     "from __future__ import division"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": 3,
191 |    "metadata": {},
192 |    "outputs": [
193 |     {
194 |      "name": "stdout",
195 |      "output_type": "stream",
196 |      "text": [
197 |       "Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz\n",
198 |       "11337728/11490434 [============================>.] - ETA: 0s"
199 |      ]
200 |     }
201 |    ],
202 |    "source": [
203 |     "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n",
204 |     "\n",
205 |     "y_train = np_utils.to_categorical(y_train, 10)\n",
206 |     "y_test = np_utils.to_categorical(y_test, 10)\n",
207 |     "\n",
208 |     "def vectorize_image(images):\n",
209 |     "    scaled_images = images / 255\n",
210 |     "    return images.reshape(scaled_images.shape[0],-1)\n",
211 |     "\n",
212 |     "x_train = vectorize_image(x_train)\n",
213 |     "x_test = vectorize_image(x_test)"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 51,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "model = Sequential([\n",
223 |     "    Dense(128, input_shape=(784,), activation='elu', kernel_initializer='he_normal'),\n",
224 |     "    BatchNormalization(),\n",
225 |     "    Dropout(0.5),\n",
226 |     "    Dense(64, activation='elu', kernel_initializer='he_normal'),\n",
227 |     "    BatchNormalization(),\n",
228 |     "    Dropout(0.5),\n",
229 |     "    Dense(10, activation='softmax')\n",
230 |     "])"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": 52,
236 |    "metadata": {},
237 |    "outputs": [
238 |     {
239 |      "name": "stdout",
240 |      "output_type": "stream",
241 |      "text": [
242 |       "_________________________________________________________________\n",
243 |       "Layer (type)                 Output Shape              Param #   \n",
244 |       "=================================================================\n",
245 |       "dense_17 (Dense)             (None, 128)               100480    \n",
246 |       "_________________________________________________________________\n",
247 |       "batch_normalization_3 (Batch (None, 128)               512       \n",
248 |       "_________________________________________________________________\n",
249 |       "dropout_5 (Dropout)          (None, 128)               0         \n",
250 |       "_________________________________________________________________\n",
251 |       "dense_18 (Dense)             (None, 64)                8256      \n",
252 |       "_________________________________________________________________\n",
253 |       "batch_normalization_4 (Batch (None, 64)                256       \n",
254 |       "_________________________________________________________________\n",
255 |       "dropout_6 (Dropout)          (None, 64)                0         \n",
256 |       "_________________________________________________________________\n",
257 |       "dense_19 (Dense)             (None, 10)                650       \n",
258 |       "=================================================================\n",
259 |       "Total params: 110,154\n",
260 |       "Trainable params: 109,770\n",
261 |       "Non-trainable params: 384\n",
262 |       "_________________________________________________________________\n"
263 |      ]
264 |     }
265 |    ],
266 |    "source": [
267 |     "model.summary()"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 53,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "model.compile(optimizer='adam',\n",
277 |     "              loss='categorical_crossentropy')"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 54,
283 |    "metadata": {},
284 |    "outputs": [
285 |     {
286 |      "name": "stdout",
287 |      "output_type": "stream",
288 |      "text": [
289 |       "Train on 54000 samples, validate on 6000 samples\n",
290 |       "Epoch 1/20\n",
291 |       "54000/54000 [==============================] - 5s - loss: 0.5970 - val_loss: 0.1660\n",
292 |       "Epoch 2/20\n",
293 |       "54000/54000 [==============================] - 4s - loss: 0.3253 - val_loss: 0.1273\n",
294 |       "Epoch 3/20\n",
295 |       "54000/54000 [==============================] - 5s - loss: 0.2739 - val_loss: 0.1079\n",
296 |       "Epoch 4/20\n",
297 |       "54000/54000 [==============================] - 5s - loss: 0.2447 - val_loss: 0.0962\n",
298 |       "Epoch 5/20\n",
299 |       "54000/54000 [==============================] - 5s - loss: 0.2249 - val_loss: 0.0907\n",
300 |       "Epoch 6/20\n",
301 |       "54000/54000 [==============================] - 5s - loss: 0.2093 - val_loss: 0.0887\n",
302 |       "Epoch 7/20\n",
303 |       "54000/54000 [==============================] - 5s - loss: 0.1976 - val_loss: 0.0876\n",
304 |       "Epoch 8/20\n",
305 |       "54000/54000 [==============================] - 6s - loss: 0.1897 - val_loss: 0.0849\n",
306 |       "Epoch 9/20\n",
307 |       "54000/54000 [==============================] - 6s - loss: 0.1842 - val_loss: 0.0833\n",
308 |       "Epoch 10/20\n",
309 |       "54000/54000 [==============================] - 6s - loss: 0.1738 - val_loss: 0.0832\n",
310 |       "Epoch 11/20\n",
311 |       "54000/54000 [==============================] - 6s - loss: 0.1679 - val_loss: 0.0813\n",
312 |       "Epoch 12/20\n",
313 |       "54000/54000 [==============================] - 7s - loss: 0.1617 - val_loss: 0.0772\n",
314 |       "Epoch 13/20\n",
315 |       "54000/54000 [==============================] - 7s - loss: 0.1620 - val_loss: 0.0801\n",
316 |       "Epoch 14/20\n",
317 |       "54000/54000 [==============================] - 7s - loss: 0.1539 - val_loss: 0.0719\n",
318 |       "Epoch 15/20\n",
319 |       "54000/54000 [==============================] - 8s - loss: 0.1522 - val_loss: 0.0809\n",
320 |       "Epoch 16/20\n",
321 |       "54000/54000 [==============================] - 9s - loss: 0.1486 - val_loss: 0.0782\n",
322 |       "Epoch 17/20\n",
323 |       "54000/54000 [==============================] - 9s - loss: 0.1483 - val_loss: 0.0712\n",
324 |       "Epoch 18/20\n",
325 |       "54000/54000 [==============================] - 10s - loss: 0.1437 - val_loss: 0.0733\n",
326 |       "Epoch 19/20\n",
327 |       "54000/54000 [==============================] - 10s - loss: 0.1402 - val_loss: 0.0708\n",
328 |       "Epoch 20/20\n",
329 |       "54000/54000 [==============================] - 10s - loss: 0.1379 - val_loss: 0.0707\n"
330 |      ]
331 |     },
332 |     {
333 |      "data": {
334 |       "text/plain": [
335 |        "<keras.callbacks.History at 0x126926908>"
336 |       ]
337 |      },
338 |      "execution_count": 54,
339 |      "metadata": {},
340 |      "output_type": "execute_result"
341 |     }
342 |    ],
343 |    "source": [
344 |     "model.fit(x_train, y_train, epochs=20, batch_size=64, validation_split=0.1)"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "code",
349 |    "execution_count": 55,
350 |    "metadata": {},
351 |    "outputs": [],
352 |    "source": [
353 |     "test_predictions = np.argmax(model.predict(x_test),1)\n",
354 |     "y_test_sparse = np.argmax(y_test, 1)"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": 56,
360 |    "metadata": {},
361 |    "outputs": [
362 |     {
363 |      "data": {
364 |       "text/plain": [
365 |        "array([[ 969,    0,    1,    1,    1,    2,    3,    1,    2,    0],\n",
366 |        "       [   0, 1126,    3,    1,    0,    0,    1,    0,    4,    0],\n",
367 |        "       [   1,    3, 1006,    5,    2,    0,    1,    9,    5,    0],\n",
368 |        "       [   1,    0,    6,  991,    0,    3,    0,    6,    3,    0],\n",
369 |        "       [   0,    0,    3,    0,  966,    0,    6,    1,    1,    5],\n",
370 |        "       [   2,    1,    0,   10,    1,  863,    5,    3,    5,    2],\n",
371 |        "       [   5,    3,    1,    0,    2,    6,  937,    0,    4,    0],\n",
372 |        "       [   1,    4,    9,    2,    0,    0,    0, 1008,    0,    4],\n",
373 |        "       [   5,    2,    2,    6,    5,    2,    3,    5,  942,    2],\n",
374 |        "       [   4,    6,    0,   10,   14,    1,    0,    8,    4,  962]])"
375 |       ]
376 |      },
377 |      "execution_count": 56,
378 |      "metadata": {},
379 |      "output_type": "execute_result"
380 |     }
381 |    ],
382 |    "source": [
383 |     "confusion_matrix(y_test_sparse, test_predictions)"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "code",
388 |    "execution_count": 57,
389 |    "metadata": {},
390 |    "outputs": [
391 |     {
392 |      "data": {
393 |       "text/plain": [
394 |        "array([0.977])"
395 |       ]
396 |      },
397 |      "execution_count": 57,
398 |      "metadata": {},
399 |      "output_type": "execute_result"
400 |     }
401 |    ],
402 |    "source": [
403 |     "np.sum(y_test_sparse == test_predictions) / test_predictions.shape"
404 |    ]
405 |   }
406 |  ],
407 |  "metadata": {
408 |   "kernelspec": {
409 |    "display_name": "Python 3",
410 |    "language": "python",
411 |    "name": "python3"
412 |   },
413 |   "language_info": {
414 |    "codemirror_mode": {
415 |     "name": "ipython",
416 |     "version": 3
417 |    },
418 |    "file_extension": ".py",
419 |    "mimetype": "text/x-python",
420 |    "name": "python",
421 |    "nbconvert_exporter": "python",
422 |    "pygments_lexer": "ipython3",
423 |    "version": "3.5.3"
424 |   }
425 |  },
426 |  "nbformat": 4,
427 |  "nbformat_minor": 2
428 | }
429 | 


--------------------------------------------------------------------------------
/lectures/Lecture_15_Conv_Nets.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Convolutional Neural Networks\n",
  8 |     "\n",
  9 |     "Again, I am not going to even try to do a better job than this post...:\n",
 10 |     "\n",
 11 |     "https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/\n",
 12 |     "\n",
 13 |     "Now, let's review some of the famous CNN architectures on starting on p.367 of Hands on Machine Learning.\n",
 14 |     "\n",
 15 |     "## Python Conv Net"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 1,
 21 |    "metadata": {},
 22 |    "outputs": [
 23 |     {
 24 |      "name": "stderr",
 25 |      "output_type": "stream",
 26 |      "text": [
 27 |       "Using TensorFlow backend.\n"
 28 |      ]
 29 |     }
 30 |    ],
 31 |    "source": [
 32 |     "from keras.models import Sequential, Model\n",
 33 |     "from keras.layers import Dense, Activation, Conv2D, MaxPooling2D, Flatten, BatchNormalization, BatchNormalization, Dropout\n",
 34 |     "from keras.datasets import mnist\n",
 35 |     "from sklearn.metrics import confusion_matrix\n",
 36 |     "from keras import applications\n",
 37 |     "from keras.preprocessing.image import ImageDataGenerator\n",
 38 |     "from skimage.transform import rescale, resize\n",
 39 |     "import matplotlib.pyplot as plt\n",
 40 |     "from keras.preprocessing.image import ImageDataGenerator\n",
 41 |     "%matplotlib inline\n",
 42 |     "import numpy as np"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": 2,
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n",
 52 |     "\n",
 53 |     "def scale_image(images):\n",
 54 |     "    images = images.reshape(images.shape[0], 28, 28, 1)\n",
 55 |     "    images = images.astype('float32')\n",
 56 |     "    scaled_images = images / 255\n",
 57 |     "    return scaled_images\n",
 58 |     "\n",
 59 |     "x_train = scale_image(x_train)\n",
 60 |     "x_test = scale_image(x_test)"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 3,
 66 |    "metadata": {},
 67 |    "outputs": [
 68 |     {
 69 |      "data": {
 70 |       "text/plain": [
 71 |        "(60000, 28, 28, 1)"
 72 |       ]
 73 |      },
 74 |      "execution_count": 3,
 75 |      "metadata": {},
 76 |      "output_type": "execute_result"
 77 |     }
 78 |    ],
 79 |    "source": [
 80 |     "x_train.shape"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 4,
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "model = Sequential()\n",
 90 |     "\n",
 91 |     "model.add(Conv2D(32, kernel_size=(5, 5), activation='elu', input_shape=(28, 28, 1), kernel_initializer='he_normal'))\n",
 92 |     "model.add(BatchNormalization())\n",
 93 |     "model.add(MaxPooling2D(pool_size=(2, 2)))\n",
 94 |     "model.add(Dropout(0.5))\n",
 95 |     "\n",
 96 |     "model.add(Conv2D(64, kernel_size=(5, 5), activation='elu', kernel_initializer='he_normal'))\n",
 97 |     "model.add(BatchNormalization())\n",
 98 |     "model.add(MaxPooling2D(pool_size=(2, 2)))\n",
 99 |     "model.add(Dropout(0.5))\n",
100 |     "\n",
101 |     "model.add(Flatten())\n",
102 |     "model.add(Dense(256, activation='elu', kernel_initializer='he_normal'))\n",
103 |     "model.add(Dropout(0.5))\n",
104 |     "model.add(Dense(128, activation='elu', kernel_initializer='he_normal'))\n",
105 |     "model.add(Dropout(0.5))\n",
106 |     "model.add(Dense(10, activation='softmax'))"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 5,
112 |    "metadata": {},
113 |    "outputs": [
114 |     {
115 |      "name": "stdout",
116 |      "output_type": "stream",
117 |      "text": [
118 |       "_________________________________________________________________\n",
119 |       "Layer (type)                 Output Shape              Param #   \n",
120 |       "=================================================================\n",
121 |       "conv2d_1 (Conv2D)            (None, 24, 24, 32)        832       \n",
122 |       "_________________________________________________________________\n",
123 |       "batch_normalization_1 (Batch (None, 24, 24, 32)        128       \n",
124 |       "_________________________________________________________________\n",
125 |       "max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         \n",
126 |       "_________________________________________________________________\n",
127 |       "dropout_1 (Dropout)          (None, 12, 12, 32)        0         \n",
128 |       "_________________________________________________________________\n",
129 |       "conv2d_2 (Conv2D)            (None, 8, 8, 64)          51264     \n",
130 |       "_________________________________________________________________\n",
131 |       "batch_normalization_2 (Batch (None, 8, 8, 64)          256       \n",
132 |       "_________________________________________________________________\n",
133 |       "max_pooling2d_2 (MaxPooling2 (None, 4, 4, 64)          0         \n",
134 |       "_________________________________________________________________\n",
135 |       "dropout_2 (Dropout)          (None, 4, 4, 64)          0         \n",
136 |       "_________________________________________________________________\n",
137 |       "flatten_1 (Flatten)          (None, 1024)              0         \n",
138 |       "_________________________________________________________________\n",
139 |       "dense_1 (Dense)              (None, 256)               262400    \n",
140 |       "_________________________________________________________________\n",
141 |       "dropout_3 (Dropout)          (None, 256)               0         \n",
142 |       "_________________________________________________________________\n",
143 |       "dense_2 (Dense)              (None, 128)               32896     \n",
144 |       "_________________________________________________________________\n",
145 |       "dropout_4 (Dropout)          (None, 128)               0         \n",
146 |       "_________________________________________________________________\n",
147 |       "dense_3 (Dense)              (None, 10)                1290      \n",
148 |       "=================================================================\n",
149 |       "Total params: 349,066\n",
150 |       "Trainable params: 348,874\n",
151 |       "Non-trainable params: 192\n",
152 |       "_________________________________________________________________\n"
153 |      ]
154 |     }
155 |    ],
156 |    "source": [
157 |     "model.summary()"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 6,
163 |    "metadata": {},
164 |    "outputs": [],
165 |    "source": [
166 |     "model.compile(optimizer='adam',\n",
167 |     "              loss='sparse_categorical_crossentropy')"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 7,
173 |    "metadata": {},
174 |    "outputs": [
175 |     {
176 |      "name": "stdout",
177 |      "output_type": "stream",
178 |      "text": [
179 |       "Train on 48000 samples, validate on 12000 samples\n",
180 |       "Epoch 1/5\n",
181 |       "48000/48000 [==============================] - 16s - loss: 0.6392 - val_loss: 0.1036\n",
182 |       "Epoch 2/5\n",
183 |       "48000/48000 [==============================] - 14s - loss: 0.2201 - val_loss: 0.0776\n",
184 |       "Epoch 3/5\n",
185 |       "48000/48000 [==============================] - 14s - loss: 0.1676 - val_loss: 0.0638\n",
186 |       "Epoch 4/5\n",
187 |       "48000/48000 [==============================] - 14s - loss: 0.1468 - val_loss: 0.0494\n",
188 |       "Epoch 5/5\n",
189 |       "48000/48000 [==============================] - 14s - loss: 0.1259 - val_loss: 0.0466\n"
190 |      ]
191 |     },
192 |     {
193 |      "data": {
194 |       "text/plain": [
195 |        "<keras.callbacks.History at 0x7f2e3b005470>"
196 |       ]
197 |      },
198 |      "execution_count": 7,
199 |      "metadata": {},
200 |      "output_type": "execute_result"
201 |     }
202 |    ],
203 |    "source": [
204 |     "model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.2)"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 8,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "test_predictions = np.argmax(model.predict(x_test),1)"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 9,
219 |    "metadata": {},
220 |    "outputs": [
221 |     {
222 |      "data": {
223 |       "text/plain": [
224 |        "array([[ 974,    0,    1,    0,    0,    0,    4,    1,    0,    0],\n",
225 |        "       [   0, 1131,    1,    0,    0,    1,    1,    1,    0,    0],\n",
226 |        "       [   1,    3, 1024,    0,    1,    0,    0,    3,    0,    0],\n",
227 |        "       [   0,    0,    4,  993,    0,   11,    0,    1,    1,    0],\n",
228 |        "       [   0,    0,    0,    0,  969,    0,    5,    0,    0,    8],\n",
229 |        "       [   2,    0,    0,    2,    0,  886,    1,    1,    0,    0],\n",
230 |        "       [   3,    2,    0,    0,    1,    4,  947,    0,    1,    0],\n",
231 |        "       [   1,    4,    7,    2,    0,    0,    0, 1012,    1,    1],\n",
232 |        "       [   4,    0,    1,    0,    1,    0,    1,    2,  961,    4],\n",
233 |        "       [   2,    4,    0,    2,    4,    7,    0,    4,    0,  986]])"
234 |       ]
235 |      },
236 |      "execution_count": 9,
237 |      "metadata": {},
238 |      "output_type": "execute_result"
239 |     }
240 |    ],
241 |    "source": [
242 |     "confusion_matrix(y_test, test_predictions)"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": 10,
248 |    "metadata": {},
249 |    "outputs": [
250 |     {
251 |      "data": {
252 |       "text/plain": [
253 |        "array([ 0.9883])"
254 |       ]
255 |      },
256 |      "execution_count": 10,
257 |      "metadata": {},
258 |      "output_type": "execute_result"
259 |     }
260 |    ],
261 |    "source": [
262 |     "np.sum(y_test == test_predictions) / test_predictions.shape"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "markdown",
267 |    "metadata": {},
268 |    "source": [
269 |     "## Pre-training and Data Augmentation\n",
270 |     "\n",
271 |     "Source: https://gist.github.com/fchollet/7eb39b44eb9e16e59632d25fb3119975"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": 11,
277 |    "metadata": {},
278 |    "outputs": [
279 |     {
280 |      "name": "stdout",
281 |      "output_type": "stream",
282 |      "text": [
283 |       "Model loaded.\n"
284 |      ]
285 |     }
286 |    ],
287 |    "source": [
288 |     "base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(48,48,3))\n",
289 |     "print('Model loaded.')\n",
290 |     "\n",
291 |     "# build a classifier model to put on top of the convolutional model\n",
292 |     "top_model = Sequential([\n",
293 |     "    Flatten(input_shape=base_model.output_shape[1:]),\n",
294 |     "    Dense(128, activation='elu', kernel_initializer='he_normal'),\n",
295 |     "    BatchNormalization(),\n",
296 |     "    Dropout(0.5),\n",
297 |     "    Dense(64, activation='elu', kernel_initializer='he_normal'),\n",
298 |     "    BatchNormalization(),\n",
299 |     "    Dropout(0.5),\n",
300 |     "    Dense(10, activation='softmax')\n",
301 |     "])\n",
302 |     "\n",
303 |     "\n",
304 |     "# compile the model with a SGD/momentum optimizer\n",
305 |     "# and a very slow learning rate.\n",
306 |     "top_model.compile(optimizer='adam',\n",
307 |     "              loss='sparse_categorical_crossentropy',\n",
308 |     "              metrics=['accuracy'])"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "code",
313 |    "execution_count": 12,
314 |    "metadata": {},
315 |    "outputs": [],
316 |    "source": [
317 |     "def preprocess_images(images, new_size=(48,48)):\n",
318 |     "    resized_images = np.array([resize(x, new_size, mode='reflect') for x in images])\n",
319 |     "    rows, width, height, _ = resized_images.shape\n",
320 |     "    three_channels = np.zeros((rows, width, height, 3))\n",
321 |     "    three_channels[:,:,:,0] = resized_images[:,:,:,0]\n",
322 |     "    three_channels[:,:,:,1] = resized_images[:,:,:,0]\n",
323 |     "    three_channels[:,:,:,2] = resized_images[:,:,:,0]\n",
324 |     "    return three_channels"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": 13,
330 |    "metadata": {},
331 |    "outputs": [],
332 |    "source": [
333 |     "x_train_processed = preprocess_images(x_train)\n",
334 |     "x_test_processed = preprocess_images(x_test)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": 14,
340 |    "metadata": {},
341 |    "outputs": [],
342 |    "source": [
343 |     "## could augment data here\n",
344 |     "datagen = ImageDataGenerator()\n",
345 |     "generator = datagen.flow(x_train_processed, y_train, batch_size=32, shuffle=False)"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": 15,
351 |    "metadata": {},
352 |    "outputs": [
353 |     {
354 |      "name": "stdout",
355 |      "output_type": "stream",
356 |      "text": [
357 |       "1874/1875 [============================>.] - ETA: 0s"
358 |      ]
359 |     }
360 |    ],
361 |    "source": [
362 |     "bottleneck_features_train = base_model.predict_generator(generator, x_train_processed.shape[0] // 32, verbose=1)"
363 |    ]
364 |   },
365 |   {
366 |    "cell_type": "code",
367 |    "execution_count": 16,
368 |    "metadata": {},
369 |    "outputs": [],
370 |    "source": [
371 |     "bottleneck_features_test = base_model.predict(x_test_processed)"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "code",
376 |    "execution_count": 17,
377 |    "metadata": {},
378 |    "outputs": [
379 |     {
380 |      "name": "stdout",
381 |      "output_type": "stream",
382 |      "text": [
383 |       "Train on 54000 samples, validate on 6000 samples\n",
384 |       "Epoch 1/20\n",
385 |       "54000/54000 [==============================] - 11s - loss: 0.3863 - acc: 0.8813 - val_loss: 0.0922 - val_acc: 0.9698\n",
386 |       "Epoch 2/20\n",
387 |       "54000/54000 [==============================] - 11s - loss: 0.1983 - acc: 0.9400 - val_loss: 0.0990 - val_acc: 0.9668\n",
388 |       "Epoch 3/20\n",
389 |       "54000/54000 [==============================] - 11s - loss: 0.1696 - acc: 0.9482 - val_loss: 0.0675 - val_acc: 0.9783\n",
390 |       "Epoch 4/20\n",
391 |       "54000/54000 [==============================] - 11s - loss: 0.1568 - acc: 0.9523 - val_loss: 0.0759 - val_acc: 0.9752\n",
392 |       "Epoch 5/20\n",
393 |       "54000/54000 [==============================] - 12s - loss: 0.1469 - acc: 0.9560 - val_loss: 0.0801 - val_acc: 0.9725\n",
394 |       "Epoch 6/20\n",
395 |       "54000/54000 [==============================] - 11s - loss: 0.1385 - acc: 0.9590 - val_loss: 0.0854 - val_acc: 0.9732\n",
396 |       "Epoch 7/20\n",
397 |       "54000/54000 [==============================] - 11s - loss: 0.1343 - acc: 0.9594 - val_loss: 0.0573 - val_acc: 0.9833\n",
398 |       "Epoch 8/20\n",
399 |       "54000/54000 [==============================] - 11s - loss: 0.1284 - acc: 0.9609 - val_loss: 0.0647 - val_acc: 0.9787\n",
400 |       "Epoch 9/20\n",
401 |       "54000/54000 [==============================] - 10s - loss: 0.1286 - acc: 0.9614 - val_loss: 0.0634 - val_acc: 0.9820\n",
402 |       "Epoch 10/20\n",
403 |       "54000/54000 [==============================] - 12s - loss: 0.1216 - acc: 0.9630 - val_loss: 0.0620 - val_acc: 0.9807\n",
404 |       "Epoch 11/20\n",
405 |       "54000/54000 [==============================] - 12s - loss: 0.1209 - acc: 0.9634 - val_loss: 0.0558 - val_acc: 0.9825\n",
406 |       "Epoch 12/20\n",
407 |       "54000/54000 [==============================] - 10s - loss: 0.1232 - acc: 0.9632 - val_loss: 0.0587 - val_acc: 0.9812\n",
408 |       "Epoch 13/20\n",
409 |       "54000/54000 [==============================] - 10s - loss: 0.1173 - acc: 0.9643 - val_loss: 0.0563 - val_acc: 0.9820\n",
410 |       "Epoch 14/20\n",
411 |       "54000/54000 [==============================] - 11s - loss: 0.1144 - acc: 0.9646 - val_loss: 0.0820 - val_acc: 0.9738\n",
412 |       "Epoch 15/20\n",
413 |       "54000/54000 [==============================] - 11s - loss: 0.1128 - acc: 0.9656 - val_loss: 0.0613 - val_acc: 0.9812\n",
414 |       "Epoch 16/20\n",
415 |       "54000/54000 [==============================] - 11s - loss: 0.1146 - acc: 0.9660 - val_loss: 0.0552 - val_acc: 0.9842\n",
416 |       "Epoch 17/20\n",
417 |       "54000/54000 [==============================] - 11s - loss: 0.1110 - acc: 0.9674 - val_loss: 0.0631 - val_acc: 0.9798\n",
418 |       "Epoch 18/20\n",
419 |       "54000/54000 [==============================] - 11s - loss: 0.1078 - acc: 0.9671 - val_loss: 0.0539 - val_acc: 0.9838\n",
420 |       "Epoch 19/20\n",
421 |       "54000/54000 [==============================] - 11s - loss: 0.1095 - acc: 0.9671 - val_loss: 0.0523 - val_acc: 0.9838\n",
422 |       "Epoch 20/20\n",
423 |       "54000/54000 [==============================] - 11s - loss: 0.1076 - acc: 0.9667 - val_loss: 0.0567 - val_acc: 0.9835\n"
424 |      ]
425 |     },
426 |     {
427 |      "data": {
428 |       "text/plain": [
429 |        "<keras.callbacks.History at 0x7f2dac0e5c18>"
430 |       ]
431 |      },
432 |      "execution_count": 17,
433 |      "metadata": {},
434 |      "output_type": "execute_result"
435 |     }
436 |    ],
437 |    "source": [
438 |     "top_model.fit(bottleneck_features_train, y_train, epochs=20, batch_size=32, validation_split=0.1)"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "code",
443 |    "execution_count": 18,
444 |    "metadata": {},
445 |    "outputs": [],
446 |    "source": [
447 |     "test_predictions = np.argmax(top_model.predict(bottleneck_features_test),1)"
448 |    ]
449 |   },
450 |   {
451 |    "cell_type": "code",
452 |    "execution_count": 19,
453 |    "metadata": {},
454 |    "outputs": [
455 |     {
456 |      "data": {
457 |       "text/plain": [
458 |        "array([[ 966,    0,    2,    0,    0,    4,    2,    2,    3,    1],\n",
459 |        "       [   0, 1128,    0,    1,    1,    0,    0,    4,    1,    0],\n",
460 |        "       [   0,    1,  969,   17,    2,   15,    0,   21,    5,    2],\n",
461 |        "       [   0,    0,    4,  989,    0,   13,    0,    2,    2,    0],\n",
462 |        "       [   0,    2,    2,    0,  962,    1,    0,    5,    5,    5],\n",
463 |        "       [   0,    0,    2,    6,    0,  877,    0,    2,    5,    0],\n",
464 |        "       [   6,    2,    1,    0,    1,    2,  945,    0,    1,    0],\n",
465 |        "       [   1,    3,    3,    0,    5,    1,    0, 1014,    0,    1],\n",
466 |        "       [   0,    0,    2,    2,    3,    3,    1,    1,  960,    2],\n",
467 |        "       [   1,    0,    4,    1,   10,    2,    0,   12,    8,  971]])"
468 |       ]
469 |      },
470 |      "execution_count": 19,
471 |      "metadata": {},
472 |      "output_type": "execute_result"
473 |     }
474 |    ],
475 |    "source": [
476 |     "confusion_matrix(y_test, test_predictions)"
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "code",
481 |    "execution_count": 20,
482 |    "metadata": {},
483 |    "outputs": [
484 |     {
485 |      "data": {
486 |       "text/plain": [
487 |        "array([ 0.9781])"
488 |       ]
489 |      },
490 |      "execution_count": 20,
491 |      "metadata": {},
492 |      "output_type": "execute_result"
493 |     }
494 |    ],
495 |    "source": [
496 |     "np.sum(y_test == test_predictions) / test_predictions.shape"
497 |    ]
498 |   }
499 |  ],
500 |  "metadata": {
501 |   "kernelspec": {
502 |    "display_name": "Python 3",
503 |    "language": "python",
504 |    "name": "python3"
505 |   },
506 |   "language_info": {
507 |    "codemirror_mode": {
508 |     "name": "ipython",
509 |     "version": 3
510 |    },
511 |    "file_extension": ".py",
512 |    "mimetype": "text/x-python",
513 |    "name": "python",
514 |    "nbconvert_exporter": "python",
515 |    "pygments_lexer": "ipython3",
516 |    "version": "3.5.3"
517 |   }
518 |  },
519 |  "nbformat": 4,
520 |  "nbformat_minor": 2
521 | }
522 | 


--------------------------------------------------------------------------------
/lectures/Lecture_1_Introduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Introductions\n",
  8 |     "\n",
  9 |     "* Importance of getting to know people in the class\n",
 10 |     "* Review syllabus\n",
 11 |     "* Give out slack information\n",
 12 |     "* When office hours?\n",
 13 |     "* Participation: ML news / paper for the day / discuss homework ideas / class discussions\n",
 14 |     "* Homework: Target 5 homeworks which will be pretty open-ended and almost like mini-projects\n",
 15 |     "* Quizzes: In class and around 3. Will test knowledge of the subject not coding\n",
 16 |     "* First time class being taught and I am very open to feedback. Also, feel free to submit pull requests to correct or add to content.\n",
 17 |     "* Review first homework"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "## What is Machine Learning?\n",
 25 |     "\n",
 26 |     "[Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) tells us that Machine learning is, \"a field of computer science that gives computers **the ability to learn without being explicitly programmed**.\" It goes on to say, \"machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through **building a model from sample inputs**.\"\n",
 27 |     "\n",
 28 |     "\n",
 29 |     "### Learning from inputs\n",
 30 |     "\n",
 31 |     "What does it mean to learn from inputs without being explicitly programmed? Let us consider the classical machine learning problem: spam filtering.\n",
 32 |     "\n",
 33 |     "Let us imagine that we know nothing about machine learning, but are tasked with determining whether an email is spam or not. How might we do this?\n",
 34 |     "\n",
 35 |     "What do we like and not like about our approach? How would we keep it up to date with new types of spam?\n",
 36 |     "\n",
 37 |     "Now imagine we had a black-box that if given a great many examples of emails that are spam and not spam, could take these examples and learn from the text what spam looks like.  For this class, we will call that black-box (though it won't be black for long) machine learning. And often the data we feed it to understand a problem are called features.\n",
 38 |     "\n",
 39 |     "What does learning from inputs look like? Let us consider [flappy bird](https://www.youtube.com/watch?v=79BWQUN_Njc).\n",
 40 |     "\n",
 41 |     "\n",
 42 |     "## Why machine learning?\n",
 43 |     "\n",
 44 |     "From our discussion, why do you think machine learning might be valuable?\n",
 45 |     "\n",
 46 |     "\n",
 47 |     "## Types of machine learning problems\n",
 48 |     "\n",
 49 |     "**Supervised** machine learning problems are ones for which you have labeled data. Labeled data means you give the algorithm the solution with the data and these solutions are called labels. For example, with spam classification the labels would be \"spam\" or \"not spam.\" Linear regression would be considered a supervised problem.\n",
 50 |     "\n",
 51 |     "**Unsupervised** machine learning is the opposite. It is not given any labels. These algorithms are often not as powerful as they don't get the benefit of labels, but they can be extremely valuable when getting labeled data is expensive or impossible. An example would be clustering.\n",
 52 |     "\n",
 53 |     "**Regression** problems are a class of problems for which you are trying to predict a real number. For example, linear regression outputs a real number and could be used to predict housing prices.\n",
 54 |     "\n",
 55 |     "**Classification** problems are problems for which you are predicting a class. For example, spam prediction is a classification problem because you want to know whether your input falls into one of two classes. Logistic regression is an algorithm used for classification.\n",
 56 |     "\n",
 57 |     "**Ranking** problems are very popular in eCommerce. These models try to rank the items by how valuable they are to a user. For example, Netflix's movie recommendations. An example model is collaborative filtering.\n",
 58 |     "\n",
 59 |     "**Reinforcement Learning** is when you have an agent in an environment that gets to perform actions and receive rewards for actions. The model here learns the best actions to take to maximize rewards. The flappy bird video is an example of reinforcement learning. An example model is deep Q-networks.\n",
 60 |     "\n",
 61 |     "\n",
 62 |     "## Machine Learning and Econometrics\n",
 63 |     "\n",
 64 |     "How are they different?\n",
 65 |     "\n",
 66 |     "For one, they use different lingo. \n",
 67 |     "\n",
 68 |     "For another, econometrics is often more interested in understanding why things happen while machine learning often cares more about just the actual prediction being correct.\n",
 69 |     "\n",
 70 |     "Economic theory is often a driver in the development of econometric models, while machine learning often relies on the data to deliver insights.\n",
 71 |     "\n",
 72 |     "The two worlds have a lot of overlap and continue to grow closer. Machine learning is getting better and providing both predictions and understandings and econometrics is finding value in the scalability and accuracy of some machine learning models.\n",
 73 |     "\n",
 74 |     "\n",
 75 |     "## Challenges of Machine Learning"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "Perhaps my favorite part from the Wikipedia page on machine learning is, \"As of 2016, **machine learning is a buzzword**, and according to the Gartner hype cycle of 2016, at its peak of inflated expectations. Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, **machine-learning programs often fail to deliver**.\"\n",
 83 |     "\n",
 84 |     "* There isn't a clear problem to solve\n",
 85 |     "\n",
 86 |     "Some executive heard machine learning is the next big thing, so they hired a data science team. Unfortunately, there isn't a clear idea on what problems to solve, so the team flounders for a year.\n",
 87 |     "\n",
 88 |     "* Labeled data can be extremely important to building machine learning models, but can also be extremely costly.\n",
 89 |     "\n",
 90 |     "First off, you often need a lot of data. [Google](https://research.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html) found for representation learning, performance increases logarithmically based on volume of training data:\n",
 91 |     "\n",
 92 |     "![image.png](attachment:image.png)\n",
 93 |     "\n",
 94 |     "Secondly, you need to get data that represents the full distribution of the problem you are trying to solve. For example, for our spam classification problem, what kinds of emails might we want to gather? What if we only had emails that came from US IP addresses?\n",
 95 |     "\n",
 96 |     "Lastly, just getting data labeled can be time consuming and cost a lot of money. Who is going to label 1 million emails as spam or not spam?\n",
 97 |     "\n",
 98 |     "* Data can be very messy\n",
 99 |     "\n",
100 |     "Often data in the real world has errors, outliers, missing data, and noise. How you handle these can greatly influence the outcome of your model.\n",
101 |     "\n",
102 |     "* Feature engineering\n",
103 |     "\n",
104 |     "Once you have your data and labels, deciding on how to represent it to your model can be very challenging. For example, for spam classification would you just feed it the raw text? What about the origin of the IP address? What about a timestamp?\n",
105 |     "\n",
106 |     "* Your model might not generalize\n",
107 |     "\n",
108 |     "After all of this, you might still end up with a model that either is too simple to be effective (underfitting) or too complex to generalize well (overfitting). You have to develop a model that is just right. :)\n",
109 |     "\n",
110 |     "* Evaluation is non-trivial\n",
111 |     "\n",
112 |     "Let's say we develop a machine learning model for spam classification. How do we evaluate it? Do we care more about precision or recall? How do we tie our scientific metrics to business metrics?\n",
113 |     "\n",
114 |     "* Getting into production can be hard\n",
115 |     "\n",
116 |     "You have a beautiful model built in Python only to discover the back-end is in Java and has to run in under 5ms, so micro-services are not an option. So you convert your model to PMML, but engineers won't let you push code, so you are now blocked and putting your model in production isn't high on their priorities.\n",
117 |     "\n",
118 |     "\n",
119 |     "## There is hope\n",
120 |     "\n",
121 |     "While many machine learning initiatives do fail, many also succeed and are running some of the most valuable companies in the world. Companies like Google, Facebook, Amazon, AirBnB, and Netflix have all found successful ways to leverage machine learning and are reaping large rewards.\n",
122 |     "\n",
123 |     "Google CEO Sundar Pichai even recently said, \"an important shift from a mobile first world to an AI first world.\"\n",
124 |     "\n",
125 |     "And Mark Cuban said, \"Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years.\"\n",
126 |     "\n",
127 |     "And lastly, [Harvard Business Review](https://hbr.org/2012/10/big-data-the-management-revolution) found, \"companies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.\"\n",
128 |     "\n",
129 |     "The goal of this course is to prepare you for this world. So that you will not only know how to build the machine learning models to predict the future, but also understand the key ingredients of a successful machine learning initiative and how to overcome the challenges."
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {
136 |     "collapsed": true
137 |    },
138 |    "outputs": [],
139 |    "source": []
140 |   }
141 |  ],
142 |  "metadata": {
143 |   "kernelspec": {
144 |    "display_name": "Python 2",
145 |    "language": "python",
146 |    "name": "python2"
147 |   },
148 |   "language_info": {
149 |    "codemirror_mode": {
150 |     "name": "ipython",
151 |     "version": 2
152 |    },
153 |    "file_extension": ".py",
154 |    "mimetype": "text/x-python",
155 |    "name": "python",
156 |    "nbconvert_exporter": "python",
157 |    "pygments_lexer": "ipython2",
158 |    "version": "2.7.12"
159 |   }
160 |  },
161 |  "nbformat": 4,
162 |  "nbformat_minor": 2
163 | }
164 | 


--------------------------------------------------------------------------------
/lectures/Lecture_6_K_Nearest_Neighbors.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## K-Nearest Neighbors\n",
  8 |     "\n",
  9 |     "This is one of the simpliest and easiest to understand algorithms. It can be used for both classification and regression tasks, but is more common in classification, so we will focus there. The principles, though, can be used in both cases and sklearn supports both.\n",
 10 |     "\n",
 11 |     "Basically, here is the algorithm:\n",
 12 |     "\n",
 13 |     "1. Define $k$\n",
 14 |     "2. Define a distance metric - usually Euclidean distance\n",
 15 |     "3. For a new data point, find the $k$ nearest training points and combine their classes in some way - usually voting - to get a predicted class\n",
 16 |     "\n",
 17 |     "That's it!\n",
 18 |     "\n",
 19 |     "**Some of the benefits:**\n",
 20 |     "\n",
 21 |     "* Doesn't really require any training in the traditional sense. You just need a fast way to find the nearest neighbors.\n",
 22 |     "* Easy to understand \n",
 23 |     "\n",
 24 |     "**Some of the negatives:**\n",
 25 |     "\n",
 26 |     "* Need to define k, which is a hyper-parameter, so it can be tuned with cross-validation. A higher value for k increases bias and a lower value increases variance.\n",
 27 |     "* Have to choose a distance metric and could get very different results depending on the metric. Again, you can use cross-validation.\n",
 28 |     "* Doesn't really offer insights into which features might be important.\n",
 29 |     "* Can suffer with high dimensional data due to the curse of dimensionality.\n",
 30 |     "\n",
 31 |     "**Basic assumption:**\n",
 32 |     "\n",
 33 |     "* Data points that are close are similar for our target\n",
 34 |     "\n",
 35 |     "\n",
 36 |     "### Curse of Dimensionality\n",
 37 |     "\n",
 38 |     "Basically, the curse of dimensionality means that in high dimensions, it is likely that close points are not much closer than the average distance, which means being close doesnt mean much. In high dimensions, the data becomes very spread out, which creates this phenomenon. \n",
 39 |     "\n",
 40 |     "There are so many good resources for this online, that I won't go any deeper. Here is one you might look at:\n",
 41 |     "\n",
 42 |     "http://blog.galvanize.com/how-to-manage-dimensionality/\n",
 43 |     "\n",
 44 |     "### Euclidean Distance\n",
 45 |     "\n",
 46 |     "For vectors, q and p that are being compared (these would be our feature vectors):\n",
 47 |     "\n",
 48 |     "$$\\sqrt{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$$\n",
 49 |     "\n",
 50 |     "\n",
 51 |     "### SKlearn Example\n",
 52 |     "\n"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 15,
 58 |    "metadata": {
 59 |     "collapsed": false
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "from sklearn.datasets import load_breast_cancer\n",
 64 |     "from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\n",
 65 |     "from sklearn.model_selection import train_test_split, GridSearchCV\n",
 66 |     "from sklearn.metrics import f1_score, classification_report, accuracy_score, mean_squared_error"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": 3,
 72 |    "metadata": {
 73 |     "collapsed": false
 74 |    },
 75 |    "outputs": [
 76 |     {
 77 |      "name": "stdout",
 78 |      "output_type": "stream",
 79 |      "text": [
 80 |       "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}\n",
 81 |       "Train F1: 0.9606837606837607\n",
 82 |       "Test Classification Report:\n",
 83 |       "             precision    recall  f1-score   support\n",
 84 |       "\n",
 85 |       "          0       0.97      0.88      0.93        43\n",
 86 |       "          1       0.93      0.99      0.96        71\n",
 87 |       "\n",
 88 |       "avg / total       0.95      0.95      0.95       114\n",
 89 |       "\n",
 90 |       "Train Accuracy: 0.9494505494505494\tTest accuracy: 0.9473684210526315\n"
 91 |      ]
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "data = load_breast_cancer()\n",
 96 |     "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)\n",
 97 |     "clf = KNeighborsClassifier()\n",
 98 |     "gridsearch = GridSearchCV(clf, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n",
 99 |     "                                'p': [1, 2, 3]}, scoring='f1')\n",
100 |     "gridsearch.fit(X_train, y_train)\n",
101 |     "print(\"Best Params: {}\".format(gridsearch.best_params_))\n",
102 |     "y_pred_train = gridsearch.predict(X_train)\n",
103 |     "print(\"Train F1: {}\".format(f1_score(y_train, y_pred_train)))\n",
104 |     "print(\"Test Classification Report:\")\n",
105 |     "y_pred_test = gridsearch.predict(X_test)\n",
106 |     "print(classification_report(y_test, y_pred_test))\n",
107 |     "print(\"Train Accuracy: {}\\tTest accuracy: {}\".format(accuracy_score(y_train, y_pred_train),\n",
108 |     "                                                     accuracy_score(y_test, y_pred_test)))"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "### Voting Methods\n",
116 |     "\n",
117 |     "* Majority Voting: After you take the $k$ nearest neighbors, you take a \"vote\" of those neighbors' classes. The new data point is classified with whatever the majority class of the neighbors is. If you are doing a binary classification, it is recommended that you use an odd number of neighbors to avoid tied votes. However, in a multi-class problem, it is harder to avoid ties. A common solution to this is to decrease $k$ until the tie is broken.\n",
118 |     "* Distance Weighting: Instead of directly taking votes of the nearest neighbors, you weight each vote by the distance of that instance from the new data point. A common weighting method is $\\hat{y} = \\frac{\\sum_{i=1}^Nw_iq_i}{\\sum_{i=1}^N w_i}$, where $w_i=\\frac{1}{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$, or one over the distance between the new data point and the training point. The new data point is added into the class with the largest added weight. Not only does this decrease the chances of ties, it also reduces the effect of a skewed representation of data.\n",
119 |     "\n",
120 |     "### Distance Metrics\n",
121 |     "Euclidean distance, or the 2-norm, is a very common distance metric to use for $k$-nearest neighbors. Any $p$-norm can be used (the $p$-norm is defined as $||\\mathbf{x}_i-\\mathbf{y}_i||_p = (\\sum_{i=1}^N|x_i-y_i|^p)^\\frac{1}{p}$. For categorical data, however, this can be a problem. For example, if we have encoded a feature for car color from red, blue, and green to 0, 1, 2, how can the \"distance\" between green and red be measured? You could make dummy variables, but if a feature has 15 possible categories, that means adding 14 more variables to your feature set, and we run into the curse of dimensionality. There are several ways to confront this issue, but they are beyond the scope of this lecture. But be aware of the effect categorical features can have on your nearest neighbors classifier.\n",
122 |     "\n",
123 |     "### Search Algorithm\n",
124 |     "\n",
125 |     "Imagine the data set contains 2000 points. A brute-force search for the 3 nearest neighbors to one point does not take very long. But if the data set contains 2000000 points, a brute-force search can become quite costly, especially if the dimension of the data is large. Other search algorithms sacrifice an exhaustive search for a faster run time. Structures such as KDTrees (see https://en.wikipedia.org/wiki/K-d_tree) or Ball trees (see https://en.wikipedia.org/wiki/Ball_tree) are used for faster run times. While we won't dive into the details of these structures, be aware of them and how they can optimize your run time (although, training time does increase).\n",
126 |     "\n",
127 |     "### Radius Neighbors Classifier\n",
128 |     "\n",
129 |     "This is the same idea as a $k$ nearest neighbors classifier, but instead of finding the $k$ nearest neighbors, you find all the neighbors within a given radius. Setting the radius requires some domain knowledge; if your points are closely packed together, you'd want to use a smaller radius to avoid having nearly every point vote.\n",
130 |     "\n",
131 |     "### K Neighbors Regressor\n",
132 |     "To change our problem from classification to regressing, all we have to do is find the weighted average of the $k$ nearest neighbors. Instead of taking the majority class, we calculate a weighted average of these nearest values, using the same weighting method as above.\n",
133 |     "\n",
134 |     "Let's try predicting the area of tissue based on the other features using a $k$-neighbors regressor."
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 19,
140 |    "metadata": {
141 |     "collapsed": false
142 |    },
143 |    "outputs": [
144 |     {
145 |      "name": "stdout",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}\n",
149 |       "Train MSE: 0.0\tTest MSE: 878.1482686614424\n"
150 |      ]
151 |     }
152 |    ],
153 |    "source": [
154 |     "target = data.data[:,3]\n",
155 |     "X = data.data[:,[0,1,2,4,5,6,7,8]]\n",
156 |     "X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.20, random_state=42)\n",
157 |     "reg = KNeighborsRegressor()\n",
158 |     "gridsearch = GridSearchCV(reg, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n",
159 |     "                                'p': [1, 2, 3]}, scoring='neg_mean_squared_error')\n",
160 |     "gridsearch.fit(X_train, y_train)\n",
161 |     "print(\"Best Params: {}\".format(gridsearch.best_params_))\n",
162 |     "y_pred_train = gridsearch.predict(X_train)\n",
163 |     "y_pred_test = gridsearch.predict(X_test)\n",
164 |     "print(\"Train MSE: {}\\tTest MSE: {}\".format(mean_squared_error(y_train, y_pred_train),\n",
165 |     "                                           mean_squared_error(y_test, y_pred_test)))"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "### Outlier Detection\n",
173 |     "In $k$-nearest neighbors, the data is naturally clustered. Within these clusters, we can find the average distance between points (either exhaustively or from the centroid of the cluster). If we find a few points that are much farther than the average distance to other points or to the centroid, it is reasonable (but not always correct) to think they could be outliers. We can use this process on new data points as well. "
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {
180 |     "collapsed": true
181 |    },
182 |    "outputs": [],
183 |    "source": []
184 |   }
185 |  ],
186 |  "metadata": {
187 |   "kernelspec": {
188 |    "display_name": "Python 3",
189 |    "language": "python",
190 |    "name": "python3"
191 |   },
192 |   "language_info": {
193 |    "codemirror_mode": {
194 |     "name": "ipython",
195 |     "version": 3
196 |    },
197 |    "file_extension": ".py",
198 |    "mimetype": "text/x-python",
199 |    "name": "python",
200 |    "nbconvert_exporter": "python",
201 |    "pygments_lexer": "ipython3",
202 |    "version": "3.5.3"
203 |   }
204 |  },
205 |  "nbformat": 4,
206 |  "nbformat_minor": 2
207 | }
208 | 


--------------------------------------------------------------------------------
/lectures/Lecture_9_Decision_Trees.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": true
  7 |    },
  8 |    "source": [
  9 |     "## Decision Trees\n",
 10 |     "\n",
 11 |     "To better understand decision trees, let's take a look at a simple example:\n",
 12 |     "\n",
 13 |     "https://zhengtianyu.files.wordpress.com/2013/12/decision-tree.png\n",
 14 |     "\n",
 15 |     "Clearly, decision trees are very simple and pretty easy to understand. Once you have a tree, you just follow the forks down for your example to get to an end node. Then at that end node you take the most common class or average value for regression. In fact, with classification you can get probabilistic estimates by taking the fraction of examples in that class. Nice! And that is one of the biggest benefits of decision trees - they are so easy to explain and understand.\n",
 16 |     "\n",
 17 |     "Most decision trees use binary trees - each split is either yes or no. When you graph these decision boundaries you end up with a bunch of vertical or horizontal lines. Here is an example with the iris data set:\n",
 18 |     "\n",
 19 |     "http://statweb.stanford.edu/~jtaylo/courses/stats202/trees.html\n",
 20 |     "\n",
 21 |     "## How To Grow A Tree\n",
 22 |     "\n",
 23 |     "So - how does one learn a tree? Do we just randomly pick binary splitting points and see what comes out? Of course not! We leverage our data and a definition of impurity. \n",
 24 |     "\n",
 25 |     "If you think about it, what we really want from our tree is pure leaf nodes. Meaning that for classification we would like at the end to have the leaf node only have examples of a single class. This greatly increases our confidence that if a testing example ends up at that leaf node that it is in fact of that class.\n",
 26 |     "\n",
 27 |     "Imagine the opposite, a leaf node in a 2 class classification problem that has 50% of the examples in one class and 50% in the other. That really doesn't help us at all! If a testing example ends up at that leaf node, all we can say is a 50/50 guess.\n",
 28 |     "\n",
 29 |     "So, how do we define impurity?\n",
 30 |     "\n",
 31 |     "**Gini Impurity**\n",
 32 |     "\n",
 33 |     "$$Gini_{i} = 1 - \\sum_{k=1}^{n}{p_{i,k}^2}$$\n",
 34 |     "\n",
 35 |     "Where $i$ is the node of interest, $n$ is the total number of classes, and $p_{i,k}$ is the fraction of class $k$ in node $i$. The gini of a node is 0 when all the examples belong to the same class. Gini is the highest when there is an equal probability of being in each class. \n",
 36 |     "\n",
 37 |     "Let's consider a 2-class examples. Say I am trying to predict whether someone is male or female and I branch on height being less than 5 1/2 feet. In this node, I find 50 examples of which 40 are female and 10 are male. Gini is then:\n",
 38 |     "\n",
 39 |     "$$1 - (40/50)^2 - (10/50)^2 = 0.32$$\n",
 40 |     "\n",
 41 |     "Very nice - now we can basically evaluate how good a node is. With this we can now start to **greedily** grow our tree. What I mean by greedily is that we will not consider every possible tree, but instead start with the most pure feature split and then from there pick the next most pure, etc.\n",
 42 |     "\n",
 43 |     "Thus, our algorithm looks at all the features and all the splitting points for our features (note: this process runs much faster if you features have a small amount of unique values like categories as opposed to real numbers). To evalue a feature, split point pair we do the following:\n",
 44 |     "\n",
 45 |     "Take the weighted average of the Gini impurity for the two nodes created by the split. The weights being the number of examples in the nodes.\n",
 46 |     "\n",
 47 |     "That's it! Basically, the feature split that produces the lowest weighted average gini impurity is considered best, then we move on and find the next best feature split given all the previous feature splits.\n",
 48 |     "\n",
 49 |     "### When to stop?\n",
 50 |     "\n",
 51 |     "Decision trees have many hyper-parameters to help control when to stop growing the tree. These include max depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes.\n",
 52 |     "\n",
 53 |     "Since we have to consider all these splits, training can be slow, but predictions are very fast since we just have to traverse the tree.\n",
 54 |     "\n",
 55 |     "Also, there are other impurity measures that are used with decision trees. Another very popular one is Entropy. See page 173 of Hands On Machine Learning for its definition. Usually, which you choose doesn't matter too much. Gini is slightly faster to computer, while entropy produces slightly more balanced trees.\n",
 56 |     "\n",
 57 |     "### Bias and variance\n",
 58 |     "\n",
 59 |     "Decision trees can be very prone to overfitting if you let them grow to deep. Thus, decreasing the depth can decrease variance / increase bias. It is important to use cross-validation to do hyper-parameter selection. \n",
 60 |     "\n",
 61 |     "### Regression\n",
 62 |     "\n",
 63 |     "Decision trees can be applied to regression problems in exactly the same way, but instead of using Gini impurity you would use mean squared error. You calculate the MSE of a node by setting the prediction for all values in that node as the average $y$ value of the examples in that node. For example, if your node had these values: 5, 2, 1, 6 then your predicted value would be (5+2+1+6)/4 = 3.5. You would do the squared difference between 3.5 and all the values and take the mean. \n",
 64 |     "\n",
 65 |     "Prediction is done by traversing to the leaf node for an example and taking the average value at that node.\n",
 66 |     "\n",
 67 |     "### Pros\n",
 68 |     "\n",
 69 |     "* Easy to explain\n",
 70 |     "* Can be visualized\n",
 71 |     "* Can handle categorical variables and missing data well\n",
 72 |     "* Fast prediction\n",
 73 |     "\n",
 74 |     "### Cons\n",
 75 |     "\n",
 76 |     "* Typically don't have very strong prediction accuracy\n",
 77 |     "* Very sensitive to small changes in training data\n",
 78 |     "* Can be slow training"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {
 85 |     "collapsed": true
 86 |    },
 87 |    "outputs": [],
 88 |    "source": []
 89 |   }
 90 |  ],
 91 |  "metadata": {
 92 |   "kernelspec": {
 93 |    "display_name": "Python 3",
 94 |    "language": "python",
 95 |    "name": "python3"
 96 |   },
 97 |   "language_info": {
 98 |    "codemirror_mode": {
 99 |     "name": "ipython",
100 |     "version": 3
101 |    },
102 |    "file_extension": ".py",
103 |    "mimetype": "text/x-python",
104 |    "name": "python",
105 |    "nbconvert_exporter": "python",
106 |    "pygments_lexer": "ipython3",
107 |    "version": "3.5.3"
108 |   }
109 |  },
110 |  "nbformat": 4,
111 |  "nbformat_minor": 2
112 | }
113 | 


--------------------------------------------------------------------------------
/small_data/Wholesale customers data.csv:
--------------------------------------------------------------------------------
  1 | Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
  2 | 2,3,12669,9656,7561,214,2674,1338
  3 | 2,3,7057,9810,9568,1762,3293,1776
  4 | 2,3,6353,8808,7684,2405,3516,7844
  5 | 1,3,13265,1196,4221,6404,507,1788
  6 | 2,3,22615,5410,7198,3915,1777,5185
  7 | 2,3,9413,8259,5126,666,1795,1451
  8 | 2,3,12126,3199,6975,480,3140,545
  9 | 2,3,7579,4956,9426,1669,3321,2566
 10 | 1,3,5963,3648,6192,425,1716,750
 11 | 2,3,6006,11093,18881,1159,7425,2098
 12 | 2,3,3366,5403,12974,4400,5977,1744
 13 | 2,3,13146,1124,4523,1420,549,497
 14 | 2,3,31714,12319,11757,287,3881,2931
 15 | 2,3,21217,6208,14982,3095,6707,602
 16 | 2,3,24653,9465,12091,294,5058,2168
 17 | 1,3,10253,1114,3821,397,964,412
 18 | 2,3,1020,8816,12121,134,4508,1080
 19 | 1,3,5876,6157,2933,839,370,4478
 20 | 2,3,18601,6327,10099,2205,2767,3181
 21 | 1,3,7780,2495,9464,669,2518,501
 22 | 2,3,17546,4519,4602,1066,2259,2124
 23 | 1,3,5567,871,2010,3383,375,569
 24 | 1,3,31276,1917,4469,9408,2381,4334
 25 | 2,3,26373,36423,22019,5154,4337,16523
 26 | 2,3,22647,9776,13792,2915,4482,5778
 27 | 2,3,16165,4230,7595,201,4003,57
 28 | 1,3,9898,961,2861,3151,242,833
 29 | 1,3,14276,803,3045,485,100,518
 30 | 2,3,4113,20484,25957,1158,8604,5206
 31 | 1,3,43088,2100,2609,1200,1107,823
 32 | 1,3,18815,3610,11107,1148,2134,2963
 33 | 1,3,2612,4339,3133,2088,820,985
 34 | 1,3,21632,1318,2886,266,918,405
 35 | 1,3,29729,4786,7326,6130,361,1083
 36 | 1,3,1502,1979,2262,425,483,395
 37 | 2,3,688,5491,11091,833,4239,436
 38 | 1,3,29955,4362,5428,1729,862,4626
 39 | 2,3,15168,10556,12477,1920,6506,714
 40 | 2,3,4591,15729,16709,33,6956,433
 41 | 1,3,56159,555,902,10002,212,2916
 42 | 1,3,24025,4332,4757,9510,1145,5864
 43 | 1,3,19176,3065,5956,2033,2575,2802
 44 | 2,3,10850,7555,14961,188,6899,46
 45 | 2,3,630,11095,23998,787,9529,72
 46 | 2,3,9670,7027,10471,541,4618,65
 47 | 2,3,5181,22044,21531,1740,7353,4985
 48 | 2,3,3103,14069,21955,1668,6792,1452
 49 | 2,3,44466,54259,55571,7782,24171,6465
 50 | 2,3,11519,6152,10868,584,5121,1476
 51 | 2,3,4967,21412,28921,1798,13583,1163
 52 | 1,3,6269,1095,1980,3860,609,2162
 53 | 1,3,3347,4051,6996,239,1538,301
 54 | 2,3,40721,3916,5876,532,2587,1278
 55 | 2,3,491,10473,11532,744,5611,224
 56 | 1,3,27329,1449,1947,2436,204,1333
 57 | 1,3,5264,3683,5005,1057,2024,1130
 58 | 2,3,4098,29892,26866,2616,17740,1340
 59 | 2,3,5417,9933,10487,38,7572,1282
 60 | 1,3,13779,1970,1648,596,227,436
 61 | 1,3,6137,5360,8040,129,3084,1603
 62 | 2,3,8590,3045,7854,96,4095,225
 63 | 2,3,35942,38369,59598,3254,26701,2017
 64 | 2,3,7823,6245,6544,4154,4074,964
 65 | 2,3,9396,11601,15775,2896,7677,1295
 66 | 1,3,4760,1227,3250,3724,1247,1145
 67 | 2,3,85,20959,45828,36,24231,1423
 68 | 1,3,9,1534,7417,175,3468,27
 69 | 2,3,19913,6759,13462,1256,5141,834
 70 | 1,3,2446,7260,3993,5870,788,3095
 71 | 1,3,8352,2820,1293,779,656,144
 72 | 1,3,16705,2037,3202,10643,116,1365
 73 | 1,3,18291,1266,21042,5373,4173,14472
 74 | 1,3,4420,5139,2661,8872,1321,181
 75 | 2,3,19899,5332,8713,8132,764,648
 76 | 2,3,8190,6343,9794,1285,1901,1780
 77 | 1,3,20398,1137,3,4407,3,975
 78 | 1,3,717,3587,6532,7530,529,894
 79 | 2,3,12205,12697,28540,869,12034,1009
 80 | 1,3,10766,1175,2067,2096,301,167
 81 | 1,3,1640,3259,3655,868,1202,1653
 82 | 1,3,7005,829,3009,430,610,529
 83 | 2,3,219,9540,14403,283,7818,156
 84 | 2,3,10362,9232,11009,737,3537,2342
 85 | 1,3,20874,1563,1783,2320,550,772
 86 | 2,3,11867,3327,4814,1178,3837,120
 87 | 2,3,16117,46197,92780,1026,40827,2944
 88 | 2,3,22925,73498,32114,987,20070,903
 89 | 1,3,43265,5025,8117,6312,1579,14351
 90 | 1,3,7864,542,4042,9735,165,46
 91 | 1,3,24904,3836,5330,3443,454,3178
 92 | 1,3,11405,596,1638,3347,69,360
 93 | 1,3,12754,2762,2530,8693,627,1117
 94 | 2,3,9198,27472,32034,3232,18906,5130
 95 | 1,3,11314,3090,2062,35009,71,2698
 96 | 2,3,5626,12220,11323,206,5038,244
 97 | 1,3,3,2920,6252,440,223,709
 98 | 2,3,23,2616,8118,145,3874,217
 99 | 1,3,403,254,610,774,54,63
100 | 1,3,503,112,778,895,56,132
101 | 1,3,9658,2182,1909,5639,215,323
102 | 2,3,11594,7779,12144,3252,8035,3029
103 | 2,3,1420,10810,16267,1593,6766,1838
104 | 2,3,2932,6459,7677,2561,4573,1386
105 | 1,3,56082,3504,8906,18028,1480,2498
106 | 1,3,14100,2132,3445,1336,1491,548
107 | 1,3,15587,1014,3970,910,139,1378
108 | 2,3,1454,6337,10704,133,6830,1831
109 | 2,3,8797,10646,14886,2471,8969,1438
110 | 2,3,1531,8397,6981,247,2505,1236
111 | 2,3,1406,16729,28986,673,836,3
112 | 1,3,11818,1648,1694,2276,169,1647
113 | 2,3,12579,11114,17569,805,6457,1519
114 | 1,3,19046,2770,2469,8853,483,2708
115 | 1,3,14438,2295,1733,3220,585,1561
116 | 1,3,18044,1080,2000,2555,118,1266
117 | 1,3,11134,793,2988,2715,276,610
118 | 1,3,11173,2521,3355,1517,310,222
119 | 1,3,6990,3880,5380,1647,319,1160
120 | 1,3,20049,1891,2362,5343,411,933
121 | 1,3,8258,2344,2147,3896,266,635
122 | 1,3,17160,1200,3412,2417,174,1136
123 | 1,3,4020,3234,1498,2395,264,255
124 | 1,3,12212,201,245,1991,25,860
125 | 2,3,11170,10769,8814,2194,1976,143
126 | 1,3,36050,1642,2961,4787,500,1621
127 | 1,3,76237,3473,7102,16538,778,918
128 | 1,3,19219,1840,1658,8195,349,483
129 | 2,3,21465,7243,10685,880,2386,2749
130 | 1,3,140,8847,3823,142,1062,3
131 | 1,3,42312,926,1510,1718,410,1819
132 | 1,3,7149,2428,699,6316,395,911
133 | 1,3,2101,589,314,346,70,310
134 | 1,3,14903,2032,2479,576,955,328
135 | 1,3,9434,1042,1235,436,256,396
136 | 1,3,7388,1882,2174,720,47,537
137 | 1,3,6300,1289,2591,1170,199,326
138 | 1,3,4625,8579,7030,4575,2447,1542
139 | 1,3,3087,8080,8282,661,721,36
140 | 1,3,13537,4257,5034,155,249,3271
141 | 1,3,5387,4979,3343,825,637,929
142 | 1,3,17623,4280,7305,2279,960,2616
143 | 1,3,30379,13252,5189,321,51,1450
144 | 1,3,37036,7152,8253,2995,20,3
145 | 1,3,10405,1596,1096,8425,399,318
146 | 1,3,18827,3677,1988,118,516,201
147 | 2,3,22039,8384,34792,42,12591,4430
148 | 1,3,7769,1936,2177,926,73,520
149 | 1,3,9203,3373,2707,1286,1082,526
150 | 1,3,5924,584,542,4052,283,434
151 | 1,3,31812,1433,1651,800,113,1440
152 | 1,3,16225,1825,1765,853,170,1067
153 | 1,3,1289,3328,2022,531,255,1774
154 | 1,3,18840,1371,3135,3001,352,184
155 | 1,3,3463,9250,2368,779,302,1627
156 | 1,3,622,55,137,75,7,8
157 | 2,3,1989,10690,19460,233,11577,2153
158 | 2,3,3830,5291,14855,317,6694,3182
159 | 1,3,17773,1366,2474,3378,811,418
160 | 2,3,2861,6570,9618,930,4004,1682
161 | 2,3,355,7704,14682,398,8077,303
162 | 2,3,1725,3651,12822,824,4424,2157
163 | 1,3,12434,540,283,1092,3,2233
164 | 1,3,15177,2024,3810,2665,232,610
165 | 2,3,5531,15726,26870,2367,13726,446
166 | 2,3,5224,7603,8584,2540,3674,238
167 | 2,3,15615,12653,19858,4425,7108,2379
168 | 2,3,4822,6721,9170,993,4973,3637
169 | 1,3,2926,3195,3268,405,1680,693
170 | 1,3,5809,735,803,1393,79,429
171 | 1,3,5414,717,2155,2399,69,750
172 | 2,3,260,8675,13430,1116,7015,323
173 | 2,3,200,25862,19816,651,8773,6250
174 | 1,3,955,5479,6536,333,2840,707
175 | 2,3,514,7677,19805,937,9836,716
176 | 1,3,286,1208,5241,2515,153,1442
177 | 2,3,2343,7845,11874,52,4196,1697
178 | 1,3,45640,6958,6536,7368,1532,230
179 | 1,3,12759,7330,4533,1752,20,2631
180 | 1,3,11002,7075,4945,1152,120,395
181 | 1,3,3157,4888,2500,4477,273,2165
182 | 1,3,12356,6036,8887,402,1382,2794
183 | 1,3,112151,29627,18148,16745,4948,8550
184 | 1,3,694,8533,10518,443,6907,156
185 | 1,3,36847,43950,20170,36534,239,47943
186 | 1,3,327,918,4710,74,334,11
187 | 1,3,8170,6448,1139,2181,58,247
188 | 1,3,3009,521,854,3470,949,727
189 | 1,3,2438,8002,9819,6269,3459,3
190 | 2,3,8040,7639,11687,2758,6839,404
191 | 2,3,834,11577,11522,275,4027,1856
192 | 1,3,16936,6250,1981,7332,118,64
193 | 1,3,13624,295,1381,890,43,84
194 | 1,3,5509,1461,2251,547,187,409
195 | 2,3,180,3485,20292,959,5618,666
196 | 1,3,7107,1012,2974,806,355,1142
197 | 1,3,17023,5139,5230,7888,330,1755
198 | 1,1,30624,7209,4897,18711,763,2876
199 | 2,1,2427,7097,10391,1127,4314,1468
200 | 1,1,11686,2154,6824,3527,592,697
201 | 1,1,9670,2280,2112,520,402,347
202 | 2,1,3067,13240,23127,3941,9959,731
203 | 2,1,4484,14399,24708,3549,14235,1681
204 | 1,1,25203,11487,9490,5065,284,6854
205 | 1,1,583,685,2216,469,954,18
206 | 1,1,1956,891,5226,1383,5,1328
207 | 2,1,1107,11711,23596,955,9265,710
208 | 1,1,6373,780,950,878,288,285
209 | 2,1,2541,4737,6089,2946,5316,120
210 | 1,1,1537,3748,5838,1859,3381,806
211 | 2,1,5550,12729,16767,864,12420,797
212 | 1,1,18567,1895,1393,1801,244,2100
213 | 2,1,12119,28326,39694,4736,19410,2870
214 | 1,1,7291,1012,2062,1291,240,1775
215 | 1,1,3317,6602,6861,1329,3961,1215
216 | 2,1,2362,6551,11364,913,5957,791
217 | 1,1,2806,10765,15538,1374,5828,2388
218 | 2,1,2532,16599,36486,179,13308,674
219 | 1,1,18044,1475,2046,2532,130,1158
220 | 2,1,18,7504,15205,1285,4797,6372
221 | 1,1,4155,367,1390,2306,86,130
222 | 1,1,14755,899,1382,1765,56,749
223 | 1,1,5396,7503,10646,91,4167,239
224 | 1,1,5041,1115,2856,7496,256,375
225 | 2,1,2790,2527,5265,5612,788,1360
226 | 1,1,7274,659,1499,784,70,659
227 | 1,1,12680,3243,4157,660,761,786
228 | 2,1,20782,5921,9212,1759,2568,1553
229 | 1,1,4042,2204,1563,2286,263,689
230 | 1,1,1869,577,572,950,4762,203
231 | 1,1,8656,2746,2501,6845,694,980
232 | 2,1,11072,5989,5615,8321,955,2137
233 | 1,1,2344,10678,3828,1439,1566,490
234 | 1,1,25962,1780,3838,638,284,834
235 | 1,1,964,4984,3316,937,409,7
236 | 1,1,15603,2703,3833,4260,325,2563
237 | 1,1,1838,6380,2824,1218,1216,295
238 | 1,1,8635,820,3047,2312,415,225
239 | 1,1,18692,3838,593,4634,28,1215
240 | 1,1,7363,475,585,1112,72,216
241 | 1,1,47493,2567,3779,5243,828,2253
242 | 1,1,22096,3575,7041,11422,343,2564
243 | 1,1,24929,1801,2475,2216,412,1047
244 | 1,1,18226,659,2914,3752,586,578
245 | 1,1,11210,3576,5119,561,1682,2398
246 | 1,1,6202,7775,10817,1183,3143,1970
247 | 2,1,3062,6154,13916,230,8933,2784
248 | 1,1,8885,2428,1777,1777,430,610
249 | 1,1,13569,346,489,2077,44,659
250 | 1,1,15671,5279,2406,559,562,572
251 | 1,1,8040,3795,2070,6340,918,291
252 | 1,1,3191,1993,1799,1730,234,710
253 | 2,1,6134,23133,33586,6746,18594,5121
254 | 1,1,6623,1860,4740,7683,205,1693
255 | 1,1,29526,7961,16966,432,363,1391
256 | 1,1,10379,17972,4748,4686,1547,3265
257 | 1,1,31614,489,1495,3242,111,615
258 | 1,1,11092,5008,5249,453,392,373
259 | 1,1,8475,1931,1883,5004,3593,987
260 | 1,1,56083,4563,2124,6422,730,3321
261 | 1,1,53205,4959,7336,3012,967,818
262 | 1,1,9193,4885,2157,327,780,548
263 | 1,1,7858,1110,1094,6818,49,287
264 | 1,1,23257,1372,1677,982,429,655
265 | 1,1,2153,1115,6684,4324,2894,411
266 | 2,1,1073,9679,15445,61,5980,1265
267 | 1,1,5909,23527,13699,10155,830,3636
268 | 2,1,572,9763,22182,2221,4882,2563
269 | 1,1,20893,1222,2576,3975,737,3628
270 | 2,1,11908,8053,19847,1069,6374,698
271 | 1,1,15218,258,1138,2516,333,204
272 | 1,1,4720,1032,975,5500,197,56
273 | 1,1,2083,5007,1563,1120,147,1550
274 | 1,1,514,8323,6869,529,93,1040
275 | 1,3,36817,3045,1493,4802,210,1824
276 | 1,3,894,1703,1841,744,759,1153
277 | 1,3,680,1610,223,862,96,379
278 | 1,3,27901,3749,6964,4479,603,2503
279 | 1,3,9061,829,683,16919,621,139
280 | 1,3,11693,2317,2543,5845,274,1409
281 | 2,3,17360,6200,9694,1293,3620,1721
282 | 1,3,3366,2884,2431,977,167,1104
283 | 2,3,12238,7108,6235,1093,2328,2079
284 | 1,3,49063,3965,4252,5970,1041,1404
285 | 1,3,25767,3613,2013,10303,314,1384
286 | 1,3,68951,4411,12609,8692,751,2406
287 | 1,3,40254,640,3600,1042,436,18
288 | 1,3,7149,2247,1242,1619,1226,128
289 | 1,3,15354,2102,2828,8366,386,1027
290 | 1,3,16260,594,1296,848,445,258
291 | 1,3,42786,286,471,1388,32,22
292 | 1,3,2708,2160,2642,502,965,1522
293 | 1,3,6022,3354,3261,2507,212,686
294 | 1,3,2838,3086,4329,3838,825,1060
295 | 2,2,3996,11103,12469,902,5952,741
296 | 1,2,21273,2013,6550,909,811,1854
297 | 2,2,7588,1897,5234,417,2208,254
298 | 1,2,19087,1304,3643,3045,710,898
299 | 2,2,8090,3199,6986,1455,3712,531
300 | 2,2,6758,4560,9965,934,4538,1037
301 | 1,2,444,879,2060,264,290,259
302 | 2,2,16448,6243,6360,824,2662,2005
303 | 2,2,5283,13316,20399,1809,8752,172
304 | 2,2,2886,5302,9785,364,6236,555
305 | 2,2,2599,3688,13829,492,10069,59
306 | 2,2,161,7460,24773,617,11783,2410
307 | 2,2,243,12939,8852,799,3909,211
308 | 2,2,6468,12867,21570,1840,7558,1543
309 | 1,2,17327,2374,2842,1149,351,925
310 | 1,2,6987,1020,3007,416,257,656
311 | 2,2,918,20655,13567,1465,6846,806
312 | 1,2,7034,1492,2405,12569,299,1117
313 | 1,2,29635,2335,8280,3046,371,117
314 | 2,2,2137,3737,19172,1274,17120,142
315 | 1,2,9784,925,2405,4447,183,297
316 | 1,2,10617,1795,7647,1483,857,1233
317 | 2,2,1479,14982,11924,662,3891,3508
318 | 1,2,7127,1375,2201,2679,83,1059
319 | 1,2,1182,3088,6114,978,821,1637
320 | 1,2,11800,2713,3558,2121,706,51
321 | 2,2,9759,25071,17645,1128,12408,1625
322 | 1,2,1774,3696,2280,514,275,834
323 | 1,2,9155,1897,5167,2714,228,1113
324 | 1,2,15881,713,3315,3703,1470,229
325 | 1,2,13360,944,11593,915,1679,573
326 | 1,2,25977,3587,2464,2369,140,1092
327 | 1,2,32717,16784,13626,60869,1272,5609
328 | 1,2,4414,1610,1431,3498,387,834
329 | 1,2,542,899,1664,414,88,522
330 | 1,2,16933,2209,3389,7849,210,1534
331 | 1,2,5113,1486,4583,5127,492,739
332 | 1,2,9790,1786,5109,3570,182,1043
333 | 2,2,11223,14881,26839,1234,9606,1102
334 | 1,2,22321,3216,1447,2208,178,2602
335 | 2,2,8565,4980,67298,131,38102,1215
336 | 2,2,16823,928,2743,11559,332,3486
337 | 2,2,27082,6817,10790,1365,4111,2139
338 | 1,2,13970,1511,1330,650,146,778
339 | 1,2,9351,1347,2611,8170,442,868
340 | 1,2,3,333,7021,15601,15,550
341 | 1,2,2617,1188,5332,9584,573,1942
342 | 2,3,381,4025,9670,388,7271,1371
343 | 2,3,2320,5763,11238,767,5162,2158
344 | 1,3,255,5758,5923,349,4595,1328
345 | 2,3,1689,6964,26316,1456,15469,37
346 | 1,3,3043,1172,1763,2234,217,379
347 | 1,3,1198,2602,8335,402,3843,303
348 | 2,3,2771,6939,15541,2693,6600,1115
349 | 2,3,27380,7184,12311,2809,4621,1022
350 | 1,3,3428,2380,2028,1341,1184,665
351 | 2,3,5981,14641,20521,2005,12218,445
352 | 1,3,3521,1099,1997,1796,173,995
353 | 2,3,1210,10044,22294,1741,12638,3137
354 | 1,3,608,1106,1533,830,90,195
355 | 2,3,117,6264,21203,228,8682,1111
356 | 1,3,14039,7393,2548,6386,1333,2341
357 | 1,3,190,727,2012,245,184,127
358 | 1,3,22686,134,218,3157,9,548
359 | 2,3,37,1275,22272,137,6747,110
360 | 1,3,759,18664,1660,6114,536,4100
361 | 1,3,796,5878,2109,340,232,776
362 | 1,3,19746,2872,2006,2601,468,503
363 | 1,3,4734,607,864,1206,159,405
364 | 1,3,2121,1601,2453,560,179,712
365 | 1,3,4627,997,4438,191,1335,314
366 | 1,3,2615,873,1524,1103,514,468
367 | 2,3,4692,6128,8025,1619,4515,3105
368 | 1,3,9561,2217,1664,1173,222,447
369 | 1,3,3477,894,534,1457,252,342
370 | 1,3,22335,1196,2406,2046,101,558
371 | 1,3,6211,337,683,1089,41,296
372 | 2,3,39679,3944,4955,1364,523,2235
373 | 1,3,20105,1887,1939,8164,716,790
374 | 1,3,3884,3801,1641,876,397,4829
375 | 2,3,15076,6257,7398,1504,1916,3113
376 | 1,3,6338,2256,1668,1492,311,686
377 | 1,3,5841,1450,1162,597,476,70
378 | 2,3,3136,8630,13586,5641,4666,1426
379 | 1,3,38793,3154,2648,1034,96,1242
380 | 1,3,3225,3294,1902,282,68,1114
381 | 2,3,4048,5164,10391,130,813,179
382 | 1,3,28257,944,2146,3881,600,270
383 | 1,3,17770,4591,1617,9927,246,532
384 | 1,3,34454,7435,8469,2540,1711,2893
385 | 1,3,1821,1364,3450,4006,397,361
386 | 1,3,10683,21858,15400,3635,282,5120
387 | 1,3,11635,922,1614,2583,192,1068
388 | 1,3,1206,3620,2857,1945,353,967
389 | 1,3,20918,1916,1573,1960,231,961
390 | 1,3,9785,848,1172,1677,200,406
391 | 1,3,9385,1530,1422,3019,227,684
392 | 1,3,3352,1181,1328,5502,311,1000
393 | 1,3,2647,2761,2313,907,95,1827
394 | 1,3,518,4180,3600,659,122,654
395 | 1,3,23632,6730,3842,8620,385,819
396 | 1,3,12377,865,3204,1398,149,452
397 | 1,3,9602,1316,1263,2921,841,290
398 | 2,3,4515,11991,9345,2644,3378,2213
399 | 1,3,11535,1666,1428,6838,64,743
400 | 1,3,11442,1032,582,5390,74,247
401 | 1,3,9612,577,935,1601,469,375
402 | 1,3,4446,906,1238,3576,153,1014
403 | 1,3,27167,2801,2128,13223,92,1902
404 | 1,3,26539,4753,5091,220,10,340
405 | 1,3,25606,11006,4604,127,632,288
406 | 1,3,18073,4613,3444,4324,914,715
407 | 1,3,6884,1046,1167,2069,593,378
408 | 1,3,25066,5010,5026,9806,1092,960
409 | 2,3,7362,12844,18683,2854,7883,553
410 | 2,3,8257,3880,6407,1646,2730,344
411 | 1,3,8708,3634,6100,2349,2123,5137
412 | 1,3,6633,2096,4563,1389,1860,1892
413 | 1,3,2126,3289,3281,1535,235,4365
414 | 1,3,97,3605,12400,98,2970,62
415 | 1,3,4983,4859,6633,17866,912,2435
416 | 1,3,5969,1990,3417,5679,1135,290
417 | 2,3,7842,6046,8552,1691,3540,1874
418 | 2,3,4389,10940,10908,848,6728,993
419 | 1,3,5065,5499,11055,364,3485,1063
420 | 2,3,660,8494,18622,133,6740,776
421 | 1,3,8861,3783,2223,633,1580,1521
422 | 1,3,4456,5266,13227,25,6818,1393
423 | 2,3,17063,4847,9053,1031,3415,1784
424 | 1,3,26400,1377,4172,830,948,1218
425 | 2,3,17565,3686,4657,1059,1803,668
426 | 2,3,16980,2884,12232,874,3213,249
427 | 1,3,11243,2408,2593,15348,108,1886
428 | 1,3,13134,9347,14316,3141,5079,1894
429 | 1,3,31012,16687,5429,15082,439,1163
430 | 1,3,3047,5970,4910,2198,850,317
431 | 1,3,8607,1750,3580,47,84,2501
432 | 1,3,3097,4230,16483,575,241,2080
433 | 1,3,8533,5506,5160,13486,1377,1498
434 | 1,3,21117,1162,4754,269,1328,395
435 | 1,3,1982,3218,1493,1541,356,1449
436 | 1,3,16731,3922,7994,688,2371,838
437 | 1,3,29703,12051,16027,13135,182,2204
438 | 1,3,39228,1431,764,4510,93,2346
439 | 2,3,14531,15488,30243,437,14841,1867
440 | 1,3,10290,1981,2232,1038,168,2125
441 | 1,3,2787,1698,2510,65,477,52
442 | 


--------------------------------------------------------------------------------
/small_data/u.item:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/small_data/u.item


--------------------------------------------------------------------------------