├── .gitignore ├── .ipynb_checkpoints ├── Syllabus-checkpoint.md └── Untitled-checkpoint.ipynb ├── Syllabus.md ├── Syllabus.pdf ├── cheat_sheets └── git │ ├── figures │ ├── clone_screen.png │ ├── create_branch_website.png │ ├── download.png │ ├── example_notebook.png │ ├── fork.png │ ├── git_commit.png │ ├── git_status.png │ ├── init_profile.png │ ├── init_repo.png │ ├── jupyter.png │ ├── jupyter_dir.png │ ├── jupyter_dropdown.png │ ├── jupyter_shutdown.png │ ├── new_repo.png │ ├── open_pull_request.png │ ├── open_pull_request2.png │ ├── pull_request.png │ ├── pull_request2.png │ ├── rename_notebook.png │ ├── repo_screen.png │ └── updated_repo.png │ ├── git_cheat_sheet.pdf │ └── git_cheat_sheet.tex ├── homeworks ├── .ipynb_checkpoints │ ├── Final Project-checkpoint.ipynb │ ├── Homework_1-checkpoint.ipynb │ ├── Homework_2-checkpoint.ipynb │ ├── Homework_3-checkpoint.ipynb │ ├── Homework_4-checkpoint.ipynb │ └── Homework_5-checkpoint.ipynb ├── Final Project.ipynb ├── Homework_1.ipynb ├── Homework_2.ipynb ├── Homework_3.ipynb ├── Homework_4.ipynb └── Homework_5.ipynb ├── lectures ├── .ipynb_checkpoints │ ├── Lecture_10_Random_Forest-checkpoint.ipynb │ ├── Lecture_11_Boosting-checkpoint.ipynb │ ├── Lecture_1_Introduction-checkpoint.ipynb │ ├── Lecture_2_Python_Git-checkpoint.ipynb │ ├── Lecture_3_Get_Clean_Visualize_Data-checkpoint.ipynb │ ├── Lecture_4_Linear_Regression_and_Evaluation-checkpoint.ipynb │ ├── Lecture_5_Logistic_Regression_and_Evaluation-checkpoint.ipynb │ ├── Lecture_6_K_Nearest_Neighbors-checkpoint.ipynb │ ├── Lecture_7_Naive_Bayes-checkpoint.ipynb │ ├── Lecture_8_SVM-checkpoint.ipynb │ ├── Lecture_9_Decision_Trees-checkpoint.ipynb │ └── Random Notes-checkpoint.ipynb ├── Extra_Encoder_Decoder.ipynb ├── In_Progress_Encoder_Decoder_Attention.ipynb ├── Lecture_10_Random_Forest.ipynb ├── Lecture_11_Boosting.ipynb ├── Lecture_12_Dimensionality Reduction.ipynb ├── Lecture_13_Clustering.ipynb ├── Lecture_14_MLP.ipynb ├── Lecture_15_Conv_Nets.ipynb ├── Lecture_16_RNNs.ipynb ├── Lecture_17_Recommender_Systems.ipynb ├── Lecture_1_Introduction.ipynb ├── Lecture_2_Python_Git.ipynb ├── Lecture_3_Get_Clean_Visualize_Data.ipynb ├── Lecture_4_Linear_Regression_and_Evaluation.ipynb ├── Lecture_5_Logistic_Regression_and_Evaluation.ipynb ├── Lecture_6_K_Nearest_Neighbors.ipynb ├── Lecture_7_Naive_Bayes.ipynb ├── Lecture_8_SVM.ipynb └── Lecture_9_Decision_Trees.ipynb └── small_data ├── Wholesale customers data.csv ├── baby_names ├── yob2000.txt ├── yob2001.txt ├── yob2002.txt ├── yob2003.txt ├── yob2004.txt ├── yob2005.txt ├── yob2006.txt ├── yob2007.txt ├── yob2008.txt ├── yob2009.txt ├── yob2010.txt ├── yob2011.txt ├── yob2012.txt ├── yob2013.txt ├── yob2014.txt ├── yob2015.txt └── yob2016.txt ├── elonmusk_tweets.csv ├── male_names.txt ├── movie_data.tsv └── u.item /.gitignore: -------------------------------------------------------------------------------- 1 | large_data/ 2 | *.swp 3 | Syllabus.docx 4 | drafts/ 5 | *.aux 6 | *.log 7 | *.synctex.gz 8 | *.toc 9 | *.DS_Store 10 | *lectures/.ipynb_checkpoints/* 11 | models/ 12 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Syllabus-checkpoint.md: -------------------------------------------------------------------------------- 1 | **Applied Machine Learning Syllabus** 2 | 3 | Economics 213R 4 | 5 | Professor: Tyler Folkman 6 | 7 | 8 | ## Contact Information 9 | 10 | I have created a slack group that we will use for most communication. I will provide more information on how to join and use slack when the course starts. 11 | 12 | You can also always reach me at tylerfolkman at gmail 13 | 14 | 15 | ## Office Hours 16 | 17 | To be determined. I will have an hour twice a week for office hours, but I want to discuss with the class the best times. My office hours will most likely be remote using Zoom video conference. 18 | 19 | 20 | ## Grading 21 | 22 | Final Project: 30% 23 | 24 | Homeworks: 50% 25 | 26 | Participation: 10% 27 | 28 | Quizzes: 10% 29 | 30 | Grading will be done on a curve and not upon absolute thresholds. 31 | 32 | 33 | ## Class Material 34 | 35 | I will be leveraging material from many sources as well as my own content. Below is a listed of sources I used. Many are freely (and legally) available online. Not required to purchase any, though, I find [1] to be quite good as well as [4]. [4] is available for free online. 36 | 37 | 1. Hands-On Machine Learning with Scikit-Learn and Tensorflow 38 | 2. Data Science From Scratch 39 | 3. Python for Data Analysis 40 | 4. Introduction to Statistical Learning 41 | 5. Coursera: Machine Learning 42 | 6. Fastai course 1 43 | 7. https://distill.pub/2016/misread-tsne/ 44 | 45 | 46 | ## Course Description 47 | 48 | The amount of data being created in the world is staggering. More data has been created in the past two years than in the entire previous history of the human race and companies are taking notice. Some of the largest companies in the world - Google, Microsoft, Facebook, and Amazon to name a few - are leveraging these data to create valuable products. Amazon's Alexa, Google's search, Facebook's timeline are all powered by machine learning. These same companies are also scrambling to get as much machine learning talent as they can. 49 | 50 | This course will teach you the fundamental building blocks of machine learning. We will go from how to gather and clean data to writing Python code to create and evaluate our own models. We will cover enough theory to ensure an understanding of machine learning models, but the focus is on the application of machine learning. To that end, we will also cover machine learning in industry and how to avoid some of the pitfalls that cause machine learning initiatives to fail. 51 | 52 | The course will use Python heavily and while not a pre-requisite any previous experience with coding and/or Python will be very helpful. 53 | 54 | 55 | ## Prerequisites 56 | 57 | Econ 110 and Econ 378 58 | 59 | 60 | ## Tentative Outline 61 | 62 | ### Introduction to Machine Learning 63 | 64 | 1. What, why, how, and challenges 65 | 2. Differences between stats, ML, and econometrics 66 | 67 | References: [1] Chapter 1 68 | 69 | 70 | ### Introduction to Python 71 | 72 | 1. Functions, Lists, Dictionaries, Counters, Sets, List Comprehensions, Generators 73 | 2. Data science stack: numpy, pandas, scikit-learn, scipy, jupyter notebooks 74 | 3. Git and github 75 | 76 | References: [2] Chapter 2, [3] All Chapters 77 | 78 | 79 | ### How to get and clean data 80 | 81 | 1. Loading from text, SQL, and web pages 82 | 2. Cleaning data: missing, outliers, scaling data, reformatting 83 | 84 | References: [3] Chapters 5-7, [2] Chapters 9-10 and 23, [1] Chapter 2 85 | 86 | 87 | ### How to visualize and describe data 88 | 89 | 1. Data description techniques: mean, median, percentiles, correlations 90 | 2. Data visualization techniques: various plots, Bokeh 91 | 92 | References: [1] Chapter 2, [2] Chapters 3 and 5, [3] Chapters 5, 8 and 9 93 | 94 | 95 | ### Regression 96 | 97 | 1. Linear Regression 98 | 2. Logistic Regression 99 | 3. Regularized Regression 100 | 101 | References: [2] Chapters 14-16, [1] Chapter 4, [4] Chapter 3, [5] Weeks 1-3 102 | 103 | 104 | ### Gradient Descent 105 | 106 | 1. Cost functions, learning rates, and gradients 107 | 108 | References: [2] Chapter 8, [1] Chapter 4 109 | 110 | 111 | ### How to evaluate and tune models 112 | 113 | 1. Training/Testing and Cross validation 114 | 2. MSE 115 | 3. Confusion matrix, precision and recall, ROC 116 | 4. Learning Curves 117 | 5. Bias and variance / over-fitting under-fitting 118 | 6. Tuning hyperparameters 119 | 7. Feature importances 120 | 121 | References: [2] Chapter 11, [1] Chapter 2 and 3, [4] Chapter 2, 5, [5] Week 6 122 | 123 | 124 | ### Classification Models 125 | 126 | 1. Naive Bayes, K-nearest neighbors 127 | 128 | References: [2] Chapters 12-13, [4] Chapter 4 129 | 130 | 131 | ### SVMs 132 | 133 | 1. Max-margin, kernel trick, more than 2 classes 134 | 135 | References: [1] Chapter 5, [4] Chapter 9, [5] Week 7 136 | 137 | 138 | ### Trees and Ensembles 139 | 140 | 1. Decision Trees, Random Forests, and Gradient Boosted Trees 141 | 2. XGBoost 142 | 143 | References: [2] Chapter 17, [1] Chapters 6-7, [4] Chapter 8 144 | 145 | 146 | ### Dimensionality Reduction 147 | 148 | 1. Curse of dimensionality 149 | 2. PCA and SVD 150 | 3. T-SNE 151 | 152 | References: [1] Chapter 8, [7]. [4] Chapter 6, 10, [5] Week 8 153 | 154 | 155 | ### Clustering 156 | 157 | 1. K-means, Hierarchical, DBScan 158 | 159 | References: [2] Chapter 19, [4] Chapter 10, [5] Week 8 160 | 161 | 162 | ### Recommender Systems 163 | 164 | 1. Collaborative filtering and deep collaborative filtering 165 | 166 | References: [2] Chapter 22, [5] Week 9 167 | 168 | 169 | ### Data Science at Scale 170 | 171 | 1. Parallelism - dask and blaze 172 | 2. Spark and Mapreduce 173 | 3. AWS 174 | 4. Online learning and stochastic gradient descent 175 | 176 | References: [5] Week 10, [2] Chapter 24 177 | 178 | 179 | ### Deep Learning 180 | 181 | 1. Perceptron, simple networks, and backprop 182 | 2. GPUs 183 | 3. Tensorflow, Keras, Pytorch 184 | 4. CNNs 185 | 5. RNNs and word2vec 186 | 187 | References: [2] Chapter 18, [1] Chapters 9-16, [5] Weeks 4-5, [6] All Lessons 188 | 189 | 190 | ### Deploying Machine Learning 191 | 192 | 1. Docker 193 | 2. Flask and REST Endpoints 194 | 195 | 196 | ### Machine Learning in Industry 197 | 198 | 1. Guest lecturers / companies 199 | 2. Things Kaggle challenges won’t teach you 200 | 3. Why most data science initiatives fail 201 | 4. Importance of communication and collaboration 202 | 203 | 204 | ## University 205 | 206 | ### Honor Code 207 | In keeping with the principles of the BYU Honor Code, students are expected to be honest in all of their academic work. Academic honesty means, most fundamentally, that any work you present as your own must in fact be your own work and not that of another. Violations of this principle may result in a failing grade in the course and additional disciplinary action by the university. Students are also expected to adhere to the Dress and Grooming Standards. Adherence demonstrates respect for yourself and others and ensures an effective learning and working environment. It is the university's expectation, and my own expectation in class, that each student will abide by all Honor Code standards. Please call the Honor Code Office at 422-2847 if you have questions about those standards. 208 | 209 | ### Preventing & Responding to Sexual Misconduct 210 | In accordance with Title IX of the Education Amendments of 1972, Brigham Young University prohibits unlawful sex discrimination against any participant in its education programs or activities. The university also prohibits sexual harassment—including sexual violence—committed by or against students, university employees, and visitors to campus. As outlined in university policy, sexual harassment, dating violence, domestic violence, sexual assault, and stalking are considered forms of "Sexual Misconduct" prohibited by the university. University policy requires all university employees in a teaching, managerial, or supervisory role to report all incidents of Sexual Misconduct that come to their attention in any way, including but not limited to face-to-face conversations, a written class assignment or paper, class discussion, email, text, or social media post. Incidents of Sexual Misconduct should be reported to the Title IX Coordinator at t9coordinator@byu.edu or (801) 422-8692. Reports may also be submitted through EthicsPoint at https://titleix.byu.edu/report or 1-888-238-1062 (24-hours a day). BYU offers confidential resources for those affected by Sexual Misconduct, including the university’s Victim Advocate, as well as a number of non-confidential resources and services that may be helpful. Additional information about Title IX, the university’s Sexual Misconduct Policy, reporting requirements, and resources can be found at http://titleix.byu.edu or by contacting the university’s Title IX Coordinator. 211 | 212 | ### Student Disability 213 | Brigham Young University is committed to providing a working and learning atmosphere that reasonably accommodates qualified persons with disabilities. If you have any disability which may impair your ability to complete this course successfully, please contact the Services for Students with Disabilities Office (422-2767). Reasonable academic accommodations are reviewed for all students who have qualified, documented disabilities. Services are coordinated with the student and instructor by the SSD Office. If you need assistance or if you feel you have been unlawfully discriminated against on the basis of disability, you may seek resolution through established grievance policy and procedures by contacting the Equal Employment Office at 422-5895,D-285 ASB. 214 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /Syllabus.md: -------------------------------------------------------------------------------- 1 | **Applied Machine Learning Syllabus** 2 | 3 | Economics 213R 4 | 5 | Professor: Tyler Folkman 6 | 7 | 8 | ## Contact Information 9 | 10 | I have created a slack group that we will use for most communication. I will provide more information on how to join and use slack when the course starts. 11 | 12 | You can also always reach me at tylerfolkman at gmail 13 | 14 | 15 | ## Office Hours 16 | 17 | To be determined. I will have an hour twice a week for office hours, but I want to discuss with the class the best times. My office hours will most likely be remote using Zoom video conference. 18 | 19 | 20 | ## Grading 21 | 22 | Final Project: 30% 23 | 24 | Homeworks: 50% 25 | 26 | Participation: 10% 27 | 28 | Quizzes: 10% 29 | 30 | Grading will be done on a curve and not upon absolute thresholds. 31 | 32 | 33 | ## Class Material 34 | 35 | I will be leveraging material from many sources as well as my own content. Below is a listed of sources I used. Many are freely (and legally) available online. Not required to purchase any, though, I find [1] to be quite good as well as [4]. [4] is available for free online. 36 | 37 | 1. Hands-On Machine Learning with Scikit-Learn and Tensorflow 38 | 2. Data Science From Scratch 39 | 3. Python for Data Analysis 40 | 4. Introduction to Statistical Learning 41 | 5. Coursera: Machine Learning 42 | 6. Fastai course 1 43 | 7. https://distill.pub/2016/misread-tsne/ 44 | 45 | 46 | ## Course Description 47 | 48 | The amount of data being created in the world is staggering. More data has been created in the past two years than in the entire previous history of the human race and companies are taking notice. Some of the largest companies in the world - Google, Microsoft, Facebook, and Amazon to name a few - are leveraging these data to create valuable products. Amazon's Alexa, Google's search, Facebook's timeline are all powered by machine learning. These same companies are also scrambling to get as much machine learning talent as they can. 49 | 50 | This course will teach you the fundamental building blocks of machine learning. We will go from how to gather and clean data to writing Python code to create and evaluate our own models. We will cover enough theory to ensure an understanding of machine learning models, but the focus is on the application of machine learning. To that end, we will also cover machine learning in industry and how to avoid some of the pitfalls that cause machine learning initiatives to fail. 51 | 52 | The course will use Python heavily and while not a pre-requisite any previous experience with coding and/or Python will be very helpful. 53 | 54 | 55 | ## Prerequisites 56 | 57 | Econ 110 and Econ 378 58 | 59 | 60 | ## Tentative Outline 61 | 62 | ### Introduction to Machine Learning 63 | 64 | 1. What, why, how, and challenges 65 | 2. Differences between stats, ML, and econometrics 66 | 67 | References: [1] Chapter 1 68 | 69 | 70 | ### Introduction to Python 71 | 72 | 1. Functions, Lists, Dictionaries, Counters, Sets, List Comprehensions, Generators 73 | 2. Data science stack: numpy, pandas, scikit-learn, scipy, jupyter notebooks 74 | 3. Git and github 75 | 76 | References: [2] Chapter 2, [3] All Chapters 77 | 78 | 79 | ### How to get and clean data 80 | 81 | 1. Loading from text, SQL, and web pages 82 | 2. Cleaning data: missing, outliers, scaling data, reformatting 83 | 84 | References: [3] Chapters 5-7, [2] Chapters 9-10 and 23, [1] Chapter 2 85 | 86 | 87 | ### How to visualize and describe data 88 | 89 | 1. Data description techniques: mean, median, percentiles, correlations 90 | 2. Data visualization techniques: various plots, Bokeh 91 | 92 | References: [1] Chapter 2, [2] Chapters 3 and 5, [3] Chapters 5, 8 and 9 93 | 94 | 95 | ### Regression 96 | 97 | 1. Linear Regression 98 | 2. Logistic Regression 99 | 3. Regularized Regression 100 | 101 | References: [2] Chapters 14-16, [1] Chapter 4, [4] Chapter 3, [5] Weeks 1-3 102 | 103 | 104 | ### Gradient Descent 105 | 106 | 1. Cost functions, learning rates, and gradients 107 | 108 | References: [2] Chapter 8, [1] Chapter 4 109 | 110 | 111 | ### How to evaluate and tune models 112 | 113 | 1. Training/Testing and Cross validation 114 | 2. MSE 115 | 3. Confusion matrix, precision and recall, ROC 116 | 4. Learning Curves 117 | 5. Bias and variance / over-fitting under-fitting 118 | 6. Tuning hyperparameters 119 | 7. Feature importances 120 | 121 | References: [2] Chapter 11, [1] Chapter 2 and 3, [4] Chapter 2, 5, [5] Week 6 122 | 123 | 124 | ### Classification Models 125 | 126 | 1. Naive Bayes, K-nearest neighbors 127 | 128 | References: [2] Chapters 12-13, [4] Chapter 4 129 | 130 | 131 | ### SVMs 132 | 133 | 1. Max-margin, kernel trick, more than 2 classes 134 | 135 | References: [1] Chapter 5, [4] Chapter 9, [5] Week 7 136 | 137 | 138 | ### Trees and Ensembles 139 | 140 | 1. Decision Trees, Random Forests, and Gradient Boosted Trees 141 | 2. XGBoost 142 | 143 | References: [2] Chapter 17, [1] Chapters 6-7, [4] Chapter 8 144 | 145 | 146 | ### Dimensionality Reduction 147 | 148 | 1. Curse of dimensionality 149 | 2. PCA and SVD 150 | 3. T-SNE 151 | 152 | References: [1] Chapter 8, [7]. [4] Chapter 6, 10, [5] Week 8 153 | 154 | 155 | ### Clustering 156 | 157 | 1. K-means, Hierarchical, DBScan 158 | 159 | References: [2] Chapter 19, [4] Chapter 10, [5] Week 8 160 | 161 | 162 | ### Recommender Systems 163 | 164 | 1. Collaborative filtering and deep collaborative filtering 165 | 166 | References: [2] Chapter 22, [5] Week 9 167 | 168 | 169 | ### Data Science at Scale 170 | 171 | 1. Parallelism - dask and blaze 172 | 2. Spark and Mapreduce 173 | 3. AWS 174 | 4. Online learning and stochastic gradient descent 175 | 176 | References: [5] Week 10, [2] Chapter 24 177 | 178 | 179 | ### Deep Learning 180 | 181 | 1. Perceptron, simple networks, and backprop 182 | 2. GPUs 183 | 3. Tensorflow, Keras, Pytorch 184 | 4. CNNs 185 | 5. RNNs and word2vec 186 | 187 | References: [2] Chapter 18, [1] Chapters 9-16, [5] Weeks 4-5, [6] All Lessons 188 | 189 | 190 | ### Deploying Machine Learning 191 | 192 | 1. Docker 193 | 2. Flask and REST Endpoints 194 | 195 | 196 | ### Machine Learning in Industry 197 | 198 | 1. Guest lecturers / companies 199 | 2. Things Kaggle challenges won’t teach you 200 | 3. Why most data science initiatives fail 201 | 4. Importance of communication and collaboration 202 | 203 | 204 | ## University 205 | 206 | ### Honor Code 207 | In keeping with the principles of the BYU Honor Code, students are expected to be honest in all of their academic work. Academic honesty means, most fundamentally, that any work you present as your own must in fact be your own work and not that of another. Violations of this principle may result in a failing grade in the course and additional disciplinary action by the university. Students are also expected to adhere to the Dress and Grooming Standards. Adherence demonstrates respect for yourself and others and ensures an effective learning and working environment. It is the university's expectation, and my own expectation in class, that each student will abide by all Honor Code standards. Please call the Honor Code Office at 422-2847 if you have questions about those standards. 208 | 209 | ### Preventing & Responding to Sexual Misconduct 210 | In accordance with Title IX of the Education Amendments of 1972, Brigham Young University prohibits unlawful sex discrimination against any participant in its education programs or activities. The university also prohibits sexual harassment—including sexual violence—committed by or against students, university employees, and visitors to campus. As outlined in university policy, sexual harassment, dating violence, domestic violence, sexual assault, and stalking are considered forms of "Sexual Misconduct" prohibited by the university. University policy requires all university employees in a teaching, managerial, or supervisory role to report all incidents of Sexual Misconduct that come to their attention in any way, including but not limited to face-to-face conversations, a written class assignment or paper, class discussion, email, text, or social media post. Incidents of Sexual Misconduct should be reported to the Title IX Coordinator at t9coordinator@byu.edu or (801) 422-8692. Reports may also be submitted through EthicsPoint at https://titleix.byu.edu/report or 1-888-238-1062 (24-hours a day). BYU offers confidential resources for those affected by Sexual Misconduct, including the university’s Victim Advocate, as well as a number of non-confidential resources and services that may be helpful. Additional information about Title IX, the university’s Sexual Misconduct Policy, reporting requirements, and resources can be found at http://titleix.byu.edu or by contacting the university’s Title IX Coordinator. 211 | 212 | ### Student Disability 213 | Brigham Young University is committed to providing a working and learning atmosphere that reasonably accommodates qualified persons with disabilities. If you have any disability which may impair your ability to complete this course successfully, please contact the Services for Students with Disabilities Office (422-2767). Reasonable academic accommodations are reviewed for all students who have qualified, documented disabilities. Services are coordinated with the student and instructor by the SSD Office. If you need assistance or if you feel you have been unlawfully discriminated against on the basis of disability, you may seek resolution through established grievance policy and procedures by contacting the Equal Employment Office at 422-5895,D-285 ASB. 214 | -------------------------------------------------------------------------------- /Syllabus.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/Syllabus.pdf -------------------------------------------------------------------------------- /cheat_sheets/git/figures/clone_screen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/clone_screen.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/create_branch_website.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/create_branch_website.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/download.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/download.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/example_notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/example_notebook.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/fork.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/fork.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/git_commit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/git_commit.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/git_status.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/git_status.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/init_profile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/init_profile.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/init_repo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/init_repo.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/jupyter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/jupyter_dir.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter_dir.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/jupyter_dropdown.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter_dropdown.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/jupyter_shutdown.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/jupyter_shutdown.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/new_repo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/new_repo.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/open_pull_request.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/open_pull_request.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/open_pull_request2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/open_pull_request2.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/pull_request.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/pull_request.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/pull_request2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/pull_request2.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/rename_notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/rename_notebook.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/repo_screen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/repo_screen.png -------------------------------------------------------------------------------- /cheat_sheets/git/figures/updated_repo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/figures/updated_repo.png -------------------------------------------------------------------------------- /cheat_sheets/git/git_cheat_sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/cheat_sheets/git/git_cheat_sheet.pdf -------------------------------------------------------------------------------- /cheat_sheets/git/git_cheat_sheet.tex: -------------------------------------------------------------------------------- 1 | \documentclass[11pt,a4paper]{article} 2 | 3 | \usepackage[utf8]{inputenc} 4 | \usepackage{url} 5 | \usepackage[margin=1in]{geometry} 6 | \usepackage{graphicx} 7 | 8 | \title{GitHub Cheat Sheet} 9 | \date{} 10 | \author{Econ 213R - Applied Machine Learning} 11 | 12 | \begin{document} 13 | \vspace*{-75pt} 14 | {\let\newpage\relax\maketitle} 15 | 16 | \tableofcontents 17 | 18 | \section{What is GitHub?} 19 | GitHub is a place where developers can store projects and collaborate with other developers to create and update code. 20 | Below is a list of terms you should know before you get started with GitHub. 21 | 22 | \begin{itemize} 23 | \item \textbf{repository}: place where all folders and files for a project are stored 24 | \item \textbf{push}: update changes to a repository 25 | \item \textbf{pull}: retrieve and merge changes to a repository 26 | \item \textbf{commit}: an individual revision to a file or a set of files 27 | \item \textbf{branch}: a parallel version of the repository contained within the repository but not affecting the primary branch (which is most often the master branch) 28 | \item \textbf{clone}: a copy of a repository on your local machine 29 | \item \textbf{fork}: a personal copy of another user's repository 30 | \end{itemize} 31 | 32 | Some of these definitions came from \url{https://help.github.com/articles/github-glossary/}, a useful guide for many GitHub terms. 33 | 34 | \section{GitHub Registration} 35 | You need an account with GitHub. 36 | Go to \url{www.github.com}. 37 | Sign up from the home page and follow the instructions. 38 | Note that private repositories require monthly payments; however, you can have an unlimited number of public repositories for free. 39 | Once you sign up, you are ready to create and fork repositories. 40 | 41 | \section{Creating Repositories} 42 | Navigate to your GitHub profile. 43 | It should look something like Figure \ref{fig:init-prof}. 44 | 45 | \begin{figure}[h!] 46 | \centering 47 | \includegraphics[width=.7\textwidth]{figures/init_profile.png} 48 | \caption{Profile Screen} 49 | \label{fig:init-prof} 50 | \end{figure} 51 | 52 | Click the ``Repositories" tab. 53 | Click the button that says ``New" (see Figure \ref{fig:new-repo}). 54 | 55 | \begin{figure}[h!] 56 | \centering 57 | \includegraphics[width=.7\textwidth]{figures/new_repo.png} 58 | \caption{New Repository} 59 | \label{fig:new-repo} 60 | \end{figure} 61 | 62 | Name your repository. 63 | It is customary to use dashes instead of underscores in repository names. 64 | Make sure to initialize the repository with a README.md (see Figure \ref{fig:init-repo}). 65 | If you accidentally forget to do this, that's okay; you can always add a README later. 66 | A README is a short description of the repository. 67 | 68 | \begin{figure}[h!] 69 | \centering 70 | \includegraphics[width=.7\textwidth]{figures/init_repo.png} 71 | \caption{Initialize the repository.} 72 | \label{fig:init-repo} 73 | \end{figure} 74 | 75 | Click ``Create Repository" at the bottom of the page. 76 | You should see a page that looks like Figure \ref{fig:repo-screen}. 77 | 78 | \begin{figure}[h!] 79 | \centering 80 | \includegraphics[width=.7\textwidth]{figures/repo_screen.png} 81 | \caption{Repository screen.} 82 | \label{fig:repo-screen} 83 | \end{figure} 84 | 85 | Congratulations! 86 | You have just created your first GitHub repository. 87 | 88 | \section{Making Changes to Repositories} 89 | There are a few ways to upload and change files in a GitHub repository. 90 | The most common way to modify repositories is via the command line; see Section \ref{command-line} for command line instructions. 91 | It is highly, \textit{highly} recommended that you learn the command line to modify repositories. 92 | The command line allows you to more freedom to modify your repositories and branches, and can streamline your workflow. 93 | However, you can use the GitHub website to upload and download files to and from repositories. 94 | You can also modify files using the website. 95 | Note that you can't run Jupyter notebooks on the GitHub website, and when you try to edit Jupyter notebooks on the website, you have to comb through lots of nasty JSON lingo. 96 | If you choose to use the website to make changes to the GitHub repository, you will have to download the Jupyter notebook, modify it on a local machine, and upload it back to the repository. 97 | Instructions for using the website are in Section \ref{website}. 98 | 99 | If you do not have Git installed on your local machine, use the links below to download Git. 100 | \begin{itemize} 101 | \item[] Macs: \url{https://sourceforge.net/projects/git-osx-installer/files/} 102 | \item[] Windows: \url{http://gitforwindows.org/} 103 | \item[] Linux: In the Terminal, first type \texttt{sudo apt-get update} and hit Enter. Then type \texttt{sudo apt-get install git} and hit Enter. 104 | \end{itemize} 105 | 106 | \subsection{Using Command Line} \label{command-line} 107 | Let's walk through an example together. 108 | In this example, we will configure Git on a local machine, clone the repository to a local machine, create a Jupyter notebook, add it to the repository, and push the changes. 109 | 110 | First, set your username and email in Git. 111 | Use the commands below to do this. 112 | 113 | \begin{itemize} 114 | \item[] \texttt{git config --global user.name "Your Name Here"} 115 | \item[] \texttt{git config --global user.email "your\_email@website.com"} 116 | \end{itemize} 117 | 118 | Now Git is configured on your machine, and you are ready to clone the repository you made. 119 | First, go to the repository screen on the GitHub website (Figure \ref{fig:repo-screen}). 120 | Click the button that says ``Clone or download". 121 | A drop-down window should appear that contains a link to your repository (see Figure \ref{fig:clone-screen}). 122 | Copy the link in the drop-down window. 123 | 124 | \begin{figure}[h!] 125 | \centering 126 | \includegraphics[width=.7\textwidth]{figures/clone_screen.png} 127 | \caption{Jupyter notebook screen.} 128 | \label{fig:clone-screen} 129 | \end{figure} 130 | 131 | Now open your command line. 132 | Navigate to the location where you want your repository to be (for example, I store all of my files on my desktop, so I would navigate to my desktop from my command line). 133 | Once you have navigated to the location, type the following command: \texttt{git clone link\_to\_your\_repo}, where \texttt{link\_to\_your\_repo} is the link you copied above.. 134 | When your computer is done downloading the repository, you should see a folder with the repository name in the location where you wanted it. 135 | Now you are ready to make some changes and push them to your master branch. 136 | 137 | \subsubsection{Making Changes} 138 | For this example, we will make a Jupyter notebook and push it to the master branch. 139 | In the command line, navigate to your repository if you're not there already. 140 | Type \texttt{jupyter notebook} and hit enter. 141 | A web browser should open up with a page that looks like Figure \ref{fig:jupyter-dir}. 142 | If a web browser did not automatically open, the command line should have provided a URL that you can copy and past directly into a browser to access the Jupyter notebook. 143 | 144 | \begin{figure}[h!] 145 | \centering 146 | \includegraphics[width=.6\textwidth]{figures/jupyter_dir.png} 147 | \caption{Jupyter notebook main screen.} 148 | \label{fig:jupyter-dir} 149 | \end{figure} 150 | 151 | In the upper right-hand corner, click ``New". 152 | A dropdown menu will appear. 153 | Under the ``Notebooks" heading, click on Python 3 (see Figure \ref{fig:jupyter-dropdown}). 154 | When you click ``Python 3", a new window will open in your browser with a screen that looks like Figure \ref{fig:jupyter}. 155 | 156 | \begin{figure}[h!] 157 | \centering 158 | \includegraphics[width=.2\textwidth]{figures/jupyter_dropdown.png} 159 | \caption{Create a Python 3 Jupyter notebook.} 160 | \label{fig:jupyter-dropdown} 161 | \end{figure} 162 | 163 | \begin{figure}[h!] 164 | \centering 165 | \includegraphics[width=.6\textwidth]{figures/jupyter.png} 166 | \caption{A Jupyter notebook.} 167 | \label{fig:jupyter} 168 | \end{figure} 169 | 170 | Change the name of the notebook from``Untitled" to ``example". 171 | To do this, click the ``Untitled" text at the top of the notebook. 172 | A window will appear; replace the ``Untitled" text with ``example". 173 | See Figure \ref{fig:rename-notebook}. 174 | 175 | \begin{figure}[h!] 176 | \centering 177 | \includegraphics[width=.7\textwidth]{figures/rename_notebook.png} 178 | \caption{Rename the notebook.} 179 | \label{fig:rename-notebook} 180 | \end{figure} 181 | 182 | In the first cell, type \texttt{print("Hello World")}. 183 | Then press Shift+Enter. 184 | This key combination runs the cell. 185 | A new cell will automatically appear below the cell with the print statement. 186 | To save the notebook, hit Command+S. 187 | Your notebook should now look like Figure \ref{fig:example-notebook}. 188 | 189 | \begin{figure}[h!] 190 | \centering 191 | \includegraphics[width=.7\textwidth]{figures/example_notebook.png} 192 | \caption{A Jupyter notebook.} 193 | \label{fig:example-notebook} 194 | \end{figure} 195 | 196 | Now navigate to the command line where you launched Jupyter. 197 | To stop the Jupyter server from running, hit Ctrl+c twice. 198 | This will shut down the Jupyter notebook. 199 | If you didn't exit the browser window yet, it will look like Figure \ref{fig:jupyter-shutdown} after you shut down the Jupyter server. 200 | You are safe to close all browser windows related to this Jupyter notebook. 201 | 202 | \begin{figure}[h!] 203 | \centering 204 | \includegraphics[width=.7\textwidth]{figures/jupyter_shutdown.png} 205 | \caption{Shutting down a Jupyter notebook.} 206 | \label{fig:jupyter-shutdown} 207 | \end{figure} 208 | 209 | \subsubsection{Committing Changes} 210 | In the command prompt, navigate to the repository, if you aren't already there. 211 | Type \texttt{git status} and hit enter. 212 | You should see a message that looks similar to Figure \ref{fig:git-status}. 213 | 214 | \begin{figure}[h!] 215 | \centering 216 | \includegraphics[width=.7\textwidth]{figures/git_status.png} 217 | \caption{\texttt{git status}} 218 | \label{fig:git-status} 219 | \end{figure} 220 | 221 | You want to add the notebook you just created to the Git index so Git knows to keep track of this file and any future changes made to it. 222 | To do this, type \texttt{git add example.ipynb} and hit enter. 223 | Now we are going to commit this change to the branch. 224 | Then, type \texttt{git commit -m "example notebook"}. 225 | A message that looks like Figure \ref{fig:git-commit} will appear. 226 | Note that you can type any descriptive message you want in the quotes. 227 | These messages are to help you keep track of the changes you make in each commit. 228 | Now we want to push these changes up the master branch, so that any other machines with this repository can pull the changes you made. 229 | Type \texttt{git push origin master} and hit enter. 230 | Congratulations! 231 | You have just pushed your first change to your GitHub repository. 232 | If you navigate to the web page of your repository, you should see the example.ipynb file there, as in Figure \ref{fig:updated-repo}. 233 | 234 | \begin{figure}[h!] 235 | \centering 236 | \includegraphics[width=.4\textwidth]{figures/git_commit.png} 237 | \caption{\texttt{git commit}} 238 | \label{fig:git-commit} 239 | \end{figure} 240 | 241 | \begin{figure}[h!] 242 | \centering 243 | \includegraphics[width=.7\textwidth]{figures/updated_repo.png} 244 | \caption{Updated Repository} 245 | \label{fig:updated-repo} 246 | \end{figure} 247 | 248 | \subsubsection{Pulling Changes} 249 | If you ever push a change to a repository branch on one machine, and you have the repository branch on another machine, you can still access the changes you made elsewhere. 250 | Type \texttt{git pull origin branch\_name} to retrieve changes you have made to the branch elsewhere. 251 | 252 | \subsection{Using the Website} \label{website} 253 | \subsubsection{Downloading Files} 254 | To download a file from the website of your repository, find the file you wish to download, and click on its name. 255 | Click ``Raw" (Figure \ref{fig:download}). 256 | This will take you to a new web page with the raw version of the file you want to download (for example, if you are trying to download a Jupyter notebook, you will see file in JSON format). 257 | Control-click on this web page and select ``Save As...". 258 | You can save the file anywhere you'd like on your local machine. 259 | 260 | \begin{figure}[h!] 261 | \centering 262 | \includegraphics[width=.7\textwidth]{figures/download.png} 263 | \caption{Download files from the website.} 264 | \label{fig:download} 265 | \end{figure} 266 | 267 | \subsubsection{Uploading Files} 268 | To upload files to your repository using the website, navigate to the main repository page and click ``Upload Files". 269 | You can drag and drop the files you wish to upload or use a navigation tool to find them on your local machine. 270 | Then you have two options: you can either commit those changes directly to the branch you are in, or you can commit those changes to a new branch. 271 | 272 | \section{Forking Repositories} 273 | Navigate to the class repository. 274 | The link is \url{https://github.com/tfolkman/byu_econ_applied_machine_learning}. 275 | Click the ``Fork" button in the upper-left corner (Figure \ref{fig:fork}). 276 | Now you have forked the repository, and it will show up in your list of repositories. 277 | 278 | \begin{figure}[h!] 279 | \centering 280 | \includegraphics[width=.7\textwidth]{figures/fork.png} 281 | \caption{Fork a Repository} 282 | \label{fig:fork} 283 | \end{figure} 284 | 285 | \subsection{Updating a Fork} 286 | 287 | Suppose a contributor of the original repository (the one from which you forked) makes a change, and you want to update your fork with that change. 288 | To make sure you are tracking the correct remote branch, type \texttt{git remote add upstream https://github.com/tfolkman/byu\_econ\_applied\_machine\_learning.git} in the command line. 289 | This adds a remote (a pointer or link) to the class repository called \texttt{upstream}. 290 | To update your fork's master branch, use the command \texttt{git pull upstream master} (make sure you are on the master branch first!). 291 | Now your forked repository should reflect the changes made in the original repository. 292 | 293 | \section{Deleting Repositories} 294 | \textbf{PROCEED WITH CAUTION.} 295 | This action cannot be undone. 296 | 297 | To delete a repository, first navigate to the repository's main page. 298 | Click ``Settings". 299 | Scroll down to the bottom of the screen until you reach the ``Danger Zone". 300 | Click ``Delete the repository". 301 | You will have to type the name of the repository to verify that you want to completely delete it. 302 | Then you can erase it from your GitHub. 303 | 304 | \section{Appendix} 305 | 306 | \subsection{List of Git Commands} \label{git-commands} 307 | 308 | \begin{itemize} 309 | \item \texttt{git clone repository\_link}: Clone the repository to a local machine. 310 | \item \texttt{git status}: Display the current status of the Git index. 311 | \item \texttt{git add filename}: Add filename to Git index to be tracked. 312 | \item \texttt{git branch}: Check which branch you are currently on 313 | \item \texttt{git commit -m "message"}: Commit the changes to the file or files. 314 | \item \texttt{git push origin branch\_name}: Submit the changes to the branch. 315 | \item \texttt{git pull origin branch\_name}: Retrieve changes to the branch. 316 | \item \texttt{git checkout -b branch\_name}: Create and switch to a new branch. 317 | \item \texttt{git checkout branch\_name}: Switch to an existing branch. 318 | \item \texttt{git branch -d branch\_name}: Delete an existing branch. 319 | \item \texttt{git remote add upstream repository\_link}: Add a remote to a repository. 320 | \item \texttt{git pull upstream master}: Update whatever branch you're on in your forked repository to reflect the changes in the master branch of the repository you forked. 321 | \end{itemize} 322 | 323 | \subsection{Making New Branches} 324 | 325 | \subsubsection{Command Line} 326 | You might want to make different branches for different tasks. 327 | To switch to a new, nonexistent branch, type \texttt{git checkout -b branch\_name}, where ``branch\_name" is the name you want to give the new branch. 328 | Now you can use the previous steps to add, commit, and push changes. 329 | The only difference is when you push, you will type \texttt{git push origin branch\_name} (instead of pushing to the master branch). 330 | To navigate to an existing branch, type \texttt{git checkout branch\_name}, where ``branch\_name: is the name of the existing branch. 331 | To delete an existing branch, type \texttt{git checkout -d branch\_name}. 332 | Be careful with this last command. 333 | 334 | \subsubsection{Using the Website} 335 | To make a new branch on the GitHub website, navigate to the main repository page. 336 | Click ``Branch: master". 337 | A dropdown menu should appear (Figure \ref{create-branch}. 338 | Start typing the name of the branch you wish to create in the text field. 339 | An option will appear that says ``Create branch". 340 | Click it, and you will now be on the main page for that branch. 341 | You can use that dropdown menu to switch between branches. 342 | 343 | \begin{figure}[h!] 344 | \centering 345 | \includegraphics[width=.4\textwidth]{figures/create_branch_website.png} 346 | \caption{Create a Branch} 347 | \label{fig:create-branch} 348 | \end{figure} 349 | 350 | 351 | \subsection{Pull Requests} 352 | If you committed on a different branch besides the master, and you want to merge these changes with the master, you can submit what is called a pull request. 353 | Pull requests are very useful when you collaborate on a project with other people. 354 | Pull requests allow others to see the differences between the changes you made and the branch with which you are requesting to merge. 355 | To submit a pull request, navigate to the main screen of your repository on the GitHub website. 356 | Click the ``Create Pull Request" button (Figure \ref{fig:pull-request}). 357 | 358 | \begin{figure}[h!] 359 | \centering 360 | \includegraphics[width=.7\textwidth]{figures/pull_request.png} 361 | \caption{Pull Request} 362 | \label{fig:pull-request} 363 | \end{figure} 364 | 365 | This will bring you to a new screen that looks like Figure \ref{fig:pull-request2}. 366 | 367 | \begin{figure}[h!] 368 | \centering 369 | \includegraphics[width=.7\textwidth]{figures/pull_request2.png} 370 | \caption{Pull Request} 371 | \label{fig:pull-request2} 372 | \end{figure} 373 | 374 | The base branch is the branch that does not have the changes you made. 375 | The compare branch is the branch with which you want the base branch to be merged. 376 | For example, if you made a branch for data cleaning called data\_cleaning, and you want those changes to be reflected in the master branch, you would choose the base branch to be master, and the compare branch to be data\_cleaning. 377 | Once you have chosen the appropriate branches, click ``Create Pull Request". 378 | Then you will see a page that looks like Figure \ref{fig:open-pull-request}. 379 | 380 | \begin{figure}[h!] 381 | \centering 382 | \includegraphics[width=.7\textwidth]{figures/open_pull_request.png} 383 | \caption{Pull Request} 384 | \label{fig:open-pull-request} 385 | \end{figure} 386 | 387 | In the ``Write" section, it is helpful if you describe what you changed. 388 | In a collaboration project with dozens of files and thousands of lines of code, it is helpful to other reviewers to see an outline of what you changed at a glance. 389 | If you scroll down, you will see sections that describe what has been changed and what is different between the two branches. 390 | You can request reviewers, and you and other collaborators can make comments about changes. 391 | When you are ready to make the pull request official, click ``Create Pull Request". 392 | After clicking this, you will see a web page that looks like Figure \ref{fig:open-pull-request2}. 393 | 394 | \begin{figure}[h!] 395 | \centering 396 | \includegraphics[width=.7\textwidth]{figures/open_pull_request2.png} 397 | \caption{Pull Request} 398 | \label{fig:open-pull-request2} 399 | \end{figure} 400 | 401 | When you and/or your collaborators have reviewed your changes and are ready to merge them to the base branch (and the compare branch has no conflicts with the base branch), click ``Merge Pull Request". 402 | After the branches are merged, you are safe to delete the compare branch. 403 | 404 | \end{document} 405 | -------------------------------------------------------------------------------- /homeworks/.ipynb_checkpoints/Final Project-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Final Project\n", 8 | "\n", 9 | "The final project is your opportunity to do your own data project end to end and present the results to some local companies.\n", 10 | "\n", 11 | "### Rubric\n", 12 | "\n", 13 | "Code Quality - 20%\n", 14 | "\n", 15 | "Describing, cleaning, and visualizing data - 25%\n", 16 | "\n", 17 | "Modeling - 25%\n", 18 | "\n", 19 | "Presentation / Storytelling - 30%" 20 | ] 21 | } 22 | ], 23 | "metadata": { 24 | "kernelspec": { 25 | "display_name": "Python 3", 26 | "language": "python", 27 | "name": "python3" 28 | }, 29 | "language_info": { 30 | "codemirror_mode": { 31 | "name": "ipython", 32 | "version": 3 33 | }, 34 | "file_extension": ".py", 35 | "mimetype": "text/x-python", 36 | "name": "python", 37 | "nbconvert_exporter": "python", 38 | "pygments_lexer": "ipython3", 39 | "version": "3.5.3" 40 | } 41 | }, 42 | "nbformat": 4, 43 | "nbformat_minor": 2 44 | } 45 | -------------------------------------------------------------------------------- /homeworks/.ipynb_checkpoints/Homework_1-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Data Cleaning, Describing, and Visualization\n", 8 | "\n", 9 | "### Step 1 - Get your environment setup\n", 10 | "\n", 11 | "1. Install Git on your computer and fork the class repository on [Github](https://github.com/tfolkman/byu_econ_applied_machine_learning).\n", 12 | "2. Install [Anaconda](https://conda.io/docs/install/quick.html) and get it working.\n", 13 | "\n", 14 | "### Step 2 - Explore Datasets\n", 15 | "\n", 16 | "The goals of this project are:\n", 17 | "\n", 18 | "1. Read in data from multiple sources\n", 19 | "2. Gain practice cleaning, describing, and visualizing data\n", 20 | "\n", 21 | "To this end, you need to find from three different sources. For example: CSV, JSON, and API, SQL, or web scraping. For each of these data sets, you must perform the following:\n", 22 | "\n", 23 | "1. Data cleaning. Some options your might consider: handle missing data, handle outliers, scale the data, convert some data to categorical.\n", 24 | "2. Describe data. Provide tables, statistics, and summaries of your data.\n", 25 | "3. Visualize data. Provide visualizations of your data.\n", 26 | "\n", 27 | "These are the typical first steps of any data science project and are often the most time consuming. My hope is that in going through this process 3 different times, that you will gain a sense for it.\n", 28 | "\n", 29 | "Also, as you are doing this, please tell us a story. Explain in your notebook why are doing what you are doing and to what end. Telling a story in your analysis is a crucial skill for data scientists. There are almost an infinite amount of ways to analyze a data set; help us understand why you choose your particular path and why we should care.\n", 30 | "\n", 31 | "Also - this homework is very open-ended and we provided you with basically no starting point. I realize this increases the difficulty and complexity, but I think it is worth it. It is much closer to what you might experience in industry and allows you to find data that might excite you!\n", 32 | "\n", 33 | "### Grading\n", 34 | "\n", 35 | "This homework is due **Jan. 31, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 36 | "\n", 37 | "Rubric:\n", 38 | "\n", 39 | "* Code Quality - 10%\n", 40 | "* Ability to read in data - 10%\n", 41 | "* Ability to describe data - 20%\n", 42 | "* Ability to visualize data - 20%\n", 43 | "* Ability to clean data - 20%\n", 44 | "* Storytelling - 20%" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [] 53 | } 54 | ], 55 | "metadata": { 56 | "kernelspec": { 57 | "display_name": "Python 3", 58 | "language": "python", 59 | "name": "python3" 60 | }, 61 | "language_info": { 62 | "codemirror_mode": { 63 | "name": "ipython", 64 | "version": 3 65 | }, 66 | "file_extension": ".py", 67 | "mimetype": "text/x-python", 68 | "name": "python", 69 | "nbconvert_exporter": "python", 70 | "pygments_lexer": "ipython3", 71 | "version": "3.6.6" 72 | } 73 | }, 74 | "nbformat": 4, 75 | "nbformat_minor": 2 76 | } 77 | -------------------------------------------------------------------------------- /homeworks/.ipynb_checkpoints/Homework_2-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Introduction to Modeling with Python\n", 8 | "\n", 9 | "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n", 10 | "\n", 11 | "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n", 12 | "\n", 13 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n", 14 | "\n", 15 | "First, make sure you are signed up on Kaggle and then download the data:\n", 16 | "\n", 17 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n", 18 | "\n", 19 | "The data includes both testing and training sets as well as a sample submission file. \n", 20 | "\n", 21 | "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n", 22 | "\n", 23 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n", 24 | "\n", 25 | "And the discussion board:\n", 26 | "\n", 27 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n", 28 | "\n", 29 | "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n", 30 | "\n", 31 | "\n", 32 | "### Grading\n", 33 | "\n", 34 | "This homework is due **Feb. 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 35 | "\n", 36 | "Rubric:\n", 37 | "\n", 38 | "* Code Quality - 10%\n", 39 | "* Storytelling - 10%\n", 40 | "* Result on Kaggle - 5%\n", 41 | "* Describing, Cleaning, and Visualizing data - 25%\n", 42 | "* Modeling - 50%\n", 43 | "\n", 44 | "More specifically, for modeling we will look for: \n", 45 | "\n", 46 | "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n", 47 | "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n", 48 | "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## Introduction to Modeling with Python\n", 56 | "\n", 57 | "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n", 58 | "\n", 59 | "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n", 60 | "\n", 61 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n", 62 | "\n", 63 | "First, make sure you are signed up on Kaggle and then download the data:\n", 64 | "\n", 65 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n", 66 | "\n", 67 | "The data includes both testing and training sets as well as a sample submission file. \n", 68 | "\n", 69 | "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n", 70 | "\n", 71 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n", 72 | "\n", 73 | "And the discussion board:\n", 74 | "\n", 75 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n", 76 | "\n", 77 | "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n", 78 | "\n", 79 | "\n", 80 | "### Grading\n", 81 | "\n", 82 | "This homework is due **Feb. 20, 2018 by 3:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 83 | "\n", 84 | "Rubric:\n", 85 | "\n", 86 | "* Code Quality - 10%\n", 87 | "* Storytelling - 10%\n", 88 | "* Result on Kaggle - 5%\n", 89 | "* Describing, Cleaning, and Visualizing data - 25%\n", 90 | "* Modeling - 50%\n", 91 | "\n", 92 | "More specifically, for modeling we will look for: \n", 93 | "\n", 94 | "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n", 95 | "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n", 96 | "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading." 97 | ] 98 | } 99 | ], 100 | "metadata": { 101 | "kernelspec": { 102 | "display_name": "Python 3", 103 | "language": "python", 104 | "name": "python3" 105 | }, 106 | "language_info": { 107 | "codemirror_mode": { 108 | "name": "ipython", 109 | "version": 3 110 | }, 111 | "file_extension": ".py", 112 | "mimetype": "text/x-python", 113 | "name": "python", 114 | "nbconvert_exporter": "python", 115 | "pygments_lexer": "ipython3", 116 | "version": "3.6.6" 117 | } 118 | }, 119 | "nbformat": 4, 120 | "nbformat_minor": 2 121 | } 122 | -------------------------------------------------------------------------------- /homeworks/.ipynb_checkpoints/Homework_3-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Project Abstract\n", 8 | "\n", 9 | "To help people get started on their project and to make sure you are selecting an appropriate task, we will have all the teams submit an abstract. Please only submit one abstract per team.\n", 10 | "\n", 11 | "The abstract should include (at least):\n", 12 | "\n", 13 | "* Team members\n", 14 | "* Problem statement\n", 15 | "* Data you will use to solve the problem\n", 16 | "* Outline of how you plan on solving the problem with the data. For example, what pre-processing steps might you need to do, what models, etc. \n", 17 | "* Supporting documents if necessary citing past research in the area and methods used to solve the problem.\n", 18 | "\n", 19 | "The goal of this abstract is for you to think deeply about the project you will be undertaking and convince yourself (and us) that it is a meaningful and achievable project for this class.\n", 20 | "\n", 21 | "This homework is due **February 28, 2018 by 4:00 pm Utah time.** and will be submitted on learning suite." 22 | ] 23 | } 24 | ], 25 | "metadata": { 26 | "kernelspec": { 27 | "display_name": "Python 3", 28 | "language": "python", 29 | "name": "python3" 30 | }, 31 | "language_info": { 32 | "codemirror_mode": { 33 | "name": "ipython", 34 | "version": 3 35 | }, 36 | "file_extension": ".py", 37 | "mimetype": "text/x-python", 38 | "name": "python", 39 | "nbconvert_exporter": "python", 40 | "pygments_lexer": "ipython3", 41 | "version": "3.6.6" 42 | } 43 | }, 44 | "nbformat": 4, 45 | "nbformat_minor": 2 46 | } 47 | -------------------------------------------------------------------------------- /homeworks/.ipynb_checkpoints/Homework_4-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Classification with Python\n", 8 | "\n", 9 | "Hopefully now you are feeling a bit more comfortable with Python, Kaggle, and modeling. \n", 10 | "\n", 11 | "This next homework will test your classification abilities. We will be trying to predict the poverty levels of various households in Costa Rica:\n", 12 | "\n", 13 | "https://www.kaggle.com/c/costa-rican-household-poverty-prediction\n", 14 | "\n", 15 | "The evalution metric for Kaggle is accuracy, but please also explore how well your model does on multiple metrics like F1, precision, recall, and area under the ROC curve.\n", 16 | "\n", 17 | "### Grading\n", 18 | "\n", 19 | "This homework is due **March 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 20 | "\n", 21 | "Rubric:\n", 22 | "\n", 23 | "* Code Quality - 10%\n", 24 | "* Storytelling - 10%\n", 25 | "* Result on Kaggle - 5%\n", 26 | "* Describing, Cleaning, and Visualizing data - 25%\n", 27 | "* Modeling - 50%\n", 28 | "\n", 29 | "More specifically, for modeling we will look for: \n", 30 | "\n", 31 | "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n", 32 | "* Model evaluation: Did you evaluate your model on multiple metrics? Where does your model do well? Where could it be improved? How are the metrics different?\n", 33 | "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n", 34 | "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [] 43 | } 44 | ], 45 | "metadata": { 46 | "kernelspec": { 47 | "display_name": "Python 3", 48 | "language": "python", 49 | "name": "python3" 50 | }, 51 | "language_info": { 52 | "codemirror_mode": { 53 | "name": "ipython", 54 | "version": 3 55 | }, 56 | "file_extension": ".py", 57 | "mimetype": "text/x-python", 58 | "name": "python", 59 | "nbconvert_exporter": "python", 60 | "pygments_lexer": "ipython3", 61 | "version": "3.6.6" 62 | } 63 | }, 64 | "nbformat": 4, 65 | "nbformat_minor": 2 66 | } 67 | -------------------------------------------------------------------------------- /homeworks/.ipynb_checkpoints/Homework_5-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Dimensionality Reduction and Clustering\n", 8 | "\n", 9 | "For this homework we will be using some image data! Specifically, the MNIST data set. You can load this data easily with the following commands:" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "from sklearn.datasets import fetch_mldata\n", 19 | "\n", 20 | "mnist = fetch_mldata(\"MNIST original\")\n", 21 | "X = mnist.data / 255.0\n", 22 | "y = mnist.target" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "The MNIST data set is hand-drawn digits, from zero through nine.\n", 30 | "\n", 31 | "Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.\n", 32 | "\n", 33 | "Source: https://www.kaggle.com/c/digit-recognizer/data\n", 34 | "\n", 35 | "For this homework, perform the following with the MNIST data:\n", 36 | "\n", 37 | "1. Use PCA to reduce the dimensionality\n", 38 | "\n", 39 | " a. How many components did you use? Why?\n", 40 | " \n", 41 | " b. Plot the first two components. Do you notice any trends? What is this plot showing us?\n", 42 | " \n", 43 | " c. Why would you use PCA? What is it doing? And what are the drawbacks?\n", 44 | " \n", 45 | " d. Plot some of the images, then compress them using PCA and plot again. How does it look?\n", 46 | " \n", 47 | "2. Use t-SNE to plot the first two components (you should probably random sample around 10000 points):\n", 48 | "\n", 49 | " a. How does this plot differ from your PCA plot?\n", 50 | " \n", 51 | " b. How robust is it to changes in perplexity?\n", 52 | " \n", 53 | " c. How robust is it to different learning rate and number of iterations?\n", 54 | " \n", 55 | "3. Perform k-means clustering:\n", 56 | "\n", 57 | " a. How did you choose k?\n", 58 | " \n", 59 | " b. How did you evaluate your clustering?\n", 60 | " \n", 61 | " c. Visualize your clusters using t-sne\n", 62 | " \n", 63 | " d. Did you scale your data?\n", 64 | " \n", 65 | " e. How robust is your clustering?\n", 66 | " \n", 67 | "4. Perform hierarchical clustering:\n", 68 | "\n", 69 | " a. Plot your dendrogram\n", 70 | " \n", 71 | " b. How many clusters seem reasonable based off your graph?\n", 72 | " \n", 73 | " c. How does your dendrogram change with different linkage methods?" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "### Grading\n", 81 | "\n", 82 | 83 | "This homework is due **April 4, 2019 by midnight Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 84 | 85 | 86 | "\n", 87 | "Rubric:\n", 88 | "\n", 89 | "* Code Quality - 10%\n", 90 | "* Storytelling - 10%\n", 91 | "* PCA - 20%\n", 92 | "* T-SNE - 20%\n", 93 | "* K-means - 20%\n", 94 | "* Hierarchical Clustering - 20%" 95 | ] 96 | } 97 | ], 98 | "metadata": { 99 | "kernelspec": { 100 | "display_name": "Python 3", 101 | "language": "python", 102 | "name": "python3" 103 | }, 104 | "language_info": { 105 | "codemirror_mode": { 106 | "name": "ipython", 107 | "version": 3 108 | }, 109 | "file_extension": ".py", 110 | "mimetype": "text/x-python", 111 | "name": "python", 112 | "nbconvert_exporter": "python", 113 | "pygments_lexer": "ipython3", 114 | "version": "3.6.6" 115 | } 116 | }, 117 | "nbformat": 4, 118 | "nbformat_minor": 2 119 | } 120 | -------------------------------------------------------------------------------- /homeworks/Final Project.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Final Project\n", 8 | "\n", 9 | "The final project is your opportunity to do your own data project end to end and present the results to some local companies.\n", 10 | "\n", 11 | "### Rubric\n", 12 | "\n", 13 | "Code Quality - 20%\n", 14 | "\n", 15 | "Describing, cleaning, and visualizing data - 25%\n", 16 | "\n", 17 | "Modeling - 25%\n", 18 | "\n", 19 | "Presentation / Storytelling - 30%" 20 | ] 21 | } 22 | ], 23 | "metadata": { 24 | "kernelspec": { 25 | "display_name": "Python 3", 26 | "language": "python", 27 | "name": "python3" 28 | }, 29 | "language_info": { 30 | "codemirror_mode": { 31 | "name": "ipython", 32 | "version": 3 33 | }, 34 | "file_extension": ".py", 35 | "mimetype": "text/x-python", 36 | "name": "python", 37 | "nbconvert_exporter": "python", 38 | "pygments_lexer": "ipython3", 39 | "version": "3.5.3" 40 | } 41 | }, 42 | "nbformat": 4, 43 | "nbformat_minor": 2 44 | } 45 | -------------------------------------------------------------------------------- /homeworks/Homework_1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Data Cleaning, Describing, and Visualization\n", 8 | "\n", 9 | "### Step 1 - Get your environment setup\n", 10 | "\n", 11 | "1. Install Git on your computer and fork the class repository on [Github](https://github.com/tfolkman/byu_econ_applied_machine_learning).\n", 12 | "2. Install [Anaconda](https://conda.io/docs/install/quick.html) and get it working.\n", 13 | "\n", 14 | "### Step 2 - Explore Datasets\n", 15 | "\n", 16 | "The goals of this project are:\n", 17 | "\n", 18 | "1. Read in data from multiple sources\n", 19 | "2. Gain practice cleaning, describing, and visualizing data\n", 20 | "\n", 21 | "To this end, you need to find from three different sources. For example: CSV, JSON, and API, SQL, or web scraping. For each of these data sets, you must perform the following:\n", 22 | "\n", 23 | "1. Data cleaning. Some options your might consider: handle missing data, handle outliers, scale the data, convert some data to categorical.\n", 24 | "2. Describe data. Provide tables, statistics, and summaries of your data.\n", 25 | "3. Visualize data. Provide visualizations of your data.\n", 26 | "\n", 27 | "These are the typical first steps of any data science project and are often the most time consuming. My hope is that in going through this process 3 different times, that you will gain a sense for it.\n", 28 | "\n", 29 | "Also, as you are doing this, please tell us a story. Explain in your notebook why are doing what you are doing and to what end. Telling a story in your analysis is a crucial skill for data scientists. There are almost an infinite amount of ways to analyze a data set; help us understand why you choose your particular path and why we should care.\n", 30 | "\n", 31 | "Also - this homework is very open-ended and we provided you with basically no starting point. I realize this increases the difficulty and complexity, but I think it is worth it. It is much closer to what you might experience in industry and allows you to find data that might excite you!\n", 32 | "\n", 33 | "### Grading\n", 34 | "\n", 35 | "This homework is due **Jan. 31, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 36 | "\n", 37 | "Rubric:\n", 38 | "\n", 39 | "* Code Quality - 10%\n", 40 | "* Ability to read in data - 10%\n", 41 | "* Ability to describe data - 20%\n", 42 | "* Ability to visualize data - 20%\n", 43 | "* Ability to clean data - 20%\n", 44 | "* Storytelling - 20%" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [] 53 | } 54 | ], 55 | "metadata": { 56 | "kernelspec": { 57 | "display_name": "Python 3", 58 | "language": "python", 59 | "name": "python3" 60 | }, 61 | "language_info": { 62 | "codemirror_mode": { 63 | "name": "ipython", 64 | "version": 3 65 | }, 66 | "file_extension": ".py", 67 | "mimetype": "text/x-python", 68 | "name": "python", 69 | "nbconvert_exporter": "python", 70 | "pygments_lexer": "ipython3", 71 | "version": "3.6.6" 72 | } 73 | }, 74 | "nbformat": 4, 75 | "nbformat_minor": 2 76 | } 77 | -------------------------------------------------------------------------------- /homeworks/Homework_2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Introduction to Modeling with Python\n", 8 | "\n", 9 | "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n", 10 | "\n", 11 | "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n", 12 | "\n", 13 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n", 14 | "\n", 15 | "First, make sure you are signed up on Kaggle and then download the data:\n", 16 | "\n", 17 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n", 18 | "\n", 19 | "The data includes both testing and training sets as well as a sample submission file. \n", 20 | "\n", 21 | "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n", 22 | "\n", 23 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n", 24 | "\n", 25 | "And the discussion board:\n", 26 | "\n", 27 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n", 28 | "\n", 29 | "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n", 30 | "\n", 31 | "\n", 32 | "### Grading\n", 33 | "\n", 34 | "This homework is due **Feb. 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 35 | "\n", 36 | "Rubric:\n", 37 | "\n", 38 | "* Code Quality - 10%\n", 39 | "* Storytelling - 10%\n", 40 | "* Result on Kaggle - 5%\n", 41 | "* Describing, Cleaning, and Visualizing data - 25%\n", 42 | "* Modeling - 50%\n", 43 | "\n", 44 | "More specifically, for modeling we will look for: \n", 45 | "\n", 46 | "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n", 47 | "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n", 48 | "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## Introduction to Modeling with Python\n", 56 | "\n", 57 | "Now that we have seen some examples of modeling and using Python for modeling, we wanted to give you a chance to try your hand!\n", 58 | "\n", 59 | "To that goal, we choose a well structured problem with plenty of resources online to help you along the way. That problem is predicting housing prices and is hosted on Kaggle:\n", 60 | "\n", 61 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n", 62 | "\n", 63 | "First, make sure you are signed up on Kaggle and then download the data:\n", 64 | "\n", 65 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data\n", 66 | "\n", 67 | "The data includes both testing and training sets as well as a sample submission file. \n", 68 | "\n", 69 | "Your goal is the predict the sales price for each house where root mean squared error is the evaluation metric. To get some ideas on where to start, feel free to check out Kaggle Kernels:\n", 70 | "\n", 71 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels\n", 72 | "\n", 73 | "And the discussion board:\n", 74 | "\n", 75 | "https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion\n", 76 | "\n", 77 | "Again - the goal of this homework is to get you exposed to modeling with Python. Feel free to use online resources to help guide you, but we expect original thought as well. Our hope is by the end of this homework you will feel comfortable exploring data in Python and building models to make predictions. Also please submit your test results to Kaggle and let us know your ranking and score!\n", 78 | "\n", 79 | "\n", 80 | "### Grading\n", 81 | "\n", 82 | "This homework is due **Oct. 16, 2018 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 83 | "\n", 84 | "Rubric:\n", 85 | "\n", 86 | "* Code Quality - 10%\n", 87 | "* Storytelling - 10%\n", 88 | "* Result on Kaggle - 5%\n", 89 | "* Describing, Cleaning, and Visualizing data - 25%\n", 90 | "* Modeling - 50%\n", 91 | "\n", 92 | "More specifically, for modeling we will look for: \n", 93 | "\n", 94 | "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n", 95 | "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n", 96 | "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [] 105 | } 106 | ], 107 | "metadata": { 108 | "kernelspec": { 109 | "display_name": "Python 3", 110 | "language": "python", 111 | "name": "python3" 112 | }, 113 | "language_info": { 114 | "codemirror_mode": { 115 | "name": "ipython", 116 | "version": 3 117 | }, 118 | "file_extension": ".py", 119 | "mimetype": "text/x-python", 120 | "name": "python", 121 | "nbconvert_exporter": "python", 122 | "pygments_lexer": "ipython3", 123 | "version": "3.6.6" 124 | } 125 | }, 126 | "nbformat": 4, 127 | "nbformat_minor": 2 128 | } 129 | -------------------------------------------------------------------------------- /homeworks/Homework_3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Project Abstract\n", 8 | "\n", 9 | "To help people get started on their project and to make sure you are selecting an appropriate task, we will have all the teams submit an abstract. Please only submit one abstract per team.\n", 10 | "\n", 11 | "The abstract should include (at least):\n", 12 | "\n", 13 | "* Team members\n", 14 | "* Problem statement\n", 15 | "* Data you will use to solve the problem\n", 16 | "* Outline of how you plan on solving the problem with the data. For example, what pre-processing steps might you need to do, what models, etc. \n", 17 | "* Supporting documents if necessary citing past research in the area and methods used to solve the problem.\n", 18 | "\n", 19 | "The goal of this abstract is for you to think deeply about the project you will be undertaking and convince yourself (and us) that it is a meaningful and achievable project for this class.\n", 20 | "\n", 21 | "This homework is due **February 28, 2018 by 4:00 pm Utah time.** and will be submitted on learning suite." 22 | ] 23 | } 24 | ], 25 | "metadata": { 26 | "kernelspec": { 27 | "display_name": "Python 3", 28 | "language": "python", 29 | "name": "python3" 30 | }, 31 | "language_info": { 32 | "codemirror_mode": { 33 | "name": "ipython", 34 | "version": 3 35 | }, 36 | "file_extension": ".py", 37 | "mimetype": "text/x-python", 38 | "name": "python", 39 | "nbconvert_exporter": "python", 40 | "pygments_lexer": "ipython3", 41 | "version": "3.6.6" 42 | } 43 | }, 44 | "nbformat": 4, 45 | "nbformat_minor": 2 46 | } 47 | -------------------------------------------------------------------------------- /homeworks/Homework_4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Classification with Python\n", 8 | "\n", 9 | "Hopefully now you are feeling a bit more comfortable with Python, Kaggle, and modeling. \n", 10 | "\n", 11 | "This next homework will test your classification abilities. We will be trying to predict the poverty levels of various households in Costa Rica:\n", 12 | "\n", 13 | "https://www.kaggle.com/c/costa-rican-household-poverty-prediction\n", 14 | "\n", 15 | "The evalution metric for Kaggle is accuracy, but please also explore how well your model does on multiple metrics like F1, precision, recall, and area under the ROC curve.\n", 16 | "\n", 17 | "### Grading\n", 18 | "\n", 19 | "This homework is due **March 21, 2019 by 4:00pm Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 20 | "\n", 21 | "Rubric:\n", 22 | "\n", 23 | "* Code Quality - 10%\n", 24 | "* Storytelling - 10%\n", 25 | "* Result on Kaggle - 5%\n", 26 | "* Describing, Cleaning, and Visualizing data - 25%\n", 27 | "* Modeling - 50%\n", 28 | "\n", 29 | "More specifically, for modeling we will look for: \n", 30 | "\n", 31 | "* Model Selection: Did you try multiple models? Why did you choose these models? How do they work? What are they assumptions? And how did you test/account for them? How did you select hyper-parameters?\n", 32 | "* Model evaluation: Did you evaluate your model on multiple metrics? Where does your model do well? Where could it be improved? How are the metrics different?\n", 33 | "* Model interpretation: What do the model results tell you? Which variables are important? High bias or variance and how did you / could you fix this? How confident are you in your results? \n", 34 | "* Model usefulness: Do you think your final model was useful? If so, how would you recommend using it? Convince us, that if we were a company, we would feel comfortable using your model with our users. Think about edge cases as well - are there certain areas that the model performs poorly on? Best on? How would you handle these cases, if say Zillow wanted to leverage your model realizing that bad recommendations on sale prices would hurt customer trust and your brand. This section also falls into the storytelling aspect of the grading." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [] 43 | } 44 | ], 45 | "metadata": { 46 | "kernelspec": { 47 | "display_name": "Python 3", 48 | "language": "python", 49 | "name": "python3" 50 | }, 51 | "language_info": { 52 | "codemirror_mode": { 53 | "name": "ipython", 54 | "version": 3 55 | }, 56 | "file_extension": ".py", 57 | "mimetype": "text/x-python", 58 | "name": "python", 59 | "nbconvert_exporter": "python", 60 | "pygments_lexer": "ipython3", 61 | "version": "3.6.6" 62 | } 63 | }, 64 | "nbformat": 4, 65 | "nbformat_minor": 2 66 | } 67 | -------------------------------------------------------------------------------- /homeworks/Homework_5.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Dimensionality Reduction and Clustering\n", 8 | "\n", 9 | "For this homework we will be using some image data! Specifically, the MNIST data set. You can load this data easily with the following commands:" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "from sklearn.datasets import fetch_mldata\n", 19 | "\n", 20 | "mnist = fetch_mldata(\"MNIST original\")\n", 21 | "X = mnist.data / 255.0\n", 22 | "y = mnist.target" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "The MNIST data set is hand-drawn digits, from zero through nine.\n", 30 | "\n", 31 | "Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.\n", 32 | "\n", 33 | "Source: https://www.kaggle.com/c/digit-recognizer/data\n", 34 | "\n", 35 | "For this homework, perform the following with the MNIST data:\n", 36 | "\n", 37 | "1. Use PCA to reduce the dimensionality\n", 38 | "\n", 39 | " a. How many components did you use? Why?\n", 40 | " \n", 41 | " b. Plot the first two components. Do you notice any trends? What is this plot showing us?\n", 42 | " \n", 43 | " c. Why would you use PCA? What is it doing? And what are the drawbacks?\n", 44 | " \n", 45 | " d. Plot some of the images, then compress them using PCA and plot again. How does it look?\n", 46 | " \n", 47 | "2. Use t-SNE to plot the first two components (you should probably random sample around 10000 points):\n", 48 | "\n", 49 | " a. How does this plot differ from your PCA plot?\n", 50 | " \n", 51 | " b. How robust is it to changes in perplexity?\n", 52 | " \n", 53 | " c. How robust is it to different learning rate and number of iterations?\n", 54 | " \n", 55 | "3. Perform k-means clustering:\n", 56 | "\n", 57 | " a. How did you choose k?\n", 58 | " \n", 59 | " b. How did you evaluate your clustering?\n", 60 | " \n", 61 | " c. Visualize your clusters using t-sne\n", 62 | " \n", 63 | " d. Did you scale your data?\n", 64 | " \n", 65 | " e. How robust is your clustering?\n", 66 | " \n", 67 | "4. Perform hierarchical clustering:\n", 68 | "\n", 69 | " a. Plot your dendrogram\n", 70 | " \n", 71 | " b. How many clusters seem reasonable based off your graph?\n", 72 | " \n", 73 | " c. How does your dendrogram change with different linkage methods?" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "### Grading\n", 81 | "\n", 82 | 83 | "This homework is due **April 4, 2019 by midnight Utah time.** By that time, you need to have committed all your code to your github and submitted a link to your work to the TA. We can see on your Github account when you last committed code. :)\n", 84 | 85 | "\n", 86 | "Rubric:\n", 87 | "\n", 88 | "* Code Quality - 10%\n", 89 | "* Storytelling - 10%\n", 90 | "* PCA - 20%\n", 91 | "* T-SNE - 20%\n", 92 | "* K-means - 20%\n", 93 | "* Hierarchical Clustering - 20%" 94 | ] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 3 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython3", 113 | "version": "3.6.6" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 2 118 | } 119 | -------------------------------------------------------------------------------- /lectures/.ipynb_checkpoints/Lecture_11_Boosting-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Boosting\n", 8 | "\n", 9 | "Boosting is another ensemble techinque like random forests. With random forests, we trained multiple decision trees independently and created independence by randomly sampling samples and features.\n", 10 | "\n", 11 | "Boosting on the other hand trains multiple models sequentially where each each model tries to improve on the areas where the previous models performed poorly. \n", 12 | "\n", 13 | "### Adaboost\n", 14 | "\n", 15 | "Adaboost works by starting with a single model. From that model we then make predictions on the **training** set. We can then see which training samples this first model got right and which it got wrong. Adaboost then trains the next model, but puts more weight on the training samples that the first model got wrong. This process continues for N number of models where N is a hyper-parameter.\n", 16 | "\n", 17 | "It is important to note that the sequential nature of boosting makes it harder to scale and parallelize relative the random forest models. That being said, work has been done with boosting methods to allow them to be more parallelizable. A very popular library that does this is [XGboost](https://github.com/dmlc/xgboost).\n", 18 | "\n", 19 | "So - at the end of your training, you now have N models where each model is trained to do better on the instances that the previous models did poorly on. You can now combine these models via a weighted voting or average method very similar to random forest except you weight each vote by how accurate the model was overall.\n", 20 | "\n", 21 | "Adaboost also had a hyper-parameter called **learning rate.** This hyper-parameter adjusts the contribution of each classifier. When decreasing the value, each new classifier makes smaller adjustments to the weights of mis-classified samples. Basically, meaning Adaboost is slower to learn per tree. Typically a lower learning rate requires more trees to perform well. This value and the number of trees can be tuned using cross-validation. \n", 22 | "\n", 23 | "#### Math\n", 24 | "\n", 25 | "So, how exactly do we do this?\n", 26 | "\n", 27 | "**Step 1**: Set all sample weights to 1/m when have m training examples\n", 28 | "\n", 29 | "**Step 2**: Train the first model\n", 30 | "\n", 31 | "**Step 3**: Calculate the weighted error of this model, which is simply the sum of the weights of the misclassified\n", 32 | "examples divided by the total weight of all samples. We will call this $r_j$ for the $j$th model. The best is zero and worst is 1.\n", 33 | "\n", 34 | "**Step 4**: Calculate the predictor's weight where being more accurate gets a higher weight:\n", 35 | "\n", 36 | "$$\\alpha_{j} = \\eta log \\frac{1-r_{j}}{r_j}$$\n", 37 | "\n", 38 | "If you are wrong more than right you get a negative weight and if close to random weight close to zero\n", 39 | "\n", 40 | "**Step 5**: Update the weights of your training samples where if you got it right, the weight remains the same. If you got it wrong new weight is:\n", 41 | "\n", 42 | "$$old weight * exp(\\alpha_{j})$$\n", 43 | "\n", 44 | "Then normalize all weights to sum to 1 by dividing all weights by sum of weights.\n", 45 | "\n", 46 | "You can see that a good predictor adds extra weight to its mis-classification, putting a strong emphasis on them for the next model. Also, $\\eta$ is our learning rate and can decrease the impact of a tree on weight updates by being less than 1.\n", 47 | "\n", 48 | "**Step 6**: Repeat from step 2 with next model and continue repeating until required number of models.\n", 49 | "\n", 50 | "That's it!\n", 51 | "\n", 52 | "Predictions are made by running a new sample through all the trained models, getting the most likely class (for classification) for each one, and then doing a weighted vote where the weight is the value from step 4. For regression just weighted average.\n", 53 | "\n", 54 | "Note: Almost always the model choosen for boosting is a decision tree.\n", 55 | "\n", 56 | "### Gradient Boosting\n", 57 | "\n", 58 | "Gradient boosting is also sequential, but instead of changing weights, it uses the residual errors from the previous model as the targets.\n", 59 | "\n", 60 | "Basically for regression, here are our steps:\n", 61 | "\n", 62 | "1. Fit a decision tree (assuming this is our base model and it is the most common)\n", 63 | "2. Calculate the residuals: true training values - predicted training values. Note: it turns out that this is the same as taking the negative gradient of the loss function, so this can generalize to other loss functions for classification and ranking tasks.\n", 64 | "3. Train a second decision tree where the residuals are the targets\n", 65 | "4. Continue the process for the number of defined estimators (hyper-parameter)\n", 66 | "\n", 67 | "At prediction time, we make a prediction with each tree and **sum** them. \n", 68 | "\n", 69 | "### Learning Rate\n", 70 | "\n", 71 | "The learning rate for gradient boosting trees, scales the contribution of each tree. \n", 72 | "\n", 73 | "\n", 74 | "### Early Stopping\n", 75 | "\n", 76 | "A good way to decide how many trees are needed is to use early stopping. See page 199 of Hands on Machine Learning for an example of this with sklearn. Basically, as you add an additional tree, you check your validation error and when your validation error stops getting better, you stop adding trees.\n", 77 | "\n", 78 | "\n", 79 | "### Some final notes\n", 80 | "\n", 81 | "* Boosting is more likely to overfit than random forests when the number of estimators is large. Though, usually slow to overfit.\n", 82 | "* Typical learning rates are around 0.01 or 0.001. And small learning rates can require a large number of estimators to achieve good results.\n", 83 | "* The max depth of the trees controls the complexity of the model and a max depth of 1 can often work well. This results in an additive model.\n", 84 | "* Can also get feature importance scores as with Random Forest.\n", 85 | "\n", 86 | "### Stacking\n", 87 | "\n", 88 | "We won't spend time discussing stacking as it tends to be too complex for industry. That being said, it is a good technique to be familiar with. Take a look at p. 200 of Hands on Machine Learning." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "## SKlearn Example" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 2, 101 | "metadata": { 102 | "collapsed": true 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "from sklearn.datasets import load_breast_cancer\n", 107 | "from sklearn.ensemble import GradientBoostingClassifier\n", 108 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 109 | "from sklearn.metrics import f1_score, classification_report\n", 110 | "from collections import Counter\n", 111 | "import numpy as np\n", 112 | "import pandas as pd\n", 113 | "%matplotlib inline" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 4, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "data = load_breast_cancer()\n", 125 | "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 5, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/plain": [ 136 | "GridSearchCV(cv=None, error_score='raise',\n", 137 | " estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", 138 | " learning_rate=0.1, loss='deviance', max_depth=3,\n", 139 | " max_features=None, max_leaf_nodes=None,\n", 140 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 141 | " min_samples_leaf=1, min_samples_split=2,\n", 142 | " min_weight_fraction_leaf=0.0, n_estimators=100,\n", 143 | " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", 144 | " warm_start=False),\n", 145 | " fit_params=None, iid=True, n_jobs=1,\n", 146 | " param_grid={'learning_rate': [0.1, 0.01, 0.001], 'n_estimators': [100, 1000, 5000], 'max_depth': [1, 2, 3]},\n", 147 | " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", 148 | " scoring='f1', verbose=0)" 149 | ] 150 | }, 151 | "execution_count": 5, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "clf = GradientBoostingClassifier()\n", 158 | "gridsearch = GridSearchCV(clf, {\"learning_rate\": [.1, .01, .001], \"n_estimators\": [100, 1000, 5000], \n", 159 | " 'max_depth': [1, 2, 3]}, scoring='f1')\n", 160 | "gridsearch.fit(X_train, y_train)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 10, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "Best Params: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5000}\n", 173 | "\n", 174 | "Classification Report:\n", 175 | " precision recall f1-score support\n", 176 | "\n", 177 | " 0 0.95 0.93 0.94 43\n", 178 | " 1 0.96 0.97 0.97 71\n", 179 | "\n", 180 | "avg / total 0.96 0.96 0.96 114\n", 181 | "\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "print(\"Best Params: {}\".format(gridsearch.best_params_))\n", 187 | "print(\"\\nClassification Report:\")\n", 188 | "print(classification_report(y_test, gridsearch.predict(X_test)))" 189 | ] 190 | } 191 | ], 192 | "metadata": { 193 | "kernelspec": { 194 | "display_name": "Python 3", 195 | "language": "python", 196 | "name": "python3" 197 | }, 198 | "language_info": { 199 | "codemirror_mode": { 200 | "name": "ipython", 201 | "version": 3 202 | }, 203 | "file_extension": ".py", 204 | "mimetype": "text/x-python", 205 | "name": "python", 206 | "nbconvert_exporter": "python", 207 | "pygments_lexer": "ipython3", 208 | "version": "3.6.1" 209 | } 210 | }, 211 | "nbformat": 4, 212 | "nbformat_minor": 2 213 | } 214 | -------------------------------------------------------------------------------- /lectures/.ipynb_checkpoints/Lecture_1_Introduction-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Introductions\n", 8 | "\n", 9 | "* Importance of getting to know people in the class\n", 10 | "* Review syllabus\n", 11 | "* Give out slack information\n", 12 | "* When office hours?\n", 13 | "* Participation: ML news / paper for the day / discuss homework ideas / class discussions\n", 14 | "* Homework: Target 5 homeworks which will be pretty open-ended and almost like mini-projects\n", 15 | "* Quizzes: In class and around 3. Will test knowledge of the subject not coding\n", 16 | "* First time class being taught and I am very open to feedback. Also, feel free to submit pull requests to correct or add to content.\n", 17 | "* Review first homework" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## What is Machine Learning?\n", 25 | "\n", 26 | "[Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) tells us that Machine learning is, \"a field of computer science that gives computers **the ability to learn without being explicitly programmed**.\" It goes on to say, \"machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through **building a model from sample inputs**.\"\n", 27 | "\n", 28 | "\n", 29 | "### Learning from inputs\n", 30 | "\n", 31 | "What does it mean to learn from inputs without being explicitly programmed? Let us consider the classical machine learning problem: spam filtering.\n", 32 | "\n", 33 | "Let us imagine that we no nothing about machine learning, but are tasked with determining whether an email is spam or not. How might we do this?\n", 34 | "\n", 35 | "What do we like and not like about our approach? How would we keep it up to date with new types of spam?\n", 36 | "\n", 37 | "Now imagine we had a black-box that if given a great many examples of emails that are spam and not spam, could take these examples and learn from the text what spam looks like. For this class, we will call that black-box (though it won't be black for long) machine learning. And often the data we feed it to understand a problem are called features.\n", 38 | "\n", 39 | "What does learning from inputs look like? Let us consider [flappy bird](https://www.youtube.com/watch?v=79BWQUN_Njc).\n", 40 | "\n", 41 | "\n", 42 | "## Why machine learning?\n", 43 | "\n", 44 | "From our discussion, why do you think machine learning might be valuable?\n", 45 | "\n", 46 | "\n", 47 | "## Types of machine learning problems\n", 48 | "\n", 49 | "**Supervised** machine learning problems are ones for which you have labeled data. Labeled data means you give the algorithm the solution with the data and these solutions are called labels. For example, with spam classification the labels would be \"spam\" or \"not spam.\" Linear regression would be considered a supervised problem.\n", 50 | "\n", 51 | "**Unsupervised** machine learning is the opposite. It is not given any labels. These algorithms are often not as powerful as they don't get the benefit of labels, but they can be extremely valuable when getting labeled data is expensive or impossible. An example would be clustering.\n", 52 | "\n", 53 | "**Regression** problems are a class of problems for which you are trying to predict a real number. For example, linear regression outputs a real number and could be used to predict housing prices.\n", 54 | "\n", 55 | "**Classification** problems are problems for which you are predicting a class. For example, spam prediction is a classification problem because you want to know whether you input falls into one of two classes. Logistic regression is an algorithm used for classification.\n", 56 | "\n", 57 | "**Ranking** problems are very popular in eCommerce. These models try to rank the items by how valuable they are to a user. For example, Netflix's movie recommendations. An example model is collaberative filtering.\n", 58 | "\n", 59 | "**Reinforcement Learning** is when you have an agent in an enviroment that gets to perform actions and recieve rewards for actions. The model here learns the best actions to take to maximize rewards. The flappy bird video is an example of reinforcement learning. An example model is deep Q-networks.\n", 60 | "\n", 61 | "\n", 62 | "## Machine Learning and Econometrics\n", 63 | "\n", 64 | "How are they different?\n", 65 | "\n", 66 | "For one, they use different lingo. \n", 67 | "\n", 68 | "For another, econometrics is often more interested in understanding why things happen while machine learning often cares more about just the actual prediction being correct.\n", 69 | "\n", 70 | "Economic theory is often a driver in the development of econometric models, while machine learning often relies on the data to deliver insights.\n", 71 | "\n", 72 | "The two worlds have a lot of overlap and continue to grow closer. Machine learning is getting better and providing both predictions and understandings and econometrics is finding value in the scalability and accuracy of some machine learning models.\n", 73 | "\n", 74 | "\n", 75 | "## Challenges of Machine Learning" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Perhaps my favorite part from the Wikipedia page on machine learning is, \"As of 2016, **machine learning is a buzzword**, and according to the Gartner hype cycle of 2016, at its peak of inflated expectations. Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, **machine-learning programs often fail to deliver**.\"\n", 83 | "\n", 84 | "* There isn't a clear problem to solve\n", 85 | "\n", 86 | "Some executive heard machine learning is the next big thing, so they hired a data science team. Unfortunately, there isn't a clear idea on what problems to solve, so the team flounders for a year.\n", 87 | "\n", 88 | "* Labeled data can be extremely important to building machine learning models, but can also be extremely costly.\n", 89 | "\n", 90 | "First off, you often need a lot of data. [Google](https://research.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html) found for representation learning, performance increases logarithmically based on volume of training data:\n", 91 | "\n", 92 | "![image.png](attachment:image.png)\n", 93 | "\n", 94 | "Secondly, you need to get data that represents the full distribution of the problem you are trying to solve. For example, for our spam classification problem, what kinds of emails might we want to gather? What if we only had emails that came from US IP addresses?\n", 95 | "\n", 96 | "Lastly, just getting data labeled can be time consuming and cost a lot of money. Who is going to label 1 million emails as spam or not spam?\n", 97 | "\n", 98 | "* Data can be very messy\n", 99 | "\n", 100 | "Often data in the real world has errors, outliers, missing data, and noise. How you handle these can greatly influence the outcome of your model.\n", 101 | "\n", 102 | "* Feature engineering\n", 103 | "\n", 104 | "Once you have your data and labels, deciding on how to represent it to your model can be very challenging. For example, for spam classification would you just feed it the raw text? What about the origin of the IP addres? What about a timestamp?\n", 105 | "\n", 106 | "* Your model might not generalize\n", 107 | "\n", 108 | "After all of this, you might still end up with a model that either is too simple to be effective (underfitting) or too complex to generalize well (overfitting). You have to develop a model that is just right. :)\n", 109 | "\n", 110 | "* Evaluation is non-trivial\n", 111 | "\n", 112 | "Let's say we develop a machine learning model for spam classification. How do we evaluate it? Do we care more about precision or recall? How do we tie our scientific metrics to business metrics?\n", 113 | "\n", 114 | "* Getting into production can be hard\n", 115 | "\n", 116 | "You have a beautiful model built in Python only to discover the back-end is in Java and has to run in under 5ms, so micro-services are not an option. So you convert your model to PMML, but engineers won't let you push code, so you are now blocked and putting your model in production isn't high on their priorities.\n", 117 | "\n", 118 | "\n", 119 | "## There is hope\n", 120 | "\n", 121 | "While many machine learning initiative do fail, many also succeed and are running some of the most valuable companies in the world. Comapines like Google, Facebook, Amazon, AirBnB, and Netflix have all found successful ways to leverage machine learning and are reaping large rewards.\n", 122 | "\n", 123 | "Google CEO Sundar Pichai even recently said, \"an important shift from a mobile first world to an AI first world.\"\n", 124 | "\n", 125 | "And Mark Cuban said, \"Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years\"\n", 126 | "\n", 127 | "And lastly, [Harvard Business Review](https://hbr.org/2012/10/big-data-the-management-revolution) found, \"ompanies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.\"\n", 128 | "\n", 129 | "The goal of this course is to prepare you for this world. So that you will not only know how to build the machine learning models to predict the future, but also understand the key ingredients of a successful machine learning initiative and how to overcome the challenges." 130 | ] 131 | } 132 | ], 133 | "metadata": { 134 | "kernelspec": { 135 | "display_name": "Python 2", 136 | "language": "python", 137 | "name": "python2" 138 | }, 139 | "language_info": { 140 | "codemirror_mode": { 141 | "name": "ipython", 142 | "version": 2 143 | }, 144 | "file_extension": ".py", 145 | "mimetype": "text/x-python", 146 | "name": "python", 147 | "nbconvert_exporter": "python", 148 | "pygments_lexer": "ipython2", 149 | "version": "2.7.13" 150 | } 151 | }, 152 | "nbformat": 4, 153 | "nbformat_minor": 2 154 | } 155 | -------------------------------------------------------------------------------- /lectures/.ipynb_checkpoints/Lecture_6_K_Nearest_Neighbors-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## K-Nearest Neighbors\n", 8 | "\n", 9 | "This is one of the simpliest and easiest to understand algorithms. It can be used for both classification and regression tasks, but is more common in classification, so we will focus there. The principles, though, can be used in both cases and sklearn supports both.\n", 10 | "\n", 11 | "Basically, here is the algorithm:\n", 12 | "\n", 13 | "1. Define $k$\n", 14 | "2. Define a distance metric - usually Euclidean distance\n", 15 | "3. For a new data point, find the $k$ nearest training points and combine their classes in some way - usually voting - to get a predicted class\n", 16 | "\n", 17 | "That's it!\n", 18 | "\n", 19 | "**Some of the benefits:**\n", 20 | "\n", 21 | "* Doesn't really require any training in the traditional sense. You just need a fast way to find the nearest neighbors.\n", 22 | "* Easy to understand \n", 23 | "\n", 24 | "**Some of the negatives:**\n", 25 | "\n", 26 | "* Need to define k, which is a hyper-parameter, so it can be tuned with cross-validation. A higher value for k increases bias and a lower value increases variance.\n", 27 | "* Have to choose a distance metric and could get very different results depending on the metric. Again, you can use cross-validation.\n", 28 | "* Doesn't really offer insights into which features might be important.\n", 29 | "* Can suffer with high dimensional data due to the curse of dimensionality.\n", 30 | "\n", 31 | "**Basic assumption:**\n", 32 | "\n", 33 | "* Data points that are close are similar for our target\n", 34 | "\n", 35 | "\n", 36 | "### Curse of Dimensionality\n", 37 | "\n", 38 | "Basically, the curse of dimensionality means that in high dimensions, it is likely that close points are not much closer than the average distance, which means being close doesnt mean much. In high dimensions, the data becomes very spread out, which creates this phenomenon. \n", 39 | "\n", 40 | "There are so many good resources for this online, that I won't go any deeper. Here is one you might look at:\n", 41 | "\n", 42 | "http://blog.galvanize.com/how-to-manage-dimensionality/\n", 43 | "\n", 44 | "### Euclidean Distance\n", 45 | "\n", 46 | "For vectors, q and p that are being compared (these would be our feature vectors):\n", 47 | "\n", 48 | "$$\\sqrt{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$$\n", 49 | "\n", 50 | "\n", 51 | "### SKlearn Example\n", 52 | "\n" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 15, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "from sklearn.datasets import load_breast_cancer\n", 64 | "from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\n", 65 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 66 | "from sklearn.metrics import f1_score, classification_report, accuracy_score, mean_squared_error" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}\n", 81 | "Train F1: 0.9606837606837607\n", 82 | "Test Classification Report:\n", 83 | " precision recall f1-score support\n", 84 | "\n", 85 | " 0 0.97 0.88 0.93 43\n", 86 | " 1 0.93 0.99 0.96 71\n", 87 | "\n", 88 | "avg / total 0.95 0.95 0.95 114\n", 89 | "\n", 90 | "Train Accuracy: 0.9494505494505494\tTest accuracy: 0.9473684210526315\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "data = load_breast_cancer()\n", 96 | "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)\n", 97 | "clf = KNeighborsClassifier()\n", 98 | "gridsearch = GridSearchCV(clf, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n", 99 | " 'p': [1, 2, 3]}, scoring='f1')\n", 100 | "gridsearch.fit(X_train, y_train)\n", 101 | "print(\"Best Params: {}\".format(gridsearch.best_params_))\n", 102 | "y_pred_train = gridsearch.predict(X_train)\n", 103 | "print(\"Train F1: {}\".format(f1_score(y_train, y_pred_train)))\n", 104 | "print(\"Test Classification Report:\")\n", 105 | "y_pred_test = gridsearch.predict(X_test)\n", 106 | "print(classification_report(y_test, y_pred_test))\n", 107 | "print(\"Train Accuracy: {}\\tTest accuracy: {}\".format(accuracy_score(y_train, y_pred_train),\n", 108 | " accuracy_score(y_test, y_pred_test)))" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Voting Methods\n", 116 | "\n", 117 | "* Majority Voting: After you take the $k$ nearest neighbors, you take a \"vote\" of those neighbors' classes. The new data point is classified with whatever the majority class of the neighbors is. If you are doing a binary classification, it is recommended that you use an odd number of neighbors to avoid tied votes. However, in a multi-class problem, it is harder to avoid ties. A common solution to this is to decrease $k$ until the tie is broken.\n", 118 | "* Distance Weighting: Instead of directly taking votes of the nearest neighbors, you weight each vote by the distance of that instance from the new data point. A common weighting method is $\\hat{y} = \\frac{\\sum_{i=1}^Nw_iq_i}{\\sum_{i=1}^N w_i}$, where $w_i=\\frac{1}{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$, or one over the distance between the new data point and the training point. The new data point is added into the class with the largest added weight. Not only does this decrease the chances of ties, it also reduces the effect of a skewed representation of data.\n", 119 | "\n", 120 | "### Distance Metrics\n", 121 | "Euclidean distance, or the 2-norm, is a very common distance metric to use for $k$-nearest neighbors. Any $p$-norm can be used (the $p$-norm is defined as $||\\mathbf{x}_i-\\mathbf{y}_i||_p = (\\sum_{i=1}^N|x_i-y_i|^p)^\\frac{1}{p}$. For categorical data, however, this can be a problem. For example, if we have encoded a feature for car color from red, blue, and green to 0, 1, 2, how can the \"distance\" between green and red be measured? You could make dummy variables, but if a feature has 15 possible categories, that means adding 14 more variables to your feature set, and we run into the curse of dimensionality. There are several ways to confront this issue, but they are beyond the scope of this lecture. But be aware of the effect categorical features can have on your nearest neighbors classifier.\n", 122 | "\n", 123 | "### Search Algorithm\n", 124 | "\n", 125 | "Imagine the data set contains 2000 points. A brute-force search for the 3 nearest neighbors to one point does not take very long. But if the data set contains 2000000 points, a brute-force search can become quite costly, especially if the dimension of the data is large. Other search algorithms sacrifice an exhaustive search for a faster run time. Structures such as KDTrees (see https://en.wikipedia.org/wiki/K-d_tree) or Ball trees (see https://en.wikipedia.org/wiki/Ball_tree) are used for faster run times. While we won't dive into the details of these structures, be aware of them and how they can optimize your run time (although, training time does increase).\n", 126 | "\n", 127 | "### Radius Neighbors Classifier\n", 128 | "\n", 129 | "This is the same idea as a $k$ nearest neighbors classifier, but instead of finding the $k$ nearest neighbors, you find all the neighbors within a given radius. Setting the radius requires some domain knowledge; if your points are closely packed together, you'd want to use a smaller radius to avoid having nearly every point vote.\n", 130 | "\n", 131 | "### K Neighbors Regressor\n", 132 | "To change our problem from classification to regressing, all we have to do is find the weighted average of the $k$ nearest neighbors. Instead of taking the majority class, we calculate a weighted average of these nearest values, using the same weighting method as above.\n", 133 | "\n", 134 | "Let's try predicting the area of tissue based on the other features using a $k$-neighbors regressor." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 19, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [ 144 | { 145 | "name": "stdout", 146 | "output_type": "stream", 147 | "text": [ 148 | "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}\n", 149 | "Train MSE: 0.0\tTest MSE: 878.1482686614424\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "target = data.data[:,3]\n", 155 | "X = data.data[:,[0,1,2,4,5,6,7,8]]\n", 156 | "X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.20, random_state=42)\n", 157 | "reg = KNeighborsRegressor()\n", 158 | "gridsearch = GridSearchCV(reg, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n", 159 | " 'p': [1, 2, 3]}, scoring='neg_mean_squared_error')\n", 160 | "gridsearch.fit(X_train, y_train)\n", 161 | "print(\"Best Params: {}\".format(gridsearch.best_params_))\n", 162 | "y_pred_train = gridsearch.predict(X_train)\n", 163 | "y_pred_test = gridsearch.predict(X_test)\n", 164 | "print(\"Train MSE: {}\\tTest MSE: {}\".format(mean_squared_error(y_train, y_pred_train),\n", 165 | " mean_squared_error(y_test, y_pred_test)))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "### Outlier Detection\n", 173 | "In $k$-nearest neighbors, the data is naturally clustered. Within these clusters, we can find the average distance between points (either exhaustively or from the centroid of the cluster). If we find a few points that are much farther than the average distance to other points or to the centroid, it is reasonable (but not always correct) to think they could be outliers. We can use this process on new data points as well. " 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": true 181 | }, 182 | "outputs": [], 183 | "source": [] 184 | } 185 | ], 186 | "metadata": { 187 | "kernelspec": { 188 | "display_name": "Python 3", 189 | "language": "python", 190 | "name": "python3" 191 | }, 192 | "language_info": { 193 | "codemirror_mode": { 194 | "name": "ipython", 195 | "version": 3 196 | }, 197 | "file_extension": ".py", 198 | "mimetype": "text/x-python", 199 | "name": "python", 200 | "nbconvert_exporter": "python", 201 | "pygments_lexer": "ipython3", 202 | "version": "3.5.3" 203 | } 204 | }, 205 | "nbformat": 4, 206 | "nbformat_minor": 2 207 | } 208 | -------------------------------------------------------------------------------- /lectures/.ipynb_checkpoints/Lecture_9_Decision_Trees-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "## Decision Trees\n", 10 | "\n", 11 | "To better understand decision trees, let's take a look at a simple example:\n", 12 | "\n", 13 | "https://zhengtianyu.files.wordpress.com/2013/12/decision-tree.png\n", 14 | "\n", 15 | "Clearly, decision trees are very simple and pretty easy to understand. Once you have a tree, you just follow the forks down for your example to get to an end node. Then at that end node you take the most common class or average value for regression. In fact, with classification you can get probabilistic estimates by taking the fraction of examples in that class. Nice! And that is one of the biggest benefits of decision trees - they are so easy to explain and understand.\n", 16 | "\n", 17 | "Most decision trees use binary trees - each split is either yes or no. When you graph these decision boundaries you end up with a bunch of vertical or horizontal lines. Here is an example with the iris data set:\n", 18 | "\n", 19 | "http://statweb.stanford.edu/~jtaylo/courses/stats202/trees.html\n", 20 | "\n", 21 | "## How To Grow A Tree\n", 22 | "\n", 23 | "So - how does one learn a tree? Do we just randomly pick binary splitting points and see what comes out? Of course not! We leverage our data and a definition of impurity. \n", 24 | "\n", 25 | "If you think about it, what we really want from our tree is pure leaf nodes. Meaning that for classification we would like at the end to have the leaf node only have examples of a single class. This greatly increases our confidence that if a testing example ends up at that leaf node that it is in fact of that class.\n", 26 | "\n", 27 | "Imagine the opposite, a leaf node in a 2 class classification problem that has 50% of the examples in one class and 50% in the other. That really doesn't help us at all! If a testing example ends up at that leaf node, all we can say is a 50/50 guess.\n", 28 | "\n", 29 | "So, how do we define impurity?\n", 30 | "\n", 31 | "**Gini Impurity**\n", 32 | "\n", 33 | "$$Gini_{i} = 1 - \\sum_{k=1}^{n}{p_{i,k}^2}$$\n", 34 | "\n", 35 | "Where $i$ is the node of interest, $n$ is the total number of classes, and $p_{i,k}$ is the fraction of class $k$ in node $i$. The gini of a node is 0 when all the examples belong to the same class. Gini is the highest when there is an equal probability of being in each class. \n", 36 | "\n", 37 | "Let's consider a 2-class examples. Say I am trying to predict whether someone is male or female and I branch on height being less than 5 1/2 feet. In this node, I find 50 examples of which 40 are female and 10 are male. Gini is then:\n", 38 | "\n", 39 | "$$1 - (40/50)^2 - (10/50)^2 = 0.32$$\n", 40 | "\n", 41 | "Very nice - now we can basically evaluate how good a node is. With this we can now start to **greedily** grow our tree. What I mean by greedily is that we will not consider every possible tree, but instead start with the most pure feature split and then from there pick the next most pure, etc.\n", 42 | "\n", 43 | "Thus, our algorithm looks at all the features and all the splitting points for our features (note: this process runs much faster if you features have a small amount of unique values like categories as opposed to real numbers). To evalue a feature, split point pair we do the following:\n", 44 | "\n", 45 | "Take the weighted average of the Gini impurity for the two nodes created by the split. The weights being the number of examples in the nodes.\n", 46 | "\n", 47 | "That's it! Basically, the feature split that produces the lowest weighted average gini impurity is considered best, then we move on and find the next best feature split given all the previous feature splits.\n", 48 | "\n", 49 | "### When to stop?\n", 50 | "\n", 51 | "Decision trees have many hyper-parameters to help control when to stop growing the tree. These include max depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes.\n", 52 | "\n", 53 | "Since we have to consider all these splits, training can be slow, but predictions are very fast since we just have to traverse the tree.\n", 54 | "\n", 55 | "Also, there are other impurity measures that are used with decision trees. Another very popular one is Entropy. See page 173 of Hands On Machine Learning for its definition. Usually, which you choose doesn't matter too much. Gini is slightly faster to computer, while entropy produces slightly more balanced trees.\n", 56 | "\n", 57 | "### Bias and variance\n", 58 | "\n", 59 | "Decision trees can be very prone to overfitting if you let them grow to deep. Thus, decreasing the depth can decrease variance / increase bias. It is important to use cross-validation to do hyper-parameter selection. \n", 60 | "\n", 61 | "### Regression\n", 62 | "\n", 63 | "Decision trees can be applied to regression problems in exactly the same way, but instead of using Gini impurity you would use mean squared error. You calculate the MSE of a node by setting the prediction for all values in that node as the average $y$ value of the examples in that node. For example, if your node had these values: 5, 2, 1, 6 then your predicted value would be (5+2+1+6)/4 = 3.5. You would do the squared difference between 3.5 and all the values and take the mean. \n", 64 | "\n", 65 | "Prediction is done by traversing to the leaf node for an example and taking the average value at that node.\n", 66 | "\n", 67 | "### Pros\n", 68 | "\n", 69 | "* Easy to explain\n", 70 | "* Can be visualized\n", 71 | "* Can handle categorical variables and missing data well\n", 72 | "\n", 73 | "### Cons\n", 74 | "\n", 75 | "* Typically don't have very strong prediction accuracy\n", 76 | "* Very sensitive to small changes in training data" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "outputs": [], 86 | "source": [] 87 | } 88 | ], 89 | "metadata": { 90 | "kernelspec": { 91 | "display_name": "Python 2", 92 | "language": "python", 93 | "name": "python2" 94 | }, 95 | "language_info": { 96 | "codemirror_mode": { 97 | "name": "ipython", 98 | "version": 2 99 | }, 100 | "file_extension": ".py", 101 | "mimetype": "text/x-python", 102 | "name": "python", 103 | "nbconvert_exporter": "python", 104 | "pygments_lexer": "ipython2", 105 | "version": "2.7.13" 106 | } 107 | }, 108 | "nbformat": 4, 109 | "nbformat_minor": 2 110 | } 111 | -------------------------------------------------------------------------------- /lectures/.ipynb_checkpoints/Random Notes-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 0 6 | } 7 | -------------------------------------------------------------------------------- /lectures/Lecture_11_Boosting.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Boosting\n", 8 | "\n", 9 | "Boosting is another ensemble techinque like random forests. With random forests, we trained multiple decision trees independently and created independence by randomly sampling samples and features.\n", 10 | "\n", 11 | "Boosting on the other hand trains multiple models sequentially where each each model tries to improve on the areas where the previous models performed poorly. \n", 12 | "\n", 13 | "### Adaboost\n", 14 | "\n", 15 | "Adaboost works by starting with a single model. From that model we then make predictions on the **training** set. We can then see which training samples this first model got right and which it got wrong. Adaboost then trains the next model, but puts more weight on the training samples that the first model got wrong. This process continues for N number of models where N is a hyper-parameter.\n", 16 | "\n", 17 | "It is important to note that the sequential nature of boosting makes it harder to scale and parallelize relative the random forest models. That being said, work has been done with boosting methods to allow them to be more parallelizable. A very popular library that does this is [XGboost](https://github.com/dmlc/xgboost).\n", 18 | "\n", 19 | "So - at the end of your training, you now have N models where each model is trained to do better on the instances that the previous models did poorly on. You can now combine these models via a weighted voting or average method very similar to random forest except you weight each vote by how accurate the model was overall.\n", 20 | "\n", 21 | "Adaboost also had a hyper-parameter called **learning rate.** This hyper-parameter adjusts the contribution of each classifier. When decreasing the value, each new classifier makes smaller adjustments to the weights of mis-classified samples. Basically, meaning Adaboost is slower to learn per tree. Typically a lower learning rate requires more trees to perform well. This value and the number of trees can be tuned using cross-validation. \n", 22 | "\n", 23 | "#### Math\n", 24 | "\n", 25 | "So, how exactly do we do this?\n", 26 | "\n", 27 | "**Step 1**: Set all sample weights to 1/m when have m training examples\n", 28 | "\n", 29 | "**Step 2**: Train the first model\n", 30 | "\n", 31 | "**Step 3**: Calculate the weighted error of this model, which is simply the sum of the weights of the misclassified\n", 32 | "examples divided by the total weight of all samples. We will call this $r_j$ for the $j$th model. The best is zero and worst is 1.\n", 33 | "\n", 34 | "**Step 4**: Calculate the predictor's weight where being more accurate gets a higher weight:\n", 35 | "\n", 36 | "$$\\alpha_{j} = \\eta log \\frac{1-r_{j}}{r_j}$$\n", 37 | "\n", 38 | "If you are wrong more than right you get a negative weight and if close to random weight close to zero\n", 39 | "\n", 40 | "**Step 5**: Update the weights of your training samples where if you got it right, the weight remains the same. If you got it wrong new weight is:\n", 41 | "\n", 42 | "$$old weight * exp(\\alpha_{j})$$\n", 43 | "\n", 44 | "Then normalize all weights to sum to 1 by dividing all weights by sum of weights.\n", 45 | "\n", 46 | "You can see that a good predictor adds extra weight to its mis-classification, putting a strong emphasis on them for the next model. Also, $\\eta$ is our learning rate and can decrease the impact of a tree on weight updates by being less than 1.\n", 47 | "\n", 48 | "**Step 6**: Repeat from step 2 with next model and continue repeating until required number of models.\n", 49 | "\n", 50 | "That's it!\n", 51 | "\n", 52 | "Predictions are made by running a new sample through all the trained models, getting the most likely class (for classification) for each one, and then doing a weighted vote where the weight is the value from step 4. For regression just weighted average.\n", 53 | "\n", 54 | "Note: Almost always the model choosen for boosting is a decision tree.\n", 55 | "\n", 56 | "### Gradient Boosting\n", 57 | "\n", 58 | "Gradient boosting is also sequential, but instead of changing weights, it uses the residual errors from the previous model as the targets.\n", 59 | "\n", 60 | "Basically for regression, here are our steps:\n", 61 | "\n", 62 | "1. Fit a decision tree (assuming this is our base model and it is the most common)\n", 63 | "2. Calculate the residuals: true training values - predicted training values. Note: it turns out that this is the same as taking the negative gradient of the loss function, so this can generalize to other loss functions for classification and ranking tasks.\n", 64 | "3. Train a second decision tree where the residuals are the targets\n", 65 | "4. Continue the process for the number of defined estimators (hyper-parameter)\n", 66 | "\n", 67 | "At prediction time, we make a prediction with each tree and **sum** them. \n", 68 | "\n", 69 | "### Learning Rate\n", 70 | "\n", 71 | "The learning rate for gradient boosting trees, scales the contribution of each tree. \n", 72 | "\n", 73 | "\n", 74 | "### Early Stopping\n", 75 | "\n", 76 | "A good way to decide how many trees are needed is to use early stopping. See page 199 of Hands on Machine Learning for an example of this with sklearn. Basically, as you add an additional tree, you check your validation error and when your validation error stops getting better, you stop adding trees.\n", 77 | "\n", 78 | "\n", 79 | "### Some final notes\n", 80 | "\n", 81 | "* Boosting is more likely to overfit than random forests when the number of estimators is large. Though, usually slow to overfit.\n", 82 | "* Typical learning rates are around 0.01 or 0.001. And small learning rates can require a large number of estimators to achieve good results.\n", 83 | "* The max depth of the trees controls the complexity of the model and a max depth of 1 can often work well. This results in an additive model.\n", 84 | "* Can also get feature importance scores as with Random Forest.\n", 85 | "\n", 86 | "### Stacking\n", 87 | "\n", 88 | "We won't spend time discussing stacking as it tends to be too complex for industry. That being said, it is a good technique to be familiar with. Take a look at p. 200 of Hands on Machine Learning." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "## SKlearn Example" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 2, 101 | "metadata": { 102 | "collapsed": true 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "from sklearn.datasets import load_breast_cancer\n", 107 | "from sklearn.ensemble import GradientBoostingClassifier\n", 108 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 109 | "from sklearn.metrics import f1_score, classification_report\n", 110 | "from collections import Counter\n", 111 | "import numpy as np\n", 112 | "import pandas as pd\n", 113 | "%matplotlib inline" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 4, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "data = load_breast_cancer()\n", 125 | "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 5, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/plain": [ 138 | "GridSearchCV(cv=None, error_score='raise',\n", 139 | " estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", 140 | " learning_rate=0.1, loss='deviance', max_depth=3,\n", 141 | " max_features=None, max_leaf_nodes=None,\n", 142 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 143 | " min_samples_leaf=1, min_samples_split=2,\n", 144 | " min_weight_fraction_leaf=0.0, n_estimators=100,\n", 145 | " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", 146 | " warm_start=False),\n", 147 | " fit_params=None, iid=True, n_jobs=1,\n", 148 | " param_grid={'learning_rate': [0.1, 0.01, 0.001], 'n_estimators': [100, 1000, 5000], 'max_depth': [1, 2, 3]},\n", 149 | " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", 150 | " scoring='f1', verbose=0)" 151 | ] 152 | }, 153 | "execution_count": 5, 154 | "metadata": {}, 155 | "output_type": "execute_result" 156 | } 157 | ], 158 | "source": [ 159 | "clf = GradientBoostingClassifier()\n", 160 | "gridsearch = GridSearchCV(clf, {\"learning_rate\": [.1, .01, .001], \"n_estimators\": [100, 1000, 5000], \n", 161 | " 'max_depth': [1, 2, 3]}, scoring='f1')\n", 162 | "gridsearch.fit(X_train, y_train)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 10, 168 | "metadata": { 169 | "collapsed": false 170 | }, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "Best Params: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5000}\n", 177 | "\n", 178 | "Classification Report:\n", 179 | " precision recall f1-score support\n", 180 | "\n", 181 | " 0 0.95 0.93 0.94 43\n", 182 | " 1 0.96 0.97 0.97 71\n", 183 | "\n", 184 | "avg / total 0.96 0.96 0.96 114\n", 185 | "\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "print(\"Best Params: {}\".format(gridsearch.best_params_))\n", 191 | "print(\"\\nClassification Report:\")\n", 192 | "print(classification_report(y_test, gridsearch.predict(X_test)))" 193 | ] 194 | } 195 | ], 196 | "metadata": { 197 | "kernelspec": { 198 | "display_name": "Python 3", 199 | "language": "python", 200 | "name": "python3" 201 | }, 202 | "language_info": { 203 | "codemirror_mode": { 204 | "name": "ipython", 205 | "version": 3 206 | }, 207 | "file_extension": ".py", 208 | "mimetype": "text/x-python", 209 | "name": "python", 210 | "nbconvert_exporter": "python", 211 | "pygments_lexer": "ipython3", 212 | "version": "3.5.3" 213 | } 214 | }, 215 | "nbformat": 4, 216 | "nbformat_minor": 2 217 | } 218 | -------------------------------------------------------------------------------- /lectures/Lecture_14_MLP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Multi-layer perceptron\n", 8 | "\n", 9 | "I am not even going to try and write a better intro. to neural nets than this...\n", 10 | "\n", 11 | "https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/\n", 12 | "\n", 13 | "### Softmax Equation\n", 14 | "\n", 15 | "Given an array of values of length n, the softmax of value i in the array is:\n", 16 | "\n", 17 | "$$\\frac{e^{i}}{\\sum_{j}^{n}e^{j}}$$\n", 18 | "\n", 19 | "### Deep Neural Network\n", 20 | "\n", 21 | "When you have multiple hidden layers - the layers in between the input and softmax layers, the network is called deep.\n", 22 | "\n", 23 | "### Backpropagation\n", 24 | "\n", 25 | "Neural nets are trained using a technique called backpropagation. At a very high level, you pass a training example through your network (forward pass), then measure its error, and then you go backwards through each layer to measure the contribution of each connection to the error (backwards pass). You then use this information to adjust the weights of your connections using gradient descent. \n", 26 | "\n", 27 | "### Vanishing/Exploding gradients\n", 28 | "\n", 29 | "When your gradients start to get too small or too large this can negatively effect learning. For example, a zero gradient will stop learning all together and when you gradients get too large your learning can diverge.\n", 30 | "\n", 31 | "### Activation Functions\n", 32 | "\n", 33 | "The article above does not talk much about activation functions. Typically, in an MLP after you pass connections to a neuron you then apply an activation function. Historically, that activation function was a logistic function, which then is basically logistic regression. This tends to suffer from vanishing gradient problem.\n", 34 | "\n", 35 | "Another very popular activation function now is relu. Relu(z) = max(0,z). This is very fast to compute and in practice works very well. This function suffers less from the vanishing gradient problem.\n", 36 | "\n", 37 | "One problem with relu is that the connections can die. This happens if the inputs to a neuron end up negative resulting in a zero gradient. Thus, the **leaky relu** was invented: Leaky Relu(x) = max($\\alpha$x, x) where $\\alpha$ is usually a value of 0.01 or 0.02. The $\\alpha$ value is the slope when x < 0 and ensures that the activation never truly dies, though it can become quite small.\n", 38 | "\n", 39 | "**Elu** is another activation function which generally performs the best but is slower to compute then a leaky relu. Again, when x > 0 you just get x. But when x < 0 you get $\\alpha$(exp(x) -1). $\\alpha$ represents the value that the function approaches when x is a large negative number. Usually, it is set to 1. This function is also smooth everywhere, including zero.\n", 40 | "\n", 41 | "### Batch Normalization\n", 42 | "\n", 43 | "As we have learned it is important to scale - or normalize - your data before feeding it to a neural net. Another important normalization step is right before your activation function to again normalize your data by subtracting the mean and dividing by the standard deviation. Since you are working with a batch, you use the batch mean and standard deviation. You also allow each batch normalization to learn an appropriate scaling and shifiting factor for your standardized values. \n", 44 | "\n", 45 | "This technique has been shown to reduce the vanishing/exploding gradient problem, allow the use or larger learning rates, and be less sensitive to initalization. On the downside, it reduces runtime prediction speed.\n", 46 | "\n", 47 | "\n", 48 | "### Cross-entropy\n", 49 | "\n", 50 | "$$-\\frac{1}{m}\\sum_{i=1}^{m}\\sum_{k=1}^{K}y_{k}^{i}log(p_{k}^{i})$$\n", 51 | "\n", 52 | "Where:\n", 53 | "\n", 54 | "* m - the number of data points\n", 55 | "* K - the number of classes\n", 56 | "* y_{k}^{i} - the true class value for row i, class k. Either a zero or one depending on if k is the correct class\n", 57 | "* p_{k}^{i} - the value predicted by your model for class k, row i. Usually from your softmax\n", 58 | "\n", 59 | "This is the cost function you are trying to minimze.\n", 60 | "\n", 61 | "### Important to Remember\n", 62 | "\n", 63 | "* Scale data - usually zero to one\n", 64 | "* Shuffle data\n", 65 | "\n", 66 | "### Tuning Hyper-parameters\n", 67 | "\n", 68 | "* Better to use random search\n", 69 | "* Start with reasonable, known architectures\n", 70 | "* Number of hidden layers:\n", 71 | " * Often can be valuable to have a deep network to learn heirarchy. Usually converge faster and generalize better. \n", 72 | " * More complex problems can often require deeper networks and more data\n", 73 | "* Number of neurons:\n", 74 | " * Typically size the layers to form a type of funnel with fewer and fewer neurons at each layer. This comes back the heirachy idea where you might need more neurons to learn lower level features. \n", 75 | " * Also can try picking same number of neurons for all layers to have less parameters to tune\n", 76 | "* Usually more value in going deeper than wider\n", 77 | "* Can try going deeper and wider than you think necessary and use regularization techinques to prevent overfitting. Such as early stopping.\n", 78 | "\n", 79 | "### Initialization\n", 80 | "\n", 81 | "In turns out with neural nets that how you initalize your weights can be quite important. Instead of random initalization, it is usually preferred to use either Xavier or He initalization. \n", 82 | "\n", 83 | "P. 278 of Hands on Machine learning has a good description of these initalizations.\n", 84 | "\n", 85 | "If you are going to use Relu or Elu activation functions, I would recommend He, which is supported by Keras:\n", 86 | "\n", 87 | "https://keras.io/initializers/#he_normal\n", 88 | "\n", 89 | "For He normal you initialize from a truncated normal distribution centered around 0 and with a standard deviation of sqrt(2/ (number of inputs + number of outputs))\n", 90 | "\n", 91 | "### Transfer Learning\n", 92 | "\n", 93 | "It turns out that the weights of a neural network can be used by other networks with the same architecture. For example, imagine Google has trained a neural network on millions of images from google search to predict 100 categories. Now, you would like to take a few thousand photos from your own photo collection and train a neural network to detect whether or not you are in a photo (binary classification).\n", 94 | "\n", 95 | "It turns out that you can start with Google's network and weights (Assuming you can get them) and use them as a starting place for your network. Assuming you are okay with the rest of their architecture, you would just need to change the last layer to 2 nodes intead of 100 and learn those weights from scratch.\n", 96 | "\n", 97 | "This is a really powerful idea and allows you to train much faster and with less data.\n", 98 | "\n", 99 | "This is such a good idea that you are almost always better starting with pre-trained weights if you can find them even if the problem they were trained on is not that close to your problem. Obviously, the closer the problems the better.\n", 100 | "\n", 101 | "Many deep learning frameworks have what are called model zoos where you can find pre-trained models. Keras' model zoo is here: https://keras.io/applications/\n", 102 | "\n", 103 | "You can find more details on how to perform some of these techniques using Keras here: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html\n", 104 | "\n", 105 | "Lastly, another valuable option when you have little data is to pre-train your own network on related data. For example, if you want to classify whether you are in an image or not, but only have a few images of yourself. You can first train your network on images of people in general and then fine-tune your network with the images of yourself.\n", 106 | "\n", 107 | "### Optimizers\n", 108 | "\n", 109 | "We have previously discussed vanilla gradient descent where you move in the direction negative to the gradient in proportion to the learning rate. It turns out that there are faster techniques for finiding the minimum - or a minimum - in your cost function. These faster techniques are very valuable with neural nets which already take a long time to train.\n", 110 | "\n", 111 | "We won't cover these in too much detail, but there is a good description here:\n", 112 | "\n", 113 | "http://ruder.io/optimizing-gradient-descent/index.html#whichoptimizertochoose\n", 114 | "\n", 115 | "And starting on p.295 of Hands on Machine Learning.\n", 116 | "\n", 117 | "Generally, a good place to start can be the Adam optimizer.\n", 118 | "\n", 119 | "### Regularization\n", 120 | "\n", 121 | "As we have discussed, neural nets can be quite prone to over fitting. Thus, we have some techniques to combat this:\n", 122 | "\n", 123 | "* **Early Stopping:** Keep track of your validation error after every iteration and stop training when it stops going down. Usually, you would say something like: if the validation error has not decreased in 5 continuous iterations, stop.\n", 124 | "* **L1 and L2:** Just like logistic and linear regression, we can add a penalty term for large weights.\n", 125 | "* ** Dropout:** This is probably the most popular regularization method and is seen in many architectures. It is simple: at every iteration, every neuron has some probability of being turned off or inactive during that iteration (except the output neurons). This probability is usually referred to as the dropout rate and is a hyper parameter you have to choose. What this does, is it forces the network to become pretty robust. At anytime, it can lose a neuron and thus can't learn to become too dependent on a small set of neurons - including the input. This isn't too different from random forest where each decision tree sees slightly different samples and features. With dropout, every iteration is a slightly different neural net that sees different features (or neurons). \n", 126 | "\n", 127 | "Keras has a dropout layer: https://keras.io/layers/core/#dropout\n", 128 | "\n", 129 | "### Data Augmentation\n", 130 | "\n", 131 | "Neural nets - especially deep ones - love data. Sometimes you don't have a lot of data or would like more data, but getting additional samples can be costly. One way of dealing with this is by augmenting your current data via transformations.\n", 132 | "\n", 133 | "This idea is quite popular in computer vision. Say your task is to predict whether or not a dog is in an image and you have 5,000 images. To get more images you can randomly transform the 5,000 images you have. For example, you can change the rotation, brightness, size, etc. This then creates additional data while still not changing the label (the picture still contains a dog or not).\n", 134 | "\n", 135 | "These augmentations usually lead to your net being more robust to the transformations you applied and less prone to over-fit.\n", 136 | "\n", 137 | "Keras supports image data augmentation: https://keras.io/preprocessing/image/\n", 138 | "\n", 139 | "Even if your data are not images, though, you may be able to think of some creative ways of augmenting your data." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 1, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stdout", 149 | "output_type": "stream", 150 | "text": [ 151 | "[0.0, 0.0, 0.02, 0.0, 0.98]\n", 152 | "1.0\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "import numpy as np\n", 158 | "\n", 159 | "values = np.array([1.0, 3.0, 8.0, 4.0, 12.0])\n", 160 | "exp_values = np.exp(values)\n", 161 | "softmax = exp_values / sum(exp_values)\n", 162 | "print([round(x,2) for x in softmax])\n", 163 | "print(sum(softmax))" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## Example using Python" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 35, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "from keras.models import Sequential\n", 180 | "from keras.layers import Dense, Activation, Dropout, BatchNormalization\n", 181 | "from keras.utils import np_utils\n", 182 | "from keras.datasets import mnist\n", 183 | "from sklearn.metrics import confusion_matrix\n", 184 | "import numpy as np\n", 185 | "from __future__ import division" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 3, 191 | "metadata": {}, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz\n", 198 | "11337728/11490434 [============================>.] - ETA: 0s" 199 | ] 200 | } 201 | ], 202 | "source": [ 203 | "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n", 204 | "\n", 205 | "y_train = np_utils.to_categorical(y_train, 10)\n", 206 | "y_test = np_utils.to_categorical(y_test, 10)\n", 207 | "\n", 208 | "def vectorize_image(images):\n", 209 | " scaled_images = images / 255\n", 210 | " return images.reshape(scaled_images.shape[0],-1)\n", 211 | "\n", 212 | "x_train = vectorize_image(x_train)\n", 213 | "x_test = vectorize_image(x_test)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 51, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "model = Sequential([\n", 223 | " Dense(128, input_shape=(784,), activation='elu', kernel_initializer='he_normal'),\n", 224 | " BatchNormalization(),\n", 225 | " Dropout(0.5),\n", 226 | " Dense(64, activation='elu', kernel_initializer='he_normal'),\n", 227 | " BatchNormalization(),\n", 228 | " Dropout(0.5),\n", 229 | " Dense(10, activation='softmax')\n", 230 | "])" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 52, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "_________________________________________________________________\n", 243 | "Layer (type) Output Shape Param # \n", 244 | "=================================================================\n", 245 | "dense_17 (Dense) (None, 128) 100480 \n", 246 | "_________________________________________________________________\n", 247 | "batch_normalization_3 (Batch (None, 128) 512 \n", 248 | "_________________________________________________________________\n", 249 | "dropout_5 (Dropout) (None, 128) 0 \n", 250 | "_________________________________________________________________\n", 251 | "dense_18 (Dense) (None, 64) 8256 \n", 252 | "_________________________________________________________________\n", 253 | "batch_normalization_4 (Batch (None, 64) 256 \n", 254 | "_________________________________________________________________\n", 255 | "dropout_6 (Dropout) (None, 64) 0 \n", 256 | "_________________________________________________________________\n", 257 | "dense_19 (Dense) (None, 10) 650 \n", 258 | "=================================================================\n", 259 | "Total params: 110,154\n", 260 | "Trainable params: 109,770\n", 261 | "Non-trainable params: 384\n", 262 | "_________________________________________________________________\n" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "model.summary()" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 53, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "model.compile(optimizer='adam',\n", 277 | " loss='categorical_crossentropy')" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 54, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | "Train on 54000 samples, validate on 6000 samples\n", 290 | "Epoch 1/20\n", 291 | "54000/54000 [==============================] - 5s - loss: 0.5970 - val_loss: 0.1660\n", 292 | "Epoch 2/20\n", 293 | "54000/54000 [==============================] - 4s - loss: 0.3253 - val_loss: 0.1273\n", 294 | "Epoch 3/20\n", 295 | "54000/54000 [==============================] - 5s - loss: 0.2739 - val_loss: 0.1079\n", 296 | "Epoch 4/20\n", 297 | "54000/54000 [==============================] - 5s - loss: 0.2447 - val_loss: 0.0962\n", 298 | "Epoch 5/20\n", 299 | "54000/54000 [==============================] - 5s - loss: 0.2249 - val_loss: 0.0907\n", 300 | "Epoch 6/20\n", 301 | "54000/54000 [==============================] - 5s - loss: 0.2093 - val_loss: 0.0887\n", 302 | "Epoch 7/20\n", 303 | "54000/54000 [==============================] - 5s - loss: 0.1976 - val_loss: 0.0876\n", 304 | "Epoch 8/20\n", 305 | "54000/54000 [==============================] - 6s - loss: 0.1897 - val_loss: 0.0849\n", 306 | "Epoch 9/20\n", 307 | "54000/54000 [==============================] - 6s - loss: 0.1842 - val_loss: 0.0833\n", 308 | "Epoch 10/20\n", 309 | "54000/54000 [==============================] - 6s - loss: 0.1738 - val_loss: 0.0832\n", 310 | "Epoch 11/20\n", 311 | "54000/54000 [==============================] - 6s - loss: 0.1679 - val_loss: 0.0813\n", 312 | "Epoch 12/20\n", 313 | "54000/54000 [==============================] - 7s - loss: 0.1617 - val_loss: 0.0772\n", 314 | "Epoch 13/20\n", 315 | "54000/54000 [==============================] - 7s - loss: 0.1620 - val_loss: 0.0801\n", 316 | "Epoch 14/20\n", 317 | "54000/54000 [==============================] - 7s - loss: 0.1539 - val_loss: 0.0719\n", 318 | "Epoch 15/20\n", 319 | "54000/54000 [==============================] - 8s - loss: 0.1522 - val_loss: 0.0809\n", 320 | "Epoch 16/20\n", 321 | "54000/54000 [==============================] - 9s - loss: 0.1486 - val_loss: 0.0782\n", 322 | "Epoch 17/20\n", 323 | "54000/54000 [==============================] - 9s - loss: 0.1483 - val_loss: 0.0712\n", 324 | "Epoch 18/20\n", 325 | "54000/54000 [==============================] - 10s - loss: 0.1437 - val_loss: 0.0733\n", 326 | "Epoch 19/20\n", 327 | "54000/54000 [==============================] - 10s - loss: 0.1402 - val_loss: 0.0708\n", 328 | "Epoch 20/20\n", 329 | "54000/54000 [==============================] - 10s - loss: 0.1379 - val_loss: 0.0707\n" 330 | ] 331 | }, 332 | { 333 | "data": { 334 | "text/plain": [ 335 | "" 336 | ] 337 | }, 338 | "execution_count": 54, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "model.fit(x_train, y_train, epochs=20, batch_size=64, validation_split=0.1)" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 55, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "test_predictions = np.argmax(model.predict(x_test),1)\n", 354 | "y_test_sparse = np.argmax(y_test, 1)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 56, 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "data": { 364 | "text/plain": [ 365 | "array([[ 969, 0, 1, 1, 1, 2, 3, 1, 2, 0],\n", 366 | " [ 0, 1126, 3, 1, 0, 0, 1, 0, 4, 0],\n", 367 | " [ 1, 3, 1006, 5, 2, 0, 1, 9, 5, 0],\n", 368 | " [ 1, 0, 6, 991, 0, 3, 0, 6, 3, 0],\n", 369 | " [ 0, 0, 3, 0, 966, 0, 6, 1, 1, 5],\n", 370 | " [ 2, 1, 0, 10, 1, 863, 5, 3, 5, 2],\n", 371 | " [ 5, 3, 1, 0, 2, 6, 937, 0, 4, 0],\n", 372 | " [ 1, 4, 9, 2, 0, 0, 0, 1008, 0, 4],\n", 373 | " [ 5, 2, 2, 6, 5, 2, 3, 5, 942, 2],\n", 374 | " [ 4, 6, 0, 10, 14, 1, 0, 8, 4, 962]])" 375 | ] 376 | }, 377 | "execution_count": 56, 378 | "metadata": {}, 379 | "output_type": "execute_result" 380 | } 381 | ], 382 | "source": [ 383 | "confusion_matrix(y_test_sparse, test_predictions)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 57, 389 | "metadata": {}, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "array([0.977])" 395 | ] 396 | }, 397 | "execution_count": 57, 398 | "metadata": {}, 399 | "output_type": "execute_result" 400 | } 401 | ], 402 | "source": [ 403 | "np.sum(y_test_sparse == test_predictions) / test_predictions.shape" 404 | ] 405 | } 406 | ], 407 | "metadata": { 408 | "kernelspec": { 409 | "display_name": "Python 3", 410 | "language": "python", 411 | "name": "python3" 412 | }, 413 | "language_info": { 414 | "codemirror_mode": { 415 | "name": "ipython", 416 | "version": 3 417 | }, 418 | "file_extension": ".py", 419 | "mimetype": "text/x-python", 420 | "name": "python", 421 | "nbconvert_exporter": "python", 422 | "pygments_lexer": "ipython3", 423 | "version": "3.5.3" 424 | } 425 | }, 426 | "nbformat": 4, 427 | "nbformat_minor": 2 428 | } 429 | -------------------------------------------------------------------------------- /lectures/Lecture_15_Conv_Nets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Convolutional Neural Networks\n", 8 | "\n", 9 | "Again, I am not going to even try to do a better job than this post...:\n", 10 | "\n", 11 | "https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/\n", 12 | "\n", 13 | "Now, let's review some of the famous CNN architectures on starting on p.367 of Hands on Machine Learning.\n", 14 | "\n", 15 | "## Python Conv Net" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "name": "stderr", 25 | "output_type": "stream", 26 | "text": [ 27 | "Using TensorFlow backend.\n" 28 | ] 29 | } 30 | ], 31 | "source": [ 32 | "from keras.models import Sequential, Model\n", 33 | "from keras.layers import Dense, Activation, Conv2D, MaxPooling2D, Flatten, BatchNormalization, BatchNormalization, Dropout\n", 34 | "from keras.datasets import mnist\n", 35 | "from sklearn.metrics import confusion_matrix\n", 36 | "from keras import applications\n", 37 | "from keras.preprocessing.image import ImageDataGenerator\n", 38 | "from skimage.transform import rescale, resize\n", 39 | "import matplotlib.pyplot as plt\n", 40 | "from keras.preprocessing.image import ImageDataGenerator\n", 41 | "%matplotlib inline\n", 42 | "import numpy as np" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n", 52 | "\n", 53 | "def scale_image(images):\n", 54 | " images = images.reshape(images.shape[0], 28, 28, 1)\n", 55 | " images = images.astype('float32')\n", 56 | " scaled_images = images / 255\n", 57 | " return scaled_images\n", 58 | "\n", 59 | "x_train = scale_image(x_train)\n", 60 | "x_test = scale_image(x_test)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "data": { 70 | "text/plain": [ 71 | "(60000, 28, 28, 1)" 72 | ] 73 | }, 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "output_type": "execute_result" 77 | } 78 | ], 79 | "source": [ 80 | "x_train.shape" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 4, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "model = Sequential()\n", 90 | "\n", 91 | "model.add(Conv2D(32, kernel_size=(5, 5), activation='elu', input_shape=(28, 28, 1), kernel_initializer='he_normal'))\n", 92 | "model.add(BatchNormalization())\n", 93 | "model.add(MaxPooling2D(pool_size=(2, 2)))\n", 94 | "model.add(Dropout(0.5))\n", 95 | "\n", 96 | "model.add(Conv2D(64, kernel_size=(5, 5), activation='elu', kernel_initializer='he_normal'))\n", 97 | "model.add(BatchNormalization())\n", 98 | "model.add(MaxPooling2D(pool_size=(2, 2)))\n", 99 | "model.add(Dropout(0.5))\n", 100 | "\n", 101 | "model.add(Flatten())\n", 102 | "model.add(Dense(256, activation='elu', kernel_initializer='he_normal'))\n", 103 | "model.add(Dropout(0.5))\n", 104 | "model.add(Dense(128, activation='elu', kernel_initializer='he_normal'))\n", 105 | "model.add(Dropout(0.5))\n", 106 | "model.add(Dense(10, activation='softmax'))" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 5, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "_________________________________________________________________\n", 119 | "Layer (type) Output Shape Param # \n", 120 | "=================================================================\n", 121 | "conv2d_1 (Conv2D) (None, 24, 24, 32) 832 \n", 122 | "_________________________________________________________________\n", 123 | "batch_normalization_1 (Batch (None, 24, 24, 32) 128 \n", 124 | "_________________________________________________________________\n", 125 | "max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32) 0 \n", 126 | "_________________________________________________________________\n", 127 | "dropout_1 (Dropout) (None, 12, 12, 32) 0 \n", 128 | "_________________________________________________________________\n", 129 | "conv2d_2 (Conv2D) (None, 8, 8, 64) 51264 \n", 130 | "_________________________________________________________________\n", 131 | "batch_normalization_2 (Batch (None, 8, 8, 64) 256 \n", 132 | "_________________________________________________________________\n", 133 | "max_pooling2d_2 (MaxPooling2 (None, 4, 4, 64) 0 \n", 134 | "_________________________________________________________________\n", 135 | "dropout_2 (Dropout) (None, 4, 4, 64) 0 \n", 136 | "_________________________________________________________________\n", 137 | "flatten_1 (Flatten) (None, 1024) 0 \n", 138 | "_________________________________________________________________\n", 139 | "dense_1 (Dense) (None, 256) 262400 \n", 140 | "_________________________________________________________________\n", 141 | "dropout_3 (Dropout) (None, 256) 0 \n", 142 | "_________________________________________________________________\n", 143 | "dense_2 (Dense) (None, 128) 32896 \n", 144 | "_________________________________________________________________\n", 145 | "dropout_4 (Dropout) (None, 128) 0 \n", 146 | "_________________________________________________________________\n", 147 | "dense_3 (Dense) (None, 10) 1290 \n", 148 | "=================================================================\n", 149 | "Total params: 349,066\n", 150 | "Trainable params: 348,874\n", 151 | "Non-trainable params: 192\n", 152 | "_________________________________________________________________\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "model.summary()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 6, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "model.compile(optimizer='adam',\n", 167 | " loss='sparse_categorical_crossentropy')" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 7, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "Train on 48000 samples, validate on 12000 samples\n", 180 | "Epoch 1/5\n", 181 | "48000/48000 [==============================] - 16s - loss: 0.6392 - val_loss: 0.1036\n", 182 | "Epoch 2/5\n", 183 | "48000/48000 [==============================] - 14s - loss: 0.2201 - val_loss: 0.0776\n", 184 | "Epoch 3/5\n", 185 | "48000/48000 [==============================] - 14s - loss: 0.1676 - val_loss: 0.0638\n", 186 | "Epoch 4/5\n", 187 | "48000/48000 [==============================] - 14s - loss: 0.1468 - val_loss: 0.0494\n", 188 | "Epoch 5/5\n", 189 | "48000/48000 [==============================] - 14s - loss: 0.1259 - val_loss: 0.0466\n" 190 | ] 191 | }, 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "" 196 | ] 197 | }, 198 | "execution_count": 7, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.2)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 8, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "test_predictions = np.argmax(model.predict(x_test),1)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 9, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "text/plain": [ 224 | "array([[ 974, 0, 1, 0, 0, 0, 4, 1, 0, 0],\n", 225 | " [ 0, 1131, 1, 0, 0, 1, 1, 1, 0, 0],\n", 226 | " [ 1, 3, 1024, 0, 1, 0, 0, 3, 0, 0],\n", 227 | " [ 0, 0, 4, 993, 0, 11, 0, 1, 1, 0],\n", 228 | " [ 0, 0, 0, 0, 969, 0, 5, 0, 0, 8],\n", 229 | " [ 2, 0, 0, 2, 0, 886, 1, 1, 0, 0],\n", 230 | " [ 3, 2, 0, 0, 1, 4, 947, 0, 1, 0],\n", 231 | " [ 1, 4, 7, 2, 0, 0, 0, 1012, 1, 1],\n", 232 | " [ 4, 0, 1, 0, 1, 0, 1, 2, 961, 4],\n", 233 | " [ 2, 4, 0, 2, 4, 7, 0, 4, 0, 986]])" 234 | ] 235 | }, 236 | "execution_count": 9, 237 | "metadata": {}, 238 | "output_type": "execute_result" 239 | } 240 | ], 241 | "source": [ 242 | "confusion_matrix(y_test, test_predictions)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 10, 248 | "metadata": {}, 249 | "outputs": [ 250 | { 251 | "data": { 252 | "text/plain": [ 253 | "array([ 0.9883])" 254 | ] 255 | }, 256 | "execution_count": 10, 257 | "metadata": {}, 258 | "output_type": "execute_result" 259 | } 260 | ], 261 | "source": [ 262 | "np.sum(y_test == test_predictions) / test_predictions.shape" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "## Pre-training and Data Augmentation\n", 270 | "\n", 271 | "Source: https://gist.github.com/fchollet/7eb39b44eb9e16e59632d25fb3119975" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 11, 277 | "metadata": {}, 278 | "outputs": [ 279 | { 280 | "name": "stdout", 281 | "output_type": "stream", 282 | "text": [ 283 | "Model loaded.\n" 284 | ] 285 | } 286 | ], 287 | "source": [ 288 | "base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(48,48,3))\n", 289 | "print('Model loaded.')\n", 290 | "\n", 291 | "# build a classifier model to put on top of the convolutional model\n", 292 | "top_model = Sequential([\n", 293 | " Flatten(input_shape=base_model.output_shape[1:]),\n", 294 | " Dense(128, activation='elu', kernel_initializer='he_normal'),\n", 295 | " BatchNormalization(),\n", 296 | " Dropout(0.5),\n", 297 | " Dense(64, activation='elu', kernel_initializer='he_normal'),\n", 298 | " BatchNormalization(),\n", 299 | " Dropout(0.5),\n", 300 | " Dense(10, activation='softmax')\n", 301 | "])\n", 302 | "\n", 303 | "\n", 304 | "# compile the model with a SGD/momentum optimizer\n", 305 | "# and a very slow learning rate.\n", 306 | "top_model.compile(optimizer='adam',\n", 307 | " loss='sparse_categorical_crossentropy',\n", 308 | " metrics=['accuracy'])" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 12, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "def preprocess_images(images, new_size=(48,48)):\n", 318 | " resized_images = np.array([resize(x, new_size, mode='reflect') for x in images])\n", 319 | " rows, width, height, _ = resized_images.shape\n", 320 | " three_channels = np.zeros((rows, width, height, 3))\n", 321 | " three_channels[:,:,:,0] = resized_images[:,:,:,0]\n", 322 | " three_channels[:,:,:,1] = resized_images[:,:,:,0]\n", 323 | " three_channels[:,:,:,2] = resized_images[:,:,:,0]\n", 324 | " return three_channels" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 13, 330 | "metadata": {}, 331 | "outputs": [], 332 | "source": [ 333 | "x_train_processed = preprocess_images(x_train)\n", 334 | "x_test_processed = preprocess_images(x_test)" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 14, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "## could augment data here\n", 344 | "datagen = ImageDataGenerator()\n", 345 | "generator = datagen.flow(x_train_processed, y_train, batch_size=32, shuffle=False)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 15, 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "name": "stdout", 355 | "output_type": "stream", 356 | "text": [ 357 | "1874/1875 [============================>.] - ETA: 0s" 358 | ] 359 | } 360 | ], 361 | "source": [ 362 | "bottleneck_features_train = base_model.predict_generator(generator, x_train_processed.shape[0] // 32, verbose=1)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 16, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "bottleneck_features_test = base_model.predict(x_test_processed)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 17, 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "name": "stdout", 381 | "output_type": "stream", 382 | "text": [ 383 | "Train on 54000 samples, validate on 6000 samples\n", 384 | "Epoch 1/20\n", 385 | "54000/54000 [==============================] - 11s - loss: 0.3863 - acc: 0.8813 - val_loss: 0.0922 - val_acc: 0.9698\n", 386 | "Epoch 2/20\n", 387 | "54000/54000 [==============================] - 11s - loss: 0.1983 - acc: 0.9400 - val_loss: 0.0990 - val_acc: 0.9668\n", 388 | "Epoch 3/20\n", 389 | "54000/54000 [==============================] - 11s - loss: 0.1696 - acc: 0.9482 - val_loss: 0.0675 - val_acc: 0.9783\n", 390 | "Epoch 4/20\n", 391 | "54000/54000 [==============================] - 11s - loss: 0.1568 - acc: 0.9523 - val_loss: 0.0759 - val_acc: 0.9752\n", 392 | "Epoch 5/20\n", 393 | "54000/54000 [==============================] - 12s - loss: 0.1469 - acc: 0.9560 - val_loss: 0.0801 - val_acc: 0.9725\n", 394 | "Epoch 6/20\n", 395 | "54000/54000 [==============================] - 11s - loss: 0.1385 - acc: 0.9590 - val_loss: 0.0854 - val_acc: 0.9732\n", 396 | "Epoch 7/20\n", 397 | "54000/54000 [==============================] - 11s - loss: 0.1343 - acc: 0.9594 - val_loss: 0.0573 - val_acc: 0.9833\n", 398 | "Epoch 8/20\n", 399 | "54000/54000 [==============================] - 11s - loss: 0.1284 - acc: 0.9609 - val_loss: 0.0647 - val_acc: 0.9787\n", 400 | "Epoch 9/20\n", 401 | "54000/54000 [==============================] - 10s - loss: 0.1286 - acc: 0.9614 - val_loss: 0.0634 - val_acc: 0.9820\n", 402 | "Epoch 10/20\n", 403 | "54000/54000 [==============================] - 12s - loss: 0.1216 - acc: 0.9630 - val_loss: 0.0620 - val_acc: 0.9807\n", 404 | "Epoch 11/20\n", 405 | "54000/54000 [==============================] - 12s - loss: 0.1209 - acc: 0.9634 - val_loss: 0.0558 - val_acc: 0.9825\n", 406 | "Epoch 12/20\n", 407 | "54000/54000 [==============================] - 10s - loss: 0.1232 - acc: 0.9632 - val_loss: 0.0587 - val_acc: 0.9812\n", 408 | "Epoch 13/20\n", 409 | "54000/54000 [==============================] - 10s - loss: 0.1173 - acc: 0.9643 - val_loss: 0.0563 - val_acc: 0.9820\n", 410 | "Epoch 14/20\n", 411 | "54000/54000 [==============================] - 11s - loss: 0.1144 - acc: 0.9646 - val_loss: 0.0820 - val_acc: 0.9738\n", 412 | "Epoch 15/20\n", 413 | "54000/54000 [==============================] - 11s - loss: 0.1128 - acc: 0.9656 - val_loss: 0.0613 - val_acc: 0.9812\n", 414 | "Epoch 16/20\n", 415 | "54000/54000 [==============================] - 11s - loss: 0.1146 - acc: 0.9660 - val_loss: 0.0552 - val_acc: 0.9842\n", 416 | "Epoch 17/20\n", 417 | "54000/54000 [==============================] - 11s - loss: 0.1110 - acc: 0.9674 - val_loss: 0.0631 - val_acc: 0.9798\n", 418 | "Epoch 18/20\n", 419 | "54000/54000 [==============================] - 11s - loss: 0.1078 - acc: 0.9671 - val_loss: 0.0539 - val_acc: 0.9838\n", 420 | "Epoch 19/20\n", 421 | "54000/54000 [==============================] - 11s - loss: 0.1095 - acc: 0.9671 - val_loss: 0.0523 - val_acc: 0.9838\n", 422 | "Epoch 20/20\n", 423 | "54000/54000 [==============================] - 11s - loss: 0.1076 - acc: 0.9667 - val_loss: 0.0567 - val_acc: 0.9835\n" 424 | ] 425 | }, 426 | { 427 | "data": { 428 | "text/plain": [ 429 | "" 430 | ] 431 | }, 432 | "execution_count": 17, 433 | "metadata": {}, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "top_model.fit(bottleneck_features_train, y_train, epochs=20, batch_size=32, validation_split=0.1)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 18, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "test_predictions = np.argmax(top_model.predict(bottleneck_features_test),1)" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 19, 453 | "metadata": {}, 454 | "outputs": [ 455 | { 456 | "data": { 457 | "text/plain": [ 458 | "array([[ 966, 0, 2, 0, 0, 4, 2, 2, 3, 1],\n", 459 | " [ 0, 1128, 0, 1, 1, 0, 0, 4, 1, 0],\n", 460 | " [ 0, 1, 969, 17, 2, 15, 0, 21, 5, 2],\n", 461 | " [ 0, 0, 4, 989, 0, 13, 0, 2, 2, 0],\n", 462 | " [ 0, 2, 2, 0, 962, 1, 0, 5, 5, 5],\n", 463 | " [ 0, 0, 2, 6, 0, 877, 0, 2, 5, 0],\n", 464 | " [ 6, 2, 1, 0, 1, 2, 945, 0, 1, 0],\n", 465 | " [ 1, 3, 3, 0, 5, 1, 0, 1014, 0, 1],\n", 466 | " [ 0, 0, 2, 2, 3, 3, 1, 1, 960, 2],\n", 467 | " [ 1, 0, 4, 1, 10, 2, 0, 12, 8, 971]])" 468 | ] 469 | }, 470 | "execution_count": 19, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "confusion_matrix(y_test, test_predictions)" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 20, 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "data": { 486 | "text/plain": [ 487 | "array([ 0.9781])" 488 | ] 489 | }, 490 | "execution_count": 20, 491 | "metadata": {}, 492 | "output_type": "execute_result" 493 | } 494 | ], 495 | "source": [ 496 | "np.sum(y_test == test_predictions) / test_predictions.shape" 497 | ] 498 | } 499 | ], 500 | "metadata": { 501 | "kernelspec": { 502 | "display_name": "Python 3", 503 | "language": "python", 504 | "name": "python3" 505 | }, 506 | "language_info": { 507 | "codemirror_mode": { 508 | "name": "ipython", 509 | "version": 3 510 | }, 511 | "file_extension": ".py", 512 | "mimetype": "text/x-python", 513 | "name": "python", 514 | "nbconvert_exporter": "python", 515 | "pygments_lexer": "ipython3", 516 | "version": "3.5.3" 517 | } 518 | }, 519 | "nbformat": 4, 520 | "nbformat_minor": 2 521 | } 522 | -------------------------------------------------------------------------------- /lectures/Lecture_1_Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Introductions\n", 8 | "\n", 9 | "* Importance of getting to know people in the class\n", 10 | "* Review syllabus\n", 11 | "* Give out slack information\n", 12 | "* When office hours?\n", 13 | "* Participation: ML news / paper for the day / discuss homework ideas / class discussions\n", 14 | "* Homework: Target 5 homeworks which will be pretty open-ended and almost like mini-projects\n", 15 | "* Quizzes: In class and around 3. Will test knowledge of the subject not coding\n", 16 | "* First time class being taught and I am very open to feedback. Also, feel free to submit pull requests to correct or add to content.\n", 17 | "* Review first homework" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## What is Machine Learning?\n", 25 | "\n", 26 | "[Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) tells us that Machine learning is, \"a field of computer science that gives computers **the ability to learn without being explicitly programmed**.\" It goes on to say, \"machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through **building a model from sample inputs**.\"\n", 27 | "\n", 28 | "\n", 29 | "### Learning from inputs\n", 30 | "\n", 31 | "What does it mean to learn from inputs without being explicitly programmed? Let us consider the classical machine learning problem: spam filtering.\n", 32 | "\n", 33 | "Let us imagine that we know nothing about machine learning, but are tasked with determining whether an email is spam or not. How might we do this?\n", 34 | "\n", 35 | "What do we like and not like about our approach? How would we keep it up to date with new types of spam?\n", 36 | "\n", 37 | "Now imagine we had a black-box that if given a great many examples of emails that are spam and not spam, could take these examples and learn from the text what spam looks like. For this class, we will call that black-box (though it won't be black for long) machine learning. And often the data we feed it to understand a problem are called features.\n", 38 | "\n", 39 | "What does learning from inputs look like? Let us consider [flappy bird](https://www.youtube.com/watch?v=79BWQUN_Njc).\n", 40 | "\n", 41 | "\n", 42 | "## Why machine learning?\n", 43 | "\n", 44 | "From our discussion, why do you think machine learning might be valuable?\n", 45 | "\n", 46 | "\n", 47 | "## Types of machine learning problems\n", 48 | "\n", 49 | "**Supervised** machine learning problems are ones for which you have labeled data. Labeled data means you give the algorithm the solution with the data and these solutions are called labels. For example, with spam classification the labels would be \"spam\" or \"not spam.\" Linear regression would be considered a supervised problem.\n", 50 | "\n", 51 | "**Unsupervised** machine learning is the opposite. It is not given any labels. These algorithms are often not as powerful as they don't get the benefit of labels, but they can be extremely valuable when getting labeled data is expensive or impossible. An example would be clustering.\n", 52 | "\n", 53 | "**Regression** problems are a class of problems for which you are trying to predict a real number. For example, linear regression outputs a real number and could be used to predict housing prices.\n", 54 | "\n", 55 | "**Classification** problems are problems for which you are predicting a class. For example, spam prediction is a classification problem because you want to know whether your input falls into one of two classes. Logistic regression is an algorithm used for classification.\n", 56 | "\n", 57 | "**Ranking** problems are very popular in eCommerce. These models try to rank the items by how valuable they are to a user. For example, Netflix's movie recommendations. An example model is collaborative filtering.\n", 58 | "\n", 59 | "**Reinforcement Learning** is when you have an agent in an environment that gets to perform actions and receive rewards for actions. The model here learns the best actions to take to maximize rewards. The flappy bird video is an example of reinforcement learning. An example model is deep Q-networks.\n", 60 | "\n", 61 | "\n", 62 | "## Machine Learning and Econometrics\n", 63 | "\n", 64 | "How are they different?\n", 65 | "\n", 66 | "For one, they use different lingo. \n", 67 | "\n", 68 | "For another, econometrics is often more interested in understanding why things happen while machine learning often cares more about just the actual prediction being correct.\n", 69 | "\n", 70 | "Economic theory is often a driver in the development of econometric models, while machine learning often relies on the data to deliver insights.\n", 71 | "\n", 72 | "The two worlds have a lot of overlap and continue to grow closer. Machine learning is getting better and providing both predictions and understandings and econometrics is finding value in the scalability and accuracy of some machine learning models.\n", 73 | "\n", 74 | "\n", 75 | "## Challenges of Machine Learning" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Perhaps my favorite part from the Wikipedia page on machine learning is, \"As of 2016, **machine learning is a buzzword**, and according to the Gartner hype cycle of 2016, at its peak of inflated expectations. Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, **machine-learning programs often fail to deliver**.\"\n", 83 | "\n", 84 | "* There isn't a clear problem to solve\n", 85 | "\n", 86 | "Some executive heard machine learning is the next big thing, so they hired a data science team. Unfortunately, there isn't a clear idea on what problems to solve, so the team flounders for a year.\n", 87 | "\n", 88 | "* Labeled data can be extremely important to building machine learning models, but can also be extremely costly.\n", 89 | "\n", 90 | "First off, you often need a lot of data. [Google](https://research.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html) found for representation learning, performance increases logarithmically based on volume of training data:\n", 91 | "\n", 92 | "![image.png](attachment:image.png)\n", 93 | "\n", 94 | "Secondly, you need to get data that represents the full distribution of the problem you are trying to solve. For example, for our spam classification problem, what kinds of emails might we want to gather? What if we only had emails that came from US IP addresses?\n", 95 | "\n", 96 | "Lastly, just getting data labeled can be time consuming and cost a lot of money. Who is going to label 1 million emails as spam or not spam?\n", 97 | "\n", 98 | "* Data can be very messy\n", 99 | "\n", 100 | "Often data in the real world has errors, outliers, missing data, and noise. How you handle these can greatly influence the outcome of your model.\n", 101 | "\n", 102 | "* Feature engineering\n", 103 | "\n", 104 | "Once you have your data and labels, deciding on how to represent it to your model can be very challenging. For example, for spam classification would you just feed it the raw text? What about the origin of the IP address? What about a timestamp?\n", 105 | "\n", 106 | "* Your model might not generalize\n", 107 | "\n", 108 | "After all of this, you might still end up with a model that either is too simple to be effective (underfitting) or too complex to generalize well (overfitting). You have to develop a model that is just right. :)\n", 109 | "\n", 110 | "* Evaluation is non-trivial\n", 111 | "\n", 112 | "Let's say we develop a machine learning model for spam classification. How do we evaluate it? Do we care more about precision or recall? How do we tie our scientific metrics to business metrics?\n", 113 | "\n", 114 | "* Getting into production can be hard\n", 115 | "\n", 116 | "You have a beautiful model built in Python only to discover the back-end is in Java and has to run in under 5ms, so micro-services are not an option. So you convert your model to PMML, but engineers won't let you push code, so you are now blocked and putting your model in production isn't high on their priorities.\n", 117 | "\n", 118 | "\n", 119 | "## There is hope\n", 120 | "\n", 121 | "While many machine learning initiatives do fail, many also succeed and are running some of the most valuable companies in the world. Companies like Google, Facebook, Amazon, AirBnB, and Netflix have all found successful ways to leverage machine learning and are reaping large rewards.\n", 122 | "\n", 123 | "Google CEO Sundar Pichai even recently said, \"an important shift from a mobile first world to an AI first world.\"\n", 124 | "\n", 125 | "And Mark Cuban said, \"Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years.\"\n", 126 | "\n", 127 | "And lastly, [Harvard Business Review](https://hbr.org/2012/10/big-data-the-management-revolution) found, \"companies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.\"\n", 128 | "\n", 129 | "The goal of this course is to prepare you for this world. So that you will not only know how to build the machine learning models to predict the future, but also understand the key ingredients of a successful machine learning initiative and how to overcome the challenges." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": { 136 | "collapsed": true 137 | }, 138 | "outputs": [], 139 | "source": [] 140 | } 141 | ], 142 | "metadata": { 143 | "kernelspec": { 144 | "display_name": "Python 2", 145 | "language": "python", 146 | "name": "python2" 147 | }, 148 | "language_info": { 149 | "codemirror_mode": { 150 | "name": "ipython", 151 | "version": 2 152 | }, 153 | "file_extension": ".py", 154 | "mimetype": "text/x-python", 155 | "name": "python", 156 | "nbconvert_exporter": "python", 157 | "pygments_lexer": "ipython2", 158 | "version": "2.7.12" 159 | } 160 | }, 161 | "nbformat": 4, 162 | "nbformat_minor": 2 163 | } 164 | -------------------------------------------------------------------------------- /lectures/Lecture_6_K_Nearest_Neighbors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## K-Nearest Neighbors\n", 8 | "\n", 9 | "This is one of the simpliest and easiest to understand algorithms. It can be used for both classification and regression tasks, but is more common in classification, so we will focus there. The principles, though, can be used in both cases and sklearn supports both.\n", 10 | "\n", 11 | "Basically, here is the algorithm:\n", 12 | "\n", 13 | "1. Define $k$\n", 14 | "2. Define a distance metric - usually Euclidean distance\n", 15 | "3. For a new data point, find the $k$ nearest training points and combine their classes in some way - usually voting - to get a predicted class\n", 16 | "\n", 17 | "That's it!\n", 18 | "\n", 19 | "**Some of the benefits:**\n", 20 | "\n", 21 | "* Doesn't really require any training in the traditional sense. You just need a fast way to find the nearest neighbors.\n", 22 | "* Easy to understand \n", 23 | "\n", 24 | "**Some of the negatives:**\n", 25 | "\n", 26 | "* Need to define k, which is a hyper-parameter, so it can be tuned with cross-validation. A higher value for k increases bias and a lower value increases variance.\n", 27 | "* Have to choose a distance metric and could get very different results depending on the metric. Again, you can use cross-validation.\n", 28 | "* Doesn't really offer insights into which features might be important.\n", 29 | "* Can suffer with high dimensional data due to the curse of dimensionality.\n", 30 | "\n", 31 | "**Basic assumption:**\n", 32 | "\n", 33 | "* Data points that are close are similar for our target\n", 34 | "\n", 35 | "\n", 36 | "### Curse of Dimensionality\n", 37 | "\n", 38 | "Basically, the curse of dimensionality means that in high dimensions, it is likely that close points are not much closer than the average distance, which means being close doesnt mean much. In high dimensions, the data becomes very spread out, which creates this phenomenon. \n", 39 | "\n", 40 | "There are so many good resources for this online, that I won't go any deeper. Here is one you might look at:\n", 41 | "\n", 42 | "http://blog.galvanize.com/how-to-manage-dimensionality/\n", 43 | "\n", 44 | "### Euclidean Distance\n", 45 | "\n", 46 | "For vectors, q and p that are being compared (these would be our feature vectors):\n", 47 | "\n", 48 | "$$\\sqrt{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$$\n", 49 | "\n", 50 | "\n", 51 | "### SKlearn Example\n", 52 | "\n" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 15, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "from sklearn.datasets import load_breast_cancer\n", 64 | "from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor\n", 65 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 66 | "from sklearn.metrics import f1_score, classification_report, accuracy_score, mean_squared_error" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": { 73 | "collapsed": false 74 | }, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}\n", 81 | "Train F1: 0.9606837606837607\n", 82 | "Test Classification Report:\n", 83 | " precision recall f1-score support\n", 84 | "\n", 85 | " 0 0.97 0.88 0.93 43\n", 86 | " 1 0.93 0.99 0.96 71\n", 87 | "\n", 88 | "avg / total 0.95 0.95 0.95 114\n", 89 | "\n", 90 | "Train Accuracy: 0.9494505494505494\tTest accuracy: 0.9473684210526315\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "data = load_breast_cancer()\n", 96 | "X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=42)\n", 97 | "clf = KNeighborsClassifier()\n", 98 | "gridsearch = GridSearchCV(clf, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n", 99 | " 'p': [1, 2, 3]}, scoring='f1')\n", 100 | "gridsearch.fit(X_train, y_train)\n", 101 | "print(\"Best Params: {}\".format(gridsearch.best_params_))\n", 102 | "y_pred_train = gridsearch.predict(X_train)\n", 103 | "print(\"Train F1: {}\".format(f1_score(y_train, y_pred_train)))\n", 104 | "print(\"Test Classification Report:\")\n", 105 | "y_pred_test = gridsearch.predict(X_test)\n", 106 | "print(classification_report(y_test, y_pred_test))\n", 107 | "print(\"Train Accuracy: {}\\tTest accuracy: {}\".format(accuracy_score(y_train, y_pred_train),\n", 108 | " accuracy_score(y_test, y_pred_test)))" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Voting Methods\n", 116 | "\n", 117 | "* Majority Voting: After you take the $k$ nearest neighbors, you take a \"vote\" of those neighbors' classes. The new data point is classified with whatever the majority class of the neighbors is. If you are doing a binary classification, it is recommended that you use an odd number of neighbors to avoid tied votes. However, in a multi-class problem, it is harder to avoid ties. A common solution to this is to decrease $k$ until the tie is broken.\n", 118 | "* Distance Weighting: Instead of directly taking votes of the nearest neighbors, you weight each vote by the distance of that instance from the new data point. A common weighting method is $\\hat{y} = \\frac{\\sum_{i=1}^Nw_iq_i}{\\sum_{i=1}^N w_i}$, where $w_i=\\frac{1}{\\sum_{i=1}^{N}{(p_{i} - q_{i})^2}}$, or one over the distance between the new data point and the training point. The new data point is added into the class with the largest added weight. Not only does this decrease the chances of ties, it also reduces the effect of a skewed representation of data.\n", 119 | "\n", 120 | "### Distance Metrics\n", 121 | "Euclidean distance, or the 2-norm, is a very common distance metric to use for $k$-nearest neighbors. Any $p$-norm can be used (the $p$-norm is defined as $||\\mathbf{x}_i-\\mathbf{y}_i||_p = (\\sum_{i=1}^N|x_i-y_i|^p)^\\frac{1}{p}$. For categorical data, however, this can be a problem. For example, if we have encoded a feature for car color from red, blue, and green to 0, 1, 2, how can the \"distance\" between green and red be measured? You could make dummy variables, but if a feature has 15 possible categories, that means adding 14 more variables to your feature set, and we run into the curse of dimensionality. There are several ways to confront this issue, but they are beyond the scope of this lecture. But be aware of the effect categorical features can have on your nearest neighbors classifier.\n", 122 | "\n", 123 | "### Search Algorithm\n", 124 | "\n", 125 | "Imagine the data set contains 2000 points. A brute-force search for the 3 nearest neighbors to one point does not take very long. But if the data set contains 2000000 points, a brute-force search can become quite costly, especially if the dimension of the data is large. Other search algorithms sacrifice an exhaustive search for a faster run time. Structures such as KDTrees (see https://en.wikipedia.org/wiki/K-d_tree) or Ball trees (see https://en.wikipedia.org/wiki/Ball_tree) are used for faster run times. While we won't dive into the details of these structures, be aware of them and how they can optimize your run time (although, training time does increase).\n", 126 | "\n", 127 | "### Radius Neighbors Classifier\n", 128 | "\n", 129 | "This is the same idea as a $k$ nearest neighbors classifier, but instead of finding the $k$ nearest neighbors, you find all the neighbors within a given radius. Setting the radius requires some domain knowledge; if your points are closely packed together, you'd want to use a smaller radius to avoid having nearly every point vote.\n", 130 | "\n", 131 | "### K Neighbors Regressor\n", 132 | "To change our problem from classification to regressing, all we have to do is find the weighted average of the $k$ nearest neighbors. Instead of taking the majority class, we calculate a weighted average of these nearest values, using the same weighting method as above.\n", 133 | "\n", 134 | "Let's try predicting the area of tissue based on the other features using a $k$-neighbors regressor." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 19, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [ 144 | { 145 | "name": "stdout", 146 | "output_type": "stream", 147 | "text": [ 148 | "Best Params: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}\n", 149 | "Train MSE: 0.0\tTest MSE: 878.1482686614424\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "target = data.data[:,3]\n", 155 | "X = data.data[:,[0,1,2,4,5,6,7,8]]\n", 156 | "X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.20, random_state=42)\n", 157 | "reg = KNeighborsRegressor()\n", 158 | "gridsearch = GridSearchCV(reg, {\"n_neighbors\": [1, 3, 5, 7, 9, 11], \"weights\": ['uniform', 'distance'], \n", 159 | " 'p': [1, 2, 3]}, scoring='neg_mean_squared_error')\n", 160 | "gridsearch.fit(X_train, y_train)\n", 161 | "print(\"Best Params: {}\".format(gridsearch.best_params_))\n", 162 | "y_pred_train = gridsearch.predict(X_train)\n", 163 | "y_pred_test = gridsearch.predict(X_test)\n", 164 | "print(\"Train MSE: {}\\tTest MSE: {}\".format(mean_squared_error(y_train, y_pred_train),\n", 165 | " mean_squared_error(y_test, y_pred_test)))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "### Outlier Detection\n", 173 | "In $k$-nearest neighbors, the data is naturally clustered. Within these clusters, we can find the average distance between points (either exhaustively or from the centroid of the cluster). If we find a few points that are much farther than the average distance to other points or to the centroid, it is reasonable (but not always correct) to think they could be outliers. We can use this process on new data points as well. " 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": true 181 | }, 182 | "outputs": [], 183 | "source": [] 184 | } 185 | ], 186 | "metadata": { 187 | "kernelspec": { 188 | "display_name": "Python 3", 189 | "language": "python", 190 | "name": "python3" 191 | }, 192 | "language_info": { 193 | "codemirror_mode": { 194 | "name": "ipython", 195 | "version": 3 196 | }, 197 | "file_extension": ".py", 198 | "mimetype": "text/x-python", 199 | "name": "python", 200 | "nbconvert_exporter": "python", 201 | "pygments_lexer": "ipython3", 202 | "version": "3.5.3" 203 | } 204 | }, 205 | "nbformat": 4, 206 | "nbformat_minor": 2 207 | } 208 | -------------------------------------------------------------------------------- /lectures/Lecture_9_Decision_Trees.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "## Decision Trees\n", 10 | "\n", 11 | "To better understand decision trees, let's take a look at a simple example:\n", 12 | "\n", 13 | "https://zhengtianyu.files.wordpress.com/2013/12/decision-tree.png\n", 14 | "\n", 15 | "Clearly, decision trees are very simple and pretty easy to understand. Once you have a tree, you just follow the forks down for your example to get to an end node. Then at that end node you take the most common class or average value for regression. In fact, with classification you can get probabilistic estimates by taking the fraction of examples in that class. Nice! And that is one of the biggest benefits of decision trees - they are so easy to explain and understand.\n", 16 | "\n", 17 | "Most decision trees use binary trees - each split is either yes or no. When you graph these decision boundaries you end up with a bunch of vertical or horizontal lines. Here is an example with the iris data set:\n", 18 | "\n", 19 | "http://statweb.stanford.edu/~jtaylo/courses/stats202/trees.html\n", 20 | "\n", 21 | "## How To Grow A Tree\n", 22 | "\n", 23 | "So - how does one learn a tree? Do we just randomly pick binary splitting points and see what comes out? Of course not! We leverage our data and a definition of impurity. \n", 24 | "\n", 25 | "If you think about it, what we really want from our tree is pure leaf nodes. Meaning that for classification we would like at the end to have the leaf node only have examples of a single class. This greatly increases our confidence that if a testing example ends up at that leaf node that it is in fact of that class.\n", 26 | "\n", 27 | "Imagine the opposite, a leaf node in a 2 class classification problem that has 50% of the examples in one class and 50% in the other. That really doesn't help us at all! If a testing example ends up at that leaf node, all we can say is a 50/50 guess.\n", 28 | "\n", 29 | "So, how do we define impurity?\n", 30 | "\n", 31 | "**Gini Impurity**\n", 32 | "\n", 33 | "$$Gini_{i} = 1 - \\sum_{k=1}^{n}{p_{i,k}^2}$$\n", 34 | "\n", 35 | "Where $i$ is the node of interest, $n$ is the total number of classes, and $p_{i,k}$ is the fraction of class $k$ in node $i$. The gini of a node is 0 when all the examples belong to the same class. Gini is the highest when there is an equal probability of being in each class. \n", 36 | "\n", 37 | "Let's consider a 2-class examples. Say I am trying to predict whether someone is male or female and I branch on height being less than 5 1/2 feet. In this node, I find 50 examples of which 40 are female and 10 are male. Gini is then:\n", 38 | "\n", 39 | "$$1 - (40/50)^2 - (10/50)^2 = 0.32$$\n", 40 | "\n", 41 | "Very nice - now we can basically evaluate how good a node is. With this we can now start to **greedily** grow our tree. What I mean by greedily is that we will not consider every possible tree, but instead start with the most pure feature split and then from there pick the next most pure, etc.\n", 42 | "\n", 43 | "Thus, our algorithm looks at all the features and all the splitting points for our features (note: this process runs much faster if you features have a small amount of unique values like categories as opposed to real numbers). To evalue a feature, split point pair we do the following:\n", 44 | "\n", 45 | "Take the weighted average of the Gini impurity for the two nodes created by the split. The weights being the number of examples in the nodes.\n", 46 | "\n", 47 | "That's it! Basically, the feature split that produces the lowest weighted average gini impurity is considered best, then we move on and find the next best feature split given all the previous feature splits.\n", 48 | "\n", 49 | "### When to stop?\n", 50 | "\n", 51 | "Decision trees have many hyper-parameters to help control when to stop growing the tree. These include max depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and max_leaf_nodes.\n", 52 | "\n", 53 | "Since we have to consider all these splits, training can be slow, but predictions are very fast since we just have to traverse the tree.\n", 54 | "\n", 55 | "Also, there are other impurity measures that are used with decision trees. Another very popular one is Entropy. See page 173 of Hands On Machine Learning for its definition. Usually, which you choose doesn't matter too much. Gini is slightly faster to computer, while entropy produces slightly more balanced trees.\n", 56 | "\n", 57 | "### Bias and variance\n", 58 | "\n", 59 | "Decision trees can be very prone to overfitting if you let them grow to deep. Thus, decreasing the depth can decrease variance / increase bias. It is important to use cross-validation to do hyper-parameter selection. \n", 60 | "\n", 61 | "### Regression\n", 62 | "\n", 63 | "Decision trees can be applied to regression problems in exactly the same way, but instead of using Gini impurity you would use mean squared error. You calculate the MSE of a node by setting the prediction for all values in that node as the average $y$ value of the examples in that node. For example, if your node had these values: 5, 2, 1, 6 then your predicted value would be (5+2+1+6)/4 = 3.5. You would do the squared difference between 3.5 and all the values and take the mean. \n", 64 | "\n", 65 | "Prediction is done by traversing to the leaf node for an example and taking the average value at that node.\n", 66 | "\n", 67 | "### Pros\n", 68 | "\n", 69 | "* Easy to explain\n", 70 | "* Can be visualized\n", 71 | "* Can handle categorical variables and missing data well\n", 72 | "* Fast prediction\n", 73 | "\n", 74 | "### Cons\n", 75 | "\n", 76 | "* Typically don't have very strong prediction accuracy\n", 77 | "* Very sensitive to small changes in training data\n", 78 | "* Can be slow training" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [] 89 | } 90 | ], 91 | "metadata": { 92 | "kernelspec": { 93 | "display_name": "Python 3", 94 | "language": "python", 95 | "name": "python3" 96 | }, 97 | "language_info": { 98 | "codemirror_mode": { 99 | "name": "ipython", 100 | "version": 3 101 | }, 102 | "file_extension": ".py", 103 | "mimetype": "text/x-python", 104 | "name": "python", 105 | "nbconvert_exporter": "python", 106 | "pygments_lexer": "ipython3", 107 | "version": "3.5.3" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 2 112 | } 113 | -------------------------------------------------------------------------------- /small_data/Wholesale customers data.csv: -------------------------------------------------------------------------------- 1 | Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen 2 | 2,3,12669,9656,7561,214,2674,1338 3 | 2,3,7057,9810,9568,1762,3293,1776 4 | 2,3,6353,8808,7684,2405,3516,7844 5 | 1,3,13265,1196,4221,6404,507,1788 6 | 2,3,22615,5410,7198,3915,1777,5185 7 | 2,3,9413,8259,5126,666,1795,1451 8 | 2,3,12126,3199,6975,480,3140,545 9 | 2,3,7579,4956,9426,1669,3321,2566 10 | 1,3,5963,3648,6192,425,1716,750 11 | 2,3,6006,11093,18881,1159,7425,2098 12 | 2,3,3366,5403,12974,4400,5977,1744 13 | 2,3,13146,1124,4523,1420,549,497 14 | 2,3,31714,12319,11757,287,3881,2931 15 | 2,3,21217,6208,14982,3095,6707,602 16 | 2,3,24653,9465,12091,294,5058,2168 17 | 1,3,10253,1114,3821,397,964,412 18 | 2,3,1020,8816,12121,134,4508,1080 19 | 1,3,5876,6157,2933,839,370,4478 20 | 2,3,18601,6327,10099,2205,2767,3181 21 | 1,3,7780,2495,9464,669,2518,501 22 | 2,3,17546,4519,4602,1066,2259,2124 23 | 1,3,5567,871,2010,3383,375,569 24 | 1,3,31276,1917,4469,9408,2381,4334 25 | 2,3,26373,36423,22019,5154,4337,16523 26 | 2,3,22647,9776,13792,2915,4482,5778 27 | 2,3,16165,4230,7595,201,4003,57 28 | 1,3,9898,961,2861,3151,242,833 29 | 1,3,14276,803,3045,485,100,518 30 | 2,3,4113,20484,25957,1158,8604,5206 31 | 1,3,43088,2100,2609,1200,1107,823 32 | 1,3,18815,3610,11107,1148,2134,2963 33 | 1,3,2612,4339,3133,2088,820,985 34 | 1,3,21632,1318,2886,266,918,405 35 | 1,3,29729,4786,7326,6130,361,1083 36 | 1,3,1502,1979,2262,425,483,395 37 | 2,3,688,5491,11091,833,4239,436 38 | 1,3,29955,4362,5428,1729,862,4626 39 | 2,3,15168,10556,12477,1920,6506,714 40 | 2,3,4591,15729,16709,33,6956,433 41 | 1,3,56159,555,902,10002,212,2916 42 | 1,3,24025,4332,4757,9510,1145,5864 43 | 1,3,19176,3065,5956,2033,2575,2802 44 | 2,3,10850,7555,14961,188,6899,46 45 | 2,3,630,11095,23998,787,9529,72 46 | 2,3,9670,7027,10471,541,4618,65 47 | 2,3,5181,22044,21531,1740,7353,4985 48 | 2,3,3103,14069,21955,1668,6792,1452 49 | 2,3,44466,54259,55571,7782,24171,6465 50 | 2,3,11519,6152,10868,584,5121,1476 51 | 2,3,4967,21412,28921,1798,13583,1163 52 | 1,3,6269,1095,1980,3860,609,2162 53 | 1,3,3347,4051,6996,239,1538,301 54 | 2,3,40721,3916,5876,532,2587,1278 55 | 2,3,491,10473,11532,744,5611,224 56 | 1,3,27329,1449,1947,2436,204,1333 57 | 1,3,5264,3683,5005,1057,2024,1130 58 | 2,3,4098,29892,26866,2616,17740,1340 59 | 2,3,5417,9933,10487,38,7572,1282 60 | 1,3,13779,1970,1648,596,227,436 61 | 1,3,6137,5360,8040,129,3084,1603 62 | 2,3,8590,3045,7854,96,4095,225 63 | 2,3,35942,38369,59598,3254,26701,2017 64 | 2,3,7823,6245,6544,4154,4074,964 65 | 2,3,9396,11601,15775,2896,7677,1295 66 | 1,3,4760,1227,3250,3724,1247,1145 67 | 2,3,85,20959,45828,36,24231,1423 68 | 1,3,9,1534,7417,175,3468,27 69 | 2,3,19913,6759,13462,1256,5141,834 70 | 1,3,2446,7260,3993,5870,788,3095 71 | 1,3,8352,2820,1293,779,656,144 72 | 1,3,16705,2037,3202,10643,116,1365 73 | 1,3,18291,1266,21042,5373,4173,14472 74 | 1,3,4420,5139,2661,8872,1321,181 75 | 2,3,19899,5332,8713,8132,764,648 76 | 2,3,8190,6343,9794,1285,1901,1780 77 | 1,3,20398,1137,3,4407,3,975 78 | 1,3,717,3587,6532,7530,529,894 79 | 2,3,12205,12697,28540,869,12034,1009 80 | 1,3,10766,1175,2067,2096,301,167 81 | 1,3,1640,3259,3655,868,1202,1653 82 | 1,3,7005,829,3009,430,610,529 83 | 2,3,219,9540,14403,283,7818,156 84 | 2,3,10362,9232,11009,737,3537,2342 85 | 1,3,20874,1563,1783,2320,550,772 86 | 2,3,11867,3327,4814,1178,3837,120 87 | 2,3,16117,46197,92780,1026,40827,2944 88 | 2,3,22925,73498,32114,987,20070,903 89 | 1,3,43265,5025,8117,6312,1579,14351 90 | 1,3,7864,542,4042,9735,165,46 91 | 1,3,24904,3836,5330,3443,454,3178 92 | 1,3,11405,596,1638,3347,69,360 93 | 1,3,12754,2762,2530,8693,627,1117 94 | 2,3,9198,27472,32034,3232,18906,5130 95 | 1,3,11314,3090,2062,35009,71,2698 96 | 2,3,5626,12220,11323,206,5038,244 97 | 1,3,3,2920,6252,440,223,709 98 | 2,3,23,2616,8118,145,3874,217 99 | 1,3,403,254,610,774,54,63 100 | 1,3,503,112,778,895,56,132 101 | 1,3,9658,2182,1909,5639,215,323 102 | 2,3,11594,7779,12144,3252,8035,3029 103 | 2,3,1420,10810,16267,1593,6766,1838 104 | 2,3,2932,6459,7677,2561,4573,1386 105 | 1,3,56082,3504,8906,18028,1480,2498 106 | 1,3,14100,2132,3445,1336,1491,548 107 | 1,3,15587,1014,3970,910,139,1378 108 | 2,3,1454,6337,10704,133,6830,1831 109 | 2,3,8797,10646,14886,2471,8969,1438 110 | 2,3,1531,8397,6981,247,2505,1236 111 | 2,3,1406,16729,28986,673,836,3 112 | 1,3,11818,1648,1694,2276,169,1647 113 | 2,3,12579,11114,17569,805,6457,1519 114 | 1,3,19046,2770,2469,8853,483,2708 115 | 1,3,14438,2295,1733,3220,585,1561 116 | 1,3,18044,1080,2000,2555,118,1266 117 | 1,3,11134,793,2988,2715,276,610 118 | 1,3,11173,2521,3355,1517,310,222 119 | 1,3,6990,3880,5380,1647,319,1160 120 | 1,3,20049,1891,2362,5343,411,933 121 | 1,3,8258,2344,2147,3896,266,635 122 | 1,3,17160,1200,3412,2417,174,1136 123 | 1,3,4020,3234,1498,2395,264,255 124 | 1,3,12212,201,245,1991,25,860 125 | 2,3,11170,10769,8814,2194,1976,143 126 | 1,3,36050,1642,2961,4787,500,1621 127 | 1,3,76237,3473,7102,16538,778,918 128 | 1,3,19219,1840,1658,8195,349,483 129 | 2,3,21465,7243,10685,880,2386,2749 130 | 1,3,140,8847,3823,142,1062,3 131 | 1,3,42312,926,1510,1718,410,1819 132 | 1,3,7149,2428,699,6316,395,911 133 | 1,3,2101,589,314,346,70,310 134 | 1,3,14903,2032,2479,576,955,328 135 | 1,3,9434,1042,1235,436,256,396 136 | 1,3,7388,1882,2174,720,47,537 137 | 1,3,6300,1289,2591,1170,199,326 138 | 1,3,4625,8579,7030,4575,2447,1542 139 | 1,3,3087,8080,8282,661,721,36 140 | 1,3,13537,4257,5034,155,249,3271 141 | 1,3,5387,4979,3343,825,637,929 142 | 1,3,17623,4280,7305,2279,960,2616 143 | 1,3,30379,13252,5189,321,51,1450 144 | 1,3,37036,7152,8253,2995,20,3 145 | 1,3,10405,1596,1096,8425,399,318 146 | 1,3,18827,3677,1988,118,516,201 147 | 2,3,22039,8384,34792,42,12591,4430 148 | 1,3,7769,1936,2177,926,73,520 149 | 1,3,9203,3373,2707,1286,1082,526 150 | 1,3,5924,584,542,4052,283,434 151 | 1,3,31812,1433,1651,800,113,1440 152 | 1,3,16225,1825,1765,853,170,1067 153 | 1,3,1289,3328,2022,531,255,1774 154 | 1,3,18840,1371,3135,3001,352,184 155 | 1,3,3463,9250,2368,779,302,1627 156 | 1,3,622,55,137,75,7,8 157 | 2,3,1989,10690,19460,233,11577,2153 158 | 2,3,3830,5291,14855,317,6694,3182 159 | 1,3,17773,1366,2474,3378,811,418 160 | 2,3,2861,6570,9618,930,4004,1682 161 | 2,3,355,7704,14682,398,8077,303 162 | 2,3,1725,3651,12822,824,4424,2157 163 | 1,3,12434,540,283,1092,3,2233 164 | 1,3,15177,2024,3810,2665,232,610 165 | 2,3,5531,15726,26870,2367,13726,446 166 | 2,3,5224,7603,8584,2540,3674,238 167 | 2,3,15615,12653,19858,4425,7108,2379 168 | 2,3,4822,6721,9170,993,4973,3637 169 | 1,3,2926,3195,3268,405,1680,693 170 | 1,3,5809,735,803,1393,79,429 171 | 1,3,5414,717,2155,2399,69,750 172 | 2,3,260,8675,13430,1116,7015,323 173 | 2,3,200,25862,19816,651,8773,6250 174 | 1,3,955,5479,6536,333,2840,707 175 | 2,3,514,7677,19805,937,9836,716 176 | 1,3,286,1208,5241,2515,153,1442 177 | 2,3,2343,7845,11874,52,4196,1697 178 | 1,3,45640,6958,6536,7368,1532,230 179 | 1,3,12759,7330,4533,1752,20,2631 180 | 1,3,11002,7075,4945,1152,120,395 181 | 1,3,3157,4888,2500,4477,273,2165 182 | 1,3,12356,6036,8887,402,1382,2794 183 | 1,3,112151,29627,18148,16745,4948,8550 184 | 1,3,694,8533,10518,443,6907,156 185 | 1,3,36847,43950,20170,36534,239,47943 186 | 1,3,327,918,4710,74,334,11 187 | 1,3,8170,6448,1139,2181,58,247 188 | 1,3,3009,521,854,3470,949,727 189 | 1,3,2438,8002,9819,6269,3459,3 190 | 2,3,8040,7639,11687,2758,6839,404 191 | 2,3,834,11577,11522,275,4027,1856 192 | 1,3,16936,6250,1981,7332,118,64 193 | 1,3,13624,295,1381,890,43,84 194 | 1,3,5509,1461,2251,547,187,409 195 | 2,3,180,3485,20292,959,5618,666 196 | 1,3,7107,1012,2974,806,355,1142 197 | 1,3,17023,5139,5230,7888,330,1755 198 | 1,1,30624,7209,4897,18711,763,2876 199 | 2,1,2427,7097,10391,1127,4314,1468 200 | 1,1,11686,2154,6824,3527,592,697 201 | 1,1,9670,2280,2112,520,402,347 202 | 2,1,3067,13240,23127,3941,9959,731 203 | 2,1,4484,14399,24708,3549,14235,1681 204 | 1,1,25203,11487,9490,5065,284,6854 205 | 1,1,583,685,2216,469,954,18 206 | 1,1,1956,891,5226,1383,5,1328 207 | 2,1,1107,11711,23596,955,9265,710 208 | 1,1,6373,780,950,878,288,285 209 | 2,1,2541,4737,6089,2946,5316,120 210 | 1,1,1537,3748,5838,1859,3381,806 211 | 2,1,5550,12729,16767,864,12420,797 212 | 1,1,18567,1895,1393,1801,244,2100 213 | 2,1,12119,28326,39694,4736,19410,2870 214 | 1,1,7291,1012,2062,1291,240,1775 215 | 1,1,3317,6602,6861,1329,3961,1215 216 | 2,1,2362,6551,11364,913,5957,791 217 | 1,1,2806,10765,15538,1374,5828,2388 218 | 2,1,2532,16599,36486,179,13308,674 219 | 1,1,18044,1475,2046,2532,130,1158 220 | 2,1,18,7504,15205,1285,4797,6372 221 | 1,1,4155,367,1390,2306,86,130 222 | 1,1,14755,899,1382,1765,56,749 223 | 1,1,5396,7503,10646,91,4167,239 224 | 1,1,5041,1115,2856,7496,256,375 225 | 2,1,2790,2527,5265,5612,788,1360 226 | 1,1,7274,659,1499,784,70,659 227 | 1,1,12680,3243,4157,660,761,786 228 | 2,1,20782,5921,9212,1759,2568,1553 229 | 1,1,4042,2204,1563,2286,263,689 230 | 1,1,1869,577,572,950,4762,203 231 | 1,1,8656,2746,2501,6845,694,980 232 | 2,1,11072,5989,5615,8321,955,2137 233 | 1,1,2344,10678,3828,1439,1566,490 234 | 1,1,25962,1780,3838,638,284,834 235 | 1,1,964,4984,3316,937,409,7 236 | 1,1,15603,2703,3833,4260,325,2563 237 | 1,1,1838,6380,2824,1218,1216,295 238 | 1,1,8635,820,3047,2312,415,225 239 | 1,1,18692,3838,593,4634,28,1215 240 | 1,1,7363,475,585,1112,72,216 241 | 1,1,47493,2567,3779,5243,828,2253 242 | 1,1,22096,3575,7041,11422,343,2564 243 | 1,1,24929,1801,2475,2216,412,1047 244 | 1,1,18226,659,2914,3752,586,578 245 | 1,1,11210,3576,5119,561,1682,2398 246 | 1,1,6202,7775,10817,1183,3143,1970 247 | 2,1,3062,6154,13916,230,8933,2784 248 | 1,1,8885,2428,1777,1777,430,610 249 | 1,1,13569,346,489,2077,44,659 250 | 1,1,15671,5279,2406,559,562,572 251 | 1,1,8040,3795,2070,6340,918,291 252 | 1,1,3191,1993,1799,1730,234,710 253 | 2,1,6134,23133,33586,6746,18594,5121 254 | 1,1,6623,1860,4740,7683,205,1693 255 | 1,1,29526,7961,16966,432,363,1391 256 | 1,1,10379,17972,4748,4686,1547,3265 257 | 1,1,31614,489,1495,3242,111,615 258 | 1,1,11092,5008,5249,453,392,373 259 | 1,1,8475,1931,1883,5004,3593,987 260 | 1,1,56083,4563,2124,6422,730,3321 261 | 1,1,53205,4959,7336,3012,967,818 262 | 1,1,9193,4885,2157,327,780,548 263 | 1,1,7858,1110,1094,6818,49,287 264 | 1,1,23257,1372,1677,982,429,655 265 | 1,1,2153,1115,6684,4324,2894,411 266 | 2,1,1073,9679,15445,61,5980,1265 267 | 1,1,5909,23527,13699,10155,830,3636 268 | 2,1,572,9763,22182,2221,4882,2563 269 | 1,1,20893,1222,2576,3975,737,3628 270 | 2,1,11908,8053,19847,1069,6374,698 271 | 1,1,15218,258,1138,2516,333,204 272 | 1,1,4720,1032,975,5500,197,56 273 | 1,1,2083,5007,1563,1120,147,1550 274 | 1,1,514,8323,6869,529,93,1040 275 | 1,3,36817,3045,1493,4802,210,1824 276 | 1,3,894,1703,1841,744,759,1153 277 | 1,3,680,1610,223,862,96,379 278 | 1,3,27901,3749,6964,4479,603,2503 279 | 1,3,9061,829,683,16919,621,139 280 | 1,3,11693,2317,2543,5845,274,1409 281 | 2,3,17360,6200,9694,1293,3620,1721 282 | 1,3,3366,2884,2431,977,167,1104 283 | 2,3,12238,7108,6235,1093,2328,2079 284 | 1,3,49063,3965,4252,5970,1041,1404 285 | 1,3,25767,3613,2013,10303,314,1384 286 | 1,3,68951,4411,12609,8692,751,2406 287 | 1,3,40254,640,3600,1042,436,18 288 | 1,3,7149,2247,1242,1619,1226,128 289 | 1,3,15354,2102,2828,8366,386,1027 290 | 1,3,16260,594,1296,848,445,258 291 | 1,3,42786,286,471,1388,32,22 292 | 1,3,2708,2160,2642,502,965,1522 293 | 1,3,6022,3354,3261,2507,212,686 294 | 1,3,2838,3086,4329,3838,825,1060 295 | 2,2,3996,11103,12469,902,5952,741 296 | 1,2,21273,2013,6550,909,811,1854 297 | 2,2,7588,1897,5234,417,2208,254 298 | 1,2,19087,1304,3643,3045,710,898 299 | 2,2,8090,3199,6986,1455,3712,531 300 | 2,2,6758,4560,9965,934,4538,1037 301 | 1,2,444,879,2060,264,290,259 302 | 2,2,16448,6243,6360,824,2662,2005 303 | 2,2,5283,13316,20399,1809,8752,172 304 | 2,2,2886,5302,9785,364,6236,555 305 | 2,2,2599,3688,13829,492,10069,59 306 | 2,2,161,7460,24773,617,11783,2410 307 | 2,2,243,12939,8852,799,3909,211 308 | 2,2,6468,12867,21570,1840,7558,1543 309 | 1,2,17327,2374,2842,1149,351,925 310 | 1,2,6987,1020,3007,416,257,656 311 | 2,2,918,20655,13567,1465,6846,806 312 | 1,2,7034,1492,2405,12569,299,1117 313 | 1,2,29635,2335,8280,3046,371,117 314 | 2,2,2137,3737,19172,1274,17120,142 315 | 1,2,9784,925,2405,4447,183,297 316 | 1,2,10617,1795,7647,1483,857,1233 317 | 2,2,1479,14982,11924,662,3891,3508 318 | 1,2,7127,1375,2201,2679,83,1059 319 | 1,2,1182,3088,6114,978,821,1637 320 | 1,2,11800,2713,3558,2121,706,51 321 | 2,2,9759,25071,17645,1128,12408,1625 322 | 1,2,1774,3696,2280,514,275,834 323 | 1,2,9155,1897,5167,2714,228,1113 324 | 1,2,15881,713,3315,3703,1470,229 325 | 1,2,13360,944,11593,915,1679,573 326 | 1,2,25977,3587,2464,2369,140,1092 327 | 1,2,32717,16784,13626,60869,1272,5609 328 | 1,2,4414,1610,1431,3498,387,834 329 | 1,2,542,899,1664,414,88,522 330 | 1,2,16933,2209,3389,7849,210,1534 331 | 1,2,5113,1486,4583,5127,492,739 332 | 1,2,9790,1786,5109,3570,182,1043 333 | 2,2,11223,14881,26839,1234,9606,1102 334 | 1,2,22321,3216,1447,2208,178,2602 335 | 2,2,8565,4980,67298,131,38102,1215 336 | 2,2,16823,928,2743,11559,332,3486 337 | 2,2,27082,6817,10790,1365,4111,2139 338 | 1,2,13970,1511,1330,650,146,778 339 | 1,2,9351,1347,2611,8170,442,868 340 | 1,2,3,333,7021,15601,15,550 341 | 1,2,2617,1188,5332,9584,573,1942 342 | 2,3,381,4025,9670,388,7271,1371 343 | 2,3,2320,5763,11238,767,5162,2158 344 | 1,3,255,5758,5923,349,4595,1328 345 | 2,3,1689,6964,26316,1456,15469,37 346 | 1,3,3043,1172,1763,2234,217,379 347 | 1,3,1198,2602,8335,402,3843,303 348 | 2,3,2771,6939,15541,2693,6600,1115 349 | 2,3,27380,7184,12311,2809,4621,1022 350 | 1,3,3428,2380,2028,1341,1184,665 351 | 2,3,5981,14641,20521,2005,12218,445 352 | 1,3,3521,1099,1997,1796,173,995 353 | 2,3,1210,10044,22294,1741,12638,3137 354 | 1,3,608,1106,1533,830,90,195 355 | 2,3,117,6264,21203,228,8682,1111 356 | 1,3,14039,7393,2548,6386,1333,2341 357 | 1,3,190,727,2012,245,184,127 358 | 1,3,22686,134,218,3157,9,548 359 | 2,3,37,1275,22272,137,6747,110 360 | 1,3,759,18664,1660,6114,536,4100 361 | 1,3,796,5878,2109,340,232,776 362 | 1,3,19746,2872,2006,2601,468,503 363 | 1,3,4734,607,864,1206,159,405 364 | 1,3,2121,1601,2453,560,179,712 365 | 1,3,4627,997,4438,191,1335,314 366 | 1,3,2615,873,1524,1103,514,468 367 | 2,3,4692,6128,8025,1619,4515,3105 368 | 1,3,9561,2217,1664,1173,222,447 369 | 1,3,3477,894,534,1457,252,342 370 | 1,3,22335,1196,2406,2046,101,558 371 | 1,3,6211,337,683,1089,41,296 372 | 2,3,39679,3944,4955,1364,523,2235 373 | 1,3,20105,1887,1939,8164,716,790 374 | 1,3,3884,3801,1641,876,397,4829 375 | 2,3,15076,6257,7398,1504,1916,3113 376 | 1,3,6338,2256,1668,1492,311,686 377 | 1,3,5841,1450,1162,597,476,70 378 | 2,3,3136,8630,13586,5641,4666,1426 379 | 1,3,38793,3154,2648,1034,96,1242 380 | 1,3,3225,3294,1902,282,68,1114 381 | 2,3,4048,5164,10391,130,813,179 382 | 1,3,28257,944,2146,3881,600,270 383 | 1,3,17770,4591,1617,9927,246,532 384 | 1,3,34454,7435,8469,2540,1711,2893 385 | 1,3,1821,1364,3450,4006,397,361 386 | 1,3,10683,21858,15400,3635,282,5120 387 | 1,3,11635,922,1614,2583,192,1068 388 | 1,3,1206,3620,2857,1945,353,967 389 | 1,3,20918,1916,1573,1960,231,961 390 | 1,3,9785,848,1172,1677,200,406 391 | 1,3,9385,1530,1422,3019,227,684 392 | 1,3,3352,1181,1328,5502,311,1000 393 | 1,3,2647,2761,2313,907,95,1827 394 | 1,3,518,4180,3600,659,122,654 395 | 1,3,23632,6730,3842,8620,385,819 396 | 1,3,12377,865,3204,1398,149,452 397 | 1,3,9602,1316,1263,2921,841,290 398 | 2,3,4515,11991,9345,2644,3378,2213 399 | 1,3,11535,1666,1428,6838,64,743 400 | 1,3,11442,1032,582,5390,74,247 401 | 1,3,9612,577,935,1601,469,375 402 | 1,3,4446,906,1238,3576,153,1014 403 | 1,3,27167,2801,2128,13223,92,1902 404 | 1,3,26539,4753,5091,220,10,340 405 | 1,3,25606,11006,4604,127,632,288 406 | 1,3,18073,4613,3444,4324,914,715 407 | 1,3,6884,1046,1167,2069,593,378 408 | 1,3,25066,5010,5026,9806,1092,960 409 | 2,3,7362,12844,18683,2854,7883,553 410 | 2,3,8257,3880,6407,1646,2730,344 411 | 1,3,8708,3634,6100,2349,2123,5137 412 | 1,3,6633,2096,4563,1389,1860,1892 413 | 1,3,2126,3289,3281,1535,235,4365 414 | 1,3,97,3605,12400,98,2970,62 415 | 1,3,4983,4859,6633,17866,912,2435 416 | 1,3,5969,1990,3417,5679,1135,290 417 | 2,3,7842,6046,8552,1691,3540,1874 418 | 2,3,4389,10940,10908,848,6728,993 419 | 1,3,5065,5499,11055,364,3485,1063 420 | 2,3,660,8494,18622,133,6740,776 421 | 1,3,8861,3783,2223,633,1580,1521 422 | 1,3,4456,5266,13227,25,6818,1393 423 | 2,3,17063,4847,9053,1031,3415,1784 424 | 1,3,26400,1377,4172,830,948,1218 425 | 2,3,17565,3686,4657,1059,1803,668 426 | 2,3,16980,2884,12232,874,3213,249 427 | 1,3,11243,2408,2593,15348,108,1886 428 | 1,3,13134,9347,14316,3141,5079,1894 429 | 1,3,31012,16687,5429,15082,439,1163 430 | 1,3,3047,5970,4910,2198,850,317 431 | 1,3,8607,1750,3580,47,84,2501 432 | 1,3,3097,4230,16483,575,241,2080 433 | 1,3,8533,5506,5160,13486,1377,1498 434 | 1,3,21117,1162,4754,269,1328,395 435 | 1,3,1982,3218,1493,1541,356,1449 436 | 1,3,16731,3922,7994,688,2371,838 437 | 1,3,29703,12051,16027,13135,182,2204 438 | 1,3,39228,1431,764,4510,93,2346 439 | 2,3,14531,15488,30243,437,14841,1867 440 | 1,3,10290,1981,2232,1038,168,2125 441 | 1,3,2787,1698,2510,65,477,52 442 | -------------------------------------------------------------------------------- /small_data/u.item: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tfolkman/byu_econ_applied_machine_learning/add2b554e6aff6418dbe195ee75ec349b045adcd/small_data/u.item --------------------------------------------------------------------------------