├── LICENSE ├── CONTRIBUTORS ├── ar ├── cheatsheet-machine-learning-tips-and-tricks.md └── cheatsheet-unsupervised-learning.md ├── de ├── cheatsheet-machine-learning-tips-and-tricks.md ├── cheatsheet-deep-learning.md ├── refresher-linear-algebra.md └── cheatsheet-unsupervised-learning.md ├── he ├── cheatsheet-machine-learning-tips-and-tricks.md ├── cheatsheet-deep-learning.md ├── refresher-linear-algebra.md └── cheatsheet-unsupervised-learning.md ├── hi ├── cheatsheet-machine-learning-tips-and-tricks.md ├── cheatsheet-deep-learning.md ├── refresher-linear-algebra.md └── cheatsheet-unsupervised-learning.md ├── ru ├── cheatsheet-machine-learning-tips-and-tricks.md ├── cheatsheet-deep-learning.md ├── refresher-linear-algebra.md └── cheatsheet-unsupervised-learning.md ├── zh ├── cheatsheet-machine-learning-tips-and-tricks.md ├── cheatsheet-deep-learning.md ├── refresher-linear-algebra.md └── cheatsheet-unsupervised-learning.md ├── template ├── cheatsheet-machine-learning-tips-and-tricks.md ├── cheatsheet-deep-learning.md ├── refresher-linear-algebra.md └── cheatsheet-unsupervised-learning.md ├── README.md └── ko └── cheatsheet-machine-learning-tips-and-tricks.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Shervine Amidi and Afshine Amidi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /CONTRIBUTORS: -------------------------------------------------------------------------------- 1 | --ar 2 | Amjad Khatabi (translation of deep learning) 3 | Zaid Alyafeai (review of deep learning) 4 | 5 | Zaid Alyafeai (translation of linear algebra) 6 | Amjad Khatabi (review of linear algebra) 7 | Mazen Melibari (review of linear algebra) 8 | 9 | --de 10 | 11 | --es 12 | Erick Gabriel Mendoza Flores (translation of deep learning) 13 | Fernando Diaz (review of deep learning) 14 | Fernando González-Herrera (review of deep learning) 15 | Mariano Ramirez (review of deep learning) 16 | Juan P. Chavat (review of deep learning) 17 | Alonso Melgar López (review of deep learning) 18 | Gustavo Velasco-Hernández (review of deep learning) 19 | Juan Manuel Nava Zamudio (review of deep learning) 20 | 21 | Fernando González-Herrera (translation of linear algebra) 22 | Fernando Diaz (review of linear algebra) 23 | Gustavo Velasco-Hernández (review of linear algebra) 24 | Juan P. Chavat (review of linear algebra) 25 | 26 | David Jiménez Paredes (translation of machine learning tips and tricks) 27 | Fernando Diaz (translation of machine learning tips and tricks) 28 | Gustavo Velasco-Hernández (review of machine learning tips and tricks) 29 | Alonso Melgar-Lopez (review of machine learning tips and tricks) 30 | 31 | Fermin Ordaz (translation of probabilities and statistics) 32 | Fernando González-Herrera (review of probabilities and statistics) 33 | Alonso Melgar López (review of probabilities and statistics) 34 | 35 | Juan P. Chavat (translation of supervised learning) 36 | Fernando Gonzalez-Herrera (review of supervised learning) 37 | Fernando Diaz (review of supervised learning) 38 | Alonso Melgar-Lopez (review of supervised learning) 39 | 40 | Jaime Noel Alvarez Luna (translation of unsupervised learning) 41 | Alonso Melgar López (review of unsupervised learning) 42 | Fernando Diaz (review of unsupervised learning) 43 | 44 | --fa 45 | AlisterTA (translation of convolutional neural networks) 46 | Ehsan Kermani (translation of convolutional neural networks) 47 | Erfan Noury (review of convolutional neural networks) 48 | 49 | AlisterTA (translation of deep learning) 50 | Mohammad Karimi (review of deep learning) 51 | Erfan Noury (review of deep learning) 52 | 53 | AlisterTA (translation of deep learning tips and tricks) 54 | Erfan Noury (review of deep learning tips and tricks) 55 | 56 | Erfan Noury (translation of linear algebra) 57 | Mohammad Karimi (review of linear algebra) 58 | 59 | AlisterTA (translation of machine learning tips and tricks) 60 | Mohammad Reza (translation of machine learning tips and tricks) 61 | Erfan Noury (review of machine learning tips and tricks) 62 | Mohammad Karimi (review of machine learning tips and tricks) 63 | 64 | Erfan Noury (translation of probabilities and statistics) 65 | Mohammad Karimi (review of probabilities and statistics) 66 | 67 | AlisterTA (translation of recurrent neural networks) 68 | Erfan Noury (review of recurrent neural networks) 69 | 70 | Amirhosein Kazemnejad (translation of supervised learning) 71 | Erfan Noury (review of supervised learning) 72 | Mohammad Karimi (review of supervised learning) 73 | 74 | Erfan Noury (translation of unsupervised learning) 75 | Mohammad Karimi (review of unsupervised learning) 76 | 77 | --fr 78 | Original authors 79 | 80 | --he 81 | 82 | --hi 83 | 84 | --ko 85 | Wooil Jeong (translation of machine learning tips and tricks) 86 | 87 | Wooil Jeong (translation of probabilities and statistics) 88 | 89 | Kwang Hyeok Ahn (translation of Unsupervised Learning) 90 | 91 | --ja 92 | 93 | --pt 94 | Leticia Portella (translation of convolutional neural networks) 95 | Gabriel Aparecido Fonseca (review of convolutional neural networks) 96 | 97 | Gabriel Fonseca (translation of deep learning) 98 | Leticia Portella (review of deep learning) 99 | 100 | Gabriel Fonseca (translation of linear algebra) 101 | Leticia Portella (review of linear algebra) 102 | 103 | Fernando Santos (translation of machine learning tips and tricks) 104 | Leticia Portella (review of machine learning tips and tricks) 105 | Gabriel Fonseca (review of machine learning tips and tricks) 106 | 107 | Leticia Portella (translation of probabilities and statistics) 108 | Flavio Clesio (review of probabilities and statistics) 109 | 110 | Leticia Portella (translation of supervised learning) 111 | Gabriel Fonseca (review of supervised learning) 112 | Flavio Clesio (review of supervised learning) 113 | 114 | Gabriel Fonseca (translation of unsupervised learning) 115 | Tiago Danin (review of unsupervised learning) 116 | 117 | --tr 118 | Ayyüce Kızrak (translation of convolutional neural networks) 119 | Yavuz Kömeçoğlu (review of convolutional neural networks) 120 | 121 | Ekrem Çetinkaya (translation of deep learning) 122 | Omer Bukte (review of deep learning) 123 | 124 | Ayyüce Kızrak (translation of deep learning tips and tricks) 125 | Yavuz Kömeçoğlu (review of deep learning tips and tricks) 126 | 127 | Kadir Tekeli (translation of linear algebra) 128 | Ekrem Çetinkaya (review of linear algebra) 129 | 130 | Seray Beşer (translation of machine learning tips and tricks) 131 | Ayyüce Kızrak (review of machine learning tips and tricks) 132 | Yavuz Kömeçoğlu (review of machine learning tips and tricks) 133 | 134 | Ayyüce Kızrak (translation of probabilities and statistics) 135 | Başak Buluz (review of probabilities and statistics) 136 | 137 | Başak Buluz (translation of recurrent neural networks) 138 | Yavuz Kömeçoğlu (review of recurrent neural networks) 139 | 140 | Başak Buluz (translation of supervised learning) 141 | Ayyüce Kızrak (review of supervised learning) 142 | 143 | Yavuz Kömeçoğlu (translation of unsupervised learning) 144 | Başak Buluz (review of unsupervised learning) 145 | 146 | --uk 147 | Gregory Reshetniak (translation of probabilities and statistics) 148 | Denys (review of probabilities and statistics) 149 | 150 | --zh 151 | Wang Hongnian (translation of supervised learning) 152 | Xiaohu Zhu (朱小虎) (review of supervised learning) 153 | Chaoying Xue (review of supervised learning) 154 | 155 | --zh-tw 156 | kevingo (translation of deep learning) 157 | TobyOoO (review of deep learning) 158 | 159 | kevingo (translation of linear algebra) 160 | Miyaya (review of linear algebra) 161 | 162 | kevingo (translation of probabilities and statistics) 163 | johnnychhsu (review of probabilities and statistics) 164 | 165 | kevingo (translation of supervised learning) 166 | accelsao (review of supervised learning) 167 | 168 | kevingo (translation of unsupervised learning) 169 | imironhead (review of unsupervised learning) 170 | johnnychhsu (review of unsupervised learning) 171 | -------------------------------------------------------------------------------- /ar/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /de/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /he/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /hi/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /ru/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /zh/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | 1. **Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | 2. **Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | 3. **In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | 4. **Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | 5. **[Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | 6. **Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | 7. **[Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | 8. **Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | 9. **How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | 10. **Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | 11. **Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | 12. **Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | 13. **ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | 14. **[Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | 15. **AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | 16. **[Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | 17. **Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | 18. **[Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | 19. **Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | 20. **Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | 21. **where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | 22. **Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | 23. **Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | 24. **[Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | 25. **[Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | 26. **[Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | 27. **[Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | 28. **Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | 29. **Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | 30. [**Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | 31. **[Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | 32. **The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | 33. **Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | 34. **[Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | 35. **Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | 36. **Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | 37. **Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | 38. **Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | 39. **[Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | 40. **[High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | 41. **[Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | 42. **Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | 43. **Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | 44. **Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | 45. **[Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | 46. **[Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | 47. **[Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | 48. **[Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /template/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ 286 | -------------------------------------------------------------------------------- /de/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ 322 | -------------------------------------------------------------------------------- /he/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ 322 | -------------------------------------------------------------------------------- /hi/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ 322 | -------------------------------------------------------------------------------- /ru/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ 322 | -------------------------------------------------------------------------------- /zh/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | 1. **Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | 2. **Neural Networks** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | 3. **Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | 4. **Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | 5. **[Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | 6. **By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | 7. **where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | 8. **Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | 9. **[Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | 10. **Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | 11. **Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | 12. **Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | 13. **As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | 14. **Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | 15. **Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | 16. **Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | 17. **Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | 18. **Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | 19. **Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | 20. **Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | 21. **Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | 22. **Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | 23. **It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | 24. **Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | 25. **Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | 26. **[Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | 27. **[Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | 28. **LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | 29. **Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | 30. **The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | 31. **Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | 32. **Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | 33. **S is the set of states** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | 34. **A is the set of actions** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | 35. **{Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | 36. **γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | 37. **R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | 38. **Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | 39. **Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | 40. **Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | 41. **Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | 42. **Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | 43. **Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | 44. **1) We initialize the value:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | 45. **2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | 46. **Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | 47. **times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | 48. **times took action a in state s** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | 49. **Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | 50. **View PDF version on GitHub** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | 51. **[Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | 52. **[Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | 53. **[Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | 54. **[Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ 322 | -------------------------------------------------------------------------------- /template/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ 322 | -------------------------------------------------------------------------------- /de/refresher-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Matrix properties** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. [Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. [Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. diagonal** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. Matrix calculus** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [General notations, Definitions, Main matrices]** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | **55. [Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /he/refresher-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Matrix properties** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. [Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. [Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. diagonal** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. Matrix calculus** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [General notations, Definitions, Main matrices]** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | **55. [Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /hi/refresher-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Matrix properties** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. [Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. [Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. diagonal** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. Matrix calculus** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [General notations, Definitions, Main matrices]** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | **55. [Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /ru/refresher-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Matrix properties** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. [Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. [Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. diagonal** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. Matrix calculus** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [General notations, Definitions, Main matrices]** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | **55. [Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /zh/refresher-linear-algebra.md: -------------------------------------------------------------------------------- 1 | 1. **Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | 2. **General notations** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | 3. **Definitions** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | 4. **Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | 5. **Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | 6. **Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | 7. **Main matrices** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | 8. **Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | 9. **Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | 10. **Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | 11. **Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | 12. **Matrix operations** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | 13. **Multiplication** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | 14. **Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | 15. **inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | 16. **outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | 17. **Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | 18. **where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | 19. **Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | 20. **where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | 21. **Other operations** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | 22. **Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | 23. **Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | 24. **Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | 25. **Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | 26. **Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | 27. **Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | 28. **Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | 29. **Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | 30. **Matrix properties** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | 31. **Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | 32. **Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | 33. **[Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | 34. **Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | 35. **N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | 36. **if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | 37. **For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | 38. **[Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | 39. **Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | 40. **Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | 41. **Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | 42. **Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | 43. **Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | 44. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | 45. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | 46. **diagonal** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | 47. **Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | 48. **Matrix calculus** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | 49. **Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | 50. **Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | 51. **Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | 52. **Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | 53. **Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | 54. **[General notations, Definitions, Main matrices]** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | 55. **[Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | 56. **[Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | 57. **[Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /template/refresher-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Matrix properties** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. [Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | **38. [Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | **46. diagonal** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | **48. Matrix calculus** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | **54. [General notations, Definitions, Main matrices]** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | **55. [Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Translation of VIP Cheatsheets 2 | ## Goal 3 | This repository aims at collaboratively translating our [Machine Learning](https://github.com/afshinea/stanford-cs-229-machine-learning) and [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) cheatsheets into a ton of languages, so that this content can be enjoyed by anyone from any part of the world! 4 | 5 | ## Contribution guidelines 6 | The translation process of each cheatsheet contains two steps: 7 | - the **translation** step, where contributors follow a template of items to translate, 8 | - the **review** step, where contributors go through each expression translated by their peers, on top of which they add their suggestions and remarks. 9 | 10 | ### Translators 11 | 0. Check for [existing pull requests](https://github.com/shervinea/cheatsheet-translation/pulls) to see which cheatsheet is yet to be translated. 12 | 13 | 1. Fork the repository. 14 | 15 | 2. Copy the template of the cheatsheet you wish to translate (provided in the `template/` folder) into the language folder with a naming that follows the [ISO 639-1 notation](https://www.loc.gov/standards/iso639-2/php/code_list.php). 16 | 17 | 3. Translate anything you want by keeping the [following template](https://github.com/shervinea/cheatsheet-translation/tree/master/template): 18 | > 34. **English blabla** 19 | > 20 | > ⟶ Translated blabla 21 | 22 | 4. Commit the changes to your forked repository. 23 | 24 | 5. Submit a [pull request](https://help.github.com/articles/creating-a-pull-request/) and call it `[code of language name] Topic name`. For example, a translation in Spanish of the deep learning cheatsheet will be called `[es] Deep learning`. 25 | 26 | ### Reviewers 27 | 1. Go to the [list of pull requests](https://github.com/shervinea/cheatsheet-translation/pulls) and filter them by your native language (e.g. `[es]` for Spanish, `[zh]` for Mandarin Chinese). 28 | 29 | 2. Locate pull requests where help is needed. Those contain the tag `reviewer wanted`. 30 | 31 | 3. Review the content line per line and add comments and suggestions when necessary. 32 | 33 | ### Important note 34 | Please make sure to propose the translation of **only one** cheatsheet per pull request -- it simplifies a lot the review process. 35 | 36 | 37 | ## Progression for CS 230 (Deep Learning) 38 | |Cheatsheet topic|Español|فارسی|Français|日本語|Português|中文| 39 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| 40 | |Convolutional Neural Nets|not started|done|done|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/128)|not started| 41 | |Recurrent Neural Nets|not started|done|done|not started|not started|not started| 42 | |DL tips and tricks|not started|done|done|not started|not started|not started| 43 | 44 | |Cheatsheet topic|العَرَبِيَّة|עִבְרִית|हिन्दी|Türkçe|Русский|Italiano| 45 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| 46 | |Convolutional Neural Nets|not started|not started|not started|done|not started|not started| 47 | |Recurrent Neural Nets|not started|not started|not started|done|not started|not started| 48 | |DL tips and tricks|not started|not started|not started|done|not started|not started| 49 | 50 | |Cheatsheet topic|Polski|Suomi|Català|Українська|한국어| 51 | |:---|:---:|:---:|:---:|:---:|:---:| 52 | |Convolutional Neural Nets|not started|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/109)| 53 | |Recurrent Neural Nets|not started|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/107)| 54 | |DL tips and tricks|not started|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/108)| 55 | 56 | ## Progression for CS 229 (Machine Learning) 57 | |Cheatsheet topic|Español|فارسی|Français|日本語|Português|中文| 58 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| 59 | |Deep learning|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/96)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/12)| 60 | |Supervised learning|done|done|done|not started|done|done| 61 | |Unsupervised learning|done|done|done|not started|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/48)| 62 | |ML tips and tricks|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/99)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/7)| 63 | |Probabilities and Statistics|done|done|done|not started|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/73)| 64 | |Linear algebra|done|done|done|not started|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/72)| 65 | 66 | |Cheatsheet topic|العَرَبِيَّة|עִבְרִית|हिन्दी|Türkçe|Русский|Italiano| 67 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| 68 | |Deep learning|done|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/37)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/21)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/78)| 69 | |Supervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/87)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/46)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/21)|not started| 70 | |Unsupervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/88)|not started|not started|done|not started|not started| 71 | |ML tips and tricks|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/83)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/40)|done|not started|not started| 72 | |Probabilities and Statistics|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/89)|not started|not started|done|not started|not started| 73 | |Linear algebra|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/85)|not started|not started|done|not started|not started| 74 | 75 | 76 | |Cheatsheet topic|Polski|Suomi|Català|Українська|한국어| 77 | |:---|:---:|:---:|:---:|:---:|:---:| 78 | |Deep learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/34)|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/80)| 79 | |Supervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/90)| 80 | |Unsupervised learning|not started|not started|not started|not started|done| 81 | |ML tips and tricks|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|not started|done| 82 | |Probabilities and Statistics|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|done|done| 83 | |Linear algebra|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/95)|done| 84 | 85 | 86 | |Cheatsheet topic|Magyar|Deutsch| 87 | |:---|:---:|:---:| 88 | |Deep learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/106)| 89 | |Supervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started| 90 | |Unsupervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started| 91 | |ML tips and tricks|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started| 92 | |Probabilities and Statistics|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started| 93 | |Linear algebra|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started| 94 | 95 | 96 | ## Acknowledgements 97 | Thank you everyone for your help! Please do not forget to add your name to the `CONTRIBUTORS` file so that we can give you proper credit in the cheatsheets' [official website](https://stanford.edu/~shervine/teaching). 98 | -------------------------------------------------------------------------------- /zh/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | 1. **Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | 2. **Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | 3. **Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | 4. **Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | 5. **Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | 6. **Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | 7. **Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | 8. **[Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | 9. **[Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | 10. **Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | 11. **E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | 12. **M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | 13. **[Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | 14. **k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | 15. **We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | 16. **Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | 17. **[Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | 18. **Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | 19. **Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | 20. **Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | 21. **Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | 22. **[Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | 23. **[Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | 24. **Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | 25. **In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | 26. **Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | 27. **Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | 28. **the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | 29. **Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | 30. **Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | 31. **It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | 32. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | 33. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | 34. **diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | 35. **Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | 36. **Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:** 212 | 213 | ⟶ 214 | 215 |
216 | 217 | 37. **Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 218 | 219 | ⟶ 220 | 221 |
222 | 223 | 38. **Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 224 | 225 | ⟶ 226 | 227 |
228 | 229 | 39. **Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | 40. **Step 4: Project the data on spanR(u1,...,uk).** 236 | 237 | ⟶ 238 | 239 |
240 | 241 | 41. **This procedure maximizes the variance among all k-dimensional spaces.** 242 | 243 | ⟶ 244 | 245 |
246 | 247 | 42. **[Data in feature space, Find principal components, Data in principal components space]** 248 | 249 | ⟶ 250 | 251 |
252 | 253 | 43. **Independent component analysis** 254 | 255 | ⟶ 256 | 257 |
258 | 259 | 44. **It is a technique meant to find the underlying generating sources.** 260 | 261 | ⟶ 262 | 263 |
264 | 265 | 45. **Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 266 | 267 | ⟶ 268 | 269 |
270 | 271 | 46. **The goal is to find the unmixing matrix W=A−1.** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | 47. **Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 278 | 279 | ⟶ 280 | 281 |
282 | 283 | 48. **Write the probability of x=As=W−1s as:** 284 | 285 | ⟶ 286 | 287 |
288 | 289 | 49. **Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 290 | 291 | ⟶ 292 | 293 |
294 | 295 | 50. **Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 296 | 297 | ⟶ 298 | 299 |
300 | 301 | 51. **The Machine Learning cheatsheets are now available in Mandarin.** 302 | 303 | ⟶ 304 | 305 |
306 | 307 | 52. **Original authors** 308 | 309 | ⟶ 310 | 311 |
312 | 313 | 53. **Translated by X, Y and Z** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | 54. **Reviewed by X, Y and Z** 320 | 321 | ⟶ 322 | 323 |
324 | 325 | 55. **[Introduction, Motivation, Jensen's inequality]** 326 | 327 | ⟶ 328 | 329 |
330 | 331 | 56. **[Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 332 | 333 | ⟶ 334 | 335 |
336 | 337 | 57. **[Dimension reduction, PCA, ICA]** 338 | 339 | ⟶ 340 | -------------------------------------------------------------------------------- /ar/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in Arabic.** 303 | 304 | ⟶ 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 341 | -------------------------------------------------------------------------------- /de/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in German.** 303 | 304 | ⟶ 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 341 | -------------------------------------------------------------------------------- /he/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in Hebrew.** 303 | 304 | ⟶ 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 341 | -------------------------------------------------------------------------------- /hi/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in Hindi.** 303 | 304 | ⟶ 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 341 | -------------------------------------------------------------------------------- /ru/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in Russian.** 303 | 304 | ⟶ 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 341 | -------------------------------------------------------------------------------- /template/cheatsheet-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in [target language].** 303 | 304 | ⟶ 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 341 | -------------------------------------------------------------------------------- /ko/cheatsheet-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶머신러닝 팁과 트릭 치트시트 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶분류 측정 항목 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶이진 분류 상황에서 모델의 성능을 평가하기 위해 눈 여겨 봐야하는 주요 측정 항목이 여기에 있습니다. 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶혼동 행렬 ― 혼동 행렬은 모델의 성능을 평가할 때, 보다 큰 그림을 보기위해 사용됩니다. 이는 다음과 같이 정의됩니다. 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶[예측된 클래스, 실제 클래스] 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶주요 측정 항목들 ― 다음 측정 항목들은 주로 분류 모델의 성능을 평가할 때 사용됩니다. 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶[측정 항목, 공식, 해석] 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶전반적인 모델의 성능 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶예측된 양성이 정확한 정도 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶실제 양성의 예측 정도 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶실제 음성의 예측 정도 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶불균형 클래스에 유용한 하이브리드 측정 항목 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ROC(Receiver Operating Curve) ― ROC 곡선은 임계값의 변화에 따른 TPR 대 FPR의 플롯입니다. 이 측정 항목은 아래 표에 요약되어 있습니다: 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶[측정 항목, 공식, 같은 측도] 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶AUC(Area Under the receiving operating Curve) ― AUC 또는 AUROC라고도 하는 이 측정 항목은 다음 그림과 같이 ROC 곡선 아래의 영역입니다: 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶[실제값, 예측된 값] 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶기본 측정 항목 ― 회귀 모델 f가 주어졌을때, 다음의 측정 항목들은 모델의 성능을 평가할 때 주로 사용됩니다: 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶[총 제곱합, 설명된 제곱합, 잔차 제곱합] 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶결정 계수 ― 종종 R2 또는 r2로 표시되는 결정 계수는 관측된 결과가 모델에 의해 얼마나 잘 재현되는지를 측정하는 측도로서 다음과 같이 정의됩니다: 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶주요 측정 항목들 ― 다음 측정 항목들은 주로 변수의 수를 고려하여 회귀 모델의 성능을 평가할 때 사용됩니다: 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶여기서 L은 가능도이고 ^σ2는 각각의 반응과 관련된 분산의 추정값입니다. 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶모델 선택 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶어휘 ― 모델을 선택할 때 우리는 다음과 같이 가지고 있는 데이터를 세 부분으로 구분합니다: 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶[학습 세트, 검증 세트, 테스트 세트] 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶[모델 훈련, 모델 평가, 모델 예측] 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶[주로 데이터 세트의 80%, 주로 데이터 세트의 20%] 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶[홀드아웃 또는 개발 세트라고도하는, 보지 않은 데이터] 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶모델이 선택되면 전체 데이터 세트에 대해 학습을 하고 보지 않은 데이터에서 테스트합니다. 이는 아래 그림에 나타나있습니다. 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶교차-검증 ― CV라고도하는 교차-검증은 초기의 학습 세트에 지나치게 의존하지 않는 모델을 선택하는데 사용되는 방법입니다. 다양한 유형이 아래 표에 요약되어 있습니다: 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶[k-1 폴드에 대한 학습과 나머지 1폴드에 대한 평가, n-p개 관측치에 대한 학습과 나머지 p개 관측치에 대한 평가] 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶[일반적으로 k=5 또는 10, p=1인 케이스는 leave-one-out] 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶가장 일반적으로 사용되는 방법은 k-폴드 교차-검증이라고하며 이는 학습 데이터를 k개의 폴드로 분할하고, 그 중 k-1개의 폴드로 모델을 학습하는 동시에 나머지 1개의 폴드로 모델을 검증합니다. 이 작업을 k번 수행합니다. 오류는 k 폴드에 대해 평균화되고 교차-검증 오류라고 부릅니다. 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶정규화 ― 정규화 절차는 데이터에 대한 모델의 과적합을 피하고 분산이 커지는 문제를 처리하는 것을 목표로 합니다. 다음의 표는 일반적으로 사용되는 정규화 기법의 여러 유형을 요약한 것입니다: 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶[계수를 0으로 축소, 변수 선택에 좋음, 계수를 작게 함, 변수 선택과 작은 계수 간의 트래이드오프] 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶진단 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶편향 ― 모델의 편향은 기대되는 예측과 주어진 데이터 포인트에 대해 예측하려고하는 올바른 모델 간의 차이입니다. 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶분산 ― 모델의 분산은 주어진 데이터 포인트에 대한 모델 예측의 가변성입니다. 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶편향/분산 트래이드오프 ― 모델이 간단할수록 편향이 높아지고 모델이 복잡할수록 분산이 커집니다. 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶[증상, 회귀 일러스트레이션, 분류 일러스트레이션, 딥러닝 일러스트레이션, 가능한 처리방법] 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶[높은 학습 오류, 테스트 오류에 가까운 학습 오류, 높은 편향, 테스트 에러 보다 약간 낮은 학습 오류, 매우 낮은 학습 오류, 테스트 오류보다 훨씬 낮은 학습 오류, 높은 분산] 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶[모델 복잡화, 특징 추가, 학습 증대, 정규화 수행, 추가 데이터 수집] 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶오류 분석 ― 오류 분석은 현재 모델과 완벽한 모델 간의 성능 차이의 근본 원인을 분석합니다. 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶애블러티브 분석 ― 애블러티브 분석은 현재 모델과 베이스라인 모델 간의 성능 차이의 근본 원인을 분석합니다. 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶회귀 측정 항목 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶[분류 측정 항목, 혼동 행렬, 정확도, 정밀도, 리콜, F1 스코어, ROC] 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶[회귀 측정 항목, R 스퀘어, 맬로우의 CP, AIC, BIC] 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶[모델 선택, 교차-검증, 정규화] 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶[진단, 편향/분산 트래이드오프, 오류/애블러티브 분석] 286 | --------------------------------------------------------------------------------