├── LICENSE ├── template ├── cs-229-machine-learning-tips-and-tricks.md ├── cs-229-deep-learning.md ├── cs-229-linear-algebra.md ├── cs-229-unsupervised-learning.md ├── cs-229-probability.md ├── cs-221-logic-models.md └── cs-230-deep-learning-tips-and-tricks.md ├── zh-tw ├── cs-229-machine-learning-tips-and-tricks.md ├── cs-229-deep-learning.md ├── cs-229-unsupervised-learning.md ├── cs-229-linear-algebra.md └── cs-229-probability.md ├── ja ├── cs-229-machine-learning-tips-and-tricks.md ├── cs-229-deep-learning.md ├── cs-229-linear-algebra.md └── cs-229-unsupervised-learning.md ├── ko ├── cs-229-machine-learning-tips-and-tricks.md ├── cs-229-linear-algebra.md ├── cheatsheet-deep-learning.md └── cs-229-unsupervised-learning.md ├── CONTRIBUTORS ├── README.md └── vi └── cs-229-machine-learning-tips-and-tricks.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Shervine Amidi and Afshine Amidi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /template/cs-229-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **Machine Learning tips and tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks) 2 | 3 |
4 | 5 | **1. Machine Learning tips and tricks cheatsheet** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | **2. Classification metrics** 12 | 13 | ⟶ 14 | 15 |
16 | 17 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 18 | 19 | ⟶ 20 | 21 |
22 | 23 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 24 | 25 | ⟶ 26 | 27 |
28 | 29 | **5. [Predicted class, Actual class]** 30 | 31 | ⟶ 32 | 33 |
34 | 35 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 36 | 37 | ⟶ 38 | 39 |
40 | 41 | **7. [Metric, Formula, Interpretation]** 42 | 43 | ⟶ 44 | 45 |
46 | 47 | **8. Overall performance of model** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | **9. How accurate the positive predictions are** 54 | 55 | ⟶ 56 | 57 |
58 | 59 | **10. Coverage of actual positive sample** 60 | 61 | ⟶ 62 | 63 |
64 | 65 | **11. Coverage of actual negative sample** 66 | 67 | ⟶ 68 | 69 |
70 | 71 | **12. Hybrid metric useful for unbalanced classes** 72 | 73 | ⟶ 74 | 75 |
76 | 77 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 78 | 79 | ⟶ 80 | 81 |
82 | 83 | **14. [Metric, Formula, Equivalent]** 84 | 85 | ⟶ 86 | 87 |
88 | 89 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | **16. [Actual, Predicted]** 96 | 97 | ⟶ 98 | 99 |
100 | 101 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 102 | 103 | ⟶ 104 | 105 |
106 | 107 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 108 | 109 | ⟶ 110 | 111 |
112 | 113 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 114 | 115 | ⟶ 116 | 117 |
118 | 119 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 120 | 121 | ⟶ 122 | 123 |
124 | 125 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 126 | 127 | ⟶ 128 | 129 |
130 | 131 | **22. Model selection** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 138 | 139 | ⟶ 140 | 141 |
142 | 143 | **24. [Training set, Validation set, Testing set]** 144 | 145 | ⟶ 146 | 147 |
148 | 149 | **25. [Model is trained, Model is assessed, Model gives predictions]** 150 | 151 | ⟶ 152 | 153 |
154 | 155 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 156 | 157 | ⟶ 158 | 159 |
160 | 161 | **27. [Also called hold-out or development set, Unseen data]** 162 | 163 | ⟶ 164 | 165 |
166 | 167 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 168 | 169 | ⟶ 170 | 171 |
172 | 173 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 180 | 181 | ⟶ 182 | 183 |
184 | 185 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 186 | 187 | ⟶ 188 | 189 |
190 | 191 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 192 | 193 | ⟶ 194 | 195 |
196 | 197 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 198 | 199 | ⟶ 200 | 201 |
202 | 203 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 204 | 205 | ⟶ 206 | 207 |
208 | 209 | **35. Diagnostics** 210 | 211 | ⟶ 212 | 213 |
214 | 215 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 216 | 217 | ⟶ 218 | 219 |
220 | 221 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 222 | 223 | ⟶ 224 | 225 |
226 | 227 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 228 | 229 | ⟶ 230 | 231 |
232 | 233 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 234 | 235 | ⟶ 236 | 237 |
238 | 239 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 240 | 241 | ⟶ 242 | 243 |
244 | 245 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 246 | 247 | ⟶ 248 | 249 |
250 | 251 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 252 | 253 | ⟶ 254 | 255 |
256 | 257 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 258 | 259 | ⟶ 260 | 261 |
262 | 263 | **44. Regression metrics** 264 | 265 | ⟶ 266 | 267 |
268 | 269 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 270 | 271 | ⟶ 272 | 273 |
274 | 275 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 276 | 277 | ⟶ 278 | 279 |
280 | 281 | **47. [Model selection, cross-validation, regularization]** 282 | 283 | ⟶ 284 | 285 |
286 | 287 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 288 | 289 | ⟶ 290 | -------------------------------------------------------------------------------- /template/cs-229-deep-learning.md: -------------------------------------------------------------------------------- 1 | **Deep learning translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning) 2 | 3 |
4 | 5 | **1. Deep Learning cheatsheet** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | **2. Neural Networks** 12 | 13 | ⟶ 14 | 15 |
16 | 17 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 18 | 19 | ⟶ 20 | 21 |
22 | 23 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 24 | 25 | ⟶ 26 | 27 |
28 | 29 | **5. [Input layer, hidden layer, output layer]** 30 | 31 | ⟶ 32 | 33 |
34 | 35 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 36 | 37 | ⟶ 38 | 39 |
40 | 41 | **7. where we note w, b, z the weight, bias and output respectively.** 42 | 43 | ⟶ 44 | 45 |
46 | 47 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 54 | 55 | ⟶ 56 | 57 |
58 | 59 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 60 | 61 | ⟶ 62 | 63 |
64 | 65 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 66 | 67 | ⟶ 68 | 69 |
70 | 71 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 72 | 73 | ⟶ 74 | 75 |
76 | 77 | **13. As a result, the weight is updated as follows:** 78 | 79 | ⟶ 80 | 81 |
82 | 83 | **14. Updating weights ― In a neural network, weights are updated as follows:** 84 | 85 | ⟶ 86 | 87 |
88 | 89 | **15. Step 1: Take a batch of training data.** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 96 | 97 | ⟶ 98 | 99 |
100 | 101 | **17. Step 3: Backpropagate the loss to get the gradients.** 102 | 103 | ⟶ 104 | 105 |
106 | 107 | **18. Step 4: Use the gradients to update the weights of the network.** 108 | 109 | ⟶ 110 | 111 |
112 | 113 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 114 | 115 | ⟶ 116 | 117 |
118 | 119 | **20. Convolutional Neural Networks** 120 | 121 | ⟶ 122 | 123 |
124 | 125 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 126 | 127 | ⟶ 128 | 129 |
130 | 131 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 138 | 139 | ⟶ 140 | 141 |
142 | 143 | **24. Recurrent Neural Networks** 144 | 145 | ⟶ 146 | 147 |
148 | 149 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 150 | 151 | ⟶ 152 | 153 |
154 | 155 | **26. [Input gate, forget gate, gate, output gate]** 156 | 157 | ⟶ 158 | 159 |
160 | 161 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 162 | 163 | ⟶ 164 | 165 |
166 | 167 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 168 | 169 | ⟶ 170 | 171 |
172 | 173 | **29. Reinforcement Learning and Control** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 180 | 181 | ⟶ 182 | 183 |
184 | 185 | **31. Definitions** 186 | 187 | ⟶ 188 | 189 |
190 | 191 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 192 | 193 | ⟶ 194 | 195 |
196 | 197 | **33. S is the set of states** 198 | 199 | ⟶ 200 | 201 |
202 | 203 | **34. A is the set of actions** 204 | 205 | ⟶ 206 | 207 |
208 | 209 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 210 | 211 | ⟶ 212 | 213 |
214 | 215 | **36. γ∈[0,1[ is the discount factor** 216 | 217 | ⟶ 218 | 219 |
220 | 221 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 222 | 223 | ⟶ 224 | 225 |
226 | 227 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 228 | 229 | ⟶ 230 | 231 |
232 | 233 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 234 | 235 | ⟶ 236 | 237 |
238 | 239 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 240 | 241 | ⟶ 242 | 243 |
244 | 245 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 246 | 247 | ⟶ 248 | 249 |
250 | 251 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 252 | 253 | ⟶ 254 | 255 |
256 | 257 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 258 | 259 | ⟶ 260 | 261 |
262 | 263 | **44. 1) We initialize the value:** 264 | 265 | ⟶ 266 | 267 |
268 | 269 | **45. 2) We iterate the value based on the values before:** 270 | 271 | ⟶ 272 | 273 |
274 | 275 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 276 | 277 | ⟶ 278 | 279 |
280 | 281 | **47. times took action a in state s and got to s′** 282 | 283 | ⟶ 284 | 285 |
286 | 287 | **48. times took action a in state s** 288 | 289 | ⟶ 290 | 291 |
292 | 293 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 294 | 295 | ⟶ 296 | 297 |
298 | 299 | **50. View PDF version on GitHub** 300 | 301 | ⟶ 302 | 303 |
304 | 305 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 306 | 307 | ⟶ 308 | 309 |
310 | 311 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 312 | 313 | ⟶ 314 | 315 |
316 | 317 | **53. [Recurrent Neural Networks, Gates, LSTM]** 318 | 319 | ⟶ 320 | 321 |
322 | 323 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 324 | 325 | ⟶ 326 | -------------------------------------------------------------------------------- /template/cs-229-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **Linear Algebra and Calculus translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-algebra-calculus) 2 | 3 |
4 | 5 | **1. Linear Algebra and Calculus refresher** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | **2. General notations** 12 | 13 | ⟶ 14 | 15 |
16 | 17 | **3. Definitions** 18 | 19 | ⟶ 20 | 21 |
22 | 23 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 24 | 25 | ⟶ 26 | 27 |
28 | 29 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 30 | 31 | ⟶ 32 | 33 |
34 | 35 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 36 | 37 | ⟶ 38 | 39 |
40 | 41 | **7. Main matrices** 42 | 43 | ⟶ 44 | 45 |
46 | 47 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 54 | 55 | ⟶ 56 | 57 |
58 | 59 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 60 | 61 | ⟶ 62 | 63 |
64 | 65 | **11. Remark: we also note D as diag(d1,...,dn).** 66 | 67 | ⟶ 68 | 69 |
70 | 71 | **12. Matrix operations** 72 | 73 | ⟶ 74 | 75 |
76 | 77 | **13. Multiplication** 78 | 79 | ⟶ 80 | 81 |
82 | 83 | **14. Vector-vector ― There are two types of vector-vector products:** 84 | 85 | ⟶ 86 | 87 |
88 | 89 | **15. inner product: for x,y∈Rn, we have:** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | **16. outer product: for x∈Rm,y∈Rn, we have:** 96 | 97 | ⟶ 98 | 99 |
100 | 101 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 102 | 103 | ⟶ 104 | 105 |
106 | 107 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 108 | 109 | ⟶ 110 | 111 |
112 | 113 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 114 | 115 | ⟶ 116 | 117 |
118 | 119 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 120 | 121 | ⟶ 122 | 123 |
124 | 125 | **21. Other operations** 126 | 127 | ⟶ 128 | 129 |
130 | 131 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 138 | 139 | ⟶ 140 | 141 |
142 | 143 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 144 | 145 | ⟶ 146 | 147 |
148 | 149 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 150 | 151 | ⟶ 152 | 153 |
154 | 155 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 156 | 157 | ⟶ 158 | 159 |
160 | 161 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 162 | 163 | ⟶ 164 | 165 |
166 | 167 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 168 | 169 | ⟶ 170 | 171 |
172 | 173 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | **30. Matrix properties** 180 | 181 | ⟶ 182 | 183 |
184 | 185 | **31. Definitions** 186 | 187 | ⟶ 188 | 189 |
190 | 191 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 192 | 193 | ⟶ 194 | 195 |
196 | 197 | **33. [Symmetric, Antisymmetric]** 198 | 199 | ⟶ 200 | 201 |
202 | 203 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 204 | 205 | ⟶ 206 | 207 |
208 | 209 | **35. N(ax)=|a|N(x) for a scalar** 210 | 211 | ⟶ 212 | 213 |
214 | 215 | **36. if N(x)=0, then x=0** 216 | 217 | ⟶ 218 | 219 |
220 | 221 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 222 | 223 | ⟶ 224 | 225 |
226 | 227 | **38. [Norm, Notation, Definition, Use case]** 228 | 229 | ⟶ 230 | 231 |
232 | 233 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 234 | 235 | ⟶ 236 | 237 |
238 | 239 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 240 | 241 | ⟶ 242 | 243 |
244 | 245 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 246 | 247 | ⟶ 248 | 249 |
250 | 251 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 252 | 253 | ⟶ 254 | 255 |
256 | 257 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 258 | 259 | ⟶ 260 | 261 |
262 | 263 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 264 | 265 | ⟶ 266 | 267 |
268 | 269 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 270 | 271 | ⟶ 272 | 273 |
274 | 275 | **46. diagonal** 276 | 277 | ⟶ 278 | 279 |
280 | 281 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 282 | 283 | ⟶ 284 | 285 |
286 | 287 | **48. Matrix calculus** 288 | 289 | ⟶ 290 | 291 |
292 | 293 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 294 | 295 | ⟶ 296 | 297 |
298 | 299 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 300 | 301 | ⟶ 302 | 303 |
304 | 305 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 306 | 307 | ⟶ 308 | 309 |
310 | 311 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 312 | 313 | ⟶ 314 | 315 |
316 | 317 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 318 | 319 | ⟶ 320 | 321 |
322 | 323 | **54. [General notations, Definitions, Main matrices]** 324 | 325 | ⟶ 326 | 327 |
328 | 329 | **55. [Matrix operations, Multiplication, Other operations]** 330 | 331 | ⟶ 332 | 333 |
334 | 335 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 336 | 337 | ⟶ 338 | 339 |
340 | 341 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 342 | 343 | ⟶ 344 | -------------------------------------------------------------------------------- /zh-tw/cs-229-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | 1. **Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 4 | 機器學習秘訣和技巧參考手冊 5 |
6 | 7 | 2. **Classification metrics** 8 | 9 | ⟶ 10 | 分類器的評估指標 11 |
12 | 13 | 3. **In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 16 | 在二元分類的問題上,底下是主要用來衡量模型表現的指標 17 |
18 | 19 | 4. **Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 22 | 混淆矩陣 - 混淆矩陣是用來衡量模型整體表現的指標 23 |
24 | 25 | 5. **[Predicted class, Actual class]** 26 | 27 | ⟶ 28 | [預測類別, 真實類別] 29 |
30 | 31 | 6. **Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 34 | 主要的衡量指標 - 底下的指標經常用在評估分類模型的表現 35 |
36 | 37 | 7. **[Metric, Formula, Interpretation]** 38 | 39 | ⟶ 40 | [指標, 公式, 解釋] 41 |
42 | 43 | 8. **Overall performance of model** 44 | 45 | ⟶ 46 | 模型的整體表現 47 |
48 | 49 | 9. **How accurate the positive predictions are** 50 | 51 | ⟶ 52 | 預測的類別有多精準的比例 53 |
54 | 55 | 10. **Coverage of actual positive sample** 56 | 57 | ⟶ 58 | 實際正的樣本的覆蓋率有多少 59 |
60 | 61 | 11. **Coverage of actual negative sample** 62 | 63 | ⟶ 64 | 實際負的樣本的覆蓋率 65 |
66 | 67 | 12. **Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 70 | 對於非平衡類別相當有用的混合指標 71 |
72 | 73 | 13. **ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ 76 | ROC - 接收者操作特徵曲線 (ROC Curve),又被稱為 ROC,是透過改變閥值來表示 TPR 和 FPR 之間關係的圖形。這些指標總結如下: 77 |
78 | 79 | 14. **[Metric, Formula, Equivalent]** 80 | 81 | ⟶ 82 | [衡量指標, 公式, 等同於] 83 |
84 | 85 | 15. **AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ 88 | AUC - 在接收者操作特徵曲線 (ROC) 底下的面積,也稱為 AUC 或 AUROC: 89 |
90 | 91 | 16. **[Actual, Predicted]** 92 | 93 | ⟶ 94 | [實際值, 預測值] 95 |
96 | 97 | 17. **Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 100 | 基本的指標 - 給定一個迴歸模型 f,底下是經常用來評估此模型的指標: 101 |
102 | 103 | 18. **[Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ 106 | [總平方和, 被解釋平方和, 殘差平方和] 107 |
108 | 109 | 19. **Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 112 | 決定係數 - 決定係數又被稱為 R2 or r2,它提供了模型是否具備復現觀測結果的能力。定義如下: 113 |
114 | 115 | 20. **Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 118 | 主要的衡量指標 - 藉由考量變數 n 的數量,我們經常用使用底下的指標來衡量迴歸模型的表現: 119 |
120 | 121 | 21. **where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ 124 | 當中,L 代表的是概似估計,ˆσ2 則是變異數的估計 125 |
126 | 127 | 22. **Model selection** 128 | 129 | ⟶ 130 | 模型選擇 131 |
132 | 133 | 23. **Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 136 | 詞彙 - 當進行模型選擇時,我們會針對資料進行以下區分: 137 |
138 | 139 | 24. **[Training set, Validation set, Testing set]** 140 | 141 | ⟶ 142 | [訓練資料集, 驗證資料集, 測試資料集] 143 |
144 | 145 | 25. **[Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ 148 | [用來訓練模型, 用來評估模型, 模型用來預測用的資料集] 149 |
150 | 151 | 26. **[Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ 154 | [通常是 80% 的資料集, 通常是 20% 的資料集] 155 |
156 | 157 | 27. **[Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ 160 | [又被稱為 hold-out 資料集或開發資料集, 模型沒看過的資料集] 161 |
162 | 163 | 28. **Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 166 | 當模型被選擇後,就會使用整個資料集來做訓練,並且在沒看過的資料集上做測試。你可以參考以下的圖表: 167 |
168 | 169 | 29. **Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 172 | 交叉驗證 - 交叉驗證,又稱之為 CV,它是一種不特別依賴初始訓練集來挑選模型的方法。幾種不同的方法如下: 173 |
174 | 175 | 30. **[Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ 178 | [把資料分成 k 份,利用 k-1 份資料來訓練,剩下的一份用來評估模型效能, 在 n-p 份資料上進行訓練,剩下的 p 份資料用來評估模型效能] 179 |
180 | 181 | 31. **[Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ 184 | [一般來說 k=5 或 10, 當 p=1 時,又稱為 leave-one-out] 185 |
186 | 187 | 32. **The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 190 | 最常用到的方法叫做 k-fold 交叉驗證。它將訓練資料切成 k 份,在 k-1 份資料上進行訓練,而剩下的一份用來評估模型的效能,這樣的流程會重複 k 次次。最後計算出來的模型損失是 k 次結果的平均,又稱為交叉驗證損失值。 191 |
192 | 193 | 33. **Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 196 | 正規化 - 正歸化的目的是為了避免模型對於訓練資料過擬合,進而導致高方差。底下的表格整理了常見的正規化技巧: 197 |
198 | 199 | 34. **[Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ 202 | [將係數縮減為 0, 有利變數的選擇, 將係數變得更小, 在變數的選擇和小係數之間作權衡] 203 |
204 | 205 | 35. **Diagnostics** 206 | 207 | ⟶ 208 | 診斷 209 |
210 | 211 | 36. **Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ 214 | 偏差 - 模型的偏差指的是模型預測值與實際值之間的差異 215 |
216 | 217 | 37. **Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ 220 | 變異 - 變異指的是模型在預測資料時的變異程度 221 |
222 | 223 | 38. **Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ 226 | 偏差/變異的權衡 - 越簡單的模型,偏差就越大。而越複雜的模型,變異就越大 227 |
228 | 229 | 39. **[Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ 232 | [現象, 迴歸圖示, 分類圖示, 深度學習圖示, 可能的解法] 233 |
234 | 235 | 40. **[High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ 238 | [訓練錯誤較高, 訓練錯誤和測試錯誤接近, 高偏差, 訓練誤差會稍微比測試誤差低, 訓練誤差很低, 訓練誤差比測試誤差低很多, 高變異] 239 |
240 | 241 | 41. **[Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ 244 | [使用較複雜的模型, 增加更多特徵, 訓練更久, 採用正規化化的方法, 取得更多資料] 245 |
246 | 247 | 42. **Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ 250 | 誤差分析 - 誤差分析指的是分析目前使用的模型和最佳模型之間差距的根本原因 251 |
252 | 253 | 43. **Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ 256 | 銷蝕分析 (Ablative analysis) - 銷蝕分析指的是分析目前模型和基準模型之間差異的根本原因 257 |
258 | -------------------------------------------------------------------------------- /template/cs-229-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **Unsupervised Learning translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning) 2 | 3 |
4 | 5 | **1. Unsupervised Learning cheatsheet** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | **2. Introduction to Unsupervised Learning** 12 | 13 | ⟶ 14 | 15 |
16 | 17 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 18 | 19 | ⟶ 20 | 21 |
22 | 23 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 24 | 25 | ⟶ 26 | 27 |
28 | 29 | **5. Clustering** 30 | 31 | ⟶ 32 | 33 |
34 | 35 | **6. Expectation-Maximization** 36 | 37 | ⟶ 38 | 39 |
40 | 41 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 42 | 43 | ⟶ 44 | 45 |
46 | 47 | **8. [Setting, Latent variable z, Comments]** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | **9. [Mixture of k Gaussians, Factor analysis]** 54 | 55 | ⟶ 56 | 57 |
58 | 59 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 60 | 61 | ⟶ 62 | 63 |
64 | 65 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 66 | 67 | ⟶ 68 | 69 |
70 | 71 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 72 | 73 | ⟶ 74 | 75 |
76 | 77 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 78 | 79 | ⟶ 80 | 81 |
82 | 83 | **14. k-means clustering** 84 | 85 | ⟶ 86 | 87 |
88 | 89 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 96 | 97 | ⟶ 98 | 99 |
100 | 101 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 102 | 103 | ⟶ 104 | 105 |
106 | 107 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 108 | 109 | ⟶ 110 | 111 |
112 | 113 | **19. Hierarchical clustering** 114 | 115 | ⟶ 116 | 117 |
118 | 119 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 120 | 121 | ⟶ 122 | 123 |
124 | 125 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 126 | 127 | ⟶ 128 | 129 |
130 | 131 | **22. [Ward linkage, Average linkage, Complete linkage]** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 138 | 139 | ⟶ 140 | 141 |
142 | 143 | **24. Clustering assessment metrics** 144 | 145 | ⟶ 146 | 147 |
148 | 149 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 150 | 151 | ⟶ 152 | 153 |
154 | 155 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 156 | 157 | ⟶ 158 | 159 |
160 | 161 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 162 | 163 | ⟶ 164 | 165 |
166 | 167 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 168 | 169 | ⟶ 170 | 171 |
172 | 173 | **29. Dimension reduction** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | **30. Principal component analysis** 180 | 181 | ⟶ 182 | 183 |
184 | 185 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 186 | 187 | ⟶ 188 | 189 |
190 | 191 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 192 | 193 | ⟶ 194 | 195 |
196 | 197 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 198 | 199 | ⟶ 200 | 201 |
202 | 203 | **34. diagonal** 204 | 205 | ⟶ 206 | 207 |
208 | 209 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 210 | 211 | ⟶ 212 | 213 |
214 | 215 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 216 | dimensions by maximizing the variance of the data as follows:** 217 | 218 | ⟶ 219 | 220 |
221 | 222 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 223 | 224 | ⟶ 225 | 226 |
227 | 228 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 229 | 230 | ⟶ 231 | 232 |
233 | 234 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 235 | 236 | ⟶ 237 | 238 |
239 | 240 | **40. Step 4: Project the data on spanR(u1,...,uk).** 241 | 242 | ⟶ 243 | 244 |
245 | 246 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 247 | 248 | ⟶ 249 | 250 |
251 | 252 | **42. [Data in feature space, Find principal components, Data in principal components space]** 253 | 254 | ⟶ 255 | 256 |
257 | 258 | **43. Independent component analysis** 259 | 260 | ⟶ 261 | 262 |
263 | 264 | **44. It is a technique meant to find the underlying generating sources.** 265 | 266 | ⟶ 267 | 268 |
269 | 270 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 271 | 272 | ⟶ 273 | 274 |
275 | 276 | **46. The goal is to find the unmixing matrix W=A−1.** 277 | 278 | ⟶ 279 | 280 |
281 | 282 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 283 | 284 | ⟶ 285 | 286 |
287 | 288 | **48. Write the probability of x=As=W−1s as:** 289 | 290 | ⟶ 291 | 292 |
293 | 294 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 295 | 296 | ⟶ 297 | 298 |
299 | 300 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 301 | 302 | ⟶ 303 | 304 |
305 | 306 | **51. The Machine Learning cheatsheets are now available in [target language].** 307 | 308 | ⟶ 309 | 310 |
311 | 312 | **52. Original authors** 313 | 314 | ⟶ 315 | 316 |
317 | 318 | **53. Translated by X, Y and Z** 319 | 320 | ⟶ 321 | 322 |
323 | 324 | **54. Reviewed by X, Y and Z** 325 | 326 | ⟶ 327 | 328 |
329 | 330 | **55. [Introduction, Motivation, Jensen's inequality]** 331 | 332 | ⟶ 333 | 334 |
335 | 336 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 337 | 338 | ⟶ 339 | 340 |
341 | 342 | **57. [Dimension reduction, PCA, ICA]** 343 | 344 | ⟶ 345 | -------------------------------------------------------------------------------- /ja/cs-229-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ 機械学習のアドバイスやコツのチートシート 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ 分類評価指標 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ 二値分類において、モデルの性能を評価する際の主要な指標として次のものがあります。 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ 混同行列 ― 混同行列はモデルの性能を評価する際に、より完全に理解するために用いられます。次のように定義されます: 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ [予測したクラス, 実際のクラス] 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ 主要な評価指標 ― 分類モデルの性能を評価するために、一般的に次の指標が用いられます。 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ [評価指標,式,解釈] 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ モデルの全体的な性能 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ 陽性判定は、どれくらい正確ですか 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ 実際に陽性であるサンプル 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ 実際に陰性であるサンプル 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ 不均衡データに対する有用な複合指標 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ ROC曲線 ― 受信者動作特性曲線(ROC)は閾値を変えていく際のFPRに対するTPRのグラフです。これらの指標は下表の通りまとめられます。 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ [評価指標,式,等価な指標] 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ AUC ― ROC曲線下面積(AUC,AUROC)は次の図に示される通りROC曲線の下側面積のことです。 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ [実際,予測] 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ 基本的な評価指標 ― 回帰モデルfが与えられたとき,次のようなよう化指標がモデルの性能を評価するために一般的に用いられます。 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ [全平方和,回帰平方和,残差平方和] 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ 決定係数 ― よくR2やr2と書かれる決定係数は,実際の結果がモデルによってどの程度よく再現されているかを測る評価指標であり,次のように定義される。 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ 主要な評価指標 ― 次の評価指標は説明変数の数を考慮して回帰モデルの性能を評価するために,一般的に用いられています。 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ ここでLは尤度であり,ˆσ2は各応答に対する誤差分散の推定値です。 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ モデル選択 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ 用語 ― モデルを選択するときには,次のようにデータの種類を異なる3つに区別します。 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ [訓練セット,検証セット,テストセット] 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ [モデルを学習させる,モデルを評価する,モデルが予測する] 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ [通常はデータセットの80%,通常はデータセットの20%] 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ [ホールドアウトセットや,開発セットとも呼ばれる,未知のデータ] 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ 一度モデル選択が行われた場合,学習にはデータセット全体が用いられ,テストには未知のテストセットが使用されます。これらは次の図ように表されます。 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ 交差検証 ― 交差検証(CV)は,初期の学習データセットに強く依存しないようにモデル選択を行う方法です。2つの方法を下表にまとめました。 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ [k-1群で学習,残りの1群で評価,n-p個で学習,残りのp個で評価] 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ [一般的にはk=5または10,p=1の場合は一個抜き交差検証と呼ばれます] 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ 最も一般的に用いられている方法はk交差検証法です.データセットをk群に分けた後,1群を検証に使用し残りのk-1群を学習に使用するという操作を順番にk回繰り返します。求められた検証誤差はk群すべてにわたって平均化されます。この平均された誤差のことを交差検証誤差と呼びます。 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ 正則化 ― 正則化はモデルの過学習状態を回避することが目的であり,したがってハイバリアンス問題(オーバーフィット問題)に対処できます。一般的に使用されるいくつかの正則化法を下表にまとめました。 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ [係数を0にする,変数選択に適する,係数を小さくする,変数選択と係数を小さくすることのトレードオフ] 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ 診断方法 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ バイアス ― ある標本値群を予測する際の期待値と正しいモデルの結果との差異のことです。 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ バリアンス ― モデルのバリアンスとは,ある標本値群に対するモデルの予測値のばらつきのことです。 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ バイアス・バリアンストレードオフ ― よりシンプルなモデルではバイアスが高くなり,より複雑なモデルはバリアンスが高くなります。 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ [症状,回帰モデルでの図,分類モデルでの図,深層学習での図,可能な解決策] 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ [高い訓練誤差,訓練誤差がテスト誤差に近い,高いバイアス,訓練誤差がテスト誤差より少しだけ小さい,極端に小さい訓練誤差,訓練誤差がテスト誤差に比べて非常に小さい,高いバリアンス] 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ [より複雑なモデルを試す,特徴量を増やす,より長く学習する,正則化を導入する,データ数を増やす] 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ エラー分析 ― エラー分析は現在のモデルと完璧なモデル間の性能差の主要な要因を分析することです。 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ アブレーション分析 ― アブレーション分析は,ベースライン・モデルと現在されたモデル間で発生したパフォーマンスの差異の原因を分析することです。 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ 回帰評価指標 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ [分類評価指標,混同行列,正解率,適合率,再現率,F値,ROC曲線] 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ [回帰評価指標,R二乗,マローズのCp,AIC,BIC] 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ [モデルの選択,交差検証,正則化] 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ [診断方法,バイアス・バリアンストレードオフ,エラー・アブレーション分析] 286 | -------------------------------------------------------------------------------- /ko/cs-229-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶머신러닝 팁과 트릭 치트시트 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶분류 측정 항목 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶이진 분류 상황에서 모델의 성능을 평가하기 위해 눈 여겨 봐야하는 주요 측정 항목이 여기에 있습니다. 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶혼동 행렬 ― 혼동 행렬은 모델의 성능을 평가할 때, 보다 큰 그림을 보기위해 사용됩니다. 이는 다음과 같이 정의됩니다. 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶[예측된 클래스, 실제 클래스] 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶주요 측정 항목들 ― 다음 측정 항목들은 주로 분류 모델의 성능을 평가할 때 사용됩니다. 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶[측정 항목, 공식, 해석] 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶전반적인 모델의 성능 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶예측된 양성이 정확한 정도 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶실제 양성의 예측 정도 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶실제 음성의 예측 정도 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶불균형 클래스에 유용한 하이브리드 측정 항목 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ROC(Receiver Operating Curve) ― ROC 곡선은 임계값의 변화에 따른 TPR 대 FPR의 플롯입니다. 이 측정 항목은 아래 표에 요약되어 있습니다: 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶[측정 항목, 공식, 같은 측도] 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶AUC(Area Under the receiving operating Curve) ― AUC 또는 AUROC라고도 하는 이 측정 항목은 다음 그림과 같이 ROC 곡선 아래의 영역입니다: 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶[실제값, 예측된 값] 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶기본 측정 항목 ― 회귀 모델 f가 주어졌을때, 다음의 측정 항목들은 모델의 성능을 평가할 때 주로 사용됩니다: 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶[총 제곱합, 설명된 제곱합, 잔차 제곱합] 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶결정 계수 ― 종종 R2 또는 r2로 표시되는 결정 계수는 관측된 결과가 모델에 의해 얼마나 잘 재현되는지를 측정하는 측도로서 다음과 같이 정의됩니다: 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶주요 측정 항목들 ― 다음 측정 항목들은 주로 변수의 수를 고려하여 회귀 모델의 성능을 평가할 때 사용됩니다: 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶여기서 L은 가능도이고 ^σ2는 각각의 반응과 관련된 분산의 추정값입니다. 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶모델 선택 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶어휘 ― 모델을 선택할 때 우리는 다음과 같이 가지고 있는 데이터를 세 부분으로 구분합니다: 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶[학습 세트, 검증 세트, 테스트 세트] 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶[모델 훈련, 모델 평가, 모델 예측] 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶[주로 데이터 세트의 80%, 주로 데이터 세트의 20%] 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶[홀드아웃 또는 개발 세트라고도하는, 보지 않은 데이터] 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶모델이 선택되면 전체 데이터 세트에 대해 학습을 하고 보지 않은 데이터에서 테스트합니다. 이는 아래 그림에 나타나있습니다. 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶교차-검증 ― CV라고도하는 교차-검증은 초기의 학습 세트에 지나치게 의존하지 않는 모델을 선택하는데 사용되는 방법입니다. 다양한 유형이 아래 표에 요약되어 있습니다: 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶[k-1 폴드에 대한 학습과 나머지 1폴드에 대한 평가, n-p개 관측치에 대한 학습과 나머지 p개 관측치에 대한 평가] 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶[일반적으로 k=5 또는 10, p=1인 케이스는 leave-one-out] 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶가장 일반적으로 사용되는 방법은 k-폴드 교차-검증이라고하며 이는 학습 데이터를 k개의 폴드로 분할하고, 그 중 k-1개의 폴드로 모델을 학습하는 동시에 나머지 1개의 폴드로 모델을 검증합니다. 이 작업을 k번 수행합니다. 오류는 k 폴드에 대해 평균화되고 교차-검증 오류라고 부릅니다. 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶정규화 ― 정규화 절차는 데이터에 대한 모델의 과적합을 피하고 분산이 커지는 문제를 처리하는 것을 목표로 합니다. 다음의 표는 일반적으로 사용되는 정규화 기법의 여러 유형을 요약한 것입니다: 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶[계수를 0으로 축소, 변수 선택에 좋음, 계수를 작게 함, 변수 선택과 작은 계수 간의 트래이드오프] 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶진단 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶편향 ― 모델의 편향은 기대되는 예측과 주어진 데이터 포인트에 대해 예측하려고하는 올바른 모델 간의 차이입니다. 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶분산 ― 모델의 분산은 주어진 데이터 포인트에 대한 모델 예측의 가변성입니다. 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶편향/분산 트래이드오프 ― 모델이 간단할수록 편향이 높아지고 모델이 복잡할수록 분산이 커집니다. 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶[증상, 회귀 일러스트레이션, 분류 일러스트레이션, 딥러닝 일러스트레이션, 가능한 처리방법] 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶[높은 학습 오류, 테스트 오류에 가까운 학습 오류, 높은 편향, 테스트 에러 보다 약간 낮은 학습 오류, 매우 낮은 학습 오류, 테스트 오류보다 훨씬 낮은 학습 오류, 높은 분산] 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶[모델 복잡화, 특징 추가, 학습 증대, 정규화 수행, 추가 데이터 수집] 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶오류 분석 ― 오류 분석은 현재 모델과 완벽한 모델 간의 성능 차이의 근본 원인을 분석합니다. 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶애블러티브 분석 ― 애블러티브 분석은 현재 모델과 베이스라인 모델 간의 성능 차이의 근본 원인을 분석합니다. 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶회귀 측정 항목 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶[분류 측정 항목, 혼동 행렬, 정확도, 정밀도, 리콜, F1 스코어, ROC] 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶[회귀 측정 항목, R 스퀘어, 맬로우의 CP, AIC, BIC] 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶[모델 선택, 교차-검증, 정규화] 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶[진단, 편향/분산 트래이드오프, 오류/애블러티브 분석] 286 | -------------------------------------------------------------------------------- /zh-tw/cs-229-deep-learning.md: -------------------------------------------------------------------------------- 1 | 1. **Deep Learning cheatsheet** 2 | 3 | ⟶ 4 | 深度學習參考手冊 5 |
6 | 7 | 2. **Neural Networks** 8 | 9 | ⟶ 10 | 神經網路 11 |
12 | 13 | 3. **Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 16 | 神經網路是一種透過 layer 來建構的模型。經常被使用的神經網路模型包括了卷積神經網路 (CNN) 和遞迴式神經網路 (RNN)。 17 |
18 | 19 | 4. **Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 22 | 架構 - 神經網路架構所需要用到的詞彙描述如下: 23 |
24 | 25 | 5. **[Input layer, hidden layer, output layer]** 26 | 27 | ⟶ 28 | [輸入層、隱藏層、輸出層] 29 |
30 | 31 | 6. **By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ 34 | 我們使用 i 來代表網路的第 i 層、j 來代表某一層中第 j 個隱藏神經元的話,我們可以得到下面得等式: 35 |
36 | 37 | 7. **where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 40 | 其中,我們分別使用 w 來代表權重、b 代表偏差項、z 代表輸出的結果。 41 |
42 | 43 | 8. **Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 46 | Activation function - Activation function 是為了在每一層尾端的神經元帶入非線性轉換而設計的。底下是一些常見 Activation function: 47 |
48 | 49 | 9. **[Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ 52 | [Sigmoid, Tanh, ReLU, Leaky ReLU] 53 |
54 | 55 | 10. **Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 58 | 交叉熵損失函式 59 |
60 | 61 | 11. **Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 64 | 學習速率 - 學習速率通常用 α 或 η 來表示,目的是用來控制權重更新的速度。學習速度可以是一個固定值,或是隨著訓練的過程改變。現在最熱門的最佳化方法叫作 Adam,是一種隨著訓練過程改變學習速率的最佳化方法。 65 |
66 | 67 | 12. **Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 70 | 反向傳播演算法 - 反向傳播演算法是一種在神經網路中用來更新權重的方法,更新的基準是根據神經網路的實際輸出值和期望輸出值之間的關係。權重的導數是根據連鎖律 (chain rule) 來計算,通常會表示成下面的形式: 71 |
72 | 73 | 13. **As a result, the weight is updated as follows:** 74 | 75 | ⟶ 76 | 因此,權重會透過以下的方式來更新: 77 |
78 | 79 | 14. **Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 82 | 更新權重 - 在神經網路中,權重的更新會透過以下步驟進行: 83 |
84 | 85 | 15. **Step 1: Take a batch of training data.** 86 | 87 | ⟶ 88 | 步驟一:取出一個批次 (batch) 的資料 89 |
90 | 91 | 16. **Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 94 | 步驟二:執行前向傳播演算法 (forward propagation) 來得到對應的損失值 95 |
96 | 97 | 17. **Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 100 | 步驟三:將損失值透過反向傳播演算法來得到梯度 101 |
102 | 103 | 18. **Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 106 | 步驟四:使用梯度來更新網路的權重 107 |
108 | 109 | 19. **Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 112 | Dropout - Dropout 是一種透過丟棄一些神經元,來避免過擬和的技巧。在實務上,神經元會透過機率值的設定來決定要丟棄或保留 113 |
114 | 115 | 20. **Convolutional Neural Networks** 116 | 117 | ⟶ 118 | 卷積神經網絡 119 |
120 | 121 | 21. **Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 124 | 卷積層的需求 - 我們使用 W 來表示輸入資料的維度大小、F 代表卷積層的 filter 尺寸、P 代表對資料墊零 (zero padding) 使資料長度齊一後的長度,S 代表卷積後取出的特徵 stride 數量,則輸出的維度大小可以透過以下的公式表示: 125 |
126 | 127 | 22. **Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 130 | 批次正規化 (Batch normalization) - 它是一個藉由 γ,β 兩個超參數來正規化每個批次 {xi} 的過程。每一次正規化的過程,我們使用 μB,σ2B 分別代表平均數和變異數。請參考以下公式: 131 |
132 | 133 | 23. **It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 136 | 批次正規化的動作通常在全連接層/卷積層之後、在非線性層之前進行。目的在於接納更高的學習速率,並且減少該批次學習初期對取樣資料特徵的依賴性。 137 |
138 | 139 | 24. **Recurrent Neural Networks** 140 | 141 | ⟶ 142 | 遞歸神經網路 (RNN) 143 |
144 | 145 | 25. **Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 148 | 閘的種類 - 在傳統的遞歸神經網路中,你會遇到幾種閘: 149 |
150 | 151 | 26. **[Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ 154 | 輸入閘、遺忘閥、閘、輸出閘 155 |
156 | 157 | 27. **[Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ 160 | 要不要將資料寫入到記憶區塊中?要不要將存在在記憶區塊中的資料清除?要寫多少資料到記憶區塊?要不要將資料從記憶區塊中取出? 161 |
162 | 163 | 28. **LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ 166 | 長短期記憶模型 - 長短期記憶模型是一種遞歸神經網路,藉由導入遺忘閘的設計來避免梯度消失的問題 167 |
168 | 169 | 29. **Reinforcement Learning and Control** 170 | 171 | ⟶ 172 | 強化學習及控制 173 |
174 | 175 | 30. **The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 178 | 強化學習的目標就是為了讓代理 (agent) 能夠學習在環境中進化 179 |
180 | 181 | 31. **Definitions** 182 | 183 | ⟶ 184 | 定義 185 |
186 | 187 | 32. **Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 190 | 馬可夫決策過程 - 一個馬可夫決策過程 (MDP) 包含了五個元素: 191 |
192 | 193 | 33. **S is the set of states** 194 | 195 | ⟶ 196 | S 是一組狀態的集合 197 |
198 | 199 | 34. **A is the set of actions** 200 | 201 | ⟶ 202 | A 是一組行為的集合 203 |
204 | 205 | 35. **{Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ 208 | {Psa} 指的是,當 s∈S、a∈A 時,狀態轉移的機率 209 |
210 | 211 | 36. **γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ 214 | γ∈[0,1[ 是衰減係數 215 |
216 | 217 | 37. **R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ 220 | R:S×A⟶R 或 R:S⟶R 指的是獎勵函數,也就是演算法想要去最大化的目標函數 221 |
222 | 223 | 38. **Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 226 | 策略 - 一個策略 π 指的是一個函數 π:S⟶A,這個函數會將狀態映射到行為 227 |
228 | 229 | 39. **Remark: we say that we execute a given policy π if given a state a we take the action a=π(s).** 230 | 231 | ⟶ 232 | 注意:我們會說,我們給定一個策略 π,當我們給定一個狀態 s 我們會採取一個行動 a=π(s) 233 |
234 | 235 | 40. **Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 238 | 價值函數 - 給定一個策略 π 和狀態 s,我們定義價值函數 Vπ 為: 239 |
240 | 241 | 41. **Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 244 | 貝爾曼方程 - 最佳的貝爾曼方程是將價值函數 Vπ∗ 和策略 π∗ 表示為: 245 |
246 | 247 | 42. **Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 250 | 注意:對於給定一個狀態 s,最佳的策略 π∗ 是: 251 |
252 | 253 | 43. **Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 256 | 價值迭代演算法 - 價值迭代演算法包含兩個步驟: 257 |
258 | 259 | 44. **1) We initialize the value:** 260 | 261 | ⟶ 262 | 1) 針對價值初始化: 263 |
264 | 265 | 45. **2) We iterate the value based on the values before:** 266 | 267 | ⟶ 268 | 根據之前的值,迭代此價值的值: 269 |
270 | 271 | 46. **Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 274 | 最大概似估計 - 針對狀態轉移機率的最大概似估計為: 275 |
276 | 277 | 47. **times took action a in state s and got to s′** 278 | 279 | ⟶ 280 | 從狀態 s 到 s′ 所採取行為的次數 281 |
282 | 283 | 48. **times took action a in state s** 284 | 285 | ⟶ 286 | 從狀態 s 所採取行為的次數 287 |
288 | 289 | 49. **Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ 292 | Q-learning 演算法 - Q-learning 演算法是針對 Q 的一個 model-free 的估計,如下: 293 | 294 | 50. **View PDF version on GitHub** 295 | 296 | ⟶ 297 | 前往 GitHub 閱讀 PDF 版本 298 |
299 | 300 | 51. **[Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 301 | 302 | ⟶ 303 | [神經網路, 架構, Activation function, 反向傳播演算法, Dropout] 304 |
305 | 306 | 52. **[Convolutional Neural Networks, Convolutional layer, Batch normalization]** 307 | 308 | ⟶ 309 | [卷積神經網絡, 卷積層, 批次正規化] 310 |
311 | 312 | 53. **[Recurrent Neural Networks, Gates, LSTM]** 313 | 314 | ⟶ 315 | [遞歸神經網路 (RNN), 閘, 長短期記憶模型] 316 |
317 | 318 | 54. **[Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 319 | 320 | ⟶ 321 | [強化學習, 馬可夫決策過程, 價值/策略迭代, 近似動態規劃, 策略搜尋] -------------------------------------------------------------------------------- /zh-tw/cs-229-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | 1. **Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 4 | 非監督式學習參考手冊 5 |
6 | 7 | 2. **Introduction to Unsupervised Learning** 8 | 9 | ⟶ 10 | 非監督式學習介紹 11 |
12 | 13 | 3. **Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 16 | 動機 - 非監督式學習的目的是要找出未標籤資料 {x(1),...,x(m)} 之間的隱藏模式 17 |
18 | 19 | 4. **Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 22 | Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們可以得到底下這個不等式: 23 |
24 | 25 | 5. **Clustering** 26 | 27 | ⟶ 28 | 分群 29 |
30 | 31 | 6. **Expectation-Maximization** 32 | 33 | ⟶ 34 | 最大期望值 35 |
36 | 37 | 7. **Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 40 | 潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數,這會讓問題的估計變得困難,我們通常使用 z 來代表它。底下是潛在變數的常見設定: 41 |
42 | 43 | 8. **[Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 46 | [設定, 潛在變數 z, 評論] 47 |
48 | 49 | 9. **[Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 52 | [k 元高斯模型, 因素分析] 53 |
54 | 55 | 10. **Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 58 | 演算法 - 最大期望演算法 (EM Algorithm) 透過重複建構一個概似函數的下界 (E-step) 和最佳化下界 (M-step) 來進行最大概似估計給出參數 θ 的高效率估計方法: 59 |
60 | 61 | 11. **E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ 64 | E-step: 評估後驗機率 Qi(z(i)),其中每個資料點 x(i) 來自於一個特定的群集 z(i),如下: 65 |
66 | 67 | 12. **M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ 70 | M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重,用來分別重新估計每個群集,如下: 71 |
72 | 73 | 13. **[Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ 76 | [高斯分佈初始化, E-Step, M-Step, 收斂] 77 |
78 | 79 | 14. **k-means clustering** 80 | 81 | ⟶ 82 | k-means 分群法 83 |
84 | 85 | 15. **We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ 88 | 我們使用 c(i) 表示資料 i 屬於某群,而 μj 則是群 j 的中心 89 |
90 | 91 | 16. **Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 94 | 演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後,k-means 演算法重複以下步驟直到收斂: 95 |
96 | 97 | 17. **[Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 100 | [中心點初始化, 指定群集, 更新中心點, 收斂] 101 |
102 | 103 | 18. **Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 106 | 畸變函數 - 為了確認演算法是否收斂,我們定義以下的畸變函數: 107 |
108 | 109 | 19. **Hierarchical clustering** 110 | 111 | ⟶ 112 | 階層式分群法 113 |
114 | 115 | 20. **Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 118 | 演算法 - 階層式分群法是透過一種階層架構的方式,將資料建立為一種連續層狀結構的形式。 119 |
120 | 121 | 21. **Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 124 | 類型 - 底下是幾種不同類型的階層式分群法,差別在於要最佳化的目標函式的不同,請參考底下: 125 |
126 | 127 | 22. **[Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ 130 | [Ward 鏈結距離, 平均鏈結距離, 完整鏈結距離] 131 |
132 | 133 | 23. **[Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 136 | [最小化群內距離, 最小化各群彼此的平均距離, 最小化各群彼此的最大距離] 137 |
138 | 139 | 24. **Clustering assessment metrics** 140 | 141 | ⟶ 142 | 分群衡量指標 143 |
144 | 145 | 25. **In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 148 | 在非監督式學習中,通常很難去評估一個模型的好壞,因為我們沒有擁有像在監督式學習任務中正確答案的標籤 149 |
150 | 151 | 26. **Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 154 | 輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離,輪廓係數 s 對於此一樣本點的定義為: 155 |
156 | 157 | 27. **Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ 160 | Calinski-Harabaz 指標 - 定義 k 是群集的數量,Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices): 161 |
162 | 163 | 28. **the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ 166 | Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高,代表分群模型的表現越好。定義如下: 167 |
168 | 169 | 29. **Dimension reduction** 170 | 171 | ⟶ 172 | 維度縮減 173 |
174 | 175 | 30. **Principal component analysis** 176 | 177 | ⟶ 178 | 主成份分析 179 |
180 | 181 | 31. **It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 184 | 這是一個維度縮減的技巧,在於找到投影資料的最大方差 185 |
186 | 187 | 32. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 190 | 特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,我們說 λ 是 A 的特徵值,當存在一個特徵向量 z∈Rn∖{0},使得: 191 |
192 | 193 | 33. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 196 | 譜定理 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn),我們得到: 197 |
198 | 199 | 34. **diagonal** 200 | 201 | ⟶ 202 | 對角線 203 |
204 | 205 | 35. **Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 208 | 注意:與特徵值所關聯的特徵向量就是 A 矩陣的主特徵向量 209 |
210 | 211 | 36. **Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:** 212 | 213 | ⟶ 214 | 演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧,它會透過尋找資料最大變異的方式,將資料投影在 k 維空間上: 215 |
216 | 217 | 37. **Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 218 | 219 | ⟶ 220 | 第一步:正規化資料,讓資料平均為 0,變異數為 1 221 |
222 | 223 | 38. **Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 224 | 225 | ⟶ 226 | 第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n,即對稱實際特徵值 227 |
228 | 229 | 39. **Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 230 | 231 | ⟶ 232 | 第三步:計算 u1,...,uk∈Rn,k 個正交主特徵向量的總和 Σ,即是 k 個最大特徵值的正交特徵向量 233 |
234 | 235 | 40. **Step 4: Project the data on spanR(u1,...,uk).** 236 | 237 | ⟶ 238 | 第四部:將資料投影到 spanR(u1,...,uk) 239 |
240 | 241 | 41. **This procedure maximizes the variance among all k-dimensional spaces.** 242 | 243 | ⟶ 244 | 這個步驟會最大化所有 k 維空間的變異數 245 |
246 | 247 | 42. **[Data in feature space, Find principal components, Data in principal components space]** 248 | 249 | ⟶ 250 | [資料在特徵空間, 尋找主成分, 資料在主成分空間] 251 |
252 | 253 | 43. **Independent component analysis** 254 | 255 | ⟶ 256 | 獨立成分分析 257 |
258 | 259 | 44. **It is a technique meant to find the underlying generating sources.** 260 | 261 | ⟶ 262 | 這是用來尋找潛在生成來源的技巧 263 |
264 | 265 | 45. **Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 266 | 267 | ⟶ 268 | 假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生,si 為獨立變數,透過一個混合與非奇異矩陣 A 產生如下: 269 |
270 | 271 | 46. **The goal is to find the unmixing matrix W=A−1.** 272 | 273 | ⟶ 274 | 目的在於找到一個 unmixing 矩陣 W=A−1 275 |
276 | 277 | 47. **Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 278 | 279 | ⟶ 280 | Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟來找到 unmixing 矩陣: 281 |
282 | 283 | 48. **Write the probability of x=As=W−1s as:** 284 | 285 | ⟶ 286 | 紀錄 x=As=W−1s 的機率如下: 287 |
288 | 289 | 49. **Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 290 | 291 | ⟶ 292 | 在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下,其對數概似估計函數與定義 g 為 sigmoid 函數如下: 293 |
294 | 295 | 50. **Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 296 | 297 | ⟶ 298 | 因此,梯度隨機下降學習規則對每個訓練樣本 x(i) 來說,我們透過以下方法來更新 W: 299 | -------------------------------------------------------------------------------- /template/cs-229-probability.md: -------------------------------------------------------------------------------- 1 | **Probabilities and Statistics translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-229/refresher-probabilities-statistics) 2 | 3 |
4 | 5 | **1. Probabilities and Statistics refresher** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | **2. Introduction to Probability and Combinatorics** 12 | 13 | ⟶ 14 | 15 |
16 | 17 | **3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.** 18 | 19 | ⟶ 20 | 21 |
22 | 23 | **4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.** 24 | 25 | ⟶ 26 | 27 |
28 | 29 | **5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.** 30 | 31 | ⟶ 32 | 33 |
34 | 35 | **6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:** 36 | 37 | ⟶ 38 | 39 |
40 | 41 | **7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:** 42 | 43 | ⟶ 44 | 45 |
46 | 47 | **8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | **9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:** 54 | 55 | ⟶ 56 | 57 |
58 | 59 | **10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:** 60 | 61 | ⟶ 62 | 63 |
64 | 65 | **11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)** 66 | 67 | ⟶ 68 | 69 |
70 | 71 | **12. Conditional Probability** 72 | 73 | ⟶ 74 | 75 |
76 | 77 | **13. Bayes' rule ― For events A and B such that P(B)>0, we have:** 78 | 79 | ⟶ 80 | 81 |
82 | 83 | **14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)** 84 | 85 | ⟶ 86 | 87 |
88 | 89 | **15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | **16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).** 96 | 97 | ⟶ 98 | 99 |
100 | 101 | **17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:** 102 | 103 | ⟶ 104 | 105 |
106 | 107 | **18. Independence ― Two events A and B are independent if and only if we have:** 108 | 109 | ⟶ 110 | 111 |
112 | 113 | **19. Random Variables** 114 | 115 | ⟶ 116 | 117 |
118 | 119 | **20. Definitions** 120 | 121 | ⟶ 122 | 123 |
124 | 125 | **21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.** 126 | 127 | ⟶ 128 | 129 |
130 | 131 | **22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | **23. Remark: we have P(a 142 | 143 | **24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.** 144 | 145 | ⟶ 146 | 147 |
148 | 149 | **25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.** 150 | 151 | ⟶ 152 | 153 |
154 | 155 | **26. [Case, CDF F, PDF f, Properties of PDF]** 156 | 157 | ⟶ 158 | 159 |
160 | 161 | **27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:** 162 | 163 | ⟶ 164 | 165 |
166 | 167 | **28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:** 168 | 169 | ⟶ 170 | 171 |
172 | 173 | **29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | **30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:** 180 | 181 | ⟶ 182 | 183 |
184 | 185 | **31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:** 186 | 187 | ⟶ 188 | 189 |
190 | 191 | **32. Probability Distributions** 192 | 193 | ⟶ 194 | 195 |
196 | 197 | **33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:** 198 | 199 | ⟶ 200 | 201 |
202 | 203 | **34. Main distributions ― Here are the main distributions to have in mind:** 204 | 205 | ⟶ 206 | 207 |
208 | 209 | **35. [Type, Distribution]** 210 | 211 | ⟶ 212 | 213 |
214 | 215 | **36. Jointly Distributed Random Variables** 216 | 217 | ⟶ 218 | 219 |
220 | 221 | **37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have** 222 | 223 | ⟶ 224 | 225 |
226 | 227 | **38. [Case, Marginal density, Cumulative function]** 228 | 229 | ⟶ 230 | 231 |
232 | 233 | **39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:** 234 | 235 | ⟶ 236 | 237 |
238 | 239 | **40. Independence ― Two random variables X and Y are said to be independent if we have:** 240 | 241 | ⟶ 242 | 243 |
244 | 245 | **41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:** 246 | 247 | ⟶ 248 | 249 |
250 | 251 | **42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:** 252 | 253 | ⟶ 254 | 255 |
256 | 257 | **43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].** 258 | 259 | ⟶ 260 | 261 |
262 | 263 | **44. Remark 2: If X and Y are independent, then ρXY=0.** 264 | 265 | ⟶ 266 | 267 |
268 | 269 | **45. Parameter estimation** 270 | 271 | ⟶ 272 | 273 |
274 | 275 | **46. Definitions** 276 | 277 | ⟶ 278 | 279 |
280 | 281 | **47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.** 282 | 283 | ⟶ 284 | 285 |
286 | 287 | **48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.** 288 | 289 | ⟶ 290 | 291 |
292 | 293 | **49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:** 294 | 295 | ⟶ 296 | 297 |
298 | 299 | **50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.** 300 | 301 | ⟶ 302 | 303 |
304 | 305 | **51. Estimating the mean** 306 | 307 | ⟶ 308 | 309 |
310 | 311 | **52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:** 312 | 313 | ⟶ 314 | 315 |
316 | 317 | **53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.** 318 | 319 | ⟶ 320 | 321 |
322 | 323 | **54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:** 324 | 325 | ⟶ 326 | 327 |
328 | 329 | **55. Estimating the variance** 330 | 331 | ⟶ 332 | 333 |
334 | 335 | **56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:** 336 | 337 | ⟶ 338 | 339 |
340 | 341 | **57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.** 342 | 343 | ⟶ 344 | 345 |
346 | 347 | **58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:** 348 | 349 | ⟶ 350 | 351 |
352 | 353 | **59. [Introduction, Sample space, Event, Permutation]** 354 | 355 | ⟶ 356 | 357 |
358 | 359 | **60. [Conditional probability, Bayes' rule, Independence]** 360 | 361 | ⟶ 362 | 363 |
364 | 365 | **61. [Random variables, Definitions, Expectation, Variance]** 366 | 367 | ⟶ 368 | 369 |
370 | 371 | **62. [Probability distributions, Chebyshev's inequality, Main distributions]** 372 | 373 | ⟶ 374 | 375 |
376 | 377 | **63. [Jointly distributed random variables, Density, Covariance, Correlation]** 378 | 379 | ⟶ 380 | 381 |
382 | 383 | **64. [Parameter estimation, Mean, Variance]** 384 | 385 | ⟶ 386 | -------------------------------------------------------------------------------- /zh-tw/cs-229-linear-algebra.md: -------------------------------------------------------------------------------- 1 | 1. **Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 線性代數與微積分回顧 5 |
6 | 7 | 2. **General notations** 8 | 9 | ⟶ 10 | 通用符號 11 |
12 | 13 | 3. **Definitions** 14 | 15 | ⟶ 16 | 定義 17 |
18 | 19 | 4. **Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | 向量 - 我們定義 x∈Rn 是一個向量,包含 n 維元素,xi∈R 是第 i 維元素: 23 |
24 | 25 | 5. **Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 矩陣 - 我們定義 A∈Rm×n 是一個 m 列 n 行的矩陣,Ai,j∈R 代表位在第 i 列第 j 行的元素: 29 |
30 | 31 | 6. **Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 注意:上述定義的向量 x 可以視為 nx1 的矩陣,或是更常被稱為行向量 35 |
36 | 37 | 7. **Main matrices** 38 | 39 | ⟶ 40 | 主要的矩陣 41 |
42 | 43 | 8. **Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 單位矩陣 - 單位矩陣 I∈Rn×n 是一個方陣,其主對角線皆為 1,其餘皆為 0 47 |
48 | 49 | 9. **Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 注意:對於所有矩陣 A∈Rn×n,我們有 A×I=I×A=A 53 |
54 | 55 | 10. **Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 對角矩陣 - 對角矩陣 D∈Rn×n 是一個方陣,其主對角線為非 0,其餘皆為 0 59 |
60 | 61 | 11. **Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 注意:我們令 D 為 diag(d1,...,dn) 65 |
66 | 67 | 12. **Matrix operations** 68 | 69 | ⟶ 70 | 矩陣運算 71 |
72 | 73 | 13. **Multiplication** 74 | 75 | ⟶ 76 | 乘法 77 |
78 | 79 | 14. **Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | 向量-向量 - 有兩種類型的向量-向量相乘: 83 |
84 | 85 | 15. **inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 內積:對於 x,y∈Rn,我們可以得到: 89 |
90 | 91 | 16. **outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 外積:對於 x∈Rm,y∈Rn,我們可以得到: 95 |
96 | 97 | 17. **Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 矩陣-向量 - 矩陣 A∈Rm×n 和向量 x∈Rn 的乘積是一個大小為 Rm 的向量,使得: 101 |
102 | 103 | 18. **where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 其中 aTr,i 是 A 的列向量、ac,j 是 A 的行向量、xi 是 x 的元素 107 |
108 | 109 | 19. **Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 矩陣-矩陣:矩陣 A∈Rm×n 和 B∈Rn×p 的乘積為一個大小 Rm×p 的矩陣,使得: 113 |
114 | 115 | 20. **where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | 其中,aTr,i,bTr,i 和 ac,j,bc,j 分別是 A 和 B 的列向量與行向量 119 |
120 | 121 | 21. **Other operations** 122 | 123 | ⟶ 124 | 其他操作 125 |
126 | 127 | 22. **Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 轉置 - 一個矩陣的轉置矩陣 A∈Rm×n,記作 AT,指的是其中元素的翻轉: 131 |
132 | 133 | 23. **Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 注意:對於矩陣 A、B,我們有 (AB)T=BTAT 137 |
138 | 139 | 24. **Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 可逆 - 一個可逆矩陣 A 記作 A−1,存在唯一的矩陣,使得: 143 |
144 | 145 | 25. **Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 注意:並非所有的方陣都是可逆的。同樣的,對於矩陣 A、B 來說,我們有 (AB)−1=B−1A−1 149 |
150 | 151 | 26. **Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 跡 - 一個方陣 A 的跡,記作 tr(A),指的是主對角線元素之合: 155 |
156 | 157 | 27. **Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 注意:對於矩陣 A、B 來說,我們有 tr(AT)=tr(A) 及 tr(AB)=tr(BA) 161 |
162 | 163 | 28. **Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 行列式 - 一個方陣 A∈Rn×n 的行列式,記作|A| 或 det(A),可以透過 A∖i,∖j 來遞迴表示,它是一個沒有第 i 列和第 j 行的矩陣 A: 167 |
168 | 169 | 29. **Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 172 | 注意:A 是一個可逆矩陣,若且唯若 |A|≠0。同樣的,|AB|=|A||B| 且 |AT|=|A| 173 |
174 | 175 | 30. **Matrix properties** 176 | 177 | ⟶ 178 | 矩陣的性質 179 |
180 | 181 | 31. **Definitions** 182 | 183 | ⟶ 184 | 定義 185 |
186 | 187 | 32. **Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 190 | 對稱分解 - 給定一個矩陣 A,它可以透過其對稱和反對稱的部分表示如下: 191 |
192 | 193 | 33. **[Symmetric, Antisymmetric]** 194 | 195 | ⟶ 196 | [對稱, 反對稱] 197 |
198 | 199 | 34. **Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 202 | 範數 - 範數指的是一個函式 N:V⟶[0,+∞[,其中 V 是一個向量空間,且對於所有 x,y∈V,我們有: 203 |
204 | 205 | 35. **N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ 208 | 對一個純量來說,我們有 N(ax)=|a|N(x) 209 |
210 | 211 | 36. **if N(x)=0, then x=0** 212 | 213 | ⟶ 214 | 若 N(x)=0 時,則 x=0 215 |
216 | 217 | 37. **For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ 220 | 對於 x∈V,最常用的範數總結如下表: 221 |
222 | 223 | 38. **[Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ 226 | [範數, 表示法, 定義, 使用情境] 227 |
228 | 229 | 39. **Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 232 | 線性相關 - 當集合中的一個向量可以用被定義為集合中其他向量的線性組合時,則則稱此集合的向量為線性相關 233 |
234 | 235 | 40. **Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 238 | 注意:如果沒有向量可以如上表示時,則稱此集合的向量彼此為線性獨立 239 |
240 | 241 | 41. **Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 244 | 矩陣的秩 - 一個矩陣 A 的秩記作 rank(A),指的是其列向量空間所產生的維度,等價於 A 的線性獨立的最大最大行向量 245 |
246 | 247 | 42. **Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 250 | 半正定矩陣 - 當以下成立時,一個矩陣 A∈Rn×n 是半正定矩陣 (PSD),且記作A⪰0: 251 |
252 | 253 | 43. **Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 256 | 注意:同樣的,一個矩陣 A 是一個半正定矩陣 (PSD),且滿足所有非零向量 x,xTAx>0 時,稱之為正定矩陣,記作 A≻0 257 |
258 | 259 | 44. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 262 | 特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,當存在一個向量 z∈Rn∖{0} 時,此向量被稱為特徵向量,λ 稱之為 A 的特徵值,且滿足: 263 |
264 | 265 | 45. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 268 | 譜分解 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以被一個實數正交矩陣 U∈Rn×n 給對角化。令 Λ=diag(λ1,...,λn),我們得到: 269 |
270 | 271 | 46. **diagonal** 272 | 273 | ⟶ 274 | 對角線 275 |
276 | 277 | 47. **Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 280 | 奇異值分解 - 對於給定維度為 mxn 的矩陣 A,其奇異值分解指的是一種因子分解技巧,保證存在 mxm 的單式矩陣 U、對角線矩陣 Σ m×n 和 nxn 的單式矩陣 V,滿足: 281 |
282 | 283 | 48. **Matrix calculus** 284 | 285 | ⟶ 286 | 矩陣導數 287 |
288 | 289 | 49. **Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 292 | 梯度 - 令 f:Rm×n→R 是一個函式,且 A∈Rm×n 是一個矩陣。f 相對於 A 的梯度是一個 mxn 的矩陣,記作 ∇Af(A),滿足: 293 |
294 | 295 | 50. **Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 298 | 注意:f 的梯度僅在 f 為一個函數且該函數回傳一個純量時有效 299 |
300 | 301 | 51. **Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 304 | 海森 - 令 f:Rn→R 是一個函式,且 x∈Rn 是一個向量,則一個 f 的海森對於向量 x 是一個 nxn 的對稱矩陣,記作 ∇2xf(x),滿足: 305 |
306 | 307 | 52. **Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 310 | 注意:f 的海森僅在 f 為一個函數且該函數回傳一個純量時有效 311 |
312 | 313 | 53. **Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 梯度運算 - 對於矩陣 A、B、C,下列的梯度性質值得牢牢記住: 315 | ⟶ 316 | 317 | 54. **[General notations, Definitions, Main matrices]** 318 | 319 | ⟶ 320 | [通用符號, 定義, 主要矩陣] 321 |
322 | 323 | 55. **[Matrix operations, Multiplication, Other operations]** 324 | 325 | ⟶ 326 | [矩陣運算, 矩陣乘法, 其他運算] 327 |
328 | 329 | 56. **[Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 330 | 331 | ⟶ 332 | [矩陣性質, 範數, 特徵值/特徵向量, 奇異值分解] 333 |
334 | 335 | 57. **[Matrix calculus, Gradient, Hessian, Operations]** 336 | 337 | ⟶ 338 | [矩陣導數, 梯度, 海森, 運算] -------------------------------------------------------------------------------- /ja/cs-229-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ ディープラーニングチートシート 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ ニューラルネットワーク 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ ニューラルネットワークとは複数の層を用いて構成されたモデルの種類を指します。一般的に利用されるネットワークとして畳み込みニューラルネットワークとリカレントニューラルネットワークが挙げられます。 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ アーキテクチャ - ニューラルネットワークに関する用語は以下の図により説明されます: 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ [入力層, 隠れ層, 出力層] 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ iをネットワーク上のi層目の層とし、隠れ層のj個目のユニットをjとすると: 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ この場合重みをw、バイアス項をb、出力をzとします。 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 活性化関数 ー 活性化関数はモデルに非線形性を与えるために隠れユニットの最後で利用されます。一般的には以下の関数がよく使われます: 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ [Sigmoid(シグモイド関数), Tanh(双曲線関数), ReLU(正規化線形ユニット), Leaky ReLU(漏洩ReLU)] 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 交差エントロピーロス ー ニューラルネットにおいて交差エントロピーロスL(z,y)は一般的に使われ、以下のように定義されています: 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 学習率 ー 学習率は多くの場合α、しばしばηで表記され、勾配法による重み付けのアップデートをする速度を表します。学習率は固定または適応的に変更することができます。現在一般的に使われている学習法はAdam(アダム)であり、学習率を適用的に変更する方法です。 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 誤差逆伝播法(backpropagation)ー 誤差逆伝播法はニューラルネットにおいて実際の出力と期待される出力との差異を考慮して重みを更新する方法の一つです。重みwに関する導関数は連鎖律を使用して計算され、次の形式で表されます: 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 結果として、重みは以下のように更新されます: 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 重みの更新 ー ニューラルネットでは以下のように重みが更新されます: 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ ステップ1: 訓練データを1バッチ用意する。 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ ステップ2: 順伝播を行いそれに対する誤差を求める。 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 誤差を逆伝播し、勾配を計算する。 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 勾配を使いネットワークの重みを更新する。 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ドロップアウト ー ドロップアウトはニューラルネット内の一部のユニットを無効にすることで学習データへの過学習を防ぐテクニックです。実際には、ニューロンは確率pで無効、確率1-pで有効のどちらかになります。 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 畳み込みニューラルネットワーク 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 畳み込みレイヤーの条件 ー Wを入力サイズ、Fを畳み込み層のニューロンのサイズ、Pをゼロ埋めの量とすると、これらに対応するニューロンの数Nは次のようになります。 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ バッチ正規化 ー バッチ{xi}を正規化するハイパーパラメータγ、βのステップです。補正を行うバッチの平均と分散をμB,σ2Bとすると、正規化は以下のように行われます: 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ これは通常、学習率を高め、初期値への依存性を減らすことを目的でFully Connected層と畳み込み層の後、非線形化を行う前に行われます。 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ リカレントニューラルネットワーク (RNN) 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ ゲートの種類 ー 典型的なRNNに使われるゲートです: 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ [入力ゲート, 忘却ゲート, ゲート, 出力ゲート] 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ [セルに追加するべき?, セルを削除するべき?, 情報をどれだけセルに追加するべき?, セルをどの程度通すべき?] 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ LSTM - A long short-term memory (LSTM) ネットワークは勾配消失問題を解決するために忘却ゲートが追加されているRNNの一種です。 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 強化学習と制御 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 強化学習はある環境内においてエージェントが学習し、進化することを目標とします。 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 定義 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ マルコフ決定過程 ー マルコフ決定過程(Markov decision process; MDP)を5タプル(S,A,{Psa},γ,R)としたとき: 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ Sは状態の集合 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ Aは行動の集合 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ {Psa}は状態s∈Sと行動a∈Aの状態遷移確率 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ γ∈[0,1[は割引因子 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ R:S×A⟶R or R:S⟶Rはアルゴリズムが最大化したい報酬関数 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 方策 - 方策πは状態を行動に写像する関数π:S⟶A 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 備考: 状態sを与えられた際に行動a=π(s)を行うことを方策πを実行すると言います。 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 価値関数 - ある方策πとある状態sにおける価値関数Vπを以下のように定義します: 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ ベルマン方程式 - 最適ベルマン方程式は最適方策π∗の価値関数Vπ∗で記述されます: 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 備考: 与えられた状態sに対する最適方策π*はこのようになります: 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 価値反復法アルゴリズム - 価値反復法アルゴリズムは2段階で行われます: 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 1) 値を初期化する: 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 2) 前の値を元に値を繰り返す: 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 最尤推定 ー 状態遷移確率の最尤推定(maximum likelihood estimate; MLE): 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 状態sで行動aを行い状態s′に遷移した回数 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 状態sで行動aを行った回数 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ Q学習 ― Q学習はモデルフリーのQ値の推定であり、以下のように行われます: 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ GitHubでPDF版を見る 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ [ニューラルネットワーク, アーキテクチャ, 活性化関数, 誤差逆伝播法, ドロップアウト] 304 | 305 |
306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ [畳み込みニューラルネットワーク, 畳み込み層, バッチ正規化] 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ [リカレントニューラルネットワーク, ゲート, LSTM] 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ [強化学習, マルコフ決定過程, 価値/方策反復, 近似動的計画法, 方策探索] 322 | -------------------------------------------------------------------------------- /ja/cs-229-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 4 | 線形代数と微積分の復習 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 10 | 一般表記 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 16 | 定義 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 22 | ベクトル - x∈Rn はn個の要素を持つベクトルを表し、xi∈R はi番目の要素を表します。 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 28 | 行列 - m行n列の行列を A∈Rm×n と表記し、Ai,j∈R はi行目のj列目の要素を指します。 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 34 | 備考:上記で定義されたベクトル x は n×1 の行列と見なすことができ、列ベクトルと呼ばれます。 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 40 | 主な行列の種類 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 46 | 単位行列 - 単位行列 I∈Rn×n は、対角成分に 1 が並び、他は全て 0 となる正方行列です。 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 52 | 備考:すべての行列 A∈Rn×n に対して、A×I=I×A=A となります。 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 58 | 対角行列 - 対角行列 D∈Rn×n は、対角成分の値が 0 以外で、それ以外は 0 である正方行列です。 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 64 | 備考:Dをdiag(d1,...,dn) とも表記します。 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 70 | 行列演算 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 76 | 行列乗算 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 82 | ベクトル-ベクトル - ベクトル-ベクトル積には2種類あります。 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 88 | 内積: x,y∈Rn に対して、内積の定義は下記の通りです: 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 94 | 外積: x∈Rm,y∈Rn に対して、外積の定義は下記の通りです: 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 100 | 行列-ベクトル - 行列 A∈Rm×n とベクトル x∈Rn の積は以下の条件を満たすようなサイズ Rn のベクトルです。 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ 106 | 上記 aTr,i は A の行ベクトルで、ac,j は A の列ベクトルです。 xi は x の要素です。 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 112 | 行列-行列 - 行列 A∈Rm×n と B∈Rn×p の積は以下の条件を満たすようなサイズ Rm×p の行列です。 (There is a typo in the original: Rn×p) 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ 118 | aTr,i,bTr,i は A と B の行ベクトルで ac,j,bc,j は A と B の列ベクトルです。 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 124 | その他の演算 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 130 | 転置 ― A∈Rm×n の転置行列は AT と表記し、A の行列要素が交換した行列です。 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 136 | 備考: 行列AとBの場合、(AB)T=BTAT** となります。 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 142 | 逆行列 ― 可逆正方行列 A の逆行列は A-1 と表記し、 以下の条件を満たす唯一の行列です。 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 148 | 備考: すべての正方行列が可逆とは限りません。 行列 A,B については、(AB)−1=B−1A−1 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 154 | 跡 - 正方行列 A の跡は、tr(A) と表記し、その対角成分の要素の和です。 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 160 | 備考: 行列 A,B の場合: tr(AT)=tr(A) と tr(AB)=tr(BA) となります。 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 166 | 行列式 ― 正方行列 A∈Rn×n の行列式は |A| または det(A) と表記し、以下のように i番目の行とj番目の列を抜いた行列A、Aij によって再帰的に表現されます。 167 | それはi番目の行とj番目の列のない行列Aです。 次のように: 168 |
169 | 170 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 171 | 172 | ⟶ 173 | 備考: |A|≠0の場合に限り、行列は可逆行列です。また |AB|=|A||B| と |AT|=|A|。 174 |
175 | 176 | **30. Matrix properties** 177 | 178 | ⟶ 179 | 行列の性質 180 |
181 | 182 | **31. Definitions** 183 | 184 | ⟶ 185 | 定義 186 |
187 | 188 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 189 | 190 | ⟶ 191 | 対称分解 ― 行列Aは次のように対称および反対称的な部分で表現できます。 192 |
193 | 194 | **33. [Symmetric, Antisymmetric]** 195 | 196 | ⟶ 197 | [対称、反対称] 198 |
199 | 200 | **34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** 201 | 202 | ⟶ 203 | ノルムは関数N:V⟶[0,+∞[ Vはすべての x,y∈V に対して、以下の条件を満たすようなベクトル空間です。 204 | ]] 205 |
206 | 207 | **35. N(ax)=|a|N(x) for a scalar** 208 | 209 | ⟶ 210 | スカラー a に対して N(ax)=|a|N(x) 211 |
212 | 213 | **36. if N(x)=0, then x=0** 214 | 215 | ⟶ 216 | N(x)= 0ならば x = 0 217 |
218 | 219 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 220 | 221 | ⟶ 222 | x∈Vに対して、最も多用されているノルムは、以下の表にまとめられています。 223 |
224 | 225 | **38. [Norm, Notation, Definition, Use case]** 226 | 227 | ⟶ 228 | [ノルム、表記法、定義、使用事例] 229 |
230 | 231 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 232 | 233 | ⟶ 234 | 線形従属 ― ベクトルの集合に対して、少なくともどれか一つのベクトルを他のベクトルの線形結合として定義できる場合、その集合が線形従属であるといいます。 235 |
236 | 237 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 238 | 239 | ⟶ 240 | 備考:この方法でベクトルを書くことができない場合、ベクトルは線形独立していると言われます。 241 |
242 | 243 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 244 | 245 | ⟶ 246 | 行列の階数 ― 行列Aの階数は rank(A) と表記し、列空間の次元を表します。これは、Aの線形独立の列の最大数に相当します。 247 |
248 | 249 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 250 | 251 | ⟶ 252 | 半正定値行列 ― 行列A、A∈Rn×nに対して、以下の式が成り立つならば、 Aを半正定値(PSD)といい、A⪰0 と表記します。 253 |
254 | 255 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 256 | 257 | ⟶ 258 | 備考: 同様に、全ての非ゼロベクトルx、xTAx>0 に対して条件を満たすような行列Aは正定値行列といい、A≻0 と表記します。 259 |
260 | 261 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 262 | 263 | ⟶ 264 | 固有値、固有ベクトル ― 行列A、A∈Rn×n に対して、以下の条件を満たすようなベクトルz、z∈Rn∖{0} が存在するならば、λ は固有値といい、z は固有ベクトルといいます。 265 |
266 | 267 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 268 | 269 | ⟶ 270 | スペクトル定理 ― A∈Rn×n とします。A が対称ならば、A は実直交行列 U∈Rn×n によって対角化可能です。Λ=diag(λ1,...,λn) と表記すると、次のように表現できます。 271 |
272 | 273 | **46. diagonal** 274 | 275 | ⟶ 276 | 対角 277 |
278 | 279 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 280 | 281 | ⟶ 282 | 特異値分解 ― A を m×n の行列とします。特異値分解(SVD)は、ユニタリ行列 U m×m、Σ m×n の対角行列、およびユニタリ行列 V n×n の存在を保証する因数分解手法で、以下の条件を満たします。 283 |
284 | 285 | **48. Matrix calculus** 286 | 287 | ⟶ 288 | 行列微積分 289 |
290 | 291 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 292 | 293 | ⟶ 294 | 勾配 ― f:Rm×n→R を関数とし、A∈Rm×n を行列とします。 A に対する f の勾配は m×n 行列で、∇Af(A) と表記し、次の条件を満たします。 295 |
296 | 297 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 298 | 299 | ⟶ 300 | 備考: f の勾配は、f がスカラーを返す関数であるときに限り存在します。 301 |
302 | 303 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 304 | 305 | ⟶ 306 | ヘッセ行列 ― f:Rn→R を関数とし、x∈Rn をベクトルとします。 x に対する f のヘッセ行列は、n×n 対称行列で ∇2xf(x) と表記し、以下の条件を満たします。 307 |
308 | 309 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 310 | 311 | ⟶ 312 | 備考: f のヘッセ行列は、f がスカラーを返す関数である場合に限り存在します。 313 |
314 | 315 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 316 | 317 | ⟶ 318 | 勾配演算 ― 行列 A,B,C の場合、特に以下の勾配の性質を意識する甲斐があります。 319 |
320 | 321 | **54. [General notations, Definitions, Main matrices]** 322 | 323 | ⟶ 324 | [表記, 定義, 主な行列の種類] 325 |
326 | 327 | **55. [Matrix operations, Multiplication, Other operations]** 328 | 329 | ⟶ 330 | [行列演算, 乗算, その他の演算] 331 |
332 | 333 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 334 | 335 | ⟶ 336 | [行列特性, 行列ノルム, 固有値/固有ベクトル, 特異値分解] 337 |
338 | 339 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 340 | 341 | ⟶ 342 | [行列微積分, 勾配, ヘッセ行列, 演算] 343 | -------------------------------------------------------------------------------- /ko/cs-229-linear-algebra.md: -------------------------------------------------------------------------------- 1 | **1. Linear Algebra and Calculus refresher** 2 | 3 | ⟶ 선형대수와 미적분학 복습 4 | 5 |
6 | 7 | **2. General notations** 8 | 9 | ⟶ 일반적인 표기법 10 | 11 |
12 | 13 | **3. Definitions** 14 | 15 | ⟶ 정의 16 | 17 |
18 | 19 | **4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** 20 | 21 | ⟶ 벡터 - x∈Rn는 n개의 요소를 가진 벡터이고, xi∈R는 i번째 요소이다. 22 | 23 |
24 | 25 | **5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** 26 | 27 | ⟶ 행렬 - A∈Rm×n는 m개의 행과 n개의 열을 가진 행렬이고, Ai,j∈R는 i번째 행, j번째 열에 있는 원소이다. 28 | 29 |
30 | 31 | **6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** 32 | 33 | ⟶ 비고 : 위에서 정의된 벡터 x는 n×1행렬로 볼 수 있으며, 열벡터라고도 불린다. 34 | 35 |
36 | 37 | **7. Main matrices** 38 | 39 | ⟶ 주요 행렬 40 | 41 |
42 | 43 | **8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** 44 | 45 | ⟶ 단위행렬 - 단위행렬 I∈Rn×n는 대각성분이 모두 1이고 대각성분이 아닌 성분은 모두 0인 정사각행렬이다. 46 | 47 |
48 | 49 | **9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** 50 | 51 | ⟶ 비고 : 모든 행렬 A∈Rn×n에 대하여, A×I=I×A=A를 만족한다. 52 | 53 |
54 | 55 | **10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** 56 | 57 | ⟶ 대각행렬 - 대각행렬 D∈Rn×n는 대각성분은 모두 0이 아니고, 대각성분이 아닌 성분은 모두 0인 정사각행렬이다. 58 | 59 |
60 | 61 | **11. Remark: we also note D as diag(d1,...,dn).** 62 | 63 | ⟶ 비고 : D를 diag(d1,...,dn)라고도 표시한다. 64 | 65 |
66 | 67 | **12. Matrix operations** 68 | 69 | ⟶ 행렬 연산 70 | 71 |
72 | 73 | **13. Multiplication** 74 | 75 | ⟶ 곱셈 76 | 77 |
78 | 79 | **14. Vector-vector ― There are two types of vector-vector products:** 80 | 81 | ⟶ 벡터-벡터 – 벡터 간 연산에는 두 가지 종류가 있다. 82 | 83 |
84 | 85 | **15. inner product: for x,y∈Rn, we have:** 86 | 87 | ⟶ 내적 : x,y∈Rn에 대하여, 88 | 89 |
90 | 91 | **16. outer product: for x∈Rm,y∈Rn, we have:** 92 | 93 | ⟶ 외적 : x∈Rm,y∈Rn에 대하여, 94 | 95 |
96 | 97 | **17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** 98 | 99 | ⟶ 행렬-벡터 - 행렬 A∈Rm×n와 벡터 x∈Rn의 곱은 다음을 만족하는 Rn크기의 벡터이다. 100 | 101 |
102 | 103 | **18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** 104 | 105 | ⟶ aTr,i는 A의 벡터행, ac,j는 A의 벡터열, xi는 x의 성분이다. 106 | 107 |
108 | 109 | **19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** 110 | 111 | ⟶ 행렬-행렬 - 행렬 A∈Rm×n와 행렬 B∈Rn×p의 곱은 다음을 만족하는 Rn×p크기의 행렬이다. 112 | 113 |
114 | 115 | **20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** 116 | 117 | ⟶ aTr,i,bTr,i는 A,B의 벡터행, ac,j,bc,j는 A,B의 벡터열이다. 118 | 119 |
120 | 121 | **21. Other operations** 122 | 123 | ⟶ 그 외 연산 124 | 125 |
126 | 127 | **22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** 128 | 129 | ⟶ 전치 - 행렬 A∈Rm×n의 전치 AT는 모든 성분을 뒤집은 것이다. 130 | 131 |
132 | 133 | **23. Remark: for matrices A,B, we have (AB)T=BTAT** 134 | 135 | ⟶ 비고 - 행렬 A,B에 대하여, (AB)T=BTAT가 성립힌다. 136 | 137 |
138 | 139 | **24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** 140 | 141 | ⟶ 역행렬 - 가역행렬 A의 역행렬은 A-1로 표기하며, 유일하다. 142 | 143 |
144 | 145 | **25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** 146 | 147 | ⟶ 모든 정사각행렬이 역행렬을 갖는 것은 아니다. 그리고, 행렬 A,B에 대하여 (AB)−1=B−1A−1가 성립힌다. 148 | 149 |
150 | 151 | **26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** 152 | 153 | ⟶ 대각합 – 정사각행렬 A의 대각합 tr(A)는 대각성분의 합이다. 154 | 155 |
156 | 157 | **27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** 158 | 159 | ⟶ 비고 : 행렬 A,B에 대하여, tr(AT)=tr(A)와 tr(AB)=tr(BA)가 성립힌다. 160 | 161 |
162 | 163 | **28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** 164 | 165 | ⟶ 행렬식 - 정사각행렬 A∈Rn×n의 행렬식 |A| 또는 det(A)는 i번째 행과 j번째 열이 없는 행렬 A인 A∖i,∖j에 대해 재귀적으로 표현된다. 166 | 167 |
168 | 169 | **29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** 170 | 171 | ⟶ 비고 : A가 가역일 필요충분조건은 |A|≠0이다. 또한 |AB|=|A||B|와 |AT|=|A|도 그렇다. 172 | 173 |
174 | 175 | **30. Matrix properties** 176 | 177 | ⟶ 행렬의 성질 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 정의 184 | 185 |
186 | 187 | **32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** 188 | 189 | ⟶ 대칭 분해 - 주어진 행렬 A는 다음과 같이 대칭과 비대칭 부분으로 표현될 수 있다. 190 | 191 |
192 | 193 | **33. [Symmetric, Antisymmetric]** 194 | 195 | ⟶ [대칭, 비대칭] 196 | 197 |
198 | 199 | **34. Norm ― A norm is a function N:V⟶[0,+∞] where V is a vector space, and such that for all x,y∈V, we have:** 200 | 201 | ⟶ 노름 – V는 벡터공간일 때, 노름은 모든 x,y∈V에 대해 다음을 만족하는 함수 N:V⟶[0,+∞]이다. 202 | 203 |
204 | 205 | **35. N(ax)=|a|N(x) for a scalar** 206 | 207 | ⟶ scalar a에 대해서 N(ax)=|a|N(x)를 만족한다. 208 | 209 |
210 | 211 | **36. if N(x)=0, then x=0** 212 | 213 | ⟶ N(x)=0이면 x=0이다. 214 | 215 |
216 | 217 | **37. For x∈V, the most commonly used norms are summed up in the table below:** 218 | 219 | ⟶ x∈V에 대해, 가장 일반적으로 사용되는 규범이 아래 표에 요약되어 있다. 220 | 221 |
222 | 223 | **38. [Norm, Notation, Definition, Use case]** 224 | 225 | ⟶ [규범, 표기법, 정의, 유스케이스] 226 | 227 |
228 | 229 | **39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** 230 | 231 | ⟶ 일차 종속 - 집합 내의 벡터 중 하나가 다른 벡터들의 선형결합으로 정의될 수 있으면, 그 벡터 집합은 일차 종속이라고 한다. 232 | 233 |
234 | 235 | **40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent** 236 | 237 | ⟶ 비고 : 어느 벡터도 이런 방식으로 표현될 수 없다면, 그 벡터들은 일차 독립이라고 한다. 238 | 239 |
240 | 241 | **41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** 242 | 243 | ⟶ 행렬 랭크 - 주어진 행렬 A의 랭크는 열에 의해 생성된 벡터공간의 차원이고, rank(A)라고 쓴다. 이는 A의 선형독립인 열의 최대 수와 동일하다. 244 | 245 |
246 | 247 | **42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** 248 | 249 | ⟶ 양의 준정부호 행렬 – 행렬 A∈Rn×n는 다음을 만족하면 양의 준정부호(PSD)라고 하고 A⪰0라고 쓴다. 250 | 251 |
252 | 253 | **43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** 254 | 255 | ⟶ 비고 : 마찬가지로 PSD 행렬이 모든 0이 아닌 벡터 x에 대하여 xTAx>0를 만족하면 행렬 A를 양의 정부호라고 말하고 A≻0라고 쓴다. 256 | 257 |
258 | 259 | **44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 260 | 261 | ⟶ 고유값, 고유벡터 - 주어진 행렬 A∈Rn×n에 대하여, 다음을 만족하는 벡터 z∈Rn∖{0}가 존재하면, z를 고유벡터라고 부르고, λ를 A의 고유값이라고 부른다. 262 | 263 |
264 | 265 | **45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 266 | 267 | ⟶ 스펙트럼 정리 – A∈Rn×n라고 하자. A가 대칭이면, A는 실수 직교행렬 U∈Rn×n에 의해 대각화 가능하다. Λ=diag(λ1,...,λn)인 것에 주목하면, 다음을 만족한다. 268 | 269 |
270 | 271 | **46. diagonal** 272 | 273 | ⟶ 대각 274 | 275 |
276 | 277 | **47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** 278 | 279 | ⟶ 특이값 분해 – 주어진 m×n차원 행렬 A에 대하여, 특이값 분해(SVD)는 다음과 같이 U m×m 유니터리와 Σ m×n 대각 및 V n×n 유니터리 행렬의 존재를 보증하는 인수분해 기술이다. 280 | 281 |
282 | 283 | **48. Matrix calculus** 284 | 285 | ⟶ 행렬 미적분 286 | 287 |
288 | 289 | **49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** 290 | 291 | ⟶ 그라디언트 – f:Rm×n→R는 함수이고 A∈Rm×n는 행렬이라 하자. A에 대한 f의 그라디언트 ∇Af(A)는 다음을 만족하는 m×n 행렬이다. 292 | 293 |
294 | 295 | **50. Remark: the gradient of f is only defined when f is a function that returns a scalar.** 296 | 297 | ⟶ 비고 : f의 그라디언트는 f가 스칼라를 반환하는 함수일 때만 정의된다. 298 | 299 |
300 | 301 | **51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** 302 | 303 | ⟶ 헤시안 – f:Rn→R는 함수이고 x∈Rn는 벡터라고 하자. x에 대한 f의 헤시안 ∇2xf(x)는 다음을 만족하는 n×n 대칭행렬이다. 304 | 305 |
306 | 307 | **52. Remark: the hessian of f is only defined when f is a function that returns a scalar** 308 | 309 | ⟶ 비고 : f의 헤시안은 f가 스칼라를 반환하는 함수일 때만 정의된다. 310 | 311 |
312 | 313 | **53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** 314 | 315 | ⟶ 그라디언트 연산 – 행렬 A,B,C에 대하여, 다음 그라디언트 성질을 염두해두는 것이 좋다. 316 | 317 |
318 | 319 | **54. [General notations, Definitions, Main matrices]** 320 | 321 | ⟶ [일반적인 표기법, 정의, 주요 행렬] 322 | 323 |
324 | 325 | **55. [Matrix operations, Multiplication, Other operations]** 326 | 327 | ⟶ [행렬 연산, 곱셈, 다른 연산] 328 | 329 |
330 | 331 | **56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** 332 | 333 | ⟶ [행렬 성질, 노름, 고유값/고유벡터, 특이값 분해] 334 | 335 |
336 | 337 | **57. [Matrix calculus, Gradient, Hessian, Operations]** 338 | 339 | ⟶ [행렬 미적분, 그라디언트, 헤시안, 연산] 340 | 341 | -------------------------------------------------------------------------------- /ko/cheatsheet-deep-learning.md: -------------------------------------------------------------------------------- 1 | **1. Deep Learning cheatsheet** 2 | 3 | ⟶ 딥러닝 치트시트 4 | 5 |
6 | 7 | **2. Neural Networks** 8 | 9 | ⟶ 신경망 10 | 11 |
12 | 13 | **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** 14 | 15 | ⟶ 신경망(neural network)은 층(layer)으로 구성되는 모델입니다. 합성곱 신경망(convolutional neural network)과 순환 신경망(recurrent neural network)이 널리 사용되는 신경망입니다. 16 | 17 |
18 | 19 | **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:** 20 | 21 | ⟶ 구조 - 다음 그림에 신경망 구조에 관한 용어가 표현되어 있습니다: 22 | 23 |
24 | 25 | **5. [Input layer, hidden layer, output layer]** 26 | 27 | ⟶ [입력층(input layer), 은닉층(hidden layer), 출력층(output layer)] 28 | 29 |
30 | 31 | **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** 32 | 33 | ⟶ i는 네트워크의 i 번째 층을 나타내고 j는 각 층의 j 번째 은닉 유닛(hidden unit)을 지칭합니다: 34 | 35 |
36 | 37 | **7. where we note w, b, z the weight, bias and output respectively.** 38 | 39 | ⟶ 여기에서 w, b, z는 각각 가중치(weight), 절편(bias), 출력입니다. 40 | 41 |
42 | 43 | **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** 44 | 45 | ⟶ 활성화 함수(activation function) - 활성화 함수는 은닉 유닛 다음에 추가하여 모델에 비선형성을 추가합니다. 다음과 같은 함수들을 자주 사용합니다: 46 | 47 |
48 | 49 | **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]** 50 | 51 | ⟶ [시그모이드(Sigmoid), 하이퍼볼릭탄젠트(Tanh), 렐루(ReLU), Leaky 렐루(Leaky ReLU)] 52 | 53 |
54 | 55 | **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 56 | 57 | ⟶ 크로스 엔트로피(cross-entropy) 손실 - 신경망에서 널리 사용되는 크로스 엔트로피 손실 함수 L(z,y)는 다음과 같이 정의합니다: 58 | 59 |
60 | 61 | **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 62 | 63 | ⟶ 학습률 - 학습률은 종종 α 또는 η로 표시하며 가중치 업데이트 양을 조절합니다. 학습률을 일정하게 고정하거나 적응적으로 바꿀 수도 있습니다. 적응적 학습률 방법인 Adam이 현재 가장 인기가 많습니다. 64 | 65 |
66 | 67 | **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** 68 | 69 | ⟶ 역전파(backpropagation) - 역전파는 실제 출력과 기대 출력을 비교하여 신경망의 가중치를 업데이트하는 방법입니다. 연쇄 법칙(chain rule)으로 표현된 가중치 w에 대한 도함수는 다음과 같이 쓸 수 있습니다: 70 | 71 |
72 | 73 | **13. As a result, the weight is updated as follows:** 74 | 75 | ⟶ 결국 가중치는 다음과 같이 업데이트됩니다: 76 | 77 |
78 | 79 | **14. Updating weights ― In a neural network, weights are updated as follows:** 80 | 81 | ⟶ 가중치 업데이트 - 신경망에서 가중치는 다음 단계를 따라 업데이트됩니다: 82 | 83 |
84 | 85 | **15. Step 1: Take a batch of training data.** 86 | 87 | ⟶ 1 단계: 훈련 데이터의 배치(batch)를 만듭니다. 88 | 89 |
90 | 91 | **16. Step 2: Perform forward propagation to obtain the corresponding loss.** 92 | 93 | ⟶ 2 단계: 정방향 계산을 수행하여 배치에 해당하는 손실(loss)을 얻습니다. 94 | 95 |
96 | 97 | **17. Step 3: Backpropagate the loss to get the gradients.** 98 | 99 | ⟶ 3 단계: 손실을 역전파하여 그래디언트(gradient)를 구합니다. 100 | 101 |
102 | 103 | **18. Step 4: Use the gradients to update the weights of the network.** 104 | 105 | ⟶ 4 단계: 그래디언트를 사용해 네트워크의 가중치를 업데이트합니다. 106 | 107 |
108 | 109 | **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** 110 | 111 | ⟶ 드롭아웃(dropout) - 드롭아웃은 신경망의 유닛을 꺼서 훈련 데이터에 과대적합(overfitting)되는 것을 막는 기법입니다. 실전에서는 확률 p로 유닛을 끄거나 확률 1-p로 유닛을 작동시킵니다. 112 | 113 |
114 | 115 | **20. Convolutional Neural Networks** 116 | 117 | ⟶ 합성곱 신경망 118 | 119 |
120 | 121 | **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** 122 | 123 | ⟶ 합성곱 층의 조건 - 입력 크기를 W, 합성곱 층의 커널(kernel) 크기를 F, 제로 패딩(padding)을 P, 스트라이드(stride)를 S라 했을 때 필요한 뉴런의 수 N은 다음과 같습니다: 124 | 125 |
126 | 127 | **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 128 | 129 | ⟶ 배치 정규화(batch normalization) - 하이퍼파라미터 γ,β로 배치 {xi}를 정규화하는 단계입니다. 조정하려는 배치의 평균과 분산을 각각 μB,σ2B라고 했을 때 배치 정규화는 다음과 같이 계산됩니다: 130 | 131 |
132 | 133 | **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 134 | 135 | ⟶ 보통 완전 연결(fully connected)/합성곱 층과 활성화 함수 사이에 위치합니다. 배치 정규화를 적용하면 학습률을 높일 수 있고 초기화에 대한 의존도를 줄일 수 있습니다. 136 | 137 |
138 | 139 | **24. Recurrent Neural Networks** 140 | 141 | ⟶ 순환 신경망 142 | 143 |
144 | 145 | **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** 146 | 147 | ⟶ 게이트(gate) 종류 - 전형적인 순환 신경망에서 볼 수 있는 게이트 종류는 다음과 같습니다: 148 | 149 |
150 | 151 | **26. [Input gate, forget gate, gate, output gate]** 152 | 153 | ⟶ [입력 게이트, 삭제 게이트, 게이트, 출력 게이트] 154 | 155 |
156 | 157 | **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** 158 | 159 | ⟶ [셀(cell) 정보의 기록 여부, 셀 정보의 삭제 여부, 셀의 입력 조절, 셀의 출력 조절] 160 | 161 |
162 | 163 | **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** 164 | 165 | ⟶ LSTM - 장 단기 메모리(long short-term memory, LSTM) 네트워크는 삭제 게이트를 추가하여 그래디언트 소실 문제를 완화한 RNN 모델입니다. 166 | 167 |
168 | 169 | **29. Reinforcement Learning and Control** 170 | 171 | ⟶ 강화 학습(reinforcement learning) 172 | 173 |
174 | 175 | **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** 176 | 177 | ⟶ 강화 학습의 목표는 주어진 환경에서 진화할 수 있는 에이전트를 학습시키는 것입니다. 178 | 179 |
180 | 181 | **31. Definitions** 182 | 183 | ⟶ 정의 184 | 185 |
186 | 187 | **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** 188 | 189 | ⟶ 마르코프 결정 과정(Markov decision process) - 마르코프 결정 과정(MDP)은 다섯 개의 요소 (S,A,{Psa},γ,R)로 구성됩니다: 190 | 191 |
192 | 193 | **33. S is the set of states** 194 | 195 | ⟶ S는 상태(state)의 집합입니다. 196 | 197 |
198 | 199 | **34. A is the set of actions** 200 | 201 | ⟶ A는 행동(action)의 집합입니다. 202 | 203 |
204 | 205 | **35. {Psa} are the state transition probabilities for s∈S and a∈A** 206 | 207 | ⟶ {Psa}는 상태 전이 확률(state transition probability)입니다. s∈S, a∈A 입니다. 208 | 209 |
210 | 211 | **36. γ∈[0,1[ is the discount factor** 212 | 213 | ⟶ γ∈[0,1]는 할인 계수(discount factor)입니다. 214 | 215 |
216 | 217 | **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** 218 | 219 | ⟶ R:S×A⟶R 또는 R:S⟶R 는 알고리즘이 최대화하려는 보상 함수(reward function)입니다. 220 | 221 |
222 | 223 | **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.** 224 | 225 | ⟶ 정책(policy) - 정책 π는 상태와 행동을 매핑하는 함수 π:S⟶A 입니다. 226 | 227 |
228 | 229 | **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).** 230 | 231 | ⟶ 참고: 상태 s가 주어졌을 때 정책 π를 실행하여 행동 a=π(s)를 선택한다고 말합니다. 232 | 233 |
234 | 235 | **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** 236 | 237 | ⟶ 가치 함수(value function) - 정책 π와 상태 s가 주어졌을 때 가치 함수 Vπ를 다음과 같이 정의합니다: 238 | 239 |
240 | 241 | **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** 242 | 243 | ⟶ 벨만(Bellman) 방정식 - 벨만 최적 방정식은 가치 함수 Vπ∗와 최적의 정책 π∗로 표현됩니다: 244 | 245 |
246 | 247 | **42. Remark: we note that the optimal policy π∗ for a given state s is such that:** 248 | 249 | ⟶ 참고: 주어진 상태 s에 대한 최적 정책 π∗는 다음과 같이 나타냅니다: 250 | 251 |
252 | 253 | **43. Value iteration algorithm ― The value iteration algorithm is in two steps:** 254 | 255 | ⟶ 가치 반복 알고리즘 - 가치 반복 알고리즘은 두 단계를 가집니다: 256 | 257 |
258 | 259 | **44. 1) We initialize the value:** 260 | 261 | ⟶ 1) 가치를 초기화합니다: 262 | 263 |
264 | 265 | **45. 2) We iterate the value based on the values before:** 266 | 267 | ⟶ 2) 이전 가치를 기반으로 다음 가치를 반복합니다: 268 | 269 |
270 | 271 | **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** 272 | 273 | ⟶ 최대 가능도 추정 - 상태 전이 함수를 위한 최대 가능도(maximum likelihood) 추정은 다음과 같습니다: 274 | 275 |
276 | 277 | **47. times took action a in state s and got to s′** 278 | 279 | ⟶ 상태 s에서 행동 a를 선택하여 s′를 얻을 횟수 280 | 281 |
282 | 283 | **48. times took action a in state s** 284 | 285 | ⟶ 상태 s에서 행동 a를 선택한 횟수 286 | 287 |
288 | 289 | **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** 290 | 291 | ⟶ Q-러닝(learning) - Q-러닝은 다음과 같은 Q의 모델-프리(model-free) 추정입니다: 292 | 293 |
294 | 295 | **50. View PDF version on GitHub** 296 | 297 | ⟶ 깃허브(GitHub)에서 PDF 버전으로 보기 298 | 299 |
300 | 301 | **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** 302 | 303 | ⟶ 304 | 305 |
[신경망, 구조, 활성화 함수, 역전파, 드롭아웃] 306 | 307 | **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]** 308 | 309 | ⟶ [합성곱 신경망, 합성곱 층, 배치 정규화] 310 | 311 |
312 | 313 | **53. [Recurrent Neural Networks, Gates, LSTM]** 314 | 315 | ⟶ [순환 신경망, 게이트, LSTM] 316 | 317 |
318 | 319 | **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** 320 | 321 | ⟶ [강화 학습, 마르코프 결정 과정, 가치/정책 반복, 근사 동적 계획법, 정책 탐색] 322 | -------------------------------------------------------------------------------- /CONTRIBUTORS: -------------------------------------------------------------------------------- 1 | --ar 2 | cs-229-deep-learning 3 | Amjad Khatabi (translation) 4 | Zaid Alyafeai (review) 5 | 6 | cs-229-linear-algebra 7 | Zaid Alyafeai (translation) 8 | Amjad Khatabi (review) 9 | Mazen Melibari (review) 10 | 11 | cs-229-machine-learning-tips-and-tricks 12 | Fares Al-Qunaieer (translation) 13 | Zaid Alyafeai (review) 14 | 15 | cs-229-probability 16 | Mahmoud Aslan (translation) 17 | Fares Al-Qunaieer (review) 18 | 19 | cs-229-supervised-learning 20 | Fares Al-Qunaieer (translation) 21 | Zaid Alyafeai (review) 22 | 23 | cs-229-unsupervised-learning 24 | Redouane Lguensat (translation) 25 | Fares Al-Qunaieer (review) 26 | 27 | --de 28 | cs-229-deep-learning 29 | Philip Düe (translation) 30 | Bettina Schlager (review) 31 | 32 | --es 33 | cs-229-deep-learning 34 | Erick Gabriel Mendoza Flores (translation) 35 | Fernando Diaz (review) 36 | Fernando González-Herrera (review) 37 | Mariano Ramirez (review) 38 | Juan P. Chavat (review) 39 | Alonso Melgar López (review) 40 | Gustavo Velasco-Hernández (review) 41 | Juan Manuel Nava Zamudio (review) 42 | 43 | cs-229-linear-algebra 44 | Fernando González-Herrera (translation) 45 | Fernando Diaz (review) 46 | Gustavo Velasco-Hernández (review) 47 | Juan P. Chavat (review) 48 | 49 | cs-229-machine-learning-tips-and-tricks 50 | David Jiménez Paredes (translation) 51 | Fernando Diaz (translation) 52 | Gustavo Velasco-Hernández (review) 53 | Alonso Melgar-Lopez (review) 54 | 55 | cs-229-probability 56 | Fermin Ordaz (translation) 57 | Fernando González-Herrera (review) 58 | Alonso Melgar López (review) 59 | 60 | cs-229-supervised-learning 61 | Juan P. Chavat (translation) 62 | Fernando Gonzalez-Herrera (review) 63 | Fernando Diaz (review) 64 | Alonso Melgar-Lopez (review) 65 | 66 | cs-229-unsupervised-learning 67 | Jaime Noel Alvarez Luna (translation) 68 | Alonso Melgar López (review) 69 | Fernando Diaz (review) 70 | 71 | --et 72 | cs-229-machine-learning-tips-and-tricks 73 | kenkyusha (translation) 74 | stemajo (review) 75 | 76 | --fa 77 | cs-229-deep-learning 78 | AlisterTA (translation) 79 | Mohammad Karimi (review) 80 | Erfan Noury (review) 81 | 82 | cs-229-linear-algebra 83 | Erfan Noury (translation) 84 | Mohammad Karimi (review) 85 | 86 | cs-229-machine-learning-tips-and-tricks 87 | AlisterTA (translation) 88 | Mohammad Reza (translation) 89 | Erfan Noury (review) 90 | Mohammad Karimi (review) 91 | 92 | cs-229-probability 93 | Erfan Noury (translation) 94 | Mohammad Karimi (review) 95 | 96 | cs-229-supervised-learning 97 | Amirhosein Kazemnejad (translation) 98 | Erfan Noury (review) 99 | Mohammad Karimi (review) 100 | 101 | cs-229-unsupervised-learning 102 | Erfan Noury (translation) 103 | Mohammad Karimi (review) 104 | 105 | cs-230-convolutional-neural-networks 106 | AlisterTA (translation) 107 | Ehsan Kermani (translation) 108 | Erfan Noury (review) 109 | 110 | cs-230-deep-learning-tips-and-tricks 111 | AlisterTA (translation) 112 | Erfan Noury (review) 113 | 114 | cs-230-recurrent-neural-networks 115 | AlisterTA (translation) 116 | Erfan Noury (review) 117 | 118 | --fr 119 | Original authors 120 | 121 | --he 122 | 123 | --hi 124 | 125 | --id 126 | cs-229-linear-algebra 127 | Prasetia Utama Putra (translation) 128 | Gunawan Tri (review) 129 | Jimmy The Lecturer (review) 130 | 131 | cs-229-probability 132 | Prasetia Utama Putra (translation) 133 | Jimmy The Lecturer (review) 134 | 135 | cs-230-convolutional-neural-networks 136 | Prasetia Utama Putra (translation) 137 | Gunawan Tri (review) 138 | Jimmy The Lecturer (review) 139 | 140 | --it 141 | cs-229-linear-algebra 142 | Alessandro Piotti (translation) 143 | Nicola Dall'Asen (review) 144 | 145 | cs-229-probability 146 | Nicola Dall'Asen (translation) 147 | Alessandro Piotti (review) 148 | 149 | --ko 150 | cs-229-deep-learning 151 | Haesun Park (translation) 152 | Danny Toeun Kim (review) 153 | 154 | cs-229-linear-algebra 155 | Soyoung Lee (translation) 156 | 157 | cs-229-machine-learning-tips-and-tricks 158 | Wooil Jeong (translation) 159 | 160 | cs-229-probability 161 | Wooil Jeong (translation) 162 | 163 | cs-229-supervised-learning 164 | Haesun Park (translation) 165 | Danny Toeun Kim (translation) 166 | 167 | cs-229-unsupervised-learning 168 | Kwang Hyeok Ahn (translation) 169 | 170 | cs-230-convolutional-neural-networks 171 | Soyoung Lee (translation) 172 | Jack Kang (review) 173 | 174 | --ja 175 | cs-229-deep-learning 176 | Taichi Kato (translation) 177 | Dan Lillrank (review) 178 | Yoshiyuki Nakai (review) 179 | Yuki Tokyo (review) 180 | 181 | cs-229-linear-algebra 182 | Robert Altena (translation) 183 | Kamuela Lau (review) 184 | 185 | cs-229-machine-learning-tips-and-tricks 186 | UMU (translation) 187 | Hiroki Mori (review) 188 | H. Hamano (review) 189 | Tian-Jian Jiang (review) 190 | Yuta Kanzawa (review) 191 | 192 | cs-229-probability 193 | Takatoshi Nao (translation) 194 | Yuta Kanzawa (review) 195 | 196 | cs-229-supervised-learning 197 | Yuta Kanzawa (translation) 198 | Tran Tuan Anh (review) 199 | 200 | cs-229-unsupervised-learning 201 | Tran Tuan Anh (translation) 202 | Yoshiyuki Nakai (review) 203 | Yuta Kanzawa (review) 204 | Dan Lillrank (review) 205 | 206 | cs-230-convolutional-neural-networks 207 | Tran Tuan Anh (translation) 208 | Yoshiyuki Nakai (review) 209 | Linh Dang (review) 210 | 211 | cs-230-deep-learning-tips-and-tricks 212 | Kamuela Lau (translation) 213 | Yoshiyuki Nakai (review) 214 | Hiroki Mori (review) 215 | 216 | cs-230-recurrent-neural-networks 217 | H. Hamano (translation) 218 | Yoshiyuki Nakai (review) 219 | 220 | --pt 221 | cs-229-deep-learning 222 | Gabriel Fonseca (translation) 223 | Leticia Portella (review) 224 | Renato Kano (review) 225 | 226 | cs-229-linear-algebra 227 | Gabriel Fonseca (translation) 228 | Leticia Portella (review) 229 | 230 | cs-229-machine-learning-tips-and-tricks 231 | Fernando Santos (translation) 232 | Leticia Portella (review) 233 | Gabriel Fonseca (review) 234 | 235 | cs-229-probability 236 | Leticia Portella (translation) 237 | Flavio Clesio (review) 238 | 239 | cs-229-supervised-learning 240 | Leticia Portella (translation) 241 | Gabriel Fonseca (review) 242 | Flavio Clesio (review) 243 | 244 | cs-229-unsupervised-learning 245 | Gabriel Fonseca (translation) 246 | Tiago Danin (review) 247 | 248 | cs-230-convolutional-neural-networks 249 | Leticia Portella (translation) 250 | Gabriel Aparecido Fonseca (review) 251 | 252 | --tr 253 | cs-221-logic-models 254 | Ayyüce Kızrak (translation) 255 | Başak Buluz (review) 256 | 257 | cs-221-reflex-models 258 | Yavuz Kömeçoğlu (translation) 259 | Ayyüce Kızrak (review) 260 | 261 | cs-221-states-models 262 | Cemal Gurpinar (translation) 263 | Başak Buluz (review) 264 | 265 | cs-221-variables-models 266 | Başak Buluz (translation) 267 | Ayyüce Kızrak (review) 268 | 269 | cs-229-deep-learning 270 | Ekrem Çetinkaya (translation) 271 | Omer Bukte (review) 272 | 273 | cs-229-linear-algebra 274 | Kadir Tekeli (translation) 275 | Ekrem Çetinkaya (review) 276 | 277 | cs-229-machine-learning-tips-and-tricks 278 | Seray Beşer (translation) 279 | Ayyüce Kızrak (review) 280 | Yavuz Kömeçoğlu (review) 281 | 282 | cs-229-probability 283 | Ayyüce Kızrak (translation) 284 | Başak Buluz (review) 285 | 286 | cs-229-supervised-learning 287 | Başak Buluz (translation) 288 | Ayyüce Kızrak (review) 289 | 290 | cs-229-unsupervised-learning 291 | Yavuz Kömeçoğlu (translation) 292 | Başak Buluz (review) 293 | 294 | cs-230-convolutional-neural-networks 295 | Ayyüce Kızrak (translation) 296 | Yavuz Kömeçoğlu (review) 297 | 298 | cs-230-deep-learning-tips-and-tricks 299 | Ayyüce Kızrak (translation) 300 | Yavuz Kömeçoğlu (review) 301 | 302 | cs-230-recurrent-neural-networks 303 | Başak Buluz (translation) 304 | Yavuz Kömeçoğlu (review) 305 | 306 | --uk 307 | cs-229-probability 308 | Gregory Reshetniak (translation) 309 | Denys (review) 310 | 311 | --vi 312 | cs-221-logic-models 313 | Hoàng Minh Tuấn (translation) 314 | Đàm Minh Tiến (review) 315 | 316 | cs-229-deep-learning 317 | Trần Tuấn Anh (translation) 318 | Phạm Hồng Vinh (review) 319 | Đàm Minh Tiến (review) 320 | Nguyễn Khánh Hưng (review) 321 | Hoàng Vũ Đạt (review) 322 | Nguyễn Trí Minh (review) 323 | 324 | cs-229-linear-algebra 325 | Hoàng Minh Tuấn (translation) 326 | Phạm Hồng Vinh (review) 327 | 328 | cs-229-machine-learning-tips-and-tricks 329 | Trần Tuấn Anh (translation) 330 | Nguyễn Trí Minh (review) 331 | Vinh Pham (review) 332 | Đàm Minh Tiến (review) 333 | 334 | cs-229-probability 335 | Hoàng Minh Tuấn (translation) 336 | Hung Nguyễn (review) 337 | 338 | cs-229-supervised-learning 339 | Trần Tuấn Anh (translation) 340 | Đàm Minh Tiến (review) 341 | Hung Nguyễn (review) 342 | Nguyễn Trí Minh (review) 343 | 344 | cs-229-unsupervised-learning 345 | Trần Tuấn Anh (translation) 346 | Đàm Minh Tiến (review) 347 | 348 | cs-230-convolutional-neural-networks 349 | Phạm Hồng Vinh (translation) 350 | Đàm Minh Tiến (review) 351 | 352 | cs-230-deep-learning-tips-and-tricks 353 | Hoàng Minh Tuấn (translation) 354 | Trần Tuấn Anh (review) 355 | Đàm Minh Tiến (review) 356 | 357 | cs-230-recurrent-neural-networks 358 | Trần Tuấn Anh (translation) 359 | Đàm Minh Tiến (review) 360 | Hung Nguyễn (review) 361 | Nguyễn Trí Minh (review) 362 | 363 | --zh 364 | cs-229-supervised-learning 365 | Wang Hongnian (translation) 366 | Xiaohu Zhu (朱小虎) (review) 367 | Chaoying Xue (review) 368 | 369 | --zh-tw 370 | cs-229-deep-learning 371 | kevingo (translation) 372 | TobyOoO (review) 373 | 374 | cs-229-linear-algebra 375 | kevingo (translation) 376 | Miyaya (review) 377 | 378 | cs-229-probability 379 | kevingo (translation) 380 | johnnychhsu (review) 381 | 382 | cs-229-supervised-learning 383 | kevingo (translation) 384 | accelsao (review) 385 | 386 | cs-229-unsupervised-learning 387 | kevingo (translation) 388 | imironhead (review) 389 | johnnychhsu (review) 390 | 391 | cs-229-machine-learning-tips-and-tricks 392 | kevingo (translation) 393 | kentropy (review) 394 | 395 | cs-230-convolutional-neural-networks 396 | kentropy (translation) 397 | kevingo (review) 398 | -------------------------------------------------------------------------------- /ja/cs-229-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶教師なし学習チートシート 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶教師なし学習の概要 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶モチベーション - 教師なし学習の目的はラベルのないデータ{x(1),...,x(m)}に隠されたパターンを探すことです。 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶イェンセンの不等式 - fを凸関数、Xを確率変数とすると、次の不等式が成り立ちます: 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶クラスタリング 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶期待値最大化 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶潜在変数 - 潜在変数は推定問題を困難にする隠れた/観測されていない変数であり、多くの場合zで示されます。潜在変数がある最も一般的な設定は次のとおりです: 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶[設定、潜在変数z、コメント] 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶[k個のガウス分布の混合、因子分析] 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶アルゴリズム - EMアルゴリズムは次のように尤度の下限の構築(E-ステップ)と、その下限の最適化(M-ステップ)を繰り返し行うことによる最尤推定によりパラメーターθを推定する効率的な方法を提供します: 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶E-ステップ: 各データポイントx(i)が特定クラスターz(i)に由来する事後確率Qi(z(i))を次のように評価します: 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶M-ステップ: 事後確率Qi(z(i))をデータポイントx(i)のクラスター固有の重みとして使い、次のように各クラスターモデルを個別に再推定します: 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶[ガウス分布初期化、期待値ステップ、最大化ステップ、収束] 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶k平均法 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶データポイントiのクラスタをc(i)、クラスタjの中心をμjと表記します。 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶クラスターの重心μ1,μ2,...,μk∈Rnをランダムに初期化後、k-meansアルゴリズムが収束するまで次のようなステップを繰り返します: 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ [平均の初期化、クラスター割り当て、平均の更新、収束] 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ひずみ関数 - アルゴリズムが収束するかどうかを確認するため、次のように定義されたひずみ関数を参照します: 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 階層的クラスタリング 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶アルゴリズム - これは入れ子になったクラスタを逐次的に構築する凝集階層アプローチによるクラスタリングアルゴリズムです。 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 種類 ― 様々な目的関数を最適化するための様々な種類の階層クラスタリングアルゴリズムが以下の表にまとめられています。 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ [ウォードリンケージ、平均リンケージ、完全リンケージ] 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ [クラスター内の距離最小化、クラスターペア間の平均距離の最小化、クラスターペア間の最大距離の最小化] 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ クラスタリング評価指標 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 教師なし学習では、教師あり学習の場合のような正解ラベルがないため、モデルの性能を評価することが困難な場合が多くあります。 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ シルエット係数 ― ある1つのサンプルと同じクラス内のその他全ての点との平均距離をa、そのサンプルから最も近いクラスタ内の全ての点との平均距離をbと表記すると、そのサンプルのシルエット係数sは次のように定義されます: 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ Calinski-Harabazインデックス ― クラスタの数をkと表記すると、クラスタ間およびクラスタ内の分散行列であるBkおよびWkはそれぞれ以下のように定義されます。 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ Calinski-Harabazインデックスs(k)はクラスタリングモデルが各クラスタをどの程度適切に定義しているかを示します。つまり、スコアが高いほど、各クラスタはより密で、十分に分離されています。それは次のように定義されます: 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 次元削減 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 主成分分析 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ これは分散を最大にするデータの射影方向を見つける次元削減手法です。 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 固有値、固有ベクトル - 行列 A∈Rn×nが与えられたとき、次の式で固有ベクトルと呼ばれるベクトルz∈Rn∖{0}が存在した場合に、λはAの固有値と呼ばれます。 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ スペクトル定理 - A∈Rn×nとする。Aが対称のとき、Aは実直交行列U∈Rn×nを用いて対角化可能です。Λ=diag(λ1,...,λn)と表記することで、次の式を得ます。 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ diagonal 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 注釈: 最大固有値に対応する固有ベクトルは行列Aの第1固有ベクトルと呼ばれる。 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:** 212 | 213 | ⟶ アルゴリズム ― 主成分分析(PCA)の過程は、次のようにデータの分散を最大化することによりデータをk次元に射影する次元削減の技術です。 214 | 215 |
216 | 217 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 218 | 219 | ⟶ ステップ1:平均が0で標準偏差が1となるようにデータを正規化します。 220 | 221 |
222 | 223 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 224 | 225 | ⟶ ステップ2:実固有値に関して対称であるΣ=1mm∑i=1x(i)x(i)T∈Rn×nを計算します。 226 | 227 |
228 | 229 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 230 | 231 | ⟶ ステップ3:k個のΣの対角主値固有ベクトルu1,...,uk∈Rn、すなわちk個の最大の固有値の対角固有ベクトルを計算します。 232 | 233 |
234 | 235 | **40. Step 4: Project the data on spanR(u1,...,uk).** 236 | 237 | ⟶ ステップ4:データをspanR(u1,...,uk)に射影します。 238 | 239 |
240 | 241 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 242 | 243 | ⟶ この過程は全てのk次元空間の間の分散を最大化します。 244 | 245 |
246 | 247 | **42. [Data in feature space, Find principal components, Data in principal components space]** 248 | 249 | ⟶ [特徴空間内のデータ、主成分の探索、主成分空間内のデータ] 250 | 251 |
252 | 253 | **43. Independent component analysis** 254 | 255 | ⟶ 独立成分分析 256 | 257 |
258 | 259 | **44. It is a technique meant to find the underlying generating sources.** 260 | 261 | ⟶ 隠れた生成源を見つけることを意図した技術です。 262 | 263 |
264 | 265 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 266 | 267 | ⟶ 仮定 ― 混合かつ非特異行列Aを通じて、データxはn次元の元となるベクトルs=(s1,...,sn)から次のように生成されると仮定します。ただしsiは独立でランダムな変数です: 268 | 269 |
270 | 271 | **46. The goal is to find the unmixing matrix W=A−1.** 272 | 273 | ⟶ 非混合行列W=A−1を見つけることが目的です。 274 | 275 |
276 | 277 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 278 | 279 | ⟶ ベルとシノスキーのICAアルゴリズム ― このアルゴリズムは非混合行列Wを次のステップによって見つけます: 280 | 281 |
282 | 283 | **48. Write the probability of x=As=W−1s as:** 284 | 285 | ⟶ x=As=W−1sの確率を次のように表します: 286 | 287 |
288 | 289 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 290 | 291 | ⟶ 学習データを{x(i),i∈[[1,m]]}、シグモイド関数をgとし、対数尤度を次のように表します: 292 | 293 |
294 | 295 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 296 | 297 | ⟶ そのため、確率的勾配上昇法の学習規則は、学習サンプルx(i)に対して次のようにWを更新するものです: 298 | 299 |
300 | 301 | **51. The Machine Learning cheatsheets are now available in [target language].** 302 | 303 | ⟶ 機械学習チートシートは日本語で読めます。 304 | 305 |
306 | 307 | **52. Original authors** 308 | 309 | ⟶ 原著者 310 | 311 |
312 | 313 | **53. Translated by X, Y and Z** 314 | 315 | ⟶ X・Y・Z 訳 316 | 317 |
318 | 319 | **54. Reviewed by X, Y and Z** 320 | 321 | ⟶ X・Y・Z 校正 322 | 323 |
324 | 325 | **55. [Introduction, Motivation, Jensen's inequality]** 326 | 327 | ⟶ [導入、動機、イェンセンの不等式] 328 | 329 |
330 | 331 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 332 | 333 | ⟶[クラスタリング、期待値最大化法、k-means、階層クラスタリング、指標] 334 | 335 |
336 | 337 | **57. [Dimension reduction, PCA, ICA]** 338 | 339 | ⟶ [次元削減、PCA、ICA] 340 | -------------------------------------------------------------------------------- /template/cs-221-logic-models.md: -------------------------------------------------------------------------------- 1 | **Logic-based models translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-221/cheatsheet-logic-models) 2 | 3 |
4 | 5 | **1. Logic-based models with propositional and first-order logic** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | 12 | **2. Basics** 13 | 14 | ⟶ 15 | 16 |
17 | 18 | 19 | **3. Syntax of propositional logic ― By noting f,g formulas, and ¬,∧,∨,→,↔ connectives, we can write the following logical expressions:** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | 26 | **4. [Name, Symbol, Meaning, Illustration]** 27 | 28 | ⟶ 29 | 30 |
31 | 32 | 33 | **5. [Affirmation, Negation, Conjunction, Disjunction, Implication, Biconditional]** 34 | 35 | ⟶ 36 | 37 |
38 | 39 | 40 | **6. [not f, f and g, f or g, if f then g, f, that is to say g]** 41 | 42 | ⟶ 43 | 44 |
45 | 46 | 47 | **7. Remark: formulas can be built up recursively out of these connectives.** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | 54 | **8. Model ― A model w denotes an assignment of binary weights to propositional symbols.** 55 | 56 | ⟶ 57 | 58 |
59 | 60 | 61 | **9. Example: the set of truth values w={A:0,B:1,C:0} is one possible model to the propositional symbols A, B and C.** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | 68 | **10. Interpretation function ― The interpretation function I(f,w) outputs whether model w satisfies formula f:** 69 | 70 | ⟶ 71 | 72 |
73 | 74 | 75 | **11. Set of models ― M(f) denotes the set of models w that satisfy formula f. Mathematically speaking, we define it as follows:** 76 | 77 | ⟶ 78 | 79 |
80 | 81 | 82 | **12. Knowledge base** 83 | 84 | ⟶ 85 | 86 |
87 | 88 | 89 | **13. Definition ― The knowledge base KB is the conjunction of all formulas that have been considered so far. The set of models of the knowledge base is the intersection of the set of models that satisfy each formula. In other words:** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | 96 | **14. Probabilistic interpretation ― The probability that query f is evaluated to 1 can be seen as the proportion of models w of the knowledge base KB that satisfy f, i.e.:** 97 | 98 | ⟶ 99 | 100 |
101 | 102 | 103 | **15. Satisfiability ― The knowledge base KB is said to be satisfiable if at least one model w satisfies all its constraints. In other words:** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | 110 | **16. satisfiable** 111 | 112 | ⟶ 113 | 114 |
115 | 116 | 117 | **17. Remark: M(KB) denotes the set of models compatible with all the constraints of the knowledge base.** 118 | 119 | ⟶ 120 | 121 |
122 | 123 | 124 | **18. Relation between formulas and knowledge base - We define the following properties between the knowledge base KB and a new formula f:** 125 | 126 | ⟶ 127 | 128 |
129 | 130 | 131 | **19. [Name, Mathematical formulation, Illustration, Notes]** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | 138 | **20. [KB entails f, KB contradicts f, f contingent to KB]** 139 | 140 | ⟶ 141 | 142 |
143 | 144 | 145 | **21. [f does not bring any new information, Also written KB⊨f, No model satisfies the constraints after adding f, Equivalent to KB⊨¬f, f does not contradict KB, f adds a non-trivial amount of information to KB]** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | 152 | **22. Model checking ― A model checking algorithm takes as input a knowledge base KB and outputs whether it is satisfiable or not.** 153 | 154 | ⟶ 155 | 156 |
157 | 158 | 159 | **23. Remark: popular model checking algorithms include DPLL and WalkSat.** 160 | 161 | ⟶ 162 | 163 |
164 | 165 | 166 | **24. Inference rule ― An inference rule of premises f1,...,fk and conclusion g is written:** 167 | 168 | ⟶ 169 | 170 |
171 | 172 | 173 | **25. Forward inference algorithm ― From a set of inference rules Rules, this algorithm goes through all possible f1,...,fk and adds g to the knowledge base KB if a matching rule exists. This process is repeated until no more additions can be made to KB.** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | 180 | **26. Derivation ― We say that KB derives f (written KB⊢f) with rules Rules if f already is in KB or gets added during the forward inference algorithm using the set of rules Rules.** 181 | 182 | ⟶ 183 | 184 |
185 | 186 | 187 | **27. Properties of inference rules ― A set of inference rules Rules can have the following properties:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | 194 | **28. [Name, Mathematical formulation, Notes]** 195 | 196 | ⟶ 197 | 198 |
199 | 200 | 201 | **29. [Soundness, Completeness]** 202 | 203 | ⟶ 204 | 205 |
206 | 207 | 208 | **30. [Inferred formulas are entailed by KB, Can be checked one rule at a time, "Nothing but the truth", Formulas entailing KB are either already in the knowledge base or inferred from it, "The whole truth"]** 209 | 210 | ⟶ 211 | 212 |
213 | 214 | 215 | **31. Propositional logic** 216 | 217 | ⟶ 218 | 219 |
220 | 221 | 222 | **32. In this section, we will go through logic-based models that use logical formulas and inference rules. The idea here is to balance expressivity and computational efficiency.** 223 | 224 | ⟶ 225 | 226 |
227 | 228 | 229 | **33. Horn clause ― By noting p1,...,pk and q propositional symbols, a Horn clause has the form:** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | 236 | **34. Remark: when q=false, it is called a "goal clause", otherwise we denote it as a "definite clause".** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | 243 | **35. Modus ponens ― For propositional symbols f1,...,fk and p, the modus ponens rule is written:** 244 | 245 | ⟶ 246 | 247 |
248 | 249 | 250 | **36. Remark: it takes linear time to apply this rule, as each application generate a clause that contains a single propositional symbol.** 251 | 252 | ⟶ 253 | 254 |
255 | 256 | 257 | **37. Completeness ― Modus ponens is complete with respect to Horn clauses if we suppose that KB contains only Horn clauses and p is an entailed propositional symbol. Applying modus ponens will then derive p.** 258 | 259 | ⟶ 260 | 261 |
262 | 263 | 264 | **38. Conjunctive normal form ― A conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is a disjunction of atomic formulas.** 265 | 266 | ⟶ 267 | 268 |
269 | 270 | 271 | **39. Remark: in other words, CNFs are ∧ of ∨.** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | 278 | **40. Equivalent representation ― Every formula in propositional logic can be written into an equivalent CNF formula. The table below presents general conversion properties:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | 285 | **41. [Rule name, Initial, Converted, Eliminate, Distribute, over]** 286 | 287 | ⟶ 288 | 289 |
290 | 291 | 292 | **42. Resolution rule ― For propositional symbols f1,...,fn, and g1,...,gm as well as p, the resolution rule is written:** 293 | 294 | ⟶ 295 | 296 |
297 | 298 | 299 | **43. Remark: it can take exponential time to apply this rule, as each application generates a clause that has a subset of the propositional symbols.** 300 | 301 | ⟶ 302 | 303 |
304 | 305 | 306 | **44. [Resolution-based inference ― The resolution-based inference algorithm follows the following steps:, Step 1: Convert all formulas into CNF, Step 2: Repeatedly apply resolution rule, Step 3: Return unsatisfiable if and only if False, is derived]** 307 | 308 | ⟶ 309 | 310 |
311 | 312 | 313 | **45. First-order logic** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | 320 | **46. The idea here is to use variables to yield more compact knowledge representations.** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | 327 | **47. [Model ― A model w in first-order logic maps:, constant symbols to objects, predicate symbols to tuple of objects]** 328 | 329 | ⟶ 330 | 331 |
332 | 333 | 334 | **48. Horn clause ― By noting x1,...,xn variables and a1,...,ak,b atomic formulas, the first-order logic version of a horn clause has the form:** 335 | 336 | ⟶ 337 | 338 |
339 | 340 | 341 | **49. Substitution ― A substitution θ maps variables to terms and Subst[θ,f] denotes the result of substitution θ on f.** 342 | 343 | ⟶ 344 | 345 |
346 | 347 | 348 | **50. Unification ― Unification takes two formulas f and g and returns the most general substitution θ that makes them equal:** 349 | 350 | ⟶ 351 | 352 |
353 | 354 | 355 | **51. such that** 356 | 357 | ⟶ 358 | 359 |
360 | 361 | 362 | **52. Note: Unify[f,g] returns Fail if no such θ exists.** 363 | 364 | ⟶ 365 | 366 |
367 | 368 | 369 | **53. Modus ponens ― By noting x1,...,xn variables, a1,...,ak and a′1,...,a′k atomic formulas and by calling θ=Unify(a′1∧...∧a′k,a1∧...∧ak) the first-order logic version of modus ponens can be written:** 370 | 371 | ⟶ 372 | 373 |
374 | 375 | 376 | **54. Completeness ― Modus ponens is complete for first-order logic with only Horn clauses.** 377 | 378 | ⟶ 379 | 380 |
381 | 382 | 383 | **55. Resolution rule ― By noting f1,...,fn, g1,...,gm, p, q formulas and by calling θ=Unify(p,q), the first-order logic version of the resolution rule can be written:** 384 | 385 | ⟶ 386 | 387 |
388 | 389 | 390 | **56. [Semi-decidability ― First-order logic, even restricted to only Horn clauses, is semi-decidable., if KB⊨f, forward inference on complete inference rules will prove f in finite time, if KB⊭f, no algorithm can show this in finite time]** 391 | 392 | ⟶ 393 | 394 |
395 | 396 | 397 | **57. [Basics, Notations, Model, Interpretation function, Set of models]** 398 | 399 | ⟶ 400 | 401 |
402 | 403 | 404 | **58. [Knowledge base, Definition, Probabilistic interpretation, Satisfiability, Relationship with formulas, Forward inference, Rule properties]** 405 | 406 | ⟶ 407 | 408 |
409 | 410 | 411 | **59. [Propositional logic, Clauses, Modus ponens, Conjunctive normal form, Representation equivalence, Resolution]** 412 | 413 | ⟶ 414 | 415 |
416 | 417 | 418 | **60. [First-order logic, Substitution, Unification, Resolution rule, Modus ponens, Resolution, Semi-decidability]** 419 | 420 | ⟶ 421 | 422 |
423 | 424 | 425 | **61. View PDF version on GitHub** 426 | 427 | ⟶ 428 | 429 |
430 | 431 | 432 | **62. Original authors** 433 | 434 | ⟶ 435 | 436 |
437 | 438 | 439 | **63. Translated by X, Y and Z** 440 | 441 | ⟶ 442 | 443 |
444 | 445 | 446 | **64. Reviewed by X, Y and Z** 447 | 448 | ⟶ 449 | 450 |
451 | 452 | 453 | **65. By X and Y** 454 | 455 | ⟶ 456 | 457 |
458 | 459 | 460 | **66. The Artificial Intelligence cheatsheets are now available in [target language].** 461 | 462 | ⟶ 463 | -------------------------------------------------------------------------------- /ko/cs-229-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | **1. Unsupervised Learning cheatsheet** 2 | 3 | ⟶ 비지도 학습 cheatsheet 4 | 5 |
6 | 7 | **2. Introduction to Unsupervised Learning** 8 | 9 | ⟶ 비지도 학습 소개 10 | 11 |
12 | 13 | **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** 14 | 15 | ⟶ 동기부여 - 비지도학습의 목표는 {x(1),...,x(m)}와 같이 라벨링이 되어있지 않은 데이터 내의 숨겨진 패턴을 찾는것이다. 16 | 17 |
18 | 19 | **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** 20 | 21 | ⟶ 옌센 부등식 - f를 볼록함수로 하며 X는 확률변수로 두고 아래와 같은 부등식을 따르도록 하자. 22 | 23 |
24 | 25 | **5. Clustering** 26 | 27 | ⟶ 군집화 28 | 29 |
30 | 31 | **6. Expectation-Maximization** 32 | 33 | ⟶ 기댓값 최대화 34 | 35 |
36 | 37 | **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** 38 | 39 | ⟶ 잠재변수 - 잠재변수들은 숨겨져있거나 관측되지 않는 변수들을 말하며, 이러한 변수들은 추정문제의 어려움을 가져온다. 그리고 잠재변수는 종종 z로 표기되어진다. 일반적인 잠재변수로 구성되어져있는 형태들을 살펴보자 40 | 41 |
42 | 43 | **8. [Setting, Latent variable z, Comments]** 44 | 45 | ⟶ 표기형태, 잠재변수 z, 주석 46 | 47 |
48 | 49 | **9. [Mixture of k Gaussians, Factor analysis]** 50 | 51 | ⟶ 가우시안 혼합모델, 요인분석 52 | 53 |
54 | 55 | **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** 56 | 57 | ⟶ 알고리즘 - 기댓값 최대화 (EM) 알고리즘은 모수 θ를 추정하는 효율적인 방법을 제공해준다. 모수 θ의 추정은 아래와 같이 우도의 아래 경계지점을 구성하는(E-step)과 그 우도의 아래 경계지점을 최적화하는(M-step)들의 반복적인 최대우도측정을 통해 추정된다. 58 | 59 |
60 | 61 | **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** 62 | 63 | ⟶ E-step : 각각의 데이터 포인트 x(i)은 특정 클러스터 z(i)로 부터 발생한 후 사후확률Qi(z(i))를 평가한다. 아래의 식 참조 64 | 65 |
66 | 67 | **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** 68 | 69 | ⟶ M-step : 데이터 포인트 x(i)에 대한 클러스트의 특정 가중치로 사후확률 Qi(z(i))을 사용, 각 클러스트 모델을 개별적으로 재평가한다. 아래의 식 참조 70 | 71 |
72 | 73 | **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]** 74 | 75 | ⟶ Gaussians 초기값, 기대 단계, 최대화 단계, 수렴 76 | 77 |
78 | 79 | **14. k-means clustering** 80 | 81 | ⟶ k-평균 군집화 82 | 83 |
84 | 85 | **15. We note c(i) the cluster of data point i and μj the center of cluster j.** 86 | 87 | ⟶ c(i)는 데이터 포인트 i 와 j군집의 중앙인 μj 들의 군집이다. 88 | 89 |
90 | 91 | **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** 92 | 93 | ⟶ 알고리즘 - 군집 중앙에 μ1,μ2,...,μk∈Rn 와 같이 무작위로 초기값을 잡은 후, k-평균 알고리즘이 수렴될때 까지 아래와 같은 단계를 반복한다. 94 | 95 |
96 | 97 | **17. [Means initialization, Cluster assignment, Means update, Convergence]** 98 | 99 | ⟶ 평균 초기값, 군집분할, 평균 재조정, 수렴 100 | 101 |
102 | 103 | **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** 104 | 105 | ⟶ 왜곡 함수 - 알고리즘이 수렴하는지를 확인하기 위해서는 아래와 같은 왜곡함수를 정의해야 한다. 106 | 107 |
108 | 109 | **19. Hierarchical clustering** 110 | 111 | ⟶ 계층적 군집분석 112 | 113 |
114 | 115 | **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** 116 | 117 | ⟶ 알고리즘 - 연속적 방식으로 중첩된 클러스트를 구축하는 결합형 계층적 접근방식을 사용하는 군집 알고리즘이다. 118 | 119 |
120 | 121 | **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** 122 | 123 | ⟶ 종류 - 다양한 목적함수의 최적화를 목표로하는 다양한 종류의 계층적 군집분석 알고리즘들이 있으며, 아래 표와 같이 요약되어있다. 124 | 125 |
126 | 127 | **22. [Ward linkage, Average linkage, Complete linkage]** 128 | 129 | ⟶ Ward 연결법, 평균 연결법, 완전 연결법 130 | 131 |
132 | 133 | **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** 134 | 135 | ⟶ 군집 거리 내에서의 최소화, 한쌍의 군집간 평균거리의 최소화, 한쌍의 군집간 최대거리의 최소화 136 | 137 |
138 | 139 | **24. Clustering assessment metrics** 140 | 141 | ⟶ 군집화 평가 metrics 142 | 143 |
144 | 145 | **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** 146 | 147 | ⟶ 비지도학습 환경에서는, 지도학습 환경과는 다르게 실측자료에 라벨링이 없기 때문에 종종 모델에 대한 성능평가가 어렵다. 148 | 149 |
150 | 151 | **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** 152 | 153 | ⟶ 실루엣 계수 - a와 b를 같은 클래스의 다른 모든점과 샘플 사이의 평균거리와 다음 가장 가까운 군집의 다른 모든 점과 샘플사이의 평균거리로 표기하면 단일 샘플에 대한 실루엣 계수 s는 다음과 같이 정의할 수 있다. 154 | 155 |
156 | 157 | **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** 158 | 159 | ⟶ Calinski-Harabaz 색인 - k개 군집에 Bk와 Wk를 표기하면, 다음과 같이 각각 정의 된 군집간 분산행렬이다. 160 | 161 |
162 | 163 | **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** 164 | 165 | ⟶ Calinski-Harabaz 색인 s(k)는 군집모델이 군집화를 얼마나 잘 정의하는지를 나타낸다. 가령 높은 점수일수록 군집이 더욱 밀도있으며 잘 분리되는 형태이다. 아래와 같은 정의를 따른다. 166 | 167 |
168 | 169 | **29. Dimension reduction** 170 | 171 | ⟶ 차원 축소 172 | 173 |
174 | 175 | **30. Principal component analysis** 176 | 177 | ⟶ 주성분 분석 178 | 179 |
180 | 181 | **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** 182 | 183 | ⟶ 차원축소 기술은 데이터를 반영하는 최대 분산방향을 찾는 기술이다. 184 | 185 |
186 | 187 | **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** 188 | 189 | ⟶ 고유값, 고유벡터 - A∈Rn×n 행렬이 주어질때, λ는 A의 고유값이 되며, 만약 z∈Rn∖{0} 벡터가 있다면 고유함수이다. 190 | 191 |
192 | 193 | **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** 194 | 195 | ⟶ 스펙트럼 정리 - A∈Rn×n 이라고 하자 만약 A가 대칭이라면, A는 실수 직교 행렬 U∈Rn×n에 의해 대각행렬로 만들 수 있다. 196 | 197 |
198 | 199 | **34. diagonal** 200 | 201 | ⟶ 대각선 202 | 203 |
204 | 205 | **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** 206 | 207 | ⟶ 참조: 가장 큰 고유값과 연관된 고유 벡터를 행렬 A의 주요 고유벡터라고 부른다 208 | 209 |
210 | 211 | **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k 212 | dimensions by maximizing the variance of the data as follows:** 213 | 214 | ⟶ 알고리즘 - 주성분 분석(PCA) 절차는 데이터 분산을 최대화하여 k 차원의 데이터를 투영하는 차원 축소 기술로 다음과 같이 따른다. 215 | 216 |
217 | 218 | **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** 219 | 220 | ⟶ 1단계: 평균을 0으로 표준편차가 1이되도록 데이터를 표준화한다. 221 | 222 |
223 | 224 | **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** 225 | 226 | ⟶ 2단계: 실제 고유값과 대칭인 Σ=1mm∑i=1x(i)x(i)T∈Rn×n를 계산합니다. 227 | 228 |
229 | 230 | **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** 231 | 232 | ⟶ 3단계: k 직교 고유벡터의 합을 u1,...,uk∈Rn와 같이 계산한다. 다시말하면, 가장 큰 고유값 k의 직교 고유벡터이다. 233 | 234 |
235 | 236 | **40. Step 4: Project the data on spanR(u1,...,uk).** 237 | 238 | ⟶ 4단계: R(u1,...,uk) 범위에 데이터를 투영하자. 239 | 240 |
241 | 242 | **41. This procedure maximizes the variance among all k-dimensional spaces.** 243 | 244 | ⟶ 해당 절차는 모든 k-차원의 공간들 사이에 분산을 최대화 하는것이다. 245 | 246 |
247 | 248 | **42. [Data in feature space, Find principal components, Data in principal components space]** 249 | 250 | ⟶ 변수공간의 데이터, 주요성분들 찾기, 주요성분공간의 데이터 251 | 252 |
253 | 254 | **43. Independent component analysis** 255 | 256 | ⟶ 독립성분분석 257 | 258 |
259 | 260 | **44. It is a technique meant to find the underlying generating sources.** 261 | 262 | ⟶ 근원적인 생성원을 찾기위한 기술을 의미한다. 263 | 264 |
265 | 266 | **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** 267 | 268 | ⟶ 가정 - 다음과 같이 우리는 데이터 x가 n차원의 소스벡터 s=(s1,...,sn)에서부터 생성되었음을 가정한다. 이때 si는 독립적인 확률변수에서 나왔으며, 혼합 및 비특이 행렬 A를 통해 생성된다고 가정한다. 269 | 270 |
271 | 272 | **46. The goal is to find the unmixing matrix W=A−1.** 273 | 274 | ⟶ 비혼합 행렬 W=A−1를 찾는 것을 목표로 한다. 275 | 276 |
277 | 278 | **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** 279 | 280 | ⟶ Bell과 Sejnowski 독립성분분석(ICA) 알고리즘 - 다음의 단계들을 따르는 비혼합 행렬 W를 찾는 알고리즘이다. 281 | 282 |
283 | 284 | **48. Write the probability of x=As=W−1s as:** 285 | 286 | ⟶ x=As=W−1s의 확률을 다음과 같이 기술한다. 287 | 288 |
289 | 290 | **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** 291 | 292 | ⟶ 주어진 학습데이터 {x(i),i∈[[1,m]]}에 로그우도를 기술하고 시그모이드 함수 g를 다음과 같이 표기한다. 293 | 294 |
295 | 296 | **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** 297 | 298 | ⟶ 그러므로, 확률적 경사상승 학습 규칙은 각 학습예제 x(i)에 대해서 다음과 같이 W를 업데이트하는 것과 같다. 299 | 300 |
301 | 302 | **51. The Machine Learning cheatsheets are now available in Korean.** 303 | 304 | ⟶ 머신러닝 cheatsheets는 현재 한국어로 제공된다. 305 | 306 |
307 | 308 | **52. Original authors** 309 | 310 | ⟶ 원저자 311 | 312 |
313 | 314 | **53. Translated by X, Y and Z** 315 | 316 | ⟶ X,Y,Z에 의해 번역되다. 317 | 318 |
319 | 320 | **54. Reviewed by X, Y and Z** 321 | 322 | ⟶ X,Y,Z에 의해 검토되다. 323 | 324 |
325 | 326 | **55. [Introduction, Motivation, Jensen's inequality]** 327 | 328 | ⟶ 소개, 동기부여, 얀센 부등식 329 | 330 |
331 | 332 | **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]** 333 | 334 | ⟶ 군집화, 기댓값-최대화, k-means, 계층적 군집화, 측정지표 335 | 336 |
337 | 338 | **57. [Dimension reduction, PCA, ICA]** 339 | 340 | ⟶ 차원축소, 주성분분석(PCA), 독립성분분석(ICA) 341 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Translation of VIP Cheatsheets 2 | ## Goal 3 | This repository aims at collaboratively translating our [Machine Learning](https://github.com/afshinea/stanford-cs-229-machine-learning), [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) and [Artificial Intelligence](https://github.com/afshinea/stanford-cs-221-artificial-intelligence) cheatsheets into a ton of languages, so that this content can be enjoyed by anyone from any part of the world! 4 | 5 | ## Contribution guidelines 6 | The translation process of each cheatsheet contains two steps: 7 | - the **translation** step, where contributors follow a template of items to translate, 8 | - the **review** step, where contributors go through each expression translated by their peers, on top of which they add their suggestions and remarks. 9 | 10 | ### Translators 11 | 0. Check for [existing pull requests](https://github.com/shervinea/cheatsheet-translation/pulls) to see which cheatsheet is yet to be translated. 12 | 13 | 1. Fork the repository. 14 | 15 | 2. Copy [the template](https://github.com/shervinea/cheatsheet-translation/tree/master/template) of the cheatsheet you wish to translate into the language folder with a naming that follows the [ISO 639-1 notation](https://www.loc.gov/standards/iso639-2/php/code_list.php) (e.g. `[es]` for Spanish, `[zh]` for Mandarin Chinese). 16 | 17 | 3. Translate sentences by keeping the following structure: 18 | > 34. **English blabla** 19 | > 20 | > ⟶ Translated blabla 21 | 22 | 4. Commit the changes to your forked repository. 23 | 24 | 5. Submit a [pull request](https://help.github.com/articles/creating-a-pull-request/) and call it `[language code] file-name`. For example, the PR related to the translation in Spanish of the `template/cs-229-deep-learning.md` cheatsheet will be entitled `[es] cs-229-deep-learning`. 25 | 26 | ### Reviewers 27 | 1. Go to the [list of pull requests](https://github.com/shervinea/cheatsheet-translation/pulls) and filter them by your native language. 28 | 29 | 2. Locate pull requests where help is needed. Those contain the tag `reviewer wanted`. 30 | 31 | 3. Review the content line per line and add comments and suggestions when necessary. 32 | 33 | ### Important note 34 | Please make sure to propose the translation of **only one** cheatsheet per pull request -- it simplifies a lot the review process. 35 | 36 | ## Progression 37 | ### CS 221 (Artificial Intelligence) 38 | | |[Reflex models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-reflex-models.md)|[States models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-states-models.md)|[Variables models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-variables-models.md)|[Logic models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-logic-models.md)| 39 | |:---|:---:|:---:|:---:|:---:| 40 | |**Deutsch**|not started|not started|not started|not started| 41 | |**Español**|not started|not started|not started|not started| 42 | |**فارسی**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/200)|not started|not started|not started| 43 | |**Français**|done|done|done|done| 44 | |**עִבְרִית**|not started|not started|not started|not started| 45 | |**Italiano**|not started|not started|not started|not started| 46 | |**日本語**|not started|not started|not started|not started| 47 | |**한국어**|not started|not started|not started|not started| 48 | |**Português**|not started|not started|not started|not started| 49 | |**Русский**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/218)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/219)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/220)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/217)| 50 | |**Türkçe**|done|done|done|done| 51 | |**Tiếng Việt**|not started|not started|not started|done| 52 | |**简体中文**|not started|not started|not started|not started| 53 | |**繁體中文**|not started|not started|not started|not started| 54 | 55 | ### CS 229 (Machine Learning) 56 | | |[Deep learning](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-deep-learning.md)|[Supervised](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-supervised-learning.md)|[Unsupervised](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-unsupervised-learning.md)|[ML tips](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-machine-learning-tips-and-tricks.md)|[Probabilities](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-probability.md)|[Algebra](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-linear-algebra.md)| 57 | |:---|:---:|:---:|:---:|:---:|:---:|:---:| 58 | |**العَرَبِيَّة**|done|done|done|done|done|done| 59 | |**Català**|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)| 60 | |**Deutsch**|done|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/135)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/136)| 61 | |**Ελληνικά**|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/209)|not started|not started| 62 | |**Español**|done|done|done|done|done|done| 63 | |**Eesti**|not started|not started|not started|done|not started|not started| 64 | |**فارسی**|done|done|done|done|done|done| 65 | |**Suomi**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/34)|not started|not started|not started|not started|not started| 66 | |**Français**|done|done|done|done|done|done| 67 | |**עִבְרִית**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/156)|not started|not started|not started|not started|not started| 68 | |**हिन्दी**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/37)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/46)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/40)|not started|not started| 69 | |**Magyar**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)| 70 | |**Bahasa Indonesia**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/154)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/139)|not started|done|done| 71 | |**Italiano**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/78)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/207)|not started|not started|done|done| 72 | |**日本語**|done|done|done|done|done|done| 73 | |**한국어**|done|done|done|done|done|done| 74 | |**Polski**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/208)|not started| 75 | |**Português**|done|done|done|done|done|done| 76 | |**Русский**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/221)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/225)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/226)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/223)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/224)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/222)| 77 | |**Türkçe**|done|done|done|done|done|done| 78 | |**Українська**|not started|not started|not started|not started|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/95)| 79 | |**Tiếng Việt**|done|done|done|done|done|done| 80 | |**简体中文**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/12)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/48)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/7)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/73)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/72)| 81 | |**繁體中文**|done|done|done|done|done|done| 82 | 83 | ### CS 230 (Deep Learning) 84 | | |[Convolutional Neural Networks](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-230-convolutional-neural-networks.md)|[Recurrent Neural Networks](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-230-recurrent-neural-networks.md)|[Deep Learning tips](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-230-deep-learning-tips-and-tricks.md)| 85 | |:---|:---:|:---:|:---:| 86 | |**العَرَبِيَّة**|not started|not started|not started| 87 | |**Català**|not started|not started|not started| 88 | |**Deutsch**|not started|not started|not started| 89 | |**Español**|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/210)| 90 | |**فارسی**|done|done|done| 91 | |**Suomi**|not started|not started|not started| 92 | |**Français**|done|done|done| 93 | |**עִבְרִית**|not started|not started|not started| 94 | |**हिन्दी**|not started|not started|not started| 95 | |**Magyar**|not started|not started|not started| 96 | |**Bahasa Indonesia**|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/152)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/153)| 97 | |**Italiano**|not started|not started|not started| 98 | |**日本語**|done|done|done| 99 | |**한국어**|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/107)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/108)| 100 | |**Polski**|not started|not started|not started| 101 | |**Português**|done|not started|not started| 102 | |**Русский**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/227)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/229)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/228)| 103 | |**Türkçe**|done|done|done| 104 | |**Українська**|not started|not started|not started| 105 | |**Tiếng Việt**|done|done|done| 106 | |**简体中文**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/212)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/181)|not started| 107 | |**繁體中文**|done|not started|not started| 108 | 109 | ## Acknowledgements 110 | Thank you everyone for your help! Please do not forget to add your name to the `CONTRIBUTORS` file so that we can give you proper credit in the cheatsheets' [official website](https://stanford.edu/~shervine/teaching). 111 | -------------------------------------------------------------------------------- /zh-tw/cs-229-probability.md: -------------------------------------------------------------------------------- 1 | 1. **Probabilities and Statistics refresher** 2 | 3 | ⟶ 4 | 機率和統計回顧 5 |
6 | 7 | 2. **Introduction to Probability and Combinatorics** 8 | 9 | ⟶ 10 | 幾率與組合數學介紹 11 |
12 | 13 | 3. **Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.** 14 | 15 | ⟶ 16 | 樣本空間 - 一個實驗的所有可能結果的集合稱之為這個實驗的樣本空間,記做 S 17 |
18 | 19 | 4. **Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.** 20 | 21 | ⟶ 22 | 事件 - 樣本空間的任何子集合 E 被稱之為一個事件。也就是說,一個事件是實驗的可能結果的集合。如果該實驗的結果包含 E,我們稱我們稱 E 發生 23 |
24 | 25 | 5. **Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.** 26 | 27 | ⟶ 28 | 機率公理。對於每個事件 E,我們用 P(E) 表示事件 E 發生的機率 29 |
30 | 31 | 6. **Axiom 1 ― Every probability is between 0 and 1 included, i.e:** 32 | 33 | ⟶ 34 | 公理 1 - 每一個機率值介於 0 到 1 之間,包含兩端點。即: 35 |
36 | 37 | 7. **Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:** 38 | 39 | ⟶ 40 | 公理 2 - 至少一個基本事件出現在整個樣本空間中的機率是 1。即: 41 |
42 | 43 | 8. **Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:** 44 | 45 | ⟶ 46 | 公理 3 - 對於任何互斥的事件 E1,...,En,我們定義如下: 47 |
48 | 49 | 9. **Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:** 50 | 51 | ⟶ 52 | 排列 - 排列指的是從 n 個相異的物件中,取出 r 個物件按照固定順序重新安排,這樣安排的數量用 P(n,r) 來表示,定義為: 53 |
54 | 55 | 10. **Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:** 56 | 57 | ⟶ 58 | 組合 - 組合指的是從 n 個物件中,取出 r 個物件,但不考慮他的順序。這樣組合要考慮的數量用 C(n,r) 來表示,定義為: 59 |
60 | 61 | 11. **Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)** 62 | 63 | ⟶ 64 | 注意:對於 0⩽r⩽n,我們會有 P(n,r)⩾C(n,r) 65 |
66 | 67 | 12. **Conditional Probability** 68 | 69 | ⟶ 70 | 條件機率 71 |
72 | 73 | 13. **Bayes' rule ― For events A and B such that P(B)>0, we have:** 74 | 75 | ⟶ 76 | 貝氏定理 - 對於事件 A 和 B 滿足 P(B)>0 時,我們定義如下: 77 |
78 | 79 | 14. **Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)** 80 | 81 | ⟶ 82 | 注意:P(A∩B)=P(A)P(B|A)=P(A|B)P(B) 83 |
84 | 85 | 15. **Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:** 86 | 87 | ⟶ 88 | 分割 - 令 {Ai,i∈[[1,n]]} 對所有的 i,Ai≠∅,我們說 {Ai} 是一個分割,當底下成立時: 89 |
90 | 91 | 16. **Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).** 92 | 93 | ⟶ 94 | 注意:對於任何在樣本空間的事件 B 來說,P(B)=n∑i=1P(B|Ai)P(Ai) 95 |
96 | 97 | 17. **Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:** 98 | 99 | ⟶ 100 | 貝氏定理的擴展 - 令 {Ai,i∈[[1,n]]} 為樣本空間的一個分割,我們定義: 101 |
102 | 103 | 18. **Independence ― Two events A and B are independent if and only if we have:** 104 | 105 | ⟶ 106 | 獨立 - 當以下條件滿足時,兩個事件 A 和 B 為獨立事件: 107 |
108 | 109 | 19. **Random Variables** 110 | 111 | ⟶ 112 | 隨機變數 113 |
114 | 115 | 20. **Definitions** 116 | 117 | ⟶ 118 | 定義 119 |
120 | 121 | 21. **Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.** 122 | 123 | ⟶ 124 | 隨機變數 - 一個隨機變數 X,它是一個將樣本空間中的每個元素映射到實數域的函數 125 |
126 | 127 | 22. **Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:** 128 | 129 | ⟶ 130 | 累積分佈函數 (CDF) - 累積分佈函數 F 是單調遞增的函數,其 limx→−∞F(x)=0 且 limx→+∞F(x)=1,定義如下: 131 |
132 | 133 | 23. **Remark: we have P(a 138 | 139 | 24. **Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.** 140 | 141 | ⟶ 142 | 機率密度函數 - 機率密度函數 f 是隨機變數 X 在兩個相鄰的實數值附近取值的機率 143 |
144 | 145 | 25. **Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.** 146 | 147 | ⟶ 148 | 機率密度函數和累積分佈函數的關係 - 底下是一些關於離散 (D) 和連續 (C) 的情況下的重要屬性 149 |
150 | 151 | 26. **[Case, CDF F, PDF f, Properties of PDF]** 152 | 153 | ⟶ 154 | [情況, 累積分佈函數 F, 機率密度函數 f, 機率密度函數的屬性] 155 |
156 | 157 | 27. **Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:** 158 | 159 | ⟶ 160 | 分佈的期望值和動差 - 底下是期望值 E[X]、一般期望值 E[g(X)]、第 k 個動差和特徵函數 ψ(ω) 在離散和連續的情況下的表示式: 161 |
162 | 163 | 28. **Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:** 164 | 165 | ⟶ 166 | 變異數 - 隨機變數的變異數通常表示為 Var(X) 或 σ2,用來衡量一個分佈離散程度的指標。其表示如下: 167 |
168 | 169 | 29. **Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:** 170 | 171 | ⟶ 172 | 標準差 - 一個隨機變數的標準差通常表示為 σ,用來衡量一個分佈離散程度的指標,其單位和實際的隨機變數相容,表示如下: 173 |
174 | 175 | 30. **Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:** 176 | 177 | ⟶ 178 | 隨機變數的轉換 - 令變數 X 和 Y 由某個函式連結在一起。我們定義 fX 和 fY 是 X 和 Y 的分佈函式,可以得到: 179 |
180 | 181 | 31. **Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:** 182 | 183 | ⟶ 184 | 萊布尼茲積分法則 - 令 g 為 x 和 c 的函數,a 和 b 是依賴於 c 的的邊界,我們得到: 185 |
186 | 187 | 32. **Probability Distributions** 188 | 189 | ⟶ 190 | 機率分佈 191 |
192 | 193 | 33. **Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:** 194 | 195 | ⟶ 196 | 柴比雪夫不等式 - 令 X 是一隨機變數,期望值為 μ。對於 k, σ>0,我們有以下不等式: 197 |
198 | 199 | 34. **Main distributions ― Here are the main distributions to have in mind:** 200 | 201 | ⟶ 202 | 主要的分佈 - 底下是我們需要熟悉的幾個主要的不等式: 203 |
204 | 205 | 35. **[Type, Distribution]** 206 | 207 | ⟶ 208 | [種類, 分佈] 209 |
210 | 211 | 36. **Jointly Distributed Random Variables** 212 | 213 | ⟶ 214 | 聯合分佈隨機變數 215 |
216 | 217 | 37. **Marginal density and cumulative distribution ― From the joint density probability function fXY , we have** 218 | 219 | ⟶ 220 | 邊緣密度和累積分佈 - 從聯合密度機率函數 fXY 中我們可以得到: 221 |
222 | 223 | 38. **[Case, Marginal density, Cumulative function]** 224 | 225 | ⟶ 226 | [種類, 邊緣密度函數, 累積函數] 227 |
228 | 229 | 39. **Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:** 230 | 231 | ⟶ 232 | 條件密度 - X 對於 Y 的條件密度,通常用 fX|Y 表示如下: 233 |
234 | 235 | 40. **Independence ― Two random variables X and Y are said to be independent if we have:** 236 | 237 | ⟶ 238 | 獨立 - 當滿足以下條件時,我們稱隨機變數 X 和 Y 互相獨立: 239 |
240 | 241 | 41. **Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:** 242 | 243 | ⟶ 244 | 共變異數 - 我們定義隨機變數 X 和 Y 的共變異數為 σ2XY 或 Cov(X,Y) 如下: 245 |
246 | 247 | 42. **Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:** 248 | 249 | ⟶ 250 | 相關性 - 我們定義 σX、σY 為 X 和 Y 的標準差,而 X 和 Y 的相關係數 ρXY 定義如下: 251 |
252 | 253 | 43. **Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].** 254 | 255 | ⟶ 256 | 注意一:對於任何隨機變數 X 和 Y 來說,ρXY∈[−1,1] 成立 257 |
258 | 259 | 44. **Remark 2: If X and Y are independent, then ρXY=0.** 260 | 261 | ⟶ 262 | 注意二:當 X 和 Y 獨立時,ρXY=0 263 |
264 | 265 | 45. **Parameter estimation** 266 | 267 | ⟶ 268 | 參數估計 269 |
270 | 271 | 46. **Definitions** 272 | 273 | ⟶ 274 | 定義 275 |
276 | 277 | 47. **Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.** 278 | 279 | ⟶ 280 | 隨機抽樣 - 隨機抽樣指的是 n 個隨機變數 X1,...,Xn 和 X 獨立且同分佈的集合 281 |
282 | 283 | 48. **Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.** 284 | 285 | ⟶ 286 | 估計量 - 估計量是一個資料的函數,用來推斷在統計模型中未知參數的值 287 |
288 | 289 | 49. **Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:** 290 | 291 | ⟶ 292 | 偏差 - 一個估計量的偏差 ^θ 定義為 ^θ 分佈期望值和真實值之間的差距: 293 |
294 | 295 | 50. **Remark: an estimator is said to be unbiased when we have E[^θ]=θ.** 296 | 297 | ⟶ 298 | 注意:當 E[^θ]=θ 時,我們稱為不偏估計量 299 |
300 | 301 | 51. **Estimating the mean** 302 | 303 | ⟶ 304 | 預估平均數 305 |
306 | 307 | 52. **Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯X and is defined as follows:** 308 | 309 | ⟶ 310 | 樣本平均 - 一個隨機樣本的樣本平均是用來預估一個分佈的真實平均 μ,通常我們用 ¯X 來表示,定義如下: 311 |
312 | 313 | 53. **Remark: the sample mean is unbiased, i.e E[¯X]=μ.** 314 | 315 | ⟶ 316 | 注意:當 E[¯X]=μ 時,則為不偏樣本平均 317 |
318 | 319 | 54. **Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:** 320 | 321 | ⟶ 322 | 中央極限定理 - 當我們有一個隨機樣本 X1,...,Xn 滿足一個給定的分佈,其平均數為 μ,變異數為 σ2,我們有: 323 |
324 | 325 | 55. **Estimating the variance** 326 | 327 | ⟶ 328 | 估計變異數 329 |
330 | 331 | 56. **Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:** 332 | 333 | ⟶ 334 | 樣本變異數 - 一個隨機樣本的樣本變異數是用來估計一個分佈的真實變異數 σ2,通常使用 s2 或 ^σ2 來表示,定義如下: 335 |
336 | 337 | 57. **Remark: the sample variance is unbiased, i.e E[s2]=σ2.** 338 | 339 | ⟶ 340 | 注意:當 E[s2]=σ2 時,稱之為不偏樣本變異數 341 |
342 | 343 | 58. **Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:** 344 | 345 | ⟶ 346 | 與樣本變異數的卡方關聯 - 令 s2 是一個隨機樣本的樣本變異數,我們可以得到: 347 |
348 | 349 | **59. [Introduction, Sample space, Event, Permutation]** 350 | 351 | ⟶ 352 | [介紹, 樣本空間, 事件, 排列] 353 |
354 | 355 | **60. [Conditional probability, Bayes' rule, Independence]** 356 | 357 | ⟶ 358 | [條件機率, 貝氏定理, 獨立性] 359 |
360 | 361 | **61. [Random variables, Definitions, Expectation, Variance]** 362 | 363 | ⟶ 364 | [隨機變數, 定義, 期望值, 變異數] 365 |
366 | 367 | **62. [Probability distributions, Chebyshev's inequality, Main distributions]** 368 | 369 | ⟶ 370 | [機率分佈, 柴比雪夫不等式, 主要分佈] 371 |
372 | 373 | **63. [Jointly distributed random variables, Density, Covariance, Correlation]** 374 | 375 | ⟶ 376 | [聯合分佈隨機變數, 密度, 共變異數, 相關] 377 |
378 | 379 | **64. [Parameter estimation, Mean, Variance]** 380 | 381 | ⟶ 382 | [參數估計, 平均數, 變異數] -------------------------------------------------------------------------------- /template/cs-230-deep-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **Deep Learning Tips and Tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks) 2 | 3 |
4 | 5 | **1. Deep Learning Tips and Tricks cheatsheet** 6 | 7 | ⟶ 8 | 9 |
10 | 11 | 12 | **2. CS 230 - Deep Learning** 13 | 14 | ⟶ 15 | 16 |
17 | 18 | 19 | **3. Tips and tricks** 20 | 21 | ⟶ 22 | 23 |
24 | 25 | 26 | **4. [Data processing, Data augmentation, Batch normalization]** 27 | 28 | ⟶ 29 | 30 |
31 | 32 | 33 | **5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]** 34 | 35 | ⟶ 36 | 37 |
38 | 39 | 40 | **6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]** 41 | 42 | ⟶ 43 | 44 |
45 | 46 | 47 | **7. [Regularization, Dropout, Weight regularization, Early stopping]** 48 | 49 | ⟶ 50 | 51 |
52 | 53 | 54 | **8. [Good practices, Overfitting small batch, Gradient checking]** 55 | 56 | ⟶ 57 | 58 |
59 | 60 | 61 | **9. View PDF version on GitHub** 62 | 63 | ⟶ 64 | 65 |
66 | 67 | 68 | **10. Data processing** 69 | 70 | ⟶ 71 | 72 |
73 | 74 | 75 | **11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:** 76 | 77 | ⟶ 78 | 79 |
80 | 81 | 82 | **12. [Original, Flip, Rotation, Random crop]** 83 | 84 | ⟶ 85 | 86 |
87 | 88 | 89 | **13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]** 90 | 91 | ⟶ 92 | 93 |
94 | 95 | 96 | **14. [Color shift, Noise addition, Information loss, Contrast change]** 97 | 98 | ⟶ 99 | 100 |
101 | 102 | 103 | **15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]** 104 | 105 | ⟶ 106 | 107 |
108 | 109 | 110 | **16. Remark: data is usually augmented on the fly during training.** 111 | 112 | ⟶ 113 | 114 |
115 | 116 | 117 | **17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** 118 | 119 | ⟶ 120 | 121 |
122 | 123 | 124 | **18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** 125 | 126 | ⟶ 127 | 128 |
129 | 130 | 131 | **19. Training a neural network** 132 | 133 | ⟶ 134 | 135 |
136 | 137 | 138 | **20. Definitions** 139 | 140 | ⟶ 141 | 142 |
143 | 144 | 145 | **21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.** 146 | 147 | ⟶ 148 | 149 |
150 | 151 | 152 | **22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.** 153 | 154 | ⟶ 155 | 156 |
157 | 158 | 159 | **23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.** 160 | 161 | ⟶ 162 | 163 |
164 | 165 | 166 | **24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** 167 | 168 | ⟶ 169 | 170 |
171 | 172 | 173 | **25. Finding optimal weights** 174 | 175 | ⟶ 176 | 177 |
178 | 179 | 180 | **26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.** 181 | 182 | ⟶ 183 | 184 |
185 | 186 | 187 | **27. Using this method, each weight is updated with the rule:** 188 | 189 | ⟶ 190 | 191 |
192 | 193 | 194 | **28. Updating weights ― In a neural network, weights are updated as follows:** 195 | 196 | ⟶ 197 | 198 |
199 | 200 | 201 | **29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]** 202 | 203 | ⟶ 204 | 205 |
206 | 207 | 208 | **30. [Forward propagation, Backpropagation, Weights update]** 209 | 210 | ⟶ 211 | 212 |
213 | 214 | 215 | **31. Parameter tuning** 216 | 217 | ⟶ 218 | 219 |
220 | 221 | 222 | **32. Weights initialization** 223 | 224 | ⟶ 225 | 226 |
227 | 228 | 229 | **33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.** 230 | 231 | ⟶ 232 | 233 |
234 | 235 | 236 | **34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:** 237 | 238 | ⟶ 239 | 240 |
241 | 242 | 243 | **35. [Training size, Illustration, Explanation]** 244 | 245 | ⟶ 246 | 247 |
248 | 249 | 250 | **36. [Small, Medium, Large]** 251 | 252 | ⟶ 253 | 254 |
255 | 256 | 257 | **37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]** 258 | 259 | ⟶ 260 | 261 |
262 | 263 | 264 | **38. Optimizing convergence** 265 | 266 | ⟶ 267 | 268 |
269 | 270 | 271 | **39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** 272 | 273 | ⟶ 274 | 275 |
276 | 277 | 278 | **40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:** 279 | 280 | ⟶ 281 | 282 |
283 | 284 | 285 | **41. [Method, Explanation, Update of w, Update of b]** 286 | 287 | ⟶ 288 | 289 |
290 | 291 | 292 | **42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]** 293 | 294 | ⟶ 295 | 296 |
297 | 298 | 299 | **43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]** 300 | 301 | ⟶ 302 | 303 |
304 | 305 | 306 | **44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]** 307 | 308 | ⟶ 309 | 310 |
311 | 312 | 313 | **45. Remark: other methods include Adadelta, Adagrad and SGD.** 314 | 315 | ⟶ 316 | 317 |
318 | 319 | 320 | **46. Regularization** 321 | 322 | ⟶ 323 | 324 |
325 | 326 | 327 | **47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.** 328 | 329 | ⟶ 330 | 331 |
332 | 333 | 334 | **48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.** 335 | 336 | ⟶ 337 | 338 |
339 | 340 | 341 | **49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:** 342 | 343 | ⟶ 344 | 345 |
346 | 347 | 348 | **50. [LASSO, Ridge, Elastic Net, Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 349 | 350 | ⟶ 351 | 352 |
353 | 354 | **51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.** 355 | 356 | ⟶ 357 | 358 |
359 | 360 | 361 | **52. [Error, Validation, Training, early stopping, Epochs]** 362 | 363 | ⟶ 364 | 365 |
366 | 367 | 368 | **53. Good practices** 369 | 370 | ⟶ 371 | 372 |
373 | 374 | 375 | **54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.** 376 | 377 | ⟶ 378 | 379 |
380 | 381 | 382 | **55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.** 383 | 384 | ⟶ 385 | 386 |
387 | 388 | 389 | **56. [Type, Numerical gradient, Analytical gradient]** 390 | 391 | ⟶ 392 | 393 |
394 | 395 | 396 | **57. [Formula, Comments]** 397 | 398 | ⟶ 399 | 400 |
401 | 402 | 403 | **58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]** 404 | 405 | ⟶ 406 | 407 |
408 | 409 | 410 | **59. ['Exact' result, Direct computation, Used in the final implementation]** 411 | 412 | ⟶ 413 | 414 |
415 | 416 | 417 | **60. The Deep Learning cheatsheets are now available in [target language].** 418 | 419 | ⟶ 420 | 421 | 422 | **61. Original authors** 423 | 424 | ⟶ 425 | 426 |
427 | 428 | **62.Translated by X, Y and Z** 429 | 430 | ⟶ 431 | 432 |
433 | 434 | **63.Reviewed by X, Y and Z** 435 | 436 | ⟶ 437 | 438 |
439 | 440 | **64.View PDF version on GitHub** 441 | 442 | ⟶ 443 | 444 |
445 | 446 | **65.By X and Y** 447 | 448 | ⟶ 449 | 450 |
451 | -------------------------------------------------------------------------------- /vi/cs-229-machine-learning-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | **1. Machine Learning tips and tricks cheatsheet** 2 | 3 | ⟶ Các mẹo và thủ thuật trong Machine Learning (Học máy) 4 | 5 |
6 | 7 | **2. Classification metrics** 8 | 9 | ⟶ Độ đo phân loại 10 | 11 |
12 | 13 | **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.** 14 | 15 | ⟶ Đối với phân loại nhị phân (binary classification) là các độ đo chính, chúng khá quan trọng để theo dõi (track), qua đó đánh giá hiệu năng của mô hình (model) 16 | 17 |
18 | 19 | **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:** 20 | 21 | ⟶ Ma trận nhầm lẫn (Confusion matrix) - Confusion matrix được sử dụng để có kết quả hoàn chỉnh hơn khi đánh giá hiệu năng của model. Nó được định nghĩa như sau: 22 | 23 |
24 | 25 | **5. [Predicted class, Actual class]** 26 | 27 | ⟶ [Lớp dự đoán, lớp thực sự] 28 | 29 |
30 | 31 | **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:** 32 | 33 | ⟶ Độ đo chính - Các độ đo sau thường được sử dụng để đánh giá hiệu năng của mô hình phân loại: 34 | 35 |
36 | 37 | **7. [Metric, Formula, Interpretation]** 38 | 39 | ⟶ [Độ đo, Công thức, Diễn giải] 40 | 41 |
42 | 43 | **8. Overall performance of model** 44 | 45 | ⟶ Hiệu năng tổng thể của mô hình 46 | 47 |
48 | 49 | **9. How accurate the positive predictions are** 50 | 51 | ⟶ Độ chính xác của các dự đoán positive 52 | 53 |
54 | 55 | **10. Coverage of actual positive sample** 56 | 57 | ⟶ Bao phủ các mẫu thử chính xác (positive) thực sự 58 | 59 |
60 | 61 | **11. Coverage of actual negative sample** 62 | 63 | ⟶ Bao phủ các mẫu thử sai (negative) thực sự 64 | 65 |
66 | 67 | **12. Hybrid metric useful for unbalanced classes** 68 | 69 | ⟶ Độ đo Hybrid hữu ích cho các lớp không cân bằng (unbalanced classes) 70 | 71 |
72 | 73 | **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:** 74 | 75 | ⟶ ROC - Đường cong thao tác nhận, được kí hiệu là ROC, là minh hoạ của TPR với FPR bằng việc thay đổi ngưỡng (threshold). Các độ đo này được tổng kết ở bảng bên dưới: 76 | 77 |
78 | 79 | **14. [Metric, Formula, Equivalent]** 80 | 81 | ⟶ [Độ đo, Công thức, Tương đương] 82 | 83 |
84 | 85 | **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:** 86 | 87 | ⟶ AUC - Khu vực phía dưới đường cong thao tác nhận, còn được gọi tắt là AUC hoặc AUROC, là khu vực phía dưới ROC như hình minh hoạ phía dưới: 88 | 89 |
90 | 91 | **16. [Actual, Predicted]** 92 | 93 | ⟶ [Thực sự, Dự đoán] 94 | 95 |
96 | 97 | **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:** 98 | 99 | ⟶ Độ đo cơ bản - Cho trước mô hình hồi quy f, độ đo sau được sử dụng phổ biến để đánh giá hiệu năng của mô hình: 100 | 101 |
102 | 103 | **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]** 104 | 105 | ⟶ [Tổng của tổng các bình phương, Mô hình tổng bình phương, Tổng bình phương dư] 106 | 107 |
108 | 109 | **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:** 110 | 111 | ⟶ Hệ số quyết định - Hệ số quyết định, thường được kí hiệu là R2 hoặc r2, cung cấp độ đo mức độ tốt của kết quả quan sát đầu ra (được nhân rộng bởi mô hình), và được định nghĩa như sau: 112 | 113 |
114 | 115 | **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:** 116 | 117 | ⟶ Độ đo chính - Độ đo sau đây thường được sử dụng để đánh giá hiệu năng của mô hình hồi quy, bằng cách tính số lượng các biến n mà độ đo đó sẽ cân nhắc: 118 | 119 |
120 | 121 | **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.** 122 | 123 | ⟶ trong đó L là khả năng và ˆσ2 là giá trị ước tính của phương sai tương ứng với mỗi response (hồi đáp). 124 | 125 |
126 | 127 | **22. Model selection** 128 | 129 | ⟶ Lựa chọn model (mô hình) 130 | 131 |
132 | 133 | **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:** 134 | 135 | ⟶ Vocabulary - Khi lựa chọn mô hình, chúng ta chia tập dữ liệu thành 3 tập con như sau: 136 | 137 |
138 | 139 | **24. [Training set, Validation set, Testing set]** 140 | 141 | ⟶ [Tập huấn luyện, Tập xác thực, Tập kiểm tra (testing)] 142 | 143 |
144 | 145 | **25. [Model is trained, Model is assessed, Model gives predictions]** 146 | 147 | ⟶ [Mô hình được huấn luyện, mô hình được xác thực, mô hình đưa ra dự đoán] 148 | 149 |
150 | 151 | **26. [Usually 80% of the dataset, Usually 20% of the dataset]** 152 | 153 | ⟶ [Thường là 80% tập dữ liệu, Thường là 20% tập dữ liệu] 154 | 155 |
156 | 157 | **27. [Also called hold-out or development set, Unseen data]** 158 | 159 | ⟶ [Cũng được gọi là hold-out hoặc development set (tập phát triển), Dữ liệu chưa được biết] 160 | 161 |
162 | 163 | **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:** 164 | 165 | ⟶ Khi mô hình đã được chọn, nó sẽ được huấn luyện trên tập dữ liệu đầu vào và được test trên tập dữ liệu test hoàn toàn khác. Tất cả được minh hoạ ở hình bên dưới: 166 | 167 |
168 | 169 | **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:** 170 | 171 | ⟶ Cross-validation - Cross-validation, còn được gọi là CV, một phương thức được sử dụng để chọn ra một mô hình không dựa quá nhiều vào tập dữ liệu huấn luyện ban đầu. Các loại khác nhau được tổng kết ở bảng bên dưới: 172 | 173 |
174 | 175 | **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]** 176 | 177 | ⟶ [Huấn luyện trên k-1 phần và đánh giá trên 1 phần còn lại, Huấn luyện trên n-p phần và đánh giá trên p phần còn lại] 178 | 179 |
180 | 181 | **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]** 182 | 183 | ⟶ [Thường thì k=5 hoặc 10, Trường hợp p=1 được gọi là leave-one-out] 184 | 185 |
186 | 187 | **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.** 188 | 189 | ⟶ Phương thức hay được sử dụng được gọi là k-fold cross-validation và chia dữ liệu huấn luyện thành k phần, đánh giá mô hình trên 1 phần trong khi huấn luyện mô hình trên k-1 phần còn lại, tất cả k lần. Lỗi sau đó được tính trung bình trên k phần và được đặt tên là cross-validation error. 190 | 191 |
192 | 193 | **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:** 194 | 195 | ⟶ Chuẩn hoá - Mục đích của thủ tục chuẩn hoá là tránh cho mô hình bị overfit với dữ liệu, do đó gặp phải vấn đề phương sai lớn. Bảng sau đây sẽ tổng kết các loại kĩ thuật chuẩn hoá khác nhau hay được sử dụng: 196 | 197 |
198 | 199 | **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** 200 | 201 | ⟶ [Giảm hệ số xuống còn 0, Tốt cho việc lựa chọn biến, Làm cho hệ số nhỏ hơn, Thay đổi giữa chọn biến và hệ số nhỏ hơn] 202 | 203 |
204 | 205 | **35. Diagnostics** 206 | 207 | ⟶ Dự đoán (Diagnostics) 208 | 209 |
210 | 211 | **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.** 212 | 213 | ⟶ Bias - Bias của mô hình là sai số giữa dự đoán mong đợi và dự đoán của mô hình trên các điểm dữ liệu cho trước. 214 | 215 |
216 | 217 | **37. Variance ― The variance of a model is the variability of the model prediction for given data points.** 218 | 219 | ⟶ Phương sai - Phương sai của một mô hình là sự thay đổi dự đoán của mô hình trên các điểm dữ liệu cho trước. 220 | 221 |
222 | 223 | **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.** 224 | 225 | ⟶ Thay đổi/ Thay thế Bias/phương sai - Mô hình càng đơn giản bias càng lớn, mô hình càng phức tạp phương sai càng cao. 226 | 227 |
228 | 229 | **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]** 230 | 231 | ⟶ [Symptoms, Minh hoạ hồi quy, Minh hoạ phân loại, Minh hoạ deep learning (học sâu), Biện pháp khắc phục có thể dùng] 232 | 233 |
234 | 235 | **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]** 236 | 237 | ⟶ [Lỗi huấn luyện cao, Lỗi huấn luyện tiến gần tới lỗi test, Bias cao, Lỗi huấn luyện thấp hơn một chút so với lỗi test, Lỗi huấn luyện rất thấp, Lỗi huấn luyện thấp hơn lỗi test rất nhiều, Phương sai cao] 238 | 239 |
240 | 241 | **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]** 242 | 243 | ⟶ [Mô hình phức tạp, Thêm nhiều đặc trưng, Huấn luyện lâu hơn, Thực hiện chuẩn hóa, Lấy nhiều dữ liệu hơn] 244 | 245 |
246 | 247 | **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.** 248 | 249 | ⟶ Phân tích lỗi - Phân tích lỗi là phân tích nguyên nhân của sự khác biệt trong hiệu năng giữa mô hình hiện tại và mô hình lí tưởng. 250 | 251 |
252 | 253 | **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.** 254 | 255 | ⟶ Phân tích Ablative - Phân tích Ablative là phân tích nguyên nhân của sự khác biệt giữa hiệu năng của mô hình hiện tại và mô hình cơ sở. 256 | 257 |
258 | 259 | **44. Regression metrics** 260 | 261 | ⟶ Độ đo hồi quy 262 | 263 |
264 | 265 | **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]** 266 | 267 | ⟶ [Độ đo phân loại, Ma trận nhầm lẫn, chính xác, dự đoán, recall, Điểm F1, ROC] 268 | 269 |
270 | 271 | **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]** 272 | 273 | ⟶ [Độ đo hồi quy, Bình phương R, CP của Mallow, AIC, BIC] 274 | 275 |
276 | 277 | **47. [Model selection, cross-validation, regularization]** 278 | 279 | ⟶ [Lựa chọn mô hình, cross-validation, Chuẩn hoá (regularization)] 280 | 281 |
282 | 283 | **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]** 284 | 285 | ⟶ [Dự đoán, Đánh đổi Bias/phương sai, Phân tích lỗi/ablative] 286 | --------------------------------------------------------------------------------