├── .gitignore ├── LICENSE ├── README.md ├── lessons ├── ml_p1_lesson_00_intro.md ├── ml_p1_lesson_01_decision_trees.md ├── ml_p1_lesson_02_regression_intro.md ├── ml_p1_lesson_03_neural_networks.md ├── ml_p1_lesson_04_instance_based_learning.md ├── ml_p1_lesson_05_ensemble_learning.md ├── ml_p1_lesson_06_support_vector_machines.md ├── ml_p1_lesson_07_computational_learning_theory.md ├── ml_p1_lesson_08_vc_dimension.md └── ml_p1_lesson_09_bayesian_learning.md └── textbook ├── notes_ch01.md └── notes_ch03.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.acn 2 | *.acr 3 | *.alg 4 | *.aux 5 | *.bbl 6 | *.blg 7 | *.dvi 8 | *.fdb_latexmk 9 | *.glg 10 | *.glo 11 | *.gls 12 | *.idx 13 | *.ilg 14 | *.ind 15 | *.ist 16 | *.lof 17 | *.log 18 | *.lot 19 | *.maf 20 | *.mtc 21 | *.mtc0 22 | *.nav 23 | *.nlo 24 | *.out 25 | *.pdf 26 | *.pdfsync 27 | *.ps 28 | *.snm 29 | *.synctex.gz 30 | *.toc 31 | *.vrb 32 | *.xdy 33 | *.tdo 34 | 35 | # Temporary files used by editors. 36 | *~ 37 | *.swp 38 | .#* 39 | \#* 40 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | cs7641 2 | ====== 3 | 4 | Notes from Georgia Tech's CS7641 and Tom Mitchell's "Machine Learning." 5 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_00_intro.md: -------------------------------------------------------------------------------- 1 | ML is teh r0x 2 | ============= 3 | 4 | 5 | Definition of ML 6 | ---------------- 7 | 8 | Machine learning theory about how well data can be statistically 9 | modeled. Machine learning is practically about building machines which 10 | can learn from data. 11 | 12 | There are three types of ML: supervised learning, unsupervised learning, 13 | and reinforcement learning. 14 | 15 | Supervised learning 16 | ------------------- 17 | 18 | Supervised learning gleans "information" from labeled data to label new 19 | unlabeled data. This is simply discrete function approximation. 20 | Supervised learning/function approximation must make assumptions about 21 | the given data to *generalize* to new data. 22 | 23 | [Note on] Induction and deduction 24 | --------------------------------- 25 | 26 | *Induction* takes specific examples to create a general rule. *Inductive 27 | bias* is necessary to come up with "useful" generalizations. *Deduction* 28 | takes a general rule and applies it in specific cases. 29 | 30 | Unsupervised learning 31 | --------------------- 32 | 33 | Given inputs with no labels, derive some structure using the 34 | relationships between the inputs. Often a summarization of data into 35 | clusters. In practice, unsupervised learning is useful even in 36 | supervised contexts to gain insight into the data. 37 | 38 | Reinforcement learning 39 | ---------------------- 40 | 41 | Learning from delayed supervision. For example: playing a [turn-based] 42 | game [with potentially non-deterministic environmental factors] where 43 | you find out whether you have won or lost near the end. Somewhat 44 | difficult. 45 | 46 | Comparison of these parts of ML 47 | ------------------------------- 48 | 49 | All learning is "optimization." SL wants to label data well. RL wants to 50 | score well. UL has scientist-imposed criteria for correctness. 51 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_01_decision_trees.md: -------------------------------------------------------------------------------- 1 | Lesson 1: Decision Trees 2 | ======================== 3 | 4 | Classification and regression 5 | ----------------------------- 6 | 7 | In supervised learning, you are presented with instances (e.g. images of individuals) with labels (e.g. "BOY" or "GIRL") as training data. The task is to label new unlabeled instances (assuming the instance has the sufficient information). 8 | 9 | * **Classification** -- labels are discrete values (often finite, often true or false). 10 | * Better definition [Shashir]: labels (codomain of the target function) have no meaningful order. 11 | * **Regression** -- labels are reals. 12 | * Better definition [Shashir]: labels have a meaningful order. 13 | 14 | 15 | Classification learning 16 | ----------------------- 17 | 18 | * **Instances** -- inputs (vectors of features). 19 | * **Concept** -- a function which maps instances to labels (there many concepts: |labels|^|instance space|). 20 | * [Shashir] Concepts are the set of functions mapping from the instance space to the labels space. 21 | * **Target concept** -- the function which maps instances to the **correct** labels. 22 | * [Shashir] The target concept is a specific concept which we wish to model. 23 | * **Hypothesis** -- set of concepts which we are willing to search for the best approximation of the target concept. 24 | * [Shashir] Subset of the concepts set. Easier space to search through, but introduces *inductive bias*. 25 | * **Sample (training set)** -- set of inputs with correct labels. 26 | * **Candidate** -- the "best" concept chosen from the hypothesis by the learning algorithm using the sample. 27 | * [Shashir] Element of hypothesis which best approximates the target concept according to our learning algorithm. 28 | * **Testing set** -- set of instances with correct labels, similar to the training set, but used to measure how well the candidate performs on novel data. 29 | 30 | The testing set should contain many examples not found in the training set. 31 | 32 | Decision trees 33 | -------------- 34 | 35 | Sequence of tests (path of nodes starting from a root node) applied to every instance in order to arrive at its label (leaf). 36 | 37 | Decision trees: learning 38 | ------------------------ 39 | 40 | 1. Pick the best attribute to split the data. 41 | 2. Asked test every possible value of the attribute. 42 | 3. Follow the correct answer path. 43 | 4. Go to 1 until the possibilities has been narrowed to one answer. 44 | 45 | How can this intuition be used to build a "tree" for all possible instances of the problem? 46 | 47 | Decision trees: expressiveness 48 | ------------------------------ 49 | 50 | Consider the inclusive disjunction: **OR(a1, a2, ..., an)** (any). Note that the tree is "linear" and has a height of n. 51 | ``` 52 | --->a1--T-->a2--T-->...--T-->an--T-->TRUE 53 | | | | 54 | F F F 55 | | | | 56 | v v v 57 | FALSE FALSE FALSE 58 | ``` 59 | 60 | Next, consider the exclusive disjunction: **XOR(a1, a2, ..., an)** (odd parity). Note that the tree is balanced and has height O(2^n). 61 | 62 | ``` 63 | --->a1--T-->a2--T-->a3--T-->TRUE 64 | | | | 65 | | | F----->FALSE 66 | | | 67 | | F----->a3--T-->FALSE 68 | | | 69 | | F----->TRUE 70 | | 71 | F----->a2--T-->a3--T-->FALSE 72 | | | 73 | | F---->TRUE 74 | | 75 | F----->a3--T-->FALSE 76 | | 77 | F----->TRUE 78 | ``` 79 | It's better just to add in integers mod 2. 80 | 81 | Decision trees: expressiveness (search space) 82 | --------------------------------------------- 83 | 84 | Given n binary attributes, how many possible decision trees are there? 2^(2^n) 85 | 86 | Intuition: 87 | * There are 2^n possible configurations of the attributes. 88 | * Each "unique" decision tree maps these 2^n configurations into a 2^n-sized bit vector. 89 | * There are 2^(2^n) possible bit vectors of size 2^n. 90 | * Therefore, there must are 2^(2^n) possible unique classifiers. 91 | * Note that more than one tree may map to a single classifier, so the hypothesis space is even larger (thanks to inductive bias, we can cut down the problem significantly). 92 | 93 | ID3 Algorithm 94 | ------------- 95 | 96 | ``` 97 | A <- best attribute from remaining attributes (initially, all attributes) 98 | Assign A as decision attribute for Node 99 | For each value of A, create a new descendant of Node 100 | Sort training examples to leaves 101 | If examples are perfectly classified, STOP. 102 | Else if we ran out of attributes, STOP 103 | Else, start over for each leaf (with corresponding set of training examples) 104 | ``` 105 | 106 | The **best attribute** is that one with the greatest information gain. 107 | 108 | GAIN(S, A) = ENTROPY(S) - \sum_v \frac{|S_{A=v}|}{|S|}ENTROPY(S_{A=v}) 109 | 110 | Where ENTROPY is defined as: 111 | 112 | ENTROPY(S) = -\sum_v p(v)\log p(v) 113 | 114 | The **best attribute** is the one that splits the data into subsets whose entropies' weighted sum is the least (maximizing the information gain). 115 | 116 | A^* = {\textrm{argmin}_A\textrm{ }} \sum_v|S_v|ENTROPY(S_v) 117 | 118 | 119 | The **inductive bias** of the ID3 algorithm: 120 | * The best splitters appear earlier (closer to the root). 121 | * Produces shorter trees. 122 | * Prefers correct classifiers over incorrect trees (thanks for that) 123 | 124 | 125 | Decision trees: other considerations 126 | ------------------------------------ 127 | 128 | * How do handle continuous attributes? 129 | * Use intervals 130 | * Split age range 0-90 into 0-40 and 40-90 -- perhaps even use a modified ID3 to find the best splitting age. 131 | * Does it make sense to repeat an attribute along a path in the tree? 132 | * No for finite-valued attributes. 133 | * However, continuous attributes can be tested with different questions 134 | * No question needs to be asked twice. 135 | * When do we stop? 136 | * Everything classified correctly (or nearly correct -- we do not want to **overfit**). 137 | * No more attributes. 138 | * Do not **overfit**. 139 | * Try not to have a tree which is too big. 140 | * Try many trees and cross-validation. 141 | * Variant of cross-validation where you hold out a subset of the data and build a tree breadth-first on the remaining data. Stop when error is "low enough." 142 | * Build the whole tree and prune (vote if the classification is not perfect). 143 | * Regression 144 | * Model output and group them (round off or cluster). 145 | * Report average on leaves, or vote, or locally fit a line (hybrid). 146 | 147 | 148 | Decision trees 149 | -------------- 150 | 151 | We learned: 152 | * Representation (tree... set of questions) 153 | * ID3: a top down learning algorithm 154 | * Expressiveness of DTs 155 | * Bias of ID3 156 | * "Best" attribute Gain(S, A) 157 | * Dealing with overfitting. 158 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_02_regression_intro.md: -------------------------------------------------------------------------------- 1 | Lesson 2: Regression & Classification 2 | ===================================== 3 | 4 | What is regression? 5 | ------------------- 6 | 7 | The term regression became overloaded over time. Initially it described how children of very tall people and children of very short people tended both to be a bit closer average height than their parents. 8 | 9 | The "function" child_height(parent_height) looks like a line and has slope less than 1. 10 | 11 | 12 | Linear regression 13 | ----------------- 14 | 15 | It's possible to fit affine hyperplanes to data. If we transform the input vectors with squared or cubed terms, etc. we can also fit any higher dimensional polynomial with the same technique. 16 | 17 | We would like for the best fit "curve" (surface) such that it minimizes an error. However, there are certain inductive biases we can make. What degree polynomial should we fit? The higher the degree, the better we can fit a set of points, but the more "overfit" the curve will be. 18 | 19 | When we have an overfit candidate function, though it produces no error on the training set, it will not generalize. 20 | 21 | 22 | Performing linear regression 23 | ---------------------------- 24 | 25 | Let our instance be elements of R^n. Suppose we have a training set {(x_1, y_1), (x_2, y_2), ..., (x_m, y_m)} where x_i are in R^n and y_i are in R. We would like to find w such that x dot w approximates y. 26 | 27 | Let w be a column vector and x_i be row vectors. Define matrix X = [x_1 x_2 x_3 ... x_m] and Y = [y_1 ... y_m]. We want to find w such that wX = Y (approximately). 28 | 29 | We find that if we multiply by X^T on both sides, we will very likely have an invertible matrix XX^T. So we arrive at: 30 | 31 | w = Y(X^T)(XX^T)^-1 32 | 33 | For details: http://en.wikipedia.org/wiki/Linear_regression 34 | 35 | This formula minimizes the squared error (prove by differentiating L(w) = ||Y - wX||^2). 36 | 37 | 38 | Errors 39 | ------ 40 | 41 | All data has noise/errors arising from sensor error, malicious agents, transcription error, unmodeled influences, etc. 42 | 43 | We should not overfit to training data in order not to model the errors. 44 | 45 | We want to train until we minimuze the mean squared error (MSE) on the training data. 46 | 47 | MSE = \frac{1}{n} \sum_i (w \cdot x_i - y_i)^2 48 | 49 | Next we want to compute the error on various testing sets to make sure that the model is not so complicated that it overfits and doesn't generalize. 50 | 51 | Cross validation 52 | ---------------- 53 | 54 | To check that data generalizes well, hold aside a subset of the training data as "testing data", training on the remaining set and test against the "testing data." Choose our inductive bias (picking model or model complexity) such that when the model is trained on the training set it's error on the testing set is still minimal. 55 | 56 | 57 | Other input spaces 58 | ------------------ 59 | 60 | Our instance elements need not all be reals (though the regression would be better under certain special circumstances). Ideally the values of each feature should have a natural order. Boolean vectors are a good choice often. 61 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_03_neural_networks.md: -------------------------------------------------------------------------------- 1 | Lesson 3: Neural Networks 2 | =============== 3 | 4 | Artificial neural networks 5 | -------------------------- 6 | 7 | Based on biological neuron (cell body, axon, and synapses -- cell bodies fire signals down axons to synapses if their activation threshold is surpassed). 8 | 9 | Perceptron 10 | ---------- 11 | 12 | The perceptron hypothesis boils down to being the sign of the dot of the weights and the input. Let's say all inputs have a zeroth component which is 1 and the zeroth component of the weight is the negative threshold. 13 | 14 | h(\mathbf{w})(\mathbf{x}) = \{\mathbf{w \cdot x} \ge 0\} 15 | 16 | The output is 0 if the dot product is negative, 1 if positive. 17 | 18 | How powerful is a perceptron unit 19 | --------------------------------- 20 | 21 | Perceptrons define a discriminant line which splits the plane in half (analogy applies in higher dimensions also). 22 | 23 | When the components of the input are boolean, then the perceptron can define boolean logic such as AND and OR, but a single perceptron can not discover XOR and others. 24 | 25 | Given a vector <-theta, w_0, w_1> be able to visualize a half-plane and given a half-plane write the weights vector. 26 | 27 | XOR as a perceptron network 28 | --------------------------- 29 | 30 | A more complex discriminant "curve" can be contrived from a "network" of perceptrons. 31 | 32 | 33 | Perceptron training 34 | ------------------- 35 | 36 | Two ways to learn weights 37 | 38 | * Perceptron learning rule 39 | * Gradient descent/delta rule 40 | 41 | The perceptron learning rule defines how to update the weights in iterative stages. Over time, it is capable of converging on the best classifier on the training set is the training set is linearly separable. However, it will not converge if the data is not linearly separable. 42 | 43 | ``` 44 | eta <- learning rate 45 | initialize(w) // Set every component w_i of w to some random initial value 46 | do until satisfied: 47 | for each training datum (x, y) from the training set: 48 | y_hat <- w dot x // Compute the predicted output 49 | w <- w + eta * (y - y_hat) *.* x // Update the weights vector with the adjustment. *.* is Hadamard product (element-wise product). 50 | 51 | ``` 52 | 53 | This is online learning and you can stop when average error on the training set is below some threshold. 54 | 55 | If the data is "nearly" linearly separable, perhaps reduce the learning rate over time such that the discriminant line at least appears to converge? 56 | 57 | Perceptron learning can not be applied in the conventional sense to a multiple-layer neural network. It is only defined for a single layer. 58 | 59 | 60 | Gradient descent 61 | ---------------- 62 | 63 | Is there a learning algorithm which is robust to non-(linear separability)? 64 | 65 | We would like to perform gradient descent on the total error on the training set such that we can arrive at the weights which minimize the error. 66 | 67 | Let D be the training set with ordered pairs (x,y). Let y be the target. Instead of taking the sign, we will simply take the dot product of w and x (this is essentially simple linear regression). We want to minimize: 68 | 69 | E({\mathbf w}) = \frac{1}{2}\sum_{({\mathbf x}, y) \in D} (y - {\mathbf w} \cdot {\mathbf x})^2 70 | 71 | The partials with respect to components of w: 72 | 73 | \frac{\partial E}{\partial w_i} = -\sum_{({\mathbf x}, y) \in D} x_i(y - \mathbf{w} \cdot \mathbf{x}) 74 | 75 | The error decreases fastest in the direction opposite to the gradient. Walk in small steps opposite to the gradient. 76 | 77 | Robust to linear separability and converges in the limit (eta must be small not to yo-yo in the well). 78 | 79 | Comparison of learning rules 80 | ---------------------------- 81 | 82 | The perceptron learning rule uses the following weight update. 83 | 84 | \Delta w_i = \eta(y - \{\mathbf{w \cdot x} \ge 0\})x_i 85 | 86 | Learning a classifier using gradient descent on a linear regression model uses the following weight update. 87 | 88 | \Delta w_i = \eta(y - \mathbf{w \cdot x})x_i 89 | 90 | Sigmoid 91 | ------- 92 | 93 | Why not do gradient descent on the original perceptron itself rather than remodeling it as linear regression? The step function is not differentiable. We can get around this with the sigmoid function (hyperbolic tan, arctan, logistic sigmoid). 94 | 95 | \sigma(a) = \frac{1}{1 + e^{-a}} 96 | 97 | Note that the derivative has a nice form: 98 | 99 | D\sigma(a) = \sigma(a)(1 - \sigma(a)) 100 | 101 | Now we can use the chain rule to extend gradient descent to neurons with sigmoid activation functions. 102 | 103 | Neural Networks 104 | --------------- 105 | 106 | Hook up nodes which use the sigmoid of the dot product of the weights and the input. The input layer can be fed into the hidden layers and eventually into the output "layer" (a single node). 107 | 108 | Every node can be trained based on it's respective inputs and output using the gradient descent rule on the now differentiable activation function (the sigmoid). For each input, the output can be computed. The output layer can be trained first. Then the last hidden layer can be trained. Then the next and so on. This careful bookkeeping is known as backpropagation. 109 | 110 | With networks, there may be many local optima. We need to use clever techniques to arrive at a suitable set of weights. 111 | 112 | Optimizing weights 113 | ------------------ 114 | 115 | How do arrive at a suitable set of weights? Optimization problem. 116 | 117 | * Use momentum terms in the gradient (simulated annealing?) 118 | * higher order derivatives (hamiltonians, etc.) 119 | * randomized optimization (later on in the course) 120 | * penalty for "complexity" (overfitting) -- too many hidden layers? too many nodes in a layer? large valued weights (interesting) 121 | 122 | 123 | Restriction bias 124 | ---------------- 125 | 126 | Restriction bias -- the representational power and the set of hypotheses we will consider. 127 | 128 | Perceptrons are linear models (discriminant lines splitting planes). Networks of perceptrons can approximate more complex functions. Sigmoids allow even better learning and can fit more interesting functions. 129 | 130 | * Boolean functions -- network of threshold-like units. 131 | * Continuous functions -- use sigmoid activation and a hidden layer. 132 | * Arbitrary functions -- stitch together two or more functions with discontinuitives with two hidden layers. 133 | 134 | Overfitting problem with more complex networks. 135 | 136 | * We can bound the architecture of the network. 137 | * Cross validate to better bound the architecture or bound the weights. 138 | * We should also find the number of iterations at which the network best generalizes. 139 | 140 | Not really much of a restriction bias. 141 | 142 | 143 | Preference bias 144 | --------------- 145 | 146 | Preference bias -- which representation is preferred by the algorithm. 147 | 148 | Generally we initialize the network with small random weights -- we prefer "simpler"/less complex representations. (This can hit a local minimum. We generally run the training multiple times to avoid local minimum.) 149 | 150 | Practically, (Occam's Razor) produces simpler and generalizable representations. 151 | 152 | Summary 153 | ------- 154 | 155 | * Perceptron (linear and thresholded) 156 | * Perceptron learning rule 157 | * Gradient descent 158 | * Sigmoid function 159 | * Restriction bias 160 | * Preference bias 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_04_instance_based_learning.md: -------------------------------------------------------------------------------- 1 | Lesson 4: Instance Based Learning 2 | ================================= 3 | 4 | Instance based learning 5 | ----------------------- 6 | 7 | Instance based learning distinguishes machine learning models which perform "lazy learning" at scoring-time directly from the training data. This is different from other machine learning models which lossily compress the training data into a simpler hypothesis function. 8 | 9 | Some benefits 10 | 11 | * Remembers all training data points. 12 | * Fast for known points (lookup -- depending on properties of the data structure used to store). 13 | * "Simple" 14 | 15 | Possible problems 16 | 17 | * Overfits and generalizes badly 18 | * Must handle conflicting training data 19 | 20 | Since instance based learning often requires traversing the training data itself, it may be slow to score unknown inputs, but is easily adaptable to new training data points. 21 | 22 | k-Nearest Neighbors 23 | ------------------- 24 | 25 | The number of neighbors k is a free parameter which must be learned. 26 | 27 | Given training data D = {x_i, y_i}, a distance metric d(q, x), the required number of neighbors k, and the query point q: 28 | 29 | NearedNeighbors(k)(D)(q) = {a_1, a_2, ..., a_k s.t. a_j <= x_i for all x_i != a_j (the k smallest distances)} 30 | 31 | For classification: vote of y_i or weighted vote or some other strategy. 32 | 33 | For regression: take mean or weighted mean (weights are 1/distance) 34 | 35 | If there are distance conflicts, "take all of them" the way college rankings do (this fudges the k). 36 | 37 | Comparing learning and query times 38 | ---------------------------------- 39 | 40 | Given a list of n data points (x_i, y_i) sorted by x_i what is the running time and space consumption of "learning" (assume no effort for ETL -- which make no sense) and "query" (do not consider the learned data structure`s space consumption) of 1-NN, k-NN, and linear regression: 41 | 42 | * 1-NN learning runs in O(1) and takes O(n) space. 43 | * 1-NN query runs in O(log n) (through binary search) and takes O(1) space. 44 | * k-NN learning runs in O(1) and takes O(n) space. 45 | * k-NN query runs in O(k + log n) (binary search and check 2*k adjacent points) and takes O(1) space. 46 | * Simple linear regression learning runs in O(n) (due to bound on the dimension of x_i, even normal equations solution is "linear" -- though this is a bit misleading) and takes O(1) space. 47 | * Simple linear regression query runs in O(1) time and takes O(1) space. 48 | 49 | Querying in instance based algorithms may be slower (depending on the data structure), though assuming no ETL costs, learning in instance based algorithms is faster. Moreover, incremental learning of new data points are possible and probably very fast. 50 | 51 | IBL is more lazy about learning (pushes it till querying). So k-NN is a lazy learner (or "just in time" learner) whereas linear regression is an eager learner. 52 | 53 | K-NN Bias 54 | --------- 55 | 56 | Preference bias -- when searching the hypothesis space, the search may "prefer" a particular subset of the hypothesis space over another. Perhaps the hypothesis space is not completely searched and rather only a subset is searched. E.g. simpler trees, simpler functions, Occam`s razor. Compare to restriction bias which is the total representational power of the hypothesis space itself. 57 | 58 | What about k-NN? 59 | 60 | * Locality -- near points are similar and how is nearness defined. Moreover, define distance on input not label/output. This comes from domain knowledge. 61 | * Smoothness -- averaging produces smoothness as opposed to discontinuities. 62 | * All features seem to matter "equally" -- this comes from the distance function too. 63 | 64 | Curse of dimensionality 65 | ----------------------- 66 | 67 | As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially. Comes from Richard Bellman, the dynamic programming guy. 68 | 69 | Covering the "same space" in higher dimensions requires exponentially more points. ("Same space" is not well-defined.) 70 | 71 | Applies in all of ML and not just k-NN. 72 | 73 | Some other points 74 | ----------------- 75 | 76 | Distance metrics -- e.g. Eucliean, Manhattan, weighted-components, mismatches (for classification), etc. 77 | 78 | Implications of k. 79 | 80 | * When k is n, we get a constant function (assuming vote or regular average). 81 | * With weighted average, the computation may not be so easy, but "smooth"-ish hypothesis. Points closer have greater say. 82 | * Perhaps instead of weighted average, do "local" regression on the k-closest-points. Known as "locally weighted regression." We can do decision trees, neural nets, etc. there. More sophisticated curves and more complex hypotheses (removes some restriction bias, but may overfit). 83 | 84 | What have we learned? 85 | --------------------- 86 | 87 | * Instance based learning 88 | * Eager and lazy learning -- lazy puts off work until needed, eager generally learns from training data ahead of time. 89 | * k-NN 90 | * nearest-neighbor; similarity (distance) 91 | * Domain knowledge matters (distance) 92 | * Classification versus regression 93 | * "Averaging"/combining results. 94 | * Composing learning algorithms together (locally weighted regression $x regression). 95 | * Curse of dimensionality. Required data is O(2^d). 96 | * No free lunch theorem(?) -- averaging over all possible data makes any learning algorithm no better than random. Domain knowledge helps. 97 | 98 | 99 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_05_ensemble_learning.md: -------------------------------------------------------------------------------- 1 | Lesson 5: Ensemble Learning: Boosting 2 | ===================================== 3 | 4 | Spam/whitelist rules ("from: spouse", "Nigeria", "has pr0n", etc.) apply to many emails, but not all. How do we generate rules? How do we combine rules? Similar to decision trees. If you already know the rules, the combining step may be like neural networks. 5 | 6 | Boosting 7 | -------- 8 | 9 | Combine simple rules into a complex rules. 10 | 11 | * Learn over a subset of data to generate a simple rule. 12 | * Learn over another subset of data to generate another simple rule. 13 | * ... 14 | * ... 15 | * Combine all rules. 16 | 17 | Subsetting is non-trivial. 18 | 19 | * Uniformly randomly pick a subset of data. Apply any learning algorithm on it. 20 | 21 | 22 | Combine is non-trivial. 23 | 24 | * Take mean. 25 | 26 | In the case where we pick single points and take the overall mean, the result is the same as n-NN with no weighting. 27 | 28 | **Bagging/bootstrap aggregation** -- taking random subsets, learning, and taking mean of the individual machines. 29 | 30 | *Instead of picking subsets randomly, take advantage of what we are learning as we go along. Especially, pick examples we are bad at. Take "hardest" subsets."* 31 | 32 | How about using a weighted mean? 33 | 34 | Previously we defined error in classification as the number of mismatches. Instead define **error** as Pr_D(h(x) != c(x)) where D is the distribution from which x is acquired. This allows for more probably examples to produce a greater error contribution than less likely example. We want to be good at the common points. 35 | 36 | **Weak learner** -- does better than chance always: for all D, Pr_D(h(x) != c(x)) <= 1/2 - \epsilon. 37 | 38 | In some problems, there may exist distributions such that there is no weak learner. This may require an expansion of the hypothesis space (relieve the restriction bias). 39 | 40 | Boosting in code 41 | ---------------- 42 | 43 | *** 44 | Given D = {(x_i, y_i) where y_i are -1 or +1} 45 | For t = 1 to T 46 | construct D_t (a distribution) 47 | find weak classifier h_t with small error e_t = P_D_t(h_t(x_i) != y_i) (???) 48 | Output H_final 49 | *** 50 | 51 | How to construct D_t? 52 | --------------------- 53 | 54 | D_1(i) = 1/n (uniform) 55 | D_(t+1)(i) = D_t(i)*e^(-a_t*y_i*h_t(x_i))/z_t 56 | where a_t = 1/2 * ln ((1-e_t)/e_t) 57 | 58 | "Generally" puts more probability on incorrect instances and less probability on correct instances. We have an issue if error is exactly half or exactly 0. 59 | 60 | How to combine smaller classifiers into H_final? 61 | ------------------------------------------------ 62 | 63 | Take a weighted average using a_t. 64 | 65 | H_final(x) = sgn(sum_t a_t*h_t(x)) 66 | 67 | Hypotheses/ensemble methods 68 | --------------------------- 69 | 70 | Compare to decision trees. 71 | 72 | Compare to locally weighted linear regression (as a modification of k-NN) where you stitch simpler models into a complex one. This characterizes all ensemble methods. 73 | 74 | Why is boosting good? 75 | --------------------- 76 | 77 | Every iteration of boosting only "gains" more information by better classifying badly classified examples. 78 | 79 | Ensemble learning 80 | ----------------- 81 | 82 | * Ensembles are good 83 | * Bagging is good 84 | * Combines simple into complex 85 | * Boosting is really good -- agnostic to learner 86 | * Weak learners 87 | * Error based on distributions 88 | * Cool points: boosting does not seem to overfit over arbitrary iterations in practice (or does it?) 89 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_06_support_vector_machines.md: -------------------------------------------------------------------------------- 1 | Lesson 6: Support Vector Machines 2 | ================================= 3 | 4 | The best line 5 | ------------- 6 | 7 | The discriminant line that generalizes best is the farthest from the two classes ("commits least to the training data"). 8 | 9 | Support vector machine (massive handwaving) 10 | ------------------------------------------- 11 | 12 | We are interested in maximizing the size of the margin. 13 | 14 | The general linear classifier formula is: y = w^T x + b (here y is the classification) 15 | 16 | The equation for the discriminant line is 0 = w^T x + b 17 | 18 | Translate this line up or down to get 1 = w^T x + b or -1 = w^T x + b 19 | 20 | We want w such that the distance between these two lines in maximal (and the choice of 1 is really arbitrary -- we just want w "up to" scaling). 21 | 22 | Take the difference: w^T (x_1 - x_2) = 2 (where x_1 and x_2 are support vectors. 23 | 24 | Divide ||w||. Then w^T/||w|| (x_1 - x_2) = 2/||w|| 25 | 26 | Now we have the difference of x_1 and x_2 projected onto the normalized w vector... Since w is parallel to the vector from x_1 to x_2. Some handwaving magic. The left handside is the margin m. 27 | 28 | m = 2/||w|| 29 | 30 | So handwavily, the "distance" between the -1 line and 1 is 2/||w|| and we can maximize this by minimizing w's magnitude. What about direction? 31 | 32 | Support vector machines 33 | ----------------------- 34 | 35 | Maximize 2/||w|| while classifying everything correctly. 36 | 37 | The problem is: y_i(w^T x_i + b) >= 1 for all i. 38 | 39 | Alternatively minimize 1/2 ||w||^2 40 | 41 | This is a quadratic programming problem (which is a load of bullshit from shoddy under-formalized areas of math) which can be solved by jumping off a cliff. 42 | 43 | Some properties of the solution. 44 | 45 | * w = sum_i a_i y_i x_i and b = easy 46 | * a_i mostly 0 (which means only some of the x_i matter). 47 | * The data points are being dotted. Similar direction means greater weight... how "similar" are they to each other? 48 | 49 | 50 | Like k-NN except we learn which points are important (as opposed to considering all neighbors). 51 | 52 | SVMs: linearly married 53 | ---------------------- 54 | 55 | What if some points can not be classified correctly? Have a cost for misclassification. 56 | 57 | What if lines are a bad hypothesis? Transform the data points and throw them into a higher-dimensional space where a line might be a better hypothesis. 58 | 59 | Phi(q) = 60 | 61 | Phi(q) . Phi(p) = (q . p)^2 (which makes the classifier into a circle instead of a line). 62 | 63 | Instead of computing Phi, just square the dot product -- **the kernel trick**. 64 | 65 | Kernels 66 | ------- 67 | 68 | We can replace the dot product with a kernel. 69 | 70 | Implicitly throws data points into higher dimensional space and get similarity. 71 | 72 | Kernel choice requires domain knowledge. 73 | 74 | Polynomial kernel: K(x, y) = (x^T y + c)^p 75 | 76 | RBF kernel: exp(-(||x - y||^2)/(2*sigma^2)) 77 | 78 | Kernels must satisfy Mercer condition -- must act like an inner product. 79 | 80 | SVMs -- review 81 | -------------- 82 | 83 | * margins ~ generalization and overfitting 84 | * big is better 85 | * optimization problem for finding max margins: quadratic programming and dual problem 86 | * support vectors 87 | * kernel trick which extends the inner product 88 | * Mercer condition 89 | 90 | Back to boosting 91 | ---------------- 92 | 93 | Boosting doesn't seem to conventionally overfit (testing error keeps going down). 94 | 95 | **Confidence** -- how confident is a classification? (Number of neighbors agreeing in k-NN or local variation in linear regression). 96 | 97 | Explain the difference between confidence and error. 98 | 99 | We can examine each iteration of boosted classifiers by normalizing the argument of the sign. This is the confidence. Hard examples fall near the center (toward 0). As we boost further, the hard points also start having higher confidence. The margin keeps increasing and does not overfit. 100 | 101 | Boosting will overfit if the individual learners always overfit. 102 | 103 | **Pink noise** -- uniform noise. Boosting overfits. 104 | 105 | **White noise** -- Gaussian noise. 106 | 107 | Summary 108 | ------- 109 | 110 | * Margins -- important to SVM and margins 111 | * big is better 112 | * optimization problem for finding max margins: quadratic programming and dual problem 113 | * support vectors 114 | * kernel trick which extends the inner product 115 | * Mercer condition 116 | 117 | 118 | 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_07_computational_learning_theory.md: -------------------------------------------------------------------------------- 1 | Computational Learning Theory 2 | ============================= 3 | 4 | * Defining learning problems. 5 | * Showing specific algorithms work. 6 | * Show these problems are fundamentally hard. 7 | 8 | Resources in machine learning 9 | ----------------------------- 10 | 11 | * Time 12 | * Space 13 | * Training samples 14 | 15 | Defining inductive learning 16 | --------------------------- 17 | 18 | * Probability of successful training. 1 - \delta 19 | * Number of training examples. m 20 | * Complexity of hypothesis class. 21 | * Accuracy to which target concept is approximated. \epsilon 22 | * Manner in which training examples presented. (especially matters in online training as opposed to batch) 23 | * Manner in which training examples selected. 24 | 25 | Selecting training examples 26 | --------------------------- 27 | 28 | 1. Learner asks questions from teacher. 29 | 2. Teacher gives examples to help learner. 30 | 3. Some natural fixed distribution. 31 | 4. Evil distribution. 32 | 33 | Teaching via 20 questions 34 | ------------------------- 35 | 36 | H -- set of possible people (ask the correct question) 37 | X -- set of questions 38 | 39 | * Teacher chooses only one question (he knows the right answer). 40 | * Learner chooses log |H| questions. 41 | 42 | Teacher with constrained queries 43 | -------------------------------- 44 | 45 | H: Conjunction of literals or negation -- but not negation of conjunction? (find the right conjunction) 46 | X: x1x2x3...xk k-bit input 47 | 48 | To solve: 49 | * Show what is irrelevant... two positive examples which are unperturbed by a changing feature. 50 | * Show what is relevant... k negative examples which validate that perturbing a relevant feature matters. 51 | 52 | Even though there are 3^k (positive, absent, negated) hypotheses, the smart teacher can ask k + 2 questions. 53 | 54 | Learner with constrained queries 55 | -------------------------------- 56 | 57 | H: Conjunction of literals or negation -- but not negation of conjunction? (find the right conjunction) 58 | X: x1x2x3...xk k-bit input 59 | 60 | Does not know the actual answer like the teacher does so does not know what the right training examples are. 61 | 62 | Start enumerating every data point from 0,...,0 to 1,...,1 which is 2^k possibilities. 63 | 64 | The first positive result helps us significantly. 65 | 66 | Learner with mistake bounds 67 | --------------------------- 68 | 69 | H: Conjunction of literals or negation -- but not negation of conjunction? (find the right conjunction) 70 | X: x1x2x3...xk k-bit input 71 | 72 | + Input arrives 73 | + Learner guesses answer 74 | + Wrong answer **charged** 75 | + Go to 1. 76 | 77 | 1. Assume each feature positive and negated. 78 | 2. Given input, compute output. 79 | 3. If wrong, set all positive features that were 0 to absent, negative features that were 1 to abset. Go to (2). 80 | 81 | Never make more than k + 1 mistakes. 82 | 83 | Definitions 84 | ----------- 85 | 86 | * Learner chooses examples 87 | * Teacher chooses examples 88 | * Nature chooses examples 89 | * Mean teacher chooses examples 90 | 91 | **Computational complexity** -- how much computational effort is needed for learner to "converge" 92 | **Sample complexity* -- batch; how many training examples are needed for a learner to create a successful hypothesis 93 | **Mistake bounds** -- online; how many misclassifications can a learner make over an infinite run? 94 | 95 | Version spaces 96 | -------------- 97 | 98 | True/target hypothesis: c in H 99 | Candidate hypothesis: h in H 100 | Training set: S subset of X 101 | **Consistent learner** : produces h such that c(x) = h(x) for x in S (always produces a hypothesis that is consistent with data) 102 | **Version space**: VS(S) = {h in H s.t. h is consistent with S} -- hypotheses consistent with examples 103 | 104 | PAC learning -- error of h 105 | -------------------------- 106 | 107 | Training error -- fraction of traning examples misclassified by h. 108 | True error -- fraction of examples that would be misclassified on sample drawn from distribution D. 109 | 110 | error_D(h) = Pr_{x from D}[c(x) != h(x)] 111 | 112 | PAC learning 113 | ------------ 114 | 115 | * c -- concept class 116 | * L -- learner 117 | * H -- hypothesis space 118 | * n -- |H|, size of hypothesis space 119 | * D -- distribution over inputs 120 | * 0 <= \epsilon <= 1/2 (our error goal) 121 | * 0 <= \delta <= 1/2 (failure-probability -- certainty goal -- with probability 1 - \delta, must produce true error less than \epsilon) 122 | 123 | 124 | PAC -- probably (1 - \delta) approximately (\epsilon) correct (error_D(h) = 0)! 125 | 126 | "C is PAC-learnable by L using H iff learner L will with probability 1 - \delta, output a hypothesis h in H such that error_D(h) <= \epsilon in time and samples polynomial in 1/\epsilon, 1/\delta, and n. 127 | 128 | Quiz: PAC-learnable 129 | ------------------- 130 | 131 | C = H = {h_i(x) = x_i} (k-bit inputs) 132 | 133 | There are k hypotheses. So n = |H| = k. 134 | 135 | Pick a hypothesis uniformly from VS(S, H). 136 | 137 | \epsilon-exhausted version space 138 | -------------------------------- 139 | 140 | VS(S) is \epsilon-exhausted iff for all h in VS(S), error_D(h) <= \epsilon 141 | 142 | Every element in the version space has error less than epsilon. 143 | 144 | Haussler Theorem -- bound true error 145 | ------------------------------------ 146 | 147 | Let error_D(h1,...,hk in H) > \epsilon -- this is high true error. 148 | 149 | How much data do we need to knock out these hypotheses? 150 | 151 | Pr_{x from D}(h_i(x) = c(x)) <= 1 - \epsilon 152 | Pr(h_i consistent with c on m examples) <= (1 - \epsilon)^m 153 | Pr(at least one of h1,...,hk consistent with c on m examples) <= k(1 - \epsilon)^m <= |H|(1 - \epsilon)^m 154 | 155 | Note that (1 - \epsilon)^m <= exp(\epsilon*m) (this comes from -\epsilon >= ln(1 - \epsilon)) 156 | 157 | So we have Pr(at least one of hypothesis is consistent with c on m examples) <= H*exp(-\epsilon*m) <= \delta 158 | 159 | This is the upperbound that version space is **not** \epsilon-exhausted after m samples. We want \delta to be a bound on this. 160 | 161 | ln |H| - \epsilon * m <= ln \delta 162 | m >= 1/\epsilon (ln |H| + ln (1/\delta)) (i.e. polynomial) 163 | 164 | Satisfying this will satisfy PAC-learnability. 165 | 166 | Quiz: PAC-learnable example 167 | --------------------------- 168 | 169 | H = {h_i(x) = x_i} (where x is 10-bits) 170 | \epsilon = 0.1 171 | \delta = 0.2 172 | D: uniform 173 | 174 | 1/0.1 * (ln 10 + ln (1/0.2)) = 10*(ln 10 + ln 5) = 10*ln 50 = 39 175 | 176 | This bound is agnostic to distribution of nature's data. 177 | 178 | What we learned 179 | --------------- 180 | 181 | * Teachers versus learners and interaction 182 | - What is learnable? Like complexity theory for ML. 183 | * Sample complexity - data 184 | * types of interactions: 185 | - learner picks questions 186 | - teacher picks questions 187 | - nature picks questions 188 | - evil teacher picks questions 189 | * mistake bounds (as opposed to how many samples you need) 190 | * PAC learning: version spaces, training/test/true error, distribution. 191 | * m >= 1/\epsilon (ln |H| + ln (1/\delta)) (i.e. polynomial) 192 | - this assumed target in hypothesis 193 | - otherwise **agnostic** and the bound is slightly different, but still polynomial. 194 | - infinite hypothesis space can be a problem 195 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_08_vc_dimension.md: -------------------------------------------------------------------------------- 1 | Lesson 8: VC Dimensions 2 | ======================= 3 | 4 | Infinite hypothesis spaces 5 | -------------------------- 6 | 7 | Haussler theorem breaks with infinite hypothesis spaces. 8 | 9 | Consider example: 10 | 11 | X: {1, 2, ..., 10} 12 | H: h(x) = {x >= t where t is real} 13 | 14 | Though there are infinite hypotheses, there are find semantically relevant hypotheses (11). 15 | 16 | Power of a hypothesis space 17 | --------------------------- 18 | 19 | What is the largest set of inputs that the hypothesis class can label in all possible ways? 20 | 21 | In the above hypothesis, there is no way for a pair of data points to be labeled in all possible ways. It can however label **one** point is all possible ways (two ways). 22 | 23 | What VC stands for 24 | ------------------ 25 | 26 | **Shattering** -- labeling in all possible ways. 27 | 28 | Vapnik-Chervonenkis dimension -- the largest set of inputs shattered by a class of learners. 29 | 30 | Quiz: interval training 31 | ----------------------- 32 | 33 | X = Reals 34 | H = {h(x) = {x in [a, b] -- a real interval}} 35 | 36 | Has VC dimension of 2. 37 | 38 | Note on logic 39 | ------------- 40 | 41 | To prove yes: There exists points for all labelings there exists hypothesis 42 | To prove no: not(there exists points for all labelings there exists hypothesis) = for all points, not(for all labelings there exists hypothesis) = for all points, there exists labelings not(there exists hypothesis) 43 | 44 | Quiz: linear separators 45 | ----------------------- 46 | 47 | X = R^2 48 | H = {h(x) = w^T x >= \theta} 49 | 50 | Has VC dimension of 3. 51 | 52 | The rings 53 | --------- 54 | 55 | The d-dimensional hyperplane as VC dim d + 1 56 | 57 | Quiz: polygons (convex) 58 | ----------------------- 59 | 60 | X: R^2 61 | H: points inside some convex polygon 62 | 63 | VC dimension is infinite. (Use a circle for vertices and put points on edge). 64 | 65 | Sample complexity and VC dimension 66 | ---------------------------------- 67 | 68 | Update Haussler's theorem with VC dimension: 69 | 70 | m >= 1/\epsilon (8* VC(H)*lg(13/\epsilon) + 4*lg(2/\delta)) (infinite case) 71 | m >= 1/\epsilon (ln |H| + ln (1/\delta)) (finite case) 72 | 73 | VC dimension of finite H 74 | ------------------------ 75 | 76 | Upper bound: d = VC(H) implies there exist 2^d distinct concepts (each gets a different h) 77 | 2^d <= |H| and d <= lg |H| 78 | 79 | **Fundamental Theorem of Machine Learning** 80 | H is PAC-learnable if and only if VC dimension is finite. 81 | 82 | Summary 83 | ------- 84 | * VC dimension. Shattering 85 | * VC relates to hypothesis space parameters ("true" number of parameters) 86 | * VC relates to finite hypothesis space size. 87 | * Sample complexity related to VC dimension. 88 | * VC dimension captures PAC-learnability 89 | -------------------------------------------------------------------------------- /lessons/ml_p1_lesson_09_bayesian_learning.md: -------------------------------------------------------------------------------- 1 | Lesson 9: Bayesian Learning 2 | =========================== 3 | 4 | Bayesian learning 5 | ----------------- 6 | 7 | * Learn the **best** hypothesis given data and some domain knowledge. 8 | * Learn the **most probable** H given data and domain knowledge. (**best** == **most probable**) 9 | 10 | argmax_{h in H} Pr(h | D) 11 | 12 | Bayes rule 13 | ---------- 14 | 15 | P(h|D) = P(D|h)P(h)/P(D) 16 | 17 | P(D) -- prior on the data 18 | P(h) -- prior on h (this encapsulate our domain domain knowledge) 19 | P(D|h) -- data given the hypothesis (in a world where the hypothesis is true... likelihood of seeing this data) 20 | 21 | Quiz: Bayes' rule 22 | ----------------- 23 | 24 | 98% true positive. 2% false positive. 25 | 97% true negative. 3% false negative. 26 | 27 | Only 0.008 percent of population has it. 28 | 29 | P(T|P) = P(P|T)*P(T)/P(P) = .98*.008/.0376 ~ 0.209 30 | 31 | P(not T|P) = P(P | not T)*P(not T)/P(P) = .03*.992/.0376 ~ 0.791 32 | 33 | If 0.008 were higher, the test would be more useful. 34 | 35 | 36 | Bayesian learning 37 | ----------------- 38 | 39 | For each h in H, 40 | 41 | h_MAP = argmax_h P(h|D) = argmax_h P(D|h)P(h) (maximum a posteriori) 42 | h_ML = argmax_h P(D|h) (maximum likelihood -- assume prior P(h) is uniform) 43 | 44 | Direct computation not practical for large hypothesis spaces. 45 | 46 | Bayesian learning in action 47 | --------------------------- 48 | Assume: 49 | * We have labeled training data {} as noise-free examples of c. 50 | * c is in H 51 | * Uninformed prior over hypothesis space. 52 | 53 | P(h|D) = P(D|h)P(h)/P(D) 54 | 55 | P(h) = 1/|H| 56 | P(D|h) = {1 if d_i = h(x_i) for all i; 0 otherwise} = {1 if h is in VS(D)} 57 | P(D) = sum_i P(D|h_i)P(h_i) (total probability) 58 | = sum_{h in VS_H(D)} 1 * 1/|H| = |VS|/|H| 59 | 60 | P(h|D) = (1 * 1/|H|)/(|VS|/|H|) = 1/|VS| 61 | This means, given a bunch of data, the probability of a particular hypothesis being correct is uniform over elements of the version space. 62 | 63 | Any element of the version space is good. 64 | 65 | Bayesian learning with noise 66 | ---------------------------- 67 | 68 | Assume 69 | * We have labeled training data {} 70 | * d_i = f(x_i) + e_i (where e_i is error) 71 | * e_i ~ N(0, s^2) (i.i.d.) 72 | 73 | h_ML = f = argmax_h P(D|h) = argmax product_i P(d_i | h) 74 | 75 | This probability can be approximated with gaussian: stuff*exp(-1/2(d_i - h(x_i))^2)/s^2)... we can log it for the purpose of argmax. 76 | 77 | h_ML = argmax sum_i -(d_i - h(x_i))^2 = argmin sum_i (d_i - h(x_i))^2 78 | 79 | This is the sum of squared error.... derived from the gaussian noise model and the maximum likelihood assumption. 80 | 81 | Bayesian learning 82 | ----------------- 83 | 84 | h_MAP = argmax P(D|h)P(h) = argmax [lg(P(D|h)) + lg(P(h))] = argmax [ -lg(P(D|h)) - lg(P(h))] 85 | 86 | Event w has probability p, i.e. has length -lg p 87 | 88 | "Minimizing length(D|h) + lenght(h)" 89 | 90 | If the hypothesis fits the data well, the data is superfluous information. Otherwise, we need a lot of data. So the first term correlates to misclassifications/errors 91 | 92 | The second term tries to simplify the second term. 93 | 94 | **Minimum description length** 95 | 96 | Bayesian classification 97 | ----------------------- 98 | 99 | The consensus of non-MAP or non-ML hypotheses may differ from the MAP or ML hypotheses. Vote? 100 | 101 | In some sense we are throwing out the hypotheses for "boosted" superhypotheses. 102 | 103 | value_MAP = argmax_v sum_h P(v|h)P(h|D) (weighted vote of h in H) **Bayes Optimal Classifier** 104 | 105 | Summary 106 | ------- 107 | 108 | * Bayes rule (swap "causes and effect") 109 | + P(h|D) ~ P(D|h)P(h) 110 | * priors matter 111 | * h_MAP, h_ML 112 | * derived least square from Gaussian h_ML. 113 | * best classification is consensus of all classifiers: **Bayes Optimal Classifier** (the best classifier you can possibly do) 114 | 115 | -------------------------------------------------------------------------------- /textbook/notes_ch01.md: -------------------------------------------------------------------------------- 1 | Chapter 1: Introduction 2 | ======================= 3 | 4 | 1.1 Well-posed learning problems 5 | -------------------------------- 6 | 7 | Given a machine, let $P(T, E)$ be the performance of that machine at 8 | task $T$ given experience $E$. If $P(T, E') > P(T, E)$, then the machine 9 | has **learned** more from experience $E'$ than experience $E$. 10 | 11 | An alternative interpretation. 12 | 13 | If $E' > E$ implies $P(T, E') > P(T, E)$, then the machine has 14 | **learned**. 15 | -------------------------------------------------------------------------------- /textbook/notes_ch03.md: -------------------------------------------------------------------------------- 1 | Chapter 3: Decision tree learning 2 | ================================= 3 | 4 | 5 | 3.1-3.2 Decision tree representation 6 | ------------------------------------ 7 | 8 | The usage of decision trees as classifiers is obvious. Given an 9 | instance, the machine starts at the root node. At each node of the tree, 10 | the machine tests the value of some attribute of the instance and 11 | accordingly directs the machine to the next node until the machine 12 | arrives at a leaf node. The value of the leaf node is emitted as the 13 | final classification. 14 | 15 | *Each path from the tree root to a leaf corresponds to a conjunction of 16 | attribute tests, and the tree itself to a disjunction of these 17 | conjunctions.* Since the conjunctions are mutually exclusive, this 18 | disjunction can actually be an exclusive disjunction (XOR). 19 | 20 | 3.3 Appropriate problems for decision tree learning 21 | --------------------------------------------------- 22 | 23 | - *Instances are represented by attribute-value pairs.* 24 | 25 | - *The target function has discrete output values.* 26 | 27 | - *Disjunctive descriptions may be required.* This makes decision 28 | trees highly interpretable. 29 | 30 | - *The training data may contain errors.* 31 | 32 | - *The training data may contain missing attribute values.* 33 | 34 | The basic decision tree learning algorithm 35 | ------------------------------------------ 36 | 37 | ID3 (or *this* variation) is a recursive algorithm that passes through 38 | all the instances and their attributes to find the attribute which best 39 | classifies the data by itself. Then at each descendent node, the process 40 | is repeated sans the previously considered attributed. Eventually, all 41 | attributes will be considered and the algorithm is guaranteed to halt. 42 | 43 | The *best* attribute at each iteration is the one that maximizes 44 | *information gain*. Information gain depends on *entropy*. Let’s define 45 | entropy of a set $S$ contains values $i$ with frequency $p_i$ (idealized 46 | probabilities). 47 | 48 | $$Entropy(S) = - \sum p_i \log p_i = -\log \prod_i p_i^{p_i}$$ 49 | 50 | Entropy can be interpreted as the expected number of bits required to 51 | optimally encode stream of values $i$. (Confirm.) 52 | 53 | The following is an adaptation of the ID3 algorithm. 54 | 55 | $ID3(EXAMPLES, TARGET, ATTRIBUTES)$ // 56 | 57 | - Add $ROOT$ node to $TREE$. 58 | 59 | - If all elements are positive, label $ROOT$ node $+$ and return 60 | $TREE$. 61 | 62 | - Else if all elements are negative, label $ROOT$ node $-$ and return 63 | $TREE$. 64 | 65 | - Else if $ATTRIBUTES$ is empty, label $ROOT$ node as the most common 66 | $TARGET$ attribute values and return $TREE$. 67 | 68 | - Else 69 | 70 | - A \<– the attribute from $ATTRIBUTES$ that best classifies 71 | $EXAMPLES$. 72 | 73 | - The decision attribute for Root \<– A 74 | 75 | 76 | --------------------------------------------------------------------------------