├── .gitignore
├── LICENSE
├── README.md
├── lessons
    ├── ml_p1_lesson_00_intro.md
    ├── ml_p1_lesson_01_decision_trees.md
    ├── ml_p1_lesson_02_regression_intro.md
    ├── ml_p1_lesson_03_neural_networks.md
    ├── ml_p1_lesson_04_instance_based_learning.md
    ├── ml_p1_lesson_05_ensemble_learning.md
    ├── ml_p1_lesson_06_support_vector_machines.md
    ├── ml_p1_lesson_07_computational_learning_theory.md
    ├── ml_p1_lesson_08_vc_dimension.md
    └── ml_p1_lesson_09_bayesian_learning.md
└── textbook
    ├── notes_ch01.md
    └── notes_ch03.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.acn
 2 | *.acr
 3 | *.alg
 4 | *.aux
 5 | *.bbl
 6 | *.blg
 7 | *.dvi
 8 | *.fdb_latexmk
 9 | *.glg
10 | *.glo
11 | *.gls
12 | *.idx
13 | *.ilg
14 | *.ind
15 | *.ist
16 | *.lof
17 | *.log
18 | *.lot
19 | *.maf
20 | *.mtc
21 | *.mtc0
22 | *.nav
23 | *.nlo
24 | *.out
25 | *.pdf
26 | *.pdfsync
27 | *.ps
28 | *.snm
29 | *.synctex.gz
30 | *.toc
31 | *.vrb
32 | *.xdy
33 | *.tdo
34 | 
35 | # Temporary files used by editors.
36 | *~
37 | *.swp
38 | .#*
39 | \#*
40 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | For more information, please refer to <http://unlicense.org>
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | cs7641
2 | ======
3 | 
4 | Notes from Georgia Tech's CS7641 and Tom Mitchell's "Machine Learning."
5 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_00_intro.md:
--------------------------------------------------------------------------------
 1 | ML is teh r0x
 2 | =============
 3 | 
 4 | 
 5 | Definition of ML
 6 | ----------------
 7 | 
 8 | Machine learning theory about how well data can be statistically
 9 | modeled. Machine learning is practically about building machines which
10 | can learn from data.
11 | 
12 | There are three types of ML: supervised learning, unsupervised learning,
13 | and reinforcement learning.
14 | 
15 | Supervised learning
16 | -------------------
17 | 
18 | Supervised learning gleans "information" from labeled data to label new
19 | unlabeled data. This is simply discrete function approximation.
20 | Supervised learning/function approximation must make assumptions about
21 | the given data to *generalize* to new data.
22 | 
23 | [Note on] Induction and deduction
24 | ---------------------------------
25 | 
26 | *Induction* takes specific examples to create a general rule. *Inductive
27 | bias* is necessary to come up with "useful" generalizations. *Deduction*
28 | takes a general rule and applies it in specific cases.
29 | 
30 | Unsupervised learning
31 | ---------------------
32 | 
33 | Given inputs with no labels, derive some structure using the
34 | relationships between the inputs. Often a summarization of data into
35 | clusters. In practice, unsupervised learning is useful even in
36 | supervised contexts to gain insight into the data.
37 | 
38 | Reinforcement learning
39 | ----------------------
40 | 
41 | Learning from delayed supervision. For example: playing a [turn-based]
42 | game [with potentially non-deterministic environmental factors] where
43 | you find out whether you have won or lost near the end. Somewhat
44 | difficult.
45 | 
46 | Comparison of these parts of ML
47 | -------------------------------
48 | 
49 | All learning is "optimization." SL wants to label data well. RL wants to
50 | score well. UL has scientist-imposed criteria for correctness.
51 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_01_decision_trees.md:
--------------------------------------------------------------------------------
  1 | Lesson 1: Decision Trees
  2 | ========================
  3 | 
  4 | Classification and regression
  5 | -----------------------------
  6 | 
  7 | In supervised learning, you are presented with instances (e.g. images of individuals) with labels (e.g. "BOY" or "GIRL") as training data. The task is to label new unlabeled instances (assuming the instance has the sufficient information).
  8 | 
  9 | * **Classification** -- labels are discrete values (often finite, often true or false).
 10 |   * Better definition [Shashir]: labels (codomain of the target function) have no meaningful order.
 11 | * **Regression** -- labels are reals.
 12 |   * Better definition [Shashir]: labels have a meaningful order.
 13 | 
 14 | 
 15 | Classification learning
 16 | -----------------------
 17 | 
 18 | * **Instances** -- inputs (vectors of features).
 19 | * **Concept** -- a function which maps instances to labels (there many concepts: |labels|^|instance space|).
 20 |   * [Shashir] Concepts are the set of functions mapping from the instance space to the labels space.
 21 | * **Target concept** -- the function which maps instances to the **correct** labels.
 22 |   * [Shashir] The target concept is a specific concept which we wish to model.
 23 | * **Hypothesis** -- set of concepts which we are willing to search for the best approximation of the target concept.
 24 |   * [Shashir] Subset of the concepts set. Easier space to search through, but introduces *inductive bias*.
 25 | * **Sample (training set)** -- set of inputs with correct labels.
 26 | * **Candidate** -- the "best" concept chosen from the hypothesis by the learning algorithm using the sample.
 27 |   * [Shashir] Element of hypothesis which best approximates the target concept according to our learning algorithm.
 28 | * **Testing set** -- set of instances with correct labels, similar to the training set, but used to measure how well the candidate performs on novel data.
 29 | 
 30 | The testing set should contain many examples not found in the training set.
 31 | 
 32 | Decision trees
 33 | --------------
 34 | 
 35 | Sequence of tests (path of nodes starting from a root node) applied to every instance in order to arrive at its label (leaf).
 36 | 
 37 | Decision trees: learning
 38 | ------------------------
 39 | 
 40 | 1. Pick the best attribute to split the data.
 41 | 2. Asked test every possible value of the attribute.
 42 | 3. Follow the correct answer path.
 43 | 4. Go to 1 until the possibilities has been narrowed to one answer.
 44 | 
 45 | How can this intuition be used to build a "tree" for all possible instances of the problem?
 46 | 
 47 | Decision trees: expressiveness
 48 | ------------------------------
 49 | 
 50 | Consider the inclusive disjunction: **OR(a1, a2, ..., an)** (any). Note that the tree is "linear" and has a height of n.
 51 | ```
 52 | --->a1--T-->a2--T-->...--T-->an--T-->TRUE
 53 |         |       |                |
 54 |         F       F                F
 55 |         |       |                |
 56 |         v       v                v
 57 |       FALSE   FALSE            FALSE
 58 | ```
 59 | 
 60 | Next, consider the exclusive disjunction: **XOR(a1, a2, ..., an)** (odd parity). Note that the tree is balanced and has height O(2^n).
 61 | 
 62 | ```
 63 | --->a1--T-->a2--T-->a3--T-->TRUE
 64 |      |       |       |
 65 |      |       |       F----->FALSE
 66 |      |       |
 67 |      |       F----->a3--T-->FALSE
 68 |      |               |
 69 |      |               F----->TRUE
 70 |      |   
 71 |      F----->a2--T-->a3--T-->FALSE
 72 |              |       |
 73 |              |        F---->TRUE
 74 |              |
 75 |              F----->a3--T-->FALSE
 76 |                      |
 77 |                      F----->TRUE
 78 | ```
 79 | It's better just to add in integers mod 2.
 80 | 
 81 | Decision trees: expressiveness (search space)
 82 | ---------------------------------------------
 83 | 
 84 | Given n binary attributes, how many possible decision trees are there? 2^(2^n)
 85 | 
 86 | Intuition:
 87 | * There are 2^n possible configurations of the attributes.
 88 | * Each "unique" decision tree maps these 2^n configurations into a 2^n-sized bit vector.
 89 | * There are 2^(2^n) possible bit vectors of size 2^n.
 90 | * Therefore, there must are 2^(2^n) possible unique classifiers.
 91 | * Note that more than one tree may map to a single classifier, so the hypothesis space is even larger (thanks to inductive bias, we can cut down the problem significantly).
 92 | 
 93 | ID3 Algorithm
 94 | -------------
 95 | 
 96 | ```
 97 | A <- best attribute from remaining attributes (initially, all attributes)
 98 | Assign A as decision attribute for Node
 99 | For each value of A, create a new descendant of Node
100 | Sort training examples to leaves
101 | If examples are perfectly classified, STOP.
102 | Else if we ran out of attributes, STOP
103 | Else, start over for each leaf (with corresponding set of training examples)
104 | ```
105 | 
106 | The **best attribute** is that one with the greatest information gain.
107 | 
108 | <img src="http://s.wordpress.com/latex.php?latex=GAIN%28S%2C%20A%29%20%3D%20ENTROPY%28S%29%20-%20%5Csum_v%20%5Cfrac%7B%7CS_%7BA%3Dv%7D%7C%7D%7B%7CS%7C%7DENTROPY%28S_%7BA%3Dv%7D%29&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="GAIN(S, A) = ENTROPY(S) - \sum_v \frac{|S_{A=v}|}{|S|}ENTROPY(S_{A=v})" title="GAIN(S, A) = ENTROPY(S) - \sum_v \frac{|S_{A=v}|}{|S|}ENTROPY(S_{A=v})" class="latex">
109 | 
110 | Where ENTROPY is defined as:
111 | 
112 | <img src="http://s.wordpress.com/latex.php?latex=ENTROPY%28S%29%20%3D%20-%5Csum_v%20p%28v%29%5Clog%20p%28v%29&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="ENTROPY(S) = -\sum_v p(v)\log p(v)" title="ENTROPY(S) = -\sum_v p(v)\log p(v)" class="latex">
113 | 
114 | The **best attribute** is the one that splits the data into subsets whose entropies' weighted sum is the least (maximizing the information gain).
115 | 
116 | <img src="http://s.wordpress.com/latex.php?latex=A%5E%2A%20%3D%20%7B%5Ctextrm%7Bargmin%7D_A%5Ctextrm%7B%20%7D%7D%20%5Csum_v%7CS_v%7CENTROPY%28S_v%29&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="A^* = {\textrm{argmin}_A\textrm{ }} \sum_v|S_v|ENTROPY(S_v)" title="A^* = {\textrm{argmin}_A\textrm{ }} \sum_v|S_v|ENTROPY(S_v)" class="latex">
117 | 
118 | 
119 | The **inductive bias** of the ID3 algorithm:
120 | * The best splitters appear earlier (closer to the root).
121 |   * Produces shorter trees.
122 | * Prefers correct classifiers over incorrect trees (thanks for that)
123 | 
124 | 
125 | Decision trees: other considerations
126 | ------------------------------------
127 | 
128 | * How do handle continuous attributes?
129 |   * Use intervals
130 |     * Split age range 0-90 into 0-40 and 40-90 -- perhaps even use a modified ID3 to find the best splitting age.
131 | * Does it make sense to repeat an attribute along a path in the tree?
132 |   * No for finite-valued attributes.
133 |   * However, continuous attributes can be tested with different questions
134 |   * No question needs to be asked twice.
135 | * When do we stop?
136 |   * Everything classified correctly (or nearly correct -- we do not want to **overfit**).
137 |   * No more attributes.
138 |   * Do not **overfit**.
139 |     * Try not to have a tree which is too big.
140 |     * Try many trees and cross-validation.
141 |     * Variant of cross-validation where you hold out a subset of the data and build a tree breadth-first on the remaining data. Stop when error is "low enough."
142 |     * Build the whole tree and prune (vote if the classification is not perfect).
143 | * Regression
144 |   * Model output and group them (round off or cluster).
145 |   * Report average on leaves, or vote, or locally fit a line (hybrid).
146 | 
147 | 
148 | Decision trees
149 | --------------
150 | 
151 | We learned:
152 | * Representation (tree... set of questions)
153 | * ID3: a top down learning algorithm
154 | * Expressiveness of DTs
155 | * Bias of ID3
156 | * "Best" attribute Gain(S, A)
157 | * Dealing with overfitting.
158 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_02_regression_intro.md:
--------------------------------------------------------------------------------
 1 | Lesson 2: Regression & Classification
 2 | =====================================
 3 | 
 4 | What is regression?
 5 | -------------------
 6 | 
 7 | The term regression became overloaded over time. Initially it described how children of very tall people and children of very short people tended both to be a bit closer average height than their parents.
 8 | 
 9 | The "function" child_height(parent_height) looks like a line and has slope less than 1.
10 | 
11 | 
12 | Linear regression
13 | -----------------
14 | 
15 | It's possible to fit affine hyperplanes to data. If we transform the input vectors with squared or cubed terms, etc. we can also fit any higher dimensional polynomial with the same technique.
16 | 
17 | We would like for the best fit "curve" (surface) such that it minimizes an error. However, there are certain inductive biases we can make. What degree polynomial should we fit? The higher the degree, the better we can fit a set of points, but the more "overfit" the curve will be.
18 | 
19 | When we have an overfit candidate function, though it produces no error on the training set, it will not generalize.
20 | 
21 | 
22 | Performing linear regression
23 | ----------------------------
24 | 
25 | Let our instance be elements of R^n. Suppose we have a training set {(x_1, y_1), (x_2, y_2), ..., (x_m, y_m)} where x_i are in R^n and y_i are in R. We would like to find w such that x dot w approximates y.
26 | 
27 | Let w be a column vector and x_i be row vectors. Define matrix X = [x_1 x_2 x_3 ... x_m] and Y = [y_1 ... y_m]. We want to find w such that wX = Y (approximately).
28 | 
29 | We find that if we multiply by X^T on both sides, we will very likely have an invertible matrix XX^T. So we arrive at:
30 | 
31 | w = Y(X^T)(XX^T)^-1
32 | 
33 | For details: http://en.wikipedia.org/wiki/Linear_regression
34 | 
35 | This formula minimizes the squared error (prove by differentiating L(w) = ||Y - wX||^2).
36 | 
37 | 
38 | Errors
39 | ------
40 | 
41 | All data has noise/errors arising from sensor error, malicious agents, transcription error, unmodeled influences, etc.
42 | 
43 | We should not overfit to training data in order not to model the errors.
44 | 
45 | We want to train until we minimuze the mean squared error (MSE) on the training data.
46 | 
47 | <img src="http://s.wordpress.com/latex.php?latex=MSE%20%3D%20%5Cfrac%7B1%7D%7Bn%7D%20%5Csum_i%20%28w%20%5Ccdot%20x_i%20-%20y_i%29%5E2&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="MSE = \frac{1}{n} \sum_i (w \cdot x_i - y_i)^2" title="MSE = \frac{1}{n} \sum_i (w \cdot x_i - y_i)^2" class="latex">
48 | 
49 | Next we want to compute the error on various testing sets to make sure that the model is not so complicated that it overfits and doesn't generalize.
50 | 
51 | Cross validation
52 | ----------------
53 | 
54 | To check that data generalizes well, hold aside a subset of the training data as "testing data", training on the remaining set and test against the "testing data." Choose our inductive bias (picking model or model complexity) such that when the model is trained on the training set it's error on the testing set is still minimal.
55 | 
56 | 
57 | Other input spaces
58 | ------------------
59 | 
60 | Our instance elements need not all be reals (though the regression would be better under certain special circumstances). Ideally the values of each feature should have a natural order. Boolean vectors are a good choice often.
61 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_03_neural_networks.md:
--------------------------------------------------------------------------------
  1 | Lesson 3: Neural Networks
  2 | ===============
  3 | 
  4 | Artificial neural networks
  5 | --------------------------
  6 | 
  7 | Based on biological neuron (cell body, axon, and synapses -- cell bodies fire signals down axons to synapses if their activation threshold is surpassed).
  8 | 
  9 | Perceptron
 10 | ----------
 11 | 
 12 | The perceptron hypothesis boils down to being the sign of the dot of the weights and the input. Let's say all inputs have a zeroth component which is 1 and the zeroth component of the weight is the negative threshold.
 13 | 
 14 | <img src="http://s.wordpress.com/latex.php?latex=h%28%5Cmathbf%7Bw%7D%29%28%5Cmathbf%7Bx%7D%29%20%3D%20%5C%7B%5Cmathbf%7Bw%20%5Ccdot%20x%7D%20%5Cge%200%5C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="h(\mathbf{w})(\mathbf{x}) = \{\mathbf{w \cdot x} \ge 0\}" title="h(\mathbf{w})(\mathbf{x}) = \{\mathbf{w \cdot x} \ge 0\}" class="latex">
 15 | 
 16 | The output is 0 if the dot product is negative, 1 if positive.
 17 | 
 18 | How powerful is a perceptron unit
 19 | ---------------------------------
 20 | 
 21 | Perceptrons define a discriminant line which splits the plane in half (analogy applies in higher dimensions also).
 22 | 
 23 | When the components of the input are boolean, then the perceptron can define boolean logic such as AND and OR, but a single perceptron can not discover XOR and others.
 24 | 
 25 | Given a vector <-theta, w_0, w_1> be able to visualize a half-plane and given a half-plane write the weights vector.
 26 | 
 27 | XOR as a perceptron network
 28 | ---------------------------
 29 | 
 30 | A more complex discriminant "curve" can be contrived from a "network" of perceptrons.
 31 | 
 32 | 
 33 | Perceptron training
 34 | -------------------
 35 | 
 36 | Two ways to learn weights
 37 | 
 38 | * Perceptron learning rule
 39 | * Gradient descent/delta rule
 40 | 
 41 | The perceptron learning rule defines how to update the weights in iterative stages. Over time, it is capable of converging on the best classifier on the training set is the training set is linearly separable. However, it will not converge if the data is not linearly separable.
 42 | 
 43 | ```
 44 | eta <- learning rate
 45 | initialize(w) // Set every component w_i of w to some random initial value
 46 | do until satisfied: 
 47 |   for each training datum (x, y) from the training set:
 48 |     y_hat <- w dot x // Compute the predicted output
 49 |     w <- w + eta * (y - y_hat) *.* x // Update the weights vector with the adjustment. *.* is Hadamard product (element-wise product).
 50 | 
 51 | ```
 52 | 
 53 | This is online learning and you can stop when average error on the training set is below some threshold.
 54 | 
 55 | If the data is "nearly" linearly separable, perhaps reduce the learning rate over time such that the discriminant line at least appears to converge?
 56 | 
 57 | Perceptron learning can not be applied in the conventional sense to a multiple-layer neural network. It is only defined for a single layer.
 58 | 
 59 | 
 60 | Gradient descent
 61 | ----------------
 62 | 
 63 | Is there a learning algorithm which is robust to non-(linear separability)?
 64 | 
 65 | We would like to perform gradient descent on the total error on the training set such that we can arrive at the weights which minimize the error.
 66 | 
 67 | Let D be the training set with ordered pairs (x,y). Let y be the target. Instead of taking the sign, we will simply take the dot product of w and x (this is essentially simple linear regression). We want to minimize:
 68 | 
 69 | <img src="http://s.wordpress.com/latex.php?latex=E%28%7B%5Cmathbf%20w%7D%29%20%3D%20%5Cfrac%7B1%7D%7B2%7D%5Csum_%7B%28%7B%5Cmathbf%20x%7D%2C%20y%29%20%5Cin%20D%7D%20%28y%20-%20%7B%5Cmathbf%20w%7D%20%5Ccdot%20%7B%5Cmathbf%20x%7D%29%5E2&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="E({\mathbf w}) = \frac{1}{2}\sum_{({\mathbf x}, y) \in D} (y - {\mathbf w} \cdot {\mathbf x})^2" title="E({\mathbf w}) = \frac{1}{2}\sum_{({\mathbf x}, y) \in D} (y - {\mathbf w} \cdot {\mathbf x})^2" class="latex">
 70 | 
 71 | The partials with respect to components of w:
 72 | 
 73 | <img src="http://s.wordpress.com/latex.php?latex=%5Cfrac%7B%5Cpartial%20E%7D%7B%5Cpartial%20w_i%7D%20%3D%20-%5Csum_%7B%28%7B%5Cmathbf%20x%7D%2C%20y%29%20%5Cin%20D%7D%20x_i%28y%20-%20%5Cmathbf%7Bw%7D%20%5Ccdot%20%5Cmathbf%7Bx%7D%29&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="\frac{\partial E}{\partial w_i} = -\sum_{({\mathbf x}, y) \in D} x_i(y - \mathbf{w} \cdot \mathbf{x})" title="\frac{\partial E}{\partial w_i} = -\sum_{({\mathbf x}, y) \in D} x_i(y - \mathbf{w} \cdot \mathbf{x})" class="latex">
 74 | 
 75 | The error decreases fastest in the direction opposite to the gradient. Walk in small steps opposite to the gradient.
 76 | 
 77 | Robust to linear separability and converges in the limit (eta must be small not to yo-yo in the well).
 78 | 
 79 | Comparison of learning rules
 80 | ----------------------------
 81 | 
 82 | The perceptron learning rule uses the following weight update.
 83 | 
 84 | <img src="http://s.wordpress.com/latex.php?latex=%5CDelta%20w_i%20%3D%20%5Ceta%28y%20-%20%5C%7B%5Cmathbf%7Bw%20%5Ccdot%20x%7D%20%5Cge%200%5C%7D%29x_i&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="\Delta w_i = \eta(y - \{\mathbf{w \cdot x} \ge 0\})x_i" title="\Delta w_i = \eta(y - \{\mathbf{w \cdot x} \ge 0\})x_i" class="latex">
 85 | 
 86 | Learning a classifier using gradient descent on a linear regression model uses the following weight update.
 87 | 
 88 | <img src="http://s.wordpress.com/latex.php?latex=%5CDelta%20w_i%20%3D%20%5Ceta%28y%20-%20%5Cmathbf%7Bw%20%5Ccdot%20x%7D%29x_i&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="\Delta w_i = \eta(y - \mathbf{w \cdot x})x_i" title="\Delta w_i = \eta(y - \mathbf{w \cdot x})x_i" class="latex">
 89 | 
 90 | Sigmoid
 91 | -------
 92 | 
 93 | Why not do gradient descent on the original perceptron itself rather than remodeling it as linear regression? The step function is not differentiable. We can get around this with the sigmoid function (hyperbolic tan, arctan, logistic sigmoid).
 94 | 
 95 | <img src="http://s.wordpress.com/latex.php?latex=%5Csigma%28a%29%20%3D%20%5Cfrac%7B1%7D%7B1%20%2B%20e%5E%7B-a%7D%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="\sigma(a) = \frac{1}{1 + e^{-a}}" title="\sigma(a) = \frac{1}{1 + e^{-a}}" class="latex">
 96 | 
 97 | Note that the derivative has a nice form:
 98 | 
 99 | <img src="http://s.wordpress.com/latex.php?latex=D%5Csigma%28a%29%20%3D%20%5Csigma%28a%29%281%20-%20%5Csigma%28a%29%29&amp;bg=ffffff&amp;fg=000000&amp;s=0" alt="D\sigma(a) = \sigma(a)(1 - \sigma(a))" title="D\sigma(a) = \sigma(a)(1 - \sigma(a))" class="latex">
100 | 
101 | Now we can use the chain rule to extend gradient descent to neurons with sigmoid activation functions.
102 | 
103 | Neural Networks
104 | ---------------
105 | 
106 | Hook up nodes which use the sigmoid of the dot product of the weights and the input. The input layer can be fed into the hidden layers and eventually into the output "layer" (a single node).
107 | 
108 | Every node can be trained based on it's respective inputs and output using the gradient descent rule on the now differentiable activation function (the sigmoid). For each input, the output can be computed. The output layer can be trained first. Then the last hidden layer can be trained. Then the next and so on. This careful bookkeeping is known as backpropagation.
109 | 
110 | With networks, there may be many local optima. We need to use clever techniques to arrive at a suitable set of weights.
111 | 
112 | Optimizing weights
113 | ------------------
114 | 
115 | How do arrive at a suitable set of weights? Optimization problem.
116 | 
117 | * Use momentum terms in the gradient (simulated annealing?)
118 | * higher order derivatives (hamiltonians, etc.)
119 | * randomized optimization (later on in the course)
120 | * penalty for "complexity" (overfitting) -- too many hidden layers? too many nodes in a layer? large valued weights (interesting)
121 | 
122 | 
123 | Restriction bias
124 | ----------------
125 | 
126 | Restriction bias -- the representational power and the set of hypotheses we will consider.
127 | 
128 | Perceptrons are linear models (discriminant lines splitting planes). Networks of perceptrons can approximate more complex functions. Sigmoids allow even better learning and can fit more interesting functions.
129 | 
130 | * Boolean functions -- network of threshold-like units.
131 | * Continuous functions -- use sigmoid activation and a hidden layer.
132 | * Arbitrary functions -- stitch together two or more functions with discontinuitives with two hidden layers.
133 | 
134 | Overfitting problem with more complex networks.
135 | 
136 | * We can bound the architecture of the network. 
137 | * Cross validate to better bound the architecture or bound the weights.
138 | * We should also find the number of iterations at which the network best generalizes.
139 | 
140 | Not really much of a restriction bias.
141 | 
142 | 
143 | Preference bias
144 | ---------------
145 | 
146 | Preference bias -- which representation is preferred by the algorithm.
147 | 
148 | Generally we initialize the network with small random weights -- we prefer "simpler"/less complex representations. (This can hit a local minimum. We generally run the training multiple times to avoid local minimum.)
149 | 
150 | Practically, (Occam's Razor) produces simpler and generalizable representations.
151 | 
152 | Summary
153 | -------
154 | 
155 | * Perceptron (linear and thresholded)
156 | * Perceptron learning rule
157 | * Gradient descent
158 | * Sigmoid function
159 | * Restriction bias
160 | * Preference bias
161 | 
162 | 
163 | 
164 | 
165 | 
166 | 
167 | 
168 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_04_instance_based_learning.md:
--------------------------------------------------------------------------------
 1 | Lesson 4: Instance Based Learning
 2 | =================================
 3 | 
 4 | Instance based learning
 5 | -----------------------
 6 | 
 7 | Instance based learning distinguishes machine learning models which perform "lazy learning" at scoring-time directly from the training data. This is different from other machine learning models which lossily compress the training data into a simpler hypothesis function.
 8 | 
 9 | Some benefits
10 | 
11 | * Remembers all training data points.
12 | * Fast for known points (lookup -- depending on properties of the data structure used to store).
13 | * "Simple"
14 | 
15 | Possible problems
16 | 
17 | * Overfits and generalizes badly
18 | * Must handle conflicting training data
19 | 
20 | Since instance based learning often requires traversing the training data itself, it may be slow to score unknown inputs, but is easily adaptable to new training data points.
21 | 
22 | k-Nearest Neighbors
23 | -------------------
24 | 
25 | The number of neighbors k is a free parameter which must be learned.
26 | 
27 | Given training data D = {x_i, y_i}, a distance metric d(q, x), the required number of neighbors k, and the query point q:
28 | 
29 | NearedNeighbors(k)(D)(q) = {a_1, a_2, ..., a_k  s.t. a_j <= x_i for all x_i != a_j (the k smallest distances)}
30 | 
31 | For classification: vote of y_i or weighted vote or some other strategy.
32 | 
33 | For regression: take mean or weighted mean (weights are 1/distance)
34 | 
35 | If there are distance conflicts, "take all of them" the way college rankings do (this fudges the k).
36 | 
37 | Comparing learning and query times
38 | ----------------------------------
39 | 
40 | Given a list of n data points (x_i, y_i) sorted by x_i what is the running time and space consumption of "learning" (assume no effort for ETL -- which make no sense) and "query" (do not consider the learned data structure`s space consumption) of 1-NN, k-NN, and linear regression:
41 | 
42 | * 1-NN learning runs in O(1) and takes O(n) space.
43 | * 1-NN query runs in O(log n) (through binary search) and takes O(1) space.
44 | * k-NN learning runs in O(1) and takes O(n) space.
45 | * k-NN query runs in O(k + log n) (binary search and check 2*k adjacent points) and takes O(1) space.
46 | * Simple linear regression learning runs in O(n) (due to bound on the dimension of x_i, even normal equations solution is "linear" -- though this is a bit misleading) and takes O(1) space.
47 | * Simple linear regression query runs in O(1) time and takes O(1) space.
48 | 
49 | Querying in instance based algorithms may be slower (depending on the data structure), though assuming no ETL costs, learning in instance based algorithms is faster. Moreover, incremental learning of new data points are possible and probably very fast.
50 | 
51 | IBL is more lazy about learning (pushes it till querying). So k-NN is a lazy learner (or "just in time" learner) whereas linear regression is an eager learner.
52 | 
53 | K-NN Bias
54 | ---------
55 | 
56 | Preference bias -- when searching the hypothesis space, the search may "prefer" a particular subset of the hypothesis space over another. Perhaps the hypothesis space is not completely searched and rather only a subset is searched. E.g. simpler trees, simpler functions, Occam`s razor. Compare to restriction bias which is the total representational power of the hypothesis space itself.
57 | 
58 | What about k-NN?
59 | 
60 | * Locality -- near points are similar and how is nearness defined. Moreover, define distance on input not label/output. This comes from domain knowledge.
61 | * Smoothness -- averaging produces smoothness as opposed to discontinuities.
62 | * All features seem to matter "equally" -- this comes from the distance function too.
63 | 
64 | Curse of dimensionality
65 | -----------------------
66 | 
67 | As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially. Comes from Richard Bellman, the dynamic programming guy.
68 | 
69 | Covering the "same space" in higher dimensions requires exponentially more points. ("Same space" is not well-defined.)
70 | 
71 | Applies in all of ML and not just k-NN.
72 | 
73 | Some other points
74 | -----------------
75 | 
76 | Distance metrics -- e.g. Eucliean, Manhattan, weighted-components, mismatches (for classification), etc.
77 | 
78 | Implications of k.
79 | 
80 | * When k is n, we get a constant function (assuming vote or regular average).
81 | * With weighted average, the computation may not be so easy, but "smooth"-ish hypothesis. Points closer have greater say.
82 | * Perhaps instead of weighted average, do "local" regression on the k-closest-points. Known as "locally weighted regression." We can do decision trees, neural nets, etc. there. More sophisticated curves and more complex hypotheses (removes some restriction bias, but may overfit).
83 | 
84 | What have we learned?
85 | ---------------------
86 | 
87 | * Instance based learning
88 | * Eager and lazy learning -- lazy puts off work until needed, eager generally learns from training data ahead of time.
89 | * k-NN
90 | * nearest-neighbor; similarity (distance)
91 | * Domain knowledge matters (distance)
92 | * Classification versus regression
93 | * "Averaging"/combining results.
94 | * Composing learning algorithms together (locally weighted regression $x regression).
95 | * Curse of dimensionality. Required data is O(2^d).
96 | * No free lunch theorem(?) -- averaging over all possible data makes any learning algorithm no better than random. Domain knowledge helps.
97 | 
98 | 
99 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_05_ensemble_learning.md:
--------------------------------------------------------------------------------
 1 | Lesson 5: Ensemble Learning: Boosting
 2 | =====================================
 3 | 
 4 | Spam/whitelist rules ("from: spouse", "Nigeria", "has pr0n", etc.) apply to many emails, but not all. How do we generate rules? How do we combine rules? Similar to decision trees. If you already know the rules, the combining step may be like neural networks.
 5 | 
 6 | Boosting
 7 | --------
 8 | 
 9 | Combine simple rules into a complex rules.
10 | 
11 | * Learn over a subset of data to generate a simple rule.
12 | * Learn over another subset of data to generate another simple rule.
13 | * ...
14 | * ...
15 | * Combine all rules.
16 | 
17 | Subsetting is non-trivial.
18 | 
19 | * Uniformly randomly pick a subset of data. Apply any learning algorithm on it.
20 | 
21 | 
22 | Combine is non-trivial.
23 | 
24 | * Take mean.
25 | 
26 | In the case where we pick single points and take the overall mean, the result is the same as n-NN with no weighting.
27 | 
28 | **Bagging/bootstrap aggregation** -- taking random subsets, learning, and taking mean of the individual machines.
29 | 
30 | *Instead of picking subsets randomly, take advantage of what we are learning as we go along. Especially, pick examples we are bad at. Take "hardest" subsets."*
31 | 
32 | How about using a weighted mean?
33 | 
34 | Previously we defined error in classification as the number of mismatches. Instead define **error** as Pr_D(h(x) != c(x)) where D is the distribution from which x is acquired. This allows for more probably examples to produce a greater error contribution than less likely example. We want to be good at the common points.
35 | 
36 | **Weak learner** -- does better than chance always: for all D, Pr_D(h(x) != c(x)) <= 1/2 - \epsilon.
37 | 
38 | In some problems, there may exist distributions such that there is no weak learner. This may require an expansion of the hypothesis space (relieve the restriction bias).
39 | 
40 | Boosting in code
41 | ----------------
42 | 
43 | ***
44 | Given D = {(x_i, y_i) where y_i are -1 or +1}
45 | For t = 1 to T
46 |     construct D_t (a distribution)
47 |     find weak classifier h_t with small error e_t = P_D_t(h_t(x_i) != y_i) (???)
48 | Output H_final
49 | ***
50 | 
51 | How to construct D_t?
52 | ---------------------
53 | 
54 | D_1(i) = 1/n (uniform)
55 | D_(t+1)(i) = D_t(i)*e^(-a_t*y_i*h_t(x_i))/z_t
56 | where a_t = 1/2 * ln ((1-e_t)/e_t)
57 | 
58 | "Generally" puts more probability on incorrect instances and less probability on correct instances. We have an issue if error is exactly half or exactly 0.
59 | 
60 | How to combine smaller classifiers into H_final?
61 | ------------------------------------------------
62 | 
63 | Take a weighted average using a_t.
64 | 
65 | H_final(x) = sgn(sum_t a_t*h_t(x))
66 | 
67 | Hypotheses/ensemble methods
68 | ---------------------------
69 | 
70 | Compare to decision trees.
71 | 
72 | Compare to locally weighted linear regression (as a modification of k-NN) where you stitch simpler models into a complex one. This characterizes all ensemble methods.
73 | 
74 | Why is boosting good?
75 | ---------------------
76 | 
77 | Every iteration of boosting only "gains" more information by better classifying badly classified examples.
78 | 
79 | Ensemble learning
80 | -----------------
81 | 
82 | * Ensembles are good
83 | * Bagging is good
84 | * Combines simple into complex
85 | * Boosting is really good -- agnostic to learner
86 | * Weak learners
87 | * Error based on distributions
88 | * Cool points: boosting does not seem to overfit over arbitrary iterations in practice (or does it?)
89 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_06_support_vector_machines.md:
--------------------------------------------------------------------------------
  1 | Lesson 6: Support Vector Machines
  2 | =================================
  3 | 
  4 | The best line
  5 | -------------
  6 | 
  7 | The discriminant line that generalizes best is the farthest from the two classes ("commits least to the training data").
  8 | 
  9 | Support vector machine (massive handwaving)
 10 | -------------------------------------------
 11 | 
 12 | We are interested in maximizing the size of the margin.
 13 | 
 14 | The general linear classifier formula is: y = w^T x + b (here y is the classification)
 15 | 
 16 | The equation for the discriminant line is 0 = w^T x + b
 17 | 
 18 | Translate this line up or down to get 1 = w^T x + b or -1 = w^T x + b
 19 | 
 20 | We want w such that the distance between these two lines in maximal (and the choice of 1 is really arbitrary -- we just want w "up to" scaling).
 21 | 
 22 | Take the difference: w^T (x_1 - x_2) = 2 (where x_1 and x_2 are support vectors.
 23 | 
 24 | Divide ||w||. Then w^T/||w|| (x_1 - x_2) = 2/||w||
 25 | 
 26 | Now we have the difference of x_1 and x_2 projected onto the normalized w vector... Since w is parallel to the vector from x_1 to x_2. Some handwaving magic. The left handside is the margin m.
 27 | 
 28 | m = 2/||w||
 29 | 
 30 | So handwavily, the "distance" between the -1 line and 1 is 2/||w|| and we can maximize this by minimizing w's magnitude. What about direction?
 31 | 
 32 | Support vector machines
 33 | -----------------------
 34 | 
 35 | Maximize 2/||w|| while classifying everything correctly.
 36 | 
 37 | The problem is: y_i(w^T x_i + b) >= 1 for all i.
 38 | 
 39 | Alternatively minimize 1/2 ||w||^2 
 40 | 
 41 | This is a quadratic programming problem (which is a load of bullshit from shoddy under-formalized areas of math) which can be solved by jumping off a cliff.
 42 | 
 43 | Some properties of the solution.
 44 | 
 45 | * w = sum_i a_i y_i x_i and b = easy
 46 | * a_i mostly 0 (which means only some of the x_i matter).
 47 | * The data points are being dotted. Similar direction means greater weight... how "similar" are they to each other?
 48 | 
 49 | 
 50 | Like k-NN except we learn which points are important (as opposed to considering all neighbors).
 51 | 
 52 | SVMs: linearly married
 53 | ----------------------
 54 | 
 55 | What if some points can not be classified correctly? Have a cost for misclassification.
 56 | 
 57 | What if lines are a bad hypothesis? Transform the data points and throw them into a higher-dimensional space where a line might be a better hypothesis.
 58 | 
 59 | Phi(q) = <q1^2, q2^2, sqrt(s)*q1*q2>
 60 | 
 61 | Phi(q) . Phi(p) = (q . p)^2 (which makes the classifier into a circle instead of a line).
 62 | 
 63 | Instead of computing Phi, just square the dot product -- **the kernel trick**.
 64 | 
 65 | Kernels
 66 | -------
 67 | 
 68 | We can replace the dot product with a kernel.
 69 | 
 70 | Implicitly throws data points into higher dimensional space and get similarity.
 71 | 
 72 | Kernel choice requires domain knowledge.
 73 | 
 74 | Polynomial kernel: K(x, y) = (x^T y + c)^p
 75 | 
 76 | RBF kernel: exp(-(||x - y||^2)/(2*sigma^2))
 77 | 
 78 | Kernels must satisfy Mercer condition -- must act like an inner product.
 79 | 
 80 | SVMs -- review
 81 | --------------
 82 | 
 83 | * margins ~ generalization and overfitting
 84 | * big is better
 85 | * optimization problem for finding max margins: quadratic programming and dual problem
 86 | * support vectors
 87 | * kernel trick which extends the inner product
 88 | * Mercer condition
 89 | 
 90 | Back to boosting
 91 | ----------------
 92 | 
 93 | Boosting doesn't seem to conventionally overfit (testing error keeps going down).
 94 | 
 95 | **Confidence** -- how confident is a classification? (Number of neighbors agreeing in k-NN or local variation in linear regression).
 96 | 
 97 | Explain the difference between confidence and error.
 98 | 
 99 | We can examine each iteration of boosted classifiers by normalizing the argument of the sign. This is the confidence. Hard examples fall near the center (toward 0). As we boost further, the hard points also start having higher confidence. The margin keeps increasing and does not overfit.
100 | 
101 | Boosting will overfit if the individual learners always overfit.
102 | 
103 | **Pink noise** -- uniform noise. Boosting overfits.
104 | 
105 | **White noise** -- Gaussian noise.
106 | 
107 | Summary
108 | -------
109 | 
110 | * Margins -- important to SVM and margins
111 | * big is better
112 | * optimization problem for finding max margins: quadratic programming and dual problem
113 | * support vectors
114 | * kernel trick which extends the inner product
115 | * Mercer condition
116 | 
117 | 
118 | 
119 | 
120 | 
121 | 
122 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_07_computational_learning_theory.md:
--------------------------------------------------------------------------------
  1 | Computational Learning Theory
  2 | =============================
  3 | 
  4 | * Defining learning problems.
  5 | * Showing specific algorithms work.
  6 | * Show these problems are fundamentally hard.
  7 | 
  8 | Resources in machine learning
  9 | -----------------------------
 10 | 
 11 | * Time
 12 | * Space
 13 | * Training samples
 14 | 
 15 | Defining inductive learning
 16 | ---------------------------
 17 | 
 18 | * Probability of successful training. 1 - \delta
 19 | * Number of training examples. m
 20 | * Complexity of hypothesis class. 
 21 | * Accuracy to which target concept is approximated. \epsilon
 22 | * Manner in which training examples presented. (especially matters in online training as opposed to batch)
 23 | * Manner in which training examples selected.
 24 | 
 25 | Selecting training examples
 26 | ---------------------------
 27 | 
 28 | 1. Learner asks questions from teacher.
 29 | 2. Teacher gives examples to help learner.
 30 | 3. Some natural fixed distribution.
 31 | 4. Evil distribution.
 32 | 
 33 | Teaching via 20 questions
 34 | -------------------------
 35 | 
 36 | H -- set of possible people (ask the correct question)
 37 | X -- set of questions
 38 | 
 39 | * Teacher chooses only one question (he knows the right answer).
 40 | * Learner chooses log |H| questions.
 41 | 
 42 | Teacher with constrained queries
 43 | --------------------------------
 44 | 
 45 | H: Conjunction of literals or negation -- but not negation of conjunction? (find the right conjunction)
 46 | X: x1x2x3...xk k-bit input
 47 | 
 48 | To solve:
 49 | * Show what is irrelevant... two positive examples which are unperturbed by a changing feature.
 50 | * Show what is relevant... k negative examples which validate that perturbing a relevant feature matters.
 51 | 
 52 | Even though there are 3^k (positive, absent, negated) hypotheses, the smart teacher can ask k + 2 questions.
 53 | 
 54 | Learner with constrained queries
 55 | --------------------------------
 56 | 
 57 | H: Conjunction of literals or negation -- but not negation of conjunction? (find the right conjunction)
 58 | X: x1x2x3...xk k-bit input
 59 | 
 60 | Does not know the actual answer like the teacher does so does not know what the right training examples are.
 61 | 
 62 | Start enumerating every data point from 0,...,0 to 1,...,1 which is 2^k possibilities.
 63 | 
 64 | The first positive result helps us significantly.
 65 | 
 66 | Learner with mistake bounds
 67 | ---------------------------
 68 | 
 69 | H: Conjunction of literals or negation -- but not negation of conjunction? (find the right conjunction)
 70 | X: x1x2x3...xk k-bit input
 71 | 
 72 | + Input arrives
 73 | + Learner guesses answer
 74 | + Wrong answer **charged**
 75 | + Go to 1.
 76 | 
 77 | 1. Assume each feature positive and negated.
 78 | 2. Given input, compute output.
 79 | 3. If wrong, set all positive features that were 0 to absent, negative features that were 1 to abset. Go to (2).
 80 | 
 81 | Never make more than k + 1 mistakes.
 82 | 
 83 | Definitions
 84 | -----------
 85 | 
 86 | * Learner chooses examples
 87 | * Teacher chooses examples
 88 | * Nature chooses examples
 89 | * Mean teacher chooses examples
 90 | 
 91 | **Computational complexity** -- how much computational effort is needed for learner to "converge"
 92 | **Sample complexity* -- batch; how many training examples are needed for a learner to create a successful hypothesis
 93 | **Mistake bounds** -- online; how many misclassifications can a learner make over an infinite run?
 94 | 
 95 | Version spaces
 96 | --------------
 97 | 
 98 | True/target hypothesis: c in H
 99 | Candidate hypothesis: h in H
100 | Training set: S subset of X
101 | **Consistent learner** : produces h such that c(x) = h(x) for x in S (always produces a hypothesis that is consistent with data)
102 | **Version space**: VS(S) = {h in H s.t. h is consistent with S} -- hypotheses consistent with examples
103 | 
104 | PAC learning -- error of h
105 | --------------------------
106 | 
107 | Training error -- fraction of traning examples misclassified by h.
108 | True error -- fraction of examples that would be misclassified on sample drawn from distribution D.
109 | 
110 | error_D(h) = Pr_{x from D}[c(x) != h(x)]
111 | 
112 | PAC learning
113 | ------------
114 | 
115 | * c -- concept class
116 | * L -- learner
117 | * H -- hypothesis space
118 | * n -- |H|, size of hypothesis space
119 | * D -- distribution over inputs
120 | * 0  <= \epsilon <= 1/2    (our error goal)
121 | * 0 <= \delta <= 1/2   (failure-probability -- certainty goal -- with probability 1 - \delta, must produce true error less than \epsilon)
122 | 
123 | 
124 | PAC -- probably (1 - \delta) approximately (\epsilon) correct (error_D(h) = 0)!
125 | 
126 | "C is PAC-learnable by L using H iff learner L will with probability 1 - \delta, output a hypothesis h in H such that error_D(h) <= \epsilon in time and samples polynomial in 1/\epsilon, 1/\delta, and n.
127 | 
128 | Quiz: PAC-learnable
129 | -------------------
130 | 
131 | C = H = {h_i(x) = x_i}  (k-bit inputs)
132 | 
133 | There are k hypotheses. So n = |H| = k.
134 | 
135 | Pick a hypothesis uniformly from VS(S, H).
136 | 
137 | \epsilon-exhausted version space
138 | --------------------------------
139 | 
140 | VS(S) is \epsilon-exhausted iff for all h in VS(S), error_D(h) <= \epsilon
141 | 
142 | Every element in the version space has error less than epsilon.
143 | 
144 | Haussler Theorem -- bound true error
145 | ------------------------------------
146 | 
147 | Let error_D(h1,...,hk in H) > \epsilon   -- this is high true error.
148 | 
149 | How much data do we need to knock out these hypotheses?
150 | 
151 | Pr_{x from D}(h_i(x) = c(x)) <= 1 - \epsilon
152 | Pr(h_i consistent with c on m examples) <= (1 - \epsilon)^m
153 | Pr(at least one of h1,...,hk consistent with c on m examples) <= k(1 - \epsilon)^m <= |H|(1 - \epsilon)^m
154 | 
155 | Note that (1 - \epsilon)^m <= exp(\epsilon*m) (this comes from -\epsilon >= ln(1 - \epsilon))
156 | 
157 | So we have Pr(at least one of hypothesis is consistent with c on m examples) <= H*exp(-\epsilon*m) <= \delta
158 | 
159 | This is the upperbound that version space is **not** \epsilon-exhausted after m samples. We want \delta to be a bound on this.
160 | 
161 | ln |H| - \epsilon * m <= ln \delta
162 | m >= 1/\epsilon (ln |H| + ln (1/\delta)) (i.e. polynomial)
163 | 
164 | Satisfying this will satisfy PAC-learnability.
165 | 
166 | Quiz: PAC-learnable example
167 | ---------------------------
168 | 
169 | H = {h_i(x) = x_i} (where x is 10-bits)
170 | \epsilon = 0.1
171 | \delta = 0.2
172 | D: uniform
173 | 
174 | 1/0.1 * (ln 10 + ln (1/0.2)) = 10*(ln 10 + ln 5) = 10*ln 50 = 39
175 | 
176 | This bound is agnostic to distribution of nature's data.
177 | 
178 | What we learned
179 | ---------------
180 | 
181 | * Teachers versus learners and interaction
182 |   - What is learnable? Like complexity theory for ML.
183 | * Sample complexity - data
184 | * types of interactions:
185 |   - learner picks questions
186 |   - teacher picks questions
187 |   - nature picks questions
188 |   - evil teacher picks questions
189 | * mistake bounds (as opposed to how many samples you need)
190 | * PAC learning: version spaces, training/test/true error, distribution.
191 | * m >= 1/\epsilon (ln |H| + ln (1/\delta)) (i.e. polynomial)
192 |   - this assumed target in hypothesis
193 |   - otherwise **agnostic** and the bound is slightly different, but still polynomial.
194 |   - infinite hypothesis space can be a problem
195 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_08_vc_dimension.md:
--------------------------------------------------------------------------------
 1 | Lesson 8: VC Dimensions
 2 | =======================
 3 | 
 4 | Infinite hypothesis spaces
 5 | --------------------------
 6 | 
 7 | Haussler theorem breaks with infinite hypothesis spaces.
 8 | 
 9 | Consider example:
10 | 
11 | X: {1, 2, ..., 10}
12 | H: h(x) = {x >= t where t is real}
13 | 
14 | Though there are infinite hypotheses, there are find semantically relevant hypotheses (11).
15 | 
16 | Power of a hypothesis space
17 | ---------------------------
18 | 
19 | What is the largest set of inputs that the hypothesis class can label in all possible ways?
20 | 
21 | In the above hypothesis, there is no way for a pair of data points to be labeled in all possible ways. It can however label **one** point is all possible ways (two ways).
22 | 
23 | What VC stands for
24 | ------------------
25 | 
26 | **Shattering** -- labeling in all possible ways.
27 | 
28 | Vapnik-Chervonenkis dimension -- the largest set of inputs shattered by a class of learners.
29 | 
30 | Quiz: interval training
31 | -----------------------
32 | 
33 | X = Reals
34 | H = {h(x) = {x in [a, b] -- a real interval}}
35 | 
36 | Has VC dimension of 2.
37 | 
38 | Note on logic
39 | -------------
40 | 
41 | To prove yes: There exists points for all labelings there exists hypothesis
42 | To prove no: not(there exists points for all labelings there exists hypothesis) = for all points, not(for all labelings there exists hypothesis) = for all points, there exists labelings not(there exists hypothesis)
43 | 
44 | Quiz: linear separators
45 | -----------------------
46 | 
47 | X = R^2
48 | H = {h(x) = w^T x >= \theta}
49 | 
50 | Has VC dimension of 3.
51 | 
52 | The rings
53 | ---------
54 | 
55 | The d-dimensional hyperplane as VC dim d + 1
56 | 
57 | Quiz: polygons (convex)
58 | -----------------------
59 | 
60 | X: R^2
61 | H: points inside some convex polygon
62 | 
63 | VC dimension is infinite. (Use a circle for vertices and put points on edge).
64 | 
65 | Sample complexity and VC dimension
66 | ----------------------------------
67 | 
68 | Update Haussler's theorem with VC dimension:
69 | 
70 | m >= 1/\epsilon (8* VC(H)*lg(13/\epsilon) + 4*lg(2/\delta))  (infinite case)
71 | m >= 1/\epsilon (ln |H| + ln (1/\delta)) (finite case)
72 | 
73 | VC dimension of finite H
74 | ------------------------
75 | 
76 | Upper bound: d = VC(H) implies there exist 2^d distinct concepts (each gets a different h)
77 | 2^d <= |H|    and d <= lg |H|
78 | 
79 | **Fundamental Theorem of Machine Learning** 
80 | H is PAC-learnable if and only if VC dimension is finite.
81 | 
82 | Summary
83 | -------
84 | * VC dimension. Shattering
85 | * VC relates to hypothesis space parameters ("true" number of parameters)
86 | * VC relates to finite hypothesis space size.
87 | * Sample complexity related to VC dimension.
88 | * VC dimension captures PAC-learnability
89 | 


--------------------------------------------------------------------------------
/lessons/ml_p1_lesson_09_bayesian_learning.md:
--------------------------------------------------------------------------------
  1 | Lesson 9: Bayesian Learning
  2 | ===========================
  3 | 
  4 | Bayesian learning
  5 | -----------------
  6 | 
  7 | * Learn the **best** hypothesis given data and some domain knowledge.
  8 | * Learn the **most probable** H given data and domain knowledge. (**best** == **most probable**)
  9 | 
 10 | argmax_{h in H} Pr(h | D)
 11 | 
 12 | Bayes rule
 13 | ----------
 14 | 
 15 | P(h|D) = P(D|h)P(h)/P(D)
 16 | 
 17 | P(D) -- prior on the data
 18 | P(h) -- prior on h (this encapsulate our domain domain knowledge)
 19 | P(D|h) -- data given the hypothesis (in a world where the hypothesis is true... likelihood of seeing this data)
 20 | 
 21 | Quiz: Bayes' rule
 22 | -----------------
 23 | 
 24 | 98% true positive. 2% false positive.
 25 | 97% true negative. 3% false negative.
 26 | 
 27 | Only 0.008 percent of population has it.
 28 | 
 29 | P(T|P) = P(P|T)*P(T)/P(P) = .98*.008/.0376 ~ 0.209
 30 | 
 31 | P(not T|P) = P(P | not T)*P(not T)/P(P) = .03*.992/.0376 ~ 0.791
 32 | 
 33 | If 0.008 were higher, the test would be more useful.
 34 | 
 35 | 
 36 | Bayesian learning
 37 | -----------------
 38 | 
 39 | For each h in H,
 40 | 
 41 | h_MAP = argmax_h P(h|D) = argmax_h P(D|h)P(h) (maximum a posteriori)
 42 | h_ML = argmax_h P(D|h) (maximum likelihood -- assume prior P(h) is uniform)
 43 | 
 44 | Direct computation not practical for large hypothesis spaces.
 45 | 
 46 | Bayesian learning in action
 47 | ---------------------------
 48 | Assume:
 49 | * We have labeled training data {<x_i, d_i>} as noise-free examples of c.
 50 | * c is in H
 51 | * Uninformed prior over hypothesis space.
 52 | 
 53 | P(h|D) = P(D|h)P(h)/P(D)
 54 | 
 55 | P(h) = 1/|H|
 56 | P(D|h) = {1 if d_i = h(x_i) for all i; 0 otherwise} = {1 if h is in VS(D)}
 57 | P(D) = sum_i P(D|h_i)P(h_i) (total probability)
 58 |      = sum_{h in VS_H(D)} 1 * 1/|H| = |VS|/|H|
 59 | 
 60 | P(h|D) = (1 * 1/|H|)/(|VS|/|H|) = 1/|VS| 
 61 | This means, given a bunch of data, the probability of a particular hypothesis being correct is uniform over elements of the version space.
 62 | 
 63 | Any element of the version space is good.
 64 | 
 65 | Bayesian learning with noise
 66 | ----------------------------
 67 | 
 68 | Assume
 69 | * We have labeled training data {<x_i, d_i>}
 70 | * d_i = f(x_i) + e_i (where e_i is error)
 71 | * e_i ~ N(0, s^2) (i.i.d.)
 72 | 
 73 | h_ML = f = argmax_h P(D|h) = argmax product_i P(d_i | h)
 74 | 
 75 | This probability can be approximated with gaussian: stuff*exp(-1/2(d_i - h(x_i))^2)/s^2)... we can log it for the purpose of argmax.
 76 | 
 77 | h_ML = argmax sum_i -(d_i - h(x_i))^2 = argmin sum_i (d_i - h(x_i))^2
 78 | 
 79 | This is the sum of squared error.... derived from the gaussian noise model and the maximum likelihood assumption.
 80 | 
 81 | Bayesian learning
 82 | -----------------
 83 | 
 84 | h_MAP = argmax P(D|h)P(h) = argmax [lg(P(D|h)) + lg(P(h))] =  argmax [ -lg(P(D|h)) - lg(P(h))]
 85 | 
 86 | Event w has probability p, i.e. has length -lg p
 87 | 
 88 | "Minimizing length(D|h) + lenght(h)"
 89 | 
 90 | If the hypothesis fits the data well, the data is superfluous information. Otherwise, we need a lot of data. So the first term correlates to misclassifications/errors
 91 | 
 92 | The second term tries to simplify the second term.
 93 | 
 94 | **Minimum description length**
 95 | 
 96 | Bayesian classification
 97 | -----------------------
 98 | 
 99 | The consensus of non-MAP or non-ML hypotheses may differ from the MAP or ML hypotheses. Vote?
100 | 
101 | In some sense we are throwing out the hypotheses for "boosted" superhypotheses.
102 | 
103 | value_MAP = argmax_v sum_h P(v|h)P(h|D) (weighted vote of h in H) **Bayes Optimal Classifier**
104 | 
105 | Summary
106 | -------
107 | 
108 | * Bayes rule (swap "causes and effect")
109 |   + P(h|D) ~ P(D|h)P(h)
110 | * priors matter
111 | * h_MAP, h_ML
112 | * derived least square from Gaussian h_ML.
113 | * best classification is consensus of all classifiers: **Bayes Optimal Classifier** (the best classifier you can possibly do)
114 | 
115 | 


--------------------------------------------------------------------------------
/textbook/notes_ch01.md:
--------------------------------------------------------------------------------
 1 | Chapter 1: Introduction
 2 | =======================
 3 | 
 4 | 1.1 Well-posed learning problems
 5 | --------------------------------
 6 | 
 7 | Given a machine, let $P(T, E)$ be the performance of that machine at
 8 | task $T$ given experience $E$. If $P(T, E') > P(T, E)$, then the machine
 9 | has **learned** more from experience $E'$ than experience $E$.
10 | 
11 | An alternative interpretation.
12 | 
13 | If $E' > E$ implies $P(T, E') > P(T, E)$, then the machine has
14 | **learned**.
15 | 


--------------------------------------------------------------------------------
/textbook/notes_ch03.md:
--------------------------------------------------------------------------------
 1 | Chapter 3: Decision tree learning
 2 | =================================
 3 | 
 4 | 
 5 | 3.1-3.2 Decision tree representation
 6 | ------------------------------------
 7 | 
 8 | The usage of decision trees as classifiers is obvious. Given an
 9 | instance, the machine starts at the root node. At each node of the tree,
10 | the machine tests the value of some attribute of the instance and
11 | accordingly directs the machine to the next node until the machine
12 | arrives at a leaf node. The value of the leaf node is emitted as the
13 | final classification.
14 | 
15 | *Each path from the tree root to a leaf corresponds to a conjunction of
16 | attribute tests, and the tree itself to a disjunction of these
17 | conjunctions.* Since the conjunctions are mutually exclusive, this
18 | disjunction can actually be an exclusive disjunction (XOR).
19 | 
20 | 3.3 Appropriate problems for decision tree learning
21 | ---------------------------------------------------
22 | 
23 | -   *Instances are represented by attribute-value pairs.*
24 | 
25 | -   *The target function has discrete output values.*
26 | 
27 | -   *Disjunctive descriptions may be required.* This makes decision
28 |     trees highly interpretable.
29 | 
30 | -   *The training data may contain errors.*
31 | 
32 | -   *The training data may contain missing attribute values.*
33 | 
34 | The basic decision tree learning algorithm
35 | ------------------------------------------
36 | 
37 | ID3 (or *this* variation) is a recursive algorithm that passes through
38 | all the instances and their attributes to find the attribute which best
39 | classifies the data by itself. Then at each descendent node, the process
40 | is repeated sans the previously considered attributed. Eventually, all
41 | attributes will be considered and the algorithm is guaranteed to halt.
42 | 
43 | The *best* attribute at each iteration is the one that maximizes
44 | *information gain*. Information gain depends on *entropy*. Let’s define
45 | entropy of a set $S$ contains values $i$ with frequency $p_i$ (idealized
46 | probabilities).
47 | 
48 | $$Entropy(S) = - \sum p_i \log p_i = -\log \prod_i p_i^{p_i}$$
49 | 
50 | Entropy can be interpreted as the expected number of bits required to
51 | optimally encode stream of values $i$. (Confirm.)
52 | 
53 | The following is an adaptation of the ID3 algorithm.
54 | 
55 | $ID3(EXAMPLES, TARGET, ATTRIBUTES)$ //
56 | 
57 | -   Add $ROOT$ node to $TREE$.
58 | 
59 | -   If all elements are positive, label $ROOT$ node $+$ and return
60 |     $TREE$.
61 | 
62 | -   Else if all elements are negative, label $ROOT$ node $-$ and return
63 |     $TREE$.
64 | 
65 | -   Else if $ATTRIBUTES$ is empty, label $ROOT$ node as the most common
66 |     $TARGET$ attribute values and return $TREE$.
67 | 
68 | -   Else
69 | 
70 |     -   A \<– the attribute from $ATTRIBUTES$ that best classifies
71 |         $EXAMPLES$.
72 | 
73 |     -   The decision attribute for Root \<– A
74 | 
75 | 
76 | 


--------------------------------------------------------------------------------