├── .gitignore ├── README.md ├── cifar_notes.pdf ├── cifar_notes.tex ├── working_notes.pdf └── working_notes.tex /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.aux 3 | *.dvi 4 | *.log 5 | *.out 6 | *.txt -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | notes-on-neural-networks 2 | ======================== 3 | 4 | Rough working notes on neural networks. 5 | 6 | As of December 11, 2013 I've migrated the notes to another repository (not yet public, it's still 7 | getting constructed as I merge various things together, I hope to make it public). 8 | -------------------------------------------------------------------------------- /cifar_notes.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnielsen/notes-on-neural-networks/4104b3175516550335282ea0a4aeb936bd4fe6c1/cifar_notes.pdf -------------------------------------------------------------------------------- /cifar_notes.tex: -------------------------------------------------------------------------------- 1 | \documentclass[12pt]{report} 2 | 3 | \usepackage{hyperref} 4 | 5 | \newcommand{\link}[2]{\href{#1}{#2}} 6 | 7 | 8 | \begin{document} 9 | 10 | \title{Notes on neural networks --- CIFAR material} 11 | \author{Michael Nielsen\thanks{Email: mn@michaelnielsen.org}$^{,}$\thanks{Web: http://michaelnielsen.org/ddi}} 12 | 13 | \maketitle 14 | 15 | \chapter{Introduction} 16 | 17 | \textbf{Working notes, by Michael Nielsen:} These are rough working 18 | notes, written as part of my study of neural networks, especially work 19 | on CIFAR. Note that they really are \emph{rough}, and I've made no 20 | attempt to clean them up, nor do I plan to. They contain 21 | misunderstandings, misinterpretations, omissions, and outright errors. 22 | As such, I don't advise others to read the notes, and certainly not to 23 | rely on them! 24 | 25 | \chapter{Papers} 26 | 27 | \section{Dahl, Sainath, Hinton 2013} 28 | 29 | This is about acoustic models, studied using rectified linear units 30 | and dropout. Punchline: they get a 4.2 percent improvement over 31 | sigmoid units by using ReLUs and dropout. 32 | 33 | Speech recognition used to be done by hidden Markov models. Now 34 | replaced by deep nets. TIMIT (small-scale phone recognition), LVSR 35 | (large-scale task). Dropout as similar to denoising auto-encoders. 36 | ``ALthough dropout is trivial to incorporate into minibatched SGD the 37 | best way of adding it to 2nd order optimization methods is an open 38 | research question.'' They find an undesirable interaction between HF 39 | optimizer and SGD. They used the Bayesian optimizer to find the right 40 | hyper-parameters. 41 | 42 | 43 | \section{KSH configuration} 44 | 45 | 46 | Note that the bias initialization parameter initB was not set anywhere 47 | in the KSH configuration. That means it defaults to 0. 48 | 49 | \textbf{Layer 1} 50 | \begin{itemize} 51 | \item Convolutional 52 | \item 3 channels. 53 | \item 32 filters 54 | \item Padding of 2. Pads the images on the outside with a 2-pixel border. 55 | \item Stride length of 1. 56 | \item Filter size is 5 by 5. 57 | \item initW=0.0001. The initial standard deviation. I'm surprised by how 58 | low this is --- much lower than I would have guessed. I wonder if 59 | there's any benefit to increasing it? 60 | \item partialSum=4. No idea what this means. The docs don't really say. 61 | \item sharedBiases=1. According to the docs, "indicates that the biases 62 | of every filter in this layer should be shared amongst all 63 | applications of that filter." This is a little unclear. Does it 64 | mean that all filters have the same bias? 65 | \item Fully linear layer. 66 | \end{itemize} 67 | 68 | \textbf{Layer 2} 69 | 70 | + Pooling layer 71 | + Uses maxpooling 72 | + start=0. Where to start pooling. This is just the default, which 73 | is to start pooling where you'd expect (the top left). 74 | + sizeX=3. Pool 3 x 3 regions. 75 | + stride=2. The stride length. 76 | + outputsX=0. This is an unimportant default; if not equal to 0 the 77 | output would only cover part of the image. 78 | + channels=32. Presumably to correspond to the filters in the last 79 | layer. 80 | + neuron=relu 81 | 82 | Layer 3 83 | 84 | + Convolutional layer 85 | + 32 filters output, 32 channels input. 86 | + 5 by 5 filters. 87 | + Stride length of 1 88 | + Initial weight SD = 0.01 89 | + Rectified linear units 90 | + sharedBiases=1 91 | + partialSum=4 92 | 93 | Layer 4: 94 | + Pooling layer 95 | + Average pooling 96 | + 3 x 3 pooling windows 97 | + Stride length 2 98 | 99 | Layer 5: 100 | + COnvolutional layer. 101 | + 32 input channels, 64 output filters 102 | + 5 x 5 filters 103 | + Padding by 2 pixel border 104 | + Stride length of 1 105 | + Initial weight SD = 0.01 106 | + Recitified linear units 107 | + sharedBiases=1 108 | + partialSum=4 109 | 110 | Layer 6: 111 | + Pooling layer, 64 input channels, 64 outputs 112 | + Average pooling 113 | + 3 x 3 poling windows. 114 | + Stride length 2 115 | 116 | Layer 7: 117 | + Fully connected layer 118 | + 64 outputs 119 | + Initial weight SD = 0.1 120 | + Rectified linear units 121 | 122 | Layer 8: 123 | + Fully connected layer 124 | + 10 outputs 125 | + Initial weight SD = 0.1. 126 | + Linear neurons 127 | 128 | Layer 9: 129 | + SOftmax layer, producing 10 outputs 130 | 131 | Cost function: logistic regression on the Softmax outputs. 132 | 133 | 134 | Learning parameters 135 | 136 | Layer 1 (first convolutional layer): 137 | + Weight learning rate: 0.001 138 | + Bias learning rate: 0.002 139 | + Weight and bias momentum: 0.9 140 | + Weight decay 0.004. Note there is no bias decay. 141 | 142 | Note that in the docs Krizhevsky explicitly gives the update rule: 143 | 144 | w' = (weight momentum) * w - (weight decay) * (weight learning rate) * w 145 | + (weight learning rate) * gradient 146 | 147 | The bias rule is the same, but there is no bias weight decay. 148 | 149 | 150 | Layer 3 (second convolutional layer) 151 | 152 | Same as layer 1. 153 | 154 | \textbf{Layer 5 (third convolutional layer):} Same as layer 1. 155 | 156 | 157 | \textbf{Layer 7 (first fully connected layer):} Learning rates as for 158 | convolutional layers, and weight decay of 0.03 159 | 160 | \textbf{Layer 8 (final layer)}: Same as first fully connected layer. 161 | 162 | Krizhevsky notes that rescaling the overall cost function has the 163 | effect of changing the effective overall learning rate. 164 | 165 | \section{Snoek, Larochelle, and Adams (2012)} 166 | 167 | ``IN this work we consider the automatic tuning problem within the 168 | framework of Bayesin optimization... The tractable posterior 169 | distribution... leads to efficient use of the information gathered by 170 | previous experiments.... we show how the effects of the Gaussian 171 | process prior and the associated inference procedure can have a large 172 | impact on the success or failure of Bayesian 173 | optimization... thoughtful choices can lead to results that exceed 174 | expert-level performance in tuning machine learning algorithms.'' 175 | 176 | They do it not just for neural nets but for a whole bundle of 177 | algorithms. Of course, it's especially important for neural nets, 178 | since they have so many hyper-parameters. 179 | 180 | ``... these high-level parameters are often considered a nuisance, 181 | making it desirable to develop algorithms with as few of these `knobs' 182 | as possible. Another, more flexible take on this issue is to view the 183 | optimization of high-level parameters as a procedure to be 184 | automated.'' 185 | 186 | ``For continuous functions [like the cost function, one presumes], 187 | Bayesian optimization typically works by assuming the unknown function 188 | [which?] was sampled from a Gaussian process (GP) and maintains a 189 | posterior distribution for this function as observations are made.'' 190 | 191 | What I think this means is: set up Gaussians on our hyper-parameters. 192 | Then sample, and look to see the cost on the validation data. 193 | 194 | We have a function $f(x)$ on a bounded subset of $R^D$. We're going 195 | to construct a probabilistic model of $f(x)$. The idea is to use the 196 | information we get from evaluations of $f(x)$ to improve our model --- 197 | and to choose where to evaluate next. ``This results in a procedure 198 | that can find the minimum of difficult non-convex functions with 199 | relatively few evaluations, at the cost of performing more computation 200 | to determine the next point to try.'' 201 | 202 | Two choices: a prior over functions. They choose the Gaussian process 203 | prior. I'm not quite sure what this means in this context. Second, 204 | they choose an acquisition function, to construct a utility function 205 | from the model posterior. Not sure what this means. 206 | 207 | Gaussian process. Suppose we have a set of points $x_n$ in our 208 | domain. T 209 | 210 | \section{Wan et al (2013) -- ``Regularization of Neural Networks using 211 | DropConnect''} 212 | 213 | \subsection{Summary of the main points} 214 | 215 | \begin{itemize} 216 | \item Dropout means randomly deleting half the neurons 217 | when training. 218 | 219 | \item DropConnect means randomly deleting half the connections when 220 | training. 221 | 222 | \item Note that the output is defined as the \emph{average} output 223 | over the sampled networks, not the full network. 224 | 225 | \item There is a nice linear algebraic way of representing DropConnect 226 | and Dropout, using Hadamard products, which no doubt helps in 227 | implementations. 228 | 229 | \item In actual fact, they don't literally implement DropConnect. 230 | Rather, they analyse what the distribution of weighted sums would 231 | be, and approximate by a Gaussian, before sampling. I don't see why 232 | they do this (it may be faster), but in some sense we can use this 233 | as a definition. I'd probably prefer just to sample. No idea why 234 | they don't. 235 | 236 | \item They claim that the regularization is greatly helped by using 237 | small mini-batches, ideally mini-batch size $1$ (online learning). 238 | 239 | \item The code is available. They used cuda-convnet for convolutional 240 | and softmax steps. The DropConnect implementation is a bit 241 | convoluted --- worth reading about the problems they had, though. 242 | It certainly seems worth storing the masks as bits or ints, not 243 | floats. 244 | 245 | \item Used mini-batch SGD with momentum on batches of 128 images, and 246 | momentum fixed at 0.9. Not clear how this relates to the above 247 | comments about online learning. They augment the dataset (cropping, 248 | flipping, scaling and rotation); train 5 independent network with 249 | random permutuations; manually decrease the learning rate using a 250 | validation set; train using Dropout, DropConnect, or neither. Use 251 | 1,000 samples. Use a bias learning rate twice the weight learning 252 | rate. Weights are N(0, 0.1) for fully connected layers, and N(0, 253 | 0.01) for convolutional layers. 254 | 255 | \item The learning schedule is fascinating. ``We report three numbers 256 | of epochs, such as 600-400-200 to define our schedule. We multiply 257 | the initial rate by 1 for the first such number of epochs. Then we 258 | use a multipler of 0.5 for the second number of epochs followed by 259 | 0.1 again for this second number of epochs. The third number of 260 | epochs is used for multipliers of 0.05, 0.01, 0.005, and 0.001 in 261 | that order, after which point we report our results. We determine 262 | the epochs to use for our schedule using a validation set to look 263 | for plateaus in the loss function, at which point we move to the 264 | next multiplier.'' 265 | 266 | \item CIFAR-10: Subtract per-pixel mean computed over the training 267 | set. Then use KSH's 3-layer convolutional net. Follow by 64-unit 268 | fully connected layer to which DropConnect etc may be applied. No 269 | data augmentation. 150-0-0 epochs, a single model, with an initial 270 | learning rate of 0.0001, and KSH's weight decay (0.995, I believe). 271 | DropConnect prevents overfitting a little better than Dropout. 272 | 273 | \item CIFAR-10: More advanced results. Using 2 conv layers, 2 locally 274 | connected layers, per KSH. 128 neuron fully connected layer with 275 | ReLU activations between softmax and feature extractor. Images are 276 | cropped to 24 by 24 to get more data. Initial learning rate: 0.001, 277 | and train for 700-300-50 epochs with KSH's weight decay. Model 278 | voting helps a \emph{lot}, getting error rate 9.41 percent. This 279 | can be improved to 9.32 percent by using 12 networks. 280 | 281 | \end{itemize} 282 | 283 | Add a note: data agmentation works nearly as well. We should push on that. 284 | 285 | \subsection{Other notes} 286 | 287 | ``When training with Dropout, a randomly selected subset of 288 | activations are set to zero within each layer. DropConnect instead 289 | sets a randomly selected subset of weights within the network to 290 | zero.'' 291 | 292 | As with Dropout, DropConnect is essentially a method of 293 | regularization, to prevent the network from overtraining. ``In 294 | practice, using these [regularization] techniques when training big 295 | networks gives superior test performance to smaller networks trained 296 | without regularization.'' 297 | 298 | On Dropout: ``Although a full understanding of its mechanism is 299 | elusive, the intuition is that it prevents the network weights from 300 | collaborating with one another to memorize the training examples.'' 301 | 302 | ``Like Dropout, [DropConnect] is suitable for fully connected layers only.'' 303 | 304 | I don't really see why. Does something go wrong if we apply it to a 305 | convolutional net? I don't see why something analogous couldn't be 306 | done. 307 | 308 | We can rewrite Dropout as $a \rightarrow \sigma(m \odot (wa+b))$, 309 | where $\odot$ is the Hadamard product, and $m$ is a binary mask 310 | vector, chosen according to an appropriate Bernoulli distribution. A 311 | similarly nice expression can be obtained for DropConnect. (This 312 | seems likely to help in implementations.) 313 | 314 | \textbf{Architecture:} A CNN, followed by a DropConnect layer, 315 | followed by a SoftMax, and a cross-entropy loss. 316 | 317 | Note that the output value can be viewed as the result of sampling a 318 | large number of different (though overlapping) neural networks. 319 | 320 | ``A key component to successfully training with DropConnect is the 321 | selection of a different mask for each training example. Selecting a 322 | single mask for a subset of training examples, such as a mini-batch of 323 | 128 examples, does not regularize the model enough in practice.'' 324 | 325 | They define the output as the result of averaging over all 326 | DropConnected networks. Note that this seems likely to be superior to 327 | using the entire network (i.e., with no weights deleted). 328 | 329 | They do some odd things involving Gaussian moment matching to sample. 330 | I don't see \emph{why} they need to do this, I must admit. But it 331 | does give a reasonably nice way of approximating the network. 332 | Alternately, one could view it as the definition of DropConnect. 333 | 334 | 335 | \textbf{Q: How do Dropout and DropConnect fare in a sparse network?} 336 | My guess is that they'll show very interesting behaviour. 337 | 338 | \chapter{Short reviews: what do we know about nonlinearities?} 339 | 340 | In this chapter I take a very quick and not in-depth look at what is 341 | known about various nonlinearities. 342 | 343 | DasGupta and Schnitger (1994): They want to compare activation 344 | functions as a function of size and number of layrs. And they want to 345 | figure out when two activation functions have essentially the same 346 | approximating power. ``Our results show that good approximation 347 | performance... hings on two properties, namely efficient approximation 348 | of polynomials and efficient approximation of the binary threshold.'' 349 | I have a lot of troubel believing the former; I wonder if it is an 350 | artifact of their analysis. The latter seems interesting. 351 | 352 | Jarrett et al (2009): ``We show that using non-linearities that 353 | include rectification and local contrast normalization is the single 354 | most important ingredient for good accuracy on object recognition 355 | benchmarks.'' ALso, ``[H]ow do the non-linearities that follow the 356 | filter banks influence the recognition accuracy. The surprising answer 357 | is that using a rectifying non-linearity is the single most important 358 | factor in improving the performance of a recognition system. This 359 | might be due to several reasons: a) the polarity of features is often 360 | irrelevant to recognize objects, b) the rectification eliminates 361 | cancellations between neighboring filter outputs when combined with 362 | average pooling. Without a rectification what is propagated by the 363 | average down-sampling is just the noise in the input. Also introducing 364 | a local normalization layer improves the performance. It appears to 365 | make supervised learning considerably faster, perhaps because all 366 | variables have similar variances (akin to the advantages introduced by 367 | whitening and other decorrelation methods)'' 368 | 369 | Karlik and Olgac (2009): Investigated a few special cases. 370 | 371 | Nair and Hinton (2010): Done in the context of Boltzmann machines. 372 | They consider noisy rectified linear units (NReLUs), which have output 373 | $\max(0, x+ N(0, \sigma(x))$, where $N$ denotes a Gaussian random 374 | variable, as per usual. Not so clear that it's relevant to us. 375 | 376 | Tan, Teo, and Anthony (2011): 377 | \link{http://link.springer.com/article/10.1007\%2Fs10462-011-9294-y}{link} 378 | Investigated a few special cases. 379 | 380 | Question: What should we bound? What class of nonlinearities should 381 | we allow? 382 | 383 | \chapter{Queue} 384 | 385 | LeCun 2013. 386 | 387 | Snoek, Larochelle, Adams. 388 | 389 | Model voting. 390 | 391 | Hinton Dropout paper. 392 | 393 | Bengio's dropout paper. 394 | 395 | ReLU. 396 | 397 | \end{document} -------------------------------------------------------------------------------- /working_notes.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnielsen/notes-on-neural-networks/4104b3175516550335282ea0a4aeb936bd4fe6c1/working_notes.pdf -------------------------------------------------------------------------------- /working_notes.tex: -------------------------------------------------------------------------------- 1 | \documentclass[12pt]{report} 2 | 3 | \usepackage{hyperref} 4 | 5 | \newcommand{\link}[2]{\href{#1}{#2}} 6 | 7 | 8 | \begin{document} 9 | 10 | \title{Notes on neural networks} 11 | \author{Michael Nielsen\thanks{Email: mn@michaelnielsen.org}$^{,}$\thanks{Web: http://michaelnielsen.org/ddi}} 12 | 13 | \maketitle 14 | 15 | \chapter{Introduction} 16 | 17 | \textbf{Working notes, by Michael Nielsen:} These are rough working 18 | notes, written as part of my study of neural networks. Note that they 19 | really are \emph{rough}, and I've made no attempt to clean them up, 20 | nor do I plan to. They contain misunderstandings, misinterpretations, 21 | omissions, and outright errors. As such, I don't advise others to 22 | read the notes, and certainly not to rely on them! 23 | 24 | \textbf{Core questions:} There is a practical, narrow question: what 25 | are the most significant results about deep learning and neural 26 | networks? And then there is the broader question: how to build an 27 | artificial intelligence? My reading will address both questions. 28 | 29 | \chapter{Papers} 30 | 31 | \section{Hopfield (1982)} 32 | 33 | What I like about this paper is the condensed matter physicist's point 34 | of view. Hopfield asks ``whether the ability of large collections of 35 | neurons to perform computational tasks may in part be a spontaneous 36 | collective consequence of having a large number of interacting simple 37 | neurons''. He goes on to give an explanation of how a type of memory 38 | can be constructed in pretty much this way. It's an inspiring point 39 | of view. 40 | 41 | \section{Bourland and Kamp (1988)} 42 | 43 | \link{http://scholar.google.com/scholar?cluster=17784424506773259343\&hl=en\&as\_sdt=0,5}{link} 44 | 45 | Suggests removing nonlinearity in output. Motivation: since we're 46 | trying to recover the original input, claims that it's obviously not a 47 | good idea to have the nonlinearity. I don't see that this is true: if 48 | the inputs are normalized to be between 0 and 1 then there shouldn't 49 | be any problem. 50 | 51 | With this constraint, the problem then is to find $w, b$ and $w', b'$ 52 | minimizing: 53 | \begin{eqnarray} 54 | \sum_x \|w' \sigma(wx+b)+b'-x\|^2, 55 | \end{eqnarray} 56 | where the sum is over all input vectors $x$. Let $X$ be the matrix 57 | whose columns are the training vectors. Abusing notation, let $b$ and 58 | $b'$ be matrices whose columns are $b$ and $b'$, respectively. Then 59 | matrix whose columns are the outputs is given by $Y = 60 | w'\sigma(wX+b)+b'$, where we apply $\sigma$ elementwise to the 61 | matrix $wX+b$. The quadratic loss function can then be written: 62 | \begin{eqnarray} 63 | \| w'\sigma(wX+b)+b'-X\|^2, 64 | \end{eqnarray} 65 | where $\|\cdot\|$ is here the usual Frobenius matrix norm. 66 | 67 | \section{Blum and Rivest (1989)} 68 | 69 | Show that it's NP-complete to train a three node neural network. 70 | Apparently built on earlier work by Judd, who showed this for a 71 | general neural network; indeed, Judd showed that even approximating a 72 | function is NP-complete. Blum and Rivest use a very particular 73 | architecture: $n$ inputs, a 2-neuron hidden layer, and a single output 74 | neuron. They use a perceptron model, although I doubt that is 75 | essential. The idea is simply to take a (supervised) training 76 | problem, and to ask whether there exist weights so that the output 77 | from the network are consistent with the training data. They show 78 | that this problem is NP-complete. They contrast this with a 79 | single-layer perceptron, which can be trained in polynomial time, 80 | using linear programming. They comment that their technique does not 81 | apply to sigmoidal neurons, but that Judd's does. 82 | 83 | \section{Williams and Zipser (1989)} 84 | 85 | \link{http://scholar.google.ca/scholar?cluster=1352799553544912946\&hl=en\&as\_sdt=0,5}{(link)} A gradient-based learning method for recurrent neural networks. 86 | 87 | Claims that feedforward networks don't have the "ability to store 88 | information for later use". It'd be nice to understand what that 89 | means. Obviously there's a trivial sense in which feedforward 90 | networks can store information based on training data. 91 | 92 | Claims that backprop requires lots of memory when used with large 93 | amounts of training data. I don't believe this, except in the trivial 94 | sense that it may take a lot of memory to store all the training data. 95 | Otherwise, we can compute gradients 96 | training-instance-by-training-instance, and sum the results, which is 97 | not especially memory intensive. (Of course, one may have a huge 98 | network which requires a lot of memory to story. But that's a 99 | separate issue.) 100 | 101 | Their model of recurrent neural networks is interesting. Basically, 102 | we have a set of neurons, each with an output. And we have a set of 103 | inputs to the network. There is a weight between every pair of 104 | neurons, and from each input to each neuron. To compute a neuron's 105 | output at time $t+1$ we compute the weighted sum of the inputs and the 106 | outputs at time $t$, and apply the appropriate nonlinear function 107 | (sigmoid, or whatever). Note that in order for this description to 108 | make sense we must specify the behaviour of the external inputs over 109 | time. We can incorporate a bias by having an external input which is 110 | always $1$. 111 | 112 | So a recurrent neural network is just like a feedforward network, with 113 | a weight constraint: the weights in each layer are the same, over and 114 | over again. Also, the inputs must be input to every layer in the 115 | network. 116 | 117 | Williams and Zipser take as their supervised training task the goal of 118 | getting neuron outputs to match certain desired training values at 119 | certain times. For instance, you could define a two-neuron network 120 | that will \emph{eventually} produce the XOR of the inputs. 121 | 122 | They define the total error to be the sum over squares of the errors 123 | in individual neuron outputs. And we can then do ordinary gradient 124 | descent with that error function. They derive a simple dynamical 125 | system to describe how to improve the weights in the network using 126 | gradient descent. 127 | 128 | The above algorithm assumes that the weights in the network remain 129 | constant for all time. Williams and Zipser then modify the learning 130 | algorithm, allowing it to change the weights at each step. The idea 131 | is simply to compute the total error at any \emph{given} time, and 132 | then to use gradient descent with that error function to update the 133 | weights. (Similar to online learning in feedforward networks.) 134 | 135 | Williams and Zipser describe a method of \emph{teacher-forcing}, 136 | modifying the neural network by replacing the output of certain 137 | neurons by the \emph{desired} output, for training purposes in later 138 | steps. 139 | 140 | Unfortunately, it is still unclear to me \emph{why} one would wish to 141 | use recurrent neural networks. Williams and Zipser describe a number 142 | of examples, but they don't seem compelling. 143 | 144 | The algorithm in which the weights can change seems non-physiological 145 | --- it verges on being an unmotivated statistical model. (I doubt 146 | that the weights in the brain swing around wildly, but I'll bet that 147 | the weights found by this algorithm can swing around wildly.) The 148 | algorithm in which the weights are fixed seems more biological. 149 | 150 | Note that Williams and Zipser \emph{do not} offer any analysis of 151 | running time for their algorithms, or an understanding of when it is 152 | likely to work well, and when it is not. It's very much in the 153 | empirical let's-see-how-this-works style adopted through much of the 154 | neural networks literature. 155 | 156 | Summing up: the recurrent neural network works by, at each step, 157 | computing the sigmoid function of the weighted sum of the inputs and 158 | the previous step's outputs. Training means specifying a set of 159 | desired outputs at particular times, and adapting the weights at each 160 | time-step. Training works by specifying an error function at any 161 | given time step, computing the gradient, and updating the weights 162 | appropriately. 163 | 164 | 165 | \section{Baldi and Hornik (1989)} 166 | 167 | \link{http://scholar.google.ca/scholar?cluster=11637720331851320383&hl=en&as_sdt=0,5}{(link)} 168 | This characterizes linear autoencoders. We have a three-layer 169 | network, and the output is related to the input by $x \rightarrow 170 | ABx$, where $B$ describes the first layer of weights, and $A$ the 171 | second layer. The goal is to find weight matrices $A$ and $B$ to 172 | minimize: 173 | \begin{eqnarray} 174 | \sum_x \|x-ABx\|^2. 175 | \end{eqnarray} 176 | The challenge is that the hidden layer has a \emph{smaller} number $h$ 177 | of neurons than the input layer (which is, of course, of the same size 178 | as the output layer)\footnote{It's not quite clear to me what $h$ 179 | should parameterize. I'll use it to parameterize the number of 180 | dimensions in the vector space representing outputs from the hidden 181 | units. It seems likely that it'd be better to write $2^h$, but I'll 182 | ignore that.}. Let me try an attack on this without reading the 183 | paper. That sum above is just: 184 | \begin{eqnarray} 185 | \mbox{tr}((I-AB)^2 \Sigma), 186 | \end{eqnarray} 187 | where $\Sigma \equiv \sum_x x x^T$. To minimize this what we want to 188 | do is obvious (and easily proven): we'll choose $A$ and $B$ so that 189 | $AB$ is a $h$-dimensional projector onto the span of the eignenvectors 190 | of $\Sigma$ with the $h$ largest eigenvalues. Let $P(\Sigma, h)$ 191 | denote such a projector, so: 192 | \begin{eqnarray} 193 | AB = P(\Sigma, h). 194 | \end{eqnarray} 195 | We can easily characterize such $A$ and $B$. $A$ should take the 196 | space $P(\Sigma, h)$ into the space spanned by the outputs from the 197 | hidden units, and $B$ should then undo that transformation. There is 198 | an orthogonal freedom inbetween time, and a possible freedom in 199 | $P(\Sigma, h)$. This completely characterizes $A$ and $B$. 200 | 201 | Summing up, in a linear neural network, \emph{a linear autoencoder is 202 | just doing principal components analysis}. So \emph{a non-linear 203 | autoencoder can be thought of as a non-linear generalization of 204 | PCA}. That's a useful fact to remember. Examination of the 205 | remainder of the paper suggests that these are the key facts. 206 | 207 | \section{Olshausen and Field (1996)} 208 | 209 | Presents a method for finding low-complexity representations of 210 | natural images, in terms of atomic images --- which they call ``sparse 211 | codes'' --- which are localized, oriented, and scale-sensitive. These 212 | are found using an unsupervised learning algorithm with a bias toward 213 | good quality, low-complexity representations. The codes seem to be 214 | quite similar to the receptive fields found in the human visual 215 | system. 216 | 217 | The \emph{receptive field} for a cell in the retina is the volume of 218 | space (roughly, a cone) which can stimulate that cell to fire. Nearby 219 | cells can have overlapping (or nearby) receptive fields. Other cells 220 | in the visual cortex also have receptive fields, but they may be more 221 | complex, since the light has already been filtered through one or more 222 | levels of processing. 223 | 224 | The paper claims that the receptive fields in the primary visual 225 | cortex are: (a) spatially localized; (b) oriented; and (c) can 226 | distinguish structure at different scales. 227 | 228 | There is then a question: so what are those receptive fields? In a 229 | way, we can view this as being the question: to what type of images do 230 | different cells in our primary visual cortex respond? Answering that 231 | question seems like a good start for understanding any higher-level 232 | image processing. It's the question: what are the atoms of image 233 | processing? Or perhaps a better way is to think of them as the 234 | molecules of image processing, since they're one level up from the 235 | pixel level. 236 | 237 | They develop an unsupervised learning algorithm which, trained on 238 | natural data, can find receptive fields that are spatially localized, 239 | oriented, and can distinguish structure at different scales. 240 | 241 | Olshausen and Field want to decompose an image as: 242 | \begin{eqnarray} 243 | I(x,y) = \sum_j a_j \phi_j(x,y). 244 | \end{eqnarray} 245 | The idea is that the $\phi_j$ form a (possibly overcomplete) basis for 246 | the space of images. They want to choose the $\phi_j$ which ``results 247 | in the coefficient values being as statistically independent as 248 | possible over an ensemble of natural images''. In some sense, the 249 | different $a_j$ would be ``telling us different things'' about the 250 | image. They also want the coefficient values to be sparse, favouring 251 | simple representations over more complex. 252 | 253 | O \& F try to search for a suitable set of $\phi_j$s by introducing an 254 | error function: 255 | \begin{eqnarray} 256 | E = -\mbox{[preserve information]}-\lambda\mbox{[sparseness of } a_j {]}. 257 | \end{eqnarray} 258 | This error is \emph{for a single image}. The first term is just the 259 | $l_2$ error, i.e., (minus) the quadratic distance between the image 260 | and its representation. The sparseness term is just a nonlinear 261 | function of the $a_j$ coefficients, quantifying how sparse they are. 262 | 263 | The idea is to do online learning with this error function, presenting 264 | it with natural images, and gradually minimizing the error. (I see 265 | later in the article that it was actually batch learning using 266 | conjugate gradient descent. It appears that some kind of average 267 | error is being computed.) The result will be an overcomplete basis 268 | set that favours sparse decompositions of images. 269 | 270 | The ``sparsification'' idea is a very interesting one. Basically, 271 | it's a way of trying to force a kind of Occam's razor into the system. 272 | It's a bit like autoencoders, forcing a simple explanation of complex 273 | data. 274 | 275 | O \& F note that wavelets have been used to find sparse codes 276 | previously. 277 | 278 | 279 | \section{LeCun (1998)} 280 | 281 | \link{http://yann.lecun.com/exdb/publis/index.html\#lecun-98}{link} 282 | 283 | Reviews the classic two-part architecture: a feature extraction 284 | module, followed by a trainable classifier module. Points out that 285 | the real goal is to shunt as much as possible out of the feature 286 | extraction module and into the classifier module, since the first 287 | requires hand-engineering, while the second is (much more) automated. 288 | 289 | Makes the remarkable claim that the difference in error between test 290 | and training set scales as $(h/N)^\alpha$, where $h$ is a measure of 291 | how complex a classifier we're using, $N$ is the number of training 292 | examples, and $0.5 < \alpha < 1$. In other words, the error grows as 293 | the complexity of the machine grows. And it shrinks as the number of 294 | training samples grows. I wonder why this is the case? Could we come 295 | up with a model that more or less proves that this is the case? Maybe 296 | a renormalization argument? 297 | 298 | ``The fact that local minima do not seem to be a problem for 299 | multi-layer neural networks is somewhat of a theoretical mystery'': 300 | This is strange. Maybe it's the case that it's very hard to fall down 301 | into such local minima in high dimensions? I've personally had 302 | problems with very simple training data, but as soon as the training 303 | data and network become at all complex, those problems seem to vanish. 304 | This presumably means that ``most'' local minima are pretty darn good. 305 | 306 | The \emph{segmentation} problem: the problem of cutting up a string of 307 | characters. Notes that a nice heuristic is to try lots and lots of 308 | different cuts, and for each possible cut to score the cut by using 309 | the individual character classifier: if that classifier seems to be 310 | working well, then chances are that you have a good cut. 311 | 312 | The authors note that existing systems are based on hand-crafted 313 | feature extractors, but that they will not use hand-crafted features. 314 | 315 | MNIST: constructured by combining NIST Special database 3 (SD-3) and 316 | Special Database 1 (SD-1). Apparently, NIST designated SD-3 as a 317 | training set, and SD-1 as a test set. But the two are actually very 318 | different from on enaother. SD-3 is a clean data set, taken from 319 | Census Bureau employees, while SD-1 is not so good, being taken from 320 | high-school students. They describe some details of how MNIST was 321 | constructed. I'll review a few particularly striking facts. First, 322 | each character is size normalized, while preserving aspect ratio, and 323 | centred. There was also anti-aliasing going on. So this can all be 324 | regarded as pre-processing of features. The database was prepared in 325 | three forms. One was the form I know it. A second was a deslanted 326 | form. The third reduced the image resolution. 327 | 328 | Deslanting: Idea was to compute moments of inertia, and then to 329 | recenter things (vertically), while downsampling to 20 by 20. As 330 | we'll see below this significantly improves performance. 331 | 332 | Convolutional networks: They use local receptive fields, shared 333 | weights, and spatial sub-sampling. ``With local receptive fields, 334 | neurons can extract elementary visual features such as oriented edges, 335 | end-points, corners (or similar features in other signals such as 336 | speech spectrograms). These features are then combined by the 337 | subsequent layers in order to detect higher-order features.'' 338 | ``... elementary feature detectors that are useful on one part of the 339 | image are likely to be useful across the entire image. This knowledge 340 | can be applied by forcing a set of units, whose receptive fields are 341 | located at different places on the image, to have identical weight 342 | vectors.'' 343 | 344 | ``Units in a layer are organized in planes within which all the units 345 | share the same set of weights''. So the basic idea is to convolve the 346 | original inputs in some small window of the inputs. We call this a 347 | ``feature map''. I think Hinton later calls it a kernel(?) We will 348 | typically have several different feature maps. So what we have is a 349 | convolution stage. FOr example, we might have a 5 by 5 feature map. 350 | This is applied to a 5 by 5 receptive field in the input, i.e., a 5 by 351 | 5 area in the input. Each unit has 25 inputs, and so 25 weights and a 352 | bias. ``all the units in a feature map share the same set of 25 353 | weights and the same bias so they detect the same feature at all 354 | possible locations on the input.'' ``The other feature maps in the 355 | layer use different sets of weights and biases, thereby extracting 356 | different types of local features.'' In LeNet-5 there are 6 feature 357 | maps. Note that a squashing function and bias apparently are used --- 358 | this wasn't apparent earlier, where the focus is on the convolution. 359 | Note that the feature map output will respect translations of the 360 | original image. 361 | 362 | Sub-sampling: The intuition is that exact location information is not 363 | necessary. ``Not only is the precise position of each of those 364 | features [identified by the feature maps] irrelevant for identifying 365 | the pattern, it is potentially harmful because the positions are 366 | likely to vary for different instances of the character.'' ``A simple 367 | way to reduce the precision with which the position of distinctive 368 | features are encoded in a feature map is to reduce the spatial 369 | resolution of the feature map. This can be achieved with a so-called 370 | sub-sampling layers [\emph{sic}] which performs a local averaging and 371 | a sub-sampling, reducing the resolution of the feature map, and 372 | reducing the sensititivity of the output to shifts and distortions.'' 373 | In LeNet-5 they use a sub-sampling layers, which perform a kind of 374 | local averaging and sub-sampling. Basically, they use six 2 by 2 375 | features maps, one for each of the previous six feature maps. ``Each 376 | unit computes the \emph{average} of its four inputs, multiplies it by 377 | a trainable coefficient, adds a trainable bias, and passes the result 378 | through a sigmoid function''. It's notable here that we don't have 379 | trainable weights in the ordinary fashion. It's also notable that 380 | things aren't overlapping in this case, unlike the local receptive 381 | fields. Possibilities for this layer: blurring, local max, local min. 382 | (Depends on parameter values). 383 | 384 | Architecture: ``Successive layers of convolutions and sub-sampling are 385 | typically alternated...'' Traces the origins of the idea to Hubel and 386 | Wiesel and to Fukushima. It sounds as though the main new thing here 387 | is to try it out with backprop. The paper also describes some 388 | previous applications of convolutional neural networks to image and 389 | speech recognition. 390 | 391 | ``Since all the weights are learned with back-propagation, 392 | convolutional networks can be seen as synthesizing their own feature 393 | extractor.'' Big advantage of reducing the number of parameters: it 394 | reduces overfitting. 395 | 396 | LeNet-5: 7 layers, not counting the input. 32 by 32 inputs. Note 397 | that the characters are themselves 20 by 20 pixels centered in a 28 by 398 | 28 field. 399 | 400 | Layer C3 (third layer, convolutional): 16 feature maps. Each unit in 401 | each feature map is connected to several 5 by 5 neighbourhoods are 402 | identical locations in a subset of S2's feature maps. ``WHy not 403 | connect every S2 feature map to every C3 feature map?'' (1) Reduce 404 | the number of connections; (2) Forces a break in symmetry in the 405 | network. My guess is that it would otherwise work, but might be 406 | slower. ``Different feature maps are forced to extract different 407 | (hopefully complementary) features because they get different sets of 408 | input.'' 409 | 410 | Layer C5: 120 feature maps. Each unit is connected to a 5 by 5 411 | neighbourhood on all 16 of S4's feature maps. They state that this 412 | amounts to a full connection between S4 and C5 --- this is true 413 | because each feature unit is just a single unit. 414 | 415 | They use a scaled hyperbolic tangent as the squashing function. ``As 416 | seen before, the squashing function used in our Convolutional Networks 417 | is $f(a) = A \tanh(Sa)$. Symmetric functions are believed to yield 418 | faster convergence [i.e., learn at a faster rate], although the 419 | learning can become extremely slow if the weights are too small. The 420 | cause of this problem is that in weight space the origin is a fixed 421 | point of the learning dynamics, and, although it is a saddle point, it 422 | is attractive in almost all directions''. It seems likely to me that 423 | we will have a similar problem with the usual sigmoid function. They 424 | chose their parameters to ensure $f(\pm 1) = \pm 1$, i.e., for 425 | convenience. ``This particular choice of parameters is merely a 426 | convenience, and does not affect the result.'' 427 | 428 | They initialize weights with the inverse of the fan-in, omitting the 429 | square root that I am accustomed to use. ``The standard deviation of 430 | the weighted sum scales like the square root of the number of inputs 431 | when the inputs are independent, and it scales linearly with the 432 | number of inputs if the inputs are highly correlated. We choose to 433 | assume the second hypothesis since some units receive highly 434 | correlated signals.'' The second clause in the first sentence is 435 | simply false, since the weights are set independently of the inputs. 436 | It's interesting that their method apparently works okay anyway, i.e., 437 | it must be quite insensitive to this detail. 438 | 439 | Final layer in the network: Euclidean Radial Basis functions (RBF), 440 | one for each class (i.e., 10 in total), with 84 inputs. The output is 441 | the squared Euclidean distance between the inputs and the input 442 | weights. In other words, the RBF measures how close the input is to 443 | the weights. Fascinatingly, the initial values for these were set by 444 | hand, based on very simple versions of ASCII characters. 445 | 446 | ``[O]utput units... must be off most of the time. This is quite 447 | difficult to achieve with sigmoid units.'' Not sure why. 448 | 449 | Learning schedule: $\eta = 0.0005$ for the first two epochs, $0.0002$ 450 | for the next three, $0.0001$ for the next three, $0.00005$ for the 451 | next four, and $0.00001$ for the remaining epochs (up to 20, so it was 452 | eight). 453 | 454 | Distortions: ``When distorted data was used for training, the test 455 | error rate dropped to 0.8 percent (from 0.95 percent without 456 | deformation).'' It'd be nice to have a nice little library of 457 | transformations. 458 | 459 | Linear classifier: 12 percent error rate. When deslanted, gets 8.4 460 | percent error rate. ``Various combinations of sigmoid units, linear 461 | units, gradient descent learning, and learning by directly solving 462 | linear systems gave similar results''. ``A simple improvement of the 463 | basic linear classifier was tested. The idea is to train each unit of 464 | a single-layer network to separate each class from each other class. 465 | In other words, there are ${10 \choose 2} = 45$ units. There is still 466 | a need to have a final decision procedure, and they simply chose the 467 | class which beat the largest number of other classes. ``The error 468 | rate on the regular test set was 7.6\%''. 469 | 470 | Baseline nearest neighbor classifier: Using Euclidean distance between 471 | input images. ``On the regular test set the error rate was 5.0\%. On 472 | the deslanted data, the error rate was 2.5\%, with $k = 3$.'' 473 | 474 | PCA: Computes the projection of the input pattern on the 40 principal 475 | components. ``The 40-dimensional feature vector was used as the input 476 | of a second degree polynomial classifer.'' ``The error on the regular 477 | test set was 3.3\%.'' 478 | 479 | Radial basis functions: Error rate of 3.6\%. 480 | 481 | One-hidden layer neural network: Error was 4.7\% for a network with 482 | 300 hidden units. Interesting: this is worse than my results, even 483 | when I'm using mean-square error (I get some improvement from using 484 | cross-entropy). I don't know why. My initialization is somewhat 485 | different. Otherwise, I can't think of any reason. They get a 486 | reduction to 4.5\% for a network with 1000 hidden units(!) They did 487 | even bettter with distortions: 3.6\% and 3.8\%, with 300 and 1000 488 | hidden units, respectively. When deslanted images were used, the test 489 | error dropped to 1.6\%, with 300 hidden units. Raises the question of 490 | why we don't get terrible overfitting, just on parameter counting 491 | grounds. 492 | 493 | Two-hidden layer neural network: ``The test error rate of a 494 | 784-300-100-10 network was 3.05\%, a much better result than the 495 | one-hidden layer network [4.7\%], obtained using marginally more 496 | weights and connections.'' This doesn't accord with my experience 497 | using basic backprop. Rather, it's like their results now match up 498 | with mine for both a single and two-hidden layer. (Admittedly, I do 499 | get an improvement --- albeit more modest --- if pretraining is used). 500 | However, I'm using both the cross-entropy and different weight 501 | initialization. So identical results wouldn't be expected. 502 | Increasing the network size to 784-1000-150-10 improved things only a 503 | tiny bit, to 2.95\%. Training with distorted patterns improved things 504 | to 2.5\% and 2.45\%, respectively. 505 | 506 | LeNet-1: A small convolutional net. It got 1.7\% test error rate. 507 | ``The fact that a network with such a small number [2,600] of 508 | parameters can attain such a good error rate is an indication that the 509 | architecture is appropriate for the task.'' 510 | 511 | Boosting: This is a technique which sounds like an idea I've been 512 | wondering about: concentrating more on training data which the network 513 | is misclassifying. 514 | 515 | Tangent distance classifier: This is an interesting idea. The idea is 516 | to consider the tangent plane near a digit image, where we're 517 | considering a (low-dimensional) submanifold generated by distortions 518 | and translations of the images. ``An excellent measure of `closeness' 519 | for character images is the distance between their tangent planes, 520 | where the set of distortions used to generate the planes includes 521 | translations, scaling, skewing, squeezing, rotation, and line 522 | thickness variations''. They use this measure of distance to run a 523 | nearest-neighbor method classifier. They get an error rate of 1.1\%, 524 | which is (obviously) excellent. 525 | 526 | Support vector machines: Depending on technique, results obtained 527 | varied between 1.4\% and 0.8\%. 528 | 529 | They report on the number of operations required to do a 530 | classification, and the convolutional networks do quite well. Much 531 | better than the SVMs, interestingly enough, perhaps because the SVMs 532 | are fitting high-order polynomials, and thus have a very large number 533 | of terms. 534 | 535 | ``When plenty of data is available, many methods can attain 536 | respectable accuracy. The neural-net methods run much faster and 537 | require much less space than memory-based techniques. The neural 538 | nets' advantage will become more striking as training databases 539 | continue to increase in size.'' 540 | 541 | Invariance and noise resistance: ``Convolutional networks are 542 | particularly well suited for recognizing or rejecting shapes with 543 | widely varying size, position, and orientation, such as the ones 544 | typically produced by heuristic segmenters in real-world string 545 | recognition systems. In an experiment like the one described above, 546 | the importance of noise resistance and distortion invariance is not 547 | obvious. The situation in most real applications is quite different. 548 | Characters must generally be segmented our of their context prior to 549 | recognition. Segmentation algorithms... often leave extraneous marks 550 | in character images... or sometimes cut characters too much and 551 | produce incomplete characters. Those images cannot be reliably 552 | size-normalized and centered. Normalizing incomplete characters can 553 | be very dangerous. For example, an enlarged stray mark can look like 554 | a genuine 1.'' 555 | 556 | Conclusions: ``Convolutional Neural Networks have been show to 557 | eliminate the need for hand-crafter feature extractors. Graph 558 | Transformer Networks have been shown to reduce the need for 559 | hand-crafted heuristics, manual labeling, and manual parameter tuning 560 | in document recognition systems.'' ``It was shown that all the steps 561 | of a document analysis system can be formulated as graph transformers 562 | through which gradients can be back-propagated.'' ``It is worth 563 | pointing out that data generating models... and the Maximum Likelihood 564 | Principle were not called upon to justify most of the architectures 565 | and training criteria described in this paper.'' 566 | 567 | 568 | \section{Ferret rewiring (Nature, 2000)} 569 | 570 | The primary visual cortex has what are called orientation modules. 571 | These are groups of cells that share a preferred ``stimulus 572 | orientation''. It's not clear to me what a stimulus orientation is, 573 | exactly --- do they mean the direction the stimulus comes from. I'll 574 | get back to that. Anyway, there is apparently an orientation map. 575 | Well, when they rewire the ferrets' brains, apparently there are 576 | visually responsive cells in the auditory cortex that start to develop 577 | an orientation map! It's similar to the one in the visual cortex, 578 | although apparently less orderly. 579 | 580 | They use a nice piece of terminology: sensory pathways have an 581 | \emph{instructive} role in the development of cortical networks. The 582 | visual cortex apparently has a couple of different kinds of structure: 583 | ocular dominance columns, and orientation columns. Actually, looking 584 | at Wikipedia, there's quite a bit more structure in there than that. 585 | Apparently, orientation columns were discovered simply by stimulating 586 | a cat with visual stimuli from different directions, and noticing 587 | where in the visual cortex excitement occurred. They're apparently 588 | little slabs of cells that respond to visual stimuli from a particular 589 | direction. Perhaps unsurprisingly, these columns are arranged into 590 | little pinwheels --- it's natural enough that they would reflect 591 | external geometry. 592 | 593 | They wanted to investigate ``whether afferent [i.e., sensory] activity 594 | or intrinsic features of the cortical target regulate the development 595 | of orientation columns.'' ``... within limits, input activity [from 596 | eyes to auditory cortex] has a significant instructive role in 597 | establishing the cortical circuits that underlie orientation 598 | selectivity and the orientation map''. 599 | 600 | They identify two separate things --- the degree of ``tuning'' in the 601 | cortex, as well as the orientation map. Apparently, these two things 602 | are found to be more or less independent. What's ``orientation 603 | tuning'' mean? Maybe it's a way of calibrating the respective meaning 604 | of activation of different orientation columns? ``... afferent 605 | activity is required for at least the maintenance of orientation 606 | selectivity in V1 neurons''. In other words, you destory the 607 | orientation structure if you don't get sensory input. This is a 608 | complementary result. 609 | 610 | 611 | \section{Tenenbaum, de Silva and Langford (2000)} 612 | 613 | \link{http://scholar.google.ca/scholar?cluster=14602426245887619907&hl=en&as_sdt=0,5}{(link)} 614 | They mention a technique called multidimensional scaling (MDS), which 615 | I hadn't heard of. The idea seems to be that we have a lot of items, 616 | and we know some ``dissimilarities'' between items. The goal is to 617 | find a metric space embedding of those items so that the distances are 618 | roughly equal to the dissimilarities. 619 | 620 | A sample problem: we have a 4096-dimensional space, corresponding to 621 | 64 by 64 pixel images. A (nonlinear) subspace of this corresponds to 622 | images we'd recognize as faces. How can we characterize this 623 | subspace? 624 | 625 | This is just one possible mathematical formalization of the problem. 626 | In practice, things are more complex. Our classification will be 627 | fuzzy. We'll have all kinds of extra contextual information: maybe 628 | we've got an external hint; maybe we can see a nose; maybe the colour 629 | is wrong, but we see enough to suspect it's false colour. All these 630 | kinds of things are clearly important in how we actually see. In 631 | other words, we don't just have an algorithm for face detection. We 632 | have a million related algorithms, and they all affect how well face 633 | detection works. In some sense you don't solve one problem perfectly. 634 | You solve a network of problems imperfectly --- and then use those 635 | results to improve your performance on the original problem. It's a 636 | kind of \emph{learning network}. In a sense this is what a deep 637 | neural network does: it builds up gradually more complicated features. 638 | 639 | The algorithm they describe is very simple. Very roughly (this 640 | certainly contains mistakes): the idea seems to be to take all your 641 | data points and to compute distances between them. We assume that 642 | when the distances are small, the points are neighbours. Construct a 643 | graph in which neighbouring points are connected. Then geodesic 644 | distance is found (approximated) by finding the shortest distance in 645 | the graph. We then embed the graph in a space of the chosen 646 | dimensionality. Nice! Simple, probably pretty easy to implement, and 647 | I expect it lets us find a lot of structure. 648 | 649 | It's worth thinking about what the input and output are. The input to 650 | Iso-map is just a data set --- maybe it's a set of images of a face, 651 | maybe it's a set of words, whatever. This data lives in a very 652 | high-dimensional space. What we do is we find an embedding in a much 653 | lower dimensional space --- say, 2-dimensional. In other words, we're 654 | constructing new features, based on the original features. 655 | 656 | \textbf{There are $10^6$ optic nerves and $30,000$ auditory nerves:} 657 | I'm not quite sure what to make of this. Presumably it means that we 658 | process something like $30$ times as much optical information as 659 | auditory. I wonder how pixellated the information is? 660 | 661 | \textbf{What happens when we augment the features, with PCA?} Let's 662 | suppose we start off with 3 features, $x, y, z$. Then we add $x^2$ 663 | and $y^2$ as new features. Certain subsets of the original space that 664 | weren't linearly approximable \emph{will be} in the new feature space. 665 | This seems like a potentially powerful technique. What can it be used 666 | to do? What are its limits? 667 | 668 | \section{Simard (2003)} 669 | 670 | ``Best Practices for Convolutional Neural Networks Applied to Visual 671 | Document Analysis'' 672 | 673 | ``The most important practice is getting a training set as large as 674 | possible: we expand the training set by adding a new form of distorted 675 | data''. They claim it's better even than being convolutional. ``The 676 | optimal performance on MNIST was achieved using two essential 677 | practices. First, we created a new, general set of elastic 678 | distortions that vastly expanded the size of the training set...'' 679 | 680 | ``We avoided using momentum, weight decay, structure-dependent 681 | learning rates, extra padding around the inputs, and averaging instead 682 | of subsampling. (We were motivated to avoid these complications by 683 | trying them on various architecture/distortions combinations and on a 684 | train/validation split of the data and finding they did not help.)'' 685 | 686 | They have lots of useful details about how they came up with their 687 | convolutional architecture. It's very similar to LeCun (1998), of 688 | course, but they have more detail on \emph{how} they chose the various 689 | parameters. Interestingly, they found that having 5 features in the 690 | first convolutional layer and 50 features in the second convolutional 691 | layer was more or less optimal. 692 | 693 | ``Convolutional neural networks have been proposed for visual tasks 694 | for many years [LeCun 1998], yet have not been popular in the 695 | engineering community. We believe that is due to the complexity of 696 | implementing the convolutional neural networks.'' 697 | 698 | They point out that implementation is complicated by the fact that not 699 | every unit has the same number of outgoing connections. 700 | 701 | The results suggest substantial improvements from both distortions, 702 | and the use of convolutional nets. They achieve a best-possible 703 | accuracy of 99.6\%, which was apparently a record at the time. 704 | 705 | \section{Hinton, Osindero, and Teh (2006)} 706 | 707 | \link{http://www.cs.toronto.edu/\~hinton/absps/ncfast.pdf}{A Fast 708 | Learning Algorithm for Deep Belief Nets} 709 | 710 | ``Learning is difficult in densely connected, directed belief nets 711 | that have many hidden layers because it is difficult to infer the 712 | conditional distribution of the hidden activities when given a data 713 | vector.'' I don't know why this is. My impression is that it's easy 714 | to at least sample from the distribution of hidden activations. Is 715 | that false? Or maybe it's true and it's just the calculation of the 716 | distribution that is hard. ``Variatonal methods use simple 717 | approximations to the true conditional distribution, but the 718 | approximations may be poor, especially at the deepest hidden layer, 719 | where the prior assumes independence. Also, variational learning 720 | still requires all of the parameters to be learned together and this 721 | makes the learning time scale poorly as the number of parameters 722 | increase.'' I don't know what variational learning is. 723 | 724 | ``The network used to model the join distribution of digit images and 725 | digits labels... work in progress has shown that the same learning 726 | algorithm can be used if the `labels' are replaced by a multilayer 727 | pathway whose inputs are spectrograms from multiple different speakers 728 | saying isolated digist. The network then learns to generate pairs 729 | that consist of an image and a spectrogram of the same digit class.'' 730 | Fascinating: in other words, it will associate ``9'' both with 731 | different images of 9, and also with different people saying 9. 732 | 733 | Discriminative model: a model which can be used to distinguish the 734 | MNIST digits. Generative model: a model which, given a label, can be 735 | used to generate an image which in some sense samples from the MNIST 736 | distribution. 737 | 738 | Generative models seems interesting in part because that's what we do 739 | (we can both read and write digits, for instance). Of course, it's 740 | not entirely clear how these skills are associated. One can learn to 741 | read without also learning to write; there are fine motor skills in 742 | the latter that are not all that closely associated to reading. 743 | 744 | ``There is a fine-tuning algorithm that learns an excellent generative 745 | model that outperforms discriminative methods on the MNIST database of 746 | hand-written digits.'' I haven't seen this kind of thing mentioned at 747 | all in later work --- it's all discriminative. 748 | 749 | ``The learning algorithm is local. Adjustments to a synapse strength 750 | depend on only the states of the presynaptic and postsynaptic 751 | neuron.'' This seems very preferable to gradient descent! 752 | 753 | \textbf{Explaining away:} Makes inference difficult in directed belief 754 | nets. Basically, we can't figure out what root causes must have been, 755 | given only partial evidence. 756 | 757 | \section{Hinton and Salakhutdinov (2006)} 758 | 759 | \link{http://scholar.google.ca/scholar?cluster=15344645275208957628}{(link)} 760 | 761 | Their RBM uses ``symmetrically weighted connections''. It is not 762 | clear to me what this means. It seems to mean that the biases are the 763 | same on hidden and visible units. I don't see how that can be --- 764 | aren't there different numbers of such units? 765 | 766 | So the idea is to take an RBM, and then use the training data to find 767 | a new set of features. We then use the features generated by the 768 | training data as a \emph{new} set of training data, for another RBM. 769 | We use that to find new features. And so on, through multiple levels 770 | of RBMs. We then use backpropagation to fine-tune the whole thing. 771 | It appears that the backpropagation is done with the weights treated 772 | as though in a deterministic neural network, not stochastic, as in an 773 | RBM. 774 | 775 | In a bit more detail, when working with real-valued data, the visible 776 | units in later RBMs were set to the activation probabilities of 777 | previous hidden units. I.e., probabilities became data. 778 | 779 | H and S used a deep network with 784-400-200-100-50-25-6 units. That 780 | is, they reduced 784-dimensional input data to just 6 parameters. 781 | And, visually at least, their reconstructions were very good, 782 | significantly better than 6-parameter PCA and similar techniques. 783 | 784 | What makes it difficult to train deep neural networks? I must admit, 785 | I don't really have a great answer to this question. Can we come up 786 | with a good \emph{a priori} reason for thinking it will be tough? 787 | It's not obvious that it should be tougher than a shallow network with 788 | the same number of neurons. 789 | 790 | H and S compare to the work of Tenenbaum \emph{et al} and Roweis and 791 | Saul, and comment: ``Unlike nonparametric methods (cites), 792 | autoencoders give mappings in both directions between the data and 793 | code spaces, and they can be applied to very large data sets because 794 | both the pretraining and the fine-tuning scale linearly in time and 795 | space with the number of training cases.'' I don't quite understand 796 | the comment about mappings in both directions --- I thought the 797 | earlier work provided such mappings. Perhaps I should look closer. 798 | 799 | \section{Bengio, Lamblin, Popovici, Larochelle (2007)} 800 | 801 | \link{http://www.iro.umontreal.ca/\~lisa/publications2/index.php/attachments/single/24}{Greedy 802 | Layer-Wise Training of Deep Networks} 803 | 804 | They have a complexity-theoretic point of view, a point of view that 805 | says depth (in circuits, or otherwise) helps compute functions. I 806 | guess this is more or less the point of view of computer scientists 807 | who believe that \textbf{NC} is a strict subset of \textbf{P}. 808 | 809 | In general, this is a point of view I haven't much engaged with. I've 810 | been thinking more in the detailed world of the practitioner, 811 | wondering just how well a given network functions, and not thinking 812 | about these structural questions. But I suppose there is a deep 813 | structural question here, which is whether there are deep networks 814 | that can compute functions using polynomially many elements, and said 815 | functions require exponentially many more elements in a shallow 816 | network? 817 | 818 | A skeptical way of looking at this is to say that this is a question 819 | about scaling, and that scaling isn't what matters for solving pattern 820 | recognition problems in the real world, since we have just one such 821 | world, of fixed size. But to be skeptical of the skeptic, we would 822 | still find it interesting if, in the real world, we were trying to 823 | learn functions which were much easier to compute by a deep network 824 | than a shallow. 825 | 826 | Why might deep networks be better? Two broad reasons: ease of 827 | computation; and ease of learning. I'd like to understand both these: 828 | Why might computation be easier? And why might learning be easier? 829 | 830 | Well, those notes get me to the end of the first sentence of the 831 | abstract! Let me skip ahead and see if I can sum up the first 832 | paragraph, since it seems very interesting. The basic problem is the 833 | ability of various machine-learning algorithms to learn highly-varying 834 | functions, ``e.g., they would require a large number of pieces to be 835 | well represented by a piecewise-linear approximation. Since the 836 | number of pieces can be made to grow exponentially... If the shapes of 837 | all these pieces are unrelated, one needs enough examples for each 838 | piece in order to generalize properly. However, if these shapes are 839 | related and can be predicted from each other, `non-local' learning 840 | algorithms have the potential to generalize to pieces not covered by 841 | the training set.'' I can sort of see this: basically, linear 842 | boundaries aren't going to give us very much, even with new features: 843 | they can't go a huge amount beyond what is already in the input data. 844 | But I don't quite see what non-linearities do to get beyond this. I 845 | guess it's that we're starting to learn from multiple pieces of 846 | training data at once, and making higher-order generalizations. 847 | (Basically, once you can do {\sc and} gates, you can do conditional 848 | logic, and that lets you build up hierarchical reasoning.) 849 | 850 | 851 | \section{Pinto, Cox and DiCarlo (2008)} 852 | 853 | \link{http://www.ploscompbiol.org/article/info\%3Adoi\%2F10.1371\%2Fjournal.pcbi.0040027}{Why is Real-World Visual Object Recognition Hard?} 854 | 855 | ``[W]e show that a simple V1-like [computational?] model --- a 856 | neuroscientist's `null' model, which should perform poorly at 857 | real-world visual object recognition tasks --- outperforms 858 | state-of-the-art object recognition systems (biologically inspired and 859 | otherwise) on a standard, ostensibly natural image recognition test.'' 860 | I'm not sure what moral to take away. That simple systems can do well 861 | recognizing natural images? But they also created another ``simple'' 862 | test which demonstrated the inadequacy of their system. ``Taken 863 | together, these results demonstrate that tests based on uncontrolled 864 | natural images can be seriously misleading...'' The ultimate 865 | conclusion is that they want more focus on real-world image variation, 866 | by which they mean that the same object can cast a potentially 867 | infinite number of variations on the eye. 868 | 869 | ``[I]t is not clear to what extent such `natural' image tests [like 870 | Caltech101] actually engage the core problem of object recognition. 871 | Specifically, while the Caltech101 set certainly contains a large 872 | number of images (9,144 images), variations in object view, position, 873 | size, etc., between and within object category are poorly defined and 874 | are not varied systematically [I can see that this might be a problem 875 | if the sampling is not reasonably fair]. Furthermore, image 876 | backgrounds strongly covary with object category [wow!].... The 877 | majority of images are also `composed' photographs, in that a human 878 | decided how the shot should be framed [!], and thus the placement of 879 | objects within the image is not random and the set may not properly 880 | reflect the variation found in the real world. Furthermore, if the 881 | Caltech101 object recognition task is hard, it is not easy to know 882 | what makesi it hard---different kinds of variation (view, lighting, 883 | exemplar, etc) are all inextricably linked together.'' 884 | 885 | ``We built a very basic representation inspired by known properties of 886 | V1 `simple' cells... The responses of these cells to visual stimuli 887 | are well-described by a spatial linear filter, resembling a Gabor 888 | wavelet... with a nonlinear output function... and some local 889 | normalization (roughly analogous to `contrast gain control').'' 890 | 891 | \section{Deng (2009)} 892 | 893 | ``ImageNet: A Large-Scale Hierarchical Image Database'' 894 | 895 | An ``ontology of images builts upon the backbone of the WordNet 896 | structure''. ``Each meaningful concept in WordNet, possibly described 897 | by multiple words or word phrases, is called a `synonym set' or 898 | `synset'.'' There are apparently 80 thousand (noun) synsets in 899 | WordNet. (I presume there are verb synsets, and perhaps other types 900 | as well?) The idea of ImageNet is to provide 500-1,000 images per 901 | synset. That's a grand total of some tens of millions of images. 902 | ``Images of each concept are quality-cntrolled and human-annotated''. 903 | To some extent this means that they don't match what we'll actually 904 | find ``in the wild''. This paper reports early work --- 5,247 synsets 905 | and 3.2 million images. 906 | 907 | ``ImageNet aims to provide the most comprehensive and diverse coverage 908 | of the image world. The current 12 subtrees consist of a total of 3.2 909 | million cleanly annotated images spread over 5,247 categories... To 910 | our knowledge this is already the largest clean [what does this mean?] 911 | image dataset available to the vision research community, in terms of 912 | the total number of images, number of images per category as well as 913 | the number of categories.'' ``... to our knowledge no existing vision 914 | dataset offers images of 147 dog categories.'' 915 | 916 | Even at very low levels in the tree, ImageNet labels were found to be 917 | highly accurate by an independent group of subjects. 918 | 919 | ``ImageNet is constructed with the goal that objects in images should 920 | have variable appearances, positions, view points, poses as well as 921 | background clutteer and occlusions.'' They do an interesting thing to 922 | measure diversity. They compute an ``average image'' for each synset, 923 | and then measure the JPG file size. The idea is that very different 924 | images will blur out (and so have small file sizes), while more 925 | similar images will not (and so will have large file sizes). They 926 | find that their images are much more diverse than Caltech101. 927 | 928 | Images are collected by querying several image search engines with the 929 | appropriate noun or noun phrase. ``To obtain as many images as 930 | possible, we expand the query set by appending the queries with the 931 | word [?] from parent synsets, if the same word appears in the gloss of 932 | the target synset [?]''. ``To further enlarge and diversify the 933 | candidate pool, we translate the queries into other languages''. 934 | 935 | Using Mechanical Turk to label: ``In each of our labeling tasks, we 936 | present the users with a set of candidate images and the definition of 937 | the target synset... We then ask users to verify whether each image 938 | contains objects of the synset. We encourage users to select images 939 | regardless of occlusions, number of objects and clutter in the scene 940 | to ensure diversity.'' Of course, the problems are that people make 941 | mistakes, and they may not agree with one another. They get multiple 942 | people to label each image, and only classify something positively if 943 | an image gets a convincing majority of the votes. ``... different 944 | categories require different levels of consensus among users.'' 945 | Basically, the more contentious, the more votes we need to be sure. 946 | They have to do some initial setup to figure out the appropriate 947 | thresholds (or if a threshold fails to exist). 948 | 949 | Nice idea: classifying at each node in the WordNet net. This reduces 950 | the classification difficulty at each step. I wonder if there's a 951 | natural way this can be done in deep neural nets? Maybe by building a 952 | feature representation unsupervised, and then using those features to 953 | train a (tree-like) classifier? ``At nearly all levels, the 954 | performance of the tree-max classifier is consistently higher than the 955 | independent classifier.'' 956 | 957 | \section{Jarrett (2009)} 958 | 959 | ``What is the Best Multi-Stage Architecture for Object Recognition?'': 960 | \link{http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf}{link} 961 | 962 | This is a much more conventional paper about object recognition than 963 | the material I've been reading. The basic idea is to build a pretty 964 | good feature extractor, and then to use a standard (supervised) 965 | classifier. 966 | 967 | ``We show that using non-linearities that include rectification and 968 | local contrast normalization is the single most important ingredient 969 | for good accuracy on object recognition benchmarks.'' ``[T]he SIFT 970 | operator applies oriented edge filters to a small patch and determines 971 | the dominant orientation through a winner-take-all operation.'' 972 | ``Several recognition architectures use a single stage of such 973 | features followed by a supervised classifier.'' 974 | 975 | ``At first glance, one may think that training a complete system in a 976 | purely supervised manner (using gradient descent) is bound to fail on 977 | dataset with small number of labeled samples such as Caltech-101, 978 | because the number of parameters greatly outstrips the number of 979 | samples. [Yes, one might think this] One may also think that the 980 | filters need to be carefully hand-picked (or trained) to produce good 981 | performance [Yes, at least to the training part], and that the details 982 | of the non-linearity play a somewhat secondary role [Again, agreed]. 983 | These intutitions, as it turns out, are wrong. [!]'' 984 | 985 | ``A common choice for the filter bank of the first stage is Gabor 986 | Wavelets. [A linear filter used for edge detection. Apparently there 987 | are similar things in the visual cortex!] Other proposals use simple 988 | oriented edge detection filters such as gradient operators, including 989 | SIFT, and HoG. Another set of methods learn the filters by adapting 990 | them to the statistics of the input data with unsupervised 991 | learning. [This is the deep neural nets approach] ... The advantage of 992 | learning methods is that they provide a way to learn the filters in 993 | subsequent stages of the feature hierarchy. While prior knowledge 994 | about image statistics point to the usefulness of oriented edge 995 | detectors at the first stage, there is no similar prior knowledge that 996 | would allow to design sensible filters for the second stage in the 997 | hierarchy. Hence the second stage \emph{must be learned}.'' This 998 | seems overly pessimistic to me: one can certainly imagine a theory 999 | that tells us what features there should be at the second level. 1000 | Still, it's obviously an attractive model. 1001 | 1002 | ``The second ingredient of a feature extraction system is the 1003 | non-linearity.'' I don't really understand deeply why non-linearity 1004 | is so necessary. It'd be good to do so. 1005 | 1006 | Notes that pooling can be applied over space, over scale and space 1007 | (rescaling?), and over similar feature types and space. ``This layer 1008 | [pooling] builds robustness by computing an average or a max of the 1009 | filter responses within the pool.'' 1010 | 1011 | Caltech 101: 101 categories. About 50 images per category, and the 1012 | size of each image is roughly 300 by 200 pixels. SIFT features plus a 1013 | linear classifier will give us 50 percent classification accuracy. 1014 | Using a better classifer will give us 65 percent. ``[T]he best 1015 | results on Caltech-101 have been obtained by combining a large number 1016 | of different feature families [29]''. Reference is to Varma and Ray. 1017 | 1018 | ``The hierarchy stacks one or several feature extraction stages, each 1019 | of which consists of filter bank layer, non-linear transformation 1020 | layers [\emph{sic}?], and a pooling layer that combines filter 1021 | responses over local neighborhoods using an average or max operation, 1022 | thereby achieving invariance to small distortions.'' 1023 | 1024 | 1025 | 1026 | Conclusions: ``[U]sing a rectifying non-linearity is the single most 1027 | important factor in improving the performance of a recognition 1028 | system[!]'' I don't understand the heuristic justifications they 1029 | give. ``Also introducing a local normalization layer improves the 1030 | performance. It appears to make supervised learning considerably 1031 | faster, perhaps because all variables have similar variances (akin to 1032 | the advantages introduced by whitening and other decorrelation 1033 | methods).'' 1034 | 1035 | \section{Lee (2009) - video} 1036 | 1037 | \link{http://videolectures.net/icml09\_lee\_cdb/}{link} A video 1038 | version of the paper below. ``We are interested in scaling up deep 1039 | belief networks to learn generative models and to perform inference on 1040 | challenging problems.'' RBMs. Visible nodes: input (training) data. 1041 | Hidden nodes: encode statistical relationships in the visible nodes. 1042 | ``Unsupervised training using Contrastive Divergence approximation to 1043 | maximum likelihood''. Deep belief network: ``Greedy layerwise 1044 | training using RBMs''. Want to scale DBNs to realistic image sizes: 1045 | 200 by 200 pixels. One way to deal with this is to use a 1046 | convolutional net. Alternate between ``detection'' and ``pooling'' 1047 | layers. ``Detection layers involve weights shared between all image 1048 | locations'': we have a window of features, sliding across the input 1049 | image. ``Each pooling unit computes the maximum of the activation of 1050 | several detection units''. It shrinks the representation in higher 1051 | layers. They define a convolutional RBM. It's very similar to a 1052 | standard RBM, but with a couple of differences. One, the weights are 1053 | shared across hidden units, as in a convolutional net. Second, they 1054 | impose a constrain on the hidden units --- basically, local sums can't 1055 | be too large. It's not quite clear to me why they're doing this, but 1056 | they are. They can still do block Gibbs sampling. 1057 | 1058 | Convolutional DBNs: They do greedy, layerwise training, training one 1059 | convolutional RBM at a time. They can both infer forwards and 1060 | backwards through the layers. 1061 | 1062 | Results (MNIST): They trained a two-layer CDBN on \emph{unlabeled} 1063 | MNIST data. The first layer learns ``strokes'', while the second 1064 | layer learns groupings of strokes. Nice results: down to 0.82\% error 1065 | rate. I like the fact that they talk about how the error rate scales 1066 | with the number of labeled examples. 1067 | 1068 | Results (natural images): First layer it learns localized, orinted 1069 | edges. Second layer: contours, corners, arcs, surface boundaries. 1070 | Caltech 101: 65.4\% accuracy. Final result is competitive. Training 1071 | images unrelated to Caltech 101. Three-layer network from faces: 1072 | first layer learns edges, second layer learns eyes, third layer learns 1073 | faces. They're computing some kind of precision-recall curve. I 1074 | don't quite get this --- it's an unfamiliar useage to me. They do 1075 | some training with multiple classes (cars, faces, motorbikes, 1076 | aeroplanes). The first layer gets general-purpose features. The 1077 | second layer gets object-class-specific features, as well as some 1078 | shared features. The third layer gets highly specific features. Nice 1079 | conditional entropy graph: uncertainty in the class, given the number 1080 | of features which are active. Wonderful ``filling in'' of faces. 1081 | 1082 | \section{Lee (2009)} 1083 | 1084 | RBMs. Two layer. Bipartite. Undirected. Binary hidden units, $h$. 1085 | Binary or real-valued visible units, $v$. A weight matrix $W$ between 1086 | the two layers. If visible units are binary, then we define the 1087 | energy: 1088 | \begin{eqnarray} 1089 | E = v^T W h - b^T h-c^T v, 1090 | \end{eqnarray} 1091 | where $b$ are the hidden unit biases, and $c$ are the visible unit 1092 | biases. For real-valued visible units, modify the energy by adding a 1093 | $1/2 v^2$ term. This model is simple enough. How should we think 1094 | about it? The idea is to start with a given set of values for one 1095 | layer, say the visible layer. Then sample the hidden units. Then 1096 | sample the visible layer. And so on, ping-ponging back and forth. 1097 | 1098 | ``In principle, the RBM parameters can be optimized by performing 1099 | stochastic gradient ascent on the log-likelihood of the training 1100 | data.'' The parameters to be optimized are presumably the weights and 1101 | biases. The likelihood is the probability of the observed outcomes 1102 | (i.e., the training data), given the particular parameters. I assume 1103 | that the idea is that the visible units are supposed to represent the 1104 | observed data. So we want to choose the parameters of the model in 1105 | order to maximize the probability of seeing the training data in the 1106 | visible units. Apparently contrastive divergence is a technique for 1107 | computing the gradient of the log-likelihood. 1108 | 1109 | Convolutional RBM. The weights between the hidden and visible layers 1110 | are shared among all locations in an image. What exactly does this 1111 | mean? Suppose we have an $N_V \times N_V$ image. Then the input 1112 | layer apparently consists of $N_V \times N_V$ binary units. There are 1113 | $K$ groups in the hidden layer, each an $N_H \times N_H$ array of 1114 | binary units. So there are $N_H^2 K$ total hidden units. 1115 | 1116 | We index the hidden groups by $k$. Each hidden group has a bias, 1117 | $b_k$. All visible units share a single bias, $c$. 1118 | 1119 | For any given group, $k$, we have a single set of $N_W \times N_W$ 1120 | weights (the ``filter''). $N_W \equiv N_V-N_H+1$. The basic idea is 1121 | to filter the inputs, but translating the filter across the input 1122 | image. 1123 | 1124 | I will come back to the energy function a little later. XXX. We can 1125 | do Gibbs sampling to generate the appropriate distributions. 1126 | 1127 | \section{Scherer (2010)} 1128 | 1129 | \link{http://www.ais.uni-bonn.de/papers/icann2010\_maxpool.pdf}{Evaluation 1130 | of Pooling Operations in Convolutional Architectures for Object 1131 | Recognition} 1132 | 1133 | Notes that many standard models are based on Hubel and Wiesel: the 1134 | Neocognitron, convolutional nets, HoG, SIFT, Gist features, and HMAX. 1135 | ``These models can be broadly distinguished by the operation that 1136 | summarizes over a spatial neighbourhood. Most earlier models perform 1137 | a subsampling operation, where the average over all input values is 1138 | propagated to the next layer... A different approach is to compute the 1139 | maximum value in a neighborhood... While entire models have been 1140 | extensively compared, there has been no research evaluating the choice 1141 | of the aggregation function so far. The aim of our work is therefore 1142 | to empirically determine which of the established aggregation 1143 | functions is more suitable for vision tasks. Additionally, we 1144 | investigate if ideas from signal processing, such as overlapping 1145 | receptive fields and window functions can improve recognition 1146 | performance.'' 1147 | 1148 | They note that there are so many variants on complex cells / pooling 1149 | operations that it's impossible to do a complete analysis. Instead, 1150 | they're going to choose a particular model and analyse that, based on 1151 | convolutional neural networks. ``Our choice of a CNN is largely 1152 | motivated by the fact that the operation performed by pooling layers 1153 | is easily interchangeable without modifications to the architecture.'' 1154 | 1155 | ``The purpose of the pooling layers is to achieve spatial invariance 1156 | by reducing the resolution of the feature maps.'' Is that really 1157 | right? We don't actually want spatial invariance --- relative 1158 | positions matter. But the details don't matter. It's a way of saying 1159 | small spatial shifts (relative to feature size) don't matter. So a 1160 | better sentence would be: the purpose of the pooling layers is to 1161 | ensure that small spatial shifts (relative to feature size) don't 1162 | matter. 1163 | 1164 | ``We evaluate two different pooling operations: max pooling and 1165 | subsampling.'' Subsampling computes an average and multiples by a 1166 | trainable scalar. Max pooling applies a window function and computers 1167 | the maximum in the neighbourhood. 1168 | 1169 | They wanted to do the following: (1) figure out how max pooling and 1170 | subsampling compare; (2) determine whether overlapping pooling windows 1171 | improve performance; and (3) find suitable window functions. 1172 | 1173 | ``For both NORB and Caltech-101 our results indicate that 1174 | architectures with a max pooling operation converge considerably 1175 | faster than those employing a subsampling operation. Furthermore, 1176 | they seem to be superior in selecting invariant features and improve 1177 | generalization.'' Of course, this conclusion only applies in their 1178 | specific context. Maybe if we increased the data set then this would 1179 | no longer be true? Or if we changed the architecture in some other 1180 | way? Furthermore, no explanation of why it is true has been given. 1181 | 1182 | ``To evaluate how the step size of overlapping pooling windows affects 1183 | recognition rates, we essentially used the same architectures as in 1184 | the previous section. Adjusting the step size does, however, change 1185 | the size of the feature maps [I don't see why --- we can make them the 1186 | same size] and with it the total number of trainable parameters, as 1187 | well as the ratio between fully connected weights and shared 1188 | weights.'' I must admit I don't understand what's being done here, or 1189 | in the remainder of this section. I think it would be dangerous for 1190 | me to take much away from it. 1191 | 1192 | Comparison to Coates' (2011) paper: Coates found that shorter stride 1193 | length helped. However, the model used seems to have been quite a bit 1194 | different to this paper. So I'm not sure I'd read too much into 1195 | either result --- more study is, I think, needed, to understand this. 1196 | 1197 | \section{Coates (2011)} 1198 | 1199 | \link{http://www.stanford.edu/~acoates/papers/coatesleeng\_aistats\_2011.pdf}{An 1200 | Analysis of Single-Layer Networks in Unsupervised Feature Learning} 1201 | 1202 | ``In this paper... we show that several simple factors, such as the 1203 | number of hidden nodes in the model, may be more important to 1204 | achieving high performance than the learning algorithm or the depth of 1205 | the model... Our results show that large numbers of hidden nodes and 1206 | dense feature extraction are critical to achieving high performance.'' 1207 | They actually get state-of-the-art performance using only a single 1208 | layer of features. This is interesting: it's a case where deep 1209 | learning \emph{doesn't} help. But increasing the number of features 1210 | \emph{does} help --- a lot! 1211 | 1212 | Reviews the standard practice: use unsupervised learning to pre-train 1213 | multiple layers of features. 1214 | 1215 | ``Even with very simple algorithms and a single layer of features, it 1216 | is possible to achieve state-of-the-art performance by focusing effort 1217 | on these choices [number of features, dense feature extraction, 1218 | whitening] rather than on the learning system itself.'' 1219 | 1220 | ``[W]e employ very \emph{simple} learning algorithms and then more 1221 | carefully choose the network parameters in search of higher 1222 | performance. If (as is often the case) larger representations perform 1223 | better, then we can leverage the speed and simplicity of these 1224 | learning algorithms to use larger representations.'' 1225 | 1226 | CIFAR-10: 60,000 32 by 32 colour images in 10 classes, with 6,000 1227 | images per class. There are 50,000 training images and 10,000 test 1228 | images. CIFAR-10 is a subset of the ``80 million tiny images'' 1229 | dataset. 1230 | 1231 | CIFAR-100: Like CIFAR-10, but with 100 classes containing 600 images 1232 | each. I.e., CIFAR-100 is a more difficult problem. 1233 | 1234 | So the CIFAR data sets can be thought of as small but challenging 1235 | class recognition data sets. 1236 | 1237 | ``It will turn out that whitening, large numbers of features, and 1238 | small stride lead to uniformaly better performance regardless of the 1239 | choice of unsupervised learning algorithm... the main contribution of 1240 | our work is in demonstrating that these considerations may, in fact, 1241 | be \emph{critical} to the success of feature learning algorithms --- 1242 | potentially more important even than the choice of unsupervised 1243 | learning algorithm. Indeed, it will be shown that when we push these 1244 | parameters to their limits that we can achieve state-of-the-art 1245 | performance, outperforming many other more complex algorithms on the 1246 | same tasks.'' 1247 | 1248 | This really makes me wonder about the standard claims made about deep 1249 | learning. 1250 | 1251 | ``Since the introduction of unsupervised pre-training, many new 1252 | schemes for stacking layers of features to build `deep' 1253 | representations have been proposed. Most have focused on creating new 1254 | training algorithms to build single-layer models that are composed to 1255 | build deeper structures. Among the algorithms considered in the 1256 | literature are [long list]. Thus, amongst the many components of 1257 | feature learning architectures, the unsupervised learning module 1258 | appears to be the most heavily scrutinized.'' 1259 | 1260 | ``Some work, however, has considered the impact of other choices in 1261 | these feature learning systems, especially the choice of network 1262 | architecture. Jarret et al. [11], for instance, have considered the 1263 | impact of changes to the ``pooling'' strategies frequently employed 1264 | between layers of features, as well as different forms of 1265 | normalization and rectification between layers.'' One reason this is 1266 | interesting is that it suggests a direction in which to take work. 1267 | 1268 | ``While we confirm that some feature-learning schemes are better than 1269 | others, we also show that the differences can often be outweighted by 1270 | other factors, such as the number of features. Thus, even though more 1271 | complex learning schemes may improve performance slightly, these 1272 | advantages can be overcome by fast, simple learning algorithms that 1273 | are able to handle larger networks.'' [It'd be nice to know more 1274 | about the impact of changed data set size as well.] Summing up: more 1275 | sophisticated algorithms may not be as useful as increasing the basic 1276 | parameters in a simple algorithm. But given this, I'd like to know 1277 | why Ng used a deep RICA network in his later work? 1278 | 1279 | ``At a high-level [\emph{sic}], our system performs the following 1280 | steps to learn a feature representation: 1. Extract random patches 1281 | from unlabeled training images. 2. Apply a pre-processing stage to 1282 | the patches. 3. Learn a feature-mapping using an unsupervised learning 1283 | algorithm. [So this is how we learn the features to be used. Now we 1284 | move to classification.] Given the learned feature mapping and a set 1285 | of labeled training images we can then perform feature extraction and 1286 | classification: 1. Extract features from equally spaced sub-patches 1287 | [why equally spaced? why use sub-patches?] covering the input image. 1288 | 2. Pool features together over regions of the input image to reduce 1289 | the number of feature values. [I guess this makes sense if we're using 1290 | small local features, as does the use of sub-patches.] 3. Train a 1291 | linear classifier to predict the labels given the feature vectors.'' 1292 | 1293 | ``It is common practice to perform several simple normalization steps 1294 | before attempting to generate features from data. In this work, we 1295 | assume that every patch $x^{(i)}$ is normalized by subtracting the 1296 | mean and dividing by the standard deviation of its elements. For 1297 | visual data, this corresponds to local brightness and contrast 1298 | normalization.'' 1299 | 1300 | ``For our purposes, we will view an unsupervised learning algorithm as 1301 | a `black box' that takes the [training] dataset $X$ and outputs a 1302 | function $f : R^N \rightarrow R^K$ that maps an input vector $x^{(i)}$ 1303 | to a new feature vector of $K$ features, where $K$ is a parameter of 1304 | the algorithm.'' 1305 | 1306 | After learning features, they do a type of convolutional extraction: 1307 | basically, stepping across the images with a particular stride length, 1308 | and extracting $K$-dimensional features at each stage. 1309 | 1310 | They do a funny form of pooling. They split their features up into 1311 | four quadrants, and simply sum over each quadrant. That gives them a 1312 | total of $4K$ features to use for classification. I must admit, this 1313 | seems to me like a rather strange procedure to use. They don't appear 1314 | to discuss it at much length. 1315 | 1316 | After pooling they use a linear classifier --- an SVM, with the 1317 | regularization parameter determined by cross-validation. 1318 | 1319 | ``For sparse autoencoders and RBMs, the effect of whitening is 1320 | somewhat ambiguous. When using only 100 features, there is a 1321 | significant benefit of whitening for sparse RBMs, but this advantage 1322 | disappears with larger numbers of features. For the clustering 1323 | algorithms, however, we see that whitening is a crucial pre-process 1324 | since the clustering algorithms cannot handle the correlations in the 1325 | data.'' 1326 | 1327 | Whitening made a big difference for both k-means measures, and for 1328 | Gaussian mixture models. It made only a small difference for the 1329 | sparse autoencoder and for the RBM. 1330 | 1331 | The number of features made a big difference for all approaches. It's 1332 | not clear what the asymptotic performance will be, but even with 1600 1333 | features (where they stopped) things were still improving quite a bit. 1334 | 1335 | The stride length also had a huge impact on performance. I find this 1336 | really interesting! It'd be interesting to understand the performance 1337 | tradeoffs. 1338 | 1339 | Size of the local receptive field didn't have quite as much of an 1340 | impact. Indeed, increasing the size sometimes decreased performance, 1341 | when other factors (e.g., number of features) was held constant. 1342 | 1343 | They got the best known results on CIFAR 10 using k-means. (Note that 1344 | this has since been greatly improved.) 1345 | 1346 | ``Our results above may seem inexplicable considering the simplicy of 1347 | the system --- it is not clear, on first inspection, exactly what in 1348 | our experiments allows us to achieve such high performance compared to 1349 | prior work.... Each of the network parameters (feature count, stride 1350 | and receptive field size) we've tested potentially confers a 1351 | significant benefit on performance. For instance, lare numbers of 1352 | features (regardless of how they're trained) gives us many non-linear 1353 | projections of the data... using extremely large numbers of non-linear 1354 | projections can make data closer to linearly separable and thus easier 1355 | to classy. [E.g., the kernel trick] Hence, larger numbers of features 1356 | may be uniformly beneficial, regardless of the training algorithm'' 1357 | 1358 | ``It appears that large receptive fields result in a space that is 1359 | simply too large to cover effectively with a small number of nonlinear 1360 | features.'' 1361 | 1362 | \textbf{Takeaways:} the notion of a pipeline: feature learning by 1363 | unsupervised techniques, followed by a standard classifier (e.g., 1364 | SVM); increasing the number of features learned can help \emph{a lot}; 1365 | larger local receptive fields don't seem to help, and can actually 1366 | hinder; a shorter stride length can help quite a bit; K-means (using 1367 | the triangle technique) can help a lot. 1368 | 1369 | \section{Le, Karpenko et al (2011)} 1370 | 1371 | ``ICA with Reconstruction Cost for Efficient Overcomplete Feature 1372 | Learning'': 1373 | \link{http://ai.stanford.edu/~ang/papers/nips11-ICAReconstructionCost.pdf}{link} 1374 | 1375 | ICA as a technique for unsupervised feature learning. Point out that 1376 | standard ICA learns orthonormal features, while they want overcomplete 1377 | feature sets. ``Using our method to learn highly overcomplete sparse 1378 | features and tiled convolutional neural networks, we obtain 1379 | competitive performances on a wide variety of object recognition 1380 | tasks. We achieve state-of-the-art test accuracies on the STL-10 and 1381 | Hollywood2 datasets.'' 1382 | 1383 | ``Sparsity has been shown to work well for learning feature 1384 | representations that are robust for object recognition.'' What 1385 | exactly is a sparse feature? I guess in the case of sparse 1386 | autoencoders we only allow a relatively small number of hidden neurons 1387 | to be on. Algorithms for learning sparse features: sparse 1388 | auto-encoders, RBMs, sparse coding, and ICA. 1389 | 1390 | ``[Standard] ICA has two major drabacks. First, it is difficult to 1391 | learn \emph{overcomplete feature representations}''. Goes on to claim 1392 | that classification performance works better when features are 1393 | overcomplete. This makes a certain amount of sense: certainly, there 1394 | should be no problem having overlapping features. Also claims that 1395 | ICA is sensitive to whitening, and this makes it difficult to scale 1396 | ICA to high dimensional data. 1397 | 1398 | Regular ICA: Let $x^j$ be training data. Choose a penalty function 1399 | $g(\cdot)$. They suggest $g(z) = \log(\cosh(z))$. Let $W_j$ be a row 1400 | in a weight matrix. Then $W_j x^k$ measures the overlap between the 1401 | weight vector and the training data. If it's one, then $x^k$ is very 1402 | much like the weight vector. And if it's less than one, then it's 1403 | less so. So we simply sum over features, $W_j$, and over training 1404 | data, $x^k$. The goal is to ``find the best features'', i.e., to 1405 | minimize: 1406 | \begin{eqnarray} 1407 | \sum_{jk} g(W_j x^k). 1408 | \end{eqnarray} 1409 | This is done subject to the constraint that $WW^T = I$, i.e., the 1410 | feature vector are orthonormal to one another. ICA is done assuming 1411 | zero mean for the training data, $\sum_k x^k = 0$, and unit 1412 | covariance, $\sum_k x_k x_k^T = m I$. This is achieved by whitening 1413 | the data. 1414 | 1415 | Reconstruction ICA (RICA): Minimize: 1416 | \begin{eqnarray} 1417 | \frac{\lambda}{m} \sum_k \| W^T W x^k - x^k\|^2 + \sum_{jk} g(W_j x^k). 1418 | \end{eqnarray} 1419 | In other words, find the features which minimize the cost, while 1420 | preserving the training data pretty well. ``We use the term 1421 | `reconstruction cost' for this smooth pentalty because it corresponds 1422 | to the reconstruction cost of a linear autoencoder, where the encoding 1423 | weights and decoding weights are tied''. Note that tying is not used 1424 | in the LRM paper. This makes it more similar to a standard 1425 | autoencoder, as I've described elsewhere in my book. 1426 | 1427 | ``ICA's main distinction compared to sparse coding and autoencoders is 1428 | its use of the hard orthonormality constraint in lieu of 1429 | reconstruction costs.'' The basic idea in proving some kind of 1430 | equivalence is to let $\lambda$ be large. ``If the data is whitened, 1431 | RICA is equivalent to ICA for undercomplete representations and 1432 | $\lambda$ approaching infinity.'' 1433 | 1434 | From my point of view, the main thing here is simply the basic problem 1435 | formulation: the function to minimize. I'd like to think of this in a 1436 | slightly more connectionist fashion. Let me think back to the cost 1437 | function. Minimizing the first part means that we have weights which 1438 | allow us to approximately reconstruct the training data. Minimizing 1439 | the second part is more an l1 constraint, roughly speaking it's 1440 | telling us to have few features. So we have features which let us 1441 | reconstruct, and we are likely to have only a few features at a time. 1442 | 1443 | Local receptive field TICA: ``[L]ocal receptive field neural networks 1444 | are faster to optimize than their fully connected counterparts 1445 | [because they have fewer parameters]. A major drawback of this 1446 | approach, however, is the difficulty in enforcing orthogonality across 1447 | partially overlappling patches. [This becomes a severe constraint if 1448 | we only overlap at a few points.] We show that swapping out locally 1449 | enforced orthogonality constraints with a global reconstruction cost 1450 | solves this issue. [I.e., we can forget about local orthogonality, and 1451 | just worry about optimizing the cost.]'' It seems that they do this 1452 | by minimizing the following function: 1453 | \begin{eqnarray} 1454 | \sum_k \| W^T W x^k-x^k\|^2+ \sum_{jk} \sqrt{\epsilon+H_j (Wx^k)^2}. 1455 | \end{eqnarray} 1456 | A few things: (1) $\lambda$ should presumably appear out the front of 1457 | the first term; (2) They never explain $\epsilon$; (3) The $H_j$ are 1458 | pooling matrices; (4) It's not clear what $(Wx^k)^2$ means --- 1459 | presumably the elementwise square; (5) I don't see how $H_j (Wx^k)^2$ 1460 | can be a scalar. 1461 | 1462 | \section{Tenenbaum, Kemp, Griffiths, and Goodman (2011)} 1463 | 1464 | \link{http://scholar.google.ca/scholar?cluster=2667398573353002097&hl=en&as_sdt=0,5}{(link)} 1465 | A review of a particular approach to inductive learning. They want to 1466 | combine Bayesian learning with complex ways of representing knowledge. 1467 | 1468 | Claims that there is strong evidence that children can learn to 1469 | generalize their use of words from just a few examples. This suggests 1470 | that there must be some pretty clever underlying patterns to how we 1471 | generalize. ``A massive mismatch looms between information coming in 1472 | through our senses and the outputs of cognition''. 1473 | 1474 | Claims that we humans do reason (implicitly) in Bayesian ways about a 1475 | number of things. Mostly omits the evidence that we \emph{don't} in 1476 | some important ways. This omission bugs me. They \emph{do} mention 1477 | the fact that our conscious assessements of probability tend to be 1478 | terrible, which is pleasing. With that said, I'm not certain about 1479 | this --- I just have the strong impression that there are well-known 1480 | instances where we certainly don't reason in a Bayesian way. It'd be 1481 | good to have references. 1482 | 1483 | ``The biggest remaining obstacle is to understand how structured 1484 | symbolic knowledge can be represented in neural circuits.'' 1485 | Interesting. I've often wondered exactly this. They make the 1486 | followup comment: ``Connectionist models sidestep these challenges by 1487 | denying that brains actually encode such rich knowledge''. That seems 1488 | too strong to me, but there is some truth to it: the connectionists 1489 | seem less interested than one might suppose in this question, perhaps 1490 | believing that its solution should be deferred. 1491 | 1492 | How would one go about solving this problem? Actually, what would a 1493 | solution / better statement of the problem even look like? Maybe we 1494 | could encode entry-relationships? In particular, let us suppose we 1495 | want to encode $X Y Z$ where $X$ and $Z$ are entities, and $Y$ is the 1496 | relationship. One way of encoding this would be to have a neural 1497 | network with nodes for each entity and for each relationship. We'd 1498 | try to design the network so that the only relationships which are 1499 | active would be those which are true, given the active entities. 1500 | 1501 | \section{Bengio (2012)} 1502 | 1503 | \link{http://arxiv.org/abs/1206.5533}{(link)} 1504 | 1505 | Notes that many of the recommendations haven't been proved, they're 1506 | heuristics that have emerged out of experimentation. ``A good 1507 | indication of the need for such validation is that different 1508 | researchers and research groups do not always agree on the practice of 1509 | training neural networks''. 1510 | 1511 | Claims that the optimal learning rate is usually close to the largest 1512 | learning rate that does not cause divergence of the cost function. 1513 | Heuristic: start with a large learning rate, and if the cost function 1514 | increases, start again with a training criterion that is three times 1515 | smaller. 1516 | 1517 | This can be automated by keeping track of the cost from epoch to 1518 | epoch. If the cost got \emph{larger} during an epoch, then decrease 1519 | the training rate by a factor two, say. If the cost got 1520 | \emph{smaller}, then increase the training rate by a factor of 1.1, 1521 | say. How well will that work? I worry that we'll end up with a 1522 | situation where we're mostly going back and forth between the training 1523 | rate being too high, and too low, with not enough time to really learn 1524 | anything. 1525 | 1526 | Larger mini-batches allow a modest increase in learning rate. I don't 1527 | understand the details of this. It'd be nice to have some heuristics. 1528 | Large mini-batches will certainly reduce stochastic error from the 1529 | sampling. Is that what's going on? Or is there some other reason? 1530 | 1531 | ``Because the gradient direction is not quite the right direction of 1532 | descent, there is no point in spending a lot of computation to 1533 | estimate it precisely for gradient descent.'' In other words, do 1534 | frequent rapid estimates rather than slow accurate computations. 1535 | 1536 | It seems to me that it'd be helpful to keep track of training examples 1537 | with markedly different gradients. Those are ones which we could 1538 | learn a lot from. There's an idea here, which is to \emph{identify 1539 | outliers} using the gradient. We should oversample from the 1540 | outliers. I'll bet that improves performance, if the right 1541 | oversampling rate is chosen. I've explored this idea further below. 1542 | 1543 | Bengio confirms that for large data sets, mini-batch stochastic 1544 | gradient descent is pretty much non-optional. 1545 | 1546 | The use of validation data to train hyper-learners, which learn 1547 | hyper-parameters for a learning algorithm. 1548 | 1549 | Comments that the initial learning rate is often the single most 1550 | important hyper-parameter. ``If there is only time to optimize one 1551 | hyper-parameter and one uses stochastic gradient descent, then this is 1552 | the hyper-parameter that is worth tuning.'' Also comments that 1553 | there's often little benefit to doing anything other than keeping the 1554 | learning rate constant. When doing otherwise, Bengio suggests a 1555 | strategy of keeping the learning rate constant for the first $\tau$ 1556 | steps, and then decreasing it as $1/ t$, where $t$ is the number of 1557 | steps. Note that this strategy is not the same as the (exponential) 1558 | automated strategy I describe above. Suggests setting $\tau$ by 1559 | waiting until the cost goes up. Also suggests setting multiple values 1560 | for the schedule, and seeing how they compare. 1561 | 1562 | Mini-batch size: between 1 and a few hundreds. Typical value of 32. 1563 | Notes that this mostly affects computation time, not the final value 1564 | of the cost. 1565 | 1566 | Number of epochs: Watch the validation error, and stop once we're 1567 | beginning to overfit. 1568 | 1569 | Momentum: smooth out gradient by taking an average of recent 1570 | gradients. 1571 | 1572 | Comments that increasing the number of hidden neurons in all layers 1573 | results in a quadratic increase in time. It's not clear to me why 1574 | that should be the case --- obviously there is a quadratic increase in 1575 | the number of weights, and so a quadratic increase in time per epoch. 1576 | But maybe it'll take a larger number of epochs to converge? 1577 | 1578 | ``[W]e found that using the same size for all layers worked generally 1579 | better or the same as using a decreasing size (pyramid-like) or 1580 | increasing size (upside down pyramid), but of course this may be data 1581 | dependent.'' 1582 | 1583 | I am surprised by this. It seems to contradict our ideas about 1584 | feature learning. It'd be good to look at Larochelle et al's results. 1585 | Perhaps it reflects the fact that \emph{more} high level concepts can 1586 | be formed out of the ``atoms'' of input than there are atoms. 1587 | 1588 | ``For most tasks that we worked on, find that an overcomplete first 1589 | hidden layer works better than an undercomplete one.'' 1590 | 1591 | It's not really clear why this is the case. Again, it may be that 1592 | it's because there are more high-level concepts than low level one. 1593 | Still, that seems to be at odds with my intuition about autoencoders. 1594 | 1595 | States that this is particularly true for unsupervised learning. That 1596 | \emph{is} consistent with the idea that it's because there are many 1597 | different abstractions possible, far more than basic features. 1598 | 1599 | Claims that there is a ``clean Bayesian justification'' for 1600 | regularization as the negative log-prior. The discussion that follows 1601 | is extremely interesting and I'm still sorting it out. The picture 1602 | that emerges seems to be that what we're doing when learning is using 1603 | some kind of maximum likelihood estimation. In particular, we start 1604 | with some sort of prior in parameter space --- a Gaussian --- and then 1605 | try to find the weights maximizing the probability of the parameters 1606 | (weights), given the training data. I need to unpack this still 1607 | further: it's regularization as a form of maximum likelihood. For now 1608 | I'll proceed, and then return to this later. 1609 | 1610 | Normalization: Claims that we should normalize the regularization 1611 | parameter by $B / T$, where $B$ is the mini-batch size, and $T$ is the 1612 | number of training examples. This is consonant with what I've 1613 | observed. 1614 | 1615 | Early stopping and L2 regularization: comments that these two are 1616 | essentially equivalent, and that one may as well drop L2 rgularization 1617 | when engaged in early stopping. I don't believe this. The solution 1618 | spaces will be completely different in the two cases. I'm happy to 1619 | believe that \emph{sometimes} they'll give the same result, but see no 1620 | reason to believe that they'll always give the same outcome. 1621 | 1622 | L1 regularization and feature selection: Comments that this strongly 1623 | suppresses irrelevant weights. Also comments that you may wish to 1624 | consider doing both L1 and L2 regularization, with different 1625 | regularization parameters. That seems sensible to me. 1626 | 1627 | Q: An alternative approach to choosing $\lambda$ is to regard it as an 1628 | extra parameter beyond the weights, and to apply gradient descent to 1629 | it as well. How well would this work? My first instinct is to think 1630 | that it won't work --- that $\lambda$ will be driven to zero. But 1631 | upon more reflection things are more complicated than that. It'd be 1632 | interesting to know. 1633 | 1634 | Sparsity: Increase sparsity can be compensated by a larger number of 1635 | hidden units. A sparsity-inducing penalty can be viewed as a way of 1636 | regularizing. Note that it's no longer so easy to view this in the 1637 | Bayesian framework. Notes that the L1 penalty seems most natural, but 1638 | is not often used. Try to push the (mini-batch) average to a 1639 | particular constant. 1640 | 1641 | Neuron nonlinearity: Bengio notes that he's most often used the 1642 | sigmoid, the tanh, $\max(0, a)$, and the hard tanh. Interesting 1643 | remark about the sigmoid not working well as the top layer of a deep 1644 | supervised net without unsupervised pretraining. Apparently it's okay 1645 | for auto-encoders. 1646 | 1647 | Weight initialization: Sample uniformly on 1648 | $4\sqrt{6/(\mbox{fan-in}+\mbox{fan-out})}$. This will give us a total 1649 | length equal to roughly the number of layers. 1650 | 1651 | Hyper-parameter selection as an optimization problem: points out the 1652 | dangers of overfitting your validation data. 1653 | 1654 | Q: When does it make sense to say that we're overfitting? 1655 | 1656 | Approach to parameter search: doing it logarithmically. 1657 | 1658 | Q: Does it make sense to do gradient descent on just a subset of 1659 | weights at a time? I do wonder if that wouldn't sometimes yield better 1660 | results. Deep learning has something of this flavour. 1661 | 1662 | \section{Bengio 2012} 1663 | 1664 | Bengio, Courville, and Vincent: ``Representation Learning: A Review 1665 | and New Perspectives'': http://arxiv.org/pdf/1206.5538v2.pdf. 1666 | 1667 | ``This paper reviews recent work in the area of unsupervised feature 1668 | learning and joint training of deep learning, covering advances in 1669 | probabilistic models, auto-encoders, manifold learning, and deep 1670 | architectures.'' ``... much of the actual effort in deploying machine 1671 | learning algorithms goes into the design of preprocessing pipelines 1672 | and data transformations that result in a representation of the data 1673 | than can support effective machine learning.'' While I know this last 1674 | is true, I haven't actually had to do a whole lot of data cleaning 1675 | myself, yet. ``What makes one representation better than another? 1676 | Given an example, how should we compute its representation, 1677 | i.e. perform feature extraction? Also, what are appropriate 1678 | objectives for learning good representations? 1679 | 1680 | ``Speech was one of the early applications of neural networks, in 1681 | particular convolutional (or time-delay) neural networks... Microsoft 1682 | has released in 2012 a new version of their MAVIS... speech system 1683 | based on deep learning''. 1684 | 1685 | ``Transfer learning is the ability of a learning algorithm to exploit 1686 | commonalities between different learning tasks in order to share 1687 | statistical strength, and \emph{transfer knowledge} across tasks.'' 1688 | There are apparently competitions for transfer learning. I wonder 1689 | what sorts of problems are being attacked? ``Of course, the case of 1690 | jointly predicting outputs for many tasks or classes, i.e., performing 1691 | \emph{multi-task} learning also enhances the advantages of 1692 | representation learning algorithms'' 1693 | 1694 | ``Unfortunately,... most of these algorithms [SVM etc] only exploit 1695 | the principle of \emph{local generalization}... they rely on examples 1696 | to \emph{explicitly map out the wrinkles of the target function}. 1697 | Generalization is mostly achieved by a form of local interpolation 1698 | between neighboring training examples... We advocate learning 1699 | algorithms that are flexible and non-parametric, but do not rely 1700 | exclusively on the smoothness assumption. Instead, we propose to 1701 | incorporate generic priors such as those enumerated above into 1702 | representation-learning algorithms.'' This starts to get at the point 1703 | of view that says that neural network architecture is all about 1704 | figuring out how we generalize. If there is a hierarchical structure 1705 | in how to generalize well, that's what your network will need. If 1706 | not, it won't. ``Kernel machines are useful, but they depend on a 1707 | prior definition of a suitable similarity metric, or a feature space 1708 | in which naive similarity metrics suffice. We would like to use the 1709 | data, along with very generic priors, to discovery these features, or 1710 | equivalently, a similarity function.'' 1711 | 1712 | They make a really nice point about expressiveness. ``[H]ow many 1713 | parameters does [a model[ require compared to the number of input 1714 | regions (or configurations) it can distinguish?'' They argue that a 1715 | deep net can distinguish exponentially more regions than more 1716 | conventional approaches. 1717 | 1718 | 1719 | \section{Bottou (2012)} 1720 | 1721 | \link{http://leon.bottou.org/papers/bottou-tricks-2012}{(link)} 1722 | 1723 | Notes that there are theorems about the convergence time for batch 1724 | gradient descent (time is logarithmic in the eventual error), and for 1725 | second-order gradient descent. It's really not clear how valuable 1726 | such results are; I guess it's comforting that they exist. 1727 | 1728 | Notes that there are some powerful results about the convergence of 1729 | stochastic gradient descent, under conditions like $\sum \eta^2 < 1730 | \infty, \sum \eta = \infty$. Apparently the ``Robbins-Siegmund 1731 | theorem'' helps with convergence. The relevant paper is 1732 | \link{http://scholar.google.ca/scholar?cluster=509989913518206088\&hl=en\&as\_sdt=0,5}{here}. 1733 | 1734 | Monitor both the training cost and the validation error: Suggests 1735 | periodically evaluating the validation error during training, and 1736 | stopping training when it hasn't improved after some time. 1737 | 1738 | \section{Ciresan (2012)} 1739 | 1740 | \link{http://arxiv.org/abs/1003.0358}{link} This uses just straight-up 1741 | backprop to train a neural net --- no convolutional nets, no 1742 | pretraining, just online learning with backprop. The main tricks are 1743 | to use numerous deformed training images, and graphics cards to speed 1744 | up learning. Apparently, Simard et al used a single hidden layer with 1745 | 800 neurons to get an accuracy of 99.3 percent on MNIST. (It'd be 1746 | interesting to know whether they deformed the images?) 1747 | 1748 | The paper asks whether it was really true that the pre-training is 1749 | necessary? Can't you just train for a long time? And the answer 1750 | seems to be yes! 1751 | 1752 | They train online, using slightly deformed images, and claim that this 1753 | means they can use the whole MNIST set for validation. This seems 1754 | suspect to me --- it relies on the deformations being more or less 1755 | independent of how the network generalizes. Let's run with it, 1756 | however. 1757 | 1758 | They trained 5 networks, with 2 to 9 hidden layers each. From 1.34 to 1759 | 12.11 million free parameters. They have a variable learning rate 1760 | that shrinks by a constant factor after each epoch, from 0.001 down to 1761 | 0.000001. This seems absolutely crucial to their success. I'm a 1762 | little surprised by the use of the constant factor decrease, since 1763 | that will bound the ``total'' (so to speak) learning distance 1764 | travered, simply because the geometric sum converges. It seems like 1765 | you'd get better performance if you chose a learning schedule where 1766 | terms decreased more slowly, so the sum of the learning rates 1767 | diverged. That's true of the hyperbolic function advocated by Bengio 1768 | in his 2012 paper, whose sum will diverge (albeit, only 1769 | logarithmically). They initialized weights uniformly at random in the 1770 | range -0.05 to 0.05 --- that's close to, but not the same as, the 1771 | $1/\sqrt{\mbox fan-in}$ that I've preferred. They use a tanh 1772 | activation function. 1773 | 1774 | They used a GPU to do computations. It apparently sped the 1775 | deformation routine up by a factor of 10, and forwardprop and backprop 1776 | by a factor of 40! That's a big improvement. 1777 | 1778 | Typical architecture: 784-1000-500-10 neurons. They get 0.44 percent 1779 | test error. That's pretty close to perfect. The most complex 1780 | architectures were: 784-2500-2000-1500-1000-500-10 and 784-9 x 1781 | 10000-10. These get test errors of 0.32 and 0.43 percent, 1782 | respectively. Interestingly, there seems be some advantages to having 1783 | non-homogeneous numbers in the layers. 1784 | 1785 | Took 93 CPU seconds to deform the MNIST images. 87 of those seconds 1786 | were for the elastic distortions, so that's what they converted to the 1787 | GPU. When doing the conversion they converted MNIST images to 29 x 29 1788 | to get a proper center, which simplifies distortion. 1789 | 1790 | \section{Ciresan (2012)} 1791 | 1792 | \link{http://arxiv.org/pdf/1202.2745.pdf}{link} 1793 | 1794 | Claims that their deep nets can match human performance on recognizing 1795 | human digits and traffic signs. ``Small (often minimal [what does 1796 | this mean?]) receptive fields of convolutional winner-take-all neurons 1797 | [?] yield large network depth, resulting in roughly as many sparsely 1798 | connected neural layers as found in mammals between retina and visual 1799 | cortex''. They achieve better-than-human performance on a traffic 1800 | sign benchmark. 1801 | 1802 | They claim records on MNIST, Latin letters, Chinese characters, 1803 | traffic signs, NORB, and CIFAR10. ``We will show that properly 1804 | trained big and deep DNNs can outperform all previous methods, and 1805 | demonstrate that unsupervised initialization/pretraining is not 1806 | necessary (although we don't deny that it might help sometimes, 1807 | especially for small datasets).'' Again, we're back to this 1808 | fundamental question: how much does pretraining help? How necessary 1809 | is it? 1810 | 1811 | They use winner-take-all neurons. It occurs to me that this has some 1812 | similarity to sparsity constraints. Same? Again, they were inspired 1813 | by Hubel and Wiesel --- simple cells (orientation), and complex cells 1814 | (basically, pooling). 1815 | 1816 | Very similar architecture to the KSH paper --- convolutional, max 1817 | pooling, convolutional, max pooling, fully connected, fully connected. 1818 | 1819 | They use this multi-columnar architecture, along lines I've seen 1820 | before in their papers. They make strong claims that this helps a 1821 | lot. Worth understanding. The general idea seems to be to try to set 1822 | up slightly different training techniques, and then to average. 1823 | 1824 | Scaled tanh for conv and fully conn layers. Linear activation for 1825 | max-pooling, and softmax for output. They use an annealed learning 1826 | rate. THey use translations, scaling and rotation during training. 1827 | They use a very simple initial weight distribution: uniform on [-0.05, 1828 | 0.05]! This really surprises me. 1829 | 1830 | They train for 800 epochs. 1831 | 1832 | Chinese character recognition: they have 3755 classes, and just 240 1833 | samples per class, and they learn an error rate of 6.5 percent, which 1834 | is a big improvement over the old record of 10.01 percent. 1835 | 1836 | The traffic sign results are fascinating. They use the GTSRB traffic 1837 | sign dataset --- the German Traffic Sign Benchmark. They do some 1838 | preprocessing, and then apply their deep network. They get an error 1839 | rate of 0.54 percent on the test set, which is apparently about a 1840 | factor of two lower than humans. I can see why it's tough --- a lot 1841 | of the images are difficult to see well. If it rejects the 6.67 1842 | percent of images about which it is least confident, then the system 1843 | makes only a single misclassification (0.01 percent error rate). 1844 | 1845 | Typical learning schedule (MNIST): 0.001 initialization, decays by 1846 | factor 0.993 after each epoch. 1847 | 1848 | 1849 | \section{Domingos (2012)} 1850 | 1851 | \link{http://scholar.google.ca/scholar?cluster=4404716649035182981\&hl=en\&as\_sdt=0,5}{link} 1852 | 1853 | He points out that we don't have access to the function we really want 1854 | to optimize, unlike in most optimization problems. Instead we use 1855 | training error as a proxy for test error. That's a very interesting 1856 | and strange situation. 1857 | 1858 | ``Learners combine knowledge with data to grow programs.'' 1859 | 1860 | Overfitting has many faces: ``the bugbear of machine learning''; ``it 1861 | comes in many forms that are not immediately obvious''. 1862 | Generalization error can be decomposed into bias and variance. Bias 1863 | is the tendency to keep learn the same wrong things. Variance is the 1864 | tendency to learn random things. E.g., an SVM (without kernel) may 1865 | have high bias if the data is nowhere close to linearly separable. 1866 | Cross-validation can itself start to overfit. 1867 | 1868 | Intuition fails in high dimensions: I don't think this is quite right. 1869 | It would be better to say that it needs to be replaced in high 1870 | dimensions. 1871 | 1872 | Theoretical guarantees are not what they seem: Points out that there 1873 | are effectively guarantees that can (with caveats) be put on 1874 | induction. Very interesting. It'd be good to understand this in 1875 | conjunction with the no-free lunch theorems. 1876 | 1877 | Feature engineering is the key: Points out that the ``machine 1878 | learning'' part of a machine learning project may be tiny. More time 1879 | spent gathering data, cleaning it, and figuring out good input 1880 | features. 1881 | 1882 | More data beats a cleverer algorithm: ``As a rule, it pays to try the 1883 | simplest learners first''. ``... the organizations that make the most 1884 | of machine learning are those that have in place an infrastructure 1885 | that makes experimenting with many different learners, data sources 1886 | and learning problems easy and effcient, and where there is a close 1887 | collaboration between machine learning experts and application domain 1888 | ones.'' 1889 | 1890 | Representable does not imply learnable: in other words, don't focus 1891 | all your attention on one representation (say, neural nets, or SVMs) 1892 | merely because there is some kind of universality theorem for them. 1893 | 1894 | Correlation does not imply causation: Keep it in mind when 1895 | interpreting the results of machine learning algorithsm. 1896 | 1897 | \section{Hinton 2012 --- Coursera} 1898 | 1899 | \textbf{Lecture 5 b: Object recognition:} If you want to solve 1900 | computer vision, it may help to find features that are invariant under 1901 | things like rotation, translation, and so on. Example: parallel lines 1902 | with a red dot between them. This is invariant under rotation and 1903 | translation, but may actually be quite a useful feature. I guess I 1904 | can imagine similar features being use to recognize an eye. 1905 | Relationship between features may themselves be captured by other 1906 | features. The idea of normalizing an image: once normalized, it may 1907 | be easier to extract features. Of course, that then requires us to 1908 | solve the problem: how to normalize? (Hinton claims, without 1909 | presenting anything so gauche as actual evidence, that we don't 1910 | mentally rotate images to recognize them.) One approach to 1911 | normalization: brute force approach, trying all possible boxes, in a 1912 | wide range of positions and scales. 1913 | 1914 | \textbf{Lecture 5c: Convolutional neural networks for handwriting 1915 | recognition:} Early example of deep neural nets, from the 1980s. 1916 | The idea is to \emph{replicate features}. So an edge is a good 1917 | feature --- and if it's a good feature at one point in the visual 1918 | field, then it's probably a good feature at other points in the visual 1919 | field. Put another way, a feature detector that's useful at some 1920 | point in the visual field is likely to be useful elsewhere, too. 1921 | Replication across position reduces the number of parameters to be 1922 | learned. It's easy to learn replicated features with backpropagation. 1923 | I guess we just constrain the weights to be the same. So we want 1924 | $\Delta w_1 = \Delta w_2$. We just average the gradients across 1925 | partial derivatives. An advantage is that if we can learn to detect a 1926 | feature in one place, then we learn how to detect it in other places. 1927 | Hinton advocates against rotational or scale invariance. I don't know 1928 | if that's a good idea, frankly --- it seems to me that with modern 1929 | computers that may be practical. The idea of pooling adjacent 1930 | replicated features. Hinton advocates either averaging or the max (he 1931 | says max is a little better). LeNet was used to read something like 1932 | 10 percent of all checks in North America, according to Hinton. 1933 | There's still a frontier associated to MNIST, and it may be worth 1934 | trying to push that frontier. The idea of generating synthetic data 1935 | (in part to reduce overfitting). McNemar test. 1936 | 1937 | \textbf{Lecture 5d: Convolutional neural networks for object 1938 | recognition:} Apparently most people doing vision with neural nets 1939 | have switched to using rectified linear activation function, not just 1940 | a sigma function. A good paper on this appears to be ``Deep Sparse 1941 | Rectifier Neural Networks'' (Bengio et al). Use left-right reflection 1942 | of images to get more training data. And use image subsets to get 1943 | more training data. Uses GPUs: 500 cores per GPU, very fast at 1944 | matrix-by-matrix arithmetic, very high bandwidth to memory. 1945 | 1946 | \textbf{Lecture 6a: stochastic mini-batch gradient descent:} Hinton 1947 | calls this the most frequently used algorithm for training neural 1948 | networks. He says it's often preferable even to techniques from the 1949 | optimization community. How to choose a learning rate: if the error 1950 | keeps getting worse or oscillates wildly, reduce the learning rate. 1951 | If the error is falling slowly, increase the learning rate. Do this 1952 | all automatically. 1953 | 1954 | \textbf{Lecture 15a:} PCA: Lots of data in a very high-dimensional 1955 | space. But maybe there's a low-dimensional manifold on which most of 1956 | the data lies. In some sense that manifold captures much of the 1957 | structure in the data. What we want is a projector onto a 1958 | lower-dimensional subspace. Suppose $x_1, x_2, \ldots, x_m$ are our 1959 | data points. Obvious idea is to stick . 1960 | 1961 | 1962 | \section{Hinton (2012) - videos} 1963 | 1964 | \link{https://www.ipam.ucla.edu/schedule.aspx?pc=gss2012}{IPAM Summer School videos} 1965 | 1966 | Attributes backprop to \link{http://www.werbos.com/}{Paul Werbos}. It 1967 | was done in his 1974 PhD thesis. Hinton also lists several others, 1968 | including Amari, Parker, and LeCun. Points out that deep learning 1969 | didn't work well, except in time delay and convolutional networks. 1970 | Says that part of the reason Werbos was ignored was because he was 1971 | applying backprop to econometrics, where it was hard to see the value. 1972 | 1973 | Why deep learning is feasible today: He starts with simple raw speed, 1974 | not pre-training, interestingly enough. He also says that there's 1975 | been a ``small improvement in the theory''. Says the biggest 1976 | disappointment with backprop was that it didn't work with recurrent 1977 | neural nets. I don't understand how this squares with Williams and 1978 | Zipser. 1979 | 1980 | ``On the whole backpropagation fell out of favour because it failed to 1981 | be able to learn multiple layers of features.'' Says that 1982 | convolutional nets were the only ones where deep learning worked. 1983 | 1984 | ``Almost everything I used to believe about backpropagation is 1985 | wrong.'' 1986 | 1987 | ``What is wrong with back-propagation? It requires labeled training 1988 | data. [Well, no, not if you use ideas like autoencoders.]'' 1989 | 1990 | Why is the learning time slow in deep nets? If you use the right 1991 | scales for the weights, you can do some of this learning much faster. 1992 | 1993 | He's strongly emphasizing the unsupervised learning / feature learning 1994 | point of view. Basically, pretraining to initialize, and then 1995 | fine-tuning with labelled data and backprop. 1996 | 1997 | ``You can get a lot of knowledge into the network by messing with the 1998 | training data.'' Analogizes to education. 1999 | 2000 | On the advantages of generative models: learn $p({\rm image})$, not 2001 | $p({\rm label | image})$. ``If you want to do computer vision, first 2002 | learn computer graphics.'' I think that overstates the case, but 2003 | there's something to it. Reminiscent of the idea that learning is 2004 | really memory. 2005 | 2006 | \textbf{Belief nets:} A directed acyclic graph of stochastic 2007 | variables. We learn the values of some variables. We'd like to infer 2008 | the states of the other variables. And we'd like to adjust the 2009 | interactions between variables to make the network more likely to 2010 | generate the observed data. (Obviously, this is all very close to 2011 | Pearl's causal models.) 2012 | 2013 | Neat point about stochastic models: it lowers the communication cost 2014 | in distributed models. Send 1 bit instead of 32 or 64 bit float. 2015 | 2016 | Points out that while it's easy to generate examples at the leaf 2017 | nodes, it's hard to infer causes. Yet that's exactly what we want to 2018 | do. 2019 | 2020 | Suppose we observe some output data. Let's suppose we can sample the 2021 | hidden states in an unbiased fashion. Now update the weight by 2022 | $\Delta w_{ji} = \eta s_j (s_i-p_i)$. Note that here, $j$ is a 2023 | parent, and $i$ is a child node. This is more or less Hebb's rule. 2024 | ``Nice local learning rule''. 2025 | 2026 | Monte Carlo methods: painfully slow for large, deep models. Can be 2027 | used to sample from the posterior. Variational methods are much 2028 | faster. You get the wrong result, but it's bounded away from the 2029 | right result. ``Inferring the wrong posterior and then doing learning 2030 | anyway''. 2031 | 2032 | RBMs: The feature detectors are genuinely independent (given the 2033 | data). The posterior distribution is easy to sample from. The 2034 | partition function makes learning difficult. He derives a nice quick 2035 | way to learn an RBM: $\Delta w_{ij} = \eta *$ a difference of 2036 | averages of correlations between visible and hidden units. 2037 | 2038 | \section{Krizhevsky (2012)} 2039 | 2040 | \link{http://www.cs.toronto.edu/\~hinton/absps/imagenet.pdf}{link} 1.2 2041 | million images in ImageNet 2010. 1000 classes. 650,000 neurons. 2042 | Five convolutional layers. Max pooling layers. Three fully-connected 2043 | layers. 1000-way softmax. Used dropout to prevent overfitting. 2044 | 2045 | Past image data sets: NORB, Caltech-101/256. CIFAR-10/100. ``Simple 2046 | recognition tasks can be solved quite well with datasets of this size, 2047 | especially if they are augmented with label-preserving 2048 | transformations.'' ``But objects in realistic settings exhibit 2049 | considerable variability, so to learn to recognize them it is 2050 | necessary to use much larger training sets''. LabelMe: hundreds of 2051 | thousands of fully-segmented images. ImageNet: 15 million labeled 2052 | high-res images in over 22,000 categories. 2053 | 2054 | This paper: trained a very large convolutional neural net on subsets 2055 | of ImageNet used in two competitions. Got by far the best results 2056 | ever reported on those data sets. Removing any convolutional layer 2057 | significantly decreased performance. ``All of our experiments suggest 2058 | that our results can be improved simply by waiting for faster GPUs and 2059 | bigger datasets to become available.'' 2060 | 2061 | ImageNet: 15 million images, 22,000 categories. ILSVRC: 1000 images 2062 | in 1000 categories. 1.2 million training images, 50,000 validation 2063 | images, and 150,000 testing images. ILSVRC-2010: test set labels are 2064 | available. Top-5 error rate: the fraction of test images for which 2065 | the correct label is not among the five labels considered most 2066 | probable by the model. 2067 | 2068 | ImageNet has variable-resolution images. They down-sampled to 256 2069 | $\times$ 256. They did this by rescaling the image so the shorter 2070 | side was of length 256. Then cropped out the central 256 $\times$ 256 2071 | patch. They also subtracted the mean activity over the training set 2072 | from each pixel. This was the complete pre-processing. 2073 | 2074 | Architecture: 8 layers. 5 convolutional. 3 fully-connected. 2075 | 2076 | ReLU Nonlinearity: Instead of sigmoid function they used $f(z) = 2077 | \max(z, 0)$. They refer to this as a \emph{rectified linear} unit. 2078 | ``Deep convolutional neural networks with ReLUs train several times 2079 | faster than their equivalents with tanh units''. I believe it is 2080 | standard wisdom that convolutional nets work better with tanh units 2081 | than sigmoid. ``The magnitude of the effect [faster learning...] 2082 | varies with network architecture, but networks with ReLUs consistently 2083 | learn several times faster than equivalents with saturating neurons''. 2084 | 2085 | Training on multiple GPUs: Done in part because the training set 2086 | wouldn't fit into a single GPU's memory. 2087 | 2088 | Local response normalization: They do a local normalization step, 2089 | essentially a kind of brightness normalization. It reduces error 2090 | rates by a little over 1 percent. 2091 | 2092 | Overlapping pooling: Again, a slight improvement. 2093 | 2094 | Architecture: The first convolutional layer filters the 224 by 224 by 2095 | 3 image with 96 kernels of size 11 by 11 by 3. There is a stride 2096 | distance of 4, i.e., the distance between the receptive field centers 2097 | of neighbouring neurons. I need to understand quite a bit more about 2098 | CNNs and pooling. 2099 | 2100 | Lots of overfitting: 1.2 million examples, 10 bits of info per example 2101 | (1 in 1000 classification). But 60 million parameters. So 2102 | overfitting is a real problem. 2103 | 2104 | Data augmentation: (1) image translations and horizontal reflections. 2105 | Extracting 224 by 224 patches. This gives them a factor 2048 more 2106 | training data. The network makes a prediction by extracting five 224 2107 | by 224 patches and their horizontal reflections, and averaging the 2108 | predictions made by the network's softmax layer. (2) Altering the 2109 | intensities of the RGB channels in the training images. Perform PCA 2110 | on ImageNet and use it to modify the images. ``This scheme 2111 | approximately captures an important property of natural images, 2112 | namely, that object identity is invariant to changes in the intensity 2113 | and color of the illumination.'' 2114 | 2115 | Dropout: ``a very efficient version of model combination that only 2116 | cost about a factor of two during training''. Set to zero the output 2117 | of each hidden neuron with probability 0.5. Don't contribute to 2118 | forwardprop nor to backprop. Every time the network is trained it has 2119 | a different architecture, but the architectures share weights. ``This 2120 | technique reduces complex co-adaptation so neurons, since a neuron 2121 | cannot rely on the presence of other neurons''. This is going to be 2122 | useful in very large networks with a relative paucity of data. 2123 | ``Without dropout, our network exhibits substantial overfitting. 2124 | Dropout roughly doubles the number of iterations required to 2125 | converge.'' 2126 | 2127 | Used SGD with momentum. ``We used an equal learning rate for all 2128 | layers, which we adjusted manually throughout training. The heuristic 2129 | which we followed was to divide the learning rate by 10 when the 2130 | validation error rate stopped improving with the current learning 2131 | rate. The learning rate was initialized at 0.01 and reduced three 2132 | times prior to termination.'' That seems like a useful heuristic. 2133 | 2134 | Results: top-1 test set error rate: 37.5 percent. top-5 test set 2135 | error rate: 17.0 percent. That seems incredibly good, although not 2136 | human comparable. They also report a bunch of other results: every 2137 | single one is very, very good. 2138 | 2139 | \section{Le (2012)} 2140 | 2141 | \link{http://ai.stanford.edu/\~ang/papers/icml12-HighLevelFeaturesUsingUnsupervisedLearning.pdf}{Building 2142 | high-level features using large-scale unsupervised learning} 2143 | 2144 | I have described the architecture elsewhere. Let me describe some of 2145 | the results. They took 13,026 faces from Labeled Faces in The Wild, 2146 | and about 24,000 distractor objects from ImageNet. I'm not especially 2147 | keen on this procedure --- it seems like there might be very easy ways 2148 | to distinguish the two data sets that have little to do with whether 2149 | or not a face is present. 2150 | 2151 | ``After training, we used this test set to measure the performance of 2152 | each neuron in classifying faces against distractors. For each 2153 | neuron, we found its maximum and minimum activation thresholds, then 2154 | picked 20 equally spaced thresholds in between. The reported accuracy 2155 | is the best classification accuracy among 20 thresholds.'' There's a 2156 | lot that's not being said here. Which neurons are we considering? 2157 | Every neuron in the network? Or just in the last layer? And it's not 2158 | actually stated that higher activations are the right criterion (as 2159 | opposed to lower). However, I think I can reasonably infer that's the 2160 | case, because of the pooling. 2161 | 2162 | ``The best neuron in the network achieves 81.7 percent accuracy in 2163 | detecting faces''. That's compared to a guessing strategy, which 2164 | achieves 64.8 percent. They found that removing the local contrast 2165 | normalization reduced this number to 78.5 percent. 2166 | 2167 | The performed a numerical optimization to find the optimal stimulus. 2168 | This really is quite striking: it's definitely a face! 2169 | 2170 | They did a very interesting control experiment, removing the faces 2171 | from the unlabelled training data, using OpenCV. The recognition 2172 | accuracy of the best neuron dropped to 72.5 percent. So it's not just 2173 | that we're detecting ImageNet versus Labeled Faces in the Wild. 2174 | 2175 | Invariance properties: They used 10 face images and did some scaling, 2176 | rotation, x and y translations. The face feature detector still 2177 | worked pretty well for rotation, not so well for the other operations. 2178 | Still, it's interesting that this is possible at all, especially since 2179 | rotation and scaling are going to be hard to build into the network. 2180 | 2181 | \textbf{Other feature detectors:} They repeated the face detector 2182 | work, but with cats and human body parts. They got 74.8 \% and 76.7 2183 | \%, respectively. The data sets were constructed so that random 2184 | guessing would give 64.8 percent, as for the faces. They also tried 2185 | some deep autoencoder experiments, and found that while there were 2186 | selective neurons in those networks, they weren't nearly as good as 2187 | with the muli-RICA-layer architecture. 2188 | 2189 | \textbf{ImageNet:} On the 2011 data set --- 16 million images, 20,000 2190 | categories --- they achieved 15.8 percent accuracy, a huge jump over 2191 | the best (9.3 percent) results. 2192 | 2193 | \section{Le (2012), video} 2194 | 2195 | ``Tera-scale deep learning'': \link{http://vimeo.com/52332329}{link} 2196 | 2197 | Problems with standard hand-crafted features: the features may not 2198 | generalize to another domain; the features take a long time to 2199 | develop. (Recall Hinton: the time of hand-engineered features is 2200 | over.) ``We're still stuck at SIFT and HOG''. 2201 | 2202 | RICA: Built on TICA (topographic independent component analysis). We 2203 | have some data, $x^i$. Take a 3 by 2204 | 2205 | RICA can learn from any (unlabelled) data. Can learn features from 2206 | videos to do action recognition. E.g. ``Get out of car''. ``Eat''. 2207 | And so on. Very interesting features! SIFT / HoG is also used for 2208 | video, apparently. 2209 | 2210 | Four most famous activity recognition data sets: KTH. 2211 | Hollywood2. UCF. YouTube. They outperform SIFT / HoG on four 2212 | best-known data sets. 2213 | 2214 | On cancer / MRI: our usual intuition breaks down. It really helps to 2215 | be able to automatically discover features. 2216 | 2217 | ``Scaling up deep RICA'': This is the way to think about what the 2218 | Google-Stanford paper does. 2219 | 2220 | ``Using a thousand machines alone is not enough.'' They needed to 2221 | change their algorithms. 2222 | 2223 | ``Higher layer [i.e., later features] are very difficult to 2224 | visualize.'' I don't understand why that is the case. 2225 | 2226 | Two main ideas in scaling up: local connectivity; asynchronous SGD. 2227 | 2228 | 1 billion parameters. I wonder how they avoid overfitting? 2229 | 2230 | They pick out a neuron in the top layer. They look to see: which 2231 | images in the test (?) set stimulate that neuron the most? And then 2232 | they do a numerical optimization to figure out what the optimal input 2233 | stimulus is. 2234 | 2235 | Classify ``sting ray'' versus ``manta ray'' in ImageNet. 2236 | 2237 | \section{Ng (2012) --- video} 2238 | 2239 | \link{https://www.ipam.ucla.edu/schedule.aspx?pc=gss2012}{Ng's 2240 | contribution to 2012 IPAM workshop} 2241 | 2242 | ``Instead of doing AI, we ended up spending our lives doing curve 2243 | fitting.'' 2244 | 2245 | Has a nice example of building a motorcycle recognizer by building a 2246 | feature classifier to determine whether there are wheels or 2247 | handlebars, and then adding an extra classifier layer. There's 2248 | actually a general story here: we can recursively do this. 2249 | 2250 | He makes stronger claims about the brain rewiring (e.g., in ferrets) 2251 | than I've heard before. Says that it's been done in four animal 2252 | species. (I wasn't immediately able to find this using Google 2253 | Scholar, it'd be interesting to see sources.) Also says that it's 2254 | vision in every sense that he understands what vision means. I must 2255 | admit I didn't get that out of Sur's original paper --- it seemed like 2256 | a much coarser sense of vision. It'd be interesting to know if 2257 | finer-grained tests have since been done. 2258 | 2259 | ``The complexity of the trained algorithm comes from the data, not the 2260 | algorithm.'' 2261 | 2262 | Distinguishes semi-supervised versus self-taught learning 2263 | (unsupervised feature learning). The problem with semi-supervised 2264 | learning is that it requires some broad constraints on classses --- 2265 | e.g., we need all images to be of cars or motorcycles. Self-taught 2266 | learning is much easier. 2267 | 2268 | Sparse coding: Learns a dictionary of basis functions so that each 2269 | training image can be decomposed sparsely in terms of basis functions. 2270 | (Use a ``sparsity penalty term''.) If you train sparse coding on 2271 | natural images you get edge detectors. ``It's more useful to know 2272 | where the edges are in an image than where the pixels are.'' ``It 2273 | gives us an alternate way of representing the image.'' ``ICA version 2274 | of spare coding.'' Recursively do sparse coding. 2275 | 2276 | Sparse deep belief network (Honglak Lee). One layer: edges. Second 2277 | layers: models of object parts. Third layers: object models. 2278 | 2279 | Ng believes it is mostly scalability that is the issue. More 2280 | features, more data => better results. What appears superficially to 2281 | be an algorithmic superiority is really about the availablity of more 2282 | data, or more computational power that allows more features to be 2283 | learned. 2284 | 2285 | Should we use higher-order algorithms? Conjugate gradient? L-BFGS. 2286 | Gradient descent with line search. Black-box algorithms which will 2287 | take the gradient and cost and just work. Ng's favourite: L-BFGS. 2288 | 2289 | ``The most reliable indicator of whether [a new grad student] has got 2290 | gradient descent to work is whether they do gradient checking.'' The 2291 | problem: buggy implementations will learn. Just not as well as a 2292 | correct implementation. 2293 | 2294 | \section{Sutskever (2013)} 2295 | 2296 | \link{http://www.cs.utoronto.ca/\~ilya/pubs/2013/1051_2.pdf}{On the 2297 | importance of initialization and momentum in deep learning} 2298 | 2299 | ``Deep and recurrent neural networks... are poweful models that were 2300 | considered to be almost impossible to train using stochastic gradient 2301 | descent with momentum. In this paper, we show that when sotchastic 2302 | gradient descent with momentum uses a well-designed random 2303 | initialization and a particular type of slowly increasing schedule for 2304 | the momentum parameters, it can train both... to levels of performance 2305 | that were previously achievable only with Hessian-Free optimization. 2306 | We find that both the initialization and the momentum are crucial 2307 | since poorly initialized networks cannot be trained with momentum and 2308 | well-initialized networks perform markedly worse when the momentum is 2309 | absent or poorly tuned. Our success training these models suggests 2310 | that previous attempts to train deep and recurrent neural networks 2311 | from random initializations have likely failed due to poor 2312 | initialization schemes.'' In other words, we can train deep neural 2313 | nets with (momentum-based) stochastic gradient descent, 2314 | \emph{provided} we're careful about how we initialize the weights, and 2315 | provided we do the appropriate things with the momentum. 2316 | 2317 | ``Martens (2010) attracted considerable attention by showing 2318 | that... Hessian-free Optimization... is capable of training [deep 2319 | neural nets] from certain random initializations without the use of 2320 | pre-training, and can achieve lower errors for the various 2321 | auto-encoding tasks considered by Hinton and Salakhutdinov (2006).'' 2322 | 2323 | The picture that is starting to appear: Overall achievement = Quality 2324 | of algorithm + quantity of data + number of features + amount of 2325 | computing time. 2326 | 2327 | ``The first contribution of this paper is a much more thorough 2328 | investigation of the difficulty of training deep and temporal networks 2329 | than has been previously done... We show that while a definite 2330 | performance gap seems to exist between plain SGD and HF on certain 2331 | deep and temporal learning problems, this gap can be eliminated or 2332 | nearly eliminated... by careful use of classical momentum methods or 2333 | Nesterov's accelerated gradient.'' 2334 | 2335 | Apparently Polyak introduced the momentum technique, and obtained some 2336 | results on how much faster they can be than position-based techniques. 2337 | 2338 | NAG: Nesterov's Accelerated Gradient. ``[F]or general smooth 2339 | (non-strongly) convex functions and a deterministic gradient, NAG 2340 | achieves a gloabl convergence rate of $O(1/T^2)$ (versus the $O(1/T)$ 2341 | of gradient descent), with constant proportional to the Lipschitz 2342 | coefficient of the derivative and the squared Euclidean distance to 2343 | the solution.'' I don't know what $T$ is here. NAG turns out to be a 2344 | variation on the momentum method, with the only difference being that 2345 | we compute the gradient at the updated position. ``While the 2346 | classical convergence theories for both methods [NAG and momentum] 2347 | rely on noiseless gradient estimates (i.e., not stochastic), with some 2348 | care in practice they are both applicable to the stochastic setting.'' 2349 | ``However, the theory predicts that any advantages in terms of 2350 | asymptotic local rate of convergence will be lost... a result also 2351 | confirmed in experiments... For these reasons, interest in momentum 2352 | methods diminished after they had received substantial attention in 2353 | the 90's. And because of this apparenty incompatability with 2354 | stochastic optimization, some authors even discourage using momentum 2355 | or downplay its potential advantages'' 2356 | 2357 | The key point seems to be to separate out two timescales. One is the 2358 | initial transient phase, when we're still hopping between regions of 2359 | different local minima, before the phase of fine local convergence. 2360 | ``[I]n ractice, the `transient phase'... seems to matter a whole lot 2361 | more for optimizing deep neural networks. In this transient phase of 2362 | learning, directions of reduction in the objective tend to persist 2363 | across many successive gradient estimates and are not completely 2364 | swamped by noise.'' ``Thus, for convex objectives, momentum-based 2365 | methods will outperform SGD in the early or transient stages of the 2366 | optimization where $L/T$ is the dominant term.'' Here, $L$ is the 2367 | Lipshitz coefficient of the gradient. 2368 | 2369 | Why NAG works: ``This benign-looking difference seems to allow NAG to 2370 | change $v$ in a quicker and more responsive way, letting it behave 2371 | more stably than CM [classical momentum[ in many situations, 2372 | especially for higher values of $\mu$. Indeed, consider the situation 2373 | where the addition of $\mu v_t$ results in an immediate undesirable 2374 | increase in the objective $f$. The gradient correction to the 2375 | velocity $v_t$ is computed at position $\theta_t + \mu v_t$ and if 2376 | $\mu v_t$ is indeed a poor update, then [the gradient at the new 2377 | position] will point back toward $\theta_t$ more strangly than [the 2378 | gradient at the old position], thus providing a larger and more timely 2379 | correction to $v_t$ than CM.'' I don't think this is quite right. 2380 | It's not a question of pointing back to the original position. It's a 2381 | question of pointing in the right direction, which may be different 2382 | than the direction of the original position. Still, this line of 2383 | reasoning otherwise seems sound. (And it's probably true in two 2384 | dimensions.) 2385 | 2386 | ``While each iteration of NAG may only be slightly more effective than 2387 | CM at correcting a large and inappropriate velocity, this difference 2388 | in effectiveness may compound as the algorithms iterate.'' I don't 2389 | know that this is the case. It seems more likely to me that rather 2390 | than accumulating small improvements, it actually is preventing 2391 | occasional bad mistakes. 2392 | 2393 | There is a nice example in the appendix. Basically, a 2d example, 2394 | with very elongated ellipses as the contours. The momentum method 2395 | (with low friction) has the problem that it only very slowly builds up 2396 | momentum in the right direction. Basically it overshoots early on, 2397 | and then has to swing backwards and forwards, slowly. NAG avoids 2398 | this, even though it also has low values of momentum. 2399 | 2400 | They analyse CM and NAG for the objective function $C(y) = \sum_j 2401 | \lambda_j y_j^2+ c_j y_j$. In this particular case, they prove that 2402 | NAG acts like the classical momentum technique with learning rate 2403 | $\eta$, but with a modified momentum $\mu(1-\eta \lambda_j)$ in 2404 | component $j$. It should be possible to prove this through a 2405 | straightforward computation. 2406 | 2407 | This means that for small learning rates CM and NAG become very 2408 | similar. Note that locally every (smooth, convex) cost function may 2409 | be approximated by a quadratic cost functional (i.e., approximating by 2410 | the appropriate quadratic locally), and so this behaviour is likely 2411 | generic. We also see that NAG is going to have lower momentums, and 2412 | so more friction; it will tend to damp out oscillations. The decrease 2413 | in momentum will be particularly high when $\lambda_j$ is large. This 2414 | is good: it decreases the momentum a lot, and so increases the 2415 | friction a lot, which damps out the overoscillation that will cause a 2416 | problem as we go through the $y_j = 0$. 2417 | 2418 | Takeaway technique here: to understand an optimization technique it 2419 | can really help to look at quadratic trial functions like this, where 2420 | it may be possible to analyse behaviour analytically. In particular, 2421 | the big benefit of analysing quadratic cost functions is that they 2422 | really do carry most of the (local) information we'll ever need. 2423 | 2424 | ``The aim of our experiments is three-fold. First, to investigate the 2425 | attainable performance of stochastic momentum methods on deep 2426 | autoencoders starting from well-designed random initializations; 2427 | second, to explore the importance and effect of the schedule for the 2428 | momentum parameter $\mu$ assuming an optimal fixed choice of the 2429 | learning rate...; and third, to compare the performance of NAG versus 2430 | CM.'' 2431 | 2432 | They don't look at test errors --- i.e., they ignore regularization 2433 | and overfitting. Not sure how bulletproof their argument for this is 2434 | --- it seems a bit like special pleading. But I'm happy to run with 2435 | it: the point is that optimization can be treated separately from 2436 | generalization. (Of course, it may be that a better optimization 2437 | method is worse at generalization, and that ultimately needs to be 2438 | looked into.) 2439 | 2440 | The schedule for $\mu$ which they used was to look for the smaller of 2441 | $\mu_{\rm max}$ and $1-1/2(\lfloor t/250 \rfloor +1)$, where $t$ is 2442 | the epoch number. In other words, over the first 250 iterations, $\mu 2443 | = 1/2$. Then we switch to $3/4$. Then to $5/6$. And so on, until we 2444 | get to $\mu_{\rm max}$. They have a nice explanation for this. 2445 | Basically, use the $1-1/t$ type schedule when the function isn't 2446 | convex --- this will helps us explore and gradually find a good 2447 | locality. But once that is found it's better to switch to a constant 2448 | rate, which will converge exponentially quickly. So this is a nice 2449 | hybrid. 2450 | 2451 | Actually, that's not quite the full schedule. It turns out that they 2452 | do a final modification of $\mu$ for the tail end of training, 2453 | reducing it to another constant. They have a nice heuristic 2454 | explanation for this, which I won't get into here, but should perhaps 2455 | come back to in future. 2456 | 2457 | The results are impressive: much, much better than basic stochastic 2458 | gradient descent. 2459 | 2460 | They also investigate recurrent neural networks. I'm less familiar 2461 | with these, so I'll just quickly write out some very telegraphic and 2462 | incomplete notes. Echo-state networks. ``ESNs ... have achieved high 2463 | performance on tasks with long range dependencies (?)'' ``RNNs were 2464 | believed to be almost impossible to successfully train on such 2465 | datasets [with long-range temporal dependencies], due to various 2466 | difficulties such as vanishing/exploding gradients'' Interesting 2467 | comments on the spectral radius of the hidden-to-hidden matrix on the 2468 | dynamics of a RNN. When are RNNs likely to be useful? ``The main 2469 | achievement of these results is a demonstration of the ability of 2470 | momentum methods to cope with long-range temporal dependency training 2471 | tasks to a level which seems sufficient for most practical purposes.'' 2472 | In practice, of course, many (most?) interesting human cognitive tasks 2473 | involve long-range temporal dependency: I do action X now, then must 2474 | do Y later, than Z (which depends on X) later still, and so on. RNNs 2475 | seem like they might be especially useful for ``chains of thought'' as 2476 | opposed to pattern recognition. 2477 | 2478 | Comparison to HF: They note that HF is a truncated Newton method. 2479 | Sounds like it's an improved linear conjugate gradient 2480 | method. ``[Conjugate gradient] accumulates information as it iterates 2481 | which allows it to be optimal in a much stronger sense that any other 2482 | first-order method (like NAG)'' 2483 | 2484 | The idea behind all these first-order methods --- momentum-based and 2485 | NAG --- seeems to be to find indirect ways of putting curvature into 2486 | the problem, by computing gradients at two separate points. This 2487 | gives us information about the locally approximating quadratic, rather 2488 | than the locally approximating plane. It seems as though you could do 2489 | better by making use of even more points --- three or four would give 2490 | you higher-order still approximations. (They still wouldn't give you 2491 | global information, though.) 2492 | 2493 | I need to get clear on the relationship of curvature to gradient 2494 | descent. The basic point is that if the cost surface is highly curved 2495 | in some direction, gradient descent will tend to send us in that 2496 | direction. That's not always what we want. Sometimes we want to move 2497 | off along low curvature directions as well. That's typically the case 2498 | for a general (positive-definite) quadratic. 2499 | 2500 | \section{Summary of CIFAR-10 results} 2501 | 2502 | As at July, 2013. I have drawn heavily on the compendium of results 2503 | by 2504 | \link{http://rodrigob.github.io/are\_we\_there\_yet/build/classification_datasets_results.html}{Rodrigo 2505 | Benenson}. Note that CIFAR-10 contains 10 classes, with 5,000 2506 | training images per class, and 1,000 test images per class. Images 2507 | are 32 by 32, and in RGB. Not centred or size-normalized. 2508 | 2509 | Note that the accounts below are not at all complete, they are 2510 | intended as a quick first cut. Several of these should be 2511 | investigated in much more depth. 2512 | 2513 | Karpathy on CIFAR: http://karpathy.ca/myblog/2011/04/27/lessons-learned-from-manually-classifying-cifar-10-with-code/ 2514 | 2515 | \textbf{Snoek, Larochelle, and Adams (2012)} 2516 | (\link{http://www.cs.toronto.edu/\~jasper/bayesopt.pdf}{link}): 2517 | ``Practical Bayesian Optimization of Machine Learning Algorithms''. 2518 | Appears to provide the best results at the time of writing. They used 2519 | a three-layer convolutional neural network. Achieved an error on the 2520 | test set of 14.98\%. This is over 3\% better than state of the art 2521 | (without augmenting the data). They then augmented the data using 2522 | horizontal reflections and translations, getting the error down to 9.5 2523 | \% on the test set. 2524 | 2525 | An interesting aspect of the project is that they learnt the 2526 | hyper-parameters automatically. In particular, they did a Bayesian 2527 | optimization to learn 9 separate hyper-parameters, including the 2528 | number of epochs, the learning rate, and the width, scale and power of 2529 | the response normalization in the pooling layers. The learned 2530 | hyper-parameters significantly outperform a human expert's 2531 | optimization of the hyper-parameters. Their expert achieved 18\% and 2532 | 11\% error (without and with data augmentation, respectively). 2533 | 2534 | Code from this project is available. Note that Jasper Snoek is at U 2535 | of T, but will be leaving for Harvard in September. They based their 2536 | convolutional net implementation on cuda-convnet. 2537 | 2538 | \textbf{Krizhevsky, Sutskever, and Hinton (2012):} 2539 | (\link{http://books.nips.cc/papers/files/nips25/NIPS2012\_0534.pdf}{link}) 2540 | ``ImageNet Classification with Deep Convolutional Neural Networks'' A 2541 | four-layer convolutional neural net achieved 13\% test error rate 2542 | without local response normalization, and 11\% with local response 2543 | normalization. Used cuda-convnet. 2544 | 2545 | \textbf{Ciresan, Meier, and Schmidhuber (2012):} 2546 | (\link{http://www.idsia.ch/\~ciresan/data/cvpr2012.pdf}{link}) 2547 | ``Multi-column deep neural networks for image classification'' 2548 | Achieves 11.21\% error for CIFAR-10. Achieves 0.23 \% error for 2549 | MNIST. Claims that humans get a 0.2\% error, with citation (would be 2550 | interesting to look up). Use a deep convolutional network. They do 2551 | basic backprop, with no pretraining. The architecture is to repeat a 2552 | convolutional layer followed by max pooling multiple times, followed 2553 | by some fully connected layers. They use 2 by 2 receptive fields and 2554 | max-pooling regions. It appears that the stride length is 2, as well. 2555 | Somewhat similar to Krizhevsky et al's ImageNet paper. They use a 2556 | fully online training algorithm. They use a GPU. They use what they 2557 | call a multi-column deep neural network, which I don't quite 2558 | understand --- looks to be a technique for training multiple networks 2559 | and combining the results. They used a (scaled) tanh function for 2560 | convolutional and fully connected layers, a linear activation function 2561 | (does this mean rectified?) for max-pooling layers, and softmax at the 2562 | output. They used online gradient descent, with an annealed learning 2563 | rate (0.001, decaying by a factor of 0.993 after every epoch), and 2564 | continual translations, scaling and rotation of images. Initial 2565 | weights are drawn from a uniform random distribution in the range 2566 | [-0.05, 0.05]. 2567 | 2568 | MNIST architecture: 29 by 29 input; a 20-map convolutional layer, with 2569 | a receptive field of 4 by 4; max-pooling of 2 by 2 regions; a 40-map 2570 | convolutional layer with 5 by 5 receptive field; max-pooling of 3 by 3 2571 | regions; fully connected layer with 150 neurons; fully connected 2572 | (softmax) layer with 10 neurons. 2573 | 2574 | CIFAR architecture: 3 by 32 by 32 input; 300-map convolutional layer, 2575 | with 3 by 3 receptive fields; max-pooling of 2 by 2 regions; 300-map 2576 | convolutional layer, with 2 by 2 receptive fields; max-pooling of 2 by 2577 | 2 regions; 300 convolutional maps, 2 by 2 receptive fields; 2578 | max-pooling of 2 by 2 regions; then fully connected layers with 300, 2579 | 100 and 10 neurons. 2580 | 2581 | Augmenting the training set (by translating up to 5\%) helps a lot. 2582 | Scaling (up to 15 percent), rotation (up to 5 degrees) and additional 2583 | translations (up to 15 percent) helps a little extra. 2584 | 2585 | The contrast between the Krizhevsky and Ciresan results suggests that 2586 | ideas like dropout and rectified linear units make a big difference. 2587 | 2588 | Q: How much difference does the larger number of maps in the 2589 | convolutional layers make? 2590 | 2591 | \textbf{Goodfellow, Warde-Farley, Mirza, COurville, Bengio (2013):} 2592 | \link{http://arxiv.org/abs/1302.4389}{link} ``Maxout networks'': Test 2593 | set error of 12.93 \%. 2594 | 2595 | ``We define a simple new model called maxout... designed to both 2596 | facilitate optimization by droput and improve the accuracy of 2597 | dropout's fast approximate model averaging technique.'' 2598 | 2599 | Preprocessed the data using global contrast normalization and ZCA 2600 | whitening. Best model consists of three convolutional maxout layers 2601 | followed by a fully connected maxout layer, then finally a softmax 2602 | layer. 2603 | 2604 | \textbf{Tentative conclusions:} Use, in roughly this order: Martens' 2605 | initialization; rectified linear units; dropout; augmented training 2606 | data; annealed learning rate. It'd be interesting to look at the 2607 | local contrast normalization. Also try looking at Nesterov's momentum 2608 | method. The Ciresan results suggest some benefit from using lots of 2609 | maps in the convolutional layers. 2610 | 2611 | 2612 | \section{Grandmother cell (Wikipedia)} 2613 | 2614 | Apparently proposed in the late 1960s by Konorski and Lettvin. 2615 | Lettvin ``originated the term grandmother cell to illustrate the 2616 | logical inconsistency of the concept.'' There is apparently quite a 2617 | bit of support for the concept at the broad category level: neurons 2618 | which are higly face-specific, and even to individual human faces. 2619 | However, ``[e]ven the most selective face cells usually also disharge, 2620 | if more weakly, to a variety of individual faces.'' A 2005 study 2621 | found a ``neuron for Halle Berry'', which fired not only for pictures 2622 | of the actress, but also to the words ``Halle Berry'', and which 2623 | didn't fire when pictures of several other actresses were presented. 2624 | Of course, this doesn't mean that was the only cell to respond. The 2625 | ``sparseness'' hypothesis versus the ``distributed representation'' 2626 | theory. It's really not clear to me that there is a dichotomy here. 2627 | A picture of Halle Berry will no doubt cause many neurons to fire, 2628 | some of which will fire for other reasons too. Maybe the hypothesis is 2629 | this: for each single object or concept there is a corresponding 2630 | grandmother neuron. 2631 | 2632 | \chapter{Miscellanea} 2633 | 2634 | \textbf{Compiling to neural networks:} Can we create compilers which 2635 | translate programs written in a conventional programming language into 2636 | a neural network? I'd be especially interested in seeing how this 2637 | works for AI workhorses such as Prolog. What could we learn from such 2638 | a procedure? (1) Perhaps we could figure out how to link up multiple 2639 | neural modules, with one or more of the modules coming from the 2640 | compiler? (2) Maybe we could use a learning technique to further 2641 | improve the performance of the compiled network. Googling doesn't 2642 | reveal a whole lot, although I did find a paper by 2643 | \link{http://scholar.google.ca/scholar?cluster=10518384657895134615\&hl=en\&as\_sdt=0,5}{Thrun} 2644 | where he discusses decompiling, i.e., extracting rules from a neural 2645 | network. Thrun uses a technique he calls validity-interval analysis, 2646 | basically propagating intervals for inputs and outputs forwards and 2647 | backwards through a network. 2648 | 2649 | \textbf{Deep learning requires nonlinear neurons:} Put another way, 2650 | deep learning with linear neurons doesn't help. Via linear embedding 2651 | it's equivalent to a single hidden layer whose size is just the 2652 | minimal size of any of the original hidden layers. So there is 2653 | absolutely no advantage to doing deep learning with linear neurons. 2654 | 2655 | \textbf{No theory of generalization:} We have all these techniques 2656 | based on parameter-fitting. But we have a paucity of strong 2657 | underlying theoretical ideas. 2658 | 2659 | \textbf{Principal Components Analysis (PCA):} It'll be useful to 2660 | review PCA here. Suppose we have a set of data points $x$ in some 2661 | high-dimensional (vector) space. Then we'd like to find a 2662 | $k$-dimensional projector $P$ such that the following error function 2663 | is minimized: 2664 | \begin{eqnarray} 2665 | \sum_x \| x-Px \|^2. 2666 | \end{eqnarray} 2667 | This error can be rewritten as $\mbox{tr}((I-P)\Sigma)$, where $\Sigma 2668 | \equiv \sum_x x x^T$. And so we simply choose $P$ to project onto the 2669 | eigenvectors of $\Sigma$ with the $k$ largest eigenvalues. The 2670 | \emph{principal components} are the eigenvectors of $\Sigma$, in order 2671 | of decreasing eigenvalue. (There may, of course, be some ambiguity 2672 | when $\Sigma$ is degenerate). 2673 | 2674 | Practically speaking, suppose we have a billion images, each of which 2675 | can be regarded as a vector in a 100,000-dimensional space. We can 2676 | reduce to (say) a 100-dimensional space. This gets rid of much of the 2677 | irrelevant structure, and hopefully leaves a structure that is useful 2678 | for comparing images. 2679 | 2680 | \textbf{PCA and autoencoders:} PCA is a way of simplifying our 2681 | understanding of data in high dimensions. Think of the space of all 2682 | possible images. There's a subset of that space which can plausibly 2683 | be taken to represent faces. (Note that contextual clues can also 2684 | help). How can we characterize that subspace? Classic example of 2685 | PCA: IQ testing. Take a large number of different tests. Turns out 2686 | that there is a common factor. Another nice example: a helix in 3 2687 | dimensions. There's a major question: how to determine the number of 2688 | hidden units? 2689 | 2690 | \textbf{Recurrent neural networks (RNN):} According to Wikipedia, RNNs 2691 | have achieved the best results to date on handwriting recognition. An 2692 | obvious question is: what are the respective advantages of RNNs and 2693 | feedforward networks? Are there important problems for which one or 2694 | the other is preferable? Why? What I've read about these questions 2695 | is opaque. 2696 | 2697 | \textbf{Regularization:} I'd like to understand \emph{why} we 2698 | regularize. Certainly, regularization results in solutions with a 2699 | small norm. But why do we not what solutions with a larger norm? 2700 | Will something bad happen to us if we allow such solutions? 2701 | 2702 | The standard argument: what's bad is that overfitting can occur. And 2703 | thus regularization helps reduce overfitting. It'd be nice to have an 2704 | example where overfitting actually occurs. It's really not clear that 2705 | there \emph{should} be a problem with overfitting. In fact, neural 2706 | networks eventually become virtually invariant under rescaling of 2707 | their weights and biases. So it's really not clear that it should 2708 | help. 2709 | 2710 | Returning to regularization, here's the standard story people tell to 2711 | explain why they regularize. The story is that they want to avoid 2712 | high-complexity solutions, in order to avoid over-fitting. Solutions 2713 | with smaller norms are in some sense lower complexity. And therefore 2714 | it makes sense to look for solutions with smaller norm. One way of 2715 | doing this is to penalize solutions with larger norms. Thus, we 2716 | should add a term to the cost which penalizes such solutions. 2717 | 2718 | Now, this is just a story. It's not in any sense a sharp 2719 | justification. In fact, the impact of regularization is still being 2720 | understood. Researchers write papers where they try different 2721 | approaches to regularization, compare them to see which works better, 2722 | and try to understand why different approaches work the way the day. 2723 | 2724 | When can overfitting occur? Typically, when there are more parameters 2725 | in the model than there is training data. What's odd about this is 2726 | that regularization doesn't really help all that much with this 2727 | problem. It just restricts one degee of freedom. 2728 | 2729 | Many different types of regularization possible. I will just use the 2730 | most standard and obvious, which is quadratic. Anything which 2731 | penalizes high-complexity solutions is okay. It's really a research 2732 | topic. 2733 | 2734 | Empirically: I find that regularization seems to help. When we 2735 | regularize I get higher accuracies, by quite a bit. I don't 2736 | understand why that is. 2737 | 2738 | Maybe I'm already overfitting, and regularization is helping reduce 2739 | that problem. It's possible: I have 20,000 or so parameters in my 2740 | model. It'd be nice to see if this is the case. 2741 | 2742 | An example of overfitting: I'll bet I can it to overfit when we use 2743 | just 50 training examples. And I can probably more or less prove this 2744 | using cross-validation. 2745 | 2746 | Look at LeCun \emph{et al}'s results: do they regularize, or not? 2747 | 2748 | \textbf{Restricted Boltzmann machines:} The idea is not to learn a 2749 | function, but rather to learn a probability distribution. There are 2750 | two layers of neurons: a visible layer, and a hidden layer. All 2751 | visible units are connected to all hidden units. The energy of a 2752 | given configuration is just: 2753 | \begin{eqnarray} 2754 | E(v, h) = -\sum_i a_i v_i-\sum_j b_j h_j-\sum_{ij} w_{ij} v_i h_j \\ 2755 | & = & -a \cdot v-b\cdot h -v^T W h, 2756 | \end{eqnarray} 2757 | where $a$ are the biases for the visible units, $b$ are the biases for 2758 | the hidden units, and $W$ is the weight matrix. The distribution is 2759 | just the standard Boltzmann distribution, at some fixed temperature. 2760 | Apparently it can be shown that: 2761 | \begin{eqnarray} 2762 | p(v_i = 1 | h) = \sigma( a_i + (Wh)i), 2763 | \end{eqnarray} 2764 | where $\sigma$ is the usual sigmoid function. (I'll bet this is easy 2765 | to show, just by summing out all the other visible units.) 2766 | Furthermore, the $v_i$ are independent of one another, given $h$. 2767 | This too would be easy to show --- it'll be a straightforward 2768 | consequence of the bipartite nature of the graph. So we can compute 2769 | the probability of $v$, given $h$, simply by multiplying sigmoids. 2770 | 2771 | Let's suppose we wanted to train an RBM with a set of images. The 2772 | images would correspond to the visible units, while the hidden units 2773 | would be feature detectors. The idea is to adjust the weights and 2774 | biases so that training images have a high probability, i.e., a low 2775 | energy. 2776 | 2777 | In a little more detail, suppose we input a training image. Then we 2778 | can stochastically pick a corresponding value for the hidden units. 2779 | Now, feed that back, and stochastically choose a value for the image. 2780 | In an ideal world, we'd recover the original image. We modify the 2781 | weights in such a way as to improve the fidelity of the recovered 2782 | image. 2783 | 2784 | Well, the penny finally drops: an RBM can be viewed as a neural 2785 | network in which the transitions are probabilistic. That's all! 2786 | Frankly, we don't even really need the stuff about ground states, 2787 | although it's a beautiful thing to keep in mind. 2788 | 2789 | \textbf{Softmax function:} Suppose $q_j$ is some set of values. Then 2790 | we define the softmax function by: 2791 | \begin{eqnarray} 2792 | p_j \equiv \exp(q_j)/\sum_k \exp(q_j). 2793 | \end{eqnarray} 2794 | This is a probability distribution, which preserves the order of the 2795 | original values. You can, for example, take the softmax in the final 2796 | layer of a neural network, taking the weighted sum of inputs as the 2797 | $q_j$ values, and then applying the softmax. The output from the 2798 | network can then be interpreted as a probability distribution. 2799 | 2800 | \textbf{Thinking geometrically:} Suppose we're asked to tell the 2801 | difference between pictures of a human face, and pictures of a 2802 | giraffe. We can represent the pictures as points $x$ in a very 2803 | high-dimensional space. And so our task is to divide that space up 2804 | into two parts: one is classified as giraffe, the other as human face. 2805 | (Maybe it should be three parts: the thrid part would be: neither face 2806 | nor giraffe). And so what we really want is algorithms for dividing 2807 | up that space. In some sense we're interested in understanding the 2808 | space of all such algorithms. 2809 | 2810 | It'd be interesting to lay out all the different curlicues to thinking 2811 | in this way: the opportunities, and the pitfalls. There are at least 2812 | three broad approaches: (1) the \emph{pure geometric approach}, based 2813 | on finding mathematical structures to divide the space; (2) the 2814 | \emph{biological approach}, where we try to figure out how we do it; 2815 | and (3) the \emph{kludge approach}, where we simply try lots of ideas, 2816 | and pile them up on top of one another. That's a pretty rough 2817 | division, but seems like a good starting point for thought. My bet is 2818 | that progress comes from playing these ideas off against one another. 2819 | 2820 | \textbf{Tricks:} Much of what seems to be going on is the discovery of 2821 | tricks (of various generality) which can be used to improve pattern 2822 | recognition performance. There are some general heuristics: \emph{use 2823 | symmetry} is obviously one. 2824 | 2825 | 2826 | \section{Future reading} 2827 | 2828 | On the display of scientific papers: https://news.ycombinator.com/item?id=6042742 2829 | 2830 | Connectomics - a recent approach: http://arxiv.org/abs/1306.5709 2831 | 2832 | Ciresan 2012 on MNIST, and Rifai 2011 (``The manifold tangent 2833 | classifier'') on MNIST. 2834 | 2835 | Kiros 2013: http://www.ualberta.ca/\~rkiros/kiros\_thesis\_jun5.pdf 2836 | Best reported results on MNIST when no distortions are used 2837 | 2838 | Interesting comments on image recognition: https://news.ycombinator.com/item?id=5994851 2839 | 2840 | Saxe et al: ``On random weights and unsupervised feature learning'' 2841 | (2011). On hyper-parameter optimization. One of Ng's collaborators. 2842 | 2843 | LeCun on recent Ng results: https://plus.google.com/104362980539466846301/posts/5ab217HugeF 2844 | 2845 | Goodfellow: https://plus.google.com/103174629363045094445/posts/dh7UT9xbMW4 2846 | 2847 | \textbf{SIFT:} 2848 | 2849 | 2850 | Tips on what works: https://news.ycombinator.com/item?id=5994851 2851 | 2852 | ``Fast, accurate detection of 100,000 object classes on a single 2853 | machine'': 2854 | http://googleresearch.blogspot.ca/2013/06/fast-accurate-detection-of-100000.html 2855 | 2856 | Hinton: ``Where do features come from?'': http://scholar.google.ca/citations?view\_op=view\_citation\&hl=en\&user=JicYPdAAAAAJ\&sortby=pubdate\&citation\_for\_view=JicYPdAAAAAJ:L\_l9e5I586QC 2857 | 2858 | Bengio lecture notes 2859 | 2860 | Seide 2011 on deep learning and Microsoft's MAVIS system. 2861 | 2862 | Bengio and COurville: ``Deep learning of representations'' 2863 | http://www.iro.umontreal.ca/~bengioy/papers/BengioCourvilleChapter.pdf 2864 | 2865 | Andrew Ng, CS294A lecture notes 2866 | 2867 | McCulloch and Pitts 2868 | 2869 | Recent Bengio paper on new approach to deep learning:http://arxiv.org/abs/1306.1091 2870 | 2871 | Elliasmith 2872 | 2873 | Levesque: http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf 2874 | 2875 | On feedback in the brain: http://blogs.scientificamerican.com/mind-guest-blog/2013/08/08/this-brain-discovery-may-overturn-a-century-old-theory/ 2876 | 2877 | Martens 2010: Hessian-Free optimization, and sparse initialization. 2878 | 2879 | Bengio et al ``Scaling Learning algorithms towards AI'' 2007 2880 | 2881 | Boureau: A theoretical analysis of feature pooling in visual 2882 | recognition (2010). 2883 | 2884 | Sermanet: Convolutional neural networks applied to house numbers digit 2885 | classification 2886 | 2887 | Elkan 2013: Learning meanings for sentences: http://cseweb.ucsd.edu/\~elkan/250B/learningmeaning.pdf 2888 | 2889 | Agre. 2890 | 2891 | Hubel and Wiesel: 1959. Simple and Complex. The basic model of V1. 2892 | 2893 | Frome 2009: Large-scale Privacy Protection in Google Street View 2894 | 2895 | Deep Learning for the Masses: http://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-the-masses-you-can-thank-google-later/ 2896 | 2897 | \textbf{Collobert:} ``Natural language processing almost from scratch:'' 2898 | 2899 | \textbf{Bengio et al (1994):} The vanishing gradient problem. 2900 | ``Learning long-term dependencies with gradient descent is 2901 | difficult''. 2902 | 2903 | \textbf{Erhan:} ``Why does unsupervised pre-training help deep learning?'' 2904 | 2905 | \textbf{HoG:} 2906 | 2907 | \textbf{Hinton et al (2006):} 2908 | 2909 | \textbf{Itamar Arel et al} For a different POV. 2910 | 2911 | PAC learning. 2912 | 2913 | Conference on learning representations: http://techtalks.tv/iclr2013/ 2914 | 2915 | IPAM: https://www.ipam.ucla.edu/schedule.aspx?pc=gss2012 2916 | 2917 | \textbf{Lee and Mumford (2003):} 2918 | \link{http://dash.harvard.edu/bitstream/handle/1/3637109/Mumford\_HierarchBayesInfer.pdf?sequence=1}{link} 2919 | This looks like great background reading on the idea of doing 2920 | hierarchical inference in the visual cortex. 2921 | 2922 | \textbf{Embrechts (2010)}: 2923 | 2924 | \textbf{Dropout:} 2925 | 2926 | \textbf{Le (2012):} \link{https://plus.google.com/u/0/+ResearchatGoogle/posts/EMyhnBetd2F}{link} 2927 | 2928 | \textbf{Seide (2011):} 2929 | \link{http://research.microsoft.com/apps/pubs/default.aspx?id=153169}{link} 2930 | 2931 | \textbf{Bengio (2007):} \link{http://arxiv.org/pdf/1206.5533v2.pdf}{link} 2932 | 2933 | \textbf{Ranzato (2007):} 2934 | 2935 | \textbf{Lee (2008):} 2936 | 2937 | \textbf{Larochelle (2009):} 2938 | 2939 | \textbf{Wolpert (XXX):} No free lunch. 2940 | 2941 | \textbf{The NIPS 2012 talks:} 2942 | 2943 | \textbf{Elements of statistical learning:} \link{http://www.stanford.edu/\~hastie/local.ftp/Springer/OLD//ESLII\_print4.pdf}{link} 2944 | 2945 | \textbf{No more pesky learning rates:} \link{http://arxiv.org/pdf/1206.1106.pdf}{link} 2946 | 2947 | \textbf{Olshausen and Field:} 2948 | 2949 | Tenenbaum 2011: How to grow a mind 2950 | 2951 | Rumelhart et al on backprop. 2952 | 2953 | BigBrain Atlas: http://news.sciencemag.org/sciencenow/2013/06/bigbrain-atlas-unveiled.html 2954 | 2955 | Hinton on DReDnets: http://techtalks.tv/talks/drednets/58115/ 2956 | 2957 | \textbf{Distributed deep learning:} 2958 | \link{http://research.google.com/archive/large\_deep\_networks_nips2012.html}{link}. 2959 | 2960 | \textbf{Stanford tutorial:} http://ufldl.stanford.edu/wiki/index.php/UFLDL\_Tutorial 2961 | 2962 | Eliot R. Smith: ``What do connectionism and social psychology off each 2963 | other?'' Good for something of an exterior point of view. 2964 | 2965 | \textbf{To do:} Contrastive divergence 2966 | (http://learning.cs.toronto.edu/~hinton/absps/cdmiguel.pdf and 2967 | http://www.cs.utoronto.ca/~hinton/absps/nccd.pdf ). LeCun 1998 2968 | ``Efficient BackProp''. Dropout. Maxout. Andrew Ng's 1997 paper 2969 | ``Preventing overfitting of cross-validation data''. Blumer \emph{et 2970 | al} with guarantees on induction: 2971 | (http://scholar.google.ca/scholar?cluster=11895938102761137877\&hl=en\&as\_sdt=0,5). 2972 | Would be good to understand this in conjunction with no free lunch. 2973 | NIPS papers are online. 2974 | 2975 | \textbf{Neural nets FAQ:} No one definition of a neural network. It's 2976 | possible to do XOR with just a single hidden layer, if direct 2977 | connections to the output from the input are allowed. Problems which 2978 | neural nets aren't so good at: predicting random or pseudo-random 2979 | numbers; factoring large integers; determining whether a number is 2980 | prime. Research problem: find a net which will determine whether a 2981 | number is prime. Distinction between recurrent and feedforward neural 2982 | networks. Calls the set of cases we'd like to generalize to the 2983 | \emph{population}. Constructive learning: start with a small network, 2984 | train, then gradually add extra neurons, and do more training. A lot 2985 | of work has been done on toy problems, and various hacks are known for 2986 | the different toy problems. 2987 | 2988 | 2989 | \textbf{Stephen Judd (1988):} Thesis on complexity of learning in 2990 | neural networks: http://www.dtic.mil/dtic/tr/fulltext/u2/a450825.pdf. 2991 | 2992 | \textbf{Sima (1996):} Shows that finding weights is hard even for 2993 | sigmoidal neural networks with just 3 nodes. This can be viewed as an 2994 | extension of Blum and Rivest (1989). 2995 | http://scholar.google.ca/scholar?cluster=18396613610240979409\&hl=en\&as\_sdt=0,5 2996 | 2997 | \textbf{Egri and Schultz:} Found a neural network capable of 2998 | recognizing prime numbers. 2999 | http://www.cs.mcgill.ca/\~legri1/prime06.pdf 3000 | 3001 | 3002 | 3003 | \end{document} 3004 | --------------------------------------------------------------------------------