├── .gitignore
├── README.md
├── cifar_notes.pdf
├── cifar_notes.tex
├── working_notes.pdf
└── working_notes.tex


/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | *.aux
3 | *.dvi
4 | *.log
5 | *.out
6 | *.txt


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | notes-on-neural-networks
2 | ========================
3 | 
4 | Rough working notes on neural networks.
5 | 
6 | As of December 11, 2013 I've migrated the notes to another repository (not yet public, it's still 
7 | getting constructed as I merge various things together, I hope to make it public).
8 | 


--------------------------------------------------------------------------------
/cifar_notes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mnielsen/notes-on-neural-networks/4104b3175516550335282ea0a4aeb936bd4fe6c1/cifar_notes.pdf


--------------------------------------------------------------------------------
/cifar_notes.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[12pt]{report}
  2 | 
  3 | \usepackage{hyperref}
  4 | 
  5 | \newcommand{\link}[2]{\href{#1}{#2}}
  6 | 
  7 | 
  8 | \begin{document}
  9 | 
 10 | \title{Notes on neural networks --- CIFAR material}
 11 | \author{Michael Nielsen\thanks{Email: mn@michaelnielsen.org}$^{,}$\thanks{Web: http://michaelnielsen.org/ddi}}
 12 | 
 13 | \maketitle
 14 | 
 15 | \chapter{Introduction}
 16 | 
 17 | \textbf{Working notes, by Michael Nielsen:} These are rough working
 18 | notes, written as part of my study of neural networks, especially work
 19 | on CIFAR.  Note that they really are \emph{rough}, and I've made no
 20 | attempt to clean them up, nor do I plan to.  They contain
 21 | misunderstandings, misinterpretations, omissions, and outright errors.
 22 | As such, I don't advise others to read the notes, and certainly not to
 23 | rely on them!
 24 | 
 25 | \chapter{Papers}
 26 | 
 27 | \section{Dahl, Sainath, Hinton 2013}
 28 | 
 29 | This is about acoustic models, studied using rectified linear units
 30 | and dropout.  Punchline: they get a 4.2 percent improvement over
 31 | sigmoid units by using ReLUs and dropout.
 32 | 
 33 | Speech recognition used to be done by hidden Markov models.  Now
 34 | replaced by deep nets.  TIMIT (small-scale phone recognition), LVSR
 35 | (large-scale task).  Dropout as similar to denoising auto-encoders.
 36 | ``ALthough dropout is trivial to incorporate into minibatched SGD the
 37 | best way of adding it to 2nd order optimization methods is an open
 38 | research question.''  They find an undesirable interaction between HF
 39 | optimizer and SGD.  They used the Bayesian optimizer to find the right
 40 | hyper-parameters.
 41 | 
 42 | 
 43 | \section{KSH configuration}
 44 | 
 45 | 
 46 | Note that the bias initialization parameter initB was not set anywhere
 47 | in the KSH configuration.  That means it defaults to 0.
 48 | 
 49 | \textbf{Layer 1}
 50 | \begin{itemize}
 51 | \item Convolutional
 52 | \item 3 channels.  
 53 | \item 32 filters
 54 | \item Padding of 2.  Pads the images on the outside with a 2-pixel border.
 55 | \item Stride length of 1.
 56 | \item Filter size is 5 by 5.
 57 | \item initW=0.0001.  The initial standard deviation.  I'm surprised by how
 58 |   low this is --- much lower than I would have guessed.  I wonder if
 59 |   there's any benefit to increasing it?
 60 | \item partialSum=4.  No idea what this means.  The docs don't really say.
 61 | \item sharedBiases=1.  According to the docs, "indicates that the biases
 62 |   of every filter in this layer should be shared amongst all
 63 |   applications of that filter."  This is a little unclear.  Does it
 64 |   mean that all filters have the same bias?
 65 | \item Fully linear layer.
 66 | \end{itemize}
 67 | 
 68 | \textbf{Layer 2}
 69 | 
 70 | + Pooling layer
 71 | + Uses maxpooling
 72 | + start=0.  Where to start pooling.  This is just the default, which
 73 |   is to start pooling where you'd expect (the top left).
 74 | + sizeX=3.  Pool 3 x 3 regions.
 75 | + stride=2.  The stride length.
 76 | + outputsX=0.  This is an unimportant default; if not equal to 0 the
 77 |   output would only cover part of the image.
 78 | + channels=32.  Presumably to correspond to the filters in the last
 79 |   layer.
 80 | + neuron=relu
 81 | 
 82 | Layer 3
 83 | 
 84 | + Convolutional layer
 85 | + 32 filters output, 32 channels input.
 86 | + 5 by 5 filters.
 87 | + Stride length of 1
 88 | + Initial weight SD = 0.01
 89 | + Rectified linear units
 90 | + sharedBiases=1
 91 | + partialSum=4
 92 | 
 93 | Layer 4:
 94 | + Pooling layer
 95 | + Average pooling
 96 | + 3 x 3 pooling windows
 97 | + Stride length 2
 98 | 
 99 | Layer 5:
100 | + COnvolutional layer.
101 | + 32 input channels, 64 output filters
102 | + 5 x 5 filters
103 | + Padding by 2 pixel border
104 | + Stride length of 1
105 | + Initial weight SD = 0.01
106 | + Recitified linear units
107 | + sharedBiases=1
108 | + partialSum=4
109 | 
110 | Layer 6:
111 | + Pooling layer, 64 input channels, 64 outputs
112 | + Average pooling
113 | + 3 x 3 poling windows.
114 | + Stride length 2
115 | 
116 | Layer 7:
117 | + Fully connected layer
118 | + 64 outputs
119 | + Initial weight SD = 0.1
120 | + Rectified linear units
121 | 
122 | Layer 8:
123 | + Fully connected layer
124 | + 10 outputs
125 | + Initial weight SD = 0.1.
126 | + Linear neurons
127 | 
128 | Layer 9:
129 | + SOftmax layer, producing 10 outputs
130 | 
131 | Cost function: logistic regression on the Softmax outputs.
132 | 
133 | 
134 | Learning parameters
135 | 
136 | Layer 1 (first convolutional layer):
137 | + Weight learning rate: 0.001
138 | + Bias learning rate: 0.002
139 | + Weight and bias momentum: 0.9
140 | + Weight decay 0.004.  Note there is no bias decay.
141 | 
142 | Note that in the docs Krizhevsky explicitly gives the update rule:
143 | 
144 | w' = (weight momentum) * w - (weight decay) * (weight learning rate) * w
145 | + (weight learning rate) * gradient
146 | 
147 | The bias rule is the same, but there is no bias weight decay.
148 | 
149 | 
150 | Layer 3 (second convolutional layer)
151 | 
152 | Same as layer 1.
153 | 
154 | \textbf{Layer 5 (third convolutional layer):} Same as layer 1.
155 | 
156 | 
157 | \textbf{Layer 7 (first fully connected layer):} Learning rates as for
158 | convolutional layers, and weight decay of 0.03
159 | 
160 | \textbf{Layer 8 (final layer)}: Same as first fully connected layer.
161 | 
162 | Krizhevsky notes that rescaling the overall cost function has the
163 | effect of changing the effective overall learning rate.
164 | 
165 | \section{Snoek, Larochelle, and Adams (2012)}
166 | 
167 | ``IN this work we consider the automatic tuning problem within the
168 | framework of Bayesin optimization... The tractable posterior
169 | distribution... leads to efficient use of the information gathered by
170 | previous experiments.... we show how the effects of the Gaussian
171 | process prior and the associated inference procedure can have a large
172 | impact on the success or failure of Bayesian
173 | optimization... thoughtful choices can lead to results that exceed
174 | expert-level performance in tuning machine learning algorithms.''
175 | 
176 | They do it not just for neural nets but for a whole bundle of
177 | algorithms.  Of course, it's especially important for neural nets,
178 | since they have so many hyper-parameters.
179 | 
180 | ``... these high-level parameters are often considered a nuisance,
181 | making it desirable to develop algorithms with as few of these `knobs'
182 | as possible.  Another, more flexible take on this issue is to view the
183 | optimization of high-level parameters as a procedure to be
184 | automated.''
185 | 
186 | ``For continuous functions [like the cost function, one presumes],
187 | Bayesian optimization typically works by assuming the unknown function
188 | [which?] was sampled from a Gaussian process (GP) and maintains a
189 | posterior distribution for this function as observations are made.''
190 | 
191 | What I think this means is: set up Gaussians on our hyper-parameters.
192 | Then sample, and look to see the cost on the validation data.
193 | 
194 | We have a function $f(x)$ on a bounded subset of $R^D$.  We're going
195 | to construct a probabilistic model of $f(x)$.  The idea is to use the
196 | information we get from evaluations of $f(x)$ to improve our model ---
197 | and to choose where to evaluate next.  ``This results in a procedure
198 | that can find the minimum of difficult non-convex functions with
199 | relatively few evaluations, at the cost of performing more computation
200 | to determine the next point to try.''
201 | 
202 | Two choices: a prior over functions.  They choose the Gaussian process
203 | prior.  I'm not quite sure what this means in this context.  Second,
204 | they choose an acquisition function, to construct a utility function
205 | from the model posterior.  Not sure what this means.
206 | 
207 | Gaussian process.  Suppose we have a set of points $x_n$ in our
208 | domain.  T
209 | 
210 | \section{Wan et al (2013) -- ``Regularization of Neural Networks using
211 |   DropConnect''}
212 | 
213 | \subsection{Summary of the main points} 
214 | 
215 | \begin{itemize}
216 | \item Dropout means randomly deleting half the neurons
217 | when training.  
218 | 
219 | \item DropConnect means randomly deleting half the connections when
220 |   training.  
221 | 
222 | \item Note that the output is defined as the \emph{average} output
223 |   over the sampled networks, not the full network.  
224 | 
225 | \item There is a nice linear algebraic way of representing DropConnect
226 |   and Dropout, using Hadamard products, which no doubt helps in
227 |   implementations.  
228 | 
229 | \item In actual fact, they don't literally implement DropConnect.
230 |   Rather, they analyse what the distribution of weighted sums would
231 |   be, and approximate by a Gaussian, before sampling.  I don't see why
232 |   they do this (it may be faster), but in some sense we can use this
233 |   as a definition.  I'd probably prefer just to sample.  No idea why
234 |   they don't.
235 | 
236 | \item They claim that the regularization is greatly helped by using
237 |   small mini-batches, ideally mini-batch size $1$ (online learning).
238 | 
239 | \item The code is available.  They used cuda-convnet for convolutional
240 |   and softmax steps.  The DropConnect implementation is a bit
241 |   convoluted --- worth reading about the problems they had, though.
242 |   It certainly seems worth storing the masks as bits or ints, not
243 |   floats.
244 | 
245 | \item Used mini-batch SGD with momentum on batches of 128 images, and
246 |   momentum fixed at 0.9.  Not clear how this relates to the above
247 |   comments about online learning.  They augment the dataset (cropping,
248 |   flipping, scaling and rotation); train 5 independent network with
249 |   random permutuations; manually decrease the learning rate using a
250 |   validation set; train using Dropout, DropConnect, or neither.  Use
251 |   1,000 samples.  Use a bias learning rate twice the weight learning
252 |   rate.  Weights are N(0, 0.1) for fully connected layers, and N(0,
253 |   0.01) for convolutional layers.
254 | 
255 | \item The learning schedule is fascinating.  ``We report three numbers
256 |   of epochs, such as 600-400-200 to define our schedule.  We multiply
257 |   the initial rate by 1 for the first such number of epochs.  Then we
258 |   use a multipler of 0.5 for the second number of epochs followed by
259 |   0.1 again for this second number of epochs.  The third number of
260 |   epochs is used for multipliers of 0.05, 0.01, 0.005, and 0.001 in
261 |   that order, after which point we report our results.  We determine
262 |   the epochs to use for our schedule using a validation set to look
263 |   for plateaus in the loss function, at which point we move to the
264 |   next multiplier.''
265 | 
266 | \item CIFAR-10: Subtract per-pixel mean computed over the training
267 |   set.  Then use KSH's 3-layer convolutional net.  Follow by 64-unit
268 |   fully connected layer to which DropConnect etc may be applied.  No
269 |   data augmentation. 150-0-0 epochs, a single model, with an initial
270 |   learning rate of 0.0001, and KSH's weight decay (0.995, I believe).
271 |   DropConnect prevents overfitting a little better than Dropout.
272 | 
273 | \item CIFAR-10: More advanced results.  Using 2 conv layers, 2 locally
274 |   connected layers, per KSH.  128 neuron fully connected layer with
275 |   ReLU activations between softmax and feature extractor.  Images are
276 |   cropped to 24 by 24 to get more data.  Initial learning rate: 0.001,
277 |   and train for 700-300-50 epochs with KSH's weight decay.  Model
278 |   voting helps a \emph{lot}, getting error rate 9.41 percent.  This
279 |   can be improved to 9.32 percent by using 12 networks.
280 | 
281 | \end{itemize}
282 | 
283 | Add a note: data agmentation works nearly as well.  We should push on that.
284 | 
285 | \subsection{Other notes}
286 | 
287 | ``When training with Dropout, a randomly selected subset of
288 | activations are set to zero within each layer.  DropConnect instead
289 | sets a randomly selected subset of weights within the network to
290 | zero.''
291 | 
292 | As with Dropout, DropConnect is essentially a method of
293 | regularization, to prevent the network from overtraining.  ``In
294 | practice, using these [regularization] techniques when training big
295 | networks gives superior test performance to smaller networks trained
296 | without regularization.''
297 | 
298 | On Dropout: ``Although a full understanding of its mechanism is
299 | elusive, the intuition is that it prevents the network weights from
300 | collaborating with one another to memorize the training examples.''
301 | 
302 | ``Like Dropout, [DropConnect] is suitable for fully connected layers only.''
303 | 
304 | I don't really see why.  Does something go wrong if we apply it to a
305 | convolutional net?  I don't see why something analogous couldn't be
306 | done.
307 | 
308 | We can rewrite Dropout as $a \rightarrow \sigma(m \odot (wa+b))$,
309 | where $\odot$ is the Hadamard product, and $m$ is a binary mask
310 | vector, chosen according to an appropriate Bernoulli distribution.  A
311 | similarly nice expression can be obtained for DropConnect.  (This
312 | seems likely to help in implementations.)
313 | 
314 | \textbf{Architecture:} A CNN, followed by a DropConnect layer,
315 | followed by a SoftMax, and a cross-entropy loss.
316 | 
317 | Note that the output value can be viewed as the result of sampling a
318 | large number of different (though overlapping) neural networks.
319 | 
320 | ``A key component to successfully training with DropConnect is the
321 | selection of a different mask for each training example.  Selecting a
322 | single mask for a subset of training examples, such as a mini-batch of
323 | 128 examples, does not regularize the model enough in practice.''
324 | 
325 | They define the output as the result of averaging over all
326 | DropConnected networks.  Note that this seems likely to be superior to
327 | using the entire network (i.e., with no weights deleted).
328 | 
329 | They do some odd things involving Gaussian moment matching to sample.
330 | I don't see \emph{why} they need to do this, I must admit.  But it
331 | does give a reasonably nice way of approximating the network.
332 | Alternately, one could view it as the definition of DropConnect.
333 | 
334 | 
335 | \textbf{Q: How do Dropout and DropConnect fare in a sparse network?}
336 | My guess is that they'll show very interesting behaviour.
337 | 
338 | \chapter{Short reviews: what do we know about nonlinearities?}
339 | 
340 | In this chapter I take a very quick and not in-depth look at what is
341 | known about various nonlinearities.
342 | 
343 | DasGupta and Schnitger (1994): They want to compare activation
344 | functions as a function of size and number of layrs.  And they want to
345 | figure out when two activation functions have essentially the same
346 | approximating power. ``Our results show that good approximation
347 | performance... hings on two properties, namely efficient approximation
348 | of polynomials and efficient approximation of the binary threshold.''
349 | I have a lot of troubel believing the former; I wonder if it is an
350 | artifact of their analysis.  The latter seems interesting.
351 | 
352 | Jarrett et al (2009): ``We show that using non-linearities that
353 | include rectification and local contrast normalization is the single
354 | most important ingredient for good accuracy on object recognition
355 | benchmarks.''  ALso, ``[H]ow do the non-linearities that follow the
356 | ﬁlter banks inﬂuence the recognition accuracy. The surprising answer
357 | is that using a rectifying non-linearity is the single most important
358 | factor in improving the performance of a recognition system. This
359 | might be due to several reasons: a) the polarity of features is often
360 | irrelevant to recognize objects, b) the rectiﬁcation eliminates
361 | cancellations between neighboring ﬁlter outputs when combined with
362 | average pooling.  Without a rectiﬁcation what is propagated by the
363 | average down-sampling is just the noise in the input. Also introducing
364 | a local normalization layer improves the performance.  It appears to
365 | make supervised learning considerably faster, perhaps because all
366 | variables have similar variances (akin to the advantages introduced by
367 | whitening and other decorrelation methods)''
368 | 
369 | Karlik and Olgac (2009): Investigated a few special cases.
370 | 
371 | Nair and Hinton (2010): Done in the context of Boltzmann machines.
372 | They consider noisy rectified linear units (NReLUs), which have output
373 | $\max(0, x+ N(0, \sigma(x))$, where $N$ denotes a Gaussian random
374 | variable, as per usual.  Not so clear that it's relevant to us.
375 | 
376 | Tan, Teo, and Anthony (2011):
377 | \link{http://link.springer.com/article/10.1007\%2Fs10462-011-9294-y}{link}
378 | Investigated a few special cases.
379 | 
380 | Question: What should we bound?  What class of nonlinearities should
381 | we allow?
382 | 
383 | \chapter{Queue}
384 | 
385 | LeCun 2013.
386 | 
387 | Snoek, Larochelle, Adams.
388 | 
389 | Model voting.
390 | 
391 | Hinton Dropout paper.
392 | 
393 | Bengio's dropout paper.
394 | 
395 | ReLU.
396 | 
397 | \end{document}


--------------------------------------------------------------------------------
/working_notes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mnielsen/notes-on-neural-networks/4104b3175516550335282ea0a4aeb936bd4fe6c1/working_notes.pdf


--------------------------------------------------------------------------------
/working_notes.tex:
--------------------------------------------------------------------------------
   1 | \documentclass[12pt]{report}
   2 | 
   3 | \usepackage{hyperref}
   4 | 
   5 | \newcommand{\link}[2]{\href{#1}{#2}}
   6 | 
   7 | 
   8 | \begin{document}
   9 | 
  10 | \title{Notes on neural networks}
  11 | \author{Michael Nielsen\thanks{Email: mn@michaelnielsen.org}$^{,}$\thanks{Web: http://michaelnielsen.org/ddi}}
  12 | 
  13 | \maketitle
  14 | 
  15 | \chapter{Introduction}
  16 | 
  17 | \textbf{Working notes, by Michael Nielsen:} These are rough working
  18 | notes, written as part of my study of neural networks.  Note that they
  19 | really are \emph{rough}, and I've made no attempt to clean them up,
  20 | nor do I plan to.  They contain misunderstandings, misinterpretations,
  21 | omissions, and outright errors.  As such, I don't advise others to
  22 | read the notes, and certainly not to rely on them!
  23 | 
  24 | \textbf{Core questions:} There is a practical, narrow question: what
  25 | are the most significant results about deep learning and neural
  26 | networks?  And then there is the broader question: how to build an
  27 | artificial intelligence?  My reading will address both questions.
  28 | 
  29 | \chapter{Papers}
  30 | 
  31 | \section{Hopfield (1982)} 
  32 | 
  33 | What I like about this paper is the condensed matter physicist's point
  34 | of view.  Hopfield asks ``whether the ability of large collections of
  35 | neurons to perform computational tasks may in part be a spontaneous
  36 | collective consequence of having a large number of interacting simple
  37 | neurons''.  He goes on to give an explanation of how a type of memory
  38 | can be constructed in pretty much this way.  It's an inspiring point
  39 | of view.
  40 | 
  41 | \section{Bourland and Kamp (1988)}
  42 | 
  43 | \link{http://scholar.google.com/scholar?cluster=17784424506773259343\&hl=en\&as\_sdt=0,5}{link}
  44 | 
  45 | Suggests removing nonlinearity in output.  Motivation: since we're
  46 | trying to recover the original input, claims that it's obviously not a
  47 | good idea to have the nonlinearity.  I don't see that this is true: if
  48 | the inputs are normalized to be between 0 and 1 then there shouldn't
  49 | be any problem.
  50 | 
  51 | With this constraint, the problem then is to find $w, b$ and $w', b'$
  52 | minimizing:
  53 | \begin{eqnarray}
  54 |   \sum_x \|w' \sigma(wx+b)+b'-x\|^2,
  55 | \end{eqnarray}
  56 | where the sum is over all input vectors $x$.  Let $X$ be the matrix
  57 | whose columns are the training vectors.  Abusing notation, let $b$ and
  58 | $b'$ be matrices whose columns are $b$ and $b'$, respectively.  Then
  59 | matrix whose columns are the outputs is given by $Y =
  60 | w'\sigma(wX+b)+b'$, where we apply $\sigma$ elementwise to the
  61 | matrix $wX+b$.  The quadratic loss function can then be written:
  62 | \begin{eqnarray}
  63 |   \| w'\sigma(wX+b)+b'-X\|^2,
  64 | \end{eqnarray}
  65 | where $\|\cdot\|$ is here the usual Frobenius matrix norm.
  66 | 
  67 | \section{Blum and Rivest (1989)} 
  68 | 
  69 | Show that it's NP-complete to train a three node neural network.
  70 | Apparently built on earlier work by Judd, who showed this for a
  71 | general neural network; indeed, Judd showed that even approximating a
  72 | function is NP-complete.  Blum and Rivest use a very particular
  73 | architecture: $n$ inputs, a 2-neuron hidden layer, and a single output
  74 | neuron.  They use a perceptron model, although I doubt that is
  75 | essential.  The idea is simply to take a (supervised) training
  76 | problem, and to ask whether there exist weights so that the output
  77 | from the network are consistent with the training data.  They show
  78 | that this problem is NP-complete.  They contrast this with a
  79 | single-layer perceptron, which can be trained in polynomial time,
  80 | using linear programming.  They comment that their technique does not
  81 | apply to sigmoidal neurons, but that Judd's does.
  82 | 
  83 | \section{Williams and Zipser (1989)}
  84 | 
  85 | \link{http://scholar.google.ca/scholar?cluster=1352799553544912946\&hl=en\&as\_sdt=0,5}{(link)}  A gradient-based learning method for recurrent neural networks.
  86 |   
  87 | Claims that feedforward networks don't have the "ability to store
  88 | information for later use".  It'd be nice to understand what that
  89 | means.  Obviously there's a trivial sense in which feedforward
  90 | networks can store information based on training data.  
  91 | 
  92 | Claims that backprop requires lots of memory when used with large
  93 | amounts of training data.  I don't believe this, except in the trivial
  94 | sense that it may take a lot of memory to store all the training data.
  95 | Otherwise, we can compute gradients
  96 | training-instance-by-training-instance, and sum the results, which is
  97 | not especially memory intensive.  (Of course, one may have a huge
  98 | network which requires a lot of memory to story.  But that's a
  99 | separate issue.)
 100 | 
 101 | Their model of recurrent neural networks is interesting.  Basically,
 102 | we have a set of neurons, each with an output.  And we have a set of
 103 | inputs to the network.  There is a weight between every pair of
 104 | neurons, and from each input to each neuron.  To compute a neuron's
 105 | output at time $t+1$ we compute the weighted sum of the inputs and the
 106 | outputs at time $t$, and apply the appropriate nonlinear function
 107 | (sigmoid, or whatever).  Note that in order for this description to
 108 | make sense we must specify the behaviour of the external inputs over
 109 | time.  We can incorporate a bias by having an external input which is
 110 | always $1$.
 111 |   
 112 | So a recurrent neural network is just like a feedforward network, with
 113 | a weight constraint: the weights in each layer are the same, over and
 114 | over again.  Also, the inputs must be input to every layer in the
 115 | network.
 116 |   
 117 | Williams and Zipser take as their supervised training task the goal of
 118 | getting neuron outputs to match certain desired training values at
 119 | certain times.  For instance, you could define a two-neuron network
 120 | that will \emph{eventually} produce the XOR of the inputs.
 121 |   
 122 | They define the total error to be the sum over squares of the errors
 123 | in individual neuron outputs.  And we can then do ordinary gradient
 124 | descent with that error function.  They derive a simple dynamical
 125 | system to describe how to improve the weights in the network using
 126 | gradient descent.
 127 | 
 128 | The above algorithm assumes that the weights in the network remain
 129 | constant for all time.  Williams and Zipser then modify the learning
 130 | algorithm, allowing it to change the weights at each step.  The idea
 131 | is simply to compute the total error at any \emph{given} time, and
 132 | then to use gradient descent with that error function to update the
 133 | weights.  (Similar to online learning in feedforward networks.)
 134 |   
 135 | Williams and Zipser describe a method of \emph{teacher-forcing},
 136 | modifying the neural network by replacing the output of certain
 137 | neurons by the \emph{desired} output, for training purposes in later
 138 | steps.
 139 |   
 140 | Unfortunately, it is still unclear to me \emph{why} one would wish to
 141 | use recurrent neural networks.  Williams and Zipser describe a number
 142 | of examples, but they don't seem compelling.
 143 | 
 144 | The algorithm in which the weights can change seems non-physiological
 145 | --- it verges on being an unmotivated statistical model.  (I doubt
 146 | that the weights in the brain swing around wildly, but I'll bet that
 147 | the weights found by this algorithm can swing around wildly.)  The
 148 | algorithm in which the weights are fixed seems more biological.
 149 |   
 150 | Note that Williams and Zipser \emph{do not} offer any analysis of
 151 | running time for their algorithms, or an understanding of when it is
 152 | likely to work well, and when it is not.  It's very much in the
 153 | empirical let's-see-how-this-works style adopted through much of the
 154 | neural networks literature.
 155 |   
 156 | Summing up: the recurrent neural network works by, at each step,
 157 | computing the sigmoid function of the weighted sum of the inputs and
 158 | the previous step's outputs.  Training means specifying a set of
 159 | desired outputs at particular times, and adapting the weights at each
 160 | time-step.  Training works by specifying an error function at any
 161 | given time step, computing the gradient, and updating the weights
 162 | appropriately.
 163 | 
 164 | 
 165 | \section{Baldi and Hornik (1989)}
 166 | 
 167 | \link{http://scholar.google.ca/scholar?cluster=11637720331851320383&hl=en&as_sdt=0,5}{(link)}
 168 | This characterizes linear autoencoders.  We have a three-layer
 169 | network, and the output is related to the input by $x \rightarrow
 170 | ABx$, where $B$ describes the first layer of weights, and $A$ the
 171 | second layer.  The goal is to find weight matrices $A$ and $B$ to
 172 | minimize:
 173 | \begin{eqnarray}
 174 | \sum_x \|x-ABx\|^2.
 175 | \end{eqnarray}
 176 | The challenge is that the hidden layer has a \emph{smaller} number $h$
 177 | of neurons than the input layer (which is, of course, of the same size
 178 | as the output layer)\footnote{It's not quite clear to me what $h$
 179 |   should parameterize.  I'll use it to parameterize the number of
 180 |   dimensions in the vector space representing outputs from the hidden
 181 |   units.  It seems likely that it'd be better to write $2^h$, but I'll
 182 |   ignore that.}.  Let me try an attack on this without reading the
 183 | paper.  That sum above is just:
 184 | \begin{eqnarray}
 185 | \mbox{tr}((I-AB)^2 \Sigma),
 186 | \end{eqnarray}
 187 | where $\Sigma \equiv \sum_x x x^T$.  To minimize this what we want to
 188 | do is obvious (and easily proven): we'll choose $A$ and $B$ so that
 189 | $AB$ is a $h$-dimensional projector onto the span of the eignenvectors
 190 | of $\Sigma$ with the $h$ largest eigenvalues.  Let $P(\Sigma, h)$
 191 | denote such a projector, so:
 192 | \begin{eqnarray}
 193 | AB = P(\Sigma, h).
 194 | \end{eqnarray}
 195 | We can easily characterize such $A$ and $B$.  $A$ should take the
 196 | space $P(\Sigma, h)$ into the space spanned by the outputs from the
 197 | hidden units, and $B$ should then undo that transformation.  There is
 198 | an orthogonal freedom inbetween time, and a possible freedom in
 199 | $P(\Sigma, h)$.  This completely characterizes $A$ and $B$.
 200 | 
 201 | Summing up, in a linear neural network, \emph{a linear autoencoder is
 202 |   just doing principal components analysis}.  So \emph{a non-linear
 203 |   autoencoder can be thought of as a non-linear generalization of
 204 |   PCA}.  That's a useful fact to remember.  Examination of the
 205 | remainder of the paper suggests that these are the key facts.
 206 | 
 207 | \section{Olshausen and Field  (1996)} 
 208 | 
 209 | Presents a method for finding low-complexity representations of
 210 | natural images, in terms of atomic images --- which they call ``sparse
 211 | codes'' --- which are localized, oriented, and scale-sensitive.  These
 212 | are found using an unsupervised learning algorithm with a bias toward
 213 | good quality, low-complexity representations.  The codes seem to be
 214 | quite similar to the receptive fields found in the human visual
 215 | system.
 216 | 
 217 | The \emph{receptive field} for a cell in the retina is the volume of
 218 | space (roughly, a cone) which can stimulate that cell to fire.  Nearby
 219 | cells can have overlapping (or nearby) receptive fields.  Other cells
 220 | in the visual cortex also have receptive fields, but they may be more
 221 | complex, since the light has already been filtered through one or more
 222 | levels of processing.  
 223 | 
 224 | The paper claims that the receptive fields in the primary visual
 225 | cortex are: (a) spatially localized; (b) oriented; and (c) can
 226 | distinguish structure at different scales.
 227 | 
 228 | There is then a question: so what are those receptive fields?  In a
 229 | way, we can view this as being the question: to what type of images do
 230 | different cells in our primary visual cortex respond?  Answering that
 231 | question seems like a good start for understanding any higher-level
 232 | image processing.  It's the question: what are the atoms of image
 233 | processing?  Or perhaps a better way is to think of them as the
 234 | molecules of image processing, since they're one level up from the
 235 | pixel level.
 236 | 
 237 | They develop an unsupervised learning algorithm which, trained on
 238 | natural data, can find receptive fields that are spatially localized,
 239 | oriented, and can distinguish structure at different scales.
 240 | 
 241 | Olshausen and Field want to decompose an image as:
 242 | \begin{eqnarray}
 243 |   I(x,y) = \sum_j a_j \phi_j(x,y).
 244 | \end{eqnarray}
 245 | The idea is that the $\phi_j$ form a (possibly overcomplete) basis for
 246 | the space of images.  They want to choose the $\phi_j$ which ``results
 247 | in the coefficient values being as statistically independent as
 248 | possible over an ensemble of natural images''.  In some sense, the
 249 | different $a_j$ would be ``telling us different things'' about the
 250 | image.  They also want the coefficient values to be sparse, favouring
 251 | simple representations over more complex.
 252 | 
 253 | O \& F try to search for a suitable set of $\phi_j$s by introducing an
 254 | error function:
 255 | \begin{eqnarray}
 256 |   E = -\mbox{[preserve information]}-\lambda\mbox{[sparseness of } a_j {]}.
 257 | \end{eqnarray}
 258 | This error is \emph{for a single image}.  The first term is just the
 259 | $l_2$ error, i.e., (minus) the quadratic distance between the image
 260 | and its representation.  The sparseness term is just a nonlinear
 261 | function of the $a_j$ coefficients, quantifying how sparse they are.
 262 | 
 263 | The idea is to do online learning with this error function, presenting
 264 | it with natural images, and gradually minimizing the error.  (I see
 265 | later in the article that it was actually batch learning using
 266 | conjugate gradient descent.  It appears that some kind of average
 267 | error is being computed.)  The result will be an overcomplete basis
 268 | set that favours sparse decompositions of images.
 269 | 
 270 | The ``sparsification'' idea is a very interesting one.  Basically,
 271 | it's a way of trying to force a kind of Occam's razor into the system.
 272 | It's a bit like autoencoders, forcing a simple explanation of complex
 273 | data.
 274 | 
 275 | O \& F note that wavelets have been used to find sparse codes
 276 | previously.
 277 | 
 278 | 
 279 | \section{LeCun (1998)}
 280 | 
 281 | \link{http://yann.lecun.com/exdb/publis/index.html\#lecun-98}{link}
 282 | 
 283 | Reviews the classic two-part architecture: a feature extraction
 284 | module, followed by a trainable classifier module.  Points out that
 285 | the real goal is to shunt as much as possible out of the feature
 286 | extraction module and into the classifier module, since the first
 287 | requires hand-engineering, while the second is (much more) automated.
 288 | 
 289 | Makes the remarkable claim that the difference in error between test
 290 | and training set scales as $(h/N)^\alpha$, where $h$ is a measure of
 291 | how complex a classifier we're using, $N$ is the number of training
 292 | examples, and $0.5 < \alpha < 1$.  In other words, the error grows as
 293 | the complexity of the machine grows.  And it shrinks as the number of
 294 | training samples grows.  I wonder why this is the case?  Could we come
 295 | up with a model that more or less proves that this is the case?  Maybe
 296 | a renormalization argument?
 297 | 
 298 | ``The fact that local minima do not seem to be a problem for
 299 | multi-layer neural networks is somewhat of a theoretical mystery'':
 300 | This is strange.  Maybe it's the case that it's very hard to fall down
 301 | into such local minima in high dimensions? I've personally had
 302 | problems with very simple training data, but as soon as the training
 303 | data and network become at all complex, those problems seem to vanish.
 304 | This presumably means that ``most'' local minima are pretty darn good.
 305 | 
 306 | The \emph{segmentation} problem: the problem of cutting up a string of
 307 | characters.  Notes that a nice heuristic is to try lots and lots of
 308 | different cuts, and for each possible cut to score the cut by using
 309 | the individual character classifier: if that classifier seems to be
 310 | working well, then chances are that you have a good cut.
 311 | 
 312 | The authors note that existing systems are based on hand-crafted
 313 | feature extractors, but that they will not use hand-crafted features.
 314 | 
 315 | MNIST: constructured by combining NIST Special database 3 (SD-3) and
 316 | Special Database 1 (SD-1).  Apparently, NIST designated SD-3 as a
 317 | training set, and SD-1 as a test set.  But the two are actually very
 318 | different from on enaother.  SD-3 is a clean data set, taken from
 319 | Census Bureau employees, while SD-1 is not so good, being taken from
 320 | high-school students.  They describe some details of how MNIST was
 321 | constructed.  I'll review a few particularly striking facts.  First,
 322 | each character is size normalized, while preserving aspect ratio, and
 323 | centred.  There was also anti-aliasing going on.  So this can all be
 324 | regarded as pre-processing of features.  The database was prepared in
 325 | three forms.  One was the form I know it.  A second was a deslanted
 326 | form.  The third reduced the image resolution.
 327 | 
 328 | Deslanting: Idea was to compute moments of inertia, and then to
 329 | recenter things (vertically), while downsampling to 20 by 20.  As
 330 | we'll see below this significantly improves performance.
 331 | 
 332 | Convolutional networks: They use local receptive fields, shared
 333 | weights, and spatial sub-sampling.  ``With local receptive fields,
 334 | neurons can extract elementary visual features such as oriented edges,
 335 | end-points, corners (or similar features in other signals such as
 336 | speech spectrograms).  These features are then combined by the
 337 | subsequent layers in order to detect higher-order features.''
 338 | ``... elementary feature detectors that are useful on one part of the
 339 | image are likely to be useful across the entire image.  This knowledge
 340 | can be applied by forcing a set of units, whose receptive fields are
 341 | located at different places on the image, to have identical weight
 342 | vectors.''
 343 | 
 344 | ``Units in a layer are organized in planes within which all the units
 345 | share the same set of weights''.  So the basic idea is to convolve the
 346 | original inputs in some small window of the inputs.  We call this a
 347 | ``feature map''.  I think Hinton later calls it a kernel(?)  We will
 348 | typically have several different feature maps.  So what we have is a
 349 | convolution stage.  FOr example, we might have a 5 by 5 feature map.
 350 | This is applied to a 5 by 5 receptive field in the input, i.e., a 5 by
 351 | 5 area in the input.  Each unit has 25 inputs, and so 25 weights and a
 352 | bias.  ``all the units in a feature map share the same set of 25
 353 | weights and the same bias so they detect the same feature at all
 354 | possible locations on the input.''  ``The other feature maps in the
 355 | layer use different sets of weights and biases, thereby extracting
 356 | different types of local features.''  In LeNet-5 there are 6 feature
 357 | maps.  Note that a squashing function and bias apparently are used ---
 358 | this wasn't apparent earlier, where the focus is on the convolution.
 359 | Note that the feature map output will respect translations of the
 360 | original image.
 361 | 
 362 | Sub-sampling: The intuition is that exact location information is not
 363 | necessary.  ``Not only is the precise position of each of those
 364 | features [identified by the feature maps] irrelevant for identifying
 365 | the pattern, it is potentially harmful because the positions are
 366 | likely to vary for different instances of the character.''  ``A simple
 367 | way to reduce the precision with which the position of distinctive
 368 | features are encoded in a feature map is to reduce the spatial
 369 | resolution of the feature map.  This can be achieved with a so-called
 370 | sub-sampling layers [\emph{sic}] which performs a local averaging and
 371 | a sub-sampling, reducing the resolution of the feature map, and
 372 | reducing the sensititivity of the output to shifts and distortions.''
 373 | In LeNet-5 they use a sub-sampling layers, which perform a kind of
 374 | local averaging and sub-sampling.  Basically, they use six 2 by 2
 375 | features maps, one for each of the previous six feature maps.  ``Each
 376 | unit computes the \emph{average} of its four inputs, multiplies it by
 377 | a trainable coefficient, adds a trainable bias, and passes the result
 378 | through a sigmoid function''.  It's notable here that we don't have
 379 | trainable weights in the ordinary fashion.  It's also notable that
 380 | things aren't overlapping in this case, unlike the local receptive
 381 | fields.  Possibilities for this layer: blurring, local max, local min.
 382 | (Depends on parameter values).  
 383 | 
 384 | Architecture: ``Successive layers of convolutions and sub-sampling are
 385 | typically alternated...''  Traces the origins of the idea to Hubel and
 386 | Wiesel and to Fukushima.  It sounds as though the main new thing here
 387 | is to try it out with backprop.  The paper also describes some
 388 | previous applications of convolutional neural networks to image and
 389 | speech recognition.
 390 | 
 391 | ``Since all the weights are learned with back-propagation,
 392 | convolutional networks can be seen as synthesizing their own feature
 393 | extractor.''  Big advantage of reducing the number of parameters: it
 394 | reduces overfitting.
 395 | 
 396 | LeNet-5: 7 layers, not counting the input.  32 by 32 inputs.  Note
 397 | that the characters are themselves 20 by 20 pixels centered in a 28 by
 398 | 28 field.
 399 | 
 400 | Layer C3 (third layer, convolutional): 16 feature maps.  Each unit in
 401 | each feature map is connected to several 5 by 5 neighbourhoods are
 402 | identical locations in a subset of S2's feature maps.  ``WHy not
 403 | connect every S2 feature map to every C3 feature map?''  (1) Reduce
 404 | the number of connections; (2) Forces a break in symmetry in the
 405 | network.  My guess is that it would otherwise work, but might be
 406 | slower.  ``Different feature maps are forced to extract different
 407 | (hopefully complementary) features because they get different sets of
 408 | input.''
 409 | 
 410 | Layer C5: 120 feature maps.  Each unit is connected to a 5 by 5
 411 | neighbourhood on all 16 of S4's feature maps.  They state that this
 412 | amounts to a full connection between S4 and C5 --- this is true
 413 | because each feature unit is just a single unit.
 414 | 
 415 | They use a scaled hyperbolic tangent as the squashing function.  ``As
 416 | seen before, the squashing function used in our Convolutional Networks
 417 | is $f(a) = A \tanh(Sa)$.  Symmetric functions are believed to yield
 418 | faster convergence [i.e., learn at a faster rate], although the
 419 | learning can become extremely slow if the weights are too small.  The
 420 | cause of this problem is that in weight space the origin is a fixed
 421 | point of the learning dynamics, and, although it is a saddle point, it
 422 | is attractive in almost all directions''. It seems likely to me that
 423 | we will have a similar problem with the usual sigmoid function.  They
 424 | chose their parameters to ensure $f(\pm 1) = \pm 1$, i.e., for
 425 | convenience.  ``This particular choice of parameters is merely a
 426 | convenience, and does not affect the result.''
 427 | 
 428 | They initialize weights with the inverse of the fan-in, omitting the
 429 | square root that I am accustomed to use.  ``The standard deviation of
 430 | the weighted sum scales like the square root of the number of inputs
 431 | when the inputs are independent, and it scales linearly with the
 432 | number of inputs if the inputs are highly correlated.  We choose to
 433 | assume the second hypothesis since some units receive highly
 434 | correlated signals.''  The second clause in the first sentence is
 435 | simply false, since the weights are set independently of the inputs.
 436 | It's interesting that their method apparently works okay anyway, i.e.,
 437 | it must be quite insensitive to this detail.
 438 | 
 439 | Final layer in the network: Euclidean Radial Basis functions (RBF),
 440 | one for each class (i.e., 10 in total), with 84 inputs.  The output is
 441 | the squared Euclidean distance between the inputs and the input
 442 | weights.  In other words, the RBF measures how close the input is to
 443 | the weights.  Fascinatingly, the initial values for these were set by
 444 | hand, based on very simple versions of ASCII characters.
 445 | 
 446 | ``[O]utput units... must be off most of the time.  This is quite
 447 | difficult to achieve with sigmoid units.''  Not sure why. 
 448 | 
 449 | Learning schedule: $\eta = 0.0005$ for the first two epochs, $0.0002$
 450 | for the next three, $0.0001$ for the next three, $0.00005$ for the
 451 | next four, and $0.00001$ for the remaining epochs (up to 20, so it was
 452 | eight).
 453 | 
 454 | Distortions: ``When distorted data was used for training, the test
 455 | error rate dropped to 0.8 percent (from 0.95 percent without
 456 | deformation).''  It'd be nice to have a nice little library of
 457 | transformations.
 458 | 
 459 | Linear classifier: 12 percent error rate. When deslanted, gets 8.4
 460 | percent error rate.  ``Various combinations of sigmoid units, linear
 461 | units, gradient descent learning, and learning by directly solving
 462 | linear systems gave similar results''.  ``A simple improvement of the
 463 | basic linear classifier was tested.  The idea is to train each unit of
 464 | a single-layer network to separate each class from each other class.
 465 | In other words, there are ${10 \choose 2} = 45$ units.  There is still
 466 | a need to have a final decision procedure, and they simply chose the
 467 | class which beat the largest number of other classes.  ``The error
 468 | rate on the regular test set was 7.6\%''.
 469 | 
 470 | Baseline nearest neighbor classifier: Using Euclidean distance between
 471 | input images.  ``On the regular test set the error rate was 5.0\%.  On
 472 | the deslanted data, the error rate was 2.5\%, with $k = 3$.''
 473 | 
 474 | PCA: Computes the projection of the input pattern on the 40 principal
 475 | components.  ``The 40-dimensional feature vector was used as the input
 476 | of a second degree polynomial classifer.''  ``The error on the regular
 477 | test set was 3.3\%.''
 478 | 
 479 | Radial basis functions: Error rate of 3.6\%.
 480 | 
 481 | One-hidden layer neural network: Error was 4.7\% for a network with
 482 | 300 hidden units.  Interesting: this is worse than my results, even
 483 | when I'm using mean-square error (I get some improvement from using
 484 | cross-entropy).  I don't know why.  My initialization is somewhat
 485 | different.  Otherwise, I can't think of any reason.  They get a
 486 | reduction to 4.5\% for a network with 1000 hidden units(!)  They did
 487 | even bettter with distortions: 3.6\% and 3.8\%, with 300 and 1000
 488 | hidden units, respectively.  When deslanted images were used, the test
 489 | error dropped to 1.6\%, with 300 hidden units.  Raises the question of
 490 | why we don't get terrible overfitting, just on parameter counting
 491 | grounds.
 492 | 
 493 | Two-hidden layer neural network: ``The test error rate of a
 494 | 784-300-100-10 network was 3.05\%, a much better result than the
 495 | one-hidden layer network [4.7\%], obtained using marginally more
 496 | weights and connections.''  This doesn't accord with my experience
 497 | using basic backprop.  Rather, it's like their results now match up
 498 | with mine for both a single and two-hidden layer.  (Admittedly, I do
 499 | get an improvement --- albeit more modest --- if pretraining is used).
 500 | However, I'm using both the cross-entropy and different weight
 501 | initialization.  So identical results wouldn't be expected.
 502 | Increasing the network size to 784-1000-150-10 improved things only a
 503 | tiny bit, to 2.95\%.  Training with distorted patterns improved things
 504 | to 2.5\% and 2.45\%, respectively.  
 505 | 
 506 | LeNet-1: A small convolutional net.  It got 1.7\% test error rate.
 507 | ``The fact that a network with such a small number [2,600] of
 508 | parameters can attain such a good error rate is an indication that the
 509 | architecture is appropriate for the task.''
 510 | 
 511 | Boosting: This is a technique which sounds like an idea I've been
 512 | wondering about: concentrating more on training data which the network
 513 | is misclassifying.
 514 | 
 515 | Tangent distance classifier: This is an interesting idea.  The idea is
 516 | to consider the tangent plane near a digit image, where we're
 517 | considering a (low-dimensional) submanifold generated by distortions
 518 | and translations of the images.  ``An excellent measure of `closeness'
 519 | for character images is the distance between their tangent planes,
 520 | where the set of distortions used to generate the planes includes
 521 | translations, scaling, skewing, squeezing, rotation, and line
 522 | thickness variations''.  They use this measure of distance to run a
 523 | nearest-neighbor method classifier.  They get an error rate of 1.1\%,
 524 | which is (obviously) excellent.
 525 | 
 526 | Support vector machines: Depending on technique, results obtained
 527 | varied between 1.4\% and 0.8\%.
 528 | 
 529 | They report on the number of operations required to do a
 530 | classification, and the convolutional networks do quite well.  Much
 531 | better than the SVMs, interestingly enough, perhaps because the SVMs
 532 | are fitting high-order polynomials, and thus have a very large number
 533 | of terms.
 534 | 
 535 | ``When plenty of data is available, many methods can attain
 536 | respectable accuracy.  The neural-net methods run much faster and
 537 | require much less space than memory-based techniques.  The neural
 538 | nets' advantage will become more striking as training databases
 539 | continue to increase in size.''
 540 | 
 541 | Invariance and noise resistance: ``Convolutional networks are
 542 | particularly well suited for recognizing or rejecting shapes with
 543 | widely varying size, position, and orientation, such as the ones
 544 | typically produced by heuristic segmenters in real-world string
 545 | recognition systems.  In an experiment like the one described above,
 546 | the importance of noise resistance and distortion invariance is not
 547 | obvious.  The situation in most real applications is quite different.
 548 | Characters must generally be segmented our of their context prior to
 549 | recognition.  Segmentation algorithms... often leave extraneous marks
 550 | in character images... or sometimes cut characters too much and
 551 | produce incomplete characters.  Those images cannot be reliably
 552 | size-normalized and centered.  Normalizing incomplete characters can
 553 | be very dangerous.  For example, an enlarged stray mark can look like
 554 | a genuine 1.''
 555 | 
 556 | Conclusions: ``Convolutional Neural Networks have been show to
 557 | eliminate the need for hand-crafter feature extractors.  Graph
 558 | Transformer Networks have been shown to reduce the need for
 559 | hand-crafted heuristics, manual labeling, and manual parameter tuning
 560 | in document recognition systems.''  ``It was shown that all the steps
 561 | of a document analysis system can be formulated as graph transformers
 562 | through which gradients can be back-propagated.''  ``It is worth
 563 | pointing out that data generating models... and the Maximum Likelihood
 564 | Principle were not called upon to justify most of the architectures
 565 | and training criteria described in this paper.''
 566 | 
 567 | 
 568 | \section{Ferret rewiring  (Nature, 2000)} 
 569 | 
 570 | The primary visual cortex has what are called orientation modules.
 571 | These are groups of cells that share a preferred ``stimulus
 572 | orientation''.  It's not clear to me what a stimulus orientation is,
 573 | exactly --- do they mean the direction the stimulus comes from.  I'll
 574 | get back to that.  Anyway, there is apparently an orientation map.
 575 | Well, when they rewire the ferrets' brains, apparently there are
 576 | visually responsive cells in the auditory cortex that start to develop
 577 | an orientation map!  It's similar to the one in the visual cortex,
 578 | although apparently less orderly.
 579 | 
 580 | They use a nice piece of terminology: sensory pathways have an
 581 | \emph{instructive} role in the development of cortical networks.  The
 582 | visual cortex apparently has a couple of different kinds of structure:
 583 | ocular dominance columns, and orientation columns.  Actually, looking
 584 | at Wikipedia, there's quite a bit more structure in there than that.
 585 | Apparently, orientation columns were discovered simply by stimulating
 586 | a cat with visual stimuli from different directions, and noticing
 587 | where in the visual cortex excitement occurred.  They're apparently
 588 | little slabs of cells that respond to visual stimuli from a particular
 589 | direction.  Perhaps unsurprisingly, these columns are arranged into
 590 | little pinwheels --- it's natural enough that they would reflect
 591 | external geometry.
 592 | 
 593 | They wanted to investigate ``whether afferent [i.e., sensory] activity
 594 | or intrinsic features of the cortical target regulate the development
 595 | of orientation columns.''  ``... within limits, input activity [from
 596 | eyes to auditory cortex] has a significant instructive role in
 597 | establishing the cortical circuits that underlie orientation
 598 | selectivity and the orientation map''.
 599 | 
 600 | They identify two separate things --- the degree of ``tuning'' in the
 601 | cortex, as well as the orientation map.  Apparently, these two things
 602 | are found to be more or less independent.  What's ``orientation
 603 | tuning'' mean?  Maybe it's a way of calibrating the respective meaning
 604 | of activation of different orientation columns?  ``... afferent
 605 | activity is required for at least the maintenance of orientation
 606 | selectivity in V1 neurons''.  In other words, you destory the
 607 | orientation structure if you don't get sensory input.  This is a
 608 | complementary result.
 609 | 
 610 | 
 611 | \section{Tenenbaum, de Silva and Langford (2000)}
 612 | 
 613 | \link{http://scholar.google.ca/scholar?cluster=14602426245887619907&hl=en&as_sdt=0,5}{(link)}
 614 | They mention a technique called multidimensional scaling (MDS), which
 615 | I hadn't heard of.  The idea seems to be that we have a lot of items,
 616 | and we know some ``dissimilarities'' between items.  The goal is to
 617 | find a metric space embedding of those items so that the distances are
 618 | roughly equal to the dissimilarities.
 619 | 
 620 | A sample problem: we have a 4096-dimensional space, corresponding to
 621 | 64 by 64 pixel images.  A (nonlinear) subspace of this corresponds to
 622 | images we'd recognize as faces.  How can we characterize this
 623 | subspace?  
 624 | 
 625 | This is just one possible mathematical formalization of the problem.
 626 | In practice, things are more complex.  Our classification will be
 627 | fuzzy.  We'll have all kinds of extra contextual information: maybe
 628 | we've got an external hint; maybe we can see a nose; maybe the colour
 629 | is wrong, but we see enough to suspect it's false colour.  All these
 630 | kinds of things are clearly important in how we actually see.  In
 631 | other words, we don't just have an algorithm for face detection.  We
 632 | have a million related algorithms, and they all affect how well face
 633 | detection works.  In some sense you don't solve one problem perfectly.
 634 | You solve a network of problems imperfectly --- and then use those
 635 | results to improve your performance on the original problem.  It's a
 636 | kind of \emph{learning network}.  In a sense this is what a deep
 637 | neural network does: it builds up gradually more complicated features.
 638 | 
 639 | The algorithm they describe is very simple.  Very roughly (this
 640 | certainly contains mistakes): the idea seems to be to take all your
 641 | data points and to compute distances between them.  We assume that
 642 | when the distances are small, the points are neighbours.  Construct a
 643 | graph in which neighbouring points are connected.  Then geodesic
 644 | distance is found (approximated) by finding the shortest distance in
 645 | the graph.  We then embed the graph in a space of the chosen
 646 | dimensionality.  Nice!  Simple, probably pretty easy to implement, and
 647 | I expect it lets us find a lot of structure.
 648 | 
 649 | It's worth thinking about what the input and output are.  The input to
 650 | Iso-map is just a data set --- maybe it's a set of images of a face,
 651 | maybe it's a set of words, whatever.  This data lives in a very
 652 | high-dimensional space.  What we do is we find an embedding in a much
 653 | lower dimensional space --- say, 2-dimensional.  In other words, we're
 654 | constructing new features, based on the original features.
 655 | 
 656 | \textbf{There are $10^6$ optic nerves and $30,000$ auditory nerves:}
 657 | I'm not quite sure what to make of this.  Presumably it means that we
 658 | process something like $30$ times as much optical information as
 659 | auditory.  I wonder how pixellated the information is?  
 660 | 
 661 | \textbf{What happens when we augment the features, with PCA?}  Let's
 662 | suppose we start off with 3 features, $x, y, z$.  Then we add $x^2$
 663 | and $y^2$ as new features.  Certain subsets of the original space that
 664 | weren't linearly approximable \emph{will be} in the new feature space.
 665 | This seems like a potentially powerful technique.  What can it be used
 666 | to do?  What are its limits?
 667 | 
 668 | \section{Simard (2003)}
 669 | 
 670 | ``Best Practices for Convolutional Neural Networks Applied to Visual
 671 | Document Analysis''
 672 | 
 673 | ``The most important practice is getting a training set as large as
 674 | possible: we expand the training set by adding a new form of distorted
 675 | data''.  They claim it's better even than being convolutional.  ``The
 676 | optimal performance on MNIST was achieved using two essential
 677 | practices.  First, we created a new, general set of elastic
 678 | distortions that vastly expanded the size of the training set...''
 679 | 
 680 | ``We avoided using momentum, weight decay, structure-dependent
 681 | learning rates, extra padding around the inputs, and averaging instead
 682 | of subsampling. (We were motivated to avoid these complications by
 683 | trying them on various architecture/distortions combinations and on a
 684 | train/validation split of the data and finding they did not help.)''
 685 | 
 686 | They have lots of useful details about how they came up with their
 687 | convolutional architecture.  It's very similar to LeCun (1998), of
 688 | course, but they have more detail on \emph{how} they chose the various
 689 | parameters.  Interestingly, they found that having 5 features in the
 690 | first convolutional layer and 50 features in the second convolutional
 691 | layer was more or less optimal.
 692 | 
 693 | ``Convolutional neural networks have been proposed for visual tasks
 694 | for many years [LeCun 1998], yet have not been popular in the
 695 | engineering community.  We believe that is due to the complexity of
 696 | implementing the convolutional neural networks.''
 697 | 
 698 | They point out that implementation is complicated by the fact that not
 699 | every unit has the same number of outgoing connections.
 700 | 
 701 | The results suggest substantial improvements from both distortions,
 702 | and the use of convolutional nets.  They achieve a best-possible
 703 | accuracy of 99.6\%, which was apparently a record at the time.
 704 | 
 705 | \section{Hinton, Osindero, and Teh (2006)}
 706 | 
 707 | \link{http://www.cs.toronto.edu/\~hinton/absps/ncfast.pdf}{A Fast
 708 |   Learning Algorithm for Deep Belief Nets}
 709 | 
 710 | ``Learning is difficult in densely connected, directed belief nets
 711 | that have many hidden layers because it is difficult to infer the
 712 | conditional distribution of the hidden activities when given a data
 713 | vector.'' I don't know why this is.  My impression is that it's easy
 714 | to at least sample from the distribution of hidden activations.  Is
 715 | that false?  Or maybe it's true and it's just the calculation of the
 716 | distribution that is hard.  ``Variatonal methods use simple
 717 | approximations to the true conditional distribution, but the
 718 | approximations may be poor, especially at the deepest hidden layer,
 719 | where the prior assumes independence.  Also, variational learning
 720 | still requires all of the parameters to be learned together and this
 721 | makes the learning time scale poorly as the number of parameters
 722 | increase.''  I don't know what variational learning is.
 723 | 
 724 | ``The network used to model the join distribution of digit images and
 725 | digits labels... work in progress has shown that the same learning
 726 | algorithm can be used if the `labels' are replaced by a multilayer
 727 | pathway whose inputs are spectrograms from multiple different speakers
 728 | saying isolated digist.  The network then learns to generate pairs
 729 | that consist of an image and a spectrogram of the same digit class.''
 730 | Fascinating: in other words, it will associate ``9'' both with
 731 | different images of 9, and also with different people saying 9.
 732 | 
 733 | Discriminative model: a model which can be used to distinguish the
 734 | MNIST digits.  Generative model: a model which, given a label, can be
 735 | used to generate an image which in some sense samples from the MNIST
 736 | distribution.  
 737 | 
 738 | Generative models seems interesting in part because that's what we do
 739 | (we can both read and write digits, for instance).  Of course, it's
 740 | not entirely clear how these skills are associated.  One can learn to
 741 | read without also learning to write; there are fine motor skills in
 742 | the latter that are not all that closely associated to reading.
 743 | 
 744 | ``There is a fine-tuning algorithm that learns an excellent generative
 745 | model that outperforms discriminative methods on the MNIST database of
 746 | hand-written digits.''  I haven't seen this kind of thing mentioned at
 747 | all in later work --- it's all discriminative.
 748 | 
 749 | ``The learning algorithm is local.  Adjustments to a synapse strength
 750 | depend on only the states of the presynaptic and postsynaptic
 751 | neuron.''  This seems very preferable to gradient descent!
 752 | 
 753 | \textbf{Explaining away:} Makes inference difficult in directed belief
 754 | nets.  Basically, we can't figure out what root causes must have been,
 755 | given only partial evidence.
 756 | 
 757 | \section{Hinton and Salakhutdinov (2006)}
 758 | 
 759 | \link{http://scholar.google.ca/scholar?cluster=15344645275208957628}{(link)}
 760 | 
 761 | Their RBM uses ``symmetrically weighted connections''.  It is not
 762 | clear to me what this means.  It seems to mean that the biases are the
 763 | same on hidden and visible units.  I don't see how that can be ---
 764 | aren't there different numbers of such units?
 765 | 
 766 | So the idea is to take an RBM, and then use the training data to find
 767 | a new set of features.  We then use the features generated by the
 768 | training data as a \emph{new} set of training data, for another RBM.
 769 | We use that to find new features.  And so on, through multiple levels
 770 | of RBMs.  We then use backpropagation to fine-tune the whole thing.
 771 | It appears that the backpropagation is done with the weights treated
 772 | as though in a deterministic neural network, not stochastic, as in an
 773 | RBM.
 774 | 
 775 | In a bit more detail, when working with real-valued data, the visible
 776 | units in later RBMs were set to the activation probabilities of
 777 | previous hidden units.  I.e., probabilities became data.
 778 | 
 779 | H and S used a deep network with 784-400-200-100-50-25-6 units.  That
 780 | is, they reduced 784-dimensional input data to just 6 parameters.
 781 | And, visually at least, their reconstructions were very good,
 782 | significantly better than 6-parameter PCA and similar techniques.
 783 | 
 784 | What makes it difficult to train deep neural networks?  I must admit,
 785 | I don't really have a great answer to this question.  Can we come up
 786 | with a good \emph{a priori} reason for thinking it will be tough?
 787 | It's not obvious that it should be tougher than a shallow network with
 788 | the same number of neurons.
 789 | 
 790 | H and S compare to the work of Tenenbaum \emph{et al} and Roweis and
 791 | Saul, and comment: ``Unlike nonparametric methods (cites),
 792 | autoencoders give mappings in both directions between the data and
 793 | code spaces, and they can be applied to very large data sets because
 794 | both the pretraining and the fine-tuning scale linearly in time and
 795 | space with the number of training cases.''  I don't quite understand
 796 | the comment about mappings in both directions --- I thought the
 797 | earlier work provided such mappings.  Perhaps I should look closer.
 798 | 
 799 | \section{Bengio, Lamblin, Popovici, Larochelle (2007)}
 800 | 
 801 | \link{http://www.iro.umontreal.ca/\~lisa/publications2/index.php/attachments/single/24}{Greedy
 802 |   Layer-Wise Training of Deep Networks}
 803 | 
 804 | They have a complexity-theoretic point of view, a point of view that
 805 | says depth (in circuits, or otherwise) helps compute functions.  I
 806 | guess this is more or less the point of view of computer scientists
 807 | who believe that \textbf{NC} is a strict subset of \textbf{P}.
 808 | 
 809 | In general, this is a point of view I haven't much engaged with.  I've
 810 | been thinking more in the detailed world of the practitioner,
 811 | wondering just how well a given network functions, and not thinking
 812 | about these structural questions.  But I suppose there is a deep
 813 | structural question here, which is whether there are deep networks
 814 | that can compute functions using polynomially many elements, and said
 815 | functions require exponentially many more elements in a shallow
 816 | network?
 817 | 
 818 | A skeptical way of looking at this is to say that this is a question
 819 | about scaling, and that scaling isn't what matters for solving pattern
 820 | recognition problems in the real world, since we have just one such
 821 | world, of fixed size.  But to be skeptical of the skeptic, we would
 822 | still find it interesting if, in the real world, we were trying to
 823 | learn functions which were much easier to compute by a deep network
 824 | than a shallow.
 825 | 
 826 | Why might deep networks be better?  Two broad reasons: ease of
 827 | computation; and ease of learning.  I'd like to understand both these:
 828 | Why might computation be easier?  And why might learning be easier?
 829 | 
 830 | Well, those notes get me to the end of the first sentence of the
 831 | abstract!  Let me skip ahead and see if I can sum up the first
 832 | paragraph, since it seems very interesting.  The basic problem is the
 833 | ability of various machine-learning algorithms to learn highly-varying
 834 | functions, ``e.g., they would require a large number of pieces to be
 835 | well represented by a piecewise-linear approximation.  Since the
 836 | number of pieces can be made to grow exponentially... If the shapes of
 837 | all these pieces are unrelated, one needs enough examples for each
 838 | piece in order to generalize properly.  However, if these shapes are
 839 | related and can be predicted from each other, `non-local' learning
 840 | algorithms have the potential to generalize to pieces not covered by
 841 | the training set.''  I can sort of see this: basically, linear
 842 | boundaries aren't going to give us very much, even with new features:
 843 | they can't go a huge amount beyond what is already in the input data.
 844 | But I don't quite see what non-linearities do to get beyond this.  I
 845 | guess it's that we're starting to learn from multiple pieces of
 846 | training data at once, and making higher-order generalizations.
 847 | (Basically, once you can do {\sc and} gates, you can do conditional
 848 | logic, and that lets you build up hierarchical reasoning.)
 849 | 
 850 | 
 851 | \section{Pinto, Cox and DiCarlo (2008)}
 852 | 
 853 | \link{http://www.ploscompbiol.org/article/info\%3Adoi\%2F10.1371\%2Fjournal.pcbi.0040027}{Why is Real-World Visual Object Recognition Hard?}
 854 |  
 855 | ``[W]e show that a simple V1-like [computational?] model --- a
 856 | neuroscientist's `null' model, which should perform poorly at
 857 | real-world visual object recognition tasks --- outperforms
 858 | state-of-the-art object recognition systems (biologically inspired and
 859 | otherwise) on a standard, ostensibly natural image recognition test.''
 860 | I'm not sure what moral to take away.  That simple systems can do well
 861 | recognizing natural images?  But they also created another ``simple''
 862 | test which demonstrated the inadequacy of their system.  ``Taken
 863 | together, these results demonstrate that tests based on uncontrolled
 864 | natural images can be seriously misleading...''  The ultimate
 865 | conclusion is that they want more focus on real-world image variation,
 866 | by which they mean that the same object can cast a potentially
 867 | infinite number of variations on the eye.
 868 | 
 869 | ``[I]t is not clear to what extent such `natural' image tests [like
 870 | Caltech101] actually engage the core problem of object recognition.
 871 | Specifically, while the Caltech101 set certainly contains a large
 872 | number of images (9,144 images), variations in object view, position,
 873 | size, etc., between and within object category are poorly defined and
 874 | are not varied systematically [I can see that this might be a problem
 875 | if the sampling is not reasonably fair].  Furthermore, image
 876 | backgrounds strongly covary with object category [wow!].... The
 877 | majority of images are also `composed' photographs, in that a human
 878 | decided how the shot should be framed [!], and thus the placement of
 879 | objects within the image is not random and the set may not properly
 880 | reflect the variation found in the real world.  Furthermore, if the
 881 | Caltech101 object recognition task is hard, it is not easy to know
 882 | what makesi it hard---different kinds of variation (view, lighting,
 883 | exemplar, etc) are all inextricably linked together.''
 884 | 
 885 | ``We built a very basic representation inspired by known properties of
 886 | V1 `simple' cells... The responses of these cells to visual stimuli
 887 | are well-described by a spatial linear filter, resembling a Gabor
 888 | wavelet... with a nonlinear output function... and some local
 889 | normalization (roughly analogous to `contrast gain control').''
 890 | 
 891 | \section{Deng (2009)}
 892 | 
 893 | ``ImageNet: A Large-Scale Hierarchical Image Database''
 894 | 
 895 | An ``ontology of images builts upon the backbone of the WordNet
 896 | structure''.  ``Each meaningful concept in WordNet, possibly described
 897 | by multiple words or word phrases, is called a `synonym set' or
 898 | `synset'.''  There are apparently 80 thousand (noun) synsets in
 899 | WordNet.  (I presume there are verb synsets, and perhaps other types
 900 | as well?)  The idea of ImageNet is to provide 500-1,000 images per
 901 | synset.  That's a grand total of some tens of millions of images.
 902 | ``Images of each concept are quality-cntrolled and human-annotated''.
 903 | To some extent this means that they don't match what we'll actually
 904 | find ``in the wild''.  This paper reports early work --- 5,247 synsets
 905 | and 3.2 million images.
 906 | 
 907 | ``ImageNet aims to provide the most comprehensive and diverse coverage
 908 | of the image world.  The current 12 subtrees consist of a total of 3.2
 909 | million cleanly annotated images spread over 5,247 categories... To
 910 | our knowledge this is already the largest clean [what does this mean?]
 911 | image dataset available to the vision research community, in terms of
 912 | the total number of images, number of images per category as well as
 913 | the number of categories.''  ``... to our knowledge no existing vision
 914 | dataset offers images of 147 dog categories.''
 915 | 
 916 | Even at very low levels in the tree, ImageNet labels were found to be
 917 | highly accurate by an independent group of subjects.
 918 | 
 919 | ``ImageNet is constructed with the goal that objects in images should
 920 | have variable appearances, positions, view points, poses as well as
 921 | background clutteer and occlusions.''  They do an interesting thing to
 922 | measure diversity.  They compute an ``average image'' for each synset,
 923 | and then measure the JPG file size.  The idea is that very different
 924 | images will blur out (and so have small file sizes), while more
 925 | similar images will not (and so will have large file sizes).  They
 926 | find that their images are much more diverse than Caltech101.
 927 | 
 928 | Images are collected by querying several image search engines with the
 929 | appropriate noun or noun phrase.  ``To obtain as many images as
 930 | possible, we expand the query set by appending the queries with the
 931 | word [?] from parent synsets, if the same word appears in the gloss of
 932 | the target synset [?]''.  ``To further enlarge and diversify the
 933 | candidate pool, we translate the queries into other languages''.
 934 | 
 935 | Using Mechanical Turk to label: ``In each of our labeling tasks, we
 936 | present the users with a set of candidate images and the definition of
 937 | the target synset... We then ask users to verify whether each image
 938 | contains objects of the synset.  We encourage users to select images
 939 | regardless of occlusions, number of objects and clutter in the scene
 940 | to ensure diversity.''  Of course, the problems are that people make
 941 | mistakes, and they may not agree with one another.  They get multiple
 942 | people to label each image, and only classify something positively if
 943 | an image gets a convincing majority of the votes.  ``... different
 944 | categories require different levels of consensus among users.''
 945 | Basically, the more contentious, the more votes we need to be sure.
 946 | They have to do some initial setup to figure out the appropriate
 947 | thresholds (or if a threshold fails to exist).
 948 | 
 949 | Nice idea: classifying at each node in the WordNet net.  This reduces
 950 | the classification difficulty at each step.  I wonder if there's a
 951 | natural way this can be done in deep neural nets?  Maybe by building a
 952 | feature representation unsupervised, and then using those features to
 953 | train a (tree-like) classifier?  ``At nearly all levels, the
 954 | performance of the tree-max classifier is consistently higher than the
 955 | independent classifier.''
 956 | 
 957 | \section{Jarrett (2009)}
 958 | 
 959 | ``What is the Best Multi-Stage Architecture for Object Recognition?'':
 960 | \link{http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf}{link}
 961 | 
 962 | This is a much more conventional paper about object recognition than
 963 | the material I've been reading.  The basic idea is to build a pretty
 964 | good feature extractor, and then to use a standard (supervised)
 965 | classifier.
 966 | 
 967 | ``We show that using non-linearities that include rectification and
 968 | local contrast normalization is the single most important ingredient
 969 | for good accuracy on object recognition benchmarks.''  ``[T]he SIFT
 970 | operator applies oriented edge filters to a small patch and determines
 971 | the dominant orientation through a winner-take-all operation.''
 972 | ``Several recognition architectures use a single stage of such
 973 | features followed by a supervised classifier.''
 974 | 
 975 | ``At first glance, one may think that training a complete system in a
 976 | purely supervised manner (using gradient descent) is bound to fail on
 977 | dataset with small number of labeled samples such as Caltech-101,
 978 | because the number of parameters greatly outstrips the number of
 979 | samples.  [Yes, one might think this] One may also think that the
 980 | filters need to be carefully hand-picked (or trained) to produce good
 981 | performance [Yes, at least to the training part], and that the details
 982 | of the non-linearity play a somewhat secondary role [Again, agreed].
 983 | These intutitions, as it turns out, are wrong. [!]''
 984 | 
 985 | ``A common choice for the filter bank of the first stage is Gabor
 986 | Wavelets.  [A linear filter used for edge detection.  Apparently there
 987 | are similar things in the visual cortex!] Other proposals use simple
 988 | oriented edge detection filters such as gradient operators, including
 989 | SIFT, and HoG.  Another set of methods learn the filters by adapting
 990 | them to the statistics of the input data with unsupervised
 991 | learning. [This is the deep neural nets approach] ... The advantage of
 992 | learning methods is that they provide a way to learn the filters in
 993 | subsequent stages of the feature hierarchy.  While prior knowledge
 994 | about image statistics point to the usefulness of oriented edge
 995 | detectors at the first stage, there is no similar prior knowledge that
 996 | would allow to design sensible filters for the second stage in the
 997 | hierarchy.  Hence the second stage \emph{must be learned}.''  This
 998 | seems overly pessimistic to me: one can certainly imagine a theory
 999 | that tells us what features there should be at the second level.
1000 | Still, it's obviously an attractive model.
1001 | 
1002 | ``The second ingredient of a feature extraction system is the
1003 | non-linearity.''  I don't really understand deeply why non-linearity
1004 | is so necessary.  It'd be good to do so.
1005 | 
1006 | Notes that pooling can be applied over space, over scale and space
1007 | (rescaling?), and over similar feature types and space.  ``This layer
1008 | [pooling] builds robustness by computing an average or a max of the
1009 | filter responses within the pool.''
1010 | 
1011 | Caltech 101: 101 categories.  About 50 images per category, and the
1012 | size of each image is roughly 300 by 200 pixels.  SIFT features plus a
1013 | linear classifier will give us 50 percent classification accuracy.
1014 | Using a better classifer will give us 65 percent.  ``[T]he best
1015 | results on Caltech-101 have been obtained by combining a large number
1016 | of different feature families [29]''.  Reference is to Varma and Ray.
1017 | 
1018 | ``The hierarchy stacks one or several feature extraction stages, each
1019 | of which consists of filter bank layer, non-linear transformation
1020 | layers [\emph{sic}?], and a pooling layer that combines filter
1021 | responses over local neighborhoods using an average or max operation,
1022 | thereby achieving invariance to small distortions.''
1023 | 
1024 | 
1025 | 
1026 | Conclusions: ``[U]sing a rectifying non-linearity is the single most
1027 | important factor in improving the performance of a recognition
1028 | system[!]''  I don't understand the heuristic justifications they
1029 | give. ``Also introducing a local normalization layer improves the
1030 | performance.  It appears to make supervised learning considerably
1031 | faster, perhaps because all variables have similar variances (akin to
1032 | the advantages introduced by whitening and other decorrelation
1033 | methods).''
1034 | 
1035 | \section{Lee (2009) - video}
1036 | 
1037 | \link{http://videolectures.net/icml09\_lee\_cdb/}{link} A video
1038 | version of the paper below.  ``We are interested in scaling up deep
1039 | belief networks to learn generative models and to perform inference on
1040 | challenging problems.''  RBMs.  Visible nodes: input (training) data.
1041 | Hidden nodes: encode statistical relationships in the visible nodes.
1042 | ``Unsupervised training using Contrastive Divergence approximation to
1043 | maximum likelihood''.  Deep belief network: ``Greedy layerwise
1044 | training using RBMs''.  Want to scale DBNs to realistic image sizes:
1045 | 200 by 200 pixels.  One way to deal with this is to use a
1046 | convolutional net.  Alternate between ``detection'' and ``pooling''
1047 | layers.  ``Detection layers involve weights shared between all image
1048 | locations'': we have a window of features, sliding across the input
1049 | image.  ``Each pooling unit computes the maximum of the activation of
1050 | several detection units''.  It shrinks the representation in higher
1051 | layers.  They define a convolutional RBM.  It's very similar to a
1052 | standard RBM, but with a couple of differences.  One, the weights are
1053 | shared across hidden units, as in a convolutional net.  Second, they
1054 | impose a constrain on the hidden units --- basically, local sums can't
1055 | be too large.  It's not quite clear to me why they're doing this, but
1056 | they are.  They can still do block Gibbs sampling.  
1057 | 
1058 | Convolutional DBNs: They do greedy, layerwise training, training one
1059 | convolutional RBM at a time.  They can both infer forwards and
1060 | backwards through the layers.
1061 | 
1062 | Results (MNIST): They trained a two-layer CDBN on \emph{unlabeled}
1063 | MNIST data.  The first layer learns ``strokes'', while the second
1064 | layer learns groupings of strokes.  Nice results: down to 0.82\% error
1065 | rate.  I like the fact that they talk about how the error rate scales
1066 | with the number of labeled examples.
1067 | 
1068 | Results (natural images): First layer it learns localized, orinted
1069 | edges.  Second layer: contours, corners, arcs, surface boundaries.
1070 | Caltech 101: 65.4\% accuracy.  Final result is competitive.  Training
1071 | images unrelated to Caltech 101.  Three-layer network from faces:
1072 | first layer learns edges, second layer learns eyes, third layer learns
1073 | faces.  They're computing some kind of precision-recall curve.  I
1074 | don't quite get this --- it's an unfamiliar useage to me.  They do
1075 | some training with multiple classes (cars, faces, motorbikes,
1076 | aeroplanes).  The first layer gets general-purpose features.  The
1077 | second layer gets object-class-specific features, as well as some
1078 | shared features.  The third layer gets highly specific features.  Nice
1079 | conditional entropy graph: uncertainty in the class, given the number
1080 | of features which are active.  Wonderful ``filling in'' of faces.
1081 | 
1082 | \section{Lee (2009)} 
1083 | 
1084 | RBMs.  Two layer.  Bipartite.  Undirected.  Binary hidden units, $h$.
1085 | Binary or real-valued visible units, $v$.  A weight matrix $W$ between
1086 | the two layers.  If visible units are binary, then we define the
1087 | energy:
1088 | \begin{eqnarray}
1089 |   E = v^T W h - b^T h-c^T v,
1090 | \end{eqnarray}
1091 | where $b$ are the hidden unit biases, and $c$ are the visible unit
1092 | biases.  For real-valued visible units, modify the energy by adding a
1093 | $1/2 v^2$ term.  This model is simple enough.  How should we think
1094 | about it?  The idea is to start with a given set of values for one
1095 | layer, say the visible layer.  Then sample the hidden units.  Then
1096 | sample the visible layer.  And so on, ping-ponging back and forth.
1097 | 
1098 | ``In principle, the RBM parameters can be optimized by performing
1099 | stochastic gradient ascent on the log-likelihood of the training
1100 | data.''  The parameters to be optimized are presumably the weights and
1101 | biases.  The likelihood is the probability of the observed outcomes
1102 | (i.e., the training data), given the particular parameters.  I assume
1103 | that the idea is that the visible units are supposed to represent the
1104 | observed data.  So we want to choose the parameters of the model in
1105 | order to maximize the probability of seeing the training data in the
1106 | visible units.  Apparently contrastive divergence is a technique for
1107 | computing the gradient of the log-likelihood.
1108 | 
1109 | Convolutional RBM. The weights between the hidden and visible layers
1110 | are shared among all locations in an image.  What exactly does this
1111 | mean?  Suppose we have an $N_V \times N_V$ image.  Then the input
1112 | layer apparently consists of $N_V \times N_V$ binary units.  There are
1113 | $K$ groups in the hidden layer, each an $N_H \times N_H$ array of
1114 | binary units.  So there are $N_H^2 K$ total hidden units.
1115 | 
1116 | We index the hidden groups by $k$.  Each hidden group has a bias,
1117 | $b_k$.  All visible units share a single bias, $c$.
1118 | 
1119 | For any given group, $k$, we have a single set of $N_W \times N_W$
1120 | weights (the ``filter'').  $N_W \equiv N_V-N_H+1$.  The basic idea is
1121 | to filter the inputs, but translating the filter across the input
1122 | image.
1123 | 
1124 | I will come back to the energy function a little later.  XXX.  We can
1125 | do Gibbs sampling to generate the appropriate distributions.
1126 | 
1127 | \section{Scherer (2010)}
1128 | 
1129 | \link{http://www.ais.uni-bonn.de/papers/icann2010\_maxpool.pdf}{Evaluation
1130 |   of Pooling Operations in Convolutional Architectures for Object
1131 |   Recognition}
1132 | 
1133 | Notes that many standard models are based on Hubel and Wiesel: the
1134 | Neocognitron, convolutional nets, HoG, SIFT, Gist features, and HMAX.
1135 | ``These models can be broadly distinguished by the operation that
1136 | summarizes over a spatial neighbourhood.  Most earlier models perform
1137 | a subsampling operation, where the average over all input values is
1138 | propagated to the next layer... A different approach is to compute the
1139 | maximum value in a neighborhood... While entire models have been
1140 | extensively compared, there has been no research evaluating the choice
1141 | of the aggregation function so far.  The aim of our work is therefore
1142 | to empirically determine which of the established aggregation
1143 | functions is more suitable for vision tasks.  Additionally, we
1144 | investigate if ideas from signal processing, such as overlapping
1145 | receptive fields and window functions can improve recognition
1146 | performance.''
1147 | 
1148 | They note that there are so many variants on complex cells / pooling
1149 | operations that it's impossible to do a complete analysis.  Instead,
1150 | they're going to choose a particular model and analyse that, based on
1151 | convolutional neural networks.  ``Our choice of a CNN is largely
1152 | motivated by the fact that the operation performed by pooling layers
1153 | is easily interchangeable without modifications to the architecture.''
1154 | 
1155 | ``The purpose of the pooling layers is to achieve spatial invariance
1156 | by reducing the resolution of the feature maps.''  Is that really
1157 | right?  We don't actually want spatial invariance --- relative
1158 | positions matter.  But the details don't matter.  It's a way of saying
1159 | small spatial shifts (relative to feature size) don't matter.  So a
1160 | better sentence would be: the purpose of the pooling layers is to
1161 | ensure that small spatial shifts (relative to feature size) don't
1162 | matter.
1163 | 
1164 | ``We evaluate two different pooling operations: max pooling and
1165 | subsampling.''  Subsampling computes an average and multiples by a
1166 | trainable scalar.  Max pooling applies a window function and computers
1167 | the maximum in the neighbourhood.
1168 | 
1169 | They wanted to do the following: (1) figure out how max pooling and
1170 | subsampling compare; (2) determine whether overlapping pooling windows
1171 | improve performance; and (3) find suitable window functions.
1172 | 
1173 | ``For both NORB and Caltech-101 our results indicate that
1174 | architectures with a max pooling operation converge considerably
1175 | faster than those employing a subsampling operation.  Furthermore,
1176 | they seem to be superior in selecting invariant features and improve
1177 | generalization.''  Of course, this conclusion only applies in their
1178 | specific context.  Maybe if we increased the data set then this would
1179 | no longer be true?  Or if we changed the architecture in some other
1180 | way?  Furthermore, no explanation of why it is true has been given.
1181 | 
1182 | ``To evaluate how the step size of overlapping pooling windows affects
1183 | recognition rates, we essentially used the same architectures as in
1184 | the previous section.  Adjusting the step size does, however, change
1185 | the size of the feature maps [I don't see why --- we can make them the
1186 | same size] and with it the total number of trainable parameters, as
1187 | well as the ratio between fully connected weights and shared
1188 | weights.''  I must admit I don't understand what's being done here, or
1189 | in the remainder of this section.  I think it would be dangerous for
1190 | me to take much away from it.
1191 | 
1192 | Comparison to Coates' (2011) paper: Coates found that shorter stride
1193 | length helped.  However, the model used seems to have been quite a bit
1194 | different to this paper.  So I'm not sure I'd read too much into
1195 | either result --- more study is, I think, needed, to understand this.
1196 | 
1197 | \section{Coates (2011)}
1198 | 
1199 | \link{http://www.stanford.edu/~acoates/papers/coatesleeng\_aistats\_2011.pdf}{An
1200 |   Analysis of Single-Layer Networks in Unsupervised Feature Learning}
1201 | 
1202 | ``In this paper... we show that several simple factors, such as the
1203 | number of hidden nodes in the model, may be more important to
1204 | achieving high performance than the learning algorithm or the depth of
1205 | the model... Our results show that large numbers of hidden nodes and
1206 | dense feature extraction are critical to achieving high performance.''
1207 | They actually get state-of-the-art performance using only a single
1208 | layer of features.  This is interesting: it's a case where deep
1209 | learning \emph{doesn't} help.  But increasing the number of features
1210 | \emph{does} help --- a lot!
1211 | 
1212 | Reviews the standard practice: use unsupervised learning to pre-train
1213 | multiple layers of features.  
1214 | 
1215 | ``Even with very simple algorithms and a single layer of features, it
1216 | is possible to achieve state-of-the-art performance by focusing effort
1217 | on these choices [number of features, dense feature extraction,
1218 | whitening] rather than on the learning system itself.''
1219 | 
1220 | ``[W]e employ very \emph{simple} learning algorithms and then more
1221 | carefully choose the network parameters in search of higher
1222 | performance.  If (as is often the case) larger representations perform
1223 | better, then we can leverage the speed and simplicity of these
1224 | learning algorithms to use larger representations.''
1225 | 
1226 | CIFAR-10: 60,000 32 by 32 colour images in 10 classes, with 6,000
1227 | images per class.  There are 50,000 training images and 10,000 test
1228 | images.  CIFAR-10 is a subset of the ``80 million tiny images''
1229 | dataset.
1230 | 
1231 | CIFAR-100: Like CIFAR-10, but with 100 classes containing 600 images
1232 | each.  I.e., CIFAR-100 is a more difficult problem.
1233 | 
1234 | So the CIFAR data sets can be thought of as small but challenging
1235 | class recognition data sets.
1236 | 
1237 | ``It will turn out that whitening, large numbers of features, and
1238 | small stride lead to uniformaly better performance regardless of the
1239 | choice of unsupervised learning algorithm... the main contribution of
1240 | our work is in demonstrating that these considerations may, in fact,
1241 | be \emph{critical} to the success of feature learning algorithms ---
1242 | potentially more important even than the choice of unsupervised
1243 | learning algorithm.  Indeed, it will be shown that when we push these
1244 | parameters to their limits that we can achieve state-of-the-art
1245 | performance, outperforming many other more complex algorithms on the
1246 | same tasks.''
1247 | 
1248 | This really makes me wonder about the standard claims made about deep
1249 | learning.
1250 | 
1251 | ``Since the introduction of unsupervised pre-training, many new
1252 | schemes for stacking layers of features to build `deep'
1253 | representations have been proposed.  Most have focused on creating new
1254 | training algorithms to build single-layer models that are composed to
1255 | build deeper structures.  Among the algorithms considered in the
1256 | literature are [long list].  Thus, amongst the many components of
1257 | feature learning architectures, the unsupervised learning module
1258 | appears to be the most heavily scrutinized.''
1259 | 
1260 | ``Some work, however, has considered the impact of other choices in
1261 | these feature learning systems, especially the choice of network
1262 | architecture.  Jarret et al. [11], for instance, have considered the
1263 | impact of changes to the ``pooling'' strategies frequently employed
1264 | between layers of features, as well as different forms of
1265 | normalization and rectification between layers.''  One reason this is
1266 | interesting is that it suggests a direction in which to take work.
1267 | 
1268 | ``While we confirm that some feature-learning schemes are better than
1269 | others, we also show that the differences can often be outweighted by
1270 | other factors, such as the number of features.  Thus, even though more
1271 | complex learning schemes may improve performance slightly, these
1272 | advantages can be overcome by fast, simple learning algorithms that
1273 | are able to handle larger networks.''  [It'd be nice to know more
1274 | about the impact of changed data set size as well.]  Summing up: more
1275 | sophisticated algorithms may not be as useful as increasing the basic
1276 | parameters in a simple algorithm.  But given this, I'd like to know
1277 | why Ng used a deep RICA network in his later work?
1278 | 
1279 | ``At a high-level [\emph{sic}], our system performs the following
1280 | steps to learn a feature representation: 1. Extract random patches
1281 | from unlabeled training images.  2. Apply a pre-processing stage to
1282 | the patches. 3. Learn a feature-mapping using an unsupervised learning
1283 | algorithm. [So this is how we learn the features to be used.  Now we
1284 | move to classification.]  Given the learned feature mapping and a set
1285 | of labeled training images we can then perform feature extraction and
1286 | classification: 1. Extract features from equally spaced sub-patches
1287 | [why equally spaced? why use sub-patches?] covering the input image.
1288 | 2. Pool features together over regions of the input image to reduce
1289 | the number of feature values. [I guess this makes sense if we're using
1290 | small local features, as does the use of sub-patches.]  3. Train a
1291 | linear classifier to predict the labels given the feature vectors.''
1292 | 
1293 | ``It is common practice to perform several simple normalization steps
1294 | before attempting to generate features from data.  In this work, we
1295 | assume that every patch $x^{(i)}$ is normalized by subtracting the
1296 | mean and dividing by the standard deviation of its elements.  For
1297 | visual data, this corresponds to local brightness and contrast
1298 | normalization.''
1299 | 
1300 | ``For our purposes, we will view an unsupervised learning algorithm as
1301 | a `black box' that takes the [training] dataset $X$ and outputs a
1302 | function $f : R^N \rightarrow R^K$ that maps an input vector $x^{(i)}$
1303 | to a new feature vector of $K$ features, where $K$ is a parameter of
1304 | the algorithm.''  
1305 | 
1306 | After learning features, they do a type of convolutional extraction:
1307 | basically, stepping across the images with a particular stride length,
1308 | and extracting $K$-dimensional features at each stage.
1309 | 
1310 | They do a funny form of pooling.  They split their features up into
1311 | four quadrants, and simply sum over each quadrant.  That gives them a
1312 | total of $4K$ features to use for classification.  I must admit, this
1313 | seems to me like a rather strange procedure to use.  They don't appear
1314 | to discuss it at much length. 
1315 | 
1316 | After pooling they use a linear classifier --- an SVM, with the
1317 | regularization parameter determined by cross-validation.
1318 | 
1319 | ``For sparse autoencoders and RBMs, the effect of whitening is
1320 | somewhat ambiguous.  When using only 100 features, there is a
1321 | significant benefit of whitening for sparse RBMs, but this advantage
1322 | disappears with larger numbers of features.  For the clustering
1323 | algorithms, however, we see that whitening is a crucial pre-process
1324 | since the clustering algorithms cannot handle the correlations in the
1325 | data.''
1326 | 
1327 | Whitening made a big difference for both k-means measures, and for
1328 | Gaussian mixture models.  It made only a small difference for the
1329 | sparse autoencoder and for the RBM.
1330 | 
1331 | The number of features made a big difference for all approaches.  It's
1332 | not clear what the asymptotic performance will be, but even with 1600
1333 | features (where they stopped) things were still improving quite a bit.
1334 | 
1335 | The stride length also had a huge impact on performance.  I find this
1336 | really interesting!  It'd be interesting to understand the performance
1337 | tradeoffs.
1338 | 
1339 | Size of the local receptive field didn't have quite as much of an
1340 | impact.  Indeed, increasing the size sometimes decreased performance,
1341 | when other factors (e.g., number of features) was held constant.  
1342 | 
1343 | They got the best known results on CIFAR 10 using k-means.  (Note that
1344 | this has since been greatly improved.)
1345 | 
1346 | ``Our results above may seem inexplicable considering the simplicy of
1347 | the system --- it is not clear, on first inspection, exactly what in
1348 | our experiments allows us to achieve such high performance compared to
1349 | prior work.... Each of the network parameters (feature count, stride
1350 | and receptive field size) we've tested potentially confers a
1351 | significant benefit on performance.  For instance, lare numbers of
1352 | features (regardless of how they're trained) gives us many non-linear
1353 | projections of the data... using extremely large numbers of non-linear
1354 | projections can make data closer to linearly separable and thus easier
1355 | to classy.  [E.g., the kernel trick] Hence, larger numbers of features
1356 | may be uniformly beneficial, regardless of the training algorithm''
1357 | 
1358 | ``It appears that large receptive fields result in a space that is
1359 | simply too large to cover effectively with a small number of nonlinear
1360 | features.''
1361 | 
1362 | \textbf{Takeaways:} the notion of a pipeline: feature learning by
1363 | unsupervised techniques, followed by a standard classifier (e.g.,
1364 | SVM); increasing the number of features learned can help \emph{a lot};
1365 | larger local receptive fields don't seem to help, and can actually
1366 | hinder; a shorter stride length can help quite a bit; K-means (using
1367 | the triangle technique) can help a lot.
1368 |  
1369 | \section{Le, Karpenko et al (2011)}
1370 | 
1371 | ``ICA with Reconstruction Cost for Efficient Overcomplete Feature
1372 | Learning'':
1373 | \link{http://ai.stanford.edu/~ang/papers/nips11-ICAReconstructionCost.pdf}{link}
1374 | 
1375 | ICA as a technique for unsupervised feature learning.  Point out that
1376 | standard ICA learns orthonormal features, while they want overcomplete
1377 | feature sets.  ``Using our method to learn highly overcomplete sparse
1378 | features and tiled convolutional neural networks, we obtain
1379 | competitive performances on a wide variety of object recognition
1380 | tasks.  We achieve state-of-the-art test accuracies on the STL-10 and
1381 | Hollywood2 datasets.''
1382 | 
1383 | ``Sparsity has been shown to work well for learning feature
1384 | representations that are robust for object recognition.''  What
1385 | exactly is a sparse feature?  I guess in the case of sparse
1386 | autoencoders we only allow a relatively small number of hidden neurons
1387 | to be on.  Algorithms for learning sparse features: sparse
1388 | auto-encoders, RBMs, sparse coding, and ICA.
1389 | 
1390 | ``[Standard] ICA has two major drabacks.  First, it is difficult to
1391 | learn \emph{overcomplete feature representations}''.  Goes on to claim
1392 | that classification performance works better when features are
1393 | overcomplete.  This makes a certain amount of sense: certainly, there
1394 | should be no problem having overlapping features.  Also claims that
1395 | ICA is sensitive to whitening, and this makes it difficult to scale
1396 | ICA to high dimensional data.
1397 | 
1398 | Regular ICA: Let $x^j$ be training data.  Choose a penalty function
1399 | $g(\cdot)$.  They suggest $g(z) = \log(\cosh(z))$.  Let $W_j$ be a row
1400 | in a weight matrix.  Then $W_j x^k$ measures the overlap between the
1401 | weight vector and the training data.  If it's one, then $x^k$ is very
1402 | much like the weight vector.  And if it's less than one, then it's
1403 | less so.  So we simply sum over features, $W_j$, and over training
1404 | data, $x^k$.  The goal is to ``find the best features'', i.e., to
1405 | minimize:
1406 | \begin{eqnarray}
1407 |   \sum_{jk} g(W_j x^k).
1408 | \end{eqnarray}
1409 | This is done subject to the constraint that $WW^T = I$, i.e., the
1410 | feature vector are orthonormal to one another.  ICA is done assuming
1411 | zero mean for the training data, $\sum_k x^k = 0$, and unit
1412 | covariance, $\sum_k x_k x_k^T = m I$.  This is achieved by whitening
1413 | the data.
1414 | 
1415 | Reconstruction ICA (RICA): Minimize:
1416 | \begin{eqnarray}
1417 |   \frac{\lambda}{m} \sum_k \| W^T W x^k - x^k\|^2 + \sum_{jk} g(W_j x^k).
1418 | \end{eqnarray}
1419 | In other words, find the features which minimize the cost, while
1420 | preserving the training data pretty well.  ``We use the term
1421 | `reconstruction cost' for this smooth pentalty because it corresponds
1422 | to the reconstruction cost of a linear autoencoder, where the encoding
1423 | weights and decoding weights are tied''.  Note that tying is not used
1424 | in the LRM paper.  This makes it more similar to a standard
1425 | autoencoder, as I've described elsewhere in my book.
1426 | 
1427 | ``ICA's main distinction compared to sparse coding and autoencoders is
1428 | its use of the hard orthonormality constraint in lieu of
1429 | reconstruction costs.''  The basic idea in proving some kind of
1430 | equivalence is to let $\lambda$ be large.  ``If the data is whitened,
1431 | RICA is equivalent to ICA for undercomplete representations and
1432 | $\lambda$ approaching infinity.''
1433 | 
1434 | From my point of view, the main thing here is simply the basic problem
1435 | formulation: the function to minimize.  I'd like to think of this in a
1436 | slightly more connectionist fashion.  Let me think back to the cost
1437 | function.  Minimizing the first part means that we have weights which
1438 | allow us to approximately reconstruct the training data.  Minimizing
1439 | the second part is more an l1 constraint, roughly speaking it's
1440 | telling us to have few features.  So we have features which let us
1441 | reconstruct, and we are likely to have only a few features at a time.
1442 | 
1443 | Local receptive field TICA: ``[L]ocal receptive field neural networks
1444 | are faster to optimize than their fully connected counterparts
1445 | [because they have fewer parameters].  A major drawback of this
1446 | approach, however, is the difficulty in enforcing orthogonality across
1447 | partially overlappling patches.  [This becomes a severe constraint if
1448 | we only overlap at a few points.]  We show that swapping out locally
1449 | enforced orthogonality constraints with a global reconstruction cost
1450 | solves this issue. [I.e., we can forget about local orthogonality, and
1451 | just worry about optimizing the cost.]''  It seems that they do this
1452 | by minimizing the following function:
1453 | \begin{eqnarray}
1454 |   \sum_k \| W^T W x^k-x^k\|^2+ \sum_{jk} \sqrt{\epsilon+H_j (Wx^k)^2}.
1455 | \end{eqnarray}
1456 | A few things: (1) $\lambda$ should presumably appear out the front of
1457 | the first term; (2) They never explain $\epsilon$; (3) The $H_j$ are
1458 | pooling matrices; (4) It's not clear what $(Wx^k)^2$ means ---
1459 | presumably the elementwise square; (5) I don't see how $H_j (Wx^k)^2$
1460 | can be a scalar.
1461 | 
1462 | \section{Tenenbaum, Kemp, Griffiths, and Goodman (2011)}
1463 | 
1464 | \link{http://scholar.google.ca/scholar?cluster=2667398573353002097&hl=en&as_sdt=0,5}{(link)}
1465 | A review of a particular approach to inductive learning.  They want to
1466 | combine Bayesian learning with complex ways of representing knowledge.
1467 | 
1468 | Claims that there is strong evidence that children can learn to
1469 | generalize their use of words from just a few examples.  This suggests
1470 | that there must be some pretty clever underlying patterns to how we
1471 | generalize.  ``A massive mismatch looms between information coming in
1472 | through our senses and the outputs of cognition''.  
1473 | 
1474 | Claims that we humans do reason (implicitly) in Bayesian ways about a
1475 | number of things.  Mostly omits the evidence that we \emph{don't} in
1476 | some important ways.  This omission bugs me.  They \emph{do} mention
1477 | the fact that our conscious assessements of probability tend to be
1478 | terrible, which is pleasing.  With that said, I'm not certain about
1479 | this --- I just have the strong impression that there are well-known
1480 | instances where we certainly don't reason in a Bayesian way.  It'd be
1481 | good to have references.
1482 | 
1483 | ``The biggest remaining obstacle is to understand how structured
1484 | symbolic knowledge can be represented in neural circuits.''
1485 | Interesting.  I've often wondered exactly this.  They make the
1486 | followup comment: ``Connectionist models sidestep these challenges by
1487 | denying that brains actually encode such rich knowledge''.  That seems
1488 | too strong to me, but there is some truth to it: the connectionists
1489 | seem less interested than one might suppose in this question, perhaps
1490 | believing that its solution should be deferred.
1491 | 
1492 | How would one go about solving this problem?  Actually, what would a
1493 | solution / better statement of the problem even look like?  Maybe we
1494 | could encode entry-relationships?  In particular, let us suppose we
1495 | want to encode $X Y Z$ where $X$ and $Z$ are entities, and $Y$ is the
1496 | relationship.  One way of encoding this would be to have a neural
1497 | network with nodes for each entity and for each relationship.  We'd
1498 | try to design the network so that the only relationships which are
1499 | active would be those which are true, given the active entities.
1500 | 
1501 | \section{Bengio (2012)} 
1502 | 
1503 | \link{http://arxiv.org/abs/1206.5533}{(link)}
1504 | 
1505 | Notes that many of the recommendations haven't been proved, they're
1506 | heuristics that have emerged out of experimentation.  ``A good
1507 | indication of the need for such validation is that different
1508 | researchers and research groups do not always agree on the practice of
1509 | training neural networks''.
1510 | 
1511 | Claims that the optimal learning rate is usually close to the largest
1512 | learning rate that does not cause divergence of the cost function.
1513 | Heuristic: start with a large learning rate, and if the cost function
1514 | increases, start again with a training criterion that is three times
1515 | smaller.
1516 | 
1517 | This can be automated by keeping track of the cost from epoch to
1518 | epoch.  If the cost got \emph{larger} during an epoch, then decrease
1519 | the training rate by a factor two, say.  If the cost got
1520 | \emph{smaller}, then increase the training rate by a factor of 1.1,
1521 | say.  How well will that work?  I worry that we'll end up with a
1522 | situation where we're mostly going back and forth between the training
1523 | rate being too high, and too low, with not enough time to really learn
1524 | anything.
1525 | 
1526 | Larger mini-batches allow a modest increase in learning rate.  I don't
1527 | understand the details of this.  It'd be nice to have some heuristics.
1528 | Large mini-batches will certainly reduce stochastic error from the
1529 | sampling.  Is that what's going on?  Or is there some other reason?
1530 | 
1531 | ``Because the gradient direction is not quite the right direction of
1532 | descent, there is no point in spending a lot of computation to
1533 | estimate it precisely for gradient descent.''  In other words, do
1534 | frequent rapid estimates rather than slow accurate computations.
1535 | 
1536 | It seems to me that it'd be helpful to keep track of training examples
1537 | with markedly different gradients.  Those are ones which we could
1538 | learn a lot from.  There's an idea here, which is to \emph{identify
1539 |   outliers} using the gradient.  We should oversample from the
1540 | outliers.  I'll bet that improves performance, if the right
1541 | oversampling rate is chosen.  I've explored this idea further below.
1542 | 
1543 | Bengio confirms that for large data sets, mini-batch stochastic
1544 | gradient descent is pretty much non-optional.
1545 | 
1546 | The use of validation data to train hyper-learners, which learn
1547 | hyper-parameters for a learning algorithm.
1548 | 
1549 | Comments that the initial learning rate is often the single most
1550 | important hyper-parameter.  ``If there is only time to optimize one
1551 | hyper-parameter and one uses stochastic gradient descent, then this is
1552 | the hyper-parameter that is worth tuning.''  Also comments that
1553 | there's often little benefit to doing anything other than keeping the
1554 | learning rate constant.  When doing otherwise, Bengio suggests a
1555 | strategy of keeping the learning rate constant for the first $\tau$
1556 | steps, and then decreasing it as $1/ t$, where $t$ is the number of
1557 | steps.  Note that this strategy is not the same as the (exponential)
1558 | automated strategy I describe above.  Suggests setting $\tau$ by
1559 | waiting until the cost goes up.  Also suggests setting multiple values
1560 | for the schedule, and seeing how they compare.
1561 | 
1562 | Mini-batch size: between 1 and a few hundreds.  Typical value of 32.
1563 | Notes that this mostly affects computation time, not the final value
1564 | of the cost.
1565 | 
1566 | Number of epochs: Watch the validation error, and stop once we're
1567 | beginning to overfit.
1568 | 
1569 | Momentum: smooth out gradient by taking an average of recent
1570 | gradients.
1571 | 
1572 | Comments that increasing the number of hidden neurons in all layers
1573 | results in a quadratic increase in time.  It's not clear to me why
1574 | that should be the case --- obviously there is a quadratic increase in
1575 | the number of weights, and so a quadratic increase in time per epoch.
1576 | But maybe it'll take a larger number of epochs to converge?
1577 | 
1578 | ``[W]e found that using the same size for all layers worked generally
1579 | better or the same as using a decreasing size (pyramid-like) or
1580 | increasing size (upside down pyramid), but of course this may be data
1581 | dependent.''
1582 | 
1583 | I am surprised by this.  It seems to contradict our ideas about
1584 | feature learning.  It'd be good to look at Larochelle et al's results.
1585 | Perhaps it reflects the fact that \emph{more} high level concepts can
1586 | be formed out of the ``atoms'' of input than there are atoms.
1587 | 
1588 | ``For most tasks that we worked on, find that an overcomplete first
1589 | hidden layer works better than an undercomplete one.''
1590 | 
1591 | It's not really clear why this is the case.  Again, it may be that
1592 | it's because there are more high-level concepts than low level one.
1593 | Still, that seems to be at odds with my intuition about autoencoders.
1594 | 
1595 | States that this is particularly true for unsupervised learning.  That
1596 | \emph{is} consistent with the idea that it's because there are many
1597 | different abstractions possible, far more than basic features.
1598 | 
1599 | Claims that there is a ``clean Bayesian justification'' for
1600 | regularization as the negative log-prior.  The discussion that follows
1601 | is extremely interesting and I'm still sorting it out.  The picture
1602 | that emerges seems to be that what we're doing when learning is using
1603 | some kind of maximum likelihood estimation.  In particular, we start
1604 | with some sort of prior in parameter space --- a Gaussian --- and then
1605 | try to find the weights maximizing the probability of the parameters
1606 | (weights), given the training data.  I need to unpack this still
1607 | further: it's regularization as a form of maximum likelihood.  For now
1608 | I'll proceed, and then return to this later.
1609 | 
1610 | Normalization: Claims that we should normalize the regularization
1611 | parameter by $B / T$, where $B$ is the mini-batch size, and $T$ is the
1612 | number of training examples.  This is consonant with what I've
1613 | observed.
1614 | 
1615 | Early stopping and L2 regularization: comments that these two are
1616 | essentially equivalent, and that one may as well drop L2 rgularization
1617 | when engaged in early stopping.  I don't believe this.  The solution
1618 | spaces will be completely different in the two cases.  I'm happy to
1619 | believe that \emph{sometimes} they'll give the same result, but see no
1620 | reason to believe that they'll always give the same outcome.
1621 | 
1622 | L1 regularization and feature selection: Comments that this strongly
1623 | suppresses irrelevant weights.  Also comments that you may wish to
1624 | consider doing both L1 and L2 regularization, with different
1625 | regularization parameters.  That seems sensible to me.
1626 | 
1627 | Q: An alternative approach to choosing $\lambda$ is to regard it as an
1628 | extra parameter beyond the weights, and to apply gradient descent to
1629 | it as well.  How well would this work?  My first instinct is to think
1630 | that it won't work --- that $\lambda$ will be driven to zero.  But
1631 | upon more reflection things are more complicated than that.  It'd be
1632 | interesting to know.
1633 | 
1634 | Sparsity: Increase sparsity can be compensated by a larger number of
1635 | hidden units.  A sparsity-inducing penalty can be viewed as a way of
1636 | regularizing.  Note that it's no longer so easy to view this in the
1637 | Bayesian framework.  Notes that the L1 penalty seems most natural, but
1638 | is not often used.  Try to push the (mini-batch) average to a
1639 | particular constant.
1640 | 
1641 | Neuron nonlinearity: Bengio notes that he's most often used the
1642 | sigmoid, the tanh, $\max(0, a)$, and the hard tanh.  Interesting
1643 | remark about the sigmoid not working well as the top layer of a deep
1644 | supervised net without unsupervised pretraining.  Apparently it's okay
1645 | for auto-encoders.
1646 | 
1647 | Weight initialization: Sample uniformly on
1648 | $4\sqrt{6/(\mbox{fan-in}+\mbox{fan-out})}$.  This will give us a total
1649 | length equal to roughly the number of layers.
1650 | 
1651 | Hyper-parameter selection as an optimization problem: points out the
1652 | dangers of overfitting your validation data.
1653 | 
1654 | Q: When does it make sense to say that we're overfitting?
1655 | 
1656 | Approach to parameter search: doing it logarithmically.
1657 | 
1658 | Q: Does it make sense to do gradient descent on just a subset of
1659 | weights at a time? I do wonder if that wouldn't sometimes yield better
1660 | results.  Deep learning has something of this flavour.
1661 | 
1662 | \section{Bengio 2012}
1663 | 
1664 | Bengio, Courville, and Vincent: ``Representation Learning: A Review
1665 | and New Perspectives'': http://arxiv.org/pdf/1206.5538v2.pdf.
1666 | 
1667 | ``This paper reviews recent work in the area of unsupervised feature
1668 | learning and joint training of deep learning, covering advances in
1669 | probabilistic models, auto-encoders, manifold learning, and deep
1670 | architectures.''  ``... much of the actual effort in deploying machine
1671 | learning algorithms goes into the design of preprocessing pipelines
1672 | and data transformations that result in a representation of the data
1673 | than can support effective machine learning.''  While I know this last
1674 | is true, I haven't actually had to do a whole lot of data cleaning
1675 | myself, yet.  ``What makes one representation better than another?
1676 | Given an example, how should we compute its representation,
1677 | i.e. perform feature extraction?  Also, what are appropriate
1678 | objectives for learning good representations?  
1679 | 
1680 | ``Speech was one of the early applications of neural networks, in
1681 | particular convolutional (or time-delay) neural networks... Microsoft
1682 | has released in 2012 a new version of their MAVIS... speech system
1683 | based on deep learning''.
1684 | 
1685 | ``Transfer learning is the ability of a learning algorithm to exploit
1686 | commonalities between different learning tasks in order to share
1687 | statistical strength, and \emph{transfer knowledge} across tasks.''
1688 | There are apparently competitions for transfer learning.  I wonder
1689 | what sorts of problems are being attacked?  ``Of course, the case of
1690 | jointly predicting outputs for many tasks or classes, i.e., performing
1691 | \emph{multi-task} learning also enhances the advantages of
1692 | representation learning algorithms''
1693 | 
1694 | ``Unfortunately,... most of these algorithms [SVM etc] only exploit
1695 | the principle of \emph{local generalization}... they rely on examples
1696 | to \emph{explicitly map out the wrinkles of the target function}.
1697 | Generalization is mostly achieved by a form of local interpolation
1698 | between neighboring training examples... We advocate learning
1699 | algorithms that are flexible and non-parametric, but do not rely
1700 | exclusively on the smoothness assumption.  Instead, we propose to
1701 | incorporate generic priors such as those enumerated above into
1702 | representation-learning algorithms.''  This starts to get at the point
1703 | of view that says that neural network architecture is all about
1704 | figuring out how we generalize.  If there is a hierarchical structure
1705 | in how to generalize well, that's what your network will need.  If
1706 | not, it won't.  ``Kernel machines are useful, but they depend on a
1707 | prior definition of a suitable similarity metric, or a feature space
1708 | in which naive similarity metrics suffice.  We would like to use the
1709 | data, along with very generic priors, to discovery these features, or
1710 | equivalently, a similarity function.''
1711 | 
1712 | They make a really nice point about expressiveness.  ``[H]ow many
1713 | parameters does [a model[ require compared to the number of input
1714 | regions (or configurations) it can distinguish?''  They argue that a
1715 | deep net can distinguish exponentially more regions than more
1716 | conventional approaches.
1717 | 
1718 | 
1719 | \section{Bottou (2012)}
1720 | 
1721 | \link{http://leon.bottou.org/papers/bottou-tricks-2012}{(link)}
1722 | 
1723 | Notes that there are theorems about the convergence time for batch
1724 | gradient descent (time is logarithmic in the eventual error), and for
1725 | second-order gradient descent.  It's really not clear how valuable
1726 | such results are; I guess it's comforting that they exist.
1727 | 
1728 | Notes that there are some powerful results about the convergence of
1729 | stochastic gradient descent, under conditions like $\sum \eta^2 <
1730 | \infty, \sum \eta = \infty$.  Apparently the ``Robbins-Siegmund
1731 | theorem'' helps with convergence.  The relevant paper is
1732 | \link{http://scholar.google.ca/scholar?cluster=509989913518206088\&hl=en\&as\_sdt=0,5}{here}.
1733 | 
1734 | Monitor both the training cost and the validation error: Suggests
1735 | periodically evaluating the validation error during training, and
1736 | stopping training when it hasn't improved after some time.
1737 | 
1738 | \section{Ciresan (2012)}
1739 | 
1740 | \link{http://arxiv.org/abs/1003.0358}{link} This uses just straight-up
1741 | backprop to train a neural net --- no convolutional nets, no
1742 | pretraining, just online learning with backprop.  The main tricks are
1743 | to use numerous deformed training images, and graphics cards to speed
1744 | up learning.  Apparently, Simard et al used a single hidden layer with
1745 | 800 neurons to get an accuracy of 99.3 percent on MNIST.  (It'd be
1746 | interesting to know whether they deformed the images?)
1747 | 
1748 | The paper asks whether it was really true that the pre-training is
1749 | necessary?  Can't you just train for a long time?  And the answer
1750 | seems to be yes!
1751 | 
1752 | They train online, using slightly deformed images, and claim that this
1753 | means they can use the whole MNIST set for validation.  This seems
1754 | suspect to me --- it relies on the deformations being more or less
1755 | independent of how the network generalizes.  Let's run with it,
1756 | however.
1757 | 
1758 | They trained 5 networks, with 2 to 9 hidden layers each.  From 1.34 to
1759 | 12.11 million free parameters.  They have a variable learning rate
1760 | that shrinks by a constant factor after each epoch, from 0.001 down to
1761 | 0.000001.  This seems absolutely crucial to their success. I'm a
1762 | little surprised by the use of the constant factor decrease, since
1763 | that will bound the ``total'' (so to speak) learning distance
1764 | travered, simply because the geometric sum converges.  It seems like
1765 | you'd get better performance if you chose a learning schedule where
1766 | terms decreased more slowly, so the sum of the learning rates
1767 | diverged.  That's true of the hyperbolic function advocated by Bengio
1768 | in his 2012 paper, whose sum will diverge (albeit, only
1769 | logarithmically).  They initialized weights uniformly at random in the
1770 | range -0.05 to 0.05 --- that's close to, but not the same as, the
1771 | $1/\sqrt{\mbox fan-in}$ that I've preferred.  They use a tanh
1772 | activation function.
1773 | 
1774 | They used a GPU to do computations.  It apparently sped the
1775 | deformation routine up by a factor of 10, and forwardprop and backprop
1776 | by a factor of 40!  That's a big improvement.
1777 | 
1778 | Typical architecture: 784-1000-500-10 neurons.  They get 0.44 percent
1779 | test error.  That's pretty close to perfect.  The most complex
1780 | architectures were: 784-2500-2000-1500-1000-500-10 and 784-9 x
1781 | 10000-10.  These get test errors of 0.32 and 0.43 percent,
1782 | respectively.  Interestingly, there seems be some advantages to having
1783 | non-homogeneous numbers in the layers.
1784 | 
1785 | Took 93 CPU seconds to deform the MNIST images.  87 of those seconds
1786 | were for the elastic distortions, so that's what they converted to the
1787 | GPU.  When doing the conversion they converted MNIST images to 29 x 29
1788 | to get a proper center, which simplifies distortion.
1789 | 
1790 | \section{Ciresan (2012)}
1791 | 
1792 | \link{http://arxiv.org/pdf/1202.2745.pdf}{link}
1793 | 
1794 | Claims that their deep nets can match human performance on recognizing
1795 | human digits and traffic signs.  ``Small (often minimal [what does
1796 | this mean?]) receptive fields of convolutional winner-take-all neurons
1797 | [?] yield large network depth, resulting in roughly as many sparsely
1798 | connected neural layers as found in mammals between retina and visual
1799 | cortex''.  They achieve better-than-human performance on a traffic
1800 | sign benchmark.
1801 | 
1802 | They claim records on MNIST, Latin letters, Chinese characters,
1803 | traffic signs, NORB, and CIFAR10.  ``We will show that properly
1804 | trained big and deep DNNs can outperform all previous methods, and
1805 | demonstrate that unsupervised initialization/pretraining is not
1806 | necessary (although we don't deny that it might help sometimes,
1807 | especially for small datasets).''  Again, we're back to this
1808 | fundamental question: how much does pretraining help?  How necessary
1809 | is it?
1810 | 
1811 | They use winner-take-all neurons.  It occurs to me that this has some
1812 | similarity to sparsity constraints.  Same?  Again, they were inspired
1813 | by Hubel and Wiesel --- simple cells (orientation), and complex cells
1814 | (basically, pooling).
1815 | 
1816 | Very similar architecture to the KSH paper --- convolutional, max
1817 | pooling, convolutional, max pooling, fully connected, fully connected.
1818 | 
1819 | They use this multi-columnar architecture, along lines I've seen
1820 | before in their papers.  They make strong claims that this helps a
1821 | lot.  Worth understanding.  The general idea seems to be to try to set
1822 | up slightly different training techniques, and then to average.
1823 | 
1824 | Scaled tanh for conv and fully conn layers.  Linear activation for
1825 | max-pooling, and softmax for output. They use an annealed learning
1826 | rate.  THey use translations, scaling and rotation during training.
1827 | They use a very simple initial weight distribution: uniform on [-0.05,
1828 | 0.05]!  This really surprises me.
1829 | 
1830 | They train for 800 epochs.
1831 | 
1832 | Chinese character recognition: they have 3755 classes, and just 240
1833 | samples per class, and they learn an error rate of 6.5 percent, which
1834 | is a big improvement over the old record of 10.01 percent.
1835 | 
1836 | The traffic sign results are fascinating.  They use the GTSRB traffic
1837 | sign dataset --- the German Traffic Sign Benchmark.  They do some
1838 | preprocessing, and then apply their deep network.  They get an error
1839 | rate of 0.54 percent on the test set, which is apparently about a
1840 | factor of two lower than humans.  I can see why it's tough --- a lot
1841 | of the images are difficult to see well.  If it rejects the 6.67
1842 | percent of images about which it is least confident, then the system
1843 | makes only a single misclassification (0.01 percent error rate).  
1844 | 
1845 | Typical learning schedule (MNIST): 0.001 initialization, decays by
1846 | factor 0.993 after each epoch.
1847 | 
1848 | 
1849 | \section{Domingos (2012)} 
1850 | 
1851 | \link{http://scholar.google.ca/scholar?cluster=4404716649035182981\&hl=en\&as\_sdt=0,5}{link}
1852 | 
1853 | He points out that we don't have access to the function we really want
1854 | to optimize, unlike in most optimization problems.  Instead we use
1855 | training error as a proxy for test error.  That's a very interesting
1856 | and strange situation.
1857 | 
1858 | ``Learners combine knowledge with data to grow programs.''
1859 | 
1860 | Overfitting has many faces: ``the bugbear of machine learning''; ``it
1861 | comes in many forms that are not immediately obvious''.
1862 | Generalization error can be decomposed into bias and variance.  Bias
1863 | is the tendency to keep learn the same wrong things.  Variance is the
1864 | tendency to learn random things.  E.g., an SVM (without kernel) may
1865 | have high bias if the data is nowhere close to linearly separable.
1866 | Cross-validation can itself start to overfit.
1867 | 
1868 | Intuition fails in high dimensions: I don't think this is quite right.
1869 | It would be better to say that it needs to be replaced in high
1870 | dimensions.
1871 | 
1872 | Theoretical guarantees are not what they seem: Points out that there
1873 | are effectively guarantees that can (with caveats) be put on
1874 | induction.  Very interesting.  It'd be good to understand this in
1875 | conjunction with the no-free lunch theorems.
1876 | 
1877 | Feature engineering is the key: Points out that the ``machine
1878 | learning'' part of a machine learning project may be tiny.  More time
1879 | spent gathering data, cleaning it, and figuring out good input
1880 | features.
1881 | 
1882 | More data beats a cleverer algorithm: ``As a rule, it pays to try the
1883 | simplest learners first''.  ``... the organizations that make the most
1884 | of machine learning are those that have in place an infrastructure
1885 | that makes experimenting with many different learners, data sources
1886 | and learning problems easy and effcient, and where there is a close
1887 | collaboration between machine learning experts and application domain
1888 | ones.''
1889 | 
1890 | Representable does not imply learnable: in other words, don't focus
1891 | all your attention on one representation (say, neural nets, or SVMs)
1892 | merely because there is some kind of universality theorem for them.
1893 | 
1894 | Correlation does not imply causation: Keep it in mind when
1895 | interpreting the results of machine learning algorithsm.
1896 | 
1897 | \section{Hinton 2012 --- Coursera}
1898 | 
1899 | \textbf{Lecture 5 b: Object recognition:} If you want to solve
1900 | computer vision, it may help to find features that are invariant under
1901 | things like rotation, translation, and so on.  Example: parallel lines
1902 | with a red dot between them.  This is invariant under rotation and
1903 | translation, but may actually be quite a useful feature.  I guess I
1904 | can imagine similar features being use to recognize an eye.
1905 | Relationship between features may themselves be captured by other
1906 | features.  The idea of normalizing an image: once normalized, it may
1907 | be easier to extract features.  Of course, that then requires us to
1908 | solve the problem: how to normalize?  (Hinton claims, without
1909 | presenting anything so gauche as actual evidence, that we don't
1910 | mentally rotate images to recognize them.)  One approach to
1911 | normalization: brute force approach, trying all possible boxes, in a
1912 | wide range of positions and scales.
1913 | 
1914 | \textbf{Lecture 5c: Convolutional neural networks for handwriting
1915 |   recognition:} Early example of deep neural nets, from the 1980s.
1916 | The idea is to \emph{replicate features}.  So an edge is a good
1917 | feature --- and if it's a good feature at one point in the visual
1918 | field, then it's probably a good feature at other points in the visual
1919 | field.  Put another way, a feature detector that's useful at some
1920 | point in the visual field is likely to be useful elsewhere, too.
1921 | Replication across position reduces the number of parameters to be
1922 | learned.  It's easy to learn replicated features with backpropagation.
1923 | I guess we just constrain the weights to be the same.  So we want
1924 | $\Delta w_1 = \Delta w_2$.  We just average the gradients across
1925 | partial derivatives.  An advantage is that if we can learn to detect a
1926 | feature in one place, then we learn how to detect it in other places.
1927 | Hinton advocates against rotational or scale invariance.  I don't know
1928 | if that's a good idea, frankly --- it seems to me that with modern
1929 | computers that may be practical.  The idea of pooling adjacent
1930 | replicated features.  Hinton advocates either averaging or the max (he
1931 | says max is a little better).  LeNet was used to read something like
1932 | 10 percent of all checks in North America, according to Hinton.
1933 | There's still a frontier associated to MNIST, and it may be worth
1934 | trying to push that frontier.  The idea of generating synthetic data
1935 | (in part to reduce overfitting).  McNemar test.
1936 | 
1937 | \textbf{Lecture 5d: Convolutional neural networks for object
1938 |   recognition:} Apparently most people doing vision with neural nets
1939 | have switched to using rectified linear activation function, not just
1940 | a sigma function.  A good paper on this appears to be ``Deep Sparse
1941 | Rectifier Neural Networks'' (Bengio et al).  Use left-right reflection
1942 | of images to get more training data.  And use image subsets to get
1943 | more training data.  Uses GPUs: 500 cores per GPU, very fast at
1944 | matrix-by-matrix arithmetic, very high bandwidth to memory.
1945 | 
1946 | \textbf{Lecture 6a: stochastic mini-batch gradient descent:} Hinton
1947 | calls this the most frequently used algorithm for training neural
1948 | networks.  He says it's often preferable even to techniques from the
1949 | optimization community.  How to choose a learning rate: if the error
1950 | keeps getting worse or oscillates wildly, reduce the learning rate.
1951 | If the error is falling slowly, increase the learning rate.  Do this
1952 | all automatically.
1953 | 
1954 | \textbf{Lecture 15a:} PCA: Lots of data in a very high-dimensional
1955 | space.  But maybe there's a low-dimensional manifold on which most of
1956 | the data lies.  In some sense that manifold captures much of the
1957 | structure in the data.  What we want is a projector onto a
1958 | lower-dimensional subspace.  Suppose $x_1, x_2, \ldots, x_m$ are our
1959 | data points.  Obvious idea is to stick .
1960 | 
1961 | 
1962 | \section{Hinton (2012) - videos}
1963 | 
1964 | \link{https://www.ipam.ucla.edu/schedule.aspx?pc=gss2012}{IPAM Summer School videos}
1965 | 
1966 | Attributes backprop to \link{http://www.werbos.com/}{Paul Werbos}.  It
1967 | was done in his 1974 PhD thesis.  Hinton also lists several others,
1968 | including Amari, Parker, and LeCun.  Points out that deep learning
1969 | didn't work well, except in time delay and convolutional networks.
1970 | Says that part of the reason Werbos was ignored was because he was
1971 | applying backprop to econometrics, where it was hard to see the value.
1972 | 
1973 | Why deep learning is feasible today: He starts with simple raw speed,
1974 | not pre-training, interestingly enough.  He also says that there's
1975 | been a ``small improvement in the theory''.  Says the biggest
1976 | disappointment with backprop was that it didn't work with recurrent
1977 | neural nets.  I don't understand how this squares with Williams and
1978 | Zipser.  
1979 | 
1980 | ``On the whole backpropagation fell out of favour because it failed to
1981 | be able to learn multiple layers of features.''  Says that
1982 | convolutional nets were the only ones where deep learning worked.
1983 | 
1984 | ``Almost everything I used to believe about backpropagation is
1985 | wrong.''
1986 | 
1987 | ``What is wrong with back-propagation?  It requires labeled training
1988 | data. [Well, no, not if you use ideas like autoencoders.]''
1989 | 
1990 | Why is the learning time slow in deep nets?  If you use the right
1991 | scales for the weights, you can do some of this learning much faster.
1992 | 
1993 | He's strongly emphasizing the unsupervised learning / feature learning
1994 | point of view.  Basically, pretraining to initialize, and then
1995 | fine-tuning with labelled data and backprop.
1996 | 
1997 | ``You can get a lot of knowledge into the network by messing with the
1998 | training data.''  Analogizes to education.  
1999 | 
2000 | On the advantages of generative models: learn $p({\rm image})$, not
2001 | $p({\rm label | image})$.  ``If you want to do computer vision, first
2002 | learn computer graphics.''  I think that overstates the case, but
2003 | there's something to it.  Reminiscent of the idea that learning is
2004 | really memory.
2005 | 
2006 | \textbf{Belief nets:} A directed acyclic graph of stochastic
2007 | variables.  We learn the values of some variables.  We'd like to infer
2008 | the states of the other variables.  And we'd like to adjust the
2009 | interactions between variables to make the network more likely to
2010 | generate the observed data.  (Obviously, this is all very close to
2011 | Pearl's causal models.)
2012 | 
2013 | Neat point about stochastic models: it lowers the communication cost
2014 | in distributed models.  Send 1 bit instead of 32 or 64 bit float.
2015 | 
2016 | Points out that while it's easy to generate examples at the leaf
2017 | nodes, it's hard to infer causes.  Yet that's exactly what we want to
2018 | do.
2019 | 
2020 | Suppose we observe some output data.  Let's suppose we can sample the
2021 | hidden states in an unbiased fashion.  Now update the weight by
2022 | $\Delta w_{ji} = \eta s_j (s_i-p_i)$.  Note that here, $j$ is a
2023 | parent, and $i$ is a child node.  This is more or less Hebb's rule.
2024 | ``Nice local learning rule''.
2025 | 
2026 | Monte Carlo methods: painfully slow for large, deep models.  Can be
2027 | used to sample from the posterior.  Variational methods are much
2028 | faster.  You get the wrong result, but it's bounded away from the
2029 | right result.  ``Inferring the wrong posterior and then doing learning
2030 | anyway''.
2031 | 
2032 | RBMs: The feature detectors are genuinely independent (given the
2033 | data).  The posterior distribution is easy to sample from.  The
2034 | partition function makes learning difficult.  He derives a nice quick
2035 | way to learn an RBM: $\Delta w_{ij} = \eta *$ a difference of
2036 | averages of correlations between visible and hidden units.
2037 | 
2038 | \section{Krizhevsky (2012)}
2039 | 
2040 | \link{http://www.cs.toronto.edu/\~hinton/absps/imagenet.pdf}{link} 1.2
2041 | million images in ImageNet 2010.  1000 classes.  650,000 neurons.
2042 | Five convolutional layers.  Max pooling layers.  Three fully-connected
2043 | layers.  1000-way softmax.  Used dropout to prevent overfitting.
2044 | 
2045 | Past image data sets: NORB, Caltech-101/256. CIFAR-10/100.  ``Simple
2046 | recognition tasks can be solved quite well with datasets of this size,
2047 | especially if they are augmented with label-preserving
2048 | transformations.''  ``But objects in realistic settings exhibit
2049 | considerable variability, so to learn to recognize them it is
2050 | necessary to use much larger training sets''.  LabelMe: hundreds of
2051 | thousands of fully-segmented images.  ImageNet: 15 million labeled
2052 | high-res images in over 22,000 categories.
2053 | 
2054 | This paper: trained a very large convolutional neural net on subsets
2055 | of ImageNet used in two competitions.  Got by far the best results
2056 | ever reported on those data sets.  Removing any convolutional layer
2057 | significantly decreased performance.  ``All of our experiments suggest
2058 | that our results can be improved simply by waiting for faster GPUs and
2059 | bigger datasets to become available.''
2060 | 
2061 | ImageNet: 15 million images, 22,000 categories.  ILSVRC: 1000 images
2062 | in 1000 categories.  1.2 million training images, 50,000 validation
2063 | images, and 150,000 testing images.  ILSVRC-2010: test set labels are
2064 | available.  Top-5 error rate: the fraction of test images for which
2065 | the correct label is not among the five labels considered most
2066 | probable by the model.
2067 | 
2068 | ImageNet has variable-resolution images.  They down-sampled to 256
2069 | $\times$ 256.  They did this by rescaling the image so the shorter
2070 | side was of length 256.  Then cropped out the central 256 $\times$ 256
2071 | patch.  They also subtracted the mean activity over the training set
2072 | from each pixel.  This was the complete pre-processing.
2073 | 
2074 | Architecture: 8 layers.  5 convolutional.  3 fully-connected.
2075 | 
2076 | ReLU Nonlinearity: Instead of sigmoid function they used $f(z) =
2077 | \max(z, 0)$.  They refer to this as a \emph{rectified linear} unit.
2078 | ``Deep convolutional neural networks with ReLUs train several times
2079 | faster than their equivalents with tanh units''.  I believe it is
2080 | standard wisdom that convolutional nets work better with tanh units
2081 | than sigmoid.  ``The magnitude of the effect [faster learning...]
2082 | varies with network architecture, but networks with ReLUs consistently
2083 | learn several times faster than equivalents with saturating neurons''.
2084 | 
2085 | Training on multiple GPUs: Done in part because the training set
2086 | wouldn't fit into a single GPU's memory.
2087 | 
2088 | Local response normalization: They do a local normalization step,
2089 | essentially a kind of brightness normalization.  It reduces error
2090 | rates by a little over 1 percent.
2091 | 
2092 | Overlapping pooling: Again, a slight improvement.
2093 | 
2094 | Architecture: The first convolutional layer filters the 224 by 224 by
2095 | 3 image with 96 kernels of size 11 by 11 by 3.  There is a stride
2096 | distance of 4, i.e., the distance between the receptive field centers
2097 | of neighbouring neurons.  I need to understand quite a bit more about
2098 | CNNs and pooling.
2099 | 
2100 | Lots of overfitting: 1.2 million examples, 10 bits of info per example
2101 | (1 in 1000 classification).  But 60 million parameters.  So
2102 | overfitting is a real problem.
2103 | 
2104 | Data augmentation: (1) image translations and horizontal reflections.
2105 | Extracting 224 by 224 patches.  This gives them a factor 2048 more
2106 | training data.  The network makes a prediction by extracting five 224
2107 | by 224 patches and their horizontal reflections, and averaging the
2108 | predictions made by the network's softmax layer. (2) Altering the
2109 | intensities of the RGB channels in the training images.  Perform PCA
2110 | on ImageNet and use it to modify the images.  ``This scheme
2111 | approximately captures an important property of natural images,
2112 | namely, that object identity is invariant to changes in the intensity
2113 | and color of the illumination.'' 
2114 | 
2115 | Dropout: ``a very efficient version of model combination that only
2116 | cost about a factor of two during training''.  Set to zero the output
2117 | of each hidden neuron with probability 0.5.  Don't contribute to
2118 | forwardprop nor to backprop.  Every time the network is trained it has
2119 | a different architecture, but the architectures share weights.  ``This
2120 | technique reduces complex co-adaptation so neurons, since a neuron
2121 | cannot rely on the presence of other neurons''.  This is going to be
2122 | useful in very large networks with a relative paucity of data.
2123 | ``Without dropout, our network exhibits substantial overfitting.
2124 | Dropout roughly doubles the number of iterations required to
2125 | converge.''
2126 | 
2127 | Used SGD with momentum.  ``We used an equal learning rate for all
2128 | layers, which we adjusted manually throughout training.  The heuristic
2129 | which we followed was to divide the learning rate by 10 when the
2130 | validation error rate stopped improving with the current learning
2131 | rate.  The learning rate was initialized at 0.01 and reduced three
2132 | times prior to termination.''   That seems like a useful heuristic.
2133 | 
2134 | Results: top-1 test set error rate: 37.5 percent.  top-5 test set
2135 | error rate: 17.0 percent.  That seems incredibly good, although not
2136 | human comparable.  They also report a bunch of other results: every
2137 | single one is very, very good.
2138 | 
2139 | \section{Le (2012)}
2140 | 
2141 | \link{http://ai.stanford.edu/\~ang/papers/icml12-HighLevelFeaturesUsingUnsupervisedLearning.pdf}{Building
2142 |   high-level features using large-scale unsupervised learning}
2143 | 
2144 | I have described the architecture elsewhere.  Let me describe some of
2145 | the results.  They took 13,026 faces from Labeled Faces in The Wild,
2146 | and about 24,000 distractor objects from ImageNet.  I'm not especially
2147 | keen on this procedure --- it seems like there might be very easy ways
2148 | to distinguish the two data sets that have little to do with whether
2149 | or not a face is present.
2150 | 
2151 | ``After training, we used this test set to measure the performance of
2152 | each neuron in classifying faces against distractors.  For each
2153 | neuron, we found its maximum and minimum activation thresholds, then
2154 | picked 20 equally spaced thresholds in between.  The reported accuracy
2155 | is the best classification accuracy among 20 thresholds.''  There's a
2156 | lot that's not being said here.  Which neurons are we considering?
2157 | Every neuron in the network?  Or just in the last layer?  And it's not
2158 | actually stated that higher activations are the right criterion (as
2159 | opposed to lower).  However, I think I can reasonably infer that's the
2160 | case, because of the pooling.
2161 | 
2162 | ``The best neuron in the network achieves 81.7 percent accuracy in
2163 | detecting faces''.  That's compared to a guessing strategy, which
2164 | achieves 64.8 percent.  They found that removing the local contrast
2165 | normalization reduced this number to 78.5 percent.
2166 | 
2167 | The performed a numerical optimization to find the optimal stimulus.
2168 | This really is quite striking: it's definitely a face!
2169 | 
2170 | They did a very interesting control experiment, removing the faces
2171 | from the unlabelled training data, using OpenCV.  The recognition
2172 | accuracy of the best neuron dropped to 72.5 percent.  So it's not just
2173 | that we're detecting ImageNet versus Labeled Faces in the Wild.
2174 | 
2175 | Invariance properties: They used 10 face images and did some scaling,
2176 | rotation, x and y translations.  The face feature detector still
2177 | worked pretty well for rotation, not so well for the other operations.
2178 | Still, it's interesting that this is possible at all, especially since
2179 | rotation and scaling are going to be hard to build into the network.
2180 | 
2181 | \textbf{Other feature detectors:} They repeated the face detector
2182 | work, but with cats and human body parts.  They got 74.8 \% and 76.7
2183 | \%, respectively.  The data sets were constructed so that random
2184 | guessing would give 64.8 percent, as for the faces.  They also tried
2185 | some deep autoencoder experiments, and found that while there were
2186 | selective neurons in those networks, they weren't nearly as good as
2187 | with the muli-RICA-layer architecture.
2188 | 
2189 | \textbf{ImageNet:} On the 2011 data set --- 16 million images, 20,000
2190 | categories --- they achieved 15.8 percent accuracy, a huge jump over
2191 | the best (9.3 percent) results.
2192 | 
2193 | \section{Le (2012), video}
2194 | 
2195 | ``Tera-scale deep learning'': \link{http://vimeo.com/52332329}{link}
2196 | 
2197 | Problems with standard hand-crafted features: the features may not
2198 | generalize to another domain; the features take a long time to
2199 | develop.  (Recall Hinton: the time of hand-engineered features is
2200 | over.)  ``We're still stuck at SIFT and HOG''.  
2201 | 
2202 | RICA: Built on TICA (topographic independent component analysis).  We
2203 | have some data, $x^i$.  Take a 3 by 
2204 | 
2205 | RICA can learn from any (unlabelled) data.  Can learn features from
2206 | videos to do action recognition.  E.g. ``Get out of car''.  ``Eat''.
2207 | And so on.  Very interesting features!  SIFT / HoG is also used for
2208 | video, apparently.
2209 | 
2210 | Four most famous activity recognition data sets: KTH.
2211 | Hollywood2. UCF. YouTube.  They outperform SIFT / HoG on four
2212 | best-known data sets.
2213 | 
2214 | On cancer / MRI: our usual intuition breaks down.  It really helps to
2215 | be able to automatically discover features.
2216 | 
2217 | ``Scaling up deep RICA'': This is the way to think about what the
2218 | Google-Stanford paper does.
2219 | 
2220 | ``Using a thousand machines alone is not enough.''  They needed to
2221 | change their algorithms.  
2222 | 
2223 | ``Higher layer [i.e., later features] are very difficult to
2224 | visualize.''  I don't understand why that is the case.
2225 | 
2226 | Two main ideas in scaling up: local connectivity; asynchronous SGD.  
2227 | 
2228 | 1 billion parameters.  I wonder how they avoid overfitting?
2229 | 
2230 | They pick out a neuron in the top layer.  They look to see: which
2231 | images in the test (?) set stimulate that neuron the most?  And then
2232 | they do a numerical optimization to figure out what the optimal input
2233 | stimulus is.
2234 | 
2235 | Classify ``sting ray'' versus ``manta ray'' in ImageNet.
2236 | 
2237 | \section{Ng (2012) --- video}
2238 | 
2239 | \link{https://www.ipam.ucla.edu/schedule.aspx?pc=gss2012}{Ng's
2240 |   contribution to 2012 IPAM workshop}
2241 | 
2242 | ``Instead of doing AI, we ended up spending our lives doing curve
2243 | fitting.''
2244 | 
2245 | Has a nice example of building a motorcycle recognizer by building a
2246 | feature classifier to determine whether there are wheels or
2247 | handlebars, and then adding an extra classifier layer.  There's
2248 | actually a general story here: we can recursively do this.
2249 | 
2250 | He makes stronger claims about the brain rewiring (e.g., in ferrets)
2251 | than I've heard before.  Says that it's been done in four animal
2252 | species.  (I wasn't immediately able to find this using Google
2253 | Scholar, it'd be interesting to see sources.)  Also says that it's
2254 | vision in every sense that he understands what vision means.  I must
2255 | admit I didn't get that out of Sur's original paper --- it seemed like
2256 | a much coarser sense of vision.  It'd be interesting to know if
2257 | finer-grained tests have since been done.
2258 | 
2259 | ``The complexity of the trained algorithm comes from the data, not the
2260 | algorithm.''
2261 | 
2262 | Distinguishes semi-supervised versus self-taught learning
2263 | (unsupervised feature learning).  The problem with semi-supervised
2264 | learning is that it requires some broad constraints on classses ---
2265 | e.g., we need all images to be of cars or motorcycles.  Self-taught
2266 | learning is much easier.
2267 | 
2268 | Sparse coding: Learns a dictionary of basis functions so that each
2269 | training image can be decomposed sparsely in terms of basis functions.
2270 | (Use a ``sparsity penalty term''.)  If you train sparse coding on
2271 | natural images you get edge detectors.  ``It's more useful to know
2272 | where the edges are in an image than where the pixels are.''  ``It
2273 | gives us an alternate way of representing the image.''  ``ICA version
2274 | of spare coding.''  Recursively do sparse coding.
2275 | 
2276 | Sparse deep belief network (Honglak Lee).  One layer: edges.  Second
2277 | layers: models of object parts.  Third layers: object models.  
2278 | 
2279 | Ng believes it is mostly scalability that is the issue.  More
2280 | features, more data => better results.  What appears superficially to
2281 | be an algorithmic superiority is really about the availablity of more
2282 | data, or more computational power that allows more features to be
2283 | learned.
2284 | 
2285 | Should we use higher-order algorithms?  Conjugate gradient?  L-BFGS.
2286 | Gradient descent with line search.  Black-box algorithms which will
2287 | take the gradient and cost and just work.  Ng's favourite: L-BFGS.
2288 | 
2289 | ``The most reliable indicator of whether [a new grad student] has got
2290 | gradient descent to work is whether they do gradient checking.''  The
2291 | problem: buggy implementations will learn.  Just not as well as a
2292 | correct implementation.
2293 | 
2294 | \section{Sutskever (2013)}
2295 | 
2296 | \link{http://www.cs.utoronto.ca/\~ilya/pubs/2013/1051_2.pdf}{On the
2297 |   importance of initialization and momentum in deep learning}
2298 | 
2299 | ``Deep and recurrent neural networks... are poweful models that were
2300 | considered to be almost impossible to train using stochastic gradient
2301 | descent with momentum.  In this paper, we show that when sotchastic
2302 | gradient descent with momentum uses a well-designed random
2303 | initialization and a particular type of slowly increasing schedule for
2304 | the momentum parameters, it can train both... to levels of performance
2305 | that were previously achievable only with Hessian-Free optimization.
2306 | We find that both the initialization and the momentum are crucial
2307 | since poorly initialized networks cannot be trained with momentum and
2308 | well-initialized networks perform markedly worse when the momentum is
2309 | absent or poorly tuned.  Our success training these models suggests
2310 | that previous attempts to train deep and recurrent neural networks
2311 | from random initializations have likely failed due to poor
2312 | initialization schemes.''  In other words, we can train deep neural
2313 | nets with (momentum-based) stochastic gradient descent,
2314 | \emph{provided} we're careful about how we initialize the weights, and
2315 | provided we do the appropriate things with the momentum.
2316 | 
2317 | ``Martens (2010) attracted considerable attention by showing
2318 | that... Hessian-free Optimization... is capable of training [deep
2319 | neural nets] from certain random initializations without the use of
2320 | pre-training, and can achieve lower errors for the various
2321 | auto-encoding tasks considered by Hinton and Salakhutdinov (2006).''
2322 | 
2323 | The picture that is starting to appear: Overall achievement = Quality
2324 | of algorithm + quantity of data + number of features + amount of
2325 | computing time.
2326 | 
2327 | ``The first contribution of this paper is a much more thorough
2328 | investigation of the difficulty of training deep and temporal networks
2329 | than has been previously done... We show that while a definite
2330 | performance gap seems to exist between plain SGD and HF on certain
2331 | deep and temporal learning problems, this gap can be eliminated or
2332 | nearly eliminated... by careful use of classical momentum methods or
2333 | Nesterov's accelerated gradient.''
2334 | 
2335 | Apparently Polyak introduced the momentum technique, and obtained some
2336 | results on how much faster they can be than position-based techniques.
2337 | 
2338 | NAG: Nesterov's Accelerated Gradient.  ``[F]or general smooth
2339 | (non-strongly) convex functions and a deterministic gradient, NAG
2340 | achieves a gloabl convergence rate of $O(1/T^2)$ (versus the $O(1/T)$
2341 | of gradient descent), with constant proportional to the Lipschitz
2342 | coefficient of the derivative and the squared Euclidean distance to
2343 | the solution.''  I don't know what $T$ is here.  NAG turns out to be a
2344 | variation on the momentum method, with the only difference being that
2345 | we compute the gradient at the updated position.  ``While the
2346 | classical convergence theories for both methods [NAG and momentum]
2347 | rely on noiseless gradient estimates (i.e., not stochastic), with some
2348 | care in practice they are both applicable to the stochastic setting.''
2349 | ``However, the theory predicts that any advantages in terms of
2350 | asymptotic local rate of convergence will be lost... a result also
2351 | confirmed in experiments... For these reasons, interest in momentum
2352 | methods diminished after they had received substantial attention in
2353 | the 90's.  And because of this apparenty incompatability with
2354 | stochastic optimization, some authors even discourage using momentum
2355 | or downplay its potential advantages''
2356 | 
2357 | The key point seems to be to separate out two timescales.  One is the
2358 | initial transient phase, when we're still hopping between regions of
2359 | different local minima, before the phase of fine local convergence.
2360 | ``[I]n ractice, the `transient phase'... seems to matter a whole lot
2361 | more for optimizing deep neural networks.  In this transient phase of
2362 | learning, directions of reduction in the objective tend to persist
2363 | across many successive gradient estimates and are not completely
2364 | swamped by noise.''  ``Thus, for convex objectives, momentum-based
2365 | methods will outperform SGD in the early or transient stages of the
2366 | optimization where $L/T$ is the dominant term.''  Here, $L$ is the
2367 | Lipshitz coefficient of the gradient.
2368 | 
2369 | Why NAG works: ``This benign-looking difference seems to allow NAG to
2370 | change $v$ in a quicker and more responsive way, letting it behave
2371 | more stably than CM [classical momentum[ in many situations,
2372 | especially for higher values of $\mu$.  Indeed, consider the situation
2373 | where the addition of $\mu v_t$ results in an immediate undesirable
2374 | increase in the objective $f$.  The gradient correction to the
2375 | velocity $v_t$ is computed at position $\theta_t + \mu v_t$ and if
2376 | $\mu v_t$ is indeed a poor update, then [the gradient at the new
2377 | position] will point back toward $\theta_t$ more strangly than [the
2378 | gradient at the old position], thus providing a larger and more timely
2379 | correction to $v_t$ than CM.''  I don't think this is quite right.
2380 | It's not a question of pointing back to the original position.  It's a
2381 | question of pointing in the right direction, which may be different
2382 | than the direction of the original position. Still, this line of
2383 | reasoning otherwise seems sound.  (And it's probably true in two
2384 | dimensions.)
2385 | 
2386 | ``While each iteration of NAG may only be slightly more effective than
2387 | CM at correcting a large and inappropriate velocity, this difference
2388 | in effectiveness may compound as the algorithms iterate.''  I don't
2389 | know that this is the case.  It seems more likely to me that rather
2390 | than accumulating small improvements, it actually is preventing
2391 | occasional bad mistakes.
2392 | 
2393 | There is a nice example in the appendix.  Basically, a 2d example,
2394 | with very elongated ellipses as the contours.  The momentum method
2395 | (with low friction) has the problem that it only very slowly builds up
2396 | momentum in the right direction.  Basically it overshoots early on,
2397 | and then has to swing backwards and forwards, slowly.  NAG avoids
2398 | this, even though it also has low values of momentum.
2399 | 
2400 | They analyse CM and NAG for the objective function $C(y) = \sum_j
2401 | \lambda_j y_j^2+ c_j y_j$.  In this particular case, they prove that
2402 | NAG acts like the classical momentum technique with learning rate
2403 | $\eta$, but with a modified momentum $\mu(1-\eta \lambda_j)$ in
2404 | component $j$.  It should be possible to prove this through a
2405 | straightforward computation.
2406 | 
2407 | This means that for small learning rates CM and NAG become very
2408 | similar.  Note that locally every (smooth, convex) cost function may
2409 | be approximated by a quadratic cost functional (i.e., approximating by
2410 | the appropriate quadratic locally), and so this behaviour is likely
2411 | generic.  We also see that NAG is going to have lower momentums, and
2412 | so more friction; it will tend to damp out oscillations.  The decrease
2413 | in momentum will be particularly high when $\lambda_j$ is large.  This
2414 | is good: it decreases the momentum a lot, and so increases the
2415 | friction a lot, which damps out the overoscillation that will cause a
2416 | problem as we go through the $y_j = 0$.
2417 | 
2418 | Takeaway technique here: to understand an optimization technique it
2419 | can really help to look at quadratic trial functions like this, where
2420 | it may be possible to analyse behaviour analytically.  In particular,
2421 | the big benefit of analysing quadratic cost functions is that they
2422 | really do carry most of the (local) information we'll ever need.
2423 | 
2424 | ``The aim of our experiments is three-fold.  First, to investigate the
2425 | attainable performance of stochastic momentum methods on deep
2426 | autoencoders starting from well-designed random initializations;
2427 | second, to explore the importance and effect of the schedule for the
2428 | momentum parameter $\mu$ assuming an optimal fixed choice of the
2429 | learning rate...; and third, to compare the performance of NAG versus
2430 | CM.''
2431 | 
2432 | They don't look at test errors --- i.e., they ignore regularization
2433 | and overfitting.  Not sure how bulletproof their argument for this is
2434 | --- it seems a bit like special pleading.  But I'm happy to run with
2435 | it: the point is that optimization can be treated separately from
2436 | generalization.  (Of course, it may be that a better optimization
2437 | method is worse at generalization, and that ultimately needs to be
2438 | looked into.)
2439 | 
2440 | The schedule for $\mu$ which they used was to look for the smaller of
2441 | $\mu_{\rm max}$ and $1-1/2(\lfloor t/250 \rfloor +1)$, where $t$ is
2442 | the epoch number.  In other words, over the first 250 iterations, $\mu
2443 | = 1/2$.  Then we switch to $3/4$.  Then to $5/6$.  And so on, until we
2444 | get to $\mu_{\rm max}$.  They have a nice explanation for this.
2445 | Basically, use the $1-1/t$ type schedule when the function isn't
2446 | convex --- this will helps us explore and gradually find a good
2447 | locality.  But once that is found it's better to switch to a constant
2448 | rate, which will converge exponentially quickly.  So this is a nice
2449 | hybrid.  
2450 | 
2451 | Actually, that's not quite the full schedule.  It turns out that they
2452 | do a final modification of $\mu$ for the tail end of training,
2453 | reducing it to another constant.  They have a nice heuristic
2454 | explanation for this, which I won't get into here, but should perhaps
2455 | come back to in future.
2456 | 
2457 | The results are impressive: much, much better than basic stochastic
2458 | gradient descent.
2459 | 
2460 | They also investigate recurrent neural networks.  I'm less familiar
2461 | with these, so I'll just quickly write out some very telegraphic and
2462 | incomplete notes.  Echo-state networks.  ``ESNs ... have achieved high
2463 | performance on tasks with long range dependencies (?)''  ``RNNs were
2464 | believed to be almost impossible to successfully train on such
2465 | datasets [with long-range temporal dependencies], due to various
2466 | difficulties such as vanishing/exploding gradients'' Interesting
2467 | comments on the spectral radius of the hidden-to-hidden matrix on the
2468 | dynamics of a RNN.  When are RNNs likely to be useful?  ``The main
2469 | achievement of these results is a demonstration of the ability of
2470 | momentum methods to cope with long-range temporal dependency training
2471 | tasks to a level which seems sufficient for most practical purposes.''
2472 | In practice, of course, many (most?) interesting human cognitive tasks
2473 | involve long-range temporal dependency: I do action X now, then must
2474 | do Y later, than Z (which depends on X) later still, and so on.  RNNs
2475 | seem like they might be especially useful for ``chains of thought'' as
2476 | opposed to pattern recognition.
2477 | 
2478 | Comparison to HF: They note that HF is a truncated Newton method.
2479 | Sounds like it's an improved linear conjugate gradient
2480 | method. ``[Conjugate gradient] accumulates information as it iterates
2481 | which allows it to be optimal in a much stronger sense that any other
2482 | first-order method (like NAG)''
2483 | 
2484 | The idea behind all these first-order methods --- momentum-based and
2485 | NAG --- seeems to be to find indirect ways of putting curvature into
2486 | the problem, by computing gradients at two separate points.  This
2487 | gives us information about the locally approximating quadratic, rather
2488 | than the locally approximating plane.  It seems as though you could do
2489 | better by making use of even more points --- three or four would give
2490 | you higher-order still approximations.  (They still wouldn't give you
2491 | global information, though.)
2492 | 
2493 | I need to get clear on the relationship of curvature to gradient
2494 | descent.  The basic point is that if the cost surface is highly curved
2495 | in some direction, gradient descent will tend to send us in that
2496 | direction.  That's not always what we want.  Sometimes we want to move
2497 | off along low curvature directions as well.  That's typically the case
2498 | for a general (positive-definite) quadratic.
2499 | 
2500 | \section{Summary of CIFAR-10 results}
2501 | 
2502 | As at July, 2013.  I have drawn heavily on the compendium of results
2503 | by
2504 | \link{http://rodrigob.github.io/are\_we\_there\_yet/build/classification_datasets_results.html}{Rodrigo
2505 |   Benenson}.  Note that CIFAR-10 contains 10 classes, with 5,000
2506 | training images per class, and 1,000 test images per class.  Images
2507 | are 32 by 32, and in RGB.  Not centred or size-normalized.
2508 | 
2509 | Note that the accounts below are not at all complete, they are
2510 | intended as a quick first cut.  Several of these should be
2511 | investigated in much more depth.
2512 | 
2513 | Karpathy on CIFAR: http://karpathy.ca/myblog/2011/04/27/lessons-learned-from-manually-classifying-cifar-10-with-code/
2514 | 
2515 | \textbf{Snoek, Larochelle, and Adams (2012)}
2516 | (\link{http://www.cs.toronto.edu/\~jasper/bayesopt.pdf}{link}):
2517 | ``Practical Bayesian Optimization of Machine Learning Algorithms''.
2518 | Appears to provide the best results at the time of writing.  They used
2519 | a three-layer convolutional neural network.  Achieved an error on the
2520 | test set of 14.98\%.  This is over 3\% better than state of the art
2521 | (without augmenting the data).  They then augmented the data using
2522 | horizontal reflections and translations, getting the error down to 9.5
2523 | \% on the test set.
2524 | 
2525 | An interesting aspect of the project is that they learnt the
2526 | hyper-parameters automatically.  In particular, they did a Bayesian
2527 | optimization to learn 9 separate hyper-parameters, including the
2528 | number of epochs, the learning rate, and the width, scale and power of
2529 | the response normalization in the pooling layers.  The learned
2530 | hyper-parameters significantly outperform a human expert's
2531 | optimization of the hyper-parameters.  Their expert achieved 18\% and
2532 | 11\% error (without and with data augmentation, respectively).
2533 | 
2534 | Code from this project is available.  Note that Jasper Snoek is at U
2535 | of T, but will be leaving for Harvard in September.  They based their
2536 | convolutional net implementation on cuda-convnet.
2537 | 
2538 | \textbf{Krizhevsky, Sutskever, and Hinton (2012):}
2539 | (\link{http://books.nips.cc/papers/files/nips25/NIPS2012\_0534.pdf}{link})
2540 | ``ImageNet Classification with Deep Convolutional Neural Networks'' A
2541 | four-layer convolutional neural net achieved 13\% test error rate
2542 | without local response normalization, and 11\% with local response
2543 | normalization.  Used cuda-convnet.
2544 | 
2545 | \textbf{Ciresan, Meier, and Schmidhuber (2012):}
2546 | (\link{http://www.idsia.ch/\~ciresan/data/cvpr2012.pdf}{link})
2547 | ``Multi-column deep neural networks for image classification''
2548 | Achieves 11.21\% error for CIFAR-10.  Achieves 0.23 \% error for
2549 | MNIST.  Claims that humans get a 0.2\% error, with citation (would be
2550 | interesting to look up).  Use a deep convolutional network.  They do
2551 | basic backprop, with no pretraining.  The architecture is to repeat a
2552 | convolutional layer followed by max pooling multiple times, followed
2553 | by some fully connected layers.  They use 2 by 2 receptive fields and
2554 | max-pooling regions.  It appears that the stride length is 2, as well.
2555 | Somewhat similar to Krizhevsky et al's ImageNet paper.  They use a
2556 | fully online training algorithm.  They use a GPU.  They use what they
2557 | call a multi-column deep neural network, which I don't quite
2558 | understand --- looks to be a technique for training multiple networks
2559 | and combining the results.  They used a (scaled) tanh function for
2560 | convolutional and fully connected layers, a linear activation function
2561 | (does this mean rectified?) for max-pooling layers, and softmax at the
2562 | output.  They used online gradient descent, with an annealed learning
2563 | rate (0.001, decaying by a factor of 0.993 after every epoch), and
2564 | continual translations, scaling and rotation of images.  Initial
2565 | weights are drawn from a uniform random distribution in the range
2566 | [-0.05, 0.05].
2567 | 
2568 | MNIST architecture: 29 by 29 input; a 20-map convolutional layer, with
2569 | a receptive field of 4 by 4; max-pooling of 2 by 2 regions; a 40-map
2570 | convolutional layer with 5 by 5 receptive field; max-pooling of 3 by 3
2571 | regions; fully connected layer with 150 neurons; fully connected
2572 | (softmax) layer with 10 neurons.
2573 | 
2574 | CIFAR architecture: 3 by 32 by 32 input; 300-map convolutional layer,
2575 | with 3 by 3 receptive fields; max-pooling of 2 by 2 regions; 300-map
2576 | convolutional layer, with 2 by 2 receptive fields; max-pooling of 2 by
2577 | 2 regions; 300 convolutional maps, 2 by 2 receptive fields;
2578 | max-pooling of 2 by 2 regions; then fully connected layers with 300,
2579 | 100 and 10 neurons.
2580 | 
2581 | Augmenting the training set (by translating up to 5\%) helps a lot.
2582 | Scaling (up to 15 percent), rotation (up to 5 degrees) and additional
2583 | translations (up to 15 percent) helps a little extra.
2584 | 
2585 | The contrast between the Krizhevsky and Ciresan results suggests that
2586 | ideas like dropout and rectified linear units make a big difference.
2587 | 
2588 | Q: How much difference does the larger number of maps in the
2589 | convolutional layers make?
2590 | 
2591 | \textbf{Goodfellow, Warde-Farley, Mirza, COurville, Bengio (2013):}
2592 | \link{http://arxiv.org/abs/1302.4389}{link} ``Maxout networks'': Test
2593 | set error of 12.93 \%. 
2594 | 
2595 | ``We define a simple new model called maxout... designed to both
2596 | facilitate optimization by droput and improve the accuracy of
2597 | dropout's fast approximate model averaging technique.''
2598 | 
2599 | Preprocessed the data using global contrast normalization and ZCA
2600 | whitening.  Best model consists of three convolutional maxout layers
2601 | followed by a fully connected maxout layer, then finally a softmax
2602 | layer.
2603 | 
2604 | \textbf{Tentative conclusions:} Use, in roughly this order: Martens'
2605 | initialization; rectified linear units; dropout; augmented training
2606 | data; annealed learning rate.  It'd be interesting to look at the
2607 | local contrast normalization.  Also try looking at Nesterov's momentum
2608 | method.  The Ciresan results suggest some benefit from using lots of
2609 | maps in the convolutional layers.
2610 | 
2611 | 
2612 | \section{Grandmother cell (Wikipedia)}
2613 | 
2614 | Apparently proposed in the late 1960s by Konorski and Lettvin.
2615 | Lettvin ``originated the term grandmother cell to illustrate the
2616 | logical inconsistency of the concept.''  There is apparently quite a
2617 | bit of support for the concept at the broad category level: neurons
2618 | which are higly face-specific, and even to individual human faces.
2619 | However, ``[e]ven the most selective face cells usually also disharge,
2620 | if more weakly, to a variety of individual faces.''  A 2005 study
2621 | found a ``neuron for Halle Berry'', which fired not only for pictures
2622 | of the actress, but also to the words ``Halle Berry'', and which
2623 | didn't fire when pictures of several other actresses were presented.
2624 | Of course, this doesn't mean that was the only cell to respond.  The
2625 | ``sparseness'' hypothesis versus the ``distributed representation''
2626 | theory.  It's really not clear to me that there is a dichotomy here.
2627 | A picture of Halle Berry will no doubt cause many neurons to fire,
2628 | some of which will fire for other reasons too. Maybe the hypothesis is
2629 | this: for each single object or concept there is a corresponding
2630 | grandmother neuron.
2631 | 
2632 | \chapter{Miscellanea}
2633 | 
2634 | \textbf{Compiling to neural networks:} Can we create compilers which
2635 | translate programs written in a conventional programming language into
2636 | a neural network?  I'd be especially interested in seeing how this
2637 | works for AI workhorses such as Prolog.  What could we learn from such
2638 | a procedure?  (1) Perhaps we could figure out how to link up multiple
2639 | neural modules, with one or more of the modules coming from the
2640 | compiler? (2) Maybe we could use a learning technique to further
2641 | improve the performance of the compiled network.  Googling doesn't
2642 | reveal a whole lot, although I did find a paper by
2643 | \link{http://scholar.google.ca/scholar?cluster=10518384657895134615\&hl=en\&as\_sdt=0,5}{Thrun}
2644 | where he discusses decompiling, i.e., extracting rules from a neural
2645 | network.  Thrun uses a technique he calls validity-interval analysis,
2646 | basically propagating intervals for inputs and outputs forwards and
2647 | backwards through a network.
2648 | 
2649 | \textbf{Deep learning requires nonlinear neurons:} Put another way,
2650 | deep learning with linear neurons doesn't help.  Via linear embedding
2651 | it's equivalent to a single hidden layer whose size is just the
2652 | minimal size of any of the original hidden layers.  So there is
2653 | absolutely no advantage to doing deep learning with linear neurons.
2654 | 
2655 | \textbf{No theory of generalization:} We have all these techniques
2656 | based on parameter-fitting.  But we have a paucity of strong
2657 | underlying theoretical ideas.
2658 | 
2659 | \textbf{Principal Components Analysis (PCA):} It'll be useful to
2660 | review PCA here.  Suppose we have a set of data points $x$ in some
2661 | high-dimensional (vector) space.  Then we'd like to find a
2662 | $k$-dimensional projector $P$ such that the following error function
2663 | is minimized:
2664 | \begin{eqnarray}
2665 | \sum_x \| x-Px \|^2.
2666 | \end{eqnarray}
2667 | This error can be rewritten as $\mbox{tr}((I-P)\Sigma)$, where $\Sigma
2668 | \equiv \sum_x x x^T$.  And so we simply choose $P$ to project onto the
2669 | eigenvectors of $\Sigma$ with the $k$ largest eigenvalues.  The
2670 | \emph{principal components} are the eigenvectors of $\Sigma$, in order
2671 | of decreasing eigenvalue.  (There may, of course, be some ambiguity
2672 | when $\Sigma$ is degenerate).
2673 | 
2674 | Practically speaking, suppose we have a billion images, each of which
2675 | can be regarded as a vector in a 100,000-dimensional space.  We can
2676 | reduce to (say) a 100-dimensional space.  This gets rid of much of the
2677 | irrelevant structure, and hopefully leaves a structure that is useful
2678 | for comparing images.
2679 | 
2680 | \textbf{PCA and autoencoders:} PCA is a way of simplifying our
2681 | understanding of data in high dimensions.  Think of the space of all
2682 | possible images.  There's a subset of that space which can plausibly
2683 | be taken to represent faces.  (Note that contextual clues can also
2684 | help).  How can we characterize that subspace?  Classic example of
2685 | PCA: IQ testing.  Take a large number of different tests.  Turns out
2686 | that there is a common factor.  Another nice example: a helix in 3
2687 | dimensions.  There's a major question: how to determine the number of
2688 | hidden units?
2689 | 
2690 | \textbf{Recurrent neural networks (RNN):} According to Wikipedia, RNNs
2691 | have achieved the best results to date on handwriting recognition.  An
2692 | obvious question is: what are the respective advantages of RNNs and
2693 | feedforward networks?  Are there important problems for which one or
2694 | the other is preferable?  Why?  What I've read about these questions
2695 | is opaque.
2696 | 
2697 | \textbf{Regularization:} I'd like to understand \emph{why} we
2698 | regularize.  Certainly, regularization results in solutions with a
2699 | small norm.  But why do we not what solutions with a larger norm?
2700 | Will something bad happen to us if we allow such solutions?
2701 | 
2702 | The standard argument: what's bad is that overfitting can occur.  And
2703 | thus regularization helps reduce overfitting.  It'd be nice to have an
2704 | example where overfitting actually occurs.  It's really not clear that
2705 | there \emph{should} be a problem with overfitting.  In fact, neural
2706 | networks eventually become virtually invariant under rescaling of
2707 | their weights and biases.  So it's really not clear that it should
2708 | help.
2709 | 
2710 | Returning to regularization, here's the standard story people tell to
2711 | explain why they regularize.  The story is that they want to avoid
2712 | high-complexity solutions, in order to avoid over-fitting.  Solutions
2713 | with smaller norms are in some sense lower complexity.  And therefore
2714 | it makes sense to look for solutions with smaller norm.  One way of
2715 | doing this is to penalize solutions with larger norms.  Thus, we
2716 | should add a term to the cost which penalizes such solutions.
2717 | 
2718 | Now, this is just a story.  It's not in any sense a sharp
2719 | justification.  In fact, the impact of regularization is still being
2720 | understood.  Researchers write papers where they try different
2721 | approaches to regularization, compare them to see which works better,
2722 | and try to understand why different approaches work the way the day.
2723 | 
2724 | When can overfitting occur?  Typically, when there are more parameters
2725 | in the model than there is training data.  What's odd about this is
2726 | that regularization doesn't really help all that much with this
2727 | problem.  It just restricts one degee of freedom.
2728 | 
2729 | Many different types of regularization possible.  I will just use the
2730 | most standard and obvious, which is quadratic.  Anything which
2731 | penalizes high-complexity solutions is okay.  It's really a research
2732 | topic.
2733 | 
2734 | Empirically: I find that regularization seems to help.  When we
2735 | regularize I get higher accuracies, by quite a bit.  I don't
2736 | understand why that is.
2737 | 
2738 | Maybe I'm already overfitting, and regularization is helping reduce
2739 | that problem.  It's possible: I have 20,000 or so parameters in my
2740 | model.  It'd be nice to see if this is the case.
2741 | 
2742 | An example of overfitting: I'll bet I can it to overfit when we use
2743 | just 50 training examples.  And I can probably more or less prove this
2744 | using cross-validation.
2745 | 
2746 | Look at LeCun \emph{et al}'s results: do they regularize, or not?
2747 | 
2748 | \textbf{Restricted Boltzmann machines:} The idea is not to learn a
2749 | function, but rather to learn a probability distribution.  There are
2750 | two layers of neurons: a visible layer, and a hidden layer.  All
2751 | visible units are connected to all hidden units.  The energy of a
2752 | given configuration is just:
2753 | \begin{eqnarray}
2754 |   E(v, h) = -\sum_i a_i v_i-\sum_j b_j h_j-\sum_{ij} w_{ij} v_i h_j \\
2755 |   & = & -a \cdot v-b\cdot h -v^T W h,
2756 | \end{eqnarray}
2757 | where $a$ are the biases for the visible units, $b$ are the biases for
2758 | the hidden units, and $W$ is the weight matrix.  The distribution is
2759 | just the standard Boltzmann distribution, at some fixed temperature.
2760 | Apparently it can be shown that:
2761 | \begin{eqnarray}
2762 |   p(v_i = 1 | h) = \sigma( a_i + (Wh)i),
2763 | \end{eqnarray}
2764 | where $\sigma$ is the usual sigmoid function.  (I'll bet this is easy
2765 | to show, just by summing out all the other visible units.)
2766 | Furthermore, the $v_i$ are independent of one another, given $h$.
2767 | This too would be easy to show --- it'll be a straightforward
2768 | consequence of the bipartite nature of the graph. So we can compute
2769 | the probability of $v$, given $h$, simply by multiplying sigmoids.
2770 | 
2771 | Let's suppose we wanted to train an RBM with a set of images.  The
2772 | images would correspond to the visible units, while the hidden units
2773 | would be feature detectors.  The idea is to adjust the weights and
2774 | biases so that training images have a high probability, i.e., a low
2775 | energy.  
2776 | 
2777 | In a little more detail, suppose we input a training image.  Then we
2778 | can stochastically pick a corresponding value for the hidden units.
2779 | Now, feed that back, and stochastically choose a value for the image.
2780 | In an ideal world, we'd recover the original image.  We modify the
2781 | weights in such a way as to improve the fidelity of the recovered
2782 | image.
2783 | 
2784 | Well, the penny finally drops: an RBM can be viewed as a neural
2785 | network in which the transitions are probabilistic.  That's all!
2786 | Frankly, we don't even really need the stuff about ground states,
2787 | although it's a beautiful thing to keep in mind.
2788 | 
2789 | \textbf{Softmax function:} Suppose $q_j$ is some set of values.  Then
2790 |   we define the softmax function by:
2791 | \begin{eqnarray}
2792 | p_j \equiv \exp(q_j)/\sum_k \exp(q_j).
2793 | \end{eqnarray}
2794 | This is a probability distribution, which preserves the order of the
2795 | original values.  You can, for example, take the softmax in the final
2796 | layer of a neural network, taking the weighted sum of inputs as the
2797 | $q_j$ values, and then applying the softmax.  The output from the
2798 | network can then be interpreted as a probability distribution.
2799 | 
2800 | \textbf{Thinking geometrically:} Suppose we're asked to tell the
2801 | difference between pictures of a human face, and pictures of a
2802 | giraffe.  We can represent the pictures as points $x$ in a very
2803 | high-dimensional space.  And so our task is to divide that space up
2804 | into two parts: one is classified as giraffe, the other as human face.
2805 | (Maybe it should be three parts: the thrid part would be: neither face
2806 | nor giraffe).  And so what we really want is algorithms for dividing
2807 | up that space.  In some sense we're interested in understanding the
2808 | space of all such algorithms. 
2809 | 
2810 | It'd be interesting to lay out all the different curlicues to thinking
2811 | in this way: the opportunities, and the pitfalls.  There are at least
2812 | three broad approaches: (1) the \emph{pure geometric approach}, based
2813 | on finding mathematical structures to divide the space; (2) the
2814 | \emph{biological approach}, where we try to figure out how we do it;
2815 | and (3) the \emph{kludge approach}, where we simply try lots of ideas,
2816 | and pile them up on top of one another.  That's a pretty rough
2817 | division, but seems like a good starting point for thought.  My bet is
2818 | that progress comes from playing these ideas off against one another.
2819 | 
2820 | \textbf{Tricks:} Much of what seems to be going on is the discovery of
2821 | tricks (of various generality) which can be used to improve pattern
2822 | recognition performance.  There are some general heuristics: \emph{use
2823 |   symmetry} is obviously one.
2824 | 
2825 | 
2826 | \section{Future reading}
2827 | 
2828 | On the display of scientific papers: https://news.ycombinator.com/item?id=6042742
2829 | 
2830 | Connectomics - a recent approach: http://arxiv.org/abs/1306.5709
2831 | 
2832 | Ciresan 2012 on MNIST, and Rifai 2011 (``The manifold tangent
2833 | classifier'') on MNIST.
2834 | 
2835 | Kiros 2013: http://www.ualberta.ca/\~rkiros/kiros\_thesis\_jun5.pdf
2836 | Best reported results on MNIST when no distortions are used
2837 | 
2838 | Interesting comments on image recognition: https://news.ycombinator.com/item?id=5994851
2839 | 
2840 | Saxe et al: ``On random weights and unsupervised feature learning''
2841 | (2011).  On hyper-parameter optimization.  One of Ng's collaborators.
2842 | 
2843 | LeCun on recent Ng results: https://plus.google.com/104362980539466846301/posts/5ab217HugeF
2844 | 
2845 | Goodfellow: https://plus.google.com/103174629363045094445/posts/dh7UT9xbMW4
2846 | 
2847 | \textbf{SIFT:}
2848 | 
2849 | 
2850 | Tips on what works: https://news.ycombinator.com/item?id=5994851
2851 | 
2852 | ``Fast, accurate detection of 100,000 object classes on a single
2853 | machine'':
2854 | http://googleresearch.blogspot.ca/2013/06/fast-accurate-detection-of-100000.html
2855 | 
2856 | Hinton: ``Where do features come from?'': http://scholar.google.ca/citations?view\_op=view\_citation\&hl=en\&user=JicYPdAAAAAJ\&sortby=pubdate\&citation\_for\_view=JicYPdAAAAAJ:L\_l9e5I586QC
2857 | 
2858 | Bengio lecture notes
2859 | 
2860 | Seide 2011 on deep learning and Microsoft's MAVIS system.
2861 | 
2862 | Bengio and COurville: ``Deep learning of representations''
2863 | http://www.iro.umontreal.ca/~bengioy/papers/BengioCourvilleChapter.pdf
2864 | 
2865 | Andrew Ng, CS294A lecture notes
2866 | 
2867 | McCulloch and Pitts
2868 | 
2869 | Recent Bengio paper on new approach to deep learning:http://arxiv.org/abs/1306.1091
2870 | 
2871 | Elliasmith
2872 | 
2873 | Levesque: http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf
2874 | 
2875 | On feedback in the brain: http://blogs.scientificamerican.com/mind-guest-blog/2013/08/08/this-brain-discovery-may-overturn-a-century-old-theory/
2876 | 
2877 | Martens 2010: Hessian-Free optimization, and sparse initialization.
2878 | 
2879 | Bengio et al ``Scaling Learning algorithms towards AI'' 2007
2880 | 
2881 | Boureau: A theoretical analysis of feature pooling in visual
2882 | recognition (2010).
2883 | 
2884 | Sermanet: Convolutional neural networks applied to house numbers digit
2885 | classification
2886 | 
2887 | Elkan 2013: Learning meanings for sentences: http://cseweb.ucsd.edu/\~elkan/250B/learningmeaning.pdf
2888 | 
2889 | Agre.
2890 | 
2891 | Hubel and Wiesel: 1959.  Simple and Complex.  The basic model of V1.
2892 | 
2893 | Frome 2009: Large-scale Privacy Protection in Google Street View
2894 | 
2895 | Deep Learning for the Masses: http://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-the-masses-you-can-thank-google-later/
2896 | 
2897 | \textbf{Collobert:} ``Natural language processing almost from scratch:''
2898 | 
2899 | \textbf{Bengio et al (1994):} The vanishing gradient problem.
2900 | ``Learning long-term dependencies with gradient descent is
2901 | difficult''.
2902 | 
2903 | \textbf{Erhan:} ``Why does unsupervised pre-training help deep learning?''
2904 | 
2905 | \textbf{HoG:}
2906 | 
2907 | \textbf{Hinton et al (2006):}
2908 | 
2909 | \textbf{Itamar Arel et al} For a different POV.
2910 | 
2911 | PAC learning.
2912 | 
2913 | Conference on learning representations: http://techtalks.tv/iclr2013/
2914 | 
2915 | IPAM: https://www.ipam.ucla.edu/schedule.aspx?pc=gss2012
2916 | 
2917 | \textbf{Lee and Mumford (2003):}
2918 | \link{http://dash.harvard.edu/bitstream/handle/1/3637109/Mumford\_HierarchBayesInfer.pdf?sequence=1}{link}
2919 | This looks like great background reading on the idea of doing
2920 | hierarchical inference in the visual cortex.
2921 | 
2922 | \textbf{Embrechts (2010)}:
2923 | 
2924 | \textbf{Dropout:}
2925 | 
2926 | \textbf{Le (2012):} \link{https://plus.google.com/u/0/+ResearchatGoogle/posts/EMyhnBetd2F}{link}
2927 | 
2928 | \textbf{Seide (2011):}
2929 | \link{http://research.microsoft.com/apps/pubs/default.aspx?id=153169}{link}
2930 | 
2931 | \textbf{Bengio (2007):} \link{http://arxiv.org/pdf/1206.5533v2.pdf}{link}
2932 | 
2933 | \textbf{Ranzato (2007):}
2934 | 
2935 | \textbf{Lee (2008):} 
2936 | 
2937 | \textbf{Larochelle (2009):}
2938 | 
2939 | \textbf{Wolpert (XXX):} No free lunch.
2940 | 
2941 | \textbf{The NIPS 2012 talks:}
2942 | 
2943 | \textbf{Elements of statistical learning:} \link{http://www.stanford.edu/\~hastie/local.ftp/Springer/OLD//ESLII\_print4.pdf}{link}
2944 | 
2945 | \textbf{No more pesky learning rates:} \link{http://arxiv.org/pdf/1206.1106.pdf}{link}
2946 | 
2947 | \textbf{Olshausen and Field:}
2948 | 
2949 | Tenenbaum 2011: How to grow a mind
2950 | 
2951 | Rumelhart et al on backprop.
2952 | 
2953 | BigBrain Atlas: http://news.sciencemag.org/sciencenow/2013/06/bigbrain-atlas-unveiled.html
2954 | 
2955 | Hinton on DReDnets: http://techtalks.tv/talks/drednets/58115/
2956 | 
2957 | \textbf{Distributed deep learning:}
2958 | \link{http://research.google.com/archive/large\_deep\_networks_nips2012.html}{link}.
2959 | 
2960 | \textbf{Stanford tutorial:} http://ufldl.stanford.edu/wiki/index.php/UFLDL\_Tutorial
2961 | 
2962 | Eliot R. Smith: ``What do connectionism and social psychology off each
2963 | other?''  Good for something of an exterior point of view.
2964 | 
2965 | \textbf{To do:} Contrastive divergence
2966 | (http://learning.cs.toronto.edu/~hinton/absps/cdmiguel.pdf and
2967 | http://www.cs.utoronto.ca/~hinton/absps/nccd.pdf ). LeCun 1998
2968 | ``Efficient BackProp''.  Dropout.  Maxout.  Andrew Ng's 1997 paper
2969 | ``Preventing overfitting of cross-validation data''.  Blumer \emph{et
2970 |   al} with guarantees on induction:
2971 | (http://scholar.google.ca/scholar?cluster=11895938102761137877\&hl=en\&as\_sdt=0,5).
2972 | Would be good to understand this in conjunction with no free lunch.
2973 | NIPS papers are online.  
2974 | 
2975 | \textbf{Neural nets FAQ:} No one definition of a neural network. It's
2976 | possible to do XOR with just a single hidden layer, if direct
2977 | connections to the output from the input are allowed.  Problems which
2978 | neural nets aren't so good at: predicting random or pseudo-random
2979 | numbers; factoring large integers; determining whether a number is
2980 | prime.  Research problem: find a net which will determine whether a
2981 | number is prime.  Distinction between recurrent and feedforward neural
2982 | networks.  Calls the set of cases we'd like to generalize to the
2983 | \emph{population}.  Constructive learning: start with a small network,
2984 | train, then gradually add extra neurons, and do more training.  A lot
2985 | of work has been done on toy problems, and various hacks are known for
2986 | the different toy problems.
2987 | 
2988 | 
2989 | \textbf{Stephen Judd (1988):} Thesis on complexity of learning in
2990 | neural networks: http://www.dtic.mil/dtic/tr/fulltext/u2/a450825.pdf.
2991 | 
2992 | \textbf{Sima (1996):} Shows that finding weights is hard even for
2993 | sigmoidal neural networks with just 3 nodes.  This can be viewed as an
2994 | extension of Blum and Rivest (1989).
2995 | http://scholar.google.ca/scholar?cluster=18396613610240979409\&hl=en\&as\_sdt=0,5
2996 | 
2997 | \textbf{Egri and Schultz:} Found a neural network capable of
2998 | recognizing prime numbers.
2999 | http://www.cs.mcgill.ca/\~legri1/prime06.pdf
3000 | 
3001 | 
3002 | 
3003 | \end{document}
3004 | 


--------------------------------------------------------------------------------