├── .gitignore
├── Bayesian Optimization & Gaussian Processes
    ├── Garnelo_2018b.md
    └── Shahriari_2016.md
├── Biologically_Plausible_DL
    ├── Bartunov_2018.md
    ├── Lillicrap_2016.md
    ├── Sacramento_2018.md
    └── Whittington_2019.md
├── Deep Learning
    ├── Gal_2015.md
    ├── Gal_2016.md
    ├── Goodfellow_2016.md
    └── Spoerer_2017.md
├── Formal Grammars
    ├── 2001_Hale.md
    ├── 2016b_Siyari.md
    └── 2018_Schoenhense.md
├── Free-Energy Principle
    ├── Friston_2010.md
    ├── Limanowski_2013.md
    └── Ostwald_2015.md
├── Hierarchical Reinforcement Learning
    ├── 1997_McGovern.md
    ├── 1998_McGovern.md
    ├── 2001_Mcgovern.md
    ├── 2002_Menache.md
    ├── 2002_Stolle.md
    ├── 2004_Bakker.md
    ├── 2004_Mannor.md
    ├── 2014_Yao.md
    ├── 2017_Florensa.md
    ├── 2018_Frans.md
    └── 2018_Levy.md
├── Hyperparam-Opt
    ├── 2017_Jaderberg.md
    └── 2019_Li.md
├── Meta-Learning
    ├── 2016_Andrychowicz.md
    ├── 2017_Finn.md
    └── 2018_Finn.md
├── Multi-Agent RL
    ├── 2016_Foerster.md
    ├── 2018_Strouse.md
    └── 2019_Das.md
├── Optimization
    ├── 2018_Baydin.md
    └── Ruder_2016.md
├── ReadingGroupICL
    ├── 01_Daume_2004.md
    ├── 02_Zhang_2017.md
    ├── 03_Marco_2017.md
    └── 05_Lee_2017.md
├── ReadingGroupTU
    ├── 2018_Nayebi_pres.pdf
    ├── 2019_Bengio_pres.pdf
    ├── 2019_Flennerhag_pres.pdf
    ├── 2019_Geier.md
    └── 2019_Merel_pres.pdf
├── Readme.md
├── Reinforcement Learning
    ├── 1993_Dayan_b.md
    ├── 2015_Hausknecht.md
    ├── 2015_Schaul.md
    ├── 2016_Rusu.md
    ├── 2018_Andrychowicz.md
    └── 2018_Choshen.md
├── Theory-of-DL
    ├── 2017_Collins.md
    ├── 2018_Jacot.md
    ├── 2019_Frankle.md
    └── 2019b_Frankle.md
├── Variational Inference
    ├── Ostwald_2014.md
    └── Zhang_2018.md
└── Vision
    └── 2018_Kuemmerer.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | *.pdf
3 | *.key
4 | plain.md
5 | !2018_Nayebi_pres.pdf
6 | !2019_Bengio_pres.pdf
7 | !2019_Flennerhag_pres.pdf
8 | !2019_Merel_pres.pdf
9 | 


--------------------------------------------------------------------------------
/Bayesian Optimization & Gaussian Processes/Garnelo_2018b.md:
--------------------------------------------------------------------------------
 1 | # Title: Neural Processes
 2 | 
 3 | # Author: Garnelo et al (2018b)
 4 | 
 5 | #### General Content: Introduce a framework that balances advantages of both deep learning as well as Gaussian processes. A NP is a network-based approx of a stochastic process. It models a distribution over fcts, estimates uncertainty, shifts workload from training to test and is computationally efficient. In its essence it is an VAE structure with a global latent variable that captures epistemic uncertainty. They show powerful results in different problem setting (self-supervised image completion, Bayesian optimization as well as function approx.).
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Sufficient conditions for definition of stochastic process - Kolmogorov Extension Theorem: Exchangeability (invariance of joint distribution to permutations of elements in sequence) and Consistency (marginalization does not change resulting class of distribution) - de Finetti's theorem
11 | 
12 | * NP = Encoder (h - NN), Aggregator (a - mean operator), Decoder (g - NN) - check out good graphical model visualization
13 | 
14 | * z - global uncertainty - distribution characterized by data and context specific prior
15 | 
16 | * Relationship to other models:
17 |     * Conditional NPs: Lack latent variable - unable to produce different samples for same context - no uncertainty estimate
18 |     * Generalization of generative query network - similar training for prediction of new viewpoints in 3D given some context
19 |     * GPs: Deep Kernel Learning, Kernel Approx
20 |     * Meta-Learning - Workload shift to test time
21 | 
22 | * Amortized inference - Variational inference procedure in which a powerful predictor (NN) is used to predict optimal value of variational parameters based on features - replace local var parameters by fct of data whose parameters are shared across data points.
23 | 
24 | * Comparison to GPs: No handcrafted kernel, but learning of implicit measure directly from the data - efficient computation.
25 | 
26 | 
27 | #### Questions:
28 | 
29 | * What does image completion with full set of context points do? samples a noisy version? NP seems to perform a lot better in low data/context-point regimes - as many Bayesian methods do.
30 | * No details about architectures - cant replicate anything - good github repo (https://github.com/geniki/neural-processes)
31 | 


--------------------------------------------------------------------------------
/Bayesian Optimization & Gaussian Processes/Shahriari_2016.md:
--------------------------------------------------------------------------------
 1 | # Title: 
 2 | 
 3 | # Author: 
 4 | 
 5 | #### General Content: 
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | 
11 | #### Summary:
12 | 	
13 | 
14 | #### Questions: 
15 | 


--------------------------------------------------------------------------------
/Biologically_Plausible_DL/Bartunov_2018.md:
--------------------------------------------------------------------------------
 1 | # Title: Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
 2 | 
 3 | # Author: Bartunov et al (2018)
 4 | 
 5 | #### General Content: Extent the target propagation algorithm to not require exact gradient at penultimate layer. Test alternative learning rules in more complicated settings (CIFAR/ImageNet) and differentiate between locally and fully connected architectures. Very good review but not much additional innovation. Behavioral + Physiological Realism
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Problems with backpropagation
11 |     * Feedback connections require exact copy of feedforward connections = Weight transport
12 |     * Info propagation does not influence "neural activity" - does not conform to any known biological mechanism
13 | 
14 | * Feedback alginment: Use random weights in backward pass to deliver info to earlier layers
15 |     * Still requires delivery if signed error via distinct pathway
16 |     * Direct/Broadcast FA - connect feedback from output layer directly to all previous ones
17 | 
18 | * Contrastive Hebbian Learning/Generalized Recirculation: Use top-down feedback connections to influence neural activity and differences to locally approx gradients
19 |     * Positive/negative phase - need settling process - Likely to slow for brain to compute in real time
20 | 
21 | * Target Propagation: Trains distinct set of feedback connections defining backward activity propagation
22 |     * Connections trained to approximately invert feedforward connections to compute target activites for each each layer by successive inversion - decoders
23 |         * Reconstruction + Forward loss
24 |         * Different target constructions
25 |     * Vanilla TP: Target computation via propagation from higher layers' targets backwards through layer-wise inverses
26 |     * Difference TP: Standard delta rule with additional stabilization from prev reconstruction error. Still needs explicit grad comp at final layer
27 |     * Not tested on data more complex than MNIST
28 | 
29 | * Simplified Difference Target Propagation: Computation also for penultimate layer with help of correct label distribution - removes implausible gradient communication
30 |     * Need diversity in targets - problem of low entropy of classification targets
31 |     * Need precision in targets - poor inverse learned
32 |     * Combat both problems/weakness of targets with help of auxiliary output resembling random features from penultilmate hidden layer
33 |     * Parallel vs alternating inverse training - simultaneous more plausible
34 | 
35 | * Weight-Sharing is not plausible - regularizes by reducing number of free parameters
36 | 
37 | * Experiments - Mostly negative results:
38 |     1. None of existing algos is able to scale up - Good performance MNIST/Somewhat reasonable on CIFAR/Horrible on ImageNet - Seems like weight-sharing is not key to success
39 |     2. Need for behavioral realism - judged by performance on difficult tasks
40 |     3. Hyperparameter Sensitivity
41 |         * First fix "good" architeture and then optimize
42 |         * Use hyperbolic tanh instead of ReLu - work better
43 | 
44 | 
45 | #### Questions:
46 | 
47 | * How could the brain do weight sharing - is approximate again satisfactory/functional approx?
48 | * Think more about communication: MARL agents learning communication channels
49 | 


--------------------------------------------------------------------------------
/Biologically_Plausible_DL/Lillicrap_2016.md:
--------------------------------------------------------------------------------
 1 | # Title: Random synaptic feedback weights support error backpropagation for deep learning
 2 | 
 3 | # Author: Lillicrap et al (2016)
 4 | 
 5 | #### General Content: Introduce a first feedback alignment approach to solve the weight transport problem of backpropagation. Forward and backward weights are modeled separately - backward weights align with weight matrix transpose through learning process. Argument follows from positive definiteness of weight and random matrix product and a rotation line of thought.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Weight transport Problem: downstream errors are fed back to upstream neurons via exact symmetric copy of downstream synaptic weight matrix - neuron "deep" within network has to have precise knowledge of all downstream synapses!
11 | 
12 | * Possible solutions:
13 |     1. Retrograde transmission of info along axons - problem of slow timescale
14 |     2. Feedback of errors via second network - problem of symmetry assumption of feedforward and feedback connections
15 |     3. Here: Show that even fixed random connections can allow for learning - symmetry not required! Instead implicit dynamics lead to soft alignment between forward and backward weights
16 | 
17 | * Observations:
18 |     * Feedback weights does not have to be exact: $B \approx W^T$ with $e^TWBe > 0$. rotation within 90 degrees of backprop signal. Learning speed depends on degree!
19 |     * Alignment of $B$ and $W^T$ via adjustment of W (and B) possible
20 | 
21 | * Feedback alignment:
22 |     * Modulator signal (error-FA) does not impact forward pass post-synaptic activity bu acts to alter plasticity at the forward synapses.
23 |     * FA may encourage W to align with Moore-Penrose pseudoinverse of B - approximate functional symmetry
24 |     * Inference vs learning - towards bayesian approaches
25 | 
26 | * Experiments:
27 |     * Learns linear function with single hidden layer - learning not slower than backprop
28 |     * Sigmoid nonlinearity and classification task - altered function of post-synaptic activity - learned also to communicate info when 50% of weights were randomly removed
29 |     * More layers 3 hidden layers - as well as backprop and making use of depth - froze layers and trained alternatingly - positive/negative phase?
30 |     * Neurons that integrate activity over time and spike stochastically - synchronous pathways
31 | 
32 | * Possible Extensions:
33 |     * Fixed spike thresholds/refractory period
34 |     * Dropout/stochasticity
35 | 
36 | #### Questions:
37 | 
38 | * Still signed error signal has to be transferred which remains illusive - see target propagation.
39 | * Is result related to Johnson-Lindenstrauss concentration ineq ideas?
40 | * Usage of intricate/more complex architectures of communication of backward error - relation to multi-agent RL
41 | * Relationship to predictive coding
42 | 


--------------------------------------------------------------------------------
/Biologically_Plausible_DL/Sacramento_2018.md:
--------------------------------------------------------------------------------
 1 | # Title: Dendritic cortical microcircuits approximate the backpropagation algorithm
 2 | 
 3 | # Author: Sacramento et al (2018)
 4 | 
 5 | #### General Content: MLP with simplified dendritic compartments learned in local PE plasticity fashion. No separate phases needed. Errors represent mismatch between pre input from lateral interneurons and top-down feedback. First cortical microcircuit approach. Analytically derive that such a setup/learning rule approximates backprop weight updates and proof basic performance on MNIST.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Hypothesis: Pred errors are encoded at distal dendrites of pyramidal neurons - receive input from downstream neurons - in model: error arise from mismatch of lateral local interneuron inputs (SST - somatostatin) - Learning via local plasticity
11 | 
12 | * 3 Compartment Neuron:
13 |     * Soma + Integration zones: Basal/Apical - convergence of top-down/bottum-up synapses on different compartments - Larkum (2013): Preferred connectivity patterns of cortico-cortical projections
14 | 
15 | * 2nd Population within hidden layer - Interneurons = lateral + cross-layer connectivity: cancel t-d input - only backprop errors remain as apical dendrite activity
16 |     * Predominantly driven by same layer but cross-layer feedback provides weak nudge for interneurons = modeled as conduc-based somatic input current
17 |     * Modeled as one-to-one between layer interneuron and corresponding upper-layer neuron
18 |     * Empirically justified by monosynaptic input mapping experiments: weak interneuron teaching signal
19 | 
20 | * Neuron/network Model:
21 |     - Simplifications:
22 |         1. Membrane capacity to 1 and resting potential 0; Background activity is white noise
23 |         2. Modeling of layer dynamics - where vectors represent units
24 |         3. No apical compartment in pyramidal output neurons - 3 compartments seem to suffice as comparison mechanism
25 |     - Qualitative dynamics: error = apical voltage deflection -> propagates down soma -> modulates somatic firing rate -> plasticity at bottom-up synapses
26 |     - Somatic conductance acts as nudging conductance
27 |     - Lateral dendritic projections: interneuron is nudged to follow corresponding next layer pyramidal neuron
28 | 
29 | * Synaptic learning rules = Dendritic Predictive Plasticity Rules
30 |     - Originally: reduction of somatic spiking error
31 |     - conductance based normalization of lateral projections based on dendritic attenuation factors of different compartments
32 |     - Implementation requires subdivision of apical compartment into two distal parts (t-d input and lateral input from interneurons)
33 | 
34 | * Prev work:
35 |     * Guergiev: View apical dendrites as integration zones - temp difference between activity of apical dendrite in presence/absence of teaching input = error inducing plasticity at forward synapses. Used directly for learning b-u synapses without influencing somatic activity. HERE: apical dendrite has explicit error represnentation by sim integration of t-d excitation and lateral inhibition - No need for separate temporal phases - continuous operation with plasticity always turned on
36 |     * PC based work - Whittington and Bogacz: Only plastic synapses are those connecting prediction and error neurons. HERE: all connections plastic - errors are directly encoded in dendritic compartments
37 | 
38 | * Main Results/Experiments:
39 |     * Analytic derivation: Somatic MP at layer k integrate feedforward predictions (basal dendritic potentials) and backprop errors (apical dendritic potentials)
40 |     * Analytic derivation: Plasticity rule converges to backprop weight change with weak feedback limit
41 |     * Random/Fixed t-d weights = FA
42 |     * Learned t-d weights minimizing inverse reconstruction loss = TP
43 |     * Experiments:
44 |         * Non-Linear regression task: Use soft rectifying nonlinearity as transfer fct - Tons of hyperparameters - injected noise current (dropout/regularization effect?)
45 |         * MNIST - Deeper architectures: Use convex combination of learning/nudging
46 | 
47 | * General notes:
48 |     * Kriegeskorte/DiCarlo/RSA - DNNs outperform alternative frameworks in accurately reproducing activity patterns in cortex - What does this mean? Is DL just extremely flexible/expressive?
49 |     * bottom-up = feedforward, top-down = feedback
50 | 
51 | 
52 | #### Questions:
53 | 
54 | * Neural transfer fct = Activation fct!
55 | * Again tons of hyperparameters to be chosen - How?
56 | * Think of learning (accurate gradient approx) vs architecture (depth, number of hyperparameters) complexity
57 | * Different interneuron types (PV = parvalbumin-positive) - different types of errors (generative)
58 | 


--------------------------------------------------------------------------------
/Biologically_Plausible_DL/Whittington_2019.md:
--------------------------------------------------------------------------------
 1 | # Title: Theories of Error Propagation in the Brain
 2 | 
 3 | # Author: Whittington & Bogacz (2019)
 4 | 
 5 | #### General Content: Review article on bio-plausible alternatives to backprop. Goal is not to state that the brain exactly performs backprop but that other rules that achieve similar performance are plausible alternatives. Differentiate between two main classes of models: Temporal-Error (learning via plasticity and target constraining output activity) vs Explicit-Error (learning via Hebbian rule and explicit error computation/tracking)
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Bio-plausible problems with backprop:
11 |     * Updates depend on activity/comps of all downstream neurons - non-local
12 |     * Symmetry idea of feedforward and back pathways - question of bidirectional connections - still need form of alignment
13 |     * Neurons send continuous output (rate-based) - real neurons spike. Also linear summation as aggregation of inputs
14 | 
15 | * Alternatives:
16 |     * Neuromodulators may guide plasticity by carrying form of global error. Slow and bad scaling.
17 |     * Pre-/Post-Synaptic activity: STDP/Pyramidal neurons/cortical microcircuits
18 |     * Rely on similar framework: Forward + Back -> energy minimization?
19 | 
20 | * Temporal-Error Models: Error is def as difference in activity across time.
21 |     * Contrastive learning: Error decomposition into target indep activity part (anti-Hebbian - unlearn associations - only prediction) and target dep activity part (Hebbian - learn new assocs - output layer set to target activity)
22 |         * Problem: Need form of phase signal that is global! Info local via oscillatory rhythms such as hippocampal that oscillations? Neurons in output layer are driven by feedforward inputs in one part of cycle and forced to take value of target pattern in other
23 | 
24 |     * Continuous Update Model: Local plasticity rule based on rate of change of activity. Still requires signal since no plasticity should take during prediction
25 | 
26 | * Explicit-Error Models: Close approx of errors that would be computed in backprop setting
27 |     * Predictive Coding Model: Error nodes 1-to-1 association with value nodes. During prediction activity is prop between value nodes via error nodes. Convergence to equilibrium in which the nodes decay to zero and all value nodes have same values as ANN. During training error nodes dont go to zero but to backprop error - adapt weights using Hebbian rule.
28 |         * Problem: Inconsistent 1-to-1 connectivity
29 | 
30 |     * Dendritic Error Model: Error representation in apical dendrite. Local error computation via cortical microcircuits, i.e. comparison with lateral inhibitory interneuron activity. Plasticity guidance via plateu potentials.
31 |         * PCs are exctitatory <- negativ inhibition is provided by interneurons - hard to train their weights - need higher level projections
32 |         * Show form of intuitive equivalence to backpropagation - like Sacramento
33 | 
34 | * Computational comparison:
35 |     * T-E: Mechanism for informing whether target pattern constrains output neurons (phase signaling) - not needed for E-E
36 |     * Predictive Coding needs to propagate twice as many neurons (travel via error neuron)! - Evolutionary benefit for fewer neurons and faster propagation
37 |     * Dendritic model: Learning all connections in parallel can lead to problems. Ideally first learn inhibitory neurons
38 | 
39 | * Experimental data: No evidence for separate error neuron population
40 | 
41 | * Vogels et al (2011) - learning rule in which the direction of modification depends on neurons in equilibrium as in PC model, can arise from alternate form of STDP
42 | 
43 | * Martinotti cells - receive inout from pyramidal neurons in same cortical area and project to their apical dendrite
44 | 
45 | * Question of speed: Pred Coding - slow prediction; Dendritic error model - slow training (but can get faster with time)
46 | 
47 | * Equilibrium propagation: Energy based models and minimization of free energy - entropy/ELBO optimization. Markov blanket ideas - strong connectivity leads to similar activity
48 | 
49 | #### Questions:
50 | * Robustness without symmetry in forward and backward weights?
51 | * Efficient one-shot learning support?
52 | * Anti-Hebbian <-> Hebbian: Relationship to target/forward phase in Guergiev et al (2017)
53 | 


--------------------------------------------------------------------------------
/Deep Learning/Gal_2015.md:
--------------------------------------------------------------------------------
 1 | # Title: Dropout as a Bayesian Approximation: Insights and Applications (2015)
 2 | 
 3 | # Author: Gal & Ghahramani
 4 | 
 5 | #### General Content: 
 6 | 
 7 | Introduction of one of the first Bayesian frameworks for Deep Learning. GP perspective allows to formalize the uncertainty over parameters. One does not need to compute the exact GP but instead is able to approximate the covariance function via Monte Carlo integration.
 8 | 
 9 | #### Keypoints: 
10 | 
11 | Authors show that MLP with dropout at each layer is equivalent (in terms of objective function) to Approximate Variational Inference for Deep GPs. The analogy to Gaussian process regression is a possible explanation for why Deep NNs with dropout generalize well. Furthermore, we are able reason about uncertainty over the complicated exacted features. Authors suggest that one has to put an approx variational distribution over the bias vector and not only the weights.
12 | 
13 | #### Summary:
14 | 
15 | * Dropout seems to be related to SGD on L^2-penalized error function.
16 | * Show that dropout objective minimizes KL divergence between approx. model and deep Gaussian process.
17 | * Important to readjust after dropout layer by rescaling the weights by inverse dropout probability
18 | * View MLP as one deeply composed function: In GP setup non-stationary covariance functions/kernels correspond to hyperbolic or ReLU MLPs. -> indirect assumption on the smoothness of the function
19 | * Don't perform the full GP, which would take a O(n^3) matrix inversion. Instead one approximates the full/true GP to obtain a managable time complexity. 
20 | * Minimising the KL div <-> maximising the log evidence lower bound. Maximisation leads to variational distribution which explains the data well while still not deviating too much from the prior distribution.
21 | * Approximate the exact covariance function by a MC integration. The paper obtains an approximate objective function for this "deep GP NN version" and shows equivalence of the objective functions.
22 | * One can't evaluate the KL div term between a mixture of Gaussians and a single Gaussian analytically!
23 | * Common knowledge: doubling the learning rate of the bias during MLP optimisation works very well in practice with dropout nets with p=0.5
24 | * MC dropout: Estimate the dropout model using T forward passes through the net and averaging the results - model averaging.
25 | 
26 | #### Questions:
27 | 
28 | * How does the rescaling work? - only at test time and not at training
29 | * How does one optimise the dropout proportion?
30 | * What does model assumption (section 2.3) of assuming one latent set of variables, which act as sufficient statistics, affect the model flexibility/generative model? Common assumption for VI?
31 | 
32 | #### Other:
33 | 
34 | * Stationary Kernel: Function whose length scale does not change throughout the input space will be well modeled by a GP with a stationary kernel - can just be written as x - x'
35 | 
36 | 


--------------------------------------------------------------------------------
/Deep Learning/Gal_2016.md:
--------------------------------------------------------------------------------
 1 | # Title: On Modern Deep Learning and Variational Inference
 2 | 
 3 | # Author: Gal & Ghahramani (2016) 
 4 | 
 5 | 
 6 | #### General Content: 
 7 | Modern deep learning models can be formulated as performing approx. variational inference in a Bayesian setting. Paper extends Gal & Ghahramani (2015), which showed that a MLP with dropout at every layer corresponds to approx. VI in Deep GPs, to a mor general Bayesian NN/deep learning models setting. Furthermore, they show that any form of stochastic regularization in arbitrary NN structures can be viewed as approx. variational inference in Bayesian Nets. Ultimate goal: Synthesize world of stats and VI with deep learning in order to form a mathematical grounding
 8 | 
 9 | 
10 | #### Keypoints: 
11 |  
12 | * Show that **any form of stochastic regularization in arbitrary NN structures can be viewed as approx. variational inference in Bayesian Nets: Almost all deep learning models perform approx. Bayesian inference to capture the latent stochastic process underlying the data.
13 | 	* Allows us to reason about the full DL framework in a Bayesian manner: assess model uncertainty, model comparison and uncertainty over features/inference. 	
14 | 
15 | #### Summary:
16 | 
17 | * Want to be able to reason about parameter and *structural* uncertainty of neural nets.
18 | * **Multivariate Gaussian noise**: alternative to dropout. Does not sample bernoullis but N(1,1) and multiplies this random sample with the weights. Slows training down more than bernoullis because harder to sample from the Gaussian.
19 | * **Methaphorical Interpretation of dropout**: Sexual reproduction - many sperm cells are "send" and only one is needed to impragnate the women = Unnecessary redundancy in the data. Cannot be proven since Gaussian noise (which are non-exactly-0 draws) works similarly well.
20 | * CNNs only work when dropout is only used for the non-convoluted layers. Possible reason: less co-adaption in convolution layers, already sparse -> no dropout needed. 
21 | 
22 | * Here: Derivation focuses on multiclass-classification problem and they derive an intractable GP posterior which afterwards is approximated by introducing a variational distribution over the latent variables.
23 | * Minimization of KL div between variational distribution and parameter posterior is equivalent to maximising the log evidence lower bound.
24 | * Gal & Turner (2015): Show that one is able to approximate the exact GP by defining an approx. distr. over spectral frequencies and their coefficients in a Fourier decomposition if the function.
25 | 
26 | * **Structural Formulations**:
27 | 	* Different nonlinearity-regularisation techniques correspond to different GP kernels, which have different impact on the predictive uncertainty! 
28 | 	* Multiple variables: Apply kernel the concatenated matrix of variables
29 | 	* Two seperate activations for two different variables: Summing of two Kernels
30 | 
31 | * **Bayesian Neural Nets**:
32 | 	* Plaxe a prior distr. over the weights - often matrix Gaussian prior
33 | 	* Interested in finding most likely weights given the data -> use VI to compute approx. to otherwise intractable integral.
34 | 	* Use Monte Carlo integration to approximate the approximation of the KL divergence
35 | 		* Suggests that stochastic regularisation techniques work well since they approximately integrate over the model parameters!
36 | 	* **MC dropout**: Make predictions using the predictive distr. approximation with VI-obtained MC integrated posterior! 	
37 | 	
38 | 
39 | #### Questions: 
40 | 
41 | * What are spectral frequencies? 
42 | * It should be easily possible to find/create new application specific stochastic regularisation variants. Enforce correlations between the weights.
43 | * Extract Bayesian inference knowledge about the GP - how can we learn something from NNs?
44 | 


--------------------------------------------------------------------------------
/Deep Learning/Goodfellow_2016.md:
--------------------------------------------------------------------------------
 1 | # Title: Deep Learning
 2 | 
 3 | # Author: Goodfellow et al. (2016)
 4 | 
 5 | ## Chapter 6: Ch. 6: Deep Feedforward Networks
 6 | 
 7 | 
 8 | #### General Content: 
 9 | 
10 | 
11 | #### Keypoints: 
12 | 
13 | 
14 | #### Summary:
15 | 	
16 | 
17 | #### Questions: 
18 | 


--------------------------------------------------------------------------------
/Deep Learning/Spoerer_2017.md:
--------------------------------------------------------------------------------
 1 | # Title: Recurrent Convolutional NNs: A Better Model of Biological Object Recognition
 2 | 
 3 | # Author: Spoerer et al (2017)
 4 | 
 5 | #### General Content: Introduce (?) CNN architecture with lateral as well as recurrent connections. Hypothesize that recurrence is not necessarily used to deal with Gaussian noise but with occlusions. Introduce two new dataset augmentations (debris and jitter). Debris = random crop overlays with task of single digit recognition, Jitter = Multiple digits with task of recognizing all of them. Use deconvolution to deal with recurrent dim inbalance. Train recurrent net over 4 timesteps and learn via BPTT. Performance comparison robustness check with permutation test. Lateral connections only help with strong debris.
 6 |     - Extension of original R-CNN work by Liang & Hu (2015) and Liao and Poggio (2016)
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Motivation: Ventral visual pathway with lateral and feedback connections. Feedforward models vary successful - underexploration of recurrence. Visual processing unfolds over time.
11 |     * Fast local recurrent processing =/= attention
12 |     * Occlusion slows down behavioral responses - recurrent processing? = Competitive processing
13 |     * Occlusion - question of border ownership (cells) => require info from outside their classical receptive fields and signals are delayed relative to initial feedforward input - again suggest recurrent processes
14 |     * Hypothesis: Recurrence provides no clear benefit for solely processing noise since linear filter which is learnable in NN can already deal with it! Recurrence useful across a wider range!
15 | 
16 | * Occlusion Datasets:
17 |     * Debris: Random crops from randomly selected digts add mask and overlay - summing overall features is bad strategy.
18 |     * Jitter: Multiple images sequentially placed in the image - overlap with relative depth order
19 |     * Preprocessing: Pixel-wise normalization (mean/std) for each pixel from stats computed across the entire datasets
20 | 
21 | * Architecture:
22 |     * 4 different architectures:
23 |         1. B - Pure feedforward
24 |         2. BT - Feedforward + Top-down
25 |         3. BL - Feedforward + Lateral (recurrence)
26 |         4. BLT
27 |     * Define pre-activations for all 4 for each layer, pixel, time point as well as feature map
28 |     * Pass pre-activations through local response normalization and ReLU
29 |     * Need to control for comparable complexity of the models also generalization to case without occlusion: Increase the number of parameters via either number of b-u connections (B-F = better match in terms of params) or large kernel sizes (B-K = more plausible)
30 |     * General architecture: 2 hidden recurrent layers + Readout at every step - problem with top-down connections - dim mismatch need deconvolution/transposed convolution to size up again. Zeiler et al (2011)
31 |     * Unrolled for 4 time steps on full image! - Why 4 timesteps?
32 |     * At each time step feed in the image and make a readout
33 |     * Train via BPTT - Error is propagated throught time for each time point - network trained to converge as soon as possible, rather than final step - As compared to spiking Joram case where network has settling time steps and is only propagated from final loss!
34 |     * Accuracy measured only at final time step - Loss function then defined sum over time steps? Sum over timesteps and final backprop at the end vs backprop for each current loss at t backwards?
35 |     * Deconvolution: Normal conv layer with stride where input and output sides of layer have been switched
36 | 
37 | * Model comparison:
38 |     * Pairwise McNemar's test (prediction dependence correction), False discovery rate, Benjamini-Hochberg correction
39 |     * Robustness check with permutation test - form of linear regression of test errors on debris/cropping strength
40 | 
41 | * Experimental results
42 |     * Lateral connections only help with strong debris
43 |     * Support for RNNs also being better in OCR tasks without noise/occlusion
44 |     * Robustness of RNNs - do simple ffw networks overfit? Is this overfitting? More like learning a specific task under quality conditions
45 | 
46 | #### Questions:
47 | 
48 | * Extensions to neural data and reaction time distributions
49 | * How much recurrence do we need?
50 | * Why do people in ML dont do more multiple comparison checks to robustify their performance gains?
51 | * Read up on deconvolutions = transposed convolution!
52 | * Recurrence vs larger receptive fields: Similar to DRQN vs frame concatenation discussion.
53 | 


--------------------------------------------------------------------------------
/Formal Grammars/2001_Hale.md:
--------------------------------------------------------------------------------
 1 | # Title: A Probabilistic Earley Parser as a Psycholinguistic Model
 2 | 
 3 | # Author:  John Hale (2001)
 4 | 
 5 | #### General Content: The cognitive load of language comprehension can be expressed as the total prob of structural options that have been disconfirmed at the point in sentence - paths in tree that have been traveled unsuccessfully. This again be formulated as suprisal of a word given its precessors. Measure can be efficiently calculated using a probabilistic version of the Earley parser from a probabilistic phrase-structure grammar. Is able to account for Garden Path sentences.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Garden Path Sentence: Grammatically correct sentence that starts in way that reader's most likely interpretation will be incorrect. Reader deceived into dead end parse. Sentence creates temporary ambiguity/multiple interpretations.
11 | * Relationship knowledge of gramar - application in perceiving syntactic structure - 3 principles:
12 | 	1. Parser - Grammar: Strong competence - parser uses rules of grammar
13 | 	2. Frequency affects performance - stat theory of language performance
14 | 	3. Eager sentence processing: unrushed experimental setup
15 | 
16 | * Stolcke's prob Earley parser: Way to use hierarchical phrase structure in language model such as ngram
17 | 	* Prob CFG: sentence prob - product of all rules used to generate
18 | 		* multiplication - assumption that rule choices are indep
19 | 		* consistency: does grammar assign non-0 prob to inf recursion?
20 | 		* problem: only assigns probs to complete sentences of language
21 | 	* Compute prefix prob at each word - sum of probs of all derivs whose yield is compatible with string seen so far
22 | 	* 1 - prefixprob = prob of all other derivs that have been disconfirmed (given consistency)
23 | 
24 | * Earley parser: top-down - propagation of prediction back up through set of states of states representing hypotheses parser is using about struxture of the sentence.
25 | 	* collection of states - chart - tree set
26 | 	* state:
27 | 		* current input string position processed so far
28 | 		* grammar rule
29 | 		* dot-position in rule: how much of rule recognized
30 | 		* left-most edge of substring rule generates
31 | 	* functions:
32 | 		* predict, scan, complete
33 | 		* S $\to$ predict (add new states for prod. rule following S) $\to$ scan (check input string. If match move dot) $\to$ complete (propagate change through graph)
34 | 		* Earley paths <-> correspondence with derivations
35 | 	* Stolcke: add two infos to each state: $\alpha$ - prefix prob and $\gamma$ - inside probability
36 | 	* Earley path: sequence of Earley states linked by three ops
37 | 	* State completion: bottom-up confirmation - $\alpha$ unites with top-down prediction $\gamma$          
38 | 
39 | * Total parallelism theory: Entire set of trees compatible with input is maintained somehow from word-to-word
40 | 
41 | * Cognitive effort expanded to parse prefix - prop. - total prob of all structural analyses which cant be compatible with observed prefix
42 | 	* generate predictions about word-by-word reading times by comparing total effort expended before some word to the total effort after
43 | 	* assumption: probs of PCFG rules = statements of how difficult it is to disconfirm each rule
44 | 	* $log(\frac{\alpha_{n-1}}{\alpha_n})$ - surprisal: combined difficulty of disconfirming all disconfirmable structures at a given word
45 | 
46 | * Garden paths are points where the parser can disconfirm alternatives that together comprise a great amount of probability
47 | 	* effect disappears when words intervene that cancel reduced relation interpretation early on
48 | 	* Evidence for total-parallelism parsing theory	  
49 | 
50 | #### Questions: 
51 | 
52 | * How to incorporate this into option/substructure discovery problem? - options over options - specific attention to beginning
53 | * Info-theoretic based grammar extraction - IGGI


--------------------------------------------------------------------------------
/Formal Grammars/2016b_Siyari.md:
--------------------------------------------------------------------------------
 1 | # Title: Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data
 2 | 
 3 | # Author: Siyari, Dilkina, Dovrolis (2016)
 4 | 
 5 | #### General Content:
 6 | Propose algorithm that constructs an optimized hierarchical representation of a given set of target strings, the Lexis-DAG. The DAG displays how to derive each target string through the concatenation of intermediate substrings by minimizing the total number of such concatenations or DAG edges.
 7 | 
 8 | 
 9 | #### Keypoints:
10 | 
11 | * Problem is related to the SGP which is NP hard. The authors also prove this for the Lexis optimization problem and propose a greedy algorithm which efficiently constructs the DAG.
12 | 
13 | * Problem of construction is a synthetic design problem: construction of min cost DAG which shows how to produce a given set  of targets from a given alphabet in hierarchical manner, through the construction of intermediate substrings that are re-used in at least two higher-level strings.
14 |   * Cost of DAG <-> related to concatenation work that corresponding hierarchy would require.
15 | 
16 | * Optimized DAG can be thought of as a plausible hypothesis for the unknown process that created given targets as long as we have reasons to believe that the process cares to minimize the same cost function that the DAG optimization considers.
17 | 
18 | * 2 cost functions (turn out to be algorithmically identical):
19 |   * Min total number of concatenations
20 |   * Min number of DAG edges
21 |   * Which one to use is application specific - Lexis is NP hard for both
22 | 
23 | * Intermediate nodes/core: minimal set of DAG nodes that can cover a given fraction of s-to-t paths. Represents the most central substrings in corresponding
24 | 
25 | * Lexis-DAG - $D(V,A)$ with 3 conditions:
26 |   1. $v \in V$ represents string $S(v)$, $V_S$ sources, $V_T$ targets, $V_M$ intermediate nodes. $V = V_S \cup V_M \cup V_T$
27 |   2. Each node in $V_M \cup V_T$ represents a string that is the concatenation of substrings, $d_{in}(v)$: # incoming edges for v, $d_{out}(v)$: # outgoing edges for v
28 |   3. Lexis-DAG should only include intermediate nodes with 2 outgoing edges - re-used in at least two concatenation operations
29 | 
30 | * Lexis Optimization Problem: Construct min-cost Lexis DAG for given alphabet S and targets T for given cost function:
31 | 
32 | $$\min_{(E, V_M)} C(D) s.t. D=(V, E) \text{ is Lexis-DAG for S and T}$$
33 | 
34 | * No explicit min of nodes in $V_M$ - but implicit through cost functions
35 |   * Edge costs: $\mathcal{E}(D) = \sum_{v \in V} d_{in}(v) = |E|$
36 |   * Concatenation costs: $C(D) = \sum_{v\in V\setminus V_s} (d_{in} -1) = |E| - V\setminus V_s$
37 |   * Both problems are NP-hard
38 | 
39 | * G-Lexis idea: Search for substring $\xi$ that will lead to max cost reduction, when added as new intermediate node. Algo starts from trivial Lexis-DAG with no intermediate nodes and edges from the source nodes representing alphabet symbols to each occurance in target
40 | 
41 | * $I(v)$: Sequence of nodes appearing in incoming edges of v
42 |   * $\leftrightarrow$ sequence of nodes whose string concatenation results in string $S(v)$ represented by v
43 |   * $\leftrightarrow$ strings of alphabet of Lexis-DAG
44 |   * Look for repeated substring $\xi \in I_{T \cup M} = \{I(v | v) \in V_T \cup V_M\}$ that can be used to construct new intermediate node
45 |   * Can construct new intermediate nodes for $\xi$, create incoming edges based in symbols in $\xi$ and replace incoming edges to each of the non-overlapping repeated occurances of $\xi$ with a single outgoing edge from the new node.
46 | 
47 | Algorithm:
48 | 
49 | 1. Init $V \leftarrow V_T \cup V_S$ and E, constructing each target in T from characters in S. $V_M \leftarrow \emptyset$.
50 | 2. Repeat:
51 |   * $I_{T \cup M} \leftarrow \{I(v | v) \in V_T \cup V_M\}$
52 |   * Select $\xi$ with max $(R_{T \cup M, \xi} -1)(|\xi| - 1)=0$, where $R_{T \cup M, \xi}$ is number of repeats $\xi$ in $I_{T \cup M}$.
53 |   * if $(R_{T \cup M, \xi}-1)(|\xi| - 1) = 0$, break. Terminate when there are no more substrings with length at least 2 and which are repeated at least twice.
54 |   * $V \leftarrow V \cup \{\sigma_{\xi}\}$ where $\sigma_{\xi}$ is new intermediate node and update E accordingly.
55 | 
56 | * Substring that max saved cost is a max repeat: substrings of length at least 2, whose extensionto right or left would reduce its occurences in the given set of strings
57 |   * Suffix tree over set of input strings captures all right-max repeats which are superset of all mac repeats. To pick the one with max saved cost we need the count of non-overlapping occurences of these substrings - minimal augmented suffix tree
58 |   * Here: implementation using regular suffix tree - Iterate over all occurances of selected substring, skipping overlapping occurances
59 |     * $O(L) \text{ vs } O(L \log L)$ where L is total length of target strings
60 |     * Overall runtime: $O(L^2)$ since max # iterations is $O(L)$ (each iteration reduces number of edges/concats which at start is $O(L)$)
61 | 
62 | * After construction: rank constructed intermediate nodes in terms of significance or centrality
63 |   * Dependency chain: the higher the number of s-to-t paths traversing an intermediate node v (path centrality), the more important v is in terms of number of dependency chains it participates in.
64 |   * Core of Lexis-DAG: Set of intermediate nodes that represent, as a whole the most important substrings in that Lexis-DAG
65 |     * Should include nodes of high path centrality
66 |     * Almost all s-to-t dependency chains of Lexis-DAG should traverse at least one of the core nodes
67 | 
68 | * Data compression: Looks for regularities that can be used to compress the data - patterns often useful as such regularities
69 | 
70 | * Minimum Description Principle: Compression scheme that results in smallest size for joint representation of both dictionary and encoding of data using that dictionary
71 | 
72 | 
73 | #### Questions:
74 | 


--------------------------------------------------------------------------------
/Formal Grammars/2018_Schoenhense.md:
--------------------------------------------------------------------------------
 1 | # Title: Data-efficient inference of hierarchical structure in sequential data by information-greedy grammar inference
 2 | 
 3 | # Author: Schoenhense and Faisal
 4 | 
 5 | #### General Content: Combat "overfitting" of greedy grammatical inference algos such as sequitur by introducing decision rule based on  Shannon entropy decrease. This way they introduce form of regularizer. Show that IGGI does not fit noise when sequence is generated without any underlying hierarchical rules.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Algos: Sequitur, MDLcompress, G-Lexis
11 | * Iterative repeat replacement (IRR): Longest-match, byte-pair coding, compressive
12 | * CFG + MDL = grammar-based universal source code
13 | * Info theory approach: More symbols used = more info needed to give meaning to symbol - average info/entropy needed per symbol increases
14 | * Idea: Greedily + iteratively replace bigrams by minimizing info content of coding string
15 | * New definition of size of grammar: Previously - based on concatenated string of all symbols needed to write out production rules and coding sequence - now: based on info content 
16 | * Testing on random data shows: no overfitting of noise - compression ration around 1


--------------------------------------------------------------------------------
/Free-Energy Principle/Friston_2010.md:
--------------------------------------------------------------------------------
 1 | # Title: The free-energy principle: A unified brain theory?
 2 | 
 3 | # Author: Karl Friston (2010)
 4 | 
 5 | #### General Content: 
 6 | 
 7 | THE interdisciplinary introduction to the FEP. Biological agents must minimimize there the long-term average of surprise (free energy as the upper bound!) to ensure their sensory entropy to remain low. Biological systems violate the fluctuation theorem (generalization of second law of thermodynamics) -> probability of entropy decrease becomes exponentially smaller with time.
 8 | 
 9 | 
10 | #### Keypoints: 
11 | 
12 | * FEP is a very flexible/diverse model that allows to model/account for many different brain structures/functions. Thereby it allows us to unify multiple different neuroscientific theories: Bayesian brain hypothesis, Efficient coding, Predictive coding, synaptic pruning, cell assembly theory, attention, neural Darwinism, RL and value learning, optimal control and DP
13 | * 3 PoVs:
14 | 	1. Optimizing brain states makes the representation an approx conditional density on the causes of sensory input. This enables action to avoid suprising sensory encounters.
15 | 	2. Optimization of sufficient statistics: Free energy = surprise + KL(recognition density || posterior) -> Minimization leads to a better approximation 
16 | 	3. Optimization of actions: Free energy = mixture of accuracy and complexity -> action can only affect accuracy -> active learning to minimize prediction error. Selective sampling of sensory inputs that the brain expects.
17 | * The long-term goal of homeostasis leads to a short-term avoidance of surprise.
18 | 
19 | #### Summary:
20 | 
21 | * FEP: Any self-organizing system that is at equilibrium with its environment must minimize its free energy (not equiv. to variational free energy!!).
22 | * Entropy: average self-info or surpirse -> agent dependent!!
23 | 	* Low entropy density = An outcome is relatively predictable = Uncertainty Measure 
24 | * Free energy: Measure that bounds surprise on sampling events given a generative model
25 | 	* Lower bound on log model evidence[]()
26 | * Homeostatsis: Process of self-system-regulation <-> internal system remains in bounded states 
27 | * Surprise: negative log-prob of outcome. Improbable outcome is surprising.
28 | * Attractor: Set (of points/states) to which a dynamical system evolves after a certain amount of time <-> Statitionary State? 
29 | * Recognition density: Approx prob distribution of causes of the data -> product of inverting the generative model
30 | * Bayesian Brain: Brain as inference machine that actively predicts and explains sensations. Perception: Process of inverting the generative model to access the posterior of the causes given sensory input.
31 | * Hierarchical generative models: Min Free energy <=> optimization of empirical priors. Optimization makes every level in hierarchy accountable to others - leads to internally consistent representation of sensory causes at multiple levels of descriptions.
32 | * Laplace Assumption: Free energy = difference between model prediction and sensations/predicted representations - saddle-point approximation of integral of exp function which uses 2nd order taylor - Gaussian Approximation?
33 | 	* Predictive Coding: min free energy corresponds to explaining away prediction errors
34 | * Principle of efficient coding - Barlow's redundancy reduction principle - infomax principle: brain optimizes the mutual info between sensorium and its internal representation 
35 | * Cell assembly theory by Hebb - Plasticity: Groups of interconnected neurons are formed through strengthening of synaptic connections that depends on correlated pre- and postsynaptic activity.
36 | * Correlation theory - Metaplasticity: Selective enabling of synaptic efficacy and its plasticity by fast synchronous activity induced by different perceptual attributes of the same object.
37 | * Hebb plasticity and predictive coding - connected by delta learning rule
38 | * GD on free energy <-> Hebbian plasticity
39 | * Biased competition and attention: neuromodulators - prediction errors with high precision have greater impact on units that encode conditional expectations 
40 | 	* Optimization of expected precision in terms of synaptic gain links attention to synaptic gain and synchronisation
41 | * Neural Darwinism: 
42 | 	1. Epigenetic mechanisms create primary repertoire of neuronal connections, which are refined by experience-dependent plasticity to produce a secondary repertoire of neuronal groups.
43 | 	2. These are selected and maintained through reentrant signalling among neuronal groups. Value modulates the plasticity.
44 | 	3. Value is signalled by ascending neuromodulatory transmitter systems and controls which neuronal groups are selected and which not.
45 | 	4. The capacity of value to do this is assured by natural selection, in the sense that neuronal value systems are subject to selective pressure.
46 | 	* Value is inversely prop to surprise
47 | 	* Prior expectations describe small set of states in which we expect high value.
48 | 
49 | #### Questions: 
50 | 
51 | * How does entropy relate to tail-thickness and concentration bounds? Are we bounding the KL divergence? Or the distance between the two densities (true generative model vs predicted model)?


--------------------------------------------------------------------------------
/Free-Energy Principle/Limanowski_2013.md:
--------------------------------------------------------------------------------
 1 | # Title: 
 2 | 
 3 | # Author: 
 4 | 
 5 | #### General Content: 
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | 
11 | #### Summary:
12 | 	
13 | 
14 | #### Questions: 
15 | 


--------------------------------------------------------------------------------
/Free-Energy Principle/Ostwald_2015.md:
--------------------------------------------------------------------------------
 1 | # Title: The Free Energy Principle for Perception - An Introduction
 2 | 
 3 | # Author: Ostwald (2015)
 4 | 
 5 | #### General Content: 
 6 | 
 7 | General methodological introduction to the variational Bayes framework for the Free Energy Principle
 8 | 
 9 | 1. Intro FEP
10 | 2. Math
11 | 3. Parametric Bayesian Inference, Information Theory, Variational Bayes
12 | 4. Free-form mean-field variational Bayes for univariate Gaussian models
13 | 5. Fixed-form mean-field variational Bayes for static nonlinear models
14 | 6. Fixed-form mean-field variational Bayes for dynamic nonlinear models
15 | 
16 | #### Keypoints: 
17 | * Chapter 1:
18 | 	* FEP: Neurobiological Interpretation of application of deterministic approx. Bayesian inference methods to nonlinear hierarchical random dynamical systems.
19 | 	* Biological agents minimize dispersion/entropy of their interoceptive and exteroceptive states. Minimizing state dispersion in long run results in mim of surprise in every point in time.
20 | 	* Min of variational energy under hierarchical models encoded by adjustment of agent's internal states is known as Bayesian filtering and leads to neurobiological predictive coding schemes.
21 | 
22 | * Chapter 2:
23 | 	* Review Gaussian Trafo theorems and proofs (Bishop Ch. 2)
24 | 	* Completing the square theorem - allows identification of parameters of Gaussian density based on quadratic form
25 | 	* Gamma distribution and different parametrizations - exponential distribution
26 | 	* Jacobian: Matrix of all first-order partial derivative of a vector-valued function
27 | 	* Linear approx of vector fields - Use Jacobian for the first derivative analog of the Taylor series approximation
28 | 
29 | * Chapter 3:
30 | 	* Bayesian Inference: Identification of Log model evidence and posterior distribution for a Generative Model
31 | 	* Bayesian Parameter Estimation: Not only interested in single point estimate but full probability distribution over the possible parameter values. The conditional distribution is inherent in the specification of the generative model and is thus not created once the data is observed!
32 | 	
33 | 
34 | #### Questions: 
35 | 
36 | * Heterogenous agents - how can we model risk-aversion e.g.? In Macro models this is specified by concavity of the utility functions and the derived optimal actions. Or by the future discount factor.


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/1997_McGovern.md:
--------------------------------------------------------------------------------
 1 | # Title: Roles of Macro-Actions in Accelerating Reinforcement Learning
 2 | 
 3 | # Author: McGovern et al (1997)
 4 | 
 5 | #### General Content:
 6 | 
 7 | Macro-actions are a form of commitment defined by closed-loop policies with termination condition. The appropriateness of construction determines whether or not they make learning faster or slow it down. The influence can be broken down into 2 parts: Effect on exploration and effect on value information propagation/learning. Authors derive basic experiments to test both effects and show that value backprop seems more important.
 8 | 
 9 | #### Keypoints:
10 | 
11 | * Macro-operators: common in robotics and to aid state-space search
12 | * Original literature: Korf (1985) and Iba (1989) - here: application to RL
13 | * Macros: Compress effective length of solution by chunking together primitive actions.
14 | * Macro Def.: Agent can choose macro or primitive action from any state unless agent is already executing a macro-action. Once agent has chosen a macro it must follow actions defined by macro's policy until the termination condition is satisfied/goal is reached.
15 | * The set of available macro-actions often depends on current state of agent.
16 | 
17 | * **Effect on Exploration**
18 |   * Bias behavior of agent so that he spends most of his time in specific regions of state space.
19 |   * Authors run experiment where actions are uniformly chosen ("random walk") among primitives and macros
20 |   * Agent spends most of time in area where macros lead him to (obvious somewhat)
21 | 
22 | * **Effect on Propagation of Action Values**
23 |   * Macro action values affect rate of action-value backup
24 |   * Q-Learning: values propagate backwards one step at a time
25 |   * Macro-Q: value info can propagate over several time steps
26 |   * When macro takes agent to a good state, the corresponding value is updated immediately with useful info, despite fact that original state is several primitive actions away from good state
27 |   * Again random walk experiment: value flows further back in state space - mix of macro/Q
28 | 
29 | * Macros only make learning slower if they alone cant bring the agent to the goal location
30 |   * This is always fulfilled when working with straight-line grammars!
31 | 


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/1998_McGovern.md:
--------------------------------------------------------------------------------
 1 | # Title: Macro-Actions in Reinforcement Learning: An Empirical Analysis
 2 | 
 3 | # Author: McGovern and Sutton (1998)
 4 | 
 5 | #### General Content:
 6 | 
 7 | Extend 1997 paper with comparison to eligibility traces. Authors show that Macro-Q converges faster to optimal policy while $Q(\lambda)$ converges faster to optimal action values (accounts for all Q-values and not only relative ordering).
 8 | 
 9 | #### Keypoints:
10 | 
11 | * Eligibility traces and macros both provide mechanism for speeding up value propagation.
12 | 
13 | * Eligibility traces: Each state-action pair is marked as eligible for backup with  a trace indicating how recently it has been experienced.
14 |   * On each step the values of all state-action pairs are updated in porportion of their E traces at the time.
15 |   * Because many recent state-action pairs have non-zero traces, value info is propagated backwards many steps
16 | 
17 | * $Q(\lambda)$ disseminates value info at an even more rapid rate at the cost of getting the policy right.
18 | * Authors propose to combine both approaches - when is Q more important than policy?
19 | 
20 | * "Large scale" experiment with random walk robot - random walk exhibits higher variance in traces when only choosing among primitive actions  
21 | 
22 | #### Questions:
23 | 
24 | * Read more on eligibility traces - try to merge!
25 | 


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2001_Mcgovern.md:
--------------------------------------------------------------------------------
 1 | # Title: Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density
 2 | 
 3 | # Author: Amy McGivern and Andrew Barto (2001)
 4 | 
 5 | #### General Content: Discover subgoals online based on commonalities across multiple paths to solution. View problem as multiple-instance learning and use diverse density approach to solve it.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 	
10 | * Bottleneck = region in observation space that is visited often on successful paths but not on unsuccessful paths. Interested in "early" (in successful traces) bottlenecks that are persistent throughout learning.
11 | * Idea: Essentially fix exploration tree up to subgoal and continue exploration from there on!
12 | * Multiple-instance learning: Supervised learning problem - system attempts to identify target concept on basis of "bags" of instances (e.g. successful and unsuccessful traces)
13 | * Option = macro but without fixed sequence of actions but policy which allows to interact with env
14 | * 2 room problem - 2 strongly connected regions. Random exploration spends most time within component - unlikely to leave! But bottleneck connects!
15 | * Simple approach to options: randomly generate based on heuristic and add to set of actions - problem: too many actions - bad exploration - need more structure
16 | * Alternative - visitation/count-based approaches: Only use first visitation - agent spends little time in actual bottleneck and more in component
17 | 	* Problems: Noisy process, hard to generalize to continouos/large state spaces, no incorporation of negative feedback
18 | 
19 | * Multiple-instance learning - Diet-terich et al. (1997)
20 | 	* Positive bag: Contains at least one positive instance from "target concept" - bottleneck
21 | 	* Negative bag: Contains only negative instances
22 | 	* Learn concept based on evidence collected from different bags
23 | 	* Each trajectory defines a bag - observation vectors correspond to instances within bag
24 | 	* Bottleneck = Target concept - agent experiences region somewhere on every successful trajectory and not on all unsuccessful ones
25 | 
26 | * Diverse Density (DD) approach - Maron, 1998; Maron & Lozano-Pe ́rez, 1998
27 | 	* Most diversely dense region in feature space = region with instances from most positive bags and least negative bags
28 | 	* DD = posterior of state being concept given positive and negative bags
29 | 	* Probability of state being in target concept - gaussian based on distance from particular distance to the target concept
30 | 	* Concept with max DD value is output of DD search - for small state spaces use exhaustive search
31 | 	* Can use abstract notions of concepts. Most simple: Individual states
32 | 
33 | * Option Construction:
34 | 	* Detect regions which appear early and persist as peaks - keep running average of how often each state appears as a peak - Init to 0 for each state
35 | 	* Convergence to series ratio
36 | 	* I: if target concept c is reached at t, add all visited states from time t-n to t to set. Do this for all added traces that lead to same target concept - augmentation of option set throughout time
37 | 	* $\beta$: Set to 1 when goal is reached or agent no longer in input set. 0 otherwise.
38 | 	* $\pi$: Create new value function for option. Give -1 reward on each step and 0 when termination. Learn policy using experience replay.   
39 | 
40 | #### Questions: 
41 | 
42 | * To obtain Gaussian do we have to know the target location? - Then not automatic identification!
43 | * How to choose n? - number of states we go for init set - somewhat like length of prod rules
44 | * Static filter (Iba, 1989) to filter option set and throw away unsuccessful ones.
45 | * Improvements seem very small. Why is that?!
46 | * Do we simply augment action set and add options or do we substitute?


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2002_Menache.md:
--------------------------------------------------------------------------------
 1 | # Title: Q-Cut - Dynamic Discovery of Sub-Goals in Reinforcement Learning
 2 | 
 3 | # Author: Ishai Menache, Shie Mannor, and Nahum Shimkin (2002)
 4 | 
 5 | #### General Content: Graph theoretic approach for automatic detection of subgoals in dynamic envs. Agent creates online map of process history and uses max-flow/min-cut algo to identify bottlenecks. Policies to reach those are learned separately. Segemented Q-Cut generalizes this by using previously identified bottlenecks for state space partitioning. This seems necessary to identify additional bottlenecks in complex environments
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Approaches to subgoal discovery:
11 | 	* landmark states
12 | 	* states with non-typical reinforcement - high reinforcement gradient
13 | 		* problem: hard to find meaningful subgoals when sparse rewards 
14 | 	* bottleneck = freq of appearance + success condition
15 | 		* problem: needs lots of exploration to distinguish bottlenecks and "regular" states
16 | 
17 | * Here: Consider bottlenecks as "border states" of strongly connected areas
18 | 	* local criterion: choose bottlenecks based on qualities of state itself
19 | 	* global criterion: choose bottlenecks based on all state transitions
20 | 	* think of MDP as flow problem:
21 | 		* Nodes = States
22 | 		* Arcs = State transitions
23 | 		* Bottlenecks = accumulation nodes where many paths coincide - support between loosely connected areas    
24 | 
25 | * Cut procedure for recursive decomposition of state space:
26 | 	* Divide state space into segments to simplify overall learning task
27 | 	* Separate consideration of each segment
28 | 
29 | * Algo:
30 | 	1. Interact with env, learn using SMDP Q-learning
31 | 	2. Save state transition history
32 | 	3. If activating cut conditions are met, choose $s,t \in S$ and perform Cut(s,t)
33 | 
34 | * Cut(s,t):
35 | 	1. Translate state transition history into graph representation
36 | 	2. Find MinCut partition $[N_s, N_t]$ between s and t 
37 | 	3. If cut quality good: create option, learn policy with ER
38 | 
39 | * Choosing s,t: Task dependent
40 | 	* use distance metric between states
41 | 	* use env reset structure
42 | 
43 | * Activating cut conditions: constant rate slower than actual experience frequency
44 | 	* depends on computational resources/goodness of found s,t pair
45 | 
46 | * Graph construction - capacity:
47 | 	* frequency based: too much weight to freq visited states that might not be bottlenecks
48 | 	* fixed: same significance to all visited states
49 | 	* relative frequency - seems to perform best
50 | 
51 | * Cut quality: "significant" s-t cuts: small number of arcs <-> enough states in $N_s$ and $N_t$
52 | 	* Look for small number of bottleneck states, separating significant balanced areas in state space
53 | 	* Use metric called ratiocut bipartitioning metric related to size of both sets and number of arcs between them.     
54 | 	* Only consider cuts whose quality is above threshold
55 | 
56 | * Q-Cut works well when one bottleneck sequentially leads to the other.
57 | 	* Segmented version: Use discovered bottlenecks as segmentation tool - divide and conquer: work on small segments of states in order to find additional bottlenecks and corresponding options   	       


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2002_Stolle.md:
--------------------------------------------------------------------------------
 1 | # Title: Learning Options in Reinforcement Learning
 2 | # Author: Martin Stolle and Doina Precup (2002)
 3 | 
 4 | 
 5 | #### General Content: State visitation approach to option discovery. Explicitly construct a set of options and not individual options! They pose a series of random tasks in a static env and let the agent solve them. The agent collects statistics of frequencies of occurance of different states. Intuition: If states occur frequently on trajectories that represent solutions to random tasks, then states may be important.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Differences to diverse density approach:
11 | 	1. No assumption of what a good/bad trajectory is. Only assume agent to be confronted with different tasks in same env and allow for exploration. Makes sense since we cook in the same kitchen everyday ;)
12 | 	2. Explicit construction of set of options and not greedy construction of individual ones.
13 | * Get init and term sets first and then simply learn intra-option policy using pseudo-rewards
14 | 
15 | * Algorithm:
16 | 	1. Select number of start and target states S, T according to some distribution (e.g. uniform)
17 | 	2. For each pair (S,T)
18 | 		* Perform N_train Q-L episodes to learn policy from S to T
19 | 		* Perform N_test episodes using greedy policy eval
20 | 		* For all states s count number of total occurances n(s)
21 | 	3. Repeat until number of desired options is reached:
22 | 		* T_max = argmax_s n(s) as target state for option
23 | 		* Compute n(s, T_max): number of times each s occurs on path to T_max
24 | 		* Compute rho(T_max) = avg_s n(s, T_max)
25 | 		* Select all states s for which n(s, T_max) > rho(T_max) to be in init set
26 | 		* Complete init set by interpolating between states (domain specific)
27 | 		* Decrease visitation counts for all states by number of visits to states on trajectories going to T_max - prevent several options going to neighbouring subgoals (redundancy)
28 | 	4. For each option learn internal policy achieved by giving high reward for entering T_max and no rewards otherwise. Learn by Q.
29 | 
30 | #### Questions: 
31 | 
32 | * Need to do a lot of pretraining!!! For each pair of S, T!


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2004_Bakker.md:
--------------------------------------------------------------------------------
 1 | # Title: Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization
 2 | 
 3 | # Author: Bram Bakker and Jürgen Schmidhuber
 4 | 
 5 | 
 6 | #### General Content: Idea - let high-level poilicy identify subgoals that precede overall goals. Simultaneously, let low-lebel policies learn to reach subgoals set by higher level. Also learn which subgoals subpolicies are capable of reaching - specialization.
 7 | 
 8 | 
 9 | #### Keypoints: 
10 | 
11 | * HASSL: Hierarchical Assignment of Subgoals to Subpolicies Learning
12 | 	* 2 layer hierarchy (can be generalized) 
13 | * Specialization <-> Generalization
14 | 	* state-specific reaching of goals <-> single sub-policy to reach multiple goal states
15 | 	* focus on parts of obs space relevant to specialization <-> generalize within specialization 
16 | 
17 | * High level observation and high level goal state are included in input vector ("command" of high-level policy) 
18 | * Time-out-value: max number of low-level actions that policy can execute before control returns to higher level
19 | * Learning done with advantage function learning
20 | 	* decrease value of subgoal if it was not reached
21 | 	* gradient expressions for advantage function, weights of prarametrized value function amd C-values which measure capability of low level policy to reach high level goal state
22 | 
23 | * Production of high-level obs - requirement: clustering of primitive low-level obs s.t. locally neighbouring states tend to be clustered together
24 | 	* Use unsupervised learning vector quantization technique ARAVQ: Adaptively allocates new model vector if the latter's Euclidean distance to any existing model vector exceeds threshold
25 | 	
26 | * Other algorithms: Associate subgoals with primitive, low-level observations rather than high-level obs.    


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2004_Mannor.md:
--------------------------------------------------------------------------------
 1 | # Title: Dynamic Abstraction in Reinforcement Learning via Clustering
 2 | 
 3 | # Author: Mannor et al (2004)
 4 | 
 5 | #### General Content: Online generation of env map that represents topological structure of state transitions. Afterwards, they use clustering to partition state space into meaningful regions. Furthermore, they consider building a map with preliminary indication of location of interesting (high reward density) regions of state space. A high value gradient indicates significant cluster where additional exploration is potentially beneficial. 
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Subtask definition in terms of state space - here: consider clusters of states as intermediate stages in learning process, rather than unique states - leads to more robust results.
11 | * Option is then defined as policy that allows agent to efficiently shift from one cluster of states to the other
12 | 
13 | * Input to clustering algo: agent's recorded state transitions = topological representations of learning task dynamics (+ current value estimates)
14 | 	* Algo encourages creation of clusters with small deviation in value function
15 | 	* Encourage agent to travel between homogenous clusters <-> increase prob to reach clusters with interesting values
16 | 
17 | * Process of cluster creation <-> bootstrapping: Clusters are formed early in learning process and are based on rough estimate of env. Using rough est improves exploration.
18 | 
19 | * Algo:
20 | 	1. Interact with env and learn using SMDP Q-learning
21 | 	2. Save state transition
22 | 	3. If clustering condition is met and clustering not evoked previously:
23 | 		* Translate state transition history to graph representation
24 | 		* Run clustering algo
25 | 		* Learn options for reaching neighbouring clusters
26 | 
27 | * Activating clustering conditions: Trade-off
28 | 	* Want early clustering: Have impact on exploration when most significant 
29 | 	* Not too early: Info may not suffice for finding meaningful clusters
30 | 	* Solution: Wait until no new staets were encountered for T (task dep param) - indicating stable state-transition model
31 | 
32 | * Clustering objective: max sum of cluster qualities + sum of separation qualities between clusters
33 | 	* Agglomerative approach: start with more clusters than desired and merge clusters by selecting pair whose merging improves objective the most
34 | 	* Stop when pre-specified number of clusters is reached
35 | 
36 | * Topological approach: 
37 | 	1. Size of clusters should be roughly the same
38 | 	2. Clusters should be well separated
39 | 
40 | * Value approach: 
41 | 	* area with dense concentration of distinct rewards should not be contained in large cluster - careful control for max exploitation   	* area with few rewards - regard as one cluster - agent only wants to exit and explore other areas
42 | 


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2014_Yao.md:
--------------------------------------------------------------------------------
 1 | # Title: Universal Option Models
 2 | 
 3 | # Author: Yao et al (2014)
 4 | 
 5 | #### General Content: Authors deal with the problem of learning models for options in real time and in the setting where reward functions can be specified at any time and expected returns must be efficiently computed.
 6 | 
 7 | 	* UOM: option model independent of any reward fct - universal wrt rewards
 8 | 	* Show how to extend notion to linear fct approx
 9 | 	* Show that UOM gives TD solution of option returns - proof convergence
10 | 
11 | 
12 | #### Keypoints: 
13 | 
14 | * Classic option models cant deal with multiple-reward planning problems. When reward fct is changed abstract planning with traditional option model has to start from scratch
15 | 
16 | * UOM consists of 2 parts:
17 | 	1. State prediction: predict state of option termination
18 | 	2. Accumulation: predict occupancies of all states by option after execution starts - similar to Dayan's successor representation
19 | 
20 | * Option model: For option $o$: $<R^o, p^o>$
21 | 	* $R^o$ - option return: total exp disc return until option terminates (stochastic waiting time T)
22 | 
23 | 	$$R^o(s) = \mathbb{E}[r_1+ \gamma r_2 + ... + \gamma^{T-1} r_T]$$
24 | 	
25 | 	*   $p^o$ - discounted terminal distribution: disc prob of termination at $s'$ after option is initiated in $s$
26 | 
27 | 	$$p^o(s, s') = \mathbb{E}_{s,o}[\gamma^T \mathbb{1}_{\{s_k =s'\}}] = \sum_{k=1}^\infty \gamma^k \mathbb{P}_{s,o} \{s_T = s', T=k\}$$
28 | 
29 | * Universal option model: $<u^o, p^o>$ where $u^o$ is option's discounted state occupancy function.
30 | 
31 | $$u^o(s, s') =\mathbb{E}_{s,o}[ \sum_{k=0}^{T-1} \gamma^k \mathbb{1}_{\{s_k =s'\}}]$$
32 | 
33 | and 
34 | 
35 | $$R^o(s) = \sum_{s' \in S} r^\pi(s') u^o(s,s')$$
36 | 
37 | * Theorem: Traditional option model can be constructed from UOM and reward vector of option. Vector essentially conditions universal model to specific option.
38 | 
39 | * Authors show that to compute TD approc of option return corresponding to R it suffices to find LS approx of exp one-step reward under the option and R provided one is given U matrix of option.
40 | 
41 | * TD(0)-based linear UOM - reward independence!
42 | * Construct multiple immediate reward models for different reward signals of interest
43 | * Compare to LOEM (linear option exp model) by Sorg and Singh (2010) - estimation from experience
44 | 	* Learning LOEM model is faster (computing time) then learning UOM for single fixed reward function
45 | 	* UOM can produce accurate option return quickly for each new reward function
46 | 
47 | 
48 | #### Questions: 
49 | 
50 | * Read up on Dyna and more model-based RL!
51 | * Authors use computer time to compare approaches!!! very code dependent very bad!!!


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2017_Florensa.md:
--------------------------------------------------------------------------------
 1 | # Title: Stochastic NNs for HRL (2017)
 2 | 
 3 | # Author: Florensa, Duan, Abbeel
 4 | 
 5 | #### General Content: Intend to solve exploration problem by combining HRL with intrinsic motivation. First, they learn skills in a pre-training env using proxy rewards which require minimal info (resembles intrinsic motivation - done with the help of SNN with info-theoretic regularization). Afterwards, they train a separate high-level policy for task on top of skills.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * eps-greedy exploration/uniform Gaussian exploration noise fail in case of sparse rewards - long-term credit assignment problem - Two approaches:
11 | 	* HRL: composition of policies - reduce search space exponentially - require domain knowledge 
12 | 	* Intrinsic motivation: guide exploration - hard to transfer knowledge
13 | 
14 | * Stochastic NN: stochastic units in computation graph
15 | 	* feed latent variable with simple distribution as extra input to policy network
16 | 	* form joint embedding - concatenation (change bias term) or bilinear integration (change weights of layer) - fed to feedforward network
17 | 	* encourage diversity in policies by using mutual info as a regularizer
18 | 
19 | * Lit: Learning skills in discrete domains:
20 | 	* Chentanez et al (2004) - Intrinsically motivated RL
21 | 	* Vigorito & Barto (2010) - Intrinsically motivated hierarchical skill learning in structured environments
22 | 	* Stolle & Precup (2002) - Learning options in reinforcement learning
23 | 	* Mannor et al (2004) - Dynamic abstraction via clustering
24 | 	* Simsek et al (2005) - Local graph partitioning
25 | 
26 | * Partition MDPs into two components (one shared and one task specific), all MDPs have same action space - structural assumption: sharing of same agent space
27 | 
28 | * Pre-training env: minimal setup required - design of proxy reward should encourage existence of locally otpimal solutions - skills  
29 | 
30 | * Obtaining skills - sample latent code at beginning of every rollout of net - keep constant throughout entire rollout. After training each of the latent codes corresponds to an interpretable skill - use for downstream tasks 
31 | 
32 | * Regularization: combat problem of different latent codes corresponding to similar skills
33 | 	* add additional reward bonus, prop. to mutual info between latent and current state - estimate by discretization - relative count of how often center of mass state was visited when code z was active
34 | 
35 | * Freeze K skills after pre-training. For all MDPs train a new manager MM on top of frozen common skills
36 | 	*   high-level policy gets full state as input and outputs a parametrization of the categorial distr from which a discrete action/skill z out of K is sampled
37 | 
38 | * Opitimization via trust region policy optimization - Schulman et al (2015)	   
39 | 
40 | #### Questions: 
41 | 
42 | * What has to be manually defined? Learning params of TRPO, number of skills K, two network structures


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2018_Frans.md:
--------------------------------------------------------------------------------
 1 | # Title: Meta Learning Shared Hierarchies
 2 | 
 3 | # Author: Frans, Ho, Chen, Abbeel, Schulman
 4 | 
 5 | #### General Content: Combine meta-learning (which uses info from past experiences to learn quickly) with HRL. They generalize the options framework (master policy over subpolicies) to the setting of task distributions (sharing of primitives within).
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Write hierarchical problem in terms of optimization: set low-level motor primitives such that meta-policy learns quickly
11 | * Then approx problem: repeatedly reset master policy and adapt sub-policies for faster learning.
12 | * MLSH vs meta-learning: MLSH learns quickly over large number of policy gradient updates
13 | * $<\pi_{\phi, \theta}>$
14 | 	* $\phi$: params shared between tasks $\{\phi_1,...,\phi_k\}$
15 | 	* $\theta$: params learned from scratch per task - state of learning process on that task $\to$ neural network that switches between/chooses subpolicies
16 | 		* Master policy that chooses specific $k$ - acts on slower time scale and fixed frequency $N$ 
17 | 	* Sample MDP from $P_M$ - then agent is initiated with shared params $\phi$ and randomly initiated $\theta$	
18 | * Max over  $\phi$ the expected reward sequence under prob distr of MDPs
19 | 
20 | ALGO:
21 | 
22 | 1. Sample M
23 | 
24 | Repeat:
25 | 
26 | 2. Initialize $agent(\theta, \phi^{t-1})$
27 | 3. Warmup period $\to$ optimize master $\theta$
28 | 4. Joint update period $\to$ optimize both $\theta, \phi$
29 | 
30 | * Warmup intuition: Only update $\phi$ if $\theta$ is close to optimal
31 | * Important aspect: No gradient passing between master and subpolicies
32 | * Also: Can easily replace policy gradient used in training with Q-Learning
33 | 
34 | 
35 | #### Questions: 
36 | 
37 | * Possibility of generalizing in one unified network where Master acts as input layer?
38 | * Look at Florensa paper - use info max objective for option discovery - similar to IGGI?
39 | * Problem of still having to specify different learning timescales as well as number of desired subpolicies!


--------------------------------------------------------------------------------
/Hierarchical Reinforcement Learning/2018_Levy.md:
--------------------------------------------------------------------------------
 1 | # Title: Hierarchical Reinforcement Learning with Hindsight
 2 | 
 3 | # Author: Levy et al (2018)
 4 | 
 5 | #### General Content: Authors propose to merge UVFA, HER and hierarchies of policies. Each hierarchical level proposes subgoals for the level below. Hindsight is then used to solve all MDPs within hierarchy simultaneously.
 6 | 
 7 | * 2 different ways to use hindsight:
 8 | 	1. Agent replays actions with different goals: learn to generalize to unseen goals
 9 | 	2. Agents replays higher-level decisions using subgoal states achieved in hindsight as the subgoal actions. This provides way for agent to discover high-level subgoal actions belonging to a particular time scale autonomously. Furthermore, leads to sample efficiency - allows agent to evaluate higher level actions even when lower level layer has not fully learned to achieve higher level subgoals.
10 | 
11 | 
12 | #### Keypoints: 
13 | * Universal MDP: $U=(S, A, T, R, G)$ where $R: S \times A \times G \to \mathcal{R}$ or just $R(s,a,g)$
14 | * Solve $U_{original}$ via learning hierarchical policy that solves hierarchy of k UMDPs $U_0, ..., U_{k-1}$ where each UMDP represents a different level of temporal abstraction
15 | 	* $U_0$: lowest level of hierarchy - $S_0 = S$, $A_0 = A$, $G_0 = S_0 =S$
16 | 	* $U_i, 0<i<k-1$: $S_i = S$, $A_i = G_{i-1}$, $G_i = S_i$ 
17 | 	* $U_{k-1}: S_{k-1} = S$, $A_{k-1} = S$, $G_{k-1} = G$
18 | * Action Selection - top-down
19 | 	* $s \in S, g \in S$  are initialized $\to$ $\pi_i: S_i \times G_i \to A_i$ yields $a^{k-1}$
20 | 	* $a^{k-1}$ initializes $g_{k-2} \in G_{k-1}$ and $\pi_{k-2}(s, g_{k-2} = a^{k-1}$ yields $a^{k-2}$
21 | 	* Recursion all the way to bottom $\pi_0(s, g_0=a^1)$
22 | 	* Note: upper layer Action knits lower level goal!
23 | * Action Execution - bottom-up
24 | 	* $a^0$ selected - $\pi^0$ continues until $s=g_0$, afterwards return control to next higher level.
25 | 
26 | * Motivation for hindsight: In beginning of training we have $a^i = g^{i-1}$ but $\pi_{i-1}$ might not know how to achieve, e.g. $s_{t+1} \neq a^i = g^{i-1}$
27 | 	* $(s_t^i, s_{t+1}^i, g_t^i, s_{t+1}^i, r_{t+1}^i)$: Transition if subgoal achievement failed - action replay: process of substituting hindsight subgoal state for action
28 | 	*  $(s_t^i, a_t^i, g_t^i, a_t^i, r_{t+1}^i)$: Transition if subgoal achievement succeeded
29 | 	*  $(s_t^i, a_t^i, g_t^i, s_{t+1}^i, r_{t+1}^i)$: standard form of a transition
30 | 
31 | * Action Replay:
32 | 	* Learn with suboptimal policies
33 | 	* Helps high level agent to discover new subgoal actions that satisfy appropriate time scale d
34 | 	* Helps high level agents to eval subgoal actions even when lower level layers have not fully learned to achieve those subgoal actions
35 | 
36 | * Goal Replay:
37 | 	* Generate set of additional experiences $\{(s_t^i, a_t^i, g^i, s_{t+1}^i, r_{t+1}^i | g^i \in G^i\}$ where G denotes set of additional goal configurations with which we will augment training data
38 | 	* Enables learning with enough experience
39 | 
40 | * Tabular Discrete Version - Hierarchical Q-Learning
41 | 	* Augment with complete goal space in goal replay 
42 | 	* Each layer learns at frequency of time scale of its actions
43 | 	* Lower level policy learns after each low-level action using goal replay
44 | 	* Higher level policy learns using action replay every d actions
45 | 
46 | * Continuous Fct Approx Version - Hierarchical AC
47 | 	* Fct approx + off-policy AC RL also such as DDPG
48 | 	* Subgoal testing: for % of time, no noise is added to any of the policies during training
49 | 		* If level proposes subgoal that is not achieved then penalize with low reward
50 | 		* Prevent Fct approx from producing unrealistic subgoals over and over
51 | 		* If proposed subgoal is never reached Q-Val for subgoal action may never be updated
52 | 		* Divergent info - authors think that these washout  
53 | 	
54 | 
55 | #### Questions: 
56 | * User has to specify d the time-scale/max number of attempts to achieve the subgoal - how to do? Domain-knowledge again?
57 | * Each layer learns form both action and goal replay in hierarchical learning
58 | * Others talk about separate training of layers as in Feudal but mention that each manager in FUN is responsible for a separate part of the space which leads to questions in continuous spaces - why don’t we have this problem here as well? - propose to combine hindsight with FUN


--------------------------------------------------------------------------------
/Hyperparam-Opt/2017_Jaderberg.md:
--------------------------------------------------------------------------------
 1 | # Title: Population based Training of Neural Networks
 2 | 
 3 | # Author: Jaderberg et al (2017) (DeepMind)
 4 | 
 5 | #### General Content:
 6 | Authors introduce a meta-optimization algorithm (PBT) that jointly learns the parameters of a NN as well as an effective hyperparameter schedule. It is an async algorithm that uses a ficed computational budget. It learns a **schedule** of hyperparameters and they test PBT on A3C, UNREAL, GANs and Transformers.
 7 | 
 8 | #### Key take-away for own work:
 9 | PBT performs both hyperparam opt as well as model selection at the same time! Very important.
10 | 
11 | #### Keypoints:
12 | 
13 | * Motivation of learning a hyperparam schedule: Often times in RL learning problem is non-stationary in itself (e.g. depends on current policy that explores different parts of the state space). THEREFORE: ideal hyperparams are themselves non-stationary!
14 | 
15 | * PBT - Wall-clock run time is not greater than a single optimisation run! Does not require sequential training as in BO and thereby uses fewer computational resources.
16 | 
17 | * Main benefit: Sharing information about different learning dynamics across a population of concurrently running optimisation processes. Allows for **online adaptation** of hyperparams between members of the population based on their performance!
18 | 
19 | * Model selection element of "warm starts" - allows for intermediate good models to receive more computational resources - define basis of further search/optimisation
20 | 
21 | * Improvements result from:
22 |     1. Automatic selection of hyperparams - explore & exploit
23 |     2. Online model selection to max computation spent on promising models
24 |     3. Ability for online adaptation to accomodate potentially nonstationary training regimes
25 | 
26 | * Problem BO: Sequential nature - inherently slow!
27 | * Problem Hyperband: Problem of hyperparam selection as many-arm bandit - cant practically be executed within a single training optimisation process!
28 | 
29 | * PBT:
30 |     * Starts like parallel search - randomly samples hyperparams and weight inits
31 |     * If model in pop in underperforming - will exploit the rest of the better performing population by replacing itself with better performers (both hyperparams + weights)
32 |     * It will also explore new hyperparams by modifying the better models hyperparams before continuing training
33 |     * Idea: Bootstrap the evaluation of new hyperparams furing training on partially trained models. This eliminates the need for sequential optimisation processes
34 | 
35 | * Comparison to traditional genetic algorithms: Instead of evolving the params of the model - train them partially with SGD. Mix learning with evolution!
36 | 
37 | * Formalism:
38 |     * `Optimise` consists of many `step` evaluations composed with different hyperparams for each `step`
39 |     * `exploit`: decides whether worker should abandon current solution and focus on more promising one by spawning another run from the population that performs better. Copy weights and hyperparams
40 |     * `explore`: Proposed new hyperparams to better explore solution space. Create by perturbing/resampling hyperparams from previous worker.
41 |     * `ready`: Mark worker to be ready for explore/exploit after sufficient number of steps in between.
42 |     * Overall: Two timescale algorithm that trains locally with SGD and globally selects models and performs genetic hyperparam mutation
43 | 
44 | * Globally available: current performance info, weights, hyperparams across population! But no syncronisation required!
45 | 
46 | * DRL experiments:
47 |     * A3c Starcraft II: ready after 10^6, 6x10^6 agent steps
48 |     * Exploit:
49 |         * T-test selection and performance comparison after sampling competitor uniformly at random
50 |         * Truncation selection - rank models and select radnomly from top 20% if current agent is in lower 20%
51 |     * Explore:
52 |         * Perturb by factor 0.8-1.2
53 |         * Resample from original prior distribution on hyperparams
54 |     * Optimise lrate, entropy cost coeff, unroll length for BPTT, intrinsic reward cost
55 | 
56 | * DRL results:
57 |     * hyperparams focus on best part of sampling range adapted over time
58 |     * Gradually increases number of N-step PG unroll length before BPTT - allows RNN to learn to utilise RNN for memory.
59 |     * Copying of weights - allows the population to quckly propagate agents that got lucky during env exploration - benefits goes to whole population!
60 | 
61 | * Machine translation: Learns to set layer dependent different dropout rates! + lrate schedule looks very much like the one handtuning obtains!
62 | 
63 | * GANs: Usually couple gen/discr hyperparam schedule - likely suboptimal
64 |     * Update discriminator more frequently than generator!
65 |     * Exploitation via binary tournament strategy: Each member of population selects another member of population and copies params if the other members score is better - for both discriminator and generator
66 |     * Eval with Inception score of weaker CIFAR10 classifier and not from Imagenet
67 | 
68 | * Number of workers - need sufficiently large population size: 10 leeds to high variance in PBT and poor performance - since PBT os greedy and can get stuck in local optima
69 |     * Large enough population is rquored to maintain diversity in exploration & population
70 | 
71 | * Truncation selection on 20% range worked best in terms of exploitation
72 | 
73 | * Played with transferring only weights or only hyperparams and show that combination of both is required!
74 | 
75 | * Also if continuing training with final hyperparams of PBT this helps signifacntly!
76 | 
77 | * In general PBT rule seems to be more aggressive
78 | 
79 | #### Questions:
80 | 
81 | * How to choose the number of sgd iterations between steps? Essentially how to define ready?
82 | 


--------------------------------------------------------------------------------
/Hyperparam-Opt/2019_Li.md:
--------------------------------------------------------------------------------
 1 | # Title: A Generalized Framework for Population Based Training
 2 | 
 3 | # Author: Li et al (2019) (DeepMind)
 4 | 
 5 | #### General Content:
 6 | Generalizes the original PBT pipeline into a fully blackbox type API by decomposing training into async trials and introducing a PBT controller in charge of coordination. This is inspired by the Google Vizier architecture. Extend experimental PBT results to WaveNet
 7 | 
 8 | #### Key take-away for own work:
 9 | Mainly API/program architecture/distributed systems design paper. No really strong improvements on original PBT idea.
10 | 
11 | #### Keypoints:
12 | 
13 | * Problem PBT application: require continuous and simult training of all workers and may encounter issues when workers could be preempted by other higher priority jobs in real distributed working envs
14 |     * Also glass-box: trainer has to know info about parallel workers and performs weight copying and hyperparam changes - can require computational graph to be changed (i.e. activations)
15 |     * Parallel workers read and write to the same database and decide to warmstart from another workers checkpoint!
16 | 
17 | * Glass-box limitations:
18 |     1. Cant make changes to comp graph easily
19 |     2. Does not easily handle case of worker job being preempted by another worker job
20 |     3. Not extendable to advanced evolutation/mutation strategies
21 | 
22 | * Here: Propose to decompose whole training into multiple trials
23 |     * In each trial a worker only trains for limited amount of time
24 |     * Each trial is dependent on one or more other trials (exploit idea)
25 |     * trainer/controller introduced to oversee the whole population: Decides about the hyperparams and checkpoint for warmstarting the agent
26 | 
27 | * Vizier hyperparam service: blackbox - makes setup of experiments easier - highest flexibility
28 | 
29 | * Succes of PBT - makes decisions based on incomplete obervations = non/convergent objective values
30 | 
31 | * Generalized framework = **stateless service** - each request to service does not depend on any other. Allows for high scalability
32 |     * All decisions are being mad by a central controller
33 |     * Small work load for each trial executed by each worker
34 |     * Trial = continuous training session - use protocol buffer
35 |     * Represent parameters in a DAG to incorporate notion parameter dependence
36 | 
37 | * Advantages of this approach:
38 |     1. Allows for tuning of all hyperparams regardless of comp graph
39 |     2. Allows for training diff/non-diff objectives
40 |     3. Allows for all hyperparams to be adapted dynamically over time
41 |     4. Maximizes flexibility!
42 | 
43 | * Initiator Based Evolution: Every trial is guaranteed to participate in at least one reproduction procedure - guarantee is important for small population sizes
44 |     * Fitness: Can be multiple criteria. Using weighting or Pareto improvement criterion for comparison
45 |     * Repro:
46 |         * Initiator: Assymetric evo - send request to trainer to init competition
47 |         * Opponent selection: Compare to last 2 generations of potential opponents
48 |         * Winner = Parent for current reproduction: Every population member participates in a binary tournament once and only once
49 |         * Reproduction: Copy from parent + crossover/mutation
50 |         * For categorical params - sample = mutate
51 |         * For discrete params = sample 0/1 lower or higher next values
52 | 
53 | * Proposed Budget mode: simulate evolution with large generation size using small number of workers
54 |     * Only start reproduction when initiators generation has reached specified population size
55 | 
56 | * Training Replay: Ultimately need to train on a slightly modified dataset (train + val + test) - simply apply previously obtained schedule
57 |     * Can also apply only subset of schedule - useful when doing ensembling
58 |     * Trial dependency graph: training replay requires extracting dependency graph of cetrain final trial in order to perform same training procedure.Do so via topoligical sort
59 | 
60 | * Opponent selection strategy: Only look at past 1-2 generations appears to yield best results!
61 | 
62 | #### Questions/Critical Observations:
63 | * In Jaderberg et al PBT is introduced as being async. Here claimed that not - sznc glass-box system
64 | 


--------------------------------------------------------------------------------
/Meta-Learning/2016_Andrychowicz.md:
--------------------------------------------------------------------------------
 1 | # Title: Learning to learn by gradient descent by gradient descent
 2 | 
 3 | # Author: Andrychowicz et al (2016)
 4 | 
 5 | #### General Content: Train an LSTM architecture with GD to perform parameter updates given a gradient. Perform updates coordinate-wise and show improved learning results on MNIST, CIFAR, Neural Art.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * GD: Parameter updates that neglects second order/curvature info
11 |     * More specialized methods when more structure of the problem is known.
12 |     * No free lunch theorems for optimization: setting of combinatorial optimization - no algo can do better than random strategy in expectation
13 | 
14 | * GD by GD key idea: Substitute the alpha*grad update by a function approximator which takes the gradient as an input - here LSTM
15 |     * Generalization = transfer between problem instances
16 |     * Problem: Having to learn separate set of parameters vs MAML where meta-learning is performed as initial robust param constellation
17 |     * Similar to momentum: Learn to integrate info of gradient histories
18 | 
19 | * Meta-Objective over whole optimization trajectory with weighting params
20 |     * Weightings=0 for RNN to learn temporal dynamics via BPTT
21 |     * RNN outputs update as well as hidden state
22 |     * Learn by sampling a random function and applying backprop on graph
23 | 
24 | * Coordinate-wise LSTM updates: Optimizer operates on each dim of params individually.
25 |     * How many forward passes needed for single GD update?
26 |     * Architecture two-layer LSTM
27 | 
28 | * Training of Optimizer:
29 |     * Sample random datasets/initial param configs
30 |     * Train with ADAM and truncated BPTT with early stopping
31 | 
32 | * Experiments:
33 |     * Quadratic functions, MNIST, CIFAR-10, Neural Art
34 |     * Does not easily generalize between different function types, i.e. training on sigmoid nonlinearity, generalization to ReLU; Or need for two different LSTMs for fully-connected layer and conv layer in CNN
35 | 
36 | 
37 | #### Questions:
38 | 
39 | * Can we cleverly design the trajectory weightings?
40 | * How is this ever computationally efficient? Train on 100 different MNIST Configs and then deploy single optimization?
41 | * Optimizer unrolled for less than optimization steps... How?
42 | 


--------------------------------------------------------------------------------
/Meta-Learning/2017_Finn.md:
--------------------------------------------------------------------------------
 1 | # Title: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 2 | 
 3 | # Author: Finn et al (2017)
 4 | 
 5 | #### General Content: Introduce a general multi-task initialization procedure which tunes the parameters to be sensitive to changes in the task. It alters the gradient descent update according to a meta objective which allows the model to adapt to new task in very few GD steps.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Meta-Learning challenge: Agent must integrate prior knowledge with small amount of new info, while avoiding overfittingof new data (catastrophic forgetting).
11 |     * Previous approaches:
12 |         * Learn an update function/learning rule (e.g. a separate network - recurrent/Siamese)
13 |         * Learn good representations suitable for many tasks
14 |     * This approach - Explicit: No additional parameters/architecture constraint
15 |         * Dynamical systems PoV: Maximize the sensitivity of the loss functions of new tasks with respect to the params
16 |     * Test Error on new task provides training error for the meta-learner
17 | 
18 | * Formulation: Distribution of tasks over which we want model to adapt. Improve model by considering test error change wrt params
19 |     * Learn model such that GD-based learning can make rapid progress on new task without overfitting
20 |     * Assumptions: Parametric model with loss that is smooth enough in params that GD methods can be applied
21 | 
22 | * Algo:
23 |     - Init params randomly
24 |     - While true
25 |         * Sample batch of tasks
26 |         * For all tasks:
27 |             - Evaluate the task-specific loss fct gradient wrt K examples (K-shot)
28 |             - Compute single step basic GD update of params (don't overwrite just store!)
29 |         * Update the parameters according to the gradient of the sum of the task specific loss functions evaluated at the single-step updates computed in the loop! (gradient of gradient!)
30 | 
31 | * Derive simple setting for regression, classification and RL objectives:
32 |     * For RL: expected reward usually not differentiable (unknown dynamics) - use PG methods to estimate both model gradient updates and the meta-optimization - Additional on-policy rollouts required!
33 | 
34 | * Related work comparison:
35 |     * Uses same GD update for both learner as well as meta-update
36 |     * Methods explicitly trains the params for sensitivity on a given task-distribution, allowing for extremely efficient adaptation!
37 | 
38 | * Amazing writing style! In beginning of experiments state what you want to test/explore with the experiments!
39 | 
40 | * Meta-Learning/Few-shot datasets: Omniglot and MiniImagenet
41 | 
42 | * Experiments: First order approx of MAML - show that performance is nearly same without full second derivative! Most improvements from gradients of objective at post-update param values rather than from differentiation through the gradient!
43 | 
44 | #### Questions:
45 | 
46 | * Is it possible to switch back and forth between different tasks? Or is the architecture just tuned for a single adaptation and then params are in part of space where they are no longer sensitive to task changes?
47 | 


--------------------------------------------------------------------------------
/Meta-Learning/2018_Finn.md:
--------------------------------------------------------------------------------
 1 | # Title: Probabilistic Model-Agnostic Meta-Learning
 2 | 
 3 | # Author: Finn et al (2018)
 4 | 
 5 | #### General Content: Challenge for meta-learning: Task ambiguity from few shots. Embed MAML in a graphical model setup which allows for Amortized Variational Inference. Thereby can place distribution on meta-parameters. At meta-test time we can draw different meta-learners to train on few shots. This generative process allows for active learning and effective exploration based on uncertainty estimates.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Definition of Meta-Learning: Acquisition of prior that allows learner to adapt to new task given few data points. Inherently, Bayesian. Info integration with few data points
11 | 
12 | * Here: Want scalable approach (to many parameters of the meta-learner) as well as good uncertainty estimate -> Amortized variational inference with shared params of model and inference networks. Extend MAML to model distribution over prior model parameters - simple stochastic adaptation procedure which injects noise into GD at meta-test time.
13 | 
14 | * Similar to LLAMA method - Put local laplace approx for modeling the task params - requires high-dim covariance matrix. This work otoh approximately infers pre-update params which are made tractable through choice of approx posterior.
15 | 
16 | * Key assumption: Task distribution allows for structural overlap of the tasks!
17 | 
18 | * Conventional MAML = Approximating MAP inference under simplified model where meta-param prior is delta function - see LLAMA derivation - Maximize the log likelihood of test labels given train data and test features
19 | 
20 | * Here allow prior distribution to be non-determinisitic -> Structural VI
21 |     * Specific factorization of joint variational distribution over task specific params and meta params - allow for sharing of params across tasks with help of amortized inference
22 |     * Derive variational lower bound on log LH: easy eval = loss of network and gaussian assumptions
23 |     * Important choice conditional variational distribution of task-specific params given meta parameters
24 | 
25 | * Provide simplified version with more resemblance to LLAMA and easy GD algorithmic derivation - injection of noise!
26 | 
27 | * Experiments:
28 |     * Performance with sampled meta params comparable to MAML
29 |     * Multiple draws allow for approx confidence bands/Variance estimation: Active inference - reduce error based on newly acquired data points - show that this clearly outperforms MAML with randomly sampled points
30 |     * Multi-modal task distribution can be captured by posterior!
31 | 
32 | #### Questions:
33 | 
34 | * Again amazing Finn writing style - what do we want to answer in the experiments?
35 | * There is no numerical expression for variance? - From samples? No! Cant be right.
36 | * Can this be further framed as Bayesian Optimization problem?
37 | 


--------------------------------------------------------------------------------
/Multi-Agent RL/2016_Foerster.md:
--------------------------------------------------------------------------------
 1 | # Title: Learning to Communicate with Deep MARL
 2 | 
 3 | # Author: Foerster et al (2016)
 4 | 
 5 | #### General Content: Introduce Multi-agent RL architectures able to learn communication (cooperative) via centralised learning and decentralized execution. Two architectures - RIAL (parameter sharing communication), DIAL (sharing of gradient info across agents). Differentiable communication + DRL
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * General setup
11 |     * All agents max same discounted reward
12 |     * Agents emit two forms of actions: communication messages (discrete limited-bandwidth channel) and env actions
13 |     * Indep DQN - Agent indep and simultaneously learn own Q-function - convergence problems
14 |     * DRQN - No longer fully observable - additional hidden/internal state and observation state
15 | 
16 | * RIAL
17 |     * 2 Versions:
18 |         * Independent DQN + DRQN - Each agent learns individual network (other agents part of env)
19 |         * Parameter sharing + DRQN - One shared network different hidden state/obs conditioning per agent
20 |     * End-to-end trainable within agent
21 |     * Details:
22 |         * Disabled ER - account for non-stationarity occuring with multiple agents learning simultaneously
23 |         * Split actions env/comm into two different streams - avoid multiplicative dim - two separate value functions are learned and evaluated at every time step
24 |         * Param sharing: Also feed in agents index as part of input vector = conditioning
25 | 
26 | * DIAL
27 |     * End-to-end trainable across agents -
28 |     * Centralized training: real-valued gradients are passed between agents (while training continuous!)
29 |     * Communication actions as bottleneck connections between agents
30 |     * Decentralized execution: discretization of messages mapped to discrete communication actions
31 |     * Details:
32 |         * not only param sharing but also gradient pushing through communication channel since messages act as any other form of net activation
33 |         * use discretize/regularize units with different functional forms while training/execution
34 |         * One single Q function whose grad is based on DQN loss
35 | 
36 | * Prototyping in switch riddle/colour-digit MNIST with max 4 agents - scalability?
37 |     * Show how elaborate 1-bit message passing protocols emerge
38 |     * Importance of noisy communication channels - adding boosts performance - relation to discrete language structure?
39 | 
40 | #### Questions:
41 | 
42 | * Importance of language in cooperative behavior - regularization through discrete structures
43 | 
44 | * Communication passing - analogy with federated learning and homomorphic encryption
45 |     * Extension to distributed learning domain in the long-term!
46 | 
47 | * Translation into policy gradient domain and swarm env domain
48 |     * How can this scale to multiple agents?
49 | 


--------------------------------------------------------------------------------
/Multi-Agent RL/2018_Strouse.md:
--------------------------------------------------------------------------------
 1 | # Title: Learning to Share and Hide Intentions using Info Regularization
 2 | 
 3 | # Author: Strouse et al (2018)
 4 | 
 5 | #### General Content: Introduce minimal observer model in form of info-sharing/hiding regularizer which controls incentives to share intentions. 2 versions: goal-state (less observational capabilities required) and goal-action (easier to optimize). Derive Monte Carlo PG estimates and prototype in specific pre-built environments. Info reg breaks symmetries between equally preferred policies (in terms of rewards) towards those which are less/more diffuse.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | * Go away from strong observer model as in Machine ToM
10 | * Often times regularizer as entropy of policy - incentivizes exploration
11 | 
12 | * Information quantities: Measure ease in inferring goal from Alice  from action/states - Sharing: Alice should choose policy with high info
13 |  	* Action: Assumes bob can observe alice's states and actions (easier to optimize)
14 | 	* State: Inly state - cost of alice counting goal-dependent state frequencies
15 | 
16 | * $\beta$ controls tradeoff:
17 | 	* sign determines if agent/alice wants to signal or hide info
18 | 	* magnitude determines relative preference for rewards and intention hiding/sharing
19 | 
20 | * Monte Carlo PG update:
21 | 	* rewrite as expectation over trajectories - use log derivative
22 | 	* specific information discount factor
23 | 	* State info case - need to keep track of state goal frequencies
24 | 
25 | * Experiments:
26 | 	* Spatial navigation 5x5 grid with changing goal position
27 | 		* sign of beta influences how "complicated"/diffuse the observed policy is - cooperative - very clear difference between policies depending on goal
28 | 		* First train Alice and then introduce Bob who is processes Alice trajectories with RNN to form belief of goal - feed into his policy traned end-to-end via REINFORCE
29 | 		* Training without info reg - results between cooperation and competitive behavior
30 | 	* Key-door-game: Alice can delay informing with special key - does so if she wants to hide info
31 | 
32 | #### Questions:
33 | * Interesting connection to econ - moral hazard/assymmetric info sharing
34 | 	* What could a regularizer correspond to in a simple econ model?
35 | 
36 | * Come up with mixing between both regularizers - what could others be?
37 | 
38 | * Importance of designing environments which are suited to validate your approach
39 | 


--------------------------------------------------------------------------------
/Multi-Agent RL/2019_Das.md:
--------------------------------------------------------------------------------
 1 | # Title: TarMAC: Targeted Multi-Agent Communication
 2 | 
 3 | # Author: Das et al (2019)
 4 | 
 5 | #### General Content: Introduce network architecture that accounts for targeted message communication based on a soft attention mechanism. Messages are real-valued vectors which are aggregated with the help of attention weights. Furthermore, the approach allows for multiple rounds of communication before an action is executed.
 6 | 
 7 | #### Keypoints:
 8 | 
 9 | ### Questions:
10 | * How can we analyze the communication vectors? Form of embedding - how to visualize multi-round "discussions"?
11 | * What does without attention mean - simple DIAL? alphas=1
12 | * Is it possible to learn the number of stages/interactions instead of pre-specifying?
13 | * How to specify dims of communication? - How sparse/high-dim do they have to be?
14 | * How to come up with the query vector? Separate supervised learning problem?
15 | 


--------------------------------------------------------------------------------
/Optimization/2018_Baydin.md:
--------------------------------------------------------------------------------
 1 | # Title: Automatic Differentiation in Machine Learning: a Survey
 2 | 
 3 | # Author: Baydin et al (2018)
 4 | 
 5 | #### General Content:
 6 | Provide a somewhat high-level review on automatic differentiation (forward & backward mode), its applications and implementation considerations.
 7 | 
 8 | #### Key take-away for own work:
 9 | AD is not equal to backprop. But the gradient problem in Deep Learning naturally leads itself to backward mode ad since we have many inputs and few outputs! Ultimately the ratio of input dim to output dim determines which method is more computationally efficient.
10 | 
11 | #### Keypoints:
12 | 
13 | * AD = efficient and accurate eval of derivatives of numeric functions expressed as computer programs
14 | 
15 | * Different types of derivative computations:
16 |     - Manual: time consuming, error prone
17 |     - Numerical (finite differences): round-off and truncation errors, scales poorly
18 |     - Symbolic (computer algebra - expression manipulation): complex expressions - "expression" swell, requires models to be defined as closed-form expressions - limits expressability and control flow
19 |     - Algorithmic/AD: Non-standard interpretation of computer program. Replaces domain of vars to incorporate derivate values and redefining semantics of operators to propagate derivatives per chain rule of differential calculus
20 | 
21 | * AD = Family of techniques that compute derivatives through accumulation of values during code exec to generate numerical derivative evaluations rather than derivative expressions. Evaluation at machine precision
22 |     * Application to regular code - allows for branching, loops and recursion
23 |     * Backprop = gradient obtained by backward propagation (via chain rule) of sensitivity of objective value at the output. Equivalent to transforming the net evaluation function composed with the objective function under reverse mode AD
24 |     * Important: Provides numerical values of the derivative and not derivative expressions. Does so by using symbolic rules of differentiation
25 | 
26 | * Numerical Differentiation: finite diff approx of deriv using values of original function at some sample point based on limit definition of derivative
27 |     * Advantage: easy to eval, Disadvantage: Need as many evals as there are input dims
28 |     * Ill conditioned and unstable due to truncation and round-off errors (limited precision)
29 |     * Center difference approximation: More numerically stable! But needs also more function evals
30 | 
31 | * Symbolic Differentiation: Automatic manipulation of expressions for obtaining derivative expressions - apply trafos representing rules of differentiation - Mathematica, etc.
32 |     * Problem of nested duplications that can lead to exponentially large symbolic expressions - taking long to evaluate! = expression swell
33 |     * Try to simplify by storing only values of intermediate sub-expressions in memory. Interleave differentiation and simplification as much as possible. Forms basis of AD: Apply symbolic differentiation at elementary operation level and keep intermediate numerical results linked to evaluation of the main function
34 | 
35 | * General idea: Augmentation of standard computation with calculation of various derivatives. All numerical computations are ultimately compositions of finite set of elementary operations for which derivatives are known. Combining the derivatives of constituent operations through chain rule gives derivative of overall composition
36 | 
37 | * Important: AD is blind with respect to any operation including control flow statements which do not direclty alter the numeric values
38 | 
39 | * Forward accumulation mode: Applying the chain rule to each elementary operation in the forward primal trace generates the corresponding tangent derivative trace.
40 |     * Evaluating the primals in lockstep with their corresponding tangents gives the required final derivative
41 |     * Full Jacobian requires n evaluations! Set x dot to vector in order to compute Jacobian vector product
42 |     * Mathematically = forward mode AD is equal to evaluating a function using dual numbers - defined as truncated Taylor series. Can extract derivative from the coefficient in front of the dual number - Simultaneous computation!
43 | 
44 | * Reverse accumulation mode: Propagates derivatives backward from a given output
45 |     * Complement every intermediate variable with an adjoint - represents the sensitivity of considered output with respect to changes in intermediate variable
46 |     * Two phase procedure - run function forward and populate/store intermediate variables
47 |     * While doing so record the dependencies in the computational graph through a bookeeping procedure
48 |     * 2nd phase: calculate derivatives by propagating adjoints in reverse from output to input
49 |     * Advantage: Significantly less costly to evaluate than forward mode if number of inputs is large. Comes at cost of increased storage requirements in proportion to number of operations in the evaluated function
50 | 
51 | * Hessian computation tricks:
52 |     * quasi-Newton = numerical approx using first-order updates from gradient evals
53 |     * L-BFGS = limited-memory Broyden-Fletcher-Goldfarb-Shannon
54 |     * Often dont need full Hessian but a vector product - reverse-on-forward configuration of AD. Run first forward and then backward on result
55 | 
56 | * Dynamic computational graph = define-by-run: execution dynamically constructs graph on the fly that can freely change over iterations
57 | 
58 | #### Questions:
59 | 


--------------------------------------------------------------------------------
/Optimization/Ruder_2016.md:
--------------------------------------------------------------------------------
 1 | # Title: An Overview of Gradient Descent Optimization Algorithms (2016)
 2 | 
 3 | # Author: Sebastian Ruder
 4 | 
 5 | #### General Content: 
 6 | 
 7 | Paper summarizes different GD algorithms. Introduces batch GD as well as minibatch and plain SGD. Afterwards, the paper discusses pros/cons of adaptive learning rates and outlines Adagrad, Adadelta, RMSprop and Adam. 
 8 | 
 9 | #### Keypoints: 
10 | 
11 | Adaptive learning rates for non-convex DL optimization schemes outperform vanilla SGD with annealing in terms of speed of convergence. But SGD seems to prevail as the "simple" dominant non-convex optimization algorithm.
12 | 
13 | #### Summary:
14 | 	
15 | * **Batch GD**: Performs one update for the full dataset. This means that the the gradient of the loss function is evaluated for all input/output training pairs. There is no stochastic element. Guaranteed to converge to optimum (global if fct convex, otw. to local)
16 | 
17 | * **SGD**: Performs parameter updates for each datapoint. One epoch corresponds to an update performed for each single (x,y) pair (for loop). It is stochastic in the order in which the updates are performed. It is important to shuffle the data before the next epoch.
18 | 
19 | * **Mini-batch GD**: Combines Batch GB with plain SGD and performs update for random batches of datapoints. Benefits: reduces update variance - more stable convergence; can use optimized matrix operations - speeds up computation time
20 | 	
21 | * **Adaptive Learning Rates**:
22 | 	* *Momentum*: Adds fraction of previous update vector in order to build up speed. Momentum term increases if multiple gradients point in the same direction. Danger of blindly going to fast.
23 | 	* *Nesterov Accelerated Gradient (NAG)*: Does not only look at the previous gradients but tries to anticipate future by evaluating the gradient of the loss function at a point that is close to the next parameter value. "Anticipatory" update prevents us from going to fast.
24 | 	* *Adagrad*: Larger learning rate for parameters which change infrequently. Smaller updates for parameters that change frequently. Learning rate is set inversely to the sum of previous squared gradients. Performs best for sparse data. Eliminates need to manually tune the learning rate.
25 | 	* *Adadelta*: Restricts the window of previous gradients. Sum of gradients is recursively defined as a decaying average of all past squared gradients. - RMS: root mean squared error criterion
26 | 	* *RMSprop*: Almost identical to Adadelta. Specific parameter suggestions.
27 | 	* *Adam*: Adaptive Moment Estimation - Does not only store squared gradients (variance) but also the simple gradients (mean) again with exponential decay. Initialization pulls parameters to zero - bias-corrected version exists.
28 | 
29 | * **Random stuff**:
30 | 	* Curriculum Learning: supply training examples in a "meaningful" manner - increase the degree of difficulty (don't shuffle)
31 | 	* Batch Normalization: Zero mean, unit variance - reapply normalization for every mini-batch. Allows for higher learning rates without having to worry about initialization parameters. Also acts as a form of regularization.
32 | 	* Early Stopping: Check convergence with the behavior of the validation set error and not with the parameter change.
33 | 	* Gradient noise: Adding noise to gradient update makes algo more robust
34 | 
35 | #### Questions: 
36 | 
37 | * What is automatic differentiation? - Same as backprop? Allows to easily obtain gradients
38 | * Exploration-Exploitation trade-off/Bayesian Optimization framework


--------------------------------------------------------------------------------
/ReadingGroupICL/01_Daume_2004.md:
--------------------------------------------------------------------------------
 1 | # Title: From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or Less
 2 | 
 3 | # Author: Hal Daumé (2004)
 4 | 
 5 | #### General Content: 
 6 | 
 7 | * Mathematical summary/brushup of concepts relevant to understanding RKHS.
 8 | 
 9 | * General Structure of Paper:
10 | 
11 | 	1. Fields: Ordered Fields, Complete Ordered Fields, Isomorphisms
12 | 	2. Vector Spaces
13 | 	3. Banach Spaces: Complete Vector Spaces, Norms, Infinite Sequences and Norms on Function Spaces
14 | 	4. Hilbert Spaces
15 | 	5. RKHS  
16 | 
17 | #### Summary:
18 | 
19 | * Fields 
20 | 	* allow us to solve linear equations on them
21 | 	* < F, +, ., 0, 1 > - F: Universe, +: addition op., .: multiplication op., 0: identity addition, 1: identity multiplication, $(.)^{-1}$: inverse operation for multiplication.
22 | 	* Associative, Commutative, Distributive, Identity, Invers
23 | 
24 | * Ordered Fields
25 | 	* Field equipped with binary relation (<=) that is a linear order
26 | 	* Reflexive, Antisymmetric, Transitive, Interaction of operations and ordering
27 | 
28 | * Complete Ordered Fields
29 | 	* Non-formal: Space is complete if every sequence of its elements that approaches a particular value has this value as it's limit and this limit lies in space itself
30 | 	* Formal: X is a complete space if every Cauchy sequence in X is convergent.
31 | 	* Need definitions for distance, Cauchy sequence, and convergent sequence
32 | 
33 | * Isomorphisms		
34 | 	*  Non-formal: mapping between two objects that preserves all the relevant properties of those objects.
35 | 	*  Formal: Mapping that is injective (one-to-one), surjective (onto) and preserves operations/obeys structure of field.
36 | 
37 | * Vector spaces: Space that contains elements called vectors and supports addition of vectors and multiplication by scalars.
38 | 
39 | * Banach spaces: Complete vector space endowed with a norm/method of calculating the size of vectors in the space.
40 | 	* Complete vector space: Definition as before but replace distance by norm
41 | 	* Norms: Defined by properties - non-negative, strictly positive, homogenous, triangle inequality
42 | 
43 | * Infinite sequences: Can also form Banach spaces. In order to ensure completeness, need to make sure that sequences don't diverge under summing.
44 | 
45 | Space and Norm Definition:
46 | $$l_p = \{<x_i>^\infty_{i=0}: \sum_{i=0}^\infty |x_i|^p < \infty\}$$
47 | $$||<x_i>^\infty_{i=0}||_{l_p} = \left(\sum_{i=0}^\infty |x_i|^p\right)^{\frac{1}{p}}$$
48 | 
49 | * Norms on Function Spaces: Use sup-norm for continuous functions from X to R.
50 | 
51 | Space and Norm Definition:
52 | $$L_p = \{(f: \mathcal{R}^n \to \mathcal{R}): \int_{- \infty}^\infty |f^p(x)| dx < \infty \}$$
53 | 
54 | $$||f||_{L_p} = \left(\int_{- \infty}^\infty |f^p(x)| dx \right)^{\frac{1}{p}}$$
55 | 
56 | * Hilbert Spaces: Banach space further endowed with a dot-product operation. Dot-product operation has to satisfy associativity, commutative and distributive property
57 | 
58 | * RKHS: Hilbert space, which requires that all Dirac eval functionals in $\mathcal{H}$ are bounded and continues (but one implies other)
59 | 
60 | 	* Riesz Representation Theorem: If $\phi$ is a bounded linear functional (e.g. Dirac eval functionals) on a Hilbert space, then there is a unique vector $u$ in $\mathcal{H}$ such that $\phi f = <f,u>_{\mathcal{H}}$ for all $f \in \mathcal{H}$.
61 | 	* Apply to Dirac eval functionals: For each functional, there is a unique vector $k_x$ in $\mathcal{H}$ such that $\delta_x f = f(x) = <f, k_x>_\mathcal{H}$
62 | 
63 | * Reproducing Kernel for $\mathcal{H}$: $K(x ,x') = <k_x, k_{x'}>_\mathcal{H}$
64 | 	* Reproducing property: $<f, K(x, x')>_\mathcal{H} = f(x')$
65 | 	* $k_x$ is defined to be a function $y \mapsto K(x,y)$ and thus $<K(x,.) K(y,.)>_\mathcal{H} = K(x,y)$
66 | 	* Positive definiteness: All reproducing kernels are positive. Matrix filled by $K(x_i, x_j)$ is positive definite. 
67 | 	
68 | 	
69 | #### Keypoints: 
70 | 
71 | * Any positive definite function is a reproducing kernel for *some* RKHS, $\mathcal{H}_K$.
72 | 	* $\mathcal{H}_K$ is unique up to isomorphism.
73 | 	* Construction via defining dot product via linear combinations of kernel function.
74 | * Force space to be complete by manually adding the limits of all Cauchy sequences!
75 | * Finite trace property of mapping $K(x, x')$: K is continuous and does not diverge!
76 | * Eigenfunction for function-space: Equivalent to eigenvector for vector space.
77 | 
78 | $$ \int K(x, x') \phi(x') dx' = \lambda \phi(x')$$
79 | 
80 | * Mercer-Hilbert-Schmitt Theorems: If K is a positive definite kernel (continuous with finite trace), then there exists an infinite sequence of eigenfunctions $< phi_i >_{i=0}^\infty$ and ordered (decreasing) eigenvalues $\lambda_i$, and that we can write K in analogous form to the spectral decomposition:
81 | 
82 | $$K(x, x') = \sum_{i=0}^\infty \lambda_i \phi_i (x) \phi_i (x')$$ 
83 | 
84 | * Feature Space Construction: Can't only construct $\mathcal{H}_K$, but also find a feature function $\Phi: X \to  \mathcal{H}_K$ s.t.: $$K(x, x') = <\Phi(x), \Phi(x')>_\mathcal{H}$$
85 | 	* Given a symmetric p.d. fct. K, there exists a function $\Phi$ s.t. the evaluation of the kernel at points x and x' is equivalent to taking the dot product between $\Phi(x)$ a $\Phi(x')$ in some perhaps unknown Hilbert Space.
86 | 	* $\Phi$: Maps inut space to large and possibly infinite feature space $\mathcal{H}$, where we can still easily compute the Kernel product K instead of the dot product - **KERNEL TRICK**
87 | 	* Two common constructions: 
88 | 		* $\Phi(x) = K(x,.)$
89 | 		* $\Phi(x) = <\sqrt{\lambda_i} \phi_i(x)>_{i=0}^\infty$
90 | 		
91 | 		 
92 | #### Questions: 
93 | 
94 | * How is the dot product defined in 6.4.2 - Fourier analysis??
95 | * And how is this related to finding a basis in $\mathcal{R}^n$?
96 | * In practice: what does theory give/tell us? Only existence or even more? - Know how to explicity construct the space $\mathcal{H}$, which is unique up to isomorphisms - $\Phi$ is also only semi-unique


--------------------------------------------------------------------------------
/ReadingGroupICL/02_Zhang_2017.md:
--------------------------------------------------------------------------------
 1 | # Title: Theory of Deep Learning III: Generalization Properties of SGD
 2 | 
 3 | # Author: Zhang et al (2017)
 4 | 
 5 | #### General Content: 
 6 | 
 7 | Authors analyze the role of SGD for the generalization property of hugely overparametrized NNs. They argue that SGD does not only minimize the loss function but also maximizes the "degeneracy" of the obtained optimizers. I.e. SGD finds minima with large flat regions/volume. 
 8 | 
 9 | #### Keypoints: 
10 | 
11 | This essentially leads to input robustness since perturbing the weights can be viewed as equivalently varying the inputs! They base their results on three main observations:
12 | 
13 | * SGD can be written as Langevin dynamics which converge to a Boltzmann distribution whose mode can be found in the flat minima
14 | * Flat regions can be characterized by a hyperplane. Finding this hyperplane corresponds to a LS problem which has to be solved using a pseudoinverse. Pseudoinverses regularize indirectly - see Tikhanov regularization.  
15 | * Connection to robust optimization - perform implicit regularization
16 | * Consistency may hold in the absence of generalization!
17 | * Good accuracy does not imply good generalization
18 | 
19 | * SGD minimizes empirical loss but also maximizes the volume/flatness of the minima -> effectively linearizes the problem wrt the weights
20 | * In linear case SGD and GD converge to pseudoinvers-base solution
21 | 
22 | * Langevin analysis observation: SGD selects with high prob large volume minima (mostly correspond to flat minima), because those have large regions of zero loss
23 | * For flat minima SGD/GD converge to pseudoinverse solutions <-> solution found is minimum L1 norm among the infinite degenerate solutions that correspond to zero empirical error.
24 | * Minimum norm solutions <-> large margin solutions
25 | * Overparametrization does not necessarily improve test error which is otpimal around boundary between over- and under-parametrization.
26 | * Perturbations in weights and inputs - Exactly what generalization means!!! - relationship between different forms of regularisation
27 | 
28 | * **Explanation for two puzzles**
29 | 	* DL does not generalize for randomly labeled training data, but does so for natural labels - Reason: Larger margin and larger volume regions can be found for natural labels. Data generating process induces exploitable pattern
30 | 	* There occurs no overfitting with increasing network size - Training set size is more important than the number of parameters 
31 | 
32 | * Fast convergence implicitly indicates that SGD has a large basin of attraction -> strong flat regions which imply good generalization performance!
33 | 
34 | #### Summary:
35 | 
36 | * Generalization: Difference between empirical risk and expected error goes to zero as the training sample size increases.
37 | * Consistency: Empirical error converges to optimal Bayes' risk with increasing training sample size.
38 | * Classical Approaches of statistical learning theory to NNs: VC dimension (restrict complexity of hypothesis space); Large Margin Classifiers (Min Norm <-> Large Margin in linear case)
39 | * Keskar et al: Large batch sizes tend to converge to sharp minimizers and bad generalization behavior
40 | * 2 Ways to prove convergence of SGD:
41 | 	1. Partition parameter space into several attraction basins - assume that after few iterations algo confines parameters in single basin - proceed as in convex case
42 | 	2. Instead of showing that function f converges (dynamical systems approach), show that cost function and its gradient converge.  
43 | * Pseudoconvex functions: Function that behaves like a convex function wrt finding its local minima, but need not actually be convex. A differentiable fct is pseudoconvex if it is increasing in any direction where it has a positive directional derivative
44 | 
45 | * **SGD as approximate Langevin Equation**
46 | 	* Vector of gradient uodates has an approximate Gaussian distribution -> follows from CLT
47 | 	* Write SGD as DS with noisy true risk -> *discretized Langevin Diffusion*
48 | 	* If noise is derivative of Brownian Motion it is a Langevin equation -> Stochastic Dynamical System
49 | 	* Has an associated *Fokker-Planck equation* on probability distribution of f
50 | 	* Asymptotic distribution is *Boltzman distribution* -> puts more weight on degenerate optima
51 | 	* Discretized version of Langevin dynamics is equivalent to MH for small eta 
52 | 
53 | #### Questions: 
54 | 
55 | * Need to learn more about statistical dynamics and the meaning of the Boltzmann distribution <-> also: Thermodynamics and FEP connection!
56 | 
57 | * Learn more about robust optimization.
58 | 
59 | * Learn more about discretized Langevin Diffusion
60 | 
61 | * How does this relate to finding an optimal mini-batch size?
62 | 
63 | * What are conditions on data generating process that make performance of DL strong?


--------------------------------------------------------------------------------
/ReadingGroupICL/03_Marco_2017.md:
--------------------------------------------------------------------------------
 1 | # Title: Virtual vs. Real: Trading Off Simulations and Physical Experiments in Reinforcement Learning with Bayesian Optimization
 2 | 
 3 | # Author: Marco et al (2017)
 4 | 
 5 | #### General Content:
 6 | Authors derive a Bayesian Optimization scheme (namely entropy search) for real-life RL settings in which experiments are associated with costs. The framework trades off the different costs associated with real-life and computer simulation experiments. It is named "multifidelity entropy-search".
 7 | 
 8 | 
 9 | #### Keypoints: 
10 | 
11 | * Entropy search: Maximise information gain for each query/experiment
12 | 	* Select paramters which maximally reduce the uncertainty about the uncertainty about location of minimizer - make minimizer distribution collapse at minimizer, which means low entropy distribution! $\mathbb{P}(\theta \in argmin_{\theta} J(\theta))$ 
13 | 	* Use non-uniform grid with more points/higher resolution in areas where it is "more likely" to find the minimum. 
14 | 	* What does this mean? Why only evaluate for already strong belief points? Don't we want to have more exploration? <-> Relationship to expected improvement (EI)
15 | * Proposed framework generalizes this to case of multiple info sources $\to$ way of efficiently combining cheap, but inaccurate info from simulations with expensive but inaccurate info from real physical experiments
16 | * Introduce conisistent unit of information gain: $\mathbb{E}[\Delta H_i(\theta)]/t_i$ where $t_i$ is some form physical meaningful cost, $i \in \{sim, exp\}$.
17 | 
18 | 
19 | #### Summary:
20 | 
21 | * Approach is different from *2-stage learning*: System automatically chooses environment in which we run the experiment.
22 | * Transfer learning - similarly use some form of prior knowledge - two distinct steps - used to generalize between different tasjs
23 | * Policy gradients - RL technique: Policy parameters are improved locally along the gradient
24 | * Multi-task Bayesian Optimization (MTBO): Try to transfer knowledge about two related tasks
25 | 	* Very very similar - Switched kernel to be additive and not multiplicative! 
26 | * Previous approaches: No explicit modelling of different experiment costs
27 | * Simulator comes usually for free! <-> by-product of robot design
28 | 
29 | * Def.: Costs - costs resulting from lack of accuracy
30 | * Def.: Effort - costs resulting from retrieving an evaluation from the system
31 | 
32 | * Kernel - encodes smoothness assumption and rate of change assumptions of unknown target function
33 | 
34 | * $a = (\theta, \delta) \to k(a, a') = k_{sim} (\theta, \theta') + k_\delta (\delta, \delta')k_{err}(\theta, \theta')$
35 | * $k_\delta (\delta, \delta') = \delta \delta'$ - 1 if both parameters indicate physical experiment - $\delta = 0$: indicates simulation
36 | * Kernel specification: two experiments on physical system covary strongly - error covariance turned off in simulations in order to model that sims can't provide all info about the function
37 | * Incorporated different noise variances for measurements - This plus the kernel structure allows to incorporate multiple information sources into single GP model
38 | 
39 | 
40 | #### Questions: 
41 | 
42 | * How does this generalize to non-linear systems? Very simple optimization structure used in example experiment!
43 | 


--------------------------------------------------------------------------------
/ReadingGroupICL/05_Lee_2017.md:
--------------------------------------------------------------------------------
 1 | # Title: Deep Neural Networks as Gaussian Processes
 2 | 
 3 | # Author: Lee et al (2017) 
 4 | 
 5 | #### General Content: 
 6 | 
 7 | Shows how to recursively model the covariance function of a Gaussian Processes to obtain exact Bayesian Inference in Deep NNs without stochastic regularization. They derive correspondence of the function representation and develop computational efficient pipeline for computation using a look-up table. Furthermore, they show that using this approach for standard classification tasks one can show that there is a strong correlation between the GP uncertainty and prediction error. Allows us to apply full Bayesian framework - model uncertainty and perform model comparison.
 8 | 
 9 | #### Keypoints: 
10 | 
11 | * Neal (1994a) - Connects deep NNs in the limit with finite width with GPs
12 | * Duvenaud et al (2014) - Deep GPs -> degenerate form of kernels that are infinitely times composed.
13 | * Cho & Saul (2009) - derive compositional kernels for ReLU
14 | * Lawrence and Moore (2007) - first DGP paper? Hierarchical GPs
15 | * Weight and biases are drawn from Gaussian prior - infinite width (hidden units) allows us to apply CLT - function computed by the NN is a draw from a specific GP
16 | * iid prior over weights and biases is equivalent to specific GP prior over functions
17 | * Expectation over Gaussian input to ReLU is Gaussian again - integral over 0 to infinity though!
18 | * Hidden units pre-activation are drawn from GP -> by induction this holds for the next layers as well. 
19 | * Obtain one single kernel expression for all hidden units (since they are iid! - watch out for notation!) - Integral becomes deterministic function
20 | * Single layer infinite width network is a GP (known from Neal) <-> Deep network as well GP --- Hence a deep network is basically the same as a shallow but wide network
21 | * Weakness of GP in limit - loss of correlation - not the case for the deep net
22 | * Deep Signal Propagation - Recurrence relations lead to ordered/chaotic phase:
23 | 	* Ordered - recursion similarises dissimilar inputs: Constant kernel value
24 | 	* Unordered - weight variance dominates: inputs become dissimilar from random matrix projections
25 | * Bounded/Unbounded phase:
26 | 	* Can only learn something in specific hyperparameter domain.
27 | * GPs as alternative to standard SGD trained NNs
28 | 
29 | #### Questions: 
30 | 
31 | * What is the direct contribution? - Approximation through lookup table and Bayesian Inference
32 | * Generalized Cho & Saul (2009) to tanh. Otherwise mainly reproduction.
33 | 


--------------------------------------------------------------------------------
/ReadingGroupTU/2018_Nayebi_pres.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RobertTLange/reading-notes-ml/57ca859ba9db71a0dc9c6e1aedaf754d89c36671/ReadingGroupTU/2018_Nayebi_pres.pdf


--------------------------------------------------------------------------------
/ReadingGroupTU/2019_Bengio_pres.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RobertTLange/reading-notes-ml/57ca859ba9db71a0dc9c6e1aedaf754d89c36671/ReadingGroupTU/2019_Bengio_pres.pdf


--------------------------------------------------------------------------------
/ReadingGroupTU/2019_Flennerhag_pres.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RobertTLange/reading-notes-ml/57ca859ba9db71a0dc9c6e1aedaf754d89c36671/ReadingGroupTU/2019_Flennerhag_pres.pdf


--------------------------------------------------------------------------------
/ReadingGroupTU/2019_Geier.md:
--------------------------------------------------------------------------------
 1 | # Title: Weight Agnostic Neural Networks
 2 | 
 3 | # Author: Geier & Ha (2019; NeuRIPS)
 4 | 
 5 | #### General Content:
 6 | WA-NN = NAS + Weight Sharing without SGD afterwards. By not doing SGD a lot of architectural choices are freed (i.e. non-ReLU activations) because no gradients have to flow through the architecture. GD as a search algorithm is replaced by this new architecture search
 7 | 
 8 | #### Key take-away for own work:
 9 | Residual RL: use simple controller from WANN and then learn a correction network on top
10 | 
11 | #### Keypoints:
12 | 
13 | * Fixed weight and variable architecture search - bypass the costly inner loop SGD training via neural architecture search (NAS)
14 |     * replace SGD by sampling from uniform distribution - single weight
15 |     * sampling every weight is sample inefficient
16 | 
17 | * Motivation via few-shot learning in mammals - search for inductive biases that generalize across tasks
18 | 
19 | * Algorithmic details - WANN
20 |     - Start with minimal network
21 |     - evaluate the performance of the Network
22 |     - alter/increase complexity via NEAT formalism
23 | 
24 | * NEAT - insert node, add connection, change activation - for
25 |     - tournament in batches (tournament size) - winner pool and mutation for next round
26 | 
27 | * Performance objective: Sample weights and average over performance - how weight agnostic is the network
28 | 
29 | * Complexity Objective: Connection cost technique. Sum of all connections
30 | 
31 | * Ranking via dominance relation - no adding of objectives
32 |     - 80% both performance & complexity, 20% only performance (also max)
33 | 
34 | * Testing on 3 envs: CartPoleSwingUp, BipedalWalker, CarRacing
35 |     - increase dimensionality/complexity
36 |     - CarRacing with pretrained VAE for dimensionality reduction
37 |     - Awesome visualizations with GUI to play-around with
38 |     - Performance around 0 weight is strongly diminished
39 | 
40 | * Evaluation:
41 |     - random weights - sample all weights [-2, 2]
42 |     - random shared weight
43 |     - tuned shared weight
44 |     - tuned weights via population-based REINFORCE
45 | 
46 | * Population-based REINFORCE:
47 |     - No parameters updated - integrated out!
48 |     - Step taken wrt probability distribution of policy parameters - i.e. normal distribution (mean, variance)
49 |     - MC trajectory - sample parameters and afterwards policy execution is deterministic - apparently this reduces the variance
50 |     - Motivation: landscape defined by WANN maybe not well suited
51 | 
52 | * MNIST Testing: Think of each weight value as an individual classifier - ensemble - not better than simple CNN
53 | 
54 | * Think of WANN as a form of pretraining/search and then continue training afterwards
55 | 
56 | #### Questions:
57 | 
58 | * Relationship to Lottery ticket networks: Weights change very much!
59 | * How much would we have to put in to get something like a convolution out of this?
60 | 


--------------------------------------------------------------------------------
/ReadingGroupTU/2019_Merel_pres.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RobertTLange/reading-notes-ml/57ca859ba9db71a0dc9c6e1aedaf754d89c36671/ReadingGroupTU/2019_Merel_pres.pdf


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
  1 | # Reading Notes 2017/2018
  2 | # Author: Robert T. Lange
  3 | 
  4 | This repository contains simple reading notes, thoughts, questions and summaries of the papers/book chapters, which I ([Robert T. Lange](www.rob-lange.com)) have read in the second half of 2017 and 2018. This includes the summer break, where I attended a Free-Energy Principle summer school organised by Prof. Blankenburg and Prof. Ostwald (both BCCN and FU Berlin), the DS^3 Summer School as well as the European Summer School in Information Retrieval (ESSIR) and the time of my Computing (ML) Master's at Imperial College London.
  5 | 
  6 | The documents are grouped by overarching topic. First of all I hope that this way I am able to structure my knowledge gains and have a quick read to remind myself of keypoints. Second, I hope that you are able to follow my current interests and research progress.
  7 | 
  8 | **Notes:**
  9 | 
 10 | 1. If there is a tick in the box, there exists a summary markdown file in the subdirectory. Otherwise, I have only read the document. Furthermore, I list below the chapters of books, while the markdown file contains the summary for the full (or the parts that I have worked in) book.
 11 | 2. Most papers focus on Deep RL and Hierarchical RL (due to thesis and general interest).
 12 | 
 13 | | Read / Notes  | Title  & Author  | Year  | Category | Conference | Paper  |  Notes |
 14 | | ------ |:-------------:|  :-----:| :-----:|  :-----:| :-----:|:-----:|
 15 | :fire: #13 - 01/20 | Merel et al. - Deep Neuroethology of a Virtual Rodent | 2020 | DRL-Neuro | ICLR | [Click](https://arxiv.org/abs/1911.09451) | [Click](ReadingGroupTU/2019_Merel_pres.pdf)
 16 | :fire: #12 - 12/19 | Gaier & Ha - Weight Agnostic Neural Networks | 2019 | NAS | NeuRIPS | [Click](https://arxiv.org/abs/1906.04358) | [Click](ReadingGroupTU/2019_Geier.md)
 17 | :fire: #11 - 11/19 | Kümmerer et al. - Saliency Benchmarking made easy: Separating models, maps and metrics | 2018 | Saliency | EECV | [Click](http://openaccess.thecvf.com/content_ECCV_2018/html/Matthias_Kummerer_Saliency_Benchmarking_Made_ECCV_2018_paper.html) | [Click](Vision/2018_Kuemmerer.md)
 18 | :fire: #10 - 08/19 | Baydin et al. - Automatic Differentiation in Machine Learning: a Survey | 2018 | Autodiff | JMLR | [Click](http://www.jmlr.org/papers/volume18/17-468/17-468.pdf) | [Click](Optimization/2018_Baydin.md)
 19 | :fire: #9 - 08/19 | Flennerhag et al. -  Transferring Knowledge across Learning Processes| 2019 | Meta-Learning | ICLR | [Click](https://arxiv.org/abs/1812.01054) | [Click](ReadingGroupTU/2019_Flennerhag_pres.pdf)
 20 | :fire: #8 - 08/19 | Jacot et al. - Neural Tangent Kernel: Convergence and Generalization in Neural Networks | 2018 | Theory of DL | NeuRIPS | [Click](https://arxiv.org/abs/1806.07572) | [Click](Theory-of-DL/2018_Jacot.md)
 21 | :fire: #7 - 08/19 | Collins et al. - Capacity and Trainability in Recurrent Neural Networks | 2017 | RNNs | ICLR | [Click](https://arxiv.org/abs/1611.09913) | [Click](Theory-of-DL/2017_Collins.md)
 22 | :fire: #6 - 08/19 | Li et al. - A Generalized Framework for Population Based Training | 2019 | PBT | ArXiv | [Click](https://arxiv.org/abs/1902.01894) | [Click](Hyperparam-Opt/2019_Li.md)
 23 | :fire: #5 - 08/19 | Jaderberg et al. - Population Based Training of Neural Networks | 2017 | PBT | ArXiv | [Click](https://arxiv.org/abs/1711.09846) | [Click](Hyperparam-Opt/2017_Jaderberg.md)
 24 | :fire: #4 - 08/19 | Frankle et al. - Stabilizing The Lottery Ticket Hypothesis | 2019 | Initialization | ArXiv | [Click](https://arxiv.org/abs/1903.01611) | [Click](Theory-of-DL/2019b_Frankle.md)
 25 | :fire: #3 - 08/19 | Frankle & Carbin - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks | 2019 | Initialization | ICLR | [Click](https://arxiv.org/abs/1803.03635) | [Click](Theory-of-DL/2019_Frankle.md)
 26 | :fire: #2 - 08/19 | Nayebi et al. - Task-Driven Convolutional Recurrent Models of the Visual System | 2018 | RNNs | NeuRIPS | [Click](https://arxiv.org/pdf/1807.00053.pdf) | [Click](ReadingGroupTU/2018_Nayebi_pres.pdf)
 27 | :fire: #1 - 07/19 | Bengio et al. - A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms | 2019 | Meta | ArXiv | [Click](https://arxiv.org/abs/1901.10912) | [Click](ReadingGroupTU/2019_Bengio_pres.pdf)
 28 | 
 29 | 
 30 | **2019-03**
 31 | 
 32 | * [x] [Finn et al (2018) - Probabilistic Model-Agnostic Meta-Learning](Meta-Learning/2018_Finn.md)
 33 | * [x] [Andrychowicz et al (2016) - Learning to learn by gradient descent by gradient descent](Meta-Learning/2016_Andrychowicz.md)
 34 | * [x] [Finn et al (2017) - Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](Meta-Learning/2017_Finn.md)
 35 | 
 36 | 
 37 | ## Multi-Agent RL
 38 | 
 39 | **2018-01**
 40 | 
 41 | * [x] Das et al (2019) - TarMAC: Targeted Multi-Agent Communication
 42 | * [x] Hausknecht and Stone (2015) - Deep Recurrent Q-Learning for POMDPs
 43 | 
 44 | **2018-11**
 45 | 
 46 | * [x] Strouse et al (2018) - Learning to Share and Hide Intentions using Info Regularization
 47 | * [x] Foerster et al (2016) - Learning to Communicate with Deep MARL
 48 | 
 49 | 
 50 | ## Biologically-Plausible Deep Learning
 51 | 
 52 | **2019-02**
 53 | * [x] Whittington & Bogacz (2019) - Theories of Error Propagation in the Brain
 54 | 
 55 | **2018-12**
 56 | 
 57 | * [x] Sacramento et al (2018) - Dendritic cortical microcircuits approximate the backpropagation algorithm
 58 | * [x] Bartunov et al (2018) - Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
 59 | * [x] Lillicrap et al (2016) - Random synaptic feedback weights support error backpropagation for deep learning
 60 | 
 61 | 
 62 | **2018-11**
 63 | 
 64 | * [x] Garnelo et al (2018b) - Neural Processes
 65 | 
 66 | ## Hierarchical Reinforcement Learning
 67 | 
 68 | **2018-08**
 69 | 
 70 | * [ ] Pastra and Aloimonos (2012) - The minimalist grammar of action
 71 | * [ ] Bacon et al (2017) - The Option-Critic Architecture
 72 | * [ ] Daniel et al (2016) - Probabilistic inference for determining options in reinforcement learning
 73 | * [ ] Smith et al (2018) - An Inference-Based Policy Gradient Method for Learning Options
 74 | 
 75 | **2018-07**
 76 | 
 77 | * [x] McGovern & Sutton (1998) - Macro-Actions in Reinforcement Learning: An Empiricial Analysis
 78 | * [x] McGovern et al (1997) - Roles of Macro-Actions in Accelerating Reinforcement Learning
 79 | 
 80 | **2018-06**
 81 | 
 82 | * [x] Yao et al (2014) - Universal Option Models
 83 | * [x] Levy et al (2018) - Hierarchical Reinforcement Learning with Hindsight
 84 | * [x] Bakker and Schmidhuber (2004) - Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization
 85 | * [x] Mannor et al (2004) - Dynamic Abstraction in Reinforcement Learning via Clustering
 86 | * [x] Menache et al (2002) - Q-Cut - Dynamic Discovery of Sub-Goals in Reinforcement Learning
 87 | * [x] Stolle and Barto (2002) - Learning Options in Reinforcement Learning
 88 | * [x] McGovern and Barto (2001) - Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density
 89 | 
 90 | **2018-05**
 91 | 
 92 | * [x]	Frans et al (2018) - Meta Learning Shared Hierarchies
 93 | * [x]	Florensa et al (2017) - Stochastic Neural Networks for Hierarchical Reinforcement Learning
 94 | 
 95 | 
 96 | ## Formal Grammars, Grammatical Inference and Surprisal
 97 | 
 98 | **2018-07**
 99 | 
100 | * [x] Siyari et al (2016) - Lexis: An Optimization Framework
101 | 
102 | **2018-06**
103 | 
104 | * [ ] Stout et al (2018) - Grammars of action in human behavior and evolution
105 | 
106 | **2018-05**
107 | 
108 | * [ ] Hale (2014) - Automaton theories of human sentence comprehension
109 | * [x] Schoenhense & Faisal (2017) - Data-efficient inference of hierarchical structure in sequential data by information-greedy grammar inference
110 | * [x] Hale (2001) - A Probabilistic Earley Parser as a Psycholinguistic Model
111 | 
112 | 
113 | ## (Deep) Reinforcement Learning
114 | 
115 | **2018-06**
116 | 
117 | * [x] Schaul et al (2015) - Universal Value Function Approximators
118 | * [ ] Gershman and Daw (2017) - Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework
119 | * [x] Rusu et al (2016) - Policy Distillation
120 | * [x] Andrychowicz et al (2018) - Hindsight Experience Replay
121 | * [x] Choshen, Fox, Loewenstein (2018) - DORA - Directed Outreaching Reinforcement Action-Selection
122 | 
123 | **2018-05**
124 | 
125 | * [ ] Li (2017) - Deep Reinforcement Learning: An Overview
126 | * [ ] Arulkumaran (2017) - A Brief Survey of Deep Reinforcement Learning
127 | * [x] Dayan (1993b) - Improving Generalisation for TemporalDifference Learning: The Successor Representation
128 | 
129 | **2017-08**
130 | 
131 | * [ ] Barto & Sutton (2016 - draft) - Ch. 1: The Reinforcement Learning Problem
132 | 
133 | 
134 | ## Free-Energy Principle
135 | 
136 | **2017-07**
137 | 
138 | * [x] Friston (2010) - The free-energy principle: a unified brain theory?
139 | * [ ] Limanowski, Blankenburg (2013) - Minimal self-models and the free energy principle
140 | 
141 | **2017-08**
142 | 
143 | * [ ] Ostwald (2015) - The Free Energy Principle for Perception: An Introduction
144 | * [ ] Bogacz (2017) - A tutorial on the free-energy framework for modelling perception and learning
145 | 
146 | 
147 | ## Variational Inference
148 | 
149 | **2019-02**
150 | 
151 | * [x] Zhang et al. (2018) - Advances in Variational Inference
152 | 
153 | **2017-07**
154 | 
155 | * [x] Ostwald et al. (2014) - A tutorial on variational Bayes for latent linear stochastic time-series models
156 | 
157 | 
158 | 
159 | ## Deep Learning (Application + Theory)
160 | 
161 | **2019-06**
162 | 
163 | * [ ] Morcos et al (2018) - Insights on representational similarity in neural networks with canonical correlation
164 | 
165 | **2019-03**
166 | 
167 | * [x] Spoerer et al (2017) - Recurrent Convolutional NNs: A Better Model of Biological Object Recognition
168 | 
169 | **2017-07**
170 | 
171 | * [ ] Goodfellow et al. (2016) - Ch. 6: Deep Feedforward Networks
172 | * [x] Gal, Ghahramani (2015) - Dropout as a Bayesian Approximation: Insights and Applications
173 | 
174 | **2017-08**
175 | 
176 | * [x] Gal & Ghahramani (2016) - On Modern Deep Learning and Variational Inference
177 | 
178 | 
179 | ## Optimization
180 | 
181 | **2017-07**
182 | 
183 | * [x] Ruder (2016) - An overview of gradient descent optimization algorithms
184 | 
185 | 
186 | 
187 | ## Bayesian Optimization & Gaussian Processes
188 | 
189 | **2017-07**
190 | 
191 | * [ ] Shahriari et al. (2015) - Taking the Human Out of the Loop: A Review of Bayesian Optimization
192 | 
193 | 
194 | ## Real Intelligence
195 | 
196 | **2017-08**
197 | 
198 | * [ ] Jeff Hawkins (2003) - On Intelligence
199 | * [ ] Hassabis (2017) - Neuroscience-Inspired Artificial Intelligence
200 | 
201 | 
202 | ## Machine Learning Reading Group ICL
203 | 
204 | **2017-10**
205 | 
206 | * [x] #1: Daume (2004) - From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or Less
207 | * [x] #2: Zhang et al (2017) - Theory of Deep Learning III: Generalization Properties of SGD
208 | * [x] #3: Marco et al (2017) - Virtual vs. Real: Trading Off Simulations and Physical Experiments in Reinforcement Learning with Bayesian Optimization
209 | 
210 | **2017-11**
211 | 
212 | * [x] #5: Lee et al (2017) - Deep Neural Networks as Gaussian Processes
213 | 


--------------------------------------------------------------------------------
/Reinforcement Learning/1993_Dayan_b.md:
--------------------------------------------------------------------------------
 1 | # Title: Improving Generalisation for Temporal Difference Learning: The Successor Representation
 2 | # Author: Peter Dayan (1993b)
 3 | 
 4 | #### General Content: Good representation for state resembles representations for its successors - only small Euclidean distance away. Succesor representation (SR) = expected number of occupancies of state under given policy. Essentially replaces single temporal learning problem (prediction of future reinforcements) by a set of learning problems (prediction of future occupancy of all states. Thereby it factors out temporal component of task/Markov transition matrix. Do not have to solve full problem but can use SR predictions to augment standard (punctuate) representation. Also: Combine with previous reward free exploration phase.
 5 | 
 6 | 
 7 | #### Keypoints: 
 8 | 
 9 | * Generalizing representations:
10 | 	* Static: Similarity in terms of distance in space
11 | 	* Dynamic: Similarity of future course of behavior of dynamical systems
12 | 
13 | $Q^\pi(s,a) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t R(s_t) | s_0=s, a_0=a] = \sum_{s' \in S} M(s,s',a) R(s')$
14 | 
15 | where $M(s, s', a) = \mathbb{E}[\sum \gamma^t \mathbb{1}[s_t = s'] |s_0=s, a_0=a]$ is the SR
16 | 
17 | which is untractable due to indicator expectation. Solve by TD learning problem for each state!
18 | 
19 | * Make fairly loose assumption of linear representation of rewards (still allows for deep nets as state representations): $R(s_t)=\phi_{s_t}$
20 | 
21 | $Q^\pi (s,a) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t \phi_{s_t} | s_0=s, a_0=a] \cdot w = M^\pi(s,a) \cdot w$
22 | 
23 | * Have to learn both M, w: M - represents policy-dependent expected features and w - represents goals
24 | 
25 | 
26 | #### Questions: 
27 | 
28 | * Read Kulkarni et al (2017) generalisation to DL!
29 | * Relationship to HRL!!! - good representation for temporal tasks
30 | * How does SR work in stochastic domains/adapt - Dayan hints that it is bad!
31 | * Think of SR as hidden representations


--------------------------------------------------------------------------------
/Reinforcement Learning/2015_Hausknecht.md:
--------------------------------------------------------------------------------
 1 | # Title: Deep Recurrent Q-Learning for POMDPs
 2 | 
 3 | # Author: Hausknecht and Stone (2015)
 4 | 
 5 | #### General Content: Add recurrence to standard DQN architecture by replacing fully connected after convolutions by LSTM layer. Layer takes previous hidden states as input. Afterwards, another fully connected layers maps to action values. Provides alternative to stacking frames to induce Markovity. Show that recurrent is more robust to occlusion/noise than DQN.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Most ATARI games are not directly Markovian - example pong and velocity of the ball. DQN combats this by feeding multiple frames at a time. Mnih et al show that 4 frames suffice to ensure Markovity.
11 | * At training time: All parameters are learned via backprop through time. Two different updating schemes:
12 |     1. Bootstrapped Sequential Updates: Sample complete episodes. Instantiate hidden state to zeros. Begin at start of episode and process complete sequence.
13 |     2. Bootstrapped Random Updates: Sample random point of episode and process block starting from there and lasting for "unroll iterations timesteps" (here chosen to be 10) - zero hidden state at beginning of update.
14 | * Test approach by altering Atari games to flickr with prob 0.5 - need to infer velocity etc.
15 |     * DRQN takes one frame at a time, while DQN takes 10 - recurrent layer needs to compensate for both flickering state as well as conv velocity detection.
16 | * Train on MDP (non-flickering) data and test generalization on noisy data - DRQN outperforms significantly! - Hypothesis: Recurrent controller are robust to missing info even when trained on fully obs info
17 | 
18 | #### Questions:
19 | 
20 | * Authors state that two update schemes perform similiar! and use random strategy.
21 | * Scalability? Is it faster just to feed more channels/frames to conv net or to unroll the DRQN for a timestep for each frame?
22 | 


--------------------------------------------------------------------------------
/Reinforcement Learning/2015_Schaul.md:
--------------------------------------------------------------------------------
 1 | # Title: Universal Value Function Approximators
 2 | 
 3 | # Author: Schaul et al (2015)
 4 | 
 5 | #### General Content: Authors generalize notion of value fct approx over states and goals $V(s,g;\theta)$. These can be learned in one of two ways: 
 6 | 
 7 | 1. Using supervised learning and a set of learned values. They factorize the tabular matrix representation into two embedding vectors (one for goals and one for states) assuming a simple dot product activation network. Afterwards, they learn the mapping from state/goal to these vectors.
 8 | 2. Using RL and a set of value function learners as in the Horde setup (Sutton et al, 2011). In this case the UVFA is directly learned from rewards.
 9 | 
10 | They show that these set-ups lead to good generalization to unseen goals.
11 | 
12 | 
13 | #### Keypoints: 
14 | 
15 | * General value function: $V_g(s)$ - value of any state s in achieving goal g - pseudo-reward
16 | * Horde: Discrete set of value functions (demons)- all learned simultaneously from single stream of experience - bootstrapping from off-policy
17 | * Collection of value functions = Predictive representation of state <-> predicted values = feature vector
18 | * Goal space contains often as much structure if not even more than state space
19 | * UVFA = Infinite Horde of demons - summarize whole class of predictions in single object
20 | 	* Takes into account structure between goals:
21 | 		1. Similarity encoded in goal representations g
22 | 		2. Structure in induced value functions discovered bottom-up
23 | 	* Complexity does not depend on number of demons but on domain complexity
24 | * Learning of UVFA. View data as sparse table of values (rows = states, cols = goals)
25 | 	1. Low-rank factorization: State embeddings $\phi(s)$, goal embeddings $\mu(g)$
26 | 	2. Learn non-linear mappings: $ s \to \phi(s)$ and $g \to \mu(g)$ by minimizing MSE.
27 | * RL Learning of UVFA: 2 Algos - Continual learning: set of goals under consideration expands over time
28 | 	* Use finite Horde of $V_g(s)$ to fill table
29 | 	* Bootstrap directly from value of UVFA at successor state - end-to-end training
30 | 	
31 | * Low-rank approximation can usually already capture much if the structure of UVF
32 | * Extrapolation: If states are represemted with same features as goals and states in new parts of space that have already been encountered, then partially symmetric architecture allows for knowledge transfer from $\phi$ to $\mu$.
33 | 
34 | * RL learning from rewards:
35 | 	* only rewards/obs data available and no knowledge about ground-truth target values
36 | 	* Horde to provide targets of UVFA - each demon in Horde approximates value fct for single specific goal - learn off-policy in parallel from shared stream of interactions with env - seed data matric with values - learn embeddings afterwards
37 | 	* Problem: data depends on way in which behavior policy explores env.
38 | 	* Two performance variables:
39 | 		1. Amount of experience Horde has accumulated - quality of targets used in factorization
40 | 		2. Amount of computation used to build UVFA from data - training of embeddings  	
41 | 
42 | #### Questions: 
43 | 
44 | * Very interesting connection between RL and supervised learning with the same premise: Generalization and fitting the underlying value data-generating process.
45 | * What env-structure does the notion of parameter sharing assume?


--------------------------------------------------------------------------------
/Reinforcement Learning/2016_Rusu.md:
--------------------------------------------------------------------------------
 1 | # Title: Policy Distillation
 2 | 
 3 | # Author: Rusu et al (2016)
 4 | 
 5 | #### General Content: Method to extract policy of RL agent and to train new net that performs at expert level while being smaller and more efficient. Furthermore, combines multiple task-specific policies into single one and is able to online continuously distill best policy - efficient way to track evolving Q-learning policy.
 6 | 
 7 | #### Keypoints: 
 8 | 
 9 | * Distillation: Efficient means for supervised model comparison
10 | 	* Creation of single net from ensemble model
11 | 	* Optimization method that stabilizes learning
12 | 	* Uses supervised regression to train target net to produce same output distribution as original net often using less peaked/softened target distribution
13 | 	* policy version - proves that supervised learning can generalize to sequential prediciton tasks
14 | 	* Increase of temperature of softmax: transfer more knowledge from teacher to student
15 | 
16 | * Relation between Distillation - imitation learning - multi-task learning 	
17 | * Teacher model T, student model S - teacher outputs softened softmax of vector of q-value vector
18 | 	* Student learns this distribution with the help of regression/different net
19 | 	* 3 different loss functions: 
20 | 		* negative log LH (NLL) - preserve greedy action of teacher
21 | 		* MSE - preserve full set of action-values in student model
22 | 		* KL divergence
23 | 	* Experiments: KL performs best - agent often even outperforms the DQN teacher!
24 | 		* MSE: Bad. Greedy action choices can be made based on very small differences in Q-values - receive low weight in MSE
25 | 		* NLL: Assumes that single aciton choice is correct at any point in time. If teacher not perfect, this amplifies noise!
26 | 		* KL finds balance between two - softening param around 0.01  
27 | 
28 | * Classification version of distillation: output distribution q^T usually very peaked - good to soften
29 | 	* Policy distillation: outputs of teacher not distribution but rather exp future disc reward of each possible action
30 | 	* Rather than soften, we want to make them sharper!    
31 | 
32 | * Multi-task distillation: Since different tasks often have different action sets, we need different output layer trained for each task
33 | * Policy distillation for model compression: Likely that final policy does not require all of capacity of DQN
34 | 	* Can even outperform - form of regularization
35 | 	* Training larger net can accelerate policy iteration cycle of DQN - perform distillation after full training
36 | * Online policy distillation: student must track teacher during training - distilled agent much more stable!  


--------------------------------------------------------------------------------
/Reinforcement Learning/2018_Andrychowicz.md:
--------------------------------------------------------------------------------
 1 | # Title: Hindsight Experience Replay
 2 | 
 3 | # Author: Andrychowicz et al (2018)
 4 | 
 5 | #### General Content: Humans learn almost as much from achieving undesired outcome as from desired - Intuition of HER: action sequences would have been successful for task if goal would have been positioned elsewhere. Algorithm is based on universal policies which take state and goal as input. Algorithm replays each episode with different goal than the one the agent was trying to achieve.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Often times in hard RL problems: Need to engineer reward function that does not only reflect task at hand but that is also carefully shaped to guide policy optimization. Often times requires domain-specific knowledge.
11 | * DQN: By making GD steps we encourage the approx Q-function to satisfy the Bellman equation
12 | * Polyak-averaged version of parametric model M: Model whose parameters are computed as exponential moving average of parameters of M over time. - used to stabilize learning/slow updating of target network
13 | * Deep Deterministic Policy Gradients (DDPG): Actor-Critic model for continuous action space
14 | 	* 2 networks: Actor (target policy), Critic (action-value function approximator)
15 | 	* Critic's job: Approx actor's action-value function
16 | 	* Critic trained similar to DQN but targets are computed using actions computed by actor.
17 | 	* behavioral polciy - noisy versionof target policy
18 | * Universal Function Approximators (UVFA) - incorporate goals in value functions - every episode starts with sampling state-goal pair from distribution
19 | 
20 | * Sometimes it is easier to train with multiple goals rather than only one - even if we only care about one specific.
21 | * Predicate f function - indicator that states whether or not g is goal for state s
22 | * Map m function - maps a state to a corresponding goal state
23 | * HER - form of implicit curriculum: goals for replay shift from ones easy to achieve by random agent to more difficult ones. But: no explicit control over distribution of initial environment states
24 | * HER performs badly with shaped rewards
25 | 	* Discrepancy between optimized shaped reward function and success condition
26 | 	* Shaped rewards penalize for inappropriate behavior - hinders exploration!
27 | * How to choose goals which we use for HER part of replay buffer? Also number of goals - k -> FUTURE seems to perform best
28 | 	*  FUTURE: replay with k random states from same episode as transition being replayed and obs afterwards
29 | 	*  EPISODE: replay with k random states from same episode as transition being replayed
30 | 	*  RANDOM: replay with k random states encountered so far in whole training procedure


--------------------------------------------------------------------------------
/Reinforcement Learning/2018_Choshen.md:
--------------------------------------------------------------------------------
 1 | # Title: DORA The Explorer: Directed Outreaching Reinforcement Action-Selection
 2 | 
 3 | # Author: Choshen, Fox & Loewenstein (2018)
 4 | 
 5 | #### General Content: E-values generalize visit-counters so that they can be used to evaluate propagating exploratory value over state-action trajectories. Provide model-free solution by learning such values using SARSA.
 6 | 
 7 | 
 8 | #### Keypoints: 
 9 | 
10 | * Exploration via visit-counters: problem of locality
11 | * Exploitation - necessary for learning: want to have better estimations of valuable state-action values and care less about exact values of actions/parts of space where we already know that they are inferior.
12 | * Limitations of random exploration:
13 | 	1. Agent does not utilize current knowledge about world to guide exploration
14 | 	2. Agent would not be biased in favor of exploring unvisited trajectories more than visited ones.
15 | * Boltzmann distribution based exploration: Captures preference to learn about actions associated with higher rewards
16 | 	* Draw actions from softmax distribution ober learned Q-values
17 | 	* Still fail at point 2 - not directed towards gaining more info
18 | 
19 | * Directed exploration: Estimate value of exploration for different s,a pairs
20 | 	* Current approaches: counting, recency, value difference - only evaluated for reward one-step ahead
21 | 	* Also want to incorporate estimate of how much knowledge could be gained from trajectory starting with action-state pair
22 | 	* Treat problem as separate Q-Learning problem and learn exploration values
23 | 	* Problem of model-based approaches: Markov property is violated - exploration value decreases with amount of visitations. Exploration value decreases with amount of visitations
24 | 
25 | * Introduce second MDP - difference to core: no rewards are associated with any (s,a) - init all values to 1
26 | * Learn using SARSA and not Q - not consider potentially highly informative actions never selected - Guarantee that exploration values will decrease when repeating same trajectory.
27 | * Log of E-values: generalization of visit counter with propagation of values along state-action pairs
28 | 	* $\gamma_E = 0$: yields counter when taking log of E
29 | 	* $\gamma_E > 0$: slower increase of log E than for counter
30 | 
31 | * Augmenting of rewards with counter-based exploration bonus - replace counter by log E
32 | * Action-selection rules - Add E-values to Q-Values and select actions greedily from that function
33 | * Given a stochastic action-selection rule, every deterministic policy that does not choose actions that are visited too many times until now is determinization of such rule - in-the-limit equivalence derivation. 
34 | 
35 | #### Questions: 
36 | 
37 | * Incorporate with HER?! Add goals to exploration behavior


--------------------------------------------------------------------------------
/Theory-of-DL/2017_Collins.md:
--------------------------------------------------------------------------------
 1 | # Title: Capacity and Trainability in Recurrent Neural Networks
 2 | 
 3 | # Author: Collins et al (2017) (Google Brain)
 4 | 
 5 | #### General Content:
 6 | The authors study the differences in capacity/expressiveness (ability to store info about task in parameters + store info about input history in units). Show that all nets (if trained carefully) have same per-task/per-unit capacity bounds. Furthermore, the nets can approx store one real number per hidden unit & task info that is linear in the number of parameters (approx 5 per).
 7 | 
 8 | * Main comparison differences driven by trainability and not capacity! I.e. vanilla RNNs harder to train but slightly higher capacity
 9 | 
10 | #### Key take-away for own work:
11 | 
12 | #### Keypoints:
13 | 
14 | * Gated models are easier to train but not necessarily more expressive. Here: Evidence that gated models are dominant because of trainability and not capacity related issues.
15 | 
16 | * Different bottlenecks:
17 |     * Memory capacity: how much of task/input vector can be stored in units
18 |     * How many Computational primitives can be performed. Despite fact that gated architectures are outfitted with multiplicative primative between hiddens, while vanilla RNN not, found no evidence of computational bottleneck. Use per-param capacity.
19 | 
20 | * Two new architecture designs:
21 |     1. Update Gate RNN: Minimally gated RNN with only coupled gate between recurrent hidden state and update to hidden state. Inspired by large impact of removing forget gate from the LSTM
22 |     2. Intersection RNN: Coupled gates to gate both recurrent and depth dimensions (coupled depth gate)
23 | 
24 | * Experiments: BO Hyperparam/GP Bandit Spearmint model
25 |     - fixed number of parameters across comparisons
26 |     - Result: capacity of RNNs to remember input history is not a practical limiting factor on their performance.
27 |     - Memorization task: Draw fixed set of random inputs and labels and train RNN to map them via cross-entropy error - but return mutual info. This way treat number of in-out mappings as hyperparmeter! From this compute bits per params - measure of how much RNN learned about the task/stored info about the mapping. Divide mutual info by the number of parameters to standardize - 5 bits for more stable larger architectures
28 |     - Also find that per-param task capacity increases as a function of the number of unrollings (diminishing returns though)
29 |     - Find that gated models have slightly reduced capacity and that relus reduce capacity as well.
30 |     - Every RNN can reconstruct random input at some time in future iff number of hiddens per layer is greater than the input dim
31 |     - Parallel paranthesis task/addition of integers - Hard tasks = problem of trainability became apparent!
32 | 
33 | * Inshights/Hacks for training RNNs:
34 |     - Initial conditions = learned bias. Well-known to init gates of LSTM/GRU with large bias to induce better gradient flow
35 |     - Fundamental 1-Layer MLP result: N^2 params are able to store at least 2bits of info per parameter - Can implement mapping from 2N, N-dim input vectors to arbitrary N-dim binary output vectors subject to restriction that input vectors are in general position
36 | 
37 | * Implications:
38 |     - Train vs inference budget: Use RNN if many train resources but few inference. Otherwise use gated model
39 |     - GRUs are most "learnable"
40 | 
41 | #### Questions:
42 | 


--------------------------------------------------------------------------------
/Theory-of-DL/2018_Jacot.md:
--------------------------------------------------------------------------------
 1 | # Title: Neural Tangent Kernel: Convergence and Generalization in Neural Networks
 2 | 
 3 | # Author: Jacot et al. (2018) (EPFL)
 4 | 
 5 | #### General Content:
 6 | Authors connect ANNs to Kernel methods during training and show that the evolution is also described as a kernel - the neural tangent kernel (not only at Initialization)!
 7 | * kernel allows to characterize the generalization behavior of the net and allows to study training in the function space instead of param space.
 8 | * they show how convergence is related to the positive-definiteness of the kernel.
 9 | * they also show that for the regression objective the network function follows a linear differential equation and that training is fastest along largest principal component - relates to early stopping!
10 | 
11 | 
12 | #### Key take-away for own work:
13 | 
14 | #### Keypoints:
15 | 
16 | * Theoretical results showing that for wide enough nets there are few bad local minima. Corresponds to kernel method properties of good generalization
17 | 
18 | * Known result - Radford Neal: Infinite hidden layer width - network at Initialization converges to Gaussian distribution - Here: not init but during training
19 | 
20 | * Show that dynamics of the function characterized by NN follows kernel GD wrt to limiting NTK - only depends upon depth, choice of non-lin and init variance
21 | 
22 | * Kernel gradient:
23 |     - Cost function may be convex but composition with NN becomes highly non-convex
24 |     - During training the net function follows descent along kernel gradient wrt NTK
25 |     - Multi-dim Kernel = defines bilinear map on function space
26 |     - Dual of function space = linear forms - same linear form means equal on training data
27 |     - Cost functional at certain function can be viewed as element in dual space. Can write down gradient with respect to function
28 |     - Kernel gradient = gradient of cost function wrt to kernel given function - defined by dual to function space map with specific dual elements
29 |     - Main benefit: It is defined outside the dataset
30 |     - During descent algorithm the cost evolves according to linear differential equation
31 |     - Convergence is guaranteed as long as K is positive definite
32 |     - Can be approximated by random sampled functions from function space
33 |     - Show that GD on composition of cost and function is equivalent to Kernel GD with tangent kernel in function space
34 | 
35 | * Neural Tangent kernel:
36 |     - Difference to previous case: kernel depends on parameters that vary during Training
37 |     - show that NTK becomes deterministic at init and stays constant during training for infinite-width case
38 |     - Least squares result: Diagonalization with exponential rate eigenvalues
39 |     - Motivation for early stopping - convergence is faster along large eigenvalue eigenspaces. Lower eigenvalues are typically ones associated with noise
40 | 
41 | * Small experiments:
42 |     - wider nets show less variance in the kernel and they are smoother
43 |     - exponential rate of convergence - faster for smaller nets - connection to inflation in kernel - BUT: need to take into account possibility of larger learning rate 
44 | 
45 | #### Questions:
46 | 


--------------------------------------------------------------------------------
/Theory-of-DL/2019_Frankle.md:
--------------------------------------------------------------------------------
 1 | # Title: The Lottery Ticket Hypothesis: Finding sparse, trainable NNs
 2 | 
 3 | # Author: Frankle & Carbin (2019 - Best paper ICLR) (MIT)
 4 | 
 5 | #### General Content:
 6 | Authors discover that subnetworks resulting from weight-magnitude based pruning have had weight initialization that makes them capable of training more effectively than others. The resulting hypothesis is that full nets contain winning tickets of subnetworks that on their own perform. In their experiments (MNIST/CIFAR10) they show that aggressive pruning (up to 80%) can even help in performance.
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * Motivation: Observation that sparse architecture resulting from pruning of full net are hard to train from start/scratch when randomly reinitialized. Previously argued that this might be due to small capacity.
11 | 
12 | * The LT hypothesis: A randomly-init dense NN contains a subnetwork that is init such that - when trained in isolation - can match the test accuracy of the original net after training for at most the same number of iters - often generalize better and need less training iters
13 | 
14 | * Key difference to previous pruning ideas: No random re-init but initialized to previous weight values at beginning of training of the full network. In experiments the authors show that when instead random init, the winning tickets no longer match performance of original nets. Structure alone cannot explain a winning tickets success.
15 | 
16 | * Ticket identification: Train + prune the smallest-magnitude weights. Remaining unpruned connections constitute architecture of winning ticket.
17 |     1. Random init net
18 |     2. Train net for j iters
19 |     3. Prune p% of params by creating a mask
20 |     4. Reset the remaining params to their init values from 1., creating the winning ticket
21 | 
22 | * Can do this in one-shot or with iterative pruning (use unstructured pruning). Iterate train - prune - reset loop over n rounds each time only pruning p^(1/n)% of weights. Experiments show iterative pruning allows to reduce number of parameters even more!
23 | 
24 | * Observation: In deeper architectures the winning ticket identification is sensitive to the learning rate - requires warm up with smaller learning rate
25 | 
26 | * The LT conjecture: SGD seeks out & trains subset of well-init weights. Dense, random-init nets are easier to train than the sparse nets that result from pruning because there are more possible subnetworks from which training can recover a winning ticket.
27 | 
28 | * Implications:
29 |     1. Design training schemes that search for winners and prune as early as possible
30 |     2. Design better net archs and init schemes with same properties as winning tickets
31 |     3. Improve theoretical understanding - Implications for optimization and initialization
32 | 
33 | * Experiments on small architecture: MNIST-Lenet (FC)
34 |     * Connections to outputs are pruned at half of the rate of the rest of the net
35 |     * Iterative pruning: 20% per iteration - use iteration at which early-stopping criterion is reached as indicator for how quickly the net is learning
36 |     * At early stopping training acc increases with pruning  - but no better generalization - need to train longer! But gap between training and test accuracy is smaller for winning tickets -> better generalization
37 |     * Random reinit: Winning tickets learn progressively slower
38 |     * Iterative pruning is costly. One-shot finds winning ticket but further improvement with iterating
39 | 
40 | * Experiments on larger architecture: CIFAR10-VGG/ResNet (Conv)
41 |     * Scaled down VGG versions: As network is pruned it learns faster and test accuracy rises
42 |     * As before: Train accuracy at early-stopping iteration rises with test accuracy
43 |     * Dropout: Intuition of simultaneously training an ensemble of subnetworks - rate 0.5 - still find winning networks when training with dropout - methods appear complimentary
44 |     * Original dropout paper observes that dropout induces sparse activations in final network - possible that dropout-induced sparsity primes net to be pruned!
45 |     * Future direction: dropout techniques that target weightsor learn per-weight dropout probs might make finding of winning tickets easier?!
46 |     * Full VGG-16/Resnet-18 architectures: Continue to find winning tickets but iterative pruning is sensitive to particular learning rate being used.
47 |     * Before: Use form of local uniform pruning with same rate for each layer. Problem in deeper nets since some layers have more params than others. Smaller layers become bottlenecks. Global pruning instead = remove the smallest magnitude weights from all layers
48 |     * For VGG architecture need an increasing learning rate warmup sequence. Same for Resnet
49 | 
50 | * Importance of init winning ticket: Winning weights move the most! - Highlights connection to optim procedure
51 | * Extremely pruned less severely overparametrized nets only maintain accuracy
52 | * Hypothesis: Structure of winning tickets encodes an inductive bias customized to learning task at hand
53 | * Larger networks might contain simpler representations
54 | * Role of overparametrization during training
55 | 
56 | #### Questions:
57 | 
58 | * How many rounds in iterative pruning schedule?
59 | * Similar as distillation can this be extended to Deep RL?
60 | 


--------------------------------------------------------------------------------
/Theory-of-DL/2019b_Frankle.md:
--------------------------------------------------------------------------------
 1 | # Title: Stabilizing the Lottery Ticket Hypothesis
 2 | 
 3 | # Author: Frankle et al (2019) (MIT)
 4 | 
 5 | #### General Content:
 6 | Iterative Magnitude Pruning does not work on deep networks when resetting remaining weights to their initalization values at k=0/precise initialization. Here the authors show that if we do not reset to 0 but to weights trained with full net at an "early" stage of training we can recover the Lottery Ticket hypothesis! They achieve to scale to ImageNet with this approach. Furthermore, they introduce two notions of stability: Pruning and Data Order. As subnetworks become more stable to gradient noise, they train to params closer to those of full network & and do so robustly.
 7 | 
 8 | #### Keypoints:
 9 | 
10 | 
11 | * LTH: Dense NNs contain sparse subnets capable of training to commensurate accuracy at similar speed. Use IMP do identify such networks ex post.
12 | 
13 | * Fails on deeper networks and therefore cant scale to Imagenet. Original work tried to overcome this limitation by changing networks learning schedule - warm up! Here: Argue that this is due to fact of initializing back to 0!
14 | 
15 | * Propose Rewinding instead: Demonstrate that there are subnets at early points in training that are 50-90% smaller and can complete training process to match original nets accuracy
16 | 
17 | * Where original IMP without rewinding cannot find winning ticket, accuracy benefits of delay in pruning accrue rapidly (only few iterations required!)
18 |     * Resnet 50:
19 |     * Inception:
20 |     * SqueezeNet:
21 | 
22 | * Stability: Why is this the case? Stability = distance between two trained copies of same subnet subjected to different noise
23 |     * Noise due to pruning - Comparing trained weights of full net and subnet - Full vs Masked on same data ordering
24 |     * Noise due to data ordering - Comparing trained weights of two subnets with different data orders - Masked Order 1 vs Masked Order 2
25 |     * In case where IMP fails to find tickets when resetting to 0 - both stability measures increase rapidly as pruning in delayed - mirroring rise in accuracy
26 |     * Captures extent to which subnetwork arrives at same destination as original net in optimization landscape
27 | 
28 | * New hypothesis:
29 |     * Improved pruning stability = Subnets come closer to original optimum and thereby accuracy.
30 |     * Improved ordering stability = Subnetwork can do so consistently in spite of noise intrinsic to SGD.
31 | 
32 | * LTH with Rewinding: A dense randomly init NN that trains to accuracy a* in T* has a subnetwork with weights W_t at iter t so that there exists an iter k and fixed pruning mask such that subnet m dot W_k trains to accuracy a<a* in less than T-k iters!
33 | 
34 | * Experiments:
35 |     * Random pruning comparison: Random with same layer-wise proportions - are less stable in both terms compared to winning tickets
36 |     * Starting point: IMP finds winning tickets for Lenet but not Resnet/VGG without warming up of learning rate
37 |         * Suggests: High lrate is not detrimental at later stages in training - only init. Hypothesis: For less overparam. subnets optimization is brittle to noise in early stages of training. Networks become more stable later on!
38 |     *  Rewinding to 100 its VGG-19 and 500 its for Resnet-18 works (afterwards benefits saturate in stability and accuracy)! - Benefits of stability decrease with k as random nets become more stable later on!
39 |     * Problem of rewinding to late stages - does not allow enough further training!
40 | 
41 | 
42 | * Scaling to ImageNet:
43 |     * As rewinding iteration increases gap in stability emerges - accuracy improves alongside stability - reaches original nets accuracy at epoch 4/90 Resnet-50, 6/171 Inception, 10/150 for Squeezenet.
44 |     * As IMP subnets become more stable - nets reach higher accuracy. LTH still applies but have to use weights later in training.
45 | 
46 | #### Questions:
47 | * Where to go?
48 |     - Only vision, lots of compute, to which point/iter to rewind?, unstructured pruning procedure problem.
49 | * Somehow exploit knowledge about stability in the pruning procedure!?
50 | 


--------------------------------------------------------------------------------
/Variational Inference/Ostwald_2014.md:
--------------------------------------------------------------------------------
 1 | # Title: A tutorial on variational Bayes for latent linear stochastic time-series models
 2 | 
 3 | # Author: Ostwald et al. (2014)
 4 | 
 5 | #### General Content: 
 6 | 
 7 | Tutorial introduces the Linear Gaussian State Space model (LGSSM) framework and discusses inference using Variational Inference as a computationally efficient alternative to posterior sampling algos such as MCMC.
 8 | 
 9 | 
10 | #### Keypoints: 
11 | 
12 | * VI as Bayesian identification framework for LGSSM
13 | * Difference between filtering and smoothing:
14 | 	* Filtering: P(x_t | y_{1:t}) - preceeding
15 | 	* Smoothing: P(x_t | y_{1:T}) - both preceeding and succeeding
16 | * Mean-field approximation: variational distribution over latents factorizes into a set of variables - assumption: respective variables form stochastically independent contributions to the multivariate posterior, which, depending on the true form of the generative model, may have weak or strong implications for the validity of the ensuing posterior inference.
17 | * VB inference theorem for mean-field approximations: Variational free energy is max wrt to latent partition, if the variational distribution is set prop. to exp of expected log joint probability of the target and the params under the variational distribution
18 | 
19 | 
20 | #### Summary:
21 | 	
22 | * Problem: Want to estimate a continous time process for which only observe a discrete set of point in a bayesian manner.
23 | * Bayesian model identification: Bayesian posterior distribution and model evidence approximation.
24 | * VB for latent stochastic time-series models: Unification of stochastic differential equation modeling and approximate-deterministic Bayesian inference 
25 | * LGSSM:
26 | 	* Discretized approx. to a latent linear stochastic differential equation (SDE) 
27 | 	* Simple AR(1) model
28 | 	* Posterior distributions over the discrete time LGSSM parameter vars may be transferred to posterior distr over the corresponding continuous time latent SDE system using the integral trafo theorem for prob. density functions
29 | * Augmented linear diffusion process <-> Langevin-Equation <-> represents autonomous linear SDE
30 | * Euler-Maruyama discretization: Family of approximations to the solution of SDEs - including the Euler approximation
31 | * Frequentists (log LH - no latents over which we integrate) vs. Bayesians (log marginal LH/model evidence)
32 | * Log model decomposition: Free-energy + KL divergence term
33 | * Variational free energy is always <= than the log model evidence
34 | * Variational calculus: Optimization of functions with respect to functions -> functionals
35 | * VB inference theorem for mean-field approximations suggests an iterative coordinate-wise variational free energy ascent algorithm
36 | * Empirical Bayes: priors are learned from the data
37 |  
38 | 
39 | #### Questions/To-Do:
40 | 
41 | * Learn more about Kalman Filters/ 
42 | 


--------------------------------------------------------------------------------
/Variational Inference/Zhang_2018.md:
--------------------------------------------------------------------------------
 1 | # Title: Advances in Variational Inference
 2 | 
 3 | # Author: Zhang et al. (2018)
 4 | 
 5 | #### General Content: Provides a review of the developments in VI during the last 10 years. Review - Scalability - Beyond Conjugacy - Beyond KL/Mean Field - Amortized/Inference Networks - VAEs. Very well written and right level of abstraction.
 6 | 
 7 | 
 8 | #### Keypoints:
 9 | 
10 | * MCMC/Sampling unbiased - slow; VI/Optimization fast - oversimplified posterior approx
11 |     * Based on generative process p(x,z) - latents can be shared among multiple data points, and data points can have multiple latents.
12 |     * Idea: Approx complicated posterior by simpler variational distribution parametrized by set of variational params. Tuning via min of KL divergence. Circumvents knowledge of posterior normalization. Every latent has its own variational param
13 |     * Expectation propagation <-> the reverse KL of VI - local moment matching
14 |     * Usually var distribution is underparametrized - not flexible enough: Trade-off between expressive var distr and tractable approximation
15 |     * Derive objective ELBO via log model decomposition and jensen's ineq - conservative estimate of marginal = model fit -> comparison
16 |     * Traditional VI: analytical solve expectation over var distribution - restriction to class of conjugate exp models
17 |     * Mean Field VI: Full factorization of var distribution across the latent variables
18 |         * Allows for separation of expectation into inner and outer which allows for form of coordinate descent updates - relates to var message passing and Markov blankets
19 | 
20 | * Scalable VI: Use SGD to scale to large data
21 |     * Closed-form gradient expression might be too comp exp to eval for whole dataset. Instead subsample data batch and perform optimization with approx gradient
22 |     * Natural gradients: Simplification for models in conditionally conjugate exponential family - take geometry into account. Pre-multiply gradient with inverse Fisher info matrix
23 |     * Split into global and local params: Global params want larger batches while local params want smaller!
24 |     * Smaller variance in grads allows for larger learning rate and faster convergence
25 |         * Optimally adapt batchsize or learning rate and fix the other.
26 |             * Idea of RMSprop/others: adapt learning rate inversely proportional to gradient noise
27 |             * Choose batchsize proportionately to value of objective relative to its optimum
28 |     * Variance Reduction:
29 |         * Control variates: Stochastic term that when added to grad does not change exp but reduces the variance - needs to be correlated with grad.    
30 |             * SVRG: Use prev taken grad over all datapoints and exploit fact that gradients along optimization path are correlated - Requires full pass after discrete set of iterations
31 |         * Importance sampling: Non-uniform sampling of batches with smaller grad variance
32 |         * Rao-Blackwellization: Reduce variance by conditioning on some valid statistic
33 |     * Collapsed inference: Analytically integrate out certain model params. Tightens lower bound wrt specific params
34 |     * Sparse inference: Additional low-rank approx. T inducing points - pseudo-inputs that represent original data but yield sparser representation
35 | 
36 | * Generic/Blackbox VI
37 |     * BBVI - Away from conjugate exponential family
38 |     * Laplac approx: Gaussian approx at mode of posterior (=mean). Fit an inverse Hessian for a cov matrix appox. Bernstein von Mises theorem: approx becomes exact in the limit.
39 |         * Problem: Purely local and depending on curvature
40 |     * REINFORCE like gradients: Only generative process needed - represent gradient as expectation and approximate via MC samples. No analytical eval of ELBO. Via log ratio trick. Again techniques for var reduction are applicable
41 |     * Reparametrization trick: Represent random var as det function of noise distribution. Construct stochastic estimator by pulling gradient into expectation. Key to VAEs. Gumbel-Max trick: replace argmax by softmax to make things work for categorical distribution
42 | 
43 | * Corrections/Different Divergences
44 |     * Thouless-Anderson-Palmer: Pertubative correction to var free energy from stat physics
45 |     * All versions of Stein's discrepancy: f-div -> alpha-dic -> KL div
46 |     * Structured VI: Not fully factorized - more expressive but higher comp cost
47 |     * Hierarchical VI: E.g. Var GP - applies GP prior on variational params and then allows to sample
48 | 
49 | * Amortized Inference
50 |     * Idea: Approx the latent var as a fct of the data that optimally predicts it. Utilize idea of parameter sharing to reduce num of var params. Amortized: Utilize past computations for future comps
51 | 
52 | * Variational Autoencoders
53 |     * Generative + Recognition network with amortized mean field var distr - objective is KL divergence
54 |     * Exploits reparam trick - reduces grad var? How?
55 |     * Normalizing flows = Transform simple approx posterior into more expressive distribution by successive trafos
56 |     * Dying Units Problem: If decoder is too strong - inference fails to learn informative posterior - some dimensions of latent z will simply be ignored.
57 | 
58 | #### Questions:
59 | 
60 | * Look into hybrid sampling/optimization approaches
61 | 


--------------------------------------------------------------------------------
/Vision/2018_Kuemmerer.md:
--------------------------------------------------------------------------------
 1 | # Title: Saliency Benchmarking made easy: Separating models, maps and metrics
 2 | 
 3 | # Author: Kümmerer et al. (2018, ECCV - Bethge Lab)
 4 | 
 5 | #### General Content:
 6 | In current benchmarks no single saliency map can perform well - need to disentangle model, metrics and maps. This allows to derive closed-form optimization solutions for the specific metrics given model as a log probability density over fixation points. By submitting such a model and performing this optimization within the benchmarking setup, all models can be compared in a fair way across all different metrics. Show that if you do so DeepGaze II (their model) performs best!
 7 | 
 8 | **Note**: Discussed in vision reading group!
 9 | 
10 | #### Key take-away for own work:
11 | 
12 | #### Keypoints:
13 | 
14 | * Salience problem has developed into a prediction of fixation problem.
15 | 
16 | * DeepGaze II - VGG based with readout network
17 |     - output map + blurring + softmax + center bias = probability distribution
18 |     - How to backpropagate through the gaussian noise?
19 |     - If noise is fixed and stored than simple linear trafo
20 |     - Center bias = form of average fixation across all images
21 |     - 4 layers of 1×1 convolutions readout network
22 |     - Optimizes the information gain
23 | 
24 | * How to compare models?! Inconsistent variety of metrics
25 |     - Proposition: Disentangle model & map in terms of Bayesian Decision Theory
26 |     - Saliency model: prob model of fixation density prediction = Posterior
27 |     - Saliency map: metric-specific prediction derived from model density. = Prediction
28 |     - Saliency metric: Performance measure for a saliency map on ground truth data = Utility Function
29 | 
30 | * Task-independent probability distribution
31 | * Task-dependent error metric
32 | 
33 | * Problem: Optimization - max utility with respect to metric - returns saliency map
34 |     - map that optimizes specific metric is performing best on that metrics
35 |     - show example with model corresponding to true density!
36 | 
37 | * Proposed solution:
38 |     - Transform model into log probability space
39 |     - Perform optimization for all metrics at evaluation time
40 |     - Compare across models that are all optimized for the same metric.
41 | 
42 | * Different metrics to evaluate on:
43 |     * AUC = equalized prob distr to yield a uniform histogram over all pixels - 2 alternative forced choice test - two pixels which one attended to which not
44 |     * sAUC = dividing the prob density by center bias density and equalizing
45 |     * NSS/IG = Probability density
46 |     * CC/KL = probability density convolved with Gaussian kernel
47 | 
48 | * Benchmarks: MIT1000, LSUN
49 | 
50 | * Argue that IG is "ideal" optimization metric because reflects all info in structure of fixation density independent of particular metric. Can formulate as a multi-task problem?
51 | 
52 | #### Questions:
53 | 
54 | * How much is lost in the tansformation to log density space?!
55 |     - DeepGaze directly outputs density while the others have to be transformed
56 |     - How is this transformation actually done? MC sampling and approx distribution?
57 |     - Then problem of high-dim pixel space - cant sample enough.
58 | 


--------------------------------------------------------------------------------