├── .gitignore ├── Bayesian Optimization & Gaussian Processes ├── Garnelo_2018b.md └── Shahriari_2016.md ├── Biologically_Plausible_DL ├── Bartunov_2018.md ├── Lillicrap_2016.md ├── Sacramento_2018.md └── Whittington_2019.md ├── Deep Learning ├── Gal_2015.md ├── Gal_2016.md ├── Goodfellow_2016.md └── Spoerer_2017.md ├── Formal Grammars ├── 2001_Hale.md ├── 2016b_Siyari.md └── 2018_Schoenhense.md ├── Free-Energy Principle ├── Friston_2010.md ├── Limanowski_2013.md └── Ostwald_2015.md ├── Hierarchical Reinforcement Learning ├── 1997_McGovern.md ├── 1998_McGovern.md ├── 2001_Mcgovern.md ├── 2002_Menache.md ├── 2002_Stolle.md ├── 2004_Bakker.md ├── 2004_Mannor.md ├── 2014_Yao.md ├── 2017_Florensa.md ├── 2018_Frans.md └── 2018_Levy.md ├── Hyperparam-Opt ├── 2017_Jaderberg.md └── 2019_Li.md ├── Meta-Learning ├── 2016_Andrychowicz.md ├── 2017_Finn.md └── 2018_Finn.md ├── Multi-Agent RL ├── 2016_Foerster.md ├── 2018_Strouse.md └── 2019_Das.md ├── Optimization ├── 2018_Baydin.md └── Ruder_2016.md ├── ReadingGroupICL ├── 01_Daume_2004.md ├── 02_Zhang_2017.md ├── 03_Marco_2017.md └── 05_Lee_2017.md ├── ReadingGroupTU ├── 2018_Nayebi_pres.pdf ├── 2019_Bengio_pres.pdf ├── 2019_Flennerhag_pres.pdf ├── 2019_Geier.md └── 2019_Merel_pres.pdf ├── Readme.md ├── Reinforcement Learning ├── 1993_Dayan_b.md ├── 2015_Hausknecht.md ├── 2015_Schaul.md ├── 2016_Rusu.md ├── 2018_Andrychowicz.md └── 2018_Choshen.md ├── Theory-of-DL ├── 2017_Collins.md ├── 2018_Jacot.md ├── 2019_Frankle.md └── 2019b_Frankle.md ├── Variational Inference ├── Ostwald_2014.md └── Zhang_2018.md └── Vision └── 2018_Kuemmerer.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.pdf 3 | *.key 4 | plain.md 5 | !2018_Nayebi_pres.pdf 6 | !2019_Bengio_pres.pdf 7 | !2019_Flennerhag_pres.pdf 8 | !2019_Merel_pres.pdf 9 | -------------------------------------------------------------------------------- /Bayesian Optimization & Gaussian Processes/Garnelo_2018b.md: -------------------------------------------------------------------------------- 1 | # Title: Neural Processes 2 | 3 | # Author: Garnelo et al (2018b) 4 | 5 | #### General Content: Introduce a framework that balances advantages of both deep learning as well as Gaussian processes. A NP is a network-based approx of a stochastic process. It models a distribution over fcts, estimates uncertainty, shifts workload from training to test and is computationally efficient. In its essence it is an VAE structure with a global latent variable that captures epistemic uncertainty. They show powerful results in different problem setting (self-supervised image completion, Bayesian optimization as well as function approx.). 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Sufficient conditions for definition of stochastic process - Kolmogorov Extension Theorem: Exchangeability (invariance of joint distribution to permutations of elements in sequence) and Consistency (marginalization does not change resulting class of distribution) - de Finetti's theorem 11 | 12 | * NP = Encoder (h - NN), Aggregator (a - mean operator), Decoder (g - NN) - check out good graphical model visualization 13 | 14 | * z - global uncertainty - distribution characterized by data and context specific prior 15 | 16 | * Relationship to other models: 17 | * Conditional NPs: Lack latent variable - unable to produce different samples for same context - no uncertainty estimate 18 | * Generalization of generative query network - similar training for prediction of new viewpoints in 3D given some context 19 | * GPs: Deep Kernel Learning, Kernel Approx 20 | * Meta-Learning - Workload shift to test time 21 | 22 | * Amortized inference - Variational inference procedure in which a powerful predictor (NN) is used to predict optimal value of variational parameters based on features - replace local var parameters by fct of data whose parameters are shared across data points. 23 | 24 | * Comparison to GPs: No handcrafted kernel, but learning of implicit measure directly from the data - efficient computation. 25 | 26 | 27 | #### Questions: 28 | 29 | * What does image completion with full set of context points do? samples a noisy version? NP seems to perform a lot better in low data/context-point regimes - as many Bayesian methods do. 30 | * No details about architectures - cant replicate anything - good github repo (https://github.com/geniki/neural-processes) 31 | -------------------------------------------------------------------------------- /Bayesian Optimization & Gaussian Processes/Shahriari_2016.md: -------------------------------------------------------------------------------- 1 | # Title: 2 | 3 | # Author: 4 | 5 | #### General Content: 6 | 7 | 8 | #### Keypoints: 9 | 10 | 11 | #### Summary: 12 | 13 | 14 | #### Questions: 15 | -------------------------------------------------------------------------------- /Biologically_Plausible_DL/Bartunov_2018.md: -------------------------------------------------------------------------------- 1 | # Title: Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures 2 | 3 | # Author: Bartunov et al (2018) 4 | 5 | #### General Content: Extent the target propagation algorithm to not require exact gradient at penultimate layer. Test alternative learning rules in more complicated settings (CIFAR/ImageNet) and differentiate between locally and fully connected architectures. Very good review but not much additional innovation. Behavioral + Physiological Realism 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Problems with backpropagation 11 | * Feedback connections require exact copy of feedforward connections = Weight transport 12 | * Info propagation does not influence "neural activity" - does not conform to any known biological mechanism 13 | 14 | * Feedback alginment: Use random weights in backward pass to deliver info to earlier layers 15 | * Still requires delivery if signed error via distinct pathway 16 | * Direct/Broadcast FA - connect feedback from output layer directly to all previous ones 17 | 18 | * Contrastive Hebbian Learning/Generalized Recirculation: Use top-down feedback connections to influence neural activity and differences to locally approx gradients 19 | * Positive/negative phase - need settling process - Likely to slow for brain to compute in real time 20 | 21 | * Target Propagation: Trains distinct set of feedback connections defining backward activity propagation 22 | * Connections trained to approximately invert feedforward connections to compute target activites for each each layer by successive inversion - decoders 23 | * Reconstruction + Forward loss 24 | * Different target constructions 25 | * Vanilla TP: Target computation via propagation from higher layers' targets backwards through layer-wise inverses 26 | * Difference TP: Standard delta rule with additional stabilization from prev reconstruction error. Still needs explicit grad comp at final layer 27 | * Not tested on data more complex than MNIST 28 | 29 | * Simplified Difference Target Propagation: Computation also for penultimate layer with help of correct label distribution - removes implausible gradient communication 30 | * Need diversity in targets - problem of low entropy of classification targets 31 | * Need precision in targets - poor inverse learned 32 | * Combat both problems/weakness of targets with help of auxiliary output resembling random features from penultilmate hidden layer 33 | * Parallel vs alternating inverse training - simultaneous more plausible 34 | 35 | * Weight-Sharing is not plausible - regularizes by reducing number of free parameters 36 | 37 | * Experiments - Mostly negative results: 38 | 1. None of existing algos is able to scale up - Good performance MNIST/Somewhat reasonable on CIFAR/Horrible on ImageNet - Seems like weight-sharing is not key to success 39 | 2. Need for behavioral realism - judged by performance on difficult tasks 40 | 3. Hyperparameter Sensitivity 41 | * First fix "good" architeture and then optimize 42 | * Use hyperbolic tanh instead of ReLu - work better 43 | 44 | 45 | #### Questions: 46 | 47 | * How could the brain do weight sharing - is approximate again satisfactory/functional approx? 48 | * Think more about communication: MARL agents learning communication channels 49 | -------------------------------------------------------------------------------- /Biologically_Plausible_DL/Lillicrap_2016.md: -------------------------------------------------------------------------------- 1 | # Title: Random synaptic feedback weights support error backpropagation for deep learning 2 | 3 | # Author: Lillicrap et al (2016) 4 | 5 | #### General Content: Introduce a first feedback alignment approach to solve the weight transport problem of backpropagation. Forward and backward weights are modeled separately - backward weights align with weight matrix transpose through learning process. Argument follows from positive definiteness of weight and random matrix product and a rotation line of thought. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Weight transport Problem: downstream errors are fed back to upstream neurons via exact symmetric copy of downstream synaptic weight matrix - neuron "deep" within network has to have precise knowledge of all downstream synapses! 11 | 12 | * Possible solutions: 13 | 1. Retrograde transmission of info along axons - problem of slow timescale 14 | 2. Feedback of errors via second network - problem of symmetry assumption of feedforward and feedback connections 15 | 3. Here: Show that even fixed random connections can allow for learning - symmetry not required! Instead implicit dynamics lead to soft alignment between forward and backward weights 16 | 17 | * Observations: 18 | * Feedback weights does not have to be exact: $B \approx W^T$ with $e^TWBe > 0$. rotation within 90 degrees of backprop signal. Learning speed depends on degree! 19 | * Alignment of $B$ and $W^T$ via adjustment of W (and B) possible 20 | 21 | * Feedback alignment: 22 | * Modulator signal (error-FA) does not impact forward pass post-synaptic activity bu acts to alter plasticity at the forward synapses. 23 | * FA may encourage W to align with Moore-Penrose pseudoinverse of B - approximate functional symmetry 24 | * Inference vs learning - towards bayesian approaches 25 | 26 | * Experiments: 27 | * Learns linear function with single hidden layer - learning not slower than backprop 28 | * Sigmoid nonlinearity and classification task - altered function of post-synaptic activity - learned also to communicate info when 50% of weights were randomly removed 29 | * More layers 3 hidden layers - as well as backprop and making use of depth - froze layers and trained alternatingly - positive/negative phase? 30 | * Neurons that integrate activity over time and spike stochastically - synchronous pathways 31 | 32 | * Possible Extensions: 33 | * Fixed spike thresholds/refractory period 34 | * Dropout/stochasticity 35 | 36 | #### Questions: 37 | 38 | * Still signed error signal has to be transferred which remains illusive - see target propagation. 39 | * Is result related to Johnson-Lindenstrauss concentration ineq ideas? 40 | * Usage of intricate/more complex architectures of communication of backward error - relation to multi-agent RL 41 | * Relationship to predictive coding 42 | -------------------------------------------------------------------------------- /Biologically_Plausible_DL/Sacramento_2018.md: -------------------------------------------------------------------------------- 1 | # Title: Dendritic cortical microcircuits approximate the backpropagation algorithm 2 | 3 | # Author: Sacramento et al (2018) 4 | 5 | #### General Content: MLP with simplified dendritic compartments learned in local PE plasticity fashion. No separate phases needed. Errors represent mismatch between pre input from lateral interneurons and top-down feedback. First cortical microcircuit approach. Analytically derive that such a setup/learning rule approximates backprop weight updates and proof basic performance on MNIST. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Hypothesis: Pred errors are encoded at distal dendrites of pyramidal neurons - receive input from downstream neurons - in model: error arise from mismatch of lateral local interneuron inputs (SST - somatostatin) - Learning via local plasticity 11 | 12 | * 3 Compartment Neuron: 13 | * Soma + Integration zones: Basal/Apical - convergence of top-down/bottum-up synapses on different compartments - Larkum (2013): Preferred connectivity patterns of cortico-cortical projections 14 | 15 | * 2nd Population within hidden layer - Interneurons = lateral + cross-layer connectivity: cancel t-d input - only backprop errors remain as apical dendrite activity 16 | * Predominantly driven by same layer but cross-layer feedback provides weak nudge for interneurons = modeled as conduc-based somatic input current 17 | * Modeled as one-to-one between layer interneuron and corresponding upper-layer neuron 18 | * Empirically justified by monosynaptic input mapping experiments: weak interneuron teaching signal 19 | 20 | * Neuron/network Model: 21 | - Simplifications: 22 | 1. Membrane capacity to 1 and resting potential 0; Background activity is white noise 23 | 2. Modeling of layer dynamics - where vectors represent units 24 | 3. No apical compartment in pyramidal output neurons - 3 compartments seem to suffice as comparison mechanism 25 | - Qualitative dynamics: error = apical voltage deflection -> propagates down soma -> modulates somatic firing rate -> plasticity at bottom-up synapses 26 | - Somatic conductance acts as nudging conductance 27 | - Lateral dendritic projections: interneuron is nudged to follow corresponding next layer pyramidal neuron 28 | 29 | * Synaptic learning rules = Dendritic Predictive Plasticity Rules 30 | - Originally: reduction of somatic spiking error 31 | - conductance based normalization of lateral projections based on dendritic attenuation factors of different compartments 32 | - Implementation requires subdivision of apical compartment into two distal parts (t-d input and lateral input from interneurons) 33 | 34 | * Prev work: 35 | * Guergiev: View apical dendrites as integration zones - temp difference between activity of apical dendrite in presence/absence of teaching input = error inducing plasticity at forward synapses. Used directly for learning b-u synapses without influencing somatic activity. HERE: apical dendrite has explicit error represnentation by sim integration of t-d excitation and lateral inhibition - No need for separate temporal phases - continuous operation with plasticity always turned on 36 | * PC based work - Whittington and Bogacz: Only plastic synapses are those connecting prediction and error neurons. HERE: all connections plastic - errors are directly encoded in dendritic compartments 37 | 38 | * Main Results/Experiments: 39 | * Analytic derivation: Somatic MP at layer k integrate feedforward predictions (basal dendritic potentials) and backprop errors (apical dendritic potentials) 40 | * Analytic derivation: Plasticity rule converges to backprop weight change with weak feedback limit 41 | * Random/Fixed t-d weights = FA 42 | * Learned t-d weights minimizing inverse reconstruction loss = TP 43 | * Experiments: 44 | * Non-Linear regression task: Use soft rectifying nonlinearity as transfer fct - Tons of hyperparameters - injected noise current (dropout/regularization effect?) 45 | * MNIST - Deeper architectures: Use convex combination of learning/nudging 46 | 47 | * General notes: 48 | * Kriegeskorte/DiCarlo/RSA - DNNs outperform alternative frameworks in accurately reproducing activity patterns in cortex - What does this mean? Is DL just extremely flexible/expressive? 49 | * bottom-up = feedforward, top-down = feedback 50 | 51 | 52 | #### Questions: 53 | 54 | * Neural transfer fct = Activation fct! 55 | * Again tons of hyperparameters to be chosen - How? 56 | * Think of learning (accurate gradient approx) vs architecture (depth, number of hyperparameters) complexity 57 | * Different interneuron types (PV = parvalbumin-positive) - different types of errors (generative) 58 | -------------------------------------------------------------------------------- /Biologically_Plausible_DL/Whittington_2019.md: -------------------------------------------------------------------------------- 1 | # Title: Theories of Error Propagation in the Brain 2 | 3 | # Author: Whittington & Bogacz (2019) 4 | 5 | #### General Content: Review article on bio-plausible alternatives to backprop. Goal is not to state that the brain exactly performs backprop but that other rules that achieve similar performance are plausible alternatives. Differentiate between two main classes of models: Temporal-Error (learning via plasticity and target constraining output activity) vs Explicit-Error (learning via Hebbian rule and explicit error computation/tracking) 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Bio-plausible problems with backprop: 11 | * Updates depend on activity/comps of all downstream neurons - non-local 12 | * Symmetry idea of feedforward and back pathways - question of bidirectional connections - still need form of alignment 13 | * Neurons send continuous output (rate-based) - real neurons spike. Also linear summation as aggregation of inputs 14 | 15 | * Alternatives: 16 | * Neuromodulators may guide plasticity by carrying form of global error. Slow and bad scaling. 17 | * Pre-/Post-Synaptic activity: STDP/Pyramidal neurons/cortical microcircuits 18 | * Rely on similar framework: Forward + Back -> energy minimization? 19 | 20 | * Temporal-Error Models: Error is def as difference in activity across time. 21 | * Contrastive learning: Error decomposition into target indep activity part (anti-Hebbian - unlearn associations - only prediction) and target dep activity part (Hebbian - learn new assocs - output layer set to target activity) 22 | * Problem: Need form of phase signal that is global! Info local via oscillatory rhythms such as hippocampal that oscillations? Neurons in output layer are driven by feedforward inputs in one part of cycle and forced to take value of target pattern in other 23 | 24 | * Continuous Update Model: Local plasticity rule based on rate of change of activity. Still requires signal since no plasticity should take during prediction 25 | 26 | * Explicit-Error Models: Close approx of errors that would be computed in backprop setting 27 | * Predictive Coding Model: Error nodes 1-to-1 association with value nodes. During prediction activity is prop between value nodes via error nodes. Convergence to equilibrium in which the nodes decay to zero and all value nodes have same values as ANN. During training error nodes dont go to zero but to backprop error - adapt weights using Hebbian rule. 28 | * Problem: Inconsistent 1-to-1 connectivity 29 | 30 | * Dendritic Error Model: Error representation in apical dendrite. Local error computation via cortical microcircuits, i.e. comparison with lateral inhibitory interneuron activity. Plasticity guidance via plateu potentials. 31 | * PCs are exctitatory <- negativ inhibition is provided by interneurons - hard to train their weights - need higher level projections 32 | * Show form of intuitive equivalence to backpropagation - like Sacramento 33 | 34 | * Computational comparison: 35 | * T-E: Mechanism for informing whether target pattern constrains output neurons (phase signaling) - not needed for E-E 36 | * Predictive Coding needs to propagate twice as many neurons (travel via error neuron)! - Evolutionary benefit for fewer neurons and faster propagation 37 | * Dendritic model: Learning all connections in parallel can lead to problems. Ideally first learn inhibitory neurons 38 | 39 | * Experimental data: No evidence for separate error neuron population 40 | 41 | * Vogels et al (2011) - learning rule in which the direction of modification depends on neurons in equilibrium as in PC model, can arise from alternate form of STDP 42 | 43 | * Martinotti cells - receive inout from pyramidal neurons in same cortical area and project to their apical dendrite 44 | 45 | * Question of speed: Pred Coding - slow prediction; Dendritic error model - slow training (but can get faster with time) 46 | 47 | * Equilibrium propagation: Energy based models and minimization of free energy - entropy/ELBO optimization. Markov blanket ideas - strong connectivity leads to similar activity 48 | 49 | #### Questions: 50 | * Robustness without symmetry in forward and backward weights? 51 | * Efficient one-shot learning support? 52 | * Anti-Hebbian <-> Hebbian: Relationship to target/forward phase in Guergiev et al (2017) 53 | -------------------------------------------------------------------------------- /Deep Learning/Gal_2015.md: -------------------------------------------------------------------------------- 1 | # Title: Dropout as a Bayesian Approximation: Insights and Applications (2015) 2 | 3 | # Author: Gal & Ghahramani 4 | 5 | #### General Content: 6 | 7 | Introduction of one of the first Bayesian frameworks for Deep Learning. GP perspective allows to formalize the uncertainty over parameters. One does not need to compute the exact GP but instead is able to approximate the covariance function via Monte Carlo integration. 8 | 9 | #### Keypoints: 10 | 11 | Authors show that MLP with dropout at each layer is equivalent (in terms of objective function) to Approximate Variational Inference for Deep GPs. The analogy to Gaussian process regression is a possible explanation for why Deep NNs with dropout generalize well. Furthermore, we are able reason about uncertainty over the complicated exacted features. Authors suggest that one has to put an approx variational distribution over the bias vector and not only the weights. 12 | 13 | #### Summary: 14 | 15 | * Dropout seems to be related to SGD on L^2-penalized error function. 16 | * Show that dropout objective minimizes KL divergence between approx. model and deep Gaussian process. 17 | * Important to readjust after dropout layer by rescaling the weights by inverse dropout probability 18 | * View MLP as one deeply composed function: In GP setup non-stationary covariance functions/kernels correspond to hyperbolic or ReLU MLPs. -> indirect assumption on the smoothness of the function 19 | * Don't perform the full GP, which would take a O(n^3) matrix inversion. Instead one approximates the full/true GP to obtain a managable time complexity. 20 | * Minimising the KL div <-> maximising the log evidence lower bound. Maximisation leads to variational distribution which explains the data well while still not deviating too much from the prior distribution. 21 | * Approximate the exact covariance function by a MC integration. The paper obtains an approximate objective function for this "deep GP NN version" and shows equivalence of the objective functions. 22 | * One can't evaluate the KL div term between a mixture of Gaussians and a single Gaussian analytically! 23 | * Common knowledge: doubling the learning rate of the bias during MLP optimisation works very well in practice with dropout nets with p=0.5 24 | * MC dropout: Estimate the dropout model using T forward passes through the net and averaging the results - model averaging. 25 | 26 | #### Questions: 27 | 28 | * How does the rescaling work? - only at test time and not at training 29 | * How does one optimise the dropout proportion? 30 | * What does model assumption (section 2.3) of assuming one latent set of variables, which act as sufficient statistics, affect the model flexibility/generative model? Common assumption for VI? 31 | 32 | #### Other: 33 | 34 | * Stationary Kernel: Function whose length scale does not change throughout the input space will be well modeled by a GP with a stationary kernel - can just be written as x - x' 35 | 36 | -------------------------------------------------------------------------------- /Deep Learning/Gal_2016.md: -------------------------------------------------------------------------------- 1 | # Title: On Modern Deep Learning and Variational Inference 2 | 3 | # Author: Gal & Ghahramani (2016) 4 | 5 | 6 | #### General Content: 7 | Modern deep learning models can be formulated as performing approx. variational inference in a Bayesian setting. Paper extends Gal & Ghahramani (2015), which showed that a MLP with dropout at every layer corresponds to approx. VI in Deep GPs, to a mor general Bayesian NN/deep learning models setting. Furthermore, they show that any form of stochastic regularization in arbitrary NN structures can be viewed as approx. variational inference in Bayesian Nets. Ultimate goal: Synthesize world of stats and VI with deep learning in order to form a mathematical grounding 8 | 9 | 10 | #### Keypoints: 11 | 12 | * Show that **any form of stochastic regularization in arbitrary NN structures can be viewed as approx. variational inference in Bayesian Nets: Almost all deep learning models perform approx. Bayesian inference to capture the latent stochastic process underlying the data. 13 | * Allows us to reason about the full DL framework in a Bayesian manner: assess model uncertainty, model comparison and uncertainty over features/inference. 14 | 15 | #### Summary: 16 | 17 | * Want to be able to reason about parameter and *structural* uncertainty of neural nets. 18 | * **Multivariate Gaussian noise**: alternative to dropout. Does not sample bernoullis but N(1,1) and multiplies this random sample with the weights. Slows training down more than bernoullis because harder to sample from the Gaussian. 19 | * **Methaphorical Interpretation of dropout**: Sexual reproduction - many sperm cells are "send" and only one is needed to impragnate the women = Unnecessary redundancy in the data. Cannot be proven since Gaussian noise (which are non-exactly-0 draws) works similarly well. 20 | * CNNs only work when dropout is only used for the non-convoluted layers. Possible reason: less co-adaption in convolution layers, already sparse -> no dropout needed. 21 | 22 | * Here: Derivation focuses on multiclass-classification problem and they derive an intractable GP posterior which afterwards is approximated by introducing a variational distribution over the latent variables. 23 | * Minimization of KL div between variational distribution and parameter posterior is equivalent to maximising the log evidence lower bound. 24 | * Gal & Turner (2015): Show that one is able to approximate the exact GP by defining an approx. distr. over spectral frequencies and their coefficients in a Fourier decomposition if the function. 25 | 26 | * **Structural Formulations**: 27 | * Different nonlinearity-regularisation techniques correspond to different GP kernels, which have different impact on the predictive uncertainty! 28 | * Multiple variables: Apply kernel the concatenated matrix of variables 29 | * Two seperate activations for two different variables: Summing of two Kernels 30 | 31 | * **Bayesian Neural Nets**: 32 | * Plaxe a prior distr. over the weights - often matrix Gaussian prior 33 | * Interested in finding most likely weights given the data -> use VI to compute approx. to otherwise intractable integral. 34 | * Use Monte Carlo integration to approximate the approximation of the KL divergence 35 | * Suggests that stochastic regularisation techniques work well since they approximately integrate over the model parameters! 36 | * **MC dropout**: Make predictions using the predictive distr. approximation with VI-obtained MC integrated posterior! 37 | 38 | 39 | #### Questions: 40 | 41 | * What are spectral frequencies? 42 | * It should be easily possible to find/create new application specific stochastic regularisation variants. Enforce correlations between the weights. 43 | * Extract Bayesian inference knowledge about the GP - how can we learn something from NNs? 44 | -------------------------------------------------------------------------------- /Deep Learning/Goodfellow_2016.md: -------------------------------------------------------------------------------- 1 | # Title: Deep Learning 2 | 3 | # Author: Goodfellow et al. (2016) 4 | 5 | ## Chapter 6: Ch. 6: Deep Feedforward Networks 6 | 7 | 8 | #### General Content: 9 | 10 | 11 | #### Keypoints: 12 | 13 | 14 | #### Summary: 15 | 16 | 17 | #### Questions: 18 | -------------------------------------------------------------------------------- /Deep Learning/Spoerer_2017.md: -------------------------------------------------------------------------------- 1 | # Title: Recurrent Convolutional NNs: A Better Model of Biological Object Recognition 2 | 3 | # Author: Spoerer et al (2017) 4 | 5 | #### General Content: Introduce (?) CNN architecture with lateral as well as recurrent connections. Hypothesize that recurrence is not necessarily used to deal with Gaussian noise but with occlusions. Introduce two new dataset augmentations (debris and jitter). Debris = random crop overlays with task of single digit recognition, Jitter = Multiple digits with task of recognizing all of them. Use deconvolution to deal with recurrent dim inbalance. Train recurrent net over 4 timesteps and learn via BPTT. Performance comparison robustness check with permutation test. Lateral connections only help with strong debris. 6 | - Extension of original R-CNN work by Liang & Hu (2015) and Liao and Poggio (2016) 7 | 8 | #### Keypoints: 9 | 10 | * Motivation: Ventral visual pathway with lateral and feedback connections. Feedforward models vary successful - underexploration of recurrence. Visual processing unfolds over time. 11 | * Fast local recurrent processing =/= attention 12 | * Occlusion slows down behavioral responses - recurrent processing? = Competitive processing 13 | * Occlusion - question of border ownership (cells) => require info from outside their classical receptive fields and signals are delayed relative to initial feedforward input - again suggest recurrent processes 14 | * Hypothesis: Recurrence provides no clear benefit for solely processing noise since linear filter which is learnable in NN can already deal with it! Recurrence useful across a wider range! 15 | 16 | * Occlusion Datasets: 17 | * Debris: Random crops from randomly selected digts add mask and overlay - summing overall features is bad strategy. 18 | * Jitter: Multiple images sequentially placed in the image - overlap with relative depth order 19 | * Preprocessing: Pixel-wise normalization (mean/std) for each pixel from stats computed across the entire datasets 20 | 21 | * Architecture: 22 | * 4 different architectures: 23 | 1. B - Pure feedforward 24 | 2. BT - Feedforward + Top-down 25 | 3. BL - Feedforward + Lateral (recurrence) 26 | 4. BLT 27 | * Define pre-activations for all 4 for each layer, pixel, time point as well as feature map 28 | * Pass pre-activations through local response normalization and ReLU 29 | * Need to control for comparable complexity of the models also generalization to case without occlusion: Increase the number of parameters via either number of b-u connections (B-F = better match in terms of params) or large kernel sizes (B-K = more plausible) 30 | * General architecture: 2 hidden recurrent layers + Readout at every step - problem with top-down connections - dim mismatch need deconvolution/transposed convolution to size up again. Zeiler et al (2011) 31 | * Unrolled for 4 time steps on full image! - Why 4 timesteps? 32 | * At each time step feed in the image and make a readout 33 | * Train via BPTT - Error is propagated throught time for each time point - network trained to converge as soon as possible, rather than final step - As compared to spiking Joram case where network has settling time steps and is only propagated from final loss! 34 | * Accuracy measured only at final time step - Loss function then defined sum over time steps? Sum over timesteps and final backprop at the end vs backprop for each current loss at t backwards? 35 | * Deconvolution: Normal conv layer with stride where input and output sides of layer have been switched 36 | 37 | * Model comparison: 38 | * Pairwise McNemar's test (prediction dependence correction), False discovery rate, Benjamini-Hochberg correction 39 | * Robustness check with permutation test - form of linear regression of test errors on debris/cropping strength 40 | 41 | * Experimental results 42 | * Lateral connections only help with strong debris 43 | * Support for RNNs also being better in OCR tasks without noise/occlusion 44 | * Robustness of RNNs - do simple ffw networks overfit? Is this overfitting? More like learning a specific task under quality conditions 45 | 46 | #### Questions: 47 | 48 | * Extensions to neural data and reaction time distributions 49 | * How much recurrence do we need? 50 | * Why do people in ML dont do more multiple comparison checks to robustify their performance gains? 51 | * Read up on deconvolutions = transposed convolution! 52 | * Recurrence vs larger receptive fields: Similar to DRQN vs frame concatenation discussion. 53 | -------------------------------------------------------------------------------- /Formal Grammars/2001_Hale.md: -------------------------------------------------------------------------------- 1 | # Title: A Probabilistic Earley Parser as a Psycholinguistic Model 2 | 3 | # Author: John Hale (2001) 4 | 5 | #### General Content: The cognitive load of language comprehension can be expressed as the total prob of structural options that have been disconfirmed at the point in sentence - paths in tree that have been traveled unsuccessfully. This again be formulated as suprisal of a word given its precessors. Measure can be efficiently calculated using a probabilistic version of the Earley parser from a probabilistic phrase-structure grammar. Is able to account for Garden Path sentences. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Garden Path Sentence: Grammatically correct sentence that starts in way that reader's most likely interpretation will be incorrect. Reader deceived into dead end parse. Sentence creates temporary ambiguity/multiple interpretations. 11 | * Relationship knowledge of gramar - application in perceiving syntactic structure - 3 principles: 12 | 1. Parser - Grammar: Strong competence - parser uses rules of grammar 13 | 2. Frequency affects performance - stat theory of language performance 14 | 3. Eager sentence processing: unrushed experimental setup 15 | 16 | * Stolcke's prob Earley parser: Way to use hierarchical phrase structure in language model such as ngram 17 | * Prob CFG: sentence prob - product of all rules used to generate 18 | * multiplication - assumption that rule choices are indep 19 | * consistency: does grammar assign non-0 prob to inf recursion? 20 | * problem: only assigns probs to complete sentences of language 21 | * Compute prefix prob at each word - sum of probs of all derivs whose yield is compatible with string seen so far 22 | * 1 - prefixprob = prob of all other derivs that have been disconfirmed (given consistency) 23 | 24 | * Earley parser: top-down - propagation of prediction back up through set of states of states representing hypotheses parser is using about struxture of the sentence. 25 | * collection of states - chart - tree set 26 | * state: 27 | * current input string position processed so far 28 | * grammar rule 29 | * dot-position in rule: how much of rule recognized 30 | * left-most edge of substring rule generates 31 | * functions: 32 | * predict, scan, complete 33 | * S $\to$ predict (add new states for prod. rule following S) $\to$ scan (check input string. If match move dot) $\to$ complete (propagate change through graph) 34 | * Earley paths <-> correspondence with derivations 35 | * Stolcke: add two infos to each state: $\alpha$ - prefix prob and $\gamma$ - inside probability 36 | * Earley path: sequence of Earley states linked by three ops 37 | * State completion: bottom-up confirmation - $\alpha$ unites with top-down prediction $\gamma$ 38 | 39 | * Total parallelism theory: Entire set of trees compatible with input is maintained somehow from word-to-word 40 | 41 | * Cognitive effort expanded to parse prefix - prop. - total prob of all structural analyses which cant be compatible with observed prefix 42 | * generate predictions about word-by-word reading times by comparing total effort expended before some word to the total effort after 43 | * assumption: probs of PCFG rules = statements of how difficult it is to disconfirm each rule 44 | * $log(\frac{\alpha_{n-1}}{\alpha_n})$ - surprisal: combined difficulty of disconfirming all disconfirmable structures at a given word 45 | 46 | * Garden paths are points where the parser can disconfirm alternatives that together comprise a great amount of probability 47 | * effect disappears when words intervene that cancel reduced relation interpretation early on 48 | * Evidence for total-parallelism parsing theory 49 | 50 | #### Questions: 51 | 52 | * How to incorporate this into option/substructure discovery problem? - options over options - specific attention to beginning 53 | * Info-theoretic based grammar extraction - IGGI -------------------------------------------------------------------------------- /Formal Grammars/2016b_Siyari.md: -------------------------------------------------------------------------------- 1 | # Title: Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data 2 | 3 | # Author: Siyari, Dilkina, Dovrolis (2016) 4 | 5 | #### General Content: 6 | Propose algorithm that constructs an optimized hierarchical representation of a given set of target strings, the Lexis-DAG. The DAG displays how to derive each target string through the concatenation of intermediate substrings by minimizing the total number of such concatenations or DAG edges. 7 | 8 | 9 | #### Keypoints: 10 | 11 | * Problem is related to the SGP which is NP hard. The authors also prove this for the Lexis optimization problem and propose a greedy algorithm which efficiently constructs the DAG. 12 | 13 | * Problem of construction is a synthetic design problem: construction of min cost DAG which shows how to produce a given set of targets from a given alphabet in hierarchical manner, through the construction of intermediate substrings that are re-used in at least two higher-level strings. 14 | * Cost of DAG <-> related to concatenation work that corresponding hierarchy would require. 15 | 16 | * Optimized DAG can be thought of as a plausible hypothesis for the unknown process that created given targets as long as we have reasons to believe that the process cares to minimize the same cost function that the DAG optimization considers. 17 | 18 | * 2 cost functions (turn out to be algorithmically identical): 19 | * Min total number of concatenations 20 | * Min number of DAG edges 21 | * Which one to use is application specific - Lexis is NP hard for both 22 | 23 | * Intermediate nodes/core: minimal set of DAG nodes that can cover a given fraction of s-to-t paths. Represents the most central substrings in corresponding 24 | 25 | * Lexis-DAG - $D(V,A)$ with 3 conditions: 26 | 1. $v \in V$ represents string $S(v)$, $V_S$ sources, $V_T$ targets, $V_M$ intermediate nodes. $V = V_S \cup V_M \cup V_T$ 27 | 2. Each node in $V_M \cup V_T$ represents a string that is the concatenation of substrings, $d_{in}(v)$: # incoming edges for v, $d_{out}(v)$: # outgoing edges for v 28 | 3. Lexis-DAG should only include intermediate nodes with 2 outgoing edges - re-used in at least two concatenation operations 29 | 30 | * Lexis Optimization Problem: Construct min-cost Lexis DAG for given alphabet S and targets T for given cost function: 31 | 32 | $$\min_{(E, V_M)} C(D) s.t. D=(V, E) \text{ is Lexis-DAG for S and T}$$ 33 | 34 | * No explicit min of nodes in $V_M$ - but implicit through cost functions 35 | * Edge costs: $\mathcal{E}(D) = \sum_{v \in V} d_{in}(v) = |E|$ 36 | * Concatenation costs: $C(D) = \sum_{v\in V\setminus V_s} (d_{in} -1) = |E| - V\setminus V_s$ 37 | * Both problems are NP-hard 38 | 39 | * G-Lexis idea: Search for substring $\xi$ that will lead to max cost reduction, when added as new intermediate node. Algo starts from trivial Lexis-DAG with no intermediate nodes and edges from the source nodes representing alphabet symbols to each occurance in target 40 | 41 | * $I(v)$: Sequence of nodes appearing in incoming edges of v 42 | * $\leftrightarrow$ sequence of nodes whose string concatenation results in string $S(v)$ represented by v 43 | * $\leftrightarrow$ strings of alphabet of Lexis-DAG 44 | * Look for repeated substring $\xi \in I_{T \cup M} = \{I(v | v) \in V_T \cup V_M\}$ that can be used to construct new intermediate node 45 | * Can construct new intermediate nodes for $\xi$, create incoming edges based in symbols in $\xi$ and replace incoming edges to each of the non-overlapping repeated occurances of $\xi$ with a single outgoing edge from the new node. 46 | 47 | Algorithm: 48 | 49 | 1. Init $V \leftarrow V_T \cup V_S$ and E, constructing each target in T from characters in S. $V_M \leftarrow \emptyset$. 50 | 2. Repeat: 51 | * $I_{T \cup M} \leftarrow \{I(v | v) \in V_T \cup V_M\}$ 52 | * Select $\xi$ with max $(R_{T \cup M, \xi} -1)(|\xi| - 1)=0$, where $R_{T \cup M, \xi}$ is number of repeats $\xi$ in $I_{T \cup M}$. 53 | * if $(R_{T \cup M, \xi}-1)(|\xi| - 1) = 0$, break. Terminate when there are no more substrings with length at least 2 and which are repeated at least twice. 54 | * $V \leftarrow V \cup \{\sigma_{\xi}\}$ where $\sigma_{\xi}$ is new intermediate node and update E accordingly. 55 | 56 | * Substring that max saved cost is a max repeat: substrings of length at least 2, whose extensionto right or left would reduce its occurences in the given set of strings 57 | * Suffix tree over set of input strings captures all right-max repeats which are superset of all mac repeats. To pick the one with max saved cost we need the count of non-overlapping occurences of these substrings - minimal augmented suffix tree 58 | * Here: implementation using regular suffix tree - Iterate over all occurances of selected substring, skipping overlapping occurances 59 | * $O(L) \text{ vs } O(L \log L)$ where L is total length of target strings 60 | * Overall runtime: $O(L^2)$ since max # iterations is $O(L)$ (each iteration reduces number of edges/concats which at start is $O(L)$) 61 | 62 | * After construction: rank constructed intermediate nodes in terms of significance or centrality 63 | * Dependency chain: the higher the number of s-to-t paths traversing an intermediate node v (path centrality), the more important v is in terms of number of dependency chains it participates in. 64 | * Core of Lexis-DAG: Set of intermediate nodes that represent, as a whole the most important substrings in that Lexis-DAG 65 | * Should include nodes of high path centrality 66 | * Almost all s-to-t dependency chains of Lexis-DAG should traverse at least one of the core nodes 67 | 68 | * Data compression: Looks for regularities that can be used to compress the data - patterns often useful as such regularities 69 | 70 | * Minimum Description Principle: Compression scheme that results in smallest size for joint representation of both dictionary and encoding of data using that dictionary 71 | 72 | 73 | #### Questions: 74 | -------------------------------------------------------------------------------- /Formal Grammars/2018_Schoenhense.md: -------------------------------------------------------------------------------- 1 | # Title: Data-efficient inference of hierarchical structure in sequential data by information-greedy grammar inference 2 | 3 | # Author: Schoenhense and Faisal 4 | 5 | #### General Content: Combat "overfitting" of greedy grammatical inference algos such as sequitur by introducing decision rule based on Shannon entropy decrease. This way they introduce form of regularizer. Show that IGGI does not fit noise when sequence is generated without any underlying hierarchical rules. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Algos: Sequitur, MDLcompress, G-Lexis 11 | * Iterative repeat replacement (IRR): Longest-match, byte-pair coding, compressive 12 | * CFG + MDL = grammar-based universal source code 13 | * Info theory approach: More symbols used = more info needed to give meaning to symbol - average info/entropy needed per symbol increases 14 | * Idea: Greedily + iteratively replace bigrams by minimizing info content of coding string 15 | * New definition of size of grammar: Previously - based on concatenated string of all symbols needed to write out production rules and coding sequence - now: based on info content 16 | * Testing on random data shows: no overfitting of noise - compression ration around 1 -------------------------------------------------------------------------------- /Free-Energy Principle/Friston_2010.md: -------------------------------------------------------------------------------- 1 | # Title: The free-energy principle: A unified brain theory? 2 | 3 | # Author: Karl Friston (2010) 4 | 5 | #### General Content: 6 | 7 | THE interdisciplinary introduction to the FEP. Biological agents must minimimize there the long-term average of surprise (free energy as the upper bound!) to ensure their sensory entropy to remain low. Biological systems violate the fluctuation theorem (generalization of second law of thermodynamics) -> probability of entropy decrease becomes exponentially smaller with time. 8 | 9 | 10 | #### Keypoints: 11 | 12 | * FEP is a very flexible/diverse model that allows to model/account for many different brain structures/functions. Thereby it allows us to unify multiple different neuroscientific theories: Bayesian brain hypothesis, Efficient coding, Predictive coding, synaptic pruning, cell assembly theory, attention, neural Darwinism, RL and value learning, optimal control and DP 13 | * 3 PoVs: 14 | 1. Optimizing brain states makes the representation an approx conditional density on the causes of sensory input. This enables action to avoid suprising sensory encounters. 15 | 2. Optimization of sufficient statistics: Free energy = surprise + KL(recognition density || posterior) -> Minimization leads to a better approximation 16 | 3. Optimization of actions: Free energy = mixture of accuracy and complexity -> action can only affect accuracy -> active learning to minimize prediction error. Selective sampling of sensory inputs that the brain expects. 17 | * The long-term goal of homeostasis leads to a short-term avoidance of surprise. 18 | 19 | #### Summary: 20 | 21 | * FEP: Any self-organizing system that is at equilibrium with its environment must minimize its free energy (not equiv. to variational free energy!!). 22 | * Entropy: average self-info or surpirse -> agent dependent!! 23 | * Low entropy density = An outcome is relatively predictable = Uncertainty Measure 24 | * Free energy: Measure that bounds surprise on sampling events given a generative model 25 | * Lower bound on log model evidence[]() 26 | * Homeostatsis: Process of self-system-regulation <-> internal system remains in bounded states 27 | * Surprise: negative log-prob of outcome. Improbable outcome is surprising. 28 | * Attractor: Set (of points/states) to which a dynamical system evolves after a certain amount of time <-> Statitionary State? 29 | * Recognition density: Approx prob distribution of causes of the data -> product of inverting the generative model 30 | * Bayesian Brain: Brain as inference machine that actively predicts and explains sensations. Perception: Process of inverting the generative model to access the posterior of the causes given sensory input. 31 | * Hierarchical generative models: Min Free energy <=> optimization of empirical priors. Optimization makes every level in hierarchy accountable to others - leads to internally consistent representation of sensory causes at multiple levels of descriptions. 32 | * Laplace Assumption: Free energy = difference between model prediction and sensations/predicted representations - saddle-point approximation of integral of exp function which uses 2nd order taylor - Gaussian Approximation? 33 | * Predictive Coding: min free energy corresponds to explaining away prediction errors 34 | * Principle of efficient coding - Barlow's redundancy reduction principle - infomax principle: brain optimizes the mutual info between sensorium and its internal representation 35 | * Cell assembly theory by Hebb - Plasticity: Groups of interconnected neurons are formed through strengthening of synaptic connections that depends on correlated pre- and postsynaptic activity. 36 | * Correlation theory - Metaplasticity: Selective enabling of synaptic efficacy and its plasticity by fast synchronous activity induced by different perceptual attributes of the same object. 37 | * Hebb plasticity and predictive coding - connected by delta learning rule 38 | * GD on free energy <-> Hebbian plasticity 39 | * Biased competition and attention: neuromodulators - prediction errors with high precision have greater impact on units that encode conditional expectations 40 | * Optimization of expected precision in terms of synaptic gain links attention to synaptic gain and synchronisation 41 | * Neural Darwinism: 42 | 1. Epigenetic mechanisms create primary repertoire of neuronal connections, which are refined by experience-dependent plasticity to produce a secondary repertoire of neuronal groups. 43 | 2. These are selected and maintained through reentrant signalling among neuronal groups. Value modulates the plasticity. 44 | 3. Value is signalled by ascending neuromodulatory transmitter systems and controls which neuronal groups are selected and which not. 45 | 4. The capacity of value to do this is assured by natural selection, in the sense that neuronal value systems are subject to selective pressure. 46 | * Value is inversely prop to surprise 47 | * Prior expectations describe small set of states in which we expect high value. 48 | 49 | #### Questions: 50 | 51 | * How does entropy relate to tail-thickness and concentration bounds? Are we bounding the KL divergence? Or the distance between the two densities (true generative model vs predicted model)? -------------------------------------------------------------------------------- /Free-Energy Principle/Limanowski_2013.md: -------------------------------------------------------------------------------- 1 | # Title: 2 | 3 | # Author: 4 | 5 | #### General Content: 6 | 7 | 8 | #### Keypoints: 9 | 10 | 11 | #### Summary: 12 | 13 | 14 | #### Questions: 15 | -------------------------------------------------------------------------------- /Free-Energy Principle/Ostwald_2015.md: -------------------------------------------------------------------------------- 1 | # Title: The Free Energy Principle for Perception - An Introduction 2 | 3 | # Author: Ostwald (2015) 4 | 5 | #### General Content: 6 | 7 | General methodological introduction to the variational Bayes framework for the Free Energy Principle 8 | 9 | 1. Intro FEP 10 | 2. Math 11 | 3. Parametric Bayesian Inference, Information Theory, Variational Bayes 12 | 4. Free-form mean-field variational Bayes for univariate Gaussian models 13 | 5. Fixed-form mean-field variational Bayes for static nonlinear models 14 | 6. Fixed-form mean-field variational Bayes for dynamic nonlinear models 15 | 16 | #### Keypoints: 17 | * Chapter 1: 18 | * FEP: Neurobiological Interpretation of application of deterministic approx. Bayesian inference methods to nonlinear hierarchical random dynamical systems. 19 | * Biological agents minimize dispersion/entropy of their interoceptive and exteroceptive states. Minimizing state dispersion in long run results in mim of surprise in every point in time. 20 | * Min of variational energy under hierarchical models encoded by adjustment of agent's internal states is known as Bayesian filtering and leads to neurobiological predictive coding schemes. 21 | 22 | * Chapter 2: 23 | * Review Gaussian Trafo theorems and proofs (Bishop Ch. 2) 24 | * Completing the square theorem - allows identification of parameters of Gaussian density based on quadratic form 25 | * Gamma distribution and different parametrizations - exponential distribution 26 | * Jacobian: Matrix of all first-order partial derivative of a vector-valued function 27 | * Linear approx of vector fields - Use Jacobian for the first derivative analog of the Taylor series approximation 28 | 29 | * Chapter 3: 30 | * Bayesian Inference: Identification of Log model evidence and posterior distribution for a Generative Model 31 | * Bayesian Parameter Estimation: Not only interested in single point estimate but full probability distribution over the possible parameter values. The conditional distribution is inherent in the specification of the generative model and is thus not created once the data is observed! 32 | 33 | 34 | #### Questions: 35 | 36 | * Heterogenous agents - how can we model risk-aversion e.g.? In Macro models this is specified by concavity of the utility functions and the derived optimal actions. Or by the future discount factor. -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/1997_McGovern.md: -------------------------------------------------------------------------------- 1 | # Title: Roles of Macro-Actions in Accelerating Reinforcement Learning 2 | 3 | # Author: McGovern et al (1997) 4 | 5 | #### General Content: 6 | 7 | Macro-actions are a form of commitment defined by closed-loop policies with termination condition. The appropriateness of construction determines whether or not they make learning faster or slow it down. The influence can be broken down into 2 parts: Effect on exploration and effect on value information propagation/learning. Authors derive basic experiments to test both effects and show that value backprop seems more important. 8 | 9 | #### Keypoints: 10 | 11 | * Macro-operators: common in robotics and to aid state-space search 12 | * Original literature: Korf (1985) and Iba (1989) - here: application to RL 13 | * Macros: Compress effective length of solution by chunking together primitive actions. 14 | * Macro Def.: Agent can choose macro or primitive action from any state unless agent is already executing a macro-action. Once agent has chosen a macro it must follow actions defined by macro's policy until the termination condition is satisfied/goal is reached. 15 | * The set of available macro-actions often depends on current state of agent. 16 | 17 | * **Effect on Exploration** 18 | * Bias behavior of agent so that he spends most of his time in specific regions of state space. 19 | * Authors run experiment where actions are uniformly chosen ("random walk") among primitives and macros 20 | * Agent spends most of time in area where macros lead him to (obvious somewhat) 21 | 22 | * **Effect on Propagation of Action Values** 23 | * Macro action values affect rate of action-value backup 24 | * Q-Learning: values propagate backwards one step at a time 25 | * Macro-Q: value info can propagate over several time steps 26 | * When macro takes agent to a good state, the corresponding value is updated immediately with useful info, despite fact that original state is several primitive actions away from good state 27 | * Again random walk experiment: value flows further back in state space - mix of macro/Q 28 | 29 | * Macros only make learning slower if they alone cant bring the agent to the goal location 30 | * This is always fulfilled when working with straight-line grammars! 31 | -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/1998_McGovern.md: -------------------------------------------------------------------------------- 1 | # Title: Macro-Actions in Reinforcement Learning: An Empirical Analysis 2 | 3 | # Author: McGovern and Sutton (1998) 4 | 5 | #### General Content: 6 | 7 | Extend 1997 paper with comparison to eligibility traces. Authors show that Macro-Q converges faster to optimal policy while $Q(\lambda)$ converges faster to optimal action values (accounts for all Q-values and not only relative ordering). 8 | 9 | #### Keypoints: 10 | 11 | * Eligibility traces and macros both provide mechanism for speeding up value propagation. 12 | 13 | * Eligibility traces: Each state-action pair is marked as eligible for backup with a trace indicating how recently it has been experienced. 14 | * On each step the values of all state-action pairs are updated in porportion of their E traces at the time. 15 | * Because many recent state-action pairs have non-zero traces, value info is propagated backwards many steps 16 | 17 | * $Q(\lambda)$ disseminates value info at an even more rapid rate at the cost of getting the policy right. 18 | * Authors propose to combine both approaches - when is Q more important than policy? 19 | 20 | * "Large scale" experiment with random walk robot - random walk exhibits higher variance in traces when only choosing among primitive actions 21 | 22 | #### Questions: 23 | 24 | * Read more on eligibility traces - try to merge! 25 | -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2001_Mcgovern.md: -------------------------------------------------------------------------------- 1 | # Title: Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density 2 | 3 | # Author: Amy McGivern and Andrew Barto (2001) 4 | 5 | #### General Content: Discover subgoals online based on commonalities across multiple paths to solution. View problem as multiple-instance learning and use diverse density approach to solve it. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Bottleneck = region in observation space that is visited often on successful paths but not on unsuccessful paths. Interested in "early" (in successful traces) bottlenecks that are persistent throughout learning. 11 | * Idea: Essentially fix exploration tree up to subgoal and continue exploration from there on! 12 | * Multiple-instance learning: Supervised learning problem - system attempts to identify target concept on basis of "bags" of instances (e.g. successful and unsuccessful traces) 13 | * Option = macro but without fixed sequence of actions but policy which allows to interact with env 14 | * 2 room problem - 2 strongly connected regions. Random exploration spends most time within component - unlikely to leave! But bottleneck connects! 15 | * Simple approach to options: randomly generate based on heuristic and add to set of actions - problem: too many actions - bad exploration - need more structure 16 | * Alternative - visitation/count-based approaches: Only use first visitation - agent spends little time in actual bottleneck and more in component 17 | * Problems: Noisy process, hard to generalize to continouos/large state spaces, no incorporation of negative feedback 18 | 19 | * Multiple-instance learning - Diet- terich et al. (1997) 20 | * Positive bag: Contains at least one positive instance from "target concept" - bottleneck 21 | * Negative bag: Contains only negative instances 22 | * Learn concept based on evidence collected from different bags 23 | * Each trajectory defines a bag - observation vectors correspond to instances within bag 24 | * Bottleneck = Target concept - agent experiences region somewhere on every successful trajectory and not on all unsuccessful ones 25 | 26 | * Diverse Density (DD) approach - Maron, 1998; Maron & Lozano-Pe ́rez, 1998 27 | * Most diversely dense region in feature space = region with instances from most positive bags and least negative bags 28 | * DD = posterior of state being concept given positive and negative bags 29 | * Probability of state being in target concept - gaussian based on distance from particular distance to the target concept 30 | * Concept with max DD value is output of DD search - for small state spaces use exhaustive search 31 | * Can use abstract notions of concepts. Most simple: Individual states 32 | 33 | * Option Construction: 34 | * Detect regions which appear early and persist as peaks - keep running average of how often each state appears as a peak - Init to 0 for each state 35 | * Convergence to series ratio 36 | * I: if target concept c is reached at t, add all visited states from time t-n to t to set. Do this for all added traces that lead to same target concept - augmentation of option set throughout time 37 | * $\beta$: Set to 1 when goal is reached or agent no longer in input set. 0 otherwise. 38 | * $\pi$: Create new value function for option. Give -1 reward on each step and 0 when termination. Learn policy using experience replay. 39 | 40 | #### Questions: 41 | 42 | * To obtain Gaussian do we have to know the target location? - Then not automatic identification! 43 | * How to choose n? - number of states we go for init set - somewhat like length of prod rules 44 | * Static filter (Iba, 1989) to filter option set and throw away unsuccessful ones. 45 | * Improvements seem very small. Why is that?! 46 | * Do we simply augment action set and add options or do we substitute? -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2002_Menache.md: -------------------------------------------------------------------------------- 1 | # Title: Q-Cut - Dynamic Discovery of Sub-Goals in Reinforcement Learning 2 | 3 | # Author: Ishai Menache, Shie Mannor, and Nahum Shimkin (2002) 4 | 5 | #### General Content: Graph theoretic approach for automatic detection of subgoals in dynamic envs. Agent creates online map of process history and uses max-flow/min-cut algo to identify bottlenecks. Policies to reach those are learned separately. Segemented Q-Cut generalizes this by using previously identified bottlenecks for state space partitioning. This seems necessary to identify additional bottlenecks in complex environments 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Approaches to subgoal discovery: 11 | * landmark states 12 | * states with non-typical reinforcement - high reinforcement gradient 13 | * problem: hard to find meaningful subgoals when sparse rewards 14 | * bottleneck = freq of appearance + success condition 15 | * problem: needs lots of exploration to distinguish bottlenecks and "regular" states 16 | 17 | * Here: Consider bottlenecks as "border states" of strongly connected areas 18 | * local criterion: choose bottlenecks based on qualities of state itself 19 | * global criterion: choose bottlenecks based on all state transitions 20 | * think of MDP as flow problem: 21 | * Nodes = States 22 | * Arcs = State transitions 23 | * Bottlenecks = accumulation nodes where many paths coincide - support between loosely connected areas 24 | 25 | * Cut procedure for recursive decomposition of state space: 26 | * Divide state space into segments to simplify overall learning task 27 | * Separate consideration of each segment 28 | 29 | * Algo: 30 | 1. Interact with env, learn using SMDP Q-learning 31 | 2. Save state transition history 32 | 3. If activating cut conditions are met, choose $s,t \in S$ and perform Cut(s,t) 33 | 34 | * Cut(s,t): 35 | 1. Translate state transition history into graph representation 36 | 2. Find MinCut partition $[N_s, N_t]$ between s and t 37 | 3. If cut quality good: create option, learn policy with ER 38 | 39 | * Choosing s,t: Task dependent 40 | * use distance metric between states 41 | * use env reset structure 42 | 43 | * Activating cut conditions: constant rate slower than actual experience frequency 44 | * depends on computational resources/goodness of found s,t pair 45 | 46 | * Graph construction - capacity: 47 | * frequency based: too much weight to freq visited states that might not be bottlenecks 48 | * fixed: same significance to all visited states 49 | * relative frequency - seems to perform best 50 | 51 | * Cut quality: "significant" s-t cuts: small number of arcs <-> enough states in $N_s$ and $N_t$ 52 | * Look for small number of bottleneck states, separating significant balanced areas in state space 53 | * Use metric called ratiocut bipartitioning metric related to size of both sets and number of arcs between them. 54 | * Only consider cuts whose quality is above threshold 55 | 56 | * Q-Cut works well when one bottleneck sequentially leads to the other. 57 | * Segmented version: Use discovered bottlenecks as segmentation tool - divide and conquer: work on small segments of states in order to find additional bottlenecks and corresponding options -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2002_Stolle.md: -------------------------------------------------------------------------------- 1 | # Title: Learning Options in Reinforcement Learning 2 | # Author: Martin Stolle and Doina Precup (2002) 3 | 4 | 5 | #### General Content: State visitation approach to option discovery. Explicitly construct a set of options and not individual options! They pose a series of random tasks in a static env and let the agent solve them. The agent collects statistics of frequencies of occurance of different states. Intuition: If states occur frequently on trajectories that represent solutions to random tasks, then states may be important. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Differences to diverse density approach: 11 | 1. No assumption of what a good/bad trajectory is. Only assume agent to be confronted with different tasks in same env and allow for exploration. Makes sense since we cook in the same kitchen everyday ;) 12 | 2. Explicit construction of set of options and not greedy construction of individual ones. 13 | * Get init and term sets first and then simply learn intra-option policy using pseudo-rewards 14 | 15 | * Algorithm: 16 | 1. Select number of start and target states S, T according to some distribution (e.g. uniform) 17 | 2. For each pair (S,T) 18 | * Perform N_train Q-L episodes to learn policy from S to T 19 | * Perform N_test episodes using greedy policy eval 20 | * For all states s count number of total occurances n(s) 21 | 3. Repeat until number of desired options is reached: 22 | * T_max = argmax_s n(s) as target state for option 23 | * Compute n(s, T_max): number of times each s occurs on path to T_max 24 | * Compute rho(T_max) = avg_s n(s, T_max) 25 | * Select all states s for which n(s, T_max) > rho(T_max) to be in init set 26 | * Complete init set by interpolating between states (domain specific) 27 | * Decrease visitation counts for all states by number of visits to states on trajectories going to T_max - prevent several options going to neighbouring subgoals (redundancy) 28 | 4. For each option learn internal policy achieved by giving high reward for entering T_max and no rewards otherwise. Learn by Q. 29 | 30 | #### Questions: 31 | 32 | * Need to do a lot of pretraining!!! For each pair of S, T! -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2004_Bakker.md: -------------------------------------------------------------------------------- 1 | # Title: Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization 2 | 3 | # Author: Bram Bakker and Jürgen Schmidhuber 4 | 5 | 6 | #### General Content: Idea - let high-level poilicy identify subgoals that precede overall goals. Simultaneously, let low-lebel policies learn to reach subgoals set by higher level. Also learn which subgoals subpolicies are capable of reaching - specialization. 7 | 8 | 9 | #### Keypoints: 10 | 11 | * HASSL: Hierarchical Assignment of Subgoals to Subpolicies Learning 12 | * 2 layer hierarchy (can be generalized) 13 | * Specialization <-> Generalization 14 | * state-specific reaching of goals <-> single sub-policy to reach multiple goal states 15 | * focus on parts of obs space relevant to specialization <-> generalize within specialization 16 | 17 | * High level observation and high level goal state are included in input vector ("command" of high-level policy) 18 | * Time-out-value: max number of low-level actions that policy can execute before control returns to higher level 19 | * Learning done with advantage function learning 20 | * decrease value of subgoal if it was not reached 21 | * gradient expressions for advantage function, weights of prarametrized value function amd C-values which measure capability of low level policy to reach high level goal state 22 | 23 | * Production of high-level obs - requirement: clustering of primitive low-level obs s.t. locally neighbouring states tend to be clustered together 24 | * Use unsupervised learning vector quantization technique ARAVQ: Adaptively allocates new model vector if the latter's Euclidean distance to any existing model vector exceeds threshold 25 | 26 | * Other algorithms: Associate subgoals with primitive, low-level observations rather than high-level obs. -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2004_Mannor.md: -------------------------------------------------------------------------------- 1 | # Title: Dynamic Abstraction in Reinforcement Learning via Clustering 2 | 3 | # Author: Mannor et al (2004) 4 | 5 | #### General Content: Online generation of env map that represents topological structure of state transitions. Afterwards, they use clustering to partition state space into meaningful regions. Furthermore, they consider building a map with preliminary indication of location of interesting (high reward density) regions of state space. A high value gradient indicates significant cluster where additional exploration is potentially beneficial. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Subtask definition in terms of state space - here: consider clusters of states as intermediate stages in learning process, rather than unique states - leads to more robust results. 11 | * Option is then defined as policy that allows agent to efficiently shift from one cluster of states to the other 12 | 13 | * Input to clustering algo: agent's recorded state transitions = topological representations of learning task dynamics (+ current value estimates) 14 | * Algo encourages creation of clusters with small deviation in value function 15 | * Encourage agent to travel between homogenous clusters <-> increase prob to reach clusters with interesting values 16 | 17 | * Process of cluster creation <-> bootstrapping: Clusters are formed early in learning process and are based on rough estimate of env. Using rough est improves exploration. 18 | 19 | * Algo: 20 | 1. Interact with env and learn using SMDP Q-learning 21 | 2. Save state transition 22 | 3. If clustering condition is met and clustering not evoked previously: 23 | * Translate state transition history to graph representation 24 | * Run clustering algo 25 | * Learn options for reaching neighbouring clusters 26 | 27 | * Activating clustering conditions: Trade-off 28 | * Want early clustering: Have impact on exploration when most significant 29 | * Not too early: Info may not suffice for finding meaningful clusters 30 | * Solution: Wait until no new staets were encountered for T (task dep param) - indicating stable state-transition model 31 | 32 | * Clustering objective: max sum of cluster qualities + sum of separation qualities between clusters 33 | * Agglomerative approach: start with more clusters than desired and merge clusters by selecting pair whose merging improves objective the most 34 | * Stop when pre-specified number of clusters is reached 35 | 36 | * Topological approach: 37 | 1. Size of clusters should be roughly the same 38 | 2. Clusters should be well separated 39 | 40 | * Value approach: 41 | * area with dense concentration of distinct rewards should not be contained in large cluster - careful control for max exploitation * area with few rewards - regard as one cluster - agent only wants to exit and explore other areas 42 | -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2014_Yao.md: -------------------------------------------------------------------------------- 1 | # Title: Universal Option Models 2 | 3 | # Author: Yao et al (2014) 4 | 5 | #### General Content: Authors deal with the problem of learning models for options in real time and in the setting where reward functions can be specified at any time and expected returns must be efficiently computed. 6 | 7 | * UOM: option model independent of any reward fct - universal wrt rewards 8 | * Show how to extend notion to linear fct approx 9 | * Show that UOM gives TD solution of option returns - proof convergence 10 | 11 | 12 | #### Keypoints: 13 | 14 | * Classic option models cant deal with multiple-reward planning problems. When reward fct is changed abstract planning with traditional option model has to start from scratch 15 | 16 | * UOM consists of 2 parts: 17 | 1. State prediction: predict state of option termination 18 | 2. Accumulation: predict occupancies of all states by option after execution starts - similar to Dayan's successor representation 19 | 20 | * Option model: For option $o$: $$ 21 | * $R^o$ - option return: total exp disc return until option terminates (stochastic waiting time T) 22 | 23 | $$R^o(s) = \mathbb{E}[r_1+ \gamma r_2 + ... + \gamma^{T-1} r_T]$$ 24 | 25 | * $p^o$ - discounted terminal distribution: disc prob of termination at $s'$ after option is initiated in $s$ 26 | 27 | $$p^o(s, s') = \mathbb{E}_{s,o}[\gamma^T \mathbb{1}_{\{s_k =s'\}}] = \sum_{k=1}^\infty \gamma^k \mathbb{P}_{s,o} \{s_T = s', T=k\}$$ 28 | 29 | * Universal option model: $$ where $u^o$ is option's discounted state occupancy function. 30 | 31 | $$u^o(s, s') =\mathbb{E}_{s,o}[ \sum_{k=0}^{T-1} \gamma^k \mathbb{1}_{\{s_k =s'\}}]$$ 32 | 33 | and 34 | 35 | $$R^o(s) = \sum_{s' \in S} r^\pi(s') u^o(s,s')$$ 36 | 37 | * Theorem: Traditional option model can be constructed from UOM and reward vector of option. Vector essentially conditions universal model to specific option. 38 | 39 | * Authors show that to compute TD approc of option return corresponding to R it suffices to find LS approx of exp one-step reward under the option and R provided one is given U matrix of option. 40 | 41 | * TD(0)-based linear UOM - reward independence! 42 | * Construct multiple immediate reward models for different reward signals of interest 43 | * Compare to LOEM (linear option exp model) by Sorg and Singh (2010) - estimation from experience 44 | * Learning LOEM model is faster (computing time) then learning UOM for single fixed reward function 45 | * UOM can produce accurate option return quickly for each new reward function 46 | 47 | 48 | #### Questions: 49 | 50 | * Read up on Dyna and more model-based RL! 51 | * Authors use computer time to compare approaches!!! very code dependent very bad!!! -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2017_Florensa.md: -------------------------------------------------------------------------------- 1 | # Title: Stochastic NNs for HRL (2017) 2 | 3 | # Author: Florensa, Duan, Abbeel 4 | 5 | #### General Content: Intend to solve exploration problem by combining HRL with intrinsic motivation. First, they learn skills in a pre-training env using proxy rewards which require minimal info (resembles intrinsic motivation - done with the help of SNN with info-theoretic regularization). Afterwards, they train a separate high-level policy for task on top of skills. 6 | 7 | 8 | #### Keypoints: 9 | 10 | * eps-greedy exploration/uniform Gaussian exploration noise fail in case of sparse rewards - long-term credit assignment problem - Two approaches: 11 | * HRL: composition of policies - reduce search space exponentially - require domain knowledge 12 | * Intrinsic motivation: guide exploration - hard to transfer knowledge 13 | 14 | * Stochastic NN: stochastic units in computation graph 15 | * feed latent variable with simple distribution as extra input to policy network 16 | * form joint embedding - concatenation (change bias term) or bilinear integration (change weights of layer) - fed to feedforward network 17 | * encourage diversity in policies by using mutual info as a regularizer 18 | 19 | * Lit: Learning skills in discrete domains: 20 | * Chentanez et al (2004) - Intrinsically motivated RL 21 | * Vigorito & Barto (2010) - Intrinsically motivated hierarchical skill learning in structured environments 22 | * Stolle & Precup (2002) - Learning options in reinforcement learning 23 | * Mannor et al (2004) - Dynamic abstraction via clustering 24 | * Simsek et al (2005) - Local graph partitioning 25 | 26 | * Partition MDPs into two components (one shared and one task specific), all MDPs have same action space - structural assumption: sharing of same agent space 27 | 28 | * Pre-training env: minimal setup required - design of proxy reward should encourage existence of locally otpimal solutions - skills 29 | 30 | * Obtaining skills - sample latent code at beginning of every rollout of net - keep constant throughout entire rollout. After training each of the latent codes corresponds to an interpretable skill - use for downstream tasks 31 | 32 | * Regularization: combat problem of different latent codes corresponding to similar skills 33 | * add additional reward bonus, prop. to mutual info between latent and current state - estimate by discretization - relative count of how often center of mass state was visited when code z was active 34 | 35 | * Freeze K skills after pre-training. For all MDPs train a new manager MM on top of frozen common skills 36 | * high-level policy gets full state as input and outputs a parametrization of the categorial distr from which a discrete action/skill z out of K is sampled 37 | 38 | * Opitimization via trust region policy optimization - Schulman et al (2015) 39 | 40 | #### Questions: 41 | 42 | * What has to be manually defined? Learning params of TRPO, number of skills K, two network structures -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2018_Frans.md: -------------------------------------------------------------------------------- 1 | # Title: Meta Learning Shared Hierarchies 2 | 3 | # Author: Frans, Ho, Chen, Abbeel, Schulman 4 | 5 | #### General Content: Combine meta-learning (which uses info from past experiences to learn quickly) with HRL. They generalize the options framework (master policy over subpolicies) to the setting of task distributions (sharing of primitives within). 6 | 7 | 8 | #### Keypoints: 9 | 10 | * Write hierarchical problem in terms of optimization: set low-level motor primitives such that meta-policy learns quickly 11 | * Then approx problem: repeatedly reset master policy and adapt sub-policies for faster learning. 12 | * MLSH vs meta-learning: MLSH learns quickly over large number of policy gradient updates 13 | * $<\pi_{\phi, \theta}>$ 14 | * $\phi$: params shared between tasks $\{\phi_1,...,\phi_k\}$ 15 | * $\theta$: params learned from scratch per task - state of learning process on that task $\to$ neural network that switches between/chooses subpolicies 16 | * Master policy that chooses specific $k$ - acts on slower time scale and fixed frequency $N$ 17 | * Sample MDP from $P_M$ - then agent is initiated with shared params $\phi$ and randomly initiated $\theta$ 18 | * Max over $\phi$ the expected reward sequence under prob distr of MDPs 19 | 20 | ALGO: 21 | 22 | 1. Sample M 23 | 24 | Repeat: 25 | 26 | 2. Initialize $agent(\theta, \phi^{t-1})$ 27 | 3. Warmup period $\to$ optimize master $\theta$ 28 | 4. Joint update period $\to$ optimize both $\theta, \phi$ 29 | 30 | * Warmup intuition: Only update $\phi$ if $\theta$ is close to optimal 31 | * Important aspect: No gradient passing between master and subpolicies 32 | * Also: Can easily replace policy gradient used in training with Q-Learning 33 | 34 | 35 | #### Questions: 36 | 37 | * Possibility of generalizing in one unified network where Master acts as input layer? 38 | * Look at Florensa paper - use info max objective for option discovery - similar to IGGI? 39 | * Problem of still having to specify different learning timescales as well as number of desired subpolicies! -------------------------------------------------------------------------------- /Hierarchical Reinforcement Learning/2018_Levy.md: -------------------------------------------------------------------------------- 1 | # Title: Hierarchical Reinforcement Learning with Hindsight 2 | 3 | # Author: Levy et al (2018) 4 | 5 | #### General Content: Authors propose to merge UVFA, HER and hierarchies of policies. Each hierarchical level proposes subgoals for the level below. Hindsight is then used to solve all MDPs within hierarchy simultaneously. 6 | 7 | * 2 different ways to use hindsight: 8 | 1. Agent replays actions with different goals: learn to generalize to unseen goals 9 | 2. Agents replays higher-level decisions using subgoal states achieved in hindsight as the subgoal actions. This provides way for agent to discover high-level subgoal actions belonging to a particular time scale autonomously. Furthermore, leads to sample efficiency - allows agent to evaluate higher level actions even when lower level layer has not fully learned to achieve higher level subgoals. 10 | 11 | 12 | #### Keypoints: 13 | * Universal MDP: $U=(S, A, T, R, G)$ where $R: S \times A \times G \to \mathcal{R}$ or just $R(s,a,g)$ 14 | * Solve $U_{original}$ via learning hierarchical policy that solves hierarchy of k UMDPs $U_0, ..., U_{k-1}$ where each UMDP represents a different level of temporal abstraction 15 | * $U_0$: lowest level of hierarchy - $S_0 = S$, $A_0 = A$, $G_0 = S_0 =S$ 16 | * $U_i, 0