├── Open-Source.md ├── README.md ├── Reinforcement-Learning-Papers.md └── papers ├── Action-Conditional Video Prediction using Deep Networks in Atari Games.md ├── Continuous Deep Q-Learning with Model-based Acceleration.md ├── Deep Successor Reinforcement Learning.md ├── Generalizing Skills with Semi-Supervised Reinforcement Learning.md ├── High-Dimensional Continuous Control Using Generalized Advantage Estimation.md ├── Human-level control through deep reinforcement learning.md ├── Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution.md ├── Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer.md ├── Learning Tetris Using the Noisy Cross-Entropy Method.md ├── Mastering the game of Go with deep neural networks and tree search.md ├── Noisy Networks for Exploration.md ├── One-Shot Imitation Learning.md ├── Policy Distillation.md ├── Stochastic Neural Network For Hierarchical Reinforcement Learning.md ├── Towards Deep Symbolic Reinforcement Learning.md ├── Unsupervised Perceptual Rewards for Imitation Learning.md └── Value Iteration Networks.md /Open-Source.md: -------------------------------------------------------------------------------- 1 | # Open Source 2 | The collection of helpful reinforcement laerning open source, including source code, tool, textbook, and some inpiring blog post. 3 | Hope you will like it! 4 | 5 | ## :snake: Python users[Tensorflow, Theano] 6 | - [OpenAI gym](https://gym.openai.com/) 7 | - RL **benchmarking** toolkit 8 | - Provide environment and evaluation metrics 9 | - [Keras](https://github.com/matthiasplappert/keras-rl) 10 | - Fully compatible with OpenAI 11 | - Some algorithms have been implement(e.g DQN, DDQN, DDPG, CDQN) 12 | - [TensorLayer](https://github.com/zsdonghao/tensorlayer) 13 | - Built on the top of Google TensorFlow 14 | - [rllab](https://github.com/rllab/rllab) 15 | - Fully compatible with gym 16 | - Continuous control tasks 17 | - Nice to implement new algorothms 18 | - [Benchmarking Deep Reinforcement Learning for Continuous Control](https://arxiv.org/abs/1604.06778) 19 | - [KEras](https://github.com/osh/kerlym) 20 | - Built on keras 21 | - Fully compatible with OpenAI 22 | - Host a handful of agent of reinforcement learning agents 23 | - [Deep Reinforcement Learning Radio Control and Signal Detection with KeRLym, a Gym RL Agent](http://arxiv.org/abs/1605.09221) 24 | - [Deep Reinforcement Learning in TensorFlow](https://github.com/carpedm20/deep-rl-tensorflow) 25 | - Implemented by @carpedm20 26 | - Having some basic reinforcement algorothms 27 | - [OpenAI baselines](https://github.com/openai/baselines) 28 | - high-quality RL code(highly recommend) 29 | - provide pretrained poicies 30 | 31 | ## :flashlight: Lua users[Torch] 32 | - [rltorch](https://github.com/ludc/rltorch), basic reinforcement learning package 33 | - [awesome-torch for reinforcement learning](https://github.com/carpedm20/awesome-torch#reinforcement-learning) 34 | - List of open sources for rerinforcement learning 35 | - [torch-twrl](https://github.com/twitter/torch-twrl), maintained by twitter 36 | - [Reinforcement Learning for Torch: Introducing torch-twrl](https://blog.twitter.com/2016/reinforcement-learning-for-torch-introducing-torch-twrl) 37 | 38 | ## Course 39 | - [CS 294: Deep Reinforcement Learning](http://rll.berkeley.edu/deeprlcourse/#related-materials) 40 | - Instructors: John Schulman, Pieter Abbeel 41 | - [UC Berkeley CS188 Intro to AI](http://ai.berkeley.edu/home.html) 42 | - [2013 Spring video](https://www.youtube.com/user/CS188Spring2013) on youtube 43 | - [Advanced Topics: RL](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html) 44 | - Instructors: David Silver 45 | - [Deep learning videoes at Oxford 2015](https://www.youtube.com/playlist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu) 46 | - Instructors: Nando de Freitas 47 | - lecture 15, 16 are strongly related to reinforcement learning 48 | 49 | ## Textbook 50 | - [Foundations_of_Machine_Learning](http://www.cs.nyu.edu/~mohri/mlbook/) 51 | - chapter 14: Reinforcement learning 52 | 53 | ## Misc 54 | - [A collection of Deep Learning resources](http://www.jeremydjacksonphd.com/category/deep-learning/) 55 | - [Deep Reinforcement Learning: Pong from Pixels](http://karpathy.github.io/2016/05/31/rl/), from Andrej Karpathy' blog 56 | - policy gradient (very clear!) 57 | - many useful link inside 58 | - [Guest Post (Part I): Demystifying Deep Reinforcement Learning](https://www.nervanasys.com/demystifying-deep-reinforcement-learning/) 59 | - [Reinforcement Learning and Control](http://cs229.stanford.edu/notes/cs229-notes12.pdf), lecture from Andrew Ng 60 | - basic reinforcement learning 61 | - continuous state MDPs 62 | - [DEEP REINFORCEMENT LEARNING](https://deepmind.com/blog), from David Silver, Google DeepMind 63 | - briefly discuss some work done by deepmind 64 | - [What are the benefits of actor/critic framework in reinforcement learning?](https://www.quora.com/What-are-the-benefits-of-actor-critic-framework-in-reinforcement-learning) 65 | - Clearly expain the advantages of actor/critic 66 | - [Deep Reinforcement Learning: A Tutorial](https://gym.openai.com/docs/rl), from OpenAI 67 | - It's good to be a kick-off for newbie in reinforcement learning 68 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Reinforcement Learning survey 2 | This paper list is a bit different from others. I'll put some opinion and summary on it. However, to understand the whole paper, you still have to read it by yourself! 3 | Surely, any pull request or discussion are welcomed! 4 | 5 | ## Before Jump into Deep Reinforcement Learning 6 | If you're a newbie in deep reinforcement learning, I suggest you to read the blog post and open course first. 7 | 8 | ## Outline 9 | - Reinforcement Learning Papers 10 | - [Human-level control through deep reinforcement learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Human-level%20control%20through%20deep%20reinforcement%20learning.md) 11 | - [Mastering the game of Go with deep neural networks and tree search](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Mastering%20the%20game%20of%20Go%20with%20deep%20neural%20networks%20and%20tree%20search.md) 12 | - [Deep Successor Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Deep%20Successor%20Reinforcement%20Learning.md) 13 | - [Action-Conditional Video Prediction using Deep Networks in Atari Games](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Action-Conditional%20Video%20Prediction%20using%20Deep%20Networks%20in%20Atari%20Games.md) 14 | - [Policy Distillation](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Policy%20Distillation.md) 15 | - [Learning Tetris Using the Noisy Cross-Entropy Method](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Learning%20Tetris%20Using%20the%20Noisy%20Cross-Entropy%20Method.md), with **code** 16 | - [Continuous Deep Q-Learning with Model-based Acceleration](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Continuous%20Deep%20Q-Learning%20with%20Model-based%20Acceleration.md) 17 | - [Value Iteration Networks](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Value%20Iteration%20Networks.md) 18 | - [Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Learning%20Modular%20Neural%20Network%20Policies%20for%20Multi-Task%20and%20Multi-Robot%20Transfer.md) 19 | - [Stochastic Neural Network For Hierarchical Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Stochastic%20Neural%20Network%20For%20Hierarchical%20Reinforcement%20Learning.md) 20 | - [Noisy Networks for Exploration](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Noisy%20Networks%20for%20Exploration.md) 21 | - [Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Improving%20Stochastic%20Policy%20Gradients%20in%20Continuous%20Control%20with%20Deep%20Reinforcement%20Learning%20using%20the%20Beta%20Distribution.md) 22 | - [High-Dimensional Continuous Control Using Generalized Advantage Estimation](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/High-Dimensional%20Continuous%20Control%20Using%20Generalized%20Advantage%20Estimation.md) 23 | - [Generalizing Skills with Semi-Supervised Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Generalizing%20Skills%20with%20Semi-Supervised%20Reinforcement%20Learning.md) 24 | - [Unsupervised Perceptual Rewards for Imitation Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Unsupervised%20Perceptual%20Rewards%20for%20Imitation%20Learning.md) 25 | - [Towards Deep Symbolic Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Towards%20Deep%20Symbolic%20Reinforcement%20Learning.md) 26 | - [others](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Reinforcement-Learning-Papers.md) 27 | - [Open Source](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#open-source) 28 | - Python users 29 | - Lua users 30 | - [Courses](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#course) 31 | - [Textbook](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#textbook) 32 | - [Misc](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#misc) 33 | 34 | ## [**Useful**] Learn Reinforcement Learning 35 | :point_right: [dennybritz/reinforcement-learning](https://github.com/dennybritz/reinforcement-learning) 36 | :point_right: [Daivd Silver's course about policy gradient](https://www.youtube.com/watch?v=KHZVXao4qXs) 37 | :point_right: [Deep Reinforcement Learning](http://rll.berkeley.edu/deeprlcourse/) 38 | -------------------------------------------------------------------------------- /Reinforcement-Learning-Papers.md: -------------------------------------------------------------------------------- 1 | # Reinforcement learning Papers 2 | ***Mistakes teach us to clarify what we really want and how we want to live.*** That's the spirit of reinforcement 3 | learning: learning from the mistakes. Let's be the explorer in reinforcement learning! 4 | 5 | 6 | - ***Gradient Estimation Using Stochastic Computation Graphs*** [[NIPS 2015]](https://arxiv.org/abs/1506.05254) 7 | - John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel 8 | - ***Maximum Entropy Inverse Reinforcement Learning*** [[AAAI 2008]](https://www.cs.uic.edu/pub/Ziebart/Publications/maxentirl-bziebart.pdf) 9 | - Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey 10 | - Close to the real case :point_right: the suboptimal case, optimal case can't cover all the state space, and alleviate the reward function ambiguity 11 | - basic concept: plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred. 12 | - ***Reinforcement Learning with Unsupervised Auxiliary Tasks*** [[arXiv 2016]](https://128.84.21.199/abs/1611.05397) 13 | - Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu 14 | - Introduce an agent that also maximises many other **pseudo-reward** functions simultaneously by reinforcement learning 15 | - The advantage of auxiliary tasks: in many environment the extrinsic reward is very sparse, which make the feature extractor hard to learn at the beginning. Giving some pseudo reward makes the learner know how to interpret the image at the initial stage. 16 | - They proposes two main auxiliary task: Pixel changes, Network features. 17 | - Pixel changes: maximally changing the pixels in each cell of an n*n non-overlapping grid placed over the input image :point_right: make the learner knows to move faster or avoid stopping (I guess) 18 | - Network features: maximally activating each of the units in a specific hidden layer :point_right: to fully use the hidden units 19 | - Section 4.1 "Unsupervised Reinforcement Learning" discuss why not use the pixel reconstruction loss 20 | - ***Apprenticeship Learning via Inverse Reinforcement Learning*** [[ICML 2004]](http://dl.acm.org/citation.cfm?id=1015430) 21 | - Pieter Abbeel, Andrew Y. Ng 22 | - The first time when apprenticeship is proposed 23 | - Most of these methods try to directly mimic the demonstrator by applying a **supervised learning** algorithm to learn a direct mapping from the states to the actions. :point_right: only suitable for the case that the taskis to mimic the expert’s trajectory 24 | - Reward function, rather than the policy or the value function, is the most succinct, robust, and transferable definition of the task, 25 | - Basic concept: use the inverse reinforcement learning to recover the reward function from the expert and use the that reward function to find the optimal policy. 26 | - Actually, Apprenticeship Learning doesn't need to find the correct reward function. Instead, it use the predicted reward function to find the policy that is similar to expert. 27 | - From the experiments, the apprenticeship learning need less sample trajectory than action mimic. 28 | - ***Algorithms for Inverse Reinforcement Learning*** [[ICML 2000]](http://www.andrewng.org/portfolio/algorithms-for-inverse-reinforcement-learning/) 29 | - Andrew Y. Ng, Stuart Russell 30 | - In examing animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation. 31 | - Recover the expert's reward function and this to generate desired behavior 32 | - Use the eq.4 (see the derivation in the paper, which is quite clear) 33 | - Avoid reward function ambiguity by adding margin λ|R| 34 | - Use feature representation for the reward function 35 | - Since the reward function is assumed to be the linear combination of features, this implies that **if two policy with similar accumulated feature expectation, the accumulated reward is similar** 36 | - Experiment part shows IRL is soluble at least in **moderate** discrete, continuous space 37 | - Reference: [Inverse Reinforcement Learning](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/inverseRL.pdf), by Pieter Abbeel. 38 | - ***Deep Reinforcement Learning with a Natural Language Action Space*** [[ACL 2016]](Deep Reinforcement Learning with a Natural Language Action Space) 39 | - Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, Mari Ostendorf 40 | - Task: String of text :point_right: state, several strings of text :point_right: potential actions 41 | - other 42 | - ***Imitation Learning with Recurrent Neural Networks***[[arXiv 2016]](https://arxiv.org/abs/1607.05241) 43 | - Khanh Nguyen 44 | - Aims to unify two sequence prediction network: learning to search(L2S) and recurrent neural network 45 | - supervised recurrent neural entwork makes an independent prediction at each time step, which suffers seriously from **compounding errors** since the input observations correlate and thus violate the identically independent distributed assumption required by any supervised approach. 46 | - L2S algorithms reduce a sequential prediction problem to learning a policy to traverse in a search space with minimum cost. It garantees that compounding errors grow linearly with trajectory lengths. 47 | - Map the RNN components to L2S (take sequence2sequence for example): 48 | - hidden representation :point_right: St 49 | - decoded word :point_right: At 50 | - encoded word :point_right: Xt (new information for the environment) 51 | - The non-linear gates in RNN are served to be the trainsition function 52 | - ***Language Understanding for Text-based Games Using Deep Reinforcement Learning*** [[EMNLP 2015]](https://arxiv.org/abs/1506.08941) 53 | - Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay 54 | - Use natural language as state representation, and fixed action space (not output natural language in free-form :point_right: major restriction) 55 | - :star: challenging part: the environment is not directly observable 56 | - The task includes text interpretation and learning strategy build on the text interpretation 57 | - Use LSTM to interpret the state(in natural langiage form), and use DQN to select to corresponding action 58 | - Basically follow the [deepmind paper](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html). With experience replay and mini-batch update 59 | - Using tSNE for the represnetation analysis is really cool (fig. 5) 60 | - :star: ***High-Dimensional Continuous Control Using Generalized Advantage Estimation*** [[ICLR 2016]](https://arxiv.org/abs/1506.02438) 61 | - John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel 62 | - In extremely high dimensional task(like continuous control in 3D environment), stability is a key point. 63 | - Propose an effective variance reduction scheme for policy gradients, which called generalized advantage estimation (GAE) 64 | - Motivation of GAE: Supposed we have fixed length of steps, from eq.15, we know that the bias of each advantage function is **k-dependent**. So, as k increases, the biased term becomes more ignorable, while the variance increases and vice versa. (if you found this concept is abstract, think of MC is unbiased but with high variance, while TD is biased, but with los variance) 65 | - ***λ*** is a new concept included in this paper. 66 | - If λ = 0 (like eq.17), then we have low variance, and is biased 67 | - If λ = 1 (like eq.18), then we have high variance, and is unbased 68 | - :star: ***Recurrent Models of Visual Attention*** [[NIPS 2014]](https://arxiv.org/abs/1406.6247) 69 | - Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu 70 | - Motivation: computationally expensive when dealing with large image. Many attention methods computation cost is propotional to the image size. 71 | - Use the action control to attend part of image (define a Gaussian, and use treat the location(mean of Gaussian) as action) 72 | - Can be viewed as POMDP (partially observation markov decision process) 73 | - The location network is a 2D-Gaussian distribution 74 | - Action is stochastically drawn from the distribution of location network 75 | - Reward can be task-dependent (this paper is used in classification) 76 | - Use the policy gradient to optimize 77 | - ***Deterministic Policy Gradient Algorithms*** [[ICML 2014]](http://jmlr.org/proceedings/papers/v32/silver14.pdf) 78 | - D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller 79 | - The deterministic policy gradient is just a special case for stochastic policy gradient 80 | - **Problem of stochastic policy gradient**: as the policy become more and more deterministic, the variance of the policy gradient become larger an larger. Finally, become two delta function :point_right: end up computing the gradient of Q 81 | - The stochatic policy gradient ends up calculating the gradient of Q(s,a), a is the mean mean of the policy(assume that the policy is a normal distribution) 82 | - Intuition: update policy in the direction that ***most improve Q*** 83 | - The trade-off of determinisitic policy: exploration :point_right: off-policy deterministic actor-critic (exploration in Q-learning) 84 | - In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximisation at 85 | every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction 86 | of the gradient of Q, rather than globally maximising Q 87 | - David Silver's talk about [deterministic policy gradient](http://techtalks.tv/talks/deterministic-policy-gradient-algorithms/61098/) in ICML :point_right: very clear! 88 | - ***Prioritized Experience Replay*** [[ICML 2016]](http://arxiv.org/abs/1511.05952) 89 | - Tom Schaul, John Quan, Ioannis Antonoglou, David Silver 90 | - Use prioritized sampling rather than uniformly sampling 91 | - Use transition’s TD error δ, which indicates how **"surprising" or "unexpected" the transition is** 92 | - Alleviate the loss of diversity with stochastic prioritization, and introduce bias 93 | - Stochastic Prioritization: mixture of pure greedy prioritization and uniform random sampling 94 | - ***Deep Reinforcement Learning with Double Q-learning*** [[AAAI 2016]](http://arxiv.org/abs/1509.06461) 95 | - Hado van Hasselt, Arthur Guez, David Silver 96 | - Deal with overestimation of Q-values 97 | - Separate action-select-Q and predict-Q 98 | - :star: ***Asynchronous Methods for Deep Reinforcement Learning*** [[ICML 2016]](https://arxiv.org/abs/1602.01783) 99 | - Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, 100 | David Silver, Koray Kavukcuoglu 101 | - On-policy updates 102 | - Implementation from others: [async-rl](https://github.com/muupan/async-rl) 103 | - [Asynchronous SGD](https://cxwangyi.wordpress.com/2013/04/09/why-asynchronous-sgd-works-better-than-its-synchronous-counterpart/), 104 | explain what "asynchronous" means. 105 | - [Tuning Deep Learning Episode 1: DeepMind's A3C in Torch](http://www.allinea.com/blog/201607/tuning-deep-learning-episode-1-deepminds-a3c-torch) 106 | - :star: ***Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection*** 107 | [[arXiv 2016]](http://arxiv.org/abs/1603.02199) 108 | - Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen 109 | - [Deep Learning for Robots: Learning from Large-Scale Interaction] 110 | (https://research.googleblog.com/2016/03/deep-learning-for-robots-learning-from.html) 111 | - :star: ***Dueling Network Architectures for Deep Reinforcement Learning*** [[ICML 2016]](http://arxiv.org/abs/1511.06581) 112 | - Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas 113 | - Best Paper in ICML 2016 114 | - Pose the question: Is conventional CNN suitable for RL tasks? 115 | - Two stream network(state-value and advantage funvtion) 116 | - Focusing on innovating a neural network architecture that is better suited for model-free RL 117 | - Torch blog - [Dueling Deep Q-Networks](http://torch.ch/blog/2016/04/30/dueling_dqn.html) 118 | - ***Control of Memory, Active Perception, and Action in Minecraft*** [[arXiv 2016]](https://arxiv.org/abs/1605.09128) 119 | - Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, Honglak Lee 120 | - Solving problem concerning to partial observability 121 | - Propose mincraft task 122 | - Memory Q-Network (MQN), Recurrent Memory Q-Network (RMQN), and Feedback Recurrent Memory Q-Network (FRMQN) 123 | - :star: ***Continuous Control With Deep Reinforcement Learning*** [[ICLR 2016]](http://arxiv.org/abs/1509.02971) 124 | - Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, 125 | Daan Wierstra 126 | - Solves the continuous control task, and avoids the curse of **dimension** 127 | - **Deep** version of DPG(deterministic policy gradient) 128 | - When going deep, some issues will happens. It's unstable to use the non-linear function to approxiamate 129 | - The different components of the observation may have different physical units and the ranges may vary 130 | across environments. => solve by batch normalization 131 | - For exploration, adding the noise to the actor policy: µ0(st) = µ(st|θt µ) + N 132 | - ***Active Object Localization with Deep Reinforcement Learning*** [[ICCV 2015]](http://arxiv.org/abs/1511.06015) 133 | - Juan C. Caicedo, Svetlana Lazebnik 134 | - Agent learns to deform a bounding box using simple transformation action(map the object detection task to RL) 135 | - Ideas similar to [G-CNN: an Iterative Grid Based Object Detector](http://arxiv.org/abs/1512.07729) 136 | - ***Memory-based control with recurrent neural networks*** [[NIPS 2015 Deep Reinforcement Learning Workshop]] 137 | (http://arxiv.org/abs/1512.04455) 138 | - Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, David Silver 139 | - Use RNN to solve partially-observed problem 140 | - ***Playing Atari with Deep Reinforcement Learning*** [[NIPS 2013 Deep Learning Workshop]](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) 141 | - Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves Ioannis Antonoglou, Daan Wierstra 142 | 143 | # Suggest Paper 144 | - ***Maximum Entropy Inverse Reinforcement Learning*** [[AAAI 2008]](https://www.cs.uic.edu/pub/Ziebart/Publications/maxentirl-bziebart.pdf) -------------------------------------------------------------------------------- /papers/Action-Conditional Video Prediction using Deep Networks in Atari Games.md: -------------------------------------------------------------------------------- 1 | # Action-Conditional Video Prediction using Deep Networks in Atari Games 2 | 3 | - Long-term predictions on Atari games conditional on the action 4 | - Using the predicted frame (more informative) to replace the exploration to improve the model-free controller 5 | - Multiplicative Action-Conditional Transformation: if use ont-hot to represent the action :point_right: each matrix correspond to a transformation matrix 6 | - Learning with Multi-Step Prediction (minimize the k steps accumulated error) 7 | - Section 4.2 is really promising! 8 | - replace the real frame by predicted frames 9 | - use prediction model to help agent explore the state visited least 10 | -------------------------------------------------------------------------------- /papers/Continuous Deep Q-Learning with Model-based Acceleration.md: -------------------------------------------------------------------------------- 1 | # Continuous Deep Q-Learning with Model-based Acceleration 2 | 3 | Previous work solving model-free **continuous** control mainly falls inot two group: policy search based method, actor-critic algorithms(intergrate the value function), e.g DDPG, Dyna-Q. 4 | 5 | The difficulty of using Q learning in continuous control is ```argmax Q(s,a)```. This work derive **NAF**, a variant of Q function. Besides NAF, and proposed **imageination rollout** to combine model-based and model-free method. 6 | 7 | The author decomposed the Q function into Value function and Advantage function, and advantage funciton is parameterized as a quadratic funciton. (the reason why it works still surprise me, though) 8 | 9 | Secondly, the author propose to incorporate model-based technique to increase the sample efficiency in Q learning. Use iLQG to iterative fit the local model(use the sample from the recent episodes) 10 | 11 | 12 | ## keypoints 13 | - Decompose the Q(s,µ) function as V(s)+A(s,µ), assume the the µ(s|θ) always give the action that maximize the Q(s,µ) 14 | - use simple linear model to iteratively fit the model(envs) 15 | - it's will difficult to use non-linear neural network to learn the model(envs). since neural network is usefule when large amount of data, if the data is non i.i.d. :point_right: 16 | 17 | ## note/question 18 | - compared with DDPG: 19 | - pros: simpler, converge faster 20 | - cons: without theoretical proof -------------------------------------------------------------------------------- /papers/Deep Successor Reinforcement Learning.md: -------------------------------------------------------------------------------- 1 | # Deep Successor Reinforcement Learning 2 | 3 | 4 | - Successor Representation(SR): decomposed the value into successor map and reward predictor (the definition of the components in successor representation should be referred to section 3.2) 5 | - Advantage1: increase the sensitivityof the environment changes, since it records the immediate reward in each 6 | state. In DQN, we only record(or predict) the accumulated reward, so the sudden change of reward will be diluted. 7 | However, the SR mehtod records the reward in each state: R(s), which enable it to be more sensitive to the change 8 | of environment. 9 | - Advantage2: able to extract the bottleneck states(subgoals). Since we predict the successor map (the predicted 10 | visit count), the state with higher visit count is likely to be bottleneck 11 | - Section 3.3 is a little tricky. The m_sa represents the **feature of the future occupancy**, so m_sa×W becomes 12 | the future accumulated reward (Q-value). The ∅(s)×W is the **immediate reward**. Therefore, 13 | m_sa = φ(s) + γE[m_st+1a'] :point_right: eq.6 14 | -------------------------------------------------------------------------------- /papers/Generalizing Skills with Semi-Supervised Reinforcement Learning.md: -------------------------------------------------------------------------------- 1 | # Generalizing Skills with Semi-Supervised Reinforcement Learning 2 | 3 | Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, Sergey Levine 4 | 5 | ## keypoints 6 | - Train RL agent under reward MDPs and no-reward MDPs 7 | - for no-reward MDPs: using guided cost learning (finn. 2016) 8 | - IRL: re-optimizing policy under novel environment (allow interaction with no-reward MDPs) 9 | - for RL: using guided policy search; for IRL: using guided cost learning 10 | 11 | ## related to IRL 12 | - ill-defined problem: no exact solution. There are many reward functions can explain optimal policy. 13 | - suffer reward ambiguity -------------------------------------------------------------------------------- /papers/High-Dimensional Continuous Control Using Generalized Advantage Estimation.md: -------------------------------------------------------------------------------- 1 | # High-Dimensional Continuous Control Using Generalized Advantage Estimation 2 | 3 | This paper introduce an extra hyperparameter λ, which compromise two policy-gradient-based reinforcement learning method 4 | (REINFORCE and TD). 5 | In REINFORCE, the advantage function is: 6 | 7 |