├── Open-Source.md
├── README.md
├── Reinforcement-Learning-Papers.md
└── papers
    ├── Action-Conditional Video Prediction using Deep Networks in Atari Games.md
    ├── Continuous Deep Q-Learning with Model-based Acceleration.md
    ├── Deep Successor Reinforcement Learning.md
    ├── Generalizing Skills with Semi-Supervised Reinforcement Learning.md
    ├── High-Dimensional Continuous Control Using Generalized Advantage Estimation.md
    ├── Human-level control through deep reinforcement learning.md
    ├── Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution.md
    ├── Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer.md
    ├── Learning Tetris Using the Noisy Cross-Entropy Method.md
    ├── Mastering the game of Go with deep neural networks and tree search.md
    ├── Noisy Networks for Exploration.md
    ├── One-Shot Imitation Learning.md
    ├── Policy Distillation.md
    ├── Stochastic Neural Network For Hierarchical Reinforcement Learning.md
    ├── Towards Deep Symbolic Reinforcement Learning.md
    ├── Unsupervised Perceptual Rewards for Imitation Learning.md
    └── Value Iteration Networks.md


/Open-Source.md:
--------------------------------------------------------------------------------
 1 | # Open Source
 2 | The collection of helpful reinforcement laerning open source, including source code, tool, textbook, and some inpiring blog post.   
 3 | Hope you will like it!
 4 | 
 5 | ## :snake: Python users[Tensorflow, Theano]
 6 |   - [OpenAI gym](https://gym.openai.com/)
 7 |     - RL **benchmarking** toolkit
 8 |     - Provide environment and evaluation metrics
 9 |   - [Keras](https://github.com/matthiasplappert/keras-rl)
10 |     - Fully compatible with OpenAI
11 |     - Some algorithms have been implement(e.g DQN, DDQN, DDPG, CDQN)
12 |   - [TensorLayer](https://github.com/zsdonghao/tensorlayer)
13 |     - Built on the top of Google TensorFlow
14 |   - [rllab](https://github.com/rllab/rllab)
15 |     - Fully compatible with gym 
16 |     - Continuous control tasks 
17 |     - Nice to implement new algorothms
18 |     - [Benchmarking Deep Reinforcement Learning for Continuous Control](https://arxiv.org/abs/1604.06778)
19 |   - [KEras](https://github.com/osh/kerlym)
20 |     - Built on keras
21 |     - Fully compatible with OpenAI
22 |     - Host a handful of agent of reinforcement learning agents
23 |     - [Deep Reinforcement Learning Radio Control and Signal Detection with KeRLym, a Gym RL Agent](http://arxiv.org/abs/1605.09221)
24 |   - [Deep Reinforcement Learning in TensorFlow](https://github.com/carpedm20/deep-rl-tensorflow)
25 |     - Implemented by @carpedm20
26 |     - Having some basic reinforcement algorothms
27 |   - [OpenAI baselines](https://github.com/openai/baselines)
28 |     - high-quality RL code(highly recommend)
29 |     - provide pretrained poicies
30 |  
31 | ## :flashlight: Lua users[Torch]
32 |   - [rltorch](https://github.com/ludc/rltorch), basic reinforcement learning package
33 |   - [awesome-torch for reinforcement learning](https://github.com/carpedm20/awesome-torch#reinforcement-learning)
34 |     - List of open sources for rerinforcement learning 
35 |   - [torch-twrl](https://github.com/twitter/torch-twrl), maintained by twitter
36 |     - [Reinforcement Learning for Torch: Introducing torch-twrl](https://blog.twitter.com/2016/reinforcement-learning-for-torch-introducing-torch-twrl)
37 |     
38 | ## Course
39 |   - [CS 294: Deep Reinforcement Learning](http://rll.berkeley.edu/deeprlcourse/#related-materials)
40 |       - Instructors: John Schulman, Pieter Abbeel
41 |   - [UC Berkeley CS188 Intro to AI](http://ai.berkeley.edu/home.html)
42 |       - [2013 Spring video](https://www.youtube.com/user/CS188Spring2013) on youtube   
43 |   - [Advanced Topics: RL](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
44 |       - Instructors: David Silver
45 |   - [Deep learning videoes at Oxford 2015](https://www.youtube.com/playlist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu)
46 |       - Instructors: Nando de Freitas
47 |       - lecture 15, 16 are strongly related to reinforcement learning
48 |   
49 | ## Textbook
50 |   - [Foundations_of_Machine_Learning](http://www.cs.nyu.edu/~mohri/mlbook/)
51 |       - chapter 14: Reinforcement learning  
52 |   
53 | ## Misc
54 |   - [A collection of Deep Learning resources](http://www.jeremydjacksonphd.com/category/deep-learning/)
55 |   - [Deep Reinforcement Learning: Pong from Pixels](http://karpathy.github.io/2016/05/31/rl/), from Andrej Karpathy' blog
56 |       - policy gradient (very clear!)
57 |       - many useful link inside
58 |   - [Guest Post (Part I): Demystifying Deep Reinforcement Learning](https://www.nervanasys.com/demystifying-deep-reinforcement-learning/)
59 |   - [Reinforcement Learning and Control](http://cs229.stanford.edu/notes/cs229-notes12.pdf), lecture from Andrew Ng
60 |       - basic reinforcement learning 
61 |       - continuous state MDPs
62 |   - [DEEP REINFORCEMENT LEARNING](https://deepmind.com/blog), from David Silver, Google DeepMind
63 |       - briefly discuss some work done by deepmind
64 |   - [What are the benefits of actor/critic framework in reinforcement learning?](https://www.quora.com/What-are-the-benefits-of-actor-critic-framework-in-reinforcement-learning)
65 |       - Clearly expain the advantages of actor/critic 
66 |   - [Deep Reinforcement Learning: A Tutorial](https://gym.openai.com/docs/rl), from OpenAI
67 |       - It's good to be a kick-off for newbie in reinforcement learning    
68 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Deep Reinforcement Learning survey
 2 | This paper list is a bit different from others. I'll put some opinion and summary on it. However, to understand the whole paper, you still have to read it by yourself!   
 3 | Surely, any pull request or discussion are welcomed!
 4 | 
 5 | ## Before Jump into Deep Reinforcement Learning
 6 | If you're a newbie in deep reinforcement learning, I suggest you to read the blog post and open course first.
 7 | 
 8 | ## Outline
 9 | - Reinforcement Learning Papers
10 |   - [Human-level control through deep reinforcement learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Human-level%20control%20through%20deep%20reinforcement%20learning.md)
11 |   - [Mastering the game of Go with deep neural networks and tree search](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Mastering%20the%20game%20of%20Go%20with%20deep%20neural%20networks%20and%20tree%20search.md)
12 |   - [Deep Successor Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Deep%20Successor%20Reinforcement%20Learning.md)
13 |   - [Action-Conditional Video Prediction using Deep Networks in Atari Games](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Action-Conditional%20Video%20Prediction%20using%20Deep%20Networks%20in%20Atari%20Games.md)
14 |   - [Policy Distillation](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Policy%20Distillation.md)
15 |   - [Learning Tetris Using the Noisy Cross-Entropy Method](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Learning%20Tetris%20Using%20the%20Noisy%20Cross-Entropy%20Method.md), with **code**
16 |   - [Continuous Deep Q-Learning with Model-based Acceleration](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Continuous%20Deep%20Q-Learning%20with%20Model-based%20Acceleration.md)
17 |   - [Value Iteration Networks](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Value%20Iteration%20Networks.md)
18 |   - [Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Learning%20Modular%20Neural%20Network%20Policies%20for%20Multi-Task%20and%20Multi-Robot%20Transfer.md) 
19 |   - [Stochastic Neural Network For Hierarchical Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Stochastic%20Neural%20Network%20For%20Hierarchical%20Reinforcement%20Learning.md)
20 |   - [Noisy Networks for Exploration](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Noisy%20Networks%20for%20Exploration.md) 
21 |   - [Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Improving%20Stochastic%20Policy%20Gradients%20in%20Continuous%20Control%20with%20Deep%20Reinforcement%20Learning%20using%20the%20Beta%20Distribution.md)
22 |   - [High-Dimensional Continuous Control Using Generalized Advantage Estimation](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/High-Dimensional%20Continuous%20Control%20Using%20Generalized%20Advantage%20Estimation.md) 
23 |   - [Generalizing Skills with Semi-Supervised Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Generalizing%20Skills%20with%20Semi-Supervised%20Reinforcement%20Learning.md)
24 |   - [Unsupervised Perceptual Rewards for Imitation Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Unsupervised%20Perceptual%20Rewards%20for%20Imitation%20Learning.md)
25 |   - [Towards Deep Symbolic Reinforcement Learning](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/papers/Towards%20Deep%20Symbolic%20Reinforcement%20Learning.md)
26 |   - [others](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Reinforcement-Learning-Papers.md)
27 | - [Open Source](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#open-source)
28 |   - Python users
29 |   - Lua users
30 | - [Courses](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#course)
31 | - [Textbook](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#textbook)
32 | - [Misc](https://github.com/andrewliao11/Deep-Reinforcement-Learning-Survey/blob/master/Open-Source.md#misc)
33 | 
34 | ## [**Useful**] Learn Reinforcement Learning 
35 | :point_right: [dennybritz/reinforcement-learning](https://github.com/dennybritz/reinforcement-learning)   
36 | :point_right: [Daivd Silver's course about policy gradient](https://www.youtube.com/watch?v=KHZVXao4qXs)    
37 | :point_right: [Deep Reinforcement Learning](http://rll.berkeley.edu/deeprlcourse/)
38 | 


--------------------------------------------------------------------------------
/Reinforcement-Learning-Papers.md:
--------------------------------------------------------------------------------
  1 | # Reinforcement learning Papers
  2 | ***Mistakes teach us to clarify what we really want and how we want to live.*** That's the spirit of reinforcement 
  3 | learning: learning from the mistakes. Let's be the explorer in reinforcement learning!	
  4 | 
  5 | 
  6 | - ***Gradient Estimation Using Stochastic Computation Graphs*** [[NIPS 2015]](https://arxiv.org/abs/1506.05254)
  7 | 	- John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel
  8 | - ***Maximum Entropy Inverse Reinforcement Learning*** [[AAAI 2008]](https://www.cs.uic.edu/pub/Ziebart/Publications/maxentirl-bziebart.pdf)
  9 | 	- Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey
 10 | 	- Close to the real case :point_right: the suboptimal case, optimal case can't cover all the state space, and alleviate the reward function ambiguity
 11 | 	- basic concept: plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred.
 12 | - ***Reinforcement Learning with Unsupervised Auxiliary Tasks*** [[arXiv 2016]](https://128.84.21.199/abs/1611.05397)
 13 | 	- Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
 14 | 	- Introduce an agent that also maximises many other **pseudo-reward** functions simultaneously by reinforcement learning
 15 | 	- The advantage of auxiliary tasks: in many environment the extrinsic reward is very sparse, which make the feature extractor hard to learn at the beginning. Giving some pseudo reward makes the learner know how to interpret the image at the initial stage.
 16 | 	- They proposes two main auxiliary task: Pixel changes, Network features.
 17 | 		- Pixel changes: maximally changing the pixels in each cell of an n*n non-overlapping grid placed over the input image :point_right: make the learner knows to move faster or avoid stopping (I guess)
 18 | 		- Network features: maximally activating each of the units in a specific hidden layer :point_right: to fully use the hidden units
 19 | 	- Section 4.1 "Unsupervised Reinforcement Learning" discuss why not use the pixel reconstruction loss
 20 | - ***Apprenticeship Learning via Inverse Reinforcement Learning*** [[ICML 2004]](http://dl.acm.org/citation.cfm?id=1015430)
 21 | 	- Pieter Abbeel, Andrew Y. Ng
 22 | 	- The first time when apprenticeship is proposed
 23 | 	- Most of these methods try to directly mimic the demonstrator by applying a **supervised learning** algorithm to learn a direct mapping from the states to the actions. :point_right: only suitable for the case that the taskis to mimic the expert’s trajectory
 24 | 	- Reward function, rather than the policy or the value function, is the most succinct, robust, and transferable definition of the task,
 25 | 	- Basic concept: use the inverse reinforcement learning to recover the reward function from the expert and use the that reward function to find the optimal policy.
 26 | 	- Actually, Apprenticeship Learning doesn't need to find the correct reward function. Instead, it use the predicted reward function to find the policy that is similar to expert.
 27 | 	- From the experiments, the apprenticeship learning need less sample trajectory than action mimic.
 28 | - ***Algorithms for Inverse Reinforcement Learning*** [[ICML 2000]](http://www.andrewng.org/portfolio/algorithms-for-inverse-reinforcement-learning/)
 29 | 	- Andrew Y. Ng, Stuart Russell
 30 | 	- In examing animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.
 31 | 	- Recover the expert's reward function and this to generate desired behavior
 32 | 	- Use the eq.4 (see the derivation in the paper, which is quite clear)
 33 | 	- Avoid reward function ambiguity by adding margin λ|R|
 34 | 	- Use feature representation for the reward function
 35 | 		- Since the reward function is assumed to be the linear combination of features, this implies that **if two policy with similar accumulated feature expectation, the accumulated reward is similar**
 36 | 	- Experiment part shows IRL is soluble at least in **moderate** discrete, continuous space
 37 | 	- Reference: [Inverse Reinforcement Learning](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/inverseRL.pdf), by Pieter Abbeel.
 38 | - ***Deep Reinforcement Learning with a Natural Language Action Space*** [[ACL 2016]](Deep Reinforcement Learning with a Natural Language Action Space)
 39 | 	- Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, Mari Ostendorf
 40 | 	- Task: String of text :point_right: state, several strings of text :point_right: potential actions
 41 | 	- other
 42 | - ***Imitation Learning with Recurrent Neural Networks***[[arXiv 2016]](https://arxiv.org/abs/1607.05241)
 43 | 	- Khanh Nguyen
 44 | 	- Aims to unify two sequence prediction network: learning to search(L2S) and recurrent neural network
 45 | 	- supervised recurrent neural entwork makes an independent prediction at each time step, which suffers seriously from **compounding errors** since the input observations correlate and thus violate the identically independent distributed assumption required by any supervised approach. 
 46 | 	-  L2S algorithms reduce a sequential prediction problem to learning a policy to traverse in a search space with minimum cost. It garantees that compounding errors grow linearly with trajectory lengths.
 47 | 	-  Map the RNN components to L2S (take sequence2sequence for example): 
 48 | 		-  hidden representation :point_right: St
 49 | 		-  decoded word :point_right: At 
 50 | 		-  encoded word :point_right: Xt (new information for the environment)
 51 | 		-  The non-linear gates in RNN are served to be the trainsition function
 52 | - ***Language Understanding for Text-based Games Using Deep Reinforcement Learning*** [[EMNLP 2015]](https://arxiv.org/abs/1506.08941)
 53 | 	- Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay
 54 | 	- Use natural language as state representation, and fixed action space (not output natural language in free-form :point_right: major restriction)
 55 | 		- :star: challenging part: the environment is not directly observable
 56 | 		- The task includes text interpretation and learning strategy build on the text interpretation 
 57 | 	- 	Use LSTM to interpret the state(in natural langiage form), and use DQN to select to corresponding action
 58 | 	-  Basically follow the [deepmind paper](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html). With experience replay and mini-batch update
 59 | 	-  Using tSNE for the represnetation analysis is really cool (fig. 5)
 60 | - :star: ***High-Dimensional Continuous Control Using Generalized Advantage Estimation*** [[ICLR 2016]](https://arxiv.org/abs/1506.02438)
 61 | 	- John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel
 62 | 	- In extremely high dimensional task(like continuous control in 3D environment), stability is a key point.
 63 | 	- Propose an effective variance reduction scheme for policy gradients, which called generalized advantage estimation (GAE)
 64 | 	-  Motivation of GAE: Supposed we have fixed length of steps, from eq.15,  we know that the bias of each advantage function is **k-dependent**. So, as k increases, the biased term becomes more ignorable, while the variance increases and vice versa. (if you found this concept is abstract, think of MC is unbiased but with high variance, while TD is biased, but with los variance)
 65 | 	-  ***λ*** is a new concept included in this paper. 
 66 | 		-  If λ = 0 (like eq.17), then we have low variance, and is biased
 67 | 		-  If λ = 1 (like eq.18), then we have high variance, and is unbased
 68 | - :star: ***Recurrent Models of Visual Attention*** [[NIPS 2014]](https://arxiv.org/abs/1406.6247) 
 69 |   - Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu
 70 |   - Motivation: computationally expensive when dealing with large image. Many attention methods computation cost is propotional to the image size.
 71 |   - Use the action control to attend part of image (define a Gaussian, and use treat the location(mean of Gaussian) as action)
 72 |   - Can be viewed as POMDP (partially observation markov decision process)
 73 |   - The location network is a 2D-Gaussian distribution
 74 |   - Action is stochastically drawn from the distribution of location network
 75 |   - Reward can be task-dependent (this paper is used in classification)
 76 |   - Use the policy gradient to optimize
 77 | - ***Deterministic Policy Gradient Algorithms*** [[ICML 2014]](http://jmlr.org/proceedings/papers/v32/silver14.pdf)
 78 |   - D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller
 79 |   - The deterministic policy gradient is just a special case for stochastic policy gradient
 80 |   - **Problem of stochastic policy gradient**: as the policy become more and more deterministic, the variance of the policy gradient become larger an larger. Finally, become two delta function :point_right: end up computing the gradient of Q 
 81 |   - The stochatic policy gradient ends up calculating the gradient of Q(s,a), a is the mean mean of the policy(assume that the policy is a normal distribution)
 82 |   - Intuition: update policy in the direction that ***most improve Q***
 83 |   - The trade-off of determinisitic policy: exploration :point_right: off-policy deterministic actor-critic (exploration in Q-learning)
 84 |   - In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximisation at
 85 |       every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction 
 86 |       of the gradient of Q, rather than globally maximising Q
 87 |   - David Silver's talk about [deterministic policy gradient](http://techtalks.tv/talks/deterministic-policy-gradient-algorithms/61098/) in ICML :point_right: very clear!
 88 | - ***Prioritized Experience Replay*** [[ICML 2016]](http://arxiv.org/abs/1511.05952)
 89 |   -  Tom Schaul, John Quan, Ioannis Antonoglou, David Silver
 90 |   -  Use prioritized sampling rather than uniformly sampling
 91 |   -  Use transition’s TD error δ, which indicates how **"surprising" or "unexpected" the transition is**
 92 |   -  Alleviate the loss of diversity with stochastic prioritization, and introduce bias
 93 |   -  Stochastic Prioritization: mixture of pure greedy prioritization and uniform random sampling
 94 | - ***Deep Reinforcement Learning with Double Q-learning*** [[AAAI 2016]](http://arxiv.org/abs/1509.06461)
 95 |   - Hado van Hasselt, Arthur Guez, David Silver 
 96 |   - Deal with overestimation of Q-values
 97 |   - Separate action-select-Q and predict-Q 
 98 | - :star: ***Asynchronous Methods for Deep Reinforcement Learning*** [[ICML 2016]](https://arxiv.org/abs/1602.01783)
 99 |   - Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, 
100 |       David Silver, Koray Kavukcuoglu 
101 |   - On-policy updates
102 |   - Implementation from others:  [async-rl](https://github.com/muupan/async-rl)
103 |   - [Asynchronous SGD](https://cxwangyi.wordpress.com/2013/04/09/why-asynchronous-sgd-works-better-than-its-synchronous-counterpart/), 
104 |       explain what "asynchronous" means. 
105 |   - [Tuning Deep Learning Episode 1: DeepMind's A3C in Torch](http://www.allinea.com/blog/201607/tuning-deep-learning-episode-1-deepminds-a3c-torch)
106 | - :star: ***Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection*** 
107 |   [[arXiv 2016]](http://arxiv.org/abs/1603.02199)
108 |   - Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen
109 |   - [Deep Learning for Robots: Learning from Large-Scale Interaction]
110 |       (https://research.googleblog.com/2016/03/deep-learning-for-robots-learning-from.html)
111 | - :star: ***Dueling Network Architectures for Deep Reinforcement Learning*** [[ICML 2016]](http://arxiv.org/abs/1511.06581)
112 |   - Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas
113 |   - Best Paper in ICML 2016
114 |   - Pose the question: Is conventional CNN suitable for RL tasks?
115 |   - Two stream network(state-value and advantage funvtion)
116 |   - Focusing on innovating a neural network architecture that is better suited for model-free RL
117 |   - Torch blog - [Dueling Deep Q-Networks](http://torch.ch/blog/2016/04/30/dueling_dqn.html)   
118 | - ***Control of Memory, Active Perception, and Action in Minecraft*** [[arXiv 2016]](https://arxiv.org/abs/1605.09128)
119 |   - Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, Honglak Lee
120 |   - Solving problem concerning to partial observability
121 |   - Propose mincraft task
122 |   - Memory Q-Network (MQN), Recurrent Memory Q-Network (RMQN), and Feedback Recurrent Memory Q-Network (FRMQN)
123 | - :star: ***Continuous Control With Deep Reinforcement Learning*** [[ICLR 2016]](http://arxiv.org/abs/1509.02971)
124 |   - Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,
125 |      Daan Wierstra
126 |   - Solves the continuous control task, and avoids the curse of **dimension**
127 |   - **Deep** version of DPG(deterministic policy gradient)
128 |   - When going deep, some issues will happens. It's unstable to use the non-linear function to approxiamate 
129 |   - The different components of the observation may have different physical units and the ranges may vary 
130 |       across environments. => solve by batch normalization
131 |   - For exploration, adding the noise to the actor policy: µ0(st) = µ(st|θt µ) + N
132 | - ***Active Object Localization with Deep Reinforcement Learning*** [[ICCV 2015]](http://arxiv.org/abs/1511.06015)
133 |   - Juan C. Caicedo, Svetlana Lazebnik
134 |   - Agent learns to deform a bounding box using simple transformation action(map the object detection task to RL)   
135 |   - Ideas similar to [G-CNN: an Iterative Grid Based Object Detector](http://arxiv.org/abs/1512.07729)
136 | - ***Memory-based control with recurrent neural networks*** [[NIPS 2015 Deep Reinforcement Learning Workshop]]
137 |   (http://arxiv.org/abs/1512.04455)
138 |   - Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, David Silver
139 |   - Use RNN to solve partially-observed problem  
140 | - ***Playing Atari with Deep Reinforcement Learning*** [[NIPS 2013 Deep Learning Workshop]](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
141 |   - Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves Ioannis Antonoglou, Daan Wierstra  
142 |   
143 |  # Suggest Paper
144 | - ***Maximum Entropy Inverse Reinforcement Learning*** [[AAAI 2008]](https://www.cs.uic.edu/pub/Ziebart/Publications/maxentirl-bziebart.pdf)


--------------------------------------------------------------------------------
/papers/Action-Conditional Video Prediction using Deep Networks in Atari Games.md:
--------------------------------------------------------------------------------
 1 | # Action-Conditional Video Prediction using Deep Networks in Atari Games
 2 | 
 3 | - Long-term predictions on Atari games conditional on the action
 4 | - Using the predicted frame (more informative) to replace the exploration to improve the model-free controller
 5 | - Multiplicative Action-Conditional Transformation: if use ont-hot to represent the action :point_right: each matrix correspond to a transformation matrix
 6 | - Learning with Multi-Step Prediction (minimize the k steps accumulated error)
 7 | - Section 4.2 is really promising! 
 8 |   - replace the real frame by predicted frames
 9 |   - use prediction model to help agent explore the state visited least
10 | 


--------------------------------------------------------------------------------
/papers/Continuous Deep Q-Learning with Model-based Acceleration.md:
--------------------------------------------------------------------------------
 1 | # Continuous Deep Q-Learning with Model-based Acceleration
 2 | 
 3 | Previous work solving model-free **continuous** control mainly falls inot two group: policy search based method, actor-critic algorithms(intergrate the value function), e.g DDPG, Dyna-Q. 
 4 | 
 5 | The difficulty of using Q learning in continuous control is ```argmax Q(s,a)```. This work derive **NAF**, a variant of Q function. Besides NAF, and proposed **imageination rollout** to combine model-based and model-free method.
 6 | 
 7 | The author decomposed the Q function into Value function and Advantage function, and advantage funciton is parameterized as a quadratic funciton. (the reason why it works still surprise me, though)
 8 | 
 9 | Secondly, the author propose to incorporate model-based technique to increase the sample efficiency in Q learning. Use iLQG to iterative fit the local model(use the sample from the recent episodes)
10 | 
11 | 
12 | ## keypoints
13 | - Decompose the Q(s,µ) function as V(s)+A(s,µ), assume the the µ(s|θ) always give the action that maximize the Q(s,µ)
14 | - use simple linear model to iteratively fit the model(envs)
15 | - it's will difficult to use non-linear neural network to learn the model(envs). since neural network is usefule when large amount of data, if the data is non i.i.d. :point_right: 
16 | 
17 | ## note/question
18 | - compared with DDPG: 
19 | 	- pros: simpler, converge faster
20 | 	- cons: without theoretical proof


--------------------------------------------------------------------------------
/papers/Deep Successor Reinforcement Learning.md:
--------------------------------------------------------------------------------
 1 | # Deep Successor Reinforcement Learning
 2 | 
 3 | 
 4 | - Successor Representation(SR): decomposed the value into successor map and reward predictor (the definition of the components in successor representation should be referred to section 3.2)
 5 | - Advantage1: increase the sensitivityof the environment changes, since it records the immediate reward in each 
 6 |     state. In DQN, we only record(or predict) the accumulated reward, so the sudden change of reward will be diluted.
 7 |     However, the SR mehtod records the reward in each state: R(s), which enable it to be more sensitive to the change
 8 |     of environment.
 9 | - Advantage2: able to extract the bottleneck states(subgoals). Since we predict the successor map (the predicted
10 |       visit count), the state with higher visit count is likely to be bottleneck 
11 | - Section 3.3 is a little tricky. The m_sa represents the **feature of the future occupancy**, so m_sa×W becomes 
12 |       the future accumulated reward (Q-value). The ∅(s)×W is the **immediate reward**. Therefore, 
13 |       m_sa = φ(s) + γE[m_st+1a'] :point_right: eq.6
14 | 


--------------------------------------------------------------------------------
/papers/Generalizing Skills with Semi-Supervised Reinforcement Learning.md:
--------------------------------------------------------------------------------
 1 | # Generalizing Skills with Semi-Supervised Reinforcement Learning
 2 | 
 3 | Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, Sergey Levine
 4 | 
 5 | ## keypoints
 6 | - Train RL agent under reward MDPs and no-reward MDPs
 7 | - for no-reward MDPs: using guided cost learning (finn. 2016)
 8 | 	- IRL: re-optimizing policy under novel environment (allow interaction with no-reward MDPs)
 9 | -  for RL: using guided policy search; for IRL: using guided cost learning
10 | 
11 | ## related to IRL
12 | - ill-defined problem: no exact solution. There are many reward functions can explain optimal policy.
13 | - suffer reward ambiguity


--------------------------------------------------------------------------------
/papers/High-Dimensional Continuous Control Using Generalized Advantage Estimation.md:
--------------------------------------------------------------------------------
 1 | # High-Dimensional Continuous Control Using Generalized Advantage Estimation
 2 | 
 3 | This paper introduce an extra hyperparameter λ, which compromise two policy-gradient-based reinforcement learning method 
 4 | (REINFORCE and TD). 
 5 | In REINFORCE, the advantage function is: 
 6 | 
 7 | <p align="center"><img src="https://i.imgur.com/nHW6kpj.png" height="30"/></p>
 8 | 
 9 | This algorithm is known for high variance and bias-free .
10 | 
11 | In TD, the advantage function is calulated with bootstrapping:  
12 | 
13 | <p align="center"><img src="https://i.imgur.com/lCtv0iL.png" height="30"/></p>
14 | 
15 | This algorithm is known for low variance and introduce bias.
16 | 
17 | This paper combine these two methods into "Generalized Advantage Estimation".
18 | 
19 | ## method
20 | 
21 | Assume that 
22 | 
23 | <p align="center"><img src="https://i.imgur.com/asAzkPd.png" height="250"/></p>
24 | 
25 | the above \delta_{i} is the k-step estimators of advantage function. The generalized advantage estimator GAE(γ, λ) is defined as the exponentially-weighted average of these k-step estimators:
26 | 
27 | <p align="center"><img src="https://i.imgur.com/h6XlY0k.png" height="70"/></p>
28 | 
29 | ## keypoints
30 | - Highly compatible with many policy-gradient-based reinforcement learning algorithms
31 | 
32 | 
33 | 


--------------------------------------------------------------------------------
/papers/Human-level control through deep reinforcement learning.md:
--------------------------------------------------------------------------------
1 | # Human-level control through deep reinforcement learning
2 | 
3 | - Most optimization algorithms assume that the samples are independently and identically distributed,
4 |       while for reinforcement learning, the data is a sequence of action, which breaks the assumption.
5 | - Strong correlation btn data => break the assumption of stochastic gradient-based algorithms(re-sampling)
6 | - Experience replay(off-policy)
7 | - Iterative update Q-value
8 | - [Source code [Torch]](https://sites.google.com/a/deepmind.com/dqn)
9 | 


--------------------------------------------------------------------------------
/papers/Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution.md:
--------------------------------------------------------------------------------
 1 | # Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution
 2 | 
 3 | accepted in ICML2017
 4 | 
 5 | ## keypoints
 6 | - In continuous control, we oftern resort to [gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) to model the continuous action. The benefit of the gaussian 
 7 | is that it provides a simple form (mean and sigma) and allow as to reparametrization. 
 8 | - However, the gaussian distribution has some drawbacks:
 9 |   - it has nonzero probability for any action values (introduce bias if we clip the action values due to the controller limitation)
10 |   - Using gaussian distribution, the variance of the policy gradient estimator is inverse to the \sigma^2. (as the gaussian distribution gets more deterministic the variance become higher and higher)
11 | - [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution)(beta-dist) comes to rescue:
12 |   - also provide simeple form (\alpha, and \beta)
13 |   - naturally has a **bounded** probability distribution
14 |   - see fig.4 for the relation between beta-dist and log(beta-dist)
15 | 
16 | ## notes
17 | - one of my favorite recent paper (I really like it)
18 | - well-written, and easily to understand their motivation
19 | - Tackles a foundametal question, which serves as convention (using gaussian distribution as policy)
20 | - looks easy for implementation
21 | 


--------------------------------------------------------------------------------
/papers/Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer.md:
--------------------------------------------------------------------------------
 1 | # Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer
 2 | 
 3 | This paper aims to solve multi-robots, multi-tasks **modular** policy. Before, modular network is frequently 
 4 | used in question answering task, see [@Jacob Andreas](https://people.eecs.berkeley.edu/~jda/). This paper introduce 
 5 | a nice characteristic of modular network, composing network in a plug&play-like fashion.
 6 | 
 7 | ## Problem formulation
 8 | - define a universe that contains robots:R_1, R_2 and tasks:T_1, T_2. There are 4 different permutations in the universe.
 9 | The algorithms wants to do the zero-shot transfer from the 3 combination[(R_1, T_1), (R_1, T_2), (R_2, T_1))] to the 
10 | unseen combination[(R_2,T_2)]
11 | 
12 | ## Keypoints
13 | - Need to robots modules to be **task-invariant** and task modules to be **robot-invariant**
14 | - Apply bottleneck and dropout on the interface of the module to prevent one module overfitting on some specific modules
15 | - Allow changes in observation/action space
16 | - Few training steps to tranfer (few-shot learning)
17 | 
18 | ## Some opinions
19 | - really cool ideas and worth exploring more
20 | - lack of implementation details 
21 | - the authors choose a good environment to show off their paper (from easy to difficult)
22 | - in QA task, program generator matters a lot! In [Modular Multitask Reinforcement Learning with Policy Sketches](https://arxiv.org/abs/1611.01796), they use human-defined policy sketch. In the near future(maybe nips2017), we can expect 
23 | someone doing modular RL with program generator jointly trained.
24 | 


--------------------------------------------------------------------------------
/papers/Learning Tetris Using the Noisy Cross-Entropy Method.md:
--------------------------------------------------------------------------------
 1 | # Learning Tetris Using the Noisy Cross-Entropy Method
 2 | 
 3 | the paper works on solving Tetris with modified cross entropy method. original CE method in reinforcement learning usually results in early convergence.
 4 | 
 5 | ### Cross entropy method in reinforcement learning
 6 | - first we start with a random uniform distribution ```F_0```
 7 | - drawn from ```F_0``` and get N samples θ_0, θ_1, ...
 8 | - choose the top K samples that get the highest scores and use these selected sample(θ_0, θ_1, ...) update distribution and get ```F_1```
 9 | 
10 | 
11 | ## keypoints
12 | - add noise to the cross-entropy method to prevent early converge
13 | - if we decrease the noise, which is only depend on time steps, the performance can even be better.
14 | - noise :point_right: prevent early converge
15 | 
16 | ## note
17 | - can view the noise apply to std as ensure enough exploration
18 | 
19 | ## implementation
20 | I have a simple implementation on CartPole, [link](https://gist.github.com/andrewliao11/d52125b52f76a4af73433e1cf8405a8f)
21 | 


--------------------------------------------------------------------------------
/papers/Mastering the game of Go with deep neural networks and tree search.md:
--------------------------------------------------------------------------------
 1 | # Mastering the game of Go with deep neural networks and tree search
 2 |   
 3 | - David Silver, Aja Huang 
 4 | - First stage: supervised learning policy network, including rollout policy and SL policy network(learn the knowledge from human experts)
 5 |   -  Rollout policy is used for predicting **fast** but relatively inaccurate decision
 6 |   -  SL policy network is used for initialization of RL policy network(improved by policy gradient) 
 7 | - To prevent overfit, auto-generate the sample from self-play(half) and train with the KGS dataset(half)
 8 | - Use Monte Carlo tree search with policy network and value network. To understand the MCTS more, plz refer to [here](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search#Principle_of_operation)
 9 |   - Selection: select the most promising action depends on Q+u(P) --> depth L
10 |   - Expansion: after L steps, create a new child
11 |   - Evaluation: evaluated by the mixture of value network and simulated rollout
12 |   - Backup: Calculate and store the Q(s,a), N(s,a), which is used in Selection
13 | 


--------------------------------------------------------------------------------
/papers/Noisy Networks for Exploration.md:
--------------------------------------------------------------------------------
 1 | # Noisy Networks for Exploration
 2 | 
 3 | The authors propose an idea to perturb in parameter space for exploration. Replacing the epsilon-greediy method, 
 4 | the perturbation in parameter space is learned by gradient descent along with other weights. There are some heuristic 
 5 | methods to do the exploratoin, e.g. optimism initailization, epsilon-greedy(perturb in action space). However, 
 6 | these method often work well only on small state space. The paper evaluate their method on value-based and policy-based
 7 | algorithms, and both get a bunch of improvement.
 8 | 
 9 | ## keypoint
10 | - Do the pertubation on parameter space for exploration
11 | - Universal idea for most of the RL algorithms (empirically found)
12 | - The scale of the perturbation is learned along with the original objective function (easy to apply)
13 | 
14 | ## notes/questions
15 | -  The author doesn't provide much therodical proof or intuitive of the proposed method thought the result is good.
16 | -  It'll be good for the author to visualize the (behavior) difference between epsilon-greedy and the proposed method
17 | or any other analysis.
18 | -  Also, the author mentions that the propose variant works well in Asterix and Freeway(in sec.5). However, I wonder 
19 | that what this might imply. (better for games that need more exploration? It has to be confirmed by visualizing the agent)
20 | 
21 | # non-official source code
22 | - Tensorflow for [NoisyNet-DQN](https://github.com/andrewliao11/NoisyNet-DQN)
23 | - Pytorch for [NoisyNet-A3C(LSTM)](https://github.com/Kaixhin/NoisyNet-A3C)
24 | 


--------------------------------------------------------------------------------
/papers/One-Shot Imitation Learning.md:
--------------------------------------------------------------------------------
 1 | # One-Shot Imitation Learning
 2 | tl;dr
 3 | 
 4 | This work aims to do the one-shot (only one demonstration) imitation learning.
 5 | 
 6 | - Task Formalization:
 7 | The learned policy takes as input: (i) the current observation, and (ii) one demonstration that
 8 | successfully solves a different instance of the same task (this demonstration is fixed for the 
 9 | duration of the episode).
10 | - training phase
11 | given a pair of demonstration, a neural net is trained that takes as input **one demonstration** and the **current state**, and outputs an action with the **goal that the resulting sequence of states and actions matches as closely as possible with the second demonstration**.
12 | 
13 | ## keypoints
14 | 
15 | ## question
16 | 
17 | ## basic concept
18 | - imitation learning can be seperated into behavior cloning and inverse reinforcement learning
19 | 


--------------------------------------------------------------------------------
/papers/Policy Distillation.md:
--------------------------------------------------------------------------------
 1 | # Policy Distillation
 2 | 
 3 | The authors borrows the concept in model compression into reinforcement learning. The concept is quite simple: use a well-trained agent to teach other random initialized agent. View teacher's transitions as a large an i.i.d. dataset and supervised train the random agent. Instead of calculating loss with maximum likelihood between teacher's action and student's action, it minimize the **Kullback-Leibler divergence (KL) with temperature ⲧ**.   
 4 | Multi-task policy distillation: use many experts as many i.i.d. dataset to train an general agent. From table2, the experiment show it can achieve geometric mean at 89%
 5 | 
 6 | ## keypoints
 7 | - distillation method transfer more information from teacher to student.
 8 | - temperature of the softmax allows more of the secondary knowledge to be transferred to the student.
 9 | 
10 | ## compare
11 | - imitation learning: given **few** teacher's trajectories and learn the policy. Here, all the training trajectories is provieded by teachers. :sweat: a patient teacher
12 | 
13 | ## notes
14 | - high temperature softmax :point_right: become more equally distributed, and vice versa
15 | 


--------------------------------------------------------------------------------
/papers/Stochastic Neural Network For Hierarchical Reinforcement Learning.md:
--------------------------------------------------------------------------------
 1 | # Stochastic Neural Network For Hierarchical Reinforcement Learning
 2 | 
 3 | published in ***ICLR17***
 4 | 
 5 | ## keypoints
 6 | - two phase: pretrain on swimmer with proxy reward, and then train on sparse reward environment 
 7 | - stochastic neural network of multi-modal policies, enabling weight sharing in different mode.
 8 | - information-theoretic regularizer: Mutural information, to **diversify** the skills, without this term as regularizer, 
 9 | the policy might perform very limited moves.
10 | - similar to common HRL, they use a manager(somewhat like meta-controller) to determine which skills to perform.
11 | 
12 | ## note
13 | - advantage of stocahstic neural netowrk here: learn weight sharing itself. need not to pre-define number of skills.
14 | - maybe the first RL paper to include the concept: use MI to diversify the policies
15 | - strong experiment results on benchmark
16 | 


--------------------------------------------------------------------------------
/papers/Towards Deep Symbolic Reinforcement Learning.md:
--------------------------------------------------------------------------------
 1 | # Towards Deep Symbolic Reinforcement Learning
 2 | 
 3 | **disclaimers: this paper is a workshop paper in DRL at NIPS2016**
 4 | 
 5 | tldr; symbolize the data and do RL for interpretability, and potentially generalizability. 
 6 | 
 7 | keywords: symbolic reasoning, reinforcement learning, interpretable NN
 8 | 
 9 | People do planning or think complex problems via first make the perception abstract and tackle it. 
10 | This paper explores the how to first symbolize the raw perceptual data and then map 
11 | the symbolic representations into actions.
12 | 
13 | ## Why this matters? and what’s the motivation?
14 | 
15 | First, they inherit from deep learning the need for very large training sets, 
16 | which entails that they learn very slowly. 
17 | Second, they are brittle in the sense that a trained network that performs well on 
18 | one task often performs very poorly on a new task, 
19 | even if the new task is very similar to the one it was originally trained on. 
20 | Third, they are strictly reactive, meaning that they do not use high-level processes such as planning, causal reasoning, 
21 | or analogical reasoning to fully exploit the statistical regularities present in the training data. 
22 | Fourth, they are opaque. 
23 | It is typically difficult to extract a humanly-comprehensible chain of reasons for the action choice the system makes.
24 | 
25 | ## What does DRL lack? Causal reasoning
26 | 
27 | To carry out analogical inference at a more abstract level, 
28 | and thereby facilitate the transfer of expertise from one domain to another, 
29 | the narrative structure of the ongoing situation needs to 
30 | be mapped to the causal structure of a set of previously encountered situations. 
31 | As well as maximising the benefit of past experience, 
32 | this enables high-level causal reasoning processes to be deployed in action selection, 
33 | such as planning, lookahead, and off-line exploration (imagination).
34 | 
35 | ## Method
36 | We can view the whole architecture as a model-based RL, 
37 | where the proposed method discovers how to symbolize the data (learn environment dynamics), and learns how to act.
38 | 
39 | Specifically, the proposed method can be separated into 3 parts: 
40 | low-level symbol generation, representation building, and reinforcement learning.
41 | 
42 | Low-level symbol generation:
43 | They train a convolutional autoencoder on 5000 randomly generated images. 
44 | The information extracted at this stage consists of 
45 | a symbolic representation of the positions of salient objects in the frame along with their types. (See fig.4)
46 | 
47 | Representation building: 
48 | They track them across frames in order to observe and learn from their dynamics in the function of Spatial proximity, 
49 | Type transitions, Neighborhood.
50 | 
51 | Reinforcement learning: tabular Q-learning
52 | 
53 | ## Experiments:
54 | From fig.6, the proposed method works better when the environment is more complex and stochastic.
55 | 
56 | ## Discussion
57 | This paper proposed a naive method the make the proof-of-concept of symbolic RL, 
58 | which I think it’s really good for a workshop paper. 
59 | The author should show how their unsupervised representation can be better 
60 | when the amount of the data become larger and larger. 
61 | Also, the paper doesn’t use the symbolic to do the planning, which they keep emphasize at the beginning.
62 | 
63 | This paper illustrates the idea in a clear and thought-provoking way, which is the reason I share.
64 | 


--------------------------------------------------------------------------------
/papers/Unsupervised Perceptual Rewards for Imitation Learning.md:
--------------------------------------------------------------------------------
 1 | # Unsupervised Perceptual Rewards for Imitation LearningPierre Sermanet, Kelvin Xu, Sergey Levine (RSS2017)
 2 | 
 3 | ## Keypoints
 4 | - Given few **human demonstrations** to learn reward function for **robot manipulation** (different embodiment)
 5 | - Learn reward function that are dense and discover subgoals in an unsupervised fashion 
 6 | - Use **ImageNet** for pretrained model (without any fintuning)
 7 | 	- advantage of perceptual reward:
 8 | 	- just like the advantages in computer vision: learn semantic meaning, ignoring unrelated object (backgroud)
 9 | -  To avoid overfitting to small amount of demonstration, use only time-independent gaussian to approximate the reward function (one quadratic reward function for each subgoal)
10 | 	
11 | 	
12 | ## Question
13 | - Why the model trained on ImageNet can be that useful for robot manipulation? or it's depends on task? 


--------------------------------------------------------------------------------
/papers/Value Iteration Networks.md:
--------------------------------------------------------------------------------
 1 | # Value Iteration Networks
 2 | The author aims to combine the planning algorithm(value iteration) into a differentiable policy network, making the whole 
 3 | model able to train end-to-end. 
 4 | ```
 5 | Value iteration: V_{n+1}(s) = R(s,a) + max{Q_{n}(s,a)}
 6 | ```
 7 | 
 8 | If we see the ```R(s,a)``` as the input of CNN and do convolution on it producing ```Q(s,a)```, then max-pooling on it. the 
 9 | whole process is similar to the right hand side of value iteration. If we apply this K times, it simply looks like doing K 
10 | itetaion of value iteration. This method provides us a differential method to embed planning (value iteration) to our NN.
11 | 
12 | ## keypoints
13 | - embed value iteration into NN in a general way
14 | - the advantage of planning: it's invariant to that whether the observation is novel or not
15 | - can't **directly** generalize to continuous domain (perform "high-level" planning on a discrete, coarse grid-world 
16 | representation of the continuous domain)
17 | 
18 | ## notes/question
19 | - the author pay attention on the archietecture of policy network, which makes me think of ***dueling network*** (happy to 
20 | see that there's someone interested in this :smile:)
21 |  
22 |  
23 | ## reference
24 | - comments from [@karpathy](https://github.com/karpathy/paper-notes/blob/master/vin.md#misc)
25 | 


--------------------------------------------------------------------------------