└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # How to Learn Reinforcement Learning: A Step-by-step Guide
2 |
3 | This repository provides the RL learning roadmap mentioned in the blog post [How to Learn Reinforcement Learning: A Step-by-step Guide](https://www.fiercepotato.com/post/rl-roadmap).
4 |
5 | For complimentary [MATLAB](https://www.mathworks.com/products/matlab.html) coding exercises with solutions, see [RL Course MATLAB](https://github.com/anhOfTheStars/RL-Course-MATLAB).
6 |
7 | ## The RL Learning Roadmap
8 |
9 | Highly recommend you work through the roadmap in order. After the first 4 chapters, you should have enough foundation to mix up the roadmap.
10 | - Make sure you fully understand the required concepts through learning materials
11 | - Implement the algorithm in your favorite framework. Learning happens when you implement and debug it yourself.
12 | - Test it out with some RL problems. My favorites are cart-pole, inverted pendulum, walking robot, pong.
13 |
14 | | Chapter | Algorithm | Required Concepts | Learning Materials |
15 | | :-----------: | ------------- | ------------- | ------------- |
16 | | 1 | **Dynamic Programming**
• Policy Evaluation
• Policy Improvement
• Value Iteration | • Markov Decision Process
• Expected return
• Discount factor
• State, Observation
• Action
• Reward
• State value function V(s)
• State-action value function Q(s,a) | • [MATLAB Tech Talk][1] Part 1: What is RL?
• [MATLAB Tech Talk][1] Part 2: Understanding the Environment and Rewards
• [RL Textbook][2] - Chapter 3+4: Finite MDP + Dynamic Programming
• [WildML][3] – Dynamic Programming exercises
• [David Silver’s Lecture][4] 1+2 |
17 | | 2 | **Temporal-Difference (TD) Learning**
• Q-Learning
• SARSA | • TD Error
• On-policy vs off-policy
• Epsilon greedy | • [RL Textbook][2] - Chapter 6: Temporal Difference Learning
• [WildML][3] – SARSA, Q-Learning exercises |
18 | | 3 | **Function Approximation** (replace table with neural network)
• Deep Q-Learning | RL
• Why tables cannot scale
• Relationship with supervised learning
• Replay memory
• Target network
• Partially observable environment
• Frame stacking for ATARI game environment
• Typical DQN network
• Double Q Learning
Deep Learning
• Supervised Learning
• Feedforward network
• Convolution neural network | RL
• [David Silver’s Lecture][4] 6: Value function approximation
• [WildML][3] – Q-Learning with Linear Function Approximation
• [DeepMind DQN paper][5]
• [WildML][3] – Deep Q-Learning for Atari Games
• [Arthur Juliani’s series][7] Part 4 – Deep Q-Networks
• [Pytorch DQN Tutorial][6]
Deep Learning
• [Deep Learning Specialization][8] Course 1+2 |
19 | | 4 | **Policy gradient**
• REINFORCE (vanilla policy gradient)
• Actor Critic | • Actor
• Critic
• Stochastic policy
• Statistics: distribution (focus on normal/Gaussian distribution), sample from a distribution, entropy, probability density function
• How to model discrete stochastic policy vs continuous stochastic policy
• Importance sampling
• KL divergence |• [RL Textbook][2] – Chapter 13: Policy Gradient Methods
• [WildML][3] – Policy Gradient exercises
• [OpenAI Spinning Up][9] – Vanilla Policy Gradient
• [Deep RL Berkeley][10] – Policy Gradients + Actor-Critic Algorithms |
20 | | 5 | **Advanced Policy Gradient**
• Deep Deterministic Policy Gradient (DDPG)
• Twin Delayed DDPG (TD3)
• Proximal Policy Optimization (PPO)
• Trust Region Policy Optimization (TRPO) | • Continuous action space
• Deterministic policy
• Deterministic policy gradient
| • [Deep RL Berkeley][10] – Advanced Policy Gradients
• Original papers
• [OpenAI Spinning Up][9] – PPO, TRPO, DDPG and TD3 |
21 | | 6 | **Partially Observable Environment**
• Modify existing algorithms to work with recurrent neural network (RNN) | • Recurrent neural network (RNN)
• Backpropagation through time
• Observation stacking
• How to sample data out of replay memory for RNN update | • [Arthur Juliani’s series][7] Part 6 – Partial Observability and DRQN
• [Deep Recurrent Q-Learning for Partially Observable MDPs][11]
• [Memory-based control with recurrent neural networks][12] |
22 | | 7 | **Model-based**
• Modify existing algorithms to utilize a model of the environment to simulate and plan | • Motivation: environment can be on actual hardware (high cost)
• Model: an approximation of the environment
• Environment step vs model step
• Model-based planning
• Model-based learning
• Parallelization for on-policy vs off-policy algorithms
• Gradient parallelization
• Experience parallelization | • [RL Textbook][2] – Chapter 8: Planning and Learning with Tabular Methods (8.1-8.4)
• [Deep RL Berkeley][10] – Model-based Planning
• [Deep RL Berkeley][10] – Model-based Reinforcement Learning |
23 | | 8 | **Parallelization**
• A2C
• A3C
• IMPALA | • Parallelization for on-policy vs off-policy algorithms
• Gradient parallelization
• Experience parallelization | • [Deep RL Berkeley][10] – Distributed RL |
24 | | 9 | **Exploration** | • Explore through sampling
• Intrinsic motivation
• Imitation learning
| • [Deep RL Berkeley][10] – Exploration |
25 |
26 | ## References
27 |
28 | • [Reinforcement Learning Toolbox][12], The MathWorks
29 | • [Reinforcement Learning: An Introduction][1] (textbook), Sutton and Barto
30 | • [Deep Reinforcement Learning][10] (course), UC Berkeley
31 | • [OpenAI Spinning Up][9](textbook/blog)
32 | • [WildML Learning Reinforcement Learning][3] (python course with exercises/solutions), Denny Britz
33 | • [MATLAB RL Tech Talks][1] (videos), The MathWorks
34 | • [David Silver’s RL course][4]
35 | • [Simple Reinforcement Learning][7] (blog), Arthur Juliani
36 | • [Deep Learning Specialization Coursera][8] (course), Andrew Ng (you can audit for free, highly recommend course 1 + 2 to get Deep Learning foundations)
37 |
38 | [1]: https://www.mathworks.com/videos/series/reinforcement-learning.html
39 | [2]: http://incompleteideas.net/book/RLbook2018.pdf
40 | [3]: https://github.com/dennybritz/reinforcement-learning
41 | [4]: https://www.davidsilver.uk/teaching/
42 | [5]: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
43 | [6]: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
44 | [7]: https://medium.com/@awjuliani
45 | [8]: https://www.coursera.org/specializations/deep-learning
46 | [9]: https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
47 | [10]: http://rail.eecs.berkeley.edu/deeprlcourse/
48 | [11]: https://arxiv.org/abs/1507.06527
49 | [12]: http://rll.berkeley.edu/deeprlworkshop/papers/rdpg.pdf
50 |
--------------------------------------------------------------------------------