└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # How to Learn Reinforcement Learning: A Step-by-step Guide 2 | 3 | This repository provides the RL learning roadmap mentioned in the blog post [How to Learn Reinforcement Learning: A Step-by-step Guide](https://www.fiercepotato.com/post/rl-roadmap). 4 | 5 | For complimentary [MATLAB](https://www.mathworks.com/products/matlab.html) coding exercises with solutions, see [RL Course MATLAB](https://github.com/anhOfTheStars/RL-Course-MATLAB). 6 | 7 | ## The RL Learning Roadmap 8 | 9 | Highly recommend you work through the roadmap in order. After the first 4 chapters, you should have enough foundation to mix up the roadmap.
10 | - Make sure you fully understand the required concepts through learning materials
11 | - Implement the algorithm in your favorite framework. Learning happens when you implement and debug it yourself.
12 | - Test it out with some RL problems. My favorites are cart-pole, inverted pendulum, walking robot, pong.
13 | 14 | | Chapter | Algorithm | Required Concepts | Learning Materials | 15 | | :-----------: | ------------- | ------------- | ------------- | 16 | | 1 | **Dynamic Programming**
• Policy Evaluation
• Policy Improvement
• Value Iteration | • Markov Decision Process
• Expected return
• Discount factor
• State, Observation
• Action
• Reward
• State value function V(s)
• State-action value function Q(s,a) | • [MATLAB Tech Talk][1] Part 1: What is RL?
• [MATLAB Tech Talk][1] Part 2: Understanding the Environment and Rewards
• [RL Textbook][2] - Chapter 3+4: Finite MDP + Dynamic Programming
• [WildML][3] – Dynamic Programming exercises
• [David Silver’s Lecture][4] 1+2 | 17 | | 2 | **Temporal-Difference (TD) Learning**
• Q-Learning
• SARSA | • TD Error
• On-policy vs off-policy
• Epsilon greedy | • [RL Textbook][2] - Chapter 6: Temporal Difference Learning
• [WildML][3] – SARSA, Q-Learning exercises | 18 | | 3 | **Function Approximation** (replace table with neural network)
• Deep Q-Learning | RL
• Why tables cannot scale
• Relationship with supervised learning
• Replay memory
• Target network
• Partially observable environment
• Frame stacking for ATARI game environment
• Typical DQN network
• Double Q Learning

Deep Learning
• Supervised Learning
• Feedforward network
• Convolution neural network | RL
• [David Silver’s Lecture][4] 6: Value function approximation
• [WildML][3] – Q-Learning with Linear Function Approximation
• [DeepMind DQN paper][5]
• [WildML][3] – Deep Q-Learning for Atari Games
• [Arthur Juliani’s series][7] Part 4 – Deep Q-Networks
• [Pytorch DQN Tutorial][6]

Deep Learning
• [Deep Learning Specialization][8] Course 1+2 | 19 | | 4 | **Policy gradient**
• REINFORCE (vanilla policy gradient)
• Actor Critic | • Actor
• Critic
• Stochastic policy
• Statistics: distribution (focus on normal/Gaussian distribution), sample from a distribution, entropy, probability density function
• How to model discrete stochastic policy vs continuous stochastic policy
• Importance sampling
• KL divergence |• [RL Textbook][2] – Chapter 13: Policy Gradient Methods
• [WildML][3] – Policy Gradient exercises
• [OpenAI Spinning Up][9] – Vanilla Policy Gradient
• [Deep RL Berkeley][10] – Policy Gradients + Actor-Critic Algorithms | 20 | | 5 | **Advanced Policy Gradient**
• Deep Deterministic Policy Gradient (DDPG)
• Twin Delayed DDPG (TD3)
• Proximal Policy Optimization (PPO)
• Trust Region Policy Optimization (TRPO) | • Continuous action space
• Deterministic policy
• Deterministic policy gradient
| • [Deep RL Berkeley][10] – Advanced Policy Gradients
• Original papers
• [OpenAI Spinning Up][9] – PPO, TRPO, DDPG and TD3 | 21 | | 6 | **Partially Observable Environment**
• Modify existing algorithms to work with recurrent neural network (RNN) | • Recurrent neural network (RNN)
• Backpropagation through time
• Observation stacking
• How to sample data out of replay memory for RNN update | • [Arthur Juliani’s series][7] Part 6 – Partial Observability and DRQN
• [Deep Recurrent Q-Learning for Partially Observable MDPs][11]
• [Memory-based control with recurrent neural networks][12] | 22 | | 7 | **Model-based**
• Modify existing algorithms to utilize a model of the environment to simulate and plan | • Motivation: environment can be on actual hardware (high cost)
• Model: an approximation of the environment
• Environment step vs model step
• Model-based planning
• Model-based learning
• Parallelization for on-policy vs off-policy algorithms
• Gradient parallelization
• Experience parallelization | • [RL Textbook][2] – Chapter 8: Planning and Learning with Tabular Methods (8.1-8.4)
• [Deep RL Berkeley][10] – Model-based Planning
• [Deep RL Berkeley][10] – Model-based Reinforcement Learning | 23 | | 8 | **Parallelization**
• A2C
• A3C
• IMPALA | • Parallelization for on-policy vs off-policy algorithms
• Gradient parallelization
• Experience parallelization | • [Deep RL Berkeley][10] – Distributed RL | 24 | | 9 | **Exploration** | • Explore through sampling
• Intrinsic motivation
• Imitation learning
| • [Deep RL Berkeley][10] – Exploration | 25 | 26 | ## References 27 | 28 | • [Reinforcement Learning Toolbox][12], The MathWorks
29 | • [Reinforcement Learning: An Introduction][1] (textbook), Sutton and Barto
30 | • [Deep Reinforcement Learning][10] (course), UC Berkeley
31 | • [OpenAI Spinning Up][9](textbook/blog)
32 | • [WildML Learning Reinforcement Learning][3] (python course with exercises/solutions), Denny Britz
33 | • [MATLAB RL Tech Talks][1] (videos), The MathWorks
34 | • [David Silver’s RL course][4]
35 | • [Simple Reinforcement Learning][7] (blog), Arthur Juliani
36 | • [Deep Learning Specialization Coursera][8] (course), Andrew Ng (you can audit for free, highly recommend course 1 + 2 to get Deep Learning foundations)
37 | 38 | [1]: https://www.mathworks.com/videos/series/reinforcement-learning.html 39 | [2]: http://incompleteideas.net/book/RLbook2018.pdf 40 | [3]: https://github.com/dennybritz/reinforcement-learning 41 | [4]: https://www.davidsilver.uk/teaching/ 42 | [5]: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf 43 | [6]: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html 44 | [7]: https://medium.com/@awjuliani 45 | [8]: https://www.coursera.org/specializations/deep-learning 46 | [9]: https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 47 | [10]: http://rail.eecs.berkeley.edu/deeprlcourse/ 48 | [11]: https://arxiv.org/abs/1507.06527 49 | [12]: http://rll.berkeley.edu/deeprlworkshop/papers/rdpg.pdf 50 | --------------------------------------------------------------------------------