├── .github
├── ISSUE_TEMPLATE
│ ├── i-have-a-bug-with-a-hands-on.md
│ ├── i-have-a-question.md
│ └── i-want-to-improve-the-course.md
└── workflows
│ ├── build_documentation.yml
│ ├── build_pr_documentation.yml
│ └── upload_pr_documentation.yml
├── LICENSE.md
├── README.md
├── notebooks
├── bonus-unit1
│ ├── bonus-unit1.ipynb
│ └── bonus_unit1.ipynb
├── unit1
│ ├── requirements-unit1.txt
│ └── unit1.ipynb
├── unit2
│ ├── requirements-unit2.txt
│ └── unit2.ipynb
├── unit3
│ └── unit3.ipynb
├── unit4
│ ├── requirements-unit4.txt
│ └── unit4.ipynb
├── unit5
│ └── unit5.ipynb
├── unit6
│ ├── requirements-unit6.txt
│ └── unit6.ipynb
└── unit8
│ ├── unit8_part1.ipynb
│ └── unit8_part2.ipynb
└── units
└── en
├── _toctree.yml
├── communication
├── certification.mdx
└── conclusion.mdx
├── live1
└── live1.mdx
├── unit0
├── discord101.mdx
├── introduction.mdx
└── setup.mdx
├── unit1
├── additional-readings.mdx
├── conclusion.mdx
├── deep-rl.mdx
├── exp-exp-tradeoff.mdx
├── glossary.mdx
├── hands-on.mdx
├── introduction.mdx
├── quiz.mdx
├── rl-framework.mdx
├── summary.mdx
├── tasks.mdx
├── two-methods.mdx
└── what-is-rl.mdx
├── unit2
├── additional-readings.mdx
├── bellman-equation.mdx
├── conclusion.mdx
├── glossary.mdx
├── hands-on.mdx
├── introduction.mdx
├── mc-vs-td.mdx
├── mid-way-quiz.mdx
├── mid-way-recap.mdx
├── q-learning-example.mdx
├── q-learning-recap.mdx
├── q-learning.mdx
├── quiz2.mdx
├── two-types-value-based-methods.mdx
└── what-is-rl.mdx
├── unit3
├── additional-readings.mdx
├── conclusion.mdx
├── deep-q-algorithm.mdx
├── deep-q-network.mdx
├── from-q-to-dqn.mdx
├── glossary.mdx
├── hands-on.mdx
├── introduction.mdx
└── quiz.mdx
├── unit4
├── additional-readings.mdx
├── advantages-disadvantages.mdx
├── conclusion.mdx
├── glossary.mdx
├── hands-on.mdx
├── introduction.mdx
├── pg-theorem.mdx
├── policy-gradient.mdx
├── quiz.mdx
└── what-are-policy-based-methods.mdx
├── unit5
├── bonus.mdx
├── conclusion.mdx
├── curiosity.mdx
├── hands-on.mdx
├── how-mlagents-works.mdx
├── introduction.mdx
├── pyramids.mdx
├── quiz.mdx
└── snowball-target.mdx
├── unit6
├── additional-readings.mdx
├── advantage-actor-critic.mdx
├── conclusion.mdx
├── hands-on.mdx
├── introduction.mdx
├── quiz.mdx
└── variance-problem.mdx
├── unit7
├── additional-readings.mdx
├── conclusion.mdx
├── hands-on.mdx
├── introduction-to-marl.mdx
├── introduction.mdx
├── multi-agent-setting.mdx
├── quiz.mdx
└── self-play.mdx
├── unit8
├── additional-readings.mdx
├── clipped-surrogate-objective.mdx
├── conclusion-sf.mdx
├── conclusion.mdx
├── hands-on-cleanrl.mdx
├── hands-on-sf.mdx
├── introduction-sf.mdx
├── introduction.mdx
├── intuition-behind-ppo.mdx
└── visualize.mdx
├── unitbonus1
├── conclusion.mdx
├── how-huggy-works.mdx
├── introduction.mdx
├── play.mdx
└── train.mdx
├── unitbonus2
├── hands-on.mdx
├── introduction.mdx
└── optuna.mdx
├── unitbonus3
├── curriculum-learning.mdx
├── decision-transformers.mdx
├── envs-to-try.mdx
├── generalisation.mdx
├── godotrl.mdx
├── introduction.mdx
├── language-models.mdx
├── learning-agents.mdx
├── model-based.mdx
├── offline-online.mdx
├── rl-documentation.mdx
├── rlhf.mdx
└── student-works.mdx
└── unitbonus5
├── conclusion.mdx
├── customize-the-environment.mdx
├── getting-started.mdx
├── introduction.mdx
├── the-environment.mdx
└── train-our-robot.mdx
/.github/ISSUE_TEMPLATE/i-have-a-bug-with-a-hands-on.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: I have a bug with a hands-on
3 | about: You have encountered a bug during one of the hands-on
4 | title: "[HANDS-ON BUG]"
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | # Describe the bug
11 |
12 | A clear and concise description of what the bug is.
13 | **Please share your notebook link so that we can reproduce the error**
14 |
15 | # Material
16 |
17 | - Did you use Google Colab?
18 |
19 | If not:
20 | - Your Operating system (OS)
21 | - Version of your OS
22 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/i-have-a-question.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: I have a question
3 | about: You have a question about a part of the course
4 | title: "[QUESTION]"
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | 1. First, the **best way to get a response fast is to ask the community** it on #rl-study-group in our Discord server: https://www.hf.co/join/discord
11 |
12 | 2. If you prefer you can ask here, please **be specific**.
13 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/i-want-to-improve-the-course.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: I want to improve the course
3 | about: You found a typo, an error or you want to improve a part of the course
4 | title: "[UPDATE]"
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | # What do you want to improve?
11 |
12 | - Explain the typo/error or the part of the course you want to improve
13 |
14 | - **Also, don't hesitate to open a Pull Request with the update**.
15 |
--------------------------------------------------------------------------------
/.github/workflows/build_documentation.yml:
--------------------------------------------------------------------------------
1 | name: Build documentation
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 |
8 | jobs:
9 | build:
10 | uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
11 | with:
12 | commit_sha: ${{ github.sha }}
13 | package: deep-rl-class
14 | package_name: deep-rl-course
15 | path_to_docs: deep-rl-class/units/
16 | additional_args: --not_python_module
17 | languages: en
18 | secrets:
19 | hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
20 |
--------------------------------------------------------------------------------
/.github/workflows/build_pr_documentation.yml:
--------------------------------------------------------------------------------
1 | name: Build PR Documentation
2 |
3 | on:
4 | pull_request:
5 |
6 | concurrency:
7 | group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
8 | cancel-in-progress: true
9 |
10 | jobs:
11 | build:
12 | uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
13 | with:
14 | commit_sha: ${{ github.event.pull_request.head.sha }}
15 | pr_number: ${{ github.event.number }}
16 | package: deep-rl-class
17 | package_name: deep-rl-course
18 | path_to_docs: deep-rl-class/units/
19 | additional_args: --not_python_module
20 | languages: en
21 |
--------------------------------------------------------------------------------
/.github/workflows/upload_pr_documentation.yml:
--------------------------------------------------------------------------------
1 | name: Upload PR Documentation
2 |
3 | on:
4 | workflow_run:
5 | workflows: ["Build PR Documentation"]
6 | types:
7 | - completed
8 |
9 | jobs:
10 | build:
11 | uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
12 | with:
13 | package_name: deep-rl-course
14 | hub_base_path: https://moon-ci-docs.huggingface.co
15 | secrets:
16 | hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
17 | comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](https://huggingface.co/deep-rl-course/unit0/introduction)
2 |
3 |
4 |
5 | If you like the course, don't hesitate to **⭐ star this repository. This helps us 🤗**.
6 |
7 | This repository contains the Deep Reinforcement Learning Course mdx files and notebooks. **The website is here**: https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt
8 |
9 | - The syllabus 📚: https://simoninithomas.github.io/deep-rl-course
10 |
11 | - The course 📚: https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt
12 |
13 | - **Sign up here** ➡️➡️➡️ http://eepurl.com/ic5ZUD
14 |
15 | ## Course Maintenance Notice 🚧
16 |
17 | Please note that this **Deep Reinforcement Learning course is now in a low-maintenance state**. However, it **remains an excellent resource to learn both the theory and practical aspects of Deep Reinforcement Learning**.
18 |
19 | Keep in mind the following points:
20 |
21 | - *Unit 7 (AI vs AI)* : This feature is currently non-functional. However, you can still train your agent to play soccer and observe its performance.
22 |
23 | - *Leaderboard* : The leaderboard is no longer operational.
24 |
25 | Aside from these points, all theory content and practical exercises remain fully accessible and effective for learning.
26 |
27 | If you have any problem with one of the hands-on **please check the issue sections where the community give some solutions to bugs**.
28 |
29 | ## Citing the project
30 |
31 | To cite this repository in publications:
32 |
33 | ```bibtex
34 | @misc{deep-rl-course,
35 | author = {Simonini, Thomas and Sanseviero, Omar},
36 | title = {The Hugging Face Deep Reinforcement Learning Class},
37 | year = {2023},
38 | publisher = {GitHub},
39 | journal = {GitHub repository},
40 | howpublished = {\url{https://github.com/huggingface/deep-rl-class}},
41 | }
42 | ```
43 |
--------------------------------------------------------------------------------
/notebooks/unit1/requirements-unit1.txt:
--------------------------------------------------------------------------------
1 | stable-baselines3==2.0.0a5
2 | swig
3 | gymnasium[box2d]
4 | huggingface_sb3
5 |
--------------------------------------------------------------------------------
/notebooks/unit2/requirements-unit2.txt:
--------------------------------------------------------------------------------
1 | gymnasium
2 | pygame
3 | numpy
4 |
5 | huggingface_hub
6 | pickle5
7 | pyyaml==6.0
8 | imageio
9 | imageio_ffmpeg
10 | pyglet==1.5.1
11 | tqdm
--------------------------------------------------------------------------------
/notebooks/unit4/requirements-unit4.txt:
--------------------------------------------------------------------------------
1 | git+https://github.com/ntasfi/PyGame-Learning-Environment.git
2 | git+https://github.com/simoninithomas/gym-games
3 | huggingface_hub
4 | imageio-ffmpeg
5 | pyyaml==6.0
6 |
--------------------------------------------------------------------------------
/notebooks/unit6/requirements-unit6.txt:
--------------------------------------------------------------------------------
1 | stable-baselines3==2.0.0a4
2 | huggingface_sb3
3 | panda-gym
4 | huggingface_hub
--------------------------------------------------------------------------------
/units/en/communication/certification.mdx:
--------------------------------------------------------------------------------
1 | # The certification process
2 |
3 |
4 | The certification process is **completely free**:
5 |
6 | - To get a *certificate of completion*: you need **to pass 80% of the assignments**.
7 | - To get a *certificate of excellence*: you need **to pass 100% of the assignments**.
8 |
9 | There's **no deadlines, the course is self-paced**.
10 |
11 |
12 |
13 | When we say pass, **we mean that your model must be pushed to the Hub and get a result equal or above the minimal requirement**.
14 |
15 | To check your progression and which unit you passed/not passed: https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
16 |
17 | Now that you're ready for the certification process, you need to:
18 |
19 | 1. Go here: https://huggingface.co/spaces/huggingface-projects/Deep-RL-Course-Certification/
20 | 2. Type your *hugging face username*, your *first name*, *last name*
21 |
22 | 3. Click on "Generate my certificate".
23 | - If you passed 80% of the assignments, **congratulations** you've just got the certificate of completion.
24 | - If you passed 100% of the assignments, **congratulations** you've just got the excellence certificate.
25 | - If you are below 80%, don't be discouraged! Check which units you need to do again to get your certificate.
26 |
27 | 4. You can download your certificate in pdf format and png format.
28 |
29 | Don't hesitate to share your certificate on Twitter (tag me @ThomasSimonini and @huggingface) and on Linkedin.
30 |
31 |
--------------------------------------------------------------------------------
/units/en/communication/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Congratulations
2 |
3 |
4 |
5 |
6 | **Congratulations on finishing this course!** With perseverance, hard work, and determination, **you've acquired a solid background in Deep Reinforcement Learning**.
7 |
8 | But finishing this course is **not the end of your journey**. It's just the beginning: don't hesitate to explore bonus unit 3, where we show you topics you may be interested in studying. And don't hesitate to **share what you're doing, and ask questions in the discord server**
9 |
10 | **Thank you** for being part of this course. **I hope you liked this course as much as I loved writing it**.
11 |
12 | Don't hesitate **to give us feedback on how we can improve the course** using [this form](https://forms.gle/BzKXWzLAGZESGNaE9)
13 |
14 | And don't forget **to check in the next section how you can get (if you pass) your certificate of completion 🎓.**
15 |
16 | One last thing, to keep in touch with the Reinforcement Learning Team and with me:
17 |
18 | - [Follow me on Twitter](https://twitter.com/thomassimonini)
19 | - [Follow Hugging Face Twitter account](https://twitter.com/huggingface)
20 | - [Join the Hugging Face Discord](https://www.hf.co/join/discord)
21 |
22 | ## Keep Learning, Stay Awesome 🤗
23 |
24 | Thomas Simonini,
25 |
--------------------------------------------------------------------------------
/units/en/live1/live1.mdx:
--------------------------------------------------------------------------------
1 | # Live 1: How the course work, Q&A, and playing with Huggy
2 |
3 | In this first live stream, we explained how the course work (scope, units, challenges, and more) and answered your questions.
4 |
5 | And finally, we saw some LunarLander agents you've trained and play with your Huggies 🐶
6 |
7 |
8 |
9 | To know when the next live is scheduled **check the discord server**. We will also send **you an email**. If you can't participate, don't worry, we record the live sessions.
--------------------------------------------------------------------------------
/units/en/unit0/discord101.mdx:
--------------------------------------------------------------------------------
1 | # Discord 101 [[discord-101]]
2 |
3 | Hey there! My name is Huggy, the dog 🐕, and I'm looking forward to train with you during this RL Course!
4 | Although I don't know much about fetching sticks (yet), I know one or two things about Discord. So I wrote this guide to help you learn about it!
5 |
6 |
7 |
8 | Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 50000 members you can join with a single click here. So many humans to play with!
9 |
10 | Starting in Discord can be a bit intimidating, so let me take you through it.
11 |
12 | When you [sign-up to our Discord server](http://hf.co/join/discord), you'll choose your interests. Make sure to **click "Reinforcement Learning,"** and you'll get access to the Reinforcement Learning Category containing all the course-related channels. If you feel like joining even more channels, go for it! 🚀
13 |
14 | Then click next, you'll then get to **introduce yourself in the `#introduce-yourself` channel**.
15 |
16 |
17 |
18 |
19 | They are in the reinforcement learning category. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
20 | - `rl-announcements`: where we give the **latest information about the course**.
21 | - `rl-discussions`: where you can **exchange about RL and share information**.
22 | - `rl-study-group`: where you can **ask questions and exchange with your classmates**.
23 | - `rl-i-made-this`: where you can **share your projects and models**.
24 |
25 | The HF Community Server has a thriving community of human beings interested in many areas, so you can also learn from those. There are paper discussions, events, and many other things.
26 |
27 | Was this useful? There are a couple of tips I can share with you:
28 |
29 | - There are **voice channels** you can use as well, although most people prefer text chat.
30 | - You can **use markdown style** for text chats. So if you're writing code, you can use that style. Sadly this does not work as well for links.
31 | - You can open threads as well! It's a good idea when **it's a long conversation**.
32 |
33 | I hope this is useful! And if you have questions, just ask!
34 |
35 | See you later!
36 |
37 | Huggy 🐶
38 |
--------------------------------------------------------------------------------
/units/en/unit0/setup.mdx:
--------------------------------------------------------------------------------
1 | # Setup [[setup]]
2 |
3 | After all this information, it's time to get started. We're going to do two things:
4 |
5 | 1. **Create your Hugging Face account** if it's not already done
6 | 2. **Sign up to Discord and introduce yourself** (don't be shy 🤗)
7 |
8 | ### Let's create my Hugging Face account
9 |
10 | (If it's not already done) create an account to HF here
11 |
12 | ### Let's join our Discord server
13 |
14 | You can now sign up for our Discord Server. This is the place where you **can chat with the community and with us, create and join study groups to grow with each other and more**
15 |
16 | 👉🏻 Join our discord server here.
17 |
18 | When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #channels-and-roles.
19 |
20 | We have multiple RL-related channels:
21 | - `rl-announcements`: where we give the latest information about the course.
22 | - `rl-discussions`: where you can chat about RL and share information.
23 | - `rl-study-group`: where you can create and join study groups.
24 | - `rl-i-made-this`: where you can share your projects and models.
25 |
26 | If this is your first time using Discord, we wrote a Discord 101 to get the best practices. Check the next section.
27 |
28 | Congratulations! **You've just finished the on-boarding**. You're now ready to start to learn Deep Reinforcement Learning. Have fun!
29 |
30 |
31 | ### Keep Learning, stay awesome 🤗
32 |
--------------------------------------------------------------------------------
/units/en/unit1/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings [[additional-readings]]
2 |
3 | These are **optional readings** if you want to go deeper.
4 |
5 | ## Deep Reinforcement Learning [[deep-rl]]
6 |
7 | - [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 1, 2 and 3](http://incompleteideas.net/book/RLbook2020.pdf)
8 | - [Foundations of Deep RL Series, L1 MDPs, Exact Solution Methods, Max-ent RL by Pieter Abbeel](https://youtu.be/2GwBez0D20A)
9 | - [Spinning Up RL by OpenAI Part 1: Key concepts of RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)
10 |
11 | ## Gym [[gym]]
12 |
13 | - [Getting Started With OpenAI Gym: The Basic Building Blocks](https://blog.paperspace.com/getting-started-with-openai-gym/)
14 | - [Make your own Gym custom environment](https://www.gymlibrary.dev/content/environment_creation/)
15 |
--------------------------------------------------------------------------------
/units/en/unit1/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion [[conclusion]]
2 |
3 | Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep RL agents and shared them with the community! 🥳
4 |
5 | It's **normal if you still feel confused by some of these elements**. This was the same for me and for all people who studied RL.
6 |
7 | **Take time to really grasp the material** before continuing. It’s important to master these elements and have a solid foundation before entering the fun part.
8 |
9 | Naturally, during the course, we’re going to use and explain these terms again, but it’s better to understand them before diving into the next units.
10 |
11 | In the next (bonus) unit, we’re going to reinforce what we just learned by **training Huggy the Dog to fetch a stick**.
12 |
13 | You will then be able to play with him 🤗.
14 |
15 |
16 |
17 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
18 |
19 | ### Keep Learning, stay awesome 🤗
20 |
21 |
22 |
--------------------------------------------------------------------------------
/units/en/unit1/deep-rl.mdx:
--------------------------------------------------------------------------------
1 | # The “Deep” in Reinforcement Learning [[deep-rl]]
2 |
3 |
4 | What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
5 |
6 |
7 | Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
8 |
9 | For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
10 |
11 | You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
12 |
13 | In the second approach, **we will use a Neural Network** (to approximate the Q value).
14 |
15 |
16 |
17 | Schema inspired by the Q learning notebook by Udacity
18 |
19 |
20 |
21 | If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
22 |
--------------------------------------------------------------------------------
/units/en/unit1/exp-exp-tradeoff.mdx:
--------------------------------------------------------------------------------
1 | # The Exploration/Exploitation trade-off [[exp-exp-tradeoff]]
2 |
3 | Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: *the exploration/exploitation trade-off.*
4 |
5 | - *Exploration* is exploring the environment by trying random actions in order to **find more information about the environment.**
6 | - *Exploitation* is **exploiting known information to maximize the reward.**
7 |
8 | Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, **we can fall into a common trap**.
9 |
10 | Let’s take an example:
11 |
12 |
13 |
14 | In this game, our mouse can have an **infinite amount of small cheese** (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
15 |
16 | However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit **the nearest source of rewards,** even if this source is small (exploitation).
17 |
18 | But if our agent does a little bit of exploration, it can **discover the big reward** (the pile of big cheese).
19 |
20 | This is what we call the exploration/exploitation trade-off. We need to balance how much we **explore the environment** and how much we **exploit what we know about the environment.**
21 |
22 | Therefore, we must **define a rule that helps to handle this trade-off**. We’ll see the different ways to handle it in the future units.
23 |
24 | If it’s still confusing, **think of a real problem: the choice of picking a restaurant:**
25 |
26 |
27 |
28 |
29 | Source: Berkley AI Course
30 |
31 |
32 |
33 | - *Exploitation*: You go to the same one that you know is good every day and **take the risk to miss another better restaurant.**
34 | - *Exploration*: Try restaurants you never went to before, with the risk of having a bad experience **but the probable opportunity of a fantastic experience.**
35 |
36 | To recap:
37 |
38 |
--------------------------------------------------------------------------------
/units/en/unit1/glossary.mdx:
--------------------------------------------------------------------------------
1 | # Glossary [[glossary]]
2 |
3 | This is a community-created glossary. Contributions are welcome!
4 |
5 | ### Agent
6 |
7 | An agent learns to **make decisions by trial and error, with rewards and punishments from the surroundings**.
8 |
9 | ### Environment
10 |
11 | An environment is a simulated world **where an agent can learn by interacting with it**.
12 |
13 | ### Markov Property
14 |
15 | It implies that the action taken by our agent is **conditional solely on the present state and independent of the past states and actions**.
16 |
17 | ### Observations/State
18 |
19 | - **State**: Complete description of the state of the world.
20 | - **Observation**: Partial description of the state of the environment/world.
21 |
22 | ### Actions
23 |
24 | - **Discrete Actions**: Finite number of actions, such as left, right, up, and down.
25 | - **Continuous Actions**: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring.
26 |
27 | ### Rewards and Discounting
28 |
29 | - **Rewards**: Fundamental factor in RL. Tells the agent whether the action taken is good/bad.
30 | - RL algorithms are focused on maximizing the **cumulative reward**.
31 | - **Reward Hypothesis**: RL problems can be formulated as a maximisation of (cumulative) return.
32 | - **Discounting** is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards.
33 |
34 | ### Tasks
35 |
36 | - **Episodic**: Has a starting point and an ending point.
37 | - **Continuous**: Has a starting point but no ending point.
38 |
39 | ### Exploration v/s Exploitation Trade-Off
40 |
41 | - **Exploration**: It's all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment.
42 | - **Exploitation**: It's about exploiting what we know about the environment to gain maximum rewards.
43 | - **Exploration-Exploitation Trade-Off**: It balances how much we want to **explore** the environment and how much we want to **exploit** what we know about the environment.
44 |
45 | ### Policy
46 |
47 | - **Policy**: It is called the agent's brain. It tells us what action to take, given the state.
48 | - **Optimal Policy**: Policy that **maximizes** the **expected return** when an agent acts according to it. It is learned through *training*.
49 |
50 | ### Policy-based Methods:
51 |
52 | - An approach to solving RL problems.
53 | - In this method, the Policy is learned directly.
54 | - Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.
55 |
56 | ### Value-based Methods:
57 |
58 | - Another approach to solving RL problems.
59 | - Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.
60 |
61 | Contributions are welcome 🤗
62 |
63 | If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
64 |
65 | This glossary was made possible thanks to:
66 |
67 | - [@lucifermorningstar1305](https://github.com/lucifermorningstar1305)
68 | - [@daspartho](https://github.com/daspartho)
69 | - [@misza222](https://github.com/misza222)
70 |
71 |
--------------------------------------------------------------------------------
/units/en/unit1/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction to Deep Reinforcement Learning [[introduction-to-deep-reinforcement-learning]]
2 |
3 |
4 |
5 |
6 | Welcome to the most fascinating topic in Artificial Intelligence: **Deep Reinforcement Learning.**
7 |
8 | Deep RL is a type of Machine Learning where an agent learns **how to behave** in an environment **by performing actions** and **seeing the results.**
9 |
10 | In this first unit, **you'll learn the foundations of Deep Reinforcement Learning.**
11 |
12 |
13 | Then, you'll **train your Deep Reinforcement Learning agent, a lunar lander to land correctly on the Moon** using Stable-Baselines3 , a Deep Reinforcement Learning library.
14 |
15 |
16 |
17 |
18 | And finally, you'll **upload this trained agent to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets, and demos.**
19 |
20 | It's essential **to master these elements** before diving into implementing Deep Reinforcement Learning agents. The goal of this chapter is to give you solid foundations.
21 |
22 |
23 | After this unit, in a bonus unit, you'll be **able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗**.
24 |
25 |
26 |
27 | So let's get started! 🚀
28 |
--------------------------------------------------------------------------------
/units/en/unit1/summary.mdx:
--------------------------------------------------------------------------------
1 | # Summary [[summary]]
2 |
3 | That was a lot of information! Let's summarize:
4 |
5 | - Reinforcement Learning is a computational approach of learning from actions. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
6 |
7 | - The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the **reward hypothesis**, which is that **all goals can be described as the maximization of the expected cumulative reward.**
8 |
9 | - The RL process is a loop that outputs a sequence of **state, action, reward and next state.**
10 |
11 | - To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) **are more probable to happen since they are more predictable than the long term future reward.**
12 |
13 | - To solve an RL problem, you want to **find an optimal policy**. The policy is the “brain” of your agent, which will tell us **what action to take given a state.** The optimal policy is the one which **gives you the actions that maximize the expected return.**
14 |
15 | - There are two ways to find your optimal policy:
16 | 1. By training your policy directly: **policy-based methods.**
17 | 2. By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: **value-based methods.**
18 |
19 | - Finally, we speak about Deep RL because we introduce **deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based)** hence the name “deep”.
20 |
--------------------------------------------------------------------------------
/units/en/unit1/tasks.mdx:
--------------------------------------------------------------------------------
1 | # Type of tasks [[tasks]]
2 |
3 | A task is an **instance** of a Reinforcement Learning problem. We can have two types of tasks: **episodic** and **continuing**.
4 |
5 | ## Episodic task [[episodic-task]]
6 |
7 | In this case, we have a starting point and an ending point **(a terminal state). This creates an episode**: a list of States, Actions, Rewards, and new States.
8 |
9 | For instance, think about Super Mario Bros: an episode begins at the launch of a new Mario Level and ends **when you’re killed or you reached the end of the level.**
10 |
11 |
12 |
13 | Beginning of a new episode.
14 |
15 |
16 |
17 |
18 | ## Continuing tasks [[continuing-tasks]]
19 |
20 | These are tasks that continue forever (**no terminal state**). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment.**
21 |
22 | For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. **The agent keeps running until we decide to stop it.**
23 |
24 |
25 |
26 | To recap:
27 |
28 |
--------------------------------------------------------------------------------
/units/en/unit1/two-methods.mdx:
--------------------------------------------------------------------------------
1 | # Two main approaches for solving RL problems [[two-methods]]
2 |
3 |
4 | Now that we learned the RL framework, how do we solve the RL problem?
5 |
6 |
7 | In other words, how do we build an RL agent that can **select the actions that maximize its expected cumulative reward?**
8 |
9 | ## The Policy π: the agent’s brain [[policy]]
10 |
11 | The Policy **π** is the **brain of our Agent**, it’s the function that tells us what **action to take given the state we are in.** So it **defines the agent’s behavior** at a given time.
12 |
13 |
14 |
15 | Think of policy as the brain of our agent, the function that will tell us the action to take given a state
16 |
17 |
18 | This Policy **is the function we want to learn**, our goal is to find the optimal policy π\*, the policy that **maximizes expected return** when the agent acts according to it. We find this π\* **through training.**
19 |
20 | There are two approaches to train our agent to find this optimal policy π\*:
21 |
22 | - **Directly,** by teaching the agent to learn which **action to take,** given the current state: **Policy-Based Methods.**
23 | - Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods.
24 |
25 | ## Policy-Based Methods [[policy-based]]
26 |
27 | In Policy-Based methods, **we learn a policy function directly.**
28 |
29 | This function will define a mapping from each state to the best corresponding action. Alternatively, it could define **a probability distribution over the set of possible actions at that state.**
30 |
31 |
32 |
33 | As we can see here, the policy (deterministic) directly indicates the action to take for each step.
34 |
35 |
36 |
37 | We have two types of policies:
38 |
39 |
40 | - *Deterministic*: a policy at a given state **will always return the same action.**
41 |
42 |
43 |
44 | action = policy(state)
45 |
46 |
47 |
48 |
49 | - *Stochastic*: outputs **a probability distribution over actions.**
50 |
51 |
52 |
53 | policy(actions | state) = probability distribution over the set of actions given the current state
54 |
55 |
56 |
57 |
58 | Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.
59 |
60 |
61 |
62 | If we recap:
63 |
64 |
65 |
66 |
67 |
68 | ## Value-based methods [[value-based]]
69 |
70 | In value-based methods, instead of learning a policy function, we **learn a value function** that maps a state to the expected value **of being at that state.**
71 |
72 | The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then acts according to our policy.**
73 |
74 | “Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
75 |
76 |
77 |
78 | Here we see that our value function **defined values for each possible state.**
79 |
80 |
81 |
82 | Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
83 |
84 |
85 | Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
86 |
87 | If we recap:
88 |
89 |
90 |
91 |
--------------------------------------------------------------------------------
/units/en/unit1/what-is-rl.mdx:
--------------------------------------------------------------------------------
1 | # What is Reinforcement Learning? [[what-is-reinforcement-learning]]
2 |
3 | To understand Reinforcement Learning, let’s start with the big picture.
4 |
5 | ## The big picture [[the-big-picture]]
6 |
7 | The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by **interacting with it** (through trial and error) and **receiving rewards** (negative or positive) as feedback for performing actions.
8 |
9 | Learning from interactions with the environment **comes from our natural experiences.**
10 |
11 | For instance, imagine putting your little brother in front of a video game he never played, giving him a controller, and leaving him alone.
12 |
13 |
14 |
15 |
16 | Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game **he must get the coins.**
17 |
18 |
19 |
20 | But then, **he presses the right button again** and he touches an enemy. He just died, so that's a -1 reward.
21 |
22 |
23 |
24 |
25 | By interacting with his environment through trial and error, your little brother understands that **he needs to get coins in this environment but avoid the enemies.**
26 |
27 | **Without any supervision**, the child will get better and better at playing the game.
28 |
29 | That’s how humans and animals learn, **through interaction.** Reinforcement Learning is just a **computational approach of learning from actions.**
30 |
31 |
32 | ### A formal definition [[a-formal-definition]]
33 |
34 | We can now make a formal definition:
35 |
36 |
37 | Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
38 |
39 |
40 | But how does Reinforcement Learning work?
41 |
--------------------------------------------------------------------------------
/units/en/unit2/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings [[additional-readings]]
2 |
3 | These are **optional readings** if you want to go deeper.
4 |
5 | ## Monte Carlo and TD Learning [[mc-td]]
6 |
7 | To dive deeper into Monte Carlo and Temporal Difference Learning:
8 |
9 | - Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?
10 | - When are Monte Carlo methods preferred over temporal difference ones?
11 |
12 | ## Q-Learning [[q-learning]]
13 |
14 | - Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 5, 6 and 7
15 | - Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel
16 |
--------------------------------------------------------------------------------
/units/en/unit2/bellman-equation.mdx:
--------------------------------------------------------------------------------
1 | # The Bellman Equation: simplify our value estimation [[bellman-equation]]
2 |
3 | The Bellman equation **simplifies our state value or state-action value calculation.**
4 |
5 |
6 |
7 |
8 | With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
9 |
10 | So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
11 |
12 |
13 |
14 | To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.
15 |
16 |
17 | Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
18 |
19 |
20 |
21 | To calculate the value of State 2: the sum of rewards if the agent started in that state, and then followed the policy for all the time steps.
22 |
23 |
24 | So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.
25 |
26 | Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.** (hint: if you know what Dynamic Programming is, this is very similar! if you don't know what it is, no worries!)
27 |
28 | The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
29 |
30 | **The immediate reward \\(R_{t+1}\\) + the discounted value of the state that follows ( \\(\gamma * V(S_{t+1}) \\) ) .**
31 |
32 |
33 |
34 |
35 |
36 |
37 | If we go back to our example, we can say that the value of State 1 is equal to the expected cumulative return if we start at that state.
38 |
39 |
40 |
41 |
42 | To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
43 |
44 | This is equivalent to \\(V(S_{t})\\) = Immediate reward \\(R_{t+1}\\) + Discounted value of the next state \\(\gamma * V(S_{t+1})\\)
45 |
46 |
47 |
48 | For simplification, here we don’t discount so gamma = 1.
49 |
50 |
51 | In the interest of simplicity, here we don't discount, so gamma = 1.
52 | But you'll study an example with gamma = 0.99 in the Q-Learning section of this unit.
53 |
54 | - The value of \\(V(S_{t+1}) \\) = Immediate reward \\(R_{t+2}\\) + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
55 | - And so on.
56 |
57 |
58 |
59 |
60 |
61 | To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.**
62 |
63 | Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
64 |
--------------------------------------------------------------------------------
/units/en/unit2/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion [[conclusion]]
2 |
3 | Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorials. You’ve just implemented your first RL agent from scratch and shared it on the Hub 🥳.
4 |
5 | Implementing from scratch when you study a new architecture **is important to understand how it works.**
6 |
7 | It's **normal if you still feel confused** by all these elements. **This was the same for me and for everyone who studies RL.**
8 |
9 | Take time to really grasp the material before continuing.
10 |
11 |
12 | In the next chapter, we’re going to dive deeper by studying our first Deep Reinforcement Learning algorithm based on Q-Learning: Deep Q-Learning. And you'll train a **DQN agent with RL-Baselines3 Zoo to play Atari Games**.
13 |
14 |
15 |
16 |
17 |
18 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
19 |
20 | ### Keep Learning, stay awesome 🤗
--------------------------------------------------------------------------------
/units/en/unit2/glossary.mdx:
--------------------------------------------------------------------------------
1 | # Glossary [[glossary]]
2 |
3 | This is a community-created glossary. Contributions are welcomed!
4 |
5 |
6 | ### Strategies to find the optimal policy
7 |
8 | - **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case it is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
9 | - **Value-based methods.** In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn't define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
10 |
11 | ### Among the value-based methods, we can find two main strategies
12 |
13 | - **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
14 | - **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.
15 |
16 | ### Epsilon-greedy strategy:
17 |
18 | - Common strategy used in reinforcement learning that involves balancing exploration and exploitation.
19 | - Chooses the action with the highest expected reward with a probability of 1-epsilon.
20 | - Chooses a random action with a probability of epsilon.
21 | - Epsilon is typically decreased over time to shift focus towards exploitation.
22 |
23 | ### Greedy strategy:
24 |
25 | - Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (Only exploitation)
26 | - Always chooses the action with the highest expected reward.
27 | - Does not include any exploration.
28 | - Can be disadvantageous in environments with uncertainty or unknown optimal actions.
29 |
30 | ### Off-policy vs on-policy algorithms
31 |
32 | - **Off-policy algorithms:** A different policy is used at training time and inference time
33 | - **On-policy algorithms:** The same policy is used during training and inference
34 |
35 | ### Monte Carlo and Temporal Difference learning strategies
36 |
37 | - **Monte Carlo (MC):** Learning at the end of the episode. With Monte Carlo, we wait until the episode ends and then we update the value function (or policy function) from a complete episode.
38 |
39 | - **Temporal Difference (TD):** Learning at each step. With Temporal Difference Learning, we update the value function (or policy function) at each step without requiring a complete episode.
40 |
41 | If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
42 |
43 | This glossary was made possible thanks to:
44 |
45 | - [Ramón Rueda](https://github.com/ramon-rd)
46 | - [Hasarindu Perera](https://github.com/hasarinduperera/)
47 | - [Arkady Arkhangorodsky](https://github.com/arkadyark/)
48 |
--------------------------------------------------------------------------------
/units/en/unit2/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction to Q-Learning [[introduction-q-learning]]
2 |
3 |
4 |
5 |
6 | In the first unit of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also **trained our first agents and uploaded them to the Hugging Face Hub.**
7 |
8 | In this unit, we're going to **dive deeper into one of the Reinforcement Learning methods: value-based methods** and study our first RL algorithm: **Q-Learning.**
9 |
10 | We'll also **implement our first RL agent from scratch**, a Q-Learning agent, and will train it in two environments:
11 |
12 | 1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
13 | 2. An autonomous taxi: where our agent will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
14 |
15 |
16 |
17 |
18 | Concretely, we will:
19 |
20 | - Learn about **value-based methods**.
21 | - Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
22 | - Study and implement **our first RL algorithm**: Q-Learning.
23 |
24 | This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders, etc).
25 |
26 | So let's get started! 🚀
27 |
--------------------------------------------------------------------------------
/units/en/unit2/mid-way-quiz.mdx:
--------------------------------------------------------------------------------
1 | # Mid-way Quiz [[mid-way-quiz]]
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 |
6 | ### Q1: What are the two main approaches to find optimal policy?
7 |
8 |
9 |
31 |
32 |
33 | ### Q2: What is the Bellman Equation?
34 |
35 |
36 | Solution
37 |
38 | **The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
39 |
40 | Rt+1 + gamma * V(St+1)
41 |
42 | The immediate reward + the discounted value of the state that follows
43 |
44 |
45 |
46 | ### Q3: Define each part of the Bellman Equation
47 |
48 |
49 |
50 |
51 |
52 | Solution
53 |
54 |
55 |
56 |
57 |
58 | ### Q4: What is the difference between Monte Carlo and Temporal Difference learning methods?
59 |
60 |
82 |
83 | ### Q5: Define each part of Temporal Difference learning formula
84 |
85 |
86 |
87 |
88 | Solution
89 |
90 |
91 |
92 |
93 |
94 |
95 | ### Q6: Define each part of Monte Carlo learning formula
96 |
97 |
98 |
99 |
100 | Solution
101 |
102 |
103 |
104 |
105 |
106 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the previous sections to reinforce (😏) your knowledge.
107 |
--------------------------------------------------------------------------------
/units/en/unit2/mid-way-recap.mdx:
--------------------------------------------------------------------------------
1 | # Mid-way Recap [[mid-way-recap]]
2 |
3 | Before diving into Q-Learning, let's summarize what we've just learned.
4 |
5 | We have two types of value-based functions:
6 |
7 | - State-value function: outputs the expected return if **the agent starts at a given state and acts according to the policy forever after.**
8 | - Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
9 | - In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
10 |
11 | There are two types of methods to update the value function:
12 |
13 | - With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual discounted return of this episode.**
14 | - With *the TD Learning method,* we update the value function from a step, replacing the unknown \\(G_t\\) with **an estimated return called the TD target.**
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/units/en/unit2/q-learning-example.mdx:
--------------------------------------------------------------------------------
1 | # A Q-Learning example [[q-learning-example]]
2 |
3 | To better understand Q-Learning, let's take a simple example:
4 |
5 |
6 |
7 | - You're a mouse in this tiny maze. You always **start at the same starting point.**
8 | - The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
9 | - The episode ends if we eat the poison, **eat the big pile of cheese**, or if we take more than five steps.
10 | - The learning rate is 0.1
11 | - The discount rate (gamma) is 0.99
12 |
13 |
14 |
15 |
16 | The reward function goes like this:
17 |
18 | - **+0:** Going to a state with no cheese in it.
19 | - **+1:** Going to a state with a small cheese in it.
20 | - **+10:** Going to the state with the big pile of cheese.
21 | - **-10:** Going to the state with the poison and thus dying.
22 | - **+0** If we take more than five steps.
23 |
24 |
25 |
26 | To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
27 |
28 | ## Step 1: Initialize the Q-table [[step1]]
29 |
30 |
31 |
32 | So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
33 |
34 | Let's do it for 2 training timesteps:
35 |
36 | Training timestep 1:
37 |
38 | ## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2]]
39 |
40 | Because epsilon is big (= 1.0), I take a random action. In this case, I go right.
41 |
42 |
43 |
44 |
45 | ## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]
46 |
47 | By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
48 |
49 |
50 |
51 |
52 |
53 | ## Step 4: Update Q(St, At) [[step4]]
54 |
55 | We can now update \\(Q(S_t, A_t)\\) using our formula.
56 |
57 |
58 |
59 |
60 | Training timestep 2:
61 |
62 | ## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2-2]]
63 |
64 | **I take a random action again, since epsilon=0.99 is big**. (Notice we decay epsilon a little bit because, as the training progress, we want less and less exploration).
65 |
66 | I took the action 'down'. **This is not a good action since it leads me to the poison.**
67 |
68 |
69 |
70 |
71 | ## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]
72 |
73 | Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
74 |
75 |
76 |
77 | ## Step 4: Update Q(St, At) [[step4-4]]
78 |
79 |
80 |
81 | Because we're dead, we start a new episode. But what we see here is that, **with two explorations steps, my agent became smarter.**
82 |
83 | As we continue exploring and exploiting the environment and updating Q-values using the TD target, the **Q-table will give us a better and better approximation. At the end of the training, we'll get an estimate of the optimal Q-function.**
84 |
--------------------------------------------------------------------------------
/units/en/unit2/q-learning-recap.mdx:
--------------------------------------------------------------------------------
1 | # Q-Learning Recap [[q-learning-recap]]
2 |
3 |
4 | *Q-Learning* **is the RL algorithm that** :
5 |
6 | - Trains a *Q-function*, an **action-value function** encoded, in internal memory, by a *Q-table* **containing all the state-action pair values.**
7 |
8 | - Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
9 |
10 |
11 |
12 | - When the training is done, **we have an optimal Q-function, or, equivalently, an optimal Q-table.**
13 |
14 | - And if we **have an optimal Q-function**, we
15 | have an optimal policy, since we **know, for each state, the best action to take.**
16 |
17 |
18 |
19 | But, in the beginning, our **Q-table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we explore the environment and update our Q-table it will give us a better and better approximation.
20 |
21 |
22 |
23 | This is the Q-Learning pseudocode:
24 |
25 |
26 |
--------------------------------------------------------------------------------
/units/en/unit2/quiz2.mdx:
--------------------------------------------------------------------------------
1 | # Second Quiz [[quiz2]]
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 |
6 | ### Q1: What is Q-Learning?
7 |
8 |
9 |
30 |
31 | ### Q2: What is a Q-table?
32 |
33 |
50 |
51 | ### Q3: Why if we have an optimal Q-function Q* we have an optimal policy?
52 |
53 |
54 | Solution
55 |
56 | Because if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.
57 |
58 |
59 |
60 |
61 |
62 | ### Q4: Can you explain what is Epsilon-Greedy Strategy?
63 |
64 |
65 | Solution
66 | Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
67 |
68 | The idea is that we define epsilon ɛ = 1.0:
69 |
70 | - With *probability 1 — ɛ* : we do exploitation (aka our agent selects the action with the highest state-action pair value).
71 | - With *probability ɛ* : we do exploration (trying random action).
72 |
73 |
74 |
75 |
76 |
77 |
78 | ### Q5: How do we update the Q value of a state, action pair?
79 |
80 |
81 |
82 | Solution
83 |
84 |
85 |
86 |
87 |
88 |
89 | ### Q6: What's the difference between on-policy and off-policy
90 |
91 |
92 | Solution
93 |
94 |
95 |
96 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
97 |
--------------------------------------------------------------------------------
/units/en/unit2/what-is-rl.mdx:
--------------------------------------------------------------------------------
1 | # What is RL? A short recap [[what-is-rl]]
2 |
3 | In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by deciding on **what stocks to buy and when to sell.**
4 |
5 |
6 |
7 |
8 | To make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
9 |
10 | Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
11 |
12 | **The agent's decision-making process is called the policy π:** given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.
13 |
14 |
15 |
16 | **Our goal is to find an optimal policy π* **, aka., a policy that leads to the best expected cumulative reward.
17 |
18 | And to find this optimal policy (hence solving the RL problem), there **are two main types of RL methods**:
19 |
20 | - *Policy-based methods*: **Train the policy directly** to learn which action to take given a state.
21 | - *Value-based methods*: **Train a value function** to learn **which state is more valuable** and use this value function **to take the action that leads to it.**
22 |
23 |
24 |
25 | And in this unit, **we'll dive deeper into the value-based methods.**
26 |
--------------------------------------------------------------------------------
/units/en/unit3/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings [[additional-readings]]
2 |
3 | These are **optional readings** if you want to go deeper.
4 |
5 | - [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
6 | - [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
7 | - [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
8 | - [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
9 | - [Dueling Deep Q-Learning](https://arxiv.org/abs/1511.06581)
10 |
--------------------------------------------------------------------------------
/units/en/unit3/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion [[conclusion]]
2 |
3 | Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.
4 |
5 | Take time to really grasp the material before continuing.
6 |
7 | Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The **best way to learn is to try things on your own!**
8 |
9 |
10 |
11 |
12 | In the next unit, **we're going to learn about Optuna**. One of the most critical tasks in Deep Reinforcement Learning is to find a good set of training hyperparameters. Optuna is a library that helps you to automate the search.
13 |
14 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
15 |
16 | ### Keep Learning, stay awesome 🤗
17 |
18 |
--------------------------------------------------------------------------------
/units/en/unit3/deep-q-network.mdx:
--------------------------------------------------------------------------------
1 | # The Deep Q-Network (DQN) [[deep-q-network]]
2 | This is the architecture of our Deep Q-Learning network:
3 |
4 |
5 |
6 | As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
7 |
8 | When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with the appropriate action and **learn to play the game well**.
9 |
10 | ## Preprocessing the input and temporal limitation [[preprocessing]]
11 |
12 | We need to **preprocess the input**. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
13 |
14 | To achieve this, we **reduce the state space to 84x84 and grayscale it**. We can do this since the colors in Atari environments don't add important information.
15 | This is a big improvement since we **reduce our three color channels (RGB) to 1**.
16 |
17 | We can also **crop a part of the screen in some games** if it does not contain important information.
18 | Then we stack four frames together.
19 |
20 |
21 |
22 | **Why do we stack four frames together?**
23 | We stack frames together because it helps us **handle the problem of temporal limitation**. Let’s take an example with the game of Pong. When you see this frame:
24 |
25 |
26 |
27 | Can you tell me where the ball is going?
28 | No, because one frame is not enough to have a sense of motion! But what if I add three more frames? **Here you can see that the ball is going to the right**.
29 |
30 |
31 | That’s why, to capture temporal information, we stack four frames together.
32 |
33 | Then the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because the frames are stacked together, **we can exploit some temporal properties across those frames**.
34 |
35 | If you don't know what convolutional layers are, don't worry. You can check out [Lesson 4 of this free Deep Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
36 |
37 | Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
38 |
39 |
40 |
41 | So, we see that Deep Q-Learning uses a neural network to approximate, given a state, the different Q-values for each possible action at that state. Now let's study the Deep Q-Learning algorithm.
42 |
--------------------------------------------------------------------------------
/units/en/unit3/from-q-to-dqn.mdx:
--------------------------------------------------------------------------------
1 | # From Q-Learning to Deep Q-Learning [[from-q-to-dqn]]
2 |
3 | We learned that **Q-Learning is an algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
4 |
5 |
6 |
7 |
8 |
9 | The **Q comes from "the Quality" of that action at that state.**
10 |
11 | Internally, our Q-function is encoded by **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
12 |
13 | The problem is that Q-Learning is a *tabular method*. This becomes a problem if the states and actions spaces **are not small enough to be represented efficiently by arrays and tables**. In other words: it is **not scalable**.
14 | Q-Learning worked well with small state space environments like:
15 |
16 | - FrozenLake, we had 16 states.
17 | - Taxi-v3, we had 500 states.
18 |
19 | But think of what we're going to do today: we will train an agent to learn to play Space Invaders, a more complex game, using the frames as input.
20 |
21 | As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210 \times 160 \times 3} = 256^{100800}\\) possible observations (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
22 |
23 | * A single frame in Atari is composed of an image of 210x160 pixels. Given that the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
24 |
25 |
26 |
27 | Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
28 |
29 | This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
30 |
31 |
32 |
33 |
34 | Now that we understand Deep Q-Learning, let's dive deeper into the Deep Q-Network.
35 |
--------------------------------------------------------------------------------
/units/en/unit3/glossary.mdx:
--------------------------------------------------------------------------------
1 | # Glossary
2 |
3 | This is a community-created glossary. Contributions are welcomed!
4 |
5 | - **Tabular Method:** Type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
6 | **Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
7 |
8 | - **Deep Q-Learning:** Method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
9 | It is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
10 |
11 | - **Temporal Limitation** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
12 | In order to obtain temporal information, we need to **stack** a number of frames together.
13 |
14 | - **Phases of Deep Q-Learning:**
15 | - **Sampling:** Actions are performed, and observed experience tuples are stored in a **replay memory**.
16 | - **Training:** Batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
17 |
18 | - **Solutions to stabilize Deep Q-Learning:**
19 | - **Experience Replay:** A replay memory is created to save experiences samples that can be reused during training.
20 | This allows the agent to learn from the same experiences multiple times. Also, it helps the agent avoid forgetting previous experiences as it gets new ones.
21 | - **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
22 | catastrophically.
23 |
24 | - **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
25 | is that the same network weights are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
26 | To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
27 | our Deep Q-Network after certain **C steps**.
28 |
29 | - **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
30 | - **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
31 | - **Target Network** to calculate the target **Q-Value** of taking that action at the next state.
32 | This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning.
33 |
34 | If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
35 |
36 | This glossary was made possible thanks to:
37 |
38 | - [Dario Paez](https://github.com/dario248)
39 |
--------------------------------------------------------------------------------
/units/en/unit3/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Deep Q-Learning [[deep-q-learning]]
2 |
3 |
4 |
5 |
6 |
7 | In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
8 |
9 | We got excellent results with this simple algorithm, but these environments were relatively simple because the **state space was discrete and small** (16 different states for FrozenLake-v1 and 500 for Taxi-v3). For comparison, the state space in Atari games can **contain \\(10^{9}\\) to \\(10^{11}\\) states**.
10 |
11 | But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
12 |
13 | So in this unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.
14 |
15 | And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
16 |
17 |
18 |
19 | So let’s get started! 🚀
20 |
--------------------------------------------------------------------------------
/units/en/unit3/quiz.mdx:
--------------------------------------------------------------------------------
1 | # Quiz [[quiz]]
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 | ### Q1: We mentioned Q Learning is a tabular method. What are tabular methods?
6 |
7 |
8 | Solution
9 |
10 | *Tabular methods* is a type of problem in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state, and action value pairs.
11 |
12 |
13 |
14 |
15 | ### Q2: Why can't we use a classical Q-Learning to solve an Atari Game?
16 |
17 |
30 |
31 |
32 | ### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
33 |
34 |
35 | Solution
36 |
37 | We stack frames together because it helps us **handle the problem of temporal limitation**: one frame is not enough to capture temporal information.
38 | For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 | ### Q4: What are the two phases of Deep Q-Learning?
48 |
49 |
71 |
72 | ### Q5: Why do we create a replay memory in Deep Q-Learning?
73 |
74 |
75 | Solution
76 |
77 | **1. Make more efficient use of the experiences during the training**
78 |
79 | Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
80 | But, with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
81 |
82 | **2. Avoid forgetting previous experiences and reduce the correlation between experiences**
83 |
84 | The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
85 |
86 |
87 |
88 |
89 | ### Q6: How do we use Double Deep Q-Learning?
90 |
91 |
92 |
93 | Solution
94 |
95 | When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
96 |
97 | - Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
98 |
99 | - Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
100 |
101 |
102 |
103 |
104 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
105 |
--------------------------------------------------------------------------------
/units/en/unit4/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings
2 |
3 | These are **optional readings** if you want to go deeper.
4 |
5 |
6 | ## Introduction to Policy Optimization
7 |
8 | - [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
9 |
10 |
11 | ## Policy Gradient
12 |
13 | - [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/)
14 | - [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146)
15 | - [Chapter 13, Policy Gradient Methods; Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf)
16 |
17 | ## Implementation
18 |
19 | - [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
20 | - [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need)
21 |
--------------------------------------------------------------------------------
/units/en/unit4/advantages-disadvantages.mdx:
--------------------------------------------------------------------------------
1 | # The advantages and disadvantages of policy-gradient methods
2 |
3 | At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
4 |
5 | ## Advantages
6 |
7 | There are multiple advantages over value-based methods. Let's see some of them:
8 |
9 | ### The simplicity of integration
10 |
11 | We can estimate the policy directly without storing additional data (action values).
12 |
13 | ### Policy-gradient methods can learn a stochastic policy
14 |
15 | Policy-gradient methods can **learn a stochastic policy while value functions can't**.
16 |
17 | This has two consequences:
18 |
19 | 1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
20 |
21 | 2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
22 |
23 | Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
24 |
25 |
26 |
27 |
28 |
29 | Our vacuum cleaner can only perceive where the walls are.
30 |
31 | The problem is that the **two red (colored) states are aliased states because the agent perceives an upper and lower wall for each**.
32 |
33 |
34 |
35 |
36 |
37 | Under a deterministic policy, the policy will either always move right when in a red state or always move left. **Either case will cause our agent to get stuck and never suck the dust**.
38 |
39 | Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
40 |
41 | On the other hand, an optimal stochastic policy **will randomly move left or right in red (colored) states**. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
42 |
43 |
44 |
45 |
46 |
47 | ### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
48 |
49 | The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
50 |
51 | But what if we have an infinite possibility of actions?
52 |
53 | For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
54 |
55 | Instead, with policy-gradient methods, we output a **probability distribution over actions.**
56 |
57 | ### Policy-gradient methods have better convergence properties
58 |
59 | In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
60 | Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
61 |
62 | For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
63 |
64 | On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
65 |
66 | ## Disadvantages
67 |
68 | Naturally, policy-gradient methods also have some disadvantages:
69 |
70 | - **Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.**
71 | - Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
72 | - Policy-gradient can have high variance. We'll see in the actor-critic unit why, and how we can solve this problem.
73 |
74 | 👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
75 |
--------------------------------------------------------------------------------
/units/en/unit4/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion
2 |
3 |
4 | **Congrats on finishing this unit**! There was a lot of information.
5 | And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
6 |
7 | Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
8 | frames as observation)?
9 |
10 | In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
11 | to compete against other agents in a snowball fight and a soccer game.**
12 |
13 | Sound fun? See you next time!
14 |
15 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
16 |
17 | ### Keep Learning, stay awesome 🤗
18 |
--------------------------------------------------------------------------------
/units/en/unit4/glossary.mdx:
--------------------------------------------------------------------------------
1 | # Glossary
2 |
3 | This is a community-created glossary. Contributions are welcome!
4 |
5 | - **Deep Q-Learning:** A value-based deep reinforcement learning algorithm that uses a deep neural network to approximate Q-values for actions in a given state. The goal of Deep Q-learning is to find the optimal policy that maximizes the expected cumulative reward by learning the action-values.
6 |
7 | - **Value-based methods:** Reinforcement Learning methods that estimate a value function as an intermediate step towards finding an optimal policy.
8 |
9 | - **Policy-based methods:** Reinforcement Learning methods that directly learn to approximate the optimal policy without learning a value function. In practice they output a probability distribution over actions.
10 |
11 | The benefits of using policy-gradient methods over value-based methods include:
12 | - simplicity of integration: no need to store action values;
13 | - ability to learn a stochastic policy: the agent explores the state space without always taking the same trajectory, and avoids the problem of perceptual aliasing;
14 | - effectiveness in high-dimensional and continuous action spaces; and
15 | - improved convergence properties.
16 |
17 | - **Policy Gradient:** A subset of policy-based methods where the objective is to maximize the performance of a parameterized policy using gradient ascent. The goal of a policy-gradient is to control the probability distribution of actions by tuning the policy such that good actions (that maximize the return) are sampled more frequently in the future.
18 |
19 | - **Monte Carlo Reinforce:** A policy-gradient algorithm that uses an estimated return from an entire episode to update the policy parameter.
20 |
21 | If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
22 |
23 | This glossary was made possible thanks to:
24 |
25 | - [Diego Carpintero](https://github.com/dcarpintero)
--------------------------------------------------------------------------------
/units/en/unit4/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction [[introduction]]
2 |
3 |
4 |
5 | In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
6 |
7 | Since the beginning of the course, we have only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
8 |
9 |
10 |
11 | In value-based methods, the policy ** \(π\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
12 |
13 | With policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
14 |
15 | So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
16 | Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
17 |
18 | You'll then be able to iterate and improve this implementation for more advanced environments.
19 |
20 |
21 |
22 |
23 |
24 | Let's get started!
25 |
--------------------------------------------------------------------------------
/units/en/unit4/pg-theorem.mdx:
--------------------------------------------------------------------------------
1 | # (Optional) the Policy Gradient Theorem
2 |
3 | In this optional section where we're **going to study how we differentiate the objective function that we will use to approximate the policy gradient**.
4 |
5 | Let's first recap our different formulas:
6 |
7 | 1. The Objective function
8 |
9 |
10 |
11 |
12 | 2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
13 |
14 |
15 |
16 |
17 | So we have:
18 |
19 | \\(\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)
20 |
21 |
22 | We can rewrite the gradient of the sum as the sum of the gradient:
23 |
24 | \\( = \sum_{\tau} \nabla_\theta (P(\tau;\theta)R(\tau)) = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\) as \\(R(\tau)\\) is not dependent on \\(\theta\\)
25 |
26 | We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta)}\\)(which is possible since it's = 1)
27 |
28 | \\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
29 |
30 |
31 | We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\).
32 |
33 | Thus we can rewrite the sum as
34 |
35 | \\( P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
36 |
37 | We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
38 |
39 | So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)
40 |
41 |
42 |
43 | So this is our likelihood policy gradient:
44 |
45 | \\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau) \\)
46 |
47 |
48 |
49 |
50 |
51 | Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
52 |
53 | \\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.
54 |
55 |
56 | But we still have some mathematics work to do there: we need to simplify \\( \nabla_\theta log P(\tau|\theta) \\)
57 |
58 | We know that:
59 |
60 | \\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
61 |
62 | Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
63 |
64 | We know that the log of a product is equal to the sum of the logs:
65 |
66 | \\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[log \mu(s_0) + \sum\limits_{t=0}^{H}log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum\limits_{t=0}^{H}log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right] \\)
67 |
68 | We also know that the gradient of the sum is equal to the sum of gradient:
69 |
70 | \\( \nabla_\theta log P(\tau^{(i)};\theta)=\nabla_\theta log\mu(s_0) + \nabla_\theta \sum\limits_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum\limits_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
71 |
72 |
73 | Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
74 |
75 | Since:
76 | \\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
77 |
78 | \\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
79 |
80 | We can rewrite the gradient of the sum as the sum of gradients:
81 |
82 | \\( \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
83 |
84 | So, the final formula for estimating the policy gradient is:
85 |
86 | \\( \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)}) \\)
87 |
--------------------------------------------------------------------------------
/units/en/unit4/quiz.mdx:
--------------------------------------------------------------------------------
1 | # Quiz
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 |
6 | ### Q1: What are the advantages of policy-gradient over value-based methods? (Check all that apply)
7 |
8 |
26 |
27 | ### Q2: What is the Policy Gradient Theorem?
28 |
29 |
30 | Solution
31 |
32 | *The Policy Gradient Theorem* is a formula that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
33 |
34 |
35 |
36 |
37 |
38 |
39 | ### Q3: What's the difference between policy-based methods and policy-gradient methods? (Check all that apply)
40 |
41 |
64 |
65 |
66 | ### Q4: Why do we use gradient ascent instead of gradient descent to optimize J(θ)?
67 |
68 |
81 |
82 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
83 |
--------------------------------------------------------------------------------
/units/en/unit4/what-are-policy-based-methods.mdx:
--------------------------------------------------------------------------------
1 | # What are the policy-based methods?
2 |
3 | The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
4 | Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
5 |
6 | For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
7 | **maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
8 |
9 |
10 |
11 | ## Value-based, Policy-based, and Actor-critic methods
12 |
13 | In the first unit, we saw two methods to find (or, most of the time, approximate) this optimal policy \\(\pi^{*}\\).
14 |
15 | - In *value-based methods*, we learn a value function.
16 | - The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
17 | - Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
18 | - We have a policy, but it's implicit since it **is generated directly from the value function**. For instance, in Q-Learning, we used an (epsilon-)greedy policy.
19 |
20 | - On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
21 | - The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
22 | -
23 | - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
24 | - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
25 |
26 |
27 |
28 | - Next time, we'll study the *actor-critic* method, which is a combination of value-based and policy-based methods.
29 |
30 | Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
31 | To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find the value \\(\theta\\) that maximizes this objective function**.
32 |
33 | ## The difference between policy-based and policy-gradient methods
34 |
35 | Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
36 |
37 | The difference between these two methods **lies on how we optimize the parameter** \\(\theta\\):
38 |
39 | - In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter \\(\theta\\) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
40 | - In *policy-gradient methods*, because it is a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter \\(\theta\\) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
41 |
42 | Before diving more into how policy-gradient methods work (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.
43 |
--------------------------------------------------------------------------------
/units/en/unit5/bonus.mdx:
--------------------------------------------------------------------------------
1 | # Bonus: Learn to create your own environments with Unity and MLAgents
2 |
3 | **You can create your own reinforcement learning environments with Unity and MLAgents**. Using a game engine such as Unity can be intimidating at first, but here are the steps you can take to learn smoothly.
4 |
5 | ## Step 1: Know how to use Unity
6 |
7 | - The best way to learn Unity is to do ["Create with Code" course](https://learn.unity.com/course/create-with-code): it's a series of videos for beginners where **you will create 5 small games with Unity**.
8 |
9 | ## Step 2: Create the simplest environment with this tutorial
10 |
11 | - Then, when you know how to use Unity, you can create your [first basic RL environment using this tutorial](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Learning-Environment-Create-New.md).
12 |
13 | ## Step 3: Iterate and create nice environments
14 |
15 | - Now that you've created your first simple environment you can iterate to more complex ones using the [MLAgents documentation (especially Designing Agents and Agent part)](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/)
16 | - In addition, you can take this free course ["Create a hummingbird environment"](https://learn.unity.com/course/ml-agents-hummingbirds) by [Adam Kelly](https://twitter.com/aktwelve)
17 |
18 |
19 | Have fun! And if you create custom environments don't hesitate to share them to the `#rl-i-made-this` discord channel.
20 |
--------------------------------------------------------------------------------
/units/en/unit5/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion
2 |
3 | Congrats on finishing this unit! You’ve just trained your first ML-Agents and shared it to the Hub 🥳.
4 |
5 | The best way to learn is to **practice and try stuff**. Why not try another environment? [ML-Agents has 18 different environments](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md).
6 |
7 | For instance:
8 | - [Worm](https://singularite.itch.io/worm), where you teach a worm to crawl.
9 | - [Walker](https://singularite.itch.io/walker), where you teach an agent to walk towards a goal.
10 |
11 | Check the documentation to find out how to train them and to see the list of already integrated MLAgents environments on the Hub: https://github.com/huggingface/ml-agents#getting-started
12 |
13 |
14 |
15 |
16 | In the next unit, we're going to learn about multi-agents. You're going to train your first multi-agents to compete in Soccer and Snowball fight against other classmate's agents.
17 |
18 |
19 |
20 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
21 |
22 | ### Keep Learning, stay awesome 🤗
23 |
--------------------------------------------------------------------------------
/units/en/unit5/curiosity.mdx:
--------------------------------------------------------------------------------
1 | # (Optional) What is Curiosity in Deep Reinforcement Learning?
2 |
3 | This is an (optional) introduction to Curiosity. If you want to learn more, you can read two additional articles where we dive into the mathematical details:
4 |
5 | - [Curiosity-Driven Learning through Next State Prediction](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa)
6 | - [Random Network Distillation: a new take on Curiosity-Driven Learning](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938)
7 |
8 | ## Two Major Problems in Modern RL
9 |
10 | To understand what Curiosity is, we first need to understand the two major problems with RL:
11 |
12 | First, the *sparse rewards problem:* that is, **most rewards do not contain information, and hence are set to zero**.
13 |
14 | Remember that RL is based on the *reward hypothesis*, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents; **if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change**.
15 |
16 |
17 |
18 |
19 | Source: Thanks to the reward, our agent knows that this action at that state was good
20 |
21 |
22 |
23 | For instance, in [Vizdoom](https://vizdoom.cs.put.edu.pl/), a set of environments based on the game Doom “DoomMyWayHome,” your agent is only rewarded **if it finds the vest**.
24 | However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy, and **it can spend time turning around without finding the goal**.
25 |
26 |
27 |
28 | The second big problem is that **the extrinsic reward function is handmade; in each environment, a human has to implement a reward function**. But how we can scale that in big and complex environments?
29 |
30 | ## So what is Curiosity?
31 |
32 | A solution to these problems is **to develop a reward function intrinsic to the agent, i.e., generated by the agent itself**. The agent will act as a self-learner since it will be the student and its own feedback master.
33 |
34 | **This intrinsic reward mechanism is known as Curiosity** because this reward pushes the agent to explore states that are novel/unfamiliar. To achieve that, our agent will receive a high reward when exploring new trajectories.
35 |
36 | This reward is inspired by how humans act. ** We naturally have an intrinsic desire to explore environments and discover new things**.
37 |
38 | There are different ways to calculate this intrinsic reward. The classical approach (Curiosity through next-state prediction) is to calculate Curiosity **as the error of our agent in predicting the next state, given the current state and action taken**.
39 |
40 |
41 |
42 | Because the idea of Curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its actions** (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
43 |
44 | If the agent spends a lot of time on these states, it will be good at predicting the next state (low Curiosity). On the other hand, if it’s in a new, unexplored state, it will be hard to predict the following state (high Curiosity).
45 |
46 |
47 |
48 | Using Curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and **consequently better explore our environment**.
49 |
50 | There’s also **other curiosity calculation methods**. ML-Agents uses a more advanced one called Curiosity through random network distillation. This is out of the scope of the tutorial but if you’re interested [I wrote an article explaining it in detail](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938).
51 |
--------------------------------------------------------------------------------
/units/en/unit5/how-mlagents-works.mdx:
--------------------------------------------------------------------------------
1 | # How do Unity ML-Agents work? [[how-mlagents-works]]
2 |
3 | Before training our agent, we need to understand **what ML-Agents is and how it works**.
4 |
5 | ## What is Unity ML-Agents? [[what-is-mlagents]]
6 |
7 | [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents) is a toolkit for the game engine Unity that **allows us to create environments using Unity or use pre-made environments to train our agents**.
8 |
9 | It’s developed by [Unity Technologies](https://unity.com/), the developers of Unity, one of the most famous Game Engines used by the creators of Firewatch, Cuphead, and Cities: Skylines.
10 |
11 |
12 |
13 | Firewatch was made with Unity
14 |
15 |
16 | ## The six components [[six-components]]
17 |
18 | With Unity ML-Agents, you have six essential components:
19 |
20 |
21 |
22 | Source: Unity ML-Agents Documentation
23 |
24 |
25 | - The first is the *Learning Environment*, which contains **the Unity scene (the environment) and the environment elements** (game characters).
26 | - The second is the *Python Low-level API*, which contains **the low-level Python interface for interacting and manipulating the environment**. It’s the API we use to launch the training.
27 | - Then, we have the *External Communicator* that **connects the Learning Environment (made with C#) with the low level Python API (Python)**.
28 | - The *Python trainers*: the **Reinforcement algorithms made with PyTorch (PPO, SAC…)**.
29 | - The *Gym wrapper*: to encapsulate the RL environment in a gym wrapper.
30 | - The *PettingZoo wrapper*: PettingZoo is the multi-agents version of the gym wrapper.
31 |
32 | ## Inside the Learning Component [[inside-learning-component]]
33 |
34 | Inside the Learning Component, we have **two important elements**:
35 |
36 | - The first is the *agent component*, the actor of the scene. We’ll **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called the *Brain*.
37 | - Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher who handles Python API requests.
38 |
39 | To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
40 |
41 |
42 |
43 | The RL Process: a loop of state, action, reward and next state
44 | Source: Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto
45 |
46 |
47 | Now, let’s imagine an agent learning to play a platform game. The RL process looks like this:
48 |
49 |
50 |
51 | - Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
52 | - Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
53 | - The environment goes to a **new** **state \\(S_1\\)** — new frame.
54 | - The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
55 |
56 | This RL loop outputs a sequence of **state, action, reward and next state.** The goal of the agent is to **maximize the expected cumulative reward**.
57 |
58 | The Academy will be the one that will **send the order to our Agents and ensure that agents are in sync**:
59 |
60 | - Collect Observations
61 | - Select your action using your policy
62 | - Take the Action
63 | - Reset if you reached the max step or if you’re done.
64 |
65 |
66 |
67 |
68 | Now that we understand how ML-Agents works, **we’re ready to train our agents.**
69 |
--------------------------------------------------------------------------------
/units/en/unit5/introduction.mdx:
--------------------------------------------------------------------------------
1 | # An Introduction to Unity ML-Agents [[introduction-to-ml-agents]]
2 |
3 |
4 |
5 | One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to do so.
6 | These engines, such as [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
7 | for creating environments: they provide physics systems, 2D/3D rendering, and more.
8 |
9 |
10 | One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents), a plugin based on the game engine Unity that allows us **to use the Unity Game Engine as an environment builder to train agents**. In the first bonus unit, this is what we used to train Huggy to catch a stick!
11 |
12 |
13 |
14 | Source: ML-Agents documentation
15 |
16 |
17 | Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping over big walls.
18 |
19 | In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**: you don't need to use it to train your agents.
20 |
21 | So, today, we're going to train two agents:
22 | - The first one will learn to **shoot snowballs onto a spawning target**.
23 | - The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be done using a technique called curiosity.
24 |
25 |
26 |
27 | Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize them playing directly on your browser without having to use the Unity Editor**.
28 |
29 | Doing this Unit will **prepare you for the next challenge: AI vs. AI where you will train agents in multi-agents environments and compete against your classmates' agents**.
30 |
31 | Sound exciting? Let's get started!
32 |
--------------------------------------------------------------------------------
/units/en/unit5/pyramids.mdx:
--------------------------------------------------------------------------------
1 | # The Pyramid environment
2 |
3 | The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
4 |
5 |
6 |
7 |
8 | ## The reward function
9 |
10 | The reward function is:
11 |
12 |
13 |
14 | In terms of code, it looks like this
15 |
16 |
17 | To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
18 |
19 | - The *extrinsic one* given by the environment (illustration above).
20 | - But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
21 |
22 | If you want to know more about curiosity, the next section (optional) will explain the basics.
23 |
24 | ## The observation space
25 |
26 | In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
27 |
28 |
29 |
30 | We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
31 |
32 |
33 |
34 |
35 | ## The action space
36 |
37 | The action space is **discrete** with four possible actions:
38 |
39 |
40 |
--------------------------------------------------------------------------------
/units/en/unit5/quiz.mdx:
--------------------------------------------------------------------------------
1 | # Quiz
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 | ### Q1: Which of the following tools are specifically designed for video games development?
6 |
7 |
41 |
42 | ### Q2: What of the following statements are true about Unity ML-Agents?
43 |
44 |
78 |
79 | ### Q3: Fill the missing letters
80 |
81 | - In Unity ML-Agents, the Policy of an Agent is called a b \_ \_ \_ n
82 | - The component in charge of orchestrating the agents is called the \_ c \_ \_ \_ m \_
83 |
84 |
85 | Solution
86 |
87 |
b r a i n
88 |
a c a d e m y
89 |
90 |
91 |
92 | ### Q4: Define with your own words what is a `raycast`
93 |
94 |
95 | Solution
96 | A raycast is (most of the times) a linear projection, as a `laser` which aims to detect collisions through objects.
97 |
98 |
99 | ### Q5: Which are the differences between capturing the environment using `frames` or `raycasts`?
100 |
101 |
120 |
121 |
122 | ### Q6: Name several environment and agent input variables used to train the agent in the Snowball or Pyramid environments
123 |
124 |
125 | Solution
126 | - Collisions of the raycasts spawned from the agent detecting blocks, (invisible) walls, stones, our target, switches, etc.
127 | - Traditional inputs describing agent features, as its speed
128 | - Boolean vars, as the switch (on/off) in Pyramids or the `can I shoot?` in the SnowballTarget.
129 |
130 |
131 |
132 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
133 |
--------------------------------------------------------------------------------
/units/en/unit5/snowball-target.mdx:
--------------------------------------------------------------------------------
1 | # The SnowballTarget Environment
2 |
3 |
4 |
5 | SnowballTarget is an environment we created at Hugging Face using assets from [Kay Lousberg](https://kaylousberg.com/). We have an optional section at the end of this Unit **if you want to learn to use Unity and create your environments**.
6 |
7 | ## The agent's Goal
8 |
9 | The first agent you're going to train is called Julien the bear 🐻. Julien is trained **to hit targets with snowballs**.
10 |
11 | The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly in relation to the target and shoot**to do that.
12 |
13 | In addition, to avoid "snowball spamming" (aka shooting a snowball every timestep), **Julien has a "cool off" system** (it needs to wait 0.5 seconds after a shoot to be able to shoot again).
14 |
15 |
16 |
17 | The agent needs to wait 0.5s before being able to shoot a snowball again
18 |
19 |
20 | ## The reward function and the reward engineering problem
21 |
22 | The reward function is simple. **The environment gives a +1 reward every time the agent's snowball hits a target**. Because the agent's Goal is to maximize the expected cumulative reward, **it will try to hit as many targets as possible**.
23 |
24 |
25 |
26 | We could have a more complex reward function (with a penalty to push the agent to go faster, for example). But when you design an environment, you need to avoid the *reward engineering problem*, which is having a too complex reward function to force your agent to behave as you want it to do.
27 | Why? Because by doing that, **you might miss interesting strategies that the agent will find with a simpler reward function**.
28 |
29 | In terms of code, it looks like this:
30 |
31 |
32 |
33 |
34 | ## The observation space
35 |
36 | Regarding observations, we don't use normal vision (frame), but **we use raycasts**.
37 |
38 | Think of raycasts as lasers that will detect if they pass through an object.
39 |
40 |
41 |
42 | Source: ML-Agents documentation
43 |
44 |
45 |
46 | In this environment, our agent has multiple set of raycasts:
47 |
48 |
49 | In addition to raycasts, the agent gets a "can I shoot" bool as observation.
50 |
51 |
52 |
53 | ## The action space
54 |
55 | The action space is discrete:
56 |
57 |
58 |
--------------------------------------------------------------------------------
/units/en/unit6/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings [[additional-readings]]
2 |
3 | ## Bias-variance tradeoff in Reinforcement Learning
4 |
5 | If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
6 |
7 | - [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
8 | - [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
9 |
10 | ## Advantage Functions
11 |
12 | - [Advantage Functions, SpinningUp RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html?highlight=advantage%20functio#advantage-functions)
13 |
14 | ## Actor Critic
15 |
16 | - [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://www.youtube.com/watch?v=AKbX1Zvo7r8)
17 | - [A2C Paper: Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783v2)
18 |
--------------------------------------------------------------------------------
/units/en/unit6/advantage-actor-critic.mdx:
--------------------------------------------------------------------------------
1 | # Advantage Actor-Critic (A2C) [[advantage-actor-critic]]
2 |
3 | ## Reducing variance with Actor-Critic methods
4 |
5 | The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
6 |
7 | To understand the Actor-Critic, imagine you're playing a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
8 |
9 |
10 |
11 | You don't know how to play at the beginning, **so you try some actions randomly**. The Critic observes your action and **provides feedback**.
12 |
13 | Learning from this feedback, **you'll update your policy and be better at playing that game.**
14 |
15 | On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.
16 |
17 | This is the idea behind Actor-Critic. We learn two function approximations:
18 |
19 | - *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s) \\)
20 |
21 | - *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
22 |
23 | ## The Actor-Critic Process
24 | Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how the Actor and Critic improve together during the training.
25 |
26 | As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
27 | - *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s) \\)
28 | - *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
29 |
30 | Let's see the training process to understand how the Actor and Critic are optimized:
31 | - At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**.
32 |
33 | - Our Policy takes the state and **outputs an action** \\( A_t \\).
34 |
35 |
36 |
37 | - The Critic takes that action also as input and, using \\( S_t\\) and \\( A_t \\), **computes the value of taking that action at that state: the Q-value**.
38 |
39 |
40 |
41 | - The action \\( A_t\\) performed in the environment outputs a new state \\( S_{t+1}\\) and a reward \\( R_{t+1} \\) .
42 |
43 |
44 |
45 | - The Actor updates its policy parameters using the Q value.
46 |
47 |
48 |
49 | - Thanks to its updated parameters, the Actor produces the next action to take at \\( A_{t+1} \\) given the new state \\( S_{t+1} \\).
50 |
51 | - The Critic then updates its value parameters.
52 |
53 |
54 |
55 | ## Adding Advantage in Actor-Critic (A2C)
56 | We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**.
57 |
58 | The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how taking that action at a state is better compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair:
59 |
60 |
61 |
62 | In other words, this function calculates **the extra reward we get if we take this action at that state compared to the mean reward we get at that state**.
63 |
64 | The extra reward is what's beyond the expected value of that state.
65 | - If A(s,a) > 0: our gradient is **pushed in that direction**.
66 | - If A(s,a) < 0 (our action does worse than the average value of that state), **our gradient is pushed in the opposite direction**.
67 |
68 | The problem with implementing this advantage function is that it requires two value functions — \\( Q(s,a)\\) and \\( V(s)\\). Fortunately, **we can use the TD error as a good estimator of the advantage function.**
69 |
70 |
71 |
--------------------------------------------------------------------------------
/units/en/unit6/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion [[conclusion]]
2 |
3 | Congrats on finishing this unit and the tutorial. You've just trained your first virtual robots 🥳.
4 |
5 | **Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
6 |
7 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
8 |
9 | See you in next unit!
10 |
11 | ### Keep learning, stay awesome 🤗
12 |
--------------------------------------------------------------------------------
/units/en/unit6/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction [[introduction]]
2 |
3 |
4 |
5 |
6 | In unit 4, we learned about our first Policy-Based algorithm called **Reinforce**.
7 |
8 | In Policy-Based methods, **we aim to optimize the policy directly without using a value function**. More precisely, Reinforce is part of a subclass of *Policy-Based Methods* called *Policy-Gradient methods*. This subclass optimizes the policy directly by **estimating the weights of the optimal policy using Gradient Ascent**.
9 |
10 | We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), **we have significant variance in policy gradient estimation**.
11 |
12 | Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**.
13 |
14 | So today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:
15 | - *An Actor* that controls **how our agent behaves** (Policy-Based method)
16 | - *A Critic* that measures **how good the taken action is** (Value-Based method)
17 |
18 |
19 | We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train:
20 | - A robotic arm 🦾 to move to the correct position.
21 |
22 | Sound exciting? Let's get started!
23 |
--------------------------------------------------------------------------------
/units/en/unit6/quiz.mdx:
--------------------------------------------------------------------------------
1 | # Quiz
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 |
6 | ### Q1: Which of the following interpretations of bias-variance tradeoff is the most accurate in the field of Reinforcement Learning?
7 |
8 |
22 |
23 | ### Q2: Which of the following statements are true, when talking about models with bias and/or variance in RL?
24 |
25 |
49 |
50 |
51 | ### Q3: Which of the following statements are true about Monte Carlo method?
52 |
53 |
72 |
73 | ### Q4: How would you describe, with your own words, the Actor-Critic Method (A2C)?
74 |
75 |
76 | Solution
77 |
78 | The idea behind Actor-Critic is that we learn two function approximations:
79 | 1. A `policy` that controls how our agent acts (π)
80 | 2. A `value` function to assist the policy update by measuring how good the action taken is (q)
81 |
82 |
83 |
84 |
85 |
86 | ### Q5: Which of the following statements are true about the Actor-Critic Method?
87 |
88 |
107 |
108 |
109 |
110 | ### Q6: What is `Advantage` in the A2C method?
111 |
112 |
113 | Solution
114 |
115 | Instead of using directly the Action-Value function of the Critic as it is, we could use an `Advantage` function. The idea behind an `Advantage` function is that we calculate the relative advantage of an action compared to the others possible at a state, averaging them.
116 |
117 | In other words: how taking that action at a state is better compared to the average value of the state
118 |
119 |
120 |
121 |
122 |
123 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
124 |
--------------------------------------------------------------------------------
/units/en/unit6/variance-problem.mdx:
--------------------------------------------------------------------------------
1 | # The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]]
2 |
3 | In Reinforce, we want to **increase the probability of actions in a trajectory proportionally to how high the return is**.
4 |
5 |
6 |
7 |
8 | - If the **return is high**, we will **push up** the probabilities of the (state, action) combinations.
9 | - Otherwise, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
10 |
11 | This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.
12 |
13 | \\(R(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\\)
14 |
15 | The advantage of this method is that **it’s unbiased. Since we’re not estimating the return**, we use only the true return we obtain.
16 |
17 | Given the stochasticity of the environment (random events during an episode) and stochasticity of the policy, **trajectories can lead to different returns, which can lead to high variance**. Consequently, the same starting state can lead to very different returns.
18 | Because of this, **the return starting at the same state can vary significantly across episodes**.
19 |
20 |
21 |
22 | The solution is to mitigate the variance by **using a large number of trajectories, hoping that the variance introduced in any one trajectory will be reduced in aggregate and provide a "true" estimation of the return.**
23 |
24 | However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.
25 |
26 | ---
27 | If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
28 | - [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
29 | - [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
30 | - [High Variance in Policy gradients](https://balajiai.github.io/high_variance_in_policy_gradients)
31 | ---
32 |
--------------------------------------------------------------------------------
/units/en/unit7/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings [[additional-readings]]
2 |
3 | ## An introduction to multi-agents
4 |
5 | - [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
6 | - [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
7 | - [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
8 | - [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
9 | - [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
10 | - [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
11 |
12 | ## Self-Play and MA-POCA
13 |
14 | - [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
15 | - [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
16 | - [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
17 | - [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)
18 |
--------------------------------------------------------------------------------
/units/en/unit7/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion
2 |
3 | That’s all for today. Congrats on finishing this unit and the tutorial!
4 |
5 | The best way to learn is to practice and try stuff. **Why not train another agent with a different configuration?**
6 |
7 | And don’t hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
8 |
9 | See you in Unit 8 🔥
10 |
11 | ## Keep Learning, Stay awesome 🤗
12 |
--------------------------------------------------------------------------------
/units/en/unit7/introduction-to-marl.mdx:
--------------------------------------------------------------------------------
1 | # An introduction to Multi-Agents Reinforcement Learning (MARL)
2 |
3 | ## From single agent to multiple agents
4 |
5 | In the first unit, we learned to train agents in a single-agent system. When our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
6 |
7 |
8 |
9 |
10 | A patchwork of all the environments you've trained your agents on since the beginning of the course
11 |
12 |
13 |
14 | When we do multi-agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**.
15 |
16 | For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**.
17 |
18 |
19 |
20 | [Image by upklyak](https://www.freepik.com/free-vector/robots-warehouse-interior-automated-machines_32117680.htm#query=warehouse robot&position=17&from_view=keyword) on Freepik
21 |
22 |
23 | Or a road with **several autonomous vehicles**.
24 |
25 |
26 |
27 |
28 | [Image by jcomp](https://www.freepik.com/free-vector/autonomous-smart-car-automatic-wireless-sensor-driving-road-around-car-autonomous-smart-car-goes-scans-roads-observe-distance-automatic-braking-system_26413332.htm#query=self driving cars highway&position=34&from_view=search&track=ais) on Freepik
29 |
30 |
31 |
32 | In these examples, we have **multiple agents interacting in the environment and with the other agents**. This implies defining a multi-agents system. But first, let's understand the different types of multi-agent environments.
33 |
34 | ## Different types of multi-agent environments
35 |
36 | Given that, in a multi-agent system, agents interact with other agents, we can have different types of environments:
37 |
38 | - *Cooperative environments*: where your agents need **to maximize the common benefits**.
39 |
40 | For instance, in a warehouse, **robots must collaborate to load and unload the packages efficiently (as fast as possible)**.
41 |
42 | - *Competitive/Adversarial environments*: in this case, your agent **wants to maximize its benefits by minimizing the opponent's**.
43 |
44 | For example, in a game of tennis, **each agent wants to beat the other agent**.
45 |
46 |
47 |
48 | - *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team.
49 |
50 |
51 |
52 | This environment was made by the Unity MLAgents Team
53 |
54 |
55 | So now we might wonder: how can we design these multi-agent systems? Said differently, **how can we train agents in a multi-agent setting** ?
56 |
--------------------------------------------------------------------------------
/units/en/unit7/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction [[introduction]]
2 |
3 |
4 |
5 | Since the beginning of this course, we learned to train agents in a *single-agent system* where our agent was alone in its environment: it was **not cooperating or collaborating with other agents**.
6 |
7 | This worked great, and the single-agent system is useful for many applications.
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 | A patchwork of all the environments you’ve trained your agents on since the beginning of the course
16 |
17 |
18 |
19 | But, as humans, **we live in a multi-agent world**. Our intelligence comes from interaction with other agents. And so, our **goal is to create agents that can interact with other humans and other agents**.
20 |
21 | Consequently, we must study how to train deep reinforcement learning agents in a *multi-agents system* to build robust agents that can adapt, collaborate, or compete.
22 |
23 | So today we’re going to **learn the basics of the fascinating topic of multi-agents reinforcement learning (MARL)**.
24 |
25 | And the most exciting part is that, during this unit, you’re going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**.
26 |
27 | ## Course Maintenance Notice 🚧
28 |
29 | Please note that this **Deep Reinforcement Learning course is now in a low-maintenance state**. However, it **remains an excellent resource to learn both the theory and practical aspects of Deep Reinforcement Learning**.
30 |
31 | Keep in mind the following points:
32 |
33 | - *Unit 7 (AI vs AI)* : This feature is currently non-functional. However, you can still train your agent to play soccer and observe its performance. But the leaderboard for AI vs AI soccer was shut down.
34 |
35 |
36 |
37 |
38 | This environment was made by the Unity MLAgents Team
39 |
40 |
41 |
42 | So let’s get started!
43 |
--------------------------------------------------------------------------------
/units/en/unit7/multi-agent-setting.mdx:
--------------------------------------------------------------------------------
1 | # Designing Multi-Agents systems
2 |
3 | For this section, you're going to watch this excellent introduction to multi-agents made by Brian Douglas .
4 |
5 |
6 |
7 |
8 | In this video, Brian talked about how to design multi-agent systems. He specifically took a multi-agents system of vacuum cleaners and asked: **how can can cooperate with each other**?
9 |
10 | We have two solutions to design this multi-agent reinforcement learning system (MARL).
11 |
12 | ## Decentralized system
13 |
14 |
15 |
16 |
17 | Source: Introduction to Multi-Agent Reinforcement Learning
18 |
19 |
20 |
21 | In decentralized learning, **each agent is trained independently from the others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**.
22 |
23 | The benefit is that **since no information is shared between agents, these vacuums can be designed and trained like we train single agents**.
24 |
25 | The idea here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents.
26 |
27 | However, the big drawback of this technique is that it will **make the environment non-stationary** since the underlying Markov decision process changes over time as other agents are also interacting in the environment.
28 | And this is problematic for many Reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**.
29 |
30 | ## Centralized approach
31 |
32 |
33 |
34 |
35 | Source: Introduction to Multi-Agent Reinforcement Learning
36 |
37 |
38 |
39 | In this architecture, **we have a high-level process that collects agents' experiences**: the experience buffer. And we'll use these experiences **to learn a common policy**.
40 |
41 | For instance, in the vacuum cleaner example, the observation will be:
42 | - The coverage map of the vacuums.
43 | - The position of all the vacuums.
44 |
45 | We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from their common experience.
46 | We now have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs).
47 |
48 | If we recap:
49 |
50 | - In a *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.**
51 | - In this case, all agents **consider others agents as part of the environment**.
52 | - **It’s a non-stationarity environment condition**, so has no guarantee of convergence.
53 |
54 | - In a *centralized approach*:
55 | - A **single policy is learned from all the agents**.
56 | - Takes as input the present state of an environment and the policy outputs joint actions.
57 | - The reward is global.
58 |
--------------------------------------------------------------------------------
/units/en/unit7/quiz.mdx:
--------------------------------------------------------------------------------
1 | # Quiz
2 |
3 | The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
4 |
5 |
6 | ### Q1: Chose the option which fits better when comparing different types of multi-agent environments
7 |
8 | - Your agents aim to maximize common benefits in ____ environments
9 | - Your agents aim to maximize common benefits while minimizing opponent's in ____ environments
10 |
11 |
25 |
26 | ### Q2: Which of the following statements are true about `decentralized` learning?
27 |
28 |
47 |
48 |
49 | ### Q3: Which of the following statements are true about `centralized` learning?
50 |
51 |
70 |
71 | ### Q4: Explain in your own words what is the `Self-Play` approach
72 |
73 |
74 | Solution
75 |
76 | `Self-play` is an approach to instantiate copies of agents with the same policy as your as opponents, so that your agent learns from agents with same training level.
77 |
78 |
79 |
80 | ### Q5: When configuring `Self-play`, several parameters are important. Could you identify, by their definition, which parameter are we talking about?
81 |
82 | - The probability of playing against the current self vs an opponent from a pool
83 | - Variety (dispersion) of training levels of the opponents you can face
84 | - The number of training steps before spawning a new opponent
85 | - Opponent change rate
86 |
87 |
111 |
112 | ### Q6: What are the main motivations to use a ELO rating Score?
113 |
114 |
138 |
139 | Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
140 |
--------------------------------------------------------------------------------
/units/en/unit8/additional-readings.mdx:
--------------------------------------------------------------------------------
1 | # Additional Readings [[additional-readings]]
2 |
3 | These are **optional readings** if you want to go deeper.
4 |
5 | ## PPO Explained
6 |
7 | - [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
8 | - [What is the way to understand Proximal Policy Optimization Algorithm in RL?](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl)
9 | - [Foundations of Deep RL Series, L4 TRPO and PPO by Pieter Abbeel](https://youtu.be/KjWF8VIMGiY)
10 | - [OpenAI PPO Blogpost](https://openai.com/blog/openai-baselines-ppo/)
11 | - [Spinning Up RL PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
12 | - [Paper Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
13 |
14 | ## PPO Implementation details
15 |
16 | - [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
17 | - [Part 1 of 3 — Proximal Policy Optimization Implementation: 11 Core Implementation Details](https://www.youtube.com/watch?v=MEt6rrxH8W4)
18 |
19 | ## Importance Sampling
20 |
21 | - [Importance Sampling Explained](https://youtu.be/C3p2wI4RAi8)
22 |
--------------------------------------------------------------------------------
/units/en/unit8/clipped-surrogate-objective.mdx:
--------------------------------------------------------------------------------
1 | # Introducing the Clipped Surrogate Objective Function
2 | ## Recap: The Policy Objective Function
3 |
4 | Let’s remember what the objective is to optimize in Reinforce:
5 |
6 |
7 | The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
8 |
9 | However, the problem comes from the step size:
10 | - Too small, **the training process was too slow**
11 | - Too high, **there was too much variability in the training**
12 |
13 | With PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
14 |
15 | This new function **is designed to avoid destructively large weights updates** :
16 |
17 |
18 |
19 | Let’s study each part to understand how it works.
20 |
21 | ## The Ratio Function
22 |
23 |
24 | This ratio is calculated as follows:
25 |
26 |
27 |
28 | It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy, divided by the same for the previous policy.
29 |
30 | As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
31 |
32 | - If \\( r_t(\theta) > 1 \\), the **action \\( a_t \\) at state \\( s_t \\) is more likely in the current policy than the old policy.**
33 | - If \\( r_t(\theta) \\) is between 0 and 1, the **action is less likely for the current policy than for the old one**.
34 |
35 | So this probability ratio is an **easy way to estimate the divergence between old and current policy.**
36 |
37 | ## The unclipped part of the Clipped Surrogate Objective function
38 |
39 |
40 | This ratio **can replace the log probability we use in the policy objective function**. This gives us the left part of the new objective function: multiplying the ratio by the advantage.
41 |
42 |
43 | Proximal Policy Optimization Algorithms
44 |
45 |
46 | However, without a constraint, if the action taken is much more probable in our current policy than in our former, **this would lead to a significant policy gradient step** and, therefore, an **excessive policy update.**
47 |
48 | ## The clipped Part of the Clipped Surrogate Objective function
49 |
50 |
51 |
52 | Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio far away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
53 |
54 | **By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
55 |
56 | To do that, we have two solutions:
57 |
58 | - *TRPO (Trust Region Policy Optimization)* uses KL divergence constraints outside the objective function to constrain the policy update. But this method **is complicated to implement and takes more computation time.**
59 | - *PPO* clip probability ratio directly in the objective function with its **Clipped surrogate objective function.**
60 |
61 |
62 |
63 | This clipped part is a version where \\( r_t(\theta) \\) is clipped between \\( [1 - \epsilon, 1 + \epsilon] \\).
64 |
65 | With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
66 |
67 | Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**
68 |
69 | Taking the minimum of the clipped and non-clipped objective means **we'll select either the clipped or the non-clipped objective based on the ratio and advantage situation**.
70 |
--------------------------------------------------------------------------------
/units/en/unit8/conclusion-sf.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion
2 |
3 | That's all for today. Congrats on finishing this Unit and the tutorial! ⭐️
4 |
5 | Now that you've successfully trained your Doom agent, why not try deathmatch? Remember, that's a much more complex level than the one you've just trained, **but it's a nice experiment and I advise you to try it.**
6 |
7 | If you do it, don't hesitate to share your model in the `#rl-i-made-this` channel in our [discord server](https://www.hf.co/join/discord).
8 |
9 | This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
10 |
11 | See you next time 🔥
12 |
13 | ## Keep Learning, Stay awesome 🤗
14 |
--------------------------------------------------------------------------------
/units/en/unit8/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion [[Conclusion]]
2 |
3 | That’s all for today. Congrats on finishing this unit and the tutorial!
4 |
5 | The best way to learn is to practice and try stuff. **Why not improve the implementation to handle frames as input?**.
6 |
7 | See you on second part of this Unit 🔥
8 |
9 | ## Keep Learning, Stay awesome 🤗
10 |
--------------------------------------------------------------------------------
/units/en/unit8/introduction-sf.mdx:
--------------------------------------------------------------------------------
1 | # Introduction to PPO with Sample-Factory
2 |
3 |
4 |
5 | In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent to play [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
6 |
7 | In the notebook, **you'll train your agent to play the Health Gathering level**, where the agent must collect health packs to avoid dying. After that, you can **train your agent to play more complex levels, such as Deathmatch**.
8 |
9 |
10 |
11 | Sound exciting? Let's get started! 🚀
12 |
13 | The hands-on is made by [Edward Beeching](https://twitter.com/edwardbeeching), a Machine Learning Research Scientist at Hugging Face. He worked on Godot Reinforcement Learning Agents, an open-source interface for developing environments and agents in the Godot Game Engine.
14 |
--------------------------------------------------------------------------------
/units/en/unit8/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction [[introduction]]
2 |
3 |
4 |
5 | In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance with:
6 |
7 | - *An Actor* that controls **how our agent behaves** (policy-based method).
8 | - *A Critic* that measures **how good the action taken is** (value-based method).
9 |
10 | Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding policy updates that are too large**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio to a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
11 |
12 | Doing this will ensure **that our policy update will not be too large and that the training is more stable.**
13 |
14 | This Unit is in two parts:
15 | - In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using the [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness you'll use LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
16 | - In the second part, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/) and train an agent playing vizdoom (an open source version of Doom).
17 |
18 |
19 |
20 | These are the environments you're going to use to train your agents: VizDoom environments
21 |
22 |
23 | Sound exciting? Let's get started! 🚀
24 |
--------------------------------------------------------------------------------
/units/en/unit8/intuition-behind-ppo.mdx:
--------------------------------------------------------------------------------
1 | # The intuition behind PPO [[the-intuition-behind-ppo]]
2 |
3 |
4 | The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large of a policy update.**
5 |
6 | For two reasons:
7 | - We know empirically that smaller policy updates during training are **more likely to converge to an optimal solution.**
8 | - A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and taking a long time or even having no possibility to recover.**
9 |
10 |
11 |
12 | Taking smaller policy updates to improve the training stability
13 | Modified version from RL — Proximal Policy Optimization (PPO) Explained by Jonathan Hui
14 |
15 |
16 | **So with PPO, we update the policy conservatively**. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range \\( [1 - \epsilon, 1 + \epsilon] \\), meaning that we **remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).**
17 |
--------------------------------------------------------------------------------
/units/en/unitbonus1/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion [[conclusion]]
2 |
3 | Congrats on finishing this bonus unit!
4 |
5 | You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**
6 |
7 |
8 |
9 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
10 |
11 | ### Keep Learning, stay awesome 🤗
12 |
13 |
--------------------------------------------------------------------------------
/units/en/unitbonus1/how-huggy-works.mdx:
--------------------------------------------------------------------------------
1 | # How Huggy works [[how-huggy-works]]
2 |
3 | Huggy is a Deep Reinforcement Learning environment made by Hugging Face and based on [Puppo the Corgi, a project by the Unity MLAgents team](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit).
4 | This environment was created using the [Unity game engine](https://unity.com/) and [MLAgents](https://github.com/Unity-Technologies/ml-agents). ML-Agents is a toolkit for the game engine from Unity that allows us to **create environments using Unity or use pre-made environments to train our agents**.
5 |
6 |
7 |
8 | In this environment we aim to train Huggy to **fetch the stick we throw. This means he needs to move correctly toward the stick**.
9 |
10 | ## The State Space, what Huggy perceives. [[state-space]]
11 | Huggy doesn't "see" his environment. Instead, we provide him information about the environment:
12 |
13 | - The target (stick) position
14 | - The relative position between himself and the target
15 | - The orientation of his legs.
16 |
17 | Given all this information, Huggy can **use his policy to determine which action to take next to fulfill his goal**.
18 |
19 | ## The Action Space, what moves Huggy can perform [[action-space]]
20 |
21 |
22 | **Joint motors drive Huggy's legs**. This means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
23 |
24 | ## The Reward Function [[reward-function]]
25 |
26 | The reward function is designed so that **Huggy will fulfill his goal**: fetch the stick.
27 |
28 | Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.
29 |
30 | Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.
31 |
32 | Our reward function:
33 |
34 |
35 |
36 | - *Orientation bonus*: we **reward him for getting close to the target**.
37 | - *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.
38 | - *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.
39 | - *Getting to the target reward*: we reward Huggy for **reaching the target**.
40 |
41 | If you want to see what this reward function looks like mathematically, check [Puppo the Corgi presentation](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit).
42 |
43 | ## Train Huggy
44 |
45 | Huggy aims **to learn to run correctly and as fast as possible toward the goal**. To do that, at every step and given the environment observation, he needs to decide how to rotate each joint motor of his legs to move correctly (not spinning too much) and towards the goal.
46 |
47 | The training loop looks like this:
48 |
49 |
50 |
51 |
52 | The training environment looks like this:
53 |
54 |
55 |
56 |
57 | It's a place where a **stick is spawned randomly**. When Huggy reaches it, the stick get spawned somewhere else.
58 | We built **multiple copies of the environment for the training**. This helps speed up the training by providing more diverse experiences.
59 |
60 |
61 |
62 | Now that you have the big picture of the environment, you're ready to train Huggy to fetch the stick.
63 |
64 | To do that, we're going to use [MLAgents](https://github.com/Unity-Technologies/ml-agents). Don't worry if you have never used it before. In this unit we'll use Google Colab to train Huggy, and then you'll be able to load your trained Huggy and play with him directly in the browser.
65 |
66 | In a future unit, we will study MLAgents more in-depth and see how it works. But for now, we keep things simple by just using the provided implementation.
67 |
--------------------------------------------------------------------------------
/units/en/unitbonus1/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction [[introduction]]
2 |
3 | In this bonus unit, we'll reinforce what we learned in the first unit by teaching Huggy the Dog to fetch the stick and then [play with him directly in your browser](https://huggingface.co/spaces/ThomasSimonini/Huggy) 🐶
4 |
5 |
6 |
7 | So let's get started 🚀
8 |
--------------------------------------------------------------------------------
/units/en/unitbonus1/play.mdx:
--------------------------------------------------------------------------------
1 | # Play with Huggy [[play]]
2 |
3 | Now that you've trained Huggy and pushed it to the Hub. **You will be able to play with him ❤️**
4 |
5 | For this step it’s simple:
6 |
7 | - Open the Huggy game in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
8 | - Click on Play with my Huggy model
9 |
10 |
11 |
12 | 1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).
13 |
14 | 2. In step 2, **choose which model you want to replay**:
15 | - I have multiple ones, since we saved a model every 500000 timesteps.
16 | - But if I want the most recent one I choose Huggy.onnx
17 |
18 | 👉 It's good to **try with different model checkpoints to see the improvement of the agent.**
19 |
--------------------------------------------------------------------------------
/units/en/unitbonus2/hands-on.mdx:
--------------------------------------------------------------------------------
1 | # Hands-on [[hands-on]]
2 |
3 | Now that you've learned to use Optuna, here are some ideas to apply what you've learned:
4 |
5 | 1️⃣ **Beat your LunarLander-v2 agent results**, by using Optuna to find a better set of hyperparameters. You can also try with another environment, such as MountainCar-v0 and CartPole-v1.
6 |
7 | 2️⃣ **Beat your SpaceInvaders agent results**.
8 |
9 | By doing this, you'll see how valuable and powerful Optuna can be in training better agents.
10 |
11 | Have fun!
12 |
13 | Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
14 |
15 | ### Keep Learning, stay awesome 🤗
16 |
17 |
--------------------------------------------------------------------------------
/units/en/unitbonus2/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction [[introduction]]
2 |
3 | One of the most critical tasks in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
4 |
5 |
6 |
7 | [Optuna](https://optuna.org/) is a library that helps you to automate the search. In this Unit, we'll study a **little bit of the theory behind automatic hyperparameter tuning**. We'll first try to optimize the parameters of the DQN studied in the last unit manually. We'll then **learn how to automate the search using Optuna**.
8 |
--------------------------------------------------------------------------------
/units/en/unitbonus2/optuna.mdx:
--------------------------------------------------------------------------------
1 | # Optuna Tutorial [[optuna]]
2 |
3 | The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://araffin.github.io/tools-for-robotic-rl-icra2022/), he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
4 |
5 |
6 | ## The theory behind Hyperparameter tuning
7 |
8 |
9 |
10 |
11 | ## Optuna Tutorial
12 |
13 |
14 |
15 | The notebook 👉 [here](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)
16 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/curriculum-learning.mdx:
--------------------------------------------------------------------------------
1 | # (Automatic) Curriculum Learning for RL
2 |
3 | While most of the RL methods seen in this course work well in practice, there are some cases where using them alone fails. This can happen, for instance, when:
4 |
5 | - the task to learn is hard and requires an **incremental acquisition of skills** (for instance when one wants to make a bipedal agent learn to go through hard obstacles, it must first learn to stand, then walk, then maybe jump…)
6 | - there are variations in the environment (that affect the difficulty) and one wants its agent to be **robust** to them
7 |
8 |
9 |
10 |
11 | TeachMyAgent
12 |
13 |
14 | In such cases, it seems needed to propose different tasks to our RL agent and organize them such that the agent progressively acquires skills. This approach is called **Curriculum Learning** and usually implies a hand-designed curriculum (or set of tasks organized in a specific order). In practice, one can, for instance, control the generation of the environment, the initial states, or use Self-Play and control the level of opponents proposed to the RL agent.
15 |
16 | As designing such a curriculum is not always trivial, the field of **Automatic Curriculum Learning (ACL) proposes to design approaches that learn to create such an organization of tasks in order to maximize the RL agent’s performances**. Portelas et al. proposed to define ACL as:
17 |
18 | > … a family of mechanisms that automatically adapt the distribution of training data by learning to adjust the selection of learning situations to the capabilities of RL agents.
19 | >
20 |
21 | As an example, OpenAI used **Domain Randomization** (they applied random variations on the environment) to make a robot hand solve Rubik’s Cubes.
22 |
23 |
24 |
25 |
26 | OpenAI - Solving Rubik’s Cube with a Robot Hand
27 |
28 |
29 | Finally, you can play with the robustness of agents trained in the TeachMyAgent benchmark by controlling environment variations or even drawing the terrain 👇
30 |
31 |
32 |
33 | https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo
34 |
35 |
36 |
37 | ## Further reading
38 |
39 | For more information, we recommend that you check out the following resources:
40 |
41 | ### Overview of the field
42 |
43 | - [Automatic Curriculum Learning For Deep RL: A Short Survey](https://arxiv.org/pdf/2003.04664.pdf)
44 | - [Curriculum for Reinforcement Learning](https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/)
45 |
46 | ### Recent methods
47 |
48 | - [Evolving Curricula with Regret-Based Environment Design](https://arxiv.org/abs/2203.01302)
49 | - [Curriculum Reinforcement Learning via Constrained Optimal Transport](https://proceedings.mlr.press/v162/klink22a.html)
50 | - [Prioritized Level Replay](https://arxiv.org/abs/2010.03934)
51 |
52 | ## Author
53 |
54 | This section was written by Clément Romac
55 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/decision-transformers.mdx:
--------------------------------------------------------------------------------
1 | # Decision Transformers
2 |
3 | The Decision Transformer model was introduced by ["Decision Transformer: Reinforcement Learning via Sequence Modeling” by Chen L. et al](https://arxiv.org/abs/2106.01345). It abstracts Reinforcement Learning as a conditional-sequence modeling problem.
4 |
5 | The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), **we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return**.
6 | It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
7 |
8 | This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. This means that in Decision Transformers, we don’t maximize the return but rather generate a series of future actions that achieve the desired return.
9 |
10 | The 🤗 Transformers team integrated the Decision Transformer, an Offline Reinforcement Learning method, into the library as well as the Hugging Face Hub.
11 |
12 | ## Learn about Decision Transformers
13 |
14 | To learn more about Decision Transformers, you should read the blogpost we wrote about it [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers)
15 |
16 | ## Train your first Decision Transformers
17 |
18 | Now that you understand how Decision Transformers work thanks to [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers), you’re ready to learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
19 |
20 | Start the tutorial here 👉 https://huggingface.co/blog/train-decision-transformers
21 |
22 | ## Further reading
23 |
24 | For more information, we recommend that you check out the following resources:
25 |
26 | - [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)
27 | - [Online Decision Transformer](https://arxiv.org/abs/2202.05607)
28 |
29 | ## Author
30 |
31 | This section was written by Edward Beeching
32 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/generalisation.mdx:
--------------------------------------------------------------------------------
1 | # Generalization in Reinforcement Learning
2 |
3 | Generalization plays a pivotal role in the realm of Reinforcement Learning. While **RL algorithms demonstrate good performance in controlled environments**, the real world presents a **unique challenge due to its non-stationary and open-ended nature**.
4 |
5 | As a result, the development of RL algorithms that stay robust in the face of environmental variations, coupled with the capability to transfer and adapt to uncharted yet analogous tasks and settings, becomes fundamental for real world application of RL.
6 |
7 | If you're interested to dive deeper into this research subject, we recommend exploring the following resource:
8 |
9 | - [Generalization in Reinforcement Learning by Robert Kirk](https://robertkirk.github.io/2022/01/17/generalisation-in-reinforcement-learning-survey.html): this comprehensive survey provides an insightful **overview of the concept of generalization in RL**, making it an excellent starting point for your exploration.
10 |
11 | - [Improving Generalization in Reinforcement Learning using Policy Similarity Embeddings](https://blog.research.google/2021/09/improving-generalization-in.html?m=1)
12 |
13 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction
2 |
3 |
4 |
5 |
6 | Congratulations on finishing this course! **You now have a solid background in Deep Reinforcement Learning**.
7 | But this course was just the beginning of your Deep Reinforcement Learning journey, there are so many subsections to discover. In this optional unit, we **give you resources to explore multiple concepts and research topics in Reinforcement Learning**.
8 |
9 | Contrary to other units, this unit is a collective work of multiple people from Hugging Face. We mention the author for each unit.
10 |
11 | Sound fun? Let's get started 🔥,
12 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/language-models.mdx:
--------------------------------------------------------------------------------
1 | # Language models in RL
2 | ## LMs encode useful knowledge for agents
3 |
4 | **Language models** (LMs) can exhibit impressive abilities when manipulating text such as question-answering or even step-by-step reasoning. Additionally, their training on massive text corpora allowed them to **encode various types of knowledge including abstract ones about the physical rules of our world** (for instance what is possible to do with an object, what happens when one rotates an object…).
5 |
6 | A natural question recently studied was whether such knowledge could benefit agents such as robots when trying to solve everyday tasks. And while these works showed interesting results, the proposed agents lacked any learning method. **This limitation prevents these agent from adapting to the environment (e.g. fixing wrong knowledge) or learning new skills.**
7 |
8 |
9 |
10 | Source: Towards Helpful Robots: Grounding Language in Robotic Affordances
11 |
12 |
13 | ## LMs and RL
14 |
15 | There is therefore a potential synergy between LMs which can bring knowledge about the world, and RL which can align and correct this knowledge by interacting with an environment. It is especially interesting from a RL point-of-view as the RL field mostly relies on the **Tabula-rasa** setup where everything is learned from scratch by the agent leading to:
16 |
17 | 1) Sample inefficiency
18 |
19 | 2) Unexpected behaviors from humans’ eyes
20 |
21 | As a first attempt, the paper [“Grounding Large Language Models with Online Reinforcement Learning”](https://arxiv.org/abs/2302.02662v1) tackled the problem of **adapting or aligning a LM to a textual environment using PPO**. They showed that the knowledge encoded in the LM lead to a fast adaptation to the environment (opening avenues for sample efficient RL agents) but also that such knowledge allowed the LM to better generalize to new tasks once aligned.
22 |
23 |
24 |
25 | Another direction studied in [“Guiding Pretraining in Reinforcement Learning with Large Language Models”](https://arxiv.org/abs/2302.06692) was to keep the LM frozen but leverage its knowledge to **guide an RL agent’s exploration**. Such a method allows the RL agent to be guided towards human-meaningful and plausibly useful behaviors without requiring a human in the loop during training.
26 |
27 |
28 |
29 | Source: Towards Helpful Robots: Grounding Language in Robotic Affordances
30 |
31 |
32 | Several limitations make these works still very preliminary such as the need to convert the agent's observation to text before giving it to a LM as well as the compute cost of interacting with very large LMs.
33 |
34 | ## Further reading
35 |
36 | For more information we recommend you check out the following resources:
37 |
38 | - [Google Research, 2022 & beyond: Robotics](https://ai.googleblog.com/2023/02/google-research-2022-beyond-robotics.html)
39 | - [Pre-Trained Language Models for Interactive Decision-Making](https://arxiv.org/abs/2202.01771)
40 | - [Grounding Large Language Models with Online Reinforcement Learning](https://arxiv.org/abs/2302.02662v1)
41 | - [Guiding Pretraining in Reinforcement Learning with Large Language Models](https://arxiv.org/abs/2302.06692)
42 |
43 | ## Author
44 |
45 | This section was written by Clément Romac
46 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/learning-agents.mdx:
--------------------------------------------------------------------------------
1 | # An Introduction to Unreal Learning Agents
2 |
3 | [Learning Agents](https://dev.epicgames.com/community/learning/tutorials/8OWY/unreal-engine-learning-agents-introduction) is an Unreal Engine (UE) plugin that allows you **to train AI characters using machine learning (ML) in Unreal**.
4 |
5 | It's an exciting new plugin where you can create unique environments using Unreal Engine and train your agents.
6 |
7 | Let's see how you can **get started and train a car to drive in an Unreal Engine Environment**.
8 |
9 |
10 |
11 | Source: [Learning Agents Driving Car Tutorial](https://dev.epicgames.com/community/learning/tutorials/qj2O/unreal-engine-learning-to-drive)
12 |
13 |
14 | ## Case 1: I don't know anything about Unreal Engine and Beginners in Unreal Engine
15 | If you're new to Unreal Engine, don't be scared! We listed two courses you need to follow to be able to use Learning Agents:
16 |
17 | 1. Master the Basics: Begin by watching this course [your first hour in Unreal Engine 5](https://dev.epicgames.com/community/learning/courses/ZpX/your-first-hour-in-unreal-engine-5/E7L/introduction-to-your-first-hour-in-unreal-engine-5). This comprehensive course will **lay down the foundational knowledge you need to use Unreal**.
18 |
19 | 2. Dive into Blueprints: Explore the world of Blueprints, the visual scripting component of Unreal Engine. [This video course](https://youtu.be/W0brCeJNMqk?si=zy4t4t1l6FMIzbpz) will familiarize you with this essential tool.
20 |
21 | Armed with the basics, **you're now prepared to play with Learning Agents**:
22 |
23 | 3. Get the Big Picture of Learning Agents by [reading this informative overview](https://dev.epicgames.com/community/learning/tutorials/8OWY/unreal-engine-learning-agents-introduction).
24 |
25 | 4. [Teach a Car to Drive using Reinforcement Learning in Learning Agents](https://dev.epicgames.com/community/learning/tutorials/qj2O/unreal-engine-learning-to-drive).
26 |
27 | 5. [Check Imitation Learning with the Unreal Engine 5.3 Learning Agents Plugin](https://www.youtube.com/watch?v=NwYUNlFvajQ)
28 |
29 | ## Case 2: I'm familiar with Unreal
30 |
31 | For those already acquainted with Unreal Engine, you can jump straight into Learning Agents with these two tutorials:
32 |
33 | 1. Get the Big Picture of Learning Agents by [reading this informative overview](https://dev.epicgames.com/community/learning/tutorials/8OWY/unreal-engine-learning-agents-introduction).
34 |
35 | 2. [Teach a Car to Drive using Reinforcement Learning in Learning Agents](https://dev.epicgames.com/community/learning/tutorials/qj2O/unreal-engine-learning-to-drive). .
36 |
37 | 3. [Check Imitation Learning with the Unreal Engine 5.3 Learning Agents Plugin](https://www.youtube.com/watch?v=NwYUNlFvajQ)
--------------------------------------------------------------------------------
/units/en/unitbonus3/model-based.mdx:
--------------------------------------------------------------------------------
1 | # Model Based Reinforcement Learning (MBRL)
2 |
3 | Model-based reinforcement learning only differs from its model-free counterpart in learning a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
4 |
5 | The dynamics model usually models the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
6 |
7 |
8 | ## Simple definition
9 |
10 | - There is an agent that repeatedly tries to solve a problem, **accumulating state and action data**.
11 | - With that data, the agent creates a structured learning tool, *a dynamics model*, to reason about the world.
12 | - With the dynamics model, the agent **decides how to act by predicting the future**.
13 | - With those actions, **the agent collects more data, improves said model, and hopefully improves future actions**.
14 |
15 | ## Academic definition
16 |
17 | Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, **learning a model of said environment**, and then **leveraging the model for control (making decisions).
18 |
19 | Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) **to minimize the negative log-likelihood of the transitions**.
20 |
21 | We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, \\( \tau \\), from a set of actions sampled from a uniform distribution \\( U(a) \\), (see [paper](https://arxiv.org/pdf/2002.04523) or [paper](https://arxiv.org/pdf/2012.09156.pdf) or [paper](https://arxiv.org/pdf/2009.01221.pdf)).
22 |
23 | ## Further reading
24 |
25 | For more information on MBRL, we recommend you check out the following resources:
26 |
27 | - A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).
28 | - A [recent review paper on MBRL](https://arxiv.org/abs/2006.16712),
29 |
30 | ## Author
31 |
32 | This section was written by Nathan Lambert
33 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/offline-online.mdx:
--------------------------------------------------------------------------------
1 | # Offline vs. Online Reinforcement Learning
2 |
3 | Deep Reinforcement Learning (RL) is a framework **to build decision-making agents**. These agents aim to learn optimal behavior (policy) by interacting with the environment through **trial and error and receiving rewards as unique feedback**.
4 |
5 | The agent’s goal **is to maximize its cumulative reward**, called return. Because RL is based on the *reward hypothesis*: all goals can be described as the **maximization of the expected cumulative reward**.
6 |
7 | Deep Reinforcement Learning agents **learn with batches of experience**. The question is, how do they collect it?:
8 |
9 |
10 |
11 | A comparison between Reinforcement Learning in an Online and Offline setting, figure taken from this post
12 |
13 |
14 | - In *online reinforcement learning*, which is what we've learned during this course, the agent **gathers data directly**: it collects a batch of experience by **interacting with the environment**. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy).
15 |
16 | But this implies that either you **train your agent directly in the real world or have a simulator**. If you don’t have one, you need to build it, which can be very complex (how to reflect the complex reality of the real world in an environment?), expensive, and insecure (if the simulator has flaws that may provide a competitive advantage, the agent will exploit them).
17 |
18 | - On the other hand, in *offline reinforcement learning*, the agent only **uses data collected from other agents or human demonstrations**. It does **not interact with the environment**.
19 |
20 | The process is as follows:
21 | - **Create a dataset** using one or more policies and/or human interactions.
22 | - Run **offline RL on this dataset** to learn a policy
23 |
24 | This method has one drawback: the *counterfactual queries problem*. What do we do if our agent **decides to do something for which we don’t have the data?** For instance, turning right on an intersection but we don’t have this trajectory.
25 |
26 | There exist some solutions on this topic, but if you want to know more about offline reinforcement learning, you can [watch this video](https://www.youtube.com/watch?v=k08N5a0gG0A)
27 |
28 | ## Further reading
29 |
30 | For more information, we recommend you check out the following resources:
31 |
32 | - [Offline Reinforcement Learning, Talk by Sergei Levine](https://www.youtube.com/watch?v=qgZPZREor5I)
33 | - [Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems](https://arxiv.org/abs/2005.01643)
34 |
35 | ## Author
36 |
37 | This section was written by Thomas Simonini
38 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/rl-documentation.mdx:
--------------------------------------------------------------------------------
1 | # Brief introduction to RL documentation
2 |
3 | In this advanced topic, we address the question: **how should we monitor and keep track of powerful reinforcement learning agents that we are training in the real world and
4 | interfacing with humans?**
5 |
6 | As machine learning systems have increasingly impacted modern life, the **call for the documentation of these systems has grown**.
7 |
8 | Such documentation can cover aspects such as the training data used — where it is stored, when it was collected, who was involved, etc.
9 | — or the model optimization framework — the architecture, evaluation metrics, relevant papers, etc. — and more.
10 |
11 | Today, model cards and datasheets are becoming increasingly available. For example, on the Hub
12 | (see documentation [here](https://huggingface.co/docs/hub/model-cards)).
13 |
14 | If you click on a [popular model on the Hub](https://huggingface.co/models), you can learn about its creation process.
15 |
16 | These model and data specific logs are designed to be completed when the model or dataset are created, leaving them to go un-updated when these models are built into evolving systems in the future.
17 |
18 | ## Motivating Reward Reports
19 |
20 | Reinforcement learning systems are fundamentally designed to optimize based on measurements of reward and time.
21 | While the notion of a reward function can be mapped nicely to many well-understood fields of supervised learning (via a loss function),
22 | understanding of how machine learning systems evolve over time is limited.
23 |
24 | To that end, the authors introduce [*Reward Reports for Reinforcement Learning*](https://www.notion.so/Brief-introduction-to-RL-documentation-b8cbda5a6f5242338e0756e6bef72af4) (the pithy naming is designed to mirror the popular papers *Model Cards for Model Reporting* and *Datasheets for Datasets*).
25 | The goal is to propose a type of documentation focused on the **human factors of reward** and **time-varying feedback systems**.
26 |
27 | Building on the documentation frameworks for [model cards](https://arxiv.org/abs/1810.03993) and [datasheets](https://arxiv.org/abs/1803.09010) proposed by Mitchell et al. and Gebru et al., we argue the need for Reward Reports for AI systems.
28 |
29 | **Reward Reports** are living documents for proposed RL deployments that demarcate design choices.
30 |
31 | However, many questions remain about the applicability of this framework to different RL applications, roadblocks to system interpretability,
32 | and the resonances between deployed supervised machine learning systems and the sequential decision-making utilized in RL.
33 |
34 | At a minimum, Reward Reports are an opportunity for RL practitioners to deliberate on these questions and begin the work of deciding how to resolve them in practice.
35 |
36 | ## Capturing temporal behavior with documentation
37 |
38 | The core piece specific to documentation designed for RL and feedback-driven ML systems is a *change-log*. The change-log updates information
39 | from the designer (changed training parameters, data, etc.) along with noticed changes from the user (harmful behavior, unexpected responses, etc.).
40 |
41 | The change log is accompanied by update triggers that encourage monitoring these effects.
42 |
43 | ## Contributing
44 |
45 | Some of the most impactful RL-driven systems are multi-stakeholder in nature and behind the closed doors of private corporations.
46 | These corporations are largely without regulation, so the burden of documentation falls on the public.
47 |
48 | If you are interested in contributing, we are building Reward Reports for popular machine learning systems on a public
49 | record on [GitHub](https://github.com/RewardReports/reward-reports).
50 |
51 | For further reading, you can visit the Reward Reports [paper](https://arxiv.org/abs/2204.10817)
52 | or look [an example report](https://github.com/RewardReports/reward-reports/tree/main/examples).
53 |
54 | ## Author
55 |
56 | This section was written by Nathan Lambert
57 |
--------------------------------------------------------------------------------
/units/en/unitbonus3/student-works.mdx:
--------------------------------------------------------------------------------
1 | # Student Works
2 |
3 | Since the launch of the Deep Reinforcement Learning Course, **many students have created amazing projects that you should check out and consider participating in**.
4 |
5 | If you've created an interesting project, don't hesitate to [add it to this list by opening a pull request on the GitHub repository](https://github.com/huggingface/deep-rl-class).
6 |
7 | The projects are **arranged based on the date of publication in this page**.
8 |
9 |
10 | ## Space Scavanger AI
11 |
12 | This project is a space game environment with trained neural network for AI.
13 |
14 | AI is trained by Reinforcement learning algorithm based on UnityMLAgents and RLlib frameworks.
15 |
16 |
17 |
18 | Play the Game here 👉 https://swingshuffle.itch.io/spacescalvagerai
19 |
20 | Check the Unity project here 👉 https://github.com/HighExecutor/SpaceScalvagerAI
21 |
22 |
23 | ## Neural Nitro 🏎️
24 |
25 |
26 |
27 | In this project, Sookeyy created a low poly racing game and trained a car to drive.
28 |
29 | Check out the demo here 👉 https://sookeyy.itch.io/neuralnitro
30 |
31 |
32 | ## Space War 🚀
33 |
34 |
35 |
36 | In this project, Eric Dong recreates Bill Seiler's 1985 version of Space War in Pygame and uses reinforcement learning (RL) to train AI agents.
37 |
38 | This project is currently in development!
39 |
40 | ### Demo
41 |
42 | Dev/Edge version:
43 | * https://e-dong.itch.io/spacewar-dev
44 |
45 | Stable version:
46 | * https://e-dong.itch.io/spacewar
47 | * https://huggingface.co/spaces/EricofRL/SpaceWarRL
48 |
49 | ### Community blog posts
50 |
51 | TBA
52 |
53 | ### Other links
54 |
55 | Check out the source here 👉 https://github.com/e-dong/space-war-rl
56 | Check out his blog here 👉 https://dev.to/edong/space-war-rl-0-series-introduction-25dh
57 |
58 |
59 | ## Decision Transformers for Trading
60 |
61 | In this project, student has explored training a Decision Transformer for stock trading. In phase-1, offline training has been implemented. He intends to incorporate online fine-tuning in the next version.
62 |
63 |
64 |
65 |
66 |
67 |
68 | Source: Stanford CS25: V1 I Decision Transformer: Reinforcement Learning via Sequence Modeling
69 |
70 |
71 |
72 | Check out the source here 👉 https://github.com/ra9hur/Decision-Transformers-For-Trading
73 |
--------------------------------------------------------------------------------
/units/en/unitbonus5/conclusion.mdx:
--------------------------------------------------------------------------------
1 | # Conclusion:
2 |
3 | **Congratulations on finishing this bonus unit!** You have learned the process of recording expert demonstrations and training the agent using IL, which can be an alternative to training in-game agents with RL in some cases.
4 |
5 | This tutorial was written by [Ivan Dodic](https://github.com/Ivan-267). Thanks to [Edward Beeching](https://twitter.com/edwardbeeching) and [Thomas Simonini](https://twitter.com/thomassimonini) for their reviews and feedback.
--------------------------------------------------------------------------------
/units/en/unitbonus5/customize-the-environment.mdx:
--------------------------------------------------------------------------------
1 | # (Optional) How to customize the environment
2 |
3 | If you’d like to customize the game level, open the level scene `res://scenes/level.tscn`, then open the `res://scenes/modules/` folder in the Godot FileSystem:
4 |
5 |
6 |
7 | The level contains 3 rooms made using the modules, robot, and some additional colliders which prevent the ability to complete the level by climbing on a wall in the first room and reaching the key that way. By adding the modules to the scene, you can add new rooms and items.
8 |
9 | If you click on the Key node (it’s in `Room3`, you can also search for it), then click on `Node > Signals`, you will see that the `collected` signal is connected to both the robot and the chest. We use this to track whether the robot has collected the key, and to unlock the chest. The same system is applied for using the lever to activate the stairs, and if you add more levers/stairs/keys, you can connect them using signals.
10 |
11 |
12 |
13 | If you switch to `Groups`, you will see that the key is a member of the `resetable` group. In the same group we have the raft, lever, chest, player, and can add any node that needs to be reset when the episode resets.
14 |
15 |
16 |
17 | For this to work, every object that is in the `resetable` group also needs to implement the `reset()` method, which takes care of resetting that object.
18 |
19 | Because we have multiple instances of the level scene for training, we don’t reset all `resetables`, but only those within the same scene. In `level_manager.gd`, we have a method `reset_all_resetables()` that takes care of this, and it is called by the robot script when resetting is needed.
20 |
21 | After changing the level size, updating the `level_size` variable in `robot_ai_controller.gd` is also needed. For this, just roughly measure the longest dimension of the level, and update the variable.
22 |
23 | If you change the amount of objects that need to be tracked by the `AIController` (levers, rafts, etc.), you will need to update the relevant code in the script, include export properties for those objects, then connect them in the inspector properties of `AIController` in the level scene:
24 |
25 |
26 |
27 | After this, you may also need to update the same properties of the `AIController` in the demo record scene as well.
--------------------------------------------------------------------------------
/units/en/unitbonus5/introduction.mdx:
--------------------------------------------------------------------------------
1 | # Introduction:
2 |
3 |
4 |
5 | Welcome to this bonus unit, where you will **train a robot agent to complete a mini-game level using imitation learning.**
6 |
7 | At the end of the unit, **you will have a trained agent capable of solving the level as in the video**:
8 |
9 |
10 |
11 |
12 | ## Objectives:
13 |
14 | - Learn how to use imitation learning with Godot RL Agents by training an agent to complete a mini-game environment using human-recorded expert demonstrations.
15 |
16 | ## Prerequisites and requirements:
17 |
18 | - It is recommended that you complete the previous chapter ([Godot RL Agents](https://huggingface.co/learn/deep-rl-course/unitbonus3/godotrl)) before starting this tutorial,
19 | - Some familiarity with Godot is recommended, although completing the tutorial does not require any gdscript coding knowledge,
20 | - Godot with .NET support (tested to work with [4.3.dev5 .NET](https://godotengine.org/article/dev-snapshot-godot-4-3-dev-5/), may work with newer versions too),
21 | - Godot RL Agents (you can use `pip install godot-rl` in the venv/conda env),
22 | - [Imitation library](https://huggingface.co/learn/deep-rl-course/unitbonus5/train-our-robot),
23 | - Time: ~1-2 hours to complete the project and training. It can be outside of this range depending on the hardware used.
24 |
--------------------------------------------------------------------------------
/units/en/unitbonus5/the-environment.mdx:
--------------------------------------------------------------------------------
1 | # The environment
2 |
3 |
4 |
5 | The tutorial environment features a robot that needs to:
6 |
7 | - Pull a lever to raise the stairs leading to the second room,
8 | - Navigate to the key 🔑 and collect it while avoiding falling down into traps, water, or outside the map,
9 | - Navigate back to the treasure chest in the first room, and open it. Victory! 🏆
--------------------------------------------------------------------------------
/units/en/unitbonus5/train-our-robot.mdx:
--------------------------------------------------------------------------------
1 | # Train our robot
2 |
3 |
4 | In order to start training, we’ll first need to install the imitation library in the same venv / conda env where you installed Godot RL Agents by using: pip install imitation
5 |
6 |
7 | ### Download a copy of the [imitation learning](https://github.com/edbeeching/godot_rl_agents/blob/main/examples/sb3_imitation.py) script from the Godot RL Repository.
8 |
9 | ### Run training using the arguments below:
10 |
11 | ```python
12 | sb3_imitation.py --env_path="path_to_ILTutorial_executable" --bc_epochs=100 --gail_timesteps=1450000 --demo_files "path_to_expert_demos.json" --n_parallel=4 --speedup=20 --onnx_export_path=model.onnx --experiment_name=ILTutorial
13 | ```
14 |
15 | **Set the env path to the exported game, and demo files path to the recorded demos. If you have multiple demo files add them with a space in between, e.g. `--demo_files demos.json demos2.json`.**
16 |
17 | You can also set a large amount of timesteps for `--gail_timesteps` and then manually stop training with `CTRL+C`. I used this method to stop training when the reward started to approach 3, which was at `total_timesteps | 1.38e+06`. That took ~41 minutes, and the BC pre-training took ~5.5 minutes on my PC using CPU for training.
18 |
19 | To observe the environment while training, add the `--viz` argument. For the duration of the BC training, the env will be frozen as this stage doesn’t use the env except to get some information about the observation and action spaces. During the GAIL training stage, the env rendering will update.
20 |
21 | Here are the `ep_rew_mean` and `ep_rew_wrapped_mean` stats from the logs displayed using [tensorboard](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/TRAINING_STATISTICS.md), we can see that they are closely matching in this case:
22 |
23 |
25 |
26 |
27 |
28 | You can find the logs in `logs/ILTutorial` relative to the path you started training from. If making multiple runs, change the `--experiment_name` argument between each.
29 |
30 |
31 | Even though setting the env rewards is not necessary and not used for the training here, a simple sparse reward was implemented to track success. Falling outside the map, in water, or traps sets `reward += -1`, while activating the lever, collecting the key, and opening the chest each set `reward += 1`. If the `ep_rew_mean` approaches 3, we are getting a good result. `ep_rew_wrapped_mean` is the reward from the GAIL discriminator, which does not directly tell us how successful the agent is at solving the environment.
32 |
33 | ### Let’s test the trained agent
34 |
35 | After training, you’ll find a `model.onnx` file in the folder you started the training script from (you can also find the full path to the `.onnx` file in the training log in the console, near the end). **Copy it to the Godot game project folder.**
36 |
37 | ### Open the onnx inference scene
38 |
39 | This scene, like the demo record scene, uses only one copy of the level. It also has it’s `Sync` node mode set to `Onnx Inference`.
40 |
41 | **Click on the `Sync` node and set the `Onnx Model Path` property to `model.onnx`.**
42 |
43 |
45 |
46 | **Press F6 to start the scene and let’s see what the agent has learned!**
47 |
48 | Video of the trained agent:
49 |
50 |
51 | It seems the agent is capable of collecting the key from both positions (left platform or right platform) and replicates the recorded behavior well. **If you’re getting similar results, well done, you’ve successfully completed this tutorial!** 🏆👏
52 |
53 | If your results are significantly different, note that the amount and quality of recorded demos can affect the results, and adjusting the number of steps for BC/GAIL stages as well as modifying the hyper-parameters in the Python script can potentially help. There’s also some run-to-run variation, so sometimes the results can be slightly different even with the same settings.
54 |
--------------------------------------------------------------------------------