├── .gitignore
├── 01_taxi
    ├── README.md
    ├── notebooks
    │   ├── 00_environment.ipynb
    │   ├── 01_random_agent_baseline.ipynb
    │   ├── 02_q_agent.ipynb
    │   ├── 03_q_agent_hyperparameters_analysis.ipynb
    │   └── 04_homework.ipynb
    ├── pyproject.toml
    ├── requirements.txt
    └── src
    │   ├── __init__.py
    │   ├── loops.py
    │   ├── q_agent.py
    │   └── random_agent.py
├── 02_mountain_car
    ├── README.md
    ├── notebooks
    │   ├── 00_environment.ipynb
    │   ├── 01_random_agent_baseline.ipynb
    │   ├── 02_sarsa_agent.ipynb
    │   ├── 03_momentum_agent_baseline.ipynb
    │   └── 04_homework.ipynb
    ├── poetry.lock
    ├── pyproject.toml
    └── src
    │   ├── base_agent.py
    │   ├── config.py
    │   ├── loops.py
    │   ├── momentum_agent.py
    │   ├── random_agent.py
    │   ├── sarsa_agent.py
    │   └── viz.py
├── 03_cart_pole
    ├── .gitignore
    ├── README.md
    ├── images
    │   ├── deep_q_net.svg
    │   ├── hparams_search_diagram.svg
    │   ├── linear_model.jpg
    │   ├── linear_model_sml.jpg
    │   ├── neural_net.jpg
    │   ├── neural_net_homework.jpg
    │   ├── ngrok_example.png
    │   ├── nn_1_hidden_layer_sml.jpg
    │   ├── nn_2_hidden_layers_sml.jpg
    │   ├── nn_3_hidden_layers_sml.jpg
    │   └── optuna.png
    ├── mlflow_runs
    │   └── readme.md
    ├── notebooks
    │   ├── 00_environment.ipynb
    │   ├── 01_random_agent_baseline.ipynb
    │   ├── 02_linear_q_agent_bad_hyperparameters.ipynb
    │   ├── 03_linear_q_agent_good_hyperparameters.ipynb
    │   ├── 04_homework.ipynb
    │   ├── 05_crash_course_on_neural_nets.ipynb
    │   ├── 06_deep_q_agent_bad_hyperparameters.ipynb
    │   ├── 07_deep_q_agent_good_hyperparameters.ipynb
    │   ├── 08_homework.ipynb
    │   ├── 09_hyperparameter_search.ipynb
    │   ├── 10_homework.ipynb
    │   └── 11_hyperparameter_search_in_google_colab.ipynb
    ├── poetry.lock
    ├── pyproject.toml
    ├── requirements.txt
    ├── saved_agents
    │   ├── CartPole-v1
    │   │   └── 0
    │   │   │   ├── hparams.json
    │   │   │   └── model
    │   └── readme.md
    ├── src
    │   ├── __init__.py
    │   ├── agent_memory.py
    │   ├── config.py
    │   ├── loops.py
    │   ├── model_factory.py
    │   ├── optimize_hyperparameters.py
    │   ├── q_agent.py
    │   ├── random_agent.py
    │   ├── supervised_ml.py
    │   ├── utils.py
    │   └── viz.py
    └── tensorboard_logs
    │   ├── .gitignore
    │   └── readme.md
├── 04_lunar_lander
    ├── README.md
    ├── images
    │   └── policy_network.svg
    ├── notebooks
    │   ├── 01_random_agent_baseline.ipynb
    │   ├── 02_vanilla_policy_gradient_with_rewards_as_weights.ipynb
    │   ├── 03_vanilla_policy_gradient_with_rewards_to_go_as_weights.ipynb
    │   └── 04_homework.ipynb
    ├── pyproject.toml
    ├── requirements.txt
    ├── saved_agents
    │   └── readme.md
    ├── src
    │   ├── config.py
    │   ├── evaluation.py
    │   ├── model_factory.py
    │   ├── utils.py
    │   ├── viz.py
    │   └── vpg_agent.py
    └── tensorboard_logs
    │   └── readme.md
├── LICENSE
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | help/
 2 | logs/
 3 | snake/
 4 | **/.ipynb_checkpoints/
 5 | **/__pycache__/
 6 | **/tensorboard_logs/sml/
 7 | 02_mountain_car/saved_agents
 8 | 03_cart_pole/saved_agents
 9 | 04_cart_pole_tune_hparams_like_a_pro/
10 | 


--------------------------------------------------------------------------------
/01_taxi/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | <h1>Q-learning to drive a taxi 🚕</h1>
  3 | <h3><i>You talkin' to me?</i></h3>
  4 | <h4>-- Robert de Niro (Taxi driver)</h4>
  5 | </div>
  6 | 
  7 | <figure>
  8 | <img src="http://datamachines.xyz/wp-content/uploads/2021/11/pexels-helena-jankovic%CC%8Cova%CC%81-kova%CC%81c%CC%8Cova%CC%81-5870314.jpg" style="width:100%">
  9 | <figcaption align = "center">Venice’s taxis 👆 by <a href="https://www.pexels.com/@helen1?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels">Helena Jankovičová Kováčová</a> from Pexels 🙏</figcaption>
 10 | </figure>
 11 | 
 12 | ## Table of Contents
 13 | * [Welcome 🤗](#welcome-)
 14 | * [Quick setup](#quick-setup)
 15 | * [Lecture transcripts](#lecture-transcripts)
 16 | * [Notebooks](#notebooks)
 17 | * [Let's connect](#lets-connect)
 18 | 
 19 | ## Welcome 🤗
 20 | This is part 1 of the Hands-on RL course.
 21 | 
 22 | Let's use (tabular) Q-learning to teach an agent to solve the [Taxi-v3](https://gym.openai.com/envs/Taxi-v3/) environment
 23 | from OpenAI gym.
 24 | 
 25 | Fasten your seat belt and get ready. We are ready to depart!
 26 | 
 27 | 
 28 | ## Quick setup
 29 | 
 30 | Make sure you have Python >= 3.7. Otherwise, update it.
 31 | 
 32 | 1. Pull the code from GitHub and cd into the `01_taxi` folder:
 33 |     ```
 34 |     $ git clone https://github.com/Paulescu/hands-on-rl.git
 35 |     $ cd hands-on-rl/01_taxi
 36 |     ```
 37 | 
 38 | 2. Make sure you have the `virtualenv` tool in your Python installation
 39 |     ```
 40 |    $ pip3 install virtualenv
 41 |    ```
 42 |    
 43 | 3. Create a virtual environment and activate it.
 44 |     ```
 45 |     $ virtualenv -p python3 venv
 46 |     $ source venv/bin/activate
 47 |     ```
 48 |  
 49 |     From this point onwards commands run inside the  virtual environment.
 50 | 
 51 | 
 52 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code.
 53 |     ```
 54 |     $ (venv) pip install -r requirements.txt
 55 |     $ (venv) export PYTHONPATH="."
 56 |     ```
 57 | 
 58 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab
 59 |     ```
 60 |     $ (venv) jupyter notebook
 61 |     ```
 62 |     ```
 63 |     $ (venv) jupyter lab
 64 |     ```
 65 |     If both launch commands fail, try these:
 66 |     ```
 67 |     $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False
 68 |     ```
 69 |     ```
 70 |     $ (venv) jupyter lab --NotebookApp.use_redirect_file=False
 71 |     ```
 72 | 
 73 | 5. Play and learn. And do the homework 😉.
 74 | 
 75 | 
 76 | ## Lecture transcripts
 77 | 
 78 | [📝 Q learning](http://datamachines.xyz/2021/12/06/hands-on-reinforcement-learning-course-part-2-q-learning/)  
 79 | 
 80 | 
 81 | ## Notebooks
 82 | 
 83 | - [Explore the environment](notebooks/00_environment.ipynb)
 84 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
 85 | - [Q-agent](notebooks/02_q_agent.ipynb)
 86 | - [Hyperparameter tuning](notebooks/03_q_agent_hyperparameters_analysis.ipynb)
 87 | - [Homework](notebooks/04_homework.ipynb)
 88 | 
 89 | ## Let's connect!
 90 | 
 91 | Do you wanna become a PRO in Machine Learning?
 92 | 
 93 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/).
 94 | 
 95 | 👉🏽 Follow me on [Medium](https://pau-labarta-bajo.medium.com/).
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 


--------------------------------------------------------------------------------
/01_taxi/notebooks/00_environment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "4c2ff31f",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# 00 Environment"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "04a5c882",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "#### 👉Before you solve a Reinforcement Learning problem you need to define what are\n",
 17 |     "- the actions\n",
 18 |     "- the states of the world\n",
 19 |     "- the rewards\n",
 20 |     "\n",
 21 |     "#### 👉We are using the `Taxi-v3` environment from OpenAI's gym: https://gym.openai.com/envs/Taxi-v3/\n",
 22 |     "\n",
 23 |     "#### 👉`Taxi-v3` is an easy environment because the action space is small, and the state space is large but finite.\n",
 24 |     "\n",
 25 |     "#### 👉Environments with a finite number of actions and states are called tabular"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 3,
 31 |    "id": "e3629346",
 32 |    "metadata": {},
 33 |    "outputs": [
 34 |     {
 35 |      "name": "stdout",
 36 |      "output_type": "stream",
 37 |      "text": [
 38 |       "The autoreload extension is already loaded. To reload it, use:\n",
 39 |       "  %reload_ext autoreload\n",
 40 |       "Populating the interactive namespace from numpy and matplotlib\n"
 41 |      ]
 42 |     }
 43 |    ],
 44 |    "source": [
 45 |     "%load_ext autoreload\n",
 46 |     "%autoreload 2\n",
 47 |     "%pylab inline\n",
 48 |     "%config InlineBackend.figure_format = 'svg'"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "id": "76e9a06d",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "## Load the environment 🌎"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 4,
 62 |    "id": "ebfba291",
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "import gym\n",
 67 |     "env = gym.make(\"Taxi-v3\").env"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "id": "1fcfc13a",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "## Action space"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 5,
 81 |    "id": "98cfdb84",
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "name": "stdout",
 86 |      "output_type": "stream",
 87 |      "text": [
 88 |       "Action Space Discrete(6)\n"
 89 |      ]
 90 |     }
 91 |    ],
 92 |    "source": [
 93 |     "print(\"Action Space {}\".format(env.action_space))"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "id": "4f53a38e",
 99 |    "metadata": {},
100 |    "source": [
101 |     "## State space"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 6,
107 |    "id": "e809514b",
108 |    "metadata": {},
109 |    "outputs": [
110 |     {
111 |      "name": "stdout",
112 |      "output_type": "stream",
113 |      "text": [
114 |       "State Space Discrete(500)\n"
115 |      ]
116 |     }
117 |    ],
118 |    "source": [
119 |     "print(\"State Space {}\".format(env.observation_space))"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "id": "c8f6a690",
125 |    "metadata": {},
126 |    "source": [
127 |     "## Rewards"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 7,
133 |    "id": "0faad2a7",
134 |    "metadata": {},
135 |    "outputs": [
136 |     {
137 |      "name": "stdout",
138 |      "output_type": "stream",
139 |      "text": [
140 |       "env.P[state][action][0]:  (1.0, 223, -1, False)\n"
141 |      ]
142 |     }
143 |    ],
144 |    "source": [
145 |     "# env.P is double dictionary.\n",
146 |     "# - The 1st key represents the state, from 0 to 499\n",
147 |     "# - The 2nd key represens the action taken by the agent,\n",
148 |     "#   from 0 to 5\n",
149 |     "\n",
150 |     "# example\n",
151 |     "state = 123\n",
152 |     "action = 0  # move south\n",
153 |     "\n",
154 |     "# env.P[state][action][0] is a list with 4 elements\n",
155 |     "# (probability, next_state, reward, done)\n",
156 |     "# \n",
157 |     "#  - probability\n",
158 |     "#    It is always 1 in this environment, which means\n",
159 |     "#    there are no external/random factors that determine the\n",
160 |     "#    next_state\n",
161 |     "#    apart from the agent's action a.\n",
162 |     "#\n",
163 |     "#  - next_state: 223 in this case\n",
164 |     "# \n",
165 |     "#  - reward: -1 in this case\n",
166 |     "#\n",
167 |     "#  - done: boolean (True/False) indicates whether the\n",
168 |     "#    episode has ended (i.e. the driver has dropped the\n",
169 |     "#    passenger at the correct destination)\n",
170 |     "print('env.P[state][action][0]: ', env.P[state][action][0])"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 8,
176 |    "id": "552caf92",
177 |    "metadata": {},
178 |    "outputs": [
179 |     {
180 |      "name": "stdout",
181 |      "output_type": "stream",
182 |      "text": [
183 |       "+---------+\n",
184 |       "|\u001b[34;1mR\u001b[0m: | : :G|\n",
185 |       "| :\u001b[43m \u001b[0m| : : |\n",
186 |       "| : : : : |\n",
187 |       "| | : | : |\n",
188 |       "|Y| : |\u001b[35mB\u001b[0m: |\n",
189 |       "+---------+\n",
190 |       "\n"
191 |      ]
192 |     }
193 |    ],
194 |    "source": [
195 |     "# Need to call reset() at least once before render() will work\n",
196 |     "env.reset()\n",
197 |     "\n",
198 |     "env.s = 123\n",
199 |     "env.render(mode='human')"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": 9,
205 |    "id": "2ded2ba5",
206 |    "metadata": {},
207 |    "outputs": [
208 |     {
209 |      "name": "stdout",
210 |      "output_type": "stream",
211 |      "text": [
212 |       "+---------+\n",
213 |       "|\u001b[34;1mR\u001b[0m: | : :G|\n",
214 |       "| : | : : |\n",
215 |       "| :\u001b[43m \u001b[0m: : : |\n",
216 |       "| | : | : |\n",
217 |       "|Y| : |\u001b[35mB\u001b[0m: |\n",
218 |       "+---------+\n",
219 |       "\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "env.s = 223\n",
225 |     "env.render(mode='human')"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "id": "2aacea45",
232 |    "metadata": {},
233 |    "outputs": [],
234 |    "source": []
235 |   }
236 |  ],
237 |  "metadata": {
238 |   "kernelspec": {
239 |    "display_name": "Python 3 (ipykernel)",
240 |    "language": "python",
241 |    "name": "python3"
242 |   },
243 |   "language_info": {
244 |    "codemirror_mode": {
245 |     "name": "ipython",
246 |     "version": 3
247 |    },
248 |    "file_extension": ".py",
249 |    "mimetype": "text/x-python",
250 |    "name": "python",
251 |    "nbconvert_exporter": "python",
252 |    "pygments_lexer": "ipython3",
253 |    "version": "3.7.5"
254 |   }
255 |  },
256 |  "nbformat": 4,
257 |  "nbformat_minor": 5
258 | }
259 | 


--------------------------------------------------------------------------------
/01_taxi/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "f0fd6807",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 04 Homework 🏋️🏋️🏋️"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "abcf6613",
14 |    "metadata": {},
15 |    "source": [
16 |     "#### 👉A course without homework is not a course!\n",
17 |     "\n",
18 |     "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 |     "\n",
20 |     "#### 👉They are not so easy, so if you get stuck drop me an email at `plabartabajo@gmail.com`"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "id": "86f82e45",
26 |    "metadata": {},
27 |    "source": [
28 |     "-----"
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "id": "67656662",
34 |    "metadata": {},
35 |    "source": [
36 |     "## 1. Can you update the function `train` in a way that the input `epsilon` can also be a callable function?\n",
37 |     "\n",
38 |     "An `epsilon` value that decays after each episode works better than a fixed `epsilon` for most RL problems.\n",
39 |     "\n",
40 |     "This is hard exercise, but I want you to give it a try.\n",
41 |     "\n",
42 |     "If you do not manage it, do not worry. We are going to implement this in an upcoming lesson."
43 |    ]
44 |   },
45 |   {
46 |    "cell_type": "markdown",
47 |    "id": "7d1e016e",
48 |    "metadata": {},
49 |    "source": [
50 |     "-----"
51 |    ]
52 |   },
53 |   {
54 |    "cell_type": "markdown",
55 |    "id": "c0a46bf7",
56 |    "metadata": {},
57 |    "source": [
58 |     "## 2. Can you parallelize the function `train_many_runs` using Python's `multiprocessing` module?\n",
59 |     "\n",
60 |     "I do not like to wait and stare at each progress bar, while I think that each run in `train_many_runs` could execute\n",
61 |     "in parallel.\n",
62 |     "\n",
63 |     "Create a new function called `train_many_runs_in_parallel` that outputs the same results as `train_many_runs` but that executes in a fraction of time."
64 |    ]
65 |   }
66 |  ],
67 |  "metadata": {
68 |   "kernelspec": {
69 |    "display_name": "Python 3 (ipykernel)",
70 |    "language": "python",
71 |    "name": "python3"
72 |   },
73 |   "language_info": {
74 |    "codemirror_mode": {
75 |     "name": "ipython",
76 |     "version": 3
77 |    },
78 |    "file_extension": ".py",
79 |    "mimetype": "text/x-python",
80 |    "name": "python",
81 |    "nbconvert_exporter": "python",
82 |    "pygments_lexer": "ipython3",
83 |    "version": "3.7.5"
84 |   }
85 |  },
86 |  "nbformat": 4,
87 |  "nbformat_minor": 5
88 | }
89 | 


--------------------------------------------------------------------------------
/01_taxi/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "src"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Pau <plabartabajo@gmail.com>"]
 6 | 
 7 | [tool.poetry.dependencies]
 8 | python = ">=3.7.1,<4.0"
 9 | gym = "^0.21.0"
10 | tqdm = "^4.62.3"
11 | matplotlib = "^3.5.0"
12 | pandas = "^1.3.4"
13 | seaborn = "^0.11.2"
14 | jupyter = "^1.0.0"
15 | jupyterlab = "^3.3.0"
16 | 
17 | [tool.poetry.dev-dependencies]
18 | pytest = "^5.2"
19 | 
20 | [build-system]
21 | requires = ["poetry-core>=1.0.0"]
22 | build-backend = "poetry.core.masonry.api"
23 | 


--------------------------------------------------------------------------------
/01_taxi/requirements.txt:
--------------------------------------------------------------------------------
  1 | anyio==3.5.0; python_full_version >= "3.6.2" and python_version >= "3.7"
  2 | appnope==0.1.2; platform_system == "Darwin" and python_version >= "3.7" and sys_platform == "darwin"
  3 | argon2-cffi-bindings==21.2.0; python_version >= "3.6"
  4 | argon2-cffi==21.3.0; python_version >= "3.7"
  5 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0"
  6 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
  7 | babel==2.9.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
  8 | backcall==0.2.0; python_version >= "3.7"
  9 | bleach==4.1.0; python_version >= "3.7"
 10 | certifi==2021.10.8; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7"
 11 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6"
 12 | charset-normalizer==2.0.12; python_full_version >= "3.6.0" and python_version >= "3.7"
 13 | cloudpickle==2.0.0; python_version >= "3.6"
 14 | colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.7"
 15 | cycler==0.11.0; python_version >= "3.7"
 16 | debugpy==1.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
 17 | decorator==5.1.1; python_version >= "3.7"
 18 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
 19 | entrypoints==0.4; python_full_version >= "3.6.1" and python_version >= "3.7"
 20 | fonttools==4.29.1; python_version >= "3.7"
 21 | gym==0.21.0; python_version >= "3.6"
 22 | idna==3.3; python_full_version >= "3.6.2" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7")
 23 | importlib-metadata==4.11.2; python_version < "3.8" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.7" and python_version < "3.8")
 24 | importlib-resources==5.4.0; python_version < "3.9" and python_version >= "3.7"
 25 | ipykernel==6.9.1; python_version >= "3.7"
 26 | ipython-genutils==0.2.0; python_version >= "3.7"
 27 | ipython==7.32.0; python_version >= "3.7"
 28 | ipywidgets==7.6.5
 29 | jedi==0.18.1; python_version >= "3.7"
 30 | jinja2==3.0.3; python_version >= "3.7"
 31 | json5==0.9.6; python_version >= "3.7"
 32 | jsonschema==4.4.0; python_version >= "3.7"
 33 | jupyter-client==7.1.2; python_full_version >= "3.7.0" and python_version >= "3.7"
 34 | jupyter-console==6.4.3; python_version >= "3.6"
 35 | jupyter-core==4.9.2; python_full_version >= "3.6.1" and python_version >= "3.7"
 36 | jupyter-server==1.13.5; python_version >= "3.7"
 37 | jupyter==1.0.0
 38 | jupyterlab-pygments==0.1.2; python_version >= "3.7"
 39 | jupyterlab-server==2.10.3; python_version >= "3.7"
 40 | jupyterlab-widgets==1.0.2; python_version >= "3.6"
 41 | jupyterlab==3.3.0; python_version >= "3.7"
 42 | kiwisolver==1.3.2; python_version >= "3.7"
 43 | markupsafe==2.1.0; python_version >= "3.7"
 44 | matplotlib-inline==0.1.3; python_version >= "3.7"
 45 | matplotlib==3.5.1; python_version >= "3.7"
 46 | mistune==0.8.4; python_version >= "3.7"
 47 | more-itertools==8.12.0; python_version >= "3.5"
 48 | nbclassic==0.3.6; python_version >= "3.7"
 49 | nbclient==0.5.12; python_full_version >= "3.7.0" and python_version >= "3.7"
 50 | nbconvert==6.4.2; python_version >= "3.7"
 51 | nbformat==5.1.3; python_full_version >= "3.7.0" and python_version >= "3.7"
 52 | nest-asyncio==1.5.4; python_full_version >= "3.7.0" and python_version >= "3.7"
 53 | notebook-shim==0.1.0; python_version >= "3.7"
 54 | notebook==6.4.8; python_version >= "3.7"
 55 | numpy==1.21.1
 56 | packaging==21.3; python_version >= "3.7"
 57 | pandas==1.3.5; python_full_version >= "3.7.1"
 58 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
 59 | parso==0.8.3; python_version >= "3.7"
 60 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7"
 61 | pickleshare==0.7.5; python_version >= "3.7"
 62 | pillow==9.0.1; python_version >= "3.7"
 63 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5"
 64 | prometheus-client==0.13.1; python_version >= "3.7"
 65 | prompt-toolkit==3.0.28; python_full_version >= "3.6.2" and python_version >= "3.7"
 66 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.7" and sys_platform != "win32"
 67 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy"
 68 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
 69 | pygments==2.11.2; python_version >= "3.7"
 70 | pyparsing==3.0.7; python_version >= "3.7"
 71 | pyrsistent==0.18.1; python_version >= "3.7"
 72 | pytest==5.4.3; python_version >= "3.5"
 73 | python-dateutil==2.8.2; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.7")
 74 | pytz==2021.3; python_full_version >= "3.7.1" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7")
 75 | pywin32==303; sys_platform == "win32" and platform_python_implementation != "PyPy" and python_version >= "3.7"
 76 | pywinpty==1.1.6; os_name == "nt" and python_version >= "3.7"
 77 | pyzmq==22.3.0; python_full_version >= "3.6.1" and python_version >= "3.7"
 78 | qtconsole==5.2.2; python_version >= "3.6"
 79 | qtpy==2.0.1; python_version >= "3.6"
 80 | requests==2.27.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7"
 81 | scipy==1.6.1; python_version >= "3.7"
 82 | seaborn==0.11.2; python_version >= "3.6"
 83 | send2trash==1.8.0; python_version >= "3.7"
 84 | setuptools-scm==6.4.2; python_version >= "3.7"
 85 | six==1.16.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.7"
 86 | sniffio==1.2.0; python_full_version >= "3.6.2" and python_version >= "3.7"
 87 | terminado==0.13.3; python_version >= "3.7"
 88 | testpath==0.6.0; python_version >= "3.7"
 89 | tomli==2.0.1; python_version >= "3.7"
 90 | tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7"
 91 | tqdm==4.63.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
 92 | traitlets==5.1.1; python_full_version >= "3.7.0" and python_version >= "3.7"
 93 | typing-extensions==4.1.1; python_version < "3.8" and python_version >= "3.7" and python_full_version >= "3.6.2"
 94 | urllib3==1.26.8; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7"
 95 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.6"
 96 | webencodings==0.5.1; python_version >= "3.7"
 97 | websocket-client==1.3.1; python_version >= "3.7"
 98 | widgetsnbextension==3.5.2
 99 | zipp==3.7.0; python_version < "3.8" and python_version >= "3.7"
100 | 


--------------------------------------------------------------------------------
/01_taxi/src/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '0.1.0'
2 | 


--------------------------------------------------------------------------------
/01_taxi/src/loops.py:
--------------------------------------------------------------------------------
  1 | from typing import Tuple, List, Any
  2 | import random
  3 | from pdb import set_trace as stop
  4 | 
  5 | import numpy as np
  6 | from tqdm import tqdm
  7 | 
  8 | 
  9 | def train(
 10 |     agent,
 11 |     env,
 12 |     n_episodes: int,
 13 |     epsilon: float
 14 | ) -> Tuple[Any, List, List]:
 15 |     """
 16 |     Trains and agent and returns 3 things:
 17 |     - agent object
 18 |     - timesteps_per_episode
 19 |     - penalties_per_episode
 20 |     """
 21 |     # For plotting metrics
 22 |     timesteps_per_episode = []
 23 |     penalties_per_episode = []
 24 | 
 25 |     for i in tqdm(range(0, n_episodes)):
 26 | 
 27 |         state = env.reset()
 28 | 
 29 |         epochs, penalties, reward, = 0, 0, 0
 30 |         done = False
 31 | 
 32 |         while not done:
 33 | 
 34 |             if random.uniform(0, 1) < epsilon:
 35 |                 # Explore action space
 36 |                 action = env.action_space.sample()
 37 |             else:
 38 |                 # Exploit learned values
 39 |                 action = agent.get_action(state)
 40 | 
 41 |             next_state, reward, done, info = env.step(action)
 42 | 
 43 |             agent.update_parameters(state, action, reward, next_state)
 44 | 
 45 |             if reward == -10:
 46 |                 penalties += 1
 47 | 
 48 |             state = next_state
 49 |             epochs += 1
 50 | 
 51 |         timesteps_per_episode.append(epochs)
 52 |         penalties_per_episode.append(penalties)
 53 | 
 54 |     return agent, timesteps_per_episode, penalties_per_episode
 55 | 
 56 | 
 57 | def evaluate(
 58 |     agent,
 59 |     env,
 60 |     n_episodes: int,
 61 |     epsilon: float,
 62 |     initial_state: int = None
 63 | ) -> Tuple[List, List]:
 64 |     """
 65 |     Tests agent performance in random `n_episodes`.
 66 | 
 67 |     It returns:
 68 |     - timesteps_per_episode
 69 |     - penalties_per_episode
 70 |     """
 71 |     # For plotting metrics
 72 |     timesteps_per_episode = []
 73 |     penalties_per_episode = []
 74 |     frames_per_episode = []
 75 | 
 76 |     for i in tqdm(range(0, n_episodes)):
 77 | 
 78 |         if initial_state:
 79 |             # init the environment at 'initial_state'
 80 |             state = initial_state
 81 |             env.s = initial_state
 82 |         else:
 83 |             # random starting state
 84 |             state = env.reset()
 85 | 
 86 |         epochs, penalties, reward, = 0, 0, 0
 87 |         frames = []
 88 |         done = False
 89 | 
 90 |         while not done:
 91 | 
 92 |             if random.uniform(0, 1) < epsilon:
 93 |                 # Explore action space
 94 |                 action = env.action_space.sample()
 95 |             else:
 96 |                 # Exploit learned values
 97 |                 action = agent.get_action(state)
 98 | 
 99 |             next_state, reward, done, info = env.step(action)
100 | 
101 |             frames.append({
102 |                 'frame': env.render(mode='ansi'),
103 |                 'state': state,
104 |                 'action': action,
105 |                 'reward': reward
106 |             })
107 | 
108 |             if reward == -10:
109 |                 penalties += 1
110 | 
111 |             state = next_state
112 |             epochs += 1
113 | 
114 |         timesteps_per_episode.append(epochs)
115 |         penalties_per_episode.append(penalties)
116 |         frames_per_episode.append(frames)
117 | 
118 |     return timesteps_per_episode, penalties_per_episode, frames_per_episode
119 | 
120 | 
121 | def train_many_runs(
122 |     agent,
123 |     env,
124 |     n_episodes: int,
125 |     epsilon: float,
126 |     n_runs: int,
127 | ) -> Tuple[List, List]:
128 |     """
129 |     Calls 'train' many times, stores results and averages them out.
130 |     """
131 |     timesteps = np.zeros(shape=(n_runs, n_episodes))
132 |     penalties = np.zeros(shape=(n_runs, n_episodes))
133 | 
134 |     for i in range(0, n_runs):
135 | 
136 |         agent.reset()
137 | 
138 |         _, timesteps[i, :], penalties[i, :] = train(
139 |             agent, env, n_episodes, epsilon
140 |         )
141 |     timesteps = np.mean(timesteps, axis=0).tolist()
142 |     penalties = np.mean(penalties, axis=0).tolist()
143 | 
144 |     return timesteps, penalties
145 | 
146 | if __name__ == '__main__':
147 | 
148 |     import gym
149 |     from src.q_agent import QAgent
150 | 
151 |     env = gym.make("Taxi-v3").env
152 |     alpha = 0.1
153 |     gamma = 0.6
154 |     agent = QAgent(env, alpha, gamma)
155 | 
156 |     agent, _, _ = train(
157 |         agent, env, n_episodes=10000, epsilon=0.10)
158 | 
159 |     timesteps_per_episode, penalties_per_episode, _ = evaluate(
160 |         agent, env, n_episodes=100, epsilon=0.05
161 |     )
162 | 
163 |     print(f'Avg steps to complete ride: {np.array(timesteps_per_episode).mean()}')
164 |     print(f'Avg penalties to complete ride: {np.array(penalties_per_episode).mean()}')


--------------------------------------------------------------------------------
/01_taxi/src/q_agent.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from pdb import set_trace as stop
 3 | 
 4 | class QAgent:
 5 | 
 6 |     def __init__(self, env, alpha, gamma):
 7 |         self.env = env
 8 | 
 9 |         # table with q-values: n_states * n_actions
10 |         self.q_table = np.zeros([env.observation_space.n, env.action_space.n])
11 | 
12 |         # hyper-parameters
13 |         self.alpha = alpha
14 |         self.gamma = gamma
15 | 
16 |     def get_action(self, state):
17 |         """"""
18 |         # stop()
19 |         return np.argmax(self.q_table[state])
20 | 
21 |     def update_parameters(self, state, action, reward, next_state):
22 |         """"""
23 |         old_value = self.q_table[state, action]
24 |         next_max = np.max(self.q_table[next_state])
25 | 
26 |         new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
27 |         self.q_table[state, action] = new_value
28 | 
29 |     def reset(self):
30 |         """
31 |         Sets q-values to zeros, which essentially means the agent does not know
32 |         anything
33 |         """
34 |         self.q_table = np.zeros([self.env.observation_space.n, self.env.action_space.n])
35 | 


--------------------------------------------------------------------------------
/01_taxi/src/random_agent.py:
--------------------------------------------------------------------------------
 1 | 
 2 | class RandomAgent:
 3 |     """
 4 |     This taxi driver selects actions randomly.
 5 |     You better not get into this taxi!
 6 |     """
 7 |     def __init__(self, env):
 8 |         self.env = env
 9 | 
10 |     def get_action(self, state) -> int:
11 |         """
12 |         No input arguments to this function.
13 |         The agent does not consider the state of the environment when deciding
14 |         what to do next.
15 |         """
16 |         return self.env.action_space.sample()


--------------------------------------------------------------------------------
/02_mountain_car/README.md:
--------------------------------------------------------------------------------
 1 | # SARSA to beat gravity 🚃
 2 | 👉 [Read in datamachines](http://datamachines.xyz/2021/12/17/hands-on-reinforcement-learning-course-part-3-sarsa/)
 3 | 👉 [Read in Towards Data Science](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-3-5db40e7938d4)
 4 | 
 5 | 
 6 | This is part 2 of my course Hands-on reinforcement learning.
 7 | 
 8 | In this part we use SARSA to help a poor car win the battle against gravity!
 9 | 
10 | > *Be like a train; go in the rain, go in the sun, go in the storm, go in the dark tunnels! Be like a train; concentrate on your road and go with no hesitation!*
11 | >
12 | > --_Mehmet Murat Ildan_
13 | 
14 | ### Quick setup
15 | 
16 | The easiest way to get the code working in your machine is by using [Poetry](https://python-poetry.org/docs/#installation).
17 | 
18 | 
19 | 1. You can install Poetry with this one-liner:
20 |     ```bash
21 |     $ curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
22 |     ```
23 | 
24 | 2. Git clone the code
25 |     ```bash
26 |     $ git clone https://github.com/Paulescu/hands-on-rl.git 
27 |     ```
28 | 
29 | 3. Navigate to this lesson code `02_mountain_car`
30 |     ```bash
31 |     $ cd hands-on-rl/02_mountain_car
32 |     ```
33 | 
34 | 4. Install all dependencies from `pyproject.toml:
35 |     ```bash
36 |     $ poetry install
37 |     ```
38 | 
39 | 5. Activate the virtual environment
40 |     ```bash
41 |     $ poetry shell
42 |     ```
43 | 
44 | 6. Set PYTHONPATH and launch jupyter (jupyter-lab param may fix launch problems on some systems)
45 |     ```bash
46 |     $ export PYTHONPATH=".."
47 |     $ jupyter-lab --NotebookApp.use_redirect_file=False
48 |     ```
49 | 
50 | ### Notebooks
51 | 
52 | 1. [Explore the environment](notebooks/00_environment.ipynb)
53 | 2. [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
54 | 3. [SARSA agent](notebooks/02_sarsa_agent.ipynb)
55 | 4. [Momentum agent](notebooks/03_momentum_agent_baseline.ipynb)
56 | 5. [Homework](notebooks/04_homework.ipynb)
57 | 


--------------------------------------------------------------------------------
/02_mountain_car/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "f0fd6807",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 04 Homework 🏋️🏋️🏋️"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "abcf6613",
14 |    "metadata": {},
15 |    "source": [
16 |     "#### 👉A course without homework is not a course!\n",
17 |     "\n",
18 |     "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 |     "\n",
20 |     "#### 👉Feel free to email me your solutions at:"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "id": "d1d983a3",
26 |    "metadata": {},
27 |    "source": [
28 |     "# `plabartabajo@gmail.com`"
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "id": "86f82e45",
34 |    "metadata": {},
35 |    "source": [
36 |     "-----"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "id": "67656662",
42 |    "metadata": {},
43 |    "source": [
44 |     "## 1. Can you adjust the hyper-parameters `alpha = 0.1` and `gamma = 0.9` to train a better SARSA agent than mine?\n",
45 |     "\n",
46 |     "Experiment with these 2 hyper-parameters to maximize the agent success rate."
47 |    ]
48 |   },
49 |   {
50 |    "cell_type": "markdown",
51 |    "id": "c0a46bf7",
52 |    "metadata": {},
53 |    "source": [
54 |     "## 2. Can you increase the resolution of the discretization?\n",
55 |     "\n",
56 |     "Instead of using round marks of\n",
57 |     "- `0.1` for position\n",
58 |     "- `0.01` for velocity\n",
59 |     "\n",
60 |     "Use 10x:\n",
61 |     "- `0.01` for position\n",
62 |     "- `0.001` for velocity\n",
63 |     "\n",
64 |     "Let me know if you get a better agent than mine?"
65 |    ]
66 |   }
67 |  ],
68 |  "metadata": {
69 |   "kernelspec": {
70 |    "display_name": "Python 3 (ipykernel)",
71 |    "language": "python",
72 |    "name": "python3"
73 |   },
74 |   "language_info": {
75 |    "codemirror_mode": {
76 |     "name": "ipython",
77 |     "version": 3
78 |    },
79 |    "file_extension": ".py",
80 |    "mimetype": "text/x-python",
81 |    "name": "python",
82 |    "nbconvert_exporter": "python",
83 |    "pygments_lexer": "ipython3",
84 |    "version": "3.7.5"
85 |   }
86 |  },
87 |  "nbformat": 4,
88 |  "nbformat_minor": 5
89 | }
90 | 


--------------------------------------------------------------------------------
/02_mountain_car/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "src"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Pau <plabartabajo@gmail.com>"]
 6 | 
 7 | [tool.poetry.dependencies]
 8 | python = ">=3.7.1,<4.0"
 9 | gym = "^0.21.0"
10 | pyglet = "^1.5.21"
11 | matplotlib = "^3.5.0"
12 | tqdm = "^4.62.3"
13 | pandas = "^1.3.4"
14 | jupyter = "^1.0.0"
15 | PyVirtualDisplay = "^2.2"
16 | imageio = "^2.13.3"
17 | seaborn = "^0.11.2"
18 | 
19 | [tool.poetry.dev-dependencies]
20 | pytest = "^5.2"
21 | 
22 | [build-system]
23 | requires = ["poetry-core>=1.0.0"]
24 | build-backend = "poetry.core.masonry.api"
25 | 


--------------------------------------------------------------------------------
/02_mountain_car/src/base_agent.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | from pathlib import Path
 3 | from abc import ABC, abstractmethod
 4 | 
 5 | 
 6 | class BaseAgent(ABC):
 7 | 
 8 |     @abstractmethod
 9 |     def get_action(self, state):
10 |         pass
11 | 
12 |     @abstractmethod
13 |     def update_parameters(self, state, action, reward, next_state):
14 |         pass
15 | 
16 |     def save_to_disk(self, path: Path):
17 |         """
18 |         Saves python object to disk using a binary format
19 |         """
20 |         with open(path, "wb") as f:
21 |             pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
22 | 
23 |     @classmethod
24 |     def load_from_disk(cls, path: Path):
25 |         """
26 |         Loads binary format into Python object.
27 |         """
28 |         with open(path, "rb") as f:
29 |             dump = pickle.load(f)
30 | 
31 |         return dump


--------------------------------------------------------------------------------
/02_mountain_car/src/config.py:
--------------------------------------------------------------------------------
1 | # Define SAVED_AGENTS_DIR and create dir if missing
2 | import os
3 | import pathlib
4 | root_dir = pathlib.Path(__file__).parent.resolve().parent
5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents'
6 | os.makedirs(SAVED_AGENTS_DIR, exist_ok=True)
7 | 


--------------------------------------------------------------------------------
/02_mountain_car/src/loops.py:
--------------------------------------------------------------------------------
  1 | from typing import Tuple, List, Callable, Union, Optional
  2 | import random
  3 | 
  4 | from tqdm import tqdm
  5 | 
  6 | def train(
  7 |     agent,
  8 |     env,
  9 |     n_episodes: int,
 10 |     epsilon: Union[float, Callable]
 11 | ) -> Tuple[List, List]:
 12 | 
 13 |     # For plotting metrics
 14 |     reward_per_episode = []
 15 |     max_position_per_episode = []
 16 | 
 17 |     pbar = tqdm(range(0, n_episodes))
 18 |     for i in pbar:
 19 | 
 20 |         state = env.reset()
 21 | 
 22 |         rewards = 0
 23 |         max_position = -99
 24 | 
 25 |         # handle case when epsilon is either
 26 |         # - a float
 27 |         # - or a function that returns a float given the episode nubmer
 28 |         epsilon_ = epsilon if isinstance(epsilon, float) else epsilon(i)
 29 | 
 30 |         pbar.set_description(f'Epsilon: {epsilon_:.2f}')
 31 | 
 32 |         done = False
 33 |         while not done:
 34 | 
 35 |             action = agent.get_action(state, epsilon_)
 36 | 
 37 |             next_state, reward, done, info = env.step(action)
 38 | 
 39 |             agent.update_parameters(state, action, reward, next_state, epsilon_)
 40 | 
 41 |             rewards += reward
 42 |             if next_state[0] > max_position:
 43 |                 max_position = next_state[0]
 44 | 
 45 |             state = next_state
 46 | 
 47 |         reward_per_episode.append(rewards)
 48 |         max_position_per_episode.append(max_position)
 49 | 
 50 |     return reward_per_episode, max_position_per_episode
 51 | 
 52 | 
 53 | def evaluate(
 54 |     agent,
 55 |     env,
 56 |     n_episodes: int,
 57 |     epsilon: Optional[Union[float, Callable]] = None
 58 | ) -> Tuple[List, List]:
 59 | 
 60 |     # For plotting metrics
 61 |     reward_per_episode = []
 62 |     max_position_per_episode = []
 63 | 
 64 |     for i in tqdm(range(0, n_episodes)):
 65 | 
 66 |         state = env.reset()
 67 | 
 68 |         rewards = 0
 69 |         max_position = -99
 70 | 
 71 |         done = False
 72 |         while not done:
 73 | 
 74 |             epsilon_ = None
 75 |             if epsilon is not None:
 76 |                 epsilon_ = epsilon if isinstance(epsilon, float) else epsilon(i)
 77 |             action = agent.get_action(state, epsilon_)
 78 | 
 79 |             next_state, reward, done, info = env.step(action)
 80 | 
 81 |             agent.update_parameters(state, action, reward, next_state, epsilon_)
 82 | 
 83 |             rewards += reward
 84 |             if next_state[0] > max_position:
 85 |                 max_position = next_state[0]
 86 | 
 87 |             state = next_state
 88 | 
 89 |         reward_per_episode.append(rewards)
 90 |         max_position_per_episode.append(max_position)
 91 | 
 92 |     return reward_per_episode, max_position_per_episode
 93 | 
 94 | if __name__ == '__main__':
 95 | 
 96 |     # environment
 97 |     import gym
 98 |     env = gym.make('MountainCar-v0')
 99 |     env._max_episode_steps = 1000
100 | 
101 |     # agent
102 |     from src.sarsa_agent import SarsaAgent
103 |     alpha = 0.1
104 |     gamma = 0.6
105 |     agent = SarsaAgent(env, alpha, gamma)
106 | 
107 |     rewards, max_positions = train(agent, env, n_episodes=100, epsilon=0.1)


--------------------------------------------------------------------------------
/02_mountain_car/src/momentum_agent.py:
--------------------------------------------------------------------------------
 1 | from src.base_agent import BaseAgent
 2 | 
 3 | class MomentumAgent(BaseAgent):
 4 | 
 5 |     def __init__(self, env):
 6 |         self.env = env
 7 | 
 8 |         self.valley_position = -0.5
 9 | 
10 |     def get_action(self, state, epsilon=None) -> int:
11 |         """
12 |         No input arguments to this function.
13 |         The agent does not consider the state of the environment when deciding
14 |         what to do next.
15 |         """
16 |         velocity = state[1]
17 | 
18 |         if velocity > 0:
19 |             # accelerate to the right
20 |             action = 2
21 |         else:
22 |             # accelerate to the left
23 |             action = 0
24 | 
25 |         return action
26 | 
27 |     def update_parameters(self, state, action, reward, next_state, epsilon):
28 |         pass
29 | 
30 | 


--------------------------------------------------------------------------------
/02_mountain_car/src/random_agent.py:
--------------------------------------------------------------------------------
 1 | from src.base_agent import BaseAgent
 2 | 
 3 | class RandomAgent(BaseAgent):
 4 |     """
 5 |     This taxi driver selects actions randomly.
 6 |     You better not get into this taxi!
 7 |     """
 8 |     def __init__(self, env):
 9 |         self.env = env
10 | 
11 |     def get_action(self, state, epsilon) -> int:
12 |         """
13 |         No input arguments to this function.
14 |         The agent does not consider the state of the environment when deciding
15 |         what to do next.
16 |         """
17 |         return self.env.action_space.sample()
18 | 
19 |     def update_parameters(self, state, action, reward, next_state, epsilon):
20 |         pass
21 | 
22 | 


--------------------------------------------------------------------------------
/02_mountain_car/src/sarsa_agent.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import random
 3 | 
 4 | from src.base_agent import BaseAgent
 5 | 
 6 | class SarsaAgent(BaseAgent):
 7 | 
 8 |     def __init__(self, env, alpha, gamma):
 9 | 
10 |         self.env = env
11 |         self.q_table = self._init_q_table()
12 | 
13 |         # hyper-parameters
14 |         self.alpha = alpha
15 |         self.gamma = gamma
16 | 
17 |     def _init_q_table(self) -> np.array:
18 |         """
19 |         Return numpy array with 3 dimensions.
20 |         The first 2 dimensions are the state components, i.e. position, speed.
21 |         The third dimension is the action.
22 |         """
23 |         # discretize state space from a continuous to discrete
24 |         high = self.env.observation_space.high
25 |         low = self.env.observation_space.low
26 |         n_states = (high - low) * np.array([10, 100])
27 |         n_states = np.round(n_states, 0).astype(int) + 1
28 | 
29 |         # table with q-values: n_states[0] * n_states[1] * n_actions
30 |         return np.zeros([n_states[0], n_states[1], self.env.action_space.n])
31 | 
32 |     def _discretize_state(self, state):
33 |         min_states = self.env.observation_space.low
34 |         state_discrete = (state - min_states) * np.array([10, 100])
35 |         return np.round(state_discrete, 0).astype(int)
36 | 
37 |     def get_action(self, state, epsilon=None):
38 |         """"""
39 |         if epsilon and random.uniform(0, 1) < epsilon:
40 |             # Explore action space
41 |             action = self.env.action_space.sample()
42 |         else:
43 |             # Exploit learned values
44 |             state_discrete = self._discretize_state(state)
45 |             action = np.argmax(self.q_table[state_discrete[0], state_discrete[1]])
46 |         
47 |         return action
48 | 
49 |     def update_parameters(self, state, action, reward, next_state, epsilon):
50 |         """"""
51 |         s = self._discretize_state(state)
52 |         ns = self._discretize_state(next_state)
53 |         na = self.get_action(next_state, epsilon)
54 | 
55 |         delta = self.alpha * (
56 |                 reward
57 |                 + self.gamma * self.q_table[ns[0], ns[1], na]
58 |                 - self.q_table[s[0], s[1], action]
59 |         )
60 |         self.q_table[s[0], s[1], action] += delta


--------------------------------------------------------------------------------
/02_mountain_car/src/viz.py:
--------------------------------------------------------------------------------
  1 | from time import sleep
  2 | from argparse import ArgumentParser
  3 | from pdb import set_trace as stop
  4 | 
  5 | import pandas as pd
  6 | import gym
  7 | 
  8 | from src.config import SAVED_AGENTS_DIR
  9 | 
 10 | import numpy as np
 11 | 
 12 | 
 13 | def plot_policy(agent, positions: np.arange, velocities: np.arange, figsize = None):
 14 |     """"""
 15 |     data = []
 16 |     int2str = {
 17 |         0: 'Accelerate Left',
 18 |         1: 'Do nothing',
 19 |         2: 'Accelerate Right'
 20 |     }
 21 |     for position in positions:
 22 |         for velocity in velocities:
 23 | 
 24 |             state = np.array([position, velocity])
 25 |             action = int2str[agent.get_action(state)]
 26 | 
 27 |             data.append({
 28 |                 'position': position,
 29 |                 'velocity': velocity,
 30 |                 'action': action,
 31 |             })
 32 | 
 33 |     data = pd.DataFrame(data)
 34 | 
 35 |     import seaborn as sns
 36 |     import matplotlib.pyplot as plt
 37 | 
 38 |     if figsize:
 39 |         plt.figure(figsize=figsize)
 40 | 
 41 |     colors = {
 42 |         'Accelerate Left': 'blue',
 43 |         'Do nothing': 'grey',
 44 |         'Accelerate Right': 'orange'
 45 |     }
 46 |     sns.scatterplot(x="position", y="velocity", hue="action", data=data,
 47 |                     palette=colors)
 48 | 
 49 |     plt.show()
 50 |     return data
 51 | 
 52 | def show_video(agent, env, sleep_sec: float = 0.1, mode: str = "rgb_array"):
 53 | 
 54 |     state = env.reset()
 55 |     done = False
 56 | 
 57 |     # LAPADULA
 58 |     if mode == "rgb_array":
 59 |         from matplotlib import pyplot as plt
 60 |         from IPython.display import display, clear_output
 61 |         steps = 0
 62 |         fig, ax = plt.subplots(figsize=(8, 6))
 63 | 
 64 |     while not done:
 65 | 
 66 |         action = agent.get_action(state)
 67 |         state, reward, done, info = env.step(action)
 68 |         # LAPADULA
 69 |         if mode == "rgb_array":
 70 |             steps += 1
 71 |             frame = env.render(mode=mode)
 72 |             ax.cla()
 73 |             ax.axes.yaxis.set_visible(False)
 74 |             ax.imshow(frame, extent=[env.min_position, env.max_position, 0, 1])
 75 |             ax.set_title(f'Steps: {steps}')
 76 |             display(fig)
 77 |             clear_output(wait=True)
 78 |             plt.pause(sleep_sec)
 79 |         else:
 80 |             env.render()
 81 | 
 82 | 
 83 | if __name__ == '__main__':
 84 | 
 85 |     parser = ArgumentParser()
 86 |     parser.add_argument('--agent_file', type=str, required=True)
 87 |     parser.add_argument('--sleep_sec', type=float, required=False, default=0.1)
 88 |     args = parser.parse_args()
 89 | 
 90 |     from src.base_agent import BaseAgent
 91 |     agent_path = SAVED_AGENTS_DIR / args.agent_file
 92 |     agent = BaseAgent.load_from_disk(agent_path)
 93 | 
 94 |     env = gym.make('MountainCar-v0')
 95 |     env._max_episode_steps = 1000
 96 | 
 97 |     show_video(agent, env, sleep_sec=args.sleep_sec)
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 
105 | 
106 | 


--------------------------------------------------------------------------------
/03_cart_pole/.gitignore:
--------------------------------------------------------------------------------
1 | data_supervised_ml/*
2 | 


--------------------------------------------------------------------------------
/03_cart_pole/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | <h1>Parametric Q learning to solve the Cart Pole</h1>
  3 | <h3><i>There exists everywhere a medium in things, determined by equilibrium.</i></h3>
  4 | <h4>-- Dmitri Mendeleev</h4>
  5 | </div>
  6 | 
  7 | ![](http://datamachines.xyz/wp-content/uploads/2022/01/pexels-yogendra-singh-1701202.jpg)
  8 | 
  9 | ## Table of Contents
 10 | * [Welcome 🤗](#welcome-)
 11 | * [Lecture transcripts](#lecture-transcripts)
 12 | * [Quick setup](#quick-setup)
 13 | * [Notebooks](#notebooks)
 14 | * [Let's connect](#lets-connect)
 15 | 
 16 | ## Welcome 🤗
 17 | 
 18 | In today's lecture we enter new territory...
 19 | 
 20 | A territory where function approximation (aka supervised machine learning)
 21 | meets good old Reinforcement Learning.
 22 | 
 23 | And this is how Deep RL is born.
 24 | 
 25 | We will solve the Cart Pole environment of OpenAI using **parametric Q-learning**.
 26 | 
 27 | Today's lesson is split into 3 parts.
 28 | 
 29 | ## Lecture transcripts
 30 | 
 31 | [📝 1. Parametric Q learning](http://datamachines.xyz/2022/01/18/hands-on-reinforcement-learning-course-part-4-parametric-q-learning)  
 32 | [📝 2. Deep Q learning](http://datamachines.xyz/2022/02/11/hands-on-reinforcement-learning-course-part-5-deep-q-learning/)  
 33 | [📝 3. Hyperparameter search](http://datamachines.xyz/2022/03/03/hyperparameters-in-deep-rl-hands-on-course/)
 34 | 
 35 | ## Quick setup
 36 | 
 37 | Make sure you have Python >= 3.7. Otherwise, update it.
 38 | 
 39 | 1. Pull the code from GitHub and cd into the `01_taxi` folder:
 40 |     ```
 41 |     $ git clone https://github.com/Paulescu/hands-on-rl.git
 42 |     $ cd hands-on-rl/01_taxi
 43 |     ```
 44 | 
 45 | 2. Make sure you have the `virtualenv` tool in your Python installation
 46 |     ```
 47 |    $ pip3 install virtualenv
 48 |    ```
 49 | 
 50 | 3. Create a virtual environment and activate it.
 51 |     ```
 52 |     $ virtualenv -p python3 venv
 53 |     $ source venv/bin/activate
 54 |     ```
 55 | 
 56 |    From this point onwards commands run inside the  virtual environment.
 57 | 
 58 | 
 59 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code.
 60 |     ```
 61 |     $ (venv) pip install -r requirements.txt
 62 |     $ (venv) export PYTHONPATH="."
 63 |     ```
 64 | 
 65 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab
 66 |     ```
 67 |     $ (venv) jupyter notebook
 68 |     ```
 69 |     ```
 70 |     $ (venv) jupyter lab
 71 |     ```
 72 |    If both launch commands fail, try these:
 73 |     ```
 74 |     $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False
 75 |     ```
 76 |     ```
 77 |     $ (venv) jupyter lab --NotebookApp.use_redirect_file=False
 78 |     ```
 79 | 
 80 | 5. Play and learn. And do the homework 😉.
 81 | 
 82 | ## Notebooks
 83 | 
 84 | Parametric Q-learning
 85 | - [Explore the environment](notebooks/00_environment.ipynb)
 86 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
 87 | - [Linear Q agent with bad hyper-parameters](notebooks/02_linear_q_agent_bad_hyperparameters.ipynb)
 88 | - [Linear Q agent with good hyper-parameters](notebooks/03_linear_q_agent_good_hyperparameters.ipynb)
 89 | - [Homework](notebooks/04_homework.ipynb)
 90 | 
 91 | Deep Q-learning
 92 | - [Crash course on neural networks](notebooks/05_crash_course_on_neural_nets.ipynb)
 93 | - [Deep Q agent with bad hyper-parameters](notebooks/06_deep_q_agent_bad_hyperparameters.ipynb)
 94 | - [Deep Q agent with good hyper-parameters](notebooks/07_deep_q_agent_good_hyperparameters.ipynb)
 95 | - [Homework](notebooks/08_homework.ipynb)
 96 | 
 97 | Hyperparameter search
 98 | - [Hyperparameter search](notebooks/09_hyperparameter_search.ipynb)
 99 | - [Homework](notebooks/10_homework.ipynb)
100 | 
101 | ## Let's connect!
102 | 
103 | Do you wanna become a PRO in Machine Learning?
104 | 
105 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/).
106 | 
107 | 👉🏽 Follow me on [Medium](https://pau-labarta-bajo.medium.com/).


--------------------------------------------------------------------------------
/03_cart_pole/images/deep_q_net.svg:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <!-- Do not edit this file with editors other than diagrams.net -->
3 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
4 | <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="684px" height="505px" viewBox="-0.5 -0.5 684 505" content="&lt;mxfile host=&quot;Electron&quot; modified=&quot;2022-02-21T19:01:59.467Z&quot; agent=&quot;5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/16.5.1 Chrome/96.0.4664.110 Electron/16.0.7 Safari/537.36&quot; version=&quot;16.5.1&quot; etag=&quot;hX9cYGlxi4RjFS1hJfUk&quot; type=&quot;device&quot;&gt;&lt;diagram id=&quot;boVR5oMU57DQKPLkT1rZ&quot;&gt;7Z1Bc5s4FIB/jWfagzMgwNjHOG22l850N4ftHhWQbaYYeUFpnP31K4Fkg4RrQhCRa+WQIAES6H16edJ7EhPvbrv/I4e7zVcco3QCnHg/8T5NAHAXwYz+YTkvVc58HlYZ6zyJ+UXHjIfkP8QzHZ77lMSoaFxIME5JsmtmRjjLUEQaeTDP8XPzshVOm7Xu4BopGQ8RTNXcv5OYbPhbBM4x/wtK1htRs+vwM1soLuYZxQbG+LmW5X2eeHc5xqQ62u7vUMoaT7RLdd/9ibOHB8tRRrrcAKobfsL0ib/bnj8YeRFvS++gDUsTy+dNQtDDDkbszDOVLc3bkG1KUy49hMWuau1Vske0guUKZ+QebpOUyfkv/IgJ5rkP+CkvS9kQQoUGAu+W/qKPyX6xC4qbNcbrFMFdUtxEeFueiIry0vtVVSY9lEqtSAEuT9/hFOflS3js5562wpK/MMoJ2p9sNPcgCsowwltEclqdw2+YgjkX34sgWsjzuYYDz9rUSBB5kAO4PpR9lBE94GJqF5mniOynFdmrRQYcdzyR+YrIJmCW0gqWxQ5mDdHN/n1ifX/5CKMf6xw/ZfE0qpqEtraTrx8/gCCYMNxorY50/PF4Pz1al3/vPk+Wc1EdfdCqRn7ScvNqbg66fARugpPcPLZCI5HiHNioHX1kh6xtnMEQ48/ExFD+0xWSOF5YymS1asPzfjJf1PCsCmkWTLMfLbJvQHYxHrKzFmStxM5IzJfMiXk4nsBCK7C3C2wxG09g8w4Cy+JbNsyhqSiFRZFETTmVSp8J6JPzfkJyw18ICe0T8p0/LTv+hz3rTcBTn/b80cvEi0hktCW/1xO1u1jyeFuZEvdVzYdiacxXiJc/DpMIzNeINBRdB2hqTAQtTIi8HKWQJD+bD9EGCq/hG07K/7ecSV9W+oHEWvU+/K76kFAqaHHK4BEFVa2gFFRye3jtTigvLMomoBwahvLUXcj6NewJ89QLmjQ7M200iwGlxXlEnD3zNfPUVaaLZDugM85AxlnuGAPi7FqcTcDZPO0cyjgv+uLsyjjLHWNAnNWpbouzbpz9S8B5Jhm8jmzwdsZZplnuFwPSrHoBLM26aQ4ugeZAptntTbM8EPT14dzmIfn9cB4WTaNtXuCco6crh7KN4MpAD4hhm8PFYvg7YQi83jMJsmKVJ9gG5LDNi9Jwsqm+P3ZiWpTiZ248N9ztVXfbnx+KypMHmUw/nvS7ycxbB0DTAeCJgdp7RBO4bS6bwelwLR2D0TFmzIDbxT9kJfZrhx61BILxJNbFDWIldkZiwB9PYiLar6OdmOGMSS6GxaY0DN2myFj+N0gIyrMyBzgezS1Ijn8cwkuBYlmesMaMMb2A1KGEI+a1dpdUjhNI5QxndQE74W3EnIrQh8agrPgjj972t48iXH08A8uzETzPr4hnT1+0SEvku7WTztlJbtAci4waW9ga+G4l9kqJjRlc2BpybiX2OomNOnq0EddDSGzU0WOXkOsrHz26C9nr03f8qJSkcwRpY7PHt7i5/qtb3MJSNAbnQSNalbWn+ni2Adpm8OwbxnOgMNjXua/SrG38KKqyNL8zzcG10KwGXw1I83XMVuuLVTEuWEVZVNg3ck/BkNq82jAEV4GhWUoVqDSbP8k8qMmrT63awGozeDbeCTgcz1r1s42sfpuZYJ5iVWJalUjU7ssJZRD1eaM9G1v9CxD7KOQOirVlOaH5ilVdo22kYu3id7l8ns0yFFrWE14Az8NFV4CFPp5f55WyPOvi2Th7Q+G599Yyo/JsfXHvMJBrMzhMm2eby/E1vad7Z+dKGhBn64ozBGfTlnwDfyiePa9Zks7NkoRJZHkek+c2e8Oq5yFwvg5vnGE4twXbm66e+2/IMap6BpZnM9SzcTzLobNOb3ND6Rn6NrMTVVme31s/m2ZugJlMYd8dQjylZ+jbzc6/DreeYTy3BLOZp59DmcK+s8+e0jP07WfnX4d38AJ4Nk4/z2UK+84+e3LP0LizuW+9g+PzLFyBRkd5XmTovG+dg4bgbHjsvOrR6x+0rM836Fvf4Dmcp86N44Im0154NoyJpb6hPKHCQPkbUTdt0ZO8T7QnBxd1Rl3eu0opaUDUrd/QEM1t2prUUIZQjv3svkHvuZKGw1lUZXEeE+f5BRgi8kcPPXlZdHe34bmSBsTZug3Hxzm8AOV8kaPEAFiajaDZNMt5sG9SjEqzdRkaYmoYPoWnrjLpG6Gkc71KYD2GhuBsmnaW7V1lEWBnnGUHi8blhIF1GBqCs2mmszyvAXoPBGXPo1LSgDhbf6EZprNp0xrDmc5KNIe+vYkC6y40g2bTLGflU5u9pzWUKCeNA8E2b2Hje0sryIUgvqjEWap9YcmJIYFTdv20yKPGxRW4DFsK7QlkS2APuB5grVdw/qNPs+ZHn5zoAPUxs8R7tVI/DvUliWOU0btS+IJyVpxohMdcXFP+G3OesoQU6smun5MiaE+aqqDazFZ0Qb4f7ipJUykLpsmabYIbUdiZ93HJ9ghOIpje8hNb+g6smtZtkNs2yH3bJ2rCRZNRT/j9ap3NbettcgRqn12GgzbHn6UWWGrPbI3tOuNRS5M5ZgI7qmb6WpuvOEbsiv8B&lt;/diagram&gt;&lt;/mxfile&gt;" style="background-color: rgb(255, 255, 255);"><defs><style type="text/css">@import url(https://fonts.googleapis.com/css?family=Roboto);&#xa;</style></defs><g><ellipse cx="40" cy="90" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 90px; margin-left: 1px;"><div data-drawio-colors="color: #3333FF; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(51, 51, 255); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;">x</div></div></div></foreignObject><text x="40" y="96" fill="#3333FF" font-family="Roboto" font-size="21px" text-anchor="middle">x</text></switch></g><ellipse cx="40" cy="180" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 180px; margin-left: 1px;"><div data-drawio-colors="color: #3333FF; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(51, 51, 255); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;">v</div></div></div></foreignObject><text x="40" y="186" fill="#3333FF" font-family="Roboto" font-size="21px" text-anchor="middle">v</text></switch></g><ellipse cx="40" cy="270" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 270px; margin-left: 1px;"><div data-drawio-colors="color: #3333FF; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(51, 51, 255); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><span style="background-color: rgb(255 , 255 , 255)">θ</span></div></div></div></foreignObject><text x="40" y="276" fill="#3333FF" font-family="Roboto" font-size="21px" text-anchor="middle">θ</text></switch></g><ellipse cx="40" cy="360" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 360px; margin-left: 1px;"><div data-drawio-colors="color: #3333FF; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(51, 51, 255); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><b style="color: rgb(0 , 0 , 0) ; background-color: rgb(255 , 255 , 255)"><font color="#3333ff">ω</font></b></div></div></div></foreignObject><text x="40" y="366" fill="#3333FF" font-family="Roboto" font-size="21px" text-anchor="middle">ω</text></switch></g><ellipse cx="280" cy="40" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><ellipse cx="280" cy="130" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><path d="M 80 90 L 233.92 41.9" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 238.93 40.33 L 233.3 45.76 L 233.92 41.9 L 231.21 39.08 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 90 L 233.82 128.46" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 238.92 129.73 L 231.28 131.43 L 233.82 128.46 L 232.97 124.64 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 180 L 235.21 44.19" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.16 40.74 L 236.2 47.98 L 235.21 44.19 L 231.59 42.71 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 180 L 233.92 131.9" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 238.93 130.33 L 233.3 135.76 L 233.92 131.9 L 231.21 129.08 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 270 L 235.21 134.19" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.16 130.74 L 236.2 137.98 L 235.21 134.19 L 231.59 132.71 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 360 L 236.36 135.23" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.36 130.92 L 238.24 138.66 L 236.36 135.23 L 232.49 134.67 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 270 L 236.36 45.23" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.36 40.92 L 238.24 48.66 L 236.36 45.23 L 232.49 44.67 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 360 L 237.15 45.7" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.5 41 L 239.5 48.83 L 237.15 45.7 L 233.24 45.7 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><ellipse cx="643" cy="180" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 180px; margin-left: 604px;"><div data-drawio-colors="color: #3333FF; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(51, 51, 255); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><font style="font-size: 17px">Q(s, a=0)</font></div></div></div></foreignObject><text x="643" y="186" fill="#3333FF" font-family="Roboto" font-size="21px" text-anchor="middle">Q(s, a=0)</text></switch></g><ellipse cx="643" cy="270" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 270px; margin-left: 604px;"><div data-drawio-colors="color: #3333FF; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(51, 51, 255); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><font style="font-size: 17px">Q(s, a=1)</font></div></div></div></foreignObject><text x="643" y="276" fill="#3333FF" font-family="Roboto" font-size="21px" text-anchor="middle">Q(s, a=1)</text></switch></g><ellipse cx="280" cy="325" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><ellipse cx="280" cy="415" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><path d="M 282 265 L 282 185" fill="none" stroke="rgb(0, 0, 0)" stroke-width="2" stroke-miterlimit="10" stroke-dasharray="2 6" pointer-events="stroke"/><path d="M 80 360 L 233.98 412.93" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 238.94 414.64 L 231.19 415.67 L 233.98 412.93 L 233.46 409.05 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 360 L 233.78 326.36" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 238.91 325.24 L 232.82 330.15 L 233.78 326.36 L 231.32 323.32 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><ellipse cx="473" cy="40" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><ellipse cx="473" cy="130" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><ellipse cx="473" cy="325" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><ellipse cx="473" cy="415" rx="40" ry="40" fill="rgb(255, 255, 255)" stroke="rgb(0, 0, 0)" pointer-events="all"/><path d="M 475 265 L 475 185" fill="none" stroke="rgb(0, 0, 0)" stroke-width="2" stroke-miterlimit="10" stroke-dasharray="2 6" pointer-events="stroke"/><path d="M 320 40 L 426.63 40" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 431.88 40 L 424.88 43.5 L 426.63 40 L 424.88 36.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 40 L 428.02 126.03" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.13 129.3 L 424.47 127.68 L 428.02 126.03 L 428.83 122.2 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 40 L 430.65 319.08" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.59 323.96 L 426.75 318.74 L 430.65 319.08 L 433.26 316.16 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 40 L 431.16 408.9" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.68 413.93 L 427.31 408.24 L 431.16 408.9 L 434.01 406.22 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 90 L 236.42 319.74" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.37 324.08 L 232.54 320.26 L 236.42 319.74 L 238.32 316.32 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 90 L 237.19 409.29" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.51 414 L 233.27 409.26 L 237.19 409.29 L 239.55 406.17 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 180 L 235.28 320.72" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.17 324.25 L 231.63 322.14 L 235.28 320.72 L 236.33 316.96 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 180 L 236.42 409.74" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.37 414.08 L 232.54 410.26 L 236.42 409.74 L 238.32 406.32 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 270 L 235.28 410.72" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 239.17 414.25 L 231.63 412.14 L 235.28 410.72 L 236.33 406.96 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 80 270 L 233.98 322.93" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 238.94 324.64 L 231.19 325.67 L 233.98 322.93 L 233.46 319.05 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 40 L 599.56 174.64" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.4 179.06 L 595.67 175.06 L 599.56 174.64 L 601.55 171.28 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 40 L 600.68 264.07" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.59 268.96 L 596.78 263.72 L 600.68 264.07 L 603.3 261.16 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 130 L 597.43 176.91" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.02 179.46 L 594.2 179.12 L 597.43 176.91 L 597.6 173 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 325 L 597.57 273.32" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.05 270.58 L 597.9 277.22 L 597.57 273.32 L 594.25 271.25 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 130 L 599.56 264.64" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.4 269.06 L 595.67 265.06 L 599.56 264.64 L 601.55 261.28 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 325 L 599.64 185.41" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.41 180.95 L 601.69 188.74 L 599.64 185.41 L 595.75 185.05 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 415 L 599.64 275.41" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.41 270.95 L 601.69 278.74 L 599.64 275.41 L 595.75 275.05 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 513 415 L 600.72 185.95" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 602.6 181.04 L 603.37 188.83 L 600.72 185.95 L 596.83 186.33 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 415 L 426.63 415" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 431.88 415 L 424.88 418.5 L 426.63 415 L 424.88 411.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 415 L 428.02 328.97" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.13 325.7 L 428.83 332.8 L 428.02 328.97 L 424.47 327.32 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 415 L 429.78 125.95" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 431.64 121.05 L 432.43 128.83 L 429.78 125.95 L 425.89 126.35 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 415 L 431.16 46.1" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.68 41.07 L 434.01 48.78 L 431.16 46.1 L 427.31 46.76 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 325 L 426.63 325" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 431.88 325 L 424.88 328.5 L 426.63 325 L 424.88 321.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 130 L 428.02 43.97" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.13 40.7 L 428.83 47.8 L 428.02 43.97 L 424.47 42.32 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 130 L 426.63 130" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 431.88 130 L 424.88 133.5 L 426.63 130 L 424.88 126.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 325 L 428.02 411.03" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.13 414.3 L 424.47 412.68 L 428.02 411.03 L 428.83 407.2 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 325 L 429.81 135.51" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.44 130.97 L 431.96 138.78 L 429.81 135.51 L 425.9 135.27 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 325 L 430.65 45.92" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.59 41.04 L 433.26 48.84 L 430.65 45.92 L 426.75 46.26 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 130 L 429.81 319.49" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.44 324.03 L 425.9 319.73 L 429.81 319.49 L 431.96 316.22 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 320 130 L 430.65 409.08" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 432.59 413.96 L 426.75 408.74 L 430.65 409.08 L 433.26 406.16 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><rect x="201" y="470" width="150" height="30" fill="none" stroke="none" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 148px; height: 1px; padding-top: 485px; margin-left: 202px;"><div data-drawio-colors="color: rgb(0, 0, 0); " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><font face="Roboto" data-font-src="https://fonts.googleapis.com/css?family=Roboto" style="font-size: 16px" color="#3333ff">Hidden layer 1<br />256 units<br /></font></div></div></div></foreignObject><text x="276" y="489" fill="rgb(0, 0, 0)" font-family="Helvetica" font-size="12px" text-anchor="middle">Hidden layer 1...</text></switch></g><rect x="390" y="470" width="150" height="30" fill="none" stroke="none" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 148px; height: 1px; padding-top: 485px; margin-left: 391px;"><div data-drawio-colors="color: rgb(0, 0, 0); " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><font face="Roboto" data-font-src="https://fonts.googleapis.com/css?family=Roboto" style="font-size: 16px" color="#3333ff">Hidden layer 2<br />256 units<br /></font></div></div></div></foreignObject><text x="465" y="489" fill="rgb(0, 0, 0)" font-family="Helvetica" font-size="12px" text-anchor="middle">Hidden layer 2...</text></switch></g></g><switch><g requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"/><a transform="translate(0,-5)" xlink:href="https://www.diagrams.net/doc/faq/svg-export-text-problems" target="_blank"><text text-anchor="middle" font-size="10px" x="50%" y="100%">Text is not SVG - cannot display</text></a></switch></svg>


--------------------------------------------------------------------------------
/03_cart_pole/images/hparams_search_diagram.svg:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <!-- Do not edit this file with editors other than diagrams.net -->
3 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
4 | <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="834px" height="164px" viewBox="-0.5 -0.5 834 164" content="&lt;mxfile host=&quot;Electron&quot; modified=&quot;2022-02-28T09:33:51.142Z&quot; agent=&quot;5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/16.5.1 Chrome/96.0.4664.110 Electron/16.0.7 Safari/537.36&quot; version=&quot;16.5.1&quot; etag=&quot;TNhGN6mk3quZrlViNss2&quot; type=&quot;device&quot;&gt;&lt;diagram id=&quot;xAcpHFQRIoUoICBUJVGP&quot;&gt;7ZtLk+I2EIB/Sw6uSg6mbMvy4wgzkFyS2spsVXaPwpZBtbLlyGJg8usjWTJ+QYadQIYwVG0tVrfe/XW3RgYLPOS7nzkq17+yFFPLc9KdBR4tz3NjGMgPJXnRkigKtWDFSWoqtYIn8hc2QsdINyTFVa+iYIwKUvaFCSsKnIieDHHOtv1qGaP9UUu0wiPBU4LoWPoHScXarAI6rfwXTFbrZmTXMZocNZWNoFqjlG07IjC3wANnTOinfPeAqdq8Zl90u8UR7X5iHBfilAaebvCM6MaszfICKpvOMiZ7kBMUL2bVwZ8b1ijsqrbJVFZw/XLXKuXTSn2uUVm+SO2WyAWbHpe80Yo1ljqOqw0V0hSLcY1GImeuJ9KIvd6cPL5m+XIj+5ht10TgpxIlSrOV3KlZiJzKkmtmvUA5oQq539mSCWakT2zD60ZrISQ/HlTLgnLH1H+qQjVZMbaiGJWkmiQsrxVJVVddZLpP+TjoVUPrhqpMKH1glPF6zuDRn7sLuF/NM+YC747az91TId0JsxwLrjbWNAiBbmEcKfAdqAXblksXGHdbd5lsCETGF1b7vltc5IMh5jA94Cg9VYmKtmJDR8qSyiaFwLxA1FZebGcYZgFchnaYZZntYZjaYZTKJwg8kMpiAuMxXs0wpQoihPcGoYJ3GxwgmJIC281m1AxP1DJ09RzxFSlswUqtc0ox1EkzC4lBqz42PbMLp/hQO0itM1jVWtkM5WWtBMCXn1yDNhTXVR/UeKio7Apzku07XaLk24qzTZHaiSZR9Sy4rGlg6w2/7WxOqIKXUSpWiYyDNqJkVWj9ElVY7eiBbVg4VrywYt+K5h2H1rsy8vPybtyLG/f77PeEqcqed8v97yy3fikxt2UXSMZ0zKs32HBUdZh51axxatLrK9mXoiWms/1iO8lwsZjDOH7HBO2PE/R+TmdI0DZ0vAkcJOkm93aTNNxX66bp+AxZ2r/BLC0NIrqukGA13xuPFp0QwVfLH5193c7TT+dI+X4n5etV6sUrXcF4jmhf/Yw4QYU4WKG2VIoTxpEgrGjqqKj1bw4Vkfw3s+YLaxap51vNUa+7Zhp7ALnLTLtmGiPHDhwvsrMgRX6WujBE6fFh3nfD7l78Ub34M0dE2UPfBKBVvZD7GeWdziheHE+8Uw4pvrOvd+5TCryfUu7x7WbimzmlTENrOv2oufZ1hw2BG3vSWbXD+kkY2THIsDzAhDIgLGMPon9w2I+2n3ffvg7f/owrcT+6XMnRxXXiE84twLnMoSU4cGgZGBIX6VS9cpSlhKKqIknfdq2hnXc2E94R8cXMSj1/VXOaQFN63Jkp1oWXplDIHfvSLXRaqWLbrC417fQ24XTwnrVqFtm+XBIyWGLRu8k6AY+O8eEB2zcyjqmMCc/9SRwCwozwiZE6ajX0RU4fviAYQKUXZJp138MOewLDntxBT3ofRj3VhO4XfhK04R3ai0Hrj6GFd2jPAG300aE9AT44hs+7NviAO2DGD+Eb6fNCZ3xrMeztfATGH53AC4ZNb0yue23JHsRDct8aN/2RDwQXo7YZ6kRs6z+LrpDZoMdfi9zXhuHv5++aw6Tru31C4JtztD9gLbxcjnbdm2HtCDNXA4jtD+36ZkLcV3s6IyHerSTRQUDy5cq6IWniAPBKWKpLnzAn9Vd0To9V4OpQBPGZUPxPg9X4y6tfcTWiUV0e9vmrBGffcHOfZYJY94rLiMxd4qO5GAaz5pZxahQ5SVM1zMHbveugPDzPLZ0bDA88jaF7l3T+mFHgHMfx1Eu65iTZsfSjNNEP48hDKSmrYxbpIICqUv+sICM7ZaP3so/njm9XU4ij1B9xKjWRtwRBcB6L7oNHe+06/uJ5dCDoROcw6Phd8W/s7rmX8NzYG4T2A2Y+l+PKYvvDFx3R258Pgfnf&lt;/diagram&gt;&lt;/mxfile&gt;" style="background-color: rgb(255, 255, 255);"><defs><style type="text/css">@import url(https://fonts.googleapis.com/css?family=Roboto);&#xa;</style></defs><g><path d="M 644 0 L 712 60 L 644 120 L 576 60 Z" fill="#d4e1f5" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 134px; height: 1px; padding-top: 60px; margin-left: 577px;"><div data-drawio-colors="color: rgb(0, 0, 0); " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 17px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;"><font style="font-size: 14px">happy with<br />the results?<br /></font></div></div></div></foreignObject><text x="644" y="65" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="17px" text-anchor="middle">happy with...</text></switch></g><rect x="0.5" y="15" width="152.5" height="90" rx="13.5" ry="13.5" fill="#ffe599" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 151px; height: 1px; padding-top: 60px; margin-left: 2px;"><div data-drawio-colors="color: rgb(0, 0, 0); background-color: #FFE599; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 14px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; background-color: rgb(255, 229, 153); white-space: normal; overflow-wrap: normal;"><span id="docs-internal-guid-fe5f65b7-7fff-2e5d-78df-25323de5dc59"><p dir="ltr" style="line-height: 1.2 ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; background-color: transparent ; font-weight: 700 ; vertical-align: baseline">🔎</span></p><p dir="ltr" style="line-height: 1.2 ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; background-color: transparent ; vertical-align: baseline">Select</span></p><p dir="ltr" style="line-height: 1.2 ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; background-color: transparent ; vertical-align: baseline">hyper-parameters</span></p></span></div></div></div></foreignObject><text x="77" y="64" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="14px" text-anchor="middle">🔎Select...</text></switch></g><rect x="203.75" y="15" width="140.25" height="90" rx="13.5" ry="13.5" fill="#ffe599" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 138px; height: 1px; padding-top: 60px; margin-left: 205px;"><div data-drawio-colors="color: rgb(0, 0, 0); background-color: #FFE599; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 14px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; background-color: rgb(255, 229, 153); white-space: normal; overflow-wrap: normal;"><span id="docs-internal-guid-fe5f65b7-7fff-2e5d-78df-25323de5dc59"><p dir="ltr" style="line-height: 1.2 ; text-align: center ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; color: rgb(0 , 0 , 0) ; background-color: transparent ; font-weight: 400 ; font-style: normal ; font-variant: normal ; text-decoration: none ; vertical-align: baseline">🏋️</span></p><p dir="ltr" style="line-height: 1.2 ; margin-top: 0pt ; margin-bottom: 0pt"><span id="docs-internal-guid-d923a1bf-7fff-d9a0-6028-f6da4fd157ad"></span></p><p dir="ltr" style="line-height: 1.2 ; text-align: center ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; color: rgb(0 , 0 , 0) ; background-color: transparent ; font-weight: 400 ; font-style: normal ; font-variant: normal ; text-decoration: none ; vertical-align: baseline">Train the agent</span></p></span></div></div></div></foreignObject><text x="274" y="64" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="14px" text-anchor="middle">🏋️Train the agent...</text></switch></g><rect x="394" y="15" width="130" height="90" rx="13.5" ry="13.5" fill="#ffe599" stroke="rgb(0, 0, 0)" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 128px; height: 1px; padding-top: 60px; margin-left: 395px;"><div data-drawio-colors="color: rgb(0, 0, 0); background-color: #FFE599; " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 14px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; background-color: rgb(255, 229, 153); white-space: normal; overflow-wrap: normal;"><span id="docs-internal-guid-fe5f65b7-7fff-2e5d-78df-25323de5dc59"><p dir="ltr" style="line-height: 1.2 ; text-align: center ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; color: rgb(0 , 0 , 0) ; background-color: transparent ; font-weight: 400 ; font-style: normal ; font-variant: normal ; text-decoration: none ; vertical-align: baseline">🧪</span></p><p dir="ltr" style="line-height: 1.2 ; text-align: center ; margin-top: 0pt ; margin-bottom: 0pt"><span id="docs-internal-guid-731927ff-7fff-4c78-93fe-607225b925a9"></span></p><p dir="ltr" style="line-height: 1.2 ; text-align: center ; margin-top: 0pt ; margin-bottom: 0pt"><span style="font-size: 14pt ; font-family: &quot;roboto&quot; , sans-serif ; color: rgb(0 , 0 , 0) ; background-color: transparent ; font-weight: 400 ; font-style: normal ; font-variant: normal ; text-decoration: none ; vertical-align: baseline">Test the agent</span></p></span></div></div></div></foreignObject><text x="459" y="64" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="14px" text-anchor="middle">🧪Test the agent...</text></switch></g><path d="M 153 60 L 197.38 60" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 202.63 60 L 195.63 63.5 L 197.38 60 L 195.63 56.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 344 60 L 387.63 60" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 392.88 60 L 385.88 63.5 L 387.63 60 L 385.88 56.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 524 60 L 569.63 60" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 574.88 60 L 567.88 63.5 L 569.63 60 L 567.88 56.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 712 60 L 746.63 60" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 751.88 60 L 744.88 63.5 L 746.63 60 L 744.88 56.5 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><path d="M 644 155 L 644 120" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 63 155 L 643 155" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 64 155 L 63.42 114.34" fill="none" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="stroke"/><path d="M 63.35 109.09 L 66.95 116.04 L 63.42 114.34 L 59.95 116.14 Z" fill="rgb(0, 0, 0)" stroke="rgb(0, 0, 0)" stroke-miterlimit="10" pointer-events="all"/><rect x="663" y="35" width="134" height="30" fill="none" stroke="none" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 132px; height: 1px; padding-top: 50px; margin-left: 664px;"><div data-drawio-colors="color: rgb(0, 0, 0); " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 17px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;">Yes</div></div></div></foreignObject><text x="730" y="55" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="17px" text-anchor="middle">Yes</text></switch></g><ellipse cx="793" cy="60" rx="40" ry="40" fill="#d5e8d4" stroke="#82b366" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 78px; height: 1px; padding-top: 60px; margin-left: 754px;"><div data-drawio-colors="color: rgb(0, 0, 0); " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 21px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;">Done!</div></div></div></foreignObject><text x="793" y="66" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="21px" text-anchor="middle">Done!</text></switch></g><rect x="595" y="120" width="134" height="30" fill="none" stroke="none" pointer-events="all"/><g transform="translate(-0.5 -0.5)"><switch><foreignObject pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility" style="overflow: visible; text-align: left;"><div xmlns="http://www.w3.org/1999/xhtml" style="display: flex; align-items: unsafe center; justify-content: unsafe center; width: 132px; height: 1px; padding-top: 135px; margin-left: 596px;"><div data-drawio-colors="color: rgb(0, 0, 0); " style="box-sizing: border-box; font-size: 0px; text-align: center;"><div style="display: inline-block; font-size: 17px; font-family: Roboto; color: rgb(0, 0, 0); line-height: 1.2; pointer-events: all; white-space: normal; overflow-wrap: normal;">No</div></div></div></foreignObject><text x="662" y="140" fill="rgb(0, 0, 0)" font-family="Roboto" font-size="17px" text-anchor="middle">No</text></switch></g></g><switch><g requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"/><a transform="translate(0,-5)" xlink:href="https://www.diagrams.net/doc/faq/svg-export-text-problems" target="_blank"><text text-anchor="middle" font-size="10px" x="50%" y="100%">Text is not SVG - cannot display</text></a></switch></svg>


--------------------------------------------------------------------------------
/03_cart_pole/images/linear_model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/linear_model.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/linear_model_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/linear_model_sml.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/neural_net.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/neural_net.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/neural_net_homework.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/neural_net_homework.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/ngrok_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/ngrok_example.png


--------------------------------------------------------------------------------
/03_cart_pole/images/nn_1_hidden_layer_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_1_hidden_layer_sml.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/nn_2_hidden_layers_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_2_hidden_layers_sml.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/nn_3_hidden_layers_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_3_hidden_layers_sml.jpg


--------------------------------------------------------------------------------
/03_cart_pole/images/optuna.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/optuna.png


--------------------------------------------------------------------------------
/03_cart_pole/mlflow_runs/readme.md:
--------------------------------------------------------------------------------
1 | MLflow logs are saved in this folder.


--------------------------------------------------------------------------------
/03_cart_pole/notebooks/00_environment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "4c2ff31f",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# 00 Environment"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "04a5c882",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "#### 👉Before you solve a Reinforcement Learning problem you need to define what are\n",
 17 |     "- the actions\n",
 18 |     "- the states of the world\n",
 19 |     "- the rewards\n",
 20 |     "\n",
 21 |     "#### 👉We are using the `CartPole-v0` environment from [OpenAI's gym](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)\n",
 22 |     "\n",
 23 |     "#### 👉`CartPole-v0` is not an extremely difficult environment. However, it is complex enough to force us level up our game. The tools we will use to solve it are really powerful.\n",
 24 |     "\n",
 25 |     "#### 👉Let's explore it!"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 44,
 31 |    "id": "e3629346",
 32 |    "metadata": {},
 33 |    "outputs": [
 34 |     {
 35 |      "name": "stdout",
 36 |      "output_type": "stream",
 37 |      "text": [
 38 |       "The autoreload extension is already loaded. To reload it, use:\n",
 39 |       "  %reload_ext autoreload\n",
 40 |       "Populating the interactive namespace from numpy and matplotlib\n"
 41 |      ]
 42 |     }
 43 |    ],
 44 |    "source": [
 45 |     "%load_ext autoreload\n",
 46 |     "%autoreload 2\n",
 47 |     "%pylab inline\n",
 48 |     "%config InlineBackend.figure_format = 'svg'\n",
 49 |     "\n",
 50 |     "from matplotlib import pyplot as plt\n",
 51 |     "%matplotlib inline"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "id": "76e9a06d",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "## Load the environment 🌎"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 45,
 65 |    "id": "ebfba291",
 66 |    "metadata": {},
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "import gym\n",
 70 |     "env = gym.make('CartPole-v1')"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "markdown",
 75 |    "id": "c6e2bc37",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "## The goal\n",
 79 |     "### is to keep the pole in an upright position as long as you can by moving the cart a the bottom, left and right."
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "id": "7babf939",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "![title](../images/cart_pole.jpg)"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "id": "9cb921cf",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "## Let's see how a good agent solves this problem"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "id": "92dcbf74",
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": []
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 30,
109 |    "id": "2ded2ba5",
110 |    "metadata": {},
111 |    "outputs": [
112 |     {
113 |      "data": {
114 |       "text/plain": [
115 |        "<matplotlib.image.AxesImage at 0x11d26b590>"
116 |       ]
117 |      },
118 |      "execution_count": 30,
119 |      "metadata": {},
120 |      "output_type": "execute_result"
121 |     },
122 |     {
123 |      "data": {
124 |       "image/svg+xml": [
125 |        "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\n",
126 |        "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
127 |        "  \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
128 |        "<svg xmlns:xlink=\"http://www.w3.org/1999/xlink\" width=\"409.292016pt\" height=\"357.238125pt\" viewBox=\"0 0 409.292016 357.238125\" xmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">\n",
129 |        " <metadata>\n",
130 |        "  <rdf:RDF xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n",
131 |        "   <cc:Work>\n",
132 |        "    <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\n",
133 |        "    <dc:date>2022-01-10T09:36:37.476916</dc:date>\n",
134 |        "    <dc:format>image/svg+xml</dc:format>\n",
135 |        "    <dc:creator>\n",
136 |        "     <cc:Agent>\n",
137 |        "      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>\n",
138 |        "     </cc:Agent>\n",
139 |        "    </dc:creator>\n",
140 |        "   </cc:Work>\n",
141 |        "  </rdf:RDF>\n",
142 |        " </metadata>\n",
143 |        " <defs>\n",
144 |        "  <style type=\"text/css\">*{stroke-linejoin: round; stroke-linecap: butt}</style>\n",
145 |        " </defs>\n",
146 |        " <g id=\"figure_1\">\n",
147 |        "  <g id=\"patch_1\">\n",
148 |        "   <path d=\"M 0 357.238125 \n",
149 |        "L 409.292016 357.238125 \n",
150 |        "L 409.292016 0 \n",
151 |        "L 0 0 \n",
152 |        "L 0 357.238125 \n",
153 |        "z\n",
154 |        "\" style=\"fill: none\"/>\n",
155 |        "  </g>\n",
156 |        "  <g id=\"axes_1\">\n",
157 |        "   <g id=\"patch_2\">\n",
158 |        "    <path d=\"M 10.7 333.36 \n",
159 |        "L 402.092016 333.36 \n",
160 |        "L 402.092016 7.2 \n",
161 |        "L 10.7 7.2 \n",
162 |        "z\n",
163 |        "\" style=\"fill: #ffffff\"/>\n",
164 |        "   </g>\n",
165 |        "   <g clip-path=\"url(#p2d2b1a84a4)\">\n",
166 |        "    <image xlink:href=\"data:image/png;base64,\n",
167 |        "iVBORw0KGgoAAAANSUhEUgAAAYgAAAFHCAYAAAC7/dtHAAAH9klEQVR4nO3dz25cdx3G4e9vZuzxeJI6NKRR2hQShYDUKitYEcSSBReRHWtuIqxZcA3scwVISJGQIkUKiKhqE1EBRWmhNDSO4/rfHDammtpvQIYwJ46fZ3l+mtG78Oij4xmPW9d1XQHAAYO+BwDwchIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIAKJR3wPg/+Hu3bt169atXjfcuHGjrl692usG+F8IBK+ke/fu1c2bN3vdcP36dYHgWPMrJgAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBrdvn277w3wwj148KDvCXX//v06depU3zPgvzZaW1vrewO8cJPJpO8JNZ1Oy+uL42x07dq1vjfAC3fnzp0jP+a11XG9/cZrNeuGtb77erWa1emlx1U1qw8fPa7Nrd0jPd+lS5fK64vjzLe5wr7vfudC/ewnP67fff7D+viLy1U1q9eXP6l3T/+6fvrzX9bvP/xr3xNhobxJDV8a1MON79XHW1eq2rCqLdVnO2/Ve+vfr85LhRPITz3s256t1EebV6uqzV1t9fftN+vp7pmeVkF/BAL2/WPnXG3Pxoeu73Wj2uuWelgE/RII2Dcdfl6jttP3DHhpCAT8y2yjBrV96PLycLu+fWG5h0HQL4GAfQ///FGdb7+pVntzV7u6OHlY71z0gT9OHj/1sG826+ri5P2aDtbqjxvvVGuzenPlD3Xl1L2627q+58HCCQTMGbRZXZn+tr65+l5VdTVqO9Xaf3wYvJIEglfSeDyus2fPHukxa9Plam1QrVUtta++FzFZmRz5+ZaXvW/B8da6rnPvzCtnc3Oznjx5cqTH7G1t1Ce/+kXNtp8dOjv9rR/U2rs/OtLznTlzpsbjwx+bhePCHQSvpMlkcuQv7NvZXK+/DQY1C2er09U6f/78ixkHx4RPMQEQCQQAkUAAEAkEAJFAABAJBACRQMC+1lq1wTCedXt75U+GOGkEAvYNx6u1eu5SPFt/9EF1s6P9T2o47gQCvvTv7iB2q9xAcMIIBACRQAAQCQQAkUAAEAkEAJFAABAJBACRQAAQCQQAkUAAEAkEAJFAwJzJ1y7E67tbT2tn88mC10C/BAL2tdZq5XmB+GKjdjfXF7wI+iUQAEQCAUAkEABEAgFAJBAARAIBQCQQAEQCAUAkEABEAgFAJBAARAIBc9pgWFXt8EHX1db6pwvfA30SCJgzPXe5huNJOOlq87O/LHwP9EkgYM5z7yDgBBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEgiY16payy+LbrZXXdcteBD0RyBgzmh8qla//o149vTRB9XN9ha8CPojEDCvtWrDUTya7W4veAz0SyAAiAQCgEggAIgEAoBIIACIBAKASCAAiAQCgEggAIgEAoBIIACIBAIOWFk7H6/vbm3U3tbGgtdAfwQC5rTWanL2rXi2u7leO8+eLHgR9EcgAIgEAoBIIACIBAKASCAAiAQCgEggAIgEAoBIIACIBAKASCAAiAQCDmhtGK93XVfbG48XvAb6IxBwwPSNyzVYWjl80M3q2ad/Wvwg6IlAwAFtOKrWWt8zoHcCAUAkEABEAgFAJBAARAIBQCQQAEQCAUAkEABEAgFAJBAARAIByfO+aqPrquu6xW6BnggEHLA0OV2rZ9+OZ+uP3q/qZgteBP0QCDioDaoNl+LRbGdrwWOgP6O+B8DLaDRerdHk9KHrw/G0hzXQj9b5hSp8Rdd1NdvZqm62d+isDQY1WFrxdeCcCAIBQOQ9CAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCAAigQAgEggAIoEAIBIIACKBACASCACifwJdFPD9Ou7JLAAAAABJRU5ErkJggg==\" id=\"imagef829029a41\" transform=\"scale(1 -1)translate(0 -327)\" x=\"10.7\" y=\"-6.36\" width=\"392\" height=\"327\"/>\n",
168 |        "   </g>\n",
169 |        "   <g id=\"matplotlib.axis_1\">\n",
170 |        "    <g id=\"xtick_1\">\n",
171 |        "     <g id=\"line2d_1\">\n",
172 |        "      <defs>\n",
173 |        "       <path id=\"mbc9af65224\" d=\"M 0 0 \n",
174 |        "L 0 3.5 \n",
175 |        "\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
176 |        "      </defs>\n",
177 |        "      <g>\n",
178 |        "       <use xlink:href=\"#mbc9af65224\" x=\"43.316008\" y=\"333.36\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
179 |        "      </g>\n",
180 |        "     </g>\n",
181 |        "     <g id=\"text_1\">\n",
182 |        "      <!-- −4 -->\n",
183 |        "      <g transform=\"translate(35.944914 347.958438)scale(0.1 -0.1)\">\n",
184 |        "       <defs>\n",
185 |        "        <path id=\"DejaVuSans-2212\" d=\"M 678 2272 \n",
186 |        "L 4684 2272 \n",
187 |        "L 4684 1741 \n",
188 |        "L 678 1741 \n",
189 |        "L 678 2272 \n",
190 |        "z\n",
191 |        "\" transform=\"scale(0.015625)\"/>\n",
192 |        "        <path id=\"DejaVuSans-34\" d=\"M 2419 4116 \n",
193 |        "L 825 1625 \n",
194 |        "L 2419 1625 \n",
195 |        "L 2419 4116 \n",
196 |        "z\n",
197 |        "M 2253 4666 \n",
198 |        "L 3047 4666 \n",
199 |        "L 3047 1625 \n",
200 |        "L 3713 1625 \n",
201 |        "L 3713 1100 \n",
202 |        "L 3047 1100 \n",
203 |        "L 3047 0 \n",
204 |        "L 2419 0 \n",
205 |        "L 2419 1100 \n",
206 |        "L 313 1100 \n",
207 |        "L 313 1709 \n",
208 |        "L 2253 4666 \n",
209 |        "z\n",
210 |        "\" transform=\"scale(0.015625)\"/>\n",
211 |        "       </defs>\n",
212 |        "       <use xlink:href=\"#DejaVuSans-2212\"/>\n",
213 |        "       <use xlink:href=\"#DejaVuSans-34\" x=\"83.789062\"/>\n",
214 |        "      </g>\n",
215 |        "     </g>\n",
216 |        "    </g>\n",
217 |        "    <g id=\"xtick_2\">\n",
218 |        "     <g id=\"line2d_2\">\n",
219 |        "      <g>\n",
220 |        "       <use xlink:href=\"#mbc9af65224\" x=\"124.856008\" y=\"333.36\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
221 |        "      </g>\n",
222 |        "     </g>\n",
223 |        "     <g id=\"text_2\">\n",
224 |        "      <!-- −2 -->\n",
225 |        "      <g transform=\"translate(117.484914 347.958438)scale(0.1 -0.1)\">\n",
226 |        "       <defs>\n",
227 |        "        <path id=\"DejaVuSans-32\" d=\"M 1228 531 \n",
228 |        "L 3431 531 \n",
229 |        "L 3431 0 \n",
230 |        "L 469 0 \n",
231 |        "L 469 531 \n",
232 |        "Q 828 903 1448 1529 \n",
233 |        "Q 2069 2156 2228 2338 \n",
234 |        "Q 2531 2678 2651 2914 \n",
235 |        "Q 2772 3150 2772 3378 \n",
236 |        "Q 2772 3750 2511 3984 \n",
237 |        "Q 2250 4219 1831 4219 \n",
238 |        "Q 1534 4219 1204 4116 \n",
239 |        "Q 875 4013 500 3803 \n",
240 |        "L 500 4441 \n",
241 |        "Q 881 4594 1212 4672 \n",
242 |        "Q 1544 4750 1819 4750 \n",
243 |        "Q 2544 4750 2975 4387 \n",
244 |        "Q 3406 4025 3406 3419 \n",
245 |        "Q 3406 3131 3298 2873 \n",
246 |        "Q 3191 2616 2906 2266 \n",
247 |        "Q 2828 2175 2409 1742 \n",
248 |        "Q 1991 1309 1228 531 \n",
249 |        "z\n",
250 |        "\" transform=\"scale(0.015625)\"/>\n",
251 |        "       </defs>\n",
252 |        "       <use xlink:href=\"#DejaVuSans-2212\"/>\n",
253 |        "       <use xlink:href=\"#DejaVuSans-32\" x=\"83.789062\"/>\n",
254 |        "      </g>\n",
255 |        "     </g>\n",
256 |        "    </g>\n",
257 |        "    <g id=\"xtick_3\">\n",
258 |        "     <g id=\"line2d_3\">\n",
259 |        "      <g>\n",
260 |        "       <use xlink:href=\"#mbc9af65224\" x=\"206.396008\" y=\"333.36\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
261 |        "      </g>\n",
262 |        "     </g>\n",
263 |        "     <g id=\"text_3\">\n",
264 |        "      <!-- 0 -->\n",
265 |        "      <g transform=\"translate(203.214758 347.958438)scale(0.1 -0.1)\">\n",
266 |        "       <defs>\n",
267 |        "        <path id=\"DejaVuSans-30\" d=\"M 2034 4250 \n",
268 |        "Q 1547 4250 1301 3770 \n",
269 |        "Q 1056 3291 1056 2328 \n",
270 |        "Q 1056 1369 1301 889 \n",
271 |        "Q 1547 409 2034 409 \n",
272 |        "Q 2525 409 2770 889 \n",
273 |        "Q 3016 1369 3016 2328 \n",
274 |        "Q 3016 3291 2770 3770 \n",
275 |        "Q 2525 4250 2034 4250 \n",
276 |        "z\n",
277 |        "M 2034 4750 \n",
278 |        "Q 2819 4750 3233 4129 \n",
279 |        "Q 3647 3509 3647 2328 \n",
280 |        "Q 3647 1150 3233 529 \n",
281 |        "Q 2819 -91 2034 -91 \n",
282 |        "Q 1250 -91 836 529 \n",
283 |        "Q 422 1150 422 2328 \n",
284 |        "Q 422 3509 836 4129 \n",
285 |        "Q 1250 4750 2034 4750 \n",
286 |        "z\n",
287 |        "\" transform=\"scale(0.015625)\"/>\n",
288 |        "       </defs>\n",
289 |        "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
290 |        "      </g>\n",
291 |        "     </g>\n",
292 |        "    </g>\n",
293 |        "    <g id=\"xtick_4\">\n",
294 |        "     <g id=\"line2d_4\">\n",
295 |        "      <g>\n",
296 |        "       <use xlink:href=\"#mbc9af65224\" x=\"287.936008\" y=\"333.36\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
297 |        "      </g>\n",
298 |        "     </g>\n",
299 |        "     <g id=\"text_4\">\n",
300 |        "      <!-- 2 -->\n",
301 |        "      <g transform=\"translate(284.754758 347.958438)scale(0.1 -0.1)\">\n",
302 |        "       <use xlink:href=\"#DejaVuSans-32\"/>\n",
303 |        "      </g>\n",
304 |        "     </g>\n",
305 |        "    </g>\n",
306 |        "    <g id=\"xtick_5\">\n",
307 |        "     <g id=\"line2d_5\">\n",
308 |        "      <g>\n",
309 |        "       <use xlink:href=\"#mbc9af65224\" x=\"369.476008\" y=\"333.36\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
310 |        "      </g>\n",
311 |        "     </g>\n",
312 |        "     <g id=\"text_5\">\n",
313 |        "      <!-- 4 -->\n",
314 |        "      <g transform=\"translate(366.294758 347.958438)scale(0.1 -0.1)\">\n",
315 |        "       <use xlink:href=\"#DejaVuSans-34\"/>\n",
316 |        "      </g>\n",
317 |        "     </g>\n",
318 |        "    </g>\n",
319 |        "   </g>\n",
320 |        "   <g id=\"patch_3\">\n",
321 |        "    <path d=\"M 10.7 333.36 \n",
322 |        "L 10.7 7.2 \n",
323 |        "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
324 |        "   </g>\n",
325 |        "   <g id=\"patch_4\">\n",
326 |        "    <path d=\"M 402.092016 333.36 \n",
327 |        "L 402.092016 7.2 \n",
328 |        "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
329 |        "   </g>\n",
330 |        "   <g id=\"patch_5\">\n",
331 |        "    <path d=\"M 10.7 333.36 \n",
332 |        "L 402.092016 333.36 \n",
333 |        "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
334 |        "   </g>\n",
335 |        "   <g id=\"patch_6\">\n",
336 |        "    <path d=\"M 10.7 7.2 \n",
337 |        "L 402.092016 7.2 \n",
338 |        "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
339 |        "   </g>\n",
340 |        "  </g>\n",
341 |        " </g>\n",
342 |        " <defs>\n",
343 |        "  <clipPath id=\"p2d2b1a84a4\">\n",
344 |        "   <rect x=\"10.7\" y=\"7.2\" width=\"391.392016\" height=\"326.16\"/>\n",
345 |        "  </clipPath>\n",
346 |        " </defs>\n",
347 |        "</svg>\n"
348 |       ],
349 |       "text/plain": [
350 |        "<Figure size 576x432 with 1 Axes>"
351 |       ]
352 |      },
353 |      "metadata": {
354 |       "needs_background": "light"
355 |      },
356 |      "output_type": "display_data"
357 |     }
358 |    ],
359 |    "source": [
360 |     "# env.reset()\n",
361 |     "# frame = env.render(mode='rgb_array')\n",
362 |     "\n",
363 |     "# fig, ax = plt.subplots(figsize=(8, 6))\n",
364 |     "# ax.axes.yaxis.set_visible(False)\n",
365 |     "# min_x = env.observation_space.low[0]\n",
366 |     "# max_x = env.observation_space.high[0]\n",
367 |     "# ax.imshow(frame, extent=[min_x, max_x, 0, 8])\n",
368 |     "\n"
369 |    ]
370 |   },
371 |   {
372 |    "cell_type": "markdown",
373 |    "id": "4f53a38e",
374 |    "metadata": {},
375 |    "source": [
376 |     "## State space"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": 51,
382 |    "id": "e809514b",
383 |    "metadata": {},
384 |    "outputs": [
385 |     {
386 |      "name": "stdout",
387 |      "output_type": "stream",
388 |      "text": [
389 |       "Cart position from -4.80 to 4.80\n",
390 |       "Cart velocity from -3.40E+38 to 3.40E+38\n",
391 |       "Angle from -0.42 to 0.42\n",
392 |       "Angular velocity from -3.40E+38 to 3.40E+38\n"
393 |      ]
394 |     }
395 |    ],
396 |    "source": [
397 |     "# The state consists of 4 numbers:\n",
398 |     "x_min, v_min, angle_min, angular_v_min = env.observation_space.low\n",
399 |     "x_max, v_max, angle_max, angular_v_max = env.observation_space.high\n",
400 |     "\n",
401 |     "print(f'Cart position from {x_min:.2f} to {x_max:.2f}')\n",
402 |     "print(f'Cart velocity from {v_min:.2E} to {v_max:.2E}')\n",
403 |     "print(f'Angle from {angle_min:.2f} to {angle_max:.2f}')\n",
404 |     "print(f'Angular velocity from {angular_v_min:.2E} to {angular_v_max:.2E}')"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "markdown",
409 |    "id": "f413604e",
410 |    "metadata": {},
411 |    "source": [
412 |     "[IMAGE]"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "markdown",
417 |    "id": "5e0c527b",
418 |    "metadata": {},
419 |    "source": [
420 |     "### The ranges for the cart velocity and pole angular velocity are a bit too large, aren't they?\n",
421 |     "\n",
422 |     "👉 As a general principle, the high/low state values you can read from `env.observation_space`\n",
423 |     "are set very conservatively, to guarantee that the state value alwayas lies between the max and the min.\n",
424 |     "\n",
425 |     "👉In practice, you need to simulate a few interactions with the environment to really see the actual intervals where the state components lie.\n",
426 |     "\n",
427 |     "👉 Knowing the max and min values for each state component is going to be useful later when we normalize the inputs to our Parametric models."
428 |    ]
429 |   },
430 |   {
431 |    "cell_type": "markdown",
432 |    "id": "1fcfc13a",
433 |    "metadata": {},
434 |    "source": [
435 |     "## Action space\n",
436 |     "\n",
437 |     "- `0` Push cart to the left\n",
438 |     "- `1` Push cart to the right"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "code",
443 |    "execution_count": 43,
444 |    "id": "98cfdb84",
445 |    "metadata": {},
446 |    "outputs": [
447 |     {
448 |      "name": "stdout",
449 |      "output_type": "stream",
450 |      "text": [
451 |       "Action Space Discrete(2)\n"
452 |      ]
453 |     }
454 |    ],
455 |    "source": [
456 |     "print(\"Action Space {}\".format(env.action_space))"
457 |    ]
458 |   },
459 |   {
460 |    "cell_type": "markdown",
461 |    "id": "c8f6a690",
462 |    "metadata": {},
463 |    "source": [
464 |     "## Rewards\n",
465 |     "\n",
466 |     "- A reward of -1 is awarded if the position of the car is less than 0.5.\n",
467 |     "- The episode ends once the car's position is above 0.5, or the max number of steps has been reached: `n_steps >= env._max_episode_steps`\n",
468 |     "\n",
469 |     "A default negative reward of -1 encourages the car to escape the valley as fast as possible."
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": null,
475 |    "id": "578d1ba3",
476 |    "metadata": {},
477 |    "outputs": [],
478 |    "source": []
479 |   }
480 |  ],
481 |  "metadata": {
482 |   "kernelspec": {
483 |    "display_name": "Python 3 (ipykernel)",
484 |    "language": "python",
485 |    "name": "python3"
486 |   },
487 |   "language_info": {
488 |    "codemirror_mode": {
489 |     "name": "ipython",
490 |     "version": 3
491 |    },
492 |    "file_extension": ".py",
493 |    "mimetype": "text/x-python",
494 |    "name": "python",
495 |    "nbconvert_exporter": "python",
496 |    "pygments_lexer": "ipython3",
497 |    "version": "3.7.5"
498 |   }
499 |  },
500 |  "nbformat": 4,
501 |  "nbformat_minor": 5
502 | }
503 | 


--------------------------------------------------------------------------------
/03_cart_pole/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "f0fd6807",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 04 Homework 🏋️🏋️🏋️"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "abcf6613",
14 |    "metadata": {},
15 |    "source": [
16 |     "#### 👉A course without homework is not a course!\n",
17 |     "\n",
18 |     "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 |     "\n",
20 |     "#### 👉Feel free to email me your solutions at:"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "id": "d1d983a3",
26 |    "metadata": {},
27 |    "source": [
28 |     "# `plabartabajo@gmail.com`"
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "id": "86f82e45",
34 |    "metadata": {},
35 |    "source": [
36 |     "-----"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "id": "67656662",
42 |    "metadata": {},
43 |    "source": [
44 |     "## 1. Can you use 3 different `SEED` values and re-train the agent with good hyper-parameters?\n",
45 |     "\n",
46 |     "Do you still train a good agent? Or does the seed really affect the training outcome?"
47 |    ]
48 |   },
49 |   {
50 |    "cell_type": "markdown",
51 |    "id": "c0a46bf7",
52 |    "metadata": {},
53 |    "source": [
54 |     "## 2. Can you solve the `MountainCar-v0` environment using today's code?\n",
55 |     "\n",
56 |     "Are you able to score 99% on the evaluation set?"
57 |    ]
58 |   }
59 |  ],
60 |  "metadata": {
61 |   "kernelspec": {
62 |    "display_name": "Python 3 (ipykernel)",
63 |    "language": "python",
64 |    "name": "python3"
65 |   },
66 |   "language_info": {
67 |    "codemirror_mode": {
68 |     "name": "ipython",
69 |     "version": 3
70 |    },
71 |    "file_extension": ".py",
72 |    "mimetype": "text/x-python",
73 |    "name": "python",
74 |    "nbconvert_exporter": "python",
75 |    "pygments_lexer": "ipython3",
76 |    "version": "3.7.5"
77 |   }
78 |  },
79 |  "nbformat": 4,
80 |  "nbformat_minor": 5
81 | }
82 | 


--------------------------------------------------------------------------------
/03_cart_pole/notebooks/08_homework.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "f0fd6807",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 08 Homework 🏋️🏋️🏋️"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "abcf6613",
14 |    "metadata": {},
15 |    "source": [
16 |     "#### 👉A course without homework is not a course!\n",
17 |     "\n",
18 |     "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 |     "\n",
20 |     "#### 👉Feel free to email me your solutions at:"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "id": "d1d983a3",
26 |    "metadata": {},
27 |    "source": [
28 |     "# `plabartabajo@gmail.com`"
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "id": "86f82e45",
34 |    "metadata": {},
35 |    "source": [
36 |     "-----"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "id": "67656662",
42 |    "metadata": {},
43 |    "source": [
44 |     "### 1. Re-train the neural networks from `05_crash_course_on_neural_nets.ipynb` with a larger training set, e.g. `10,000 samples`?\n",
45 |     "\n",
46 |     "👉Do the validation metrics improve?\n",
47 |     "\n",
48 |     "👉Did you manage to get to 95% validation accuracy?"
49 |    ]
50 |   },
51 |   {
52 |    "cell_type": "markdown",
53 |    "id": "c0a46bf7",
54 |    "metadata": {},
55 |    "source": [
56 |     "## 2. Can you perfectly solve the `Cart Pole` environment using a neural network with only 1 hidden layer?\n",
57 |     "\n",
58 |     "\n",
59 |     "![](https://github.com/Paulescu/hands-on-rl/blob/main/03_cart_pole/images/neural_net_homework.jpg?raw=true)"
60 |    ]
61 |   }
62 |  ],
63 |  "metadata": {
64 |   "kernelspec": {
65 |    "display_name": "Python 3 (ipykernel)",
66 |    "language": "python",
67 |    "name": "python3"
68 |   },
69 |   "language_info": {
70 |    "codemirror_mode": {
71 |     "name": "ipython",
72 |     "version": 3
73 |    },
74 |    "file_extension": ".py",
75 |    "mimetype": "text/x-python",
76 |    "name": "python",
77 |    "nbconvert_exporter": "python",
78 |    "pygments_lexer": "ipython3",
79 |    "version": "3.7.5"
80 |   }
81 |  },
82 |  "nbformat": 4,
83 |  "nbformat_minor": 5
84 | }
85 | 


--------------------------------------------------------------------------------
/03_cart_pole/notebooks/10_homework.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "f0fd6807",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 10 Homework 🏋️🏋️🏋️"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "3f1582da",
14 |    "metadata": {},
15 |    "source": [
16 |     "## Challenge"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "id": "67656662",
22 |    "metadata": {},
23 |    "source": [
24 |     "If you carefully look at `sample_hyper_parameters()` in `src/optimize_hyperparameters.py` you will see I did not let Optuna test different neural network architectures.\n",
25 |     "\n",
26 |     "I set `nn_hidden_layers = [256, 256]` and that was it.\n",
27 |     "\n",
28 |     "I dare you find the smallest neural network architecture that solves the `CartPole` perfectly."
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "id": "4ea9756e",
34 |    "metadata": {},
35 |    "source": [
36 |     "## Send your solution through\n",
37 |     "\n",
38 |     "- a pull request or\n",
39 |     "- direcly by email at `plabartabajo@gmail.com`"
40 |    ]
41 |   }
42 |  ],
43 |  "metadata": {
44 |   "kernelspec": {
45 |    "display_name": "Python 3 (ipykernel)",
46 |    "language": "python",
47 |    "name": "python3"
48 |   },
49 |   "language_info": {
50 |    "codemirror_mode": {
51 |     "name": "ipython",
52 |     "version": 3
53 |    },
54 |    "file_extension": ".py",
55 |    "mimetype": "text/x-python",
56 |    "name": "python",
57 |    "nbconvert_exporter": "python",
58 |    "pygments_lexer": "ipython3",
59 |    "version": "3.7.5"
60 |   }
61 |  },
62 |  "nbformat": 4,
63 |  "nbformat_minor": 5
64 | }
65 | 


--------------------------------------------------------------------------------
/03_cart_pole/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "src"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Pau <plabartabajo@gmail.com>"]
 6 | 
 7 | [tool.poetry.dependencies]
 8 | python = ">=3.7.1,<3.8"
 9 | gym = "^0.21.0"
10 | sklearn = "^0.0"
11 | numpy = "^1.21.4"
12 | matplotlib = "^3.5.0"
13 | jupyter = "^1.0.0"
14 | tqdm = "^4.62.3"
15 | torch = "^1.10.1"
16 | tensorboard = "^2.7.0"
17 | pandas = "^1.3.5"
18 | PyYAML = "^6.0"
19 | pyglet = "^1.5.21"
20 | mlflow = "^1.22.0"
21 | gdown = "^4.2.0"
22 | optuna = "^2.10.0"
23 | pyngrok = "^5.1.0"
24 | 
25 | [tool.poetry.dev-dependencies]
26 | pytest = "^5.2"
27 | certifi = "^2021.10.8"
28 | 
29 | [build-system]
30 | requires = ["poetry-core>=1.0.0"]
31 | build-backend = "poetry.core.masonry.api"
32 | 


--------------------------------------------------------------------------------
/03_cart_pole/requirements.txt:
--------------------------------------------------------------------------------
  1 | absl-py==1.0.0; python_version >= "3.6"
  2 | alembic==1.4.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
  3 | appnope==0.1.2; platform_system == "Darwin" and python_version >= "3.7" and sys_platform == "darwin"
  4 | argcomplete==2.0.0; python_version < "3.8.0" and python_version >= "3.7"
  5 | argon2-cffi-bindings==21.2.0; python_version >= "3.6"
  6 | argon2-cffi==21.3.0; python_version >= "3.6"
  7 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0"
  8 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
  9 | autopage==0.5.0; python_version >= "3.6"
 10 | backcall==0.2.0; python_version >= "3.7"
 11 | beautifulsoup4==4.10.0; python_full_version > "3.0.0"
 12 | bleach==4.1.0; python_version >= "3.7"
 13 | cachetools==4.2.4; python_version >= "3.5" and python_version < "4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
 14 | certifi==2021.10.8
 15 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6"
 16 | charset-normalizer==2.0.9; python_full_version >= "3.6.0" and python_version >= "3.6"
 17 | click==8.0.3; python_version >= "3.6"
 18 | cliff==3.10.1; python_version >= "3.6"
 19 | cloudpickle==2.0.0; python_version >= "3.6"
 20 | cmaes==0.8.2; python_version >= "3.6"
 21 | cmd2==2.4.0; python_version >= "3.6"
 22 | colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.7"
 23 | colorlog==6.6.0; python_version >= "3.6"
 24 | cycler==0.11.0; python_version >= "3.7"
 25 | databricks-cli==0.16.2; python_version >= "3.6"
 26 | debugpy==1.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
 27 | decorator==5.1.0; python_version >= "3.7"
 28 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
 29 | docker==5.0.3; python_version >= "3.6"
 30 | entrypoints==0.3; python_full_version >= "3.6.1" and python_version >= "3.7"
 31 | filelock==3.4.2; python_version >= "3.7"
 32 | flask==2.0.2; python_version >= "3.6"
 33 | fonttools==4.28.5; python_version >= "3.7"
 34 | gdown==4.2.0
 35 | gitdb==4.0.9; python_version >= "3.7"
 36 | gitpython==3.1.26; python_version >= "3.7"
 37 | google-auth-oauthlib==0.4.6; python_version >= "3.6"
 38 | google-auth==2.3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 39 | greenlet==1.1.2; python_version >= "3" and python_full_version < "3.0.0" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_version >= "3" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") and python_full_version >= "3.5.0"
 40 | grpcio==1.43.0; python_version >= "3.6"
 41 | gunicorn==20.1.0; platform_system != "Windows" and python_version >= "3.6"
 42 | gym==0.21.0; python_version >= "3.6"
 43 | idna==3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 44 | importlib-metadata==4.10.0; python_version == "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version >= "3.7" and python_version < "3.8") and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.7" and python_version < "3.8")
 45 | importlib-resources==5.4.0; python_version < "3.9" and python_version >= "3.7"
 46 | ipykernel==6.6.1; python_version >= "3.7"
 47 | ipython-genutils==0.2.0; python_version >= "3.6"
 48 | ipython==7.30.1; python_version >= "3.7"
 49 | ipywidgets==7.6.5
 50 | itsdangerous==2.0.1; python_version >= "3.6"
 51 | jedi==0.18.1; python_version >= "3.7"
 52 | jinja2==3.0.3; python_version >= "3.7"
 53 | joblib==1.1.0; python_version >= "3.7"
 54 | jsonschema==4.3.3; python_version >= "3.7"
 55 | jupyter-client==7.1.0; python_full_version >= "3.6.1" and python_version >= "3.7"
 56 | jupyter-console==6.4.0; python_version >= "3.6"
 57 | jupyter-core==4.9.1; python_full_version >= "3.6.1" and python_version >= "3.7"
 58 | jupyter==1.0.0
 59 | jupyterlab-pygments==0.1.2; python_version >= "3.7"
 60 | jupyterlab-widgets==1.0.2; python_version >= "3.6"
 61 | kiwisolver==1.3.2; python_version >= "3.7"
 62 | mako==1.1.6; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
 63 | markdown==3.3.6; python_version >= "3.6"
 64 | markupsafe==2.0.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
 65 | matplotlib-inline==0.1.3; python_version >= "3.7"
 66 | matplotlib==3.5.1; python_version >= "3.7"
 67 | mistune==0.8.4; python_version >= "3.7"
 68 | mlflow==1.22.0; python_version >= "3.6"
 69 | more-itertools==8.12.0; python_version >= "3.5"
 70 | nbclient==0.5.9; python_full_version >= "3.6.1" and python_version >= "3.7"
 71 | nbconvert==6.4.0; python_version >= "3.7"
 72 | nbformat==5.1.3; python_full_version >= "3.6.1" and python_version >= "3.7"
 73 | nest-asyncio==1.5.4; python_full_version >= "3.6.1" and python_version >= "3.7"
 74 | notebook==6.4.6; python_version >= "3.6"
 75 | numpy==1.21.5; python_version >= "3.7" and python_version < "3.11"
 76 | oauthlib==3.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
 77 | optuna==2.10.0; python_version >= "3.6"
 78 | packaging==21.3; python_version >= "3.7"
 79 | pandas==1.3.5; python_full_version >= "3.7.1"
 80 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
 81 | parso==0.8.3; python_version >= "3.7"
 82 | pbr==5.8.1; python_version >= "3.6"
 83 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7"
 84 | pickleshare==0.7.5; python_version >= "3.7"
 85 | pillow==9.0.0; python_version >= "3.7"
 86 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5"
 87 | prettytable==3.1.1; python_version >= "3.7"
 88 | prometheus-client==0.12.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
 89 | prometheus-flask-exporter==0.18.7; python_version >= "3.6"
 90 | prompt-toolkit==3.0.24; python_full_version >= "3.6.2" and python_version >= "3.7"
 91 | protobuf==3.19.1; python_version >= "3.6"
 92 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.7" and sys_platform != "win32"
 93 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy"
 94 | pyasn1-modules==0.2.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 95 | pyasn1==0.4.8; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_full_version >= "3.6.0" and python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
 96 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
 97 | pyglet==1.5.21
 98 | pygments==2.11.1; python_version >= "3.7"
 99 | pyngrok==5.1.0; python_version >= "3.5"
100 | pyparsing==3.0.6; python_version >= "3.7"
101 | pyperclip==1.8.2; python_version >= "3.6"
102 | pyreadline3==3.4.1; sys_platform == "win32" and python_version >= "3.6"
103 | pyrsistent==0.18.0; python_version >= "3.7"
104 | pysocks==1.7.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
105 | pytest==5.4.3; python_version >= "3.5"
106 | python-dateutil==2.8.2; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
107 | python-editor==1.0.4; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
108 | pytz==2021.3; python_full_version >= "3.7.1" and python_version >= "3.6"
109 | pywin32==227; sys_platform == "win32" and python_version >= "3.7" and platform_python_implementation != "PyPy"
110 | pywinpty==1.1.6; os_name == "nt" and python_version >= "3.6"
111 | pyyaml==6.0; python_version >= "3.6"
112 | pyzmq==22.3.0; python_full_version >= "3.6.1" and python_version >= "3.7"
113 | qtconsole==5.2.2; python_version >= "3.6"
114 | qtpy==2.0.0; python_version >= "3.6"
115 | querystring-parser==1.2.4; python_version >= "3.6"
116 | requests-oauthlib==1.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
117 | requests==2.27.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
118 | rsa==4.8; python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
119 | scikit-learn==1.0.2; python_version >= "3.7"
120 | scipy==1.7.3; python_version >= "3.7" and python_version < "3.11"
121 | send2trash==1.8.0; python_version >= "3.6"
122 | setuptools-scm==6.3.2; python_version >= "3.7"
123 | six==1.16.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7"
124 | sklearn==0.0
125 | smmap==5.0.0; python_version >= "3.7"
126 | soupsieve==2.3.1; python_version >= "3.6" and python_full_version > "3.0.0"
127 | sqlalchemy==1.4.29; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
128 | sqlparse==0.4.2; python_version >= "3.6"
129 | stevedore==3.5.0; python_version >= "3.6"
130 | tabulate==0.8.9; python_version >= "3.6"
131 | tensorboard-data-server==0.6.1; python_version >= "3.6"
132 | tensorboard-plugin-wit==1.8.0; python_version >= "3.6"
133 | tensorboard==2.7.0; python_version >= "3.6"
134 | terminado==0.12.1; python_version >= "3.6"
135 | testpath==0.5.0; python_version >= "3.7"
136 | threadpoolctl==3.0.0; python_version >= "3.7"
137 | tomli==2.0.0; python_version >= "3.7"
138 | torch==1.10.1; python_full_version >= "3.6.2"
139 | tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7"
140 | tqdm==4.62.3; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
141 | traitlets==5.1.1; python_full_version >= "3.6.1" and python_version >= "3.7"
142 | typing-extensions==4.0.1; python_version >= "3.7" and python_full_version >= "3.6.2" and python_version < "3.8"
143 | urllib3==1.26.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6"
144 | waitress==2.0.0; platform_system == "Windows" and python_version >= "3.6" and python_full_version >= "3.6.0"
145 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.7"
146 | webencodings==0.5.1; python_version >= "3.7"
147 | websocket-client==1.2.3; python_version >= "3.6"
148 | werkzeug==2.0.2; python_version >= "3.6"
149 | widgetsnbextension==3.5.2
150 | zipp==3.7.0; python_version < "3.8" and python_version >= "3.7"
151 | 


--------------------------------------------------------------------------------
/03_cart_pole/saved_agents/CartPole-v1/0/hparams.json:
--------------------------------------------------------------------------------
1 | {"learning_rate": 0.119449136260578, "discount_factor": 0.99, "batch_size": 128, "memory_size": 100000, "freq_steps_update_target": 1000, "n_steps_warm_up_memory": 1000, "freq_steps_train": 16, "n_gradient_steps": 16, "nn_hidden_layers": null, "max_grad_norm": 1, "normalize_state": false, "epsilon_start": 0.9, "epsilon_end": 0.1421425009699689, "steps_epsilon_decay": 100000}


--------------------------------------------------------------------------------
/03_cart_pole/saved_agents/CartPole-v1/0/model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/saved_agents/CartPole-v1/0/model


--------------------------------------------------------------------------------
/03_cart_pole/saved_agents/readme.md:
--------------------------------------------------------------------------------
1 | ### Trained agents are saved in this folder


--------------------------------------------------------------------------------
/03_cart_pole/src/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '0.1.0'
2 | 


--------------------------------------------------------------------------------
/03_cart_pole/src/agent_memory.py:
--------------------------------------------------------------------------------
 1 | from collections import namedtuple, deque
 2 | import random
 3 | 
 4 | Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done'))
 5 | 
 6 | 
 7 | class AgentMemory:
 8 | 
 9 |     def __init__(self, memory_size):
10 |         self.memory = deque([], maxlen=memory_size)
11 | 
12 |     def push(self, *args):
13 |         """Save a transition"""
14 |         self.memory.append(Transition(*args))
15 | 
16 |     def sample(self, batch_size):
17 |         transitions = random.sample(self.memory, batch_size)
18 | 
19 |         # stop()
20 | 
21 |         return Transition(*zip(*transitions))
22 | 
23 |     def __len__(self):
24 |         return len(self.memory)


--------------------------------------------------------------------------------
/03_cart_pole/src/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pathlib
 3 | root_dir = pathlib.Path(__file__).parent.resolve().parent
 4 | 
 5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents'
 6 | TENSORBOARD_LOG_DIR = root_dir / 'tensorboard_logs'
 7 | OPTUNA_DB = root_dir / 'optuna.db'
 8 | DATA_SUPERVISED_ML = root_dir / 'data_supervised_ml'
 9 | MLFLOW_RUNS_DIR = root_dir / 'mlflow_runs'
10 | 
11 | if not SAVED_AGENTS_DIR.exists():
12 |     os.makedirs(SAVED_AGENTS_DIR)
13 | 
14 | if not TENSORBOARD_LOG_DIR.exists():
15 |     os.makedirs(TENSORBOARD_LOG_DIR)
16 | 
17 | if not DATA_SUPERVISED_ML.exists():
18 |     os.makedirs(DATA_SUPERVISED_ML)
19 | 
20 | if not MLFLOW_RUNS_DIR.exists():
21 |     os.makedirs(MLFLOW_RUNS_DIR)


--------------------------------------------------------------------------------
/03_cart_pole/src/loops.py:
--------------------------------------------------------------------------------
  1 | from typing import Tuple, List, Callable, Union, Optional
  2 | import random
  3 | from pathlib import Path
  4 | from collections import deque
  5 | from pdb import set_trace as stop
  6 | 
  7 | import numpy as np
  8 | from tqdm import tqdm
  9 | import torch
 10 | from torch.utils.tensorboard import SummaryWriter
 11 | 
 12 | 
 13 | 
 14 | def train(
 15 |     agent,
 16 |     env,
 17 |     n_episodes: int,
 18 |     log_dir: Optional[Path] = None,
 19 |     max_steps: Optional[int] = float("inf"),
 20 |     n_episodes_evaluate_agent: Optional[int] = 100,
 21 |     freq_episodes_evaluate_agent: int = 200,
 22 | ) -> None:
 23 | 
 24 |     # Tensorborad log writer
 25 |     logging = False
 26 |     if log_dir is not None:
 27 |         writer = SummaryWriter(log_dir)
 28 |         logging = True
 29 | 
 30 |     reward_per_episode = []
 31 |     steps_per_episode = []
 32 |     global_step_counter = 0
 33 | 
 34 |     for i in tqdm(range(0, n_episodes)):
 35 | 
 36 |         state = env.reset()
 37 | 
 38 |         rewards = 0
 39 |         steps = 0
 40 |         done = False
 41 |         while not done:
 42 | 
 43 |             action = agent.act(state)
 44 | 
 45 |             # agents takes a step and the environment throws out a new state and
 46 |             # a reward
 47 |             next_state, reward, done, info = env.step(action)
 48 | 
 49 |             # agent observes transition and stores it for later use
 50 |             agent.observe(state, action, reward, next_state, done)
 51 | 
 52 |             # learning happens here, through experience replay
 53 |             agent.replay()
 54 | 
 55 |             global_step_counter += 1
 56 |             steps += 1
 57 |             rewards += reward
 58 |             state = next_state
 59 | 
 60 |         # log to Tensorboard
 61 |         if logging:
 62 |             writer.add_scalar('train/rewards', rewards, i)
 63 |             writer.add_scalar('train/steps', steps, i)
 64 |             writer.add_scalar('train/epsilon', agent.epsilon, i)
 65 |             writer.add_scalar('train/replay_memory_size', len(agent.memory), i)
 66 | 
 67 |         reward_per_episode.append(rewards)
 68 |         steps_per_episode.append(steps)
 69 | 
 70 |         # if (i > 0) and (i % freq_episodes_evaluate_agent) == 0:
 71 |         if (i + 1) % freq_episodes_evaluate_agent == 0:
 72 |             # evaluate agent
 73 |             eval_rewards, eval_steps = evaluate(agent, env,
 74 |                                                 n_episodes=n_episodes_evaluate_agent,
 75 |                                                 epsilon=0.01)
 76 | 
 77 |             # from src.utils import get_success_rate_from_n_steps
 78 |             # success_rate = get_success_rate_from_n_steps(env, eval_steps)
 79 |             print(f'Reward mean: {np.mean(eval_rewards):.2f}, std: {np.std(eval_rewards):.2f}')
 80 |             print(f'Num steps mean: {np.mean(eval_steps):.2f}, std: {np.std(eval_steps):.2f}')
 81 |             # print(f'Success rate: {success_rate:.2%}')
 82 |             if logging:
 83 |                 writer.add_scalar('eval/avg_reward', np.mean(eval_rewards), i)
 84 |                 writer.add_scalar('eval/avg_steps', np.mean(eval_steps), i)
 85 |             # writer.add_scalar('eval/success_rate', success_rate, i)
 86 | 
 87 |         if global_step_counter > max_steps:
 88 |             break
 89 | 
 90 | 
 91 | def evaluate(
 92 |     agent,
 93 |     env,
 94 |     n_episodes: int,
 95 |     epsilon: Optional[float] = None,
 96 |     seed: Optional[int] = 0,
 97 | ) -> Tuple[List, List]:
 98 | 
 99 |     from src.utils import set_seed
100 |     set_seed(env, seed)
101 | 
102 |     # output metrics
103 |     reward_per_episode = []
104 |     steps_per_episode = []
105 | 
106 |     for i in tqdm(range(0, n_episodes)):
107 | 
108 |         state = env.reset()
109 |         rewards = 0
110 |         steps = 0
111 |         done = False
112 |         while not done:
113 | 
114 |             action = agent.act(state, epsilon=epsilon)
115 |             next_state, reward, done, info = env.step(action)
116 | 
117 |             rewards += reward
118 |             steps += 1
119 |             state = next_state
120 | 
121 |         reward_per_episode.append(rewards)
122 |         steps_per_episode.append(steps)
123 | 
124 |     return reward_per_episode, steps_per_episode


--------------------------------------------------------------------------------
/03_cart_pole/src/model_factory.py:
--------------------------------------------------------------------------------
 1 | from typing import Optional, List
 2 | from pdb import set_trace as stop
 3 | 
 4 | import torch.nn as nn
 5 | 
 6 | 
 7 | def get_model(
 8 |     input_dim: int,
 9 |     output_dim: int,
10 |     hidden_layers: Optional[List[int]] = None,
11 | ):
12 |     """
13 |     Feed-forward network, made of linear layers with ReLU activation functions
14 |     The number of layers, and their size is given by `hidden_layers`.
15 |     """
16 |     # assert init_method in {'default', 'xavier'}
17 | 
18 |     if hidden_layers is None:
19 |         # linear model
20 |         model = nn.Sequential(nn.Linear(input_dim, output_dim))
21 | 
22 |     else:
23 |         # neural network
24 |         # there are hidden layers in this case.
25 |         dims = [input_dim] + hidden_layers + [output_dim]
26 |         modules = []
27 |         for i, dim in enumerate(dims[:-2]):
28 |             modules.append(nn.Linear(dims[i], dims[i + 1]))
29 |             modules.append(nn.ReLU())
30 | 
31 |         modules.append(nn.Linear(dims[-2], dims[-1]))
32 |         model = nn.Sequential(*modules)
33 |         # stop()
34 | 
35 |     # n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
36 |     # print(f'{n_parameters:,} parameters')
37 | 
38 |     return model
39 | 
40 | def count_parameters(model: nn.Module) -> int:
41 |     """"""
42 |     return sum(p.numel() for p in model.parameters() if p.requires_grad)


--------------------------------------------------------------------------------
/03_cart_pole/src/optimize_hyperparameters.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict
  2 | from argparse import ArgumentParser
  3 | from pdb import set_trace as stop
  4 | 
  5 | import optuna
  6 | import gym
  7 | import numpy as np
  8 | import mlflow
  9 | 
 10 | from src.q_agent import QAgent
 11 | from src.utils import get_agent_id
 12 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR, OPTUNA_DB
 13 | from src.utils import set_seed
 14 | from src.loops import train, evaluate
 15 | 
 16 | 
 17 | def sample_hyper_parameters(
 18 |     trial: optuna.trial.Trial,
 19 |     force_linear_model: bool = False,
 20 | ) -> Dict:
 21 | 
 22 |     learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-2)
 23 |     discount_factor = trial.suggest_categorical("discount_factor", [0.9, 0.95, 0.99])
 24 |     batch_size = trial.suggest_categorical("batch_size", [16, 32, 64, 128])
 25 |     memory_size = trial.suggest_categorical("memory_size", [int(1e4), int(5e4), int(1e5)])
 26 | 
 27 |     # we update the main model parameters every 'freq_steps_train' steps
 28 |     freq_steps_train = trial.suggest_categorical('freq_steps_train', [8, 16, 128, 256])
 29 | 
 30 |     # we update the target model parameters every 'freq_steps_update_target' steps
 31 |     freq_steps_update_target = trial.suggest_categorical('freq_steps_update_target', [10, 100, 1000])
 32 | 
 33 |     # minimum memory size we want before we start training
 34 |     # e.g.    0 --> start training right away.
 35 |     # e.g 1,000 --> start training when there are at least 1,000 sample trajectories in the agent's memory
 36 |     n_steps_warm_up_memory = trial.suggest_categorical("n_steps_warm_up_memory", [1000, 5000])
 37 | 
 38 |     # how many consecutive gradient descent steps to perform when we update the main model parameters
 39 |     n_gradient_steps = trial.suggest_categorical("n_gradient_steps", [1, 4, 16])
 40 | 
 41 |     # model architecture to approximate q values
 42 |     if force_linear_model:
 43 |         # linear model
 44 |         nn_hidden_layers = None
 45 |     else:
 46 |         # neural network hidden layers
 47 |         # nn_hidden_layers = trial.suggest_categorical("nn_hidden_layers", [None, [64, 64], [256, 256]])
 48 |         nn_hidden_layers = trial.suggest_categorical("nn_hidden_layers", [[256, 256]]) # ;-)
 49 | 
 50 |     # how large do we let the gradients grow before capping them?
 51 |     # Explosive gradients can be an issue and this hyper-parameters helps mitigate it.
 52 |     max_grad_norm = trial.suggest_categorical("max_grad_norm", [1, 10])
 53 | 
 54 |     # should we scale the inputs before feeding them to the model?
 55 |     normalize_state = trial.suggest_categorical('normalize_state', [True, False])
 56 | 
 57 |     # start value for the exploration rate
 58 |     epsilon_start = trial.suggest_categorical("epsilon_start", [0.9])
 59 | 
 60 |     # final value for the exploration rate
 61 |     epsilon_end = trial.suggest_uniform("epsilon_end", 0, 0.2)
 62 | 
 63 |     # for how many steps do we decrease epsilon from its starting value to
 64 |     # its final value `epsilon_end`
 65 |     steps_epsilon_decay = trial.suggest_categorical("steps_epsilon_decay", [int(1e3), int(1e4), int(1e5)])
 66 | 
 67 |     seed = trial.suggest_int('seed', 0, 2 ** 30 - 1)
 68 | 
 69 |     return {
 70 |         'learning_rate': learning_rate,
 71 |         'discount_factor': discount_factor,
 72 |         'batch_size': batch_size,
 73 |         'memory_size': memory_size,
 74 |         'freq_steps_train': freq_steps_train,
 75 |         'freq_steps_update_target': freq_steps_update_target,
 76 |         'n_steps_warm_up_memory': n_steps_warm_up_memory,
 77 |         'n_gradient_steps': n_gradient_steps,
 78 |         'nn_hidden_layers': nn_hidden_layers,
 79 |         'max_grad_norm': max_grad_norm,
 80 |         'normalize_state': normalize_state,
 81 |         'epsilon_start': epsilon_start,
 82 |         'epsilon_end': epsilon_end,
 83 |         'steps_epsilon_decay': steps_epsilon_decay,
 84 |         'seed': seed,
 85 |     }
 86 | 
 87 | 
 88 | def objective(
 89 |     trial: optuna.trial.Trial,
 90 |     force_linear_model: bool = False,
 91 |     n_episodes_to_train: int = 200,
 92 | ):
 93 |     env_name = 'CartPole-v1'
 94 |     env = gym.make('CartPole-v1')
 95 | 
 96 |     with mlflow.start_run():
 97 | 
 98 |         # generate unique agent_id
 99 |         agent_id = get_agent_id(env_name)
100 |         mlflow.log_param('agent_id', agent_id)
101 | 
102 |         # hyper-parameters
103 |         args = sample_hyper_parameters(trial,
104 |                                        force_linear_model=force_linear_model)
105 |         mlflow.log_params(trial.params)
106 | 
107 |         # fix seeds to ensure reproducible runs
108 |         set_seed(env, args['seed'])
109 | 
110 |         # create agent object
111 |         agent = QAgent(
112 |             env,
113 |             learning_rate=args['learning_rate'],
114 |             discount_factor=args['discount_factor'],
115 |             batch_size=args['batch_size'],
116 |             memory_size=args['memory_size'],
117 |             freq_steps_train=args['freq_steps_train'],
118 |             freq_steps_update_target=args['freq_steps_update_target'],
119 |             n_steps_warm_up_memory=args['n_steps_warm_up_memory'],
120 |             n_gradient_steps=args['n_gradient_steps'],
121 |             nn_hidden_layers=args['nn_hidden_layers'],
122 |             max_grad_norm=args['max_grad_norm'],
123 |             normalize_state=args['normalize_state'],
124 |             epsilon_start=args['epsilon_start'],
125 |             epsilon_end=args['epsilon_end'],
126 |             steps_epsilon_decay=args['steps_epsilon_decay'],
127 |             log_dir=TENSORBOARD_LOG_DIR / env_name / agent_id
128 |         )
129 | 
130 |         # train loop
131 |         train(agent,
132 |               env,
133 |               n_episodes=n_episodes_to_train,
134 |               log_dir=TENSORBOARD_LOG_DIR / env_name / agent_id)
135 | 
136 |         agent.save_to_disk(SAVED_AGENTS_DIR / env_name / agent_id)
137 | 
138 |         # evaluate its performance
139 |         rewards, steps = evaluate(agent, env, n_episodes=1000, epsilon=0.00)
140 |         mean_reward = np.mean(rewards)
141 |         std_reward = np.std(rewards)
142 |         mlflow.log_metric('mean_reward', mean_reward)
143 |         mlflow.log_metric('std_reward', std_reward)
144 | 
145 |     return mean_reward
146 | 
147 | 
148 | if __name__ == '__main__':
149 | 
150 |     parser = ArgumentParser()
151 |     parser.add_argument('--trials', type=int, required=True)
152 |     parser.add_argument('--episodes', type=int, required=True)
153 |     parser.add_argument('--force_linear_model', dest='force_linear_model', action='store_true')
154 |     parser.set_defaults(force_linear_model=False)
155 |     parser.add_argument('--experiment_name', type=str, required=True)
156 |     args = parser.parse_args()
157 | 
158 |     # set Mlflow experiment name
159 |     mlflow.set_experiment(args.experiment_name)
160 | 
161 |     # set Optuna study
162 |     study = optuna.create_study(study_name=args.experiment_name,
163 |                                 direction='maximize',
164 |                                 load_if_exists=True,
165 |                                 storage=f'sqlite:///{OPTUNA_DB}')
166 | 
167 |     # Wrap the objective inside a lambda and call objective inside it
168 |     # Nice trick taken from https://www.kaggle.com/general/261870
169 |     func = lambda trial: objective(trial, force_linear_model=args.force_linear_model, n_episodes_to_train=args.episodes)
170 | 
171 |     # run Optuna
172 |     study.optimize(func, n_trials=args.trials)


--------------------------------------------------------------------------------
/03_cart_pole/src/q_agent.py:
--------------------------------------------------------------------------------
  1 | """
  2 | We use PyTorch for all agents:
  3 | 
  4 | - Linear model trained one sample at a time -> Easy to train, slow and results are not great.
  5 | - Linear model trained with batches of data. -> Faster to train, but results are still not good.
  6 | - NN trained with batches -> Promising, but it looks like it does not train..
  7 | - NN with memory buffer -> Fix sample autocorrelation
  8 | - NN with memory buffer and target network for stability. -> RL trick (called double Q-learning)
  9 | 
 10 | """
 11 | import os
 12 | from pathlib import Path
 13 | from typing import Union, Callable, Tuple, List
 14 | import random
 15 | from argparse import ArgumentParser
 16 | import json
 17 | from pdb import set_trace as stop
 18 | 
 19 | import gym
 20 | import numpy as np
 21 | import torch
 22 | import torch.nn as nn
 23 | import torch.optim as optim
 24 | from torch.utils.tensorboard import SummaryWriter
 25 | from torch.nn import functional as F
 26 | 
 27 | from src.model_factory import get_model
 28 | from src.agent_memory import AgentMemory
 29 | from src.utils import (
 30 |     get_agent_id,
 31 |     get_input_output_dims,
 32 |     get_epsilon_decay_fn,
 33 |     # load_default_hyperparameters,
 34 |     get_observation_samples,
 35 |     set_seed,
 36 |     get_num_model_parameters
 37 | )
 38 | from src.loops import train
 39 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR
 40 | 
 41 | 
 42 | class QAgent:
 43 | 
 44 |     def __init__(
 45 |         self,
 46 |         env: gym.Env,
 47 |         learning_rate: float = 1e-4,
 48 |         discount_factor: float = 0.99,
 49 |         batch_size: int = 64,
 50 |         memory_size: int = 10000,
 51 |         freq_steps_update_target: int = 1000,
 52 |         n_steps_warm_up_memory: int = 1000,
 53 |         freq_steps_train: int = 16,
 54 |         n_gradient_steps: int = 8,
 55 |         nn_hidden_layers: List[int] = None,
 56 |         max_grad_norm: int = 10,
 57 |         normalize_state: bool = False,
 58 |         epsilon_start: float = 1.0,
 59 |         epsilon_end: float = 0.05,
 60 |         steps_epsilon_decay: float = 50000,
 61 |         log_dir: str = None,
 62 |     ):
 63 |         """
 64 |         :param env:
 65 |         :param learning_rate: size of the updates in the SGD/Adam formula
 66 |         :param discount_factor: discount factor for future rewards
 67 |         :param batch_size: number of (s,a,r,s') experiences we use in each SGD
 68 |         update
 69 |         :param memory_size: number of experiences the agent keeps in the replay
 70 |         memory
 71 |         :param freq_steps_update_target: frequency at which we copy the
 72 |         parameter
 73 |         from the main model to the target model.
 74 |         :param n_steps_warm_up_memory: number of experiences we require to have
 75 |         in memory before we start training the agent.
 76 |         :param freq_steps_train: frequency at which we update the main model
 77 |         parameters
 78 |         :param n_gradient_steps: number of SGD/Adam updates we perform when we
 79 |         train the main model.
 80 |         :param nn_hidden_layers: architecture of the main and target models.
 81 |         :param max_grad_norm: used to clipped gradients if they become too
 82 |         large.
 83 |         :param normalize_state: True/False depending if you want to normalize
 84 |         the raw states before feeding them into the model.
 85 |         :param epsilon_start: starting exploration rate
 86 |         :param epsilon_end: final exploration rate
 87 |         :param steps_epsilon_decay: number of step in which epsilon decays from
 88 |         'epsilon_start' to 'epsilon_end'
 89 |         :param log_dir: Tensorboard logging folder
 90 |         """
 91 |         self.env = env
 92 | 
 93 |         # general hyper-parameters
 94 |         self.learning_rate = learning_rate
 95 |         self.discount_factor = discount_factor
 96 | 
 97 |         # replay memory we use to sample experiences and update parameters
 98 |         # `memory_size` defines the maximum number of past experiences we want the
 99 |         # agent remember.
100 |         self.memory_size = memory_size
101 |         self.memory = AgentMemory(memory_size)
102 | 
103 |         # number of experiences we take at once from `self.memory` to update parameters
104 |         self.batch_size = batch_size
105 | 
106 |         # hyper-parameters to control exploration of the environment
107 |         self.steps_epsilon_decay = steps_epsilon_decay
108 |         self.epsilon_start = epsilon_start
109 |         self.epsilon_end = epsilon_end
110 |         self.epsilon_fn = get_epsilon_decay_fn(epsilon_start, epsilon_end, steps_epsilon_decay)
111 |         self.epsilon = None
112 | 
113 |         # create q model(s). Plural because we use 2 models: main one, and the other for the target.
114 |         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
115 |         self.q_net, self.target_q_net = None, None
116 |         self._init_models(nn_hidden_layers)
117 |         print(f'{get_num_model_parameters(self.q_net):,} parameters')
118 |         self.optimizer = optim.Adam(self.q_net.parameters(), lr=learning_rate) # Adam optimizer is a safe and standard choice
119 |         self.max_grad_norm = max_grad_norm
120 | 
121 |         # hyper-parameters to control how often or when we do certain things, like
122 |         # - update the main net parameters
123 |         self.freq_steps_train = freq_steps_train
124 |         # - update the target net parameters
125 |         self.freq_steps_update_target = freq_steps_update_target
126 |         # - start training until the memory is big enough
127 |         assert n_steps_warm_up_memory > batch_size, 'batch_size must be larger than n_steps_warm_up_memory'
128 |         self.n_steps_warm_up_memory = n_steps_warm_up_memory
129 |         # - number of gradient steps we perform every time we update the main net parameters
130 |         self.n_gradient_steps = n_gradient_steps
131 | 
132 |         # state variable we use to keep track of the number of calls to `observe()`
133 |         self._step_counter = 0
134 | 
135 |         # input normalizer
136 |         self.normalize_state = normalize_state
137 |         if normalize_state:
138 |             state_samples = get_observation_samples(env, n_samples=10000)
139 |             self.mean_states = state_samples.mean(axis=0)
140 |             self.std_states = state_samples.std(axis=0)
141 | 
142 |         # create a tensorboard logger if `log_dir` was provided
143 |         # logging becomes crucial to understand what is not working in our code.
144 |         self.log_dir = log_dir
145 |         if log_dir:
146 |             self.logger = SummaryWriter(log_dir)
147 | 
148 |         # save hyper-parameters
149 |         self.hparams = {
150 |             'learning_rate': learning_rate,
151 |             'discount_factor': discount_factor,
152 |             'batch_size': batch_size,
153 |             'memory_size': memory_size,
154 |             'freq_steps_update_target': freq_steps_update_target,
155 |             'n_steps_warm_up_memory': n_steps_warm_up_memory,
156 |             'freq_steps_train': freq_steps_train,
157 |             'n_gradient_steps': n_gradient_steps,
158 |             'nn_hidden_layers': nn_hidden_layers,
159 |             'max_grad_norm': max_grad_norm,
160 |             'normalize_state': normalize_state,
161 |             'epsilon_start': epsilon_start,
162 |             'epsilon_end': epsilon_end,
163 |             'steps_epsilon_decay': steps_epsilon_decay,
164 |         }
165 | 
166 |     def _init_models(self, nn_hidden_layers):
167 | 
168 |         # state is a vector of dimension 4, and 2 are the possible actions
169 |         input_dim, output_dim = get_input_output_dims(str(self.env))
170 |         self.q_net = get_model(
171 |             input_dim=input_dim,
172 |             output_dim=output_dim,
173 |             hidden_layers=nn_hidden_layers,
174 |         )
175 |         self.q_net.to(self.device)
176 | 
177 |         # target q-net
178 |         self.target_q_net = get_model(
179 |             input_dim=input_dim,
180 |             output_dim=output_dim,
181 |             hidden_layers=nn_hidden_layers,
182 |         )
183 |         self.target_q_net.to(self.device)
184 | 
185 |         # copy parameters from the `self.q_net`
186 |         self._copy_params_to_target_q_net()
187 | 
188 |     def _copy_params_to_target_q_net(self):
189 |         """
190 |         Copies parameters from q_net to target_q_net
191 |         """
192 |         for target_param, param in zip(self.target_q_net.parameters(), self.q_net.parameters()):
193 |             target_param.data.copy_(param.data)
194 | 
195 |     def _normalize_state(self, state: np.array) -> np.array:
196 |         """"""
197 |         # return (state - self.min_states) / (self.max_states - self.min_states)
198 |         return (state - self.mean_states) / (self.std_states)
199 | 
200 |     def _preprocess_state(self, state: np.array) -> np.array:
201 | 
202 |         # state = np.copy(state_)
203 | 
204 |         if len(state.shape) == 1:
205 |             # add extra dimension to make sure it is 2D
206 |             s = state.reshape(1, -1)
207 |         else:
208 |             s = state
209 | 
210 |         if self.normalize_state:
211 |             s = self._normalize_state(s)
212 | 
213 |         return s
214 | 
215 |     def act(self, state: np.array, epsilon: float = None) -> int:
216 |         """
217 |         Behavioural policy
218 |         """
219 |         if epsilon is None:
220 |             # update epsilon
221 |             self.epsilon = self.epsilon_fn(self._step_counter)
222 |             epsilon = self.epsilon
223 | 
224 |         if random.uniform(0, 1) < epsilon:
225 |             # Explore action space
226 |             action = self.env.action_space.sample()
227 |             return action
228 | 
229 |         # make sure s is a numpy array with 2 dimensions,
230 |         # and normalize it if `self.normalize_state = True`
231 |         s = self._preprocess_state(state)
232 | 
233 |         # forward pass through the net to compute q-values for the 3 actions
234 |         s = torch.from_numpy(s).float().to(self.device)
235 |         q_values = self.q_net(s)
236 | 
237 |         # extract index max q-value and reshape tensor to dimensions (1, 1)
238 |         action = q_values.max(1)[1].view(1, 1)
239 | 
240 |         # tensor to float
241 |         action = action.item()
242 | 
243 |         return action
244 | 
245 |     def observe(self, state, action, reward, next_state, done) -> None:
246 | 
247 |         # preprocess state
248 |         s = self._preprocess_state(state)
249 |         ns = self._preprocess_state(next_state)
250 | 
251 |         # store new experience in the agent's memory.
252 |         self.memory.push(s, action, reward, ns, done)
253 | 
254 |         self._step_counter += 1
255 | 
256 |     def replay(self) -> None:
257 | 
258 |         if self._step_counter % self.freq_steps_train != 0:
259 |             # update parameters every `self.freq_steps_update_target`
260 |             # this way we add inertia to the agent actions, as they are more sticky
261 |             return
262 | 
263 |         if self._step_counter < self.n_steps_warm_up_memory:
264 |             # memory needs to be larger, no training yet
265 |             return
266 | 
267 |         if self._step_counter % self.freq_steps_update_target == 0:
268 |             # we update the target network parameters
269 |             # self.target_nn.load_state_dict(self.nn.state_dict())
270 |             self._copy_params_to_target_q_net()
271 | 
272 |         losses = []
273 |         for i in range(0, self.n_gradient_steps):
274 | 
275 |             # get batch of experiences from the agent's memory.
276 |             batch = self.memory.sample(self.batch_size)
277 | 
278 |             # A bit of plumbing to transform numpy arrays to PyTorch tensors
279 |             state_batch = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in batch.state]).to(self.device)
280 |             action_batch = torch.cat([torch.tensor([[a]]).long().view(1, -1) for a in batch.action]).to(self.device)
281 |             reward_batch = torch.cat([torch.tensor([r]).float() for r in batch.reward]).to(self.device)
282 |             next_state_batch = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in batch.next_state]).to(self.device)
283 |             done_batch = torch.tensor(batch.done).float().to(self.device)
284 | 
285 |             # q_values for all 3 actions
286 |             q_values = self.q_net(state_batch)
287 | 
288 |             # keep only q_value for the chosen action in the trajectory, i.e. `action_batch`
289 |             q_values = q_values.gather(1, action_batch)
290 | 
291 |             with torch.no_grad():
292 |                 # q-values for each action in next_state
293 |                 next_q_values = self.target_q_net(next_state_batch)
294 | 
295 |                 # extract max q-value for each next_state
296 |                 next_q_values, _ = next_q_values.max(dim=1)
297 | 
298 |                 # TD target
299 |                 target_q_values = (1 - done_batch) * next_q_values * self.discount_factor + reward_batch
300 | 
301 |             # compute loss
302 |             loss = F.mse_loss(q_values.squeeze(1), target_q_values)
303 |             losses.append(loss.item())
304 | 
305 |             # backward step to adjust network parameters
306 |             self.optimizer.zero_grad()
307 |             loss.backward()
308 |             torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), self.max_grad_norm)
309 |             self.optimizer.step()
310 | 
311 |         if self.log_dir:
312 |             self.logger.add_scalar('train/loss', np.mean(losses), self._step_counter)
313 | 
314 |     def save_to_disk(self, path: Path) -> None:
315 |         """"""
316 |         if not path.exists():
317 |             os.makedirs(path)
318 | 
319 |         # save hyper-parameters in a json file
320 |         with open(path / 'hparams.json', 'w') as f:
321 |             json.dump(self.hparams, f)
322 | 
323 |         if self.normalize_state:
324 |             np.save(path / 'mean_states.npy', self.mean_states)
325 |             np.save(path / 'std_states.npy', self.std_states)
326 | 
327 |         # save main model
328 |         torch.save(self.q_net, path / 'model')
329 | 
330 |     @classmethod
331 |     def load_from_disk(cls, env: gym.Env, path: Path):
332 |         """
333 |         We recover all necessary variables to be able to evaluate the agent.
334 | 
335 |         NOTE: training state is not stored, so it is not possible to resume
336 |         an interrupted training run as it was.
337 |         """
338 |         # load hyper-params
339 |         with open(path / 'hparams.json', 'r') as f:
340 |             hparams = json.load(f)
341 | 
342 |         # generate Python object
343 |         agent = cls(env, **hparams)
344 | 
345 |         agent.normalize_state = hparams['normalize_state']
346 |         if hparams['normalize_state']:
347 |             agent.mean_states = np.load(path / 'mean_states.npy')
348 |             agent.std_states = np.load(path / 'std_states.npy')
349 | 
350 |         agent.q_net = torch.load(path / 'model')
351 |         agent.q_net.eval()
352 | 
353 |         return agent
354 | 
355 | 
356 | 
357 | def parse_arguments():
358 |     """
359 |     Hyper-parameters are set either from command line or from the `hyperparameters.yaml' file.
360 |     Parameters set throught the command line have priority over the default ones
361 |     set in the yaml file.
362 |     """
363 | 
364 |     parser = ArgumentParser()
365 |     parser.add_argument('--env', type=str, required=True)
366 |     parser.add_argument('--learning_rate', type=float)
367 |     parser.add_argument('--discount_factor', type=float)
368 |     parser.add_argument('--episodes', type=int)
369 |     parser.add_argument('--max_steps', type=int)
370 |     parser.add_argument('--epsilon_start', type=float)
371 |     parser.add_argument('--epsilon_end', type=float)
372 |     parser.add_argument('--steps_epsilon_decay', type=int)
373 |     parser.add_argument('--batch_size', type=int)
374 |     parser.add_argument('--memory_size', type=int)
375 |     parser.add_argument('--n_steps_warm_up_memory', type=int)
376 |     parser.add_argument('--freq_steps_update_target', type=int)
377 |     parser.add_argument('--freq_steps_train', type=int)
378 |     parser.add_argument('--normalize_state', dest='normalize_state', action='store_true')
379 |     parser.set_defaults(normalize_state=False)
380 |     parser.add_argument('--n_gradient_steps', type=int,)
381 |     parser.add_argument("--nn_hidden_layers", type=int, nargs="+",)
382 |     parser.add_argument('--nn_init_method', type=str, default='default')
383 |     parser.add_argument('--loss', type=str)
384 |     parser.add_argument("--max_grad_norm", type=float, default=10)
385 |     parser.add_argument('--n_episodes_evaluate_agent', type=int, default=100)
386 |     parser.add_argument('--freq_episodes_evaluate_agent', type=int, default=100)
387 |     parser.add_argument('--seed', type=int, default=0)
388 | 
389 |     args = parser.parse_args()
390 | 
391 |     args_dict = {}
392 |     for arg in vars(args):
393 |         args_dict[arg] = getattr(args, arg)
394 | 
395 |     print('Hyper-parameters')
396 |     for key, value in args_dict.items():
397 |         print(f'{key}: {value}')
398 | 
399 |     return args_dict
400 | 
401 | 
402 | if __name__ == '__main__':
403 | 
404 |     args = parse_arguments()
405 | 
406 |     # setup the environment
407 |     env = gym.make(args['env'])
408 | 
409 |     # fix seeds to ensure reproducibility between runs
410 |     set_seed(env, args['seed'])
411 | 
412 |     # generate a unique agent_id, that we later use to save results to disk, as
413 |     # well as TensorBoard logs.
414 |     agent_id = get_agent_id(args['env'])
415 |     print('agent_id: ', agent_id)
416 | 
417 |     agent = QAgent(
418 |         env,
419 |         learning_rate=args['learning_rate'],
420 |         discount_factor=args['discount_factor'],
421 |         batch_size=args['batch_size'],
422 |         memory_size=args['memory_size'],
423 |         freq_steps_train=args['freq_steps_train'],
424 |         freq_steps_update_target=args['freq_steps_update_target'],
425 |         n_steps_warm_up_memory=args['n_steps_warm_up_memory'],
426 |         n_gradient_steps=args['n_gradient_steps'],
427 |         nn_hidden_layers=args['nn_hidden_layers'],
428 |         max_grad_norm=args['max_grad_norm'],
429 |         normalize_state=args['normalize_state'],
430 |         epsilon_start=args['epsilon_start'],
431 |         epsilon_end=args['epsilon_end'],
432 |         steps_epsilon_decay=args['steps_epsilon_decay'],
433 |         log_dir=TENSORBOARD_LOG_DIR / args['env'] / agent_id
434 |     )
435 |     agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id)
436 | 
437 |     try:
438 |         train(agent, env,
439 |               n_episodes=args['episodes'],
440 |               log_dir=TENSORBOARD_LOG_DIR / args['env'] / agent_id,
441 |               n_episodes_evaluate_agent=args['n_episodes_evaluate_agent'],
442 |               freq_episodes_evaluate_agent=args['freq_episodes_evaluate_agent'],
443 |               # max_steps=args['max_steps']
444 |               )
445 | 
446 |         agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id)
447 |         print(f'Agent {agent_id} was saved')
448 | 
449 |     except KeyboardInterrupt:
450 |         # save the agent before quitting...
451 |         agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id)
452 |         print(f'Agent {agent_id} was saved')


--------------------------------------------------------------------------------
/03_cart_pole/src/random_agent.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | 
 4 | class RandomAgent:
 5 |     """
 6 |     This taxi driver selects actions randomly.
 7 |     You better not get into this taxi!
 8 |     """
 9 |     def __init__(self, env):
10 |         self.env = env
11 | 
12 |     def act(self, state: np.array, epsilon: float = None) -> int:
13 |         """
14 |         No input arguments to this function.
15 |         The agent does not consider the state of the environment when deciding
16 |         what to do next.
17 |         """
18 |         return self.env.action_space.sample()


--------------------------------------------------------------------------------
/03_cart_pole/src/supervised_ml.py:
--------------------------------------------------------------------------------
  1 | from argparse import ArgumentParser
  2 | from pathlib import Path
  3 | from typing import List, Optional, Tuple, Union, Dict
  4 | from pdb import set_trace as stop
  5 | 
  6 | import zipfile
  7 | import gdown
  8 | from tqdm import tqdm
  9 | import pandas as pd
 10 | import gym
 11 | from torch.utils.data import Dataset, DataLoader
 12 | 
 13 | import numpy as np
 14 | #from sklearn.model_selection import train_test_split   # Unused import
 15 | import torch
 16 | import torch.optim as optim
 17 | import torch.nn as nn
 18 | from torch.utils.tensorboard import SummaryWriter
 19 | 
 20 | from src.model_factory import get_model
 21 | from src.utils import set_seed
 22 | from src.loops import evaluate
 23 | from src.q_agent import QAgent
 24 | from src.config import DATA_SUPERVISED_ML, SAVED_AGENTS_DIR, TENSORBOARD_LOG_DIR
 25 | 
 26 | 
 27 | global_train_step = 0
 28 | global_val_step = 0
 29 | 
 30 | 
 31 | def download_agent_parameters() -> Path:
 32 |     """
 33 |     Downloads the agent parameters and hyper-parameters that I trained on my machine
 34 |     Returns the path to the unzipped folder.
 35 |     """
 36 |     # download .zip file from public google drive
 37 |     # url = 'https://docs.google.com/uc?export=download&id=1KH4ANx84PMmCY6H4FoUnkBLVC1z1A6W6'
 38 |     url = 'https://docs.google.com/uc?export=download&id=1ZdyAuzY-0VYfyNrg0a7gHd5TOX-GadJJ'
 39 |     output = SAVED_AGENTS_DIR / 'CartPole-v1' / 'gdrive_agent.zip'
 40 |     gdown.download(url, str(output))
 41 | 
 42 |     # unzip it
 43 |     with zipfile.ZipFile(str(output), "r") as zip_ref:
 44 |         zip_ref.extractall(str(SAVED_AGENTS_DIR / 'CartPole-v1'))
 45 | 
 46 |     return SAVED_AGENTS_DIR / 'CartPole-v1' / '298'
 47 | 
 48 | 
 49 | def simulate_episode(env, agent) -> List[Dict]:
 50 |     """
 51 |     We let the agent interact with the environment and return a list of collected
 52 |     states and actions
 53 |     """
 54 |     done = False
 55 |     state = env.reset()
 56 |     samples = []
 57 |     while not done:
 58 | 
 59 |         action = agent.act(state, epsilon=0.0)
 60 |         samples.append({
 61 |             's0': state[0],
 62 |             's1': state[1],
 63 |             's2': state[2],
 64 |             's3': state[3],
 65 |             'action': action
 66 |         })
 67 |         state, reward, done, info = env.step(action)
 68 | 
 69 |     return samples
 70 | 
 71 | 
 72 | def generate_state_action_data(
 73 |     env: gym.Env,
 74 |     agent: QAgent,
 75 |     n_samples: int,
 76 |     path: Path
 77 | ) -> None:
 78 |     """
 79 |     We let the agent interact the environment until we have collected
 80 |     n_samples of data.
 81 |     Then we save the data as a csv file with columns: s0, s1, s2, s3, a
 82 |     """
 83 |     samples = []
 84 |     with tqdm(total=n_samples) as pbar:
 85 |         while len(samples) < n_samples:
 86 |             new_samples = simulate_episode(env, agent)
 87 |             pbar.update(len(new_samples))
 88 |             samples += new_samples
 89 | 
 90 |     # save dataframe to csv file
 91 |     pd.DataFrame(samples).to_csv(path, index=False)
 92 | 
 93 | 
 94 | class OptimalPolicyDataset(Dataset):
 95 |     """
 96 |     PyTorch custom dataset that wraps around the pandas dataframe and that
 97 |     will speak to the DataLoader later on, when we train the model.
 98 |     """
 99 |     def __init__(self, X: pd.DataFrame, y: pd.Series):
100 |         self.X = X
101 |         self.y = y
102 | 
103 |     def __len__(self):
104 |         return len(self.X)
105 | 
106 |     def __getitem__(self, idx):
107 |         return self.X.iloc[idx].values, self.y.iloc[idx]
108 | 
109 | 
110 | def get_tensorboard_writer(run_name: str):
111 | 
112 |     from torch.utils.tensorboard import SummaryWriter
113 |     from src.config import TENSORBOARD_LOG_DIR
114 |     tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_DIR / 'sml' / run_name)
115 |     return tensorboard_writer
116 | 
117 | 
118 | def get_train_val_loop(
119 |     model: nn.Module,
120 |     criterion,
121 |     optimizer,
122 |     tensorboard_writer,
123 | ):
124 |     global global_train_step, global_val_step
125 |     global_train_step = 0
126 |     global_val_step = 0
127 | 
128 |     def train_val_loop(
129 |         is_train: bool,
130 |         dataloader: DataLoader,
131 |         epoch: int,
132 |     ):
133 |         """"""
134 |         global global_train_step, global_val_step
135 | 
136 |         n_batches = 0
137 |         running_loss = 0
138 |         n_samples = 0
139 |         n_correct_predictions = 0
140 | 
141 |         pbar = tqdm(dataloader)
142 |         for data in pbar:
143 | 
144 |             # extract batch of features and target values (aka labels)
145 |             inputs, labels = data
146 | 
147 |             if is_train:
148 |                 # zero the parameter gradients
149 |                 optimizer.zero_grad()
150 | 
151 |             # forward
152 |             outputs = model(inputs.float())
153 |             loss = criterion(outputs, labels)
154 | 
155 |             if is_train:
156 |                 # backward + optimize
157 |                 loss.backward()
158 |                 optimizer.step()
159 | 
160 |             predicted_labels = torch.argmax(outputs, 1)
161 |             batch_accuracy = (predicted_labels == labels).numpy().mean()
162 | 
163 |             n_batches += 1
164 |             running_loss += loss.item()
165 |             avg_loss = running_loss / n_batches
166 | 
167 |             n_correct_predictions += (predicted_labels == labels).numpy().sum()
168 |             n_samples += len(labels)
169 |             avg_accuracy = n_correct_predictions / n_samples
170 |             pbar.set_description(f'Epoch {epoch} - loss: {avg_loss:.4f} - accuracy: {avg_accuracy:.4f}')
171 | 
172 |             # log to tensorboard
173 |             if is_train:
174 |                 global_train_step += 1
175 |                 tensorboard_writer.add_scalar('train/loss', loss.item(), global_train_step)
176 |                 tensorboard_writer.add_scalar('train/accuracy', batch_accuracy, global_train_step)
177 |                 # print('sent logs to TB')
178 |             else:
179 |                 global_val_step += 1
180 |                 tensorboard_writer.add_scalar('val/loss', loss.item(), global_val_step)
181 |                 tensorboard_writer.add_scalar('val/accuracy', batch_accuracy, global_val_step)
182 | 
183 |     return train_val_loop
184 | 
185 | 
186 | 
187 | def run(
188 |     n_samples_train: int,
189 |     n_samples_test: int,
190 |     hidden_layers: Union[Tuple[int], None],
191 |     n_epochs: int,
192 | ):
193 |     env = gym.make('CartPole-v1')
194 | 
195 |     print('Downloading agent data from GDrive...')
196 |     path_to_agent_data = download_agent_parameters()
197 |     agent = QAgent.load_from_disk(env, path=path_to_agent_data)
198 | 
199 |     set_seed(env, 1234)
200 |     print('Sanity checking that our agent is really that good...')
201 |     rewards, steps = evaluate(agent, env, n_episodes=100, epsilon=0.0)
202 |     print('Avg reward evaluation: ', np.mean(rewards))
203 | 
204 |     print('Generating train data for our supervised ML problem...')
205 |     path_to_train_data = DATA_SUPERVISED_ML / 'train.csv'
206 |     env.seed(0)
207 |     generate_state_action_data(env, agent, n_samples=n_samples_train, path=path_to_train_data)
208 | 
209 |     print('Generating test data for our supervised ML problem...')
210 |     path_to_test_data = DATA_SUPERVISED_ML / 'test.csv'
211 |     env.seed(1)
212 |     generate_state_action_data(env, agent, n_samples=n_samples_test, path=path_to_test_data)
213 | 
214 |     # load data from disk
215 |     print('Loading CSV files into dataframes...')
216 |     train_data = pd.read_csv(path_to_train_data)
217 |     test_data = pd.read_csv(path_to_test_data)
218 | 
219 |     # split features and labels
220 |     X_train = train_data[['s0', 's1', 's2', 's3']]
221 |     y_train = train_data['action']
222 |     X_test = test_data[['s0', 's1', 's2', 's3']]
223 |     y_test = test_data['action']
224 | 
225 |     # PyTorch datasets
226 |     train_dataset = OptimalPolicyDataset(X_train, y_train)
227 |     test_dataset = OptimalPolicyDataset(X_test, y_test)
228 | 
229 |     batch_size = 64
230 | 
231 |     # PyTorch dataloaders
232 |     train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
233 |     test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
234 | 
235 |     # model architecture
236 |     model = get_model(input_dim=4, output_dim=2, hidden_layers=hidden_layers)
237 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
238 |     model.to(device)
239 | 
240 |     # loss function
241 |     criterion = nn.CrossEntropyLoss()
242 | 
243 |     # optimization method
244 |     optimizer = optim.Adam(model.parameters()) #, lr=3e-4)
245 | 
246 |     import time
247 |     ts = int(time.time())
248 |     tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_DIR / 'sml' / str(ts))
249 |     train_val_loop = get_train_val_loop(model, criterion, optimizer, tensorboard_writer)
250 | 
251 |     # training loop, with evaluation at the end of each epoch
252 |     # n_epochs = 20
253 |     for epoch in range(n_epochs):
254 |         # train
255 |         train_val_loop(is_train=True, dataloader=train_dataloader, epoch=epoch)
256 | 
257 |         with torch.no_grad():
258 |             # validate
259 |             train_val_loop(is_train=False, dataloader=test_dataloader, epoch=epoch)
260 | 
261 |         print('----------')
262 | 
263 | 
264 | if __name__ == '__main__':
265 | 
266 |     parser = ArgumentParser()
267 |     parser.add_argument('--n_samples_train', type=int, default=1000)
268 |     parser.add_argument('--n_samples_test', type=int, default=1000)
269 |     parser.add_argument("--hidden_layers", type=int, nargs="+",)
270 |     parser.add_argument('--n_epochs', type=int, default=20)
271 |     args = parser.parse_args()
272 | 
273 |     run(n_samples_train=args.n_samples_train,
274 |         n_samples_test=args.n_samples_test,
275 |         hidden_layers=args.hidden_layers,
276 |         n_epochs=args.n_epochs)


--------------------------------------------------------------------------------
/03_cart_pole/src/utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from typing import Callable, Dict, Tuple, List
  3 | import pathlib
  4 | from pathlib import Path
  5 | import json
  6 | from pdb import set_trace as stop
  7 | 
  8 | import numpy as np
  9 | import gym
 10 | import yaml
 11 | import torch.nn as nn
 12 | 
 13 | 
 14 | def snake_to_camel(word):
 15 |     import re
 16 |     return ''.join(x.capitalize() or '_' for x in word.split('_'))
 17 | 
 18 | 
 19 | def get_agent_id(env_name: str) -> str:
 20 |     """"""
 21 |     from src.config import SAVED_AGENTS_DIR
 22 | 
 23 |     dir = Path(SAVED_AGENTS_DIR) / env_name
 24 |     if not dir.exists():
 25 |         os.makedirs(dir)
 26 | 
 27 |     # try:
 28 |     #     agent_id = max([int(id) for id in os.listdir(dir)]) + 1
 29 |     # except ValueError:
 30 |     #     agent_id = 0
 31 | 
 32 |     ids = []
 33 |     for id in os.listdir(dir):
 34 |         try:
 35 |             ids.append(int(id))
 36 |         except:
 37 |             pass
 38 |     if len(ids) > 0:
 39 |         agent_id = max(ids) + 1
 40 |     else:
 41 |         agent_id = 0
 42 |     # stop()
 43 | 
 44 |     return str(agent_id)
 45 | 
 46 | def get_input_output_dims(env_name: str) -> Tuple[int]:
 47 |     """"""
 48 |     if 'MountainCar' in env_name:
 49 |         input_dim = 2
 50 |         output_dim = 3
 51 |     elif 'CartPole' in env_name:
 52 |         input_dim = 4
 53 |         output_dim = 2
 54 |     else:
 55 |         raise Exception('Invalid environment')
 56 | 
 57 |     return input_dim, output_dim
 58 | 
 59 | 
 60 | def get_epsilon_decay_fn(
 61 |     eps_start: float,
 62 |     eps_end: float,
 63 |     total_episodes: int
 64 | ) -> Callable:
 65 |     """
 66 |     Returns function epsilon_fn, which depends on
 67 |     a single input, step, which is the current episode
 68 |     """
 69 |     def epsilon_fn(episode: int) -> float:
 70 |         r = max((total_episodes - episode) / total_episodes, 0)
 71 |         return (eps_start - eps_end)*r + eps_end
 72 | 
 73 |     return epsilon_fn
 74 | 
 75 | 
 76 | def get_epsilon_exponential_decay_fn(
 77 |     eps_max: float,
 78 |     eps_min: float,
 79 |     decay: float,
 80 | ) -> Callable:
 81 |     """
 82 |     Returns function epsilon_fn, which depends on
 83 |     a single input, step, which is the current episode
 84 |     """
 85 |     def epsilon_fn(episode: int) -> float:
 86 |         return max(eps_min, eps_max * (decay ** episode))
 87 |     return epsilon_fn
 88 | 
 89 | 
 90 | def get_success_rate_from_n_steps(env: gym.Env, steps: List[int]):
 91 | 
 92 |     import numpy as np
 93 |     if 'MountainCar' in str(env):
 94 |         success_rate = np.mean((np.array(steps) < env._max_episode_steps) * 1.0)
 95 |     elif 'CartPole' in str(env):
 96 |         success_rate = np.mean((np.array(steps) >= env._max_episode_steps) * 1.0)
 97 |     else:
 98 |         raise Exception('Invalid environment name')
 99 | 
100 |     return success_rate
101 | 
102 | def get_observation_samples(env: gym.Env, n_samples: int) -> np.array:
103 |     """"""
104 |     samples = []
105 |     state = env.reset()
106 |     while len(samples) < n_samples:
107 | 
108 |         samples.append(np.copy(state))
109 |         action = env.action_space.sample()
110 |         next_state, reward, done, info = env.step(action)
111 | 
112 |         if done:
113 |             state = env.reset()
114 |         else:
115 |             state = next_state
116 | 
117 |     return np.array(samples)
118 | 
119 | 
120 | def set_seed(
121 |     env,
122 |     seed
123 | ):
124 |     """To ensure reproducible runs we fix the seed for different libraries"""
125 |     import random
126 |     random.seed(seed)
127 | 
128 |     import numpy as np
129 |     np.random.seed(seed)
130 | 
131 |     env.seed(seed)
132 |     env.action_space.seed(seed)
133 | 
134 |     import torch
135 |     torch.manual_seed(seed)
136 | 
137 |     # Deterministic operations for CuDNN, it may impact performances
138 |     torch.backends.cudnn.deterministic = True
139 |     torch.backends.cudnn.benchmark = False
140 | 
141 |     # env.seed(seed)
142 |     # gym.spaces.prng.seed(seed)
143 | 
144 | 
145 | def get_num_model_parameters(model: nn.Module) -> int:
146 |     return sum(p.numel() for p in model.parameters() if p.requires_grad)
147 | 
148 | 
149 | # from dotenv import dotenv_values
150 | # import uuid
151 | # from pdb import set_trace as stop
152 | 
153 | # import pandas as pd
154 | # import git
155 | 
156 | # from src.io import get_list_files
157 | 
158 | 
159 | # def get_project_root() -> Path:
160 | #     return Path(__file__).parent.resolve().parent
161 | #
162 | # from typing import Dict
163 | # def load_env_config() -> Dict:
164 | #     """
165 | #     """
166 | #     config = dotenv_values(get_project_root() / ".env")
167 | #     return config
168 | #
169 | 
170 | 
171 | 
172 | 
173 | 


--------------------------------------------------------------------------------
/03_cart_pole/src/viz.py:
--------------------------------------------------------------------------------
 1 | from time import sleep
 2 | from argparse import ArgumentParser
 3 | from pdb import set_trace as stop
 4 | from typing import Optional
 5 | 
 6 | import pandas as pd
 7 | import gym
 8 | 
 9 | from src.config import SAVED_AGENTS_DIR
10 | 
11 | import numpy as np
12 | 
13 | 
14 | def show_video(agent, env, sleep_sec: float = 0.1, seed: Optional[int] = 0, mode: str = "rgb_array"):
15 | 
16 |     env.seed(seed)
17 |     state = env.reset()
18 | 
19 |     # LAPADULA
20 |     if mode == "rgb_array":
21 |         from matplotlib import pyplot as plt
22 |         from IPython.display import display, clear_output
23 |         steps = 0
24 |         fig, ax = plt.subplots(figsize=(8, 6))
25 | 
26 |     done = False
27 |     while not done:
28 | 
29 |         action = agent.act(state, epsilon=0.001)
30 |         state, reward, done, info = env.step(action)
31 | 
32 |         # LAPADULA
33 |         if mode == "rgb_array":
34 |             steps += 1
35 |             frame = env.render(mode=mode)
36 |             ax.cla()
37 |             ax.axes.yaxis.set_visible(False)
38 |             ax.imshow(frame)
39 |             ax.set_title(f'Steps: {steps}')
40 |             display(fig)
41 |             clear_output(wait=True)
42 |             plt.pause(sleep_sec)
43 |         else:
44 |             env.render()
45 |             sleep(sleep_sec)
46 | 
47 | 
48 | if __name__ == '__main__':
49 | 
50 |     parser = ArgumentParser()
51 |     parser.add_argument('--agent_id', type=str, required=True)
52 |     parser.add_argument('--sleep_sec', type=float, required=False, default=0.1)
53 |     args = parser.parse_args()
54 | 
55 |     from src.base_agent import BaseAgent
56 |     agent_path = SAVED_AGENTS_DIR / args.agent_file
57 |     agent = BaseAgent.load_from_disk(agent_path)
58 | 
59 |     from src.q_agent import QAgent
60 | 
61 | 
62 |     env = gym.make('CartPole-v1')
63 |     # env._max_episode_steps = 1000
64 | 
65 |     show_video(agent, env, sleep_sec=args.sleep_sec)
66 | 
67 | 
68 | 
69 | 
70 | 
71 | 
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/03_cart_pole/tensorboard_logs/.gitignore:
--------------------------------------------------------------------------------
1 | CartPole-v1/
2 | 


--------------------------------------------------------------------------------
/03_cart_pole/tensorboard_logs/readme.md:
--------------------------------------------------------------------------------
1 | ### Tensorboard logs for each train run are stored in this folder
2 | 


--------------------------------------------------------------------------------
/04_lunar_lander/README.md:
--------------------------------------------------------------------------------
 1 | <div align="center">
 2 | <h1>Policy Gradients to land on the Moon</h1>
 3 | <h3><i>“That's one small step for your gradient ascent, one giant leap for your ML career.”</i></h3>
 4 | <h4>-- Pau quoting Neil Armstrong</h4>
 5 | </div>
 6 | 
 7 | ![](http://datamachines.xyz/wp-content/uploads/2022/05/jagoda_and_kai-2048x1536.jpg)
 8 | 
 9 | ## Table of Contents
10 | * [Welcome 🤗](#welcome-)
11 | * [Lecture transcripts](#lecture-transcripts)
12 | * [Quick setup](#quick-setup)
13 | * [Notebooks](#notebooks)
14 | * [Let's connect](#lets-connect)
15 | 
16 | ## Welcome 🤗
17 | 
18 | Today we will learn about Policy Gradient methods, and use them to land on the Moon.
19 | 
20 | Ready, set, go!
21 | 
22 | ## Lecture transcripts
23 | 
24 | [📝 1. Policy gradients](http://datamachines.xyz/2022/05/06/policy-gradients-in-reinforcement-learning-to-land-on-the-moon-hands-on-course/)  
25 | 
26 | ## Quick setup
27 | 
28 | Make sure you have Python >= 3.7. Otherwise, update it.
29 | 
30 | 1. Pull the code from GitHub and cd into the `04_lunar_lander` folder:
31 |     ```
32 |     $ git clone https://github.com/Paulescu/hands-on-rl.git
33 |     $ cd hands-on-rl/04_lunar_lander
34 |     ```
35 | 
36 | 2. Make sure you have the `virtualenv` tool in your Python installation
37 |     ```
38 |    $ pip3 install virtualenv
39 |    ```
40 | 
41 | 3. Create a virtual environment and activate it.
42 |     ```
43 |     $ virtualenv -p python3 venv
44 |     $ source venv/bin/activate
45 |     ```
46 | 
47 |    From this point onwards commands run inside the  virtual environment.
48 | 
49 | 
50 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code.
51 |     ```
52 |     $ (venv) pip install -r requirements.txt
53 |     $ (venv) export PYTHONPATH="."
54 |     ```
55 | 
56 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab
57 |     ```
58 |     $ (venv) jupyter notebook
59 |     ```
60 |     ```
61 |     $ (venv) jupyter lab
62 |     ```
63 |    If both launch commands fail, try these:
64 |     ```
65 |     $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False
66 |     ```
67 |     ```
68 |     $ (venv) jupyter lab --NotebookApp.use_redirect_file=False
69 |     ```
70 | 
71 | 5. Play and learn. And do the homework 😉.
72 | 
73 | ## Notebooks
74 | 
75 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
76 | - [Policy gradients with rewards as weights](notebooks/02_vanilla_policy_gradient_with_rewards_as_weights.ipynb)
77 | - [Policy gradients with rewards-to-go as weights](notebooks/03_vanilla_policy_gradient_with_rewards_to_go_as_weights.ipynb)
78 | - [Homework](notebooks/04_homework.ipynb)
79 | 
80 | ## Let's connect!
81 | 
82 | Do you wanna become a PRO in Machine Learning?
83 | 
84 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/) 🧠
85 | 
86 | 👉🏽 Follow me on [Twitter](https://twitter.com/paulabartabajo_) and [LinkedIn](https://www.linkedin.com/in/pau-labarta-bajo-4432074b/) 💡
87 | 


--------------------------------------------------------------------------------
/04_lunar_lander/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "f0fd6807",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 04 Homework 🏋️🏋️🏋️"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "abcf6613",
14 |    "metadata": {},
15 |    "source": [
16 |     "#### 👉A course without homework is not a course!\n",
17 |     "\n",
18 |     "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 |     "\n",
20 |     "#### 👉Feel free to email me your solutions at:"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "id": "d1d983a3",
26 |    "metadata": {},
27 |    "source": [
28 |     "# `plabartabajo@gmail.com`"
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "id": "86f82e45",
34 |    "metadata": {},
35 |    "source": [
36 |     "-----"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "id": "67656662",
42 |    "metadata": {},
43 |    "source": [
44 |     "## 1. Can you find a smaller network that solves this environment?\n",
45 |     "\n",
46 |     "I used one hidden layer with 64 units, but I have the feeling this  was an overkill."
47 |    ]
48 |   },
49 |   {
50 |    "cell_type": "markdown",
51 |    "id": "c0a46bf7",
52 |    "metadata": {},
53 |    "source": [
54 |     "## 2. Can you speed up converge by properly tunning the `batch_size`?"
55 |    ]
56 |   }
57 |  ],
58 |  "metadata": {
59 |   "kernelspec": {
60 |    "display_name": "Python 3 (ipykernel)",
61 |    "language": "python",
62 |    "name": "python3"
63 |   },
64 |   "language_info": {
65 |    "codemirror_mode": {
66 |     "name": "ipython",
67 |     "version": 3
68 |    },
69 |    "file_extension": ".py",
70 |    "mimetype": "text/x-python",
71 |    "name": "python",
72 |    "nbconvert_exporter": "python",
73 |    "pygments_lexer": "ipython3",
74 |    "version": "3.8.10"
75 |   }
76 |  },
77 |  "nbformat": 4,
78 |  "nbformat_minor": 5
79 | }
80 | 


--------------------------------------------------------------------------------
/04_lunar_lander/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "src"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Pau <plabartabajo@gmail.com>"]
 6 | 
 7 | [tool.poetry.dependencies]
 8 | python = ">=3.8,<3.11"
 9 | numpy = "^1.22.3"
10 | torch = "^1.11.0"
11 | scipy = "^1.8.0"
12 | Box2D = "^2.3.10"
13 | box2d-py = "^2.3.8"
14 | gym = "0.17.2"
15 | tensorboard = "^2.8.0"
16 | tqdm = "^4.64.0"
17 | jupyter = "^1.0.0"
18 | matplotlib = "^3.5.1"
19 | pandas = "^1.4.2"
20 | pyglet = "1.5.0"
21 | 
22 | [tool.poetry.dev-dependencies]
23 | pytest = "^5.2"
24 | 
25 | [build-system]
26 | requires = ["poetry-core>=1.0.0"]
27 | build-backend = "poetry.core.masonry.api"
28 | 


--------------------------------------------------------------------------------
/04_lunar_lander/requirements.txt:
--------------------------------------------------------------------------------
  1 | absl-py==1.0.0; python_version >= "3.6"
  2 | appnope==0.1.3; platform_system == "Darwin" and python_version >= "3.8" and sys_platform == "darwin"
  3 | argon2-cffi-bindings==21.2.0; python_version >= "3.6"
  4 | argon2-cffi==21.3.0; python_version >= "3.6"
  5 | asttokens==2.0.5; python_version >= "3.8"
  6 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0"
  7 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
  8 | backcall==0.2.0; python_version >= "3.8"
  9 | beautifulsoup4==4.11.1; python_full_version >= "3.6.0" and python_version >= "3.7"
 10 | bleach==5.0.0; python_version >= "3.7"
 11 | box2d-py==2.3.8
 12 | box2d==2.3.10
 13 | cachetools==5.0.0; python_version >= "3.7" and python_version < "4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
 14 | certifi==2021.10.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 15 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6"
 16 | charset-normalizer==2.0.12; python_full_version >= "3.6.0" and python_version >= "3.6"
 17 | cloudpickle==1.2.2; python_version >= "3.5"
 18 | colorama==0.4.4; python_version >= "3.8" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.8"
 19 | cycler==0.11.0; python_version >= "3.7"
 20 | debugpy==1.6.0; python_version >= "3.7"
 21 | decorator==5.1.1; python_version >= "3.8"
 22 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
 23 | entrypoints==0.4; python_version >= "3.7"
 24 | executing==0.8.3; python_version >= "3.8"
 25 | fastjsonschema==2.15.3; python_version >= "3.7"
 26 | fonttools==4.32.0; python_version >= "3.7"
 27 | future==0.18.2; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.5"
 28 | google-auth-oauthlib==0.4.6; python_version >= "3.6"
 29 | google-auth==2.6.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 30 | grpcio==1.45.0; python_version >= "3.6"
 31 | gym==0.17.2; python_version >= "3.5"
 32 | idna==3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 33 | importlib-metadata==4.11.3; python_version < "3.10" and python_version >= "3.7"
 34 | importlib-resources==5.7.1; python_version < "3.9" and python_version >= "3.7"
 35 | ipykernel==6.13.0; python_version >= "3.7"
 36 | ipython-genutils==0.2.0; python_version >= "3.7"
 37 | ipython==8.2.0; python_version >= "3.8"
 38 | ipywidgets==7.7.0
 39 | jedi==0.18.1; python_version >= "3.8"
 40 | jinja2==3.1.1; python_version >= "3.7"
 41 | jsonschema==4.4.0; python_version >= "3.7"
 42 | jupyter-client==7.2.2; python_full_version >= "3.7.0" and python_version >= "3.7"
 43 | jupyter-console==6.4.3; python_version >= "3.6"
 44 | jupyter-core==4.10.0; python_version >= "3.7"
 45 | jupyter==1.0.0
 46 | jupyterlab-pygments==0.2.2; python_version >= "3.7"
 47 | jupyterlab-widgets==1.1.0; python_version >= "3.6"
 48 | kiwisolver==1.4.2; python_version >= "3.7"
 49 | markdown==3.3.6; python_version >= "3.6"
 50 | markupsafe==2.1.1; python_version >= "3.7"
 51 | matplotlib-inline==0.1.3; python_version >= "3.8"
 52 | matplotlib==3.5.1; python_version >= "3.7"
 53 | mistune==0.8.4; python_version >= "3.7"
 54 | more-itertools==8.12.0; python_version >= "3.5"
 55 | nbclient==0.6.0; python_full_version >= "3.7.0" and python_version >= "3.7"
 56 | nbconvert==6.5.0; python_version >= "3.7"
 57 | nbformat==5.3.0; python_full_version >= "3.7.0" and python_version >= "3.7"
 58 | nest-asyncio==1.5.5; python_full_version >= "3.7.0" and python_version >= "3.7"
 59 | notebook==6.4.10; python_version >= "3.6"
 60 | numpy==1.22.3; python_version >= "3.8"
 61 | oauthlib==3.2.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
 62 | packaging==21.3; python_version >= "3.7"
 63 | pandas==1.4.2; python_version >= "3.8"
 64 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
 65 | parso==0.8.3; python_version >= "3.8"
 66 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.8"
 67 | pickleshare==0.7.5; python_version >= "3.8"
 68 | pillow==9.0.1; python_version >= "3.7"
 69 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5"
 70 | prometheus-client==0.14.1; python_version >= "3.6"
 71 | prompt-toolkit==3.0.29; python_full_version >= "3.6.2" and python_version >= "3.8"
 72 | protobuf==3.20.0; python_version >= "3.7"
 73 | psutil==5.9.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
 74 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.8" and sys_platform != "win32"
 75 | pure-eval==0.2.2; python_version >= "3.8"
 76 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy"
 77 | pyasn1-modules==0.2.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 78 | pyasn1==0.4.8; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_full_version >= "3.6.0" and python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
 79 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
 80 | pyglet==1.5.0
 81 | pygments==2.11.2; python_version >= "3.8"
 82 | pyparsing==3.0.7; python_version >= "3.7"
 83 | pyrsistent==0.18.1; python_version >= "3.7"
 84 | pytest==5.4.3; python_version >= "3.5"
 85 | python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
 86 | pytz==2022.1; python_version >= "3.8"
 87 | pywin32==303; sys_platform == "win32" and platform_python_implementation != "PyPy" and python_version >= "3.7"
 88 | pywinpty==2.0.5; os_name == "nt" and python_version >= "3.7"
 89 | pyzmq==22.3.0; python_version >= "3.7"
 90 | qtconsole==5.3.0; python_version >= "3.7"
 91 | qtpy==2.0.1; python_version >= "3.7"
 92 | requests-oauthlib==1.3.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
 93 | requests==2.27.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
 94 | rsa==4.8; python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
 95 | scipy==1.8.0; python_version >= "3.8" and python_version < "3.11"
 96 | send2trash==1.8.0; python_version >= "3.6"
 97 | setuptools-scm==6.4.2; python_version >= "3.7"
 98 | six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.8"
 99 | soupsieve==2.3.2.post1; python_full_version >= "3.6.0" and python_version >= "3.7"
100 | stack-data==0.2.0; python_version >= "3.8"
101 | tensorboard-data-server==0.6.1; python_version >= "3.6"
102 | tensorboard-plugin-wit==1.8.1; python_version >= "3.6"
103 | tensorboard==2.8.0; python_version >= "3.6"
104 | terminado==0.13.3; python_version >= "3.7"
105 | tinycss2==1.1.1; python_version >= "3.7"
106 | tomli==2.0.1; python_version >= "3.7"
107 | torch==1.11.0; python_full_version >= "3.7.0"
108 | tornado==6.1; python_version >= "3.7"
109 | tqdm==4.64.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
110 | traitlets==5.1.1; python_full_version >= "3.7.0" and python_version >= "3.8"
111 | typing-extensions==4.1.1; python_version >= "3.6" and python_full_version >= "3.7.0"
112 | urllib3==1.26.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6"
113 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.6"
114 | webencodings==0.5.1; python_version >= "3.7"
115 | werkzeug==2.1.1; python_version >= "3.7"
116 | widgetsnbextension==3.6.0
117 | zipp==3.8.0; python_version < "3.9" and python_version >= "3.7"
118 | 


--------------------------------------------------------------------------------
/04_lunar_lander/saved_agents/readme.md:
--------------------------------------------------------------------------------
1 | ### Trained agents are saved in this folder


--------------------------------------------------------------------------------
/04_lunar_lander/src/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pathlib
 3 | root_dir = pathlib.Path(__file__).parent.resolve().parent
 4 | 
 5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents'
 6 | TENSORBOARD_LOG_DIR = root_dir / 'tensorboard_logs'
 7 | # OPTUNA_DB = root_dir / 'optuna.db'
 8 | # DATA_SUPERVISED_ML = root_dir / 'data_supervised_ml'
 9 | # MLFLOW_RUNS_DIR = root_dir / 'mlflow_runs'
10 | 
11 | if not SAVED_AGENTS_DIR.exists():
12 |     os.makedirs(SAVED_AGENTS_DIR)
13 | 
14 | if not TENSORBOARD_LOG_DIR.exists():
15 |     os.makedirs(TENSORBOARD_LOG_DIR)
16 | 
17 | # if not DATA_SUPERVISED_ML.exists():
18 | #     os.makedirs(DATA_SUPERVISED_ML)
19 | #
20 | # if not MLFLOW_RUNS_DIR.exists():
21 | #     os.makedirs(MLFLOW_RUNS_DIR)


--------------------------------------------------------------------------------
/04_lunar_lander/src/evaluation.py:
--------------------------------------------------------------------------------
 1 | from typing import Optional, Tuple, List
 2 | from tqdm import tqdm
 3 | 
 4 | import torch
 5 | 
 6 | 
 7 | def evaluate(
 8 |     agent,
 9 |     env,
10 |     n_episodes: int,
11 |     seed: Optional[int] = 0,
12 | ) -> Tuple[List, List]:
13 | 
14 |     # from src.utils import set_seed
15 |     # set_seed(env, seed)
16 | 
17 |     # output metrics
18 |     reward_per_episode = []
19 |     steps_per_episode = []
20 | 
21 |     for i in tqdm(range(0, n_episodes)):
22 | 
23 |         state = env.reset()
24 |         rewards = 0
25 |         steps = 0
26 |         done = False
27 |         while not done:
28 | 
29 |             action = agent.act(torch.as_tensor(state, dtype=torch.float32))
30 | 
31 |             next_state, reward, done, info = env.step(action)
32 | 
33 |             rewards += reward
34 |             steps += 1
35 |             state = next_state
36 | 
37 |         reward_per_episode.append(rewards)
38 |         steps_per_episode.append(steps)
39 | 
40 |     return reward_per_episode, steps_per_episode


--------------------------------------------------------------------------------
/04_lunar_lander/src/model_factory.py:
--------------------------------------------------------------------------------
 1 | from typing import Optional, List
 2 | from pdb import set_trace as stop
 3 | 
 4 | import torch.nn as nn
 5 | 
 6 | 
 7 | def get_model(
 8 |     input_dim: int,
 9 |     output_dim: int,
10 |     hidden_layers: Optional[List[int]] = None,
11 | ):
12 |     """
13 |     Feed-forward network, made of linear layers with ReLU activation functions
14 |     The number of layers, and their size is given by `hidden_layers`.
15 |     """
16 |     # assert init_method in {'default', 'xavier'}
17 | 
18 |     if hidden_layers is None:
19 |         # linear model
20 |         model = nn.Sequential(nn.Linear(input_dim, output_dim))
21 | 
22 |     else:
23 |         # neural network
24 |         # there are hidden layers in this case.
25 |         dims = [input_dim] + hidden_layers + [output_dim]
26 |         modules = []
27 |         for i, dim in enumerate(dims[:-2]):
28 |             modules.append(nn.Linear(dims[i], dims[i + 1]))
29 |             modules.append(nn.ReLU())
30 | 
31 |         modules.append(nn.Linear(dims[-2], dims[-1]))
32 |         model = nn.Sequential(*modules)
33 |         # stop()
34 | 
35 |     # n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
36 |     # print(f'{n_parameters:,} parameters')
37 | 
38 |     return model
39 | 
40 | def count_parameters(model: nn.Module) -> int:
41 |     """"""
42 |     return sum(p.numel() for p in model.parameters() if p.requires_grad)


--------------------------------------------------------------------------------
/04_lunar_lander/src/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from typing import Callable, Dict, Tuple, List
 3 | import pathlib
 4 | from pathlib import Path
 5 | import json
 6 | from pdb import set_trace as stop
 7 | 
 8 | import torch.nn as nn
 9 | from torch.utils.tensorboard import SummaryWriter
10 | 
11 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR
12 | 
13 | 
14 | # def snake_to_camel(word):
15 | #     import re
16 | #     return ''.join(x.capitalize() or '_' for x in word.split('_'))
17 | 
18 | def get_agent_id(env_name: str) -> str:
19 |     """"""
20 |     dir = Path(SAVED_AGENTS_DIR) / env_name
21 |     if not dir.exists():
22 |         os.makedirs(dir)
23 | 
24 |     ids = []
25 |     for id in os.listdir(dir):
26 |         try:
27 |             ids.append(int(id))
28 |         except:
29 |             pass
30 |     if len(ids) > 0:
31 |         agent_id = max(ids) + 1
32 |     else:
33 |         agent_id = 0
34 |     # stop()
35 | 
36 |     return str(agent_id)
37 | 
38 | 
39 | def set_seed(
40 |     env,
41 |     seed
42 | ):
43 |     """To ensure reproducible runs we fix the seed for different libraries"""
44 |     import random
45 |     random.seed(seed)
46 | 
47 |     import numpy as np
48 |     np.random.seed(seed)
49 | 
50 |     env.seed(seed)
51 |     env.action_space.seed(seed)
52 | 
53 |     import torch
54 |     torch.manual_seed(seed)
55 | 
56 |     # Deterministic operations for CuDNN, it may impact performances
57 |     torch.backends.cudnn.deterministic = True
58 |     torch.backends.cudnn.benchmark = False
59 | 
60 | def get_num_model_parameters(model: nn.Module) -> int:
61 |     return sum(p.numel() for p in model.parameters() if p.requires_grad)
62 | 
63 | def get_logger(env_name: str, agent_id: str) -> SummaryWriter:
64 |     return SummaryWriter(TENSORBOARD_LOG_DIR / env_name / agent_id)
65 | 
66 | def get_model_path(env_name: str, agent_id: str) -> Path:
67 |     """
68 |     Returns path where we save train artifacts, including:
69 |      -> the policy network weights
70 |      -> json with hyperparameters
71 |      """
72 |     return SAVED_AGENTS_DIR / env_name / agent_id
73 | 
74 | 
75 | 


--------------------------------------------------------------------------------
/04_lunar_lander/src/viz.py:
--------------------------------------------------------------------------------
 1 | from time import sleep
 2 | from argparse import ArgumentParser
 3 | from pdb import set_trace as stop
 4 | from typing import Optional
 5 | 
 6 | import pandas as pd
 7 | import gym
 8 | 
 9 | from src.config import SAVED_AGENTS_DIR
10 | 
11 | import numpy as np
12 | 
13 | def make_video(agent):
14 | 
15 |     import gym
16 |     from gym.wrappers import Monitor
17 |     env = Monitor(gym.make('CartPole-v0'), './video', force=True)
18 | 
19 |     state = env.reset()
20 |     done = False
21 | 
22 |     while not done:
23 |         # action = env.action_space.sample()
24 |         import torch
25 |         action = agent.act(torch.as_tensor(state, dtype=torch.float32))
26 |         state_next, reward, done, info = env.step(action)
27 |     env.close()
28 | 
29 | 
30 | def show_video(
31 |     agent,
32 |     env,
33 |     sleep_sec: float = 0.1,
34 |     seed: Optional[int] = 0,
35 |     mode: str = "rgb_array"
36 | ):
37 | 
38 |     env.seed(seed)
39 |     state = env.reset()
40 | 
41 |     # LAPADULA
42 |     if mode == "rgb_array":
43 |         from matplotlib import pyplot as plt
44 |         from IPython.display import display, clear_output
45 |         steps = 0
46 |         fig, ax = plt.subplots(figsize=(8, 6))
47 | 
48 |     done = False
49 |     while not done:
50 | 
51 |         import torch
52 |         action = agent.act(torch.as_tensor(state, dtype=torch.float32))
53 | 
54 |         state, reward, done, info = env.step(action)
55 | 
56 |         # LAPADULA
57 |         if mode == "rgb_array":
58 |             steps += 1
59 |             frame = env.render(mode=mode)
60 |             ax.cla()
61 |             ax.axes.yaxis.set_visible(False)
62 |             ax.imshow(frame)
63 |             ax.set_title(f'Steps: {steps}')
64 |             display(fig)
65 |             clear_output(wait=True)
66 |             plt.pause(sleep_sec)
67 |         else:
68 |             env.render()
69 |             sleep(sleep_sec)
70 | 
71 | 
72 | if __name__ == '__main__':
73 | 
74 |     parser = ArgumentParser()
75 |     parser.add_argument('--agent_id', type=str, required=True)
76 |     parser.add_argument('--sleep_sec', type=float, required=False, default=0.1)
77 |     args = parser.parse_args()
78 | 
79 |     from src.base_agent import BaseAgent
80 |     agent_path = SAVED_AGENTS_DIR / args.agent_file
81 |     agent = BaseAgent.load_from_disk(agent_path)
82 | 
83 |     from src.q_agent import QAgent
84 | 
85 | 
86 |     env = gym.make('CartPole-v1')
87 |     # env._max_episode_steps = 1000
88 | 
89 |     show_video(agent, env, sleep_sec=args.sleep_sec)
90 | 
91 | 
92 | 
93 | 
94 | 
95 | 
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------
/04_lunar_lander/src/vpg_agent.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from typing import List, Optional, Tuple
  3 | from pathlib import Path
  4 | import json
  5 | from pdb import set_trace as stop
  6 | 
  7 | from tqdm import tqdm
  8 | import numpy as np
  9 | import torch
 10 | from torch.optim import Adam
 11 | from torch.utils.tensorboard import SummaryWriter
 12 | from torch.distributions.categorical import Categorical
 13 | import gym
 14 | 
 15 | from src.model_factory import get_model
 16 | from src.utils import (
 17 |     get_agent_id,
 18 |     set_seed,
 19 |     get_num_model_parameters,
 20 |     get_logger, get_model_path
 21 | )
 22 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR
 23 | 
 24 | def reward_to_go(rews):
 25 | 
 26 |     n = len(rews)
 27 |     rtgs = np.zeros_like(rews)
 28 |     for i in reversed(range(n)):
 29 |         rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0)
 30 |     return rtgs
 31 | 
 32 | 
 33 | class VPGAgent:
 34 | 
 35 |     def __init__(
 36 |         self,
 37 |         env_name: str = 'LunarLander-v2',
 38 |         learning_rate: float = 3e-4,
 39 |         hidden_layers: List[int] = [32],
 40 |         gradient_weights: str = 'rewards'
 41 |     ):
 42 |         assert gradient_weights in {'rewards', 'rewards-to-go'}
 43 | 
 44 |         self.env_name = env_name
 45 |         self.env = gym.make(env_name)
 46 |         self.obs_dim = self.env.observation_space.shape[0]
 47 |         self.act_dim = self.env.action_space.n
 48 | 
 49 |         # stochastic policy network
 50 |         # the outputs of this network are un-normalized probabilities for each
 51 |         # action (aka logits)
 52 |         self.policy_net = get_model(input_dim=self.obs_dim,
 53 |                                     output_dim=self.act_dim,
 54 |                                     hidden_layers=hidden_layers)
 55 |         print(f'Policy network with {get_num_model_parameters(self.policy_net):,} parameters')
 56 |         print(self.policy_net)
 57 | 
 58 | 
 59 |         self.optimizer = Adam(self.policy_net.parameters(), lr=learning_rate)
 60 | 
 61 |         self.gradient_weights = gradient_weights
 62 | 
 63 |         self.hparams = {
 64 |             'learning_rate': learning_rate,
 65 |             'hidden_layers': hidden_layers,
 66 |             'gradient_weights': gradient_weights,
 67 | 
 68 |         }
 69 | 
 70 |     def act(self, obs: torch.Tensor):
 71 |         """
 72 |         Action selection function (outputs int actions, sampled from policy)
 73 |         """
 74 |         return self._get_policy(obs).sample().item()
 75 | 
 76 |     def train(
 77 |         self,
 78 |         n_policy_updates: int = 1000,
 79 |         batch_size: int = 4000,
 80 |         logger: Optional[SummaryWriter] = None,
 81 |         model_path: Optional[Path] = None,
 82 |         seed: Optional[int] = 0,
 83 |         freq_eval: Optional[int] = 10,
 84 |     ):
 85 |         """
 86 |         """
 87 |         total_steps = 0
 88 |         save_model = True if model_path is not None else False
 89 | 
 90 |         best_avg_reward = -np.inf
 91 | 
 92 |         # fix seeds to ensure reproducible training runs
 93 |         set_seed(self.env, seed)
 94 | 
 95 |         for i in range(n_policy_updates):
 96 | 
 97 |             # use current policy to collect trajectories
 98 |             states, actions, weights, rewards = self._collect_trajectories(n_samples=batch_size)
 99 | 
100 |             # one step of gradient ascent to update policy parameters
101 |             loss = self._update_parameters(states, actions, weights)
102 | 
103 |             # log epoch metrics
104 |             print('epoch: %3d \t loss: %.3f \t reward: %.3f' % (i, loss, np.mean(rewards)))
105 |             if logger is not None:
106 |                 # we use total_steps instead of epoch to render all plots in Tensorboard comparable
107 |                 # Agents wit different batch_size (aka steps_per_epoch) are fairly compared this way.
108 |                 total_steps += batch_size
109 |                 logger.add_scalar('train/loss', loss, total_steps)
110 |                 logger.add_scalar('train/episode_reward', np.mean(rewards), total_steps)
111 | 
112 |             # evaluate the agent on a fixed set of 100 episodes
113 |             if (i + 1) % freq_eval == 0:
114 |                 rewards, success = self.evaluate(n_episodes=100)
115 | 
116 |                 avg_reward = np.mean(rewards)
117 |                 avg_success_rate = np.mean(success)
118 |                 if save_model and (avg_reward > best_avg_reward):
119 |                     self.save_to_disk(model_path)
120 |                     print(f'Best model! Average reward = {avg_reward:.2f}, Success rate = {avg_success_rate:.2%}')
121 | 
122 |                     best_avg_reward = avg_reward
123 | 
124 |     def evaluate(self, n_episodes: Optional[int] = 100, seed: Optional[int] = 1234) -> Tuple[List[float], List[float]]:
125 |         """
126 |         """
127 |         # output metrics
128 |         reward_per_episode = []
129 |         success_per_episode = []
130 | 
131 |         # fix seed
132 |         self.env.seed(seed)
133 |         self.env.action_space.seed(seed)
134 | 
135 |         for i in tqdm(range(0, n_episodes)):
136 | 
137 |             state = self.env.reset()
138 |             rewards = 0
139 |             done = False
140 |             reward = None
141 |             while not done:
142 | 
143 |                 action = self.act(torch.as_tensor(state, dtype=torch.float32))
144 | 
145 |                 next_state, reward, done, info = self.env.step(action)
146 |                 rewards += reward
147 | 
148 |                 state = next_state
149 | 
150 |             reward_per_episode.append(rewards)
151 |             success_per_episode.append(1 if reward > 0 else 0)
152 | 
153 |         return reward_per_episode, success_per_episode
154 | 
155 |     def _collect_trajectories(self, n_samples: int):
156 | 
157 |         # make some empty lists for logging.
158 |         batch_obs = []          # for observations
159 |         batch_acts = []         # for actions
160 |         batch_weights = []      # for reward-to-go weighting in policy gradient
161 |         batch_rets = []         # for measuring episode returns
162 |         batch_lens = []         # for measuring episode lengths
163 | 
164 |         # reset episode-specific variables
165 |         obs = self.env.reset()       # first obs comes from starting distribution
166 |         done = False            # signal from environment that episode is over
167 |         ep_rews = []            # list for rewards accrued throughout ep
168 | 
169 |         # collect experience by acting in the environment with current policy
170 |         while True:
171 | 
172 |             # save obs
173 |             batch_obs.append(obs.copy())
174 | 
175 |             # act in the environment
176 |             # act = get_action(torch.as_tensor(obs, dtype=torch.float32))
177 |             action = self.act(torch.as_tensor(obs, dtype=torch.float32))
178 |             obs, rew, done, _ = self.env.step(action)
179 | 
180 |             # save action, reward
181 |             batch_acts.append(action)
182 |             ep_rews.append(rew)
183 | 
184 |             if done:
185 |                 # if episode is over, record info about episode
186 |                 ep_ret, ep_len = sum(ep_rews), len(ep_rews)
187 |                 batch_rets.append(ep_ret)
188 |                 batch_lens.append(ep_len)
189 | 
190 |                 # the weight for each logprob(a_t|s_t) is reward-to-go from t
191 |                 if self.gradient_weights == 'rewards':
192 |                     # the weight for each logprob(a|s) is the total reward for the episode
193 |                     batch_weights += [ep_ret] * ep_len
194 |                 elif self.gradient_weights == 'rewards-to-go':
195 |                     # the weight for each logprob(a|s) is the total reward AFTER the action is taken
196 |                     batch_weights += list(reward_to_go(ep_rews))
197 |                 else:
198 |                     raise NotImplemented
199 | 
200 |                 # reset episode-specific variables
201 |                 obs, done, ep_rews = self.env.reset(), False, []
202 | 
203 |                 # end experience loop if we have enough of it
204 |                 if len(batch_obs) > n_samples:
205 |                     break
206 | 
207 |         return batch_obs, batch_acts, batch_weights, batch_rets
208 | 
209 |     def _update_parameters(self, states, actions, weights) -> float:
210 |         """
211 |         One step of policy gradient update
212 |         """
213 |         self.optimizer.zero_grad()
214 | 
215 |         loss = self._compute_loss(
216 |             obs=torch.as_tensor(states, dtype=torch.float32),
217 |             act=torch.as_tensor(actions, dtype=torch.int32),
218 |             weights=torch.as_tensor(weights, dtype=torch.float32)
219 |         )
220 | 
221 |         # compute gradients
222 |         loss.backward()
223 | 
224 |         # update parameters with Adam
225 |         self.optimizer.step()
226 | 
227 |         return loss.item()
228 | 
229 |     def _compute_loss(self, obs, act, weights):
230 |         logp = self._get_policy(obs).log_prob(act)
231 |         return -(logp * weights).mean()
232 | 
233 |     def _get_policy(self, obs):
234 |         """
235 |         Get action distribution given the current policy
236 |         """
237 |         logits = self.policy_net(obs)
238 |         return Categorical(logits=logits)
239 | 
240 |     @classmethod
241 |     def load_from_disk(cls, env_name: str, path: Path):
242 |         """
243 |         We recover all necessary variables to be able to evaluate the agent.
244 | 
245 |         NOTE: training state is not stored, so it is not possible to resume
246 |         an interrupted training run as it was.
247 |         """
248 |         # load hyper-params
249 |         with open(path / 'hparams.json', 'r') as f:
250 |             hparams = json.load(f)
251 | 
252 |         # generate Python object
253 |         agent = cls(env_name, **hparams)
254 | 
255 |         agent.policy_net = torch.load(path / 'model')
256 |         agent.policy_net.eval()
257 | 
258 |         return agent
259 | 
260 |     def save_to_disk(self, path: Path) -> None:
261 |         """"""
262 |         if not path.exists():
263 |             os.makedirs(path)
264 | 
265 |         # save hyper-parameters in a json file
266 |         with open(path / 'hparams.json', 'w') as f:
267 |             json.dump(self.hparams, f)
268 | 
269 |         # save main model
270 |         torch.save(self.policy_net, path / 'model')
271 | 
272 | 
273 | if __name__ == '__main__':
274 | 
275 |     import argparse
276 |     parser = argparse.ArgumentParser()
277 |     parser.add_argument('--env', type=str, default='CartPole-v0')
278 |     parser.add_argument('--n_policy_updates', type=int, default=1000)
279 |     parser.add_argument('--batch_size', type=int, default=128)
280 |     parser.add_argument('--gradient_weights', type=str, default='rewards')
281 |     parser.add_argument('--lr', type=float, default=3e-4)
282 |     parser.add_argument("--hidden_layers", type=int, nargs="+",)
283 |     parser.add_argument("--freq_eval", type=int)
284 |     args = parser.parse_args()
285 | 
286 |     vpg_agent = VPGAgent(
287 |         env_name=args.env,
288 |         gradient_weights=args.gradient_weights,
289 |         learning_rate=args.lr,
290 |         hidden_layers=args.hidden_layers,
291 |     )
292 | 
293 |     # generate a unique agent_id, that we later use to save results to disk, as
294 |     # well as TensorBoard logs.
295 |     agent_id = get_agent_id(args.env)
296 |     print(f'agent_id = {agent_id}')
297 | 
298 |     # tensorboard logger to see training curves
299 |     logger = get_logger(env_name=args.env, agent_id=agent_id)
300 | 
301 |     # path to save policy network weights
302 |     model_path = get_model_path(env_name=args.env, agent_id=agent_id)
303 | 
304 |     # start training
305 |     vpg_agent.train(
306 |         n_policy_updates=args.n_policy_updates,
307 |         batch_size=args.batch_size,
308 |         logger=logger,
309 |         model_path=model_path,
310 |         freq_eval=args.freq_eval,
311 |     )


--------------------------------------------------------------------------------
/04_lunar_lander/tensorboard_logs/readme.md:
--------------------------------------------------------------------------------
1 | ### Tensorboard logs for each train run are stored in this folder
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Pau Labarta Bajo
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <div align="center">
 2 | <h1>The Hands-on Reinforcement Learning course 🚀 </h1>
 3 | <h2>From zero to HERO 🦸🏻‍🦸🏽</h2>
 4 | <h3><i>Out of intense complexities, intense simplicities emerge.</i></h3>
 5 | <h4>-- Winston Churchill</h4>
 6 | </div>
 7 | 
 8 | ![](http://datamachines.xyz/wp-content/uploads/2021/11/PHOTO-2021-11-05-13-54-11.jpg)
 9 | 
10 | [![Twitter Follow](https://img.shields.io/twitter/follow/paulabartabajo_?label=Follow&style=social)](https://twitter.com/paulabartabajo_)
11 | 
12 | ## Contents
13 | 
14 | * [Welcome to the course](#welcome-to-the-course-)
15 | * [Lectures](#lectures)
16 | * [Wanna contribute?](#wanna-contribute)
17 | * [Let's connect!](#lets-connect)
18 | 
19 | ## Welcome to the course 🤗❤️
20 | 
21 | Welcome to my step by step hands-on-course that will take you from basic reinforcement learning to cutting-edge deep RL.
22 | 
23 | We will start with a short intro of what RL is, what is it used for, and how does the landscape of current
24 | RL algorithms look like.
25 | 
26 | Then, in each following chapter we will solve a different problem, with increasing difficulty:
27 | - 🏆 easy
28 | - 🏆🏆 medium
29 | - 🏆🏆🏆  hard
30 | 
31 | Ultimately, the most complex RL problems involve a mixture of reinforcement learning algorithms, optimizations and Deep Learning techniques.
32 | 
33 | You do not need to know deep learning (DL) to follow along this course.
34 | 
35 | I will give you enough context to get you familiar with DL philosophy and understand
36 | how it becomes a crucial ingredient in modern reinforcement learning.
37 | 
38 | ## Lectures
39 | 
40 | 0. [Introduction to Reinforcement Learning](https://datamachines.xyz/2021/11/17/hands-on-reinforcement-learning-course-part-1/)
41 | 1. [Q-learning to drive a taxi 🏆](01_taxi/README.md)
42 | 2. [SARSA to beat gravity 🏆](02_mountain_car/README.md)
43 | 3. [Parametric Q learning to keep the balance 💃 🏆](03_cart_pole/README.md)
44 | 4. [Policy gradients to land on the Moon 🏆](04_lunar_lander/README.md)
45 | 
46 | ## Wanna contribute?
47 | 
48 | There are 2 things you can do to contribute to this course:
49 | 
50 | 1. Spread the word and share it on [Twitter](https://ctt.ac/Aa7dt), [LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=http%3A//datamachines.xyz/the-hands-on-reinforcement-learning-course-page/&title=The%20hands-on%20Reinforcement%20Learning%20course&summary=Wanna%20learn%20Reinforcement%20Learning?%20%F0%9F%A4%94%0A%40paulabartabajo%20has%20a%20course%20on%20%23reinforcementlearning,%20that%20takes%20you%20from%20zero%20to%20PRO%20%F0%9F%A6%B8%F0%9F%8F%BB%E2%80%8D%F0%9F%A6%B8%F0%9F%8F%BD.%0A%0A%F0%9F%91%89%F0%9F%8F%BD%20With%20lots%20of%20Python%0A%F0%9F%91%89%F0%9F%8F%BD%20Intuitions,%20tips%20%26%20tricks%20explained.%0A%F0%9F%91%89%F0%9F%8F%BD%20And%20free,%20by%20the%20way.%0A%0AReady%20to%20start?%20Click%20%F0%9F%91%87%F0%9F%8F%BD%F0%9F%91%87%F0%9F%8F%BE%F0%9F%91%87%F0%9F%8F%BF%0A%0A%23MachineLearning&source=)
51 | 
52 | 2. Open a [pull request](https://github.com/Paulescu/hands-on-rl/pulls) to fix a bug or improve the code readability.
53 | 
54 | ### Thanks ❤️
55 | Special thanks to all the students who contributed with valuable feedback
56 | and pull requests ❤
57 | 
58 | - [Neria Uzan](https://www.linkedin.com/in/neria-uzan-369803107/)
59 | - [Anthony Lapadula](https://www.linkedin.com/in/anthony-lapadula-9343a5b/)
60 | - [Petar Sekulić](https://www.linkedin.com/in/petar-sekulic-ml/)
61 | 
62 | ## Let's connect!
63 | 
64 | 👉🏽 Subscribe for **FREE** to the [Real-World ML newsletter](https://realworldml.net/subscribe/) 🧠
65 | 
66 | 👉🏽 Follow me on [Twitter](https://twitter.com/paulabartabajo_) and [LinkedIn](https://www.linkedin.com/in/pau-labarta-bajo-4432074b/) 💡
67 | 


--------------------------------------------------------------------------------