6 |
7 |
8 |
9 | Venice’s taxis 👆 by Helena Jankovičová Kováčová from Pexels 🙏
10 |
11 |
12 | ## Table of Contents
13 | * [Welcome 🤗](#welcome-)
14 | * [Quick setup](#quick-setup)
15 | * [Lecture transcripts](#lecture-transcripts)
16 | * [Notebooks](#notebooks)
17 | * [Let's connect](#lets-connect)
18 |
19 | ## Welcome 🤗
20 | This is part 1 of the Hands-on RL course.
21 |
22 | Let's use (tabular) Q-learning to teach an agent to solve the [Taxi-v3](https://gym.openai.com/envs/Taxi-v3/) environment
23 | from OpenAI gym.
24 |
25 | Fasten your seat belt and get ready. We are ready to depart!
26 |
27 |
28 | ## Quick setup
29 |
30 | Make sure you have Python >= 3.7. Otherwise, update it.
31 |
32 | 1. Pull the code from GitHub and cd into the `01_taxi` folder:
33 | ```
34 | $ git clone https://github.com/Paulescu/hands-on-rl.git
35 | $ cd hands-on-rl/01_taxi
36 | ```
37 |
38 | 2. Make sure you have the `virtualenv` tool in your Python installation
39 | ```
40 | $ pip3 install virtualenv
41 | ```
42 |
43 | 3. Create a virtual environment and activate it.
44 | ```
45 | $ virtualenv -p python3 venv
46 | $ source venv/bin/activate
47 | ```
48 |
49 | From this point onwards commands run inside the virtual environment.
50 |
51 |
52 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code.
53 | ```
54 | $ (venv) pip install -r requirements.txt
55 | $ (venv) export PYTHONPATH="."
56 | ```
57 |
58 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab
59 | ```
60 | $ (venv) jupyter notebook
61 | ```
62 | ```
63 | $ (venv) jupyter lab
64 | ```
65 | If both launch commands fail, try these:
66 | ```
67 | $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False
68 | ```
69 | ```
70 | $ (venv) jupyter lab --NotebookApp.use_redirect_file=False
71 | ```
72 |
73 | 5. Play and learn. And do the homework 😉.
74 |
75 |
76 | ## Lecture transcripts
77 |
78 | [📝 Q learning](http://datamachines.xyz/2021/12/06/hands-on-reinforcement-learning-course-part-2-q-learning/)
79 |
80 |
81 | ## Notebooks
82 |
83 | - [Explore the environment](notebooks/00_environment.ipynb)
84 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
85 | - [Q-agent](notebooks/02_q_agent.ipynb)
86 | - [Hyperparameter tuning](notebooks/03_q_agent_hyperparameters_analysis.ipynb)
87 | - [Homework](notebooks/04_homework.ipynb)
88 |
89 | ## Let's connect!
90 |
91 | Do you wanna become a PRO in Machine Learning?
92 |
93 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/).
94 |
95 | 👉🏽 Follow me on [Medium](https://pau-labarta-bajo.medium.com/).
96 |
97 |
98 |
99 |
100 |
101 |
--------------------------------------------------------------------------------
/01_taxi/notebooks/00_environment.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "4c2ff31f",
6 | "metadata": {},
7 | "source": [
8 | "# 00 Environment"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "04a5c882",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉Before you solve a Reinforcement Learning problem you need to define what are\n",
17 | "- the actions\n",
18 | "- the states of the world\n",
19 | "- the rewards\n",
20 | "\n",
21 | "#### 👉We are using the `Taxi-v3` environment from OpenAI's gym: https://gym.openai.com/envs/Taxi-v3/\n",
22 | "\n",
23 | "#### 👉`Taxi-v3` is an easy environment because the action space is small, and the state space is large but finite.\n",
24 | "\n",
25 | "#### 👉Environments with a finite number of actions and states are called tabular"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 3,
31 | "id": "e3629346",
32 | "metadata": {},
33 | "outputs": [
34 | {
35 | "name": "stdout",
36 | "output_type": "stream",
37 | "text": [
38 | "The autoreload extension is already loaded. To reload it, use:\n",
39 | " %reload_ext autoreload\n",
40 | "Populating the interactive namespace from numpy and matplotlib\n"
41 | ]
42 | }
43 | ],
44 | "source": [
45 | "%load_ext autoreload\n",
46 | "%autoreload 2\n",
47 | "%pylab inline\n",
48 | "%config InlineBackend.figure_format = 'svg'"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "id": "76e9a06d",
54 | "metadata": {},
55 | "source": [
56 | "## Load the environment 🌎"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 4,
62 | "id": "ebfba291",
63 | "metadata": {},
64 | "outputs": [],
65 | "source": [
66 | "import gym\n",
67 | "env = gym.make(\"Taxi-v3\").env"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "id": "1fcfc13a",
73 | "metadata": {},
74 | "source": [
75 | "## Action space"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 5,
81 | "id": "98cfdb84",
82 | "metadata": {},
83 | "outputs": [
84 | {
85 | "name": "stdout",
86 | "output_type": "stream",
87 | "text": [
88 | "Action Space Discrete(6)\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "print(\"Action Space {}\".format(env.action_space))"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "id": "4f53a38e",
99 | "metadata": {},
100 | "source": [
101 | "## State space"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 6,
107 | "id": "e809514b",
108 | "metadata": {},
109 | "outputs": [
110 | {
111 | "name": "stdout",
112 | "output_type": "stream",
113 | "text": [
114 | "State Space Discrete(500)\n"
115 | ]
116 | }
117 | ],
118 | "source": [
119 | "print(\"State Space {}\".format(env.observation_space))"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "id": "c8f6a690",
125 | "metadata": {},
126 | "source": [
127 | "## Rewards"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 7,
133 | "id": "0faad2a7",
134 | "metadata": {},
135 | "outputs": [
136 | {
137 | "name": "stdout",
138 | "output_type": "stream",
139 | "text": [
140 | "env.P[state][action][0]: (1.0, 223, -1, False)\n"
141 | ]
142 | }
143 | ],
144 | "source": [
145 | "# env.P is double dictionary.\n",
146 | "# - The 1st key represents the state, from 0 to 499\n",
147 | "# - The 2nd key represens the action taken by the agent,\n",
148 | "# from 0 to 5\n",
149 | "\n",
150 | "# example\n",
151 | "state = 123\n",
152 | "action = 0 # move south\n",
153 | "\n",
154 | "# env.P[state][action][0] is a list with 4 elements\n",
155 | "# (probability, next_state, reward, done)\n",
156 | "# \n",
157 | "# - probability\n",
158 | "# It is always 1 in this environment, which means\n",
159 | "# there are no external/random factors that determine the\n",
160 | "# next_state\n",
161 | "# apart from the agent's action a.\n",
162 | "#\n",
163 | "# - next_state: 223 in this case\n",
164 | "# \n",
165 | "# - reward: -1 in this case\n",
166 | "#\n",
167 | "# - done: boolean (True/False) indicates whether the\n",
168 | "# episode has ended (i.e. the driver has dropped the\n",
169 | "# passenger at the correct destination)\n",
170 | "print('env.P[state][action][0]: ', env.P[state][action][0])"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 8,
176 | "id": "552caf92",
177 | "metadata": {},
178 | "outputs": [
179 | {
180 | "name": "stdout",
181 | "output_type": "stream",
182 | "text": [
183 | "+---------+\n",
184 | "|\u001b[34;1mR\u001b[0m: | : :G|\n",
185 | "| :\u001b[43m \u001b[0m| : : |\n",
186 | "| : : : : |\n",
187 | "| | : | : |\n",
188 | "|Y| : |\u001b[35mB\u001b[0m: |\n",
189 | "+---------+\n",
190 | "\n"
191 | ]
192 | }
193 | ],
194 | "source": [
195 | "# Need to call reset() at least once before render() will work\n",
196 | "env.reset()\n",
197 | "\n",
198 | "env.s = 123\n",
199 | "env.render(mode='human')"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 9,
205 | "id": "2ded2ba5",
206 | "metadata": {},
207 | "outputs": [
208 | {
209 | "name": "stdout",
210 | "output_type": "stream",
211 | "text": [
212 | "+---------+\n",
213 | "|\u001b[34;1mR\u001b[0m: | : :G|\n",
214 | "| : | : : |\n",
215 | "| :\u001b[43m \u001b[0m: : : |\n",
216 | "| | : | : |\n",
217 | "|Y| : |\u001b[35mB\u001b[0m: |\n",
218 | "+---------+\n",
219 | "\n"
220 | ]
221 | }
222 | ],
223 | "source": [
224 | "env.s = 223\n",
225 | "env.render(mode='human')"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "id": "2aacea45",
232 | "metadata": {},
233 | "outputs": [],
234 | "source": []
235 | }
236 | ],
237 | "metadata": {
238 | "kernelspec": {
239 | "display_name": "Python 3 (ipykernel)",
240 | "language": "python",
241 | "name": "python3"
242 | },
243 | "language_info": {
244 | "codemirror_mode": {
245 | "name": "ipython",
246 | "version": 3
247 | },
248 | "file_extension": ".py",
249 | "mimetype": "text/x-python",
250 | "name": "python",
251 | "nbconvert_exporter": "python",
252 | "pygments_lexer": "ipython3",
253 | "version": "3.7.5"
254 | }
255 | },
256 | "nbformat": 4,
257 | "nbformat_minor": 5
258 | }
259 |
--------------------------------------------------------------------------------
/01_taxi/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f0fd6807",
6 | "metadata": {},
7 | "source": [
8 | "# 04 Homework 🏋️🏋️🏋️"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "abcf6613",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉A course without homework is not a course!\n",
17 | "\n",
18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 | "\n",
20 | "#### 👉They are not so easy, so if you get stuck drop me an email at `plabartabajo@gmail.com`"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "id": "86f82e45",
26 | "metadata": {},
27 | "source": [
28 | "-----"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "id": "67656662",
34 | "metadata": {},
35 | "source": [
36 | "## 1. Can you update the function `train` in a way that the input `epsilon` can also be a callable function?\n",
37 | "\n",
38 | "An `epsilon` value that decays after each episode works better than a fixed `epsilon` for most RL problems.\n",
39 | "\n",
40 | "This is hard exercise, but I want you to give it a try.\n",
41 | "\n",
42 | "If you do not manage it, do not worry. We are going to implement this in an upcoming lesson."
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "id": "7d1e016e",
48 | "metadata": {},
49 | "source": [
50 | "-----"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "id": "c0a46bf7",
56 | "metadata": {},
57 | "source": [
58 | "## 2. Can you parallelize the function `train_many_runs` using Python's `multiprocessing` module?\n",
59 | "\n",
60 | "I do not like to wait and stare at each progress bar, while I think that each run in `train_many_runs` could execute\n",
61 | "in parallel.\n",
62 | "\n",
63 | "Create a new function called `train_many_runs_in_parallel` that outputs the same results as `train_many_runs` but that executes in a fraction of time."
64 | ]
65 | }
66 | ],
67 | "metadata": {
68 | "kernelspec": {
69 | "display_name": "Python 3 (ipykernel)",
70 | "language": "python",
71 | "name": "python3"
72 | },
73 | "language_info": {
74 | "codemirror_mode": {
75 | "name": "ipython",
76 | "version": 3
77 | },
78 | "file_extension": ".py",
79 | "mimetype": "text/x-python",
80 | "name": "python",
81 | "nbconvert_exporter": "python",
82 | "pygments_lexer": "ipython3",
83 | "version": "3.7.5"
84 | }
85 | },
86 | "nbformat": 4,
87 | "nbformat_minor": 5
88 | }
89 |
--------------------------------------------------------------------------------
/01_taxi/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.poetry]
2 | name = "src"
3 | version = "0.1.0"
4 | description = ""
5 | authors = ["Pau "]
6 |
7 | [tool.poetry.dependencies]
8 | python = ">=3.7.1,<4.0"
9 | gym = "^0.21.0"
10 | tqdm = "^4.62.3"
11 | matplotlib = "^3.5.0"
12 | pandas = "^1.3.4"
13 | seaborn = "^0.11.2"
14 | jupyter = "^1.0.0"
15 | jupyterlab = "^3.3.0"
16 |
17 | [tool.poetry.dev-dependencies]
18 | pytest = "^5.2"
19 |
20 | [build-system]
21 | requires = ["poetry-core>=1.0.0"]
22 | build-backend = "poetry.core.masonry.api"
23 |
--------------------------------------------------------------------------------
/01_taxi/requirements.txt:
--------------------------------------------------------------------------------
1 | anyio==3.5.0; python_full_version >= "3.6.2" and python_version >= "3.7"
2 | appnope==0.1.2; platform_system == "Darwin" and python_version >= "3.7" and sys_platform == "darwin"
3 | argon2-cffi-bindings==21.2.0; python_version >= "3.6"
4 | argon2-cffi==21.3.0; python_version >= "3.7"
5 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0"
6 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
7 | babel==2.9.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
8 | backcall==0.2.0; python_version >= "3.7"
9 | bleach==4.1.0; python_version >= "3.7"
10 | certifi==2021.10.8; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7"
11 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6"
12 | charset-normalizer==2.0.12; python_full_version >= "3.6.0" and python_version >= "3.7"
13 | cloudpickle==2.0.0; python_version >= "3.6"
14 | colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.7"
15 | cycler==0.11.0; python_version >= "3.7"
16 | debugpy==1.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
17 | decorator==5.1.1; python_version >= "3.7"
18 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
19 | entrypoints==0.4; python_full_version >= "3.6.1" and python_version >= "3.7"
20 | fonttools==4.29.1; python_version >= "3.7"
21 | gym==0.21.0; python_version >= "3.6"
22 | idna==3.3; python_full_version >= "3.6.2" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7")
23 | importlib-metadata==4.11.2; python_version < "3.8" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.7" and python_version < "3.8")
24 | importlib-resources==5.4.0; python_version < "3.9" and python_version >= "3.7"
25 | ipykernel==6.9.1; python_version >= "3.7"
26 | ipython-genutils==0.2.0; python_version >= "3.7"
27 | ipython==7.32.0; python_version >= "3.7"
28 | ipywidgets==7.6.5
29 | jedi==0.18.1; python_version >= "3.7"
30 | jinja2==3.0.3; python_version >= "3.7"
31 | json5==0.9.6; python_version >= "3.7"
32 | jsonschema==4.4.0; python_version >= "3.7"
33 | jupyter-client==7.1.2; python_full_version >= "3.7.0" and python_version >= "3.7"
34 | jupyter-console==6.4.3; python_version >= "3.6"
35 | jupyter-core==4.9.2; python_full_version >= "3.6.1" and python_version >= "3.7"
36 | jupyter-server==1.13.5; python_version >= "3.7"
37 | jupyter==1.0.0
38 | jupyterlab-pygments==0.1.2; python_version >= "3.7"
39 | jupyterlab-server==2.10.3; python_version >= "3.7"
40 | jupyterlab-widgets==1.0.2; python_version >= "3.6"
41 | jupyterlab==3.3.0; python_version >= "3.7"
42 | kiwisolver==1.3.2; python_version >= "3.7"
43 | markupsafe==2.1.0; python_version >= "3.7"
44 | matplotlib-inline==0.1.3; python_version >= "3.7"
45 | matplotlib==3.5.1; python_version >= "3.7"
46 | mistune==0.8.4; python_version >= "3.7"
47 | more-itertools==8.12.0; python_version >= "3.5"
48 | nbclassic==0.3.6; python_version >= "3.7"
49 | nbclient==0.5.12; python_full_version >= "3.7.0" and python_version >= "3.7"
50 | nbconvert==6.4.2; python_version >= "3.7"
51 | nbformat==5.1.3; python_full_version >= "3.7.0" and python_version >= "3.7"
52 | nest-asyncio==1.5.4; python_full_version >= "3.7.0" and python_version >= "3.7"
53 | notebook-shim==0.1.0; python_version >= "3.7"
54 | notebook==6.4.8; python_version >= "3.7"
55 | numpy==1.21.1
56 | packaging==21.3; python_version >= "3.7"
57 | pandas==1.3.5; python_full_version >= "3.7.1"
58 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
59 | parso==0.8.3; python_version >= "3.7"
60 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7"
61 | pickleshare==0.7.5; python_version >= "3.7"
62 | pillow==9.0.1; python_version >= "3.7"
63 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5"
64 | prometheus-client==0.13.1; python_version >= "3.7"
65 | prompt-toolkit==3.0.28; python_full_version >= "3.6.2" and python_version >= "3.7"
66 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.7" and sys_platform != "win32"
67 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy"
68 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
69 | pygments==2.11.2; python_version >= "3.7"
70 | pyparsing==3.0.7; python_version >= "3.7"
71 | pyrsistent==0.18.1; python_version >= "3.7"
72 | pytest==5.4.3; python_version >= "3.5"
73 | python-dateutil==2.8.2; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.7")
74 | pytz==2021.3; python_full_version >= "3.7.1" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7")
75 | pywin32==303; sys_platform == "win32" and platform_python_implementation != "PyPy" and python_version >= "3.7"
76 | pywinpty==1.1.6; os_name == "nt" and python_version >= "3.7"
77 | pyzmq==22.3.0; python_full_version >= "3.6.1" and python_version >= "3.7"
78 | qtconsole==5.2.2; python_version >= "3.6"
79 | qtpy==2.0.1; python_version >= "3.6"
80 | requests==2.27.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7"
81 | scipy==1.6.1; python_version >= "3.7"
82 | seaborn==0.11.2; python_version >= "3.6"
83 | send2trash==1.8.0; python_version >= "3.7"
84 | setuptools-scm==6.4.2; python_version >= "3.7"
85 | six==1.16.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.7"
86 | sniffio==1.2.0; python_full_version >= "3.6.2" and python_version >= "3.7"
87 | terminado==0.13.3; python_version >= "3.7"
88 | testpath==0.6.0; python_version >= "3.7"
89 | tomli==2.0.1; python_version >= "3.7"
90 | tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7"
91 | tqdm==4.63.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
92 | traitlets==5.1.1; python_full_version >= "3.7.0" and python_version >= "3.7"
93 | typing-extensions==4.1.1; python_version < "3.8" and python_version >= "3.7" and python_full_version >= "3.6.2"
94 | urllib3==1.26.8; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7"
95 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.6"
96 | webencodings==0.5.1; python_version >= "3.7"
97 | websocket-client==1.3.1; python_version >= "3.7"
98 | widgetsnbextension==3.5.2
99 | zipp==3.7.0; python_version < "3.8" and python_version >= "3.7"
100 |
--------------------------------------------------------------------------------
/01_taxi/src/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '0.1.0'
2 |
--------------------------------------------------------------------------------
/01_taxi/src/loops.py:
--------------------------------------------------------------------------------
1 | from typing import Tuple, List, Any
2 | import random
3 | from pdb import set_trace as stop
4 |
5 | import numpy as np
6 | from tqdm import tqdm
7 |
8 |
9 | def train(
10 | agent,
11 | env,
12 | n_episodes: int,
13 | epsilon: float
14 | ) -> Tuple[Any, List, List]:
15 | """
16 | Trains and agent and returns 3 things:
17 | - agent object
18 | - timesteps_per_episode
19 | - penalties_per_episode
20 | """
21 | # For plotting metrics
22 | timesteps_per_episode = []
23 | penalties_per_episode = []
24 |
25 | for i in tqdm(range(0, n_episodes)):
26 |
27 | state = env.reset()
28 |
29 | epochs, penalties, reward, = 0, 0, 0
30 | done = False
31 |
32 | while not done:
33 |
34 | if random.uniform(0, 1) < epsilon:
35 | # Explore action space
36 | action = env.action_space.sample()
37 | else:
38 | # Exploit learned values
39 | action = agent.get_action(state)
40 |
41 | next_state, reward, done, info = env.step(action)
42 |
43 | agent.update_parameters(state, action, reward, next_state)
44 |
45 | if reward == -10:
46 | penalties += 1
47 |
48 | state = next_state
49 | epochs += 1
50 |
51 | timesteps_per_episode.append(epochs)
52 | penalties_per_episode.append(penalties)
53 |
54 | return agent, timesteps_per_episode, penalties_per_episode
55 |
56 |
57 | def evaluate(
58 | agent,
59 | env,
60 | n_episodes: int,
61 | epsilon: float,
62 | initial_state: int = None
63 | ) -> Tuple[List, List]:
64 | """
65 | Tests agent performance in random `n_episodes`.
66 |
67 | It returns:
68 | - timesteps_per_episode
69 | - penalties_per_episode
70 | """
71 | # For plotting metrics
72 | timesteps_per_episode = []
73 | penalties_per_episode = []
74 | frames_per_episode = []
75 |
76 | for i in tqdm(range(0, n_episodes)):
77 |
78 | if initial_state:
79 | # init the environment at 'initial_state'
80 | state = initial_state
81 | env.s = initial_state
82 | else:
83 | # random starting state
84 | state = env.reset()
85 |
86 | epochs, penalties, reward, = 0, 0, 0
87 | frames = []
88 | done = False
89 |
90 | while not done:
91 |
92 | if random.uniform(0, 1) < epsilon:
93 | # Explore action space
94 | action = env.action_space.sample()
95 | else:
96 | # Exploit learned values
97 | action = agent.get_action(state)
98 |
99 | next_state, reward, done, info = env.step(action)
100 |
101 | frames.append({
102 | 'frame': env.render(mode='ansi'),
103 | 'state': state,
104 | 'action': action,
105 | 'reward': reward
106 | })
107 |
108 | if reward == -10:
109 | penalties += 1
110 |
111 | state = next_state
112 | epochs += 1
113 |
114 | timesteps_per_episode.append(epochs)
115 | penalties_per_episode.append(penalties)
116 | frames_per_episode.append(frames)
117 |
118 | return timesteps_per_episode, penalties_per_episode, frames_per_episode
119 |
120 |
121 | def train_many_runs(
122 | agent,
123 | env,
124 | n_episodes: int,
125 | epsilon: float,
126 | n_runs: int,
127 | ) -> Tuple[List, List]:
128 | """
129 | Calls 'train' many times, stores results and averages them out.
130 | """
131 | timesteps = np.zeros(shape=(n_runs, n_episodes))
132 | penalties = np.zeros(shape=(n_runs, n_episodes))
133 |
134 | for i in range(0, n_runs):
135 |
136 | agent.reset()
137 |
138 | _, timesteps[i, :], penalties[i, :] = train(
139 | agent, env, n_episodes, epsilon
140 | )
141 | timesteps = np.mean(timesteps, axis=0).tolist()
142 | penalties = np.mean(penalties, axis=0).tolist()
143 |
144 | return timesteps, penalties
145 |
146 | if __name__ == '__main__':
147 |
148 | import gym
149 | from src.q_agent import QAgent
150 |
151 | env = gym.make("Taxi-v3").env
152 | alpha = 0.1
153 | gamma = 0.6
154 | agent = QAgent(env, alpha, gamma)
155 |
156 | agent, _, _ = train(
157 | agent, env, n_episodes=10000, epsilon=0.10)
158 |
159 | timesteps_per_episode, penalties_per_episode, _ = evaluate(
160 | agent, env, n_episodes=100, epsilon=0.05
161 | )
162 |
163 | print(f'Avg steps to complete ride: {np.array(timesteps_per_episode).mean()}')
164 | print(f'Avg penalties to complete ride: {np.array(penalties_per_episode).mean()}')
--------------------------------------------------------------------------------
/01_taxi/src/q_agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from pdb import set_trace as stop
3 |
4 | class QAgent:
5 |
6 | def __init__(self, env, alpha, gamma):
7 | self.env = env
8 |
9 | # table with q-values: n_states * n_actions
10 | self.q_table = np.zeros([env.observation_space.n, env.action_space.n])
11 |
12 | # hyper-parameters
13 | self.alpha = alpha
14 | self.gamma = gamma
15 |
16 | def get_action(self, state):
17 | """"""
18 | # stop()
19 | return np.argmax(self.q_table[state])
20 |
21 | def update_parameters(self, state, action, reward, next_state):
22 | """"""
23 | old_value = self.q_table[state, action]
24 | next_max = np.max(self.q_table[next_state])
25 |
26 | new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
27 | self.q_table[state, action] = new_value
28 |
29 | def reset(self):
30 | """
31 | Sets q-values to zeros, which essentially means the agent does not know
32 | anything
33 | """
34 | self.q_table = np.zeros([self.env.observation_space.n, self.env.action_space.n])
35 |
--------------------------------------------------------------------------------
/01_taxi/src/random_agent.py:
--------------------------------------------------------------------------------
1 |
2 | class RandomAgent:
3 | """
4 | This taxi driver selects actions randomly.
5 | You better not get into this taxi!
6 | """
7 | def __init__(self, env):
8 | self.env = env
9 |
10 | def get_action(self, state) -> int:
11 | """
12 | No input arguments to this function.
13 | The agent does not consider the state of the environment when deciding
14 | what to do next.
15 | """
16 | return self.env.action_space.sample()
--------------------------------------------------------------------------------
/02_mountain_car/README.md:
--------------------------------------------------------------------------------
1 | # SARSA to beat gravity 🚃
2 | 👉 [Read in datamachines](http://datamachines.xyz/2021/12/17/hands-on-reinforcement-learning-course-part-3-sarsa/)
3 | 👉 [Read in Towards Data Science](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-3-5db40e7938d4)
4 |
5 |
6 | This is part 2 of my course Hands-on reinforcement learning.
7 |
8 | In this part we use SARSA to help a poor car win the battle against gravity!
9 |
10 | > *Be like a train; go in the rain, go in the sun, go in the storm, go in the dark tunnels! Be like a train; concentrate on your road and go with no hesitation!*
11 | >
12 | > --_Mehmet Murat Ildan_
13 |
14 | ### Quick setup
15 |
16 | The easiest way to get the code working in your machine is by using [Poetry](https://python-poetry.org/docs/#installation).
17 |
18 |
19 | 1. You can install Poetry with this one-liner:
20 | ```bash
21 | $ curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
22 | ```
23 |
24 | 2. Git clone the code
25 | ```bash
26 | $ git clone https://github.com/Paulescu/hands-on-rl.git
27 | ```
28 |
29 | 3. Navigate to this lesson code `02_mountain_car`
30 | ```bash
31 | $ cd hands-on-rl/02_mountain_car
32 | ```
33 |
34 | 4. Install all dependencies from `pyproject.toml:
35 | ```bash
36 | $ poetry install
37 | ```
38 |
39 | 5. Activate the virtual environment
40 | ```bash
41 | $ poetry shell
42 | ```
43 |
44 | 6. Set PYTHONPATH and launch jupyter (jupyter-lab param may fix launch problems on some systems)
45 | ```bash
46 | $ export PYTHONPATH=".."
47 | $ jupyter-lab --NotebookApp.use_redirect_file=False
48 | ```
49 |
50 | ### Notebooks
51 |
52 | 1. [Explore the environment](notebooks/00_environment.ipynb)
53 | 2. [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
54 | 3. [SARSA agent](notebooks/02_sarsa_agent.ipynb)
55 | 4. [Momentum agent](notebooks/03_momentum_agent_baseline.ipynb)
56 | 5. [Homework](notebooks/04_homework.ipynb)
57 |
--------------------------------------------------------------------------------
/02_mountain_car/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f0fd6807",
6 | "metadata": {},
7 | "source": [
8 | "# 04 Homework 🏋️🏋️🏋️"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "abcf6613",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉A course without homework is not a course!\n",
17 | "\n",
18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 | "\n",
20 | "#### 👉Feel free to email me your solutions at:"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "id": "d1d983a3",
26 | "metadata": {},
27 | "source": [
28 | "# `plabartabajo@gmail.com`"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "id": "86f82e45",
34 | "metadata": {},
35 | "source": [
36 | "-----"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "67656662",
42 | "metadata": {},
43 | "source": [
44 | "## 1. Can you adjust the hyper-parameters `alpha = 0.1` and `gamma = 0.9` to train a better SARSA agent than mine?\n",
45 | "\n",
46 | "Experiment with these 2 hyper-parameters to maximize the agent success rate."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "id": "c0a46bf7",
52 | "metadata": {},
53 | "source": [
54 | "## 2. Can you increase the resolution of the discretization?\n",
55 | "\n",
56 | "Instead of using round marks of\n",
57 | "- `0.1` for position\n",
58 | "- `0.01` for velocity\n",
59 | "\n",
60 | "Use 10x:\n",
61 | "- `0.01` for position\n",
62 | "- `0.001` for velocity\n",
63 | "\n",
64 | "Let me know if you get a better agent than mine?"
65 | ]
66 | }
67 | ],
68 | "metadata": {
69 | "kernelspec": {
70 | "display_name": "Python 3 (ipykernel)",
71 | "language": "python",
72 | "name": "python3"
73 | },
74 | "language_info": {
75 | "codemirror_mode": {
76 | "name": "ipython",
77 | "version": 3
78 | },
79 | "file_extension": ".py",
80 | "mimetype": "text/x-python",
81 | "name": "python",
82 | "nbconvert_exporter": "python",
83 | "pygments_lexer": "ipython3",
84 | "version": "3.7.5"
85 | }
86 | },
87 | "nbformat": 4,
88 | "nbformat_minor": 5
89 | }
90 |
--------------------------------------------------------------------------------
/02_mountain_car/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.poetry]
2 | name = "src"
3 | version = "0.1.0"
4 | description = ""
5 | authors = ["Pau "]
6 |
7 | [tool.poetry.dependencies]
8 | python = ">=3.7.1,<4.0"
9 | gym = "^0.21.0"
10 | pyglet = "^1.5.21"
11 | matplotlib = "^3.5.0"
12 | tqdm = "^4.62.3"
13 | pandas = "^1.3.4"
14 | jupyter = "^1.0.0"
15 | PyVirtualDisplay = "^2.2"
16 | imageio = "^2.13.3"
17 | seaborn = "^0.11.2"
18 |
19 | [tool.poetry.dev-dependencies]
20 | pytest = "^5.2"
21 |
22 | [build-system]
23 | requires = ["poetry-core>=1.0.0"]
24 | build-backend = "poetry.core.masonry.api"
25 |
--------------------------------------------------------------------------------
/02_mountain_car/src/base_agent.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | from pathlib import Path
3 | from abc import ABC, abstractmethod
4 |
5 |
6 | class BaseAgent(ABC):
7 |
8 | @abstractmethod
9 | def get_action(self, state):
10 | pass
11 |
12 | @abstractmethod
13 | def update_parameters(self, state, action, reward, next_state):
14 | pass
15 |
16 | def save_to_disk(self, path: Path):
17 | """
18 | Saves python object to disk using a binary format
19 | """
20 | with open(path, "wb") as f:
21 | pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
22 |
23 | @classmethod
24 | def load_from_disk(cls, path: Path):
25 | """
26 | Loads binary format into Python object.
27 | """
28 | with open(path, "rb") as f:
29 | dump = pickle.load(f)
30 |
31 | return dump
--------------------------------------------------------------------------------
/02_mountain_car/src/config.py:
--------------------------------------------------------------------------------
1 | # Define SAVED_AGENTS_DIR and create dir if missing
2 | import os
3 | import pathlib
4 | root_dir = pathlib.Path(__file__).parent.resolve().parent
5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents'
6 | os.makedirs(SAVED_AGENTS_DIR, exist_ok=True)
7 |
--------------------------------------------------------------------------------
/02_mountain_car/src/loops.py:
--------------------------------------------------------------------------------
1 | from typing import Tuple, List, Callable, Union, Optional
2 | import random
3 |
4 | from tqdm import tqdm
5 |
6 | def train(
7 | agent,
8 | env,
9 | n_episodes: int,
10 | epsilon: Union[float, Callable]
11 | ) -> Tuple[List, List]:
12 |
13 | # For plotting metrics
14 | reward_per_episode = []
15 | max_position_per_episode = []
16 |
17 | pbar = tqdm(range(0, n_episodes))
18 | for i in pbar:
19 |
20 | state = env.reset()
21 |
22 | rewards = 0
23 | max_position = -99
24 |
25 | # handle case when epsilon is either
26 | # - a float
27 | # - or a function that returns a float given the episode nubmer
28 | epsilon_ = epsilon if isinstance(epsilon, float) else epsilon(i)
29 |
30 | pbar.set_description(f'Epsilon: {epsilon_:.2f}')
31 |
32 | done = False
33 | while not done:
34 |
35 | action = agent.get_action(state, epsilon_)
36 |
37 | next_state, reward, done, info = env.step(action)
38 |
39 | agent.update_parameters(state, action, reward, next_state, epsilon_)
40 |
41 | rewards += reward
42 | if next_state[0] > max_position:
43 | max_position = next_state[0]
44 |
45 | state = next_state
46 |
47 | reward_per_episode.append(rewards)
48 | max_position_per_episode.append(max_position)
49 |
50 | return reward_per_episode, max_position_per_episode
51 |
52 |
53 | def evaluate(
54 | agent,
55 | env,
56 | n_episodes: int,
57 | epsilon: Optional[Union[float, Callable]] = None
58 | ) -> Tuple[List, List]:
59 |
60 | # For plotting metrics
61 | reward_per_episode = []
62 | max_position_per_episode = []
63 |
64 | for i in tqdm(range(0, n_episodes)):
65 |
66 | state = env.reset()
67 |
68 | rewards = 0
69 | max_position = -99
70 |
71 | done = False
72 | while not done:
73 |
74 | epsilon_ = None
75 | if epsilon is not None:
76 | epsilon_ = epsilon if isinstance(epsilon, float) else epsilon(i)
77 | action = agent.get_action(state, epsilon_)
78 |
79 | next_state, reward, done, info = env.step(action)
80 |
81 | agent.update_parameters(state, action, reward, next_state, epsilon_)
82 |
83 | rewards += reward
84 | if next_state[0] > max_position:
85 | max_position = next_state[0]
86 |
87 | state = next_state
88 |
89 | reward_per_episode.append(rewards)
90 | max_position_per_episode.append(max_position)
91 |
92 | return reward_per_episode, max_position_per_episode
93 |
94 | if __name__ == '__main__':
95 |
96 | # environment
97 | import gym
98 | env = gym.make('MountainCar-v0')
99 | env._max_episode_steps = 1000
100 |
101 | # agent
102 | from src.sarsa_agent import SarsaAgent
103 | alpha = 0.1
104 | gamma = 0.6
105 | agent = SarsaAgent(env, alpha, gamma)
106 |
107 | rewards, max_positions = train(agent, env, n_episodes=100, epsilon=0.1)
--------------------------------------------------------------------------------
/02_mountain_car/src/momentum_agent.py:
--------------------------------------------------------------------------------
1 | from src.base_agent import BaseAgent
2 |
3 | class MomentumAgent(BaseAgent):
4 |
5 | def __init__(self, env):
6 | self.env = env
7 |
8 | self.valley_position = -0.5
9 |
10 | def get_action(self, state, epsilon=None) -> int:
11 | """
12 | No input arguments to this function.
13 | The agent does not consider the state of the environment when deciding
14 | what to do next.
15 | """
16 | velocity = state[1]
17 |
18 | if velocity > 0:
19 | # accelerate to the right
20 | action = 2
21 | else:
22 | # accelerate to the left
23 | action = 0
24 |
25 | return action
26 |
27 | def update_parameters(self, state, action, reward, next_state, epsilon):
28 | pass
29 |
30 |
--------------------------------------------------------------------------------
/02_mountain_car/src/random_agent.py:
--------------------------------------------------------------------------------
1 | from src.base_agent import BaseAgent
2 |
3 | class RandomAgent(BaseAgent):
4 | """
5 | This taxi driver selects actions randomly.
6 | You better not get into this taxi!
7 | """
8 | def __init__(self, env):
9 | self.env = env
10 |
11 | def get_action(self, state, epsilon) -> int:
12 | """
13 | No input arguments to this function.
14 | The agent does not consider the state of the environment when deciding
15 | what to do next.
16 | """
17 | return self.env.action_space.sample()
18 |
19 | def update_parameters(self, state, action, reward, next_state, epsilon):
20 | pass
21 |
22 |
--------------------------------------------------------------------------------
/02_mountain_car/src/sarsa_agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import random
3 |
4 | from src.base_agent import BaseAgent
5 |
6 | class SarsaAgent(BaseAgent):
7 |
8 | def __init__(self, env, alpha, gamma):
9 |
10 | self.env = env
11 | self.q_table = self._init_q_table()
12 |
13 | # hyper-parameters
14 | self.alpha = alpha
15 | self.gamma = gamma
16 |
17 | def _init_q_table(self) -> np.array:
18 | """
19 | Return numpy array with 3 dimensions.
20 | The first 2 dimensions are the state components, i.e. position, speed.
21 | The third dimension is the action.
22 | """
23 | # discretize state space from a continuous to discrete
24 | high = self.env.observation_space.high
25 | low = self.env.observation_space.low
26 | n_states = (high - low) * np.array([10, 100])
27 | n_states = np.round(n_states, 0).astype(int) + 1
28 |
29 | # table with q-values: n_states[0] * n_states[1] * n_actions
30 | return np.zeros([n_states[0], n_states[1], self.env.action_space.n])
31 |
32 | def _discretize_state(self, state):
33 | min_states = self.env.observation_space.low
34 | state_discrete = (state - min_states) * np.array([10, 100])
35 | return np.round(state_discrete, 0).astype(int)
36 |
37 | def get_action(self, state, epsilon=None):
38 | """"""
39 | if epsilon and random.uniform(0, 1) < epsilon:
40 | # Explore action space
41 | action = self.env.action_space.sample()
42 | else:
43 | # Exploit learned values
44 | state_discrete = self._discretize_state(state)
45 | action = np.argmax(self.q_table[state_discrete[0], state_discrete[1]])
46 |
47 | return action
48 |
49 | def update_parameters(self, state, action, reward, next_state, epsilon):
50 | """"""
51 | s = self._discretize_state(state)
52 | ns = self._discretize_state(next_state)
53 | na = self.get_action(next_state, epsilon)
54 |
55 | delta = self.alpha * (
56 | reward
57 | + self.gamma * self.q_table[ns[0], ns[1], na]
58 | - self.q_table[s[0], s[1], action]
59 | )
60 | self.q_table[s[0], s[1], action] += delta
--------------------------------------------------------------------------------
/02_mountain_car/src/viz.py:
--------------------------------------------------------------------------------
1 | from time import sleep
2 | from argparse import ArgumentParser
3 | from pdb import set_trace as stop
4 |
5 | import pandas as pd
6 | import gym
7 |
8 | from src.config import SAVED_AGENTS_DIR
9 |
10 | import numpy as np
11 |
12 |
13 | def plot_policy(agent, positions: np.arange, velocities: np.arange, figsize = None):
14 | """"""
15 | data = []
16 | int2str = {
17 | 0: 'Accelerate Left',
18 | 1: 'Do nothing',
19 | 2: 'Accelerate Right'
20 | }
21 | for position in positions:
22 | for velocity in velocities:
23 |
24 | state = np.array([position, velocity])
25 | action = int2str[agent.get_action(state)]
26 |
27 | data.append({
28 | 'position': position,
29 | 'velocity': velocity,
30 | 'action': action,
31 | })
32 |
33 | data = pd.DataFrame(data)
34 |
35 | import seaborn as sns
36 | import matplotlib.pyplot as plt
37 |
38 | if figsize:
39 | plt.figure(figsize=figsize)
40 |
41 | colors = {
42 | 'Accelerate Left': 'blue',
43 | 'Do nothing': 'grey',
44 | 'Accelerate Right': 'orange'
45 | }
46 | sns.scatterplot(x="position", y="velocity", hue="action", data=data,
47 | palette=colors)
48 |
49 | plt.show()
50 | return data
51 |
52 | def show_video(agent, env, sleep_sec: float = 0.1, mode: str = "rgb_array"):
53 |
54 | state = env.reset()
55 | done = False
56 |
57 | # LAPADULA
58 | if mode == "rgb_array":
59 | from matplotlib import pyplot as plt
60 | from IPython.display import display, clear_output
61 | steps = 0
62 | fig, ax = plt.subplots(figsize=(8, 6))
63 |
64 | while not done:
65 |
66 | action = agent.get_action(state)
67 | state, reward, done, info = env.step(action)
68 | # LAPADULA
69 | if mode == "rgb_array":
70 | steps += 1
71 | frame = env.render(mode=mode)
72 | ax.cla()
73 | ax.axes.yaxis.set_visible(False)
74 | ax.imshow(frame, extent=[env.min_position, env.max_position, 0, 1])
75 | ax.set_title(f'Steps: {steps}')
76 | display(fig)
77 | clear_output(wait=True)
78 | plt.pause(sleep_sec)
79 | else:
80 | env.render()
81 |
82 |
83 | if __name__ == '__main__':
84 |
85 | parser = ArgumentParser()
86 | parser.add_argument('--agent_file', type=str, required=True)
87 | parser.add_argument('--sleep_sec', type=float, required=False, default=0.1)
88 | args = parser.parse_args()
89 |
90 | from src.base_agent import BaseAgent
91 | agent_path = SAVED_AGENTS_DIR / args.agent_file
92 | agent = BaseAgent.load_from_disk(agent_path)
93 |
94 | env = gym.make('MountainCar-v0')
95 | env._max_episode_steps = 1000
96 |
97 | show_video(agent, env, sleep_sec=args.sleep_sec)
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
--------------------------------------------------------------------------------
/03_cart_pole/.gitignore:
--------------------------------------------------------------------------------
1 | data_supervised_ml/*
2 |
--------------------------------------------------------------------------------
/03_cart_pole/README.md:
--------------------------------------------------------------------------------
1 |
2 |
Parametric Q learning to solve the Cart Pole
3 |
There exists everywhere a medium in things, determined by equilibrium.
4 |
-- Dmitri Mendeleev
5 |
6 |
7 | 
8 |
9 | ## Table of Contents
10 | * [Welcome 🤗](#welcome-)
11 | * [Lecture transcripts](#lecture-transcripts)
12 | * [Quick setup](#quick-setup)
13 | * [Notebooks](#notebooks)
14 | * [Let's connect](#lets-connect)
15 |
16 | ## Welcome 🤗
17 |
18 | In today's lecture we enter new territory...
19 |
20 | A territory where function approximation (aka supervised machine learning)
21 | meets good old Reinforcement Learning.
22 |
23 | And this is how Deep RL is born.
24 |
25 | We will solve the Cart Pole environment of OpenAI using **parametric Q-learning**.
26 |
27 | Today's lesson is split into 3 parts.
28 |
29 | ## Lecture transcripts
30 |
31 | [📝 1. Parametric Q learning](http://datamachines.xyz/2022/01/18/hands-on-reinforcement-learning-course-part-4-parametric-q-learning)
32 | [📝 2. Deep Q learning](http://datamachines.xyz/2022/02/11/hands-on-reinforcement-learning-course-part-5-deep-q-learning/)
33 | [📝 3. Hyperparameter search](http://datamachines.xyz/2022/03/03/hyperparameters-in-deep-rl-hands-on-course/)
34 |
35 | ## Quick setup
36 |
37 | Make sure you have Python >= 3.7. Otherwise, update it.
38 |
39 | 1. Pull the code from GitHub and cd into the `01_taxi` folder:
40 | ```
41 | $ git clone https://github.com/Paulescu/hands-on-rl.git
42 | $ cd hands-on-rl/01_taxi
43 | ```
44 |
45 | 2. Make sure you have the `virtualenv` tool in your Python installation
46 | ```
47 | $ pip3 install virtualenv
48 | ```
49 |
50 | 3. Create a virtual environment and activate it.
51 | ```
52 | $ virtualenv -p python3 venv
53 | $ source venv/bin/activate
54 | ```
55 |
56 | From this point onwards commands run inside the virtual environment.
57 |
58 |
59 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code.
60 | ```
61 | $ (venv) pip install -r requirements.txt
62 | $ (venv) export PYTHONPATH="."
63 | ```
64 |
65 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab
66 | ```
67 | $ (venv) jupyter notebook
68 | ```
69 | ```
70 | $ (venv) jupyter lab
71 | ```
72 | If both launch commands fail, try these:
73 | ```
74 | $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False
75 | ```
76 | ```
77 | $ (venv) jupyter lab --NotebookApp.use_redirect_file=False
78 | ```
79 |
80 | 5. Play and learn. And do the homework 😉.
81 |
82 | ## Notebooks
83 |
84 | Parametric Q-learning
85 | - [Explore the environment](notebooks/00_environment.ipynb)
86 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
87 | - [Linear Q agent with bad hyper-parameters](notebooks/02_linear_q_agent_bad_hyperparameters.ipynb)
88 | - [Linear Q agent with good hyper-parameters](notebooks/03_linear_q_agent_good_hyperparameters.ipynb)
89 | - [Homework](notebooks/04_homework.ipynb)
90 |
91 | Deep Q-learning
92 | - [Crash course on neural networks](notebooks/05_crash_course_on_neural_nets.ipynb)
93 | - [Deep Q agent with bad hyper-parameters](notebooks/06_deep_q_agent_bad_hyperparameters.ipynb)
94 | - [Deep Q agent with good hyper-parameters](notebooks/07_deep_q_agent_good_hyperparameters.ipynb)
95 | - [Homework](notebooks/08_homework.ipynb)
96 |
97 | Hyperparameter search
98 | - [Hyperparameter search](notebooks/09_hyperparameter_search.ipynb)
99 | - [Homework](notebooks/10_homework.ipynb)
100 |
101 | ## Let's connect!
102 |
103 | Do you wanna become a PRO in Machine Learning?
104 |
105 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/).
106 |
107 | 👉🏽 Follow me on [Medium](https://pau-labarta-bajo.medium.com/).
--------------------------------------------------------------------------------
/03_cart_pole/images/deep_q_net.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/03_cart_pole/images/hparams_search_diagram.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/03_cart_pole/images/linear_model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/linear_model.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/linear_model_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/linear_model_sml.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/neural_net.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/neural_net.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/neural_net_homework.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/neural_net_homework.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/ngrok_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/ngrok_example.png
--------------------------------------------------------------------------------
/03_cart_pole/images/nn_1_hidden_layer_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_1_hidden_layer_sml.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/nn_2_hidden_layers_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_2_hidden_layers_sml.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/nn_3_hidden_layers_sml.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_3_hidden_layers_sml.jpg
--------------------------------------------------------------------------------
/03_cart_pole/images/optuna.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/optuna.png
--------------------------------------------------------------------------------
/03_cart_pole/mlflow_runs/readme.md:
--------------------------------------------------------------------------------
1 | MLflow logs are saved in this folder.
--------------------------------------------------------------------------------
/03_cart_pole/notebooks/00_environment.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "4c2ff31f",
6 | "metadata": {},
7 | "source": [
8 | "# 00 Environment"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "04a5c882",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉Before you solve a Reinforcement Learning problem you need to define what are\n",
17 | "- the actions\n",
18 | "- the states of the world\n",
19 | "- the rewards\n",
20 | "\n",
21 | "#### 👉We are using the `CartPole-v0` environment from [OpenAI's gym](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)\n",
22 | "\n",
23 | "#### 👉`CartPole-v0` is not an extremely difficult environment. However, it is complex enough to force us level up our game. The tools we will use to solve it are really powerful.\n",
24 | "\n",
25 | "#### 👉Let's explore it!"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 44,
31 | "id": "e3629346",
32 | "metadata": {},
33 | "outputs": [
34 | {
35 | "name": "stdout",
36 | "output_type": "stream",
37 | "text": [
38 | "The autoreload extension is already loaded. To reload it, use:\n",
39 | " %reload_ext autoreload\n",
40 | "Populating the interactive namespace from numpy and matplotlib\n"
41 | ]
42 | }
43 | ],
44 | "source": [
45 | "%load_ext autoreload\n",
46 | "%autoreload 2\n",
47 | "%pylab inline\n",
48 | "%config InlineBackend.figure_format = 'svg'\n",
49 | "\n",
50 | "from matplotlib import pyplot as plt\n",
51 | "%matplotlib inline"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "id": "76e9a06d",
57 | "metadata": {},
58 | "source": [
59 | "## Load the environment 🌎"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 45,
65 | "id": "ebfba291",
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "import gym\n",
70 | "env = gym.make('CartPole-v1')"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "id": "c6e2bc37",
76 | "metadata": {},
77 | "source": [
78 | "## The goal\n",
79 | "### is to keep the pole in an upright position as long as you can by moving the cart a the bottom, left and right."
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "id": "7babf939",
85 | "metadata": {},
86 | "source": [
87 | ""
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "id": "9cb921cf",
93 | "metadata": {},
94 | "source": [
95 | "## Let's see how a good agent solves this problem"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "id": "92dcbf74",
102 | "metadata": {},
103 | "outputs": [],
104 | "source": []
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 30,
109 | "id": "2ded2ba5",
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "data": {
114 | "text/plain": [
115 | ""
116 | ]
117 | },
118 | "execution_count": 30,
119 | "metadata": {},
120 | "output_type": "execute_result"
121 | },
122 | {
123 | "data": {
124 | "image/svg+xml": [
125 | "\n",
126 | "\n",
128 | "\n"
348 | ],
349 | "text/plain": [
350 | ""
351 | ]
352 | },
353 | "metadata": {
354 | "needs_background": "light"
355 | },
356 | "output_type": "display_data"
357 | }
358 | ],
359 | "source": [
360 | "# env.reset()\n",
361 | "# frame = env.render(mode='rgb_array')\n",
362 | "\n",
363 | "# fig, ax = plt.subplots(figsize=(8, 6))\n",
364 | "# ax.axes.yaxis.set_visible(False)\n",
365 | "# min_x = env.observation_space.low[0]\n",
366 | "# max_x = env.observation_space.high[0]\n",
367 | "# ax.imshow(frame, extent=[min_x, max_x, 0, 8])\n",
368 | "\n"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "id": "4f53a38e",
374 | "metadata": {},
375 | "source": [
376 | "## State space"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 51,
382 | "id": "e809514b",
383 | "metadata": {},
384 | "outputs": [
385 | {
386 | "name": "stdout",
387 | "output_type": "stream",
388 | "text": [
389 | "Cart position from -4.80 to 4.80\n",
390 | "Cart velocity from -3.40E+38 to 3.40E+38\n",
391 | "Angle from -0.42 to 0.42\n",
392 | "Angular velocity from -3.40E+38 to 3.40E+38\n"
393 | ]
394 | }
395 | ],
396 | "source": [
397 | "# The state consists of 4 numbers:\n",
398 | "x_min, v_min, angle_min, angular_v_min = env.observation_space.low\n",
399 | "x_max, v_max, angle_max, angular_v_max = env.observation_space.high\n",
400 | "\n",
401 | "print(f'Cart position from {x_min:.2f} to {x_max:.2f}')\n",
402 | "print(f'Cart velocity from {v_min:.2E} to {v_max:.2E}')\n",
403 | "print(f'Angle from {angle_min:.2f} to {angle_max:.2f}')\n",
404 | "print(f'Angular velocity from {angular_v_min:.2E} to {angular_v_max:.2E}')"
405 | ]
406 | },
407 | {
408 | "cell_type": "markdown",
409 | "id": "f413604e",
410 | "metadata": {},
411 | "source": [
412 | "[IMAGE]"
413 | ]
414 | },
415 | {
416 | "cell_type": "markdown",
417 | "id": "5e0c527b",
418 | "metadata": {},
419 | "source": [
420 | "### The ranges for the cart velocity and pole angular velocity are a bit too large, aren't they?\n",
421 | "\n",
422 | "👉 As a general principle, the high/low state values you can read from `env.observation_space`\n",
423 | "are set very conservatively, to guarantee that the state value alwayas lies between the max and the min.\n",
424 | "\n",
425 | "👉In practice, you need to simulate a few interactions with the environment to really see the actual intervals where the state components lie.\n",
426 | "\n",
427 | "👉 Knowing the max and min values for each state component is going to be useful later when we normalize the inputs to our Parametric models."
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "id": "1fcfc13a",
433 | "metadata": {},
434 | "source": [
435 | "## Action space\n",
436 | "\n",
437 | "- `0` Push cart to the left\n",
438 | "- `1` Push cart to the right"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": 43,
444 | "id": "98cfdb84",
445 | "metadata": {},
446 | "outputs": [
447 | {
448 | "name": "stdout",
449 | "output_type": "stream",
450 | "text": [
451 | "Action Space Discrete(2)\n"
452 | ]
453 | }
454 | ],
455 | "source": [
456 | "print(\"Action Space {}\".format(env.action_space))"
457 | ]
458 | },
459 | {
460 | "cell_type": "markdown",
461 | "id": "c8f6a690",
462 | "metadata": {},
463 | "source": [
464 | "## Rewards\n",
465 | "\n",
466 | "- A reward of -1 is awarded if the position of the car is less than 0.5.\n",
467 | "- The episode ends once the car's position is above 0.5, or the max number of steps has been reached: `n_steps >= env._max_episode_steps`\n",
468 | "\n",
469 | "A default negative reward of -1 encourages the car to escape the valley as fast as possible."
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "id": "578d1ba3",
476 | "metadata": {},
477 | "outputs": [],
478 | "source": []
479 | }
480 | ],
481 | "metadata": {
482 | "kernelspec": {
483 | "display_name": "Python 3 (ipykernel)",
484 | "language": "python",
485 | "name": "python3"
486 | },
487 | "language_info": {
488 | "codemirror_mode": {
489 | "name": "ipython",
490 | "version": 3
491 | },
492 | "file_extension": ".py",
493 | "mimetype": "text/x-python",
494 | "name": "python",
495 | "nbconvert_exporter": "python",
496 | "pygments_lexer": "ipython3",
497 | "version": "3.7.5"
498 | }
499 | },
500 | "nbformat": 4,
501 | "nbformat_minor": 5
502 | }
503 |
--------------------------------------------------------------------------------
/03_cart_pole/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f0fd6807",
6 | "metadata": {},
7 | "source": [
8 | "# 04 Homework 🏋️🏋️🏋️"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "abcf6613",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉A course without homework is not a course!\n",
17 | "\n",
18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 | "\n",
20 | "#### 👉Feel free to email me your solutions at:"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "id": "d1d983a3",
26 | "metadata": {},
27 | "source": [
28 | "# `plabartabajo@gmail.com`"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "id": "86f82e45",
34 | "metadata": {},
35 | "source": [
36 | "-----"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "67656662",
42 | "metadata": {},
43 | "source": [
44 | "## 1. Can you use 3 different `SEED` values and re-train the agent with good hyper-parameters?\n",
45 | "\n",
46 | "Do you still train a good agent? Or does the seed really affect the training outcome?"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "id": "c0a46bf7",
52 | "metadata": {},
53 | "source": [
54 | "## 2. Can you solve the `MountainCar-v0` environment using today's code?\n",
55 | "\n",
56 | "Are you able to score 99% on the evaluation set?"
57 | ]
58 | }
59 | ],
60 | "metadata": {
61 | "kernelspec": {
62 | "display_name": "Python 3 (ipykernel)",
63 | "language": "python",
64 | "name": "python3"
65 | },
66 | "language_info": {
67 | "codemirror_mode": {
68 | "name": "ipython",
69 | "version": 3
70 | },
71 | "file_extension": ".py",
72 | "mimetype": "text/x-python",
73 | "name": "python",
74 | "nbconvert_exporter": "python",
75 | "pygments_lexer": "ipython3",
76 | "version": "3.7.5"
77 | }
78 | },
79 | "nbformat": 4,
80 | "nbformat_minor": 5
81 | }
82 |
--------------------------------------------------------------------------------
/03_cart_pole/notebooks/08_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f0fd6807",
6 | "metadata": {},
7 | "source": [
8 | "# 08 Homework 🏋️🏋️🏋️"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "abcf6613",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉A course without homework is not a course!\n",
17 | "\n",
18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 | "\n",
20 | "#### 👉Feel free to email me your solutions at:"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "id": "d1d983a3",
26 | "metadata": {},
27 | "source": [
28 | "# `plabartabajo@gmail.com`"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "id": "86f82e45",
34 | "metadata": {},
35 | "source": [
36 | "-----"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "67656662",
42 | "metadata": {},
43 | "source": [
44 | "### 1. Re-train the neural networks from `05_crash_course_on_neural_nets.ipynb` with a larger training set, e.g. `10,000 samples`?\n",
45 | "\n",
46 | "👉Do the validation metrics improve?\n",
47 | "\n",
48 | "👉Did you manage to get to 95% validation accuracy?"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "id": "c0a46bf7",
54 | "metadata": {},
55 | "source": [
56 | "## 2. Can you perfectly solve the `Cart Pole` environment using a neural network with only 1 hidden layer?\n",
57 | "\n",
58 | "\n",
59 | ""
60 | ]
61 | }
62 | ],
63 | "metadata": {
64 | "kernelspec": {
65 | "display_name": "Python 3 (ipykernel)",
66 | "language": "python",
67 | "name": "python3"
68 | },
69 | "language_info": {
70 | "codemirror_mode": {
71 | "name": "ipython",
72 | "version": 3
73 | },
74 | "file_extension": ".py",
75 | "mimetype": "text/x-python",
76 | "name": "python",
77 | "nbconvert_exporter": "python",
78 | "pygments_lexer": "ipython3",
79 | "version": "3.7.5"
80 | }
81 | },
82 | "nbformat": 4,
83 | "nbformat_minor": 5
84 | }
85 |
--------------------------------------------------------------------------------
/03_cart_pole/notebooks/10_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f0fd6807",
6 | "metadata": {},
7 | "source": [
8 | "# 10 Homework 🏋️🏋️🏋️"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "3f1582da",
14 | "metadata": {},
15 | "source": [
16 | "## Challenge"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "id": "67656662",
22 | "metadata": {},
23 | "source": [
24 | "If you carefully look at `sample_hyper_parameters()` in `src/optimize_hyperparameters.py` you will see I did not let Optuna test different neural network architectures.\n",
25 | "\n",
26 | "I set `nn_hidden_layers = [256, 256]` and that was it.\n",
27 | "\n",
28 | "I dare you find the smallest neural network architecture that solves the `CartPole` perfectly."
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "id": "4ea9756e",
34 | "metadata": {},
35 | "source": [
36 | "## Send your solution through\n",
37 | "\n",
38 | "- a pull request or\n",
39 | "- direcly by email at `plabartabajo@gmail.com`"
40 | ]
41 | }
42 | ],
43 | "metadata": {
44 | "kernelspec": {
45 | "display_name": "Python 3 (ipykernel)",
46 | "language": "python",
47 | "name": "python3"
48 | },
49 | "language_info": {
50 | "codemirror_mode": {
51 | "name": "ipython",
52 | "version": 3
53 | },
54 | "file_extension": ".py",
55 | "mimetype": "text/x-python",
56 | "name": "python",
57 | "nbconvert_exporter": "python",
58 | "pygments_lexer": "ipython3",
59 | "version": "3.7.5"
60 | }
61 | },
62 | "nbformat": 4,
63 | "nbformat_minor": 5
64 | }
65 |
--------------------------------------------------------------------------------
/03_cart_pole/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.poetry]
2 | name = "src"
3 | version = "0.1.0"
4 | description = ""
5 | authors = ["Pau "]
6 |
7 | [tool.poetry.dependencies]
8 | python = ">=3.7.1,<3.8"
9 | gym = "^0.21.0"
10 | sklearn = "^0.0"
11 | numpy = "^1.21.4"
12 | matplotlib = "^3.5.0"
13 | jupyter = "^1.0.0"
14 | tqdm = "^4.62.3"
15 | torch = "^1.10.1"
16 | tensorboard = "^2.7.0"
17 | pandas = "^1.3.5"
18 | PyYAML = "^6.0"
19 | pyglet = "^1.5.21"
20 | mlflow = "^1.22.0"
21 | gdown = "^4.2.0"
22 | optuna = "^2.10.0"
23 | pyngrok = "^5.1.0"
24 |
25 | [tool.poetry.dev-dependencies]
26 | pytest = "^5.2"
27 | certifi = "^2021.10.8"
28 |
29 | [build-system]
30 | requires = ["poetry-core>=1.0.0"]
31 | build-backend = "poetry.core.masonry.api"
32 |
--------------------------------------------------------------------------------
/03_cart_pole/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py==1.0.0; python_version >= "3.6"
2 | alembic==1.4.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
3 | appnope==0.1.2; platform_system == "Darwin" and python_version >= "3.7" and sys_platform == "darwin"
4 | argcomplete==2.0.0; python_version < "3.8.0" and python_version >= "3.7"
5 | argon2-cffi-bindings==21.2.0; python_version >= "3.6"
6 | argon2-cffi==21.3.0; python_version >= "3.6"
7 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0"
8 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
9 | autopage==0.5.0; python_version >= "3.6"
10 | backcall==0.2.0; python_version >= "3.7"
11 | beautifulsoup4==4.10.0; python_full_version > "3.0.0"
12 | bleach==4.1.0; python_version >= "3.7"
13 | cachetools==4.2.4; python_version >= "3.5" and python_version < "4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
14 | certifi==2021.10.8
15 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6"
16 | charset-normalizer==2.0.9; python_full_version >= "3.6.0" and python_version >= "3.6"
17 | click==8.0.3; python_version >= "3.6"
18 | cliff==3.10.1; python_version >= "3.6"
19 | cloudpickle==2.0.0; python_version >= "3.6"
20 | cmaes==0.8.2; python_version >= "3.6"
21 | cmd2==2.4.0; python_version >= "3.6"
22 | colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.7"
23 | colorlog==6.6.0; python_version >= "3.6"
24 | cycler==0.11.0; python_version >= "3.7"
25 | databricks-cli==0.16.2; python_version >= "3.6"
26 | debugpy==1.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
27 | decorator==5.1.0; python_version >= "3.7"
28 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
29 | docker==5.0.3; python_version >= "3.6"
30 | entrypoints==0.3; python_full_version >= "3.6.1" and python_version >= "3.7"
31 | filelock==3.4.2; python_version >= "3.7"
32 | flask==2.0.2; python_version >= "3.6"
33 | fonttools==4.28.5; python_version >= "3.7"
34 | gdown==4.2.0
35 | gitdb==4.0.9; python_version >= "3.7"
36 | gitpython==3.1.26; python_version >= "3.7"
37 | google-auth-oauthlib==0.4.6; python_version >= "3.6"
38 | google-auth==2.3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
39 | greenlet==1.1.2; python_version >= "3" and python_full_version < "3.0.0" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_version >= "3" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") and python_full_version >= "3.5.0"
40 | grpcio==1.43.0; python_version >= "3.6"
41 | gunicorn==20.1.0; platform_system != "Windows" and python_version >= "3.6"
42 | gym==0.21.0; python_version >= "3.6"
43 | idna==3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
44 | importlib-metadata==4.10.0; python_version == "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version >= "3.7" and python_version < "3.8") and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.7" and python_version < "3.8")
45 | importlib-resources==5.4.0; python_version < "3.9" and python_version >= "3.7"
46 | ipykernel==6.6.1; python_version >= "3.7"
47 | ipython-genutils==0.2.0; python_version >= "3.6"
48 | ipython==7.30.1; python_version >= "3.7"
49 | ipywidgets==7.6.5
50 | itsdangerous==2.0.1; python_version >= "3.6"
51 | jedi==0.18.1; python_version >= "3.7"
52 | jinja2==3.0.3; python_version >= "3.7"
53 | joblib==1.1.0; python_version >= "3.7"
54 | jsonschema==4.3.3; python_version >= "3.7"
55 | jupyter-client==7.1.0; python_full_version >= "3.6.1" and python_version >= "3.7"
56 | jupyter-console==6.4.0; python_version >= "3.6"
57 | jupyter-core==4.9.1; python_full_version >= "3.6.1" and python_version >= "3.7"
58 | jupyter==1.0.0
59 | jupyterlab-pygments==0.1.2; python_version >= "3.7"
60 | jupyterlab-widgets==1.0.2; python_version >= "3.6"
61 | kiwisolver==1.3.2; python_version >= "3.7"
62 | mako==1.1.6; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
63 | markdown==3.3.6; python_version >= "3.6"
64 | markupsafe==2.0.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
65 | matplotlib-inline==0.1.3; python_version >= "3.7"
66 | matplotlib==3.5.1; python_version >= "3.7"
67 | mistune==0.8.4; python_version >= "3.7"
68 | mlflow==1.22.0; python_version >= "3.6"
69 | more-itertools==8.12.0; python_version >= "3.5"
70 | nbclient==0.5.9; python_full_version >= "3.6.1" and python_version >= "3.7"
71 | nbconvert==6.4.0; python_version >= "3.7"
72 | nbformat==5.1.3; python_full_version >= "3.6.1" and python_version >= "3.7"
73 | nest-asyncio==1.5.4; python_full_version >= "3.6.1" and python_version >= "3.7"
74 | notebook==6.4.6; python_version >= "3.6"
75 | numpy==1.21.5; python_version >= "3.7" and python_version < "3.11"
76 | oauthlib==3.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
77 | optuna==2.10.0; python_version >= "3.6"
78 | packaging==21.3; python_version >= "3.7"
79 | pandas==1.3.5; python_full_version >= "3.7.1"
80 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
81 | parso==0.8.3; python_version >= "3.7"
82 | pbr==5.8.1; python_version >= "3.6"
83 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7"
84 | pickleshare==0.7.5; python_version >= "3.7"
85 | pillow==9.0.0; python_version >= "3.7"
86 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5"
87 | prettytable==3.1.1; python_version >= "3.7"
88 | prometheus-client==0.12.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
89 | prometheus-flask-exporter==0.18.7; python_version >= "3.6"
90 | prompt-toolkit==3.0.24; python_full_version >= "3.6.2" and python_version >= "3.7"
91 | protobuf==3.19.1; python_version >= "3.6"
92 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.7" and sys_platform != "win32"
93 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy"
94 | pyasn1-modules==0.2.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
95 | pyasn1==0.4.8; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_full_version >= "3.6.0" and python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
96 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
97 | pyglet==1.5.21
98 | pygments==2.11.1; python_version >= "3.7"
99 | pyngrok==5.1.0; python_version >= "3.5"
100 | pyparsing==3.0.6; python_version >= "3.7"
101 | pyperclip==1.8.2; python_version >= "3.6"
102 | pyreadline3==3.4.1; sys_platform == "win32" and python_version >= "3.6"
103 | pyrsistent==0.18.0; python_version >= "3.7"
104 | pysocks==1.7.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
105 | pytest==5.4.3; python_version >= "3.5"
106 | python-dateutil==2.8.2; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
107 | python-editor==1.0.4; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
108 | pytz==2021.3; python_full_version >= "3.7.1" and python_version >= "3.6"
109 | pywin32==227; sys_platform == "win32" and python_version >= "3.7" and platform_python_implementation != "PyPy"
110 | pywinpty==1.1.6; os_name == "nt" and python_version >= "3.6"
111 | pyyaml==6.0; python_version >= "3.6"
112 | pyzmq==22.3.0; python_full_version >= "3.6.1" and python_version >= "3.7"
113 | qtconsole==5.2.2; python_version >= "3.6"
114 | qtpy==2.0.0; python_version >= "3.6"
115 | querystring-parser==1.2.4; python_version >= "3.6"
116 | requests-oauthlib==1.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
117 | requests==2.27.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
118 | rsa==4.8; python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
119 | scikit-learn==1.0.2; python_version >= "3.7"
120 | scipy==1.7.3; python_version >= "3.7" and python_version < "3.11"
121 | send2trash==1.8.0; python_version >= "3.6"
122 | setuptools-scm==6.3.2; python_version >= "3.7"
123 | six==1.16.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7"
124 | sklearn==0.0
125 | smmap==5.0.0; python_version >= "3.7"
126 | soupsieve==2.3.1; python_version >= "3.6" and python_full_version > "3.0.0"
127 | sqlalchemy==1.4.29; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
128 | sqlparse==0.4.2; python_version >= "3.6"
129 | stevedore==3.5.0; python_version >= "3.6"
130 | tabulate==0.8.9; python_version >= "3.6"
131 | tensorboard-data-server==0.6.1; python_version >= "3.6"
132 | tensorboard-plugin-wit==1.8.0; python_version >= "3.6"
133 | tensorboard==2.7.0; python_version >= "3.6"
134 | terminado==0.12.1; python_version >= "3.6"
135 | testpath==0.5.0; python_version >= "3.7"
136 | threadpoolctl==3.0.0; python_version >= "3.7"
137 | tomli==2.0.0; python_version >= "3.7"
138 | torch==1.10.1; python_full_version >= "3.6.2"
139 | tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7"
140 | tqdm==4.62.3; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
141 | traitlets==5.1.1; python_full_version >= "3.6.1" and python_version >= "3.7"
142 | typing-extensions==4.0.1; python_version >= "3.7" and python_full_version >= "3.6.2" and python_version < "3.8"
143 | urllib3==1.26.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6"
144 | waitress==2.0.0; platform_system == "Windows" and python_version >= "3.6" and python_full_version >= "3.6.0"
145 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.7"
146 | webencodings==0.5.1; python_version >= "3.7"
147 | websocket-client==1.2.3; python_version >= "3.6"
148 | werkzeug==2.0.2; python_version >= "3.6"
149 | widgetsnbextension==3.5.2
150 | zipp==3.7.0; python_version < "3.8" and python_version >= "3.7"
151 |
--------------------------------------------------------------------------------
/03_cart_pole/saved_agents/CartPole-v1/0/hparams.json:
--------------------------------------------------------------------------------
1 | {"learning_rate": 0.119449136260578, "discount_factor": 0.99, "batch_size": 128, "memory_size": 100000, "freq_steps_update_target": 1000, "n_steps_warm_up_memory": 1000, "freq_steps_train": 16, "n_gradient_steps": 16, "nn_hidden_layers": null, "max_grad_norm": 1, "normalize_state": false, "epsilon_start": 0.9, "epsilon_end": 0.1421425009699689, "steps_epsilon_decay": 100000}
--------------------------------------------------------------------------------
/03_cart_pole/saved_agents/CartPole-v1/0/model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/saved_agents/CartPole-v1/0/model
--------------------------------------------------------------------------------
/03_cart_pole/saved_agents/readme.md:
--------------------------------------------------------------------------------
1 | ### Trained agents are saved in this folder
--------------------------------------------------------------------------------
/03_cart_pole/src/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '0.1.0'
2 |
--------------------------------------------------------------------------------
/03_cart_pole/src/agent_memory.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple, deque
2 | import random
3 |
4 | Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done'))
5 |
6 |
7 | class AgentMemory:
8 |
9 | def __init__(self, memory_size):
10 | self.memory = deque([], maxlen=memory_size)
11 |
12 | def push(self, *args):
13 | """Save a transition"""
14 | self.memory.append(Transition(*args))
15 |
16 | def sample(self, batch_size):
17 | transitions = random.sample(self.memory, batch_size)
18 |
19 | # stop()
20 |
21 | return Transition(*zip(*transitions))
22 |
23 | def __len__(self):
24 | return len(self.memory)
--------------------------------------------------------------------------------
/03_cart_pole/src/config.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pathlib
3 | root_dir = pathlib.Path(__file__).parent.resolve().parent
4 |
5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents'
6 | TENSORBOARD_LOG_DIR = root_dir / 'tensorboard_logs'
7 | OPTUNA_DB = root_dir / 'optuna.db'
8 | DATA_SUPERVISED_ML = root_dir / 'data_supervised_ml'
9 | MLFLOW_RUNS_DIR = root_dir / 'mlflow_runs'
10 |
11 | if not SAVED_AGENTS_DIR.exists():
12 | os.makedirs(SAVED_AGENTS_DIR)
13 |
14 | if not TENSORBOARD_LOG_DIR.exists():
15 | os.makedirs(TENSORBOARD_LOG_DIR)
16 |
17 | if not DATA_SUPERVISED_ML.exists():
18 | os.makedirs(DATA_SUPERVISED_ML)
19 |
20 | if not MLFLOW_RUNS_DIR.exists():
21 | os.makedirs(MLFLOW_RUNS_DIR)
--------------------------------------------------------------------------------
/03_cart_pole/src/loops.py:
--------------------------------------------------------------------------------
1 | from typing import Tuple, List, Callable, Union, Optional
2 | import random
3 | from pathlib import Path
4 | from collections import deque
5 | from pdb import set_trace as stop
6 |
7 | import numpy as np
8 | from tqdm import tqdm
9 | import torch
10 | from torch.utils.tensorboard import SummaryWriter
11 |
12 |
13 |
14 | def train(
15 | agent,
16 | env,
17 | n_episodes: int,
18 | log_dir: Optional[Path] = None,
19 | max_steps: Optional[int] = float("inf"),
20 | n_episodes_evaluate_agent: Optional[int] = 100,
21 | freq_episodes_evaluate_agent: int = 200,
22 | ) -> None:
23 |
24 | # Tensorborad log writer
25 | logging = False
26 | if log_dir is not None:
27 | writer = SummaryWriter(log_dir)
28 | logging = True
29 |
30 | reward_per_episode = []
31 | steps_per_episode = []
32 | global_step_counter = 0
33 |
34 | for i in tqdm(range(0, n_episodes)):
35 |
36 | state = env.reset()
37 |
38 | rewards = 0
39 | steps = 0
40 | done = False
41 | while not done:
42 |
43 | action = agent.act(state)
44 |
45 | # agents takes a step and the environment throws out a new state and
46 | # a reward
47 | next_state, reward, done, info = env.step(action)
48 |
49 | # agent observes transition and stores it for later use
50 | agent.observe(state, action, reward, next_state, done)
51 |
52 | # learning happens here, through experience replay
53 | agent.replay()
54 |
55 | global_step_counter += 1
56 | steps += 1
57 | rewards += reward
58 | state = next_state
59 |
60 | # log to Tensorboard
61 | if logging:
62 | writer.add_scalar('train/rewards', rewards, i)
63 | writer.add_scalar('train/steps', steps, i)
64 | writer.add_scalar('train/epsilon', agent.epsilon, i)
65 | writer.add_scalar('train/replay_memory_size', len(agent.memory), i)
66 |
67 | reward_per_episode.append(rewards)
68 | steps_per_episode.append(steps)
69 |
70 | # if (i > 0) and (i % freq_episodes_evaluate_agent) == 0:
71 | if (i + 1) % freq_episodes_evaluate_agent == 0:
72 | # evaluate agent
73 | eval_rewards, eval_steps = evaluate(agent, env,
74 | n_episodes=n_episodes_evaluate_agent,
75 | epsilon=0.01)
76 |
77 | # from src.utils import get_success_rate_from_n_steps
78 | # success_rate = get_success_rate_from_n_steps(env, eval_steps)
79 | print(f'Reward mean: {np.mean(eval_rewards):.2f}, std: {np.std(eval_rewards):.2f}')
80 | print(f'Num steps mean: {np.mean(eval_steps):.2f}, std: {np.std(eval_steps):.2f}')
81 | # print(f'Success rate: {success_rate:.2%}')
82 | if logging:
83 | writer.add_scalar('eval/avg_reward', np.mean(eval_rewards), i)
84 | writer.add_scalar('eval/avg_steps', np.mean(eval_steps), i)
85 | # writer.add_scalar('eval/success_rate', success_rate, i)
86 |
87 | if global_step_counter > max_steps:
88 | break
89 |
90 |
91 | def evaluate(
92 | agent,
93 | env,
94 | n_episodes: int,
95 | epsilon: Optional[float] = None,
96 | seed: Optional[int] = 0,
97 | ) -> Tuple[List, List]:
98 |
99 | from src.utils import set_seed
100 | set_seed(env, seed)
101 |
102 | # output metrics
103 | reward_per_episode = []
104 | steps_per_episode = []
105 |
106 | for i in tqdm(range(0, n_episodes)):
107 |
108 | state = env.reset()
109 | rewards = 0
110 | steps = 0
111 | done = False
112 | while not done:
113 |
114 | action = agent.act(state, epsilon=epsilon)
115 | next_state, reward, done, info = env.step(action)
116 |
117 | rewards += reward
118 | steps += 1
119 | state = next_state
120 |
121 | reward_per_episode.append(rewards)
122 | steps_per_episode.append(steps)
123 |
124 | return reward_per_episode, steps_per_episode
--------------------------------------------------------------------------------
/03_cart_pole/src/model_factory.py:
--------------------------------------------------------------------------------
1 | from typing import Optional, List
2 | from pdb import set_trace as stop
3 |
4 | import torch.nn as nn
5 |
6 |
7 | def get_model(
8 | input_dim: int,
9 | output_dim: int,
10 | hidden_layers: Optional[List[int]] = None,
11 | ):
12 | """
13 | Feed-forward network, made of linear layers with ReLU activation functions
14 | The number of layers, and their size is given by `hidden_layers`.
15 | """
16 | # assert init_method in {'default', 'xavier'}
17 |
18 | if hidden_layers is None:
19 | # linear model
20 | model = nn.Sequential(nn.Linear(input_dim, output_dim))
21 |
22 | else:
23 | # neural network
24 | # there are hidden layers in this case.
25 | dims = [input_dim] + hidden_layers + [output_dim]
26 | modules = []
27 | for i, dim in enumerate(dims[:-2]):
28 | modules.append(nn.Linear(dims[i], dims[i + 1]))
29 | modules.append(nn.ReLU())
30 |
31 | modules.append(nn.Linear(dims[-2], dims[-1]))
32 | model = nn.Sequential(*modules)
33 | # stop()
34 |
35 | # n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
36 | # print(f'{n_parameters:,} parameters')
37 |
38 | return model
39 |
40 | def count_parameters(model: nn.Module) -> int:
41 | """"""
42 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
--------------------------------------------------------------------------------
/03_cart_pole/src/optimize_hyperparameters.py:
--------------------------------------------------------------------------------
1 | from typing import Dict
2 | from argparse import ArgumentParser
3 | from pdb import set_trace as stop
4 |
5 | import optuna
6 | import gym
7 | import numpy as np
8 | import mlflow
9 |
10 | from src.q_agent import QAgent
11 | from src.utils import get_agent_id
12 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR, OPTUNA_DB
13 | from src.utils import set_seed
14 | from src.loops import train, evaluate
15 |
16 |
17 | def sample_hyper_parameters(
18 | trial: optuna.trial.Trial,
19 | force_linear_model: bool = False,
20 | ) -> Dict:
21 |
22 | learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-2)
23 | discount_factor = trial.suggest_categorical("discount_factor", [0.9, 0.95, 0.99])
24 | batch_size = trial.suggest_categorical("batch_size", [16, 32, 64, 128])
25 | memory_size = trial.suggest_categorical("memory_size", [int(1e4), int(5e4), int(1e5)])
26 |
27 | # we update the main model parameters every 'freq_steps_train' steps
28 | freq_steps_train = trial.suggest_categorical('freq_steps_train', [8, 16, 128, 256])
29 |
30 | # we update the target model parameters every 'freq_steps_update_target' steps
31 | freq_steps_update_target = trial.suggest_categorical('freq_steps_update_target', [10, 100, 1000])
32 |
33 | # minimum memory size we want before we start training
34 | # e.g. 0 --> start training right away.
35 | # e.g 1,000 --> start training when there are at least 1,000 sample trajectories in the agent's memory
36 | n_steps_warm_up_memory = trial.suggest_categorical("n_steps_warm_up_memory", [1000, 5000])
37 |
38 | # how many consecutive gradient descent steps to perform when we update the main model parameters
39 | n_gradient_steps = trial.suggest_categorical("n_gradient_steps", [1, 4, 16])
40 |
41 | # model architecture to approximate q values
42 | if force_linear_model:
43 | # linear model
44 | nn_hidden_layers = None
45 | else:
46 | # neural network hidden layers
47 | # nn_hidden_layers = trial.suggest_categorical("nn_hidden_layers", [None, [64, 64], [256, 256]])
48 | nn_hidden_layers = trial.suggest_categorical("nn_hidden_layers", [[256, 256]]) # ;-)
49 |
50 | # how large do we let the gradients grow before capping them?
51 | # Explosive gradients can be an issue and this hyper-parameters helps mitigate it.
52 | max_grad_norm = trial.suggest_categorical("max_grad_norm", [1, 10])
53 |
54 | # should we scale the inputs before feeding them to the model?
55 | normalize_state = trial.suggest_categorical('normalize_state', [True, False])
56 |
57 | # start value for the exploration rate
58 | epsilon_start = trial.suggest_categorical("epsilon_start", [0.9])
59 |
60 | # final value for the exploration rate
61 | epsilon_end = trial.suggest_uniform("epsilon_end", 0, 0.2)
62 |
63 | # for how many steps do we decrease epsilon from its starting value to
64 | # its final value `epsilon_end`
65 | steps_epsilon_decay = trial.suggest_categorical("steps_epsilon_decay", [int(1e3), int(1e4), int(1e5)])
66 |
67 | seed = trial.suggest_int('seed', 0, 2 ** 30 - 1)
68 |
69 | return {
70 | 'learning_rate': learning_rate,
71 | 'discount_factor': discount_factor,
72 | 'batch_size': batch_size,
73 | 'memory_size': memory_size,
74 | 'freq_steps_train': freq_steps_train,
75 | 'freq_steps_update_target': freq_steps_update_target,
76 | 'n_steps_warm_up_memory': n_steps_warm_up_memory,
77 | 'n_gradient_steps': n_gradient_steps,
78 | 'nn_hidden_layers': nn_hidden_layers,
79 | 'max_grad_norm': max_grad_norm,
80 | 'normalize_state': normalize_state,
81 | 'epsilon_start': epsilon_start,
82 | 'epsilon_end': epsilon_end,
83 | 'steps_epsilon_decay': steps_epsilon_decay,
84 | 'seed': seed,
85 | }
86 |
87 |
88 | def objective(
89 | trial: optuna.trial.Trial,
90 | force_linear_model: bool = False,
91 | n_episodes_to_train: int = 200,
92 | ):
93 | env_name = 'CartPole-v1'
94 | env = gym.make('CartPole-v1')
95 |
96 | with mlflow.start_run():
97 |
98 | # generate unique agent_id
99 | agent_id = get_agent_id(env_name)
100 | mlflow.log_param('agent_id', agent_id)
101 |
102 | # hyper-parameters
103 | args = sample_hyper_parameters(trial,
104 | force_linear_model=force_linear_model)
105 | mlflow.log_params(trial.params)
106 |
107 | # fix seeds to ensure reproducible runs
108 | set_seed(env, args['seed'])
109 |
110 | # create agent object
111 | agent = QAgent(
112 | env,
113 | learning_rate=args['learning_rate'],
114 | discount_factor=args['discount_factor'],
115 | batch_size=args['batch_size'],
116 | memory_size=args['memory_size'],
117 | freq_steps_train=args['freq_steps_train'],
118 | freq_steps_update_target=args['freq_steps_update_target'],
119 | n_steps_warm_up_memory=args['n_steps_warm_up_memory'],
120 | n_gradient_steps=args['n_gradient_steps'],
121 | nn_hidden_layers=args['nn_hidden_layers'],
122 | max_grad_norm=args['max_grad_norm'],
123 | normalize_state=args['normalize_state'],
124 | epsilon_start=args['epsilon_start'],
125 | epsilon_end=args['epsilon_end'],
126 | steps_epsilon_decay=args['steps_epsilon_decay'],
127 | log_dir=TENSORBOARD_LOG_DIR / env_name / agent_id
128 | )
129 |
130 | # train loop
131 | train(agent,
132 | env,
133 | n_episodes=n_episodes_to_train,
134 | log_dir=TENSORBOARD_LOG_DIR / env_name / agent_id)
135 |
136 | agent.save_to_disk(SAVED_AGENTS_DIR / env_name / agent_id)
137 |
138 | # evaluate its performance
139 | rewards, steps = evaluate(agent, env, n_episodes=1000, epsilon=0.00)
140 | mean_reward = np.mean(rewards)
141 | std_reward = np.std(rewards)
142 | mlflow.log_metric('mean_reward', mean_reward)
143 | mlflow.log_metric('std_reward', std_reward)
144 |
145 | return mean_reward
146 |
147 |
148 | if __name__ == '__main__':
149 |
150 | parser = ArgumentParser()
151 | parser.add_argument('--trials', type=int, required=True)
152 | parser.add_argument('--episodes', type=int, required=True)
153 | parser.add_argument('--force_linear_model', dest='force_linear_model', action='store_true')
154 | parser.set_defaults(force_linear_model=False)
155 | parser.add_argument('--experiment_name', type=str, required=True)
156 | args = parser.parse_args()
157 |
158 | # set Mlflow experiment name
159 | mlflow.set_experiment(args.experiment_name)
160 |
161 | # set Optuna study
162 | study = optuna.create_study(study_name=args.experiment_name,
163 | direction='maximize',
164 | load_if_exists=True,
165 | storage=f'sqlite:///{OPTUNA_DB}')
166 |
167 | # Wrap the objective inside a lambda and call objective inside it
168 | # Nice trick taken from https://www.kaggle.com/general/261870
169 | func = lambda trial: objective(trial, force_linear_model=args.force_linear_model, n_episodes_to_train=args.episodes)
170 |
171 | # run Optuna
172 | study.optimize(func, n_trials=args.trials)
--------------------------------------------------------------------------------
/03_cart_pole/src/q_agent.py:
--------------------------------------------------------------------------------
1 | """
2 | We use PyTorch for all agents:
3 |
4 | - Linear model trained one sample at a time -> Easy to train, slow and results are not great.
5 | - Linear model trained with batches of data. -> Faster to train, but results are still not good.
6 | - NN trained with batches -> Promising, but it looks like it does not train..
7 | - NN with memory buffer -> Fix sample autocorrelation
8 | - NN with memory buffer and target network for stability. -> RL trick (called double Q-learning)
9 |
10 | """
11 | import os
12 | from pathlib import Path
13 | from typing import Union, Callable, Tuple, List
14 | import random
15 | from argparse import ArgumentParser
16 | import json
17 | from pdb import set_trace as stop
18 |
19 | import gym
20 | import numpy as np
21 | import torch
22 | import torch.nn as nn
23 | import torch.optim as optim
24 | from torch.utils.tensorboard import SummaryWriter
25 | from torch.nn import functional as F
26 |
27 | from src.model_factory import get_model
28 | from src.agent_memory import AgentMemory
29 | from src.utils import (
30 | get_agent_id,
31 | get_input_output_dims,
32 | get_epsilon_decay_fn,
33 | # load_default_hyperparameters,
34 | get_observation_samples,
35 | set_seed,
36 | get_num_model_parameters
37 | )
38 | from src.loops import train
39 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR
40 |
41 |
42 | class QAgent:
43 |
44 | def __init__(
45 | self,
46 | env: gym.Env,
47 | learning_rate: float = 1e-4,
48 | discount_factor: float = 0.99,
49 | batch_size: int = 64,
50 | memory_size: int = 10000,
51 | freq_steps_update_target: int = 1000,
52 | n_steps_warm_up_memory: int = 1000,
53 | freq_steps_train: int = 16,
54 | n_gradient_steps: int = 8,
55 | nn_hidden_layers: List[int] = None,
56 | max_grad_norm: int = 10,
57 | normalize_state: bool = False,
58 | epsilon_start: float = 1.0,
59 | epsilon_end: float = 0.05,
60 | steps_epsilon_decay: float = 50000,
61 | log_dir: str = None,
62 | ):
63 | """
64 | :param env:
65 | :param learning_rate: size of the updates in the SGD/Adam formula
66 | :param discount_factor: discount factor for future rewards
67 | :param batch_size: number of (s,a,r,s') experiences we use in each SGD
68 | update
69 | :param memory_size: number of experiences the agent keeps in the replay
70 | memory
71 | :param freq_steps_update_target: frequency at which we copy the
72 | parameter
73 | from the main model to the target model.
74 | :param n_steps_warm_up_memory: number of experiences we require to have
75 | in memory before we start training the agent.
76 | :param freq_steps_train: frequency at which we update the main model
77 | parameters
78 | :param n_gradient_steps: number of SGD/Adam updates we perform when we
79 | train the main model.
80 | :param nn_hidden_layers: architecture of the main and target models.
81 | :param max_grad_norm: used to clipped gradients if they become too
82 | large.
83 | :param normalize_state: True/False depending if you want to normalize
84 | the raw states before feeding them into the model.
85 | :param epsilon_start: starting exploration rate
86 | :param epsilon_end: final exploration rate
87 | :param steps_epsilon_decay: number of step in which epsilon decays from
88 | 'epsilon_start' to 'epsilon_end'
89 | :param log_dir: Tensorboard logging folder
90 | """
91 | self.env = env
92 |
93 | # general hyper-parameters
94 | self.learning_rate = learning_rate
95 | self.discount_factor = discount_factor
96 |
97 | # replay memory we use to sample experiences and update parameters
98 | # `memory_size` defines the maximum number of past experiences we want the
99 | # agent remember.
100 | self.memory_size = memory_size
101 | self.memory = AgentMemory(memory_size)
102 |
103 | # number of experiences we take at once from `self.memory` to update parameters
104 | self.batch_size = batch_size
105 |
106 | # hyper-parameters to control exploration of the environment
107 | self.steps_epsilon_decay = steps_epsilon_decay
108 | self.epsilon_start = epsilon_start
109 | self.epsilon_end = epsilon_end
110 | self.epsilon_fn = get_epsilon_decay_fn(epsilon_start, epsilon_end, steps_epsilon_decay)
111 | self.epsilon = None
112 |
113 | # create q model(s). Plural because we use 2 models: main one, and the other for the target.
114 | self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
115 | self.q_net, self.target_q_net = None, None
116 | self._init_models(nn_hidden_layers)
117 | print(f'{get_num_model_parameters(self.q_net):,} parameters')
118 | self.optimizer = optim.Adam(self.q_net.parameters(), lr=learning_rate) # Adam optimizer is a safe and standard choice
119 | self.max_grad_norm = max_grad_norm
120 |
121 | # hyper-parameters to control how often or when we do certain things, like
122 | # - update the main net parameters
123 | self.freq_steps_train = freq_steps_train
124 | # - update the target net parameters
125 | self.freq_steps_update_target = freq_steps_update_target
126 | # - start training until the memory is big enough
127 | assert n_steps_warm_up_memory > batch_size, 'batch_size must be larger than n_steps_warm_up_memory'
128 | self.n_steps_warm_up_memory = n_steps_warm_up_memory
129 | # - number of gradient steps we perform every time we update the main net parameters
130 | self.n_gradient_steps = n_gradient_steps
131 |
132 | # state variable we use to keep track of the number of calls to `observe()`
133 | self._step_counter = 0
134 |
135 | # input normalizer
136 | self.normalize_state = normalize_state
137 | if normalize_state:
138 | state_samples = get_observation_samples(env, n_samples=10000)
139 | self.mean_states = state_samples.mean(axis=0)
140 | self.std_states = state_samples.std(axis=0)
141 |
142 | # create a tensorboard logger if `log_dir` was provided
143 | # logging becomes crucial to understand what is not working in our code.
144 | self.log_dir = log_dir
145 | if log_dir:
146 | self.logger = SummaryWriter(log_dir)
147 |
148 | # save hyper-parameters
149 | self.hparams = {
150 | 'learning_rate': learning_rate,
151 | 'discount_factor': discount_factor,
152 | 'batch_size': batch_size,
153 | 'memory_size': memory_size,
154 | 'freq_steps_update_target': freq_steps_update_target,
155 | 'n_steps_warm_up_memory': n_steps_warm_up_memory,
156 | 'freq_steps_train': freq_steps_train,
157 | 'n_gradient_steps': n_gradient_steps,
158 | 'nn_hidden_layers': nn_hidden_layers,
159 | 'max_grad_norm': max_grad_norm,
160 | 'normalize_state': normalize_state,
161 | 'epsilon_start': epsilon_start,
162 | 'epsilon_end': epsilon_end,
163 | 'steps_epsilon_decay': steps_epsilon_decay,
164 | }
165 |
166 | def _init_models(self, nn_hidden_layers):
167 |
168 | # state is a vector of dimension 4, and 2 are the possible actions
169 | input_dim, output_dim = get_input_output_dims(str(self.env))
170 | self.q_net = get_model(
171 | input_dim=input_dim,
172 | output_dim=output_dim,
173 | hidden_layers=nn_hidden_layers,
174 | )
175 | self.q_net.to(self.device)
176 |
177 | # target q-net
178 | self.target_q_net = get_model(
179 | input_dim=input_dim,
180 | output_dim=output_dim,
181 | hidden_layers=nn_hidden_layers,
182 | )
183 | self.target_q_net.to(self.device)
184 |
185 | # copy parameters from the `self.q_net`
186 | self._copy_params_to_target_q_net()
187 |
188 | def _copy_params_to_target_q_net(self):
189 | """
190 | Copies parameters from q_net to target_q_net
191 | """
192 | for target_param, param in zip(self.target_q_net.parameters(), self.q_net.parameters()):
193 | target_param.data.copy_(param.data)
194 |
195 | def _normalize_state(self, state: np.array) -> np.array:
196 | """"""
197 | # return (state - self.min_states) / (self.max_states - self.min_states)
198 | return (state - self.mean_states) / (self.std_states)
199 |
200 | def _preprocess_state(self, state: np.array) -> np.array:
201 |
202 | # state = np.copy(state_)
203 |
204 | if len(state.shape) == 1:
205 | # add extra dimension to make sure it is 2D
206 | s = state.reshape(1, -1)
207 | else:
208 | s = state
209 |
210 | if self.normalize_state:
211 | s = self._normalize_state(s)
212 |
213 | return s
214 |
215 | def act(self, state: np.array, epsilon: float = None) -> int:
216 | """
217 | Behavioural policy
218 | """
219 | if epsilon is None:
220 | # update epsilon
221 | self.epsilon = self.epsilon_fn(self._step_counter)
222 | epsilon = self.epsilon
223 |
224 | if random.uniform(0, 1) < epsilon:
225 | # Explore action space
226 | action = self.env.action_space.sample()
227 | return action
228 |
229 | # make sure s is a numpy array with 2 dimensions,
230 | # and normalize it if `self.normalize_state = True`
231 | s = self._preprocess_state(state)
232 |
233 | # forward pass through the net to compute q-values for the 3 actions
234 | s = torch.from_numpy(s).float().to(self.device)
235 | q_values = self.q_net(s)
236 |
237 | # extract index max q-value and reshape tensor to dimensions (1, 1)
238 | action = q_values.max(1)[1].view(1, 1)
239 |
240 | # tensor to float
241 | action = action.item()
242 |
243 | return action
244 |
245 | def observe(self, state, action, reward, next_state, done) -> None:
246 |
247 | # preprocess state
248 | s = self._preprocess_state(state)
249 | ns = self._preprocess_state(next_state)
250 |
251 | # store new experience in the agent's memory.
252 | self.memory.push(s, action, reward, ns, done)
253 |
254 | self._step_counter += 1
255 |
256 | def replay(self) -> None:
257 |
258 | if self._step_counter % self.freq_steps_train != 0:
259 | # update parameters every `self.freq_steps_update_target`
260 | # this way we add inertia to the agent actions, as they are more sticky
261 | return
262 |
263 | if self._step_counter < self.n_steps_warm_up_memory:
264 | # memory needs to be larger, no training yet
265 | return
266 |
267 | if self._step_counter % self.freq_steps_update_target == 0:
268 | # we update the target network parameters
269 | # self.target_nn.load_state_dict(self.nn.state_dict())
270 | self._copy_params_to_target_q_net()
271 |
272 | losses = []
273 | for i in range(0, self.n_gradient_steps):
274 |
275 | # get batch of experiences from the agent's memory.
276 | batch = self.memory.sample(self.batch_size)
277 |
278 | # A bit of plumbing to transform numpy arrays to PyTorch tensors
279 | state_batch = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in batch.state]).to(self.device)
280 | action_batch = torch.cat([torch.tensor([[a]]).long().view(1, -1) for a in batch.action]).to(self.device)
281 | reward_batch = torch.cat([torch.tensor([r]).float() for r in batch.reward]).to(self.device)
282 | next_state_batch = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in batch.next_state]).to(self.device)
283 | done_batch = torch.tensor(batch.done).float().to(self.device)
284 |
285 | # q_values for all 3 actions
286 | q_values = self.q_net(state_batch)
287 |
288 | # keep only q_value for the chosen action in the trajectory, i.e. `action_batch`
289 | q_values = q_values.gather(1, action_batch)
290 |
291 | with torch.no_grad():
292 | # q-values for each action in next_state
293 | next_q_values = self.target_q_net(next_state_batch)
294 |
295 | # extract max q-value for each next_state
296 | next_q_values, _ = next_q_values.max(dim=1)
297 |
298 | # TD target
299 | target_q_values = (1 - done_batch) * next_q_values * self.discount_factor + reward_batch
300 |
301 | # compute loss
302 | loss = F.mse_loss(q_values.squeeze(1), target_q_values)
303 | losses.append(loss.item())
304 |
305 | # backward step to adjust network parameters
306 | self.optimizer.zero_grad()
307 | loss.backward()
308 | torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), self.max_grad_norm)
309 | self.optimizer.step()
310 |
311 | if self.log_dir:
312 | self.logger.add_scalar('train/loss', np.mean(losses), self._step_counter)
313 |
314 | def save_to_disk(self, path: Path) -> None:
315 | """"""
316 | if not path.exists():
317 | os.makedirs(path)
318 |
319 | # save hyper-parameters in a json file
320 | with open(path / 'hparams.json', 'w') as f:
321 | json.dump(self.hparams, f)
322 |
323 | if self.normalize_state:
324 | np.save(path / 'mean_states.npy', self.mean_states)
325 | np.save(path / 'std_states.npy', self.std_states)
326 |
327 | # save main model
328 | torch.save(self.q_net, path / 'model')
329 |
330 | @classmethod
331 | def load_from_disk(cls, env: gym.Env, path: Path):
332 | """
333 | We recover all necessary variables to be able to evaluate the agent.
334 |
335 | NOTE: training state is not stored, so it is not possible to resume
336 | an interrupted training run as it was.
337 | """
338 | # load hyper-params
339 | with open(path / 'hparams.json', 'r') as f:
340 | hparams = json.load(f)
341 |
342 | # generate Python object
343 | agent = cls(env, **hparams)
344 |
345 | agent.normalize_state = hparams['normalize_state']
346 | if hparams['normalize_state']:
347 | agent.mean_states = np.load(path / 'mean_states.npy')
348 | agent.std_states = np.load(path / 'std_states.npy')
349 |
350 | agent.q_net = torch.load(path / 'model')
351 | agent.q_net.eval()
352 |
353 | return agent
354 |
355 |
356 |
357 | def parse_arguments():
358 | """
359 | Hyper-parameters are set either from command line or from the `hyperparameters.yaml' file.
360 | Parameters set throught the command line have priority over the default ones
361 | set in the yaml file.
362 | """
363 |
364 | parser = ArgumentParser()
365 | parser.add_argument('--env', type=str, required=True)
366 | parser.add_argument('--learning_rate', type=float)
367 | parser.add_argument('--discount_factor', type=float)
368 | parser.add_argument('--episodes', type=int)
369 | parser.add_argument('--max_steps', type=int)
370 | parser.add_argument('--epsilon_start', type=float)
371 | parser.add_argument('--epsilon_end', type=float)
372 | parser.add_argument('--steps_epsilon_decay', type=int)
373 | parser.add_argument('--batch_size', type=int)
374 | parser.add_argument('--memory_size', type=int)
375 | parser.add_argument('--n_steps_warm_up_memory', type=int)
376 | parser.add_argument('--freq_steps_update_target', type=int)
377 | parser.add_argument('--freq_steps_train', type=int)
378 | parser.add_argument('--normalize_state', dest='normalize_state', action='store_true')
379 | parser.set_defaults(normalize_state=False)
380 | parser.add_argument('--n_gradient_steps', type=int,)
381 | parser.add_argument("--nn_hidden_layers", type=int, nargs="+",)
382 | parser.add_argument('--nn_init_method', type=str, default='default')
383 | parser.add_argument('--loss', type=str)
384 | parser.add_argument("--max_grad_norm", type=float, default=10)
385 | parser.add_argument('--n_episodes_evaluate_agent', type=int, default=100)
386 | parser.add_argument('--freq_episodes_evaluate_agent', type=int, default=100)
387 | parser.add_argument('--seed', type=int, default=0)
388 |
389 | args = parser.parse_args()
390 |
391 | args_dict = {}
392 | for arg in vars(args):
393 | args_dict[arg] = getattr(args, arg)
394 |
395 | print('Hyper-parameters')
396 | for key, value in args_dict.items():
397 | print(f'{key}: {value}')
398 |
399 | return args_dict
400 |
401 |
402 | if __name__ == '__main__':
403 |
404 | args = parse_arguments()
405 |
406 | # setup the environment
407 | env = gym.make(args['env'])
408 |
409 | # fix seeds to ensure reproducibility between runs
410 | set_seed(env, args['seed'])
411 |
412 | # generate a unique agent_id, that we later use to save results to disk, as
413 | # well as TensorBoard logs.
414 | agent_id = get_agent_id(args['env'])
415 | print('agent_id: ', agent_id)
416 |
417 | agent = QAgent(
418 | env,
419 | learning_rate=args['learning_rate'],
420 | discount_factor=args['discount_factor'],
421 | batch_size=args['batch_size'],
422 | memory_size=args['memory_size'],
423 | freq_steps_train=args['freq_steps_train'],
424 | freq_steps_update_target=args['freq_steps_update_target'],
425 | n_steps_warm_up_memory=args['n_steps_warm_up_memory'],
426 | n_gradient_steps=args['n_gradient_steps'],
427 | nn_hidden_layers=args['nn_hidden_layers'],
428 | max_grad_norm=args['max_grad_norm'],
429 | normalize_state=args['normalize_state'],
430 | epsilon_start=args['epsilon_start'],
431 | epsilon_end=args['epsilon_end'],
432 | steps_epsilon_decay=args['steps_epsilon_decay'],
433 | log_dir=TENSORBOARD_LOG_DIR / args['env'] / agent_id
434 | )
435 | agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id)
436 |
437 | try:
438 | train(agent, env,
439 | n_episodes=args['episodes'],
440 | log_dir=TENSORBOARD_LOG_DIR / args['env'] / agent_id,
441 | n_episodes_evaluate_agent=args['n_episodes_evaluate_agent'],
442 | freq_episodes_evaluate_agent=args['freq_episodes_evaluate_agent'],
443 | # max_steps=args['max_steps']
444 | )
445 |
446 | agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id)
447 | print(f'Agent {agent_id} was saved')
448 |
449 | except KeyboardInterrupt:
450 | # save the agent before quitting...
451 | agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id)
452 | print(f'Agent {agent_id} was saved')
--------------------------------------------------------------------------------
/03_cart_pole/src/random_agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class RandomAgent:
5 | """
6 | This taxi driver selects actions randomly.
7 | You better not get into this taxi!
8 | """
9 | def __init__(self, env):
10 | self.env = env
11 |
12 | def act(self, state: np.array, epsilon: float = None) -> int:
13 | """
14 | No input arguments to this function.
15 | The agent does not consider the state of the environment when deciding
16 | what to do next.
17 | """
18 | return self.env.action_space.sample()
--------------------------------------------------------------------------------
/03_cart_pole/src/supervised_ml.py:
--------------------------------------------------------------------------------
1 | from argparse import ArgumentParser
2 | from pathlib import Path
3 | from typing import List, Optional, Tuple, Union, Dict
4 | from pdb import set_trace as stop
5 |
6 | import zipfile
7 | import gdown
8 | from tqdm import tqdm
9 | import pandas as pd
10 | import gym
11 | from torch.utils.data import Dataset, DataLoader
12 |
13 | import numpy as np
14 | #from sklearn.model_selection import train_test_split # Unused import
15 | import torch
16 | import torch.optim as optim
17 | import torch.nn as nn
18 | from torch.utils.tensorboard import SummaryWriter
19 |
20 | from src.model_factory import get_model
21 | from src.utils import set_seed
22 | from src.loops import evaluate
23 | from src.q_agent import QAgent
24 | from src.config import DATA_SUPERVISED_ML, SAVED_AGENTS_DIR, TENSORBOARD_LOG_DIR
25 |
26 |
27 | global_train_step = 0
28 | global_val_step = 0
29 |
30 |
31 | def download_agent_parameters() -> Path:
32 | """
33 | Downloads the agent parameters and hyper-parameters that I trained on my machine
34 | Returns the path to the unzipped folder.
35 | """
36 | # download .zip file from public google drive
37 | # url = 'https://docs.google.com/uc?export=download&id=1KH4ANx84PMmCY6H4FoUnkBLVC1z1A6W6'
38 | url = 'https://docs.google.com/uc?export=download&id=1ZdyAuzY-0VYfyNrg0a7gHd5TOX-GadJJ'
39 | output = SAVED_AGENTS_DIR / 'CartPole-v1' / 'gdrive_agent.zip'
40 | gdown.download(url, str(output))
41 |
42 | # unzip it
43 | with zipfile.ZipFile(str(output), "r") as zip_ref:
44 | zip_ref.extractall(str(SAVED_AGENTS_DIR / 'CartPole-v1'))
45 |
46 | return SAVED_AGENTS_DIR / 'CartPole-v1' / '298'
47 |
48 |
49 | def simulate_episode(env, agent) -> List[Dict]:
50 | """
51 | We let the agent interact with the environment and return a list of collected
52 | states and actions
53 | """
54 | done = False
55 | state = env.reset()
56 | samples = []
57 | while not done:
58 |
59 | action = agent.act(state, epsilon=0.0)
60 | samples.append({
61 | 's0': state[0],
62 | 's1': state[1],
63 | 's2': state[2],
64 | 's3': state[3],
65 | 'action': action
66 | })
67 | state, reward, done, info = env.step(action)
68 |
69 | return samples
70 |
71 |
72 | def generate_state_action_data(
73 | env: gym.Env,
74 | agent: QAgent,
75 | n_samples: int,
76 | path: Path
77 | ) -> None:
78 | """
79 | We let the agent interact the environment until we have collected
80 | n_samples of data.
81 | Then we save the data as a csv file with columns: s0, s1, s2, s3, a
82 | """
83 | samples = []
84 | with tqdm(total=n_samples) as pbar:
85 | while len(samples) < n_samples:
86 | new_samples = simulate_episode(env, agent)
87 | pbar.update(len(new_samples))
88 | samples += new_samples
89 |
90 | # save dataframe to csv file
91 | pd.DataFrame(samples).to_csv(path, index=False)
92 |
93 |
94 | class OptimalPolicyDataset(Dataset):
95 | """
96 | PyTorch custom dataset that wraps around the pandas dataframe and that
97 | will speak to the DataLoader later on, when we train the model.
98 | """
99 | def __init__(self, X: pd.DataFrame, y: pd.Series):
100 | self.X = X
101 | self.y = y
102 |
103 | def __len__(self):
104 | return len(self.X)
105 |
106 | def __getitem__(self, idx):
107 | return self.X.iloc[idx].values, self.y.iloc[idx]
108 |
109 |
110 | def get_tensorboard_writer(run_name: str):
111 |
112 | from torch.utils.tensorboard import SummaryWriter
113 | from src.config import TENSORBOARD_LOG_DIR
114 | tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_DIR / 'sml' / run_name)
115 | return tensorboard_writer
116 |
117 |
118 | def get_train_val_loop(
119 | model: nn.Module,
120 | criterion,
121 | optimizer,
122 | tensorboard_writer,
123 | ):
124 | global global_train_step, global_val_step
125 | global_train_step = 0
126 | global_val_step = 0
127 |
128 | def train_val_loop(
129 | is_train: bool,
130 | dataloader: DataLoader,
131 | epoch: int,
132 | ):
133 | """"""
134 | global global_train_step, global_val_step
135 |
136 | n_batches = 0
137 | running_loss = 0
138 | n_samples = 0
139 | n_correct_predictions = 0
140 |
141 | pbar = tqdm(dataloader)
142 | for data in pbar:
143 |
144 | # extract batch of features and target values (aka labels)
145 | inputs, labels = data
146 |
147 | if is_train:
148 | # zero the parameter gradients
149 | optimizer.zero_grad()
150 |
151 | # forward
152 | outputs = model(inputs.float())
153 | loss = criterion(outputs, labels)
154 |
155 | if is_train:
156 | # backward + optimize
157 | loss.backward()
158 | optimizer.step()
159 |
160 | predicted_labels = torch.argmax(outputs, 1)
161 | batch_accuracy = (predicted_labels == labels).numpy().mean()
162 |
163 | n_batches += 1
164 | running_loss += loss.item()
165 | avg_loss = running_loss / n_batches
166 |
167 | n_correct_predictions += (predicted_labels == labels).numpy().sum()
168 | n_samples += len(labels)
169 | avg_accuracy = n_correct_predictions / n_samples
170 | pbar.set_description(f'Epoch {epoch} - loss: {avg_loss:.4f} - accuracy: {avg_accuracy:.4f}')
171 |
172 | # log to tensorboard
173 | if is_train:
174 | global_train_step += 1
175 | tensorboard_writer.add_scalar('train/loss', loss.item(), global_train_step)
176 | tensorboard_writer.add_scalar('train/accuracy', batch_accuracy, global_train_step)
177 | # print('sent logs to TB')
178 | else:
179 | global_val_step += 1
180 | tensorboard_writer.add_scalar('val/loss', loss.item(), global_val_step)
181 | tensorboard_writer.add_scalar('val/accuracy', batch_accuracy, global_val_step)
182 |
183 | return train_val_loop
184 |
185 |
186 |
187 | def run(
188 | n_samples_train: int,
189 | n_samples_test: int,
190 | hidden_layers: Union[Tuple[int], None],
191 | n_epochs: int,
192 | ):
193 | env = gym.make('CartPole-v1')
194 |
195 | print('Downloading agent data from GDrive...')
196 | path_to_agent_data = download_agent_parameters()
197 | agent = QAgent.load_from_disk(env, path=path_to_agent_data)
198 |
199 | set_seed(env, 1234)
200 | print('Sanity checking that our agent is really that good...')
201 | rewards, steps = evaluate(agent, env, n_episodes=100, epsilon=0.0)
202 | print('Avg reward evaluation: ', np.mean(rewards))
203 |
204 | print('Generating train data for our supervised ML problem...')
205 | path_to_train_data = DATA_SUPERVISED_ML / 'train.csv'
206 | env.seed(0)
207 | generate_state_action_data(env, agent, n_samples=n_samples_train, path=path_to_train_data)
208 |
209 | print('Generating test data for our supervised ML problem...')
210 | path_to_test_data = DATA_SUPERVISED_ML / 'test.csv'
211 | env.seed(1)
212 | generate_state_action_data(env, agent, n_samples=n_samples_test, path=path_to_test_data)
213 |
214 | # load data from disk
215 | print('Loading CSV files into dataframes...')
216 | train_data = pd.read_csv(path_to_train_data)
217 | test_data = pd.read_csv(path_to_test_data)
218 |
219 | # split features and labels
220 | X_train = train_data[['s0', 's1', 's2', 's3']]
221 | y_train = train_data['action']
222 | X_test = test_data[['s0', 's1', 's2', 's3']]
223 | y_test = test_data['action']
224 |
225 | # PyTorch datasets
226 | train_dataset = OptimalPolicyDataset(X_train, y_train)
227 | test_dataset = OptimalPolicyDataset(X_test, y_test)
228 |
229 | batch_size = 64
230 |
231 | # PyTorch dataloaders
232 | train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
233 | test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
234 |
235 | # model architecture
236 | model = get_model(input_dim=4, output_dim=2, hidden_layers=hidden_layers)
237 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
238 | model.to(device)
239 |
240 | # loss function
241 | criterion = nn.CrossEntropyLoss()
242 |
243 | # optimization method
244 | optimizer = optim.Adam(model.parameters()) #, lr=3e-4)
245 |
246 | import time
247 | ts = int(time.time())
248 | tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_DIR / 'sml' / str(ts))
249 | train_val_loop = get_train_val_loop(model, criterion, optimizer, tensorboard_writer)
250 |
251 | # training loop, with evaluation at the end of each epoch
252 | # n_epochs = 20
253 | for epoch in range(n_epochs):
254 | # train
255 | train_val_loop(is_train=True, dataloader=train_dataloader, epoch=epoch)
256 |
257 | with torch.no_grad():
258 | # validate
259 | train_val_loop(is_train=False, dataloader=test_dataloader, epoch=epoch)
260 |
261 | print('----------')
262 |
263 |
264 | if __name__ == '__main__':
265 |
266 | parser = ArgumentParser()
267 | parser.add_argument('--n_samples_train', type=int, default=1000)
268 | parser.add_argument('--n_samples_test', type=int, default=1000)
269 | parser.add_argument("--hidden_layers", type=int, nargs="+",)
270 | parser.add_argument('--n_epochs', type=int, default=20)
271 | args = parser.parse_args()
272 |
273 | run(n_samples_train=args.n_samples_train,
274 | n_samples_test=args.n_samples_test,
275 | hidden_layers=args.hidden_layers,
276 | n_epochs=args.n_epochs)
--------------------------------------------------------------------------------
/03_cart_pole/src/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | from typing import Callable, Dict, Tuple, List
3 | import pathlib
4 | from pathlib import Path
5 | import json
6 | from pdb import set_trace as stop
7 |
8 | import numpy as np
9 | import gym
10 | import yaml
11 | import torch.nn as nn
12 |
13 |
14 | def snake_to_camel(word):
15 | import re
16 | return ''.join(x.capitalize() or '_' for x in word.split('_'))
17 |
18 |
19 | def get_agent_id(env_name: str) -> str:
20 | """"""
21 | from src.config import SAVED_AGENTS_DIR
22 |
23 | dir = Path(SAVED_AGENTS_DIR) / env_name
24 | if not dir.exists():
25 | os.makedirs(dir)
26 |
27 | # try:
28 | # agent_id = max([int(id) for id in os.listdir(dir)]) + 1
29 | # except ValueError:
30 | # agent_id = 0
31 |
32 | ids = []
33 | for id in os.listdir(dir):
34 | try:
35 | ids.append(int(id))
36 | except:
37 | pass
38 | if len(ids) > 0:
39 | agent_id = max(ids) + 1
40 | else:
41 | agent_id = 0
42 | # stop()
43 |
44 | return str(agent_id)
45 |
46 | def get_input_output_dims(env_name: str) -> Tuple[int]:
47 | """"""
48 | if 'MountainCar' in env_name:
49 | input_dim = 2
50 | output_dim = 3
51 | elif 'CartPole' in env_name:
52 | input_dim = 4
53 | output_dim = 2
54 | else:
55 | raise Exception('Invalid environment')
56 |
57 | return input_dim, output_dim
58 |
59 |
60 | def get_epsilon_decay_fn(
61 | eps_start: float,
62 | eps_end: float,
63 | total_episodes: int
64 | ) -> Callable:
65 | """
66 | Returns function epsilon_fn, which depends on
67 | a single input, step, which is the current episode
68 | """
69 | def epsilon_fn(episode: int) -> float:
70 | r = max((total_episodes - episode) / total_episodes, 0)
71 | return (eps_start - eps_end)*r + eps_end
72 |
73 | return epsilon_fn
74 |
75 |
76 | def get_epsilon_exponential_decay_fn(
77 | eps_max: float,
78 | eps_min: float,
79 | decay: float,
80 | ) -> Callable:
81 | """
82 | Returns function epsilon_fn, which depends on
83 | a single input, step, which is the current episode
84 | """
85 | def epsilon_fn(episode: int) -> float:
86 | return max(eps_min, eps_max * (decay ** episode))
87 | return epsilon_fn
88 |
89 |
90 | def get_success_rate_from_n_steps(env: gym.Env, steps: List[int]):
91 |
92 | import numpy as np
93 | if 'MountainCar' in str(env):
94 | success_rate = np.mean((np.array(steps) < env._max_episode_steps) * 1.0)
95 | elif 'CartPole' in str(env):
96 | success_rate = np.mean((np.array(steps) >= env._max_episode_steps) * 1.0)
97 | else:
98 | raise Exception('Invalid environment name')
99 |
100 | return success_rate
101 |
102 | def get_observation_samples(env: gym.Env, n_samples: int) -> np.array:
103 | """"""
104 | samples = []
105 | state = env.reset()
106 | while len(samples) < n_samples:
107 |
108 | samples.append(np.copy(state))
109 | action = env.action_space.sample()
110 | next_state, reward, done, info = env.step(action)
111 |
112 | if done:
113 | state = env.reset()
114 | else:
115 | state = next_state
116 |
117 | return np.array(samples)
118 |
119 |
120 | def set_seed(
121 | env,
122 | seed
123 | ):
124 | """To ensure reproducible runs we fix the seed for different libraries"""
125 | import random
126 | random.seed(seed)
127 |
128 | import numpy as np
129 | np.random.seed(seed)
130 |
131 | env.seed(seed)
132 | env.action_space.seed(seed)
133 |
134 | import torch
135 | torch.manual_seed(seed)
136 |
137 | # Deterministic operations for CuDNN, it may impact performances
138 | torch.backends.cudnn.deterministic = True
139 | torch.backends.cudnn.benchmark = False
140 |
141 | # env.seed(seed)
142 | # gym.spaces.prng.seed(seed)
143 |
144 |
145 | def get_num_model_parameters(model: nn.Module) -> int:
146 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
147 |
148 |
149 | # from dotenv import dotenv_values
150 | # import uuid
151 | # from pdb import set_trace as stop
152 |
153 | # import pandas as pd
154 | # import git
155 |
156 | # from src.io import get_list_files
157 |
158 |
159 | # def get_project_root() -> Path:
160 | # return Path(__file__).parent.resolve().parent
161 | #
162 | # from typing import Dict
163 | # def load_env_config() -> Dict:
164 | # """
165 | # """
166 | # config = dotenv_values(get_project_root() / ".env")
167 | # return config
168 | #
169 |
170 |
171 |
172 |
173 |
--------------------------------------------------------------------------------
/03_cart_pole/src/viz.py:
--------------------------------------------------------------------------------
1 | from time import sleep
2 | from argparse import ArgumentParser
3 | from pdb import set_trace as stop
4 | from typing import Optional
5 |
6 | import pandas as pd
7 | import gym
8 |
9 | from src.config import SAVED_AGENTS_DIR
10 |
11 | import numpy as np
12 |
13 |
14 | def show_video(agent, env, sleep_sec: float = 0.1, seed: Optional[int] = 0, mode: str = "rgb_array"):
15 |
16 | env.seed(seed)
17 | state = env.reset()
18 |
19 | # LAPADULA
20 | if mode == "rgb_array":
21 | from matplotlib import pyplot as plt
22 | from IPython.display import display, clear_output
23 | steps = 0
24 | fig, ax = plt.subplots(figsize=(8, 6))
25 |
26 | done = False
27 | while not done:
28 |
29 | action = agent.act(state, epsilon=0.001)
30 | state, reward, done, info = env.step(action)
31 |
32 | # LAPADULA
33 | if mode == "rgb_array":
34 | steps += 1
35 | frame = env.render(mode=mode)
36 | ax.cla()
37 | ax.axes.yaxis.set_visible(False)
38 | ax.imshow(frame)
39 | ax.set_title(f'Steps: {steps}')
40 | display(fig)
41 | clear_output(wait=True)
42 | plt.pause(sleep_sec)
43 | else:
44 | env.render()
45 | sleep(sleep_sec)
46 |
47 |
48 | if __name__ == '__main__':
49 |
50 | parser = ArgumentParser()
51 | parser.add_argument('--agent_id', type=str, required=True)
52 | parser.add_argument('--sleep_sec', type=float, required=False, default=0.1)
53 | args = parser.parse_args()
54 |
55 | from src.base_agent import BaseAgent
56 | agent_path = SAVED_AGENTS_DIR / args.agent_file
57 | agent = BaseAgent.load_from_disk(agent_path)
58 |
59 | from src.q_agent import QAgent
60 |
61 |
62 | env = gym.make('CartPole-v1')
63 | # env._max_episode_steps = 1000
64 |
65 | show_video(agent, env, sleep_sec=args.sleep_sec)
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
--------------------------------------------------------------------------------
/03_cart_pole/tensorboard_logs/.gitignore:
--------------------------------------------------------------------------------
1 | CartPole-v1/
2 |
--------------------------------------------------------------------------------
/03_cart_pole/tensorboard_logs/readme.md:
--------------------------------------------------------------------------------
1 | ### Tensorboard logs for each train run are stored in this folder
2 |
--------------------------------------------------------------------------------
/04_lunar_lander/README.md:
--------------------------------------------------------------------------------
1 |
2 |
Policy Gradients to land on the Moon
3 |
“That's one small step for your gradient ascent, one giant leap for your ML career.”
4 |
-- Pau quoting Neil Armstrong
5 |
6 |
7 | 
8 |
9 | ## Table of Contents
10 | * [Welcome 🤗](#welcome-)
11 | * [Lecture transcripts](#lecture-transcripts)
12 | * [Quick setup](#quick-setup)
13 | * [Notebooks](#notebooks)
14 | * [Let's connect](#lets-connect)
15 |
16 | ## Welcome 🤗
17 |
18 | Today we will learn about Policy Gradient methods, and use them to land on the Moon.
19 |
20 | Ready, set, go!
21 |
22 | ## Lecture transcripts
23 |
24 | [📝 1. Policy gradients](http://datamachines.xyz/2022/05/06/policy-gradients-in-reinforcement-learning-to-land-on-the-moon-hands-on-course/)
25 |
26 | ## Quick setup
27 |
28 | Make sure you have Python >= 3.7. Otherwise, update it.
29 |
30 | 1. Pull the code from GitHub and cd into the `04_lunar_lander` folder:
31 | ```
32 | $ git clone https://github.com/Paulescu/hands-on-rl.git
33 | $ cd hands-on-rl/04_lunar_lander
34 | ```
35 |
36 | 2. Make sure you have the `virtualenv` tool in your Python installation
37 | ```
38 | $ pip3 install virtualenv
39 | ```
40 |
41 | 3. Create a virtual environment and activate it.
42 | ```
43 | $ virtualenv -p python3 venv
44 | $ source venv/bin/activate
45 | ```
46 |
47 | From this point onwards commands run inside the virtual environment.
48 |
49 |
50 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code.
51 | ```
52 | $ (venv) pip install -r requirements.txt
53 | $ (venv) export PYTHONPATH="."
54 | ```
55 |
56 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab
57 | ```
58 | $ (venv) jupyter notebook
59 | ```
60 | ```
61 | $ (venv) jupyter lab
62 | ```
63 | If both launch commands fail, try these:
64 | ```
65 | $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False
66 | ```
67 | ```
68 | $ (venv) jupyter lab --NotebookApp.use_redirect_file=False
69 | ```
70 |
71 | 5. Play and learn. And do the homework 😉.
72 |
73 | ## Notebooks
74 |
75 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb)
76 | - [Policy gradients with rewards as weights](notebooks/02_vanilla_policy_gradient_with_rewards_as_weights.ipynb)
77 | - [Policy gradients with rewards-to-go as weights](notebooks/03_vanilla_policy_gradient_with_rewards_to_go_as_weights.ipynb)
78 | - [Homework](notebooks/04_homework.ipynb)
79 |
80 | ## Let's connect!
81 |
82 | Do you wanna become a PRO in Machine Learning?
83 |
84 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/) 🧠
85 |
86 | 👉🏽 Follow me on [Twitter](https://twitter.com/paulabartabajo_) and [LinkedIn](https://www.linkedin.com/in/pau-labarta-bajo-4432074b/) 💡
87 |
--------------------------------------------------------------------------------
/04_lunar_lander/notebooks/04_homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f0fd6807",
6 | "metadata": {},
7 | "source": [
8 | "# 04 Homework 🏋️🏋️🏋️"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "abcf6613",
14 | "metadata": {},
15 | "source": [
16 | "#### 👉A course without homework is not a course!\n",
17 | "\n",
18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n",
19 | "\n",
20 | "#### 👉Feel free to email me your solutions at:"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "id": "d1d983a3",
26 | "metadata": {},
27 | "source": [
28 | "# `plabartabajo@gmail.com`"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "id": "86f82e45",
34 | "metadata": {},
35 | "source": [
36 | "-----"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "67656662",
42 | "metadata": {},
43 | "source": [
44 | "## 1. Can you find a smaller network that solves this environment?\n",
45 | "\n",
46 | "I used one hidden layer with 64 units, but I have the feeling this was an overkill."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "id": "c0a46bf7",
52 | "metadata": {},
53 | "source": [
54 | "## 2. Can you speed up converge by properly tunning the `batch_size`?"
55 | ]
56 | }
57 | ],
58 | "metadata": {
59 | "kernelspec": {
60 | "display_name": "Python 3 (ipykernel)",
61 | "language": "python",
62 | "name": "python3"
63 | },
64 | "language_info": {
65 | "codemirror_mode": {
66 | "name": "ipython",
67 | "version": 3
68 | },
69 | "file_extension": ".py",
70 | "mimetype": "text/x-python",
71 | "name": "python",
72 | "nbconvert_exporter": "python",
73 | "pygments_lexer": "ipython3",
74 | "version": "3.8.10"
75 | }
76 | },
77 | "nbformat": 4,
78 | "nbformat_minor": 5
79 | }
80 |
--------------------------------------------------------------------------------
/04_lunar_lander/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.poetry]
2 | name = "src"
3 | version = "0.1.0"
4 | description = ""
5 | authors = ["Pau "]
6 |
7 | [tool.poetry.dependencies]
8 | python = ">=3.8,<3.11"
9 | numpy = "^1.22.3"
10 | torch = "^1.11.0"
11 | scipy = "^1.8.0"
12 | Box2D = "^2.3.10"
13 | box2d-py = "^2.3.8"
14 | gym = "0.17.2"
15 | tensorboard = "^2.8.0"
16 | tqdm = "^4.64.0"
17 | jupyter = "^1.0.0"
18 | matplotlib = "^3.5.1"
19 | pandas = "^1.4.2"
20 | pyglet = "1.5.0"
21 |
22 | [tool.poetry.dev-dependencies]
23 | pytest = "^5.2"
24 |
25 | [build-system]
26 | requires = ["poetry-core>=1.0.0"]
27 | build-backend = "poetry.core.masonry.api"
28 |
--------------------------------------------------------------------------------
/04_lunar_lander/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py==1.0.0; python_version >= "3.6"
2 | appnope==0.1.3; platform_system == "Darwin" and python_version >= "3.8" and sys_platform == "darwin"
3 | argon2-cffi-bindings==21.2.0; python_version >= "3.6"
4 | argon2-cffi==21.3.0; python_version >= "3.6"
5 | asttokens==2.0.5; python_version >= "3.8"
6 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0"
7 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
8 | backcall==0.2.0; python_version >= "3.8"
9 | beautifulsoup4==4.11.1; python_full_version >= "3.6.0" and python_version >= "3.7"
10 | bleach==5.0.0; python_version >= "3.7"
11 | box2d-py==2.3.8
12 | box2d==2.3.10
13 | cachetools==5.0.0; python_version >= "3.7" and python_version < "4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
14 | certifi==2021.10.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
15 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6"
16 | charset-normalizer==2.0.12; python_full_version >= "3.6.0" and python_version >= "3.6"
17 | cloudpickle==1.2.2; python_version >= "3.5"
18 | colorama==0.4.4; python_version >= "3.8" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.8"
19 | cycler==0.11.0; python_version >= "3.7"
20 | debugpy==1.6.0; python_version >= "3.7"
21 | decorator==5.1.1; python_version >= "3.8"
22 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
23 | entrypoints==0.4; python_version >= "3.7"
24 | executing==0.8.3; python_version >= "3.8"
25 | fastjsonschema==2.15.3; python_version >= "3.7"
26 | fonttools==4.32.0; python_version >= "3.7"
27 | future==0.18.2; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.5"
28 | google-auth-oauthlib==0.4.6; python_version >= "3.6"
29 | google-auth==2.6.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
30 | grpcio==1.45.0; python_version >= "3.6"
31 | gym==0.17.2; python_version >= "3.5"
32 | idna==3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
33 | importlib-metadata==4.11.3; python_version < "3.10" and python_version >= "3.7"
34 | importlib-resources==5.7.1; python_version < "3.9" and python_version >= "3.7"
35 | ipykernel==6.13.0; python_version >= "3.7"
36 | ipython-genutils==0.2.0; python_version >= "3.7"
37 | ipython==8.2.0; python_version >= "3.8"
38 | ipywidgets==7.7.0
39 | jedi==0.18.1; python_version >= "3.8"
40 | jinja2==3.1.1; python_version >= "3.7"
41 | jsonschema==4.4.0; python_version >= "3.7"
42 | jupyter-client==7.2.2; python_full_version >= "3.7.0" and python_version >= "3.7"
43 | jupyter-console==6.4.3; python_version >= "3.6"
44 | jupyter-core==4.10.0; python_version >= "3.7"
45 | jupyter==1.0.0
46 | jupyterlab-pygments==0.2.2; python_version >= "3.7"
47 | jupyterlab-widgets==1.1.0; python_version >= "3.6"
48 | kiwisolver==1.4.2; python_version >= "3.7"
49 | markdown==3.3.6; python_version >= "3.6"
50 | markupsafe==2.1.1; python_version >= "3.7"
51 | matplotlib-inline==0.1.3; python_version >= "3.8"
52 | matplotlib==3.5.1; python_version >= "3.7"
53 | mistune==0.8.4; python_version >= "3.7"
54 | more-itertools==8.12.0; python_version >= "3.5"
55 | nbclient==0.6.0; python_full_version >= "3.7.0" and python_version >= "3.7"
56 | nbconvert==6.5.0; python_version >= "3.7"
57 | nbformat==5.3.0; python_full_version >= "3.7.0" and python_version >= "3.7"
58 | nest-asyncio==1.5.5; python_full_version >= "3.7.0" and python_version >= "3.7"
59 | notebook==6.4.10; python_version >= "3.6"
60 | numpy==1.22.3; python_version >= "3.8"
61 | oauthlib==3.2.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
62 | packaging==21.3; python_version >= "3.7"
63 | pandas==1.4.2; python_version >= "3.8"
64 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
65 | parso==0.8.3; python_version >= "3.8"
66 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.8"
67 | pickleshare==0.7.5; python_version >= "3.8"
68 | pillow==9.0.1; python_version >= "3.7"
69 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5"
70 | prometheus-client==0.14.1; python_version >= "3.6"
71 | prompt-toolkit==3.0.29; python_full_version >= "3.6.2" and python_version >= "3.8"
72 | protobuf==3.20.0; python_version >= "3.7"
73 | psutil==5.9.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
74 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.8" and sys_platform != "win32"
75 | pure-eval==0.2.2; python_version >= "3.8"
76 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy"
77 | pyasn1-modules==0.2.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
78 | pyasn1==0.4.8; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_full_version >= "3.6.0" and python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
79 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
80 | pyglet==1.5.0
81 | pygments==2.11.2; python_version >= "3.8"
82 | pyparsing==3.0.7; python_version >= "3.7"
83 | pyrsistent==0.18.1; python_version >= "3.7"
84 | pytest==5.4.3; python_version >= "3.5"
85 | python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
86 | pytz==2022.1; python_version >= "3.8"
87 | pywin32==303; sys_platform == "win32" and platform_python_implementation != "PyPy" and python_version >= "3.7"
88 | pywinpty==2.0.5; os_name == "nt" and python_version >= "3.7"
89 | pyzmq==22.3.0; python_version >= "3.7"
90 | qtconsole==5.3.0; python_version >= "3.7"
91 | qtpy==2.0.1; python_version >= "3.7"
92 | requests-oauthlib==1.3.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
93 | requests==2.27.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
94 | rsa==4.8; python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6")
95 | scipy==1.8.0; python_version >= "3.8" and python_version < "3.11"
96 | send2trash==1.8.0; python_version >= "3.6"
97 | setuptools-scm==6.4.2; python_version >= "3.7"
98 | six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.8"
99 | soupsieve==2.3.2.post1; python_full_version >= "3.6.0" and python_version >= "3.7"
100 | stack-data==0.2.0; python_version >= "3.8"
101 | tensorboard-data-server==0.6.1; python_version >= "3.6"
102 | tensorboard-plugin-wit==1.8.1; python_version >= "3.6"
103 | tensorboard==2.8.0; python_version >= "3.6"
104 | terminado==0.13.3; python_version >= "3.7"
105 | tinycss2==1.1.1; python_version >= "3.7"
106 | tomli==2.0.1; python_version >= "3.7"
107 | torch==1.11.0; python_full_version >= "3.7.0"
108 | tornado==6.1; python_version >= "3.7"
109 | tqdm==4.64.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
110 | traitlets==5.1.1; python_full_version >= "3.7.0" and python_version >= "3.8"
111 | typing-extensions==4.1.1; python_version >= "3.6" and python_full_version >= "3.7.0"
112 | urllib3==1.26.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6"
113 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.6"
114 | webencodings==0.5.1; python_version >= "3.7"
115 | werkzeug==2.1.1; python_version >= "3.7"
116 | widgetsnbextension==3.6.0
117 | zipp==3.8.0; python_version < "3.9" and python_version >= "3.7"
118 |
--------------------------------------------------------------------------------
/04_lunar_lander/saved_agents/readme.md:
--------------------------------------------------------------------------------
1 | ### Trained agents are saved in this folder
--------------------------------------------------------------------------------
/04_lunar_lander/src/config.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pathlib
3 | root_dir = pathlib.Path(__file__).parent.resolve().parent
4 |
5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents'
6 | TENSORBOARD_LOG_DIR = root_dir / 'tensorboard_logs'
7 | # OPTUNA_DB = root_dir / 'optuna.db'
8 | # DATA_SUPERVISED_ML = root_dir / 'data_supervised_ml'
9 | # MLFLOW_RUNS_DIR = root_dir / 'mlflow_runs'
10 |
11 | if not SAVED_AGENTS_DIR.exists():
12 | os.makedirs(SAVED_AGENTS_DIR)
13 |
14 | if not TENSORBOARD_LOG_DIR.exists():
15 | os.makedirs(TENSORBOARD_LOG_DIR)
16 |
17 | # if not DATA_SUPERVISED_ML.exists():
18 | # os.makedirs(DATA_SUPERVISED_ML)
19 | #
20 | # if not MLFLOW_RUNS_DIR.exists():
21 | # os.makedirs(MLFLOW_RUNS_DIR)
--------------------------------------------------------------------------------
/04_lunar_lander/src/evaluation.py:
--------------------------------------------------------------------------------
1 | from typing import Optional, Tuple, List
2 | from tqdm import tqdm
3 |
4 | import torch
5 |
6 |
7 | def evaluate(
8 | agent,
9 | env,
10 | n_episodes: int,
11 | seed: Optional[int] = 0,
12 | ) -> Tuple[List, List]:
13 |
14 | # from src.utils import set_seed
15 | # set_seed(env, seed)
16 |
17 | # output metrics
18 | reward_per_episode = []
19 | steps_per_episode = []
20 |
21 | for i in tqdm(range(0, n_episodes)):
22 |
23 | state = env.reset()
24 | rewards = 0
25 | steps = 0
26 | done = False
27 | while not done:
28 |
29 | action = agent.act(torch.as_tensor(state, dtype=torch.float32))
30 |
31 | next_state, reward, done, info = env.step(action)
32 |
33 | rewards += reward
34 | steps += 1
35 | state = next_state
36 |
37 | reward_per_episode.append(rewards)
38 | steps_per_episode.append(steps)
39 |
40 | return reward_per_episode, steps_per_episode
--------------------------------------------------------------------------------
/04_lunar_lander/src/model_factory.py:
--------------------------------------------------------------------------------
1 | from typing import Optional, List
2 | from pdb import set_trace as stop
3 |
4 | import torch.nn as nn
5 |
6 |
7 | def get_model(
8 | input_dim: int,
9 | output_dim: int,
10 | hidden_layers: Optional[List[int]] = None,
11 | ):
12 | """
13 | Feed-forward network, made of linear layers with ReLU activation functions
14 | The number of layers, and their size is given by `hidden_layers`.
15 | """
16 | # assert init_method in {'default', 'xavier'}
17 |
18 | if hidden_layers is None:
19 | # linear model
20 | model = nn.Sequential(nn.Linear(input_dim, output_dim))
21 |
22 | else:
23 | # neural network
24 | # there are hidden layers in this case.
25 | dims = [input_dim] + hidden_layers + [output_dim]
26 | modules = []
27 | for i, dim in enumerate(dims[:-2]):
28 | modules.append(nn.Linear(dims[i], dims[i + 1]))
29 | modules.append(nn.ReLU())
30 |
31 | modules.append(nn.Linear(dims[-2], dims[-1]))
32 | model = nn.Sequential(*modules)
33 | # stop()
34 |
35 | # n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
36 | # print(f'{n_parameters:,} parameters')
37 |
38 | return model
39 |
40 | def count_parameters(model: nn.Module) -> int:
41 | """"""
42 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
--------------------------------------------------------------------------------
/04_lunar_lander/src/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | from typing import Callable, Dict, Tuple, List
3 | import pathlib
4 | from pathlib import Path
5 | import json
6 | from pdb import set_trace as stop
7 |
8 | import torch.nn as nn
9 | from torch.utils.tensorboard import SummaryWriter
10 |
11 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR
12 |
13 |
14 | # def snake_to_camel(word):
15 | # import re
16 | # return ''.join(x.capitalize() or '_' for x in word.split('_'))
17 |
18 | def get_agent_id(env_name: str) -> str:
19 | """"""
20 | dir = Path(SAVED_AGENTS_DIR) / env_name
21 | if not dir.exists():
22 | os.makedirs(dir)
23 |
24 | ids = []
25 | for id in os.listdir(dir):
26 | try:
27 | ids.append(int(id))
28 | except:
29 | pass
30 | if len(ids) > 0:
31 | agent_id = max(ids) + 1
32 | else:
33 | agent_id = 0
34 | # stop()
35 |
36 | return str(agent_id)
37 |
38 |
39 | def set_seed(
40 | env,
41 | seed
42 | ):
43 | """To ensure reproducible runs we fix the seed for different libraries"""
44 | import random
45 | random.seed(seed)
46 |
47 | import numpy as np
48 | np.random.seed(seed)
49 |
50 | env.seed(seed)
51 | env.action_space.seed(seed)
52 |
53 | import torch
54 | torch.manual_seed(seed)
55 |
56 | # Deterministic operations for CuDNN, it may impact performances
57 | torch.backends.cudnn.deterministic = True
58 | torch.backends.cudnn.benchmark = False
59 |
60 | def get_num_model_parameters(model: nn.Module) -> int:
61 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
62 |
63 | def get_logger(env_name: str, agent_id: str) -> SummaryWriter:
64 | return SummaryWriter(TENSORBOARD_LOG_DIR / env_name / agent_id)
65 |
66 | def get_model_path(env_name: str, agent_id: str) -> Path:
67 | """
68 | Returns path where we save train artifacts, including:
69 | -> the policy network weights
70 | -> json with hyperparameters
71 | """
72 | return SAVED_AGENTS_DIR / env_name / agent_id
73 |
74 |
75 |
--------------------------------------------------------------------------------
/04_lunar_lander/src/viz.py:
--------------------------------------------------------------------------------
1 | from time import sleep
2 | from argparse import ArgumentParser
3 | from pdb import set_trace as stop
4 | from typing import Optional
5 |
6 | import pandas as pd
7 | import gym
8 |
9 | from src.config import SAVED_AGENTS_DIR
10 |
11 | import numpy as np
12 |
13 | def make_video(agent):
14 |
15 | import gym
16 | from gym.wrappers import Monitor
17 | env = Monitor(gym.make('CartPole-v0'), './video', force=True)
18 |
19 | state = env.reset()
20 | done = False
21 |
22 | while not done:
23 | # action = env.action_space.sample()
24 | import torch
25 | action = agent.act(torch.as_tensor(state, dtype=torch.float32))
26 | state_next, reward, done, info = env.step(action)
27 | env.close()
28 |
29 |
30 | def show_video(
31 | agent,
32 | env,
33 | sleep_sec: float = 0.1,
34 | seed: Optional[int] = 0,
35 | mode: str = "rgb_array"
36 | ):
37 |
38 | env.seed(seed)
39 | state = env.reset()
40 |
41 | # LAPADULA
42 | if mode == "rgb_array":
43 | from matplotlib import pyplot as plt
44 | from IPython.display import display, clear_output
45 | steps = 0
46 | fig, ax = plt.subplots(figsize=(8, 6))
47 |
48 | done = False
49 | while not done:
50 |
51 | import torch
52 | action = agent.act(torch.as_tensor(state, dtype=torch.float32))
53 |
54 | state, reward, done, info = env.step(action)
55 |
56 | # LAPADULA
57 | if mode == "rgb_array":
58 | steps += 1
59 | frame = env.render(mode=mode)
60 | ax.cla()
61 | ax.axes.yaxis.set_visible(False)
62 | ax.imshow(frame)
63 | ax.set_title(f'Steps: {steps}')
64 | display(fig)
65 | clear_output(wait=True)
66 | plt.pause(sleep_sec)
67 | else:
68 | env.render()
69 | sleep(sleep_sec)
70 |
71 |
72 | if __name__ == '__main__':
73 |
74 | parser = ArgumentParser()
75 | parser.add_argument('--agent_id', type=str, required=True)
76 | parser.add_argument('--sleep_sec', type=float, required=False, default=0.1)
77 | args = parser.parse_args()
78 |
79 | from src.base_agent import BaseAgent
80 | agent_path = SAVED_AGENTS_DIR / args.agent_file
81 | agent = BaseAgent.load_from_disk(agent_path)
82 |
83 | from src.q_agent import QAgent
84 |
85 |
86 | env = gym.make('CartPole-v1')
87 | # env._max_episode_steps = 1000
88 |
89 | show_video(agent, env, sleep_sec=args.sleep_sec)
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
--------------------------------------------------------------------------------
/04_lunar_lander/src/vpg_agent.py:
--------------------------------------------------------------------------------
1 | import os
2 | from typing import List, Optional, Tuple
3 | from pathlib import Path
4 | import json
5 | from pdb import set_trace as stop
6 |
7 | from tqdm import tqdm
8 | import numpy as np
9 | import torch
10 | from torch.optim import Adam
11 | from torch.utils.tensorboard import SummaryWriter
12 | from torch.distributions.categorical import Categorical
13 | import gym
14 |
15 | from src.model_factory import get_model
16 | from src.utils import (
17 | get_agent_id,
18 | set_seed,
19 | get_num_model_parameters,
20 | get_logger, get_model_path
21 | )
22 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR
23 |
24 | def reward_to_go(rews):
25 |
26 | n = len(rews)
27 | rtgs = np.zeros_like(rews)
28 | for i in reversed(range(n)):
29 | rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0)
30 | return rtgs
31 |
32 |
33 | class VPGAgent:
34 |
35 | def __init__(
36 | self,
37 | env_name: str = 'LunarLander-v2',
38 | learning_rate: float = 3e-4,
39 | hidden_layers: List[int] = [32],
40 | gradient_weights: str = 'rewards'
41 | ):
42 | assert gradient_weights in {'rewards', 'rewards-to-go'}
43 |
44 | self.env_name = env_name
45 | self.env = gym.make(env_name)
46 | self.obs_dim = self.env.observation_space.shape[0]
47 | self.act_dim = self.env.action_space.n
48 |
49 | # stochastic policy network
50 | # the outputs of this network are un-normalized probabilities for each
51 | # action (aka logits)
52 | self.policy_net = get_model(input_dim=self.obs_dim,
53 | output_dim=self.act_dim,
54 | hidden_layers=hidden_layers)
55 | print(f'Policy network with {get_num_model_parameters(self.policy_net):,} parameters')
56 | print(self.policy_net)
57 |
58 |
59 | self.optimizer = Adam(self.policy_net.parameters(), lr=learning_rate)
60 |
61 | self.gradient_weights = gradient_weights
62 |
63 | self.hparams = {
64 | 'learning_rate': learning_rate,
65 | 'hidden_layers': hidden_layers,
66 | 'gradient_weights': gradient_weights,
67 |
68 | }
69 |
70 | def act(self, obs: torch.Tensor):
71 | """
72 | Action selection function (outputs int actions, sampled from policy)
73 | """
74 | return self._get_policy(obs).sample().item()
75 |
76 | def train(
77 | self,
78 | n_policy_updates: int = 1000,
79 | batch_size: int = 4000,
80 | logger: Optional[SummaryWriter] = None,
81 | model_path: Optional[Path] = None,
82 | seed: Optional[int] = 0,
83 | freq_eval: Optional[int] = 10,
84 | ):
85 | """
86 | """
87 | total_steps = 0
88 | save_model = True if model_path is not None else False
89 |
90 | best_avg_reward = -np.inf
91 |
92 | # fix seeds to ensure reproducible training runs
93 | set_seed(self.env, seed)
94 |
95 | for i in range(n_policy_updates):
96 |
97 | # use current policy to collect trajectories
98 | states, actions, weights, rewards = self._collect_trajectories(n_samples=batch_size)
99 |
100 | # one step of gradient ascent to update policy parameters
101 | loss = self._update_parameters(states, actions, weights)
102 |
103 | # log epoch metrics
104 | print('epoch: %3d \t loss: %.3f \t reward: %.3f' % (i, loss, np.mean(rewards)))
105 | if logger is not None:
106 | # we use total_steps instead of epoch to render all plots in Tensorboard comparable
107 | # Agents wit different batch_size (aka steps_per_epoch) are fairly compared this way.
108 | total_steps += batch_size
109 | logger.add_scalar('train/loss', loss, total_steps)
110 | logger.add_scalar('train/episode_reward', np.mean(rewards), total_steps)
111 |
112 | # evaluate the agent on a fixed set of 100 episodes
113 | if (i + 1) % freq_eval == 0:
114 | rewards, success = self.evaluate(n_episodes=100)
115 |
116 | avg_reward = np.mean(rewards)
117 | avg_success_rate = np.mean(success)
118 | if save_model and (avg_reward > best_avg_reward):
119 | self.save_to_disk(model_path)
120 | print(f'Best model! Average reward = {avg_reward:.2f}, Success rate = {avg_success_rate:.2%}')
121 |
122 | best_avg_reward = avg_reward
123 |
124 | def evaluate(self, n_episodes: Optional[int] = 100, seed: Optional[int] = 1234) -> Tuple[List[float], List[float]]:
125 | """
126 | """
127 | # output metrics
128 | reward_per_episode = []
129 | success_per_episode = []
130 |
131 | # fix seed
132 | self.env.seed(seed)
133 | self.env.action_space.seed(seed)
134 |
135 | for i in tqdm(range(0, n_episodes)):
136 |
137 | state = self.env.reset()
138 | rewards = 0
139 | done = False
140 | reward = None
141 | while not done:
142 |
143 | action = self.act(torch.as_tensor(state, dtype=torch.float32))
144 |
145 | next_state, reward, done, info = self.env.step(action)
146 | rewards += reward
147 |
148 | state = next_state
149 |
150 | reward_per_episode.append(rewards)
151 | success_per_episode.append(1 if reward > 0 else 0)
152 |
153 | return reward_per_episode, success_per_episode
154 |
155 | def _collect_trajectories(self, n_samples: int):
156 |
157 | # make some empty lists for logging.
158 | batch_obs = [] # for observations
159 | batch_acts = [] # for actions
160 | batch_weights = [] # for reward-to-go weighting in policy gradient
161 | batch_rets = [] # for measuring episode returns
162 | batch_lens = [] # for measuring episode lengths
163 |
164 | # reset episode-specific variables
165 | obs = self.env.reset() # first obs comes from starting distribution
166 | done = False # signal from environment that episode is over
167 | ep_rews = [] # list for rewards accrued throughout ep
168 |
169 | # collect experience by acting in the environment with current policy
170 | while True:
171 |
172 | # save obs
173 | batch_obs.append(obs.copy())
174 |
175 | # act in the environment
176 | # act = get_action(torch.as_tensor(obs, dtype=torch.float32))
177 | action = self.act(torch.as_tensor(obs, dtype=torch.float32))
178 | obs, rew, done, _ = self.env.step(action)
179 |
180 | # save action, reward
181 | batch_acts.append(action)
182 | ep_rews.append(rew)
183 |
184 | if done:
185 | # if episode is over, record info about episode
186 | ep_ret, ep_len = sum(ep_rews), len(ep_rews)
187 | batch_rets.append(ep_ret)
188 | batch_lens.append(ep_len)
189 |
190 | # the weight for each logprob(a_t|s_t) is reward-to-go from t
191 | if self.gradient_weights == 'rewards':
192 | # the weight for each logprob(a|s) is the total reward for the episode
193 | batch_weights += [ep_ret] * ep_len
194 | elif self.gradient_weights == 'rewards-to-go':
195 | # the weight for each logprob(a|s) is the total reward AFTER the action is taken
196 | batch_weights += list(reward_to_go(ep_rews))
197 | else:
198 | raise NotImplemented
199 |
200 | # reset episode-specific variables
201 | obs, done, ep_rews = self.env.reset(), False, []
202 |
203 | # end experience loop if we have enough of it
204 | if len(batch_obs) > n_samples:
205 | break
206 |
207 | return batch_obs, batch_acts, batch_weights, batch_rets
208 |
209 | def _update_parameters(self, states, actions, weights) -> float:
210 | """
211 | One step of policy gradient update
212 | """
213 | self.optimizer.zero_grad()
214 |
215 | loss = self._compute_loss(
216 | obs=torch.as_tensor(states, dtype=torch.float32),
217 | act=torch.as_tensor(actions, dtype=torch.int32),
218 | weights=torch.as_tensor(weights, dtype=torch.float32)
219 | )
220 |
221 | # compute gradients
222 | loss.backward()
223 |
224 | # update parameters with Adam
225 | self.optimizer.step()
226 |
227 | return loss.item()
228 |
229 | def _compute_loss(self, obs, act, weights):
230 | logp = self._get_policy(obs).log_prob(act)
231 | return -(logp * weights).mean()
232 |
233 | def _get_policy(self, obs):
234 | """
235 | Get action distribution given the current policy
236 | """
237 | logits = self.policy_net(obs)
238 | return Categorical(logits=logits)
239 |
240 | @classmethod
241 | def load_from_disk(cls, env_name: str, path: Path):
242 | """
243 | We recover all necessary variables to be able to evaluate the agent.
244 |
245 | NOTE: training state is not stored, so it is not possible to resume
246 | an interrupted training run as it was.
247 | """
248 | # load hyper-params
249 | with open(path / 'hparams.json', 'r') as f:
250 | hparams = json.load(f)
251 |
252 | # generate Python object
253 | agent = cls(env_name, **hparams)
254 |
255 | agent.policy_net = torch.load(path / 'model')
256 | agent.policy_net.eval()
257 |
258 | return agent
259 |
260 | def save_to_disk(self, path: Path) -> None:
261 | """"""
262 | if not path.exists():
263 | os.makedirs(path)
264 |
265 | # save hyper-parameters in a json file
266 | with open(path / 'hparams.json', 'w') as f:
267 | json.dump(self.hparams, f)
268 |
269 | # save main model
270 | torch.save(self.policy_net, path / 'model')
271 |
272 |
273 | if __name__ == '__main__':
274 |
275 | import argparse
276 | parser = argparse.ArgumentParser()
277 | parser.add_argument('--env', type=str, default='CartPole-v0')
278 | parser.add_argument('--n_policy_updates', type=int, default=1000)
279 | parser.add_argument('--batch_size', type=int, default=128)
280 | parser.add_argument('--gradient_weights', type=str, default='rewards')
281 | parser.add_argument('--lr', type=float, default=3e-4)
282 | parser.add_argument("--hidden_layers", type=int, nargs="+",)
283 | parser.add_argument("--freq_eval", type=int)
284 | args = parser.parse_args()
285 |
286 | vpg_agent = VPGAgent(
287 | env_name=args.env,
288 | gradient_weights=args.gradient_weights,
289 | learning_rate=args.lr,
290 | hidden_layers=args.hidden_layers,
291 | )
292 |
293 | # generate a unique agent_id, that we later use to save results to disk, as
294 | # well as TensorBoard logs.
295 | agent_id = get_agent_id(args.env)
296 | print(f'agent_id = {agent_id}')
297 |
298 | # tensorboard logger to see training curves
299 | logger = get_logger(env_name=args.env, agent_id=agent_id)
300 |
301 | # path to save policy network weights
302 | model_path = get_model_path(env_name=args.env, agent_id=agent_id)
303 |
304 | # start training
305 | vpg_agent.train(
306 | n_policy_updates=args.n_policy_updates,
307 | batch_size=args.batch_size,
308 | logger=logger,
309 | model_path=model_path,
310 | freq_eval=args.freq_eval,
311 | )
--------------------------------------------------------------------------------
/04_lunar_lander/tensorboard_logs/readme.md:
--------------------------------------------------------------------------------
1 | ### Tensorboard logs for each train run are stored in this folder
2 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Pau Labarta Bajo
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
The Hands-on Reinforcement Learning course 🚀
3 |
From zero to HERO 🦸🏻🦸🏽
4 |
Out of intense complexities, intense simplicities emerge.
5 |
-- Winston Churchill
6 |
7 |
8 | 
9 |
10 | [](https://twitter.com/paulabartabajo_)
11 |
12 | ## Contents
13 |
14 | * [Welcome to the course](#welcome-to-the-course-)
15 | * [Lectures](#lectures)
16 | * [Wanna contribute?](#wanna-contribute)
17 | * [Let's connect!](#lets-connect)
18 |
19 | ## Welcome to the course 🤗❤️
20 |
21 | Welcome to my step by step hands-on-course that will take you from basic reinforcement learning to cutting-edge deep RL.
22 |
23 | We will start with a short intro of what RL is, what is it used for, and how does the landscape of current
24 | RL algorithms look like.
25 |
26 | Then, in each following chapter we will solve a different problem, with increasing difficulty:
27 | - 🏆 easy
28 | - 🏆🏆 medium
29 | - 🏆🏆🏆 hard
30 |
31 | Ultimately, the most complex RL problems involve a mixture of reinforcement learning algorithms, optimizations and Deep Learning techniques.
32 |
33 | You do not need to know deep learning (DL) to follow along this course.
34 |
35 | I will give you enough context to get you familiar with DL philosophy and understand
36 | how it becomes a crucial ingredient in modern reinforcement learning.
37 |
38 | ## Lectures
39 |
40 | 0. [Introduction to Reinforcement Learning](https://datamachines.xyz/2021/11/17/hands-on-reinforcement-learning-course-part-1/)
41 | 1. [Q-learning to drive a taxi 🏆](01_taxi/README.md)
42 | 2. [SARSA to beat gravity 🏆](02_mountain_car/README.md)
43 | 3. [Parametric Q learning to keep the balance 💃 🏆](03_cart_pole/README.md)
44 | 4. [Policy gradients to land on the Moon 🏆](04_lunar_lander/README.md)
45 |
46 | ## Wanna contribute?
47 |
48 | There are 2 things you can do to contribute to this course:
49 |
50 | 1. Spread the word and share it on [Twitter](https://ctt.ac/Aa7dt), [LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=http%3A//datamachines.xyz/the-hands-on-reinforcement-learning-course-page/&title=The%20hands-on%20Reinforcement%20Learning%20course&summary=Wanna%20learn%20Reinforcement%20Learning?%20%F0%9F%A4%94%0A%40paulabartabajo%20has%20a%20course%20on%20%23reinforcementlearning,%20that%20takes%20you%20from%20zero%20to%20PRO%20%F0%9F%A6%B8%F0%9F%8F%BB%E2%80%8D%F0%9F%A6%B8%F0%9F%8F%BD.%0A%0A%F0%9F%91%89%F0%9F%8F%BD%20With%20lots%20of%20Python%0A%F0%9F%91%89%F0%9F%8F%BD%20Intuitions,%20tips%20%26%20tricks%20explained.%0A%F0%9F%91%89%F0%9F%8F%BD%20And%20free,%20by%20the%20way.%0A%0AReady%20to%20start?%20Click%20%F0%9F%91%87%F0%9F%8F%BD%F0%9F%91%87%F0%9F%8F%BE%F0%9F%91%87%F0%9F%8F%BF%0A%0A%23MachineLearning&source=)
51 |
52 | 2. Open a [pull request](https://github.com/Paulescu/hands-on-rl/pulls) to fix a bug or improve the code readability.
53 |
54 | ### Thanks ❤️
55 | Special thanks to all the students who contributed with valuable feedback
56 | and pull requests ❤
57 |
58 | - [Neria Uzan](https://www.linkedin.com/in/neria-uzan-369803107/)
59 | - [Anthony Lapadula](https://www.linkedin.com/in/anthony-lapadula-9343a5b/)
60 | - [Petar Sekulić](https://www.linkedin.com/in/petar-sekulic-ml/)
61 |
62 | ## Let's connect!
63 |
64 | 👉🏽 Subscribe for **FREE** to the [Real-World ML newsletter](https://realworldml.net/subscribe/) 🧠
65 |
66 | 👉🏽 Follow me on [Twitter](https://twitter.com/paulabartabajo_) and [LinkedIn](https://www.linkedin.com/in/pau-labarta-bajo-4432074b/) 💡
67 |
--------------------------------------------------------------------------------