├── .gitignore ├── 01_taxi ├── README.md ├── notebooks │ ├── 00_environment.ipynb │ ├── 01_random_agent_baseline.ipynb │ ├── 02_q_agent.ipynb │ ├── 03_q_agent_hyperparameters_analysis.ipynb │ └── 04_homework.ipynb ├── pyproject.toml ├── requirements.txt └── src │ ├── __init__.py │ ├── loops.py │ ├── q_agent.py │ └── random_agent.py ├── 02_mountain_car ├── README.md ├── notebooks │ ├── 00_environment.ipynb │ ├── 01_random_agent_baseline.ipynb │ ├── 02_sarsa_agent.ipynb │ ├── 03_momentum_agent_baseline.ipynb │ └── 04_homework.ipynb ├── poetry.lock ├── pyproject.toml └── src │ ├── base_agent.py │ ├── config.py │ ├── loops.py │ ├── momentum_agent.py │ ├── random_agent.py │ ├── sarsa_agent.py │ └── viz.py ├── 03_cart_pole ├── .gitignore ├── README.md ├── images │ ├── deep_q_net.svg │ ├── hparams_search_diagram.svg │ ├── linear_model.jpg │ ├── linear_model_sml.jpg │ ├── neural_net.jpg │ ├── neural_net_homework.jpg │ ├── ngrok_example.png │ ├── nn_1_hidden_layer_sml.jpg │ ├── nn_2_hidden_layers_sml.jpg │ ├── nn_3_hidden_layers_sml.jpg │ └── optuna.png ├── mlflow_runs │ └── readme.md ├── notebooks │ ├── 00_environment.ipynb │ ├── 01_random_agent_baseline.ipynb │ ├── 02_linear_q_agent_bad_hyperparameters.ipynb │ ├── 03_linear_q_agent_good_hyperparameters.ipynb │ ├── 04_homework.ipynb │ ├── 05_crash_course_on_neural_nets.ipynb │ ├── 06_deep_q_agent_bad_hyperparameters.ipynb │ ├── 07_deep_q_agent_good_hyperparameters.ipynb │ ├── 08_homework.ipynb │ ├── 09_hyperparameter_search.ipynb │ ├── 10_homework.ipynb │ └── 11_hyperparameter_search_in_google_colab.ipynb ├── poetry.lock ├── pyproject.toml ├── requirements.txt ├── saved_agents │ ├── CartPole-v1 │ │ └── 0 │ │ │ ├── hparams.json │ │ │ └── model │ └── readme.md ├── src │ ├── __init__.py │ ├── agent_memory.py │ ├── config.py │ ├── loops.py │ ├── model_factory.py │ ├── optimize_hyperparameters.py │ ├── q_agent.py │ ├── random_agent.py │ ├── supervised_ml.py │ ├── utils.py │ └── viz.py └── tensorboard_logs │ ├── .gitignore │ └── readme.md ├── 04_lunar_lander ├── README.md ├── images │ └── policy_network.svg ├── notebooks │ ├── 01_random_agent_baseline.ipynb │ ├── 02_vanilla_policy_gradient_with_rewards_as_weights.ipynb │ ├── 03_vanilla_policy_gradient_with_rewards_to_go_as_weights.ipynb │ └── 04_homework.ipynb ├── pyproject.toml ├── requirements.txt ├── saved_agents │ └── readme.md ├── src │ ├── config.py │ ├── evaluation.py │ ├── model_factory.py │ ├── utils.py │ ├── viz.py │ └── vpg_agent.py └── tensorboard_logs │ └── readme.md ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | help/ 2 | logs/ 3 | snake/ 4 | **/.ipynb_checkpoints/ 5 | **/__pycache__/ 6 | **/tensorboard_logs/sml/ 7 | 02_mountain_car/saved_agents 8 | 03_cart_pole/saved_agents 9 | 04_cart_pole_tune_hparams_like_a_pro/ 10 | -------------------------------------------------------------------------------- /01_taxi/README.md: -------------------------------------------------------------------------------- 1 |
2 |

Q-learning to drive a taxi 🚕

3 |

You talkin' to me?

4 |

-- Robert de Niro (Taxi driver)

5 |
6 | 7 |
8 | 9 |
Venice’s taxis 👆 by Helena Jankovičová Kováčová from Pexels 🙏
10 |
11 | 12 | ## Table of Contents 13 | * [Welcome 🤗](#welcome-) 14 | * [Quick setup](#quick-setup) 15 | * [Lecture transcripts](#lecture-transcripts) 16 | * [Notebooks](#notebooks) 17 | * [Let's connect](#lets-connect) 18 | 19 | ## Welcome 🤗 20 | This is part 1 of the Hands-on RL course. 21 | 22 | Let's use (tabular) Q-learning to teach an agent to solve the [Taxi-v3](https://gym.openai.com/envs/Taxi-v3/) environment 23 | from OpenAI gym. 24 | 25 | Fasten your seat belt and get ready. We are ready to depart! 26 | 27 | 28 | ## Quick setup 29 | 30 | Make sure you have Python >= 3.7. Otherwise, update it. 31 | 32 | 1. Pull the code from GitHub and cd into the `01_taxi` folder: 33 | ``` 34 | $ git clone https://github.com/Paulescu/hands-on-rl.git 35 | $ cd hands-on-rl/01_taxi 36 | ``` 37 | 38 | 2. Make sure you have the `virtualenv` tool in your Python installation 39 | ``` 40 | $ pip3 install virtualenv 41 | ``` 42 | 43 | 3. Create a virtual environment and activate it. 44 | ``` 45 | $ virtualenv -p python3 venv 46 | $ source venv/bin/activate 47 | ``` 48 | 49 | From this point onwards commands run inside the virtual environment. 50 | 51 | 52 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code. 53 | ``` 54 | $ (venv) pip install -r requirements.txt 55 | $ (venv) export PYTHONPATH="." 56 | ``` 57 | 58 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab 59 | ``` 60 | $ (venv) jupyter notebook 61 | ``` 62 | ``` 63 | $ (venv) jupyter lab 64 | ``` 65 | If both launch commands fail, try these: 66 | ``` 67 | $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False 68 | ``` 69 | ``` 70 | $ (venv) jupyter lab --NotebookApp.use_redirect_file=False 71 | ``` 72 | 73 | 5. Play and learn. And do the homework 😉. 74 | 75 | 76 | ## Lecture transcripts 77 | 78 | [📝 Q learning](http://datamachines.xyz/2021/12/06/hands-on-reinforcement-learning-course-part-2-q-learning/) 79 | 80 | 81 | ## Notebooks 82 | 83 | - [Explore the environment](notebooks/00_environment.ipynb) 84 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb) 85 | - [Q-agent](notebooks/02_q_agent.ipynb) 86 | - [Hyperparameter tuning](notebooks/03_q_agent_hyperparameters_analysis.ipynb) 87 | - [Homework](notebooks/04_homework.ipynb) 88 | 89 | ## Let's connect! 90 | 91 | Do you wanna become a PRO in Machine Learning? 92 | 93 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/). 94 | 95 | 👉🏽 Follow me on [Medium](https://pau-labarta-bajo.medium.com/). 96 | 97 | 98 | 99 | 100 | 101 | -------------------------------------------------------------------------------- /01_taxi/notebooks/00_environment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4c2ff31f", 6 | "metadata": {}, 7 | "source": [ 8 | "# 00 Environment" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "04a5c882", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉Before you solve a Reinforcement Learning problem you need to define what are\n", 17 | "- the actions\n", 18 | "- the states of the world\n", 19 | "- the rewards\n", 20 | "\n", 21 | "#### 👉We are using the `Taxi-v3` environment from OpenAI's gym: https://gym.openai.com/envs/Taxi-v3/\n", 22 | "\n", 23 | "#### 👉`Taxi-v3` is an easy environment because the action space is small, and the state space is large but finite.\n", 24 | "\n", 25 | "#### 👉Environments with a finite number of actions and states are called tabular" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 3, 31 | "id": "e3629346", 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "name": "stdout", 36 | "output_type": "stream", 37 | "text": [ 38 | "The autoreload extension is already loaded. To reload it, use:\n", 39 | " %reload_ext autoreload\n", 40 | "Populating the interactive namespace from numpy and matplotlib\n" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "%load_ext autoreload\n", 46 | "%autoreload 2\n", 47 | "%pylab inline\n", 48 | "%config InlineBackend.figure_format = 'svg'" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "76e9a06d", 54 | "metadata": {}, 55 | "source": [ 56 | "## Load the environment 🌎" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 4, 62 | "id": "ebfba291", 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "import gym\n", 67 | "env = gym.make(\"Taxi-v3\").env" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "id": "1fcfc13a", 73 | "metadata": {}, 74 | "source": [ 75 | "## Action space" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "id": "98cfdb84", 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "name": "stdout", 86 | "output_type": "stream", 87 | "text": [ 88 | "Action Space Discrete(6)\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "print(\"Action Space {}\".format(env.action_space))" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "id": "4f53a38e", 99 | "metadata": {}, 100 | "source": [ 101 | "## State space" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 6, 107 | "id": "e809514b", 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "name": "stdout", 112 | "output_type": "stream", 113 | "text": [ 114 | "State Space Discrete(500)\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "print(\"State Space {}\".format(env.observation_space))" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "id": "c8f6a690", 125 | "metadata": {}, 126 | "source": [ 127 | "## Rewards" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 7, 133 | "id": "0faad2a7", 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "env.P[state][action][0]: (1.0, 223, -1, False)\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "# env.P is double dictionary.\n", 146 | "# - The 1st key represents the state, from 0 to 499\n", 147 | "# - The 2nd key represens the action taken by the agent,\n", 148 | "# from 0 to 5\n", 149 | "\n", 150 | "# example\n", 151 | "state = 123\n", 152 | "action = 0 # move south\n", 153 | "\n", 154 | "# env.P[state][action][0] is a list with 4 elements\n", 155 | "# (probability, next_state, reward, done)\n", 156 | "# \n", 157 | "# - probability\n", 158 | "# It is always 1 in this environment, which means\n", 159 | "# there are no external/random factors that determine the\n", 160 | "# next_state\n", 161 | "# apart from the agent's action a.\n", 162 | "#\n", 163 | "# - next_state: 223 in this case\n", 164 | "# \n", 165 | "# - reward: -1 in this case\n", 166 | "#\n", 167 | "# - done: boolean (True/False) indicates whether the\n", 168 | "# episode has ended (i.e. the driver has dropped the\n", 169 | "# passenger at the correct destination)\n", 170 | "print('env.P[state][action][0]: ', env.P[state][action][0])" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 8, 176 | "id": "552caf92", 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "+---------+\n", 184 | "|\u001b[34;1mR\u001b[0m: | : :G|\n", 185 | "| :\u001b[43m \u001b[0m| : : |\n", 186 | "| : : : : |\n", 187 | "| | : | : |\n", 188 | "|Y| : |\u001b[35mB\u001b[0m: |\n", 189 | "+---------+\n", 190 | "\n" 191 | ] 192 | } 193 | ], 194 | "source": [ 195 | "# Need to call reset() at least once before render() will work\n", 196 | "env.reset()\n", 197 | "\n", 198 | "env.s = 123\n", 199 | "env.render(mode='human')" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 9, 205 | "id": "2ded2ba5", 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "+---------+\n", 213 | "|\u001b[34;1mR\u001b[0m: | : :G|\n", 214 | "| : | : : |\n", 215 | "| :\u001b[43m \u001b[0m: : : |\n", 216 | "| | : | : |\n", 217 | "|Y| : |\u001b[35mB\u001b[0m: |\n", 218 | "+---------+\n", 219 | "\n" 220 | ] 221 | } 222 | ], 223 | "source": [ 224 | "env.s = 223\n", 225 | "env.render(mode='human')" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "2aacea45", 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [] 235 | } 236 | ], 237 | "metadata": { 238 | "kernelspec": { 239 | "display_name": "Python 3 (ipykernel)", 240 | "language": "python", 241 | "name": "python3" 242 | }, 243 | "language_info": { 244 | "codemirror_mode": { 245 | "name": "ipython", 246 | "version": 3 247 | }, 248 | "file_extension": ".py", 249 | "mimetype": "text/x-python", 250 | "name": "python", 251 | "nbconvert_exporter": "python", 252 | "pygments_lexer": "ipython3", 253 | "version": "3.7.5" 254 | } 255 | }, 256 | "nbformat": 4, 257 | "nbformat_minor": 5 258 | } 259 | -------------------------------------------------------------------------------- /01_taxi/notebooks/04_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f0fd6807", 6 | "metadata": {}, 7 | "source": [ 8 | "# 04 Homework 🏋️🏋️🏋️" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "abcf6613", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉A course without homework is not a course!\n", 17 | "\n", 18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n", 19 | "\n", 20 | "#### 👉They are not so easy, so if you get stuck drop me an email at `plabartabajo@gmail.com`" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "86f82e45", 26 | "metadata": {}, 27 | "source": [ 28 | "-----" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "67656662", 34 | "metadata": {}, 35 | "source": [ 36 | "## 1. Can you update the function `train` in a way that the input `epsilon` can also be a callable function?\n", 37 | "\n", 38 | "An `epsilon` value that decays after each episode works better than a fixed `epsilon` for most RL problems.\n", 39 | "\n", 40 | "This is hard exercise, but I want you to give it a try.\n", 41 | "\n", 42 | "If you do not manage it, do not worry. We are going to implement this in an upcoming lesson." 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "7d1e016e", 48 | "metadata": {}, 49 | "source": [ 50 | "-----" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "c0a46bf7", 56 | "metadata": {}, 57 | "source": [ 58 | "## 2. Can you parallelize the function `train_many_runs` using Python's `multiprocessing` module?\n", 59 | "\n", 60 | "I do not like to wait and stare at each progress bar, while I think that each run in `train_many_runs` could execute\n", 61 | "in parallel.\n", 62 | "\n", 63 | "Create a new function called `train_many_runs_in_parallel` that outputs the same results as `train_many_runs` but that executes in a fraction of time." 64 | ] 65 | } 66 | ], 67 | "metadata": { 68 | "kernelspec": { 69 | "display_name": "Python 3 (ipykernel)", 70 | "language": "python", 71 | "name": "python3" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 3 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython3", 83 | "version": "3.7.5" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 5 88 | } 89 | -------------------------------------------------------------------------------- /01_taxi/pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "src" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Pau "] 6 | 7 | [tool.poetry.dependencies] 8 | python = ">=3.7.1,<4.0" 9 | gym = "^0.21.0" 10 | tqdm = "^4.62.3" 11 | matplotlib = "^3.5.0" 12 | pandas = "^1.3.4" 13 | seaborn = "^0.11.2" 14 | jupyter = "^1.0.0" 15 | jupyterlab = "^3.3.0" 16 | 17 | [tool.poetry.dev-dependencies] 18 | pytest = "^5.2" 19 | 20 | [build-system] 21 | requires = ["poetry-core>=1.0.0"] 22 | build-backend = "poetry.core.masonry.api" 23 | -------------------------------------------------------------------------------- /01_taxi/requirements.txt: -------------------------------------------------------------------------------- 1 | anyio==3.5.0; python_full_version >= "3.6.2" and python_version >= "3.7" 2 | appnope==0.1.2; platform_system == "Darwin" and python_version >= "3.7" and sys_platform == "darwin" 3 | argon2-cffi-bindings==21.2.0; python_version >= "3.6" 4 | argon2-cffi==21.3.0; python_version >= "3.7" 5 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0" 6 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 7 | babel==2.9.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7" 8 | backcall==0.2.0; python_version >= "3.7" 9 | bleach==4.1.0; python_version >= "3.7" 10 | certifi==2021.10.8; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7" 11 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6" 12 | charset-normalizer==2.0.12; python_full_version >= "3.6.0" and python_version >= "3.7" 13 | cloudpickle==2.0.0; python_version >= "3.6" 14 | colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.7" 15 | cycler==0.11.0; python_version >= "3.7" 16 | debugpy==1.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 17 | decorator==5.1.1; python_version >= "3.7" 18 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 19 | entrypoints==0.4; python_full_version >= "3.6.1" and python_version >= "3.7" 20 | fonttools==4.29.1; python_version >= "3.7" 21 | gym==0.21.0; python_version >= "3.6" 22 | idna==3.3; python_full_version >= "3.6.2" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7") 23 | importlib-metadata==4.11.2; python_version < "3.8" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.7" and python_version < "3.8") 24 | importlib-resources==5.4.0; python_version < "3.9" and python_version >= "3.7" 25 | ipykernel==6.9.1; python_version >= "3.7" 26 | ipython-genutils==0.2.0; python_version >= "3.7" 27 | ipython==7.32.0; python_version >= "3.7" 28 | ipywidgets==7.6.5 29 | jedi==0.18.1; python_version >= "3.7" 30 | jinja2==3.0.3; python_version >= "3.7" 31 | json5==0.9.6; python_version >= "3.7" 32 | jsonschema==4.4.0; python_version >= "3.7" 33 | jupyter-client==7.1.2; python_full_version >= "3.7.0" and python_version >= "3.7" 34 | jupyter-console==6.4.3; python_version >= "3.6" 35 | jupyter-core==4.9.2; python_full_version >= "3.6.1" and python_version >= "3.7" 36 | jupyter-server==1.13.5; python_version >= "3.7" 37 | jupyter==1.0.0 38 | jupyterlab-pygments==0.1.2; python_version >= "3.7" 39 | jupyterlab-server==2.10.3; python_version >= "3.7" 40 | jupyterlab-widgets==1.0.2; python_version >= "3.6" 41 | jupyterlab==3.3.0; python_version >= "3.7" 42 | kiwisolver==1.3.2; python_version >= "3.7" 43 | markupsafe==2.1.0; python_version >= "3.7" 44 | matplotlib-inline==0.1.3; python_version >= "3.7" 45 | matplotlib==3.5.1; python_version >= "3.7" 46 | mistune==0.8.4; python_version >= "3.7" 47 | more-itertools==8.12.0; python_version >= "3.5" 48 | nbclassic==0.3.6; python_version >= "3.7" 49 | nbclient==0.5.12; python_full_version >= "3.7.0" and python_version >= "3.7" 50 | nbconvert==6.4.2; python_version >= "3.7" 51 | nbformat==5.1.3; python_full_version >= "3.7.0" and python_version >= "3.7" 52 | nest-asyncio==1.5.4; python_full_version >= "3.7.0" and python_version >= "3.7" 53 | notebook-shim==0.1.0; python_version >= "3.7" 54 | notebook==6.4.8; python_version >= "3.7" 55 | numpy==1.21.1 56 | packaging==21.3; python_version >= "3.7" 57 | pandas==1.3.5; python_full_version >= "3.7.1" 58 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7" 59 | parso==0.8.3; python_version >= "3.7" 60 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7" 61 | pickleshare==0.7.5; python_version >= "3.7" 62 | pillow==9.0.1; python_version >= "3.7" 63 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5" 64 | prometheus-client==0.13.1; python_version >= "3.7" 65 | prompt-toolkit==3.0.28; python_full_version >= "3.6.2" and python_version >= "3.7" 66 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.7" and sys_platform != "win32" 67 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy" 68 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0" 69 | pygments==2.11.2; python_version >= "3.7" 70 | pyparsing==3.0.7; python_version >= "3.7" 71 | pyrsistent==0.18.1; python_version >= "3.7" 72 | pytest==5.4.3; python_version >= "3.5" 73 | python-dateutil==2.8.2; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.7") 74 | pytz==2021.3; python_full_version >= "3.7.1" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7") 75 | pywin32==303; sys_platform == "win32" and platform_python_implementation != "PyPy" and python_version >= "3.7" 76 | pywinpty==1.1.6; os_name == "nt" and python_version >= "3.7" 77 | pyzmq==22.3.0; python_full_version >= "3.6.1" and python_version >= "3.7" 78 | qtconsole==5.2.2; python_version >= "3.6" 79 | qtpy==2.0.1; python_version >= "3.6" 80 | requests==2.27.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7" 81 | scipy==1.6.1; python_version >= "3.7" 82 | seaborn==0.11.2; python_version >= "3.6" 83 | send2trash==1.8.0; python_version >= "3.7" 84 | setuptools-scm==6.4.2; python_version >= "3.7" 85 | six==1.16.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.7" 86 | sniffio==1.2.0; python_full_version >= "3.6.2" and python_version >= "3.7" 87 | terminado==0.13.3; python_version >= "3.7" 88 | testpath==0.6.0; python_version >= "3.7" 89 | tomli==2.0.1; python_version >= "3.7" 90 | tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7" 91 | tqdm==4.63.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0") 92 | traitlets==5.1.1; python_full_version >= "3.7.0" and python_version >= "3.7" 93 | typing-extensions==4.1.1; python_version < "3.8" and python_version >= "3.7" and python_full_version >= "3.6.2" 94 | urllib3==1.26.8; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7" 95 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.6" 96 | webencodings==0.5.1; python_version >= "3.7" 97 | websocket-client==1.3.1; python_version >= "3.7" 98 | widgetsnbextension==3.5.2 99 | zipp==3.7.0; python_version < "3.8" and python_version >= "3.7" 100 | -------------------------------------------------------------------------------- /01_taxi/src/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.1.0' 2 | -------------------------------------------------------------------------------- /01_taxi/src/loops.py: -------------------------------------------------------------------------------- 1 | from typing import Tuple, List, Any 2 | import random 3 | from pdb import set_trace as stop 4 | 5 | import numpy as np 6 | from tqdm import tqdm 7 | 8 | 9 | def train( 10 | agent, 11 | env, 12 | n_episodes: int, 13 | epsilon: float 14 | ) -> Tuple[Any, List, List]: 15 | """ 16 | Trains and agent and returns 3 things: 17 | - agent object 18 | - timesteps_per_episode 19 | - penalties_per_episode 20 | """ 21 | # For plotting metrics 22 | timesteps_per_episode = [] 23 | penalties_per_episode = [] 24 | 25 | for i in tqdm(range(0, n_episodes)): 26 | 27 | state = env.reset() 28 | 29 | epochs, penalties, reward, = 0, 0, 0 30 | done = False 31 | 32 | while not done: 33 | 34 | if random.uniform(0, 1) < epsilon: 35 | # Explore action space 36 | action = env.action_space.sample() 37 | else: 38 | # Exploit learned values 39 | action = agent.get_action(state) 40 | 41 | next_state, reward, done, info = env.step(action) 42 | 43 | agent.update_parameters(state, action, reward, next_state) 44 | 45 | if reward == -10: 46 | penalties += 1 47 | 48 | state = next_state 49 | epochs += 1 50 | 51 | timesteps_per_episode.append(epochs) 52 | penalties_per_episode.append(penalties) 53 | 54 | return agent, timesteps_per_episode, penalties_per_episode 55 | 56 | 57 | def evaluate( 58 | agent, 59 | env, 60 | n_episodes: int, 61 | epsilon: float, 62 | initial_state: int = None 63 | ) -> Tuple[List, List]: 64 | """ 65 | Tests agent performance in random `n_episodes`. 66 | 67 | It returns: 68 | - timesteps_per_episode 69 | - penalties_per_episode 70 | """ 71 | # For plotting metrics 72 | timesteps_per_episode = [] 73 | penalties_per_episode = [] 74 | frames_per_episode = [] 75 | 76 | for i in tqdm(range(0, n_episodes)): 77 | 78 | if initial_state: 79 | # init the environment at 'initial_state' 80 | state = initial_state 81 | env.s = initial_state 82 | else: 83 | # random starting state 84 | state = env.reset() 85 | 86 | epochs, penalties, reward, = 0, 0, 0 87 | frames = [] 88 | done = False 89 | 90 | while not done: 91 | 92 | if random.uniform(0, 1) < epsilon: 93 | # Explore action space 94 | action = env.action_space.sample() 95 | else: 96 | # Exploit learned values 97 | action = agent.get_action(state) 98 | 99 | next_state, reward, done, info = env.step(action) 100 | 101 | frames.append({ 102 | 'frame': env.render(mode='ansi'), 103 | 'state': state, 104 | 'action': action, 105 | 'reward': reward 106 | }) 107 | 108 | if reward == -10: 109 | penalties += 1 110 | 111 | state = next_state 112 | epochs += 1 113 | 114 | timesteps_per_episode.append(epochs) 115 | penalties_per_episode.append(penalties) 116 | frames_per_episode.append(frames) 117 | 118 | return timesteps_per_episode, penalties_per_episode, frames_per_episode 119 | 120 | 121 | def train_many_runs( 122 | agent, 123 | env, 124 | n_episodes: int, 125 | epsilon: float, 126 | n_runs: int, 127 | ) -> Tuple[List, List]: 128 | """ 129 | Calls 'train' many times, stores results and averages them out. 130 | """ 131 | timesteps = np.zeros(shape=(n_runs, n_episodes)) 132 | penalties = np.zeros(shape=(n_runs, n_episodes)) 133 | 134 | for i in range(0, n_runs): 135 | 136 | agent.reset() 137 | 138 | _, timesteps[i, :], penalties[i, :] = train( 139 | agent, env, n_episodes, epsilon 140 | ) 141 | timesteps = np.mean(timesteps, axis=0).tolist() 142 | penalties = np.mean(penalties, axis=0).tolist() 143 | 144 | return timesteps, penalties 145 | 146 | if __name__ == '__main__': 147 | 148 | import gym 149 | from src.q_agent import QAgent 150 | 151 | env = gym.make("Taxi-v3").env 152 | alpha = 0.1 153 | gamma = 0.6 154 | agent = QAgent(env, alpha, gamma) 155 | 156 | agent, _, _ = train( 157 | agent, env, n_episodes=10000, epsilon=0.10) 158 | 159 | timesteps_per_episode, penalties_per_episode, _ = evaluate( 160 | agent, env, n_episodes=100, epsilon=0.05 161 | ) 162 | 163 | print(f'Avg steps to complete ride: {np.array(timesteps_per_episode).mean()}') 164 | print(f'Avg penalties to complete ride: {np.array(penalties_per_episode).mean()}') -------------------------------------------------------------------------------- /01_taxi/src/q_agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from pdb import set_trace as stop 3 | 4 | class QAgent: 5 | 6 | def __init__(self, env, alpha, gamma): 7 | self.env = env 8 | 9 | # table with q-values: n_states * n_actions 10 | self.q_table = np.zeros([env.observation_space.n, env.action_space.n]) 11 | 12 | # hyper-parameters 13 | self.alpha = alpha 14 | self.gamma = gamma 15 | 16 | def get_action(self, state): 17 | """""" 18 | # stop() 19 | return np.argmax(self.q_table[state]) 20 | 21 | def update_parameters(self, state, action, reward, next_state): 22 | """""" 23 | old_value = self.q_table[state, action] 24 | next_max = np.max(self.q_table[next_state]) 25 | 26 | new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max) 27 | self.q_table[state, action] = new_value 28 | 29 | def reset(self): 30 | """ 31 | Sets q-values to zeros, which essentially means the agent does not know 32 | anything 33 | """ 34 | self.q_table = np.zeros([self.env.observation_space.n, self.env.action_space.n]) 35 | -------------------------------------------------------------------------------- /01_taxi/src/random_agent.py: -------------------------------------------------------------------------------- 1 | 2 | class RandomAgent: 3 | """ 4 | This taxi driver selects actions randomly. 5 | You better not get into this taxi! 6 | """ 7 | def __init__(self, env): 8 | self.env = env 9 | 10 | def get_action(self, state) -> int: 11 | """ 12 | No input arguments to this function. 13 | The agent does not consider the state of the environment when deciding 14 | what to do next. 15 | """ 16 | return self.env.action_space.sample() -------------------------------------------------------------------------------- /02_mountain_car/README.md: -------------------------------------------------------------------------------- 1 | # SARSA to beat gravity 🚃 2 | 👉 [Read in datamachines](http://datamachines.xyz/2021/12/17/hands-on-reinforcement-learning-course-part-3-sarsa/) 3 | 👉 [Read in Towards Data Science](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-3-5db40e7938d4) 4 | 5 | 6 | This is part 2 of my course Hands-on reinforcement learning. 7 | 8 | In this part we use SARSA to help a poor car win the battle against gravity! 9 | 10 | > *Be like a train; go in the rain, go in the sun, go in the storm, go in the dark tunnels! Be like a train; concentrate on your road and go with no hesitation!* 11 | > 12 | > --_Mehmet Murat Ildan_ 13 | 14 | ### Quick setup 15 | 16 | The easiest way to get the code working in your machine is by using [Poetry](https://python-poetry.org/docs/#installation). 17 | 18 | 19 | 1. You can install Poetry with this one-liner: 20 | ```bash 21 | $ curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python - 22 | ``` 23 | 24 | 2. Git clone the code 25 | ```bash 26 | $ git clone https://github.com/Paulescu/hands-on-rl.git 27 | ``` 28 | 29 | 3. Navigate to this lesson code `02_mountain_car` 30 | ```bash 31 | $ cd hands-on-rl/02_mountain_car 32 | ``` 33 | 34 | 4. Install all dependencies from `pyproject.toml: 35 | ```bash 36 | $ poetry install 37 | ``` 38 | 39 | 5. Activate the virtual environment 40 | ```bash 41 | $ poetry shell 42 | ``` 43 | 44 | 6. Set PYTHONPATH and launch jupyter (jupyter-lab param may fix launch problems on some systems) 45 | ```bash 46 | $ export PYTHONPATH=".." 47 | $ jupyter-lab --NotebookApp.use_redirect_file=False 48 | ``` 49 | 50 | ### Notebooks 51 | 52 | 1. [Explore the environment](notebooks/00_environment.ipynb) 53 | 2. [Random agent baseline](notebooks/01_random_agent_baseline.ipynb) 54 | 3. [SARSA agent](notebooks/02_sarsa_agent.ipynb) 55 | 4. [Momentum agent](notebooks/03_momentum_agent_baseline.ipynb) 56 | 5. [Homework](notebooks/04_homework.ipynb) 57 | -------------------------------------------------------------------------------- /02_mountain_car/notebooks/04_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f0fd6807", 6 | "metadata": {}, 7 | "source": [ 8 | "# 04 Homework 🏋️🏋️🏋️" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "abcf6613", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉A course without homework is not a course!\n", 17 | "\n", 18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n", 19 | "\n", 20 | "#### 👉Feel free to email me your solutions at:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "d1d983a3", 26 | "metadata": {}, 27 | "source": [ 28 | "# `plabartabajo@gmail.com`" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "86f82e45", 34 | "metadata": {}, 35 | "source": [ 36 | "-----" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "67656662", 42 | "metadata": {}, 43 | "source": [ 44 | "## 1. Can you adjust the hyper-parameters `alpha = 0.1` and `gamma = 0.9` to train a better SARSA agent than mine?\n", 45 | "\n", 46 | "Experiment with these 2 hyper-parameters to maximize the agent success rate." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "c0a46bf7", 52 | "metadata": {}, 53 | "source": [ 54 | "## 2. Can you increase the resolution of the discretization?\n", 55 | "\n", 56 | "Instead of using round marks of\n", 57 | "- `0.1` for position\n", 58 | "- `0.01` for velocity\n", 59 | "\n", 60 | "Use 10x:\n", 61 | "- `0.01` for position\n", 62 | "- `0.001` for velocity\n", 63 | "\n", 64 | "Let me know if you get a better agent than mine?" 65 | ] 66 | } 67 | ], 68 | "metadata": { 69 | "kernelspec": { 70 | "display_name": "Python 3 (ipykernel)", 71 | "language": "python", 72 | "name": "python3" 73 | }, 74 | "language_info": { 75 | "codemirror_mode": { 76 | "name": "ipython", 77 | "version": 3 78 | }, 79 | "file_extension": ".py", 80 | "mimetype": "text/x-python", 81 | "name": "python", 82 | "nbconvert_exporter": "python", 83 | "pygments_lexer": "ipython3", 84 | "version": "3.7.5" 85 | } 86 | }, 87 | "nbformat": 4, 88 | "nbformat_minor": 5 89 | } 90 | -------------------------------------------------------------------------------- /02_mountain_car/pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "src" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Pau "] 6 | 7 | [tool.poetry.dependencies] 8 | python = ">=3.7.1,<4.0" 9 | gym = "^0.21.0" 10 | pyglet = "^1.5.21" 11 | matplotlib = "^3.5.0" 12 | tqdm = "^4.62.3" 13 | pandas = "^1.3.4" 14 | jupyter = "^1.0.0" 15 | PyVirtualDisplay = "^2.2" 16 | imageio = "^2.13.3" 17 | seaborn = "^0.11.2" 18 | 19 | [tool.poetry.dev-dependencies] 20 | pytest = "^5.2" 21 | 22 | [build-system] 23 | requires = ["poetry-core>=1.0.0"] 24 | build-backend = "poetry.core.masonry.api" 25 | -------------------------------------------------------------------------------- /02_mountain_car/src/base_agent.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | from pathlib import Path 3 | from abc import ABC, abstractmethod 4 | 5 | 6 | class BaseAgent(ABC): 7 | 8 | @abstractmethod 9 | def get_action(self, state): 10 | pass 11 | 12 | @abstractmethod 13 | def update_parameters(self, state, action, reward, next_state): 14 | pass 15 | 16 | def save_to_disk(self, path: Path): 17 | """ 18 | Saves python object to disk using a binary format 19 | """ 20 | with open(path, "wb") as f: 21 | pickle.dump(self, f, pickle.HIGHEST_PROTOCOL) 22 | 23 | @classmethod 24 | def load_from_disk(cls, path: Path): 25 | """ 26 | Loads binary format into Python object. 27 | """ 28 | with open(path, "rb") as f: 29 | dump = pickle.load(f) 30 | 31 | return dump -------------------------------------------------------------------------------- /02_mountain_car/src/config.py: -------------------------------------------------------------------------------- 1 | # Define SAVED_AGENTS_DIR and create dir if missing 2 | import os 3 | import pathlib 4 | root_dir = pathlib.Path(__file__).parent.resolve().parent 5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents' 6 | os.makedirs(SAVED_AGENTS_DIR, exist_ok=True) 7 | -------------------------------------------------------------------------------- /02_mountain_car/src/loops.py: -------------------------------------------------------------------------------- 1 | from typing import Tuple, List, Callable, Union, Optional 2 | import random 3 | 4 | from tqdm import tqdm 5 | 6 | def train( 7 | agent, 8 | env, 9 | n_episodes: int, 10 | epsilon: Union[float, Callable] 11 | ) -> Tuple[List, List]: 12 | 13 | # For plotting metrics 14 | reward_per_episode = [] 15 | max_position_per_episode = [] 16 | 17 | pbar = tqdm(range(0, n_episodes)) 18 | for i in pbar: 19 | 20 | state = env.reset() 21 | 22 | rewards = 0 23 | max_position = -99 24 | 25 | # handle case when epsilon is either 26 | # - a float 27 | # - or a function that returns a float given the episode nubmer 28 | epsilon_ = epsilon if isinstance(epsilon, float) else epsilon(i) 29 | 30 | pbar.set_description(f'Epsilon: {epsilon_:.2f}') 31 | 32 | done = False 33 | while not done: 34 | 35 | action = agent.get_action(state, epsilon_) 36 | 37 | next_state, reward, done, info = env.step(action) 38 | 39 | agent.update_parameters(state, action, reward, next_state, epsilon_) 40 | 41 | rewards += reward 42 | if next_state[0] > max_position: 43 | max_position = next_state[0] 44 | 45 | state = next_state 46 | 47 | reward_per_episode.append(rewards) 48 | max_position_per_episode.append(max_position) 49 | 50 | return reward_per_episode, max_position_per_episode 51 | 52 | 53 | def evaluate( 54 | agent, 55 | env, 56 | n_episodes: int, 57 | epsilon: Optional[Union[float, Callable]] = None 58 | ) -> Tuple[List, List]: 59 | 60 | # For plotting metrics 61 | reward_per_episode = [] 62 | max_position_per_episode = [] 63 | 64 | for i in tqdm(range(0, n_episodes)): 65 | 66 | state = env.reset() 67 | 68 | rewards = 0 69 | max_position = -99 70 | 71 | done = False 72 | while not done: 73 | 74 | epsilon_ = None 75 | if epsilon is not None: 76 | epsilon_ = epsilon if isinstance(epsilon, float) else epsilon(i) 77 | action = agent.get_action(state, epsilon_) 78 | 79 | next_state, reward, done, info = env.step(action) 80 | 81 | agent.update_parameters(state, action, reward, next_state, epsilon_) 82 | 83 | rewards += reward 84 | if next_state[0] > max_position: 85 | max_position = next_state[0] 86 | 87 | state = next_state 88 | 89 | reward_per_episode.append(rewards) 90 | max_position_per_episode.append(max_position) 91 | 92 | return reward_per_episode, max_position_per_episode 93 | 94 | if __name__ == '__main__': 95 | 96 | # environment 97 | import gym 98 | env = gym.make('MountainCar-v0') 99 | env._max_episode_steps = 1000 100 | 101 | # agent 102 | from src.sarsa_agent import SarsaAgent 103 | alpha = 0.1 104 | gamma = 0.6 105 | agent = SarsaAgent(env, alpha, gamma) 106 | 107 | rewards, max_positions = train(agent, env, n_episodes=100, epsilon=0.1) -------------------------------------------------------------------------------- /02_mountain_car/src/momentum_agent.py: -------------------------------------------------------------------------------- 1 | from src.base_agent import BaseAgent 2 | 3 | class MomentumAgent(BaseAgent): 4 | 5 | def __init__(self, env): 6 | self.env = env 7 | 8 | self.valley_position = -0.5 9 | 10 | def get_action(self, state, epsilon=None) -> int: 11 | """ 12 | No input arguments to this function. 13 | The agent does not consider the state of the environment when deciding 14 | what to do next. 15 | """ 16 | velocity = state[1] 17 | 18 | if velocity > 0: 19 | # accelerate to the right 20 | action = 2 21 | else: 22 | # accelerate to the left 23 | action = 0 24 | 25 | return action 26 | 27 | def update_parameters(self, state, action, reward, next_state, epsilon): 28 | pass 29 | 30 | -------------------------------------------------------------------------------- /02_mountain_car/src/random_agent.py: -------------------------------------------------------------------------------- 1 | from src.base_agent import BaseAgent 2 | 3 | class RandomAgent(BaseAgent): 4 | """ 5 | This taxi driver selects actions randomly. 6 | You better not get into this taxi! 7 | """ 8 | def __init__(self, env): 9 | self.env = env 10 | 11 | def get_action(self, state, epsilon) -> int: 12 | """ 13 | No input arguments to this function. 14 | The agent does not consider the state of the environment when deciding 15 | what to do next. 16 | """ 17 | return self.env.action_space.sample() 18 | 19 | def update_parameters(self, state, action, reward, next_state, epsilon): 20 | pass 21 | 22 | -------------------------------------------------------------------------------- /02_mountain_car/src/sarsa_agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | 4 | from src.base_agent import BaseAgent 5 | 6 | class SarsaAgent(BaseAgent): 7 | 8 | def __init__(self, env, alpha, gamma): 9 | 10 | self.env = env 11 | self.q_table = self._init_q_table() 12 | 13 | # hyper-parameters 14 | self.alpha = alpha 15 | self.gamma = gamma 16 | 17 | def _init_q_table(self) -> np.array: 18 | """ 19 | Return numpy array with 3 dimensions. 20 | The first 2 dimensions are the state components, i.e. position, speed. 21 | The third dimension is the action. 22 | """ 23 | # discretize state space from a continuous to discrete 24 | high = self.env.observation_space.high 25 | low = self.env.observation_space.low 26 | n_states = (high - low) * np.array([10, 100]) 27 | n_states = np.round(n_states, 0).astype(int) + 1 28 | 29 | # table with q-values: n_states[0] * n_states[1] * n_actions 30 | return np.zeros([n_states[0], n_states[1], self.env.action_space.n]) 31 | 32 | def _discretize_state(self, state): 33 | min_states = self.env.observation_space.low 34 | state_discrete = (state - min_states) * np.array([10, 100]) 35 | return np.round(state_discrete, 0).astype(int) 36 | 37 | def get_action(self, state, epsilon=None): 38 | """""" 39 | if epsilon and random.uniform(0, 1) < epsilon: 40 | # Explore action space 41 | action = self.env.action_space.sample() 42 | else: 43 | # Exploit learned values 44 | state_discrete = self._discretize_state(state) 45 | action = np.argmax(self.q_table[state_discrete[0], state_discrete[1]]) 46 | 47 | return action 48 | 49 | def update_parameters(self, state, action, reward, next_state, epsilon): 50 | """""" 51 | s = self._discretize_state(state) 52 | ns = self._discretize_state(next_state) 53 | na = self.get_action(next_state, epsilon) 54 | 55 | delta = self.alpha * ( 56 | reward 57 | + self.gamma * self.q_table[ns[0], ns[1], na] 58 | - self.q_table[s[0], s[1], action] 59 | ) 60 | self.q_table[s[0], s[1], action] += delta -------------------------------------------------------------------------------- /02_mountain_car/src/viz.py: -------------------------------------------------------------------------------- 1 | from time import sleep 2 | from argparse import ArgumentParser 3 | from pdb import set_trace as stop 4 | 5 | import pandas as pd 6 | import gym 7 | 8 | from src.config import SAVED_AGENTS_DIR 9 | 10 | import numpy as np 11 | 12 | 13 | def plot_policy(agent, positions: np.arange, velocities: np.arange, figsize = None): 14 | """""" 15 | data = [] 16 | int2str = { 17 | 0: 'Accelerate Left', 18 | 1: 'Do nothing', 19 | 2: 'Accelerate Right' 20 | } 21 | for position in positions: 22 | for velocity in velocities: 23 | 24 | state = np.array([position, velocity]) 25 | action = int2str[agent.get_action(state)] 26 | 27 | data.append({ 28 | 'position': position, 29 | 'velocity': velocity, 30 | 'action': action, 31 | }) 32 | 33 | data = pd.DataFrame(data) 34 | 35 | import seaborn as sns 36 | import matplotlib.pyplot as plt 37 | 38 | if figsize: 39 | plt.figure(figsize=figsize) 40 | 41 | colors = { 42 | 'Accelerate Left': 'blue', 43 | 'Do nothing': 'grey', 44 | 'Accelerate Right': 'orange' 45 | } 46 | sns.scatterplot(x="position", y="velocity", hue="action", data=data, 47 | palette=colors) 48 | 49 | plt.show() 50 | return data 51 | 52 | def show_video(agent, env, sleep_sec: float = 0.1, mode: str = "rgb_array"): 53 | 54 | state = env.reset() 55 | done = False 56 | 57 | # LAPADULA 58 | if mode == "rgb_array": 59 | from matplotlib import pyplot as plt 60 | from IPython.display import display, clear_output 61 | steps = 0 62 | fig, ax = plt.subplots(figsize=(8, 6)) 63 | 64 | while not done: 65 | 66 | action = agent.get_action(state) 67 | state, reward, done, info = env.step(action) 68 | # LAPADULA 69 | if mode == "rgb_array": 70 | steps += 1 71 | frame = env.render(mode=mode) 72 | ax.cla() 73 | ax.axes.yaxis.set_visible(False) 74 | ax.imshow(frame, extent=[env.min_position, env.max_position, 0, 1]) 75 | ax.set_title(f'Steps: {steps}') 76 | display(fig) 77 | clear_output(wait=True) 78 | plt.pause(sleep_sec) 79 | else: 80 | env.render() 81 | 82 | 83 | if __name__ == '__main__': 84 | 85 | parser = ArgumentParser() 86 | parser.add_argument('--agent_file', type=str, required=True) 87 | parser.add_argument('--sleep_sec', type=float, required=False, default=0.1) 88 | args = parser.parse_args() 89 | 90 | from src.base_agent import BaseAgent 91 | agent_path = SAVED_AGENTS_DIR / args.agent_file 92 | agent = BaseAgent.load_from_disk(agent_path) 93 | 94 | env = gym.make('MountainCar-v0') 95 | env._max_episode_steps = 1000 96 | 97 | show_video(agent, env, sleep_sec=args.sleep_sec) 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | -------------------------------------------------------------------------------- /03_cart_pole/.gitignore: -------------------------------------------------------------------------------- 1 | data_supervised_ml/* 2 | -------------------------------------------------------------------------------- /03_cart_pole/README.md: -------------------------------------------------------------------------------- 1 |
2 |

Parametric Q learning to solve the Cart Pole

3 |

There exists everywhere a medium in things, determined by equilibrium.

4 |

-- Dmitri Mendeleev

5 |
6 | 7 | ![](http://datamachines.xyz/wp-content/uploads/2022/01/pexels-yogendra-singh-1701202.jpg) 8 | 9 | ## Table of Contents 10 | * [Welcome 🤗](#welcome-) 11 | * [Lecture transcripts](#lecture-transcripts) 12 | * [Quick setup](#quick-setup) 13 | * [Notebooks](#notebooks) 14 | * [Let's connect](#lets-connect) 15 | 16 | ## Welcome 🤗 17 | 18 | In today's lecture we enter new territory... 19 | 20 | A territory where function approximation (aka supervised machine learning) 21 | meets good old Reinforcement Learning. 22 | 23 | And this is how Deep RL is born. 24 | 25 | We will solve the Cart Pole environment of OpenAI using **parametric Q-learning**. 26 | 27 | Today's lesson is split into 3 parts. 28 | 29 | ## Lecture transcripts 30 | 31 | [📝 1. Parametric Q learning](http://datamachines.xyz/2022/01/18/hands-on-reinforcement-learning-course-part-4-parametric-q-learning) 32 | [📝 2. Deep Q learning](http://datamachines.xyz/2022/02/11/hands-on-reinforcement-learning-course-part-5-deep-q-learning/) 33 | [📝 3. Hyperparameter search](http://datamachines.xyz/2022/03/03/hyperparameters-in-deep-rl-hands-on-course/) 34 | 35 | ## Quick setup 36 | 37 | Make sure you have Python >= 3.7. Otherwise, update it. 38 | 39 | 1. Pull the code from GitHub and cd into the `01_taxi` folder: 40 | ``` 41 | $ git clone https://github.com/Paulescu/hands-on-rl.git 42 | $ cd hands-on-rl/01_taxi 43 | ``` 44 | 45 | 2. Make sure you have the `virtualenv` tool in your Python installation 46 | ``` 47 | $ pip3 install virtualenv 48 | ``` 49 | 50 | 3. Create a virtual environment and activate it. 51 | ``` 52 | $ virtualenv -p python3 venv 53 | $ source venv/bin/activate 54 | ``` 55 | 56 | From this point onwards commands run inside the virtual environment. 57 | 58 | 59 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code. 60 | ``` 61 | $ (venv) pip install -r requirements.txt 62 | $ (venv) export PYTHONPATH="." 63 | ``` 64 | 65 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab 66 | ``` 67 | $ (venv) jupyter notebook 68 | ``` 69 | ``` 70 | $ (venv) jupyter lab 71 | ``` 72 | If both launch commands fail, try these: 73 | ``` 74 | $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False 75 | ``` 76 | ``` 77 | $ (venv) jupyter lab --NotebookApp.use_redirect_file=False 78 | ``` 79 | 80 | 5. Play and learn. And do the homework 😉. 81 | 82 | ## Notebooks 83 | 84 | Parametric Q-learning 85 | - [Explore the environment](notebooks/00_environment.ipynb) 86 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb) 87 | - [Linear Q agent with bad hyper-parameters](notebooks/02_linear_q_agent_bad_hyperparameters.ipynb) 88 | - [Linear Q agent with good hyper-parameters](notebooks/03_linear_q_agent_good_hyperparameters.ipynb) 89 | - [Homework](notebooks/04_homework.ipynb) 90 | 91 | Deep Q-learning 92 | - [Crash course on neural networks](notebooks/05_crash_course_on_neural_nets.ipynb) 93 | - [Deep Q agent with bad hyper-parameters](notebooks/06_deep_q_agent_bad_hyperparameters.ipynb) 94 | - [Deep Q agent with good hyper-parameters](notebooks/07_deep_q_agent_good_hyperparameters.ipynb) 95 | - [Homework](notebooks/08_homework.ipynb) 96 | 97 | Hyperparameter search 98 | - [Hyperparameter search](notebooks/09_hyperparameter_search.ipynb) 99 | - [Homework](notebooks/10_homework.ipynb) 100 | 101 | ## Let's connect! 102 | 103 | Do you wanna become a PRO in Machine Learning? 104 | 105 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/). 106 | 107 | 👉🏽 Follow me on [Medium](https://pau-labarta-bajo.medium.com/). -------------------------------------------------------------------------------- /03_cart_pole/images/deep_q_net.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |
x
x
v
v
θ
θ
ω
ω
Q(s, a=0)
Q(s, a=0)
Q(s, a=1)
Q(s, a=1)
Hidden layer 1
256 units
Hidden layer 1...
Hidden layer 2
256 units
Hidden layer 2...
Text is not SVG - cannot display
-------------------------------------------------------------------------------- /03_cart_pole/images/hparams_search_diagram.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |
happy with
the results?
happy with...

🔎

Select

hyper-parameters

🔎Select...

🏋️

Train the agent

🏋️Train the agent...

🧪

Test the agent

🧪Test the agent...
Yes
Yes
Done!
Done!
No
No
Text is not SVG - cannot display
-------------------------------------------------------------------------------- /03_cart_pole/images/linear_model.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/linear_model.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/linear_model_sml.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/linear_model_sml.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/neural_net.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/neural_net.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/neural_net_homework.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/neural_net_homework.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/ngrok_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/ngrok_example.png -------------------------------------------------------------------------------- /03_cart_pole/images/nn_1_hidden_layer_sml.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_1_hidden_layer_sml.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/nn_2_hidden_layers_sml.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_2_hidden_layers_sml.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/nn_3_hidden_layers_sml.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/nn_3_hidden_layers_sml.jpg -------------------------------------------------------------------------------- /03_cart_pole/images/optuna.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/images/optuna.png -------------------------------------------------------------------------------- /03_cart_pole/mlflow_runs/readme.md: -------------------------------------------------------------------------------- 1 | MLflow logs are saved in this folder. -------------------------------------------------------------------------------- /03_cart_pole/notebooks/00_environment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4c2ff31f", 6 | "metadata": {}, 7 | "source": [ 8 | "# 00 Environment" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "04a5c882", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉Before you solve a Reinforcement Learning problem you need to define what are\n", 17 | "- the actions\n", 18 | "- the states of the world\n", 19 | "- the rewards\n", 20 | "\n", 21 | "#### 👉We are using the `CartPole-v0` environment from [OpenAI's gym](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)\n", 22 | "\n", 23 | "#### 👉`CartPole-v0` is not an extremely difficult environment. However, it is complex enough to force us level up our game. The tools we will use to solve it are really powerful.\n", 24 | "\n", 25 | "#### 👉Let's explore it!" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 44, 31 | "id": "e3629346", 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "name": "stdout", 36 | "output_type": "stream", 37 | "text": [ 38 | "The autoreload extension is already loaded. To reload it, use:\n", 39 | " %reload_ext autoreload\n", 40 | "Populating the interactive namespace from numpy and matplotlib\n" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "%load_ext autoreload\n", 46 | "%autoreload 2\n", 47 | "%pylab inline\n", 48 | "%config InlineBackend.figure_format = 'svg'\n", 49 | "\n", 50 | "from matplotlib import pyplot as plt\n", 51 | "%matplotlib inline" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "id": "76e9a06d", 57 | "metadata": {}, 58 | "source": [ 59 | "## Load the environment 🌎" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 45, 65 | "id": "ebfba291", 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "import gym\n", 70 | "env = gym.make('CartPole-v1')" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "id": "c6e2bc37", 76 | "metadata": {}, 77 | "source": [ 78 | "## The goal\n", 79 | "### is to keep the pole in an upright position as long as you can by moving the cart a the bottom, left and right." 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "id": "7babf939", 85 | "metadata": {}, 86 | "source": [ 87 | "![title](../images/cart_pole.jpg)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "id": "9cb921cf", 93 | "metadata": {}, 94 | "source": [ 95 | "## Let's see how a good agent solves this problem" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "id": "92dcbf74", 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 30, 109 | "id": "2ded2ba5", 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/plain": [ 115 | "" 116 | ] 117 | }, 118 | "execution_count": 30, 119 | "metadata": {}, 120 | "output_type": "execute_result" 121 | }, 122 | { 123 | "data": { 124 | "image/svg+xml": [ 125 | "\n", 126 | "\n", 128 | "\n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " 2022-01-10T09:36:37.476916\n", 134 | " image/svg+xml\n", 135 | " \n", 136 | " \n", 137 | " Matplotlib v3.5.1, https://matplotlib.org/\n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 192 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | "\n" 348 | ], 349 | "text/plain": [ 350 | "
" 351 | ] 352 | }, 353 | "metadata": { 354 | "needs_background": "light" 355 | }, 356 | "output_type": "display_data" 357 | } 358 | ], 359 | "source": [ 360 | "# env.reset()\n", 361 | "# frame = env.render(mode='rgb_array')\n", 362 | "\n", 363 | "# fig, ax = plt.subplots(figsize=(8, 6))\n", 364 | "# ax.axes.yaxis.set_visible(False)\n", 365 | "# min_x = env.observation_space.low[0]\n", 366 | "# max_x = env.observation_space.high[0]\n", 367 | "# ax.imshow(frame, extent=[min_x, max_x, 0, 8])\n", 368 | "\n" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "id": "4f53a38e", 374 | "metadata": {}, 375 | "source": [ 376 | "## State space" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 51, 382 | "id": "e809514b", 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "name": "stdout", 387 | "output_type": "stream", 388 | "text": [ 389 | "Cart position from -4.80 to 4.80\n", 390 | "Cart velocity from -3.40E+38 to 3.40E+38\n", 391 | "Angle from -0.42 to 0.42\n", 392 | "Angular velocity from -3.40E+38 to 3.40E+38\n" 393 | ] 394 | } 395 | ], 396 | "source": [ 397 | "# The state consists of 4 numbers:\n", 398 | "x_min, v_min, angle_min, angular_v_min = env.observation_space.low\n", 399 | "x_max, v_max, angle_max, angular_v_max = env.observation_space.high\n", 400 | "\n", 401 | "print(f'Cart position from {x_min:.2f} to {x_max:.2f}')\n", 402 | "print(f'Cart velocity from {v_min:.2E} to {v_max:.2E}')\n", 403 | "print(f'Angle from {angle_min:.2f} to {angle_max:.2f}')\n", 404 | "print(f'Angular velocity from {angular_v_min:.2E} to {angular_v_max:.2E}')" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "id": "f413604e", 410 | "metadata": {}, 411 | "source": [ 412 | "[IMAGE]" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "id": "5e0c527b", 418 | "metadata": {}, 419 | "source": [ 420 | "### The ranges for the cart velocity and pole angular velocity are a bit too large, aren't they?\n", 421 | "\n", 422 | "👉 As a general principle, the high/low state values you can read from `env.observation_space`\n", 423 | "are set very conservatively, to guarantee that the state value alwayas lies between the max and the min.\n", 424 | "\n", 425 | "👉In practice, you need to simulate a few interactions with the environment to really see the actual intervals where the state components lie.\n", 426 | "\n", 427 | "👉 Knowing the max and min values for each state component is going to be useful later when we normalize the inputs to our Parametric models." 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "id": "1fcfc13a", 433 | "metadata": {}, 434 | "source": [ 435 | "## Action space\n", 436 | "\n", 437 | "- `0` Push cart to the left\n", 438 | "- `1` Push cart to the right" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 43, 444 | "id": "98cfdb84", 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "name": "stdout", 449 | "output_type": "stream", 450 | "text": [ 451 | "Action Space Discrete(2)\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "print(\"Action Space {}\".format(env.action_space))" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "id": "c8f6a690", 462 | "metadata": {}, 463 | "source": [ 464 | "## Rewards\n", 465 | "\n", 466 | "- A reward of -1 is awarded if the position of the car is less than 0.5.\n", 467 | "- The episode ends once the car's position is above 0.5, or the max number of steps has been reached: `n_steps >= env._max_episode_steps`\n", 468 | "\n", 469 | "A default negative reward of -1 encourages the car to escape the valley as fast as possible." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "id": "578d1ba3", 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [] 479 | } 480 | ], 481 | "metadata": { 482 | "kernelspec": { 483 | "display_name": "Python 3 (ipykernel)", 484 | "language": "python", 485 | "name": "python3" 486 | }, 487 | "language_info": { 488 | "codemirror_mode": { 489 | "name": "ipython", 490 | "version": 3 491 | }, 492 | "file_extension": ".py", 493 | "mimetype": "text/x-python", 494 | "name": "python", 495 | "nbconvert_exporter": "python", 496 | "pygments_lexer": "ipython3", 497 | "version": "3.7.5" 498 | } 499 | }, 500 | "nbformat": 4, 501 | "nbformat_minor": 5 502 | } 503 | -------------------------------------------------------------------------------- /03_cart_pole/notebooks/04_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f0fd6807", 6 | "metadata": {}, 7 | "source": [ 8 | "# 04 Homework 🏋️🏋️🏋️" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "abcf6613", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉A course without homework is not a course!\n", 17 | "\n", 18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n", 19 | "\n", 20 | "#### 👉Feel free to email me your solutions at:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "d1d983a3", 26 | "metadata": {}, 27 | "source": [ 28 | "# `plabartabajo@gmail.com`" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "86f82e45", 34 | "metadata": {}, 35 | "source": [ 36 | "-----" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "67656662", 42 | "metadata": {}, 43 | "source": [ 44 | "## 1. Can you use 3 different `SEED` values and re-train the agent with good hyper-parameters?\n", 45 | "\n", 46 | "Do you still train a good agent? Or does the seed really affect the training outcome?" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "c0a46bf7", 52 | "metadata": {}, 53 | "source": [ 54 | "## 2. Can you solve the `MountainCar-v0` environment using today's code?\n", 55 | "\n", 56 | "Are you able to score 99% on the evaluation set?" 57 | ] 58 | } 59 | ], 60 | "metadata": { 61 | "kernelspec": { 62 | "display_name": "Python 3 (ipykernel)", 63 | "language": "python", 64 | "name": "python3" 65 | }, 66 | "language_info": { 67 | "codemirror_mode": { 68 | "name": "ipython", 69 | "version": 3 70 | }, 71 | "file_extension": ".py", 72 | "mimetype": "text/x-python", 73 | "name": "python", 74 | "nbconvert_exporter": "python", 75 | "pygments_lexer": "ipython3", 76 | "version": "3.7.5" 77 | } 78 | }, 79 | "nbformat": 4, 80 | "nbformat_minor": 5 81 | } 82 | -------------------------------------------------------------------------------- /03_cart_pole/notebooks/08_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f0fd6807", 6 | "metadata": {}, 7 | "source": [ 8 | "# 08 Homework 🏋️🏋️🏋️" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "abcf6613", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉A course without homework is not a course!\n", 17 | "\n", 18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n", 19 | "\n", 20 | "#### 👉Feel free to email me your solutions at:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "d1d983a3", 26 | "metadata": {}, 27 | "source": [ 28 | "# `plabartabajo@gmail.com`" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "86f82e45", 34 | "metadata": {}, 35 | "source": [ 36 | "-----" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "67656662", 42 | "metadata": {}, 43 | "source": [ 44 | "### 1. Re-train the neural networks from `05_crash_course_on_neural_nets.ipynb` with a larger training set, e.g. `10,000 samples`?\n", 45 | "\n", 46 | "👉Do the validation metrics improve?\n", 47 | "\n", 48 | "👉Did you manage to get to 95% validation accuracy?" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "c0a46bf7", 54 | "metadata": {}, 55 | "source": [ 56 | "## 2. Can you perfectly solve the `Cart Pole` environment using a neural network with only 1 hidden layer?\n", 57 | "\n", 58 | "\n", 59 | "![](https://github.com/Paulescu/hands-on-rl/blob/main/03_cart_pole/images/neural_net_homework.jpg?raw=true)" 60 | ] 61 | } 62 | ], 63 | "metadata": { 64 | "kernelspec": { 65 | "display_name": "Python 3 (ipykernel)", 66 | "language": "python", 67 | "name": "python3" 68 | }, 69 | "language_info": { 70 | "codemirror_mode": { 71 | "name": "ipython", 72 | "version": 3 73 | }, 74 | "file_extension": ".py", 75 | "mimetype": "text/x-python", 76 | "name": "python", 77 | "nbconvert_exporter": "python", 78 | "pygments_lexer": "ipython3", 79 | "version": "3.7.5" 80 | } 81 | }, 82 | "nbformat": 4, 83 | "nbformat_minor": 5 84 | } 85 | -------------------------------------------------------------------------------- /03_cart_pole/notebooks/10_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f0fd6807", 6 | "metadata": {}, 7 | "source": [ 8 | "# 10 Homework 🏋️🏋️🏋️" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "3f1582da", 14 | "metadata": {}, 15 | "source": [ 16 | "## Challenge" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "67656662", 22 | "metadata": {}, 23 | "source": [ 24 | "If you carefully look at `sample_hyper_parameters()` in `src/optimize_hyperparameters.py` you will see I did not let Optuna test different neural network architectures.\n", 25 | "\n", 26 | "I set `nn_hidden_layers = [256, 256]` and that was it.\n", 27 | "\n", 28 | "I dare you find the smallest neural network architecture that solves the `CartPole` perfectly." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "4ea9756e", 34 | "metadata": {}, 35 | "source": [ 36 | "## Send your solution through\n", 37 | "\n", 38 | "- a pull request or\n", 39 | "- direcly by email at `plabartabajo@gmail.com`" 40 | ] 41 | } 42 | ], 43 | "metadata": { 44 | "kernelspec": { 45 | "display_name": "Python 3 (ipykernel)", 46 | "language": "python", 47 | "name": "python3" 48 | }, 49 | "language_info": { 50 | "codemirror_mode": { 51 | "name": "ipython", 52 | "version": 3 53 | }, 54 | "file_extension": ".py", 55 | "mimetype": "text/x-python", 56 | "name": "python", 57 | "nbconvert_exporter": "python", 58 | "pygments_lexer": "ipython3", 59 | "version": "3.7.5" 60 | } 61 | }, 62 | "nbformat": 4, 63 | "nbformat_minor": 5 64 | } 65 | -------------------------------------------------------------------------------- /03_cart_pole/pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "src" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Pau "] 6 | 7 | [tool.poetry.dependencies] 8 | python = ">=3.7.1,<3.8" 9 | gym = "^0.21.0" 10 | sklearn = "^0.0" 11 | numpy = "^1.21.4" 12 | matplotlib = "^3.5.0" 13 | jupyter = "^1.0.0" 14 | tqdm = "^4.62.3" 15 | torch = "^1.10.1" 16 | tensorboard = "^2.7.0" 17 | pandas = "^1.3.5" 18 | PyYAML = "^6.0" 19 | pyglet = "^1.5.21" 20 | mlflow = "^1.22.0" 21 | gdown = "^4.2.0" 22 | optuna = "^2.10.0" 23 | pyngrok = "^5.1.0" 24 | 25 | [tool.poetry.dev-dependencies] 26 | pytest = "^5.2" 27 | certifi = "^2021.10.8" 28 | 29 | [build-system] 30 | requires = ["poetry-core>=1.0.0"] 31 | build-backend = "poetry.core.masonry.api" 32 | -------------------------------------------------------------------------------- /03_cart_pole/requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==1.0.0; python_version >= "3.6" 2 | alembic==1.4.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 3 | appnope==0.1.2; platform_system == "Darwin" and python_version >= "3.7" and sys_platform == "darwin" 4 | argcomplete==2.0.0; python_version < "3.8.0" and python_version >= "3.7" 5 | argon2-cffi-bindings==21.2.0; python_version >= "3.6" 6 | argon2-cffi==21.3.0; python_version >= "3.6" 7 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0" 8 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 9 | autopage==0.5.0; python_version >= "3.6" 10 | backcall==0.2.0; python_version >= "3.7" 11 | beautifulsoup4==4.10.0; python_full_version > "3.0.0" 12 | bleach==4.1.0; python_version >= "3.7" 13 | cachetools==4.2.4; python_version >= "3.5" and python_version < "4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") 14 | certifi==2021.10.8 15 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6" 16 | charset-normalizer==2.0.9; python_full_version >= "3.6.0" and python_version >= "3.6" 17 | click==8.0.3; python_version >= "3.6" 18 | cliff==3.10.1; python_version >= "3.6" 19 | cloudpickle==2.0.0; python_version >= "3.6" 20 | cmaes==0.8.2; python_version >= "3.6" 21 | cmd2==2.4.0; python_version >= "3.6" 22 | colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.7" 23 | colorlog==6.6.0; python_version >= "3.6" 24 | cycler==0.11.0; python_version >= "3.7" 25 | databricks-cli==0.16.2; python_version >= "3.6" 26 | debugpy==1.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 27 | decorator==5.1.0; python_version >= "3.7" 28 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 29 | docker==5.0.3; python_version >= "3.6" 30 | entrypoints==0.3; python_full_version >= "3.6.1" and python_version >= "3.7" 31 | filelock==3.4.2; python_version >= "3.7" 32 | flask==2.0.2; python_version >= "3.6" 33 | fonttools==4.28.5; python_version >= "3.7" 34 | gdown==4.2.0 35 | gitdb==4.0.9; python_version >= "3.7" 36 | gitpython==3.1.26; python_version >= "3.7" 37 | google-auth-oauthlib==0.4.6; python_version >= "3.6" 38 | google-auth==2.3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 39 | greenlet==1.1.2; python_version >= "3" and python_full_version < "3.0.0" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_version >= "3" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") and python_full_version >= "3.5.0" 40 | grpcio==1.43.0; python_version >= "3.6" 41 | gunicorn==20.1.0; platform_system != "Windows" and python_version >= "3.6" 42 | gym==0.21.0; python_version >= "3.6" 43 | idna==3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 44 | importlib-metadata==4.10.0; python_version == "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version >= "3.7" and python_version < "3.8") and (python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.7" and python_version < "3.8") 45 | importlib-resources==5.4.0; python_version < "3.9" and python_version >= "3.7" 46 | ipykernel==6.6.1; python_version >= "3.7" 47 | ipython-genutils==0.2.0; python_version >= "3.6" 48 | ipython==7.30.1; python_version >= "3.7" 49 | ipywidgets==7.6.5 50 | itsdangerous==2.0.1; python_version >= "3.6" 51 | jedi==0.18.1; python_version >= "3.7" 52 | jinja2==3.0.3; python_version >= "3.7" 53 | joblib==1.1.0; python_version >= "3.7" 54 | jsonschema==4.3.3; python_version >= "3.7" 55 | jupyter-client==7.1.0; python_full_version >= "3.6.1" and python_version >= "3.7" 56 | jupyter-console==6.4.0; python_version >= "3.6" 57 | jupyter-core==4.9.1; python_full_version >= "3.6.1" and python_version >= "3.7" 58 | jupyter==1.0.0 59 | jupyterlab-pygments==0.1.2; python_version >= "3.7" 60 | jupyterlab-widgets==1.0.2; python_version >= "3.6" 61 | kiwisolver==1.3.2; python_version >= "3.7" 62 | mako==1.1.6; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 63 | markdown==3.3.6; python_version >= "3.6" 64 | markupsafe==2.0.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7" 65 | matplotlib-inline==0.1.3; python_version >= "3.7" 66 | matplotlib==3.5.1; python_version >= "3.7" 67 | mistune==0.8.4; python_version >= "3.7" 68 | mlflow==1.22.0; python_version >= "3.6" 69 | more-itertools==8.12.0; python_version >= "3.5" 70 | nbclient==0.5.9; python_full_version >= "3.6.1" and python_version >= "3.7" 71 | nbconvert==6.4.0; python_version >= "3.7" 72 | nbformat==5.1.3; python_full_version >= "3.6.1" and python_version >= "3.7" 73 | nest-asyncio==1.5.4; python_full_version >= "3.6.1" and python_version >= "3.7" 74 | notebook==6.4.6; python_version >= "3.6" 75 | numpy==1.21.5; python_version >= "3.7" and python_version < "3.11" 76 | oauthlib==3.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 77 | optuna==2.10.0; python_version >= "3.6" 78 | packaging==21.3; python_version >= "3.7" 79 | pandas==1.3.5; python_full_version >= "3.7.1" 80 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7" 81 | parso==0.8.3; python_version >= "3.7" 82 | pbr==5.8.1; python_version >= "3.6" 83 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7" 84 | pickleshare==0.7.5; python_version >= "3.7" 85 | pillow==9.0.0; python_version >= "3.7" 86 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5" 87 | prettytable==3.1.1; python_version >= "3.7" 88 | prometheus-client==0.12.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 89 | prometheus-flask-exporter==0.18.7; python_version >= "3.6" 90 | prompt-toolkit==3.0.24; python_full_version >= "3.6.2" and python_version >= "3.7" 91 | protobuf==3.19.1; python_version >= "3.6" 92 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.7" and sys_platform != "win32" 93 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy" 94 | pyasn1-modules==0.2.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 95 | pyasn1==0.4.8; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_full_version >= "3.6.0" and python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") 96 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0" 97 | pyglet==1.5.21 98 | pygments==2.11.1; python_version >= "3.7" 99 | pyngrok==5.1.0; python_version >= "3.5" 100 | pyparsing==3.0.6; python_version >= "3.7" 101 | pyperclip==1.8.2; python_version >= "3.6" 102 | pyreadline3==3.4.1; sys_platform == "win32" and python_version >= "3.6" 103 | pyrsistent==0.18.0; python_version >= "3.7" 104 | pysocks==1.7.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 105 | pytest==5.4.3; python_version >= "3.5" 106 | python-dateutil==2.8.2; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") 107 | python-editor==1.0.4; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 108 | pytz==2021.3; python_full_version >= "3.7.1" and python_version >= "3.6" 109 | pywin32==227; sys_platform == "win32" and python_version >= "3.7" and platform_python_implementation != "PyPy" 110 | pywinpty==1.1.6; os_name == "nt" and python_version >= "3.6" 111 | pyyaml==6.0; python_version >= "3.6" 112 | pyzmq==22.3.0; python_full_version >= "3.6.1" and python_version >= "3.7" 113 | qtconsole==5.2.2; python_version >= "3.6" 114 | qtpy==2.0.0; python_version >= "3.6" 115 | querystring-parser==1.2.4; python_version >= "3.6" 116 | requests-oauthlib==1.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 117 | requests==2.27.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 118 | rsa==4.8; python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") 119 | scikit-learn==1.0.2; python_version >= "3.7" 120 | scipy==1.7.3; python_version >= "3.7" and python_version < "3.11" 121 | send2trash==1.8.0; python_version >= "3.6" 122 | setuptools-scm==6.3.2; python_version >= "3.7" 123 | six==1.16.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.7" 124 | sklearn==0.0 125 | smmap==5.0.0; python_version >= "3.7" 126 | soupsieve==2.3.1; python_version >= "3.6" and python_full_version > "3.0.0" 127 | sqlalchemy==1.4.29; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 128 | sqlparse==0.4.2; python_version >= "3.6" 129 | stevedore==3.5.0; python_version >= "3.6" 130 | tabulate==0.8.9; python_version >= "3.6" 131 | tensorboard-data-server==0.6.1; python_version >= "3.6" 132 | tensorboard-plugin-wit==1.8.0; python_version >= "3.6" 133 | tensorboard==2.7.0; python_version >= "3.6" 134 | terminado==0.12.1; python_version >= "3.6" 135 | testpath==0.5.0; python_version >= "3.7" 136 | threadpoolctl==3.0.0; python_version >= "3.7" 137 | tomli==2.0.0; python_version >= "3.7" 138 | torch==1.10.1; python_full_version >= "3.6.2" 139 | tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7" 140 | tqdm==4.62.3; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0") 141 | traitlets==5.1.1; python_full_version >= "3.6.1" and python_version >= "3.7" 142 | typing-extensions==4.0.1; python_version >= "3.7" and python_full_version >= "3.6.2" and python_version < "3.8" 143 | urllib3==1.26.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6" 144 | waitress==2.0.0; platform_system == "Windows" and python_version >= "3.6" and python_full_version >= "3.6.0" 145 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.7" 146 | webencodings==0.5.1; python_version >= "3.7" 147 | websocket-client==1.2.3; python_version >= "3.6" 148 | werkzeug==2.0.2; python_version >= "3.6" 149 | widgetsnbextension==3.5.2 150 | zipp==3.7.0; python_version < "3.8" and python_version >= "3.7" 151 | -------------------------------------------------------------------------------- /03_cart_pole/saved_agents/CartPole-v1/0/hparams.json: -------------------------------------------------------------------------------- 1 | {"learning_rate": 0.119449136260578, "discount_factor": 0.99, "batch_size": 128, "memory_size": 100000, "freq_steps_update_target": 1000, "n_steps_warm_up_memory": 1000, "freq_steps_train": 16, "n_gradient_steps": 16, "nn_hidden_layers": null, "max_grad_norm": 1, "normalize_state": false, "epsilon_start": 0.9, "epsilon_end": 0.1421425009699689, "steps_epsilon_decay": 100000} -------------------------------------------------------------------------------- /03_cart_pole/saved_agents/CartPole-v1/0/model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/hands-on-rl/21c11b01fd6b950cfa16800b4dd9234d55b4a1ac/03_cart_pole/saved_agents/CartPole-v1/0/model -------------------------------------------------------------------------------- /03_cart_pole/saved_agents/readme.md: -------------------------------------------------------------------------------- 1 | ### Trained agents are saved in this folder -------------------------------------------------------------------------------- /03_cart_pole/src/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.1.0' 2 | -------------------------------------------------------------------------------- /03_cart_pole/src/agent_memory.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple, deque 2 | import random 3 | 4 | Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done')) 5 | 6 | 7 | class AgentMemory: 8 | 9 | def __init__(self, memory_size): 10 | self.memory = deque([], maxlen=memory_size) 11 | 12 | def push(self, *args): 13 | """Save a transition""" 14 | self.memory.append(Transition(*args)) 15 | 16 | def sample(self, batch_size): 17 | transitions = random.sample(self.memory, batch_size) 18 | 19 | # stop() 20 | 21 | return Transition(*zip(*transitions)) 22 | 23 | def __len__(self): 24 | return len(self.memory) -------------------------------------------------------------------------------- /03_cart_pole/src/config.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pathlib 3 | root_dir = pathlib.Path(__file__).parent.resolve().parent 4 | 5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents' 6 | TENSORBOARD_LOG_DIR = root_dir / 'tensorboard_logs' 7 | OPTUNA_DB = root_dir / 'optuna.db' 8 | DATA_SUPERVISED_ML = root_dir / 'data_supervised_ml' 9 | MLFLOW_RUNS_DIR = root_dir / 'mlflow_runs' 10 | 11 | if not SAVED_AGENTS_DIR.exists(): 12 | os.makedirs(SAVED_AGENTS_DIR) 13 | 14 | if not TENSORBOARD_LOG_DIR.exists(): 15 | os.makedirs(TENSORBOARD_LOG_DIR) 16 | 17 | if not DATA_SUPERVISED_ML.exists(): 18 | os.makedirs(DATA_SUPERVISED_ML) 19 | 20 | if not MLFLOW_RUNS_DIR.exists(): 21 | os.makedirs(MLFLOW_RUNS_DIR) -------------------------------------------------------------------------------- /03_cart_pole/src/loops.py: -------------------------------------------------------------------------------- 1 | from typing import Tuple, List, Callable, Union, Optional 2 | import random 3 | from pathlib import Path 4 | from collections import deque 5 | from pdb import set_trace as stop 6 | 7 | import numpy as np 8 | from tqdm import tqdm 9 | import torch 10 | from torch.utils.tensorboard import SummaryWriter 11 | 12 | 13 | 14 | def train( 15 | agent, 16 | env, 17 | n_episodes: int, 18 | log_dir: Optional[Path] = None, 19 | max_steps: Optional[int] = float("inf"), 20 | n_episodes_evaluate_agent: Optional[int] = 100, 21 | freq_episodes_evaluate_agent: int = 200, 22 | ) -> None: 23 | 24 | # Tensorborad log writer 25 | logging = False 26 | if log_dir is not None: 27 | writer = SummaryWriter(log_dir) 28 | logging = True 29 | 30 | reward_per_episode = [] 31 | steps_per_episode = [] 32 | global_step_counter = 0 33 | 34 | for i in tqdm(range(0, n_episodes)): 35 | 36 | state = env.reset() 37 | 38 | rewards = 0 39 | steps = 0 40 | done = False 41 | while not done: 42 | 43 | action = agent.act(state) 44 | 45 | # agents takes a step and the environment throws out a new state and 46 | # a reward 47 | next_state, reward, done, info = env.step(action) 48 | 49 | # agent observes transition and stores it for later use 50 | agent.observe(state, action, reward, next_state, done) 51 | 52 | # learning happens here, through experience replay 53 | agent.replay() 54 | 55 | global_step_counter += 1 56 | steps += 1 57 | rewards += reward 58 | state = next_state 59 | 60 | # log to Tensorboard 61 | if logging: 62 | writer.add_scalar('train/rewards', rewards, i) 63 | writer.add_scalar('train/steps', steps, i) 64 | writer.add_scalar('train/epsilon', agent.epsilon, i) 65 | writer.add_scalar('train/replay_memory_size', len(agent.memory), i) 66 | 67 | reward_per_episode.append(rewards) 68 | steps_per_episode.append(steps) 69 | 70 | # if (i > 0) and (i % freq_episodes_evaluate_agent) == 0: 71 | if (i + 1) % freq_episodes_evaluate_agent == 0: 72 | # evaluate agent 73 | eval_rewards, eval_steps = evaluate(agent, env, 74 | n_episodes=n_episodes_evaluate_agent, 75 | epsilon=0.01) 76 | 77 | # from src.utils import get_success_rate_from_n_steps 78 | # success_rate = get_success_rate_from_n_steps(env, eval_steps) 79 | print(f'Reward mean: {np.mean(eval_rewards):.2f}, std: {np.std(eval_rewards):.2f}') 80 | print(f'Num steps mean: {np.mean(eval_steps):.2f}, std: {np.std(eval_steps):.2f}') 81 | # print(f'Success rate: {success_rate:.2%}') 82 | if logging: 83 | writer.add_scalar('eval/avg_reward', np.mean(eval_rewards), i) 84 | writer.add_scalar('eval/avg_steps', np.mean(eval_steps), i) 85 | # writer.add_scalar('eval/success_rate', success_rate, i) 86 | 87 | if global_step_counter > max_steps: 88 | break 89 | 90 | 91 | def evaluate( 92 | agent, 93 | env, 94 | n_episodes: int, 95 | epsilon: Optional[float] = None, 96 | seed: Optional[int] = 0, 97 | ) -> Tuple[List, List]: 98 | 99 | from src.utils import set_seed 100 | set_seed(env, seed) 101 | 102 | # output metrics 103 | reward_per_episode = [] 104 | steps_per_episode = [] 105 | 106 | for i in tqdm(range(0, n_episodes)): 107 | 108 | state = env.reset() 109 | rewards = 0 110 | steps = 0 111 | done = False 112 | while not done: 113 | 114 | action = agent.act(state, epsilon=epsilon) 115 | next_state, reward, done, info = env.step(action) 116 | 117 | rewards += reward 118 | steps += 1 119 | state = next_state 120 | 121 | reward_per_episode.append(rewards) 122 | steps_per_episode.append(steps) 123 | 124 | return reward_per_episode, steps_per_episode -------------------------------------------------------------------------------- /03_cart_pole/src/model_factory.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, List 2 | from pdb import set_trace as stop 3 | 4 | import torch.nn as nn 5 | 6 | 7 | def get_model( 8 | input_dim: int, 9 | output_dim: int, 10 | hidden_layers: Optional[List[int]] = None, 11 | ): 12 | """ 13 | Feed-forward network, made of linear layers with ReLU activation functions 14 | The number of layers, and their size is given by `hidden_layers`. 15 | """ 16 | # assert init_method in {'default', 'xavier'} 17 | 18 | if hidden_layers is None: 19 | # linear model 20 | model = nn.Sequential(nn.Linear(input_dim, output_dim)) 21 | 22 | else: 23 | # neural network 24 | # there are hidden layers in this case. 25 | dims = [input_dim] + hidden_layers + [output_dim] 26 | modules = [] 27 | for i, dim in enumerate(dims[:-2]): 28 | modules.append(nn.Linear(dims[i], dims[i + 1])) 29 | modules.append(nn.ReLU()) 30 | 31 | modules.append(nn.Linear(dims[-2], dims[-1])) 32 | model = nn.Sequential(*modules) 33 | # stop() 34 | 35 | # n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad) 36 | # print(f'{n_parameters:,} parameters') 37 | 38 | return model 39 | 40 | def count_parameters(model: nn.Module) -> int: 41 | """""" 42 | return sum(p.numel() for p in model.parameters() if p.requires_grad) -------------------------------------------------------------------------------- /03_cart_pole/src/optimize_hyperparameters.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | from argparse import ArgumentParser 3 | from pdb import set_trace as stop 4 | 5 | import optuna 6 | import gym 7 | import numpy as np 8 | import mlflow 9 | 10 | from src.q_agent import QAgent 11 | from src.utils import get_agent_id 12 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR, OPTUNA_DB 13 | from src.utils import set_seed 14 | from src.loops import train, evaluate 15 | 16 | 17 | def sample_hyper_parameters( 18 | trial: optuna.trial.Trial, 19 | force_linear_model: bool = False, 20 | ) -> Dict: 21 | 22 | learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-2) 23 | discount_factor = trial.suggest_categorical("discount_factor", [0.9, 0.95, 0.99]) 24 | batch_size = trial.suggest_categorical("batch_size", [16, 32, 64, 128]) 25 | memory_size = trial.suggest_categorical("memory_size", [int(1e4), int(5e4), int(1e5)]) 26 | 27 | # we update the main model parameters every 'freq_steps_train' steps 28 | freq_steps_train = trial.suggest_categorical('freq_steps_train', [8, 16, 128, 256]) 29 | 30 | # we update the target model parameters every 'freq_steps_update_target' steps 31 | freq_steps_update_target = trial.suggest_categorical('freq_steps_update_target', [10, 100, 1000]) 32 | 33 | # minimum memory size we want before we start training 34 | # e.g. 0 --> start training right away. 35 | # e.g 1,000 --> start training when there are at least 1,000 sample trajectories in the agent's memory 36 | n_steps_warm_up_memory = trial.suggest_categorical("n_steps_warm_up_memory", [1000, 5000]) 37 | 38 | # how many consecutive gradient descent steps to perform when we update the main model parameters 39 | n_gradient_steps = trial.suggest_categorical("n_gradient_steps", [1, 4, 16]) 40 | 41 | # model architecture to approximate q values 42 | if force_linear_model: 43 | # linear model 44 | nn_hidden_layers = None 45 | else: 46 | # neural network hidden layers 47 | # nn_hidden_layers = trial.suggest_categorical("nn_hidden_layers", [None, [64, 64], [256, 256]]) 48 | nn_hidden_layers = trial.suggest_categorical("nn_hidden_layers", [[256, 256]]) # ;-) 49 | 50 | # how large do we let the gradients grow before capping them? 51 | # Explosive gradients can be an issue and this hyper-parameters helps mitigate it. 52 | max_grad_norm = trial.suggest_categorical("max_grad_norm", [1, 10]) 53 | 54 | # should we scale the inputs before feeding them to the model? 55 | normalize_state = trial.suggest_categorical('normalize_state', [True, False]) 56 | 57 | # start value for the exploration rate 58 | epsilon_start = trial.suggest_categorical("epsilon_start", [0.9]) 59 | 60 | # final value for the exploration rate 61 | epsilon_end = trial.suggest_uniform("epsilon_end", 0, 0.2) 62 | 63 | # for how many steps do we decrease epsilon from its starting value to 64 | # its final value `epsilon_end` 65 | steps_epsilon_decay = trial.suggest_categorical("steps_epsilon_decay", [int(1e3), int(1e4), int(1e5)]) 66 | 67 | seed = trial.suggest_int('seed', 0, 2 ** 30 - 1) 68 | 69 | return { 70 | 'learning_rate': learning_rate, 71 | 'discount_factor': discount_factor, 72 | 'batch_size': batch_size, 73 | 'memory_size': memory_size, 74 | 'freq_steps_train': freq_steps_train, 75 | 'freq_steps_update_target': freq_steps_update_target, 76 | 'n_steps_warm_up_memory': n_steps_warm_up_memory, 77 | 'n_gradient_steps': n_gradient_steps, 78 | 'nn_hidden_layers': nn_hidden_layers, 79 | 'max_grad_norm': max_grad_norm, 80 | 'normalize_state': normalize_state, 81 | 'epsilon_start': epsilon_start, 82 | 'epsilon_end': epsilon_end, 83 | 'steps_epsilon_decay': steps_epsilon_decay, 84 | 'seed': seed, 85 | } 86 | 87 | 88 | def objective( 89 | trial: optuna.trial.Trial, 90 | force_linear_model: bool = False, 91 | n_episodes_to_train: int = 200, 92 | ): 93 | env_name = 'CartPole-v1' 94 | env = gym.make('CartPole-v1') 95 | 96 | with mlflow.start_run(): 97 | 98 | # generate unique agent_id 99 | agent_id = get_agent_id(env_name) 100 | mlflow.log_param('agent_id', agent_id) 101 | 102 | # hyper-parameters 103 | args = sample_hyper_parameters(trial, 104 | force_linear_model=force_linear_model) 105 | mlflow.log_params(trial.params) 106 | 107 | # fix seeds to ensure reproducible runs 108 | set_seed(env, args['seed']) 109 | 110 | # create agent object 111 | agent = QAgent( 112 | env, 113 | learning_rate=args['learning_rate'], 114 | discount_factor=args['discount_factor'], 115 | batch_size=args['batch_size'], 116 | memory_size=args['memory_size'], 117 | freq_steps_train=args['freq_steps_train'], 118 | freq_steps_update_target=args['freq_steps_update_target'], 119 | n_steps_warm_up_memory=args['n_steps_warm_up_memory'], 120 | n_gradient_steps=args['n_gradient_steps'], 121 | nn_hidden_layers=args['nn_hidden_layers'], 122 | max_grad_norm=args['max_grad_norm'], 123 | normalize_state=args['normalize_state'], 124 | epsilon_start=args['epsilon_start'], 125 | epsilon_end=args['epsilon_end'], 126 | steps_epsilon_decay=args['steps_epsilon_decay'], 127 | log_dir=TENSORBOARD_LOG_DIR / env_name / agent_id 128 | ) 129 | 130 | # train loop 131 | train(agent, 132 | env, 133 | n_episodes=n_episodes_to_train, 134 | log_dir=TENSORBOARD_LOG_DIR / env_name / agent_id) 135 | 136 | agent.save_to_disk(SAVED_AGENTS_DIR / env_name / agent_id) 137 | 138 | # evaluate its performance 139 | rewards, steps = evaluate(agent, env, n_episodes=1000, epsilon=0.00) 140 | mean_reward = np.mean(rewards) 141 | std_reward = np.std(rewards) 142 | mlflow.log_metric('mean_reward', mean_reward) 143 | mlflow.log_metric('std_reward', std_reward) 144 | 145 | return mean_reward 146 | 147 | 148 | if __name__ == '__main__': 149 | 150 | parser = ArgumentParser() 151 | parser.add_argument('--trials', type=int, required=True) 152 | parser.add_argument('--episodes', type=int, required=True) 153 | parser.add_argument('--force_linear_model', dest='force_linear_model', action='store_true') 154 | parser.set_defaults(force_linear_model=False) 155 | parser.add_argument('--experiment_name', type=str, required=True) 156 | args = parser.parse_args() 157 | 158 | # set Mlflow experiment name 159 | mlflow.set_experiment(args.experiment_name) 160 | 161 | # set Optuna study 162 | study = optuna.create_study(study_name=args.experiment_name, 163 | direction='maximize', 164 | load_if_exists=True, 165 | storage=f'sqlite:///{OPTUNA_DB}') 166 | 167 | # Wrap the objective inside a lambda and call objective inside it 168 | # Nice trick taken from https://www.kaggle.com/general/261870 169 | func = lambda trial: objective(trial, force_linear_model=args.force_linear_model, n_episodes_to_train=args.episodes) 170 | 171 | # run Optuna 172 | study.optimize(func, n_trials=args.trials) -------------------------------------------------------------------------------- /03_cart_pole/src/q_agent.py: -------------------------------------------------------------------------------- 1 | """ 2 | We use PyTorch for all agents: 3 | 4 | - Linear model trained one sample at a time -> Easy to train, slow and results are not great. 5 | - Linear model trained with batches of data. -> Faster to train, but results are still not good. 6 | - NN trained with batches -> Promising, but it looks like it does not train.. 7 | - NN with memory buffer -> Fix sample autocorrelation 8 | - NN with memory buffer and target network for stability. -> RL trick (called double Q-learning) 9 | 10 | """ 11 | import os 12 | from pathlib import Path 13 | from typing import Union, Callable, Tuple, List 14 | import random 15 | from argparse import ArgumentParser 16 | import json 17 | from pdb import set_trace as stop 18 | 19 | import gym 20 | import numpy as np 21 | import torch 22 | import torch.nn as nn 23 | import torch.optim as optim 24 | from torch.utils.tensorboard import SummaryWriter 25 | from torch.nn import functional as F 26 | 27 | from src.model_factory import get_model 28 | from src.agent_memory import AgentMemory 29 | from src.utils import ( 30 | get_agent_id, 31 | get_input_output_dims, 32 | get_epsilon_decay_fn, 33 | # load_default_hyperparameters, 34 | get_observation_samples, 35 | set_seed, 36 | get_num_model_parameters 37 | ) 38 | from src.loops import train 39 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR 40 | 41 | 42 | class QAgent: 43 | 44 | def __init__( 45 | self, 46 | env: gym.Env, 47 | learning_rate: float = 1e-4, 48 | discount_factor: float = 0.99, 49 | batch_size: int = 64, 50 | memory_size: int = 10000, 51 | freq_steps_update_target: int = 1000, 52 | n_steps_warm_up_memory: int = 1000, 53 | freq_steps_train: int = 16, 54 | n_gradient_steps: int = 8, 55 | nn_hidden_layers: List[int] = None, 56 | max_grad_norm: int = 10, 57 | normalize_state: bool = False, 58 | epsilon_start: float = 1.0, 59 | epsilon_end: float = 0.05, 60 | steps_epsilon_decay: float = 50000, 61 | log_dir: str = None, 62 | ): 63 | """ 64 | :param env: 65 | :param learning_rate: size of the updates in the SGD/Adam formula 66 | :param discount_factor: discount factor for future rewards 67 | :param batch_size: number of (s,a,r,s') experiences we use in each SGD 68 | update 69 | :param memory_size: number of experiences the agent keeps in the replay 70 | memory 71 | :param freq_steps_update_target: frequency at which we copy the 72 | parameter 73 | from the main model to the target model. 74 | :param n_steps_warm_up_memory: number of experiences we require to have 75 | in memory before we start training the agent. 76 | :param freq_steps_train: frequency at which we update the main model 77 | parameters 78 | :param n_gradient_steps: number of SGD/Adam updates we perform when we 79 | train the main model. 80 | :param nn_hidden_layers: architecture of the main and target models. 81 | :param max_grad_norm: used to clipped gradients if they become too 82 | large. 83 | :param normalize_state: True/False depending if you want to normalize 84 | the raw states before feeding them into the model. 85 | :param epsilon_start: starting exploration rate 86 | :param epsilon_end: final exploration rate 87 | :param steps_epsilon_decay: number of step in which epsilon decays from 88 | 'epsilon_start' to 'epsilon_end' 89 | :param log_dir: Tensorboard logging folder 90 | """ 91 | self.env = env 92 | 93 | # general hyper-parameters 94 | self.learning_rate = learning_rate 95 | self.discount_factor = discount_factor 96 | 97 | # replay memory we use to sample experiences and update parameters 98 | # `memory_size` defines the maximum number of past experiences we want the 99 | # agent remember. 100 | self.memory_size = memory_size 101 | self.memory = AgentMemory(memory_size) 102 | 103 | # number of experiences we take at once from `self.memory` to update parameters 104 | self.batch_size = batch_size 105 | 106 | # hyper-parameters to control exploration of the environment 107 | self.steps_epsilon_decay = steps_epsilon_decay 108 | self.epsilon_start = epsilon_start 109 | self.epsilon_end = epsilon_end 110 | self.epsilon_fn = get_epsilon_decay_fn(epsilon_start, epsilon_end, steps_epsilon_decay) 111 | self.epsilon = None 112 | 113 | # create q model(s). Plural because we use 2 models: main one, and the other for the target. 114 | self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 115 | self.q_net, self.target_q_net = None, None 116 | self._init_models(nn_hidden_layers) 117 | print(f'{get_num_model_parameters(self.q_net):,} parameters') 118 | self.optimizer = optim.Adam(self.q_net.parameters(), lr=learning_rate) # Adam optimizer is a safe and standard choice 119 | self.max_grad_norm = max_grad_norm 120 | 121 | # hyper-parameters to control how often or when we do certain things, like 122 | # - update the main net parameters 123 | self.freq_steps_train = freq_steps_train 124 | # - update the target net parameters 125 | self.freq_steps_update_target = freq_steps_update_target 126 | # - start training until the memory is big enough 127 | assert n_steps_warm_up_memory > batch_size, 'batch_size must be larger than n_steps_warm_up_memory' 128 | self.n_steps_warm_up_memory = n_steps_warm_up_memory 129 | # - number of gradient steps we perform every time we update the main net parameters 130 | self.n_gradient_steps = n_gradient_steps 131 | 132 | # state variable we use to keep track of the number of calls to `observe()` 133 | self._step_counter = 0 134 | 135 | # input normalizer 136 | self.normalize_state = normalize_state 137 | if normalize_state: 138 | state_samples = get_observation_samples(env, n_samples=10000) 139 | self.mean_states = state_samples.mean(axis=0) 140 | self.std_states = state_samples.std(axis=0) 141 | 142 | # create a tensorboard logger if `log_dir` was provided 143 | # logging becomes crucial to understand what is not working in our code. 144 | self.log_dir = log_dir 145 | if log_dir: 146 | self.logger = SummaryWriter(log_dir) 147 | 148 | # save hyper-parameters 149 | self.hparams = { 150 | 'learning_rate': learning_rate, 151 | 'discount_factor': discount_factor, 152 | 'batch_size': batch_size, 153 | 'memory_size': memory_size, 154 | 'freq_steps_update_target': freq_steps_update_target, 155 | 'n_steps_warm_up_memory': n_steps_warm_up_memory, 156 | 'freq_steps_train': freq_steps_train, 157 | 'n_gradient_steps': n_gradient_steps, 158 | 'nn_hidden_layers': nn_hidden_layers, 159 | 'max_grad_norm': max_grad_norm, 160 | 'normalize_state': normalize_state, 161 | 'epsilon_start': epsilon_start, 162 | 'epsilon_end': epsilon_end, 163 | 'steps_epsilon_decay': steps_epsilon_decay, 164 | } 165 | 166 | def _init_models(self, nn_hidden_layers): 167 | 168 | # state is a vector of dimension 4, and 2 are the possible actions 169 | input_dim, output_dim = get_input_output_dims(str(self.env)) 170 | self.q_net = get_model( 171 | input_dim=input_dim, 172 | output_dim=output_dim, 173 | hidden_layers=nn_hidden_layers, 174 | ) 175 | self.q_net.to(self.device) 176 | 177 | # target q-net 178 | self.target_q_net = get_model( 179 | input_dim=input_dim, 180 | output_dim=output_dim, 181 | hidden_layers=nn_hidden_layers, 182 | ) 183 | self.target_q_net.to(self.device) 184 | 185 | # copy parameters from the `self.q_net` 186 | self._copy_params_to_target_q_net() 187 | 188 | def _copy_params_to_target_q_net(self): 189 | """ 190 | Copies parameters from q_net to target_q_net 191 | """ 192 | for target_param, param in zip(self.target_q_net.parameters(), self.q_net.parameters()): 193 | target_param.data.copy_(param.data) 194 | 195 | def _normalize_state(self, state: np.array) -> np.array: 196 | """""" 197 | # return (state - self.min_states) / (self.max_states - self.min_states) 198 | return (state - self.mean_states) / (self.std_states) 199 | 200 | def _preprocess_state(self, state: np.array) -> np.array: 201 | 202 | # state = np.copy(state_) 203 | 204 | if len(state.shape) == 1: 205 | # add extra dimension to make sure it is 2D 206 | s = state.reshape(1, -1) 207 | else: 208 | s = state 209 | 210 | if self.normalize_state: 211 | s = self._normalize_state(s) 212 | 213 | return s 214 | 215 | def act(self, state: np.array, epsilon: float = None) -> int: 216 | """ 217 | Behavioural policy 218 | """ 219 | if epsilon is None: 220 | # update epsilon 221 | self.epsilon = self.epsilon_fn(self._step_counter) 222 | epsilon = self.epsilon 223 | 224 | if random.uniform(0, 1) < epsilon: 225 | # Explore action space 226 | action = self.env.action_space.sample() 227 | return action 228 | 229 | # make sure s is a numpy array with 2 dimensions, 230 | # and normalize it if `self.normalize_state = True` 231 | s = self._preprocess_state(state) 232 | 233 | # forward pass through the net to compute q-values for the 3 actions 234 | s = torch.from_numpy(s).float().to(self.device) 235 | q_values = self.q_net(s) 236 | 237 | # extract index max q-value and reshape tensor to dimensions (1, 1) 238 | action = q_values.max(1)[1].view(1, 1) 239 | 240 | # tensor to float 241 | action = action.item() 242 | 243 | return action 244 | 245 | def observe(self, state, action, reward, next_state, done) -> None: 246 | 247 | # preprocess state 248 | s = self._preprocess_state(state) 249 | ns = self._preprocess_state(next_state) 250 | 251 | # store new experience in the agent's memory. 252 | self.memory.push(s, action, reward, ns, done) 253 | 254 | self._step_counter += 1 255 | 256 | def replay(self) -> None: 257 | 258 | if self._step_counter % self.freq_steps_train != 0: 259 | # update parameters every `self.freq_steps_update_target` 260 | # this way we add inertia to the agent actions, as they are more sticky 261 | return 262 | 263 | if self._step_counter < self.n_steps_warm_up_memory: 264 | # memory needs to be larger, no training yet 265 | return 266 | 267 | if self._step_counter % self.freq_steps_update_target == 0: 268 | # we update the target network parameters 269 | # self.target_nn.load_state_dict(self.nn.state_dict()) 270 | self._copy_params_to_target_q_net() 271 | 272 | losses = [] 273 | for i in range(0, self.n_gradient_steps): 274 | 275 | # get batch of experiences from the agent's memory. 276 | batch = self.memory.sample(self.batch_size) 277 | 278 | # A bit of plumbing to transform numpy arrays to PyTorch tensors 279 | state_batch = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in batch.state]).to(self.device) 280 | action_batch = torch.cat([torch.tensor([[a]]).long().view(1, -1) for a in batch.action]).to(self.device) 281 | reward_batch = torch.cat([torch.tensor([r]).float() for r in batch.reward]).to(self.device) 282 | next_state_batch = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in batch.next_state]).to(self.device) 283 | done_batch = torch.tensor(batch.done).float().to(self.device) 284 | 285 | # q_values for all 3 actions 286 | q_values = self.q_net(state_batch) 287 | 288 | # keep only q_value for the chosen action in the trajectory, i.e. `action_batch` 289 | q_values = q_values.gather(1, action_batch) 290 | 291 | with torch.no_grad(): 292 | # q-values for each action in next_state 293 | next_q_values = self.target_q_net(next_state_batch) 294 | 295 | # extract max q-value for each next_state 296 | next_q_values, _ = next_q_values.max(dim=1) 297 | 298 | # TD target 299 | target_q_values = (1 - done_batch) * next_q_values * self.discount_factor + reward_batch 300 | 301 | # compute loss 302 | loss = F.mse_loss(q_values.squeeze(1), target_q_values) 303 | losses.append(loss.item()) 304 | 305 | # backward step to adjust network parameters 306 | self.optimizer.zero_grad() 307 | loss.backward() 308 | torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), self.max_grad_norm) 309 | self.optimizer.step() 310 | 311 | if self.log_dir: 312 | self.logger.add_scalar('train/loss', np.mean(losses), self._step_counter) 313 | 314 | def save_to_disk(self, path: Path) -> None: 315 | """""" 316 | if not path.exists(): 317 | os.makedirs(path) 318 | 319 | # save hyper-parameters in a json file 320 | with open(path / 'hparams.json', 'w') as f: 321 | json.dump(self.hparams, f) 322 | 323 | if self.normalize_state: 324 | np.save(path / 'mean_states.npy', self.mean_states) 325 | np.save(path / 'std_states.npy', self.std_states) 326 | 327 | # save main model 328 | torch.save(self.q_net, path / 'model') 329 | 330 | @classmethod 331 | def load_from_disk(cls, env: gym.Env, path: Path): 332 | """ 333 | We recover all necessary variables to be able to evaluate the agent. 334 | 335 | NOTE: training state is not stored, so it is not possible to resume 336 | an interrupted training run as it was. 337 | """ 338 | # load hyper-params 339 | with open(path / 'hparams.json', 'r') as f: 340 | hparams = json.load(f) 341 | 342 | # generate Python object 343 | agent = cls(env, **hparams) 344 | 345 | agent.normalize_state = hparams['normalize_state'] 346 | if hparams['normalize_state']: 347 | agent.mean_states = np.load(path / 'mean_states.npy') 348 | agent.std_states = np.load(path / 'std_states.npy') 349 | 350 | agent.q_net = torch.load(path / 'model') 351 | agent.q_net.eval() 352 | 353 | return agent 354 | 355 | 356 | 357 | def parse_arguments(): 358 | """ 359 | Hyper-parameters are set either from command line or from the `hyperparameters.yaml' file. 360 | Parameters set throught the command line have priority over the default ones 361 | set in the yaml file. 362 | """ 363 | 364 | parser = ArgumentParser() 365 | parser.add_argument('--env', type=str, required=True) 366 | parser.add_argument('--learning_rate', type=float) 367 | parser.add_argument('--discount_factor', type=float) 368 | parser.add_argument('--episodes', type=int) 369 | parser.add_argument('--max_steps', type=int) 370 | parser.add_argument('--epsilon_start', type=float) 371 | parser.add_argument('--epsilon_end', type=float) 372 | parser.add_argument('--steps_epsilon_decay', type=int) 373 | parser.add_argument('--batch_size', type=int) 374 | parser.add_argument('--memory_size', type=int) 375 | parser.add_argument('--n_steps_warm_up_memory', type=int) 376 | parser.add_argument('--freq_steps_update_target', type=int) 377 | parser.add_argument('--freq_steps_train', type=int) 378 | parser.add_argument('--normalize_state', dest='normalize_state', action='store_true') 379 | parser.set_defaults(normalize_state=False) 380 | parser.add_argument('--n_gradient_steps', type=int,) 381 | parser.add_argument("--nn_hidden_layers", type=int, nargs="+",) 382 | parser.add_argument('--nn_init_method', type=str, default='default') 383 | parser.add_argument('--loss', type=str) 384 | parser.add_argument("--max_grad_norm", type=float, default=10) 385 | parser.add_argument('--n_episodes_evaluate_agent', type=int, default=100) 386 | parser.add_argument('--freq_episodes_evaluate_agent', type=int, default=100) 387 | parser.add_argument('--seed', type=int, default=0) 388 | 389 | args = parser.parse_args() 390 | 391 | args_dict = {} 392 | for arg in vars(args): 393 | args_dict[arg] = getattr(args, arg) 394 | 395 | print('Hyper-parameters') 396 | for key, value in args_dict.items(): 397 | print(f'{key}: {value}') 398 | 399 | return args_dict 400 | 401 | 402 | if __name__ == '__main__': 403 | 404 | args = parse_arguments() 405 | 406 | # setup the environment 407 | env = gym.make(args['env']) 408 | 409 | # fix seeds to ensure reproducibility between runs 410 | set_seed(env, args['seed']) 411 | 412 | # generate a unique agent_id, that we later use to save results to disk, as 413 | # well as TensorBoard logs. 414 | agent_id = get_agent_id(args['env']) 415 | print('agent_id: ', agent_id) 416 | 417 | agent = QAgent( 418 | env, 419 | learning_rate=args['learning_rate'], 420 | discount_factor=args['discount_factor'], 421 | batch_size=args['batch_size'], 422 | memory_size=args['memory_size'], 423 | freq_steps_train=args['freq_steps_train'], 424 | freq_steps_update_target=args['freq_steps_update_target'], 425 | n_steps_warm_up_memory=args['n_steps_warm_up_memory'], 426 | n_gradient_steps=args['n_gradient_steps'], 427 | nn_hidden_layers=args['nn_hidden_layers'], 428 | max_grad_norm=args['max_grad_norm'], 429 | normalize_state=args['normalize_state'], 430 | epsilon_start=args['epsilon_start'], 431 | epsilon_end=args['epsilon_end'], 432 | steps_epsilon_decay=args['steps_epsilon_decay'], 433 | log_dir=TENSORBOARD_LOG_DIR / args['env'] / agent_id 434 | ) 435 | agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id) 436 | 437 | try: 438 | train(agent, env, 439 | n_episodes=args['episodes'], 440 | log_dir=TENSORBOARD_LOG_DIR / args['env'] / agent_id, 441 | n_episodes_evaluate_agent=args['n_episodes_evaluate_agent'], 442 | freq_episodes_evaluate_agent=args['freq_episodes_evaluate_agent'], 443 | # max_steps=args['max_steps'] 444 | ) 445 | 446 | agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id) 447 | print(f'Agent {agent_id} was saved') 448 | 449 | except KeyboardInterrupt: 450 | # save the agent before quitting... 451 | agent.save_to_disk(SAVED_AGENTS_DIR / args['env'] / agent_id) 452 | print(f'Agent {agent_id} was saved') -------------------------------------------------------------------------------- /03_cart_pole/src/random_agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class RandomAgent: 5 | """ 6 | This taxi driver selects actions randomly. 7 | You better not get into this taxi! 8 | """ 9 | def __init__(self, env): 10 | self.env = env 11 | 12 | def act(self, state: np.array, epsilon: float = None) -> int: 13 | """ 14 | No input arguments to this function. 15 | The agent does not consider the state of the environment when deciding 16 | what to do next. 17 | """ 18 | return self.env.action_space.sample() -------------------------------------------------------------------------------- /03_cart_pole/src/supervised_ml.py: -------------------------------------------------------------------------------- 1 | from argparse import ArgumentParser 2 | from pathlib import Path 3 | from typing import List, Optional, Tuple, Union, Dict 4 | from pdb import set_trace as stop 5 | 6 | import zipfile 7 | import gdown 8 | from tqdm import tqdm 9 | import pandas as pd 10 | import gym 11 | from torch.utils.data import Dataset, DataLoader 12 | 13 | import numpy as np 14 | #from sklearn.model_selection import train_test_split # Unused import 15 | import torch 16 | import torch.optim as optim 17 | import torch.nn as nn 18 | from torch.utils.tensorboard import SummaryWriter 19 | 20 | from src.model_factory import get_model 21 | from src.utils import set_seed 22 | from src.loops import evaluate 23 | from src.q_agent import QAgent 24 | from src.config import DATA_SUPERVISED_ML, SAVED_AGENTS_DIR, TENSORBOARD_LOG_DIR 25 | 26 | 27 | global_train_step = 0 28 | global_val_step = 0 29 | 30 | 31 | def download_agent_parameters() -> Path: 32 | """ 33 | Downloads the agent parameters and hyper-parameters that I trained on my machine 34 | Returns the path to the unzipped folder. 35 | """ 36 | # download .zip file from public google drive 37 | # url = 'https://docs.google.com/uc?export=download&id=1KH4ANx84PMmCY6H4FoUnkBLVC1z1A6W6' 38 | url = 'https://docs.google.com/uc?export=download&id=1ZdyAuzY-0VYfyNrg0a7gHd5TOX-GadJJ' 39 | output = SAVED_AGENTS_DIR / 'CartPole-v1' / 'gdrive_agent.zip' 40 | gdown.download(url, str(output)) 41 | 42 | # unzip it 43 | with zipfile.ZipFile(str(output), "r") as zip_ref: 44 | zip_ref.extractall(str(SAVED_AGENTS_DIR / 'CartPole-v1')) 45 | 46 | return SAVED_AGENTS_DIR / 'CartPole-v1' / '298' 47 | 48 | 49 | def simulate_episode(env, agent) -> List[Dict]: 50 | """ 51 | We let the agent interact with the environment and return a list of collected 52 | states and actions 53 | """ 54 | done = False 55 | state = env.reset() 56 | samples = [] 57 | while not done: 58 | 59 | action = agent.act(state, epsilon=0.0) 60 | samples.append({ 61 | 's0': state[0], 62 | 's1': state[1], 63 | 's2': state[2], 64 | 's3': state[3], 65 | 'action': action 66 | }) 67 | state, reward, done, info = env.step(action) 68 | 69 | return samples 70 | 71 | 72 | def generate_state_action_data( 73 | env: gym.Env, 74 | agent: QAgent, 75 | n_samples: int, 76 | path: Path 77 | ) -> None: 78 | """ 79 | We let the agent interact the environment until we have collected 80 | n_samples of data. 81 | Then we save the data as a csv file with columns: s0, s1, s2, s3, a 82 | """ 83 | samples = [] 84 | with tqdm(total=n_samples) as pbar: 85 | while len(samples) < n_samples: 86 | new_samples = simulate_episode(env, agent) 87 | pbar.update(len(new_samples)) 88 | samples += new_samples 89 | 90 | # save dataframe to csv file 91 | pd.DataFrame(samples).to_csv(path, index=False) 92 | 93 | 94 | class OptimalPolicyDataset(Dataset): 95 | """ 96 | PyTorch custom dataset that wraps around the pandas dataframe and that 97 | will speak to the DataLoader later on, when we train the model. 98 | """ 99 | def __init__(self, X: pd.DataFrame, y: pd.Series): 100 | self.X = X 101 | self.y = y 102 | 103 | def __len__(self): 104 | return len(self.X) 105 | 106 | def __getitem__(self, idx): 107 | return self.X.iloc[idx].values, self.y.iloc[idx] 108 | 109 | 110 | def get_tensorboard_writer(run_name: str): 111 | 112 | from torch.utils.tensorboard import SummaryWriter 113 | from src.config import TENSORBOARD_LOG_DIR 114 | tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_DIR / 'sml' / run_name) 115 | return tensorboard_writer 116 | 117 | 118 | def get_train_val_loop( 119 | model: nn.Module, 120 | criterion, 121 | optimizer, 122 | tensorboard_writer, 123 | ): 124 | global global_train_step, global_val_step 125 | global_train_step = 0 126 | global_val_step = 0 127 | 128 | def train_val_loop( 129 | is_train: bool, 130 | dataloader: DataLoader, 131 | epoch: int, 132 | ): 133 | """""" 134 | global global_train_step, global_val_step 135 | 136 | n_batches = 0 137 | running_loss = 0 138 | n_samples = 0 139 | n_correct_predictions = 0 140 | 141 | pbar = tqdm(dataloader) 142 | for data in pbar: 143 | 144 | # extract batch of features and target values (aka labels) 145 | inputs, labels = data 146 | 147 | if is_train: 148 | # zero the parameter gradients 149 | optimizer.zero_grad() 150 | 151 | # forward 152 | outputs = model(inputs.float()) 153 | loss = criterion(outputs, labels) 154 | 155 | if is_train: 156 | # backward + optimize 157 | loss.backward() 158 | optimizer.step() 159 | 160 | predicted_labels = torch.argmax(outputs, 1) 161 | batch_accuracy = (predicted_labels == labels).numpy().mean() 162 | 163 | n_batches += 1 164 | running_loss += loss.item() 165 | avg_loss = running_loss / n_batches 166 | 167 | n_correct_predictions += (predicted_labels == labels).numpy().sum() 168 | n_samples += len(labels) 169 | avg_accuracy = n_correct_predictions / n_samples 170 | pbar.set_description(f'Epoch {epoch} - loss: {avg_loss:.4f} - accuracy: {avg_accuracy:.4f}') 171 | 172 | # log to tensorboard 173 | if is_train: 174 | global_train_step += 1 175 | tensorboard_writer.add_scalar('train/loss', loss.item(), global_train_step) 176 | tensorboard_writer.add_scalar('train/accuracy', batch_accuracy, global_train_step) 177 | # print('sent logs to TB') 178 | else: 179 | global_val_step += 1 180 | tensorboard_writer.add_scalar('val/loss', loss.item(), global_val_step) 181 | tensorboard_writer.add_scalar('val/accuracy', batch_accuracy, global_val_step) 182 | 183 | return train_val_loop 184 | 185 | 186 | 187 | def run( 188 | n_samples_train: int, 189 | n_samples_test: int, 190 | hidden_layers: Union[Tuple[int], None], 191 | n_epochs: int, 192 | ): 193 | env = gym.make('CartPole-v1') 194 | 195 | print('Downloading agent data from GDrive...') 196 | path_to_agent_data = download_agent_parameters() 197 | agent = QAgent.load_from_disk(env, path=path_to_agent_data) 198 | 199 | set_seed(env, 1234) 200 | print('Sanity checking that our agent is really that good...') 201 | rewards, steps = evaluate(agent, env, n_episodes=100, epsilon=0.0) 202 | print('Avg reward evaluation: ', np.mean(rewards)) 203 | 204 | print('Generating train data for our supervised ML problem...') 205 | path_to_train_data = DATA_SUPERVISED_ML / 'train.csv' 206 | env.seed(0) 207 | generate_state_action_data(env, agent, n_samples=n_samples_train, path=path_to_train_data) 208 | 209 | print('Generating test data for our supervised ML problem...') 210 | path_to_test_data = DATA_SUPERVISED_ML / 'test.csv' 211 | env.seed(1) 212 | generate_state_action_data(env, agent, n_samples=n_samples_test, path=path_to_test_data) 213 | 214 | # load data from disk 215 | print('Loading CSV files into dataframes...') 216 | train_data = pd.read_csv(path_to_train_data) 217 | test_data = pd.read_csv(path_to_test_data) 218 | 219 | # split features and labels 220 | X_train = train_data[['s0', 's1', 's2', 's3']] 221 | y_train = train_data['action'] 222 | X_test = test_data[['s0', 's1', 's2', 's3']] 223 | y_test = test_data['action'] 224 | 225 | # PyTorch datasets 226 | train_dataset = OptimalPolicyDataset(X_train, y_train) 227 | test_dataset = OptimalPolicyDataset(X_test, y_test) 228 | 229 | batch_size = 64 230 | 231 | # PyTorch dataloaders 232 | train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) 233 | test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) 234 | 235 | # model architecture 236 | model = get_model(input_dim=4, output_dim=2, hidden_layers=hidden_layers) 237 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 238 | model.to(device) 239 | 240 | # loss function 241 | criterion = nn.CrossEntropyLoss() 242 | 243 | # optimization method 244 | optimizer = optim.Adam(model.parameters()) #, lr=3e-4) 245 | 246 | import time 247 | ts = int(time.time()) 248 | tensorboard_writer = SummaryWriter(TENSORBOARD_LOG_DIR / 'sml' / str(ts)) 249 | train_val_loop = get_train_val_loop(model, criterion, optimizer, tensorboard_writer) 250 | 251 | # training loop, with evaluation at the end of each epoch 252 | # n_epochs = 20 253 | for epoch in range(n_epochs): 254 | # train 255 | train_val_loop(is_train=True, dataloader=train_dataloader, epoch=epoch) 256 | 257 | with torch.no_grad(): 258 | # validate 259 | train_val_loop(is_train=False, dataloader=test_dataloader, epoch=epoch) 260 | 261 | print('----------') 262 | 263 | 264 | if __name__ == '__main__': 265 | 266 | parser = ArgumentParser() 267 | parser.add_argument('--n_samples_train', type=int, default=1000) 268 | parser.add_argument('--n_samples_test', type=int, default=1000) 269 | parser.add_argument("--hidden_layers", type=int, nargs="+",) 270 | parser.add_argument('--n_epochs', type=int, default=20) 271 | args = parser.parse_args() 272 | 273 | run(n_samples_train=args.n_samples_train, 274 | n_samples_test=args.n_samples_test, 275 | hidden_layers=args.hidden_layers, 276 | n_epochs=args.n_epochs) -------------------------------------------------------------------------------- /03_cart_pole/src/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import Callable, Dict, Tuple, List 3 | import pathlib 4 | from pathlib import Path 5 | import json 6 | from pdb import set_trace as stop 7 | 8 | import numpy as np 9 | import gym 10 | import yaml 11 | import torch.nn as nn 12 | 13 | 14 | def snake_to_camel(word): 15 | import re 16 | return ''.join(x.capitalize() or '_' for x in word.split('_')) 17 | 18 | 19 | def get_agent_id(env_name: str) -> str: 20 | """""" 21 | from src.config import SAVED_AGENTS_DIR 22 | 23 | dir = Path(SAVED_AGENTS_DIR) / env_name 24 | if not dir.exists(): 25 | os.makedirs(dir) 26 | 27 | # try: 28 | # agent_id = max([int(id) for id in os.listdir(dir)]) + 1 29 | # except ValueError: 30 | # agent_id = 0 31 | 32 | ids = [] 33 | for id in os.listdir(dir): 34 | try: 35 | ids.append(int(id)) 36 | except: 37 | pass 38 | if len(ids) > 0: 39 | agent_id = max(ids) + 1 40 | else: 41 | agent_id = 0 42 | # stop() 43 | 44 | return str(agent_id) 45 | 46 | def get_input_output_dims(env_name: str) -> Tuple[int]: 47 | """""" 48 | if 'MountainCar' in env_name: 49 | input_dim = 2 50 | output_dim = 3 51 | elif 'CartPole' in env_name: 52 | input_dim = 4 53 | output_dim = 2 54 | else: 55 | raise Exception('Invalid environment') 56 | 57 | return input_dim, output_dim 58 | 59 | 60 | def get_epsilon_decay_fn( 61 | eps_start: float, 62 | eps_end: float, 63 | total_episodes: int 64 | ) -> Callable: 65 | """ 66 | Returns function epsilon_fn, which depends on 67 | a single input, step, which is the current episode 68 | """ 69 | def epsilon_fn(episode: int) -> float: 70 | r = max((total_episodes - episode) / total_episodes, 0) 71 | return (eps_start - eps_end)*r + eps_end 72 | 73 | return epsilon_fn 74 | 75 | 76 | def get_epsilon_exponential_decay_fn( 77 | eps_max: float, 78 | eps_min: float, 79 | decay: float, 80 | ) -> Callable: 81 | """ 82 | Returns function epsilon_fn, which depends on 83 | a single input, step, which is the current episode 84 | """ 85 | def epsilon_fn(episode: int) -> float: 86 | return max(eps_min, eps_max * (decay ** episode)) 87 | return epsilon_fn 88 | 89 | 90 | def get_success_rate_from_n_steps(env: gym.Env, steps: List[int]): 91 | 92 | import numpy as np 93 | if 'MountainCar' in str(env): 94 | success_rate = np.mean((np.array(steps) < env._max_episode_steps) * 1.0) 95 | elif 'CartPole' in str(env): 96 | success_rate = np.mean((np.array(steps) >= env._max_episode_steps) * 1.0) 97 | else: 98 | raise Exception('Invalid environment name') 99 | 100 | return success_rate 101 | 102 | def get_observation_samples(env: gym.Env, n_samples: int) -> np.array: 103 | """""" 104 | samples = [] 105 | state = env.reset() 106 | while len(samples) < n_samples: 107 | 108 | samples.append(np.copy(state)) 109 | action = env.action_space.sample() 110 | next_state, reward, done, info = env.step(action) 111 | 112 | if done: 113 | state = env.reset() 114 | else: 115 | state = next_state 116 | 117 | return np.array(samples) 118 | 119 | 120 | def set_seed( 121 | env, 122 | seed 123 | ): 124 | """To ensure reproducible runs we fix the seed for different libraries""" 125 | import random 126 | random.seed(seed) 127 | 128 | import numpy as np 129 | np.random.seed(seed) 130 | 131 | env.seed(seed) 132 | env.action_space.seed(seed) 133 | 134 | import torch 135 | torch.manual_seed(seed) 136 | 137 | # Deterministic operations for CuDNN, it may impact performances 138 | torch.backends.cudnn.deterministic = True 139 | torch.backends.cudnn.benchmark = False 140 | 141 | # env.seed(seed) 142 | # gym.spaces.prng.seed(seed) 143 | 144 | 145 | def get_num_model_parameters(model: nn.Module) -> int: 146 | return sum(p.numel() for p in model.parameters() if p.requires_grad) 147 | 148 | 149 | # from dotenv import dotenv_values 150 | # import uuid 151 | # from pdb import set_trace as stop 152 | 153 | # import pandas as pd 154 | # import git 155 | 156 | # from src.io import get_list_files 157 | 158 | 159 | # def get_project_root() -> Path: 160 | # return Path(__file__).parent.resolve().parent 161 | # 162 | # from typing import Dict 163 | # def load_env_config() -> Dict: 164 | # """ 165 | # """ 166 | # config = dotenv_values(get_project_root() / ".env") 167 | # return config 168 | # 169 | 170 | 171 | 172 | 173 | -------------------------------------------------------------------------------- /03_cart_pole/src/viz.py: -------------------------------------------------------------------------------- 1 | from time import sleep 2 | from argparse import ArgumentParser 3 | from pdb import set_trace as stop 4 | from typing import Optional 5 | 6 | import pandas as pd 7 | import gym 8 | 9 | from src.config import SAVED_AGENTS_DIR 10 | 11 | import numpy as np 12 | 13 | 14 | def show_video(agent, env, sleep_sec: float = 0.1, seed: Optional[int] = 0, mode: str = "rgb_array"): 15 | 16 | env.seed(seed) 17 | state = env.reset() 18 | 19 | # LAPADULA 20 | if mode == "rgb_array": 21 | from matplotlib import pyplot as plt 22 | from IPython.display import display, clear_output 23 | steps = 0 24 | fig, ax = plt.subplots(figsize=(8, 6)) 25 | 26 | done = False 27 | while not done: 28 | 29 | action = agent.act(state, epsilon=0.001) 30 | state, reward, done, info = env.step(action) 31 | 32 | # LAPADULA 33 | if mode == "rgb_array": 34 | steps += 1 35 | frame = env.render(mode=mode) 36 | ax.cla() 37 | ax.axes.yaxis.set_visible(False) 38 | ax.imshow(frame) 39 | ax.set_title(f'Steps: {steps}') 40 | display(fig) 41 | clear_output(wait=True) 42 | plt.pause(sleep_sec) 43 | else: 44 | env.render() 45 | sleep(sleep_sec) 46 | 47 | 48 | if __name__ == '__main__': 49 | 50 | parser = ArgumentParser() 51 | parser.add_argument('--agent_id', type=str, required=True) 52 | parser.add_argument('--sleep_sec', type=float, required=False, default=0.1) 53 | args = parser.parse_args() 54 | 55 | from src.base_agent import BaseAgent 56 | agent_path = SAVED_AGENTS_DIR / args.agent_file 57 | agent = BaseAgent.load_from_disk(agent_path) 58 | 59 | from src.q_agent import QAgent 60 | 61 | 62 | env = gym.make('CartPole-v1') 63 | # env._max_episode_steps = 1000 64 | 65 | show_video(agent, env, sleep_sec=args.sleep_sec) 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | -------------------------------------------------------------------------------- /03_cart_pole/tensorboard_logs/.gitignore: -------------------------------------------------------------------------------- 1 | CartPole-v1/ 2 | -------------------------------------------------------------------------------- /03_cart_pole/tensorboard_logs/readme.md: -------------------------------------------------------------------------------- 1 | ### Tensorboard logs for each train run are stored in this folder 2 | -------------------------------------------------------------------------------- /04_lunar_lander/README.md: -------------------------------------------------------------------------------- 1 |
2 |

Policy Gradients to land on the Moon

3 |

“That's one small step for your gradient ascent, one giant leap for your ML career.”

4 |

-- Pau quoting Neil Armstrong

5 |
6 | 7 | ![](http://datamachines.xyz/wp-content/uploads/2022/05/jagoda_and_kai-2048x1536.jpg) 8 | 9 | ## Table of Contents 10 | * [Welcome 🤗](#welcome-) 11 | * [Lecture transcripts](#lecture-transcripts) 12 | * [Quick setup](#quick-setup) 13 | * [Notebooks](#notebooks) 14 | * [Let's connect](#lets-connect) 15 | 16 | ## Welcome 🤗 17 | 18 | Today we will learn about Policy Gradient methods, and use them to land on the Moon. 19 | 20 | Ready, set, go! 21 | 22 | ## Lecture transcripts 23 | 24 | [📝 1. Policy gradients](http://datamachines.xyz/2022/05/06/policy-gradients-in-reinforcement-learning-to-land-on-the-moon-hands-on-course/) 25 | 26 | ## Quick setup 27 | 28 | Make sure you have Python >= 3.7. Otherwise, update it. 29 | 30 | 1. Pull the code from GitHub and cd into the `04_lunar_lander` folder: 31 | ``` 32 | $ git clone https://github.com/Paulescu/hands-on-rl.git 33 | $ cd hands-on-rl/04_lunar_lander 34 | ``` 35 | 36 | 2. Make sure you have the `virtualenv` tool in your Python installation 37 | ``` 38 | $ pip3 install virtualenv 39 | ``` 40 | 41 | 3. Create a virtual environment and activate it. 42 | ``` 43 | $ virtualenv -p python3 venv 44 | $ source venv/bin/activate 45 | ``` 46 | 47 | From this point onwards commands run inside the virtual environment. 48 | 49 | 50 | 3. Install dependencies and code from `src` folder in editable mode, so you can experiment with the code. 51 | ``` 52 | $ (venv) pip install -r requirements.txt 53 | $ (venv) export PYTHONPATH="." 54 | ``` 55 | 56 | 4. Open the notebooks, either with good old Jupyter or Jupyter lab 57 | ``` 58 | $ (venv) jupyter notebook 59 | ``` 60 | ``` 61 | $ (venv) jupyter lab 62 | ``` 63 | If both launch commands fail, try these: 64 | ``` 65 | $ (venv) jupyter notebook --NotebookApp.use_redirect_file=False 66 | ``` 67 | ``` 68 | $ (venv) jupyter lab --NotebookApp.use_redirect_file=False 69 | ``` 70 | 71 | 5. Play and learn. And do the homework 😉. 72 | 73 | ## Notebooks 74 | 75 | - [Random agent baseline](notebooks/01_random_agent_baseline.ipynb) 76 | - [Policy gradients with rewards as weights](notebooks/02_vanilla_policy_gradient_with_rewards_as_weights.ipynb) 77 | - [Policy gradients with rewards-to-go as weights](notebooks/03_vanilla_policy_gradient_with_rewards_to_go_as_weights.ipynb) 78 | - [Homework](notebooks/04_homework.ipynb) 79 | 80 | ## Let's connect! 81 | 82 | Do you wanna become a PRO in Machine Learning? 83 | 84 | 👉🏽 Subscribe to the [datamachines newsletter](https://datamachines.xyz/subscribe/) 🧠 85 | 86 | 👉🏽 Follow me on [Twitter](https://twitter.com/paulabartabajo_) and [LinkedIn](https://www.linkedin.com/in/pau-labarta-bajo-4432074b/) 💡 87 | -------------------------------------------------------------------------------- /04_lunar_lander/notebooks/04_homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f0fd6807", 6 | "metadata": {}, 7 | "source": [ 8 | "# 04 Homework 🏋️🏋️🏋️" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "abcf6613", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 👉A course without homework is not a course!\n", 17 | "\n", 18 | "#### 👉Spend some time thinking and trying to implement the challenges I propose here.\n", 19 | "\n", 20 | "#### 👉Feel free to email me your solutions at:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "d1d983a3", 26 | "metadata": {}, 27 | "source": [ 28 | "# `plabartabajo@gmail.com`" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "86f82e45", 34 | "metadata": {}, 35 | "source": [ 36 | "-----" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "67656662", 42 | "metadata": {}, 43 | "source": [ 44 | "## 1. Can you find a smaller network that solves this environment?\n", 45 | "\n", 46 | "I used one hidden layer with 64 units, but I have the feeling this was an overkill." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "c0a46bf7", 52 | "metadata": {}, 53 | "source": [ 54 | "## 2. Can you speed up converge by properly tunning the `batch_size`?" 55 | ] 56 | } 57 | ], 58 | "metadata": { 59 | "kernelspec": { 60 | "display_name": "Python 3 (ipykernel)", 61 | "language": "python", 62 | "name": "python3" 63 | }, 64 | "language_info": { 65 | "codemirror_mode": { 66 | "name": "ipython", 67 | "version": 3 68 | }, 69 | "file_extension": ".py", 70 | "mimetype": "text/x-python", 71 | "name": "python", 72 | "nbconvert_exporter": "python", 73 | "pygments_lexer": "ipython3", 74 | "version": "3.8.10" 75 | } 76 | }, 77 | "nbformat": 4, 78 | "nbformat_minor": 5 79 | } 80 | -------------------------------------------------------------------------------- /04_lunar_lander/pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "src" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Pau "] 6 | 7 | [tool.poetry.dependencies] 8 | python = ">=3.8,<3.11" 9 | numpy = "^1.22.3" 10 | torch = "^1.11.0" 11 | scipy = "^1.8.0" 12 | Box2D = "^2.3.10" 13 | box2d-py = "^2.3.8" 14 | gym = "0.17.2" 15 | tensorboard = "^2.8.0" 16 | tqdm = "^4.64.0" 17 | jupyter = "^1.0.0" 18 | matplotlib = "^3.5.1" 19 | pandas = "^1.4.2" 20 | pyglet = "1.5.0" 21 | 22 | [tool.poetry.dev-dependencies] 23 | pytest = "^5.2" 24 | 25 | [build-system] 26 | requires = ["poetry-core>=1.0.0"] 27 | build-backend = "poetry.core.masonry.api" 28 | -------------------------------------------------------------------------------- /04_lunar_lander/requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==1.0.0; python_version >= "3.6" 2 | appnope==0.1.3; platform_system == "Darwin" and python_version >= "3.8" and sys_platform == "darwin" 3 | argon2-cffi-bindings==21.2.0; python_version >= "3.6" 4 | argon2-cffi==21.3.0; python_version >= "3.6" 5 | asttokens==2.0.5; python_version >= "3.8" 6 | atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0" 7 | attrs==21.4.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 8 | backcall==0.2.0; python_version >= "3.8" 9 | beautifulsoup4==4.11.1; python_full_version >= "3.6.0" and python_version >= "3.7" 10 | bleach==5.0.0; python_version >= "3.7" 11 | box2d-py==2.3.8 12 | box2d==2.3.10 13 | cachetools==5.0.0; python_version >= "3.7" and python_version < "4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") 14 | certifi==2021.10.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 15 | cffi==1.15.0; implementation_name == "pypy" and python_version >= "3.6" 16 | charset-normalizer==2.0.12; python_full_version >= "3.6.0" and python_version >= "3.6" 17 | cloudpickle==1.2.2; python_version >= "3.5" 18 | colorama==0.4.4; python_version >= "3.8" and python_full_version < "3.0.0" and platform_system == "Windows" and sys_platform == "win32" or python_full_version >= "3.5.0" and platform_system == "Windows" and sys_platform == "win32" and python_version >= "3.8" 19 | cycler==0.11.0; python_version >= "3.7" 20 | debugpy==1.6.0; python_version >= "3.7" 21 | decorator==5.1.1; python_version >= "3.8" 22 | defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7" 23 | entrypoints==0.4; python_version >= "3.7" 24 | executing==0.8.3; python_version >= "3.8" 25 | fastjsonschema==2.15.3; python_version >= "3.7" 26 | fonttools==4.32.0; python_version >= "3.7" 27 | future==0.18.2; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.5" 28 | google-auth-oauthlib==0.4.6; python_version >= "3.6" 29 | google-auth==2.6.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 30 | grpcio==1.45.0; python_version >= "3.6" 31 | gym==0.17.2; python_version >= "3.5" 32 | idna==3.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 33 | importlib-metadata==4.11.3; python_version < "3.10" and python_version >= "3.7" 34 | importlib-resources==5.7.1; python_version < "3.9" and python_version >= "3.7" 35 | ipykernel==6.13.0; python_version >= "3.7" 36 | ipython-genutils==0.2.0; python_version >= "3.7" 37 | ipython==8.2.0; python_version >= "3.8" 38 | ipywidgets==7.7.0 39 | jedi==0.18.1; python_version >= "3.8" 40 | jinja2==3.1.1; python_version >= "3.7" 41 | jsonschema==4.4.0; python_version >= "3.7" 42 | jupyter-client==7.2.2; python_full_version >= "3.7.0" and python_version >= "3.7" 43 | jupyter-console==6.4.3; python_version >= "3.6" 44 | jupyter-core==4.10.0; python_version >= "3.7" 45 | jupyter==1.0.0 46 | jupyterlab-pygments==0.2.2; python_version >= "3.7" 47 | jupyterlab-widgets==1.1.0; python_version >= "3.6" 48 | kiwisolver==1.4.2; python_version >= "3.7" 49 | markdown==3.3.6; python_version >= "3.6" 50 | markupsafe==2.1.1; python_version >= "3.7" 51 | matplotlib-inline==0.1.3; python_version >= "3.8" 52 | matplotlib==3.5.1; python_version >= "3.7" 53 | mistune==0.8.4; python_version >= "3.7" 54 | more-itertools==8.12.0; python_version >= "3.5" 55 | nbclient==0.6.0; python_full_version >= "3.7.0" and python_version >= "3.7" 56 | nbconvert==6.5.0; python_version >= "3.7" 57 | nbformat==5.3.0; python_full_version >= "3.7.0" and python_version >= "3.7" 58 | nest-asyncio==1.5.5; python_full_version >= "3.7.0" and python_version >= "3.7" 59 | notebook==6.4.10; python_version >= "3.6" 60 | numpy==1.22.3; python_version >= "3.8" 61 | oauthlib==3.2.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 62 | packaging==21.3; python_version >= "3.7" 63 | pandas==1.4.2; python_version >= "3.8" 64 | pandocfilters==1.5.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7" 65 | parso==0.8.3; python_version >= "3.8" 66 | pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.8" 67 | pickleshare==0.7.5; python_version >= "3.8" 68 | pillow==9.0.1; python_version >= "3.7" 69 | pluggy==0.13.1; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.5" 70 | prometheus-client==0.14.1; python_version >= "3.6" 71 | prompt-toolkit==3.0.29; python_full_version >= "3.6.2" and python_version >= "3.8" 72 | protobuf==3.20.0; python_version >= "3.7" 73 | psutil==5.9.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7" 74 | ptyprocess==0.7.0; os_name != "nt" and python_version >= "3.8" and sys_platform != "win32" 75 | pure-eval==0.2.2; python_version >= "3.8" 76 | py==1.11.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or python_full_version >= "3.5.0" and python_version >= "3.6" and implementation_name == "pypy" 77 | pyasn1-modules==0.2.8; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 78 | pyasn1==0.4.8; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") or python_full_version >= "3.6.0" and python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") 79 | pycparser==2.21; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0" 80 | pyglet==1.5.0 81 | pygments==2.11.2; python_version >= "3.8" 82 | pyparsing==3.0.7; python_version >= "3.7" 83 | pyrsistent==0.18.1; python_version >= "3.7" 84 | pytest==5.4.3; python_version >= "3.5" 85 | python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8" 86 | pytz==2022.1; python_version >= "3.8" 87 | pywin32==303; sys_platform == "win32" and platform_python_implementation != "PyPy" and python_version >= "3.7" 88 | pywinpty==2.0.5; os_name == "nt" and python_version >= "3.7" 89 | pyzmq==22.3.0; python_version >= "3.7" 90 | qtconsole==5.3.0; python_version >= "3.7" 91 | qtpy==2.0.1; python_version >= "3.7" 92 | requests-oauthlib==1.3.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6" 93 | requests==2.27.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6" 94 | rsa==4.8; python_version >= "3.6" and python_version < "4" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6") 95 | scipy==1.8.0; python_version >= "3.8" and python_version < "3.11" 96 | send2trash==1.8.0; python_version >= "3.6" 97 | setuptools-scm==6.4.2; python_version >= "3.7" 98 | six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.8" 99 | soupsieve==2.3.2.post1; python_full_version >= "3.6.0" and python_version >= "3.7" 100 | stack-data==0.2.0; python_version >= "3.8" 101 | tensorboard-data-server==0.6.1; python_version >= "3.6" 102 | tensorboard-plugin-wit==1.8.1; python_version >= "3.6" 103 | tensorboard==2.8.0; python_version >= "3.6" 104 | terminado==0.13.3; python_version >= "3.7" 105 | tinycss2==1.1.1; python_version >= "3.7" 106 | tomli==2.0.1; python_version >= "3.7" 107 | torch==1.11.0; python_full_version >= "3.7.0" 108 | tornado==6.1; python_version >= "3.7" 109 | tqdm==4.64.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0") 110 | traitlets==5.1.1; python_full_version >= "3.7.0" and python_version >= "3.8" 111 | typing-extensions==4.1.1; python_version >= "3.6" and python_full_version >= "3.7.0" 112 | urllib3==1.26.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6" 113 | wcwidth==0.2.5; python_full_version >= "3.6.2" and python_version >= "3.6" 114 | webencodings==0.5.1; python_version >= "3.7" 115 | werkzeug==2.1.1; python_version >= "3.7" 116 | widgetsnbextension==3.6.0 117 | zipp==3.8.0; python_version < "3.9" and python_version >= "3.7" 118 | -------------------------------------------------------------------------------- /04_lunar_lander/saved_agents/readme.md: -------------------------------------------------------------------------------- 1 | ### Trained agents are saved in this folder -------------------------------------------------------------------------------- /04_lunar_lander/src/config.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pathlib 3 | root_dir = pathlib.Path(__file__).parent.resolve().parent 4 | 5 | SAVED_AGENTS_DIR = root_dir / 'saved_agents' 6 | TENSORBOARD_LOG_DIR = root_dir / 'tensorboard_logs' 7 | # OPTUNA_DB = root_dir / 'optuna.db' 8 | # DATA_SUPERVISED_ML = root_dir / 'data_supervised_ml' 9 | # MLFLOW_RUNS_DIR = root_dir / 'mlflow_runs' 10 | 11 | if not SAVED_AGENTS_DIR.exists(): 12 | os.makedirs(SAVED_AGENTS_DIR) 13 | 14 | if not TENSORBOARD_LOG_DIR.exists(): 15 | os.makedirs(TENSORBOARD_LOG_DIR) 16 | 17 | # if not DATA_SUPERVISED_ML.exists(): 18 | # os.makedirs(DATA_SUPERVISED_ML) 19 | # 20 | # if not MLFLOW_RUNS_DIR.exists(): 21 | # os.makedirs(MLFLOW_RUNS_DIR) -------------------------------------------------------------------------------- /04_lunar_lander/src/evaluation.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, Tuple, List 2 | from tqdm import tqdm 3 | 4 | import torch 5 | 6 | 7 | def evaluate( 8 | agent, 9 | env, 10 | n_episodes: int, 11 | seed: Optional[int] = 0, 12 | ) -> Tuple[List, List]: 13 | 14 | # from src.utils import set_seed 15 | # set_seed(env, seed) 16 | 17 | # output metrics 18 | reward_per_episode = [] 19 | steps_per_episode = [] 20 | 21 | for i in tqdm(range(0, n_episodes)): 22 | 23 | state = env.reset() 24 | rewards = 0 25 | steps = 0 26 | done = False 27 | while not done: 28 | 29 | action = agent.act(torch.as_tensor(state, dtype=torch.float32)) 30 | 31 | next_state, reward, done, info = env.step(action) 32 | 33 | rewards += reward 34 | steps += 1 35 | state = next_state 36 | 37 | reward_per_episode.append(rewards) 38 | steps_per_episode.append(steps) 39 | 40 | return reward_per_episode, steps_per_episode -------------------------------------------------------------------------------- /04_lunar_lander/src/model_factory.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, List 2 | from pdb import set_trace as stop 3 | 4 | import torch.nn as nn 5 | 6 | 7 | def get_model( 8 | input_dim: int, 9 | output_dim: int, 10 | hidden_layers: Optional[List[int]] = None, 11 | ): 12 | """ 13 | Feed-forward network, made of linear layers with ReLU activation functions 14 | The number of layers, and their size is given by `hidden_layers`. 15 | """ 16 | # assert init_method in {'default', 'xavier'} 17 | 18 | if hidden_layers is None: 19 | # linear model 20 | model = nn.Sequential(nn.Linear(input_dim, output_dim)) 21 | 22 | else: 23 | # neural network 24 | # there are hidden layers in this case. 25 | dims = [input_dim] + hidden_layers + [output_dim] 26 | modules = [] 27 | for i, dim in enumerate(dims[:-2]): 28 | modules.append(nn.Linear(dims[i], dims[i + 1])) 29 | modules.append(nn.ReLU()) 30 | 31 | modules.append(nn.Linear(dims[-2], dims[-1])) 32 | model = nn.Sequential(*modules) 33 | # stop() 34 | 35 | # n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad) 36 | # print(f'{n_parameters:,} parameters') 37 | 38 | return model 39 | 40 | def count_parameters(model: nn.Module) -> int: 41 | """""" 42 | return sum(p.numel() for p in model.parameters() if p.requires_grad) -------------------------------------------------------------------------------- /04_lunar_lander/src/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import Callable, Dict, Tuple, List 3 | import pathlib 4 | from pathlib import Path 5 | import json 6 | from pdb import set_trace as stop 7 | 8 | import torch.nn as nn 9 | from torch.utils.tensorboard import SummaryWriter 10 | 11 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR 12 | 13 | 14 | # def snake_to_camel(word): 15 | # import re 16 | # return ''.join(x.capitalize() or '_' for x in word.split('_')) 17 | 18 | def get_agent_id(env_name: str) -> str: 19 | """""" 20 | dir = Path(SAVED_AGENTS_DIR) / env_name 21 | if not dir.exists(): 22 | os.makedirs(dir) 23 | 24 | ids = [] 25 | for id in os.listdir(dir): 26 | try: 27 | ids.append(int(id)) 28 | except: 29 | pass 30 | if len(ids) > 0: 31 | agent_id = max(ids) + 1 32 | else: 33 | agent_id = 0 34 | # stop() 35 | 36 | return str(agent_id) 37 | 38 | 39 | def set_seed( 40 | env, 41 | seed 42 | ): 43 | """To ensure reproducible runs we fix the seed for different libraries""" 44 | import random 45 | random.seed(seed) 46 | 47 | import numpy as np 48 | np.random.seed(seed) 49 | 50 | env.seed(seed) 51 | env.action_space.seed(seed) 52 | 53 | import torch 54 | torch.manual_seed(seed) 55 | 56 | # Deterministic operations for CuDNN, it may impact performances 57 | torch.backends.cudnn.deterministic = True 58 | torch.backends.cudnn.benchmark = False 59 | 60 | def get_num_model_parameters(model: nn.Module) -> int: 61 | return sum(p.numel() for p in model.parameters() if p.requires_grad) 62 | 63 | def get_logger(env_name: str, agent_id: str) -> SummaryWriter: 64 | return SummaryWriter(TENSORBOARD_LOG_DIR / env_name / agent_id) 65 | 66 | def get_model_path(env_name: str, agent_id: str) -> Path: 67 | """ 68 | Returns path where we save train artifacts, including: 69 | -> the policy network weights 70 | -> json with hyperparameters 71 | """ 72 | return SAVED_AGENTS_DIR / env_name / agent_id 73 | 74 | 75 | -------------------------------------------------------------------------------- /04_lunar_lander/src/viz.py: -------------------------------------------------------------------------------- 1 | from time import sleep 2 | from argparse import ArgumentParser 3 | from pdb import set_trace as stop 4 | from typing import Optional 5 | 6 | import pandas as pd 7 | import gym 8 | 9 | from src.config import SAVED_AGENTS_DIR 10 | 11 | import numpy as np 12 | 13 | def make_video(agent): 14 | 15 | import gym 16 | from gym.wrappers import Monitor 17 | env = Monitor(gym.make('CartPole-v0'), './video', force=True) 18 | 19 | state = env.reset() 20 | done = False 21 | 22 | while not done: 23 | # action = env.action_space.sample() 24 | import torch 25 | action = agent.act(torch.as_tensor(state, dtype=torch.float32)) 26 | state_next, reward, done, info = env.step(action) 27 | env.close() 28 | 29 | 30 | def show_video( 31 | agent, 32 | env, 33 | sleep_sec: float = 0.1, 34 | seed: Optional[int] = 0, 35 | mode: str = "rgb_array" 36 | ): 37 | 38 | env.seed(seed) 39 | state = env.reset() 40 | 41 | # LAPADULA 42 | if mode == "rgb_array": 43 | from matplotlib import pyplot as plt 44 | from IPython.display import display, clear_output 45 | steps = 0 46 | fig, ax = plt.subplots(figsize=(8, 6)) 47 | 48 | done = False 49 | while not done: 50 | 51 | import torch 52 | action = agent.act(torch.as_tensor(state, dtype=torch.float32)) 53 | 54 | state, reward, done, info = env.step(action) 55 | 56 | # LAPADULA 57 | if mode == "rgb_array": 58 | steps += 1 59 | frame = env.render(mode=mode) 60 | ax.cla() 61 | ax.axes.yaxis.set_visible(False) 62 | ax.imshow(frame) 63 | ax.set_title(f'Steps: {steps}') 64 | display(fig) 65 | clear_output(wait=True) 66 | plt.pause(sleep_sec) 67 | else: 68 | env.render() 69 | sleep(sleep_sec) 70 | 71 | 72 | if __name__ == '__main__': 73 | 74 | parser = ArgumentParser() 75 | parser.add_argument('--agent_id', type=str, required=True) 76 | parser.add_argument('--sleep_sec', type=float, required=False, default=0.1) 77 | args = parser.parse_args() 78 | 79 | from src.base_agent import BaseAgent 80 | agent_path = SAVED_AGENTS_DIR / args.agent_file 81 | agent = BaseAgent.load_from_disk(agent_path) 82 | 83 | from src.q_agent import QAgent 84 | 85 | 86 | env = gym.make('CartPole-v1') 87 | # env._max_episode_steps = 1000 88 | 89 | show_video(agent, env, sleep_sec=args.sleep_sec) 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | -------------------------------------------------------------------------------- /04_lunar_lander/src/vpg_agent.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import List, Optional, Tuple 3 | from pathlib import Path 4 | import json 5 | from pdb import set_trace as stop 6 | 7 | from tqdm import tqdm 8 | import numpy as np 9 | import torch 10 | from torch.optim import Adam 11 | from torch.utils.tensorboard import SummaryWriter 12 | from torch.distributions.categorical import Categorical 13 | import gym 14 | 15 | from src.model_factory import get_model 16 | from src.utils import ( 17 | get_agent_id, 18 | set_seed, 19 | get_num_model_parameters, 20 | get_logger, get_model_path 21 | ) 22 | from src.config import TENSORBOARD_LOG_DIR, SAVED_AGENTS_DIR 23 | 24 | def reward_to_go(rews): 25 | 26 | n = len(rews) 27 | rtgs = np.zeros_like(rews) 28 | for i in reversed(range(n)): 29 | rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0) 30 | return rtgs 31 | 32 | 33 | class VPGAgent: 34 | 35 | def __init__( 36 | self, 37 | env_name: str = 'LunarLander-v2', 38 | learning_rate: float = 3e-4, 39 | hidden_layers: List[int] = [32], 40 | gradient_weights: str = 'rewards' 41 | ): 42 | assert gradient_weights in {'rewards', 'rewards-to-go'} 43 | 44 | self.env_name = env_name 45 | self.env = gym.make(env_name) 46 | self.obs_dim = self.env.observation_space.shape[0] 47 | self.act_dim = self.env.action_space.n 48 | 49 | # stochastic policy network 50 | # the outputs of this network are un-normalized probabilities for each 51 | # action (aka logits) 52 | self.policy_net = get_model(input_dim=self.obs_dim, 53 | output_dim=self.act_dim, 54 | hidden_layers=hidden_layers) 55 | print(f'Policy network with {get_num_model_parameters(self.policy_net):,} parameters') 56 | print(self.policy_net) 57 | 58 | 59 | self.optimizer = Adam(self.policy_net.parameters(), lr=learning_rate) 60 | 61 | self.gradient_weights = gradient_weights 62 | 63 | self.hparams = { 64 | 'learning_rate': learning_rate, 65 | 'hidden_layers': hidden_layers, 66 | 'gradient_weights': gradient_weights, 67 | 68 | } 69 | 70 | def act(self, obs: torch.Tensor): 71 | """ 72 | Action selection function (outputs int actions, sampled from policy) 73 | """ 74 | return self._get_policy(obs).sample().item() 75 | 76 | def train( 77 | self, 78 | n_policy_updates: int = 1000, 79 | batch_size: int = 4000, 80 | logger: Optional[SummaryWriter] = None, 81 | model_path: Optional[Path] = None, 82 | seed: Optional[int] = 0, 83 | freq_eval: Optional[int] = 10, 84 | ): 85 | """ 86 | """ 87 | total_steps = 0 88 | save_model = True if model_path is not None else False 89 | 90 | best_avg_reward = -np.inf 91 | 92 | # fix seeds to ensure reproducible training runs 93 | set_seed(self.env, seed) 94 | 95 | for i in range(n_policy_updates): 96 | 97 | # use current policy to collect trajectories 98 | states, actions, weights, rewards = self._collect_trajectories(n_samples=batch_size) 99 | 100 | # one step of gradient ascent to update policy parameters 101 | loss = self._update_parameters(states, actions, weights) 102 | 103 | # log epoch metrics 104 | print('epoch: %3d \t loss: %.3f \t reward: %.3f' % (i, loss, np.mean(rewards))) 105 | if logger is not None: 106 | # we use total_steps instead of epoch to render all plots in Tensorboard comparable 107 | # Agents wit different batch_size (aka steps_per_epoch) are fairly compared this way. 108 | total_steps += batch_size 109 | logger.add_scalar('train/loss', loss, total_steps) 110 | logger.add_scalar('train/episode_reward', np.mean(rewards), total_steps) 111 | 112 | # evaluate the agent on a fixed set of 100 episodes 113 | if (i + 1) % freq_eval == 0: 114 | rewards, success = self.evaluate(n_episodes=100) 115 | 116 | avg_reward = np.mean(rewards) 117 | avg_success_rate = np.mean(success) 118 | if save_model and (avg_reward > best_avg_reward): 119 | self.save_to_disk(model_path) 120 | print(f'Best model! Average reward = {avg_reward:.2f}, Success rate = {avg_success_rate:.2%}') 121 | 122 | best_avg_reward = avg_reward 123 | 124 | def evaluate(self, n_episodes: Optional[int] = 100, seed: Optional[int] = 1234) -> Tuple[List[float], List[float]]: 125 | """ 126 | """ 127 | # output metrics 128 | reward_per_episode = [] 129 | success_per_episode = [] 130 | 131 | # fix seed 132 | self.env.seed(seed) 133 | self.env.action_space.seed(seed) 134 | 135 | for i in tqdm(range(0, n_episodes)): 136 | 137 | state = self.env.reset() 138 | rewards = 0 139 | done = False 140 | reward = None 141 | while not done: 142 | 143 | action = self.act(torch.as_tensor(state, dtype=torch.float32)) 144 | 145 | next_state, reward, done, info = self.env.step(action) 146 | rewards += reward 147 | 148 | state = next_state 149 | 150 | reward_per_episode.append(rewards) 151 | success_per_episode.append(1 if reward > 0 else 0) 152 | 153 | return reward_per_episode, success_per_episode 154 | 155 | def _collect_trajectories(self, n_samples: int): 156 | 157 | # make some empty lists for logging. 158 | batch_obs = [] # for observations 159 | batch_acts = [] # for actions 160 | batch_weights = [] # for reward-to-go weighting in policy gradient 161 | batch_rets = [] # for measuring episode returns 162 | batch_lens = [] # for measuring episode lengths 163 | 164 | # reset episode-specific variables 165 | obs = self.env.reset() # first obs comes from starting distribution 166 | done = False # signal from environment that episode is over 167 | ep_rews = [] # list for rewards accrued throughout ep 168 | 169 | # collect experience by acting in the environment with current policy 170 | while True: 171 | 172 | # save obs 173 | batch_obs.append(obs.copy()) 174 | 175 | # act in the environment 176 | # act = get_action(torch.as_tensor(obs, dtype=torch.float32)) 177 | action = self.act(torch.as_tensor(obs, dtype=torch.float32)) 178 | obs, rew, done, _ = self.env.step(action) 179 | 180 | # save action, reward 181 | batch_acts.append(action) 182 | ep_rews.append(rew) 183 | 184 | if done: 185 | # if episode is over, record info about episode 186 | ep_ret, ep_len = sum(ep_rews), len(ep_rews) 187 | batch_rets.append(ep_ret) 188 | batch_lens.append(ep_len) 189 | 190 | # the weight for each logprob(a_t|s_t) is reward-to-go from t 191 | if self.gradient_weights == 'rewards': 192 | # the weight for each logprob(a|s) is the total reward for the episode 193 | batch_weights += [ep_ret] * ep_len 194 | elif self.gradient_weights == 'rewards-to-go': 195 | # the weight for each logprob(a|s) is the total reward AFTER the action is taken 196 | batch_weights += list(reward_to_go(ep_rews)) 197 | else: 198 | raise NotImplemented 199 | 200 | # reset episode-specific variables 201 | obs, done, ep_rews = self.env.reset(), False, [] 202 | 203 | # end experience loop if we have enough of it 204 | if len(batch_obs) > n_samples: 205 | break 206 | 207 | return batch_obs, batch_acts, batch_weights, batch_rets 208 | 209 | def _update_parameters(self, states, actions, weights) -> float: 210 | """ 211 | One step of policy gradient update 212 | """ 213 | self.optimizer.zero_grad() 214 | 215 | loss = self._compute_loss( 216 | obs=torch.as_tensor(states, dtype=torch.float32), 217 | act=torch.as_tensor(actions, dtype=torch.int32), 218 | weights=torch.as_tensor(weights, dtype=torch.float32) 219 | ) 220 | 221 | # compute gradients 222 | loss.backward() 223 | 224 | # update parameters with Adam 225 | self.optimizer.step() 226 | 227 | return loss.item() 228 | 229 | def _compute_loss(self, obs, act, weights): 230 | logp = self._get_policy(obs).log_prob(act) 231 | return -(logp * weights).mean() 232 | 233 | def _get_policy(self, obs): 234 | """ 235 | Get action distribution given the current policy 236 | """ 237 | logits = self.policy_net(obs) 238 | return Categorical(logits=logits) 239 | 240 | @classmethod 241 | def load_from_disk(cls, env_name: str, path: Path): 242 | """ 243 | We recover all necessary variables to be able to evaluate the agent. 244 | 245 | NOTE: training state is not stored, so it is not possible to resume 246 | an interrupted training run as it was. 247 | """ 248 | # load hyper-params 249 | with open(path / 'hparams.json', 'r') as f: 250 | hparams = json.load(f) 251 | 252 | # generate Python object 253 | agent = cls(env_name, **hparams) 254 | 255 | agent.policy_net = torch.load(path / 'model') 256 | agent.policy_net.eval() 257 | 258 | return agent 259 | 260 | def save_to_disk(self, path: Path) -> None: 261 | """""" 262 | if not path.exists(): 263 | os.makedirs(path) 264 | 265 | # save hyper-parameters in a json file 266 | with open(path / 'hparams.json', 'w') as f: 267 | json.dump(self.hparams, f) 268 | 269 | # save main model 270 | torch.save(self.policy_net, path / 'model') 271 | 272 | 273 | if __name__ == '__main__': 274 | 275 | import argparse 276 | parser = argparse.ArgumentParser() 277 | parser.add_argument('--env', type=str, default='CartPole-v0') 278 | parser.add_argument('--n_policy_updates', type=int, default=1000) 279 | parser.add_argument('--batch_size', type=int, default=128) 280 | parser.add_argument('--gradient_weights', type=str, default='rewards') 281 | parser.add_argument('--lr', type=float, default=3e-4) 282 | parser.add_argument("--hidden_layers", type=int, nargs="+",) 283 | parser.add_argument("--freq_eval", type=int) 284 | args = parser.parse_args() 285 | 286 | vpg_agent = VPGAgent( 287 | env_name=args.env, 288 | gradient_weights=args.gradient_weights, 289 | learning_rate=args.lr, 290 | hidden_layers=args.hidden_layers, 291 | ) 292 | 293 | # generate a unique agent_id, that we later use to save results to disk, as 294 | # well as TensorBoard logs. 295 | agent_id = get_agent_id(args.env) 296 | print(f'agent_id = {agent_id}') 297 | 298 | # tensorboard logger to see training curves 299 | logger = get_logger(env_name=args.env, agent_id=agent_id) 300 | 301 | # path to save policy network weights 302 | model_path = get_model_path(env_name=args.env, agent_id=agent_id) 303 | 304 | # start training 305 | vpg_agent.train( 306 | n_policy_updates=args.n_policy_updates, 307 | batch_size=args.batch_size, 308 | logger=logger, 309 | model_path=model_path, 310 | freq_eval=args.freq_eval, 311 | ) -------------------------------------------------------------------------------- /04_lunar_lander/tensorboard_logs/readme.md: -------------------------------------------------------------------------------- 1 | ### Tensorboard logs for each train run are stored in this folder 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Pau Labarta Bajo 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 |

The Hands-on Reinforcement Learning course 🚀

3 |

From zero to HERO 🦸🏻‍🦸🏽

4 |

Out of intense complexities, intense simplicities emerge.

5 |

-- Winston Churchill

6 |
7 | 8 | ![](http://datamachines.xyz/wp-content/uploads/2021/11/PHOTO-2021-11-05-13-54-11.jpg) 9 | 10 | [![Twitter Follow](https://img.shields.io/twitter/follow/paulabartabajo_?label=Follow&style=social)](https://twitter.com/paulabartabajo_) 11 | 12 | ## Contents 13 | 14 | * [Welcome to the course](#welcome-to-the-course-) 15 | * [Lectures](#lectures) 16 | * [Wanna contribute?](#wanna-contribute) 17 | * [Let's connect!](#lets-connect) 18 | 19 | ## Welcome to the course 🤗❤️ 20 | 21 | Welcome to my step by step hands-on-course that will take you from basic reinforcement learning to cutting-edge deep RL. 22 | 23 | We will start with a short intro of what RL is, what is it used for, and how does the landscape of current 24 | RL algorithms look like. 25 | 26 | Then, in each following chapter we will solve a different problem, with increasing difficulty: 27 | - 🏆 easy 28 | - 🏆🏆 medium 29 | - 🏆🏆🏆 hard 30 | 31 | Ultimately, the most complex RL problems involve a mixture of reinforcement learning algorithms, optimizations and Deep Learning techniques. 32 | 33 | You do not need to know deep learning (DL) to follow along this course. 34 | 35 | I will give you enough context to get you familiar with DL philosophy and understand 36 | how it becomes a crucial ingredient in modern reinforcement learning. 37 | 38 | ## Lectures 39 | 40 | 0. [Introduction to Reinforcement Learning](https://datamachines.xyz/2021/11/17/hands-on-reinforcement-learning-course-part-1/) 41 | 1. [Q-learning to drive a taxi 🏆](01_taxi/README.md) 42 | 2. [SARSA to beat gravity 🏆](02_mountain_car/README.md) 43 | 3. [Parametric Q learning to keep the balance 💃 🏆](03_cart_pole/README.md) 44 | 4. [Policy gradients to land on the Moon 🏆](04_lunar_lander/README.md) 45 | 46 | ## Wanna contribute? 47 | 48 | There are 2 things you can do to contribute to this course: 49 | 50 | 1. Spread the word and share it on [Twitter](https://ctt.ac/Aa7dt), [LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=http%3A//datamachines.xyz/the-hands-on-reinforcement-learning-course-page/&title=The%20hands-on%20Reinforcement%20Learning%20course&summary=Wanna%20learn%20Reinforcement%20Learning?%20%F0%9F%A4%94%0A%40paulabartabajo%20has%20a%20course%20on%20%23reinforcementlearning,%20that%20takes%20you%20from%20zero%20to%20PRO%20%F0%9F%A6%B8%F0%9F%8F%BB%E2%80%8D%F0%9F%A6%B8%F0%9F%8F%BD.%0A%0A%F0%9F%91%89%F0%9F%8F%BD%20With%20lots%20of%20Python%0A%F0%9F%91%89%F0%9F%8F%BD%20Intuitions,%20tips%20%26%20tricks%20explained.%0A%F0%9F%91%89%F0%9F%8F%BD%20And%20free,%20by%20the%20way.%0A%0AReady%20to%20start?%20Click%20%F0%9F%91%87%F0%9F%8F%BD%F0%9F%91%87%F0%9F%8F%BE%F0%9F%91%87%F0%9F%8F%BF%0A%0A%23MachineLearning&source=) 51 | 52 | 2. Open a [pull request](https://github.com/Paulescu/hands-on-rl/pulls) to fix a bug or improve the code readability. 53 | 54 | ### Thanks ❤️ 55 | Special thanks to all the students who contributed with valuable feedback 56 | and pull requests ❤ 57 | 58 | - [Neria Uzan](https://www.linkedin.com/in/neria-uzan-369803107/) 59 | - [Anthony Lapadula](https://www.linkedin.com/in/anthony-lapadula-9343a5b/) 60 | - [Petar Sekulić](https://www.linkedin.com/in/petar-sekulic-ml/) 61 | 62 | ## Let's connect! 63 | 64 | 👉🏽 Subscribe for **FREE** to the [Real-World ML newsletter](https://realworldml.net/subscribe/) 🧠 65 | 66 | 👉🏽 Follow me on [Twitter](https://twitter.com/paulabartabajo_) and [LinkedIn](https://www.linkedin.com/in/pau-labarta-bajo-4432074b/) 💡 67 | --------------------------------------------------------------------------------