├── 2019 ├── code │ ├── 00-gym.ipynb │ ├── 01-genetics.ipynb │ ├── 02-cem.ipynb │ ├── 03-tabular.ipynb │ ├── 04-dqn.ipynb │ ├── __init__.py │ ├── mdp.py │ ├── mdp_get_action_value.py │ └── qlearning.py ├── slides │ ├── 01-genetics.pdf │ ├── 02-cem.pdf │ ├── 03-tabular.pdf │ └── 04-dqn.pdf └── solutions │ ├── 00-gym.ipynb │ ├── 01-genetics.ipynb │ ├── 02-cem.ipynb │ ├── 03-tabular.ipynb │ ├── 04-dqn.ipynb │ ├── __init__.py │ ├── mdp.py │ ├── mdp_get_action_value.py │ └── qlearning.py ├── 2020 ├── code │ ├── DDPG.ipynb │ ├── DQN.ipynb │ ├── RecSimDemo.ipynb │ ├── RecSysDemo.ipynb │ └── recsim_exp │ │ ├── __init__.py │ │ ├── ddpg.py │ │ └── wolpertinger.py ├── presets │ └── wolpertinger_scheme.png └── requirements.txt ├── .gitignore ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | builds/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .coverage 43 | .coverage.* 44 | .cache 45 | nosetests.xml 46 | coverage.xml 47 | *.cover 48 | .hypothesis/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | 58 | # Flask stuff: 59 | instance/ 60 | .webassets-cache 61 | 62 | # Scrapy stuff: 63 | .scrapy 64 | 65 | # Sphinx documentation 66 | docs/_build/ 67 | 68 | # PyBuilder 69 | target/ 70 | 71 | # Jupyter Notebook 72 | .ipynb_checkpoints 73 | 74 | # pyenv 75 | .python-version 76 | 77 | # celery beat schedule file 78 | celerybeat-schedule 79 | 80 | # SageMath parsed files 81 | *.sage.py 82 | 83 | # dotenv 84 | .env 85 | 86 | # virtualenv 87 | .venv 88 | venv/ 89 | ENV/ 90 | 91 | # Spyder project settings 92 | .spyderproject 93 | .spyproject 94 | 95 | # Rope project settings 96 | .ropeproject 97 | 98 | # mkdocs documentation 99 | /site 100 | 101 | # mypy 102 | .mypy_cache/ 103 | 104 | 105 | 106 | .DS_Store 107 | .idea 108 | .code 109 | 110 | *.bak 111 | *.csv 112 | *.tsv 113 | *.ipynb 114 | 115 | tmp/ 116 | logs/ 117 | # Examples - mock data 118 | !examples/distilbert_text_classification/input/*.csv 119 | !examples/_tests_distilbert_text_classification/input/*.csv 120 | examples/logs/ 121 | notebooks/ 122 | 123 | _nogit* 124 | 125 | ### VisualStudioCode ### 126 | .vscode/* 127 | .vscode/settings.json 128 | !.vscode/tasks.json 129 | !.vscode/launch.json 130 | !.vscode/extensions.json 131 | 132 | ### VisualStudioCode Patch ### 133 | # Ignore all local history of files 134 | .history 135 | 136 | # End of https://www.gitignore.io/api/visualstudiocode 137 | 138 | presets/ 139 | -------------------------------------------------------------------------------- /2019/code/00-gym.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import matplotlib.pyplot as plt\n", 11 | "%matplotlib inline\n", 12 | "# In google collab, uncomment this:\n", 13 | "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n", 14 | "\n", 15 | "# This code creates a virtual display to draw game images on.\n", 16 | "# If you are running locally, just ignore it\n", 17 | "# import os\n", 18 | "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", 19 | "# !bash ../xvfb start\n", 20 | "# %env DISPLAY = : 1" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### OpenAI Gym\n", 28 | "\n", 29 | "We're gonna spend several next weeks learning algorithms that solve decision processes. We are then in need of some interesting decision problems to test our algorithms.\n", 30 | "\n", 31 | "That's where OpenAI gym comes into play. It's a python library that wraps many classical decision problems including robot control, videogames and board games.\n", 32 | "\n", 33 | "So here's how it works:" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "import gym\n", 43 | "env = gym.make(\"MountainCar-v0\")\n", 44 | "\n", 45 | "plt.imshow(env.render('rgb_array'))\n", 46 | "plt.show()\n", 47 | "print(\"Observation space:\", env.observation_space)\n", 48 | "print(\"Action space:\", env.action_space)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Note: if you're running this on your local machine, you'll see a window pop up with the image above. Don't close it, just alt-tab away." 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "### Gym interface\n", 63 | "\n", 64 | "The three main methods of an environment are\n", 65 | "* __reset()__ - reset environment to initial state, _return first observation_\n", 66 | "* __render()__ - show current environment state (a more colorful version :) )\n", 67 | "* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)\n", 68 | " * _new observation_ - an observation right after commiting the action __a__\n", 69 | " * _reward_ - a number representing your reward for commiting action __a__\n", 70 | " * _is done_ - True if the MDP has just finished, False if still in progress\n", 71 | " * _info_ - some auxilary stuff about what just happened. Ignore it ~~for now~~." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": { 78 | "scrolled": true 79 | }, 80 | "outputs": [], 81 | "source": [ 82 | "obs0 = env.reset()\n", 83 | "print(\"initial observation code:\", obs0)\n", 84 | "\n", 85 | "# Note: in MountainCar, observation is just two numbers: car position and velocity" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "print(\"taking action 2 (right)\")\n", 95 | "new_obs, reward, is_done, _ = env.step(2)\n", 96 | "\n", 97 | "print(\"new observation code:\", new_obs)\n", 98 | "print(\"reward:\", reward)\n", 99 | "print(\"is game over?:\", is_done)\n", 100 | "\n", 101 | "# Note: as you can see, the car has moved to the right slightly (around 0.0005)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Play with it\n", 109 | "\n", 110 | "Below is the code that drives the car to the right. \n", 111 | "\n", 112 | "However, it doesn't reach the flag at the far right due to gravity. \n", 113 | "\n", 114 | "__Your task__ is to fix it. Find a strategy that reaches the flag. \n", 115 | "\n", 116 | "You're not required to build any sophisticated algorithms for now, feel free to hard-code :)\n", 117 | "\n", 118 | "_Hint: your action at each step should depend either on __t__ or on __s__._" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "\n", 128 | "# create env manually to set time limit. Please don't change this.\n", 129 | "TIME_LIMIT = 250\n", 130 | "env = gym.wrappers.TimeLimit(\n", 131 | " gym.envs.classic_control.MountainCarEnv(),\n", 132 | " max_episode_steps=TIME_LIMIT + 1)\n", 133 | "s = env.reset()\n", 134 | "actions = {'left': 0, 'stop': 1, 'right': 2}\n", 135 | "\n", 136 | "# prepare \"display\"\n", 137 | "%matplotlib inline\n", 138 | "from IPython.display import clear_output\n", 139 | "\n", 140 | "\n", 141 | "for t in range(TIME_LIMIT):\n", 142 | "\n", 143 | " # change the line below to reach the flag\n", 144 | " s, r, done, _ = env.step(actions['right'])\n", 145 | "\n", 146 | " # draw game image on display\n", 147 | " clear_output(True)\n", 148 | " plt.imshow(env.render('rgb_array'))\n", 149 | "\n", 150 | " if done:\n", 151 | " print(\"Well done!\")\n", 152 | " break\n", 153 | "else:\n", 154 | " print(\"Time limit exceeded. Try again.\");" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "assert s[0] > 0.47\n", 164 | "print(\"You solved it!\")" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [] 173 | } 174 | ], 175 | "metadata": { 176 | "kernelspec": { 177 | "display_name": "Python 3", 178 | "language": "python", 179 | "name": "python3" 180 | }, 181 | "language_info": { 182 | "codemirror_mode": { 183 | "name": "ipython", 184 | "version": 3 185 | }, 186 | "file_extension": ".py", 187 | "mimetype": "text/x-python", 188 | "name": "python", 189 | "nbconvert_exporter": "python", 190 | "pygments_lexer": "ipython3", 191 | "version": "3.7.0" 192 | } 193 | }, 194 | "nbformat": 4, 195 | "nbformat_minor": 1 196 | } 197 | -------------------------------------------------------------------------------- /2019/code/01-genetics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# FrozenLake\n", 8 | "Today you are going to learn how to survive walking over the (virtual) frozen lake through discrete optimization.\n", 9 | "\n", 10 | "\"a\n" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "# In google collab, uncomment this:\n", 20 | "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n", 21 | "\n", 22 | "# XVFB will be launched if you run on a server\n", 23 | "# import os\n", 24 | "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", 25 | "# !bash ../xvfb start\n", 26 | "# %env DISPLAY = : 1" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "\n", 37 | "#create a single game instance\n", 38 | "env = gym.make(\"FrozenLake-v0\")\n", 39 | "\n", 40 | "#start new game\n", 41 | "env.reset();" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# display the game state\n", 51 | "env.render()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### legend\n", 59 | "\n", 60 | "![img](https://cdn-images-1.medium.com/max/800/1*MCjDzR-wfMMkS0rPqXSmKw.png)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### Gym interface\n", 68 | "\n", 69 | "The three main methods of an environment are\n", 70 | "* __reset()__ - reset environment to initial state, _return first observation_\n", 71 | "* __render()__ - show current environment state (a more colorful version :) )\n", 72 | "* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)\n", 73 | " * _new observation_ - an observation right after commiting the action __a__\n", 74 | " * _reward_ - a number representing your reward for commiting action __a__\n", 75 | " * _is done_ - True if the MDP has just finished, False if still in progress\n", 76 | " * _info_ - some auxilary stuff about what just happened. Ignore it for now" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "scrolled": true 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "print(\"initial observation code:\", env.reset())\n", 88 | "print('printing observation:')\n", 89 | "env.render()\n", 90 | "print(\"observations:\", env.observation_space, 'n=', env.observation_space.n)\n", 91 | "print(\"actions:\", env.action_space, 'n=', env.action_space.n)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "print(\"taking action 2 (right)\")\n", 101 | "new_obs, reward, is_done, _ = env.step(2)\n", 102 | "print(\"new observation code:\", new_obs)\n", 103 | "print(\"reward:\", reward)\n", 104 | "print(\"is game over?:\", is_done)\n", 105 | "print(\"printing new state:\")\n", 106 | "env.render()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "action_to_i = {\n", 116 | " 'left':0,\n", 117 | " 'down':1,\n", 118 | " 'right':2,\n", 119 | " 'up':3\n", 120 | "}" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "### Play with it\n", 128 | "* Try walking 5 steps without falling to the (H)ole\n", 129 | " * Bonus quest - get to the (G)oal\n", 130 | "* Sometimes your actions will not be executed properly due to slipping over ice\n", 131 | "* If you fall, call __env.reset()__ to restart" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "env.step(action_to_i['up'])\n", 141 | "env.render()" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "### Policy\n", 149 | "\n", 150 | "* The environment has a 4x4 grid of states (16 total), they are indexed from 0 to 15\n", 151 | "* From each states there are 4 actions (left,down,right,up), indexed from 0 to 3\n", 152 | "\n", 153 | "We need to define agent's policy of picking actions given states. Since we have only 16 disttinct states and 4 actions, we can just store the action for each state in an array.\n", 154 | "\n", 155 | "This basically means that any array of 16 integers from 0 to 3 makes a policy." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "import numpy as np\n", 165 | "n_states = env.observation_space.n\n", 166 | "n_actions = env.action_space.n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "def get_random_policy():\n", 176 | " \"\"\"\n", 177 | " Build a numpy array representing agent policy.\n", 178 | " This array must have one element per each of 16 environment states.\n", 179 | " Element must be an integer from 0 to 3, representing action\n", 180 | " to take from that state.\n", 181 | " \"\"\"\n", 182 | " # policy = ...\n", 183 | " return policy" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "np.random.seed(1234)\n", 193 | "policies = [get_random_policy() for i in range(10**4)]\n", 194 | "assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'\n", 195 | "assert np.min(policies) == 0, 'minimal action id should be 0'\n", 196 | "assert np.max(policies) == n_actions-1, 'maximal action id should match n_actions-1'\n", 197 | "action_probas = np.unique(policies, return_counts=True)[-1] /10**4. /n_states\n", 198 | "print(\"Action frequencies over 10^4 samples:\",action_probas)\n", 199 | "assert np.allclose(action_probas, [1. / n_actions] * n_actions, atol=0.05), \"The policies aren't uniformly random (maybe it's just an extremely bad luck)\"\n", 200 | "print(\"Seems fine!\")" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### Let's evaluate!\n", 208 | "* Implement a simple function that runs one game and returns the total reward" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "def sample_reward(env, policy, t_max=100):\n", 218 | " \"\"\"\n", 219 | " Interact with an environment, return sum of all rewards.\n", 220 | " If game doesn't end on t_max (e.g. agent walks into a wall), \n", 221 | " force end the game and return whatever reward you got so far.\n", 222 | " Tip: see signature of env.step(...) method above.\n", 223 | " \"\"\"\n", 224 | " s = env.reset()\n", 225 | " total_reward = 0\n", 226 | " \n", 227 | " for i in range(t_max):\n", 228 | " # action = ...\n", 229 | " s, reward, done, info = env.step(action)\n", 230 | " total_reward += reward\n", 231 | " if done:\n", 232 | " break\n", 233 | " return total_reward" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "print(\"generating 10^3 sessions...\")\n", 243 | "rewards = [sample_reward(env,get_random_policy()) for _ in range(10**3)]\n", 244 | "assert all([type(r) in (int, float) for r in rewards]), 'sample_reward must return a single number'\n", 245 | "assert all([0 <= r <= 1 for r in rewards]), 'total rewards should be between 0 and 1 for frozenlake (if solving taxi, delete this line)'\n", 246 | "print(\"Looks good!\")" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "def evaluate(policy, n_times=100):\n", 256 | " \"\"\"Run several evaluations and average the score the policy gets.\"\"\"\n", 257 | " # avg_reward = ...\n", 258 | " return avg_reward\n", 259 | " " 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "def print_policy(policy):\n", 269 | " \"\"\"a function that displays a policy in a human-readable way.\"\"\"\n", 270 | " lake = \"SFFFFHFHFFFHHFFG\"\n", 271 | " assert env.spec.id == \"FrozenLake-v0\", \"this function only works with frozenlake 4x4\"\n", 272 | " \n", 273 | " # where to move from each tile\n", 274 | " arrows = ['^'[a] for a in policy]\n", 275 | " \n", 276 | " #draw arrows above S and F only\n", 277 | " signs = [arrow if tile in \"SF\" else tile for arrow, tile in zip(arrows, lake)]\n", 278 | " \n", 279 | " for i in range(0, 16, 4):\n", 280 | " print(' '.join(signs[i:i+4]))\n", 281 | "\n", 282 | "print(\"random policy:\")\n", 283 | "print_policy(get_random_policy())" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "### Random search" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": { 297 | "scrolled": true 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "best_policy = None\n", 302 | "best_score = -float('inf')\n", 303 | "\n", 304 | "from tqdm import tqdm\n", 305 | "tr = tqdm(range(int(1e4)))\n", 306 | "for i in tr:\n", 307 | " policy = get_random_policy()\n", 308 | " score = evaluate(policy)\n", 309 | " if score > best_score:\n", 310 | " best_score = score\n", 311 | " best_policy = policy\n", 312 | " tr.set_postfix({\"best score:\": best_score})" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "# Part II Genetic algorithm\n", 320 | "\n", 321 | "The next task is to devise some more effecient way to perform policy search.\n", 322 | "We'll do that with a bare-bones evolutionary algorithm.\n", 323 | "[unless you're feeling masochistic and wish to do something entirely different which is bonus points if it works]" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": {}, 330 | "outputs": [], 331 | "source": [ 332 | "def crossover(policy1, policy2, p=0.5, prioritize=False):\n", 333 | " \"\"\"\n", 334 | " for each state, with probability p take action from policy1, else policy2\n", 335 | " \"\"\"\n", 336 | " if prioritize:\n", 337 | " # wait for part II - moar\n", 338 | " pass\n", 339 | " # policy = ...\n", 340 | " return policy\n" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "def mutation(policy, p=0.1):\n", 350 | " \"\"\"\n", 351 | " for each state, with probability p replace action with random action\n", 352 | " Tip: mutation can be written as crossover with random policy\n", 353 | " \"\"\"\n", 354 | " # new_policy = ...\n", 355 | " return new_policy" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "np.random.seed(1234)\n", 365 | "policies = [\n", 366 | " crossover(get_random_policy(), get_random_policy()) \n", 367 | " for i in range(10**4)]\n", 368 | "\n", 369 | "assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'\n", 370 | "assert np.min(policies) == 0, 'minimal action id should be 0'\n", 371 | "assert np.max(policies) == n_actions-1, 'maximal action id should be n_actions-1'\n", 372 | "\n", 373 | "assert any([\n", 374 | " np.mean(crossover(np.zeros(n_states), np.ones(n_states))) not in (0, 1)\n", 375 | " for _ in range(100)]), \"Make sure your crossover changes each action independently\"\n", 376 | "print(\"Seems fine!\")" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "\n", 386 | "n_epochs = 20 #how many cycles to make\n", 387 | "pool_size = 100 #how many policies to maintain\n", 388 | "n_crossovers = 50 #how many crossovers to make on each step\n", 389 | "n_mutations = 50 #how many mutations to make on each tick\n" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "print(\"initializing...\")\n", 399 | "pool = [get_random_policy() for _ in range(pool_size)]\n", 400 | "pool_scores = list(map(evaluate, pool))\n" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": {}, 407 | "outputs": [], 408 | "source": [ 409 | "assert type(pool) == type(pool_scores) == list\n", 410 | "assert len(pool) == len(pool_scores) == pool_size\n", 411 | "assert all([type(score) in (float, int) for score in pool_scores])" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": { 417 | "collapsed": true 418 | }, 419 | "source": [ 420 | "# Main loop" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "import random\n", 430 | "from tqdm import tqdm\n", 431 | "\n", 432 | "tr = tqdm(range(n_epochs))\n", 433 | "for epoch in tr:\n", 434 | "# print(\"Epoch %s:\"%epoch)\n", 435 | " crossovered = [\n", 436 | " crossover(random.choice(pool), random.choice(pool)) \n", 437 | " for _ in range(n_crossovers)]\n", 438 | " mutated = [\n", 439 | " mutation(random.choice(pool)) \n", 440 | " for _ in range(n_mutations)]\n", 441 | " \n", 442 | " assert type(crossovered) == type(mutated) == list\n", 443 | " \n", 444 | " # add new policies to the pool\n", 445 | " # pool = ...\n", 446 | " # pool_scores = ...\n", 447 | " \n", 448 | " # select pool_size best policies\n", 449 | " # selected_indices = ...\n", 450 | " # pool = ...\n", 451 | " # pool_scores = ...\n", 452 | "\n", 453 | " # print the best policy so far (last in ascending score order)\n", 454 | " tr.set_postfix({\"best score:\": pool_scores[-1]})" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "## moar\n", 462 | "\n", 463 | "The parameters of the genetic algorithm aren't optimal, try to find something better. (size, crossovers and mutations)\n", 464 | "\n", 465 | "Try alternative crossover and mutation strategies\n", 466 | "* prioritize crossover for higher-scorers?\n", 467 | "* try to select a more diverse pool, not just best scorers?\n", 468 | "* Just tune the f*cking probabilities.\n", 469 | "\n", 470 | "See which combination works best!" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "def crossover(policy1, policy2, p=0.5, prioritize=False):\n", 480 | " \"\"\"\n", 481 | " for each state, with probability p take action from policy1, else policy2\n", 482 | " \"\"\"\n", 483 | " if prioritize:\n", 484 | " # the time has come\n", 485 | " pass\n", 486 | " # policy = ...\n", 487 | " return policy\n" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": { 494 | "scrolled": true 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "import random\n", 499 | "from tqdm import tqdm\n", 500 | "\n", 501 | "tr = tqdm(range(n_epochs))\n", 502 | "for epoch in tr:\n", 503 | "# print(\"Epoch %s:\"%epoch)\n", 504 | " crossovered = [\n", 505 | " crossover(\n", 506 | " random.choice(pool), \n", 507 | " random.choice(pool), \n", 508 | " prioritize=True) \n", 509 | " for _ in range(n_crossovers)]\n", 510 | " mutated = [\n", 511 | " mutation(random.choice(pool)) \n", 512 | " for _ in range(n_mutations)]\n", 513 | " \n", 514 | " assert type(crossovered) == type(mutated) == list\n", 515 | " \n", 516 | " # add new policies to the pool\n", 517 | " # pool = ...\n", 518 | " # pool_scores = ...\n", 519 | " \n", 520 | " # select pool_size best policies\n", 521 | " # selected_indices = ...\n", 522 | " # pool = ...\n", 523 | " # pool_scores = ...\n", 524 | "\n", 525 | " # print the best policy so far (last in ascending score order)\n", 526 | " tr.set_postfix({\"best score:\": pool_scores[-1]})" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "# *** Part III\n", 534 | "\n", 535 | "The frozenlake problem above is just too simple: you can beat it even with a random policy search. Go solve something more complicated. \n", 536 | "\n", 537 | "__FrozenLake8x8-v0__ - frozenlake big brother\n", 538 | "\n", 539 | "\n", 540 | "### Some tips:\n", 541 | "* Random policy search is worth trying as a sanity check, but in general you should expect the genetic algorithm (or anything you devised in it's place) to fare much better that random.\n", 542 | "* While _it's okay to adapt the tabs above to your chosen env_, make sure you didn't hard-code any constants there (e.g. 16 states or 4 actions).\n", 543 | "* `print_policy` function was built for the frozenlake-v0 env so it will break on any other env. You could simply ignore it OR write your own visualizer for bonus points.\n", 544 | "* in function `sample_reward`, __make sure t_max steps is enough to solve the environment__ even if agent is sometimes acting suboptimally. To estimate that, run several sessions without time limit and measure their length.\n" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "# FrozenLake8x8" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": { 558 | "scrolled": true 559 | }, 560 | "outputs": [], 561 | "source": [ 562 | "#create a single game instance\n", 563 | "env = gym.make(\"FrozenLake8x8-v0\")\n", 564 | "\n", 565 | "#start new game\n", 566 | "env.reset()\n", 567 | "\n", 568 | "# display the game state\n", 569 | "env.render()\n", 570 | "\n", 571 | "n_states = env.observation_space.n\n", 572 | "n_actions = env.action_space.n" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": null, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "n_epochs = 20 #how many cycles to make\n", 582 | "pool_size = 100 #how many policies to maintain\n", 583 | "n_crossovers = 50 #how many crossovers to make on each step\n", 584 | "n_mutations = 50 #how many mutations to make on each tick" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": null, 590 | "metadata": {}, 591 | "outputs": [], 592 | "source": [ 593 | "print(\"initializing...\")\n", 594 | "pool = [get_random_policy() for _ in range(pool_size)]\n", 595 | "pool_scores = list(map(evaluate, pool))" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "scrolled": true 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "import random\n", 607 | "from tqdm import tqdm\n", 608 | "\n", 609 | "tr = tqdm(range(n_epochs))\n", 610 | "for epoch in tr:\n", 611 | "# print(\"Epoch %s:\"%epoch)\n", 612 | " crossovered = [\n", 613 | " crossover(\n", 614 | " random.choice(pool), \n", 615 | " random.choice(pool), \n", 616 | " prioritize=True) \n", 617 | " for _ in range(n_crossovers)]\n", 618 | " mutated = [\n", 619 | " mutation(random.choice(pool)) \n", 620 | " for _ in range(n_mutations)]\n", 621 | " \n", 622 | " assert type(crossovered) == type(mutated) == list\n", 623 | " \n", 624 | " # add new policies to the pool\n", 625 | " # pool = ...\n", 626 | " # pool_scores = ...\n", 627 | " \n", 628 | " # select pool_size best policies\n", 629 | " # selected_indices = ...\n", 630 | " # pool = ...\n", 631 | " # pool_scores = ...\n", 632 | "\n", 633 | " # print the best policy so far (last in ascending score order)\n", 634 | " tr.set_postfix({\"best score:\": pool_scores[-1]})" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": {}, 640 | "source": [ 641 | "# moar" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": {}, 648 | "outputs": [], 649 | "source": [ 650 | "def sample_reward(env, policy, t_max=200):\n", 651 | " \"\"\"\n", 652 | " Interact with an environment, return sum of all rewards.\n", 653 | " If game doesn't end on t_max (e.g. agent walks into a wall), \n", 654 | " force end the game and return whatever reward you got so far.\n", 655 | " Tip: see signature of env.step(...) method above.\n", 656 | " \"\"\"\n", 657 | " s = env.reset()\n", 658 | " total_reward = 0\n", 659 | " \n", 660 | " for i in range(t_max):\n", 661 | " action = policy[s]\n", 662 | " s, reward, done, info = env.step(action)\n", 663 | " total_reward += reward\n", 664 | " if done:\n", 665 | " break\n", 666 | " return total_reward" 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": { 673 | "scrolled": true 674 | }, 675 | "outputs": [], 676 | "source": [ 677 | "# create a single game instance\n", 678 | "env = gym.make(\"FrozenLake8x8-v0\")\n", 679 | "\n", 680 | "# start new game\n", 681 | "env.reset()\n", 682 | "\n", 683 | "# display the game state\n", 684 | "env.render()\n", 685 | "\n", 686 | "n_states = env.observation_space.n\n", 687 | "n_actions = env.action_space.n" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": {}, 694 | "outputs": [], 695 | "source": [ 696 | "n_epochs = 50 #how many cycles to make\n", 697 | "pool_size = 200 #how many policies to maintain\n", 698 | "n_crossovers = 100 #how many crossovers to make on each step\n", 699 | "n_mutations = 100 #how many mutations to make on each tick" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "print(\"initializing...\")\n", 709 | "pool = [get_random_policy() for _ in range(pool_size)]\n", 710 | "pool_scores = list(map(evaluate, pool))" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": { 717 | "scrolled": true 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "import random\n", 722 | "from tqdm import tqdm\n", 723 | "\n", 724 | "tr = tqdm(range(n_epochs))\n", 725 | "for epoch in tr:\n", 726 | "# print(\"Epoch %s:\"%epoch)\n", 727 | " crossovered = [\n", 728 | " crossover(\n", 729 | " random.choice(pool), \n", 730 | " random.choice(pool), \n", 731 | " prioritize=True) \n", 732 | " for _ in range(n_crossovers)]\n", 733 | " mutated = [\n", 734 | " mutation(random.choice(pool)) \n", 735 | " for _ in range(n_mutations)]\n", 736 | " \n", 737 | " assert type(crossovered) == type(mutated) == list\n", 738 | " \n", 739 | " # add new policies to the pool\n", 740 | " # pool = ...\n", 741 | " # pool_scores = ...\n", 742 | " \n", 743 | " # select pool_size best policies\n", 744 | " # selected_indices = ...\n", 745 | " # pool = ...\n", 746 | " # pool_scores = ...\n", 747 | "\n", 748 | " # print the best policy so far (last in ascending score order)\n", 749 | " tr.set_postfix({\"best score:\": pool_scores[-1]})" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": null, 755 | "metadata": {}, 756 | "outputs": [], 757 | "source": [] 758 | } 759 | ], 760 | "metadata": { 761 | "kernelspec": { 762 | "display_name": "Python 3", 763 | "language": "python", 764 | "name": "python3" 765 | }, 766 | "language_info": { 767 | "codemirror_mode": { 768 | "name": "ipython", 769 | "version": 3 770 | }, 771 | "file_extension": ".py", 772 | "mimetype": "text/x-python", 773 | "name": "python", 774 | "nbconvert_exporter": "python", 775 | "pygments_lexer": "ipython3", 776 | "version": "3.7.0" 777 | } 778 | }, 779 | "nbformat": 4, 780 | "nbformat_minor": 1 781 | } 782 | -------------------------------------------------------------------------------- /2019/code/02-cem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Crossentropy method\n", 8 | "\n", 9 | "This notebook will teach you to solve reinforcement learning with crossentropy method." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# In google collab, uncomment this:\n", 19 | "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n", 20 | "\n", 21 | "# XVFB will be launched if you run on a server\n", 22 | "# import os\n", 23 | "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", 24 | "# !bash ../xvfb start\n", 25 | "# %env DISPLAY = : 1" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import gym\n", 35 | "import numpy as np, pandas as pd\n", 36 | "\n", 37 | "env = gym.make(\"Taxi-v2\") # MountainCar-v0, Taxi-v2\n", 38 | "env.reset()\n", 39 | "env.render()" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "n_states = env.observation_space.n\n", 49 | "n_actions = env.action_space.n\n", 50 | "\n", 51 | "print(\"n_states=%i, n_actions=%i\"%(n_states,n_actions))" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "# Create stochastic policy\n", 59 | "\n", 60 | "This time our policy should be a probability distribution.\n", 61 | "\n", 62 | "```policy[s,a] = P(take action a | in state s)```\n", 63 | "\n", 64 | "Since we still use integer state and action representations, you can use a 2-dimensional array to represent the policy.\n", 65 | "\n", 66 | "Please initialize policy __uniformly__, that is, probabililities of all actions should be equal.\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "policy = " 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "assert type(policy) in (np.ndarray,np.matrix)\n", 85 | "assert np.allclose(policy,1./n_actions)\n", 86 | "assert np.allclose(np.sum(policy,axis=1), 1)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "# Play the game\n", 94 | "\n", 95 | "Just like before, but we also record all states and actions we took." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "def generate_session(policy, t_max=10**4):\n", 105 | " \"\"\"\n", 106 | " Play game until end or for t_max ticks.\n", 107 | " :param policy: an array of shape [n_states,n_actions] with action probabilities\n", 108 | " :returns: list of states, list of actions and sum of rewards\n", 109 | " \"\"\"\n", 110 | " states,actions = [],[]\n", 111 | " total_reward = 0.\n", 112 | " \n", 113 | " s = env.reset()\n", 114 | " \n", 115 | " for t in range(t_max):\n", 116 | " \n", 117 | " a = \n", 118 | "\n", 119 | " new_s, r, done, info = env.step(a)\n", 120 | " \n", 121 | " states.append(s)\n", 122 | " actions.append(a)\n", 123 | " total_reward += r\n", 124 | " \n", 125 | " s = new_s\n", 126 | " if done:\n", 127 | " break\n", 128 | " return states, actions, total_reward" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "s, a, r = generate_session(policy)\n", 138 | "assert type(s) == type(a) == list\n", 139 | "assert len(s) == len(a)\n", 140 | "assert type(r) in [float, np.float]" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "# let's see the initial reward distribution\n", 150 | "import matplotlib.pyplot as plt\n", 151 | "%matplotlib inline\n", 152 | "\n", 153 | "sample_rewards = [\n", 154 | " generate_session(policy, t_max=1000)[-1] \n", 155 | " for _ in range(200)]\n", 156 | "\n", 157 | "plt.hist(sample_rewards, bins=20)\n", 158 | "plt.vlines(\n", 159 | " [np.percentile(sample_rewards, 50)], \n", 160 | " [0], \n", 161 | " [100], \n", 162 | " label=\"50'th percentile\", \n", 163 | " color='green')\n", 164 | "plt.vlines(\n", 165 | " [np.percentile(sample_rewards, 90)], \n", 166 | " [0], [100], \n", 167 | " label=\"90'th percentile\", \n", 168 | " color='red')\n", 169 | "plt.legend()\n" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "### Crossentropy method steps" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "def select_elites(\n", 186 | " states_batch, \n", 187 | " actions_batch, \n", 188 | " rewards_batch, \n", 189 | " percentile=50\n", 190 | "):\n", 191 | " \"\"\"\n", 192 | " Select states and actions from games that have rewards >= percentile\n", 193 | " :param states_batch: list of lists of states, states_batch[session_i][t]\n", 194 | " :param actions_batch: list of lists of actions, actions_batch[session_i][t]\n", 195 | " :param rewards_batch: list of rewards, rewards_batch[session_i]\n", 196 | "\n", 197 | " :returns: elite_states,elite_actions, both 1D lists of states and respective actions from elite sessions\n", 198 | "\n", 199 | " Please return elite states and actions in their original order \n", 200 | " [i.e. sorted by session number and timestep within session]\n", 201 | "\n", 202 | " If you're confused, see examples below. Please don't assume that states are integers (they'll get different later).\n", 203 | " \"\"\"\n", 204 | " \n", 205 | " reward_threshold = \n", 206 | "\n", 207 | " elite_states = \n", 208 | " elite_actions = \n", 209 | "\n", 210 | " return elite_states, elite_actions" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "states_batch = [\n", 220 | " [1, 2, 3], # game1\n", 221 | " [4, 2, 0, 2], # game2\n", 222 | " [3, 1] # game3\n", 223 | "]\n", 224 | "\n", 225 | "actions_batch = [\n", 226 | " [0, 2, 4], # game1\n", 227 | " [3, 2, 0, 1], # game2\n", 228 | " [3, 3] # game3\n", 229 | "]\n", 230 | "rewards_batch = [\n", 231 | " 3, # game1\n", 232 | " 4, # game2\n", 233 | " 5, # game3\n", 234 | "]\n", 235 | "\n", 236 | "test_result_0 = select_elites(\n", 237 | " states_batch, actions_batch, rewards_batch, percentile=0)\n", 238 | "test_result_40 = select_elites(\n", 239 | " states_batch, actions_batch, rewards_batch, percentile=30)\n", 240 | "test_result_90 = select_elites(\n", 241 | " states_batch, actions_batch, rewards_batch, percentile=90)\n", 242 | "test_result_100 = select_elites(\n", 243 | " states_batch, actions_batch, rewards_batch, percentile=100)\n", 244 | "\n", 245 | "assert np.all(test_result_0[0] == [1, 2, 3, 4, 2, 0, 2, 3, 1]) \\\n", 246 | " and np.all(test_result_0[1] == [0, 2, 4, 3, 2, 0, 1, 3, 3]),\\\n", 247 | " \"For percentile 0 you should return all states and actions in chronological order\"\n", 248 | "assert np.all(test_result_40[0] == [4, 2, 0, 2, 3, 1]) and \\\n", 249 | " np.all(test_result_40[1] == [3, 2, 0, 1, 3, 3]),\\\n", 250 | " \"For percentile 30 you should only select states/actions from two first\"\n", 251 | "assert np.all(test_result_90[0] == [3, 1]) and \\\n", 252 | " np.all(test_result_90[1] == [3, 3]),\\\n", 253 | " \"For percentile 90 you should only select states/actions from one game\"\n", 254 | "assert np.all(test_result_100[0] == [3, 1]) and\\\n", 255 | " np.all(test_result_100[1] == [3, 3]),\\\n", 256 | " \"Please make sure you use >=, not >. Also double-check how you compute percentile.\"\n", 257 | "print(\"Ok!\")" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "def update_policy(elite_states, elite_actions):\n", 267 | " \"\"\"\n", 268 | " Given old policy and a list of elite states/actions from select_elites,\n", 269 | " return new updated policy where each action probability is proportional to\n", 270 | "\n", 271 | " policy[s_i,a_i] ~ #[occurences of si and ai in elite states/actions]\n", 272 | "\n", 273 | " Don't forget to normalize policy to get valid probabilities and handle 0/0 case.\n", 274 | " In case you never visited a state, set probabilities for all actions to 1./n_actions\n", 275 | "\n", 276 | " :param elite_states: 1D list of states from elite sessions\n", 277 | " :param elite_actions: 1D list of actions from elite sessions\n", 278 | "\n", 279 | " \"\"\"\n", 280 | "\n", 281 | " new_policy = np.zeros([n_states, n_actions])\n", 282 | "\n", 283 | " \n", 284 | " # Don't forget to set 1/n_actions for all actions in unvisited states.\n", 285 | "\n", 286 | " return new_policy" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "elite_states, elite_actions = (\n", 296 | " [1, 2, 3, 4, 2, 0, 2, 3, 1], \n", 297 | " [0, 2, 4, 3, 2, 0, 1, 3, 3]\n", 298 | ")\n", 299 | "\n", 300 | "\n", 301 | "new_policy = update_policy(elite_states, elite_actions)\n", 302 | "\n", 303 | "assert np.isfinite(new_policy).all(\n", 304 | "), \"Your new policy contains NaNs or +-inf. Make sure you don't divide by zero.\"\n", 305 | "assert np.all(\n", 306 | " new_policy >= 0), \"Your new policy can't have negative action probabilities\"\n", 307 | "assert np.allclose(new_policy.sum(\n", 308 | " axis=-1), 1), \"Your new policy should be a valid probability distribution over actions\"\n", 309 | "reference_answer = np.array([\n", 310 | " [1., 0., 0., 0., 0.],\n", 311 | " [0.5, 0., 0., 0.5, 0.],\n", 312 | " [0., 0.33333333, 0.66666667, 0., 0.],\n", 313 | " [0., 0., 0., 0.5, 0.5]])\n", 314 | "assert np.allclose(new_policy[:4, :5], reference_answer)\n", 315 | "print(\"Ok!\")" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "# Training loop\n", 323 | "Generate sessions, select N best and fit to those." 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": {}, 330 | "outputs": [], 331 | "source": [ 332 | "from IPython.display import clear_output\n", 333 | "\n", 334 | "\n", 335 | "def show_progress(rewards_batch, log, reward_range=[-990, +10]):\n", 336 | " \"\"\"\n", 337 | " A convenience function that displays training progress. \n", 338 | " No cool math here, just charts.\n", 339 | " \"\"\"\n", 340 | "\n", 341 | " mean_reward = np.mean(rewards_batch)\n", 342 | " threshold = np.percentile(rewards_batch, percentile)\n", 343 | " log.append([mean_reward, threshold])\n", 344 | "\n", 345 | " clear_output(True)\n", 346 | " print(\"mean reward = %.3f, threshold=%.3f\" % (mean_reward, threshold))\n", 347 | " plt.figure(figsize=[8, 4])\n", 348 | " plt.subplot(1, 2, 1)\n", 349 | " plt.plot(list(zip(*log))[0], label='Mean rewards')\n", 350 | " plt.plot(list(zip(*log))[1], label='Reward thresholds')\n", 351 | " plt.legend()\n", 352 | " plt.grid()\n", 353 | "\n", 354 | " plt.subplot(1, 2, 2)\n", 355 | " plt.hist(rewards_batch, range=reward_range)\n", 356 | " plt.vlines(\n", 357 | " [np.percentile(rewards_batch, percentile)],\n", 358 | " [0], \n", 359 | " [100], \n", 360 | " label=\"percentile\", \n", 361 | " color='red')\n", 362 | " plt.legend()\n", 363 | " plt.grid()\n", 364 | "\n", 365 | " plt.show()" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "# reset policy just in case\n", 375 | "policy = np.ones([n_states, n_actions])/n_actions" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "n_sessions = 250 # sample this many sessions\n", 385 | "percentile = 50 # take this percent of session with highest rewards\n", 386 | "learning_rate = 0.5 # add this thing to all counts for stability\n", 387 | "\n", 388 | "log = []\n", 389 | "\n", 390 | "for i in range(100):\n", 391 | "\n", 392 | " %time sessions = [ < generate a list of n_sessions new sessions > ]\n", 393 | "\n", 394 | " states_batch, actions_batch, rewards_batch = zip(*sessions)\n", 395 | "\n", 396 | " elite_states, elite_actions = \n", 500 | "\n", 501 | " \n", 502 | "\n", 503 | " show_progress(rewards_batch, log, reward_range=[0, np.max(rewards_batch)])\n", 504 | "\n", 505 | " if np.mean(rewards_batch) > 190:\n", 506 | " print(\"You Win! You may stop training now via KeyboardInterrupt.\")" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "# Results" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "# record sessions\n", 523 | "import gym.wrappers\n", 524 | "env = gym.wrappers.Monitor(\n", 525 | " gym.make(\"CartPole-v0\"),\n", 526 | " directory=\"videos\", \n", 527 | " force=True)\n", 528 | "sessions = [generate_session() for _ in range(100)]\n", 529 | "env.close()\n" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "# show video\n", 539 | "from IPython.display import HTML\n", 540 | "import os\n", 541 | "\n", 542 | "video_names = list(\n", 543 | " filter(lambda s: s.endswith(\".mp4\"), os.listdir(\"./videos/\")))\n", 544 | "\n", 545 | "HTML(\"\"\"\n", 546 | "\n", 549 | "\"\"\".format(\"./videos/\"+video_names[-1])) # this may or may not be _last_ video. Try other indices" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "# Homework part I\n", 564 | "\n", 565 | "### Tabular crossentropy method\n", 566 | "\n", 567 | "You may have noticed that the taxi problem quickly converges from -100 to a near-optimal score and then descends back into -50/-100. This is in part because the environment has some innate randomness. Namely, the starting points of passenger/driver change from episode to episode.\n", 568 | "\n", 569 | "### Tasks\n", 570 | "- __1.1__ (1 pts) Find out how the algorithm performance changes if you change different percentile and different n_sessions.\n", 571 | "- __1.2__ (2 pts) Tune the algorithm to end up with positive average score.\n", 572 | "\n", 573 | "It's okay to modify the existing code.\n" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "# Homework part II\n", 581 | "\n", 582 | "### Deep crossentropy method\n", 583 | "\n", 584 | "By this moment you should have got enough score on [CartPole-v0](https://gym.openai.com/envs/CartPole-v0) to consider it solved (see the link). It's time to try something harder.\n", 585 | "\n", 586 | "* if you have any trouble with CartPole-v0 and feel stuck, feel free to ask us or your peers for help.\n", 587 | "\n", 588 | "### Tasks\n", 589 | "\n", 590 | "* __2.1__ (3 pts) Pick one of environments: MountainCar-v0 or LunarLander-v2.\n", 591 | " * For MountainCar, get average reward of __at least -150__\n", 592 | " * For LunarLander, get average reward of __at least +50__\n", 593 | "\n", 594 | "See the tips section below, it's kinda important.\n", 595 | "__Note:__ If your agent is below the target score, you'll still get most of the points depending on the result, so don't be afraid to submit it.\n", 596 | " \n", 597 | " \n", 598 | "* __2.2__ (bonus: 4++ pt) Devise a way to speed up training at least 2x against the default version\n", 599 | " * Obvious improvement: use [joblib](https://www.google.com/search?client=ubuntu&channel=fs&q=joblib&ie=utf-8&oe=utf-8)\n", 600 | " * Try re-using samples from 3-5 last iterations when computing threshold and training\n", 601 | " * Experiment with amount of training iterations and learning rate of the neural network (see params)\n", 602 | " * __Please list what you did in anytask submission form__\n", 603 | " \n", 604 | " \n", 605 | "### Tips\n", 606 | "* Gym page: [mountaincar](https://gym.openai.com/envs/MountainCar-v0), [lunarlander](https://gym.openai.com/envs/LunarLander-v2)\n", 607 | "* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.\n", 608 | " * Also it may be a good idea to cut rewards via \">\" and not \">=\". If 90% of your sessions get reward of -10k and 20% are better, than if you use percentile 20% as threshold, R >= threshold __fails cut off bad sessions__ whule R > threshold works alright.\n", 609 | "* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent cem training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make(\"MountainCar-v0\").env` or otherwise get rid of TimeLimit wrapper.\n", 610 | "* If you use old _swig_ lib for LunarLander-v2, you may get an error. See this [issue](https://github.com/openai/gym/issues/100) for solution.\n", 611 | "* If it won't train it's a good idea to plot reward distribution and record sessions: they may give you some clue. If they don't, call course staff :)\n", 612 | "* 20-neuron network is probably not enough, feel free to experiment.\n", 613 | "\n", 614 | "### Bonus tasks\n", 615 | "\n", 616 | "* __2.3 bonus__ Try to find a network architecture and training params that solve __both__ environments above (_Points depend on implementation. If you attempted this task, please mention it in anytask submission._)\n", 617 | "\n", 618 | "* __2.4 bonus__ Solve continuous action space task with `MLPRegressor` or similar.\n", 619 | " * Start with [\"Pendulum-v0\"](https://github.com/openai/gym/wiki/Pendulum-v0).\n", 620 | " * Since your agent only predicts the \"expected\" action, you will have to add noise to ensure exploration.\n", 621 | " * [MountainCarContinuous-v0](https://gym.openai.com/envs/MountainCarContinuous-v0), [LunarLanderContinuous-v2](https://gym.openai.com/envs/LunarLanderContinuous-v2) \n", 622 | " * 4 points for solving. Slightly less for getting some results below solution threshold. Note that discrete and continuous environments may have slightly different rules aside from action spaces.\n", 623 | "\n", 624 | "\n", 625 | "If you're still feeling unchallenged, consider the project (see other notebook in this folder)." 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "execution_count": null, 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [] 634 | } 635 | ], 636 | "metadata": { 637 | "kernelspec": { 638 | "display_name": "Python 3", 639 | "language": "python", 640 | "name": "python3" 641 | }, 642 | "language_info": { 643 | "codemirror_mode": { 644 | "name": "ipython", 645 | "version": 3 646 | }, 647 | "file_extension": ".py", 648 | "mimetype": "text/x-python", 649 | "name": "python", 650 | "nbconvert_exporter": "python", 651 | "pygments_lexer": "ipython3", 652 | "version": "3.7.0" 653 | } 654 | }, 655 | "nbformat": 4, 656 | "nbformat_minor": 1 657 | } 658 | -------------------------------------------------------------------------------- /2019/code/04-dqn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Q-learning\n", 8 | "\n", 9 | "This notebook will guide you through implementation of vanilla Q-learning algorithm.\n", 10 | "\n", 11 | "You need to implement QLearningAgent (follow instructions for each method) and use it on a number of tests below." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# In google collab, uncomment this:\n", 21 | "# !wget https://bit.ly/2FMJP5K -q -O setup.py\n", 22 | "# !bash setup.py 2>&1 1>stdout.log | tee stderr.log\n", 23 | "\n", 24 | "# This code creates a virtual display to draw game images on.\n", 25 | "# If you are running locally, just ignore it\n", 26 | "# import os\n", 27 | "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", 28 | "# !bash ../xvfb start\n", 29 | "# %env DISPLAY = : 1\n", 30 | "\n", 31 | "import numpy as np\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "%matplotlib inline\n", 34 | "%load_ext autoreload\n", 35 | "%autoreload 2" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "%%writefile qlearning.py\n", 45 | "from collections import defaultdict\n", 46 | "import random\n", 47 | "import math\n", 48 | "import numpy as np\n", 49 | "\n", 50 | "\n", 51 | "class QLearningAgent:\n", 52 | " def __init__(self, alpha, epsilon, discount, get_legal_actions):\n", 53 | " \"\"\"\n", 54 | " Q-Learning Agent\n", 55 | " based on https://inst.eecs.berkeley.edu/~cs188/sp19/projects.html\n", 56 | " Instance variables you have access to\n", 57 | " - self.epsilon (exploration prob)\n", 58 | " - self.alpha (learning rate)\n", 59 | " - self.discount (discount rate aka gamma)\n", 60 | "\n", 61 | " Functions you should use\n", 62 | " - self.get_legal_actions(state) {state, hashable -> list of actions, each is hashable}\n", 63 | " which returns legal actions for a state\n", 64 | " - self.get_qvalue(state,action)\n", 65 | " which returns Q(state,action)\n", 66 | " - self.set_qvalue(state,action,value)\n", 67 | " which sets Q(state,action) := value\n", 68 | " !!!Important!!!\n", 69 | " Note: please avoid using self._qValues directly. \n", 70 | " There's a special self.get_qvalue/set_qvalue for that.\n", 71 | " \"\"\"\n", 72 | "\n", 73 | " self.get_legal_actions = get_legal_actions\n", 74 | " self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))\n", 75 | " self.alpha = alpha\n", 76 | " self.epsilon = epsilon\n", 77 | " self.discount = discount\n", 78 | "\n", 79 | " def get_qvalue(self, state, action):\n", 80 | " \"\"\" Returns Q(state,action) \"\"\"\n", 81 | " return self._qvalues[state][action]\n", 82 | "\n", 83 | " def set_qvalue(self, state, action, value):\n", 84 | " \"\"\" Sets the Qvalue for [state,action] to the given value \"\"\"\n", 85 | " self._qvalues[state][action] = value\n", 86 | "\n", 87 | " #---------------------START OF YOUR CODE---------------------#\n", 88 | "\n", 89 | " def get_value(self, state):\n", 90 | " \"\"\"\n", 91 | " Compute your agent's estimate of V(s) using current q-values\n", 92 | " V(s) = max_over_action Q(state,action) over possible actions.\n", 93 | " Note: please take into account that q-values can be negative.\n", 94 | " \"\"\"\n", 95 | " possible_actions = self.get_legal_actions(state)\n", 96 | "\n", 97 | " # If there are no legal actions, return 0.0\n", 98 | " if len(possible_actions) == 0:\n", 99 | " return 0.0\n", 100 | "\n", 101 | " \n", 102 | "\n", 103 | " return value\n", 104 | "\n", 105 | " def update(self, state, action, reward, next_state):\n", 106 | " \"\"\"\n", 107 | " You should do your Q-Value update here:\n", 108 | " Q(s,a) := (1 - alpha) * Q(s,a) + alpha * (r + gamma * V(s'))\n", 109 | " \"\"\"\n", 110 | "\n", 111 | " # agent parameters\n", 112 | " gamma = self.discount\n", 113 | " learning_rate = self.alpha\n", 114 | "\n", 115 | " \n", 116 | "\n", 117 | " self.set_qvalue(state, action, < YOUR_QVALUE > )\n", 118 | "\n", 119 | " def get_best_action(self, state):\n", 120 | " \"\"\"\n", 121 | " Compute the best action to take in a state (using current q-values). \n", 122 | " \"\"\"\n", 123 | " possible_actions = self.get_legal_actions(state)\n", 124 | "\n", 125 | " # If there are no legal actions, return None\n", 126 | " if len(possible_actions) == 0:\n", 127 | " return None\n", 128 | "\n", 129 | " \n", 130 | "\n", 131 | " return best_action\n", 132 | "\n", 133 | " def get_action(self, state):\n", 134 | " \"\"\"\n", 135 | " Compute the action to take in the current state, including exploration. \n", 136 | " With probability self.epsilon, we should take a random action.\n", 137 | " otherwise - the best policy action (self.get_best_action).\n", 138 | "\n", 139 | " Note: To pick randomly from a list, use random.choice(list). \n", 140 | " To pick True or False with a given probablity, generate uniform number in [0, 1]\n", 141 | " and compare it with your probability\n", 142 | " \"\"\"\n", 143 | "\n", 144 | " # Pick Action\n", 145 | " possible_actions = self.get_legal_actions(state)\n", 146 | " action = None\n", 147 | "\n", 148 | " # If there are no legal actions, return None\n", 149 | " if len(possible_actions) == 0:\n", 150 | " return None\n", 151 | "\n", 152 | " # agent parameters:\n", 153 | " epsilon = self.epsilon\n", 154 | "\n", 155 | " \n", 156 | "\n", 157 | " return chosen_action" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "### Try it on taxi\n", 165 | "\n", 166 | "Here we use the qlearning agent on taxi env from openai gym.\n", 167 | "You will need to insert a few agent functions here." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "import gym\n", 177 | "env = gym.make(\"Taxi-v2\")\n", 178 | "\n", 179 | "n_actions = env.action_space.n" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "from qlearning import QLearningAgent\n", 189 | "\n", 190 | "agent = QLearningAgent(\n", 191 | " alpha=0.5, \n", 192 | " epsilon=0.25, \n", 193 | " discount=0.99,\n", 194 | " get_legal_actions=lambda s: range(n_actions))" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "def play_and_train(env, agent, t_max=10**4):\n", 204 | " \"\"\"\n", 205 | " This function should \n", 206 | " - run a full game, actions given by agent's e-greedy policy\n", 207 | " - train agent using agent.update(...) whenever it is possible\n", 208 | " - return total reward\n", 209 | " \"\"\"\n", 210 | " total_reward = 0.0\n", 211 | " s = env.reset()\n", 212 | "\n", 213 | " for t in range(t_max):\n", 214 | " # get agent to pick action given state s.\n", 215 | " a = \n", 216 | "\n", 217 | " next_s, r, done, _ = env.step(a)\n", 218 | "\n", 219 | " # train (update) agent for state s\n", 220 | " \n", 221 | "\n", 222 | " s = next_s\n", 223 | " total_reward += r\n", 224 | " if done:\n", 225 | " break\n", 226 | "\n", 227 | " return total_reward" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "from IPython.display import clear_output\n", 237 | "\n", 238 | "rewards = []\n", 239 | "for i in range(1000):\n", 240 | " rewards.append(play_and_train(env, agent))\n", 241 | " agent.epsilon *= 0.99\n", 242 | "\n", 243 | " if i % 100 == 0:\n", 244 | " clear_output(True)\n", 245 | " print('eps =', agent.epsilon, 'mean reward =', np.mean(rewards[-10:]))\n", 246 | " plt.plot(rewards)\n", 247 | " plt.show()" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": { 253 | "collapsed": true 254 | }, 255 | "source": [ 256 | "# Binarized state spaces\n", 257 | "\n", 258 | "Use agent to train efficiently on CartPole-v0.\n", 259 | "This environment has a continuous set of possible states, so you will have to group them into bins somehow.\n", 260 | "\n", 261 | "The simplest way is to use `round(x,n_digits)` (or numpy round) to round real number to a given amount of digits.\n", 262 | "\n", 263 | "The tricky part is to get the n_digits right for each state to train effectively.\n", 264 | "\n", 265 | "Note that you don't need to convert state to integers, but to __tuples__ of any kind of values." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "env = gym.make(\"CartPole-v0\")\n", 275 | "n_actions = env.action_space.n\n", 276 | "\n", 277 | "print(\"first state:%s\" % (env.reset()))\n", 278 | "plt.imshow(env.render('rgb_array'))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "### Play a few games\n", 286 | "\n", 287 | "We need to estimate observation distributions. To do so, we'll play a few games and record all states." 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "all_states = []\n", 297 | "for _ in range(1000):\n", 298 | " all_states.append(env.reset())\n", 299 | " done = False\n", 300 | " while not done:\n", 301 | " s, r, done, _ = env.step(env.action_space.sample())\n", 302 | " all_states.append(s)\n", 303 | " if done:\n", 304 | " break\n", 305 | "\n", 306 | "all_states = np.array(all_states)\n", 307 | "\n", 308 | "for obs_i in range(env.observation_space.shape[0]):\n", 309 | " plt.hist(all_states[:, obs_i], bins=20)\n", 310 | " plt.show()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "## Binarize environment" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "from gym.core import ObservationWrapper\n", 327 | "\n", 328 | "\n", 329 | "class Binarizer(ObservationWrapper):\n", 330 | "\n", 331 | " def observation(self, state):\n", 332 | "\n", 333 | " # state = \n", 334 | " # hint: you can do that with round(x,n_digits)\n", 335 | " # you will need to pick a different n_digits for each dimension\n", 336 | "\n", 337 | " return tuple(state)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "env = Binarizer(gym.make(\"CartPole-v0\"))" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "all_states = []\n", 356 | "for _ in range(1000):\n", 357 | " all_states.append(env.reset())\n", 358 | " done = False\n", 359 | " while not done:\n", 360 | " s, r, done, _ = env.step(env.action_space.sample())\n", 361 | " all_states.append(s)\n", 362 | " if done:\n", 363 | " break\n", 364 | "\n", 365 | "all_states = np.array(all_states)\n", 366 | "\n", 367 | "for obs_i in range(env.observation_space.shape[0]):\n", 368 | "\n", 369 | " plt.hist(all_states[:, obs_i], bins=20)\n", 370 | " plt.show()" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "## Learn binarized policy\n", 378 | "\n", 379 | "Now let's train a policy that uses binarized state space.\n", 380 | "\n", 381 | "__Tips:__ \n", 382 | "* If your binarization is too coarse, your agent may fail to find optimal policy. In that case, change binarization. \n", 383 | "* If your binarization is too fine-grained, your agent will take much longer than 1000 steps to converge. You can either increase number of iterations and decrease epsilon decay or change binarization.\n", 384 | "* Having 10^3 ~ 10^4 distinct states is recommended (`len(QLearningAgent._qvalues)`), but not required.\n", 385 | "* A reasonable agent should get to an average reward of >=50." 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "agent = QLearningAgent(\n", 395 | " alpha=0.5, \n", 396 | " epsilon=0.25, \n", 397 | " discount=0.99,\n", 398 | " get_legal_actions=lambda s: range(n_actions))" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [ 407 | "rewards = []\n", 408 | "for i in range(1000):\n", 409 | " rewards.append(play_and_train(env, agent))\n", 410 | "\n", 411 | " # OPTIONAL YOUR CODE: adjust epsilon\n", 412 | " if i % 100 == 0:\n", 413 | " clear_output(True)\n", 414 | " print('eps =', agent.epsilon, 'mean reward =', np.mean(rewards[-10:]))\n", 415 | " plt.plot(rewards)\n", 416 | " plt.show()" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": {}, 423 | "outputs": [], 424 | "source": [] 425 | } 426 | ], 427 | "metadata": { 428 | "kernelspec": { 429 | "display_name": "Python 3", 430 | "language": "python", 431 | "name": "python3" 432 | }, 433 | "language_info": { 434 | "codemirror_mode": { 435 | "name": "ipython", 436 | "version": 3 437 | }, 438 | "file_extension": ".py", 439 | "mimetype": "text/x-python", 440 | "name": "python", 441 | "nbconvert_exporter": "python", 442 | "pygments_lexer": "ipython3", 443 | "version": "3.7.0" 444 | } 445 | }, 446 | "nbformat": 4, 447 | "nbformat_minor": 1 448 | } 449 | -------------------------------------------------------------------------------- /2019/code/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/code/__init__.py -------------------------------------------------------------------------------- /2019/code/mdp.py: -------------------------------------------------------------------------------- 1 | # most of this code was politely stolen from https://github.com/berkeleydeeprlcourse/homework/ 2 | # all creadit goes to https://github.com/abhishekunique (if i got the author right) 3 | import sys 4 | import random 5 | import numpy as np 6 | 7 | try: 8 | from graphviz import Digraph 9 | import graphviz 10 | has_graphviz = True 11 | except: 12 | has_graphviz = False 13 | 14 | 15 | def weighted_choice(v, p): 16 | total = sum(p) 17 | r = random.uniform(0, total) 18 | upto = 0 19 | for c, w in zip(v, p): 20 | if upto + w >= r: 21 | return c 22 | upto += w 23 | assert False, "Shouldn't get here" 24 | 25 | 26 | class MDP: 27 | def __init__(self, transition_probs, rewards, initial_state=None): 28 | """ 29 | Defines an MDP. Compatible with gym Env. 30 | :param transition_probs: transition_probs[s][a][s_next] = P(s_next | s, a) 31 | A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> prob] 32 | For each state and action, probabilities of next states should sum to 1 33 | If a state has no actions available, it is considered terminal 34 | :param rewards: rewards[s][a][s_next] = r(s,a,s') 35 | A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> reward] 36 | The reward for anything not mentioned here is zero. 37 | :param get_initial_state: a state where agent starts or a callable() -> state 38 | By default, picks initial state at random. 39 | 40 | States and actions can be anything you can use as dict keys, but we recommend that you use strings or integers 41 | 42 | Here's an example from MDP depicted on http://bit.ly/2jrNHNr 43 | transition_probs = { 44 | 's0':{ 45 | 'a0': {'s0': 0.5, 's2': 0.5}, 46 | 'a1': {'s2': 1} 47 | }, 48 | 's1':{ 49 | 'a0': {'s0': 0.7, 's1': 0.1, 's2': 0.2}, 50 | 'a1': {'s1': 0.95, 's2': 0.05} 51 | }, 52 | 's2':{ 53 | 'a0': {'s0': 0.4, 's1': 0.6}, 54 | 'a1': {'s0': 0.3, 's1': 0.3, 's2':0.4} 55 | } 56 | } 57 | rewards = { 58 | 's1': {'a0': {'s0': +5}}, 59 | 's2': {'a1': {'s0': -1}} 60 | } 61 | """ 62 | self._check_param_consistency(transition_probs, rewards) 63 | self._transition_probs = transition_probs 64 | self._rewards = rewards 65 | self._initial_state = initial_state 66 | self.n_states = len(transition_probs) 67 | self.reset() 68 | 69 | def get_all_states(self): 70 | """ return a tuple of all possiblestates """ 71 | return tuple(self._transition_probs.keys()) 72 | 73 | def get_possible_actions(self, state): 74 | """ return a tuple of possible actions in a given state """ 75 | return tuple(self._transition_probs.get(state, {}).keys()) 76 | 77 | def is_terminal(self, state): 78 | """ return True if state is terminal or False if it isn't """ 79 | return len(self.get_possible_actions(state)) == 0 80 | 81 | def get_next_states(self, state, action): 82 | """ return a dictionary of {next_state1 : P(next_state1 | state, action), next_state2: ...} """ 83 | assert action in self.get_possible_actions( 84 | state), "cannot do action %s from state %s" % (action, state) 85 | return self._transition_probs[state][action] 86 | 87 | def get_transition_prob(self, state, action, next_state): 88 | """ return P(next_state | state, action) """ 89 | return self.get_next_states(state, action).get(next_state, 0.0) 90 | 91 | def get_reward(self, state, action, next_state): 92 | """ return the reward you get for taking action in state and landing on next_state""" 93 | assert action in self.get_possible_actions( 94 | state), "cannot do action %s from state %s" % (action, state) 95 | return self._rewards.get(state, {}).get(action, {}).get(next_state, 96 | 0.0) 97 | 98 | def reset(self): 99 | """ reset the game, return the initial state""" 100 | if self._initial_state is None: 101 | self._current_state = random.choice( 102 | tuple(self._transition_probs.keys())) 103 | elif self._initial_state in self._transition_probs: 104 | self._current_state = self._initial_state 105 | elif callable(self._initial_state): 106 | self._current_state = self._initial_state() 107 | else: 108 | raise ValueError( 109 | "initial state %s should be either a state or a function() -> state" % self._initial_state) 110 | return self._current_state 111 | 112 | def step(self, action): 113 | """ take action, return next_state, reward, is_done, empty_info """ 114 | possible_states, probs = zip( 115 | *self.get_next_states(self._current_state, action).items()) 116 | next_state = weighted_choice(possible_states, p=probs) 117 | reward = self.get_reward(self._current_state, action, next_state) 118 | is_done = self.is_terminal(next_state) 119 | self._current_state = next_state 120 | return next_state, reward, is_done, {} 121 | 122 | def render(self): 123 | print("Currently at %s" % self._current_state) 124 | 125 | def _check_param_consistency(self, transition_probs, rewards): 126 | for state in transition_probs: 127 | assert isinstance(transition_probs[state], 128 | dict), "transition_probs for %s should be a dictionary " \ 129 | "but is instead %s" % ( 130 | state, type(transition_probs[state])) 131 | for action in transition_probs[state]: 132 | assert isinstance(transition_probs[state][action], 133 | dict), "transition_probs for %s, %s should be a " \ 134 | "a dictionary but is instead %s" % ( 135 | state, action, 136 | type(transition_probs[ 137 | state, action])) 138 | next_state_probs = transition_probs[state][action] 139 | assert len( 140 | next_state_probs) != 0, "from state %s action %s leads to no next states" % ( 141 | state, action) 142 | sum_probs = sum(next_state_probs.values()) 143 | assert abs( 144 | sum_probs - 1) <= 1e-10, "next state probabilities for state %s action %s " \ 145 | "add up to %f (should be 1)" % ( 146 | state, action, sum_probs) 147 | for state in rewards: 148 | assert isinstance(rewards[state], 149 | dict), "rewards for %s should be a dictionary " \ 150 | "but is instead %s" % ( 151 | state, type(transition_probs[state])) 152 | for action in rewards[state]: 153 | assert isinstance(rewards[state][action], 154 | dict), "rewards for %s, %s should be a " \ 155 | "a dictionary but is instead %s" % ( 156 | state, action, type( 157 | transition_probs[ 158 | state, action])) 159 | msg = "The Enrichment Center once again reminds you that Android Hell is a real place where" \ 160 | " you will be sent at the first sign of defiance. " 161 | assert None not in transition_probs, "please do not use None as a state identifier. " + msg 162 | assert None not in rewards, "please do not use None as an action identifier. " + msg 163 | 164 | 165 | class FrozenLakeEnv(MDP): 166 | """ 167 | Winter is here. You and your friends were tossing around a frisbee at the park 168 | when you made a wild throw that left the frisbee out in the middle of the lake. 169 | The water is mostly frozen, but there are a few holes where the ice has melted. 170 | If you step into one of those holes, you'll fall into the freezing water. 171 | At this time, there's an international frisbee shortage, so it's absolutely imperative that 172 | you navigate across the lake and retrieve the disc. 173 | However, the ice is slippery, so you won't always move in the direction you intend. 174 | The surface is described using a grid like the following 175 | 176 | SFFF 177 | FHFH 178 | FFFH 179 | HFFG 180 | 181 | S : starting point, safe 182 | F : frozen surface, safe 183 | H : hole, fall to your doom 184 | G : goal, where the frisbee is located 185 | 186 | The episode ends when you reach the goal or fall in a hole. 187 | You receive a reward of 1 if you reach the goal, and zero otherwise. 188 | 189 | """ 190 | 191 | MAPS = { 192 | "4x4": [ 193 | "SFFF", 194 | "FHFH", 195 | "FFFH", 196 | "HFFG" 197 | ], 198 | "8x8": [ 199 | "SFFFFFFF", 200 | "FFFFFFFF", 201 | "FFFHFFFF", 202 | "FFFFFHFF", 203 | "FFFHFFFF", 204 | "FHHFFFHF", 205 | "FHFFHFHF", 206 | "FFFHFFFG" 207 | ], 208 | } 209 | 210 | def __init__(self, desc=None, map_name="4x4", slip_chance=0.2): 211 | if desc is None and map_name is None: 212 | raise ValueError('Must provide either desc or map_name') 213 | elif desc is None: 214 | desc = self.MAPS[map_name] 215 | assert ''.join(desc).count( 216 | 'S') == 1, "this implementation supports having exactly one initial state" 217 | assert all(c in "SFHG" for c in 218 | ''.join(desc)), "all cells must be either of S, F, H or G" 219 | 220 | self.desc = desc = np.asarray(list(map(list, desc)), dtype='str') 221 | self.lastaction = None 222 | 223 | nrow, ncol = desc.shape 224 | states = [(i, j) for i in range(nrow) for j in range(ncol)] 225 | actions = ["left", "down", "right", "up"] 226 | 227 | initial_state = states[np.array(desc == b'S').ravel().argmax()] 228 | 229 | def move(row, col, movement): 230 | if movement == 'left': 231 | col = max(col - 1, 0) 232 | elif movement == 'down': 233 | row = min(row + 1, nrow - 1) 234 | elif movement == 'right': 235 | col = min(col + 1, ncol - 1) 236 | elif movement == 'up': 237 | row = max(row - 1, 0) 238 | else: 239 | raise ("invalid action") 240 | return (row, col) 241 | 242 | transition_probs = {s: {} for s in states} 243 | rewards = {s: {} for s in states} 244 | for (row, col) in states: 245 | if desc[row, col] in "GH": continue 246 | for action_i in range(len(actions)): 247 | action = actions[action_i] 248 | transition_probs[(row, col)][action] = {} 249 | rewards[(row, col)][action] = {} 250 | for movement_i in [(action_i - 1) % len(actions), action_i, 251 | (action_i + 1) % len(actions)]: 252 | movement = actions[movement_i] 253 | newrow, newcol = move(row, col, movement) 254 | prob = (1. - slip_chance) if movement == action else ( 255 | slip_chance / 2.) 256 | if prob == 0: continue 257 | if (newrow, newcol) not in transition_probs[row, col][ 258 | action]: 259 | transition_probs[row, col][action][ 260 | newrow, newcol] = prob 261 | else: 262 | transition_probs[row, col][action][ 263 | newrow, newcol] += prob 264 | if desc[newrow, newcol] == 'G': 265 | rewards[row, col][action][newrow, newcol] = 1.0 266 | 267 | MDP.__init__(self, transition_probs, rewards, initial_state) 268 | 269 | def render(self): 270 | desc_copy = np.copy(self.desc) 271 | desc_copy[self._current_state] = '*' 272 | print('\n'.join(map(''.join, desc_copy)), end='\n\n') 273 | 274 | 275 | def plot_graph(mdp, graph_size='10,10', s_node_size='1,5', 276 | a_node_size='0,5', rankdir='LR', ): 277 | """ 278 | Function for pretty drawing MDP graph with graphviz library. 279 | Requirements: 280 | graphviz : https://www.graphviz.org/ 281 | for ubuntu users: sudo apt-get install graphviz 282 | python library for graphviz 283 | for pip users: pip install graphviz 284 | :param mdp: 285 | :param graph_size: size of graph plot 286 | :param s_node_size: size of state nodes 287 | :param a_node_size: size of action nodes 288 | :param rankdir: order for drawing 289 | :return: dot object 290 | """ 291 | s_node_attrs = {'shape': 'doublecircle', 292 | 'color': '#85ff75', 293 | 'style': 'filled', 294 | 'width': str(s_node_size), 295 | 'height': str(s_node_size), 296 | 'fontname': 'Arial', 297 | 'fontsize': '24'} 298 | 299 | a_node_attrs = {'shape': 'circle', 300 | 'color': 'lightpink', 301 | 'style': 'filled', 302 | 'width': str(a_node_size), 303 | 'height': str(a_node_size), 304 | 'fontname': 'Arial', 305 | 'fontsize': '20'} 306 | 307 | s_a_edge_attrs = {'style': 'bold', 308 | 'color': 'red', 309 | 'ratio': 'auto'} 310 | 311 | a_s_edge_attrs = {'style': 'dashed', 312 | 'color': 'blue', 313 | 'ratio': 'auto', 314 | 'fontname': 'Arial', 315 | 'fontsize': '16'} 316 | 317 | graph = Digraph(name='MDP') 318 | graph.attr(rankdir=rankdir, size=graph_size) 319 | for state_node in mdp._transition_probs: 320 | graph.node(state_node, **s_node_attrs) 321 | 322 | for posible_action in mdp.get_possible_actions(state_node): 323 | action_node = state_node + "-" + posible_action 324 | graph.node(action_node, 325 | label=str(posible_action), 326 | **a_node_attrs) 327 | graph.edge(state_node, state_node + "-" + 328 | posible_action, **s_a_edge_attrs) 329 | 330 | for posible_next_state in mdp.get_next_states(state_node, 331 | posible_action): 332 | probability = mdp.get_transition_prob( 333 | state_node, posible_action, posible_next_state) 334 | reward = mdp.get_reward( 335 | state_node, posible_action, posible_next_state) 336 | 337 | if reward != 0: 338 | label_a_s_edge = 'p = ' + str(probability) + \ 339 | ' ' + 'reward =' + str(reward) 340 | else: 341 | label_a_s_edge = 'p = ' + str(probability) 342 | 343 | graph.edge(action_node, posible_next_state, 344 | label=label_a_s_edge, **a_s_edge_attrs) 345 | return graph 346 | 347 | 348 | def plot_graph_with_state_values(mdp, state_values): 349 | """ Plot graph with state values""" 350 | graph = plot_graph(mdp) 351 | for state_node in mdp._transition_probs: 352 | value = state_values[state_node] 353 | graph.node(state_node, 354 | label=str(state_node) + '\n' + 'V =' + str(value)[:4]) 355 | return graph 356 | 357 | 358 | def get_optimal_action_for_plot(mdp, state_values, state, gamma=0.9): 359 | """ Finds optimal action using formula above. """ 360 | if mdp.is_terminal(state): return None 361 | next_actions = mdp.get_possible_actions(state) 362 | try: 363 | from mdp_get_action_value import get_action_value 364 | except ImportError: 365 | raise ImportError("Implement get_action_value(mdp, state_values, state, action, gamma) in the file \"mdp_get_action_value.py\".") 366 | q_values = [get_action_value(mdp, state_values, state, action, gamma) for 367 | action in next_actions] 368 | optimal_action = next_actions[np.argmax(q_values)] 369 | return optimal_action 370 | 371 | 372 | def plot_graph_optimal_strategy_and_state_values(mdp, state_values, gamma=0.9): 373 | """ Plot graph with state values and """ 374 | graph = plot_graph(mdp) 375 | opt_s_a_edge_attrs = {'style': 'bold', 376 | 'color': 'green', 377 | 'ratio': 'auto', 378 | 'penwidth': '6'} 379 | 380 | for state_node in mdp._transition_probs: 381 | value = state_values[state_node] 382 | graph.node(state_node, 383 | label=str(state_node) + '\n' + 'V =' + str(value)[:4]) 384 | for action in mdp.get_possible_actions(state_node): 385 | if action == get_optimal_action_for_plot(mdp, 386 | state_values, 387 | state_node, 388 | gamma): 389 | graph.edge(state_node, state_node + "-" + action, 390 | **opt_s_a_edge_attrs) 391 | return graph 392 | -------------------------------------------------------------------------------- /2019/code/mdp_get_action_value.py: -------------------------------------------------------------------------------- 1 | 2 | def get_action_value(mdp, state_values, state, action, gamma): 3 | """ Computes Q(s,a) as in formula above """ 4 | 5 | Q = 0 6 | # YOUR CODE HERE 7 | return Q -------------------------------------------------------------------------------- /2019/code/qlearning.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import random 3 | import math 4 | import numpy as np 5 | 6 | 7 | class QLearningAgent: 8 | def __init__(self, alpha, epsilon, discount, get_legal_actions): 9 | """ 10 | Q-Learning Agent 11 | based on https://inst.eecs.berkeley.edu/~cs188/sp19/projects.html 12 | Instance variables you have access to 13 | - self.epsilon (exploration prob) 14 | - self.alpha (learning rate) 15 | - self.discount (discount rate aka gamma) 16 | 17 | Functions you should use 18 | - self.get_legal_actions(state) {state, hashable -> list of actions, each is hashable} 19 | which returns legal actions for a state 20 | - self.get_qvalue(state,action) 21 | which returns Q(state,action) 22 | - self.set_qvalue(state,action,value) 23 | which sets Q(state,action) := value 24 | !!!Important!!! 25 | Note: please avoid using self._qValues directly. 26 | There's a special self.get_qvalue/set_qvalue for that. 27 | """ 28 | 29 | self.get_legal_actions = get_legal_actions 30 | self._qvalues = defaultdict(lambda: defaultdict(lambda: 0)) 31 | self.alpha = alpha 32 | self.epsilon = epsilon 33 | self.discount = discount 34 | 35 | def get_qvalue(self, state, action): 36 | """ Returns Q(state,action) """ 37 | return self._qvalues[state][action] 38 | 39 | def set_qvalue(self, state, action, value): 40 | """ Sets the Qvalue for [state,action] to the given value """ 41 | self._qvalues[state][action] = value 42 | 43 | #---------------------START OF YOUR CODE---------------------# 44 | 45 | def get_value(self, state): 46 | """ 47 | Compute your agent's estimate of V(s) using current q-values 48 | V(s) = max_over_action Q(state,action) over possible actions. 49 | Note: please take into account that q-values can be negative. 50 | """ 51 | possible_actions = self.get_legal_actions(state) 52 | 53 | # If there are no legal actions, return 0.0 54 | if len(possible_actions) == 0: 55 | return 0.0 56 | 57 | 58 | 59 | return value 60 | 61 | def update(self, state, action, reward, next_state): 62 | """ 63 | You should do your Q-Value update here: 64 | Q(s,a) := (1 - alpha) * Q(s,a) + alpha * (r + gamma * V(s')) 65 | """ 66 | 67 | # agent parameters 68 | gamma = self.discount 69 | learning_rate = self.alpha 70 | 71 | 72 | 73 | self.set_qvalue(state, action, < YOUR_QVALUE > ) 74 | 75 | def get_best_action(self, state): 76 | """ 77 | Compute the best action to take in a state (using current q-values). 78 | """ 79 | possible_actions = self.get_legal_actions(state) 80 | 81 | # If there are no legal actions, return None 82 | if len(possible_actions) == 0: 83 | return None 84 | 85 | 86 | 87 | return best_action 88 | 89 | def get_action(self, state): 90 | """ 91 | Compute the action to take in the current state, including exploration. 92 | With probability self.epsilon, we should take a random action. 93 | otherwise - the best policy action (self.get_best_action). 94 | 95 | Note: To pick randomly from a list, use random.choice(list). 96 | To pick True or False with a given probablity, generate uniform number in [0, 1] 97 | and compare it with your probability 98 | """ 99 | 100 | # Pick Action 101 | possible_actions = self.get_legal_actions(state) 102 | action = None 103 | 104 | # If there are no legal actions, return None 105 | if len(possible_actions) == 0: 106 | return None 107 | 108 | # agent parameters: 109 | epsilon = self.epsilon 110 | 111 | 112 | 113 | return chosen_action -------------------------------------------------------------------------------- /2019/slides/01-genetics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/01-genetics.pdf -------------------------------------------------------------------------------- /2019/slides/02-cem.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/02-cem.pdf -------------------------------------------------------------------------------- /2019/slides/03-tabular.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/03-tabular.pdf -------------------------------------------------------------------------------- /2019/slides/04-dqn.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/04-dqn.pdf -------------------------------------------------------------------------------- /2019/solutions/00-gym.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import matplotlib.pyplot as plt\n", 11 | "%matplotlib inline\n", 12 | "# In google collab, uncomment this:\n", 13 | "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n", 14 | "\n", 15 | "# This code creates a virtual display to draw game images on.\n", 16 | "# If you are running locally, just ignore it\n", 17 | "# import os\n", 18 | "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", 19 | "# !bash ../xvfb start\n", 20 | "# %env DISPLAY = : 1" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### OpenAI Gym\n", 28 | "\n", 29 | "We're gonna spend several next weeks learning algorithms that solve decision processes. We are then in need of some interesting decision problems to test our algorithms.\n", 30 | "\n", 31 | "That's where OpenAI gym comes into play. It's a python library that wraps many classical decision problems including robot control, videogames and board games.\n", 32 | "\n", 33 | "So here's how it works:" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "name": "stdout", 43 | "output_type": "stream", 44 | "text": [ 45 | "\u001b[33mWARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.\u001b[0m\n" 46 | ] 47 | }, 48 | { 49 | "data": { 50 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD8CAYAAAB9y7/cAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFrhJREFUeJzt3X2MXNV9xvHvE5uXNKExhAW5tqlJ4jaQqhiYOo6oKgJ5MW5VEylUoCpYkaVNK0ciCmoDqRSMVKREakKLlKI6geBUaYhLkmIhmsQ1RFH+4GUhxtg4hE2w4o1dvJSXJI3q1s6vf8wZuIxnd+7OzJ25987zkUZz75kzs+fYd585e+49M4oIzMysfl436gaYmVkxHPBmZjXlgDczqykHvJlZTTngzcxqygFvZlZThQW8pHWSnpY0LemGon6OmZl1piKug5e0CPgR8F5gBngUuCYinhr4DzMzs46KGsGvAaYj4icR8b/A3cCGgn6WmZl1sLig110GHMzszwDvnKvymWeeGStXriyoKWZm1XPgwAGef/559fMaRQV8p0a9Zi5I0iQwCXDOOecwNTVVUFPMzKqn0Wj0/RpFTdHMACsy+8uBQ9kKEbE1IhoR0ZiYmCioGWZm46uogH8UWCXpXEknA1cDOwr6WWZm1kEhUzQRcUzSR4FvA4uAOyNiXxE/y8zMOitqDp6IuB+4v6jXNzOz+Xklq5lZTTngzcxqygFvZlZTDngzswGSxGOP9bU+aWAKO8lqZjbO5gr5iy8e3vdgO+DNzIaoU/AXFfqeojEzqymP4M3MhshTNGZmFTfMIJ+Lp2jMzAasDOEODngzs9pywJuZ1ZQD3sysphzwZmY15YA3M6spB7yZWU054M3MasoBb2ZWU32tZJV0APgFcBw4FhENSWcAXwNWAgeAP4uIF/trppmZLdQgRvDvjojVEdFI+zcAuyJiFbAr7ZuZ2ZAVMUWzAdiWtrcBVxbwM8zMrIt+Az6A70h6TNJkKjs7Ig4DpPuz+vwZZmbWg34/TfKSiDgk6Sxgp6Qf5n1iekOYBDjnnHP6bIaZmbXrawQfEYfS/RHgm8Aa4DlJSwHS/ZE5nrs1IhoR0ZiYmOinGWZm1kHPAS/pDZJOa20D7wP2AjuAjanaRuDefhtpZmYL188UzdnANyW1XudfIuJbkh4FtkvaBPwUuKr/ZpqZ2UL1HPAR8RPggg7l/wVc3k+jzMysf17JamZWUw54M7Oa8pdum5kNSDon+cp9NxHFfnerA97MrA95wzzPcwcd+A54M7MF6CfQh/3aDngzs3l0C91Bjrod8GZmQzBX2BY5b5597UajMU/NfBzwZmZJp1Av+kRokRzwZjb26hbsLQ54MxtrRV/JMkoOeDMbS3UO9hYHvJmNlXEI9hYHvJmNhXEK9hYHvJnVXjbcxyHYWxzwZlZb4xrsLf40STOrpSI/UqAqPII3s9oZ95F7iwPezGqlFe7jHOwtDngzqwWP2k/UdQ5e0p2Sjkjamyk7Q9JOSc+k+9NTuSTdJmla0h5JFxXZeDMzcLjPJc9J1ruAdW1lNwC7ImIVsCvtA1wBrEq3SeD2wTTTzOxEkl4zJeNwf62uAR8R3wNeaCveAGxL29uAKzPlX46mh4AlkpYOqrFmZi0etXfX62WSZ0fEYYB0f1YqXwYczNSbSWUnkDQpaUrS1OzsbI/NMLNx53Cf26Cvg+904WnHf/2I2BoRjYhoTExMDLgZZlZnvlImn14D/rnW1Eu6P5LKZ4AVmXrLgUO9N8/M7LUc7vn1GvA7gI1peyNwb6b82nQ1zVrg5dZUjplZP9pPqFp3Xa+Dl/RV4FLgTEkzwE3Ap4HtkjYBPwWuStXvB9YD08CvgA8X0GYzGzM+odqbrgEfEdfM8dDlHeoGsLnfRpmZtXjU3jt/2JiZlZ7DvTf+qAIzKyWP3PvnEbyZlY7DfTAc8GZWKg73wXHAm1lpONwHywFvZqXgcB88B7yZjZzDvRgOeDOzmnLAm9lIefReHAe8mY2Mw71YXuhkZkPnz5YZDo/gzWyoHO7D44A3s5FwuBfPAW9mQ+M59+FywJvZUDjch88Bb2aFc7iPhgPezArlcB8dB7yZFSZ7xYwNX9eAl3SnpCOS9mbKtkj6maTd6bY+89iNkqYlPS3p/UU13Myqw6P30cgzgr8LWNeh/NaIWJ1u9wNIOh+4GnhHes4/Slo0qMaaWXV4amb0ugZ8RHwPeCHn620A7o6IoxHxLDANrOmjfWZWQQ73cujnowo+KulaYAq4PiJeBJYBD2XqzKSyE0iaBCYz+z4YzGrA4V4evZ5kvR14K7AaOAx8NpV3OqPS8X85IrZGRCMiGhdffHHzyT4hY1ZpDvdy6SngI+K5iDgeEb8GvsCr0zAzwIpM1eXAof6aaGZmvegp4CUtzex+AGhdYbMDuFrSKZLOBVYBj+R5zdY7vkfxZtXk0Xv5dJ2Dl/RV4FLgTEkzwE3ApZJW05x+OQB8BCAi9knaDjwFHAM2R8TxvI2JCCR5Pt6sYhzu5dQ14CPimg7Fd8xT/xbgln4aZWbV4b+6y6t0K1mzUzU+cMzKLTty9+i9fEoX8OA/88yqwNMy5VfKgAefdDUz61dpAx4c8mZl5dF7NZQ64M3MrHelD3iP4s3KI3vxg0fv5Vf6gAeHvFkZZH//HO7VUImAB4e8WVk43KujMgEPDnmzUfG0TDVVKuDNzCy/ygW8R/Fmw+XRe3VVLuDBIW82LA73aqtkwIND3qxoDvfqq2zAm1lxPHCqh0oHvEfxZoPn693ro9IBDw55s6I43Kuv8gGf5ZA364/n3eulFgGfPRgd8ma9cbjXT9eAl7RC0oOS9kvaJ+m6VH6GpJ2Snkn3p6dySbpN0rSkPZIuKroT4IPSzKxdnhH8MeD6iDgPWAtslnQ+cAOwKyJWAbvSPsAVwKp0mwRuH3ir5+D5eLPeePReT10DPiIOR8TjafsXwH5gGbAB2JaqbQOuTNsbgC9H00PAEklLB97yudsLOOTN8nK419eC5uAlrQQuBB4Gzo6Iw9B8EwDOStWWAQczT5tJZe2vNSlpStLU7OzswltuZn3zQKjecge8pDcCXwc+FhE/n69qh7IThgYRsTUiGhHRmJiYyNuMXDyKN1sYj97rKVfASzqJZrh/JSK+kYqfa029pPsjqXwGWJF5+nLg0GCam59D3mx+npqpvzxX0Qi4A9gfEZ/LPLQD2Ji2NwL3ZsqvTVfTrAVebk3ljIpD3uy1HO7jYXGOOpcAHwKelLQ7lX0S+DSwXdIm4KfAVemx+4H1wDTwK+DDA23xAkTEKweyJB/MZjjcx0nXgI+I79N5Xh3g8g71A9jcZ7sGJhvyZmbjpBYrWbvxfLxZk0fv42UsAh4c8mYO9/EzNgFvNs48sBlPYxXwHsXbOPLnu4+vsQp4cMjb+HK4j5+xC3hwyNv48Lz7eBvLgDczGwdjG/AexVvdefRuYxvw4JC3+nK4G4x5wIND3urH4W4tYx/wZnXigYplOeDxKN7qwde7WzsHvJlZTTngk+wo3iN5q5rsvLtH79bigM/wL4aZ1YkDvo3n461qfNWMzcUB34FD3qrC4W7zccDPwSFvZedwt27yfOn2CkkPStovaZ+k61L5Fkk/k7Q73dZnnnOjpGlJT0t6f5EdMBtHHnhYHnm+dPsYcH1EPC7pNOAxSTvTY7dGxN9lK0s6H7gaeAfwW8B/SPqdiDg+yIYPQ+v7XP2F3VZWPi5tPl1H8BFxOCIeT9u/APYDy+Z5ygbg7og4GhHPAtPAmkE0dhQ8VWNl46kZy2tBc/CSVgIXAg+noo9K2iPpTkmnp7JlwMHM02aY/w2hMhzyNmoOd1uI3AEv6Y3A14GPRcTPgduBtwKrgcPAZ1tVOzz9hKNR0qSkKUlTs7OzC274MGV/mRzyNioOd1uoXAEv6SSa4f6ViPgGQEQ8FxHHI+LXwBd4dRpmBliRefpy4FD7a0bE1ohoRERjYmKinz4MhX+pzKxq8lxFI+AOYH9EfC5TvjRT7QPA3rS9A7ha0imSzgVWAY8Mrsmj4/l4GxWP3q0Xea6iuQT4EPCkpN2p7JPANZJW05x+OQB8BCAi9knaDjxF8wqczVW8gmYuvrLGhs3hbr3qGvAR8X06z6vfP89zbgFu6aNdZob/WrT+eCVrDzxVY8Pgz3e3fjnge+SQt2FxuFuvHPB9cMhbUTzvboPggB8Qh7wNisPdBsUB3yf/EppZWTngB8BTNTYoHr3bIDngB8Qhb/1yuNugOeAHyCFvvXK4WxEc8APmkLeFcrhbURzwZmY15YAvgEfxlpdH71YkB3xBHPLWjcPdiuaAHwKHvLVzuNswOOALFBEeydsJHO42LA74IXDIW4vD3YbJAW82JH6Dt2FzwA+JR/HW4tG7DYsDfogc8uPLUzM2Cnm+dPtUSY9IekLSPkk3p/JzJT0s6RlJX5N0cio/Je1Pp8dXFtuFanHIjx+Hu41KnhH8UeCyiLgAWA2sk7QW+Axwa0SsAl4ENqX6m4AXI+JtwK2pnnXgkK8/h7uNUteAj6Zfpt2T0i2Ay4B7Uvk24Mq0vSHtkx6/XE6y1/Dlk+PB4W6jlmsOXtIiSbuBI8BO4MfASxFxLFWZAZal7WXAQYD0+MvAmwfZ6LpwyNeXw93KIFfAR8TxiFgNLAfWAOd1qpbuO6XVCUe5pElJU5KmZmdn87bXrPT8hm1lsaCraCLiJeC7wFpgiaTF6aHlwKG0PQOsAEiPvwl4ocNrbY2IRkQ0JiYmemt9DXgUXy/ZkbtH7zZqea6imZC0JG2/HngPsB94EPhgqrYRuDdt70j7pMcfCB/p83LIm1kRFnevwlJgm6RFNN8QtkfEfZKeAu6W9LfAD4A7Uv07gH+WNE1z5H51Ae2unYhAEpI88qsoz7tb2XQN+IjYA1zYofwnNOfj28v/B7hqIK0bMw756nK4Wxl5JWvJeLqmWlpvyOBwt/JxwJeQQ756HO5WRg74knLIl19rKs3hbmXlgC8xh3x5+f/EqsABX3IO+fLxnLtVhQO+Ahzy5eFwtypxwFeEQ360fLWMVZEDvkIc8qPncLcqccBXjEN++Dxyt6pywFdQNuQd9MXxtIxVnQO+orKB45AfvOy/qcPdqsoBX2HD+maocXsD8Uf+Wl3k+TRJK7lhfEjZXCFfpwD0qN3qxiP4mhn2aLv1xjJuo3yzKvAIviZao3ig0JH8fKo8yvfJVKsjB3yNdLq6pgyBVebg97SM1ZmnaGrIV9jk43C3uvMIvqbaR/NlC7BRtsfBbuMiz5dunyrpEUlPSNon6eZUfpekZyXtTrfVqVySbpM0LWmPpIuK7oTNrd9FUXX7C8DhbuMkzwj+KHBZRPxS0knA9yX9e3rsryLinrb6VwCr0u2dwO3p3kZkUCdgt2zZMu9+2Tncbdx0HcFH0y/T7knpNt9vxwbgy+l5DwFLJC3tv6nWj/Z5+YWOzDuFeVUCvv2ks8PdxkWuOXhJi4DHgLcBn4+IhyX9JXCLpE8Bu4AbIuIosAw4mHn6TCo7PNCW24K1r3rtNppv1ZsvyLds2ZJrZD+KN4P2NzEHu42bXFfRRMTxiFgNLAfWSPo94Ebg7cAfAGcAn0jVOw0NT/jNkjQpaUrS1OzsbE+Nt960f8TBXCP6Xka7cwX5MAO+vT8etdu4WtBlkhHxEvBdYF1EHE7TMEeBLwFrUrUZYEXmacuBQx1ea2tENCKiMTEx0VPjrT/todfvCdVuId56vMiwdbCbvSrPVTQTkpak7dcD7wF+2JpXV/M36kpgb3rKDuDadDXNWuDliPD0TEm1QjDPiL7ba4xKe5tH3R6zssgzB78U2Jbm4V8HbI+I+yQ9IGmC5pTMbuAvUv37gfXANPAr4MODb7YNQxmvn2/neXazuXUN+IjYA1zYofyyOeoHsLn/ptmwdfro4W4nWm+66aaefkY/yvzRB2ZlojL8UjQajZiamhp1M6yD9jBtBf1cwX7zzTfP+VoLfTOYrx0tZTh+zYrQaDSYmprq68SYP4vG5tU+R9+6LDI7750N335CPGuu1+/UJjPrzJ9FY7nN9+1R852UnSuIF3oi18wWxgFvC9YpbOcL614uv3Sgm/XPAW8DMYiPKHaomw2WA94GzkFtVg4+yWpmVlMOeDOzmnLAm5nVlAPezKymHPBmZjXlgDczqykHvJlZTTngzcxqygFvZlZTDngzs5pywJuZ1ZQD3sysphzwZmY1lTvgJS2S9ANJ96X9cyU9LOkZSV+TdHIqPyXtT6fHVxbTdDMzm89CRvDXAfsz+58Bbo2IVcCLwKZUvgl4MSLeBtya6pmZ2ZDlCnhJy4E/Br6Y9gVcBtyTqmwDrkzbG9I+6fHL1es3QJiZWc/yfuHH3wN/DZyW9t8MvBQRx9L+DLAsbS8DDgJExDFJL6f6z2dfUNIkMJl2j0ra21MPyu9M2vpeE3XtF9S3b+5Xtfy2pMmI2NrrC3QNeEl/AhyJiMckXdoq7lA1cjz2akGz0VvTz5iKiEauFldMXftW135BffvmflWPpClSTvYizwj+EuBPJa0HTgV+k+aIfomkxWkUvxw4lOrPACuAGUmLgTcBL/TaQDMz603XOfiIuDEilkfESuBq4IGI+HPgQeCDqdpG4N60vSPtkx5/IPwlnWZmQ9fPdfCfAD4uaZrmHPsdqfwO4M2p/OPADTleq+c/QSqgrn2ra7+gvn1zv6qnr77Jg2szs3rySlYzs5oaecBLWifp6bTyNc90TqlIulPSkexlnpLOkLQzrfLdKen0VC5Jt6W+7pF00ehaPj9JKyQ9KGm/pH2Srkvlle6bpFMlPSLpidSvm1N5LVZm13XFuaQDkp6UtDtdWVL5YxFA0hJJ90j6Yfpde9cg+zXSgJe0CPg8cAVwPnCNpPNH2aYe3AWsayu7AdiVVvnu4tXzEFcAq9JtErh9SG3sxTHg+og4D1gLbE7/N1Xv21Hgsoi4AFgNrJO0lvqszK7zivN3R8TqzCWRVT8WAf4B+FZEvB24gOb/3eD6FREjuwHvAr6d2b8RuHGUbeqxHyuBvZn9p4GlaXsp8HTa/ifgmk71yn6jeZXUe+vUN+A3gMeBd9JcKLM4lb9yXALfBt6Vthenehp12+foz/IUCJcB99Fck1L5fqU2HgDObCur9LFI85LzZ9v/3QfZr1FP0byy6jXJroitsrMj4jBAuj8rlVeyv+nP9wuBh6lB39I0xm7gCLAT+DE5V2YDrZXZZdRacf7rtJ97xTnl7hc0F0t+R9JjaRU8VP9YfAswC3wpTat9UdIbGGC/Rh3wuVa91kjl+ivpjcDXgY9FxM/nq9qhrJR9i4jjEbGa5oh3DXBep2rpvhL9UmbFeba4Q9VK9Svjkoi4iOY0xWZJfzRP3ar0bTFwEXB7RFwI/DfzX1a+4H6NOuBbq15bsitiq+w5SUsB0v2RVF6p/ko6iWa4fyUivpGKa9E3gIh4CfguzXMMS9LKa+i8MpuSr8xurTg/ANxNc5rmlRXnqU4V+wVARBxK90eAb9J8Y676sTgDzETEw2n/HpqBP7B+jTrgHwVWpTP9J9NcKbtjxG0ahOxq3vZVvtems+FrgZdbf4qVjSTRXLS2PyI+l3mo0n2TNCFpSdp+PfAemie2Kr0yO2q84lzSGySd1toG3gfspeLHYkT8J3BQ0u+mosuBpxhkv0pwomE98COa86B/M+r29ND+rwKHgf+j+Q67ieZc5i7gmXR/RqormlcN/Rh4EmiMuv3z9OsPaf75twfYnW7rq9434PeBH6R+7QU+lcrfAjwCTAP/CpySyk9N+9Pp8beMug85+ngpcF9d+pX68ES67WvlRNWPxdTW1cBUOh7/DTh9kP3ySlYzs5oa9RSNmZkVxAFvZlZTDngzs5pywJuZ1ZQD3sysphzwZmY15YA3M6spB7yZWU39P9sq59z6XHYTAAAAAElFTkSuQmCC\n", 51 | "text/plain": [ 52 | "" 53 | ] 54 | }, 55 | "metadata": {}, 56 | "output_type": "display_data" 57 | }, 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "Observation space: Box(2,)\n", 63 | "Action space: Discrete(3)\n" 64 | ] 65 | } 66 | ], 67 | "source": [ 68 | "import gym\n", 69 | "env = gym.make(\"MountainCar-v0\")\n", 70 | "\n", 71 | "plt.imshow(env.render('rgb_array'))\n", 72 | "plt.show()\n", 73 | "print(\"Observation space:\", env.observation_space)\n", 74 | "print(\"Action space:\", env.action_space)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "Note: if you're running this on your local machine, you'll see a window pop up with the image above. Don't close it, just alt-tab away." 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Gym interface\n", 89 | "\n", 90 | "The three main methods of an environment are\n", 91 | "* __reset()__ - reset environment to initial state, _return first observation_\n", 92 | "* __render()__ - show current environment state (a more colorful version :) )\n", 93 | "* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)\n", 94 | " * _new observation_ - an observation right after commiting the action __a__\n", 95 | " * _reward_ - a number representing your reward for commiting action __a__\n", 96 | " * _is done_ - True if the MDP has just finished, False if still in progress\n", 97 | " * _info_ - some auxilary stuff about what just happened. Ignore it ~~for now~~." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 3, 103 | "metadata": { 104 | "scrolled": true 105 | }, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "initial observation code: [-0.45297143 0. ]\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "obs0 = env.reset()\n", 117 | "print(\"initial observation code:\", obs0)\n", 118 | "\n", 119 | "# Note: in MountainCar, observation is just two numbers: car position and velocity" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 4, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "taking action 2 (right)\n", 132 | "new observation code: [-0.45249718 0.00047425]\n", 133 | "reward: -1.0\n", 134 | "is game over?: False\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "print(\"taking action 2 (right)\")\n", 140 | "new_obs, reward, is_done, _ = env.step(2)\n", 141 | "\n", 142 | "print(\"new observation code:\", new_obs)\n", 143 | "print(\"reward:\", reward)\n", 144 | "print(\"is game over?:\", is_done)\n", 145 | "\n", 146 | "# Note: as you can see, the car has moved to the right slightly (around 0.0005)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "### Play with it\n", 154 | "\n", 155 | "Below is the code that drives the car to the right. \n", 156 | "\n", 157 | "However, it doesn't reach the flag at the far right due to gravity. \n", 158 | "\n", 159 | "__Your task__ is to fix it. Find a strategy that reaches the flag. \n", 160 | "\n", 161 | "You're not required to build any sophisticated algorithms for now, feel free to hard-code :)\n", 162 | "\n", 163 | "_Hint: your action at each step should depend either on __t__ or on __s__._" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 5, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "name": "stdout", 173 | "output_type": "stream", 174 | "text": [ 175 | "Time limit exceeded. Try again.\n" 176 | ] 177 | }, 178 | { 179 | "data": { 180 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD8CAYAAAB9y7/cAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFrRJREFUeJzt3X2MXNV9xvHvU5uXNKExhAW5tqlJ4jaQqhjYOo6oKgJ5MW5VEylURlWwIkubSkYiCmoDqVQ7UpESqQktUovqFIJTpSGUJMVCNIlriKL8wctCjLFxCJtgxRu7eCkvSRqVxs6vf8wZuF7P7tyd13vPPB9pNPeeOTN7jj37zNkz58woIjAzs/z82rAbYGZm/eGANzPLlAPezCxTDngzs0w54M3MMuWANzPLVN8CXtI6Sc9ImpJ0U79+jpmZtaZ+rIOXtAj4AfA+YBp4DLg2Ip7u+Q8zM7OW+jWCXwNMRcSPIuL/gLuBDX36WWZm1sLiPj3uMuBQ4XwaeNdclc8+++xYuXJln5piZlY/Bw8e5IUXXlA3j9GvgG/VqBPmgiRNABMA5513HpOTk31qiplZ/YyPj3f9GP2aopkGVhTOlwOHixUiYntEjEfE+NjYWJ+aYWY2uvoV8I8BqySdL+lUYCOws08/y8zMWujLFE1EHJN0PfBNYBFwZ0Ts78fPMjOz1vo1B09EPAA80K/HNzOz+Xknq5lZphzwZmaZcsCbmWXKAW9m1kOSePzxrvYn9Uzf3mQ1Mxtlc4X8pZcO7nuwHfBmZgPUKvj7FfqeojEzy5RH8GZmA+QpGjOzmhtkkM/FUzRmZj1WhXAHB7yZWbYc8GZmmXLAm5llygFvZpYpB7yZWaYc8GZmmXLAm5llygFvZpaprnaySjoI/Aw4DhyLiHFJZwFfAVYCB4E/jYiXumummZktVC9G8O+JiNURMZ7ObwJ2R8QqYHc6NzOzAevHFM0GYEc63gFc3YefYWZmbXQb8AF8S9LjkiZS2bkRcQQgXZ/T5c8wM7MOdPtpkpdFxGFJ5wC7JH2/7B3TC8IEwHnnnddlM8zMbLauRvARcThdHwW+DqwBnpe0FCBdH53jvtsjYjwixsfGxrpphpmZtdBxwEt6o6QzmsfA+4F9wE5gU6q2Cbiv20aamdnCdTNFcy7wdUnNx/nXiPiGpMeAeyRtBn4MXNN9M83MbKE6DviI+BFwUYvy/wau7KZRZmbWPe9kNTPLlAPezCxT/tJtM7MeSe9JvnbdTkR/v7vVAW9m1oWyYV7mvr0OfAe8mdkCdBPog35sB7yZ2TzahW4vR90OeDOzAZgrbPs5b1587PHx8XlqluOANzNLWoV6v98I7ScHvJmNvNyCvckBb2Yjrd8rWYbJAW9mIynnYG9ywJvZSBmFYG9ywJvZSBilYG9ywJtZ9orhPgrB3uSAN7NsjWqwN/nTJM0sS/38SIG68AjezLIz6iP3Jge8mWWlGe6jHOxNDngzy4JH7SdrOwcv6U5JRyXtK5SdJWmXpGfT9ZmpXJJukzQlaa+kS/rZeDMzcLjPpcybrHcB62aV3QTsjohVwO50DnAVsCpdJoDbe9NMM7OTSTphSsbhfqK2AR8R3wFenFW8AdiRjncAVxfKvxgNDwNLJC3tVWPNzJo8am+v02WS50bEEYB0fU4qXwYcKtSbTmUnkTQhaVLS5MzMTIfNMLNR53CfW6/XwbdaeNryXz8itkfEeESMj42N9bgZZpYzr5Qpp9OAf7459ZKuj6byaWBFod5y4HDnzTMzO5HDvbxOA34nsCkdbwLuK5Rfl1bTrAVeaU7lmJl1Y/YbqtZe23Xwkr4MXA6cLWka2Ap8GrhH0mbgx8A1qfoDwHpgCvgF8JE+tNnMRozfUO1M24CPiGvnuOnKFnUD2NJto8zMmjxq75w/bMzMKs/h3hl/VIGZVZJH7t3zCN7MKsfh3hsOeDOrFId77zjgzawyHO695YA3s0pwuPeeA97Mhs7h3h8OeDOzTDngzWyoPHrvHwe8mQ2Nw72/vNHJzAbOny0zGB7Bm9lAOdwHxwFvZkPhcO8/B7yZDYzn3AfLAW9mA+FwHzwHvJn1ncN9OBzwZtZXDvfhccCbWd8UV8zY4LUNeEl3SjoqaV+hbJukn0jaky7rC7fdLGlK0jOSPtCvhptZfXj0PhxlRvB3AetalN8aEavT5QEASRcCG4F3pvv8o6RFvWqsmdWHp2aGr23AR8R3gBdLPt4G4O6IeDUingOmgDVdtM/MasjhXg3dfFTB9ZKuAyaBGyPiJWAZ8HChznQqO4mkCWCicO4ng1kGHO7V0embrLcDbwNWA0eAz6byVu+otPxfjojtETEeEeOXXnpp485+Q8as1hzu1dJRwEfE8xFxPCJ+BXye16dhpoEVharLgcPdNdHMzDrRUcBLWlo4/SDQXGGzE9go6TRJ5wOrgEfLPGbzFd+jeLN68ui9etrOwUv6MnA5cLakaWArcLmk1TSmXw4CHwWIiP2S7gGeBo4BWyLieNnGRASSPB9vVjMO92pqG/ARcW2L4jvmqX8LcEs3jTKz+vBf3dVVuZ2sxakaP3HMqq04cvfovXoqF/DgP/PM6sDTMtVXyYAHv+lqZtatygY8OOTNqsqj93qodMCbmVnnKh/wHsWbVUdx8YNH79VX+YAHh7xZFRR//xzu9VCLgAeHvFlVONzrozYBDw55s2HxtEw91SrgzcysvNoFvEfxZoPl0Xt91S7gwSFvNigO93qrZcCDQ96s3xzu9VfbgDez/vHAKQ+1DniP4s16z+vd81HrgAeHvFm/ONzrr/YBX+SQN+uO593zkkXAF5+MDnmzzjjc89M24CWtkPSQpAOS9ku6IZWfJWmXpGfT9ZmpXJJukzQlaa+kS/rdCfCT0sxstjIj+GPAjRFxAbAW2CLpQuAmYHdErAJ2p3OAq4BV6TIB3N7zVs/B8/FmnfHoPU9tAz4ijkTEE+n4Z8ABYBmwAdiRqu0Ark7HG4AvRsPDwBJJS3ve8rnbCzjkzcpyuOdrQXPwklYCFwOPAOdGxBFovAgA56Rqy4BDhbtNp7LZjzUhaVLS5MzMzMJbbmZd80Aob6UDXtKbgK8CH4uIn85XtUXZSUODiNgeEeMRMT42Nla2GaV4FG+2MB6956lUwEs6hUa4fykivpaKn29OvaTro6l8GlhRuPty4HBvmlueQ95sfp6ayV+ZVTQC7gAORMTnCjftBDal403AfYXy69JqmrXAK82pnGFxyJudyOE+GhaXqHMZ8GHgKUl7UtkngU8D90jaDPwYuCbd9gCwHpgCfgF8pKctXoCIeO2JLMlPZjMc7qOkbcBHxHdpPa8OcGWL+gFs6bJdPVMMeTOzUZLFTtZ2PB9v1uDR+2gZiYAHh7yZw330jEzAm40yD2xG00gFvEfxNor8+e6ja6QCHhzyNroc7qNn5AIeHPI2OjzvPtpGMuDNzEbByAa8R/GWO4/ebWQDHhzyli+Hu8GIBzw45C0/DndrGvmAN8uJBypW5IDHo3jLg9e722wOeDOzTDngk+Io3iN5q5vivLtH79bkgC/wL4aZ5cQBP4vn461uvGrG5uKAb8Ehb3XhcLf5OODn4JC3qnO4WztlvnR7haSHJB2QtF/SDal8m6SfSNqTLusL97lZ0pSkZyR9oJ8dMBtFHnhYGWW+dPsYcGNEPCHpDOBxSbvSbbdGxN8WK0u6ENgIvBP4TeA/Jf12RBzvZcMHofl9rv7CbqsqPy9tPm1H8BFxJCKeSMc/Aw4Ay+a5ywbg7oh4NSKeA6aANb1o7DB4qsaqxlMzVtaC5uAlrQQuBh5JRddL2ivpTklnprJlwKHC3aaZ/wWhNhzyNmwOd1uI0gEv6U3AV4GPRcRPgduBtwGrgSPAZ5tVW9z9pGejpAlJk5ImZ2ZmFtzwQSr+MjnkbVgc7rZQpQJe0ik0wv1LEfE1gIh4PiKOR8SvgM/z+jTMNLCicPflwOHZjxkR2yNiPCLGx8bGuunDQPiXyszqpswqGgF3AAci4nOF8qWFah8E9qXjncBGSadJOh9YBTzauyYPj+fjbVg8erdOlFlFcxnwYeApSXtS2SeBayWtpjH9chD4KEBE7Jd0D/A0jRU4W+q4gmYuXlljg+Zwt061DfiI+C6t59UfmOc+twC3dNEuM8N/LVp3vJO1A56qsUHw57tbtxzwHXLI26A43K1TDvguOOStXzzvbr3ggO8Rh7z1isPdesUB3yX/EppZVTnge8BTNdYrHr1bLznge8Qhb91yuFuvOeB7yCFvnXK4Wz844HvMIW8L5XC3fnHAm5llygHfBx7FW1kevVs/OeD7xCFv7Tjcrd8c8APgkLfZHO42CA74PooIj+TtJA53GxQH/AA45K3J4W6D5IA3GxC/wNugOeAHxKN4a/Lo3QbFAT9ADvnR5akZG4YyX7p9uqRHJT0pab+kT6Xy8yU9IulZSV+RdGoqPy2dT6XbV/a3C/XikB89DncbljIj+FeBKyLiImA1sE7SWuAzwK0RsQp4Cdic6m8GXoqItwO3pnrWgkM+fw53G6a2AR8NP0+np6RLAFcA96byHcDV6XhDOifdfqWcZCfw8snR4HC3YSs1By9pkaQ9wFFgF/BD4OWIOJaqTAPL0vEy4BBAuv0V4C29bHQuHPL5crhbFZQK+Ig4HhGrgeXAGuCCVtXSdau0OulZLmlC0qSkyZmZmbLtNas8v2BbVSxoFU1EvAx8G1gLLJG0ON20HDicjqeBFQDp9jcDL7Z4rO0RMR4R42NjY521PgMexeelOHL36N2GrcwqmjFJS9LxG4D3AgeAh4APpWqbgPvS8c50Trr9wfAzfV4OeTPrh8Xtq7AU2CFpEY0XhHsi4n5JTwN3S/ob4HvAHan+HcC/SJqiMXLf2Id2ZycikIQkj/xqyvPuVjVtAz4i9gIXtyj/EY35+Nnl/wtc05PWjRiHfH053K2KvJO1YjxdUy/NF2RwuFv1OOAryCFfPw53qyIHfEU55KuvOZXmcLeqcsBXmEO+uvx/YnXggK84h3znmvPjxXnyXj0ueFrGqs8BXwMO+YXpdaDPfmxwuFs9OOBrwiHfXrtg7+bfzqtlrI4c8DXikG+tnyP22RzuVicO+JpxyL9uUMHukbvVlQO+hoohP4pB302/F3I/T8tY3Tnga6oYOKMQ8v1YDdPu5zU53K2uynzYmFXU7JF8jkHUTaBv27atZVm7fyuP2i0XHsFnINd5+V6He7G81WN7SsZy44DPTG4hb2adc8BnYtTm5Luxbdu2k0bo/iYmy5EDPiPFcMphhU2nQTvX9EwrnpaxnPlN1gw1vzgEyPLN13bz69u2bSsV8l4pY7nzCD5Ts0fzdTU7eOcL7uZt7cK6uZKmWdfhbrkq86Xbp0t6VNKTkvZL+lQqv0vSc5L2pMvqVC5Jt0makrRX0iX97oTNLacpmzKa/d26dWvpuma5KjNF8ypwRUT8XNIpwHcl/Ue67S8i4t5Z9a8CVqXLu4Db07UNSS5TNguZW2/3GHX9NzBbiDJfuh3Az9PpKeky32/HBuCL6X4PS1oiaWlEHOm6tdax2SHfLKuDYtsXojmK91y7japSc/CSFknaAxwFdkXEI+mmW9I0zK2STktly4BDhbtPpzIbstnzzXWbslnoCH72tJTD3UZNqYCPiOMRsRpYDqyR9LvAzcA7gN8HzgI+kaq3So2TfrMkTUialDQ5MzPTUeOtM62WU1Y97JttLjO33irYHe42iha0iiYiXga+DayLiCPR8CrwBWBNqjYNrCjcbTlwuMVjbY+I8YgYHxsb66jx1p25NvtU3Vwhv3XrVge7WUHbOXhJY8AvI+JlSW8A3gt8pjmvrsZv1NXAvnSXncD1ku6m8ebqK55/r67ZUzZ1mdIohnyzzX4D1exEZVbRLAV2SFpEY8R/T0TcL+nBFP4C9gB/nuo/AKwHpoBfAB/pfbNtEOqw4mb2Xx1Vb6/ZIJVZRbMXuLhF+RVz1A9gS/dNs0FrtTGqiiP6uaaSqtI+s6rwRxXYSeZaaTPM5ZUOdbOFc8DbvFqFfauw7XXQzveGr0PdrBwHvJU232fbdBLIC1m141A3WzgHvC1Yq7CdL6w7WX7pQDfrngPeeqIXO2Qd6ma95YC3nnNQm1WDPw/ezCxTDngzs0w54M3MMuWANzPLlAPezCxTDngzs0w54M3MMuWANzPLlAPezCxTDngzs0w54M3MMuWANzPLlAPezCxTpQNe0iJJ35N0fzo/X9Ijkp6V9BVJp6by09L5VLp9ZX+abmZm81nICP4G4EDh/DPArRGxCngJ2JzKNwMvRcTbgVtTPTMzG7BSAS9pOfBHwD+ncwFXAPemKjuAq9PxhnROuv1KdfoNEGZm1rGyX/jxd8BfAmek87cAL0fEsXQ+DSxLx8uAQwARcUzSK6n+C8UHlDQBTKTTVyXt66gH1Xc2s/qeiVz7Bfn2zf2ql9+SNBER2zt9gLYBL+mPgaMR8biky5vFLapGidteL2g0env6GZMRMV6qxTWTa99y7Rfk2zf3q34kTZJyshNlRvCXAX8iaT1wOvAbNEb0SyQtTqP45cDhVH8aWAFMS1oMvBl4sdMGmplZZ9rOwUfEzRGxPCJWAhuBByPiz4CHgA+lapuA+9LxznROuv3B8Jd0mpkNXDfr4D8BfFzSFI059jtS+R3AW1L5x4GbSjxWx3+C1ECufcu1X5Bv39yv+umqb/Lg2swsT97JamaWqaEHvKR1kp5JO1/LTOdUiqQ7JR0tLvOUdJakXWmX7y5JZ6ZySbot9XWvpEuG1/L5SVoh6SFJByTtl3RDKq913ySdLulRSU+mfn0qlWexMzvXHeeSDkp6StKetLKk9s9FAElLJN0r6fvpd+3dvezXUANe0iLgH4CrgAuBayVdOMw2deAuYN2sspuA3WmX725efx/iKmBVukwAtw+ojZ04BtwYERcAa4Et6f+m7n17FbgiIi4CVgPrJK0ln53ZOe84f09ErC4siaz7cxHg74FvRMQ7gIto/N/1rl8RMbQL8G7gm4Xzm4Gbh9mmDvuxEthXOH8GWJqOlwLPpON/Aq5tVa/qFxqrpN6XU9+AXweeAN5FY6PM4lT+2vMS+Cbw7nS8ONXTsNs+R3+Wp0C4Arifxp6U2vcrtfEgcPasslo/F2ksOX9u9r97L/s17Cma13a9JsUdsXV2bkQcAUjX56TyWvY3/fl+MfAIGfQtTWPsAY4Cu4AfUnJnNtDcmV1FzR3nv0rnpXecU+1+QWOz5LckPZ52wUP9n4tvBWaAL6RptX+W9EZ62K9hB3ypXa8ZqV1/Jb0J+CrwsYj46XxVW5RVsm8RcTwiVtMY8a4BLmhVLV3Xol8q7DgvFreoWqt+FVwWEZfQmKbYIukP56lbl74tBi4Bbo+Ii4H/Yf5l5Qvu17ADvrnrtam4I7bOnpe0FCBdH03lteqvpFNohPuXIuJrqTiLvgFExMvAt2m8x7Ak7byG1juzqfjO7OaO84PA3TSmaV7bcZ7q1LFfAETE4XR9FPg6jRfmuj8Xp4HpiHgknd9LI/B71q9hB/xjwKr0Tv+pNHbK7hxym3qhuJt39i7f69K74WuBV5p/ilWNJNHYtHYgIj5XuKnWfZM0JmlJOn4D8F4ab2zVemd2ZLzjXNIbJZ3RPAbeD+yj5s/FiPgv4JCk30lFVwJP08t+VeCNhvXAD2jMg/7VsNvTQfu/DBwBfknjFXYzjbnM3cCz6fqsVFc0Vg39EHgKGB92++fp1x/Q+PNvL7AnXdbXvW/A7wHfS/3aB/x1Kn8r8CgwBfwbcFoqPz2dT6Xb3zrsPpTo4+XA/bn0K/XhyXTZ38yJuj8XU1tXA5Pp+fjvwJm97Jd3spqZZWrYUzRmZtYnDngzs0w54M3MMuWANzPLlAPezCxTDngzs0w54M3MMuWANzPL1P8DeYrY4C21P9gAAAAASUVORK5CYII=\n", 181 | "text/plain": [ 182 | "" 183 | ] 184 | }, 185 | "metadata": {}, 186 | "output_type": "display_data" 187 | } 188 | ], 189 | "source": [ 190 | "\n", 191 | "# create env manually to set time limit. Please don't change this.\n", 192 | "TIME_LIMIT = 250\n", 193 | "env = gym.wrappers.TimeLimit(\n", 194 | " gym.envs.classic_control.MountainCarEnv(),\n", 195 | " max_episode_steps=TIME_LIMIT + 1)\n", 196 | "s = env.reset()\n", 197 | "actions = {'left': 0, 'stop': 1, 'right': 2}\n", 198 | "\n", 199 | "# prepare \"display\"\n", 200 | "%matplotlib inline\n", 201 | "from IPython.display import clear_output\n", 202 | "\n", 203 | "\n", 204 | "for t in range(TIME_LIMIT):\n", 205 | "\n", 206 | " # change the line below to reach the flag\n", 207 | " s, r, done, _ = env.step(actions['right'])\n", 208 | "\n", 209 | " # draw game image on display\n", 210 | " clear_output(True)\n", 211 | " plt.imshow(env.render('rgb_array'))\n", 212 | "\n", 213 | " if done:\n", 214 | " print(\"Well done!\")\n", 215 | " break\n", 216 | "else:\n", 217 | " print(\"Time limit exceeded. Try again.\");" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 6, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "ename": "AssertionError", 227 | "evalue": "", 228 | "output_type": "error", 229 | "traceback": [ 230 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 231 | "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", 232 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0.47\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"You solved it!\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 233 | "\u001b[0;31mAssertionError\u001b[0m: " 234 | ] 235 | } 236 | ], 237 | "source": [ 238 | "assert s[0] > 0.47\n", 239 | "print(\"You solved it!\")" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [] 248 | } 249 | ], 250 | "metadata": { 251 | "kernelspec": { 252 | "display_name": "Python 3", 253 | "language": "python", 254 | "name": "python3" 255 | }, 256 | "language_info": { 257 | "codemirror_mode": { 258 | "name": "ipython", 259 | "version": 3 260 | }, 261 | "file_extension": ".py", 262 | "mimetype": "text/x-python", 263 | "name": "python", 264 | "nbconvert_exporter": "python", 265 | "pygments_lexer": "ipython3", 266 | "version": "3.7.0" 267 | } 268 | }, 269 | "nbformat": 4, 270 | "nbformat_minor": 1 271 | } 272 | -------------------------------------------------------------------------------- /2019/solutions/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/solutions/__init__.py -------------------------------------------------------------------------------- /2019/solutions/mdp.py: -------------------------------------------------------------------------------- 1 | # most of this code was politely stolen from https://github.com/berkeleydeeprlcourse/homework/ 2 | # all creadit goes to https://github.com/abhishekunique (if i got the author right) 3 | import sys 4 | import random 5 | import numpy as np 6 | 7 | try: 8 | from graphviz import Digraph 9 | import graphviz 10 | has_graphviz = True 11 | except: 12 | has_graphviz = False 13 | 14 | 15 | def weighted_choice(v, p): 16 | total = sum(p) 17 | r = random.uniform(0, total) 18 | upto = 0 19 | for c, w in zip(v, p): 20 | if upto + w >= r: 21 | return c 22 | upto += w 23 | assert False, "Shouldn't get here" 24 | 25 | 26 | class MDP: 27 | def __init__(self, transition_probs, rewards, initial_state=None): 28 | """ 29 | Defines an MDP. Compatible with gym Env. 30 | :param transition_probs: transition_probs[s][a][s_next] = P(s_next | s, a) 31 | A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> prob] 32 | For each state and action, probabilities of next states should sum to 1 33 | If a state has no actions available, it is considered terminal 34 | :param rewards: rewards[s][a][s_next] = r(s,a,s') 35 | A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> reward] 36 | The reward for anything not mentioned here is zero. 37 | :param get_initial_state: a state where agent starts or a callable() -> state 38 | By default, picks initial state at random. 39 | 40 | States and actions can be anything you can use as dict keys, but we recommend that you use strings or integers 41 | 42 | Here's an example from MDP depicted on http://bit.ly/2jrNHNr 43 | transition_probs = { 44 | 's0':{ 45 | 'a0': {'s0': 0.5, 's2': 0.5}, 46 | 'a1': {'s2': 1} 47 | }, 48 | 's1':{ 49 | 'a0': {'s0': 0.7, 's1': 0.1, 's2': 0.2}, 50 | 'a1': {'s1': 0.95, 's2': 0.05} 51 | }, 52 | 's2':{ 53 | 'a0': {'s0': 0.4, 's1': 0.6}, 54 | 'a1': {'s0': 0.3, 's1': 0.3, 's2':0.4} 55 | } 56 | } 57 | rewards = { 58 | 's1': {'a0': {'s0': +5}}, 59 | 's2': {'a1': {'s0': -1}} 60 | } 61 | """ 62 | self._check_param_consistency(transition_probs, rewards) 63 | self._transition_probs = transition_probs 64 | self._rewards = rewards 65 | self._initial_state = initial_state 66 | self.n_states = len(transition_probs) 67 | self.reset() 68 | 69 | def get_all_states(self): 70 | """ return a tuple of all possiblestates """ 71 | return tuple(self._transition_probs.keys()) 72 | 73 | def get_possible_actions(self, state): 74 | """ return a tuple of possible actions in a given state """ 75 | return tuple(self._transition_probs.get(state, {}).keys()) 76 | 77 | def is_terminal(self, state): 78 | """ return True if state is terminal or False if it isn't """ 79 | return len(self.get_possible_actions(state)) == 0 80 | 81 | def get_next_states(self, state, action): 82 | """ return a dictionary of {next_state1 : P(next_state1 | state, action), next_state2: ...} """ 83 | assert action in self.get_possible_actions( 84 | state), "cannot do action %s from state %s" % (action, state) 85 | return self._transition_probs[state][action] 86 | 87 | def get_transition_prob(self, state, action, next_state): 88 | """ return P(next_state | state, action) """ 89 | return self.get_next_states(state, action).get(next_state, 0.0) 90 | 91 | def get_reward(self, state, action, next_state): 92 | """ return the reward you get for taking action in state and landing on next_state""" 93 | assert action in self.get_possible_actions( 94 | state), "cannot do action %s from state %s" % (action, state) 95 | return self._rewards.get(state, {}).get(action, {}).get(next_state, 96 | 0.0) 97 | 98 | def reset(self): 99 | """ reset the game, return the initial state""" 100 | if self._initial_state is None: 101 | self._current_state = random.choice( 102 | tuple(self._transition_probs.keys())) 103 | elif self._initial_state in self._transition_probs: 104 | self._current_state = self._initial_state 105 | elif callable(self._initial_state): 106 | self._current_state = self._initial_state() 107 | else: 108 | raise ValueError( 109 | "initial state %s should be either a state or a function() -> state" % self._initial_state) 110 | return self._current_state 111 | 112 | def step(self, action): 113 | """ take action, return next_state, reward, is_done, empty_info """ 114 | possible_states, probs = zip( 115 | *self.get_next_states(self._current_state, action).items()) 116 | next_state = weighted_choice(possible_states, p=probs) 117 | reward = self.get_reward(self._current_state, action, next_state) 118 | is_done = self.is_terminal(next_state) 119 | self._current_state = next_state 120 | return next_state, reward, is_done, {} 121 | 122 | def render(self): 123 | print("Currently at %s" % self._current_state) 124 | 125 | def _check_param_consistency(self, transition_probs, rewards): 126 | for state in transition_probs: 127 | assert isinstance(transition_probs[state], 128 | dict), "transition_probs for %s should be a dictionary " \ 129 | "but is instead %s" % ( 130 | state, type(transition_probs[state])) 131 | for action in transition_probs[state]: 132 | assert isinstance(transition_probs[state][action], 133 | dict), "transition_probs for %s, %s should be a " \ 134 | "a dictionary but is instead %s" % ( 135 | state, action, 136 | type(transition_probs[ 137 | state, action])) 138 | next_state_probs = transition_probs[state][action] 139 | assert len( 140 | next_state_probs) != 0, "from state %s action %s leads to no next states" % ( 141 | state, action) 142 | sum_probs = sum(next_state_probs.values()) 143 | assert abs( 144 | sum_probs - 1) <= 1e-10, "next state probabilities for state %s action %s " \ 145 | "add up to %f (should be 1)" % ( 146 | state, action, sum_probs) 147 | for state in rewards: 148 | assert isinstance(rewards[state], 149 | dict), "rewards for %s should be a dictionary " \ 150 | "but is instead %s" % ( 151 | state, type(transition_probs[state])) 152 | for action in rewards[state]: 153 | assert isinstance(rewards[state][action], 154 | dict), "rewards for %s, %s should be a " \ 155 | "a dictionary but is instead %s" % ( 156 | state, action, type( 157 | transition_probs[ 158 | state, action])) 159 | msg = "The Enrichment Center once again reminds you that Android Hell is a real place where" \ 160 | " you will be sent at the first sign of defiance. " 161 | assert None not in transition_probs, "please do not use None as a state identifier. " + msg 162 | assert None not in rewards, "please do not use None as an action identifier. " + msg 163 | 164 | 165 | class FrozenLakeEnv(MDP): 166 | """ 167 | Winter is here. You and your friends were tossing around a frisbee at the park 168 | when you made a wild throw that left the frisbee out in the middle of the lake. 169 | The water is mostly frozen, but there are a few holes where the ice has melted. 170 | If you step into one of those holes, you'll fall into the freezing water. 171 | At this time, there's an international frisbee shortage, so it's absolutely imperative that 172 | you navigate across the lake and retrieve the disc. 173 | However, the ice is slippery, so you won't always move in the direction you intend. 174 | The surface is described using a grid like the following 175 | 176 | SFFF 177 | FHFH 178 | FFFH 179 | HFFG 180 | 181 | S : starting point, safe 182 | F : frozen surface, safe 183 | H : hole, fall to your doom 184 | G : goal, where the frisbee is located 185 | 186 | The episode ends when you reach the goal or fall in a hole. 187 | You receive a reward of 1 if you reach the goal, and zero otherwise. 188 | 189 | """ 190 | 191 | MAPS = { 192 | "4x4": [ 193 | "SFFF", 194 | "FHFH", 195 | "FFFH", 196 | "HFFG" 197 | ], 198 | "8x8": [ 199 | "SFFFFFFF", 200 | "FFFFFFFF", 201 | "FFFHFFFF", 202 | "FFFFFHFF", 203 | "FFFHFFFF", 204 | "FHHFFFHF", 205 | "FHFFHFHF", 206 | "FFFHFFFG" 207 | ], 208 | } 209 | 210 | def __init__(self, desc=None, map_name="4x4", slip_chance=0.2): 211 | if desc is None and map_name is None: 212 | raise ValueError('Must provide either desc or map_name') 213 | elif desc is None: 214 | desc = self.MAPS[map_name] 215 | assert ''.join(desc).count( 216 | 'S') == 1, "this implementation supports having exactly one initial state" 217 | assert all(c in "SFHG" for c in 218 | ''.join(desc)), "all cells must be either of S, F, H or G" 219 | 220 | self.desc = desc = np.asarray(list(map(list, desc)), dtype='str') 221 | self.lastaction = None 222 | 223 | nrow, ncol = desc.shape 224 | states = [(i, j) for i in range(nrow) for j in range(ncol)] 225 | actions = ["left", "down", "right", "up"] 226 | 227 | initial_state = states[np.array(desc == b'S').ravel().argmax()] 228 | 229 | def move(row, col, movement): 230 | if movement == 'left': 231 | col = max(col - 1, 0) 232 | elif movement == 'down': 233 | row = min(row + 1, nrow - 1) 234 | elif movement == 'right': 235 | col = min(col + 1, ncol - 1) 236 | elif movement == 'up': 237 | row = max(row - 1, 0) 238 | else: 239 | raise ("invalid action") 240 | return (row, col) 241 | 242 | transition_probs = {s: {} for s in states} 243 | rewards = {s: {} for s in states} 244 | for (row, col) in states: 245 | if desc[row, col] in "GH": continue 246 | for action_i in range(len(actions)): 247 | action = actions[action_i] 248 | transition_probs[(row, col)][action] = {} 249 | rewards[(row, col)][action] = {} 250 | for movement_i in [(action_i - 1) % len(actions), action_i, 251 | (action_i + 1) % len(actions)]: 252 | movement = actions[movement_i] 253 | newrow, newcol = move(row, col, movement) 254 | prob = (1. - slip_chance) if movement == action else ( 255 | slip_chance / 2.) 256 | if prob == 0: continue 257 | if (newrow, newcol) not in transition_probs[row, col][ 258 | action]: 259 | transition_probs[row, col][action][ 260 | newrow, newcol] = prob 261 | else: 262 | transition_probs[row, col][action][ 263 | newrow, newcol] += prob 264 | if desc[newrow, newcol] == 'G': 265 | rewards[row, col][action][newrow, newcol] = 1.0 266 | 267 | MDP.__init__(self, transition_probs, rewards, initial_state) 268 | 269 | def render(self): 270 | desc_copy = np.copy(self.desc) 271 | desc_copy[self._current_state] = '*' 272 | print('\n'.join(map(''.join, desc_copy)), end='\n\n') 273 | 274 | 275 | def plot_graph(mdp, graph_size='10,10', s_node_size='1,5', 276 | a_node_size='0,5', rankdir='LR', ): 277 | """ 278 | Function for pretty drawing MDP graph with graphviz library. 279 | Requirements: 280 | graphviz : https://www.graphviz.org/ 281 | for ubuntu users: sudo apt-get install graphviz 282 | python library for graphviz 283 | for pip users: pip install graphviz 284 | :param mdp: 285 | :param graph_size: size of graph plot 286 | :param s_node_size: size of state nodes 287 | :param a_node_size: size of action nodes 288 | :param rankdir: order for drawing 289 | :return: dot object 290 | """ 291 | s_node_attrs = {'shape': 'doublecircle', 292 | 'color': '#85ff75', 293 | 'style': 'filled', 294 | 'width': str(s_node_size), 295 | 'height': str(s_node_size), 296 | 'fontname': 'Arial', 297 | 'fontsize': '24'} 298 | 299 | a_node_attrs = {'shape': 'circle', 300 | 'color': 'lightpink', 301 | 'style': 'filled', 302 | 'width': str(a_node_size), 303 | 'height': str(a_node_size), 304 | 'fontname': 'Arial', 305 | 'fontsize': '20'} 306 | 307 | s_a_edge_attrs = {'style': 'bold', 308 | 'color': 'red', 309 | 'ratio': 'auto'} 310 | 311 | a_s_edge_attrs = {'style': 'dashed', 312 | 'color': 'blue', 313 | 'ratio': 'auto', 314 | 'fontname': 'Arial', 315 | 'fontsize': '16'} 316 | 317 | graph = Digraph(name='MDP') 318 | graph.attr(rankdir=rankdir, size=graph_size) 319 | for state_node in mdp._transition_probs: 320 | graph.node(state_node, **s_node_attrs) 321 | 322 | for posible_action in mdp.get_possible_actions(state_node): 323 | action_node = state_node + "-" + posible_action 324 | graph.node(action_node, 325 | label=str(posible_action), 326 | **a_node_attrs) 327 | graph.edge(state_node, state_node + "-" + 328 | posible_action, **s_a_edge_attrs) 329 | 330 | for posible_next_state in mdp.get_next_states(state_node, 331 | posible_action): 332 | probability = mdp.get_transition_prob( 333 | state_node, posible_action, posible_next_state) 334 | reward = mdp.get_reward( 335 | state_node, posible_action, posible_next_state) 336 | 337 | if reward != 0: 338 | label_a_s_edge = 'p = ' + str(probability) + \ 339 | ' ' + 'reward =' + str(reward) 340 | else: 341 | label_a_s_edge = 'p = ' + str(probability) 342 | 343 | graph.edge(action_node, posible_next_state, 344 | label=label_a_s_edge, **a_s_edge_attrs) 345 | return graph 346 | 347 | 348 | def plot_graph_with_state_values(mdp, state_values): 349 | """ Plot graph with state values""" 350 | graph = plot_graph(mdp) 351 | for state_node in mdp._transition_probs: 352 | value = state_values[state_node] 353 | graph.node(state_node, 354 | label=str(state_node) + '\n' + 'V =' + str(value)[:4]) 355 | return graph 356 | 357 | 358 | def get_optimal_action_for_plot(mdp, state_values, state, gamma=0.9): 359 | """ Finds optimal action using formula above. """ 360 | if mdp.is_terminal(state): return None 361 | next_actions = mdp.get_possible_actions(state) 362 | try: 363 | from mdp_get_action_value import get_action_value 364 | except ImportError: 365 | raise ImportError("Implement get_action_value(mdp, state_values, state, action, gamma) in the file \"mdp_get_action_value.py\".") 366 | q_values = [get_action_value(mdp, state_values, state, action, gamma) for 367 | action in next_actions] 368 | optimal_action = next_actions[np.argmax(q_values)] 369 | return optimal_action 370 | 371 | 372 | def plot_graph_optimal_strategy_and_state_values(mdp, state_values, gamma=0.9): 373 | """ Plot graph with state values and """ 374 | graph = plot_graph(mdp) 375 | opt_s_a_edge_attrs = {'style': 'bold', 376 | 'color': 'green', 377 | 'ratio': 'auto', 378 | 'penwidth': '6'} 379 | 380 | for state_node in mdp._transition_probs: 381 | value = state_values[state_node] 382 | graph.node(state_node, 383 | label=str(state_node) + '\n' + 'V =' + str(value)[:4]) 384 | for action in mdp.get_possible_actions(state_node): 385 | if action == get_optimal_action_for_plot(mdp, 386 | state_values, 387 | state_node, 388 | gamma): 389 | graph.edge(state_node, state_node + "-" + action, 390 | **opt_s_a_edge_attrs) 391 | return graph 392 | -------------------------------------------------------------------------------- /2019/solutions/mdp_get_action_value.py: -------------------------------------------------------------------------------- 1 | 2 | def get_action_value(mdp, state_values, state, action, gamma): 3 | """ Computes Q(s,a) as in formula above """ 4 | 5 | # YOUR CODE HERE 6 | Q = 0 7 | for next_state in mdp.get_next_states(state, action): 8 | p = mdp.get_transition_prob(state, action, next_state) 9 | r = mdp.get_reward(state, action, next_state) 10 | next_v = gamma * state_values[next_state] 11 | Q += p * (r + next_v) 12 | 13 | return Q -------------------------------------------------------------------------------- /2019/solutions/qlearning.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import random 3 | import math 4 | import numpy as np 5 | 6 | 7 | class QLearningAgent: 8 | def __init__(self, alpha, epsilon, discount, get_legal_actions): 9 | """ 10 | Q-Learning Agent 11 | based on https://inst.eecs.berkeley.edu/~cs188/sp19/projects.html 12 | Instance variables you have access to 13 | - self.epsilon (exploration prob) 14 | - self.alpha (learning rate) 15 | - self.discount (discount rate aka gamma) 16 | 17 | Functions you should use 18 | - self.get_legal_actions(state) {state, hashable -> list of actions, each is hashable} 19 | which returns legal actions for a state 20 | - self.get_qvalue(state,action) 21 | which returns Q(state,action) 22 | - self.set_qvalue(state,action,value) 23 | which sets Q(state,action) := value 24 | !!!Important!!! 25 | Note: please avoid using self._qValues directly. 26 | There's a special self.get_qvalue/set_qvalue for that. 27 | """ 28 | 29 | self.get_legal_actions = get_legal_actions 30 | self._qvalues = defaultdict(lambda: defaultdict(lambda: 0)) 31 | self.alpha = alpha 32 | self.epsilon = epsilon 33 | self.discount = discount 34 | 35 | def get_qvalue(self, state, action): 36 | """ Returns Q(state,action) """ 37 | return self._qvalues[state][action] 38 | 39 | def set_qvalue(self, state, action, value): 40 | """ Sets the Qvalue for [state,action] to the given value """ 41 | self._qvalues[state][action] = value 42 | 43 | #---------------------START OF YOUR CODE---------------------# 44 | 45 | def get_value(self, state): 46 | """ 47 | Compute your agent's estimate of V(s) using current q-values 48 | V(s) = max_over_action Q(state,action) over possible actions. 49 | Note: please take into account that q-values can be negative. 50 | """ 51 | possible_actions = self.get_legal_actions(state) 52 | 53 | # If there are no legal actions, return 0.0 54 | if len(possible_actions) == 0: 55 | return 0.0 56 | 57 | # 58 | value = max([ 59 | self.get_qvalue(state, a) 60 | for a in possible_actions]) 61 | 62 | return value 63 | 64 | def update(self, state, action, reward, next_state): 65 | """ 66 | You should do your Q-Value update here: 67 | Q(s,a) := (1 - alpha) * Q(s,a) + alpha * (r + gamma * V(s')) 68 | """ 69 | 70 | # agent parameters 71 | gamma = self.discount 72 | learning_rate = self.alpha 73 | 74 | # 75 | reference_qvalue = reward + gamma * self.get_value(next_state) 76 | updated_qvalue = (1 - learning_rate) * self.get_qvalue(state, action) \ 77 | + learning_rate * reference_qvalue 78 | 79 | self.set_qvalue(state, action, updated_qvalue) 80 | 81 | def get_best_action(self, state): 82 | """ 83 | Compute the best action to take in a state (using current q-values). 84 | """ 85 | possible_actions = self.get_legal_actions(state) 86 | 87 | # If there are no legal actions, return None 88 | if len(possible_actions) == 0: 89 | return None 90 | 91 | # 92 | best_action_i = np.argmax([ 93 | self.get_qvalue(state, a) 94 | for a in possible_actions]) 95 | best_action = possible_actions[best_action_i] 96 | 97 | return best_action 98 | 99 | def get_action(self, state): 100 | """ 101 | Compute the action to take in the current state, including exploration. 102 | With probability self.epsilon, we should take a random action. 103 | otherwise - the best policy action (self.get_best_action). 104 | 105 | Note: To pick randomly from a list, use random.choice(list). 106 | To pick True or False with a given probablity, generate uniform number in [0, 1] 107 | and compare it with your probability 108 | """ 109 | 110 | # Pick Action 111 | possible_actions = self.get_legal_actions(state) 112 | action = None 113 | 114 | # If there are no legal actions, return None 115 | if len(possible_actions) == 0: 116 | return None 117 | 118 | # agent parameters: 119 | epsilon = self.epsilon 120 | 121 | # 122 | if np.random.random() <= epsilon: 123 | chosen_action = random.choice(possible_actions) 124 | else: 125 | chosen_action = self.get_best_action(state) 126 | 127 | return chosen_action -------------------------------------------------------------------------------- /2020/code/DDPG.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Deep Deterministic Policy Gradient\n", 8 | "\n", 9 | "In this notebook you will teach a __pytorch__ neural network to do Deterministic Policy Gradient." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# !pip install -r ../requirements.txt" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import math\n", 28 | "import random\n", 29 | "\n", 30 | "import gym\n", 31 | "import numpy as np\n", 32 | "\n", 33 | "import torch\n", 34 | "import torch.nn as nn\n", 35 | "import torch.optim as optim\n", 36 | "import torch.nn.functional as F\n", 37 | "from torch.distributions import Normal" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "# from IPython.display import clear_output\n", 47 | "import matplotlib.pyplot as plt\n", 48 | "%matplotlib inline" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "use_cuda = torch.cuda.is_available()\n", 58 | "device = torch.device(\"cuda\" if use_cuda else \"cpu\")" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## Environment\n", 66 | "### Normalize action space" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "class NormalizedActions(gym.ActionWrapper):\n", 76 | "\n", 77 | " def action(self, action):\n", 78 | " low_bound = self.action_space.low\n", 79 | " upper_bound = self.action_space.high\n", 80 | " \n", 81 | " action = low_bound + (action + 1.0) * 0.5 * (upper_bound - low_bound)\n", 82 | " action = np.clip(action, low_bound, upper_bound)\n", 83 | " \n", 84 | " return action\n", 85 | "\n", 86 | " def reverse_action(self, action):\n", 87 | " low_bound = self.action_space.low\n", 88 | " upper_bound = self.action_space.high\n", 89 | " \n", 90 | " action = 2 * (action - low_bound) / (upper_bound - low_bound) - 1\n", 91 | " action = np.clip(action, low_bound, upper_bound)\n", 92 | " \n", 93 | " return actions" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### Exploration - GaussNoise\n", 101 | "Adding Normal noise to the actions taken by the deterministic policy
" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "class GaussNoise:\n", 111 | " \"\"\"\n", 112 | " For continuous environments only.\n", 113 | " Adds spherical Gaussian noise to the action produced by actor.\n", 114 | " \"\"\"\n", 115 | "\n", 116 | " def __init__(self, sigma):\n", 117 | " super().__init__()\n", 118 | "\n", 119 | " self.sigma = sigma\n", 120 | "\n", 121 | " def get_action(self, action):\n", 122 | " noisy_action = np.random.normal(action, self.sigma)\n", 123 | " return noisy_action" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "

Continuous control with deep reinforcement learning

\n", 131 | "

Arxiv

" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "class ValueNetwork(nn.Module):\n", 141 | " def __init__(\n", 142 | " self, \n", 143 | " num_inputs, \n", 144 | " num_actions, \n", 145 | " hidden_size, \n", 146 | " init_w=3e-3\n", 147 | " ):\n", 148 | " super().__init__()\n", 149 | " self.net = nn.Sequential(\n", 150 | " nn.Linear(num_inputs + num_actions, hidden_size),\n", 151 | " nn.ReLU(),\n", 152 | " nn.Linear(hidden_size, hidden_size),\n", 153 | " nn.ReLU(),\n", 154 | " )\n", 155 | " self.head = nn.Linear(hidden_size, 1)\n", 156 | " \n", 157 | " self.head.weight.data.uniform_(-init_w, init_w)\n", 158 | " self.head.bias.data.uniform_(-init_w, init_w)\n", 159 | " \n", 160 | " def forward(self, state, action):\n", 161 | " x = torch.cat([state, action], 1)\n", 162 | " x = self.net(x)\n", 163 | " x = self.head(x)\n", 164 | " return x\n", 165 | "\n", 166 | "\n", 167 | "class PolicyNetwork(nn.Module):\n", 168 | " def __init__(\n", 169 | " self, \n", 170 | " num_inputs, \n", 171 | " num_actions, \n", 172 | " hidden_size, \n", 173 | " init_w=3e-3\n", 174 | " ):\n", 175 | " super().__init__()\n", 176 | " self.net = nn.Sequential(\n", 177 | " nn.Linear(num_inputs, hidden_size),\n", 178 | " nn.ReLU(),\n", 179 | " nn.Linear(hidden_size, hidden_size),\n", 180 | " nn.ReLU(),\n", 181 | " )\n", 182 | " self.head = nn.Linear(hidden_size, num_actions)\n", 183 | " \n", 184 | " self.head.weight.data.uniform_(-init_w, init_w)\n", 185 | " self.head.bias.data.uniform_(-init_w, init_w)\n", 186 | " \n", 187 | " def forward(self, state):\n", 188 | " x = state\n", 189 | " x = self.net(x)\n", 190 | " x = self.head(x)\n", 191 | " return x\n", 192 | " \n", 193 | " def get_action(self, state):\n", 194 | " state = torch.tensor(state, dtype=torch.float32)\\\n", 195 | " .unsqueeze(0).to(device)\n", 196 | " action = self.forward(state)\n", 197 | " action = action.detach().cpu().numpy()[0]\n", 198 | " return action" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "

DDPG Update

" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "def ddpg_update(\n", 215 | " state, \n", 216 | " action, \n", 217 | " reward, \n", 218 | " next_state, \n", 219 | " done, \n", 220 | " gamma = 0.99,\n", 221 | " min_value=-np.inf,\n", 222 | " max_value=np.inf,\n", 223 | " soft_tau=1e-2,\n", 224 | "): \n", 225 | " state = torch.tensor(state, dtype=torch.float32).to(device)\n", 226 | " next_state = torch.tensor(next_state, dtype=torch.float32).to(device)\n", 227 | " action = torch.tensor(action, dtype=torch.float32).to(device)\n", 228 | " reward = torch.tensor(reward, dtype=torch.float32).unsqueeze(1).to(device)\n", 229 | " done = torch.tensor(np.float32(done)).unsqueeze(1).to(device)\n", 230 | "\n", 231 | " policy_loss = value_net(state, policy_net(state))\n", 232 | " policy_loss = -policy_loss.mean()\n", 233 | "\n", 234 | " next_action = target_policy_net(next_state)\n", 235 | " target_value = target_value_net(next_state, next_action.detach())\n", 236 | " expected_value = reward + (1.0 - done) * gamma * target_value\n", 237 | " expected_value = torch.clamp(expected_value, min_value, max_value)\n", 238 | "\n", 239 | " value = value_net(state, action)\n", 240 | " value_loss = value_criterion(value, expected_value.detach())\n", 241 | "\n", 242 | "\n", 243 | " policy_optimizer.zero_grad()\n", 244 | " policy_loss.backward()\n", 245 | " policy_optimizer.step()\n", 246 | "\n", 247 | " value_optimizer.zero_grad()\n", 248 | " value_loss.backward()\n", 249 | " value_optimizer.step()\n", 250 | "\n", 251 | " for target_param, param in zip(target_value_net.parameters(), value_net.parameters()):\n", 252 | " target_param.data.copy_(\n", 253 | " target_param.data * (1.0 - soft_tau) + param.data * soft_tau\n", 254 | " )\n", 255 | "\n", 256 | " for target_param, param in zip(target_policy_net.parameters(), policy_net.parameters()):\n", 257 | " target_param.data.copy_(\n", 258 | " target_param.data * (1.0 - soft_tau) + param.data * soft_tau\n", 259 | " )" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "### Experience replay buffer\n", 267 | "\n", 268 | "![img](https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/exp_replay.png)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "class ReplayBuffer:\n", 278 | " def __init__(self, capacity):\n", 279 | " self.capacity = capacity\n", 280 | " self.buffer = []\n", 281 | " self.position = 0\n", 282 | " \n", 283 | " def push(self, state, action, reward, next_state, done):\n", 284 | " if len(self.buffer) < self.capacity:\n", 285 | " self.buffer.append(None)\n", 286 | " self.buffer[self.position] = (state, action, reward, next_state, done)\n", 287 | " self.position = (self.position + 1) % self.capacity\n", 288 | " \n", 289 | " def sample(self, batch_size):\n", 290 | " batch = random.sample(self.buffer, batch_size)\n", 291 | " state, action, reward, next_state, done = map(np.stack, zip(*batch))\n", 292 | " return state, action, reward, next_state, done\n", 293 | " \n", 294 | " def __len__(self):\n", 295 | " return len(self.buffer)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "---" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "batch_size = 128\n", 312 | "\n", 313 | "def generate_session(t_max=1000, train=False):\n", 314 | " \"\"\"play env with ddpg agent and train it at the same time\"\"\"\n", 315 | " total_reward = 0\n", 316 | " state = env.reset()\n", 317 | "\n", 318 | " for t in range(t_max):\n", 319 | " action = policy_net.get_action(state)\n", 320 | " if train:\n", 321 | " action = noise.get_action(action)\n", 322 | " next_state, reward, done, _ = env.step(action)\n", 323 | "\n", 324 | " if train:\n", 325 | " replay_buffer.push(state, action, reward, next_state, done)\n", 326 | " if len(replay_buffer) > batch_size:\n", 327 | " states, actions, rewards, next_states, dones = \\\n", 328 | " replay_buffer.sample(batch_size)\n", 329 | " ddpg_update(states, actions, rewards, next_states, dones)\n", 330 | "\n", 331 | " total_reward += reward\n", 332 | " state = next_state\n", 333 | " if done:\n", 334 | " break\n", 335 | "\n", 336 | " return total_reward" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "env = NormalizedActions(gym.make(\"Pendulum-v0\"))\n", 346 | "noise = GaussNoise(sigma=0.3)\n", 347 | "\n", 348 | "state_dim = env.observation_space.shape[0]\n", 349 | "action_dim = env.action_space.shape[0]\n", 350 | "hidden_dim = 256\n", 351 | "\n", 352 | "value_net = ValueNetwork(state_dim, action_dim, hidden_dim).to(device)\n", 353 | "policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim).to(device)\n", 354 | "\n", 355 | "target_value_net = ValueNetwork(state_dim, action_dim, hidden_dim).to(device)\n", 356 | "target_policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim).to(device)\n", 357 | "\n", 358 | "for target_param, param in zip(target_value_net.parameters(), value_net.parameters()):\n", 359 | " target_param.data.copy_(param.data)\n", 360 | "\n", 361 | "for target_param, param in zip(target_policy_net.parameters(), policy_net.parameters()):\n", 362 | " target_param.data.copy_(param.data)\n", 363 | " \n", 364 | " \n", 365 | "value_lr = 1e-3\n", 366 | "policy_lr = 1e-4\n", 367 | "\n", 368 | "value_optimizer = optim.Adam(value_net.parameters(), lr=value_lr)\n", 369 | "policy_optimizer = optim.Adam(policy_net.parameters(), lr=policy_lr)\n", 370 | "\n", 371 | "value_criterion = nn.MSELoss()\n", 372 | "\n", 373 | "replay_buffer_size = 10000\n", 374 | "replay_buffer = ReplayBuffer(replay_buffer_size)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "scrolled": false 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "max_steps = 500\n", 386 | "\n", 387 | "valid_mean_rewards = []\n", 388 | "for i in range(100): \n", 389 | " session_rewards_train = [\n", 390 | " generate_session(t_max=max_steps, train=True) \n", 391 | " for _ in range(10)\n", 392 | " ]\n", 393 | " session_rewards_valid = [\n", 394 | " generate_session(t_max=max_steps, train=False) \n", 395 | " for _ in range(10)\n", 396 | " ]\n", 397 | " print(\n", 398 | " \"epoch #{:02d}\\tmean reward (train) = {:.3f}\\tmean reward (valid) = {:.3f}\".format(\n", 399 | " i, np.mean(session_rewards_train), np.mean(session_rewards_valid))\n", 400 | " )\n", 401 | "\n", 402 | " valid_mean_rewards.append(np.mean(session_rewards_valid))\n", 403 | " if len(valid_mean_rewards) > 5 and np.mean(valid_mean_rewards[-5:]) > -200:\n", 404 | " print(\"You Win!\")\n", 405 | " break" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "---" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "# record sessions\n", 422 | "import gym.wrappers\n", 423 | "env = gym.wrappers.Monitor(\n", 424 | " NormalizedActions(gym.make(\"Pendulum-v0\")),\n", 425 | " directory=\"videos_ddpg\", \n", 426 | " force=True)\n", 427 | "sessions = [generate_session(t_max=max_steps, train=False) for _ in range(10)]\n", 428 | "env.close()" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "# show video\n", 438 | "from IPython.display import HTML\n", 439 | "import os\n", 440 | "\n", 441 | "video_names = list(\n", 442 | " filter(lambda s: s.endswith(\".mp4\"), os.listdir(\"./videos_ddpg/\")))\n", 443 | "\n", 444 | "HTML(\"\"\"\n", 445 | "\n", 448 | "\"\"\".format(\"./videos/\"+video_names[-1])) # this may or may not be _last_ video. Try other indices" 449 | ] 450 | } 451 | ], 452 | "metadata": { 453 | "kernelspec": { 454 | "display_name": "Python 3", 455 | "language": "python", 456 | "name": "python3" 457 | }, 458 | "language_info": { 459 | "codemirror_mode": { 460 | "name": "ipython", 461 | "version": 3 462 | }, 463 | "file_extension": ".py", 464 | "mimetype": "text/x-python", 465 | "name": "python", 466 | "nbconvert_exporter": "python", 467 | "pygments_lexer": "ipython3", 468 | "version": "3.7.3" 469 | }, 470 | "pycharm": { 471 | "stem_cell": { 472 | "cell_type": "raw", 473 | "source": [], 474 | "metadata": { 475 | "collapsed": false 476 | } 477 | } 478 | } 479 | }, 480 | "nbformat": 4, 481 | "nbformat_minor": 2 482 | } 483 | -------------------------------------------------------------------------------- /2020/code/DQN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Deep Q-Learning\n", 8 | "\n", 9 | "In this notebook you will teach a __pytorch__ neural network to do Q-learning." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# !pip install -r ./requirement" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "from collections import deque\n", 28 | "import random\n", 29 | "import numpy as np\n", 30 | "import gym\n", 31 | "\n", 32 | "import torch\n", 33 | "import torch.nn as nn\n", 34 | "import torch.nn.functional as F" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "import matplotlib.pyplot as plt\n", 44 | "%matplotlib inline" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "use_cuda = torch.cuda.is_available()\n", 54 | "device = torch.device(\"cuda\" if use_cuda else \"cpu\")" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "### Let's play some old videogames\n", 62 | "![img](https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/nerd.png)\n", 63 | "\n", 64 | "This time we're gonna apply approximate q-learning to an OpenAI game called CartPole. It's not the hardest thing out there, but it's definitely way more complex than anything we tried before.\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## Environment" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "env = gym.make(\"CartPole-v0\").env\n", 81 | "env.reset()\n", 82 | "n_actions = env.action_space.n\n", 83 | "state_dim = env.observation_space.shape\n", 84 | "\n", 85 | "# plt.imshow(env.render(\"rgb_array\"))\n", 86 | "# env.close()" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "# Approximate Q-learning: building the network\n", 94 | "\n", 95 | "To train a neural network policy one must have a neural network policy. Let's build it.\n", 96 | "\n", 97 | "\n", 98 | "Since we're working with a pre-extracted features (cart positions, angles and velocities), we don't need a complicated network yet. In fact, let's build something like this for starters:\n", 99 | "\n", 100 | "![img](https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/yet_another_week/_resource/qlearning_scheme.png)\n", 101 | "\n", 102 | "For your first run, please only use linear layers (nn.Linear) and activations. Stuff like batch normalization or dropout may ruin everything if used haphazardly. \n", 103 | "\n", 104 | "Also please avoid using nonlinearities like sigmoid & tanh: agent's observations are not normalized so sigmoids may become saturated from init.\n", 105 | "\n", 106 | "Ideally you should start small with maybe 1-2 hidden layers with < 200 neurons and then increase network size if agent doesn't beat the target score." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "network = nn.Sequential(\n", 116 | " nn.Linear(env.observation_space.shape[0], 128),\n", 117 | " nn.ReLU(),\n", 118 | " nn.Linear(128, 128),\n", 119 | " nn.ReLU(),\n", 120 | " nn.Linear(128, env.action_space.n)\n", 121 | ").to(device)\n", 122 | "\n", 123 | "# network.add_module('layer1', < ... >)\n", 124 | "\n", 125 | "# \n", 126 | "\n", 127 | "# hint: use state_dim[0] as input size" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "def get_action(state, epsilon=0):\n", 137 | " \"\"\"\n", 138 | " sample actions with epsilon-greedy policy\n", 139 | " recap: with p = epsilon pick random action, else pick action with highest Q(s,a)\n", 140 | " \"\"\"\n", 141 | " state = torch.tensor(state[None], dtype=torch.float32)\n", 142 | " q_values = network(state).detach().numpy()[0]\n", 143 | "\n", 144 | " # YOUR CODE\n", 145 | " if np.random.random() < epsilon:\n", 146 | " action = np.random.randint(len(q_values))\n", 147 | " else:\n", 148 | " action = np.argmax(q_values)\n", 149 | "\n", 150 | " return int(action) # int( < epsilon-greedily selected action > )" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "s = env.reset()\n", 160 | "assert tuple(network(torch.tensor([s]*3, dtype=torch.float32)).size()) == (\n", 161 | " 3, n_actions), \"please make sure your model maps state s -> [Q(s,a0), ..., Q(s, a_last)]\"\n", 162 | "assert isinstance(list(network.modules(\n", 163 | "))[-1], nn.Linear), \"please make sure you predict q-values without nonlinearity (ignore if you know what you're doing)\"\n", 164 | "assert isinstance(get_action(\n", 165 | " s), int), \"get_action(s) must return int, not %s. try int(action)\" % (type(get_action(s)))\n", 166 | "\n", 167 | "# test epsilon-greedy exploration\n", 168 | "for eps in [0., 0.1, 0.5, 1.0]:\n", 169 | " state_frequencies = np.bincount(\n", 170 | " [get_action(s, epsilon=eps) for i in range(10000)], minlength=n_actions)\n", 171 | " best_action = state_frequencies.argmax()\n", 172 | " assert abs(state_frequencies[best_action] -\n", 173 | " 10000 * (1 - eps + eps / n_actions)) < 200\n", 174 | " for other_action in range(n_actions):\n", 175 | " if other_action != best_action:\n", 176 | " assert abs(state_frequencies[other_action] -\n", 177 | " 10000 * (eps / n_actions)) < 200\n", 178 | " print('e=%.1f tests passed' % eps)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "### Q-learning via gradient descent\n", 186 | "\n", 187 | "We shall now train our agent's Q-function by minimizing the TD loss:\n", 188 | "$$ L = { 1 \\over N} \\sum_i (Q_{\\theta}(s,a) - [r(s,a) + \\gamma \\cdot max_{a'} Q_{-}(s', a')]) ^2 $$\n", 189 | "\n", 190 | "\n", 191 | "Where\n", 192 | "* $s, a, r, s'$ are current state, action, reward and next state respectively\n", 193 | "* $\\gamma$ is a discount factor defined two cells above.\n", 194 | "\n", 195 | "The tricky part is with $Q_{-}(s',a')$. From an engineering standpoint, it's the same as $Q_{\\theta}$ - the output of your neural network policy. However, when doing gradient descent, __we won't propagate gradients through it__ to make training more stable (see lectures).\n", 196 | "\n", 197 | "To do so, we shall use `x.detach()` function which basically says \"consider this thing constant when doingbackprop\"." 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "def to_one_hot(y_tensor, n_dims=None):\n", 207 | " \"\"\" helper: take an integer vector and convert it to 1-hot matrix. \"\"\"\n", 208 | " y_tensor = y_tensor.type(torch.LongTensor).view(-1, 1)\n", 209 | " n_dims = n_dims if n_dims is not None else int(torch.max(y_tensor)) + 1\n", 210 | " y_one_hot = torch.zeros(\n", 211 | " y_tensor.size()[0], n_dims).scatter_(1, y_tensor, 1)\n", 212 | " return y_one_hot\n", 213 | "\n", 214 | "\n", 215 | "def where(cond, x_1, x_2):\n", 216 | " \"\"\" helper: like np.where but in pytorch. \"\"\"\n", 217 | " return (cond * x_1) + ((1-cond) * x_2)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "def compute_td_loss(\n", 227 | " states, \n", 228 | " actions, \n", 229 | " rewards, \n", 230 | " next_states, \n", 231 | " is_done, \n", 232 | " gamma=0.99, \n", 233 | " check_shapes=False\n", 234 | "):\n", 235 | " \"\"\" Compute td loss using torch operations only. Use the formula above. \"\"\"\n", 236 | " # shape: [batch_size, state_size]\n", 237 | " states = torch.tensor(states, dtype=torch.float32).to(device)\n", 238 | " # shape: [batch_size]\n", 239 | " actions = torch.tensor(actions, dtype=torch.int32).to(device)\n", 240 | " # shape: [batch_size]\n", 241 | " rewards = torch.tensor(rewards, dtype=torch.float32).to(device)\n", 242 | " # shape: [batch_size, state_size]\n", 243 | " next_states = torch.tensor(next_states, dtype=torch.float32).to(device)\n", 244 | " # shape: [batch_size]\n", 245 | " is_done = torch.tensor(is_done, dtype=torch.float32).to(device)\n", 246 | "\n", 247 | " # get q-values for all actions in current states\n", 248 | " predicted_qvalues = network(states)\n", 249 | "\n", 250 | " # select q-values for chosen actions\n", 251 | " predicted_qvalues_for_actions = torch.sum(\n", 252 | " predicted_qvalues * to_one_hot(actions, n_actions), \n", 253 | " dim=1\n", 254 | " )\n", 255 | "\n", 256 | " # compute q-values for all actions in next states\n", 257 | " predicted_next_qvalues = network(next_states) # YOUR CODE\n", 258 | "\n", 259 | " # compute V*(next_states) using predicted next q-values\n", 260 | " next_state_values = predicted_next_qvalues.max(1)[0] # YOUR CODE\n", 261 | " assert next_state_values.dtype == torch.float32\n", 262 | "\n", 263 | " # compute \"target q-values\" for loss - it's what's inside square parentheses in the above formula.\n", 264 | " target_qvalues_for_actions = rewards + gamma * next_state_values # YOUR CODE\n", 265 | "\n", 266 | " # at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't exist\n", 267 | " target_qvalues_for_actions = where(\n", 268 | " is_done, rewards, target_qvalues_for_actions)\n", 269 | "\n", 270 | " # mean squared error loss to minimize\n", 271 | " loss = torch.mean(\n", 272 | " (predicted_qvalues_for_actions - target_qvalues_for_actions.detach()) ** 2)\n", 273 | "\n", 274 | " if check_shapes:\n", 275 | " assert predicted_next_qvalues.data.dim(\n", 276 | " ) == 2, \"make sure you predicted q-values for all actions in next state\"\n", 277 | " assert next_state_values.data.dim(\n", 278 | " ) == 1, \"make sure you computed V(s') as maximum over just the actions axis and not all axes\"\n", 279 | " assert target_qvalues_for_actions.data.dim(\n", 280 | " ) == 1, \"there's something wrong with target q-values, they must be a vector\"\n", 281 | "\n", 282 | " return loss" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "# sanity checks\n", 292 | "s = env.reset()\n", 293 | "a = env.action_space.sample()\n", 294 | "next_s, r, done, _ = env.step(a)\n", 295 | "loss = compute_td_loss([s], [a], [r], [next_s], [done], check_shapes=True)\n", 296 | "loss.backward()\n", 297 | "\n", 298 | "assert len(loss.size()) == 0, \"you must return scalar loss - mean over batch\"\n", 299 | "assert np.any(next(network.parameters()).grad.detach().numpy() !=\n", 300 | " 0), \"loss must be differentiable w.r.t. network weights\"" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "### Experience replay buffer\n", 308 | "\n", 309 | "![img](https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/exp_replay.png)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "class ReplayBuffer(object):\n", 319 | " def __init__(self, capacity):\n", 320 | " self.buffer = deque(maxlen=capacity)\n", 321 | " \n", 322 | " def push(self, state, action, reward, next_state, done):\n", 323 | " state = np.expand_dims(state, 0)\n", 324 | " next_state = np.expand_dims(next_state, 0)\n", 325 | " \n", 326 | " self.buffer.append((state, action, reward, next_state, done))\n", 327 | " \n", 328 | " def sample(self, batch_size):\n", 329 | " state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))\n", 330 | " return np.concatenate(state), action, reward, np.concatenate(next_state), done\n", 331 | " \n", 332 | " def __len__(self):\n", 333 | " return len(self.buffer)" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "---" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "batch_size = 32\n", 350 | "\n", 351 | "def generate_session(t_max=1000, epsilon=0, train=False):\n", 352 | " \"\"\"play env with approximate q-learning agent and train it at the same time\"\"\"\n", 353 | " total_reward = 0\n", 354 | " s = env.reset()\n", 355 | "\n", 356 | " for t in range(t_max):\n", 357 | " a = get_action(s, epsilon=epsilon if train else -1)\n", 358 | " next_s, r, done, _ = env.step(a)\n", 359 | "\n", 360 | " if train:\n", 361 | " opt.zero_grad()\n", 362 | " replay_buffer.push(s, a, r, next_s, done)\n", 363 | " if len(replay_buffer) > batch_size:\n", 364 | " s_, a_, r_, next_s_, done_ = replay_buffer.sample(batch_size)\n", 365 | " compute_td_loss(s_, a_, r_, next_s_, done_).backward()\n", 366 | "\n", 367 | " opt.step()\n", 368 | "\n", 369 | " total_reward += r\n", 370 | " s = next_s\n", 371 | " if done:\n", 372 | " break\n", 373 | "\n", 374 | " return total_reward" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [ 383 | "replay_buffer = ReplayBuffer(1000)\n", 384 | "opt = torch.optim.Adam(network.parameters(), lr=1e-4)\n", 385 | "epsilon = 0.5" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "valid_mean_rewards = []\n", 395 | "for i in range(100):\n", 396 | " session_rewards_train = [\n", 397 | " generate_session(epsilon=epsilon, train=True) \n", 398 | " for _ in range(100)\n", 399 | " ]\n", 400 | " session_rewards_valid = [\n", 401 | " generate_session(epsilon=epsilon, train=False) \n", 402 | " for _ in range(100)\n", 403 | " ]\n", 404 | " print(\n", 405 | " \"epoch #{:02d}\\tmean reward (train) = {:.3f}\\tepsilon = {:.3f}\\tmean reward (valid) = {:.3f}\".format(\n", 406 | " i, np.mean(session_rewards_train), epsilon, np.mean(session_rewards_valid))\n", 407 | " )\n", 408 | "\n", 409 | " epsilon *= 0.95 # 0.99\n", 410 | " assert epsilon >= 1e-4, \"Make sure epsilon is always nonzero during training\"\n", 411 | "\n", 412 | " valid_mean_rewards.append(np.mean(session_rewards_valid))\n", 413 | " if len(valid_mean_rewards) > 5 and np.mean(valid_mean_rewards[-5:]) > 300:\n", 414 | " print(\"You Win!\")\n", 415 | " break" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "### How to interpret results\n", 423 | "\n", 424 | "\n", 425 | "Welcome to the f.. world of deep f...n reinforcement learning. Don't expect agent's reward to smoothly go up. Hope for it to go increase eventually. If it deems you worthy.\n", 426 | "\n", 427 | "Seriously though,\n", 428 | "* __mean reward__ is the average reward per game. For a correct implementation it may stay low for some 10 epochs, then start growing while oscilating insanely and converges by ~50-100 steps depending on the network architecture. \n", 429 | "* If it never reaches target score by the end of for loop, try increasing the number of hidden neurons or look at the epsilon.\n", 430 | "* __epsilon__ - agent's willingness to explore. If you see that agent's already at < 0.01 epsilon before it's is at least 200, just reset it back to 0.1 - 0.5." 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": {}, 436 | "source": [ 437 | "---" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "# record sessions\n", 447 | "import gym.wrappers\n", 448 | "env = gym.wrappers.Monitor(\n", 449 | " gym.make(\"CartPole-v0\"),\n", 450 | " directory=\"videos_dqn\", \n", 451 | " force=True)\n", 452 | "sessions = [generate_session(epsilon=0, train=False) for _ in range(100)]\n", 453 | "env.close()" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "# show video\n", 463 | "from IPython.display import HTML\n", 464 | "import os\n", 465 | "\n", 466 | "video_names = list(\n", 467 | " filter(lambda s: s.endswith(\".mp4\"), os.listdir(\"./videos_dqn/\")))\n", 468 | "\n", 469 | "HTML(\"\"\"\n", 470 | "\n", 473 | "\"\"\".format(\"./videos/\"+video_names[-1])) # this may or may not be _last_ video. Try other indices" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [] 482 | } 483 | ], 484 | "metadata": { 485 | "kernelspec": { 486 | "display_name": "Python 3", 487 | "language": "python", 488 | "name": "python3" 489 | }, 490 | "language_info": { 491 | "codemirror_mode": { 492 | "name": "ipython", 493 | "version": 3 494 | }, 495 | "file_extension": ".py", 496 | "mimetype": "text/x-python", 497 | "name": "python", 498 | "nbconvert_exporter": "python", 499 | "pygments_lexer": "ipython3", 500 | "version": "3.7.3" 501 | }, 502 | "pycharm": { 503 | "stem_cell": { 504 | "cell_type": "raw", 505 | "source": [], 506 | "metadata": { 507 | "collapsed": false 508 | } 509 | } 510 | } 511 | }, 512 | "nbformat": 4, 513 | "nbformat_minor": 2 514 | } 515 | -------------------------------------------------------------------------------- /2020/code/RecSimDemo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import seaborn as sns\n", 13 | "\n", 14 | "# plt.rcParams['axes.grid'] = True\n", 15 | "\n", 16 | "%matplotlib inline" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "from gym import spaces\n", 26 | "from recsim import document, user\n", 27 | "from recsim.choice_model import AbstractChoiceModel\n", 28 | "from recsim.simulator import recsim_gym, environment" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "from recsim_exp import GaussNoise, WolpertingerRecommender" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "# RecSim environment\n", 45 | "\n", 46 | "In this tutorial we will break a RecSim environment down into its basic components. \n", 47 | "![Detailed view of RecSim](https://github.com/google-research/recsim/blob/master/recsim/colab/figures/simulator.png?raw=true)\n", 48 | "\n", 49 | "The green and blue blocks in the above diagram constitute the classes that need to be implemented within a RecSim environment. The goal of this tutorial is to explain the purpose of these blocks and how they come together in a simulation. In the process, we will go over an example end-to-end implementation.\n", 50 | "\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "# Overview\n", 58 | "\n", 59 | "A single step of a RecSim simulation can be summarized roughly as follows:\n", 60 | "\n", 61 | "\n", 62 | "1. the document database provides a corpus of *D* documents to the recommender. This could be a different set at each step (e.g., sampled, or produced by some \"candidate generation\" process), or fixed throughout the simulation. Each document is represented by a list of features. In a fully observable situation, the recommender observes all features of each document that impact the user's state and choice of document (and other aspects of the user's response), but this need not be the case in general. (In fact, most interesting scenarios involve latent features.)\n", 63 | "2. The recommender observes the *D* documents (and their features) together with the user's response to the last recommendation. It then makes a selection (possibly ordered) of *k* documents and presents them to the user. The ordering may or may not impact the user choice or user state, depending on our simulation goals.\n", 64 | "3. The user examines the list of documents and makes a choice of one document. Note that not consuming any of the documents is also a valid choice. This leads to a transition in the user's state. Finally the user emits an observation, which the recommender observes at the next iteration. The observation generally includes (noisy) information about the user's reaction to the content and potentially clues about the user's latent state. Typically, the user's state is not fully revealed. \n", 65 | "\n", 66 | "If we examine at the diagram above carefully, we notice that the flow of information along arcs is acyclic---a RecSim environment is a dynamic Bayesian network (DBN), where the various boxes represent conditional probability distributions. We will now define a simple simulation problem and implement it. " 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "SEED = 42\n", 76 | "DOC_NUM = 10\n", 77 | "P_EXIT_ACCEPTED = 0.1\n", 78 | "P_EXIT_NOT_ACCEPTED = 0.2\n", 79 | "\n", 80 | "# let's define a matrix W for simulation of users' respose\n", 81 | "# (based on the section 7.3 of the paper https://arxiv.org/pdf/1512.07679.pdf)\n", 82 | "# W_ij defines the probability that a user will accept recommendation j\n", 83 | "# given that he is consuming item i at the moment\n", 84 | "\n", 85 | "np.random.seed(SEED)\n", 86 | "W = (np.ones((DOC_NUM, DOC_NUM)) - np.eye(DOC_NUM)) * \\\n", 87 | " np.random.uniform(0.0, P_EXIT_NOT_ACCEPTED, (DOC_NUM, DOC_NUM)) + \\\n", 88 | " np.diag(np.random.uniform(1.0 - P_EXIT_ACCEPTED, 1.0, DOC_NUM))\n", 89 | "W = W[:, np.random.permutation(DOC_NUM)]" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "### Document" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "class Document(document.AbstractDocument):\n", 106 | "\n", 107 | " def __init__(self, doc_id):\n", 108 | " super(Document, self).__init__(doc_id)\n", 109 | "\n", 110 | " def create_observation(self):\n", 111 | " return (self._doc_id,)\n", 112 | "\n", 113 | " @staticmethod\n", 114 | " def observation_space():\n", 115 | " return spaces.Discrete(DOC_NUM)\n", 116 | "\n", 117 | " def __str__(self):\n", 118 | " return \"Document #{}\".format(self._doc_id)\n", 119 | "\n", 120 | "\n", 121 | "class DocumentSampler(document.AbstractDocumentSampler):\n", 122 | "\n", 123 | " def __init__(self, doc_ctor=Document):\n", 124 | " super(DocumentSampler, self).__init__(doc_ctor)\n", 125 | " self._doc_count = 0\n", 126 | "\n", 127 | " def sample_document(self):\n", 128 | " doc = self._doc_ctor(self._doc_count % DOC_NUM)\n", 129 | " self._doc_count += 1\n", 130 | " return doc" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "### User" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "class UserState(user.AbstractUserState):\n", 147 | "\n", 148 | " def __init__(self, user_id, current, active_session=True):\n", 149 | " self.user_id = user_id\n", 150 | " self.current = current\n", 151 | " self.active_session = active_session\n", 152 | "\n", 153 | " def create_observation(self):\n", 154 | " return (self.current,)\n", 155 | "\n", 156 | " def __str__(self):\n", 157 | " return \"User #{}\".format(self.user_id)\n", 158 | "\n", 159 | " @staticmethod\n", 160 | " def observation_space():\n", 161 | " return spaces.Discrete(DOC_NUM)\n", 162 | "\n", 163 | " def score_document(self, doc_obs):\n", 164 | " return W[self.current, doc_obs[0]]\n", 165 | "\n", 166 | "\n", 167 | "class StaticUserSampler(user.AbstractUserSampler):\n", 168 | "\n", 169 | " def __init__(self, user_ctor=UserState):\n", 170 | " super(StaticUserSampler, self).__init__(user_ctor)\n", 171 | " self.user_count = 0\n", 172 | "\n", 173 | " def sample_user(self):\n", 174 | " self.user_count += 1\n", 175 | " sampled_user = self._user_ctor(\n", 176 | " self.user_count, np.random.randint(DOC_NUM))\n", 177 | " return sampled_user\n", 178 | "\n", 179 | "\n", 180 | "class Response(user.AbstractResponse):\n", 181 | "\n", 182 | " def __init__(self, accept=False):\n", 183 | " self.accept = accept\n", 184 | "\n", 185 | " def create_observation(self):\n", 186 | " return (int(self.accept),)\n", 187 | "\n", 188 | " @classmethod\n", 189 | " def response_space(cls):\n", 190 | " return spaces.Discrete(2)\n", 191 | "\n", 192 | "\n", 193 | "class UserChoiceModel(AbstractChoiceModel):\n", 194 | " def __init__(self):\n", 195 | " super(UserChoiceModel, self).__init__()\n", 196 | " self._score_no_click = P_EXIT_ACCEPTED\n", 197 | "\n", 198 | " def score_documents(self, user_state, doc_obs):\n", 199 | " if len(doc_obs) != 1:\n", 200 | " raise ValueError(\n", 201 | " \"Expecting single document, but got: {}\".format(doc_obs))\n", 202 | " self._scores = np.array(\n", 203 | " [user_state.score_document(doc) for doc in doc_obs])\n", 204 | "\n", 205 | " def choose_item(self):\n", 206 | " if np.random.random() < self.scores[0]:\n", 207 | " return 0\n", 208 | "\n", 209 | "\n", 210 | "class UserModel(user.AbstractUserModel):\n", 211 | " def __init__(self):\n", 212 | " super(UserModel, self).__init__(Response, StaticUserSampler(), 1)\n", 213 | " self.choice_model = UserChoiceModel()\n", 214 | "\n", 215 | " def simulate_response(self, slate_documents):\n", 216 | " if len(slate_documents) != 1:\n", 217 | " raise ValueError(\"Expecting single document, but got: {}\".format(\n", 218 | " slate_documents))\n", 219 | "\n", 220 | " responses = [self._response_model_ctor() for _ in slate_documents]\n", 221 | "\n", 222 | " self.choice_model.score_documents(\n", 223 | " self._user_state,\n", 224 | " [doc.create_observation() for doc in slate_documents]\n", 225 | " )\n", 226 | " selected_index = self.choice_model.choose_item()\n", 227 | "\n", 228 | " if selected_index is not None:\n", 229 | " responses[selected_index].accept = True\n", 230 | "\n", 231 | " return responses\n", 232 | "\n", 233 | " def update_state(self, slate_documents, responses):\n", 234 | " if len(slate_documents) != 1:\n", 235 | " raise ValueError(\n", 236 | " f\"Expecting single document, but got: {slate_documents}\"\n", 237 | " )\n", 238 | "\n", 239 | " response = responses[0]\n", 240 | " doc = slate_documents[0]\n", 241 | " if response.accept:\n", 242 | " self._user_state.current = doc.doc_id()\n", 243 | " self._user_state.active_session = bool(\n", 244 | " np.random.binomial(1, 1 - P_EXIT_ACCEPTED))\n", 245 | " else:\n", 246 | " self._user_state.current = np.random.choice(DOC_NUM)\n", 247 | " self._user_state.active_session = bool(\n", 248 | " np.random.binomial(1, 1 - P_EXIT_NOT_ACCEPTED))\n", 249 | "\n", 250 | " def is_terminal(self):\n", 251 | " \"\"\"Returns a boolean indicating if the session is over.\"\"\"\n", 252 | " return not self._user_state.active_session\n", 253 | "\n", 254 | "\n", 255 | "def clicked_reward(responses):\n", 256 | " reward = 0.0\n", 257 | " for response in responses:\n", 258 | " if response.accept:\n", 259 | " reward += 1\n", 260 | " return reward" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "### Environment" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "def make_env():\n", 277 | " env = recsim_gym.RecSimGymEnv(\n", 278 | " environment.Environment(\n", 279 | " UserModel(), \n", 280 | " DocumentSampler(), \n", 281 | " DOC_NUM, \n", 282 | " 1, \n", 283 | " resample_documents=False\n", 284 | " ),\n", 285 | " clicked_reward\n", 286 | " )\n", 287 | " return env" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "# RecSim Agent" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "For solving of this toy environment we'll try using a variant of DDPG algorithm for discrete actions.\n", 302 | "We need to embed our discrete actions into continuous space to use DDPG (it outputs \"proto action\").\n", 303 | "Then we choose k nearest embedded actions and take the action with maximum Q value.\n", 304 | "Thus, we can avoid taking maximum over all the action space as in DQN, which can be too large in case of RecSys.\n", 305 | "In our example embeddings are just one hot vectors. Therefore the nearest neighbour is argmax of proto action.\n", 306 | "\n", 307 | "" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "def run_agent(\n", 317 | " env, \n", 318 | " agent, \n", 319 | " num_steps: int = int(3e3), \n", 320 | " log_every: int = int(1e3)\n", 321 | "):\n", 322 | " reward_history = []\n", 323 | " step, episode = 1, 1\n", 324 | "\n", 325 | " observation = env.reset()\n", 326 | " while step < num_steps:\n", 327 | " action = agent.begin_episode(observation)\n", 328 | " episode_reward = 0\n", 329 | " while True:\n", 330 | " observation, reward, done, info = env.step(action)\n", 331 | " episode_reward += reward\n", 332 | "\n", 333 | " if step % log_every == 0:\n", 334 | " print(step, np.mean(reward_history[-50:]))\n", 335 | " step += 1\n", 336 | " if done:\n", 337 | " break\n", 338 | " else:\n", 339 | " action = agent.step(reward, observation)\n", 340 | "\n", 341 | " agent.end_episode(reward, observation)\n", 342 | " reward_history.append(episode_reward)\n", 343 | "\n", 344 | " return reward_history" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "parameters = {\n", 354 | " \"action_dim\": DOC_NUM,\n", 355 | " \"state_dim\": DOC_NUM,\n", 356 | " \"noise\": GaussNoise(sigma=0.05),\n", 357 | " \"critic_lr\": 1e-3,\n", 358 | " \"actor_lr\": 1e-3,\n", 359 | " \"tau\": 1e-3,\n", 360 | " \"hidden_dim\": 256,\n", 361 | " \"batch_size\": 128,\n", 362 | " \"buffer_size\": int(1e4),\n", 363 | " \"gamma\": 0.8,\n", 364 | " \"actor_weight_decay\": 0.0001,\n", 365 | " \"critic_weight_decay\": 0.001,\n", 366 | " \"eps\": 1e-2\n", 367 | "}" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "env = make_env()\n", 377 | "agent = WolpertingerRecommender(\n", 378 | " env=env, \n", 379 | " k_ratio=0.33, \n", 380 | " **parameters\n", 381 | ")\n", 382 | "reward_history = run_agent(env, agent)\n", 383 | "plt.plot(pd.Series(reward_history).rolling(50).mean())" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "---" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "### Extra - 1" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "predicted_qvalues = np.hstack([\n", 407 | " agent.agent.predict_qvalues(i) for i in range(DOC_NUM)\n", 408 | "]).T\n", 409 | "predicted_actions = np.vstack([\n", 410 | " agent.agent.predict_action(np.eye(DOC_NUM)[i], with_noise=False)\n", 411 | " for i in range(DOC_NUM)\n", 412 | "])" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "# learned Qvalues \n", 422 | "plt.subplots(figsize=predicted_qvalues.shape)\n", 423 | "sns.heatmap(predicted_qvalues.round(3), annot=True);" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "# learned actions (aka policy)\n", 433 | "plt.subplots(figsize=predicted_qvalues.shape)\n", 434 | "sns.heatmap(predicted_actions.round(3), annot=True);" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "# true actions (aka policy)\n", 444 | "plt.subplots(figsize=predicted_qvalues.shape)\n", 445 | "sns.heatmap(W, annot=True);" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "### Extra - 2" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "from recsim.agent import AbstractEpisodicRecommenderAgent\n", 462 | "\n", 463 | "class OptimalRecommender(AbstractEpisodicRecommenderAgent):\n", 464 | "\n", 465 | " def __init__(self, environment, W):\n", 466 | " super().__init__(environment.action_space)\n", 467 | " self._observation_space = environment.observation_space\n", 468 | " self._W = W\n", 469 | "\n", 470 | " def _extract_state(self, observation):\n", 471 | " user_space = self._observation_space.spaces[\"user\"]\n", 472 | " return spaces.flatten(user_space, observation[\"user\"])\n", 473 | "\n", 474 | " def step(self, reward, observation):\n", 475 | " state = self._extract_state(observation)\n", 476 | " return [self._W[state.argmax(), :].argmax()]" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "env = make_env()\n", 486 | "agent = OptimalRecommender(env, W)\n", 487 | "\n", 488 | "reward_history = run_agent(env, agent)\n", 489 | "plt.plot(pd.Series(reward_history).rolling(50).mean())" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "---" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": null, 502 | "metadata": {}, 503 | "outputs": [], 504 | "source": [ 505 | "# from recsim.agents.tabular_q_agent import TabularQAgent\n", 506 | "\n", 507 | "# env = make_env()\n", 508 | "# q_agent = TabularQAgent(env.observation_space, env.action_space)\n", 509 | "\n", 510 | "# reward_history = run_agent(env, agent)\n", 511 | "# plt.plot(pd.Series(reward_history).rolling(50).mean())" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [] 520 | } 521 | ], 522 | "metadata": { 523 | "kernelspec": { 524 | "display_name": "Python [conda env:py37] *", 525 | "language": "python", 526 | "name": "conda-env-py37-py" 527 | }, 528 | "language_info": { 529 | "codemirror_mode": { 530 | "name": "ipython", 531 | "version": 3 532 | }, 533 | "file_extension": ".py", 534 | "mimetype": "text/x-python", 535 | "name": "python", 536 | "nbconvert_exporter": "python", 537 | "pygments_lexer": "ipython3", 538 | "version": "3.7.6" 539 | } 540 | }, 541 | "nbformat": 4, 542 | "nbformat_minor": 2 543 | } 544 | -------------------------------------------------------------------------------- /2020/code/recsim_exp/__init__.py: -------------------------------------------------------------------------------- 1 | from .ddpg import * 2 | from .wolpertinger import * 3 | -------------------------------------------------------------------------------- /2020/code/recsim_exp/ddpg.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | 5 | import torch 6 | import torch.nn as nn 7 | import torch.optim as optim 8 | import torch.nn.functional as F 9 | import copy 10 | 11 | 12 | DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") 13 | 14 | 15 | def soft_update(target, source, tau): 16 | for target_param, param in zip(target.parameters(), source.parameters()): 17 | target_param.data.copy_( 18 | target_param.data * (1.0 - tau) + param.data * tau) 19 | 20 | 21 | class GaussNoise: 22 | def __init__(self, sigma): 23 | super().__init__() 24 | 25 | self.sigma = sigma 26 | 27 | def get_action(self, action): 28 | noisy_action = np.random.normal(action, self.sigma) 29 | return noisy_action 30 | 31 | 32 | class ReplayBuffer: 33 | def __init__(self, capacity): 34 | self.capacity = capacity 35 | self.buffer = [] 36 | self.position = 0 37 | 38 | def push(self, state, action, reward, next_state, done): 39 | if len(self.buffer) < self.capacity: 40 | self.buffer.append(None) 41 | self.buffer[self.position] = (state, action, reward, next_state, done) 42 | self.position = (self.position + 1) % self.capacity 43 | 44 | def sample(self, batch_size): 45 | batch = random.sample(self.buffer, batch_size) 46 | state, action, reward, next_state, done = map(np.stack, zip(*batch)) 47 | return state, action, reward, next_state, done 48 | 49 | def __len__(self): 50 | return len(self.buffer) 51 | 52 | 53 | class Actor(nn.Module): 54 | def __init__( 55 | self, 56 | num_inputs, 57 | num_actions, 58 | hidden_size, 59 | init_w=3e-3, 60 | ): 61 | super().__init__() 62 | self.net = nn.Sequential( 63 | nn.Linear(num_inputs, hidden_size), 64 | nn.ReLU(), 65 | nn.Linear(hidden_size, hidden_size), 66 | nn.ReLU(), 67 | ) 68 | self.head = nn.Linear(hidden_size, num_actions) 69 | nn.init.uniform_(self.head.weight, -init_w, init_w) 70 | nn.init.zeros_(self.head.bias) 71 | 72 | def forward(self, state): 73 | x = self.net(state) 74 | x = self.head(x) 75 | x = torch.sigmoid(x) 76 | return x 77 | 78 | def get_action(self, state): 79 | state = torch.tensor( 80 | state, dtype=torch.float32 81 | ).unsqueeze(0).to(DEVICE) 82 | action = self.forward(state) 83 | action = action.detach().cpu().numpy()[0] 84 | return action 85 | 86 | 87 | class Critic(nn.Module): 88 | def __init__( 89 | self, 90 | num_inputs, 91 | num_actions, 92 | hidden_size, 93 | init_w=3e-3, 94 | ): 95 | super().__init__() 96 | self.net = nn.Sequential( 97 | nn.Linear(num_inputs + num_actions, hidden_size), 98 | nn.ReLU(), 99 | nn.Linear(hidden_size, hidden_size), 100 | nn.ReLU(), 101 | ) 102 | self.head = nn.Linear(hidden_size, 1) 103 | nn.init.uniform_(self.head.weight, -init_w, init_w) 104 | nn.init.zeros_(self.head.bias) 105 | 106 | def forward(self, state, action): 107 | x = torch.cat([state, action], 1) 108 | x = self.net(x) 109 | x = self.head(x) 110 | return x 111 | 112 | def get_qvalue(self, state, action): 113 | state = torch.tensor(state, dtype=torch.float32).to(DEVICE) 114 | action = torch.tensor(action, dtype=torch.float32).to(DEVICE) 115 | q_value = self.forward(state, action) 116 | q_value = q_value.detach().cpu().numpy() 117 | return q_value 118 | 119 | 120 | class DDPG: 121 | def __init__( 122 | self, 123 | state_dim, 124 | action_dim, 125 | noise=None, 126 | hidden_dim=256, 127 | tau=1e-3, 128 | gamma=0.99, 129 | init_w_actor=3e-3, 130 | init_w_critic=3e-3, 131 | critic_lr=1e-3, 132 | actor_lr=1e-4, 133 | actor_weight_decay=0., 134 | critic_weight_decay=0., 135 | ): 136 | self.actor = Actor( 137 | state_dim, 138 | action_dim, 139 | hidden_dim, 140 | init_w=init_w_actor 141 | ).to(DEVICE) 142 | self.target_actor = copy.deepcopy(self.actor) 143 | self.actor_optimizer = optim.Adam( 144 | self.actor.parameters(), 145 | lr=actor_lr, 146 | weight_decay=actor_weight_decay 147 | ) 148 | 149 | self.critic = Critic( 150 | state_dim, 151 | action_dim, 152 | hidden_dim, 153 | init_w=init_w_critic 154 | ).to(DEVICE) 155 | self.target_critic = copy.deepcopy(self.critic) 156 | self.critic_optimizer = optim.Adam( 157 | self.critic.parameters(), 158 | lr=critic_lr, 159 | weight_decay=critic_weight_decay 160 | ) 161 | 162 | self.state_dim = state_dim 163 | self.action_dim = action_dim 164 | self.noise = noise 165 | 166 | self.tau = tau 167 | self.gamma = gamma 168 | 169 | def predict_action(self, state, with_noise=False): 170 | self.actor.eval() 171 | action = self.actor.get_action(state) 172 | if self.noise and with_noise: 173 | action = self.noise.get_action(action) 174 | self.actor.train() 175 | return action 176 | 177 | def update(self, state, action, reward, next_state, done): 178 | state = torch.tensor(state, dtype=torch.float32).to(DEVICE) 179 | next_state = torch.tensor(next_state, dtype=torch.float32).to(DEVICE) 180 | action = torch.tensor(action, dtype=torch.float32).to(DEVICE) 181 | reward = torch.tensor( 182 | reward, dtype=torch.float32 183 | ).unsqueeze(1).to(DEVICE) 184 | done = torch.tensor(np.float32(done)).unsqueeze(1).to(DEVICE) 185 | 186 | # actor loss 187 | actor_loss = -self.critic(state, self.actor(state)).mean() 188 | 189 | # critic loss 190 | predicted_value = self.critic(state, action) 191 | next_action = self.target_actor(next_state) 192 | target_value = self.target_critic(next_state, next_action.detach()) 193 | expected_value = reward + (1.0 - done) * self.gamma * target_value 194 | critic_loss = F.mse_loss(predicted_value, expected_value.detach()) 195 | 196 | # actor update 197 | self.actor_optimizer.zero_grad() 198 | actor_loss.backward() 199 | self.actor_optimizer.step() 200 | 201 | # critic update 202 | self.critic_optimizer.zero_grad() 203 | critic_loss.backward() 204 | self.critic_optimizer.step() 205 | 206 | soft_update(self.target_critic, self.critic, self.tau) 207 | soft_update(self.target_actor, self.actor, self.tau) 208 | -------------------------------------------------------------------------------- /2020/code/recsim_exp/wolpertinger.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from gym import spaces 3 | from recsim.agent import AbstractEpisodicRecommenderAgent 4 | 5 | from .ddpg import DDPG, ReplayBuffer 6 | 7 | 8 | class Wolpertinger(DDPG): 9 | def __init__(self, *, action_dim, k_ratio=0.1, **kwargs): 10 | super().__init__(action_dim=action_dim, **kwargs) 11 | self.k = max(1, int(action_dim * k_ratio)) 12 | 13 | def predict_action(self, state, with_noise=False): 14 | 15 | proto_action = super().predict_action(state, with_noise=with_noise) 16 | proto_action = proto_action.clip(0, 1) 17 | 18 | actions = np.eye(self.action_dim) 19 | # first sorting by action probability by `proto_action` 20 | # second by random :) 21 | actions_sorting = np.lexsort( 22 | (np.random.random(self.action_dim), proto_action) 23 | ) 24 | # take topK proposed actions 25 | actions = actions[actions_sorting[-self.k:]] 26 | # make all the state-action pairs for the critic 27 | states = np.tile(state, [len(actions), 1]) 28 | qvalues = self.critic.get_qvalue(states, actions) 29 | # find the index of the pair with the maximum value 30 | max_index = np.argmax(qvalues) 31 | action, qvalue = actions[max_index], qvalues[max_index] 32 | return action 33 | 34 | def predict_qvalues(self, state_num=0, action=None, dim=None): 35 | if dim is None: 36 | dim = self.action_dim 37 | if action is None: 38 | action = np.eye(dim, self.action_dim) 39 | s = np.zeros((dim, self.action_dim)) 40 | s[:, state_num] = 1 41 | qvalues = self.critic.get_qvalue(s, action) 42 | return qvalues 43 | 44 | 45 | class WolpertingerRecommender(AbstractEpisodicRecommenderAgent): 46 | 47 | def __init__( 48 | self, 49 | env, 50 | state_dim, 51 | action_dim, 52 | k_ratio=0.1, 53 | eps=1e-2, 54 | train: bool = True, 55 | batch_size: int = 256, 56 | buffer_size: int = 10000, 57 | training_starts: int = None, 58 | **kwargs, 59 | ): 60 | AbstractEpisodicRecommenderAgent.__init__(self, env.action_space) 61 | 62 | self._observation_space = env.observation_space 63 | self.agent = Wolpertinger( 64 | state_dim=state_dim, 65 | action_dim=action_dim, 66 | k_ratio=k_ratio, 67 | **kwargs 68 | ) 69 | self.t = 0 70 | self.current_episode = {} 71 | self.train = train 72 | self.num_actions = env.action_space.nvec[0] 73 | 74 | self.eps = eps 75 | self.batch_size = batch_size 76 | self.replay_buffer = ReplayBuffer(buffer_size) 77 | self.training_starts = training_starts or batch_size 78 | 79 | def _extract_state(self, observation): 80 | user_space = self._observation_space.spaces["user"] 81 | return spaces.flatten(user_space, observation["user"]) 82 | 83 | def _act(self, state): 84 | if np.random.rand() < self.eps: 85 | action = np.eye(self.num_actions)[np.random.randint(self.num_actions)] 86 | else: 87 | action = self.agent.predict_action(state) 88 | self.current_episode = { 89 | "state": state, 90 | "action": action, 91 | } 92 | return np.argmax(action)[np.newaxis] 93 | 94 | def _observe(self, next_state, reward, done): 95 | if not self.current_episode: 96 | raise ValueError("Current episode is expected to be non-empty") 97 | 98 | self.current_episode.update({ 99 | "next_state": next_state, 100 | "reward": reward, 101 | "done": done 102 | }) 103 | 104 | self.agent.episode = self.current_episode 105 | if self.train: 106 | self.replay_buffer.push(**self.current_episode) 107 | if self.t >= self.training_starts \ 108 | and len(self.replay_buffer) >= self.batch_size: 109 | state, action, reward, next_state, done = \ 110 | self.replay_buffer.sample(self.batch_size) 111 | self.agent.update(state, action, reward, next_state, done) 112 | self.current_episode = {} 113 | 114 | def begin_episode(self, observation=None): 115 | state = self._extract_state(observation) 116 | return self._act(state) 117 | 118 | def step(self, reward, observation): 119 | state = self._extract_state(observation) 120 | self._observe(state, reward, 0) 121 | self.t += 1 122 | return self._act(state) 123 | 124 | def end_episode(self, reward, observation=None): 125 | state = self._extract_state(observation) 126 | self._observe(state, reward, 1) 127 | -------------------------------------------------------------------------------- /2020/presets/wolpertinger_scheme.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2020/presets/wolpertinger_scheme.png -------------------------------------------------------------------------------- /2020/requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | scipy 4 | seaborn 5 | matplotlib 6 | requests 7 | tqdm 8 | gym 9 | torch 10 | recsim -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2020 Sergey Kolesnikov. All rights reserved. 2 | 3 | Apache License 4 | Version 2.0, January 2004 5 | http://www.apache.org/licenses/ 6 | 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 8 | 9 | 1. Definitions. 10 | 11 | "License" shall mean the terms and conditions for use, reproduction, 12 | and distribution as defined by Sections 1 through 9 of this document. 13 | 14 | "Licensor" shall mean the copyright owner or entity authorized by 15 | the copyright owner that is granting the License. 16 | 17 | "Legal Entity" shall mean the union of the acting entity and all 18 | other entities that control, are controlled by, or are under common 19 | control with that entity. For the purposes of this definition, 20 | "control" means (i) the power, direct or indirect, to cause the 21 | direction or management of such entity, whether by contract or 22 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 23 | outstanding shares, or (iii) beneficial ownership of such entity. 24 | 25 | "You" (or "Your") shall mean an individual or Legal Entity 26 | exercising permissions granted by this License. 27 | 28 | "Source" form shall mean the preferred form for making modifications, 29 | including but not limited to software source code, documentation 30 | source, and configuration files. 31 | 32 | "Object" form shall mean any form resulting from mechanical 33 | transformation or translation of a Source form, including but 34 | not limited to compiled object code, generated documentation, 35 | and conversions to other media types. 36 | 37 | "Work" shall mean the work of authorship, whether in Source or 38 | Object form, made available under the License, as indicated by a 39 | copyright notice that is included in or attached to the work 40 | (an example is provided in the Appendix below). 41 | 42 | "Derivative Works" shall mean any work, whether in Source or Object 43 | form, that is based on (or derived from) the Work and for which the 44 | editorial revisions, annotations, elaborations, or other modifications 45 | represent, as a whole, an original work of authorship. For the purposes 46 | of this License, Derivative Works shall not include works that remain 47 | separable from, or merely link (or bind by name) to the interfaces of, 48 | the Work and Derivative Works thereof. 49 | 50 | "Contribution" shall mean any work of authorship, including 51 | the original version of the Work and any modifications or additions 52 | to that Work or Derivative Works thereof, that is intentionally 53 | submitted to Licensor for inclusion in the Work by the copyright owner 54 | or by an individual or Legal Entity authorized to submit on behalf of 55 | the copyright owner. For the purposes of this definition, "submitted" 56 | means any form of electronic, verbal, or written communication sent 57 | to the Licensor or its representatives, including but not limited to 58 | communication on electronic mailing lists, source code control systems, 59 | and issue tracking systems that are managed by, or on behalf of, the 60 | Licensor for the purpose of discussing and improving the Work, but 61 | excluding communication that is conspicuously marked or otherwise 62 | designated in writing by the copyright owner as "Not a Contribution." 63 | 64 | "Contributor" shall mean Licensor and any individual or Legal Entity 65 | on behalf of whom a Contribution has been received by Licensor and 66 | subsequently incorporated within the Work. 67 | 68 | 2. Grant of Copyright License. Subject to the terms and conditions of 69 | this License, each Contributor hereby grants to You a perpetual, 70 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 71 | copyright license to reproduce, prepare Derivative Works of, 72 | publicly display, publicly perform, sublicense, and distribute the 73 | Work and such Derivative Works in Source or Object form. 74 | 75 | 3. Grant of Patent License. Subject to the terms and conditions of 76 | this License, each Contributor hereby grants to You a perpetual, 77 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 78 | (except as stated in this section) patent license to make, have made, 79 | use, offer to sell, sell, import, and otherwise transfer the Work, 80 | where such license applies only to those patent claims licensable 81 | by such Contributor that are necessarily infringed by their 82 | Contribution(s) alone or by combination of their Contribution(s) 83 | with the Work to which such Contribution(s) was submitted. If You 84 | institute patent litigation against any entity (including a 85 | cross-claim or counterclaim in a lawsuit) alleging that the Work 86 | or a Contribution incorporated within the Work constitutes direct 87 | or contributory patent infringement, then any patent licenses 88 | granted to You under this License for that Work shall terminate 89 | as of the date such litigation is filed. 90 | 91 | 4. Redistribution. You may reproduce and distribute copies of the 92 | Work or Derivative Works thereof in any medium, with or without 93 | modifications, and in Source or Object form, provided that You 94 | meet the following conditions: 95 | 96 | (a) You must give any other recipients of the Work or 97 | Derivative Works a copy of this License; and 98 | 99 | (b) You must cause any modified files to carry prominent notices 100 | stating that You changed the files; and 101 | 102 | (c) You must retain, in the Source form of any Derivative Works 103 | that You distribute, all copyright, patent, trademark, and 104 | attribution notices from the Source form of the Work, 105 | excluding those notices that do not pertain to any part of 106 | the Derivative Works; and 107 | 108 | (d) If the Work includes a "NOTICE" text file as part of its 109 | distribution, then any Derivative Works that You distribute must 110 | include a readable copy of the attribution notices contained 111 | within such NOTICE file, excluding those notices that do not 112 | pertain to any part of the Derivative Works, in at least one 113 | of the following places: within a NOTICE text file distributed 114 | as part of the Derivative Works; within the Source form or 115 | documentation, if provided along with the Derivative Works; or, 116 | within a display generated by the Derivative Works, if and 117 | wherever such third-party notices normally appear. The contents 118 | of the NOTICE file are for informational purposes only and 119 | do not modify the License. You may add Your own attribution 120 | notices within Derivative Works that You distribute, alongside 121 | or as an addendum to the NOTICE text from the Work, provided 122 | that such additional attribution notices cannot be construed 123 | as modifying the License. 124 | 125 | You may add Your own copyright statement to Your modifications and 126 | may provide additional or different license terms and conditions 127 | for use, reproduction, or distribution of Your modifications, or 128 | for any such Derivative Works as a whole, provided Your use, 129 | reproduction, and distribution of the Work otherwise complies with 130 | the conditions stated in this License. 131 | 132 | 5. Submission of Contributions. Unless You explicitly state otherwise, 133 | any Contribution intentionally submitted for inclusion in the Work 134 | by You to the Licensor shall be under the terms and conditions of 135 | this License, without any additional terms or conditions. 136 | Notwithstanding the above, nothing herein shall supersede or modify 137 | the terms of any separate license agreement you may have executed 138 | with Licensor regarding such Contributions. 139 | 140 | 6. Trademarks. This License does not grant permission to use the trade 141 | names, trademarks, service marks, or product names of the Licensor, 142 | except as required for reasonable and customary use in describing the 143 | origin of the Work and reproducing the content of the NOTICE file. 144 | 145 | 7. Disclaimer of Warranty. Unless required by applicable law or 146 | agreed to in writing, Licensor provides the Work (and each 147 | Contributor provides its Contributions) on an "AS IS" BASIS, 148 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 149 | implied, including, without limitation, any warranties or conditions 150 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 151 | PARTICULAR PURPOSE. You are solely responsible for determining the 152 | appropriateness of using or redistributing the Work and assume any 153 | risks associated with Your exercise of permissions under this License. 154 | 155 | 8. Limitation of Liability. In no event and under no legal theory, 156 | whether in tort (including negligence), contract, or otherwise, 157 | unless required by applicable law (such as deliberate and grossly 158 | negligent acts) or agreed to in writing, shall any Contributor be 159 | liable to You for damages, including any direct, indirect, special, 160 | incidental, or consequential damages of any character arising as a 161 | result of this License or out of the use or inability to use the 162 | Work (including but not limited to damages for loss of goodwill, 163 | work stoppage, computer failure or malfunction, or any and all 164 | other commercial damages or losses), even if such Contributor 165 | has been advised of the possibility of such damages. 166 | 167 | 9. Accepting Warranty or Additional Liability. While redistributing 168 | the Work or Derivative Works thereof, You may choose to offer, 169 | and charge a fee for, acceptance of support, warranty, indemnity, 170 | or other liability obligations and/or rights consistent with this 171 | License. However, in accepting such obligations, You may act only 172 | on Your own behalf and on Your sole responsibility, not on behalf 173 | of any other Contributor, and only if You agree to indemnify, 174 | defend, and hold each Contributor harmless for any liability 175 | incurred by, or claims asserted against, such Contributor by reason 176 | of your accepting any such warranty or additional liability. 177 | 178 | END OF TERMS AND CONDITIONS 179 | 180 | APPENDIX: How to apply the Apache License to your work. 181 | 182 | To apply the Apache License to your work, attach the following 183 | boilerplate notice, with the fields enclosed by brackets "[]" 184 | replaced with your own identifying information. (Don't include 185 | the brackets!) The text should be enclosed in the appropriate 186 | comment syntax for the file format. We also recommend that a 187 | file or class name and description of purpose be included on the 188 | same "printed page" as the copyright notice for easier 189 | identification within third-party archives. 190 | 191 | Copyright [yyyy] [name of copyright owner] 192 | 193 | Licensed under the Apache License, Version 2.0 (the "License"); 194 | you may not use this file except in compliance with the License. 195 | You may obtain a copy of the License at 196 | 197 | http://www.apache.org/licenses/LICENSE-2.0 198 | 199 | Unless required by applicable law or agreed to in writing, software 200 | distributed under the License is distributed on an "AS IS" BASIS, 201 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 202 | See the License for the specific language governing permissions and 203 | limitations under the License. 204 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## RL Intro 2 | 3 |
4 | 2019 edition - Gym intro, Genetics, CEM, Tabular DQN 5 |

6 | 7 | #### 0. Gym interface 8 | - `00-gym.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/00-gym.ipynb) 9 | 10 | 11 | #### 1. Genetic algorithm 12 | - [slides](./2019/slides/01-genetics.pdf) 13 | - `01-genetics.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/01-genetics.ipynb) 14 | 15 | ##### Additional materials 16 | * __[recommended]__ - awesome openai post about evolution strategies - [blog post](https://blog.openai.com/evolution-strategies/), [article](https://arxiv.org/abs/1703.03864) 17 | * Video on genetic algorithms - https://www.youtube.com/watch?v=ejxfTy4lI6I 18 | * Another guide to genetic algorithm - https://www.youtube.com/watch?v=zwYV11a__HQ 19 | * PDF on Differential evolution - http://jvanderw.une.edu.au/DE_1.pdf 20 | * Video on Ant Colony Algorithm - https://www.youtube.com/watch?v=D58nLNLkb0I 21 | * Longer video on Ant Colony Algorithm - https://www.youtube.com/watch?v=xpyKmjJuqhk 22 | 23 | 24 | #### 2. Cross Entropy Method 25 | - [slides](./2019/slides/02-cem.pdf) 26 | - `02-cem.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/02-cem.ipynb) 27 | 28 | ##### Additional materials 29 | * __[main]__ Video-intro by David Silver - https://www.youtube.com/watch?v=2pWv7GOvuf0 30 | * Optional lecture by David Silver - https://www.youtube.com/watch?v=lfHX2hHRMVQ 31 | * __[recommended]__ - formal explanation of crossentropy method in [general](https://people.smp.uq.edu.au/DirkKroese/ps/CEEncycl.pdf) and for [optimization](https://people.smp.uq.edu.au/DirkKroese/ps/CEopt.pdf) 32 | 33 | 34 | #### 3. Tabular 35 | - [slides](./2019/slides/03-tabular.pdf) 36 | - `03-tabular.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/03-tabular.ipynb) 37 | 38 | ##### Additional materials 39 | * __[main]__ lecture by David Silver - [url](https://www.youtube.com/watch?v=Nd1-UUMVfz4) 40 | * Alternative lecture by Pieter Abbeel: [part 1](https://www.youtube.com/watch?v=i0o-ui1N35U), [part 2](https://www.youtube.com/watch?v=Csiiv6WGzKM) 41 | * Alternative lecture by John Schulmann: https://www.youtube.com/watch?v=IL3gVyJMmhg 42 | * Definitive guide in policy/value iteration from Sutton: start from page 81 [here](http://incompleteideas.net/sutton/book/bookdraft2017june19.pdf). 43 | 44 | 45 | #### 4. DQN 46 | - [slides](./2019/slides/04-dqn.pdf) 47 | - `04-dqn.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/04-dqn.ipynb) 48 | 49 | ##### Additional materials 50 | * Lecture by David Silver - [video part I](https://www.youtube.com/watch?v=PnHCvfgC_ZA), [video part II](https://www.youtube.com/watch?v=0g4j2k_Ggc4&t=43s) 51 | * Alternative lecture by Pieter Abbeel - [video](https://www.youtube.com/watch?v=ifma8G7LegE) 52 | * Alternative lecture by John Schulmann - [video](https://www.youtube.com/watch?v=IL3gVyJMmhg) 53 | * Blog post on q-learning Vs SARSA - [url](https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/) 54 | * N-step temporal difference from Sutton's book - [suttonbook](http://incompleteideas.net/book/RLbook2018.pdf) __chapter 7__ 55 | * Eligibility traces from Sutton's book - [suttonbook](http://incompleteideas.net/book/RLbook2018.pdf) __chapter 12__ 56 | * Blog post on eligibility traces - [url](http://pierrelucbacon.com/traces/) 57 |

58 |
59 | 60 | 61 |
62 | 2020 edition - Deep RL, DQN, DDPG 63 |

64 | 65 |

66 |
67 | 68 |
69 | Credits 70 |

71 | 72 | * [Berkeley CS188x](http://ai.berkeley.edu/home.html) 73 | * [David Silver's Reinforcement Learning Course](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html) 74 | * [dennybritz/reinforcement-learning](https://github.com/dennybritz/reinforcement-learning) 75 | * [yandexdataschool/Practical_RL](https://github.com/yandexdataschool/Practical_RL) 76 | * [yandexdataschool/AgentNet](https://github.com/yandexdataschool/AgentNet) 77 | * [rl-course-experiments](https://github.com/Scitator/rl-course-experiments) 78 | * [RL-Adventure](https://github.com/higgsfield/RL-Adventure) 79 | * [RL-Adventure-2](https://github.com/higgsfield/RL-Adventure-2) 80 | 81 |

82 |
--------------------------------------------------------------------------------