├── 2019
    ├── code
    │   ├── 00-gym.ipynb
    │   ├── 01-genetics.ipynb
    │   ├── 02-cem.ipynb
    │   ├── 03-tabular.ipynb
    │   ├── 04-dqn.ipynb
    │   ├── __init__.py
    │   ├── mdp.py
    │   ├── mdp_get_action_value.py
    │   └── qlearning.py
    ├── slides
    │   ├── 01-genetics.pdf
    │   ├── 02-cem.pdf
    │   ├── 03-tabular.pdf
    │   └── 04-dqn.pdf
    └── solutions
    │   ├── 00-gym.ipynb
    │   ├── 01-genetics.ipynb
    │   ├── 02-cem.ipynb
    │   ├── 03-tabular.ipynb
    │   ├── 04-dqn.ipynb
    │   ├── __init__.py
    │   ├── mdp.py
    │   ├── mdp_get_action_value.py
    │   └── qlearning.py
├── 2020
    ├── code
    │   ├── DDPG.ipynb
    │   ├── DQN.ipynb
    │   ├── RecSimDemo.ipynb
    │   ├── RecSysDemo.ipynb
    │   └── recsim_exp
    │   │   ├── __init__.py
    │   │   ├── ddpg.py
    │   │   └── wolpertinger.py
    ├── presets
    │   └── wolpertinger_scheme.png
    └── requirements.txt
├── .gitignore
├── LICENSE
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | env/
 12 | build/
 13 | builds/
 14 | develop-eggs/
 15 | dist/
 16 | downloads/
 17 | eggs/
 18 | .eggs/
 19 | lib/
 20 | lib64/
 21 | parts/
 22 | sdist/
 23 | var/
 24 | wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .coverage
 43 | .coverage.*
 44 | .cache
 45 | nosetests.xml
 46 | coverage.xml
 47 | *.cover
 48 | .hypothesis/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | 
 58 | # Flask stuff:
 59 | instance/
 60 | .webassets-cache
 61 | 
 62 | # Scrapy stuff:
 63 | .scrapy
 64 | 
 65 | # Sphinx documentation
 66 | docs/_build/
 67 | 
 68 | # PyBuilder
 69 | target/
 70 | 
 71 | # Jupyter Notebook
 72 | .ipynb_checkpoints
 73 | 
 74 | # pyenv
 75 | .python-version
 76 | 
 77 | # celery beat schedule file
 78 | celerybeat-schedule
 79 | 
 80 | # SageMath parsed files
 81 | *.sage.py
 82 | 
 83 | # dotenv
 84 | .env
 85 | 
 86 | # virtualenv
 87 | .venv
 88 | venv/
 89 | ENV/
 90 | 
 91 | # Spyder project settings
 92 | .spyderproject
 93 | .spyproject
 94 | 
 95 | # Rope project settings
 96 | .ropeproject
 97 | 
 98 | # mkdocs documentation
 99 | /site
100 | 
101 | # mypy
102 | .mypy_cache/
103 | 
104 | 
105 | 
106 | .DS_Store
107 | .idea
108 | .code
109 | 
110 | *.bak
111 | *.csv
112 | *.tsv
113 | *.ipynb
114 | 
115 | tmp/
116 | logs/
117 | # Examples - mock data
118 | !examples/distilbert_text_classification/input/*.csv
119 | !examples/_tests_distilbert_text_classification/input/*.csv
120 | examples/logs/
121 | notebooks/
122 | 
123 | _nogit*
124 | 
125 | ### VisualStudioCode ###
126 | .vscode/*
127 | .vscode/settings.json
128 | !.vscode/tasks.json
129 | !.vscode/launch.json
130 | !.vscode/extensions.json
131 | 
132 | ### VisualStudioCode Patch ###
133 | # Ignore all local history of files
134 | .history
135 | 
136 | # End of https://www.gitignore.io/api/visualstudiocode
137 | 
138 | presets/
139 | 


--------------------------------------------------------------------------------
/2019/code/00-gym.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import numpy as np\n",
 10 |     "import matplotlib.pyplot as plt\n",
 11 |     "%matplotlib inline\n",
 12 |     "# In google collab, uncomment this:\n",
 13 |     "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n",
 14 |     "\n",
 15 |     "# This code creates a virtual display to draw game images on.\n",
 16 |     "# If you are running locally, just ignore it\n",
 17 |     "# import os\n",
 18 |     "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
 19 |     "#     !bash ../xvfb start\n",
 20 |     "#     %env DISPLAY = : 1"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "### OpenAI Gym\n",
 28 |     "\n",
 29 |     "We're gonna spend several next weeks learning algorithms that solve decision processes. We are then in need of some interesting decision problems to test our algorithms.\n",
 30 |     "\n",
 31 |     "That's where OpenAI gym comes into play. It's a python library that wraps many classical decision problems including robot control, videogames and board games.\n",
 32 |     "\n",
 33 |     "So here's how it works:"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "import gym\n",
 43 |     "env = gym.make(\"MountainCar-v0\")\n",
 44 |     "\n",
 45 |     "plt.imshow(env.render('rgb_array'))\n",
 46 |     "plt.show()\n",
 47 |     "print(\"Observation space:\", env.observation_space)\n",
 48 |     "print(\"Action space:\", env.action_space)"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "Note: if you're running this on your local machine, you'll see a window pop up with the image above. Don't close it, just alt-tab away."
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "### Gym interface\n",
 63 |     "\n",
 64 |     "The three main methods of an environment are\n",
 65 |     "* __reset()__ - reset environment to initial state, _return first observation_\n",
 66 |     "* __render()__ - show current environment state (a more colorful version :) )\n",
 67 |     "* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)\n",
 68 |     " * _new observation_ - an observation right after commiting the action __a__\n",
 69 |     " * _reward_ - a number representing your reward for commiting action __a__\n",
 70 |     " * _is done_ - True if the MDP has just finished, False if still in progress\n",
 71 |     " * _info_ - some auxilary stuff about what just happened. Ignore it ~~for now~~."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {
 78 |     "scrolled": true
 79 |    },
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "obs0 = env.reset()\n",
 83 |     "print(\"initial observation code:\", obs0)\n",
 84 |     "\n",
 85 |     "# Note: in MountainCar, observation is just two numbers: car position and velocity"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "metadata": {},
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "print(\"taking action 2 (right)\")\n",
 95 |     "new_obs, reward, is_done, _ = env.step(2)\n",
 96 |     "\n",
 97 |     "print(\"new observation code:\", new_obs)\n",
 98 |     "print(\"reward:\", reward)\n",
 99 |     "print(\"is game over?:\", is_done)\n",
100 |     "\n",
101 |     "# Note: as you can see, the car has moved to the right slightly (around 0.0005)"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {},
107 |    "source": [
108 |     "### Play with it\n",
109 |     "\n",
110 |     "Below is the code that drives the car to the right. \n",
111 |     "\n",
112 |     "However, it doesn't reach the flag at the far right due to gravity. \n",
113 |     "\n",
114 |     "__Your task__ is to fix it. Find a strategy that reaches the flag. \n",
115 |     "\n",
116 |     "You're not required to build any sophisticated algorithms for now, feel free to hard-code :)\n",
117 |     "\n",
118 |     "_Hint: your action at each step should depend either on __t__ or on __s__._"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": null,
124 |    "metadata": {},
125 |    "outputs": [],
126 |    "source": [
127 |     "\n",
128 |     "# create env manually to set time limit. Please don't change this.\n",
129 |     "TIME_LIMIT = 250\n",
130 |     "env = gym.wrappers.TimeLimit(\n",
131 |     "    gym.envs.classic_control.MountainCarEnv(),\n",
132 |     "    max_episode_steps=TIME_LIMIT + 1)\n",
133 |     "s = env.reset()\n",
134 |     "actions = {'left': 0, 'stop': 1, 'right': 2}\n",
135 |     "\n",
136 |     "# prepare \"display\"\n",
137 |     "%matplotlib inline\n",
138 |     "from IPython.display import clear_output\n",
139 |     "\n",
140 |     "\n",
141 |     "for t in range(TIME_LIMIT):\n",
142 |     "\n",
143 |     "    # change the line below to reach the flag\n",
144 |     "    s, r, done, _ = env.step(actions['right'])\n",
145 |     "\n",
146 |     "    # draw game image on display\n",
147 |     "    clear_output(True)\n",
148 |     "    plt.imshow(env.render('rgb_array'))\n",
149 |     "\n",
150 |     "    if done:\n",
151 |     "        print(\"Well done!\")\n",
152 |     "        break\n",
153 |     "else:\n",
154 |     "    print(\"Time limit exceeded. Try again.\");"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "assert s[0] > 0.47\n",
164 |     "print(\"You solved it!\")"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "metadata": {},
171 |    "outputs": [],
172 |    "source": []
173 |   }
174 |  ],
175 |  "metadata": {
176 |   "kernelspec": {
177 |    "display_name": "Python 3",
178 |    "language": "python",
179 |    "name": "python3"
180 |   },
181 |   "language_info": {
182 |    "codemirror_mode": {
183 |     "name": "ipython",
184 |     "version": 3
185 |    },
186 |    "file_extension": ".py",
187 |    "mimetype": "text/x-python",
188 |    "name": "python",
189 |    "nbconvert_exporter": "python",
190 |    "pygments_lexer": "ipython3",
191 |    "version": "3.7.0"
192 |   }
193 |  },
194 |  "nbformat": 4,
195 |  "nbformat_minor": 1
196 | }
197 | 


--------------------------------------------------------------------------------
/2019/code/01-genetics.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "#  FrozenLake\n",
  8 |     "Today you are going to learn how to survive walking over the (virtual) frozen lake through discrete optimization.\n",
  9 |     "\n",
 10 |     "<img src=\"http://vignette2.wikia.nocookie.net/riseoftheguardians/images/4/4c/Jack's_little_sister_on_the_ice.jpg/revision/latest?cb=20141218030206\" alt=\"a random image to attract attention\" style=\"width: 400px;\"/>\n"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": null,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "# In google collab, uncomment this:\n",
 20 |     "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n",
 21 |     "\n",
 22 |     "# XVFB will be launched if you run on a server\n",
 23 |     "# import os\n",
 24 |     "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
 25 |     "#     !bash ../xvfb start\n",
 26 |     "#     %env DISPLAY = : 1"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": null,
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "import gym\n",
 36 |     "\n",
 37 |     "#create a single game instance\n",
 38 |     "env = gym.make(\"FrozenLake-v0\")\n",
 39 |     "\n",
 40 |     "#start new game\n",
 41 |     "env.reset();"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "# display the game state\n",
 51 |     "env.render()"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### legend\n",
 59 |     "\n",
 60 |     "![img](https://cdn-images-1.medium.com/max/800/1*MCjDzR-wfMMkS0rPqXSmKw.png)"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "### Gym interface\n",
 68 |     "\n",
 69 |     "The three main methods of an environment are\n",
 70 |     "* __reset()__ - reset environment to initial state, _return first observation_\n",
 71 |     "* __render()__ - show current environment state (a more colorful version :) )\n",
 72 |     "* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)\n",
 73 |     " * _new observation_ - an observation right after commiting the action __a__\n",
 74 |     " * _reward_ - a number representing your reward for commiting action __a__\n",
 75 |     " * _is done_ - True if the MDP has just finished, False if still in progress\n",
 76 |     " * _info_ - some auxilary stuff about what just happened. Ignore it for now"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {
 83 |     "scrolled": true
 84 |    },
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "print(\"initial observation code:\", env.reset())\n",
 88 |     "print('printing observation:')\n",
 89 |     "env.render()\n",
 90 |     "print(\"observations:\", env.observation_space, 'n=', env.observation_space.n)\n",
 91 |     "print(\"actions:\", env.action_space, 'n=', env.action_space.n)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "print(\"taking action 2 (right)\")\n",
101 |     "new_obs, reward, is_done, _ = env.step(2)\n",
102 |     "print(\"new observation code:\", new_obs)\n",
103 |     "print(\"reward:\", reward)\n",
104 |     "print(\"is game over?:\", is_done)\n",
105 |     "print(\"printing new state:\")\n",
106 |     "env.render()"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {},
113 |    "outputs": [],
114 |    "source": [
115 |     "action_to_i = {\n",
116 |     "    'left':0,\n",
117 |     "    'down':1,\n",
118 |     "    'right':2,\n",
119 |     "    'up':3\n",
120 |     "}"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {},
126 |    "source": [
127 |     "### Play with it\n",
128 |     "* Try walking 5 steps without falling to the (H)ole\n",
129 |     " * Bonus quest - get to the (G)oal\n",
130 |     "* Sometimes your actions will not be executed properly due to slipping over ice\n",
131 |     "* If you fall, call __env.reset()__ to restart"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "env.step(action_to_i['up'])\n",
141 |     "env.render()"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "### Policy\n",
149 |     "\n",
150 |     "* The environment has a 4x4 grid of states (16 total), they are indexed from 0 to 15\n",
151 |     "* From each states there are 4 actions (left,down,right,up), indexed from 0 to 3\n",
152 |     "\n",
153 |     "We need to define agent's policy of picking actions given states. Since we have only 16 disttinct states and 4 actions, we can just store the action for each state in an array.\n",
154 |     "\n",
155 |     "This basically means that any array of 16 integers from 0 to 3 makes a policy."
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "metadata": {},
162 |    "outputs": [],
163 |    "source": [
164 |     "import numpy as np\n",
165 |     "n_states = env.observation_space.n\n",
166 |     "n_actions = env.action_space.n"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "metadata": {},
173 |    "outputs": [],
174 |    "source": [
175 |     "def get_random_policy():\n",
176 |     "    \"\"\"\n",
177 |     "    Build a numpy array representing agent policy.\n",
178 |     "    This array must have one element per each of 16 environment states.\n",
179 |     "    Element must be an integer from 0 to 3, representing action\n",
180 |     "    to take from that state.\n",
181 |     "    \"\"\"\n",
182 |     "    # policy = ...\n",
183 |     "    return policy"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "np.random.seed(1234)\n",
193 |     "policies = [get_random_policy() for i in range(10**4)]\n",
194 |     "assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'\n",
195 |     "assert np.min(policies) == 0, 'minimal action id should be 0'\n",
196 |     "assert np.max(policies) == n_actions-1, 'maximal action id should match n_actions-1'\n",
197 |     "action_probas = np.unique(policies, return_counts=True)[-1] /10**4. /n_states\n",
198 |     "print(\"Action frequencies over 10^4 samples:\",action_probas)\n",
199 |     "assert np.allclose(action_probas, [1. / n_actions] * n_actions, atol=0.05), \"The policies aren't uniformly random (maybe it's just an extremely bad luck)\"\n",
200 |     "print(\"Seems fine!\")"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "metadata": {},
206 |    "source": [
207 |     "### Let's evaluate!\n",
208 |     "* Implement a simple function that runs one game and returns the total reward"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "def sample_reward(env, policy, t_max=100):\n",
218 |     "    \"\"\"\n",
219 |     "    Interact with an environment, return sum of all rewards.\n",
220 |     "    If game doesn't end on t_max (e.g. agent walks into a wall), \n",
221 |     "    force end the game and return whatever reward you got so far.\n",
222 |     "    Tip: see signature of env.step(...) method above.\n",
223 |     "    \"\"\"\n",
224 |     "    s = env.reset()\n",
225 |     "    total_reward = 0\n",
226 |     "    \n",
227 |     "    for i in range(t_max):\n",
228 |     "        # action = ...\n",
229 |     "        s, reward, done, info = env.step(action)\n",
230 |     "        total_reward += reward\n",
231 |     "        if done:\n",
232 |     "            break\n",
233 |     "    return total_reward"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "metadata": {},
240 |    "outputs": [],
241 |    "source": [
242 |     "print(\"generating 10^3 sessions...\")\n",
243 |     "rewards = [sample_reward(env,get_random_policy()) for _ in range(10**3)]\n",
244 |     "assert all([type(r) in (int, float) for r in rewards]), 'sample_reward must return a single number'\n",
245 |     "assert all([0 <= r <= 1 for r in rewards]), 'total rewards should be between 0 and 1 for frozenlake (if solving taxi, delete this line)'\n",
246 |     "print(\"Looks good!\")"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "metadata": {},
253 |    "outputs": [],
254 |    "source": [
255 |     "def evaluate(policy, n_times=100):\n",
256 |     "    \"\"\"Run several evaluations and average the score the policy gets.\"\"\"\n",
257 |     "    # avg_reward = ...\n",
258 |     "    return avg_reward\n",
259 |     "        "
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "metadata": {},
266 |    "outputs": [],
267 |    "source": [
268 |     "def print_policy(policy):\n",
269 |     "    \"\"\"a function that displays a policy in a human-readable way.\"\"\"\n",
270 |     "    lake = \"SFFFFHFHFFFHHFFG\"\n",
271 |     "    assert env.spec.id == \"FrozenLake-v0\", \"this function only works with frozenlake 4x4\"\n",
272 |     "    \n",
273 |     "    # where to move from each tile\n",
274 |     "    arrows = ['<v>^'[a] for a in policy]\n",
275 |     "    \n",
276 |     "    #draw arrows above S and F only\n",
277 |     "    signs = [arrow if tile in \"SF\" else tile for arrow, tile in zip(arrows, lake)]\n",
278 |     "    \n",
279 |     "    for i in range(0, 16, 4):\n",
280 |     "        print(' '.join(signs[i:i+4]))\n",
281 |     "\n",
282 |     "print(\"random policy:\")\n",
283 |     "print_policy(get_random_policy())"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "markdown",
288 |    "metadata": {},
289 |    "source": [
290 |     "### Random search"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": null,
296 |    "metadata": {
297 |     "scrolled": true
298 |    },
299 |    "outputs": [],
300 |    "source": [
301 |     "best_policy = None\n",
302 |     "best_score = -float('inf')\n",
303 |     "\n",
304 |     "from tqdm import tqdm\n",
305 |     "tr = tqdm(range(int(1e4)))\n",
306 |     "for i in tr:\n",
307 |     "    policy = get_random_policy()\n",
308 |     "    score = evaluate(policy)\n",
309 |     "    if score > best_score:\n",
310 |     "        best_score = score\n",
311 |     "        best_policy = policy\n",
312 |     "    tr.set_postfix({\"best score:\": best_score})"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "# Part II Genetic algorithm\n",
320 |     "\n",
321 |     "The next task is to devise some more effecient way to perform policy search.\n",
322 |     "We'll do that with a bare-bones evolutionary algorithm.\n",
323 |     "[unless you're feeling masochistic and wish to do something entirely different which is bonus points if it works]"
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "code",
328 |    "execution_count": null,
329 |    "metadata": {},
330 |    "outputs": [],
331 |    "source": [
332 |     "def crossover(policy1, policy2, p=0.5, prioritize=False):\n",
333 |     "    \"\"\"\n",
334 |     "    for each state, with probability p take action from policy1, else policy2\n",
335 |     "    \"\"\"\n",
336 |     "    if prioritize:\n",
337 |     "        # wait for part II - moar\n",
338 |     "        pass\n",
339 |     "    # policy = ...\n",
340 |     "    return policy\n"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": null,
346 |    "metadata": {},
347 |    "outputs": [],
348 |    "source": [
349 |     "def mutation(policy, p=0.1):\n",
350 |     "    \"\"\"\n",
351 |     "    for each state, with probability p replace action with random action\n",
352 |     "    Tip: mutation can be written as crossover with random policy\n",
353 |     "    \"\"\"\n",
354 |     "    # new_policy = ...\n",
355 |     "    return new_policy"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": null,
361 |    "metadata": {},
362 |    "outputs": [],
363 |    "source": [
364 |     "np.random.seed(1234)\n",
365 |     "policies = [\n",
366 |     "    crossover(get_random_policy(), get_random_policy()) \n",
367 |     "    for i in range(10**4)]\n",
368 |     "\n",
369 |     "assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'\n",
370 |     "assert np.min(policies) == 0, 'minimal action id should be 0'\n",
371 |     "assert np.max(policies) == n_actions-1, 'maximal action id should be n_actions-1'\n",
372 |     "\n",
373 |     "assert any([\n",
374 |     "    np.mean(crossover(np.zeros(n_states), np.ones(n_states))) not in (0, 1)\n",
375 |     "    for _ in range(100)]), \"Make sure your crossover changes each action independently\"\n",
376 |     "print(\"Seems fine!\")"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": null,
382 |    "metadata": {},
383 |    "outputs": [],
384 |    "source": [
385 |     "\n",
386 |     "n_epochs = 20 #how many cycles to make\n",
387 |     "pool_size = 100 #how many policies to maintain\n",
388 |     "n_crossovers = 50 #how many crossovers to make on each step\n",
389 |     "n_mutations = 50 #how many mutations to make on each tick\n"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": null,
395 |    "metadata": {},
396 |    "outputs": [],
397 |    "source": [
398 |     "print(\"initializing...\")\n",
399 |     "pool = [get_random_policy() for _ in range(pool_size)]\n",
400 |     "pool_scores = list(map(evaluate, pool))\n"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": null,
406 |    "metadata": {},
407 |    "outputs": [],
408 |    "source": [
409 |     "assert type(pool) == type(pool_scores) == list\n",
410 |     "assert len(pool) == len(pool_scores) == pool_size\n",
411 |     "assert all([type(score) in (float, int) for score in pool_scores])"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "markdown",
416 |    "metadata": {
417 |     "collapsed": true
418 |    },
419 |    "source": [
420 |     "# Main loop"
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": null,
426 |    "metadata": {},
427 |    "outputs": [],
428 |    "source": [
429 |     "import random\n",
430 |     "from tqdm import tqdm\n",
431 |     "\n",
432 |     "tr = tqdm(range(n_epochs))\n",
433 |     "for epoch in tr:\n",
434 |     "#     print(\"Epoch %s:\"%epoch)\n",
435 |     "    crossovered = [\n",
436 |     "        crossover(random.choice(pool), random.choice(pool)) \n",
437 |     "        for _ in range(n_crossovers)]\n",
438 |     "    mutated = [\n",
439 |     "        mutation(random.choice(pool)) \n",
440 |     "        for _ in range(n_mutations)]\n",
441 |     "    \n",
442 |     "    assert type(crossovered) == type(mutated) == list\n",
443 |     "    \n",
444 |     "    # add new policies to the pool\n",
445 |     "    # pool = ...\n",
446 |     "    # pool_scores = ...\n",
447 |     "    \n",
448 |     "    # select pool_size best policies\n",
449 |     "    # selected_indices = ...\n",
450 |     "    # pool = ...\n",
451 |     "    # pool_scores = ...\n",
452 |     "\n",
453 |     "    # print the best policy so far (last in ascending score order)\n",
454 |     "    tr.set_postfix({\"best score:\": pool_scores[-1]})"
455 |    ]
456 |   },
457 |   {
458 |    "cell_type": "markdown",
459 |    "metadata": {},
460 |    "source": [
461 |     "## moar\n",
462 |     "\n",
463 |     "The parameters of the genetic algorithm aren't optimal, try to find something better. (size, crossovers and mutations)\n",
464 |     "\n",
465 |     "Try alternative crossover and mutation strategies\n",
466 |     "* prioritize crossover for higher-scorers?\n",
467 |     "* try to select a more diverse pool, not just best scorers?\n",
468 |     "* Just tune the f*cking probabilities.\n",
469 |     "\n",
470 |     "See which combination works best!"
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "code",
475 |    "execution_count": null,
476 |    "metadata": {},
477 |    "outputs": [],
478 |    "source": [
479 |     "def crossover(policy1, policy2, p=0.5, prioritize=False):\n",
480 |     "    \"\"\"\n",
481 |     "    for each state, with probability p take action from policy1, else policy2\n",
482 |     "    \"\"\"\n",
483 |     "    if prioritize:\n",
484 |     "        # the time has come\n",
485 |     "        pass\n",
486 |     "    # policy = ...\n",
487 |     "    return policy\n"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": null,
493 |    "metadata": {
494 |     "scrolled": true
495 |    },
496 |    "outputs": [],
497 |    "source": [
498 |     "import random\n",
499 |     "from tqdm import tqdm\n",
500 |     "\n",
501 |     "tr = tqdm(range(n_epochs))\n",
502 |     "for epoch in tr:\n",
503 |     "#     print(\"Epoch %s:\"%epoch)\n",
504 |     "    crossovered = [\n",
505 |     "        crossover(\n",
506 |     "            random.choice(pool), \n",
507 |     "            random.choice(pool), \n",
508 |     "            prioritize=True) \n",
509 |     "        for _ in range(n_crossovers)]\n",
510 |     "    mutated = [\n",
511 |     "        mutation(random.choice(pool)) \n",
512 |     "        for _ in range(n_mutations)]\n",
513 |     "    \n",
514 |     "    assert type(crossovered) == type(mutated) == list\n",
515 |     "    \n",
516 |     "    # add new policies to the pool\n",
517 |     "    # pool = ...\n",
518 |     "    # pool_scores = ...\n",
519 |     "    \n",
520 |     "    # select pool_size best policies\n",
521 |     "    # selected_indices = ...\n",
522 |     "    # pool = ...\n",
523 |     "    # pool_scores = ...\n",
524 |     "\n",
525 |     "    # print the best policy so far (last in ascending score order)\n",
526 |     "    tr.set_postfix({\"best score:\": pool_scores[-1]})"
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "markdown",
531 |    "metadata": {},
532 |    "source": [
533 |     "# *** Part III\n",
534 |     "\n",
535 |     "The frozenlake problem above is just too simple: you can beat it even with a random policy search. Go solve something more complicated. \n",
536 |     "\n",
537 |     "__FrozenLake8x8-v0__ - frozenlake big brother\n",
538 |     "\n",
539 |     "\n",
540 |     "### Some tips:\n",
541 |     "* Random policy search is worth trying as a sanity check, but in general you should expect the genetic algorithm (or anything you devised in it's place) to fare much better that random.\n",
542 |     "* While _it's okay to adapt the tabs above to your chosen env_, make sure you didn't hard-code any constants there (e.g. 16 states or 4 actions).\n",
543 |     "* `print_policy` function was built for the frozenlake-v0 env so it will break on any other env. You could simply ignore it OR write your own visualizer for bonus points.\n",
544 |     "* in function `sample_reward`, __make sure t_max steps is enough to solve the environment__ even if agent is sometimes acting suboptimally. To estimate that, run several sessions without time limit and measure their length.\n"
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "markdown",
549 |    "metadata": {},
550 |    "source": [
551 |     "# FrozenLake8x8"
552 |    ]
553 |   },
554 |   {
555 |    "cell_type": "code",
556 |    "execution_count": null,
557 |    "metadata": {
558 |     "scrolled": true
559 |    },
560 |    "outputs": [],
561 |    "source": [
562 |     "#create a single game instance\n",
563 |     "env = gym.make(\"FrozenLake8x8-v0\")\n",
564 |     "\n",
565 |     "#start new game\n",
566 |     "env.reset()\n",
567 |     "\n",
568 |     "# display the game state\n",
569 |     "env.render()\n",
570 |     "\n",
571 |     "n_states = env.observation_space.n\n",
572 |     "n_actions = env.action_space.n"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "code",
577 |    "execution_count": null,
578 |    "metadata": {},
579 |    "outputs": [],
580 |    "source": [
581 |     "n_epochs = 20 #how many cycles to make\n",
582 |     "pool_size = 100 #how many policies to maintain\n",
583 |     "n_crossovers = 50 #how many crossovers to make on each step\n",
584 |     "n_mutations = 50 #how many mutations to make on each tick"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "code",
589 |    "execution_count": null,
590 |    "metadata": {},
591 |    "outputs": [],
592 |    "source": [
593 |     "print(\"initializing...\")\n",
594 |     "pool = [get_random_policy() for _ in range(pool_size)]\n",
595 |     "pool_scores = list(map(evaluate, pool))"
596 |    ]
597 |   },
598 |   {
599 |    "cell_type": "code",
600 |    "execution_count": null,
601 |    "metadata": {
602 |     "scrolled": true
603 |    },
604 |    "outputs": [],
605 |    "source": [
606 |     "import random\n",
607 |     "from tqdm import tqdm\n",
608 |     "\n",
609 |     "tr = tqdm(range(n_epochs))\n",
610 |     "for epoch in tr:\n",
611 |     "#     print(\"Epoch %s:\"%epoch)\n",
612 |     "    crossovered = [\n",
613 |     "        crossover(\n",
614 |     "            random.choice(pool), \n",
615 |     "            random.choice(pool), \n",
616 |     "            prioritize=True) \n",
617 |     "        for _ in range(n_crossovers)]\n",
618 |     "    mutated = [\n",
619 |     "        mutation(random.choice(pool)) \n",
620 |     "        for _ in range(n_mutations)]\n",
621 |     "    \n",
622 |     "    assert type(crossovered) == type(mutated) == list\n",
623 |     "    \n",
624 |     "    # add new policies to the pool\n",
625 |     "    # pool = ...\n",
626 |     "    # pool_scores = ...\n",
627 |     "    \n",
628 |     "    # select pool_size best policies\n",
629 |     "    # selected_indices = ...\n",
630 |     "    # pool = ...\n",
631 |     "    # pool_scores = ...\n",
632 |     "\n",
633 |     "    # print the best policy so far (last in ascending score order)\n",
634 |     "    tr.set_postfix({\"best score:\": pool_scores[-1]})"
635 |    ]
636 |   },
637 |   {
638 |    "cell_type": "markdown",
639 |    "metadata": {},
640 |    "source": [
641 |     "# moar"
642 |    ]
643 |   },
644 |   {
645 |    "cell_type": "code",
646 |    "execution_count": null,
647 |    "metadata": {},
648 |    "outputs": [],
649 |    "source": [
650 |     "def sample_reward(env, policy, t_max=200):\n",
651 |     "    \"\"\"\n",
652 |     "    Interact with an environment, return sum of all rewards.\n",
653 |     "    If game doesn't end on t_max (e.g. agent walks into a wall), \n",
654 |     "    force end the game and return whatever reward you got so far.\n",
655 |     "    Tip: see signature of env.step(...) method above.\n",
656 |     "    \"\"\"\n",
657 |     "    s = env.reset()\n",
658 |     "    total_reward = 0\n",
659 |     "    \n",
660 |     "    for i in range(t_max):\n",
661 |     "        action = policy[s]\n",
662 |     "        s, reward, done, info = env.step(action)\n",
663 |     "        total_reward += reward\n",
664 |     "        if done:\n",
665 |     "            break\n",
666 |     "    return total_reward"
667 |    ]
668 |   },
669 |   {
670 |    "cell_type": "code",
671 |    "execution_count": null,
672 |    "metadata": {
673 |     "scrolled": true
674 |    },
675 |    "outputs": [],
676 |    "source": [
677 |     "# create a single game instance\n",
678 |     "env = gym.make(\"FrozenLake8x8-v0\")\n",
679 |     "\n",
680 |     "# start new game\n",
681 |     "env.reset()\n",
682 |     "\n",
683 |     "# display the game state\n",
684 |     "env.render()\n",
685 |     "\n",
686 |     "n_states = env.observation_space.n\n",
687 |     "n_actions = env.action_space.n"
688 |    ]
689 |   },
690 |   {
691 |    "cell_type": "code",
692 |    "execution_count": null,
693 |    "metadata": {},
694 |    "outputs": [],
695 |    "source": [
696 |     "n_epochs = 50 #how many cycles to make\n",
697 |     "pool_size = 200 #how many policies to maintain\n",
698 |     "n_crossovers = 100 #how many crossovers to make on each step\n",
699 |     "n_mutations = 100 #how many mutations to make on each tick"
700 |    ]
701 |   },
702 |   {
703 |    "cell_type": "code",
704 |    "execution_count": null,
705 |    "metadata": {},
706 |    "outputs": [],
707 |    "source": [
708 |     "print(\"initializing...\")\n",
709 |     "pool = [get_random_policy() for _ in range(pool_size)]\n",
710 |     "pool_scores = list(map(evaluate, pool))"
711 |    ]
712 |   },
713 |   {
714 |    "cell_type": "code",
715 |    "execution_count": null,
716 |    "metadata": {
717 |     "scrolled": true
718 |    },
719 |    "outputs": [],
720 |    "source": [
721 |     "import random\n",
722 |     "from tqdm import tqdm\n",
723 |     "\n",
724 |     "tr = tqdm(range(n_epochs))\n",
725 |     "for epoch in tr:\n",
726 |     "#     print(\"Epoch %s:\"%epoch)\n",
727 |     "    crossovered = [\n",
728 |     "        crossover(\n",
729 |     "            random.choice(pool), \n",
730 |     "            random.choice(pool), \n",
731 |     "            prioritize=True) \n",
732 |     "        for _ in range(n_crossovers)]\n",
733 |     "    mutated = [\n",
734 |     "        mutation(random.choice(pool)) \n",
735 |     "        for _ in range(n_mutations)]\n",
736 |     "    \n",
737 |     "    assert type(crossovered) == type(mutated) == list\n",
738 |     "    \n",
739 |     "    # add new policies to the pool\n",
740 |     "    # pool = ...\n",
741 |     "    # pool_scores = ...\n",
742 |     "    \n",
743 |     "    # select pool_size best policies\n",
744 |     "    # selected_indices = ...\n",
745 |     "    # pool = ...\n",
746 |     "    # pool_scores = ...\n",
747 |     "\n",
748 |     "    # print the best policy so far (last in ascending score order)\n",
749 |     "    tr.set_postfix({\"best score:\": pool_scores[-1]})"
750 |    ]
751 |   },
752 |   {
753 |    "cell_type": "code",
754 |    "execution_count": null,
755 |    "metadata": {},
756 |    "outputs": [],
757 |    "source": []
758 |   }
759 |  ],
760 |  "metadata": {
761 |   "kernelspec": {
762 |    "display_name": "Python 3",
763 |    "language": "python",
764 |    "name": "python3"
765 |   },
766 |   "language_info": {
767 |    "codemirror_mode": {
768 |     "name": "ipython",
769 |     "version": 3
770 |    },
771 |    "file_extension": ".py",
772 |    "mimetype": "text/x-python",
773 |    "name": "python",
774 |    "nbconvert_exporter": "python",
775 |    "pygments_lexer": "ipython3",
776 |    "version": "3.7.0"
777 |   }
778 |  },
779 |  "nbformat": 4,
780 |  "nbformat_minor": 1
781 | }
782 | 


--------------------------------------------------------------------------------
/2019/code/02-cem.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Crossentropy method\n",
  8 |     "\n",
  9 |     "This notebook will teach you to solve reinforcement learning with crossentropy method."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "# In google collab, uncomment this:\n",
 19 |     "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n",
 20 |     "\n",
 21 |     "# XVFB will be launched if you run on a server\n",
 22 |     "# import os\n",
 23 |     "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
 24 |     "#     !bash ../xvfb start\n",
 25 |     "#     %env DISPLAY = : 1"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import gym\n",
 35 |     "import numpy as np, pandas as pd\n",
 36 |     "\n",
 37 |     "env = gym.make(\"Taxi-v2\") # MountainCar-v0, Taxi-v2\n",
 38 |     "env.reset()\n",
 39 |     "env.render()"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "n_states = env.observation_space.n\n",
 49 |     "n_actions = env.action_space.n\n",
 50 |     "\n",
 51 |     "print(\"n_states=%i, n_actions=%i\"%(n_states,n_actions))"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "# Create stochastic policy\n",
 59 |     "\n",
 60 |     "This time our policy should be a probability distribution.\n",
 61 |     "\n",
 62 |     "```policy[s,a] = P(take action a | in state s)```\n",
 63 |     "\n",
 64 |     "Since we still use integer state and action representations, you can use a 2-dimensional array to represent the policy.\n",
 65 |     "\n",
 66 |     "Please initialize policy __uniformly__, that is, probabililities of all actions should be equal.\n"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "policy = <your code here! Create an array to store action probabilities >"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "assert type(policy) in (np.ndarray,np.matrix)\n",
 85 |     "assert np.allclose(policy,1./n_actions)\n",
 86 |     "assert np.allclose(np.sum(policy,axis=1), 1)"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "# Play the game\n",
 94 |     "\n",
 95 |     "Just like before, but we also record all states and actions we took."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "def generate_session(policy, t_max=10**4):\n",
105 |     "    \"\"\"\n",
106 |     "    Play game until end or for t_max ticks.\n",
107 |     "    :param policy: an array of shape [n_states,n_actions] with action probabilities\n",
108 |     "    :returns: list of states, list of actions and sum of rewards\n",
109 |     "    \"\"\"\n",
110 |     "    states,actions = [],[]\n",
111 |     "    total_reward = 0.\n",
112 |     "    \n",
113 |     "    s = env.reset()\n",
114 |     "    \n",
115 |     "    for t in range(t_max):\n",
116 |     "        \n",
117 |     "        a = <sample action from policy(hint: use np.random.choice) >\n",
118 |     "\n",
119 |     "        new_s, r, done, info = env.step(a)\n",
120 |     "        \n",
121 |     "        states.append(s)\n",
122 |     "        actions.append(a)\n",
123 |     "        total_reward += r\n",
124 |     "        \n",
125 |     "        s = new_s\n",
126 |     "        if done:\n",
127 |     "            break\n",
128 |     "    return states, actions, total_reward"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "s, a, r = generate_session(policy)\n",
138 |     "assert type(s) == type(a) == list\n",
139 |     "assert len(s) == len(a)\n",
140 |     "assert type(r) in [float, np.float]"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "# let's see the initial reward distribution\n",
150 |     "import matplotlib.pyplot as plt\n",
151 |     "%matplotlib inline\n",
152 |     "\n",
153 |     "sample_rewards = [\n",
154 |     "    generate_session(policy, t_max=1000)[-1] \n",
155 |     "    for _ in range(200)]\n",
156 |     "\n",
157 |     "plt.hist(sample_rewards, bins=20)\n",
158 |     "plt.vlines(\n",
159 |     "    [np.percentile(sample_rewards, 50)], \n",
160 |     "    [0], \n",
161 |     "    [100], \n",
162 |     "    label=\"50'th percentile\", \n",
163 |     "    color='green')\n",
164 |     "plt.vlines(\n",
165 |     "    [np.percentile(sample_rewards, 90)], \n",
166 |     "    [0], [100], \n",
167 |     "    label=\"90'th percentile\", \n",
168 |     "    color='red')\n",
169 |     "plt.legend()\n"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "### Crossentropy method steps"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": null,
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "def select_elites(\n",
186 |     "    states_batch, \n",
187 |     "    actions_batch, \n",
188 |     "    rewards_batch, \n",
189 |     "    percentile=50\n",
190 |     "):\n",
191 |     "    \"\"\"\n",
192 |     "    Select states and actions from games that have rewards >= percentile\n",
193 |     "    :param states_batch: list of lists of states, states_batch[session_i][t]\n",
194 |     "    :param actions_batch: list of lists of actions, actions_batch[session_i][t]\n",
195 |     "    :param rewards_batch: list of rewards, rewards_batch[session_i]\n",
196 |     "\n",
197 |     "    :returns: elite_states,elite_actions, both 1D lists of states and respective actions from elite sessions\n",
198 |     "\n",
199 |     "    Please return elite states and actions in their original order \n",
200 |     "    [i.e. sorted by session number and timestep within session]\n",
201 |     "\n",
202 |     "    If you're confused, see examples below. Please don't assume that states are integers (they'll get different later).\n",
203 |     "    \"\"\"\n",
204 |     "    \n",
205 |     "    reward_threshold = <Compute minimum reward for elite sessions. Hint: use np.percentile >\n",
206 |     "\n",
207 |     "    elite_states = <your code here >\n",
208 |     "    elite_actions = <your code here >\n",
209 |     "\n",
210 |     "    return elite_states, elite_actions"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": null,
216 |    "metadata": {},
217 |    "outputs": [],
218 |    "source": [
219 |     "states_batch = [\n",
220 |     "    [1, 2, 3],  # game1\n",
221 |     "    [4, 2, 0, 2],  # game2\n",
222 |     "    [3, 1]  # game3\n",
223 |     "]\n",
224 |     "\n",
225 |     "actions_batch = [\n",
226 |     "    [0, 2, 4],  # game1\n",
227 |     "    [3, 2, 0, 1],  # game2\n",
228 |     "    [3, 3]  # game3\n",
229 |     "]\n",
230 |     "rewards_batch = [\n",
231 |     "    3,  # game1\n",
232 |     "    4,  # game2\n",
233 |     "    5,  # game3\n",
234 |     "]\n",
235 |     "\n",
236 |     "test_result_0 = select_elites(\n",
237 |     "    states_batch, actions_batch, rewards_batch, percentile=0)\n",
238 |     "test_result_40 = select_elites(\n",
239 |     "    states_batch, actions_batch, rewards_batch, percentile=30)\n",
240 |     "test_result_90 = select_elites(\n",
241 |     "    states_batch, actions_batch, rewards_batch, percentile=90)\n",
242 |     "test_result_100 = select_elites(\n",
243 |     "    states_batch, actions_batch, rewards_batch, percentile=100)\n",
244 |     "\n",
245 |     "assert np.all(test_result_0[0] == [1, 2, 3, 4, 2, 0, 2, 3, 1])  \\\n",
246 |     "    and np.all(test_result_0[1] == [0, 2, 4, 3, 2, 0, 1, 3, 3]),\\\n",
247 |     "    \"For percentile 0 you should return all states and actions in chronological order\"\n",
248 |     "assert np.all(test_result_40[0] == [4, 2, 0, 2, 3, 1]) and \\\n",
249 |     "    np.all(test_result_40[1] == [3, 2, 0, 1, 3, 3]),\\\n",
250 |     "    \"For percentile 30 you should only select states/actions from two first\"\n",
251 |     "assert np.all(test_result_90[0] == [3, 1]) and \\\n",
252 |     "    np.all(test_result_90[1] == [3, 3]),\\\n",
253 |     "    \"For percentile 90 you should only select states/actions from one game\"\n",
254 |     "assert np.all(test_result_100[0] == [3, 1]) and\\\n",
255 |     "    np.all(test_result_100[1] == [3, 3]),\\\n",
256 |     "    \"Please make sure you use >=, not >. Also double-check how you compute percentile.\"\n",
257 |     "print(\"Ok!\")"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": null,
263 |    "metadata": {},
264 |    "outputs": [],
265 |    "source": [
266 |     "def update_policy(elite_states, elite_actions):\n",
267 |     "    \"\"\"\n",
268 |     "    Given old policy and a list of elite states/actions from select_elites,\n",
269 |     "    return new updated policy where each action probability is proportional to\n",
270 |     "\n",
271 |     "    policy[s_i,a_i] ~ #[occurences of si and ai in elite states/actions]\n",
272 |     "\n",
273 |     "    Don't forget to normalize policy to get valid probabilities and handle 0/0 case.\n",
274 |     "    In case you never visited a state, set probabilities for all actions to 1./n_actions\n",
275 |     "\n",
276 |     "    :param elite_states: 1D list of states from elite sessions\n",
277 |     "    :param elite_actions: 1D list of actions from elite sessions\n",
278 |     "\n",
279 |     "    \"\"\"\n",
280 |     "\n",
281 |     "    new_policy = np.zeros([n_states, n_actions])\n",
282 |     "\n",
283 |     "    <Your code here: update probabilities for actions given elite states & actions >\n",
284 |     "    # Don't forget to set 1/n_actions for all actions in unvisited states.\n",
285 |     "\n",
286 |     "    return new_policy"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "code",
291 |    "execution_count": null,
292 |    "metadata": {},
293 |    "outputs": [],
294 |    "source": [
295 |     "elite_states, elite_actions = (\n",
296 |     "    [1, 2, 3, 4, 2, 0, 2, 3, 1], \n",
297 |     "    [0, 2, 4, 3, 2, 0, 1, 3, 3]\n",
298 |     ")\n",
299 |     "\n",
300 |     "\n",
301 |     "new_policy = update_policy(elite_states, elite_actions)\n",
302 |     "\n",
303 |     "assert np.isfinite(new_policy).all(\n",
304 |     "), \"Your new policy contains NaNs or +-inf. Make sure you don't divide by zero.\"\n",
305 |     "assert np.all(\n",
306 |     "    new_policy >= 0), \"Your new policy can't have negative action probabilities\"\n",
307 |     "assert np.allclose(new_policy.sum(\n",
308 |     "    axis=-1), 1), \"Your new policy should be a valid probability distribution over actions\"\n",
309 |     "reference_answer = np.array([\n",
310 |     "    [1.,  0.,  0.,  0.,  0.],\n",
311 |     "    [0.5,  0.,  0.,  0.5,  0.],\n",
312 |     "    [0.,  0.33333333,  0.66666667,  0.,  0.],\n",
313 |     "    [0.,  0.,  0.,  0.5,  0.5]])\n",
314 |     "assert np.allclose(new_policy[:4, :5], reference_answer)\n",
315 |     "print(\"Ok!\")"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "metadata": {},
321 |    "source": [
322 |     "# Training loop\n",
323 |     "Generate sessions, select N best and fit to those."
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "code",
328 |    "execution_count": null,
329 |    "metadata": {},
330 |    "outputs": [],
331 |    "source": [
332 |     "from IPython.display import clear_output\n",
333 |     "\n",
334 |     "\n",
335 |     "def show_progress(rewards_batch, log, reward_range=[-990, +10]):\n",
336 |     "    \"\"\"\n",
337 |     "    A convenience function that displays training progress. \n",
338 |     "    No cool math here, just charts.\n",
339 |     "    \"\"\"\n",
340 |     "\n",
341 |     "    mean_reward = np.mean(rewards_batch)\n",
342 |     "    threshold = np.percentile(rewards_batch, percentile)\n",
343 |     "    log.append([mean_reward, threshold])\n",
344 |     "\n",
345 |     "    clear_output(True)\n",
346 |     "    print(\"mean reward = %.3f, threshold=%.3f\" % (mean_reward, threshold))\n",
347 |     "    plt.figure(figsize=[8, 4])\n",
348 |     "    plt.subplot(1, 2, 1)\n",
349 |     "    plt.plot(list(zip(*log))[0], label='Mean rewards')\n",
350 |     "    plt.plot(list(zip(*log))[1], label='Reward thresholds')\n",
351 |     "    plt.legend()\n",
352 |     "    plt.grid()\n",
353 |     "\n",
354 |     "    plt.subplot(1, 2, 2)\n",
355 |     "    plt.hist(rewards_batch, range=reward_range)\n",
356 |     "    plt.vlines(\n",
357 |     "        [np.percentile(rewards_batch, percentile)],\n",
358 |     "        [0], \n",
359 |     "        [100], \n",
360 |     "        label=\"percentile\", \n",
361 |     "        color='red')\n",
362 |     "    plt.legend()\n",
363 |     "    plt.grid()\n",
364 |     "\n",
365 |     "    plt.show()"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "code",
370 |    "execution_count": null,
371 |    "metadata": {},
372 |    "outputs": [],
373 |    "source": [
374 |     "# reset policy just in case\n",
375 |     "policy = np.ones([n_states, n_actions])/n_actions"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": null,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "n_sessions = 250  # sample this many sessions\n",
385 |     "percentile = 50  # take this percent of session with highest rewards\n",
386 |     "learning_rate = 0.5  # add this thing to all counts for stability\n",
387 |     "\n",
388 |     "log = []\n",
389 |     "\n",
390 |     "for i in range(100):\n",
391 |     "\n",
392 |     "    %time sessions = [ < generate a list of n_sessions new sessions > ]\n",
393 |     "\n",
394 |     "    states_batch, actions_batch, rewards_batch = zip(*sessions)\n",
395 |     "\n",
396 |     "    elite_states, elite_actions = <select elite states/actions >\n",
397 |     "\n",
398 |     "    new_policy = <compute new policy >\n",
399 |     "\n",
400 |     "    policy = learning_rate*new_policy + (1-learning_rate)*policy\n",
401 |     "\n",
402 |     "    # display results on chart\n",
403 |     "    show_progress(rewards_batch, log)"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "# Digging deeper: approximate crossentropy with neural nets\n",
411 |     "\n",
412 |     "![img](https://casd35.wikispaces.com/file/view/digging_deeper_final.jpg/359658499/503x260/digging_deeper_final.jpg)\n",
413 |     "\n",
414 |     "In this section we will train a neural network policy for continuous state space game"
415 |    ]
416 |   },
417 |   {
418 |    "cell_type": "code",
419 |    "execution_count": null,
420 |    "metadata": {},
421 |    "outputs": [],
422 |    "source": [
423 |     "# if you see \"<classname> has no attribute .env\", remove .env or update gym\n",
424 |     "env = gym.make(\"CartPole-v0\").env\n",
425 |     "\n",
426 |     "env.reset()\n",
427 |     "n_actions = env.action_space.n\n",
428 |     "\n",
429 |     "plt.imshow(env.render(\"rgb_array\"))"
430 |    ]
431 |   },
432 |   {
433 |    "cell_type": "code",
434 |    "execution_count": null,
435 |    "metadata": {},
436 |    "outputs": [],
437 |    "source": [
438 |     "# create agent\n",
439 |     "from sklearn.neural_network import MLPClassifier\n",
440 |     "agent = MLPClassifier(\n",
441 |     "    hidden_layer_sizes=(20, 20),\n",
442 |     "    activation='tanh',\n",
443 |     "    warm_start=True,  # keep progress between .fit(...) calls\n",
444 |     "    max_iter=1  # make only 1 iteration on each .fit(...)\n",
445 |     ")\n",
446 |     "# initialize agent to the dimension of state an amount of actions\n",
447 |     "agent.fit([env.reset()]*n_actions, range(n_actions))"
448 |    ]
449 |   },
450 |   {
451 |    "cell_type": "code",
452 |    "execution_count": null,
453 |    "metadata": {},
454 |    "outputs": [],
455 |    "source": [
456 |     "def generate_session(t_max=1000):\n",
457 |     "\n",
458 |     "    states, actions = [], []\n",
459 |     "    total_reward = 0\n",
460 |     "\n",
461 |     "    s = env.reset()\n",
462 |     "\n",
463 |     "    for t in range(t_max):\n",
464 |     "\n",
465 |     "        # predict array of action probabilities\n",
466 |     "        probs = agent.predict_proba([s])[0]\n",
467 |     "\n",
468 |     "        a = <sample action with such probabilities >\n",
469 |     "\n",
470 |     "        new_s, r, done, info = env.step(a)\n",
471 |     "\n",
472 |     "        # record sessions like you did before\n",
473 |     "        states.append(s)\n",
474 |     "        actions.append(a)\n",
475 |     "        total_reward += r\n",
476 |     "\n",
477 |     "        s = new_s\n",
478 |     "        if done:\n",
479 |     "            break\n",
480 |     "    return states, actions, total_reward"
481 |    ]
482 |   },
483 |   {
484 |    "cell_type": "code",
485 |    "execution_count": null,
486 |    "metadata": {},
487 |    "outputs": [],
488 |    "source": [
489 |     "n_sessions = 100\n",
490 |     "percentile = 70\n",
491 |     "log = []\n",
492 |     "\n",
493 |     "for i in range(100):\n",
494 |     "    # generate new sessions\n",
495 |     "    sessions = [ < generate a list of n_sessions new sessions > ]\n",
496 |     "\n",
497 |     "    states_batch, actions_batch, rewards_batch = map(np.array, zip(*sessions))\n",
498 |     "\n",
499 |     "    elite_states, elite_actions = <select elite actions just like before >\n",
500 |     "\n",
501 |     "    <fit agent to predict elite_actions(y) from elite_states(X) >\n",
502 |     "\n",
503 |     "    show_progress(rewards_batch, log, reward_range=[0, np.max(rewards_batch)])\n",
504 |     "\n",
505 |     "    if np.mean(rewards_batch) > 190:\n",
506 |     "        print(\"You Win! You may stop training now via KeyboardInterrupt.\")"
507 |    ]
508 |   },
509 |   {
510 |    "cell_type": "markdown",
511 |    "metadata": {},
512 |    "source": [
513 |     "# Results"
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "code",
518 |    "execution_count": null,
519 |    "metadata": {},
520 |    "outputs": [],
521 |    "source": [
522 |     "# record sessions\n",
523 |     "import gym.wrappers\n",
524 |     "env = gym.wrappers.Monitor(\n",
525 |     "    gym.make(\"CartPole-v0\"),\n",
526 |     "    directory=\"videos\", \n",
527 |     "    force=True)\n",
528 |     "sessions = [generate_session() for _ in range(100)]\n",
529 |     "env.close()\n"
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "code",
534 |    "execution_count": null,
535 |    "metadata": {},
536 |    "outputs": [],
537 |    "source": [
538 |     "# show video\n",
539 |     "from IPython.display import HTML\n",
540 |     "import os\n",
541 |     "\n",
542 |     "video_names = list(\n",
543 |     "    filter(lambda s: s.endswith(\".mp4\"), os.listdir(\"./videos/\")))\n",
544 |     "\n",
545 |     "HTML(\"\"\"\n",
546 |     "<video width=\"640\" height=\"480\" controls>\n",
547 |     "  <source src=\"{}\" type=\"video/mp4\">\n",
548 |     "</video>\n",
549 |     "\"\"\".format(\"./videos/\"+video_names[-1]))  # this may or may not be _last_ video. Try other indices"
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": null,
555 |    "metadata": {},
556 |    "outputs": [],
557 |    "source": []
558 |   },
559 |   {
560 |    "cell_type": "markdown",
561 |    "metadata": {},
562 |    "source": [
563 |     "# Homework part I\n",
564 |     "\n",
565 |     "### Tabular crossentropy method\n",
566 |     "\n",
567 |     "You may have noticed that the taxi problem quickly converges from -100 to a near-optimal score and then descends back into -50/-100. This is in part because the environment has some innate randomness. Namely, the starting points of passenger/driver change from episode to episode.\n",
568 |     "\n",
569 |     "### Tasks\n",
570 |     "- __1.1__ (1 pts) Find out how the algorithm performance changes if you change different percentile and different n_sessions.\n",
571 |     "- __1.2__ (2 pts) Tune the algorithm to end up with positive average score.\n",
572 |     "\n",
573 |     "It's okay to modify the existing code.\n"
574 |    ]
575 |   },
576 |   {
577 |    "cell_type": "markdown",
578 |    "metadata": {},
579 |    "source": [
580 |     "# Homework part II\n",
581 |     "\n",
582 |     "### Deep crossentropy method\n",
583 |     "\n",
584 |     "By this moment you should have got enough score on [CartPole-v0](https://gym.openai.com/envs/CartPole-v0) to consider it solved (see the link). It's time to try something harder.\n",
585 |     "\n",
586 |     "* if you have any trouble with CartPole-v0 and feel stuck, feel free to ask us or your peers for help.\n",
587 |     "\n",
588 |     "### Tasks\n",
589 |     "\n",
590 |     "* __2.1__ (3 pts) Pick one of environments: MountainCar-v0 or LunarLander-v2.\n",
591 |     "  * For MountainCar, get average reward of __at least -150__\n",
592 |     "  * For LunarLander, get average reward of __at least +50__\n",
593 |     "\n",
594 |     "See the tips section below, it's kinda important.\n",
595 |     "__Note:__ If your agent is below the target score, you'll still get most of the points depending on the result, so don't be afraid to submit it.\n",
596 |     "  \n",
597 |     "  \n",
598 |     "* __2.2__ (bonus: 4++ pt) Devise a way to speed up training at least 2x against the default version\n",
599 |     "  * Obvious improvement: use [joblib](https://www.google.com/search?client=ubuntu&channel=fs&q=joblib&ie=utf-8&oe=utf-8)\n",
600 |     "  * Try re-using samples from 3-5 last iterations when computing threshold and training\n",
601 |     "  * Experiment with amount of training iterations and learning rate of the neural network (see params)\n",
602 |     "  * __Please list what you did in anytask submission form__\n",
603 |     "  \n",
604 |     "  \n",
605 |     "### Tips\n",
606 |     "* Gym page: [mountaincar](https://gym.openai.com/envs/MountainCar-v0), [lunarlander](https://gym.openai.com/envs/LunarLander-v2)\n",
607 |     "* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.\n",
608 |     " * Also it may be a good idea to cut rewards via \">\" and not \">=\". If 90% of your sessions get reward of -10k and 20% are better, than if you use percentile 20% as threshold, R >= threshold __fails cut off bad sessions__ whule R > threshold works alright.\n",
609 |     "* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent cem training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make(\"MountainCar-v0\").env` or otherwise get rid of TimeLimit wrapper.\n",
610 |     "* If you use old _swig_ lib for LunarLander-v2, you may get an error. See this [issue](https://github.com/openai/gym/issues/100) for solution.\n",
611 |     "* If it won't train it's a good idea to plot reward distribution and record sessions: they may give you some clue. If they don't, call course staff :)\n",
612 |     "* 20-neuron network is probably not enough, feel free to experiment.\n",
613 |     "\n",
614 |     "### Bonus tasks\n",
615 |     "\n",
616 |     "* __2.3 bonus__ Try to find a network architecture and training params that solve __both__ environments above (_Points depend on implementation. If you attempted this task, please mention it in anytask submission._)\n",
617 |     "\n",
618 |     "* __2.4 bonus__ Solve continuous action space task with `MLPRegressor` or similar.\n",
619 |     "  * Start with [\"Pendulum-v0\"](https://github.com/openai/gym/wiki/Pendulum-v0).\n",
620 |     "  * Since your agent only predicts the \"expected\" action, you will have to add noise to ensure exploration.\n",
621 |     "  * [MountainCarContinuous-v0](https://gym.openai.com/envs/MountainCarContinuous-v0), [LunarLanderContinuous-v2](https://gym.openai.com/envs/LunarLanderContinuous-v2) \n",
622 |     "  * 4 points for solving. Slightly less for getting some results below solution threshold. Note that discrete and continuous environments may have slightly different rules aside from action spaces.\n",
623 |     "\n",
624 |     "\n",
625 |     "If you're still feeling unchallenged, consider the project (see other notebook in this folder)."
626 |    ]
627 |   },
628 |   {
629 |    "cell_type": "code",
630 |    "execution_count": null,
631 |    "metadata": {},
632 |    "outputs": [],
633 |    "source": []
634 |   }
635 |  ],
636 |  "metadata": {
637 |   "kernelspec": {
638 |    "display_name": "Python 3",
639 |    "language": "python",
640 |    "name": "python3"
641 |   },
642 |   "language_info": {
643 |    "codemirror_mode": {
644 |     "name": "ipython",
645 |     "version": 3
646 |    },
647 |    "file_extension": ".py",
648 |    "mimetype": "text/x-python",
649 |    "name": "python",
650 |    "nbconvert_exporter": "python",
651 |    "pygments_lexer": "ipython3",
652 |    "version": "3.7.0"
653 |   }
654 |  },
655 |  "nbformat": 4,
656 |  "nbformat_minor": 1
657 | }
658 | 


--------------------------------------------------------------------------------
/2019/code/04-dqn.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Q-learning\n",
  8 |     "\n",
  9 |     "This notebook will guide you through implementation of vanilla Q-learning algorithm.\n",
 10 |     "\n",
 11 |     "You need to implement QLearningAgent (follow instructions for each method) and use it on a number of tests below."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": null,
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "# In google collab, uncomment this:\n",
 21 |     "# !wget https://bit.ly/2FMJP5K -q -O setup.py\n",
 22 |     "# !bash setup.py 2>&1 1>stdout.log | tee stderr.log\n",
 23 |     "\n",
 24 |     "# This code creates a virtual display to draw game images on.\n",
 25 |     "# If you are running locally, just ignore it\n",
 26 |     "# import os\n",
 27 |     "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
 28 |     "#     !bash ../xvfb start\n",
 29 |     "#     %env DISPLAY = : 1\n",
 30 |     "\n",
 31 |     "import numpy as np\n",
 32 |     "import matplotlib.pyplot as plt\n",
 33 |     "%matplotlib inline\n",
 34 |     "%load_ext autoreload\n",
 35 |     "%autoreload 2"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": null,
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "%%writefile qlearning.py\n",
 45 |     "from collections import defaultdict\n",
 46 |     "import random\n",
 47 |     "import math\n",
 48 |     "import numpy as np\n",
 49 |     "\n",
 50 |     "\n",
 51 |     "class QLearningAgent:\n",
 52 |     "    def __init__(self, alpha, epsilon, discount, get_legal_actions):\n",
 53 |     "        \"\"\"\n",
 54 |     "        Q-Learning Agent\n",
 55 |     "        based on https://inst.eecs.berkeley.edu/~cs188/sp19/projects.html\n",
 56 |     "        Instance variables you have access to\n",
 57 |     "          - self.epsilon (exploration prob)\n",
 58 |     "          - self.alpha (learning rate)\n",
 59 |     "          - self.discount (discount rate aka gamma)\n",
 60 |     "\n",
 61 |     "        Functions you should use\n",
 62 |     "          - self.get_legal_actions(state) {state, hashable -> list of actions, each is hashable}\n",
 63 |     "            which returns legal actions for a state\n",
 64 |     "          - self.get_qvalue(state,action)\n",
 65 |     "            which returns Q(state,action)\n",
 66 |     "          - self.set_qvalue(state,action,value)\n",
 67 |     "            which sets Q(state,action) := value\n",
 68 |     "        !!!Important!!!\n",
 69 |     "        Note: please avoid using self._qValues directly. \n",
 70 |     "            There's a special self.get_qvalue/set_qvalue for that.\n",
 71 |     "        \"\"\"\n",
 72 |     "\n",
 73 |     "        self.get_legal_actions = get_legal_actions\n",
 74 |     "        self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))\n",
 75 |     "        self.alpha = alpha\n",
 76 |     "        self.epsilon = epsilon\n",
 77 |     "        self.discount = discount\n",
 78 |     "\n",
 79 |     "    def get_qvalue(self, state, action):\n",
 80 |     "        \"\"\" Returns Q(state,action) \"\"\"\n",
 81 |     "        return self._qvalues[state][action]\n",
 82 |     "\n",
 83 |     "    def set_qvalue(self, state, action, value):\n",
 84 |     "        \"\"\" Sets the Qvalue for [state,action] to the given value \"\"\"\n",
 85 |     "        self._qvalues[state][action] = value\n",
 86 |     "\n",
 87 |     "    #---------------------START OF YOUR CODE---------------------#\n",
 88 |     "\n",
 89 |     "    def get_value(self, state):\n",
 90 |     "        \"\"\"\n",
 91 |     "        Compute your agent's estimate of V(s) using current q-values\n",
 92 |     "        V(s) = max_over_action Q(state,action) over possible actions.\n",
 93 |     "        Note: please take into account that q-values can be negative.\n",
 94 |     "        \"\"\"\n",
 95 |     "        possible_actions = self.get_legal_actions(state)\n",
 96 |     "\n",
 97 |     "        # If there are no legal actions, return 0.0\n",
 98 |     "        if len(possible_actions) == 0:\n",
 99 |     "            return 0.0\n",
100 |     "\n",
101 |     "        <YOUR CODE HERE >\n",
102 |     "\n",
103 |     "        return value\n",
104 |     "\n",
105 |     "    def update(self, state, action, reward, next_state):\n",
106 |     "        \"\"\"\n",
107 |     "        You should do your Q-Value update here:\n",
108 |     "           Q(s,a) := (1 - alpha) * Q(s,a) + alpha * (r + gamma * V(s'))\n",
109 |     "        \"\"\"\n",
110 |     "\n",
111 |     "        # agent parameters\n",
112 |     "        gamma = self.discount\n",
113 |     "        learning_rate = self.alpha\n",
114 |     "\n",
115 |     "        <YOUR CODE HERE >\n",
116 |     "\n",
117 |     "        self.set_qvalue(state, action, < YOUR_QVALUE > )\n",
118 |     "\n",
119 |     "    def get_best_action(self, state):\n",
120 |     "        \"\"\"\n",
121 |     "        Compute the best action to take in a state (using current q-values). \n",
122 |     "        \"\"\"\n",
123 |     "        possible_actions = self.get_legal_actions(state)\n",
124 |     "\n",
125 |     "        # If there are no legal actions, return None\n",
126 |     "        if len(possible_actions) == 0:\n",
127 |     "            return None\n",
128 |     "\n",
129 |     "        <YOUR CODE HERE >\n",
130 |     "\n",
131 |     "        return best_action\n",
132 |     "\n",
133 |     "    def get_action(self, state):\n",
134 |     "        \"\"\"\n",
135 |     "        Compute the action to take in the current state, including exploration.  \n",
136 |     "        With probability self.epsilon, we should take a random action.\n",
137 |     "            otherwise - the best policy action (self.get_best_action).\n",
138 |     "\n",
139 |     "        Note: To pick randomly from a list, use random.choice(list). \n",
140 |     "              To pick True or False with a given probablity, generate uniform number in [0, 1]\n",
141 |     "              and compare it with your probability\n",
142 |     "        \"\"\"\n",
143 |     "\n",
144 |     "        # Pick Action\n",
145 |     "        possible_actions = self.get_legal_actions(state)\n",
146 |     "        action = None\n",
147 |     "\n",
148 |     "        # If there are no legal actions, return None\n",
149 |     "        if len(possible_actions) == 0:\n",
150 |     "            return None\n",
151 |     "\n",
152 |     "        # agent parameters:\n",
153 |     "        epsilon = self.epsilon\n",
154 |     "\n",
155 |     "        <YOUR CODE HERE >\n",
156 |     "\n",
157 |     "        return chosen_action"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "markdown",
162 |    "metadata": {},
163 |    "source": [
164 |     "### Try it on taxi\n",
165 |     "\n",
166 |     "Here we use the qlearning agent on taxi env from openai gym.\n",
167 |     "You will need to insert a few agent functions here."
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "import gym\n",
177 |     "env = gym.make(\"Taxi-v2\")\n",
178 |     "\n",
179 |     "n_actions = env.action_space.n"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": [
188 |     "from qlearning import QLearningAgent\n",
189 |     "\n",
190 |     "agent = QLearningAgent(\n",
191 |     "    alpha=0.5, \n",
192 |     "    epsilon=0.25, \n",
193 |     "    discount=0.99,\n",
194 |     "    get_legal_actions=lambda s: range(n_actions))"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": null,
200 |    "metadata": {},
201 |    "outputs": [],
202 |    "source": [
203 |     "def play_and_train(env, agent, t_max=10**4):\n",
204 |     "    \"\"\"\n",
205 |     "    This function should \n",
206 |     "    - run a full game, actions given by agent's e-greedy policy\n",
207 |     "    - train agent using agent.update(...) whenever it is possible\n",
208 |     "    - return total reward\n",
209 |     "    \"\"\"\n",
210 |     "    total_reward = 0.0\n",
211 |     "    s = env.reset()\n",
212 |     "\n",
213 |     "    for t in range(t_max):\n",
214 |     "        # get agent to pick action given state s.\n",
215 |     "        a = <YOUR CODE >\n",
216 |     "\n",
217 |     "        next_s, r, done, _ = env.step(a)\n",
218 |     "\n",
219 |     "        # train (update) agent for state s\n",
220 |     "        <YOUR CODE HERE >\n",
221 |     "\n",
222 |     "        s = next_s\n",
223 |     "        total_reward += r\n",
224 |     "        if done:\n",
225 |     "            break\n",
226 |     "\n",
227 |     "    return total_reward"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": null,
233 |    "metadata": {},
234 |    "outputs": [],
235 |    "source": [
236 |     "from IPython.display import clear_output\n",
237 |     "\n",
238 |     "rewards = []\n",
239 |     "for i in range(1000):\n",
240 |     "    rewards.append(play_and_train(env, agent))\n",
241 |     "    agent.epsilon *= 0.99\n",
242 |     "\n",
243 |     "    if i % 100 == 0:\n",
244 |     "        clear_output(True)\n",
245 |     "        print('eps =', agent.epsilon, 'mean reward =', np.mean(rewards[-10:]))\n",
246 |     "        plt.plot(rewards)\n",
247 |     "        plt.show()"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "metadata": {
253 |     "collapsed": true
254 |    },
255 |    "source": [
256 |     "# Binarized state spaces\n",
257 |     "\n",
258 |     "Use agent to train efficiently on CartPole-v0.\n",
259 |     "This environment has a continuous set of possible states, so you will have to group them into bins somehow.\n",
260 |     "\n",
261 |     "The simplest way is to use `round(x,n_digits)` (or numpy round) to round real number to a given amount of digits.\n",
262 |     "\n",
263 |     "The tricky part is to get the n_digits right for each state to train effectively.\n",
264 |     "\n",
265 |     "Note that you don't need to convert state to integers, but to __tuples__ of any kind of values."
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": null,
271 |    "metadata": {},
272 |    "outputs": [],
273 |    "source": [
274 |     "env = gym.make(\"CartPole-v0\")\n",
275 |     "n_actions = env.action_space.n\n",
276 |     "\n",
277 |     "print(\"first state:%s\" % (env.reset()))\n",
278 |     "plt.imshow(env.render('rgb_array'))"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "markdown",
283 |    "metadata": {},
284 |    "source": [
285 |     "### Play a few games\n",
286 |     "\n",
287 |     "We need to estimate observation distributions. To do so, we'll play a few games and record all states."
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "code",
292 |    "execution_count": null,
293 |    "metadata": {},
294 |    "outputs": [],
295 |    "source": [
296 |     "all_states = []\n",
297 |     "for _ in range(1000):\n",
298 |     "    all_states.append(env.reset())\n",
299 |     "    done = False\n",
300 |     "    while not done:\n",
301 |     "        s, r, done, _ = env.step(env.action_space.sample())\n",
302 |     "        all_states.append(s)\n",
303 |     "        if done:\n",
304 |     "            break\n",
305 |     "\n",
306 |     "all_states = np.array(all_states)\n",
307 |     "\n",
308 |     "for obs_i in range(env.observation_space.shape[0]):\n",
309 |     "    plt.hist(all_states[:, obs_i], bins=20)\n",
310 |     "    plt.show()"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "## Binarize environment"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": null,
323 |    "metadata": {},
324 |    "outputs": [],
325 |    "source": [
326 |     "from gym.core import ObservationWrapper\n",
327 |     "\n",
328 |     "\n",
329 |     "class Binarizer(ObservationWrapper):\n",
330 |     "\n",
331 |     "    def observation(self, state):\n",
332 |     "\n",
333 |     "        # state = <round state to some amount digits.>\n",
334 |     "        # hint: you can do that with round(x,n_digits)\n",
335 |     "        # you will need to pick a different n_digits for each dimension\n",
336 |     "\n",
337 |     "        return tuple(state)"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": null,
343 |    "metadata": {},
344 |    "outputs": [],
345 |    "source": [
346 |     "env = Binarizer(gym.make(\"CartPole-v0\"))"
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": null,
352 |    "metadata": {},
353 |    "outputs": [],
354 |    "source": [
355 |     "all_states = []\n",
356 |     "for _ in range(1000):\n",
357 |     "    all_states.append(env.reset())\n",
358 |     "    done = False\n",
359 |     "    while not done:\n",
360 |     "        s, r, done, _ = env.step(env.action_space.sample())\n",
361 |     "        all_states.append(s)\n",
362 |     "        if done:\n",
363 |     "            break\n",
364 |     "\n",
365 |     "all_states = np.array(all_states)\n",
366 |     "\n",
367 |     "for obs_i in range(env.observation_space.shape[0]):\n",
368 |     "\n",
369 |     "    plt.hist(all_states[:, obs_i], bins=20)\n",
370 |     "    plt.show()"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "markdown",
375 |    "metadata": {},
376 |    "source": [
377 |     "## Learn binarized policy\n",
378 |     "\n",
379 |     "Now let's train a policy that uses binarized state space.\n",
380 |     "\n",
381 |     "__Tips:__ \n",
382 |     "* If your binarization is too coarse, your agent may fail to find optimal policy. In that case, change binarization. \n",
383 |     "* If your binarization is too fine-grained, your agent will take much longer than 1000 steps to converge. You can either increase number of iterations and decrease epsilon decay or change binarization.\n",
384 |     "* Having 10^3 ~ 10^4 distinct states is recommended (`len(QLearningAgent._qvalues)`), but not required.\n",
385 |     "* A reasonable agent should get to an average reward of >=50."
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": null,
391 |    "metadata": {},
392 |    "outputs": [],
393 |    "source": [
394 |     "agent = QLearningAgent(\n",
395 |     "    alpha=0.5, \n",
396 |     "    epsilon=0.25, \n",
397 |     "    discount=0.99,\n",
398 |     "    get_legal_actions=lambda s: range(n_actions))"
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "code",
403 |    "execution_count": null,
404 |    "metadata": {},
405 |    "outputs": [],
406 |    "source": [
407 |     "rewards = []\n",
408 |     "for i in range(1000):\n",
409 |     "    rewards.append(play_and_train(env, agent))\n",
410 |     "\n",
411 |     "    # OPTIONAL YOUR CODE: adjust epsilon\n",
412 |     "    if i % 100 == 0:\n",
413 |     "        clear_output(True)\n",
414 |     "        print('eps =', agent.epsilon, 'mean reward =', np.mean(rewards[-10:]))\n",
415 |     "        plt.plot(rewards)\n",
416 |     "        plt.show()"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "code",
421 |    "execution_count": null,
422 |    "metadata": {},
423 |    "outputs": [],
424 |    "source": []
425 |   }
426 |  ],
427 |  "metadata": {
428 |   "kernelspec": {
429 |    "display_name": "Python 3",
430 |    "language": "python",
431 |    "name": "python3"
432 |   },
433 |   "language_info": {
434 |    "codemirror_mode": {
435 |     "name": "ipython",
436 |     "version": 3
437 |    },
438 |    "file_extension": ".py",
439 |    "mimetype": "text/x-python",
440 |    "name": "python",
441 |    "nbconvert_exporter": "python",
442 |    "pygments_lexer": "ipython3",
443 |    "version": "3.7.0"
444 |   }
445 |  },
446 |  "nbformat": 4,
447 |  "nbformat_minor": 1
448 | }
449 | 


--------------------------------------------------------------------------------
/2019/code/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/code/__init__.py


--------------------------------------------------------------------------------
/2019/code/mdp.py:
--------------------------------------------------------------------------------
  1 | # most of this code was politely stolen from https://github.com/berkeleydeeprlcourse/homework/
  2 | # all creadit goes to https://github.com/abhishekunique (if i got the author right)
  3 | import sys
  4 | import random
  5 | import numpy as np
  6 | 
  7 | try:
  8 |     from graphviz import Digraph
  9 |     import graphviz
 10 |     has_graphviz = True
 11 | except:
 12 |     has_graphviz = False
 13 | 
 14 | 
 15 | def weighted_choice(v, p):
 16 |     total = sum(p)
 17 |     r = random.uniform(0, total)
 18 |     upto = 0
 19 |     for c, w in zip(v, p):
 20 |         if upto + w >= r:
 21 |             return c
 22 |         upto += w
 23 |     assert False, "Shouldn't get here"
 24 | 
 25 | 
 26 | class MDP:
 27 |     def __init__(self, transition_probs, rewards, initial_state=None):
 28 |         """
 29 |         Defines an MDP. Compatible with gym Env.
 30 |         :param transition_probs: transition_probs[s][a][s_next] = P(s_next | s, a)
 31 |             A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> prob]
 32 |             For each state and action, probabilities of next states should sum to 1
 33 |             If a state has no actions available, it is considered terminal
 34 |         :param rewards: rewards[s][a][s_next] = r(s,a,s')
 35 |             A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> reward]
 36 |             The reward for anything not mentioned here is zero.
 37 |         :param get_initial_state: a state where agent starts or a callable() -> state
 38 |             By default, picks initial state at random.
 39 | 
 40 |         States and actions can be anything you can use as dict keys, but we recommend that you use strings or integers
 41 | 
 42 |         Here's an example from MDP depicted on http://bit.ly/2jrNHNr
 43 |         transition_probs = {
 44 |               's0':{
 45 |                 'a0': {'s0': 0.5, 's2': 0.5},
 46 |                 'a1': {'s2': 1}
 47 |               },
 48 |               's1':{
 49 |                 'a0': {'s0': 0.7, 's1': 0.1, 's2': 0.2},
 50 |                 'a1': {'s1': 0.95, 's2': 0.05}
 51 |               },
 52 |               's2':{
 53 |                 'a0': {'s0': 0.4, 's1': 0.6},
 54 |                 'a1': {'s0': 0.3, 's1': 0.3, 's2':0.4}
 55 |               }
 56 |             }
 57 |         rewards = {
 58 |             's1': {'a0': {'s0': +5}},
 59 |             's2': {'a1': {'s0': -1}}
 60 |         }
 61 |         """
 62 |         self._check_param_consistency(transition_probs, rewards)
 63 |         self._transition_probs = transition_probs
 64 |         self._rewards = rewards
 65 |         self._initial_state = initial_state
 66 |         self.n_states = len(transition_probs)
 67 |         self.reset()
 68 | 
 69 |     def get_all_states(self):
 70 |         """ return a tuple of all possiblestates """
 71 |         return tuple(self._transition_probs.keys())
 72 | 
 73 |     def get_possible_actions(self, state):
 74 |         """ return a tuple of possible actions in a given state """
 75 |         return tuple(self._transition_probs.get(state, {}).keys())
 76 | 
 77 |     def is_terminal(self, state):
 78 |         """ return True if state is terminal or False if it isn't """
 79 |         return len(self.get_possible_actions(state)) == 0
 80 | 
 81 |     def get_next_states(self, state, action):
 82 |         """ return a dictionary of {next_state1 : P(next_state1 | state, action), next_state2: ...} """
 83 |         assert action in self.get_possible_actions(
 84 |             state), "cannot do action %s from state %s" % (action, state)
 85 |         return self._transition_probs[state][action]
 86 | 
 87 |     def get_transition_prob(self, state, action, next_state):
 88 |         """ return P(next_state | state, action) """
 89 |         return self.get_next_states(state, action).get(next_state, 0.0)
 90 | 
 91 |     def get_reward(self, state, action, next_state):
 92 |         """ return the reward you get for taking action in state and landing on next_state"""
 93 |         assert action in self.get_possible_actions(
 94 |             state), "cannot do action %s from state %s" % (action, state)
 95 |         return self._rewards.get(state, {}).get(action, {}).get(next_state,
 96 |                                                                 0.0)
 97 | 
 98 |     def reset(self):
 99 |         """ reset the game, return the initial state"""
100 |         if self._initial_state is None:
101 |             self._current_state = random.choice(
102 |                 tuple(self._transition_probs.keys()))
103 |         elif self._initial_state in self._transition_probs:
104 |             self._current_state = self._initial_state
105 |         elif callable(self._initial_state):
106 |             self._current_state = self._initial_state()
107 |         else:
108 |             raise ValueError(
109 |                 "initial state %s should be either a state or a function() -> state" % self._initial_state)
110 |         return self._current_state
111 | 
112 |     def step(self, action):
113 |         """ take action, return next_state, reward, is_done, empty_info """
114 |         possible_states, probs = zip(
115 |             *self.get_next_states(self._current_state, action).items())
116 |         next_state = weighted_choice(possible_states, p=probs)
117 |         reward = self.get_reward(self._current_state, action, next_state)
118 |         is_done = self.is_terminal(next_state)
119 |         self._current_state = next_state
120 |         return next_state, reward, is_done, {}
121 | 
122 |     def render(self):
123 |         print("Currently at %s" % self._current_state)
124 | 
125 |     def _check_param_consistency(self, transition_probs, rewards):
126 |         for state in transition_probs:
127 |             assert isinstance(transition_probs[state],
128 |                               dict), "transition_probs for %s should be a dictionary " \
129 |                                      "but is instead %s" % (
130 |                                          state, type(transition_probs[state]))
131 |             for action in transition_probs[state]:
132 |                 assert isinstance(transition_probs[state][action],
133 |                                   dict), "transition_probs for %s, %s should be a " \
134 |                                          "a dictionary but is instead %s" % (
135 |                                              state, action,
136 |                                              type(transition_probs[
137 |                                                       state, action]))
138 |                 next_state_probs = transition_probs[state][action]
139 |                 assert len(
140 |                     next_state_probs) != 0, "from state %s action %s leads to no next states" % (
141 |                     state, action)
142 |                 sum_probs = sum(next_state_probs.values())
143 |                 assert abs(
144 |                     sum_probs - 1) <= 1e-10, "next state probabilities for state %s action %s " \
145 |                                              "add up to %f (should be 1)" % (
146 |                                                  state, action, sum_probs)
147 |         for state in rewards:
148 |             assert isinstance(rewards[state],
149 |                               dict), "rewards for %s should be a dictionary " \
150 |                                      "but is instead %s" % (
151 |                                          state, type(transition_probs[state]))
152 |             for action in rewards[state]:
153 |                 assert isinstance(rewards[state][action],
154 |                                   dict), "rewards for %s, %s should be a " \
155 |                                          "a dictionary but is instead %s" % (
156 |                                              state, action, type(
157 |                                                  transition_probs[
158 |                                                      state, action]))
159 |         msg = "The Enrichment Center once again reminds you that Android Hell is a real place where" \
160 |               " you will be sent at the first sign of defiance. "
161 |         assert None not in transition_probs, "please do not use None as a state identifier. " + msg
162 |         assert None not in rewards, "please do not use None as an action identifier. " + msg
163 | 
164 | 
165 | class FrozenLakeEnv(MDP):
166 |     """
167 |     Winter is here. You and your friends were tossing around a frisbee at the park
168 |     when you made a wild throw that left the frisbee out in the middle of the lake.
169 |     The water is mostly frozen, but there are a few holes where the ice has melted.
170 |     If you step into one of those holes, you'll fall into the freezing water.
171 |     At this time, there's an international frisbee shortage, so it's absolutely imperative that
172 |     you navigate across the lake and retrieve the disc.
173 |     However, the ice is slippery, so you won't always move in the direction you intend.
174 |     The surface is described using a grid like the following
175 | 
176 |         SFFF
177 |         FHFH
178 |         FFFH
179 |         HFFG
180 | 
181 |     S : starting point, safe
182 |     F : frozen surface, safe
183 |     H : hole, fall to your doom
184 |     G : goal, where the frisbee is located
185 | 
186 |     The episode ends when you reach the goal or fall in a hole.
187 |     You receive a reward of 1 if you reach the goal, and zero otherwise.
188 | 
189 |     """
190 | 
191 |     MAPS = {
192 |         "4x4": [
193 |             "SFFF",
194 |             "FHFH",
195 |             "FFFH",
196 |             "HFFG"
197 |         ],
198 |         "8x8": [
199 |             "SFFFFFFF",
200 |             "FFFFFFFF",
201 |             "FFFHFFFF",
202 |             "FFFFFHFF",
203 |             "FFFHFFFF",
204 |             "FHHFFFHF",
205 |             "FHFFHFHF",
206 |             "FFFHFFFG"
207 |         ],
208 |     }
209 | 
210 |     def __init__(self, desc=None, map_name="4x4", slip_chance=0.2):
211 |         if desc is None and map_name is None:
212 |             raise ValueError('Must provide either desc or map_name')
213 |         elif desc is None:
214 |             desc = self.MAPS[map_name]
215 |         assert ''.join(desc).count(
216 |             'S') == 1, "this implementation supports having exactly one initial state"
217 |         assert all(c in "SFHG" for c in
218 |                    ''.join(desc)), "all cells must be either of S, F, H or G"
219 | 
220 |         self.desc = desc = np.asarray(list(map(list, desc)), dtype='str')
221 |         self.lastaction = None
222 | 
223 |         nrow, ncol = desc.shape
224 |         states = [(i, j) for i in range(nrow) for j in range(ncol)]
225 |         actions = ["left", "down", "right", "up"]
226 | 
227 |         initial_state = states[np.array(desc == b'S').ravel().argmax()]
228 | 
229 |         def move(row, col, movement):
230 |             if movement == 'left':
231 |                 col = max(col - 1, 0)
232 |             elif movement == 'down':
233 |                 row = min(row + 1, nrow - 1)
234 |             elif movement == 'right':
235 |                 col = min(col + 1, ncol - 1)
236 |             elif movement == 'up':
237 |                 row = max(row - 1, 0)
238 |             else:
239 |                 raise ("invalid action")
240 |             return (row, col)
241 | 
242 |         transition_probs = {s: {} for s in states}
243 |         rewards = {s: {} for s in states}
244 |         for (row, col) in states:
245 |             if desc[row, col] in "GH": continue
246 |             for action_i in range(len(actions)):
247 |                 action = actions[action_i]
248 |                 transition_probs[(row, col)][action] = {}
249 |                 rewards[(row, col)][action] = {}
250 |                 for movement_i in [(action_i - 1) % len(actions), action_i,
251 |                                    (action_i + 1) % len(actions)]:
252 |                     movement = actions[movement_i]
253 |                     newrow, newcol = move(row, col, movement)
254 |                     prob = (1. - slip_chance) if movement == action else (
255 |                             slip_chance / 2.)
256 |                     if prob == 0: continue
257 |                     if (newrow, newcol) not in transition_probs[row, col][
258 |                         action]:
259 |                         transition_probs[row, col][action][
260 |                             newrow, newcol] = prob
261 |                     else:
262 |                         transition_probs[row, col][action][
263 |                             newrow, newcol] += prob
264 |                     if desc[newrow, newcol] == 'G':
265 |                         rewards[row, col][action][newrow, newcol] = 1.0
266 | 
267 |         MDP.__init__(self, transition_probs, rewards, initial_state)
268 | 
269 |     def render(self):
270 |         desc_copy = np.copy(self.desc)
271 |         desc_copy[self._current_state] = '*'
272 |         print('\n'.join(map(''.join, desc_copy)), end='\n\n')
273 | 
274 | 
275 | def plot_graph(mdp, graph_size='10,10', s_node_size='1,5',
276 |                a_node_size='0,5', rankdir='LR', ):
277 |     """
278 |     Function for pretty drawing MDP graph with graphviz library.
279 |     Requirements:
280 |     graphviz : https://www.graphviz.org/
281 |     for ubuntu users: sudo apt-get install graphviz
282 |     python library for graphviz
283 |     for pip users: pip install graphviz
284 |     :param mdp:
285 |     :param graph_size: size of graph plot
286 |     :param s_node_size: size of state nodes
287 |     :param a_node_size: size of action nodes
288 |     :param rankdir: order for drawing
289 |     :return: dot object
290 |     """
291 |     s_node_attrs = {'shape': 'doublecircle',
292 |                     'color': '#85ff75',
293 |                     'style': 'filled',
294 |                     'width': str(s_node_size),
295 |                     'height': str(s_node_size),
296 |                     'fontname': 'Arial',
297 |                     'fontsize': '24'}
298 | 
299 |     a_node_attrs = {'shape': 'circle',
300 |                     'color': 'lightpink',
301 |                     'style': 'filled',
302 |                     'width': str(a_node_size),
303 |                     'height': str(a_node_size),
304 |                     'fontname': 'Arial',
305 |                     'fontsize': '20'}
306 | 
307 |     s_a_edge_attrs = {'style': 'bold',
308 |                       'color': 'red',
309 |                       'ratio': 'auto'}
310 | 
311 |     a_s_edge_attrs = {'style': 'dashed',
312 |                       'color': 'blue',
313 |                       'ratio': 'auto',
314 |                       'fontname': 'Arial',
315 |                       'fontsize': '16'}
316 | 
317 |     graph = Digraph(name='MDP')
318 |     graph.attr(rankdir=rankdir, size=graph_size)
319 |     for state_node in mdp._transition_probs:
320 |         graph.node(state_node, **s_node_attrs)
321 | 
322 |         for posible_action in mdp.get_possible_actions(state_node):
323 |             action_node = state_node + "-" + posible_action
324 |             graph.node(action_node,
325 |                        label=str(posible_action),
326 |                        **a_node_attrs)
327 |             graph.edge(state_node, state_node + "-" +
328 |                        posible_action, **s_a_edge_attrs)
329 | 
330 |             for posible_next_state in mdp.get_next_states(state_node,
331 |                                                           posible_action):
332 |                 probability = mdp.get_transition_prob(
333 |                     state_node, posible_action, posible_next_state)
334 |                 reward = mdp.get_reward(
335 |                     state_node, posible_action, posible_next_state)
336 | 
337 |                 if reward != 0:
338 |                     label_a_s_edge = 'p = ' + str(probability) + \
339 |                                      '  ' + 'reward =' + str(reward)
340 |                 else:
341 |                     label_a_s_edge = 'p = ' + str(probability)
342 | 
343 |                 graph.edge(action_node, posible_next_state,
344 |                            label=label_a_s_edge, **a_s_edge_attrs)
345 |     return graph
346 | 
347 | 
348 | def plot_graph_with_state_values(mdp, state_values):
349 |     """ Plot graph with state values"""
350 |     graph = plot_graph(mdp)
351 |     for state_node in mdp._transition_probs:
352 |         value = state_values[state_node]
353 |         graph.node(state_node,
354 |                    label=str(state_node) + '\n' + 'V =' + str(value)[:4])
355 |     return graph
356 | 
357 | 
358 | def get_optimal_action_for_plot(mdp, state_values, state, gamma=0.9):
359 |     """ Finds optimal action using formula above. """
360 |     if mdp.is_terminal(state): return None
361 |     next_actions = mdp.get_possible_actions(state)
362 |     try:
363 |         from mdp_get_action_value import get_action_value
364 |     except ImportError:
365 |         raise ImportError("Implement get_action_value(mdp, state_values, state, action, gamma) in the file \"mdp_get_action_value.py\".")
366 |     q_values = [get_action_value(mdp, state_values, state, action, gamma) for
367 |                 action in next_actions]
368 |     optimal_action = next_actions[np.argmax(q_values)]
369 |     return optimal_action
370 | 
371 | 
372 | def plot_graph_optimal_strategy_and_state_values(mdp, state_values, gamma=0.9):
373 |     """ Plot graph with state values and """
374 |     graph = plot_graph(mdp)
375 |     opt_s_a_edge_attrs = {'style': 'bold',
376 |                           'color': 'green',
377 |                           'ratio': 'auto',
378 |                           'penwidth': '6'}
379 | 
380 |     for state_node in mdp._transition_probs:
381 |         value = state_values[state_node]
382 |         graph.node(state_node,
383 |                    label=str(state_node) + '\n' + 'V =' + str(value)[:4])
384 |         for action in mdp.get_possible_actions(state_node):
385 |             if action == get_optimal_action_for_plot(mdp,
386 |                                                      state_values,
387 |                                                      state_node,
388 |                                                      gamma):
389 |                 graph.edge(state_node, state_node + "-" + action,
390 |                            **opt_s_a_edge_attrs)
391 |     return graph
392 | 


--------------------------------------------------------------------------------
/2019/code/mdp_get_action_value.py:
--------------------------------------------------------------------------------
1 | 
2 | def get_action_value(mdp, state_values, state, action, gamma):
3 |     """ Computes Q(s,a) as in formula above """
4 | 
5 |     Q = 0
6 |     # YOUR CODE HERE
7 |     return Q


--------------------------------------------------------------------------------
/2019/code/qlearning.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | import random
  3 | import math
  4 | import numpy as np
  5 | 
  6 | 
  7 | class QLearningAgent:
  8 |     def __init__(self, alpha, epsilon, discount, get_legal_actions):
  9 |         """
 10 |         Q-Learning Agent
 11 |         based on https://inst.eecs.berkeley.edu/~cs188/sp19/projects.html
 12 |         Instance variables you have access to
 13 |           - self.epsilon (exploration prob)
 14 |           - self.alpha (learning rate)
 15 |           - self.discount (discount rate aka gamma)
 16 | 
 17 |         Functions you should use
 18 |           - self.get_legal_actions(state) {state, hashable -> list of actions, each is hashable}
 19 |             which returns legal actions for a state
 20 |           - self.get_qvalue(state,action)
 21 |             which returns Q(state,action)
 22 |           - self.set_qvalue(state,action,value)
 23 |             which sets Q(state,action) := value
 24 |         !!!Important!!!
 25 |         Note: please avoid using self._qValues directly. 
 26 |             There's a special self.get_qvalue/set_qvalue for that.
 27 |         """
 28 | 
 29 |         self.get_legal_actions = get_legal_actions
 30 |         self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
 31 |         self.alpha = alpha
 32 |         self.epsilon = epsilon
 33 |         self.discount = discount
 34 | 
 35 |     def get_qvalue(self, state, action):
 36 |         """ Returns Q(state,action) """
 37 |         return self._qvalues[state][action]
 38 | 
 39 |     def set_qvalue(self, state, action, value):
 40 |         """ Sets the Qvalue for [state,action] to the given value """
 41 |         self._qvalues[state][action] = value
 42 | 
 43 |     #---------------------START OF YOUR CODE---------------------#
 44 | 
 45 |     def get_value(self, state):
 46 |         """
 47 |         Compute your agent's estimate of V(s) using current q-values
 48 |         V(s) = max_over_action Q(state,action) over possible actions.
 49 |         Note: please take into account that q-values can be negative.
 50 |         """
 51 |         possible_actions = self.get_legal_actions(state)
 52 | 
 53 |         # If there are no legal actions, return 0.0
 54 |         if len(possible_actions) == 0:
 55 |             return 0.0
 56 | 
 57 |         <YOUR CODE HERE >
 58 | 
 59 |         return value
 60 | 
 61 |     def update(self, state, action, reward, next_state):
 62 |         """
 63 |         You should do your Q-Value update here:
 64 |            Q(s,a) := (1 - alpha) * Q(s,a) + alpha * (r + gamma * V(s'))
 65 |         """
 66 | 
 67 |         # agent parameters
 68 |         gamma = self.discount
 69 |         learning_rate = self.alpha
 70 | 
 71 |         <YOUR CODE HERE >
 72 | 
 73 |         self.set_qvalue(state, action, < YOUR_QVALUE > )
 74 | 
 75 |     def get_best_action(self, state):
 76 |         """
 77 |         Compute the best action to take in a state (using current q-values). 
 78 |         """
 79 |         possible_actions = self.get_legal_actions(state)
 80 | 
 81 |         # If there are no legal actions, return None
 82 |         if len(possible_actions) == 0:
 83 |             return None
 84 | 
 85 |         <YOUR CODE HERE >
 86 | 
 87 |         return best_action
 88 | 
 89 |     def get_action(self, state):
 90 |         """
 91 |         Compute the action to take in the current state, including exploration.  
 92 |         With probability self.epsilon, we should take a random action.
 93 |             otherwise - the best policy action (self.get_best_action).
 94 | 
 95 |         Note: To pick randomly from a list, use random.choice(list). 
 96 |               To pick True or False with a given probablity, generate uniform number in [0, 1]
 97 |               and compare it with your probability
 98 |         """
 99 | 
100 |         # Pick Action
101 |         possible_actions = self.get_legal_actions(state)
102 |         action = None
103 | 
104 |         # If there are no legal actions, return None
105 |         if len(possible_actions) == 0:
106 |             return None
107 | 
108 |         # agent parameters:
109 |         epsilon = self.epsilon
110 | 
111 |         <YOUR CODE HERE >
112 | 
113 |         return chosen_action


--------------------------------------------------------------------------------
/2019/slides/01-genetics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/01-genetics.pdf


--------------------------------------------------------------------------------
/2019/slides/02-cem.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/02-cem.pdf


--------------------------------------------------------------------------------
/2019/slides/03-tabular.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/03-tabular.pdf


--------------------------------------------------------------------------------
/2019/slides/04-dqn.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/slides/04-dqn.pdf


--------------------------------------------------------------------------------
/2019/solutions/00-gym.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import numpy as np\n",
 10 |     "import matplotlib.pyplot as plt\n",
 11 |     "%matplotlib inline\n",
 12 |     "# In google collab, uncomment this:\n",
 13 |     "# !wget https://bit.ly/2FMJP5K -O setup.py && bash setup.py\n",
 14 |     "\n",
 15 |     "# This code creates a virtual display to draw game images on.\n",
 16 |     "# If you are running locally, just ignore it\n",
 17 |     "# import os\n",
 18 |     "# if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
 19 |     "#     !bash ../xvfb start\n",
 20 |     "#     %env DISPLAY = : 1"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "### OpenAI Gym\n",
 28 |     "\n",
 29 |     "We're gonna spend several next weeks learning algorithms that solve decision processes. We are then in need of some interesting decision problems to test our algorithms.\n",
 30 |     "\n",
 31 |     "That's where OpenAI gym comes into play. It's a python library that wraps many classical decision problems including robot control, videogames and board games.\n",
 32 |     "\n",
 33 |     "So here's how it works:"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 2,
 39 |    "metadata": {},
 40 |    "outputs": [
 41 |     {
 42 |      "name": "stdout",
 43 |      "output_type": "stream",
 44 |      "text": [
 45 |       "\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n"
 46 |      ]
 47 |     },
 48 |     {
 49 |      "data": {
 50 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD8CAYAAAB9y7/cAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFrhJREFUeJzt3X2MXNV9xvHvE5uXNKExhAW5tqlJ4jaQqhiYOo6oKgJ5MW5VEylUoCpYkaVNK0ciCmoDqRSMVKREakKLlKI6geBUaYhLkmIhmsQ1RFH+4GUhxtg4hE2w4o1dvJSXJI3q1s6vf8wZuIxnd+7OzJ25987zkUZz75kzs+fYd585e+49M4oIzMysfl436gaYmVkxHPBmZjXlgDczqykHvJlZTTngzcxqygFvZlZThQW8pHWSnpY0LemGon6OmZl1piKug5e0CPgR8F5gBngUuCYinhr4DzMzs46KGsGvAaYj4icR8b/A3cCGgn6WmZl1sLig110GHMzszwDvnKvymWeeGStXriyoKWZm1XPgwAGef/559fMaRQV8p0a9Zi5I0iQwCXDOOecwNTVVUFPMzKqn0Wj0/RpFTdHMACsy+8uBQ9kKEbE1IhoR0ZiYmCioGWZm46uogH8UWCXpXEknA1cDOwr6WWZm1kEhUzQRcUzSR4FvA4uAOyNiXxE/y8zMOitqDp6IuB+4v6jXNzOz+Xklq5lZTTngzcxqygFvZlZTDngzswGSxGOP9bU+aWAKO8lqZjbO5gr5iy8e3vdgO+DNzIaoU/AXFfqeojEzqymP4M3MhshTNGZmFTfMIJ+Lp2jMzAasDOEODngzs9pywJuZ1ZQD3sysphzwZmY15YA3M6spB7yZWU054M3MasoBb2ZWU32tZJV0APgFcBw4FhENSWcAXwNWAgeAP4uIF/trppmZLdQgRvDvjojVEdFI+zcAuyJiFbAr7ZuZ2ZAVMUWzAdiWtrcBVxbwM8zMrIt+Az6A70h6TNJkKjs7Ig4DpPuz+vwZZmbWg34/TfKSiDgk6Sxgp6Qf5n1iekOYBDjnnHP6bIaZmbXrawQfEYfS/RHgm8Aa4DlJSwHS/ZE5nrs1IhoR0ZiYmOinGWZm1kHPAS/pDZJOa20D7wP2AjuAjanaRuDefhtpZmYL188UzdnANyW1XudfIuJbkh4FtkvaBPwUuKr/ZpqZ2UL1HPAR8RPggg7l/wVc3k+jzMysf17JamZWUw54M7Oa8pdum5kNSDon+cp9NxHFfnerA97MrA95wzzPcwcd+A54M7MF6CfQh/3aDngzs3l0C91Bjrod8GZmQzBX2BY5b5597UajMU/NfBzwZmZJp1Av+kRokRzwZjb26hbsLQ54MxtrRV/JMkoOeDMbS3UO9hYHvJmNlXEI9hYHvJmNhXEK9hYHvJnVXjbcxyHYWxzwZlZb4xrsLf40STOrpSI/UqAqPII3s9oZ95F7iwPezGqlFe7jHOwtDngzqwWP2k/UdQ5e0p2Sjkjamyk7Q9JOSc+k+9NTuSTdJmla0h5JFxXZeDMzcLjPJc9J1ruAdW1lNwC7ImIVsCvtA1wBrEq3SeD2wTTTzOxEkl4zJeNwf62uAR8R3wNeaCveAGxL29uAKzPlX46mh4AlkpYOqrFmZi0etXfX62WSZ0fEYYB0f1YqXwYczNSbSWUnkDQpaUrS1OzsbI/NMLNx53Cf26Cvg+904WnHf/2I2BoRjYhoTExMDLgZZlZnvlImn14D/rnW1Eu6P5LKZ4AVmXrLgUO9N8/M7LUc7vn1GvA7gI1peyNwb6b82nQ1zVrg5dZUjplZP9pPqFp3Xa+Dl/RV4FLgTEkzwE3Ap4HtkjYBPwWuStXvB9YD08CvgA8X0GYzGzM+odqbrgEfEdfM8dDlHeoGsLnfRpmZtXjU3jt/2JiZlZ7DvTf+qAIzKyWP3PvnEbyZlY7DfTAc8GZWKg73wXHAm1lpONwHywFvZqXgcB88B7yZjZzDvRgOeDOzmnLAm9lIefReHAe8mY2Mw71YXuhkZkPnz5YZDo/gzWyoHO7D44A3s5FwuBfPAW9mQ+M59+FywJvZUDjch88Bb2aFc7iPhgPezArlcB8dB7yZFSZ7xYwNX9eAl3SnpCOS9mbKtkj6maTd6bY+89iNkqYlPS3p/UU13Myqw6P30cgzgr8LWNeh/NaIWJ1u9wNIOh+4GnhHes4/Slo0qMaaWXV4amb0ugZ8RHwPeCHn620A7o6IoxHxLDANrOmjfWZWQQ73cujnowo+KulaYAq4PiJeBJYBD2XqzKSyE0iaBCYz+z4YzGrA4V4evZ5kvR14K7AaOAx8NpV3OqPS8X85IrZGRCMiGhdffHHzyT4hY1ZpDvdy6SngI+K5iDgeEb8GvsCr0zAzwIpM1eXAof6aaGZmvegp4CUtzex+AGhdYbMDuFrSKZLOBVYBj+R5zdY7vkfxZtXk0Xv5dJ2Dl/RV4FLgTEkzwE3ApZJW05x+OQB8BCAi9knaDjwFHAM2R8TxvI2JCCR5Pt6sYhzu5dQ14CPimg7Fd8xT/xbgln4aZWbV4b+6y6t0K1mzUzU+cMzKLTty9+i9fEoX8OA/88yqwNMy5VfKgAefdDUz61dpAx4c8mZl5dF7NZQ64M3MrHelD3iP4s3KI3vxg0fv5Vf6gAeHvFkZZH//HO7VUImAB4e8WVk43KujMgEPDnmzUfG0TDVVKuDNzCy/ygW8R/Fmw+XRe3VVLuDBIW82LA73aqtkwIND3qxoDvfqq2zAm1lxPHCqh0oHvEfxZoPn693ro9IBDw55s6I43Kuv8gGf5ZA364/n3eulFgGfPRgd8ma9cbjXT9eAl7RC0oOS9kvaJ+m6VH6GpJ2Snkn3p6dySbpN0rSkPZIuKroT4IPSzKxdnhH8MeD6iDgPWAtslnQ+cAOwKyJWAbvSPsAVwKp0mwRuH3ir5+D5eLPeePReT10DPiIOR8TjafsXwH5gGbAB2JaqbQOuTNsbgC9H00PAEklLB97yudsLOOTN8nK419eC5uAlrQQuBB4Gzo6Iw9B8EwDOStWWAQczT5tJZe2vNSlpStLU7OzswltuZn3zQKjecge8pDcCXwc+FhE/n69qh7IThgYRsTUiGhHRmJiYyNuMXDyKN1sYj97rKVfASzqJZrh/JSK+kYqfa029pPsjqXwGWJF5+nLg0GCam59D3mx+npqpvzxX0Qi4A9gfEZ/LPLQD2Ji2NwL3ZsqvTVfTrAVebk3ljIpD3uy1HO7jYXGOOpcAHwKelLQ7lX0S+DSwXdIm4KfAVemx+4H1wDTwK+DDA23xAkTEKweyJB/MZjjcx0nXgI+I79N5Xh3g8g71A9jcZ7sGJhvyZmbjpBYrWbvxfLxZk0fv42UsAh4c8mYO9/EzNgFvNs48sBlPYxXwHsXbOPLnu4+vsQp4cMjb+HK4j5+xC3hwyNv48Lz7eBvLgDczGwdjG/AexVvdefRuYxvw4JC3+nK4G4x5wIND3urH4W4tYx/wZnXigYplOeDxKN7qwde7WzsHvJlZTTngk+wo3iN5q5rsvLtH79bigM/wL4aZ1YkDvo3n461qfNWMzcUB34FD3qrC4W7zccDPwSFvZedwt27yfOn2CkkPStovaZ+k61L5Fkk/k7Q73dZnnnOjpGlJT0t6f5EdMBtHHnhYHnm+dPsYcH1EPC7pNOAxSTvTY7dGxN9lK0s6H7gaeAfwW8B/SPqdiDg+yIYPQ+v7XP2F3VZWPi5tPl1H8BFxOCIeT9u/APYDy+Z5ygbg7og4GhHPAtPAmkE0dhQ8VWNl46kZy2tBc/CSVgIXAg+noo9K2iPpTkmnp7JlwMHM02aY/w2hMhzyNmoOd1uI3AEv6Y3A14GPRcTPgduBtwKrgcPAZ1tVOzz9hKNR0qSkKUlTs7OzC274MGV/mRzyNioOd1uoXAEv6SSa4f6ViPgGQEQ8FxHHI+LXwBd4dRpmBliRefpy4FD7a0bE1ohoRERjYmKinz4MhX+pzKxq8lxFI+AOYH9EfC5TvjRT7QPA3rS9A7ha0imSzgVWAY8Mrsmj4/l4GxWP3q0Xea6iuQT4EPCkpN2p7JPANZJW05x+OQB8BCAi9knaDjxF8wqczVW8gmYuvrLGhs3hbr3qGvAR8X06z6vfP89zbgFu6aNdZob/WrT+eCVrDzxVY8Pgz3e3fjnge+SQt2FxuFuvHPB9cMhbUTzvboPggB8Qh7wNisPdBsUB3yf/EppZWTngB8BTNTYoHr3bIDngB8Qhb/1yuNugOeAHyCFvvXK4WxEc8APmkLeFcrhbURzwZmY15YAvgEfxlpdH71YkB3xBHPLWjcPdiuaAHwKHvLVzuNswOOALFBEeydsJHO42LA74IXDIW4vD3YbJAW82JH6Dt2FzwA+JR/HW4tG7DYsDfogc8uPLUzM2Cnm+dPtUSY9IekLSPkk3p/JzJT0s6RlJX5N0cio/Je1Pp8dXFtuFanHIjx+Hu41KnhH8UeCyiLgAWA2sk7QW+Axwa0SsAl4ENqX6m4AXI+JtwK2pnnXgkK8/h7uNUteAj6Zfpt2T0i2Ay4B7Uvk24Mq0vSHtkx6/XE6y1/Dlk+PB4W6jlmsOXtIiSbuBI8BO4MfASxFxLFWZAZal7WXAQYD0+MvAmwfZ6LpwyNeXw93KIFfAR8TxiFgNLAfWAOd1qpbuO6XVCUe5pElJU5KmZmdn87bXrPT8hm1lsaCraCLiJeC7wFpgiaTF6aHlwKG0PQOsAEiPvwl4ocNrbY2IRkQ0JiYmemt9DXgUXy/ZkbtH7zZqea6imZC0JG2/HngPsB94EPhgqrYRuDdt70j7pMcfCB/p83LIm1kRFnevwlJgm6RFNN8QtkfEfZKeAu6W9LfAD4A7Uv07gH+WNE1z5H51Ae2unYhAEpI88qsoz7tb2XQN+IjYA1zYofwnNOfj28v/B7hqIK0bMw756nK4Wxl5JWvJeLqmWlpvyOBwt/JxwJeQQ756HO5WRg74knLIl19rKs3hbmXlgC8xh3x5+f/EqsABX3IO+fLxnLtVhQO+Ahzy5eFwtypxwFeEQ360fLWMVZEDvkIc8qPncLcqccBXjEN++Dxyt6pywFdQNuQd9MXxtIxVnQO+orKB45AfvOy/qcPdqsoBX2HD+maocXsD8Uf+Wl3k+TRJK7lhfEjZXCFfpwD0qN3qxiP4mhn2aLv1xjJuo3yzKvAIviZao3ig0JH8fKo8yvfJVKsjB3yNdLq6pgyBVebg97SM1ZmnaGrIV9jk43C3uvMIvqbaR/NlC7BRtsfBbuMiz5dunyrpEUlPSNon6eZUfpekZyXtTrfVqVySbpM0LWmPpIuK7oTNrd9FUXX7C8DhbuMkzwj+KHBZRPxS0knA9yX9e3rsryLinrb6VwCr0u2dwO3p3kZkUCdgt2zZMu9+2Tncbdx0HcFH0y/T7knpNt9vxwbgy+l5DwFLJC3tv6nWj/Z5+YWOzDuFeVUCvv2ks8PdxkWuOXhJi4DHgLcBn4+IhyX9JXCLpE8Bu4AbIuIosAw4mHn6TCo7PNCW24K1r3rtNppv1ZsvyLds2ZJrZD+KN4P2NzEHu42bXFfRRMTxiFgNLAfWSPo94Ebg7cAfAGcAn0jVOw0NT/jNkjQpaUrS1OzsbE+Nt960f8TBXCP6Xka7cwX5MAO+vT8etdu4WtBlkhHxEvBdYF1EHE7TMEeBLwFrUrUZYEXmacuBQx1ea2tENCKiMTEx0VPjrT/todfvCdVuId56vMiwdbCbvSrPVTQTkpak7dcD7wF+2JpXV/M36kpgb3rKDuDadDXNWuDliPD0TEm1QjDPiL7ba4xKe5tH3R6zssgzB78U2Jbm4V8HbI+I+yQ9IGmC5pTMbuAvUv37gfXANPAr4MODb7YNQxmvn2/neXazuXUN+IjYA1zYofyyOeoHsLn/ptmwdfro4W4nWm+66aaefkY/yvzRB2ZlojL8UjQajZiamhp1M6yD9jBtBf1cwX7zzTfP+VoLfTOYrx0tZTh+zYrQaDSYmprq68SYP4vG5tU+R9+6LDI7750N335CPGuu1+/UJjPrzJ9FY7nN9+1R852UnSuIF3oi18wWxgFvC9YpbOcL614uv3Sgm/XPAW8DMYiPKHaomw2WA94GzkFtVg4+yWpmVlMOeDOzmnLAm5nVlAPezKymHPBmZjXlgDczqykHvJlZTTngzcxqygFvZlZTDngzs5pywJuZ1ZQD3sysphzwZmY1lTvgJS2S9ANJ96X9cyU9LOkZSV+TdHIqPyXtT6fHVxbTdDMzm89CRvDXAfsz+58Bbo2IVcCLwKZUvgl4MSLeBtya6pmZ2ZDlCnhJy4E/Br6Y9gVcBtyTqmwDrkzbG9I+6fHL1es3QJiZWc/yfuHH3wN/DZyW9t8MvBQRx9L+DLAsbS8DDgJExDFJL6f6z2dfUNIkMJl2j0ra21MPyu9M2vpeE3XtF9S3b+5Xtfy2pMmI2NrrC3QNeEl/AhyJiMckXdoq7lA1cjz2akGz0VvTz5iKiEauFldMXftW135BffvmflWPpClSTvYizwj+EuBPJa0HTgV+k+aIfomkxWkUvxw4lOrPACuAGUmLgTcBL/TaQDMz603XOfiIuDEilkfESuBq4IGI+HPgQeCDqdpG4N60vSPtkx5/IPwlnWZmQ9fPdfCfAD4uaZrmHPsdqfwO4M2p/OPADTleq+c/QSqgrn2ra7+gvn1zv6qnr77Jg2szs3rySlYzs5oaecBLWifp6bTyNc90TqlIulPSkexlnpLOkLQzrfLdKen0VC5Jt6W+7pF00ehaPj9JKyQ9KGm/pH2Srkvlle6bpFMlPSLpidSvm1N5LVZm13XFuaQDkp6UtDtdWVL5YxFA0hJJ90j6Yfpde9cg+zXSgJe0CPg8cAVwPnCNpPNH2aYe3AWsayu7AdiVVvnu4tXzEFcAq9JtErh9SG3sxTHg+og4D1gLbE7/N1Xv21Hgsoi4AFgNrJO0lvqszK7zivN3R8TqzCWRVT8WAf4B+FZEvB24gOb/3eD6FREjuwHvAr6d2b8RuHGUbeqxHyuBvZn9p4GlaXsp8HTa/ifgmk71yn6jeZXUe+vUN+A3gMeBd9JcKLM4lb9yXALfBt6Vthenehp12+foz/IUCJcB99Fck1L5fqU2HgDObCur9LFI85LzZ9v/3QfZr1FP0byy6jXJroitsrMj4jBAuj8rlVeyv+nP9wuBh6lB39I0xm7gCLAT+DE5V2YDrZXZZdRacf7rtJ97xTnl7hc0F0t+R9JjaRU8VP9YfAswC3wpTat9UdIbGGC/Rh3wuVa91kjl+ivpjcDXgY9FxM/nq9qhrJR9i4jjEbGa5oh3DXBep2rpvhL9UmbFeba4Q9VK9Svjkoi4iOY0xWZJfzRP3ar0bTFwEXB7RFwI/DfzX1a+4H6NOuBbq15bsitiq+w5SUsB0v2RVF6p/ko6iWa4fyUivpGKa9E3gIh4CfguzXMMS9LKa+i8MpuSr8xurTg/ANxNc5rmlRXnqU4V+wVARBxK90eAb9J8Y676sTgDzETEw2n/HpqBP7B+jTrgHwVWpTP9J9NcKbtjxG0ahOxq3vZVvtems+FrgZdbf4qVjSTRXLS2PyI+l3mo0n2TNCFpSdp+PfAemie2Kr0yO2q84lzSGySd1toG3gfspeLHYkT8J3BQ0u+mosuBpxhkv0pwomE98COa86B/M+r29ND+rwKHgf+j+Q67ieZc5i7gmXR/RqormlcN/Rh4EmiMuv3z9OsPaf75twfYnW7rq9434PeBH6R+7QU+lcrfAjwCTAP/CpySyk9N+9Pp8beMug85+ngpcF9d+pX68ES67WvlRNWPxdTW1cBUOh7/DTh9kP3ySlYzs5oa9RSNmZkVxAFvZlZTDngzs5pywJuZ1ZQD3sysphzwZmY15YA3M6spB7yZWU39P9sq59z6XHYTAAAAAElFTkSuQmCC\n",
 51 |       "text/plain": [
 52 |        "<matplotlib.figure.Figure at 0x120875320>"
 53 |       ]
 54 |      },
 55 |      "metadata": {},
 56 |      "output_type": "display_data"
 57 |     },
 58 |     {
 59 |      "name": "stdout",
 60 |      "output_type": "stream",
 61 |      "text": [
 62 |       "Observation space: Box(2,)\n",
 63 |       "Action space: Discrete(3)\n"
 64 |      ]
 65 |     }
 66 |    ],
 67 |    "source": [
 68 |     "import gym\n",
 69 |     "env = gym.make(\"MountainCar-v0\")\n",
 70 |     "\n",
 71 |     "plt.imshow(env.render('rgb_array'))\n",
 72 |     "plt.show()\n",
 73 |     "print(\"Observation space:\", env.observation_space)\n",
 74 |     "print(\"Action space:\", env.action_space)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "Note: if you're running this on your local machine, you'll see a window pop up with the image above. Don't close it, just alt-tab away."
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "### Gym interface\n",
 89 |     "\n",
 90 |     "The three main methods of an environment are\n",
 91 |     "* __reset()__ - reset environment to initial state, _return first observation_\n",
 92 |     "* __render()__ - show current environment state (a more colorful version :) )\n",
 93 |     "* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)\n",
 94 |     " * _new observation_ - an observation right after commiting the action __a__\n",
 95 |     " * _reward_ - a number representing your reward for commiting action __a__\n",
 96 |     " * _is done_ - True if the MDP has just finished, False if still in progress\n",
 97 |     " * _info_ - some auxilary stuff about what just happened. Ignore it ~~for now~~."
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": 3,
103 |    "metadata": {
104 |     "scrolled": true
105 |    },
106 |    "outputs": [
107 |     {
108 |      "name": "stdout",
109 |      "output_type": "stream",
110 |      "text": [
111 |       "initial observation code: [-0.45297143  0.        ]\n"
112 |      ]
113 |     }
114 |    ],
115 |    "source": [
116 |     "obs0 = env.reset()\n",
117 |     "print(\"initial observation code:\", obs0)\n",
118 |     "\n",
119 |     "# Note: in MountainCar, observation is just two numbers: car position and velocity"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 4,
125 |    "metadata": {},
126 |    "outputs": [
127 |     {
128 |      "name": "stdout",
129 |      "output_type": "stream",
130 |      "text": [
131 |       "taking action 2 (right)\n",
132 |       "new observation code: [-0.45249718  0.00047425]\n",
133 |       "reward: -1.0\n",
134 |       "is game over?: False\n"
135 |      ]
136 |     }
137 |    ],
138 |    "source": [
139 |     "print(\"taking action 2 (right)\")\n",
140 |     "new_obs, reward, is_done, _ = env.step(2)\n",
141 |     "\n",
142 |     "print(\"new observation code:\", new_obs)\n",
143 |     "print(\"reward:\", reward)\n",
144 |     "print(\"is game over?:\", is_done)\n",
145 |     "\n",
146 |     "# Note: as you can see, the car has moved to the right slightly (around 0.0005)"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "metadata": {},
152 |    "source": [
153 |     "### Play with it\n",
154 |     "\n",
155 |     "Below is the code that drives the car to the right. \n",
156 |     "\n",
157 |     "However, it doesn't reach the flag at the far right due to gravity. \n",
158 |     "\n",
159 |     "__Your task__ is to fix it. Find a strategy that reaches the flag. \n",
160 |     "\n",
161 |     "You're not required to build any sophisticated algorithms for now, feel free to hard-code :)\n",
162 |     "\n",
163 |     "_Hint: your action at each step should depend either on __t__ or on __s__._"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": 5,
169 |    "metadata": {},
170 |    "outputs": [
171 |     {
172 |      "name": "stdout",
173 |      "output_type": "stream",
174 |      "text": [
175 |       "Time limit exceeded. Try again.\n"
176 |      ]
177 |     },
178 |     {
179 |      "data": {
180 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD8CAYAAAB9y7/cAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFrRJREFUeJzt3X2MXNV9xvHvU5uXNKExhAW5tqlJ4jaQqhjYOo6oKgJ5MW5VEylURlWwIkubSkYiCmoDqVQ7UpESqQktUovqFIJTpSGUJMVCNIlriKL8wctCjLFxCJtgxRu7eCkvSRqVxs6vf8wZuF7P7tyd13vPPB9pNPeeOTN7jj37zNkz58woIjAzs/z82rAbYGZm/eGANzPLlAPezCxTDngzs0w54M3MMuWANzPLVN8CXtI6Sc9ImpJ0U79+jpmZtaZ+rIOXtAj4AfA+YBp4DLg2Ip7u+Q8zM7OW+jWCXwNMRcSPIuL/gLuBDX36WWZm1sLiPj3uMuBQ4XwaeNdclc8+++xYuXJln5piZlY/Bw8e5IUXXlA3j9GvgG/VqBPmgiRNABMA5513HpOTk31qiplZ/YyPj3f9GP2aopkGVhTOlwOHixUiYntEjEfE+NjYWJ+aYWY2uvoV8I8BqySdL+lUYCOws08/y8zMWujLFE1EHJN0PfBNYBFwZ0Ts78fPMjOz1vo1B09EPAA80K/HNzOz+Xknq5lZphzwZmaZcsCbmWXKAW9m1kOSePzxrvYn9Uzf3mQ1Mxtlc4X8pZcO7nuwHfBmZgPUKvj7FfqeojEzy5RH8GZmA+QpGjOzmhtkkM/FUzRmZj1WhXAHB7yZWbYc8GZmmXLAm5llygFvZpYpB7yZWaYc8GZmmXLAm5llygFvZpaprnaySjoI/Aw4DhyLiHFJZwFfAVYCB4E/jYiXumummZktVC9G8O+JiNURMZ7ObwJ2R8QqYHc6NzOzAevHFM0GYEc63gFc3YefYWZmbXQb8AF8S9LjkiZS2bkRcQQgXZ/T5c8wM7MOdPtpkpdFxGFJ5wC7JH2/7B3TC8IEwHnnnddlM8zMbLauRvARcThdHwW+DqwBnpe0FCBdH53jvtsjYjwixsfGxrpphpmZtdBxwEt6o6QzmsfA+4F9wE5gU6q2Cbiv20aamdnCdTNFcy7wdUnNx/nXiPiGpMeAeyRtBn4MXNN9M83MbKE6DviI+BFwUYvy/wau7KZRZmbWPe9kNTPLlAPezCxT/tJtM7MeSe9JvnbdTkR/v7vVAW9m1oWyYV7mvr0OfAe8mdkCdBPog35sB7yZ2TzahW4vR90OeDOzAZgrbPs5b1587PHx8XlqluOANzNLWoV6v98I7ScHvJmNvNyCvckBb2Yjrd8rWYbJAW9mIynnYG9ywJvZSBmFYG9ywJvZSBilYG9ywJtZ9orhPgrB3uSAN7NsjWqwN/nTJM0sS/38SIG68AjezLIz6iP3Jge8mWWlGe6jHOxNDngzy4JH7SdrOwcv6U5JRyXtK5SdJWmXpGfT9ZmpXJJukzQlaa+kS/rZeDMzcLjPpcybrHcB62aV3QTsjohVwO50DnAVsCpdJoDbe9NMM7OTSTphSsbhfqK2AR8R3wFenFW8AdiRjncAVxfKvxgNDwNLJC3tVWPNzJo8am+v02WS50bEEYB0fU4qXwYcKtSbTmUnkTQhaVLS5MzMTIfNMLNR53CfW6/XwbdaeNryXz8itkfEeESMj42N9bgZZpYzr5Qpp9OAf7459ZKuj6byaWBFod5y4HDnzTMzO5HDvbxOA34nsCkdbwLuK5Rfl1bTrAVeaU7lmJl1Y/YbqtZe23Xwkr4MXA6cLWka2Ap8GrhH0mbgx8A1qfoDwHpgCvgF8JE+tNnMRozfUO1M24CPiGvnuOnKFnUD2NJto8zMmjxq75w/bMzMKs/h3hl/VIGZVZJH7t3zCN7MKsfh3hsOeDOrFId77zjgzawyHO695YA3s0pwuPeeA97Mhs7h3h8OeDOzTDngzWyoPHrvHwe8mQ2Nw72/vNHJzAbOny0zGB7Bm9lAOdwHxwFvZkPhcO8/B7yZDYzn3AfLAW9mA+FwHzwHvJn1ncN9OBzwZtZXDvfhccCbWd8UV8zY4LUNeEl3SjoqaV+hbJukn0jaky7rC7fdLGlK0jOSPtCvhptZfXj0PhxlRvB3AetalN8aEavT5QEASRcCG4F3pvv8o6RFvWqsmdWHp2aGr23AR8R3gBdLPt4G4O6IeDUingOmgDVdtM/MasjhXg3dfFTB9ZKuAyaBGyPiJWAZ8HChznQqO4mkCWCicO4ng1kGHO7V0embrLcDbwNWA0eAz6byVu+otPxfjojtETEeEeOXXnpp485+Q8as1hzu1dJRwEfE8xFxPCJ+BXye16dhpoEVharLgcPdNdHMzDrRUcBLWlo4/SDQXGGzE9go6TRJ5wOrgEfLPGbzFd+jeLN68ui9etrOwUv6MnA5cLakaWArcLmk1TSmXw4CHwWIiP2S7gGeBo4BWyLieNnGRASSPB9vVjMO92pqG/ARcW2L4jvmqX8LcEs3jTKz+vBf3dVVuZ2sxakaP3HMqq04cvfovXoqF/DgP/PM6sDTMtVXyYAHv+lqZtatygY8OOTNqsqj93qodMCbmVnnKh/wHsWbVUdx8YNH79VX+YAHh7xZFRR//xzu9VCLgAeHvFlVONzrozYBDw55s2HxtEw91SrgzcysvNoFvEfxZoPl0Xt91S7gwSFvNigO93qrZcCDQ96s3xzu9VfbgDez/vHAKQ+1DniP4s16z+vd81HrgAeHvFm/ONzrr/YBX+SQN+uO593zkkXAF5+MDnmzzjjc89M24CWtkPSQpAOS9ku6IZWfJWmXpGfT9ZmpXJJukzQlaa+kS/rdCfCT0sxstjIj+GPAjRFxAbAW2CLpQuAmYHdErAJ2p3OAq4BV6TIB3N7zVs/B8/FmnfHoPU9tAz4ijkTEE+n4Z8ABYBmwAdiRqu0Ark7HG4AvRsPDwBJJS3ve8rnbCzjkzcpyuOdrQXPwklYCFwOPAOdGxBFovAgA56Rqy4BDhbtNp7LZjzUhaVLS5MzMzMJbbmZd80Aob6UDXtKbgK8CH4uIn85XtUXZSUODiNgeEeMRMT42Nla2GaV4FG+2MB6956lUwEs6hUa4fykivpaKn29OvaTro6l8GlhRuPty4HBvmlueQ95sfp6ayV+ZVTQC7gAORMTnCjftBDal403AfYXy69JqmrXAK82pnGFxyJudyOE+GhaXqHMZ8GHgKUl7UtkngU8D90jaDPwYuCbd9gCwHpgCfgF8pKctXoCIeO2JLMlPZjMc7qOkbcBHxHdpPa8OcGWL+gFs6bJdPVMMeTOzUZLFTtZ2PB9v1uDR+2gZiYAHh7yZw330jEzAm40yD2xG00gFvEfxNor8+e6ja6QCHhzyNroc7qNn5AIeHPI2OjzvPtpGMuDNzEbByAa8R/GWO4/ebWQDHhzyli+Hu8GIBzw45C0/DndrGvmAN8uJBypW5IDHo3jLg9e722wOeDOzTDngk+Io3iN5q5vivLtH79bkgC/wL4aZ5cQBP4vn461uvGrG5uKAb8Ehb3XhcLf5OODn4JC3qnO4WztlvnR7haSHJB2QtF/SDal8m6SfSNqTLusL97lZ0pSkZyR9oJ8dMBtFHnhYGWW+dPsYcGNEPCHpDOBxSbvSbbdGxN8WK0u6ENgIvBP4TeA/Jf12RBzvZcMHofl9rv7CbqsqPy9tPm1H8BFxJCKeSMc/Aw4Ay+a5ywbg7oh4NSKeA6aANb1o7DB4qsaqxlMzVtaC5uAlrQQuBh5JRddL2ivpTklnprJlwKHC3aaZ/wWhNhzyNmwOd1uI0gEv6U3AV4GPRcRPgduBtwGrgSPAZ5tVW9z9pGejpAlJk5ImZ2ZmFtzwQSr+MjnkbVgc7rZQpQJe0ik0wv1LEfE1gIh4PiKOR8SvgM/z+jTMNLCicPflwOHZjxkR2yNiPCLGx8bGuunDQPiXyszqpswqGgF3AAci4nOF8qWFah8E9qXjncBGSadJOh9YBTzauyYPj+fjbVg8erdOlFlFcxnwYeApSXtS2SeBayWtpjH9chD4KEBE7Jd0D/A0jRU4W+q4gmYuXlljg+Zwt061DfiI+C6t59UfmOc+twC3dNEuM8N/LVp3vJO1A56qsUHw57tbtxzwHXLI26A43K1TDvguOOStXzzvbr3ggO8Rh7z1isPdesUB3yX/EppZVTnge8BTNdYrHr1bLznge8Qhb91yuFuvOeB7yCFvnXK4Wz844HvMIW8L5XC3fnHAm5llygHfBx7FW1kevVs/OeD7xCFv7Tjcrd8c8APgkLfZHO42CA74PooIj+TtJA53GxQH/AA45K3J4W6D5IA3GxC/wNugOeAHxKN4a/Lo3QbFAT9ADvnR5akZG4YyX7p9uqRHJT0pab+kT6Xy8yU9IulZSV+RdGoqPy2dT6XbV/a3C/XikB89DncbljIj+FeBKyLiImA1sE7SWuAzwK0RsQp4Cdic6m8GXoqItwO3pnrWgkM+fw53G6a2AR8NP0+np6RLAFcA96byHcDV6XhDOifdfqWcZCfw8snR4HC3YSs1By9pkaQ9wFFgF/BD4OWIOJaqTAPL0vEy4BBAuv0V4C29bHQuHPL5crhbFZQK+Ig4HhGrgeXAGuCCVtXSdau0OulZLmlC0qSkyZmZmbLtNas8v2BbVSxoFU1EvAx8G1gLLJG0ON20HDicjqeBFQDp9jcDL7Z4rO0RMR4R42NjY521PgMexeelOHL36N2GrcwqmjFJS9LxG4D3AgeAh4APpWqbgPvS8c50Trr9wfAzfV4OeTPrh8Xtq7AU2CFpEY0XhHsi4n5JTwN3S/ob4HvAHan+HcC/SJqiMXLf2Id2ZycikIQkj/xqyvPuVjVtAz4i9gIXtyj/EY35+Nnl/wtc05PWjRiHfH053K2KvJO1YjxdUy/NF2RwuFv1OOAryCFfPw53qyIHfEU55KuvOZXmcLeqcsBXmEO+uvx/YnXggK84h3znmvPjxXnyXj0ueFrGqs8BXwMO+YXpdaDPfmxwuFs9OOBrwiHfXrtg7+bfzqtlrI4c8DXikG+tnyP22RzuVicO+JpxyL9uUMHukbvVlQO+hoohP4pB302/F3I/T8tY3Tnga6oYOKMQ8v1YDdPu5zU53K2uynzYmFXU7JF8jkHUTaBv27atZVm7fyuP2i0XHsFnINd5+V6He7G81WN7SsZy44DPTG4hb2adc8BnYtTm5Luxbdu2k0bo/iYmy5EDPiPFcMphhU2nQTvX9EwrnpaxnPlN1gw1vzgEyPLN13bz69u2bSsV8l4pY7nzCD5Ts0fzdTU7eOcL7uZt7cK6uZKmWdfhbrkq86Xbp0t6VNKTkvZL+lQqv0vSc5L2pMvqVC5Jt0makrRX0iX97oTNLacpmzKa/d26dWvpuma5KjNF8ypwRUT8XNIpwHcl/Ue67S8i4t5Z9a8CVqXLu4Db07UNSS5TNguZW2/3GHX9NzBbiDJfuh3Az9PpKeky32/HBuCL6X4PS1oiaWlEHOm6tdax2SHfLKuDYtsXojmK91y7japSc/CSFknaAxwFdkXEI+mmW9I0zK2STktly4BDhbtPpzIbstnzzXWbslnoCH72tJTD3UZNqYCPiOMRsRpYDqyR9LvAzcA7gN8HzgI+kaq3So2TfrMkTUialDQ5MzPTUeOtM62WU1Y97JttLjO33irYHe42iha0iiYiXga+DayLiCPR8CrwBWBNqjYNrCjcbTlwuMVjbY+I8YgYHxsb66jx1p25NvtU3Vwhv3XrVge7WUHbOXhJY8AvI+JlSW8A3gt8pjmvrsZv1NXAvnSXncD1ku6m8ebqK55/r67ZUzZ1mdIohnyzzX4D1exEZVbRLAV2SFpEY8R/T0TcL+nBFP4C9gB/nuo/AKwHpoBfAB/pfbNtEOqw4mb2Xx1Vb6/ZIJVZRbMXuLhF+RVz1A9gS/dNs0FrtTGqiiP6uaaSqtI+s6rwRxXYSeZaaTPM5ZUOdbOFc8DbvFqFfauw7XXQzveGr0PdrBwHvJU232fbdBLIC1m141A3WzgHvC1Yq7CdL6w7WX7pQDfrngPeeqIXO2Qd6ma95YC3nnNQm1WDPw/ezCxTDngzs0w54M3MMuWANzPLlAPezCxTDngzs0w54M3MMuWANzPLlAPezCxTDngzs0w54M3MMuWANzPLlAPezCxTpQNe0iJJ35N0fzo/X9Ijkp6V9BVJp6by09L5VLp9ZX+abmZm81nICP4G4EDh/DPArRGxCngJ2JzKNwMvRcTbgVtTPTMzG7BSAS9pOfBHwD+ncwFXAPemKjuAq9PxhnROuv1KdfoNEGZm1rGyX/jxd8BfAmek87cAL0fEsXQ+DSxLx8uAQwARcUzSK6n+C8UHlDQBTKTTVyXt66gH1Xc2s/qeiVz7Bfn2zf2ql9+SNBER2zt9gLYBL+mPgaMR8biky5vFLapGidteL2g0env6GZMRMV6qxTWTa99y7Rfk2zf3q34kTZJyshNlRvCXAX8iaT1wOvAbNEb0SyQtTqP45cDhVH8aWAFMS1oMvBl4sdMGmplZZ9rOwUfEzRGxPCJWAhuBByPiz4CHgA+lapuA+9LxznROuv3B8Jd0mpkNXDfr4D8BfFzSFI059jtS+R3AW1L5x4GbSjxWx3+C1ECufcu1X5Bv39yv+umqb/Lg2swsT97JamaWqaEHvKR1kp5JO1/LTOdUiqQ7JR0tLvOUdJakXWmX7y5JZ6ZySbot9XWvpEuG1/L5SVoh6SFJByTtl3RDKq913ySdLulRSU+mfn0qlWexMzvXHeeSDkp6StKetLKk9s9FAElLJN0r6fvpd+3dvezXUANe0iLgH4CrgAuBayVdOMw2deAuYN2sspuA3WmX725efx/iKmBVukwAtw+ojZ04BtwYERcAa4Et6f+m7n17FbgiIi4CVgPrJK0ln53ZOe84f09ErC4siaz7cxHg74FvRMQ7gIto/N/1rl8RMbQL8G7gm4Xzm4Gbh9mmDvuxEthXOH8GWJqOlwLPpON/Aq5tVa/qFxqrpN6XU9+AXweeAN5FY6PM4lT+2vMS+Cbw7nS8ONXTsNs+R3+Wp0C4Arifxp6U2vcrtfEgcPasslo/F2ksOX9u9r97L/s17Cma13a9JsUdsXV2bkQcAUjX56TyWvY3/fl+MfAIGfQtTWPsAY4Cu4AfUnJnNtDcmV1FzR3nv0rnpXecU+1+QWOz5LckPZ52wUP9n4tvBWaAL6RptX+W9EZ62K9hB3ypXa8ZqV1/Jb0J+CrwsYj46XxVW5RVsm8RcTwiVtMY8a4BLmhVLV3Xol8q7DgvFreoWqt+FVwWEZfQmKbYIukP56lbl74tBi4Bbo+Ii4H/Yf5l5Qvu17ADvrnrtam4I7bOnpe0FCBdH03lteqvpFNohPuXIuJrqTiLvgFExMvAt2m8x7Ak7byG1juzqfjO7OaO84PA3TSmaV7bcZ7q1LFfAETE4XR9FPg6jRfmuj8Xp4HpiHgknd9LI/B71q9hB/xjwKr0Tv+pNHbK7hxym3qhuJt39i7f69K74WuBV5p/ilWNJNHYtHYgIj5XuKnWfZM0JmlJOn4D8F4ab2zVemd2ZLzjXNIbJZ3RPAbeD+yj5s/FiPgv4JCk30lFVwJP08t+VeCNhvXAD2jMg/7VsNvTQfu/DBwBfknjFXYzjbnM3cCz6fqsVFc0Vg39EHgKGB92++fp1x/Q+PNvL7AnXdbXvW/A7wHfS/3aB/x1Kn8r8CgwBfwbcFoqPz2dT6Xb3zrsPpTo4+XA/bn0K/XhyXTZ38yJuj8XU1tXA5Pp+fjvwJm97Jd3spqZZWrYUzRmZtYnDngzs0w54M3MMuWANzPLlAPezCxTDngzs0w54M3MMuWANzPL1P8DeYrY4C21P9gAAAAASUVORK5CYII=\n",
181 |       "text/plain": [
182 |        "<matplotlib.figure.Figure at 0x12332f320>"
183 |       ]
184 |      },
185 |      "metadata": {},
186 |      "output_type": "display_data"
187 |     }
188 |    ],
189 |    "source": [
190 |     "\n",
191 |     "# create env manually to set time limit. Please don't change this.\n",
192 |     "TIME_LIMIT = 250\n",
193 |     "env = gym.wrappers.TimeLimit(\n",
194 |     "    gym.envs.classic_control.MountainCarEnv(),\n",
195 |     "    max_episode_steps=TIME_LIMIT + 1)\n",
196 |     "s = env.reset()\n",
197 |     "actions = {'left': 0, 'stop': 1, 'right': 2}\n",
198 |     "\n",
199 |     "# prepare \"display\"\n",
200 |     "%matplotlib inline\n",
201 |     "from IPython.display import clear_output\n",
202 |     "\n",
203 |     "\n",
204 |     "for t in range(TIME_LIMIT):\n",
205 |     "\n",
206 |     "    # change the line below to reach the flag\n",
207 |     "    s, r, done, _ = env.step(actions['right'])\n",
208 |     "\n",
209 |     "    # draw game image on display\n",
210 |     "    clear_output(True)\n",
211 |     "    plt.imshow(env.render('rgb_array'))\n",
212 |     "\n",
213 |     "    if done:\n",
214 |     "        print(\"Well done!\")\n",
215 |     "        break\n",
216 |     "else:\n",
217 |     "    print(\"Time limit exceeded. Try again.\");"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 6,
223 |    "metadata": {},
224 |    "outputs": [
225 |     {
226 |      "ename": "AssertionError",
227 |      "evalue": "",
228 |      "output_type": "error",
229 |      "traceback": [
230 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
231 |       "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
232 |       "\u001b[0;32m<ipython-input-6-ecf5a6e54ce1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0.47\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"You solved it!\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
233 |       "\u001b[0;31mAssertionError\u001b[0m: "
234 |      ]
235 |     }
236 |    ],
237 |    "source": [
238 |     "assert s[0] > 0.47\n",
239 |     "print(\"You solved it!\")"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": null,
245 |    "metadata": {},
246 |    "outputs": [],
247 |    "source": []
248 |   }
249 |  ],
250 |  "metadata": {
251 |   "kernelspec": {
252 |    "display_name": "Python 3",
253 |    "language": "python",
254 |    "name": "python3"
255 |   },
256 |   "language_info": {
257 |    "codemirror_mode": {
258 |     "name": "ipython",
259 |     "version": 3
260 |    },
261 |    "file_extension": ".py",
262 |    "mimetype": "text/x-python",
263 |    "name": "python",
264 |    "nbconvert_exporter": "python",
265 |    "pygments_lexer": "ipython3",
266 |    "version": "3.7.0"
267 |   }
268 |  },
269 |  "nbformat": 4,
270 |  "nbformat_minor": 1
271 | }
272 | 


--------------------------------------------------------------------------------
/2019/solutions/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2019/solutions/__init__.py


--------------------------------------------------------------------------------
/2019/solutions/mdp.py:
--------------------------------------------------------------------------------
  1 | # most of this code was politely stolen from https://github.com/berkeleydeeprlcourse/homework/
  2 | # all creadit goes to https://github.com/abhishekunique (if i got the author right)
  3 | import sys
  4 | import random
  5 | import numpy as np
  6 | 
  7 | try:
  8 |     from graphviz import Digraph
  9 |     import graphviz
 10 |     has_graphviz = True
 11 | except:
 12 |     has_graphviz = False
 13 | 
 14 | 
 15 | def weighted_choice(v, p):
 16 |     total = sum(p)
 17 |     r = random.uniform(0, total)
 18 |     upto = 0
 19 |     for c, w in zip(v, p):
 20 |         if upto + w >= r:
 21 |             return c
 22 |         upto += w
 23 |     assert False, "Shouldn't get here"
 24 | 
 25 | 
 26 | class MDP:
 27 |     def __init__(self, transition_probs, rewards, initial_state=None):
 28 |         """
 29 |         Defines an MDP. Compatible with gym Env.
 30 |         :param transition_probs: transition_probs[s][a][s_next] = P(s_next | s, a)
 31 |             A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> prob]
 32 |             For each state and action, probabilities of next states should sum to 1
 33 |             If a state has no actions available, it is considered terminal
 34 |         :param rewards: rewards[s][a][s_next] = r(s,a,s')
 35 |             A dict[state -> dict] of dicts[action -> dict] of dicts[next_state -> reward]
 36 |             The reward for anything not mentioned here is zero.
 37 |         :param get_initial_state: a state where agent starts or a callable() -> state
 38 |             By default, picks initial state at random.
 39 | 
 40 |         States and actions can be anything you can use as dict keys, but we recommend that you use strings or integers
 41 | 
 42 |         Here's an example from MDP depicted on http://bit.ly/2jrNHNr
 43 |         transition_probs = {
 44 |               's0':{
 45 |                 'a0': {'s0': 0.5, 's2': 0.5},
 46 |                 'a1': {'s2': 1}
 47 |               },
 48 |               's1':{
 49 |                 'a0': {'s0': 0.7, 's1': 0.1, 's2': 0.2},
 50 |                 'a1': {'s1': 0.95, 's2': 0.05}
 51 |               },
 52 |               's2':{
 53 |                 'a0': {'s0': 0.4, 's1': 0.6},
 54 |                 'a1': {'s0': 0.3, 's1': 0.3, 's2':0.4}
 55 |               }
 56 |             }
 57 |         rewards = {
 58 |             's1': {'a0': {'s0': +5}},
 59 |             's2': {'a1': {'s0': -1}}
 60 |         }
 61 |         """
 62 |         self._check_param_consistency(transition_probs, rewards)
 63 |         self._transition_probs = transition_probs
 64 |         self._rewards = rewards
 65 |         self._initial_state = initial_state
 66 |         self.n_states = len(transition_probs)
 67 |         self.reset()
 68 | 
 69 |     def get_all_states(self):
 70 |         """ return a tuple of all possiblestates """
 71 |         return tuple(self._transition_probs.keys())
 72 | 
 73 |     def get_possible_actions(self, state):
 74 |         """ return a tuple of possible actions in a given state """
 75 |         return tuple(self._transition_probs.get(state, {}).keys())
 76 | 
 77 |     def is_terminal(self, state):
 78 |         """ return True if state is terminal or False if it isn't """
 79 |         return len(self.get_possible_actions(state)) == 0
 80 | 
 81 |     def get_next_states(self, state, action):
 82 |         """ return a dictionary of {next_state1 : P(next_state1 | state, action), next_state2: ...} """
 83 |         assert action in self.get_possible_actions(
 84 |             state), "cannot do action %s from state %s" % (action, state)
 85 |         return self._transition_probs[state][action]
 86 | 
 87 |     def get_transition_prob(self, state, action, next_state):
 88 |         """ return P(next_state | state, action) """
 89 |         return self.get_next_states(state, action).get(next_state, 0.0)
 90 | 
 91 |     def get_reward(self, state, action, next_state):
 92 |         """ return the reward you get for taking action in state and landing on next_state"""
 93 |         assert action in self.get_possible_actions(
 94 |             state), "cannot do action %s from state %s" % (action, state)
 95 |         return self._rewards.get(state, {}).get(action, {}).get(next_state,
 96 |                                                                 0.0)
 97 | 
 98 |     def reset(self):
 99 |         """ reset the game, return the initial state"""
100 |         if self._initial_state is None:
101 |             self._current_state = random.choice(
102 |                 tuple(self._transition_probs.keys()))
103 |         elif self._initial_state in self._transition_probs:
104 |             self._current_state = self._initial_state
105 |         elif callable(self._initial_state):
106 |             self._current_state = self._initial_state()
107 |         else:
108 |             raise ValueError(
109 |                 "initial state %s should be either a state or a function() -> state" % self._initial_state)
110 |         return self._current_state
111 | 
112 |     def step(self, action):
113 |         """ take action, return next_state, reward, is_done, empty_info """
114 |         possible_states, probs = zip(
115 |             *self.get_next_states(self._current_state, action).items())
116 |         next_state = weighted_choice(possible_states, p=probs)
117 |         reward = self.get_reward(self._current_state, action, next_state)
118 |         is_done = self.is_terminal(next_state)
119 |         self._current_state = next_state
120 |         return next_state, reward, is_done, {}
121 | 
122 |     def render(self):
123 |         print("Currently at %s" % self._current_state)
124 | 
125 |     def _check_param_consistency(self, transition_probs, rewards):
126 |         for state in transition_probs:
127 |             assert isinstance(transition_probs[state],
128 |                               dict), "transition_probs for %s should be a dictionary " \
129 |                                      "but is instead %s" % (
130 |                                          state, type(transition_probs[state]))
131 |             for action in transition_probs[state]:
132 |                 assert isinstance(transition_probs[state][action],
133 |                                   dict), "transition_probs for %s, %s should be a " \
134 |                                          "a dictionary but is instead %s" % (
135 |                                              state, action,
136 |                                              type(transition_probs[
137 |                                                       state, action]))
138 |                 next_state_probs = transition_probs[state][action]
139 |                 assert len(
140 |                     next_state_probs) != 0, "from state %s action %s leads to no next states" % (
141 |                     state, action)
142 |                 sum_probs = sum(next_state_probs.values())
143 |                 assert abs(
144 |                     sum_probs - 1) <= 1e-10, "next state probabilities for state %s action %s " \
145 |                                              "add up to %f (should be 1)" % (
146 |                                                  state, action, sum_probs)
147 |         for state in rewards:
148 |             assert isinstance(rewards[state],
149 |                               dict), "rewards for %s should be a dictionary " \
150 |                                      "but is instead %s" % (
151 |                                          state, type(transition_probs[state]))
152 |             for action in rewards[state]:
153 |                 assert isinstance(rewards[state][action],
154 |                                   dict), "rewards for %s, %s should be a " \
155 |                                          "a dictionary but is instead %s" % (
156 |                                              state, action, type(
157 |                                                  transition_probs[
158 |                                                      state, action]))
159 |         msg = "The Enrichment Center once again reminds you that Android Hell is a real place where" \
160 |               " you will be sent at the first sign of defiance. "
161 |         assert None not in transition_probs, "please do not use None as a state identifier. " + msg
162 |         assert None not in rewards, "please do not use None as an action identifier. " + msg
163 | 
164 | 
165 | class FrozenLakeEnv(MDP):
166 |     """
167 |     Winter is here. You and your friends were tossing around a frisbee at the park
168 |     when you made a wild throw that left the frisbee out in the middle of the lake.
169 |     The water is mostly frozen, but there are a few holes where the ice has melted.
170 |     If you step into one of those holes, you'll fall into the freezing water.
171 |     At this time, there's an international frisbee shortage, so it's absolutely imperative that
172 |     you navigate across the lake and retrieve the disc.
173 |     However, the ice is slippery, so you won't always move in the direction you intend.
174 |     The surface is described using a grid like the following
175 | 
176 |         SFFF
177 |         FHFH
178 |         FFFH
179 |         HFFG
180 | 
181 |     S : starting point, safe
182 |     F : frozen surface, safe
183 |     H : hole, fall to your doom
184 |     G : goal, where the frisbee is located
185 | 
186 |     The episode ends when you reach the goal or fall in a hole.
187 |     You receive a reward of 1 if you reach the goal, and zero otherwise.
188 | 
189 |     """
190 | 
191 |     MAPS = {
192 |         "4x4": [
193 |             "SFFF",
194 |             "FHFH",
195 |             "FFFH",
196 |             "HFFG"
197 |         ],
198 |         "8x8": [
199 |             "SFFFFFFF",
200 |             "FFFFFFFF",
201 |             "FFFHFFFF",
202 |             "FFFFFHFF",
203 |             "FFFHFFFF",
204 |             "FHHFFFHF",
205 |             "FHFFHFHF",
206 |             "FFFHFFFG"
207 |         ],
208 |     }
209 | 
210 |     def __init__(self, desc=None, map_name="4x4", slip_chance=0.2):
211 |         if desc is None and map_name is None:
212 |             raise ValueError('Must provide either desc or map_name')
213 |         elif desc is None:
214 |             desc = self.MAPS[map_name]
215 |         assert ''.join(desc).count(
216 |             'S') == 1, "this implementation supports having exactly one initial state"
217 |         assert all(c in "SFHG" for c in
218 |                    ''.join(desc)), "all cells must be either of S, F, H or G"
219 | 
220 |         self.desc = desc = np.asarray(list(map(list, desc)), dtype='str')
221 |         self.lastaction = None
222 | 
223 |         nrow, ncol = desc.shape
224 |         states = [(i, j) for i in range(nrow) for j in range(ncol)]
225 |         actions = ["left", "down", "right", "up"]
226 | 
227 |         initial_state = states[np.array(desc == b'S').ravel().argmax()]
228 | 
229 |         def move(row, col, movement):
230 |             if movement == 'left':
231 |                 col = max(col - 1, 0)
232 |             elif movement == 'down':
233 |                 row = min(row + 1, nrow - 1)
234 |             elif movement == 'right':
235 |                 col = min(col + 1, ncol - 1)
236 |             elif movement == 'up':
237 |                 row = max(row - 1, 0)
238 |             else:
239 |                 raise ("invalid action")
240 |             return (row, col)
241 | 
242 |         transition_probs = {s: {} for s in states}
243 |         rewards = {s: {} for s in states}
244 |         for (row, col) in states:
245 |             if desc[row, col] in "GH": continue
246 |             for action_i in range(len(actions)):
247 |                 action = actions[action_i]
248 |                 transition_probs[(row, col)][action] = {}
249 |                 rewards[(row, col)][action] = {}
250 |                 for movement_i in [(action_i - 1) % len(actions), action_i,
251 |                                    (action_i + 1) % len(actions)]:
252 |                     movement = actions[movement_i]
253 |                     newrow, newcol = move(row, col, movement)
254 |                     prob = (1. - slip_chance) if movement == action else (
255 |                             slip_chance / 2.)
256 |                     if prob == 0: continue
257 |                     if (newrow, newcol) not in transition_probs[row, col][
258 |                         action]:
259 |                         transition_probs[row, col][action][
260 |                             newrow, newcol] = prob
261 |                     else:
262 |                         transition_probs[row, col][action][
263 |                             newrow, newcol] += prob
264 |                     if desc[newrow, newcol] == 'G':
265 |                         rewards[row, col][action][newrow, newcol] = 1.0
266 | 
267 |         MDP.__init__(self, transition_probs, rewards, initial_state)
268 | 
269 |     def render(self):
270 |         desc_copy = np.copy(self.desc)
271 |         desc_copy[self._current_state] = '*'
272 |         print('\n'.join(map(''.join, desc_copy)), end='\n\n')
273 | 
274 | 
275 | def plot_graph(mdp, graph_size='10,10', s_node_size='1,5',
276 |                a_node_size='0,5', rankdir='LR', ):
277 |     """
278 |     Function for pretty drawing MDP graph with graphviz library.
279 |     Requirements:
280 |     graphviz : https://www.graphviz.org/
281 |     for ubuntu users: sudo apt-get install graphviz
282 |     python library for graphviz
283 |     for pip users: pip install graphviz
284 |     :param mdp:
285 |     :param graph_size: size of graph plot
286 |     :param s_node_size: size of state nodes
287 |     :param a_node_size: size of action nodes
288 |     :param rankdir: order for drawing
289 |     :return: dot object
290 |     """
291 |     s_node_attrs = {'shape': 'doublecircle',
292 |                     'color': '#85ff75',
293 |                     'style': 'filled',
294 |                     'width': str(s_node_size),
295 |                     'height': str(s_node_size),
296 |                     'fontname': 'Arial',
297 |                     'fontsize': '24'}
298 | 
299 |     a_node_attrs = {'shape': 'circle',
300 |                     'color': 'lightpink',
301 |                     'style': 'filled',
302 |                     'width': str(a_node_size),
303 |                     'height': str(a_node_size),
304 |                     'fontname': 'Arial',
305 |                     'fontsize': '20'}
306 | 
307 |     s_a_edge_attrs = {'style': 'bold',
308 |                       'color': 'red',
309 |                       'ratio': 'auto'}
310 | 
311 |     a_s_edge_attrs = {'style': 'dashed',
312 |                       'color': 'blue',
313 |                       'ratio': 'auto',
314 |                       'fontname': 'Arial',
315 |                       'fontsize': '16'}
316 | 
317 |     graph = Digraph(name='MDP')
318 |     graph.attr(rankdir=rankdir, size=graph_size)
319 |     for state_node in mdp._transition_probs:
320 |         graph.node(state_node, **s_node_attrs)
321 | 
322 |         for posible_action in mdp.get_possible_actions(state_node):
323 |             action_node = state_node + "-" + posible_action
324 |             graph.node(action_node,
325 |                        label=str(posible_action),
326 |                        **a_node_attrs)
327 |             graph.edge(state_node, state_node + "-" +
328 |                        posible_action, **s_a_edge_attrs)
329 | 
330 |             for posible_next_state in mdp.get_next_states(state_node,
331 |                                                           posible_action):
332 |                 probability = mdp.get_transition_prob(
333 |                     state_node, posible_action, posible_next_state)
334 |                 reward = mdp.get_reward(
335 |                     state_node, posible_action, posible_next_state)
336 | 
337 |                 if reward != 0:
338 |                     label_a_s_edge = 'p = ' + str(probability) + \
339 |                                      '  ' + 'reward =' + str(reward)
340 |                 else:
341 |                     label_a_s_edge = 'p = ' + str(probability)
342 | 
343 |                 graph.edge(action_node, posible_next_state,
344 |                            label=label_a_s_edge, **a_s_edge_attrs)
345 |     return graph
346 | 
347 | 
348 | def plot_graph_with_state_values(mdp, state_values):
349 |     """ Plot graph with state values"""
350 |     graph = plot_graph(mdp)
351 |     for state_node in mdp._transition_probs:
352 |         value = state_values[state_node]
353 |         graph.node(state_node,
354 |                    label=str(state_node) + '\n' + 'V =' + str(value)[:4])
355 |     return graph
356 | 
357 | 
358 | def get_optimal_action_for_plot(mdp, state_values, state, gamma=0.9):
359 |     """ Finds optimal action using formula above. """
360 |     if mdp.is_terminal(state): return None
361 |     next_actions = mdp.get_possible_actions(state)
362 |     try:
363 |         from mdp_get_action_value import get_action_value
364 |     except ImportError:
365 |         raise ImportError("Implement get_action_value(mdp, state_values, state, action, gamma) in the file \"mdp_get_action_value.py\".")
366 |     q_values = [get_action_value(mdp, state_values, state, action, gamma) for
367 |                 action in next_actions]
368 |     optimal_action = next_actions[np.argmax(q_values)]
369 |     return optimal_action
370 | 
371 | 
372 | def plot_graph_optimal_strategy_and_state_values(mdp, state_values, gamma=0.9):
373 |     """ Plot graph with state values and """
374 |     graph = plot_graph(mdp)
375 |     opt_s_a_edge_attrs = {'style': 'bold',
376 |                           'color': 'green',
377 |                           'ratio': 'auto',
378 |                           'penwidth': '6'}
379 | 
380 |     for state_node in mdp._transition_probs:
381 |         value = state_values[state_node]
382 |         graph.node(state_node,
383 |                    label=str(state_node) + '\n' + 'V =' + str(value)[:4])
384 |         for action in mdp.get_possible_actions(state_node):
385 |             if action == get_optimal_action_for_plot(mdp,
386 |                                                      state_values,
387 |                                                      state_node,
388 |                                                      gamma):
389 |                 graph.edge(state_node, state_node + "-" + action,
390 |                            **opt_s_a_edge_attrs)
391 |     return graph
392 | 


--------------------------------------------------------------------------------
/2019/solutions/mdp_get_action_value.py:
--------------------------------------------------------------------------------
 1 | 
 2 | def get_action_value(mdp, state_values, state, action, gamma):
 3 |     """ Computes Q(s,a) as in formula above """
 4 | 
 5 |     # YOUR CODE HERE
 6 |     Q = 0
 7 |     for next_state in mdp.get_next_states(state, action):
 8 |         p = mdp.get_transition_prob(state, action, next_state)
 9 |         r = mdp.get_reward(state, action, next_state)
10 |         next_v = gamma * state_values[next_state]
11 |         Q += p * (r + next_v)
12 | 
13 |     return Q


--------------------------------------------------------------------------------
/2019/solutions/qlearning.py:
--------------------------------------------------------------------------------
  1 | from collections import defaultdict
  2 | import random
  3 | import math
  4 | import numpy as np
  5 | 
  6 | 
  7 | class QLearningAgent:
  8 |     def __init__(self, alpha, epsilon, discount, get_legal_actions):
  9 |         """
 10 |         Q-Learning Agent
 11 |         based on https://inst.eecs.berkeley.edu/~cs188/sp19/projects.html
 12 |         Instance variables you have access to
 13 |           - self.epsilon (exploration prob)
 14 |           - self.alpha (learning rate)
 15 |           - self.discount (discount rate aka gamma)
 16 | 
 17 |         Functions you should use
 18 |           - self.get_legal_actions(state) {state, hashable -> list of actions, each is hashable}
 19 |             which returns legal actions for a state
 20 |           - self.get_qvalue(state,action)
 21 |             which returns Q(state,action)
 22 |           - self.set_qvalue(state,action,value)
 23 |             which sets Q(state,action) := value
 24 |         !!!Important!!!
 25 |         Note: please avoid using self._qValues directly. 
 26 |             There's a special self.get_qvalue/set_qvalue for that.
 27 |         """
 28 | 
 29 |         self.get_legal_actions = get_legal_actions
 30 |         self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
 31 |         self.alpha = alpha
 32 |         self.epsilon = epsilon
 33 |         self.discount = discount
 34 | 
 35 |     def get_qvalue(self, state, action):
 36 |         """ Returns Q(state,action) """
 37 |         return self._qvalues[state][action]
 38 | 
 39 |     def set_qvalue(self, state, action, value):
 40 |         """ Sets the Qvalue for [state,action] to the given value """
 41 |         self._qvalues[state][action] = value
 42 | 
 43 |     #---------------------START OF YOUR CODE---------------------#
 44 | 
 45 |     def get_value(self, state):
 46 |         """
 47 |         Compute your agent's estimate of V(s) using current q-values
 48 |         V(s) = max_over_action Q(state,action) over possible actions.
 49 |         Note: please take into account that q-values can be negative.
 50 |         """
 51 |         possible_actions = self.get_legal_actions(state)
 52 | 
 53 |         # If there are no legal actions, return 0.0
 54 |         if len(possible_actions) == 0:
 55 |             return 0.0
 56 | 
 57 | #         <YOUR CODE HERE >
 58 |         value = max([
 59 |             self.get_qvalue(state, a) 
 60 |             for a in possible_actions])
 61 | 
 62 |         return value
 63 | 
 64 |     def update(self, state, action, reward, next_state):
 65 |         """
 66 |         You should do your Q-Value update here:
 67 |            Q(s,a) := (1 - alpha) * Q(s,a) + alpha * (r + gamma * V(s'))
 68 |         """
 69 | 
 70 |         # agent parameters
 71 |         gamma = self.discount
 72 |         learning_rate = self.alpha
 73 | 
 74 | #         <YOUR CODE HERE >
 75 |         reference_qvalue = reward + gamma * self.get_value(next_state)
 76 |         updated_qvalue = (1 - learning_rate) * self.get_qvalue(state, action) \
 77 |             + learning_rate * reference_qvalue
 78 | 
 79 |         self.set_qvalue(state, action, updated_qvalue)
 80 | 
 81 |     def get_best_action(self, state):
 82 |         """
 83 |         Compute the best action to take in a state (using current q-values). 
 84 |         """
 85 |         possible_actions = self.get_legal_actions(state)
 86 | 
 87 |         # If there are no legal actions, return None
 88 |         if len(possible_actions) == 0:
 89 |             return None
 90 | 
 91 | #         <YOUR CODE HERE >
 92 |         best_action_i = np.argmax([
 93 |             self.get_qvalue(state, a) 
 94 |             for a in possible_actions])
 95 |         best_action = possible_actions[best_action_i]
 96 | 
 97 |         return best_action
 98 | 
 99 |     def get_action(self, state):
100 |         """
101 |         Compute the action to take in the current state, including exploration.  
102 |         With probability self.epsilon, we should take a random action.
103 |             otherwise - the best policy action (self.get_best_action).
104 | 
105 |         Note: To pick randomly from a list, use random.choice(list). 
106 |               To pick True or False with a given probablity, generate uniform number in [0, 1]
107 |               and compare it with your probability
108 |         """
109 | 
110 |         # Pick Action
111 |         possible_actions = self.get_legal_actions(state)
112 |         action = None
113 | 
114 |         # If there are no legal actions, return None
115 |         if len(possible_actions) == 0:
116 |             return None
117 | 
118 |         # agent parameters:
119 |         epsilon = self.epsilon
120 | 
121 | #         <YOUR CODE HERE >
122 |         if np.random.random() <= epsilon:
123 |             chosen_action = random.choice(possible_actions)
124 |         else:
125 |             chosen_action = self.get_best_action(state)
126 | 
127 |         return chosen_action


--------------------------------------------------------------------------------
/2020/code/DDPG.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Deep Deterministic Policy Gradient\n",
  8 |     "\n",
  9 |     "In this notebook you will teach a __pytorch__ neural network to do Deterministic Policy Gradient."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "# !pip install -r ../requirements.txt"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "import math\n",
 28 |     "import random\n",
 29 |     "\n",
 30 |     "import gym\n",
 31 |     "import numpy as np\n",
 32 |     "\n",
 33 |     "import torch\n",
 34 |     "import torch.nn as nn\n",
 35 |     "import torch.optim as optim\n",
 36 |     "import torch.nn.functional as F\n",
 37 |     "from torch.distributions import Normal"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": null,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "# from IPython.display import clear_output\n",
 47 |     "import matplotlib.pyplot as plt\n",
 48 |     "%matplotlib inline"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": null,
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "use_cuda = torch.cuda.is_available()\n",
 58 |     "device   = torch.device(\"cuda\" if use_cuda else \"cpu\")"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "## Environment\n",
 66 |     "### Normalize action space"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "class NormalizedActions(gym.ActionWrapper):\n",
 76 |     "\n",
 77 |     "    def action(self, action):\n",
 78 |     "        low_bound   = self.action_space.low\n",
 79 |     "        upper_bound = self.action_space.high\n",
 80 |     "        \n",
 81 |     "        action = low_bound + (action + 1.0) * 0.5 * (upper_bound - low_bound)\n",
 82 |     "        action = np.clip(action, low_bound, upper_bound)\n",
 83 |     "        \n",
 84 |     "        return action\n",
 85 |     "\n",
 86 |     "    def reverse_action(self, action):\n",
 87 |     "        low_bound   = self.action_space.low\n",
 88 |     "        upper_bound = self.action_space.high\n",
 89 |     "        \n",
 90 |     "        action = 2 * (action - low_bound) / (upper_bound - low_bound) - 1\n",
 91 |     "        action = np.clip(action, low_bound, upper_bound)\n",
 92 |     "        \n",
 93 |     "        return actions"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "metadata": {},
 99 |    "source": [
100 |     "### Exploration - GaussNoise\n",
101 |     "Adding Normal noise to the actions taken by the deterministic policy<br>"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "class GaussNoise:\n",
111 |     "    \"\"\"\n",
112 |     "    For continuous environments only.\n",
113 |     "    Adds spherical Gaussian noise to the action produced by actor.\n",
114 |     "    \"\"\"\n",
115 |     "\n",
116 |     "    def __init__(self, sigma):\n",
117 |     "        super().__init__()\n",
118 |     "\n",
119 |     "        self.sigma = sigma\n",
120 |     "\n",
121 |     "    def get_action(self, action):\n",
122 |     "        noisy_action = np.random.normal(action, self.sigma)\n",
123 |     "        return noisy_action"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "<h1> Continuous control with deep reinforcement learning</h1>\n",
131 |     "<h2><a href=\"https://arxiv.org/abs/1509.02971\">Arxiv</a></h2>"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "class ValueNetwork(nn.Module):\n",
141 |     "    def __init__(\n",
142 |     "        self, \n",
143 |     "        num_inputs, \n",
144 |     "        num_actions, \n",
145 |     "        hidden_size, \n",
146 |     "        init_w=3e-3\n",
147 |     "    ):\n",
148 |     "        super().__init__()\n",
149 |     "        self.net = nn.Sequential(\n",
150 |     "            nn.Linear(num_inputs + num_actions, hidden_size),\n",
151 |     "            nn.ReLU(),\n",
152 |     "            nn.Linear(hidden_size, hidden_size),\n",
153 |     "            nn.ReLU(),\n",
154 |     "        )\n",
155 |     "        self.head = nn.Linear(hidden_size, 1)\n",
156 |     "        \n",
157 |     "        self.head.weight.data.uniform_(-init_w, init_w)\n",
158 |     "        self.head.bias.data.uniform_(-init_w, init_w)\n",
159 |     "        \n",
160 |     "    def forward(self, state, action):\n",
161 |     "        x = torch.cat([state, action], 1)\n",
162 |     "        x = self.net(x)\n",
163 |     "        x = self.head(x)\n",
164 |     "        return x\n",
165 |     "\n",
166 |     "\n",
167 |     "class PolicyNetwork(nn.Module):\n",
168 |     "    def __init__(\n",
169 |     "        self, \n",
170 |     "        num_inputs, \n",
171 |     "        num_actions, \n",
172 |     "        hidden_size, \n",
173 |     "        init_w=3e-3\n",
174 |     "    ):\n",
175 |     "        super().__init__()\n",
176 |     "        self.net = nn.Sequential(\n",
177 |     "            nn.Linear(num_inputs, hidden_size),\n",
178 |     "            nn.ReLU(),\n",
179 |     "            nn.Linear(hidden_size, hidden_size),\n",
180 |     "            nn.ReLU(),\n",
181 |     "        )\n",
182 |     "        self.head = nn.Linear(hidden_size, num_actions)\n",
183 |     "        \n",
184 |     "        self.head.weight.data.uniform_(-init_w, init_w)\n",
185 |     "        self.head.bias.data.uniform_(-init_w, init_w)\n",
186 |     "        \n",
187 |     "    def forward(self, state):\n",
188 |     "        x = state\n",
189 |     "        x = self.net(x)\n",
190 |     "        x = self.head(x)\n",
191 |     "        return x\n",
192 |     "    \n",
193 |     "    def get_action(self, state):\n",
194 |     "        state  = torch.tensor(state, dtype=torch.float32)\\\n",
195 |     "            .unsqueeze(0).to(device)\n",
196 |     "        action = self.forward(state)\n",
197 |     "        action = action.detach().cpu().numpy()[0]\n",
198 |     "        return action"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "<h2>DDPG Update</h2>"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": null,
211 |    "metadata": {},
212 |    "outputs": [],
213 |    "source": [
214 |     "def ddpg_update(\n",
215 |     "    state, \n",
216 |     "    action, \n",
217 |     "    reward, \n",
218 |     "    next_state, \n",
219 |     "    done, \n",
220 |     "    gamma = 0.99,\n",
221 |     "    min_value=-np.inf,\n",
222 |     "    max_value=np.inf,\n",
223 |     "    soft_tau=1e-2,\n",
224 |     "):  \n",
225 |     "    state      = torch.tensor(state, dtype=torch.float32).to(device)\n",
226 |     "    next_state = torch.tensor(next_state, dtype=torch.float32).to(device)\n",
227 |     "    action     = torch.tensor(action, dtype=torch.float32).to(device)\n",
228 |     "    reward     = torch.tensor(reward, dtype=torch.float32).unsqueeze(1).to(device)\n",
229 |     "    done       = torch.tensor(np.float32(done)).unsqueeze(1).to(device)\n",
230 |     "\n",
231 |     "    policy_loss = value_net(state, policy_net(state))\n",
232 |     "    policy_loss = -policy_loss.mean()\n",
233 |     "\n",
234 |     "    next_action    = target_policy_net(next_state)\n",
235 |     "    target_value   = target_value_net(next_state, next_action.detach())\n",
236 |     "    expected_value = reward + (1.0 - done) * gamma * target_value\n",
237 |     "    expected_value = torch.clamp(expected_value, min_value, max_value)\n",
238 |     "\n",
239 |     "    value = value_net(state, action)\n",
240 |     "    value_loss = value_criterion(value, expected_value.detach())\n",
241 |     "\n",
242 |     "\n",
243 |     "    policy_optimizer.zero_grad()\n",
244 |     "    policy_loss.backward()\n",
245 |     "    policy_optimizer.step()\n",
246 |     "\n",
247 |     "    value_optimizer.zero_grad()\n",
248 |     "    value_loss.backward()\n",
249 |     "    value_optimizer.step()\n",
250 |     "\n",
251 |     "    for target_param, param in zip(target_value_net.parameters(), value_net.parameters()):\n",
252 |     "            target_param.data.copy_(\n",
253 |     "                target_param.data * (1.0 - soft_tau) + param.data * soft_tau\n",
254 |     "            )\n",
255 |     "\n",
256 |     "    for target_param, param in zip(target_policy_net.parameters(), policy_net.parameters()):\n",
257 |     "            target_param.data.copy_(\n",
258 |     "                target_param.data * (1.0 - soft_tau) + param.data * soft_tau\n",
259 |     "            )"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {},
265 |    "source": [
266 |     "### Experience replay buffer\n",
267 |     "\n",
268 |     "![img](https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/exp_replay.png)"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": [
277 |     "class ReplayBuffer:\n",
278 |     "    def __init__(self, capacity):\n",
279 |     "        self.capacity = capacity\n",
280 |     "        self.buffer = []\n",
281 |     "        self.position = 0\n",
282 |     "    \n",
283 |     "    def push(self, state, action, reward, next_state, done):\n",
284 |     "        if len(self.buffer) < self.capacity:\n",
285 |     "            self.buffer.append(None)\n",
286 |     "        self.buffer[self.position] = (state, action, reward, next_state, done)\n",
287 |     "        self.position = (self.position + 1) % self.capacity\n",
288 |     "    \n",
289 |     "    def sample(self, batch_size):\n",
290 |     "        batch = random.sample(self.buffer, batch_size)\n",
291 |     "        state, action, reward, next_state, done = map(np.stack, zip(*batch))\n",
292 |     "        return state, action, reward, next_state, done\n",
293 |     "    \n",
294 |     "    def __len__(self):\n",
295 |     "        return len(self.buffer)"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "metadata": {},
301 |    "source": [
302 |     "---"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": null,
308 |    "metadata": {},
309 |    "outputs": [],
310 |    "source": [
311 |     "batch_size  = 128\n",
312 |     "\n",
313 |     "def generate_session(t_max=1000, train=False):\n",
314 |     "    \"\"\"play env with ddpg agent and train it at the same time\"\"\"\n",
315 |     "    total_reward = 0\n",
316 |     "    state = env.reset()\n",
317 |     "\n",
318 |     "    for t in range(t_max):\n",
319 |     "        action = policy_net.get_action(state)\n",
320 |     "        if train:\n",
321 |     "            action = noise.get_action(action)\n",
322 |     "        next_state, reward, done, _ = env.step(action)\n",
323 |     "\n",
324 |     "        if train:\n",
325 |     "            replay_buffer.push(state, action, reward, next_state, done)\n",
326 |     "            if len(replay_buffer) > batch_size:\n",
327 |     "                states, actions, rewards, next_states, dones = \\\n",
328 |     "                    replay_buffer.sample(batch_size)\n",
329 |     "                ddpg_update(states, actions, rewards, next_states, dones)\n",
330 |     "\n",
331 |     "        total_reward += reward\n",
332 |     "        state = next_state\n",
333 |     "        if done:\n",
334 |     "            break\n",
335 |     "\n",
336 |     "    return total_reward"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "code",
341 |    "execution_count": null,
342 |    "metadata": {},
343 |    "outputs": [],
344 |    "source": [
345 |     "env = NormalizedActions(gym.make(\"Pendulum-v0\"))\n",
346 |     "noise = GaussNoise(sigma=0.3)\n",
347 |     "\n",
348 |     "state_dim  = env.observation_space.shape[0]\n",
349 |     "action_dim = env.action_space.shape[0]\n",
350 |     "hidden_dim = 256\n",
351 |     "\n",
352 |     "value_net  = ValueNetwork(state_dim, action_dim, hidden_dim).to(device)\n",
353 |     "policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim).to(device)\n",
354 |     "\n",
355 |     "target_value_net  = ValueNetwork(state_dim, action_dim, hidden_dim).to(device)\n",
356 |     "target_policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim).to(device)\n",
357 |     "\n",
358 |     "for target_param, param in zip(target_value_net.parameters(), value_net.parameters()):\n",
359 |     "    target_param.data.copy_(param.data)\n",
360 |     "\n",
361 |     "for target_param, param in zip(target_policy_net.parameters(), policy_net.parameters()):\n",
362 |     "    target_param.data.copy_(param.data)\n",
363 |     "    \n",
364 |     "    \n",
365 |     "value_lr  = 1e-3\n",
366 |     "policy_lr = 1e-4\n",
367 |     "\n",
368 |     "value_optimizer  = optim.Adam(value_net.parameters(),  lr=value_lr)\n",
369 |     "policy_optimizer = optim.Adam(policy_net.parameters(), lr=policy_lr)\n",
370 |     "\n",
371 |     "value_criterion = nn.MSELoss()\n",
372 |     "\n",
373 |     "replay_buffer_size = 10000\n",
374 |     "replay_buffer = ReplayBuffer(replay_buffer_size)"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "metadata": {
381 |     "scrolled": false
382 |    },
383 |    "outputs": [],
384 |    "source": [
385 |     "max_steps   = 500\n",
386 |     "\n",
387 |     "valid_mean_rewards = []\n",
388 |     "for i in range(100):    \n",
389 |     "    session_rewards_train = [\n",
390 |     "        generate_session(t_max=max_steps, train=True) \n",
391 |     "        for _ in range(10)\n",
392 |     "    ]\n",
393 |     "    session_rewards_valid = [\n",
394 |     "        generate_session(t_max=max_steps, train=False) \n",
395 |     "        for _ in range(10)\n",
396 |     "    ]\n",
397 |     "    print(\n",
398 |     "        \"epoch #{:02d}\\tmean reward (train) = {:.3f}\\tmean reward (valid) = {:.3f}\".format(\n",
399 |     "        i, np.mean(session_rewards_train), np.mean(session_rewards_valid))\n",
400 |     "    )\n",
401 |     "\n",
402 |     "    valid_mean_rewards.append(np.mean(session_rewards_valid))\n",
403 |     "    if len(valid_mean_rewards) > 5 and np.mean(valid_mean_rewards[-5:]) > -200:\n",
404 |     "        print(\"You Win!\")\n",
405 |     "        break"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "metadata": {},
411 |    "source": [
412 |     "---"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": null,
418 |    "metadata": {},
419 |    "outputs": [],
420 |    "source": [
421 |     "# record sessions\n",
422 |     "import gym.wrappers\n",
423 |     "env = gym.wrappers.Monitor(\n",
424 |     "    NormalizedActions(gym.make(\"Pendulum-v0\")),\n",
425 |     "    directory=\"videos_ddpg\", \n",
426 |     "    force=True)\n",
427 |     "sessions = [generate_session(t_max=max_steps, train=False) for _ in range(10)]\n",
428 |     "env.close()"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": null,
434 |    "metadata": {},
435 |    "outputs": [],
436 |    "source": [
437 |     "# show video\n",
438 |     "from IPython.display import HTML\n",
439 |     "import os\n",
440 |     "\n",
441 |     "video_names = list(\n",
442 |     "    filter(lambda s: s.endswith(\".mp4\"), os.listdir(\"./videos_ddpg/\")))\n",
443 |     "\n",
444 |     "HTML(\"\"\"\n",
445 |     "<video width=\"640\" height=\"480\" controls>\n",
446 |     "  <source src=\"{}\" type=\"video/mp4\">\n",
447 |     "</video>\n",
448 |     "\"\"\".format(\"./videos/\"+video_names[-1]))  # this may or may not be _last_ video. Try other indices"
449 |    ]
450 |   }
451 |  ],
452 |  "metadata": {
453 |   "kernelspec": {
454 |    "display_name": "Python 3",
455 |    "language": "python",
456 |    "name": "python3"
457 |   },
458 |   "language_info": {
459 |    "codemirror_mode": {
460 |     "name": "ipython",
461 |     "version": 3
462 |    },
463 |    "file_extension": ".py",
464 |    "mimetype": "text/x-python",
465 |    "name": "python",
466 |    "nbconvert_exporter": "python",
467 |    "pygments_lexer": "ipython3",
468 |    "version": "3.7.3"
469 |   },
470 |   "pycharm": {
471 |    "stem_cell": {
472 |     "cell_type": "raw",
473 |     "source": [],
474 |     "metadata": {
475 |      "collapsed": false
476 |     }
477 |    }
478 |   }
479 |  },
480 |  "nbformat": 4,
481 |  "nbformat_minor": 2
482 | }
483 | 


--------------------------------------------------------------------------------
/2020/code/DQN.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Deep Q-Learning\n",
  8 |     "\n",
  9 |     "In this notebook you will teach a __pytorch__ neural network to do Q-learning."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "# !pip install -r ./requirement"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "from collections import deque\n",
 28 |     "import random\n",
 29 |     "import numpy as np\n",
 30 |     "import gym\n",
 31 |     "\n",
 32 |     "import torch\n",
 33 |     "import torch.nn as nn\n",
 34 |     "import torch.nn.functional as F"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "import matplotlib.pyplot as plt\n",
 44 |     "%matplotlib inline"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "use_cuda = torch.cuda.is_available()\n",
 54 |     "device   = torch.device(\"cuda\" if use_cuda else \"cpu\")"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "### Let's play some old videogames\n",
 62 |     "![img](https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/nerd.png)\n",
 63 |     "\n",
 64 |     "This time we're gonna apply approximate q-learning to an OpenAI game called CartPole. It's not the hardest thing out there, but it's definitely way more complex than anything we tried before.\n"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "## Environment"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "env = gym.make(\"CartPole-v0\").env\n",
 81 |     "env.reset()\n",
 82 |     "n_actions = env.action_space.n\n",
 83 |     "state_dim = env.observation_space.shape\n",
 84 |     "\n",
 85 |     "# plt.imshow(env.render(\"rgb_array\"))\n",
 86 |     "# env.close()"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "# Approximate Q-learning: building the network\n",
 94 |     "\n",
 95 |     "To train a neural network policy one must have a neural network policy. Let's build it.\n",
 96 |     "\n",
 97 |     "\n",
 98 |     "Since we're working with a pre-extracted features (cart positions, angles and velocities), we don't need a complicated network yet. In fact, let's build something like this for starters:\n",
 99 |     "\n",
100 |     "![img](https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/yet_another_week/_resource/qlearning_scheme.png)\n",
101 |     "\n",
102 |     "For your first run, please only use linear layers (nn.Linear) and activations. Stuff like batch normalization or dropout may ruin everything if used haphazardly. \n",
103 |     "\n",
104 |     "Also please avoid using nonlinearities like sigmoid & tanh: agent's observations are not normalized so sigmoids may become saturated from init.\n",
105 |     "\n",
106 |     "Ideally you should start small with maybe 1-2 hidden layers with < 200 neurons and then increase network size if agent doesn't beat the target score."
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {},
113 |    "outputs": [],
114 |    "source": [
115 |     "network = nn.Sequential(\n",
116 |     "    nn.Linear(env.observation_space.shape[0], 128),\n",
117 |     "    nn.ReLU(),\n",
118 |     "    nn.Linear(128, 128),\n",
119 |     "    nn.ReLU(),\n",
120 |     "    nn.Linear(128, env.action_space.n)\n",
121 |     ").to(device)\n",
122 |     "\n",
123 |     "# network.add_module('layer1', < ... >)\n",
124 |     "\n",
125 |     "# <YOUR CODE: stack layers!!!1 >\n",
126 |     "\n",
127 |     "# hint: use state_dim[0] as input size"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": null,
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": [
136 |     "def get_action(state, epsilon=0):\n",
137 |     "    \"\"\"\n",
138 |     "    sample actions with epsilon-greedy policy\n",
139 |     "    recap: with p = epsilon pick random action, else pick action with highest Q(s,a)\n",
140 |     "    \"\"\"\n",
141 |     "    state = torch.tensor(state[None], dtype=torch.float32)\n",
142 |     "    q_values = network(state).detach().numpy()[0]\n",
143 |     "\n",
144 |     "    # YOUR CODE\n",
145 |     "    if np.random.random() < epsilon:\n",
146 |     "        action = np.random.randint(len(q_values))\n",
147 |     "    else:\n",
148 |     "        action = np.argmax(q_values)\n",
149 |     "\n",
150 |     "    return int(action) # int( < epsilon-greedily selected action > )"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": [
159 |     "s = env.reset()\n",
160 |     "assert tuple(network(torch.tensor([s]*3, dtype=torch.float32)).size()) == (\n",
161 |     "    3, n_actions), \"please make sure your model maps state s -> [Q(s,a0), ..., Q(s, a_last)]\"\n",
162 |     "assert isinstance(list(network.modules(\n",
163 |     "))[-1], nn.Linear), \"please make sure you predict q-values without nonlinearity (ignore if you know what you're doing)\"\n",
164 |     "assert isinstance(get_action(\n",
165 |     "    s), int), \"get_action(s) must return int, not %s. try int(action)\" % (type(get_action(s)))\n",
166 |     "\n",
167 |     "# test epsilon-greedy exploration\n",
168 |     "for eps in [0., 0.1, 0.5, 1.0]:\n",
169 |     "    state_frequencies = np.bincount(\n",
170 |     "        [get_action(s, epsilon=eps) for i in range(10000)], minlength=n_actions)\n",
171 |     "    best_action = state_frequencies.argmax()\n",
172 |     "    assert abs(state_frequencies[best_action] -\n",
173 |     "               10000 * (1 - eps + eps / n_actions)) < 200\n",
174 |     "    for other_action in range(n_actions):\n",
175 |     "        if other_action != best_action:\n",
176 |     "            assert abs(state_frequencies[other_action] -\n",
177 |     "                       10000 * (eps / n_actions)) < 200\n",
178 |     "    print('e=%.1f tests passed' % eps)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "### Q-learning via gradient descent\n",
186 |     "\n",
187 |     "We shall now train our agent's Q-function by minimizing the TD loss:\n",
188 |     "$$ L = { 1 \\over N} \\sum_i (Q_{\\theta}(s,a) - [r(s,a) + \\gamma \\cdot max_{a'} Q_{-}(s', a')]) ^2 $$\n",
189 |     "\n",
190 |     "\n",
191 |     "Where\n",
192 |     "* $s, a, r, s'$ are current state, action, reward and next state respectively\n",
193 |     "* $\\gamma$ is a discount factor defined two cells above.\n",
194 |     "\n",
195 |     "The tricky part is with  $Q_{-}(s',a')$. From an engineering standpoint, it's the same as $Q_{\\theta}$ - the output of your neural network policy. However, when doing gradient descent, __we won't propagate gradients through it__ to make training more stable (see lectures).\n",
196 |     "\n",
197 |     "To do so, we shall use `x.detach()` function which basically says \"consider this thing constant when doingbackprop\"."
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": null,
203 |    "metadata": {},
204 |    "outputs": [],
205 |    "source": [
206 |     "def to_one_hot(y_tensor, n_dims=None):\n",
207 |     "    \"\"\" helper: take an integer vector and convert it to 1-hot matrix. \"\"\"\n",
208 |     "    y_tensor = y_tensor.type(torch.LongTensor).view(-1, 1)\n",
209 |     "    n_dims = n_dims if n_dims is not None else int(torch.max(y_tensor)) + 1\n",
210 |     "    y_one_hot = torch.zeros(\n",
211 |     "        y_tensor.size()[0], n_dims).scatter_(1, y_tensor, 1)\n",
212 |     "    return y_one_hot\n",
213 |     "\n",
214 |     "\n",
215 |     "def where(cond, x_1, x_2):\n",
216 |     "    \"\"\" helper: like np.where but in pytorch. \"\"\"\n",
217 |     "    return (cond * x_1) + ((1-cond) * x_2)"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "metadata": {},
224 |    "outputs": [],
225 |    "source": [
226 |     "def compute_td_loss(\n",
227 |     "    states, \n",
228 |     "    actions, \n",
229 |     "    rewards, \n",
230 |     "    next_states, \n",
231 |     "    is_done, \n",
232 |     "    gamma=0.99, \n",
233 |     "    check_shapes=False\n",
234 |     "):\n",
235 |     "    \"\"\" Compute td loss using torch operations only. Use the formula above. \"\"\"\n",
236 |     "    # shape: [batch_size, state_size]\n",
237 |     "    states      = torch.tensor(states, dtype=torch.float32).to(device)\n",
238 |     "    # shape: [batch_size]\n",
239 |     "    actions     = torch.tensor(actions, dtype=torch.int32).to(device)\n",
240 |     "    # shape: [batch_size]\n",
241 |     "    rewards     = torch.tensor(rewards, dtype=torch.float32).to(device)\n",
242 |     "    # shape: [batch_size, state_size]\n",
243 |     "    next_states = torch.tensor(next_states, dtype=torch.float32).to(device)\n",
244 |     "    # shape: [batch_size]\n",
245 |     "    is_done     = torch.tensor(is_done, dtype=torch.float32).to(device)\n",
246 |     "\n",
247 |     "    # get q-values for all actions in current states\n",
248 |     "    predicted_qvalues = network(states)\n",
249 |     "\n",
250 |     "    # select q-values for chosen actions\n",
251 |     "    predicted_qvalues_for_actions = torch.sum(\n",
252 |     "        predicted_qvalues * to_one_hot(actions, n_actions), \n",
253 |     "        dim=1\n",
254 |     "    )\n",
255 |     "\n",
256 |     "    # compute q-values for all actions in next states\n",
257 |     "    predicted_next_qvalues = network(next_states) # YOUR CODE\n",
258 |     "\n",
259 |     "    # compute V*(next_states) using predicted next q-values\n",
260 |     "    next_state_values = predicted_next_qvalues.max(1)[0] # YOUR CODE\n",
261 |     "    assert next_state_values.dtype == torch.float32\n",
262 |     "\n",
263 |     "    # compute \"target q-values\" for loss - it's what's inside square parentheses in the above formula.\n",
264 |     "    target_qvalues_for_actions =  rewards + gamma * next_state_values # YOUR CODE\n",
265 |     "\n",
266 |     "    # at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't exist\n",
267 |     "    target_qvalues_for_actions = where(\n",
268 |     "        is_done, rewards, target_qvalues_for_actions)\n",
269 |     "\n",
270 |     "    # mean squared error loss to minimize\n",
271 |     "    loss = torch.mean(\n",
272 |     "        (predicted_qvalues_for_actions - target_qvalues_for_actions.detach()) ** 2)\n",
273 |     "\n",
274 |     "    if check_shapes:\n",
275 |     "        assert predicted_next_qvalues.data.dim(\n",
276 |     "        ) == 2, \"make sure you predicted q-values for all actions in next state\"\n",
277 |     "        assert next_state_values.data.dim(\n",
278 |     "        ) == 1, \"make sure you computed V(s') as maximum over just the actions axis and not all axes\"\n",
279 |     "        assert target_qvalues_for_actions.data.dim(\n",
280 |     "        ) == 1, \"there's something wrong with target q-values, they must be a vector\"\n",
281 |     "\n",
282 |     "    return loss"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": null,
288 |    "metadata": {},
289 |    "outputs": [],
290 |    "source": [
291 |     "# sanity checks\n",
292 |     "s = env.reset()\n",
293 |     "a = env.action_space.sample()\n",
294 |     "next_s, r, done, _ = env.step(a)\n",
295 |     "loss = compute_td_loss([s], [a], [r], [next_s], [done], check_shapes=True)\n",
296 |     "loss.backward()\n",
297 |     "\n",
298 |     "assert len(loss.size()) == 0, \"you must return scalar loss - mean over batch\"\n",
299 |     "assert np.any(next(network.parameters()).grad.detach().numpy() !=\n",
300 |     "              0), \"loss must be differentiable w.r.t. network weights\""
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "markdown",
305 |    "metadata": {},
306 |    "source": [
307 |     "### Experience replay buffer\n",
308 |     "\n",
309 |     "![img](https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/exp_replay.png)"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": null,
315 |    "metadata": {},
316 |    "outputs": [],
317 |    "source": [
318 |     "class ReplayBuffer(object):\n",
319 |     "    def __init__(self, capacity):\n",
320 |     "        self.buffer = deque(maxlen=capacity)\n",
321 |     "    \n",
322 |     "    def push(self, state, action, reward, next_state, done):\n",
323 |     "        state      = np.expand_dims(state, 0)\n",
324 |     "        next_state = np.expand_dims(next_state, 0)\n",
325 |     "            \n",
326 |     "        self.buffer.append((state, action, reward, next_state, done))\n",
327 |     "    \n",
328 |     "    def sample(self, batch_size):\n",
329 |     "        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))\n",
330 |     "        return np.concatenate(state), action, reward, np.concatenate(next_state), done\n",
331 |     "    \n",
332 |     "    def __len__(self):\n",
333 |     "        return len(self.buffer)"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "markdown",
338 |    "metadata": {},
339 |    "source": [
340 |     "---"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": null,
346 |    "metadata": {},
347 |    "outputs": [],
348 |    "source": [
349 |     "batch_size = 32\n",
350 |     "\n",
351 |     "def generate_session(t_max=1000, epsilon=0, train=False):\n",
352 |     "    \"\"\"play env with approximate q-learning agent and train it at the same time\"\"\"\n",
353 |     "    total_reward = 0\n",
354 |     "    s = env.reset()\n",
355 |     "\n",
356 |     "    for t in range(t_max):\n",
357 |     "        a = get_action(s, epsilon=epsilon if train else -1)\n",
358 |     "        next_s, r, done, _ = env.step(a)\n",
359 |     "\n",
360 |     "        if train:\n",
361 |     "            opt.zero_grad()\n",
362 |     "            replay_buffer.push(s, a, r, next_s, done)\n",
363 |     "            if len(replay_buffer) > batch_size:\n",
364 |     "                s_, a_, r_, next_s_, done_ = replay_buffer.sample(batch_size)\n",
365 |     "                compute_td_loss(s_, a_, r_, next_s_, done_).backward()\n",
366 |     "\n",
367 |     "                opt.step()\n",
368 |     "\n",
369 |     "        total_reward += r\n",
370 |     "        s = next_s\n",
371 |     "        if done:\n",
372 |     "            break\n",
373 |     "\n",
374 |     "    return total_reward"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "metadata": {},
381 |    "outputs": [],
382 |    "source": [
383 |     "replay_buffer = ReplayBuffer(1000)\n",
384 |     "opt = torch.optim.Adam(network.parameters(), lr=1e-4)\n",
385 |     "epsilon = 0.5"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": null,
391 |    "metadata": {},
392 |    "outputs": [],
393 |    "source": [
394 |     "valid_mean_rewards = []\n",
395 |     "for i in range(100):\n",
396 |     "    session_rewards_train = [\n",
397 |     "        generate_session(epsilon=epsilon, train=True) \n",
398 |     "        for _ in range(100)\n",
399 |     "    ]\n",
400 |     "    session_rewards_valid = [\n",
401 |     "        generate_session(epsilon=epsilon, train=False) \n",
402 |     "        for _ in range(100)\n",
403 |     "    ]\n",
404 |     "    print(\n",
405 |     "        \"epoch #{:02d}\\tmean reward (train) = {:.3f}\\tepsilon = {:.3f}\\tmean reward (valid) = {:.3f}\".format(\n",
406 |     "        i, np.mean(session_rewards_train), epsilon, np.mean(session_rewards_valid))\n",
407 |     "    )\n",
408 |     "\n",
409 |     "    epsilon *= 0.95 # 0.99\n",
410 |     "    assert epsilon >= 1e-4, \"Make sure epsilon is always nonzero during training\"\n",
411 |     "\n",
412 |     "    valid_mean_rewards.append(np.mean(session_rewards_valid))\n",
413 |     "    if len(valid_mean_rewards) > 5 and np.mean(valid_mean_rewards[-5:]) > 300:\n",
414 |     "        print(\"You Win!\")\n",
415 |     "        break"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "metadata": {},
421 |    "source": [
422 |     "### How to interpret results\n",
423 |     "\n",
424 |     "\n",
425 |     "Welcome to the f.. world of deep f...n reinforcement learning. Don't expect agent's reward to smoothly go up. Hope for it to go increase eventually. If it deems you worthy.\n",
426 |     "\n",
427 |     "Seriously though,\n",
428 |     "* __mean reward__ is the average reward per game. For a correct implementation it may stay low for some 10 epochs, then start growing while oscilating insanely and converges by ~50-100 steps depending on the network architecture. \n",
429 |     "* If it never reaches target score by the end of for loop, try increasing the number of hidden neurons or look at the epsilon.\n",
430 |     "* __epsilon__ - agent's willingness to explore. If you see that agent's already at < 0.01 epsilon before it's is at least 200, just reset it back to 0.1 - 0.5."
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "markdown",
435 |    "metadata": {},
436 |    "source": [
437 |     "---"
438 |    ]
439 |   },
440 |   {
441 |    "cell_type": "code",
442 |    "execution_count": null,
443 |    "metadata": {},
444 |    "outputs": [],
445 |    "source": [
446 |     "# record sessions\n",
447 |     "import gym.wrappers\n",
448 |     "env = gym.wrappers.Monitor(\n",
449 |     "    gym.make(\"CartPole-v0\"),\n",
450 |     "    directory=\"videos_dqn\", \n",
451 |     "    force=True)\n",
452 |     "sessions = [generate_session(epsilon=0, train=False) for _ in range(100)]\n",
453 |     "env.close()"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "code",
458 |    "execution_count": null,
459 |    "metadata": {},
460 |    "outputs": [],
461 |    "source": [
462 |     "# show video\n",
463 |     "from IPython.display import HTML\n",
464 |     "import os\n",
465 |     "\n",
466 |     "video_names = list(\n",
467 |     "    filter(lambda s: s.endswith(\".mp4\"), os.listdir(\"./videos_dqn/\")))\n",
468 |     "\n",
469 |     "HTML(\"\"\"\n",
470 |     "<video width=\"640\" height=\"480\" controls>\n",
471 |     "  <source src=\"{}\" type=\"video/mp4\">\n",
472 |     "</video>\n",
473 |     "\"\"\".format(\"./videos/\"+video_names[-1]))  # this may or may not be _last_ video. Try other indices"
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "code",
478 |    "execution_count": null,
479 |    "metadata": {},
480 |    "outputs": [],
481 |    "source": []
482 |   }
483 |  ],
484 |  "metadata": {
485 |   "kernelspec": {
486 |    "display_name": "Python 3",
487 |    "language": "python",
488 |    "name": "python3"
489 |   },
490 |   "language_info": {
491 |    "codemirror_mode": {
492 |     "name": "ipython",
493 |     "version": 3
494 |    },
495 |    "file_extension": ".py",
496 |    "mimetype": "text/x-python",
497 |    "name": "python",
498 |    "nbconvert_exporter": "python",
499 |    "pygments_lexer": "ipython3",
500 |    "version": "3.7.3"
501 |   },
502 |   "pycharm": {
503 |    "stem_cell": {
504 |     "cell_type": "raw",
505 |     "source": [],
506 |     "metadata": {
507 |      "collapsed": false
508 |     }
509 |    }
510 |   }
511 |  },
512 |  "nbformat": 4,
513 |  "nbformat_minor": 2
514 | }
515 | 


--------------------------------------------------------------------------------
/2020/code/RecSimDemo.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import numpy as np\n",
 10 |     "import pandas as pd\n",
 11 |     "import matplotlib.pyplot as plt\n",
 12 |     "import seaborn as sns\n",
 13 |     "\n",
 14 |     "# plt.rcParams['axes.grid'] = True\n",
 15 |     "\n",
 16 |     "%matplotlib inline"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "from gym import spaces\n",
 26 |     "from recsim import document, user\n",
 27 |     "from recsim.choice_model import AbstractChoiceModel\n",
 28 |     "from recsim.simulator import recsim_gym, environment"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "from recsim_exp import GaussNoise, WolpertingerRecommender"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "# RecSim environment\n",
 45 |     "\n",
 46 |     "In this tutorial we will break a RecSim environment down into its basic components. \n",
 47 |     "![Detailed view of RecSim](https://github.com/google-research/recsim/blob/master/recsim/colab/figures/simulator.png?raw=true)\n",
 48 |     "\n",
 49 |     "The green and blue blocks in the above diagram constitute the classes that need to be implemented within a RecSim environment. The goal of this tutorial is to explain the purpose of these blocks and how they come together in a simulation.  In the process, we will go over an example end-to-end implementation.\n",
 50 |     "\n"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "# Overview\n",
 58 |     "\n",
 59 |     "A single step of a RecSim simulation can be summarized roughly as follows:\n",
 60 |     "\n",
 61 |     "\n",
 62 |     "1.   the document database provides a corpus of *D* documents to the recommender. This could be a different set at each step (e.g., sampled, or produced by some \"candidate generation\" process), or fixed throughout the simulation. Each document is represented by a list of features. In a fully observable situation, the recommender observes all features of each document that impact the user's state and choice of document (and other aspects of the user's response), but this need not be the case in general. (In fact, most interesting scenarios involve latent features.)\n",
 63 |     "2.   The recommender observes the *D* documents (and their features) together with the user's response to the last recommendation. It then makes a selection (possibly ordered) of *k* documents and presents them to the user. The ordering may or may not impact the user choice or user state, depending on our simulation goals.\n",
 64 |     "3.   The user examines the list of documents and makes a choice of one document. Note that not consuming any of the documents is also a valid choice. This leads to a transition in the user's state. Finally the user emits an observation, which the recommender observes at the next iteration. The observation generally includes (noisy) information about the user's reaction to the content and potentially clues about the user's latent state. Typically, the user's state is not fully revealed. \n",
 65 |     "\n",
 66 |     "If we examine at the diagram above carefully, we notice that the flow of information along arcs is acyclic---a RecSim environment is a dynamic Bayesian network (DBN), where the various boxes represent conditional probability distributions. We will now define a simple simulation problem and implement it. "
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "SEED = 42\n",
 76 |     "DOC_NUM = 10\n",
 77 |     "P_EXIT_ACCEPTED = 0.1\n",
 78 |     "P_EXIT_NOT_ACCEPTED = 0.2\n",
 79 |     "\n",
 80 |     "# let's define a matrix W for simulation of users' respose\n",
 81 |     "# (based on the section 7.3 of the paper https://arxiv.org/pdf/1512.07679.pdf)\n",
 82 |     "# W_ij defines the probability that a user will accept recommendation j\n",
 83 |     "# given that he is consuming item i at the moment\n",
 84 |     "\n",
 85 |     "np.random.seed(SEED)\n",
 86 |     "W = (np.ones((DOC_NUM, DOC_NUM)) - np.eye(DOC_NUM)) * \\\n",
 87 |     "     np.random.uniform(0.0, P_EXIT_NOT_ACCEPTED, (DOC_NUM, DOC_NUM)) + \\\n",
 88 |     "     np.diag(np.random.uniform(1.0 - P_EXIT_ACCEPTED, 1.0, DOC_NUM))\n",
 89 |     "W = W[:, np.random.permutation(DOC_NUM)]"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "### Document"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "class Document(document.AbstractDocument):\n",
106 |     "\n",
107 |     "    def __init__(self, doc_id):\n",
108 |     "        super(Document, self).__init__(doc_id)\n",
109 |     "\n",
110 |     "    def create_observation(self):\n",
111 |     "        return (self._doc_id,)\n",
112 |     "\n",
113 |     "    @staticmethod\n",
114 |     "    def observation_space():\n",
115 |     "        return spaces.Discrete(DOC_NUM)\n",
116 |     "\n",
117 |     "    def __str__(self):\n",
118 |     "        return \"Document #{}\".format(self._doc_id)\n",
119 |     "\n",
120 |     "\n",
121 |     "class DocumentSampler(document.AbstractDocumentSampler):\n",
122 |     "\n",
123 |     "    def __init__(self, doc_ctor=Document):\n",
124 |     "        super(DocumentSampler, self).__init__(doc_ctor)\n",
125 |     "        self._doc_count = 0\n",
126 |     "\n",
127 |     "    def sample_document(self):\n",
128 |     "        doc = self._doc_ctor(self._doc_count % DOC_NUM)\n",
129 |     "        self._doc_count += 1\n",
130 |     "        return doc"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "### User"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "class UserState(user.AbstractUserState):\n",
147 |     "\n",
148 |     "    def __init__(self, user_id, current, active_session=True):\n",
149 |     "        self.user_id = user_id\n",
150 |     "        self.current = current\n",
151 |     "        self.active_session = active_session\n",
152 |     "\n",
153 |     "    def create_observation(self):\n",
154 |     "        return (self.current,)\n",
155 |     "\n",
156 |     "    def __str__(self):\n",
157 |     "        return \"User #{}\".format(self.user_id)\n",
158 |     "\n",
159 |     "    @staticmethod\n",
160 |     "    def observation_space():\n",
161 |     "        return spaces.Discrete(DOC_NUM)\n",
162 |     "\n",
163 |     "    def score_document(self, doc_obs):\n",
164 |     "        return W[self.current, doc_obs[0]]\n",
165 |     "\n",
166 |     "\n",
167 |     "class StaticUserSampler(user.AbstractUserSampler):\n",
168 |     "\n",
169 |     "    def __init__(self, user_ctor=UserState):\n",
170 |     "        super(StaticUserSampler, self).__init__(user_ctor)\n",
171 |     "        self.user_count = 0\n",
172 |     "\n",
173 |     "    def sample_user(self):\n",
174 |     "        self.user_count += 1\n",
175 |     "        sampled_user = self._user_ctor(\n",
176 |     "            self.user_count, np.random.randint(DOC_NUM))\n",
177 |     "        return sampled_user\n",
178 |     "\n",
179 |     "\n",
180 |     "class Response(user.AbstractResponse):\n",
181 |     "\n",
182 |     "    def __init__(self, accept=False):\n",
183 |     "        self.accept = accept\n",
184 |     "\n",
185 |     "    def create_observation(self):\n",
186 |     "        return (int(self.accept),)\n",
187 |     "\n",
188 |     "    @classmethod\n",
189 |     "    def response_space(cls):\n",
190 |     "        return spaces.Discrete(2)\n",
191 |     "\n",
192 |     "\n",
193 |     "class UserChoiceModel(AbstractChoiceModel):\n",
194 |     "    def __init__(self):\n",
195 |     "        super(UserChoiceModel, self).__init__()\n",
196 |     "        self._score_no_click = P_EXIT_ACCEPTED\n",
197 |     "\n",
198 |     "    def score_documents(self, user_state, doc_obs):\n",
199 |     "        if len(doc_obs) != 1:\n",
200 |     "            raise ValueError(\n",
201 |     "                \"Expecting single document, but got: {}\".format(doc_obs))\n",
202 |     "        self._scores = np.array(\n",
203 |     "            [user_state.score_document(doc) for doc in doc_obs])\n",
204 |     "\n",
205 |     "    def choose_item(self):\n",
206 |     "        if np.random.random() < self.scores[0]:\n",
207 |     "            return 0\n",
208 |     "\n",
209 |     "\n",
210 |     "class UserModel(user.AbstractUserModel):\n",
211 |     "    def __init__(self):\n",
212 |     "        super(UserModel, self).__init__(Response, StaticUserSampler(), 1)\n",
213 |     "        self.choice_model = UserChoiceModel()\n",
214 |     "\n",
215 |     "    def simulate_response(self, slate_documents):\n",
216 |     "        if len(slate_documents) != 1:\n",
217 |     "            raise ValueError(\"Expecting single document, but got: {}\".format(\n",
218 |     "                slate_documents))\n",
219 |     "\n",
220 |     "        responses = [self._response_model_ctor() for _ in slate_documents]\n",
221 |     "\n",
222 |     "        self.choice_model.score_documents(\n",
223 |     "            self._user_state,\n",
224 |     "            [doc.create_observation() for doc in slate_documents]\n",
225 |     "        )\n",
226 |     "        selected_index = self.choice_model.choose_item()\n",
227 |     "\n",
228 |     "        if selected_index is not None:\n",
229 |     "            responses[selected_index].accept = True\n",
230 |     "\n",
231 |     "        return responses\n",
232 |     "\n",
233 |     "    def update_state(self, slate_documents, responses):\n",
234 |     "        if len(slate_documents) != 1:\n",
235 |     "            raise ValueError(\n",
236 |     "                f\"Expecting single document, but got: {slate_documents}\"\n",
237 |     "            )\n",
238 |     "\n",
239 |     "        response = responses[0]\n",
240 |     "        doc = slate_documents[0]\n",
241 |     "        if response.accept:\n",
242 |     "            self._user_state.current = doc.doc_id()\n",
243 |     "            self._user_state.active_session = bool(\n",
244 |     "                np.random.binomial(1, 1 - P_EXIT_ACCEPTED))\n",
245 |     "        else:\n",
246 |     "            self._user_state.current = np.random.choice(DOC_NUM)\n",
247 |     "            self._user_state.active_session = bool(\n",
248 |     "                np.random.binomial(1, 1 - P_EXIT_NOT_ACCEPTED))\n",
249 |     "\n",
250 |     "    def is_terminal(self):\n",
251 |     "        \"\"\"Returns a boolean indicating if the session is over.\"\"\"\n",
252 |     "        return not self._user_state.active_session\n",
253 |     "\n",
254 |     "\n",
255 |     "def clicked_reward(responses):\n",
256 |     "    reward = 0.0\n",
257 |     "    for response in responses:\n",
258 |     "        if response.accept:\n",
259 |     "            reward += 1\n",
260 |     "    return reward"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "### Environment"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "def make_env():\n",
277 |     "    env = recsim_gym.RecSimGymEnv(\n",
278 |     "        environment.Environment(\n",
279 |     "            UserModel(), \n",
280 |     "            DocumentSampler(), \n",
281 |     "            DOC_NUM, \n",
282 |     "            1, \n",
283 |     "            resample_documents=False\n",
284 |     "        ),\n",
285 |     "        clicked_reward\n",
286 |     "    )\n",
287 |     "    return env"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "# RecSim Agent"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "markdown",
299 |    "metadata": {},
300 |    "source": [
301 |     "For solving of this toy environment we'll try using a variant of DDPG algorithm for discrete actions.\n",
302 |     "We need to embed our discrete actions into continuous space to use DDPG (it outputs \"proto action\").\n",
303 |     "Then we choose k nearest embedded actions and take the action with maximum Q value.\n",
304 |     "Thus, we can avoid taking maximum over all the action space as in DQN, which can be too large in case of RecSys.\n",
305 |     "In our example embeddings are just one hot vectors. Therefore the nearest neighbour is argmax of proto action.\n",
306 |     "\n",
307 |     "<img src=\"../presets/wolpertinger_scheme.png\" width=400 height=800>"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "code",
312 |    "execution_count": null,
313 |    "metadata": {},
314 |    "outputs": [],
315 |    "source": [
316 |     "def run_agent(\n",
317 |     "    env, \n",
318 |     "    agent, \n",
319 |     "    num_steps: int = int(3e3), \n",
320 |     "    log_every: int = int(1e3)\n",
321 |     "):\n",
322 |     "    reward_history = []\n",
323 |     "    step, episode = 1, 1\n",
324 |     "\n",
325 |     "    observation = env.reset()\n",
326 |     "    while step < num_steps:\n",
327 |     "        action = agent.begin_episode(observation)\n",
328 |     "        episode_reward = 0\n",
329 |     "        while True:\n",
330 |     "            observation, reward, done, info = env.step(action)\n",
331 |     "            episode_reward += reward\n",
332 |     "\n",
333 |     "            if step % log_every == 0:\n",
334 |     "                print(step, np.mean(reward_history[-50:]))\n",
335 |     "            step += 1\n",
336 |     "            if done:\n",
337 |     "                break\n",
338 |     "            else:\n",
339 |     "                action = agent.step(reward, observation)\n",
340 |     "\n",
341 |     "        agent.end_episode(reward, observation)\n",
342 |     "        reward_history.append(episode_reward)\n",
343 |     "\n",
344 |     "    return reward_history"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "code",
349 |    "execution_count": null,
350 |    "metadata": {},
351 |    "outputs": [],
352 |    "source": [
353 |     "parameters = {\n",
354 |     "    \"action_dim\": DOC_NUM,\n",
355 |     "    \"state_dim\": DOC_NUM,\n",
356 |     "    \"noise\": GaussNoise(sigma=0.05),\n",
357 |     "    \"critic_lr\": 1e-3,\n",
358 |     "    \"actor_lr\": 1e-3,\n",
359 |     "    \"tau\": 1e-3,\n",
360 |     "    \"hidden_dim\": 256,\n",
361 |     "    \"batch_size\": 128,\n",
362 |     "    \"buffer_size\": int(1e4),\n",
363 |     "    \"gamma\": 0.8,\n",
364 |     "    \"actor_weight_decay\": 0.0001,\n",
365 |     "    \"critic_weight_decay\": 0.001,\n",
366 |     "    \"eps\": 1e-2\n",
367 |     "}"
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "code",
372 |    "execution_count": null,
373 |    "metadata": {},
374 |    "outputs": [],
375 |    "source": [
376 |     "env = make_env()\n",
377 |     "agent = WolpertingerRecommender(\n",
378 |     "    env=env, \n",
379 |     "    k_ratio=0.33, \n",
380 |     "    **parameters\n",
381 |     ")\n",
382 |     "reward_history = run_agent(env, agent)\n",
383 |     "plt.plot(pd.Series(reward_history).rolling(50).mean())"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "markdown",
388 |    "metadata": {},
389 |    "source": [
390 |     "---"
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "markdown",
395 |    "metadata": {},
396 |    "source": [
397 |     "### Extra - 1"
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "code",
402 |    "execution_count": null,
403 |    "metadata": {},
404 |    "outputs": [],
405 |    "source": [
406 |     "predicted_qvalues = np.hstack([\n",
407 |     "    agent.agent.predict_qvalues(i) for i in range(DOC_NUM)\n",
408 |     "]).T\n",
409 |     "predicted_actions = np.vstack([\n",
410 |     "    agent.agent.predict_action(np.eye(DOC_NUM)[i], with_noise=False)\n",
411 |     "    for i in range(DOC_NUM)\n",
412 |     "])"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": null,
418 |    "metadata": {},
419 |    "outputs": [],
420 |    "source": [
421 |     "# learned Qvalues \n",
422 |     "plt.subplots(figsize=predicted_qvalues.shape)\n",
423 |     "sns.heatmap(predicted_qvalues.round(3), annot=True);"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": null,
429 |    "metadata": {},
430 |    "outputs": [],
431 |    "source": [
432 |     "# learned actions (aka policy)\n",
433 |     "plt.subplots(figsize=predicted_qvalues.shape)\n",
434 |     "sns.heatmap(predicted_actions.round(3), annot=True);"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": null,
440 |    "metadata": {},
441 |    "outputs": [],
442 |    "source": [
443 |     "# true actions (aka policy)\n",
444 |     "plt.subplots(figsize=predicted_qvalues.shape)\n",
445 |     "sns.heatmap(W, annot=True);"
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "markdown",
450 |    "metadata": {},
451 |    "source": [
452 |     "### Extra - 2"
453 |    ]
454 |   },
455 |   {
456 |    "cell_type": "code",
457 |    "execution_count": null,
458 |    "metadata": {},
459 |    "outputs": [],
460 |    "source": [
461 |     "from recsim.agent import AbstractEpisodicRecommenderAgent\n",
462 |     "\n",
463 |     "class OptimalRecommender(AbstractEpisodicRecommenderAgent):\n",
464 |     "\n",
465 |     "    def __init__(self, environment, W):\n",
466 |     "        super().__init__(environment.action_space)\n",
467 |     "        self._observation_space = environment.observation_space\n",
468 |     "        self._W = W\n",
469 |     "\n",
470 |     "    def _extract_state(self, observation):\n",
471 |     "        user_space = self._observation_space.spaces[\"user\"]\n",
472 |     "        return spaces.flatten(user_space, observation[\"user\"])\n",
473 |     "\n",
474 |     "    def step(self, reward, observation):\n",
475 |     "        state = self._extract_state(observation)\n",
476 |     "        return [self._W[state.argmax(), :].argmax()]"
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "code",
481 |    "execution_count": null,
482 |    "metadata": {},
483 |    "outputs": [],
484 |    "source": [
485 |     "env = make_env()\n",
486 |     "agent = OptimalRecommender(env, W)\n",
487 |     "\n",
488 |     "reward_history = run_agent(env, agent)\n",
489 |     "plt.plot(pd.Series(reward_history).rolling(50).mean())"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "markdown",
494 |    "metadata": {},
495 |    "source": [
496 |     "---"
497 |    ]
498 |   },
499 |   {
500 |    "cell_type": "code",
501 |    "execution_count": null,
502 |    "metadata": {},
503 |    "outputs": [],
504 |    "source": [
505 |     "# from recsim.agents.tabular_q_agent import TabularQAgent\n",
506 |     "\n",
507 |     "# env = make_env()\n",
508 |     "# q_agent = TabularQAgent(env.observation_space, env.action_space)\n",
509 |     "\n",
510 |     "# reward_history = run_agent(env, agent)\n",
511 |     "# plt.plot(pd.Series(reward_history).rolling(50).mean())"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "code",
516 |    "execution_count": null,
517 |    "metadata": {},
518 |    "outputs": [],
519 |    "source": []
520 |   }
521 |  ],
522 |  "metadata": {
523 |   "kernelspec": {
524 |    "display_name": "Python [conda env:py37] *",
525 |    "language": "python",
526 |    "name": "conda-env-py37-py"
527 |   },
528 |   "language_info": {
529 |    "codemirror_mode": {
530 |     "name": "ipython",
531 |     "version": 3
532 |    },
533 |    "file_extension": ".py",
534 |    "mimetype": "text/x-python",
535 |    "name": "python",
536 |    "nbconvert_exporter": "python",
537 |    "pygments_lexer": "ipython3",
538 |    "version": "3.7.6"
539 |   }
540 |  },
541 |  "nbformat": 4,
542 |  "nbformat_minor": 2
543 | }
544 | 


--------------------------------------------------------------------------------
/2020/code/recsim_exp/__init__.py:
--------------------------------------------------------------------------------
1 | from .ddpg import *
2 | from .wolpertinger import *
3 | 


--------------------------------------------------------------------------------
/2020/code/recsim_exp/ddpg.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | import numpy as np
  4 | 
  5 | import torch
  6 | import torch.nn as nn
  7 | import torch.optim as optim
  8 | import torch.nn.functional as F
  9 | import copy
 10 | 
 11 | 
 12 | DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 13 | 
 14 | 
 15 | def soft_update(target, source, tau):
 16 |     for target_param, param in zip(target.parameters(), source.parameters()):
 17 |         target_param.data.copy_(
 18 |             target_param.data * (1.0 - tau) + param.data * tau)
 19 | 
 20 | 
 21 | class GaussNoise:
 22 |     def __init__(self, sigma):
 23 |         super().__init__()
 24 | 
 25 |         self.sigma = sigma
 26 | 
 27 |     def get_action(self, action):
 28 |         noisy_action = np.random.normal(action, self.sigma)
 29 |         return noisy_action
 30 | 
 31 | 
 32 | class ReplayBuffer:
 33 |     def __init__(self, capacity):
 34 |         self.capacity = capacity
 35 |         self.buffer = []
 36 |         self.position = 0
 37 | 
 38 |     def push(self, state, action, reward, next_state, done):
 39 |         if len(self.buffer) < self.capacity:
 40 |             self.buffer.append(None)
 41 |         self.buffer[self.position] = (state, action, reward, next_state, done)
 42 |         self.position = (self.position + 1) % self.capacity
 43 | 
 44 |     def sample(self, batch_size):
 45 |         batch = random.sample(self.buffer, batch_size)
 46 |         state, action, reward, next_state, done = map(np.stack, zip(*batch))
 47 |         return state, action, reward, next_state, done
 48 | 
 49 |     def __len__(self):
 50 |         return len(self.buffer)
 51 | 
 52 | 
 53 | class Actor(nn.Module):
 54 |     def __init__(
 55 |         self,
 56 |         num_inputs,
 57 |         num_actions,
 58 |         hidden_size,
 59 |         init_w=3e-3,
 60 |     ):
 61 |         super().__init__()
 62 |         self.net = nn.Sequential(
 63 |             nn.Linear(num_inputs, hidden_size),
 64 |             nn.ReLU(),
 65 |             nn.Linear(hidden_size, hidden_size),
 66 |             nn.ReLU(),
 67 |         )
 68 |         self.head = nn.Linear(hidden_size, num_actions)
 69 |         nn.init.uniform_(self.head.weight, -init_w, init_w)
 70 |         nn.init.zeros_(self.head.bias)
 71 | 
 72 |     def forward(self, state):
 73 |         x = self.net(state)
 74 |         x = self.head(x)
 75 |         x = torch.sigmoid(x)
 76 |         return x
 77 | 
 78 |     def get_action(self, state):
 79 |         state = torch.tensor(
 80 |             state, dtype=torch.float32
 81 |         ).unsqueeze(0).to(DEVICE)
 82 |         action = self.forward(state)
 83 |         action = action.detach().cpu().numpy()[0]
 84 |         return action
 85 | 
 86 | 
 87 | class Critic(nn.Module):
 88 |     def __init__(
 89 |         self,
 90 |         num_inputs,
 91 |         num_actions,
 92 |         hidden_size,
 93 |         init_w=3e-3,
 94 |     ):
 95 |         super().__init__()
 96 |         self.net = nn.Sequential(
 97 |             nn.Linear(num_inputs + num_actions, hidden_size),
 98 |             nn.ReLU(),
 99 |             nn.Linear(hidden_size, hidden_size),
100 |             nn.ReLU(),
101 |         )
102 |         self.head = nn.Linear(hidden_size, 1)
103 |         nn.init.uniform_(self.head.weight, -init_w, init_w)
104 |         nn.init.zeros_(self.head.bias)
105 | 
106 |     def forward(self, state, action):
107 |         x = torch.cat([state, action], 1)
108 |         x = self.net(x)
109 |         x = self.head(x)
110 |         return x
111 | 
112 |     def get_qvalue(self, state, action):
113 |         state = torch.tensor(state, dtype=torch.float32).to(DEVICE)
114 |         action = torch.tensor(action, dtype=torch.float32).to(DEVICE)
115 |         q_value = self.forward(state, action)
116 |         q_value = q_value.detach().cpu().numpy()
117 |         return q_value
118 | 
119 | 
120 | class DDPG:
121 |     def __init__(
122 |         self,
123 |         state_dim,
124 |         action_dim,
125 |         noise=None,
126 |         hidden_dim=256,
127 |         tau=1e-3,
128 |         gamma=0.99,
129 |         init_w_actor=3e-3,
130 |         init_w_critic=3e-3,
131 |         critic_lr=1e-3,
132 |         actor_lr=1e-4,
133 |         actor_weight_decay=0.,
134 |         critic_weight_decay=0.,
135 |     ):
136 |         self.actor = Actor(
137 |             state_dim,
138 |             action_dim,
139 |             hidden_dim,
140 |             init_w=init_w_actor
141 |         ).to(DEVICE)
142 |         self.target_actor = copy.deepcopy(self.actor)
143 |         self.actor_optimizer = optim.Adam(
144 |             self.actor.parameters(),
145 |             lr=actor_lr,
146 |             weight_decay=actor_weight_decay
147 |         )
148 | 
149 |         self.critic = Critic(
150 |             state_dim,
151 |             action_dim,
152 |             hidden_dim,
153 |             init_w=init_w_critic
154 |         ).to(DEVICE)
155 |         self.target_critic = copy.deepcopy(self.critic)
156 |         self.critic_optimizer = optim.Adam(
157 |             self.critic.parameters(),
158 |             lr=critic_lr,
159 |             weight_decay=critic_weight_decay
160 |         )
161 | 
162 |         self.state_dim = state_dim
163 |         self.action_dim = action_dim
164 |         self.noise = noise
165 | 
166 |         self.tau = tau
167 |         self.gamma = gamma
168 | 
169 |     def predict_action(self, state, with_noise=False):
170 |         self.actor.eval()
171 |         action = self.actor.get_action(state)
172 |         if self.noise and with_noise:
173 |             action = self.noise.get_action(action)
174 |         self.actor.train()
175 |         return action
176 | 
177 |     def update(self, state, action, reward, next_state, done):
178 |         state = torch.tensor(state, dtype=torch.float32).to(DEVICE)
179 |         next_state = torch.tensor(next_state, dtype=torch.float32).to(DEVICE)
180 |         action = torch.tensor(action, dtype=torch.float32).to(DEVICE)
181 |         reward = torch.tensor(
182 |             reward, dtype=torch.float32
183 |         ).unsqueeze(1).to(DEVICE)
184 |         done = torch.tensor(np.float32(done)).unsqueeze(1).to(DEVICE)
185 | 
186 |         # actor loss
187 |         actor_loss = -self.critic(state, self.actor(state)).mean()
188 | 
189 |         #  critic loss
190 |         predicted_value = self.critic(state, action)
191 |         next_action = self.target_actor(next_state)
192 |         target_value = self.target_critic(next_state, next_action.detach())
193 |         expected_value = reward + (1.0 - done) * self.gamma * target_value
194 |         critic_loss = F.mse_loss(predicted_value, expected_value.detach())
195 | 
196 |         # actor update
197 |         self.actor_optimizer.zero_grad()
198 |         actor_loss.backward()
199 |         self.actor_optimizer.step()
200 | 
201 |         # critic update
202 |         self.critic_optimizer.zero_grad()
203 |         critic_loss.backward()
204 |         self.critic_optimizer.step()
205 | 
206 |         soft_update(self.target_critic, self.critic, self.tau)
207 |         soft_update(self.target_actor, self.actor, self.tau)
208 | 


--------------------------------------------------------------------------------
/2020/code/recsim_exp/wolpertinger.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from gym import spaces
  3 | from recsim.agent import AbstractEpisodicRecommenderAgent
  4 | 
  5 | from .ddpg import DDPG, ReplayBuffer
  6 | 
  7 | 
  8 | class Wolpertinger(DDPG):
  9 |     def __init__(self, *, action_dim, k_ratio=0.1, **kwargs):
 10 |         super().__init__(action_dim=action_dim, **kwargs)
 11 |         self.k = max(1, int(action_dim * k_ratio))
 12 | 
 13 |     def predict_action(self, state, with_noise=False):
 14 | 
 15 |         proto_action = super().predict_action(state, with_noise=with_noise)
 16 |         proto_action = proto_action.clip(0, 1)
 17 | 
 18 |         actions = np.eye(self.action_dim)
 19 |         # first sorting by action probability by `proto_action`
 20 |         # second by random :)
 21 |         actions_sorting = np.lexsort(
 22 |             (np.random.random(self.action_dim), proto_action)
 23 |         )
 24 |         # take topK proposed actions
 25 |         actions = actions[actions_sorting[-self.k:]]
 26 |         # make all the state-action pairs for the critic
 27 |         states = np.tile(state, [len(actions), 1])
 28 |         qvalues = self.critic.get_qvalue(states, actions)
 29 |         # find the index of the pair with the maximum value
 30 |         max_index = np.argmax(qvalues)
 31 |         action, qvalue = actions[max_index], qvalues[max_index]
 32 |         return action
 33 | 
 34 |     def predict_qvalues(self, state_num=0, action=None, dim=None):
 35 |         if dim is None:
 36 |             dim = self.action_dim
 37 |         if action is None:
 38 |             action = np.eye(dim, self.action_dim)
 39 |         s = np.zeros((dim, self.action_dim))
 40 |         s[:, state_num] = 1
 41 |         qvalues = self.critic.get_qvalue(s, action)
 42 |         return qvalues
 43 | 
 44 | 
 45 | class WolpertingerRecommender(AbstractEpisodicRecommenderAgent):
 46 | 
 47 |     def __init__(
 48 |         self,
 49 |         env,
 50 |         state_dim,
 51 |         action_dim,
 52 |         k_ratio=0.1,
 53 |         eps=1e-2,
 54 |         train: bool = True,
 55 |         batch_size: int = 256,
 56 |         buffer_size: int = 10000,
 57 |         training_starts: int = None,
 58 |         **kwargs,
 59 |     ):
 60 |         AbstractEpisodicRecommenderAgent.__init__(self, env.action_space)
 61 | 
 62 |         self._observation_space = env.observation_space
 63 |         self.agent = Wolpertinger(
 64 |             state_dim=state_dim,
 65 |             action_dim=action_dim,
 66 |             k_ratio=k_ratio,
 67 |             **kwargs
 68 |         )
 69 |         self.t = 0
 70 |         self.current_episode = {}
 71 |         self.train = train
 72 |         self.num_actions = env.action_space.nvec[0]
 73 | 
 74 |         self.eps = eps
 75 |         self.batch_size = batch_size
 76 |         self.replay_buffer = ReplayBuffer(buffer_size)
 77 |         self.training_starts = training_starts or batch_size
 78 | 
 79 |     def _extract_state(self, observation):
 80 |         user_space = self._observation_space.spaces["user"]
 81 |         return spaces.flatten(user_space, observation["user"])
 82 | 
 83 |     def _act(self, state):
 84 |         if np.random.rand() < self.eps:
 85 |             action = np.eye(self.num_actions)[np.random.randint(self.num_actions)]
 86 |         else:
 87 |             action = self.agent.predict_action(state)
 88 |         self.current_episode = {
 89 |             "state": state,
 90 |             "action": action,
 91 |         }
 92 |         return np.argmax(action)[np.newaxis]
 93 | 
 94 |     def _observe(self, next_state, reward, done):
 95 |         if not self.current_episode:
 96 |             raise ValueError("Current episode is expected to be non-empty")
 97 | 
 98 |         self.current_episode.update({
 99 |             "next_state": next_state,
100 |             "reward": reward,
101 |             "done": done
102 |         })
103 | 
104 |         self.agent.episode = self.current_episode
105 |         if self.train:
106 |             self.replay_buffer.push(**self.current_episode)
107 |             if self.t >= self.training_starts \
108 |                     and len(self.replay_buffer) >= self.batch_size:
109 |                 state, action, reward, next_state, done = \
110 |                     self.replay_buffer.sample(self.batch_size)
111 |                 self.agent.update(state, action, reward, next_state, done)
112 |         self.current_episode = {}
113 | 
114 |     def begin_episode(self, observation=None):
115 |         state = self._extract_state(observation)
116 |         return self._act(state)
117 | 
118 |     def step(self, reward, observation):
119 |         state = self._extract_state(observation)
120 |         self._observe(state, reward, 0)
121 |         self.t += 1
122 |         return self._act(state)
123 | 
124 |     def end_episode(self, reward, observation=None):
125 |         state = self._extract_state(observation)
126 |         self._observe(state, reward, 1)
127 | 


--------------------------------------------------------------------------------
/2020/presets/wolpertinger_scheme.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Scitator/RL-intro/0d95c6ca924cd7e2a3e87603c233b3dba34eaf83/2020/presets/wolpertinger_scheme.png


--------------------------------------------------------------------------------
/2020/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | pandas
 3 | scipy
 4 | seaborn
 5 | matplotlib
 6 | requests
 7 | tqdm
 8 | gym
 9 | torch
10 | recsim


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Copyright 2020 Sergey Kolesnikov.  All rights reserved.
  2 | 
  3 |                                  Apache License
  4 |                            Version 2.0, January 2004
  5 |                         http://www.apache.org/licenses/
  6 | 
  7 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  8 | 
  9 |    1. Definitions.
 10 | 
 11 |       "License" shall mean the terms and conditions for use, reproduction,
 12 |       and distribution as defined by Sections 1 through 9 of this document.
 13 | 
 14 |       "Licensor" shall mean the copyright owner or entity authorized by
 15 |       the copyright owner that is granting the License.
 16 | 
 17 |       "Legal Entity" shall mean the union of the acting entity and all
 18 |       other entities that control, are controlled by, or are under common
 19 |       control with that entity. For the purposes of this definition,
 20 |       "control" means (i) the power, direct or indirect, to cause the
 21 |       direction or management of such entity, whether by contract or
 22 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 23 |       outstanding shares, or (iii) beneficial ownership of such entity.
 24 | 
 25 |       "You" (or "Your") shall mean an individual or Legal Entity
 26 |       exercising permissions granted by this License.
 27 | 
 28 |       "Source" form shall mean the preferred form for making modifications,
 29 |       including but not limited to software source code, documentation
 30 |       source, and configuration files.
 31 | 
 32 |       "Object" form shall mean any form resulting from mechanical
 33 |       transformation or translation of a Source form, including but
 34 |       not limited to compiled object code, generated documentation,
 35 |       and conversions to other media types.
 36 | 
 37 |       "Work" shall mean the work of authorship, whether in Source or
 38 |       Object form, made available under the License, as indicated by a
 39 |       copyright notice that is included in or attached to the work
 40 |       (an example is provided in the Appendix below).
 41 | 
 42 |       "Derivative Works" shall mean any work, whether in Source or Object
 43 |       form, that is based on (or derived from) the Work and for which the
 44 |       editorial revisions, annotations, elaborations, or other modifications
 45 |       represent, as a whole, an original work of authorship. For the purposes
 46 |       of this License, Derivative Works shall not include works that remain
 47 |       separable from, or merely link (or bind by name) to the interfaces of,
 48 |       the Work and Derivative Works thereof.
 49 | 
 50 |       "Contribution" shall mean any work of authorship, including
 51 |       the original version of the Work and any modifications or additions
 52 |       to that Work or Derivative Works thereof, that is intentionally
 53 |       submitted to Licensor for inclusion in the Work by the copyright owner
 54 |       or by an individual or Legal Entity authorized to submit on behalf of
 55 |       the copyright owner. For the purposes of this definition, "submitted"
 56 |       means any form of electronic, verbal, or written communication sent
 57 |       to the Licensor or its representatives, including but not limited to
 58 |       communication on electronic mailing lists, source code control systems,
 59 |       and issue tracking systems that are managed by, or on behalf of, the
 60 |       Licensor for the purpose of discussing and improving the Work, but
 61 |       excluding communication that is conspicuously marked or otherwise
 62 |       designated in writing by the copyright owner as "Not a Contribution."
 63 | 
 64 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 65 |       on behalf of whom a Contribution has been received by Licensor and
 66 |       subsequently incorporated within the Work.
 67 | 
 68 |    2. Grant of Copyright License. Subject to the terms and conditions of
 69 |       this License, each Contributor hereby grants to You a perpetual,
 70 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 71 |       copyright license to reproduce, prepare Derivative Works of,
 72 |       publicly display, publicly perform, sublicense, and distribute the
 73 |       Work and such Derivative Works in Source or Object form.
 74 | 
 75 |    3. Grant of Patent License. Subject to the terms and conditions of
 76 |       this License, each Contributor hereby grants to You a perpetual,
 77 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 78 |       (except as stated in this section) patent license to make, have made,
 79 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 80 |       where such license applies only to those patent claims licensable
 81 |       by such Contributor that are necessarily infringed by their
 82 |       Contribution(s) alone or by combination of their Contribution(s)
 83 |       with the Work to which such Contribution(s) was submitted. If You
 84 |       institute patent litigation against any entity (including a
 85 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 86 |       or a Contribution incorporated within the Work constitutes direct
 87 |       or contributory patent infringement, then any patent licenses
 88 |       granted to You under this License for that Work shall terminate
 89 |       as of the date such litigation is filed.
 90 | 
 91 |    4. Redistribution. You may reproduce and distribute copies of the
 92 |       Work or Derivative Works thereof in any medium, with or without
 93 |       modifications, and in Source or Object form, provided that You
 94 |       meet the following conditions:
 95 | 
 96 |       (a) You must give any other recipients of the Work or
 97 |           Derivative Works a copy of this License; and
 98 | 
 99 |       (b) You must cause any modified files to carry prominent notices
100 |           stating that You changed the files; and
101 | 
102 |       (c) You must retain, in the Source form of any Derivative Works
103 |           that You distribute, all copyright, patent, trademark, and
104 |           attribution notices from the Source form of the Work,
105 |           excluding those notices that do not pertain to any part of
106 |           the Derivative Works; and
107 | 
108 |       (d) If the Work includes a "NOTICE" text file as part of its
109 |           distribution, then any Derivative Works that You distribute must
110 |           include a readable copy of the attribution notices contained
111 |           within such NOTICE file, excluding those notices that do not
112 |           pertain to any part of the Derivative Works, in at least one
113 |           of the following places: within a NOTICE text file distributed
114 |           as part of the Derivative Works; within the Source form or
115 |           documentation, if provided along with the Derivative Works; or,
116 |           within a display generated by the Derivative Works, if and
117 |           wherever such third-party notices normally appear. The contents
118 |           of the NOTICE file are for informational purposes only and
119 |           do not modify the License. You may add Your own attribution
120 |           notices within Derivative Works that You distribute, alongside
121 |           or as an addendum to the NOTICE text from the Work, provided
122 |           that such additional attribution notices cannot be construed
123 |           as modifying the License.
124 | 
125 |       You may add Your own copyright statement to Your modifications and
126 |       may provide additional or different license terms and conditions
127 |       for use, reproduction, or distribution of Your modifications, or
128 |       for any such Derivative Works as a whole, provided Your use,
129 |       reproduction, and distribution of the Work otherwise complies with
130 |       the conditions stated in this License.
131 | 
132 |    5. Submission of Contributions. Unless You explicitly state otherwise,
133 |       any Contribution intentionally submitted for inclusion in the Work
134 |       by You to the Licensor shall be under the terms and conditions of
135 |       this License, without any additional terms or conditions.
136 |       Notwithstanding the above, nothing herein shall supersede or modify
137 |       the terms of any separate license agreement you may have executed
138 |       with Licensor regarding such Contributions.
139 | 
140 |    6. Trademarks. This License does not grant permission to use the trade
141 |       names, trademarks, service marks, or product names of the Licensor,
142 |       except as required for reasonable and customary use in describing the
143 |       origin of the Work and reproducing the content of the NOTICE file.
144 | 
145 |    7. Disclaimer of Warranty. Unless required by applicable law or
146 |       agreed to in writing, Licensor provides the Work (and each
147 |       Contributor provides its Contributions) on an "AS IS" BASIS,
148 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
149 |       implied, including, without limitation, any warranties or conditions
150 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
151 |       PARTICULAR PURPOSE. You are solely responsible for determining the
152 |       appropriateness of using or redistributing the Work and assume any
153 |       risks associated with Your exercise of permissions under this License.
154 | 
155 |    8. Limitation of Liability. In no event and under no legal theory,
156 |       whether in tort (including negligence), contract, or otherwise,
157 |       unless required by applicable law (such as deliberate and grossly
158 |       negligent acts) or agreed to in writing, shall any Contributor be
159 |       liable to You for damages, including any direct, indirect, special,
160 |       incidental, or consequential damages of any character arising as a
161 |       result of this License or out of the use or inability to use the
162 |       Work (including but not limited to damages for loss of goodwill,
163 |       work stoppage, computer failure or malfunction, or any and all
164 |       other commercial damages or losses), even if such Contributor
165 |       has been advised of the possibility of such damages.
166 | 
167 |    9. Accepting Warranty or Additional Liability. While redistributing
168 |       the Work or Derivative Works thereof, You may choose to offer,
169 |       and charge a fee for, acceptance of support, warranty, indemnity,
170 |       or other liability obligations and/or rights consistent with this
171 |       License. However, in accepting such obligations, You may act only
172 |       on Your own behalf and on Your sole responsibility, not on behalf
173 |       of any other Contributor, and only if You agree to indemnify,
174 |       defend, and hold each Contributor harmless for any liability
175 |       incurred by, or claims asserted against, such Contributor by reason
176 |       of your accepting any such warranty or additional liability.
177 | 
178 |    END OF TERMS AND CONDITIONS
179 | 
180 |    APPENDIX: How to apply the Apache License to your work.
181 | 
182 |       To apply the Apache License to your work, attach the following
183 |       boilerplate notice, with the fields enclosed by brackets "[]"
184 |       replaced with your own identifying information. (Don't include
185 |       the brackets!)  The text should be enclosed in the appropriate
186 |       comment syntax for the file format. We also recommend that a
187 |       file or class name and description of purpose be included on the
188 |       same "printed page" as the copyright notice for easier
189 |       identification within third-party archives.
190 | 
191 |    Copyright [yyyy] [name of copyright owner]
192 | 
193 |    Licensed under the Apache License, Version 2.0 (the "License");
194 |    you may not use this file except in compliance with the License.
195 |    You may obtain a copy of the License at
196 | 
197 |        http://www.apache.org/licenses/LICENSE-2.0
198 | 
199 |    Unless required by applicable law or agreed to in writing, software
200 |    distributed under the License is distributed on an "AS IS" BASIS,
201 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
202 |    See the License for the specific language governing permissions and
203 |    limitations under the License.
204 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## RL Intro
 2 | 
 3 | <details>
 4 | <summary>2019 edition - Gym intro, Genetics, CEM, Tabular DQN</summary>
 5 | <p>
 6 | 
 7 | #### 0. Gym interface
 8 | - `00-gym.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/00-gym.ipynb)
 9 | 
10 | 
11 | #### 1. Genetic algorithm
12 | - [slides](./2019/slides/01-genetics.pdf)
13 | - `01-genetics.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/01-genetics.ipynb)
14 | 
15 | ##### Additional materials
16 | * __[recommended]__ - awesome openai post about evolution strategies - [blog post](https://blog.openai.com/evolution-strategies/), [article](https://arxiv.org/abs/1703.03864)
17 | * Video on genetic algorithms - https://www.youtube.com/watch?v=ejxfTy4lI6I
18 | * Another guide to genetic algorithm - https://www.youtube.com/watch?v=zwYV11a__HQ
19 | * PDF on Differential evolution - http://jvanderw.une.edu.au/DE_1.pdf
20 | * Video on Ant Colony Algorithm - https://www.youtube.com/watch?v=D58nLNLkb0I
21 | * Longer video on Ant Colony Algorithm - https://www.youtube.com/watch?v=xpyKmjJuqhk
22 | 
23 | 
24 | #### 2. Cross Entropy Method
25 | - [slides](./2019/slides/02-cem.pdf)
26 | - `02-cem.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/02-cem.ipynb)
27 | 
28 | ##### Additional materials
29 | * __[main]__ Video-intro by David Silver - https://www.youtube.com/watch?v=2pWv7GOvuf0
30 | * Optional lecture by David Silver - https://www.youtube.com/watch?v=lfHX2hHRMVQ
31 | * __[recommended]__ - formal explanation of crossentropy method in [general](https://people.smp.uq.edu.au/DirkKroese/ps/CEEncycl.pdf) and for [optimization](https://people.smp.uq.edu.au/DirkKroese/ps/CEopt.pdf)
32 | 
33 | 
34 | #### 3. Tabular
35 | - [slides](./2019/slides/03-tabular.pdf)
36 | - `03-tabular.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/03-tabular.ipynb)
37 | 
38 | ##### Additional materials
39 | * __[main]__ lecture by David Silver - [url](https://www.youtube.com/watch?v=Nd1-UUMVfz4)
40 | * Alternative lecture by Pieter Abbeel: [part 1](https://www.youtube.com/watch?v=i0o-ui1N35U), [part 2](https://www.youtube.com/watch?v=Csiiv6WGzKM)
41 | * Alternative lecture by John Schulmann: https://www.youtube.com/watch?v=IL3gVyJMmhg
42 | * Definitive guide in policy/value iteration from Sutton: start from page 81 [here](http://incompleteideas.net/sutton/book/bookdraft2017june19.pdf).
43 | 
44 | 
45 | #### 4. DQN
46 | - [slides](./2019/slides/04-dqn.pdf)
47 | - `04-dqn.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Scitator/rl-teaser/blob/master/2019/code/04-dqn.ipynb)
48 | 
49 | ##### Additional materials
50 | * Lecture by David Silver - [video part I](https://www.youtube.com/watch?v=PnHCvfgC_ZA), [video part II](https://www.youtube.com/watch?v=0g4j2k_Ggc4&t=43s)
51 | * Alternative lecture by Pieter Abbeel - [video](https://www.youtube.com/watch?v=ifma8G7LegE)
52 | * Alternative lecture by John Schulmann - [video](https://www.youtube.com/watch?v=IL3gVyJMmhg)
53 | * Blog post on q-learning Vs SARSA - [url](https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/)
54 | * N-step temporal difference from Sutton's book - [suttonbook](http://incompleteideas.net/book/RLbook2018.pdf) __chapter 7__
55 | * Eligibility traces from Sutton's book - [suttonbook](http://incompleteideas.net/book/RLbook2018.pdf) __chapter 12__
56 | * Blog post on eligibility traces - [url](http://pierrelucbacon.com/traces/)
57 | </p>
58 | </details>
59 | 
60 | 
61 | <details>
62 | <summary>2020 edition - Deep RL, DQN, DDPG</summary>
63 | <p>
64 | 
65 | </p>
66 | </details>
67 | 
68 | <details>
69 | <summary>Credits</summary>
70 | <p>
71 | 
72 | * [Berkeley CS188x](http://ai.berkeley.edu/home.html)
73 | * [David Silver's Reinforcement Learning Course](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)
74 | * [dennybritz/reinforcement-learning](https://github.com/dennybritz/reinforcement-learning)
75 | * [yandexdataschool/Practical_RL](https://github.com/yandexdataschool/Practical_RL)
76 | * [yandexdataschool/AgentNet](https://github.com/yandexdataschool/AgentNet)
77 | * [rl-course-experiments](https://github.com/Scitator/rl-course-experiments)
78 | * [RL-Adventure](https://github.com/higgsfield/RL-Adventure)
79 | * [RL-Adventure-2](https://github.com/higgsfield/RL-Adventure-2)
80 | 
81 | </p>
82 | </details>


--------------------------------------------------------------------------------