├── .github └── workflows │ └── manual.yml ├── .gitignore ├── CODEOWNERS ├── LICENSE ├── README.md └── notebooks ├── Discretization.ipynb ├── Discretization_Solution.ipynb ├── Tile_Coding.ipynb └── Tile_Coding_Solution.ipynb /.github/workflows/manual.yml: -------------------------------------------------------------------------------- 1 | # Workflow to ensure whenever a Github PR is submitted, 2 | # a JIRA ticket gets created automatically. 3 | name: Manual Workflow 4 | 5 | # Controls when the action will run. 6 | on: 7 | # Triggers the workflow on pull request events but only for the master branch 8 | pull_request_target: 9 | types: [assigned, opened, reopened] 10 | 11 | # Allows you to run this workflow manually from the Actions tab 12 | workflow_dispatch: 13 | 14 | jobs: 15 | test-transition-issue: 16 | name: Convert Github Issue to Jira Issue 17 | runs-on: ubuntu-latest 18 | steps: 19 | - name: Checkout 20 | uses: actions/checkout@master 21 | 22 | - name: Login 23 | uses: atlassian/gajira-login@master 24 | env: 25 | JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }} 26 | JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }} 27 | JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }} 28 | 29 | - name: Create NEW JIRA ticket 30 | id: create 31 | uses: atlassian/gajira-create@master 32 | with: 33 | project: CONUPDATE 34 | issuetype: Task 35 | summary: | 36 | Github PR - nd101 v5 Deep Learning | Repo: ${{ github.repository }} | PR# ${{github.event.number}} 37 | description: | 38 | Repo link: https://github.com/${{ github.repository }} 39 | PR no. ${{ github.event.pull_request.number }} 40 | PR title: ${{ github.event.pull_request.title }} 41 | PR description: ${{ github.event.pull_request.description }} 42 | In addition, please resolve other issues, if any. 43 | fields: '{"components": [{"name":"nd101 - Deep Learning ND"}], "customfield_16449":"https://classroom.udacity.com/nanodegrees/nd101/dashboard/overview", "customfield_16450":"Resolve the PR", "labels": ["github"], "priority":{"id": "4"}}' 44 | 45 | - name: Log created issue 46 | run: echo "Issue ${{ steps.create.outputs.issue }} was created" 47 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Mac OS X 2 | .DS_Store 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | wheels/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | 52 | # Translations 53 | *.mo 54 | *.pot 55 | 56 | # Django stuff: 57 | *.log 58 | local_settings.py 59 | 60 | # Flask stuff: 61 | instance/ 62 | .webassets-cache 63 | 64 | # Scrapy stuff: 65 | .scrapy 66 | 67 | # Sphinx documentation 68 | docs/_build/ 69 | 70 | # PyBuilder 71 | target/ 72 | 73 | # Jupyter Notebook 74 | .ipynb_checkpoints 75 | 76 | # pyenv 77 | .python-version 78 | 79 | # celery beat schedule file 80 | celerybeat-schedule 81 | 82 | # SageMath parsed files 83 | *.sage.py 84 | 85 | # dotenv 86 | .env 87 | 88 | # virtualenv 89 | .venv 90 | venv/ 91 | ENV/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | 106 | .github/** 107 | -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @udacity/active-public-content -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Udacity 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reinforcement Learning 2 | 3 | Reinforcement learning material, code and exercises for [Udacity](https://www.udacity.com/) Nanodegree programs. 4 | -------------------------------------------------------------------------------- /notebooks/Discretization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Discretization\n", 8 | "\n", 9 | "---\n", 10 | "\n", 11 | "In this notebook, you will deal with continuous state and action spaces by discretizing them. This will enable you to apply reinforcement learning algorithms that are only designed to work with discrete spaces.\n", 12 | "\n", 13 | "### 1. Import the Necessary Packages" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import sys\n", 23 | "import gym\n", 24 | "import numpy as np\n", 25 | "\n", 26 | "import pandas as pd\n", 27 | "import matplotlib.pyplot as plt\n", 28 | "\n", 29 | "# Set plotting options\n", 30 | "%matplotlib inline\n", 31 | "plt.style.use('ggplot')\n", 32 | "np.set_printoptions(precision=3, linewidth=120)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "### 2. Specify the Environment, and Explore the State and Action Spaces\n", 40 | "\n", 41 | "We'll use [OpenAI Gym](https://gym.openai.com/) environments to test and develop our algorithms. These simulate a variety of classic as well as contemporary reinforcement learning tasks. Let's use an environment that has a continuous state space, but a discrete action space." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# Create an environment and set random seed\n", 51 | "env = gym.make('MountainCar-v0')\n", 52 | "env.seed(505);" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "Run the next code cell to watch a random agent." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "state = env.reset()\n", 69 | "score = 0\n", 70 | "for t in range(200):\n", 71 | " action = env.action_space.sample()\n", 72 | " env.render()\n", 73 | " state, reward, done, _ = env.step(action)\n", 74 | " score += reward\n", 75 | " if done:\n", 76 | " break \n", 77 | "print('Final score:', score)\n", 78 | "env.close()" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "In this notebook, you will train an agent to perform much better! For now, we can explore the state and action spaces, as well as sample them." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "# Explore state (observation) space\n", 95 | "print(\"State space:\", env.observation_space)\n", 96 | "print(\"- low:\", env.observation_space.low)\n", 97 | "print(\"- high:\", env.observation_space.high)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# Generate some samples from the state space \n", 107 | "print(\"State space samples:\")\n", 108 | "print(np.array([env.observation_space.sample() for i in range(10)]))" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Explore the action space\n", 118 | "print(\"Action space:\", env.action_space)\n", 119 | "\n", 120 | "# Generate some samples from the action space\n", 121 | "print(\"Action space samples:\")\n", 122 | "print(np.array([env.action_space.sample() for i in range(10)]))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "### 3. Discretize the State Space with a Uniform Grid\n", 130 | "\n", 131 | "We will discretize the space using a uniformly-spaced grid. Implement the following function to create such a grid, given the lower bounds (`low`), upper bounds (`high`), and number of desired `bins` along each dimension. It should return the split points for each dimension, which will be 1 less than the number of bins.\n", 132 | "\n", 133 | "For instance, if `low = [-1.0, -5.0]`, `high = [1.0, 5.0]`, and `bins = (10, 10)`, then your function should return the following list of 2 NumPy arrays:\n", 134 | "\n", 135 | "```\n", 136 | "[array([-0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8]),\n", 137 | " array([-4.0, -3.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, 4.0])]\n", 138 | "```\n", 139 | "\n", 140 | "Note that the ends of `low` and `high` are **not** included in these split points. It is assumed that any value below the lowest split point maps to index `0` and any value above the highest split point maps to index `n-1`, where `n` is the number of bins along that dimension." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "def create_uniform_grid(low, high, bins=(10, 10)):\n", 150 | " \"\"\"Define a uniformly-spaced grid that can be used to discretize a space.\n", 151 | " \n", 152 | " Parameters\n", 153 | " ----------\n", 154 | " low : array_like\n", 155 | " Lower bounds for each dimension of the continuous space.\n", 156 | " high : array_like\n", 157 | " Upper bounds for each dimension of the continuous space.\n", 158 | " bins : tuple\n", 159 | " Number of bins along each corresponding dimension.\n", 160 | " \n", 161 | " Returns\n", 162 | " -------\n", 163 | " grid : list of array_like\n", 164 | " A list of arrays containing split points for each dimension.\n", 165 | " \"\"\"\n", 166 | " # TODO: Implement this\n", 167 | " pass\n", 168 | "\n", 169 | "\n", 170 | "low = [-1.0, -5.0]\n", 171 | "high = [1.0, 5.0]\n", 172 | "create_uniform_grid(low, high) # [test]" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "Now write a function that can convert samples from a continuous space into its equivalent discretized representation, given a grid like the one you created above. You can use the [`numpy.digitize()`](https://docs.scipy.org/doc/numpy-1.9.3/reference/generated/numpy.digitize.html) function for this purpose.\n", 180 | "\n", 181 | "Assume the grid is a list of NumPy arrays containing the following split points:\n", 182 | "```\n", 183 | "[array([-0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8]),\n", 184 | " array([-4.0, -3.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, 4.0])]\n", 185 | "```\n", 186 | "\n", 187 | "Here are some potential samples and their corresponding discretized representations:\n", 188 | "```\n", 189 | "[-1.0 , -5.0] => [0, 0]\n", 190 | "[-0.81, -4.1] => [0, 0]\n", 191 | "[-0.8 , -4.0] => [1, 1]\n", 192 | "[-0.5 , 0.0] => [2, 5]\n", 193 | "[ 0.2 , -1.9] => [6, 3]\n", 194 | "[ 0.8 , 4.0] => [9, 9]\n", 195 | "[ 0.81, 4.1] => [9, 9]\n", 196 | "[ 1.0 , 5.0] => [9, 9]\n", 197 | "```\n", 198 | "\n", 199 | "**Note**: There may be one-off differences in binning due to floating-point inaccuracies when samples are close to grid boundaries, but that is alright." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "def discretize(sample, grid):\n", 209 | " \"\"\"Discretize a sample as per given grid.\n", 210 | " \n", 211 | " Parameters\n", 212 | " ----------\n", 213 | " sample : array_like\n", 214 | " A single sample from the (original) continuous space.\n", 215 | " grid : list of array_like\n", 216 | " A list of arrays containing split points for each dimension.\n", 217 | " \n", 218 | " Returns\n", 219 | " -------\n", 220 | " discretized_sample : array_like\n", 221 | " A sequence of integers with the same number of dimensions as sample.\n", 222 | " \"\"\"\n", 223 | " # TODO: Implement this\n", 224 | " pass\n", 225 | "\n", 226 | "\n", 227 | "# Test with a simple grid and some samples\n", 228 | "grid = create_uniform_grid([-1.0, -5.0], [1.0, 5.0])\n", 229 | "samples = np.array(\n", 230 | " [[-1.0 , -5.0],\n", 231 | " [-0.81, -4.1],\n", 232 | " [-0.8 , -4.0],\n", 233 | " [-0.5 , 0.0],\n", 234 | " [ 0.2 , -1.9],\n", 235 | " [ 0.8 , 4.0],\n", 236 | " [ 0.81, 4.1],\n", 237 | " [ 1.0 , 5.0]])\n", 238 | "discretized_samples = np.array([discretize(sample, grid) for sample in samples])\n", 239 | "print(\"\\nSamples:\", repr(samples), sep=\"\\n\")\n", 240 | "print(\"\\nDiscretized samples:\", repr(discretized_samples), sep=\"\\n\")" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### 4. Visualization\n", 248 | "\n", 249 | "It might be helpful to visualize the original and discretized samples to get a sense of how much error you are introducing." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "import matplotlib.collections as mc\n", 259 | "\n", 260 | "def visualize_samples(samples, discretized_samples, grid, low=None, high=None):\n", 261 | " \"\"\"Visualize original and discretized samples on a given 2-dimensional grid.\"\"\"\n", 262 | "\n", 263 | " fig, ax = plt.subplots(figsize=(10, 10))\n", 264 | " \n", 265 | " # Show grid\n", 266 | " ax.xaxis.set_major_locator(plt.FixedLocator(grid[0]))\n", 267 | " ax.yaxis.set_major_locator(plt.FixedLocator(grid[1]))\n", 268 | " ax.grid(True)\n", 269 | " \n", 270 | " # If bounds (low, high) are specified, use them to set axis limits\n", 271 | " if low is not None and high is not None:\n", 272 | " ax.set_xlim(low[0], high[0])\n", 273 | " ax.set_ylim(low[1], high[1])\n", 274 | " else:\n", 275 | " # Otherwise use first, last grid locations as low, high (for further mapping discretized samples)\n", 276 | " low = [splits[0] for splits in grid]\n", 277 | " high = [splits[-1] for splits in grid]\n", 278 | "\n", 279 | " # Map each discretized sample (which is really an index) to the center of corresponding grid cell\n", 280 | " grid_extended = np.hstack((np.array([low]).T, grid, np.array([high]).T)) # add low and high ends\n", 281 | " grid_centers = (grid_extended[:, 1:] + grid_extended[:, :-1]) / 2 # compute center of each grid cell\n", 282 | " locs = np.stack(grid_centers[i, discretized_samples[:, i]] for i in range(len(grid))).T # map discretized samples\n", 283 | "\n", 284 | " ax.plot(samples[:, 0], samples[:, 1], 'o') # plot original samples\n", 285 | " ax.plot(locs[:, 0], locs[:, 1], 's') # plot discretized samples in mapped locations\n", 286 | " ax.add_collection(mc.LineCollection(list(zip(samples, locs)), colors='orange')) # add a line connecting each original-discretized sample\n", 287 | " ax.legend(['original', 'discretized'])\n", 288 | "\n", 289 | " \n", 290 | "visualize_samples(samples, discretized_samples, grid, low, high)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "Now that we have a way to discretize a state space, let's apply it to our reinforcement learning environment." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "# Create a grid to discretize the state space\n", 307 | "state_grid = create_uniform_grid(env.observation_space.low, env.observation_space.high, bins=(10, 10))\n", 308 | "state_grid" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# Obtain some samples from the space, discretize them, and then visualize them\n", 318 | "state_samples = np.array([env.observation_space.sample() for i in range(10)])\n", 319 | "discretized_state_samples = np.array([discretize(sample, state_grid) for sample in state_samples])\n", 320 | "visualize_samples(state_samples, discretized_state_samples, state_grid,\n", 321 | " env.observation_space.low, env.observation_space.high)\n", 322 | "plt.xlabel('position'); plt.ylabel('velocity'); # axis labels for MountainCar-v0 state space" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "You might notice that if you have enough bins, the discretization doesn't introduce too much error into your representation. So we may be able to now apply a reinforcement learning algorithm (like Q-Learning) that operates on discrete spaces. Give it a shot to see how well it works!\n", 330 | "\n", 331 | "### 5. Q-Learning\n", 332 | "\n", 333 | "Provided below is a simple Q-Learning agent. Implement the `preprocess_state()` method to convert each continuous state sample to its corresponding discretized representation." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "class QLearningAgent:\n", 343 | " \"\"\"Q-Learning agent that can act on a continuous state space by discretizing it.\"\"\"\n", 344 | "\n", 345 | " def __init__(self, env, state_grid, alpha=0.02, gamma=0.99,\n", 346 | " epsilon=1.0, epsilon_decay_rate=0.9995, min_epsilon=.01, seed=505):\n", 347 | " \"\"\"Initialize variables, create grid for discretization.\"\"\"\n", 348 | " # Environment info\n", 349 | " self.env = env\n", 350 | " self.state_grid = state_grid\n", 351 | " self.state_size = tuple(len(splits) + 1 for splits in self.state_grid) # n-dimensional state space\n", 352 | " self.action_size = self.env.action_space.n # 1-dimensional discrete action space\n", 353 | " self.seed = np.random.seed(seed)\n", 354 | " print(\"Environment:\", self.env)\n", 355 | " print(\"State space size:\", self.state_size)\n", 356 | " print(\"Action space size:\", self.action_size)\n", 357 | " \n", 358 | " # Learning parameters\n", 359 | " self.alpha = alpha # learning rate\n", 360 | " self.gamma = gamma # discount factor\n", 361 | " self.epsilon = self.initial_epsilon = epsilon # initial exploration rate\n", 362 | " self.epsilon_decay_rate = epsilon_decay_rate # how quickly should we decrease epsilon\n", 363 | " self.min_epsilon = min_epsilon\n", 364 | " \n", 365 | " # Create Q-table\n", 366 | " self.q_table = np.zeros(shape=(self.state_size + (self.action_size,)))\n", 367 | " print(\"Q table size:\", self.q_table.shape)\n", 368 | "\n", 369 | " def preprocess_state(self, state):\n", 370 | " \"\"\"Map a continuous state to its discretized representation.\"\"\"\n", 371 | " # TODO: Implement this\n", 372 | " pass\n", 373 | "\n", 374 | " def reset_episode(self, state):\n", 375 | " \"\"\"Reset variables for a new episode.\"\"\"\n", 376 | " # Gradually decrease exploration rate\n", 377 | " self.epsilon *= self.epsilon_decay_rate\n", 378 | " self.epsilon = max(self.epsilon, self.min_epsilon)\n", 379 | "\n", 380 | " # Decide initial action\n", 381 | " self.last_state = self.preprocess_state(state)\n", 382 | " self.last_action = np.argmax(self.q_table[self.last_state])\n", 383 | " return self.last_action\n", 384 | " \n", 385 | " def reset_exploration(self, epsilon=None):\n", 386 | " \"\"\"Reset exploration rate used when training.\"\"\"\n", 387 | " self.epsilon = epsilon if epsilon is not None else self.initial_epsilon\n", 388 | "\n", 389 | " def act(self, state, reward=None, done=None, mode='train'):\n", 390 | " \"\"\"Pick next action and update internal Q table (when mode != 'test').\"\"\"\n", 391 | " state = self.preprocess_state(state)\n", 392 | " if mode == 'test':\n", 393 | " # Test mode: Simply produce an action\n", 394 | " action = np.argmax(self.q_table[state])\n", 395 | " else:\n", 396 | " # Train mode (default): Update Q table, pick next action\n", 397 | " # Note: We update the Q table entry for the *last* (state, action) pair with current state, reward\n", 398 | " self.q_table[self.last_state + (self.last_action,)] += self.alpha * \\\n", 399 | " (reward + self.gamma * max(self.q_table[state]) - self.q_table[self.last_state + (self.last_action,)])\n", 400 | "\n", 401 | " # Exploration vs. exploitation\n", 402 | " do_exploration = np.random.uniform(0, 1) < self.epsilon\n", 403 | " if do_exploration:\n", 404 | " # Pick a random action\n", 405 | " action = np.random.randint(0, self.action_size)\n", 406 | " else:\n", 407 | " # Pick the best action from Q table\n", 408 | " action = np.argmax(self.q_table[state])\n", 409 | "\n", 410 | " # Roll over current state, action for next step\n", 411 | " self.last_state = state\n", 412 | " self.last_action = action\n", 413 | " return action\n", 414 | "\n", 415 | " \n", 416 | "q_agent = QLearningAgent(env, state_grid)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "Let's also define a convenience function to run an agent on a given environment. When calling this function, you can pass in `mode='test'` to tell the agent not to learn." 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "def run(agent, env, num_episodes=20000, mode='train'):\n", 433 | " \"\"\"Run agent in given reinforcement learning environment and return scores.\"\"\"\n", 434 | " scores = []\n", 435 | " max_avg_score = -np.inf\n", 436 | " for i_episode in range(1, num_episodes+1):\n", 437 | " # Initialize episode\n", 438 | " state = env.reset()\n", 439 | " action = agent.reset_episode(state)\n", 440 | " total_reward = 0\n", 441 | " done = False\n", 442 | "\n", 443 | " # Roll out steps until done\n", 444 | " while not done:\n", 445 | " state, reward, done, info = env.step(action)\n", 446 | " total_reward += reward\n", 447 | " action = agent.act(state, reward, done, mode)\n", 448 | "\n", 449 | " # Save final score\n", 450 | " scores.append(total_reward)\n", 451 | " \n", 452 | " # Print episode stats\n", 453 | " if mode == 'train':\n", 454 | " if len(scores) > 100:\n", 455 | " avg_score = np.mean(scores[-100:])\n", 456 | " if avg_score > max_avg_score:\n", 457 | " max_avg_score = avg_score\n", 458 | "\n", 459 | " if i_episode % 100 == 0:\n", 460 | " print(\"\\rEpisode {}/{} | Max Average Score: {}\".format(i_episode, num_episodes, max_avg_score), end=\"\")\n", 461 | " sys.stdout.flush()\n", 462 | "\n", 463 | " return scores\n", 464 | "\n", 465 | "scores = run(q_agent, env)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "The best way to analyze if your agent was learning the task is to plot the scores. It should generally increase as the agent goes through more episodes." 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "# Plot scores obtained per episode\n", 482 | "plt.plot(scores); plt.title(\"Scores\");" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "If the scores are noisy, it might be difficult to tell whether your agent is actually learning. To find the underlying trend, you may want to plot a rolling mean of the scores. Let's write a convenience function to plot both raw scores as well as a rolling mean." 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": null, 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "def plot_scores(scores, rolling_window=100):\n", 499 | " \"\"\"Plot scores and optional rolling mean using specified window.\"\"\"\n", 500 | " plt.plot(scores); plt.title(\"Scores\");\n", 501 | " rolling_mean = pd.Series(scores).rolling(rolling_window).mean()\n", 502 | " plt.plot(rolling_mean);\n", 503 | " return rolling_mean\n", 504 | "\n", 505 | "rolling_mean = plot_scores(scores)" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "You should observe the mean episode scores go up over time. Next, you can freeze learning and run the agent in test mode to see how well it performs." 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": { 519 | "scrolled": true 520 | }, 521 | "outputs": [], 522 | "source": [ 523 | "# Run in test mode and analyze scores obtained\n", 524 | "test_scores = run(q_agent, env, num_episodes=100, mode='test')\n", 525 | "print(\"[TEST] Completed {} episodes with avg. score = {}\".format(len(test_scores), np.mean(test_scores)))\n", 526 | "_ = plot_scores(test_scores, rolling_window=10)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "It's also interesting to look at the final Q-table that is learned by the agent. Note that the Q-table is of size MxNxA, where (M, N) is the size of the state space, and A is the size of the action space. We are interested in the maximum Q-value for each state, and the corresponding (best) action associated with that value." 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "def plot_q_table(q_table):\n", 543 | " \"\"\"Visualize max Q-value for each state and corresponding action.\"\"\"\n", 544 | " q_image = np.max(q_table, axis=2) # max Q-value for each state\n", 545 | " q_actions = np.argmax(q_table, axis=2) # best action for each state\n", 546 | "\n", 547 | " fig, ax = plt.subplots(figsize=(10, 10))\n", 548 | " cax = ax.imshow(q_image, cmap='jet');\n", 549 | " cbar = fig.colorbar(cax)\n", 550 | " for x in range(q_image.shape[0]):\n", 551 | " for y in range(q_image.shape[1]):\n", 552 | " ax.text(x, y, q_actions[x, y], color='white',\n", 553 | " horizontalalignment='center', verticalalignment='center')\n", 554 | " ax.grid(False)\n", 555 | " ax.set_title(\"Q-table, size: {}\".format(q_table.shape))\n", 556 | " ax.set_xlabel('position')\n", 557 | " ax.set_ylabel('velocity')\n", 558 | "\n", 559 | "\n", 560 | "plot_q_table(q_agent.q_table)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": { 566 | "collapsed": true 567 | }, 568 | "source": [ 569 | "### 6. Modify the Grid\n", 570 | "\n", 571 | "Now it's your turn to play with the grid definition and see what gives you optimal results. Your agent's final performance is likely to get better if you use a finer grid, with more bins per dimension, at the cost of higher model complexity (more parameters to learn)." 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": {}, 578 | "outputs": [], 579 | "source": [ 580 | "# TODO: Create a new agent with a different state space grid\n", 581 | "state_grid_new = create_uniform_grid(?, ?, bins=(?, ?))\n", 582 | "q_agent_new = QLearningAgent(env, state_grid_new)\n", 583 | "q_agent_new.scores = [] # initialize a list to store scores for this agent" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "# Train it over a desired number of episodes and analyze scores\n", 593 | "# Note: This cell can be run multiple times, and scores will get accumulated\n", 594 | "q_agent_new.scores += run(q_agent_new, env, num_episodes=50000) # accumulate scores\n", 595 | "rolling_mean_new = plot_scores(q_agent_new.scores)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "# Run in test mode and analyze scores obtained\n", 605 | "test_scores = run(q_agent_new, env, num_episodes=100, mode='test')\n", 606 | "print(\"[TEST] Completed {} episodes with avg. score = {}\".format(len(test_scores), np.mean(test_scores)))\n", 607 | "_ = plot_scores(test_scores)" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": null, 613 | "metadata": {}, 614 | "outputs": [], 615 | "source": [ 616 | "# Visualize the learned Q-table\n", 617 | "plot_q_table(q_agent_new.q_table)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "### 7. Watch a Smart Agent" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": {}, 631 | "outputs": [], 632 | "source": [ 633 | "state = env.reset()\n", 634 | "score = 0\n", 635 | "for t in range(200):\n", 636 | " action = q_agent_new.act(state, mode='test')\n", 637 | " env.render()\n", 638 | " state, reward, done, _ = env.step(action)\n", 639 | " score += reward\n", 640 | " if done:\n", 641 | " break \n", 642 | "print('Final score:', score)\n", 643 | "env.close()" 644 | ] 645 | } 646 | ], 647 | "metadata": { 648 | "kernelspec": { 649 | "display_name": "Python 3", 650 | "language": "python", 651 | "name": "python3" 652 | }, 653 | "language_info": { 654 | "codemirror_mode": { 655 | "name": "ipython", 656 | "version": 3 657 | }, 658 | "file_extension": ".py", 659 | "mimetype": "text/x-python", 660 | "name": "python", 661 | "nbconvert_exporter": "python", 662 | "pygments_lexer": "ipython3", 663 | "version": "3.6.4" 664 | } 665 | }, 666 | "nbformat": 4, 667 | "nbformat_minor": 2 668 | } 669 | -------------------------------------------------------------------------------- /notebooks/Tile_Coding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tile Coding\n", 8 | "---\n", 9 | "\n", 10 | "Tile coding is an innovative way of discretizing a continuous space that enables better generalization compared to a single grid-based approach. The fundamental idea is to create several overlapping grids or _tilings_; then for any given sample value, you need only check which tiles it lies in. You can then encode the original continuous value by a vector of integer indices or bits that identifies each activated tile.\n", 11 | "\n", 12 | "### 1. Import the Necessary Packages" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "# Import common libraries\n", 22 | "import sys\n", 23 | "import gym\n", 24 | "import numpy as np\n", 25 | "import matplotlib.pyplot as plt\n", 26 | "\n", 27 | "# Set plotting options\n", 28 | "%matplotlib inline\n", 29 | "plt.style.use('ggplot')\n", 30 | "np.set_printoptions(precision=3, linewidth=120)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### 2. Specify the Environment, and Explore the State and Action Spaces\n", 38 | "\n", 39 | "We'll use [OpenAI Gym](https://gym.openai.com/) environments to test and develop our algorithms. These simulate a variety of classic as well as contemporary reinforcement learning tasks. Let's begin with an environment that has a continuous state space, but a discrete action space." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# Create an environment\n", 49 | "env = gym.make('Acrobot-v1')\n", 50 | "env.seed(505);\n", 51 | "\n", 52 | "# Explore state (observation) space\n", 53 | "print(\"State space:\", env.observation_space)\n", 54 | "print(\"- low:\", env.observation_space.low)\n", 55 | "print(\"- high:\", env.observation_space.high)\n", 56 | "\n", 57 | "# Explore action space\n", 58 | "print(\"Action space:\", env.action_space)" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Note that the state space is multi-dimensional, with most dimensions ranging from -1 to 1 (positions of the two joints), while the final two dimensions have a larger range. How do we discretize such a space using tiles?\n", 66 | "\n", 67 | "### 3. Tiling\n", 68 | "\n", 69 | "Let's first design a way to create a single tiling for a given state space. This is very similar to a uniform grid! The only difference is that you should include an offset for each dimension that shifts the split points.\n", 70 | "\n", 71 | "For instance, if `low = [-1.0, -5.0]`, `high = [1.0, 5.0]`, `bins = (10, 10)`, and `offsets = (-0.1, 0.5)`, then return a list of 2 NumPy arrays (2 dimensions) each containing the following split points (9 split points per dimension):\n", 72 | "\n", 73 | "```\n", 74 | "[array([-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7]),\n", 75 | " array([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5])]\n", 76 | "```\n", 77 | "\n", 78 | "Notice how the split points for the first dimension are offset by `-0.1`, and for the second dimension are offset by `+0.5`. This might mean that some of our tiles, especially along the perimeter, are partially outside the valid state space, but that is unavoidable and harmless." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "def create_tiling_grid(low, high, bins=(10, 10), offsets=(0.0, 0.0)):\n", 88 | " \"\"\"Define a uniformly-spaced grid that can be used for tile-coding a space.\n", 89 | " \n", 90 | " Parameters\n", 91 | " ----------\n", 92 | " low : array_like\n", 93 | " Lower bounds for each dimension of the continuous space.\n", 94 | " high : array_like\n", 95 | " Upper bounds for each dimension of the continuous space.\n", 96 | " bins : tuple\n", 97 | " Number of bins or tiles along each corresponding dimension.\n", 98 | " offsets : tuple\n", 99 | " Split points for each dimension should be offset by these values.\n", 100 | " \n", 101 | " Returns\n", 102 | " -------\n", 103 | " grid : list of array_like\n", 104 | " A list of arrays containing split points for each dimension.\n", 105 | " \"\"\"\n", 106 | " # TODO: Implement this\n", 107 | " pass\n", 108 | "\n", 109 | "\n", 110 | "low = [-1.0, -5.0]\n", 111 | "high = [1.0, 5.0]\n", 112 | "create_tiling_grid(low, high, bins=(10, 10), offsets=(-0.1, 0.5)) # [test]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "You can now use this function to define a set of tilings that are a little offset from each other." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "def create_tilings(low, high, tiling_specs):\n", 129 | " \"\"\"Define multiple tilings using the provided specifications.\n", 130 | "\n", 131 | " Parameters\n", 132 | " ----------\n", 133 | " low : array_like\n", 134 | " Lower bounds for each dimension of the continuous space.\n", 135 | " high : array_like\n", 136 | " Upper bounds for each dimension of the continuous space.\n", 137 | " tiling_specs : list of tuples\n", 138 | " A sequence of (bins, offsets) to be passed to create_tiling_grid().\n", 139 | "\n", 140 | " Returns\n", 141 | " -------\n", 142 | " tilings : list\n", 143 | " A list of tilings (grids), each produced by create_tiling_grid().\n", 144 | " \"\"\"\n", 145 | " # TODO: Implement this\n", 146 | " pass\n", 147 | "\n", 148 | "\n", 149 | "# Tiling specs: [(, ), ...]\n", 150 | "tiling_specs = [((10, 10), (-0.066, -0.33)),\n", 151 | " ((10, 10), (0.0, 0.0)),\n", 152 | " ((10, 10), (0.066, 0.33))]\n", 153 | "tilings = create_tilings(low, high, tiling_specs)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "It may be hard to gauge whether you are getting desired results or not. So let's try to visualize these tilings." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "from matplotlib.lines import Line2D\n", 170 | "\n", 171 | "def visualize_tilings(tilings):\n", 172 | " \"\"\"Plot each tiling as a grid.\"\"\"\n", 173 | " prop_cycle = plt.rcParams['axes.prop_cycle']\n", 174 | " colors = prop_cycle.by_key()['color']\n", 175 | " linestyles = ['-', '--', ':']\n", 176 | " legend_lines = []\n", 177 | "\n", 178 | " fig, ax = plt.subplots(figsize=(10, 10))\n", 179 | " for i, grid in enumerate(tilings):\n", 180 | " for x in grid[0]:\n", 181 | " l = ax.axvline(x=x, color=colors[i % len(colors)], linestyle=linestyles[i % len(linestyles)], label=i)\n", 182 | " for y in grid[1]:\n", 183 | " l = ax.axhline(y=y, color=colors[i % len(colors)], linestyle=linestyles[i % len(linestyles)])\n", 184 | " legend_lines.append(l)\n", 185 | " ax.grid('off')\n", 186 | " ax.legend(legend_lines, [\"Tiling #{}\".format(t) for t in range(len(legend_lines))], facecolor='white', framealpha=0.9)\n", 187 | " ax.set_title(\"Tilings\")\n", 188 | " return ax # return Axis object to draw on later, if needed\n", 189 | "\n", 190 | "\n", 191 | "visualize_tilings(tilings);" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "Great! Now that we have a way to generate these tilings, we can next write our encoding function that will convert any given continuous state value to a discrete vector.\n", 199 | "\n", 200 | "### 4. Tile Encoding\n", 201 | "\n", 202 | "Implement the following to produce a vector that contains the indices for each tile that the input state value belongs to. The shape of the vector can be the same as the arrangment of tiles you have, or it can be ultimately flattened for convenience.\n", 203 | "\n", 204 | "You can use the same `discretize()` function here from grid-based discretization, and simply call it for each tiling." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "def discretize(sample, grid):\n", 214 | " \"\"\"Discretize a sample as per given grid.\n", 215 | " \n", 216 | " Parameters\n", 217 | " ----------\n", 218 | " sample : array_like\n", 219 | " A single sample from the (original) continuous space.\n", 220 | " grid : list of array_like\n", 221 | " A list of arrays containing split points for each dimension.\n", 222 | " \n", 223 | " Returns\n", 224 | " -------\n", 225 | " discretized_sample : array_like\n", 226 | " A sequence of integers with the same number of dimensions as sample.\n", 227 | " \"\"\"\n", 228 | " # TODO: Implement this\n", 229 | " pass\n", 230 | "\n", 231 | "\n", 232 | "def tile_encode(sample, tilings, flatten=False):\n", 233 | " \"\"\"Encode given sample using tile-coding.\n", 234 | " \n", 235 | " Parameters\n", 236 | " ----------\n", 237 | " sample : array_like\n", 238 | " A single sample from the (original) continuous space.\n", 239 | " tilings : list\n", 240 | " A list of tilings (grids), each produced by create_tiling_grid().\n", 241 | " flatten : bool\n", 242 | " If true, flatten the resulting binary arrays into a single long vector.\n", 243 | "\n", 244 | " Returns\n", 245 | " -------\n", 246 | " encoded_sample : list or array_like\n", 247 | " A list of indice vectors, one for each tiling, or flattened into one.\n", 248 | " \"\"\"\n", 249 | " # TODO: Implement this\n", 250 | " pass\n", 251 | "\n", 252 | "\n", 253 | "# Test with some sample values\n", 254 | "samples = [(-1.2 , -5.1 ),\n", 255 | " (-0.75, 3.25),\n", 256 | " (-0.5 , 0.0 ),\n", 257 | " ( 0.25, -1.9 ),\n", 258 | " ( 0.15, -1.75),\n", 259 | " ( 0.75, 2.5 ),\n", 260 | " ( 0.7 , -3.7 ),\n", 261 | " ( 1.0 , 5.0 )]\n", 262 | "encoded_samples = [tile_encode(sample, tilings) for sample in samples]\n", 263 | "print(\"\\nSamples:\", repr(samples), sep=\"\\n\")\n", 264 | "print(\"\\nEncoded samples:\", repr(encoded_samples), sep=\"\\n\")" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "Note that we did not flatten the encoding above, which is why each sample's representation is a pair of indices for each tiling. This makes it easy to visualize it using the tilings." 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "from matplotlib.patches import Rectangle\n", 281 | "\n", 282 | "def visualize_encoded_samples(samples, encoded_samples, tilings, low=None, high=None):\n", 283 | " \"\"\"Visualize samples by activating the respective tiles.\"\"\"\n", 284 | " samples = np.array(samples) # for ease of indexing\n", 285 | "\n", 286 | " # Show tiling grids\n", 287 | " ax = visualize_tilings(tilings)\n", 288 | " \n", 289 | " # If bounds (low, high) are specified, use them to set axis limits\n", 290 | " if low is not None and high is not None:\n", 291 | " ax.set_xlim(low[0], high[0])\n", 292 | " ax.set_ylim(low[1], high[1])\n", 293 | " else:\n", 294 | " # Pre-render (invisible) samples to automatically set reasonable axis limits, and use them as (low, high)\n", 295 | " ax.plot(samples[:, 0], samples[:, 1], 'o', alpha=0.0)\n", 296 | " low = [ax.get_xlim()[0], ax.get_ylim()[0]]\n", 297 | " high = [ax.get_xlim()[1], ax.get_ylim()[1]]\n", 298 | "\n", 299 | " # Map each encoded sample (which is really a list of indices) to the corresponding tiles it belongs to\n", 300 | " tilings_extended = [np.hstack((np.array([low]).T, grid, np.array([high]).T)) for grid in tilings] # add low and high ends\n", 301 | " tile_centers = [(grid_extended[:, 1:] + grid_extended[:, :-1]) / 2 for grid_extended in tilings_extended] # compute center of each tile\n", 302 | " tile_toplefts = [grid_extended[:, :-1] for grid_extended in tilings_extended] # compute topleft of each tile\n", 303 | " tile_bottomrights = [grid_extended[:, 1:] for grid_extended in tilings_extended] # compute bottomright of each tile\n", 304 | "\n", 305 | " prop_cycle = plt.rcParams['axes.prop_cycle']\n", 306 | " colors = prop_cycle.by_key()['color']\n", 307 | " for sample, encoded_sample in zip(samples, encoded_samples):\n", 308 | " for i, tile in enumerate(encoded_sample):\n", 309 | " # Shade the entire tile with a rectangle\n", 310 | " topleft = tile_toplefts[i][0][tile[0]], tile_toplefts[i][1][tile[1]]\n", 311 | " bottomright = tile_bottomrights[i][0][tile[0]], tile_bottomrights[i][1][tile[1]]\n", 312 | " ax.add_patch(Rectangle(topleft, bottomright[0] - topleft[0], bottomright[1] - topleft[1],\n", 313 | " color=colors[i], alpha=0.33))\n", 314 | "\n", 315 | " # In case sample is outside tile bounds, it may not have been highlighted properly\n", 316 | " if any(sample < topleft) or any(sample > bottomright):\n", 317 | " # So plot a point in the center of the tile and draw a connecting line\n", 318 | " cx, cy = tile_centers[i][0][tile[0]], tile_centers[i][1][tile[1]]\n", 319 | " ax.add_line(Line2D([sample[0], cx], [sample[1], cy], color=colors[i]))\n", 320 | " ax.plot(cx, cy, 's', color=colors[i])\n", 321 | " \n", 322 | " # Finally, plot original samples\n", 323 | " ax.plot(samples[:, 0], samples[:, 1], 'o', color='r')\n", 324 | "\n", 325 | " ax.margins(x=0, y=0) # remove unnecessary margins\n", 326 | " ax.set_title(\"Tile-encoded samples\")\n", 327 | " return ax\n", 328 | "\n", 329 | "visualize_encoded_samples(samples, encoded_samples, tilings);" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "Inspect the results and make sure you understand how the corresponding tiles are being chosen. Note that some samples may have one or more tiles in common.\n", 337 | "\n", 338 | "### 5. Q-Table with Tile Coding\n", 339 | "\n", 340 | "The next step is to design a special Q-table that is able to utilize this tile coding scheme. It should have the same kind of interface as a regular table, i.e. given a `` pair, it should return a ``. Similarly, it should also allow you to update the `` for a given `` pair (note that this should update all the tiles that `` belongs to).\n", 341 | "\n", 342 | "The `` supplied here is assumed to be from the original continuous state space, and `` is discrete (and integer index). The Q-table should internally convert the `` to its tile-coded representation when required." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "class QTable:\n", 352 | " \"\"\"Simple Q-table.\"\"\"\n", 353 | "\n", 354 | " def __init__(self, state_size, action_size):\n", 355 | " \"\"\"Initialize Q-table.\n", 356 | " \n", 357 | " Parameters\n", 358 | " ----------\n", 359 | " state_size : tuple\n", 360 | " Number of discrete values along each dimension of state space.\n", 361 | " action_size : int\n", 362 | " Number of discrete actions in action space.\n", 363 | " \"\"\"\n", 364 | " self.state_size = state_size\n", 365 | " self.action_size = action_size\n", 366 | "\n", 367 | " # TODO: Create Q-table, initialize all Q-values to zero\n", 368 | " # Note: If state_size = (9, 9), action_size = 2, q_table.shape should be (9, 9, 2)\n", 369 | " \n", 370 | " print(\"QTable(): size =\", self.q_table.shape)\n", 371 | "\n", 372 | "\n", 373 | "class TiledQTable:\n", 374 | " \"\"\"Composite Q-table with an internal tile coding scheme.\"\"\"\n", 375 | " \n", 376 | " def __init__(self, low, high, tiling_specs, action_size):\n", 377 | " \"\"\"Create tilings and initialize internal Q-table(s).\n", 378 | " \n", 379 | " Parameters\n", 380 | " ----------\n", 381 | " low : array_like\n", 382 | " Lower bounds for each dimension of state space.\n", 383 | " high : array_like\n", 384 | " Upper bounds for each dimension of state space.\n", 385 | " tiling_specs : list of tuples\n", 386 | " A sequence of (bins, offsets) to be passed to create_tilings() along with low, high.\n", 387 | " action_size : int\n", 388 | " Number of discrete actions in action space.\n", 389 | " \"\"\"\n", 390 | " self.tilings = create_tilings(low, high, tiling_specs)\n", 391 | " self.state_sizes = [tuple(len(splits)+1 for splits in tiling_grid) for tiling_grid in self.tilings]\n", 392 | " self.action_size = action_size\n", 393 | " self.q_tables = [QTable(state_size, self.action_size) for state_size in self.state_sizes]\n", 394 | " print(\"TiledQTable(): no. of internal tables = \", len(self.q_tables))\n", 395 | " \n", 396 | " def get(self, state, action):\n", 397 | " \"\"\"Get Q-value for given pair.\n", 398 | " \n", 399 | " Parameters\n", 400 | " ----------\n", 401 | " state : array_like\n", 402 | " Vector representing the state in the original continuous space.\n", 403 | " action : int\n", 404 | " Index of desired action.\n", 405 | " \n", 406 | " Returns\n", 407 | " -------\n", 408 | " value : float\n", 409 | " Q-value of given pair, averaged from all internal Q-tables.\n", 410 | " \"\"\"\n", 411 | " # TODO: Encode state to get tile indices\n", 412 | " \n", 413 | " # TODO: Retrieve q-value for each tiling, and return their average\n", 414 | " pass\n", 415 | "\n", 416 | " def update(self, state, action, value, alpha=0.1):\n", 417 | " \"\"\"Soft-update Q-value for given pair to value.\n", 418 | " \n", 419 | " Instead of overwriting Q(state, action) with value, perform soft-update:\n", 420 | " Q(state, action) = alpha * value + (1.0 - alpha) * Q(state, action)\n", 421 | " \n", 422 | " Parameters\n", 423 | " ----------\n", 424 | " state : array_like\n", 425 | " Vector representing the state in the original continuous space.\n", 426 | " action : int\n", 427 | " Index of desired action.\n", 428 | " value : float\n", 429 | " Desired Q-value for pair.\n", 430 | " alpha : float\n", 431 | " Update factor to perform soft-update, in [0.0, 1.0] range.\n", 432 | " \"\"\"\n", 433 | " # TODO: Encode state to get tile indices\n", 434 | " \n", 435 | " # TODO: Update q-value for each tiling by update factor alpha\n", 436 | " pass\n", 437 | "\n", 438 | "\n", 439 | "# Test with a sample Q-table\n", 440 | "tq = TiledQTable(low, high, tiling_specs, 2)\n", 441 | "s1 = 3; s2 = 4; a = 0; q = 1.0\n", 442 | "print(\"[GET] Q({}, {}) = {}\".format(samples[s1], a, tq.get(samples[s1], a))) # check value at sample = s1, action = a\n", 443 | "print(\"[UPDATE] Q({}, {}) = {}\".format(samples[s2], a, q)); tq.update(samples[s2], a, q) # update value for sample with some common tile(s)\n", 444 | "print(\"[GET] Q({}, {}) = {}\".format(samples[s1], a, tq.get(samples[s1], a))) # check value again, should be slightly updated" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "If you update the q-value for a particular state (say, `(0.25, -1.91)`) and action (say, `0`), then you should notice the q-value of a nearby state (e.g. `(0.15, -1.75)` and same action) has changed as well! This is how tile-coding is able to generalize values across the state space better than a single uniform grid." 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "### 6. Implement a Q-Learning Agent using Tile-Coding\n", 459 | "\n", 460 | "Now it's your turn to apply this discretization technique to design and test a complete learning agent! " 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [] 469 | } 470 | ], 471 | "metadata": { 472 | "kernelspec": { 473 | "display_name": "Python 3", 474 | "language": "python", 475 | "name": "python3" 476 | }, 477 | "language_info": { 478 | "codemirror_mode": { 479 | "name": "ipython", 480 | "version": 3 481 | }, 482 | "file_extension": ".py", 483 | "mimetype": "text/x-python", 484 | "name": "python", 485 | "nbconvert_exporter": "python", 486 | "pygments_lexer": "ipython3", 487 | "version": "3.6.4" 488 | } 489 | }, 490 | "nbformat": 4, 491 | "nbformat_minor": 2 492 | } --------------------------------------------------------------------------------