├── Dynamic-Programming ├── Dynamic_Programming.ipynb ├── algorithms.pdf ├── check_test.py ├── frozenlake.py └── plot_utils.py └── README.md /Dynamic-Programming/Dynamic_Programming.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Dynamic Programming\n", 8 | "\n", 9 | "In this notebook, you will write your own implementations of many classical dynamic programming algorithms. \n", 10 | "\n", 11 | "While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "### Part 0: Explore FrozenLakeEnv\n", 16 | "\n", 17 | "We begin by importing the necessary packages." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 1, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import numpy as np\n", 27 | "import copy\n", 28 | "\n", 29 | "import check_test\n", 30 | "from frozenlake import FrozenLakeEnv\n", 31 | "from plot_utils import plot_values" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Use the code cell below to create an instance of the [FrozenLake](https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py) environment." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "env = FrozenLakeEnv()" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "The agent moves through a $4 \\times 4$ gridworld, with states numbered as follows:\n", 55 | "```\n", 56 | "[[ 0 1 2 3]\n", 57 | " [ 4 5 6 7]\n", 58 | " [ 8 9 10 11]\n", 59 | " [12 13 14 15]]\n", 60 | "```\n", 61 | "and the agent has 4 potential actions:\n", 62 | "```\n", 63 | "LEFT = 0\n", 64 | "DOWN = 1\n", 65 | "RIGHT = 2\n", 66 | "UP = 3\n", 67 | "```\n", 68 | "\n", 69 | "Thus, $\\mathcal{S}^+ = \\{0, 1, \\ldots, 15\\}$, and $\\mathcal{A} = \\{0, 1, 2, 3\\}$. Verify this by running the code cell below." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "Discrete(16)\n", 82 | "Discrete(4)\n", 83 | "16\n", 84 | "4\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "# print the state space and action space\n", 90 | "print(env.observation_space)\n", 91 | "print(env.action_space)\n", 92 | "\n", 93 | "# print the total number of states and actions\n", 94 | "print(env.nS)\n", 95 | "print(env.nA)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Dynamic programming assumes that the agent has full knowledge of the MDP. We have already amended the `frozenlake.py` file to make the one-step dynamics accessible to the agent. \n", 103 | "\n", 104 | "Execute the code cell below to return the one-step dynamics corresponding to a particular state and action. In particular, `env.P[1][0]` returns the the probability of each possible reward and next state, if the agent is in state 1 of the gridworld and decides to go left." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/plain": [ 115 | "[(0.3333333333333333, 1, 0.0, False),\n", 116 | " (0.3333333333333333, 0, 0.0, False),\n", 117 | " (0.3333333333333333, 5, 0.0, True)]" 118 | ] 119 | }, 120 | "execution_count": 4, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "env.P[1][0]" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Each entry takes the form \n", 134 | "```\n", 135 | "prob, next_state, reward, done\n", 136 | "```\n", 137 | "where: \n", 138 | "- `prob` details the conditional probability of the corresponding (`next_state`, `reward`) pair, and\n", 139 | "- `done` is `True` if the `next_state` is a terminal state, and otherwise `False`.\n", 140 | "\n", 141 | "Thus, we can interpret `env.P[1][0]` as follows:\n", 142 | "$$\n", 143 | "\\mathbb{P}(S_{t+1}=s',R_{t+1}=r|S_t=1,A_t=0) = \\begin{cases}\n", 144 | " \\frac{1}{3} \\text{ if } s'=1, r=0\\\\\n", 145 | " \\frac{1}{3} \\text{ if } s'=0, r=0\\\\\n", 146 | " \\frac{1}{3} \\text{ if } s'=5, r=0\\\\\n", 147 | " 0 \\text{ else}\n", 148 | " \\end{cases}\n", 149 | "$$\n", 150 | "\n", 151 | "To understand the value of `env.P[1][0]`, note that when you create a FrozenLake environment, it takes as an (optional) argument `is_slippery`, which defaults to `True`. \n", 152 | "\n", 153 | "To see this, change the first line in the notebook from `env = FrozenLakeEnv()` to `env = FrozenLakeEnv(is_slippery=False)`. Then, when you check `env.P[1][0]`, it should look like what you expect (i.e., `env.P[1][0] = [(1.0, 0, 0.0, False)]`).\n", 154 | "\n", 155 | "The default value for the `is_slippery` argument is `True`, and so `env = FrozenLakeEnv()` is equivalent to `env = FrozenLakeEnv(is_slippery=True)`. In the event that `is_slippery=True`, you see that this can result in the agent moving in a direction that it did not intend (where the idea is that the ground is *slippery*, and so the agent can slide to a location other than the one it wanted).\n", 156 | "\n", 157 | "Feel free to change the code cell above to explore how the environment behaves in response to other (state, action) pairs. \n", 158 | "\n", 159 | "Before proceeding to the next part, make sure that you set `is_slippery=True`, so that your implementations below will work with the slippery environment!" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "### Part 1: Iterative Policy Evaluation\n", 167 | "\n", 168 | "In this section, you will write your own implementation of iterative policy evaluation.\n", 169 | "\n", 170 | "Your algorithm should accept four arguments as **input**:\n", 171 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 172 | "- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`). `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.\n", 173 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 174 | "- `theta`: This is a very small positive number that is used to decide if the estimate has sufficiently converged to the true value function (default value: `1e-8`).\n", 175 | "\n", 176 | "The algorithm returns as **output**:\n", 177 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s` under the input policy.\n", 178 | "\n", 179 | "Please complete the function in the code cell below." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 5, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "def policy_evaluation(env, policy, gamma=1, theta=1e-8):\n", 189 | " V = np.zeros(env.nS)\n", 190 | " while True:\n", 191 | " delta = 0\n", 192 | " for s in range(env.nS):\n", 193 | " Vs = 0\n", 194 | " for a, action_prob in enumerate(policy[s]):\n", 195 | " for prob, next_state, reward, done in env.P[s][a]:\n", 196 | " Vs += action_prob * prob * (reward + gamma * V[next_state])\n", 197 | " delta = max(delta, np.abs(V[s]-Vs))\n", 198 | " V[s] = Vs\n", 199 | " if delta < theta:\n", 200 | " break\n", 201 | " return V" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "We will evaluate the equiprobable random policy $\\pi$, where $\\pi(a|s) = \\frac{1}{|\\mathcal{A}(s)|}$ for all $s\\in\\mathcal{S}$ and $a\\in\\mathcal{A}(s)$. \n", 209 | "\n", 210 | "Use the code cell below to specify this policy in the variable `random_policy`." 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 6, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "random_policy = np.ones([env.nS, env.nA]) / env.nA" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "Run the next code cell to evaluate the equiprobable random policy and visualize the output. The state-value function has been reshaped to match the shape of the gridworld." 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 7, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "image/png": "\n", 237 | "text/plain": [ 238 | "
" 239 | ] 240 | }, 241 | "metadata": {}, 242 | "output_type": "display_data" 243 | } 244 | ], 245 | "source": [ 246 | "# evaluate the policy \n", 247 | "V = policy_evaluation(env, random_policy)\n", 248 | "\n", 249 | "plot_values(V)" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "Run the code cell below to test your function. If the code cell returns **PASSED**, then you have implemented the function correctly! \n", 257 | "\n", 258 | "**Note:** In order to ensure accurate results, make sure that your `policy_evaluation` function satisfies the requirements outlined above (with four inputs, a single output, and with the default values of the input arguments unchanged)." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 8, 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "data": { 268 | "text/markdown": [ 269 | "**PASSED**" 270 | ], 271 | "text/plain": [ 272 | "" 273 | ] 274 | }, 275 | "metadata": {}, 276 | "output_type": "display_data" 277 | } 278 | ], 279 | "source": [ 280 | "check_test.run_check('policy_evaluation_check', policy_evaluation)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "### Part 2: Obtain $q_\\pi$ from $v_\\pi$\n", 288 | "\n", 289 | "In this section, you will write a function that takes the state-value function estimate as input, along with some state $s\\in\\mathcal{S}$. It returns the **row in the action-value function** corresponding to the input state $s\\in\\mathcal{S}$. That is, your function should accept as input both $v_\\pi$ and $s$, and return $q_\\pi(s,a)$ for all $a\\in\\mathcal{A}(s)$.\n", 290 | "\n", 291 | "Your algorithm should accept four arguments as **input**:\n", 292 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 293 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`.\n", 294 | "- `s`: This is an integer corresponding to a state in the environment. It should be a value between `0` and `(env.nS)-1`, inclusive.\n", 295 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 296 | "\n", 297 | "The algorithm returns as **output**:\n", 298 | "- `q`: This is a 1D numpy array with `q.shape[0]` equal to the number of actions (`env.nA`). `q[a]` contains the (estimated) value of state `s` and action `a`.\n", 299 | "\n", 300 | "Please complete the function in the code cell below." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 9, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "def q_from_v(env, V, s, gamma=1):\n", 310 | " q = np.zeros(env.nA)\n", 311 | " for a in range(env.nA):\n", 312 | " for prob, next_state, reward, done in env.P[s][a]:\n", 313 | " q[a] += prob * (reward + gamma * V[next_state])\n", 314 | " return q" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "Run the code cell below to print the action-value function corresponding to the above state-value function." 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 10, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "Action-Value Function:\n", 334 | "[[0.0147094 0.01393978 0.01393978 0.01317015]\n", 335 | " [0.00852356 0.01163091 0.0108613 0.01550788]\n", 336 | " [0.02444514 0.02095298 0.02406033 0.01435346]\n", 337 | " [0.01047649 0.01047649 0.00698432 0.01396865]\n", 338 | " [0.02166487 0.01701828 0.01624865 0.01006281]\n", 339 | " [0. 0. 0. 0. ]\n", 340 | " [0.05433538 0.04735105 0.05433538 0.00698432]\n", 341 | " [0. 0. 0. 0. ]\n", 342 | " [0.01701828 0.04099204 0.03480619 0.04640826]\n", 343 | " [0.07020885 0.11755991 0.10595784 0.05895312]\n", 344 | " [0.18940421 0.17582037 0.16001424 0.04297382]\n", 345 | " [0. 0. 0. 0. ]\n", 346 | " [0. 0. 0. 0. ]\n", 347 | " [0.08799677 0.20503718 0.23442716 0.17582037]\n", 348 | " [0.25238823 0.53837051 0.52711478 0.43929118]\n", 349 | " [0. 0. 0. 0. ]]\n" 350 | ] 351 | } 352 | ], 353 | "source": [ 354 | "Q = np.zeros([env.nS, env.nA])\n", 355 | "for s in range(env.nS):\n", 356 | " Q[s] = q_from_v(env, V, s)\n", 357 | "print(\"Action-Value Function:\")\n", 358 | "print(Q)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Run the code cell below to test your function. If the code cell returns **PASSED**, then you have implemented the function correctly! \n", 366 | "\n", 367 | "**Note:** In order to ensure accurate results, make sure that the `q_from_v` function satisfies the requirements outlined above (with four inputs, a single output, and with the default values of the input arguments unchanged)." 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 11, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "data": { 377 | "text/markdown": [ 378 | "**PASSED**" 379 | ], 380 | "text/plain": [ 381 | "" 382 | ] 383 | }, 384 | "metadata": {}, 385 | "output_type": "display_data" 386 | } 387 | ], 388 | "source": [ 389 | "check_test.run_check('q_from_v_check', q_from_v)" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "### Part 3: Policy Improvement\n", 397 | "\n", 398 | "In this section, you will write your own implementation of policy improvement. \n", 399 | "\n", 400 | "Your algorithm should accept three arguments as **input**:\n", 401 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 402 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`.\n", 403 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 404 | "\n", 405 | "The algorithm returns as **output**:\n", 406 | "- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`). `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.\n", 407 | "\n", 408 | "Please complete the function in the code cell below. You are encouraged to use the `q_from_v` function you implemented above." 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 12, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "def policy_improvement(env, V, gamma=1):\n", 418 | " policy = np.zeros([env.nS, env.nA]) / env.nA\n", 419 | " for s in range(env.nS):\n", 420 | " q = q_from_v(env, V, s, gamma)\n", 421 | " \n", 422 | " # OPTION 1: construct a deterministic policy \n", 423 | " # policy[s][np.argmax(q)] = 1\n", 424 | " \n", 425 | " # OPTION 2: construct a stochastic policy that puts equal probability on maximizing actions\n", 426 | " best_a = np.argwhere(q==np.max(q)).flatten()\n", 427 | " policy[s] = np.sum([np.eye(env.nA)[i] for i in best_a], axis=0)/len(best_a)\n", 428 | " \n", 429 | " return policy" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "Run the code cell below to test your function. If the code cell returns **PASSED**, then you have implemented the function correctly! \n", 437 | "\n", 438 | "**Note:** In order to ensure accurate results, make sure that the `policy_improvement` function satisfies the requirements outlined above (with three inputs, a single output, and with the default values of the input arguments unchanged).\n", 439 | "\n", 440 | "Before moving on to the next part of the notebook, you are strongly encouraged to check out the solution in **Dynamic_Programming_Solution.ipynb**. There are many correct ways to approach this function!" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 13, 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "data": { 450 | "text/markdown": [ 451 | "**PASSED**" 452 | ], 453 | "text/plain": [ 454 | "" 455 | ] 456 | }, 457 | "metadata": {}, 458 | "output_type": "display_data" 459 | } 460 | ], 461 | "source": [ 462 | "check_test.run_check('policy_improvement_check', policy_improvement)" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "### Part 4: Policy Iteration\n", 470 | "\n", 471 | "In this section, you will write your own implementation of policy iteration. The algorithm returns the optimal policy, along with its corresponding state-value function.\n", 472 | "\n", 473 | "Your algorithm should accept three arguments as **input**:\n", 474 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 475 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 476 | "- `theta`: This is a very small positive number that is used to decide if the policy evaluation step has sufficiently converged to the true value function (default value: `1e-8`).\n", 477 | "\n", 478 | "The algorithm returns as **output**:\n", 479 | "- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`). `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.\n", 480 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`.\n", 481 | "\n", 482 | "Please complete the function in the code cell below. You are strongly encouraged to use the `policy_evaluation` and `policy_improvement` functions you implemented above." 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 14, 488 | "metadata": {}, 489 | "outputs": [], 490 | "source": [ 491 | "def policy_iteration(env, gamma=1, theta=1e-8):\n", 492 | " policy = np.ones([env.nS, env.nA]) / env.nA\n", 493 | " while True:\n", 494 | " V = policy_evaluation(env, policy, gamma, theta)\n", 495 | " new_policy = policy_improvement(env, V)\n", 496 | " \n", 497 | " # OPTION 1: stop if the policy is unchanged after an improvement step\n", 498 | " if (new_policy == policy).all():\n", 499 | " break;\n", 500 | " \n", 501 | " # OPTION 2: stop if the value function estimates for successive policies has converged\n", 502 | " # if np.max(abs(policy_evaluation(env, policy) - policy_evaluation(env, new_policy))) < theta*1e2:\n", 503 | " # break;\n", 504 | " \n", 505 | " policy = copy.copy(new_policy)\n", 506 | " return policy, V" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "Run the next code cell to solve the MDP and visualize the output. The optimal state-value function has been reshaped to match the shape of the gridworld.\n", 514 | "\n", 515 | "**Compare the optimal state-value function to the state-value function from Part 1 of this notebook**. _Is the optimal state-value function consistently greater than or equal to the state-value function for the equiprobable random policy?_" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 15, 521 | "metadata": {}, 522 | "outputs": [ 523 | { 524 | "name": "stdout", 525 | "output_type": "stream", 526 | "text": [ 527 | "\n", 528 | "Optimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):\n", 529 | "[[1. 0. 0. 0. ]\n", 530 | " [0. 0. 0. 1. ]\n", 531 | " [0. 0. 0. 1. ]\n", 532 | " [0. 0. 0. 1. ]\n", 533 | " [1. 0. 0. 0. ]\n", 534 | " [0.25 0.25 0.25 0.25]\n", 535 | " [0.5 0. 0.5 0. ]\n", 536 | " [0.25 0.25 0.25 0.25]\n", 537 | " [0. 0. 0. 1. ]\n", 538 | " [0. 1. 0. 0. ]\n", 539 | " [1. 0. 0. 0. ]\n", 540 | " [0.25 0.25 0.25 0.25]\n", 541 | " [0.25 0.25 0.25 0.25]\n", 542 | " [0. 0. 1. 0. ]\n", 543 | " [0. 1. 0. 0. ]\n", 544 | " [0.25 0.25 0.25 0.25]] \n", 545 | "\n" 546 | ] 547 | }, 548 | { 549 | "data": { 550 | "image/png": "\n", 551 | "text/plain": [ 552 | "
" 553 | ] 554 | }, 555 | "metadata": {}, 556 | "output_type": "display_data" 557 | } 558 | ], 559 | "source": [ 560 | "# obtain the optimal policy and optimal state-value function\n", 561 | "policy_pi, V_pi = policy_iteration(env)\n", 562 | "\n", 563 | "# print the optimal policy\n", 564 | "print(\"\\nOptimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):\")\n", 565 | "print(policy_pi,\"\\n\")\n", 566 | "\n", 567 | "plot_values(V_pi)" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "Run the code cell below to test your function. If the code cell returns **PASSED**, then you have implemented the function correctly! \n", 575 | "\n", 576 | "**Note:** In order to ensure accurate results, make sure that the `policy_iteration` function satisfies the requirements outlined above (with three inputs, two outputs, and with the default values of the input arguments unchanged)." 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 16, 582 | "metadata": {}, 583 | "outputs": [ 584 | { 585 | "data": { 586 | "text/markdown": [ 587 | "**PASSED**" 588 | ], 589 | "text/plain": [ 590 | "" 591 | ] 592 | }, 593 | "metadata": {}, 594 | "output_type": "display_data" 595 | } 596 | ], 597 | "source": [ 598 | "check_test.run_check('policy_iteration_check', policy_iteration)" 599 | ] 600 | }, 601 | { 602 | "cell_type": "markdown", 603 | "metadata": {}, 604 | "source": [ 605 | "### Part 5: Truncated Policy Iteration\n", 606 | "\n", 607 | "In this section, you will write your own implementation of truncated policy iteration. \n", 608 | "\n", 609 | "You will begin by implementing truncated policy evaluation. Your algorithm should accept five arguments as **input**:\n", 610 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 611 | "- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`). `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.\n", 612 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`.\n", 613 | "- `max_it`: This is a positive integer that corresponds to the number of sweeps through the state space (default value: `1`).\n", 614 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 615 | "\n", 616 | "The algorithm returns as **output**:\n", 617 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`.\n", 618 | "\n", 619 | "Please complete the function in the code cell below." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 17, 625 | "metadata": {}, 626 | "outputs": [], 627 | "source": [ 628 | "def truncated_policy_evaluation(env, policy, V, max_it=1, gamma=1):\n", 629 | " num_it=0\n", 630 | " while num_it < max_it:\n", 631 | " for s in range(env.nS):\n", 632 | " v = 0\n", 633 | " q = q_from_v(env, V, s, gamma)\n", 634 | " for a, action_prob in enumerate(policy[s]):\n", 635 | " v += action_prob * q[a]\n", 636 | " V[s] = v\n", 637 | " num_it += 1\n", 638 | " return V" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "Next, you will implement truncated policy iteration. Your algorithm should accept five arguments as **input**:\n", 646 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 647 | "- `max_it`: This is a positive integer that corresponds to the number of sweeps through the state space (default value: `1`).\n", 648 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 649 | "- `theta`: This is a very small positive number that is used for the stopping criterion (default value: `1e-8`).\n", 650 | "\n", 651 | "The algorithm returns as **output**:\n", 652 | "- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`). `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.\n", 653 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`.\n", 654 | "\n", 655 | "Please complete the function in the code cell below." 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 18, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "def truncated_policy_iteration(env, max_it=1, gamma=1, theta=1e-8):\n", 665 | " V = np.zeros(env.nS)\n", 666 | " policy = np.zeros([env.nS, env.nA]) / env.nA\n", 667 | " while True:\n", 668 | " policy = policy_improvement(env, V)\n", 669 | " old_V = copy.copy(V)\n", 670 | " V = truncated_policy_evaluation(env, policy, V, max_it, gamma)\n", 671 | " if max(abs(V-old_V)) < theta:\n", 672 | " break;\n", 673 | " return policy, V" 674 | ] 675 | }, 676 | { 677 | "cell_type": "markdown", 678 | "metadata": {}, 679 | "source": [ 680 | "Run the next code cell to solve the MDP and visualize the output. The state-value function has been reshaped to match the shape of the gridworld.\n", 681 | "\n", 682 | "Play with the value of the `max_it` argument. Do you always end with the optimal state-value function?" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "execution_count": 19, 688 | "metadata": {}, 689 | "outputs": [ 690 | { 691 | "name": "stdout", 692 | "output_type": "stream", 693 | "text": [ 694 | "\n", 695 | "Optimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):\n", 696 | "[[1. 0. 0. 0. ]\n", 697 | " [0. 0. 0. 1. ]\n", 698 | " [0. 0. 0. 1. ]\n", 699 | " [0. 0. 0. 1. ]\n", 700 | " [1. 0. 0. 0. ]\n", 701 | " [0.25 0.25 0.25 0.25]\n", 702 | " [0.5 0. 0.5 0. ]\n", 703 | " [0.25 0.25 0.25 0.25]\n", 704 | " [0. 0. 0. 1. ]\n", 705 | " [0. 1. 0. 0. ]\n", 706 | " [1. 0. 0. 0. ]\n", 707 | " [0.25 0.25 0.25 0.25]\n", 708 | " [0.25 0.25 0.25 0.25]\n", 709 | " [0. 0. 1. 0. ]\n", 710 | " [0. 1. 0. 0. ]\n", 711 | " [0.25 0.25 0.25 0.25]] \n", 712 | "\n" 713 | ] 714 | }, 715 | { 716 | "data": { 717 | "image/png": "\n", 718 | "text/plain": [ 719 | "
" 720 | ] 721 | }, 722 | "metadata": {}, 723 | "output_type": "display_data" 724 | } 725 | ], 726 | "source": [ 727 | "policy_tpi, V_tpi = truncated_policy_iteration(env, max_it=2)\n", 728 | "\n", 729 | "# print the optimal policy\n", 730 | "print(\"\\nOptimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):\")\n", 731 | "print(policy_tpi,\"\\n\")\n", 732 | "\n", 733 | "# plot the optimal state-value function\n", 734 | "plot_values(V_tpi)" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "Run the code cell below to test your function. If the code cell returns **PASSED**, then you have implemented the function correctly! \n", 742 | "\n", 743 | "**Note:** In order to ensure accurate results, make sure that the `truncated_policy_iteration` function satisfies the requirements outlined above (with four inputs, two outputs, and with the default values of the input arguments unchanged)." 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": 20, 749 | "metadata": {}, 750 | "outputs": [ 751 | { 752 | "data": { 753 | "text/markdown": [ 754 | "**PASSED**" 755 | ], 756 | "text/plain": [ 757 | "" 758 | ] 759 | }, 760 | "metadata": {}, 761 | "output_type": "display_data" 762 | } 763 | ], 764 | "source": [ 765 | "check_test.run_check('truncated_policy_iteration_check', truncated_policy_iteration)" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "### Part 6: Value Iteration\n", 773 | "\n", 774 | "In this section, you will write your own implementation of value iteration.\n", 775 | "\n", 776 | "Your algorithm should accept three arguments as input:\n", 777 | "- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.\n", 778 | "- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", 779 | "- `theta`: This is a very small positive number that is used for the stopping criterion (default value: `1e-8`).\n", 780 | "\n", 781 | "The algorithm returns as **output**:\n", 782 | "- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`). `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.\n", 783 | "- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`). `V[s]` contains the estimated value of state `s`." 784 | ] 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": 21, 789 | "metadata": {}, 790 | "outputs": [], 791 | "source": [ 792 | "def value_iteration(env, gamma=1, theta=1e-8):\n", 793 | " V = np.zeros(env.nS)\n", 794 | " while True:\n", 795 | " delta = 0\n", 796 | " for s in range(env.nS):\n", 797 | " v = V[s]\n", 798 | " V[s] = max(q_from_v(env, V, s, gamma))\n", 799 | " delta = max(delta,abs(V[s]-v))\n", 800 | " if delta < theta:\n", 801 | " break\n", 802 | " policy = policy_improvement(env, V, gamma)\n", 803 | " return policy, V" 804 | ] 805 | }, 806 | { 807 | "cell_type": "markdown", 808 | "metadata": {}, 809 | "source": [ 810 | "Use the next code cell to solve the MDP and visualize the output. The state-value function has been reshaped to match the shape of the gridworld." 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 22, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "name": "stdout", 820 | "output_type": "stream", 821 | "text": [ 822 | "\n", 823 | "Optimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):\n", 824 | "[[1. 0. 0. 0. ]\n", 825 | " [0. 0. 0. 1. ]\n", 826 | " [0. 0. 0. 1. ]\n", 827 | " [0. 0. 0. 1. ]\n", 828 | " [1. 0. 0. 0. ]\n", 829 | " [0.25 0.25 0.25 0.25]\n", 830 | " [0.5 0. 0.5 0. ]\n", 831 | " [0.25 0.25 0.25 0.25]\n", 832 | " [0. 0. 0. 1. ]\n", 833 | " [0. 1. 0. 0. ]\n", 834 | " [1. 0. 0. 0. ]\n", 835 | " [0.25 0.25 0.25 0.25]\n", 836 | " [0.25 0.25 0.25 0.25]\n", 837 | " [0. 0. 1. 0. ]\n", 838 | " [0. 1. 0. 0. ]\n", 839 | " [0.25 0.25 0.25 0.25]] \n", 840 | "\n" 841 | ] 842 | }, 843 | { 844 | "data": { 845 | "image/png": "\n", 846 | "text/plain": [ 847 | "
" 848 | ] 849 | }, 850 | "metadata": {}, 851 | "output_type": "display_data" 852 | } 853 | ], 854 | "source": [ 855 | "policy_vi, V_vi = value_iteration(env)\n", 856 | "\n", 857 | "# print the optimal policy\n", 858 | "print(\"\\nOptimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):\")\n", 859 | "print(policy_vi,\"\\n\")\n", 860 | "\n", 861 | "# plot the optimal state-value function\n", 862 | "plot_values(V_vi)" 863 | ] 864 | }, 865 | { 866 | "cell_type": "markdown", 867 | "metadata": {}, 868 | "source": [ 869 | "Run the code cell below to test your function. If the code cell returns **PASSED**, then you have implemented the function correctly! \n", 870 | "\n", 871 | "**Note:** In order to ensure accurate results, make sure that the `value_iteration` function satisfies the requirements outlined above (with three inputs, two outputs, and with the default values of the input arguments unchanged)." 872 | ] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": 23, 877 | "metadata": {}, 878 | "outputs": [ 879 | { 880 | "data": { 881 | "text/markdown": [ 882 | "**PASSED**" 883 | ], 884 | "text/plain": [ 885 | "" 886 | ] 887 | }, 888 | "metadata": {}, 889 | "output_type": "display_data" 890 | } 891 | ], 892 | "source": [ 893 | "check_test.run_check('value_iteration_check', value_iteration)" 894 | ] 895 | } 896 | ], 897 | "metadata": { 898 | "anaconda-cloud": {}, 899 | "kernelspec": { 900 | "display_name": "Python 3", 901 | "language": "python", 902 | "name": "python3" 903 | }, 904 | "language_info": { 905 | "codemirror_mode": { 906 | "name": "ipython", 907 | "version": 3 908 | }, 909 | "file_extension": ".py", 910 | "mimetype": "text/x-python", 911 | "name": "python", 912 | "nbconvert_exporter": "python", 913 | "pygments_lexer": "ipython3", 914 | "version": "3.6.8" 915 | } 916 | }, 917 | "nbformat": 4, 918 | "nbformat_minor": 2 919 | } 920 | -------------------------------------------------------------------------------- /Dynamic-Programming/algorithms.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/antonio-f/Dynamic-Programming/ca116016953dfff2cd4a8f1a7e9dd31ed660435e/Dynamic-Programming/algorithms.pdf -------------------------------------------------------------------------------- /Dynamic-Programming/check_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import copy 3 | from IPython.display import Markdown, display 4 | import numpy as np 5 | from frozenlake import FrozenLakeEnv 6 | 7 | def printmd(string): 8 | display(Markdown(string)) 9 | 10 | def policy_evaluation_soln(env, policy, gamma=1, theta=1e-8): 11 | V = np.zeros(env.nS) 12 | while True: 13 | delta = 0 14 | for s in range(env.nS): 15 | Vs = 0 16 | for a, action_prob in enumerate(policy[s]): 17 | for prob, next_state, reward, done in env.P[s][a]: 18 | Vs += action_prob * prob * (reward + gamma * V[next_state]) 19 | delta = max(delta, np.abs(V[s]-Vs)) 20 | V[s] = Vs 21 | if delta < theta: 22 | break 23 | return V 24 | 25 | def q_from_v_soln(env, V, s, gamma=1): 26 | q = np.zeros(env.nA) 27 | for a in range(env.nA): 28 | for prob, next_state, reward, done in env.P[s][a]: 29 | q[a] += prob * (reward + gamma * V[next_state]) 30 | return q 31 | 32 | def policy_improvement_soln(env, V, gamma=1): 33 | policy = np.zeros([env.nS, env.nA]) / env.nA 34 | for s in range(env.nS): 35 | q = q_from_v_soln(env, V, s, gamma) 36 | best_a = np.argwhere(q==np.max(q)).flatten() 37 | policy[s] = np.sum([np.eye(env.nA)[i] for i in best_a], axis=0)/len(best_a) 38 | return policy 39 | 40 | def policy_iteration_soln(env, gamma=1, theta=1e-8): 41 | policy = np.ones([env.nS, env.nA]) / env.nA 42 | while True: 43 | V = policy_evaluation_soln(env, policy, gamma, theta) 44 | new_policy = policy_improvement_soln(env, V) 45 | if (new_policy == policy).all(): 46 | break; 47 | policy = copy.copy(new_policy) 48 | return policy, V 49 | 50 | env = FrozenLakeEnv() 51 | random_policy = np.ones([env.nS, env.nA]) / env.nA 52 | 53 | class Tests(unittest.TestCase): 54 | 55 | def policy_evaluation_check(self, policy_evaluation): 56 | soln = policy_evaluation_soln(env, random_policy) 57 | to_check = policy_evaluation(env, random_policy) 58 | np.testing.assert_array_almost_equal(soln, to_check) 59 | 60 | def q_from_v_check(self, q_from_v): 61 | V = policy_evaluation_soln(env, random_policy) 62 | soln = np.zeros([env.nS, env.nA]) 63 | to_check = np.zeros([env.nS, env.nA]) 64 | for s in range(env.nS): 65 | soln[s] = q_from_v_soln(env, V, s) 66 | to_check[s] = q_from_v(env, V, s) 67 | np.testing.assert_array_almost_equal(soln, to_check) 68 | 69 | def policy_improvement_check(self, policy_improvement): 70 | V = policy_evaluation_soln(env, random_policy) 71 | new_policy = policy_improvement(env, V) 72 | new_V = policy_evaluation_soln(env, new_policy) 73 | self.assertTrue(np.all(new_V >= V)) 74 | 75 | def policy_iteration_check(self, policy_iteration): 76 | policy_soln, _ = policy_iteration_soln(env) 77 | policy_to_check, _ = policy_iteration(env) 78 | soln = policy_evaluation_soln(env, policy_soln) 79 | to_check = policy_evaluation_soln(env, policy_to_check) 80 | np.testing.assert_array_almost_equal(soln, to_check) 81 | 82 | def truncated_policy_iteration_check(self, truncated_policy_iteration): 83 | self.policy_iteration_check(truncated_policy_iteration) 84 | 85 | def value_iteration_check(self, value_iteration): 86 | self.policy_iteration_check(value_iteration) 87 | 88 | check = Tests() 89 | 90 | def run_check(check_name, func): 91 | try: 92 | getattr(check, check_name)(func) 93 | except check.failureException as e: 94 | printmd('**PLEASE TRY AGAIN**') 95 | return 96 | printmd('**PASSED**') -------------------------------------------------------------------------------- /Dynamic-Programming/frozenlake.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import sys 3 | from six import StringIO, b 4 | 5 | from gym import utils 6 | from gym.envs.toy_text import discrete 7 | 8 | LEFT = 0 9 | DOWN = 1 10 | RIGHT = 2 11 | UP = 3 12 | 13 | MAPS = { 14 | "4x4": [ 15 | "SFFF", 16 | "FHFH", 17 | "FFFH", 18 | "HFFG" 19 | ], 20 | "8x8": [ 21 | "SFFFFFFF", 22 | "FFFFFFFF", 23 | "FFFHFFFF", 24 | "FFFFFHFF", 25 | "FFFHFFFF", 26 | "FHHFFFHF", 27 | "FHFFHFHF", 28 | "FFFHFFFG" 29 | ], 30 | } 31 | 32 | class FrozenLakeEnv(discrete.DiscreteEnv): 33 | """ 34 | Winter is here. You and your friends were tossing around a frisbee at the park 35 | when you made a wild throw that left the frisbee out in the middle of the lake. 36 | The water is mostly frozen, but there are a few holes where the ice has melted. 37 | If you step into one of those holes, you'll fall into the freezing water. 38 | At this time, there's an international frisbee shortage, so it's absolutely imperative that 39 | you navigate across the lake and retrieve the disc. 40 | However, the ice is slippery, so you won't always move in the direction you intend. 41 | The surface is described using a grid like the following 42 | 43 | SFFF 44 | FHFH 45 | FFFH 46 | HFFG 47 | 48 | S : starting point, safe 49 | F : frozen surface, safe 50 | H : hole, fall to your doom 51 | G : goal, where the frisbee is located 52 | 53 | The episode ends when you reach the goal or fall in a hole. 54 | You receive a reward of 1 if you reach the goal, and zero otherwise. 55 | 56 | """ 57 | 58 | metadata = {'render.modes': ['human', 'ansi']} 59 | 60 | def __init__(self, desc=None, map_name="4x4",is_slippery=True): 61 | if desc is None and map_name is None: 62 | raise ValueError('Must provide either desc or map_name') 63 | elif desc is None: 64 | desc = MAPS[map_name] 65 | self.desc = desc = np.asarray(desc,dtype='c') 66 | self.nrow, self.ncol = nrow, ncol = desc.shape 67 | 68 | nA = 4 69 | nS = nrow * ncol 70 | 71 | isd = np.array(desc == b'S').astype('float64').ravel() 72 | isd /= isd.sum() 73 | 74 | P = {s : {a : [] for a in range(nA)} for s in range(nS)} 75 | 76 | def to_s(row, col): 77 | return row*ncol + col 78 | def inc(row, col, a): 79 | if a==0: # left 80 | col = max(col-1,0) 81 | elif a==1: # down 82 | row = min(row+1,nrow-1) 83 | elif a==2: # right 84 | col = min(col+1,ncol-1) 85 | elif a==3: # up 86 | row = max(row-1,0) 87 | return (row, col) 88 | 89 | for row in range(nrow): 90 | for col in range(ncol): 91 | s = to_s(row, col) 92 | for a in range(4): 93 | li = P[s][a] 94 | letter = desc[row, col] 95 | if letter in b'GH': 96 | li.append((1.0, s, 0, True)) 97 | else: 98 | if is_slippery: 99 | for b in [(a-1)%4, a, (a+1)%4]: 100 | newrow, newcol = inc(row, col, b) 101 | newstate = to_s(newrow, newcol) 102 | newletter = desc[newrow, newcol] 103 | done = bytes(newletter) in b'GH' 104 | rew = float(newletter == b'G') 105 | li.append((1.0/3.0, newstate, rew, done)) 106 | else: 107 | newrow, newcol = inc(row, col, a) 108 | newstate = to_s(newrow, newcol) 109 | newletter = desc[newrow, newcol] 110 | done = bytes(newletter) in b'GH' 111 | rew = float(newletter == b'G') 112 | li.append((1.0, newstate, rew, done)) 113 | 114 | # obtain one-step dynamics for dynamic programming setting 115 | self.P = P 116 | 117 | super(FrozenLakeEnv, self).__init__(nS, nA, P, isd) 118 | 119 | def _render(self, mode='human', close=False): 120 | if close: 121 | return 122 | outfile = StringIO() if mode == 'ansi' else sys.stdout 123 | 124 | row, col = self.s // self.ncol, self.s % self.ncol 125 | desc = self.desc.tolist() 126 | desc = [[c.decode('utf-8') for c in line] for line in desc] 127 | desc[row][col] = utils.colorize(desc[row][col], "red", highlight=True) 128 | if self.lastaction is not None: 129 | outfile.write(" ({})\n".format(["Left","Down","Right","Up"][self.lastaction])) 130 | else: 131 | outfile.write("\n") 132 | outfile.write("\n".join(''.join(line) for line in desc)+"\n") 133 | 134 | if mode != 'human': 135 | return outfile 136 | -------------------------------------------------------------------------------- /Dynamic-Programming/plot_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | def plot_values(V): 5 | # reshape value function 6 | V_sq = np.reshape(V, (4,4)) 7 | 8 | # plot the state-value function 9 | fig = plt.figure(figsize=(6, 6)) 10 | ax = fig.add_subplot(111) 11 | im = ax.imshow(V_sq, cmap='cool') 12 | for (j,i),label in np.ndenumerate(V_sq): 13 | ax.text(i, j, np.round(label, 5), ha='center', va='center', fontsize=14) 14 | plt.tick_params(bottom=False, left=False, labelbottom=False, labelleft=False) 15 | plt.title('State-Value Function') 16 | plt.show() 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dynamic-Programming 2 | Algorithms for Policy Evaluation, Estimation of Action Values, Policy Improvement, Policy Iteration, Truncated Policy Evaluation, Truncated Policy Iteration, Value Iteration . From Udacity's Deep Reinforcement Learning Nanodegree program. 3 | --------------------------------------------------------------------------------