├── README.md
├── Completing the Parameter Study.ipynb
├── MoonShot Technologies.ipynb
├── Function Approximation and Control.ipynb
└── Implement your agent.ipynb


/README.md:
--------------------------------------------------------------------------------
1 | # Reinforcement-Learning-Specialization
2 | This repository is dedicated to the assignments of the Reinforcement Learning Specialization course by the university of the Alberta.
3 | This is course is based on the "Reinforcement Learning: An Introduction" book by Andrew Barto and Richard S. Sutton.
4 | It should be noted that, most of the contents are created by the course authors, and I just wrote some parts of them.
5 | Hope you enjoy, give feedback and help me to improve this repository.
6 | 
7 | Note: some parts of some of the notebooks are helped by others repos.
8 | 


--------------------------------------------------------------------------------
/Completing the Parameter Study.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": false,
  7 |     "editable": false,
  8 |     "nbgrader": {
  9 |      "cell_type": "markdown",
 10 |      "checksum": "b64717e9bcdc4ab796bc1d0fdbb12c4b",
 11 |      "grade": false,
 12 |      "grade_id": "cell-f64cdc46435b25eb",
 13 |      "locked": true,
 14 |      "schema_version": 3,
 15 |      "solution": false,
 16 |      "task": false
 17 |     }
 18 |    },
 19 |    "source": [
 20 |     "# Assignment 3 - Completing the Parameter Study\n",
 21 |     "\n",
 22 |     "Welcome to Course 4 Programming Assignment 3. In the previous assignments, you completed the implementation of the Lunar Lander environment and implemented an agent with neural networks and the Adam optimizer. As you may remember, we discussed a number of key meta-parameters that affect the performance of the agent (e.g. the step-size, the temperature parameter for the softmax policy, the capacity of the replay buffer). We can use rules of thumb for picking reasonable values for these meta-parameters. However, we can also study the impact of these meta-parameters on the performance of the agent to gain insight.\n",
 23 |     "\n",
 24 |     "In this assignment, you will conduct a careful experiment analyzing performance of an agent, under different values of the step-size parameter.\n",
 25 |     "\n",
 26 |     "**In this assignment, you will:**\n",
 27 |     "\n",
 28 |     "- write a script to run your agent and environment on a set of parameters, to determine performance across these parameters.\n",
 29 |     "- gain insight into the impact of the step-size parameter on agent performance by examining its parameter sensitivity curve."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {
 35 |     "deletable": false,
 36 |     "editable": false,
 37 |     "nbgrader": {
 38 |      "cell_type": "markdown",
 39 |      "checksum": "a5e3743388aba0f744da57d8af1c5270",
 40 |      "grade": false,
 41 |      "grade_id": "cell-4359dc74745ffb31",
 42 |      "locked": true,
 43 |      "schema_version": 3,
 44 |      "solution": false,
 45 |      "task": false
 46 |     }
 47 |    },
 48 |    "source": [
 49 |     "## Packages\n",
 50 |     "\n",
 51 |     "- [numpy](www.numpy.org) : Fundamental package for scientific computing with Python.\n",
 52 |     "- [matplotlib](http://matplotlib.org) : Library for plotting graphs in Python.\n",
 53 |     "- [RL-Glue](http://www.jmlr.org/papers/v10/tanner09a.html) : Library for reinforcement learning experiments.\n",
 54 |     "- [tqdm](https://tqdm.github.io/) : A package to display progress bar when running experiments"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 2,
 60 |    "metadata": {
 61 |     "deletable": false,
 62 |     "editable": false,
 63 |     "nbgrader": {
 64 |      "cell_type": "code",
 65 |      "checksum": "501d238ebcf7d6e116e6849af15dbb07",
 66 |      "grade": false,
 67 |      "grade_id": "cell-e55836e566b8c01d",
 68 |      "locked": true,
 69 |      "schema_version": 3,
 70 |      "solution": false,
 71 |      "task": false
 72 |     }
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "# Import necessary libraries\n",
 77 |     "# DO NOT IMPORT OTHER LIBRARIES - This will break the autograder.\n",
 78 |     "import numpy as np\n",
 79 |     "import matplotlib.pyplot as plt\n",
 80 |     "%matplotlib inline\n",
 81 |     "\n",
 82 |     "import os\n",
 83 |     "from tqdm import tqdm\n",
 84 |     "\n",
 85 |     "from rl_glue import RLGlue\n",
 86 |     "from environment import BaseEnvironment\n",
 87 |     "from agent import BaseAgent\n",
 88 |     "from dummy_environment import DummyEnvironment\n",
 89 |     "from dummy_agent import DummyAgent"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {
 95 |     "deletable": false,
 96 |     "editable": false,
 97 |     "nbgrader": {
 98 |      "cell_type": "markdown",
 99 |      "checksum": "3eaa22ac9a78ff78d9599150ff83877d",
100 |      "grade": false,
101 |      "grade_id": "cell-c3d5d347d1726775",
102 |      "locked": true,
103 |      "schema_version": 3,
104 |      "solution": false,
105 |      "task": false
106 |     }
107 |    },
108 |    "source": [
109 |     "## Section 1: Write Parameter Study Script\n",
110 |     "\n",
111 |     "In this section, you will write a script for performing parameter studies. You will implement the `run_experiment()` function. This function takes an environment and agent and performs a parameter study on the step-size and termperature parameters."
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 3,
117 |    "metadata": {
118 |     "deletable": false,
119 |     "nbgrader": {
120 |      "cell_type": "code",
121 |      "checksum": "74fc83f73918b97072a2c3dffa909bdd",
122 |      "grade": false,
123 |      "grade_id": "cell-e53c85e6098a975b",
124 |      "locked": false,
125 |      "schema_version": 3,
126 |      "solution": true,
127 |      "task": false
128 |     }
129 |    },
130 |    "outputs": [],
131 |    "source": [
132 |     "# -----------\n",
133 |     "# Graded Cell\n",
134 |     "# -----------\n",
135 |     "\n",
136 |     "def run_experiment(environment, agent, environment_parameters, agent_parameters, experiment_parameters):\n",
137 |     "    \n",
138 |     "    \"\"\"\n",
139 |     "    Assume environment_parameters dict contains:\n",
140 |     "    {\n",
141 |     "        input_dim: integer,\n",
142 |     "        num_actions: integer,\n",
143 |     "        discount_factor: float\n",
144 |     "    }\n",
145 |     "    \n",
146 |     "    Assume agent_parameters dict contains:\n",
147 |     "    {\n",
148 |     "        step_size: 1D numpy array of floats,\n",
149 |     "        tau: 1D numpy array of floats\n",
150 |     "    }\n",
151 |     "    \n",
152 |     "    Assume experiment_parameters dict contains:\n",
153 |     "    {\n",
154 |     "        num_runs: integer,\n",
155 |     "        num_episodes: integer\n",
156 |     "    }    \n",
157 |     "    \"\"\"\n",
158 |     "    \n",
159 |     "    ### Instantiate rl_glue from RLGlue    \n",
160 |     "    rl_glue = RLGlue(environment, agent)\n",
161 |     "\n",
162 |     "    os.system('sleep 1') # to prevent tqdm printing out-of-order\n",
163 |     "    \n",
164 |     "    ### START CODE HERE\n",
165 |     "    \n",
166 |     "     ### Initialize agent_sum_reward to zero in the form of a numpy array \n",
167 |     "    # with shape (number of values for tau, number of step-sizes, number of runs, number of episodes)\n",
168 |     "    agent_sum_reward = np.zeros((len(agent_parameters['tau']),\n",
169 |     "                                 len(agent_parameters['step_size']),\n",
170 |     "                                 experiment_parameters['num_runs'],\n",
171 |     "                                 experiment_parameters['num_episodes']))\n",
172 |     "    print(agent_sum_reward.shape)\n",
173 |     "    \n",
174 |     "    ### Replace the Nones with the correct values in the rest of the code\n",
175 |     "\n",
176 |     "    # for loop over different values of tau\n",
177 |     "    # tqdm is used to show a progress bar for completing the parameter study\n",
178 |     "    for i in tqdm(range(len(agent_parameters['tau']))):\n",
179 |     "    \n",
180 |     "        # for loop over different values of the step-size\n",
181 |     "        for j in range(len(agent_parameters['step_size'])): \n",
182 |     "\n",
183 |     "            ### Specify env_info \n",
184 |     "            env_info = {}   \n",
185 |     "    \n",
186 |     "\n",
187 |     "            ### Specify agent_info\n",
188 |     "            agent_info = {\"num_actions\": environment_parameters[\"num_actions\"],\n",
189 |     "                          \"input_dim\": environment_parameters[\"input_dim\"],\n",
190 |     "                          \"discount_factor\": environment_parameters[\"discount_factor\"],\n",
191 |     "                          \"tau\": agent_parameters['tau'][i],\n",
192 |     "                          \"step_size\": agent_parameters['step_size'][j]}\n",
193 |     "\n",
194 |     "            # for loop over runs\n",
195 |     "            for run in range(experiment_parameters['num_runs']): \n",
196 |     "                \n",
197 |     "                # Set the seed\n",
198 |     "                agent_info[\"seed\"] = agent_parameters[\"seed\"] * experiment_parameters[\"num_runs\"] + run\n",
199 |     "                \n",
200 |     "                # Beginning of the run            \n",
201 |     "                rl_glue.rl_init(agent_info, env_info)\n",
202 |     "\n",
203 |     "                for episode in range(experiment_parameters['num_episodes']): \n",
204 |     "                    \n",
205 |     "                    # Run episode\n",
206 |     "                    rl_glue.rl_episode(0) # no step limit\n",
207 |     "\n",
208 |     "                    ### Store sum of reward\n",
209 |     "                    agent_sum_reward[i, j, run, episode] = rl_glue.rl_agent_message(\"get_sum_reward\")\n",
210 |     "\n",
211 |     "            if not os.path.exists('results'):\n",
212 |     "                    os.makedirs('results')\n",
213 |     "\n",
214 |     "            save_name = \"{}\".format(rl_glue.agent.name).replace('.','')\n",
215 |     "\n",
216 |     "            # save sum reward\n",
217 |     "            np.save(\"results/sum_reward_{}\".format(save_name), agent_sum_reward) "
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "metadata": {
223 |     "deletable": false,
224 |     "editable": false,
225 |     "nbgrader": {
226 |      "cell_type": "markdown",
227 |      "checksum": "3e49955e441ad30ea64396c178bcef79",
228 |      "grade": false,
229 |      "grade_id": "cell-b5e0bf5f2c8ed098",
230 |      "locked": true,
231 |      "schema_version": 3,
232 |      "solution": false,
233 |      "task": false
234 |     }
235 |    },
236 |    "source": [
237 |     "Run the following code to test your implementation of `run_experiment()` given a dummy agent and a dummy environment for 100 runs, 100 episodes, 12 values of the step-size, and 4 values of $\\tau$:"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": 4,
243 |    "metadata": {},
244 |    "outputs": [
245 |     {
246 |      "name": "stderr",
247 |      "output_type": "stream",
248 |      "text": [
249 |       "  0%|          | 0/4 [00:00<?, ?it/s]"
250 |      ]
251 |     },
252 |     {
253 |      "name": "stdout",
254 |      "output_type": "stream",
255 |      "text": [
256 |       "(4, 12, 100, 100)\n"
257 |      ]
258 |     },
259 |     {
260 |      "name": "stderr",
261 |      "output_type": "stream",
262 |      "text": [
263 |       "100%|██████████| 4/4 [00:11<00:00,  2.97s/it]"
264 |      ]
265 |     },
266 |     {
267 |      "name": "stdout",
268 |      "output_type": "stream",
269 |      "text": [
270 |       "Passed the assert!\n"
271 |      ]
272 |     },
273 |     {
274 |      "name": "stderr",
275 |      "output_type": "stream",
276 |      "text": [
277 |       "\n"
278 |      ]
279 |     }
280 |    ],
281 |    "source": [
282 |     "# --------------\n",
283 |     "# Debugging Cell\n",
284 |     "# --------------\n",
285 |     "# Feel free to make any changes to this cell to debug your code\n",
286 |     "\n",
287 |     "# Experiment parameters\n",
288 |     "experiment_parameters = {\n",
289 |     "    \"num_runs\" : 100,\n",
290 |     "    \"num_episodes\" : 100,\n",
291 |     "}\n",
292 |     "\n",
293 |     "# Environment parameters\n",
294 |     "environment_parameters = {\n",
295 |     "    \"input_dim\" : 8,\n",
296 |     "    \"num_actions\": 4, \n",
297 |     "    \"discount_factor\" : 0.99\n",
298 |     "}\n",
299 |     "\n",
300 |     "agent_parameters = {\n",
301 |     "    \"step_size\": 3e-5 * np.power(2.0, np.array([-6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])),\n",
302 |     "    \"tau\": np.array([0.001, 0.01, 0.1, 1.0]),\n",
303 |     "    \"seed\": 0\n",
304 |     "}\n",
305 |     "\n",
306 |     "test_env = DummyEnvironment\n",
307 |     "test_agent = DummyAgent\n",
308 |     "\n",
309 |     "run_experiment(test_env, \n",
310 |     "               test_agent, \n",
311 |     "               environment_parameters, \n",
312 |     "               agent_parameters, \n",
313 |     "               experiment_parameters)\n",
314 |     "\n",
315 |     "sum_reward_dummy_agent = np.load(\"results/sum_reward_dummy_agent.npy\")\n",
316 |     "sum_reward_dummy_agent_answer = np.load(\"asserts/sum_reward_dummy_agent.npy\")\n",
317 |     "assert(np.allclose(sum_reward_dummy_agent, sum_reward_dummy_agent_answer))\n",
318 |     "\n",
319 |     "print(\"Passed the assert!\")\n"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": 5,
325 |    "metadata": {
326 |     "deletable": false,
327 |     "editable": false,
328 |     "nbgrader": {
329 |      "cell_type": "code",
330 |      "checksum": "74063f15ad25a724f198b4be2f3f38d7",
331 |      "grade": true,
332 |      "grade_id": "cell-2284e3bbbcf29936",
333 |      "locked": true,
334 |      "points": 10,
335 |      "schema_version": 3,
336 |      "solution": false,
337 |      "task": false
338 |     }
339 |    },
340 |    "outputs": [
341 |     {
342 |      "name": "stderr",
343 |      "output_type": "stream",
344 |      "text": [
345 |       "  0%|          | 0/4 [00:00<?, ?it/s]"
346 |      ]
347 |     },
348 |     {
349 |      "name": "stdout",
350 |      "output_type": "stream",
351 |      "text": [
352 |       "(4, 12, 100, 100)\n"
353 |      ]
354 |     },
355 |     {
356 |      "name": "stderr",
357 |      "output_type": "stream",
358 |      "text": [
359 |       "100%|██████████| 4/4 [00:11<00:00,  2.97s/it]\n"
360 |      ]
361 |     }
362 |    ],
363 |    "source": [
364 |     "# -----------\n",
365 |     "# Tested Cell\n",
366 |     "# -----------\n",
367 |     "# The contents of the cell will be tested by the autograder.\n",
368 |     "# If they do not pass here, they will not pass there.\n",
369 |     "\n",
370 |     "# Experiment parameters\n",
371 |     "experiment_parameters = {\n",
372 |     "    \"num_runs\" : 100,\n",
373 |     "    \"num_episodes\" : 100,\n",
374 |     "}\n",
375 |     "\n",
376 |     "# Environment parameters\n",
377 |     "environment_parameters = {\n",
378 |     "    \"input_dim\" : 8,\n",
379 |     "    \"num_actions\": 4, \n",
380 |     "    \"discount_factor\" : 0.99\n",
381 |     "}\n",
382 |     "\n",
383 |     "agent_parameters = {\n",
384 |     "    \"step_size\": 3e-5 * np.power(2.0, np.array([-6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])),\n",
385 |     "    \"tau\": np.array([0.001, 0.01, 0.1, 1.0]),\n",
386 |     "    \"seed\": 0\n",
387 |     "}\n",
388 |     "\n",
389 |     "test_env = DummyEnvironment\n",
390 |     "test_agent = DummyAgent\n",
391 |     "\n",
392 |     "run_experiment(test_env, \n",
393 |     "               test_agent,  \n",
394 |     "               environment_parameters, \n",
395 |     "               agent_parameters, \n",
396 |     "               experiment_parameters)\n",
397 |     "\n",
398 |     "sum_reward_dummy_agent = np.load(\"results/sum_reward_dummy_agent.npy\")\n",
399 |     "sum_reward_dummy_agent_answer = np.load(\"asserts/sum_reward_dummy_agent.npy\")\n",
400 |     "assert(np.allclose(sum_reward_dummy_agent, sum_reward_dummy_agent_answer))"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "markdown",
405 |    "metadata": {
406 |     "deletable": false,
407 |     "editable": false,
408 |     "nbgrader": {
409 |      "cell_type": "markdown",
410 |      "checksum": "43eb94b7507dc4c618e6bdc6a89dd471",
411 |      "grade": false,
412 |      "grade_id": "cell-ee4685363ad498e2",
413 |      "locked": true,
414 |      "schema_version": 3,
415 |      "solution": false,
416 |      "task": false
417 |     }
418 |    },
419 |    "source": [
420 |     "## Section 2: The Parameter Study for the Agent with Neural Network and Adam Optimizer\n",
421 |     "\n",
422 |     "Now that you implemented run_experiment() for a dummy agent, let’s examine the performance of the agent that you implemented in Assignment 2 for different values of the step-size parameter. To do so, we can use parameter sensitivity curves. As you know, in parameter sensitivity curves, on the y-axis, we have our performance measure and on the x-axis, we have the values of the parameter we are testing. We will use the average of returns over episodes, averaged over 30 runs as our performance metric.\n",
423 |     "\n",
424 |     "Recall that we used a step-size of 10^{-3}$ in Assignment 2 and got reasonable performance. We can use this value to construct a sensible set of step-sizes for our parameter study by multiplying it with powers of two:\n",
425 |     "\n",
426 |     "$10^{-3} \\times 2^x$ where $x \\in \\{-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3\\}$\n",
427 |     "\n",
428 |     "We use powers of two because doing so produces smaller increments in the step-size for smaller step-size values and larger jumps for larger step-sizes. \n",
429 |     "\n",
430 |     "Let’s take a look at the results for this set of step-sizes.\n",
431 |     "\n",
432 |     "<img src=\"parameter_study.png\" alt=\"Drawing\" style=\"width: 500px;\"/>\n",
433 |     "\n",
434 |     "Observe that the best performance is achieved for step-sizes in range $[10^{-4}, 10^{-3}]$. This includes the step-size that we used in Assignment 2! The performance degrades for higher and lower step-size values. Since the range of step-sizes for which the agent performs well is not broad, choosing a good step-size is challenging for this problem.\n",
435 |     "\n",
436 |     "As we mentioned above, we used the average of returns over episodes, averaged over 30 runs as our performance metric. This metric gives an overall estimation of the agent's performance over the episodes. If we want to study the effect of the step-size parameter on the agent's early performance or final performance, we should use different metrics. For example, to study the effect of the step-size parameter on early performance, we could use the average of returns over the first 100 episodes, averaged over 30 runs. When conducting a parameter study, you may consider these for defining your performance metric!"
437 |    ]
438 |   },
439 |   {
440 |    "cell_type": "markdown",
441 |    "metadata": {
442 |     "deletable": false,
443 |     "editable": false,
444 |     "nbgrader": {
445 |      "cell_type": "markdown",
446 |      "checksum": "16533ce7f50f0d22dd5774ccb53f50de",
447 |      "grade": false,
448 |      "grade_id": "cell-a682d7f91cd82cc3",
449 |      "locked": true,
450 |      "schema_version": 3,
451 |      "solution": false,
452 |      "task": false
453 |     }
454 |    },
455 |    "source": [
456 |     "### **Wrapping up!** \n",
457 |     "\n",
458 |     "Congratulations, you have completed the Capstone project! In Assignment 1 (Module 1), you designed the reward function for the Lunar Lander environment. In Assignment 2 (Module 4), you implemented your Expected Sarsa agent with a neural network and Adam optimizer. In Assignment 3 (Module 5), you conducted a careful parameter study and examined the effect of changing the step size parameter on the performance of the agent. Thanks for sticking with us throughout the specialization! At this point, you should have a solid foundation for formulating your own reinforcement learning problems, understanding advanced topics in reinforcement learning, and even pursuing graduate studies."
459 |    ]
460 |   }
461 |  ],
462 |  "metadata": {
463 |   "coursera": {
464 |    "course_slug": "complete-reinforcement-learning-system",
465 |    "graded_item_id": "h4ZLq",
466 |    "launcher_item_id": "rbt6a"
467 |   },
468 |   "kernelspec": {
469 |    "display_name": "Python 3",
470 |    "language": "python",
471 |    "name": "python3"
472 |   },
473 |   "language_info": {
474 |    "codemirror_mode": {
475 |     "name": "ipython",
476 |     "version": 3
477 |    },
478 |    "file_extension": ".py",
479 |    "mimetype": "text/x-python",
480 |    "name": "python",
481 |    "nbconvert_exporter": "python",
482 |    "pygments_lexer": "ipython3",
483 |    "version": "3.7.6"
484 |   }
485 |  },
486 |  "nbformat": 4,
487 |  "nbformat_minor": 2
488 | }
489 | 


--------------------------------------------------------------------------------
/MoonShot Technologies.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": false,
  7 |     "editable": false,
  8 |     "nbgrader": {
  9 |      "cell_type": "markdown",
 10 |      "checksum": "91bab923ff51b8c4cc2db62df96dceba",
 11 |      "grade": false,
 12 |      "grade_id": "cell-728287ea719cc025",
 13 |      "locked": true,
 14 |      "schema_version": 3,
 15 |      "solution": false
 16 |     }
 17 |    },
 18 |    "source": [
 19 |     "# MoonShot Technologies"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {
 25 |     "deletable": false,
 26 |     "editable": false,
 27 |     "nbgrader": {
 28 |      "cell_type": "markdown",
 29 |      "checksum": "75370e1d01e0e2216d0982ef3c22eb0a",
 30 |      "grade": false,
 31 |      "grade_id": "cell-e86bb7c59ff0a4a5",
 32 |      "locked": true,
 33 |      "schema_version": 3,
 34 |      "solution": false
 35 |     }
 36 |    },
 37 |    "source": [
 38 |     "Congratulations! Due to your strong performance in the first three courses, you landed a job as a reinforcement learning engineer at the hottest new non-revenue generating unicorn, MoonShot Technologies (MST). Times are busy at MST, which is preparing for its initial public offering (IPO) at the end of the fiscal year, and your labor is much needed.\n",
 39 |     "\n",
 40 |     "Like many successful startups, MST is exceedingly concerned with the valuation that it will receive at its IPO (as this valuation determines the price at which its existing venture capitalist shareholders will be able to sell their shares). Accordingly, to whet the appetites of potential investors, MST has set its sights on accomplishing a technological tour de force &mdash; a lunar landing &mdash; before the year is out. But it is not just any mundane lunar landing that MST aspires toward. Rather than the more sensible approach of employing techniques from aerospace engineering to pilot its spacecraft, MST endeavors to wow investors by training an agent to do the job via reinforcement learning.\n",
 41 |     "\n",
 42 |     "However, it is clearly not practical for a reinforcement learning agent to be trained tabula rasa with real rockets &mdash; even the pockets of venture capitalists have their limits. Instead, MST aims to build a simulator that is realistic enough to train an agent that can be deployed in the real world. This will be a difficult project, and will require building a realistic simulator, choosing the right reinforcement learning algorithm, implementing this algorithm, and optimizing the hyperparameters for this algorithm.\n",
 43 |     "\n",
 44 |     "Naturally, as the newly hired reinforcement learning engineer, you have been staffed to lead the project. In this notebook, you will take the first steps by building a lunar lander environment."
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {
 50 |     "deletable": false,
 51 |     "editable": false,
 52 |     "nbgrader": {
 53 |      "cell_type": "markdown",
 54 |      "checksum": "c4a738c811626904fa17a90cbfddd80e",
 55 |      "grade": false,
 56 |      "grade_id": "cell-62c5c402edcd8ae5",
 57 |      "locked": true,
 58 |      "schema_version": 3,
 59 |      "solution": false
 60 |     }
 61 |    },
 62 |    "source": [
 63 |     "## Creating an Environment\n",
 64 |     "The software engineering team at MST has already set up some infrastructure for your convenience. Specifically, they have provided you with the following functions:\n",
 65 |     "\n",
 66 |     "* **get_velocity** - returns an array representing the x, y velocity of the lander. Both the x and y velocity are in the range $[0, 60]$.\n",
 67 |     "\n",
 68 |     "\n",
 69 |     "* **get_angle** - returns a scalar representing the angle of the lander. The angle is in the range $[0, 359]$.\n",
 70 |     "\n",
 71 |     "\n",
 72 |     "* **get_position** - returns an array representing the x, y position of the lander. Both the x and y position of the agent are in the range $[0, 100]$.\n",
 73 |     "\n",
 74 |     "\n",
 75 |     "* **get_landing_zone** - returns an array representing the x, y position of the landing zone. Both the x, y coordinates are in the range $[1, 100]$. \n",
 76 |     "\n",
 77 |     "\n",
 78 |     "* **get_fuel** - returns a scalar representing the remaining amount of fuel. Fuel starts at $100$ and is in range $[0, 100]$.\n",
 79 |     "\n",
 80 |     "Note that these are dummy functions just for this assignment.\n",
 81 |     "\n",
 82 |     "![Lunar Landar](lunar_landar.png)\n",
 83 |     "\n",
 84 |     "In this notebook, you will be applying these functions to __structure the reward signal__ based on the following criteria:\n",
 85 |     "\n",
 86 |     "1. **The lander will crash if** it touches the ground when ``y_velocity < -3`` (the downward velocity is greater than three).\n",
 87 |     "\n",
 88 |     "\n",
 89 |     "2. **The lander will crash if** it touches the ground when ``x_velocity < -10 or 10 < x_velocity`` (horizontal speed is greater than $10$).\n",
 90 |     "\n",
 91 |     "\n",
 92 |     "3. The lander's angle taken values in $[0, 359]$. It is completely vertical at $0$ degrees. **The lander will crash if** it touches the ground when ``5 < angle < 355`` (angle differs from vertical by more than $5$ degrees).\n",
 93 |     "\n",
 94 |     "\n",
 95 |     "4. **The lander will crash if** it has yet to land and ``fuel <= 0`` (it runs out of fuel).\n",
 96 |     "\n",
 97 |     "\n",
 98 |     "5. MST would like to save money on fuel when it is possible **(using less fuel is preferred)**.\n",
 99 |     "\n",
100 |     "\n",
101 |     "6. The lander can only land in the landing zone. **The lander will crash if** it touches the ground when ``x_position`` $\\not\\in$ ``landing_zone`` (it lands outside the landing zone).\n",
102 |     "\n",
103 |     "\n",
104 |     "Fill in the methods below to create an environment for the lunar lander."
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 7,
110 |    "metadata": {
111 |     "deletable": false,
112 |     "nbgrader": {
113 |      "cell_type": "code",
114 |      "checksum": "193356706350cbad3ee2e968e1f33fce",
115 |      "grade": false,
116 |      "grade_id": "cell-b5475cc072c387ff",
117 |      "locked": false,
118 |      "schema_version": 3,
119 |      "solution": true
120 |     }
121 |    },
122 |    "outputs": [],
123 |    "source": [
124 |     "import environment\n",
125 |     "from utils import get_landing_zone, get_angle, get_velocity, get_position, get_fuel, tests\n",
126 |     "get_landing_zone()\n",
127 |     "# Lunar Lander Environment\n",
128 |     "class LunarLanderEnvironment(environment.BaseEnvironment):\n",
129 |     "    def __init__(self):\n",
130 |     "        self.current_state = None\n",
131 |     "        self.count = 0\n",
132 |     "    \n",
133 |     "    def env_init(self, env_info):\n",
134 |     "        # users set this up\n",
135 |     "        self.state = np.zeros(6) # velocity x, y, angle, distance to ground, landing zone x, y\n",
136 |     "    \n",
137 |     "    def env_start(self):\n",
138 |     "        land_x, land_y = get_landing_zone() # gets the x, y coordinate of the landing zone\n",
139 |     "        # At the start we initialize the agent to the top left hand corner (100, 20) with 0 velocity \n",
140 |     "        # in either any direction. The agent's angle is set to 0 and the landing zone is retrieved and set.\n",
141 |     "        # The lander starts with fuel of 100.\n",
142 |     "        # (vel_x, vel_y, angle, pos_x, pos_y, land_x, land_y, fuel)\n",
143 |     "        self.current_state = (0, 0, 0, 100, 20, land_x, land_y, 100)\n",
144 |     "        return self.current_state\n",
145 |     "    \n",
146 |     "    def env_step(self, action):\n",
147 |     "        \n",
148 |     "        land_x, land_y = get_landing_zone() # gets the x, y coordinate of the landing zone\n",
149 |     "        vel_x, vel_y = get_velocity(action) # gets the x, y velocity of the lander\n",
150 |     "        angle = get_angle(action) # gets the angle the lander is positioned in\n",
151 |     "        pos_x, pos_y = get_position(action) # gets the x, y position of the lander\n",
152 |     "        fuel = get_fuel(action) # get the amount of fuel remaining for the lander\n",
153 |     "        \n",
154 |     "        terminal = False\n",
155 |     "        reward = 0.0\n",
156 |     "        observation = (vel_x, vel_y, angle, pos_x, pos_y, land_x, land_y, fuel)\n",
157 |     "        \n",
158 |     "        # use the above observations to decide what the reward will be, and if the\n",
159 |     "        # agent is in a terminal state.\n",
160 |     "        # Recall - if the agent crashes or lands terminal needs to be set to True\n",
161 |     "\n",
162 |     "        # YOUR CODE HERE\n",
163 |     "        # touch the ground\n",
164 |     "        if pos_y <= land_y:\n",
165 |     "            terminal = True\n",
166 |     "            # crash\n",
167 |     "            if vel_y < -3 or abs(vel_x) > 10 or 5 < angle < 355 or pos_x != land_x:\n",
168 |     "                reward -= 10000\n",
169 |     "            # not crash\n",
170 |     "            else:\n",
171 |     "                # save fuel\n",
172 |     "                reward += fuel\n",
173 |     "        # landing\n",
174 |     "        else:\n",
175 |     "            # run out of fuel\n",
176 |     "            if fuel <= 0:\n",
177 |     "                terminal = True\n",
178 |     "                reward -= 10000\n",
179 |     "            \n",
180 |     "        self.reward_obs_term = (reward, observation, terminal)\n",
181 |     "        return self.reward_obs_term\n",
182 |     "    \n",
183 |     "    def env_cleanup(self):\n",
184 |     "        return None\n",
185 |     "    \n",
186 |     "    def env_message(self):\n",
187 |     "        return None"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {
193 |     "deletable": false,
194 |     "editable": false,
195 |     "nbgrader": {
196 |      "cell_type": "markdown",
197 |      "checksum": "6f960ff843630e06b23aa258113d6e6b",
198 |      "grade": false,
199 |      "grade_id": "cell-9c57ff0d2b96ac51",
200 |      "locked": true,
201 |      "schema_version": 3,
202 |      "solution": false
203 |     }
204 |    },
205 |    "source": [
206 |     "## Evaluating your reward function\n",
207 |     "\n",
208 |     "Designing the best reward function for an objective is a challenging task - it is not clear what the term “best reward function” even means, let alone how to find it. Consequently, rather than evaluating your reward function by quantitative metrics, we merely ask that you check that its behavior is qualitatively reasonable. For this purpose, we provide a series of test cases below. In each case we show a transition and explain how a reward function that we implemented behaves. As you read, check how your own reward behaves in each scenario and judge for yourself whether it acts appropriately. (For the latter parts of the capstone you will use our implementation of the lunar lander environment, so don’t worry if your reward function isn’t exactly the same as ours. The purpose of this of this notebook is to gain experience implementing environments and reward functions.)"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "metadata": {
214 |     "deletable": false,
215 |     "editable": false,
216 |     "nbgrader": {
217 |      "cell_type": "markdown",
218 |      "checksum": "8ea30a25d9a5f478e892745463ddacbc",
219 |      "grade": false,
220 |      "grade_id": "cell-2d30dbf6446d5afc",
221 |      "locked": true,
222 |      "schema_version": 3,
223 |      "solution": false
224 |     }
225 |    },
226 |    "source": [
227 |     "### Case 1: Uncertain Future\n",
228 |     "The lander is in the top left corner of the screen moving at a velocity of (12, 15) with 10 units of fuel &mdash; whether this landing will be successful remains to be seen.\n",
229 |     "\n",
230 |     "![Lunar Landar](lunar_landar_1.png)"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": 8,
236 |    "metadata": {
237 |     "deletable": false,
238 |     "editable": false,
239 |     "nbgrader": {
240 |      "cell_type": "code",
241 |      "checksum": "09b03ba3c243b29134f5bc04dcdc10f4",
242 |      "grade": true,
243 |      "grade_id": "cell-99abd81376335339",
244 |      "locked": true,
245 |      "points": 1,
246 |      "schema_version": 3,
247 |      "solution": false,
248 |      "task": false
249 |     }
250 |    },
251 |    "outputs": [
252 |     {
253 |      "name": "stdout",
254 |      "output_type": "stream",
255 |      "text": [
256 |       "Reward: 0.0, Terminal: False\n"
257 |      ]
258 |     }
259 |    ],
260 |    "source": [
261 |     "tests(LunarLanderEnvironment, 1)"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "markdown",
266 |    "metadata": {
267 |     "deletable": false,
268 |     "editable": false,
269 |     "nbgrader": {
270 |      "cell_type": "markdown",
271 |      "checksum": "ca2d70140549a4b7eb3c52787154953b",
272 |      "grade": false,
273 |      "grade_id": "cell-89911113f764c447",
274 |      "locked": true,
275 |      "schema_version": 3,
276 |      "solution": false
277 |     }
278 |    },
279 |    "source": [
280 |     "In this case we gave the agent no reward, as it neither achieved the objective nor crashed. One alternative is giving the agent a positive reward for moving closer to the goal. Another is to give a negative reward for fuel consumption. What did your reward function do?\n",
281 |     "\n",
282 |     "Also check to make sure that ``Terminal`` is set to ``False``. Your agent has not landed, crashed, or ran out of fuel. The episode is not over."
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "markdown",
287 |    "metadata": {
288 |     "deletable": false,
289 |     "editable": false,
290 |     "nbgrader": {
291 |      "cell_type": "markdown",
292 |      "checksum": "f6c45e927b165cdb95744caaf3309aed",
293 |      "grade": false,
294 |      "grade_id": "cell-b19c487e9da05800",
295 |      "locked": true,
296 |      "schema_version": 3,
297 |      "solution": false
298 |     }
299 |    },
300 |    "source": [
301 |     "### Case 2: Imminent Crash!\n",
302 |     "\n",
303 |     "The lander is positioned in the target landing zone at a 45 degree angle, but its landing gear can only handle an angular offset of five degrees &mdash; it is about to crash!\n",
304 |     "\n",
305 |     "![Lunar Landar](lunar_landar_2.png)"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": 9,
311 |    "metadata": {
312 |     "deletable": false,
313 |     "editable": false,
314 |     "nbgrader": {
315 |      "cell_type": "code",
316 |      "checksum": "5f07c252a2f60904323ad2fb02675221",
317 |      "grade": true,
318 |      "grade_id": "cell-9b3900153803f78e",
319 |      "locked": true,
320 |      "points": 1,
321 |      "schema_version": 3,
322 |      "solution": false,
323 |      "task": false
324 |     }
325 |    },
326 |    "outputs": [
327 |     {
328 |      "name": "stdout",
329 |      "output_type": "stream",
330 |      "text": [
331 |       "Reward: -10000.0, Terminal: True\n"
332 |      ]
333 |     }
334 |    ],
335 |    "source": [
336 |     "tests(LunarLanderEnvironment, 2)"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {
342 |     "deletable": false,
343 |     "editable": false,
344 |     "nbgrader": {
345 |      "cell_type": "markdown",
346 |      "checksum": "48870c2ad1832288abe07ea16cc88ba2",
347 |      "grade": false,
348 |      "grade_id": "cell-4731b02a8f54214b",
349 |      "locked": true,
350 |      "schema_version": 3,
351 |      "solution": false
352 |     }
353 |    },
354 |    "source": [
355 |     "We gave the agent a reward of -10000 to punish it for crashing. How did your reward function handle the crash?\n",
356 |     "\n",
357 |     "Also check to make sure that ``Terminal`` is set to ``True``. Your agent has crashed and the episode is over."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "metadata": {
363 |     "deletable": false,
364 |     "editable": false,
365 |     "nbgrader": {
366 |      "cell_type": "markdown",
367 |      "checksum": "a06bb6be22b15cb43cad3378583b188c",
368 |      "grade": false,
369 |      "grade_id": "cell-af000f1895c6bd69",
370 |      "locked": true,
371 |      "schema_version": 3,
372 |      "solution": false
373 |     }
374 |    },
375 |    "source": [
376 |     "### Case 3: Nice Landing!\n",
377 |     "The lander is vertically oriented and positioned in the target landing zone with five units of remaining fuel. The landing is being completed successfully!\n",
378 |     "\n",
379 |     "![Lunar Landar](lunar_landar_3.png)"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "code",
384 |    "execution_count": 10,
385 |    "metadata": {
386 |     "deletable": false,
387 |     "editable": false,
388 |     "nbgrader": {
389 |      "cell_type": "code",
390 |      "checksum": "b2013e0485824e18311b02cfda0c857b",
391 |      "grade": true,
392 |      "grade_id": "cell-6a53769313d85b0b",
393 |      "locked": true,
394 |      "points": 1,
395 |      "schema_version": 3,
396 |      "solution": false,
397 |      "task": false
398 |     }
399 |    },
400 |    "outputs": [
401 |     {
402 |      "name": "stdout",
403 |      "output_type": "stream",
404 |      "text": [
405 |       "Reward: 5.0, Terminal: True\n"
406 |      ]
407 |     }
408 |    ],
409 |    "source": [
410 |     "tests(LunarLanderEnvironment, 3)"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "markdown",
415 |    "metadata": {
416 |     "deletable": false,
417 |     "editable": false,
418 |     "nbgrader": {
419 |      "cell_type": "markdown",
420 |      "checksum": "7fcd0832a9a99dd4d6ae3cd1f2a99666",
421 |      "grade": false,
422 |      "grade_id": "cell-e23d284c1e7c6865",
423 |      "locked": true,
424 |      "schema_version": 3,
425 |      "solution": false
426 |     }
427 |    },
428 |    "source": [
429 |     "To encourage the agent to conserve as much fuel as possible, we reward successful landings proportionally to the amount of fuel remaining. Here, we gave the agent a reward of five since it landed with five units of fuel remaining. How did you incentivize the agent to be fuel efficient?\n",
430 |     "\n",
431 |     "Also check to make sure that ``Terminal`` is set to ``True``. Your agent has landed and the episode is over."
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "markdown",
436 |    "metadata": {
437 |     "deletable": false,
438 |     "editable": false,
439 |     "nbgrader": {
440 |      "cell_type": "markdown",
441 |      "checksum": "144e1370f8a1146808c367d077c8a6db",
442 |      "grade": false,
443 |      "grade_id": "cell-21cc28d788b4455d",
444 |      "locked": true,
445 |      "schema_version": 3,
446 |      "solution": false
447 |     }
448 |    },
449 |    "source": [
450 |     "### Case 4: Dark Times Ahead!\n",
451 |     "The lander is directly above the target landing zone but has no fuel left. The future does not look good for the agent &mdash; without fuel there is no way for it to avoid crashing!\n",
452 |     "\n",
453 |     "![Lunar Landar](lunar_landar_4.png)"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "code",
458 |    "execution_count": 11,
459 |    "metadata": {
460 |     "deletable": false,
461 |     "editable": false,
462 |     "nbgrader": {
463 |      "cell_type": "code",
464 |      "checksum": "4ca9f4cf5b7d498584d814a0ec14fbbb",
465 |      "grade": true,
466 |      "grade_id": "cell-86ece2998b73491a",
467 |      "locked": true,
468 |      "points": 1,
469 |      "schema_version": 3,
470 |      "solution": false,
471 |      "task": false
472 |     }
473 |    },
474 |    "outputs": [
475 |     {
476 |      "name": "stdout",
477 |      "output_type": "stream",
478 |      "text": [
479 |       "Reward: -10000.0, Terminal: True\n"
480 |      ]
481 |     }
482 |    ],
483 |    "source": [
484 |     "tests(LunarLanderEnvironment, 4)"
485 |    ]
486 |   },
487 |   {
488 |    "cell_type": "markdown",
489 |    "metadata": {
490 |     "deletable": false,
491 |     "editable": false,
492 |     "nbgrader": {
493 |      "cell_type": "markdown",
494 |      "checksum": "14711bd7c29c70cfc7f2779a4a4548ff",
495 |      "grade": false,
496 |      "grade_id": "cell-e1cc048da1ecc920",
497 |      "locked": true,
498 |      "schema_version": 3,
499 |      "solution": false
500 |     }
501 |    },
502 |    "source": [
503 |     "We gave the agent a reward of -10000 to punish it for crashing. Did your reward function treat all crashes equally, as ours did? Or did you penalize some crashes more than others? What reasoning did you use to make this decision?\n",
504 |     "\n",
505 |     "Also check to make sure that ``Terminal`` is set to ``True``. Your agent has crashed and the episode is over."
506 |    ]
507 |   },
508 |   {
509 |    "cell_type": "markdown",
510 |    "metadata": {
511 |     "deletable": false,
512 |     "editable": false,
513 |     "nbgrader": {
514 |      "cell_type": "markdown",
515 |      "checksum": "abf49fa9dd8069cd57474df9bafe03e8",
516 |      "grade": false,
517 |      "grade_id": "cell-29610cb5c650ab8b",
518 |      "locked": true,
519 |      "schema_version": 3,
520 |      "solution": false,
521 |      "task": false
522 |     }
523 |    },
524 |    "source": [
525 |     "### Case 5: Where's The Landing Zone?!\n",
526 |     "\n",
527 |     "The lander is touching down at a vertical angle with fuel to spare. But it is not in the landing zone and the surface is uneven &mdash; it is going to crash!\n",
528 |     "\n",
529 |     "![Lunar Landar](lunar_landar_5.png)"
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "code",
534 |    "execution_count": 12,
535 |    "metadata": {
536 |     "deletable": false,
537 |     "editable": false,
538 |     "nbgrader": {
539 |      "cell_type": "code",
540 |      "checksum": "4198a382067136602a638dd1690df307",
541 |      "grade": true,
542 |      "grade_id": "cell-b7cee5ebb3ab91bf",
543 |      "locked": true,
544 |      "points": 1,
545 |      "schema_version": 3,
546 |      "solution": false,
547 |      "task": false
548 |     }
549 |    },
550 |    "outputs": [
551 |     {
552 |      "name": "stdout",
553 |      "output_type": "stream",
554 |      "text": [
555 |       "Reward: -10000.0, Terminal: True\n"
556 |      ]
557 |     }
558 |    ],
559 |    "source": [
560 |     "tests(LunarLanderEnvironment, 5)"
561 |    ]
562 |   },
563 |   {
564 |    "cell_type": "markdown",
565 |    "metadata": {
566 |     "deletable": false,
567 |     "editable": false,
568 |     "nbgrader": {
569 |      "cell_type": "markdown",
570 |      "checksum": "63a597e957db543772b510123497c5f8",
571 |      "grade": false,
572 |      "grade_id": "cell-aead9765209963cd",
573 |      "locked": true,
574 |      "schema_version": 3,
575 |      "solution": false,
576 |      "task": false
577 |     }
578 |    },
579 |    "source": [
580 |     "We gave the agent a reward of -10000 to punish it for landing in the wrong spot. An alternative is to scale the negative reward by distance from the landing zone. What approach did you take?\n",
581 |     "\n",
582 |     "Also check to make sure that ``Terminal`` is set to ``True``. Your agent has crashed and the episode is over."
583 |    ]
584 |   },
585 |   {
586 |    "cell_type": "markdown",
587 |    "metadata": {
588 |     "deletable": false,
589 |     "editable": false,
590 |     "nbgrader": {
591 |      "cell_type": "markdown",
592 |      "checksum": "847d71c3274e0abf69bb00349ad82069",
593 |      "grade": false,
594 |      "grade_id": "cell-228ffe4d4fd5d55f",
595 |      "locked": true,
596 |      "schema_version": 3,
597 |      "solution": false
598 |     }
599 |    },
600 |    "source": [
601 |     "## Wrapping Up\n",
602 |     "Excellent! The lunar lander simulator is complete and the project can commence. In the next module, you will build upon your work here by implementing an agent to train in the environment. Don’t dally! The team at MST is eagerly awaiting your solution."
603 |    ]
604 |   }
605 |  ],
606 |  "metadata": {
607 |   "@webio": {
608 |    "lastCommId": null,
609 |    "lastKernelId": null
610 |   },
611 |   "kernelspec": {
612 |    "display_name": "Python 3",
613 |    "language": "python",
614 |    "name": "python3"
615 |   },
616 |   "language_info": {
617 |    "codemirror_mode": {
618 |     "name": "ipython",
619 |     "version": 3
620 |    },
621 |    "file_extension": ".py",
622 |    "mimetype": "text/x-python",
623 |    "name": "python",
624 |    "nbconvert_exporter": "python",
625 |    "pygments_lexer": "ipython3",
626 |    "version": "3.7.6"
627 |   }
628 |  },
629 |  "nbformat": 4,
630 |  "nbformat_minor": 2
631 | }
632 | 


--------------------------------------------------------------------------------
/Function Approximation and Control.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": false,
  7 |     "editable": false,
  8 |     "nbgrader": {
  9 |      "cell_type": "markdown",
 10 |      "checksum": "f91befe560efc058705938cd516c0cbf",
 11 |      "grade": false,
 12 |      "grade_id": "cell-d7aa7f0ccfbe6764",
 13 |      "locked": true,
 14 |      "schema_version": 3,
 15 |      "solution": false,
 16 |      "task": false
 17 |     }
 18 |    },
 19 |    "source": [
 20 |     "# Assignment 3: Function Approximation and Control"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {
 26 |     "deletable": false,
 27 |     "editable": false,
 28 |     "nbgrader": {
 29 |      "cell_type": "markdown",
 30 |      "checksum": "8367cb0523270ba9ea53fc9a4e237294",
 31 |      "grade": false,
 32 |      "grade_id": "cell-4aea2284d1d0ee5b",
 33 |      "locked": true,
 34 |      "schema_version": 3,
 35 |      "solution": false,
 36 |      "task": false
 37 |     }
 38 |    },
 39 |    "source": [
 40 |     "Welcome to Assignment 3. In this notebook you will learn how to:\n",
 41 |     "- Use function approximation in the control setting\n",
 42 |     "- Implement the Sarsa algorithm using tile coding\n",
 43 |     "- Compare three settings for tile coding to see their effect on our agent\n",
 44 |     "\n",
 45 |     "As with the rest of the notebooks do not import additional libraries or adjust grading cells as this will break the grader.\n",
 46 |     "\n",
 47 |     "MAKE SURE TO RUN ALL OF THE CELLS SO THE GRADER GETS THE OUTPUT IT NEEDS\n"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 1,
 53 |    "metadata": {
 54 |     "deletable": false,
 55 |     "editable": false,
 56 |     "nbgrader": {
 57 |      "cell_type": "code",
 58 |      "checksum": "c3bf50501352096f22c673e3f781ca93",
 59 |      "grade": false,
 60 |      "grade_id": "cell-68be8d91fe7fd3dd",
 61 |      "locked": true,
 62 |      "schema_version": 3,
 63 |      "solution": false,
 64 |      "task": false
 65 |     }
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "# Import Necessary Libraries\n",
 70 |     "import numpy as np\n",
 71 |     "import itertools\n",
 72 |     "import matplotlib.pyplot as plt\n",
 73 |     "import tiles3 as tc\n",
 74 |     "from rl_glue import RLGlue\n",
 75 |     "from agent import BaseAgent\n",
 76 |     "from utils import argmax\n",
 77 |     "import mountaincar_env\n",
 78 |     "import time"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {
 84 |     "deletable": false,
 85 |     "editable": false,
 86 |     "nbgrader": {
 87 |      "cell_type": "markdown",
 88 |      "checksum": "2197641d6402383eb671432ec4de8822",
 89 |      "grade": false,
 90 |      "grade_id": "cell-631c7b26d3b5c04b",
 91 |      "locked": true,
 92 |      "schema_version": 3,
 93 |      "solution": false,
 94 |      "task": false
 95 |     }
 96 |    },
 97 |    "source": [
 98 |     "In the above cell, we import the libraries we need for this assignment. You may have noticed that we import mountaincar_env. This is the __Mountain Car Task__ introduced in [Section 10.1 of the textbook](http://www.incompleteideas.net/book/RLbook2018.pdf#page=267). The task is for an under powered car to make it to the top of a hill:\n",
 99 |     "![Mountain Car](mountaincar.png \"Mountain Car\")\n",
100 |     "The car is under-powered so the agent needs to learn to rock back and forth to get enough momentum to reach the goal. At each time step the agent receives from the environment its current velocity (a float between -0.07 and 0.07), and it's current position (a float between -1.2 and 0.5). Because our state is continuous there are a potentially infinite number of states that our agent could be in. We need a function approximation method to help the agent deal with this. In this notebook we will use tile coding. We provide a tile coding implementation for you to use, imported above with tiles3."
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {
106 |     "deletable": false,
107 |     "editable": false,
108 |     "nbgrader": {
109 |      "cell_type": "markdown",
110 |      "checksum": "2e2e288fe7f1840098d7b6ada17f505e",
111 |      "grade": false,
112 |      "grade_id": "cell-bbe24ca993c1b297",
113 |      "locked": true,
114 |      "schema_version": 3,
115 |      "solution": false,
116 |      "task": false
117 |     }
118 |    },
119 |    "source": [
120 |     "## Section 0: Tile Coding Helper Function"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {
126 |     "deletable": false,
127 |     "editable": false,
128 |     "nbgrader": {
129 |      "cell_type": "markdown",
130 |      "checksum": "104fa1489f8107d5d841e786c2fb27b0",
131 |      "grade": false,
132 |      "grade_id": "cell-00c3ac7f568f6166",
133 |      "locked": true,
134 |      "schema_version": 3,
135 |      "solution": false,
136 |      "task": false
137 |     }
138 |    },
139 |    "source": [
140 |     "To begin we are going to build a tile coding class for our Sarsa agent that will make it easier to make calls to our tile coder."
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {
146 |     "deletable": false,
147 |     "editable": false,
148 |     "nbgrader": {
149 |      "cell_type": "markdown",
150 |      "checksum": "6cd7184cfa3356edd1720b8fd51ed2c9",
151 |      "grade": false,
152 |      "grade_id": "cell-97d06568c071d9cc",
153 |      "locked": true,
154 |      "schema_version": 3,
155 |      "solution": false,
156 |      "task": false
157 |     }
158 |    },
159 |    "source": [
160 |     "### Tile Coding Function"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {
166 |     "deletable": false,
167 |     "editable": false,
168 |     "nbgrader": {
169 |      "cell_type": "markdown",
170 |      "checksum": "a88c06a2dc95f3c24dafc6896575fec7",
171 |      "grade": false,
172 |      "grade_id": "cell-cdfcd9285845ad67",
173 |      "locked": true,
174 |      "schema_version": 3,
175 |      "solution": false,
176 |      "task": false
177 |     }
178 |    },
179 |    "source": [
180 |     "Tile coding is introduced in [Section 9.5.4 of the textbook](http://www.incompleteideas.net/book/RLbook2018.pdf#page=239) of the textbook as a way to create features that can both provide good generalization and discrimination. It consists of multiple overlapping tilings, where each tiling is a partitioning of the space into tiles.\n",
181 |     "![Tile Coding](tilecoding.png \"Tile Coding\")"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "metadata": {
187 |     "deletable": false,
188 |     "editable": false,
189 |     "nbgrader": {
190 |      "cell_type": "markdown",
191 |      "checksum": "fbbce233f230a1d8be56f3da3f9b74ea",
192 |      "grade": false,
193 |      "grade_id": "cell-a4b22741e31308d4",
194 |      "locked": true,
195 |      "schema_version": 3,
196 |      "solution": false,
197 |      "task": false
198 |     }
199 |    },
200 |    "source": [
201 |     "To help keep our agent code clean we are going to make a function specific for tile coding for our Mountain Car environment. To help we are going to use the Tiles3 library. This is a Python 3 implementation of the tile coder. To start take a look at the documentation: [Tiles3 documentation](http://incompleteideas.net/tiles/tiles3.html)\n",
202 |     "To get the tile coder working we need to implement a few pieces:\n",
203 |     "- First: create an index hash table - this is done for you in the init function using tc.IHT.\n",
204 |     "- Second is to scale the inputs for the tile coder based on the number of tiles and the range of values each input could take. The tile coder needs to take in a number in range [0, 1], or scaled to be [0, 1] * num_tiles. For more on this refer to the [Tiles3 documentation](http://incompleteideas.net/tiles/tiles3.html).\n",
205 |     "- Finally we call tc.tiles to get the active tiles back."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": 37,
211 |    "metadata": {
212 |     "deletable": false,
213 |     "nbgrader": {
214 |      "cell_type": "code",
215 |      "checksum": "1f9f621377db1b80790b2203fdf266a5",
216 |      "grade": false,
217 |      "grade_id": "cell-5d4b035fb7a71186",
218 |      "locked": false,
219 |      "schema_version": 3,
220 |      "solution": true,
221 |      "task": false
222 |     }
223 |    },
224 |    "outputs": [],
225 |    "source": [
226 |     "# -----------\n",
227 |     "# Graded Cell\n",
228 |     "# -----------\n",
229 |     "class MountainCarTileCoder:\n",
230 |     "    def __init__(self, iht_size=4096, num_tilings=8, num_tiles=8):\n",
231 |     "        \"\"\"\n",
232 |     "        Initializes the MountainCar Tile Coder\n",
233 |     "        Initializers:\n",
234 |     "        iht_size -- int, the size of the index hash table, typically a power of 2\n",
235 |     "        num_tilings -- int, the number of tilings\n",
236 |     "        num_tiles -- int, the number of tiles. Here both the width and height of the\n",
237 |     "                     tile coder are the same\n",
238 |     "        Class Variables:\n",
239 |     "        self.iht -- tc.IHT, the index hash table that the tile coder will use\n",
240 |     "        self.num_tilings -- int, the number of tilings the tile coder will use\n",
241 |     "        self.num_tiles -- int, the number of tiles the tile coder will use\n",
242 |     "        \"\"\"\n",
243 |     "        self.iht = tc.IHT(iht_size)\n",
244 |     "        self.num_tilings = num_tilings\n",
245 |     "        self.num_tiles = num_tiles\n",
246 |     "    \n",
247 |     "    def get_tiles(self, position, velocity):\n",
248 |     "        \"\"\"\n",
249 |     "        Takes in a position and velocity from the mountaincar environment\n",
250 |     "        and returns a numpy array of active tiles.\n",
251 |     "        \n",
252 |     "        Arguments:\n",
253 |     "        position -- float, the position of the agent between -1.2 and 0.5\n",
254 |     "        velocity -- float, the velocity of the agent between -0.07 and 0.07\n",
255 |     "        returns:\n",
256 |     "        tiles - np.array, active tiles\n",
257 |     "        \"\"\"\n",
258 |     "        # Use the ranges above and self.num_tiles to scale position and velocity to the range [0, 1]\n",
259 |     "        # then multiply that range with self.num_tiles so it scales from [0, num_tiles]\n",
260 |     "        \n",
261 |     "        position_scaled = 0\n",
262 |     "        velocity_scaled = 0\n",
263 |     "        \n",
264 |     "        # ----------------\n",
265 |     "        # your code here\n",
266 |     "        position_scaled = self.num_tiles*((position+1.2)/(0.5+1.2))\n",
267 |     "        velocity_scaled = self.num_tiles*((velocity+0.07)/(0.07+0.07))\n",
268 |     "        \n",
269 |     "        # ----------------\n",
270 |     "        \n",
271 |     "        # get the tiles using tc.tiles, with self.iht, self.num_tilings and [scaled position, scaled velocity]\n",
272 |     "        # nothing to implment here\n",
273 |     "        tiles = tc.tiles(self.iht, self.num_tilings, [position_scaled, velocity_scaled])\n",
274 |     "        \n",
275 |     "        return np.array(tiles)"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "code",
280 |    "execution_count": 38,
281 |    "metadata": {
282 |     "deletable": false,
283 |     "editable": false,
284 |     "nbgrader": {
285 |      "cell_type": "code",
286 |      "checksum": "734698474359c7766846c7d26ffbaf67",
287 |      "grade": true,
288 |      "grade_id": "cell-beac2fa8ff1ef94e",
289 |      "locked": true,
290 |      "points": 10,
291 |      "schema_version": 3,
292 |      "solution": false,
293 |      "task": false
294 |     }
295 |    },
296 |    "outputs": [],
297 |    "source": [
298 |     "# -----------\n",
299 |     "# Tested Cell\n",
300 |     "# -----------\n",
301 |     "# The contents of the cell will be tested by the autograder.\n",
302 |     "# If they do not pass here, they will not pass there.\n",
303 |     "\n",
304 |     "# create a range of positions and velocities to test\n",
305 |     "# then test every element in the cross-product between these lists\n",
306 |     "pos_tests = np.linspace(-1.2, 0.5, num=5)\n",
307 |     "vel_tests = np.linspace(-0.07, 0.07, num=5)\n",
308 |     "tests = list(itertools.product(pos_tests, vel_tests))\n",
309 |     "\n",
310 |     "mctc = MountainCarTileCoder(iht_size=1024, num_tilings=8, num_tiles=2)\n",
311 |     "\n",
312 |     "t = []\n",
313 |     "for test in tests:\n",
314 |     "    position, velocity = test\n",
315 |     "    tiles = mctc.get_tiles(position=position, velocity=velocity)\n",
316 |     "    t.append(tiles)\n",
317 |     "\n",
318 |     "expected = [\n",
319 |     "    [0, 1, 2, 3, 4, 5, 6, 7],\n",
320 |     "    [0, 1, 8, 3, 9, 10, 6, 11],\n",
321 |     "    [12, 13, 8, 14, 9, 10, 15, 11],\n",
322 |     "    [12, 13, 16, 14, 17, 18, 15, 19],\n",
323 |     "    [20, 21, 16, 22, 17, 18, 23, 19],\n",
324 |     "    [0, 1, 2, 3, 24, 25, 26, 27],\n",
325 |     "    [0, 1, 8, 3, 28, 29, 26, 30],\n",
326 |     "    [12, 13, 8, 14, 28, 29, 31, 30],\n",
327 |     "    [12, 13, 16, 14, 32, 33, 31, 34],\n",
328 |     "    [20, 21, 16, 22, 32, 33, 35, 34],\n",
329 |     "    [36, 37, 38, 39, 24, 25, 26, 27],\n",
330 |     "    [36, 37, 40, 39, 28, 29, 26, 30],\n",
331 |     "    [41, 42, 40, 43, 28, 29, 31, 30],\n",
332 |     "    [41, 42, 44, 43, 32, 33, 31, 34],\n",
333 |     "    [45, 46, 44, 47, 32, 33, 35, 34],\n",
334 |     "    [36, 37, 38, 39, 48, 49, 50, 51],\n",
335 |     "    [36, 37, 40, 39, 52, 53, 50, 54],\n",
336 |     "    [41, 42, 40, 43, 52, 53, 55, 54],\n",
337 |     "    [41, 42, 44, 43, 56, 57, 55, 58],\n",
338 |     "    [45, 46, 44, 47, 56, 57, 59, 58],\n",
339 |     "    [60, 61, 62, 63, 48, 49, 50, 51],\n",
340 |     "    [60, 61, 64, 63, 52, 53, 50, 54],\n",
341 |     "    [65, 66, 64, 67, 52, 53, 55, 54],\n",
342 |     "    [65, 66, 68, 67, 56, 57, 55, 58],\n",
343 |     "    [69, 70, 68, 71, 56, 57, 59, 58],\n",
344 |     "]\n",
345 |     "assert np.all(expected == np.array(t))"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "markdown",
350 |    "metadata": {
351 |     "deletable": false,
352 |     "editable": false,
353 |     "nbgrader": {
354 |      "cell_type": "markdown",
355 |      "checksum": "089f4dcbbff9c5efec3476ee8e8ba1c9",
356 |      "grade": false,
357 |      "grade_id": "cell-5191224461a0f3b5",
358 |      "locked": true,
359 |      "schema_version": 3,
360 |      "solution": false,
361 |      "task": false
362 |     }
363 |    },
364 |    "source": [
365 |     "## Section 1: Sarsa Agent"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {
371 |     "deletable": false,
372 |     "editable": false,
373 |     "nbgrader": {
374 |      "cell_type": "markdown",
375 |      "checksum": "f0c661efa74205b51539471c4285118a",
376 |      "grade": false,
377 |      "grade_id": "cell-5db2a3c6722ea91c",
378 |      "locked": true,
379 |      "schema_version": 3,
380 |      "solution": false,
381 |      "task": false
382 |     }
383 |    },
384 |    "source": [
385 |     "We are now going to use the functions that we just created to implement the Sarsa algorithm. Recall from class that Sarsa stands for State, Action, Reward, State, Action.\n",
386 |     "\n",
387 |     "For this case we have given you an argmax function similar to what you wrote back in Course 1 Assignment 1. Recall, this is different than the argmax function that is used by numpy, which returns the first index of a maximum value. We want our argmax function to arbitrarily break ties, which is what the imported argmax function does. The given argmax function takes in an array of values and returns an int of the chosen action: \n",
388 |     "argmax(action values)\n",
389 |     "\n",
390 |     "There are multiple ways that we can deal with actions for the tile coder. Here we are going to use one simple method - make the size of the weight vector equal to (iht_size, num_actions). This will give us one weight vector for each action and one weight for each tile.\n",
391 |     "\n",
392 |     "Use the above function to help fill in select_action, agent_start, agent_step, and agent_end.\n",
393 |     "\n",
394 |     "Hints:\n",
395 |     "\n",
396 |     "1) The tile coder returns a list of active indexes (e.g. [1, 12, 22]). You can index a numpy array using an array of values - this will return an array of the values at each of those indices. So in order to get the value of a state we can index our weight vector using the action and the array of tiles that the tile coder returns:\n",
397 |     "\n",
398 |     "```self.w[action][active_tiles]```\n",
399 |     "\n",
400 |     "This will give us an array of values, one for each active tile, and we sum the result to get the value of that state-action pair.\n",
401 |     "\n",
402 |     "2) In the case of a binary feature vector (such as the tile coder), the derivative is 1 at each of the active tiles, and zero otherwise."
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": 42,
408 |    "metadata": {
409 |     "deletable": false,
410 |     "nbgrader": {
411 |      "cell_type": "code",
412 |      "checksum": "c1dd6c5e729fc638934b67090e2c92a0",
413 |      "grade": false,
414 |      "grade_id": "cell-50303440b2e9be74",
415 |      "locked": false,
416 |      "schema_version": 3,
417 |      "solution": true,
418 |      "task": false
419 |     }
420 |    },
421 |    "outputs": [],
422 |    "source": [
423 |     "# -----------\n",
424 |     "# Graded Cell\n",
425 |     "# -----------\n",
426 |     "class SarsaAgent(BaseAgent):\n",
427 |     "    \"\"\"\n",
428 |     "    Initialization of Sarsa Agent. All values are set to None so they can\n",
429 |     "    be initialized in the agent_init method.\n",
430 |     "    \"\"\"\n",
431 |     "    def __init__(self):\n",
432 |     "        self.last_action = None\n",
433 |     "        self.last_state = None\n",
434 |     "        self.epsilon = None\n",
435 |     "        self.gamma = None\n",
436 |     "        self.iht_size = None\n",
437 |     "        self.w = None\n",
438 |     "        self.alpha = None\n",
439 |     "        self.num_tilings = None\n",
440 |     "        self.num_tiles = None\n",
441 |     "        self.mctc = None\n",
442 |     "        self.initial_weights = None\n",
443 |     "        self.num_actions = None\n",
444 |     "        self.previous_tiles = None\n",
445 |     "\n",
446 |     "    def agent_init(self, agent_info={}):\n",
447 |     "        \"\"\"Setup for the agent called when the experiment first starts.\"\"\"\n",
448 |     "        self.num_tilings = agent_info.get(\"num_tilings\", 8)\n",
449 |     "        self.num_tiles = agent_info.get(\"num_tiles\", 8)\n",
450 |     "        self.iht_size = agent_info.get(\"iht_size\", 4096)\n",
451 |     "        self.epsilon = agent_info.get(\"epsilon\", 0.0)\n",
452 |     "        self.gamma = agent_info.get(\"gamma\", 1.0)\n",
453 |     "        self.alpha = agent_info.get(\"alpha\", 0.5) / self.num_tilings\n",
454 |     "        self.initial_weights = agent_info.get(\"initial_weights\", 0.0)\n",
455 |     "        self.num_actions = agent_info.get(\"num_actions\", 3)\n",
456 |     "        \n",
457 |     "        # We initialize self.w to three times the iht_size. Recall this is because\n",
458 |     "        # we need to have one set of weights for each action.\n",
459 |     "        self.w = np.ones((self.num_actions, self.iht_size)) * self.initial_weights\n",
460 |     "        \n",
461 |     "        # We initialize self.mctc to the mountaincar verions of the \n",
462 |     "        # tile coder that we created\n",
463 |     "        self.tc = MountainCarTileCoder(iht_size=self.iht_size, \n",
464 |     "                                         num_tilings=self.num_tilings, \n",
465 |     "                                         num_tiles=self.num_tiles)\n",
466 |     "\n",
467 |     "    def select_action(self, tiles):\n",
468 |     "        \"\"\"\n",
469 |     "        Selects an action using epsilon greedy\n",
470 |     "        Args:\n",
471 |     "        tiles - np.array, an array of active tiles\n",
472 |     "        Returns:\n",
473 |     "        (chosen_action, action_value) - (int, float), tuple of the chosen action\n",
474 |     "                                        and it's value\n",
475 |     "        \"\"\"\n",
476 |     "        action_values = []\n",
477 |     "        chosen_action = None\n",
478 |     "        \n",
479 |     "        # First loop through the weights of each action and populate action_values\n",
480 |     "        # with the action value for each action and tiles instance\n",
481 |     "        \n",
482 |     "        # Use np.random.random to decide if an exploritory action should be taken\n",
483 |     "        # and set chosen_action to a random action if it is\n",
484 |     "        # Otherwise choose the greedy action using the given argmax \n",
485 |     "        # function and the action values (don't use numpy's armax)\n",
486 |     "        \n",
487 |     "        # ----------------\n",
488 |     "        # your code here\n",
489 |     "        action_values = [sum(self.w[action][tiles]) for action in range(self.num_actions)]\n",
490 |     "    \n",
491 |     "        if(np.random.random()>self.epsilon):\n",
492 |     "            chosen_action= argmax(action_values)\n",
493 |     "        else:\n",
494 |     "            chosen_action=np.random.choice(len(action_values))\n",
495 |     "        # ----------------\n",
496 |     "\n",
497 |     "        return chosen_action, action_values[chosen_action]\n",
498 |     "    \n",
499 |     "    def agent_start(self, state):\n",
500 |     "        \"\"\"The first method called when the experiment starts, called after\n",
501 |     "        the environment starts.\n",
502 |     "        Args:\n",
503 |     "            state (Numpy array): the state observation from the\n",
504 |     "                environment's evn_start function.\n",
505 |     "        Returns:\n",
506 |     "            The first action the agent takes.\n",
507 |     "        \"\"\"\n",
508 |     "        position, velocity = state\n",
509 |     "        \n",
510 |     "        # Use self.tc to set active_tiles using position and velocity\n",
511 |     "        # set current_action to the epsilon greedy chosen action using\n",
512 |     "        # the select_action function above with the active tiles\n",
513 |     "        \n",
514 |     "        # ----------------\n",
515 |     "        # your code here\n",
516 |     "        active_tiles = self.tc.get_tiles(position=position, velocity=velocity)\n",
517 |     "        current_action = self.select_action(active_tiles)\n",
518 |     "        \n",
519 |     "        # ----------------\n",
520 |     "        \n",
521 |     "        self.last_action = current_action[0]\n",
522 |     "        self.previous_tiles = np.copy(active_tiles)\n",
523 |     "        return self.last_action\n",
524 |     "\n",
525 |     "    def agent_step(self, reward, state):\n",
526 |     "        \"\"\"A step taken by the agent.\n",
527 |     "        Args:\n",
528 |     "            reward (float): the reward received for taking the last action taken\n",
529 |     "            state (Numpy array): the state observation from the\n",
530 |     "                environment's step based, where the agent ended up after the\n",
531 |     "                last step\n",
532 |     "        Returns:\n",
533 |     "            The action the agent is taking.\n",
534 |     "        \"\"\"\n",
535 |     "        # choose the action here\n",
536 |     "        position, velocity = state\n",
537 |     "        \n",
538 |     "        # Use self.tc to set active_tiles using position and velocity\n",
539 |     "        # set current_action and action_value to the epsilon greedy chosen action using\n",
540 |     "        # the select_action function above with the active tiles\n",
541 |     "        \n",
542 |     "        # Update self.w at self.previous_tiles and self.previous action\n",
543 |     "        # using the reward, action_value, self.gamma, self.w,\n",
544 |     "        # self.alpha, and the Sarsa update from the textbook\n",
545 |     "        \n",
546 |     "        # ----------------\n",
547 |     "        # your code here\n",
548 |     "        active_tiles = self.tc.get_tiles(position=position, velocity=velocity)\n",
549 |     "        current_action, action_value = self.select_action(active_tiles)\n",
550 |     "\n",
551 |     "        self.w[self.last_action][self.previous_tiles] += self.alpha*(reward + self.gamma*action_value\n",
552 |     "                                        -(sum(self.w[self.last_action][self.previous_tiles])))\n",
553 |     "        \n",
554 |     "        # ----------------\n",
555 |     "        \n",
556 |     "        self.last_action = current_action\n",
557 |     "        self.previous_tiles = np.copy(active_tiles)\n",
558 |     "        #print(self.w)\n",
559 |     "        return self.last_action\n",
560 |     "\n",
561 |     "    def agent_end(self, reward):\n",
562 |     "        \"\"\"Run when the agent terminates.\n",
563 |     "        Args:\n",
564 |     "            reward (float): the reward the agent received for entering the\n",
565 |     "                terminal state.\n",
566 |     "        \"\"\"\n",
567 |     "        # Update self.w at self.previous_tiles and self.previous action\n",
568 |     "        # using the reward, self.gamma, self.w,\n",
569 |     "        # self.alpha, and the Sarsa update from the textbook\n",
570 |     "        # Hint - there is no action_value used here because this is the end\n",
571 |     "        # of the episode.\n",
572 |     "        \n",
573 |     "        # ----------------\n",
574 |     "        # your code here\n",
575 |     "    \n",
576 |     "        self.w[self.last_action][self.previous_tiles] += self.alpha*(reward-sum(self.w[self.last_action][self.previous_tiles]))\n",
577 |     "    \n",
578 |     "        \n",
579 |     "        # ----------------\n",
580 |     "        \n",
581 |     "    def agent_cleanup(self):\n",
582 |     "        \"\"\"Cleanup done after the agent ends.\"\"\"\n",
583 |     "        pass\n",
584 |     "\n",
585 |     "    def agent_message(self, message):\n",
586 |     "        \"\"\"A function used to pass information from the agent to the experiment.\n",
587 |     "        Args:\n",
588 |     "            message: The message passed to the agent.\n",
589 |     "        Returns:\n",
590 |     "            The response (or answer) to the message.\n",
591 |     "        \"\"\"\n",
592 |     "        pass"
593 |    ]
594 |   },
595 |   {
596 |    "cell_type": "code",
597 |    "execution_count": 43,
598 |    "metadata": {
599 |     "deletable": false,
600 |     "editable": false,
601 |     "nbgrader": {
602 |      "cell_type": "code",
603 |      "checksum": "692ac428d5e59bae3f74450657877a50",
604 |      "grade": true,
605 |      "grade_id": "cell-0cf3e9c19ac6be06",
606 |      "locked": true,
607 |      "points": 5,
608 |      "schema_version": 3,
609 |      "solution": false,
610 |      "task": false
611 |     }
612 |    },
613 |    "outputs": [
614 |     {
615 |      "name": "stdout",
616 |      "output_type": "stream",
617 |      "text": [
618 |       "action distribution: [ 29.  35. 936.]\n"
619 |      ]
620 |     }
621 |    ],
622 |    "source": [
623 |     "# -----------\n",
624 |     "# Tested Cell\n",
625 |     "# -----------\n",
626 |     "# The contents of the cell will be tested by the autograder.\n",
627 |     "# If they do not pass here, they will not pass there.\n",
628 |     "\n",
629 |     "np.random.seed(0)\n",
630 |     "\n",
631 |     "agent = SarsaAgent()\n",
632 |     "agent.agent_init({\"epsilon\": 0.1})\n",
633 |     "agent.w = np.array([np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9])])\n",
634 |     "\n",
635 |     "action_distribution = np.zeros(3)\n",
636 |     "for i in range(1000):\n",
637 |     "    chosen_action, action_value = agent.select_action(np.array([0,1]))\n",
638 |     "    action_distribution[chosen_action] += 1\n",
639 |     "    \n",
640 |     "print(\"action distribution:\", action_distribution)\n",
641 |     "# notice that the two non-greedy actions are roughly uniformly distributed\n",
642 |     "assert np.all(action_distribution == [29, 35, 936])\n",
643 |     "\n",
644 |     "agent = SarsaAgent()\n",
645 |     "agent.agent_init({\"epsilon\": 0.0})\n",
646 |     "agent.w = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n",
647 |     "\n",
648 |     "chosen_action, action_value = agent.select_action([0, 1])\n",
649 |     "assert chosen_action == 2\n",
650 |     "assert action_value == 15\n",
651 |     "\n",
652 |     "# -----------\n",
653 |     "# test update\n",
654 |     "# -----------\n",
655 |     "agent = SarsaAgent()\n",
656 |     "agent.agent_init({\"epsilon\": 0.1})\n",
657 |     "\n",
658 |     "agent.agent_start((0.1, 0.3))\n",
659 |     "agent.agent_step(1, (0.02, 0.1))\n",
660 |     "\n",
661 |     "assert np.all(agent.w[0,0:8] == 0.0625)\n",
662 |     "assert np.all(agent.w[1:] == 0)"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "code",
667 |    "execution_count": 44,
668 |    "metadata": {
669 |     "deletable": false,
670 |     "editable": false,
671 |     "nbgrader": {
672 |      "cell_type": "code",
673 |      "checksum": "31da193410fe9153637b4e5043c81176",
674 |      "grade": true,
675 |      "grade_id": "cell-5e2a49e089992132",
676 |      "locked": true,
677 |      "points": 25,
678 |      "schema_version": 3,
679 |      "solution": false,
680 |      "task": false
681 |     }
682 |    },
683 |    "outputs": [
684 |     {
685 |      "name": "stdout",
686 |      "output_type": "stream",
687 |      "text": [
688 |       "RUN: 0\n",
689 |       "RUN: 5\n",
690 |       "Run time: 9.558087348937988\n"
691 |      ]
692 |     },
693 |     {
694 |      "data": {
695 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3deXRc5Znn8e+jXbL21bIk4wV5kc0WCwKGEMJqlgC9EJw5PSHLOZzJ0NNJp7NAZ7LNjNNMZzrbdJPTDEkgnQ5LEzq4u0OIcYIhYBaZzfu+yZYl2bIWy9ZW9cwfdW3KtrypJJdc9/c5R6eq3rq36nnx4adX7733vebuiIhIOKQluwARETl7FPoiIiGi0BcRCRGFvohIiCj0RURCJCPZBZxKeXm5T5kyJdlliIicU1asWLHX3SuObR/3oT9lyhSampqSXYaIyDnFzLYP167pHRGREFHoi4iEiEJfRCREFPoiIiGi0BcRCRGFvohIiCj0RURCJGVD/7FXt7H43d3JLkNEZFxJ2dB//I0d/JtCX0TkKCkb+kW5mXQdGkx2GSIi48opQ9/MfmJmbWa2apj3vmhmbmblcW0PmNkmM1tvZjfFtc8zs5XBez80Mxu9bhyvKDeTroMKfRGReKcz0n8UWHBso5nVATcAO+LaGoCFwJxgn4fMLD14+0fAvUB98HPcZ44mjfRFRI53ytB395eAjmHe+h7wZSD+Jrt3AE+4e7+7bwU2AZeZWTVQ6O7LPXZT3p8BdyZc/UkU5yn0RUSONaI5fTO7Hdjl7u8e81YNsDPudXPQVhM8P7b9RJ9/r5k1mVlTe3v7SEqkKDeTQ4MR+ociI9pfRCQVnXHom1ke8FXg68O9PUybn6R9WO7+sLs3untjRcVxy0GflqLcTACN9kVE4oxkpD8dmAq8a2bbgFrgLTObSGwEXxe3bS2wO2ivHaZ9zBTlZQHQrdAXETnijEPf3Ve6e6W7T3H3KcQC/QPuvgdYDCw0s2wzm0rsgO0b7t4C9JjZ5cFZO58Anh29bhxPI30RkeOdzimbjwPLgZlm1mxmnznRtu6+GngKWAP8BrjP3Q9Pqn8WeITYwd3NwHMJ1n5Sh0O/U6dtiogcccrbJbr7x0/x/pRjXi8CFg2zXRMw9wzrG7FijfRFRI6T0lfkgkJfRCReyoZ+oUJfROQ4KRv66WlGQXaG5vRFROKkbOgDFOVl6pRNEZE4qR36Wn9HROQoKR/6nQp9EZEjUjr0teiaiMjRUjr0Nb0jInK0lA79wiD0Y6s5i4hISod+UW4mA0NR+gajyS5FRGRcSOnQL86NrbSpKR4RkZiUDn0txSAicrRQhH7nwYEkVyIiMj6kdOgX52mkLyISL6VDX9M7IiJHS+nQ10qbIiJHS+nQL8jOwEyhLyJyWEqHflqa6apcEZE4KR36oKUYRETihSL0dSMVEZGYUIS+RvoiIjGhCH3dPUtEJOaUoW9mPzGzNjNbFdf2HTNbZ2bvmdm/mllx3HsPmNkmM1tvZjfFtc8zs5XBez80Mxv97hxPI30Rkfedzkj/UWDBMW1LgLnufiGwAXgAwMwagIXAnGCfh8wsPdjnR8C9QH3wc+xnjonivNjds7S8sojIaYS+u78EdBzT9lt3HwpevgbUBs/vAJ5w93533wpsAi4zs2qg0N2Xeyx9fwbcOVqdOJmi3EwiUad3IHI2vk5EZFwbjTn9TwPPBc9rgJ1x7zUHbTXB82Pbh2Vm95pZk5k1tbe3J1SclmIQEXlfQqFvZl8FhoB/Ptw0zGZ+kvZhufvD7t7o7o0VFRWJlKiVNkVE4mSMdEczuwe4DbjO358wbwbq4jarBXYH7bXDtI+5It1IRUTkiBGN9M1sAfAV4HZ3Pxj31mJgoZllm9lUYgds33D3FqDHzC4Pztr5BPBsgrWflsMjfZ22KSJyGiN9M3scuAYoN7Nm4BvEztbJBpYEZ16+5u7/xd1Xm9lTwBpi0z73ufvhI6ifJXYmUC6xYwDPcRYUaU19EZEjThn67v7xYZp/fJLtFwGLhmlvAuaeUXWjoPjInL5CX0Qk5a/IzctKJyPNNNIXESEEoW+m5ZVFRA5L+dCHYKVNhb6ISEhCP0+LromIQFhCX9M7IiKAQl9EJFRCEfrFunuWiAgQktAvys2ku2+QaFTLK4tIuIUi9AtzM3GHnv6hU28sIpLCQhH6R5ZX1hSPiIRcKEK/OE8rbYqIQEhCXzdSERGJUeiLiIRIKEK/OFheufOQ7p4lIuEWitDXSF9EJCYUoZ+TmU5WRppCX0RCLxShD7GrcnXKpoiEXWhCX+vviIgo9EVEQkWhLyISIuEJ/TyttCkiEp7Qz9Xds0REThn6ZvYTM2szs1VxbaVmtsTMNgaPJXHvPWBmm8xsvZndFNc+z8xWBu/90Mxs9LtzYkW5mfT0DzEUiZ7NrxURGVdOZ6T/KLDgmLb7gaXuXg8sDV5jZg3AQmBOsM9DZpYe7PMj4F6gPvg59jPHVHFwgVZ3n5ZXFpHwOmXou/tLQMcxzXcAjwXPHwPujGt/wt373X0rsAm4zMyqgUJ3X+7uDvwsbp+zoihPV+WKiIx0Tr/K3VsAgsfKoL0G2Bm3XXPQVhM8P7Z9WGZ2r5k1mVlTe3v7CEs8mpZiEBEZ/QO5w83T+0nah+XuD7t7o7s3VlRUjEphCn0RkZGHfmswZUPw2Ba0NwN1cdvVAruD9tph2s+aotzYjVQ6D2qlTREJr5GG/mLgnuD5PcCzce0LzSzbzKYSO2D7RjAF1GNmlwdn7Xwibp+z4vBIX6dtikiYZZxqAzN7HLgGKDezZuAbwIPAU2b2GWAHcBeAu682s6eANcAQcJ+7R4KP+iyxM4FygeeCn7NG0zsiIqcR+u7+8RO8dd0Jtl8ELBqmvQmYe0bVjaKsjDTystIV+iISaqG5Ihdio30txSAiYRa60NdIX0TCLFShX6jQF5GQC1XoFyv0RSTkQhX6mt4RkbBT6IuIhEioQr84L5ODAxEGhrS8soiEU6hCXxdoiUjYhSr0CxX6IhJyoQp9jfRFJOxCFfrFebGVNrsOaaVNEQmnUIW+RvoiEnbhDH2tvyMiIRWq0C/MiS0q2nVIN0cXkXAKVehnpKdRkJ1Bp+b0RSSkQhX6oEXXRCTcQhf6xXmZumWiiIRW6EJfN1IRkTALZehrekdEwip0oV9dlMvO/QfpG4ycemMRkRQTutD/UH05fYNR3tzWkexSRETOutCF/genlZKVnsZLG9qTXYqIyFmXUOib2V+a2WozW2Vmj5tZjpmVmtkSM9sYPJbEbf+AmW0ys/VmdlPi5Z+5vKwMLptayjKFvoiE0IhD38xqgL8AGt19LpAOLATuB5a6ez2wNHiNmTUE788BFgAPmVl6YuWPzIdnVLCh9QC7Ow8l4+tFRJIm0emdDCDXzDKAPGA3cAfwWPD+Y8CdwfM7gCfcvd/dtwKbgMsS/P4RuXpGBQAvb9RoX0TCZcSh7+67gP8D7ABagC53/y1Q5e4twTYtQGWwSw2wM+4jmoO245jZvWbWZGZN7e2jH8wzqvKZWJijKR4RCZ1EpndKiI3epwKTgAlm9mcn22WYNh9uQ3d/2N0b3b2xoqJipCWeuBAzPjyjgj9s3MtQRPfLFZHwSGR653pgq7u3u/sg8AwwH2g1s2qA4LEt2L4ZqIvbv5bYdFBSXD2jgu6+Id5t7kxWCSIiZ10iob8DuNzM8szMgOuAtcBi4J5gm3uAZ4Pni4GFZpZtZlOBeuCNBL4/IVedX06awbL1muIRkfBIZE7/deBp4C1gZfBZDwMPAjeY2UbghuA17r4aeApYA/wGuM/dk3ZZbFFeJpdMLmHZxr3JKkFE5KzLSGRnd/8G8I1jmvuJjfqH234RsCiR7xxNV9dX8P2lG+joHaB0QlayyxERGXOhuyI33odnVuCuUzdFJDxCHfoX1BRRkpfJSxs0xSMi4RDq0E9PMz5UX8GyDe1Eo8OePSoiklJCHfoQO3Vz74F+1u7pTnYpIiJjTqFfXw6gKR4RCYXQh35lYQ4N1YUs29B26o1FRM5xoQ99iE3xNG3bz4H+oWSXIiIyphT6xJZaHoo6yzfvS3YpIiJjSqEPzDuvhAlZ6ZriEZGUp9AHsjLSuGJ6Ocs2tOOuUzdFJHUp9APXzKxgZ8chNrUdSHYpIiJjRqEfuKGhCjN4btWeZJciIjJmFPqBqsIcGs8rUeiLSEpT6MdZMLeatS3dbN3bm+xSRETGhEI/zoK5EwF4blVLkisRERkbCv04NcW5XFRXzG80xSMiKUqhf4xb5k7kveYudnYcTHYpIiKjTqF/jJvnVgPw/GqN9kUk9Sj0jzG5LI85kwr59UrN64tI6lHoD+OWC6p5a0cnLV2Hkl2KiMioUugP4/BZPM/rgK6IpBiF/jCmV+Qzs6qAXyv0RSTFJBT6ZlZsZk+b2TozW2tmV5hZqZktMbONwWNJ3PYPmNkmM1tvZjclXv7YWTB3Im9u66C9pz/ZpYiIjJpER/o/AH7j7rOAi4C1wP3AUnevB5YGrzGzBmAhMAdYADxkZukJfv+YueWCatx1Fo+IpJYRh76ZFQJXAz8GcPcBd+8E7gAeCzZ7DLgzeH4H8IS797v7VmATcNlIv3+szajKZ1r5BF2dKyIpJZGR/jSgHfipmb1tZo+Y2QSgyt1bAILHymD7GmBn3P7NQdtxzOxeM2sys6b29vYEShw5M+PmCyby2pYOOnoHklKDiMhoSyT0M4APAD9y90uAXoKpnBOwYdqGvWOJuz/s7o3u3lhRUZFAiYm5eW41kaizZI2meEQkNSQS+s1As7u/Hrx+mtgvgVYzqwYIHtvitq+L278W2J3A94+5OZMKqSvN1XLLIpIyRhz67r4H2GlmM4Om64A1wGLgnqDtHuDZ4PliYKGZZZvZVKAeeGOk3382mBm3zK3mlU176To4mOxyREQSlujZO/8N+Gczew+4GPg28CBwg5ltBG4IXuPuq4GniP1i+A1wn7tHEvz+MXfrhdUMRpwfLduc7FJERBKWkcjO7v4O0DjMW9edYPtFwKJEvvNsu7C2mIWX1vGPL23m6hnlzJ9enuySRERGTFfknoavf7SBqWUT+MKT79J5UGfyiMi5S6F/GvKyMvjBwkvY19vPA8+sxH3Yk45ERMY9hf5puqC2iL+6cSbPrdrDU007T72DiMg4pNA/A/d+aBrzp5fxzcVr2NJ+INnliIicMYX+GUhLM777sYvJzkzjc0+8w8BQNNkliYicEYX+GZpYlMODf3whK3d18d0lG5JdjojIGVHoj8CCuRP5+GWx0zhf27Iv2eWIiJw2hf4Ife22BsomZPNPy7cnuxQRkdOm0B+hvKwMrp9dyUsb2jW3LyLnDIV+Aq6dVUlP/xBN2zqSXYqIyGlR6CfgqvpysjLSeGFt26k3FhEZBxT6CcjLymD+9DKWrmvVVboick5Q6CfoulmVbN93kC17e5NdiojIKSn0E3Tt7CoAlq5tTXIlIiKnptBPUE1xLrMmFrBU8/oicg5Q6I+C62ZX0rR9v+6uJSLjnkJ/FFw3u4pI1Hlxg0b7IjK+KfRHwUW1xZRNyOJ36xT6IjK+KfRHQXqa8ZFZlby4vp2hiK7OFZHxS6E/Sq6bVUnXoUFWbN+f7FJERE5IoT9KrqovJzPdNMUjIuOaQn+UFORkcvm0Ml7Q+foiMo4lHPpmlm5mb5vZvwevS81siZltDB5L4rZ9wMw2mdl6M7sp0e8eb66dVcnm9l626epcERmnRmOk/zlgbdzr+4Gl7l4PLA1eY2YNwEJgDrAAeMjM0kfh+8eN62bFrs7VFI+IjFcJhb6Z1QK3Ao/ENd8BPBY8fwy4M679CXfvd/etwCbgskS+f7yZXJZHfWU+S9dpikdExqdER/rfB74MxJ+nWOXuLQDBY2XQXgPsjNuuOWg7jpnda2ZNZtbU3t6eYIln17WzK3l9Swc9fbo6V0TGnxGHvpndBrS5+4rT3WWYtmHXI3b3h9290d0bKyoqRlpiUlw/u4qhqPPyxr3JLkVE5DgZCex7JXC7md0C5ACFZvZzoNXMqt29xcyqgcMT3M1AXdz+tcDuBL5/XLqkrpjivEy+sXg1j7y8hayMNLIz0oPHNMrzs2mcUsJlU0upLMhJdrkiEjI2Gjf/MLNrgC+6+21m9h1gn7s/aGb3A6Xu/mUzmwP8gtg8/iRiB3nr3T1yss9ubGz0pqamhGs8m558cwdL1rTSPxR9/2cwwkAkyp6uPg4OxLo8rXwCl00t5YPTSvng1DImFecmuXIRSRVmtsLdG49tT2SkfyIPAk+Z2WeAHcBdAO6+2syeAtYAQ8B9pwr8c9Xdl07m7ksnD/veUCTK6t3dvL51H29s7eDXK1t44s3YoY7Z1YXc0FDFjQ1VzJlUiNlwM2IiIiM3KiP9sXQujvTPRCTqrN/Twyub9rJkTStN2zuIOkwqyuH6hipubJjIleeX6ReAiJyRE430FfrjzL4D/fxuXRtL1rTy0sZ2+gaj3N1Yx9/88QWkpSn4ReT0nM3pHUlAWX42dzXWcVdjHX2DEX64dCMPvbiZtDRj0Z1zFfwikhCF/jiWk5nOl26aCcBDL24mPQ3+5x1zNdUjIiOm0B/nzIwv3TSTiDv/uGwL6WZ88/Y5Cn4RGRGF/jnAzLh/wSwiEeeRP2wlLc34+m0NCn4ROWMK/XOEmfHVW2cTceenr2wjPXit4BeRM6HQP4eYxUb47vDIH7bS0TvA1z/aQHFeVrJLE5FzhG6ico4xM77x0Qb+4trzefbd3Vz3d8t49p1dnOrU2/6hCB29A2epShEZrxT65yAz4ws3zuTf/vwqakvz+NwT7/DJn77Jzo6DR20XjTrLN+/j/l++x6X/6wWu+t+/Y1NbT5KqFpHxQBdnneMiUeeflm/jO8+vJ+rwlzfUc8W0cv7tvd0sfmc3e7r7yMtK58aGKl7auJfKgmx+dd+V5GSm1P1rROQYuiI3xe3uPMTXn1195B69GWnGNTMruP3iGm6YXUVuVjq/X9fGpx59k0/On8I3b5+T5IpFZCzpitwUN6k4l//3iXm8uL6dtp4+bmyYSMmEow/wfmRWJZ++cio/eWUrV51fzvUNVUmqVkSSRXP6KcTM+MisSu6+dPJxgX/YV26eyZxJhXzp6XfZ09V3lisUkWRT6IdMdkY6//fjl9A/FOXzT75NJDq+p/dEZHQp9ENoWkU+37p9Dq9t6eBHL2467v21Ld0s+o813Pi9ZfzghY36xSCSQjSnH1J/Oq+Wlzfu5XsvbOSK6WVMLp3As+/s4pm3drGmpZvMdGPmxAK+98IGXt28lx8svISJRbq9o8i5TmfvhFhP3yC3/PBlOnsHOTgYIRJ1Lqwt4k8+UMtHL5pE6YQsfrmima89u4rsjDT+7mMXce0sHfwVORfolE0Z1nvNnfz3X61i/vRy/uQDNdRXFRy3zeb2A/z5L95mbUs3n7lqKl9ZMIusjPdnBiNRZ0fHQTa09tDSeYgD/UP09A3R3TdET98gB/qHONgfYSgaJRJ1hqJ+5DEzPY1PzZ/Cn86r1b0CREaRQl8S0jcY4cHn1vHoq9u4oKaIm+ZUsbHtABtaD7C5/QADQ9Gjts/OSKMgJ5OCnAwKcjLIy0onMz2N9DQjI81IMyMj3djZcYiVu7q4qK6Yb90+h4vrik+7poGhKKt2d7Fi236atneQnZHO/TfP0g3mRVDoyyj57eo9fOnp9+g6NEhNcS71VfnMqCrg/MrYY11JLgU5mUf9JXAy7s6v3tnFt3+9jvaefu5urOPLC2ZSlp993HZ7uvtYtaubt3fsp2nbft5t7qQ/+GUzuTSPvQf6STfjax9t4K55tVqBVEJNoS+jpm8wwlDUyc8evfMAevoG+eHSjfz0lW3kZaXz+etnUFWYw6rdXaza1cWa3d3sCxaMy0gz5tQU0XheCY3nlTBvSgmVBTns2HeQLz79Lm9s7eC6WZV8+48voKpQB58lnBT6ck7Y1NbDNxev4Q+b9gKxgJ9RVcDcmkLmTCpibk0hDdVF5GYNv3ZQNOo8+uo2/vb5dWRnpPOt2+dwx8WTzplRf9ehQbLS007YP5HTNeqhb2Z1wM+AiUAUeNjdf2BmpcCTwBRgG/Axd98f7PMA8BkgAvyFuz9/qu9R6IePu7Ni+35yMtOpr8onO+PMA3BL+wG++C/v8taOTq6dVcntF03i8mllpzztNBp1Wnv6KM/PJjN97C9j6e0f4o1tHby6aS+vbNrHmpZuzKC2JJf6ygLqK/M5vzKf+qoCZlYV6JeBnLaxCP1qoNrd3zKzAmAFcCfwSaDD3R80s/uBEnf/ipk1AI8DlwGTgBeAGe4eOdn3KPRlpCJR55GXt/DQi5vpOjQIwJSyPK6YXsbl08q4uK6Y1u5+1u3pZt2eHta1dLN+Tw+9AxGmlOXx1VsbuH525Un/Sli3p5u//c161u/p4dYLq/lYYx3nV+afcHt3Z/Xubl5Y28orm/by9o5OhqJOVnoaHzivmPnTywHY2HaAja09bGnvZSASO26RlZHG/OllXDerkmtnV1EzzAHrnR0HeWXTXl7dvI91e7q5uK6Ya2ZWclV9OYU5mYn855RzzJhP75jZs8DfBz/XuHtL8IvhRXefGYzycfe/CbZ/Hvimuy8/2ecq9CVRkaiztqWb17bs47Ut+3h9awc9fUNHbVOUm8msiQXMri6ktiSXx9/Yweb2Xq46v5yv3dbAzIlHn8q6q/MQ3/3tBp55u5mC7AwumVzCK5v2MhR15p1Xwt2Nddx6YTUTsjMYGIry+tZ9LFnTygtrWtnd1YcZXFBTxPzp5Vx5fhmN55UOO4ofikTZuf8QG1p7eH1LB0vXtbJ9X+y+CbMmFnD97CqmVUzgja0dvLJ5Lzs7DgFQUZDNrIkFvLOzk56+IdLTjHnnlXDNzAqumVHJ9MoJp/UXlLvTHew/msdwZOyNaeib2RTgJWAusMPdi+Pe2+/uJWb298Br7v7zoP3HwHPu/vQwn3cvcC/A5MmT523fvj3hGkUOi0SdNbu7eW9XJ5OKcplVXcDEwpyjRvSDkSg/f207339hIz19g/ynD07mCzfMJN2Mh17cxE9f3QbAJ+dP4b9eM53ivCzae/p55q1mnmzayZb2XiZkpdM4pZS3duynp2+InMw0rq6v4IaGKq6dVXncGUqnw93Z3N7L79a18sLaNlZs308k6hTkZHD5tDKunF7GleeXc35lPmbGUCTK2zs7+f26Nl5c386alu4jn1WUm0lFQTYV+dlUFmZTnp/NUCRKW09/8NNHe08/fYOxvzSmV0zgotpiLqqL/cyuLhjR1JucHWMW+maWDywDFrn7M2bWeYLQ/wdg+TGh/2t3/+XJPl8jfUmm/b0DfP+FDfz89R3kZaWTZkZ33yB/dEkNX7hhBrUlecft4+68tWM/T765kze37efSKSXc0DCRq84vH/U5+c6DA+zu7GNGVT4Zp3EMoq27j1c272XX/kO0B+He3tNP+4F+2rr7yUg3KguyqSzIobIw+8jzvsEI7zZ38c7OTvYe6AcgM924oKaIP7qkhtsvrqEoV9NH48mYhL6ZZQL/Djzv7t8N2taj6R1JMRtae/i7367HHT5//QwaJhUmu6SkcHdauvp4r7mTd5u7eHF9O2tbusnOSOOWC6q5+9I6Pji19LjjIHsP9LNhTw+b9/ZSnJvJjKoCppZPOKPrOXoHIuzvHaCjd4D9Bwfo7Y+Qn5NBUW4mhYcfczPH9AD8gf4h/uO93byxdT8fOK+Ya2dVUl00Pi8GHIsDuQY8Ruyg7efj2r8D7Is7kFvq7l82sznAL3j/QO5SoF4HckXOXe7Oql3dPNm0g2ff3k1P/xBTyvL46EWT6Do0yPo9PWxsO0BHcI1FvPQ0Y0pZHjOqYmcp5WVnsP/gAF0HB9l/cIDOg4N0HYo93987eOSA9qkUZGdw8wUT+dSVU5ldfeJfzkORKEvWtPKz5dvp7hvkg1PLuGJ6GZdNLT3qr5Zo1Hl9awf/smInz63cw6HBCAU5GUeOC82uLgwOrldyUW0x6eNkOZGxCP2rgJeBlcRO2QT4a+B14ClgMrADuMvdO4J9vgp8GhgCPu/uz53qexT6IueGQwMRnlvVwpNv7uT1rR3kZ2dQX5XPzKoC6qsKmFGVz/SKfDoPDrKxrYcNrT1sbD3AxrYDbN/XS9RjU0bFeVmU5GVSnJtFcV4mxXmZlE7IpnRCJsV5WZTmZVEyIYsJ2en09g/RdSj2y6H7UOz5jo6D/Md7LRwajDB/ehmfvnIq186qPLK2U+fBAZ54cyf/tHw7uzoPUVuSy+TSPFZs30//UJQ0gzmTirhiehm5mek883YzOzsOUZCdwW0XTeKuxlouqStmc/sBlq5tY+m694+tlE3I4uoZFXx4RgUfqi8f0XGb0aKLs0TkrDnQP8SErPTTviiuL1jlNe8M9jmZw8H+2KvbaOnqY0pZHn92+Xls2dvLM2810zcY5fJppXzqyqlcP7uK9DSjfyjC2zs6Wb55H8u37OOdHZ0MRqPMn17GXfPquGnOxBMek+k8OMCyDe38fl0bL23cS0fvAGZwYU0RH55RwdUzKqgoyKanbyj4GTyyMKEZVBbkUFWYTWVhDhX52ac97XUyCn0RCZ3BSJTnV+/hx3/Yyts7OsnKSOOPLq7hk1dOOenUD8T+cukdGKL8DEfr0aizancXy9a3s2xDO2/t2M+Z3oeodEIWlQXZPP3Z+SM+VVahLyKhtrG1h7L8bEpPcP/osdJ1cJDlW/bS2x87FpCfk0FhTib52bEVaCNRP3KKbGt37Cyq1p4+9vb084//ed6I//I5UejragsRCYXh7hVxNhTlZbJgbvVJt6kszAGKzko9ukeuiEiIKPRFREJEoS8iEiIKfRGREFHoi4iEiEJfRCREFPoiIiGi0BcRCZFxf0WumbUDI72LSjmwdxTLOVeo3+GifofL6fb7PHevOLZx3Id+IsysabjLkFOd+h0u6ne4JNpvTe+IiISIQl9EJERSPfQfTnYBSaJ+h4v6HS4J9Tul5/RFRORoqT7SFxGROAp9EZEQScnQN7MFZrbezDaZ2f3JrmcsmdlPzKzNzFbFtd2pZyAAAALbSURBVJWa2RIz2xg8liSzxrFgZnVm9nszW2tmq83sc0F7SvfdzHLM7A0zezfo97eC9pTuN4CZpZvZ22b278HrlO8zgJltM7OVZvaOmTUFbSPue8qFvpmlA/8A3Aw0AB83s4bkVjWmHgUWHNN2P7DU3euBpcHrVDME/JW7zwYuB+4L/p1Tve/9wLXufhFwMbDAzC4n9fsN8DlgbdzrMPT5sI+4+8Vx5+ePuO8pF/rAZcAmd9/i7gPAE8AdSa5pzLj7S0DHMc13AI8Fzx8D7jyrRZ0F7t7i7m8Fz3uIhUENKd53jzkQvMwMfpwU77eZ1QK3Ao/ENad0n09hxH1PxdCvAXbGvW4O2sKkyt1bIBaOQGWS6xlTZjYFuAR4nRD0PZjmeAdoA5a4exj6/X3gy0A0ri3V+3yYA781sxVmdm/QNuK+p+KN0Ye7dbzOS01RZpYP/BL4vLt3mw33z59a3D0CXGxmxcC/mtncZNc0lszsNqDN3VeY2TXJricJrnT33WZWCSwxs3WJfFgqjvSbgbq417XA7iTVkiytZlYNEDy2JbmeMWFmmcQC/5/d/ZmgORR9B3D3TuBFYsd0UrnfVwK3m9k2YtO115rZz0ntPh/h7ruDxzbgX4lNYY+476kY+m8C9WY21cyygIXA4iTXdLYtBu4Jnt8DPJvEWsaExYb0PwbWuvt3495K6b6bWUUwwsfMcoHrgXWkcL/d/QF3r3X3KcT+f/6du/8ZKdznw8xsgpkVHH4O3AisIoG+p+QVuWZ2C7E5wHTgJ+6+KMkljRkzexy4hthyq63AN4BfAU8Bk4EdwF3ufuzB3nOamV0FvAys5P153r8mNq+fsn03swuJHbhLJzZoe8rd/4eZlZHC/T4smN75orvfFoY+m9k0YqN7iE3H/8LdFyXS95QMfRERGV4qTu+IiMgJKPRFREJEoS8iEiIKfRGREFHoi4iEiEJfRCREFPoiIiHy/wH54FHjdAj2sgAAAABJRU5ErkJggg==\n",
696 |       "text/plain": [
697 |        "<Figure size 432x288 with 1 Axes>"
698 |       ]
699 |      },
700 |      "metadata": {
701 |       "needs_background": "light"
702 |      },
703 |      "output_type": "display_data"
704 |     }
705 |    ],
706 |    "source": [
707 |     "# -----------\n",
708 |     "# Tested Cell\n",
709 |     "# -----------\n",
710 |     "# The contents of the cell will be tested by the autograder.\n",
711 |     "# If they do not pass here, they will not pass there.\n",
712 |     "\n",
713 |     "np.random.seed(0)\n",
714 |     "\n",
715 |     "num_runs = 10\n",
716 |     "num_episodes = 50\n",
717 |     "env_info = {\"num_tiles\": 8, \"num_tilings\": 8}\n",
718 |     "agent_info = {}\n",
719 |     "all_steps = []\n",
720 |     "\n",
721 |     "agent = SarsaAgent\n",
722 |     "env = mountaincar_env.Environment\n",
723 |     "start = time.time()\n",
724 |     "\n",
725 |     "for run in range(num_runs):\n",
726 |     "    if run % 5 == 0:\n",
727 |     "        print(\"RUN: {}\".format(run))\n",
728 |     "\n",
729 |     "    rl_glue = RLGlue(env, agent)\n",
730 |     "    rl_glue.rl_init(agent_info, env_info)\n",
731 |     "    steps_per_episode = []\n",
732 |     "\n",
733 |     "    for episode in range(num_episodes):\n",
734 |     "        rl_glue.rl_episode(15000)\n",
735 |     "        steps_per_episode.append(rl_glue.num_steps)\n",
736 |     "\n",
737 |     "    all_steps.append(np.array(steps_per_episode))\n",
738 |     "\n",
739 |     "print(\"Run time: {}\".format(time.time() - start))\n",
740 |     "\n",
741 |     "mean = np.mean(all_steps, axis=0)\n",
742 |     "plt.plot(mean)\n",
743 |     "\n",
744 |     "# because we set the random seed, these values should be *exactly* the same\n",
745 |     "assert np.allclose(mean, [1432.5, 837.9, 694.4, 571.4, 515.2, 380.6, 379.4, 369.6, 357.2, 316.5, 291.1, 305.3, 250.1, 264.9, 235.4, 242.1, 244.4, 245., 221.2, 229., 238.3, 211.2, 201.1, 208.3, 185.3, 207.1, 191.6, 204., 214.5, 207.9, 195.9, 206.4, 194.9, 191.1, 195., 186.6, 171., 177.8, 171.1, 174., 177.1, 174.5, 156.9, 174.3, 164.1, 179.3, 167.4, 156.1, 158.4, 154.4])"
746 |    ]
747 |   },
748 |   {
749 |    "cell_type": "markdown",
750 |    "metadata": {
751 |     "deletable": false,
752 |     "editable": false,
753 |     "nbgrader": {
754 |      "cell_type": "markdown",
755 |      "checksum": "1decba9ad1d71bdd4a835f9ae46378dd",
756 |      "grade": false,
757 |      "grade_id": "cell-8178ac2e12418ca5",
758 |      "locked": true,
759 |      "schema_version": 3,
760 |      "solution": false,
761 |      "task": false
762 |     }
763 |    },
764 |    "source": [
765 |     "The learning rate of your agent should look similar to ours, though it will not look exactly the same.If there are some spikey points that is okay. Due to stochasticity,  a few episodes may have taken much longer, causing some spikes in the plot. The trend of the line should be similar, though, generally decreasing to about 200 steps per run.\n",
766 |     "![alt text](sarsa_agent_initial.png \"Logo Title Text 1\")"
767 |    ]
768 |   },
769 |   {
770 |    "cell_type": "markdown",
771 |    "metadata": {
772 |     "deletable": false,
773 |     "editable": false,
774 |     "nbgrader": {
775 |      "cell_type": "markdown",
776 |      "checksum": "6fbb20239a03855fc6a21a05fbf1ddf5",
777 |      "grade": false,
778 |      "grade_id": "cell-f395294510618c9b",
779 |      "locked": true,
780 |      "schema_version": 3,
781 |      "solution": false,
782 |      "task": false
783 |     }
784 |    },
785 |    "source": [
786 |     "This result was using 8 tilings with 8x8 tiles on each. Let's see if we can do better, and what different tilings look like. We will also text 2 tilings of 16x16 and 4 tilings of 32x32. These three choices produce the same number of features (512), but distributed quite differently. "
787 |    ]
788 |   },
789 |   {
790 |    "cell_type": "code",
791 |    "execution_count": null,
792 |    "metadata": {
793 |     "deletable": false,
794 |     "editable": false,
795 |     "nbgrader": {
796 |      "cell_type": "code",
797 |      "checksum": "80a5fffe28f2e72b4745e265bbd276a7",
798 |      "grade": false,
799 |      "grade_id": "cell-f608c2e9a0d94727",
800 |      "locked": true,
801 |      "schema_version": 3,
802 |      "solution": false,
803 |      "task": false
804 |     }
805 |    },
806 |    "outputs": [
807 |     {
808 |      "name": "stdout",
809 |      "output_type": "stream",
810 |      "text": [
811 |       "RUN: 0\n",
812 |       "RUN: 5\n",
813 |       "RUN: 10\n",
814 |       "RUN: 15\n",
815 |       "stepsize: 0.25\n",
816 |       "Run Time: 44.90065026283264\n",
817 |       "RUN: 0\n",
818 |       "RUN: 5\n",
819 |       "RUN: 10\n",
820 |       "RUN: 15\n"
821 |      ]
822 |     }
823 |    ],
824 |    "source": [
825 |     "# ---------------\n",
826 |     "# Discussion Cell\n",
827 |     "# ---------------\n",
828 |     "\n",
829 |     "np.random.seed(0)\n",
830 |     "\n",
831 |     "# Compare the three\n",
832 |     "num_runs = 20\n",
833 |     "num_episodes = 100\n",
834 |     "env_info = {}\n",
835 |     "\n",
836 |     "agent_runs = []\n",
837 |     "# alphas = [0.2, 0.4, 0.5, 1.0]\n",
838 |     "alphas = [0.5]\n",
839 |     "agent_info_options = [{\"num_tiles\": 16, \"num_tilings\": 2, \"alpha\": 0.5},\n",
840 |     "                      {\"num_tiles\": 4, \"num_tilings\": 32, \"alpha\": 0.5},\n",
841 |     "                      {\"num_tiles\": 8, \"num_tilings\": 8, \"alpha\": 0.5}]\n",
842 |     "agent_info_options = [{\"num_tiles\" : agent[\"num_tiles\"], \n",
843 |     "                       \"num_tilings\": agent[\"num_tilings\"],\n",
844 |     "                       \"alpha\" : alpha} for agent in agent_info_options for alpha in alphas]\n",
845 |     "\n",
846 |     "agent = SarsaAgent\n",
847 |     "env = mountaincar_env.Environment\n",
848 |     "for agent_info in agent_info_options:\n",
849 |     "    all_steps = []\n",
850 |     "    start = time.time()\n",
851 |     "    for run in range(num_runs):\n",
852 |     "        if run % 5 == 0:\n",
853 |     "            print(\"RUN: {}\".format(run))\n",
854 |     "        env = mountaincar_env.Environment\n",
855 |     "        \n",
856 |     "        rl_glue = RLGlue(env, agent)\n",
857 |     "        rl_glue.rl_init(agent_info, env_info)\n",
858 |     "        steps_per_episode = []\n",
859 |     "\n",
860 |     "        for episode in range(num_episodes):\n",
861 |     "            rl_glue.rl_episode(15000)\n",
862 |     "            steps_per_episode.append(rl_glue.num_steps)\n",
863 |     "        all_steps.append(np.array(steps_per_episode))\n",
864 |     "    \n",
865 |     "    agent_runs.append(np.mean(np.array(all_steps), axis=0))\n",
866 |     "    print(\"stepsize:\", rl_glue.agent.alpha)\n",
867 |     "    print(\"Run Time: {}\".format(time.time() - start))\n",
868 |     "\n",
869 |     "plt.figure(figsize=(15, 10), dpi= 80, facecolor='w', edgecolor='k')\n",
870 |     "plt.plot(np.array(agent_runs).T)\n",
871 |     "plt.xlabel(\"Episode\")\n",
872 |     "plt.ylabel(\"Steps Per Episode\")\n",
873 |     "plt.yscale(\"linear\")\n",
874 |     "plt.ylim(0, 1000)\n",
875 |     "plt.legend([\"num_tiles: {}, num_tilings: {}, alpha: {}\".format(agent_info[\"num_tiles\"], \n",
876 |     "                                                               agent_info[\"num_tilings\"],\n",
877 |     "                                                               agent_info[\"alpha\"])\n",
878 |     "            for agent_info in agent_info_options])"
879 |    ]
880 |   },
881 |   {
882 |    "cell_type": "markdown",
883 |    "metadata": {
884 |     "deletable": false,
885 |     "editable": false,
886 |     "nbgrader": {
887 |      "cell_type": "markdown",
888 |      "checksum": "bd4932e9dfc12e055b297632bd3e55a5",
889 |      "grade": false,
890 |      "grade_id": "cell-a9d6014459310d14",
891 |      "locked": true,
892 |      "schema_version": 3,
893 |      "solution": false,
894 |      "task": false
895 |     }
896 |    },
897 |    "source": [
898 |     "Here we can see that using 32 tilings and 4 x 4 tiles does a little better than 8 tilings with 8x8 tiles. Both seem to do much better than using 2 tilings, with 16 x 16 tiles."
899 |    ]
900 |   },
901 |   {
902 |    "cell_type": "markdown",
903 |    "metadata": {
904 |     "deletable": false,
905 |     "editable": false,
906 |     "nbgrader": {
907 |      "cell_type": "markdown",
908 |      "checksum": "8bfe024cdf651a451f5be7e3e8df8325",
909 |      "grade": false,
910 |      "grade_id": "cell-b583918603d6925b",
911 |      "locked": true,
912 |      "schema_version": 3,
913 |      "solution": false,
914 |      "task": false
915 |     }
916 |    },
917 |    "source": [
918 |     "## Section 3: Conclusion"
919 |    ]
920 |   },
921 |   {
922 |    "cell_type": "markdown",
923 |    "metadata": {
924 |     "deletable": false,
925 |     "editable": false,
926 |     "nbgrader": {
927 |      "cell_type": "markdown",
928 |      "checksum": "5b7683cc73139dd200ecf0dd1c2b6f4e",
929 |      "grade": false,
930 |      "grade_id": "cell-d15725ba24684800",
931 |      "locked": true,
932 |      "schema_version": 3,
933 |      "solution": false,
934 |      "task": false
935 |     }
936 |    },
937 |    "source": [
938 |     "Congratulations! You have learned how to implement a control agent using function approximation. In this notebook you learned how to:\n",
939 |     "\n",
940 |     "- Use function approximation in the control setting\n",
941 |     "- Implement the Sarsa algorithm using tile coding\n",
942 |     "- Compare three settings for tile coding to see their effect on our agent"
943 |    ]
944 |   }
945 |  ],
946 |  "metadata": {
947 |   "kernelspec": {
948 |    "display_name": "Python 3",
949 |    "language": "python",
950 |    "name": "python3"
951 |   },
952 |   "language_info": {
953 |    "codemirror_mode": {
954 |     "name": "ipython",
955 |     "version": 3
956 |    },
957 |    "file_extension": ".py",
958 |    "mimetype": "text/x-python",
959 |    "name": "python",
960 |    "nbconvert_exporter": "python",
961 |    "pygments_lexer": "ipython3",
962 |    "version": "3.7.6"
963 |   }
964 |  },
965 |  "nbformat": 4,
966 |  "nbformat_minor": 2
967 | }
968 | 


--------------------------------------------------------------------------------
/Implement your agent.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "deletable": false,
   7 |     "editable": false,
   8 |     "nbgrader": {
   9 |      "cell_type": "markdown",
  10 |      "checksum": "355c6ee07b77d9504f86561bbc59d831",
  11 |      "grade": false,
  12 |      "grade_id": "cell-f96c128874bfc5b3",
  13 |      "locked": true,
  14 |      "schema_version": 3,
  15 |      "solution": false,
  16 |      "task": false
  17 |     }
  18 |    },
  19 |    "source": [
  20 |     "# Assignment 2 - Implement your agent\n",
  21 |     "\n",
  22 |     "Welcome to Course 4, Programming Assignment 2! We have learned about reinforcement learning algorithms for prediction and control in previous courses and extended those algorithms to large state spaces using function approximation. One example of this was in assignment 2 of course 3 where we implemented semi-gradient TD for prediction and used a neural network as the function approximator. In this notebook, we will build a reinforcement learning agent for control, again using a neural network for function approximation. This combination of neural network function approximators and reinforcement learning algorithms, often referred to as Deep RL, is an active area of research and has led to many impressive results (e. g., AlphaGo: https://deepmind.com/research/case-studies/alphago-the-story-so-far).\n",
  23 |     "\n",
  24 |     "**In this assignment, you will:**\n",
  25 |     "  1. Extend the neural network code from assignment 2 of course 3 to output action-values instead of state-values.\n",
  26 |     "  2. Write up the Adam algorithm for neural network optimization.\n",
  27 |     "  3. Understand experience replay buffers.\n",
  28 |     "  4. Implement Softmax action-selection.\n",
  29 |     "  5. Build an Expected Sarsa agent by putting all the pieces together.\n",
  30 |     "  6. Solve Lunar Lander with your agent."
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "markdown",
  35 |    "metadata": {
  36 |     "deletable": false,
  37 |     "editable": false,
  38 |     "nbgrader": {
  39 |      "cell_type": "markdown",
  40 |      "checksum": "a942b1d80536a6e097a5b7879f8b28d3",
  41 |      "grade": false,
  42 |      "grade_id": "cell-9524f4b3df469bab",
  43 |      "locked": true,
  44 |      "schema_version": 3,
  45 |      "solution": false,
  46 |      "task": false
  47 |     }
  48 |    },
  49 |    "source": [
  50 |     "## Packages\n",
  51 |     "- [numpy](www.numpy.org) : Fundamental package for scientific computing with Python.\n",
  52 |     "- [matplotlib](http://matplotlib.org) : Library for plotting graphs in Python.\n",
  53 |     "- [RL-Glue](http://www.jmlr.org/papers/v10/tanner09a.html), BaseEnvironment, BaseAgent : Library and abstract classes to inherit from  for reinforcement learning experiments.\n",
  54 |     "- [LunarLanderEnvironment](https://gym.openai.com/envs/LunarLander-v2/) : An RLGlue environment that wraps a LundarLander environment implementation from OpenAI Gym.\n",
  55 |     "- [collections.deque](https://docs.python.org/3/library/collections.html#collections.deque): a double-ended queue implementation. We use deque to implement the experience replay buffer.\n",
  56 |     "- [copy.deepcopy](https://docs.python.org/3/library/copy.html#copy.deepcopy): As objects are not passed by value in python, we often need to make copies of mutable objects. copy.deepcopy allows us to make a new object with the same contents as another object. (Take a look at this link if you are interested to learn more: https://robertheaton.com/2014/02/09/pythons-pass-by-object-reference-as-explained-by-philip-k-dick/)\n",
  57 |     "- [tqdm](https://github.com/tqdm/tqdm) : A package to display progress bar when running experiments\n",
  58 |     "- [os](https://docs.python.org/3/library/os.html): Package used to interface with the operating system. Here we use it for creating a results folder when it does not exist.\n",
  59 |     "- [shutil](https://docs.python.org/3/library/shutil.html): Package used to operate on files and folders. Here we use it for creating a zip file of the results folder.\n",
  60 |     "- plot_script: Used for plotting learning curves using matplotlib."
  61 |    ]
  62 |   },
  63 |   {
  64 |    "cell_type": "code",
  65 |    "execution_count": 1,
  66 |    "metadata": {
  67 |     "deletable": false,
  68 |     "editable": false,
  69 |     "nbgrader": {
  70 |      "cell_type": "code",
  71 |      "checksum": "1a16d3a2b4b78bd0c8a054524d667d1c",
  72 |      "grade": false,
  73 |      "grade_id": "cell-3a093c227c1a8513",
  74 |      "locked": true,
  75 |      "schema_version": 3,
  76 |      "solution": false,
  77 |      "task": false
  78 |     }
  79 |    },
  80 |    "outputs": [],
  81 |    "source": [
  82 |     "# Do not modify this cell!\n",
  83 |     "\n",
  84 |     "# Import necessary libraries\n",
  85 |     "# DO NOT IMPORT OTHER LIBRARIES - This will break the autograder.\n",
  86 |     "import numpy as np\n",
  87 |     "import matplotlib.pyplot as plt\n",
  88 |     "%matplotlib inline\n",
  89 |     "\n",
  90 |     "from rl_glue import RLGlue\n",
  91 |     "from environment import BaseEnvironment\n",
  92 |     "from lunar_lander import LunarLanderEnvironment\n",
  93 |     "from agent import BaseAgent\n",
  94 |     "from collections import deque\n",
  95 |     "from copy import deepcopy\n",
  96 |     "from tqdm import tqdm\n",
  97 |     "import os \n",
  98 |     "import shutil\n",
  99 |     "from plot_script import plot_result"
 100 |    ]
 101 |   },
 102 |   {
 103 |    "cell_type": "markdown",
 104 |    "metadata": {
 105 |     "deletable": false,
 106 |     "editable": false,
 107 |     "nbgrader": {
 108 |      "cell_type": "markdown",
 109 |      "checksum": "833da49e040139194c4c5e7c68b23bee",
 110 |      "grade": false,
 111 |      "grade_id": "cell-c1f6c6471017fd99",
 112 |      "locked": true,
 113 |      "schema_version": 3,
 114 |      "solution": false,
 115 |      "task": false
 116 |     }
 117 |    },
 118 |    "source": [
 119 |     "## Section 1: Action-Value Network\n",
 120 |     "This section includes the function approximator that we use in our agent, a neural network. In Course 3 Assignment 2, we used a neural network as the function approximator for a policy evaluation problem. In this assignment, we will use a neural network for approximating the action-value function in a control problem. The main difference between approximating a state-value function and an action-value function using a neural network is that in the former the output layer only includes one unit whereas in the latter the output layer includes as many units as the number of actions. \n",
 121 |     "\n",
 122 |     "In the cell below, you will specify the architecture of the action-value neural network. More specifically, you will specify `self.layer_sizes` in the `__init__()` function. \n",
 123 |     "\n",
 124 |     "We have already provided `get_action_values()` and `get_TD_update()` methods. The former computes the action-value function by doing a forward pass and the latter computes the gradient of the action-value function with respect to the weights times the TD error. These `get_action_values()` and `get_TD_update()` methods are similar to the `get_value()` and `get_gradient()` methods that you implemented in Course 3 Assignment 2. The main difference is that in this notebook, they are designed to be applied to batches of states instead of one state. You will later use these functions for implementing the agent."
 125 |    ]
 126 |   },
 127 |   {
 128 |    "cell_type": "code",
 129 |    "execution_count": 6,
 130 |    "metadata": {
 131 |     "deletable": false,
 132 |     "nbgrader": {
 133 |      "cell_type": "code",
 134 |      "checksum": "d10feeabf000214a0f53c5dfc5812437",
 135 |      "grade": false,
 136 |      "grade_id": "cell-e6d82e74c686dbf5",
 137 |      "locked": false,
 138 |      "schema_version": 3,
 139 |      "solution": true,
 140 |      "task": false
 141 |     }
 142 |    },
 143 |    "outputs": [],
 144 |    "source": [
 145 |     "# -----------\n",
 146 |     "# Graded Cell\n",
 147 |     "# -----------\n",
 148 |     "\n",
 149 |     "# Work Required: Yes. Fill in the code for layer_sizes in __init__ (~1 Line). \n",
 150 |     "# Also go through the rest of the code to ensure your understanding is correct.\n",
 151 |     "class ActionValueNetwork:\n",
 152 |     "    # Work Required: Yes. Fill in the layer_sizes member variable (~1 Line).\n",
 153 |     "    def __init__(self, network_config):\n",
 154 |     "        self.state_dim = network_config.get(\"state_dim\")\n",
 155 |     "        self.num_hidden_units = network_config.get(\"num_hidden_units\")\n",
 156 |     "        self.num_actions = network_config.get(\"num_actions\")\n",
 157 |     "        \n",
 158 |     "        self.rand_generator = np.random.RandomState(network_config.get(\"seed\"))\n",
 159 |     "        \n",
 160 |     "        # Specify self.layer_size which shows the number of nodes in each layer\n",
 161 |     "        ### START CODE HERE (~1 Line)\n",
 162 |     "        self.layer_sizes = [self.state_dim,self.num_hidden_units,self.num_actions]\n",
 163 |     "        ### END CODE HERE\n",
 164 |     "        \n",
 165 |     "        # Initialize the weights of the neural network\n",
 166 |     "        # self.weights is an array of dictionaries with each dictionary corresponding to \n",
 167 |     "        # the weights from one layer to the next. Each dictionary includes W and b\n",
 168 |     "        self.weights = [dict() for i in range(0, len(self.layer_sizes) - 1)]\n",
 169 |     "        for i in range(0, len(self.layer_sizes) - 1):\n",
 170 |     "            self.weights[i]['W'] = self.init_saxe(self.layer_sizes[i], self.layer_sizes[i + 1])\n",
 171 |     "            self.weights[i]['b'] = np.zeros((1, self.layer_sizes[i + 1]))\n",
 172 |     "    \n",
 173 |     "    # Work Required: No.\n",
 174 |     "    def get_action_values(self, s):\n",
 175 |     "        \"\"\"\n",
 176 |     "        Args:\n",
 177 |     "            s (Numpy array): The state.\n",
 178 |     "        Returns:\n",
 179 |     "            The action-values (Numpy array) calculated using the network's weights.\n",
 180 |     "        \"\"\"\n",
 181 |     "        \n",
 182 |     "        W0, b0 = self.weights[0]['W'], self.weights[0]['b']\n",
 183 |     "        psi = np.dot(s, W0) + b0\n",
 184 |     "        x = np.maximum(psi, 0)\n",
 185 |     "        \n",
 186 |     "        W1, b1 = self.weights[1]['W'], self.weights[1]['b']\n",
 187 |     "        q_vals = np.dot(x, W1) + b1\n",
 188 |     "\n",
 189 |     "        return q_vals\n",
 190 |     "    \n",
 191 |     "    # Work Required: No.\n",
 192 |     "    def get_TD_update(self, s, delta_mat):\n",
 193 |     "        \"\"\"\n",
 194 |     "        Args:\n",
 195 |     "            s (Numpy array): The state.\n",
 196 |     "            delta_mat (Numpy array): A 2D array of shape (batch_size, num_actions). Each row of delta_mat  \n",
 197 |     "            correspond to one state in the batch. Each row has only one non-zero element \n",
 198 |     "            which is the TD-error corresponding to the action taken.\n",
 199 |     "        Returns:\n",
 200 |     "            The TD update (Array of dictionaries with gradient times TD errors) for the network's weights\n",
 201 |     "        \"\"\"\n",
 202 |     "\n",
 203 |     "        W0, b0 = self.weights[0]['W'], self.weights[0]['b']\n",
 204 |     "        W1, b1 = self.weights[1]['W'], self.weights[1]['b']\n",
 205 |     "        \n",
 206 |     "        psi = np.dot(s, W0) + b0\n",
 207 |     "        x = np.maximum(psi, 0)\n",
 208 |     "        dx = (psi > 0).astype(float)\n",
 209 |     "\n",
 210 |     "        # td_update has the same structure as self.weights, that is an array of dictionaries.\n",
 211 |     "        # td_update[0][\"W\"], td_update[0][\"b\"], td_update[1][\"W\"], and td_update[1][\"b\"] have the same shape as \n",
 212 |     "        # self.weights[0][\"W\"], self.weights[0][\"b\"], self.weights[1][\"W\"], and self.weights[1][\"b\"] respectively\n",
 213 |     "        td_update = [dict() for i in range(len(self.weights))]\n",
 214 |     "         \n",
 215 |     "        v = delta_mat\n",
 216 |     "        td_update[1]['W'] = np.dot(x.T, v) * 1. / s.shape[0]\n",
 217 |     "        td_update[1]['b'] = np.sum(v, axis=0, keepdims=True) * 1. / s.shape[0]\n",
 218 |     "        \n",
 219 |     "        v = np.dot(v, W1.T) * dx\n",
 220 |     "        td_update[0]['W'] = np.dot(s.T, v) * 1. / s.shape[0]\n",
 221 |     "        td_update[0]['b'] = np.sum(v, axis=0, keepdims=True) * 1. / s.shape[0]\n",
 222 |     "                \n",
 223 |     "        return td_update\n",
 224 |     "    \n",
 225 |     "    # Work Required: No. You may wish to read the relevant paper for more information on this weight initialization\n",
 226 |     "    # (Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe, A et al., 2013)\n",
 227 |     "    def init_saxe(self, rows, cols):\n",
 228 |     "        \"\"\"\n",
 229 |     "        Args:\n",
 230 |     "            rows (int): number of input units for layer.\n",
 231 |     "            cols (int): number of output units for layer.\n",
 232 |     "        Returns:\n",
 233 |     "            NumPy Array consisting of weights for the layer based on the initialization in Saxe et al.\n",
 234 |     "        \"\"\"\n",
 235 |     "        tensor = self.rand_generator.normal(0, 1, (rows, cols))\n",
 236 |     "        if rows < cols:\n",
 237 |     "            tensor = tensor.T\n",
 238 |     "        tensor, r = np.linalg.qr(tensor)\n",
 239 |     "        d = np.diag(r, 0)\n",
 240 |     "        ph = np.sign(d)\n",
 241 |     "        tensor *= ph\n",
 242 |     "\n",
 243 |     "        if rows < cols:\n",
 244 |     "            tensor = tensor.T\n",
 245 |     "        return tensor\n",
 246 |     "    \n",
 247 |     "    # Work Required: No.\n",
 248 |     "    def get_weights(self):\n",
 249 |     "        \"\"\"\n",
 250 |     "        Returns: \n",
 251 |     "            A copy of the current weights of this network.\n",
 252 |     "        \"\"\"\n",
 253 |     "        return deepcopy(self.weights)\n",
 254 |     "    \n",
 255 |     "    # Work Required: No.\n",
 256 |     "    def set_weights(self, weights):\n",
 257 |     "        \"\"\"\n",
 258 |     "        Args: \n",
 259 |     "            weights (list of dictionaries): Consists of weights that this network will set as its own weights.\n",
 260 |     "        \"\"\"\n",
 261 |     "        self.weights = deepcopy(weights)"
 262 |    ]
 263 |   },
 264 |   {
 265 |    "cell_type": "markdown",
 266 |    "metadata": {
 267 |     "deletable": false,
 268 |     "editable": false,
 269 |     "nbgrader": {
 270 |      "cell_type": "markdown",
 271 |      "checksum": "e56b7c14735e136a6aa4bfb968f48013",
 272 |      "grade": false,
 273 |      "grade_id": "cell-09cdc118d2f5951c",
 274 |      "locked": true,
 275 |      "schema_version": 3,
 276 |      "solution": false,
 277 |      "task": false
 278 |     }
 279 |    },
 280 |    "source": [
 281 |     "Run the cell below to test your implementation of the `__init__()` function for ActionValueNetwork:"
 282 |    ]
 283 |   },
 284 |   {
 285 |    "cell_type": "code",
 286 |    "execution_count": 7,
 287 |    "metadata": {},
 288 |    "outputs": [
 289 |     {
 290 |      "name": "stdout",
 291 |      "output_type": "stream",
 292 |      "text": [
 293 |       "layer_sizes: [5, 20, 3]\n"
 294 |      ]
 295 |     }
 296 |    ],
 297 |    "source": [
 298 |     "# --------------\n",
 299 |     "# Debugging Cell\n",
 300 |     "# --------------\n",
 301 |     "# Feel free to make any changes to this cell to debug your code\n",
 302 |     "\n",
 303 |     "network_config = {\n",
 304 |     "    \"state_dim\": 5,\n",
 305 |     "    \"num_hidden_units\": 20,\n",
 306 |     "    \"num_actions\": 3\n",
 307 |     "}\n",
 308 |     "\n",
 309 |     "test_network = ActionValueNetwork(network_config)\n",
 310 |     "print(\"layer_sizes:\", test_network.layer_sizes)\n",
 311 |     "assert(np.allclose(test_network.layer_sizes, np.array([5, 20, 3])))"
 312 |    ]
 313 |   },
 314 |   {
 315 |    "cell_type": "code",
 316 |    "execution_count": 8,
 317 |    "metadata": {
 318 |     "deletable": false,
 319 |     "editable": false,
 320 |     "nbgrader": {
 321 |      "cell_type": "code",
 322 |      "checksum": "60e5d798e80eba541ad2862a00d1571f",
 323 |      "grade": true,
 324 |      "grade_id": "cell-49a0cb79ea0e45ea",
 325 |      "locked": true,
 326 |      "points": 5,
 327 |      "schema_version": 3,
 328 |      "solution": false,
 329 |      "task": false
 330 |     }
 331 |    },
 332 |    "outputs": [],
 333 |    "source": [
 334 |     "# -----------\n",
 335 |     "# Tested Cell\n",
 336 |     "# -----------\n",
 337 |     "# The contents of the cell will be tested by the autograder.\n",
 338 |     "# If they do not pass here, they will not pass there.\n",
 339 |     "\n",
 340 |     "rand_generator = np.random.RandomState(0)\n",
 341 |     "for _ in range(1000):\n",
 342 |     "    network_config = {\n",
 343 |     "        \"state_dim\": rand_generator.randint(2, 10),\n",
 344 |     "        \"num_hidden_units\": rand_generator.randint(2, 1024),\n",
 345 |     "        \"num_actions\": rand_generator.randint(2, 10)\n",
 346 |     "    }\n",
 347 |     "\n",
 348 |     "    test_network = ActionValueNetwork(network_config)\n",
 349 |     "\n",
 350 |     "    assert(np.allclose(test_network.layer_sizes, np.array([network_config[\"state_dim\"], \n",
 351 |     "                                                           network_config[\"num_hidden_units\"], \n",
 352 |     "                                                           network_config[\"num_actions\"]])))"
 353 |    ]
 354 |   },
 355 |   {
 356 |    "cell_type": "markdown",
 357 |    "metadata": {
 358 |     "deletable": false,
 359 |     "editable": false,
 360 |     "nbgrader": {
 361 |      "cell_type": "markdown",
 362 |      "checksum": "43dd3d66aa92d0ba9b560bac155dbe14",
 363 |      "grade": false,
 364 |      "grade_id": "cell-169bc96641d22305",
 365 |      "locked": true,
 366 |      "schema_version": 3,
 367 |      "solution": false,
 368 |      "task": false
 369 |     }
 370 |    },
 371 |    "source": [
 372 |     "**Expected output:** (assuming no changes to the debugging cell)\n",
 373 |     "\n",
 374 |     "    layer_sizes: [ 5 20  3]"
 375 |    ]
 376 |   },
 377 |   {
 378 |    "cell_type": "markdown",
 379 |    "metadata": {
 380 |     "deletable": false,
 381 |     "editable": false,
 382 |     "nbgrader": {
 383 |      "cell_type": "markdown",
 384 |      "checksum": "567638b2dd6adb9971de931d9992b4e1",
 385 |      "grade": false,
 386 |      "grade_id": "cell-9020651e057104f0",
 387 |      "locked": true,
 388 |      "schema_version": 3,
 389 |      "solution": false,
 390 |      "task": false
 391 |     }
 392 |    },
 393 |    "source": [
 394 |     "## Section 2: Adam Optimizer\n",
 395 |     "\n",
 396 |     "In this assignment, you will use the Adam algorithm for updating the weights of your action-value network. As you may remember from Course 3 Assignment 2, the Adam algorithm is a more advanced variant of stochastic gradient descent (SGD). The Adam algorithm improves the SGD update with two concepts: adaptive vector stepsizes and momentum. It keeps running estimates of the mean and second moment of the updates, denoted by $\\mathbf{m}$ and $\\mathbf{v}$ respectively:\n",
 397 |     "$$\\mathbf{m_t} = \\beta_m \\mathbf{m_{t-1}} + (1 - \\beta_m)g_t \\\\\n",
 398 |     "\\mathbf{v_t} = \\beta_v \\mathbf{v_{t-1}} + (1 - \\beta_v)g^2_t\n",
 399 |     "$$\n",
 400 |     "\n",
 401 |     "Here, $\\beta_m$ and $\\beta_v$ are fixed parameters controlling the linear combinations above and $g_t$ is the update at time $t$ (generally the gradients, but here the TD error times the gradients).\n",
 402 |     "\n",
 403 |     "Given that $\\mathbf{m}$ and $\\mathbf{v}$ are initialized to zero, they are biased toward zero. To get unbiased estimates of the mean and second moment, Adam defines $\\mathbf{\\hat{m}}$ and $\\mathbf{\\hat{v}}$ as:\n",
 404 |     "$$ \\mathbf{\\hat{m}_t} = \\frac{\\mathbf{m_t}}{1 - \\beta_m^t} \\\\\n",
 405 |     "\\mathbf{\\hat{v}_t} = \\frac{\\mathbf{v_t}}{1 - \\beta_v^t}\n",
 406 |     "$$\n",
 407 |     "\n",
 408 |     "The weights are then updated as follows:\n",
 409 |     "$$ \\mathbf{w_t} = \\mathbf{w_{t-1}} + \\frac{\\alpha}{\\sqrt{\\mathbf{\\hat{v}_t}}+\\epsilon} \\mathbf{\\hat{m}_t}\n",
 410 |     "$$\n",
 411 |     "\n",
 412 |     "Here, $\\alpha$ is the step size parameter and $\\epsilon$ is another small parameter to keep the denominator from being zero.\n",
 413 |     "\n",
 414 |     "In the cell below, you will implement the `__init__()` and `update_weights()` methods for the Adam algorithm. In `__init__()`, you will initialize `self.m` and `self.v`. In `update_weights()`, you will compute new weights given the input weights and an update $g$ (here `td_errors_times_gradients`) according to the equations above."
 415 |    ]
 416 |   },
 417 |   {
 418 |    "cell_type": "code",
 419 |    "execution_count": 9,
 420 |    "metadata": {
 421 |     "deletable": false,
 422 |     "nbgrader": {
 423 |      "cell_type": "code",
 424 |      "checksum": "798d4618ba32342f63eb237947151a4a",
 425 |      "grade": false,
 426 |      "grade_id": "cell-585fd403a17cf660",
 427 |      "locked": false,
 428 |      "schema_version": 3,
 429 |      "solution": true,
 430 |      "task": false
 431 |     }
 432 |    },
 433 |    "outputs": [],
 434 |    "source": [
 435 |     "### Work Required: Yes. Fill in code in __init__ and update_weights (~9-11 Lines).\n",
 436 |     "class Adam():\n",
 437 |     "    # Work Required: Yes. Fill in the initialization for self.m and self.v (~4 Lines).\n",
 438 |     "    def __init__(self, layer_sizes, \n",
 439 |     "                 optimizer_info):\n",
 440 |     "        self.layer_sizes = layer_sizes\n",
 441 |     "\n",
 442 |     "        # Specify Adam algorithm's hyper parameters\n",
 443 |     "        self.step_size = optimizer_info.get(\"step_size\")\n",
 444 |     "        self.beta_m = optimizer_info.get(\"beta_m\")\n",
 445 |     "        self.beta_v = optimizer_info.get(\"beta_v\")\n",
 446 |     "        self.epsilon = optimizer_info.get(\"epsilon\")\n",
 447 |     "        \n",
 448 |     "        # Initialize Adam algorithm's m and v\n",
 449 |     "        self.m = [dict() for i in range(1, len(self.layer_sizes))]\n",
 450 |     "        self.v = [dict() for i in range(1, len(self.layer_sizes))]\n",
 451 |     "        \n",
 452 |     "        for i in range(0, len(self.layer_sizes) - 1):\n",
 453 |     "            ### START CODE HERE (~4 Lines)\n",
 454 |     "            # Hint: The initialization for m and v should look very much like the initializations of the weights\n",
 455 |     "            # except for the fact that initialization here is to zeroes (see description above.)\n",
 456 |     "            self.m[i][\"W\"] = np.zeros((self.layer_sizes[i], self.layer_sizes[i + 1]))\n",
 457 |     "            self.m[i][\"b\"] = np.zeros((1, self.layer_sizes[i + 1]))\n",
 458 |     "            self.v[i][\"W\"] = np.zeros((self.layer_sizes[i], self.layer_sizes[i + 1]))\n",
 459 |     "            self.v[i][\"b\"] = np.zeros((1, self.layer_sizes[i + 1]))\n",
 460 |     "        \n",
 461 |     "            ### END CODE HERE\n",
 462 |     "            \n",
 463 |     "        # Notice that to calculate m_hat and v_hat, we use powers of beta_m and beta_v to \n",
 464 |     "        # the time step t. We can calculate these powers using an incremental product. At initialization then, \n",
 465 |     "        # beta_m_product and beta_v_product should be ...? (Note that timesteps start at 1 and if we were to \n",
 466 |     "        # start from 0, the denominator would be 0.)\n",
 467 |     "        self.beta_m_product = self.beta_m\n",
 468 |     "        self.beta_v_product = self.beta_v\n",
 469 |     "    \n",
 470 |     "    # Work Required: Yes. Fill in the weight updates (~5-7 lines).\n",
 471 |     "    def update_weights(self, weights, td_errors_times_gradients):\n",
 472 |     "        \"\"\"\n",
 473 |     "        Args:\n",
 474 |     "            weights (Array of dictionaries): The weights of the neural network.\n",
 475 |     "            td_errors_times_gradients (Array of dictionaries): The gradient of the \n",
 476 |     "            action-values with respect to the network's weights times the TD-error\n",
 477 |     "        Returns:\n",
 478 |     "            The updated weights (Array of dictionaries).\n",
 479 |     "        \"\"\"\n",
 480 |     "        g=td_errors_times_gradients\n",
 481 |     "        for i in range(len(weights)):\n",
 482 |     "            for param in weights[i].keys():\n",
 483 |     "                ### START CODE HERE (~5-7 Lines)\n",
 484 |     "                # Hint: Follow the equations above. First, you should update m and v and then compute \n",
 485 |     "                # m_hat and v_hat. Finally, compute how much the weights should be incremented by.\n",
 486 |     "                # self.m[i][param] = None\n",
 487 |     "                # self.v[i][param] = None\n",
 488 |     "                # m_hat = None\n",
 489 |     "                # v_hat = None\n",
 490 |     "                ### update self.m and self.v\n",
 491 |     "                self.m[i][param] = self.beta_m * self.m[i][param] + (1 - self.beta_m) * g[i][param]\n",
 492 |     "                self.v[i][param] = self.beta_v * self.v[i][param] + (1 - self.beta_v) * (g[i][param] * g[i][param])\n",
 493 |     "\n",
 494 |     "                ### compute m_hat and v_hat\n",
 495 |     "                m_hat = self.m[i][param] / (1 - self.beta_m_product)\n",
 496 |     "                v_hat = self.v[i][param] / (1 - self.beta_v_product)\n",
 497 |     "\n",
 498 |     "                ### update weights\n",
 499 |     "                weight_update = self.step_size * m_hat / (np.sqrt(v_hat) + self.epsilon)\n",
 500 |     "                ### END CODE HERE\n",
 501 |     "                \n",
 502 |     "                weights[i][param] = weights[i][param] + weight_update\n",
 503 |     "        # Notice that to calculate m_hat and v_hat, we use powers of beta_m and beta_v to \n",
 504 |     "        ### update self.beta_m_product and self.beta_v_product\n",
 505 |     "        self.beta_m_product *= self.beta_m\n",
 506 |     "        self.beta_v_product *= self.beta_v\n",
 507 |     "        \n",
 508 |     "        return weights"
 509 |    ]
 510 |   },
 511 |   {
 512 |    "cell_type": "markdown",
 513 |    "metadata": {
 514 |     "deletable": false,
 515 |     "editable": false,
 516 |     "nbgrader": {
 517 |      "cell_type": "markdown",
 518 |      "checksum": "e64f7418759c1b30b0ff0544f21d4ad0",
 519 |      "grade": false,
 520 |      "grade_id": "cell-779e4e90ee7ae5b8",
 521 |      "locked": true,
 522 |      "schema_version": 3,
 523 |      "solution": false,
 524 |      "task": false
 525 |     }
 526 |    },
 527 |    "source": [
 528 |     "Run the following code to test your implementation of the `__init__()` function:"
 529 |    ]
 530 |   },
 531 |   {
 532 |    "cell_type": "code",
 533 |    "execution_count": 10,
 534 |    "metadata": {},
 535 |    "outputs": [
 536 |     {
 537 |      "name": "stdout",
 538 |      "output_type": "stream",
 539 |      "text": [
 540 |       "m[0][\"W\"] shape: (5, 2)\n",
 541 |       "m[0][\"b\"] shape: (1, 2)\n",
 542 |       "m[1][\"W\"] shape: (2, 3)\n",
 543 |       "m[1][\"b\"] shape: (1, 3) \n",
 544 |       "\n",
 545 |       "v[0][\"W\"] shape: (5, 2)\n",
 546 |       "v[0][\"b\"] shape: (1, 2)\n",
 547 |       "v[1][\"W\"] shape: (2, 3)\n",
 548 |       "v[1][\"b\"] shape: (1, 3) \n",
 549 |       "\n"
 550 |      ]
 551 |     }
 552 |    ],
 553 |    "source": [
 554 |     "# --------------\n",
 555 |     "# Debugging Cell\n",
 556 |     "# --------------\n",
 557 |     "# Feel free to make any changes to this cell to debug your code\n",
 558 |     "\n",
 559 |     "network_config = {\"state_dim\": 5,\n",
 560 |     "                  \"num_hidden_units\": 2,\n",
 561 |     "                  \"num_actions\": 3\n",
 562 |     "                 }\n",
 563 |     "\n",
 564 |     "optimizer_info = {\"step_size\": 0.1,\n",
 565 |     "                  \"beta_m\": 0.99,\n",
 566 |     "                  \"beta_v\": 0.999,\n",
 567 |     "                  \"epsilon\": 0.0001\n",
 568 |     "                 }\n",
 569 |     "\n",
 570 |     "network = ActionValueNetwork(network_config)\n",
 571 |     "test_adam = Adam(network.layer_sizes, optimizer_info)\n",
 572 |     "\n",
 573 |     "print(\"m[0][\\\"W\\\"] shape: {}\".format(test_adam.m[0][\"W\"].shape))\n",
 574 |     "print(\"m[0][\\\"b\\\"] shape: {}\".format(test_adam.m[0][\"b\"].shape))\n",
 575 |     "print(\"m[1][\\\"W\\\"] shape: {}\".format(test_adam.m[1][\"W\"].shape))\n",
 576 |     "print(\"m[1][\\\"b\\\"] shape: {}\".format(test_adam.m[1][\"b\"].shape), \"\\n\")\n",
 577 |     "\n",
 578 |     "assert(np.allclose(test_adam.m[0][\"W\"].shape, np.array([5, 2])))\n",
 579 |     "assert(np.allclose(test_adam.m[0][\"b\"].shape, np.array([1, 2])))\n",
 580 |     "assert(np.allclose(test_adam.m[1][\"W\"].shape, np.array([2, 3])))\n",
 581 |     "assert(np.allclose(test_adam.m[1][\"b\"].shape, np.array([1, 3])))\n",
 582 |     "\n",
 583 |     "print(\"v[0][\\\"W\\\"] shape: {}\".format(test_adam.v[0][\"W\"].shape))\n",
 584 |     "print(\"v[0][\\\"b\\\"] shape: {}\".format(test_adam.v[0][\"b\"].shape))\n",
 585 |     "print(\"v[1][\\\"W\\\"] shape: {}\".format(test_adam.v[1][\"W\"].shape))\n",
 586 |     "print(\"v[1][\\\"b\\\"] shape: {}\".format(test_adam.v[1][\"b\"].shape), \"\\n\")\n",
 587 |     "\n",
 588 |     "assert(np.allclose(test_adam.v[0][\"W\"].shape, np.array([5, 2])))\n",
 589 |     "assert(np.allclose(test_adam.v[0][\"b\"].shape, np.array([1, 2])))\n",
 590 |     "assert(np.allclose(test_adam.v[1][\"W\"].shape, np.array([2, 3])))\n",
 591 |     "assert(np.allclose(test_adam.v[1][\"b\"].shape, np.array([1, 3])))\n",
 592 |     "\n",
 593 |     "assert(np.all(test_adam.m[0][\"W\"]==0))\n",
 594 |     "assert(np.all(test_adam.m[0][\"b\"]==0))\n",
 595 |     "assert(np.all(test_adam.m[1][\"W\"]==0))\n",
 596 |     "assert(np.all(test_adam.m[1][\"b\"]==0))\n",
 597 |     "\n",
 598 |     "assert(np.all(test_adam.v[0][\"W\"]==0))\n",
 599 |     "assert(np.all(test_adam.v[0][\"b\"]==0))\n",
 600 |     "assert(np.all(test_adam.v[1][\"W\"]==0))\n",
 601 |     "assert(np.all(test_adam.v[1][\"b\"]==0))"
 602 |    ]
 603 |   },
 604 |   {
 605 |    "cell_type": "code",
 606 |    "execution_count": 11,
 607 |    "metadata": {
 608 |     "deletable": false,
 609 |     "editable": false,
 610 |     "nbgrader": {
 611 |      "cell_type": "code",
 612 |      "checksum": "78b04676af63a17cd9ec3b207a18e1f1",
 613 |      "grade": true,
 614 |      "grade_id": "cell-32c93afdee106ad5",
 615 |      "locked": true,
 616 |      "points": 20,
 617 |      "schema_version": 3,
 618 |      "solution": false,
 619 |      "task": false
 620 |     }
 621 |    },
 622 |    "outputs": [],
 623 |    "source": [
 624 |     "# -----------\n",
 625 |     "# Tested Cell\n",
 626 |     "# -----------\n",
 627 |     "# The contents of the cell will be tested by the autograder.\n",
 628 |     "# If they do not pass here, they will not pass there.\n",
 629 |     "\n",
 630 |     "\n",
 631 |     "\n",
 632 |     "# import our implementation of Adam\n",
 633 |     "# while you can go look at this for the answer, try to solve the programming challenge yourself first\n",
 634 |     "from tests import TrueAdam\n",
 635 |     "\n",
 636 |     "rand_generator = np.random.RandomState(0)\n",
 637 |     "for _ in range(1000):\n",
 638 |     "    network_config = {\n",
 639 |     "        \"state_dim\": rand_generator.randint(2, 10),\n",
 640 |     "        \"num_hidden_units\": rand_generator.randint(2, 1024),\n",
 641 |     "        \"num_actions\": rand_generator.randint(2, 10)\n",
 642 |     "    }\n",
 643 |     "    \n",
 644 |     "    optimizer_info = {\"step_size\": rand_generator.choice(np.geomspace(0.1, 1e-5, num=5)),\n",
 645 |     "                  \"beta_m\": rand_generator.choice([0.9, 0.99, 0.999, 0.9999, 0.99999]),\n",
 646 |     "                  \"beta_v\": rand_generator.choice([0.9, 0.99, 0.999, 0.9999, 0.99999]),\n",
 647 |     "                  \"epsilon\": rand_generator.choice(np.geomspace(0.1, 1e-5, num=5))\n",
 648 |     "                 }\n",
 649 |     "\n",
 650 |     "    test_network = ActionValueNetwork(network_config)\n",
 651 |     "    test_adam = Adam(test_network.layer_sizes, optimizer_info)\n",
 652 |     "    true_adam = TrueAdam(test_network.layer_sizes, optimizer_info)\n",
 653 |     "    \n",
 654 |     "    assert(np.allclose(test_adam.m[0][\"W\"].shape, true_adam.m[0][\"W\"].shape))\n",
 655 |     "    assert(np.allclose(test_adam.m[0][\"b\"].shape, true_adam.m[0][\"b\"].shape))\n",
 656 |     "    assert(np.allclose(test_adam.m[1][\"W\"].shape, true_adam.m[1][\"W\"].shape))\n",
 657 |     "    assert(np.allclose(test_adam.m[1][\"b\"].shape, true_adam.m[1][\"b\"].shape))\n",
 658 |     "\n",
 659 |     "    assert(np.allclose(test_adam.v[0][\"W\"].shape, true_adam.v[0][\"W\"].shape))\n",
 660 |     "    assert(np.allclose(test_adam.v[0][\"b\"].shape, true_adam.v[0][\"b\"].shape))\n",
 661 |     "    assert(np.allclose(test_adam.v[1][\"W\"].shape, true_adam.v[1][\"W\"].shape))\n",
 662 |     "    assert(np.allclose(test_adam.v[1][\"b\"].shape, true_adam.v[1][\"b\"].shape))\n",
 663 |     "\n",
 664 |     "    assert(np.all(test_adam.m[0][\"W\"]==0))\n",
 665 |     "    assert(np.all(test_adam.m[0][\"b\"]==0))\n",
 666 |     "    assert(np.all(test_adam.m[1][\"W\"]==0))\n",
 667 |     "    assert(np.all(test_adam.m[1][\"b\"]==0))\n",
 668 |     "\n",
 669 |     "    assert(np.all(test_adam.v[0][\"W\"]==0))\n",
 670 |     "    assert(np.all(test_adam.v[0][\"b\"]==0))\n",
 671 |     "    assert(np.all(test_adam.v[1][\"W\"]==0))\n",
 672 |     "    assert(np.all(test_adam.v[1][\"b\"]==0))\n",
 673 |     "    \n",
 674 |     "    assert(test_adam.beta_m_product == optimizer_info[\"beta_m\"])\n",
 675 |     "    assert(test_adam.beta_v_product == optimizer_info[\"beta_v\"])"
 676 |    ]
 677 |   },
 678 |   {
 679 |    "cell_type": "markdown",
 680 |    "metadata": {},
 681 |    "source": [
 682 |     "**Expected output:**\n",
 683 |     "\n",
 684 |     "    m[0][\"W\"] shape: (5, 2)\n",
 685 |     "    m[0][\"b\"] shape: (1, 2)\n",
 686 |     "    m[1][\"W\"] shape: (2, 3)\n",
 687 |     "    m[1][\"b\"] shape: (1, 3) \n",
 688 |     "\n",
 689 |     "    v[0][\"W\"] shape: (5, 2)\n",
 690 |     "    v[0][\"b\"] shape: (1, 2)\n",
 691 |     "    v[1][\"W\"] shape: (2, 3)\n",
 692 |     "    v[1][\"b\"] shape: (1, 3) "
 693 |    ]
 694 |   },
 695 |   {
 696 |    "cell_type": "markdown",
 697 |    "metadata": {
 698 |     "deletable": false,
 699 |     "editable": false,
 700 |     "nbgrader": {
 701 |      "cell_type": "markdown",
 702 |      "checksum": "1e2c3416db5fbd5cc08a6b34a59f3b57",
 703 |      "grade": false,
 704 |      "grade_id": "cell-070867fcd800c19d",
 705 |      "locked": true,
 706 |      "schema_version": 3,
 707 |      "solution": false,
 708 |      "task": false
 709 |     }
 710 |    },
 711 |    "source": [
 712 |     "## Section 3: Experience Replay Buffers\n",
 713 |     "\n",
 714 |     "In Course 3, you implemented agents that update value functions once for each sample. We can use a more efficient approach for updating value functions. You have seen an example of an efficient approach in Course 2 when implementing Dyna. The idea behind Dyna is to learn a model using sampled experience, obtain simulated experience from the model, and improve the value function using the simulated experience.\n",
 715 |     "\n",
 716 |     "Experience replay is a simple method that can get some of the advantages of Dyna by saving a buffer of experience and using the data stored in the buffer as a model. This view of prior data as a model works because the data represents actual transitions from the underlying MDP. Furthermore, as a side note, this kind of model that is not learned and simply a collection of experience can be called non-parametric as it can be ever-growing as opposed to a parametric model where the transitions are learned to be represented with a fixed set of parameters or weights.\n",
 717 |     "\n",
 718 |     "We have provided the implementation of the experience replay buffer in the cell below. ReplayBuffer includes two main functions: `append()` and `sample()`. `append()` adds an experience transition to the buffer as an array that includes the state, action, reward, terminal flag (indicating termination of the episode), and next_state. `sample()` gets a batch of experiences from the buffer with size `minibatch_size`.\n",
 719 |     "\n",
 720 |     "You will use the `append()` and `sample()` functions when implementing the agent."
 721 |    ]
 722 |   },
 723 |   {
 724 |    "cell_type": "code",
 725 |    "execution_count": 12,
 726 |    "metadata": {
 727 |     "deletable": false,
 728 |     "editable": false,
 729 |     "nbgrader": {
 730 |      "cell_type": "code",
 731 |      "checksum": "dd216bf2169746f6331d6a5fbd79d605",
 732 |      "grade": false,
 733 |      "grade_id": "cell-1e1aaa0d442eb015",
 734 |      "locked": true,
 735 |      "schema_version": 3,
 736 |      "solution": false,
 737 |      "task": false
 738 |     }
 739 |    },
 740 |    "outputs": [],
 741 |    "source": [
 742 |     "# ---------------\n",
 743 |     "# Discussion Cell\n",
 744 |     "# ---------------\n",
 745 |     "\n",
 746 |     "class ReplayBuffer:\n",
 747 |     "    def __init__(self, size, minibatch_size, seed):\n",
 748 |     "        \"\"\"\n",
 749 |     "        Args:\n",
 750 |     "            size (integer): The size of the replay buffer.              \n",
 751 |     "            minibatch_size (integer): The sample size.\n",
 752 |     "            seed (integer): The seed for the random number generator. \n",
 753 |     "        \"\"\"\n",
 754 |     "        self.buffer = []\n",
 755 |     "        self.minibatch_size = minibatch_size\n",
 756 |     "        self.rand_generator = np.random.RandomState(seed)\n",
 757 |     "        self.max_size = size\n",
 758 |     "\n",
 759 |     "    def append(self, state, action, reward, terminal, next_state):\n",
 760 |     "        \"\"\"\n",
 761 |     "        Args:\n",
 762 |     "            state (Numpy array): The state.              \n",
 763 |     "            action (integer): The action.\n",
 764 |     "            reward (float): The reward.\n",
 765 |     "            terminal (integer): 1 if the next state is a terminal state and 0 otherwise.\n",
 766 |     "            next_state (Numpy array): The next state.           \n",
 767 |     "        \"\"\"\n",
 768 |     "        if len(self.buffer) == self.max_size:\n",
 769 |     "            del self.buffer[0]\n",
 770 |     "        self.buffer.append([state, action, reward, terminal, next_state])\n",
 771 |     "\n",
 772 |     "    def sample(self):\n",
 773 |     "        \"\"\"\n",
 774 |     "        Returns:\n",
 775 |     "            A list of transition tuples including state, action, reward, terinal, and next_state\n",
 776 |     "        \"\"\"\n",
 777 |     "        idxs = self.rand_generator.choice(np.arange(len(self.buffer)), size=self.minibatch_size)\n",
 778 |     "        return [self.buffer[idx] for idx in idxs]\n",
 779 |     "\n",
 780 |     "    def size(self):\n",
 781 |     "        return len(self.buffer)"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "markdown",
 786 |    "metadata": {
 787 |     "deletable": false,
 788 |     "editable": false,
 789 |     "nbgrader": {
 790 |      "cell_type": "markdown",
 791 |      "checksum": "c28fd69bca4622a73612f68ac95d7826",
 792 |      "grade": false,
 793 |      "grade_id": "cell-21cc33a3ea0ac94b",
 794 |      "locked": true,
 795 |      "schema_version": 3,
 796 |      "solution": false,
 797 |      "task": false
 798 |     }
 799 |    },
 800 |    "source": [
 801 |     "## Section 4: Softmax Policy\n",
 802 |     "\n",
 803 |     "In this assignment, you will use a softmax policy. One advantage of a softmax policy is that it explores according to the action-values, meaning that an action with a moderate value has a higher chance of getting selected compared to an action with a lower value. Contrast this with an $\\epsilon$-greedy policy which does not consider the individual action values when choosing an exploratory action in a state and instead chooses randomly when doing so.\n",
 804 |     "\n",
 805 |     "The probability of selecting each action according to the softmax policy is shown below:\n",
 806 |     "$$Pr{(A_t=a | S_t=s)} \\hspace{0.1cm} \\dot{=} \\hspace{0.1cm} \\frac{e^{Q(s, a)/\\tau}}{\\sum_{b \\in A}e^{Q(s, b)/\\tau}}$$\n",
 807 |     "where $\\tau$ is the temperature parameter which controls how much the agent focuses on the highest valued actions. The smaller the temperature, the more the agent selects the greedy action. Conversely, when the temperature is high, the agent selects among actions more uniformly random.\n",
 808 |     "\n",
 809 |     "Given that a softmax policy exponentiates action values, if those values are large, exponentiating them could get very large. To implement the softmax policy in a numerically stable way, we often subtract the maximum action-value from the action-values. If we do so, the probability of selecting each action looks as follows:\n",
 810 |     "\n",
 811 |     "$$Pr{(A_t=a | S_t=s)} \\hspace{0.1cm} \\dot{=} \\hspace{0.1cm} \\frac{e^{Q(s, a)/\\tau - max_{c}Q(s, c)/\\tau}}{\\sum_{b \\in A}e^{Q(s, b)/\\tau - max_{c}Q(s, c)/\\tau}}$$\n",
 812 |     "\n",
 813 |     "In the cell below, you will implement the `softmax()` function. In order to do so, you could break the above computation into smaller steps:\n",
 814 |     "- compute the preference, $H(a)$, for taking each action by dividing the action-values by the temperature parameter $\\tau$,\n",
 815 |     "- subtract the maximum preference across the actions from the preferences to avoid overflow, and,\n",
 816 |     "- compute the probability of taking each action."
 817 |    ]
 818 |   },
 819 |   {
 820 |    "cell_type": "code",
 821 |    "execution_count": 13,
 822 |    "metadata": {
 823 |     "deletable": false,
 824 |     "nbgrader": {
 825 |      "cell_type": "code",
 826 |      "checksum": "0bc082ff0d5b933fb88fa1936f2057d3",
 827 |      "grade": false,
 828 |      "grade_id": "cell-b32ebbeb60c5b9f7",
 829 |      "locked": false,
 830 |      "schema_version": 3,
 831 |      "solution": true,
 832 |      "task": false
 833 |     }
 834 |    },
 835 |    "outputs": [],
 836 |    "source": [
 837 |     "# -----------\n",
 838 |     "# Graded Cell\n",
 839 |     "# -----------\n",
 840 |     "\n",
 841 |     "def softmax(action_values, tau=1.0):\n",
 842 |     "    \"\"\"\n",
 843 |     "    Args:\n",
 844 |     "        action_values (Numpy array): A 2D array of shape (batch_size, num_actions). \n",
 845 |     "                       The action-values computed by an action-value network.              \n",
 846 |     "        tau (float): The temperature parameter scalar.\n",
 847 |     "    Returns:\n",
 848 |     "        A 2D array of shape (batch_size, num_actions). Where each column is a probability distribution over\n",
 849 |     "        the actions representing the policy.\n",
 850 |     "    \"\"\"\n",
 851 |     "    ### START CODE HERE (~2 Lines)\n",
 852 |     "    # Compute the preferences by dividing the action-values by the temperature parameter tau\n",
 853 |     "    preferences = action_values/tau\n",
 854 |     "    # Compute the maximum preference across the actions\n",
 855 |     "    max_preference = np.max(preferences,axis=1)\n",
 856 |     "    ### END CODE HERE\n",
 857 |     "    \n",
 858 |     "    \n",
 859 |     "    # Reshape max_preference array which has shape [Batch,] to [Batch, 1]. This allows NumPy broadcasting \n",
 860 |     "    # when subtracting the maximum preference from the preference of each action.\n",
 861 |     "    reshaped_max_preference = max_preference.reshape((-1, 1))\n",
 862 |     "    \n",
 863 |     "    ### START CODE HERE (~2 Lines)\n",
 864 |     "    # Compute the numerator, i.e., the exponential of the preference - the max preference.\n",
 865 |     "    exp_preferences = np.exp(preferences-reshaped_max_preference)\n",
 866 |     "    # Compute the denominator, i.e., the sum over the numerator along the actions axis.\n",
 867 |     "    sum_of_exp_preferences = np.sum(exp_preferences,axis=1)\n",
 868 |     "    ### END CODE HERE\n",
 869 |     "    \n",
 870 |     "    \n",
 871 |     "    # Reshape sum_of_exp_preferences array which has shape [Batch,] to [Batch, 1] to  allow for NumPy broadcasting \n",
 872 |     "    # when dividing the numerator by the denominator.\n",
 873 |     "    reshaped_sum_of_exp_preferences = sum_of_exp_preferences.reshape((-1, 1))\n",
 874 |     "    \n",
 875 |     "    ### START CODE HERE (~1 Lines)\n",
 876 |     "    # Compute the action probabilities according to the equation in the previous cell.\n",
 877 |     "    \n",
 878 |     "    action_probs = exp_preferences/reshaped_sum_of_exp_preferences\n",
 879 |     "    ### END CODE HERE\n",
 880 |     "    \n",
 881 |     "    \n",
 882 |     "    # squeeze() removes any singleton dimensions. It is used here because this function is used in the \n",
 883 |     "    # agent policy when selecting an action (for which the batch dimension is 1.) As np.random.choice is used in \n",
 884 |     "    # the agent policy and it expects 1D arrays, we need to remove this singleton batch dimension.\n",
 885 |     "    action_probs = action_probs.squeeze()\n",
 886 |     "    return action_probs"
 887 |    ]
 888 |   },
 889 |   {
 890 |    "cell_type": "markdown",
 891 |    "metadata": {
 892 |     "deletable": false,
 893 |     "editable": false,
 894 |     "nbgrader": {
 895 |      "cell_type": "markdown",
 896 |      "checksum": "c88e2bfeb8b6915745bc8a136d818a2c",
 897 |      "grade": false,
 898 |      "grade_id": "cell-df0ee871ce60dea2",
 899 |      "locked": true,
 900 |      "schema_version": 3,
 901 |      "solution": false,
 902 |      "task": false
 903 |     }
 904 |    },
 905 |    "source": [
 906 |     "Run the cell below to test your implementation of the `softmax()` function:"
 907 |    ]
 908 |   },
 909 |   {
 910 |    "cell_type": "code",
 911 |    "execution_count": 14,
 912 |    "metadata": {},
 913 |    "outputs": [
 914 |     {
 915 |      "name": "stdout",
 916 |      "output_type": "stream",
 917 |      "text": [
 918 |       "action_probs [[0.25849645 0.01689625 0.05374514 0.67086216]\n",
 919 |       " [0.84699852 0.00286345 0.13520063 0.01493741]]\n",
 920 |       "Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)\n"
 921 |      ]
 922 |     }
 923 |    ],
 924 |    "source": [
 925 |     "# --------------\n",
 926 |     "# Debugging Cell\n",
 927 |     "# --------------\n",
 928 |     "# Feel free to make any changes to this cell to debug your code\n",
 929 |     "\n",
 930 |     "rand_generator = np.random.RandomState(0)\n",
 931 |     "action_values = rand_generator.normal(0, 1, (2, 4))\n",
 932 |     "tau = 0.5\n",
 933 |     "\n",
 934 |     "action_probs = softmax(action_values, tau)\n",
 935 |     "print(\"action_probs\", action_probs)\n",
 936 |     "\n",
 937 |     "assert(np.allclose(action_probs, np.array([\n",
 938 |     "    [0.25849645, 0.01689625, 0.05374514, 0.67086216],\n",
 939 |     "    [0.84699852, 0.00286345, 0.13520063, 0.01493741]\n",
 940 |     "])))\n",
 941 |     "\n",
 942 |     "print(\"Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)\")"
 943 |    ]
 944 |   },
 945 |   {
 946 |    "cell_type": "code",
 947 |    "execution_count": 15,
 948 |    "metadata": {
 949 |     "deletable": false,
 950 |     "editable": false,
 951 |     "nbgrader": {
 952 |      "cell_type": "code",
 953 |      "checksum": "72670a38cf2dcb8dd9c83293392bc3cd",
 954 |      "grade": true,
 955 |      "grade_id": "cell-ce689c1bd91bc11f",
 956 |      "locked": true,
 957 |      "points": 10,
 958 |      "schema_version": 3,
 959 |      "solution": false,
 960 |      "task": false
 961 |     }
 962 |    },
 963 |    "outputs": [],
 964 |    "source": [
 965 |     "# -----------\n",
 966 |     "# Tested Cell\n",
 967 |     "# -----------\n",
 968 |     "# The contents of the cell will be tested by the autograder.\n",
 969 |     "# If they do not pass here, they will not pass there.\n",
 970 |     "\n",
 971 |     "from tests import __true__softmax\n",
 972 |     "\n",
 973 |     "rand_generator = np.random.RandomState(0)\n",
 974 |     "for _ in range(1000):\n",
 975 |     "    action_values = rand_generator.normal(0, 1, (rand_generator.randint(1, 5), 4))\n",
 976 |     "    tau = rand_generator.rand()\n",
 977 |     "    assert(np.allclose(softmax(action_values, tau), __true__softmax(action_values, tau)))"
 978 |    ]
 979 |   },
 980 |   {
 981 |    "cell_type": "markdown",
 982 |    "metadata": {
 983 |     "deletable": false,
 984 |     "editable": false,
 985 |     "nbgrader": {
 986 |      "cell_type": "markdown",
 987 |      "checksum": "cbae924e257425842433cf6b49aa0128",
 988 |      "grade": false,
 989 |      "grade_id": "cell-af71b793e63f3db0",
 990 |      "locked": true,
 991 |      "schema_version": 3,
 992 |      "solution": false,
 993 |      "task": false
 994 |     }
 995 |    },
 996 |    "source": [
 997 |     "**Expected output:**\n",
 998 |     "\n",
 999 |     "    action_probs [[0.25849645 0.01689625 0.05374514 0.67086216]\n",
1000 |     "     [0.84699852 0.00286345 0.13520063 0.01493741]]"
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "markdown",
1005 |    "metadata": {
1006 |     "deletable": false,
1007 |     "editable": false,
1008 |     "nbgrader": {
1009 |      "cell_type": "markdown",
1010 |      "checksum": "9b74fe561e81cd10e366c2ad76673248",
1011 |      "grade": false,
1012 |      "grade_id": "cell-9d3660107222383d",
1013 |      "locked": true,
1014 |      "schema_version": 3,
1015 |      "solution": false,
1016 |      "task": false
1017 |     }
1018 |    },
1019 |    "source": [
1020 |     "## Section 5: Putting the pieces together\n",
1021 |     "\n",
1022 |     "In this section, you will combine components from the previous sections to write up an RL-Glue Agent. The main component that you will implement is the action-value network updates with experience sampled from the experience replay buffer.\n",
1023 |     "\n",
1024 |     "At time $t$, we have an action-value function represented as a neural network, say $Q_t$. We want to update our action-value function and get a new one we can use at the next timestep. We will get this $Q_{t+1}$ using multiple replay steps that each result in an intermediate action-value function $Q_{t+1}^{i}$ where $i$ indexes which replay step we are at.\n",
1025 |     "\n",
1026 |     "In each replay step, we sample a batch of experiences from the replay buffer and compute a minibatch Expected-SARSA update. Across these N replay steps, we will use the current \"un-updated\" action-value network at time $t$, $Q_t$, for computing the action-values of the next-states. This contrasts using the most recent action-values from the last replay step $Q_{t+1}^{i}$. We make this choice to have targets that are stable across replay steps. Here is the pseudocode for performing the updates:\n",
1027 |     "\n",
1028 |     "$$\n",
1029 |     "\\begin{align}\n",
1030 |     "& Q_t \\leftarrow \\text{action-value network at timestep t (current action-value network)}\\\\\n",
1031 |     "& \\text{Initialize } Q_{t+1}^1 \\leftarrow Q_t\\\\\n",
1032 |     "& \\text{For } i \\text{ in } [1, ..., N] \\text{ (i.e. N} \\text{  replay steps)}:\\\\\n",
1033 |     "& \\hspace{1cm} s, a, r, t, s'\n",
1034 |     "\\leftarrow \\text{Sample batch of experiences from experience replay buffer} \\\\\n",
1035 |     "& \\hspace{1cm} \\text{Do Expected Sarsa update with } Q_t: Q_{t+1}^{i+1}(s, a) \\leftarrow Q_{t+1}^{i}(s, a) + \\alpha \\cdot \\left[r + \\gamma \\left(\\sum_{b} \\pi(b | s') Q_t(s', b)\\right) - Q_{t+1}^{i}(s, a)\\right]\\\\\n",
1036 |     "& \\hspace{1.5cm} \\text{ making sure to add the } \\gamma \\left(\\sum_{b} \\pi(b | s') Q_t(s', b)\\right) \\text{ for non-terminal transitions only.} \\\\\n",
1037 |     "& \\text{After N replay steps, we set } Q_{t+1}^{N} \\text{ as } Q_{t+1} \\text{ and have a new } Q_{t+1} \\text{for time step } t + 1 \\text{ that we will fix in the next set of updates. }\n",
1038 |     "\\end{align}\n",
1039 |     "$$\n",
1040 |     "\n",
1041 |     "As you can see in the pseudocode, after sampling a batch of experiences, we do many computations. The basic idea however is that we are looking to compute a form of a TD error. In order to so, we can take the following steps:\n",
1042 |     "- compute the action-values for the next states using the action-value network $Q_{t}$,\n",
1043 |     "- compute the policy $\\pi(b | s')$ induced by the action-values $Q_{t}$ (using the softmax function you implemented before),\n",
1044 |     "- compute the Expected sarsa targets $r + \\gamma \\left(\\sum_{b} \\pi(b | s') Q_t(s', b)\\right)$,\n",
1045 |     "- compute the action-values for the current states using the latest $Q_{t + 1}$, and,\n",
1046 |     "- compute the TD-errors with the Expected Sarsa targets.\n",
1047 |     " \n",
1048 |     "For the third step above, you can start by computing $\\pi(b | s') Q_t(s', b)$ followed by summation to get $\\hat{v}_\\pi(s') = \\left(\\sum_{b} \\pi(b | s') Q_t(s', b)\\right)$. $\\hat{v}_\\pi(s')$ is an estimate of the value of the next state. Note for terminal next states, $\\hat{v}_\\pi(s') = 0$. Finally, we add the rewards to the discount times $\\hat{v}_\\pi(s')$.\n",
1049 |     "\n",
1050 |     "You will implement these steps in the `get_td_error()` function below which given a batch of experiences (including states, next_states, actions, rewards, terminals), fixed action-value network (current_q), and action-value network (network), computes the TD error in the form of a 1D array of size batch_size."
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "code",
1055 |    "execution_count": 16,
1056 |    "metadata": {
1057 |     "deletable": false,
1058 |     "nbgrader": {
1059 |      "cell_type": "code",
1060 |      "checksum": "7e8ff7f160ca26a7639acc062ae6b29a",
1061 |      "grade": false,
1062 |      "grade_id": "cell-f370691c828efad9",
1063 |      "locked": false,
1064 |      "schema_version": 3,
1065 |      "solution": true,
1066 |      "task": false
1067 |     }
1068 |    },
1069 |    "outputs": [],
1070 |    "source": [
1071 |     "### Work Required: Yes. Fill in code in get_td_error (~9 Lines).\n",
1072 |     "def get_td_error(states, next_states, actions, rewards, discount, terminals, network, current_q, tau):\n",
1073 |     "    \"\"\"\n",
1074 |     "    Args:\n",
1075 |     "        states (Numpy array): The batch of states with the shape (batch_size, state_dim).\n",
1076 |     "        next_states (Numpy array): The batch of next states with the shape (batch_size, state_dim).\n",
1077 |     "        actions (Numpy array): The batch of actions with the shape (batch_size,).\n",
1078 |     "        rewards (Numpy array): The batch of rewards with the shape (batch_size,).\n",
1079 |     "        discount (float): The discount factor.\n",
1080 |     "        terminals (Numpy array): The batch of terminals with the shape (batch_size,).\n",
1081 |     "        network (ActionValueNetwork): The latest state of the network that is getting replay updates.\n",
1082 |     "        current_q (ActionValueNetwork): The fixed network used for computing the targets, \n",
1083 |     "                                        and particularly, the action-values at the next-states.\n",
1084 |     "    Returns:\n",
1085 |     "        The TD errors (Numpy array) for actions taken, of shape (batch_size,)\n",
1086 |     "    \"\"\"\n",
1087 |     "    \n",
1088 |     "    # Note: Here network is the latest state of the network that is getting replay updates. In other words, \n",
1089 |     "    # the network represents Q_{t+1}^{i} whereas current_q represents Q_t, the fixed network used for computing the \n",
1090 |     "    # targets, and particularly, the action-values at the next-states.\n",
1091 |     "    \n",
1092 |     "    # Compute action values at next states using current_q network\n",
1093 |     "    # Note that q_next_mat is a 2D array of shape (batch_size, num_actions)\n",
1094 |     "    \n",
1095 |     "    ### START CODE HERE (~1 Line)\n",
1096 |     "\n",
1097 |     "    q_next_mat = current_q.get_action_values(next_states)\n",
1098 |     "\n",
1099 |     "    ### END CODE HERE\n",
1100 |     "    \n",
1101 |     "    # Compute policy at next state by passing the action-values in q_next_mat to softmax()\n",
1102 |     "    # Note that probs_mat is a 2D array of shape (batch_size, num_actions)\n",
1103 |     "    \n",
1104 |     "    ### START CODE HERE (~1 Line)\n",
1105 |     "    probs_mat = softmax(q_next_mat,tau=tau)\n",
1106 |     "    ### END CODE HERE\n",
1107 |     "    \n",
1108 |     "    # Compute the estimate of the next state value, v_next_vec.\n",
1109 |     "    # Hint: sum the action-values for the next_states weighted by the policy, probs_mat. Then, multiply by\n",
1110 |     "    # (1 - terminals) to make sure v_next_vec is zero for terminal next states.\n",
1111 |     "    # Note that v_next_vec is a 1D array of shape (batch_size,)\n",
1112 |     "    \n",
1113 |     "    ### START CODE HERE (~3 Lines)\n",
1114 |     "    v_next_vec = np.sum(q_next_mat*probs_mat,axis=1)*(1-terminals)\n",
1115 |     "    ### END CODE HERE\n",
1116 |     "    \n",
1117 |     "    # Compute Expected Sarsa target\n",
1118 |     "    # Note that target_vec is a 1D array of shape (batch_size,)\n",
1119 |     "    \n",
1120 |     "    ### START CODE HERE (~1 Line)\n",
1121 |     "    target_vec = rewards + discount * v_next_vec\n",
1122 |     "    ### END CODE HERE\n",
1123 |     "    \n",
1124 |     "    # Compute action values at the current states for all actions using network\n",
1125 |     "    # Note that q_mat is a 2D array of shape (batch_size, num_actions)\n",
1126 |     "    \n",
1127 |     "    ### START CODE HERE (~1 Line)\n",
1128 |     "    q_mat = network.get_action_values(states)\n",
1129 |     "    ### END CODE HERE\n",
1130 |     "    \n",
1131 |     "    # Batch Indices is an array from 0 to the batch size - 1. \n",
1132 |     "    batch_indices = np.arange(q_mat.shape[0])\n",
1133 |     "\n",
1134 |     "    # Compute q_vec by selecting q(s, a) from q_mat for taken actions\n",
1135 |     "    # Use batch_indices as the index for the first dimension of q_mat\n",
1136 |     "    # Note that q_vec is a 1D array of shape (batch_size)\n",
1137 |     "    \n",
1138 |     "    ### START CODE HERE (~1 Line)\n",
1139 |     "    q_vec = [q_mat[i][actions[i]] for i in range(len(actions))]\n",
1140 |     "\n",
1141 |     "    ### END CODE HERE\n",
1142 |     "    \n",
1143 |     "    # Compute TD errors for actions taken\n",
1144 |     "    # Note that delta_vec is a 1D array of shape (batch_size)\n",
1145 |     "    \n",
1146 |     "    ### START CODE HERE (~1 Line)\n",
1147 |     "    delta_vec = target_vec - q_vec\n",
1148 |     "    ### END CODE HERE\n",
1149 |     "\n",
1150 |     "    return delta_vec"
1151 |    ]
1152 |   },
1153 |   {
1154 |    "cell_type": "markdown",
1155 |    "metadata": {
1156 |     "deletable": false,
1157 |     "editable": false,
1158 |     "nbgrader": {
1159 |      "cell_type": "markdown",
1160 |      "checksum": "324b3f707db53cbe0e0182d494691fa4",
1161 |      "grade": false,
1162 |      "grade_id": "cell-07d3abd266be6559",
1163 |      "locked": true,
1164 |      "schema_version": 3,
1165 |      "solution": false,
1166 |      "task": false
1167 |     }
1168 |    },
1169 |    "source": [
1170 |     "Run the following code to test your implementation of the `get_td_error()` function:"
1171 |    ]
1172 |   },
1173 |   {
1174 |    "cell_type": "code",
1175 |    "execution_count": 17,
1176 |    "metadata": {},
1177 |    "outputs": [
1178 |     {
1179 |      "name": "stdout",
1180 |      "output_type": "stream",
1181 |      "text": [
1182 |       "Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)\n"
1183 |      ]
1184 |     }
1185 |    ],
1186 |    "source": [
1187 |     "# --------------\n",
1188 |     "# Debugging Cell\n",
1189 |     "# --------------\n",
1190 |     "# Feel free to make any changes to this cell to debug your code\n",
1191 |     "\n",
1192 |     "data = np.load(\"asserts/get_td_error_1.npz\", allow_pickle=True)\n",
1193 |     "\n",
1194 |     "states = data[\"states\"]\n",
1195 |     "next_states = data[\"next_states\"]\n",
1196 |     "actions = data[\"actions\"]\n",
1197 |     "rewards = data[\"rewards\"]\n",
1198 |     "discount = data[\"discount\"]\n",
1199 |     "terminals = data[\"terminals\"]\n",
1200 |     "tau = 0.001\n",
1201 |     "\n",
1202 |     "network_config = {\"state_dim\": 8,\n",
1203 |     "                  \"num_hidden_units\": 512,\n",
1204 |     "                  \"num_actions\": 4\n",
1205 |     "                  }\n",
1206 |     "\n",
1207 |     "network = ActionValueNetwork(network_config)\n",
1208 |     "network.set_weights(data[\"network_weights\"])\n",
1209 |     "\n",
1210 |     "current_q = ActionValueNetwork(network_config)\n",
1211 |     "current_q.set_weights(data[\"current_q_weights\"])\n",
1212 |     "\n",
1213 |     "delta_vec = get_td_error(states, next_states, actions, rewards, discount, terminals, network, current_q, tau)\n",
1214 |     "answer_delta_vec = data[\"delta_vec\"]\n",
1215 |     "\n",
1216 |     "assert(np.allclose(delta_vec, answer_delta_vec))\n",
1217 |     "print(\"Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)\")"
1218 |    ]
1219 |   },
1220 |   {
1221 |    "cell_type": "code",
1222 |    "execution_count": 18,
1223 |    "metadata": {
1224 |     "deletable": false,
1225 |     "editable": false,
1226 |     "nbgrader": {
1227 |      "cell_type": "code",
1228 |      "checksum": "de39bdc2c7e843db180faf962fac9572",
1229 |      "grade": true,
1230 |      "grade_id": "cell-6b4dccc113c7b5c9",
1231 |      "locked": true,
1232 |      "points": 20,
1233 |      "schema_version": 3,
1234 |      "solution": false,
1235 |      "task": false
1236 |     }
1237 |    },
1238 |    "outputs": [],
1239 |    "source": [
1240 |     "# -----------\n",
1241 |     "# Tested Cell\n",
1242 |     "# -----------\n",
1243 |     "# The contents of the cell will be tested by the autograder.\n",
1244 |     "# If they do not pass here, they will not pass there.\n",
1245 |     "\n",
1246 |     "data = np.load(\"asserts/get_td_error_1.npz\", allow_pickle=True)\n",
1247 |     "\n",
1248 |     "states = data[\"states\"]\n",
1249 |     "next_states = data[\"next_states\"]\n",
1250 |     "actions = data[\"actions\"]\n",
1251 |     "rewards = data[\"rewards\"]\n",
1252 |     "discount = data[\"discount\"]\n",
1253 |     "terminals = data[\"terminals\"]\n",
1254 |     "tau = 0.001\n",
1255 |     "\n",
1256 |     "network_config = {\"state_dim\": 8,\n",
1257 |     "                  \"num_hidden_units\": 512,\n",
1258 |     "                  \"num_actions\": 4\n",
1259 |     "                  }\n",
1260 |     "\n",
1261 |     "network = ActionValueNetwork(network_config)\n",
1262 |     "network.set_weights(data[\"network_weights\"])\n",
1263 |     "\n",
1264 |     "current_q = ActionValueNetwork(network_config)\n",
1265 |     "current_q.set_weights(data[\"current_q_weights\"])\n",
1266 |     "\n",
1267 |     "delta_vec = get_td_error(states, next_states, actions, rewards, discount, terminals, network, current_q, tau)\n",
1268 |     "answer_delta_vec = data[\"delta_vec\"]\n",
1269 |     "\n",
1270 |     "assert(np.allclose(delta_vec, answer_delta_vec))"
1271 |    ]
1272 |   },
1273 |   {
1274 |    "cell_type": "markdown",
1275 |    "metadata": {
1276 |     "deletable": false,
1277 |     "editable": false,
1278 |     "nbgrader": {
1279 |      "cell_type": "markdown",
1280 |      "checksum": "be42a5cbc9a9a6b71d6bd29e4ff78984",
1281 |      "grade": false,
1282 |      "grade_id": "cell-68c8eca2519dd27d",
1283 |      "locked": true,
1284 |      "schema_version": 3,
1285 |      "solution": false,
1286 |      "task": false
1287 |     }
1288 |    },
1289 |    "source": [
1290 |     "Now that you implemented the `get_td_error()` function, you can use it to implement the `optimize_network()` function. In this function, you will:\n",
1291 |     "- get the TD-errors vector from `get_td_error()`,\n",
1292 |     "- make the TD-errors into a matrix using zeroes for actions not taken in the transitions,\n",
1293 |     "- pass the TD-errors matrix to the `get_TD_update()` function of network to calculate the gradients times TD errors, and,\n",
1294 |     "- perform an ADAM optimizer step."
1295 |    ]
1296 |   },
1297 |   {
1298 |    "cell_type": "code",
1299 |    "execution_count": 19,
1300 |    "metadata": {
1301 |     "deletable": false,
1302 |     "nbgrader": {
1303 |      "cell_type": "code",
1304 |      "checksum": "5772dc09af7d47867f70baa8580057cf",
1305 |      "grade": false,
1306 |      "grade_id": "cell-2b9714cb6ee933de",
1307 |      "locked": false,
1308 |      "schema_version": 3,
1309 |      "solution": true,
1310 |      "task": false
1311 |     }
1312 |    },
1313 |    "outputs": [],
1314 |    "source": [
1315 |     "# -----------\n",
1316 |     "# Graded Cell\n",
1317 |     "# -----------\n",
1318 |     "\n",
1319 |     "### Work Required: Yes. Fill in code in optimize_network (~2 Lines).\n",
1320 |     "def optimize_network(experiences, discount, optimizer, network, current_q, tau):\n",
1321 |     "    \"\"\"\n",
1322 |     "    Args:\n",
1323 |     "        experiences (Numpy array): The batch of experiences including the states, actions, \n",
1324 |     "                                   rewards, terminals, and next_states.\n",
1325 |     "        discount (float): The discount factor.\n",
1326 |     "        network (ActionValueNetwork): The latest state of the network that is getting replay updates.\n",
1327 |     "        current_q (ActionValueNetwork): The fixed network used for computing the targets, \n",
1328 |     "                                        and particularly, the action-values at the next-states.\n",
1329 |     "    \"\"\"\n",
1330 |     "    \n",
1331 |     "    # Get states, action, rewards, terminals, and next_states from experiences\n",
1332 |     "    states, actions, rewards, terminals, next_states = map(list, zip(*experiences))\n",
1333 |     "    states = np.concatenate(states)\n",
1334 |     "    next_states = np.concatenate(next_states)\n",
1335 |     "    rewards = np.array(rewards)\n",
1336 |     "    terminals = np.array(terminals)\n",
1337 |     "    batch_size = states.shape[0]\n",
1338 |     "\n",
1339 |     "    # Compute TD error using the get_td_error function\n",
1340 |     "    # Note that q_vec is a 1D array of shape (batch_size)\n",
1341 |     "    delta_vec = get_td_error(states, next_states, actions, rewards, discount, terminals, network, current_q, tau)\n",
1342 |     "\n",
1343 |     "    # Batch Indices is an array from 0 to the batch_size - 1. \n",
1344 |     "    batch_indices = np.arange(batch_size)\n",
1345 |     "\n",
1346 |     "    # Make a td error matrix of shape (batch_size, num_actions)\n",
1347 |     "    # delta_mat has non-zero value only for actions taken\n",
1348 |     "    delta_mat = np.zeros((batch_size, network.num_actions))\n",
1349 |     "    delta_mat[batch_indices, actions] = delta_vec\n",
1350 |     "\n",
1351 |     "    # Pass delta_mat to compute the TD errors times the gradients of the network's weights from back-propagation\n",
1352 |     "    \n",
1353 |     "    ### START CODE HERE\n",
1354 |     "    td_update = network.get_TD_update(states, delta_mat)\n",
1355 |     "\n",
1356 |     "    ### END CODE HERE\n",
1357 |     "    \n",
1358 |     "    # Pass network.get_weights and the td_update to the optimizer to get updated weights\n",
1359 |     "    ### START CODE HERE\n",
1360 |     "    weights = optimizer.update_weights(network.get_weights(),td_update)\n",
1361 |     "    ### END CODE HERE\n",
1362 |     "    \n",
1363 |     "    network.set_weights(weights)"
1364 |    ]
1365 |   },
1366 |   {
1367 |    "cell_type": "markdown",
1368 |    "metadata": {
1369 |     "deletable": false,
1370 |     "editable": false,
1371 |     "nbgrader": {
1372 |      "cell_type": "markdown",
1373 |      "checksum": "f1510172c78a65fed30c285a935f29f4",
1374 |      "grade": false,
1375 |      "grade_id": "cell-dd47bfc5b0850596",
1376 |      "locked": true,
1377 |      "schema_version": 3,
1378 |      "solution": false,
1379 |      "task": false
1380 |     }
1381 |    },
1382 |    "source": [
1383 |     "Run the following code to test your implementation of the `optimize_network()` function:"
1384 |    ]
1385 |   },
1386 |   {
1387 |    "cell_type": "code",
1388 |    "execution_count": 20,
1389 |    "metadata": {
1390 |     "deletable": false,
1391 |     "editable": false,
1392 |     "nbgrader": {
1393 |      "cell_type": "code",
1394 |      "checksum": "288c5aae724dcb5ac4edc8c5ebec827d",
1395 |      "grade": true,
1396 |      "grade_id": "cell-2fcf8d08f7cbc3f2",
1397 |      "locked": true,
1398 |      "points": 10,
1399 |      "schema_version": 3,
1400 |      "solution": false,
1401 |      "task": false
1402 |     }
1403 |    },
1404 |    "outputs": [],
1405 |    "source": [
1406 |     "# -----------\n",
1407 |     "# Tested Cell\n",
1408 |     "# -----------\n",
1409 |     "# The contents of the cell will be tested by the autograder.\n",
1410 |     "# If they do not pass here, they will not pass there.\n",
1411 |     "\n",
1412 |     "input_data = np.load(\"asserts/optimize_network_input_1.npz\", allow_pickle=True)\n",
1413 |     "\n",
1414 |     "experiences = list(input_data[\"experiences\"])\n",
1415 |     "discount = input_data[\"discount\"]\n",
1416 |     "tau = 0.001\n",
1417 |     "\n",
1418 |     "network_config = {\"state_dim\": 8,\n",
1419 |     "                  \"num_hidden_units\": 512,\n",
1420 |     "                  \"num_actions\": 4\n",
1421 |     "                  }\n",
1422 |     "\n",
1423 |     "network = ActionValueNetwork(network_config)\n",
1424 |     "network.set_weights(input_data[\"network_weights\"])\n",
1425 |     "\n",
1426 |     "current_q = ActionValueNetwork(network_config)\n",
1427 |     "current_q.set_weights(input_data[\"current_q_weights\"])\n",
1428 |     "\n",
1429 |     "optimizer_config = {'step_size': 3e-5, \n",
1430 |     "                    'beta_m': 0.9, \n",
1431 |     "                    'beta_v': 0.999,\n",
1432 |     "                    'epsilon': 1e-8\n",
1433 |     "                   }\n",
1434 |     "optimizer = Adam(network.layer_sizes, optimizer_config)\n",
1435 |     "optimizer.m = input_data[\"optimizer_m\"]\n",
1436 |     "optimizer.v = input_data[\"optimizer_v\"]\n",
1437 |     "optimizer.beta_m_product = input_data[\"optimizer_beta_m_product\"]\n",
1438 |     "optimizer.beta_v_product = input_data[\"optimizer_beta_v_product\"]\n",
1439 |     "\n",
1440 |     "optimize_network(experiences, discount, optimizer, network, current_q, tau)\n",
1441 |     "updated_weights = network.get_weights()\n",
1442 |     "\n",
1443 |     "output_data = np.load(\"asserts/optimize_network_output_1.npz\", allow_pickle=True)\n",
1444 |     "answer_updated_weights = output_data[\"updated_weights\"]\n",
1445 |     "\n",
1446 |     "assert(np.allclose(updated_weights[0][\"W\"], answer_updated_weights[0][\"W\"]))\n",
1447 |     "assert(np.allclose(updated_weights[0][\"b\"], answer_updated_weights[0][\"b\"]))\n",
1448 |     "assert(np.allclose(updated_weights[1][\"W\"], answer_updated_weights[1][\"W\"]))\n",
1449 |     "assert(np.allclose(updated_weights[1][\"b\"], answer_updated_weights[1][\"b\"]))"
1450 |    ]
1451 |   },
1452 |   {
1453 |    "cell_type": "markdown",
1454 |    "metadata": {
1455 |     "deletable": false,
1456 |     "editable": false,
1457 |     "nbgrader": {
1458 |      "cell_type": "markdown",
1459 |      "checksum": "298777333c00b402d67ac947c6cad456",
1460 |      "grade": false,
1461 |      "grade_id": "cell-66cff2d5725d31c3",
1462 |      "locked": true,
1463 |      "schema_version": 3,
1464 |      "solution": false,
1465 |      "task": false
1466 |     }
1467 |    },
1468 |    "source": [
1469 |     "Now that you implemented the `optimize_network()` function, you can implement the agent. In the cell below, you will fill the `agent_step()` and `agent_end()` functions. You should:\n",
1470 |     "- select an action (only in `agent_step()`),\n",
1471 |     "- add transitions (consisting of the state, action, reward, terminal, and next state) to the replay buffer, and,\n",
1472 |     "- update the weights of the neural network by doing multiple replay steps and calling the `optimize_network()` function that you implemented above."
1473 |    ]
1474 |   },
1475 |   {
1476 |    "cell_type": "code",
1477 |    "execution_count": 21,
1478 |    "metadata": {
1479 |     "deletable": false,
1480 |     "nbgrader": {
1481 |      "cell_type": "code",
1482 |      "checksum": "1d9134ac89ad8c86157599044f5dbc8e",
1483 |      "grade": false,
1484 |      "grade_id": "cell-54b5db480295424c",
1485 |      "locked": false,
1486 |      "schema_version": 3,
1487 |      "solution": true,
1488 |      "task": false
1489 |     }
1490 |    },
1491 |    "outputs": [],
1492 |    "source": [
1493 |     "# -----------\n",
1494 |     "# Graded Cell\n",
1495 |     "# -----------\n",
1496 |     "\n",
1497 |     "### Work Required: Yes. Fill in code in agent_step and agent_end (~7 Lines).\n",
1498 |     "class Agent(BaseAgent):\n",
1499 |     "    def __init__(self):\n",
1500 |     "        self.name = \"expected_sarsa_agent\"\n",
1501 |     "        \n",
1502 |     "    # Work Required: No.\n",
1503 |     "    def agent_init(self, agent_config):\n",
1504 |     "        \"\"\"Setup for the agent called when the experiment first starts.\n",
1505 |     "\n",
1506 |     "        Set parameters needed to setup the agent.\n",
1507 |     "\n",
1508 |     "        Assume agent_config dict contains:\n",
1509 |     "        {\n",
1510 |     "            network_config: dictionary,\n",
1511 |     "            optimizer_config: dictionary,\n",
1512 |     "            replay_buffer_size: integer,\n",
1513 |     "            minibatch_sz: integer, \n",
1514 |     "            num_replay_updates_per_step: float\n",
1515 |     "            discount_factor: float,\n",
1516 |     "        }\n",
1517 |     "        \"\"\"\n",
1518 |     "        self.replay_buffer = ReplayBuffer(agent_config['replay_buffer_size'], \n",
1519 |     "                                          agent_config['minibatch_sz'], agent_config.get(\"seed\"))\n",
1520 |     "        self.network = ActionValueNetwork(agent_config['network_config'])\n",
1521 |     "        self.optimizer = Adam(self.network.layer_sizes, agent_config[\"optimizer_config\"])\n",
1522 |     "        self.num_actions = agent_config['network_config']['num_actions']\n",
1523 |     "        self.num_replay = agent_config['num_replay_updates_per_step']\n",
1524 |     "        self.discount = agent_config['gamma']\n",
1525 |     "        self.tau = agent_config['tau']\n",
1526 |     "        \n",
1527 |     "        self.rand_generator = np.random.RandomState(agent_config.get(\"seed\"))\n",
1528 |     "        \n",
1529 |     "        self.last_state = None\n",
1530 |     "        self.last_action = None\n",
1531 |     "        \n",
1532 |     "        self.sum_rewards = 0\n",
1533 |     "        self.episode_steps = 0\n",
1534 |     "\n",
1535 |     "    # Work Required: No.\n",
1536 |     "    def policy(self, state):\n",
1537 |     "        \"\"\"\n",
1538 |     "        Args:\n",
1539 |     "            state (Numpy array): the state.\n",
1540 |     "        Returns:\n",
1541 |     "            the action. \n",
1542 |     "        \"\"\"\n",
1543 |     "        action_values = self.network.get_action_values(state)\n",
1544 |     "        probs_batch = softmax(action_values, self.tau)\n",
1545 |     "        action = self.rand_generator.choice(self.num_actions, p=probs_batch.squeeze())\n",
1546 |     "        return action\n",
1547 |     "\n",
1548 |     "    # Work Required: No.\n",
1549 |     "    def agent_start(self, state):\n",
1550 |     "        \"\"\"The first method called when the experiment starts, called after\n",
1551 |     "        the environment starts.\n",
1552 |     "        Args:\n",
1553 |     "            state (Numpy array): the state from the\n",
1554 |     "                environment's evn_start function.\n",
1555 |     "        Returns:\n",
1556 |     "            The first action the agent takes.\n",
1557 |     "        \"\"\"\n",
1558 |     "        self.sum_rewards = 0\n",
1559 |     "        self.episode_steps = 0\n",
1560 |     "        self.last_state = np.array([state])\n",
1561 |     "        self.last_action = self.policy(self.last_state)\n",
1562 |     "        return self.last_action\n",
1563 |     "\n",
1564 |     "    # Work Required: Yes. Fill in the action selection, replay-buffer update, \n",
1565 |     "    # weights update using optimize_network, and updating last_state and last_action (~5 lines).\n",
1566 |     "    def agent_step(self, reward, state):\n",
1567 |     "        \"\"\"A step taken by the agent.\n",
1568 |     "        Args:\n",
1569 |     "            reward (float): the reward received for taking the last action taken\n",
1570 |     "            state (Numpy array): the state from the\n",
1571 |     "                environment's step based, where the agent ended up after the\n",
1572 |     "                last step\n",
1573 |     "        Returns:\n",
1574 |     "            The action the agent is taking.\n",
1575 |     "        \"\"\"\n",
1576 |     "        \n",
1577 |     "        self.sum_rewards += reward\n",
1578 |     "        self.episode_steps += 1\n",
1579 |     "\n",
1580 |     "        # Make state an array of shape (1, state_dim) to add a batch dimension and\n",
1581 |     "        # to later match the get_action_values() and get_TD_update() functions\n",
1582 |     "        state = np.array([state])\n",
1583 |     "\n",
1584 |     "        # Select action\n",
1585 |     "        ### START CODE HERE (~1 Line)\n",
1586 |     "        action = self.policy(state)\n",
1587 |     "        ### END CODE HERE\n",
1588 |     "        \n",
1589 |     "        # Append new experience to replay buffer\n",
1590 |     "        # Note: look at the replay_buffer append function for the order of arguments\n",
1591 |     "\n",
1592 |     "        ### START CODE HERE (~1 Line)\n",
1593 |     "        self.replay_buffer.append(self.last_state, self.last_action, reward, 0, state)\n",
1594 |     "        ### END CODE HERE\n",
1595 |     "        \n",
1596 |     "        # Perform replay steps:\n",
1597 |     "        if self.replay_buffer.size() > self.replay_buffer.minibatch_size:\n",
1598 |     "            current_q = deepcopy(self.network)\n",
1599 |     "            for _ in range(self.num_replay):\n",
1600 |     "                \n",
1601 |     "                # Get sample experiences from the replay buffer\n",
1602 |     "                experiences = self.replay_buffer.sample()\n",
1603 |     "                \n",
1604 |     "                # Call optimize_network to update the weights of the network (~1 Line)\n",
1605 |     "                ### START CODE HERE\n",
1606 |     "                optimize_network(experiences, self.discount, self.optimizer, self.network, current_q, self.tau)\n",
1607 |     "                ### END CODE HERE\n",
1608 |     "                \n",
1609 |     "        # Update the last state and last action.\n",
1610 |     "        ### START CODE HERE (~2 Lines)\n",
1611 |     "        self.last_state = state\n",
1612 |     "        self.last_action = action\n",
1613 |     "        ### END CODE HERE\n",
1614 |     "        \n",
1615 |     "        return action\n",
1616 |     "\n",
1617 |     "    # Work Required: Yes. Fill in the replay-buffer update and\n",
1618 |     "    # update of the weights using optimize_network (~2 lines).\n",
1619 |     "    def agent_end(self, reward):\n",
1620 |     "        \"\"\"Run when the agent terminates.\n",
1621 |     "        Args:\n",
1622 |     "            reward (float): the reward the agent received for entering the\n",
1623 |     "                terminal state.\n",
1624 |     "        \"\"\"\n",
1625 |     "        self.sum_rewards += reward\n",
1626 |     "        self.episode_steps += 1\n",
1627 |     "        \n",
1628 |     "        # Set terminal state to an array of zeros\n",
1629 |     "        state = np.zeros_like(self.last_state)\n",
1630 |     "\n",
1631 |     "        # Append new experience to replay buffer\n",
1632 |     "        # Note: look at the replay_buffer append function for the order of arguments\n",
1633 |     "        \n",
1634 |     "        ### START CODE HERE (~1 Line)\n",
1635 |     "        self.replay_buffer.append(self.last_state, self.last_action, reward, 1, state)\n",
1636 |     "        ### END CODE HERE\n",
1637 |     "        \n",
1638 |     "        # Perform replay steps:\n",
1639 |     "        if self.replay_buffer.size() > self.replay_buffer.minibatch_size:\n",
1640 |     "            current_q = deepcopy(self.network)\n",
1641 |     "            for _ in range(self.num_replay):\n",
1642 |     "                \n",
1643 |     "                # Get sample experiences from the replay buffer\n",
1644 |     "                experiences = self.replay_buffer.sample()\n",
1645 |     "                \n",
1646 |     "                # Call optimize_network to update the weights of the network\n",
1647 |     "                ### START CODE HERE (~1 Line)\n",
1648 |     "                optimize_network(experiences, self.discount, self.optimizer, self.network, current_q, self.tau)\n",
1649 |     "                ### END CODE HERE\n",
1650 |     "                \n",
1651 |     "        \n",
1652 |     "    def agent_message(self, message):\n",
1653 |     "        if message == \"get_sum_reward\":\n",
1654 |     "            return self.sum_rewards\n",
1655 |     "        else:\n",
1656 |     "            raise Exception(\"Unrecognized Message!\")"
1657 |    ]
1658 |   },
1659 |   {
1660 |    "cell_type": "markdown",
1661 |    "metadata": {
1662 |     "deletable": false,
1663 |     "editable": false,
1664 |     "nbgrader": {
1665 |      "cell_type": "markdown",
1666 |      "checksum": "3eb47bb8d21e025d56d1d89d4e4c746c",
1667 |      "grade": false,
1668 |      "grade_id": "cell-d5e188547a2be9d9",
1669 |      "locked": true,
1670 |      "schema_version": 3,
1671 |      "solution": false,
1672 |      "task": false
1673 |     }
1674 |    },
1675 |    "source": [
1676 |     "Run the following code to test your implementation of the `agent_step()` function:"
1677 |    ]
1678 |   },
1679 |   {
1680 |    "cell_type": "code",
1681 |    "execution_count": 22,
1682 |    "metadata": {
1683 |     "deletable": false,
1684 |     "editable": false,
1685 |     "nbgrader": {
1686 |      "cell_type": "code",
1687 |      "checksum": "d351a79afa5ae8b5c72510a871a7328a",
1688 |      "grade": true,
1689 |      "grade_id": "cell-154adaa2d37b3d45",
1690 |      "locked": true,
1691 |      "points": 20,
1692 |      "schema_version": 3,
1693 |      "solution": false,
1694 |      "task": false
1695 |     }
1696 |    },
1697 |    "outputs": [],
1698 |    "source": [
1699 |     "# -----------\n",
1700 |     "# Tested Cell\n",
1701 |     "# -----------\n",
1702 |     "# The contents of the cell will be tested by the autograder.\n",
1703 |     "# If they do not pass here, they will not pass there.\n",
1704 |     "\n",
1705 |     "agent_info = {\n",
1706 |     "             'network_config': {\n",
1707 |     "                 'state_dim': 8,\n",
1708 |     "                 'num_hidden_units': 256,\n",
1709 |     "                 'num_hidden_layers': 1,\n",
1710 |     "                 'num_actions': 4\n",
1711 |     "             },\n",
1712 |     "             'optimizer_config': {\n",
1713 |     "                 'step_size': 3e-5, \n",
1714 |     "                 'beta_m': 0.9, \n",
1715 |     "                 'beta_v': 0.999,\n",
1716 |     "                 'epsilon': 1e-8\n",
1717 |     "             },\n",
1718 |     "             'replay_buffer_size': 32,\n",
1719 |     "             'minibatch_sz': 32,\n",
1720 |     "             'num_replay_updates_per_step': 4,\n",
1721 |     "             'gamma': 0.99,\n",
1722 |     "             'tau': 1000.0,\n",
1723 |     "             'seed': 0}\n",
1724 |     "\n",
1725 |     "# Initialize agent\n",
1726 |     "agent = Agent()\n",
1727 |     "agent.agent_init(agent_info)\n",
1728 |     "\n",
1729 |     "# load agent network, optimizer, replay_buffer from the agent_input_1.npz file\n",
1730 |     "input_data = np.load(\"asserts/agent_input_1.npz\", allow_pickle=True)\n",
1731 |     "agent.network.set_weights(input_data[\"network_weights\"])\n",
1732 |     "agent.optimizer.m = input_data[\"optimizer_m\"]\n",
1733 |     "agent.optimizer.v = input_data[\"optimizer_v\"]\n",
1734 |     "agent.optimizer.beta_m_product = input_data[\"optimizer_beta_m_product\"]\n",
1735 |     "agent.optimizer.beta_v_product = input_data[\"optimizer_beta_v_product\"]\n",
1736 |     "agent.replay_buffer.rand_generator.seed(int(input_data[\"replay_buffer_seed\"]))\n",
1737 |     "for experience in input_data[\"replay_buffer\"]:\n",
1738 |     "    agent.replay_buffer.buffer.append(experience)\n",
1739 |     "\n",
1740 |     "# Perform agent_step multiple times\n",
1741 |     "last_state_array = input_data[\"last_state_array\"]\n",
1742 |     "last_action_array = input_data[\"last_action_array\"]\n",
1743 |     "state_array = input_data[\"state_array\"]\n",
1744 |     "reward_array = input_data[\"reward_array\"]\n",
1745 |     "\n",
1746 |     "for i in range(5):\n",
1747 |     "    agent.last_state = last_state_array[i]\n",
1748 |     "    agent.last_action = last_action_array[i]\n",
1749 |     "    state = state_array[i]\n",
1750 |     "    reward = reward_array[i]\n",
1751 |     "    \n",
1752 |     "    agent.agent_step(reward, state)\n",
1753 |     "    \n",
1754 |     "    # Load expected values for last_state, last_action, weights, and replay_buffer \n",
1755 |     "    output_data = np.load(\"asserts/agent_step_output_{}.npz\".format(i), allow_pickle=True)\n",
1756 |     "    answer_last_state = output_data[\"last_state\"]\n",
1757 |     "    answer_last_action = output_data[\"last_action\"]\n",
1758 |     "    answer_updated_weights = output_data[\"updated_weights\"]\n",
1759 |     "    answer_replay_buffer = output_data[\"replay_buffer\"]\n",
1760 |     "\n",
1761 |     "    # Asserts for last_state and last_action\n",
1762 |     "    assert(np.allclose(answer_last_state, agent.last_state))\n",
1763 |     "    assert(np.allclose(answer_last_action, agent.last_action))\n",
1764 |     "\n",
1765 |     "    # Asserts for replay_buffer \n",
1766 |     "    for i in range(answer_replay_buffer.shape[0]):\n",
1767 |     "        for j in range(answer_replay_buffer.shape[1]):\n",
1768 |     "            assert(np.allclose(np.asarray(agent.replay_buffer.buffer)[i, j], answer_replay_buffer[i, j]))\n",
1769 |     "\n",
1770 |     "    # Asserts for network.weights\n",
1771 |     "    assert(np.allclose(agent.network.weights[0][\"W\"], answer_updated_weights[0][\"W\"]))\n",
1772 |     "    assert(np.allclose(agent.network.weights[0][\"b\"], answer_updated_weights[0][\"b\"]))\n",
1773 |     "    assert(np.allclose(agent.network.weights[1][\"W\"], answer_updated_weights[1][\"W\"]))\n",
1774 |     "    assert(np.allclose(agent.network.weights[1][\"b\"], answer_updated_weights[1][\"b\"]))\n"
1775 |    ]
1776 |   },
1777 |   {
1778 |    "cell_type": "markdown",
1779 |    "metadata": {
1780 |     "deletable": false,
1781 |     "editable": false,
1782 |     "nbgrader": {
1783 |      "cell_type": "markdown",
1784 |      "checksum": "ab42a7c0695fa34f1aedc1f59d943d24",
1785 |      "grade": false,
1786 |      "grade_id": "cell-e28c640fdbff0646",
1787 |      "locked": true,
1788 |      "schema_version": 3,
1789 |      "solution": false,
1790 |      "task": false
1791 |     }
1792 |    },
1793 |    "source": [
1794 |     "Run the following code to test your implementation of the `agent_end()` function:"
1795 |    ]
1796 |   },
1797 |   {
1798 |    "cell_type": "code",
1799 |    "execution_count": 23,
1800 |    "metadata": {
1801 |     "deletable": false,
1802 |     "editable": false,
1803 |     "nbgrader": {
1804 |      "cell_type": "code",
1805 |      "checksum": "9e669f8872d9ffb8231570fa2c8b177a",
1806 |      "grade": true,
1807 |      "grade_id": "cell-1c52a8c6d80f28c0",
1808 |      "locked": true,
1809 |      "points": 20,
1810 |      "schema_version": 3,
1811 |      "solution": false,
1812 |      "task": false
1813 |     }
1814 |    },
1815 |    "outputs": [],
1816 |    "source": [
1817 |     "# -----------\n",
1818 |     "# Tested Cell\n",
1819 |     "# -----------\n",
1820 |     "# The contents of the cell will be tested by the autograder.\n",
1821 |     "# If they do not pass here, they will not pass there.\n",
1822 |     "\n",
1823 |     "agent_info = {\n",
1824 |     "             'network_config': {\n",
1825 |     "                 'state_dim': 8,\n",
1826 |     "                 'num_hidden_units': 256,\n",
1827 |     "                 'num_hidden_layers': 1,\n",
1828 |     "                 'num_actions': 4\n",
1829 |     "             },\n",
1830 |     "             'optimizer_config': {\n",
1831 |     "                 'step_size': 3e-5, \n",
1832 |     "                 'beta_m': 0.9, \n",
1833 |     "                 'beta_v': 0.999,\n",
1834 |     "                 'epsilon': 1e-8\n",
1835 |     "             },\n",
1836 |     "             'replay_buffer_size': 32,\n",
1837 |     "             'minibatch_sz': 32,\n",
1838 |     "             'num_replay_updates_per_step': 4,\n",
1839 |     "             'gamma': 0.99,\n",
1840 |     "             'tau': 1000,\n",
1841 |     "             'seed': 0\n",
1842 |     "             }\n",
1843 |     "\n",
1844 |     "# Initialize agent\n",
1845 |     "agent = Agent()\n",
1846 |     "agent.agent_init(agent_info)\n",
1847 |     "\n",
1848 |     "# load agent network, optimizer, replay_buffer from the agent_input_1.npz file\n",
1849 |     "input_data = np.load(\"asserts/agent_input_1.npz\", allow_pickle=True)\n",
1850 |     "agent.network.set_weights(input_data[\"network_weights\"])\n",
1851 |     "agent.optimizer.m = input_data[\"optimizer_m\"]\n",
1852 |     "agent.optimizer.v = input_data[\"optimizer_v\"]\n",
1853 |     "agent.optimizer.beta_m_product = input_data[\"optimizer_beta_m_product\"]\n",
1854 |     "agent.optimizer.beta_v_product = input_data[\"optimizer_beta_v_product\"]\n",
1855 |     "agent.replay_buffer.rand_generator.seed(int(input_data[\"replay_buffer_seed\"]))\n",
1856 |     "for experience in input_data[\"replay_buffer\"]:\n",
1857 |     "    agent.replay_buffer.buffer.append(experience)\n",
1858 |     "\n",
1859 |     "# Perform agent_step multiple times\n",
1860 |     "last_state_array = input_data[\"last_state_array\"]\n",
1861 |     "last_action_array = input_data[\"last_action_array\"]\n",
1862 |     "state_array = input_data[\"state_array\"]\n",
1863 |     "reward_array = input_data[\"reward_array\"]\n",
1864 |     "\n",
1865 |     "for i in range(5):\n",
1866 |     "    agent.last_state = last_state_array[i]\n",
1867 |     "    agent.last_action = last_action_array[i]\n",
1868 |     "    reward = reward_array[i]\n",
1869 |     "    \n",
1870 |     "    agent.agent_end(reward)\n",
1871 |     "\n",
1872 |     "    # Load expected values for last_state, last_action, weights, and replay_buffer \n",
1873 |     "    output_data = np.load(\"asserts/agent_end_output_{}.npz\".format(i), allow_pickle=True)\n",
1874 |     "    answer_updated_weights = output_data[\"updated_weights\"]\n",
1875 |     "    answer_replay_buffer = output_data[\"replay_buffer\"]\n",
1876 |     "\n",
1877 |     "    # Asserts for replay_buffer \n",
1878 |     "    for i in range(answer_replay_buffer.shape[0]):\n",
1879 |     "        for j in range(answer_replay_buffer.shape[1]):\n",
1880 |     "            assert(np.allclose(np.asarray(agent.replay_buffer.buffer)[i, j], answer_replay_buffer[i, j]))\n",
1881 |     "\n",
1882 |     "    # Asserts for network.weights\n",
1883 |     "    assert(np.allclose(agent.network.weights[0][\"W\"], answer_updated_weights[0][\"W\"]))\n",
1884 |     "    assert(np.allclose(agent.network.weights[0][\"b\"], answer_updated_weights[0][\"b\"]))\n",
1885 |     "    assert(np.allclose(agent.network.weights[1][\"W\"], answer_updated_weights[1][\"W\"]))\n",
1886 |     "    assert(np.allclose(agent.network.weights[1][\"b\"], answer_updated_weights[1][\"b\"]))"
1887 |    ]
1888 |   },
1889 |   {
1890 |    "cell_type": "markdown",
1891 |    "metadata": {
1892 |     "deletable": false,
1893 |     "editable": false,
1894 |     "nbgrader": {
1895 |      "cell_type": "markdown",
1896 |      "checksum": "1ed9fca0352062809443beae983d9ea2",
1897 |      "grade": false,
1898 |      "grade_id": "cell-83130c3c2426b0c4",
1899 |      "locked": true,
1900 |      "schema_version": 3,
1901 |      "solution": false,
1902 |      "task": false
1903 |     }
1904 |    },
1905 |    "source": [
1906 |     "## Section 6: Run Experiment\n",
1907 |     "\n",
1908 |     "Now that you implemented the agent, we can use it to run an experiment on the Lunar Lander problem. We will plot the learning curve of the agent to visualize learning progress. To plot the learning curve, we use the sum of rewards in an episode as the performance measure. We have provided for you the experiment/plot code in the cell below which you can go ahead and run. Note that running the cell below has taken approximately 10 minutes in prior testing."
1909 |    ]
1910 |   },
1911 |   {
1912 |    "cell_type": "code",
1913 |    "execution_count": 24,
1914 |    "metadata": {
1915 |     "deletable": false,
1916 |     "editable": false,
1917 |     "nbgrader": {
1918 |      "cell_type": "code",
1919 |      "checksum": "e192cd7f474ff57861f6f8a3e3ab188c",
1920 |      "grade": false,
1921 |      "grade_id": "cell-0defecc3f69370dc",
1922 |      "locked": true,
1923 |      "schema_version": 3,
1924 |      "solution": false,
1925 |      "task": false
1926 |     }
1927 |    },
1928 |    "outputs": [
1929 |     {
1930 |      "name": "stderr",
1931 |      "output_type": "stream",
1932 |      "text": [
1933 |       "100%|██████████| 300/300 [08:09<00:00,  1.63s/it]\n"
1934 |      ]
1935 |     }
1936 |    ],
1937 |    "source": [
1938 |     "# ---------------\n",
1939 |     "# Discussion Cell\n",
1940 |     "# ---------------\n",
1941 |     "\n",
1942 |     "def run_experiment(environment, agent, environment_parameters, agent_parameters, experiment_parameters):\n",
1943 |     "    \n",
1944 |     "    rl_glue = RLGlue(environment, agent)\n",
1945 |     "        \n",
1946 |     "    # save sum of reward at the end of each episode\n",
1947 |     "    agent_sum_reward = np.zeros((experiment_parameters[\"num_runs\"], \n",
1948 |     "                                 experiment_parameters[\"num_episodes\"]))\n",
1949 |     "\n",
1950 |     "    env_info = {}\n",
1951 |     "\n",
1952 |     "    agent_info = agent_parameters\n",
1953 |     "\n",
1954 |     "    # one agent setting\n",
1955 |     "    for run in range(1, experiment_parameters[\"num_runs\"]+1):\n",
1956 |     "        agent_info[\"seed\"] = run\n",
1957 |     "        agent_info[\"network_config\"][\"seed\"] = run\n",
1958 |     "        env_info[\"seed\"] = run\n",
1959 |     "\n",
1960 |     "        rl_glue.rl_init(agent_info, env_info)\n",
1961 |     "        \n",
1962 |     "        for episode in tqdm(range(1, experiment_parameters[\"num_episodes\"]+1)):\n",
1963 |     "            # run episode\n",
1964 |     "            rl_glue.rl_episode(experiment_parameters[\"timeout\"])\n",
1965 |     "            \n",
1966 |     "            episode_reward = rl_glue.rl_agent_message(\"get_sum_reward\")\n",
1967 |     "            agent_sum_reward[run - 1, episode - 1] = episode_reward\n",
1968 |     "    save_name = \"{}\".format(rl_glue.agent.name)\n",
1969 |     "    if not os.path.exists('results'):\n",
1970 |     "        os.makedirs('results')\n",
1971 |     "    np.save(\"results/sum_reward_{}\".format(save_name), agent_sum_reward)\n",
1972 |     "    shutil.make_archive('results', 'zip', 'results')\n",
1973 |     "\n",
1974 |     "# Run Experiment\n",
1975 |     "\n",
1976 |     "# Experiment parameters\n",
1977 |     "experiment_parameters = {\n",
1978 |     "    \"num_runs\" : 1,\n",
1979 |     "    \"num_episodes\" : 300,\n",
1980 |     "    # OpenAI Gym environments allow for a timestep limit timeout, causing episodes to end after \n",
1981 |     "    # some number of timesteps. Here we use the default of 500.\n",
1982 |     "    \"timeout\" : 500\n",
1983 |     "}\n",
1984 |     "\n",
1985 |     "# Environment parameters\n",
1986 |     "environment_parameters = {}\n",
1987 |     "\n",
1988 |     "current_env = LunarLanderEnvironment\n",
1989 |     "\n",
1990 |     "# Agent parameters\n",
1991 |     "agent_parameters = {\n",
1992 |     "    'network_config': {\n",
1993 |     "        'state_dim': 8,\n",
1994 |     "        'num_hidden_units': 256,\n",
1995 |     "        'num_actions': 4\n",
1996 |     "    },\n",
1997 |     "    'optimizer_config': {\n",
1998 |     "        'step_size': 1e-3,\n",
1999 |     "        'beta_m': 0.9, \n",
2000 |     "        'beta_v': 0.999,\n",
2001 |     "        'epsilon': 1e-8\n",
2002 |     "    },\n",
2003 |     "    'replay_buffer_size': 50000,\n",
2004 |     "    'minibatch_sz': 8,\n",
2005 |     "    'num_replay_updates_per_step': 4,\n",
2006 |     "    'gamma': 0.99,\n",
2007 |     "    'tau': 0.001\n",
2008 |     "}\n",
2009 |     "current_agent = Agent\n",
2010 |     "\n",
2011 |     "# run experiment\n",
2012 |     "run_experiment(current_env, current_agent, environment_parameters, agent_parameters, experiment_parameters)"
2013 |    ]
2014 |   },
2015 |   {
2016 |    "cell_type": "markdown",
2017 |    "metadata": {
2018 |     "deletable": false,
2019 |     "editable": false,
2020 |     "nbgrader": {
2021 |      "cell_type": "markdown",
2022 |      "checksum": "92ba982f59ab0ecd45333f5b73f0be60",
2023 |      "grade": false,
2024 |      "grade_id": "cell-b6321a32b126637e",
2025 |      "locked": true,
2026 |      "schema_version": 3,
2027 |      "solution": false,
2028 |      "task": false
2029 |     }
2030 |    },
2031 |    "source": [
2032 |     "Run the cell below to see the comparison between the agent that you implemented and a random agent for the one run and 300 episodes. Note that the `plot_result()` function smoothes the learning curve by applying a sliding window on the performance measure. "
2033 |    ]
2034 |   },
2035 |   {
2036 |    "cell_type": "code",
2037 |    "execution_count": 25,
2038 |    "metadata": {
2039 |     "deletable": false,
2040 |     "editable": false,
2041 |     "nbgrader": {
2042 |      "cell_type": "code",
2043 |      "checksum": "3132510fde7c06020276a6c6f272eccd",
2044 |      "grade": false,
2045 |      "grade_id": "cell-337be142123eb81f",
2046 |      "locked": true,
2047 |      "schema_version": 3,
2048 |      "solution": false,
2049 |      "task": false
2050 |     }
2051 |    },
2052 |    "outputs": [
2053 |     {
2054 |      "data": {
2055 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjgAAAGoCAYAAABL+58oAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdd3xW5f3/8dcnkywSAkmAsJcsASWAVlQsWLUVBz8HbpRqa+1XnBVcdddWW1GrVqU466LWVSdDBBRBUJClrLAhhBlCyP78/rjvpEkIEDAhcPN+Ph555L7Puc45n5Ng7rfXuc65zN0RERERCSVh9V2AiIiISG1TwBEREZGQo4AjIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCJSJ8zsHjPbVN917IuZDTMzN7P4g3zcNDMbbWbLzKzAzLaa2cdmdtrBrEMkVEXUdwEiIvXsQ+B4IO9gHdDMjgI+B3YCjwILgYbAL4H3zayvu889WPWIhCIFHBEJOWYW4+67atLW3bOB7Douqap/AVuAn7l7ToXlH5jZM8C2n7Lz/Tl/kVClS1QiUm/MrLuZfWhmO4Jf48ysaYX1cWb2dzP70czyzCzTzJ4ys4ZV9uNmdlPwkk82MK/C8hFm9pCZZZvZxuD20RW2rXSJyszaBN9fYGbPmtl2M1tjZveaWViV455vZkvMbJeZfW5mxwS3HbaXcz4J6A2MqhJuAHD37919VbDtZDP7d5XtBwSP0b1KvZeY2ctmto1AUHrJzGZWc/zfB+stO98wMxtpZkuDl8oWm9kVe6pf5HChgCMi9cLMOgBfAg2Ay4BhQDcCH84WbBYLhAN3AGcAdwE/B8ZVs8tbgWbBfV1fYfnNQHPgUuAR4DfAiBqU+BcgFzgPeBW4O/i6rP4M4A3gW+Bc4H3gzRrs92SgBJhQg7b741FgB3A+8FCwtj5m1q5KuwuAD909N/j+SeBO4DngV8A7wFgzO7OW6xM5qHSJSkTqyx+BDcAZ7l4IYGbfAz8QGIvyYfDy0bVlG5hZBJAJTDOzVmU9HUEb3P3Cao6zwt2HBV9/amYnAEMIBJi9meLuNwdfjzez04PbvRVcdhuwCBjqgUn9PjGzSODP+9hvOpBdB5eQvnb368reBH9WmwkEmoeDy9KB/sFlZSHzWuBKd38puOkEM2tG4Pfz31quUeSgUQ+OiNSXQQR6C0rNLKJCeFkBZJQ1MrPLzOw7M8sFioBpwVWdquzvwz0c57Mq7xcCLWpQ37626wN84JVnLH6/BvsFqItZjiudv7sXA/8BKoa+8wkMbC5rOxAoBd4p+x0Efw8TgV5mFl4HdYocFAo4IlJfmhDoBSmq8tUOaAlgZucCLwPTCXw4H0fgchAELm1VlLWH41QdsFtYzbYHsl1Tdh+cXJPBymuBFDOrSQ37o7rzf4NAUCkLgxcC71foPWpC4BLgdir/Dl4k0MPfrJZrFDlodIlKROrLFgI9OGOqWVf2/JzzgRnu/ruyFWZ28h72Vxe9InuzAUipsqzq++pMBu4j0Huyp16nMvlAVJVlyXtoW935TyZQ54Vm9jLQD/hThfVbgGLgBAI9OVVt3Ed9IocsBRwRqS8Tge7A7CqXeSqKAQqqLLukTququW+AwWZ2e4X6z9rXRu4+1cxmAw+Z2RR331FxvZkdDWxz99XAGuCkKrs4taYFuntp8C6sCwmEpRzgkwpNJhHowUl09/E13a/I4UABR0TqUpSZnVfN8i+Ae4CZwIdmNpZAr006gQ/wF919MjAeeMrM7gBmEBh8PPAg1F0TfyZQ0xtm9gLQBbg6uK663pCKLiHwoL9ZZvYY/3vQ32nBffQDVhPo4RoebPMhcEqwzf54E/g9cCPwTtmAbgB3/9HM/hE8h78AswhchusGdHL3X+/nsUQOGQo4IlKXEqj+lu5T3H2ymR0HPEDgFuUYAuNTJgJLg+2eJTAmZwSBD97xwMXA13Vc9z65+ywzu4jALdlnEwgH1xKocbfn21TZ9kczOxYYBfyBQLDLIxD4Li57irG7f2hmtwO/A34NvAfcEPxeU18SCEstCYzJqeo6YDGBYHVfsPaFwD/34xgihxzbc8+wiIjsDzO7FHgFaOfumfVdj8iRTD04IiIHKDitwnhgK3AsgQfmfahwI1L/dJt4LTOzBmY208zmmtkCM7s3uDzZzMYHH+s+3swaVdhmVPAx6T+aZhIWOZw0Bp4m8MycWwmMd7m4XisSEUCXqGpd8BHzce6eG3yq6TQC4weGAFvc/WEzGwk0cvfbzKwr8DrQl8Dj5CcQGNxXUk+nICIicthTD04t84CyOV4ig19OYBBi2aPQXwLOCb4+G3jD3QuC3dpLCYQdEREROUAag1MHgo83nw10AJ5y9xlmlubu6wHcfb2ZpQabp1P5jpA1wWVV93kNcA1AXFxc786dO9flKYiIiBwWZs+evcndd3vIpgJOHQheXuplZkkE5njpvpfmVs2y3a4buvtzBG6lJSMjw2fNmlUrtYqIiBzOzGxldct1iaoOufs2Ao9KPx3ICs7QS/B72SPQ1xCcdyeoBbDuIJYpIiISchRwapmZpQR7bjCzGAIzJv9AYJbhK4LNruB/D+p6HxhqZtFm1hboSOBhXyIiInKAdImq9jUDXgqOwwkD3nL3/5rZdOAtMxsOrCIwiSDuvsDM3iLw5NBi4DrdQSUiIvLT6Dbxw5DG4IiIiASY2Wx3z6i6XJeoREREJOToEpWI7FVOTg4bN26kqKiovksRkSNMZGQkqampNGzYcL+3VcARkT3KyckhKyuL9PR0YmJiCDyoW0Sk7rk7u3btYu3atQD7HXJ0iUpE9mjjxo2kp6cTGxurcCMiB5WZERsbS3p6Ohs3btz3BlUo4IjIHhUVFRETE1PfZYjIESwmJuaALpEr4IjIXqnnRkTq04H+DVLAERERkZCjgCMicph49dVXadOmTX2XcUjo1q0bb7755l7bmBnTpk07SBXVnWHDhvHrX/+6vsuoVZMnTyYiom7vc1LAEZGQMGDAAKKjo4mPj6/0NW/evPoujRdffJEOHTrU+XGys7MZPnw46enpxMfH06xZM8444wzWr1+/W9tBgwYRHh7OihUrKi1fsWIFZkZcXBzx8fGkpqZy7rnnkpmZWanduHHjyMjIICkpiaSkJI4++miefPLJ3Y4zbdo0zIyrrrqqVs91wYIFXHjhhZVqXrNmTa0e40hxsP59HmwKOCISMu666y5yc3MrfR199NH1XdZBc+mll7Jjxw6+++47cnNzmTt3LhdddNFuYxiWLVvGpEmTSEpK4vnnn692Xz/++CO5ubksWLCAbdu2ceWVV5av++qrr7jqqqt44IEH2Lx5Mxs3buTFF18kPT19t/0899xzJCcn8+abb7J9+/baPeHDnLtTXFxc32UcdAfrmVoKOCIS8nJzc+nSpQsPPPBA+bL777+fLl26sHPnTiBwOWP06NH06tWLhIQETjnlFJYuXVrevri4mIceeohOnTqRlJTECSecwOzZs8vXuzvPPfccRx99NA0bNqRly5Y89dRTTJ8+nd/+9rcsX768vFdp8uTJAMyfP5/TTjuNJk2a0KpVK0aNGlXpj//MmTPJyMggPj6e/v37s3z58r2e51dffcWwYcNITU0FIDU1lcsvv5ymTZtWavfcc8/RtWtXbr/9dsaOHbvXD9mUlBTOO+88Kk4PM336dLp06cLpp59OeHg4UVFR9O7dmyFDhlTaduvWrYwbN44nn3ySmJgYXnnllT0eZ9OmTYSHh7Nu3ToAJk6ciJnxwgsvAIGff8OGDfnmm28AaNOmDa+++ioAPXv2BOCoo44iPj6e+++/v3y/33//PX369CEhIYHjjjuOH374YY81DBs2jMsuu4yrr76apKQk0tPTefbZZyu1mTp1Kv379yc5OZn27dvz17/+lbIpj6q77HLPPfcwaNCg8vdmxuOPP05GRgaxsbHMmjWLiRMn0q9fPxo1akRKSgpDhw7dr9ui27Rpw0MPPcTAgQOJj4+ne/fufPXVV5XaPP/883Tv3p3ExESOOeYYPvvsM4A9/vscPHgwf/rTn8q3b9WqFSeffHL5+2uvvZbrrrsOCPxu7rvvPtq1a0dycjIDBw5k/vz5lX6ul1xyCVdeeSXJyclcf/31u53DrFmzaNmy5R4D9wFxd30dZl+9e/d2kYNh4cKFld7f8/58v+AfXx2Ur3ven79ftZ588sl+//3373H9vHnzPCEhwSdNmuSTJk3yhIQEnz//f8cAvEuXLr5kyRLPy8vz6667zrt06eLFxcXu7j5q1Cjv27evL1u2zIuLi33MmDHeuHFj37Jli7u7P/30096sWTOfOnWql5SUeHZ2ts+YMcPd3V944QVv3759pXqysrI8OTnZ//GPf3hBQYGvWbPGe/fu7ffee6+7u2/bts2Tk5P9T3/6kxcUFPjMmTM9LS3NW7duvcdz/OUvf+ldu3b1Z5991r/99tvy2isqLCz01NRU/+tf/+pZWVkeGRnpb7/9dvn6zMxMB3z16tXu7r5+/Xo/8cQT/dhjjy1vM336dA8PD/frr7/eP/roI8/Kyqq2nscee8ybNGniBQUFfv311/vRRx+9x9rd3Xv16uUvvfSSu7uPHDnSO3To4BdddJG7u0+bNs0bNWrkJSUl7u7eunVrf+WVV6qtuQzgffr08ZUrV3p+fr6fd955PmjQoD0e/4orrvAGDRr4e++95yUlJf722297RESEr1ixwt3d58+f7/Hx8f7uu+96cXGxL1q0yNu0aVNe8+eff+7h4eGV9vnHP/7RBw4cWKmmo48+2pcuXerFxcWen5/vU6dO9ZkzZ3pRUVH5z3vo0KGV6ho+fPge627durW3b9/e58+f78XFxX7DDTd4hw4dytc/++yz3r59e58zZ46XlJT4hx9+6HFxcb5kyRJ3r/7f5+jRo/2UU05xd/cffvjBmzdv7omJib5jxw53d+/QoYP/5z//cXf3hx56yNu3b++LFi3y/Px8/+Mf/+hNmzb17du3l9cfGRnpb7zxhhcXF/vOnTsr/azee+89T0tL848//niP51j1b1FFwCyv5rNSPTgiEjIefPDB8jEhZV9lunfvzhNPPMHFF1/MxRdfzJNPPkm3bt0qbX/zzTfToUMHYmJi+Mtf/sKyZcuYMWMG7s6TTz7JI488Qrt27QgPD2f48OE0a9aMDz/8EIAnn3ySO+64g/79+xMWFkaTJk3o27fvHmt9+eWX6dmzJ7/5zW+IiooiPT2dUaNG8fLLLwPw3//+l7i4OG677TaioqLo06cPw4cP3+v5v/nmm1x66aW88MIL/OxnP6Nx48bccMMN5Ofnl7d555132Lp1K5dddhmpqamceeaZu/VSQGAQb0JCAs2aNWPr1q289tpr5euOO+44vvjiCzZt2sQ111xD06ZNycjIYOrUqZX28fzzz3PJJZcQFRXF8OHDmTdvHtOnT99j/YMGDWLChAkATJgwgQceeICJEyfi7kyYMIFTTjmFsLD9+9i69dZbadWqFdHR0QwbNox9TVT885//nLPOOouwsDCGDBlCUlISc+bMAeCZZ57h/PPP5+yzzyY8PJzOnTvz+9//vvx3VlO33HIL7du3Jzw8nOjoaPr370+fPn2IiIigadOm/OEPf2DixIn7tc/f/OY3dOvWjfDwcH7961+zdOnS8kuCTzzxBHfffTc9e/YkLCyMX/7yl5xyyim88cYbe9zfoEGD+Oqrr9i1axcTJkzgtNNOo1+/fnzxxResWrWKzMxMTjnlFABeeOEFbrvtNjp37kx0dDR333034eHh5f9tAPTv358LL7yQ8PBwYmNjy5c/8cQT/P73v+eTTz7h9NNP369z3hdN1SAiNfbHwd323age3XHHHdx55517XH/hhRcycuRIYmNjueyyy3ZbX/EOpdjYWFJSUlizZg2bNm0iNzeXwYMHVxrPUlRUVD6wdcWKFXTq1KnGtWZmZvLll19WCmHuTklJCQBr1qyhdevWlY7Xtm3bve4zPj6eUaNGMWrUKAoLC/nkk0+47LLLaNiwIffddx8Azz77LGeeeSYpKSkADB8+nMGDB5OZmVlp/wsWLKBFixbMmjWLs88+m+XLl3PUUUeVrz/hhBM44YQTAFi9ejW33norZ555JitXriQpKYmpU6eycOFCXn/9dQB69OhBRkYGzz77LMcff3y19Q8aNIirrrqKrVu3snjxYoYMGcJ9993H3LlzmTBhAhdffHGNf75lmjVrVv46Li6OHTt21Lh91W0yMzOZNGkS//nPf8rXl5aW0rJly/2qqeqdcLNnz+b2229n7ty55OXl4e7k5ubu1z6rnifAjh07SExMJDMzk+uuu67SpaHi4mJatGixx/1169aN5ORkpk6dyoQJE7jgggtYs2YN48ePZ8OGDfTu3bv83+7q1atp165d+bZhYWG0adOG1atX7/GcIfCze/DBB/ntb39Lr1699ut8a0I9OCJyxPi///s/OnfuTFxcHPfcc89u6yveUZSXl0d2djYtWrSgSZMmxMXFMWHCBLZt21b+tXPnTkaOHAkE/oAvWbKk2uNW1+vQunVrBg0aVGl/27dvL/9gS09PZ+XKleXjO4Dd7mTam6ioKM466ywGDRpU3gOxdOlSPv/8c8aPH0/Tpk1p2rQpV111Fe6+x7EPGRkZPPDAA1x99dXk5eVV26Zly5bccccd5OTklI8TKusV+sUvflF+rIULF/LWW2+xbdu2avdz0kknsXnzZv7+979z4oknEhkZyaBBg3jnnXeYMWNGpbEsFe1vr86Bat26NVdddVWl31lOTg4LFiwAAgGzpKSEgoKC8m3KxhTtrd6hQ4dy7LHHsnjxYnJycspDYW3WPXbs2Ep15+bm8swzz1RbT5mBAwfy6aefMmXKFAYOHMigQYMYP348EyZMqPS7aNmyZaV/m6WlpaxYsaJS8KvuGGFhYUyZMoWxY8fy0EMP1dbp/m//tb5HEZFD0CuvvMJ///tfXn/9dcaNG8fjjz/O+PHjK7V57LHHWLZsGfn5+YwcOZJ27drRr18/zIwRI0Zwyy23lIeY3NxcPv300/IPsOuuu46HHnqI6dOnU1payqZNm8oHxDZt2pSNGzeSk5NTfqzLL7+cWbNmMXbsWPLz8yktLWX58uV88sknAJx55pnk5ubyyCOPUFRUxLfffsvYsWP3eo433XQT33zzTfn+Jk+ezOeff86JJ54IBAYXt23blsWLFzNnzhzmzJnD3Llzufvuuxk7duwe7265/PLLiYuL44knngDg3Xff5YUXXii//XzTpk2MHj2aJk2a0LlzZ7Zs2cLbb7/NU089VX6cOXPmsGjRIho0aLDHwcYxMTEcf/zxPProo5x66qlA4EN29OjRNGvWjI4dO1a7XUpKCmFhYXsMmLXld7/7HW+88QYffPABRUVFFBcXs3DhQr744gvgf4Ocx4wZQ2lpKdOmTePf//73Pvebk5NDYmIiCQkJrFq1iocffrhW677xxhu55557mDNnTvkEltOmTSsfcF3dv08I9KiNGTOGVq1akZqaSq9evdi4cSMfffRRpYAzbNgw/vKXv7B48WIKCwt58MEHKS4u5le/+tU+azvqqKOYOnUq//znPxk1alStnne9D5jVlwYZy6FrbwP7DjUnn3yyR0VFeVxcXKWvDz74wBcsWOAJCQk+YcKE8vavvPKKp6am+rp169w9MPjzscce8x49enh8fLyfdNJJ/uOPP5a3Lyoq8r/+9a/epUsXT0hI8KZNm/o555xTPrC1tLTU//73v3uXLl08Pj7eW7Zs6U899VT5tkOGDPHk5GRPTEz0yZMnu7v7ggULfPDgwZ6WluYNGzb0Hj16lG/j7v7VV1/5scce63FxcX7CCSf4vffeu9dBxiNGjPBu3bp5QkKCN2zY0Lt06eIPPvigl5SUeEFBgaekpPgTTzyx23ZbtmzxuLg4Hzdu3B4H7L7yyiuelJTkW7Zs8SlTpvgZZ5zhaWlpHhsb62lpaT548GD/7rvv3N39b3/7mzdt2tQLCgp2O9aoUaO8W7duezyHBx980AFfsGCBu7tv377dIyIi/KqrrqrUruIg47Lt0tLSPDEx0R944AF3D/xOp06dWt6mukHAFVU3mLfqcb766iv/+c9/7o0bN/ZGjRp5nz59fNy4ceXrx40b523btvX4+Hg/77zz/IYbbthtkHHFmtzd3333XW/fvr3HxcV57969ffTo0R74eN5zXXursbrf4Ysvvui9evXyxMREb9Kkif/iF7/w77//3t33/O9z7dq1Dvitt95avp/zzz/fY2JiPD8/v3xZYWGh33333d66dWtPSkryAQMG+Ny5c/daf9Xfxdq1a71r165+7bXXemlp6W7neCCDjM0rdH/K4SEjI8P3NVBOpDYsWrSILl261HcZB4WZld8CLCKHlr39LTKz2e6eUXW5LlGJiIhIyFHAERERkZCj28RFRABdrhcJLerBERERkZCjgCMiIiIhRwFHREREQo4CjoiIiIQcBRwREREJOQo4IiIHYNq0aZUmwhSRQ4sCjoiEhAEDBhAdHU18fDyJiYn06tWLcePG1XdZIlJPFHBEJGTcdddd5ObmsnnzZoYNG8bFF1/M0qVL67ssEakHCjgiEnIiIiK4+uqrKS4uZs6cOQBceeWVtGzZkoSEBLp27cprr71W3n7y5MlERETw5ptv0r59exITE7ngggvYsWNHeZslS5YwYMAAEhIS6NmzJ1Xng8vLy2PEiBG0bNmSJk2acM4557Bq1ary9QMGDOCmm27i3HPPJSEhgfbt2zNx4kQmTJhA9+7dadiwIeeee26lY4rIgdOTjEWk5j4eCRvmHZxjNT0aznj4gDYtLCzkmWeeAaBTp04A9O/fn0cffZSkpCTGjRvH5ZdfTq9evejatSsAJSUlfPbZZ8ydO5edO3fSv39/nnjiCe644w6Ki4sZPHgwAwcO5OOPP2bNmjUMHjy40jFvvPFG5syZw9dff01SUhIjRoxg8ODBfPvtt4SHhwPwyiuv8MEHH/Dvf/+bu+66i8suu4z+/fszZcqU8hqffPJJbr/99gM6bxH5H/XgiEjIePDBB0lKSiImJoY777yTMWPG0KNHDwCGDx9O48aNCQ8PZ+jQofTo0YPJkydX2v7hhx8mPj6etLQ0zjnnnPJemhkzZpCZmckjjzxCTEwMHTt25Oabby7frrS0lJdffpkHHniA9PR04uLiGD16NIsWLWLmzJnl7S644AKOO+44wsPDufTSS1m/fj233norycnJJCcnc+aZZ/LNN9/U/Q9K5AigHhwRqbkD7FE5WO644w7uvPNOtm7dyvDhw5k0aRLDhw+ntLSUe+65hzfffJMNGzZgZuzcuZPs7OzybcPDw0lJSSl/HxcXV365aM2aNaSmphIbG1u+vm3btuWvs7Ozyc/Pp127duXL4uPjSU1NZfXq1Rx//PEANGvWrHx92b6qLtMlKpHaoYAjIiGnUaNGjBkzhvbt2/Pee++Rm5vLmDFj+Oyzz+jatSthYWFkZGTUeILN9PR0Nm7cSF5eXnkwyczMLF+fkpJCdHQ0mZmZtG/fHoDc3Fw2btxIy5Yta/8ERWSfdIlKREJScnIyN910E7fffjvbtm0jIiKClJQUSktLGTt2LHPnzq3xvo477jhat27NyJEj2bVrF8uWLeOxxx4rXx8WFsbll1/OXXfdxbp168jLy+Pmm2+mc+fO9O3bty5OT0T2QQFHRELWiBEjWL9+PWZGv3796NChA+np6SxcuJATTzyxxvuJiIjg/fffZ+7cuaSmpjJkyBCuueaaSm0ee+wxMjIy6NOnD61atWL9+vW8//775QOMReTgspp20cqhIyMjw6veoipSFxYtWkSXLl3quwwROcLt7W+Rmc1294yqy9WDIyIiIiFHAUdERERCjgKOiIiIhBwFHBEREQk5Cjgisle6EUFE6tOB/g1SwBGRPYqMjGTXrl31XYaIHMF27dpFZGTkfm+ngCMie5SamsratWvJy8tTT46IHFTuTl5eHmvXriU1NXW/t9dUDSKyRw0bNgRg3bp1FBUV1XM1InKkiYyMJC0trfxv0f5QwBGRvWrYsOEB/XEREalPukQlIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBqmZm1NLPPzWyRmS0wsxHB5clmNt7MlgS/N6qwzSgzW2pmP5rZafVXvYiISGhQwKl9xcDN7t4FOA64zsy6AiOBie7eEZgYfE9w3VCgG3A68LSZhddL5SIiIiFCAaeWuft6d/82+HoHsAhIB84GXgo2ewk4J/j6bOANdy9w90xgKdD34FYtIiISWhRw6pCZtQGOAWYAae6+HgIhCCh77nQ6sLrCZmuCy6ru6xozm2Vms7Kzs+uybBERkcOeAk4dMbN44G3gBnfP2VvTapbtNumPuz/n7hnunpGSklJbZYqIiIQkBZw6YGaRBMLNv9z9P8HFWWbWLLi+GbAxuHwN0LLC5i2AdQerVhERkVCkgFPLzMyAfwKL3P1vFVa9D1wRfH0F8F6F5UPNLNrM2gIdgZkHq14REZFQpMk2a98JwGXAPDObE1x2O/Aw8JaZDQdWAecDuPsCM3sLWEjgDqzr3L3k4JctIiISOhRwapm7T6P6cTUAA/ewzYPAg3VWlIiIyBFGl6hEREQk5CjgiIiISMhRwBEREZGQo4AjIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBEREQk5CjgiIiISMhRwBEREZGQo4AjIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBEREQk5CjgiIiISMhRwBEREZGQo4AjIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBEREQk5CjgiIiISMhRwBEREZGQo4AjIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBEREQk5CjgiIiISMhRwBEREZGQo4AjIiIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBEREQk5CjgiIiISMhRwBEREZGQo4AjIoFJzo0AACAASURBVCIiIUcBR0REREKOAo6IiIiEHAUcERERCTkKOCIiIhJyFHBEREQk5CjgiIiISMhRwKllZjbWzDaa2fwKy5LNbLyZLQl+b1Rh3SgzW2pmP5rZafVTtYiISGgJ+YBjZieY2fdmVmhmkw/CIV8ETq+ybCQw0d07AhOD7zGzrsBQoFtwm6fNLPwg1CgiIhLS9ivgmFmKmT1tZivMrMDMssxsopmdWlcF1oLHgblAe2BIXR/M3acAW6osPht4Kfj6JeCcCsvfcPcCd88ElgJ967pGERGRUBexn+3fBmKB4QQ+jFOBk4HGtVxXbeoAPOXuq+uxhjR3Xw/g7uvNLDW4PB34ukK7NcFlIiIi8hPUuAfHzJKAE4GR7j7R3Ve6+zfu/qi7v1Gh3Qozu6XKtpPN7O9V2txtZi+a2Q4zW21mF5pZkpm9YWa5wfEqv9hHTdFmNjrYk5RvZl+bWf/gujZm5kAiMNbM3MyG1fR8DxKrZplX29DsGjObZWazsrOz67gsERGRw9v+XKLKDX6dZWYNauHYNwAzgWOBtwhcunkN+AjoBUwBXt3Hsf4CXAhcBRwDzAM+MbNmwGqgGZAXPFYz4M1aqPtAZAVrIvh9Y3D5GqBlhXYtgHXV7cDdn3P3DHfPSElJqdNiRUREDnc1DjjuXgwMAy4FtpnZdDN71Mz6HeCxP3X3p919CfBHIBpY6u4vu/tS4H4gBehe3cZmFgdcC9zm7h+6+yLgt0AWcJ27l7j7BgI9ItvdfYO77zrAWn+q94Ergq+vAN6rsHxosCeqLdCRQOgTERGRn2C/Bhm7+9tAc2Aw8DHwM+BrM7v9AI79fYX95hLoaZlXYX1W8Hsq1WsPRAJfVthPCTAd6HoA9dQKM3s9WMNRZrbGzIYDDwOnmtkS4NTge9x9AYHeq4XAJwSDWf1ULiIiEjr2d5Ax7p4PjA9+3WdmY4B7zOxRdy8EStl9bElkNbsqqrrrKsvKxqLsKYRZlXZV91Uv3P2iPawauIf2DwIP1l1FIiIiR57aeA7OQgJBqWysTDaB8S4ABMfQdK6F41S1FCgE+lc4VjhwfLAmEREROULVuAfHzBoD44CxBC4v7QAygD8QeIhdTrDpJOAqM3ufQNi5g+p7cH4Sd99pZs8AD5vZJiATuBFIA56u7eOJiIjI4WN/LlHlEnhmywgCz5aJBtYSuPPpgQrt/gS0ITCQNpfA5ZfmtVBrdW4Lfn8BSAK+A04ve+aMiIiIHJnMvd6Gq8gBysjI8FmzZtV3GSIiIvXOzGa7e0bV5SE/F5WIiIgceRRwREREJOQo4IiIiEjIUcARERGRkBPyASc4ceew+q5DREREDp6QDzgiIiJy5PnJAcfMomqjkJ9YQ4SZVZ0eQkRERI5Q+x1wzGyymT0TnEk8G/jSzLqa2YdmtsPMNprZ62bWNNi+i5l5hfexZlZoZh9X2OfVwYkoy94/bGY/mtkuM1thZn8JTvlQtv4eM5tvZsPMbBlQAMSZWYdgffnB7c/8CT8bEREROUwdaA/OpQQmuzwRuB6YAswH+gKDgHjgfTMLc/dFBGYGHxDc9gRgO9DfzMqepDwAmFxh/zuBq4AuwO+AoQSmfKioLXAxcD7Qk8C8VO8Ez+n44Pb3EHjisoiIiBxBDjTgZLr7ze7+A3AGMNfdb3P3Re7+PXA50IfAXFUAXwCnBF8PAP4NbA62ATiZCgHH3e939y/dfYW7fwQ8BFSdpTsKuMzdv3X3+cH9dgUudffv3P1L4AYOYMZ0ERERObwd6If/7AqvewMnmVluNe3aAzMJhJcbgssGAI8DscCA4ESZ6VQIOGZ2XrB9BwK9QeHBr4rWuHtWhfddgLXuvqrCshlA6X6cl4iIiISAAw04Oyu8DgM+BG6ppl1ZAJkMPG1mHQn06kwG4gj0ymwClrr7WgAzOw54A7iXwOzg24CzgEf3UgMELpmJiIiI1Mrlm2+BC4CV7l5UXQN3X2RmWQTG0Sx1941m9jnwdwIBZnKF5icQ6Im5v2yBmbWuQR0LgXQza+nuq4PL+qJb4UVERI44tfHh/xSQCLxpZv3MrJ2ZDTKz58wsoUK7LwgMTv4cwN1XANnAECoHnMUEgsolwX1dy+7jb6ozAfgBeNnMepnZ8cBjQPFPOz0RERE53PzkgOPu6wj0upQCnwALCISeguBXmc8JjKOZXGHZ5KrL3P0D4BFgNPA9cCpwdw3qKAXOJXBOM4CXgQeq1CAiIiJHAHP3+q5B9lNGRobPmjWrvssQERGpd2Y2290zqi7X+BQREREJOQo4IiIiEnIUcERERCTkKOCIiIhIyFHAERERkZBTqwHHzP5rZi/Wwn48OF2DiIiIyH47VCeibAZsre8iRERE5PB0SAUcM4ty90J331DftYiIiMjh64AvUZlZrJm9aGa5ZpZlZrdXWb/CzG6psmyymf29Spt7zGysmW0D/hVcXn6JyszaBN//PzMbb2Z5ZrbQzE6tsu9fmdmPZpZvZlPMbGhwuzYHeo4iIiJyePopY3AeJTCNwv8DBgLHACcdwH5uIjCHVAZw+17aPQg8AfQEvgHeMLN4ADNrBfyHwKzmPYPt/nIAtYiIiEgIOKBLVMFgMRy4yt0/DS67ElhzALv7wt1rEkYeC85TRbC36HKgFzANuBZYDtzsgbknfjSzTgRCkYiIiBxhDrQHpz0QBUwvW+DuucC8A9hXTSdV+r7C63XB76nB752Bb7zyxFozDqAWERERCQEHGnCsBm1Kq2kXWU27nTU8ZlHZiwpBpqx+AzRrqIiIiAAHHnCWEggcx5UtMLM4oHuFNtkEbvcuW9+AQE9LXVgE9KmyrG8dHUtEREQOcQcUcIKXo/4J/NnMTjWzbsBYILxCs0nAJWY2oML66npwasM/gPZm9qiZHWVmQ4DflJVbR8cUERGRCib9kMUzk5exYlPlizPb84ooLT24H8c/5Tk4twBxwDtAHvBk8H2ZPwFtgPeAXAIDfpv/hOPtkbuvNLP/B/wN+D2Bu6zuJRCq8uvimCIiIkeqklInt6CYiYuy2JpXREJ0BB98v46pSzYB8OdPfuDnnVM5tlUS367axqQfNpLWMJohx7bgttPr6mJOZVZ5XG7oMLMRwH1AI3cvre96alNGRobPmlXTsdkiIiL7VlrqrNySx7Slm1iatYMteUVsyyukcVwU6Y1iANiYU8B3q7exdGPubts3bdiA4f3bcnr3pvzn27W8+FUmW/OKSE2I5txj08nM3kmDyHCeuOiYWq3bzGa7e8Zuy0Ml4JjZdQR6brIJjA16EviXu4+o18LqgAKOiIjUBndn4qKNvPHNaqYtzSa/KNAfkBAdQeP4KJJio9iYk8+GnMDFkCbx0XRu1pBeLRKJjgznuHbJtG4cx9adhbRPiScs7H/3FhWVlFJS6jSIDK90PLOa3KdUc3sKOIfUVA0/UQcCDwpsTOB5PP8g0IMjIiIiVSzPzuUP//6eWSu3ktYwmgsyWtKlWUP6tk2mXZO4/QoiTeKjd1sWGR5GZHjlZbUdbvYmZAKOu98I3FjfdYiIiByKNmzP518zVpKVk09xifPZwiwiw40/DTma83u3ICL8p0xucOgJmYAjIiIi1Zu9ciu/fXU2W3YW0iQ+isjwMI5plcTD/68H6Ukx9V1enTgoASc44WUm0Mfd62TwSHByznHufvD6v0RERKqxPDuXd+esY/zCLAqLS2jRKJaBXVIZ2qcVURF121NSXFJa3hvj7rw1azV3vbuAZkkN+NevT6RTWkKdHv9QcbB6cFYTeOjfpoN0PBERkQN2IINhv121lXvfX8Cm3ELWbtuFGfRrm0xyXCw/bNjB3e8t4LUZq0hPiiElIZp7zupWaQDuvhQUlzBl8SZmLN/MrqIShvdvS7uU+PL1paXOM18sY/SExZzXuwW9Wyfz79mr+Xr5Fk7s2IQnLzqGpNio/Tqnw9lBCTjuXgJsOBjHEhER2V85+UW8OXM1yzflsmVnIV8szuaotASuO6UDv+jWdI/bZW7ayZTF2WzKLWDstEySYqPo06YRVzRvzVk902ma2KC87YSFWdzzwQKWb9rJpB83smpLHmOH9dkt5KzYtJNXvl5JQXEJpQ7b8gpZujGXzE07KSpxGkSG4Q5TlmTztwt68cP6HLJyCpj4w0YWrc/hmFZJvDVrDa/PXE3juCjuP7sbF/VtFXJjbPalRreJWyDG3krg6cDNCUzV8Gd3f7XC5adLgN8BGcAK4Hp3/yy4fVmbPu4+y8wigb8C5xG462kjgVu6RwbbNwJGA2cBDYAvgRHuvqBCTZcD9wMpBJ6a/DHw94qXqMxsMHAP0A1YD7wG3OvuhfvzQzrU6DZxEZGfxt2ZtXIr73y3lunLNrN26y4KS0ppEh9FdEQ4/Ts04ZsVW1i+aSend2vKvLXbadEohusHdqSk1Jm2dBMTFmWxPPt/T+zt0qwhL17Zh7SGDfZy5IB3vlvDjW/O5aK+LfnTkB5syyvk21VbWbk5j9ETlrCrqISE6AjMID46gg6p8bRPjee4to3p37EJC9blcOGz0yko/t9j3nq1TOLivq04P6MF67bns6uwmLZN4gkPC+2RGz/1NvEHCISR64AfgeOB581sK1AWOv4C3ERg1u/rgPfMrIO7r61mf9cD5wJDCYShFsBRFda/GHx/NrCVwFOQPzGzTu6+y8z6BdvcBYwDTgEeqnLCpwH/AkYAU4BWBG4djybwFGYRETmCTF+2mffnruPHDTms357P+u35xESGc2LHJvyiaxqDezane3piefuiklL+/PEPjJmWSb+2ySzO2sElY2YAEBluHNeuMZcf15qfd06jRaOYSs+A2Zdzj2nBkqxcnp68jFkrtrI0O5ey/obOTRN4/vIMWibH7nH7Xi2TeOmqvqzanMcJHZuQlhBdqYcmVAcO74999uAEJ9HcBPzC3adWWD4a6ESg1yYTuNPdHwyuCwN+AN5y9zur6cF5gkCvyiCvUoCZdQQWAye7+5TgskRgFXCzu48xs9eAFHc/tcJ2Y4DhZT04ZjYFGO/u91docw7wKpBQ9biHE/XgiIjU3K7CEv42/keen5pJXFQ4PVsm0SguioGdUzmtW1Piovf+//rbdxWRGBNZ3svSIDKco9MTSWjw06ZXLC4p5fo3vmPrziKOb9+Yfm2TadU4ltSEBiHf61KbfkoPTlcCl4k+MbOKoSCSQO9LmellL9y91MxmBLetzovAeGCxmX0GfAR8HJxSoQtQWmV/281sXoX9dQE+qLLP6cDwCu97A33N7LYKy8KAGKApgUtWIiISgvKLSnjq86V8u2ors1duJb+olCuOb82oX3bZr4G9AIkxgSCTFBvFzzun1VqNEeFhPH1J71rbn1RWk4BT1uc1mEAvSkVFwH7HTHf/Ntirczrwc+AlYK6ZnbqP/ZUFrJocM4zAhJvjqlmXXeNiRUTksLI9r4jfvDqLGZlbODo9kaF9WnFG96b0a9e4vkuTg6gmAWchUAC0dvdJVVcGgwoE5n+aFFxmQF/g33vaqbvvIBA+xpnZi8DXBKZbWEggnBxPYOwMZtYQOBp4oUJNx1XZZdX33wKd3X3pvk9RREQONyWlztqtu1ictYMfs3awdGMua7ftYs6qbZS4M/rCXpzdK72+y5R6ss+A4+47zOxR4NFgcJkCxBMIFKXAZ8Gm15rZYmAegXE5rYFnqtunmd1E4BLRHAK9QBcDOcAad88zs/eAZ83sGmAbgUHGOQTuggJ4AvjKzEYRCFEDCAxarug+4L9mthJ4CygGugN93f0P+zpvERE5dOQXlbAkK5fZK7cwe9U2flifw8rNeRSW/O8uouaJDWiWFMPlx7fm3GPT6dY8cS97lFBX07uo7gKyCNx99AyBsDGHwJ1TZUYSuIvqWGAlcK67r9nD/nYQuO28I4HLTt8BZ7h7XnD9lQRuE3+f/90mfrq77wJw96/NbDiBS1B3A5MJ3A7+ZNkB3P1TM/tVsPZbCAScxQTG/xxSzOx04HEgHBjj7g/Xc0kiIvUir7AYd/hmxRbmrdnO8k07+XbVVlZtySu/y6hZYgO6NU/k551Tadskjo5pCXRKi//Jg34ltNToOTh73cFBmIYhlJlZOIHgdSqBWdC/AS5y94V72kZ3UYnI4crdWbAuhylLsnGH6IgwlmTlsnlnIT9syGHN1l2V2jdt2ICeLRPp2iyRdilx9G7diOa6BVoq+KnPwZG60xdY6u7LAczsDQLP/9ljwBERqWubcwuYsiSb5okx9GyZVOnOI3dn3trtvPvdOhJjIumYFk9yXBRJsZGs355P1vZ8YqMjyC8qoUFkOPmFJbzz3Vp2FZWwfvsusnIKKh2rcVwUKQnRdGvekIv6tsIs8NC8fm2TiY3Sx5QcGP3LqX/pBObqKrMG6Fe1UXA80jUArVq1OjiViUhI2piTz8L1OWTl5JO9o4CiEmdnQTGrtuQxb+12GkSGs27brvKn5CZER9CvXTJFJc6uwhKWb9rJptwCoiLCKCoppSYXAtqlxJGeFEN6o2QGdEphwFGpxEaFs6uohCbx0XV8xnIk+skBx91XcAC3iku56n52u/25cPfngOcgcImqrosSkUOTu/Ppgiwen7iEo9LiufxnbWifEs/CdTms3baL0lKnTZM4jm2VVO3cQ9OWbOKaV2aRV1hSaXmDyDCaJcbQt20yxSXOCR0ac17vlmzOLeCjeRuYv3Y7DSLDaBAZzkkdm9CnbTK/6tEMA1ZtyWNbXhFb8wppEh9Ni0Yx5BeVEB0RCDD5RSUcnZ5Y7eSV+3rInsiB0r+s+rcGaFnhfQtgXT3VIiL1rLTU2ZpXSJgZ05ZuYldhCYUlpYFboTcEbofelldEuyZxfLYwi3fnVP/nIj0phpM6pdCwQQSpDRuwJGsHi9bnsHB9Du1T4rn3rG40T4ohtWE0UeFhe505e2CXvT/cTncryaFIAaf+fQN0NLO2wFoC83NdXL8liUh9KC4p5YoXZvLl0s27rYuLCqdT0wTO6N6UXi2TGHJsC3Lzi/lq2WZWbN5J12YNaZ8SD8D3a7fx6tcrGb8wi5z8IgqLS2nYIIIeLZK49LjWjBjYkaTYqIN9eiIHlQJOPXP3YjP7PfApgdvEx1acNV1EjhyPT1zCl0s38+v+bUmOj+K4do1JTYjGzGie2GC3XpZGcVH8qkez3fbTqnEsZ/ZoDgR6hLbkFdIoNkrzG8kRRQHnEODuHxGYj0tEjjC7Ckv4btVWXpu5iv9+v57ze7fgzjP3NI3f/gsLMw3ilSOSAo6ISD3ZuCOf8/8xnZWb84iKCOPGQZ24dkD7+i5LJCQo4IiIHGQzlm/m0wVZTFmSTfaOAp6+5FhOaN+ExFg9iVektijgiIjUsYc//oGvlm3ilKNSWZy1g4/nb6BBZBipCQ34x6W9OalTSn2XKBJyFHBEROpQ5qadPD91OUkxkTw+cQlN4qP47cntGTGwIzFR4fvegYgcEAUcEZE64O6s3rKLRz79gchw45MbTqJBZBjx0RF7feaMiNQOBRwRkVq2fVcRN781lwmLsgD47cntSUnQnUwiB5MCjohIDRQUl7Bo/Q4WrNvOovU5bN1ZxJptu1ienUtRSSmlpWAWmB17R0Ex4WbcdGonjmmVxAntm9R3+SJHHAUcEZG9KCwuZfryzYx8+3vWb88HoGGDCJokRJOW0IAhx6TTIDIcM6PUnYKiEpJioxjYJZUeLZLquXqRI5cCjohIBV8t3cTr36xm9ZY81m3bRXZuAe7QPiWOpy4+lp4tE0lPitE4GpFDnAKOiBzRVm3O44sl2Xy/ehtz12xjcVYuTeKj6dw0gQFHpdA8KYaWjWL55dHNdNeTyGFEAUdEQp67szgrl8xNuazaksfKzXmVvgMkx0XRo0UiF/VtxUV9W9EgUmFG5HCmgCMiIamopJTpyzazq6iE12as4ovF2eXrkmIjaZUcS48WiVx+fGtO7ZpGq+RYXXYSCSEKOCISEnYVljD5x428NnMV2TsK2JRbyKbcAgBio8K5/ZedOb5dE1o1jiUxRlMiiIQ6BRwROSxs2J7PP75YxviFWaQ1jCYyPIwNOfmc0yudzE07+WTBBgqLS0lPiqFr84a0T43nnF7pNEtsQLPEBjTWjNoiRxQFHBE55M1fu53hL33D1p1FnNQphW15heQXl9I8MYbHJy4hPjqCi/q0ZGCXNH7WvjER4WH1XbKI1DMFHBE5ZOUXlTBm6nKemLiUJvFRfPB//TmqaUKlNqu35JEYG0nDBrrsJCL/o4AjIoekj+at5573F7BxRwG/OroZ957djSbVXGZqmRxbD9WJyKFOAUdEDilZOfm8891a/vzJD/RIT2T00F78TFMdiMh+UsARkUNCcUkpj3z6I89OWQ7AwM6pPHXJsXoejYgcEAUcEalX2TsKeG/OWsbNWsOPWTu4qG9LLurbiu7NEwkL03NpROTAKOCISL0ZN2s1I/8zj5JSp2eLRB4f2ouze6XXd1kiEgIUcESkXqzbtot7P1hI71aNePDc7nRMS9j3RiIiNaSAIyIHVdkD+6Yszqak1PnrBT11J5SI1DoFHBE5aCYuyuLGN+dQUFxK9/REbjntKIUbEakTCjgiUufyi0p4b85abn9nPl2bNeSJi46hbZO4+i5LREKYAo6I1Intu4q4Zdxcpi/bTEFxCUUlTt+2yYwd1of4aP3pEZG6pb8yIlJr3J25a7bz4ffr+GjeBjbuyOf8jJYkxkSS0boRJ3ZMISpC80SJSN1TwBGRn+y7VVt54MNFrNmaR1ZOAVHhYRzTKonRQ3vRp01yfZcnIkcgBRwR+Uk+XbCB/3v9O1Lio+nfIYVerZI4p1dzEjT5pYjUIwUcEdlvhcWlzFqxhU07C7nlrbl0bd6Qf16RQeNqJsMUEakPCjgisl9KSp3rX/+OTxZsAKBTWjwvXdmXxFj12IjIoUMBR0RqzN255/0FfLJgAzcO6kTnZgn0a5uscCMihxwFHBGpsb9PWsorX6/kNye3Y8SgjvVdjojIHul+TRGpkTdmruKv4xcz5Jh0bjutc32XIyKyV+rBEZE9cnemL9vMpws28MrXKzmpUwp/Pq8HYWFW36WJiOyVAo6I7Ka01Pls4Qae+nwZ89Zup0FkGKd3b8oj5/UkMlwdvyJy6FPAETnCuTsL1uXw8fz1zFubw1Fp8Xz+YzZLN+bSunEsDw85mnOOSadBZHh9lyoiUmMKOCJHIHdnzuptfDJ/Ax/NX8/qLbsIDzPaNYlj2pJsOqUl8MRFx/DL7k2JUI+NiByGFHBEjhA7C4qZv3Y7i7N2MPbLFWRu2klkuHFChyb83ykdObVrGo3ioigoLiEqPAwzjbMRkcOXAo5IiNu6s5CXpq/ghS9XsH1XEQA9WiTy6Pk9ObVL2m7PsImO0KUoETn8KeCIhKjVW/IYM3U5b81aw66iEgZ1SePifi1p0SiWjqnx6qERkZCmgCNymMsvKuHHDTvonp5IeJgxf+12np2ynA+/X0d4mHFOr3SuOakdHdMS6rtUEZGDRgFH5DC1La+QZ6cs519fryQnv5hjWiURFxXBtKWbiI+O4OoT23HlCW1pmtigvksVETnoFHBEDjObcgt4bcYqnp+ynNzCYn7ZvRnHtEriyUlLiY4IY+QZnbm4XysaNtD8UCJy5FLAETnE5ReVsGh9Dtt3FfGvGauY9MNGSkqdX3RN4+ZfHMVRTQOXnq48oS3urtu6RURQwBE5pM1euZU//Hsuy7J3ApAUG8nVJ7bj3GPSy4NNmfAwAzRwWEQEFHBE6p27MzNzC3PXbGPIsS3YsrOQ8Quz+GJxNjMzt9AssQGPXdiT5Lho+rRpRGyU/rMVEdkX/aUUqYHC4lIen7iY4lJncI/mvPDlCi7q25KMNskHvM/8ohL+f3v3HSdVdf5x/HO278LSe+8dqSpNBcWuoFgwdsVYYon+UtSYGBMTW6IxxhK7ElvU2BuKDZWOIL33tiwsbO97fn88s+5so7m7AzPf9+s1r71z7p07Zy6XnWef095bsJXnp69n2bYMAB78dCX5RSUAdGlej9sD/WmS1Z9GROSAKMAR2YddWflcNXku8zfuwTl48uu1AGzancPr1wwHIKegiMJiz7wNaTz46UquOqYzZw9qV+X5vPe8MnsjD366krTsAnq2TOa+Cf3p364hL83cQLvGSUw8sj3N6sfX2WcUEQk3znsf6jrIARo6dKifO3duqKsREban53Hxs7PYlJbDQ+cPpG3jRKYuTSG/qJinv1nHlJuP5ZtVqTw8dRVZ+UUAJMZGk1tYTK9WyazflU2Plsn86qSeHNu9GTPXpvHktDV8tSKV4V2acuMJ3Rjepakm3RMROUjOuXne+6GVyhXg1Bzn3HnAXUBv4Cjv/dygfbcDk4Bi4Cbv/ZRA+RDgBSAR+Aj4pd/HP4oCnLoxffVObnptPrkFxTx7+ZEM69L0x307s/IZfu/nxMdEk5VfxOiezRnepSnxMVGcM6Qd93y0nDWpWfRqlcy3q3ayNT2XY7s359OlKTRKiuX60d2YNKozUVEKbEREforqAhw1UdWsxcAE4MngQudcH+ACoC/QBpjqnOvhvS8GngCuBmZiAc4pwMd1WWkpz3vPC9PX85cPl9G5WT1e+flgelSYBbhZ/XjGD2zLlCXb+ecFAxk3oE25LMy9E/r/uL0zK58Jj09n6rIUbh7bnWuP60pCrNZ7EhGpTQpwapD3fhlQVXPDeOA1730+sM45txo4yjm3HmjgvZ8ReN1k4CwU4IRMQVEJf3xvMa/O3sSJfVry8MSB1Iuv+r/JPWf35+7x/UiM23uw0qx+PG/9YgS7sgoqDe0WEZHaoQCnbrTFMjSlNgfKCgPbFcslBNKyC7j2pXnMXpfG9WO68qsTe+61CSkuZv8n1GtWP16dhkVE6pACnAPknJsKtKpi1x3e+3ere1kVZX4v5VW979VYUxYdGbaM7AAAIABJREFUOnTYj5rKvmzYlU3bRonEREcxZ30aN7+2gNSsfP55wUDGD1ScKSJyOFOAc4C892MP4mWbgfZBz9sBWwPl7aoor+p9nwKeAutkfBB1kICCohLu+WgZL0xfT7vGibRplMic9Wm0b5zE69cMZ2D7RqGuooiI/EQKcOrGe8ArzrmHsE7G3YHZ3vti51ymc24YMAu4FPhXCOsZ9nILirn6P3P5ZtVOzhvSjg1pOeQUFHHjmG5cfVxX6lfT30ZERA4v+m1eg5xzZ2MBSnPgQ+fcAu/9yd77Jc6514GlQBFwfWAEFcB1lA0T/xh1MK41q3dkcuv/FvH9xt08cM4RnH9k+32/SEREDkuaB+cwpHlw9l9qZj6/ffMHvlyRCkCDhBjunXAEpx/ROsQ1ExGRmqB5cKRquXvg2RNhxE0w+JJQ16ZGfbd6Jzf/dwEZuYVcNaozTerHcf5QLYEgIhIJFOBEurj6sHMVpG8KdU1qzM6sfB76bCWvzt5I1+b1+c+ko+jVqkGoqyUiInVIAU6ki46BxEaQsyvUNakRizanc+WLc9idXcBlwzvx21N6khSn21xEJNLoN79AUjPI3hnqWvxkCzfv4YKnZtI4KY4PbhqlrI2ISARTgCOQ1PSwz+Bs3ZPLVS/OpUm9ON66bgQtGiSEukoiIhJC+z/XvISves0O6wBnzvo0xj/2HTkFxTx72ZEKbkRERBkcAZKawOY5oa7FAVmZksmCjXvYlV3Ag5+uoF3jRP4z6SgtZikiIoACHAHrg5OzC7yHyiuhH3Ky8ou44vk5bNmTC8DY3i148PyBNEyMDXHNRETkUKEAR6wPTkkR5KXbiKpD3H0fL2Nrei5PXjKEZvXjGdS+0V5X/RYRkcijAEesDw5YFucQDnCKiku456PlvDRzI5NGdebkvlUt6i4iIqJOxgLWRAWH/FDxBz9byXPfrePKkZ25/dReoa6OiIgcwpTBEetkDIfsSCrvPbPWpfHk12s4f2g77jyzT6irJCIihzgFOBLURLUTCrIhrl5o6xNk464cJj41g23pebRpmMDvz1BwIyIi+6YmKrFOxgBL3oF728Pu9Qd9qppenf6vHy0lPbeQP43ryxvXjaBBgkZKiYjIvinAEcvYxCTCms/BF0PG1oM+1c8nz+P2txZWuW93dgGFxSUA5BUWk1NQtNeA6ONF25iyJIXrx3TjshGdaNso8aDrJSIikUVNVGKSmkLGZtsuzD2oU6RlF/DF8hTaN0mqtC+vsJiTHp5GqwYJXHR0B/743hLyi0o4unMTnrpkKA2Tymdm7nx3MZNnbKBXq2Qmjep8UPUREZHIpQyOmHpNy7YPMsD5cvkOSjxsTMsht6AYgJSMPFamZPLegq2kZuazZGs6t721iP5tG3Lj8d2Yv3EPE5+aQXpuYbnzTJ6xgUuGdeSd60eSEBv9kz6aiIhEHmVwxCT99ABn6rIUwCZEXr0ji96tk7nk2Vms35lDs/px9GqVzK2n9uLbVTv5zck9SYiN5qjOTbji+Tnc8t8FPHPpUIpKPHd/sJQuzevxhzP6EBejGFxERA6cAhwx/c6B+q3gh1egaP8DnOtf/p4j2jXkshGdmLYylWFdmjBzbRorUjJZsjWdlSlZtGqQwNb0PG44vjtjerZgTM8WP77+mO7N+eOZffjDu0t4eOpKoqIca3dm8/wVRyq4ERGRg6YAR8ygi6HHqRbg7GcGZ0dGHh8u2saHi7bx7oKtZBcUc+Px3fl+wxwWbd7Dx4u3M7hDI567/Eg+WrSdc4e0q/I8Fw/ryKIt6TzyxWqioxxnD2pbLggSERE5UApwpExsYJRSYc5+HT5zXRoAbRslsmx7Bg+ccwQjuzWja4v6vDp7EwXFJTzys0E0SorjwqM7VHse5xx/Ht+PlSlZ7MjI464z+/7kjyIiIpFNAY6UiUmwn4V5Ve4uKCph3KPfcs1xXTh7UDtmrt1FcnwMH940im3pefRu3QCAHi3rs2xbBoM7NOLozk32660TYqN549rh5BUWk6y5bkRE5CdSJwcpExVlQU41GZyVKZks357JXz5YRmZeIbPW7mJop8Y0Sor7MbgB6NEyGYBrj+uKc/u/yndsdJSCGxERqRHK4Eh5sYnV9sFZsjUdgF3ZBfzmjYWsSc3mvKHtKx13/tD2NEiMZWzvlrVaVRERkeoogyPlxSZVO4pq8ZYM6sfHcPGwDnyyZDsAx3ZvXum45snxXDKsI1FR+5+9ERERqUnK4Eh5MQl7zeD0ad2Au8f34zcn9SK3sJhWDRPquIIiIiL7pgyOlBebVGWAU1ziWbYtkz5tGuCco2FSrIIbERE5ZCnAkfKq6YOzbmc2uYXF9G3ToIoXiYiI7IX38MN/YdaTdfaWaqKS8mKrbqIq7WDct03Duq6RiIgcjrYvgrS1FtzMeQbWfwMdR8GRP7dRu7VMAY6UF5sEuXsqFa/YnklMlKNbi/ohqJSIiBxWNs+F50+F4gJ7Xr8lnPZ3GHplnQQ3oABHKopNhKLKE/2tTMmkc7N6Wh9KRESqt24aLP8QlrwNya3h7CfBF0P7YRBdtyGHAhwpL6bqPjgrU7Lo31bNUyIiEWfKHbDqUzj3OUhoWPYolboCpv8Ldq2BjdMhth407wHjHoVW/UJWbQU4Ul5sYqWZjHMLitm0O4dzBle9WKaIiISptV/BjEchKhb+PcrKouOh21hoM8jmTZv1JLgoaNIFTrgThl1v/TlDTAGOlBebWGktqtU7svDe1pgSEZFDXFEBfHIbNGwLI2+p3OclKxWioiGxMVRcTqcgp2xdwrVfwDvXQ9NucPH/4Pv/QHIr2LkKVk2BFR/acR1HwjnPQIM2tf/ZDoACHCmvNIPj/Y83/oqUTAC6B9aYEhGR/VBSDKs+g6Sm0PoIiImvfMzit6y5p+vx5YON/Cz45kFI3wz1W0DvcdD+qMoBSUXFhfC/K2HZ+/Z85RSIqw+NOti+Dd/B7nW2r0lXC0zaDLJRTjMes31RMYCDkkILbs57ARp3ghP+EPRGD1gwFB1X531r9tehWSsJndhEwENR/o8pxlUpmcRFR9GpaVJo6yYiEireWz+TdV/D6Q9B447VH7vkHctyrJ4Km2ZaWWwSdBhmHW8TG1vQkrYW5r1g+7seD2c9AfWaW1Dy2Z2QtsYCk4xt1kzUd4JlXjbPha5jLAiq1wwGX2rZk5gEePsaC25OuQ9KimD+S/b7fMtccNHQcQQcOcmalGY8Ds+cAPVaQNZ26DACBl4Y+CO3xIKb/udX39wUd2h/JyjAkfJiAzdsUe6PN/XKlEy6NK9HTLRGUIlIiHgPBdkQHQtzn7Nhx73OgM//ZE0kvU4rf+y8F+DLv8LpD0Kf8fv/PulbYPkHMOACa67P3W0ZmA9utvKoGHjqOGjeG1r0gqOvgyadrV7ew9Q/wnf/tHMlNLSOtomNrC/LxlnWITcnrWzNv6Ovs2Dp8z/D48MAB7lp0LA9XPoedD4G8jKsn8tX91rGpONwmP+y1Ss7FWY+Xv4znHg3DLvOtkfcWHZNoHwGaMDPYNa/LdDqMNyGcO8rQ3QYUYAj5ZW2vRbm2l8ZwMa0HHqoeUpEQiV3N7x5Jaz5wppbCrIsA9FxpE0eN+NR6HkatBlsTSnz/2OZlphE+OAWC0rWfGGvGXihNckU5sGaz2048/ZFlr1uPwyWvAVZKRYcFWRbFsRFW+bkpL9Cj1Pgsz9AXroFGXOfA5wFCwA/vAJDJ8HJ99hromOtvPeZ5T9TXjrkZ0LDwOCNLqNh6p8saOl+ogVvpU0/CQ3guN9A37MgvgEktyzrRpC53TI++ZlW3+Y9oO/Zla9hVYFLUhMY87uf/u9ziFKAI+WVZnCChorvyMjnmCpWDRcRqTXe25xcW76H926APZtsdE5eOvQ6Hb6+34Kbkb+0AGTRm7DiI3ttYmObVK7DcHhqNLx2oQVGvgTmvQg9ToI1X0FBph3bYQTkZ1g/lOTWcN6LFvg0am/9VFKWWHajZR87/89etZ8Z26yjbeoKe60vgdG/g+N+u+9MSMWh1i16w4Wv7f01zbqXbZeeP7kVDLlsf69qRFGAI+XFJtrPQICTU1BEZn4RLRpU0TlORA4vqz6D7yfDSXdbpuNQlL3L+p+s/swyKWD9UC5735pmSrU/2rIy/c6xUUJj/2h9UtLWWpNPafAw4UnI3mn9VApy4K2fw4YZlg3pexZ0Pq4sy5KfaVmf6Bjbty8NWsORV9n2gAsgawf0PLXmroX8JApwpLxAgPPmrFWcO74fOzLyAWiRHPo5DUTkIOXutsnaFrxsz1MWw4SnrTmkKM86lTbrCfEhngpiy/fw34utX0mf8dC8l2VYjphYuW71msIR55Uvi69vo5WC9TunbDs2ES55q/r3j/8JTfFthxz8a6VWKMCR8gIBzuyVmzkX2JFpAU5LZXBEDk87V8PkcdZX45hfW1+PV8630TPBGneCK6dYk0dty9pho4iCrfkCXrvYgq5Jn0GbgbVfDwlrCnCknMKoBGKB1N3p7MjIIyXDJv1TBkfkMFJSbB1vdyyHtV/a/CdXfVaWZbh+FmxbaP1OSif3/PBXtjhi3wnQ7QRrAoqKrtl6eQ9T74LvHoZuJ1oH19YDYdEb8O710LwnXPSmNf2I/EQKcKScnJJYGgKJ5PPt6p3szikElMEROSzs2WQdbZe+Bxu+taHU8clw/mRo2bfsuEYd7BGsYTv49A749h/wzd9tGPT5k21+lbh6NTN8+Mt7AsHNWNg0G54eY+sWFWbbiKgLXrEh1SI1QAGOlJPt4wIBTgHfrtpJ8+R44mKiaJgYG+qqSSTL3WOZhqpmgq0N+Vkw7QHYvQFa9IFjf13z2QywjMbi/8HW+fbcOWjU0YYCZ2yxCdo2z7YMzMn3VO5fUqq4yOr77cNQnG/NPOP+ZR1r91fnY+CaadbRdvlHMOV2eOxI29d2qM1iW1JsmZ2D6auzeZ4FTgMuhLMet+zRwtdhxzJoN9QyR4fA+kUSPhTgSDmZxRbINIwt4v3VOxnZtSktkuNxYTT5kxxG1n4FH/yfzeia0AgGXQxj7rC5SfZssOG7n99to2ha9LEAKLkN9D/XvjSL8uHta21kzdArreNqYiNIW2eTox11DTTrZu+VlQqzn4IdS+2xez007gxL34Fdq22W2Zqckr6oAN6/CX541eafctHgi63Tb7Cm3Wyit6ePtwxH52Osfi37WafcbQtg+Yc2ZLr/edbs07jzwWdc4pNhwEToNBLmPm+B3eynYHJgsrz4hjakGW8deAddsu8ZbXetgXeuhfqt4NT7rG4JDeGonx9cHUX2g/OlsxvKYWPo0KF+7ty5tXLueSvWM+TVAbzd4hfcsnEUzZPj6dAkif9dN6JW3k8OcTlpMP0RaHUE9JtQM+dc9w28MtE6kY64sfphtXOfhw//D5p2ty/clCW2bk/TbjZ8OD/DjmszyGZ93b3OApr0zZYZOe1vNmnbmi9sleO0tbYicpuBkLoS8tMhsYlNkb9pFqRvAhw062Ff2Cf+GTofC9P+Dl/cDf3OhU6jbC2fk++F+kFzQ5WU2ORz+ZmWQWnc2YZkL3jJRgKlb7ahyEMut/p6b1PqL/wvjL4djv2tBWne27GFudZkVLrOT06adRTes8malrYvtMnrSors/ePq2+cdeGHN/BtVlJVqU/1HxcLC12z+l/wMq0enY+CiN8qmmAD7HEvftaHeqSvtuOg4mPgf6+QsUoOcc/O890MrlSvAOfzUZoDz5dLNjHm9Lyv7/pKT5h0NwKn9WvHExRoCWU5Rvs10mrHVvuxmPWGzqPY+I9Q1qxkZ22DO0zZ5WV66ZRcufB26j6187I5l9qXcbWz5rEFeujW1BJdlpsCTx1imJSrWMjO9x8GejZY5iI61zEbLPpY16H4SnPtc2fDdlVOsM2qHYdakEd/A1uQJbj7KSoUXz4DU5ZYZOfV+GHyZrd+z/APYPMe+jEfcCB/fBtk7bC6UFn1s7pPgydRKffsP6xwLgLMgIybegqboOBtmHaxp90DWqWHZVP+FedbXZOJL1iT1zYMw5vc2Q+3+2LUGnjzOto//vWWYWvWHHidboFZxxei6sPB1eOtqu+8nvlRW/s2DtvRAQiOrY/OeNoJLnYelFijACSO1GeC8u2ALp73dn6yhv+DclWNZk5rN5SM6cde4vvt+caRIXWmr9W5fZF/8LspW3Y2Oh0lT7C/06nhvf83Wb2nDcYuLYN7zNoP0oIvq7jOUKim2L87StXS2LbSmm0VvWnag1+kw8mab7n73erjyE2jVr+yzfH0/TPubHdtltAUqzXtb88anf7CmonGPWgAw7W82yVxxIfz8C2jaFT76jc0Y23qAlRcX2LXcviiQGXjz4PplZO+EddNsNFDwbLFVKZ3yfl8Wvm5Zk4bt4J3rbMbbtkPss8fVtyAsPtmC3wUvWdZo/GN2/rh6lvGYfJY1LxXlWf+YMx85sKak1JUWWO1toce69s1Dth7UJW9bNmz5R/Daz6y57Owna6fvkkgQBThhpDYDnJdnbWDcR0cTPeQSHuByXpi+nt+c3JPrx3Srlfc7pBTlV9+JtajAvrRWTbUmj9hE6ygZk2AL4B15lf3FioNrvrY1Xiravsj6k2yeDXHJMPgSa65JWWT7e58JKUvtCyGunjVTjLkD+oyzv/7XfmlzlTTv9dNHtORnwYzHLEOTvcOaZeq3tH4csfWsr8uwa+1LGmwBwmcCGZozHrbgZMajtg5P//Pt9TMetb/WN822ZprmvSF1mQUCRfmQt8eyLiNvsoBmb/ZstPrUVafiurJ7gy0d0O5I609Tk316QqUoHx4daoHkxJft8zVqb3PqBDdbidSS6gKcMPjfJTUpK6+IPGJpVJLPqB7NeGH6elokh9mXTLD131kTQnQsvDnJViQeOgnmT7bgokEb2DDd/mLfvd7Wpek3AY7/Q1m6vesY+9m0Kzx3CvxvkmUeSv9yLSmxJqypd9msrCffa0N5Zz4OLfvDOc/ae8x91ppK4pMt41FUAK9fYv1fMrZCzk47X+PO1jnz6GsP7K/jkhJYPw0WvGrvn58BPU61fibzXrCOt2P/ZOvaBBZa/VHDtnDR6/DimfBK0OyxI2+GsXdZ4FPa1LJ9sX2eoVfC8vetA2xxIYy6ee/ZrWAVhzCHi8Yd4eZFlrELRZNSbYiJt/vmzSvgkYHWZDfhGQU3EnLK4NQg59zfgDOBAmANcIX3fk9g3+3AJKAYuMl7PyVQPgR4AUgEPgJ+6ffxj1KbGZwHP13BxO9Oo+3AsRSe+QSPfrmaK0d2olFSXK28X51JXWn9KMb8zhbGW/o2dBljgUtxgR3ToK0NzS3VYYSNWJn2Nxu6e9rfrcljb9mTeS/ayJgOw63vR/2W9h5rvoCep9vQ3XpNyxYSDP4SyMuwVYNLFeXbe2/7wTI6Ay6EzG3ww2uwcToMvMjOt68gJ30zLHgF5r9kI48SGlq2aMgVNtLoQBTmWvCSnWrXq9OompkfRQ5/66ZZRq/naXDE+aGujUQQNVHVAefcScAX3vsi59z9AN77W51zfYBXgaOANsBUoIf3vtg5Nxv4JTATC3Ae8d5/vLf3qc0A5673lnDZ9+fSue/RcN4LtfIetWLxW7Bxpi0iuHW+jWhp2t2CmvhkG62SlWLr7aRvKusU2rSbZWpSFsOoW2xOkh3LrPyjX9sxAy6E0x7Y/3Vq5r9sE6bl7raF+wBOuddG0NRUMPDlvfD1fdZRt2Vfyyotfc8yBKc+APWaWTPUF3+B2U/aKsedj4VBl1qHUP11LSJhQk1UdcB7/2nQ05nAuYHt8cBr3vt8YJ1zbjVwlHNuPdDAez8DwDk3GTgL2GuAU5uy8osodPGWPThcfHWf9YMB2Pq9zQ3ii+15TIJ1Ak1sYl/8n9xmfUIu/8ACgqFXWL+W0iHQQy4vO68vsQzM/qwqHGzQRTb0ecHL1lxzzK+geY+f+inLG30btOgFWxdYc9Nnd1p/me0LrZNnky42iqcoz5rcRtxoHYlFRCKEApzacyXw38B2WyzgKbU5UFYY2K5YXolz7mrgaoAOHWqvf0JWXhFFUXHWFHEo2jTbOuqOvtWaWdLWWYDT7xzLZHz+ZxvNM+BC67x71NUW3ETHQUycdYJt0NYyHfta/ffoaw6+nklNLKioLc5B37PtccKdsHOVDW9OXW4jlXatgS7H2f72R9VePUREDlEKcA6Qc24qUNVyu3d4798NHHMHUAS8XPqyKo73eymvXOj9U8BTYE1UB1jt/ZaVX0Rx1CGawVnzBbx8vg0jnvG4BTizn7Y+KCf9xTIzXU8IzGgbB0ysfI6OYThhYVS0ZXPAgrxT7w9tfUREDgEKcA6Q976Kmc7KOOcuA84ATgjqLLwZaB90WDtga6C8XRXlIZOZX0RJdDwUHWIZnJJimHKHja7peWrZSsnfT7YsRYM2dlybgaGtp4iIHBLCZJziocE5dwpwKzDOex88tel7wAXOuXjnXGegOzDbe78NyHTODXO22NOlwLt1XvEgWXmF+OiEQyuDU1JsozN2LLUZXIdcYeUvnGadiYdfH9r6iYjIIUcZnJr1KBAPfBZYnHKm9/5a7/0S59zrwFKs6ep670t7wXIdZcPEPyaEHYzBmqh8YkLlBf9CoSAbpv8LZj5hk8S1GQR9zrL5Q1oPsOHTZ/xj/+dWERGRiKEApwZ576ud7td7/1fgr1WUzwX61Wa9DkRWXhE0iIf8EAQ4xUU2cikmMOfOh7+ylZZ7nWGBTY+TyiZHO/MRm+22z7i6r6eIiBzyFODIj0pKPNkFxbiYRMgOQYDz1lW2IOKVn0BBjs1dM/wGOLlSXGh9bdTfRkREqqEAR36UXVAEQFRcCJqoNs6yRRfBlgOIT7ZJ8kbdUrf1EBGRsKBOxvKjrHwLcKJjAwFOXc1y7b2tRlyvBVz4hq3OnbENxtxuM/KKiIgcIGVw5EdZeYEAJz7J+sIUF5b1h/kp0jfDp3+wmYTrN6+8f9sC2PAdnHKf9bPpcdJPf08REYloyuBEuN3ZBfzu7UVMX72TzEAGJyYusE7R3pqp0tbapHs5aft+k+n/giVvwXcPV73/+8nWHDXgZwdYexERkaopwIlwSfHRvDN/Cx8t3kZGbiEAsQlJtjN4LhzvbfHIUt8+DKumwPpv9/4G+Zm2+KSLhjnPQlZq+f0F2bDwDVvvKbFRDXwiERERBTgRLz4mmhFdm/HVilRmrk0jOsrRrFED2xk8m/HC1+GhPpaxyUmzEU5gK2/vzQ+vQUEmjH/MMkIzHi2/f8Ertn/wpTX3oUREJOIpwBFG92zO5t25vDJrA6O6NaNevWTbEZzB2TgdCnMgZTF8/6IFK/ENYMeSvZ/8h1eh1REw4AJbEHP205C9y/blpMGXf4UOI6DD8Nr5cCIiEpEU4Aije1rH34y8Is44ojXExNuO4BXFUwKBzI5lsOITW4m787GQsrT6E6dvhi3zbK0o5+DYX1uQNPNxm9Tvo99AXgac/nfbLyIiUkMU4AjtGifRrUV94qKjOKlvK4hJsB2lGZySkrJAZttCG/XU/mhbtTttTflAKNjyD+1n7zPtZ4ve0Gc8fPsQPHYkLH4TRt9mK2CLiIjUIA0TFwBuGduDbem5NEyMDQpwAqOodq+DwmzbXv6+lbcdAlHRNpw8dUXVswovex+a94Jm3cvKznwYmnaF1Z/DuEdh8CW1+8FERCQiKcARAE4/onXZk4oBTspi+9l2KGyZa9vthkJRgW3vWFY5wNm9wea2OeZX5csTG8MJd9pDRESklqiJSiqLrRjgLLHZhftNsOdJzaBRR2jSxYKhRW/YcO9g0/9lQ8OHXF5n1RYRESmlAEcqK83gFAYFOE27QZvB9rzdUOsUHB0Dx/8B1nwBk8eXLe2QtQPm/wcGTISG7eq+/iIiEvHURCWVVWyi2rkKmvWwTsLRcdBxRNmxI26A4nz4/M+QuQ0atLG5b4ryYOTNdV93ERERFOBIVSoGOBlboevxNtPwdTOgUfvyx3ccZT+3zrcAZ+2XlTsXi4iI1CE1UUllpfPgFOXZPDUFmdAg0Am5Wbey/aVa9bc+OlvnW7PWhhnQZUzd1llERCSIAhypLDZosc3MbbbdoG31x8clWcZm6wLYNMuWeOgyurZrKSIiUi0FOFJZVIxlZArzIGOLlTVos/fXtBlkGZy1X9rrO42s/XqKiIhUQwGOVOYcxCRaBidjq5XtT4CTsxPmPmezHMcn1349RUREqqEAR6oWE18+wEluvffj2waGkCc1gzP+Ubt1ExER2QeNopKqxZZmcLZAveaVOxZX1GYwXPQmtDvSRluJiIiEkAIcqVpMvC22mbdj381TYM1a3U+s/XqJiIjsBzVRSdViEmyV8Iytex9BJSIicghSgCNVi0mwDE7Glv3L4IiIiBxCFOBI1WISIHe3PRTgiIjIYUYBjlQtNgF2rbbtZAU4IiJyeFGAI1WLSYC8PbbdZlBo6yIiInKAFOBI1UoX3ExuDc17hrYuIiIiB0gBjlStNMDpMsaGgIuIiBxGFOBI1Uon9uuqVcFFROTwowBHqla6oniX0aGshYiIyEHRTMZStYEXQuPOUL9FqGsiIiJywBTgSNVa9beHiIjIYUhNVCIiIhJ2FOCIiIhI2FGAIyIiImFHAY6IiIiEHQU4IiIiEnYU4IiIiEjYUYAjIiIiYUcBjoiIiIQdBTgiIiISdhTgiIiISNhRgCMiIiJhRwGOiIiIhB0FOCIiIhJ2FOCIiIhI2FGAIyIiImFHAY6IiIiEHQU4IiIiEnYU4NQg59zdzrmFzrkFzrlPnXNtgvbd7pxb7Zxb4Zw7Oah8iHNuUWDfI845F5rai4iIhA8FODXrb977I7z3A4EPgDsBnHN9gAuAvsApwOPOuejAa54Arga6Bx6n1HmtRUREwowCnBrkvc8IeloP8IHt8cBr3vtVEmsyAAAIi0lEQVR87/06YDVwlHOuNdDAez/De++BycBZdVppERGRMBQT6gqEG+fcX4FLgXRgTKC4LTAz6LDNgbLCwHbF8qrOezWW6QHIcs6tqMFqAzQDdtbwOQ9nuh6V6ZpUpmtSma5JZbom5dX09ehYVaECnAPknJsKtKpi1x3e+3e993cAdzjnbgduAP4IVNWvxu+lvHKh908BTx1crffNOTfXez+0ts5/uNH1qEzXpDJdk8p0TSrTNSmvrq6HApwD5L0fu5+HvgJ8iAU4m4H2QfvaAVsD5e2qKBcREZGfQH1wapBzrnvQ03HA8sD2e8AFzrl451xnrDPxbO/9NiDTOTcsMHrqUuDdOq20iIhIGFIGp2bd55zrCZQAG4BrAbz3S5xzrwNLgSLgeu99ceA11wEvAInAx4FHKNRa89dhStejMl2TynRNKtM1qUzXpLw6uR7OBu+IiIiIhA81UYmIiEjYUYAjIiIiYUcBToRzzp0SWD5itXPutlDXJ1Scc+sDS2YscM7NDZQ1cc595pxbFfjZONT1rE3Oueecczucc4uDyqq9BtUtPxJOqrkmdznntgTulQXOudOC9oX1NXHOtXfOfemcW+acW+Kc+2WgPGLvk71ck0i+TxKcc7Odcz8ErsmfAuV1e5947/WI0AcQDawBugBxwA9An1DXK0TXYj3QrELZA8Btge3bgPtDXc9avgbHAoOBxfu6BkCfwP0SD3QO3EfRof4MdXRN7gJ+XcWxYX9NgNbA4MB2MrAy8Lkj9j7ZyzWJ5PvEAfUD27HALGBYXd8nyuBEtqOA1d77td77AuA1bFkJMeOBFwPbLxLmy2h476cBaRWKq7sGVS4/UicVrUPVXJPqhP018d5v895/H9jOBJZhs69H7H2yl2tSnUi4Jt57nxV4Ght4eOr4PlGAE9naApuCnle7VEQE8MCnzrl5gWUxAFp6m6uIwM8WIatd6FR3DSL93rnBObcw0IRVmmaPqGvinOsEDML+Otd9QqVrAhF8nzjnop1zC4AdwGfe+zq/TxTgRLb9XioiAoz03g8GTgWud84dG+oKHeIi+d55AugKDAS2AQ8GyiPmmjjn6gP/A2725RcZrnRoFWWRck0i+j7x3hd77wdiM/Qf5Zzrt5fDa+WaKMCJbNUtIRFxvPdbAz93AG9j6dGUwIrvBH7uCF0NQ6a6axCx9473PiXwy7sEeJqyVHpEXBPnXCz2Rf6y9/6tQHFE3ydVXZNIv09Kee/3AF8Bp1DH94kCnMg2B+junOvsnIsDLsCWlYgozrl6zrnk0m3gJGAxdi0uCxx2GZG5jEZ116DK5UdCUL86V/oLOuBs7F6BCLgmgSVlngWWee8fCtoVsfdJddckwu+T5s65RoHtRGAstnRRnd4nWqohgnnvi5xzNwBTsBFVz3nvl4S4WqHQEnjbfk8RA7zivf/EOTcHeN05NwnYCJwXwjrWOufcq8BooJlzbjO2UOx9VHEN/N6XHwkb1VyT0c65gVgKfT1wDUTMNRkJXAIsCvSvAPgdkX2fVHdNfhbB90lr4EXnXDSWSHnde/+Bc24GdXifaKkGERERCTtqohIREZGwowBHREREwo4CHBEREQk7CnBEREQk7CjAERERkbCjAEdEBHDOeefcubV4/qGB9+hUW+8hImUU4IjIYc8590IgeKj4mHkAp2kNvF9bdRSRuqWJ/kQkXEzFJlwLVrC/L/beb6/Z6ohIKCmDIyLhIt97v73CIw1+bH66wTn3oXMuxzm3wTl3cfCLKzZROefuDByX75zb7pybHLQv3jn3sHMuxTmX55yb6ZwbVeF8pzjnlgf2fwP0qFhh59wI59zXgTptcc494ZxrELT/2MC5s5xz6c65WftYtFBEAhTgiEik+BO25s1A4ClgsnNuaFUHOufOAX4N/AJbF+cMyq+N8wAwEbgSGAQsAj4JWkiwPfAO8Fng/f4VeE3we/QHPg3UaQAwIXDsc4H9MdhaPd8G9h8N/BMIt2n9RWqFlmoQkcOec+4F4GIgr8Kux7z3tzrnPPCM9/7nQa+ZCmz33l8ceO6B87z3bzrn/g9bO6if976wwnvVA3YDV3nvJwfKooGVwKve+9875+4BzgV6+sAvWefc74G7gc7e+/WBjFCh935S0LkHAvOx9dGKgF3AaO/91zVwmUQiivrgiEi4mAZcXaFsT9D2jAr7ZgCnV3OuN4BfAuucc1OAT4D3vPf5QFcgFviu9GDvfXFgIcE+gaLewExf/i/Iiu8/BOjmnJsYVOYCP7t672cEArcpzrnPgc+BN7z3m6qps4gEUROViISLHO/96gqPnQdzokAQ0RPL4mQADwLzAtmb0iCkqvR3aZmrYl9FUcAzWLNU6WMA1iS2IFCPK7CmqWnAOGClc+7kg/hIIhFHAY6IRIphVTxfVt3B3vs87/2H3vtbgCOBvsBIYDU2OuvHTsWBJqrhwNJA0VLgaOdccKBT8f2/B/pWEZSt9t7nBtXjB+/9/d770cBXwGX7/YlFIpiaqEQkXMQ751pVKCv23qcGtic45+ZgQcK5wAlYdqQS59zl2O/HWUAW1qG4EFjlvc92zj0B3Oec2wmsA27B+s08HjjFv4FfAQ875x4H+gPXVnib+4GZzrl/A08CmUAv4Ezv/TXOuc5YBuk9YAvQBTgCeOJALopIpFKAIyLhYiywrULZFqBdYPsu4BzgESAVuMJ7P6eac+0BbgX+jvW3WQpM8N6vC+y/NfDzeaAR1jH4FO/9NgDv/Ubn3ATgISxImQfcBrxU+gbe+4XOuWOBvwBfA9HAWuDtwCE52NDyN4BmQArwMhYYicg+aBSViIS94BFSoa6LiNQN9cERERGRsKMAR0RERMKOmqhEREQk7CiDIyIiImFHAY6IiIiEHQU4IiIiEnYU4IiIiEjYUYAjIiIiYef/AdX1wPZi49sdAAAAAElFTkSuQmCC\n",
2056 |       "text/plain": [
2057 |        "<Figure size 576x432 with 1 Axes>"
2058 |       ]
2059 |      },
2060 |      "metadata": {
2061 |       "needs_background": "light"
2062 |      },
2063 |      "output_type": "display_data"
2064 |     }
2065 |    ],
2066 |    "source": [
2067 |     "plot_result([\"expected_sarsa_agent\", \"random_agent\"])"
2068 |    ]
2069 |   },
2070 |   {
2071 |    "cell_type": "markdown",
2072 |    "metadata": {
2073 |     "deletable": false,
2074 |     "editable": false,
2075 |     "nbgrader": {
2076 |      "cell_type": "markdown",
2077 |      "checksum": "db793e2c314fae05c3878657eab18363",
2078 |      "grade": false,
2079 |      "grade_id": "cell-978255cacf80e540",
2080 |      "locked": true,
2081 |      "schema_version": 3,
2082 |      "solution": false,
2083 |      "task": false
2084 |     }
2085 |    },
2086 |    "source": [
2087 |     "In the following cell you can visualize the performance of the agent with a correct implementation. As you can see, the agent initially crashes quite quickly (Episode 0). Then, the agent learns to avoid crashing by expending fuel and staying far above the ground. Finally however, it learns to land smoothly within the landing zone demarcated by the two flags (Episode 275)."
2088 |    ]
2089 |   },
2090 |   {
2091 |    "cell_type": "code",
2092 |    "execution_count": 26,
2093 |    "metadata": {
2094 |     "deletable": false,
2095 |     "editable": false,
2096 |     "nbgrader": {
2097 |      "cell_type": "code",
2098 |      "checksum": "a9cc1faf04b1dd665484f8c1982470ac",
2099 |      "grade": false,
2100 |      "grade_id": "cell-9fa82bbbfd32220b",
2101 |      "locked": true,
2102 |      "schema_version": 3,
2103 |      "solution": false,
2104 |      "task": false
2105 |     }
2106 |    },
2107 |    "outputs": [
2108 |     {
2109 |      "data": {
2110 |       "text/html": [
2111 |        "<div align=\"middle\">\n",
2112 |        "<video width=\"80%\" controls>\n",
2113 |        "      <source src=\"ImplementYourAgent.mp4\" type=\"video/mp4\">\n",
2114 |        "</video></div>\n"
2115 |       ],
2116 |       "text/plain": [
2117 |        "<IPython.core.display.HTML object>"
2118 |       ]
2119 |      },
2120 |      "metadata": {},
2121 |      "output_type": "display_data"
2122 |     }
2123 |    ],
2124 |    "source": [
2125 |     "%%HTML\n",
2126 |     "<div align=\"middle\">\n",
2127 |     "<video width=\"80%\" controls>\n",
2128 |     "      <source src=\"ImplementYourAgent.mp4\" type=\"video/mp4\">\n",
2129 |     "</video></div>"
2130 |    ]
2131 |   },
2132 |   {
2133 |    "cell_type": "markdown",
2134 |    "metadata": {
2135 |     "deletable": false,
2136 |     "editable": false,
2137 |     "nbgrader": {
2138 |      "cell_type": "markdown",
2139 |      "checksum": "34eb37bcd53120dc16b74cf95fe283d4",
2140 |      "grade": false,
2141 |      "grade_id": "cell-e5423f3a7fee6813",
2142 |      "locked": true,
2143 |      "schema_version": 3,
2144 |      "solution": false,
2145 |      "task": false
2146 |     }
2147 |    },
2148 |    "source": [
2149 |     "In the learning curve above, you can see that sum of reward over episode has quite a high-variance at the beginning. However, the performance seems to be improving. The experiment that you ran was for 300 episodes and 1 run. To understand how the agent performs in the long run, we provide below the learning curve for the agent trained for 3000 episodes with performance averaged over 30 runs.\n",
2150 |     "<img src=\"3000_episodes.png\" alt=\"Drawing\" style=\"width: 500px;\"/>\n",
2151 |     "You can see that the agent learns a reasonably good policy within 3000 episodes, gaining sum of reward bigger than 200. Note that because of the high-variance in the agent performance, we also smoothed the learning curve. "
2152 |    ]
2153 |   },
2154 |   {
2155 |    "cell_type": "markdown",
2156 |    "metadata": {
2157 |     "deletable": false,
2158 |     "editable": false,
2159 |     "nbgrader": {
2160 |      "cell_type": "markdown",
2161 |      "checksum": "2949ffe4c0d604eacac2fa59caa1b1ae",
2162 |      "grade": false,
2163 |      "grade_id": "cell-d9aa9305a578583d",
2164 |      "locked": true,
2165 |      "schema_version": 3,
2166 |      "solution": false,
2167 |      "task": false
2168 |     }
2169 |    },
2170 |    "source": [
2171 |     "### Wrapping up! \n",
2172 |     "\n",
2173 |     "You have successfully implemented Course 4 Programming Assignment 2.\n",
2174 |     "\n",
2175 |     "You have implemented an **Expected Sarsa agent with a neural network and the Adam optimizer** and used it for solving the Lunar Lander problem! You implemented different components of the agent including:\n",
2176 |     "\n",
2177 |     "- a neural network for function approximation,\n",
2178 |     "- the Adam algorithm for optimizing the weights of the neural network,\n",
2179 |     "- a Softmax policy,\n",
2180 |     "- the replay steps for updating the action-value function using the experiences sampled from a replay buffer\n",
2181 |     "\n",
2182 |     "You tested the agent for a single parameter setting. In the next assignment, you will perform a parameter study on the step-size parameter to gain insight about the effect of step-size on the performance of your agent."
2183 |    ]
2184 |   }
2185 |  ],
2186 |  "metadata": {
2187 |   "coursera": {
2188 |    "course_slug": "complete-reinforcement-learning-system",
2189 |    "graded_item_id": "8dMlx",
2190 |    "launcher_item_id": "4O5gG"
2191 |   },
2192 |   "kernelspec": {
2193 |    "display_name": "Python 3",
2194 |    "language": "python",
2195 |    "name": "python3"
2196 |   },
2197 |   "language_info": {
2198 |    "codemirror_mode": {
2199 |     "name": "ipython",
2200 |     "version": 3
2201 |    },
2202 |    "file_extension": ".py",
2203 |    "mimetype": "text/x-python",
2204 |    "name": "python",
2205 |    "nbconvert_exporter": "python",
2206 |    "pygments_lexer": "ipython3",
2207 |    "version": "3.7.6"
2208 |   }
2209 |  },
2210 |  "nbformat": 4,
2211 |  "nbformat_minor": 2
2212 | }
2213 | 


--------------------------------------------------------------------------------