├── .gitignore ├── LICENSE ├── Lecture 1 - Introduction.ipynb ├── Lecture 10 - Policy Gradient III.ipynb ├── Lecture 11 - Fast RL I.ipynb ├── Lecture 12 - Fast RL II.ipynb ├── Lecture 13 - Fast RL III.ipynb ├── Lecture 14 - Batch RL.ipynb ├── Lecture 15 - Monte Carlo Tree Search.ipynb ├── Lecture 2 - Given a Model of the World.ipynb ├── Lecture 3 - Model-Free Policy Evaluation.ipynb ├── Lecture 4 - Model Free Control.ipynb ├── Lecture 5 - Value Function Approximation.ipynb ├── Lecture 6 - CNNs and Deep Q Learning.ipynb ├── Lecture 7 - Imitation Learning.ipynb ├── Lecture 8 - Policy Gradient I.ipynb ├── Lecture 9 - Policy Gradient II.ipynb ├── README.md └── img ├── CUT.PNG ├── LVFA.PNG ├── OPE.PNG ├── SARSA_theorem.PNG ├── SPI.PNG ├── VFA.PNG ├── bias_variance.PNG ├── convergence_VFA.PNG ├── diagram.PNG ├── dp_mc_td.PNG ├── dp_mdp.PNG ├── dp_mrp.PNG ├── dp_tree.PNG ├── dueling_dqn.PNG ├── experience_replay.PNG ├── forward_search.PNG ├── mc_td.PNG ├── monotonic_improvement.PNG ├── policy_improvement.PNG ├── policy_iteration.PNG ├── prove_monotonic.PNG ├── rl_agent_types.PNG ├── search_tree_path.PNG └── value_iteration.PNG /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/* -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Vincent Tu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Lecture 10 - Policy Gradient III.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 10 - Policy Gradient III\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 34 | "
" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "id": "a8837110", 40 | "metadata": {}, 41 | "source": [ 42 | "# 1. Introduction" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "e3a1a301", 48 | "metadata": {}, 49 | "source": [ 50 | "Today's lecture will cover the 2 other methods for automatic step-size tuning: trust regions and the TRPO algorithm." 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "ac9f36cf", 56 | "metadata": {}, 57 | "source": [ 58 | "# 2. Need for Automatic Step Size Tuning" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "83a14379", 64 | "metadata": {}, 65 | "source": [ 66 | "Recall the objective function we defined in Lecture 9.\n", 67 | "\n", 68 | "$$\n", 69 | "\\begin{equation}\n", 70 | " \\begin{split}\n", 71 | " L_{\\pi}(\\tilde{\\pi}) = V(\\tilde{\\theta}) & = V(\\theta) + \\mathbb{E}_{\\pi_{\\tilde{\\theta}}}[\\sum_{t = 0}^{\\infty} \\gamma^{t} A_{\\pi}(s_{t}, a_{t})]\\\\\n", 72 | " & = V(\\theta) + \\sum_{s} \\mu_{\\tilde{\\pi}}(s) \\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a)\\\\\n", 73 | " \\mu_{\\tilde{\\pi}}(s) & = \\mathbb{E}_{\\tilde{\\pi}}[\\sum_{t = 0}^{\\infty} \\gamma^{t} I(s_{t} = s)]\n", 74 | " \\end{split}\n", 75 | "\\end{equation} \\hspace{1em} (Eq.~1)\\\\\n", 76 | "$$" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "d454d7b9", 82 | "metadata": {}, 83 | "source": [ 84 | "## 2.1. Local Approximation" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "5fc1f912", 90 | "metadata": {}, 91 | "source": [ 92 | "I copied over the text from the previous lecture for completeness." 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "id": "c975292d", 98 | "metadata": {}, 99 | "source": [ 100 | "We can slightly rewrite Eq. 4 so that we have a substitute for $\\mu_{\\tilde{\\pi}}$:\n", 101 | "\n", 102 | "$$\n", 103 | "L_{\\pi}(\\tilde{\\pi}) = V(\\theta) + \\sum_{s} \\mu_{\\pi}(s) \\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a) \\hspace{1em} (Eq.~2)\\\\\n", 104 | "$$\n", 105 | "\n", 106 | "Eq. 5, instead of using the discounted weighted frequency of state $s$ under policy $\\mu_{\\tilde{\\pi}}$, uses $\\mu_{\\pi}$, the current policy's discounted weighted frequency of state $s$" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "8d10d428", 112 | "metadata": {}, 113 | "source": [ 114 | "This begs the question: how do Eq. 3 and Eq. 4 fit into our current understanding of policy gradients? Over Lecture 8 and Lecture 9, we have seen a lot of formulas involving value functions.\n", 115 | "\n", 116 | "For now, I'm still not too sure. Let's give it some time." 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "id": "a8cad738", 122 | "metadata": {}, 123 | "source": [ 124 | "## 2.2. Trust Region" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "id": "a30a8182", 130 | "metadata": {}, 131 | "source": [ 132 | "Disclaimer: I won't cover too much of the theory!\n", 133 | "\n", 134 | "With our new formulation in Eq. 1, we want to ask: is there a bound to the new policy's performance by optimizing on the surrogate objective (the local approximation)?\n", 135 | "\n", 136 | "$$\n", 137 | "\\pi_{new}(a~|~s) = (1 - \\alpha) \\pi_{old}(a~|~s) + \\alpha \\pi'(a~|~s)\n", 138 | "$$\n", 139 | "\n", 140 | "Consider a __mixture policy__ (a blend of 2 policies), it will have a percentage of the old and a percentage of the new policy. For general stochastic policies we have the theorem (Eq. 3):\n", 141 | "\n", 142 | "$$\n", 143 | "D_{TV}^{max}(\\pi_{1}, \\pi_{2}) = \\underset{s}{max} D_{TV}(\\pi_{1}(\\cdot~|~s), \\pi_{2}(\\cdot~|~s))\\\\\n", 144 | "\\epsilon = \\underset{s}{max}[\\mathbb{E}_{a \\sim \\pi'(a~|~s)}[A_{\\pi}(s, a)]]\\\\\n", 145 | "D_{TV}(p, q)^{2} \\le D_{KL}(p, q)\\\\\n", 146 | "C = \\frac{4 \\epsilon \\gamma}{(1 - \\gamma)^{2}}\\\\\n", 147 | "\\begin{equation}\n", 148 | " \\begin{split}\n", 149 | "V^{\\pi_{new}} & \\ge L_{\\pi_{old}}(\\pi_{new}) - \\frac{4 \\epsilon \\gamma}{(1 - \\gamma)^{2}}(D_{TV}^{max}(\\pi_{old}, \\pi_{new}))^{2}\\\\\n", 150 | " & \\ge L_{\\pi_{old}}(\\pi_{new}) - \\frac{4 \\epsilon \\gamma}{(1 - \\gamma)^{2}}D_{KL}^{max}(\\pi_{old}, \\pi_{new}) = M_{i}(\\pi)\n", 151 | " \\end{split}\n", 152 | "\\end{equation} \\hspace{1em} (Eq.~3)\\\\\n", 153 | "V^{\\pi_{i + 1}} - V^{\\pi_{i}} \\ge M_{i}(\\pi_{i + 1}) - M_{i}(\\pi_{i}) \\hspace{1em} (Eq.~4)\\\\\n", 154 | "$$\n", 155 | "\n", 156 | "From the theorem (Eq. 3), we can derive Eq. 4. Eq. 4 simply says that we can have a monotonically improving general stochastic policy. For $C$, we tend to make this a hyperparameter." 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "id": "7e15c253", 162 | "metadata": {}, 163 | "source": [ 164 | "With the theorem established, we can put it into practice. Let's formulate our objective function: \n", 165 | "\n", 166 | "$$\n", 167 | "\\underset{\\theta}{max} L_{\\pi_{old}}(\\pi_{new}) - CD_{KL}^{max}(\\pi_{old}, \\pi_{new}) \\hspace{1em} (Eq.~4)\\\\\n", 168 | "$$\n", 169 | "\n", 170 | "We can rewrite this to:\n", 171 | "\n", 172 | "$$\n", 173 | "\\underset{\\theta}{max} L_{\\pi_{old}}(\\pi_{new}) \\hspace{1em} (Eq.~5)\\\\\n", 174 | "subject ~ to ~ D_{KL}^{s \\sim \\mu_{\\theta_{old}}}(\\pi_{old}, \\pi_{new}) \\le \\delta\n", 175 | "$$\n", 176 | "\n", 177 | "This formulation leverages not only the previous objective function in local approximation, but also the new lower bound theorem. The constraint serves as a __trust region__ for the KL divergence between the old and new policies.\n", 178 | "\n", 179 | "We make 3 substitutions (note $\\tilde{\\pi} = \\pi_{new} = \\theta_{new}$ and the same goes for the old):\n", 180 | "\n", 181 | "1. Substituting $\\sum_{s} \\mu_{\\theta_{old}}(s)$\n", 182 | "\n", 183 | "$$\n", 184 | "L_{\\pi_{old}} = V(\\theta) + \\sum_{s} \\mu_{\\tilde{\\pi}}(s) \\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a)\\\\\n", 185 | "becomes\\\\\n", 186 | "L_{\\pi_{old}} = V(\\theta) + \\frac{1}{1 - \\gamma} \\mathbb{E}_{s \\sim \\mu_{\\theta_{old}}} [\\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a)] \\hspace{1em} (Eq.~5)\\\\\n", 187 | "$$\n", 188 | "\n", 189 | "This substitution is made because our state space can be continuous and infinite so summing over it would be impossible in practice.\n", 190 | "\n", 191 | "2. Substituting $\\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a)$\n", 192 | "\n", 193 | "$$\n", 194 | "(Eq.~5)\\\\\n", 195 | "becomes\\\\\n", 196 | "L_{\\pi_{old}} = V(\\theta) + \\frac{1}{1 - \\gamma} \\mathbb{E}_{s \\sim \\mu_{\\theta_{old}}} [\\mathbb{E}_{a \\sim q}[\\frac{\\pi_{\\theta}(a~|~s_{n})}{q(a~|~s_{n})}A_{\\theta_{old}}(s_{n}, a)]] \\hspace{1em} (Eq.~6)\\\\\n", 197 | "$$\n", 198 | "\n", 199 | "Again, summing over actions can be a continuous set. So instead, we use importance sampling.\n", 200 | "\n", 201 | "3. Substituting $A_{\\theta_{old}}$\n", 202 | "\n", 203 | "$$\n", 204 | "(Eq.~6)\\\\\n", 205 | "becomes\\\\\n", 206 | "\\underset{\\theta}{max} \\mathbb{E}_{s \\sim \\mu_{\\theta_{old}}, a \\sim q} [\\frac{\\pi_{\\theta}(a~|~s)}{q(a~|~s)}Q_{\\theta_{old}}(s, a)] \\hspace{1em} (Eq.~7)\\\\\n", 207 | "subject ~ to ~ \\mathbb{E}_{s \\sim \\mu_{\\theta_{old}}} D_{KL}(\\pi_{old}(\\cdot~|~s), \\pi_{new}(\\cdot~|~s)) \\le \\delta\n", 208 | "$$" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "id": "24662685", 214 | "metadata": {}, 215 | "source": [ 216 | "## 2.3. TRPO" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "id": "020e6565", 222 | "metadata": {}, 223 | "source": [ 224 | "for iteration = 1,2, ... do
\n", 225 | "$\\quad$ Run policy for $T$ timesteps or $N$ trajectories
\n", 226 | "$\\quad$ Estimate advantage function at all time steps
\n", 227 | "$\\quad$ Compute policy gradient $g$
\n", 228 | "$\\quad$ Use CG to compute $F^{-1}$g where $F$ is the Fisher information matrix
\n", 229 | "$\\quad$ Do line search on surrogate loss and KL constraint
\n", 230 | "\n", 231 | "_Algorithm 1. Trust Region Policy Optimization (TRPO)._" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "id": "61137305", 237 | "metadata": {}, 238 | "source": [ 239 | "Algorithm 1 is just a brief look into TRPO! " 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "0f657f6b", 245 | "metadata": {}, 246 | "source": [ 247 | "# 3. Resource" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "32e04a27", 253 | "metadata": {}, 254 | "source": [ 255 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 256 | "\n", 257 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 258 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 259 | "\n", 260 | "This is a series of 15 lectures provided by Stanford.\n" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "id": "f5c15c5c", 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.8.8" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 5 293 | } 294 | -------------------------------------------------------------------------------- /Lecture 11 - Fast RL I.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 11 - Fast RL I\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 33 | "
" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "id": "a8837110", 39 | "metadata": {}, 40 | "source": [ 41 | "# 1. Introduction" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "0c0b90ce", 47 | "metadata": {}, 48 | "source": [ 49 | "2 Categories:\n", 50 | "* _Computationally efficiency_\n", 51 | " * takes a long time to compute something (an AV needs to calculate something fast as it is moving)\n", 52 | "* _Sample efficiency_\n", 53 | " * sometimes experience/data is hard to gather\n", 54 | " \n", 55 | "How do we evaluate our algorithm?\n", 56 | "* how good is it?\n", 57 | "* does it converge?\n", 58 | "* how quickly does it converge?\n", 59 | "\n", 60 | "We usually evaluate an algorithm based on its performance, but today we will evaluate it based on the amount of data it needs to make good decisions.\n", 61 | "\n", 62 | "The next 3 lectures (including this one) will cover the following:\n", 63 | "* __settings__: (bandits, MDPs, etc)\n", 64 | "* __frameworks__: evaluation criteria for evaluating RL algorithms\n", 65 | "* __approaches__: classes of algorithms for achieving particular evaluation criterias\n", 66 | "\n", 67 | "Specifically, for today's lecture we will cover:\n", 68 | "* setting: multi-armed bandits\n", 69 | "* framework: regret\n", 70 | "* approach: optimism under uncertainty\n", 71 | "* framework: bayesian regret\n", 72 | "* approach: probability matching/thompson sampling" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "id": "3892c2d3", 78 | "metadata": {}, 79 | "source": [ 80 | "# 2. Multi-Armed Bandits" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "id": "33d94b9a", 86 | "metadata": {}, 87 | "source": [ 88 | "* Multi-armed bandits is a tuple of $(\\mathcal{A}, \\mathcal{R})$\n", 89 | "* $\\mathcal{A}$ is a known set of $m$ actions (arms)\n", 90 | "* $R^{a}(r) = \\mathbb{P}[r~|~a]$ is an unknown probability distribution over rewards\n", 91 | "* each step $t$, the agent selects an action $a_{t} \\in \\mathcal{A}$ (pulling an arm)\n", 92 | "* environment produces a reward $r_{t} \\sim \\mathcal{R}^{a_{t}}$\n", 93 | "* Goal: maximize cumulative reward $\\sum_{\\tau = 1}^{t}r_{\\tau}$\n", 94 | "\n", 95 | "$$\n", 96 | "Q(a) = \\mathbb{E}[r~|~a]\\hspace{1em} (Eq.~1)\\\\\n", 97 | "V^{*} = Q(a^{*}) = \\underset{a \\in \\mathcal{A}}{max}Q(a) \\hspace{1em} (Eq.~2)\\\\\n", 98 | "l_{t} = \\mathbb{E}[V^{*} - Q(a_{t})] \\hspace{1em} (Eq.~3)\\\\\n", 99 | "L_{t} = \\mathbb{E}[\\sum_{\\tau = 0}^{t} V^{*} - Q(a_{\\tau})] \\hspace{1em} (Eq.~4)\n", 100 | "$$\n", 101 | "\n", 102 | "Eq. 1 is the expected reward (mean reward) given an action (we ignore states in the multi-armed bandit setting). Eq. 2: Take the action $a$ that yields the largest action-value. Eq. 3: opportunity loss (the difference between the optimal action-value at time step $t$ and the action you took). Eq. 4 is the total opportunity loss (the sum of regrets across all time steps for an episode).\n", 103 | "\n", 104 | "$$\n", 105 | "\\Delta_{i} = V^{*} - Q(a_{i})\\\\\n", 106 | "\\begin{equation}\n", 107 | " \\begin{split}\n", 108 | " L_{t} & = \\mathbb{E}[\\sum_{\\tau = 1}^{t}V^{*} - Q(a_{\\tau})]\\\\\n", 109 | " & = \\sum_{a \\in \\mathcal{A}} \\mathbb{E}[N_{t}(a)](v^{*} - Q(a))\\\\\n", 110 | " & = \\sum_{a \\in \\mathcal{A}} \\mathbb{E}[N_{t}(a)]\\Delta_{a}\\\\\n", 111 | " \\end{split}\n", 112 | "\\end{equation} \\hspace{1em} (Eq.~5)\n", 113 | "$$\n", 114 | "\n", 115 | "$N_{t}(a)$ is the number of times action $a$ has been picked up to time step $t$.\n", 116 | "\n", 117 | "By maximizing cumulative reward, we minimize total regret." 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "id": "64dfd2d6", 123 | "metadata": {}, 124 | "source": [ 125 | "## 2.1. Greedy Algorithm" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "id": "cf977655", 131 | "metadata": {}, 132 | "source": [ 133 | "The simplest approach is the greedy algorithm.\n", 134 | "\n", 135 | "$$\n", 136 | "\\hat{Q}_{t}(a) = \\frac{1}{N_{t}(a)} \\sum_{t = 1}^{T} r_{t} \\mathcal{1}(a_{t} = a) \\hspace{1em} (Eq.~6)\\\\\n", 137 | "a^{*}_{t} = \\underset{a \\in \\mathcal{A}}{argmax} \\hat{Q}_{t}(a) \\hspace{1em} (Eq.~7)\\\\\n", 138 | "$$\n", 139 | "\n", 140 | "A slightly more nuanced version of this is the $\\epsilon$-greedy algorithm. It will select $a_{t} = \\underset{a \\in \\mathcal{A}}{argmax} \\hat{Q}_{t}(a)$ with probability $1 - \\epsilon$ and a random action with probability $\\epsilon$.\n", 141 | "\n", 142 | "The problem with these is that they can get stuck in suboptimal actions forever." 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "id": "466a4566", 148 | "metadata": {}, 149 | "source": [ 150 | "## 2.2. Upper Confidence Bounds" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "id": "b97a2764", 156 | "metadata": {}, 157 | "source": [ 158 | "Types of Regret bounds:\n", 159 | "* __problem independent__: bound on regret is a function of $T$\n", 160 | "* __problem dependent__: bound regret as a function of number of times we pull each arm and gap between reward and optimal action\n", 161 | "\n", 162 | "From past work, we find that the lower bound is sublinear.\n", 163 | "\n", 164 | "$$\n", 165 | "\\underset{t \\rightarrow \\infty}{lim} L_{t} \\ge log(t) \\sum_{a~|~\\Delta_{a} > 0} \\frac{\\Delta_{a}}{D_{KL}(\\mathcal{R}^{a}||\\mathcal{R}^{a*})} \\hspace{1em} (Eq.~8)\\\\\n", 166 | "$$" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "id": "25a2ab8c", 172 | "metadata": {}, 173 | "source": [ 174 | "In Upper Confidence Bounds (UCB),\n", 175 | "* we estimate an upper confidence $U_{t}(a)$ for each action value such that $Q(a) \\le U_{t}(a)$\n", 176 | "* depends on number of times $N_{t}(a)$\n", 177 | "* select action to maximize UCB: $a_{t} = \\underset{a \\in \\mathcal{A}}{argmax}[U_{t}(a)]$\n", 178 | "\n", 179 | "We leverage __Hoeffding's Inequality__:\n", 180 | "\n", 181 | "Consider X to be an i.i.d. random variable in $[0, 1]$ from 1 to n. $\\bar{X}_{n}$ to be the sample mean.\n", 182 | "$$\n", 183 | "\\mathcal{P}[\\mathcal{E}[X] > \\bar{X}_{n} + u] \\le exp(-2nu^{2}) \\hspace{1em} (Eq.~9)\\\\\n", 184 | "$$\n", 185 | "\n", 186 | "We then derive the UCB equation for selecting an action at timestep $t$. \n", 187 | "\n", 188 | "$$\n", 189 | "\\begin{equation}\n", 190 | " \\begin{split}\n", 191 | " U_{t}(a) = \\hat{Q}(a) + \\sqrt{\\frac{2 log(t)}{N_{t}(a)}}\\\\\n", 192 | " a_{t} = \\underset{a \\in \\mathcal{A}}{argmax}[U_{t}(a)]\n", 193 | " \\end{split}\n", 194 | "\\end{equation} \\hspace{1em} (Eq.~10)\\\\\n", 195 | "$$\n", 196 | "\n", 197 | "UCB achieves logarithmic asymptotic total regret:\n", 198 | "\n", 199 | "$$\n", 200 | "\\underset{t \\rightarrow \\infty}{lim} L_{t} \\le 8 log(t) \\sum_{a~|~\\Delta_{a} > 0} \\Delta_{a} \\hspace{1em} (Eq.~11)\\\\\n", 201 | "$$\n", 202 | "\n", 203 | "For an example of how UCB works refer to this: https://www.youtube.com/watch?v=FgmMK6RPU1c&t=507s&ab_channel=ritvikmath." 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "id": "0f657f6b", 209 | "metadata": {}, 210 | "source": [ 211 | "# 3. Resource" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "32e04a27", 217 | "metadata": {}, 218 | "source": [ 219 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 220 | "\n", 221 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 222 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 223 | "\n", 224 | "This is a series of 15 lectures provided by Stanford.\n" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "id": "f5c15c5c", 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "Python 3", 239 | "language": "python", 240 | "name": "python3" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 3 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython3", 252 | "version": "3.8.8" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 5 257 | } 258 | -------------------------------------------------------------------------------- /Lecture 12 - Fast RL II.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 12 - Fast RL II\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 31 | "
" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "a8837110", 37 | "metadata": {}, 38 | "source": [ 39 | "# 1. Introduction" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "679241e8", 45 | "metadata": {}, 46 | "source": [ 47 | "We have been covering algorithms that fall under the concept of __optimism under uncertainty__. \n", 48 | "\n", 49 | "We have looked at the following approaches:\n", 50 | "* __Greedy__: linear total regret\n", 51 | "* __Constant $\\epsilon$-greedy__: linear total regret\n", 52 | "* __Decaying $\\epsilon$-greedy__: sublinear regret\n", 53 | "* __UCB__: sublinear regret" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "id": "80d9035e", 59 | "metadata": {}, 60 | "source": [ 61 | "# 2. Bayesian Bandits" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "id": "42491d95", 67 | "metadata": {}, 68 | "source": [ 69 | "Before in UCB, we made no assumptions about the unknown reward distribution $R$ except for the bounds on the rewards.\n", 70 | "\n", 71 | "Another approach is called __bayesian bandits__ and exploits prior knowledge of rewards $p[R]$. It computes a posterior distribution of rewards $p[R~|~h_{t}]$ based on a history of action reward pairs.\n", 72 | "\n", 73 | "It leverages Bayes' rule:\n", 74 | "\n", 75 | "$$\n", 76 | "p(\\phi_{i}~|~r_{i1}) = \\frac{p(r_{i1}~|~\\phi_{i})p(\\phi_{i})}{p(r_{i1})} \\hspace{1em} (Eq.~1)\\\\\n", 77 | "$$\n", 78 | "\n", 79 | "If $p(\\phi_{i}~|~r_{i1})$ and $p(\\phi_{i})$ are the same, then we can call the prior $p(\\phi_{i})$ and model $p(r_{i1}~|~\\phi_{i})$ a __conjugate__. Why is this useful? It means we can do our posterior updating analytically.\n", 80 | "\n", 81 | "Framework:\n", 82 | "* __frequentist regret__ : (the framework we have been using before) assumes a true unknown set of parameters\n", 83 | "\n", 84 | "$$\n", 85 | "Regret(\\mathcal{A}, T; \\theta) = \\sum_{t = 1}^{t} \\mathbb{E}[Q(a^{*}) - Q(a_{t})] \\hspace{1em} (Eq.~2)\\\\\n", 86 | "$$\n", 87 | "\n", 88 | "* __bayesian regret__ : assumes there's a prior over parameters\n", 89 | "\n", 90 | "$$\n", 91 | "BayesRegret(\\mathcal{A}, T; \\theta) = \\mathbb{E}_{\\theta \\sim p_{\\theta}} [\\sum_{t = 1}^{t} \\mathbb{E}[Q(a^{*}) - Q(a_{t})~|~\\theta]] \\hspace{1em} (Eq.~3)\\\\\n", 92 | "$$\n", 93 | "\n", 94 | "We tackle this framework with __probability matching__." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "id": "26489369", 100 | "metadata": {}, 101 | "source": [ 102 | "# 3. Probability Matching" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "id": "d9c73893", 108 | "metadata": {}, 109 | "source": [ 110 | "We assume we have a parametric distribution over rewards for each arm. \n", 111 | "\n", 112 | "__Probability Matching__ selects the best action (optimal action) based on a history.\n", 113 | "\n", 114 | "$$\n", 115 | "\\pi(a~|~h_{t}) = \\mathbb{P}[Q(a) > Q(a'), \\forall a' \\ne a ~|~ h_{t}] \\hspace{1em} (Eq.~4)\\\\\n", 116 | "$$\n", 117 | "\n", 118 | "Uncertain actions have higher probability of being max." 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "id": "a45caa0c", 124 | "metadata": {}, 125 | "source": [ 126 | "Initialize prior over each arm $a, p(R_{a})$
\n", 127 | "loop
\n", 128 | "$\\quad$ For each arm $a$ _sample_ a reward distribution $R_{a}$ from posterior
\n", 129 | "$\\quad$ Compute action-value function $Q(a) = \\mathbb{E}[R_{a}]$
\n", 130 | "$\\quad$ $a_{t} = \\underset{a \\in \\mathcal{A}}{argmax}Q(a)$
\n", 131 | "$\\quad$ Observe reward $r$
\n", 132 | "$\\quad$ Update posterior $p(R_{a}~|~r)$ using Bayes law
\n", 133 | "\n", 134 | "_Algorithm 1. Thompson Sampling._" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "id": "4a16e1bb", 140 | "metadata": {}, 141 | "source": [ 142 | "I found this resource to be really helpful for understanding thompson sampling: https://www.youtube.com/watch?v=Zgwfw3bzSmQ.\n", 143 | "\n", 144 | "_Thompson sampling has the same regret bounds as UCB._" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "id": "02c7c09a", 150 | "metadata": {}, 151 | "source": [ 152 | "# 4. Framework: Probably Approximately Correct" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "id": "545ffd8d", 158 | "metadata": {}, 159 | "source": [ 160 | " Because we evaluate based on total regret, we don't know if regret is caused by a lot of little mistakes or a few large ones.\n", 161 | " \n", 162 | " We can tackle this problem with the __Probably Approximately Correct (PAC)__ framework.\n", 163 | " \n", 164 | " $$\n", 165 | " Q(a) \\ge Q(a^{*}) - \\epsilon \\hspace{1em} (Eq.~5)\\\\\n", 166 | " $$\n", 167 | " \n", 168 | " Basically it will operate much like before (optimism or Thompson sampling) however a small $\\epsilon$ is added to give room for other actions to be selected.\n", 169 | " \n", 170 | "From what I'm understanding, this framework can be applied to optimism under uncertainty and probability matching/thompson sampling." 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "id": "e9ae85ac", 176 | "metadata": {}, 177 | "source": [ 178 | "# 5. Fast RLs in MDPs" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "id": "8bad3cc8", 184 | "metadata": {}, 185 | "source": [ 186 | "For the MDP setting (we've been covering the multi-armed bandit setting), we can use the same frameworks. This section focuses on the PAC framework.\n", 187 | "\n", 188 | "Not too sure, but from what I understand, I would think UCB and Thompson sampling are only applicable to the multi-armed bandit setting. In the (tabular) MDP setting, they carry the same ideas but aren't exactly the same.\n", 189 | "\n", 190 | "The lecture begins with __optimistic initialization__. In the MDP setting, we can use any of the model-free algorithms (e.g. SARSA, MC, Q-learning) we've learned to estimate $Q(s, a)$. \n", 191 | "\n", 192 | "We can initialize our q-values optimistically like setting them to $\\frac{r_{max}}{1 - \\gamma}$ or initializing $V(s) = \\frac{r_{max}}{(1 - \\gamma) \\Pi_{i=1}^{T} \\alpha_{i}}$. We consider $r_{max}$ to be the state-action pair that maximizes the reward. $\\gamma$ is the discount factor. $\\alpha_{i}$ is the learning rate at the $i$-th timestep which goes up till $T$, the number of samples to learn near optimal q-values.\n", 193 | "\n", 194 | "Optimistic initialization is one way to make RL faster in the MDP setting.\n", 195 | "Other approaches include:\n", 196 | "* be very optimistic till confident empirical estimates close to true parameters\n", 197 | "* be optimistic given information you have\n", 198 | " * compute confidence sets on dynamics/reward models\n", 199 | " * add reward bonuses" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "id": "7eda8aa3", 205 | "metadata": {}, 206 | "source": [ 207 | "Given $\\epsilon, \\delta, m$
\n", 208 | "$\\beta = \\frac{1}{1 - \\gamma} \\sqrt{0.5 ln(2|S||A|\\frac{m}{\\delta})}$
\n", 209 | "$n_{sas}(s, a, s') = 0; s \\in S, a \\in A, s' \\in S$
\n", 210 | "$rc(s, a) = 0, n_{sa}(s, a) = 0, \\tilde{Q}(s, a) = \\frac{1}{1 - \\gamma} \\forall s \\in S, a \\in A$
\n", 211 | "$t = 0; s_{t} = s_{init}$
\n", 212 | "loop
\n", 213 | "$\\quad$ $a_{t} = \\underset{a \\in A}{argmax} Q(s_{t}, a)$
\n", 214 | "$\\quad$ Observe reward $r_{t}$ and state $s_{t + 1}$
\n", 215 | "$\\quad$ $n_{sa}(s_{t}, a_{t}) += 1$
\n", 216 | "$\\quad$ $n_{sas}(s_{t}, a_{t}, s_{t + 1}) += 1$
\n", 217 | "$\\quad$ $rc(s_{t}, a_{t}) = \\frac{rc(s_{t}, a_{t})n_{sa}(s_{t}, a_{t}) + r_{t}}{n_{sa}(s_{t}, a_{t}) + 1}$
\n", 218 | "$\\quad$ $\\hat{R}(s, a) = \\frac{rc(s_{t}, a_{t})}{n(s_{t}, a_{t})}$
\n", 219 | "$\\quad$ $\\hat{T}(s'~|~s, a) = \\frac{n_{sas}(s_{t}, a_{t}, s_{t + 1})}{n_{sa}(s_{t}, a_{t})} \\forall s' \\in S$
\n", 220 | "$\\quad$ while not converged do
\n", 221 | "$\\quad\\quad$ $\\hat{Q}(s, a) = \\hat{R}(s, a) + \\gamma \\sum_{s'}\\hat{T}(s'~|~s, a)\\underset{a'}{max}\\tilde{Q}(s', a') + \\underbrace{\\frac{\\beta}{\\sqrt{n_{sa}(s, a)}}}_{reward ~ bonus} \\forall s \\in S, a \\in A$\n", 222 | "\n", 223 | "_Algorithm 2. Model-Based Interval Estimation with Exploration Bonus (MBIE-EB)._" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "id": "95799536", 229 | "metadata": {}, 230 | "source": [ 231 | "Algorithm 2 (MBIE-EB) uses value iteration for model-based policy control (but it estimates the reward and dynamics models). It also implements an exploration bonus (or reward bonus)." 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "id": "0f657f6b", 237 | "metadata": {}, 238 | "source": [ 239 | "# 6. Resource" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "32e04a27", 245 | "metadata": {}, 246 | "source": [ 247 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 248 | "\n", 249 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 250 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 251 | "\n", 252 | "This is a series of 15 lectures provided by Stanford.\n" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "id": "f5c15c5c", 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [] 262 | } 263 | ], 264 | "metadata": { 265 | "kernelspec": { 266 | "display_name": "Python 3", 267 | "language": "python", 268 | "name": "python3" 269 | }, 270 | "language_info": { 271 | "codemirror_mode": { 272 | "name": "ipython", 273 | "version": 3 274 | }, 275 | "file_extension": ".py", 276 | "mimetype": "text/x-python", 277 | "name": "python", 278 | "nbconvert_exporter": "python", 279 | "pygments_lexer": "ipython3", 280 | "version": "3.8.8" 281 | } 282 | }, 283 | "nbformat": 4, 284 | "nbformat_minor": 5 285 | } 286 | -------------------------------------------------------------------------------- /Lecture 13 - Fast RL III.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 13 - Fast RL III\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 30 | "
" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "a8837110", 36 | "metadata": {}, 37 | "source": [ 38 | "# 1. Introduction" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "24456ebc", 44 | "metadata": {}, 45 | "source": [ 46 | "We defined (PAC) as a framework last lecture. But, we can also classify algorithms as PAC." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "5a88cd62", 52 | "metadata": {}, 53 | "source": [ 54 | "# 2. PAC Criteria" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "3f1ff285", 60 | "metadata": {}, 61 | "source": [ 62 | "For an algorithm to be PAC, it must match these 3 criteria:\n", 63 | "* optimism\n", 64 | " * must have optimism (assume unexplored states yield good reward)\n", 65 | "* accuracy\n", 66 | " * must balance between being optimal and being optimistic\n", 67 | " * $V^{\\pi_{t}}(s_{t}) - V^{\\pi_{t}}_{\\mu}(s_{t}) \\le \\epsilon$ where $V^{\\pi_{t}}_{\\mu}(s_{t})$ is a hybrid value function between the optimal and optimistic policy\n", 68 | "* bounded learning complexity (bounded by $\\epsilon, \\delta$)\n", 69 | " * total \\# of Q updates is updated\n", 70 | " * \\# of times visited unknown state-action pair\n", 71 | " \n", 72 | "I won't cover the following section in the lecture: a proof of how MBIE-EB is PAC." 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "id": "ec8c1afb", 78 | "metadata": {}, 79 | "source": [ 80 | "# 3. Bayesian Model-Based RL" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "id": "5eab8947", 86 | "metadata": {}, 87 | "source": [ 88 | "We know model-based RL is where the agent has a _model_ of the real world (transition model and reward model). For Bayesian Model-Based RL, we maintain a posterior distribution over MDP models and estimate transition and rewards. Our posterior guides our exploration." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "id": "45b52b59", 94 | "metadata": {}, 95 | "source": [ 96 | "Initialize prior over the dynamics and reward models for each $(s, a)$, $p(R_{sa}), p(T(s'~|~s, a))$
\n", 97 | "Initialize state $s_{0}$.
\n", 98 | "loop
\n", 99 | "$\\quad$ Sample a MDP M: for each $(s, a)$ pair, sample a dynamics model $T(s'~|~s, a)$ and reward model $R(s, a)$
\n", 100 | "$\\quad$ Compute $Q^{*}_{M}$, optimal value for MDP $M$
\n", 101 | "$\\quad$ $a_{t} = \\underset{a \\in A}{argmax} Q^{*}_{M}(s_{t}, a)$
\n", 102 | "$\\quad$ Observe reward $r_{t}$ and next state $s_{t + 1}$
\n", 103 | "$\\quad$ Update posterior $p(R_{a_{t}, s_{t}}~|~r_{t})$, $p(T(s'~|~s, a)~|~s_{t + 1})$
\n", 104 | "$\\quad$ t += 1
\n", 105 | "
\n", 106 | "\n", 107 | "_Algorithm 1. Thompson Sampling for MDPs._" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "id": "62bc8c9e", 113 | "metadata": {}, 114 | "source": [ 115 | "# 4. Generalization and Exploration" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "11913c8a", 121 | "metadata": {}, 122 | "source": [ 123 | "How do we do everything we just did for large state and action spaces? In large, how do we generalize or scale up?\n", 124 | "\n", 125 | "With VFA (value function approximation) for model-free settings, we can add a reward bonus term for updating our weights. \n", 126 | "\n", 127 | "Some people in class suggested embeddings or some type of way to model how similar states or actions are (maybe clustering them or reducing this high dimensionality?)." 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "id": "0f657f6b", 133 | "metadata": {}, 134 | "source": [ 135 | "# 5. Resource" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "id": "32e04a27", 141 | "metadata": {}, 142 | "source": [ 143 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 144 | "\n", 145 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 146 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 147 | "\n", 148 | "This is a series of 15 lectures provided by Stanford.\n" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "id": "f5c15c5c", 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [] 158 | } 159 | ], 160 | "metadata": { 161 | "kernelspec": { 162 | "display_name": "Python 3", 163 | "language": "python", 164 | "name": "python3" 165 | }, 166 | "language_info": { 167 | "codemirror_mode": { 168 | "name": "ipython", 169 | "version": 3 170 | }, 171 | "file_extension": ".py", 172 | "mimetype": "text/x-python", 173 | "name": "python", 174 | "nbconvert_exporter": "python", 175 | "pygments_lexer": "ipython3", 176 | "version": "3.8.8" 177 | } 178 | }, 179 | "nbformat": 4, 180 | "nbformat_minor": 5 181 | } 182 | -------------------------------------------------------------------------------- /Lecture 3 - Model-Free Policy Evaluation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Model-Free Policy Evaluation\n", 8 | "\n", 9 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "
\n", 19 | "Table of Contents:
\n", 20 | " \n", 21 | "\n", 28 | "
" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "# 1. Introduction" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "Last time we covered policy evaluation & control with a true model of the world (environment). This lecture will cover policy evaluation (no control!) without known dynamics & reward models. Next time, we will cover control for this case." 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Here are our definitions:\n", 50 | "\n", 51 | "$$\n", 52 | "G_{t} = r_{t} + \\gamma r_{t + 1} + ... \\\\\n", 53 | "V^{\\pi}(s) = \\mathbb{E}_{\\pi}[G_{t}~|~s_{t} = s] = \\mathbb{E}_{\\pi}[r_{t} + \\gamma r_{t + 1} + ...~|~s_{t} = s] \\\\\n", 54 | "Q^{\\pi}(s, a) = \\mathbb{E}_{\\pi}[G_{t}~|~s_{t} = s, a_{t} = a] = \\mathbb{E}_{\\pi}[r_{t} + \\gamma r_{t + 1} + ...~|~s_{t} = s, a_{t} = a]\n", 55 | "$$\n", 56 | "\n", 57 | "Remember the dynamic programming approach to evaluating a policy for a finite MDP." 58 | ] 59 | }, 60 | { 61 | "attachments": { 62 | "dp_mdp.PNG": { 63 | "image/png": "" 64 | } 65 | }, 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "![dp_mdp.PNG](attachment:dp_mdp.PNG)
\n", 70 | "_Figure 1. Dynamic Programming for evaluation._" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "We can think of $V^{\\pi}_{k}(s)$ as an exact value of the k-horizon value of state $s$ under policy $\\pi$. And we can say it's an estimate of the infinite horizon for state $s$ under policy $\\pi$.\n", 78 | "\n", 79 | "$$\n", 80 | "V^{\\pi}(s) = \\mathbb{E}_{\\pi}[G_{t}~|~s_{t} = s] \\approx \\mathbb{E}_{\\pi}[r_{t} + \\gamma V_{k - 1}~|~s_{t} = s]\n", 81 | "$$\n", 82 | "\n", 83 | "This is formalized in the above equation." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "
\"drawing\"/
\n", 91 | "_Figure 2. DP tree for evaluation._" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "We can think of the DP approach for evaluation as a tree where a state is followed by an action which can lead to a variable number of other states (which also have their corresponding actions). The point here is that we are __bootstrapping__. This means we are replacing the expected return by its estimate. " 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "# 2. Monte Carlo (MC) Policy Evaluation" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "Okay, notation is about to get a little nuanced. But this isn't the end of the road! Most notations can be omitted. They are there to highlight something. With that, let's take a look at the value function under a certain policy $\\pi$. \n", 113 | "\n", 114 | "$$\n", 115 | "V^{\\pi}(s) = \\mathbb{E}_{T \\sim \\pi}[G_{t}~|~s_{t} = s] \\hspace{1em} (Eq.~1)\\\\\n", 116 | "$$\n", 117 | "\n", 118 | "It is the same as before, but now we specify that we sample a trajectory following policy $\\pi$.\n", 119 | "\n", 120 | "> __Monte Carlo Policy Evaluation__ : a model-free (model here means the ground truth dynamics model of the environment) policy evaluation method\n", 121 | "\n", 122 | "Requirements for MC Policy Evaluation:\n", 123 | "\n", 124 | "* trajectories/episodes need to be finite (need to be episodic)\n", 125 | "* no bootstrapping (just sampling)\n", 126 | "* does not assume state is markov\n", 127 | "\n", 128 | "Let's take a look at 2 different monte carlo methods for model-free policy evaluation.\n", 129 | "\n", 130 | "> __First-Visit MC__ : a version of monte carlo policy evaluation that updates the value function with the first time $t$ that state $s$ is visited in episode $i$\n", 131 | "\n", 132 | "> __Every-Visit MC__ : a version of monte carlo policy evaluation that updates the value function with the with every time $t$ that state $s$ is visited in episode $i$\n", 133 | "\n", 134 | "Below is the algorithm for First-Visit:\n", 135 | "\n", 136 | "Initialize $N(s) = 0, G(s) = 0 ~~ \\forall s \\in S$
\n", 137 | "Loop
\n", 138 | "$\\quad$ Sample episode $i = s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, ..., s_{i, T_{i}}$\n", 139 | "
\n", 140 | "$\\quad$ Define $G_{i, t} = r_{i, t} + \\gamma r_{i, t + 1} + \\gamma^{2} r_{i, t + 2} + ... \\gamma^{T_{i} - 1} r_{i, T_{i}}$ for the $ith$ episode at time step $t$
\n", 141 | "$\\quad$ For each state $s$ visited in episode $i$
\n", 142 | "$\\quad\\quad$ For __first__ time $t$ that state $s$ is visited in episode $i$
\n", 143 | "$\\quad\\quad\\quad$ Increment counter of total first visits: $N(s) = N(s) + 1$
\n", 144 | "$\\quad\\quad\\quad$ Increment total return $G(s) = G(s) + G_{i, t}$
\n", 145 | "$\\quad\\quad\\quad$ Update estimate $V^{\\pi}(s) = G(s)/N(s)$
\n", 146 | "
\n", 147 | "_Algorithm 1. First-Visit Monte Carlo_" 148 | ] 149 | }, 150 | { 151 | "attachments": {}, 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "
\"drawing\"/
\n", 156 | "\n", 157 | "_Figure 1. Bias and variance and MSE._" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Simply put, the bias is the difference between the expected value of the statistic $\\hat{\\theta}$ and the true statistic $\\theta$. The variance is the expected squared difference between the $\\hat{\\theta}$ and the expected $\\hat{\\theta}$. The MSE is simply a combination of these 2.\n", 165 | "\n", 166 | "* $V^{\\pi}$ is an _unbiased_ estimator of true $\\mathbb{E}_{\\pi}[G_{t}~|~s_{t}=s]$.
\n", 167 | "* By the law of large numbers, as the count for First-Visit MC approaches infinity, the value estimates would also approach the expected value estimates under that same policy: $N(s) \\rightarrow \\infty, V^{\\pi}(s) \\rightarrow \\mathbb{E}_{\\pi}[G_{t}~|~s_{t} = s]$." 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Initialize $N(s) = 0, G(s) = 0 ~~ \\forall s \\in S$
\n", 175 | "Loop
\n", 176 | "$\\quad$ Sample episode $i = s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, ..., s_{i, T_{i}}$\n", 177 | "
\n", 178 | "$\\quad$ Define $G_{i, t} = r_{i, t} + \\gamma r_{i, t + 1} + \\gamma^{2} r_{i, t + 2} + ... \\gamma^{T_{i} - 1} r_{i, T_{i}}$ for the $ith$ episode at time step $t$
\n", 179 | "$\\quad$ For each state $s$ visited in episode $i$
\n", 180 | "$\\quad\\quad$ For __every__ time $t$ that state $s$ is visited in episode $i$
\n", 181 | "$\\quad\\quad\\quad$ Increment counter of total first visits: $N(s) = N(s) + 1$
\n", 182 | "$\\quad\\quad\\quad$ Increment total return $G(s) = G(s) + G_{i, t}$
\n", 183 | "$\\quad\\quad\\quad$ Update estimate $V^{\\pi}(s) = G(s)/N(s)$
\n", 184 | "
\n", 185 | "_Algorithm 2. Every-Visit Monte Carlo_" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "This is the exact same as the First Visit MC except it is done for every visit. In this case:\n", 193 | "\n", 194 | "* $V^{\\pi}$ for every-visit MC is _biased_ because now a state in one episode that occurs too often will be given a lot more priority. \n", 195 | "*It often has better MSE than first-visit and is a consistent estimator." 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "> __Incremental MC__ : an approach that can be layered on top of the previous MC methods to incrementally update the value estimate function\n", 203 | "\n", 204 | "After each episode $i$
\n", 205 | "$\\quad$ Define $G_{i, t}$
\n", 206 | "$\\quad$ For each state $s$ visited in episode $i$
\n", 207 | "$\\quad\\quad$ For a time $t$ that state $s$ is visited in episode $i$
\n", 208 | "$\\quad\\quad\\quad$ Increment counter of total first visits: $N(s) = N(s) + 1$
\n", 209 | "$\\quad\\quad\\quad$ Update estimate $V^{\\pi}(s) = V^{\\pi}(s) + \\alpha(G_{i, t} - V^{\\pi}(s))$
\n", 210 | "
\n", 211 | "_Algorithm 3. Incremental Monte Carlo._" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "* $\\alpha = \\frac{1}{N(s)}$: every-visit MC\n", 219 | "* $\\alpha > \\frac{1}{N(s)}$: forget older data, good for non-stationary domains (when the MDP is constantly changing)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "The general limitations of MC methods:\n", 227 | "* they require episodes to terminate at some point\n", 228 | "* they generally have high variance\n", 229 | "\n", 230 | "MC methods:\n", 231 | "* don't bootstrap and instead, samples\n", 232 | " * converges to true value under some assumptions" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "# 3. Temporal Difference (TD) Learning" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "> __Temporal Difference (TD) Learning__ : combines MC methods (sampling) and dynamic programming methods (bootstrapping)\n", 247 | "\n", 248 | "* the TD family of methods are model-free (much like the MC methods)\n", 249 | "* they bootstrap and sample (so they do what dynamic programming and MC do)\n", 250 | "* can be used in both episodic and infinite-horizon settings (unlike MC which can be used only in episodic settings)\n", 251 | "* updates for eery state, action, reward, next_state tuple" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "Set $\\alpha$
\n", 259 | "Initialize $V^{\\pi}(s) = 0, \\forall s \\in S$
\n", 260 | "Loop
\n", 261 | "$\\quad$ Sample __tuple__ $(s_{t}, a_{t}, r_{t}, s_{t + 1})$
\n", 262 | "$\\quad$ $V^{\\pi}(s_{t}) = V^{\\pi}(s_{t}) + \\alpha([r_{t} + \\gamma V^{\\pi}(s_{t + 1})] - V^{\\pi}(s_{t}))$
\n", 263 | "\n", 264 | "_Algorithm 4. TD Learning TD(0)._" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "We call it TD(0) because we take the initial reward (via sampling every-visit if $\\alpha = \\frac{1}{N(s)}$) and we bootstrap. Notice how similar it is to the dynamic programming approach and the MC approach. \n", 272 | "\n", 273 | "The __TD error__ is:\n", 274 | "\n", 275 | "$$\n", 276 | "\\delta_{t} = [r_{t} + \\gamma V^{\\pi}(s_{t + 1})] - V^{\\pi}(s_{t})\n", 277 | "$$" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "# 4. Comparing Approaches" 285 | ] 286 | }, 287 | { 288 | "attachments": {}, 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "
\"drawing\"/
\n", 293 | "\n", 294 | "_Figure 2. Comparing different approaches._" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "We pick different model-free policy evaluation algorithms based on:\n", 302 | "\n", 303 | "* bias/variance trade-offs\n", 304 | "* data efficiency\n", 305 | "* computational efficiency\n", 306 | "\n", 307 | "MC is:\n", 308 | "* unbiased\n", 309 | "* high variance\n", 310 | "* converges to true even with function approximation\n", 311 | "\n", 312 | "TD is:\n", 313 | "* moderate bias\n", 314 | "* lower variance\n", 315 | "* converges if tabular representation" 316 | ] 317 | }, 318 | { 319 | "attachments": {}, 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "
\"drawing\"/
\n", 324 | "\n", 325 | "_Figure 3. Data & Computational efficiency of model-free policy evaluation algorithms._" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "# 5. Resource" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 340 | "\n", 341 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 342 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 343 | "\n", 344 | "This is a series of 15 lectures provided by Stanford.\n" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [] 353 | } 354 | ], 355 | "metadata": { 356 | "kernelspec": { 357 | "display_name": "Python 3", 358 | "language": "python", 359 | "name": "python3" 360 | }, 361 | "language_info": { 362 | "codemirror_mode": { 363 | "name": "ipython", 364 | "version": 3 365 | }, 366 | "file_extension": ".py", 367 | "mimetype": "text/x-python", 368 | "name": "python", 369 | "nbconvert_exporter": "python", 370 | "pygments_lexer": "ipython3", 371 | "version": "3.8.8" 372 | } 373 | }, 374 | "nbformat": 4, 375 | "nbformat_minor": 4 376 | } 377 | -------------------------------------------------------------------------------- /Lecture 7 - Imitation Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 7 - Imitation Learning\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 31 | "
" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "a8837110", 37 | "metadata": {}, 38 | "source": [ 39 | "# 1. Introduction" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "0677a008", 45 | "metadata": {}, 46 | "source": [ 47 | "Some environments may have sparse rewards and even a DQN wouldn't be able to succeed in the environment. An example is Montezuma's Revenge which is a game where a character navigates a 2D world in order find a key and open a door.\n", 48 | "\n", 49 | "RL is good for simple and cheap data and parallelization is easy. However, it wouldn't be practical for cases where executing actions is slow, expensive to fail, or safety is priority. \n", 50 | "\n", 51 | "Problems with RL:\n", 52 | "* needs lots of data\n", 53 | "* needs lot of time\n", 54 | "* sparse rewards\n", 55 | "* hard to learn \n", 56 | "* execution of actions is slow\n", 57 | "* very expensive to fail\n", 58 | "* not safe \n", 59 | "\n", 60 | "__Imitation Learning__:\n", 61 | "* learn from imitating behavior\n", 62 | "* rewards are dense in time to closely guide the agent" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "id": "3296d6f9", 68 | "metadata": {}, 69 | "source": [ 70 | "# 2. Problem Setup" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "id": "796fc797", 76 | "metadata": {}, 77 | "source": [ 78 | "Input:\n", 79 | "* state space, action space\n", 80 | "* Transition model $P(s' ~|~ s, a)$\n", 81 | "* No reward function $R$\n", 82 | "* Set of one or more teacher's demonstrations $(s_{0}, a_{0}, s_{1}, a_{1}, s_{2})$\n", 83 | "\n", 84 | "__Behavioral Cloning__:\n", 85 | "* Can we directly learn the teacher's policy using supervised learning?\n", 86 | "\n", 87 | "__Inverse RL__:\n", 88 | "* Can we recover $R$?\n", 89 | "\n", 90 | "__Apprenticeship Learning via Inverse RL__:\n", 91 | "* Can we use the R we find in Inverse RL to generate a good policy?" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "id": "2a1b5fd9", 97 | "metadata": {}, 98 | "source": [ 99 | "# 3. Behavioral Cloning" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "id": "96d0cab4", 105 | "metadata": {}, 106 | "source": [ 107 | "Behavioral Cloning:\n", 108 | "* the second your model deviates from the teacher behavior, it will have no idea what to do\n", 109 | "* fine so long as your data covers all possible states encountered" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "id": "04d185e6", 115 | "metadata": {}, 116 | "source": [ 117 | "Initialize $D \\leftarrow \\emptyset$
\n", 118 | "Initialize $\\hat{\\pi}_{1}$ to any policy in $\\Pi$
\n", 119 | "for $i = 1$ to $N$ do
\n", 120 | "$\\quad$ Let $\\pi_{i} = \\beta_{i}\\pi^{*} + (1 - \\beta_{i})\\hat{\\pi}_{i}$
\n", 121 | "$\\quad$ Sample $T$-step trajectories using $\\pi_{i}$
\n", 122 | "$\\quad$ Get dataset $D_{i} = \\{(s, \\pi^{*}(s))\\}$ of visited states by $\\pi_{i}$ and actions given by expert
\n", 123 | "$\\quad$ Aggregate datasets: $D \\rightarrow D \\cup D_{i}$
\n", 124 | "$\\quad$ Train classifier $\\hat{\\pi}_{i + 1}$ on $D$.
\n", 125 | "\n", 126 | "Return best $\\hat{\\pi}_{i}$ on validation \n", 127 | "

\n", 128 | "\n", 129 | "_Algorithm 1. DAGGER: Dataset Aggregation._" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "id": "c0b57242", 135 | "metadata": {}, 136 | "source": [ 137 | "The basic principle behind __DAGGER__ for behavior cloning is that you continually build up the dataset, train, and repeat. " 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "id": "2dc5c8ee", 143 | "metadata": {}, 144 | "source": [ 145 | "# 4. Inverse RL" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "id": "4ef0adf3", 151 | "metadata": {}, 152 | "source": [ 153 | "We have to estimate the $R$ reward function. There is no unique $R$ for a given set of data. \n", 154 | "\n", 155 | "$R(s) = \\textbf{w}^{T}x(s)$ where $w \\in \\mathbb{R}^{n}$, $x : S \\rightarrow \\mathbb{R}^{n} \\hspace{1em} (Eq.~1)$\n", 156 | "\n", 157 | "The value function for a policy $\\pi$ is now:\n", 158 | "\n", 159 | "$$\n", 160 | "\\begin{equation}\n", 161 | "\t\\begin{split}\n", 162 | "V^{\\pi} & \\underset{s \\thicksim \\pi}{=} \\mathbb{E}[\\sum_{t = 0}^{\\infty}\\gamma^{t}R(s_{t}) ~|~ \\pi]\\\\\n", 163 | " & = \\mathbb{E}[\\sum_{t = 0}^{\\infty}\\gamma^{t}\\textbf{w}^{T}x(s_{t}) ~|~ \\pi]\\\\\n", 164 | " & = \\textbf{w}^{T} \\mathbb{E}[\\sum_{t = 0}^{\\infty}\\gamma^{t}x(s_{t}) ~|~ \\pi]\\\\\n", 165 | " & = \\textbf{w}^{T} \\mu(\\pi)\\\\\n", 166 | " \\end{split}\n", 167 | "\\end{equation}\n", 168 | "\\hspace{1em} (Eq.~2)\\\\\n", 169 | "$$\n", 170 | "\n", 171 | "$\\mu(\\pi)(s)$ is the discounted weighted frequency of state features under policy $\\pi$." 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "id": "e9997dd8", 177 | "metadata": {}, 178 | "source": [ 179 | "# 5. Apprenticeship Learning" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "id": "ad8e70e7", 185 | "metadata": {}, 186 | "source": [ 187 | "$$\n", 188 | "V^{\\pi} = \\textbf{w}^{T} \\mu(\\pi)\n", 189 | "$$\n", 190 | "\n", 191 | "$$\n", 192 | "\\mathbb{E}[\\sum_{t = 0}^{\\infty}\\gamma^{t}R^{*}(s_{t}) ~|~ \\pi^{*}] = V^{*} \\ge V^{\\pi} = \\mathbb{E}[\\sum_{t = 0}^{\\infty}\\gamma^{t}R^{*}(s_{t}) ~|~ \\pi] \\hspace{1em} (Eq.~3)\\\\\n", 193 | "w^{*^{T}} \\mu(\\pi^{*}) \\ge w^{*^{T}} \\mu(\\pi), \\forall ~ \\pi \\ne \\pi^{*} \\hspace{1em} (Eq.~4)\\\\\n", 194 | "$$\n", 195 | "\n", 196 | "If:\n", 197 | "\n", 198 | "$$\n", 199 | "||\\mu(\\pi) - \\mu(\\pi^{*})||_{1} \\le \\epsilon \\hspace{1em} (Eq.~5)\\\\\n", 200 | "$$\n", 201 | "\n", 202 | "then for all $w$ with $||w||_{\\infty} \\le 1$:

\n", 203 | "$$\n", 204 | "|w^{T}\\mu(\\pi) - w^{T}\\mu(\\pi^{*})| \\le \\epsilon \\hspace{1em} (Eq.~6)\n", 205 | "$$" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "id": "abba7e7e", 211 | "metadata": {}, 212 | "source": [ 213 | "Assumption: $R(s) = w^{T}x(s)$
\n", 214 | "Initialize policy $\\pi_{0}$\n", 215 | "for $i = 1, 2, ...$
\n", 216 | "$\\quad$ Find a reward function ($\\textbf{w}$) such that the teacher maximally outperforms all previous controllers:\n", 217 | "\n", 218 | "$$\n", 219 | "\\underset{\\textbf{w}}{argmax} ~ \\underset{\\gamma}{max} ~ s.t. ~~ w^{T} \\mu(\\pi^{*}) \\ge w^{T}\\mu(\\pi) + \\gamma ~~~ \\forall \\pi \\in \\{\\pi_{0}, \\pi_{1}, ..., \\pi_{i - 1}\\} ~~ s.t. ~~ ||w||_{2} \\le 1 \\hspace{1em} (Eq.~7)\\\\\n", 220 | "$$\n", 221 | "\n", 222 | "$\\quad$ Find optimal control policy $\\pi_{i}$ for the current $\\textbf{w}$
\n", 223 | "$\\quad$ Exit if $\\gamma \\le \\frac{\\epsilon}{2}$
\n", 224 | "\n", 225 | "_Algorithm 2. Apprenticeship Learning._" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "id": "0f657f6b", 231 | "metadata": {}, 232 | "source": [ 233 | "# 6. Resource" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "id": "32e04a27", 239 | "metadata": {}, 240 | "source": [ 241 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 242 | "\n", 243 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 244 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 245 | "\n", 246 | "This is a series of 15 lectures provided by Stanford.\n" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "id": "f5c15c5c", 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 3", 261 | "language": "python", 262 | "name": "python3" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 3 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython3", 274 | "version": "3.8.8" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 5 279 | } 280 | -------------------------------------------------------------------------------- /Lecture 8 - Policy Gradient I.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 8 - Policy Gradient I\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 28 | "
" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "a8837110", 34 | "metadata": {}, 35 | "source": [ 36 | "# 1. Introduction" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "803854f9", 42 | "metadata": {}, 43 | "source": [ 44 | "In the previous lecture, we tried to approximate the value or action-value functions using parameters $\\theta$:\n", 45 | "\n", 46 | "$$\n", 47 | "V_{\\theta}(s) \\approx V^{\\pi}(s)\\\\\n", 48 | "Q_{\\theta}(s, a) \\approx Q^{\\pi}(s, a)\n", 49 | "$$\n", 50 | "\n", 51 | "The policy will usually be $\\epsilon$-greedy applied on top of these value functions. Today we will directly parameterize the policy:\n", 52 | "\n", 53 | "$$\n", 54 | "\\pi_{\\theta}(s, a) = \\mathbb{P}[a~|~s; \\theta] \\hspace{1em} (Eq.~1)\\\\\n", 55 | "$$" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "id": "9abfa2c7", 61 | "metadata": {}, 62 | "source": [ 63 | "* Value Based\n", 64 | " * Learnt value function\n", 65 | " * implicit policy (e.g. $\\epsilon$-greedy)\n", 66 | "* Policy Based\n", 67 | " * no value function\n", 68 | " * learnt policy\n", 69 | "* Actor-Critic\n", 70 | " * learnt value function\n", 71 | " * learnt policy" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "id": "f1190870", 77 | "metadata": {}, 78 | "source": [ 79 | "Advantages of Policy-Based RL:\n", 80 | "* better convergence properties\n", 81 | "* effective in high-dimensional/continuous action spaces\n", 82 | "Disadvantages:\n", 83 | "* usually converge to local rather than global optimum\n", 84 | "* evaluating policy is inefficient" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "13807f37", 90 | "metadata": {}, 91 | "source": [ 92 | "* Goal: given a policy $\\pi_{\\theta}(s, a)$ with parameters $\\theta$, find best $\\theta$\n", 93 | "* we measure the quality of the policy (policy evaluation)\n", 94 | "* in __episodic environments__, we can use the start value of the policy:\n", 95 | "\n", 96 | "$$\n", 97 | "J_{1}(\\theta) = V^{\\pi_{\\theta}}(s_{1}) \\hspace{1em} (Eq.~2)\\\\\n", 98 | "J_{avV}(\\theta) = \\sum_{s} d^{\\pi_{\\theta}}(s) V^{\\pi_{\\theta}}(s) \\hspace{1em} (Eq.~3)\\\\\n", 99 | "J_{avR}(\\theta) = \\sum_{s} d^{\\pi_{\\theta}}(s) \\sum_{a} \\pi_{\\theta}(s, a) R(a, s) \\hspace{1em} (Eq.~4)\\\\\n", 100 | "$$\n", 101 | "\n", 102 | "$d^{\\pi_{\\theta}}(s)$ is the stationary distribution of states under $\\pi_{\\theta}$.\n", 103 | "\n", 104 | "Eq. 2: in episodic environments we can use the start value of the policy state $s_{1}$. Eq. 3: in continuing environments we can use the average value. Eq. 4: in continuing environments we can also use the average reward per time-step." 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "id": "12637991", 110 | "metadata": {}, 111 | "source": [ 112 | "# 2. Policy Optimization" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "id": "2f70e469", 118 | "metadata": {}, 119 | "source": [ 120 | "Policy-based RL (we've been doing model-based/model-free value-based RL) is an optimization problem. There are gradient-free methods for optimization:\n", 121 | "* hill climbing\n", 122 | "* genetic algorithms\n", 123 | "\n", 124 | "Non-gradient optimization methods are good baselines, but they are sample inefficient." 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "id": "d1d07646", 130 | "metadata": {}, 131 | "source": [ 132 | "Policy gradient algorithms search for a _local_ maximum in $V(\\theta)$. There are many different PG algorithms.\n", 133 | "\n", 134 | "$$\n", 135 | "V(\\theta) = V^{\\pi_{\\theta}}\\\\\n", 136 | "\\Delta \\theta = \\alpha \\nabla_{\\theta}V(\\theta)\\\\\n", 137 | "\\nabla_{\\theta}V(\\theta) = s_{t} = \\begin{pmatrix}\n", 138 | " \\frac{\\partial V(\\theta)}{\\partial \\theta_{1}} \\\\\n", 139 | " \\vdots \\\\\n", 140 | "\t\\frac{\\partial V(\\theta)}{\\partial \\theta_{n}}\n", 141 | "\\end{pmatrix}\n", 142 | "$$\n", 143 | "\n", 144 | "$\\alpha$ is a step-size parameter." 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "id": "ca2ef4d8", 150 | "metadata": {}, 151 | "source": [ 152 | "__PG by Finite Differences__ is simple, noisy, and inefficient, but can sometimes be good." 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "id": "2694da25", 158 | "metadata": {}, 159 | "source": [ 160 | "To evaluate policy gradient of $\\pi_{\\theta}(s, a)$
\n", 161 | "For each dimension $k \\in [1, n]$
\n", 162 | "$\\quad$ Estimate $k$-th partial derivative of objective function w.r.t. $\\theta$
\n", 163 | "$\\quad$ Perturb $\\theta$ by small amount $\\epsilon$ in $k$-th dimension\n", 164 | "\n", 165 | "$$\n", 166 | "\\frac{\\partial V(\\theta)}{\\partial \\theta_{k}} \\approx \\frac{V(\\theta + \\epsilon u_{k}) - V(\\theta)}{\\epsilon}\n", 167 | "$$\n", 168 | "\n", 169 | "$u_{k}$ is a unit vector with 1 in $k$-th component, 0 elsewhere.
\n", 170 | "\n", 171 | "_Algorithm 1. PG by Finite Differences._" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "id": "4a7ad3a5", 177 | "metadata": {}, 178 | "source": [ 179 | "__Likelihood Ratio Policies__\n", 180 | "\n", 181 | "Define a state-action trajectory: $\\tau = (s_{0}, a_{0}, r_{0}, ..., s_{T - 1}, r_{T - 1}, s_{T})$
\n", 182 | "Let $R(\\tau) = \\sum_{t=0}^{T}R(s_{t}, a_{t})$ be the sum of rewards for a trajectory $\\tau$.
\n", 183 | "The policy value is:\n", 184 | "\n", 185 | "$$\n", 186 | "V(\\theta) = \\sum_{\\tau} P(\\tau; \\theta) R(\\tau) \\hspace{1em} (Eq.~5)\\\\\n", 187 | "\\underset{\\theta}{argmax} V(\\theta) = \\underset{\\theta}{argmax} \\sum_{\\tau} P(\\tau; \\theta)R(\\tau) \\hspace{1em} (Eq.~6)\\\\\n", 188 | "\\begin{equation}\n", 189 | " \\begin{split}\n", 190 | " \\nabla_{\\theta}V(\\theta) & = \\sum_{\\tau} P(\\tau; \\theta) R(\\tau) \\nabla_{\\theta} log ~ P(\\tau;\\theta)\\\\\n", 191 | " & = \\mathbb{E}_{\\tau}[\\sum_{t = 0}^{T - 1} \\nabla_{\\theta} log \\pi_{\\theta}(a_{t}~|~s_{t}) G_{t}^{(i)}]\\\\\n", 192 | " & \\approx \\hat{g} = (\\frac{1}{m}) \\sum_{i = 1}^{m} R(\\tau^{(i)}) \\nabla_{\\theta} log ~ P(\\tau;\\theta)\\\\\n", 193 | " & = \\frac{1}{m} \\sum_{i = 1}^{m} R(\\tau^{(i)}) \\sum_{t = 0}^{T_{i}} \\nabla_{\\theta} log \\pi_{\\theta}(a_{t}~|~s_{t})\\\\\n", 194 | " & = \\frac{1}{m} \\sum_{i = 1}^{m} \\sum_{t = 0}^{T - 1} \\nabla_{\\theta} log \\pi_{\\theta}(a_{t}~|~s_{t}) G_{t}^{(i)}\\\\\n", 195 | " \\end{split}\n", 196 | "\\end{equation} \\hspace{1em} (Eq.~7)\\\\\n", 197 | "$$\n", 198 | "\n", 199 | "$P(\\tau; \\theta)$ is the probability over trajectories when executing policy $\\pi_{\\theta}$.\n", 200 | "\n", 201 | "In likelihood ratio policies, we often see value functions to be modeled like Eq. 5. Eq. 6 is a mathematical formulation for how we can optimize for a policy. Eq. 7 is the actual gradient of the value function w.r.t. the parameters of the policy $\\theta$. Notice how policy gradient algorithms directly optimize for the policy (and in this case, it optimizes via using gradients). The last 2 equations in Eq. 7 is the most important as those are the crux of REINFORCE, one classic policy gradient algorithm!\n", 202 | "\n", 203 | "Note: $log \\pi(a_{t}~|~s_{t}; \\theta)$ is the same as $log \\pi_{\\theta}(a_{t}~|~s_{t})$." 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "id": "e3658639", 209 | "metadata": {}, 210 | "source": [ 211 | "The rest of the lecture highlights the REINFORCE algorithm, one common policy gradient method.\n", 212 | "\n", 213 | "For an implementation of the algorithm I'd recommend this: https://github.com/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb.\n", 214 | "\n", 215 | "For understanding it, I recommend this: https://medium.com/intro-to-artificial-intelligence/reinforce-a-policy-gradient-based-reinforcement-learning-algorithm-84bde440c816." 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "id": "0f657f6b", 221 | "metadata": {}, 222 | "source": [ 223 | "# 3. Resource" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "id": "32e04a27", 229 | "metadata": {}, 230 | "source": [ 231 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 232 | "\n", 233 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 234 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 235 | "\n", 236 | "This is a series of 15 lectures provided by Stanford.\n" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "id": "f5c15c5c", 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [] 246 | } 247 | ], 248 | "metadata": { 249 | "kernelspec": { 250 | "display_name": "Python 3", 251 | "language": "python", 252 | "name": "python3" 253 | }, 254 | "language_info": { 255 | "codemirror_mode": { 256 | "name": "ipython", 257 | "version": 3 258 | }, 259 | "file_extension": ".py", 260 | "mimetype": "text/x-python", 261 | "name": "python", 262 | "nbconvert_exporter": "python", 263 | "pygments_lexer": "ipython3", 264 | "version": "3.8.8" 265 | } 266 | }, 267 | "nbformat": 4, 268 | "nbformat_minor": 5 269 | } 270 | -------------------------------------------------------------------------------- /Lecture 9 - Policy Gradient II.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16df9564", 6 | "metadata": {}, 7 | "source": [ 8 | "# Lecture 9 - Policy Gradient II\n", 9 | "\n", 10 | "provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 11 | "\n", 12 | "---" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "618c06cf", 18 | "metadata": {}, 19 | "source": [ 20 | "
\n", 21 | "Table of Contents:
\n", 22 | " \n", 23 | "\n", 33 | "
" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "id": "a8837110", 39 | "metadata": {}, 40 | "source": [ 41 | "# 1. Introduction" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "3cdb5196", 47 | "metadata": {}, 48 | "source": [ 49 | "For Policy Gradient algorithms, we want to converge as fast as possible to the local optima as well as have monotonic improvement.\n", 50 | "\n", 51 | "Last lecture, we focused on policy-based methods. This lecture we focus on policy and value-based methods which are commonly referred to as __actor-critic__ methods." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "id": "672dcc72", 57 | "metadata": {}, 58 | "source": [ 59 | "# 2. \"Vanilla\" Policy Gradient Algorithm" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "id": "d8f1ce68", 65 | "metadata": {}, 66 | "source": [ 67 | "Initialize policy parameter $\\theta$, baseline $b$
\n", 68 | "for iteration=1, 2, ..., do
\n", 69 | "$\\quad$ Collect a set of trajectories by executing the current policy
\n", 70 | "$\\quad$ At each timestep $t$ in each trajectory $\\tau^{i}$
\n", 71 | "$\\quad\\quad$ Compute Return $G_{t}^{i} = \\sum_{t' = t}^{T - 1}r_{t}^{i}$, and
\n", 72 | "$\\quad\\quad$ Advantage estimate $\\hat{A}_{t}^{i} = G_{t}^{i} - b(s_{t})$.
\n", 73 | "$\\quad$ Re-fit the baseline, by minimizing $\\sum_{i}\\sum_{t}||b(s_{t}) - G_{t}^{i}||^{2}$,
\n", 74 | "$\\quad$ Update the policy, using a policy gradient estimate $\\hat{g}$,
\n", 75 | "$\\quad\\quad$ Which is a sum of terms $\\nabla_{\\theta} log \\pi(a_{t}~|~s_{t}, \\theta) \\hat{A}_{t}$.
\n", 76 | "$\\quad\\quad$ Plug $\\hat{g}$ into SGD or ADAM
\n", 77 | "
\n", 78 | "\n", 79 | "_Algorithm 1. \"Vanilla\" Policy Gradient Algorithm._" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "id": "11e7af78", 85 | "metadata": {}, 86 | "source": [ 87 | "The __\"Vanilla\" Policy Gradient__ algorithm is a general skeleton or framework for many different PG methods. REINFORCE is a prime example of this template. Notice that this algorithm uses the equations we defined in understanding likelihood policies. The only new idea introduced here is the baseline.\n", 88 | "\n", 89 | "$b(s_{t})$ is simply a function (e.g. deep/shallow neural network, etc) that takes in a state and outputs an expected return. As this \"vanilla\" PG algorithm iterates, the baselineshould be continuously re-fit to perfectly match the expected return (undiscounted).\n", 90 | "\n", 91 | "We introduce a baseline into our standard PG algorithm template because it reduces variance.\n", 92 | "\n", 93 | "$$\n", 94 | "\\begin{equation}\n", 95 | " \\begin{split}\n", 96 | "\\nabla_{\\theta} V(\\theta) & = \\frac{1}{m} \\sum_{i = 1}^{m} R(\\tau^{(i)}) \\sum_{t = 0}^{T_{i}} \\nabla_{\\theta} log \\pi_{\\theta}(a_{t}~|~s_{t}) \\hspace{1em} (Eq.~1)\\\\\n", 97 | " & = \\mathbb{E}_{\\tau}[\\sum_{t = 0}^{T - 1} \\nabla_{\\theta} log \\pi_{\\theta}(a_{t}~|~s_{t}) G_{t}^{(i)}]\\\\\n", 98 | " \\end{split}\n", 99 | "\\end{equation}\n", 100 | "$$\n", 101 | "\n", 102 | "Eq. 1 are our standard gradient formulas from the previous lecture. \n", 103 | "\n", 104 | "$$\n", 105 | "\\hat{A}_{t}^{i} = G_{t}^{i} - b(s_{t})\\\\\n", 106 | "\\nabla_{\\theta} V(\\theta) = \\mathbb{E}_{\\tau}[\\sum_{t = 0}^{T - 1} \\nabla_{\\theta} log \\pi_{\\theta}(a_{t}~|~s_{t}) \\hat{A}_{t}^{i}] \\hspace{1em} (Eq.~2)\\\\\n", 107 | "$$\n", 108 | "\n", 109 | "Eq. 2 is the same equation written with the baseline included.\n", 110 | "\n", 111 | "Now notice we can substitute $V$ or $Q$ into the advantage function. Additionally, we can have a method that learns this value function. We call this a __critic__. Instead of just the sum of future rewards $G_{t}^{i}$, we can use the $Q$ function. We can use TD or MC methods to compute that reward.\n", 112 | "\n", 113 | "$$\n", 114 | "\\hat{A}_{t}^{i} = Q(s_{t}, w) - b(s_{t})\\\\\n", 115 | "$$" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "0952fdff", 121 | "metadata": {}, 122 | "source": [ 123 | "But wait! Keep in mind this algorithm so far is policy-based. To make it both policy and value-based, we can parameterize $R(\\tau^{(i)})$." 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "id": "ac9f36cf", 129 | "metadata": {}, 130 | "source": [ 131 | "# 3. Need for Automatic Step Size Tuning" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "id": "cf0862ec", 137 | "metadata": {}, 138 | "source": [ 139 | "At each iteration of our \"vanilla\" PG algorithm, we want the value function for the new updated policy to be better than the previous iteration's policy: $V^{\\pi'} \\ge V^{pi}$.\n", 140 | "\n", 141 | "Why is the step size important in this scenario? Well, the step size affects how we converge and how fast we do it. If we have a bad step size, our policy will be updated a certain way, and consequently, it will collect data in a biased way." 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "id": "14f67ca1", 147 | "metadata": {}, 148 | "source": [ 149 | "So what are some ways to account for this issue?\n", 150 | "* simple step size with line search\n", 151 | " * simple but expensive \n", 152 | " * naive\n", 153 | "* auto-step-size selection\n", 154 | " * can we ensure the current policy's value function is greater than or equal to the previous iteration's policy's value function?" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "id": "74e89e1d", 160 | "metadata": {}, 161 | "source": [ 162 | "$$\n", 163 | "V(\\theta) = \\mathbb{E}_{\\pi_{\\theta}}[\\sum_{t = 0}^{\\infty}\\gamma^{t} R(s_{t}, a_{t}); \\pi_{\\theta}] \\hspace{1em} (Eq.~3)\\\\\n", 164 | "$$\n", 165 | "\n", 166 | "Eq. 3 says we want to maximize the value function for a given policy in the infinite horizon setting.\n", 167 | "\n", 168 | "We can decompose this function into parts: \n", 169 | "\n", 170 | "$$\n", 171 | "\\begin{equation}\n", 172 | " \\begin{split}\n", 173 | " L_{\\pi}(\\tilde{\\pi}) = V(\\tilde{\\theta}) & = V(\\theta) + \\mathbb{E}_{\\pi_{\\tilde{\\theta}}}[\\sum_{t = 0}^{\\infty} \\gamma^{t} A_{\\pi}(s_{t}, a_{t})]\\\\\n", 174 | " & = V(\\theta) + \\sum_{s} \\mu_{\\tilde{\\pi}}(s) \\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a)\\\\\n", 175 | " \\mu_{\\tilde{\\pi}}(s) & = \\mathbb{E}_{\\tilde{\\pi}}[\\sum_{t = 0}^{\\infty} \\gamma^{t} I(s_{t} = s)]\n", 176 | " \\end{split}\n", 177 | "\\end{equation} \\hspace{1em} (Eq.~4)\\\\\n", 178 | "$$\n", 179 | "\n", 180 | "Notice how the first and second equations of Eq. 4 are the exact same just written in different ways. \n", 181 | "\n", 182 | "The only new idea in these equations is the tilde (~). $\\tilde{\\pi}$ is the new policy (at iteration $i + 1$) and the same goes for $\\tilde{\\theta}$. \n", 183 | "\n", 184 | "So we understand this is for auto step size tuning, but we don't know what $\\mu_{\\tilde{\\pi}}$. Well to be more specific, we can't calculate it just yet (it requires the new policy at iteration $i + 1$). How do we fix this?\n", 185 | "\n", 186 | "There are a few approaches to fixing this issue:\n", 187 | "* __local approximation__\n", 188 | "* __trust regions__\n", 189 | "* __TRPO algorithm__" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "id": "d454d7b9", 195 | "metadata": {}, 196 | "source": [ 197 | "## 3.1. Local Approximation" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "id": "c975292d", 203 | "metadata": {}, 204 | "source": [ 205 | "We can slightly rewrite Eq. 4 so that we have a substitute for $\\mu_{\\tilde{\\pi}}$:\n", 206 | "\n", 207 | "$$\n", 208 | "L_{\\pi}(\\tilde{\\pi}) = V(\\theta) + \\sum_{s} \\mu_{\\pi}(s) \\sum_{a} \\tilde{\\pi}(a~|~s) A_{\\pi}(s, a) \\hspace{1em} (Eq.~5)\\\\\n", 209 | "$$\n", 210 | "\n", 211 | "Eq. 5, instead of using the discounted weighted frequency of state $s$ under policy $\\mu_{\\tilde{\\pi}}$, uses $\\mu_{\\pi}$, the current policy's discounted weighted frequency of state $s$" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "8d10d428", 217 | "metadata": {}, 218 | "source": [ 219 | "This begs the question: how do Eq. 3 and Eq. 4 fit into our current understanding of policy gradients? Over Lecture 8 and Lecture 9, we have seen a lot of formulas involving value functions.\n", 220 | "\n", 221 | "For now, I'm still not too sure. Let's give it some time.\n", 222 | "\n", 223 | "My conclusion is that we formulate our objective function like this (there are many other ways to do it) because we want to find a sure-fire way to have monotonic improvement in gradient-based policy search." 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "id": "0f657f6b", 229 | "metadata": {}, 230 | "source": [ 231 | "# 4. Resource" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "id": "32e04a27", 237 | "metadata": {}, 238 | "source": [ 239 | "If you missed the link right below the title, I'm providing the resource here again along with the course website.\n", 240 | "\n", 241 | "- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)\n", 242 | "- [Course Website](http://web.stanford.edu/class/cs234/index.html)\n", 243 | "\n", 244 | "This is a series of 15 lectures provided by Stanford.\n" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "id": "f5c15c5c", 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [] 254 | } 255 | ], 256 | "metadata": { 257 | "kernelspec": { 258 | "display_name": "Python 3", 259 | "language": "python", 260 | "name": "python3" 261 | }, 262 | "language_info": { 263 | "codemirror_mode": { 264 | "name": "ipython", 265 | "version": 3 266 | }, 267 | "file_extension": ".py", 268 | "mimetype": "text/x-python", 269 | "name": "python", 270 | "nbconvert_exporter": "python", 271 | "pygments_lexer": "ipython3", 272 | "version": "3.8.8" 273 | } 274 | }, 275 | "nbformat": 4, 276 | "nbformat_minor": 5 277 | } 278 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Stanford-CS234-RL---Lecture-Notes 2 | My lecture notes on the RL series provided by Stanford. 3 | 4 | ## Table of Contents: 5 | - [1. Description](https://github.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/blob/main/README.md#1-description) 6 | - [2. Difficulties](https://github.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/blob/main/README.md#2-difficulties) 7 | - [3. Author Info](https://github.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/blob/main/README.md#3-author-info) 8 | - [4. Thank You](https://github.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/blob/main/README.md#4-thank-you) 9 | 10 | ## 1. Description 11 | 12 | All lectures (and the images I took) are located in this repo. Below is a list of these corresponding notebooks in Kaggle: 13 | * [Lecture 1](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-1) 14 | * [Lecture 2](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-2) 15 | * [Lecture 3](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-3) 16 | * [Lecture 4](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-4) 17 | * [Lecture 5](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-5) 18 | * [Lecture 6](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-6) 19 | * [Lecture 7](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-7) 20 | * [Lecture 8](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-8) 21 | * [Lecture 9](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-9) 22 | * [Lecture 10](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-10) 23 | * [Lecture 11](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-11) 24 | * [Lecture 12](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-12) 25 | * [Lecture 13](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-13) 26 | * [Lecture 14](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-14) 27 | * [Lecture 15](https://www.kaggle.com/code/vincenttu/stanford-cs234-rl-lecture-15) 28 | 29 | ## 2. Difficulties 30 | 31 | A few difficulties! Besides school, these lectures were very dense! Information was packed into the last few minutes of class too. I thought this was great (more material the better). I'm still a newbie in RL, but I think this lecture series, more than anything, provided me the landscape of modern RL and gave me insights into a a number of nooks and crannies. This lecture series definitely helped solidify my fundamental understanding of RL. I remember before I'd always mix up the different Bellman equation flavors. Why does one equation have states and actions while the other has only states? These small things would always stump me! These lectures certainly clarified all of that for me. I also think a lot of RL is theoretical! The math is definitely heavy. This made some parts of the lecture a bit more dense than usual. I ended up spending a lot of time going through these lectures, but it was all worth it! 32 | 33 | ## 3. Author Info 34 | 35 | - Vincent Tu: [LinkedIn](https://www.linkedin.com/in/vincent-tu-422b18208/) | [Kaggle](https://www.kaggle.com/vincenttu) 36 | 37 | ## 4. Thank You 38 | 39 | This lecture series of notes took a while to compile (maybe a few months). These notes have helped me a ton in reviewing and understanding RL. I hope they also help you (do still watch the lectures!). I knew I was in for a very insightful look into RL when I noticed how much the first lecture covered. From model-based to model-free policy evaluation and control to value function approximation, deep learning, imitation learning, policy gradients, and fast and batch RL, I found the lectures to be informative and clear. I'd like to thank Professor Brunskill for making this series possible! I'd also like to thank everyone involved in helping with this lecture series and also you the viewer! Thank you. -------------------------------------------------------------------------------- /img/CUT.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/CUT.PNG -------------------------------------------------------------------------------- /img/LVFA.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/LVFA.PNG -------------------------------------------------------------------------------- /img/OPE.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/OPE.PNG -------------------------------------------------------------------------------- /img/SARSA_theorem.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/SARSA_theorem.PNG -------------------------------------------------------------------------------- /img/SPI.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/SPI.PNG -------------------------------------------------------------------------------- /img/VFA.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/VFA.PNG -------------------------------------------------------------------------------- /img/bias_variance.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/bias_variance.PNG -------------------------------------------------------------------------------- /img/convergence_VFA.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/convergence_VFA.PNG -------------------------------------------------------------------------------- /img/diagram.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/diagram.PNG -------------------------------------------------------------------------------- /img/dp_mc_td.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/dp_mc_td.PNG -------------------------------------------------------------------------------- /img/dp_mdp.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/dp_mdp.PNG -------------------------------------------------------------------------------- /img/dp_mrp.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/dp_mrp.PNG -------------------------------------------------------------------------------- /img/dp_tree.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/dp_tree.PNG -------------------------------------------------------------------------------- /img/dueling_dqn.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/dueling_dqn.PNG -------------------------------------------------------------------------------- /img/experience_replay.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/experience_replay.PNG -------------------------------------------------------------------------------- /img/forward_search.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/forward_search.PNG -------------------------------------------------------------------------------- /img/mc_td.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/mc_td.PNG -------------------------------------------------------------------------------- /img/monotonic_improvement.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/monotonic_improvement.PNG -------------------------------------------------------------------------------- /img/policy_improvement.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/policy_improvement.PNG -------------------------------------------------------------------------------- /img/policy_iteration.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/policy_iteration.PNG -------------------------------------------------------------------------------- /img/prove_monotonic.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/prove_monotonic.PNG -------------------------------------------------------------------------------- /img/rl_agent_types.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/rl_agent_types.PNG -------------------------------------------------------------------------------- /img/search_tree_path.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/search_tree_path.PNG -------------------------------------------------------------------------------- /img/value_iteration.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alckasoc/Stanford-CS234-RL---Lecture-Notes/3ea1b85cb7bff6b659c512bbf57f519e2365ce7a/img/value_iteration.PNG --------------------------------------------------------------------------------