├── .gitignore ├── README.md ├── notebooks ├── 1-VPG.ipynb ├── 2-DQN.ipynb ├── 3-PPO.ipynb └── utils.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | env/ 3 | 4 | notebooks/\.ipynb_checkpoints/ 5 | 6 | notebooks/__pycache__/ 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RL Implementations 2 | 3 | ## What is it? 4 | 5 | This repo contains a set of notebooks to reproduce reinforcement learning algorithms. 6 | 7 | 8 | ## Overview 9 | This repo mostly serves (self-)educational purposes. thererfore, the notebooks are mostly self-contained and only general helper functions are outsourced, such that all relevant code is in one place. The models are built using Keras and TensorFlow. The repo is build with TensorFlow 2, however since I was experimenting with TF2 and Keras for the first time, the earlier notebooks might contain some TF1 looking code. 10 | 11 | In the process from (paper) --> (implementation) --> (production-grade code) these notebooks lie in the middle. They don't explain the theory in detail (there are plenty of blogs, papers and books one can consult on that subject) but are also not optimized for efficiency and scalability, such that the step from theory to implementation is easy to follow. Heavily optimized code can sometimes obscure the underlying principles of the algorithm which makes it harder to understand it. 12 | 13 | The following algorithms are implemented at the moment: 14 | 15 | - Vanilla Policy Gradient 16 | - Deep Q-Learning 17 | - Proximal Policy Optimisation 18 | 19 | ## Installation 20 | The code in the notebooks relies on TensorFlow 2. To install all dependencies run: 21 | 22 | ```bash 23 | pip install -r requirements.txt 24 | ``` -------------------------------------------------------------------------------- /notebooks/1-VPG.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Vanilla Policy Gradient\n", 8 | "\n", 9 | "In this notebook the Vanilla Policy Gradient algorithm is implemented in TensorFlow." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "### Import dependencies" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 20, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stderr", 26 | "output_type": "stream", 27 | "text": [ 28 | "WARNING: Logging before flag parsing goes to stderr.\n", 29 | "W0822 20:45:52.838207 4453098944 deprecation.py:323] From /Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.\n", 30 | "Instructions for updating:\n", 31 | "non-resource variables are not supported in the long term\n" 32 | ] 33 | } 34 | ], 35 | "source": [ 36 | "import tensorflow.compat.v1 as tf\n", 37 | "import gym\n", 38 | "import numpy as np\n", 39 | "import matplotlib.pyplot as plt\n", 40 | "import pandas as pd\n", 41 | "from tqdm import tqdm\n", 42 | "from tensorflow.python.framework import ops\n", 43 | "tf.disable_v2_behavior()" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Setup policy network" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 21, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "def mlp(input_dim, output_dim, shape=[128]):\n", 60 | " \"\"\"\n", 61 | " Setup multilayer perceptron policy network.\n", 62 | " \n", 63 | " args:\n", 64 | " input_dim: dimension of input vector\n", 65 | " output_dim: dimension of output vector\n", 66 | " shape: shape of hidden layers\n", 67 | " \n", 68 | " returns:\n", 69 | " x_ph: placeholder input\n", 70 | " r_ph: placeholder reward\n", 71 | " a_ph: placeholder action\n", 72 | " probs: probabilities output\n", 73 | " log_probs: log probabilities output\n", 74 | " train_step: training operation\n", 75 | " \"\"\"\n", 76 | " \n", 77 | " # setup tensorflow placeholders\n", 78 | " x_ph = tf.placeholder(tf.float64, shape=[None]+list(input_dim), name='x_ph')\n", 79 | " r_ph = tf.placeholder(tf.float64, shape=[None, 1], name='r_ph')\n", 80 | " a_ph = tf.placeholder(tf.float64, shape=[None, output_dim], name='a_ph')\n", 81 | " \n", 82 | " # setup feedforward network\n", 83 | " tmp_x = x_ph\n", 84 | " for dim in shape:\n", 85 | " tmp_x = tf.layers.dense(tmp_x, dim, activation=tf.nn.relu)\n", 86 | " probs = tf.layers.dense(tmp_x, output_dim, activation=tf.nn.softmax, name='p_actions')\n", 87 | " \n", 88 | " # setup losses and optimizer\n", 89 | " log_probs = tf.log(tf.reduce_sum(tf.multiply(probs, a_ph), axis=-1), name='log_prob')\n", 90 | " loss = tf.multiply(log_probs, r_ph)\n", 91 | " loss = -tf.reshape(loss, [-1])\n", 92 | " train_step = tf.train.AdamOptimizer().minimize(loss)\n", 93 | " \n", 94 | " return x_ph, r_ph, a_ph, probs, log_probs, train_step" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### Setup environment" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 22, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "env = gym.make('CartPole-v0')" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 23, 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "data": { 120 | "text/plain": [ 121 | "(Discrete(2), 2)" 122 | ] 123 | }, 124 | "execution_count": 23, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "env.action_space, env.action_space.n" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 24, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "text/plain": [ 141 | "(4,)" 142 | ] 143 | }, 144 | "execution_count": 24, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "env.observation_space.shape" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 25, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "array([-0.04322151, -0.0316813 , 0.04285916, 0.03108787])" 162 | ] 163 | }, 164 | "execution_count": 25, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "env.reset()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "### Helper function" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 1, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "def get_action(action_probs, epsilon, env, stochastic=True):\n", 187 | " \"\"\"\n", 188 | " Get action from actions space. With probability 1-epsilon,\n", 189 | " a random action is sampled, otherwise the action_probs are\n", 190 | " used to get an action. If stochastic, the actions are sampled\n", 191 | " according to the probablities of each action, otherwise the\n", 192 | " action with the highest probability is returned.\n", 193 | " \"\"\"\n", 194 | " \n", 195 | " if np.random.rand()>epsilon:\n", 196 | " if stochastic:\n", 197 | " action = np.random.choice(list(range(len(action_probs))), p=action_probs)\n", 198 | " else:\n", 199 | " action = np.argmax(action_probs)\n", 200 | " else:\n", 201 | " action = env.action_space.sample()\n", 202 | " return action\n", 203 | "\n", 204 | "def calc_discounted_rewards(r,gamma=0.9):\n", 205 | " \"\"\"\n", 206 | " Calculate the discounted future rewards with \n", 207 | " a gamma factor.\n", 208 | " \"\"\"\n", 209 | " discounted_rewards = []\n", 210 | " \n", 211 | " for i in range(len(r)):\n", 212 | " tmp_rewards = []\n", 213 | " for j in range(len(r)-i):\n", 214 | " tmp_rewards.append(r[i+j]*(gamma**j))\n", 215 | " discounted_rewards.append(np.sum(tmp_rewards))\n", 216 | " \n", 217 | " return np.array(discounted_rewards) " 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "### Show discounted reward example" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 28, 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "data": { 234 | "image/png": "\n", 235 | "text/plain": [ 236 | "
" 237 | ] 238 | }, 239 | "metadata": { 240 | "needs_background": "light" 241 | }, 242 | "output_type": "display_data" 243 | } 244 | ], 245 | "source": [ 246 | "test_array= [0]*30+[1,0,0,0,0,0,0,0,0,0,1,0,0,1,0]\n", 247 | "\n", 248 | "plt.plot(test_array)\n", 249 | "plt.plot(calc_discounted_rewards(test_array))\n", 250 | "plt.show()" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "### Training scheme for VPG" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 29, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "def VPG(env, n_steps=10*4, epsilon_range=[0.99, 0.1], render=False):\n", 267 | " \"\"\"\n", 268 | " Vanilla Policy Gradient implementation in TensorFlow.\n", 269 | " \n", 270 | " args:\n", 271 | " env: OpenAI gym environment\n", 272 | " n_steps=10*4: number of training steps\n", 273 | " epsilon_range=[0.99, 0.1]: epsilon decay range\n", 274 | " render=False: option to render environment\n", 275 | " \n", 276 | " returns:\n", 277 | " observations: list of observations during training\n", 278 | " actions: list of observations during training\n", 279 | " rewards: list of observations during training\n", 280 | " total_rewards: list of total rewards during episode\n", 281 | " total_discounted_rewards: list of discounted rewards at eatch step\n", 282 | " \"\"\"\n", 283 | " \n", 284 | " # get env information\n", 285 | " obs_shape = env.observation_space.shape\n", 286 | " action_space = env.action_space.n\n", 287 | " print('obs shape:',obs_shape,'| action space:', action_space)\n", 288 | " \n", 289 | " # setup tensorflow model\n", 290 | " ops.reset_default_graph()\n", 291 | " sess = tf.InteractiveSession()\n", 292 | " x_ph, r_ph, a_ph, probs, log_probs, train_step = mlp(obs_shape, action_space)\n", 293 | " init = tf.global_variables_initializer()\n", 294 | " sess = tf.Session()\n", 295 | " sess.run(init)\n", 296 | " \n", 297 | " # get epsilon decay\n", 298 | " epsilons = get_epsilons(epsilon_range, n_steps)\n", 299 | " \n", 300 | " # setup list for storage\n", 301 | " total_rewards = []\n", 302 | " total_discounted_rewards = []\n", 303 | " observations = []\n", 304 | " rewards = []\n", 305 | " actions = []\n", 306 | " \n", 307 | " # initialize baseline\n", 308 | " baseline=0\n", 309 | " \n", 310 | " # this is to reset env and get first obs\n", 311 | " game_done = True\n", 312 | " \n", 313 | " # rollout training\n", 314 | " for i in tqdm(range(n_steps)):\n", 315 | " if render:\n", 316 | " env.render()\n", 317 | " \n", 318 | " # when game is done, optimize network\n", 319 | " if game_done:\n", 320 | " if len(rewards)>0:\n", 321 | " \n", 322 | " # calculate discounted rewards and baseline\n", 323 | " discounted_rewards = calc_discounted_rewards(rewards, gamma=1)\n", 324 | " advantage = discounted_rewards-baseline\n", 325 | " \n", 326 | " # train policy network\n", 327 | " sess.run(train_step, feed_dict={x_ph: np.array(observations),\n", 328 | " r_ph: np.expand_dims(advantage,axis=1),\n", 329 | " a_ph: np.array(actions)})\n", 330 | " \n", 331 | " # update baseline\n", 332 | " baseline = np.mean(discounted_rewards)\n", 333 | " total_discounted_rewards += list(discounted_rewards)\n", 334 | " total_rewards.append(sum(rewards))\n", 335 | " \n", 336 | " obs = env.reset()\n", 337 | " observations = []\n", 338 | " rewards = []\n", 339 | " actions = []\n", 340 | " \n", 341 | " observations.append(obs)\n", 342 | " \n", 343 | " # get action probabilities from policy network\n", 344 | " action_probs = sess.run(probs, feed_dict={x_ph: np.expand_dims(obs, axis=0)})\n", 345 | " \n", 346 | " # get epsilon-greedy action\n", 347 | " action = get_action(np.squeeze(action_probs), epsilons[i], env)\n", 348 | " \n", 349 | " # update environment with chosen action\n", 350 | " obs, reward, game_done, info = env.step(action)\n", 351 | " \n", 352 | " # update lists\n", 353 | " rewards.append(reward)\n", 354 | " action_vec = np.zeros(action_space)\n", 355 | " action_vec[action]=1\n", 356 | " actions.append(action_vec)\n", 357 | " \n", 358 | " env.close()\n", 359 | " return observations, actions, rewards, total_rewards, total_discounted_rewards" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "### Run training" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 30, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "name": "stderr", 376 | "output_type": "stream", 377 | "text": [ 378 | "/Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/tensorflow/python/client/session.py:1735: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).\n", 379 | " warnings.warn('An interactive session is already active. This can '\n", 380 | "W0822 20:46:04.030754 4453098944 deprecation.py:323] From :27: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n", 381 | "Instructions for updating:\n", 382 | "Use keras.layers.dense instead.\n" 383 | ] 384 | }, 385 | { 386 | "name": "stdout", 387 | "output_type": "stream", 388 | "text": [ 389 | "obs shape: (4,) | action space: 2\n" 390 | ] 391 | }, 392 | { 393 | "name": "stderr", 394 | "output_type": "stream", 395 | "text": [ 396 | "100%|██████████| 100000/100000 [00:36<00:00, 2738.38it/s]\n" 397 | ] 398 | } 399 | ], 400 | "source": [ 401 | "obs, actions, rewards,total_rewards, discounted_rewards = VPG(env, n_steps=100000, render=False, epsilon_range=[0,0])" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "### Show results" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 31, 414 | "metadata": {}, 415 | "outputs": [ 416 | { 417 | "name": "stderr", 418 | "output_type": "stream", 419 | "text": [ 420 | "/Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/pandas/core/window.py:1833: FutureWarning: using a dict with renaming is deprecated and will be removed\n", 421 | "in a future version.\n", 422 | "\n", 423 | "For column-specific groupby renaming, use named aggregation\n", 424 | "\n", 425 | " >>> df.groupby(...).agg(name=('column', aggfunc))\n", 426 | "\n", 427 | " return super().aggregate(arg, *args, **kwargs)\n" 428 | ] 429 | }, 430 | { 431 | "data": { 432 | "image/png": "\n", 433 | "text/plain": [ 434 | "
" 435 | ] 436 | }, 437 | "metadata": { 438 | "needs_background": "light" 439 | }, 440 | "output_type": "display_data" 441 | }, 442 | { 443 | "data": { 444 | "image/png": "\n", 445 | "text/plain": [ 446 | "
" 447 | ] 448 | }, 449 | "metadata": { 450 | "needs_background": "light" 451 | }, 452 | "output_type": "display_data" 453 | } 454 | ], 455 | "source": [ 456 | "plot_reward(discounted_rewards, window=1000, y_label='discounted reward')\n", 457 | "plot_reward(total_rewards, window=100)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 32, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "import antigravity" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": {}, 473 | "outputs": [], 474 | "source": [] 475 | } 476 | ], 477 | "metadata": { 478 | "kernelspec": { 479 | "display_name": "Python 3", 480 | "language": "python", 481 | "name": "python3" 482 | }, 483 | "language_info": { 484 | "codemirror_mode": { 485 | "name": "ipython", 486 | "version": 3 487 | }, 488 | "file_extension": ".py", 489 | "mimetype": "text/x-python", 490 | "name": "python", 491 | "nbconvert_exporter": "python", 492 | "pygments_lexer": "ipython3", 493 | "version": "3.7.3" 494 | } 495 | }, 496 | "nbformat": 4, 497 | "nbformat_minor": 4 498 | } 499 | -------------------------------------------------------------------------------- /notebooks/2-DQN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Vanilla Policy Gradient\n", 8 | "In this notebook the Deep Q-Learning Network approach is implemented in TensorFlow." 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "### Import dependencies" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 20, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import tensorflow as tf\n", 25 | "import gym\n", 26 | "import numpy as np\n", 27 | "import matplotlib.pyplot as plt\n", 28 | "import pandas as pd\n", 29 | "\n", 30 | "from tqdm import tqdm_notebook\n", 31 | "\n", 32 | "from collections import deque\n", 33 | "from tensorflow.keras.models import Sequential\n", 34 | "from tensorflow.keras.layers import Dense\n", 35 | "from tensorflow.keras.optimizers import Adam\n", 36 | "import random\n", 37 | "\n", 38 | "from utils import plot_reward" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "### Setup environment" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 21, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "env = gym.make('CartPole-v0')\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 22, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "Discrete(2) 2\n" 67 | ] 68 | } 69 | ], 70 | "source": [ 71 | "print(env.action_space, env.action_space.n)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 23, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "(4,)\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "print(env.observation_space.shape)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### Helper functions" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 24, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "def get_action(action_probs, epsilon, env, stochastic=True):\n", 105 | " \"\"\"\n", 106 | " Get action from actions space. With probability 1-epsilon,\n", 107 | " a random action is sampled, otherwise the action_probs are\n", 108 | " used to get an action. If stochastic, the actions are sampled\n", 109 | " according to the probablities of each action, otherwise the\n", 110 | " action with the highest probability is returned.\n", 111 | " \"\"\"\n", 112 | " \n", 113 | " if np.random.rand()>epsilon:\n", 114 | " if stochastic:\n", 115 | " action = np.random.choice(list(range(len(action_probs))), p=action_probs)\n", 116 | " else:\n", 117 | " action = np.argmax(action_probs)\n", 118 | " else:\n", 119 | " action = env.action_space.sample()\n", 120 | " return action\n", 121 | "\n", 122 | "def calc_discounted_rewards(r,gamma=0.9):\n", 123 | " \"\"\"\n", 124 | " Calculate the discounted future rewards with \n", 125 | " a gamma factor.\n", 126 | " \"\"\"\n", 127 | " discounted_rewards = []\n", 128 | " \n", 129 | " for i in range(len(r)):\n", 130 | " tmp_rewards = []\n", 131 | " for j in range(len(r)-i):\n", 132 | " tmp_rewards.append(r[i+j]*(gamma**j))\n", 133 | " discounted_rewards.append(np.sum(tmp_rewards))\n", 134 | " \n", 135 | " return np.array(discounted_rewards)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "### Memory for replay buffer" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 25, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "class Memory:\n", 152 | " \n", 153 | " def __init__(self, memory_size=None):\n", 154 | " self._memory = deque(maxlen=memory_size)\n", 155 | " \n", 156 | " def replay(self, n):\n", 157 | " return random.sample(self._memory, n)\n", 158 | " \n", 159 | " def memorize(self, elements):\n", 160 | " self._memory.append(elements)\n", 161 | " \n", 162 | " def __len__(self):\n", 163 | " return len(self._memory)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "### Setup Q-network" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 26, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "def qforward(observation_space, action_space, shape=[24,24], lr=0.001):\n", 180 | " \n", 181 | " model = Sequential()\n", 182 | " \n", 183 | " model.add(Dense(shape[0], input_shape=observation_space, activation=\"relu\"))\n", 184 | " for dim in shape[1:]:\n", 185 | " model.add(Dense(dim, activation=\"relu\"))\n", 186 | " model.add(Dense(action_space, activation=\"linear\"))\n", 187 | " \n", 188 | " model.compile(loss=\"mse\", optimizer=Adam(lr=lr))\n", 189 | "\n", 190 | " return model" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "### Setup replay training" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 27, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "def model_replay(model, memory, batch_size, gamma):\n", 207 | " \n", 208 | " if len(memory)>> df.groupby(...).agg(name=('column', aggfunc))\n", 383 | "\n", 384 | " return super().aggregate(arg, *args, **kwargs)\n" 385 | ] 386 | }, 387 | { 388 | "data": { 389 | "image/png": "\n", 390 | "text/plain": [ 391 | "
" 392 | ] 393 | }, 394 | "metadata": { 395 | "needs_background": "light" 396 | }, 397 | "output_type": "display_data" 398 | } 399 | ], 400 | "source": [ 401 | "plot_reward(total_rewards, window=10)" 402 | ] 403 | } 404 | ], 405 | "metadata": { 406 | "kernelspec": { 407 | "display_name": "Python 3", 408 | "language": "python", 409 | "name": "python3" 410 | }, 411 | "language_info": { 412 | "codemirror_mode": { 413 | "name": "ipython", 414 | "version": 3 415 | }, 416 | "file_extension": ".py", 417 | "mimetype": "text/x-python", 418 | "name": "python", 419 | "nbconvert_exporter": "python", 420 | "pygments_lexer": "ipython3", 421 | "version": "3.7.3" 422 | } 423 | }, 424 | "nbformat": 4, 425 | "nbformat_minor": 4 426 | } 427 | -------------------------------------------------------------------------------- /notebooks/3-PPO.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 16, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "2.0.0-beta1\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "import tensorflow.compat.v1 as tf\n", 18 | "\n", 19 | "#import tensorflow as tf\n", 20 | "import gym\n", 21 | "import numpy as np\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "import pandas as pd\n", 24 | "\n", 25 | "from tqdm import tqdm_notebook\n", 26 | "\n", 27 | "from tensorflow.keras.models import Sequential, Model, clone_model\n", 28 | "from tensorflow.keras.layers import Dense, Input\n", 29 | "from tensorflow.keras.optimizers import Adam\n", 30 | "\n", 31 | "from utils import get_epsilons, plot_reward\n", 32 | "\n", 33 | "print(tf.__version__)\n", 34 | "tf.disable_v2_behavior()" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## Create environment" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 5, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "env = gym.make('CartPole-v0')" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 6, 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "data": { 60 | "text/plain": [ 61 | "2" 62 | ] 63 | }, 64 | "execution_count": 6, 65 | "metadata": {}, 66 | "output_type": "execute_result" 67 | } 68 | ], 69 | "source": [ 70 | "env.action_space.n" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 7, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "4" 82 | ] 83 | }, 84 | "execution_count": 7, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "env.observation_space.shape[0]" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## Define policy network" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 8, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "def mlp(observation_space, action_space, shape=[64,64], lr=0.001):\n", 107 | " \n", 108 | " model = Sequential()\n", 109 | " \n", 110 | " model.add(Dense(shape[0], input_shape=observation_space, activation=\"tanh\"))\n", 111 | " for dim in shape[1:]:\n", 112 | " model.add(Dense(dim, activation=\"tanh\"))\n", 113 | " model.add(Dense(action_space, activation=\"softmax\"))\n", 114 | " \n", 115 | " return model" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 9, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "def ppo_loss_function(r, advantage, eps):\n", 125 | " def loss(y_true, y_pred):\n", 126 | " \n", 127 | " ppo_loss = - tf.reduce_mean(tf.math.minimum(tf.multiply(r, advantage),\n", 128 | " tf.multiply(tf.clip_by_value(r,\n", 129 | " tf.subtract(1.,eps),\n", 130 | " tf.add(1.,eps)),\n", 131 | " advantage)))\n", 132 | " return ppo_loss\n", 133 | " return loss" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 10, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "def get_ppo_model(policy_model, observation_space, action_space):\n", 143 | " \n", 144 | " # placeholders:\n", 145 | " obs = Input(shape=(observation_space,))\n", 146 | " action = Input(shape=(action_space,))\n", 147 | " old_prob = Input(shape=(1,))\n", 148 | " advantage = Input(shape=(1,))\n", 149 | " eps = Input(shape=(1,))\n", 150 | "\n", 151 | " out_action = policy_model(obs)\n", 152 | " \n", 153 | " # ppo loss\n", 154 | " prob = tf.reduce_sum(tf.multiply(out_action, action), axis=1)\n", 155 | " r = tf.divide(prob, old_prob)\n", 156 | " \n", 157 | " # ppo model\n", 158 | " ppo_model = Model(inputs=[obs, action, old_prob, advantage, eps], outputs=[out_action])\n", 159 | "\n", 160 | " # compile model\n", 161 | " optimizer = Adam(lr=0.001)\n", 162 | " ppo_model.compile(optimizer=optimizer, loss=ppo_loss_function(r, advantage, eps))\n", 163 | " \n", 164 | " return ppo_model" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 11, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "def get_action(action_probs, epsilon, env, stochastic=True):\n", 174 | " \"\"\"\n", 175 | " Get action from actions space. With probability 1-epsilon,\n", 176 | " a random action is sampled, otherwise the action_probs are\n", 177 | " used to get an action. If stochastic, the actions are sampled\n", 178 | " according to the probablities of each action, otherwise the\n", 179 | " action with the highest probability is returned.\n", 180 | " \"\"\"\n", 181 | " \n", 182 | " if np.random.rand()>epsilon:\n", 183 | " if stochastic:\n", 184 | " action = np.random.choice(list(range(len(action_probs))), p=action_probs)\n", 185 | " else:\n", 186 | " action = np.argmax(action_probs)\n", 187 | " else:\n", 188 | " action = env.action_space.sample()\n", 189 | " return action\n", 190 | "\n", 191 | "def calc_discounted_rewards(r,gamma=0.9):\n", 192 | " \"\"\"\n", 193 | " Calculate the discounted future rewards with \n", 194 | " a gamma factor.\n", 195 | " \"\"\"\n", 196 | " discounted_rewards = []\n", 197 | " \n", 198 | " for i in range(len(r)):\n", 199 | " tmp_rewards = []\n", 200 | " for j in range(len(r)-i):\n", 201 | " tmp_rewards.append(r[i+j]*(gamma**j))\n", 202 | " discounted_rewards.append(np.sum(tmp_rewards))\n", 203 | " \n", 204 | " return np.array(discounted_rewards)\n", 205 | " " 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 12, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "def PPO(env, model, n_steps=10*4, epsilon_range=[0.99, 0.1], render=False, eps=0.2, max_steps=200):\n", 215 | " \"\"\"\n", 216 | " Proximal Policy Optimisation implementation in TensorFlow 2.\n", 217 | " \n", 218 | " args:\n", 219 | " env: OpenAI gym environment\n", 220 | " n_steps=10*4: number of training steps\n", 221 | " epsilon_range=[0.99, 0.1]: epsilon decay range\n", 222 | " \"\"\"\n", 223 | " \n", 224 | " \n", 225 | " obs_shape = env.observation_space\n", 226 | " action_space = env.action_space.n\n", 227 | " print('obs shape:',obs_shape,'| action space:', action_space)\n", 228 | " \n", 229 | " \n", 230 | " epsilons = get_epsilons(epsilon_range, n_steps)\n", 231 | " \n", 232 | " dummy_scalar = np.zeros((1,1))\n", 233 | " dummy_action = np.zeros((1, action_space))\n", 234 | " dummy_inputs = [dummy_action] + 3*[dummy_scalar]\n", 235 | " \n", 236 | " total_rewards = []\n", 237 | " total_discounted_rewards = []\n", 238 | " observations = []\n", 239 | " rewards = []\n", 240 | " actions = []\n", 241 | " old_prob = []\n", 242 | " losses = []\n", 243 | " gradients = []\n", 244 | " old_model = clone_model(model)\n", 245 | "\n", 246 | " baseline=0\n", 247 | " current_steps = 0\n", 248 | " game_done = True\n", 249 | " pbar = tqdm_notebook(range(n_steps))\n", 250 | " \n", 251 | " for i in pbar:\n", 252 | " if render:\n", 253 | " env.render()\n", 254 | "\n", 255 | " if game_done:\n", 256 | " old_model.set_weights(model.get_weights())\n", 257 | "\n", 258 | " if len(rewards)>0:\n", 259 | " discounted_rewards = calc_discounted_rewards(rewards, gamma=0.9)\n", 260 | " advantage = (discounted_rewards-baseline)\n", 261 | " \n", 262 | " samples = len(observations)\n", 263 | " model.fit(x=[np.array(observations),\n", 264 | " np.array(actions),\n", 265 | " np.array(old_prob),\n", 266 | " np.array(advantage),\n", 267 | " np.tile(eps, samples)],\n", 268 | " y=np.array(old_prob), verbose=0)\n", 269 | " \n", 270 | " baseline = np.mean(discounted_rewards)\n", 271 | " total_discounted_rewards += list(discounted_rewards)\n", 272 | " total_rewards.append(sum(rewards))\n", 273 | " pbar.set_description('reward: ' +str(total_rewards[-1]))\n", 274 | " \n", 275 | " obs = env.reset()\n", 276 | " observations = []\n", 277 | " rewards = []\n", 278 | " actions = []\n", 279 | " old_prob = []\n", 280 | " current_steps=0\n", 281 | " \n", 282 | " current_steps += 1\n", 283 | " observations.append(obs)\n", 284 | " action_probs = model.predict([np.expand_dims(obs, axis=0)] + dummy_inputs)\n", 285 | " \n", 286 | " action = get_action(np.squeeze(action_probs), epsilons[i], env, stochastic=False)\n", 287 | " old_prob.append(old_model.predict([np.expand_dims(obs, axis=0)] + dummy_inputs)[:, action])\n", 288 | " \n", 289 | " obs, reward, game_done,_ = env.step(action)\n", 290 | " rewards.append(reward)\n", 291 | " action_vec = np.zeros(action_space)\n", 292 | " action_vec[action]=1\n", 293 | " actions.append(action_vec)\n", 294 | " \n", 295 | " if current_steps>=max_steps:\n", 296 | " game_done=True\n", 297 | " \n", 298 | " #env.close()\n", 299 | "\n", 300 | " return model, total_rewards" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 13, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "policy_model = mlp([env.observation_space.shape[0]], env.action_space.n)\n", 310 | "ppo_model = get_ppo_model(policy_model, env.observation_space.shape[0], env.action_space.n)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 14, 316 | "metadata": {}, 317 | "outputs": [ 318 | { 319 | "name": "stderr", 320 | "output_type": "stream", 321 | "text": [ 322 | "W0824 21:51:53.352210 4704441792 deprecation.py:506] From /Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:97: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n", 323 | "Instructions for updating:\n", 324 | "Call initializer instance with the dtype argument instead of passing it to the constructor\n", 325 | "W0824 21:51:53.353631 4704441792 deprecation.py:506] From /Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:97: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n", 326 | "Instructions for updating:\n", 327 | "Call initializer instance with the dtype argument instead of passing it to the constructor\n" 328 | ] 329 | }, 330 | { 331 | "name": "stdout", 332 | "output_type": "stream", 333 | "text": [ 334 | "obs shape: Box(4,) | action space: 2\n" 335 | ] 336 | }, 337 | { 338 | "data": { 339 | "application/vnd.jupyter.widget-view+json": { 340 | "model_id": "4d395e1ae41d41b09fd197037c86b917", 341 | "version_major": 2, 342 | "version_minor": 0 343 | }, 344 | "text/plain": [ 345 | "HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))" 346 | ] 347 | }, 348 | "metadata": {}, 349 | "output_type": "display_data" 350 | }, 351 | { 352 | "name": "stderr", 353 | "output_type": "stream", 354 | "text": [ 355 | "W0824 21:51:53.735882 4704441792 deprecation.py:323] From /Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", 356 | "Instructions for updating:\n", 357 | "Use tf.where in 2.0, which has the same broadcast rule as np.where\n" 358 | ] 359 | }, 360 | { 361 | "name": "stdout", 362 | "output_type": "stream", 363 | "text": [ 364 | "\n" 365 | ] 366 | } 367 | ], 368 | "source": [ 369 | "model, total_rewards = PPO(env, ppo_model, n_steps=100000, render=False, epsilon_range=[1,0.1], eps=0.2)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 17, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "name": "stderr", 379 | "output_type": "stream", 380 | "text": [ 381 | "/Users/leandro/git/reproduce-rl/env/lib/python3.7/site-packages/pandas/core/window.py:1833: FutureWarning: using a dict with renaming is deprecated and will be removed\n", 382 | "in a future version.\n", 383 | "\n", 384 | "For column-specific groupby renaming, use named aggregation\n", 385 | "\n", 386 | " >>> df.groupby(...).agg(name=('column', aggfunc))\n", 387 | "\n", 388 | " return super().aggregate(arg, *args, **kwargs)\n" 389 | ] 390 | }, 391 | { 392 | "data": { 393 | "image/png": "\n", 394 | "text/plain": [ 395 | "
" 396 | ] 397 | }, 398 | "metadata": { 399 | "needs_background": "light" 400 | }, 401 | "output_type": "display_data" 402 | } 403 | ], 404 | "source": [ 405 | "plot_reward(total_rewards, window=100)" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [] 414 | } 415 | ], 416 | "metadata": { 417 | "kernelspec": { 418 | "display_name": "Python 3", 419 | "language": "python", 420 | "name": "python3" 421 | }, 422 | "language_info": { 423 | "codemirror_mode": { 424 | "name": "ipython", 425 | "version": 3 426 | }, 427 | "file_extension": ".py", 428 | "mimetype": "text/x-python", 429 | "name": "python", 430 | "nbconvert_exporter": "python", 431 | "pygments_lexer": "ipython3", 432 | "version": "3.7.3" 433 | } 434 | }, 435 | "nbformat": 4, 436 | "nbformat_minor": 4 437 | } 438 | -------------------------------------------------------------------------------- /notebooks/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | 5 | def plot_reward(rewards, window, x_label='episodes', y_label='reward'): 6 | """ 7 | Function to plot rewards with a rolling mean and standard deviation. 8 | """ 9 | steps = window 10 | 11 | df = pd.DataFrame({'rewards': rewards}) 12 | m = df.rolling(steps, center=True).agg({'mean':'mean', 'std':'std'}) 13 | m.columns = m.columns.droplevel(1) 14 | ax = m['mean'].plot() 15 | ax.fill_between(m.index, m['mean'] - m['std'], m['mean'] + m['std'], 16 | alpha=.25) 17 | plt.tight_layout() 18 | plt.ylabel(y_label) 19 | plt.xlabel(x_label) 20 | plt.show() 21 | 22 | def get_epsilons(epsilon_range, n_steps): 23 | """ 24 | Linear decay of epsilon in n_steps. 25 | """ 26 | epsilon = np.linspace(epsilon_range[0], epsilon_range[1], n_steps) 27 | return epsilon 28 | 29 | def get_exp_epsilons(epsilon_range, epsilon_decay, n_steps): 30 | """ 31 | Exponential decay of epsilon in n_steps. 32 | """ 33 | tmp_epsilon = epsilon_range[0] 34 | min_epsilon = epsilon_range[1] 35 | epsilons = [] 36 | for i in range(n_steps): 37 | epsilons.append(max([tmp_epsilon, min_epsilon])) 38 | tmp_epsilon *= epsilon_decay 39 | 40 | return epsilons -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | gym==0.14.0 2 | jupyterlab==1.0.9 3 | matplotlib==3.1.0 4 | numpy==1.17.0 5 | pandas==0.25.0 6 | tensorflow==2.0.0b1 7 | tqdm==4.34.0 --------------------------------------------------------------------------------