├── Chapter01 ├── .ipynb_checkpoints │ └── 1.1 What is Reinforcement Learning-checkpoint.ipynb ├── 1.1 What is Reinforcement Learning.ipynb └── images │ ├── B09792_01_01.png │ ├── B09792_01_02.png │ └── B09792_01_03.png ├── Chapter02 ├── .ipynb_checkpoints │ ├── 2.09 Building a Video Game Bot -checkpoint.ipynb │ ├── 2.1 Basic Simulations-checkpoint.ipynb │ ├── 2.10 TensorFlow Fundamentals-checkpoint.ipynb │ ├── 2.11 TensorBoard-checkpoint.ipynb │ ├── 2.2 Training an agent to Walk-checkpoint.ipynb │ ├── 2.7 Basic Simulations-checkpoint.ipynb │ ├── 2.8 Training an Robot to Walk-checkpoint.ipynb │ ├── 2.9 Building a Video Game Bot -checkpoint.ipynb │ ├── TensorBoard-checkpoint.ipynb │ ├── TensorFlow Basics-checkpoint.ipynb │ └── Video Game Bot using OpenAI Universe-checkpoint.ipynb ├── 2.07 Basic Simulations.ipynb ├── 2.08 Training an Robot to Walk.ipynb ├── 2.09 Building a Video Game Bot .ipynb ├── 2.10 TensorFlow Fundamentals.ipynb ├── 2.11 TensorBoard.ipynb └── logs │ └── events.out.tfevents.1527762800.sudharsan ├── Chapter03 ├── .ipynb_checkpoints │ ├── 3.1 Value Iteration - Frozen Lake Problem-checkpoint.ipynb │ ├── 3.12 Value Iteration - Frozen Lake Problem-checkpoint.ipynb │ ├── 3.13 Policy Iteration - Frozen Lake Problem-checkpoint.ipynb │ └── 3.2 Policy Iteration - Frozen Lake Problem-checkpoint.ipynb ├── 3.12 Value Iteration - Frozen Lake Problem.ipynb ├── 3.13 Policy Iteration - Frozen Lake Problem.ipynb └── images │ └── B09792_03_50.png ├── Chapter04 ├── .ipynb_checkpoints │ ├── 4.1 Estimating Value of Pi using Monte Carlo-checkpoint.ipynb │ ├── 4.2 BlackJack with First visit MC-checkpoint.ipynb │ ├── 4.2 Estimating Value of Pi using Monte Carlo-checkpoint.ipynb │ └── 4.6 BlackJack with First visit MC-checkpoint.ipynb ├── 4.2 Estimating Value of Pi using Monte Carlo.ipynb └── 4.6 BlackJack with First visit MC.ipynb ├── Chapter05 ├── .ipynb_checkpoints │ ├── 5.5 Taxi Problem - Q Learning-checkpoint.ipynb │ └── 5.7 Taxi Problem - SARSA-checkpoint.ipynb ├── 5.5 Taxi Problem - Q Learning.ipynb └── 5.7 Taxi Problem - SARSA.ipynb ├── Chapter06 ├── .ipynb_checkpoints │ ├── 6.1 MAB - Various Exploration Strategies-checkpoint.ipynb │ └── 6.7 Identifying Right AD Banner Using MAB-checkpoint.ipynb ├── 6.1 MAB - Various Exploration Strategies.ipynb ├── 6.7 Identifying Right AD Banner Using MAB.ipynb └── images │ └── B09792_06_01.png ├── Chapter07 ├── .ipynb_checkpoints │ ├── 7.10 Generating Song Lyrics Using LSTM RNN-checkpoint.ipynb │ ├── 7.13 Classifying Fashion Products Using CNN-checkpoint.ipynb │ └── 7.6 Neural Network Using Tensorflow-checkpoint.ipynb ├── 7.10 Generating Song Lyrics Using LSTM RNN.ipynb ├── 7.13 Classifying Fashion Products Using CNN.ipynb ├── 7.6 Neural Network Using Tensorflow.ipynb └── data │ ├── ZaynLyrics.txt │ ├── fashion │ ├── t10k-images-idx3-ubyte.gz │ ├── t10k-labels-idx1-ubyte.gz │ ├── train-images-idx3-ubyte.gz │ └── train-labels-idx1-ubyte.gz │ └── mnist │ ├── t10k-images-idx3-ubyte.gz │ ├── t10k-labels-idx1-ubyte.gz │ ├── train-images-idx3-ubyte.gz │ └── train-labels-idx1-ubyte.gz ├── Chapter08 ├── .ipynb_checkpoints │ └── 8.8 Building an Agent to Play Atari Games-checkpoint.ipynb ├── 8.8 Building an Agent to Play Atari Games.ipynb └── logs │ ├── events.out.tfevents.1526989751.sudharsan │ ├── events.out.tfevents.1526990072.sudharsan │ └── events.out.tfevents.1528714237.sudharsan ├── Chapter09 ├── .ipynb_checkpoints │ ├── 9.4 Basic Doom Game-checkpoint.ipynb │ └── 9.5 Doom Game Using DRQN-checkpoint.ipynb ├── 9.4 Basic Doom Game.ipynb ├── 9.5 Doom Game Using DRQN.ipynb ├── basic.cfg ├── basic.wad ├── deathmatch.cfg └── deathmatch.wad ├── Chapter10 ├── .ipynb_checkpoints │ └── 10.5 Drive up the Mountain Using A3C-checkpoint.ipynb ├── 10.5 Drive up the Mountain Using A3C.ipynb └── logs │ └── events.out.tfevents.1528713441.sudharsan ├── Chapter11 ├── .ipynb_checkpoints │ ├── 11.2 Lunar Lander Using Policy Gradients-checkpoint.ipynb │ └── 11.3 Swinging Up the Pendulum Using DDPG-checkpoint.ipynb ├── 11.2 Lunar Lander Using Policy Gradients.ipynb ├── 11.3 Swinging Up the Pendulum Using DDPG.ipynb └── logs │ └── events.out.tfevents.1528712442.sudharsan ├── Chapter12 ├── .ipynb_checkpoints │ ├── 12.1 Environment Wrapper Functions-checkpoint.ipynb │ ├── 12.2 Dueling network-checkpoint.ipynb │ ├── 12.3 Replay Memory-checkpoint.ipynb │ ├── 12.4 Training the network-checkpoint.ipynb │ └── 12.5 Car Racing-checkpoint.ipynb ├── 12.1 Environment Wrapper Functions.ipynb ├── 12.2 Dueling network.ipynb ├── 12.3 Replay Memory.ipynb ├── 12.4 Training the network.ipynb └── 12.5 Car Racing.ipynb ├── Chapter13 ├── .ipynb_checkpoints │ ├── 13.3 Deep Q Learning From Demonstrations-checkpoint.ipynb │ └── 13.4 Hindsight Experience Replay-checkpoint.ipynb ├── 13.3 Deep Q Learning From Demonstrations.ipynb ├── 13.4 Hindsight Experience Replay.ipynb └── images │ ├── B09792_13_01.png │ └── B09792_13_02.png ├── LICENSE └── README.md /Chapter01/.ipynb_checkpoints/1.1 What is Reinforcement Learning-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## What is Reinforcement Learning?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Consider you are teaching the dog to catch a ball, but you cannot teach the dog explicitly to\n", 15 | "catch a ball, instead, you will just throw a ball, every time the dog catches a ball, you will\n", 16 | "give a cookie. If it fails to catch a dog, you will not give a cookie. So the dog will figure out\n", 17 | "what actions it does that made it receive a cookie and repeat that action." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "Similarly in an RL environment, you will not teach the agent what to do or how to do,\n", 25 | "instead, you will give feedback to the agent for each action it does. The feedback may be\n", 26 | "positive (reward) or negative (punishment). The learning system which receives the\n", 27 | "punishment will improve itself. Thus it is a trial and error process. The reinforcement\n", 28 | "learning algorithm retains outputs that maximize the received reward over time. In the\n", 29 | "above analogy, the dog represents the agent, giving a cookie to the dog on catching a ball is\n", 30 | "a reward and not giving a cookie is punishment.\n", 31 | "\n", 32 | "There might be delayed rewards. You may not get a reward at each step. A reward may be\n", 33 | "given only after the completion of the whole task. In some cases, you get a reward at each\n", 34 | "step to find out that whether you are making any mistake.\n", 35 | "\n", 36 | "An RL agent can explore for different actions which might give a good reward or it can\n", 37 | "(exploit) use the previous action which resulted in a good reward. If the RL agent explores\n", 38 | "different actions, there is a great possibility to get a poor reward. If the RL agent exploits\n", 39 | "past action, there is also a great possibility of missing out the best action which might give a\n", 40 | "good reward. There is always a trade-off between exploration and exploitation. We cannot\n", 41 | "perform both exploration and exploitation at the same time. We will discuss exploration exploitation\n", 42 | "dilemma detail in the upcoming chapters.\n", 43 | "\n", 44 | "Say, If you want to teach a robot to walk, without getting stuck by hitting at the mountain,\n", 45 | "you will not explicitly teach the robot not to go in the direction of mountain," 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "![title](images/B09792_01_01.png)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "Instead, if the robot hits and get stuck on the mountain you will reduce 10 points so that\n", 60 | "robot will understand that hitting mountain will give it a negative reward so it will not go\n", 61 | "in that direction again." 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "![title](images/B09792_01_02.png)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "And you will give 20 points to the robot when it walks in the right direction without getting\n", 76 | "stuck. So robot will understand which is the right path to rewards and try to maximize the\n", 77 | "rewards by going in a right direction." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "source": [ 86 | "![title](images/B09792_01_03.png)" 87 | ] 88 | } 89 | ], 90 | "metadata": { 91 | "kernelspec": { 92 | "display_name": "Python [conda env:anaconda]", 93 | "language": "python", 94 | "name": "conda-env-anaconda-py" 95 | }, 96 | "language_info": { 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 2 100 | }, 101 | "file_extension": ".py", 102 | "mimetype": "text/x-python", 103 | "name": "python", 104 | "nbconvert_exporter": "python", 105 | "pygments_lexer": "ipython2", 106 | "version": "2.7.11" 107 | } 108 | }, 109 | "nbformat": 4, 110 | "nbformat_minor": 2 111 | } 112 | -------------------------------------------------------------------------------- /Chapter01/1.1 What is Reinforcement Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## What is Reinforcement Learning?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Consider you are teaching the dog to catch a ball, but you cannot teach the dog explicitly to\n", 15 | "catch a ball, instead, you will just throw a ball, every time the dog catches a ball, you will\n", 16 | "give a cookie. If it fails to catch a dog, you will not give a cookie. So the dog will figure out\n", 17 | "what actions it does that made it receive a cookie and repeat that action." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "Similarly in an RL environment, you will not teach the agent what to do or how to do,\n", 25 | "instead, you will give feedback to the agent for each action it does. The feedback may be\n", 26 | "positive (reward) or negative (punishment). The learning system which receives the\n", 27 | "punishment will improve itself. Thus it is a trial and error process. The reinforcement\n", 28 | "learning algorithm retains outputs that maximize the received reward over time. In the\n", 29 | "above analogy, the dog represents the agent, giving a cookie to the dog on catching a ball is\n", 30 | "a reward and not giving a cookie is punishment.\n", 31 | "\n", 32 | "There might be delayed rewards. You may not get a reward at each step. A reward may be\n", 33 | "given only after the completion of the whole task. In some cases, you get a reward at each\n", 34 | "step to find out that whether you are making any mistake.\n", 35 | "\n", 36 | "An RL agent can explore for different actions which might give a good reward or it can\n", 37 | "(exploit) use the previous action which resulted in a good reward. If the RL agent explores\n", 38 | "different actions, there is a great possibility to get a poor reward. If the RL agent exploits\n", 39 | "past action, there is also a great possibility of missing out the best action which might give a\n", 40 | "good reward. There is always a trade-off between exploration and exploitation. We cannot\n", 41 | "perform both exploration and exploitation at the same time. We will discuss exploration exploitation\n", 42 | "dilemma detail in the upcoming chapters.\n", 43 | "\n", 44 | "Say, If you want to teach a robot to walk, without getting stuck by hitting at the mountain,\n", 45 | "you will not explicitly teach the robot not to go in the direction of mountain," 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "![title](images/B09792_01_01.png)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "Instead, if the robot hits and get stuck on the mountain you will reduce 10 points so that\n", 60 | "robot will understand that hitting mountain will give it a negative reward so it will not go\n", 61 | "in that direction again." 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "![title](images/B09792_01_02.png)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "And you will give 20 points to the robot when it walks in the right direction without getting\n", 76 | "stuck. So robot will understand which is the right path to rewards and try to maximize the\n", 77 | "rewards by going in a right direction." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "collapsed": true 84 | }, 85 | "source": [ 86 | "![title](images/B09792_01_03.png)" 87 | ] 88 | } 89 | ], 90 | "metadata": { 91 | "kernelspec": { 92 | "display_name": "Python [conda env:anaconda]", 93 | "language": "python", 94 | "name": "conda-env-anaconda-py" 95 | }, 96 | "language_info": { 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 2 100 | }, 101 | "file_extension": ".py", 102 | "mimetype": "text/x-python", 103 | "name": "python", 104 | "nbconvert_exporter": "python", 105 | "pygments_lexer": "ipython2", 106 | "version": "2.7.11" 107 | } 108 | }, 109 | "nbformat": 4, 110 | "nbformat_minor": 2 111 | } 112 | -------------------------------------------------------------------------------- /Chapter01/images/B09792_01_01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter01/images/B09792_01_01.png -------------------------------------------------------------------------------- /Chapter01/images/B09792_01_02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter01/images/B09792_01_02.png -------------------------------------------------------------------------------- /Chapter01/images/B09792_01_03.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter01/images/B09792_01_03.png -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.09 Building a Video Game Bot -checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building a Video Game Bot using OpenAI Universe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | "Let us learn how to build a video game bot which plays car racing game. Our objective is\n", 17 | "that car has to move forward without getting stuck by any obstacles and hitting other cars." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "First, we import necessary libraries," 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "import universe \n", 37 | "import random" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "Then we simulate our car racing environment by make function." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "env = gym.make('flashgames.NeonRace-v0')" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "env.configure(remotes=1) " 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "And let us create variables for moving the car," 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Move left\n", 85 | "left = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', True),\n", 86 | " ('KeyEvent', 'ArrowRight', False)]\n", 87 | "\n", 88 | "# Move right\n", 89 | "right = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', False),\n", 90 | " ('KeyEvent', 'ArrowRight', True)]\n", 91 | "\n", 92 | "# Move forward\n", 93 | "\n", 94 | "forward = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowRight', False),\n", 95 | " ('KeyEvent', 'ArrowLeft', False), ('KeyEvent', 'n', True)]" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Followed by, we will initialize some other variables" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# We use turn variable for deciding whether to turn or not\n", 114 | "turn = 0\n", 115 | "\n", 116 | "# We store all the rewards in rewards list\n", 117 | "rewards = []\n", 118 | "\n", 119 | "# we will use buffer as some kind of threshold\n", 120 | "buffer_size = 100\n", 121 | "\n", 122 | "# We set our initial action has forward i.e our car moves just forward without making any turns\n", 123 | "action = forward" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "source": [ 132 | "Now, let us begin our game agent to play in an infinite loop which continuously performs an action based on interaction with the environment." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "while True:\n", 144 | " turn -= 1\n", 145 | " \n", 146 | " # Let us say initially we take no turn and move forward.\n", 147 | " # First, We will check the value of turn, if it is less than 0\n", 148 | " # then there is no necessity for turning and we just move forward\n", 149 | " \n", 150 | " if turn <= 0:\n", 151 | " action = forward\n", 152 | " turn = 0\n", 153 | " \n", 154 | " action_n = [action for ob in observation_n]\n", 155 | " \n", 156 | " # Then we use env.step() to perform an action (moving forward for now) one-time step\n", 157 | " \n", 158 | " observation_n, reward_n, done_n, info = env.step(action_n)\n", 159 | " \n", 160 | " # store the rewards in the rewards list\n", 161 | " rewards += [reward_n[0]]\n", 162 | " \n", 163 | " # We will generate some random number and if it is less than 0.5 then we will take right, else\n", 164 | " # we will take left and we will store all the rewards obtained by performing each action and\n", 165 | " # based on our rewards we will learn which direction is the best over several timesteps. \n", 166 | " \n", 167 | " if len(rewards) >= buffer_size:\n", 168 | " mean = sum(rewards)/len(rewards)\n", 169 | " \n", 170 | " if mean == 0:\n", 171 | " turn = 20\n", 172 | " if random.random() < 0.5:\n", 173 | " action = right\n", 174 | " else:\n", 175 | " action = left\n", 176 | " rewards = []\n", 177 | " \n", 178 | " env.render()" 179 | ] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python [conda env:universe]", 185 | "language": "python", 186 | "name": "conda-env-universe-py" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.5.4" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 2 203 | } 204 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.1 Basic Simulations-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Basic Simulations in OpenAI's Gym" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | " \n", 19 | "OpenAI Gym is a toolkit for building, evaluating and comparing RL algorithms. It is\n", 20 | "compatible with algorithms written in any frameworks like TensoFlow, Theano, Keras etc... It\n", 21 | "is simple and easy to comprehend. It makes no assumption about the structure of our agent\n", 22 | "and provides an interface to all RL tasks.\n", 23 | "\n", 24 | "Now, we will see, how to simulate environments in gym." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## CartPole Environment" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | " First let us import the OpenAI's Gym library" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 1, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "import gym" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "source": [ 58 | "\n", 59 | " We use the make function for simulating the environment" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "collapsed": true 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "env = gym.make('CartPole-v0')" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": { 76 | "collapsed": true 77 | }, 78 | "source": [ 79 | "\n", 80 | " Then, we initialize the environment using reset method" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "collapsed": true 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "env.reset()" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": { 97 | "collapsed": true 98 | }, 99 | "source": [ 100 | "\n", 101 | " Now,we can loop for some time steps and render the environment at each step" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": { 108 | "collapsed": true 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "for _ in range(1000):\n", 113 | " env.render()\n", 114 | " env.step(env.action_space.sample())" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## Different Types of Environments" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": { 127 | "collapsed": true 128 | }, 129 | "source": [ 130 | "\n", 131 | " OpenAI gym provides a lot of simulation environments for training, evaluating and\n", 132 | "building our agents. We can check the available environments by either checking their\n", 133 | "website or simply typing the following commands which will list the available environments." 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "from gym import envs\n", 145 | "print(envs.registry.all())" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## CarRacing Environment" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": { 158 | "collapsed": true 159 | }, 160 | "source": [ 161 | "\n", 162 | " Since Gym provides different interesting environments, let us simulate a car racing\n", 163 | "environment as shown below," 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "collapsed": true 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "import gym\n", 175 | "env = gym.make('CarRacing-v0')\n", 176 | "env.reset()\n", 177 | "for _ in range(1000):\n", 178 | " env.render()\n", 179 | " env.step(env.action_space.sample())" 180 | ] 181 | } 182 | ], 183 | "metadata": { 184 | "kernelspec": { 185 | "display_name": "Python [conda env:universe]", 186 | "language": "python", 187 | "name": "conda-env-universe-py" 188 | }, 189 | "language_info": { 190 | "codemirror_mode": { 191 | "name": "ipython", 192 | "version": 3 193 | }, 194 | "file_extension": ".py", 195 | "mimetype": "text/x-python", 196 | "name": "python", 197 | "nbconvert_exporter": "python", 198 | "pygments_lexer": "ipython3", 199 | "version": "3.5.4" 200 | } 201 | }, 202 | "nbformat": 4, 203 | "nbformat_minor": 2 204 | } 205 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.11 TensorBoard-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# TensorBoard" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "TensorBoard is the tensorflow's visualization tool which can be used to visualize the\n", 15 | "computation graph. It can also be used to plot various quantitative metrics and results of\n", 16 | "several intermediate calculations. Using tensorboard, we can easily visualize complex\n", 17 | "models which would be useful for debugging and also sharing.\n", 18 | "Now let us build a basic computation graph and visualize that in tensorboard." 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "First, let us import the library" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import tensorflow as tf" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | " Next, we initialize the variables" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": { 50 | "collapsed": true 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "a = tf.constant(5)\n", 55 | "b = tf.constant(4)\n", 56 | "c = tf.multiply(a,b)\n", 57 | "d = tf.constant(2)\n", 58 | "e = tf.constant(3)\n", 59 | "f = tf.multiply(d,e)\n", 60 | "g = tf.add(c,f)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Now, we will create a tensorflow session, we will write the results of our graph to file\n", 68 | "called event file using tf.summary.FileWriter()" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "26\n" 81 | ] 82 | } 83 | ], 84 | "source": [ 85 | "with tf.Session() as sess:\n", 86 | " writer = tf.summary.FileWriter(\"logs\", sess.graph)\n", 87 | " print(sess.run(g))\n", 88 | " writer.close()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "In order to run the tensorboard, go to your terminal, locate the working directory and\n", 96 | "type \n", 97 | "\n", 98 | "tensorboard --logdir=logs --port=6003" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "# Adding Scope" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | " Scoping is used to reduce complexity and helps to better understand the model by\n", 113 | "grouping the related nodes together, For instance, in the above example, we can break\n", 114 | "down our graph into two different groups called computation and result. If you look at the\n", 115 | "previous example we can see that nodes, a to e perform the computation and node g\n", 116 | "calculate the result. So we can group them separately using the scope for easy\n", 117 | "understanding. Scoping can be created using tf.name_scope() function." 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 4, 123 | "metadata": { 124 | "collapsed": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "with tf.name_scope(\"Computation\"):\n", 129 | " a = tf.constant(5)\n", 130 | " b = tf.constant(4)\n", 131 | " c = tf.multiply(a,b)\n", 132 | " d = tf.constant(2)\n", 133 | " e = tf.constant(3)\n", 134 | " f = tf.multiply(d,e)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 5, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "\n", 146 | "with tf.name_scope(\"Result\"):\n", 147 | " g = tf.add(c,f)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "\n", 155 | "If you see the computation scope, we can further break down in to separate parts for even\n", 156 | "more good understanding. Say we can create scope as part 1 which has nodes a to c and\n", 157 | "scope as part 2 which has nodes d to e since part 1 and 2 are independent of each other." 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 6, 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "with tf.name_scope(\"Computation\"):\n", 169 | " with tf.name_scope(\"Part1\"):\n", 170 | " a = tf.constant(5)\n", 171 | " b = tf.constant(4)\n", 172 | " c = tf.multiply(a,b)\n", 173 | " with tf.name_scope(\"Part2\"):\n", 174 | " d = tf.constant(2)\n", 175 | " e = tf.constant(3)\n", 176 | " f = tf.multiply(d,e)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "\n", 184 | "Scoping can be better understood by visualizing them in the tensorboard. The complete\n", 185 | "code looks like as follows," 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 7, 191 | "metadata": {}, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "26\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "with tf.name_scope(\"Computation\"):\n", 203 | " with tf.name_scope(\"Part1\"):\n", 204 | " a = tf.constant(5)\n", 205 | " b = tf.constant(4)\n", 206 | " c = tf.multiply(a,b)\n", 207 | " with tf.name_scope(\"Part2\"):\n", 208 | " d = tf.constant(2)\n", 209 | " e = tf.constant(3)\n", 210 | " f = tf.multiply(d,e)\n", 211 | "with tf.name_scope(\"Result\"):\n", 212 | " g = tf.add(c,f)\n", 213 | "with tf.Session() as sess:\n", 214 | " writer = tf.summary.FileWriter(\"logs\", sess.graph)\n", 215 | " print(sess.run(g))\n", 216 | " writer.close()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "In order to run the tensorboard, go to your terminal, locate the working directory and\n", 224 | "type \n", 225 | "\n", 226 | "tensorboard --logdir=logs --port=6003" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "If you look at the TensorBoard you can easily understand how scoping helps us to reduce\n", 234 | "complexity in understanding by grouping the similar nodes together. Scoping is widely\n", 235 | "used while working on a complex projects to better understand the functionality and\n", 236 | "dependencies of nodes." 237 | ] 238 | } 239 | ], 240 | "metadata": { 241 | "kernelspec": { 242 | "display_name": "Python [conda env:universe]", 243 | "language": "python", 244 | "name": "conda-env-universe-py" 245 | }, 246 | "language_info": { 247 | "codemirror_mode": { 248 | "name": "ipython", 249 | "version": 3 250 | }, 251 | "file_extension": ".py", 252 | "mimetype": "text/x-python", 253 | "name": "python", 254 | "nbconvert_exporter": "python", 255 | "pygments_lexer": "ipython3", 256 | "version": "3.5.4" 257 | } 258 | }, 259 | "nbformat": 4, 260 | "nbformat_minor": 2 261 | } 262 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.2 Training an agent to Walk-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training an agent to Walk" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "\n", 15 | "Now let us learn how to train a robot to walk using Gym along with some fundamentals.\n", 16 | "The strategy is that reward X points will be given when the robot moves forward and if the\n", 17 | "robot fails to move then Y points will be reduced. So the robot will learn to walk in the\n", 18 | "event of maximizing the reward.\n", 19 | "\n", 20 | "First, we will import the library, then we will create a simulation instance by make\n", 21 | "function. \n", 22 | "
\n", 23 | "
\n", 24 | "Open AI Gym provides an environment called BipedalWalker-v2 for training\n", 25 | "robotic agents in simple terrain. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import gym\n", 37 | "env = gym.make('BipedalWalker-v2')" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "\n", 45 | " Then for each episode (Agent-Environment interaction between initial and final state), we\n", 46 | "will initialize the environment using reset method." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "collapsed": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "for episode in range(100):\n", 58 | " observation = env.reset()\n", 59 | " \n", 60 | " # Render the environment on each step \n", 61 | " for i in range(10000):\n", 62 | " env.render()\n", 63 | " \n", 64 | " # we choose action by sampling random action from environment's action space. Every environment has\n", 65 | " # some action space which contains the all possible valid actions and observations,\n", 66 | " \n", 67 | " action = env.action_space.sample()\n", 68 | " \n", 69 | " # Then for each step, we will record the observation, reward, done, info\n", 70 | " observation, reward, done, info = env.step(action)\n", 71 | " \n", 72 | " # When done is true, we print the time steps taken for the episode and break the current episode.\n", 73 | " if done:\n", 74 | " print(\"{} timesteps taken for the Episode\".format(i+1))\n", 75 | " break" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | " \n", 83 | "\n", 84 | "The agent will learn by trail and error and over a period of time it starts selecting actions which gives the\n", 85 | "maximum rewards." 86 | ] 87 | } 88 | ], 89 | "metadata": { 90 | "kernelspec": { 91 | "display_name": "Python [conda env:universe]", 92 | "language": "python", 93 | "name": "conda-env-universe-py" 94 | }, 95 | "language_info": { 96 | "codemirror_mode": { 97 | "name": "ipython", 98 | "version": 3 99 | }, 100 | "file_extension": ".py", 101 | "mimetype": "text/x-python", 102 | "name": "python", 103 | "nbconvert_exporter": "python", 104 | "pygments_lexer": "ipython3", 105 | "version": "3.5.4" 106 | } 107 | }, 108 | "nbformat": 4, 109 | "nbformat_minor": 2 110 | } 111 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.7 Basic Simulations-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Basic Simulations in OpenAI's Gym" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "OpenAI Gym is a toolkit for building, evaluating and comparing RL algorithms. It is\n", 19 | "compatible with algorithms written in any frameworks like TensoFlow, Theano, Keras etc... It\n", 20 | "is simple and easy to comprehend. It makes no assumption about the structure of our agent\n", 21 | "and provides an interface to all RL tasks.\n", 22 | "\n", 23 | "Now, we will see, how to simulate environments in gym." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## CartPole Environment" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "First let us import the OpenAI's Gym library" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 1, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "import gym" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "source": [ 57 | "We use the make function for simulating the environment" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "env = gym.make('CartPole-v0')" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "collapsed": true 75 | }, 76 | "source": [ 77 | "Then, we initialize the environment using reset method" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": true 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "env.reset()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "source": [ 97 | "Now,we can loop for some time steps and render the environment at each step" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "for _ in range(1000):\n", 109 | " env.render()\n", 110 | " env.step(env.action_space.sample())" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## Different Types of Environments" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "collapsed": true 124 | }, 125 | "source": [ 126 | "OpenAI gym provides a lot of simulation environments for training, evaluating and\n", 127 | "building our agents. We can check the available environments by either checking their\n", 128 | "website or simply typing the following commands which will list the available environments." 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": { 135 | "collapsed": true 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "from gym import envs\n", 140 | "print(envs.registry.all())" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## CarRacing Environment" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": { 153 | "collapsed": true 154 | }, 155 | "source": [ 156 | " Since Gym provides different interesting environments, let us simulate a car racing\n", 157 | "environment as shown below," 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "import gym\n", 169 | "env = gym.make('CarRacing-v0')\n", 170 | "env.reset()\n", 171 | "for _ in range(1000):\n", 172 | " env.render()\n", 173 | " env.step(env.action_space.sample())" 174 | ] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Python [conda env:universe]", 180 | "language": "python", 181 | "name": "conda-env-universe-py" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 3 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython3", 193 | "version": "3.5.4" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 2 198 | } 199 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.8 Training an Robot to Walk-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training an agent to Walk" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Now let us learn how to train a robot to walk using Gym along with some fundamentals.\n", 15 | "The strategy is that reward X points will be given when the robot moves forward and if the\n", 16 | "robot fails to move then Y points will be reduced. So the robot will learn to walk in the\n", 17 | "event of maximizing the reward.\n", 18 | "\n", 19 | "First, we will import the library, then we will create a simulation instance by make\n", 20 | "function. \n", 21 | "\n", 22 | "\n", 23 | "Open AI Gym provides an environment called BipedalWalker-v2 for training\n", 24 | "robotic agents in simple terrain. " 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "env = gym.make('BipedalWalker-v2')" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "Then for each episode (Agent-Environment interaction between initial and final state), we\n", 44 | "will initialize the environment using reset method." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "for episode in range(100):\n", 56 | " observation = env.reset()\n", 57 | " \n", 58 | " # Render the environment on each step \n", 59 | " for i in range(10000):\n", 60 | " env.render()\n", 61 | " \n", 62 | " # we choose action by sampling random action from environment's action space. Every environment has\n", 63 | " # some action space which contains the all possible valid actions and observations,\n", 64 | " \n", 65 | " action = env.action_space.sample()\n", 66 | " \n", 67 | " # Then for each step, we will record the observation, reward, done, info\n", 68 | " observation, reward, done, info = env.step(action)\n", 69 | " \n", 70 | " # When done is true, we print the time steps taken for the episode and break the current episode.\n", 71 | " if done:\n", 72 | " print(\"{} timesteps taken for the Episode\".format(i+1))\n", 73 | " break" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "The agent will learn by trail and error and over a period of time it starts selecting actions which gives the\n", 81 | "maximum rewards." 82 | ] 83 | } 84 | ], 85 | "metadata": { 86 | "kernelspec": { 87 | "display_name": "Python [conda env:universe]", 88 | "language": "python", 89 | "name": "conda-env-universe-py" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 3 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython3", 101 | "version": "3.5.4" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 2 106 | } 107 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/2.9 Building a Video Game Bot -checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building a Video Game Bot using OpenAI Universe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | "Let us learn how to build a video game bot which plays car racing game. Our objective is\n", 17 | "that car has to move forward without getting stuck by any obstacles and hitting other cars." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "First, we import necessary libraries," 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "import universe \n", 37 | "import random" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "Then we simulate our car racing environment by make function." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "env = gym.make('flashgames.NeonRace-v0')" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "env.configure(remotes=1) " 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "And let us create variables for moving the car," 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Move left\n", 85 | "left = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', True),\n", 86 | " ('KeyEvent', 'ArrowRight', False)]\n", 87 | "\n", 88 | "# Move right\n", 89 | "right = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', False),\n", 90 | " ('KeyEvent', 'ArrowRight', True)]\n", 91 | "\n", 92 | "# Move forward\n", 93 | "\n", 94 | "forward = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowRight', False),\n", 95 | " ('KeyEvent', 'ArrowLeft', False), ('KeyEvent', 'n', True)]" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Followed by, we will initialize some other variables" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# We use turn variable for deciding whether to turn or not\n", 114 | "turn = 0\n", 115 | "\n", 116 | "# We store all the rewards in rewards list\n", 117 | "rewards = []\n", 118 | "\n", 119 | "# we will use buffer as some kind of threshold\n", 120 | "buffer_size = 100\n", 121 | "\n", 122 | "# We set our initial action has forward i.e our car moves just forward without making any turns\n", 123 | "action = forward" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "source": [ 132 | "Now, let us begin our game agent to play in an infinite loop which continuously performs an action based on interaction with the environment." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "while True:\n", 144 | " turn -= 1\n", 145 | " \n", 146 | " # Let us say initially we take no turn and move forward.\n", 147 | " # First, We will check the value of turn, if it is less than 0\n", 148 | " # then there is no necessity for turning and we just move forward\n", 149 | " \n", 150 | " if turn <= 0:\n", 151 | " action = forward\n", 152 | " turn = 0\n", 153 | " \n", 154 | " action_n = [action for ob in observation_n]\n", 155 | " \n", 156 | " # Then we use env.step() to perform an action (moving forward for now) one-time step\n", 157 | " \n", 158 | " observation_n, reward_n, done_n, info = env.step(action_n)\n", 159 | " \n", 160 | " # store the rewards in the rewards list\n", 161 | " rewards += [reward_n[0]]\n", 162 | " \n", 163 | " # We will generate some random number and if it is less than 0.5 then we will take right, else\n", 164 | " # we will take left and we will store all the rewards obtained by performing each action and\n", 165 | " # based on our rewards we will learn which direction is the best over several timesteps. \n", 166 | " \n", 167 | " if len(rewards) >= buffer_size:\n", 168 | " mean = sum(rewards)/len(rewards)\n", 169 | " if mean == 0:\n", 170 | " turn = 20\n", 171 | " if random.random() < 0.5:\n", 172 | " action = right\n", 173 | " else:\n", 174 | " action = left\n", 175 | " rewards = []\n", 176 | " \n", 177 | " env.render()" 178 | ] 179 | } 180 | ], 181 | "metadata": { 182 | "kernelspec": { 183 | "display_name": "Python [conda env:universe]", 184 | "language": "python", 185 | "name": "conda-env-universe-py" 186 | }, 187 | "language_info": { 188 | "codemirror_mode": { 189 | "name": "ipython", 190 | "version": 3 191 | }, 192 | "file_extension": ".py", 193 | "mimetype": "text/x-python", 194 | "name": "python", 195 | "nbconvert_exporter": "python", 196 | "pygments_lexer": "ipython3", 197 | "version": "3.5.4" 198 | } 199 | }, 200 | "nbformat": 4, 201 | "nbformat_minor": 2 202 | } 203 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/TensorBoard-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# TensorBoard" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "\n", 15 | " TensorBoard is the tensorflow's visualization tool which can be used to visualize the\n", 16 | "computation graph. It can also be used to plot various quantitative metrics and results of\n", 17 | "several intermediate calculations. Using tensorboard, we can easily visualize complex\n", 18 | "models which would be useful for debugging and also sharing.\n", 19 | "Now let us build a basic computation graph and visualize that in tensorboard." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | " First, let us import the library" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": { 33 | "collapsed": true 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "import tensorflow as tf" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "\n", 45 | " Next, we initialize the variables" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 2, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "a = tf.constant(5)\n", 57 | "b = tf.constant(4)\n", 58 | "c = tf.multiply(a,b)\n", 59 | "d = tf.constant(2)\n", 60 | "e = tf.constant(3)\n", 61 | "f = tf.multiply(d,e)\n", 62 | "g = tf.add(c,f)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "\n", 70 | " Now, we will create a tensorflow session, we will write the results of our graph to file\n", 71 | "called event file using tf.summary.FileWriter()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 3, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "26\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "with tf.Session() as sess:\n", 89 | " writer = tf.summary.FileWriter(\"logs\", sess.graph)\n", 90 | " print(sess.run(g))\n", 91 | " writer.close()" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | " In order to run the tensorboard, go to your terminal, locate the working directory and\n", 99 | "type \n", 100 | "\n", 101 | "tensorboard --logdir=logs --port=6003" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "# Adding Scope" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "\n", 116 | " Scoping is used to reduce complexity and helps to better understand the model by\n", 117 | "grouping the related nodes together, For instance, in the above example, we can break\n", 118 | "down our graph into two different groups called computation and result. If you look at the\n", 119 | "previous example we can see that nodes, a to e perform the computation and node g\n", 120 | "calculate the result. So we can group them separately using the scope for easy\n", 121 | "understanding. Scoping can be created using tf.name_scope() function." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": { 128 | "collapsed": true 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "with tf.name_scope(\"Computation\"):\n", 133 | " a = tf.constant(5)\n", 134 | " b = tf.constant(4)\n", 135 | " c = tf.multiply(a,b)\n", 136 | " d = tf.constant(2)\n", 137 | " e = tf.constant(3)\n", 138 | " f = tf.multiply(d,e)" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 5, 144 | "metadata": { 145 | "collapsed": true 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "\n", 150 | "with tf.name_scope(\"Result\"):\n", 151 | " g = tf.add(c,f)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "\n", 159 | "If you see the computation scope, we can further break down in to separate parts for even\n", 160 | "more good understanding. Say we can create scope as part 1 which has nodes a to c and\n", 161 | "scope as part 2 which has nodes d to e since part 1 and 2 are independent of each other." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 6, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "with tf.name_scope(\"Computation\"):\n", 173 | " with tf.name_scope(\"Part1\"):\n", 174 | " a = tf.constant(5)\n", 175 | " b = tf.constant(4)\n", 176 | " c = tf.multiply(a,b)\n", 177 | " with tf.name_scope(\"Part2\"):\n", 178 | " d = tf.constant(2)\n", 179 | " e = tf.constant(3)\n", 180 | " f = tf.multiply(d,e)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "\n", 188 | " \n", 189 | "Scoping can be better understood by visualizing them in the tensorboard. The complete\n", 190 | "code looks like as follows," 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 7, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "26\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "with tf.name_scope(\"Computation\"):\n", 208 | " with tf.name_scope(\"Part1\"):\n", 209 | " a = tf.constant(5)\n", 210 | " b = tf.constant(4)\n", 211 | " c = tf.multiply(a,b)\n", 212 | " with tf.name_scope(\"Part2\"):\n", 213 | " d = tf.constant(2)\n", 214 | " e = tf.constant(3)\n", 215 | " f = tf.multiply(d,e)\n", 216 | "with tf.name_scope(\"Result\"):\n", 217 | " g = tf.add(c,f)\n", 218 | "with tf.Session() as sess:\n", 219 | " writer = tf.summary.FileWriter(\"logs\", sess.graph)\n", 220 | " print(sess.run(g))\n", 221 | " writer.close()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | " In order to run the tensorboard, go to your terminal, locate the working directory and\n", 229 | "type \n", 230 | "\n", 231 | "tensorboard --logdir=logs --port=6003" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | " If you look at the TensorBoard you can easily understand how scoping helps us to reduce\n", 239 | "complexity in understanding by grouping the similar nodes together. Scoping is widely\n", 240 | "used while working on a complex projects to better understand the functionality and\n", 241 | "dependencies of nodes." 242 | ] 243 | } 244 | ], 245 | "metadata": { 246 | "kernelspec": { 247 | "display_name": "Python [conda env:universe]", 248 | "language": "python", 249 | "name": "conda-env-universe-py" 250 | }, 251 | "language_info": { 252 | "codemirror_mode": { 253 | "name": "ipython", 254 | "version": 3 255 | }, 256 | "file_extension": ".py", 257 | "mimetype": "text/x-python", 258 | "name": "python", 259 | "nbconvert_exporter": "python", 260 | "pygments_lexer": "ipython3", 261 | "version": "3.5.4" 262 | } 263 | }, 264 | "nbformat": 4, 265 | "nbformat_minor": 2 266 | } 267 | -------------------------------------------------------------------------------- /Chapter02/.ipynb_checkpoints/Video Game Bot using OpenAI Universe-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building a Video Game Bot using OpenAI Universe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | " Let us learn how to build a video game bot which plays car racing game. Our objective is\n", 17 | "that car has to move forward without getting stuck by any obstacles and hitting other cars." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | " First, we import necessary libraries," 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "import universe \n", 37 | "import random" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "\n", 45 | " Then we simulate our car racing environment by make function." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "env = gym.make('flashgames.NeonRace-v0')" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "env.configure(remotes=1) " 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | " And let us create variables for moving the car," 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "collapsed": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# Move left\n", 86 | "left = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', True),\n", 87 | " ('KeyEvent', 'ArrowRight', False)]\n", 88 | "\n", 89 | "# Move right\n", 90 | "right = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', False),\n", 91 | " ('KeyEvent', 'ArrowRight', True)]\n", 92 | "\n", 93 | "# Move forward\n", 94 | "\n", 95 | "forward = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowRight', False),\n", 96 | " ('KeyEvent', 'ArrowLeft', False), ('KeyEvent', 'n', True)]" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | " Followed by, we will initialize some other variables" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": { 110 | "collapsed": true 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "# We use turn variable for deciding whether to turn or not\n", 115 | "turn = 0\n", 116 | "\n", 117 | "# We store all the rewards in rewards list\n", 118 | "rewards = []\n", 119 | "\n", 120 | "# we will use buffer as some kind of threshold\n", 121 | "buffer_size = 100\n", 122 | "\n", 123 | "# We set our initial action has forward i.e our car moves just forward without making any turns\n", 124 | "action = forward" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "source": [ 133 | " Now, let us begin our game agent to play in an infinite loop which continuously performs an action based on interaction with the environment." 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "while True:\n", 145 | " turn -= 1\n", 146 | " \n", 147 | " # Let us say initially we take no turn and move forward.\n", 148 | " # First, We will check the value of turn, if it is less than 0\n", 149 | " # then there is no necessity for turning and we just move forward\n", 150 | " \n", 151 | " if turn <= 0:\n", 152 | " action = forward\n", 153 | " turn = 0\n", 154 | " \n", 155 | " action_n = [action for ob in observation_n]\n", 156 | " \n", 157 | " # Then we use env.step() to perform an action (moving forward for now) one-time step\n", 158 | " \n", 159 | " observation_n, reward_n, done_n, info = env.step(action_n)\n", 160 | " \n", 161 | " # store the rewards in the rewards list\n", 162 | " rewards += [reward_n[0]]\n", 163 | " \n", 164 | " # We will generate some random number and if it is less than 0.5 then we will take right, else\n", 165 | " # we will take left and we will store all the rewards obtained by performing each action and\n", 166 | " # based on our rewards we will learn which direction is the best over several timesteps. \n", 167 | " \n", 168 | " if len(rewards) >= buffer_size:\n", 169 | " mean = sum(rewards)/len(rewards)\n", 170 | " if mean == 0:\n", 171 | " turn = 20\n", 172 | " if random.random() < 0.5:\n", 173 | " action = right\n", 174 | " else:\n", 175 | " action = left\n", 176 | " rewards = []\n", 177 | " \n", 178 | " env.render()" 179 | ] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python [conda env:universe]", 185 | "language": "python", 186 | "name": "conda-env-universe-py" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.5.4" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 2 203 | } 204 | -------------------------------------------------------------------------------- /Chapter02/2.07 Basic Simulations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Basic Simulations in OpenAI's Gym" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "OpenAI Gym is a toolkit for building, evaluating and comparing RL algorithms. It is\n", 19 | "compatible with algorithms written in any frameworks like TensoFlow, Theano, Keras etc... It\n", 20 | "is simple and easy to comprehend. It makes no assumption about the structure of our agent\n", 21 | "and provides an interface to all RL tasks.\n", 22 | "\n", 23 | "Now, we will see, how to simulate environments in gym." 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## CartPole Environment" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "First let us import the OpenAI's Gym library" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 1, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "import gym" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "source": [ 57 | "We use the make function for simulating the environment" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "env = gym.make('CartPole-v0')" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "collapsed": true 75 | }, 76 | "source": [ 77 | "Then, we initialize the environment using reset method" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "collapsed": true 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "env.reset()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "source": [ 97 | "Now,we can loop for some time steps and render the environment at each step" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "for _ in range(1000):\n", 109 | " env.render()\n", 110 | " env.step(env.action_space.sample())" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## Different Types of Environments" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "collapsed": true 124 | }, 125 | "source": [ 126 | "OpenAI gym provides a lot of simulation environments for training, evaluating and\n", 127 | "building our agents. We can check the available environments by either checking their\n", 128 | "website or simply typing the following commands which will list the available environments." 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": { 135 | "collapsed": true 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "from gym import envs\n", 140 | "print(envs.registry.all())" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## CarRacing Environment" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": { 153 | "collapsed": true 154 | }, 155 | "source": [ 156 | " Since Gym provides different interesting environments, let us simulate a car racing\n", 157 | "environment as shown below," 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "import gym\n", 169 | "env = gym.make('CarRacing-v0')\n", 170 | "env.reset()\n", 171 | "for _ in range(1000):\n", 172 | " env.render()\n", 173 | " env.step(env.action_space.sample())" 174 | ] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Python [conda env:universe]", 180 | "language": "python", 181 | "name": "conda-env-universe-py" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 3 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython3", 193 | "version": "3.5.4" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 2 198 | } 199 | -------------------------------------------------------------------------------- /Chapter02/2.08 Training an Robot to Walk.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training an agent to Walk" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Now let us learn how to train a robot to walk using Gym along with some fundamentals.\n", 15 | "The strategy is that reward X points will be given when the robot moves forward and if the\n", 16 | "robot fails to move then Y points will be reduced. So the robot will learn to walk in the\n", 17 | "event of maximizing the reward.\n", 18 | "\n", 19 | "First, we will import the library, then we will create a simulation instance by make\n", 20 | "function. \n", 21 | "\n", 22 | "\n", 23 | "Open AI Gym provides an environment called BipedalWalker-v2 for training\n", 24 | "robotic agents in simple terrain. " 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "env = gym.make('BipedalWalker-v2')" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "Then for each episode (Agent-Environment interaction between initial and final state), we\n", 44 | "will initialize the environment using reset method." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "for episode in range(100):\n", 56 | " observation = env.reset()\n", 57 | " \n", 58 | " # Render the environment on each step \n", 59 | " for i in range(10000):\n", 60 | " env.render()\n", 61 | " \n", 62 | " # we choose action by sampling random action from environment's action space. Every environment has\n", 63 | " # some action space which contains the all possible valid actions and observations,\n", 64 | " \n", 65 | " action = env.action_space.sample()\n", 66 | " \n", 67 | " # Then for each step, we will record the observation, reward, done, info\n", 68 | " observation, reward, done, info = env.step(action)\n", 69 | " \n", 70 | " # When done is true, we print the time steps taken for the episode and break the current episode.\n", 71 | " if done:\n", 72 | " print(\"{} timesteps taken for the Episode\".format(i+1))\n", 73 | " break" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "The agent will learn by trail and error and over a period of time it starts selecting actions which gives the\n", 81 | "maximum rewards." 82 | ] 83 | } 84 | ], 85 | "metadata": { 86 | "kernelspec": { 87 | "display_name": "Python [conda env:universe]", 88 | "language": "python", 89 | "name": "conda-env-universe-py" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 3 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython3", 101 | "version": "3.5.4" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 2 106 | } 107 | -------------------------------------------------------------------------------- /Chapter02/2.09 Building a Video Game Bot .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building a Video Game Bot using OpenAI Universe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | "Let us learn how to build a video game bot which plays car racing game. Our objective is\n", 17 | "that car has to move forward without getting stuck by any obstacles and hitting other cars." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "First, we import necessary libraries," 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import gym\n", 36 | "import universe \n", 37 | "import random" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "Then we simulate our car racing environment by make function." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "env = gym.make('flashgames.NeonRace-v0')" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "env.configure(remotes=1) " 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "And let us create variables for moving the car," 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Move left\n", 85 | "left = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', True),\n", 86 | " ('KeyEvent', 'ArrowRight', False)]\n", 87 | "\n", 88 | "# Move right\n", 89 | "right = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', False),\n", 90 | " ('KeyEvent', 'ArrowRight', True)]\n", 91 | "\n", 92 | "# Move forward\n", 93 | "\n", 94 | "forward = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowRight', False),\n", 95 | " ('KeyEvent', 'ArrowLeft', False), ('KeyEvent', 'n', True)]" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Followed by, we will initialize some other variables" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# We use turn variable for deciding whether to turn or not\n", 114 | "turn = 0\n", 115 | "\n", 116 | "# We store all the rewards in rewards list\n", 117 | "rewards = []\n", 118 | "\n", 119 | "# we will use buffer as some kind of threshold\n", 120 | "buffer_size = 100\n", 121 | "\n", 122 | "# We set our initial action has forward i.e our car moves just forward without making any turns\n", 123 | "action = forward" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "source": [ 132 | "Now, let us begin our game agent to play in an infinite loop which continuously performs an action based on interaction with the environment." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "while True:\n", 144 | " turn -= 1\n", 145 | " \n", 146 | " # Let us say initially we take no turn and move forward.\n", 147 | " # First, We will check the value of turn, if it is less than 0\n", 148 | " # then there is no necessity for turning and we just move forward\n", 149 | " \n", 150 | " if turn <= 0:\n", 151 | " action = forward\n", 152 | " turn = 0\n", 153 | " \n", 154 | " action_n = [action for ob in observation_n]\n", 155 | " \n", 156 | " # Then we use env.step() to perform an action (moving forward for now) one-time step\n", 157 | " \n", 158 | " observation_n, reward_n, done_n, info = env.step(action_n)\n", 159 | " \n", 160 | " # store the rewards in the rewards list\n", 161 | " rewards += [reward_n[0]]\n", 162 | " \n", 163 | " # We will generate some random number and if it is less than 0.5 then we will take right, else\n", 164 | " # we will take left and we will store all the rewards obtained by performing each action and\n", 165 | " # based on our rewards we will learn which direction is the best over several timesteps. \n", 166 | " \n", 167 | " if len(rewards) >= buffer_size:\n", 168 | " mean = sum(rewards)/len(rewards)\n", 169 | " \n", 170 | " if mean == 0:\n", 171 | " turn = 20\n", 172 | " if random.random() < 0.5:\n", 173 | " action = right\n", 174 | " else:\n", 175 | " action = left\n", 176 | " rewards = []\n", 177 | " \n", 178 | " env.render()" 179 | ] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python [conda env:universe]", 185 | "language": "python", 186 | "name": "conda-env-universe-py" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.5.4" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 2 203 | } 204 | -------------------------------------------------------------------------------- /Chapter02/2.11 TensorBoard.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# TensorBoard" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "TensorBoard is the tensorflow's visualization tool which can be used to visualize the\n", 15 | "computation graph. It can also be used to plot various quantitative metrics and results of\n", 16 | "several intermediate calculations. Using tensorboard, we can easily visualize complex\n", 17 | "models which would be useful for debugging and also sharing.\n", 18 | "Now let us build a basic computation graph and visualize that in tensorboard." 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "First, let us import the library" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import tensorflow as tf" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | " Next, we initialize the variables" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": { 50 | "collapsed": true 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "a = tf.constant(5)\n", 55 | "b = tf.constant(4)\n", 56 | "c = tf.multiply(a,b)\n", 57 | "d = tf.constant(2)\n", 58 | "e = tf.constant(3)\n", 59 | "f = tf.multiply(d,e)\n", 60 | "g = tf.add(c,f)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Now, we will create a tensorflow session, we will write the results of our graph to file\n", 68 | "called event file using tf.summary.FileWriter()" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "26\n" 81 | ] 82 | } 83 | ], 84 | "source": [ 85 | "with tf.Session() as sess:\n", 86 | " writer = tf.summary.FileWriter(\"logs\", sess.graph)\n", 87 | " print(sess.run(g))\n", 88 | " writer.close()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "In order to run the tensorboard, go to your terminal, locate the working directory and\n", 96 | "type \n", 97 | "\n", 98 | "tensorboard --logdir=logs --port=6003" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "# Adding Scope" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | " Scoping is used to reduce complexity and helps to better understand the model by\n", 113 | "grouping the related nodes together, For instance, in the above example, we can break\n", 114 | "down our graph into two different groups called computation and result. If you look at the\n", 115 | "previous example we can see that nodes, a to e perform the computation and node g\n", 116 | "calculate the result. So we can group them separately using the scope for easy\n", 117 | "understanding. Scoping can be created using tf.name_scope() function." 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 4, 123 | "metadata": { 124 | "collapsed": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "with tf.name_scope(\"Computation\"):\n", 129 | " a = tf.constant(5)\n", 130 | " b = tf.constant(4)\n", 131 | " c = tf.multiply(a,b)\n", 132 | " d = tf.constant(2)\n", 133 | " e = tf.constant(3)\n", 134 | " f = tf.multiply(d,e)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 5, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "\n", 146 | "with tf.name_scope(\"Result\"):\n", 147 | " g = tf.add(c,f)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "\n", 155 | "If you see the computation scope, we can further break down in to separate parts for even\n", 156 | "more good understanding. Say we can create scope as part 1 which has nodes a to c and\n", 157 | "scope as part 2 which has nodes d to e since part 1 and 2 are independent of each other." 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 6, 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "with tf.name_scope(\"Computation\"):\n", 169 | " with tf.name_scope(\"Part1\"):\n", 170 | " a = tf.constant(5)\n", 171 | " b = tf.constant(4)\n", 172 | " c = tf.multiply(a,b)\n", 173 | " with tf.name_scope(\"Part2\"):\n", 174 | " d = tf.constant(2)\n", 175 | " e = tf.constant(3)\n", 176 | " f = tf.multiply(d,e)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "\n", 184 | "Scoping can be better understood by visualizing them in the tensorboard. The complete\n", 185 | "code looks like as follows," 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 7, 191 | "metadata": {}, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "26\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "with tf.name_scope(\"Computation\"):\n", 203 | " with tf.name_scope(\"Part1\"):\n", 204 | " a = tf.constant(5)\n", 205 | " b = tf.constant(4)\n", 206 | " c = tf.multiply(a,b)\n", 207 | " with tf.name_scope(\"Part2\"):\n", 208 | " d = tf.constant(2)\n", 209 | " e = tf.constant(3)\n", 210 | " f = tf.multiply(d,e)\n", 211 | "with tf.name_scope(\"Result\"):\n", 212 | " g = tf.add(c,f)\n", 213 | "with tf.Session() as sess:\n", 214 | " writer = tf.summary.FileWriter(\"logs\", sess.graph)\n", 215 | " print(sess.run(g))\n", 216 | " writer.close()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "In order to run the tensorboard, go to your terminal, locate the working directory and\n", 224 | "type \n", 225 | "\n", 226 | "tensorboard --logdir=logs --port=6003" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "If you look at the TensorBoard you can easily understand how scoping helps us to reduce\n", 234 | "complexity in understanding by grouping the similar nodes together. Scoping is widely\n", 235 | "used while working on a complex projects to better understand the functionality and\n", 236 | "dependencies of nodes." 237 | ] 238 | } 239 | ], 240 | "metadata": { 241 | "kernelspec": { 242 | "display_name": "Python [conda env:universe]", 243 | "language": "python", 244 | "name": "conda-env-universe-py" 245 | }, 246 | "language_info": { 247 | "codemirror_mode": { 248 | "name": "ipython", 249 | "version": 3 250 | }, 251 | "file_extension": ".py", 252 | "mimetype": "text/x-python", 253 | "name": "python", 254 | "nbconvert_exporter": "python", 255 | "pygments_lexer": "ipython3", 256 | "version": "3.5.4" 257 | } 258 | }, 259 | "nbformat": 4, 260 | "nbformat_minor": 2 261 | } 262 | -------------------------------------------------------------------------------- /Chapter02/logs/events.out.tfevents.1527762800.sudharsan: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter02/logs/events.out.tfevents.1527762800.sudharsan -------------------------------------------------------------------------------- /Chapter03/.ipynb_checkpoints/3.13 Policy Iteration - Frozen Lake Problem-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Solving Frozen Lake Problem Using Policy Iteration" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "Imagine, there is a frozen lake from your home to office, you should walk on the frozen lake\n", 17 | "to reach your office. But oops! there will be a hole in the frozen lake in between, so you have\n", 18 | "to be careful while walking in the frozen lake to avoid getting trapped at holes.\n", 19 | "Look at the below figure where, \n", 20 | "\n", 21 | "1. S is the starting position (Home)\n", 22 | "2. F is the Frozen lake where you can walk\n", 23 | "3. H is the Hole which you have to be so careful about\n", 24 | "4. G is the Goal (office)\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "![title](images/B09792_03_50.png)" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Okay, now let us use our agent instead of you to find the correct way to reach the office.\n", 39 | "The agent goal is to find the optimal path to reach from S to G without getting trapped at H.\n", 40 | "How an agent can achieve this? We give +1 point as a reward to the agent if it correctly\n", 41 | "walks on the frozen lake and 0 points if it falls into the hole. So that agent could determine\n", 42 | "which is the right action. An agent will now try to find the optimal policy. Optimal policy\n", 43 | "implies taking the correct path which maximizes the agent reward. If the agent is\n", 44 | "maximizing the reward, apparently agent is learning to skip the hole and reach the\n", 45 | "destination." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | " First, we import necessary libraries" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 1, 58 | "metadata": { 59 | "collapsed": true 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "import gym\n", 64 | "import numpy as np" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "Initialize our gym environment" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stderr", 81 | "output_type": "stream", 82 | "text": [ 83 | "[2018-06-01 12:00:11,336] Making new env: FrozenLake-v0\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "env = gym.make('FrozenLake-v0')" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "Let us see how the environment looks like" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 3, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "\u001b[41mS\u001b[0mFFF\n", 108 | "FHFH\n", 109 | "FFFH\n", 110 | "HFFG\n", 111 | "\n" 112 | ] 113 | }, 114 | { 115 | "data": { 116 | "text/plain": [ 117 | "" 118 | ] 119 | }, 120 | "execution_count": 3, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "env.render()" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Now, we will compute the value function" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "def compute_value_function(policy, gamma=1.0):\n", 145 | " \n", 146 | " # initialize value table with zeros\n", 147 | " value_table = np.zeros(env.nS)\n", 148 | " \n", 149 | " # set the threshold\n", 150 | " threshold = 1e-10\n", 151 | " \n", 152 | " while True:\n", 153 | " \n", 154 | " # copy the value table to the updated_value_table\n", 155 | " updated_value_table = np.copy(value_table)\n", 156 | "\n", 157 | " # for each state in the environment, select the action according to the policy and compute the value table\n", 158 | " for state in range(env.nS):\n", 159 | " action = policy[state]\n", 160 | " \n", 161 | " # build the value table with the selected action\n", 162 | " value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state]) \n", 163 | " for trans_prob, next_state, reward_prob, _ in env.P[state][action]])\n", 164 | " \n", 165 | " if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\n", 166 | " break\n", 167 | " \n", 168 | " return value_table\n" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | " Now, we define a function called extract policy for extracting optimal policy from the optimal value function. \n", 176 | "i.e We calculate Q value using our optimal value function and pick up\n", 177 | "the actions which has the highest Q value for each state as the optimal policy." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 5, 183 | "metadata": { 184 | "collapsed": true 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "\n", 189 | "def extract_policy(value_table, gamma = 1.0):\n", 190 | " \n", 191 | " # Initialize the policy with zeros\n", 192 | " policy = np.zeros(env.observation_space.n) \n", 193 | " \n", 194 | " \n", 195 | " for state in range(env.observation_space.n):\n", 196 | " \n", 197 | " # initialize the Q table for a state\n", 198 | " Q_table = np.zeros(env.action_space.n)\n", 199 | " \n", 200 | " # compute Q value for all ations in the state\n", 201 | " for action in range(env.action_space.n):\n", 202 | " for next_sr in env.P[state][action]: \n", 203 | " trans_prob, next_state, reward_prob, _ = next_sr \n", 204 | " Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))\n", 205 | " \n", 206 | " # Select the action which has maximum Q value as an optimal action of the state\n", 207 | " policy[state] = np.argmax(Q_table)\n", 208 | " \n", 209 | " return policy" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "Now, we define the function for performing policy iteration" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 7, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "def policy_iteration(env,gamma = 1.0):\n", 228 | " \n", 229 | " # Initialize policy with zeros\n", 230 | " old_policy = np.zeros(env.observation_space.n) \n", 231 | " no_of_iterations = 200000\n", 232 | " \n", 233 | " for i in range(no_of_iterations):\n", 234 | " \n", 235 | " # compute the value function\n", 236 | " new_value_function = compute_value_function(old_policy, gamma)\n", 237 | " \n", 238 | " # Extract new policy from the computed value function\n", 239 | " new_policy = extract_policy(new_value_function, gamma)\n", 240 | " \n", 241 | " # Then we check whether we have reached convergence i.e whether we found the optimal\n", 242 | " # policy by comparing old_policy and new policy if it same we will break the iteration\n", 243 | " # else we update old_policy with new_policy\n", 244 | "\n", 245 | " if (np.all(old_policy == new_policy)):\n", 246 | " print ('Policy-Iteration converged at step %d.' %(i+1))\n", 247 | " break\n", 248 | " old_policy = new_policy\n", 249 | " \n", 250 | " return new_policy\n", 251 | "\n" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 8, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | "Policy-Iteration converged at step 7.\n", 264 | "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n" 265 | ] 266 | } 267 | ], 268 | "source": [ 269 | "print (policy_iteration(env))" 270 | ] 271 | } 272 | ], 273 | "metadata": { 274 | "kernelspec": { 275 | "display_name": "Python [conda env:universe]", 276 | "language": "python", 277 | "name": "conda-env-universe-py" 278 | }, 279 | "language_info": { 280 | "codemirror_mode": { 281 | "name": "ipython", 282 | "version": 3 283 | }, 284 | "file_extension": ".py", 285 | "mimetype": "text/x-python", 286 | "name": "python", 287 | "nbconvert_exporter": "python", 288 | "pygments_lexer": "ipython3", 289 | "version": "3.5.4" 290 | } 291 | }, 292 | "nbformat": 4, 293 | "nbformat_minor": 2 294 | } 295 | -------------------------------------------------------------------------------- /Chapter03/3.13 Policy Iteration - Frozen Lake Problem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Solving Frozen Lake Problem Using Policy Iteration" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "Imagine, there is a frozen lake from your home to office, you should walk on the frozen lake\n", 17 | "to reach your office. But oops! there will be a hole in the frozen lake in between, so you have\n", 18 | "to be careful while walking in the frozen lake to avoid getting trapped at holes.\n", 19 | "Look at the below figure where, \n", 20 | "\n", 21 | "1. S is the starting position (Home)\n", 22 | "2. F is the Frozen lake where you can walk\n", 23 | "3. H is the Hole which you have to be so careful about\n", 24 | "4. G is the Goal (office)\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "![title](images/B09792_03_50.png)" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Okay, now let us use our agent instead of you to find the correct way to reach the office.\n", 39 | "The agent goal is to find the optimal path to reach from S to G without getting trapped at H.\n", 40 | "How an agent can achieve this? We give +1 point as a reward to the agent if it correctly\n", 41 | "walks on the frozen lake and 0 points if it falls into the hole. So that agent could determine\n", 42 | "which is the right action. An agent will now try to find the optimal policy. Optimal policy\n", 43 | "implies taking the correct path which maximizes the agent reward. If the agent is\n", 44 | "maximizing the reward, apparently agent is learning to skip the hole and reach the\n", 45 | "destination." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | " First, we import necessary libraries" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 1, 58 | "metadata": { 59 | "collapsed": true 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "import gym\n", 64 | "import numpy as np" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "Initialize our gym environment" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stderr", 81 | "output_type": "stream", 82 | "text": [ 83 | "[2018-06-01 12:00:11,336] Making new env: FrozenLake-v0\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "env = gym.make('FrozenLake-v0')" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "Let us see how the environment looks like" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 3, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "\u001b[41mS\u001b[0mFFF\n", 108 | "FHFH\n", 109 | "FFFH\n", 110 | "HFFG\n", 111 | "\n" 112 | ] 113 | }, 114 | { 115 | "data": { 116 | "text/plain": [ 117 | "" 118 | ] 119 | }, 120 | "execution_count": 3, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "env.render()" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "Now, we will compute the value function" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": { 140 | "collapsed": true 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "def compute_value_function(policy, gamma=1.0):\n", 145 | " \n", 146 | " # initialize value table with zeros\n", 147 | " value_table = np.zeros(env.nS)\n", 148 | " \n", 149 | " # set the threshold\n", 150 | " threshold = 1e-10\n", 151 | " \n", 152 | " while True:\n", 153 | " \n", 154 | " # copy the value table to the updated_value_table\n", 155 | " updated_value_table = np.copy(value_table)\n", 156 | "\n", 157 | " # for each state in the environment, select the action according to the policy and compute the value table\n", 158 | " for state in range(env.nS):\n", 159 | " action = policy[state]\n", 160 | " \n", 161 | " # build the value table with the selected action\n", 162 | " value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state]) \n", 163 | " for trans_prob, next_state, reward_prob, _ in env.P[state][action]])\n", 164 | " \n", 165 | " if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):\n", 166 | " break\n", 167 | " \n", 168 | " return value_table\n" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | " Now, we define a function called extract policy for extracting optimal policy from the optimal value function. \n", 176 | "i.e We calculate Q value using our optimal value function and pick up\n", 177 | "the actions which has the highest Q value for each state as the optimal policy." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 5, 183 | "metadata": { 184 | "collapsed": true 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "\n", 189 | "def extract_policy(value_table, gamma = 1.0):\n", 190 | " \n", 191 | " # Initialize the policy with zeros\n", 192 | " policy = np.zeros(env.observation_space.n) \n", 193 | " \n", 194 | " \n", 195 | " for state in range(env.observation_space.n):\n", 196 | " \n", 197 | " # initialize the Q table for a state\n", 198 | " Q_table = np.zeros(env.action_space.n)\n", 199 | " \n", 200 | " # compute Q value for all ations in the state\n", 201 | " for action in range(env.action_space.n):\n", 202 | " for next_sr in env.P[state][action]: \n", 203 | " trans_prob, next_state, reward_prob, _ = next_sr \n", 204 | " Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))\n", 205 | " \n", 206 | " # Select the action which has maximum Q value as an optimal action of the state\n", 207 | " policy[state] = np.argmax(Q_table)\n", 208 | " \n", 209 | " return policy" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "Now, we define the function for performing policy iteration" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 7, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "def policy_iteration(env,gamma = 1.0):\n", 228 | " \n", 229 | " # Initialize policy with zeros\n", 230 | " old_policy = np.zeros(env.observation_space.n) \n", 231 | " no_of_iterations = 200000\n", 232 | " \n", 233 | " for i in range(no_of_iterations):\n", 234 | " \n", 235 | " # compute the value function\n", 236 | " new_value_function = compute_value_function(old_policy, gamma)\n", 237 | " \n", 238 | " # Extract new policy from the computed value function\n", 239 | " new_policy = extract_policy(new_value_function, gamma)\n", 240 | " \n", 241 | " # Then we check whether we have reached convergence i.e whether we found the optimal\n", 242 | " # policy by comparing old_policy and new policy if it same we will break the iteration\n", 243 | " # else we update old_policy with new_policy\n", 244 | "\n", 245 | " if (np.all(old_policy == new_policy)):\n", 246 | " print ('Policy-Iteration converged at step %d.' %(i+1))\n", 247 | " break\n", 248 | " old_policy = new_policy\n", 249 | " \n", 250 | " return new_policy\n", 251 | "\n" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 8, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | "Policy-Iteration converged at step 7.\n", 264 | "[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]\n" 265 | ] 266 | } 267 | ], 268 | "source": [ 269 | "print (policy_iteration(env))" 270 | ] 271 | } 272 | ], 273 | "metadata": { 274 | "kernelspec": { 275 | "display_name": "Python [conda env:universe]", 276 | "language": "python", 277 | "name": "conda-env-universe-py" 278 | }, 279 | "language_info": { 280 | "codemirror_mode": { 281 | "name": "ipython", 282 | "version": 3 283 | }, 284 | "file_extension": ".py", 285 | "mimetype": "text/x-python", 286 | "name": "python", 287 | "nbconvert_exporter": "python", 288 | "pygments_lexer": "ipython3", 289 | "version": "3.5.4" 290 | } 291 | }, 292 | "nbformat": 4, 293 | "nbformat_minor": 2 294 | } 295 | -------------------------------------------------------------------------------- /Chapter03/images/B09792_03_50.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter03/images/B09792_03_50.png -------------------------------------------------------------------------------- /Chapter05/.ipynb_checkpoints/5.5 Taxi Problem - Q Learning-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Solving the Taxi Problem using Q Learning" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Goal:\n", 15 | "\n", 16 | "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n", 17 | "pick up a passenger at one location and drop at the another. The agent will receive +20\n", 18 | "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n", 19 | "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n", 20 | "pick up and drop passengers at the correct location in a short time without boarding any illegal\n", 21 | "passengers." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "source": [ 30 | " First, we import all necessary libraries and simulate the environment" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stderr", 40 | "output_type": "stream", 41 | "text": [ 42 | "[2018-05-23 12:23:58,368] Making new env: Taxi-v1\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "import random\n", 48 | "import gym\n", 49 | "env = gym.make('Taxi-v1')" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | " The environment is shown below, where the letters (R, G, Y, B) represents the different\n", 57 | "locations and a tiny yellow colored rectangle is the taxi driving by our agent." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "+---------+\n", 70 | "|\u001b[34;1mR\u001b[0m: | : :G|\n", 71 | "| : : : : |\n", 72 | "| : : : : |\n", 73 | "| |\u001b[43m \u001b[0m: | : |\n", 74 | "|Y| : |\u001b[35mB\u001b[0m: |\n", 75 | "+---------+\n", 76 | "\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "env.render()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "\n", 89 | "\n", 90 | "Now, we initialize, Q table as a dictionary which stores state-action pair specifying value of performing an action a in\n", 91 | " state s." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 3, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "q = {}\n", 103 | "for s in range(env.observation_space.n):\n", 104 | " for a in range(env.action_space.n):\n", 105 | " q[(s,a)] = 0.0" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "\n", 113 | "We define a function called update_q_table which will update the Q values according to our Q learning update rule. \n", 114 | "\n", 115 | "If you look at the below function, we take the value which has maximum value for a state-action pair and store it in a variable called qa, then we update the Q value of the preivous state by our update rule." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 4, 121 | "metadata": { 122 | "collapsed": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):\n", 127 | " \n", 128 | " qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])\n", 129 | " q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "\n", 137 | " \n", 138 | "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon. " 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 5, 144 | "metadata": { 145 | "collapsed": true 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "def epsilon_greedy_policy(state, epsilon):\n", 150 | " if random.uniform(0,1) < epsilon:\n", 151 | " return env.action_space.sample()\n", 152 | " else:\n", 153 | " return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "\n", 161 | "Now we initialize necessary variables\n", 162 | "\n", 163 | "alpha - TD learning rate\n", 164 | "\n", 165 | "gamma - discount factor
\n", 166 | "epsilon - epsilon value in epsilon greedy policy" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 6, 172 | "metadata": { 173 | "collapsed": true 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "alpha = 0.4\n", 178 | "gamma = 0.999\n", 179 | "epsilon = 0.017" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Now, Let us perform Q Learning!!!!" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "collapsed": true, 194 | "scrolled": true 195 | }, 196 | "outputs": [], 197 | "source": [ 198 | "for i in range(8000):\n", 199 | " r = 0\n", 200 | " \n", 201 | " prev_state = env.reset()\n", 202 | " \n", 203 | " while True:\n", 204 | " \n", 205 | " \n", 206 | " env.render()\n", 207 | " \n", 208 | " # In each state, we select the action by epsilon-greedy policy\n", 209 | " action = epsilon_greedy_policy(prev_state, epsilon)\n", 210 | " \n", 211 | " # then we perform the action and move to the next state, and receive the reward\n", 212 | " nextstate, reward, done, _ = env.step(action)\n", 213 | " \n", 214 | " # Next we update the Q value using our update_q_table function\n", 215 | " # which updates the Q value by Q learning update rule\n", 216 | " \n", 217 | " update_q_table(prev_state, action, reward, nextstate, alpha, gamma)\n", 218 | " \n", 219 | " # Finally we update the previous state as next state\n", 220 | " prev_state = nextstate\n", 221 | "\n", 222 | " # Store all the rewards obtained\n", 223 | " r += reward\n", 224 | "\n", 225 | " #we will break the loop, if we are at the terminal state of the episode\n", 226 | " if done:\n", 227 | " break\n", 228 | "\n", 229 | " print(\"total reward: \", r)\n", 230 | "\n", 231 | "env.close()" 232 | ] 233 | } 234 | ], 235 | "metadata": { 236 | "kernelspec": { 237 | "display_name": "Python [conda env:universe]", 238 | "language": "python", 239 | "name": "conda-env-universe-py" 240 | }, 241 | "language_info": { 242 | "codemirror_mode": { 243 | "name": "ipython", 244 | "version": 3 245 | }, 246 | "file_extension": ".py", 247 | "mimetype": "text/x-python", 248 | "name": "python", 249 | "nbconvert_exporter": "python", 250 | "pygments_lexer": "ipython3", 251 | "version": "3.5.4" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 2 256 | } 257 | -------------------------------------------------------------------------------- /Chapter05/.ipynb_checkpoints/5.7 Taxi Problem - SARSA-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Solving the Taxi Problem Using SARSA" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Goal:\n", 15 | "\n", 16 | "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n", 17 | "pick up a passenger at one location and drop at the another. The agent will receive +20\n", 18 | "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n", 19 | "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n", 20 | "pick up and drop passengers at the correct location in a short time without boarding any illegal\n", 21 | "passengers." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "First, we import all necessary libraries and initialize the environment" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stderr", 38 | "output_type": "stream", 39 | "text": [ 40 | "[2018-06-01 12:23:17,082] Making new env: Taxi-v1\n" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "import random\n", 46 | "import gym\n", 47 | "env = gym.make('Taxi-v1')" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "The environment is shown below, where the letters (R, G, Y, B) represents the different\n", 55 | "locations and a tiny yellow colored rectangle is the taxi driving by our agent." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "+---------+\n", 68 | "|\u001b[35m\u001b[34;1mR\u001b[0m\u001b[0m: | : :G|\n", 69 | "| : : : : |\n", 70 | "| : : : : |\n", 71 | "| | : | : |\n", 72 | "|Y| : |B:\u001b[43m \u001b[0m|\n", 73 | "+---------+\n", 74 | "\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "env.render()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "\n", 87 | "\n", 88 | "Now, we initialize, Q table has a dictionary which stores state-action pair specifying value of performing an action in\n", 89 | "a state s." 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 3, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "Q = {}\n", 101 | "for s in range(env.observation_space.n):\n", 102 | " for a in range(env.action_space.n):\n", 103 | " Q[(s,a)] = 0.0" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "\n", 111 | "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 3, 117 | "metadata": { 118 | "collapsed": true 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "def epsilon_greedy(state, epsilon):\n", 123 | " if random.uniform(0,1) < epsilon:\n", 124 | " return env.action_space.sample()\n", 125 | " else:\n", 126 | " return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\n" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "\n", 134 | "Now we initialize necessary variables\n", 135 | "\n", 136 | "alpha - TD learning rate\n", 137 | "\n", 138 | "gamma - discount factor
\n", 139 | "epsilon - epsilon value in epsilon greedy policy" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 4, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "alpha = 0.85\n", 151 | "gamma = 0.90\n", 152 | "epsilon = 0.8" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Now, we perform SARSA!!" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "for i in range(4000):\n", 169 | " \n", 170 | " # we store cumulative reward of each episodes in r\n", 171 | " r = 0\n", 172 | " \n", 173 | " # initialize the state,\n", 174 | " state = env.reset()\n", 175 | " \n", 176 | " # select the action using epsilon-greedy policy\n", 177 | " action = epsilon_greedy(state,epsilon)\n", 178 | " \n", 179 | " while True:\n", 180 | " \n", 181 | " # env.render()\n", 182 | " \n", 183 | " # then we perform the action and move to the next state, and receive the reward\n", 184 | " nextstate, reward, done, _ = env.step(action)\n", 185 | " \n", 186 | " # again, we select the next action using epsilon greedy policy\n", 187 | " nextaction = epsilon_greedy(nextstate,epsilon) \n", 188 | " \n", 189 | " # we calculate the Q value of previous state using our update rule\n", 190 | " Q[(state,action)] += alpha * (reward + gamma * Q[(nextstate,nextaction)]-Q[(state,action)])\n", 191 | "\n", 192 | " # finally we update our state and action with next action and next state\n", 193 | " action = nextaction\n", 194 | " state = nextstate\n", 195 | " \n", 196 | " # store the rewards\n", 197 | " r += reward\n", 198 | " \n", 199 | " # we will break the loop, if we are at the terminal state of the episode\n", 200 | " if done:\n", 201 | " break\n", 202 | " \n", 203 | " print(\"total reward: \", r)\n", 204 | "\n", 205 | "env.close()\n" 206 | ] 207 | } 208 | ], 209 | "metadata": { 210 | "kernelspec": { 211 | "display_name": "Python [conda env:universe]", 212 | "language": "python", 213 | "name": "conda-env-universe-py" 214 | }, 215 | "language_info": { 216 | "codemirror_mode": { 217 | "name": "ipython", 218 | "version": 3 219 | }, 220 | "file_extension": ".py", 221 | "mimetype": "text/x-python", 222 | "name": "python", 223 | "nbconvert_exporter": "python", 224 | "pygments_lexer": "ipython3", 225 | "version": "3.5.4" 226 | } 227 | }, 228 | "nbformat": 4, 229 | "nbformat_minor": 2 230 | } 231 | -------------------------------------------------------------------------------- /Chapter05/5.5 Taxi Problem - Q Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Solving the Taxi Problem using Q Learning" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Goal:\n", 15 | "\n", 16 | "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n", 17 | "pick up a passenger at one location and drop at the another. The agent will receive +20\n", 18 | "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n", 19 | "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n", 20 | "pick up and drop passengers at the correct location in a short time without boarding any illegal\n", 21 | "passengers." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "source": [ 30 | " First, we import all necessary libraries and simulate the environment" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stderr", 40 | "output_type": "stream", 41 | "text": [ 42 | "[2018-05-23 12:23:58,368] Making new env: Taxi-v1\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "import random\n", 48 | "import gym\n", 49 | "env = gym.make('Taxi-v1')" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | " The environment is shown below, where the letters (R, G, Y, B) represents the different\n", 57 | "locations and a tiny yellow colored rectangle is the taxi driving by our agent." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "+---------+\n", 70 | "|\u001b[34;1mR\u001b[0m: | : :G|\n", 71 | "| : : : : |\n", 72 | "| : : : : |\n", 73 | "| |\u001b[43m \u001b[0m: | : |\n", 74 | "|Y| : |\u001b[35mB\u001b[0m: |\n", 75 | "+---------+\n", 76 | "\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "env.render()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "\n", 89 | "\n", 90 | "Now, we initialize, Q table as a dictionary which stores state-action pair specifying value of performing an action a in\n", 91 | " state s." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 3, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "q = {}\n", 103 | "for s in range(env.observation_space.n):\n", 104 | " for a in range(env.action_space.n):\n", 105 | " q[(s,a)] = 0.0" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "\n", 113 | "We define a function called update_q_table which will update the Q values according to our Q learning update rule. \n", 114 | "\n", 115 | "If you look at the below function, we take the value which has maximum value for a state-action pair and store it in a variable called qa, then we update the Q value of the preivous state by our update rule." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 4, 121 | "metadata": { 122 | "collapsed": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):\n", 127 | " \n", 128 | " qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])\n", 129 | " q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "\n", 137 | " \n", 138 | "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon. " 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 5, 144 | "metadata": { 145 | "collapsed": true 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "def epsilon_greedy_policy(state, epsilon):\n", 150 | " if random.uniform(0,1) < epsilon:\n", 151 | " return env.action_space.sample()\n", 152 | " else:\n", 153 | " return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "\n", 161 | "Now we initialize necessary variables\n", 162 | "\n", 163 | "alpha - TD learning rate\n", 164 | "\n", 165 | "gamma - discount factor
\n", 166 | "epsilon - epsilon value in epsilon greedy policy" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 6, 172 | "metadata": { 173 | "collapsed": true 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "alpha = 0.4\n", 178 | "gamma = 0.999\n", 179 | "epsilon = 0.017" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Now, Let us perform Q Learning!!!!" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "collapsed": true, 194 | "scrolled": true 195 | }, 196 | "outputs": [], 197 | "source": [ 198 | "for i in range(8000):\n", 199 | " r = 0\n", 200 | " \n", 201 | " prev_state = env.reset()\n", 202 | " \n", 203 | " while True:\n", 204 | " \n", 205 | " \n", 206 | " env.render()\n", 207 | " \n", 208 | " # In each state, we select the action by epsilon-greedy policy\n", 209 | " action = epsilon_greedy_policy(prev_state, epsilon)\n", 210 | " \n", 211 | " # then we perform the action and move to the next state, and receive the reward\n", 212 | " nextstate, reward, done, _ = env.step(action)\n", 213 | " \n", 214 | " # Next we update the Q value using our update_q_table function\n", 215 | " # which updates the Q value by Q learning update rule\n", 216 | " \n", 217 | " update_q_table(prev_state, action, reward, nextstate, alpha, gamma)\n", 218 | " \n", 219 | " # Finally we update the previous state as next state\n", 220 | " prev_state = nextstate\n", 221 | "\n", 222 | " # Store all the rewards obtained\n", 223 | " r += reward\n", 224 | "\n", 225 | " #we will break the loop, if we are at the terminal state of the episode\n", 226 | " if done:\n", 227 | " break\n", 228 | "\n", 229 | " print(\"total reward: \", r)\n", 230 | "\n", 231 | "env.close()" 232 | ] 233 | } 234 | ], 235 | "metadata": { 236 | "kernelspec": { 237 | "display_name": "Python [conda env:universe]", 238 | "language": "python", 239 | "name": "conda-env-universe-py" 240 | }, 241 | "language_info": { 242 | "codemirror_mode": { 243 | "name": "ipython", 244 | "version": 3 245 | }, 246 | "file_extension": ".py", 247 | "mimetype": "text/x-python", 248 | "name": "python", 249 | "nbconvert_exporter": "python", 250 | "pygments_lexer": "ipython3", 251 | "version": "3.5.4" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 2 256 | } 257 | -------------------------------------------------------------------------------- /Chapter05/5.7 Taxi Problem - SARSA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Solving the Taxi Problem Using SARSA" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Goal:\n", 15 | "\n", 16 | "Say our agent is the driving the taxi. There are totally four locations and the agent has to\n", 17 | "pick up a passenger at one location and drop at the another. The agent will receive +20\n", 18 | "points as a reward for successful drop off and -1 point for every time step it takes. The agent\n", 19 | "will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to\n", 20 | "pick up and drop passengers at the correct location in a short time without boarding any illegal\n", 21 | "passengers." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "First, we import all necessary libraries and initialize the environment" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stderr", 38 | "output_type": "stream", 39 | "text": [ 40 | "[2018-06-01 12:23:17,082] Making new env: Taxi-v1\n" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "import random\n", 46 | "import gym\n", 47 | "env = gym.make('Taxi-v1')" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "The environment is shown below, where the letters (R, G, Y, B) represents the different\n", 55 | "locations and a tiny yellow colored rectangle is the taxi driving by our agent." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "+---------+\n", 68 | "|\u001b[35m\u001b[34;1mR\u001b[0m\u001b[0m: | : :G|\n", 69 | "| : : : : |\n", 70 | "| : : : : |\n", 71 | "| | : | : |\n", 72 | "|Y| : |B:\u001b[43m \u001b[0m|\n", 73 | "+---------+\n", 74 | "\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "env.render()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "\n", 87 | "\n", 88 | "Now, we initialize, Q table has a dictionary which stores state-action pair specifying value of performing an action in\n", 89 | "a state s." 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 3, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "Q = {}\n", 101 | "for s in range(env.observation_space.n):\n", 102 | " for a in range(env.action_space.n):\n", 103 | " Q[(s,a)] = 0.0" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "\n", 111 | "Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy, either we select best action with probability 1-epsilon or we explore new action with probability epsilon" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 3, 117 | "metadata": { 118 | "collapsed": true 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "def epsilon_greedy(state, epsilon):\n", 123 | " if random.uniform(0,1) < epsilon:\n", 124 | " return env.action_space.sample()\n", 125 | " else:\n", 126 | " return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])\n" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "\n", 134 | "Now we initialize necessary variables\n", 135 | "\n", 136 | "alpha - TD learning rate\n", 137 | "\n", 138 | "gamma - discount factor
\n", 139 | "epsilon - epsilon value in epsilon greedy policy" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 4, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "alpha = 0.85\n", 151 | "gamma = 0.90\n", 152 | "epsilon = 0.8" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Now, we perform SARSA!!" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "for i in range(4000):\n", 169 | " \n", 170 | " # we store cumulative reward of each episodes in r\n", 171 | " r = 0\n", 172 | " \n", 173 | " # initialize the state,\n", 174 | " state = env.reset()\n", 175 | " \n", 176 | " # select the action using epsilon-greedy policy\n", 177 | " action = epsilon_greedy(state,epsilon)\n", 178 | " \n", 179 | " while True:\n", 180 | " \n", 181 | " # env.render()\n", 182 | " \n", 183 | " # then we perform the action and move to the next state, and receive the reward\n", 184 | " nextstate, reward, done, _ = env.step(action)\n", 185 | " \n", 186 | " # again, we select the next action using epsilon greedy policy\n", 187 | " nextaction = epsilon_greedy(nextstate,epsilon) \n", 188 | " \n", 189 | " # we calculate the Q value of previous state using our update rule\n", 190 | " Q[(state,action)] += alpha * (reward + gamma * Q[(nextstate,nextaction)]-Q[(state,action)])\n", 191 | "\n", 192 | " # finally we update our state and action with next action and next state\n", 193 | " action = nextaction\n", 194 | " state = nextstate\n", 195 | " \n", 196 | " # store the rewards\n", 197 | " r += reward\n", 198 | " \n", 199 | " # we will break the loop, if we are at the terminal state of the episode\n", 200 | " if done:\n", 201 | " break\n", 202 | " \n", 203 | " print(\"total reward: \", r)\n", 204 | "\n", 205 | "env.close()\n" 206 | ] 207 | } 208 | ], 209 | "metadata": { 210 | "kernelspec": { 211 | "display_name": "Python [conda env:universe]", 212 | "language": "python", 213 | "name": "conda-env-universe-py" 214 | }, 215 | "language_info": { 216 | "codemirror_mode": { 217 | "name": "ipython", 218 | "version": 3 219 | }, 220 | "file_extension": ".py", 221 | "mimetype": "text/x-python", 222 | "name": "python", 223 | "nbconvert_exporter": "python", 224 | "pygments_lexer": "ipython3", 225 | "version": "3.5.4" 226 | } 227 | }, 228 | "nbformat": 4, 229 | "nbformat_minor": 2 230 | } 231 | -------------------------------------------------------------------------------- /Chapter06/images/B09792_06_01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter06/images/B09792_06_01.png -------------------------------------------------------------------------------- /Chapter07/data/fashion/t10k-images-idx3-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/fashion/t10k-images-idx3-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/fashion/t10k-labels-idx1-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/fashion/t10k-labels-idx1-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/fashion/train-images-idx3-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/fashion/train-images-idx3-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/fashion/train-labels-idx1-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/fashion/train-labels-idx1-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/mnist/t10k-images-idx3-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/mnist/t10k-images-idx3-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/mnist/t10k-labels-idx1-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/mnist/t10k-labels-idx1-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/mnist/train-images-idx3-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/mnist/train-images-idx3-ubyte.gz -------------------------------------------------------------------------------- /Chapter07/data/mnist/train-labels-idx1-ubyte.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter07/data/mnist/train-labels-idx1-ubyte.gz -------------------------------------------------------------------------------- /Chapter08/logs/events.out.tfevents.1526989751.sudharsan: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter08/logs/events.out.tfevents.1526989751.sudharsan -------------------------------------------------------------------------------- /Chapter08/logs/events.out.tfevents.1526990072.sudharsan: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter08/logs/events.out.tfevents.1526990072.sudharsan -------------------------------------------------------------------------------- /Chapter08/logs/events.out.tfevents.1528714237.sudharsan: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter08/logs/events.out.tfevents.1528714237.sudharsan -------------------------------------------------------------------------------- /Chapter09/.ipynb_checkpoints/9.4 Basic Doom Game-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basic Doom Game" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "\n", 15 | "Before diving in, Let us familiarize ourselves with a vizdoom environment by seeing the\n", 16 | "basic example.\n", 17 | "\n", 18 | "Let us load the necessary libraries" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "from vizdoom import *\n", 30 | "import random\n", 31 | "import time" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Create an instance to the DoomGame" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "game = DoomGame()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | " As we know vizdoom provides a lot of doom scenarios, for now, let us load the basic\n", 57 | "scenario." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 6, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "True" 69 | ] 70 | }, 71 | "execution_count": 6, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "game.load_config(\"basic.cfg\")" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | " init() method initializes the game with the scenario" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 7, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "game.init()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | " Now let us define the one hot encoded actions." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 8, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "shoot = [0, 0, 1]\n", 114 | "left = [1, 0, 0]\n", 115 | "right = [0, 1, 0]\n", 116 | "actions = [shoot, left, right]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Then We set the number of episodes we want" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 9, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "no_of_episodes = 10" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Start playing the game!!!" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": true 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "for i in range(no_of_episodes): \n", 153 | " \n", 154 | " # for each episode start the game\n", 155 | " game.new_episode()\n", 156 | " \n", 157 | " # loop until the episode is over\n", 158 | " while not game.is_episode_finished():\n", 159 | " \n", 160 | " # get the game state\n", 161 | " state = game.get_state()\n", 162 | " img = state.screen_buffer\n", 163 | " \n", 164 | " # get the game variables\n", 165 | " misc = state.game_variables\n", 166 | " \n", 167 | " # perform some action randomly and receuve reward\n", 168 | " reward = game.make_action(random.choice(actions))\n", 169 | " \n", 170 | " print(reward)\n", 171 | " \n", 172 | " # we will set some time before starting the next epiosde\n", 173 | " time.sleep(2)" 174 | ] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Python [conda env:anaconda]", 180 | "language": "python", 181 | "name": "conda-env-anaconda-py" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 2 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython2", 193 | "version": "2.7.11" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 2 198 | } 199 | -------------------------------------------------------------------------------- /Chapter09/9.4 Basic Doom Game.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basic Doom Game" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "\n", 15 | "Before diving in, Let us familiarize ourselves with a vizdoom environment by seeing the\n", 16 | "basic example.\n", 17 | "\n", 18 | "Let us load the necessary libraries" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "from vizdoom import *\n", 30 | "import random\n", 31 | "import time" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Create an instance to the DoomGame" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "game = DoomGame()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | " As we know vizdoom provides a lot of doom scenarios, for now, let us load the basic\n", 57 | "scenario." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 6, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "True" 69 | ] 70 | }, 71 | "execution_count": 6, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "game.load_config(\"basic.cfg\")" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | " init() method initializes the game with the scenario" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 7, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "game.init()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | " Now let us define the one hot encoded actions." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 8, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "shoot = [0, 0, 1]\n", 114 | "left = [1, 0, 0]\n", 115 | "right = [0, 1, 0]\n", 116 | "actions = [shoot, left, right]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Then We set the number of episodes we want" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 9, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "no_of_episodes = 10" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Start playing the game!!!" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": true 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "for i in range(no_of_episodes): \n", 153 | " \n", 154 | " # for each episode start the game\n", 155 | " game.new_episode()\n", 156 | " \n", 157 | " # loop until the episode is over\n", 158 | " while not game.is_episode_finished():\n", 159 | " \n", 160 | " # get the game state\n", 161 | " state = game.get_state()\n", 162 | " img = state.screen_buffer\n", 163 | " \n", 164 | " # get the game variables\n", 165 | " misc = state.game_variables\n", 166 | " \n", 167 | " # perform some action randomly and receuve reward\n", 168 | " reward = game.make_action(random.choice(actions))\n", 169 | " \n", 170 | " print(reward)\n", 171 | " \n", 172 | " # we will set some time before starting the next epiosde\n", 173 | " time.sleep(2)" 174 | ] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Python [conda env:anaconda]", 180 | "language": "python", 181 | "name": "conda-env-anaconda-py" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 2 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython2", 193 | "version": "2.7.11" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 2 198 | } 199 | -------------------------------------------------------------------------------- /Chapter09/basic.cfg: -------------------------------------------------------------------------------- 1 | # Lines starting with # are treated as comments (or with whitespaces+#). 2 | # It doesn't matter if you use capital letters or not. 3 | # It doesn't matter if you use underscore or camel notation for keys, e.g. episode_timeout is the same as episodeTimeout. 4 | 5 | doom_scenario_path = basic.wad 6 | doom_map = map01 7 | 8 | # Rewards 9 | living_reward = -1 10 | 11 | # Rendering options 12 | screen_resolution = RES_320X240 13 | screen_format = CRCGCB 14 | render_hud = True 15 | render_crosshair = false 16 | render_weapon = true 17 | render_decals = false 18 | render_particles = false 19 | window_visible = true 20 | 21 | # make episodes start after 20 tics (after unholstering the gun) 22 | episode_start_time = 14 23 | 24 | # make episodes finish after 300 actions (tics) 25 | episode_timeout = 300 26 | 27 | # Available buttons 28 | available_buttons = 29 | { 30 | MOVE_LEFT 31 | MOVE_RIGHT 32 | ATTACK 33 | } 34 | 35 | # Game variables that will be in the state 36 | available_game_variables = { AMMO2} 37 | 38 | mode = PLAYER 39 | doom_skill = 5 40 | -------------------------------------------------------------------------------- /Chapter09/basic.wad: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter09/basic.wad -------------------------------------------------------------------------------- /Chapter09/deathmatch.cfg: -------------------------------------------------------------------------------- 1 | # Lines starting with # are treated as comments (or with whitespaces+#). 2 | # It doesn't matter if you use capital letters or not. 3 | # It doesn't matter if you use underscore or camel notation for keys, e.g. episode_timeout is the same as episodeTimeout. 4 | 5 | doom_scenario_path = deathmatch.wad 6 | 7 | # Rendering options 8 | screen_resolution = RES_320X240 9 | screen_format = CRCGCB 10 | render_hud = true 11 | render_crosshair = false 12 | render_weapon = true 13 | render_decals = false 14 | render_particles = false 15 | window_visible = true 16 | 17 | # make episodes finish after 4200 actions (tics) 18 | episode_timeout = 4200 19 | 20 | # Available buttons 21 | available_buttons = 22 | { 23 | ATTACK 24 | SPEED 25 | STRAFE 26 | 27 | MOVE_RIGHT 28 | MOVE_LEFT 29 | MOVE_BACKWARD 30 | MOVE_FORWARD 31 | TURN_RIGHT 32 | TURN_LEFT 33 | 34 | SELECT_WEAPON1 35 | SELECT_WEAPON2 36 | SELECT_WEAPON3 37 | SELECT_WEAPON4 38 | SELECT_WEAPON5 39 | SELECT_WEAPON6 40 | 41 | SELECT_NEXT_WEAPON 42 | SELECT_PREV_WEAPON 43 | 44 | LOOK_UP_DOWN_DELTA 45 | TURN_LEFT_RIGHT_DELTA 46 | MOVE_LEFT_RIGHT_DELTA 47 | 48 | } 49 | 50 | # Game variables that will be in the state 51 | available_game_variables = 52 | { 53 | KILLCOUNT 54 | HEALTH 55 | ARMOR 56 | SELECTED_WEAPON 57 | SELECTED_WEAPON_AMMO 58 | } 59 | mode = PLAYER 60 | -------------------------------------------------------------------------------- /Chapter09/deathmatch.wad: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter09/deathmatch.wad -------------------------------------------------------------------------------- /Chapter10/logs/events.out.tfevents.1528713441.sudharsan: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter10/logs/events.out.tfevents.1528713441.sudharsan -------------------------------------------------------------------------------- /Chapter11/logs/events.out.tfevents.1528712442.sudharsan: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter11/logs/events.out.tfevents.1528712442.sudharsan -------------------------------------------------------------------------------- /Chapter12/.ipynb_checkpoints/12.1 Environment Wrapper Functions-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Environment Wrapper Functions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | "The credits for the code used in this chapter goes to Giacomo Spigler's github repo Throughout this chapter, code is explained each and every line. For a complete structured code check \n", 17 | "this github repo. " 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "First we will import all the necessary libaries," 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import warnings\n", 36 | "warnings.filterwarnings('ignore')\n", 37 | "import numpy as np\n", 38 | "import tensorflow as tf\n", 39 | "import gym\n", 40 | "from gym.spaces import Box\n", 41 | "from scipy.misc import imresize\n", 42 | "import random\n", 43 | "import cv2\n", 44 | "import time\n", 45 | "import logging\n", 46 | "import os\n", 47 | "import sys" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | " We define the Class EnvWrapper and define some of the environment wrapper functions" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "class EnvWrapper:\n", 66 | "\n", 67 | "\n", 68 | " # First we define the __init__ method and initialize variables\n", 69 | "\n", 70 | " def __init__(self, env_name, debug=False):\n", 71 | " \n", 72 | "\n", 73 | " # environment name\n", 74 | " self.env_name = env_name\n", 75 | " \n", 76 | " # initialize the gym environment\n", 77 | " self.env = gym.make(env_name)\n", 78 | "\n", 79 | " # get the action space\n", 80 | " self.action_space = self.env.action_space\n", 81 | "\n", 82 | " # get the observation space\n", 83 | " self.observation_space = Box(low=0, high=255, shape=(84, 84, 4)) \n", 84 | "\n", 85 | " # initialize frame_num for storing the frame count\n", 86 | " self.frame_num = 0\n", 87 | "\n", 88 | " # For recording the game screen\n", 89 | " self.monitor = self.env.monitor\n", 90 | "\n", 91 | " # initialize frames\n", 92 | " self.frames = np.zeros((84, 84, 4), dtype=np.uint8)\n", 93 | "\n", 94 | " # initialize a boolean called debug when set true last few frames will be displayed\n", 95 | " self.debug = debug\n", 96 | "\n", 97 | " if self.debug:\n", 98 | " cv2.startWindowThread()\n", 99 | " cv2.namedWindow(\"Game\")\n", 100 | "\n", 101 | "\n", 102 | " # we define the function called step where we perform some action in the \n", 103 | " # environment, receive reward and move to the next state \n", 104 | " # step function will take the current state as input and returns the preprocessed frame as next state\n", 105 | "\n", 106 | " def step(self, a):\n", 107 | " ob, reward, done, xx = self.env.step(a)\n", 108 | " return self.process_frame(ob), reward, done, xx\n", 109 | "\n", 110 | "\n", 111 | " # We define the helper function called reset for resetting the environment\n", 112 | " # after resetting it will return the preprocessed game screen\n", 113 | " \n", 114 | " def reset(self):\n", 115 | " self.frame_num = 0\n", 116 | " return self.process_frame(self.env.reset())\n", 117 | "\n", 118 | "\n", 119 | " # next we define another helper function for rendering the environment\n", 120 | " def render(self):\n", 121 | " return self.env.render()\n", 122 | "\n", 123 | "\n", 124 | " # now we define the function called process_frame for preprocessing the frame\n", 125 | " \n", 126 | " def process_frame(self, frame):\n", 127 | "\n", 128 | " # convert the image to gray\n", 129 | " state_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)\n", 130 | " \n", 131 | " # change the size\n", 132 | " state_resized = cv2.resize(state_gray,(84,110))\n", 133 | " \n", 134 | " # resize\n", 135 | " gray_final = state_resized[16:100,:]\n", 136 | "\n", 137 | "\n", 138 | " if self.frame_num == 0:\n", 139 | " self.frames[:, :, 0] = gray_final\n", 140 | " self.frames[:, :, 1] = gray_final\n", 141 | " self.frames[:, :, 2] = gray_final\n", 142 | " self.frames[:, :, 3] = gray_final\n", 143 | "\n", 144 | " else:\n", 145 | " self.frames[:, :, 3] = self.frames[:, :, 2]\n", 146 | " self.frames[:, :, 2] = self.frames[:, :, 1]\n", 147 | " self.frames[:, :, 1] = self.frames[:, :, 0]\n", 148 | " self.frames[:, :, 0] = gray_final\n", 149 | "\n", 150 | " \n", 151 | " # increment the frame_num counter\n", 152 | "\n", 153 | " self.frame_num += 1\n", 154 | "\n", 155 | " if self.debug:\n", 156 | " cv2.imshow('Game', gray_final)\n", 157 | "\n", 158 | " return self.frames.copy()\n" 159 | ] 160 | } 161 | ], 162 | "metadata": { 163 | "kernelspec": { 164 | "display_name": "Python [conda env:universe]", 165 | "language": "python", 166 | "name": "conda-env-universe-py" 167 | }, 168 | "language_info": { 169 | "codemirror_mode": { 170 | "name": "ipython", 171 | "version": 3 172 | }, 173 | "file_extension": ".py", 174 | "mimetype": "text/x-python", 175 | "name": "python", 176 | "nbconvert_exporter": "python", 177 | "pygments_lexer": "ipython3", 178 | "version": "3.5.4" 179 | } 180 | }, 181 | "nbformat": 4, 182 | "nbformat_minor": 2 183 | } 184 | -------------------------------------------------------------------------------- /Chapter12/.ipynb_checkpoints/12.2 Dueling network-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Dueling network" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "First we will import all the necessary libaries," 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "import warnings\n", 30 | "warnings.filterwarnings('ignore')\n", 31 | "import numpy as np\n", 32 | "import tensorflow as tf\n", 33 | "import gym\n", 34 | "from gym.spaces import Box\n", 35 | "from scipy.misc import imresize\n", 36 | "import random\n", 37 | "import cv2\n", 38 | "import time\n", 39 | "import logging\n", 40 | "import os\n", 41 | "import sys" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "source": [ 50 | "\n", 51 | "Now we build our dueling deep q network,\n", 52 | "we build three convolutional layers followed by two fully connected layers \n", 53 | "and the final fully connected layer will be split into two separate layers for\n", 54 | "value stream and advantage stream and we use aggregate layer which combines both value stream\n", 55 | "and advantage stream to compute the q value. The dimensions of these layers are given as follows,\n", 56 | "\n", 57 | "\n", 58 | "Layer 1: 32 8x8 filters with stride 4 + RELU
\n", 59 | "Layer 2: 64 4x4 filters with stride 2 + RELU
\n", 60 | "Layer 3: 64 3x3 filters with stride 1 + RELU
\n", 61 | "\n", 62 | "Layer 4a: 512 unit Fully-Connected layer + RELU
\n", 63 | "Layer 4b: 512 unit Fully-Connected layer + RELU
\n", 64 | "\n", 65 | "Layer 5a: 1 unit FC + RELU (State Value)
\n", 66 | "Layer 5b: actions FC + RELU (Advantage Value)
\n", 67 | "\n", 68 | "\n", 69 | "Layer6: Aggregate V(s)+A(s,a)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "metadata": { 76 | "collapsed": true 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "class QNetworkDueling():\n", 81 | " \n", 82 | "\n", 83 | " # we define the init method for initializing all layers,\n", 84 | "\n", 85 | " def __init__(self, input_size, output_size, name):\n", 86 | " self.name = name\n", 87 | " self.input_size = input_size\n", 88 | " self.output_size = output_size\n", 89 | "\n", 90 | "\n", 91 | " with tf.variable_scope(self.name):\n", 92 | "\n", 93 | "\n", 94 | " # Three convolutional layers\n", 95 | " self.W_conv1 = self.weight_variable([8, 8, 4, 32]) \n", 96 | " self.B_conv1 = self.bias_variable([32])\n", 97 | " self.stride1 = 4\n", 98 | "\n", 99 | " self.W_conv2 = self.weight_variable([4, 4, 32, 64])\n", 100 | " self.B_conv2 = self.bias_variable([64])\n", 101 | " self.stride2 = 2\n", 102 | "\n", 103 | " self.W_conv3 = self.weight_variable([3, 3, 64, 64])\n", 104 | " self.B_conv3 = self.bias_variable([64])\n", 105 | " self.stride3 = 1\n", 106 | "\n", 107 | " # fully connected layer 1\n", 108 | " self.W_fc4a = self.weight_variable([7*7*64, 512])\n", 109 | " self.B_fc4a = self.bias_variable([512])\n", 110 | "\n", 111 | " # fully connected layer 2\n", 112 | " self.W_fc4b = self.weight_variable([7*7*64, 512])\n", 113 | " self.B_fc4b = self.bias_variable([512])\n", 114 | "\n", 115 | " # value stream\n", 116 | " self.W_fc5a = self.weight_variable([512, 1])\n", 117 | " self.B_fc5a = self.bias_variable([1])\n", 118 | "\n", 119 | " # advantage stream\n", 120 | " self.W_fc5b = self.weight_variable([512, self.output_size])\n", 121 | " self.B_fc5b = self.bias_variable([self.output_size])\n", 122 | "\n", 123 | "\n", 124 | " # print number of parameters in the network\n", 125 | " self.print_num_of_parameters()\n", 126 | "\n", 127 | "\n", 128 | "\n", 129 | " # Now we define the method called __call_ to perform the convolutional operation\n", 130 | "\n", 131 | " def __call__(self, input_tensor):\n", 132 | " if type(input_tensor) == list:\n", 133 | " input_tensor = tf.concat(1, input_tensor)\n", 134 | "\n", 135 | " with tf.variable_scope(self.name):\n", 136 | "\n", 137 | " # Perform convolutional operation on three layers\n", 138 | " self.h_conv1 = tf.nn.relu( tf.nn.conv2d(input_tensor, self.W_conv1, strides=[1, self.stride1, self.stride1, 1], padding='VALID') + self.B_conv1 )\n", 139 | " self.h_conv2 = tf.nn.relu( tf.nn.conv2d(self.h_conv1, self.W_conv2, strides=[1, self.stride2, self.stride2, 1], padding='VALID') + self.B_conv2 )\n", 140 | " self.h_conv3 = tf.nn.relu( tf.nn.conv2d(self.h_conv2, self.W_conv3, strides=[1, self.stride3, self.stride3, 1], padding='VALID') + self.B_conv3 )\n", 141 | "\n", 142 | " # Flatten the convolutional output\n", 143 | " self.h_conv3_flat = tf.reshape(self.h_conv3, [-1, 7*7*64])\n", 144 | "\n", 145 | "\n", 146 | " # Input the flattened convolutional layer output to the fully connected layer\n", 147 | " self.h_fc4a = tf.nn.relu(tf.matmul(self.h_conv3_flat, self.W_fc4a) + self.B_fc4a)\n", 148 | " self.h_fc4b = tf.nn.relu(tf.matmul(self.h_conv3_flat, self.W_fc4b) + self.B_fc4b)\n", 149 | "\n", 150 | "\n", 151 | " # Compute value stream and advantage stream\n", 152 | " self.h_fc5a_value = tf.identity(tf.matmul(self.h_fc4a, self.W_fc5a) + self.B_fc5a)\n", 153 | " self.h_fc5b_advantage = tf.identity(tf.matmul(self.h_fc4b, self.W_fc5b) + self.B_fc5b)\n", 154 | "\n", 155 | " # Combine the both value and advantage stream to get the Q value\n", 156 | " self.h_fc6 = self.h_fc5a_value + ( self.h_fc5b_advantage - tf.reduce_mean(self.h_fc5b_advantage, reduction_indices=[1,], keep_dims=True) )\n", 157 | "\n", 158 | "\n", 159 | " return self.h_fc6\n" 160 | ] 161 | } 162 | ], 163 | "metadata": { 164 | "kernelspec": { 165 | "display_name": "Python [conda env:universe]", 166 | "language": "python", 167 | "name": "conda-env-universe-py" 168 | }, 169 | "language_info": { 170 | "codemirror_mode": { 171 | "name": "ipython", 172 | "version": 3 173 | }, 174 | "file_extension": ".py", 175 | "mimetype": "text/x-python", 176 | "name": "python", 177 | "nbconvert_exporter": "python", 178 | "pygments_lexer": "ipython3", 179 | "version": "3.5.4" 180 | } 181 | }, 182 | "nbformat": 4, 183 | "nbformat_minor": 2 184 | } 185 | -------------------------------------------------------------------------------- /Chapter12/.ipynb_checkpoints/12.3 Replay Memory-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Replay Memory" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "Now we build the experience replay buffer which is used for storing all the agent's\n", 21 | "experience. We sample minibatch of experience from the replay buffer for training the\n", 22 | "network." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "class ReplayMemoryFast:\n", 34 | "\n", 35 | "\n", 36 | " # first we define init method and initialize buffer size\n", 37 | " def __init__(self, memory_size, minibatch_size):\n", 38 | "\n", 39 | " # max number of samples to store\n", 40 | " self.memory_size = memory_size\n", 41 | "\n", 42 | " # mini batch size\n", 43 | " self.minibatch_size = minibatch_size\n", 44 | "\n", 45 | " self.experience = [None]*self.memory_size \n", 46 | " self.current_index = 0\n", 47 | " self.size = 0\n", 48 | "\n", 49 | "\n", 50 | " # next we define the function called store for storing the experiences\n", 51 | " def store(self, observation, action, reward, newobservation, is_terminal):\n", 52 | "\n", 53 | " # store the experience as a tuple (current state, action, reward, next state, is it a terminal state)\n", 54 | " self.experience[self.current_index] = (observation, action, reward, newobservation, is_terminal)\n", 55 | " self.current_index += 1\n", 56 | "\n", 57 | " self.size = min(self.size+1, self.memory_size)\n", 58 | " \n", 59 | " # if the index is greater than memory then we flush the index by subtrating it with memory size\n", 60 | "\n", 61 | " if self.current_index >= self.memory_size:\n", 62 | " self.current_index -= self.memory_size\n", 63 | "\n", 64 | "\n", 65 | "\n", 66 | " # we define a function called sample for sampling the minibatch of experience\n", 67 | "\n", 68 | " def sample(self):\n", 69 | " if self.size < self.minibatch_size:\n", 70 | " return []\n", 71 | "\n", 72 | " # first we randomly sample some indices\n", 73 | " samples_index = np.floor(np.random.random((self.minibatch_size,))*self.size)\n", 74 | "\n", 75 | " # select the experience from the sampled index\n", 76 | " samples = [self.experience[int(i)] for i in samples_index]\n", 77 | "\n", 78 | " return samples" 79 | ] 80 | } 81 | ], 82 | "metadata": { 83 | "kernelspec": { 84 | "display_name": "Python [conda env:anaconda]", 85 | "language": "python", 86 | "name": "conda-env-anaconda-py" 87 | }, 88 | "language_info": { 89 | "codemirror_mode": { 90 | "name": "ipython", 91 | "version": 2 92 | }, 93 | "file_extension": ".py", 94 | "mimetype": "text/x-python", 95 | "name": "python", 96 | "nbconvert_exporter": "python", 97 | "pygments_lexer": "ipython2", 98 | "version": "2.7.11" 99 | } 100 | }, 101 | "nbformat": 4, 102 | "nbformat_minor": 2 103 | } 104 | -------------------------------------------------------------------------------- /Chapter12/.ipynb_checkpoints/12.5 Car Racing-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Car Racing" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "So far we have seen how to build dueling deep q network. Now we will see how to make use of dueling DQN for playing the car racing game.\n", 21 | "\n", 22 | "First, let us import our necessary libraries" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "import gym\n", 34 | "import time\n", 35 | "import logging\n", 36 | "import os\n", 37 | "import sys\n", 38 | "import tensorflow as tf" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "source": [ 47 | "Initialize all necessary variables" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "ENV_NAME = 'Seaquest-v0'\n", 59 | "TOTAL_FRAMES = 20000000\n", 60 | "MAX_TRAINING_STEPS = 20*60*60/3 \n", 61 | "TESTING_GAMES = 30 \n", 62 | "MAX_TESTING_STEPS = 5*60*60/3 \n", 63 | "TRAIN_AFTER_FRAMES = 50000\n", 64 | "epoch_size = 50000 \n", 65 | "MAX_NOOP_START = 30\n", 66 | "LOG_DIR = 'logs'" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "logger = tf.train.SummaryWriter(LOG_DIR)\n", 78 | "\n", 79 | "# Intilaize tensorflow session\n", 80 | "session = tf.InteractiveSession()\n", 81 | "\n", 82 | "outdir = 'results'" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": { 88 | "collapsed": true 89 | }, 90 | "source": [ 91 | " Build the agent" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "agent = DQN(state_size=env.observation_space.shape,\n", 103 | " action_size=env.action_space.n,\n", 104 | " session=session,\n", 105 | " summary_writer = logger,\n", 106 | " exploration_period = 1000000,\n", 107 | " minibatch_size = 32,\n", 108 | " discount_factor = 0.99,\n", 109 | " experience_replay_buffer = 1000000,\n", 110 | " target_qnet_update_frequency = 20000, \n", 111 | " initial_exploration_epsilon = 1.0,\n", 112 | " final_exploration_epsilon = 0.1,\n", 113 | " reward_clipping = 1.0,\n", 114 | " DoubleDQN = UseDoubleDQN)\n" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "source": [ 123 | "Store the recording" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "session.run(tf.initialize_all_variables())\n", 135 | "logger.add_graph(session.graph)\n", 136 | "saver = tf.train.Saver(tf.all_variables())" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": true 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "env.monitor.start(outdir+'/'+ENV_NAME,force = True, video_callable=multiples_video_schedule)\n", 148 | "\n", 149 | "num_frames = 0\n", 150 | "num_games = 0\n", 151 | "current_game_frames = 0\n", 152 | "init_no_ops = np.random.randint(MAX_NOOP_START+1)\n", 153 | "last_time = time.time()\n", 154 | "last_frame_count = 0.0\n", 155 | "state = env.reset()\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | " Now let us training" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "collapsed": true 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "while num_frames <= TOTAL_FRAMES+1:\n", 174 | " if test_mode:\n", 175 | " env.render()\n", 176 | "\n", 177 | " num_frames += 1\n", 178 | " current_game_frames += 1\n", 179 | "\n", 180 | " # Select the action given the curent state \n", 181 | " action = agent.action(state, training = True)\n", 182 | "\n", 183 | " # Perform the action on the environment, receiver reward and move to the next state \n", 184 | " next_state,reward,done,_ = env.step(action)\n", 185 | "\n", 186 | " # store this transistion information in the experience replay buffer\n", 187 | " if current_game_frames >= init_no_ops:\n", 188 | " agent.store(state,action,reward,next_state,done)\n", 189 | " state = next_state\n", 190 | "\n", 191 | " # Train the agent\n", 192 | " if num_frames>=TRAIN_AFTER_FRAMES:\n", 193 | " agent.train()\n", 194 | "\n", 195 | " if done or current_game_frames > MAX_TRAINING_STEPS:\n", 196 | " state = env.reset()\n", 197 | " current_game_frames = 0\n", 198 | " num_games += 1\n", 199 | " init_no_ops = np.random.randint(MAX_NOOP_START+1)\n", 200 | "\n", 201 | "\n", 202 | " # Save the network's parameters after every epoch\n", 203 | " if num_frames % epoch_size == 0 and num_frames > TRAIN_AFTER_FRAMES:\n", 204 | " saver.save(session, outdir+\"/\"+ENV_NAME+\"/model_\"+str(num_frames/1000)+\"k.ckpt\")\n", 205 | " print \"epoch: frames=\",num_frames,\" games=\",num_games\n", 206 | "\n", 207 | "\n", 208 | " # We test the performance for every two epochs\n", 209 | " if num_frames % (2*epoch_size) == 0 and num_frames > TRAIN_AFTER_FRAMES:\n", 210 | " total_reward = 0\n", 211 | " avg_steps = 0\n", 212 | " for i in xrange(TESTING_GAMES):\n", 213 | " state = env.reset()\n", 214 | " init_no_ops = np.random.randint(MAX_NOOP_START+1)\n", 215 | " frm = 0\n", 216 | " while frm < MAX_TESTING_STEPS:\n", 217 | " frm += 1\n", 218 | " env.render()\n", 219 | " action = agent.action(state, training = False) \n", 220 | "\n", 221 | " if current_game_frames < init_no_ops:\n", 222 | " action = 0\n", 223 | "\n", 224 | " state,reward,done,_ = env.step(action)\n", 225 | "\n", 226 | " total_reward += reward\n", 227 | " if done:\n", 228 | " break\n", 229 | "\n", 230 | " avg_steps += frm\n", 231 | " avg_reward = float(total_reward)/TESTING_GAMES\n", 232 | "\n", 233 | " str_ = session.run( tf.scalar_summary('test reward ('+str(epoch_size/1000)+'k)', avg_reward) )\n", 234 | " logger.add_summary(str_, num_frames) \n", 235 | " print ' --> Evaluation Average Reward: ',avg_reward,' avg steps: ',(avg_steps/TESTING_GAMES)\n", 236 | "\n", 237 | " state = env.reset()\n", 238 | "\n", 239 | "env.monitor.close()\n", 240 | "logger.close()\n" 241 | ] 242 | } 243 | ], 244 | "metadata": { 245 | "kernelspec": { 246 | "display_name": "Python [conda env:anaconda]", 247 | "language": "python", 248 | "name": "conda-env-anaconda-py" 249 | }, 250 | "language_info": { 251 | "codemirror_mode": { 252 | "name": "ipython", 253 | "version": 2 254 | }, 255 | "file_extension": ".py", 256 | "mimetype": "text/x-python", 257 | "name": "python", 258 | "nbconvert_exporter": "python", 259 | "pygments_lexer": "ipython2", 260 | "version": "2.7.11" 261 | } 262 | }, 263 | "nbformat": 4, 264 | "nbformat_minor": 2 265 | } 266 | -------------------------------------------------------------------------------- /Chapter12/12.1 Environment Wrapper Functions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Environment Wrapper Functions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | "The credits for the code used in this chapter goes to Giacomo Spigler's github repo Throughout this chapter, code is explained each and every line. For a complete structured code check \n", 17 | "this github repo. " 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "First we will import all the necessary libaries," 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": { 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import warnings\n", 36 | "warnings.filterwarnings('ignore')\n", 37 | "import numpy as np\n", 38 | "import tensorflow as tf\n", 39 | "import gym\n", 40 | "from gym.spaces import Box\n", 41 | "from scipy.misc import imresize\n", 42 | "import random\n", 43 | "import cv2\n", 44 | "import time\n", 45 | "import logging\n", 46 | "import os\n", 47 | "import sys" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | " We define the Class EnvWrapper and define some of the environment wrapper functions" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "class EnvWrapper:\n", 66 | "\n", 67 | "\n", 68 | " # First we define the __init__ method and initialize variables\n", 69 | "\n", 70 | " def __init__(self, env_name, debug=False):\n", 71 | " \n", 72 | "\n", 73 | " # environment name\n", 74 | " self.env_name = env_name\n", 75 | " \n", 76 | " # initialize the gym environment\n", 77 | " self.env = gym.make(env_name)\n", 78 | "\n", 79 | " # get the action space\n", 80 | " self.action_space = self.env.action_space\n", 81 | "\n", 82 | " # get the observation space\n", 83 | " self.observation_space = Box(low=0, high=255, shape=(84, 84, 4)) \n", 84 | "\n", 85 | " # initialize frame_num for storing the frame count\n", 86 | " self.frame_num = 0\n", 87 | "\n", 88 | " # For recording the game screen\n", 89 | " self.monitor = self.env.monitor\n", 90 | "\n", 91 | " # initialize frames\n", 92 | " self.frames = np.zeros((84, 84, 4), dtype=np.uint8)\n", 93 | "\n", 94 | " # initialize a boolean called debug when set true last few frames will be displayed\n", 95 | " self.debug = debug\n", 96 | "\n", 97 | " if self.debug:\n", 98 | " cv2.startWindowThread()\n", 99 | " cv2.namedWindow(\"Game\")\n", 100 | "\n", 101 | "\n", 102 | " # we define the function called step where we perform some action in the \n", 103 | " # environment, receive reward and move to the next state \n", 104 | " # step function will take the current state as input and returns the preprocessed frame as next state\n", 105 | "\n", 106 | " def step(self, a):\n", 107 | " ob, reward, done, xx = self.env.step(a)\n", 108 | " return self.process_frame(ob), reward, done, xx\n", 109 | "\n", 110 | "\n", 111 | " # We define the helper function called reset for resetting the environment\n", 112 | " # after resetting it will return the preprocessed game screen\n", 113 | " \n", 114 | " def reset(self):\n", 115 | " self.frame_num = 0\n", 116 | " return self.process_frame(self.env.reset())\n", 117 | "\n", 118 | "\n", 119 | " # next we define another helper function for rendering the environment\n", 120 | " def render(self):\n", 121 | " return self.env.render()\n", 122 | "\n", 123 | "\n", 124 | " # now we define the function called process_frame for preprocessing the frame\n", 125 | " \n", 126 | " def process_frame(self, frame):\n", 127 | "\n", 128 | " # convert the image to gray\n", 129 | " state_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)\n", 130 | " \n", 131 | " # change the size\n", 132 | " state_resized = cv2.resize(state_gray,(84,110))\n", 133 | " \n", 134 | " # resize\n", 135 | " gray_final = state_resized[16:100,:]\n", 136 | "\n", 137 | "\n", 138 | " if self.frame_num == 0:\n", 139 | " self.frames[:, :, 0] = gray_final\n", 140 | " self.frames[:, :, 1] = gray_final\n", 141 | " self.frames[:, :, 2] = gray_final\n", 142 | " self.frames[:, :, 3] = gray_final\n", 143 | "\n", 144 | " else:\n", 145 | " self.frames[:, :, 3] = self.frames[:, :, 2]\n", 146 | " self.frames[:, :, 2] = self.frames[:, :, 1]\n", 147 | " self.frames[:, :, 1] = self.frames[:, :, 0]\n", 148 | " self.frames[:, :, 0] = gray_final\n", 149 | "\n", 150 | " \n", 151 | " # increment the frame_num counter\n", 152 | "\n", 153 | " self.frame_num += 1\n", 154 | "\n", 155 | " if self.debug:\n", 156 | " cv2.imshow('Game', gray_final)\n", 157 | "\n", 158 | " return self.frames.copy()\n" 159 | ] 160 | } 161 | ], 162 | "metadata": { 163 | "kernelspec": { 164 | "display_name": "Python [conda env:universe]", 165 | "language": "python", 166 | "name": "conda-env-universe-py" 167 | }, 168 | "language_info": { 169 | "codemirror_mode": { 170 | "name": "ipython", 171 | "version": 3 172 | }, 173 | "file_extension": ".py", 174 | "mimetype": "text/x-python", 175 | "name": "python", 176 | "nbconvert_exporter": "python", 177 | "pygments_lexer": "ipython3", 178 | "version": "3.5.4" 179 | } 180 | }, 181 | "nbformat": 4, 182 | "nbformat_minor": 2 183 | } 184 | -------------------------------------------------------------------------------- /Chapter12/12.2 Dueling network.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Dueling network" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "First we will import all the necessary libaries," 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "import warnings\n", 30 | "warnings.filterwarnings('ignore')\n", 31 | "import numpy as np\n", 32 | "import tensorflow as tf\n", 33 | "import gym\n", 34 | "from gym.spaces import Box\n", 35 | "from scipy.misc import imresize\n", 36 | "import random\n", 37 | "import cv2\n", 38 | "import time\n", 39 | "import logging\n", 40 | "import os\n", 41 | "import sys" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "source": [ 50 | "\n", 51 | "Now we build our dueling deep q network,\n", 52 | "we build three convolutional layers followed by two fully connected layers \n", 53 | "and the final fully connected layer will be split into two separate layers for\n", 54 | "value stream and advantage stream and we use aggregate layer which combines both value stream\n", 55 | "and advantage stream to compute the q value. The dimensions of these layers are given as follows,\n", 56 | "\n", 57 | "\n", 58 | "Layer 1: 32 8x8 filters with stride 4 + RELU
\n", 59 | "Layer 2: 64 4x4 filters with stride 2 + RELU
\n", 60 | "Layer 3: 64 3x3 filters with stride 1 + RELU
\n", 61 | "\n", 62 | "Layer 4a: 512 unit Fully-Connected layer + RELU
\n", 63 | "Layer 4b: 512 unit Fully-Connected layer + RELU
\n", 64 | "\n", 65 | "Layer 5a: 1 unit FC + RELU (State Value)
\n", 66 | "Layer 5b: actions FC + RELU (Advantage Value)
\n", 67 | "\n", 68 | "\n", 69 | "Layer6: Aggregate V(s)+A(s,a)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "metadata": { 76 | "collapsed": true 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "class QNetworkDueling():\n", 81 | " \n", 82 | "\n", 83 | " # we define the init method for initializing all layers,\n", 84 | "\n", 85 | " def __init__(self, input_size, output_size, name):\n", 86 | " self.name = name\n", 87 | " self.input_size = input_size\n", 88 | " self.output_size = output_size\n", 89 | "\n", 90 | "\n", 91 | " with tf.variable_scope(self.name):\n", 92 | "\n", 93 | "\n", 94 | " # Three convolutional layers\n", 95 | " self.W_conv1 = self.weight_variable([8, 8, 4, 32]) \n", 96 | " self.B_conv1 = self.bias_variable([32])\n", 97 | " self.stride1 = 4\n", 98 | "\n", 99 | " self.W_conv2 = self.weight_variable([4, 4, 32, 64])\n", 100 | " self.B_conv2 = self.bias_variable([64])\n", 101 | " self.stride2 = 2\n", 102 | "\n", 103 | " self.W_conv3 = self.weight_variable([3, 3, 64, 64])\n", 104 | " self.B_conv3 = self.bias_variable([64])\n", 105 | " self.stride3 = 1\n", 106 | "\n", 107 | " # fully connected layer 1\n", 108 | " self.W_fc4a = self.weight_variable([7*7*64, 512])\n", 109 | " self.B_fc4a = self.bias_variable([512])\n", 110 | "\n", 111 | " # fully connected layer 2\n", 112 | " self.W_fc4b = self.weight_variable([7*7*64, 512])\n", 113 | " self.B_fc4b = self.bias_variable([512])\n", 114 | "\n", 115 | " # value stream\n", 116 | " self.W_fc5a = self.weight_variable([512, 1])\n", 117 | " self.B_fc5a = self.bias_variable([1])\n", 118 | "\n", 119 | " # advantage stream\n", 120 | " self.W_fc5b = self.weight_variable([512, self.output_size])\n", 121 | " self.B_fc5b = self.bias_variable([self.output_size])\n", 122 | "\n", 123 | "\n", 124 | " # print number of parameters in the network\n", 125 | " self.print_num_of_parameters()\n", 126 | "\n", 127 | "\n", 128 | "\n", 129 | " # Now we define the method called __call_ to perform the convolutional operation\n", 130 | "\n", 131 | " def __call__(self, input_tensor):\n", 132 | " if type(input_tensor) == list:\n", 133 | " input_tensor = tf.concat(1, input_tensor)\n", 134 | "\n", 135 | " with tf.variable_scope(self.name):\n", 136 | "\n", 137 | " # Perform convolutional operation on three layers\n", 138 | " self.h_conv1 = tf.nn.relu( tf.nn.conv2d(input_tensor, self.W_conv1, strides=[1, self.stride1, self.stride1, 1], padding='VALID') + self.B_conv1 )\n", 139 | " self.h_conv2 = tf.nn.relu( tf.nn.conv2d(self.h_conv1, self.W_conv2, strides=[1, self.stride2, self.stride2, 1], padding='VALID') + self.B_conv2 )\n", 140 | " self.h_conv3 = tf.nn.relu( tf.nn.conv2d(self.h_conv2, self.W_conv3, strides=[1, self.stride3, self.stride3, 1], padding='VALID') + self.B_conv3 )\n", 141 | "\n", 142 | " # Flatten the convolutional output\n", 143 | " self.h_conv3_flat = tf.reshape(self.h_conv3, [-1, 7*7*64])\n", 144 | "\n", 145 | "\n", 146 | " # Input the flattened convolutional layer output to the fully connected layer\n", 147 | " self.h_fc4a = tf.nn.relu(tf.matmul(self.h_conv3_flat, self.W_fc4a) + self.B_fc4a)\n", 148 | " self.h_fc4b = tf.nn.relu(tf.matmul(self.h_conv3_flat, self.W_fc4b) + self.B_fc4b)\n", 149 | "\n", 150 | "\n", 151 | " # Compute value stream and advantage stream\n", 152 | " self.h_fc5a_value = tf.identity(tf.matmul(self.h_fc4a, self.W_fc5a) + self.B_fc5a)\n", 153 | " self.h_fc5b_advantage = tf.identity(tf.matmul(self.h_fc4b, self.W_fc5b) + self.B_fc5b)\n", 154 | "\n", 155 | " # Combine the both value and advantage stream to get the Q value\n", 156 | " self.h_fc6 = self.h_fc5a_value + ( self.h_fc5b_advantage - tf.reduce_mean(self.h_fc5b_advantage, reduction_indices=[1,], keep_dims=True) )\n", 157 | "\n", 158 | "\n", 159 | " return self.h_fc6\n" 160 | ] 161 | } 162 | ], 163 | "metadata": { 164 | "kernelspec": { 165 | "display_name": "Python [conda env:universe]", 166 | "language": "python", 167 | "name": "conda-env-universe-py" 168 | }, 169 | "language_info": { 170 | "codemirror_mode": { 171 | "name": "ipython", 172 | "version": 3 173 | }, 174 | "file_extension": ".py", 175 | "mimetype": "text/x-python", 176 | "name": "python", 177 | "nbconvert_exporter": "python", 178 | "pygments_lexer": "ipython3", 179 | "version": "3.5.4" 180 | } 181 | }, 182 | "nbformat": 4, 183 | "nbformat_minor": 2 184 | } 185 | -------------------------------------------------------------------------------- /Chapter12/12.3 Replay Memory.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Replay Memory" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "Now we build the experience replay buffer which is used for storing all the agent's\n", 21 | "experience. We sample minibatch of experience from the replay buffer for training the\n", 22 | "network." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "class ReplayMemoryFast:\n", 34 | "\n", 35 | "\n", 36 | " # first we define init method and initialize buffer size\n", 37 | " def __init__(self, memory_size, minibatch_size):\n", 38 | "\n", 39 | " # max number of samples to store\n", 40 | " self.memory_size = memory_size\n", 41 | "\n", 42 | " # mini batch size\n", 43 | " self.minibatch_size = minibatch_size\n", 44 | "\n", 45 | " self.experience = [None]*self.memory_size \n", 46 | " self.current_index = 0\n", 47 | " self.size = 0\n", 48 | "\n", 49 | "\n", 50 | " # next we define the function called store for storing the experiences\n", 51 | " def store(self, observation, action, reward, newobservation, is_terminal):\n", 52 | "\n", 53 | " # store the experience as a tuple (current state, action, reward, next state, is it a terminal state)\n", 54 | " self.experience[self.current_index] = (observation, action, reward, newobservation, is_terminal)\n", 55 | " self.current_index += 1\n", 56 | "\n", 57 | " self.size = min(self.size+1, self.memory_size)\n", 58 | " \n", 59 | " # if the index is greater than memory then we flush the index by subtrating it with memory size\n", 60 | "\n", 61 | " if self.current_index >= self.memory_size:\n", 62 | " self.current_index -= self.memory_size\n", 63 | "\n", 64 | "\n", 65 | "\n", 66 | " # we define a function called sample for sampling the minibatch of experience\n", 67 | "\n", 68 | " def sample(self):\n", 69 | " if self.size < self.minibatch_size:\n", 70 | " return []\n", 71 | "\n", 72 | " # first we randomly sample some indices\n", 73 | " samples_index = np.floor(np.random.random((self.minibatch_size,))*self.size)\n", 74 | "\n", 75 | " # select the experience from the sampled index\n", 76 | " samples = [self.experience[int(i)] for i in samples_index]\n", 77 | "\n", 78 | " return samples" 79 | ] 80 | } 81 | ], 82 | "metadata": { 83 | "kernelspec": { 84 | "display_name": "Python [conda env:anaconda]", 85 | "language": "python", 86 | "name": "conda-env-anaconda-py" 87 | }, 88 | "language_info": { 89 | "codemirror_mode": { 90 | "name": "ipython", 91 | "version": 2 92 | }, 93 | "file_extension": ".py", 94 | "mimetype": "text/x-python", 95 | "name": "python", 96 | "nbconvert_exporter": "python", 97 | "pygments_lexer": "ipython2", 98 | "version": "2.7.11" 99 | } 100 | }, 101 | "nbformat": 4, 102 | "nbformat_minor": 2 103 | } 104 | -------------------------------------------------------------------------------- /Chapter12/12.5 Car Racing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Car Racing" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "So far we have seen how to build dueling deep q network. Now we will see how to make use of dueling DQN for playing the car racing game.\n", 21 | "\n", 22 | "First, let us import our necessary libraries" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "import gym\n", 34 | "import time\n", 35 | "import logging\n", 36 | "import os\n", 37 | "import sys\n", 38 | "import tensorflow as tf" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "source": [ 47 | "Initialize all necessary variables" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "ENV_NAME = 'Seaquest-v0'\n", 59 | "TOTAL_FRAMES = 20000000\n", 60 | "MAX_TRAINING_STEPS = 20*60*60/3 \n", 61 | "TESTING_GAMES = 30 \n", 62 | "MAX_TESTING_STEPS = 5*60*60/3 \n", 63 | "TRAIN_AFTER_FRAMES = 50000\n", 64 | "epoch_size = 50000 \n", 65 | "MAX_NOOP_START = 30\n", 66 | "LOG_DIR = 'logs'" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "logger = tf.train.SummaryWriter(LOG_DIR)\n", 78 | "\n", 79 | "# Intilaize tensorflow session\n", 80 | "session = tf.InteractiveSession()\n", 81 | "\n", 82 | "outdir = 'results'" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": { 88 | "collapsed": true 89 | }, 90 | "source": [ 91 | " Build the agent" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "agent = DQN(state_size=env.observation_space.shape,\n", 103 | " action_size=env.action_space.n,\n", 104 | " session=session,\n", 105 | " summary_writer = logger,\n", 106 | " exploration_period = 1000000,\n", 107 | " minibatch_size = 32,\n", 108 | " discount_factor = 0.99,\n", 109 | " experience_replay_buffer = 1000000,\n", 110 | " target_qnet_update_frequency = 20000, \n", 111 | " initial_exploration_epsilon = 1.0,\n", 112 | " final_exploration_epsilon = 0.1,\n", 113 | " reward_clipping = 1.0,\n", 114 | " DoubleDQN = UseDoubleDQN)\n" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "source": [ 123 | "Store the recording" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "session.run(tf.initialize_all_variables())\n", 135 | "logger.add_graph(session.graph)\n", 136 | "saver = tf.train.Saver(tf.all_variables())" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "collapsed": true 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "env.monitor.start(outdir+'/'+ENV_NAME,force = True, video_callable=multiples_video_schedule)\n", 148 | "\n", 149 | "num_frames = 0\n", 150 | "num_games = 0\n", 151 | "current_game_frames = 0\n", 152 | "init_no_ops = np.random.randint(MAX_NOOP_START+1)\n", 153 | "last_time = time.time()\n", 154 | "last_frame_count = 0.0\n", 155 | "state = env.reset()\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | " Now let us training" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "collapsed": true 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "while num_frames <= TOTAL_FRAMES+1:\n", 174 | " if test_mode:\n", 175 | " env.render()\n", 176 | "\n", 177 | " num_frames += 1\n", 178 | " current_game_frames += 1\n", 179 | "\n", 180 | " # Select the action given the curent state \n", 181 | " action = agent.action(state, training = True)\n", 182 | "\n", 183 | " # Perform the action on the environment, receiver reward and move to the next state \n", 184 | " next_state,reward,done,_ = env.step(action)\n", 185 | "\n", 186 | " # store this transistion information in the experience replay buffer\n", 187 | " if current_game_frames >= init_no_ops:\n", 188 | " agent.store(state,action,reward,next_state,done)\n", 189 | " state = next_state\n", 190 | "\n", 191 | " # Train the agent\n", 192 | " if num_frames>=TRAIN_AFTER_FRAMES:\n", 193 | " agent.train()\n", 194 | "\n", 195 | " if done or current_game_frames > MAX_TRAINING_STEPS:\n", 196 | " state = env.reset()\n", 197 | " current_game_frames = 0\n", 198 | " num_games += 1\n", 199 | " init_no_ops = np.random.randint(MAX_NOOP_START+1)\n", 200 | "\n", 201 | "\n", 202 | " # Save the network's parameters after every epoch\n", 203 | " if num_frames % epoch_size == 0 and num_frames > TRAIN_AFTER_FRAMES:\n", 204 | " saver.save(session, outdir+\"/\"+ENV_NAME+\"/model_\"+str(num_frames/1000)+\"k.ckpt\")\n", 205 | " print \"epoch: frames=\",num_frames,\" games=\",num_games\n", 206 | "\n", 207 | "\n", 208 | " # We test the performance for every two epochs\n", 209 | " if num_frames % (2*epoch_size) == 0 and num_frames > TRAIN_AFTER_FRAMES:\n", 210 | " total_reward = 0\n", 211 | " avg_steps = 0\n", 212 | " for i in xrange(TESTING_GAMES):\n", 213 | " state = env.reset()\n", 214 | " init_no_ops = np.random.randint(MAX_NOOP_START+1)\n", 215 | " frm = 0\n", 216 | " while frm < MAX_TESTING_STEPS:\n", 217 | " frm += 1\n", 218 | " env.render()\n", 219 | " action = agent.action(state, training = False) \n", 220 | "\n", 221 | " if current_game_frames < init_no_ops:\n", 222 | " action = 0\n", 223 | "\n", 224 | " state,reward,done,_ = env.step(action)\n", 225 | "\n", 226 | " total_reward += reward\n", 227 | " if done:\n", 228 | " break\n", 229 | "\n", 230 | " avg_steps += frm\n", 231 | " avg_reward = float(total_reward)/TESTING_GAMES\n", 232 | "\n", 233 | " str_ = session.run( tf.scalar_summary('test reward ('+str(epoch_size/1000)+'k)', avg_reward) )\n", 234 | " logger.add_summary(str_, num_frames) \n", 235 | " print ' --> Evaluation Average Reward: ',avg_reward,' avg steps: ',(avg_steps/TESTING_GAMES)\n", 236 | "\n", 237 | " state = env.reset()\n", 238 | "\n", 239 | "env.monitor.close()\n", 240 | "logger.close()\n" 241 | ] 242 | } 243 | ], 244 | "metadata": { 245 | "kernelspec": { 246 | "display_name": "Python [conda env:anaconda]", 247 | "language": "python", 248 | "name": "conda-env-anaconda-py" 249 | }, 250 | "language_info": { 251 | "codemirror_mode": { 252 | "name": "ipython", 253 | "version": 2 254 | }, 255 | "file_extension": ".py", 256 | "mimetype": "text/x-python", 257 | "name": "python", 258 | "nbconvert_exporter": "python", 259 | "pygments_lexer": "ipython2", 260 | "version": "2.7.11" 261 | } 262 | }, 263 | "nbformat": 4, 264 | "nbformat_minor": 2 265 | } 266 | -------------------------------------------------------------------------------- /Chapter13/.ipynb_checkpoints/13.3 Deep Q Learning From Demonstrations-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Deep Q Learning From Demonstrations (DQfD)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We have seen a lot about DQN. We started off with vanilla DQN and then we saw various\n", 15 | "improvements such as double DQN, dueling network architecture, prioritized experience\n", 16 | "replay. We have also learned to build DQN to play atari games. We stored the agent's\n", 17 | "interactions with the environment in the experience buffer and made the agent to learn\n", 18 | "from those experiences. But the problem we encountered is that it took us lot of time for training. For learning in simulated environments it is fine but when we make our agent to\n", 19 | "learn in the real world environment it will cause a lot of problems. So, to overcome this\n", 20 | "researchers from Google's DeepMind introduced an improvement over DQN called Deep Q\n", 21 | "learning from demonstrations (DQfD).\n", 22 | "\n", 23 | "If we already have some demonstrations data then we can directly add those\n", 24 | "demonstrations to the experience replay buffer. For an example, consider an agent learning\n", 25 | "to play atari games, if we have already some demonstration data which tells our agent\n", 26 | "which state is better, which action gives good reward in a state then the agent can directly\n", 27 | "makes use of this data for learning. Even a small amount of demonstrations will increase\n", 28 | "the agent's performance and also minimizes the training time. Since these demonstrations\n", 29 | "data will be added directly to the prioritized experience replay buffer, the amount of data\n", 30 | "the agent can use from demonstration data and amount of data the agent can use from its\n", 31 | "own interaction for learning will be controlled by prioritized experience replay buffer as the\n", 32 | "experience will be prioritized.\n", 33 | "\n", 34 | "
\n", 35 | "\n", 36 | "Loss functions in DQfD will be the sum of various losses. In order to prevent our agent\n", 37 | "from overfitting to the demonstration data, we compute l2 regularization loss over the\n", 38 | "network weights. We compute TD loss as usual and also supervised loss to see how our\n", 39 | "agent is learning from the demonstration data. Authors of this paper experimented DQfD" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | " with various environments and the performance of DQfD is better and faster than\n", 47 | "prioritized dueling double deep q networks.\n", 48 | "You can check this video to see how DQfD learned to play private eye game https://\n", 50 | "youtu.be/4IFZvqBHsFY" 51 | ] 52 | } 53 | ], 54 | "metadata": { 55 | "kernelspec": { 56 | "display_name": "Python [conda env:anaconda]", 57 | "language": "python", 58 | "name": "conda-env-anaconda-py" 59 | }, 60 | "language_info": { 61 | "codemirror_mode": { 62 | "name": "ipython", 63 | "version": 2 64 | }, 65 | "file_extension": ".py", 66 | "mimetype": "text/x-python", 67 | "name": "python", 68 | "nbconvert_exporter": "python", 69 | "pygments_lexer": "ipython2", 70 | "version": "2.7.11" 71 | } 72 | }, 73 | "nbformat": 4, 74 | "nbformat_minor": 2 75 | } 76 | -------------------------------------------------------------------------------- /Chapter13/.ipynb_checkpoints/13.4 Hindsight Experience Replay-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Hindsight Experience Replay" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We have seen how experience replay is used in DQN to avoid the correlated experience.\n", 15 | "Also, we learned about prioritized experience replay as an improvement to the vanilla\n", 16 | "experience replay by prioritizing each experience with TD error. Now we will see a new\n", 17 | "technique called hindsight experience replay (HER) proposed by OpenAI researchers for\n", 18 | "dealing with sparse rewards.\n", 19 | "\n", 20 | "Do you remember how you learned to ride a bike? At your\n", 21 | "first try, you wouldn't have balanced the bike properly. You would have failed several\n", 22 | "times to balance correctly. But all the failures doesn't mean you haven't learned\n", 23 | "anything. The failures would have taught you how not to balance a bike. Even though you have\n", 24 | "not learned to ride a bike (goal), you have learned a different goal i.e you have learned how\n", 25 | "not to balance a bike. This is how we humans learn right? we learn from failures and this is\n", 26 | "the idea behind hindsight experience replay.\n", 27 | "\n", 28 | "\n", 29 | "Let us consider the same example given in the paper. Look at the FetchSlide environment\n", 30 | "as shown in the below figure, the goal in this environment is to move the robotic arm and\n", 31 | "slide a puck across the table hit the target (small red circle).\n", 32 | "\n", 33 | "\n", 34 | "Image source: https://blog.openai.com/ingredients-for-robotics-research/" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "![title](images/B09792_13_01.png)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "In few first trails, the agent could not definitely achieve the goal. So the agent will only\n", 49 | "receive -1 as rewards which tell the agent it was doing wrong and not attained the goal. " 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "![title](images/B09792_13_02.png)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | " But this doesn't mean that agent has not learned anything. The agent has learned a different\n", 64 | "goal i.e it has learned to move closer to our actual goal. So instead of considering it as a\n", 65 | "failure, we consider it has a different goal.\n", 66 | "\n", 67 | "So if we repeat this process over several\n", 68 | "iterations, the agent will learn to achieve our actual goal. HER can be applied to any off\n", 69 | "policy algorithms. The performance of HER is compared by DDPG without HER and\n", 70 | "DDPG with her and the results shows that DDPG with HER converges quickly than DDPG without HER.\n", 71 | "\n", 72 | "
\n", 73 | "You can see the performance of HER in this video https://youtu.be/Dz_HuzgMxzo." 74 | ] 75 | } 76 | ], 77 | "metadata": { 78 | "kernelspec": { 79 | "display_name": "Python [conda env:anaconda]", 80 | "language": "python", 81 | "name": "conda-env-anaconda-py" 82 | }, 83 | "language_info": { 84 | "codemirror_mode": { 85 | "name": "ipython", 86 | "version": 2 87 | }, 88 | "file_extension": ".py", 89 | "mimetype": "text/x-python", 90 | "name": "python", 91 | "nbconvert_exporter": "python", 92 | "pygments_lexer": "ipython2", 93 | "version": "2.7.11" 94 | } 95 | }, 96 | "nbformat": 4, 97 | "nbformat_minor": 2 98 | } 99 | -------------------------------------------------------------------------------- /Chapter13/13.3 Deep Q Learning From Demonstrations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Deep Q Learning From Demonstrations (DQfD)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We have seen a lot about DQN. We started off with vanilla DQN and then we saw various\n", 15 | "improvements such as double DQN, dueling network architecture, prioritized experience\n", 16 | "replay. We have also learned to build DQN to play atari games. We stored the agent's\n", 17 | "interactions with the environment in the experience buffer and made the agent to learn\n", 18 | "from those experiences. But the problem we encountered is that it took us lot of time for training. For learning in simulated environments it is fine but when we make our agent to\n", 19 | "learn in the real world environment it will cause a lot of problems. So, to overcome this\n", 20 | "researchers from Google's DeepMind introduced an improvement over DQN called Deep Q\n", 21 | "learning from demonstrations (DQfD).\n", 22 | "\n", 23 | "If we already have some demonstrations data then we can directly add those\n", 24 | "demonstrations to the experience replay buffer. For an example, consider an agent learning\n", 25 | "to play atari games, if we have already some demonstration data which tells our agent\n", 26 | "which state is better, which action gives good reward in a state then the agent can directly\n", 27 | "makes use of this data for learning. Even a small amount of demonstrations will increase\n", 28 | "the agent's performance and also minimizes the training time. Since these demonstrations\n", 29 | "data will be added directly to the prioritized experience replay buffer, the amount of data\n", 30 | "the agent can use from demonstration data and amount of data the agent can use from its\n", 31 | "own interaction for learning will be controlled by prioritized experience replay buffer as the\n", 32 | "experience will be prioritized.\n", 33 | "\n", 34 | "
\n", 35 | "\n", 36 | "Loss functions in DQfD will be the sum of various losses. In order to prevent our agent\n", 37 | "from overfitting to the demonstration data, we compute l2 regularization loss over the\n", 38 | "network weights. We compute TD loss as usual and also supervised loss to see how our\n", 39 | "agent is learning from the demonstration data. Authors of this paper experimented DQfD" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | " with various environments and the performance of DQfD is better and faster than\n", 47 | "prioritized dueling double deep q networks.\n", 48 | "You can check this video to see how DQfD learned to play private eye game https://\n", 50 | "youtu.be/4IFZvqBHsFY" 51 | ] 52 | } 53 | ], 54 | "metadata": { 55 | "kernelspec": { 56 | "display_name": "Python [conda env:anaconda]", 57 | "language": "python", 58 | "name": "conda-env-anaconda-py" 59 | }, 60 | "language_info": { 61 | "codemirror_mode": { 62 | "name": "ipython", 63 | "version": 2 64 | }, 65 | "file_extension": ".py", 66 | "mimetype": "text/x-python", 67 | "name": "python", 68 | "nbconvert_exporter": "python", 69 | "pygments_lexer": "ipython2", 70 | "version": "2.7.11" 71 | } 72 | }, 73 | "nbformat": 4, 74 | "nbformat_minor": 2 75 | } 76 | -------------------------------------------------------------------------------- /Chapter13/13.4 Hindsight Experience Replay.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Hindsight Experience Replay" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We have seen how experience replay is used in DQN to avoid the correlated experience.\n", 15 | "Also, we learned about prioritized experience replay as an improvement to the vanilla\n", 16 | "experience replay by prioritizing each experience with TD error. Now we will see a new\n", 17 | "technique called hindsight experience replay (HER) proposed by OpenAI researchers for\n", 18 | "dealing with sparse rewards.\n", 19 | "\n", 20 | "Do you remember how you learned to ride a bike? At your\n", 21 | "first try, you wouldn't have balanced the bike properly. You would have failed several\n", 22 | "times to balance correctly. But all the failures doesn't mean you haven't learned\n", 23 | "anything. The failures would have taught you how not to balance a bike. Even though you have\n", 24 | "not learned to ride a bike (goal), you have learned a different goal i.e you have learned how\n", 25 | "not to balance a bike. This is how we humans learn right? we learn from failures and this is\n", 26 | "the idea behind hindsight experience replay.\n", 27 | "\n", 28 | "\n", 29 | "Let us consider the same example given in the paper. Look at the FetchSlide environment\n", 30 | "as shown in the below figure, the goal in this environment is to move the robotic arm and\n", 31 | "slide a puck across the table hit the target (small red circle).\n", 32 | "\n", 33 | "\n", 34 | "Image source: https://blog.openai.com/ingredients-for-robotics-research/" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "![title](images/B09792_13_01.png)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "In few first trails, the agent could not definitely achieve the goal. So the agent will only\n", 49 | "receive -1 as rewards which tell the agent it was doing wrong and not attained the goal. " 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "![title](images/B09792_13_02.png)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | " But this doesn't mean that agent has not learned anything. The agent has learned a different\n", 64 | "goal i.e it has learned to move closer to our actual goal. So instead of considering it as a\n", 65 | "failure, we consider it has a different goal.\n", 66 | "\n", 67 | "So if we repeat this process over several\n", 68 | "iterations, the agent will learn to achieve our actual goal. HER can be applied to any off\n", 69 | "policy algorithms. The performance of HER is compared by DDPG without HER and\n", 70 | "DDPG with her and the results shows that DDPG with HER converges quickly than DDPG without HER.\n", 71 | "\n", 72 | "
\n", 73 | "You can see the performance of HER in this video https://youtu.be/Dz_HuzgMxzo." 74 | ] 75 | } 76 | ], 77 | "metadata": { 78 | "kernelspec": { 79 | "display_name": "Python [conda env:anaconda]", 80 | "language": "python", 81 | "name": "conda-env-anaconda-py" 82 | }, 83 | "language_info": { 84 | "codemirror_mode": { 85 | "name": "ipython", 86 | "version": 2 87 | }, 88 | "file_extension": ".py", 89 | "mimetype": "text/x-python", 90 | "name": "python", 91 | "nbconvert_exporter": "python", 92 | "pygments_lexer": "ipython2", 93 | "version": "2.7.11" 94 | } 95 | }, 96 | "nbformat": 4, 97 | "nbformat_minor": 2 98 | } 99 | -------------------------------------------------------------------------------- /Chapter13/images/B09792_13_01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter13/images/B09792_13_01.png -------------------------------------------------------------------------------- /Chapter13/images/B09792_13_02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/21c815b2608255694b72401919c4c08268bc48ec/Chapter13/images/B09792_13_02.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Hands-On-Reinforcement-Learning-with-Python 5 | 6 | Hands-On Reinforcement Learning with Python 7 | 8 | This is the code repository for [Hands-On-Reinforcement-Learning-with-Python](https://www.packtpub.com/big-data-and-business-intelligence/hands-reinforcement-learning-python?utm_source=github&utm_medium=repository&utm_campaign=9781788836524), published by Packt. 9 | 10 | **Master reinforcement and deep reinforcement learning using OpenAI Gym and TensorFlow** 11 | 12 | ## What is this book about? 13 | Reinforcement Learning (RL) is the trending and most promising branch of artificial intelligence. Hands-On Reinforcement learning with Python will help you master not only the basic reinforcement learning algorithms but also the advanced deep reinforcement learning algorithms. 14 | 15 | This book covers the following exciting features: 16 | * Understand the basics of reinforcement learning methods, algorithms, and elements 17 | * Train an agent to walk using OpenAI Gym and Tensorflow 18 | * Understand the Markov Decision Process, Bellman’s optimality, and TD learning 19 | * Solve multi-armed-bandit problems using various algorithms 20 | * Master deep learning algorithms, such as RNN, LSTM, and CNN with applications 21 | 22 | If you feel this book is for you, get your [copy](https://www.amazon.com/dp/1788836529) today! 23 | 24 | https://www.packtpub.com/ 26 | 27 | 28 | ## Instructions and Navigations 29 | All of the code is organized into folders. For example, Chapter02. 30 | 31 | The code will look like the following: 32 | ``` 33 | policy_iteration(): 34 | Initialize random policy 35 | for i in no_of_iterations: 36 | Q_value = value_function(random_policy) 37 | new_policy = Maximum state action pair from Q value 38 | ``` 39 | 40 | **Following is what you need for this book:** 41 | If you’re a machine learning developer or deep learning enthusiast interested in artificial intelligence and want to learn about reinforcement learning from scratch, this book is for you. Some knowledge of linear algebra, calculus, and the Python programming language will help you understand the concepts covered in this book. 42 | 43 | With the following software and hardware list you can run all code files present in the book (Chapter 1-15). 44 | 45 | ### Software and Hardware List 46 | 47 | | Chapter | Software required | OS required | 48 | | -------- | ------------------------------------| -----------------------------------| 49 | | 1-12 |anaconda |Ubutnu or mac | 50 | | | chrome | Ubutnu or mac | 51 | 52 | 53 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](http://www.packtpub.com/sites/default/files/downloads/HandsOnReinforcementLearningwithPython_ColorImages.pdf). 54 | 55 | ### Related products 56 | * Artificial Intelligence with Python [[Packt]](https://www.packtpub.com/big-data-and-business-intelligence/artificial-intelligence-python?utm_source=github&utm_medium=repository&utm_campaign=9781788293778) [[Amazon]](https://www.amazon.com/dp/178646439X) 57 | 58 | * Statistics for Machine Learning [[Packt]](https://www.packtpub.com/big-data-and-business-intelligence/statistics-machine-learning?utm_source=github&utm_medium=repository&utm_campaign=9781785280009) [[Amazon]](https://www.amazon.com/dp/1788295757) 59 | 60 | ## Get to Know the Author 61 | **Sudharsan Ravichandiran** 62 | is a data scientist, researcher, artificial intelligence enthusiast, 63 | and YouTuber (search for Sudharsan reinforcement learning). He completed his bachelors 64 | in information technology at Anna University. His area of research focuses on practical 65 | implementations of deep learning and reinforcement learning, which includes natural 66 | language processing and computer vision. He used to be a freelance web developer and 67 | designer and has designed award-winning websites. He is an open source contributor and 68 | loves answering questions on Stack Overflow. 69 | 70 | 71 | 72 | ### Suggestions and Feedback 73 | [Click here](https://docs.google.com/forms/d/e/1FAIpQLSdy7dATC6QmEL81FIUuymZ0Wy9vH1jHkvpY57OiMeKGqib_Ow/viewform) if you have any feedback or suggestions. 74 | --------------------------------------------------------------------------------