├── .gitignore ├── LICENSE ├── README.md ├── algorithms ├── A3C │ ├── atari │ │ ├── README.md │ │ ├── atari_env.py │ │ ├── atari_env_deprecated.py │ │ ├── evaluate.py │ │ ├── net.py │ │ ├── train_A3C.py │ │ ├── utils.py │ │ └── worker.py │ └── doom │ │ ├── README.md │ │ ├── basic.wad │ │ ├── env_doom.py │ │ ├── net.py │ │ ├── train_A3C.py │ │ ├── utils.py │ │ └── worker.py ├── Actor-Critic │ ├── README.md │ ├── agent.py │ ├── evaluate.py │ ├── train_actor_critic.py │ └── utils.py ├── CEM │ ├── CEM.py │ └── README.md ├── DDPG │ ├── README.md │ ├── agent.py │ ├── evaluate.py │ ├── ou_noise.py │ └── train_ddpg.py ├── DQN │ ├── README.md │ ├── agent.py │ ├── evaluation.py │ └── train_DQN.py ├── PG │ ├── agent.py │ ├── run.py │ └── sync.sh ├── PPO │ ├── README.md │ ├── agent.py │ ├── config.py │ ├── distributions.py │ ├── env_wrapper.py │ ├── logger.py │ ├── train_PPO.py │ └── utils.py ├── REINFORCE │ ├── README.md │ ├── agent.py │ ├── evaluation.py │ └── train_REINFORCE.py └── TD │ ├── README.md │ ├── agents.py │ ├── envs.py │ ├── train_TD.py │ └── utils.py └── images ├── cartpole.png ├── ddpg.png ├── doom.png ├── dqn.png ├── gridworld.png ├── pong.png ├── ppo_losses.png ├── ppo_score.png ├── walker2d.gif └── walker2d.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # Installer logs 29 | pip-log.txt 30 | pip-delete-this-directory.txt 31 | 32 | # Unit test / coverage reports 33 | htmlcov/ 34 | .tox/ 35 | .coverage 36 | .coverage.* 37 | .cache 38 | nosetests.xml 39 | coverage.xml 40 | *,cover 41 | .hypothesis/ 42 | 43 | # Translations 44 | *.mo 45 | *.pot 46 | 47 | # Scrapy stuff: 48 | .scrapy 49 | 50 | # Sphinx documentation 51 | docs/_build/ 52 | 53 | # PyBuilder 54 | target/ 55 | 56 | # Jupyter Notebook 57 | .ipynb_checkpoints 58 | 59 | # pyenv 60 | .python-version 61 | 62 | # celery beat schedule file 63 | celerybeat-schedule 64 | 65 | # SageMath parsed files 66 | *.sage.py 67 | 68 | # dotenv 69 | .env 70 | 71 | # virtualenv 72 | .venv 73 | venv/ 74 | ENV/ 75 | 76 | # Spyder project settings 77 | .spyderproject 78 | 79 | # Rope project settings 80 | .ropeproject 81 | 82 | # mkdocs documentation 83 | /site 84 | 85 | # ignore saved models 86 | models/ 87 | model/ 88 | 89 | *.swp 90 | .vscode 91 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 borgwang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reinforcement learning in Python 2 | 3 | Implement popular Reinforcement Learning algorithms using Python and Tensorflow. 4 | 5 | * Value-based Methods 6 | * [Tabular TD-Learning](https://github.com/borgwang/reinforce_py/tree/master/algorithms/TD) 7 | * [DQN&DDQN](https://github.com/borgwang/reinforce_py/tree/master/algorithms/DQN) 8 | * Policy-based Methods 9 | * [REINFORCE](https://github.com/borgwang/reinforce_py/tree/master/algorithms/REINFORCE) 10 | * [DDPG](https://github.com/borgwang/reinforce_py/tree/master/algorithms/DDPG) 11 | * Combine Policy-based and Value-based 12 | * [Actor-Critic](https://github.com/borgwang/reinforce_py/tree/master/algorithms/Actor-Critic) 13 | * [A3C](https://github.com/borgwang/reinforce_py/tree/master/algorithms/A3C/doom) 14 | * [PPO](https://github.com/borgwang/reinforce_py/tree/master/algorithms/PPO) 15 | * Derivative-free Methods 16 | * [CEM](https://github.com/borgwang/reinforce_py/tree/master/algorithms/CEM) 17 | * [Evolution Strategies](https://github.com/borgwang/evolution-strategy) (linked to a stand-alone repository) 18 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/README.md: -------------------------------------------------------------------------------- 1 | ## Asynchronous Advanced Actor-Critic (A3C) 2 | Implementation of the A3C method proposed by Google DeepMind. 3 | 4 | Related papers: 5 | * [Asynchronous Methods for Deep Reinforcement Learning](http://diyhpl.us/~bryan/papers2/ai/machine-learning/Asynchronous%20methods%20for%20deep%20reinforcement%20learning%20-%202016.pdf) 6 | 7 | 8 | ## Requirements 9 | * [Numpy](http://www.numpy.org/) 10 | * [Tensorflow](http://www.tensorflow.org) 11 | * [gym](https://gym.openai.com) 12 | 13 | ## Run 14 | python train_A3C.py 15 | python train_A3C.py -h # show all optimal arguments 16 | 17 | ## Components 18 | `train_A3C.py` create a master(global) network and multiple workers(local) network.   19 | `worker.py` is the worker class implementation. 20 | `net.py` construct Actor-Critic network. 21 | `stari_env` is a warper of gym environment. 22 | 23 | ## Note 24 | Still buggy currently. WIP. 25 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/atari_env.py: -------------------------------------------------------------------------------- 1 | # Code borrowed from OpenAI/baseliens (https://github.com/openai/baselines) 2 | # Copyright (c) 2017 OpenAI (http://openai.com) 3 | 4 | import numpy as np 5 | import gym 6 | import os 7 | 8 | from collections import deque 9 | from PIL import Image 10 | from gym import spaces 11 | 12 | 13 | DEFAULT_ENV = 'BreakoutNoFrameskip-v4' 14 | RESOLUTION = 84 15 | S_DIM = [RESOLUTION, RESOLUTION, 1] 16 | A_DIM = gym.make(DEFAULT_ENV).action_space.n 17 | 18 | 19 | class NoopResetEnv(gym.Wrapper): 20 | def __init__(self, env, noop_max=30): 21 | """Sample initial states by taking random number of no-ops on reset. 22 | No-op is assumed to be action 0. 23 | """ 24 | gym.Wrapper.__init__(self, env) 25 | self.noop_max = noop_max 26 | self.override_num_noops = None 27 | assert env.unwrapped.get_action_meanings()[0] == 'NOOP' 28 | 29 | def _reset(self): 30 | """ Do no-op action for a number of steps in [1, noop_max].""" 31 | self.env.reset() 32 | if self.override_num_noops is not None: 33 | noops = self.override_num_noops 34 | else: 35 | noops = self.unwrapped.np_random.randint(1, self.noop_max + 1) 36 | assert noops > 0 37 | obs = None 38 | for _ in range(noops): 39 | obs, _, done, _ = self.env.step(0) 40 | if done: 41 | obs = self.env.reset() 42 | return obs 43 | 44 | 45 | class FireResetEnv(gym.Wrapper): 46 | def __init__(self, env): 47 | """Take action on reset for environments that are fixed until firing.""" 48 | gym.Wrapper.__init__(self, env) 49 | assert env.unwrapped.get_action_meanings()[1] == 'FIRE' 50 | assert len(env.unwrapped.get_action_meanings()) >= 3 51 | 52 | def _reset(self): 53 | self.env.reset() 54 | obs, _, done, _ = self.env.step(1) 55 | if done: 56 | self.env.reset() 57 | obs, _, done, _ = self.env.step(2) 58 | if done: 59 | self.env.reset() 60 | return obs 61 | 62 | 63 | class EpisodicLifeEnv(gym.Wrapper): 64 | def __init__(self, env): 65 | """Make end-of-life == end-of-episode, but only reset on true game over. 66 | Done by DeepMind for the DQN and co. since it helps value estimation. 67 | """ 68 | gym.Wrapper.__init__(self, env) 69 | self.lives = 0 70 | self.was_real_done = True 71 | 72 | def _step(self, action): 73 | obs, reward, done, info = self.env.step(action) 74 | self.was_real_done = done 75 | # check current lives, make loss of life terminal, 76 | # then update lives to handle bonus lives 77 | lives = self.env.unwrapped.ale.lives() 78 | if lives < self.lives and lives > 0: 79 | # for Qbert somtimes we stay in lives == 0 condtion for a few frames 80 | # so its important to keep lives > 0, so that we only reset once 81 | # the environment advertises done. 82 | done = True 83 | self.lives = lives 84 | return obs, reward, done, info 85 | 86 | def _reset(self): 87 | """Reset only when lives are exhausted. 88 | This way all states are still reachable even though lives are episodic, 89 | and the learner need not know about any of this behind-the-scenes. 90 | """ 91 | if self.was_real_done: 92 | obs = self.env.reset() 93 | else: 94 | # no-op step to advance from terminal/lost life state 95 | obs, _, _, _ = self.env.step(0) 96 | self.lives = self.env.unwrapped.ale.lives() 97 | return obs 98 | 99 | 100 | class MaxAndSkipEnv(gym.Wrapper): 101 | def __init__(self, env, skip=4): 102 | """Return only every `skip`-th frame""" 103 | gym.Wrapper.__init__(self, env) 104 | # most recent raw observations (for max pooling across time steps) 105 | self._obs_buffer = deque(maxlen=2) 106 | self._skip = skip 107 | 108 | def _step(self, action): 109 | """Repeat action, sum reward, and max over last observations.""" 110 | total_reward = 0.0 111 | done = None 112 | for _ in range(self._skip): 113 | obs, reward, done, info = self.env.step(action) 114 | self._obs_buffer.append(obs) 115 | total_reward += reward 116 | if done: 117 | break 118 | max_frame = np.max(np.stack(self._obs_buffer), axis=0) 119 | 120 | return max_frame, total_reward, done, info 121 | 122 | def _reset(self): 123 | """Clear past frame buffer and init. to first obs. from inner env.""" 124 | self._obs_buffer.clear() 125 | obs = self.env.reset() 126 | self._obs_buffer.append(obs) 127 | return obs 128 | 129 | 130 | class ClipRewardEnv(gym.RewardWrapper): 131 | def _reward(self, reward): 132 | """Bin reward to {+1, 0, -1} by its sign.""" 133 | return np.sign(reward) 134 | 135 | 136 | class WarpFrame(gym.ObservationWrapper): 137 | def __init__(self, env): 138 | """Warp frames to 84x84 as done in the Nature paper and later work.""" 139 | gym.ObservationWrapper.__init__(self, env) 140 | self.res = RESOLUTION 141 | self.observation_space = spaces.Box(low=0, high=255, shape=(self.res, self.res, 1)) 142 | 143 | def _observation(self, obs): 144 | frame = np.dot(obs.astype('float32'), np.array([0.299, 0.587, 0.114], 'float32')) 145 | frame = np.array(Image.fromarray(frame).resize((self.res, self.res), 146 | resample=Image.BILINEAR), dtype=np.uint8) 147 | return frame.reshape((self.res, self.res, 1)) 148 | 149 | 150 | class FrameStack(gym.Wrapper): 151 | def __init__(self, env, k): 152 | """Buffer observations and stack across channels (last axis).""" 153 | gym.Wrapper.__init__(self, env) 154 | self.k = k 155 | self.frames = deque([], maxlen=k) 156 | shp = env.observation_space.shape 157 | assert shp[2] == 1 # can only stack 1-channel frames 158 | self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], k)) 159 | 160 | def _reset(self): 161 | """Clear buffer and re-fill by duplicating the first observation.""" 162 | ob = self.env.reset() 163 | for _ in range(self.k): 164 | self.frames.append(ob) 165 | return self._observation() 166 | 167 | def _step(self, action): 168 | ob, reward, done, info = self.env.step(action) 169 | self.frames.append(ob) 170 | return self._observation(), reward, done, info 171 | 172 | def _observation(self): 173 | assert len(self.frames) == self.k 174 | return np.concatenate(self.frames, axis=2) 175 | 176 | 177 | def wrap_deepmind(env, episode_life=True, clip_rewards=True): 178 | """Configure environment for DeepMind-style Atari. 179 | 180 | Note: this does not include frame stacking!""" 181 | assert 'NoFrameskip' in env.spec.id # required for DeepMind-style skip 182 | if episode_life: 183 | env = EpisodicLifeEnv(env) 184 | env = NoopResetEnv(env, noop_max=30) 185 | env = MaxAndSkipEnv(env, skip=4) 186 | if 'FIRE' in env.unwrapped.get_action_meanings(): 187 | env = FireResetEnv(env) 188 | env = WarpFrame(env) 189 | if clip_rewards: 190 | env = ClipRewardEnv(env) 191 | return env 192 | 193 | 194 | def make_env(args, record_video=False): 195 | env = gym.make(DEFAULT_ENV) 196 | if record_video: 197 | video_dir = os.path.join(args.save_path, 'videos') 198 | if not os.path.exists(video_dir): 199 | os.makedirs(video_dir) 200 | env = gym.wrappers.Monitor( 201 | env, video_dir, video_callable=lambda x: True, resume=True) 202 | 203 | return wrap_deepmind(env) 204 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/atari_env_deprecated.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import gym 4 | import numpy as np 5 | 6 | from skimage.color import rgb2gray 7 | from skimage.transform import resize 8 | 9 | 10 | class Atari(object): 11 | s_dim = [84, 84, 1] 12 | a_dim = 3 13 | 14 | def __init__(self, args, record_video=False): 15 | self.env = gym.make('BreakoutNoFrameskip-v4') 16 | self.ale = self.env.env.ale # ale interface 17 | if record_video: 18 | video_dir = os.path.join(args.save_path, 'videos') 19 | if not os.path.exists(video_dir): 20 | os.makedirs(video_dir) 21 | self.env = gym.wrappers.Monitor( 22 | self.env, video_dir, video_callable=lambda x: True, resume=True) 23 | self.ale = self.env.env.env.ale 24 | 25 | self.screen_size = Atari.s_dim[:2] # 84x84 26 | self.noop_max = 30 27 | self.frame_skip = 4 28 | self.frame_feq = 4 29 | self.s_dim = Atari.s_dim 30 | self.a_dim = Atari.a_dim 31 | 32 | self.action_space = [1, 2, 3] # Breakout specify 33 | self.done = True 34 | 35 | def new_round(self): 36 | if not self.done: # dead but not done 37 | # no-op step to advance from terminal/lost life state 38 | obs, _, _, _ = self.env.step(0) 39 | obs = self.preprocess(obs) 40 | else: # terminal 41 | self.env.reset() 42 | # No-op 43 | for _ in range(np.random.randint(1, self.noop_max + 1)): 44 | obs, _, done, _ = self.env.step(0) 45 | obs = self.preprocess(obs) 46 | return obs 47 | 48 | def preprocess(self, observ): 49 | return resize(rgb2gray(observ), self.screen_size) 50 | 51 | def step(self, action): 52 | observ, reward, dead = None, 0, False 53 | for _ in range(self.frame_skip): 54 | lives_before = self.ale.lives() 55 | o, r, self.done, _ = self.env.step(self.action_space[action]) 56 | lives_after = self.ale.lives() 57 | reward += r 58 | if lives_before > lives_after: 59 | dead = True 60 | break 61 | observ = self.preprocess(o) 62 | observ = np.reshape(observ, newshape=self.screen_size + [1]) 63 | self.state = np.append(self.state[:, :, 1:], observ, axis=2) 64 | 65 | return self.state, reward, dead, self.done 66 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/evaluate.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | 4 | import numpy as np 5 | import tensorflow as tf 6 | 7 | from atari_env import A_DIM 8 | from atari_env import make_env 9 | 10 | 11 | class Evaluate(object): 12 | ''' 13 | Evaluate a policy by running n episodes in an environment. 14 | Save a video and plot summaries to Tensorboard 15 | 16 | Args: 17 | global_net: The global network 18 | summary_writer: used to write Tensorboard summaries 19 | args: Some global parameters 20 | ''' 21 | 22 | def __init__(self, global_net, summary_writer, global_steps_counter, args): 23 | self.env = make_env(args, record_video=args.record_video) 24 | self.global_net = global_net 25 | self.summary_writer = summary_writer 26 | self.global_steps_counter = global_steps_counter 27 | self.eval_every = args.eval_every 28 | self.eval_times = 0 29 | self.eval_episodes = args.eval_episodes 30 | 31 | self.saver = tf.train.Saver(max_to_keep=5) 32 | self.model_dir = os.path.join(args.save_path, 'models/') 33 | if not os.path.exists(self.model_dir): 34 | os.makedirs(self.model_dir) 35 | 36 | def run(self, sess, coord): 37 | while not coord.should_stop(): 38 | global_steps = next(self.global_steps_counter) 39 | eval_start = time.time() 40 | avg_reward, avg_ep_length = self._eval(sess) 41 | self.eval_times += 1 42 | print('Eval at step %d: avg_reward %.4f, avg_ep_length %.4f' % 43 | (global_steps, avg_reward, avg_ep_length)) 44 | print('Time cost: %.4fs' % (time.time() - eval_start)) 45 | # add summaries 46 | ep_summary = tf.Summary() 47 | ep_summary.value.add( 48 | simple_value=avg_reward, tag='eval/avg_reward') 49 | ep_summary.value.add( 50 | simple_value=avg_ep_length, tag='eval/avg_ep_length') 51 | self.summary_writer.add_summary(ep_summary, global_steps) 52 | self.summary_writer.flush() 53 | # save models 54 | if self.eval_times % 10 == 1: 55 | save_start = time.time() 56 | self.saver.save(sess, self.model_dir + str(global_steps)) 57 | print('Model saved. Time cost: %.4fs ' % 58 | (time.time() - save_start)) 59 | 60 | time.sleep(self.eval_every) 61 | 62 | def _eval(self, sess): 63 | total_reward = 0.0 64 | episode_length = 0.0 65 | for _ in range(self.eval_episodes * 5): 66 | s = self.env.reset() 67 | while True: 68 | p = sess.run(self.global_net.policy, 69 | {self.global_net.inputs: [s]}) 70 | a = np.random.choice(range(A_DIM), p=p[0]) 71 | s, r, done, _ = self.env.step(a) 72 | total_reward += r 73 | episode_length += 1.0 74 | if done: 75 | break 76 | return total_reward / self.eval_episodes, \ 77 | episode_length / self.eval_episodes 78 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/net.py: -------------------------------------------------------------------------------- 1 | import tensorflow.contrib.slim as slim 2 | 3 | from utils import * 4 | 5 | 6 | class Net(object): 7 | ''' 8 | An Actor-Critic Network class. The shallow layers are shared by the Actor 9 | and the Critic. 10 | 11 | Args: 12 | s_dim: dimensions of the state space 13 | a_dim: dimensions of the action space 14 | scope: Scope the net belongs to 15 | trainer: optimizer used by this net 16 | ''' 17 | 18 | def __init__(self, s_dim, a_dim, scope, args, trainer=None): 19 | self.s_dim = s_dim 20 | self.a_dim = a_dim 21 | self.scope = scope 22 | self.smooth = args.smooth 23 | self.clip_grads = args.clip_grads 24 | self.entropy_ratio = args.entropy_ratio 25 | 26 | with tf.variable_scope(self.scope): 27 | self.inputs = tf.placeholder(tf.float32, shape=[None] + self.s_dim) 28 | 29 | self._contruct_network(self.inputs) 30 | 31 | if self.scope != 'global': 32 | self._update_network(trainer) 33 | 34 | def _contruct_network(self, inputs): 35 | ''' 36 | Biuld the computational graph. 37 | ''' 38 | conv1 = slim.conv2d(inputs=inputs, 39 | num_outputs=32, 40 | kernel_size=[8, 8], 41 | stride=[4, 4], 42 | padding='VALID', 43 | weights_initializer=ortho_init(), 44 | scope='share_conv1') 45 | conv2 = slim.conv2d(inputs=conv1, 46 | num_outputs=64, 47 | kernel_size=[4, 4], 48 | stride=[2, 2], 49 | padding='VALID', 50 | weights_initializer=ortho_init(), 51 | scope='share_conv2') 52 | conv3 = slim.conv2d(inputs=conv2, 53 | num_outputs=64, 54 | kernel_size=[3, 3], 55 | stride=[1, 1], 56 | padding='VALID', 57 | weights_initializer=ortho_init(), 58 | scope='share_conv3') 59 | fc = slim.fully_connected(inputs=slim.flatten(conv3), 60 | num_outputs=512, 61 | weights_initializer=ortho_init(np.sqrt(2)), 62 | scope='share_fc1') 63 | self.policy = slim.fully_connected(inputs=fc, 64 | num_outputs=self.a_dim, 65 | activation_fn=tf.nn.softmax, 66 | scope='policy_out') 67 | self.value = slim.fully_connected(inputs=fc, num_outputs=1, 68 | activation_fn=None, 69 | scope='value_out') 70 | 71 | def _update_network(self, trainer): 72 | ''' 73 | Build losses, compute gradients and apply gradients to the global net 74 | ''' 75 | 76 | self.actions = tf.placeholder(shape=[None], dtype=tf.int32) 77 | actions_onehot = tf.one_hot(self.actions, self.a_dim, dtype=tf.float32) 78 | self.target_v = tf.placeholder(shape=[None], dtype=tf.float32) 79 | self.advantages = tf.placeholder(shape=[None], dtype=tf.float32) 80 | 81 | action_prob = tf.reduce_sum(self.policy * actions_onehot, [1]) 82 | 83 | # MSE critic loss 84 | self.critic_loss = 0.5 * tf.reduce_sum( 85 | tf.squared_difference( 86 | self.target_v, tf.reshape(self.value, [-1]))) 87 | 88 | # high entropy -> low loss -> encourage exploration 89 | self.entropy = -tf.reduce_sum(self.policy * tf.log(self.policy + 1e-30), 1) 90 | self.entropy_loss = -self.entropy_ratio * tf.reduce_sum(self.entropy) 91 | 92 | # policy gradients = d_[-log(p) * advantages] / d_theta 93 | self.actor_loss = -tf.reduce_sum( 94 | tf.log(action_prob + 1e-30) * self.advantages) 95 | self.actor_loss += self.entropy_loss 96 | 97 | self.loss = self.actor_loss + self.critic_loss 98 | local_vars = tf.get_collection( 99 | tf.GraphKeys.TRAINABLE_VARIABLES, self.scope) 100 | self.grads = tf.gradients(self.loss, local_vars) 101 | 102 | # global norm gradients clipping 103 | self.grads, self.grad_norms = \ 104 | tf.clip_by_global_norm(self.grads, self.clip_grads) 105 | self.var_norms = tf.global_norm(local_vars) 106 | global_vars = tf.get_collection( 107 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global') 108 | self.apply_grads_to_global = \ 109 | trainer.apply_gradients(zip(self.grads, global_vars)) 110 | 111 | # summaries 112 | if self.scope == 'worker_1': 113 | tf.summary.scalar('loss/entropy', tf.reduce_sum(self.entropy)) 114 | tf.summary.scalar('loss/actor_loss', self.actor_loss) 115 | tf.summary.scalar('loss/critic_loss', self.critic_loss) 116 | tf.summary.scalar('advantages', tf.reduce_mean(self.advantages)) 117 | tf.summary.scalar('norms/grad_norms', self.grad_norms) 118 | tf.summary.scalar('norms/var_norms', self.var_norms) 119 | summaries = tf.get_collection(tf.GraphKeys.SUMMARIES) 120 | self.summaries = tf.summary.merge(summaries) 121 | else: 122 | self.summaries = tf.no_op() 123 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/train_A3C.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import itertools 3 | import os 4 | import threading 5 | import time 6 | 7 | import tensorflow as tf 8 | 9 | from atari_env import A_DIM 10 | from atari_env import S_DIM 11 | from atari_env import make_env 12 | from evaluate import Evaluate 13 | from net import Net 14 | from utils import print_params_nums 15 | from worker import Worker 16 | 17 | 18 | def main(args): 19 | if args.save_path is not None and not os.path.exists(args.save_path): 20 | os.makedirs(args.save_path) 21 | 22 | summary_writer = tf.summary.FileWriter(os.path.join(args.save_path, 'log')) 23 | global_steps_counter = itertools.count() # thread-safe 24 | 25 | global_net = Net(S_DIM, A_DIM, 'global', args) 26 | num_workers = args.threads 27 | workers = [] 28 | 29 | # create workers 30 | for i in range(1, num_workers + 1): 31 | worker_summary_writer = summary_writer if i == 0 else None 32 | worker = Worker(i, make_env(args), global_steps_counter, 33 | worker_summary_writer, args) 34 | workers.append(worker) 35 | 36 | saver = tf.train.Saver(max_to_keep=5) 37 | 38 | with tf.Session() as sess: 39 | coord = tf.train.Coordinator() 40 | if args.model_path is not None: 41 | print('Loading model...\n') 42 | ckpt = tf.train.get_checkpoint_state(args.model_path) 43 | saver.restore(sess, ckpt.model_checkpoint_path) 44 | else: 45 | print('Initializing a new model...\n') 46 | sess.run(tf.global_variables_initializer()) 47 | print_params_nums() 48 | # Start work process for each worker in a separated thread 49 | worker_threads = [] 50 | for worker in workers: 51 | t = threading.Thread(target=lambda: worker.run(sess, coord, saver)) 52 | t.start() 53 | time.sleep(0.5) 54 | worker_threads.append(t) 55 | 56 | if args.eval_every > 0: 57 | evaluator = Evaluate( 58 | global_net, summary_writer, global_steps_counter, args) 59 | evaluate_thread = threading.Thread( 60 | target=lambda: evaluator.run(sess, coord)) 61 | evaluate_thread.start() 62 | 63 | coord.join(worker_threads) 64 | 65 | 66 | def args_parse(): 67 | parser = argparse.ArgumentParser() 68 | 69 | parser.add_argument( 70 | '--model_path', default=None, type=str, 71 | help='Whether to use a saved model. (*None|model path)') 72 | parser.add_argument( 73 | '--save_path', default='/tmp/a3c', type=str, 74 | help='Path to save a model during training.') 75 | parser.add_argument( 76 | '--max_steps', default=int(1e8), type=int, help='Max training steps') 77 | parser.add_argument( 78 | '--start_time', default=None, type=str, help='Time to start training') 79 | parser.add_argument( 80 | '--threads', default=16, type=int, 81 | help='Numbers of parallel threads. [num_cpu_cores] by default') 82 | # evaluate 83 | parser.add_argument( 84 | '--eval_every', default=500, type=int, 85 | help='Evaluate the global policy every N seconds') 86 | parser.add_argument( 87 | '--record_video', default=True, type=bool, 88 | help='Whether to save videos when evaluating') 89 | parser.add_argument( 90 | '--eval_episodes', default=5, type=int, 91 | help='Numbers of episodes per evaluation') 92 | # hyperparameters 93 | parser.add_argument( 94 | '--init_learning_rate', default=7e-4, type=float, 95 | help='Learning rate of the optimizer') 96 | parser.add_argument( 97 | '--decay', default=0.99, type=float, 98 | help='decay factor of the RMSProp optimizer') 99 | parser.add_argument( 100 | '--smooth', default=1e-7, type=float, 101 | help='epsilon of the RMSProp optimizer') 102 | parser.add_argument( 103 | '--gamma', default=0.99, type=float, 104 | help='Discout factor of reward and advantages') 105 | parser.add_argument('--tmax', default=5, type=int, help='Rollout size') 106 | parser.add_argument( 107 | '--entropy_ratio', default=0.01, type=float, 108 | help='Initial weight of entropy loss') 109 | parser.add_argument( 110 | '--clip_grads', default=40, type=float, 111 | help='global norm gradients clipping') 112 | parser.add_argument( 113 | '--epsilon', default=1e-5, type=float, 114 | help='epsilon of rmsprop optimizer') 115 | 116 | return parser.parse_args() 117 | 118 | 119 | if __name__ == '__main__': 120 | # ignore warnings by tensorflow 121 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 122 | # make GPU invisible 123 | os.environ['CUDA_VISIBLE_DEVICES'] = '' 124 | 125 | args = args_parse() 126 | main(args) 127 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/utils.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | import numpy as np 4 | import scipy.signal 5 | import tensorflow as tf 6 | 7 | 8 | def reward_discount(x, gamma): 9 | return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1] 10 | 11 | 12 | def ortho_init(scale=1.0): 13 | 14 | def _ortho_init(shape, dtype, partition_info=None): 15 | # lasagne ortho init for tf 16 | shape = tuple(shape) 17 | if len(shape) == 2: 18 | flat_shape = shape 19 | elif len(shape) == 4: # assumes NHWC 20 | flat_shape = (np.prod(shape[:-1]), shape[-1]) 21 | else: 22 | raise NotImplementedError 23 | a = np.random.normal(0.0, 1.0, flat_shape) 24 | u, _, v = np.linalg.svd(a, full_matrices=False) 25 | q = u if u.shape == flat_shape else v # pick the one with the correct shape 26 | q = q.reshape(shape) 27 | return (scale * q[:shape[0], :shape[1]]).astype(np.float32) 28 | 29 | return _ortho_init 30 | 31 | 32 | def print_params_nums(): 33 | total_parameters = 0 34 | for v in tf.trainable_variables(): 35 | shape = v.get_shape() 36 | param_num = 1 37 | for d in shape: 38 | param_num *= d.value 39 | print(v.name, ' ', shape, ' param nums: ', param_num) 40 | total_parameters += param_num 41 | print('\nTotal nums of parameters: %d\n' % total_parameters) 42 | 43 | 44 | def print_time_cost(start_time): 45 | t_c = time.gmtime(time.time() - start_time) 46 | print('Time cost ------ %dh %dm %ds' % 47 | (t_c.tm_hour, t_c.tm_min, t_c.tm_sec)) 48 | -------------------------------------------------------------------------------- /algorithms/A3C/atari/worker.py: -------------------------------------------------------------------------------- 1 | from atari_env import A_DIM 2 | from atari_env import S_DIM 3 | from net import Net 4 | from utils import * 5 | 6 | 7 | class Worker(object): 8 | ''' 9 | An A3C worker thread. Run a game locally, gather gradients and apply 10 | to the global networks. 11 | 12 | Args: 13 | worker_id: A unique id for this thread 14 | env: Game environment used by this worker 15 | global_steps: Iterator that holds the global steps 16 | args: Global parameters and hyperparameters 17 | ''' 18 | 19 | def __init__( 20 | self, worker_id, env, global_steps_counter, summary_writer, args): 21 | self.name = 'worker_' + str(worker_id) 22 | self.env = env 23 | self.args = args 24 | self.local_steps = 0 25 | self.global_steps_counter = global_steps_counter 26 | # each worker has its own optimizer and learning_rate 27 | self.learning_rate = tf.Variable(args.init_learning_rate, 28 | dtype=tf.float32, 29 | trainable=False, 30 | name=self.name + '_lr') 31 | self.delta_lr = \ 32 | args.init_learning_rate / (args.max_steps / args.threads) 33 | self.trainer = tf.train.RMSPropOptimizer(self.learning_rate, 34 | decay=args.decay, 35 | epsilon=args.epsilon) 36 | self.summary_writer = summary_writer 37 | 38 | self.local_net = Net(S_DIM, 39 | A_DIM, 40 | scope=self.name, 41 | args=self.args, 42 | trainer=self.trainer) 43 | 44 | self.update_local_op = self._update_local_vars() 45 | self.anneal_learning_rate = self._anneal_learning_rate() 46 | 47 | def run(self, sess, coord, saver): 48 | print('Starting %s...\n' % self.name) 49 | with sess.as_default(), sess.graph.as_default(): 50 | while not coord.should_stop(): 51 | sess.run(self.update_local_op) 52 | rollout = [] 53 | s = self.env.reset() 54 | while True: 55 | p, v = sess.run( 56 | [self.local_net.policy, self.local_net.value], 57 | feed_dict={self.local_net.inputs: [s]}) 58 | a = np.random.choice(range(A_DIM), p=p[0]) 59 | s1, r, dead, done = self.env.step(a) 60 | rollout.append([s, a, r, s1, dead, v[0][0]]) 61 | s = s1 62 | 63 | global_steps = next(self.global_steps_counter) 64 | self.local_steps += 1 65 | sess.run(self.anneal_learning_rate) 66 | 67 | if not dead and len(rollout) == self.args.tmax: 68 | # calculate value of next state, uses for bootstraping 69 | v1 = sess.run(self.local_net.value, 70 | feed_dict={self.local_net.inputs: [s]}) 71 | self._train(rollout, sess, v1[0][0], global_steps) 72 | rollout = [] 73 | sess.run(self.update_local_op) 74 | 75 | if dead: 76 | break 77 | 78 | if len(rollout) != 0: 79 | self._train(rollout, sess, 0.0, global_steps) 80 | # end condition 81 | if global_steps >= self.args.max_steps: 82 | coord.request_stop() 83 | print_time_cost(self.args.start_time) 84 | 85 | def _train(self, rollout, sess, bootstrap_value, global_steps): 86 | ''' 87 | Update global networks based on the rollout experiences 88 | 89 | Args: 90 | rollout: A list of transitions experiences 91 | sess: Tensorflow session 92 | bootstrap_value: if the episode was not done, we bootstrap the value 93 | from the last state. 94 | global_steps: use for summary 95 | ''' 96 | 97 | rollout = np.array(rollout) 98 | observs, actions, rewards, next_observs, dones, values = rollout.T 99 | # compute advantages and discounted rewards 100 | rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value]) 101 | discounted_rewards = reward_discount(rewards_plus, self.args.gamma)[:-1] 102 | 103 | advantages = discounted_rewards - values 104 | 105 | summaries, _ = sess.run([ 106 | self.local_net.summaries, 107 | self.local_net.apply_grads_to_global 108 | ], feed_dict={ 109 | self.local_net.inputs: np.stack(observs), 110 | self.local_net.actions: actions, 111 | self.local_net.target_v: discounted_rewards, # for value loss 112 | self.local_net.advantages: advantages # for policy net 113 | }) 114 | # write summaries 115 | if self.summary_writer and summaries: 116 | self.summary_writer.add_summary(summaries, global_steps) 117 | self.summary_writer.flush() 118 | 119 | def _update_local_vars(self): 120 | ''' 121 | Assign global networks parameters to local networks 122 | ''' 123 | global_vars = tf.get_collection( 124 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global') 125 | local_vars = tf.get_collection( 126 | tf.GraphKeys.TRAINABLE_VARIABLES, self.name) 127 | update_op = [] 128 | for g_v, l_v in zip(global_vars, local_vars): 129 | update_op.append(l_v.assign(g_v)) 130 | 131 | return update_op 132 | 133 | def _anneal_learning_rate(self): 134 | return tf.cond( 135 | self.learning_rate > 0.0, 136 | lambda: tf.assign_sub(self.learning_rate, self.delta_lr), 137 | lambda: tf.assign(self.learning_rate, 0.0)) 138 | -------------------------------------------------------------------------------- /algorithms/A3C/doom/README.md: -------------------------------------------------------------------------------- 1 | ## Asynchronous Advanced Actor-Critic (A3C) 2 | Implementation of the A3C method proposed by Google DeepMind. 3 | 4 | Related papers: 5 | * [Asynchronous Methods for Deep Reinforcement Learning](http://diyhpl.us/~bryan/papers2/ai/machine-learning/Asynchronous%20methods%20for%20deep%20reinforcement%20learning%20-%202016.pdf) 6 | 7 | ## ViZDoom 8 | [ViZDoom](http://vizdoom.cs.put.edu.pl/) is a Doom-based AI research platform for reinforcement learning from raw visual information. The agent recieve raw visual information and make actions(moving, picking up items and attacking monsters) to maximize scores. 9 | 10 | doom 11 | 12 | In this repository, we implement A3C to slove basic ViZDoom task. 13 | 14 | ## Requirements 15 | * [Numpy](http://www.numpy.org/) 16 | * [Tensorflow](http://www.tensorflow.org) 17 | * [gym](https://gym.openai.com) 18 | * [scipy](https://www.scipy.org/) 19 | * [ViZDoom](https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) 20 | 21 | ## Run 22 | python train_A3C.py 23 | python train_A3C.py -h # show all optimal arguments 24 | 25 | ## Components 26 | `train_A3C.py` create a master(global) network and multiple workers(local) network.   27 | `worker.py` is the worker class implementation. 28 | `net.py` construct Actor-Critic network. 29 | `env_doom` is a warper of ViZDoom environment. 30 | -------------------------------------------------------------------------------- /algorithms/A3C/doom/basic.wad: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/algorithms/A3C/doom/basic.wad -------------------------------------------------------------------------------- /algorithms/A3C/doom/env_doom.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from vizdoom import * 4 | 5 | 6 | class Doom(object): 7 | '''Wrapper for Doom environment. Gym-style interface''' 8 | def __init__(self, visiable=False): 9 | self.env = self._setup(visiable) 10 | self.state_dim = 84 * 84 * 1 11 | self.action_dim = 3 12 | # Identity bool matrix, transfer action to bool one-hot 13 | self.bool_onehot = np.identity(self.action_dim, dtype=bool).tolist() 14 | 15 | def _setup(self, visiable): 16 | # setting up Doom environment 17 | env = DoomGame() 18 | env.set_doom_scenario_path('basic.wad') 19 | env.set_doom_map('map01') 20 | env.set_screen_resolution(ScreenResolution.RES_160X120) 21 | env.set_screen_format(ScreenFormat.GRAY8) 22 | env.set_render_hud(False) 23 | env.set_render_crosshair(False) 24 | env.set_render_weapon(True) 25 | env.set_render_decals(False) 26 | env.set_render_particles(False) 27 | env.add_available_button(Button.MOVE_LEFT) 28 | env.add_available_button(Button.MOVE_RIGHT) 29 | env.add_available_button(Button.ATTACK) 30 | env.add_available_game_variable(GameVariable.AMMO2) 31 | env.add_available_game_variable(GameVariable.POSITION_X) 32 | env.add_available_game_variable(GameVariable.POSITION_Y) 33 | env.set_episode_timeout(300) 34 | env.set_episode_start_time(10) 35 | env.set_sound_enabled(False) 36 | env.set_living_reward(-1) 37 | env.set_mode(Mode.PLAYER) 38 | env.set_window_visible(visiable) 39 | env.init() 40 | 41 | return env 42 | 43 | def _get_state(self): 44 | return self.env.get_state().screen_buffer 45 | 46 | def reset(self): 47 | self.env.new_episode() 48 | return self._get_state() 49 | 50 | def step(self, action): 51 | action = self.bool_onehot[action] # e.g. [False, True, False] 52 | 53 | curr_observ = self._get_state() 54 | reward = self.env.make_action(action) 55 | done = self.env.is_episode_finished() 56 | if done: 57 | next_observ = curr_observ 58 | else: 59 | next_observ = self._get_state() 60 | 61 | return next_observ, reward, done 62 | -------------------------------------------------------------------------------- /algorithms/A3C/doom/net.py: -------------------------------------------------------------------------------- 1 | import tensorflow.contrib.slim as slim 2 | 3 | from utils import * 4 | 5 | 6 | class Net: 7 | 8 | def __init__(self, s_dim, a_dim, scope, trainer): 9 | self.s_dim = s_dim 10 | self.a_dim = a_dim 11 | self.scope = scope 12 | 13 | with tf.variable_scope(self.scope): 14 | self.inputs = tf.placeholder( 15 | shape=[None, self.s_dim], dtype=tf.float32) 16 | inputs = tf.reshape(self.inputs, [-1, 84, 84, 1]) 17 | 18 | self._construct_network(inputs) 19 | 20 | if self.scope != 'global': 21 | # gradients update only for workers 22 | self._update_network(trainer) 23 | 24 | def _construct_network(self, inputs): 25 | # Actor network and critic network share all shallow layers 26 | conv1 = slim.conv2d(inputs=inputs, 27 | num_outputs=16, 28 | activation_fn=tf.nn.relu, 29 | kernel_size=[8, 8], 30 | stride=[4, 4], 31 | padding='VALID') 32 | conv2 = slim.conv2d(inputs=conv1, 33 | num_outputs=32, 34 | activation_fn=tf.nn.relu, 35 | kernel_size=[4, 4], 36 | stride=[2, 2], 37 | padding='VALID') 38 | hidden = slim.fully_connected(inputs=slim.flatten(conv2), 39 | num_outputs=256, 40 | activation_fn=tf.nn.relu) 41 | 42 | # Recurrent network for temporal dependencies 43 | lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=256) 44 | c_init = np.zeros((1, lstm_cell.state_size.c), np.float32) 45 | h_init = np.zeros((1, lstm_cell.state_size.h), np.float32) 46 | self.state_init = [c_init, h_init] 47 | 48 | c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c]) 49 | h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h]) 50 | self.state_in = (c_in, h_in) 51 | 52 | rnn_in = tf.expand_dims(hidden, [0]) 53 | step_size = tf.shape(inputs)[:1] 54 | state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in) 55 | 56 | lstm_out, lstm_state = tf.nn.dynamic_rnn(cell=lstm_cell, 57 | inputs=rnn_in, 58 | initial_state=state_in, 59 | sequence_length=step_size, 60 | time_major=False) 61 | lstm_c, lstm_h = lstm_state 62 | self.state_out = (lstm_c[:1, :], lstm_h[:1, :]) 63 | rnn_out = tf.reshape(lstm_out, [-1, 256]) 64 | 65 | # output for policy and value estimations 66 | self.policy = slim.fully_connected( 67 | inputs=rnn_out, 68 | num_outputs=self.a_dim, 69 | activation_fn=tf.nn.softmax, 70 | weights_initializer=normalized_columns_initializer(0.01), 71 | biases_initializer=None) 72 | self.value = slim.fully_connected( 73 | inputs=rnn_out, 74 | num_outputs=1, 75 | activation_fn=None, 76 | weights_initializer=normalized_columns_initializer(1.0), 77 | biases_initializer=None) 78 | 79 | def _update_network(self, trainer): 80 | self.actions = tf.placeholder(shape=[None], dtype=tf.int32) 81 | self.actions_onehot = tf.one_hot( 82 | self.actions, self.a_dim, dtype=tf.float32) 83 | self.target_v = tf.placeholder(shape=[None], dtype=tf.float32) 84 | self.advantages = tf.placeholder(shape=[None], dtype=tf.float32) 85 | 86 | self.outputs = tf.reduce_sum( 87 | self.policy * self.actions_onehot, [1]) 88 | 89 | # loss 90 | self.value_loss = 0.5 * tf.reduce_sum(tf.square( 91 | self.target_v - tf.reshape(self.value, [-1]))) 92 | # higher entropy -> lower loss -> encourage exploration 93 | self.entropy = -tf.reduce_sum(self.policy * tf.log(self.policy)) 94 | 95 | self.policy_loss = -tf.reduce_sum( 96 | tf.log(self.outputs) * self.advantages) 97 | 98 | self.loss = 0.5 * self.value_loss \ 99 | + self.policy_loss - 0.01 * self.entropy 100 | 101 | # local gradients 102 | local_vars = tf.get_collection( 103 | tf.GraphKeys.TRAINABLE_VARIABLES, self.scope) 104 | self.gradients = tf.gradients(self.loss, local_vars) 105 | self.var_norms = tf.global_norm(local_vars) 106 | 107 | # grads[i] * clip_norm / max(global_norm, clip_norm) 108 | grads, self.grad_norms = tf.clip_by_global_norm(self.gradients, 40.0) 109 | 110 | # apply gradients to global network 111 | global_vars = tf.get_collection( 112 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global') 113 | self.apply_grads = trainer.apply_gradients(zip(grads, global_vars)) 114 | -------------------------------------------------------------------------------- /algorithms/A3C/doom/train_A3C.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import multiprocessing 3 | import os 4 | import threading 5 | import time 6 | 7 | import tensorflow as tf 8 | 9 | from env_doom import Doom 10 | from net import Net 11 | from utils import print_net_params_number 12 | from worker import Worker 13 | 14 | 15 | def main(args): 16 | if args.save_path is not None and not os.path.exists(args.save_path): 17 | os.makedirs(args.save_path) 18 | 19 | tf.reset_default_graph() 20 | 21 | global_ep = tf.Variable( 22 | 0, dtype=tf.int32, name='global_ep', trainable=False) 23 | 24 | env = Doom(visiable=False) 25 | Net(env.state_dim, env.action_dim, 'global', None) 26 | num_workers = args.parallel 27 | workers = [] 28 | 29 | # create workers 30 | for i in range(num_workers): 31 | w = Worker(i, Doom(), global_ep, args) 32 | workers.append(w) 33 | 34 | print('%d workers in total.\n' % num_workers) 35 | saver = tf.train.Saver(max_to_keep=3) 36 | 37 | with tf.Session() as sess: 38 | coord = tf.train.Coordinator() 39 | if args.model_path is not None: 40 | print('Loading model...') 41 | ckpt = tf.train.get_checkpoint_state(args.model_path) 42 | saver.restore(sess, ckpt.model_checkpoint_path) 43 | else: 44 | print('Initializing a new model...') 45 | sess.run(tf.global_variables_initializer()) 46 | print_net_params_number() 47 | 48 | # Start work process for each worker in a separated thread 49 | worker_threads = [] 50 | for w in workers: 51 | run_fn = lambda: w.run(sess, coord, saver) 52 | t = threading.Thread(target=run_fn) 53 | t.start() 54 | time.sleep(0.5) 55 | worker_threads.append(t) 56 | coord.join(worker_threads) 57 | 58 | 59 | if __name__ == '__main__': 60 | # ignore warnings by tensorflow 61 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 62 | 63 | parser = argparse.ArgumentParser() 64 | parser.add_argument( 65 | '--model_path', default=None, 66 | help='Whether to use a saved model. (*None|model path)') 67 | parser.add_argument( 68 | '--save_path', default='/tmp/a3c_doom/model/', 69 | help='Path to save a model during training.') 70 | parser.add_argument( 71 | '--save_every', default=50, help='Interval of saving model') 72 | parser.add_argument( 73 | '--max_ep_len', default=300, help='Max episode steps') 74 | parser.add_argument( 75 | '--max_ep', default=3000, help='Max training episode') 76 | parser.add_argument( 77 | '--parallel', default=multiprocessing.cpu_count(), 78 | help='Number of parallel threads') 79 | main(parser.parse_args()) 80 | -------------------------------------------------------------------------------- /algorithms/A3C/doom/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy.signal 3 | import tensorflow as tf 4 | 5 | 6 | def preprocess(frame): 7 | s = frame[10: -10, 30: -30] 8 | s = scipy.misc.imresize(s, [84, 84]) 9 | s = np.reshape(s, [np.prod(s.shape)]) / 255.0 10 | return s 11 | 12 | 13 | def discount(x, gamma): 14 | return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1] 15 | 16 | 17 | def normalized_columns_initializer(std=1.0): 18 | 19 | def _initializer(shape, dtype=None, partition_info=None): 20 | out = np.random.randn(*shape).astype(np.float32) 21 | out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True)) 22 | return tf.constant(out) 23 | 24 | return _initializer 25 | 26 | 27 | def print_net_params_number(): 28 | total_parameters = 0 29 | for v in tf.trainable_variables(): 30 | shape = v.get_shape() 31 | param_num = 1 32 | for d in shape: 33 | param_num *= d.value 34 | print(v.name, ' ', shape, ' param nums: ', param_num) 35 | total_parameters += param_num 36 | print('\nTotal nums of parameters: %d\n' % total_parameters) 37 | -------------------------------------------------------------------------------- /algorithms/A3C/doom/worker.py: -------------------------------------------------------------------------------- 1 | from net import Net 2 | from utils import * 3 | from vizdoom import * 4 | 5 | 6 | class Worker(object): 7 | 8 | def __init__(self, worker_id, env, global_ep, args): 9 | self.name = 'worker_' + str(worker_id) 10 | self.env = env 11 | self.global_ep = global_ep 12 | self.args = args 13 | self.learning_rate = 1e-4 14 | self.gamma = 0.99 15 | self.trainer = tf.train.AdamOptimizer(self.learning_rate) 16 | 17 | # create local copy of AC network 18 | self.local_net = Net(self.env.state_dim, 19 | self.env.action_dim, 20 | scope=self.name, 21 | trainer=self.trainer) 22 | 23 | self.update_local_op = self._update_local_params() 24 | 25 | def run(self, sess, coord, saver): 26 | running_reward = None 27 | ep_count = sess.run(self.global_ep) 28 | print('Starting ' + self.name) 29 | 30 | with sess.as_default(), sess.graph.as_default(): 31 | 32 | while not coord.should_stop(): 33 | sess.run(self.update_local_op) 34 | rollout = [] 35 | ep_reward = 0 36 | ep_step_count = 0 37 | 38 | s = self.env.reset() 39 | self.ep_frames = [] 40 | self.ep_frames.append(s) 41 | s = preprocess(s) 42 | rnn_state = self.local_net.state_init 43 | 44 | while True: 45 | p, v, rnn_state = sess.run([ 46 | self.local_net.policy, 47 | self.local_net.value, 48 | self.local_net.state_out 49 | ], { 50 | self.local_net.inputs: [s], 51 | self.local_net.state_in[0]: rnn_state[0], 52 | self.local_net.state_in[1]: rnn_state[1] 53 | }) 54 | # sample action from the policy distribution p 55 | a = np.random.choice(np.arange(self.env.action_dim), p=p[0]) 56 | 57 | s1, r, d = self.env.step(a) 58 | self.ep_frames.append(s1) 59 | r /= 100.0 # scale rewards 60 | s1 = preprocess(s1) 61 | 62 | rollout.append([s, a, r, s1, d, v[0][0]]) 63 | ep_reward += r 64 | s = s1 65 | ep_step_count += 1 66 | 67 | # Update if the buffer is full (size=30) 68 | if not d and len(rollout) == 30 \ 69 | and ep_step_count != self.args.max_ep_len - 1: 70 | v1 = sess.run(self.local_net.value, { 71 | self.local_net.inputs: [s], 72 | self.local_net.state_in[0]: rnn_state[0], 73 | self.local_net.state_in[1]: rnn_state[1] 74 | })[0][0] 75 | v_l, p_l, e_l, g_n, v_n = self._train(rollout, sess, v1) 76 | rollout = [] 77 | 78 | sess.run(self.update_local_op) 79 | if d: 80 | break 81 | 82 | # update network at the end of the episode 83 | if len(rollout) != 0: 84 | v_l, p_l, e_l, g_n, v_n = self._train(rollout, sess, 0.0) 85 | 86 | # episode end 87 | if running_reward: 88 | running_reward = running_reward * 0.99 + ep_reward * 0.01 89 | else: 90 | running_reward = ep_reward 91 | 92 | if ep_count % 10 == 0: 93 | print('%s ep:%d step:%d reward:%.3f' % 94 | (self.name, ep_count, ep_step_count, running_reward)) 95 | 96 | if self.name == 'worker_0': 97 | # update global ep 98 | _, global_ep = sess.run([ 99 | self.global_ep.assign_add(1), 100 | self.global_ep 101 | ]) 102 | # end condition 103 | if global_ep == self.args.max_ep: 104 | # this op will stop all threads 105 | coord.request_stop() 106 | # save model and make gif 107 | if global_ep != 0 and global_ep % self.args.save_every == 0: 108 | saver.save( 109 | sess, self.args.save_path+str(global_ep)+'.cptk') 110 | ep_count += 1 # update local ep 111 | 112 | def _train(self, rollout, sess, bootstrap_value): 113 | rollout = np.array(rollout) 114 | observs, actions, rewards, next_observs, dones, values = rollout.T 115 | 116 | # compute advantages and discounted reward using rewards and value 117 | self.rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value]) 118 | discounted_rewards = discount(self.rewards_plus, self.gamma)[:-1] 119 | 120 | self.value_plus = np.asarray(values.tolist() + [bootstrap_value]) 121 | advantages = rewards + self.gamma * self.value_plus[1:] - \ 122 | self.value_plus[:-1] 123 | advantages = discount(advantages, self.gamma) 124 | 125 | # update glocal network using gradients from loss 126 | rnn_state = self.local_net.state_init 127 | v_l, p_l, e_l, g_n, v_n, _ = sess.run([ 128 | self.local_net.value_loss, 129 | self.local_net.policy_loss, 130 | self.local_net.entropy, 131 | self.local_net.grad_norms, 132 | self.local_net.var_norms, 133 | self.local_net.apply_grads 134 | ], { 135 | self.local_net.target_v: discounted_rewards, # for value net 136 | self.local_net.inputs: np.vstack(observs), 137 | self.local_net.actions: actions, 138 | self.local_net.advantages: advantages, # for policy net 139 | self.local_net.state_in[0]: rnn_state[0], 140 | self.local_net.state_in[1]: rnn_state[1] 141 | }) 142 | return v_l/len(rollout), p_l/len(rollout), e_l/len(rollout), g_n, v_n 143 | 144 | def _update_local_params(self): 145 | global_vars = tf.get_collection( 146 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global') 147 | local_vars = tf.get_collection( 148 | tf.GraphKeys.TRAINABLE_VARIABLES, self.name) 149 | update_op = [] 150 | for global_var, local_var in zip(global_vars, local_vars): 151 | update_op.append(local_var.assign(global_var)) 152 | 153 | return update_op 154 | -------------------------------------------------------------------------------- /algorithms/Actor-Critic/README.md: -------------------------------------------------------------------------------- 1 | ## Actor-Critic 2 | Actor-Critic belongs to Policy Gradient methods, which directly parameterize the policy rather than a state value function. 3 | For more details about Actor-Critic and other policy gradient algorithms, refer Chap 13 of [Reinforcement Learning: An Introduction 2nd Edition](http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html) 4 | 5 | Here we use Actor-Critic Methods to solve game of Pong. 6 | 7 | ## Pong 8 | The game of Pong is an Atari game which user control one of the paddle (the other one is control by a decent AI) and you have to bounce the ball past the other side. In reinforcement learning setting, the state is raw pixels and the action is moving the paddle UP or DOWN. 9 | 10 | pong 11 | 12 | 13 | ## Requirements 14 | * [Numpy](http://www.numpy.org/) 15 | * [Tensorflow](http://www.tensorflow.org) 16 | * [gym](https://gym.openai.com) 17 | 18 | ## Run 19 | python train_actor_critic.py 20 | -------------------------------------------------------------------------------- /algorithms/Actor-Critic/agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | 5 | class ActorCritic: 6 | 7 | def __init__(self, input_dim, hidden_units, action_dim): 8 | self.input_dim = input_dim 9 | self.hidden_units = hidden_units 10 | self.action_dim = action_dim 11 | self.gamma = 0.99 12 | self.discount_factor = 0.99 13 | self.max_gradient = 5 14 | # counter 15 | self.ep_count = 0 16 | # buffer init 17 | self.buffer_reset() 18 | 19 | self.batch_size = 32 20 | 21 | @staticmethod 22 | def get_session(device): 23 | if device == -1: # use CPU 24 | device = '/cpu:0' 25 | sess_config = tf.ConfigProto() 26 | else: # use GPU 27 | device = '/gpu:' + str(device) 28 | sess_config = tf.ConfigProto( 29 | log_device_placement=True, 30 | allow_soft_placement=True) 31 | sess_config.gpu_options.allow_growth = True 32 | sess = tf.Session(config=sess_config) 33 | return sess, device 34 | 35 | def construct_model(self, gpu): 36 | self.sess, device = self.get_session(gpu) 37 | 38 | with tf.device(device): 39 | with tf.name_scope('model_inputs'): 40 | self.input_state = tf.placeholder( 41 | tf.float32, [None, self.input_dim], name='input_state') 42 | with tf.variable_scope('actor_network'): 43 | self.logp = self.actor_network(self.input_state) 44 | with tf.variable_scope('critic_network'): 45 | self.state_value = self.critic_network(self.input_state) 46 | 47 | # get network parameters 48 | actor_params = tf.get_collection( 49 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='actor_network') 50 | critic_params = tf.get_collection( 51 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='critic_network') 52 | 53 | self.taken_action = tf.placeholder(tf.int32, [None, ]) 54 | self.discounted_rewards = tf.placeholder(tf.float32, [None, 1]) 55 | 56 | # optimizer 57 | self.optimizer = tf.train.RMSPropOptimizer(learning_rate=1e-4) 58 | # actor loss 59 | self.actor_loss = tf.nn.sparse_softmax_cross_entropy_with_logits( 60 | logits=self.logp, labels=self.taken_action) 61 | # advantage 62 | self.advantage = (self.discounted_rewards - self.state_value)[:, 0] 63 | # actor gradient 64 | actor_gradients = tf.gradients( 65 | self.actor_loss, actor_params, self.advantage) 66 | self.actor_gradients = list(zip(actor_gradients, actor_params)) 67 | 68 | # critic loss 69 | self.critic_loss = tf.reduce_mean( 70 | tf.square(self.discounted_rewards - self.state_value)) 71 | # critic gradient 72 | self.critic_gradients = self.optimizer.compute_gradients( 73 | self.critic_loss, critic_params) 74 | self.gradients = self.actor_gradients + self.critic_gradients 75 | 76 | # clip gradient 77 | for i, (grad, var) in enumerate(self.gradients): 78 | if grad is not None: 79 | self.gradients[i] = (tf.clip_by_value( 80 | grad, -self.max_gradient, self.max_gradient), var) 81 | 82 | with tf.name_scope('train_actor_critic'): 83 | # train operation 84 | self.train_op = self.optimizer.apply_gradients(self.gradients) 85 | 86 | def sample_action(self, state): 87 | 88 | def softmax(x): 89 | max_x = np.amax(x) 90 | e = np.exp(x - max_x) 91 | return e / np.sum(e) 92 | 93 | logp = self.sess.run(self.logp, {self.input_state: state})[0] 94 | prob = softmax(logp) - 1e-5 95 | return np.argmax(np.random.multinomial(1, prob)) 96 | 97 | def update_model(self): 98 | state_buffer = np.array(self.state_buffer) 99 | action_buffer = np.array(self.action_buffer) 100 | discounted_rewards_buffer = np.vstack(self.reward_discount()) 101 | 102 | ep_steps = len(action_buffer) 103 | shuffle_index = np.arange(ep_steps) 104 | np.random.shuffle(shuffle_index) 105 | 106 | for i in range(0, ep_steps, self.batch_size): 107 | if self.batch_size <= ep_steps: 108 | end_index = i + self.batch_size 109 | else: 110 | end_index = ep_steps 111 | batch_index = shuffle_index[i:end_index] 112 | 113 | # get batch 114 | input_state = state_buffer[batch_index] 115 | taken_action = action_buffer[batch_index] 116 | discounted_rewards = discounted_rewards_buffer[batch_index] 117 | 118 | # train! 119 | self.sess.run(self.train_op, feed_dict={ 120 | self.input_state: input_state, 121 | self.taken_action: taken_action, 122 | self.discounted_rewards: discounted_rewards}) 123 | 124 | # clean up job 125 | self.buffer_reset() 126 | 127 | self.ep_count += 1 128 | 129 | def store_rollout(self, state, action, reward, next_state, done): 130 | self.action_buffer.append(action) 131 | self.reward_buffer.append(reward) 132 | self.state_buffer.append(state) 133 | self.next_state_buffer.append(next_state) 134 | self.done_buffer.append(done) 135 | 136 | def buffer_reset(self): 137 | self.state_buffer = [] 138 | self.reward_buffer = [] 139 | self.action_buffer = [] 140 | self.next_state_buffer = [] 141 | self.done_buffer = [] 142 | 143 | def reward_discount(self): 144 | r = self.reward_buffer 145 | d_r = np.zeros_like(r) 146 | running_add = 0 147 | for t in range(len(r))[::-1]: 148 | if r[t] != 0: 149 | running_add = 0 # game boundary. reset the running add 150 | running_add = r[t] + running_add * self.discount_factor 151 | d_r[t] += running_add 152 | # standardize the rewards 153 | d_r -= np.mean(d_r) 154 | d_r /= np.std(d_r) 155 | return d_r 156 | 157 | def actor_network(self, input_state): 158 | w1 = tf.Variable(tf.div(tf.random_normal( 159 | [self.input_dim, self.hidden_units]), 160 | np.sqrt(self.input_dim)), name='w1') 161 | b1 = tf.Variable( 162 | tf.constant(0.0, shape=[self.hidden_units]), name='b1') 163 | h1 = tf.nn.relu(tf.matmul(input_state, w1) + b1) 164 | w2 = tf.Variable(tf.div(tf.random_normal( 165 | [self.hidden_units, self.action_dim]), 166 | np.sqrt(self.hidden_units)), name='w2') 167 | b2 = tf.Variable(tf.constant(0.0, shape=[self.action_dim]), name='b2') 168 | logp = tf.matmul(h1, w2) + b2 169 | return logp 170 | 171 | def critic_network(self, input_state): 172 | w1 = tf.Variable(tf.div(tf.random_normal( 173 | [self.input_dim, self.hidden_units]), 174 | np.sqrt(self.input_dim)), name='w1') 175 | b1 = tf.Variable( 176 | tf.constant(0.0, shape=[self.hidden_units]), name='b1') 177 | h1 = tf.nn.relu(tf.matmul(input_state, w1) + b1) 178 | w2 = tf.Variable(tf.div(tf.random_normal( 179 | [self.hidden_units, 1]), np.sqrt(self.hidden_units)), name='w2') 180 | b2 = tf.Variable(tf.constant(0.0, shape=[1]), name='b2') 181 | state_value = tf.matmul(h1, w2) + b2 182 | return state_value 183 | -------------------------------------------------------------------------------- /algorithms/Actor-Critic/evaluate.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import gym 4 | import numpy as np 5 | import tensorflow as tf 6 | 7 | from .agent import ActorCritic 8 | from .utils import preprocess 9 | 10 | 11 | def main(args): 12 | INPUT_DIM = 80 * 80 13 | HIDDEN_UNITS = 200 14 | ACTION_DIM = 6 15 | 16 | # load agent 17 | agent = ActorCritic(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM) 18 | agent.construct_model(args.gpu) 19 | 20 | # load model or init a new 21 | saver = tf.train.Saver(max_to_keep=1) 22 | if args.model_path is not None: 23 | # reuse saved model 24 | saver.restore(agent.sess, args.model_path) 25 | else: 26 | # build a new model 27 | agent.sess.run(tf.global_variables_initializer()) 28 | 29 | # load env 30 | env = gym.make('Pong-v0') 31 | 32 | # training loop 33 | for ep in range(args.ep): 34 | # reset env 35 | total_rewards = 0 36 | state = env.reset() 37 | 38 | while True: 39 | env.render() 40 | # preprocess 41 | state = preprocess(state) 42 | # sample actions 43 | action = agent.sample_action(state[np.newaxis, :]) 44 | # act! 45 | next_state, reward, done, _ = env.step(action) 46 | total_rewards += reward 47 | # state shift 48 | state = next_state 49 | if done: 50 | break 51 | 52 | print('Ep%s Reward: %s ' % (ep+1, total_rewards)) 53 | 54 | 55 | def args_parse(): 56 | parser = argparse.ArgumentParser() 57 | parser.add_argument( 58 | '--model_path', default=None, 59 | help='Whether to use a saved model. (*None|model path)') 60 | parser.add_argument( 61 | '--gpu', default=-1, 62 | help='running on a specify gpu, -1 indicates using cpu') 63 | parser.add_argument('--ep', default=1, help='Test episodes') 64 | return parser.parse_args() 65 | 66 | 67 | if __name__ == '__main__': 68 | main(args_parse()) 69 | -------------------------------------------------------------------------------- /algorithms/Actor-Critic/train_actor_critic.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | import gym 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | from agent import ActorCritic 9 | from utils import preprocess 10 | 11 | 12 | def main(args): 13 | INPUT_DIM = 80 * 80 14 | HIDDEN_UNITS = 200 15 | ACTION_DIM = 6 16 | MAX_EPISODES = 10000 17 | 18 | # load agent 19 | agent = ActorCritic(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM) 20 | agent.construct_model(args.gpu) 21 | 22 | # load model or init a new 23 | saver = tf.train.Saver(max_to_keep=1) 24 | if args.model_path is not None: 25 | # reuse saved model 26 | saver.restore(agent.sess, args.model_path) 27 | ep_base = int(args.model_path.split('_')[-1]) 28 | mean_rewards = float(args.model_path.split('/')[-1].split('_')[0]) 29 | else: 30 | # build a new model 31 | agent.sess.run(tf.global_variables_initializer()) 32 | ep_base = 0 33 | mean_rewards = 0.0 34 | 35 | # load env 36 | env = gym.make('Pong-v0') 37 | 38 | # training loop 39 | for ep in range(MAX_EPISODES): 40 | step = 0 41 | total_rewards = 0 42 | state = preprocess(env.reset()) 43 | 44 | while True: 45 | # sample actions 46 | action = agent.sample_action(state[np.newaxis, :]) 47 | # act! 48 | next_state, reward, done, _ = env.step(action) 49 | 50 | next_state = preprocess(next_state) 51 | 52 | step += 1 53 | total_rewards += reward 54 | 55 | agent.store_rollout(state, action, reward, next_state, done) 56 | # state shift 57 | state = next_state 58 | 59 | if done: 60 | break 61 | 62 | mean_rewards = 0.99 * mean_rewards + 0.01 * total_rewards 63 | rounds = (21 - np.abs(total_rewards)) + 21 64 | average_steps = (step + 1) / rounds 65 | print('Ep%s: %d rounds' % (ep_base + ep + 1, rounds)) 66 | print('Average_steps: %.2f Reward: %s Average_reward: %.4f' % 67 | (average_steps, total_rewards, mean_rewards)) 68 | 69 | # update model per episode 70 | agent.update_model() 71 | 72 | # model saving 73 | if ep > 0 and ep % args.save_every == 0: 74 | if not os.path.isdir(args.save_path): 75 | os.makedirs(args.save_path) 76 | save_name = str(round(mean_rewards, 2)) + '_' + str(ep_base + ep+1) 77 | saver.save(agent.sess, args.save_path + save_name) 78 | 79 | 80 | if __name__ == '__main__': 81 | parser = argparse.ArgumentParser() 82 | parser.add_argument( 83 | '--model_path', default=None, 84 | help='Whether to use a saved model. (*None|model path)') 85 | parser.add_argument( 86 | '--save_path', default='./model/', 87 | help='Path to save a model during training.') 88 | parser.add_argument( 89 | '--save_every', default=100, help='Save model every x episodes') 90 | parser.add_argument( 91 | '--gpu', default=-1, 92 | help='running on a specify gpu, -1 indicates using cpu') 93 | main(parser.parse_args()) 94 | -------------------------------------------------------------------------------- /algorithms/Actor-Critic/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | def preprocess(obs): 5 | obs = obs[35:195] # 160x160x3 6 | obs = obs[::2, ::2, 0] # down sample (80x80) 7 | obs[obs == 144] = 0 8 | obs[obs == 109] = 0 9 | obs[obs != 0] = 1 10 | return obs.astype(np.float).ravel() 11 | -------------------------------------------------------------------------------- /algorithms/CEM/CEM.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import gym 3 | from gym.spaces import Discrete, Box 4 | 5 | 6 | # policies 7 | class DiscreteAction(object): 8 | 9 | def __init__(self, theta, ob_space, ac_space): 10 | ob_dim = ob_space.shape[0] 11 | ac_dim = ac_space.n 12 | self.W = theta[0: ob_dim * ac_dim].reshape(ob_dim, ac_dim) 13 | self.b = theta[ob_dim * ac_dim:].reshape(1, ac_dim) 14 | 15 | def act(self, ob): 16 | y = np.dot(ob, self.W) + self.b 17 | a = np.argmax(y) 18 | return a 19 | 20 | 21 | class ContinuousAction(object): 22 | 23 | def __init__(self, theta, ob_space, ac_space): 24 | self.ac_space = ac_space 25 | ob_dim = ob_space.shape[0] 26 | ac_dim = ac_space.shape[0] 27 | self.W = theta[0: ob_dim * ac_dim].reshape(ob_dim, ac_dim) 28 | self.b = theta[ob_dim * ac_dim:] 29 | 30 | def act(self, ob): 31 | y = np.dot(ob, self.W) + self.b 32 | a = np.clip(y, self.ac_space.low, self.ac_space.high) 33 | return a 34 | 35 | 36 | def run_episode(policy, env, render=False): 37 | max_steps = 1000 38 | total_rew = 0 39 | ob = env.reset() 40 | for t in range(max_steps): 41 | a = policy.act(ob) 42 | ob, reward, done, _info = env.step(a) 43 | total_rew += reward 44 | if render and t % 3 == 0: 45 | env.render() 46 | if done: 47 | break 48 | return total_rew 49 | 50 | 51 | def make_policy(params): 52 | if isinstance(env.action_space, Discrete): 53 | return DiscreteAction(params, env.observation_space, env.action_space) 54 | elif isinstance(env.action_space, Box): 55 | return ContinuousAction( 56 | params, env.observation_space, env.action_space) 57 | else: 58 | raise NotImplementedError 59 | 60 | 61 | def eval_policy(params): 62 | policy = make_policy(params) 63 | reward = run_episode(policy, env) 64 | return reward 65 | 66 | 67 | env = gym.make('CartPole-v0') 68 | num_iter = 100 69 | batch_size = 25 70 | elite_frac = 0.2 71 | num_elite = int(batch_size * elite_frac) 72 | 73 | if isinstance(env.action_space, Discrete): 74 | dim_params = (env.observation_space.shape[0] + 1) * env.action_space.n 75 | elif isinstance(env.action_space, Box): 76 | dim_params = (env.observation_space.shape[0] + 1) \ 77 | * env.action_space.shape[0] 78 | else: 79 | raise NotImplementedError 80 | 81 | params_mean = np.zeros(dim_params) 82 | params_std = np.ones(dim_params) 83 | 84 | for i in range(num_iter): 85 | # sample parameter vectors (multi-variable gaussian distribution) 86 | sample_params = np.random.multivariate_normal( 87 | params_mean, np.diag(params_std), size=batch_size) 88 | # evaluate sample policies 89 | rewards = [eval_policy(params) for params in sample_params] 90 | 91 | # 'elite' policies 92 | elite_idxs = np.argsort(rewards)[batch_size - num_elite: batch_size] 93 | elite_params = [sample_params[i] for i in elite_idxs] 94 | 95 | # move current policy towards elite policies 96 | params_mean = np.mean(np.asarray(elite_params), axis=0) 97 | params_std = np.std(np.asarray(elite_params), axis=0) 98 | 99 | # logging 100 | print('Ep %d: mean score: %8.3f. max score: %4.3f' % 101 | (i, np.mean(rewards), np.max(rewards))) 102 | print('Eval reward: %.4f' % 103 | run_episode(make_policy(params_mean), env, render=True)) 104 | -------------------------------------------------------------------------------- /algorithms/CEM/README.md: -------------------------------------------------------------------------------- 1 | ## Cross-Entropy Method 2 | 3 | Cross-entropy method is a derivative-free policy optimize approach. It simply sample some policies, pick some good ones(elite policies) and move current policy towards these elite policies, ignoring all other information other than **rewards** collected during episode. 4 | CEM works quiet well in tasks with simply policies(small parameter space). 5 | 6 | ## Requirements 7 | * [Numpy](http://www.numpy.org/) 8 | * [gym](https://gym.openai.com) 9 | 10 | ## Test environments 11 | * Discrete: CartPole-v0, Acrobot-v1, MountainCar-v0 12 | * Continuous: Pendulum-v0, BipedalWalker-v2 13 | 14 | 15 | ## Reference 16 | [John Schulman MLSS 2016](http://rl-gym-doc.s3-website-us-west-2.amazonaws.com/mlss/lab1.html) 17 | -------------------------------------------------------------------------------- /algorithms/DDPG/README.md: -------------------------------------------------------------------------------- 1 | ### Deep Deterministic Policy Gradients (DDPG) 2 | 3 | Deep Deterministic Policy Gradient is a model-free off-policy actor-critic algorithm which combines DPG ([Deterministic Policy Gradient](http://www.jmlr.org/proceedings/papers/v32/silver14.pdf)) and DQN. 4 | 5 | Related papers: 6 | - [Lillicrap, Timothy P., et al., 2015](https://arxiv.org/pdf/1509.02971.pdf) 7 | 8 | #### Walker2d 9 | 10 | Here we use DDPG to solve a continuous control task Walker2d. Walker2d is a continuous control task based on [Mujoco](http://www.mujoco.org/) engine. The goal is to make a two-dimensional bipedal robot walk forward as fast as possible. 11 | 12 | walker2d 13 | 14 | #### Requirements 15 | 16 | * [Numpy](http://www.numpy.org/) 17 | * [Tensorflow](http://www.tensorflow.org) 18 | * [gym](https://gym.openai.com) 19 | * [Mujoco](https://www.roboti.us/index.html) 20 | 21 | #### Run 22 | 23 | ```bash 24 | python train_ddpg.py 25 | python train_ddpg.py -h # training options and hyper parameters settings 26 | ``` 27 | 28 | #### Results 29 | 30 | walker2d 31 | Results after training for 1.5M steps 32 | 33 | ddpg 34 | Training rewards (smoothed) 35 | -------------------------------------------------------------------------------- /algorithms/DDPG/agent.py: -------------------------------------------------------------------------------- 1 | import random 2 | from collections import deque 3 | 4 | import numpy as np 5 | import tensorflow as tf 6 | import tensorflow.contrib.layers as tcl 7 | 8 | from ou_noise import OUNoise 9 | 10 | 11 | class DDPG: 12 | 13 | def __init__(self, env, args): 14 | self.action_dim = env.action_space.shape[0] 15 | self.state_dim = env.observation_space.shape[0] 16 | 17 | self.actor_lr = args.a_lr 18 | self.critic_lr = args.c_lr 19 | 20 | self.gamma = args.gamma 21 | 22 | # Ornstein-Uhlenbeck noise parameters 23 | self.ou = OUNoise( 24 | self.action_dim, theta=args.noise_theta, sigma=args.noise_sigma) 25 | 26 | self.replay_buffer = deque(maxlen=args.buffer_size) 27 | self.replay_start_size = args.replay_start_size 28 | 29 | self.batch_size = args.batch_size 30 | 31 | self.target_update_rate = args.target_update_rate 32 | self.total_parameters = 0 33 | self.global_steps = 0 34 | self.reg_param = args.reg_param 35 | 36 | def construct_model(self, gpu): 37 | if gpu == -1: # use CPU 38 | device = '/cpu:0' 39 | sess_config = tf.ConfigProto() 40 | else: # use GPU 41 | device = '/gpu:' + str(gpu) 42 | sess_config = tf.ConfigProto( 43 | log_device_placement=True, allow_soft_placement=True) 44 | sess_config.gpu_options.allow_growth = True 45 | 46 | self.sess = tf.Session(config=sess_config) 47 | 48 | with tf.device(device): 49 | # output action, q_value and gradients of q_val w.r.t. action 50 | with tf.name_scope('predict_actions'): 51 | self.states = tf.placeholder( 52 | tf.float32, [None, self.state_dim], name='states') 53 | self.action = tf.placeholder( 54 | tf.float32, [None, self.action_dim], name='action') 55 | self.is_training = tf.placeholder(tf.bool, name='is_training') 56 | 57 | self.action_outputs, actor_params = self._build_actor( 58 | self.states, scope='actor_net', bn=True) 59 | value_outputs, critic_params = self._build_critic( 60 | self.states, self.action, scope='critic_net', bn=False) 61 | self.action_gradients = tf.gradients( 62 | value_outputs, self.action)[0] 63 | 64 | # estimate target_q for update critic 65 | with tf.name_scope('estimate_target_q'): 66 | self.next_states = tf.placeholder( 67 | tf.float32, [None, self.state_dim], name='next_states') 68 | self.mask = tf.placeholder(tf.float32, [None], name='mask') 69 | self.rewards = tf.placeholder(tf.float32, [None], name='rewards') 70 | 71 | # target actor network 72 | t_action_outputs, t_actor_params = self._build_actor( 73 | self.next_states, scope='t_actor_net', bn=True, 74 | trainable=False) 75 | # target critic network 76 | t_value_outputs, t_critic_params = self._build_critic( 77 | self.next_states, t_action_outputs, bn=False, 78 | scope='t_critic_net', trainable=False) 79 | 80 | target_q = self.rewards + self.gamma * \ 81 | (t_value_outputs[:, 0] * self.mask) 82 | 83 | with tf.name_scope('compute_gradients'): 84 | actor_opt = tf.train.AdamOptimizer(self.actor_lr) 85 | critic_opt = tf.train.AdamOptimizer(self.critic_lr) 86 | 87 | # critic gradients 88 | td_error = target_q - value_outputs[:, 0] 89 | critic_mse = tf.reduce_mean(tf.square(td_error)) 90 | critic_reg = tf.reduce_sum( 91 | [tf.nn.l2_loss(v) for v in critic_params]) 92 | critic_loss = critic_mse + self.reg_param * critic_reg 93 | critic_gradients = critic_opt.compute_gradients( 94 | critic_loss, critic_params) 95 | # actor gradients 96 | self.q_action_grads = tf.placeholder( 97 | tf.float32, [None, self.action_dim], name='q_action_grads') 98 | actor_gradients = tf.gradients( 99 | self.action_outputs, actor_params, -self.q_action_grads) 100 | actor_gradients = zip(actor_gradients, actor_params) 101 | # apply gradient to update model 102 | self.train_actor = actor_opt.apply_gradients(actor_gradients) 103 | self.train_critic = critic_opt.apply_gradients( 104 | critic_gradients) 105 | 106 | with tf.name_scope('update_target_networks'): 107 | # batch norm parameters should not be included when updating! 108 | target_networks_update = [] 109 | 110 | for v_source, v_target in zip(actor_params, t_actor_params): 111 | update_op = v_target.assign_sub( 112 | 0.001 * (v_target - v_source)) 113 | target_networks_update.append(update_op) 114 | 115 | for v_source, v_target in zip(critic_params, t_critic_params): 116 | update_op = v_target.assign_sub( 117 | 0.01 * (v_target - v_source)) 118 | target_networks_update.append(update_op) 119 | 120 | self.target_networks_update = tf.group(*target_networks_update) 121 | 122 | with tf.name_scope('total_numbers_of_parameters'): 123 | for v in tf.trainable_variables(): 124 | shape = v.get_shape() 125 | param_num = 1 126 | for d in shape: 127 | param_num *= d.value 128 | print(v.name, ' ', shape, ' param nums: ', param_num) 129 | self.total_parameters += param_num 130 | print('Total nums of parameters: ', self.total_parameters) 131 | 132 | def sample_action(self, states, noise): 133 | # is_training suppose to be False when sampling action. 134 | action = self.sess.run( 135 | self.action_outputs, 136 | feed_dict={self.states: states, self.is_training: False}) 137 | ou_noise = self.ou.noise() if noise else 0 138 | 139 | return action + ou_noise 140 | 141 | def store_experience(self, s, a, r, next_s, done): 142 | self.replay_buffer.append([s, a[0], r, next_s, done]) 143 | self.global_steps += 1 144 | 145 | def update_model(self): 146 | 147 | if len(self.replay_buffer) < self.replay_start_size: 148 | return 149 | 150 | # get batch 151 | batch = random.sample(self.replay_buffer, self.batch_size) 152 | s, _a, r, next_s, done = np.vstack(batch).T.tolist() 153 | mask = ~np.array(done) 154 | 155 | # compute a = u(s) 156 | a = self.sess.run(self.action_outputs, { 157 | self.states: s, 158 | self.is_training: True 159 | }) 160 | # gradients of q_value w.r.t action a 161 | dq_da = self.sess.run(self.action_gradients, { 162 | self.states: s, 163 | self.action: a, 164 | self.is_training: True 165 | }) 166 | # train 167 | self.sess.run([self.train_actor, self.train_critic], { 168 | # train_actor feed 169 | self.states: s, 170 | self.is_training: True, 171 | self.q_action_grads: dq_da, 172 | # train_critic feed 173 | self.next_states: next_s, 174 | self.action: _a, 175 | self.mask: mask, 176 | self.rewards: r 177 | }) 178 | # update target network 179 | self.sess.run(self.target_networks_update) 180 | 181 | def _build_actor(self, states, scope, bn=False, trainable=True): 182 | h1_dim = 400 183 | h2_dim = 300 184 | init = tf.contrib.layers.variance_scaling_initializer( 185 | factor=1.0, mode='FAN_IN', uniform=True) 186 | 187 | with tf.variable_scope(scope): 188 | if bn: 189 | states = self.batch_norm( 190 | states, self.is_training, tf.identity, 191 | scope='actor_bn_states', trainable=trainable) 192 | h1 = tcl.fully_connected( 193 | states, h1_dim, activation_fn=None, weights_initializer=init, 194 | biases_initializer=init, trainable=trainable, scope='actor_h1') 195 | 196 | if bn: 197 | h1 = self.batch_norm( 198 | h1, self.is_training, tf.nn.relu, scope='actor_bn_h1', 199 | trainable=trainable) 200 | else: 201 | h1 = tf.nn.relu(h1) 202 | 203 | h2 = tcl.fully_connected( 204 | h1, h2_dim, activation_fn=None, weights_initializer=init, 205 | biases_initializer=init, trainable=trainable, scope='actor_h2') 206 | if bn: 207 | h2 = self.batch_norm( 208 | h2, self.is_training, tf.nn.relu, scope='actor_bn_h2', 209 | trainable=trainable) 210 | else: 211 | h2 = tf.nn.relu(h2) 212 | 213 | # use tanh to bound the action 214 | a = tcl.fully_connected( 215 | h2, self.action_dim, activation_fn=tf.nn.tanh, 216 | weights_initializer=tf.random_uniform_initializer(-3e-3, 3e-3), 217 | biases_initializer=tf.random_uniform_initializer(-3e-4, 3e-4), 218 | trainable=trainable, scope='actor_out') 219 | 220 | params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope) 221 | 222 | return a, params 223 | 224 | def _build_critic(self, states, action, scope, bn=False, trainable=True): 225 | h1_dim = 400 226 | h2_dim = 300 227 | init = tf.contrib.layers.variance_scaling_initializer( 228 | factor=1.0, mode='FAN_IN', uniform=True) 229 | with tf.variable_scope(scope): 230 | if bn: 231 | states = self.batch_norm( 232 | states, self.is_training, tf.identity, 233 | scope='critic_bn_state', trainable=trainable) 234 | h1 = tcl.fully_connected( 235 | states, h1_dim, activation_fn=None, weights_initializer=init, 236 | biases_initializer=init, trainable=trainable, scope='critic_h1') 237 | if bn: 238 | h1 = self.batch_norm( 239 | h1, self.is_training, tf.nn.relu, scope='critic_bn_h1', 240 | trainable=trainable) 241 | else: 242 | h1 = tf.nn.relu(h1) 243 | 244 | # skip action from the first layer 245 | h1 = tf.concat([h1, action], 1) 246 | 247 | h2 = tcl.fully_connected( 248 | h1, h2_dim, activation_fn=None, weights_initializer=init, 249 | biases_initializer=init, trainable=trainable, 250 | scope='critic_h2') 251 | 252 | if bn: 253 | h2 = self.batch_norm( 254 | h2, self.is_training, tf.nn.relu, scope='critic_bn_h2', 255 | trainable=trainable) 256 | else: 257 | h2 = tf.nn.relu(h2) 258 | 259 | q = tcl.fully_connected( 260 | h2, 1, activation_fn=None, 261 | weights_initializer=tf.random_uniform_initializer(-3e-3, 3e-3), 262 | biases_initializer=tf.random_uniform_initializer(-3e-4, 3e-4), 263 | trainable=trainable, scope='critic_out') 264 | 265 | params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope) 266 | return q, params 267 | 268 | @staticmethod 269 | def batch_norm(x, is_training, activation_fn, scope, trainable=True): 270 | # switch the 'is_training' flag and 'reuse' flag 271 | return tf.cond( 272 | is_training, 273 | lambda: tf.contrib.layers.batch_norm( 274 | x, 275 | activation_fn=activation_fn, 276 | center=True, 277 | scale=True, 278 | updates_collections=None, 279 | is_training=True, 280 | reuse=None, 281 | scope=scope, 282 | decay=0.9, 283 | epsilon=1e-5, 284 | trainable=trainable), 285 | lambda: tf.contrib.layers.batch_norm( 286 | x, 287 | activation_fn=activation_fn, 288 | center=True, 289 | scale=True, 290 | updates_collections=None, 291 | is_training=False, 292 | reuse=True, # to be able to reuse scope must be given 293 | scope=scope, 294 | decay=0.9, 295 | epsilon=1e-5, 296 | trainable=trainable)) 297 | -------------------------------------------------------------------------------- /algorithms/DDPG/evaluate.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import gym 4 | import numpy as np 5 | import tensorflow as tf 6 | from gym import wrappers 7 | 8 | from agent import DDPG 9 | 10 | 11 | def main(args): 12 | env = gym.make('Walker2d-v1') 13 | env = wrappers.Monitor(env, './videos/', force=True) 14 | reward_history = [] 15 | 16 | agent = DDPG(env, args) 17 | agent.construct_model(args.gpu) 18 | 19 | saver = tf.train.Saver() 20 | if args.model_path is not None: 21 | # reuse saved model 22 | saver.restore(agent.sess, args.model_path) 23 | ep_base = int(args.model_path.split('_')[-1]) 24 | best_avg_rewards = float(args.model_path.split('/')[-1].split('_')[0]) 25 | else: 26 | raise ValueError('model_path required!') 27 | 28 | for ep in range(args.ep): 29 | # env init 30 | state = env.reset() 31 | ep_rewards = 0 32 | for step in range(env.spec.timestep_limit): 33 | env.render() 34 | action = agent.sample_action(state[np.newaxis, :], noise=False) 35 | # act 36 | next_state, reward, done, _ = env.step(action[0]) 37 | 38 | ep_rewards += reward 39 | agent.store_experience(state, action, reward, next_state, done) 40 | 41 | # shift 42 | state = next_state 43 | if done: 44 | break 45 | reward_history.append(ep_rewards) 46 | print('Ep%d reward:%d' % (ep + 1, ep_rewards)) 47 | 48 | print('Average rewards: ', np.mean(reward_history)) 49 | 50 | 51 | def args_parse(): 52 | parser = argparse.ArgumentParser() 53 | parser.add_argument( 54 | '--model_path', default='./models/', 55 | help='Whether to use a saved model. (*None|model path)') 56 | parser.add_argument( 57 | '--save_path', default='./models/', 58 | help='Path to save a model during training.') 59 | parser.add_argument( 60 | '--gpu', type=int, default=-1, 61 | help='running on a specify gpu, -1 indicates using cpu') 62 | parser.add_argument( 63 | '--seed', default=31, type=int, help='random seed') 64 | 65 | parser.add_argument( 66 | '--a_lr', type=float, default=1e-4, help='Actor learning rate') 67 | parser.add_argument( 68 | '--c_lr', type=float, default=1e-3, help='Critic learning rate') 69 | parser.add_argument( 70 | '--batch_size', type=int, default=64, help='Size of training batch') 71 | parser.add_argument( 72 | '--gamma', type=float, default=0.99, help='Discounted factor') 73 | parser.add_argument( 74 | '--target_update_rate', type=float, default=0.001, 75 | help='parameter of soft target update') 76 | parser.add_argument( 77 | '--reg_param', type=float, default=0.01, help='l2 regularization') 78 | parser.add_argument( 79 | '--buffer_size', type=int, default=1000000, help='Size of memory buffer') 80 | parser.add_argument( 81 | '--replay_start_size', type=int, default=1000, 82 | help='Number of steps before learning from replay memory') 83 | parser.add_argument( 84 | '--noise_theta', type=float, default=0.15, 85 | help='Ornstein-Uhlenbeck noise parameters') 86 | parser.add_argument( 87 | '--noise_sigma', type=float, default=0.20, 88 | help='Ornstein-Uhlenbeck noise parameters') 89 | parser.add_argument('--ep', default=10, help='Test episodes') 90 | return parser.parse_args() 91 | 92 | 93 | if __name__ == '__main__': 94 | main(args_parse()) 95 | -------------------------------------------------------------------------------- /algorithms/DDPG/ou_noise.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class OUNoise: 5 | '''docstring for OUNoise''' 6 | 7 | def __init__(self, action_dimension, mu=0, theta=0.15, sigma=0.2): 8 | self.action_dimension = action_dimension 9 | self.mu = mu 10 | self.theta = theta 11 | self.sigma = sigma 12 | self.state = np.ones(self.action_dimension) * self.mu 13 | self.reset() 14 | 15 | def reset(self): 16 | self.state = np.ones(self.action_dimension) * self.mu 17 | 18 | def noise(self): 19 | x = self.state 20 | dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(len(x)) 21 | self.state = x + dx 22 | return self.state 23 | -------------------------------------------------------------------------------- /algorithms/DDPG/train_ddpg.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | import gym 5 | import matplotlib.pyplot as plt 6 | import numpy as np 7 | import tensorflow as tf 8 | 9 | from agent import DDPG 10 | 11 | 12 | def main(args): 13 | set_random_seed(args.seed) 14 | env = gym.make('Walker2d-v1') 15 | agent = DDPG(env, args) 16 | agent.construct_model(args.gpu) 17 | 18 | saver = tf.train.Saver(max_to_keep=1) 19 | if args.model_path is not None: 20 | # reuse saved model 21 | saver.restore(agent.sess, args.model_path) 22 | ep_base = int(args.model_path.split('_')[-1]) 23 | best_avg_rewards = float(args.model_path.split('/')[-1].split('_')[0]) 24 | else: 25 | # build a new model 26 | agent.sess.run(tf.global_variables_initializer()) 27 | ep_base = 0 28 | best_avg_rewards = None 29 | 30 | reward_history, step_history = [], [] 31 | train_steps = 0 32 | 33 | for ep in range(args.max_ep): 34 | # env init 35 | state = env.reset() 36 | ep_rewards = 0 37 | for step in range(env.spec.timestep_limit): 38 | action = agent.sample_action(state[np.newaxis, :], noise=True) 39 | # act 40 | next_state, reward, done, _ = env.step(action[0]) 41 | train_steps += 1 42 | ep_rewards += reward 43 | 44 | agent.store_experience(state, action, reward, next_state, done) 45 | agent.update_model() 46 | # shift 47 | state = next_state 48 | if done: 49 | print('Ep %d global_steps: %d Reward: %.2f' % 50 | (ep + 1, agent.global_steps, ep_rewards)) 51 | # reset ou noise 52 | agent.ou.reset() 53 | break 54 | step_history.append(train_steps) 55 | if not reward_history: 56 | reward_history.append(ep_rewards) 57 | else: 58 | reward_history.append(reward_history[-1] * 0.99 + ep_rewards + 0.01) 59 | 60 | # Evaluate during training 61 | if ep % args.log_every == 0 and ep > 0: 62 | ep_rewards = 0 63 | for ep_eval in range(args.test_ep): 64 | state = env.reset() 65 | for step_eval in range(env.spec.timestep_limit): 66 | action = agent.sample_action( 67 | state[np.newaxis, :], noise=False) 68 | next_state, reward, done, _ = env.step(action[0]) 69 | ep_rewards += reward 70 | state = next_state 71 | if done: 72 | break 73 | 74 | curr_avg_rewards = ep_rewards / args.test_ep 75 | 76 | # logging 77 | print('\n') 78 | print('Episode: %d' % (ep + 1)) 79 | print('Global steps: %d' % agent.global_steps) 80 | print('Mean reward: %.2f' % curr_avg_rewards) 81 | print('\n') 82 | if not best_avg_rewards or (curr_avg_rewards >= best_avg_rewards): 83 | best_avg_rewards = curr_avg_rewards 84 | if not os.path.isdir(args.save_path): 85 | os.makedirs(args.save_path) 86 | save_name = args.save_path + str(round(best_avg_rewards, 2)) \ 87 | + '_' + str(ep_base + ep + 1) 88 | saver.save(agent.sess, save_name) 89 | print('Model save %s' % save_name) 90 | 91 | plt.plot(step_history, reward_history) 92 | plt.xlabel('steps') 93 | plt.ylabel('running reward') 94 | plt.show() 95 | 96 | 97 | def args_parse(): 98 | parser = argparse.ArgumentParser() 99 | parser.add_argument( 100 | '--model_path', default=None, 101 | help='Whether to use a saved model. (*None|model path)') 102 | parser.add_argument( 103 | '--save_path', default='./models/', 104 | help='Path to save a model during training.') 105 | parser.add_argument( 106 | '--log_every', default=100, 107 | help='Interval of logging and may be model saving') 108 | parser.add_argument( 109 | '--gpu', type=int, default=-1, 110 | help='running on a specify gpu, -1 indicates using cpu') 111 | parser.add_argument( 112 | '--seed', default=31, type=int, help='random seed') 113 | 114 | parser.add_argument( 115 | '--max_ep', type=int, default=10000, help='Number of training episodes') 116 | parser.add_argument( 117 | '--test_ep', type=int, default=10, help='Number of test episodes') 118 | parser.add_argument( 119 | '--a_lr', type=float, default=1e-4, help='Actor learning rate') 120 | parser.add_argument( 121 | '--c_lr', type=float, default=1e-3, help='Critic learning rate') 122 | parser.add_argument( 123 | '--batch_size', type=int, default=64, help='Size of training batch') 124 | parser.add_argument( 125 | '--gamma', type=float, default=0.99, help='Discounted factor') 126 | parser.add_argument( 127 | '--target_update_rate', type=float, default=0.001, 128 | help='soft target update rate') 129 | parser.add_argument( 130 | '--reg_param', type=float, default=0.01, help='l2 regularization') 131 | parser.add_argument( 132 | '--buffer_size', type=int, default=1000000, help='Size of memory buffer') 133 | parser.add_argument( 134 | '--replay_start_size', type=int, default=1000, 135 | help='Number of steps before learning from replay memory') 136 | parser.add_argument( 137 | '--noise_theta', type=float, default=0.15, 138 | help='Ornstein-Uhlenbeck noise parameters') 139 | parser.add_argument( 140 | '--noise_sigma', type=float, default=0.20, 141 | help='Ornstein-Uhlenbeck noise parameters') 142 | return parser.parse_args() 143 | 144 | 145 | def set_random_seed(seed): 146 | np.random.seed(seed) 147 | tf.set_random_seed(seed) 148 | 149 | 150 | if __name__ == '__main__': 151 | main(args_parse()) 152 | -------------------------------------------------------------------------------- /algorithms/DQN/README.md: -------------------------------------------------------------------------------- 1 | ## DQN 2 | DQN is a deep reinforcement learning architecture proposed by DeepMind in 2013. They used this architecture to play Atari games and obtained a hunman-level performance. 3 | Here we implement a simple version of DQN to solve game of CartPole. 4 | 5 | Related papers: 6 | * [Mnih et al., 2013](https://arxiv.org/pdf/1312.5602.pdf) 7 | * [Mnih et al., 2015](http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf) 8 | 9 | Double DQN is also implemented. Pass a --double=True argument to enable Double DQN. Since the CartPole task is easy to solve, Double DQN actually produce very little effect. 10 | 11 | ## CartPole 12 | In CartPole, a pole is attached by an un-actuated joint to a cart, which move along a track. The agent contrl the cart by moving left or right in order to ballance the pole. More about CartPole see [OpenAI wiki](https://github.com/openai/gym/wiki/CartPole-v0) 13 | 14 | CartPole 15 | 16 | We will solve CartPole using DQN Algorithm. 17 | 18 | ## Requirements 19 | * [Numpy](http://www.numpy.org/) 20 | * [Tensorflow](http://www.tensorflow.org) 21 | * [gym](https://gym.openai.com) 22 | 23 | 24 | ## Run 25 | python train_DQN.py 26 | python train_DQN.py -h # training options and hyper parameters settings 27 | 28 | 29 | ## Training plot 30 | 31 | DQN 32 | -------------------------------------------------------------------------------- /algorithms/DQN/agent.py: -------------------------------------------------------------------------------- 1 | import random 2 | from collections import deque 3 | 4 | import numpy as np 5 | import tensorflow as tf 6 | 7 | 8 | class DQN: 9 | 10 | def __init__(self, env, args): 11 | # init parameters 12 | self.global_step = 0 13 | self.epsilon = args.init_epsilon 14 | self.state_dim = env.observation_space.shape[0] 15 | self.action_dim = env.action_space.n 16 | 17 | self.gamma = args.gamma 18 | self.learning_rate = args.lr 19 | self.batch_size = args.batch_size 20 | 21 | self.double_q = args.double_q 22 | self.target_network_update_interval = args.target_network_update 23 | 24 | # init replay buffer 25 | self.replay_buffer = deque(maxlen=args.buffer_size) 26 | 27 | def network(self, input_state): 28 | hidden_unit = 100 29 | w1 = tf.Variable(tf.math.divide(tf.random_normal( 30 | [self.state_dim, hidden_unit]), np.sqrt(self.state_dim))) 31 | b1 = tf.Variable(tf.constant(0.0, shape=[hidden_unit])) 32 | hidden = tf.nn.relu(tf.matmul(input_state, w1) + b1) 33 | 34 | w2 = tf.Variable(tf.math.divide(tf.random_normal( 35 | [hidden_unit, self.action_dim]), np.sqrt(hidden_unit))) 36 | b2 = tf.Variable(tf.constant(0.0, shape=[self.action_dim])) 37 | output_Q = tf.matmul(hidden, w2) + b2 38 | return output_Q 39 | 40 | @staticmethod 41 | def get_session(device): 42 | if device == -1: # use CPU 43 | device = '/cpu:0' 44 | sess_config = tf.ConfigProto() 45 | else: # use GPU 46 | device = '/gpu:' + str(gpu) 47 | sess_config = tf.ConfigProto( 48 | log_device_placement=True, 49 | allow_soft_placement=True) 50 | sess_config.gpu_options.allow_growth = True 51 | sess = tf.Session(config=sess_config) 52 | return sess, device 53 | 54 | def construct_model(self, gpu): 55 | self.sess, device = self.get_session(gpu) 56 | 57 | with tf.device(device): 58 | with tf.name_scope('input_state'): 59 | self.input_state = tf.placeholder( 60 | tf.float32, [None, self.state_dim]) 61 | 62 | with tf.name_scope('q_network'): 63 | self.output_Q = self.network(self.input_state) 64 | 65 | with tf.name_scope('optimize'): 66 | self.input_action = tf.placeholder( 67 | tf.float32, [None, self.action_dim]) 68 | self.target_Q = tf.placeholder(tf.float32, [None]) 69 | # Q value of the selceted action 70 | action_Q = tf.reduce_sum(tf.multiply( 71 | self.output_Q, self.input_action), reduction_indices=1) 72 | 73 | self.loss = tf.reduce_mean(tf.square(self.target_Q - action_Q)) 74 | optimizer = tf.train.RMSPropOptimizer(self.learning_rate) 75 | self.train_op = optimizer.minimize(self.loss) 76 | 77 | # target network 78 | with tf.name_scope('target_network'): 79 | self.target_output_Q = self.network(self.input_state) 80 | 81 | q_parameters = tf.get_collection( 82 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='q_network') 83 | target_q_parameters = tf.get_collection( 84 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='target_network') 85 | 86 | with tf.name_scope('update_target_network'): 87 | self.update_target_network = [] 88 | for v_source, v_target in zip( 89 | q_parameters, target_q_parameters): 90 | # update_op = v_target.assign(v_source) 91 | # soft target update to stabilize training 92 | update_op = v_target.assign_sub(0.1 * (v_target - v_source)) 93 | self.update_target_network.append(update_op) 94 | # group all update together 95 | self.update_target_network = tf.group( 96 | *self.update_target_network) 97 | 98 | def sample_action(self, state, policy="greedy"): 99 | self.global_step += 1 100 | # Q_value of all actions 101 | output_Q = self.sess.run( 102 | self.output_Q, feed_dict={self.input_state: [state]})[0] 103 | if policy == 'egreedy': 104 | if random.random() <= self.epsilon: # random action 105 | return random.randint(0, self.action_dim - 1) 106 | else: # greedy action 107 | return np.argmax(output_Q) 108 | elif policy == 'greedy': 109 | return np.argmax(output_Q) 110 | elif policy == 'random': 111 | return random.randint(0, self.action_dim - 1) 112 | 113 | def learn(self, state, action, reward, next_state, done): 114 | onehot_action = np.zeros(self.action_dim) 115 | onehot_action[action] = 1 116 | 117 | # store experience in deque 118 | self.replay_buffer.append( 119 | np.array([state, onehot_action, reward, next_state, done])) 120 | 121 | if len(self.replay_buffer) > self.batch_size: 122 | # update target network if needed 123 | if self.global_step % self.target_network_update_interval == 0: 124 | self.sess.run(self.update_target_network) 125 | 126 | # sample experience 127 | minibatch = random.sample(self.replay_buffer, self.batch_size) 128 | 129 | # transpose mini-batch 130 | s_batch, a_batch, r_batch, next_s_batch, done_batch = np.array(minibatch).T 131 | s_batch, a_batch = np.stack(s_batch), np.stack(a_batch) 132 | next_s_batch = np.stack(next_s_batch) 133 | 134 | # use target q network to get Q of all 135 | next_s_all_action_Q = self.sess.run( 136 | self.target_output_Q, {self.input_state: next_s_batch}) 137 | next_s_Q_batch = np.max(next_s_all_action_Q, 1) 138 | 139 | if self.double_q: 140 | # use source network to select best action a* 141 | next_s_action_batch = np.argmax(self.sess.run( 142 | self.output_Q, {self.input_state: next_s_batch}), 1) 143 | # then use target network to compute Q(s', a*) 144 | next_s_Q_batch = next_s_all_action_Q[np.arange(self.batch_size), 145 | next_s_action_batch] 146 | 147 | # calculate target_Q_batch 148 | mask = ~done_batch.astype(np.bool) 149 | target_Q_batch = r_batch + self.gamma * mask * next_s_Q_batch 150 | 151 | # run actual training 152 | self.sess.run(self.train_op, feed_dict={ 153 | self.target_Q: target_Q_batch, 154 | self.input_action: a_batch, 155 | self.input_state: s_batch}) 156 | -------------------------------------------------------------------------------- /algorithms/DQN/evaluation.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import gym 4 | import tensorflow as tf 5 | 6 | from agent import DQN 7 | 8 | 9 | def main(args): 10 | # load env 11 | env = gym.make('CartPole-v0') 12 | # load agent 13 | agent = DQN(env) 14 | agent.construct_model(args.gpu) 15 | 16 | # load model or init a new 17 | saver = tf.train.Saver() 18 | if args.model_path is not None: 19 | # reuse saved model 20 | saver.restore(agent.sess, args.model_path) 21 | else: 22 | # build a new model 23 | agent.init_var() 24 | 25 | # training loop 26 | for ep in range(args.ep): 27 | # reset env 28 | total_rewards = 0 29 | state = env.reset() 30 | 31 | while True: 32 | env.render() 33 | # sample actions 34 | action = agent.sample_action(state, policy='greedy') 35 | # act! 36 | next_state, reward, done, _ = env.step(action) 37 | total_rewards += reward 38 | # state shift 39 | state = next_state 40 | if done: 41 | break 42 | 43 | print('Ep%s Reward: %s ' % (ep+1, total_rewards)) 44 | 45 | 46 | def args_parse(): 47 | parser = argparse.ArgumentParser() 48 | parser.add_argument( 49 | '--model_path', default=None, 50 | help='Whether to use a saved model. (*None|model path)') 51 | parser.add_argument( 52 | '--gpu', default=-1, 53 | help='running on a specify gpu, -1 indicates using cpu') 54 | parser.add_argument( 55 | '--ep', type=int, default=1, help='Test episodes') 56 | return parser.parse_args() 57 | 58 | 59 | if __name__ == '__main__': 60 | main(args_parse()) 61 | -------------------------------------------------------------------------------- /algorithms/DQN/train_DQN.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | import gym 5 | import matplotlib.pyplot as plt 6 | import numpy as np 7 | import tensorflow as tf 8 | 9 | from agent import DQN 10 | 11 | 12 | def main(args): 13 | set_random_seed(args.seed) 14 | 15 | env = gym.make("CartPole-v0") 16 | agent = DQN(env, args) 17 | agent.construct_model(args.gpu) 18 | 19 | # load pre-trained models or init new a model. 20 | saver = tf.train.Saver(max_to_keep=1) 21 | if args.model_path is not None: 22 | saver.restore(agent.sess, args.model_path) 23 | ep_base = int(args.model_path.split('_')[-1]) 24 | best_mean_rewards = float(args.model_path.split('/')[-1].split('_')[0]) 25 | else: 26 | agent.sess.run(tf.global_variables_initializer()) 27 | ep_base = 0 28 | best_mean_rewards = None 29 | 30 | rewards_history, steps_history = [], [] 31 | train_steps = 0 32 | # Training 33 | for ep in range(args.max_ep): 34 | state = env.reset() 35 | ep_rewards = 0 36 | for step in range(env.spec.max_episode_steps): 37 | # pick action 38 | action = agent.sample_action(state, policy='egreedy') 39 | # execution action. 40 | next_state, reward, done, debug = env.step(action) 41 | train_steps += 1 42 | ep_rewards += reward 43 | # modified reward to speed up learning 44 | reward = 0.1 if not done else -1 45 | # learn and Update net parameters 46 | agent.learn(state, action, reward, next_state, done) 47 | 48 | state = next_state 49 | if done: 50 | break 51 | steps_history.append(train_steps) 52 | if not rewards_history: 53 | rewards_history.append(ep_rewards) 54 | else: 55 | rewards_history.append( 56 | rewards_history[-1] * 0.9 + ep_rewards * 0.1) 57 | 58 | # decay epsilon 59 | if agent.epsilon > args.final_epsilon: 60 | agent.epsilon -= (args.init_epsilon - args.final_epsilon) / args.max_ep 61 | 62 | # evaluate during training 63 | if ep % args.log_every == args.log_every-1: 64 | total_reward = 0 65 | for i in range(args.test_ep): 66 | state = env.reset() 67 | for j in range(env.spec.max_episode_steps): 68 | action = agent.sample_action(state, policy='greedy') 69 | state, reward, done, _ = env.step(action) 70 | total_reward += reward 71 | if done: 72 | break 73 | current_mean_rewards = total_reward / args.test_ep 74 | print('Episode: %d Average Reward: %.2f' % 75 | (ep + 1, current_mean_rewards)) 76 | # save model if current model outperform the old one 77 | if best_mean_rewards is None or (current_mean_rewards >= best_mean_rewards): 78 | best_mean_rewards = current_mean_rewards 79 | if not os.path.isdir(args.save_path): 80 | os.makedirs(args.save_path) 81 | save_name = args.save_path + str(round(best_mean_rewards, 2)) \ 82 | + '_' + str(ep_base + ep + 1) 83 | saver.save(agent.sess, save_name) 84 | print('Model saved %s' % save_name) 85 | 86 | plt.plot(steps_history, rewards_history) 87 | plt.xlabel('steps') 88 | plt.ylabel('running avg rewards') 89 | plt.show() 90 | 91 | 92 | def set_random_seed(seed): 93 | np.random.seed(seed) 94 | tf.set_random_seed(seed) 95 | 96 | 97 | if __name__ == '__main__': 98 | parser = argparse.ArgumentParser() 99 | parser.add_argument( 100 | '--model_path', default=None, 101 | help='Whether to use a saved model. (*None|model path)') 102 | parser.add_argument( 103 | '--save_path', default='./models/', 104 | help='Path to save a model during training.') 105 | parser.add_argument( 106 | '--double_q', default=True, help='enable or disable double dqn') 107 | parser.add_argument( 108 | '--log_every', default=500, help='Log and save model every x episodes') 109 | parser.add_argument( 110 | '--gpu', default=-1, 111 | help='running on a specify gpu, -1 indicates using cpu') 112 | parser.add_argument( 113 | '--seed', default=31, help='random seed') 114 | 115 | parser.add_argument( 116 | '--max_ep', type=int, default=2000, help='Number of training episodes') 117 | parser.add_argument( 118 | '--test_ep', type=int, default=50, help='Number of test episodes') 119 | parser.add_argument( 120 | '--init_epsilon', type=float, default=0.75, help='initial epsilon') 121 | parser.add_argument( 122 | '--final_epsilon', type=float, default=0.2, help='final epsilon') 123 | parser.add_argument( 124 | '--buffer_size', type=int, default=50000, help='Size of memory buffer') 125 | parser.add_argument( 126 | '--lr', type=float, default=1e-4, help='Learning rate') 127 | parser.add_argument( 128 | '--batch_size', type=int, default=128, help='Size of training batch') 129 | parser.add_argument( 130 | '--gamma', type=float, default=0.99, help='Discounted factor') 131 | parser.add_argument( 132 | '--target_network_update', type=int, default=1000, 133 | help='update frequency of target network.') 134 | main(parser.parse_args()) 135 | -------------------------------------------------------------------------------- /algorithms/PG/agent.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | import random 3 | 4 | import numpy as np 5 | import torch 6 | import torch.distributions as tdist 7 | import torch.nn as nn 8 | import torch.nn.functional as F 9 | import torch.optim as optim 10 | 11 | 12 | class Backbone(nn.Module): 13 | 14 | def __init__(self, in_dim): 15 | super().__init__() 16 | self.fc1 = nn.Linear(in_dim, 400) 17 | self.fc2 = nn.Linear(400, 200) 18 | 19 | self.bn1 = nn.BatchNorm1d(400) 20 | self.bn2 = nn.BatchNorm1d(200) 21 | 22 | 23 | class PolicyNet(Backbone): 24 | 25 | def __init__(self, in_dim, out_dim): 26 | super().__init__(in_dim) 27 | <<<<<<< Updated upstream 28 | self.out_dim = out_dim 29 | 30 | self.fc3 = nn.Linear(200, 100) 31 | self.bn3 = nn.BatchNorm1d(100) 32 | self.fc4 = nn.Linear(100, self.out_dim) 33 | 34 | self.logstd = torch.tensor(np.zeros((1, out_dim), dtype=np.float32), 35 | ======= 36 | self.fc3 = nn.Linear(200, 100) 37 | self.fc4 = nn.Linear(100, out_dim) 38 | 39 | self.logstd = torch.tensor(np.ones(out_dim, dtype=np.float32), 40 | >>>>>>> Stashed changes 41 | requires_grad=True).to(args.device) 42 | 43 | def get_dist(self, x): 44 | x = F.relu(self.fc1(x)) 45 | x = self.bn1(x) 46 | x = F.relu(self.fc2(x)) 47 | x = self.bn2(x) 48 | x = F.relu(self.fc3(x)) 49 | <<<<<<< Updated upstream 50 | x = self.bn3(x) 51 | mean = self.fc4(x) 52 | std = torch.exp(self.logstd) 53 | randn = torch.randn(mean.size()).to(args.device) 54 | out = randn * std + mean 55 | logp = -torch.log(std * (2 * np.pi) ** 0.5) + (out - mean) ** 2 / (2 * std ** 2) 56 | print(next(self.fc4.parameters())[0][:5]) 57 | ======= 58 | mean = self.fc4(x) 59 | std = torch.exp(self.logstd) 60 | dist = tdist.Normal(mean, std) 61 | return dist 62 | 63 | def forward(self, x): 64 | dist = self.get_dist(x) 65 | out = dist.sample() 66 | logp = dist.log_prob(out) 67 | >>>>>>> Stashed changes 68 | return logp, out 69 | 70 | def get_logp(self, s, a): 71 | dist = self.get_dist(s) 72 | logp = dist.log_prob(a) 73 | return logp 74 | 75 | class ValueNet(Backbone): 76 | 77 | def __init__(self, in_dim): 78 | super().__init__(in_dim) 79 | self.fc3 = nn.Linear(200, 10) 80 | <<<<<<< Updated upstream 81 | self.bn3 = nn.BatchNorm1d(10) 82 | ======= 83 | >>>>>>> Stashed changes 84 | self.fc4 = nn.Linear(10, 1) 85 | 86 | def forward(self, x): 87 | x = F.relu(self.fc1(x)) 88 | x = self.bn1(x) 89 | x = F.relu(self.fc2(x)) 90 | x = self.bn2(x) 91 | x = F.relu(self.fc3(x)) 92 | <<<<<<< Updated upstream 93 | x = self.bn3(x) 94 | ======= 95 | >>>>>>> Stashed changes 96 | x = self.fc4(x) 97 | return x 98 | 99 | 100 | class VanillaPG: 101 | 102 | def __init__(self, env, args_): 103 | global args 104 | args = args_ 105 | self.obs_space = env.observation_space 106 | self.act_space = env.action_space 107 | 108 | self._build_nets() 109 | 110 | def step(self, obs): 111 | if obs.ndim == 1: 112 | obs = obs.reshape((1, -1)) 113 | obs = torch.tensor(obs).to(args.device) 114 | self.policy.eval() 115 | logp, out = self.policy(obs) 116 | out = out[0].detach().cpu().numpy() 117 | return logp, out 118 | 119 | def train(self, traj): 120 | obs, acts, rewards, logp, next_obs = np.array(traj).T 121 | 122 | obs, next_obs = np.stack(obs), np.stack(next_obs) 123 | obs_combine = np.concatenate([obs, next_obs], axis=0) 124 | obs_combine = torch.tensor(obs_combine).to(args.device) 125 | 126 | self.value_func.eval() 127 | vs_combine = self.value_func(obs_combine) 128 | vs_combine_np = vs_combine.cpu().detach().numpy().flatten() 129 | 130 | vs_np = vs_combine_np[:len(vs_combine)//2] 131 | next_vs_np = vs_combine_np[len(vs_combine)//2:] 132 | vs = vs_combine[:len(vs_combine)//2] 133 | 134 | logp = torch.cat(logp.tolist()) 135 | 136 | # calculate return estimations 137 | phi = self._calculate_phi(rewards, vs_np, next_vs_np) 138 | 139 | # update policy parameters 140 | self.policy.train() 141 | self.policy_optim.zero_grad() 142 | logp.backward(phi) 143 | self.policy_optim.step() 144 | 145 | # update value functions 146 | self.value_func.train() 147 | rwd_to_go = [np.sum(rewards[i:]) for i in range(len(rewards))] 148 | target = torch.tensor(rwd_to_go).view((-1, 1)).to(args.device) 149 | vf_loss = F.mse_loss(vs, target) 150 | print("vf mse: %.4f" % vf_loss) 151 | self.vf_optim.zero_grad() 152 | vf_loss.backward() 153 | self.vf_optim.step() 154 | 155 | def _build_nets(self): 156 | policy = PolicyNet(in_dim=self.obs_space.shape[0], 157 | out_dim=self.act_space.shape[0]) 158 | value_func = ValueNet(in_dim=self.obs_space.shape[0]) 159 | self.policy = policy.to(args.device) 160 | self.value_func = value_func.to(args.device) 161 | 162 | self.policy_optim = optim.Adam(self.policy.parameters(), lr=args.lr) 163 | self.vf_optim = optim.Adam(self.value_func.parameters(), lr=args.lr) 164 | 165 | def _discount(self, arr, alpha=0.99): 166 | discount_arr = [] 167 | for a in arr: 168 | discount_arr.append(alpha * a) 169 | alpha *= alpha 170 | return discount_arr 171 | 172 | def _calculate_phi(self, rewards, vs, next_vs): 173 | # option1: raw returns 174 | raw_returns = [np.sum(rewards) for i in range(len(rewards))] 175 | # option2: rewards to go 176 | rwd_to_go = [np.sum(rewards[i:]) for i in range(len(rewards))] 177 | # option3: discounted rewards to go 178 | disc_rwd_to_go = [np.sum(self._discount(rewards[i:])) 179 | for i in range(len(rewards))] 180 | # option4: td0 estimation 181 | td0 = rewards + next_vs 182 | # subtract baseline 183 | baseline = vs 184 | 185 | phi = disc_rwd_to_go - vs 186 | phi = np.array(phi).reshape((-1, 1)) 187 | phi = np.tile(phi, (1, 4)) 188 | phi = torch.tensor(phi).to(args.device) 189 | return phi 190 | 191 | 192 | class OffPolicyPG(VanillaPG): 193 | 194 | def __init__(self, env, args_): 195 | super().__init__(env, args_) 196 | self.memory = deque(maxlen=10000) 197 | 198 | def train(self, traj): 199 | obs, acts, rewards, logp, next_obs = np.array(traj).T 200 | disc_rwd_to_go = np.array([np.sum(self._discount(rewards[i:])) 201 | for i in range(len(rewards))]) 202 | for i in range(len(obs)): 203 | data_tuple = [obs[i], acts[i], rewards[i], disc_rwd_to_go[i], 204 | logp[i], next_obs[i]] 205 | self.memory.append(data_tuple) 206 | 207 | if len(self.memory) < 10000: 208 | print("reply memory not warm up") 209 | return 210 | traj = random.sample(self.memory, 1024) 211 | obs, acts, rewards, disc_rwd, logp, next_obs = np.array(traj).T 212 | 213 | disc_rwd = disc_rwd.astype(np.float32) 214 | acts = np.stack(acts) 215 | obs, next_obs = np.stack(obs), np.stack(next_obs) 216 | obs_combine = np.concatenate([obs, next_obs], axis=0) 217 | obs_combine = torch.tensor(obs_combine).to(args.device) 218 | 219 | self.value_func.eval() 220 | vs_combine = self.value_func(obs_combine) 221 | vs_combine_np = vs_combine.cpu().detach().numpy().flatten() 222 | 223 | vs_np = vs_combine_np[:len(vs_combine)//2] 224 | next_vs_np = vs_combine_np[len(vs_combine)//2:] 225 | vs = vs_combine[:len(vs_combine)//2] 226 | 227 | old_logp = torch.cat(logp.tolist()) 228 | new_logp = self.policy.get_logp(torch.tensor(obs).to(args.device), 229 | torch.tensor(acts).to(args.device)) 230 | ratio = torch.exp(new_logp - old_logp) 231 | ratio = torch.clamp(ratio, 0.9, 1.1) 232 | 233 | # calculate return estimations 234 | phi = disc_rwd - vs_np 235 | phi = np.array(phi).reshape((-1, 1)) 236 | phi = np.tile(phi, (1, 4)) 237 | phi = torch.tensor(phi).to(args.device) 238 | 239 | # update policy parameters 240 | self.policy.train() 241 | self.policy_optim.zero_grad() 242 | new_logp.backward(phi * ratio) 243 | self.policy_optim.step() 244 | 245 | # update value functions 246 | self.value_func.train() 247 | target = torch.tensor(disc_rwd).view((-1, 1)).to(args.device) 248 | vf_loss = F.mse_loss(vs, target) 249 | print("vf mse: %.4f" % vf_loss) 250 | self.vf_optim.zero_grad() 251 | vf_loss.backward() 252 | self.vf_optim.step() 253 | -------------------------------------------------------------------------------- /algorithms/PG/run.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import numpy as np 4 | import gym 5 | import torch 6 | 7 | from agent import VanillaPG 8 | from agent import OffPolicyPG 9 | 10 | 11 | def off_policy_run(env, args): 12 | agent = OffPolicyPG(env, args) 13 | global_steps = 0 14 | for ep in range(args.num_ep): 15 | rollouts, ep_steps, ep_rewards = run_episode(env, agent) 16 | global_steps += ep_steps 17 | agent.train(rollouts) 18 | ep_avg_rewards = np.mean(ep_rewards) 19 | print("Ep %d reward: %.4f ep_steps: %d global_steps: %d" % 20 | (ep, ep_avg_rewards, ep_steps, global_steps)) 21 | 22 | 23 | def on_policy_run(env, args): 24 | agent = VanillaPG(env, args) 25 | global_steps = 0 26 | for ep in range(args.num_ep): 27 | rollouts, ep_steps, ep_rewards = run_episode(env, agent) 28 | global_steps += ep_steps 29 | agent.train(rollouts) 30 | ep_avg_rewards = np.mean(ep_rewards) 31 | print("Ep %d reward: %.4f ep_steps: %d global_steps: %d" % 32 | (ep, ep_avg_rewards, ep_steps, global_steps)) 33 | 34 | 35 | def on_policy_run(env, args): 36 | agent = VanillaPG(env, args) 37 | global_steps = 0 38 | for ep in range(args.num_ep): 39 | rollouts, ep_steps, ep_rewards = run_episode(env, agent) 40 | global_steps += ep_steps 41 | agent.train(rollouts) 42 | ep_avg_rewards = np.mean(ep_rewards) 43 | print("Ep %d reward: %.4f ep_steps: %d global_steps: %d" % 44 | (ep, ep_avg_rewards, ep_steps, global_steps)) 45 | 46 | 47 | def main(args): 48 | # device 49 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 50 | args.device = device 51 | # seed 52 | np.random.seed(args.seed) 53 | torch.manual_seed(args.seed) 54 | # env and agent 55 | task_name = "BipedalWalker-v2" 56 | env = gym.make(task_name) 57 | env.seed(args.seed) 58 | # run 59 | # on_policy_run(env, args) 60 | off_policy_run(env, args) 61 | 62 | 63 | def run_episode(env, agent): 64 | obs = env.reset() 65 | obs = preprocess(obs) 66 | ep_rewards, rollouts = [], [] 67 | ep_steps = 0 68 | while True: 69 | logp, action = agent.step(obs) 70 | next_obs, reward, done, _ = env.step(action) 71 | ep_rewards.append(reward) 72 | next_obs = preprocess(next_obs) 73 | rollouts.append([obs, action, reward, logp, next_obs]) 74 | obs = next_obs 75 | ep_steps += 1 76 | if done: 77 | break 78 | return rollouts, ep_steps, ep_rewards 79 | 80 | 81 | def preprocess(obs): 82 | return obs.astype(np.float32) 83 | 84 | 85 | if __name__ == '__main__': 86 | parser = argparse.ArgumentParser() 87 | parser.add_argument("--num_ep", type=int, default=5000) 88 | parser.add_argument("--lr", type=float, default=1e-2) 89 | parser.add_argument("--seed", type=int, default=31) 90 | parser.add_argument("--gpu", action="store_true") 91 | main(parser.parse_args()) 92 | -------------------------------------------------------------------------------- /algorithms/PG/sync.sh: -------------------------------------------------------------------------------- 1 | rsync -avzP --delete ~/workspace/self/reinforce_py/ workpc:~/workspace/reinforce_py 2 | -------------------------------------------------------------------------------- /algorithms/PPO/README.md: -------------------------------------------------------------------------------- 1 | ### Proximal Policy Optimization(PPO) 2 | 3 | Impletementation of the PPO method proposed by OpenAI. 4 | 5 | PPO is an trust-region policy optimization method. It used a penalty instead of a constraint in the TRPO objective. 6 | 7 | Related papers: 8 | - [PPO - J Schulman et al.](https://arxiv.org/abs/1707.06347) 9 | - [TRPO - J Schulman et](https://arxiv.org/abs/1502.05477) 10 | 11 | ### Requirements 12 | - Python 3.x 13 | - Tensorflow 1.3.0 14 | - gym 0.9.4 15 | 16 | ### Run 17 | python3 train_PPO.py # -h to show avaliavle arguments 18 | 19 | ### Results 20 | ppo-score 21 | 22 | Smoothed rewards in 1M steps 23 | 24 | ppo-lossed 25 | 26 | Policy loss and value function loss during training. 27 | -------------------------------------------------------------------------------- /algorithms/PPO/agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | from config import args, tf_config 5 | from distributions import make_pd_type 6 | import utils as U 7 | 8 | 9 | class Policy: 10 | 11 | def __init__(self, ob_space, ac_space, batch, n_steps, reuse): 12 | ob_dim = (batch,) + ob_space.shape 13 | act_dim = ac_space.shape[0] 14 | self.ph_obs = tf.placeholder(tf.float32, ob_dim, name='ph_obs') 15 | 16 | with tf.variable_scope('policy', reuse=reuse): 17 | h1 = U.fc(self.ph_obs, out_dim=64, activation_fn=tf.nn.tanh, 18 | init_scale=np.sqrt(2), scope='pi_fc1') 19 | h2 = U.fc(h1, out_dim=64, activation_fn=tf.nn.tanh, 20 | init_scale=np.sqrt(2), scope='pi_fc2') 21 | pi = U.fc(h2, out_dim=act_dim, activation_fn=None, init_scale=0.01, 22 | scope='pi') 23 | h1 = U.fc(self.ph_obs, out_dim=64, activation_fn=tf.nn.tanh, 24 | init_scale=np.sqrt(2), scope='vf_fc1') 25 | h2 = U.fc(h1, out_dim=64, activation_fn=tf.nn.tanh, 26 | init_scale=np.sqrt(2), scope='vf_fc2') 27 | vf = U.fc(h2, out_dim=1, activation_fn=None, scope='vf')[:, 0] 28 | logstd = tf.get_variable(name='logstd', shape=[1, act_dim], 29 | initializer=tf.zeros_initializer()) 30 | # concatenate probabilities and logstds 31 | pd_params = tf.concat([pi, pi * 0.0 + logstd], axis=1) 32 | self.pd_type = make_pd_type(ac_space) 33 | self.pd = self.pd_type.pdfromflat(pd_params) 34 | 35 | self.a_out = self.pd.sample() 36 | self.neglogp = self.pd.get_neglogp(self.a_out) 37 | 38 | self.v_out = vf 39 | self.pi = pi 40 | 41 | 42 | class PPO: 43 | 44 | def __init__(self, env): 45 | self.sess = tf.Session(config=tf_config) 46 | ob_space = env.observation_space 47 | ac_space = env.action_space 48 | self.act_policy = Policy(ob_space, ac_space, env.num_envs, 49 | n_steps=1, reuse=False) 50 | self.train_policy = Policy(ob_space, ac_space, args.minibatch, 51 | n_steps=args.batch_steps, reuse=True) 52 | self._build_train() 53 | self.sess.run(tf.global_variables_initializer()) 54 | 55 | def _build_train(self): 56 | # build placeholders 57 | self.ph_obs_train = self.train_policy.ph_obs 58 | self.ph_a = self.train_policy.pd_type.get_action_placeholder([None]) 59 | self.ph_adv = tf.placeholder(tf.float32, [None]) 60 | self.ph_r = tf.placeholder(tf.float32, [None]) 61 | self.ph_old_neglogp = tf.placeholder(tf.float32, [None]) 62 | self.ph_old_v = tf.placeholder(tf.float32, [None]) 63 | self.ph_lr = tf.placeholder(tf.float32, []) 64 | self.ph_clip_range = tf.placeholder(tf.float32, []) 65 | 66 | # build losses 67 | self.neglogp = self.train_policy.pd.get_neglogp(self.ph_a) 68 | self.entropy = tf.reduce_mean(self.train_policy.pd.get_entropy()) 69 | v = self.train_policy.v_out 70 | v_clipped = self.ph_old_v + tf.clip_by_value( 71 | v - self.ph_old_v, -self.ph_clip_range, self.ph_clip_range) 72 | v_loss1 = tf.square(v - self.ph_r) 73 | v_loss2 = tf.square(v_clipped - self.ph_r) 74 | self.v_loss = 0.5 * tf.reduce_mean(tf.maximum(v_loss1, v_loss2)) 75 | 76 | # ratio = tf.exp(self.ph_old_neglogp - self.neglogp) 77 | old_p = tf.exp(-self.ph_old_neglogp) 78 | new_p = tf.exp(-self.neglogp) 79 | ratio = new_p / old_p 80 | pg_loss1 = -self.ph_adv * ratio 81 | pg_loss2 = -self.ph_adv * tf.clip_by_value( 82 | ratio, 1.0 - self.ph_clip_range, 1.0 + self.ph_clip_range) 83 | self.pg_loss = tf.reduce_mean(tf.maximum(pg_loss1, pg_loss2)) 84 | loss = self.pg_loss + args.v_coef * self.v_loss - \ 85 | args.entropy_coef * self.entropy 86 | 87 | # build train operation 88 | params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'policy') 89 | grads = tf.gradients(loss, params) 90 | if args.max_grad_norm is not None: 91 | grads, _grad_norm = tf.clip_by_global_norm( 92 | grads, args.max_grad_norm) 93 | grads = list(zip(grads, params)) 94 | self.train_op = tf.train.AdamOptimizer( 95 | learning_rate=self.ph_lr, epsilon=1e-5).apply_gradients(grads) 96 | 97 | # train info 98 | self.approxkl = 0.5 * tf.reduce_mean( 99 | tf.square(self.neglogp - self.ph_old_neglogp)) 100 | self.clip_frac = tf.reduce_mean( 101 | tf.to_float(tf.greater(tf.abs(ratio - 1.0), self.ph_clip_range))) 102 | self.avg_ratio = tf.reduce_mean(ratio) 103 | 104 | def step(self, obs, *_args, **_kwargs): 105 | feed_dict = {self.act_policy.ph_obs: obs} 106 | a, v, neglogp = self.sess.run( 107 | [self.act_policy.a_out, 108 | self.act_policy.v_out, 109 | self.act_policy.neglogp], 110 | feed_dict=feed_dict) 111 | return a, v, neglogp 112 | 113 | def get_value(self, obs, *_args, **_kwargs): 114 | feed_dict = {self.act_policy.ph_obs: obs} 115 | return self.sess.run(self.act_policy.v_out, feed_dict=feed_dict) 116 | 117 | def train(self, lr, clip_range, obs, returns, masks, actions, values, neglogps, 118 | advs): 119 | # advs = returns - values 120 | advs = (advs - advs.mean()) / (advs.std() + 1e-8) 121 | feed_dict = {self.ph_obs_train: obs, self.ph_a: actions, 122 | self.ph_adv: advs, self.ph_r: returns, 123 | self.ph_old_neglogp: neglogps, self.ph_old_v: values, 124 | self.ph_lr: lr, 125 | self.ph_clip_range: clip_range} 126 | self.loss_names = ['loss_policy', 'loss_value', 'avg_ratio', 'policy_entropy', 127 | 'approxkl', 'clipfrac'] 128 | return self.sess.run( 129 | [self.pg_loss, self.v_loss, self.avg_ratio, self.entropy, 130 | self.approxkl, self.clip_frac, self.train_op], 131 | feed_dict=feed_dict)[:-1] 132 | 133 | -------------------------------------------------------------------------------- /algorithms/PPO/config.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import tensorflow as tf 3 | 4 | 5 | parser = argparse.ArgumentParser() 6 | # Global arguments 7 | parser.add_argument('--env', help='environment ID', default='Walker2d-v1') 8 | parser.add_argument('--seed', help='RNG seed', type=int, default=931022) 9 | parser.add_argument('--save_interval', type=int, default=0) 10 | parser.add_argument('--log_interval', type=int, default=1) 11 | parser.add_argument('--n_envs', type=int, default=1) 12 | parser.add_argument('--n_steps', type=int, default=int(1e6)) 13 | 14 | # Hyperparameters 15 | parser.add_argument('--batch_steps', type=int, default=2048) 16 | parser.add_argument('--minibatch', type=int, default=64) 17 | parser.add_argument('--n_epochs', type=int, default=10) 18 | parser.add_argument('--entropy_coef', type=float, default=0.0) 19 | parser.add_argument('--v_coef', type=float, default=0.5) 20 | parser.add_argument('--max_grad_norm', type=float, default=0.5) 21 | parser.add_argument('--lam', type=float, default=0.95) 22 | parser.add_argument('--gamma', type=float, default=0.99) 23 | parser.add_argument('--lr', type=float, default=3e-4) 24 | parser.add_argument('--clip_range', type=float, default=0.2) 25 | 26 | args = parser.parse_args() 27 | 28 | # Tensroflow Session Configuration 29 | tf_config = tf.ConfigProto(allow_soft_placement=True) 30 | tf_config.gpu_options.allow_growth = True 31 | -------------------------------------------------------------------------------- /algorithms/PPO/distributions.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import utils as U 4 | from tensorflow.python.ops import math_ops 5 | 6 | 7 | class Pd(object): 8 | """ 9 | A particular probability distribution 10 | """ 11 | 12 | def get_flatparam(self): 13 | raise NotImplementedError 14 | 15 | def get_mode(self): 16 | raise NotImplementedError 17 | 18 | def get_neglogp(self, x): 19 | # Usually it's easier to define the negative logprob 20 | raise NotImplementedError 21 | 22 | def get_kl(self, other): 23 | raise NotImplementedError 24 | 25 | def get_entropy(self): 26 | raise NotImplementedError 27 | 28 | def sample(self): 29 | raise NotImplementedError 30 | 31 | def logp(self, x): 32 | return - self.get_neglogp(x) 33 | 34 | 35 | class PdType(object): 36 | """ 37 | Parametrized family of probability distributions 38 | """ 39 | 40 | def pdclass(self): 41 | raise NotImplementedError 42 | 43 | def pdfromflat(self, flat): 44 | return self.pdclass()(flat) 45 | 46 | def param_shape(self): 47 | raise NotImplementedError 48 | 49 | def action_shape(self): 50 | raise NotImplementedError 51 | 52 | def action_dtype(self): 53 | raise NotImplementedError 54 | 55 | def param_placeholder(self, prepend_shape, name=None): 56 | return tf.placeholder(dtype=tf.float32, shape=prepend_shape + self.param_shape(), name=name) 57 | 58 | def get_action_placeholder(self, prepend_shape, name=None): 59 | return tf.placeholder(dtype=self.action_dtype(), shape=prepend_shape + self.action_shape(), name=name) 60 | 61 | 62 | class CategoricalPdType(PdType): 63 | def __init__(self, ncat): 64 | self.ncat = ncat 65 | 66 | def pdclass(self): 67 | return CategoricalPd 68 | 69 | def param_shape(self): 70 | return [self.ncat] 71 | 72 | def action_shape(self): 73 | return [] 74 | 75 | def action_dtype(self): 76 | return tf.int32 77 | 78 | 79 | class MultiCategoricalPdType(PdType): 80 | def __init__(self, low, high): 81 | self.low = low 82 | self.high = high 83 | self.ncats = high - low + 1 84 | 85 | def pdclass(self): 86 | return MultiCategoricalPd 87 | 88 | def pdfromflat(self, flat): 89 | return MultiCategoricalPd(self.low, self.high, flat) 90 | 91 | def param_shape(self): 92 | return [sum(self.ncats)] 93 | 94 | def action_shape(self): 95 | return [len(self.ncats)] 96 | 97 | def action_dtype(self): 98 | return tf.int32 99 | 100 | 101 | class DiagGaussianPdType(PdType): 102 | def __init__(self, size): 103 | self.size = size 104 | 105 | def pdclass(self): 106 | return DiagGaussianPd 107 | 108 | def param_shape(self): 109 | return [2 * self.size] 110 | 111 | def action_shape(self): 112 | return [self.size] 113 | 114 | def action_dtype(self): 115 | return tf.float32 116 | 117 | 118 | class BernoulliPdType(PdType): 119 | def __init__(self, size): 120 | self.size = size 121 | 122 | def pdclass(self): 123 | return BernoulliPd 124 | 125 | def param_shape(self): 126 | return [self.size] 127 | 128 | def action_shape(self): 129 | return [self.size] 130 | 131 | def action_dtype(self): 132 | return tf.int32 133 | 134 | 135 | class CategoricalPd(Pd): 136 | def __init__(self, logits): 137 | self.logits = logits 138 | 139 | def get_flatparam(self): 140 | return self.logits 141 | 142 | def get_mode(self): 143 | return U.argmax(self.logits, axis=-1) 144 | 145 | def get_neglogp(self, x): 146 | # return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x) 147 | # Note: we can't use sparse_softmax_cross_entropy_with_logits because 148 | # the implementation does not allow second-order derivatives... 149 | one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1]) 150 | return tf.nn.softmax_cross_entropy_with_logits( 151 | logits=self.logits, 152 | labels=one_hot_actions) 153 | 154 | def get_kl(self, other): 155 | a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True) 156 | a1 = other.logits - U.max(other.logits, axis=-1, keepdims=True) 157 | ea0 = tf.exp(a0) 158 | ea1 = tf.exp(a1) 159 | z0 = U.sum(ea0, axis=-1, keepdims=True) 160 | z1 = U.sum(ea1, axis=-1, keepdims=True) 161 | p0 = ea0 / z0 162 | return U.sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1) 163 | 164 | def get_entropy(self): 165 | a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True) 166 | ea0 = tf.exp(a0) 167 | z0 = U.sum(ea0, axis=-1, keepdims=True) 168 | p0 = ea0 / z0 169 | return U.sum(p0 * (tf.log(z0) - a0), axis=-1) 170 | 171 | def sample(self): 172 | u = tf.random_uniform(tf.shape(self.logits)) 173 | return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1) 174 | 175 | @classmethod 176 | def fromflat(cls, flat): 177 | return cls(flat) 178 | 179 | 180 | class MultiCategoricalPd(Pd): 181 | def __init__(self, low, high, flat): 182 | self.flat = flat 183 | self.low = tf.constant(low, dtype=tf.int32) 184 | self.categoricals = list(map(CategoricalPd, tf.split( 185 | flat, high - low + 1, axis=len(flat.get_shape()) - 1))) 186 | 187 | def get_flatparam(self): 188 | return self.flat 189 | 190 | def get_mode(self): 191 | return self.low + tf.cast(tf.stack([p.get_mode() for p in self.categoricals], axis=-1), tf.int32) 192 | 193 | def get_neglogp(self, x): 194 | return tf.add_n([p.get_neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x - self.low, axis=len(x.get_shape()) - 1))]) 195 | 196 | def get_kl(self, other): 197 | return tf.add_n([ 198 | p.get_kl(q) for p, q in zip(self.categoricals, other.categoricals) 199 | ]) 200 | 201 | def get_entropy(self): 202 | return tf.add_n([p.get_entropy() for p in self.categoricals]) 203 | 204 | def sample(self): 205 | return self.low + tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32) 206 | 207 | @classmethod 208 | def fromflat(cls, flat): 209 | raise NotImplementedError 210 | 211 | 212 | class DiagGaussianPd(Pd): 213 | def __init__(self, flat): 214 | self.flat = flat 215 | mean, logstd = tf.split(axis=len(flat.shape) - 1, num_or_size_splits=2, value=flat) 216 | self.mean = mean 217 | self.logstd = logstd 218 | self.std = tf.exp(logstd) 219 | 220 | def get_flatparam(self): 221 | return self.flat 222 | 223 | def get_mode(self): 224 | return self.mean 225 | 226 | def get_neglogp(self, x): 227 | return 0.5 * U.sum(tf.square((x - self.mean) / self.std), axis=-1) \ 228 | + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \ 229 | + U.sum(self.logstd, axis=-1) 230 | 231 | def get_kl(self, other): 232 | assert isinstance(other, DiagGaussianPd) 233 | return U.sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1) 234 | 235 | def get_entropy(self): 236 | return U.sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1) 237 | 238 | def sample(self): 239 | return self.mean + self.std * tf.random_normal(tf.shape(self.mean)) 240 | 241 | @classmethod 242 | def fromflat(cls, flat): 243 | return cls(flat) 244 | 245 | 246 | class BernoulliPd(Pd): 247 | def __init__(self, logits): 248 | self.logits = logits 249 | self.ps = tf.sigmoid(logits) 250 | 251 | def get_flatparam(self): 252 | return self.logits 253 | 254 | def get_mode(self): 255 | return tf.round(self.ps) 256 | 257 | def get_neglogp(self, x): 258 | return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1) 259 | 260 | def get_kl(self, other): 261 | return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1) 262 | 263 | def get_entropy(self): 264 | return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1) 265 | 266 | def sample(self): 267 | u = tf.random_uniform(tf.shape(self.ps)) 268 | return tf.to_float(math_ops.less(u, self.ps)) 269 | 270 | @classmethod 271 | def fromflat(cls, flat): 272 | return cls(flat) 273 | 274 | 275 | def make_pd_type(ac_space): 276 | from gym import spaces 277 | if isinstance(ac_space, spaces.Box): 278 | assert len(ac_space.shape) == 1 279 | return DiagGaussianPdType(ac_space.shape[0]) 280 | elif isinstance(ac_space, spaces.Discrete): 281 | return CategoricalPdType(ac_space.n) 282 | elif isinstance(ac_space, spaces.MultiDiscrete): 283 | return MultiCategoricalPdType(ac_space.low, ac_space.high) 284 | elif isinstance(ac_space, spaces.MultiBinary): 285 | return BernoulliPdType(ac_space.n) 286 | else: 287 | raise NotImplementedError 288 | -------------------------------------------------------------------------------- /algorithms/PPO/env_wrapper.py: -------------------------------------------------------------------------------- 1 | import time 2 | import csv 3 | import json 4 | import gym 5 | from gym.core import Wrapper 6 | import os.path as osp 7 | import numpy as np 8 | 9 | from utils import RunningMeanStd 10 | 11 | 12 | class BaseVecEnv(object): 13 | """ 14 | Vectorized environment base class 15 | """ 16 | 17 | def step(self, vac): 18 | """ 19 | Apply sequence of actions to sequence of environments 20 | actions -> (observations, rewards, dones) 21 | """ 22 | raise NotImplementedError 23 | 24 | def reset(self): 25 | """ 26 | Reset all environments 27 | """ 28 | raise NotImplementedError 29 | 30 | def close(self): 31 | pass 32 | 33 | def set_random_seed(self, seed): 34 | raise NotImplementedError 35 | 36 | 37 | class VecEnv(BaseVecEnv): 38 | def __init__(self, env_fns): 39 | self.envs = [fn() for fn in env_fns] 40 | env = self.envs[0] 41 | self.action_space = env.action_space 42 | self.observation_space = env.observation_space 43 | self.ts = np.zeros(len(self.envs), dtype='int') 44 | 45 | def step(self, action_n): 46 | results = [env.step(a) for (a, env) in zip(action_n, self.envs)] 47 | obs, rews, dones, infos = map(np.array, zip(*results)) 48 | self.ts += 1 49 | for (i, done) in enumerate(dones): 50 | if done: 51 | obs[i] = self.envs[i].reset() 52 | self.ts[i] = 0 53 | return np.array(obs), np.array(rews), np.array(dones), infos 54 | 55 | def reset(self): 56 | results = [env.reset() for env in self.envs] 57 | return np.array(results) 58 | 59 | def render(self): 60 | self.envs[0].render() 61 | 62 | @property 63 | def num_envs(self): 64 | return len(self.envs) 65 | 66 | 67 | class VecEnvNorm(BaseVecEnv): 68 | 69 | def __init__(self, venv, ob=True, ret=True, 70 | clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8): 71 | self.venv = venv 72 | self._ob_space = venv.observation_space 73 | self._ac_space = venv.action_space 74 | self.ob_rms = RunningMeanStd(shape=self._ob_space.shape) if ob else None 75 | self.ret_rms = RunningMeanStd(shape=()) if ret else None 76 | self.clipob = clipob 77 | self.cliprew = cliprew 78 | self.ret = np.zeros(self.num_envs) 79 | self.gamma = gamma 80 | self.epsilon = epsilon 81 | 82 | def step(self, vac): 83 | obs, rews, news, infos = self.venv.step(vac) 84 | self.ret = self.ret * self.gamma + rews 85 | # normalize observations 86 | obs = self._norm_ob(obs) 87 | # normalize rewards 88 | if self.ret_rms: 89 | self.ret_rms.update(self.ret) 90 | rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), 91 | -self.cliprew, self.cliprew) 92 | return obs, rews, news, infos 93 | 94 | def _norm_ob(self, obs): 95 | if self.ob_rms: 96 | self.ob_rms.update(obs) 97 | obs = np.clip( 98 | (obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), 99 | -self.clipob, self.clipob) 100 | return obs 101 | else: 102 | return obs 103 | 104 | def reset(self): 105 | obs = self.venv.reset() 106 | return self._norm_ob(obs) 107 | 108 | def set_random_seed(self, seeds): 109 | for env, seed in zip(self.venv.envs, seeds): 110 | env.seed(int(seed)) 111 | 112 | @property 113 | def action_space(self): 114 | return self._ac_space 115 | 116 | @property 117 | def observation_space(self): 118 | return self._ob_space 119 | 120 | def close(self): 121 | self.venv.close() 122 | 123 | def render(self): 124 | self.venv.render() 125 | 126 | @property 127 | def num_envs(self): 128 | return self.venv.num_envs 129 | 130 | 131 | class Monitor(Wrapper): 132 | EXT = "monitor.csv" 133 | f = None 134 | 135 | def __init__(self, env, filename, allow_early_resets=False, reset_keywords=()): 136 | Wrapper.__init__(self, env=env) 137 | self.tstart = time.time() 138 | if filename is None: 139 | self.f = None 140 | self.logger = None 141 | else: 142 | if not filename.endswith(Monitor.EXT): 143 | if osp.isdir(filename): 144 | filename = osp.join(filename, Monitor.EXT) 145 | else: 146 | filename = filename + "." + Monitor.EXT 147 | self.f = open(filename, "wt") 148 | self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, "gym_version": gym.__version__, 149 | "env_id": env.spec.id if env.spec else 'Unknown'})) 150 | self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords) 151 | self.logger.writeheader() 152 | 153 | self.reset_keywords = reset_keywords 154 | self.allow_early_resets = allow_early_resets 155 | self.rewards = None 156 | self.needs_reset = True 157 | self.episode_rewards = [] 158 | self.episode_lengths = [] 159 | self.total_steps = 0 160 | self.current_reset_info = {} # extra info about the current episode, that was passed in during reset() 161 | 162 | def _reset(self, **kwargs): 163 | if not self.allow_early_resets and not self.needs_reset: 164 | raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)") 165 | self.rewards = [] 166 | self.needs_reset = False 167 | for k in self.reset_keywords: 168 | v = kwargs.get(k) 169 | if v is None: 170 | raise ValueError('Expected you to pass kwarg %s into reset'%k) 171 | self.current_reset_info[k] = v 172 | return self.env.reset(**kwargs) 173 | 174 | def _step(self, action): 175 | if self.needs_reset: 176 | raise RuntimeError("Tried to step environment that needs reset") 177 | ob, rew, done, info = self.env.step(action) 178 | self.rewards.append(rew) 179 | if done: 180 | self.needs_reset = True 181 | eprew = sum(self.rewards) 182 | eplen = len(self.rewards) 183 | epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)} 184 | epinfo.update(self.current_reset_info) 185 | if self.logger: 186 | self.logger.writerow(epinfo) 187 | self.f.flush() 188 | self.episode_rewards.append(eprew) 189 | self.episode_lengths.append(eplen) 190 | info['episode'] = epinfo 191 | self.total_steps += 1 192 | return (ob, rew, done, info) 193 | 194 | def close(self): 195 | if self.f is not None: 196 | self.f.close() 197 | 198 | def get_total_steps(self): 199 | return self.total_steps 200 | 201 | def get_episode_rewards(self): 202 | return self.episode_rewards 203 | 204 | def get_episode_lengths(self): 205 | return self.episode_lengths 206 | 207 | 208 | def make_env(): 209 | def env_fn(): 210 | env = gym.make(args.env) 211 | env = Monitor(env, logger.get_dir()) 212 | return env 213 | env = VecEnv([env_fn] * args.n_envs) 214 | env = VecEnvNorm(env) 215 | return env 216 | -------------------------------------------------------------------------------- /algorithms/PPO/logger.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import shutil 4 | import json 5 | import time 6 | import datetime 7 | import tempfile 8 | from mpi4py import MPI 9 | 10 | LOG_OUTPUT_FORMATS = ['stdout', 'log', 'csv'] 11 | # Also valid: json, tensorboard 12 | 13 | DEBUG = 10 14 | INFO = 20 15 | WARN = 30 16 | ERROR = 40 17 | 18 | DISABLED = 50 19 | 20 | 21 | class KVWriter(object): 22 | def writekvs(self, kvs): 23 | raise NotImplementedError 24 | 25 | 26 | class SeqWriter(object): 27 | def writeseq(self, seq): 28 | raise NotImplementedError 29 | 30 | 31 | class HumanOutputFormat(KVWriter, SeqWriter): 32 | def __init__(self, filename_or_file): 33 | if isinstance(filename_or_file, str): 34 | self.file = open(filename_or_file, 'wt') 35 | self.own_file = True 36 | else: 37 | assert hasattr(filename_or_file, 38 | 'read'), 'expected file or str, got %s' % filename_or_file 39 | self.file = filename_or_file 40 | self.own_file = False 41 | 42 | def writekvs(self, kvs): 43 | # Create strings for printing 44 | key2str = {} 45 | for (key, val) in sorted(kvs.items()): 46 | if isinstance(val, float): 47 | valstr = '%-8.3g' % (val,) 48 | else: 49 | valstr = str(val) 50 | key2str[self._truncate(key)] = self._truncate(valstr) 51 | 52 | # Find max widths 53 | if len(key2str) == 0: 54 | print('WARNING: tried to write empty key-value dict') 55 | return 56 | else: 57 | keywidth = max(map(len, key2str.keys())) 58 | valwidth = max(map(len, key2str.values())) 59 | 60 | # Write out the data 61 | dashes = '-' * (keywidth + valwidth + 7) 62 | lines = [dashes] 63 | for (key, val) in sorted(key2str.items()): 64 | lines.append('| %s%s | %s%s |' % ( 65 | key, 66 | ' ' * (keywidth - len(key)), 67 | val, 68 | ' ' * (valwidth - len(val)), 69 | )) 70 | lines.append(dashes) 71 | self.file.write('\n'.join(lines) + '\n') 72 | 73 | # Flush the output to the file 74 | self.file.flush() 75 | 76 | def _truncate(self, s): 77 | return s[:20] + '...' if len(s) > 23 else s 78 | 79 | def writeseq(self, seq): 80 | for arg in seq: 81 | self.file.write(arg) 82 | self.file.write('\n') 83 | self.file.flush() 84 | 85 | def close(self): 86 | if self.own_file: 87 | self.file.close() 88 | 89 | 90 | class JSONOutputFormat(KVWriter): 91 | def __init__(self, filename): 92 | self.file = open(filename, 'wt') 93 | 94 | def writekvs(self, kvs): 95 | for k, v in sorted(kvs.items()): 96 | if hasattr(v, 'dtype'): 97 | v = v.tolist() 98 | kvs[k] = float(v) 99 | self.file.write(json.dumps(kvs) + '\n') 100 | self.file.flush() 101 | 102 | def close(self): 103 | self.file.close() 104 | 105 | 106 | class CSVOutputFormat(KVWriter): 107 | def __init__(self, filename): 108 | self.file = open(filename, 'w+t') 109 | self.keys = [] 110 | self.sep = ',' 111 | 112 | def writekvs(self, kvs): 113 | # Add our current row to the history 114 | extra_keys = kvs.keys() - self.keys 115 | if extra_keys: 116 | self.keys.extend(extra_keys) 117 | self.file.seek(0) 118 | lines = self.file.readlines() 119 | self.file.seek(0) 120 | for (i, k) in enumerate(self.keys): 121 | if i > 0: 122 | self.file.write(',') 123 | self.file.write(k) 124 | self.file.write('\n') 125 | for line in lines[1:]: 126 | self.file.write(line[:-1]) 127 | self.file.write(self.sep * len(extra_keys)) 128 | self.file.write('\n') 129 | for (i, k) in enumerate(self.keys): 130 | if i > 0: 131 | self.file.write(',') 132 | v = kvs.get(k) 133 | if v: 134 | self.file.write(str(v)) 135 | self.file.write('\n') 136 | self.file.flush() 137 | 138 | def close(self): 139 | self.file.close() 140 | 141 | 142 | class TensorBoardOutputFormat(KVWriter): 143 | """ 144 | Dumps key/value pairs into TensorBoard's numeric format. 145 | """ 146 | 147 | def __init__(self, dir): 148 | os.makedirs(dir, exist_ok=True) 149 | self.dir = dir 150 | self.step = 1 151 | prefix = 'events' 152 | path = os.path.join(os.path.abspath(dir), prefix) 153 | import tensorflow as tf 154 | from tensorflow.python import pywrap_tensorflow 155 | from tensorflow.core.util import event_pb2 156 | from tensorflow.python.util import compat 157 | self.tf = tf 158 | self.event_pb2 = event_pb2 159 | self.pywrap_tensorflow = pywrap_tensorflow 160 | self.writer = pywrap_tensorflow.EventsWriter(compat.as_bytes(path)) 161 | 162 | def writekvs(self, kvs): 163 | def summary_val(k, v): 164 | kwargs = {'tag': k, 'simple_value': float(v)} 165 | return self.tf.Summary.Value(**kwargs) 166 | summary = self.tf.Summary(value=[summary_val(k, v) for k, v in kvs.items()]) 167 | event = self.event_pb2.Event(wall_time=time.time(), summary=summary) 168 | event.step = self.step # is there any reason why you'd want to specify the step? 169 | self.writer.WriteEvent(event) 170 | self.writer.Flush() 171 | self.step += 1 172 | 173 | def close(self): 174 | if self.writer: 175 | self.writer.Close() 176 | self.writer = None 177 | 178 | 179 | def make_output_format(format, ev_dir): 180 | os.makedirs(ev_dir, exist_ok=True) 181 | rank = MPI.COMM_WORLD.Get_rank() 182 | if format == 'stdout': 183 | return HumanOutputFormat(sys.stdout) 184 | elif format == 'log': 185 | suffix = "" if rank == 0 else ("-mpi%03i" % rank) 186 | return HumanOutputFormat(os.path.join(ev_dir, 'log%s.txt' % suffix)) 187 | elif format == 'json': 188 | assert rank == 0 189 | return JSONOutputFormat(os.path.join(ev_dir, 'progress.json')) 190 | elif format == 'csv': 191 | assert rank == 0 192 | return CSVOutputFormat(os.path.join(ev_dir, 'progress.csv')) 193 | elif format == 'tensorboard': 194 | assert rank == 0 195 | return TensorBoardOutputFormat(os.path.join(ev_dir, 'tb')) 196 | else: 197 | raise ValueError('Unknown format specified: %s' % (format,)) 198 | 199 | 200 | # ================================================================ 201 | # API 202 | # ================================================================ 203 | def logkv(key, val): 204 | """ 205 | Log a value of some diagnostic 206 | Call this once for each diagnostic quantity, each iteration 207 | """ 208 | Logger.CURRENT.logkv(key, val) 209 | 210 | 211 | def logkvs(d): 212 | """ 213 | Log a dictionary of key-value pairs 214 | """ 215 | for (k, v) in d.items(): 216 | logkv(k, v) 217 | 218 | 219 | def dumpkvs(): 220 | """ 221 | Write all of the diagnostics from the current iteration 222 | 223 | level: int. (see logger.py docs) If the global logger level is higher than 224 | the level argument here, don't print to stdout. 225 | """ 226 | Logger.CURRENT.dumpkvs() 227 | 228 | 229 | def getkvs(): 230 | return Logger.CURRENT.name2val 231 | 232 | 233 | def log(*args, level=INFO): 234 | """ 235 | Write the sequence of args, with no separators, to the console and output files (if you've configured an output file). 236 | """ 237 | Logger.CURRENT.log(*args, level=level) 238 | 239 | 240 | def debug(*args): 241 | log(*args, level=DEBUG) 242 | 243 | 244 | def info(*args): 245 | log(*args, level=INFO) 246 | 247 | 248 | def warn(*args): 249 | log(*args, level=WARN) 250 | 251 | 252 | def error(*args): 253 | log(*args, level=ERROR) 254 | 255 | 256 | def set_level(level): 257 | """ 258 | Set logging threshold on current logger. 259 | """ 260 | Logger.CURRENT.set_level(level) 261 | 262 | 263 | def get_dir(): 264 | """ 265 | Get directory that log files are being written to. 266 | will be None if there is no output directory (i.e., if you didn't call start) 267 | """ 268 | return Logger.CURRENT.get_dir() 269 | 270 | 271 | record_tabular = logkv 272 | dump_tabular = dumpkvs 273 | 274 | 275 | # ================================================================ 276 | # Backend 277 | # ================================================================ 278 | class Logger(object): 279 | DEFAULT = None # A logger with no output files. (See right below class definition) 280 | # So that you can still log to the terminal without setting up any output files 281 | CURRENT = None # Current logger being used by the free functions above 282 | 283 | def __init__(self, dir, output_formats): 284 | self.name2val = {} # values this iteration 285 | self.level = INFO 286 | self.dir = dir 287 | self.output_formats = output_formats 288 | 289 | # Logging API, forwarded 290 | # ---------------------------------------- 291 | def logkv(self, key, val): 292 | self.name2val[key] = val 293 | 294 | def dumpkvs(self): 295 | if self.level == DISABLED: 296 | return 297 | for fmt in self.output_formats: 298 | if isinstance(fmt, KVWriter): 299 | fmt.writekvs(self.name2val) 300 | self.name2val.clear() 301 | 302 | def log(self, *args, level=INFO): 303 | if self.level <= level: 304 | self._do_log(args) 305 | 306 | # Configuration 307 | # ---------------------------------------- 308 | def set_level(self, level): 309 | self.level = level 310 | 311 | def get_dir(self): 312 | return self.dir 313 | 314 | def close(self): 315 | for fmt in self.output_formats: 316 | fmt.close() 317 | 318 | # Misc 319 | # ---------------------------------------- 320 | def _do_log(self, args): 321 | for fmt in self.output_formats: 322 | if isinstance(fmt, SeqWriter): 323 | fmt.writeseq(map(str, args)) 324 | 325 | 326 | Logger.DEFAULT = Logger.CURRENT = Logger(dir=None, output_formats=[HumanOutputFormat(sys.stdout)]) 327 | 328 | 329 | def configure(dir=None, format_strs=None): 330 | if dir is None: 331 | dir = os.getenv('OPENAI_LOGDIR') 332 | if dir is None: 333 | dir = os.path.join( 334 | tempfile.gettempdir(), 335 | datetime.datetime.now().strftime("openai-%Y-%m-%d-%H-%M-%S-%f")) 336 | assert isinstance(dir, str) 337 | os.makedirs(dir, exist_ok=True) 338 | 339 | if format_strs is None: 340 | strs = os.getenv('OPENAI_LOG_FORMAT') 341 | format_strs = strs.split(',') if strs else LOG_OUTPUT_FORMATS 342 | output_formats = [make_output_format(f, dir) for f in format_strs] 343 | 344 | Logger.CURRENT = Logger(dir=dir, output_formats=output_formats) 345 | log('Logging to %s' % dir) 346 | 347 | 348 | def reset(): 349 | if Logger.CURRENT is not Logger.DEFAULT: 350 | Logger.CURRENT.close() 351 | Logger.CURRENT = Logger.DEFAULT 352 | log('Reset logger') 353 | 354 | 355 | class scoped_configure(object): 356 | def __init__(self, dir=None, format_strs=None): 357 | self.dir = dir 358 | self.format_strs = format_strs 359 | self.prevlogger = None 360 | 361 | def __enter__(self): 362 | self.prevlogger = Logger.CURRENT 363 | configure(dir=self.dir, format_strs=self.format_strs) 364 | 365 | def __exit__(self, *args): 366 | Logger.CURRENT.close() 367 | Logger.CURRENT = self.prevlogger 368 | 369 | 370 | def _demo(): 371 | info("hi") 372 | debug("shouldn't appear") 373 | set_level(DEBUG) 374 | debug("should appear") 375 | dir = "/tmp/testlogging" 376 | if os.path.exists(dir): 377 | shutil.rmtree(dir) 378 | configure(dir=dir) 379 | logkv("a", 3) 380 | logkv("b", 2.5) 381 | dumpkvs() 382 | logkv("b", -2.5) 383 | logkv("a", 5.5) 384 | dumpkvs() 385 | info("^^^ should see a = 5.5") 386 | 387 | logkv("b", -2.5) 388 | dumpkvs() 389 | 390 | logkv("a", "longasslongasslongasslongasslongasslongassvalue") 391 | dumpkvs() 392 | 393 | 394 | # ================================================================ 395 | # Readers 396 | # ================================================================ 397 | def read_json(fname): 398 | import pandas 399 | ds = [] 400 | with open(fname, 'rt') as fh: 401 | for line in fh: 402 | ds.append(json.loads(line)) 403 | return pandas.DataFrame(ds) 404 | 405 | 406 | def read_csv(fname): 407 | import pandas 408 | return pandas.read_csv(fname, index_col=None, comment='#') 409 | 410 | 411 | def read_tb(path): 412 | """ 413 | path : a tensorboard file OR a directory, where we will find all TB files 414 | of the form events.* 415 | """ 416 | import pandas 417 | import numpy as np 418 | from glob import glob 419 | from collections import defaultdict 420 | import tensorflow as tf 421 | if os.path.isdir(path): 422 | fnames = glob(os.path.join(path, "events.*")) 423 | elif os.path.basename(path).startswith("events."): 424 | fnames = [path] 425 | else: 426 | raise NotImplementedError( 427 | "Expected tensorboard file or directory containing them. Got %s" % path) 428 | tag2pairs = defaultdict(list) 429 | maxstep = 0 430 | for fname in fnames: 431 | for summary in tf.train.summary_iterator(fname): 432 | if summary.step > 0: 433 | for v in summary.summary.value: 434 | pair = (summary.step, v.simple_value) 435 | tag2pairs[v.tag].append(pair) 436 | maxstep = max(summary.step, maxstep) 437 | data = np.empty((maxstep, len(tag2pairs))) 438 | data[:] = np.nan 439 | tags = sorted(tag2pairs.keys()) 440 | for (colidx, tag) in enumerate(tags): 441 | pairs = tag2pairs[tag] 442 | for (step, value) in pairs: 443 | data[step - 1, colidx] = value 444 | return pandas.DataFrame(data, columns=tags) 445 | 446 | 447 | if __name__ == "__main__": 448 | _demo() 449 | -------------------------------------------------------------------------------- /algorithms/PPO/train_PPO.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import logger 4 | import random 5 | import tensorflow as tf 6 | import gym 7 | import numpy as np 8 | from collections import deque 9 | 10 | from config import args 11 | from utils import set_global_seeds, sf01, explained_variance 12 | from agent import PPO 13 | from env_wrapper import make_env 14 | 15 | 16 | def main(): 17 | env = make_env() 18 | set_global_seeds(env, args.seed) 19 | 20 | agent = PPO(env=env) 21 | 22 | batch_steps = args.n_envs * args.batch_steps # number of steps per update 23 | 24 | if args.save_interval and logger.get_dir(): 25 | # some saving jobs 26 | pass 27 | 28 | ep_info_buffer = deque(maxlen=100) 29 | t_train_start = time.time() 30 | n_updates = args.n_steps // batch_steps 31 | runner = Runner(env, agent) 32 | 33 | for update in range(1, n_updates + 1): 34 | t_start = time.time() 35 | frac = 1.0 - (update - 1.0) / n_updates 36 | lr_now = args.lr # maybe dynamic change 37 | clip_range_now = args.clip_range # maybe dynamic change 38 | obs, returns, masks, acts, vals, neglogps, advs, rewards, ep_infos = \ 39 | runner.run(args.batch_steps, frac) 40 | ep_info_buffer.extend(ep_infos) 41 | loss_infos = [] 42 | 43 | idxs = np.arange(batch_steps) 44 | for _ in range(args.n_epochs): 45 | np.random.shuffle(idxs) 46 | for start in range(0, batch_steps, args.minibatch): 47 | end = start + args.minibatch 48 | mb_idxs = idxs[start: end] 49 | minibatch = [arr[mb_idxs] for arr in [obs, returns, masks, acts, vals, neglogps, advs]] 50 | loss_infos.append(agent.train(lr_now, clip_range_now, *minibatch)) 51 | 52 | t_now = time.time() 53 | time_this_batch = t_now - t_start 54 | if update % args.log_interval == 0: 55 | ev = float(explained_variance(vals, returns)) 56 | logger.logkv('updates', str(update) + '/' + str(n_updates)) 57 | logger.logkv('serial_steps', update * args.batch_steps) 58 | logger.logkv('total_steps', update * batch_steps) 59 | logger.logkv('time', time_this_batch) 60 | logger.logkv('fps', int(batch_steps / (t_now - t_start))) 61 | logger.logkv('total_time', t_now - t_train_start) 62 | logger.logkv("explained_variance", ev) 63 | logger.logkv('avg_reward', np.mean([e['r'] for e in ep_info_buffer])) 64 | logger.logkv('avg_ep_len', np.mean([e['l'] for e in ep_info_buffer])) 65 | logger.logkv('adv_mean', np.mean(returns - vals)) 66 | logger.logkv('adv_variance', np.std(returns - vals)**2) 67 | loss_infos = np.mean(loss_infos, axis=0) 68 | for loss_name, loss_info in zip(agent.loss_names, loss_infos): 69 | logger.logkv(loss_name, loss_info) 70 | logger.dumpkvs() 71 | 72 | if args.save_interval and update % args.save_interval == 0 and logger.get_dir(): 73 | pass 74 | env.close() 75 | 76 | 77 | class Runner: 78 | 79 | def __init__(self, env, agent): 80 | self.env = env 81 | self.agent = agent 82 | self.obs = np.zeros((args.n_envs,) + env.observation_space.shape, dtype=np.float32) 83 | self.obs[:] = env.reset() 84 | self.dones = [False for _ in range(args.n_envs)] 85 | 86 | def run(self, batch_steps, frac): 87 | b_obs, b_rewards, b_actions, b_values, b_dones, b_neglogps = [], [], [], [], [], [] 88 | ep_infos = [] 89 | 90 | for s in range(batch_steps): 91 | actions, values, neglogps = self.agent.step(self.obs, self.dones) 92 | b_obs.append(self.obs.copy()) 93 | b_actions.append(actions) 94 | b_values.append(values) 95 | b_neglogps.append(neglogps) 96 | b_dones.append(self.dones) 97 | self.obs[:], rewards, self.dones, infos = self.env.step(actions) 98 | for info in infos: 99 | maybeinfo = info.get('episode') 100 | if maybeinfo: 101 | ep_infos.append(maybeinfo) 102 | b_rewards.append(rewards) 103 | # batch of steps to batch of rollouts 104 | b_obs = np.asarray(b_obs, dtype=self.obs.dtype) 105 | b_rewards = np.asarray(b_rewards, dtype=np.float32) 106 | b_actions = np.asarray(b_actions) 107 | b_values = np.asarray(b_values, dtype=np.float32) 108 | b_neglogps = np.asarray(b_neglogps, dtype=np.float32) 109 | b_dones = np.asarray(b_dones, dtype=np.bool) 110 | last_values = self.agent.get_value(self.obs, self.dones) 111 | 112 | b_returns = np.zeros_like(b_rewards) 113 | b_advs = np.zeros_like(b_rewards) 114 | lastgaelam = 0 115 | for t in reversed(range(batch_steps)): 116 | if t == batch_steps - 1: 117 | mask = 1.0 - self.dones 118 | nextvalues = last_values 119 | else: 120 | mask = 1.0 - b_dones[t + 1] 121 | nextvalues = b_values[t + 1] 122 | delta = b_rewards[t] + args.gamma * nextvalues * mask - b_values[t] 123 | b_advs[t] = lastgaelam = delta + args.gamma * args.lam * mask * lastgaelam 124 | b_returns = b_advs + b_values 125 | 126 | return (*map(sf01, (b_obs, b_returns, b_dones, b_actions, b_values, b_neglogps, b_advs, b_rewards)), ep_infos) 127 | 128 | 129 | if __name__ == '__main__': 130 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 131 | logger.configure() 132 | main() 133 | -------------------------------------------------------------------------------- /algorithms/PPO/utils.py: -------------------------------------------------------------------------------- 1 | import scipy.signal 2 | import numpy as np 3 | import random 4 | import tensorflow as tf 5 | 6 | 7 | def set_global_seeds(env, seed): 8 | tf.set_random_seed(seed) 9 | np.random.seed(seed) 10 | random.seed(seed) 11 | env_seeds = np.random.randint(low=0, high=1e6, size=env.num_envs) 12 | env.set_random_seed(env_seeds) 13 | 14 | 15 | class RunningMeanStd(object): 16 | 17 | def __init__(self, epsilon=1e-4, shape=()): 18 | self.mean = np.zeros(shape, 'float64') 19 | self.var = np.ones(shape, 'float64') 20 | self.count = epsilon 21 | 22 | def update(self, x): 23 | batch_mean = np.mean(x, axis=0) 24 | batch_var = np.var(x, axis=0) 25 | batch_count = x.shape[0] 26 | 27 | delta = batch_mean - self.mean 28 | tot_count = self.count + batch_count 29 | 30 | new_mean = self.mean + delta * batch_count / tot_count 31 | m_a = self.var * (self.count) 32 | m_b = batch_var * (batch_count) 33 | M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count) 34 | new_var = M2 / (self.count + batch_count) 35 | 36 | new_count = batch_count + self.count 37 | 38 | self.mean = new_mean 39 | self.var = new_var 40 | self.count = new_count 41 | 42 | 43 | def sf01(arr): 44 | """ 45 | swap and then flatten axes 0 and 1 46 | """ 47 | s = arr.shape 48 | return arr.swapaxes(0, 1).reshape(s[0] * s[1], *s[2:]) 49 | 50 | 51 | def discount(x, gamma): 52 | return scipy.signal.lfilter([1.0], [1.0, -gamma], x[::-1], axis=0)[::-1] 53 | 54 | 55 | # ================================================================ 56 | # Network components 57 | # ================================================================ 58 | def ortho_init(scale=1.0): 59 | def _ortho_init(shape, dtype, partition_info=None): 60 | shape = tuple(shape) 61 | if len(shape) == 2: 62 | flat_shape = shape 63 | elif len(shape) == 4: 64 | flat_shape = (np.prod(shape[:-1]), shape[-1]) 65 | else: 66 | raise NotImplementedError 67 | 68 | a = np.random.normal(0.0, 1.0, flat_shape) 69 | u, _, v = np.linalg.svd(a, full_matrices=False) 70 | q = u if u.shape == flat_shape else v 71 | q = q.reshape(shape) 72 | return (scale * q[:shape[0], :shape[1]]).astype(np.float32) 73 | return _ortho_init 74 | 75 | 76 | def fc(x, out_dim, activation_fn=tf.nn.relu, init_scale=1.0, scope=''): 77 | with tf.variable_scope(scope): 78 | in_dim = x.get_shape()[1].value 79 | w = tf.get_variable('w', [in_dim, out_dim], initializer=ortho_init(init_scale)) 80 | b = tf.get_variable('b', [out_dim], initializer=tf.constant_initializer(0.0)) 81 | z = tf.matmul(x, w) + b 82 | h = activation_fn(z) if activation_fn else z 83 | return h 84 | 85 | # ================================================================ 86 | # Tensorflow math utils 87 | # ================================================================ 88 | clip = tf.clip_by_value 89 | 90 | def sum(x, axis=None, keepdims=False): 91 | axis = None if axis is None else [axis] 92 | return tf.reduce_sum(x, axis=axis, keep_dims=keepdims) 93 | 94 | def mean(x, axis=None, keepdims=False): 95 | axis = None if axis is None else [axis] 96 | return tf.reduce_mean(x, axis=axis, keep_dims=keepdims) 97 | 98 | def var(x, axis=None, keepdims=False): 99 | meanx = mean(x, axis=axis, keepdims=keepdims) 100 | return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims) 101 | 102 | def std(x, axis=None, keepdims=False): 103 | return tf.sqrt(var(x, axis=axis, keepdims=keepdims)) 104 | 105 | def max(x, axis=None, keepdims=False): 106 | axis = None if axis is None else [axis] 107 | return tf.reduce_max(x, axis=axis, keep_dims=keepdims) 108 | 109 | def min(x, axis=None, keepdims=False): 110 | axis = None if axis is None else [axis] 111 | return tf.reduce_min(x, axis=axis, keep_dims=keepdims) 112 | 113 | def concatenate(arrs, axis=0): 114 | return tf.concat(axis=axis, values=arrs) 115 | 116 | def argmax(x, axis=None): 117 | return tf.argmax(x, axis=axis) 118 | 119 | def switch(condition, then_expression, else_expression): 120 | """Switches between two operations depending on a scalar value (int or bool). 121 | Note that both `then_expression` and `else_expression` 122 | should be symbolic tensors of the *same shape*. 123 | 124 | # Arguments 125 | condition: scalar tensor. 126 | then_expression: TensorFlow operation. 127 | else_expression: TensorFlow operation. 128 | """ 129 | x_shape = copy.copy(then_expression.get_shape()) 130 | x = tf.cond(tf.cast(condition, 'bool'), 131 | lambda: then_expression, 132 | lambda: else_expression) 133 | x.set_shape(x_shape) 134 | return x 135 | 136 | # ================================================================ 137 | # Math utils 138 | # ================================================================ 139 | def explained_variance(pred_y, y): 140 | """ 141 | Computes fraction of variance that pred_y explains about y. 142 | Returns 1 - Var[y-pred_y] / Var[y] 143 | 144 | Interpretation: 145 | ev=0 => might as well have predicted zero 146 | ev=1 => perfect prediction 147 | ev<0 => worse than just predicting zero 148 | 149 | """ 150 | assert y.ndim == 1 and pred_y.ndim == 1 151 | var_y = np.var(y) 152 | return np.nan if var_y == 0 else 1 - np.var(y - pred_y) / var_y 153 | -------------------------------------------------------------------------------- /algorithms/REINFORCE/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## REINFORCE 3 | REINFORCE belongs to Policy Gradient methods, which directly parameterize the policy rather than a state value function. 4 | For more details about REINFORCE and other policy gradient algorithms, refer Chap 13 of [Reinforcement Learning: An Introduction 2nd Edition](http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html) 5 | 6 | Here we use REINFORCE Methods to solve game of Pong. 7 | 8 | ## Pong 9 | The game of Pong is an Atari game which user control one of the paddle (the other one is control by a decent AI) and you have to bounce the ball past the other side. In reinforcement learning setting, the state is raw pixels and the action is moving the paddle UP or DOWN. 10 | 11 | pong 12 | 13 | 14 | ## Requirements 15 | * [Numpy](http://www.numpy.org/) 16 | * [Tensorflow](http://www.tensorflow.org) 17 | * [gym](https://gym.openai.com) 18 | 19 | ## Run 20 | python train_REINFORCE.py 21 | -------------------------------------------------------------------------------- /algorithms/REINFORCE/agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | 5 | class REINFORCE: 6 | 7 | def __init__(self, input_dim, hidden_units, action_dim): 8 | self.input_dim = input_dim 9 | self.hidden_units = hidden_units 10 | self.action_dim = action_dim 11 | self.gamma = 0.99 12 | self.max_gradient = 5 13 | 14 | self.state_buffer = [] 15 | self.reward_buffer = [] 16 | self.action_buffer = [] 17 | 18 | @staticmethod 19 | def get_session(device): 20 | if device == -1: # use CPU 21 | device = '/cpu:0' 22 | sess_config = tf.ConfigProto() 23 | else: # use GPU 24 | device = '/gpu:' + str(device) 25 | sess_config = tf.ConfigProto( 26 | log_device_placement=True, 27 | allow_soft_placement=True) 28 | sess_config.gpu_options.allow_growth = True 29 | sess = tf.Session(config=sess_config) 30 | return sess, device 31 | 32 | def construct_model(self, gpu): 33 | self.sess, device = self.get_session(gpu) 34 | 35 | with tf.device(device): 36 | # construct network 37 | self.input_state = tf.placeholder( 38 | tf.float32, [None, self.input_dim]) 39 | w1 = tf.Variable(tf.div(tf.random_normal( 40 | [self.input_dim, self.hidden_units]), 41 | np.sqrt(self.input_dim))) 42 | b1 = tf.Variable(tf.constant(0.0, shape=[self.hidden_units])) 43 | h1 = tf.nn.relu(tf.matmul(self.input_state, w1) + b1) 44 | w2 = tf.Variable(tf.div( 45 | tf.random_normal([self.hidden_units, self.action_dim]), 46 | np.sqrt(self.hidden_units))) 47 | b2 = tf.Variable(tf.constant(0.0, shape=[self.action_dim])) 48 | 49 | self.logp = tf.matmul(h1, w2) + b2 50 | 51 | self.discounted_rewards = tf.placeholder(tf.float32, [None, ]) 52 | self.taken_actions = tf.placeholder(tf.int32, [None, ]) 53 | 54 | # optimizer 55 | self.optimizer = tf.train.RMSPropOptimizer( 56 | learning_rate=1e-4, decay=0.99) 57 | # loss 58 | self.loss = tf.reduce_mean( 59 | tf.nn.sparse_softmax_cross_entropy_with_logits( 60 | logits=self.logp, labels=self.taken_actions)) 61 | # gradient 62 | self.gradient = self.optimizer.compute_gradients(self.loss) 63 | # policy gradient 64 | for i, (grad, var) in enumerate(self.gradient): 65 | if grad is not None: 66 | pg_grad = grad * self.discounted_rewards 67 | # gradient clipping 68 | pg_grad = tf.clip_by_value( 69 | pg_grad, -self.max_gradient, self.max_gradient) 70 | self.gradient[i] = (pg_grad, var) 71 | # train operation (apply gradient) 72 | self.train_op = self.optimizer.apply_gradients(self.gradient) 73 | 74 | def sample_action(self, state): 75 | 76 | def softmax(x): 77 | max_x = np.amax(x) 78 | e = np.exp(x - max_x) 79 | return e / np.sum(e) 80 | 81 | logp = self.sess.run(self.logp, {self.input_state: state})[0] 82 | prob = softmax(logp) - 1e-5 83 | action = np.argmax(np.random.multinomial(1, prob)) 84 | return action 85 | 86 | def update_model(self): 87 | discounted_rewards = self.reward_discount() 88 | episode_steps = len(discounted_rewards) 89 | 90 | for s in reversed(range(episode_steps)): 91 | state = self.state_buffer[s][np.newaxis, :] 92 | action = np.array([self.action_buffer[s]]) 93 | reward = np.array([discounted_rewards[s]]) 94 | self.sess.run(self.train_op, { 95 | self.input_state: state, 96 | self.taken_actions: action, 97 | self.discounted_rewards: reward 98 | }) 99 | 100 | # cleanup job 101 | self.state_buffer = [] 102 | self.reward_buffer = [] 103 | self.action_buffer = [] 104 | 105 | def store_rollout(self, state, action, reward): 106 | self.action_buffer.append(action) 107 | self.reward_buffer.append(reward) 108 | self.state_buffer.append(state) 109 | 110 | def reward_discount(self): 111 | r = self.reward_buffer 112 | d_r = np.zeros_like(r) 113 | running_add = 0 114 | for t in range(len(r))[::-1]: 115 | if r[t] != 0: 116 | running_add = 0 # game boundary. reset the running add 117 | running_add = r[t] + running_add * self.gamma 118 | d_r[t] += running_add 119 | # standardize the rewards 120 | d_r -= np.mean(d_r) 121 | d_r /= np.std(d_r) 122 | return d_r 123 | -------------------------------------------------------------------------------- /algorithms/REINFORCE/evaluation.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import gym 4 | from agent import REINFORCE 5 | from utils import * 6 | 7 | 8 | def main(args): 9 | 10 | def preprocess(obs): 11 | obs = obs[35:195] 12 | obs = obs[::2, ::2, 0] 13 | obs[obs == 144] = 0 14 | obs[obs == 109] = 0 15 | obs[obs != 0] = 1 16 | 17 | return obs.astype(np.float).ravel() 18 | 19 | INPUT_DIM = 80 * 80 20 | HIDDEN_UNITS = 200 21 | ACTION_DIM = 6 22 | 23 | # load agent 24 | agent = REINFORCE(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM) 25 | agent.construct_model(args.gpu) 26 | 27 | # load model or init a new 28 | saver = tf.train.Saver(max_to_keep=1) 29 | if args.model_path is not None: 30 | # reuse saved model 31 | saver.restore(agent.sess, args.model_path) 32 | else: 33 | # build a new model 34 | agent.init_var() 35 | 36 | # load env 37 | env = gym.make('Pong-v0') 38 | 39 | # evaluation 40 | for ep in range(args.ep): 41 | # reset env 42 | total_rewards = 0 43 | state = env.reset() 44 | 45 | while True: 46 | env.render() 47 | # preprocess 48 | state = preprocess(state) 49 | # sample actions 50 | action = agent.sample_action(state[np.newaxis, :]) 51 | # act! 52 | next_state, reward, done, _ = env.step(action) 53 | total_rewards += reward 54 | # state shift 55 | state = next_state 56 | if done: 57 | break 58 | 59 | print('Ep%s Reward: %s ' % (ep+1, total_rewards)) 60 | 61 | 62 | def args_parse(): 63 | parser = argparse.ArgumentParser() 64 | parser.add_argument( 65 | '--model_path', default=None, 66 | help='Whether to use a saved model. (*None|model path)') 67 | parser.add_argument( 68 | '--gpu', default=-1, 69 | help='running on a specify gpu, -1 indicates using cpu') 70 | parser.add_argument( 71 | '--ep', default=1, help='Test episodes') 72 | return parser.parse_args() 73 | 74 | 75 | if __name__ == '__main__': 76 | main(args_parse()) 77 | -------------------------------------------------------------------------------- /algorithms/REINFORCE/train_REINFORCE.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | import gym 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | from agent import REINFORCE 9 | 10 | 11 | def main(args): 12 | 13 | def preprocess(obs): 14 | obs = obs[35:195] 15 | obs = obs[::2, ::2, 0] 16 | obs[obs == 144] = 0 17 | obs[obs == 109] = 0 18 | obs[obs != 0] = 1 19 | 20 | return obs.astype(np.float).ravel() 21 | 22 | MODEL_PATH = args.model_path 23 | INPUT_DIM = 80 * 80 24 | HIDDEN_UNITS = 200 25 | ACTION_DIM = 6 26 | MAX_EPISODES = 20000 27 | MAX_STEPS = 5000 28 | 29 | # load agent 30 | agent = REINFORCE(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM) 31 | agent.construct_model(args.gpu) 32 | 33 | # model saver 34 | saver = tf.train.Saver(max_to_keep=1) 35 | if MODEL_PATH is not None: 36 | saver.restore(agent.sess, args.model_path) 37 | ep_base = int(args.model_path.split('_')[-1]) 38 | mean_rewards = float(args.model_path.split('/')[-1].split('_')[0]) 39 | else: 40 | agent.sess.run(tf.global_variables_initializer()) 41 | ep_base = 0 42 | mean_rewards = None 43 | 44 | # load env 45 | env = gym.make('Pong-v0') 46 | # main loop 47 | for ep in range(MAX_EPISODES): 48 | # reset env 49 | total_rewards = 0 50 | state = env.reset() 51 | 52 | for step in range(MAX_STEPS): 53 | # preprocess 54 | state = preprocess(state) 55 | # sample actions 56 | action = agent.sample_action(state[np.newaxis, :]) 57 | # act! 58 | next_state, reward, done, _ = env.step(action) 59 | 60 | total_rewards += reward 61 | agent.store_rollout(state, action, reward) 62 | # state shift 63 | state = next_state 64 | 65 | if done: 66 | break 67 | 68 | # update model per episode 69 | agent.update_model() 70 | 71 | # logging 72 | if mean_rewards is None: 73 | mean_rewards = total_rewards 74 | else: 75 | mean_rewards = 0.99 * mean_rewards + 0.01 * total_rewards 76 | rounds = (21 - np.abs(total_rewards)) + 21 77 | average_steps = (step + 1) / rounds 78 | print('Ep%s: %d rounds \nAvg_steps: %.2f Reward: %s Avg_reward: %.4f' % 79 | (ep+1, rounds, average_steps, total_rewards, mean_rewards)) 80 | if ep > 0 and ep % 100 == 0: 81 | if not os.path.isdir(args.save_path): 82 | os.makedirs(args.save_path) 83 | save_name = str(round(mean_rewards, 2)) + '_' + str(ep_base+ep+1) 84 | saver.save(agent.sess, args.save_path + save_name) 85 | 86 | 87 | def args_parse(): 88 | parser = argparse.ArgumentParser() 89 | parser.add_argument( 90 | '--model_path', default=None, 91 | help='Whether to use a saved model. (*None|model path)') 92 | parser.add_argument( 93 | '--save_path', default='./model/', 94 | help='Path to save a model during training.') 95 | parser.add_argument( 96 | '--gpu', default=-1, 97 | help='running on a specify gpu, -1 indicates using cpu') 98 | return parser.parse_args() 99 | 100 | 101 | if __name__ == '__main__': 102 | main(args_parse()) 103 | -------------------------------------------------------------------------------- /algorithms/TD/README.md: -------------------------------------------------------------------------------- 1 | ### Temporal Difference 2 | 3 | Temporal Difference (TD) learning is a prediction-based reinforcement learning algorithm. It is the combination of Monte Carlo (MC) and Dynamic Programming (DP). 4 | 5 | For more details about TD algorithm please refer to Chap 6 of [Reinforcement Learning: An Introduction 2nd Edition](http://incompleteideas.net/sutton/book/the-book-2nd.html) 6 | 7 | ### GridWolrd 8 | 9 | A GridWolrd is a typical environment for tabular reinforcement learning. It has a 10x10 state space and an action space of {up, down, left right} to move the agent around. There is a target point with +1 reward in the environment and some bomb point with -1 rewards. Also we set a bit cost(-0.01) when the agent take one step so that it tends to find out the optimal path faster. 10 | 11 | A typical GridWolrd may look like this. 12 | 13 | grid 14 | 15 | In GridWolrd environment, the TD agent try to figure out the optimal (i.e. the shortest) path to the target. 16 | 17 | 18 | ### Run 19 | 20 | ``` 21 | cd reinforce_py/algorithms/TD 22 | python train_TD.py --algorithm=qlearn/sarsa # Q-learning or SARSA 23 | ``` 24 | -------------------------------------------------------------------------------- /algorithms/TD/agents.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | 4 | from utils import draw_episode_steps 5 | from utils import draw_grid 6 | 7 | 8 | class TDAgent(object): 9 | 10 | def __init__(self, env, epsilon, gamma, alpha=0.1): 11 | self.env = env 12 | self.gamma = gamma 13 | self.alpha = alpha 14 | self.epsilon = epsilon # explore & exploit 15 | self.init_epsilon = epsilon 16 | 17 | self.P = np.zeros((self.env.num_s, self.env.num_a)) 18 | 19 | self.V = np.zeros(self.env.num_s) 20 | self.Q = np.zeros((self.env.num_s, self.env.num_a)) 21 | 22 | self.step_set = [] # store steps of each episode 23 | self.avg_step_set = [] # store average steps of each 100 episodes 24 | self.episode = 1 25 | self.step = 0 26 | self.max_episodes = 5000 27 | 28 | # initialize random policy 29 | for s in range(self.env.num_s): 30 | poss = self.env.allow_actions(s) 31 | for a in poss: 32 | self.P[s][a] = 1.0 / len(poss) 33 | 34 | self.curr_s = None 35 | self.curr_a = None 36 | 37 | def predict(self, episode=1000): 38 | for e in range(episode): 39 | curr_s = self.env.reset() # new episode 40 | while not self.env.is_terminal(curr_s): # for every time step 41 | a = self.select_action(curr_s, policy='greedy') 42 | r = self.env.rewards(curr_s, a) 43 | next_s = self.env.next_state(curr_s, a) 44 | self.V[curr_s] += self.alpha \ 45 | * (r+self.gamma*self.V[next_s] - self.V[curr_s]) 46 | curr_s = next_s 47 | # result display 48 | draw_grid(self.env, self, p=True, v=True, r=True) 49 | 50 | def control(self, method): 51 | assert method in ("qlearn", "sarsa") 52 | 53 | if method == "qlearn": 54 | agent = Qlearn(self.env, self.epsilon, self.gamma) 55 | else: 56 | agent = SARSA(self.env, self.epsilon, self.gamma) 57 | 58 | while agent.episode < self.max_episodes: 59 | agent.learn(agent.act()) 60 | 61 | # result display 62 | draw_grid(self.env, agent, p=True, v=True, r=True) 63 | # draw episode steps 64 | draw_episode_steps(agent.avg_step_set) 65 | 66 | def update_policy(self): 67 | # update according to Q value 68 | poss = self.env.allow_actions(self.curr_s) 69 | # Q values of all allowed actions 70 | qs = self.Q[self.curr_s][poss] 71 | q_maxs = [q for q in qs if q == max(qs)] 72 | # update probabilities 73 | for i, a in enumerate(poss): 74 | self.P[self.curr_s][a] = \ 75 | 1.0 / len(q_maxs) if qs[i] in q_maxs else 0.0 76 | 77 | def select_action(self, state, policy='egreedy'): 78 | poss = self.env.allow_actions(state) # possible actions 79 | if policy == 'egreedy' and random.random() < self.epsilon: 80 | a = random.choice(poss) 81 | else: # greedy action 82 | pros = self.P[state][poss] # probabilities for possible actions 83 | best_a_idx = [i for i, p in enumerate(pros) if p == max(pros)] 84 | a = poss[random.choice(best_a_idx)] 85 | return a 86 | 87 | 88 | class SARSA(TDAgent): 89 | 90 | def __init__(self, env, epsilon, gamma): 91 | super(SARSA, self).__init__(env, epsilon, gamma) 92 | self.reset_episode() 93 | 94 | def act(self): 95 | s = self.env.next_state(self.curr_s, self.curr_a) 96 | a = self.select_action(s, policy='egreedy') 97 | r = self.env.rewards(self.curr_s, self.curr_a) 98 | r -= 0.01 # a bit negative reward for every step 99 | return [self.curr_s, self.curr_a, r, s, a] 100 | 101 | def learn(self, exp): 102 | s, a, r, n_s, n_a = exp 103 | 104 | if self.env.is_terminal(s): 105 | target = r 106 | else: 107 | target = r + self.gamma * self.Q[n_s][n_a] 108 | self.Q[s][a] += self.alpha * (target - self.Q[s][a]) 109 | 110 | # update policy 111 | self.update_policy() 112 | 113 | if self.env.is_terminal(s): 114 | self.V = np.sum(self.Q, axis=1) 115 | print('episode %d step: %d epsilon: %f' % 116 | (self.episode, self.step, self.epsilon)) 117 | self.reset_episode() 118 | self.epsilon -= self.init_epsilon / 10000 119 | # record per 100 episode 120 | if self.episode % 100 == 0: 121 | self.avg_step_set.append( 122 | np.sum(self.step_set[self.episode-100: self.episode])/100) 123 | else: # shift state-action pair 124 | self.curr_s = n_s 125 | self.curr_a = n_a 126 | self.step += 1 127 | 128 | def reset_episode(self): 129 | # start a new episode 130 | self.curr_s = self.env.reset() 131 | self.curr_a = self.select_action(self.curr_s, policy='egreedy') 132 | self.episode += 1 133 | self.step_set.append(self.step) 134 | self.step = 0 135 | 136 | 137 | class Qlearn(TDAgent): 138 | 139 | def __init__(self, env, epsilon, gamma): 140 | super(Qlearn, self).__init__(env, epsilon, gamma) 141 | self.reset_episode() 142 | 143 | def act(self): 144 | a = self.select_action(self.curr_s, policy='egreedy') 145 | s = self.env.next_state(self.curr_s, a) 146 | r = self.env.rewards(self.curr_s, a) 147 | r -= 0.01 148 | return [self.curr_s, a, r, s] 149 | 150 | def learn(self, exp): 151 | s, a, r, n_s = exp 152 | 153 | # Q-learning magic 154 | if self.env.is_terminal(s): 155 | target = r 156 | else: 157 | target = r + self.gamma * max(self.Q[n_s]) 158 | self.Q[s][a] += self.alpha * (target - self.Q[s][a]) 159 | 160 | self.update_policy() 161 | # shift to next state 162 | if self.env.is_terminal(s): 163 | self.V = np.sum(self.Q, axis=1) 164 | print('episode %d step: %d' % (self.episode, self.step)) 165 | self.reset_episode() 166 | self.epsilon -= self.init_epsilon / self.max_episodes 167 | # record per 100 episode 168 | if self.episode % 100 == 0: 169 | self.avg_step_set.append( 170 | np.sum(self.step_set[self.episode-100: self.episode])/100) 171 | else: 172 | self.curr_s = n_s 173 | self.step += 1 174 | 175 | def reset_episode(self): 176 | self.curr_s = self.env.reset() 177 | self.episode += 1 178 | self.step_set.append(self.step) 179 | self.step = 0 180 | -------------------------------------------------------------------------------- /algorithms/TD/envs.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class GridWorld: 5 | 6 | def __init__(self): 7 | self.env_w = 10 8 | self.env_h = 10 9 | self.num_s = self.env_w * self.env_h 10 | self.num_a = 4 11 | 12 | r = np.zeros(self.num_s) 13 | w = np.zeros(self.num_s) 14 | 15 | self.target = np.array([27]) 16 | self.bomb = np.array([16, 25, 26, 28, 36, 40, 41, 48, 49, 64]) 17 | # make some walls 18 | self.wall = np.array([22, 32, 42, 52, 43, 45, 46, 47, 37]) 19 | r[self.target] = 10 20 | r[self.bomb] = -1 21 | r[self.wall] = 0 22 | w[self.wall] = 1 23 | 24 | self.W = w 25 | self.R = r # reward 26 | self.terminal = np.array(self.target) 27 | 28 | def rewards(self, s, a): 29 | return self.R[s] 30 | 31 | def allow_actions(self, s): 32 | # return allow actions in state s 33 | x = self.get_pos(s)[0] 34 | y = self.get_pos(s)[1] 35 | allow_a = np.array([], dtype='int') 36 | if y > 0 and self.W[s-self.env_w] != 1: 37 | allow_a = np.append(allow_a, 0) 38 | if y < self.env_h-1 and self.W[s+self.env_w] != 1: 39 | allow_a = np.append(allow_a, 1) 40 | if x > 0 and self.W[s-1] != 1: 41 | allow_a = np.append(allow_a, 2) 42 | if x < self.env_w-1 and self.W[s+1] != 1: 43 | allow_a = np.append(allow_a, 3) 44 | return allow_a 45 | 46 | def get_pos(self, s): 47 | # transform to coordinate (x, y) 48 | x = s % self.env_h 49 | y = s / self.env_w 50 | return x, y 51 | 52 | def next_state(self, s, a): 53 | # return next state in state s taking action a 54 | # in this deterministic environment it returns a certain state ns 55 | ns = 0 56 | if a == 0: 57 | ns = s - self.env_w 58 | if a == 1: 59 | ns = s + self.env_w 60 | if a == 2: 61 | ns = s - 1 62 | if a == 3: 63 | ns = s + 1 64 | return ns 65 | 66 | def is_terminal(self, s): 67 | return True if s in self.terminal else False 68 | 69 | def reset(self): 70 | return 0 # init state 71 | -------------------------------------------------------------------------------- /algorithms/TD/train_TD.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from agents import TDAgent 4 | from envs import GridWorld 5 | 6 | 7 | def main(args): 8 | env = GridWorld() 9 | 10 | agent = TDAgent(env, epsilon=args.epsilon, gamma=args.discount, alpha=args.lr) 11 | agent.control(method=args.algorithm) 12 | 13 | 14 | if __name__ == '__main__': 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument( 17 | '--algorithm', default='qlearn', help='(*qlearn | sarsa)') 18 | parser.add_argument( 19 | '--discount', type=float, default=0.9, help='discount factor') 20 | parser.add_argument( 21 | '--epsilon', type=float, default=0.3, 22 | help='parameter of epsilon greedy policy') 23 | parser.add_argument('--lr', type=float, default=0.05) 24 | main(parser.parse_args()) 25 | -------------------------------------------------------------------------------- /algorithms/TD/utils.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | 4 | 5 | def draw_grid(env, agent, p=True, v=False, r=False): 6 | ''' 7 | Draw the policy(|value|reward setting) at the command prompt. 8 | ''' 9 | arrows = [u'\u2191', u'\u2193', u'\u2190', u'\u2192'] 10 | cliff = u'\u25C6' 11 | sign = {0: '-', 10: u'\u2713', -1: u'\u2717'} 12 | 13 | tp = [] # transform policy 14 | for s in range(env.num_s): 15 | for a in range(env.num_a): 16 | tp.append(agent.P[s][a]) 17 | best = [] # best action for each state at the moment 18 | for i in range(0, len(tp), env.num_a): 19 | a = tp[i:i+env.num_a] 20 | ba = np.argsort(-np.array(a))[0] 21 | best.append(ba) 22 | if r: 23 | print('\n') 24 | print('Environment setting:', end=' ') 25 | for i, r in enumerate(env.R): 26 | if i % env.env_w == 0: 27 | print('\n') 28 | if env.W[i] > 0: 29 | print('%1s' % cliff, end=' ') 30 | else: 31 | print('%1s' % sign[r], end=' ') 32 | print('\n') 33 | if p: 34 | print('Trained policy:', end=' ') 35 | for i, a in enumerate(best): 36 | if i % env.env_w == 0: 37 | print('\n') 38 | if env.W[i] == 1: 39 | print('%s' % cliff, end=' ') 40 | elif env.R[i] == 1: 41 | print('%s' % u'\u272A', end=' ') 42 | else: 43 | print('%s' % arrows[a], end=' ') 44 | print('\n') 45 | if v: 46 | print('Value function for each state:', end=' ') 47 | for i, v in enumerate(agent.V): 48 | if i % env.env_w == 0: 49 | print('\n') 50 | if env.W[i] == 1: 51 | print(' %-2s ' % cliff, end=' ') 52 | elif env.R[i] == 1: 53 | print('[%.1f]' % v, end=' ') 54 | else: 55 | print('%4.1f' % v, end=' ') 56 | print('\n') 57 | 58 | 59 | def draw_episode_steps(avg_step_set): 60 | plt.plot(np.arange(len(avg_step_set)), avg_step_set) 61 | plt.title('steps per episode') 62 | plt.xlabel('episode') 63 | plt.ylabel('steps') 64 | plt.axis([0, 80, 0, 200]) 65 | plt.show() 66 | -------------------------------------------------------------------------------- /images/cartpole.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/cartpole.png -------------------------------------------------------------------------------- /images/ddpg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/ddpg.png -------------------------------------------------------------------------------- /images/doom.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/doom.png -------------------------------------------------------------------------------- /images/dqn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/dqn.png -------------------------------------------------------------------------------- /images/gridworld.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/gridworld.png -------------------------------------------------------------------------------- /images/pong.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/pong.png -------------------------------------------------------------------------------- /images/ppo_losses.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/ppo_losses.png -------------------------------------------------------------------------------- /images/ppo_score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/ppo_score.png -------------------------------------------------------------------------------- /images/walker2d.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/walker2d.gif -------------------------------------------------------------------------------- /images/walker2d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/walker2d.png --------------------------------------------------------------------------------