├── .gitignore
├── LICENSE
├── README.md
├── algorithms
├── A3C
│ ├── atari
│ │ ├── README.md
│ │ ├── atari_env.py
│ │ ├── atari_env_deprecated.py
│ │ ├── evaluate.py
│ │ ├── net.py
│ │ ├── train_A3C.py
│ │ ├── utils.py
│ │ └── worker.py
│ └── doom
│ │ ├── README.md
│ │ ├── basic.wad
│ │ ├── env_doom.py
│ │ ├── net.py
│ │ ├── train_A3C.py
│ │ ├── utils.py
│ │ └── worker.py
├── Actor-Critic
│ ├── README.md
│ ├── agent.py
│ ├── evaluate.py
│ ├── train_actor_critic.py
│ └── utils.py
├── CEM
│ ├── CEM.py
│ └── README.md
├── DDPG
│ ├── README.md
│ ├── agent.py
│ ├── evaluate.py
│ ├── ou_noise.py
│ └── train_ddpg.py
├── DQN
│ ├── README.md
│ ├── agent.py
│ ├── evaluation.py
│ └── train_DQN.py
├── PG
│ ├── agent.py
│ ├── run.py
│ └── sync.sh
├── PPO
│ ├── README.md
│ ├── agent.py
│ ├── config.py
│ ├── distributions.py
│ ├── env_wrapper.py
│ ├── logger.py
│ ├── train_PPO.py
│ └── utils.py
├── REINFORCE
│ ├── README.md
│ ├── agent.py
│ ├── evaluation.py
│ └── train_REINFORCE.py
└── TD
│ ├── README.md
│ ├── agents.py
│ ├── envs.py
│ ├── train_TD.py
│ └── utils.py
└── images
├── cartpole.png
├── ddpg.png
├── doom.png
├── dqn.png
├── gridworld.png
├── pong.png
├── ppo_losses.png
├── ppo_score.png
├── walker2d.gif
└── walker2d.png
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # Installer logs
29 | pip-log.txt
30 | pip-delete-this-directory.txt
31 |
32 | # Unit test / coverage reports
33 | htmlcov/
34 | .tox/
35 | .coverage
36 | .coverage.*
37 | .cache
38 | nosetests.xml
39 | coverage.xml
40 | *,cover
41 | .hypothesis/
42 |
43 | # Translations
44 | *.mo
45 | *.pot
46 |
47 | # Scrapy stuff:
48 | .scrapy
49 |
50 | # Sphinx documentation
51 | docs/_build/
52 |
53 | # PyBuilder
54 | target/
55 |
56 | # Jupyter Notebook
57 | .ipynb_checkpoints
58 |
59 | # pyenv
60 | .python-version
61 |
62 | # celery beat schedule file
63 | celerybeat-schedule
64 |
65 | # SageMath parsed files
66 | *.sage.py
67 |
68 | # dotenv
69 | .env
70 |
71 | # virtualenv
72 | .venv
73 | venv/
74 | ENV/
75 |
76 | # Spyder project settings
77 | .spyderproject
78 |
79 | # Rope project settings
80 | .ropeproject
81 |
82 | # mkdocs documentation
83 | /site
84 |
85 | # ignore saved models
86 | models/
87 | model/
88 |
89 | *.swp
90 | .vscode
91 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 borgwang
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Reinforcement learning in Python
2 |
3 | Implement popular Reinforcement Learning algorithms using Python and Tensorflow.
4 |
5 | * Value-based Methods
6 | * [Tabular TD-Learning](https://github.com/borgwang/reinforce_py/tree/master/algorithms/TD)
7 | * [DQN&DDQN](https://github.com/borgwang/reinforce_py/tree/master/algorithms/DQN)
8 | * Policy-based Methods
9 | * [REINFORCE](https://github.com/borgwang/reinforce_py/tree/master/algorithms/REINFORCE)
10 | * [DDPG](https://github.com/borgwang/reinforce_py/tree/master/algorithms/DDPG)
11 | * Combine Policy-based and Value-based
12 | * [Actor-Critic](https://github.com/borgwang/reinforce_py/tree/master/algorithms/Actor-Critic)
13 | * [A3C](https://github.com/borgwang/reinforce_py/tree/master/algorithms/A3C/doom)
14 | * [PPO](https://github.com/borgwang/reinforce_py/tree/master/algorithms/PPO)
15 | * Derivative-free Methods
16 | * [CEM](https://github.com/borgwang/reinforce_py/tree/master/algorithms/CEM)
17 | * [Evolution Strategies](https://github.com/borgwang/evolution-strategy) (linked to a stand-alone repository)
18 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/README.md:
--------------------------------------------------------------------------------
1 | ## Asynchronous Advanced Actor-Critic (A3C)
2 | Implementation of the A3C method proposed by Google DeepMind.
3 |
4 | Related papers:
5 | * [Asynchronous Methods for Deep Reinforcement Learning](http://diyhpl.us/~bryan/papers2/ai/machine-learning/Asynchronous%20methods%20for%20deep%20reinforcement%20learning%20-%202016.pdf)
6 |
7 |
8 | ## Requirements
9 | * [Numpy](http://www.numpy.org/)
10 | * [Tensorflow](http://www.tensorflow.org)
11 | * [gym](https://gym.openai.com)
12 |
13 | ## Run
14 | python train_A3C.py
15 | python train_A3C.py -h # show all optimal arguments
16 |
17 | ## Components
18 | `train_A3C.py` create a master(global) network and multiple workers(local) network.
19 | `worker.py` is the worker class implementation.
20 | `net.py` construct Actor-Critic network.
21 | `stari_env` is a warper of gym environment.
22 |
23 | ## Note
24 | Still buggy currently. WIP.
25 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/atari_env.py:
--------------------------------------------------------------------------------
1 | # Code borrowed from OpenAI/baseliens (https://github.com/openai/baselines)
2 | # Copyright (c) 2017 OpenAI (http://openai.com)
3 |
4 | import numpy as np
5 | import gym
6 | import os
7 |
8 | from collections import deque
9 | from PIL import Image
10 | from gym import spaces
11 |
12 |
13 | DEFAULT_ENV = 'BreakoutNoFrameskip-v4'
14 | RESOLUTION = 84
15 | S_DIM = [RESOLUTION, RESOLUTION, 1]
16 | A_DIM = gym.make(DEFAULT_ENV).action_space.n
17 |
18 |
19 | class NoopResetEnv(gym.Wrapper):
20 | def __init__(self, env, noop_max=30):
21 | """Sample initial states by taking random number of no-ops on reset.
22 | No-op is assumed to be action 0.
23 | """
24 | gym.Wrapper.__init__(self, env)
25 | self.noop_max = noop_max
26 | self.override_num_noops = None
27 | assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
28 |
29 | def _reset(self):
30 | """ Do no-op action for a number of steps in [1, noop_max]."""
31 | self.env.reset()
32 | if self.override_num_noops is not None:
33 | noops = self.override_num_noops
34 | else:
35 | noops = self.unwrapped.np_random.randint(1, self.noop_max + 1)
36 | assert noops > 0
37 | obs = None
38 | for _ in range(noops):
39 | obs, _, done, _ = self.env.step(0)
40 | if done:
41 | obs = self.env.reset()
42 | return obs
43 |
44 |
45 | class FireResetEnv(gym.Wrapper):
46 | def __init__(self, env):
47 | """Take action on reset for environments that are fixed until firing."""
48 | gym.Wrapper.__init__(self, env)
49 | assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
50 | assert len(env.unwrapped.get_action_meanings()) >= 3
51 |
52 | def _reset(self):
53 | self.env.reset()
54 | obs, _, done, _ = self.env.step(1)
55 | if done:
56 | self.env.reset()
57 | obs, _, done, _ = self.env.step(2)
58 | if done:
59 | self.env.reset()
60 | return obs
61 |
62 |
63 | class EpisodicLifeEnv(gym.Wrapper):
64 | def __init__(self, env):
65 | """Make end-of-life == end-of-episode, but only reset on true game over.
66 | Done by DeepMind for the DQN and co. since it helps value estimation.
67 | """
68 | gym.Wrapper.__init__(self, env)
69 | self.lives = 0
70 | self.was_real_done = True
71 |
72 | def _step(self, action):
73 | obs, reward, done, info = self.env.step(action)
74 | self.was_real_done = done
75 | # check current lives, make loss of life terminal,
76 | # then update lives to handle bonus lives
77 | lives = self.env.unwrapped.ale.lives()
78 | if lives < self.lives and lives > 0:
79 | # for Qbert somtimes we stay in lives == 0 condtion for a few frames
80 | # so its important to keep lives > 0, so that we only reset once
81 | # the environment advertises done.
82 | done = True
83 | self.lives = lives
84 | return obs, reward, done, info
85 |
86 | def _reset(self):
87 | """Reset only when lives are exhausted.
88 | This way all states are still reachable even though lives are episodic,
89 | and the learner need not know about any of this behind-the-scenes.
90 | """
91 | if self.was_real_done:
92 | obs = self.env.reset()
93 | else:
94 | # no-op step to advance from terminal/lost life state
95 | obs, _, _, _ = self.env.step(0)
96 | self.lives = self.env.unwrapped.ale.lives()
97 | return obs
98 |
99 |
100 | class MaxAndSkipEnv(gym.Wrapper):
101 | def __init__(self, env, skip=4):
102 | """Return only every `skip`-th frame"""
103 | gym.Wrapper.__init__(self, env)
104 | # most recent raw observations (for max pooling across time steps)
105 | self._obs_buffer = deque(maxlen=2)
106 | self._skip = skip
107 |
108 | def _step(self, action):
109 | """Repeat action, sum reward, and max over last observations."""
110 | total_reward = 0.0
111 | done = None
112 | for _ in range(self._skip):
113 | obs, reward, done, info = self.env.step(action)
114 | self._obs_buffer.append(obs)
115 | total_reward += reward
116 | if done:
117 | break
118 | max_frame = np.max(np.stack(self._obs_buffer), axis=0)
119 |
120 | return max_frame, total_reward, done, info
121 |
122 | def _reset(self):
123 | """Clear past frame buffer and init. to first obs. from inner env."""
124 | self._obs_buffer.clear()
125 | obs = self.env.reset()
126 | self._obs_buffer.append(obs)
127 | return obs
128 |
129 |
130 | class ClipRewardEnv(gym.RewardWrapper):
131 | def _reward(self, reward):
132 | """Bin reward to {+1, 0, -1} by its sign."""
133 | return np.sign(reward)
134 |
135 |
136 | class WarpFrame(gym.ObservationWrapper):
137 | def __init__(self, env):
138 | """Warp frames to 84x84 as done in the Nature paper and later work."""
139 | gym.ObservationWrapper.__init__(self, env)
140 | self.res = RESOLUTION
141 | self.observation_space = spaces.Box(low=0, high=255, shape=(self.res, self.res, 1))
142 |
143 | def _observation(self, obs):
144 | frame = np.dot(obs.astype('float32'), np.array([0.299, 0.587, 0.114], 'float32'))
145 | frame = np.array(Image.fromarray(frame).resize((self.res, self.res),
146 | resample=Image.BILINEAR), dtype=np.uint8)
147 | return frame.reshape((self.res, self.res, 1))
148 |
149 |
150 | class FrameStack(gym.Wrapper):
151 | def __init__(self, env, k):
152 | """Buffer observations and stack across channels (last axis)."""
153 | gym.Wrapper.__init__(self, env)
154 | self.k = k
155 | self.frames = deque([], maxlen=k)
156 | shp = env.observation_space.shape
157 | assert shp[2] == 1 # can only stack 1-channel frames
158 | self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], k))
159 |
160 | def _reset(self):
161 | """Clear buffer and re-fill by duplicating the first observation."""
162 | ob = self.env.reset()
163 | for _ in range(self.k):
164 | self.frames.append(ob)
165 | return self._observation()
166 |
167 | def _step(self, action):
168 | ob, reward, done, info = self.env.step(action)
169 | self.frames.append(ob)
170 | return self._observation(), reward, done, info
171 |
172 | def _observation(self):
173 | assert len(self.frames) == self.k
174 | return np.concatenate(self.frames, axis=2)
175 |
176 |
177 | def wrap_deepmind(env, episode_life=True, clip_rewards=True):
178 | """Configure environment for DeepMind-style Atari.
179 |
180 | Note: this does not include frame stacking!"""
181 | assert 'NoFrameskip' in env.spec.id # required for DeepMind-style skip
182 | if episode_life:
183 | env = EpisodicLifeEnv(env)
184 | env = NoopResetEnv(env, noop_max=30)
185 | env = MaxAndSkipEnv(env, skip=4)
186 | if 'FIRE' in env.unwrapped.get_action_meanings():
187 | env = FireResetEnv(env)
188 | env = WarpFrame(env)
189 | if clip_rewards:
190 | env = ClipRewardEnv(env)
191 | return env
192 |
193 |
194 | def make_env(args, record_video=False):
195 | env = gym.make(DEFAULT_ENV)
196 | if record_video:
197 | video_dir = os.path.join(args.save_path, 'videos')
198 | if not os.path.exists(video_dir):
199 | os.makedirs(video_dir)
200 | env = gym.wrappers.Monitor(
201 | env, video_dir, video_callable=lambda x: True, resume=True)
202 |
203 | return wrap_deepmind(env)
204 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/atari_env_deprecated.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import gym
4 | import numpy as np
5 |
6 | from skimage.color import rgb2gray
7 | from skimage.transform import resize
8 |
9 |
10 | class Atari(object):
11 | s_dim = [84, 84, 1]
12 | a_dim = 3
13 |
14 | def __init__(self, args, record_video=False):
15 | self.env = gym.make('BreakoutNoFrameskip-v4')
16 | self.ale = self.env.env.ale # ale interface
17 | if record_video:
18 | video_dir = os.path.join(args.save_path, 'videos')
19 | if not os.path.exists(video_dir):
20 | os.makedirs(video_dir)
21 | self.env = gym.wrappers.Monitor(
22 | self.env, video_dir, video_callable=lambda x: True, resume=True)
23 | self.ale = self.env.env.env.ale
24 |
25 | self.screen_size = Atari.s_dim[:2] # 84x84
26 | self.noop_max = 30
27 | self.frame_skip = 4
28 | self.frame_feq = 4
29 | self.s_dim = Atari.s_dim
30 | self.a_dim = Atari.a_dim
31 |
32 | self.action_space = [1, 2, 3] # Breakout specify
33 | self.done = True
34 |
35 | def new_round(self):
36 | if not self.done: # dead but not done
37 | # no-op step to advance from terminal/lost life state
38 | obs, _, _, _ = self.env.step(0)
39 | obs = self.preprocess(obs)
40 | else: # terminal
41 | self.env.reset()
42 | # No-op
43 | for _ in range(np.random.randint(1, self.noop_max + 1)):
44 | obs, _, done, _ = self.env.step(0)
45 | obs = self.preprocess(obs)
46 | return obs
47 |
48 | def preprocess(self, observ):
49 | return resize(rgb2gray(observ), self.screen_size)
50 |
51 | def step(self, action):
52 | observ, reward, dead = None, 0, False
53 | for _ in range(self.frame_skip):
54 | lives_before = self.ale.lives()
55 | o, r, self.done, _ = self.env.step(self.action_space[action])
56 | lives_after = self.ale.lives()
57 | reward += r
58 | if lives_before > lives_after:
59 | dead = True
60 | break
61 | observ = self.preprocess(o)
62 | observ = np.reshape(observ, newshape=self.screen_size + [1])
63 | self.state = np.append(self.state[:, :, 1:], observ, axis=2)
64 |
65 | return self.state, reward, dead, self.done
66 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/evaluate.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 |
4 | import numpy as np
5 | import tensorflow as tf
6 |
7 | from atari_env import A_DIM
8 | from atari_env import make_env
9 |
10 |
11 | class Evaluate(object):
12 | '''
13 | Evaluate a policy by running n episodes in an environment.
14 | Save a video and plot summaries to Tensorboard
15 |
16 | Args:
17 | global_net: The global network
18 | summary_writer: used to write Tensorboard summaries
19 | args: Some global parameters
20 | '''
21 |
22 | def __init__(self, global_net, summary_writer, global_steps_counter, args):
23 | self.env = make_env(args, record_video=args.record_video)
24 | self.global_net = global_net
25 | self.summary_writer = summary_writer
26 | self.global_steps_counter = global_steps_counter
27 | self.eval_every = args.eval_every
28 | self.eval_times = 0
29 | self.eval_episodes = args.eval_episodes
30 |
31 | self.saver = tf.train.Saver(max_to_keep=5)
32 | self.model_dir = os.path.join(args.save_path, 'models/')
33 | if not os.path.exists(self.model_dir):
34 | os.makedirs(self.model_dir)
35 |
36 | def run(self, sess, coord):
37 | while not coord.should_stop():
38 | global_steps = next(self.global_steps_counter)
39 | eval_start = time.time()
40 | avg_reward, avg_ep_length = self._eval(sess)
41 | self.eval_times += 1
42 | print('Eval at step %d: avg_reward %.4f, avg_ep_length %.4f' %
43 | (global_steps, avg_reward, avg_ep_length))
44 | print('Time cost: %.4fs' % (time.time() - eval_start))
45 | # add summaries
46 | ep_summary = tf.Summary()
47 | ep_summary.value.add(
48 | simple_value=avg_reward, tag='eval/avg_reward')
49 | ep_summary.value.add(
50 | simple_value=avg_ep_length, tag='eval/avg_ep_length')
51 | self.summary_writer.add_summary(ep_summary, global_steps)
52 | self.summary_writer.flush()
53 | # save models
54 | if self.eval_times % 10 == 1:
55 | save_start = time.time()
56 | self.saver.save(sess, self.model_dir + str(global_steps))
57 | print('Model saved. Time cost: %.4fs ' %
58 | (time.time() - save_start))
59 |
60 | time.sleep(self.eval_every)
61 |
62 | def _eval(self, sess):
63 | total_reward = 0.0
64 | episode_length = 0.0
65 | for _ in range(self.eval_episodes * 5):
66 | s = self.env.reset()
67 | while True:
68 | p = sess.run(self.global_net.policy,
69 | {self.global_net.inputs: [s]})
70 | a = np.random.choice(range(A_DIM), p=p[0])
71 | s, r, done, _ = self.env.step(a)
72 | total_reward += r
73 | episode_length += 1.0
74 | if done:
75 | break
76 | return total_reward / self.eval_episodes, \
77 | episode_length / self.eval_episodes
78 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/net.py:
--------------------------------------------------------------------------------
1 | import tensorflow.contrib.slim as slim
2 |
3 | from utils import *
4 |
5 |
6 | class Net(object):
7 | '''
8 | An Actor-Critic Network class. The shallow layers are shared by the Actor
9 | and the Critic.
10 |
11 | Args:
12 | s_dim: dimensions of the state space
13 | a_dim: dimensions of the action space
14 | scope: Scope the net belongs to
15 | trainer: optimizer used by this net
16 | '''
17 |
18 | def __init__(self, s_dim, a_dim, scope, args, trainer=None):
19 | self.s_dim = s_dim
20 | self.a_dim = a_dim
21 | self.scope = scope
22 | self.smooth = args.smooth
23 | self.clip_grads = args.clip_grads
24 | self.entropy_ratio = args.entropy_ratio
25 |
26 | with tf.variable_scope(self.scope):
27 | self.inputs = tf.placeholder(tf.float32, shape=[None] + self.s_dim)
28 |
29 | self._contruct_network(self.inputs)
30 |
31 | if self.scope != 'global':
32 | self._update_network(trainer)
33 |
34 | def _contruct_network(self, inputs):
35 | '''
36 | Biuld the computational graph.
37 | '''
38 | conv1 = slim.conv2d(inputs=inputs,
39 | num_outputs=32,
40 | kernel_size=[8, 8],
41 | stride=[4, 4],
42 | padding='VALID',
43 | weights_initializer=ortho_init(),
44 | scope='share_conv1')
45 | conv2 = slim.conv2d(inputs=conv1,
46 | num_outputs=64,
47 | kernel_size=[4, 4],
48 | stride=[2, 2],
49 | padding='VALID',
50 | weights_initializer=ortho_init(),
51 | scope='share_conv2')
52 | conv3 = slim.conv2d(inputs=conv2,
53 | num_outputs=64,
54 | kernel_size=[3, 3],
55 | stride=[1, 1],
56 | padding='VALID',
57 | weights_initializer=ortho_init(),
58 | scope='share_conv3')
59 | fc = slim.fully_connected(inputs=slim.flatten(conv3),
60 | num_outputs=512,
61 | weights_initializer=ortho_init(np.sqrt(2)),
62 | scope='share_fc1')
63 | self.policy = slim.fully_connected(inputs=fc,
64 | num_outputs=self.a_dim,
65 | activation_fn=tf.nn.softmax,
66 | scope='policy_out')
67 | self.value = slim.fully_connected(inputs=fc, num_outputs=1,
68 | activation_fn=None,
69 | scope='value_out')
70 |
71 | def _update_network(self, trainer):
72 | '''
73 | Build losses, compute gradients and apply gradients to the global net
74 | '''
75 |
76 | self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
77 | actions_onehot = tf.one_hot(self.actions, self.a_dim, dtype=tf.float32)
78 | self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
79 | self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
80 |
81 | action_prob = tf.reduce_sum(self.policy * actions_onehot, [1])
82 |
83 | # MSE critic loss
84 | self.critic_loss = 0.5 * tf.reduce_sum(
85 | tf.squared_difference(
86 | self.target_v, tf.reshape(self.value, [-1])))
87 |
88 | # high entropy -> low loss -> encourage exploration
89 | self.entropy = -tf.reduce_sum(self.policy * tf.log(self.policy + 1e-30), 1)
90 | self.entropy_loss = -self.entropy_ratio * tf.reduce_sum(self.entropy)
91 |
92 | # policy gradients = d_[-log(p) * advantages] / d_theta
93 | self.actor_loss = -tf.reduce_sum(
94 | tf.log(action_prob + 1e-30) * self.advantages)
95 | self.actor_loss += self.entropy_loss
96 |
97 | self.loss = self.actor_loss + self.critic_loss
98 | local_vars = tf.get_collection(
99 | tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
100 | self.grads = tf.gradients(self.loss, local_vars)
101 |
102 | # global norm gradients clipping
103 | self.grads, self.grad_norms = \
104 | tf.clip_by_global_norm(self.grads, self.clip_grads)
105 | self.var_norms = tf.global_norm(local_vars)
106 | global_vars = tf.get_collection(
107 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
108 | self.apply_grads_to_global = \
109 | trainer.apply_gradients(zip(self.grads, global_vars))
110 |
111 | # summaries
112 | if self.scope == 'worker_1':
113 | tf.summary.scalar('loss/entropy', tf.reduce_sum(self.entropy))
114 | tf.summary.scalar('loss/actor_loss', self.actor_loss)
115 | tf.summary.scalar('loss/critic_loss', self.critic_loss)
116 | tf.summary.scalar('advantages', tf.reduce_mean(self.advantages))
117 | tf.summary.scalar('norms/grad_norms', self.grad_norms)
118 | tf.summary.scalar('norms/var_norms', self.var_norms)
119 | summaries = tf.get_collection(tf.GraphKeys.SUMMARIES)
120 | self.summaries = tf.summary.merge(summaries)
121 | else:
122 | self.summaries = tf.no_op()
123 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/train_A3C.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import itertools
3 | import os
4 | import threading
5 | import time
6 |
7 | import tensorflow as tf
8 |
9 | from atari_env import A_DIM
10 | from atari_env import S_DIM
11 | from atari_env import make_env
12 | from evaluate import Evaluate
13 | from net import Net
14 | from utils import print_params_nums
15 | from worker import Worker
16 |
17 |
18 | def main(args):
19 | if args.save_path is not None and not os.path.exists(args.save_path):
20 | os.makedirs(args.save_path)
21 |
22 | summary_writer = tf.summary.FileWriter(os.path.join(args.save_path, 'log'))
23 | global_steps_counter = itertools.count() # thread-safe
24 |
25 | global_net = Net(S_DIM, A_DIM, 'global', args)
26 | num_workers = args.threads
27 | workers = []
28 |
29 | # create workers
30 | for i in range(1, num_workers + 1):
31 | worker_summary_writer = summary_writer if i == 0 else None
32 | worker = Worker(i, make_env(args), global_steps_counter,
33 | worker_summary_writer, args)
34 | workers.append(worker)
35 |
36 | saver = tf.train.Saver(max_to_keep=5)
37 |
38 | with tf.Session() as sess:
39 | coord = tf.train.Coordinator()
40 | if args.model_path is not None:
41 | print('Loading model...\n')
42 | ckpt = tf.train.get_checkpoint_state(args.model_path)
43 | saver.restore(sess, ckpt.model_checkpoint_path)
44 | else:
45 | print('Initializing a new model...\n')
46 | sess.run(tf.global_variables_initializer())
47 | print_params_nums()
48 | # Start work process for each worker in a separated thread
49 | worker_threads = []
50 | for worker in workers:
51 | t = threading.Thread(target=lambda: worker.run(sess, coord, saver))
52 | t.start()
53 | time.sleep(0.5)
54 | worker_threads.append(t)
55 |
56 | if args.eval_every > 0:
57 | evaluator = Evaluate(
58 | global_net, summary_writer, global_steps_counter, args)
59 | evaluate_thread = threading.Thread(
60 | target=lambda: evaluator.run(sess, coord))
61 | evaluate_thread.start()
62 |
63 | coord.join(worker_threads)
64 |
65 |
66 | def args_parse():
67 | parser = argparse.ArgumentParser()
68 |
69 | parser.add_argument(
70 | '--model_path', default=None, type=str,
71 | help='Whether to use a saved model. (*None|model path)')
72 | parser.add_argument(
73 | '--save_path', default='/tmp/a3c', type=str,
74 | help='Path to save a model during training.')
75 | parser.add_argument(
76 | '--max_steps', default=int(1e8), type=int, help='Max training steps')
77 | parser.add_argument(
78 | '--start_time', default=None, type=str, help='Time to start training')
79 | parser.add_argument(
80 | '--threads', default=16, type=int,
81 | help='Numbers of parallel threads. [num_cpu_cores] by default')
82 | # evaluate
83 | parser.add_argument(
84 | '--eval_every', default=500, type=int,
85 | help='Evaluate the global policy every N seconds')
86 | parser.add_argument(
87 | '--record_video', default=True, type=bool,
88 | help='Whether to save videos when evaluating')
89 | parser.add_argument(
90 | '--eval_episodes', default=5, type=int,
91 | help='Numbers of episodes per evaluation')
92 | # hyperparameters
93 | parser.add_argument(
94 | '--init_learning_rate', default=7e-4, type=float,
95 | help='Learning rate of the optimizer')
96 | parser.add_argument(
97 | '--decay', default=0.99, type=float,
98 | help='decay factor of the RMSProp optimizer')
99 | parser.add_argument(
100 | '--smooth', default=1e-7, type=float,
101 | help='epsilon of the RMSProp optimizer')
102 | parser.add_argument(
103 | '--gamma', default=0.99, type=float,
104 | help='Discout factor of reward and advantages')
105 | parser.add_argument('--tmax', default=5, type=int, help='Rollout size')
106 | parser.add_argument(
107 | '--entropy_ratio', default=0.01, type=float,
108 | help='Initial weight of entropy loss')
109 | parser.add_argument(
110 | '--clip_grads', default=40, type=float,
111 | help='global norm gradients clipping')
112 | parser.add_argument(
113 | '--epsilon', default=1e-5, type=float,
114 | help='epsilon of rmsprop optimizer')
115 |
116 | return parser.parse_args()
117 |
118 |
119 | if __name__ == '__main__':
120 | # ignore warnings by tensorflow
121 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
122 | # make GPU invisible
123 | os.environ['CUDA_VISIBLE_DEVICES'] = ''
124 |
125 | args = args_parse()
126 | main(args)
127 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/utils.py:
--------------------------------------------------------------------------------
1 | import time
2 |
3 | import numpy as np
4 | import scipy.signal
5 | import tensorflow as tf
6 |
7 |
8 | def reward_discount(x, gamma):
9 | return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]
10 |
11 |
12 | def ortho_init(scale=1.0):
13 |
14 | def _ortho_init(shape, dtype, partition_info=None):
15 | # lasagne ortho init for tf
16 | shape = tuple(shape)
17 | if len(shape) == 2:
18 | flat_shape = shape
19 | elif len(shape) == 4: # assumes NHWC
20 | flat_shape = (np.prod(shape[:-1]), shape[-1])
21 | else:
22 | raise NotImplementedError
23 | a = np.random.normal(0.0, 1.0, flat_shape)
24 | u, _, v = np.linalg.svd(a, full_matrices=False)
25 | q = u if u.shape == flat_shape else v # pick the one with the correct shape
26 | q = q.reshape(shape)
27 | return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
28 |
29 | return _ortho_init
30 |
31 |
32 | def print_params_nums():
33 | total_parameters = 0
34 | for v in tf.trainable_variables():
35 | shape = v.get_shape()
36 | param_num = 1
37 | for d in shape:
38 | param_num *= d.value
39 | print(v.name, ' ', shape, ' param nums: ', param_num)
40 | total_parameters += param_num
41 | print('\nTotal nums of parameters: %d\n' % total_parameters)
42 |
43 |
44 | def print_time_cost(start_time):
45 | t_c = time.gmtime(time.time() - start_time)
46 | print('Time cost ------ %dh %dm %ds' %
47 | (t_c.tm_hour, t_c.tm_min, t_c.tm_sec))
48 |
--------------------------------------------------------------------------------
/algorithms/A3C/atari/worker.py:
--------------------------------------------------------------------------------
1 | from atari_env import A_DIM
2 | from atari_env import S_DIM
3 | from net import Net
4 | from utils import *
5 |
6 |
7 | class Worker(object):
8 | '''
9 | An A3C worker thread. Run a game locally, gather gradients and apply
10 | to the global networks.
11 |
12 | Args:
13 | worker_id: A unique id for this thread
14 | env: Game environment used by this worker
15 | global_steps: Iterator that holds the global steps
16 | args: Global parameters and hyperparameters
17 | '''
18 |
19 | def __init__(
20 | self, worker_id, env, global_steps_counter, summary_writer, args):
21 | self.name = 'worker_' + str(worker_id)
22 | self.env = env
23 | self.args = args
24 | self.local_steps = 0
25 | self.global_steps_counter = global_steps_counter
26 | # each worker has its own optimizer and learning_rate
27 | self.learning_rate = tf.Variable(args.init_learning_rate,
28 | dtype=tf.float32,
29 | trainable=False,
30 | name=self.name + '_lr')
31 | self.delta_lr = \
32 | args.init_learning_rate / (args.max_steps / args.threads)
33 | self.trainer = tf.train.RMSPropOptimizer(self.learning_rate,
34 | decay=args.decay,
35 | epsilon=args.epsilon)
36 | self.summary_writer = summary_writer
37 |
38 | self.local_net = Net(S_DIM,
39 | A_DIM,
40 | scope=self.name,
41 | args=self.args,
42 | trainer=self.trainer)
43 |
44 | self.update_local_op = self._update_local_vars()
45 | self.anneal_learning_rate = self._anneal_learning_rate()
46 |
47 | def run(self, sess, coord, saver):
48 | print('Starting %s...\n' % self.name)
49 | with sess.as_default(), sess.graph.as_default():
50 | while not coord.should_stop():
51 | sess.run(self.update_local_op)
52 | rollout = []
53 | s = self.env.reset()
54 | while True:
55 | p, v = sess.run(
56 | [self.local_net.policy, self.local_net.value],
57 | feed_dict={self.local_net.inputs: [s]})
58 | a = np.random.choice(range(A_DIM), p=p[0])
59 | s1, r, dead, done = self.env.step(a)
60 | rollout.append([s, a, r, s1, dead, v[0][0]])
61 | s = s1
62 |
63 | global_steps = next(self.global_steps_counter)
64 | self.local_steps += 1
65 | sess.run(self.anneal_learning_rate)
66 |
67 | if not dead and len(rollout) == self.args.tmax:
68 | # calculate value of next state, uses for bootstraping
69 | v1 = sess.run(self.local_net.value,
70 | feed_dict={self.local_net.inputs: [s]})
71 | self._train(rollout, sess, v1[0][0], global_steps)
72 | rollout = []
73 | sess.run(self.update_local_op)
74 |
75 | if dead:
76 | break
77 |
78 | if len(rollout) != 0:
79 | self._train(rollout, sess, 0.0, global_steps)
80 | # end condition
81 | if global_steps >= self.args.max_steps:
82 | coord.request_stop()
83 | print_time_cost(self.args.start_time)
84 |
85 | def _train(self, rollout, sess, bootstrap_value, global_steps):
86 | '''
87 | Update global networks based on the rollout experiences
88 |
89 | Args:
90 | rollout: A list of transitions experiences
91 | sess: Tensorflow session
92 | bootstrap_value: if the episode was not done, we bootstrap the value
93 | from the last state.
94 | global_steps: use for summary
95 | '''
96 |
97 | rollout = np.array(rollout)
98 | observs, actions, rewards, next_observs, dones, values = rollout.T
99 | # compute advantages and discounted rewards
100 | rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value])
101 | discounted_rewards = reward_discount(rewards_plus, self.args.gamma)[:-1]
102 |
103 | advantages = discounted_rewards - values
104 |
105 | summaries, _ = sess.run([
106 | self.local_net.summaries,
107 | self.local_net.apply_grads_to_global
108 | ], feed_dict={
109 | self.local_net.inputs: np.stack(observs),
110 | self.local_net.actions: actions,
111 | self.local_net.target_v: discounted_rewards, # for value loss
112 | self.local_net.advantages: advantages # for policy net
113 | })
114 | # write summaries
115 | if self.summary_writer and summaries:
116 | self.summary_writer.add_summary(summaries, global_steps)
117 | self.summary_writer.flush()
118 |
119 | def _update_local_vars(self):
120 | '''
121 | Assign global networks parameters to local networks
122 | '''
123 | global_vars = tf.get_collection(
124 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
125 | local_vars = tf.get_collection(
126 | tf.GraphKeys.TRAINABLE_VARIABLES, self.name)
127 | update_op = []
128 | for g_v, l_v in zip(global_vars, local_vars):
129 | update_op.append(l_v.assign(g_v))
130 |
131 | return update_op
132 |
133 | def _anneal_learning_rate(self):
134 | return tf.cond(
135 | self.learning_rate > 0.0,
136 | lambda: tf.assign_sub(self.learning_rate, self.delta_lr),
137 | lambda: tf.assign(self.learning_rate, 0.0))
138 |
--------------------------------------------------------------------------------
/algorithms/A3C/doom/README.md:
--------------------------------------------------------------------------------
1 | ## Asynchronous Advanced Actor-Critic (A3C)
2 | Implementation of the A3C method proposed by Google DeepMind.
3 |
4 | Related papers:
5 | * [Asynchronous Methods for Deep Reinforcement Learning](http://diyhpl.us/~bryan/papers2/ai/machine-learning/Asynchronous%20methods%20for%20deep%20reinforcement%20learning%20-%202016.pdf)
6 |
7 | ## ViZDoom
8 | [ViZDoom](http://vizdoom.cs.put.edu.pl/) is a Doom-based AI research platform for reinforcement learning from raw visual information. The agent recieve raw visual information and make actions(moving, picking up items and attacking monsters) to maximize scores.
9 |
10 |
11 |
12 | In this repository, we implement A3C to slove basic ViZDoom task.
13 |
14 | ## Requirements
15 | * [Numpy](http://www.numpy.org/)
16 | * [Tensorflow](http://www.tensorflow.org)
17 | * [gym](https://gym.openai.com)
18 | * [scipy](https://www.scipy.org/)
19 | * [ViZDoom](https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md)
20 |
21 | ## Run
22 | python train_A3C.py
23 | python train_A3C.py -h # show all optimal arguments
24 |
25 | ## Components
26 | `train_A3C.py` create a master(global) network and multiple workers(local) network.
27 | `worker.py` is the worker class implementation.
28 | `net.py` construct Actor-Critic network.
29 | `env_doom` is a warper of ViZDoom environment.
30 |
--------------------------------------------------------------------------------
/algorithms/A3C/doom/basic.wad:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/algorithms/A3C/doom/basic.wad
--------------------------------------------------------------------------------
/algorithms/A3C/doom/env_doom.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | from vizdoom import *
4 |
5 |
6 | class Doom(object):
7 | '''Wrapper for Doom environment. Gym-style interface'''
8 | def __init__(self, visiable=False):
9 | self.env = self._setup(visiable)
10 | self.state_dim = 84 * 84 * 1
11 | self.action_dim = 3
12 | # Identity bool matrix, transfer action to bool one-hot
13 | self.bool_onehot = np.identity(self.action_dim, dtype=bool).tolist()
14 |
15 | def _setup(self, visiable):
16 | # setting up Doom environment
17 | env = DoomGame()
18 | env.set_doom_scenario_path('basic.wad')
19 | env.set_doom_map('map01')
20 | env.set_screen_resolution(ScreenResolution.RES_160X120)
21 | env.set_screen_format(ScreenFormat.GRAY8)
22 | env.set_render_hud(False)
23 | env.set_render_crosshair(False)
24 | env.set_render_weapon(True)
25 | env.set_render_decals(False)
26 | env.set_render_particles(False)
27 | env.add_available_button(Button.MOVE_LEFT)
28 | env.add_available_button(Button.MOVE_RIGHT)
29 | env.add_available_button(Button.ATTACK)
30 | env.add_available_game_variable(GameVariable.AMMO2)
31 | env.add_available_game_variable(GameVariable.POSITION_X)
32 | env.add_available_game_variable(GameVariable.POSITION_Y)
33 | env.set_episode_timeout(300)
34 | env.set_episode_start_time(10)
35 | env.set_sound_enabled(False)
36 | env.set_living_reward(-1)
37 | env.set_mode(Mode.PLAYER)
38 | env.set_window_visible(visiable)
39 | env.init()
40 |
41 | return env
42 |
43 | def _get_state(self):
44 | return self.env.get_state().screen_buffer
45 |
46 | def reset(self):
47 | self.env.new_episode()
48 | return self._get_state()
49 |
50 | def step(self, action):
51 | action = self.bool_onehot[action] # e.g. [False, True, False]
52 |
53 | curr_observ = self._get_state()
54 | reward = self.env.make_action(action)
55 | done = self.env.is_episode_finished()
56 | if done:
57 | next_observ = curr_observ
58 | else:
59 | next_observ = self._get_state()
60 |
61 | return next_observ, reward, done
62 |
--------------------------------------------------------------------------------
/algorithms/A3C/doom/net.py:
--------------------------------------------------------------------------------
1 | import tensorflow.contrib.slim as slim
2 |
3 | from utils import *
4 |
5 |
6 | class Net:
7 |
8 | def __init__(self, s_dim, a_dim, scope, trainer):
9 | self.s_dim = s_dim
10 | self.a_dim = a_dim
11 | self.scope = scope
12 |
13 | with tf.variable_scope(self.scope):
14 | self.inputs = tf.placeholder(
15 | shape=[None, self.s_dim], dtype=tf.float32)
16 | inputs = tf.reshape(self.inputs, [-1, 84, 84, 1])
17 |
18 | self._construct_network(inputs)
19 |
20 | if self.scope != 'global':
21 | # gradients update only for workers
22 | self._update_network(trainer)
23 |
24 | def _construct_network(self, inputs):
25 | # Actor network and critic network share all shallow layers
26 | conv1 = slim.conv2d(inputs=inputs,
27 | num_outputs=16,
28 | activation_fn=tf.nn.relu,
29 | kernel_size=[8, 8],
30 | stride=[4, 4],
31 | padding='VALID')
32 | conv2 = slim.conv2d(inputs=conv1,
33 | num_outputs=32,
34 | activation_fn=tf.nn.relu,
35 | kernel_size=[4, 4],
36 | stride=[2, 2],
37 | padding='VALID')
38 | hidden = slim.fully_connected(inputs=slim.flatten(conv2),
39 | num_outputs=256,
40 | activation_fn=tf.nn.relu)
41 |
42 | # Recurrent network for temporal dependencies
43 | lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=256)
44 | c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
45 | h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
46 | self.state_init = [c_init, h_init]
47 |
48 | c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
49 | h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
50 | self.state_in = (c_in, h_in)
51 |
52 | rnn_in = tf.expand_dims(hidden, [0])
53 | step_size = tf.shape(inputs)[:1]
54 | state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in)
55 |
56 | lstm_out, lstm_state = tf.nn.dynamic_rnn(cell=lstm_cell,
57 | inputs=rnn_in,
58 | initial_state=state_in,
59 | sequence_length=step_size,
60 | time_major=False)
61 | lstm_c, lstm_h = lstm_state
62 | self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
63 | rnn_out = tf.reshape(lstm_out, [-1, 256])
64 |
65 | # output for policy and value estimations
66 | self.policy = slim.fully_connected(
67 | inputs=rnn_out,
68 | num_outputs=self.a_dim,
69 | activation_fn=tf.nn.softmax,
70 | weights_initializer=normalized_columns_initializer(0.01),
71 | biases_initializer=None)
72 | self.value = slim.fully_connected(
73 | inputs=rnn_out,
74 | num_outputs=1,
75 | activation_fn=None,
76 | weights_initializer=normalized_columns_initializer(1.0),
77 | biases_initializer=None)
78 |
79 | def _update_network(self, trainer):
80 | self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
81 | self.actions_onehot = tf.one_hot(
82 | self.actions, self.a_dim, dtype=tf.float32)
83 | self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
84 | self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
85 |
86 | self.outputs = tf.reduce_sum(
87 | self.policy * self.actions_onehot, [1])
88 |
89 | # loss
90 | self.value_loss = 0.5 * tf.reduce_sum(tf.square(
91 | self.target_v - tf.reshape(self.value, [-1])))
92 | # higher entropy -> lower loss -> encourage exploration
93 | self.entropy = -tf.reduce_sum(self.policy * tf.log(self.policy))
94 |
95 | self.policy_loss = -tf.reduce_sum(
96 | tf.log(self.outputs) * self.advantages)
97 |
98 | self.loss = 0.5 * self.value_loss \
99 | + self.policy_loss - 0.01 * self.entropy
100 |
101 | # local gradients
102 | local_vars = tf.get_collection(
103 | tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
104 | self.gradients = tf.gradients(self.loss, local_vars)
105 | self.var_norms = tf.global_norm(local_vars)
106 |
107 | # grads[i] * clip_norm / max(global_norm, clip_norm)
108 | grads, self.grad_norms = tf.clip_by_global_norm(self.gradients, 40.0)
109 |
110 | # apply gradients to global network
111 | global_vars = tf.get_collection(
112 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
113 | self.apply_grads = trainer.apply_gradients(zip(grads, global_vars))
114 |
--------------------------------------------------------------------------------
/algorithms/A3C/doom/train_A3C.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import multiprocessing
3 | import os
4 | import threading
5 | import time
6 |
7 | import tensorflow as tf
8 |
9 | from env_doom import Doom
10 | from net import Net
11 | from utils import print_net_params_number
12 | from worker import Worker
13 |
14 |
15 | def main(args):
16 | if args.save_path is not None and not os.path.exists(args.save_path):
17 | os.makedirs(args.save_path)
18 |
19 | tf.reset_default_graph()
20 |
21 | global_ep = tf.Variable(
22 | 0, dtype=tf.int32, name='global_ep', trainable=False)
23 |
24 | env = Doom(visiable=False)
25 | Net(env.state_dim, env.action_dim, 'global', None)
26 | num_workers = args.parallel
27 | workers = []
28 |
29 | # create workers
30 | for i in range(num_workers):
31 | w = Worker(i, Doom(), global_ep, args)
32 | workers.append(w)
33 |
34 | print('%d workers in total.\n' % num_workers)
35 | saver = tf.train.Saver(max_to_keep=3)
36 |
37 | with tf.Session() as sess:
38 | coord = tf.train.Coordinator()
39 | if args.model_path is not None:
40 | print('Loading model...')
41 | ckpt = tf.train.get_checkpoint_state(args.model_path)
42 | saver.restore(sess, ckpt.model_checkpoint_path)
43 | else:
44 | print('Initializing a new model...')
45 | sess.run(tf.global_variables_initializer())
46 | print_net_params_number()
47 |
48 | # Start work process for each worker in a separated thread
49 | worker_threads = []
50 | for w in workers:
51 | run_fn = lambda: w.run(sess, coord, saver)
52 | t = threading.Thread(target=run_fn)
53 | t.start()
54 | time.sleep(0.5)
55 | worker_threads.append(t)
56 | coord.join(worker_threads)
57 |
58 |
59 | if __name__ == '__main__':
60 | # ignore warnings by tensorflow
61 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
62 |
63 | parser = argparse.ArgumentParser()
64 | parser.add_argument(
65 | '--model_path', default=None,
66 | help='Whether to use a saved model. (*None|model path)')
67 | parser.add_argument(
68 | '--save_path', default='/tmp/a3c_doom/model/',
69 | help='Path to save a model during training.')
70 | parser.add_argument(
71 | '--save_every', default=50, help='Interval of saving model')
72 | parser.add_argument(
73 | '--max_ep_len', default=300, help='Max episode steps')
74 | parser.add_argument(
75 | '--max_ep', default=3000, help='Max training episode')
76 | parser.add_argument(
77 | '--parallel', default=multiprocessing.cpu_count(),
78 | help='Number of parallel threads')
79 | main(parser.parse_args())
80 |
--------------------------------------------------------------------------------
/algorithms/A3C/doom/utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import scipy.signal
3 | import tensorflow as tf
4 |
5 |
6 | def preprocess(frame):
7 | s = frame[10: -10, 30: -30]
8 | s = scipy.misc.imresize(s, [84, 84])
9 | s = np.reshape(s, [np.prod(s.shape)]) / 255.0
10 | return s
11 |
12 |
13 | def discount(x, gamma):
14 | return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]
15 |
16 |
17 | def normalized_columns_initializer(std=1.0):
18 |
19 | def _initializer(shape, dtype=None, partition_info=None):
20 | out = np.random.randn(*shape).astype(np.float32)
21 | out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
22 | return tf.constant(out)
23 |
24 | return _initializer
25 |
26 |
27 | def print_net_params_number():
28 | total_parameters = 0
29 | for v in tf.trainable_variables():
30 | shape = v.get_shape()
31 | param_num = 1
32 | for d in shape:
33 | param_num *= d.value
34 | print(v.name, ' ', shape, ' param nums: ', param_num)
35 | total_parameters += param_num
36 | print('\nTotal nums of parameters: %d\n' % total_parameters)
37 |
--------------------------------------------------------------------------------
/algorithms/A3C/doom/worker.py:
--------------------------------------------------------------------------------
1 | from net import Net
2 | from utils import *
3 | from vizdoom import *
4 |
5 |
6 | class Worker(object):
7 |
8 | def __init__(self, worker_id, env, global_ep, args):
9 | self.name = 'worker_' + str(worker_id)
10 | self.env = env
11 | self.global_ep = global_ep
12 | self.args = args
13 | self.learning_rate = 1e-4
14 | self.gamma = 0.99
15 | self.trainer = tf.train.AdamOptimizer(self.learning_rate)
16 |
17 | # create local copy of AC network
18 | self.local_net = Net(self.env.state_dim,
19 | self.env.action_dim,
20 | scope=self.name,
21 | trainer=self.trainer)
22 |
23 | self.update_local_op = self._update_local_params()
24 |
25 | def run(self, sess, coord, saver):
26 | running_reward = None
27 | ep_count = sess.run(self.global_ep)
28 | print('Starting ' + self.name)
29 |
30 | with sess.as_default(), sess.graph.as_default():
31 |
32 | while not coord.should_stop():
33 | sess.run(self.update_local_op)
34 | rollout = []
35 | ep_reward = 0
36 | ep_step_count = 0
37 |
38 | s = self.env.reset()
39 | self.ep_frames = []
40 | self.ep_frames.append(s)
41 | s = preprocess(s)
42 | rnn_state = self.local_net.state_init
43 |
44 | while True:
45 | p, v, rnn_state = sess.run([
46 | self.local_net.policy,
47 | self.local_net.value,
48 | self.local_net.state_out
49 | ], {
50 | self.local_net.inputs: [s],
51 | self.local_net.state_in[0]: rnn_state[0],
52 | self.local_net.state_in[1]: rnn_state[1]
53 | })
54 | # sample action from the policy distribution p
55 | a = np.random.choice(np.arange(self.env.action_dim), p=p[0])
56 |
57 | s1, r, d = self.env.step(a)
58 | self.ep_frames.append(s1)
59 | r /= 100.0 # scale rewards
60 | s1 = preprocess(s1)
61 |
62 | rollout.append([s, a, r, s1, d, v[0][0]])
63 | ep_reward += r
64 | s = s1
65 | ep_step_count += 1
66 |
67 | # Update if the buffer is full (size=30)
68 | if not d and len(rollout) == 30 \
69 | and ep_step_count != self.args.max_ep_len - 1:
70 | v1 = sess.run(self.local_net.value, {
71 | self.local_net.inputs: [s],
72 | self.local_net.state_in[0]: rnn_state[0],
73 | self.local_net.state_in[1]: rnn_state[1]
74 | })[0][0]
75 | v_l, p_l, e_l, g_n, v_n = self._train(rollout, sess, v1)
76 | rollout = []
77 |
78 | sess.run(self.update_local_op)
79 | if d:
80 | break
81 |
82 | # update network at the end of the episode
83 | if len(rollout) != 0:
84 | v_l, p_l, e_l, g_n, v_n = self._train(rollout, sess, 0.0)
85 |
86 | # episode end
87 | if running_reward:
88 | running_reward = running_reward * 0.99 + ep_reward * 0.01
89 | else:
90 | running_reward = ep_reward
91 |
92 | if ep_count % 10 == 0:
93 | print('%s ep:%d step:%d reward:%.3f' %
94 | (self.name, ep_count, ep_step_count, running_reward))
95 |
96 | if self.name == 'worker_0':
97 | # update global ep
98 | _, global_ep = sess.run([
99 | self.global_ep.assign_add(1),
100 | self.global_ep
101 | ])
102 | # end condition
103 | if global_ep == self.args.max_ep:
104 | # this op will stop all threads
105 | coord.request_stop()
106 | # save model and make gif
107 | if global_ep != 0 and global_ep % self.args.save_every == 0:
108 | saver.save(
109 | sess, self.args.save_path+str(global_ep)+'.cptk')
110 | ep_count += 1 # update local ep
111 |
112 | def _train(self, rollout, sess, bootstrap_value):
113 | rollout = np.array(rollout)
114 | observs, actions, rewards, next_observs, dones, values = rollout.T
115 |
116 | # compute advantages and discounted reward using rewards and value
117 | self.rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value])
118 | discounted_rewards = discount(self.rewards_plus, self.gamma)[:-1]
119 |
120 | self.value_plus = np.asarray(values.tolist() + [bootstrap_value])
121 | advantages = rewards + self.gamma * self.value_plus[1:] - \
122 | self.value_plus[:-1]
123 | advantages = discount(advantages, self.gamma)
124 |
125 | # update glocal network using gradients from loss
126 | rnn_state = self.local_net.state_init
127 | v_l, p_l, e_l, g_n, v_n, _ = sess.run([
128 | self.local_net.value_loss,
129 | self.local_net.policy_loss,
130 | self.local_net.entropy,
131 | self.local_net.grad_norms,
132 | self.local_net.var_norms,
133 | self.local_net.apply_grads
134 | ], {
135 | self.local_net.target_v: discounted_rewards, # for value net
136 | self.local_net.inputs: np.vstack(observs),
137 | self.local_net.actions: actions,
138 | self.local_net.advantages: advantages, # for policy net
139 | self.local_net.state_in[0]: rnn_state[0],
140 | self.local_net.state_in[1]: rnn_state[1]
141 | })
142 | return v_l/len(rollout), p_l/len(rollout), e_l/len(rollout), g_n, v_n
143 |
144 | def _update_local_params(self):
145 | global_vars = tf.get_collection(
146 | tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
147 | local_vars = tf.get_collection(
148 | tf.GraphKeys.TRAINABLE_VARIABLES, self.name)
149 | update_op = []
150 | for global_var, local_var in zip(global_vars, local_vars):
151 | update_op.append(local_var.assign(global_var))
152 |
153 | return update_op
154 |
--------------------------------------------------------------------------------
/algorithms/Actor-Critic/README.md:
--------------------------------------------------------------------------------
1 | ## Actor-Critic
2 | Actor-Critic belongs to Policy Gradient methods, which directly parameterize the policy rather than a state value function.
3 | For more details about Actor-Critic and other policy gradient algorithms, refer Chap 13 of [Reinforcement Learning: An Introduction 2nd Edition](http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html)
4 |
5 | Here we use Actor-Critic Methods to solve game of Pong.
6 |
7 | ## Pong
8 | The game of Pong is an Atari game which user control one of the paddle (the other one is control by a decent AI) and you have to bounce the ball past the other side. In reinforcement learning setting, the state is raw pixels and the action is moving the paddle UP or DOWN.
9 |
10 |
11 |
12 |
13 | ## Requirements
14 | * [Numpy](http://www.numpy.org/)
15 | * [Tensorflow](http://www.tensorflow.org)
16 | * [gym](https://gym.openai.com)
17 |
18 | ## Run
19 | python train_actor_critic.py
20 |
--------------------------------------------------------------------------------
/algorithms/Actor-Critic/agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import tensorflow as tf
3 |
4 |
5 | class ActorCritic:
6 |
7 | def __init__(self, input_dim, hidden_units, action_dim):
8 | self.input_dim = input_dim
9 | self.hidden_units = hidden_units
10 | self.action_dim = action_dim
11 | self.gamma = 0.99
12 | self.discount_factor = 0.99
13 | self.max_gradient = 5
14 | # counter
15 | self.ep_count = 0
16 | # buffer init
17 | self.buffer_reset()
18 |
19 | self.batch_size = 32
20 |
21 | @staticmethod
22 | def get_session(device):
23 | if device == -1: # use CPU
24 | device = '/cpu:0'
25 | sess_config = tf.ConfigProto()
26 | else: # use GPU
27 | device = '/gpu:' + str(device)
28 | sess_config = tf.ConfigProto(
29 | log_device_placement=True,
30 | allow_soft_placement=True)
31 | sess_config.gpu_options.allow_growth = True
32 | sess = tf.Session(config=sess_config)
33 | return sess, device
34 |
35 | def construct_model(self, gpu):
36 | self.sess, device = self.get_session(gpu)
37 |
38 | with tf.device(device):
39 | with tf.name_scope('model_inputs'):
40 | self.input_state = tf.placeholder(
41 | tf.float32, [None, self.input_dim], name='input_state')
42 | with tf.variable_scope('actor_network'):
43 | self.logp = self.actor_network(self.input_state)
44 | with tf.variable_scope('critic_network'):
45 | self.state_value = self.critic_network(self.input_state)
46 |
47 | # get network parameters
48 | actor_params = tf.get_collection(
49 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='actor_network')
50 | critic_params = tf.get_collection(
51 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='critic_network')
52 |
53 | self.taken_action = tf.placeholder(tf.int32, [None, ])
54 | self.discounted_rewards = tf.placeholder(tf.float32, [None, 1])
55 |
56 | # optimizer
57 | self.optimizer = tf.train.RMSPropOptimizer(learning_rate=1e-4)
58 | # actor loss
59 | self.actor_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
60 | logits=self.logp, labels=self.taken_action)
61 | # advantage
62 | self.advantage = (self.discounted_rewards - self.state_value)[:, 0]
63 | # actor gradient
64 | actor_gradients = tf.gradients(
65 | self.actor_loss, actor_params, self.advantage)
66 | self.actor_gradients = list(zip(actor_gradients, actor_params))
67 |
68 | # critic loss
69 | self.critic_loss = tf.reduce_mean(
70 | tf.square(self.discounted_rewards - self.state_value))
71 | # critic gradient
72 | self.critic_gradients = self.optimizer.compute_gradients(
73 | self.critic_loss, critic_params)
74 | self.gradients = self.actor_gradients + self.critic_gradients
75 |
76 | # clip gradient
77 | for i, (grad, var) in enumerate(self.gradients):
78 | if grad is not None:
79 | self.gradients[i] = (tf.clip_by_value(
80 | grad, -self.max_gradient, self.max_gradient), var)
81 |
82 | with tf.name_scope('train_actor_critic'):
83 | # train operation
84 | self.train_op = self.optimizer.apply_gradients(self.gradients)
85 |
86 | def sample_action(self, state):
87 |
88 | def softmax(x):
89 | max_x = np.amax(x)
90 | e = np.exp(x - max_x)
91 | return e / np.sum(e)
92 |
93 | logp = self.sess.run(self.logp, {self.input_state: state})[0]
94 | prob = softmax(logp) - 1e-5
95 | return np.argmax(np.random.multinomial(1, prob))
96 |
97 | def update_model(self):
98 | state_buffer = np.array(self.state_buffer)
99 | action_buffer = np.array(self.action_buffer)
100 | discounted_rewards_buffer = np.vstack(self.reward_discount())
101 |
102 | ep_steps = len(action_buffer)
103 | shuffle_index = np.arange(ep_steps)
104 | np.random.shuffle(shuffle_index)
105 |
106 | for i in range(0, ep_steps, self.batch_size):
107 | if self.batch_size <= ep_steps:
108 | end_index = i + self.batch_size
109 | else:
110 | end_index = ep_steps
111 | batch_index = shuffle_index[i:end_index]
112 |
113 | # get batch
114 | input_state = state_buffer[batch_index]
115 | taken_action = action_buffer[batch_index]
116 | discounted_rewards = discounted_rewards_buffer[batch_index]
117 |
118 | # train!
119 | self.sess.run(self.train_op, feed_dict={
120 | self.input_state: input_state,
121 | self.taken_action: taken_action,
122 | self.discounted_rewards: discounted_rewards})
123 |
124 | # clean up job
125 | self.buffer_reset()
126 |
127 | self.ep_count += 1
128 |
129 | def store_rollout(self, state, action, reward, next_state, done):
130 | self.action_buffer.append(action)
131 | self.reward_buffer.append(reward)
132 | self.state_buffer.append(state)
133 | self.next_state_buffer.append(next_state)
134 | self.done_buffer.append(done)
135 |
136 | def buffer_reset(self):
137 | self.state_buffer = []
138 | self.reward_buffer = []
139 | self.action_buffer = []
140 | self.next_state_buffer = []
141 | self.done_buffer = []
142 |
143 | def reward_discount(self):
144 | r = self.reward_buffer
145 | d_r = np.zeros_like(r)
146 | running_add = 0
147 | for t in range(len(r))[::-1]:
148 | if r[t] != 0:
149 | running_add = 0 # game boundary. reset the running add
150 | running_add = r[t] + running_add * self.discount_factor
151 | d_r[t] += running_add
152 | # standardize the rewards
153 | d_r -= np.mean(d_r)
154 | d_r /= np.std(d_r)
155 | return d_r
156 |
157 | def actor_network(self, input_state):
158 | w1 = tf.Variable(tf.div(tf.random_normal(
159 | [self.input_dim, self.hidden_units]),
160 | np.sqrt(self.input_dim)), name='w1')
161 | b1 = tf.Variable(
162 | tf.constant(0.0, shape=[self.hidden_units]), name='b1')
163 | h1 = tf.nn.relu(tf.matmul(input_state, w1) + b1)
164 | w2 = tf.Variable(tf.div(tf.random_normal(
165 | [self.hidden_units, self.action_dim]),
166 | np.sqrt(self.hidden_units)), name='w2')
167 | b2 = tf.Variable(tf.constant(0.0, shape=[self.action_dim]), name='b2')
168 | logp = tf.matmul(h1, w2) + b2
169 | return logp
170 |
171 | def critic_network(self, input_state):
172 | w1 = tf.Variable(tf.div(tf.random_normal(
173 | [self.input_dim, self.hidden_units]),
174 | np.sqrt(self.input_dim)), name='w1')
175 | b1 = tf.Variable(
176 | tf.constant(0.0, shape=[self.hidden_units]), name='b1')
177 | h1 = tf.nn.relu(tf.matmul(input_state, w1) + b1)
178 | w2 = tf.Variable(tf.div(tf.random_normal(
179 | [self.hidden_units, 1]), np.sqrt(self.hidden_units)), name='w2')
180 | b2 = tf.Variable(tf.constant(0.0, shape=[1]), name='b2')
181 | state_value = tf.matmul(h1, w2) + b2
182 | return state_value
183 |
--------------------------------------------------------------------------------
/algorithms/Actor-Critic/evaluate.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | import gym
4 | import numpy as np
5 | import tensorflow as tf
6 |
7 | from .agent import ActorCritic
8 | from .utils import preprocess
9 |
10 |
11 | def main(args):
12 | INPUT_DIM = 80 * 80
13 | HIDDEN_UNITS = 200
14 | ACTION_DIM = 6
15 |
16 | # load agent
17 | agent = ActorCritic(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM)
18 | agent.construct_model(args.gpu)
19 |
20 | # load model or init a new
21 | saver = tf.train.Saver(max_to_keep=1)
22 | if args.model_path is not None:
23 | # reuse saved model
24 | saver.restore(agent.sess, args.model_path)
25 | else:
26 | # build a new model
27 | agent.sess.run(tf.global_variables_initializer())
28 |
29 | # load env
30 | env = gym.make('Pong-v0')
31 |
32 | # training loop
33 | for ep in range(args.ep):
34 | # reset env
35 | total_rewards = 0
36 | state = env.reset()
37 |
38 | while True:
39 | env.render()
40 | # preprocess
41 | state = preprocess(state)
42 | # sample actions
43 | action = agent.sample_action(state[np.newaxis, :])
44 | # act!
45 | next_state, reward, done, _ = env.step(action)
46 | total_rewards += reward
47 | # state shift
48 | state = next_state
49 | if done:
50 | break
51 |
52 | print('Ep%s Reward: %s ' % (ep+1, total_rewards))
53 |
54 |
55 | def args_parse():
56 | parser = argparse.ArgumentParser()
57 | parser.add_argument(
58 | '--model_path', default=None,
59 | help='Whether to use a saved model. (*None|model path)')
60 | parser.add_argument(
61 | '--gpu', default=-1,
62 | help='running on a specify gpu, -1 indicates using cpu')
63 | parser.add_argument('--ep', default=1, help='Test episodes')
64 | return parser.parse_args()
65 |
66 |
67 | if __name__ == '__main__':
68 | main(args_parse())
69 |
--------------------------------------------------------------------------------
/algorithms/Actor-Critic/train_actor_critic.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 |
4 | import gym
5 | import numpy as np
6 | import tensorflow as tf
7 |
8 | from agent import ActorCritic
9 | from utils import preprocess
10 |
11 |
12 | def main(args):
13 | INPUT_DIM = 80 * 80
14 | HIDDEN_UNITS = 200
15 | ACTION_DIM = 6
16 | MAX_EPISODES = 10000
17 |
18 | # load agent
19 | agent = ActorCritic(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM)
20 | agent.construct_model(args.gpu)
21 |
22 | # load model or init a new
23 | saver = tf.train.Saver(max_to_keep=1)
24 | if args.model_path is not None:
25 | # reuse saved model
26 | saver.restore(agent.sess, args.model_path)
27 | ep_base = int(args.model_path.split('_')[-1])
28 | mean_rewards = float(args.model_path.split('/')[-1].split('_')[0])
29 | else:
30 | # build a new model
31 | agent.sess.run(tf.global_variables_initializer())
32 | ep_base = 0
33 | mean_rewards = 0.0
34 |
35 | # load env
36 | env = gym.make('Pong-v0')
37 |
38 | # training loop
39 | for ep in range(MAX_EPISODES):
40 | step = 0
41 | total_rewards = 0
42 | state = preprocess(env.reset())
43 |
44 | while True:
45 | # sample actions
46 | action = agent.sample_action(state[np.newaxis, :])
47 | # act!
48 | next_state, reward, done, _ = env.step(action)
49 |
50 | next_state = preprocess(next_state)
51 |
52 | step += 1
53 | total_rewards += reward
54 |
55 | agent.store_rollout(state, action, reward, next_state, done)
56 | # state shift
57 | state = next_state
58 |
59 | if done:
60 | break
61 |
62 | mean_rewards = 0.99 * mean_rewards + 0.01 * total_rewards
63 | rounds = (21 - np.abs(total_rewards)) + 21
64 | average_steps = (step + 1) / rounds
65 | print('Ep%s: %d rounds' % (ep_base + ep + 1, rounds))
66 | print('Average_steps: %.2f Reward: %s Average_reward: %.4f' %
67 | (average_steps, total_rewards, mean_rewards))
68 |
69 | # update model per episode
70 | agent.update_model()
71 |
72 | # model saving
73 | if ep > 0 and ep % args.save_every == 0:
74 | if not os.path.isdir(args.save_path):
75 | os.makedirs(args.save_path)
76 | save_name = str(round(mean_rewards, 2)) + '_' + str(ep_base + ep+1)
77 | saver.save(agent.sess, args.save_path + save_name)
78 |
79 |
80 | if __name__ == '__main__':
81 | parser = argparse.ArgumentParser()
82 | parser.add_argument(
83 | '--model_path', default=None,
84 | help='Whether to use a saved model. (*None|model path)')
85 | parser.add_argument(
86 | '--save_path', default='./model/',
87 | help='Path to save a model during training.')
88 | parser.add_argument(
89 | '--save_every', default=100, help='Save model every x episodes')
90 | parser.add_argument(
91 | '--gpu', default=-1,
92 | help='running on a specify gpu, -1 indicates using cpu')
93 | main(parser.parse_args())
94 |
--------------------------------------------------------------------------------
/algorithms/Actor-Critic/utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | def preprocess(obs):
5 | obs = obs[35:195] # 160x160x3
6 | obs = obs[::2, ::2, 0] # down sample (80x80)
7 | obs[obs == 144] = 0
8 | obs[obs == 109] = 0
9 | obs[obs != 0] = 1
10 | return obs.astype(np.float).ravel()
11 |
--------------------------------------------------------------------------------
/algorithms/CEM/CEM.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import gym
3 | from gym.spaces import Discrete, Box
4 |
5 |
6 | # policies
7 | class DiscreteAction(object):
8 |
9 | def __init__(self, theta, ob_space, ac_space):
10 | ob_dim = ob_space.shape[0]
11 | ac_dim = ac_space.n
12 | self.W = theta[0: ob_dim * ac_dim].reshape(ob_dim, ac_dim)
13 | self.b = theta[ob_dim * ac_dim:].reshape(1, ac_dim)
14 |
15 | def act(self, ob):
16 | y = np.dot(ob, self.W) + self.b
17 | a = np.argmax(y)
18 | return a
19 |
20 |
21 | class ContinuousAction(object):
22 |
23 | def __init__(self, theta, ob_space, ac_space):
24 | self.ac_space = ac_space
25 | ob_dim = ob_space.shape[0]
26 | ac_dim = ac_space.shape[0]
27 | self.W = theta[0: ob_dim * ac_dim].reshape(ob_dim, ac_dim)
28 | self.b = theta[ob_dim * ac_dim:]
29 |
30 | def act(self, ob):
31 | y = np.dot(ob, self.W) + self.b
32 | a = np.clip(y, self.ac_space.low, self.ac_space.high)
33 | return a
34 |
35 |
36 | def run_episode(policy, env, render=False):
37 | max_steps = 1000
38 | total_rew = 0
39 | ob = env.reset()
40 | for t in range(max_steps):
41 | a = policy.act(ob)
42 | ob, reward, done, _info = env.step(a)
43 | total_rew += reward
44 | if render and t % 3 == 0:
45 | env.render()
46 | if done:
47 | break
48 | return total_rew
49 |
50 |
51 | def make_policy(params):
52 | if isinstance(env.action_space, Discrete):
53 | return DiscreteAction(params, env.observation_space, env.action_space)
54 | elif isinstance(env.action_space, Box):
55 | return ContinuousAction(
56 | params, env.observation_space, env.action_space)
57 | else:
58 | raise NotImplementedError
59 |
60 |
61 | def eval_policy(params):
62 | policy = make_policy(params)
63 | reward = run_episode(policy, env)
64 | return reward
65 |
66 |
67 | env = gym.make('CartPole-v0')
68 | num_iter = 100
69 | batch_size = 25
70 | elite_frac = 0.2
71 | num_elite = int(batch_size * elite_frac)
72 |
73 | if isinstance(env.action_space, Discrete):
74 | dim_params = (env.observation_space.shape[0] + 1) * env.action_space.n
75 | elif isinstance(env.action_space, Box):
76 | dim_params = (env.observation_space.shape[0] + 1) \
77 | * env.action_space.shape[0]
78 | else:
79 | raise NotImplementedError
80 |
81 | params_mean = np.zeros(dim_params)
82 | params_std = np.ones(dim_params)
83 |
84 | for i in range(num_iter):
85 | # sample parameter vectors (multi-variable gaussian distribution)
86 | sample_params = np.random.multivariate_normal(
87 | params_mean, np.diag(params_std), size=batch_size)
88 | # evaluate sample policies
89 | rewards = [eval_policy(params) for params in sample_params]
90 |
91 | # 'elite' policies
92 | elite_idxs = np.argsort(rewards)[batch_size - num_elite: batch_size]
93 | elite_params = [sample_params[i] for i in elite_idxs]
94 |
95 | # move current policy towards elite policies
96 | params_mean = np.mean(np.asarray(elite_params), axis=0)
97 | params_std = np.std(np.asarray(elite_params), axis=0)
98 |
99 | # logging
100 | print('Ep %d: mean score: %8.3f. max score: %4.3f' %
101 | (i, np.mean(rewards), np.max(rewards)))
102 | print('Eval reward: %.4f' %
103 | run_episode(make_policy(params_mean), env, render=True))
104 |
--------------------------------------------------------------------------------
/algorithms/CEM/README.md:
--------------------------------------------------------------------------------
1 | ## Cross-Entropy Method
2 |
3 | Cross-entropy method is a derivative-free policy optimize approach. It simply sample some policies, pick some good ones(elite policies) and move current policy towards these elite policies, ignoring all other information other than **rewards** collected during episode.
4 | CEM works quiet well in tasks with simply policies(small parameter space).
5 |
6 | ## Requirements
7 | * [Numpy](http://www.numpy.org/)
8 | * [gym](https://gym.openai.com)
9 |
10 | ## Test environments
11 | * Discrete: CartPole-v0, Acrobot-v1, MountainCar-v0
12 | * Continuous: Pendulum-v0, BipedalWalker-v2
13 |
14 |
15 | ## Reference
16 | [John Schulman MLSS 2016](http://rl-gym-doc.s3-website-us-west-2.amazonaws.com/mlss/lab1.html)
17 |
--------------------------------------------------------------------------------
/algorithms/DDPG/README.md:
--------------------------------------------------------------------------------
1 | ### Deep Deterministic Policy Gradients (DDPG)
2 |
3 | Deep Deterministic Policy Gradient is a model-free off-policy actor-critic algorithm which combines DPG ([Deterministic Policy Gradient](http://www.jmlr.org/proceedings/papers/v32/silver14.pdf)) and DQN.
4 |
5 | Related papers:
6 | - [Lillicrap, Timothy P., et al., 2015](https://arxiv.org/pdf/1509.02971.pdf)
7 |
8 | #### Walker2d
9 |
10 | Here we use DDPG to solve a continuous control task Walker2d. Walker2d is a continuous control task based on [Mujoco](http://www.mujoco.org/) engine. The goal is to make a two-dimensional bipedal robot walk forward as fast as possible.
11 |
12 |
13 |
14 | #### Requirements
15 |
16 | * [Numpy](http://www.numpy.org/)
17 | * [Tensorflow](http://www.tensorflow.org)
18 | * [gym](https://gym.openai.com)
19 | * [Mujoco](https://www.roboti.us/index.html)
20 |
21 | #### Run
22 |
23 | ```bash
24 | python train_ddpg.py
25 | python train_ddpg.py -h # training options and hyper parameters settings
26 | ```
27 |
28 | #### Results
29 |
30 |
31 | Results after training for 1.5M steps
32 |
33 |
34 | Training rewards (smoothed)
35 |
--------------------------------------------------------------------------------
/algorithms/DDPG/agent.py:
--------------------------------------------------------------------------------
1 | import random
2 | from collections import deque
3 |
4 | import numpy as np
5 | import tensorflow as tf
6 | import tensorflow.contrib.layers as tcl
7 |
8 | from ou_noise import OUNoise
9 |
10 |
11 | class DDPG:
12 |
13 | def __init__(self, env, args):
14 | self.action_dim = env.action_space.shape[0]
15 | self.state_dim = env.observation_space.shape[0]
16 |
17 | self.actor_lr = args.a_lr
18 | self.critic_lr = args.c_lr
19 |
20 | self.gamma = args.gamma
21 |
22 | # Ornstein-Uhlenbeck noise parameters
23 | self.ou = OUNoise(
24 | self.action_dim, theta=args.noise_theta, sigma=args.noise_sigma)
25 |
26 | self.replay_buffer = deque(maxlen=args.buffer_size)
27 | self.replay_start_size = args.replay_start_size
28 |
29 | self.batch_size = args.batch_size
30 |
31 | self.target_update_rate = args.target_update_rate
32 | self.total_parameters = 0
33 | self.global_steps = 0
34 | self.reg_param = args.reg_param
35 |
36 | def construct_model(self, gpu):
37 | if gpu == -1: # use CPU
38 | device = '/cpu:0'
39 | sess_config = tf.ConfigProto()
40 | else: # use GPU
41 | device = '/gpu:' + str(gpu)
42 | sess_config = tf.ConfigProto(
43 | log_device_placement=True, allow_soft_placement=True)
44 | sess_config.gpu_options.allow_growth = True
45 |
46 | self.sess = tf.Session(config=sess_config)
47 |
48 | with tf.device(device):
49 | # output action, q_value and gradients of q_val w.r.t. action
50 | with tf.name_scope('predict_actions'):
51 | self.states = tf.placeholder(
52 | tf.float32, [None, self.state_dim], name='states')
53 | self.action = tf.placeholder(
54 | tf.float32, [None, self.action_dim], name='action')
55 | self.is_training = tf.placeholder(tf.bool, name='is_training')
56 |
57 | self.action_outputs, actor_params = self._build_actor(
58 | self.states, scope='actor_net', bn=True)
59 | value_outputs, critic_params = self._build_critic(
60 | self.states, self.action, scope='critic_net', bn=False)
61 | self.action_gradients = tf.gradients(
62 | value_outputs, self.action)[0]
63 |
64 | # estimate target_q for update critic
65 | with tf.name_scope('estimate_target_q'):
66 | self.next_states = tf.placeholder(
67 | tf.float32, [None, self.state_dim], name='next_states')
68 | self.mask = tf.placeholder(tf.float32, [None], name='mask')
69 | self.rewards = tf.placeholder(tf.float32, [None], name='rewards')
70 |
71 | # target actor network
72 | t_action_outputs, t_actor_params = self._build_actor(
73 | self.next_states, scope='t_actor_net', bn=True,
74 | trainable=False)
75 | # target critic network
76 | t_value_outputs, t_critic_params = self._build_critic(
77 | self.next_states, t_action_outputs, bn=False,
78 | scope='t_critic_net', trainable=False)
79 |
80 | target_q = self.rewards + self.gamma * \
81 | (t_value_outputs[:, 0] * self.mask)
82 |
83 | with tf.name_scope('compute_gradients'):
84 | actor_opt = tf.train.AdamOptimizer(self.actor_lr)
85 | critic_opt = tf.train.AdamOptimizer(self.critic_lr)
86 |
87 | # critic gradients
88 | td_error = target_q - value_outputs[:, 0]
89 | critic_mse = tf.reduce_mean(tf.square(td_error))
90 | critic_reg = tf.reduce_sum(
91 | [tf.nn.l2_loss(v) for v in critic_params])
92 | critic_loss = critic_mse + self.reg_param * critic_reg
93 | critic_gradients = critic_opt.compute_gradients(
94 | critic_loss, critic_params)
95 | # actor gradients
96 | self.q_action_grads = tf.placeholder(
97 | tf.float32, [None, self.action_dim], name='q_action_grads')
98 | actor_gradients = tf.gradients(
99 | self.action_outputs, actor_params, -self.q_action_grads)
100 | actor_gradients = zip(actor_gradients, actor_params)
101 | # apply gradient to update model
102 | self.train_actor = actor_opt.apply_gradients(actor_gradients)
103 | self.train_critic = critic_opt.apply_gradients(
104 | critic_gradients)
105 |
106 | with tf.name_scope('update_target_networks'):
107 | # batch norm parameters should not be included when updating!
108 | target_networks_update = []
109 |
110 | for v_source, v_target in zip(actor_params, t_actor_params):
111 | update_op = v_target.assign_sub(
112 | 0.001 * (v_target - v_source))
113 | target_networks_update.append(update_op)
114 |
115 | for v_source, v_target in zip(critic_params, t_critic_params):
116 | update_op = v_target.assign_sub(
117 | 0.01 * (v_target - v_source))
118 | target_networks_update.append(update_op)
119 |
120 | self.target_networks_update = tf.group(*target_networks_update)
121 |
122 | with tf.name_scope('total_numbers_of_parameters'):
123 | for v in tf.trainable_variables():
124 | shape = v.get_shape()
125 | param_num = 1
126 | for d in shape:
127 | param_num *= d.value
128 | print(v.name, ' ', shape, ' param nums: ', param_num)
129 | self.total_parameters += param_num
130 | print('Total nums of parameters: ', self.total_parameters)
131 |
132 | def sample_action(self, states, noise):
133 | # is_training suppose to be False when sampling action.
134 | action = self.sess.run(
135 | self.action_outputs,
136 | feed_dict={self.states: states, self.is_training: False})
137 | ou_noise = self.ou.noise() if noise else 0
138 |
139 | return action + ou_noise
140 |
141 | def store_experience(self, s, a, r, next_s, done):
142 | self.replay_buffer.append([s, a[0], r, next_s, done])
143 | self.global_steps += 1
144 |
145 | def update_model(self):
146 |
147 | if len(self.replay_buffer) < self.replay_start_size:
148 | return
149 |
150 | # get batch
151 | batch = random.sample(self.replay_buffer, self.batch_size)
152 | s, _a, r, next_s, done = np.vstack(batch).T.tolist()
153 | mask = ~np.array(done)
154 |
155 | # compute a = u(s)
156 | a = self.sess.run(self.action_outputs, {
157 | self.states: s,
158 | self.is_training: True
159 | })
160 | # gradients of q_value w.r.t action a
161 | dq_da = self.sess.run(self.action_gradients, {
162 | self.states: s,
163 | self.action: a,
164 | self.is_training: True
165 | })
166 | # train
167 | self.sess.run([self.train_actor, self.train_critic], {
168 | # train_actor feed
169 | self.states: s,
170 | self.is_training: True,
171 | self.q_action_grads: dq_da,
172 | # train_critic feed
173 | self.next_states: next_s,
174 | self.action: _a,
175 | self.mask: mask,
176 | self.rewards: r
177 | })
178 | # update target network
179 | self.sess.run(self.target_networks_update)
180 |
181 | def _build_actor(self, states, scope, bn=False, trainable=True):
182 | h1_dim = 400
183 | h2_dim = 300
184 | init = tf.contrib.layers.variance_scaling_initializer(
185 | factor=1.0, mode='FAN_IN', uniform=True)
186 |
187 | with tf.variable_scope(scope):
188 | if bn:
189 | states = self.batch_norm(
190 | states, self.is_training, tf.identity,
191 | scope='actor_bn_states', trainable=trainable)
192 | h1 = tcl.fully_connected(
193 | states, h1_dim, activation_fn=None, weights_initializer=init,
194 | biases_initializer=init, trainable=trainable, scope='actor_h1')
195 |
196 | if bn:
197 | h1 = self.batch_norm(
198 | h1, self.is_training, tf.nn.relu, scope='actor_bn_h1',
199 | trainable=trainable)
200 | else:
201 | h1 = tf.nn.relu(h1)
202 |
203 | h2 = tcl.fully_connected(
204 | h1, h2_dim, activation_fn=None, weights_initializer=init,
205 | biases_initializer=init, trainable=trainable, scope='actor_h2')
206 | if bn:
207 | h2 = self.batch_norm(
208 | h2, self.is_training, tf.nn.relu, scope='actor_bn_h2',
209 | trainable=trainable)
210 | else:
211 | h2 = tf.nn.relu(h2)
212 |
213 | # use tanh to bound the action
214 | a = tcl.fully_connected(
215 | h2, self.action_dim, activation_fn=tf.nn.tanh,
216 | weights_initializer=tf.random_uniform_initializer(-3e-3, 3e-3),
217 | biases_initializer=tf.random_uniform_initializer(-3e-4, 3e-4),
218 | trainable=trainable, scope='actor_out')
219 |
220 | params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)
221 |
222 | return a, params
223 |
224 | def _build_critic(self, states, action, scope, bn=False, trainable=True):
225 | h1_dim = 400
226 | h2_dim = 300
227 | init = tf.contrib.layers.variance_scaling_initializer(
228 | factor=1.0, mode='FAN_IN', uniform=True)
229 | with tf.variable_scope(scope):
230 | if bn:
231 | states = self.batch_norm(
232 | states, self.is_training, tf.identity,
233 | scope='critic_bn_state', trainable=trainable)
234 | h1 = tcl.fully_connected(
235 | states, h1_dim, activation_fn=None, weights_initializer=init,
236 | biases_initializer=init, trainable=trainable, scope='critic_h1')
237 | if bn:
238 | h1 = self.batch_norm(
239 | h1, self.is_training, tf.nn.relu, scope='critic_bn_h1',
240 | trainable=trainable)
241 | else:
242 | h1 = tf.nn.relu(h1)
243 |
244 | # skip action from the first layer
245 | h1 = tf.concat([h1, action], 1)
246 |
247 | h2 = tcl.fully_connected(
248 | h1, h2_dim, activation_fn=None, weights_initializer=init,
249 | biases_initializer=init, trainable=trainable,
250 | scope='critic_h2')
251 |
252 | if bn:
253 | h2 = self.batch_norm(
254 | h2, self.is_training, tf.nn.relu, scope='critic_bn_h2',
255 | trainable=trainable)
256 | else:
257 | h2 = tf.nn.relu(h2)
258 |
259 | q = tcl.fully_connected(
260 | h2, 1, activation_fn=None,
261 | weights_initializer=tf.random_uniform_initializer(-3e-3, 3e-3),
262 | biases_initializer=tf.random_uniform_initializer(-3e-4, 3e-4),
263 | trainable=trainable, scope='critic_out')
264 |
265 | params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)
266 | return q, params
267 |
268 | @staticmethod
269 | def batch_norm(x, is_training, activation_fn, scope, trainable=True):
270 | # switch the 'is_training' flag and 'reuse' flag
271 | return tf.cond(
272 | is_training,
273 | lambda: tf.contrib.layers.batch_norm(
274 | x,
275 | activation_fn=activation_fn,
276 | center=True,
277 | scale=True,
278 | updates_collections=None,
279 | is_training=True,
280 | reuse=None,
281 | scope=scope,
282 | decay=0.9,
283 | epsilon=1e-5,
284 | trainable=trainable),
285 | lambda: tf.contrib.layers.batch_norm(
286 | x,
287 | activation_fn=activation_fn,
288 | center=True,
289 | scale=True,
290 | updates_collections=None,
291 | is_training=False,
292 | reuse=True, # to be able to reuse scope must be given
293 | scope=scope,
294 | decay=0.9,
295 | epsilon=1e-5,
296 | trainable=trainable))
297 |
--------------------------------------------------------------------------------
/algorithms/DDPG/evaluate.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | import gym
4 | import numpy as np
5 | import tensorflow as tf
6 | from gym import wrappers
7 |
8 | from agent import DDPG
9 |
10 |
11 | def main(args):
12 | env = gym.make('Walker2d-v1')
13 | env = wrappers.Monitor(env, './videos/', force=True)
14 | reward_history = []
15 |
16 | agent = DDPG(env, args)
17 | agent.construct_model(args.gpu)
18 |
19 | saver = tf.train.Saver()
20 | if args.model_path is not None:
21 | # reuse saved model
22 | saver.restore(agent.sess, args.model_path)
23 | ep_base = int(args.model_path.split('_')[-1])
24 | best_avg_rewards = float(args.model_path.split('/')[-1].split('_')[0])
25 | else:
26 | raise ValueError('model_path required!')
27 |
28 | for ep in range(args.ep):
29 | # env init
30 | state = env.reset()
31 | ep_rewards = 0
32 | for step in range(env.spec.timestep_limit):
33 | env.render()
34 | action = agent.sample_action(state[np.newaxis, :], noise=False)
35 | # act
36 | next_state, reward, done, _ = env.step(action[0])
37 |
38 | ep_rewards += reward
39 | agent.store_experience(state, action, reward, next_state, done)
40 |
41 | # shift
42 | state = next_state
43 | if done:
44 | break
45 | reward_history.append(ep_rewards)
46 | print('Ep%d reward:%d' % (ep + 1, ep_rewards))
47 |
48 | print('Average rewards: ', np.mean(reward_history))
49 |
50 |
51 | def args_parse():
52 | parser = argparse.ArgumentParser()
53 | parser.add_argument(
54 | '--model_path', default='./models/',
55 | help='Whether to use a saved model. (*None|model path)')
56 | parser.add_argument(
57 | '--save_path', default='./models/',
58 | help='Path to save a model during training.')
59 | parser.add_argument(
60 | '--gpu', type=int, default=-1,
61 | help='running on a specify gpu, -1 indicates using cpu')
62 | parser.add_argument(
63 | '--seed', default=31, type=int, help='random seed')
64 |
65 | parser.add_argument(
66 | '--a_lr', type=float, default=1e-4, help='Actor learning rate')
67 | parser.add_argument(
68 | '--c_lr', type=float, default=1e-3, help='Critic learning rate')
69 | parser.add_argument(
70 | '--batch_size', type=int, default=64, help='Size of training batch')
71 | parser.add_argument(
72 | '--gamma', type=float, default=0.99, help='Discounted factor')
73 | parser.add_argument(
74 | '--target_update_rate', type=float, default=0.001,
75 | help='parameter of soft target update')
76 | parser.add_argument(
77 | '--reg_param', type=float, default=0.01, help='l2 regularization')
78 | parser.add_argument(
79 | '--buffer_size', type=int, default=1000000, help='Size of memory buffer')
80 | parser.add_argument(
81 | '--replay_start_size', type=int, default=1000,
82 | help='Number of steps before learning from replay memory')
83 | parser.add_argument(
84 | '--noise_theta', type=float, default=0.15,
85 | help='Ornstein-Uhlenbeck noise parameters')
86 | parser.add_argument(
87 | '--noise_sigma', type=float, default=0.20,
88 | help='Ornstein-Uhlenbeck noise parameters')
89 | parser.add_argument('--ep', default=10, help='Test episodes')
90 | return parser.parse_args()
91 |
92 |
93 | if __name__ == '__main__':
94 | main(args_parse())
95 |
--------------------------------------------------------------------------------
/algorithms/DDPG/ou_noise.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class OUNoise:
5 | '''docstring for OUNoise'''
6 |
7 | def __init__(self, action_dimension, mu=0, theta=0.15, sigma=0.2):
8 | self.action_dimension = action_dimension
9 | self.mu = mu
10 | self.theta = theta
11 | self.sigma = sigma
12 | self.state = np.ones(self.action_dimension) * self.mu
13 | self.reset()
14 |
15 | def reset(self):
16 | self.state = np.ones(self.action_dimension) * self.mu
17 |
18 | def noise(self):
19 | x = self.state
20 | dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(len(x))
21 | self.state = x + dx
22 | return self.state
23 |
--------------------------------------------------------------------------------
/algorithms/DDPG/train_ddpg.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 |
4 | import gym
5 | import matplotlib.pyplot as plt
6 | import numpy as np
7 | import tensorflow as tf
8 |
9 | from agent import DDPG
10 |
11 |
12 | def main(args):
13 | set_random_seed(args.seed)
14 | env = gym.make('Walker2d-v1')
15 | agent = DDPG(env, args)
16 | agent.construct_model(args.gpu)
17 |
18 | saver = tf.train.Saver(max_to_keep=1)
19 | if args.model_path is not None:
20 | # reuse saved model
21 | saver.restore(agent.sess, args.model_path)
22 | ep_base = int(args.model_path.split('_')[-1])
23 | best_avg_rewards = float(args.model_path.split('/')[-1].split('_')[0])
24 | else:
25 | # build a new model
26 | agent.sess.run(tf.global_variables_initializer())
27 | ep_base = 0
28 | best_avg_rewards = None
29 |
30 | reward_history, step_history = [], []
31 | train_steps = 0
32 |
33 | for ep in range(args.max_ep):
34 | # env init
35 | state = env.reset()
36 | ep_rewards = 0
37 | for step in range(env.spec.timestep_limit):
38 | action = agent.sample_action(state[np.newaxis, :], noise=True)
39 | # act
40 | next_state, reward, done, _ = env.step(action[0])
41 | train_steps += 1
42 | ep_rewards += reward
43 |
44 | agent.store_experience(state, action, reward, next_state, done)
45 | agent.update_model()
46 | # shift
47 | state = next_state
48 | if done:
49 | print('Ep %d global_steps: %d Reward: %.2f' %
50 | (ep + 1, agent.global_steps, ep_rewards))
51 | # reset ou noise
52 | agent.ou.reset()
53 | break
54 | step_history.append(train_steps)
55 | if not reward_history:
56 | reward_history.append(ep_rewards)
57 | else:
58 | reward_history.append(reward_history[-1] * 0.99 + ep_rewards + 0.01)
59 |
60 | # Evaluate during training
61 | if ep % args.log_every == 0 and ep > 0:
62 | ep_rewards = 0
63 | for ep_eval in range(args.test_ep):
64 | state = env.reset()
65 | for step_eval in range(env.spec.timestep_limit):
66 | action = agent.sample_action(
67 | state[np.newaxis, :], noise=False)
68 | next_state, reward, done, _ = env.step(action[0])
69 | ep_rewards += reward
70 | state = next_state
71 | if done:
72 | break
73 |
74 | curr_avg_rewards = ep_rewards / args.test_ep
75 |
76 | # logging
77 | print('\n')
78 | print('Episode: %d' % (ep + 1))
79 | print('Global steps: %d' % agent.global_steps)
80 | print('Mean reward: %.2f' % curr_avg_rewards)
81 | print('\n')
82 | if not best_avg_rewards or (curr_avg_rewards >= best_avg_rewards):
83 | best_avg_rewards = curr_avg_rewards
84 | if not os.path.isdir(args.save_path):
85 | os.makedirs(args.save_path)
86 | save_name = args.save_path + str(round(best_avg_rewards, 2)) \
87 | + '_' + str(ep_base + ep + 1)
88 | saver.save(agent.sess, save_name)
89 | print('Model save %s' % save_name)
90 |
91 | plt.plot(step_history, reward_history)
92 | plt.xlabel('steps')
93 | plt.ylabel('running reward')
94 | plt.show()
95 |
96 |
97 | def args_parse():
98 | parser = argparse.ArgumentParser()
99 | parser.add_argument(
100 | '--model_path', default=None,
101 | help='Whether to use a saved model. (*None|model path)')
102 | parser.add_argument(
103 | '--save_path', default='./models/',
104 | help='Path to save a model during training.')
105 | parser.add_argument(
106 | '--log_every', default=100,
107 | help='Interval of logging and may be model saving')
108 | parser.add_argument(
109 | '--gpu', type=int, default=-1,
110 | help='running on a specify gpu, -1 indicates using cpu')
111 | parser.add_argument(
112 | '--seed', default=31, type=int, help='random seed')
113 |
114 | parser.add_argument(
115 | '--max_ep', type=int, default=10000, help='Number of training episodes')
116 | parser.add_argument(
117 | '--test_ep', type=int, default=10, help='Number of test episodes')
118 | parser.add_argument(
119 | '--a_lr', type=float, default=1e-4, help='Actor learning rate')
120 | parser.add_argument(
121 | '--c_lr', type=float, default=1e-3, help='Critic learning rate')
122 | parser.add_argument(
123 | '--batch_size', type=int, default=64, help='Size of training batch')
124 | parser.add_argument(
125 | '--gamma', type=float, default=0.99, help='Discounted factor')
126 | parser.add_argument(
127 | '--target_update_rate', type=float, default=0.001,
128 | help='soft target update rate')
129 | parser.add_argument(
130 | '--reg_param', type=float, default=0.01, help='l2 regularization')
131 | parser.add_argument(
132 | '--buffer_size', type=int, default=1000000, help='Size of memory buffer')
133 | parser.add_argument(
134 | '--replay_start_size', type=int, default=1000,
135 | help='Number of steps before learning from replay memory')
136 | parser.add_argument(
137 | '--noise_theta', type=float, default=0.15,
138 | help='Ornstein-Uhlenbeck noise parameters')
139 | parser.add_argument(
140 | '--noise_sigma', type=float, default=0.20,
141 | help='Ornstein-Uhlenbeck noise parameters')
142 | return parser.parse_args()
143 |
144 |
145 | def set_random_seed(seed):
146 | np.random.seed(seed)
147 | tf.set_random_seed(seed)
148 |
149 |
150 | if __name__ == '__main__':
151 | main(args_parse())
152 |
--------------------------------------------------------------------------------
/algorithms/DQN/README.md:
--------------------------------------------------------------------------------
1 | ## DQN
2 | DQN is a deep reinforcement learning architecture proposed by DeepMind in 2013. They used this architecture to play Atari games and obtained a hunman-level performance.
3 | Here we implement a simple version of DQN to solve game of CartPole.
4 |
5 | Related papers:
6 | * [Mnih et al., 2013](https://arxiv.org/pdf/1312.5602.pdf)
7 | * [Mnih et al., 2015](http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf)
8 |
9 | Double DQN is also implemented. Pass a --double=True argument to enable Double DQN. Since the CartPole task is easy to solve, Double DQN actually produce very little effect.
10 |
11 | ## CartPole
12 | In CartPole, a pole is attached by an un-actuated joint to a cart, which move along a track. The agent contrl the cart by moving left or right in order to ballance the pole. More about CartPole see [OpenAI wiki](https://github.com/openai/gym/wiki/CartPole-v0)
13 |
14 |
15 |
16 | We will solve CartPole using DQN Algorithm.
17 |
18 | ## Requirements
19 | * [Numpy](http://www.numpy.org/)
20 | * [Tensorflow](http://www.tensorflow.org)
21 | * [gym](https://gym.openai.com)
22 |
23 |
24 | ## Run
25 | python train_DQN.py
26 | python train_DQN.py -h # training options and hyper parameters settings
27 |
28 |
29 | ## Training plot
30 |
31 |
32 |
--------------------------------------------------------------------------------
/algorithms/DQN/agent.py:
--------------------------------------------------------------------------------
1 | import random
2 | from collections import deque
3 |
4 | import numpy as np
5 | import tensorflow as tf
6 |
7 |
8 | class DQN:
9 |
10 | def __init__(self, env, args):
11 | # init parameters
12 | self.global_step = 0
13 | self.epsilon = args.init_epsilon
14 | self.state_dim = env.observation_space.shape[0]
15 | self.action_dim = env.action_space.n
16 |
17 | self.gamma = args.gamma
18 | self.learning_rate = args.lr
19 | self.batch_size = args.batch_size
20 |
21 | self.double_q = args.double_q
22 | self.target_network_update_interval = args.target_network_update
23 |
24 | # init replay buffer
25 | self.replay_buffer = deque(maxlen=args.buffer_size)
26 |
27 | def network(self, input_state):
28 | hidden_unit = 100
29 | w1 = tf.Variable(tf.math.divide(tf.random_normal(
30 | [self.state_dim, hidden_unit]), np.sqrt(self.state_dim)))
31 | b1 = tf.Variable(tf.constant(0.0, shape=[hidden_unit]))
32 | hidden = tf.nn.relu(tf.matmul(input_state, w1) + b1)
33 |
34 | w2 = tf.Variable(tf.math.divide(tf.random_normal(
35 | [hidden_unit, self.action_dim]), np.sqrt(hidden_unit)))
36 | b2 = tf.Variable(tf.constant(0.0, shape=[self.action_dim]))
37 | output_Q = tf.matmul(hidden, w2) + b2
38 | return output_Q
39 |
40 | @staticmethod
41 | def get_session(device):
42 | if device == -1: # use CPU
43 | device = '/cpu:0'
44 | sess_config = tf.ConfigProto()
45 | else: # use GPU
46 | device = '/gpu:' + str(gpu)
47 | sess_config = tf.ConfigProto(
48 | log_device_placement=True,
49 | allow_soft_placement=True)
50 | sess_config.gpu_options.allow_growth = True
51 | sess = tf.Session(config=sess_config)
52 | return sess, device
53 |
54 | def construct_model(self, gpu):
55 | self.sess, device = self.get_session(gpu)
56 |
57 | with tf.device(device):
58 | with tf.name_scope('input_state'):
59 | self.input_state = tf.placeholder(
60 | tf.float32, [None, self.state_dim])
61 |
62 | with tf.name_scope('q_network'):
63 | self.output_Q = self.network(self.input_state)
64 |
65 | with tf.name_scope('optimize'):
66 | self.input_action = tf.placeholder(
67 | tf.float32, [None, self.action_dim])
68 | self.target_Q = tf.placeholder(tf.float32, [None])
69 | # Q value of the selceted action
70 | action_Q = tf.reduce_sum(tf.multiply(
71 | self.output_Q, self.input_action), reduction_indices=1)
72 |
73 | self.loss = tf.reduce_mean(tf.square(self.target_Q - action_Q))
74 | optimizer = tf.train.RMSPropOptimizer(self.learning_rate)
75 | self.train_op = optimizer.minimize(self.loss)
76 |
77 | # target network
78 | with tf.name_scope('target_network'):
79 | self.target_output_Q = self.network(self.input_state)
80 |
81 | q_parameters = tf.get_collection(
82 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='q_network')
83 | target_q_parameters = tf.get_collection(
84 | tf.GraphKeys.TRAINABLE_VARIABLES, scope='target_network')
85 |
86 | with tf.name_scope('update_target_network'):
87 | self.update_target_network = []
88 | for v_source, v_target in zip(
89 | q_parameters, target_q_parameters):
90 | # update_op = v_target.assign(v_source)
91 | # soft target update to stabilize training
92 | update_op = v_target.assign_sub(0.1 * (v_target - v_source))
93 | self.update_target_network.append(update_op)
94 | # group all update together
95 | self.update_target_network = tf.group(
96 | *self.update_target_network)
97 |
98 | def sample_action(self, state, policy="greedy"):
99 | self.global_step += 1
100 | # Q_value of all actions
101 | output_Q = self.sess.run(
102 | self.output_Q, feed_dict={self.input_state: [state]})[0]
103 | if policy == 'egreedy':
104 | if random.random() <= self.epsilon: # random action
105 | return random.randint(0, self.action_dim - 1)
106 | else: # greedy action
107 | return np.argmax(output_Q)
108 | elif policy == 'greedy':
109 | return np.argmax(output_Q)
110 | elif policy == 'random':
111 | return random.randint(0, self.action_dim - 1)
112 |
113 | def learn(self, state, action, reward, next_state, done):
114 | onehot_action = np.zeros(self.action_dim)
115 | onehot_action[action] = 1
116 |
117 | # store experience in deque
118 | self.replay_buffer.append(
119 | np.array([state, onehot_action, reward, next_state, done]))
120 |
121 | if len(self.replay_buffer) > self.batch_size:
122 | # update target network if needed
123 | if self.global_step % self.target_network_update_interval == 0:
124 | self.sess.run(self.update_target_network)
125 |
126 | # sample experience
127 | minibatch = random.sample(self.replay_buffer, self.batch_size)
128 |
129 | # transpose mini-batch
130 | s_batch, a_batch, r_batch, next_s_batch, done_batch = np.array(minibatch).T
131 | s_batch, a_batch = np.stack(s_batch), np.stack(a_batch)
132 | next_s_batch = np.stack(next_s_batch)
133 |
134 | # use target q network to get Q of all
135 | next_s_all_action_Q = self.sess.run(
136 | self.target_output_Q, {self.input_state: next_s_batch})
137 | next_s_Q_batch = np.max(next_s_all_action_Q, 1)
138 |
139 | if self.double_q:
140 | # use source network to select best action a*
141 | next_s_action_batch = np.argmax(self.sess.run(
142 | self.output_Q, {self.input_state: next_s_batch}), 1)
143 | # then use target network to compute Q(s', a*)
144 | next_s_Q_batch = next_s_all_action_Q[np.arange(self.batch_size),
145 | next_s_action_batch]
146 |
147 | # calculate target_Q_batch
148 | mask = ~done_batch.astype(np.bool)
149 | target_Q_batch = r_batch + self.gamma * mask * next_s_Q_batch
150 |
151 | # run actual training
152 | self.sess.run(self.train_op, feed_dict={
153 | self.target_Q: target_Q_batch,
154 | self.input_action: a_batch,
155 | self.input_state: s_batch})
156 |
--------------------------------------------------------------------------------
/algorithms/DQN/evaluation.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | import gym
4 | import tensorflow as tf
5 |
6 | from agent import DQN
7 |
8 |
9 | def main(args):
10 | # load env
11 | env = gym.make('CartPole-v0')
12 | # load agent
13 | agent = DQN(env)
14 | agent.construct_model(args.gpu)
15 |
16 | # load model or init a new
17 | saver = tf.train.Saver()
18 | if args.model_path is not None:
19 | # reuse saved model
20 | saver.restore(agent.sess, args.model_path)
21 | else:
22 | # build a new model
23 | agent.init_var()
24 |
25 | # training loop
26 | for ep in range(args.ep):
27 | # reset env
28 | total_rewards = 0
29 | state = env.reset()
30 |
31 | while True:
32 | env.render()
33 | # sample actions
34 | action = agent.sample_action(state, policy='greedy')
35 | # act!
36 | next_state, reward, done, _ = env.step(action)
37 | total_rewards += reward
38 | # state shift
39 | state = next_state
40 | if done:
41 | break
42 |
43 | print('Ep%s Reward: %s ' % (ep+1, total_rewards))
44 |
45 |
46 | def args_parse():
47 | parser = argparse.ArgumentParser()
48 | parser.add_argument(
49 | '--model_path', default=None,
50 | help='Whether to use a saved model. (*None|model path)')
51 | parser.add_argument(
52 | '--gpu', default=-1,
53 | help='running on a specify gpu, -1 indicates using cpu')
54 | parser.add_argument(
55 | '--ep', type=int, default=1, help='Test episodes')
56 | return parser.parse_args()
57 |
58 |
59 | if __name__ == '__main__':
60 | main(args_parse())
61 |
--------------------------------------------------------------------------------
/algorithms/DQN/train_DQN.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 |
4 | import gym
5 | import matplotlib.pyplot as plt
6 | import numpy as np
7 | import tensorflow as tf
8 |
9 | from agent import DQN
10 |
11 |
12 | def main(args):
13 | set_random_seed(args.seed)
14 |
15 | env = gym.make("CartPole-v0")
16 | agent = DQN(env, args)
17 | agent.construct_model(args.gpu)
18 |
19 | # load pre-trained models or init new a model.
20 | saver = tf.train.Saver(max_to_keep=1)
21 | if args.model_path is not None:
22 | saver.restore(agent.sess, args.model_path)
23 | ep_base = int(args.model_path.split('_')[-1])
24 | best_mean_rewards = float(args.model_path.split('/')[-1].split('_')[0])
25 | else:
26 | agent.sess.run(tf.global_variables_initializer())
27 | ep_base = 0
28 | best_mean_rewards = None
29 |
30 | rewards_history, steps_history = [], []
31 | train_steps = 0
32 | # Training
33 | for ep in range(args.max_ep):
34 | state = env.reset()
35 | ep_rewards = 0
36 | for step in range(env.spec.max_episode_steps):
37 | # pick action
38 | action = agent.sample_action(state, policy='egreedy')
39 | # execution action.
40 | next_state, reward, done, debug = env.step(action)
41 | train_steps += 1
42 | ep_rewards += reward
43 | # modified reward to speed up learning
44 | reward = 0.1 if not done else -1
45 | # learn and Update net parameters
46 | agent.learn(state, action, reward, next_state, done)
47 |
48 | state = next_state
49 | if done:
50 | break
51 | steps_history.append(train_steps)
52 | if not rewards_history:
53 | rewards_history.append(ep_rewards)
54 | else:
55 | rewards_history.append(
56 | rewards_history[-1] * 0.9 + ep_rewards * 0.1)
57 |
58 | # decay epsilon
59 | if agent.epsilon > args.final_epsilon:
60 | agent.epsilon -= (args.init_epsilon - args.final_epsilon) / args.max_ep
61 |
62 | # evaluate during training
63 | if ep % args.log_every == args.log_every-1:
64 | total_reward = 0
65 | for i in range(args.test_ep):
66 | state = env.reset()
67 | for j in range(env.spec.max_episode_steps):
68 | action = agent.sample_action(state, policy='greedy')
69 | state, reward, done, _ = env.step(action)
70 | total_reward += reward
71 | if done:
72 | break
73 | current_mean_rewards = total_reward / args.test_ep
74 | print('Episode: %d Average Reward: %.2f' %
75 | (ep + 1, current_mean_rewards))
76 | # save model if current model outperform the old one
77 | if best_mean_rewards is None or (current_mean_rewards >= best_mean_rewards):
78 | best_mean_rewards = current_mean_rewards
79 | if not os.path.isdir(args.save_path):
80 | os.makedirs(args.save_path)
81 | save_name = args.save_path + str(round(best_mean_rewards, 2)) \
82 | + '_' + str(ep_base + ep + 1)
83 | saver.save(agent.sess, save_name)
84 | print('Model saved %s' % save_name)
85 |
86 | plt.plot(steps_history, rewards_history)
87 | plt.xlabel('steps')
88 | plt.ylabel('running avg rewards')
89 | plt.show()
90 |
91 |
92 | def set_random_seed(seed):
93 | np.random.seed(seed)
94 | tf.set_random_seed(seed)
95 |
96 |
97 | if __name__ == '__main__':
98 | parser = argparse.ArgumentParser()
99 | parser.add_argument(
100 | '--model_path', default=None,
101 | help='Whether to use a saved model. (*None|model path)')
102 | parser.add_argument(
103 | '--save_path', default='./models/',
104 | help='Path to save a model during training.')
105 | parser.add_argument(
106 | '--double_q', default=True, help='enable or disable double dqn')
107 | parser.add_argument(
108 | '--log_every', default=500, help='Log and save model every x episodes')
109 | parser.add_argument(
110 | '--gpu', default=-1,
111 | help='running on a specify gpu, -1 indicates using cpu')
112 | parser.add_argument(
113 | '--seed', default=31, help='random seed')
114 |
115 | parser.add_argument(
116 | '--max_ep', type=int, default=2000, help='Number of training episodes')
117 | parser.add_argument(
118 | '--test_ep', type=int, default=50, help='Number of test episodes')
119 | parser.add_argument(
120 | '--init_epsilon', type=float, default=0.75, help='initial epsilon')
121 | parser.add_argument(
122 | '--final_epsilon', type=float, default=0.2, help='final epsilon')
123 | parser.add_argument(
124 | '--buffer_size', type=int, default=50000, help='Size of memory buffer')
125 | parser.add_argument(
126 | '--lr', type=float, default=1e-4, help='Learning rate')
127 | parser.add_argument(
128 | '--batch_size', type=int, default=128, help='Size of training batch')
129 | parser.add_argument(
130 | '--gamma', type=float, default=0.99, help='Discounted factor')
131 | parser.add_argument(
132 | '--target_network_update', type=int, default=1000,
133 | help='update frequency of target network.')
134 | main(parser.parse_args())
135 |
--------------------------------------------------------------------------------
/algorithms/PG/agent.py:
--------------------------------------------------------------------------------
1 | from collections import deque
2 | import random
3 |
4 | import numpy as np
5 | import torch
6 | import torch.distributions as tdist
7 | import torch.nn as nn
8 | import torch.nn.functional as F
9 | import torch.optim as optim
10 |
11 |
12 | class Backbone(nn.Module):
13 |
14 | def __init__(self, in_dim):
15 | super().__init__()
16 | self.fc1 = nn.Linear(in_dim, 400)
17 | self.fc2 = nn.Linear(400, 200)
18 |
19 | self.bn1 = nn.BatchNorm1d(400)
20 | self.bn2 = nn.BatchNorm1d(200)
21 |
22 |
23 | class PolicyNet(Backbone):
24 |
25 | def __init__(self, in_dim, out_dim):
26 | super().__init__(in_dim)
27 | <<<<<<< Updated upstream
28 | self.out_dim = out_dim
29 |
30 | self.fc3 = nn.Linear(200, 100)
31 | self.bn3 = nn.BatchNorm1d(100)
32 | self.fc4 = nn.Linear(100, self.out_dim)
33 |
34 | self.logstd = torch.tensor(np.zeros((1, out_dim), dtype=np.float32),
35 | =======
36 | self.fc3 = nn.Linear(200, 100)
37 | self.fc4 = nn.Linear(100, out_dim)
38 |
39 | self.logstd = torch.tensor(np.ones(out_dim, dtype=np.float32),
40 | >>>>>>> Stashed changes
41 | requires_grad=True).to(args.device)
42 |
43 | def get_dist(self, x):
44 | x = F.relu(self.fc1(x))
45 | x = self.bn1(x)
46 | x = F.relu(self.fc2(x))
47 | x = self.bn2(x)
48 | x = F.relu(self.fc3(x))
49 | <<<<<<< Updated upstream
50 | x = self.bn3(x)
51 | mean = self.fc4(x)
52 | std = torch.exp(self.logstd)
53 | randn = torch.randn(mean.size()).to(args.device)
54 | out = randn * std + mean
55 | logp = -torch.log(std * (2 * np.pi) ** 0.5) + (out - mean) ** 2 / (2 * std ** 2)
56 | print(next(self.fc4.parameters())[0][:5])
57 | =======
58 | mean = self.fc4(x)
59 | std = torch.exp(self.logstd)
60 | dist = tdist.Normal(mean, std)
61 | return dist
62 |
63 | def forward(self, x):
64 | dist = self.get_dist(x)
65 | out = dist.sample()
66 | logp = dist.log_prob(out)
67 | >>>>>>> Stashed changes
68 | return logp, out
69 |
70 | def get_logp(self, s, a):
71 | dist = self.get_dist(s)
72 | logp = dist.log_prob(a)
73 | return logp
74 |
75 | class ValueNet(Backbone):
76 |
77 | def __init__(self, in_dim):
78 | super().__init__(in_dim)
79 | self.fc3 = nn.Linear(200, 10)
80 | <<<<<<< Updated upstream
81 | self.bn3 = nn.BatchNorm1d(10)
82 | =======
83 | >>>>>>> Stashed changes
84 | self.fc4 = nn.Linear(10, 1)
85 |
86 | def forward(self, x):
87 | x = F.relu(self.fc1(x))
88 | x = self.bn1(x)
89 | x = F.relu(self.fc2(x))
90 | x = self.bn2(x)
91 | x = F.relu(self.fc3(x))
92 | <<<<<<< Updated upstream
93 | x = self.bn3(x)
94 | =======
95 | >>>>>>> Stashed changes
96 | x = self.fc4(x)
97 | return x
98 |
99 |
100 | class VanillaPG:
101 |
102 | def __init__(self, env, args_):
103 | global args
104 | args = args_
105 | self.obs_space = env.observation_space
106 | self.act_space = env.action_space
107 |
108 | self._build_nets()
109 |
110 | def step(self, obs):
111 | if obs.ndim == 1:
112 | obs = obs.reshape((1, -1))
113 | obs = torch.tensor(obs).to(args.device)
114 | self.policy.eval()
115 | logp, out = self.policy(obs)
116 | out = out[0].detach().cpu().numpy()
117 | return logp, out
118 |
119 | def train(self, traj):
120 | obs, acts, rewards, logp, next_obs = np.array(traj).T
121 |
122 | obs, next_obs = np.stack(obs), np.stack(next_obs)
123 | obs_combine = np.concatenate([obs, next_obs], axis=0)
124 | obs_combine = torch.tensor(obs_combine).to(args.device)
125 |
126 | self.value_func.eval()
127 | vs_combine = self.value_func(obs_combine)
128 | vs_combine_np = vs_combine.cpu().detach().numpy().flatten()
129 |
130 | vs_np = vs_combine_np[:len(vs_combine)//2]
131 | next_vs_np = vs_combine_np[len(vs_combine)//2:]
132 | vs = vs_combine[:len(vs_combine)//2]
133 |
134 | logp = torch.cat(logp.tolist())
135 |
136 | # calculate return estimations
137 | phi = self._calculate_phi(rewards, vs_np, next_vs_np)
138 |
139 | # update policy parameters
140 | self.policy.train()
141 | self.policy_optim.zero_grad()
142 | logp.backward(phi)
143 | self.policy_optim.step()
144 |
145 | # update value functions
146 | self.value_func.train()
147 | rwd_to_go = [np.sum(rewards[i:]) for i in range(len(rewards))]
148 | target = torch.tensor(rwd_to_go).view((-1, 1)).to(args.device)
149 | vf_loss = F.mse_loss(vs, target)
150 | print("vf mse: %.4f" % vf_loss)
151 | self.vf_optim.zero_grad()
152 | vf_loss.backward()
153 | self.vf_optim.step()
154 |
155 | def _build_nets(self):
156 | policy = PolicyNet(in_dim=self.obs_space.shape[0],
157 | out_dim=self.act_space.shape[0])
158 | value_func = ValueNet(in_dim=self.obs_space.shape[0])
159 | self.policy = policy.to(args.device)
160 | self.value_func = value_func.to(args.device)
161 |
162 | self.policy_optim = optim.Adam(self.policy.parameters(), lr=args.lr)
163 | self.vf_optim = optim.Adam(self.value_func.parameters(), lr=args.lr)
164 |
165 | def _discount(self, arr, alpha=0.99):
166 | discount_arr = []
167 | for a in arr:
168 | discount_arr.append(alpha * a)
169 | alpha *= alpha
170 | return discount_arr
171 |
172 | def _calculate_phi(self, rewards, vs, next_vs):
173 | # option1: raw returns
174 | raw_returns = [np.sum(rewards) for i in range(len(rewards))]
175 | # option2: rewards to go
176 | rwd_to_go = [np.sum(rewards[i:]) for i in range(len(rewards))]
177 | # option3: discounted rewards to go
178 | disc_rwd_to_go = [np.sum(self._discount(rewards[i:]))
179 | for i in range(len(rewards))]
180 | # option4: td0 estimation
181 | td0 = rewards + next_vs
182 | # subtract baseline
183 | baseline = vs
184 |
185 | phi = disc_rwd_to_go - vs
186 | phi = np.array(phi).reshape((-1, 1))
187 | phi = np.tile(phi, (1, 4))
188 | phi = torch.tensor(phi).to(args.device)
189 | return phi
190 |
191 |
192 | class OffPolicyPG(VanillaPG):
193 |
194 | def __init__(self, env, args_):
195 | super().__init__(env, args_)
196 | self.memory = deque(maxlen=10000)
197 |
198 | def train(self, traj):
199 | obs, acts, rewards, logp, next_obs = np.array(traj).T
200 | disc_rwd_to_go = np.array([np.sum(self._discount(rewards[i:]))
201 | for i in range(len(rewards))])
202 | for i in range(len(obs)):
203 | data_tuple = [obs[i], acts[i], rewards[i], disc_rwd_to_go[i],
204 | logp[i], next_obs[i]]
205 | self.memory.append(data_tuple)
206 |
207 | if len(self.memory) < 10000:
208 | print("reply memory not warm up")
209 | return
210 | traj = random.sample(self.memory, 1024)
211 | obs, acts, rewards, disc_rwd, logp, next_obs = np.array(traj).T
212 |
213 | disc_rwd = disc_rwd.astype(np.float32)
214 | acts = np.stack(acts)
215 | obs, next_obs = np.stack(obs), np.stack(next_obs)
216 | obs_combine = np.concatenate([obs, next_obs], axis=0)
217 | obs_combine = torch.tensor(obs_combine).to(args.device)
218 |
219 | self.value_func.eval()
220 | vs_combine = self.value_func(obs_combine)
221 | vs_combine_np = vs_combine.cpu().detach().numpy().flatten()
222 |
223 | vs_np = vs_combine_np[:len(vs_combine)//2]
224 | next_vs_np = vs_combine_np[len(vs_combine)//2:]
225 | vs = vs_combine[:len(vs_combine)//2]
226 |
227 | old_logp = torch.cat(logp.tolist())
228 | new_logp = self.policy.get_logp(torch.tensor(obs).to(args.device),
229 | torch.tensor(acts).to(args.device))
230 | ratio = torch.exp(new_logp - old_logp)
231 | ratio = torch.clamp(ratio, 0.9, 1.1)
232 |
233 | # calculate return estimations
234 | phi = disc_rwd - vs_np
235 | phi = np.array(phi).reshape((-1, 1))
236 | phi = np.tile(phi, (1, 4))
237 | phi = torch.tensor(phi).to(args.device)
238 |
239 | # update policy parameters
240 | self.policy.train()
241 | self.policy_optim.zero_grad()
242 | new_logp.backward(phi * ratio)
243 | self.policy_optim.step()
244 |
245 | # update value functions
246 | self.value_func.train()
247 | target = torch.tensor(disc_rwd).view((-1, 1)).to(args.device)
248 | vf_loss = F.mse_loss(vs, target)
249 | print("vf mse: %.4f" % vf_loss)
250 | self.vf_optim.zero_grad()
251 | vf_loss.backward()
252 | self.vf_optim.step()
253 |
--------------------------------------------------------------------------------
/algorithms/PG/run.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | import numpy as np
4 | import gym
5 | import torch
6 |
7 | from agent import VanillaPG
8 | from agent import OffPolicyPG
9 |
10 |
11 | def off_policy_run(env, args):
12 | agent = OffPolicyPG(env, args)
13 | global_steps = 0
14 | for ep in range(args.num_ep):
15 | rollouts, ep_steps, ep_rewards = run_episode(env, agent)
16 | global_steps += ep_steps
17 | agent.train(rollouts)
18 | ep_avg_rewards = np.mean(ep_rewards)
19 | print("Ep %d reward: %.4f ep_steps: %d global_steps: %d" %
20 | (ep, ep_avg_rewards, ep_steps, global_steps))
21 |
22 |
23 | def on_policy_run(env, args):
24 | agent = VanillaPG(env, args)
25 | global_steps = 0
26 | for ep in range(args.num_ep):
27 | rollouts, ep_steps, ep_rewards = run_episode(env, agent)
28 | global_steps += ep_steps
29 | agent.train(rollouts)
30 | ep_avg_rewards = np.mean(ep_rewards)
31 | print("Ep %d reward: %.4f ep_steps: %d global_steps: %d" %
32 | (ep, ep_avg_rewards, ep_steps, global_steps))
33 |
34 |
35 | def on_policy_run(env, args):
36 | agent = VanillaPG(env, args)
37 | global_steps = 0
38 | for ep in range(args.num_ep):
39 | rollouts, ep_steps, ep_rewards = run_episode(env, agent)
40 | global_steps += ep_steps
41 | agent.train(rollouts)
42 | ep_avg_rewards = np.mean(ep_rewards)
43 | print("Ep %d reward: %.4f ep_steps: %d global_steps: %d" %
44 | (ep, ep_avg_rewards, ep_steps, global_steps))
45 |
46 |
47 | def main(args):
48 | # device
49 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
50 | args.device = device
51 | # seed
52 | np.random.seed(args.seed)
53 | torch.manual_seed(args.seed)
54 | # env and agent
55 | task_name = "BipedalWalker-v2"
56 | env = gym.make(task_name)
57 | env.seed(args.seed)
58 | # run
59 | # on_policy_run(env, args)
60 | off_policy_run(env, args)
61 |
62 |
63 | def run_episode(env, agent):
64 | obs = env.reset()
65 | obs = preprocess(obs)
66 | ep_rewards, rollouts = [], []
67 | ep_steps = 0
68 | while True:
69 | logp, action = agent.step(obs)
70 | next_obs, reward, done, _ = env.step(action)
71 | ep_rewards.append(reward)
72 | next_obs = preprocess(next_obs)
73 | rollouts.append([obs, action, reward, logp, next_obs])
74 | obs = next_obs
75 | ep_steps += 1
76 | if done:
77 | break
78 | return rollouts, ep_steps, ep_rewards
79 |
80 |
81 | def preprocess(obs):
82 | return obs.astype(np.float32)
83 |
84 |
85 | if __name__ == '__main__':
86 | parser = argparse.ArgumentParser()
87 | parser.add_argument("--num_ep", type=int, default=5000)
88 | parser.add_argument("--lr", type=float, default=1e-2)
89 | parser.add_argument("--seed", type=int, default=31)
90 | parser.add_argument("--gpu", action="store_true")
91 | main(parser.parse_args())
92 |
--------------------------------------------------------------------------------
/algorithms/PG/sync.sh:
--------------------------------------------------------------------------------
1 | rsync -avzP --delete ~/workspace/self/reinforce_py/ workpc:~/workspace/reinforce_py
2 |
--------------------------------------------------------------------------------
/algorithms/PPO/README.md:
--------------------------------------------------------------------------------
1 | ### Proximal Policy Optimization(PPO)
2 |
3 | Impletementation of the PPO method proposed by OpenAI.
4 |
5 | PPO is an trust-region policy optimization method. It used a penalty instead of a constraint in the TRPO objective.
6 |
7 | Related papers:
8 | - [PPO - J Schulman et al.](https://arxiv.org/abs/1707.06347)
9 | - [TRPO - J Schulman et](https://arxiv.org/abs/1502.05477)
10 |
11 | ### Requirements
12 | - Python 3.x
13 | - Tensorflow 1.3.0
14 | - gym 0.9.4
15 |
16 | ### Run
17 | python3 train_PPO.py # -h to show avaliavle arguments
18 |
19 | ### Results
20 |
21 |
22 | Smoothed rewards in 1M steps
23 |
24 |
25 |
26 | Policy loss and value function loss during training.
27 |
--------------------------------------------------------------------------------
/algorithms/PPO/agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import tensorflow as tf
3 |
4 | from config import args, tf_config
5 | from distributions import make_pd_type
6 | import utils as U
7 |
8 |
9 | class Policy:
10 |
11 | def __init__(self, ob_space, ac_space, batch, n_steps, reuse):
12 | ob_dim = (batch,) + ob_space.shape
13 | act_dim = ac_space.shape[0]
14 | self.ph_obs = tf.placeholder(tf.float32, ob_dim, name='ph_obs')
15 |
16 | with tf.variable_scope('policy', reuse=reuse):
17 | h1 = U.fc(self.ph_obs, out_dim=64, activation_fn=tf.nn.tanh,
18 | init_scale=np.sqrt(2), scope='pi_fc1')
19 | h2 = U.fc(h1, out_dim=64, activation_fn=tf.nn.tanh,
20 | init_scale=np.sqrt(2), scope='pi_fc2')
21 | pi = U.fc(h2, out_dim=act_dim, activation_fn=None, init_scale=0.01,
22 | scope='pi')
23 | h1 = U.fc(self.ph_obs, out_dim=64, activation_fn=tf.nn.tanh,
24 | init_scale=np.sqrt(2), scope='vf_fc1')
25 | h2 = U.fc(h1, out_dim=64, activation_fn=tf.nn.tanh,
26 | init_scale=np.sqrt(2), scope='vf_fc2')
27 | vf = U.fc(h2, out_dim=1, activation_fn=None, scope='vf')[:, 0]
28 | logstd = tf.get_variable(name='logstd', shape=[1, act_dim],
29 | initializer=tf.zeros_initializer())
30 | # concatenate probabilities and logstds
31 | pd_params = tf.concat([pi, pi * 0.0 + logstd], axis=1)
32 | self.pd_type = make_pd_type(ac_space)
33 | self.pd = self.pd_type.pdfromflat(pd_params)
34 |
35 | self.a_out = self.pd.sample()
36 | self.neglogp = self.pd.get_neglogp(self.a_out)
37 |
38 | self.v_out = vf
39 | self.pi = pi
40 |
41 |
42 | class PPO:
43 |
44 | def __init__(self, env):
45 | self.sess = tf.Session(config=tf_config)
46 | ob_space = env.observation_space
47 | ac_space = env.action_space
48 | self.act_policy = Policy(ob_space, ac_space, env.num_envs,
49 | n_steps=1, reuse=False)
50 | self.train_policy = Policy(ob_space, ac_space, args.minibatch,
51 | n_steps=args.batch_steps, reuse=True)
52 | self._build_train()
53 | self.sess.run(tf.global_variables_initializer())
54 |
55 | def _build_train(self):
56 | # build placeholders
57 | self.ph_obs_train = self.train_policy.ph_obs
58 | self.ph_a = self.train_policy.pd_type.get_action_placeholder([None])
59 | self.ph_adv = tf.placeholder(tf.float32, [None])
60 | self.ph_r = tf.placeholder(tf.float32, [None])
61 | self.ph_old_neglogp = tf.placeholder(tf.float32, [None])
62 | self.ph_old_v = tf.placeholder(tf.float32, [None])
63 | self.ph_lr = tf.placeholder(tf.float32, [])
64 | self.ph_clip_range = tf.placeholder(tf.float32, [])
65 |
66 | # build losses
67 | self.neglogp = self.train_policy.pd.get_neglogp(self.ph_a)
68 | self.entropy = tf.reduce_mean(self.train_policy.pd.get_entropy())
69 | v = self.train_policy.v_out
70 | v_clipped = self.ph_old_v + tf.clip_by_value(
71 | v - self.ph_old_v, -self.ph_clip_range, self.ph_clip_range)
72 | v_loss1 = tf.square(v - self.ph_r)
73 | v_loss2 = tf.square(v_clipped - self.ph_r)
74 | self.v_loss = 0.5 * tf.reduce_mean(tf.maximum(v_loss1, v_loss2))
75 |
76 | # ratio = tf.exp(self.ph_old_neglogp - self.neglogp)
77 | old_p = tf.exp(-self.ph_old_neglogp)
78 | new_p = tf.exp(-self.neglogp)
79 | ratio = new_p / old_p
80 | pg_loss1 = -self.ph_adv * ratio
81 | pg_loss2 = -self.ph_adv * tf.clip_by_value(
82 | ratio, 1.0 - self.ph_clip_range, 1.0 + self.ph_clip_range)
83 | self.pg_loss = tf.reduce_mean(tf.maximum(pg_loss1, pg_loss2))
84 | loss = self.pg_loss + args.v_coef * self.v_loss - \
85 | args.entropy_coef * self.entropy
86 |
87 | # build train operation
88 | params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'policy')
89 | grads = tf.gradients(loss, params)
90 | if args.max_grad_norm is not None:
91 | grads, _grad_norm = tf.clip_by_global_norm(
92 | grads, args.max_grad_norm)
93 | grads = list(zip(grads, params))
94 | self.train_op = tf.train.AdamOptimizer(
95 | learning_rate=self.ph_lr, epsilon=1e-5).apply_gradients(grads)
96 |
97 | # train info
98 | self.approxkl = 0.5 * tf.reduce_mean(
99 | tf.square(self.neglogp - self.ph_old_neglogp))
100 | self.clip_frac = tf.reduce_mean(
101 | tf.to_float(tf.greater(tf.abs(ratio - 1.0), self.ph_clip_range)))
102 | self.avg_ratio = tf.reduce_mean(ratio)
103 |
104 | def step(self, obs, *_args, **_kwargs):
105 | feed_dict = {self.act_policy.ph_obs: obs}
106 | a, v, neglogp = self.sess.run(
107 | [self.act_policy.a_out,
108 | self.act_policy.v_out,
109 | self.act_policy.neglogp],
110 | feed_dict=feed_dict)
111 | return a, v, neglogp
112 |
113 | def get_value(self, obs, *_args, **_kwargs):
114 | feed_dict = {self.act_policy.ph_obs: obs}
115 | return self.sess.run(self.act_policy.v_out, feed_dict=feed_dict)
116 |
117 | def train(self, lr, clip_range, obs, returns, masks, actions, values, neglogps,
118 | advs):
119 | # advs = returns - values
120 | advs = (advs - advs.mean()) / (advs.std() + 1e-8)
121 | feed_dict = {self.ph_obs_train: obs, self.ph_a: actions,
122 | self.ph_adv: advs, self.ph_r: returns,
123 | self.ph_old_neglogp: neglogps, self.ph_old_v: values,
124 | self.ph_lr: lr,
125 | self.ph_clip_range: clip_range}
126 | self.loss_names = ['loss_policy', 'loss_value', 'avg_ratio', 'policy_entropy',
127 | 'approxkl', 'clipfrac']
128 | return self.sess.run(
129 | [self.pg_loss, self.v_loss, self.avg_ratio, self.entropy,
130 | self.approxkl, self.clip_frac, self.train_op],
131 | feed_dict=feed_dict)[:-1]
132 |
133 |
--------------------------------------------------------------------------------
/algorithms/PPO/config.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import tensorflow as tf
3 |
4 |
5 | parser = argparse.ArgumentParser()
6 | # Global arguments
7 | parser.add_argument('--env', help='environment ID', default='Walker2d-v1')
8 | parser.add_argument('--seed', help='RNG seed', type=int, default=931022)
9 | parser.add_argument('--save_interval', type=int, default=0)
10 | parser.add_argument('--log_interval', type=int, default=1)
11 | parser.add_argument('--n_envs', type=int, default=1)
12 | parser.add_argument('--n_steps', type=int, default=int(1e6))
13 |
14 | # Hyperparameters
15 | parser.add_argument('--batch_steps', type=int, default=2048)
16 | parser.add_argument('--minibatch', type=int, default=64)
17 | parser.add_argument('--n_epochs', type=int, default=10)
18 | parser.add_argument('--entropy_coef', type=float, default=0.0)
19 | parser.add_argument('--v_coef', type=float, default=0.5)
20 | parser.add_argument('--max_grad_norm', type=float, default=0.5)
21 | parser.add_argument('--lam', type=float, default=0.95)
22 | parser.add_argument('--gamma', type=float, default=0.99)
23 | parser.add_argument('--lr', type=float, default=3e-4)
24 | parser.add_argument('--clip_range', type=float, default=0.2)
25 |
26 | args = parser.parse_args()
27 |
28 | # Tensroflow Session Configuration
29 | tf_config = tf.ConfigProto(allow_soft_placement=True)
30 | tf_config.gpu_options.allow_growth = True
31 |
--------------------------------------------------------------------------------
/algorithms/PPO/distributions.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | import utils as U
4 | from tensorflow.python.ops import math_ops
5 |
6 |
7 | class Pd(object):
8 | """
9 | A particular probability distribution
10 | """
11 |
12 | def get_flatparam(self):
13 | raise NotImplementedError
14 |
15 | def get_mode(self):
16 | raise NotImplementedError
17 |
18 | def get_neglogp(self, x):
19 | # Usually it's easier to define the negative logprob
20 | raise NotImplementedError
21 |
22 | def get_kl(self, other):
23 | raise NotImplementedError
24 |
25 | def get_entropy(self):
26 | raise NotImplementedError
27 |
28 | def sample(self):
29 | raise NotImplementedError
30 |
31 | def logp(self, x):
32 | return - self.get_neglogp(x)
33 |
34 |
35 | class PdType(object):
36 | """
37 | Parametrized family of probability distributions
38 | """
39 |
40 | def pdclass(self):
41 | raise NotImplementedError
42 |
43 | def pdfromflat(self, flat):
44 | return self.pdclass()(flat)
45 |
46 | def param_shape(self):
47 | raise NotImplementedError
48 |
49 | def action_shape(self):
50 | raise NotImplementedError
51 |
52 | def action_dtype(self):
53 | raise NotImplementedError
54 |
55 | def param_placeholder(self, prepend_shape, name=None):
56 | return tf.placeholder(dtype=tf.float32, shape=prepend_shape + self.param_shape(), name=name)
57 |
58 | def get_action_placeholder(self, prepend_shape, name=None):
59 | return tf.placeholder(dtype=self.action_dtype(), shape=prepend_shape + self.action_shape(), name=name)
60 |
61 |
62 | class CategoricalPdType(PdType):
63 | def __init__(self, ncat):
64 | self.ncat = ncat
65 |
66 | def pdclass(self):
67 | return CategoricalPd
68 |
69 | def param_shape(self):
70 | return [self.ncat]
71 |
72 | def action_shape(self):
73 | return []
74 |
75 | def action_dtype(self):
76 | return tf.int32
77 |
78 |
79 | class MultiCategoricalPdType(PdType):
80 | def __init__(self, low, high):
81 | self.low = low
82 | self.high = high
83 | self.ncats = high - low + 1
84 |
85 | def pdclass(self):
86 | return MultiCategoricalPd
87 |
88 | def pdfromflat(self, flat):
89 | return MultiCategoricalPd(self.low, self.high, flat)
90 |
91 | def param_shape(self):
92 | return [sum(self.ncats)]
93 |
94 | def action_shape(self):
95 | return [len(self.ncats)]
96 |
97 | def action_dtype(self):
98 | return tf.int32
99 |
100 |
101 | class DiagGaussianPdType(PdType):
102 | def __init__(self, size):
103 | self.size = size
104 |
105 | def pdclass(self):
106 | return DiagGaussianPd
107 |
108 | def param_shape(self):
109 | return [2 * self.size]
110 |
111 | def action_shape(self):
112 | return [self.size]
113 |
114 | def action_dtype(self):
115 | return tf.float32
116 |
117 |
118 | class BernoulliPdType(PdType):
119 | def __init__(self, size):
120 | self.size = size
121 |
122 | def pdclass(self):
123 | return BernoulliPd
124 |
125 | def param_shape(self):
126 | return [self.size]
127 |
128 | def action_shape(self):
129 | return [self.size]
130 |
131 | def action_dtype(self):
132 | return tf.int32
133 |
134 |
135 | class CategoricalPd(Pd):
136 | def __init__(self, logits):
137 | self.logits = logits
138 |
139 | def get_flatparam(self):
140 | return self.logits
141 |
142 | def get_mode(self):
143 | return U.argmax(self.logits, axis=-1)
144 |
145 | def get_neglogp(self, x):
146 | # return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
147 | # Note: we can't use sparse_softmax_cross_entropy_with_logits because
148 | # the implementation does not allow second-order derivatives...
149 | one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
150 | return tf.nn.softmax_cross_entropy_with_logits(
151 | logits=self.logits,
152 | labels=one_hot_actions)
153 |
154 | def get_kl(self, other):
155 | a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True)
156 | a1 = other.logits - U.max(other.logits, axis=-1, keepdims=True)
157 | ea0 = tf.exp(a0)
158 | ea1 = tf.exp(a1)
159 | z0 = U.sum(ea0, axis=-1, keepdims=True)
160 | z1 = U.sum(ea1, axis=-1, keepdims=True)
161 | p0 = ea0 / z0
162 | return U.sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
163 |
164 | def get_entropy(self):
165 | a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True)
166 | ea0 = tf.exp(a0)
167 | z0 = U.sum(ea0, axis=-1, keepdims=True)
168 | p0 = ea0 / z0
169 | return U.sum(p0 * (tf.log(z0) - a0), axis=-1)
170 |
171 | def sample(self):
172 | u = tf.random_uniform(tf.shape(self.logits))
173 | return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
174 |
175 | @classmethod
176 | def fromflat(cls, flat):
177 | return cls(flat)
178 |
179 |
180 | class MultiCategoricalPd(Pd):
181 | def __init__(self, low, high, flat):
182 | self.flat = flat
183 | self.low = tf.constant(low, dtype=tf.int32)
184 | self.categoricals = list(map(CategoricalPd, tf.split(
185 | flat, high - low + 1, axis=len(flat.get_shape()) - 1)))
186 |
187 | def get_flatparam(self):
188 | return self.flat
189 |
190 | def get_mode(self):
191 | return self.low + tf.cast(tf.stack([p.get_mode() for p in self.categoricals], axis=-1), tf.int32)
192 |
193 | def get_neglogp(self, x):
194 | return tf.add_n([p.get_neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x - self.low, axis=len(x.get_shape()) - 1))])
195 |
196 | def get_kl(self, other):
197 | return tf.add_n([
198 | p.get_kl(q) for p, q in zip(self.categoricals, other.categoricals)
199 | ])
200 |
201 | def get_entropy(self):
202 | return tf.add_n([p.get_entropy() for p in self.categoricals])
203 |
204 | def sample(self):
205 | return self.low + tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
206 |
207 | @classmethod
208 | def fromflat(cls, flat):
209 | raise NotImplementedError
210 |
211 |
212 | class DiagGaussianPd(Pd):
213 | def __init__(self, flat):
214 | self.flat = flat
215 | mean, logstd = tf.split(axis=len(flat.shape) - 1, num_or_size_splits=2, value=flat)
216 | self.mean = mean
217 | self.logstd = logstd
218 | self.std = tf.exp(logstd)
219 |
220 | def get_flatparam(self):
221 | return self.flat
222 |
223 | def get_mode(self):
224 | return self.mean
225 |
226 | def get_neglogp(self, x):
227 | return 0.5 * U.sum(tf.square((x - self.mean) / self.std), axis=-1) \
228 | + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
229 | + U.sum(self.logstd, axis=-1)
230 |
231 | def get_kl(self, other):
232 | assert isinstance(other, DiagGaussianPd)
233 | return U.sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
234 |
235 | def get_entropy(self):
236 | return U.sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
237 |
238 | def sample(self):
239 | return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
240 |
241 | @classmethod
242 | def fromflat(cls, flat):
243 | return cls(flat)
244 |
245 |
246 | class BernoulliPd(Pd):
247 | def __init__(self, logits):
248 | self.logits = logits
249 | self.ps = tf.sigmoid(logits)
250 |
251 | def get_flatparam(self):
252 | return self.logits
253 |
254 | def get_mode(self):
255 | return tf.round(self.ps)
256 |
257 | def get_neglogp(self, x):
258 | return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
259 |
260 | def get_kl(self, other):
261 | return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
262 |
263 | def get_entropy(self):
264 | return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
265 |
266 | def sample(self):
267 | u = tf.random_uniform(tf.shape(self.ps))
268 | return tf.to_float(math_ops.less(u, self.ps))
269 |
270 | @classmethod
271 | def fromflat(cls, flat):
272 | return cls(flat)
273 |
274 |
275 | def make_pd_type(ac_space):
276 | from gym import spaces
277 | if isinstance(ac_space, spaces.Box):
278 | assert len(ac_space.shape) == 1
279 | return DiagGaussianPdType(ac_space.shape[0])
280 | elif isinstance(ac_space, spaces.Discrete):
281 | return CategoricalPdType(ac_space.n)
282 | elif isinstance(ac_space, spaces.MultiDiscrete):
283 | return MultiCategoricalPdType(ac_space.low, ac_space.high)
284 | elif isinstance(ac_space, spaces.MultiBinary):
285 | return BernoulliPdType(ac_space.n)
286 | else:
287 | raise NotImplementedError
288 |
--------------------------------------------------------------------------------
/algorithms/PPO/env_wrapper.py:
--------------------------------------------------------------------------------
1 | import time
2 | import csv
3 | import json
4 | import gym
5 | from gym.core import Wrapper
6 | import os.path as osp
7 | import numpy as np
8 |
9 | from utils import RunningMeanStd
10 |
11 |
12 | class BaseVecEnv(object):
13 | """
14 | Vectorized environment base class
15 | """
16 |
17 | def step(self, vac):
18 | """
19 | Apply sequence of actions to sequence of environments
20 | actions -> (observations, rewards, dones)
21 | """
22 | raise NotImplementedError
23 |
24 | def reset(self):
25 | """
26 | Reset all environments
27 | """
28 | raise NotImplementedError
29 |
30 | def close(self):
31 | pass
32 |
33 | def set_random_seed(self, seed):
34 | raise NotImplementedError
35 |
36 |
37 | class VecEnv(BaseVecEnv):
38 | def __init__(self, env_fns):
39 | self.envs = [fn() for fn in env_fns]
40 | env = self.envs[0]
41 | self.action_space = env.action_space
42 | self.observation_space = env.observation_space
43 | self.ts = np.zeros(len(self.envs), dtype='int')
44 |
45 | def step(self, action_n):
46 | results = [env.step(a) for (a, env) in zip(action_n, self.envs)]
47 | obs, rews, dones, infos = map(np.array, zip(*results))
48 | self.ts += 1
49 | for (i, done) in enumerate(dones):
50 | if done:
51 | obs[i] = self.envs[i].reset()
52 | self.ts[i] = 0
53 | return np.array(obs), np.array(rews), np.array(dones), infos
54 |
55 | def reset(self):
56 | results = [env.reset() for env in self.envs]
57 | return np.array(results)
58 |
59 | def render(self):
60 | self.envs[0].render()
61 |
62 | @property
63 | def num_envs(self):
64 | return len(self.envs)
65 |
66 |
67 | class VecEnvNorm(BaseVecEnv):
68 |
69 | def __init__(self, venv, ob=True, ret=True,
70 | clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
71 | self.venv = venv
72 | self._ob_space = venv.observation_space
73 | self._ac_space = venv.action_space
74 | self.ob_rms = RunningMeanStd(shape=self._ob_space.shape) if ob else None
75 | self.ret_rms = RunningMeanStd(shape=()) if ret else None
76 | self.clipob = clipob
77 | self.cliprew = cliprew
78 | self.ret = np.zeros(self.num_envs)
79 | self.gamma = gamma
80 | self.epsilon = epsilon
81 |
82 | def step(self, vac):
83 | obs, rews, news, infos = self.venv.step(vac)
84 | self.ret = self.ret * self.gamma + rews
85 | # normalize observations
86 | obs = self._norm_ob(obs)
87 | # normalize rewards
88 | if self.ret_rms:
89 | self.ret_rms.update(self.ret)
90 | rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon),
91 | -self.cliprew, self.cliprew)
92 | return obs, rews, news, infos
93 |
94 | def _norm_ob(self, obs):
95 | if self.ob_rms:
96 | self.ob_rms.update(obs)
97 | obs = np.clip(
98 | (obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon),
99 | -self.clipob, self.clipob)
100 | return obs
101 | else:
102 | return obs
103 |
104 | def reset(self):
105 | obs = self.venv.reset()
106 | return self._norm_ob(obs)
107 |
108 | def set_random_seed(self, seeds):
109 | for env, seed in zip(self.venv.envs, seeds):
110 | env.seed(int(seed))
111 |
112 | @property
113 | def action_space(self):
114 | return self._ac_space
115 |
116 | @property
117 | def observation_space(self):
118 | return self._ob_space
119 |
120 | def close(self):
121 | self.venv.close()
122 |
123 | def render(self):
124 | self.venv.render()
125 |
126 | @property
127 | def num_envs(self):
128 | return self.venv.num_envs
129 |
130 |
131 | class Monitor(Wrapper):
132 | EXT = "monitor.csv"
133 | f = None
134 |
135 | def __init__(self, env, filename, allow_early_resets=False, reset_keywords=()):
136 | Wrapper.__init__(self, env=env)
137 | self.tstart = time.time()
138 | if filename is None:
139 | self.f = None
140 | self.logger = None
141 | else:
142 | if not filename.endswith(Monitor.EXT):
143 | if osp.isdir(filename):
144 | filename = osp.join(filename, Monitor.EXT)
145 | else:
146 | filename = filename + "." + Monitor.EXT
147 | self.f = open(filename, "wt")
148 | self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, "gym_version": gym.__version__,
149 | "env_id": env.spec.id if env.spec else 'Unknown'}))
150 | self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords)
151 | self.logger.writeheader()
152 |
153 | self.reset_keywords = reset_keywords
154 | self.allow_early_resets = allow_early_resets
155 | self.rewards = None
156 | self.needs_reset = True
157 | self.episode_rewards = []
158 | self.episode_lengths = []
159 | self.total_steps = 0
160 | self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()
161 |
162 | def _reset(self, **kwargs):
163 | if not self.allow_early_resets and not self.needs_reset:
164 | raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
165 | self.rewards = []
166 | self.needs_reset = False
167 | for k in self.reset_keywords:
168 | v = kwargs.get(k)
169 | if v is None:
170 | raise ValueError('Expected you to pass kwarg %s into reset'%k)
171 | self.current_reset_info[k] = v
172 | return self.env.reset(**kwargs)
173 |
174 | def _step(self, action):
175 | if self.needs_reset:
176 | raise RuntimeError("Tried to step environment that needs reset")
177 | ob, rew, done, info = self.env.step(action)
178 | self.rewards.append(rew)
179 | if done:
180 | self.needs_reset = True
181 | eprew = sum(self.rewards)
182 | eplen = len(self.rewards)
183 | epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
184 | epinfo.update(self.current_reset_info)
185 | if self.logger:
186 | self.logger.writerow(epinfo)
187 | self.f.flush()
188 | self.episode_rewards.append(eprew)
189 | self.episode_lengths.append(eplen)
190 | info['episode'] = epinfo
191 | self.total_steps += 1
192 | return (ob, rew, done, info)
193 |
194 | def close(self):
195 | if self.f is not None:
196 | self.f.close()
197 |
198 | def get_total_steps(self):
199 | return self.total_steps
200 |
201 | def get_episode_rewards(self):
202 | return self.episode_rewards
203 |
204 | def get_episode_lengths(self):
205 | return self.episode_lengths
206 |
207 |
208 | def make_env():
209 | def env_fn():
210 | env = gym.make(args.env)
211 | env = Monitor(env, logger.get_dir())
212 | return env
213 | env = VecEnv([env_fn] * args.n_envs)
214 | env = VecEnvNorm(env)
215 | return env
216 |
--------------------------------------------------------------------------------
/algorithms/PPO/logger.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import shutil
4 | import json
5 | import time
6 | import datetime
7 | import tempfile
8 | from mpi4py import MPI
9 |
10 | LOG_OUTPUT_FORMATS = ['stdout', 'log', 'csv']
11 | # Also valid: json, tensorboard
12 |
13 | DEBUG = 10
14 | INFO = 20
15 | WARN = 30
16 | ERROR = 40
17 |
18 | DISABLED = 50
19 |
20 |
21 | class KVWriter(object):
22 | def writekvs(self, kvs):
23 | raise NotImplementedError
24 |
25 |
26 | class SeqWriter(object):
27 | def writeseq(self, seq):
28 | raise NotImplementedError
29 |
30 |
31 | class HumanOutputFormat(KVWriter, SeqWriter):
32 | def __init__(self, filename_or_file):
33 | if isinstance(filename_or_file, str):
34 | self.file = open(filename_or_file, 'wt')
35 | self.own_file = True
36 | else:
37 | assert hasattr(filename_or_file,
38 | 'read'), 'expected file or str, got %s' % filename_or_file
39 | self.file = filename_or_file
40 | self.own_file = False
41 |
42 | def writekvs(self, kvs):
43 | # Create strings for printing
44 | key2str = {}
45 | for (key, val) in sorted(kvs.items()):
46 | if isinstance(val, float):
47 | valstr = '%-8.3g' % (val,)
48 | else:
49 | valstr = str(val)
50 | key2str[self._truncate(key)] = self._truncate(valstr)
51 |
52 | # Find max widths
53 | if len(key2str) == 0:
54 | print('WARNING: tried to write empty key-value dict')
55 | return
56 | else:
57 | keywidth = max(map(len, key2str.keys()))
58 | valwidth = max(map(len, key2str.values()))
59 |
60 | # Write out the data
61 | dashes = '-' * (keywidth + valwidth + 7)
62 | lines = [dashes]
63 | for (key, val) in sorted(key2str.items()):
64 | lines.append('| %s%s | %s%s |' % (
65 | key,
66 | ' ' * (keywidth - len(key)),
67 | val,
68 | ' ' * (valwidth - len(val)),
69 | ))
70 | lines.append(dashes)
71 | self.file.write('\n'.join(lines) + '\n')
72 |
73 | # Flush the output to the file
74 | self.file.flush()
75 |
76 | def _truncate(self, s):
77 | return s[:20] + '...' if len(s) > 23 else s
78 |
79 | def writeseq(self, seq):
80 | for arg in seq:
81 | self.file.write(arg)
82 | self.file.write('\n')
83 | self.file.flush()
84 |
85 | def close(self):
86 | if self.own_file:
87 | self.file.close()
88 |
89 |
90 | class JSONOutputFormat(KVWriter):
91 | def __init__(self, filename):
92 | self.file = open(filename, 'wt')
93 |
94 | def writekvs(self, kvs):
95 | for k, v in sorted(kvs.items()):
96 | if hasattr(v, 'dtype'):
97 | v = v.tolist()
98 | kvs[k] = float(v)
99 | self.file.write(json.dumps(kvs) + '\n')
100 | self.file.flush()
101 |
102 | def close(self):
103 | self.file.close()
104 |
105 |
106 | class CSVOutputFormat(KVWriter):
107 | def __init__(self, filename):
108 | self.file = open(filename, 'w+t')
109 | self.keys = []
110 | self.sep = ','
111 |
112 | def writekvs(self, kvs):
113 | # Add our current row to the history
114 | extra_keys = kvs.keys() - self.keys
115 | if extra_keys:
116 | self.keys.extend(extra_keys)
117 | self.file.seek(0)
118 | lines = self.file.readlines()
119 | self.file.seek(0)
120 | for (i, k) in enumerate(self.keys):
121 | if i > 0:
122 | self.file.write(',')
123 | self.file.write(k)
124 | self.file.write('\n')
125 | for line in lines[1:]:
126 | self.file.write(line[:-1])
127 | self.file.write(self.sep * len(extra_keys))
128 | self.file.write('\n')
129 | for (i, k) in enumerate(self.keys):
130 | if i > 0:
131 | self.file.write(',')
132 | v = kvs.get(k)
133 | if v:
134 | self.file.write(str(v))
135 | self.file.write('\n')
136 | self.file.flush()
137 |
138 | def close(self):
139 | self.file.close()
140 |
141 |
142 | class TensorBoardOutputFormat(KVWriter):
143 | """
144 | Dumps key/value pairs into TensorBoard's numeric format.
145 | """
146 |
147 | def __init__(self, dir):
148 | os.makedirs(dir, exist_ok=True)
149 | self.dir = dir
150 | self.step = 1
151 | prefix = 'events'
152 | path = os.path.join(os.path.abspath(dir), prefix)
153 | import tensorflow as tf
154 | from tensorflow.python import pywrap_tensorflow
155 | from tensorflow.core.util import event_pb2
156 | from tensorflow.python.util import compat
157 | self.tf = tf
158 | self.event_pb2 = event_pb2
159 | self.pywrap_tensorflow = pywrap_tensorflow
160 | self.writer = pywrap_tensorflow.EventsWriter(compat.as_bytes(path))
161 |
162 | def writekvs(self, kvs):
163 | def summary_val(k, v):
164 | kwargs = {'tag': k, 'simple_value': float(v)}
165 | return self.tf.Summary.Value(**kwargs)
166 | summary = self.tf.Summary(value=[summary_val(k, v) for k, v in kvs.items()])
167 | event = self.event_pb2.Event(wall_time=time.time(), summary=summary)
168 | event.step = self.step # is there any reason why you'd want to specify the step?
169 | self.writer.WriteEvent(event)
170 | self.writer.Flush()
171 | self.step += 1
172 |
173 | def close(self):
174 | if self.writer:
175 | self.writer.Close()
176 | self.writer = None
177 |
178 |
179 | def make_output_format(format, ev_dir):
180 | os.makedirs(ev_dir, exist_ok=True)
181 | rank = MPI.COMM_WORLD.Get_rank()
182 | if format == 'stdout':
183 | return HumanOutputFormat(sys.stdout)
184 | elif format == 'log':
185 | suffix = "" if rank == 0 else ("-mpi%03i" % rank)
186 | return HumanOutputFormat(os.path.join(ev_dir, 'log%s.txt' % suffix))
187 | elif format == 'json':
188 | assert rank == 0
189 | return JSONOutputFormat(os.path.join(ev_dir, 'progress.json'))
190 | elif format == 'csv':
191 | assert rank == 0
192 | return CSVOutputFormat(os.path.join(ev_dir, 'progress.csv'))
193 | elif format == 'tensorboard':
194 | assert rank == 0
195 | return TensorBoardOutputFormat(os.path.join(ev_dir, 'tb'))
196 | else:
197 | raise ValueError('Unknown format specified: %s' % (format,))
198 |
199 |
200 | # ================================================================
201 | # API
202 | # ================================================================
203 | def logkv(key, val):
204 | """
205 | Log a value of some diagnostic
206 | Call this once for each diagnostic quantity, each iteration
207 | """
208 | Logger.CURRENT.logkv(key, val)
209 |
210 |
211 | def logkvs(d):
212 | """
213 | Log a dictionary of key-value pairs
214 | """
215 | for (k, v) in d.items():
216 | logkv(k, v)
217 |
218 |
219 | def dumpkvs():
220 | """
221 | Write all of the diagnostics from the current iteration
222 |
223 | level: int. (see logger.py docs) If the global logger level is higher than
224 | the level argument here, don't print to stdout.
225 | """
226 | Logger.CURRENT.dumpkvs()
227 |
228 |
229 | def getkvs():
230 | return Logger.CURRENT.name2val
231 |
232 |
233 | def log(*args, level=INFO):
234 | """
235 | Write the sequence of args, with no separators, to the console and output files (if you've configured an output file).
236 | """
237 | Logger.CURRENT.log(*args, level=level)
238 |
239 |
240 | def debug(*args):
241 | log(*args, level=DEBUG)
242 |
243 |
244 | def info(*args):
245 | log(*args, level=INFO)
246 |
247 |
248 | def warn(*args):
249 | log(*args, level=WARN)
250 |
251 |
252 | def error(*args):
253 | log(*args, level=ERROR)
254 |
255 |
256 | def set_level(level):
257 | """
258 | Set logging threshold on current logger.
259 | """
260 | Logger.CURRENT.set_level(level)
261 |
262 |
263 | def get_dir():
264 | """
265 | Get directory that log files are being written to.
266 | will be None if there is no output directory (i.e., if you didn't call start)
267 | """
268 | return Logger.CURRENT.get_dir()
269 |
270 |
271 | record_tabular = logkv
272 | dump_tabular = dumpkvs
273 |
274 |
275 | # ================================================================
276 | # Backend
277 | # ================================================================
278 | class Logger(object):
279 | DEFAULT = None # A logger with no output files. (See right below class definition)
280 | # So that you can still log to the terminal without setting up any output files
281 | CURRENT = None # Current logger being used by the free functions above
282 |
283 | def __init__(self, dir, output_formats):
284 | self.name2val = {} # values this iteration
285 | self.level = INFO
286 | self.dir = dir
287 | self.output_formats = output_formats
288 |
289 | # Logging API, forwarded
290 | # ----------------------------------------
291 | def logkv(self, key, val):
292 | self.name2val[key] = val
293 |
294 | def dumpkvs(self):
295 | if self.level == DISABLED:
296 | return
297 | for fmt in self.output_formats:
298 | if isinstance(fmt, KVWriter):
299 | fmt.writekvs(self.name2val)
300 | self.name2val.clear()
301 |
302 | def log(self, *args, level=INFO):
303 | if self.level <= level:
304 | self._do_log(args)
305 |
306 | # Configuration
307 | # ----------------------------------------
308 | def set_level(self, level):
309 | self.level = level
310 |
311 | def get_dir(self):
312 | return self.dir
313 |
314 | def close(self):
315 | for fmt in self.output_formats:
316 | fmt.close()
317 |
318 | # Misc
319 | # ----------------------------------------
320 | def _do_log(self, args):
321 | for fmt in self.output_formats:
322 | if isinstance(fmt, SeqWriter):
323 | fmt.writeseq(map(str, args))
324 |
325 |
326 | Logger.DEFAULT = Logger.CURRENT = Logger(dir=None, output_formats=[HumanOutputFormat(sys.stdout)])
327 |
328 |
329 | def configure(dir=None, format_strs=None):
330 | if dir is None:
331 | dir = os.getenv('OPENAI_LOGDIR')
332 | if dir is None:
333 | dir = os.path.join(
334 | tempfile.gettempdir(),
335 | datetime.datetime.now().strftime("openai-%Y-%m-%d-%H-%M-%S-%f"))
336 | assert isinstance(dir, str)
337 | os.makedirs(dir, exist_ok=True)
338 |
339 | if format_strs is None:
340 | strs = os.getenv('OPENAI_LOG_FORMAT')
341 | format_strs = strs.split(',') if strs else LOG_OUTPUT_FORMATS
342 | output_formats = [make_output_format(f, dir) for f in format_strs]
343 |
344 | Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
345 | log('Logging to %s' % dir)
346 |
347 |
348 | def reset():
349 | if Logger.CURRENT is not Logger.DEFAULT:
350 | Logger.CURRENT.close()
351 | Logger.CURRENT = Logger.DEFAULT
352 | log('Reset logger')
353 |
354 |
355 | class scoped_configure(object):
356 | def __init__(self, dir=None, format_strs=None):
357 | self.dir = dir
358 | self.format_strs = format_strs
359 | self.prevlogger = None
360 |
361 | def __enter__(self):
362 | self.prevlogger = Logger.CURRENT
363 | configure(dir=self.dir, format_strs=self.format_strs)
364 |
365 | def __exit__(self, *args):
366 | Logger.CURRENT.close()
367 | Logger.CURRENT = self.prevlogger
368 |
369 |
370 | def _demo():
371 | info("hi")
372 | debug("shouldn't appear")
373 | set_level(DEBUG)
374 | debug("should appear")
375 | dir = "/tmp/testlogging"
376 | if os.path.exists(dir):
377 | shutil.rmtree(dir)
378 | configure(dir=dir)
379 | logkv("a", 3)
380 | logkv("b", 2.5)
381 | dumpkvs()
382 | logkv("b", -2.5)
383 | logkv("a", 5.5)
384 | dumpkvs()
385 | info("^^^ should see a = 5.5")
386 |
387 | logkv("b", -2.5)
388 | dumpkvs()
389 |
390 | logkv("a", "longasslongasslongasslongasslongasslongassvalue")
391 | dumpkvs()
392 |
393 |
394 | # ================================================================
395 | # Readers
396 | # ================================================================
397 | def read_json(fname):
398 | import pandas
399 | ds = []
400 | with open(fname, 'rt') as fh:
401 | for line in fh:
402 | ds.append(json.loads(line))
403 | return pandas.DataFrame(ds)
404 |
405 |
406 | def read_csv(fname):
407 | import pandas
408 | return pandas.read_csv(fname, index_col=None, comment='#')
409 |
410 |
411 | def read_tb(path):
412 | """
413 | path : a tensorboard file OR a directory, where we will find all TB files
414 | of the form events.*
415 | """
416 | import pandas
417 | import numpy as np
418 | from glob import glob
419 | from collections import defaultdict
420 | import tensorflow as tf
421 | if os.path.isdir(path):
422 | fnames = glob(os.path.join(path, "events.*"))
423 | elif os.path.basename(path).startswith("events."):
424 | fnames = [path]
425 | else:
426 | raise NotImplementedError(
427 | "Expected tensorboard file or directory containing them. Got %s" % path)
428 | tag2pairs = defaultdict(list)
429 | maxstep = 0
430 | for fname in fnames:
431 | for summary in tf.train.summary_iterator(fname):
432 | if summary.step > 0:
433 | for v in summary.summary.value:
434 | pair = (summary.step, v.simple_value)
435 | tag2pairs[v.tag].append(pair)
436 | maxstep = max(summary.step, maxstep)
437 | data = np.empty((maxstep, len(tag2pairs)))
438 | data[:] = np.nan
439 | tags = sorted(tag2pairs.keys())
440 | for (colidx, tag) in enumerate(tags):
441 | pairs = tag2pairs[tag]
442 | for (step, value) in pairs:
443 | data[step - 1, colidx] = value
444 | return pandas.DataFrame(data, columns=tags)
445 |
446 |
447 | if __name__ == "__main__":
448 | _demo()
449 |
--------------------------------------------------------------------------------
/algorithms/PPO/train_PPO.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import logger
4 | import random
5 | import tensorflow as tf
6 | import gym
7 | import numpy as np
8 | from collections import deque
9 |
10 | from config import args
11 | from utils import set_global_seeds, sf01, explained_variance
12 | from agent import PPO
13 | from env_wrapper import make_env
14 |
15 |
16 | def main():
17 | env = make_env()
18 | set_global_seeds(env, args.seed)
19 |
20 | agent = PPO(env=env)
21 |
22 | batch_steps = args.n_envs * args.batch_steps # number of steps per update
23 |
24 | if args.save_interval and logger.get_dir():
25 | # some saving jobs
26 | pass
27 |
28 | ep_info_buffer = deque(maxlen=100)
29 | t_train_start = time.time()
30 | n_updates = args.n_steps // batch_steps
31 | runner = Runner(env, agent)
32 |
33 | for update in range(1, n_updates + 1):
34 | t_start = time.time()
35 | frac = 1.0 - (update - 1.0) / n_updates
36 | lr_now = args.lr # maybe dynamic change
37 | clip_range_now = args.clip_range # maybe dynamic change
38 | obs, returns, masks, acts, vals, neglogps, advs, rewards, ep_infos = \
39 | runner.run(args.batch_steps, frac)
40 | ep_info_buffer.extend(ep_infos)
41 | loss_infos = []
42 |
43 | idxs = np.arange(batch_steps)
44 | for _ in range(args.n_epochs):
45 | np.random.shuffle(idxs)
46 | for start in range(0, batch_steps, args.minibatch):
47 | end = start + args.minibatch
48 | mb_idxs = idxs[start: end]
49 | minibatch = [arr[mb_idxs] for arr in [obs, returns, masks, acts, vals, neglogps, advs]]
50 | loss_infos.append(agent.train(lr_now, clip_range_now, *minibatch))
51 |
52 | t_now = time.time()
53 | time_this_batch = t_now - t_start
54 | if update % args.log_interval == 0:
55 | ev = float(explained_variance(vals, returns))
56 | logger.logkv('updates', str(update) + '/' + str(n_updates))
57 | logger.logkv('serial_steps', update * args.batch_steps)
58 | logger.logkv('total_steps', update * batch_steps)
59 | logger.logkv('time', time_this_batch)
60 | logger.logkv('fps', int(batch_steps / (t_now - t_start)))
61 | logger.logkv('total_time', t_now - t_train_start)
62 | logger.logkv("explained_variance", ev)
63 | logger.logkv('avg_reward', np.mean([e['r'] for e in ep_info_buffer]))
64 | logger.logkv('avg_ep_len', np.mean([e['l'] for e in ep_info_buffer]))
65 | logger.logkv('adv_mean', np.mean(returns - vals))
66 | logger.logkv('adv_variance', np.std(returns - vals)**2)
67 | loss_infos = np.mean(loss_infos, axis=0)
68 | for loss_name, loss_info in zip(agent.loss_names, loss_infos):
69 | logger.logkv(loss_name, loss_info)
70 | logger.dumpkvs()
71 |
72 | if args.save_interval and update % args.save_interval == 0 and logger.get_dir():
73 | pass
74 | env.close()
75 |
76 |
77 | class Runner:
78 |
79 | def __init__(self, env, agent):
80 | self.env = env
81 | self.agent = agent
82 | self.obs = np.zeros((args.n_envs,) + env.observation_space.shape, dtype=np.float32)
83 | self.obs[:] = env.reset()
84 | self.dones = [False for _ in range(args.n_envs)]
85 |
86 | def run(self, batch_steps, frac):
87 | b_obs, b_rewards, b_actions, b_values, b_dones, b_neglogps = [], [], [], [], [], []
88 | ep_infos = []
89 |
90 | for s in range(batch_steps):
91 | actions, values, neglogps = self.agent.step(self.obs, self.dones)
92 | b_obs.append(self.obs.copy())
93 | b_actions.append(actions)
94 | b_values.append(values)
95 | b_neglogps.append(neglogps)
96 | b_dones.append(self.dones)
97 | self.obs[:], rewards, self.dones, infos = self.env.step(actions)
98 | for info in infos:
99 | maybeinfo = info.get('episode')
100 | if maybeinfo:
101 | ep_infos.append(maybeinfo)
102 | b_rewards.append(rewards)
103 | # batch of steps to batch of rollouts
104 | b_obs = np.asarray(b_obs, dtype=self.obs.dtype)
105 | b_rewards = np.asarray(b_rewards, dtype=np.float32)
106 | b_actions = np.asarray(b_actions)
107 | b_values = np.asarray(b_values, dtype=np.float32)
108 | b_neglogps = np.asarray(b_neglogps, dtype=np.float32)
109 | b_dones = np.asarray(b_dones, dtype=np.bool)
110 | last_values = self.agent.get_value(self.obs, self.dones)
111 |
112 | b_returns = np.zeros_like(b_rewards)
113 | b_advs = np.zeros_like(b_rewards)
114 | lastgaelam = 0
115 | for t in reversed(range(batch_steps)):
116 | if t == batch_steps - 1:
117 | mask = 1.0 - self.dones
118 | nextvalues = last_values
119 | else:
120 | mask = 1.0 - b_dones[t + 1]
121 | nextvalues = b_values[t + 1]
122 | delta = b_rewards[t] + args.gamma * nextvalues * mask - b_values[t]
123 | b_advs[t] = lastgaelam = delta + args.gamma * args.lam * mask * lastgaelam
124 | b_returns = b_advs + b_values
125 |
126 | return (*map(sf01, (b_obs, b_returns, b_dones, b_actions, b_values, b_neglogps, b_advs, b_rewards)), ep_infos)
127 |
128 |
129 | if __name__ == '__main__':
130 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
131 | logger.configure()
132 | main()
133 |
--------------------------------------------------------------------------------
/algorithms/PPO/utils.py:
--------------------------------------------------------------------------------
1 | import scipy.signal
2 | import numpy as np
3 | import random
4 | import tensorflow as tf
5 |
6 |
7 | def set_global_seeds(env, seed):
8 | tf.set_random_seed(seed)
9 | np.random.seed(seed)
10 | random.seed(seed)
11 | env_seeds = np.random.randint(low=0, high=1e6, size=env.num_envs)
12 | env.set_random_seed(env_seeds)
13 |
14 |
15 | class RunningMeanStd(object):
16 |
17 | def __init__(self, epsilon=1e-4, shape=()):
18 | self.mean = np.zeros(shape, 'float64')
19 | self.var = np.ones(shape, 'float64')
20 | self.count = epsilon
21 |
22 | def update(self, x):
23 | batch_mean = np.mean(x, axis=0)
24 | batch_var = np.var(x, axis=0)
25 | batch_count = x.shape[0]
26 |
27 | delta = batch_mean - self.mean
28 | tot_count = self.count + batch_count
29 |
30 | new_mean = self.mean + delta * batch_count / tot_count
31 | m_a = self.var * (self.count)
32 | m_b = batch_var * (batch_count)
33 | M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
34 | new_var = M2 / (self.count + batch_count)
35 |
36 | new_count = batch_count + self.count
37 |
38 | self.mean = new_mean
39 | self.var = new_var
40 | self.count = new_count
41 |
42 |
43 | def sf01(arr):
44 | """
45 | swap and then flatten axes 0 and 1
46 | """
47 | s = arr.shape
48 | return arr.swapaxes(0, 1).reshape(s[0] * s[1], *s[2:])
49 |
50 |
51 | def discount(x, gamma):
52 | return scipy.signal.lfilter([1.0], [1.0, -gamma], x[::-1], axis=0)[::-1]
53 |
54 |
55 | # ================================================================
56 | # Network components
57 | # ================================================================
58 | def ortho_init(scale=1.0):
59 | def _ortho_init(shape, dtype, partition_info=None):
60 | shape = tuple(shape)
61 | if len(shape) == 2:
62 | flat_shape = shape
63 | elif len(shape) == 4:
64 | flat_shape = (np.prod(shape[:-1]), shape[-1])
65 | else:
66 | raise NotImplementedError
67 |
68 | a = np.random.normal(0.0, 1.0, flat_shape)
69 | u, _, v = np.linalg.svd(a, full_matrices=False)
70 | q = u if u.shape == flat_shape else v
71 | q = q.reshape(shape)
72 | return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
73 | return _ortho_init
74 |
75 |
76 | def fc(x, out_dim, activation_fn=tf.nn.relu, init_scale=1.0, scope=''):
77 | with tf.variable_scope(scope):
78 | in_dim = x.get_shape()[1].value
79 | w = tf.get_variable('w', [in_dim, out_dim], initializer=ortho_init(init_scale))
80 | b = tf.get_variable('b', [out_dim], initializer=tf.constant_initializer(0.0))
81 | z = tf.matmul(x, w) + b
82 | h = activation_fn(z) if activation_fn else z
83 | return h
84 |
85 | # ================================================================
86 | # Tensorflow math utils
87 | # ================================================================
88 | clip = tf.clip_by_value
89 |
90 | def sum(x, axis=None, keepdims=False):
91 | axis = None if axis is None else [axis]
92 | return tf.reduce_sum(x, axis=axis, keep_dims=keepdims)
93 |
94 | def mean(x, axis=None, keepdims=False):
95 | axis = None if axis is None else [axis]
96 | return tf.reduce_mean(x, axis=axis, keep_dims=keepdims)
97 |
98 | def var(x, axis=None, keepdims=False):
99 | meanx = mean(x, axis=axis, keepdims=keepdims)
100 | return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
101 |
102 | def std(x, axis=None, keepdims=False):
103 | return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
104 |
105 | def max(x, axis=None, keepdims=False):
106 | axis = None if axis is None else [axis]
107 | return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
108 |
109 | def min(x, axis=None, keepdims=False):
110 | axis = None if axis is None else [axis]
111 | return tf.reduce_min(x, axis=axis, keep_dims=keepdims)
112 |
113 | def concatenate(arrs, axis=0):
114 | return tf.concat(axis=axis, values=arrs)
115 |
116 | def argmax(x, axis=None):
117 | return tf.argmax(x, axis=axis)
118 |
119 | def switch(condition, then_expression, else_expression):
120 | """Switches between two operations depending on a scalar value (int or bool).
121 | Note that both `then_expression` and `else_expression`
122 | should be symbolic tensors of the *same shape*.
123 |
124 | # Arguments
125 | condition: scalar tensor.
126 | then_expression: TensorFlow operation.
127 | else_expression: TensorFlow operation.
128 | """
129 | x_shape = copy.copy(then_expression.get_shape())
130 | x = tf.cond(tf.cast(condition, 'bool'),
131 | lambda: then_expression,
132 | lambda: else_expression)
133 | x.set_shape(x_shape)
134 | return x
135 |
136 | # ================================================================
137 | # Math utils
138 | # ================================================================
139 | def explained_variance(pred_y, y):
140 | """
141 | Computes fraction of variance that pred_y explains about y.
142 | Returns 1 - Var[y-pred_y] / Var[y]
143 |
144 | Interpretation:
145 | ev=0 => might as well have predicted zero
146 | ev=1 => perfect prediction
147 | ev<0 => worse than just predicting zero
148 |
149 | """
150 | assert y.ndim == 1 and pred_y.ndim == 1
151 | var_y = np.var(y)
152 | return np.nan if var_y == 0 else 1 - np.var(y - pred_y) / var_y
153 |
--------------------------------------------------------------------------------
/algorithms/REINFORCE/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## REINFORCE
3 | REINFORCE belongs to Policy Gradient methods, which directly parameterize the policy rather than a state value function.
4 | For more details about REINFORCE and other policy gradient algorithms, refer Chap 13 of [Reinforcement Learning: An Introduction 2nd Edition](http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html)
5 |
6 | Here we use REINFORCE Methods to solve game of Pong.
7 |
8 | ## Pong
9 | The game of Pong is an Atari game which user control one of the paddle (the other one is control by a decent AI) and you have to bounce the ball past the other side. In reinforcement learning setting, the state is raw pixels and the action is moving the paddle UP or DOWN.
10 |
11 |
12 |
13 |
14 | ## Requirements
15 | * [Numpy](http://www.numpy.org/)
16 | * [Tensorflow](http://www.tensorflow.org)
17 | * [gym](https://gym.openai.com)
18 |
19 | ## Run
20 | python train_REINFORCE.py
21 |
--------------------------------------------------------------------------------
/algorithms/REINFORCE/agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import tensorflow as tf
3 |
4 |
5 | class REINFORCE:
6 |
7 | def __init__(self, input_dim, hidden_units, action_dim):
8 | self.input_dim = input_dim
9 | self.hidden_units = hidden_units
10 | self.action_dim = action_dim
11 | self.gamma = 0.99
12 | self.max_gradient = 5
13 |
14 | self.state_buffer = []
15 | self.reward_buffer = []
16 | self.action_buffer = []
17 |
18 | @staticmethod
19 | def get_session(device):
20 | if device == -1: # use CPU
21 | device = '/cpu:0'
22 | sess_config = tf.ConfigProto()
23 | else: # use GPU
24 | device = '/gpu:' + str(device)
25 | sess_config = tf.ConfigProto(
26 | log_device_placement=True,
27 | allow_soft_placement=True)
28 | sess_config.gpu_options.allow_growth = True
29 | sess = tf.Session(config=sess_config)
30 | return sess, device
31 |
32 | def construct_model(self, gpu):
33 | self.sess, device = self.get_session(gpu)
34 |
35 | with tf.device(device):
36 | # construct network
37 | self.input_state = tf.placeholder(
38 | tf.float32, [None, self.input_dim])
39 | w1 = tf.Variable(tf.div(tf.random_normal(
40 | [self.input_dim, self.hidden_units]),
41 | np.sqrt(self.input_dim)))
42 | b1 = tf.Variable(tf.constant(0.0, shape=[self.hidden_units]))
43 | h1 = tf.nn.relu(tf.matmul(self.input_state, w1) + b1)
44 | w2 = tf.Variable(tf.div(
45 | tf.random_normal([self.hidden_units, self.action_dim]),
46 | np.sqrt(self.hidden_units)))
47 | b2 = tf.Variable(tf.constant(0.0, shape=[self.action_dim]))
48 |
49 | self.logp = tf.matmul(h1, w2) + b2
50 |
51 | self.discounted_rewards = tf.placeholder(tf.float32, [None, ])
52 | self.taken_actions = tf.placeholder(tf.int32, [None, ])
53 |
54 | # optimizer
55 | self.optimizer = tf.train.RMSPropOptimizer(
56 | learning_rate=1e-4, decay=0.99)
57 | # loss
58 | self.loss = tf.reduce_mean(
59 | tf.nn.sparse_softmax_cross_entropy_with_logits(
60 | logits=self.logp, labels=self.taken_actions))
61 | # gradient
62 | self.gradient = self.optimizer.compute_gradients(self.loss)
63 | # policy gradient
64 | for i, (grad, var) in enumerate(self.gradient):
65 | if grad is not None:
66 | pg_grad = grad * self.discounted_rewards
67 | # gradient clipping
68 | pg_grad = tf.clip_by_value(
69 | pg_grad, -self.max_gradient, self.max_gradient)
70 | self.gradient[i] = (pg_grad, var)
71 | # train operation (apply gradient)
72 | self.train_op = self.optimizer.apply_gradients(self.gradient)
73 |
74 | def sample_action(self, state):
75 |
76 | def softmax(x):
77 | max_x = np.amax(x)
78 | e = np.exp(x - max_x)
79 | return e / np.sum(e)
80 |
81 | logp = self.sess.run(self.logp, {self.input_state: state})[0]
82 | prob = softmax(logp) - 1e-5
83 | action = np.argmax(np.random.multinomial(1, prob))
84 | return action
85 |
86 | def update_model(self):
87 | discounted_rewards = self.reward_discount()
88 | episode_steps = len(discounted_rewards)
89 |
90 | for s in reversed(range(episode_steps)):
91 | state = self.state_buffer[s][np.newaxis, :]
92 | action = np.array([self.action_buffer[s]])
93 | reward = np.array([discounted_rewards[s]])
94 | self.sess.run(self.train_op, {
95 | self.input_state: state,
96 | self.taken_actions: action,
97 | self.discounted_rewards: reward
98 | })
99 |
100 | # cleanup job
101 | self.state_buffer = []
102 | self.reward_buffer = []
103 | self.action_buffer = []
104 |
105 | def store_rollout(self, state, action, reward):
106 | self.action_buffer.append(action)
107 | self.reward_buffer.append(reward)
108 | self.state_buffer.append(state)
109 |
110 | def reward_discount(self):
111 | r = self.reward_buffer
112 | d_r = np.zeros_like(r)
113 | running_add = 0
114 | for t in range(len(r))[::-1]:
115 | if r[t] != 0:
116 | running_add = 0 # game boundary. reset the running add
117 | running_add = r[t] + running_add * self.gamma
118 | d_r[t] += running_add
119 | # standardize the rewards
120 | d_r -= np.mean(d_r)
121 | d_r /= np.std(d_r)
122 | return d_r
123 |
--------------------------------------------------------------------------------
/algorithms/REINFORCE/evaluation.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | import gym
4 | from agent import REINFORCE
5 | from utils import *
6 |
7 |
8 | def main(args):
9 |
10 | def preprocess(obs):
11 | obs = obs[35:195]
12 | obs = obs[::2, ::2, 0]
13 | obs[obs == 144] = 0
14 | obs[obs == 109] = 0
15 | obs[obs != 0] = 1
16 |
17 | return obs.astype(np.float).ravel()
18 |
19 | INPUT_DIM = 80 * 80
20 | HIDDEN_UNITS = 200
21 | ACTION_DIM = 6
22 |
23 | # load agent
24 | agent = REINFORCE(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM)
25 | agent.construct_model(args.gpu)
26 |
27 | # load model or init a new
28 | saver = tf.train.Saver(max_to_keep=1)
29 | if args.model_path is not None:
30 | # reuse saved model
31 | saver.restore(agent.sess, args.model_path)
32 | else:
33 | # build a new model
34 | agent.init_var()
35 |
36 | # load env
37 | env = gym.make('Pong-v0')
38 |
39 | # evaluation
40 | for ep in range(args.ep):
41 | # reset env
42 | total_rewards = 0
43 | state = env.reset()
44 |
45 | while True:
46 | env.render()
47 | # preprocess
48 | state = preprocess(state)
49 | # sample actions
50 | action = agent.sample_action(state[np.newaxis, :])
51 | # act!
52 | next_state, reward, done, _ = env.step(action)
53 | total_rewards += reward
54 | # state shift
55 | state = next_state
56 | if done:
57 | break
58 |
59 | print('Ep%s Reward: %s ' % (ep+1, total_rewards))
60 |
61 |
62 | def args_parse():
63 | parser = argparse.ArgumentParser()
64 | parser.add_argument(
65 | '--model_path', default=None,
66 | help='Whether to use a saved model. (*None|model path)')
67 | parser.add_argument(
68 | '--gpu', default=-1,
69 | help='running on a specify gpu, -1 indicates using cpu')
70 | parser.add_argument(
71 | '--ep', default=1, help='Test episodes')
72 | return parser.parse_args()
73 |
74 |
75 | if __name__ == '__main__':
76 | main(args_parse())
77 |
--------------------------------------------------------------------------------
/algorithms/REINFORCE/train_REINFORCE.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 |
4 | import gym
5 | import numpy as np
6 | import tensorflow as tf
7 |
8 | from agent import REINFORCE
9 |
10 |
11 | def main(args):
12 |
13 | def preprocess(obs):
14 | obs = obs[35:195]
15 | obs = obs[::2, ::2, 0]
16 | obs[obs == 144] = 0
17 | obs[obs == 109] = 0
18 | obs[obs != 0] = 1
19 |
20 | return obs.astype(np.float).ravel()
21 |
22 | MODEL_PATH = args.model_path
23 | INPUT_DIM = 80 * 80
24 | HIDDEN_UNITS = 200
25 | ACTION_DIM = 6
26 | MAX_EPISODES = 20000
27 | MAX_STEPS = 5000
28 |
29 | # load agent
30 | agent = REINFORCE(INPUT_DIM, HIDDEN_UNITS, ACTION_DIM)
31 | agent.construct_model(args.gpu)
32 |
33 | # model saver
34 | saver = tf.train.Saver(max_to_keep=1)
35 | if MODEL_PATH is not None:
36 | saver.restore(agent.sess, args.model_path)
37 | ep_base = int(args.model_path.split('_')[-1])
38 | mean_rewards = float(args.model_path.split('/')[-1].split('_')[0])
39 | else:
40 | agent.sess.run(tf.global_variables_initializer())
41 | ep_base = 0
42 | mean_rewards = None
43 |
44 | # load env
45 | env = gym.make('Pong-v0')
46 | # main loop
47 | for ep in range(MAX_EPISODES):
48 | # reset env
49 | total_rewards = 0
50 | state = env.reset()
51 |
52 | for step in range(MAX_STEPS):
53 | # preprocess
54 | state = preprocess(state)
55 | # sample actions
56 | action = agent.sample_action(state[np.newaxis, :])
57 | # act!
58 | next_state, reward, done, _ = env.step(action)
59 |
60 | total_rewards += reward
61 | agent.store_rollout(state, action, reward)
62 | # state shift
63 | state = next_state
64 |
65 | if done:
66 | break
67 |
68 | # update model per episode
69 | agent.update_model()
70 |
71 | # logging
72 | if mean_rewards is None:
73 | mean_rewards = total_rewards
74 | else:
75 | mean_rewards = 0.99 * mean_rewards + 0.01 * total_rewards
76 | rounds = (21 - np.abs(total_rewards)) + 21
77 | average_steps = (step + 1) / rounds
78 | print('Ep%s: %d rounds \nAvg_steps: %.2f Reward: %s Avg_reward: %.4f' %
79 | (ep+1, rounds, average_steps, total_rewards, mean_rewards))
80 | if ep > 0 and ep % 100 == 0:
81 | if not os.path.isdir(args.save_path):
82 | os.makedirs(args.save_path)
83 | save_name = str(round(mean_rewards, 2)) + '_' + str(ep_base+ep+1)
84 | saver.save(agent.sess, args.save_path + save_name)
85 |
86 |
87 | def args_parse():
88 | parser = argparse.ArgumentParser()
89 | parser.add_argument(
90 | '--model_path', default=None,
91 | help='Whether to use a saved model. (*None|model path)')
92 | parser.add_argument(
93 | '--save_path', default='./model/',
94 | help='Path to save a model during training.')
95 | parser.add_argument(
96 | '--gpu', default=-1,
97 | help='running on a specify gpu, -1 indicates using cpu')
98 | return parser.parse_args()
99 |
100 |
101 | if __name__ == '__main__':
102 | main(args_parse())
103 |
--------------------------------------------------------------------------------
/algorithms/TD/README.md:
--------------------------------------------------------------------------------
1 | ### Temporal Difference
2 |
3 | Temporal Difference (TD) learning is a prediction-based reinforcement learning algorithm. It is the combination of Monte Carlo (MC) and Dynamic Programming (DP).
4 |
5 | For more details about TD algorithm please refer to Chap 6 of [Reinforcement Learning: An Introduction 2nd Edition](http://incompleteideas.net/sutton/book/the-book-2nd.html)
6 |
7 | ### GridWolrd
8 |
9 | A GridWolrd is a typical environment for tabular reinforcement learning. It has a 10x10 state space and an action space of {up, down, left right} to move the agent around. There is a target point with +1 reward in the environment and some bomb point with -1 rewards. Also we set a bit cost(-0.01) when the agent take one step so that it tends to find out the optimal path faster.
10 |
11 | A typical GridWolrd may look like this.
12 |
13 |
14 |
15 | In GridWolrd environment, the TD agent try to figure out the optimal (i.e. the shortest) path to the target.
16 |
17 |
18 | ### Run
19 |
20 | ```
21 | cd reinforce_py/algorithms/TD
22 | python train_TD.py --algorithm=qlearn/sarsa # Q-learning or SARSA
23 | ```
24 |
--------------------------------------------------------------------------------
/algorithms/TD/agents.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import random
3 |
4 | from utils import draw_episode_steps
5 | from utils import draw_grid
6 |
7 |
8 | class TDAgent(object):
9 |
10 | def __init__(self, env, epsilon, gamma, alpha=0.1):
11 | self.env = env
12 | self.gamma = gamma
13 | self.alpha = alpha
14 | self.epsilon = epsilon # explore & exploit
15 | self.init_epsilon = epsilon
16 |
17 | self.P = np.zeros((self.env.num_s, self.env.num_a))
18 |
19 | self.V = np.zeros(self.env.num_s)
20 | self.Q = np.zeros((self.env.num_s, self.env.num_a))
21 |
22 | self.step_set = [] # store steps of each episode
23 | self.avg_step_set = [] # store average steps of each 100 episodes
24 | self.episode = 1
25 | self.step = 0
26 | self.max_episodes = 5000
27 |
28 | # initialize random policy
29 | for s in range(self.env.num_s):
30 | poss = self.env.allow_actions(s)
31 | for a in poss:
32 | self.P[s][a] = 1.0 / len(poss)
33 |
34 | self.curr_s = None
35 | self.curr_a = None
36 |
37 | def predict(self, episode=1000):
38 | for e in range(episode):
39 | curr_s = self.env.reset() # new episode
40 | while not self.env.is_terminal(curr_s): # for every time step
41 | a = self.select_action(curr_s, policy='greedy')
42 | r = self.env.rewards(curr_s, a)
43 | next_s = self.env.next_state(curr_s, a)
44 | self.V[curr_s] += self.alpha \
45 | * (r+self.gamma*self.V[next_s] - self.V[curr_s])
46 | curr_s = next_s
47 | # result display
48 | draw_grid(self.env, self, p=True, v=True, r=True)
49 |
50 | def control(self, method):
51 | assert method in ("qlearn", "sarsa")
52 |
53 | if method == "qlearn":
54 | agent = Qlearn(self.env, self.epsilon, self.gamma)
55 | else:
56 | agent = SARSA(self.env, self.epsilon, self.gamma)
57 |
58 | while agent.episode < self.max_episodes:
59 | agent.learn(agent.act())
60 |
61 | # result display
62 | draw_grid(self.env, agent, p=True, v=True, r=True)
63 | # draw episode steps
64 | draw_episode_steps(agent.avg_step_set)
65 |
66 | def update_policy(self):
67 | # update according to Q value
68 | poss = self.env.allow_actions(self.curr_s)
69 | # Q values of all allowed actions
70 | qs = self.Q[self.curr_s][poss]
71 | q_maxs = [q for q in qs if q == max(qs)]
72 | # update probabilities
73 | for i, a in enumerate(poss):
74 | self.P[self.curr_s][a] = \
75 | 1.0 / len(q_maxs) if qs[i] in q_maxs else 0.0
76 |
77 | def select_action(self, state, policy='egreedy'):
78 | poss = self.env.allow_actions(state) # possible actions
79 | if policy == 'egreedy' and random.random() < self.epsilon:
80 | a = random.choice(poss)
81 | else: # greedy action
82 | pros = self.P[state][poss] # probabilities for possible actions
83 | best_a_idx = [i for i, p in enumerate(pros) if p == max(pros)]
84 | a = poss[random.choice(best_a_idx)]
85 | return a
86 |
87 |
88 | class SARSA(TDAgent):
89 |
90 | def __init__(self, env, epsilon, gamma):
91 | super(SARSA, self).__init__(env, epsilon, gamma)
92 | self.reset_episode()
93 |
94 | def act(self):
95 | s = self.env.next_state(self.curr_s, self.curr_a)
96 | a = self.select_action(s, policy='egreedy')
97 | r = self.env.rewards(self.curr_s, self.curr_a)
98 | r -= 0.01 # a bit negative reward for every step
99 | return [self.curr_s, self.curr_a, r, s, a]
100 |
101 | def learn(self, exp):
102 | s, a, r, n_s, n_a = exp
103 |
104 | if self.env.is_terminal(s):
105 | target = r
106 | else:
107 | target = r + self.gamma * self.Q[n_s][n_a]
108 | self.Q[s][a] += self.alpha * (target - self.Q[s][a])
109 |
110 | # update policy
111 | self.update_policy()
112 |
113 | if self.env.is_terminal(s):
114 | self.V = np.sum(self.Q, axis=1)
115 | print('episode %d step: %d epsilon: %f' %
116 | (self.episode, self.step, self.epsilon))
117 | self.reset_episode()
118 | self.epsilon -= self.init_epsilon / 10000
119 | # record per 100 episode
120 | if self.episode % 100 == 0:
121 | self.avg_step_set.append(
122 | np.sum(self.step_set[self.episode-100: self.episode])/100)
123 | else: # shift state-action pair
124 | self.curr_s = n_s
125 | self.curr_a = n_a
126 | self.step += 1
127 |
128 | def reset_episode(self):
129 | # start a new episode
130 | self.curr_s = self.env.reset()
131 | self.curr_a = self.select_action(self.curr_s, policy='egreedy')
132 | self.episode += 1
133 | self.step_set.append(self.step)
134 | self.step = 0
135 |
136 |
137 | class Qlearn(TDAgent):
138 |
139 | def __init__(self, env, epsilon, gamma):
140 | super(Qlearn, self).__init__(env, epsilon, gamma)
141 | self.reset_episode()
142 |
143 | def act(self):
144 | a = self.select_action(self.curr_s, policy='egreedy')
145 | s = self.env.next_state(self.curr_s, a)
146 | r = self.env.rewards(self.curr_s, a)
147 | r -= 0.01
148 | return [self.curr_s, a, r, s]
149 |
150 | def learn(self, exp):
151 | s, a, r, n_s = exp
152 |
153 | # Q-learning magic
154 | if self.env.is_terminal(s):
155 | target = r
156 | else:
157 | target = r + self.gamma * max(self.Q[n_s])
158 | self.Q[s][a] += self.alpha * (target - self.Q[s][a])
159 |
160 | self.update_policy()
161 | # shift to next state
162 | if self.env.is_terminal(s):
163 | self.V = np.sum(self.Q, axis=1)
164 | print('episode %d step: %d' % (self.episode, self.step))
165 | self.reset_episode()
166 | self.epsilon -= self.init_epsilon / self.max_episodes
167 | # record per 100 episode
168 | if self.episode % 100 == 0:
169 | self.avg_step_set.append(
170 | np.sum(self.step_set[self.episode-100: self.episode])/100)
171 | else:
172 | self.curr_s = n_s
173 | self.step += 1
174 |
175 | def reset_episode(self):
176 | self.curr_s = self.env.reset()
177 | self.episode += 1
178 | self.step_set.append(self.step)
179 | self.step = 0
180 |
--------------------------------------------------------------------------------
/algorithms/TD/envs.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class GridWorld:
5 |
6 | def __init__(self):
7 | self.env_w = 10
8 | self.env_h = 10
9 | self.num_s = self.env_w * self.env_h
10 | self.num_a = 4
11 |
12 | r = np.zeros(self.num_s)
13 | w = np.zeros(self.num_s)
14 |
15 | self.target = np.array([27])
16 | self.bomb = np.array([16, 25, 26, 28, 36, 40, 41, 48, 49, 64])
17 | # make some walls
18 | self.wall = np.array([22, 32, 42, 52, 43, 45, 46, 47, 37])
19 | r[self.target] = 10
20 | r[self.bomb] = -1
21 | r[self.wall] = 0
22 | w[self.wall] = 1
23 |
24 | self.W = w
25 | self.R = r # reward
26 | self.terminal = np.array(self.target)
27 |
28 | def rewards(self, s, a):
29 | return self.R[s]
30 |
31 | def allow_actions(self, s):
32 | # return allow actions in state s
33 | x = self.get_pos(s)[0]
34 | y = self.get_pos(s)[1]
35 | allow_a = np.array([], dtype='int')
36 | if y > 0 and self.W[s-self.env_w] != 1:
37 | allow_a = np.append(allow_a, 0)
38 | if y < self.env_h-1 and self.W[s+self.env_w] != 1:
39 | allow_a = np.append(allow_a, 1)
40 | if x > 0 and self.W[s-1] != 1:
41 | allow_a = np.append(allow_a, 2)
42 | if x < self.env_w-1 and self.W[s+1] != 1:
43 | allow_a = np.append(allow_a, 3)
44 | return allow_a
45 |
46 | def get_pos(self, s):
47 | # transform to coordinate (x, y)
48 | x = s % self.env_h
49 | y = s / self.env_w
50 | return x, y
51 |
52 | def next_state(self, s, a):
53 | # return next state in state s taking action a
54 | # in this deterministic environment it returns a certain state ns
55 | ns = 0
56 | if a == 0:
57 | ns = s - self.env_w
58 | if a == 1:
59 | ns = s + self.env_w
60 | if a == 2:
61 | ns = s - 1
62 | if a == 3:
63 | ns = s + 1
64 | return ns
65 |
66 | def is_terminal(self, s):
67 | return True if s in self.terminal else False
68 |
69 | def reset(self):
70 | return 0 # init state
71 |
--------------------------------------------------------------------------------
/algorithms/TD/train_TD.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | from agents import TDAgent
4 | from envs import GridWorld
5 |
6 |
7 | def main(args):
8 | env = GridWorld()
9 |
10 | agent = TDAgent(env, epsilon=args.epsilon, gamma=args.discount, alpha=args.lr)
11 | agent.control(method=args.algorithm)
12 |
13 |
14 | if __name__ == '__main__':
15 | parser = argparse.ArgumentParser()
16 | parser.add_argument(
17 | '--algorithm', default='qlearn', help='(*qlearn | sarsa)')
18 | parser.add_argument(
19 | '--discount', type=float, default=0.9, help='discount factor')
20 | parser.add_argument(
21 | '--epsilon', type=float, default=0.3,
22 | help='parameter of epsilon greedy policy')
23 | parser.add_argument('--lr', type=float, default=0.05)
24 | main(parser.parse_args())
25 |
--------------------------------------------------------------------------------
/algorithms/TD/utils.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | import numpy as np
3 |
4 |
5 | def draw_grid(env, agent, p=True, v=False, r=False):
6 | '''
7 | Draw the policy(|value|reward setting) at the command prompt.
8 | '''
9 | arrows = [u'\u2191', u'\u2193', u'\u2190', u'\u2192']
10 | cliff = u'\u25C6'
11 | sign = {0: '-', 10: u'\u2713', -1: u'\u2717'}
12 |
13 | tp = [] # transform policy
14 | for s in range(env.num_s):
15 | for a in range(env.num_a):
16 | tp.append(agent.P[s][a])
17 | best = [] # best action for each state at the moment
18 | for i in range(0, len(tp), env.num_a):
19 | a = tp[i:i+env.num_a]
20 | ba = np.argsort(-np.array(a))[0]
21 | best.append(ba)
22 | if r:
23 | print('\n')
24 | print('Environment setting:', end=' ')
25 | for i, r in enumerate(env.R):
26 | if i % env.env_w == 0:
27 | print('\n')
28 | if env.W[i] > 0:
29 | print('%1s' % cliff, end=' ')
30 | else:
31 | print('%1s' % sign[r], end=' ')
32 | print('\n')
33 | if p:
34 | print('Trained policy:', end=' ')
35 | for i, a in enumerate(best):
36 | if i % env.env_w == 0:
37 | print('\n')
38 | if env.W[i] == 1:
39 | print('%s' % cliff, end=' ')
40 | elif env.R[i] == 1:
41 | print('%s' % u'\u272A', end=' ')
42 | else:
43 | print('%s' % arrows[a], end=' ')
44 | print('\n')
45 | if v:
46 | print('Value function for each state:', end=' ')
47 | for i, v in enumerate(agent.V):
48 | if i % env.env_w == 0:
49 | print('\n')
50 | if env.W[i] == 1:
51 | print(' %-2s ' % cliff, end=' ')
52 | elif env.R[i] == 1:
53 | print('[%.1f]' % v, end=' ')
54 | else:
55 | print('%4.1f' % v, end=' ')
56 | print('\n')
57 |
58 |
59 | def draw_episode_steps(avg_step_set):
60 | plt.plot(np.arange(len(avg_step_set)), avg_step_set)
61 | plt.title('steps per episode')
62 | plt.xlabel('episode')
63 | plt.ylabel('steps')
64 | plt.axis([0, 80, 0, 200])
65 | plt.show()
66 |
--------------------------------------------------------------------------------
/images/cartpole.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/cartpole.png
--------------------------------------------------------------------------------
/images/ddpg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/ddpg.png
--------------------------------------------------------------------------------
/images/doom.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/doom.png
--------------------------------------------------------------------------------
/images/dqn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/dqn.png
--------------------------------------------------------------------------------
/images/gridworld.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/gridworld.png
--------------------------------------------------------------------------------
/images/pong.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/pong.png
--------------------------------------------------------------------------------
/images/ppo_losses.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/ppo_losses.png
--------------------------------------------------------------------------------
/images/ppo_score.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/ppo_score.png
--------------------------------------------------------------------------------
/images/walker2d.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/walker2d.gif
--------------------------------------------------------------------------------
/images/walker2d.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/borgwang/reinforce_py/41f67327ae7e1bf87d4648e3ea5f406466c532c9/images/walker2d.png
--------------------------------------------------------------------------------