├── LICENSE ├── README.md ├── a3c ├── README.md ├── play.py ├── resources │ ├── average-scores.png │ └── sample-game.gif ├── sample-weights │ └── model-Breakout-v0-91750000.h5 └── train.py ├── q-learning-1-step ├── README.md ├── play.py ├── resources │ ├── after-12h-training.gif │ ├── after-18h-training.gif │ └── after-6h-training.gif ├── sample-weights │ ├── model-12h.h5 │ ├── model-18h.h5 │ └── model-6h.h5 └── train.py └── q-learning-n-step ├── README.md ├── play.py └── train.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Grzegorz Opoka 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Variation of Asynchronous RL in Keras (Theano backend) + OpenAI gym [1-step Q-learning, n-step Q-learning, A3C] 2 | This is a simple variation of [asynchronous reinforcement learning](http://arxiv.org/pdf/1602.01783v1.pdf) written in Python with Keras (Theano backend). Instead of many threads training at the same time there are many processes generating experience for a single agent to learn from. 3 | 4 | ### Explanation 5 | There are many processes (tested with 4, it should work better with more in case of Q-learning methods) which are creating experience and sending it to the shared queue. Queue is limited in length (tested with 256) to stop individual processes from excessively generating experience with old weights. Learning process draws from queue samples in batches and learns on them. In A3C network weights are swapped relatively fast to keep them updated. 6 | 7 | ### Currently implemented and working methods 8 | * [1-step Q-learning](https://github.com/Grzego/async-rl/tree/master/q-learning-1-step) 9 | * [n-step Q-learning](https://github.com/Grzego/async-rl/tree/master/q-learning-n-step) 10 | * [A3C](https://github.com/Grzego/async-rl/tree/master/a3c) 11 | 12 | ### Requirements 13 | * [Python 3.4/Python 3.5](https://www.python.org/downloads/) 14 | * [Keras](http://keras.io/) 15 | * [Theano](http://deeplearning.net/software/theano/) ([Tensorflow](https://www.tensorflow.org/) would probably work too) 16 | * [OpenAI (atari-py)](https://gym.openai.com/) 17 | * `pip3 install scikit-image h5py scipy` 18 | 19 | ### Sample game (A3C) 20 | ![](https://github.com/Grzego/async-rl/blob/master/a3c/resources/sample-game.gif?raw=true) 21 | 22 | #### Feedback 23 | Because I'm newbie in Reinforcement Learning and Deep Learning, feedback is very welcome :) 24 | 25 | ### Note 26 | * Weights were learned in Theano, so loading them in Tensorflow may be a little problematic due to Convolutional Layers. 27 | * If training halts after few seconds, don't worry, its probably because Keras lazily compiles Theano function, it should resume quickly. 28 | * Each process sets its own compilation directory for Theano so compilation can take very long time at the beginning (can be disabled with `--th_comp_fix=False`) 29 | 30 | ### Useful resources 31 | * [Asyncronous RL in Tensorflow + Keras + OpenAI's Gym](https://github.com/coreylynch/async-rl) 32 | * [Replicating "Asynchronous Methods for Deep Reinforcement Learning"](https://github.com/muupan/async-rl) 33 | * [David Silver's "Deep Reinforcement Learning" lecture](http://videolectures.net/rldm2015_silver_reinforcement_learning/) 34 | * [Nervana's Demystifying Deep Reinforcement Learning blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/) 35 | * [Asynchronous Methods for Deep Reinforcement Learning](http://arxiv.org/pdf/1602.01783v1.pdf) 36 | * [Playing Atari with Deep Reinforcement Learning](http://arxiv.org/pdf/1312.5602v1.pdf) 37 | 38 | -------------------------------------------------------------------------------- /a3c/README.md: -------------------------------------------------------------------------------- 1 | #### Usage 2 | 3 | To start training simply type: 4 | ``` 5 | python train.py --game=Breakout-v0 --processes=16 6 | ``` 7 | 8 | To resume training from saved model (ex. `model-Breakout-v0-1250000.h5`): 9 | ``` 10 | python train.py --game=Breakout-v0 --processes=16 --checkpoint=1250000 11 | ``` 12 | 13 | To see how it plays: 14 | ``` 15 | python play.py --model=model-file.h5 --game=Breakout-v0 16 | ``` 17 | 18 | ### Results 19 | 20 | This method works really well. Graph below shows average score of 10 games played every 1kk frames. Learning took about 24h. I was able to process ~57k frames every minute. Final weights can be found in `sample-weights` folder. 21 | 22 | ![](https://github.com/Grzego/async-rl/blob/master/a3c/resources/average-scores.png?raw=true) 23 | 24 | ### Sample game 25 | 26 | ![](https://github.com/Grzego/async-rl/blob/master/a3c/resources/sample-game.gif?raw=true) -------------------------------------------------------------------------------- /a3c/play.py: -------------------------------------------------------------------------------- 1 | from keras.models import * 2 | from keras.layers import * 3 | from keras.optimizers import RMSprop 4 | import gym 5 | from scipy.misc import imresize 6 | from skimage.color import rgb2gray 7 | import numpy as np 8 | import argparse 9 | 10 | 11 | def build_network(input_shape, output_shape): 12 | state = Input(shape=input_shape) 13 | h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(state) 14 | h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h) 15 | h = Flatten()(h) 16 | h = Dense(256, activation='relu')(h) 17 | 18 | value = Dense(1, activation='linear')(h) 19 | policy = Dense(output_shape, activation='softmax')(h) 20 | 21 | value_network = Model(input=state, output=value) 22 | policy_network = Model(input=state, output=policy) 23 | 24 | adventage = Input(shape=(1,)) 25 | train_network = Model(inputs=state, outputs=[value, policy]) 26 | 27 | return value_network, policy_network, train_network, adventage 28 | 29 | 30 | class ActingAgent(object): 31 | def __init__(self, action_space, screen=(84, 84)): 32 | self.screen = screen 33 | self.input_depth = 1 34 | self.past_range = 3 35 | self.replay_size = 32 36 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 37 | 38 | _, self.policy, self.load_net, _ = build_network(self.observation_shape, action_space.n) 39 | 40 | self.load_net.compile(optimizer=RMSprop(clipnorm=1.), loss='mse') # clipnorm=1. 41 | 42 | self.action_space = action_space 43 | self.observations = np.zeros((self.input_depth * self.past_range,) + screen) 44 | 45 | def init_episode(self, observation): 46 | for _ in range(self.past_range): 47 | self.save_observation(observation) 48 | 49 | def choose_action(self, observation): 50 | self.save_observation(observation) 51 | policy = self.policy.predict(self.observations[None, ...])[0] 52 | policy /= np.sum(policy) # numpy, why? 53 | return np.random.choice(np.arange(self.action_space.n), p=policy) 54 | 55 | def save_observation(self, observation): 56 | self.observations = np.roll(self.observations, -self.input_depth, axis=0) 57 | self.observations[-self.input_depth:, ...] = self.transform_screen(observation) 58 | 59 | def transform_screen(self, data): 60 | return rgb2gray(imresize(data, self.screen))[None, ...] 61 | 62 | 63 | parser = argparse.ArgumentParser(description='Evaluation of model') 64 | parser.add_argument('--game', default='Breakout-v0', help='Name of openai gym environment', dest='game') 65 | parser.add_argument('--evaldir', default=None, help='Directory to save evaluation', dest='evaldir') 66 | parser.add_argument('--model', help='File with weights for model', dest='model') 67 | 68 | 69 | def main(): 70 | args = parser.parse_args() 71 | # ----- 72 | env = gym.make(args.game) 73 | if args.evaldir: 74 | env.monitor.start(args.evaldir) 75 | # ----- 76 | agent = ActingAgent(env.action_space) 77 | 78 | model_file = args.model 79 | 80 | agent.load_net.load_weights(model_file) 81 | 82 | game = 1 83 | for _ in range(10): 84 | done = False 85 | episode_reward = 0 86 | noops = 0 87 | 88 | # init game 89 | observation = env.reset() 90 | agent.init_episode(observation) 91 | # play one game 92 | print('Game #%8d; ' % (game,), end='') 93 | while not done: 94 | env.render() 95 | action = agent.choose_action(observation) 96 | observation, reward, done, _ = env.step(action) 97 | episode_reward += reward 98 | # ---- 99 | if action == 0: 100 | noops += 1 101 | else: 102 | noops = 0 103 | if noops > 100: 104 | break 105 | print('Reward %4d; ' % (episode_reward,)) 106 | game += 1 107 | # ----- 108 | if args.evaldir: 109 | env.monitor.close() 110 | 111 | 112 | if __name__ == "__main__": 113 | main() 114 | -------------------------------------------------------------------------------- /a3c/resources/average-scores.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/a3c/resources/average-scores.png -------------------------------------------------------------------------------- /a3c/resources/sample-game.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/a3c/resources/sample-game.gif -------------------------------------------------------------------------------- /a3c/sample-weights/model-Breakout-v0-91750000.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/a3c/sample-weights/model-Breakout-v0-91750000.h5 -------------------------------------------------------------------------------- /a3c/train.py: -------------------------------------------------------------------------------- 1 | from scipy.misc import imresize 2 | from skimage.color import rgb2gray 3 | from multiprocessing import * 4 | from collections import deque 5 | import gym 6 | import numpy as np 7 | import h5py 8 | import argparse 9 | 10 | # ----- 11 | parser = argparse.ArgumentParser(description='Training model') 12 | parser.add_argument('--game', default='Breakout-v0', help='OpenAI gym environment name', dest='game', type=str) 13 | parser.add_argument('--processes', default=4, help='Number of processes that generate experience for agent', 14 | dest='processes', type=int) 15 | parser.add_argument('--lr', default=0.001, help='Learning rate', dest='learning_rate', type=float) 16 | parser.add_argument('--steps', default=80000000, help='Number of frames to decay learning rate', dest='steps', type=int) 17 | parser.add_argument('--batch_size', default=20, help='Batch size to use during training', dest='batch_size', type=int) 18 | parser.add_argument('--swap_freq', default=100, help='Number of frames before swapping network weights', 19 | dest='swap_freq', type=int) 20 | parser.add_argument('--checkpoint', default=0, help='Frame to resume training', dest='checkpoint', type=int) 21 | parser.add_argument('--save_freq', default=250000, help='Number of frames before saving weights', dest='save_freq', 22 | type=int) 23 | parser.add_argument('--queue_size', default=256, help='Size of queue holding agent experience', dest='queue_size', 24 | type=int) 25 | parser.add_argument('--n_step', default=5, help='Number of steps', dest='n_step', type=int) 26 | parser.add_argument('--reward_scale', default=1., dest='reward_scale', type=float) 27 | parser.add_argument('--beta', default=0.01, dest='beta', type=float) 28 | # ----- 29 | args = parser.parse_args() 30 | 31 | 32 | # ----- 33 | 34 | 35 | def build_network(input_shape, output_shape): 36 | from keras.models import Model 37 | from keras.layers import Input, Conv2D, Flatten, Dense 38 | # ----- 39 | state = Input(shape=input_shape) 40 | h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(state) 41 | h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h) 42 | h = Flatten()(h) 43 | h = Dense(256, activation='relu')(h) 44 | 45 | value = Dense(1, activation='linear', name='value')(h) 46 | policy = Dense(output_shape, activation='softmax', name='policy')(h) 47 | 48 | value_network = Model(inputs=state, outputs=value) 49 | policy_network = Model(inputs=state, outputs=policy) 50 | 51 | adventage = Input(shape=(1,)) 52 | train_network = Model(inputs=[state, adventage], outputs=[value, policy]) 53 | 54 | return value_network, policy_network, train_network, adventage 55 | 56 | 57 | def policy_loss(adventage=0., beta=0.01): 58 | from keras import backend as K 59 | 60 | def loss(y_true, y_pred): 61 | return -K.sum(K.log(K.sum(y_true * y_pred, axis=-1) + K.epsilon()) * K.flatten(adventage)) + \ 62 | beta * K.sum(y_pred * K.log(y_pred + K.epsilon())) 63 | 64 | return loss 65 | 66 | 67 | def value_loss(): 68 | from keras import backend as K 69 | 70 | def loss(y_true, y_pred): 71 | return 0.5 * K.sum(K.square(y_true - y_pred)) 72 | 73 | return loss 74 | 75 | 76 | # ----- 77 | 78 | class LearningAgent(object): 79 | def __init__(self, action_space, batch_size=32, screen=(84, 84), swap_freq=200): 80 | from keras.optimizers import RMSprop 81 | # ----- 82 | self.screen = screen 83 | self.input_depth = 1 84 | self.past_range = 3 85 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 86 | self.batch_size = batch_size 87 | 88 | _, _, self.train_net, adventage = build_network(self.observation_shape, action_space.n) 89 | 90 | self.train_net.compile(optimizer=RMSprop(epsilon=0.1, rho=0.99), 91 | loss=[value_loss(), policy_loss(adventage, args.beta)]) 92 | 93 | self.pol_loss = deque(maxlen=25) 94 | self.val_loss = deque(maxlen=25) 95 | self.values = deque(maxlen=25) 96 | self.entropy = deque(maxlen=25) 97 | self.swap_freq = swap_freq 98 | self.swap_counter = self.swap_freq 99 | self.unroll = np.arange(self.batch_size) 100 | self.targets = np.zeros((self.batch_size, action_space.n)) 101 | self.counter = 0 102 | 103 | def learn(self, last_observations, actions, rewards, learning_rate=0.001): 104 | import keras.backend as K 105 | K.set_value(self.train_net.optimizer.lr, learning_rate) 106 | frames = len(last_observations) 107 | self.counter += frames 108 | # ----- 109 | values, policy = self.train_net.predict([last_observations, self.unroll]) 110 | # ----- 111 | self.targets.fill(0.) 112 | adventage = rewards - values.flatten() 113 | self.targets[self.unroll, actions] = 1. 114 | # ----- 115 | loss = self.train_net.train_on_batch([last_observations, adventage], [rewards, self.targets]) 116 | entropy = np.mean(-policy * np.log(policy + 0.00000001)) 117 | self.pol_loss.append(loss[2]) 118 | self.val_loss.append(loss[1]) 119 | self.entropy.append(entropy) 120 | self.values.append(np.mean(values)) 121 | min_val, max_val, avg_val = min(self.values), max(self.values), np.mean(self.values) 122 | print('\rFrames: %8d; Policy-Loss: %10.6f; Avg: %10.6f ' 123 | '--- Value-Loss: %10.6f; Avg: %10.6f ' 124 | '--- Entropy: %7.6f; Avg: %7.6f ' 125 | '--- V-value; Min: %6.3f; Max: %6.3f; Avg: %6.3f' % ( 126 | self.counter, 127 | loss[2], np.mean(self.pol_loss), 128 | loss[1], np.mean(self.val_loss), 129 | entropy, np.mean(self.entropy), 130 | min_val, max_val, avg_val), end='') 131 | # ----- 132 | self.swap_counter -= frames 133 | if self.swap_counter < 0: 134 | self.swap_counter += self.swap_freq 135 | return True 136 | return False 137 | 138 | 139 | def learn_proc(mem_queue, weight_dict): 140 | import os 141 | pid = os.getpid() 142 | os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=False,lib.cnmem=0.3,' + \ 143 | 'compiledir=th_comp_learn' 144 | # ----- 145 | print(' %5d> Learning process' % (pid,)) 146 | # ----- 147 | save_freq = args.save_freq 148 | learning_rate = args.learning_rate 149 | batch_size = args.batch_size 150 | checkpoint = args.checkpoint 151 | steps = args.steps 152 | # ----- 153 | env = gym.make(args.game) 154 | agent = LearningAgent(env.action_space, batch_size=args.batch_size, swap_freq=args.swap_freq) 155 | # ----- 156 | if checkpoint > 0: 157 | print(' %5d> Loading weights from file' % (pid,)) 158 | agent.train_net.load_weights('model-%s-%d.h5' % (args.game, checkpoint,)) 159 | # ----- 160 | print(' %5d> Setting weights in dict' % (pid,)) 161 | weight_dict['update'] = 0 162 | weight_dict['weights'] = agent.train_net.get_weights() 163 | # ----- 164 | last_obs = np.zeros((batch_size,) + agent.observation_shape) 165 | actions = np.zeros(batch_size, dtype=np.int32) 166 | rewards = np.zeros(batch_size) 167 | # ----- 168 | idx = 0 169 | agent.counter = checkpoint 170 | save_counter = checkpoint % save_freq + save_freq 171 | while True: 172 | # ----- 173 | last_obs[idx, ...], actions[idx], rewards[idx] = mem_queue.get() 174 | idx = (idx + 1) % batch_size 175 | if idx == 0: 176 | lr = max(0.00000001, (steps - agent.counter) / steps * learning_rate) 177 | updated = agent.learn(last_obs, actions, rewards, learning_rate=lr) 178 | if updated: 179 | # print(' %5d> Updating weights in dict' % (pid,)) 180 | weight_dict['weights'] = agent.train_net.get_weights() 181 | weight_dict['update'] += 1 182 | # ----- 183 | save_counter -= 1 184 | if save_counter < 0: 185 | save_counter += save_freq 186 | agent.train_net.save_weights('model-%s-%d.h5' % (args.game, agent.counter,), overwrite=True) 187 | 188 | 189 | class ActingAgent(object): 190 | def __init__(self, action_space, screen=(84, 84), n_step=8, discount=0.99): 191 | self.screen = screen 192 | self.input_depth = 1 193 | self.past_range = 3 194 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 195 | 196 | self.value_net, self.policy_net, self.load_net, adv = build_network(self.observation_shape, action_space.n) 197 | 198 | self.value_net.compile(optimizer='rmsprop', loss='mse') 199 | self.policy_net.compile(optimizer='rmsprop', loss='categorical_crossentropy') 200 | self.load_net.compile(optimizer='rmsprop', loss='mse', loss_weights=[0.5, 1.]) # dummy loss 201 | 202 | self.action_space = action_space 203 | self.observations = np.zeros(self.observation_shape) 204 | self.last_observations = np.zeros_like(self.observations) 205 | # ----- 206 | self.n_step_observations = deque(maxlen=n_step) 207 | self.n_step_actions = deque(maxlen=n_step) 208 | self.n_step_rewards = deque(maxlen=n_step) 209 | self.n_step = n_step 210 | self.discount = discount 211 | self.counter = 0 212 | 213 | def init_episode(self, observation): 214 | for _ in range(self.past_range): 215 | self.save_observation(observation) 216 | 217 | def reset(self): 218 | self.counter = 0 219 | self.n_step_observations.clear() 220 | self.n_step_actions.clear() 221 | self.n_step_rewards.clear() 222 | 223 | def sars_data(self, action, reward, observation, terminal, mem_queue): 224 | self.save_observation(observation) 225 | reward = np.clip(reward, -1., 1.) 226 | # reward /= args.reward_scale 227 | # ----- 228 | self.n_step_observations.appendleft(self.last_observations) 229 | self.n_step_actions.appendleft(action) 230 | self.n_step_rewards.appendleft(reward) 231 | # ----- 232 | self.counter += 1 233 | if terminal or self.counter >= self.n_step: 234 | r = 0. 235 | if not terminal: 236 | r = self.value_net.predict(self.observations[None, ...])[0] 237 | for i in range(self.counter): 238 | r = self.n_step_rewards[i] + self.discount * r 239 | mem_queue.put((self.n_step_observations[i], self.n_step_actions[i], r)) 240 | self.reset() 241 | 242 | def choose_action(self): 243 | policy = self.policy_net.predict(self.observations[None, ...])[0] 244 | return np.random.choice(np.arange(self.action_space.n), p=policy) 245 | 246 | def save_observation(self, observation): 247 | self.last_observations = self.observations[...] 248 | self.observations = np.roll(self.observations, -self.input_depth, axis=0) 249 | self.observations[-self.input_depth:, ...] = self.transform_screen(observation) 250 | 251 | def transform_screen(self, data): 252 | return rgb2gray(imresize(data, self.screen))[None, ...] 253 | 254 | 255 | def generate_experience_proc(mem_queue, weight_dict, no): 256 | import os 257 | pid = os.getpid() 258 | os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=True,lib.cnmem=0,' + \ 259 | 'compiledir=th_comp_act_' + str(no) 260 | # ----- 261 | print(' %5d> Process started' % (pid,)) 262 | # ----- 263 | frames = 0 264 | batch_size = args.batch_size 265 | # ----- 266 | env = gym.make(args.game) 267 | agent = ActingAgent(env.action_space, n_step=args.n_step) 268 | 269 | if frames > 0: 270 | print(' %5d> Loaded weights from file' % (pid,)) 271 | agent.load_net.load_weights('model-%s-%d.h5' % (args.game, frames)) 272 | else: 273 | import time 274 | while 'weights' not in weight_dict: 275 | time.sleep(0.1) 276 | agent.load_net.set_weights(weight_dict['weights']) 277 | print(' %5d> Loaded weights from dict' % (pid,)) 278 | 279 | best_score = 0 280 | avg_score = deque([0], maxlen=25) 281 | 282 | last_update = 0 283 | while True: 284 | done = False 285 | episode_reward = 0 286 | op_last, op_count = 0, 0 287 | observation = env.reset() 288 | agent.init_episode(observation) 289 | 290 | # ----- 291 | while not done: 292 | frames += 1 293 | action = agent.choose_action() 294 | observation, reward, done, _ = env.step(action) 295 | episode_reward += reward 296 | best_score = max(best_score, episode_reward) 297 | # ----- 298 | agent.sars_data(action, reward, observation, done, mem_queue) 299 | # ----- 300 | op_count = 0 if op_last != action else op_count + 1 301 | done = done or op_count >= 100 302 | op_last = action 303 | # ----- 304 | if frames % 2000 == 0: 305 | print(' %5d> Best: %4d; Avg: %6.2f; Max: %4d' % ( 306 | pid, best_score, np.mean(avg_score), np.max(avg_score))) 307 | if frames % batch_size == 0: 308 | update = weight_dict.get('update', 0) 309 | if update > last_update: 310 | last_update = update 311 | # print(' %5d> Getting weights from dict' % (pid,)) 312 | agent.load_net.set_weights(weight_dict['weights']) 313 | # ----- 314 | avg_score.append(episode_reward) 315 | 316 | 317 | def init_worker(): 318 | import signal 319 | signal.signal(signal.SIGINT, signal.SIG_IGN) 320 | 321 | 322 | def main(): 323 | manager = Manager() 324 | weight_dict = manager.dict() 325 | mem_queue = manager.Queue(args.queue_size) 326 | 327 | pool = Pool(args.processes + 1, init_worker) 328 | 329 | try: 330 | for i in range(args.processes): 331 | pool.apply_async(generate_experience_proc, (mem_queue, weight_dict, i)) 332 | 333 | pool.apply_async(learn_proc, (mem_queue, weight_dict)) 334 | 335 | pool.close() 336 | pool.join() 337 | 338 | except KeyboardInterrupt: 339 | pool.terminate() 340 | pool.join() 341 | 342 | 343 | if __name__ == "__main__": 344 | main() 345 | -------------------------------------------------------------------------------- /q-learning-1-step/README.md: -------------------------------------------------------------------------------- 1 | ### Usage 2 | 3 | To start training simply type (I recommend running in terminal with maximum width, due to lots of output data): 4 | ``` 5 | python train.py --game=Breakout-v0 --processes=16 6 | ``` 7 | 8 | To resume training from saved model (ex. `model-1250000.h5`): 9 | ``` 10 | python train.py --game=Breakout-v0 --processes=16 --checkpoint=1250000 11 | ``` 12 | 13 | To see how it plays: 14 | ``` 15 | python play.py --model=sample-weights/model-18h.h5 --game=Breakout-v0 16 | ``` 17 | 18 | ### Samples (old version) 19 | I tested it once and it worked quite well. (Intel i7-4700MQ and NVidia GTX 765M) 20 | 21 | Sample games after 6h, 12h and 18h of training. 22 | 23 | ![](https://raw.githubusercontent.com/Grzego/async-rl/master/q-learning-1-step/resources/after-6h-training.gif?token=AFhQOQQq2JlswCS_p1XjU6WrKn3pQ4dvks5XbsV9wA%3D%3D) 24 | ![](https://raw.githubusercontent.com/Grzego/async-rl/master/q-learning-1-step/resources/after-12h-training.gif?token=AFhQOXkCZbPO9SrOXXu5_3_-P0ftrfSsks5XbsWiwA%3D%3D) 25 | ![](https://raw.githubusercontent.com/Grzego/async-rl/master/q-learning-1-step/resources/after-18h-training.gif?token=AFhQOR-kTbupToKnNRenZCWiBEtZBmvhks5XbsWjwA%3D%3D) -------------------------------------------------------------------------------- /q-learning-1-step/play.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from scipy.misc import imresize 3 | from skimage.color import rgb2gray 4 | import numpy as np 5 | import argparse 6 | 7 | 8 | def build_network(input_shape, output_shape): 9 | from keras.models import Model 10 | from keras.layers import Input, Conv2D, Flatten, Dense 11 | 12 | x = Input(shape=input_shape) 13 | h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x) 14 | h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h) 15 | h = Flatten()(h) 16 | h = Dense(256, activation='relu')(h) 17 | v = Dense(output_shape, activation='linear')(h) 18 | return Model(inputs=x, outputs=v) 19 | 20 | 21 | class ActingAgent(object): 22 | def __init__(self, action_space, screen=(84, 84)): 23 | self.screen = screen 24 | self.input_depth = 1 25 | self.past_range = 3 26 | self.replay_size = 32 27 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 28 | 29 | self.action_value = build_network(self.observation_shape, action_space.n) 30 | self.action_value.compile(optimizer='rmsprop', loss='mse') 31 | 32 | self.action_space = action_space 33 | self.observations = np.zeros((self.input_depth * self.past_range,) + screen) 34 | 35 | def init_episode(self, observation): 36 | for _ in range(self.past_range): 37 | self.save_observation(observation) 38 | 39 | def choose_action(self, observation, epsilon=0.0): 40 | self.save_observation(observation) 41 | if np.random.random() < epsilon: 42 | return self.action_space.sample() 43 | else: 44 | return np.argmax(self.action_value.predict(self.observations[None, ...])) 45 | 46 | def save_observation(self, observation): 47 | self.observations = np.roll(self.observations, -self.input_depth, axis=0) 48 | self.observations[-self.input_depth:, ...] = self.transform_screen(observation) 49 | 50 | def transform_screen(self, data): 51 | return rgb2gray(imresize(data, self.screen))[None, ...] 52 | 53 | 54 | parser = argparse.ArgumentParser(description='Evaluation of model') 55 | parser.add_argument('--game', default='Breakout-v0', help='Name of openai gym environment', dest='game') 56 | parser.add_argument('--evaldir', default=None, help='Directory to save evaluation', dest='evaldir') 57 | parser.add_argument('--model', help='File with weights for model', dest='model') 58 | parser.add_argument('--eps', default=0., help='Epsilon value', dest='eps', type=float) 59 | 60 | 61 | def main(): 62 | args = parser.parse_args() 63 | # ----- 64 | env = gym.make(args.game) 65 | if args.evaldir: 66 | env.monitor.start(args.evaldir) 67 | # ----- 68 | agent = ActingAgent(env.action_space) 69 | 70 | model_file = args.model 71 | epsilon = args.eps 72 | 73 | agent.action_value.load_weights(model_file) 74 | 75 | game = 1 76 | for _ in range(10): 77 | done = False 78 | episode_reward = 0 79 | noops = 0 80 | 81 | # init game 82 | observation = env.reset() 83 | agent.init_episode(observation) 84 | # play one game 85 | print('Game #%8d; ' % (game,), end='') 86 | while not done: 87 | env.render() 88 | action = agent.choose_action(observation, epsilon=epsilon) 89 | observation, reward, done, _ = env.step(action) 90 | episode_reward += reward 91 | # ---- 92 | if action == 0: 93 | noops += 1 94 | else: 95 | noops = 0 96 | if noops > 100: 97 | break 98 | print('Reward %4d; ' % (episode_reward,)) 99 | game += 1 100 | # ----- 101 | if args.evaldir: 102 | env.monitor.close() 103 | 104 | 105 | if __name__ == "__main__": 106 | main() 107 | -------------------------------------------------------------------------------- /q-learning-1-step/resources/after-12h-training.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/resources/after-12h-training.gif -------------------------------------------------------------------------------- /q-learning-1-step/resources/after-18h-training.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/resources/after-18h-training.gif -------------------------------------------------------------------------------- /q-learning-1-step/resources/after-6h-training.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/resources/after-6h-training.gif -------------------------------------------------------------------------------- /q-learning-1-step/sample-weights/model-12h.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/sample-weights/model-12h.h5 -------------------------------------------------------------------------------- /q-learning-1-step/sample-weights/model-18h.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/sample-weights/model-18h.h5 -------------------------------------------------------------------------------- /q-learning-1-step/sample-weights/model-6h.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/sample-weights/model-6h.h5 -------------------------------------------------------------------------------- /q-learning-1-step/train.py: -------------------------------------------------------------------------------- 1 | from scipy.misc import imresize 2 | from skimage.color import rgb2gray 3 | from multiprocessing import * 4 | from collections import deque 5 | import queue 6 | import gym 7 | import numpy as np 8 | import argparse 9 | 10 | # ----- 11 | parser = argparse.ArgumentParser(description='Training model') 12 | parser.add_argument('--game', default='Breakout-v0', help='OpenAI gym environment name', dest='game', type=str) 13 | parser.add_argument('--processes', default=4, help='Number of processes that generate experience for agent', 14 | dest='processes', type=int) 15 | parser.add_argument('--lr', default=0.0001, help='Learning rate', dest='learning_rate', type=float) 16 | parser.add_argument('--batch_size', default=20, help='Batch size to use during training', dest='batch_size', type=int) 17 | parser.add_argument('--swap_freq', default=10000, help='Number of frames before swapping network weights', 18 | dest='swap_freq', type=int) 19 | parser.add_argument('--checkpoint', default=0, help='Iteration to resume training', dest='checkpoint', type=int) 20 | parser.add_argument('--save_freq', default=250000, help='Number of frame before saving weights', dest='save_freq', 21 | type=int) 22 | parser.add_argument('--eps_decay', default=4000000, 23 | help='Number of frames needed to decay epsilon to the lowest value', dest='eps_decay', type=int) 24 | parser.add_argument('--lr_decay', default=80000000, 25 | help='Number of frames needed to decay lr to the lowest value', dest='lr_decay', type=int) 26 | parser.add_argument('--queue_size', default=256, help='Size of queue holding agent experience', dest='queue_size', 27 | type=int) 28 | parser.add_argument('--th_comp_fix', default=True, 29 | help='Sets different Theano compiledir for each process', dest='th_fix', type=bool) 30 | # ----- 31 | args = parser.parse_args() 32 | 33 | 34 | # ----- 35 | 36 | 37 | def build_network(input_shape, output_shape): 38 | from keras.models import Model 39 | from keras.layers import Input, Conv2D, Flatten, Dense 40 | 41 | x = Input(shape=input_shape) 42 | h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x) 43 | h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h) 44 | h = Flatten()(h) 45 | h = Dense(256, activation='relu')(h) 46 | v = Dense(output_shape, activation='linear')(h) 47 | return Model(inputs=x, outputs=v) 48 | 49 | 50 | class LearningAgent(object): 51 | def __init__(self, action_space, batch_size=32, screen=(84, 84), swap_freq=200): 52 | from keras.optimizers import RMSprop 53 | # ----- 54 | self.screen = screen 55 | self.input_depth = 1 56 | self.past_range = 3 57 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 58 | self.batch_size = batch_size 59 | 60 | self.action_value = build_network(self.observation_shape, action_space.n) 61 | self.action_value_freeze = build_network(self.observation_shape, action_space.n) 62 | 63 | self.action_value.compile(optimizer='rmsprop', loss='mse') 64 | self.action_value_freeze.compile(optimizer='rmsprop', loss='mse') 65 | 66 | self.losses = deque(maxlen=25) 67 | self.q_values = deque(maxlen=25) 68 | self.swap_freq = swap_freq 69 | self.swap_counter = self.swap_freq 70 | self.unroll = np.arange(self.batch_size) 71 | self.frames = 0 72 | 73 | def learn(self, last_observations, actions, rewards, observations, not_terminals, discount=0.99, 74 | learning_rate=0.001): 75 | self.action_value.optimizer.lr.set_value(learning_rate) 76 | frames = len(last_observations) 77 | self.frames += frames 78 | # ----- 79 | targets = self.action_value.predict_on_batch(last_observations) 80 | q_values = self.action_value_freeze.predict_on_batch(observations) 81 | # ----- 82 | # equation = rewards + not_terminals * discount * np.argmax(q_values) 83 | rewards = np.clip(rewards, -1., 1.) 84 | equation = not_terminals 85 | equation *= np.max(q_values, axis=1) 86 | equation *= discount 87 | targets[self.unroll, actions] = rewards + equation 88 | # ----- 89 | loss = self.action_value.train_on_batch(last_observations, targets) 90 | self.losses.append(loss) 91 | self.q_values.append(np.mean(targets)) 92 | print( 93 | '\rFrames: %8d; Lr: %8.7f; Loss: %7.4f; Min: %7.4f; Max: %7.4f; Avg: %7.4f --- Q-value; Min: %7.4f; Max: %7.4f; Avg: %7.4f' % ( 94 | self.frames, learning_rate, loss, min(self.losses), max(self.losses), np.mean(self.losses), 95 | np.min(self.q_values), np.max(self.q_values), np.mean(self.q_values)), end='') 96 | self.swap_counter -= frames 97 | if self.swap_counter < 0: 98 | self.swap_counter += self.swap_freq 99 | self.action_value_freeze.set_weights(self.action_value.get_weights()) 100 | return True 101 | return False 102 | 103 | 104 | def learn_proc(global_frame, mem_queue, weight_dict): 105 | import os 106 | pid = os.getpid() 107 | if args.th_fix: 108 | os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=False,lib.cnmem=0,' + \ 109 | 'compiledir=th_comp_learn' 110 | # ----- 111 | save_freq = args.save_freq 112 | learning_rate = args.learning_rate 113 | batch_size = args.batch_size 114 | checkpoint = args.checkpoint 115 | lr_decay = args.lr_decay 116 | # ----- 117 | env = gym.make(args.game) 118 | agent = LearningAgent(env.action_space, batch_size=args.batch_size, swap_freq=args.swap_freq) 119 | # ----- 120 | if checkpoint > 0: 121 | agent.action_value.load_weights('model-%d.h5' % (checkpoint,)) 122 | agent.action_value_freeze.set_weights(agent.action_value.get_weights()) 123 | print(' %5d> Setting weights in dict' % (pid,)) 124 | # ----- 125 | weight_dict['update'] = 0 126 | weight_dict['weights'] = agent.action_value.get_weights() 127 | # ----- 128 | last_obs = np.zeros((batch_size,) + agent.observation_shape) 129 | actions = np.zeros(batch_size, dtype=np.int32) 130 | rewards = np.zeros(batch_size) 131 | obs = np.zeros((batch_size,) + agent.observation_shape) 132 | not_term = np.zeros(batch_size) 133 | # ----- 134 | index = 0 135 | agent.frames = checkpoint 136 | save_counter = checkpoint % save_freq + save_freq 137 | while True: 138 | last_obs[index, ...], actions[index], rewards[index], obs[index, ...], not_term[index] = mem_queue.get() 139 | # ----- 140 | index = (index + 1) % batch_size 141 | if index == 0: 142 | lr = max(0.00000001, learning_rate * (1. - agent.frames * batch_size / lr_decay)) 143 | updated = agent.learn(last_obs, actions, rewards, obs, not_term, learning_rate=lr) 144 | global_frame.value = agent.frames 145 | if updated: 146 | # print(' %5d> Updating weights in dict' % (pid,)) 147 | weight_dict['weights'] = agent.action_value_freeze.get_weights() 148 | weight_dict['update'] += 1 149 | # ----- 150 | save_counter -= 1 151 | if save_counter < 0: 152 | save_counter += save_freq 153 | agent.action_value_freeze.save_weights('model-%d.h5' % (agent.frames,), overwrite=True) 154 | 155 | 156 | class ActingAgent(object): 157 | def __init__(self, action_space, screen=(84, 84)): 158 | from keras.optimizers import RMSprop 159 | # ----- 160 | self.screen = screen 161 | self.input_depth = 1 162 | self.past_range = 3 163 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 164 | 165 | self.action_value = build_network(self.observation_shape, action_space.n) 166 | self.action_value.compile(optimizer='rmsprop', loss='mse') 167 | 168 | self.action_space = action_space 169 | self.observations = np.zeros(self.observation_shape) 170 | self.last_observations = np.zeros_like(self.observations) 171 | 172 | def init_episode(self, observation): 173 | for _ in range(self.past_range): 174 | self.save_observation(observation) 175 | 176 | def sars_data(self, action, reward, observation, not_terminal): 177 | self.save_observation(observation) 178 | return self.last_observations, action, reward, self.observations, not_terminal 179 | 180 | def choose_action(self, epsilon=0.0): 181 | if np.random.random() < epsilon: 182 | return self.action_space.sample() 183 | else: 184 | return np.argmax(self.action_value.predict(self.observations[None, ...])) 185 | 186 | def save_observation(self, observation): 187 | self.last_observations = self.observations[...] 188 | self.observations = np.roll(self.observations, -self.input_depth, axis=0) 189 | self.observations[-self.input_depth:, ...] = self.transform_screen(observation) 190 | 191 | def transform_screen(self, data): 192 | return rgb2gray(imresize(data, self.screen))[None, ...] 193 | 194 | 195 | def generate_experience_proc(global_frame, mem_queue, weight_dict, no, epsilon): 196 | import os 197 | pid = os.getpid() 198 | if args.th_fix: 199 | os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=True,lib.cnmem=0,' + \ 200 | 'compiledir=th_comp_act_' + str(no) 201 | # ----- 202 | print(' %5d> Process started with %6.3f' % (pid, epsilon)) 203 | # ----- 204 | env = gym.make(args.game) 205 | agent = ActingAgent(env.action_space) 206 | 207 | if args.checkpoint > 0: 208 | print(' %5d> Loaded weights from file' % (pid,)) 209 | agent.action_value.load_weights('model-%d.h5' % (args.checkpoint,)) 210 | else: 211 | import time 212 | while 'weights' not in weight_dict: 213 | time.sleep(0.1) 214 | agent.action_value.set_weights(weight_dict['weights']) 215 | print(' %5d> Loaded weights from dict' % (pid,)) 216 | 217 | best_score, last_update, frames = 0, 0, 0 218 | avg_score = deque(maxlen=20) 219 | stop_decay = global_frame.value > args.eps_decay 220 | 221 | while True: 222 | done = False 223 | episode_reward, noops, last_op = 0, 0, 0 224 | observation = env.reset() 225 | agent.init_episode(observation) 226 | 227 | # ----- 228 | while not done: 229 | frames += 1 230 | if not stop_decay: 231 | frame_tmp = global_frame.value 232 | decayed_epsilon = max(epsilon, epsilon + (1. - epsilon) * ( 233 | args.eps_decay - frame_tmp) / args.eps_decay) 234 | stop_decay = frame_tmp > args.eps_decay 235 | # ----- 236 | action = agent.choose_action(decayed_epsilon) 237 | observation, reward, done, _ = env.step(action) 238 | episode_reward += reward 239 | best_score = max(best_score, episode_reward) 240 | # ----- 241 | if action == last_op: 242 | noops += 1 243 | else: 244 | last_op, noops = action, 0 245 | # ----- 246 | if noops > 100: 247 | break 248 | # ----- 249 | mem_queue.put(agent.sars_data(action, reward, observation, not done)) 250 | # ----- 251 | if frames % 2000 == 0: 252 | print(' %5d> Epsilon: %9.6f; Best: %4d; Avg: %6.2f' % ( 253 | pid, decayed_epsilon, best_score, np.mean(avg_score))) 254 | if frames % args.batch_size == 0: 255 | update = weight_dict.get('update', 0) 256 | if update > last_update: 257 | last_update = update 258 | # print(' %5d> Getting weights from dict' % (pid,)) 259 | agent.action_value.set_weights(weight_dict['weights']) 260 | # ----- 261 | avg_score.append(episode_reward) 262 | 263 | 264 | def init_worker(): 265 | import signal 266 | signal.signal(signal.SIGINT, signal.SIG_IGN) 267 | 268 | 269 | def main(): 270 | manager = Manager() 271 | weight_dict = manager.dict() 272 | global_frame = manager.Value('i', args.checkpoint) 273 | mem_queue = manager.Queue(args.queue_size) 274 | 275 | eps = [0.1, 0.01, 0.5] 276 | pool = Pool(args.processes + 1, init_worker) 277 | 278 | try: 279 | for i in range(args.processes): 280 | pool.apply_async(generate_experience_proc, 281 | args=(global_frame, mem_queue, weight_dict, i, eps[i % len(eps)])) 282 | 283 | pool.apply_async(learn_proc, args=(global_frame, mem_queue, weight_dict)) 284 | 285 | pool.close() 286 | pool.join() 287 | 288 | except KeyboardInterrupt: 289 | pool.terminate() 290 | pool.join() 291 | 292 | 293 | if __name__ == "__main__": 294 | main() 295 | -------------------------------------------------------------------------------- /q-learning-n-step/README.md: -------------------------------------------------------------------------------- 1 | ### Usage 2 | 3 | To start training simply type: 4 | ``` 5 | python train.py --game=Breakout-v0 --processes=16 --n_step=5 6 | ``` 7 | 8 | To resume training from saved model (ex. `model-1250000.h5`): 9 | ``` 10 | python train.py --game=Breakout-v0 --processes=16 --checkpoint=1250000 11 | ``` 12 | 13 | To see how it plays: 14 | ``` 15 | python play.py --model=model-file.h5 --game=Breakout-v0 16 | ``` -------------------------------------------------------------------------------- /q-learning-n-step/play.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from scipy.misc import imresize 3 | from skimage.color import rgb2gray 4 | import numpy as np 5 | import argparse 6 | 7 | 8 | def build_network(input_shape, output_shape): 9 | from keras.models import Model 10 | from keras.layers import Input, Conv2D, Flatten, Dense 11 | 12 | x = Input(shape=input_shape) 13 | h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x) 14 | h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h) 15 | h = Flatten()(h) 16 | h = Dense(256, activation='relu')(h) 17 | v = Dense(output_shape, activation='linear')(h) 18 | return Model(inputs=x, outputs=v) 19 | 20 | 21 | class ActingAgent(object): 22 | def __init__(self, action_space, screen=(84, 84)): 23 | from keras.optimizers import RMSprop 24 | 25 | self.screen = screen 26 | self.input_depth = 1 27 | self.past_range = 3 28 | self.replay_size = 32 29 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 30 | 31 | self.action_value = build_network(self.observation_shape, action_space.n) 32 | self.action_value.compile(optimizer=RMSprop(clipnorm=1.), loss='mse') # clipnorm=1. 33 | 34 | self.action_space = action_space 35 | self.observations = np.zeros((self.input_depth * self.past_range,) + screen) 36 | 37 | def init_episode(self, observation): 38 | for _ in range(self.past_range): 39 | self.save_observation(observation) 40 | 41 | def choose_action(self, observation, epsilon=0.0): 42 | self.save_observation(observation) 43 | if np.random.random() < epsilon: 44 | return self.action_space.sample() 45 | else: 46 | return np.argmax(self.action_value.predict(self.observations[None, ...])) 47 | 48 | def save_observation(self, observation): 49 | self.observations = np.roll(self.observations, -self.input_depth, axis=0) 50 | self.observations[-self.input_depth:, ...] = self.transform_screen(observation) 51 | 52 | def transform_screen(self, data): 53 | return rgb2gray(imresize(data, self.screen))[None, ...] 54 | 55 | 56 | parser = argparse.ArgumentParser(description='Evaluation of model') 57 | parser.add_argument('--game', default='Breakout-v0', help='Name of openai gym environment', dest='game') 58 | parser.add_argument('--evaldir', default=None, help='Directory to save evaluation', dest='evaldir') 59 | parser.add_argument('--model', help='File with weights for model', dest='model') 60 | parser.add_argument('--eps', default=0., help='Epsilon value', dest='eps', type=float) 61 | 62 | 63 | def main(): 64 | args = parser.parse_args() 65 | # ----- 66 | env = gym.make(args.game) 67 | if args.evaldir: 68 | env.monitor.start(args.evaldir) 69 | # ----- 70 | agent = ActingAgent(env.action_space) 71 | 72 | model_file = args.model 73 | epsilon = args.eps 74 | 75 | agent.action_value.load_weights(model_file) 76 | 77 | game = 1 78 | for _ in range(10): 79 | done = False 80 | episode_reward = 0 81 | noops = 0 82 | 83 | # init game 84 | observation = env.reset() 85 | agent.init_episode(observation) 86 | # play one game 87 | print('Game #%8d; ' % (game,), end='') 88 | while not done: 89 | env.render() 90 | action = agent.choose_action(observation, epsilon=epsilon) 91 | observation, reward, done, _ = env.step(action) 92 | episode_reward += reward 93 | # ---- 94 | if action == 0: 95 | noops += 1 96 | else: 97 | noops = 0 98 | if noops > 100: 99 | break 100 | print('Reward %4d; ' % (episode_reward,)) 101 | game += 1 102 | # ----- 103 | if args.evaldir: 104 | env.monitor.close() 105 | 106 | 107 | if __name__ == "__main__": 108 | main() 109 | -------------------------------------------------------------------------------- /q-learning-n-step/train.py: -------------------------------------------------------------------------------- 1 | from scipy.misc import imresize 2 | from skimage.color import rgb2gray 3 | from multiprocessing import * 4 | from collections import deque 5 | import gym 6 | import numpy as np 7 | import argparse 8 | 9 | # ----- 10 | parser = argparse.ArgumentParser(description='Training model') 11 | parser.add_argument('--game', default='Breakout-v0', help='OpenAI gym environment name', dest='game', type=str) 12 | parser.add_argument('--processes', default=4, help='Number of processes that generate experience for agent', 13 | dest='processes', type=int) 14 | parser.add_argument('--lr', default=0.0001, help='Learning rate', dest='learning_rate', type=float) 15 | parser.add_argument('--batch_size', default=20, help='Batch size to use during training', dest='batch_size', type=int) 16 | parser.add_argument('--swap_freq', default=10000, help='Number of frames before swapping network weights', 17 | dest='swap_freq', type=int) 18 | parser.add_argument('--checkpoint', default=0, help='Iteration to resume training', dest='checkpoint', type=int) 19 | parser.add_argument('--save_freq', default=250000, help='Number of frame before saving weights', dest='save_freq', 20 | type=int) 21 | parser.add_argument('--eps_decay', default=4000000, 22 | help='Number of frames needed to decay epsilon to the lowest value', dest='eps_decay', type=int) 23 | parser.add_argument('--lr_decay', default=80000000, 24 | help='Number of frames needed to decay lr to the lowest value', dest='lr_decay', type=int) 25 | parser.add_argument('--queue_size', default=256, help='Size of queue holding agent experience', dest='queue_size', 26 | type=int) 27 | parser.add_argument('--n_step', default=5, help='Number of steps in Q-learning', dest='n_step', type=int) 28 | parser.add_argument('--th_comp_fix', default=True, 29 | help='Sets different Theano compiledir for each process', dest='th_fix', type=bool) 30 | # ----- 31 | args = parser.parse_args() 32 | 33 | 34 | # ----- 35 | 36 | 37 | def build_network(input_shape, output_shape): 38 | from keras.models import Model 39 | from keras.layers import Input, Conv2D, Flatten, Dense 40 | 41 | x = Input(shape=input_shape) 42 | h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x) 43 | h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h) 44 | h = Flatten()(h) 45 | h = Dense(256, activation='relu')(h) 46 | v = Dense(output_shape, activation='linear')(h) 47 | return Model(inputs=x, outputs=v) 48 | 49 | 50 | # ----- 51 | 52 | class LearningAgent(object): 53 | def __init__(self, action_space, batch_size=32, screen=(84, 84), swap_freq=200): 54 | from keras.optimizers import RMSprop 55 | # ----- 56 | self.screen = screen 57 | self.input_depth = 1 58 | self.past_range = 3 59 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 60 | self.batch_size = batch_size 61 | 62 | self.action_value = build_network(self.observation_shape, action_space.n) 63 | self.action_value.compile(optimizer=RMSprop(clipnorm=1.), loss='mse') 64 | 65 | self.losses = deque(maxlen=25) 66 | self.q_values = deque(maxlen=25) 67 | self.swap_freq = swap_freq 68 | self.swap_counter = self.swap_freq 69 | self.unroll = np.arange(self.batch_size) 70 | self.frames = 0 71 | 72 | def learn(self, last_observations, actions, rewards, learning_rate=0.001): 73 | self.action_value.optimizer.lr.set_value(learning_rate) 74 | frames = len(last_observations) 75 | self.frames += frames 76 | # ----- 77 | targets = self.action_value.predict_on_batch(last_observations) 78 | # ----- 79 | targets[self.unroll, actions] = rewards 80 | # ----- 81 | loss = self.action_value.train_on_batch(last_observations, targets) 82 | self.losses.append(loss) 83 | self.q_values.append(np.mean(targets)) 84 | print('\rIter: %8d; Lr: %8.7f; Loss: %7.4f; Min: %7.4f; Max: %7.4f; Avg: %7.4f --- Q-value; Min: %7.4f; Max: %7.4f; Avg: %7.4f' % ( 85 | self.frames, learning_rate, loss, min(self.losses), max(self.losses), np.mean(self.losses), 86 | np.min(self.q_values), np.max(self.q_values), np.mean(self.q_values)), end='') 87 | self.swap_counter -= frames 88 | if self.swap_counter < 0: 89 | self.swap_counter += self.swap_freq 90 | return True 91 | return False 92 | 93 | 94 | def learn_proc(global_frame, mem_queue, weight_dict): 95 | import os 96 | pid = os.getpid() 97 | if args.th_fix: 98 | os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=False,lib.cnmem=0,' + \ 99 | 'compiledir=th_comp_learn' 100 | # ----- 101 | save_freq = args.save_freq 102 | learning_rate = args.learning_rate 103 | batch_size = args.batch_size 104 | checkpoint = args.checkpoint 105 | lr_decay = args.lr_decay 106 | # ----- 107 | env = gym.make(args.game) 108 | agent = LearningAgent(env.action_space, batch_size=args.batch_size, swap_freq=args.swap_freq) 109 | # ----- 110 | if checkpoint > 0: 111 | print(' %5d> Loading weights from file' % (pid,)) 112 | agent.action_value.load_weights('model-%d.h5' % (checkpoint,)) 113 | # ----- 114 | weight_dict['update'] = 0 115 | weight_dict['weights'] = agent.action_value.get_weights() 116 | print(' %5d> Setting weights in dict' % (pid,)) 117 | # ----- 118 | last_obs = np.zeros((batch_size,) + agent.observation_shape) 119 | actions = np.zeros(batch_size, dtype=np.int32) 120 | rewards = np.zeros(batch_size) 121 | # ----- 122 | idx = 0 123 | agent.frames = checkpoint 124 | save_counter = checkpoint % save_freq + save_freq 125 | while True: 126 | # ----- 127 | last_obs[idx, ...], actions[idx], rewards[idx] = mem_queue.get() 128 | idx = (idx + 1) % batch_size 129 | if idx == 0: 130 | lr = max(0.000000001, learning_rate * (1. - agent.frames / lr_decay)) 131 | updated = agent.learn(last_obs, actions, rewards, learning_rate=lr) 132 | global_frame.value = agent.frames 133 | if updated: 134 | # print(' %5d> Updating weights in dict' % (pid,)) 135 | weight_dict['weights'] = agent.action_value.get_weights() 136 | weight_dict['update'] += 1 137 | # ----- 138 | save_counter -= 1 139 | if save_counter % save_freq == 0: 140 | agent.action_value.save_weights('model-%d.h5' % (agent.frames,), overwrite=True) 141 | 142 | 143 | class ActingAgent(object): 144 | def __init__(self, action_space, screen=(84, 84), n_step=8, discount=0.99): 145 | from keras.optimizers import RMSprop 146 | # ----- 147 | self.screen = screen 148 | self.input_depth = 1 149 | self.past_range = 3 150 | self.observation_shape = (self.input_depth * self.past_range,) + self.screen 151 | 152 | self.action_value = build_network(self.observation_shape, action_space.n) 153 | self.action_value.compile(optimizer=RMSprop(clipnorm=1.), loss='mse') # clipnorm=1. 154 | 155 | self.action_space = action_space 156 | self.observations = np.zeros(self.observation_shape) 157 | self.last_observations = np.zeros_like(self.observations) 158 | # ----- 159 | self.n_step_observations = deque(maxlen=n_step) 160 | self.n_step_actions = deque(maxlen=n_step) 161 | self.n_step_rewards = deque(maxlen=n_step) 162 | self.n_step = n_step 163 | self.discount = discount 164 | self.counter = 0 165 | 166 | def init_episode(self, observation): 167 | for _ in range(self.past_range): 168 | self.save_observation(observation) 169 | 170 | def reset(self): 171 | self.counter = 0 172 | self.n_step_observations.clear() 173 | self.n_step_actions.clear() 174 | self.n_step_rewards.clear() 175 | 176 | def sars_data(self, action, reward, observation, terminal, mem_queue): 177 | self.save_observation(observation) 178 | reward = np.clip(reward, -1., 1.) 179 | # ----- 180 | self.n_step_observations.appendleft(self.last_observations) 181 | self.n_step_actions.appendleft(action) 182 | self.n_step_rewards.appendleft(reward) 183 | # ----- 184 | self.counter += 1 185 | if terminal or self.counter >= self.n_step: 186 | r = 0. 187 | if not terminal: 188 | r = np.max(self.action_value.predict(self.observations[None, ...])) 189 | for i in range(self.counter): 190 | r = self.n_step_rewards[i] + self.discount * r 191 | mem_queue.put((self.n_step_observations[i], self.n_step_actions[i], r)) 192 | self.reset() 193 | 194 | def choose_action(self, epsilon=0.0): 195 | if np.random.random() < epsilon: 196 | return self.action_space.sample() 197 | else: 198 | return np.argmax(self.action_value.predict(self.observations[None, ...])) 199 | 200 | def save_observation(self, observation): 201 | self.last_observations = self.observations[...] 202 | self.observations = np.roll(self.observations, -self.input_depth, axis=0) 203 | self.observations[-self.input_depth:, ...] = self.transform_screen(observation) 204 | 205 | def transform_screen(self, data): 206 | return rgb2gray(imresize(data, self.screen))[None, ...] 207 | 208 | 209 | def generate_experience_proc(global_frame, mem_queue, weight_dict, no, epsilon): 210 | import os 211 | pid = os.getpid() 212 | if args.th_fix: 213 | os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=True,lib.cnmem=0,' + \ 214 | 'compiledir=th_comp_act_' + str(no) 215 | # ----- 216 | batch_size = args.batch_size 217 | # ----- 218 | print(' %5d> Process started with %6.3f' % (pid, epsilon)) 219 | # ----- 220 | env = gym.make(args.game) 221 | agent = ActingAgent(env.action_space, n_step=args.n_step) 222 | 223 | if args.checkpoint > 0: 224 | print(' %5d> Loaded weights from file' % (pid,)) 225 | agent.action_value.load_weights('model-%d.h5' % (args.checkpoint,)) 226 | else: 227 | import time 228 | while 'weights' not in weight_dict: 229 | time.sleep(0.1) 230 | agent.action_value.set_weights(weight_dict['weights']) 231 | print(' %5d> Loaded weights from dict' % (pid,)) 232 | 233 | best_score, last_update, frames = 0, 0, 0 234 | avg_score = deque(maxlen=20) 235 | stop_decay = global_frame.value > args.eps_decay 236 | 237 | while True: 238 | done = False 239 | episode_reward = 0 240 | last_op, op_count = 0, 0 241 | observation = env.reset() 242 | agent.init_episode(observation) 243 | 244 | # ----- 245 | while not done: 246 | frames += 1 247 | if not stop_decay: 248 | frame_tmp = global_frame.value 249 | decayed_epsilon = max(epsilon, epsilon + (1. - epsilon) * ( 250 | args.eps_decay - frame_tmp) / args.eps_decay) 251 | stop_decay = frame_tmp > args.eps_decay 252 | # ----- 253 | action = agent.choose_action(decayed_epsilon) 254 | observation, reward, done, _ = env.step(action) 255 | episode_reward += reward 256 | best_score = max(best_score, episode_reward) 257 | # ----- 258 | agent.sars_data(action, reward, observation, done, mem_queue) 259 | # ----- 260 | if action == last_op: 261 | op_count += 1 262 | else: 263 | op_count, last_op = 0, action 264 | # ----- 265 | if op_count > 100: 266 | agent.reset() # reset agent memory 267 | break 268 | # ----- 269 | if frames % 2000 == 0: 270 | print(' %5d> Epsilon: %9.6f; Best score: %4d; Avg: %9.3f' % ( 271 | pid, decayed_epsilon, best_score, np.mean(avg_score))) 272 | if frames % batch_size == 0: 273 | update = weight_dict.get('update', 0) 274 | if update > last_update: 275 | last_update = update 276 | # print(' %5d> Getting weights from dict' % (pid,)) 277 | agent.action_value.set_weights(weight_dict['weights']) 278 | # ----- 279 | avg_score.append(episode_reward) 280 | 281 | 282 | def init_worker(): 283 | import signal 284 | signal.signal(signal.SIGINT, signal.SIG_IGN) 285 | 286 | 287 | def main(): 288 | manager = Manager() 289 | weight_dict = manager.dict() 290 | global_frame = manager.Value('i', args.checkpoint) 291 | mem_queue = manager.Queue(args.queue_size) 292 | 293 | eps = [0.1, 0.01, 0.5] 294 | pool = Pool(args.processes + 1, init_worker) 295 | 296 | try: 297 | for i in range(args.processes): 298 | pool.apply_async(generate_experience_proc, 299 | args=(global_frame, mem_queue, weight_dict, i, eps[i % len(eps)])) 300 | 301 | pool.apply_async(learn_proc, args=(global_frame, mem_queue, weight_dict)) 302 | 303 | pool.close() 304 | pool.join() 305 | 306 | except KeyboardInterrupt: 307 | pool.terminate() 308 | pool.join() 309 | 310 | 311 | if __name__ == "__main__": 312 | main() 313 | --------------------------------------------------------------------------------