├── LICENSE ├── README.md ├── a3c.py ├── a3c_model.py ├── async_dqn.py ├── atari_environment.py ├── breakout.gif ├── model.py └── resources ├── episode_reward.png └── max_q_value.png /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Corey Lynch 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Asyncronous RL in Tensorflow + Keras + OpenAI's Gym 2 | 3 | ![](http://g.recordit.co/BeiqC9l70B.gif) 4 | 5 | This is a Tensorflow + Keras implementation of asyncronous 1-step Q learning as described in ["Asynchronous Methods for Deep Reinforcement Learning"](http://arxiv.org/pdf/1602.01783v1.pdf). 6 | 7 | Since we're using multiple actor-learner threads to stabilize learning in place of experience replay (which is super memory intensive), this runs comfortably on a macbook w/ 4g of ram. 8 | 9 | It uses Keras to define the deep q network (see model.py), OpenAI's gym library to interact with the Atari Learning Environment (see atari_environment.py), and Tensorflow for optimization/execution (see async_dqn.py). 10 | 11 | ## Requirements 12 | * [tensorflow](https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html) 13 | * [gym](https://github.com/openai/gym#installation) 14 | * [gym's atari environment] (https://github.com/openai/gym#atari) 15 | * skimage 16 | * Keras 17 | 18 | ## Usage 19 | ### Training 20 | To kick off training, run: 21 | ``` 22 | python async_dqn.py --experiment breakout --game "Breakout-v0" --num_concurrent 8 23 | ``` 24 | Here we're organizing the outputs for the current experiment under a folder called 'breakout', choosing "Breakout-v0" as our gym environment, and running 8 actor-learner threads concurrently. See [this](https://gym.openai.com/envs#atari) for a full list of possible game names you can hand to --game. 25 | 26 | ### Visualizing training with tensorboard 27 | We collect episode reward stats and max q values that can be vizualized with tensorboard by running the following: 28 | ``` 29 | tensorboard --logdir /tmp/summaries/breakout 30 | ``` 31 | This is what my per-episode reward and average max q value curves looked like over the training period: 32 | ![](https://github.com/coreylynch/async-rl/blob/master/resources/episode_reward.png) 33 | ![](https://github.com/coreylynch/async-rl/blob/master/resources/max_q_value.png) 34 | 35 | ### Evaluation 36 | To run a gym evaluation, turn the testing flag to True and hand in a current checkpoint file: 37 | ``` 38 | python async_dqn.py --experiment breakout --testing True --checkpoint_path /tmp/breakout.ckpt-2690000 --num_eval_episodes 100 39 | ``` 40 | After completing the eval, we can upload our eval file to OpenAI's site as follows: 41 | ```python 42 | import gym 43 | gym.upload('/tmp/breakout/eval', api_key='YOUR_API_KEY') 44 | ``` 45 | Now we can find the eval at https://gym.openai.com/evaluations/eval_uwwAN0U3SKSkocC0PJEwQ 46 | 47 | ### Next Steps 48 | See a3c.py for a WIP async advantage actor critic implementation. 49 | 50 | ## Resources 51 | I found these super helpful as general background materials for deep RL: 52 | 53 | * [David Silver's "Deep Reinforcement Learning" lecture](http://videolectures.net/rldm2015_silver_reinforcement_learning/) 54 | * [Nervana's Demystifying Deep Reinforcement Learning blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/) 55 | 56 | ## Important notes 57 | * In the paper the authors mention "for asynchronous methods we average over the best 5 models from **50 experiments**". I overlooked this point when I was writing this, but I think it's important. These async methods seem to vary in performance a lot from run to run (at least in my implementation of them!). I think it's a good idea to run multiple seeded versions at the same time and average over their performance to get a good picture of whether or not some architectural change is good or not. Equivalently don't get discouraged if you don't see performance on your task right away; try rerunning the same code a few more times with different seeds. 58 | * This repo has no affiliation with Deepmind or the authors; it was just a simple project I was using to learn TensorFlow. Feedback is highly appreciated. 59 | -------------------------------------------------------------------------------- /a3c.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from skimage.transform import resize 3 | from skimage.color import rgb2gray 4 | import threading 5 | import tensorflow as tf 6 | import sys 7 | import random 8 | import numpy as np 9 | import time 10 | import gym 11 | from keras import backend as K 12 | from keras.layers import Convolution2D, Flatten, Dense 13 | from collections import deque 14 | from a3c_model import build_policy_and_value_networks 15 | from keras import backend as K 16 | from atari_environment import AtariEnvironment 17 | 18 | # Path params 19 | EXPERIMENT_NAME = "breakout_a3c" 20 | SUMMARY_SAVE_PATH = "/Users/coreylynch/dev/async-rl/summaries/"+EXPERIMENT_NAME 21 | CHECKPOINT_SAVE_PATH = "/tmp/"+EXPERIMENT_NAME+".ckpt" 22 | CHECKPOINT_NAME = "/tmp/breakout_a3c.ckpt-5" 23 | CHECKPOINT_INTERVAL=5000 24 | SUMMARY_INTERVAL=5 25 | # TRAINING = False 26 | TRAINING = True 27 | 28 | SHOW_TRAINING = True 29 | # SHOW_TRAINING = False 30 | 31 | # Experiment params 32 | GAME = "Breakout-v0" 33 | ACTIONS = 3 34 | NUM_CONCURRENT = 8 35 | NUM_EPISODES = 20000 36 | 37 | AGENT_HISTORY_LENGTH = 4 38 | RESIZED_WIDTH = 84 39 | RESIZED_HEIGHT = 84 40 | 41 | # DQN Params 42 | GAMMA = 0.99 43 | 44 | # Optimization Params 45 | LEARNING_RATE = 0.00001 46 | 47 | #Shared global parameters 48 | T = 0 49 | TMAX = 80000000 50 | t_max = 32 51 | 52 | def sample_policy_action(num_actions, probs): 53 | """ 54 | Sample an action from an action probability distribution output by 55 | the policy network. 56 | """ 57 | # Subtract a tiny value from probabilities in order to avoid 58 | # "ValueError: sum(pvals[:-1]) > 1.0" in numpy.multinomial 59 | probs = probs - np.finfo(np.float32).epsneg 60 | 61 | histogram = np.random.multinomial(1, probs) 62 | action_index = int(np.nonzero(histogram)[0]) 63 | return action_index 64 | 65 | def actor_learner_thread(num, env, session, graph_ops, summary_ops, saver): 66 | # We use global shared counter T, and TMAX constant 67 | global TMAX, T 68 | 69 | # Unpack graph ops 70 | s, a, R, minimize, p_network, v_network = graph_ops 71 | 72 | # Unpack tensorboard summary stuff 73 | r_summary_placeholder, update_ep_reward, val_summary_placeholder, update_ep_val, summary_op = summary_ops 74 | 75 | # Wrap env with AtariEnvironment helper class 76 | env = AtariEnvironment(gym_env=env, resized_width=RESIZED_WIDTH, resized_height=RESIZED_HEIGHT, agent_history_length=AGENT_HISTORY_LENGTH) 77 | 78 | time.sleep(5*num) 79 | 80 | # Set up per-episode counters 81 | ep_reward = 0 82 | ep_avg_v = 0 83 | v_steps = 0 84 | ep_t = 0 85 | 86 | probs_summary_t = 0 87 | 88 | s_t = env.get_initial_state() 89 | terminal = False 90 | 91 | while T < TMAX: 92 | s_batch = [] 93 | past_rewards = [] 94 | a_batch = [] 95 | 96 | t = 0 97 | t_start = t 98 | 99 | while not (terminal or ((t - t_start) == t_max)): 100 | # Perform action a_t according to policy pi(a_t | s_t) 101 | probs = session.run(p_network, feed_dict={s: [s_t]})[0] 102 | action_index = sample_policy_action(ACTIONS, probs) 103 | a_t = np.zeros([ACTIONS]) 104 | a_t[action_index] = 1 105 | 106 | if probs_summary_t % 100 == 0: 107 | print "P, ", np.max(probs), "V ", session.run(v_network, feed_dict={s: [s_t]})[0][0] 108 | 109 | s_batch.append(s_t) 110 | a_batch.append(a_t) 111 | 112 | s_t1, r_t, terminal, info = env.step(action_index) 113 | ep_reward += r_t 114 | 115 | r_t = np.clip(r_t, -1, 1) 116 | past_rewards.append(r_t) 117 | 118 | t += 1 119 | T += 1 120 | ep_t += 1 121 | probs_summary_t += 1 122 | 123 | s_t = s_t1 124 | 125 | if terminal: 126 | R_t = 0 127 | else: 128 | R_t = session.run(v_network, feed_dict={s: [s_t]})[0][0] # Bootstrap from last state 129 | 130 | R_batch = np.zeros(t) 131 | for i in reversed(range(t_start, t)): 132 | R_t = past_rewards[i] + GAMMA * R_t 133 | R_batch[i] = R_t 134 | 135 | session.run(minimize, feed_dict={R : R_batch, 136 | a : a_batch, 137 | s : s_batch}) 138 | 139 | # Save progress every 5000 iterations 140 | if T % CHECKPOINT_INTERVAL == 0: 141 | saver.save(session, CHECKPOINT_SAVE_PATH, global_step = T) 142 | 143 | if terminal: 144 | # Episode ended, collect stats and reset game 145 | session.run(update_ep_reward, feed_dict={r_summary_placeholder: ep_reward}) 146 | print "THREAD:", num, "/ TIME", T, "/ REWARD", ep_reward 147 | s_t = env.get_initial_state() 148 | terminal = False 149 | # Reset per-episode counters 150 | ep_reward = 0 151 | ep_t = 0 152 | 153 | def build_graph(): 154 | # Create shared global policy and value networks 155 | s, p_network, v_network, p_params, v_params = build_policy_and_value_networks(num_actions=ACTIONS, agent_history_length=AGENT_HISTORY_LENGTH, resized_width=RESIZED_WIDTH, resized_height=RESIZED_HEIGHT) 156 | 157 | # Shared global optimizer 158 | optimizer = tf.train.AdamOptimizer(LEARNING_RATE) 159 | 160 | # Op for applying remote gradients 161 | R_t = tf.placeholder("float", [None]) 162 | a_t = tf.placeholder("float", [None, ACTIONS]) 163 | log_prob = tf.log(tf.reduce_sum(p_network * a_t, reduction_indices=1)) 164 | p_loss = -log_prob * (R_t - v_network) 165 | v_loss = tf.reduce_mean(tf.square(R_t - v_network)) 166 | 167 | total_loss = p_loss + (0.5 * v_loss) 168 | 169 | minimize = optimizer.minimize(total_loss) 170 | return s, a_t, R_t, minimize, p_network, v_network 171 | 172 | # Set up some episode summary ops to visualize on tensorboard. 173 | def setup_summaries(): 174 | episode_reward = tf.Variable(0.) 175 | tf.summary.scalar("Episode Reward", episode_reward) 176 | r_summary_placeholder = tf.placeholder("float") 177 | update_ep_reward = episode_reward.assign(r_summary_placeholder) 178 | ep_avg_v = tf.Variable(0.) 179 | tf.summary.scalar("Episode Value", ep_avg_v) 180 | val_summary_placeholder = tf.placeholder("float") 181 | update_ep_val = ep_avg_v.assign(val_summary_placeholder) 182 | summary_op = tf.summary.merge_all() 183 | return r_summary_placeholder, update_ep_reward, val_summary_placeholder, update_ep_val, summary_op 184 | 185 | def train(session, graph_ops, saver): 186 | # Set up game environments (one per thread) 187 | envs = [gym.make(GAME) for i in range(NUM_CONCURRENT)] 188 | 189 | summary_ops = setup_summaries() 190 | summary_op = summary_ops[-1] 191 | 192 | # Initialize variables 193 | session.run(tf.global_variables_initializer()) 194 | writer = tf.summary.FileWriter(SUMMARY_SAVE_PATH, session.graph) 195 | 196 | # Start NUM_CONCURRENT training threads 197 | actor_learner_threads = [threading.Thread(target=actor_learner_thread, args=(thread_id, envs[thread_id], session, graph_ops, summary_ops, saver)) for thread_id in range(NUM_CONCURRENT)] 198 | for t in actor_learner_threads: 199 | t.start() 200 | 201 | # Show the agents training and write summary statistics 202 | last_summary_time = 0 203 | while True: 204 | if SHOW_TRAINING: 205 | for env in envs: 206 | env.render() 207 | now = time.time() 208 | if now - last_summary_time > SUMMARY_INTERVAL: 209 | summary_str = session.run(summary_op) 210 | writer.add_summary(summary_str, float(T)) 211 | last_summary_time = now 212 | for t in actor_learner_threads: 213 | t.join() 214 | 215 | def evaluation(session, graph_ops, saver): 216 | saver.restore(session, CHECKPOINT_NAME) 217 | print "Restored model weights from ", CHECKPOINT_NAME 218 | monitor_env = gym.make(GAME) 219 | monitor_env.monitor.start('/tmp/'+EXPERIMENT_NAME+"/eval") 220 | 221 | # Unpack graph ops 222 | s, a_t, R_t, minimize, p_network, v_network = graph_ops 223 | 224 | # Wrap env with AtariEnvironment helper class 225 | env = AtariEnvironment(gym_env=monitor_env, resized_width=RESIZED_WIDTH, resized_height=RESIZED_HEIGHT, agent_history_length=AGENT_HISTORY_LENGTH) 226 | 227 | for i_episode in xrange(100): 228 | s_t = env.get_initial_state() 229 | ep_reward = 0 230 | terminal = False 231 | while not terminal: 232 | monitor_env.render() 233 | # Forward the deep q network, get Q(s,a) values 234 | probs = p_network.eval(session = session, feed_dict = {s : [s_t]})[0] 235 | action_index = sample_policy_action(ACTIONS, probs) 236 | s_t1, r_t, terminal, info = env.step(action_index) 237 | s_t = s_t1 238 | ep_reward += r_t 239 | print ep_reward 240 | monitor_env.monitor.close() 241 | 242 | def main(_): 243 | g = tf.Graph() 244 | with g.as_default(), tf.Session() as session: 245 | K.set_session(session) 246 | graph_ops = build_graph() 247 | saver = tf.train.Saver() 248 | 249 | if TRAINING: 250 | train(session, graph_ops, saver) 251 | else: 252 | evaluation(session, graph_ops, saver) 253 | 254 | if __name__ == "__main__": 255 | tf.app.run() 256 | -------------------------------------------------------------------------------- /a3c_model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from keras import backend as K 3 | from keras.layers import Convolution2D, Flatten, Dense, Input 4 | from keras.models import Model 5 | 6 | def build_policy_and_value_networks(num_actions, agent_history_length, resized_width, resized_height): 7 | with tf.device("/cpu:0"): 8 | state = tf.placeholder("float", [None, agent_history_length, resized_width, resized_height]) 9 | 10 | inputs = Input(shape=(agent_history_length, resized_width, resized_height,)) 11 | shared = Convolution2D(name="conv1", nb_filter=16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu', border_mode='same')(inputs) 12 | shared = Convolution2D(name="conv2", nb_filter=32, nb_row=4, nb_col=4, subsample=(2,2), activation='relu', border_mode='same')(shared) 13 | shared = Flatten()(shared) 14 | shared = Dense(name="h1", output_dim=256, activation='relu')(shared) 15 | 16 | action_probs = Dense(name="p", output_dim=num_actions, activation='softmax')(shared) 17 | 18 | state_value = Dense(name="v", output_dim=1, activation='linear')(shared) 19 | 20 | policy_network = Model(input=inputs, output=action_probs) 21 | value_network = Model(input=inputs, output=state_value) 22 | 23 | p_params = policy_network.trainable_weights 24 | v_params = value_network.trainable_weights 25 | 26 | p_out = policy_network(state) 27 | v_out = value_network(state) 28 | 29 | return state, p_out, v_out, p_params, v_params -------------------------------------------------------------------------------- /async_dqn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os 3 | os.environ["KERAS_BACKEND"] = "tensorflow" 4 | 5 | from skimage.transform import resize 6 | from skimage.color import rgb2gray 7 | from atari_environment import AtariEnvironment 8 | import threading 9 | import tensorflow as tf 10 | import sys 11 | import random 12 | import numpy as np 13 | import time 14 | import gym 15 | from keras import backend as K 16 | from model import build_network 17 | from keras import backend as K 18 | 19 | flags = tf.app.flags 20 | 21 | flags.DEFINE_string('experiment', 'dqn_breakout', 'Name of the current experiment') 22 | flags.DEFINE_string('game', 'Breakout-v0', 'Name of the atari game to play. Full list here: https://gym.openai.com/envs#atari') 23 | flags.DEFINE_integer('num_concurrent', 8, 'Number of concurrent actor-learner threads to use during training.') 24 | flags.DEFINE_integer('tmax', 80000000, 'Number of training timesteps.') 25 | flags.DEFINE_integer('resized_width', 84, 'Scale screen to this width.') 26 | flags.DEFINE_integer('resized_height', 84, 'Scale screen to this height.') 27 | flags.DEFINE_integer('agent_history_length', 4, 'Use this number of recent screens as the environment state.') 28 | flags.DEFINE_integer('network_update_frequency', 32, 'Frequency with which each actor learner thread does an async gradient update') 29 | flags.DEFINE_integer('target_network_update_frequency', 10000, 'Reset the target network every n timesteps') 30 | flags.DEFINE_float('learning_rate', 0.0001, 'Initial learning rate.') 31 | flags.DEFINE_float('gamma', 0.99, 'Reward discount rate.') 32 | flags.DEFINE_integer('anneal_epsilon_timesteps', 1000000, 'Number of timesteps to anneal epsilon.') 33 | flags.DEFINE_string('summary_dir', '/tmp/summaries', 'Directory for storing tensorboard summaries') 34 | flags.DEFINE_string('checkpoint_dir', '/tmp/checkpoints', 'Directory for storing model checkpoints') 35 | flags.DEFINE_integer('summary_interval', 5, 36 | 'Save training summary to file every n seconds (rounded ' 37 | 'up to statistics interval.') 38 | flags.DEFINE_integer('checkpoint_interval', 600, 39 | 'Checkpoint the model (i.e. save the parameters) every n ' 40 | 'seconds (rounded up to statistics interval.') 41 | flags.DEFINE_boolean('show_training', True, 'If true, have gym render evironments during training') 42 | flags.DEFINE_boolean('testing', False, 'If true, run gym evaluation') 43 | flags.DEFINE_string('checkpoint_path', 'path/to/recent.ckpt', 'Path to recent checkpoint to use for evaluation') 44 | flags.DEFINE_string('eval_dir', '/tmp/', 'Directory to store gym evaluation') 45 | flags.DEFINE_integer('num_eval_episodes', 100, 'Number of episodes to run gym evaluation.') 46 | FLAGS = flags.FLAGS 47 | T = 0 48 | TMAX = FLAGS.tmax 49 | 50 | def sample_final_epsilon(): 51 | """ 52 | Sample a final epsilon value to anneal towards from a distribution. 53 | These values are specified in section 5.1 of http://arxiv.org/pdf/1602.01783v1.pdf 54 | """ 55 | final_epsilons = np.array([.1,.01,.5]) 56 | probabilities = np.array([0.4,0.3,0.3]) 57 | return np.random.choice(final_epsilons, 1, p=list(probabilities))[0] 58 | 59 | def actor_learner_thread(thread_id, env, session, graph_ops, num_actions, summary_ops, saver): 60 | """ 61 | Actor-learner thread implementing asynchronous one-step Q-learning, as specified 62 | in algorithm 1 here: http://arxiv.org/pdf/1602.01783v1.pdf. 63 | """ 64 | global TMAX, T 65 | 66 | # Unpack graph ops 67 | s = graph_ops["s"] 68 | q_values = graph_ops["q_values"] 69 | st = graph_ops["st"] 70 | target_q_values = graph_ops["target_q_values"] 71 | reset_target_network_params = graph_ops["reset_target_network_params"] 72 | a = graph_ops["a"] 73 | y = graph_ops["y"] 74 | grad_update = graph_ops["grad_update"] 75 | 76 | summary_placeholders, update_ops, summary_op = summary_ops 77 | 78 | # Wrap env with AtariEnvironment helper class 79 | env = AtariEnvironment(gym_env=env, resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, agent_history_length=FLAGS.agent_history_length) 80 | 81 | # Initialize network gradients 82 | s_batch = [] 83 | a_batch = [] 84 | y_batch = [] 85 | 86 | final_epsilon = sample_final_epsilon() 87 | initial_epsilon = 1.0 88 | epsilon = 1.0 89 | 90 | print "Starting thread ", thread_id, "with final epsilon ", final_epsilon 91 | 92 | time.sleep(3*thread_id) 93 | t = 0 94 | while T < TMAX: 95 | # Get initial game observation 96 | s_t = env.get_initial_state() 97 | terminal = False 98 | 99 | # Set up per-episode counters 100 | ep_reward = 0 101 | episode_ave_max_q = 0 102 | ep_t = 0 103 | 104 | while True: 105 | # Forward the deep q network, get Q(s,a) values 106 | readout_t = q_values.eval(session = session, feed_dict = {s : [s_t]}) 107 | 108 | # Choose next action based on e-greedy policy 109 | a_t = np.zeros([num_actions]) 110 | action_index = 0 111 | if random.random() <= epsilon: 112 | action_index = random.randrange(num_actions) 113 | else: 114 | action_index = np.argmax(readout_t) 115 | a_t[action_index] = 1 116 | 117 | # Scale down epsilon 118 | if epsilon > final_epsilon: 119 | epsilon -= (initial_epsilon - final_epsilon) / FLAGS.anneal_epsilon_timesteps 120 | 121 | # Gym excecutes action in game environment on behalf of actor-learner 122 | s_t1, r_t, terminal, info = env.step(action_index) 123 | 124 | # Accumulate gradients 125 | readout_j1 = target_q_values.eval(session = session, feed_dict = {st : [s_t1]}) 126 | clipped_r_t = np.clip(r_t, -1, 1) 127 | if terminal: 128 | y_batch.append(clipped_r_t) 129 | else: 130 | y_batch.append(clipped_r_t + FLAGS.gamma * np.max(readout_j1)) 131 | 132 | a_batch.append(a_t) 133 | s_batch.append(s_t) 134 | 135 | # Update the state and counters 136 | s_t = s_t1 137 | T += 1 138 | t += 1 139 | 140 | ep_t += 1 141 | ep_reward += r_t 142 | episode_ave_max_q += np.max(readout_t) 143 | 144 | # Optionally update target network 145 | if T % FLAGS.target_network_update_frequency == 0: 146 | session.run(reset_target_network_params) 147 | 148 | # Optionally update online network 149 | if t % FLAGS.network_update_frequency == 0 or terminal: 150 | if s_batch: 151 | session.run(grad_update, feed_dict = {y : y_batch, 152 | a : a_batch, 153 | s : s_batch}) 154 | # Clear gradients 155 | s_batch = [] 156 | a_batch = [] 157 | y_batch = [] 158 | 159 | # Save model progress 160 | if t % FLAGS.checkpoint_interval == 0: 161 | saver.save(session, FLAGS.checkpoint_dir+"/"+FLAGS.experiment+".ckpt", global_step = t) 162 | 163 | # Print end of episode stats 164 | if terminal: 165 | stats = [ep_reward, episode_ave_max_q/float(ep_t), epsilon] 166 | for i in range(len(stats)): 167 | session.run(update_ops[i], feed_dict={summary_placeholders[i]:float(stats[i])}) 168 | print "THREAD:", thread_id, "/ TIME", T, "/ TIMESTEP", t, "/ EPSILON", epsilon, "/ REWARD", ep_reward, "/ Q_MAX %.4f" % (episode_ave_max_q/float(ep_t)), "/ EPSILON PROGRESS", t/float(FLAGS.anneal_epsilon_timesteps) 169 | break 170 | 171 | def build_graph(num_actions): 172 | # Create shared deep q network 173 | s, q_network = build_network(num_actions=num_actions, agent_history_length=FLAGS.agent_history_length, 174 | resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, name_scope="q-network") 175 | network_params = q_network.trainable_weights 176 | q_values = q_network(s) 177 | 178 | # Create shared target network 179 | st, target_q_network = build_network(num_actions=num_actions, agent_history_length=FLAGS.agent_history_length, 180 | resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, name_scope="target-network") 181 | target_network_params = target_q_network.trainable_weights 182 | target_q_values = target_q_network(st) 183 | 184 | # Op for periodically updating target network with online network weights 185 | reset_target_network_params = [target_network_params[i].assign(network_params[i]) for i in range(len(target_network_params))] 186 | 187 | # Define cost and gradient update op 188 | a = tf.placeholder("float", [None, num_actions]) 189 | y = tf.placeholder("float", [None]) 190 | action_q_values = tf.reduce_sum(tf.multiply(q_values, a), reduction_indices=1) 191 | cost = tf.reduce_mean(tf.square(y - action_q_values)) 192 | optimizer = tf.train.AdamOptimizer(FLAGS.learning_rate) 193 | grad_update = optimizer.minimize(cost, var_list=network_params) 194 | 195 | graph_ops = {"s" : s, 196 | "q_values" : q_values, 197 | "st" : st, 198 | "target_q_values" : target_q_values, 199 | "reset_target_network_params" : reset_target_network_params, 200 | "a" : a, 201 | "y" : y, 202 | "grad_update" : grad_update} 203 | 204 | return graph_ops 205 | 206 | # Set up some episode summary ops to visualize on tensorboard. 207 | def setup_summaries(): 208 | episode_reward = tf.Variable(0.) 209 | tf.summary.scalar("Episode_Reward", episode_reward) 210 | episode_ave_max_q = tf.Variable(0.) 211 | tf.summary.scalar("Max_Q_Value", episode_ave_max_q) 212 | logged_epsilon = tf.Variable(0.) 213 | tf.summary.scalar("Epsilon", logged_epsilon) 214 | logged_T = tf.Variable(0.) 215 | summary_vars = [episode_reward, episode_ave_max_q, logged_epsilon] 216 | summary_placeholders = [tf.placeholder("float") for i in range(len(summary_vars))] 217 | update_ops = [summary_vars[i].assign(summary_placeholders[i]) for i in range(len(summary_vars))] 218 | summary_op = tf.summary.merge_all() 219 | return summary_placeholders, update_ops, summary_op 220 | 221 | def get_num_actions(): 222 | """ 223 | Returns the number of possible actions for the given atari game 224 | """ 225 | # Figure out number of actions from gym env 226 | env = gym.make(FLAGS.game) 227 | num_actions = env.action_space.n 228 | if (FLAGS.game == "Pong-v0" or FLAGS.game == "Breakout-v0"): 229 | # Gym currently specifies 6 actions for pong 230 | # and breakout when only 3 are needed. This 231 | # is a lame workaround. 232 | num_actions = 3 233 | return num_actions 234 | 235 | def train(session, graph_ops, num_actions, saver): 236 | # Set up game environments (one per thread) 237 | envs = [gym.make(FLAGS.game) for i in range(FLAGS.num_concurrent)] 238 | 239 | summary_ops = setup_summaries() 240 | summary_op = summary_ops[-1] 241 | 242 | # Initialize variables 243 | session.run(tf.global_variables_initializer()) 244 | # Initialize target network weights 245 | session.run(graph_ops["reset_target_network_params"]) 246 | summary_save_path = FLAGS.summary_dir + "/" + FLAGS.experiment 247 | writer = tf.summary.FileWriter(summary_save_path, session.graph) 248 | if not os.path.exists(FLAGS.checkpoint_dir): 249 | os.makedirs(FLAGS.checkpoint_dir) 250 | 251 | # Start num_concurrent actor-learner training threads 252 | 253 | if(FLAGS.num_concurrent==1): # for debug 254 | actor_learner_thread(0, envs[0], session, graph_ops, num_actions, summary_ops, saver) 255 | else: 256 | actor_learner_threads = [threading.Thread(target=actor_learner_thread, args=(thread_id, envs[thread_id], session, graph_ops, num_actions, summary_ops, saver)) for thread_id in range(FLAGS.num_concurrent)] 257 | for t in actor_learner_threads: 258 | t.start() 259 | 260 | # Show the agents training and write summary statistics 261 | last_summary_time = 0 262 | while True: 263 | if FLAGS.show_training: 264 | for env in envs: 265 | env.render() 266 | now = time.time() 267 | if now - last_summary_time > FLAGS.summary_interval: 268 | summary_str = session.run(summary_op) 269 | writer.add_summary(summary_str, float(T)) 270 | last_summary_time = now 271 | for t in actor_learner_threads: 272 | t.join() 273 | 274 | def evaluation(session, graph_ops, saver): 275 | saver.restore(session, FLAGS.checkpoint_path) 276 | print "Restored model weights from ", FLAGS.checkpoint_path 277 | monitor_env = gym.make(FLAGS.game) 278 | gym.wrappers.Monitor(monitor_env, FLAGS.eval_dir+"/"+FLAGS.experiment+"/eval") 279 | 280 | # Unpack graph ops 281 | s = graph_ops["s"] 282 | q_values = graph_ops["q_values"] 283 | 284 | # Wrap env with AtariEnvironment helper class 285 | env = AtariEnvironment(gym_env=monitor_env, resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, agent_history_length=FLAGS.agent_history_length) 286 | 287 | for i_episode in xrange(FLAGS.num_eval_episodes): 288 | s_t = env.get_initial_state() 289 | ep_reward = 0 290 | terminal = False 291 | while not terminal: 292 | monitor_env.render() 293 | readout_t = q_values.eval(session = session, feed_dict = {s : [s_t]}) 294 | action_index = np.argmax(readout_t) 295 | print "action",action_index 296 | s_t1, r_t, terminal, info = env.step(action_index) 297 | s_t = s_t1 298 | ep_reward += r_t 299 | print ep_reward 300 | monitor_env.monitor.close() 301 | 302 | def main(_): 303 | g = tf.Graph() 304 | session = tf.Session(graph=g) 305 | with g.as_default(), session.as_default(): 306 | K.set_session(session) 307 | num_actions = get_num_actions() 308 | graph_ops = build_graph(num_actions) 309 | saver = tf.train.Saver() 310 | 311 | if FLAGS.testing: 312 | evaluation(session, graph_ops, saver) 313 | else: 314 | train(session, graph_ops, num_actions, saver) 315 | 316 | if __name__ == "__main__": 317 | tf.app.run() 318 | -------------------------------------------------------------------------------- /atari_environment.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from skimage.transform import resize 3 | from skimage.color import rgb2gray 4 | import numpy as np 5 | from collections import deque 6 | 7 | class AtariEnvironment(object): 8 | """ 9 | Small wrapper for gym atari environments. 10 | Responsible for preprocessing screens and holding on to a screen buffer 11 | of size agent_history_length from which environment state 12 | is constructed. 13 | """ 14 | def __init__(self, gym_env, resized_width, resized_height, agent_history_length): 15 | self.env = gym_env 16 | self.resized_width = resized_width 17 | self.resized_height = resized_height 18 | self.agent_history_length = agent_history_length 19 | 20 | self.gym_actions = range(gym_env.action_space.n) 21 | if (gym_env.spec.id == "Pong-v0" or gym_env.spec.id == "Breakout-v0"): 22 | print "Doing workaround for pong or breakout" 23 | # Gym returns 6 possible actions for breakout and pong. 24 | # Only three are used, the rest are no-ops. This just lets us 25 | # pick from a simplified "LEFT", "RIGHT", "NOOP" action space. 26 | self.gym_actions = [1,2,3] 27 | 28 | # Screen buffer of size AGENT_HISTORY_LENGTH to be able 29 | # to build state arrays of size [1, AGENT_HISTORY_LENGTH, width, height] 30 | self.state_buffer = deque() 31 | 32 | def get_initial_state(self): 33 | """ 34 | Resets the atari game, clears the state buffer 35 | """ 36 | # Clear the state buffer 37 | self.state_buffer = deque() 38 | 39 | x_t = self.env.reset() 40 | x_t = self.get_preprocessed_frame(x_t) 41 | s_t = np.stack((x_t, x_t, x_t, x_t), axis = 0) 42 | 43 | for i in range(self.agent_history_length-1): 44 | self.state_buffer.append(x_t) 45 | return s_t 46 | 47 | def get_preprocessed_frame(self, observation): 48 | """ 49 | See Methods->Preprocessing in Mnih et al. 50 | 1) Get image grayscale 51 | 2) Rescale image 52 | """ 53 | return resize(rgb2gray(observation), (self.resized_width, self.resized_height)) 54 | 55 | def step(self, action_index): 56 | """ 57 | Excecutes an action in the gym environment. 58 | Builds current state (concatenation of agent_history_length-1 previous frames and current one). 59 | Pops oldest frame, adds current frame to the state buffer. 60 | Returns current state. 61 | """ 62 | 63 | x_t1, r_t, terminal, info = self.env.step(self.gym_actions[action_index]) 64 | x_t1 = self.get_preprocessed_frame(x_t1) 65 | 66 | previous_frames = np.array(self.state_buffer) 67 | s_t1 = np.empty((self.agent_history_length, self.resized_height, self.resized_width)) 68 | s_t1[:self.agent_history_length-1, ...] = previous_frames 69 | s_t1[self.agent_history_length-1] = x_t1 70 | 71 | # Pop the oldest frame, add the current frame to the queue 72 | self.state_buffer.popleft() 73 | self.state_buffer.append(x_t1) 74 | 75 | return s_t1, r_t, terminal, info 76 | -------------------------------------------------------------------------------- /breakout.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coreylynch/async-rl/1741d52ce1cbab066c6af2c645865473c422db55/breakout.gif -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from keras import backend as K 3 | from keras.layers import Conv2D, Flatten, Dense, Input 4 | from keras.models import Model 5 | 6 | def build_network(num_actions, agent_history_length, resized_width, resized_height, name_scope): 7 | with tf.device("/cpu:0"): 8 | with tf.name_scope(name_scope): 9 | state = tf.placeholder(tf.float32, [None, agent_history_length, resized_width, resized_height], name="state") 10 | inputs = Input(shape=(agent_history_length, resized_width, resized_height,)) 11 | model = Conv2D(filters=16, kernel_size=(8,8), strides=(4,4), activation='relu', padding='same', data_format='channels_first')(inputs) 12 | model = Conv2D(filters=32, kernel_size=(4,4), strides=(2,2), activation='relu', padding='same', data_format='channels_first')(model) 13 | #model = Conv2D(filter=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same')(model) 14 | model = Flatten()(model) 15 | model = Dense(256, activation='relu')(model) 16 | print model 17 | q_values = Dense(num_actions)(model) 18 | 19 | #UserWarning: Update your `Model` call to the Keras 2 API: 20 | # `Model(outputs=Tensor("de..., inputs=Tensor("in.. 21 | m = Model(inputs=inputs, outputs=q_values) 22 | 23 | return state, m -------------------------------------------------------------------------------- /resources/episode_reward.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coreylynch/async-rl/1741d52ce1cbab066c6af2c645865473c422db55/resources/episode_reward.png -------------------------------------------------------------------------------- /resources/max_q_value.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/coreylynch/async-rl/1741d52ce1cbab066c6af2c645865473c422db55/resources/max_q_value.png --------------------------------------------------------------------------------