├── LICENSE
├── README.md
├── a3c.py
├── a3c_model.py
├── async_dqn.py
├── atari_environment.py
├── breakout.gif
├── model.py
└── resources
    ├── episode_reward.png
    └── max_q_value.png


/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Corey Lynch
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Asyncronous RL in Tensorflow + Keras + OpenAI's Gym
 2 | 
 3 | ![](http://g.recordit.co/BeiqC9l70B.gif)
 4 | 
 5 | This is a Tensorflow + Keras implementation of asyncronous 1-step Q learning as described in ["Asynchronous Methods for Deep Reinforcement Learning"](http://arxiv.org/pdf/1602.01783v1.pdf).
 6 | 
 7 | Since we're using multiple actor-learner threads to stabilize learning in place of experience replay (which is super memory intensive), this runs comfortably on a macbook w/ 4g of ram.
 8 | 
 9 | It uses Keras to define the deep q network (see model.py), OpenAI's gym library to interact with the Atari Learning Environment (see atari_environment.py), and Tensorflow for optimization/execution (see async_dqn.py).
10 | 
11 | ## Requirements
12 | * [tensorflow](https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html)
13 | * [gym](https://github.com/openai/gym#installation)
14 | * [gym's atari environment] (https://github.com/openai/gym#atari)
15 | * skimage
16 | * Keras
17 | 
18 | ## Usage
19 | ### Training
20 | To kick off training, run:
21 | ```
22 | python async_dqn.py --experiment breakout --game "Breakout-v0" --num_concurrent 8
23 | ```
24 | Here we're organizing the outputs for the current experiment under a folder called 'breakout', choosing "Breakout-v0" as our gym environment, and running 8 actor-learner threads concurrently. See [this](https://gym.openai.com/envs#atari) for a full list of possible game names you can hand to --game.
25 | 
26 | ### Visualizing training with tensorboard
27 | We collect episode reward stats and max q values that can be vizualized with tensorboard by running the following:
28 | ```
29 | tensorboard --logdir /tmp/summaries/breakout
30 | ```
31 | This is what my per-episode reward and average max q value curves looked like over the training period:
32 | ![](https://github.com/coreylynch/async-rl/blob/master/resources/episode_reward.png)
33 | ![](https://github.com/coreylynch/async-rl/blob/master/resources/max_q_value.png)
34 | 
35 | ### Evaluation
36 | To run a gym evaluation, turn the testing flag to True and hand in a current checkpoint file:
37 | ```
38 | python async_dqn.py --experiment breakout --testing True --checkpoint_path /tmp/breakout.ckpt-2690000 --num_eval_episodes 100
39 | ```
40 | After completing the eval, we can upload our eval file to OpenAI's site as follows:
41 | ```python
42 | import gym
43 | gym.upload('/tmp/breakout/eval', api_key='YOUR_API_KEY')
44 | ```
45 | Now we can find the eval at https://gym.openai.com/evaluations/eval_uwwAN0U3SKSkocC0PJEwQ
46 | 
47 | ### Next Steps
48 | See a3c.py for a WIP async advantage actor critic implementation.
49 | 
50 | ## Resources
51 | I found these super helpful as general background materials for deep RL:
52 | 
53 | * [David Silver's "Deep Reinforcement Learning" lecture](http://videolectures.net/rldm2015_silver_reinforcement_learning/)
54 | * [Nervana's Demystifying Deep Reinforcement Learning blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/)
55 | 
56 | ## Important notes
57 | * In the paper the authors mention "for asynchronous methods we average over the best 5 models from **50 experiments**". I overlooked this point when I was writing this, but I think it's important. These async methods seem to vary in performance a lot from run to run (at least in my implementation of them!). I think it's a good idea to run multiple seeded versions at the same time and average over their performance to get a good picture of whether or not some architectural change is good or not. Equivalently don't get discouraged if you don't see performance on your task right away; try rerunning the same code a few more times with different seeds.
58 | * This repo has no affiliation with Deepmind or the authors; it was just a simple project I was using to learn TensorFlow. Feedback is highly appreciated.
59 | 


--------------------------------------------------------------------------------
/a3c.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from skimage.transform import resize
  3 | from skimage.color import rgb2gray
  4 | import threading
  5 | import tensorflow as tf
  6 | import sys
  7 | import random
  8 | import numpy as np
  9 | import time
 10 | import gym
 11 | from keras import backend as K
 12 | from keras.layers import Convolution2D, Flatten, Dense
 13 | from collections import deque
 14 | from a3c_model import build_policy_and_value_networks
 15 | from keras import backend as K
 16 | from atari_environment import AtariEnvironment
 17 | 
 18 | # Path params
 19 | EXPERIMENT_NAME = "breakout_a3c"
 20 | SUMMARY_SAVE_PATH = "/Users/coreylynch/dev/async-rl/summaries/"+EXPERIMENT_NAME
 21 | CHECKPOINT_SAVE_PATH = "/tmp/"+EXPERIMENT_NAME+".ckpt"
 22 | CHECKPOINT_NAME = "/tmp/breakout_a3c.ckpt-5"
 23 | CHECKPOINT_INTERVAL=5000
 24 | SUMMARY_INTERVAL=5
 25 | # TRAINING = False
 26 | TRAINING = True
 27 | 
 28 | SHOW_TRAINING = True
 29 | # SHOW_TRAINING = False
 30 | 
 31 | # Experiment params
 32 | GAME = "Breakout-v0"
 33 | ACTIONS = 3
 34 | NUM_CONCURRENT = 8
 35 | NUM_EPISODES = 20000
 36 | 
 37 | AGENT_HISTORY_LENGTH = 4
 38 | RESIZED_WIDTH = 84
 39 | RESIZED_HEIGHT = 84
 40 | 
 41 | # DQN Params
 42 | GAMMA = 0.99
 43 | 
 44 | # Optimization Params
 45 | LEARNING_RATE = 0.00001
 46 | 
 47 | #Shared global parameters
 48 | T = 0
 49 | TMAX = 80000000
 50 | t_max = 32
 51 | 
 52 | def sample_policy_action(num_actions, probs):
 53 |     """
 54 |     Sample an action from an action probability distribution output by
 55 |     the policy network.
 56 |     """
 57 |     # Subtract a tiny value from probabilities in order to avoid
 58 |     # "ValueError: sum(pvals[:-1]) > 1.0" in numpy.multinomial
 59 |     probs = probs - np.finfo(np.float32).epsneg
 60 | 
 61 |     histogram = np.random.multinomial(1, probs)
 62 |     action_index = int(np.nonzero(histogram)[0])
 63 |     return action_index
 64 | 
 65 | def actor_learner_thread(num, env, session, graph_ops, summary_ops, saver):
 66 |     # We use global shared counter T, and TMAX constant
 67 |     global TMAX, T
 68 | 
 69 |     # Unpack graph ops
 70 |     s, a, R, minimize, p_network, v_network = graph_ops
 71 | 
 72 |     # Unpack tensorboard summary stuff
 73 |     r_summary_placeholder, update_ep_reward, val_summary_placeholder, update_ep_val, summary_op = summary_ops
 74 | 
 75 |     # Wrap env with AtariEnvironment helper class
 76 |     env = AtariEnvironment(gym_env=env, resized_width=RESIZED_WIDTH, resized_height=RESIZED_HEIGHT, agent_history_length=AGENT_HISTORY_LENGTH)
 77 | 
 78 |     time.sleep(5*num)
 79 | 
 80 |     # Set up per-episode counters
 81 |     ep_reward = 0
 82 |     ep_avg_v = 0
 83 |     v_steps = 0
 84 |     ep_t = 0
 85 | 
 86 |     probs_summary_t = 0
 87 | 
 88 |     s_t = env.get_initial_state()
 89 |     terminal = False
 90 | 
 91 |     while T < TMAX:
 92 |         s_batch = []
 93 |         past_rewards = []
 94 |         a_batch = []
 95 | 
 96 |         t = 0
 97 |         t_start = t
 98 | 
 99 |         while not (terminal or ((t - t_start)  == t_max)):
100 |             # Perform action a_t according to policy pi(a_t | s_t)
101 |             probs = session.run(p_network, feed_dict={s: [s_t]})[0]
102 |             action_index = sample_policy_action(ACTIONS, probs)
103 |             a_t = np.zeros([ACTIONS])
104 |             a_t[action_index] = 1
105 | 
106 |             if probs_summary_t % 100 == 0:
107 |                 print "P, ", np.max(probs), "V ", session.run(v_network, feed_dict={s: [s_t]})[0][0]
108 | 
109 |             s_batch.append(s_t)
110 |             a_batch.append(a_t)
111 | 
112 |             s_t1, r_t, terminal, info = env.step(action_index)
113 |             ep_reward += r_t
114 | 
115 |             r_t = np.clip(r_t, -1, 1)
116 |             past_rewards.append(r_t)
117 | 
118 |             t += 1
119 |             T += 1
120 |             ep_t += 1
121 |             probs_summary_t += 1
122 |             
123 |             s_t = s_t1
124 | 
125 |         if terminal:
126 |             R_t = 0
127 |         else:
128 |             R_t = session.run(v_network, feed_dict={s: [s_t]})[0][0] # Bootstrap from last state
129 | 
130 |         R_batch = np.zeros(t)
131 |         for i in reversed(range(t_start, t)):
132 |             R_t = past_rewards[i] + GAMMA * R_t
133 |             R_batch[i] = R_t
134 | 
135 |         session.run(minimize, feed_dict={R : R_batch,
136 |                                          a : a_batch,
137 |                                          s : s_batch})
138 |         
139 |         # Save progress every 5000 iterations
140 |         if T % CHECKPOINT_INTERVAL == 0:
141 |             saver.save(session, CHECKPOINT_SAVE_PATH, global_step = T)
142 | 
143 |         if terminal:
144 |             # Episode ended, collect stats and reset game
145 |             session.run(update_ep_reward, feed_dict={r_summary_placeholder: ep_reward})
146 |             print "THREAD:", num, "/ TIME", T, "/ REWARD", ep_reward
147 |             s_t = env.get_initial_state()
148 |             terminal = False
149 |             # Reset per-episode counters
150 |             ep_reward = 0
151 |             ep_t = 0
152 | 
153 | def build_graph():
154 |     # Create shared global policy and value networks
155 |     s, p_network, v_network, p_params, v_params = build_policy_and_value_networks(num_actions=ACTIONS, agent_history_length=AGENT_HISTORY_LENGTH, resized_width=RESIZED_WIDTH, resized_height=RESIZED_HEIGHT)
156 | 
157 |     # Shared global optimizer
158 |     optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
159 | 
160 |     # Op for applying remote gradients
161 |     R_t = tf.placeholder("float", [None])
162 |     a_t = tf.placeholder("float", [None, ACTIONS])
163 |     log_prob = tf.log(tf.reduce_sum(p_network * a_t, reduction_indices=1))
164 |     p_loss = -log_prob * (R_t - v_network)
165 |     v_loss = tf.reduce_mean(tf.square(R_t - v_network))
166 | 
167 |     total_loss = p_loss + (0.5 * v_loss)
168 | 
169 |     minimize = optimizer.minimize(total_loss)
170 |     return s, a_t, R_t, minimize, p_network, v_network
171 | 
172 | # Set up some episode summary ops to visualize on tensorboard.
173 | def setup_summaries():
174 |     episode_reward = tf.Variable(0.)
175 |     tf.summary.scalar("Episode Reward", episode_reward)
176 |     r_summary_placeholder = tf.placeholder("float")
177 |     update_ep_reward = episode_reward.assign(r_summary_placeholder)
178 |     ep_avg_v = tf.Variable(0.)
179 |     tf.summary.scalar("Episode Value", ep_avg_v)
180 |     val_summary_placeholder = tf.placeholder("float")
181 |     update_ep_val = ep_avg_v.assign(val_summary_placeholder)
182 |     summary_op = tf.summary.merge_all()
183 |     return r_summary_placeholder, update_ep_reward, val_summary_placeholder, update_ep_val, summary_op
184 | 
185 | def train(session, graph_ops, saver):
186 |     # Set up game environments (one per thread)
187 |     envs = [gym.make(GAME) for i in range(NUM_CONCURRENT)]
188 |     
189 |     summary_ops = setup_summaries()
190 |     summary_op = summary_ops[-1]
191 | 
192 |     # Initialize variables
193 |     session.run(tf.global_variables_initializer())
194 |     writer = tf.summary.FileWriter(SUMMARY_SAVE_PATH, session.graph)
195 | 
196 |     # Start NUM_CONCURRENT training threads
197 |     actor_learner_threads = [threading.Thread(target=actor_learner_thread, args=(thread_id, envs[thread_id], session, graph_ops, summary_ops, saver)) for thread_id in range(NUM_CONCURRENT)]
198 |     for t in actor_learner_threads:
199 |         t.start()
200 | 
201 |     # Show the agents training and write summary statistics
202 |     last_summary_time = 0
203 |     while True:
204 |         if SHOW_TRAINING:
205 |             for env in envs:
206 |                 env.render()
207 |         now = time.time()
208 |         if now - last_summary_time > SUMMARY_INTERVAL:
209 |             summary_str = session.run(summary_op)
210 |             writer.add_summary(summary_str, float(T))
211 |             last_summary_time = now
212 |     for t in actor_learner_threads:
213 |         t.join()
214 | 
215 | def evaluation(session, graph_ops, saver):
216 |     saver.restore(session, CHECKPOINT_NAME)
217 |     print "Restored model weights from ", CHECKPOINT_NAME
218 |     monitor_env = gym.make(GAME)
219 |     monitor_env.monitor.start('/tmp/'+EXPERIMENT_NAME+"/eval")
220 | 
221 |     # Unpack graph ops
222 |     s, a_t, R_t, minimize, p_network, v_network = graph_ops
223 | 
224 |     # Wrap env with AtariEnvironment helper class
225 |     env = AtariEnvironment(gym_env=monitor_env, resized_width=RESIZED_WIDTH, resized_height=RESIZED_HEIGHT, agent_history_length=AGENT_HISTORY_LENGTH)
226 | 
227 |     for i_episode in xrange(100):
228 |         s_t = env.get_initial_state()
229 |         ep_reward = 0
230 |         terminal = False
231 |         while not terminal:
232 |             monitor_env.render()
233 |             # Forward the deep q network, get Q(s,a) values
234 |             probs = p_network.eval(session = session, feed_dict = {s : [s_t]})[0]
235 |             action_index = sample_policy_action(ACTIONS, probs)
236 |             s_t1, r_t, terminal, info = env.step(action_index)
237 |             s_t = s_t1
238 |             ep_reward += r_t
239 |         print ep_reward
240 |     monitor_env.monitor.close()
241 | 
242 | def main(_):
243 |   g = tf.Graph()
244 |   with g.as_default(), tf.Session() as session:
245 |     K.set_session(session)
246 |     graph_ops = build_graph()
247 |     saver = tf.train.Saver()
248 | 
249 |     if TRAINING:
250 |         train(session, graph_ops, saver)
251 |     else:
252 |         evaluation(session, graph_ops, saver)
253 | 
254 | if __name__ == "__main__":
255 |   tf.app.run()
256 | 


--------------------------------------------------------------------------------
/a3c_model.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from keras import backend as K
 3 | from keras.layers import Convolution2D, Flatten, Dense, Input
 4 | from keras.models import Model
 5 | 
 6 | def build_policy_and_value_networks(num_actions, agent_history_length, resized_width, resized_height):
 7 |     with tf.device("/cpu:0"):
 8 |         state = tf.placeholder("float", [None, agent_history_length, resized_width, resized_height])
 9 |         
10 |         inputs = Input(shape=(agent_history_length, resized_width, resized_height,))
11 |         shared = Convolution2D(name="conv1", nb_filter=16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu', border_mode='same')(inputs)
12 |         shared = Convolution2D(name="conv2", nb_filter=32, nb_row=4, nb_col=4, subsample=(2,2), activation='relu', border_mode='same')(shared)
13 |         shared = Flatten()(shared)
14 |         shared = Dense(name="h1", output_dim=256, activation='relu')(shared)
15 | 
16 |         action_probs = Dense(name="p", output_dim=num_actions, activation='softmax')(shared)
17 |         
18 |         state_value = Dense(name="v", output_dim=1, activation='linear')(shared)
19 | 
20 |         policy_network = Model(input=inputs, output=action_probs)
21 |         value_network = Model(input=inputs, output=state_value)
22 | 
23 |         p_params = policy_network.trainable_weights
24 |         v_params = value_network.trainable_weights
25 | 
26 |         p_out = policy_network(state)
27 |         v_out = value_network(state)
28 | 
29 |     return state, p_out, v_out, p_params, v_params


--------------------------------------------------------------------------------
/async_dqn.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import os
  3 | os.environ["KERAS_BACKEND"] = "tensorflow"
  4 | 
  5 | from skimage.transform import resize
  6 | from skimage.color import rgb2gray
  7 | from atari_environment import AtariEnvironment
  8 | import threading
  9 | import tensorflow as tf
 10 | import sys
 11 | import random
 12 | import numpy as np
 13 | import time
 14 | import gym
 15 | from keras import backend as K
 16 | from model import build_network
 17 | from keras import backend as K
 18 | 
 19 | flags = tf.app.flags
 20 | 
 21 | flags.DEFINE_string('experiment', 'dqn_breakout', 'Name of the current experiment')
 22 | flags.DEFINE_string('game', 'Breakout-v0', 'Name of the atari game to play. Full list here: https://gym.openai.com/envs#atari')
 23 | flags.DEFINE_integer('num_concurrent', 8, 'Number of concurrent actor-learner threads to use during training.')
 24 | flags.DEFINE_integer('tmax', 80000000, 'Number of training timesteps.')
 25 | flags.DEFINE_integer('resized_width', 84, 'Scale screen to this width.')
 26 | flags.DEFINE_integer('resized_height', 84, 'Scale screen to this height.')
 27 | flags.DEFINE_integer('agent_history_length', 4, 'Use this number of recent screens as the environment state.')
 28 | flags.DEFINE_integer('network_update_frequency', 32, 'Frequency with which each actor learner thread does an async gradient update')
 29 | flags.DEFINE_integer('target_network_update_frequency', 10000, 'Reset the target network every n timesteps')
 30 | flags.DEFINE_float('learning_rate', 0.0001, 'Initial learning rate.')
 31 | flags.DEFINE_float('gamma', 0.99, 'Reward discount rate.')
 32 | flags.DEFINE_integer('anneal_epsilon_timesteps', 1000000, 'Number of timesteps to anneal epsilon.')
 33 | flags.DEFINE_string('summary_dir', '/tmp/summaries', 'Directory for storing tensorboard summaries')
 34 | flags.DEFINE_string('checkpoint_dir', '/tmp/checkpoints', 'Directory for storing model checkpoints')
 35 | flags.DEFINE_integer('summary_interval', 5,
 36 |                      'Save training summary to file every n seconds (rounded '
 37 |                      'up to statistics interval.')
 38 | flags.DEFINE_integer('checkpoint_interval', 600,
 39 |                      'Checkpoint the model (i.e. save the parameters) every n '
 40 |                      'seconds (rounded up to statistics interval.')
 41 | flags.DEFINE_boolean('show_training', True, 'If true, have gym render evironments during training')
 42 | flags.DEFINE_boolean('testing', False, 'If true, run gym evaluation')
 43 | flags.DEFINE_string('checkpoint_path', 'path/to/recent.ckpt', 'Path to recent checkpoint to use for evaluation')
 44 | flags.DEFINE_string('eval_dir', '/tmp/', 'Directory to store gym evaluation')
 45 | flags.DEFINE_integer('num_eval_episodes', 100, 'Number of episodes to run gym evaluation.')
 46 | FLAGS = flags.FLAGS
 47 | T = 0
 48 | TMAX = FLAGS.tmax
 49 | 
 50 | def sample_final_epsilon():
 51 |     """
 52 |     Sample a final epsilon value to anneal towards from a distribution.
 53 |     These values are specified in section 5.1 of http://arxiv.org/pdf/1602.01783v1.pdf
 54 |     """
 55 |     final_epsilons = np.array([.1,.01,.5])
 56 |     probabilities = np.array([0.4,0.3,0.3])
 57 |     return np.random.choice(final_epsilons, 1, p=list(probabilities))[0]
 58 | 
 59 | def actor_learner_thread(thread_id, env, session, graph_ops, num_actions, summary_ops, saver):
 60 |     """
 61 |     Actor-learner thread implementing asynchronous one-step Q-learning, as specified
 62 |     in algorithm 1 here: http://arxiv.org/pdf/1602.01783v1.pdf.
 63 |     """
 64 |     global TMAX, T
 65 | 
 66 |     # Unpack graph ops
 67 |     s = graph_ops["s"]
 68 |     q_values = graph_ops["q_values"]
 69 |     st = graph_ops["st"]
 70 |     target_q_values = graph_ops["target_q_values"]
 71 |     reset_target_network_params = graph_ops["reset_target_network_params"]
 72 |     a = graph_ops["a"]
 73 |     y = graph_ops["y"]
 74 |     grad_update = graph_ops["grad_update"]
 75 | 
 76 |     summary_placeholders, update_ops, summary_op = summary_ops
 77 | 
 78 |     # Wrap env with AtariEnvironment helper class
 79 |     env = AtariEnvironment(gym_env=env, resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, agent_history_length=FLAGS.agent_history_length)
 80 | 
 81 |     # Initialize network gradients
 82 |     s_batch = []
 83 |     a_batch = []
 84 |     y_batch = []
 85 | 
 86 |     final_epsilon = sample_final_epsilon()
 87 |     initial_epsilon = 1.0
 88 |     epsilon = 1.0
 89 | 
 90 |     print "Starting thread ", thread_id, "with final epsilon ", final_epsilon
 91 | 
 92 |     time.sleep(3*thread_id)
 93 |     t = 0
 94 |     while T < TMAX:
 95 |         # Get initial game observation
 96 |         s_t = env.get_initial_state()
 97 |         terminal = False
 98 | 
 99 |         # Set up per-episode counters
100 |         ep_reward = 0
101 |         episode_ave_max_q = 0
102 |         ep_t = 0
103 | 
104 |         while True:
105 |             # Forward the deep q network, get Q(s,a) values
106 |             readout_t = q_values.eval(session = session, feed_dict = {s : [s_t]})
107 |             
108 |             # Choose next action based on e-greedy policy
109 |             a_t = np.zeros([num_actions])
110 |             action_index = 0
111 |             if random.random() <= epsilon:
112 |                 action_index = random.randrange(num_actions)
113 |             else:
114 |                 action_index = np.argmax(readout_t)
115 |             a_t[action_index] = 1
116 | 
117 |             # Scale down epsilon
118 |             if epsilon > final_epsilon:
119 |                 epsilon -= (initial_epsilon - final_epsilon) / FLAGS.anneal_epsilon_timesteps
120 |     
121 |             # Gym excecutes action in game environment on behalf of actor-learner
122 |             s_t1, r_t, terminal, info = env.step(action_index)
123 | 
124 |             # Accumulate gradients
125 |             readout_j1 = target_q_values.eval(session = session, feed_dict = {st : [s_t1]})
126 |             clipped_r_t = np.clip(r_t, -1, 1)
127 |             if terminal:
128 |                 y_batch.append(clipped_r_t)
129 |             else:
130 |                 y_batch.append(clipped_r_t + FLAGS.gamma * np.max(readout_j1))
131 |     
132 |             a_batch.append(a_t)
133 |             s_batch.append(s_t)
134 |     
135 |             # Update the state and counters
136 |             s_t = s_t1
137 |             T += 1
138 |             t += 1
139 | 
140 |             ep_t += 1
141 |             ep_reward += r_t
142 |             episode_ave_max_q += np.max(readout_t)
143 | 
144 |             # Optionally update target network
145 |             if T % FLAGS.target_network_update_frequency == 0:
146 |                 session.run(reset_target_network_params)
147 |     
148 |             # Optionally update online network
149 |             if t % FLAGS.network_update_frequency == 0 or terminal:
150 |                 if s_batch:
151 |                     session.run(grad_update, feed_dict = {y : y_batch,
152 |                                                           a : a_batch,
153 |                                                           s : s_batch})
154 |                 # Clear gradients
155 |                 s_batch = []
156 |                 a_batch = []
157 |                 y_batch = []
158 |     
159 |             # Save model progress
160 |             if t % FLAGS.checkpoint_interval == 0:
161 |                 saver.save(session, FLAGS.checkpoint_dir+"/"+FLAGS.experiment+".ckpt", global_step = t)
162 |     
163 |             # Print end of episode stats
164 |             if terminal:
165 |                 stats = [ep_reward, episode_ave_max_q/float(ep_t), epsilon]
166 |                 for i in range(len(stats)):
167 |                     session.run(update_ops[i], feed_dict={summary_placeholders[i]:float(stats[i])})
168 |                 print "THREAD:", thread_id, "/ TIME", T, "/ TIMESTEP", t, "/ EPSILON", epsilon, "/ REWARD", ep_reward, "/ Q_MAX %.4f" % (episode_ave_max_q/float(ep_t)), "/ EPSILON PROGRESS", t/float(FLAGS.anneal_epsilon_timesteps)
169 |                 break
170 | 
171 | def build_graph(num_actions):
172 |     # Create shared deep q network
173 |     s, q_network = build_network(num_actions=num_actions, agent_history_length=FLAGS.agent_history_length,
174 |                       resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, name_scope="q-network")
175 |     network_params = q_network.trainable_weights
176 |     q_values = q_network(s)
177 | 
178 |     # Create shared target network
179 |     st, target_q_network = build_network(num_actions=num_actions, agent_history_length=FLAGS.agent_history_length,
180 |                       resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, name_scope="target-network")
181 |     target_network_params = target_q_network.trainable_weights
182 |     target_q_values = target_q_network(st)
183 | 
184 |     # Op for periodically updating target network with online network weights
185 |     reset_target_network_params = [target_network_params[i].assign(network_params[i]) for i in range(len(target_network_params))]
186 |     
187 |     # Define cost and gradient update op
188 |     a = tf.placeholder("float", [None, num_actions])
189 |     y = tf.placeholder("float", [None])
190 |     action_q_values = tf.reduce_sum(tf.multiply(q_values, a), reduction_indices=1)
191 |     cost = tf.reduce_mean(tf.square(y - action_q_values))
192 |     optimizer = tf.train.AdamOptimizer(FLAGS.learning_rate)
193 |     grad_update = optimizer.minimize(cost, var_list=network_params)
194 | 
195 |     graph_ops = {"s" : s, 
196 |                  "q_values" : q_values,
197 |                  "st" : st, 
198 |                  "target_q_values" : target_q_values,
199 |                  "reset_target_network_params" : reset_target_network_params,
200 |                  "a" : a,
201 |                  "y" : y,
202 |                  "grad_update" : grad_update}
203 | 
204 |     return graph_ops
205 | 
206 | # Set up some episode summary ops to visualize on tensorboard.
207 | def setup_summaries():
208 |     episode_reward = tf.Variable(0.)
209 |     tf.summary.scalar("Episode_Reward", episode_reward)
210 |     episode_ave_max_q = tf.Variable(0.)
211 |     tf.summary.scalar("Max_Q_Value", episode_ave_max_q)
212 |     logged_epsilon = tf.Variable(0.)
213 |     tf.summary.scalar("Epsilon", logged_epsilon)
214 |     logged_T = tf.Variable(0.)
215 |     summary_vars = [episode_reward, episode_ave_max_q, logged_epsilon]
216 |     summary_placeholders = [tf.placeholder("float") for i in range(len(summary_vars))]
217 |     update_ops = [summary_vars[i].assign(summary_placeholders[i]) for i in range(len(summary_vars))]
218 |     summary_op = tf.summary.merge_all()
219 |     return summary_placeholders, update_ops, summary_op
220 | 
221 | def get_num_actions():
222 |     """
223 |     Returns the number of possible actions for the given atari game
224 |     """
225 |     # Figure out number of actions from gym env
226 |     env = gym.make(FLAGS.game)
227 |     num_actions = env.action_space.n
228 |     if (FLAGS.game == "Pong-v0" or FLAGS.game == "Breakout-v0"):
229 |         # Gym currently specifies 6 actions for pong
230 |         # and breakout when only 3 are needed. This
231 |         # is a lame workaround.
232 |         num_actions = 3
233 |     return num_actions
234 | 
235 | def train(session, graph_ops, num_actions, saver):
236 |     # Set up game environments (one per thread)
237 |     envs = [gym.make(FLAGS.game) for i in range(FLAGS.num_concurrent)]
238 |     
239 |     summary_ops = setup_summaries()
240 |     summary_op = summary_ops[-1]
241 | 
242 |     # Initialize variables
243 |     session.run(tf.global_variables_initializer())
244 |     # Initialize target network weights
245 |     session.run(graph_ops["reset_target_network_params"])
246 |     summary_save_path = FLAGS.summary_dir + "/" + FLAGS.experiment
247 |     writer = tf.summary.FileWriter(summary_save_path, session.graph)
248 |     if not os.path.exists(FLAGS.checkpoint_dir):
249 |         os.makedirs(FLAGS.checkpoint_dir)
250 | 
251 |     # Start num_concurrent actor-learner training threads
252 | 
253 |     if(FLAGS.num_concurrent==1): # for debug
254 |         actor_learner_thread(0, envs[0], session, graph_ops, num_actions, summary_ops, saver)
255 |     else:
256 |         actor_learner_threads = [threading.Thread(target=actor_learner_thread, args=(thread_id, envs[thread_id], session, graph_ops, num_actions, summary_ops, saver)) for thread_id in range(FLAGS.num_concurrent)]
257 |         for t in actor_learner_threads:
258 |             t.start()
259 | 
260 |     # Show the agents training and write summary statistics
261 |     last_summary_time = 0
262 |     while True:
263 |         if FLAGS.show_training:
264 |             for env in envs:
265 |                 env.render()
266 |         now = time.time()
267 |         if now - last_summary_time > FLAGS.summary_interval:
268 |             summary_str = session.run(summary_op)
269 |             writer.add_summary(summary_str, float(T))
270 |             last_summary_time = now
271 |     for t in actor_learner_threads:
272 |         t.join()
273 | 
274 | def evaluation(session, graph_ops, saver):
275 |     saver.restore(session, FLAGS.checkpoint_path)
276 |     print "Restored model weights from ", FLAGS.checkpoint_path
277 |     monitor_env = gym.make(FLAGS.game)
278 |     gym.wrappers.Monitor(monitor_env, FLAGS.eval_dir+"/"+FLAGS.experiment+"/eval")
279 | 
280 |     # Unpack graph ops
281 |     s = graph_ops["s"]
282 |     q_values = graph_ops["q_values"]
283 | 
284 |     # Wrap env with AtariEnvironment helper class
285 |     env = AtariEnvironment(gym_env=monitor_env, resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height, agent_history_length=FLAGS.agent_history_length)
286 | 
287 |     for i_episode in xrange(FLAGS.num_eval_episodes):
288 |         s_t = env.get_initial_state()
289 |         ep_reward = 0
290 |         terminal = False
291 |         while not terminal:
292 |             monitor_env.render()
293 |             readout_t = q_values.eval(session = session, feed_dict = {s : [s_t]})
294 |             action_index = np.argmax(readout_t)
295 |             print "action",action_index
296 |             s_t1, r_t, terminal, info = env.step(action_index)
297 |             s_t = s_t1
298 |             ep_reward += r_t
299 |         print ep_reward
300 |     monitor_env.monitor.close()
301 | 
302 | def main(_):
303 |   g = tf.Graph()
304 |   session = tf.Session(graph=g)
305 |   with g.as_default(), session.as_default():
306 |     K.set_session(session)
307 |     num_actions = get_num_actions()
308 |     graph_ops = build_graph(num_actions)
309 |     saver = tf.train.Saver()
310 | 
311 |     if FLAGS.testing:
312 |         evaluation(session, graph_ops, saver)
313 |     else:
314 |         train(session, graph_ops, num_actions, saver)
315 | 
316 | if __name__ == "__main__":
317 |   tf.app.run()
318 | 


--------------------------------------------------------------------------------
/atari_environment.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from skimage.transform import resize
 3 | from skimage.color import rgb2gray
 4 | import numpy as np
 5 | from collections import deque
 6 | 
 7 | class AtariEnvironment(object):
 8 |     """
 9 |     Small wrapper for gym atari environments.
10 |     Responsible for preprocessing screens and holding on to a screen buffer 
11 |     of size agent_history_length from which environment state
12 |     is constructed.
13 |     """
14 |     def __init__(self, gym_env, resized_width, resized_height, agent_history_length):
15 |         self.env = gym_env
16 |         self.resized_width = resized_width
17 |         self.resized_height = resized_height
18 |         self.agent_history_length = agent_history_length
19 | 
20 |         self.gym_actions = range(gym_env.action_space.n)
21 |         if (gym_env.spec.id == "Pong-v0" or gym_env.spec.id == "Breakout-v0"):
22 |             print "Doing workaround for pong or breakout"
23 |             # Gym returns 6 possible actions for breakout and pong.
24 |             # Only three are used, the rest are no-ops. This just lets us
25 |             # pick from a simplified "LEFT", "RIGHT", "NOOP" action space.
26 |             self.gym_actions = [1,2,3]
27 | 
28 |         # Screen buffer of size AGENT_HISTORY_LENGTH to be able
29 |         # to build state arrays of size [1, AGENT_HISTORY_LENGTH, width, height]
30 |         self.state_buffer = deque()
31 | 
32 |     def get_initial_state(self):
33 |         """
34 |         Resets the atari game, clears the state buffer
35 |         """
36 |         # Clear the state buffer
37 |         self.state_buffer = deque()
38 | 
39 |         x_t = self.env.reset()
40 |         x_t = self.get_preprocessed_frame(x_t)
41 |         s_t = np.stack((x_t, x_t, x_t, x_t), axis = 0)
42 |         
43 |         for i in range(self.agent_history_length-1):
44 |             self.state_buffer.append(x_t)
45 |         return s_t
46 | 
47 |     def get_preprocessed_frame(self, observation):
48 |         """
49 |         See Methods->Preprocessing in Mnih et al.
50 |         1) Get image grayscale
51 |         2) Rescale image
52 |         """
53 |         return resize(rgb2gray(observation), (self.resized_width, self.resized_height))
54 | 
55 |     def step(self, action_index):
56 |         """
57 |         Excecutes an action in the gym environment.
58 |         Builds current state (concatenation of agent_history_length-1 previous frames and current one).
59 |         Pops oldest frame, adds current frame to the state buffer.
60 |         Returns current state.
61 |         """
62 | 
63 |         x_t1, r_t, terminal, info = self.env.step(self.gym_actions[action_index])
64 |         x_t1 = self.get_preprocessed_frame(x_t1)
65 | 
66 |         previous_frames = np.array(self.state_buffer)
67 |         s_t1 = np.empty((self.agent_history_length, self.resized_height, self.resized_width))
68 |         s_t1[:self.agent_history_length-1, ...] = previous_frames
69 |         s_t1[self.agent_history_length-1] = x_t1
70 | 
71 |         # Pop the oldest frame, add the current frame to the queue
72 |         self.state_buffer.popleft()
73 |         self.state_buffer.append(x_t1)
74 | 
75 |         return s_t1, r_t, terminal, info
76 | 


--------------------------------------------------------------------------------
/breakout.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coreylynch/async-rl/1741d52ce1cbab066c6af2c645865473c422db55/breakout.gif


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from keras import backend as K
 3 | from keras.layers import Conv2D, Flatten, Dense, Input
 4 | from keras.models import Model
 5 | 
 6 | def build_network(num_actions, agent_history_length, resized_width, resized_height, name_scope):
 7 |   with tf.device("/cpu:0"):
 8 |     with tf.name_scope(name_scope):
 9 |         state = tf.placeholder(tf.float32, [None, agent_history_length, resized_width, resized_height], name="state")
10 |         inputs = Input(shape=(agent_history_length, resized_width, resized_height,))
11 |         model = Conv2D(filters=16, kernel_size=(8,8), strides=(4,4), activation='relu', padding='same', data_format='channels_first')(inputs)
12 |         model = Conv2D(filters=32, kernel_size=(4,4), strides=(2,2), activation='relu', padding='same', data_format='channels_first')(model)
13 |         #model = Conv2D(filter=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same')(model)
14 |         model = Flatten()(model)
15 |         model = Dense(256, activation='relu')(model)
16 |         print model
17 |         q_values = Dense(num_actions)(model)
18 |         
19 |         #UserWarning: Update your `Model` call to the Keras 2 API: 
20 |         # `Model(outputs=Tensor("de..., inputs=Tensor("in..
21 |         m = Model(inputs=inputs, outputs=q_values)
22 |         
23 |   return state, m


--------------------------------------------------------------------------------
/resources/episode_reward.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coreylynch/async-rl/1741d52ce1cbab066c6af2c645865473c422db55/resources/episode_reward.png


--------------------------------------------------------------------------------
/resources/max_q_value.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/coreylynch/async-rl/1741d52ce1cbab066c6af2c645865473c422db55/resources/max_q_value.png


--------------------------------------------------------------------------------