├── .gitignore
├── LICENSE.txt
├── README.md
├── code
    ├── ale_agents.py
    ├── ale_data_set.py
    ├── ale_experiment.py
    ├── image_preprocessing.py
    ├── launcher.py
    ├── q_network.py
    ├── run_OT.py
    └── updates.py
├── figures
    ├── a3c_fig.png
    ├── beam_rider_time.png
    ├── breakout_time.png
    ├── frame.png
    ├── frostbite_cl2_1.png
    ├── frostbite_cl2_2.png
    ├── frostbite_r15_1.png
    ├── frostbite_r15_2.png
    ├── gopher.png
    ├── gopher_running.png
    ├── hero.png
    ├── pong_time.png
    ├── qbert_time.png
    ├── rescale.png
    ├── space_invaders_time.png
    ├── star_gunner.png
    ├── star_gunner_running.png
    └── zaxxon.png
└── roms
    ├── air_raid.bin
    ├── alien.bin
    ├── amidar.bin
    ├── assault.bin
    ├── asterix.bin
    ├── asteroids.bin
    ├── atlantis.bin
    ├── bank_heist.bin
    ├── battle_zone.bin
    ├── beam_rider.bin
    ├── berzerk.bin
    ├── bowling.bin
    ├── boxing.bin
    ├── breakout.bin
    ├── carnival.bin
    ├── centipede.bin
    ├── chopper_command.bin
    ├── crazy_climber.bin
    ├── defender.bin
    ├── demon_attack.bin
    ├── double_dunk.bin
    ├── elevator_action.bin
    ├── enduro.bin
    ├── fishing_derby.bin
    ├── freeway.bin
    ├── frostbite.bin
    ├── gopher.bin
    ├── gravitar.bin
    ├── hero.bin
    ├── ice_hockey.bin
    ├── jamesbond.bin
    ├── journey_escape.bin
    ├── kangaroo.bin
    ├── krull.bin
    ├── kung_fu_master.bin
    ├── montezuma_revenge.bin
    ├── ms_pacman.bin
    ├── name_this_game.bin
    ├── phoenix.bin
    ├── pitfall.bin
    ├── pong.bin
    ├── pooyan.bin
    ├── private_eye.bin
    ├── qbert.bin
    ├── riverraid.bin
    ├── road_runner.bin
    ├── robotank.bin
    ├── seaquest.bin
    ├── skiing.bin
    ├── solaris.bin
    ├── space_invaders.bin
    ├── star_gunner.bin
    ├── tennis.bin
    ├── time_pilot.bin
    ├── tutankham.bin
    ├── up_n_down.bin
    ├── venture.bin
    ├── video_pinball.bin
    ├── wizard_of_wor.bin
    ├── yars_revenge.bin
    └── zaxxon.bin


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .idea/.name
3 | 
4 | *.iml
5 | 
6 | *.xml
7 | 
8 | *.pyc
9 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Frank He
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Q-Optimality-Tightening
  2 | This is my implementation to paper [Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening](https://openreview.net/pdf?id=rJ8Je4clg). 
  3 | 
  4 | # Dependencies
  5 | * Numpy
  6 | * Scipy
  7 | * Pillow
  8 | * Matplotlib
  9 | * Theano
 10 | * Lasagne
 11 | * ALE or gym
 12 | 
 13 | Readers might refer to <https://github.com/spragunr/deep_q_rl> for installation information. However, I suggest readers installing all the packages using virtual environment. Please make sure the version of Theano is compatible with the one of Lasagne.
 14 | 
 15 | # Running
 16 | ```
 17 | THEANO_FLAGS='deivce=gpu0, allow_gc=False' python run_OT -r frostbite --close2
 18 | ```
 19 | Running frostbite with close boundes.
 20 | 
 21 | ```
 22 | THEANO_FLAGS='device=gpu1, allow_gc=False' python run_OT -r gopher
 23 | ```
 24 | 
 25 | Running gopher with randomly sampled bounds. Default: 4 out of 10 upper bounds are selected as U\_{j,k}. 4 out of 10 lower bounds are selected as L\_{j,l}.
 26 | 
 27 | I have already provided 62 game roms.
 28 | 
 29 | If everything is configured correctly, the running should be like this:
 30 | 
 31 | <img src="figures/gopher_running.png" width="320" height="450" />
 32 | <img src="figures/star_gunner_running.png" width="320" height="450" />
 33 | 
 34 | steps per second is usually between 105 to 140 using one Titan X. The GPU Occupation is about 30 percent which means our code still has huge space of improvement.
 35 | 
 36 | 
 37 | # Experiments
 38 | First I will show two figures runned on frostbite with ```--close2```:
 39 | ![frostbite_cl2_1]
 40 | ![frostbite_cl2_2]
 41 | Two other figures runned with sampling 4 bounds out of 15 are below:
 42 | ![frostbite_r15_1]
 43 | ![frostbite_r15_2]
 44 | > frostbite's 200M baseline is 328.3
 45 | 
 46 | Some other games are displayed here:
 47 | ![gopher]
 48 | > gopher's 200M baseline is 8520
 49 | 
 50 | ![hero]
 51 | > hero's 200M baseline is 19950
 52 | 
 53 | ![star_gunner]
 54 | > star_gunner's 200M baseline is 57997
 55 | 
 56 | ![zaxxon]
 57 | > zaxxon's 200M baseline is 4977
 58 | 
 59 | Finally, we can roughly compare our method with state-of-art method [A3C](https://arxiv.org/pdf/1602.01783). Our method is using 1 CPU thread and 1 GPU (GPU Occupation is 30%) while A3C is using multiple CPU threads.
 60 | 
 61 | Figure 4 in paper [A3C](https://arxiv.org/pdf/1602.01783):
 62 | ![A3C]
 63 | 
 64 | Our results:
 65 | 
 66 | ![beam_rider]
 67 | ![breakout]
 68 | ![pong]
 69 | ![qbert]
 70 | ![space_invaders]
 71 | 
 72 | From the observations, our method almost always outperforms 1,2 and 4 threads A3C and achieves similar results as 8 threads A3C. To be noticed, these five games chosen by A3C paper are not our method's specialties. Our method would definitely achieve much better performance if we run tests on games that our method is good at.
 73 | 
 74 | # Explain
 75 | ### Gradients are also rescaled so that their magnitudes are comparable with or without penalty
 76 | ![rescale]
 77 | 
 78 | ### About frames
 79 | ![frame]
 80 | 
 81 | # Comments
 82 | Since we never did grid search on hyperparameters, we expect finding better settings or initializations to further improve the results. More informed strategies regarding the choice of constraints are possible as well since we may expect lower bounds in the more distant future to have a larger impact early in the training. In contrast once the algorithm is almost converged we may expect lower bounds close to the considered time-step to have bigger impact. More complex penalty functions and sophisticated optimization approaches may yield even better results than the ones we reported yet.
 83 | 
 84 | # Please cite our paper at
 85 | ```
 86 | @inproceedings{HeICLR2017,
 87 |   author = {F.~S. He, Y. Liu, A.~G. Schwing and J. Peng},
 88 |   title = {{Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening}},
 89 |   booktitle = {Proc. ICLR},
 90 |   year = {2017},
 91 | }
 92 | ```
 93 | 
 94 | [frostbite_cl2_1]: figures/frostbite_cl2_1.png
 95 | [frostbite_cl2_2]: figures/frostbite_cl2_2.png
 96 | [frostbite_r15_1]: figures/frostbite_r15_1.png
 97 | [frostbite_r15_2]: figures/frostbite_r15_2.png
 98 | [gopher]: figures/gopher.png
 99 | [hero]: figures/hero.png
100 | [star_gunner]: figures/star_gunner.png
101 | [zaxxon]: figures/zaxxon.png
102 | [A3C]: figures/a3c_fig.png
103 | [beam_rider]: figures/beam_rider_time.png
104 | [breakout]: figures/breakout_time.png
105 | [pong]: figures/pong_time.png
106 | [qbert]: figures/qbert_time.png
107 | [space_invaders]: figures/space_invaders_time.png
108 | [rescale]: figures/rescale.png
109 | [frame]: figures/frame.png
110 | 


--------------------------------------------------------------------------------
/code/ale_agents.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | __author__ = 'frankhe'
  3 | 
  4 | """
  5 | DQN agents
  6 | """
  7 | import time
  8 | import os
  9 | import logging
 10 | import numpy as np
 11 | import cPickle
 12 | 
 13 | import ale_data_set
 14 | import sys
 15 | sys.setrecursionlimit(10000)
 16 | 
 17 | recording_size = 0
 18 | 
 19 | 
 20 | class OptimalityTightening(object):
 21 |     def __init__(self, q_network, epsilon_start, epsilon_min,
 22 |                  epsilon_decay, replay_memory_size, exp_pref, update_frequency,
 23 |                  replay_start_size, rng, transitions_sequence_length, transition_range, penalty_method,
 24 |                  weight_min, weight_max, weight_decay_length, two_train=False, late2=True, close2=True, verbose=False,
 25 |                  double=False, save_pkl=True):
 26 |         self.double_dqn = double
 27 |         self.network = q_network
 28 |         self.num_actions = q_network.num_actions
 29 |         self.epsilon_start = epsilon_start
 30 |         self.update_frequency = update_frequency
 31 | 
 32 |         self.epsilon_min = epsilon_min
 33 |         self.epsilon_decay = epsilon_decay
 34 |         self.replay_memory_size = replay_memory_size
 35 |         self.exp_dir = exp_pref + '_' + str(weight_max) + '_' + str(weight_min)
 36 |         if late2:
 37 |             self.exp_dir += '_l2'
 38 |         if close2:
 39 |             self.exp_dir += '_close2'
 40 |         else:
 41 |             self.exp_dir += '_len' + str(transitions_sequence_length) + '_r' + str(transition_range)
 42 |         if two_train:
 43 |             self.exp_dir += '_TTR'
 44 | 
 45 |         self.replay_start_size = replay_start_size
 46 |         self.rng = rng
 47 |         self.transition_len = transitions_sequence_length
 48 |         self.two_train = two_train
 49 |         self.verbose = verbose
 50 |         if verbose > 0:
 51 |             print "Using verbose", verbose
 52 |             self.exp_dir += '_vb' + str(verbose)
 53 | 
 54 |         self.phi_length = self.network.num_frames
 55 |         self.image_width = self.network.input_width
 56 |         self.image_height = self.network.input_height
 57 |         self.penalty_method = penalty_method
 58 |         self.batch_size = self.network.batch_size
 59 |         self.discount = self.network.discount
 60 |         self.transition_range = transition_range
 61 |         self.late2 = late2
 62 |         self.close2 = close2
 63 |         self.same_update = False
 64 |         self.save_pkl = save_pkl
 65 | 
 66 |         self.start_index = 0
 67 |         self.terminal_index = None
 68 | 
 69 |         self.weight_max = weight_max
 70 |         self.weight_min = weight_min
 71 |         self.weight = self.weight_max
 72 |         self.weight_decay_length = weight_decay_length
 73 |         self.weight_decay = (self.weight_max - self.weight_min) / self.weight_decay_length
 74 | 
 75 |         try:
 76 |             os.stat(self.exp_dir)
 77 |         except OSError:
 78 |             os.makedirs(self.exp_dir)
 79 | 
 80 |         self.data_set = ale_data_set.DataSet(width=self.image_width,
 81 |                                              height=self.image_height,
 82 |                                              rng=rng,
 83 |                                              max_steps=self.replay_memory_size,
 84 |                                              phi_length=self.phi_length,
 85 |                                              discount=self.discount,
 86 |                                              batch_size=self.batch_size,
 87 |                                              transitions_len=self.transition_len)
 88 | 
 89 |         # just needs to be big enough to create phi's
 90 |         self.test_data_set = ale_data_set.DataSet(width=self.image_width,
 91 |                                                   height=self.image_height,
 92 |                                                   rng=rng,
 93 |                                                   max_steps=self.phi_length * 2,
 94 |                                                   phi_length=self.phi_length)
 95 |         self.epsilon = self.epsilon_start
 96 |         if self.epsilon_decay != 0:
 97 |             self.epsilon_rate = ((self.epsilon_start - self.epsilon_min) /
 98 |                                  self.epsilon_decay)
 99 |         else:
100 |             self.epsilon_rate = 0
101 | 
102 |         self.testing = False
103 | 
104 |         self._open_results_file()
105 |         self._open_learning_file()
106 |         self._open_recording_file()
107 | 
108 |         self.step_counter = None
109 |         self.episode_reward = None
110 |         self.start_time = None
111 |         self.loss_averages = None
112 |         self.total_reward = None
113 | 
114 |         self.episode_counter = 0
115 |         self.batch_counter = 0
116 | 
117 |         self.holdout_data = None
118 | 
119 |         # In order to add an element to the data set we need the
120 |         # previous state and action and the current reward.  These
121 |         # will be used to store states and actions.
122 |         self.last_img = None
123 |         self.last_action = None
124 | 
125 |         # Exponential moving average of runtime performance.
126 |         self.steps_sec_ema = 0.
127 |         self.program_start_time = None
128 |         self.last_count_time = None
129 |         self.epoch_time = None
130 |         self.total_time = None
131 | 
132 |     def time_count_start(self):
133 |         self.last_count_time = self.program_start_time = time.time()
134 | 
135 |     def _open_results_file(self):
136 |         logging.info("OPENING " + self.exp_dir + '/results.csv')
137 |         self.results_file = open(self.exp_dir + '/results.csv', 'w', 0)
138 |         self.results_file.write(\
139 |             'epoch,num_episodes,total_reward,reward_per_epoch,mean_q, epoch time, total time\n')
140 |         self.results_file.flush()
141 | 
142 |     def _open_learning_file(self):
143 |         self.learning_file = open(self.exp_dir + '/learning.csv', 'w', 0)
144 |         self.learning_file.write('mean_loss,epsilon\n')
145 |         self.learning_file.flush()
146 | 
147 |     def _update_results_file(self, epoch, num_episodes, holdout_sum):
148 |         out = "{},{},{},{},{},{},{}\n".format(epoch, num_episodes,
149 |                                               self.total_reward, self.total_reward / float(num_episodes),
150 |                                               holdout_sum, self.epoch_time, self.total_time)
151 |         self.last_count_time = time.time()
152 |         self.results_file.write(out)
153 |         self.results_file.flush()
154 | 
155 |     def _update_learning_file(self):
156 |         out = "{},{}\n".format(np.mean(self.loss_averages),
157 |                                self.epsilon)
158 |         self.learning_file.write(out)
159 |         self.learning_file.flush()
160 | 
161 |     def _open_recording_file(self):
162 |         self.recording_tot = 0
163 |         self.recording_file = open(self.exp_dir + '/recording.csv', 'w', 0)
164 |         self.recording_file.write('nn_output, q_return, history_return, loss')
165 |         self.recording_file.write('\n')
166 |         self.recording_file.flush()
167 | 
168 |     def _update_recording_file(self, nn_output, q_return, history_return, loss):
169 |         if self.recording_tot > recording_size:
170 |             return
171 |         self.recording_tot += 1
172 |         out = "{},{},{},{}".format(nn_output, q_return, history_return, loss)
173 |         self.recording_file.write(out)
174 |         self.recording_file.write('\n')
175 |         self.recording_file.flush()
176 | 
177 |     def start_episode(self, observation):
178 |         """
179 |         This method is called once at the beginning of each episode.
180 |         No reward is provided, because reward is only available after
181 |         an action has been taken.
182 | 
183 |         Arguments:
184 |            observation - height x width numpy array
185 | 
186 |         Returns:
187 |            An integer action
188 |         """
189 | 
190 |         self.step_counter = 0
191 |         self.batch_counter = 0
192 |         self.episode_reward = 0
193 | 
194 |         # We report the mean loss for every epoch.
195 |         self.loss_averages = []
196 | 
197 |         self.start_time = time.time()
198 |         return_action = self.rng.randint(0, self.num_actions)
199 | 
200 |         self.last_action = return_action
201 | 
202 |         self.last_img = observation
203 | 
204 |         return return_action
205 | 
206 |     def _choose_action(self, data_set, epsilon, cur_img, reward):
207 |         """
208 |         Add the most recent data to the data set and choose
209 |         an action based on the current policy.
210 |         """
211 | 
212 |         data_set.add_sample(self.last_img, self.last_action, reward, False, start_index=self.start_index)
213 |         if self.step_counter >= self.phi_length:
214 |             phi = data_set.phi(cur_img)
215 |             action = self.network.choose_action(phi, epsilon)
216 |         else:
217 |             action = self.rng.randint(0, self.num_actions)
218 | 
219 |         return action
220 | 
221 |     def _do_training(self):
222 |         """
223 |         Returns the average loss for the current batch.
224 |         May be overridden if a subclass needs to train the network
225 |         differently.
226 |         """
227 |         if self.close2:
228 |             self.data_set.random_close_transitions_batch(self.batch_size, self.transition_len)
229 |         else:
230 |             self.data_set.random_transitions_batch(self.batch_size, self.transition_len, self.transition_range)
231 | 
232 |         target_q_imgs = np.append(self.data_set.forward_imgs, self.data_set.backward_imgs, axis=1)
233 |         target_q_table = self.network.q_target_vals(target_q_imgs)
234 |         target_double_q_table = None
235 |         if self.double_dqn:
236 |             target_double_q_table = self.network.q_target(target_q_imgs)
237 |         q_values = self.network.q_s_a_batch_vals(self.data_set.center_imgs, self.data_set.center_actions)
238 | 
239 |         states1 = np.zeros((self.batch_size, self.phi_length, self.image_height, self.image_width), dtype='uint8')
240 |         actions1 = np.zeros((self.batch_size, 1), dtype='int32')
241 |         targets1 = np.zeros((self.batch_size, 1), dtype='float32')
242 |         states2 = np.zeros((self.batch_size, self.phi_length, self.image_height, self.image_width), dtype='uint8')
243 |         actions2 = np.zeros((self.batch_size, 1), dtype='int32')
244 |         targets2 = np.zeros((self.batch_size, 1), dtype='float32')
245 |         """
246 |             0 1 2 3* 4 5 6 7 8 V_R
247 |             0 1 2 4  5 6 7 8 V_R
248 |             V0 = r3 + y*Q4; V1 = r3 +y*r4 + y^2*Q5
249 |             Q2 -r2 = Q3*y; Q1 - r1 - y*r2  = y^2*Q3
250 |             V-1 = (Q2 - r2) / y; V-2 = (Q1 - r1 - y*r2)/y^2; V-3 = (Q0 -r0 -y*r1 - y^2*r2)/y^3
251 |             r1 + y*r2 = R1 - y^2*R3
252 |             Q1 = r1+y*r2 + y^2*Q3
253 |         """
254 |         for i in xrange(self.batch_size):
255 |             q_value = q_values[i]
256 |             if self.two_train:
257 |                 # does nothing first
258 |                 states2[i] = self.data_set.center_imgs[i]
259 |                 actions2[i] = self.data_set.center_actions[i]
260 |                 targets2[i] = q_value
261 |             center_position = int(self.data_set.center_positions[i])
262 |             if self.data_set.terminal.take(center_position, mode='wrap'):
263 |                 states1[i] = self.data_set.center_imgs[i]
264 |                 actions1[i] = self.data_set.center_actions[i]
265 |                 targets1[i] = self.data_set.center_return_values[i]
266 |                 continue
267 |             forward_targets = np.zeros(self.transition_len, dtype=np.float32)
268 |             backward_targets = np.zeros(self.transition_len, dtype=np.float32)
269 |             for j in xrange(self.transition_len):
270 |                 if j > 0 and self.data_set.forward_positions[i, j] == center_position + 1:
271 |                     forward_targets[j] = q_value
272 |                 else:
273 |                     if not self.double_dqn:
274 |                         forward_targets[j] = self.data_set.center_return_values[i] - \
275 |                                              self.data_set.forward_return_values[i, j] * self.data_set.forward_discounts[i, j] + \
276 |                                              self.data_set.forward_discounts[i, j] * \
277 |                                              np.max(target_q_table[i, j])
278 |                     else:
279 |                         forward_targets[j] = self.data_set.center_return_values[i] - \
280 |                                              self.data_set.forward_return_values[i, j] * self.data_set.forward_discounts[i, j] + \
281 |                                              self.data_set.forward_discounts[i, j] * target_double_q_table[i, j]
282 |                     """ for integrity"""
283 |                     if self.verbose == 1:
284 |                         end = self. data_set.forward_positions[i, j]
285 |                         discount = 1.0
286 |                         cumulative_reward = 0.0
287 |                         for k in range(center_position, end):
288 |                             cumulative_reward += discount * self.data_set.rewards.take(k, mode='wrap')
289 |                             discount *= self.discount
290 |                         cumulative_reward += discount * np.max(target_q_table[i, j])
291 |                         if not np.isclose(cumulative_reward, forward_targets[j], atol=0.000001):
292 |                             print self.data_set.backward_positions[i], self.data_set.center_positions[i], self.data_set.forward_positions[i]
293 |                             print self.data_set.start_index.take(k, mode='wrap'), self.data_set.terminal_index.take(k, mode='wrap')
294 |                             print 'center return=', self.data_set.center_return_values[i], 'forward return=', \
295 |                                 self.data_set.forward_return_values[i,j], 'forward discount=', self.data_set.forward_discounts[i, j]
296 |                             end = self.data_set.forward_positions[i, j]
297 |                             discount = 1.0
298 |                             cumulative_reward = 0.0
299 |                             for k in range(center_position, end):
300 |                                 cumulative_reward += discount * self.data_set.rewards.take(k, mode='wrap')
301 |                                 print k, 'cumulative=', cumulative_reward, 'discount=', discount, 'reward=', self.data_set.rewards.take(k, mode='wrap'), \
302 |                                     'return=', self.data_set.return_value.take(k, mode='wrap')
303 |                                 print '\t start=', self.data_set.start_index.take(k, mode='wrap'), 'terminal=', self.data_set.terminal_index.take(k, mode='wrap')
304 |                                 discount *= self.discount
305 |                             cumulative_reward += discount * np.max(target_q_table[i, j])
306 |                             print 'final cumulative=', cumulative_reward, 'target=', forward_targets[j], \
307 |                                 'maxQ=', np.max(target_q_table[i, j])
308 |                             raw_input()
309 | 
310 |                 if self.data_set.backward_positions[i, j] == center_position + 1:
311 |                     backward_targets[j] = q_value
312 |                 else:
313 |                     backward_targets[j] = (-self.data_set.backward_return_values[i, j] +
314 |                                            self.data_set.backward_discounts[i, j] * self.data_set.center_return_values[i] +
315 |                                            target_q_table[i, self.transition_len + j, self.data_set.backward_actions[i, j]]) /\
316 |                                           self.data_set.backward_discounts[i, j]
317 |                     """ for integrity"""
318 |                     if self.verbose == 1:
319 |                         end = self.data_set.backward_positions[i, j]
320 |                         discount = 1.0
321 |                         cumulative_reward = 0.0
322 |                         for k in range(end, center_position):
323 |                             cumulative_reward += self.data_set.rewards.take(k, mode='wrap') * discount
324 |                             discount *= self.discount
325 |                         cumulative_reward = (-cumulative_reward + target_q_table[i, self.transition_len + j, self.data_set.actions.take(end, mode='wrap')])/discount
326 |                         if not np.isclose(cumulative_reward, backward_targets[j], atol=0.000001):
327 |                             print self.data_set.backward_positions[i], self.data_set.center_positions[i], self.data_set.forward_positions[i]
328 |                             print self.data_set.start_index.take(k, mode='wrap'), self.data_set.terminal_index.take(k, mode='wrap')
329 |                             print 'center return=', self.data_set.center_return_values[i], 'backward return=', \
330 |                                 self.data_set.backward_return_values[i,j], 'backward discount=', self.data_set.backward_discounts[i, j]
331 |                             end = self.data_set.backward_positions[i, j]
332 |                             discount = 1.0
333 |                             cumulative_reward = 0.0
334 |                             for k in range(end, center_position):
335 |                                 cumulative_reward += self.data_set.rewards.take(k, mode='wrap') * discount
336 |                                 print k, 'cumulative=', cumulative_reward, 'discount=', discount, 'reward=', self.data_set.rewards.take(k, mode='wrap'),\
337 |                                     'return=', self.data_set.return_value.take(k, mode='wrap')
338 |                                 print '\t start=', self.data_set.start_index.take(k, mode='wrap'), 'terminal=', self.data_set.terminal_index.take(k, mode='wrap')
339 |                                 discount *= self.discount
340 |                             cumulative_reward = (-cumulative_reward + target_q_table[i, self.transition_len + j, self.data_set.actions.take(end, mode='wrap')])/discount
341 |                             print 'final cumulative=', cumulative_reward, 'target=', backward_targets[j], \
342 |                                 'Q=', target_q_table[i, self.transition_len + j, self.data_set.backward_actions[i, j]]
343 |                             raw_input()
344 | 
345 |             forward_targets = np.append(forward_targets, self.data_set.center_return_values[i])
346 |             v0 = v1 = forward_targets[0]
347 |             if self.penalty_method == 'max':
348 |                 v_max = np.max(forward_targets[1:])
349 |                 v_min = np.min(backward_targets)
350 |                 if self.two_train and v_min < q_value:
351 |                     v_min_index = np.argmin(backward_targets)
352 |                     states2[i] = self.data_set.backward_imgs[i, v_min_index]
353 |                     actions2[i] = self.data_set.backward_actions[i, v_min_index]
354 |                     targets2[i] = self.data_set.backward_return_values[i, v_min_index] - \
355 |                                   self.data_set.backward_discounts[i, v_min_index] * self.data_set.center_return_values[i] + \
356 |                                   self.data_set.backward_discounts[i, v_min_index] * q_value
357 |                 if ((self.late2 and self.weight == self.weight_min) or (not self.late2)) \
358 |                         and (v_max - 0.1 > q_value > v_min + 0.1):
359 |                     v1 = v_max * 0.5 + v_min * 0.5
360 |                 elif v_max - 0.1 > q_value:
361 |                     v1 = v_max
362 |                 elif ((self.late2 and self.weight == self.weight_min) or (not self.late2)) and v_min + 0.1 < q_value:
363 |                     v1 = v_min
364 | 
365 |             states1[i] = self.data_set.center_imgs[i]
366 |             actions1[i] = self.data_set.center_actions[i]
367 |             targets1[i] = v0 * self.weight + (1-self.weight) * v1
368 | 
369 |         if self.two_train:
370 |             if self.same_update:
371 |                 self.network.train(states2, actions2, targets2)
372 |             else:
373 |                 self.network.train2(states2, actions2, targets2)
374 |         loss = self.network.train(states1, actions1, targets1)
375 |         # if self.recording_tot < recording_size:
376 |         #     pass
377 |             # for i in range(self.network.batch_size):
378 |             #     self._update_recording_file(output[i], target[i], return_value[i], loss)
379 |         return loss
380 | 
381 |     def step(self, reward, observation):
382 |         """
383 |         This method is called each time step.
384 | 
385 |         Arguments:
386 |            reward      - Real valued reward.
387 |            observation - A height x width numpy array
388 | 
389 |         Returns:
390 |            An integer action.
391 | 
392 |         """
393 | 
394 |         self.step_counter += 1
395 |         self.episode_reward += reward
396 | 
397 |         # TESTING---------------------------
398 |         if self.testing:
399 |             action = self._choose_action(self.test_data_set, 0.05,
400 |                                          observation, np.clip(reward, -1, 1))
401 | 
402 |         # NOT TESTING---------------------------
403 |         else:
404 |             if len(self.data_set) > self.replay_start_size:
405 |                 self.epsilon = max(self.epsilon_min,
406 |                                    self.epsilon - self.epsilon_rate)
407 |                 self.weight = max(self.weight_min,
408 |                                   self.weight - self.weight_decay)
409 | 
410 |                 action = self._choose_action(self.data_set, self.epsilon,
411 |                                              observation,
412 |                                              np.clip(reward, -1, 1))
413 | 
414 |                 if self.step_counter % self.update_frequency == 0:
415 |                     loss = self._do_training()
416 |                     self.batch_counter += 1
417 |                     self.loss_averages.append(loss)
418 | 
419 |             else:  # Still gathering initial random data...
420 |                 action = self._choose_action(self.data_set, self.epsilon,
421 |                                              observation,
422 |                                              np.clip(reward, -1, 1))
423 | 
424 |         self.last_action = action
425 |         self.last_img = observation
426 | 
427 |         return action
428 | 
429 |     def end_episode(self, reward, terminal=True):
430 |         """
431 |         This function is called once at the end of an episode.
432 | 
433 |         Arguments:
434 |            reward      - Real valued reward.
435 |            terminal    - Whether the episode ended intrinsically
436 |                          (ie we didn't run out of steps)
437 |         Returns:
438 |             None
439 |         """
440 | 
441 |         self.episode_reward += reward
442 |         self.step_counter += 1
443 |         total_time = time.time() - self.start_time
444 | 
445 |         if self.testing:
446 |             # If we run out of time, only count the last episode if
447 |             # it was the only episode.
448 |             if terminal or self.episode_counter == 0:
449 |                 self.episode_counter += 1
450 |                 self.total_reward += self.episode_reward
451 |         else:
452 |             # Store the latest sample.
453 |             self.data_set.add_sample(self.last_img,
454 |                                      self.last_action,
455 |                                      np.clip(reward, -1, 1),
456 |                                      True, start_index=self.start_index)
457 |             """update"""
458 |             if terminal:
459 |                 q_return = 0.
460 |             else:
461 |                 phi = self.data_set.phi(self.last_img)
462 |                 q_return = np.mean(self.network.q_vals(phi))
463 |             # last_q_return = -1.0
464 |             self.start_index = self.data_set.top
465 |             self.terminal_index = index = (self.start_index-1) % self.data_set.max_steps
466 |             while True:
467 |                 q_return = q_return * self.network.discount + self.data_set.rewards[index]
468 |                 self.data_set.return_value[index] = q_return
469 |                 self.data_set.terminal_index[index] = self.terminal_index
470 |                 index = (index-1) % self.data_set.max_steps
471 |                 if self.data_set.terminal[index] or index == self.data_set.bottom:
472 |                     break
473 | 
474 |             rho = 0.98
475 |             self.steps_sec_ema *= rho
476 |             self.steps_sec_ema += (1. - rho) * (self.step_counter/total_time)
477 | 
478 |             logging.info("steps/second: {:.2f}, avg: {:.2f}".format(
479 |                 self.step_counter/total_time, self.steps_sec_ema))
480 | 
481 |             if self.batch_counter > 0:
482 |                 self._update_learning_file()
483 |                 logging.info("average loss: {:.4f}".format(\
484 |                                 np.mean(self.loss_averages)))
485 | 
486 |     def finish_epoch(self, epoch):
487 |         if self.save_pkl:
488 |             net_file = open(self.exp_dir + '/network_file_' + str(epoch) + '.pkl', 'w')
489 |             cPickle.dump(self.network, net_file, -1)
490 |             net_file.close()
491 |         this_time = time.time()
492 |         self.total_time = this_time-self.program_start_time
493 |         self.epoch_time = this_time-self.last_count_time
494 | 
495 |     def start_testing(self):
496 |         self.testing = True
497 |         self.total_reward = 0
498 |         self.episode_counter = 0
499 | 
500 |     def finish_testing(self, epoch):
501 |         self.testing = False
502 |         holdout_size = 3200
503 | 
504 |         if self.holdout_data is None and len(self.data_set) > holdout_size:
505 |             imgs = self.data_set.random_imgs(holdout_size)
506 |             self.holdout_data = imgs[:, :self.phi_length]
507 | 
508 |         holdout_sum = 0
509 |         if self.holdout_data is not None:
510 |             for i in range(holdout_size):
511 |                 holdout_sum += np.max(
512 |                     self.network.q_vals(self.holdout_data[i]))
513 | 
514 |         self._update_results_file(epoch, self.episode_counter,
515 |                                   holdout_sum / holdout_size)
516 | 
517 | 


--------------------------------------------------------------------------------
/code/ale_data_set.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | __author__ = 'frankhe'
  3 | 
  4 | import numpy as np
  5 | import time
  6 | import theano
  7 | 
  8 | floatX = theano.config.floatX
  9 | 
 10 | 
 11 | class DataSet(object):
 12 |     def __init__(self, width, height, rng, max_steps=1000000, phi_length=4, discount=0.99, batch_size=32,
 13 |                  transitions_len=4):
 14 |         self.width = width
 15 |         self.height = height
 16 |         self.max_steps = max_steps
 17 |         self.phi_length = phi_length
 18 |         self.rng = rng
 19 |         self.discount = discount
 20 |         self.discount_table = np.power(self.discount, np.arange(30))
 21 | 
 22 |         self.imgs = np.zeros((max_steps, height, width), dtype='uint8')
 23 |         self.actions = np.zeros(max_steps, dtype='int32')
 24 |         self.rewards = np.zeros(max_steps, dtype=floatX)
 25 |         self.return_value = np.zeros(max_steps, dtype=floatX)
 26 |         self.terminal = np.zeros(max_steps, dtype='bool')
 27 |         self.terminal_index = np.zeros(max_steps, dtype='int32')
 28 |         self.start_index = np.zeros(max_steps, dtype='int32')
 29 | 
 30 |         self.bottom = 0
 31 |         self.top = 0
 32 |         self.size = 0
 33 | 
 34 |         self.center_imgs = np.zeros((batch_size,
 35 |                                      self.phi_length,
 36 |                                      self.height,
 37 |                                      self.width),
 38 |                                     dtype='uint8')
 39 |         self.forward_imgs = np.zeros((batch_size,
 40 |                                       transitions_len,
 41 |                                       self.phi_length,
 42 |                                       self.height,
 43 |                                       self.width),
 44 |                                      dtype='uint8')
 45 |         self.backward_imgs = np.zeros((batch_size,
 46 |                                        transitions_len,
 47 |                                        self.phi_length,
 48 |                                        self.height,
 49 |                                        self.width),
 50 |                                       dtype='uint8')
 51 |         self.center_positions = np.zeros((batch_size, 1), dtype='int32')
 52 |         self.forward_positions = np.zeros((batch_size, transitions_len), dtype='int32')
 53 |         self.backward_positions = np.zeros((batch_size, transitions_len), dtype='int32')
 54 | 
 55 |         self.center_actions = np.zeros((batch_size, 1), dtype='int32')
 56 |         self.backward_actions = np.zeros((batch_size, transitions_len), dtype='int32')
 57 | 
 58 |         self.center_terminals = np.zeros((batch_size, 1), dtype='bool')
 59 |         self.center_rewards = np.zeros((batch_size, 1), dtype=floatX)
 60 | 
 61 |         self.center_return_values = np.zeros((batch_size, 1), dtype=floatX)
 62 |         self.forward_return_values = np.zeros((batch_size, transitions_len), dtype=floatX)
 63 |         self.backward_return_values = np.zeros((batch_size, transitions_len), dtype=floatX)
 64 | 
 65 |         self.forward_discounts = np.zeros((batch_size, transitions_len), dtype=floatX)
 66 |         self.backward_discounts = np.zeros((batch_size, transitions_len), dtype=floatX)
 67 | 
 68 |     def add_sample(self, img, action, reward, terminal, return_value=0.0, start_index=-1):
 69 | 
 70 |         self.imgs[self.top] = img
 71 |         self.actions[self.top] = action
 72 |         self.rewards[self.top] = reward
 73 |         self.terminal[self.top] = terminal
 74 |         self.return_value[self.top] = return_value
 75 |         self.start_index[self.top] = start_index
 76 |         self.terminal_index[self.top] = -1
 77 | 
 78 |         if self.size == self.max_steps:
 79 |             self.bottom = (self.bottom + 1) % self.max_steps
 80 |         else:
 81 |             self.size += 1
 82 |         self.top = (self.top + 1) % self.max_steps
 83 | 
 84 |     def __len__(self):
 85 |         return self.size
 86 | 
 87 |     def last_phi(self):
 88 |         """Return the most recent phi (sequence of image frames)."""
 89 |         indexes = np.arange(self.top - self.phi_length, self.top)
 90 |         return self.imgs.take(indexes, axis=0, mode='wrap')
 91 | 
 92 |     def phi(self, img):
 93 |         """Return a phi (sequence of image frames), using the last phi_length -
 94 |         1, plus img.
 95 | 
 96 |         """
 97 |         indexes = np.arange(self.top - self.phi_length + 1, self.top)
 98 | 
 99 |         phi = np.empty((self.phi_length, self.height, self.width), dtype='uint8')
100 |         phi[0:self.phi_length - 1] = self.imgs.take(indexes,
101 |                                                     axis=0,
102 |                                                     mode='wrap')
103 |         phi[-1] = img
104 |         return phi
105 | 
106 |     def random_close_transitions_batch(self, batch_size, transitions_len):
107 |         transition_range = transitions_len
108 |         count = 0
109 |         while count < batch_size:
110 |             index = self.rng.randint(self.bottom,
111 |                                      self.bottom + self.size - self.phi_length)
112 | 
113 |             all_indices = np.arange(index, index + self.phi_length)
114 |             center_index = index + self.phi_length - 1
115 |             """
116 |             frame0 frame1 frame2 frame3
117 |             index                center_index = index+phi-1
118 |             """
119 |             if np.any(self.terminal.take(all_indices[0:-1], mode='wrap')):
120 |                 continue
121 |             if np.any(self.terminal_index.take(all_indices, mode='wrap') == -1):
122 |                 continue
123 |             terminal_index = self.terminal_index.take(center_index, mode='wrap')
124 |             start_index = self.start_index.take(center_index, mode='wrap')
125 |             self.center_positions[count] = center_index
126 |             self.center_terminals[count] = self.terminal.take(center_index, mode='wrap')
127 |             self.center_rewards[count] = self.rewards.take(center_index, mode='wrap')
128 | 
129 |             """ get forward transitions """
130 |             if terminal_index < center_index:
131 |                 terminal_index += self.size
132 |             max_forward_index = max(min(center_index + transition_range, terminal_index), center_index+1) + 1
133 |             self.forward_positions[count] = center_index + 1
134 |             for i, j in zip(range(transitions_len), range(center_index + 1, max_forward_index)):
135 |                 self.forward_positions[count, i] = j
136 |             """ get backward transitions """
137 |             if start_index + self.size < center_index:
138 |                 start_index += self.size
139 |             min_backward_index = max(center_index - transition_range, start_index+self.phi_length-1)
140 |             self.backward_positions[count] = center_index + 1
141 |             for i, j in zip(range(transitions_len), range(center_index - 1, min_backward_index - 1, -1)):
142 |                 self.backward_positions[count, i] = j
143 |                 if self.terminal_index.take(j, mode='wrap') == -1:
144 |                     self.backward_positions[count, i] = center_index + 1
145 | 
146 |             self.center_imgs[count] = self.imgs.take(all_indices, axis=0, mode='wrap')
147 |             for j in xrange(transitions_len):
148 |                 forward_index = self.forward_positions[count, j]
149 |                 backward_index = self.backward_positions[count, j]
150 |                 self.forward_imgs[count, j] = self.imgs.take(
151 |                     np.arange(forward_index - self.phi_length + 1, forward_index + 1), axis=0, mode='wrap')
152 |                 self.backward_imgs[count, j] = self.imgs.take(
153 |                     np.arange(backward_index - self.phi_length + 1, backward_index + 1), axis=0, mode='wrap')
154 |             self.center_actions[count] = self.actions.take(center_index, mode='wrap')
155 |             self.backward_actions[count] = self.actions.take(self.backward_positions[count], mode='wrap')
156 |             self.center_return_values[count] = self.return_value.take(center_index, mode='wrap')
157 |             self.forward_return_values[count] = self.return_value.take(self.forward_positions[count], mode='wrap')
158 |             self.backward_return_values[count] = self.return_value.take(self.backward_positions[count], mode='wrap')
159 |             distance = np.absolute(self.forward_positions[count] - center_index)
160 |             self.forward_discounts[count] = self.discount_table[distance]
161 |             distance = np.absolute(self.backward_positions[count] - center_index)
162 |             self.backward_discounts[count] = self.discount_table[distance]
163 |             # print self.backward_positions[count][::-1], self.center_positions[count], self.forward_positions[count]
164 |             # print 'start=', start_index, 'center=', self.center_positions[count], 'end=', terminal_index
165 |             # raw_input()
166 |             count += 1
167 | 
168 |     def random_transitions_batch(self, batch_size, transitions_len, transition_range=10):
169 |         count = 0
170 |         while count < batch_size:
171 |             index = self.rng.randint(self.bottom,
172 |                                      self.bottom + self.size - self.phi_length)
173 | 
174 |             all_indices = np.arange(index, index + self.phi_length)
175 |             center_index = index + self.phi_length - 1
176 |             """
177 |             frame0 frame1 frame2 frame3
178 |             index                center_index = index+phi-1
179 |             """
180 |             if np.any(self.terminal.take(all_indices[0:-1], mode='wrap')):
181 |                 continue
182 |             if np.any(self.terminal_index.take(all_indices, mode='wrap') == -1):
183 |                 continue
184 |             terminal_index = self.terminal_index.take(center_index, mode='wrap')
185 |             start_index = self.start_index.take(center_index, mode='wrap')
186 |             self.center_positions[count] = center_index
187 |             self.center_terminals[count] = self.terminal.take(center_index, mode='wrap')
188 |             self.center_rewards[count] = self.rewards.take(center_index, mode='wrap')
189 | 
190 |             """ get forward transitions """
191 |             if terminal_index < center_index:
192 |                 terminal_index += self.size
193 |             max_forward_index = max(min(center_index + transition_range, terminal_index), center_index+1) + 1
194 |             self.forward_positions[count, 0] = center_index+1
195 |             if center_index + 2 >= max_forward_index:
196 |                 self.forward_positions[count, 1:] = center_index + 1
197 |             else:
198 |                 self.forward_positions[count, 1:] = self.rng.randint(center_index+2, max_forward_index, transitions_len-1)
199 |             """ get backward transitions """
200 | 
201 |             if start_index + self.size < center_index:
202 |                 start_index += self.size
203 |             min_backward_index = max(center_index - transition_range, start_index+self.phi_length-1)
204 |             if min_backward_index >= center_index:
205 |                 self.backward_positions[count] = [center_index + 1] * transitions_len
206 |             else:
207 |                 if center_index > self.top > min_backward_index:
208 |                     min_backward_index = self.top
209 |                 self.backward_positions[count] = self.rng.randint(min_backward_index, center_index, transitions_len)
210 | 
211 |             self.center_imgs[count] = self.imgs.take(all_indices, axis=0, mode='wrap')
212 |             for j in xrange(transitions_len):
213 |                 forward_index = self.forward_positions[count, j]
214 |                 backward_index = self.backward_positions[count, j]
215 |                 self.forward_imgs[count, j] = self.imgs.take(
216 |                     np.arange(forward_index - self.phi_length + 1, forward_index + 1), axis=0, mode='wrap')
217 |                 self.backward_imgs[count, j] = self.imgs.take(
218 |                     np.arange(backward_index - self.phi_length + 1, backward_index + 1), axis=0, mode='wrap')
219 |             self.center_actions[count] = self.actions.take(center_index, mode='wrap')
220 |             self.backward_actions[count] = self.actions.take(self.backward_positions[count], mode='wrap')
221 |             self.center_return_values[count] = self.return_value.take(center_index, mode='wrap')
222 |             self.forward_return_values[count] = self.return_value.take(self.forward_positions[count], mode='wrap')
223 |             self.backward_return_values[count] = self.return_value.take(self.backward_positions[count], mode='wrap')
224 |             distance = np.absolute(self.forward_positions[count] - center_index)
225 |             self.forward_discounts[count] = self.discount_table[distance]
226 |             distance = np.absolute(self.backward_positions[count] - center_index)
227 |             self.backward_discounts[count] = self.discount_table[distance]
228 |             # print self.backward_positions[count][::-1], self.center_positions[count], self.forward_positions[count]
229 |             # print 'start=', start_index, 'center=', self.center_positions[count], 'end=', terminal_index
230 |             # raw_input()
231 |             count += 1
232 | 
233 |     def random_imgs(self, size):
234 |         imgs = np.zeros((size,
235 |                          self.phi_length + 1,
236 |                          self.height,
237 |                          self.width),
238 |                         dtype='uint8')
239 | 
240 |         count = 0
241 |         while count < size:
242 |             index = self.rng.randint(self.bottom,
243 |                                      self.bottom + self.size - self.phi_length)
244 |             all_indices = np.arange(index, index + self.phi_length + 1)
245 |             end_index = index + self.phi_length - 1
246 |             if np.any(self.terminal.take(all_indices[0:-2], mode='wrap')):
247 |                 continue
248 |             imgs[count] = self.imgs.take(all_indices, axis=0, mode='wrap')
249 |             count += 1
250 |         return imgs
251 | 
252 | 


--------------------------------------------------------------------------------
/code/ale_experiment.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | __author__ = 'frankhe'
  3 | 
  4 | import logging
  5 | import numpy as np
  6 | import image_preprocessing
  7 | 
  8 | # Number of rows to crop off the bottom of the (downsampled) screen.
  9 | # This is appropriate for breakout, but it may need to be modified
 10 | # for other games.
 11 | CROP_OFFSET = 8
 12 | 
 13 | 
 14 | class ALEExperiment(object):
 15 |     def __init__(self, ale, agent, resized_width, resized_height,
 16 |                  resize_method, num_epochs, epoch_length, test_length,
 17 |                  frame_skip, death_ends_episode, max_start_nullops, rng, flickering_buffer_size):
 18 |         self.ale = ale
 19 |         self.agent = agent
 20 |         self.num_epochs = num_epochs
 21 |         self.epoch_length = epoch_length
 22 |         self.test_length = test_length
 23 |         self.frame_skip = frame_skip
 24 |         self.death_ends_episode = death_ends_episode
 25 |         self.min_action_set = ale.getMinimalActionSet()
 26 |         self.resized_width = resized_width
 27 |         self.resized_height = resized_height
 28 |         self.resize_method = resize_method
 29 |         self.width, self.height = ale.getScreenDims()
 30 | 
 31 |         self.buffer_length = flickering_buffer_size
 32 |         self.buffer_count = 0
 33 |         self.screen_buffer = np.empty((self.buffer_length,
 34 |                                        self.height, self.width),
 35 |                                       dtype=np.uint8)
 36 | 
 37 |         self.terminal_lol = False # Most recent episode ended on a loss of life
 38 |         self.max_start_nullops = max_start_nullops
 39 |         self.rng = rng
 40 | 
 41 |     def run(self):
 42 |         """
 43 |         Run the desired number of training epochs, a testing epoch
 44 |         is conducted after each training epoch.
 45 |         """
 46 |         self.agent.time_count_start()
 47 |         for epoch in range(1, self.num_epochs + 1):
 48 |             self.run_epoch(epoch, self.epoch_length)
 49 |             self.agent.finish_epoch(epoch)
 50 | 
 51 |             if self.test_length > 0:
 52 |                 self.agent.start_testing()
 53 |                 self.run_epoch(epoch, self.test_length, True)
 54 |                 self.agent.finish_testing(epoch)
 55 | 
 56 |     def run_epoch(self, epoch, num_steps, testing=False):
 57 |         """ Run one 'epoch' of training or testing, where an epoch is defined
 58 |         by the number of steps executed.  Prints a progress report after
 59 |         every trial
 60 | 
 61 |         Arguments:
 62 |         epoch - the current epoch number
 63 |         num_steps - steps per epoch
 64 |         testing - True if this Epoch is used for testing and not training
 65 | 
 66 |         """
 67 |         self.terminal_lol = False  # Make sure each epoch starts with a reset.
 68 |         steps_left = num_steps
 69 |         while steps_left > 0:
 70 |             prefix = "testing" if testing else "training"
 71 |             logging.info(prefix + " epoch: " + str(epoch) + " steps_left: " +
 72 |                          str(steps_left))
 73 |             _, num_steps = self.run_episode(steps_left, testing)
 74 | 
 75 |             steps_left -= num_steps
 76 | 
 77 |     def _init_episode(self):
 78 |         """ This method resets the game if needed, performs enough null
 79 |         actions to ensure that the screen buffer is ready and optionally
 80 |         performs a randomly determined number of null action to randomize
 81 |         the initial game state."""
 82 | 
 83 |         if not self.terminal_lol or self.ale.game_over():
 84 |             self.ale.reset_game()
 85 | 
 86 |             if self.max_start_nullops > 0:
 87 |                 random_actions = self.rng.randint(self.buffer_length-2, self.max_start_nullops+1)
 88 |                 for _ in range(random_actions):
 89 |                     self._act(0)  # Null action
 90 | 
 91 |         # Make sure the screen buffer is filled at the beginning of
 92 |         # each episode...
 93 |         self._act(0)
 94 |         self._act(0)
 95 | 
 96 |     def _act(self, action):
 97 |         """Perform the indicated action for a single frame, return the
 98 |         resulting reward and store the resulting screen image in the
 99 |         buffer
100 | 
101 |         """
102 |         reward = self.ale.act(action)
103 |         index = self.buffer_count % self.buffer_length
104 | 
105 |         self.ale.getScreenGrayscale(self.screen_buffer[index, ...])
106 | 
107 |         self.buffer_count += 1
108 |         return reward
109 | 
110 |     def _step(self, action):
111 |         """ Repeat one action the appopriate number of times and return
112 |         the summed reward. """
113 |         reward = 0
114 |         for _ in range(self.frame_skip):
115 |             reward += self._act(action)
116 | 
117 |         return reward
118 | 
119 |     def run_episode(self, max_steps, testing):
120 |         """Run a single training episode.
121 | 
122 |         The boolean terminal value returned indicates whether the
123 |         episode ended because the game ended or the agent died (True)
124 |         or because the maximum number of steps was reached (False).
125 |         Currently this value will be ignored.
126 | 
127 |         Return: (terminal, num_steps)
128 | 
129 |         """
130 | 
131 |         self._init_episode()
132 | 
133 |         start_lives = self.ale.lives()
134 | 
135 |         action = self.agent.start_episode(self.get_observation())
136 |         num_steps = 0
137 |         while True:
138 |             reward = self._step(self.min_action_set[action])
139 |             self.terminal_lol = (self.death_ends_episode and not testing and
140 |                                  self.ale.lives() < start_lives)
141 |             terminal = self.ale.game_over() or self.terminal_lol
142 |             num_steps += 1
143 | 
144 |             if terminal or num_steps >= max_steps:
145 |                 self.agent.end_episode(reward, terminal)
146 |                 break
147 | 
148 |             action = self.agent.step(reward, self.get_observation())
149 |         return terminal, num_steps
150 | 
151 |     def get_observation(self):
152 |         """ Resize and merge the previous two screen images """
153 | 
154 |         assert self.buffer_count >= self.buffer_length
155 |         index = self.buffer_count % self.buffer_length - 1
156 |         # max_image = np.maximum(self.screen_buffer[index, ...],
157 |         #                        self.screen_buffer[index - 1, ...])
158 |         max_image = self.screen_buffer[index]
159 |         for i in range(self.buffer_length):
160 |             max_image = np.maximum(max_image, self.screen_buffer[index-i, ...])
161 |         return self.resize_image(max_image)
162 | 
163 |     def resize_image(self, image):
164 |         """ Appropriately resize a single image """
165 | 
166 |         if self.resize_method == 'crop':
167 |             # resize keeping aspect ratio
168 |             resize_height = int(round(
169 |                 float(self.height) * self.resized_width / self.width))
170 | 
171 |             resized = image_preprocessing.resize(image, (self.resized_width, resize_height))
172 | 
173 |             # Crop the part we want
174 |             crop_y_cutoff = resize_height - CROP_OFFSET - self.resized_height
175 |             cropped = resized[crop_y_cutoff:
176 |                               crop_y_cutoff + self.resized_height, :]
177 | 
178 |             return cropped
179 |         elif self.resize_method == 'scale':
180 |             return image_preprocessing.resize(image, (self.resized_width, self.resized_height))
181 |         else:
182 |             raise ValueError('Unrecognized image resize method.')
183 | 
184 | 


--------------------------------------------------------------------------------
/code/image_preprocessing.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | __author__ = 'frankhe'
 3 | 
 4 | import numpy as np
 5 | import matplotlib.pyplot as plt
 6 | import matplotlib.image as mpimg
 7 | import scipy.misc
 8 | import cPickle
 9 | 
10 | 
11 | def rgb2gray(rgb):
12 |     return np.dot(rgb[..., :3], [0.299, 0.587, 0.114])
13 | 
14 | 
15 | def resize(image, size):
16 |     return scipy.misc.imresize(image, size=size)
17 | 
18 | 
19 | def imshow(photo, gray=False):
20 |     if gray:
21 |         plt.imshow(photo, cmap = plt.get_cmap('gray'))
22 |     else:
23 |         plt.imshow(photo)
24 |     plt.show()
25 | 
26 | 
27 | def show_wall_paper():
28 |     img = mpimg.imread('wallpaper.jpg')
29 |     gray = rgb2gray(img)
30 |     gray = resize(gray, (1000, 1000))
31 |     imshow(gray, True)
32 | 
33 | if __name__ == '__main__':
34 |     f1 = open('game_images', mode='rb')
35 |     images = cPickle.load(f1)
36 |     print images[0].size, images[0].shape
37 |     for i in range(1, len(images)):
38 |         # imshow(images[i])
39 |         # print np.sum(images[i]-images[i-1])
40 |         raw_input()
41 | 
42 | 


--------------------------------------------------------------------------------
/code/launcher.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | __author__ = 'frankhe'
  3 | 
  4 | import os
  5 | import argparse
  6 | import logging
  7 | try:
  8 |     import ale_python_interface
  9 | except ImportError:
 10 |     import atari_py.ale_python_interface as ale_python_interface
 11 | import cPickle
 12 | import numpy as np
 13 | import theano
 14 | import time
 15 | import ale_experiment
 16 | import q_network
 17 | import ale_agents
 18 | 
 19 | 
 20 | def process_args(args, defaults, description):
 21 |     """
 22 |     Handle the command line.
 23 | 
 24 |     args     - list of command line arguments (not including executable name)
 25 |     defaults - a name space with variables corresponding to each of
 26 |                the required default command line values.
 27 |     description - a string to display at the top of the help message.
 28 |     """
 29 |     parser = argparse.ArgumentParser(description=description)
 30 |     parser.add_argument('-r', '--rom', dest="rom", default=defaults.ROM,
 31 |                         help='ROM to run (default: %(default)s)')
 32 |     parser.add_argument('-e', '--epochs', dest="epochs", type=int,
 33 |                         default=defaults.EPOCHS,
 34 |                         help='Number of training epochs (default: %(default)s)')
 35 |     parser.add_argument('-s', '--steps-per-epoch', dest="steps_per_epoch",
 36 |                         type=int, default=defaults.STEPS_PER_EPOCH,
 37 |                         help='Number of steps per epoch (default: %(default)s)')
 38 |     parser.add_argument('-t', '--test-length', dest="steps_per_test",
 39 |                         type=int, default=defaults.STEPS_PER_TEST,
 40 |                         help='Number of steps per test (default: %(default)s)')
 41 |     parser.add_argument('--display-screen', dest="display_screen",
 42 |                         action='store_true', default=False,
 43 |                         help='Show the game screen.')
 44 |     parser.add_argument('--double-dqn', dest="double_dqn",
 45 |                         action='store_true', default=False,
 46 |                         help='enable double DQN')
 47 |     parser.add_argument('--experiment-prefix', dest="experiment_prefix",
 48 |                         default=None,
 49 |                         help='Experiment name prefix '
 50 |                         '(default is the name of the game)')
 51 |     parser.add_argument('--frame-skip', dest="frame_skip",
 52 |                         default=defaults.FRAME_SKIP, type=int,
 53 |                         help='Every how many frames to process '
 54 |                         '(default: %(default)s)')
 55 |     parser.add_argument('--repeat-action-probability',
 56 |                         dest="repeat_action_probability",
 57 |                         default=defaults.REPEAT_ACTION_PROBABILITY, type=float,
 58 |                         help=('Probability that action choice will be ' +
 59 |                               'ignored (default: %(default)s)'))
 60 |     parser.add_argument('--update-rule', dest="update_rule",
 61 |                         type=str, default=defaults.UPDATE_RULE,
 62 |                         help=('deepmind_rmsprop|rmsprop|sgd ' +
 63 |                               '(default: %(default)s)'))
 64 |     parser.add_argument('--batch-accumulator', dest="batch_accumulator",
 65 |                         type=str, default=defaults.BATCH_ACCUMULATOR,
 66 |                         help=('sum|mean (default: %(default)s)'))
 67 |     parser.add_argument('--learning-rate', dest="learning_rate",
 68 |                         type=float, default=defaults.LEARNING_RATE,
 69 |                         help='Learning rate (default: %(default)s)')
 70 |     parser.add_argument('--rms-decay', dest="rms_decay",
 71 |                         type=float, default=defaults.RMS_DECAY,
 72 |                         help='Decay rate for rms_prop (default: %(default)s)')
 73 |     parser.add_argument('--rms-epsilon', dest="rms_epsilon",
 74 |                         type=float, default=defaults.RMS_EPSILON,
 75 |                         help='Denominator epsilson for rms_prop ' +
 76 |                         '(default: %(default)s)')
 77 |     parser.add_argument('--momentum', type=float, default=defaults.MOMENTUM,
 78 |                         help=('Momentum term for Nesterov momentum. '+
 79 |                               '(default: %(default)s)'))
 80 |     parser.add_argument('--clip-delta', dest="clip_delta", type=float,
 81 |                         default=defaults.CLIP_DELTA,
 82 |                         help=('Max absolute value for Q-update delta value. ' +
 83 |                               '(default: %(default)s)'))
 84 |     parser.add_argument('--discount', type=float, default=defaults.DISCOUNT,
 85 |                         help='Discount rate')
 86 |     parser.add_argument('--epsilon-start', dest="epsilon_start",
 87 |                         type=float, default=defaults.EPSILON_START,
 88 |                         help=('Starting value for epsilon. ' +
 89 |                               '(default: %(default)s)'))
 90 |     parser.add_argument('--epsilon-min', dest="epsilon_min",
 91 |                         type=float, default=defaults.EPSILON_MIN,
 92 |                         help='Minimum epsilon. (default: %(default)s)')
 93 |     parser.add_argument('--epsilon-decay', dest="epsilon_decay",
 94 |                         type=float, default=defaults.EPSILON_DECAY,
 95 |                         help=('Number of steps to minimum epsilon. ' +
 96 |                               '(default: %(default)s)'))
 97 |     parser.add_argument('--phi-length', dest="phi_length",
 98 |                         type=int, default=defaults.PHI_LENGTH,
 99 |                         help=('Number of recent frames used to represent ' +
100 |                               'state. (default: %(default)s)'))
101 |     parser.add_argument('--max-history', dest="replay_memory_size",
102 |                         type=int, default=defaults.REPLAY_MEMORY_SIZE,
103 |                         help=('Maximum number of steps stored in replay ' +
104 |                               'memory. (default: %(default)s)'))
105 |     parser.add_argument('--batch-size', dest="batch_size",
106 |                         type=int, default=defaults.BATCH_SIZE,
107 |                         help='Batch size. (default: %(default)s)')
108 |     parser.add_argument('--network-type', dest="network_type",
109 |                         type=str, default=defaults.NETWORK_TYPE,
110 |                         help=('nips_cuda|nips_dnn|nature_cuda|nature_dnn' +
111 |                               '|linear (default: %(default)s)'))
112 |     parser.add_argument('--freeze-interval', dest="freeze_interval",
113 |                         type=int, default=defaults.FREEZE_INTERVAL,
114 |                         help=('Interval between target freezes. ' +
115 |                               '(default: %(default)s)'))
116 |     parser.add_argument('--update-frequency', dest="update_frequency",
117 |                         type=int, default=defaults.UPDATE_FREQUENCY,
118 |                         help=('Number of actions before each SGD update. '+
119 |                               '(default: %(default)s)'))
120 |     parser.add_argument('--replay-start-size', dest="replay_start_size",
121 |                         type=int, default=defaults.REPLAY_START_SIZE,
122 |                         help=('Number of random steps before training. ' +
123 |                               '(default: %(default)s)'))
124 |     parser.add_argument('--resize-method', dest="resize_method",
125 |                         type=str, default=defaults.RESIZE_METHOD,
126 |                         help=('crop|scale (default: %(default)s)'))
127 |     parser.add_argument('--nn-file', dest="nn_file", type=str, default=None,
128 |                         help='Pickle file containing trained net.')
129 |     parser.add_argument('--death-ends-episode', dest="death_ends_episode",
130 |                         type=str, default=defaults.DEATH_ENDS_EPISODE,
131 |                         help=('true|false (default: %(default)s)'))
132 |     parser.add_argument('--max-start-nullops', dest="max_start_nullops",
133 |                         type=int, default=defaults.MAX_START_NULLOPS,
134 |                         help=('Maximum number of null-ops at the start ' +
135 |                               'of games. (default: %(default)s)'))
136 |     parser.add_argument('--deterministic', dest="deterministic",
137 |                         action='store_false', default=defaults.DETERMINISTIC,
138 |                         help=('Whether to use deterministic parameters ' +
139 |                               'for learning. (default: %(default)s)'))
140 |     parser.add_argument('--cudnn_deterministic', dest="cudnn_deterministic",
141 |                         type=bool, default=defaults.CUDNN_DETERMINISTIC,
142 |                         help=('Whether to use deterministic backprop. ' +
143 |                               '(default: %(default)s)'))
144 |     parser.add_argument('--flickering-buffer', dest="flickering_buffer_size",
145 |                         type=int, default=defaults.FLICKERING_BUFFER_SIZE,
146 |                         help='anti flickering buffer size')
147 |     parser.add_argument('--method', dest="method",
148 |                         type=str, default=defaults.METHOD,
149 |                         help='choose different learning algorithms')
150 |     parser.add_argument('--transition-len', dest='transition_length',
151 |                         type=int, default=4,
152 |                         help='transition length in Optimality Tightening')
153 |     parser.add_argument('--transition-range', dest='transition_range',
154 |                         type=int, default=10,
155 |                         help='transition range in Optimality Tightening Sampling')
156 |     parser.add_argument('--penalty-method', dest='penalty_method',
157 |                         type=str, default='max',
158 |                         help='penalty method')
159 |     parser.add_argument('--two-train', dest="two_train",
160 |                         action='store_true', default=False,
161 |                         help='doing two gradient descents per update')
162 |     parser.add_argument('--save-pkl', dest="save_pkl",
163 |                         action='store_true', default=False,
164 |                         help='saving network parameters')
165 |     parser.add_argument('--weight-min', dest='weight_min',
166 |                         type=float, default=0.8,
167 |                         help='weight min for penalty method')
168 |     parser.add_argument('--weight-max', dest='weight_max',
169 |                         type=float, default=0.8,
170 |                         help='weight max for penalty method')
171 |     parser.add_argument('--anneal-len', dest='annealing_len',
172 |                         type=float, default=5000000,
173 |                         help='annealing length of penalty method')
174 |     parser.add_argument('--close2', dest="close2",
175 |                         action='store_true', default=False,
176 |                         help='choose close bounds')
177 |     parser.add_argument('--late2', dest="late2",
178 |                         action='store_true', default=False,
179 |                         help='delay the penalty')
180 |     parser.add_argument('--verbose', dest="verbose",
181 |                         type=int, default=0,
182 |                         help='1: check correctness,')
183 | 
184 |     parameters = parser.parse_args(args)
185 |     if parameters.experiment_prefix is None:
186 |         name = os.path.splitext(os.path.basename(parameters.rom))[0]
187 |         parameters.experiment_prefix = name + time.strftime("_%m-%d-%H-%M-%S_", time.gmtime()) + parameters.method
188 | 
189 |     if parameters.double_dqn:
190 |         parameters.experiment_prefix += '_(double)'
191 | 
192 |     parameters.experiment_prefix += '_' + str(parameters.learning_rate) + '_(ep' + str(parameters.epochs) + ')'
193 | 
194 |     if parameters.death_ends_episode == 'true':
195 |         parameters.death_ends_episode = True
196 |     elif parameters.death_ends_episode == 'false':
197 |         parameters.death_ends_episode = False
198 |     else:
199 |         raise ValueError("--death-ends-episode must be true or false")
200 | 
201 |     if parameters.freeze_interval > 0:
202 |         # This addresses an inconsistency between the Nature paper and
203 |         # the Deepmind code.  The paper states that the target network
204 |         # update frequency is "measured in the number of parameter
205 |         # updates".  In the code it is actually measured in the number
206 |         # of action choices.
207 |         parameters.freeze_interval = (parameters.freeze_interval //
208 |                                       parameters.update_frequency)
209 | 
210 |     return parameters
211 | 
212 | 
213 | def launch(args, defaults, description):
214 |     """
215 |     Execute a complete training run.
216 |     """
217 | 
218 |     logging.basicConfig(level=logging.INFO)
219 |     parameters = process_args(args, defaults, description)
220 | 
221 |     if parameters.rom.endswith('.bin'):
222 |         rom = parameters.rom
223 |     else:
224 |         rom = "%s.bin" % parameters.rom
225 |     full_rom_path = os.path.join(defaults.BASE_ROM_PATH, rom)
226 | 
227 |     if parameters.deterministic:
228 |         rng = np.random.RandomState(123456)
229 |     else:
230 |         rng = np.random.RandomState()
231 | 
232 |     if parameters.cudnn_deterministic:
233 |         theano.config.dnn.conv.algo_bwd = 'deterministic'
234 | 
235 |     ale = ale_python_interface.ALEInterface()
236 |     ale.setInt('random_seed', rng.randint(1000))
237 | 
238 |     if parameters.display_screen:
239 |         import sys
240 |         if sys.platform == 'darwin':
241 |             import pygame
242 |             pygame.init()
243 |             ale.setBool('sound', False) # Sound doesn't work on OSX
244 | 
245 |     ale.setBool('display_screen', parameters.display_screen)
246 |     ale.setFloat('repeat_action_probability',
247 |                  parameters.repeat_action_probability)
248 | 
249 |     ale.loadROM(full_rom_path)
250 | 
251 |     num_actions = len(ale.getMinimalActionSet())
252 | 
253 |     agent = None
254 | 
255 |     if not parameters.close2:
256 |         print 'transition length is ', parameters.transition_length, 'transition range is', parameters.transition_range
257 |     if parameters.method == 'ot':
258 |         if parameters.nn_file is None:
259 |             network = q_network.DeepQLearner(defaults.RESIZED_WIDTH,
260 |                                              defaults.RESIZED_HEIGHT,
261 |                                              num_actions,
262 |                                              parameters.phi_length,
263 |                                              parameters.discount,
264 |                                              parameters.learning_rate,
265 |                                              parameters.rms_decay,
266 |                                              parameters.rms_epsilon,
267 |                                              parameters.momentum,
268 |                                              parameters.clip_delta,
269 |                                              parameters.freeze_interval,
270 |                                              parameters.batch_size,
271 |                                              parameters.network_type,
272 |                                              parameters.update_rule,
273 |                                              parameters.batch_accumulator,
274 |                                              rng, double=parameters.double_dqn,
275 |                                              transition_length=parameters.transition_length)
276 |         else:
277 |             handle = open(parameters.nn_file, 'r')
278 |             network = cPickle.load(handle)
279 | 
280 |         agent = ale_agents.OptimalityTightening(network,
281 |                                                 parameters.epsilon_start,
282 |                                                 parameters.epsilon_min,
283 |                                                 parameters.epsilon_decay,
284 |                                                 parameters.replay_memory_size,
285 |                                                 parameters.experiment_prefix,
286 |                                                 parameters.update_frequency,
287 |                                                 parameters.replay_start_size,
288 |                                                 rng,
289 |                                                 parameters.transition_length,
290 |                                                 parameters.transition_range,
291 |                                                 parameters.penalty_method,
292 |                                                 parameters.weight_min,
293 |                                                 parameters.weight_max,
294 |                                                 parameters.annealing_len,
295 |                                                 parameters.two_train,
296 |                                                 parameters.late2,
297 |                                                 parameters.close2,
298 |                                                 parameters.verbose,
299 |                                                 parameters.double_dqn,
300 |                                                 parameters.save_pkl)
301 | 
302 |     experiment = ale_experiment.ALEExperiment(ale, agent,
303 |                                               defaults.RESIZED_WIDTH,
304 |                                               defaults.RESIZED_HEIGHT,
305 |                                               parameters.resize_method,
306 |                                               parameters.epochs,
307 |                                               parameters.steps_per_epoch,
308 |                                               parameters.steps_per_test,
309 |                                               parameters.frame_skip,
310 |                                               parameters.death_ends_episode,
311 |                                               parameters.max_start_nullops,
312 |                                               rng,
313 |                                               parameters.flickering_buffer_size)
314 | 
315 |     experiment.run()
316 | 
317 | if __name__ == '__main__':
318 |     pass
319 | 


--------------------------------------------------------------------------------
/code/q_network.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | __author__ = 'frankhe'
  3 | 
  4 | import lasagne
  5 | import numpy as np
  6 | import theano
  7 | import theano.tensor as T
  8 | from updates import deepmind_rmsprop
  9 | 
 10 | 
 11 | class DeepQLearner:
 12 |     def __init__(self, input_width, input_height, num_actions,
 13 |                  num_frames, discount, learning_rate, rho,
 14 |                  rms_epsilon, momentum, clip_delta, freeze_interval,
 15 |                  batch_size, network_type, update_rule,
 16 |                  batch_accumulator, rng, input_scale=255.0,
 17 |                  double=False, transition_length=4):
 18 | 
 19 |         if double:
 20 |             print 'USING DOUBLE DQN'
 21 |         self.input_width = input_width
 22 |         self.input_height = input_height
 23 |         self.num_actions = num_actions
 24 |         self.num_frames = num_frames
 25 |         self.batch_size = batch_size
 26 |         self.discount = discount
 27 |         self.rho = rho
 28 |         self.lr = learning_rate
 29 |         self.rms_epsilon = rms_epsilon
 30 |         self.momentum = momentum
 31 |         self.clip_delta = clip_delta
 32 |         self.freeze_interval = freeze_interval
 33 |         self.rng = rng
 34 | 
 35 |         lasagne.random.set_rng(self.rng)
 36 | 
 37 |         self.update_counter = 0
 38 | 
 39 |         self.l_out = self.build_network(network_type, input_width, input_height,
 40 |                                         num_actions, num_frames, batch_size)
 41 |         if self.freeze_interval > 0:
 42 |             self.next_l_out = self.build_network(network_type, input_width,
 43 |                                                  input_height, num_actions,
 44 |                                                  num_frames, batch_size)
 45 |             self.reset_q_hat()
 46 | 
 47 |         states = T.tensor4('states_t')
 48 |         actions = T.icol('actions_t')
 49 |         target = T.col('evaluation_t')
 50 | 
 51 |         self.states_shared = theano.shared(
 52 |             np.zeros((batch_size, num_frames, input_height, input_width),
 53 |                      dtype=theano.config.floatX))
 54 |         self.actions_shared = theano.shared(
 55 |             np.zeros((batch_size, 1), dtype='int32'),
 56 |             broadcastable=(False, True))
 57 |         self.target_shared = theano.shared(
 58 |             np.zeros((batch_size, 1), dtype=theano.config.floatX),
 59 |             broadcastable=(False, True))
 60 | 
 61 |         self.states_transition_shared = theano.shared(
 62 |             np.zeros((batch_size, transition_length * 2, num_frames, input_height, input_width),
 63 |                      dtype=theano.config.floatX))
 64 |         self.states_one_shared = theano.shared(
 65 |             np.zeros((num_frames, input_height, input_width),
 66 |                      dtype=theano.config.floatX))
 67 | 
 68 |         q_vals = lasagne.layers.get_output(self.l_out, states / input_scale)
 69 | 
 70 |         """get Q(s)   batch_size = 1 """
 71 |         q1_givens = {
 72 |             states: self.states_one_shared.reshape((1,
 73 |                                                     self.num_frames,
 74 |                                                     self.input_height,
 75 |                                                     self.input_width))
 76 |         }
 77 |         self._q1_vals = theano.function([], q_vals[0], givens=q1_givens)
 78 |         """get Q(s)   batch_size = batch size """
 79 |         q_batch_givens = {
 80 |             states: self.states_shared.reshape((self.batch_size,
 81 |                                                 self.num_frames,
 82 |                                                 self.input_height,
 83 |                                                 self.input_width))
 84 |         }
 85 |         self._q_batch_vals = theano.function([], q_vals, givens=q_batch_givens)
 86 | 
 87 |         action_mask = T.eq(T.arange(num_actions).reshape((1, -1)),
 88 |                            actions.reshape((-1, 1))).astype(theano.config.floatX)
 89 | 
 90 |         q_s_a = (q_vals * action_mask).sum(axis=1).reshape((-1, 1))
 91 |         """ get Q(s,a)   batch_size = batch size """
 92 |         q_s_a_givens = {
 93 |             states: self.states_shared.reshape((self.batch_size,
 94 |                                                 self.num_frames,
 95 |                                                 self.input_height,
 96 |                                                 self.input_width)),
 97 |             actions: self.actions_shared
 98 |         }
 99 |         self._q_s_a_vals = theano.function([], q_s_a, givens=q_s_a_givens)
100 | 
101 |         if self.freeze_interval > 0:
102 |             q_target_vals = lasagne.layers.get_output(self.next_l_out,
103 |                                                       states / input_scale)
104 |         else:
105 |             q_target_vals = lasagne.layers.get_output(self.l_out,
106 |                                                       states / input_scale)
107 |             q_target_vals = theano.gradient.disconnected_grad(q_target_vals)
108 | 
109 |         if not double:
110 |             q_target = T.max(q_target_vals, axis=1)
111 |         else:
112 |             greedy_actions = T.argmax(q_vals, axis=1)
113 |             q_target_mask = T.eq(T.arange(num_actions).reshape((1, -1)),
114 |                                  greedy_actions.reshape((-1, 1)).astype(theano.config.floatX))
115 |             q_target = (q_target_vals * q_target_mask).sum(axis=1).reshape((-1, 1))
116 |         """get Q target Q'(s,a') for a batch of transitions  batch size = batch_size * transition length"""
117 |         q_target_transition_givens = {
118 |             states: self.states_transition_shared.reshape(
119 |                 (batch_size * transition_length * 2, self.num_frames, self.input_height, self.input_width))
120 |         }
121 |         self._q_target = theano.function([], q_target.reshape((batch_size, transition_length * 2)),
122 |                                          givens=q_target_transition_givens)
123 |         """get Q target_vals Q'(s) for a batch of transitions  batch size = batch_size * transition length"""
124 |         self._q_target_vals = theano.function([], q_target_vals.reshape(
125 |             (batch_size, transition_length * 2, num_actions)), givens=q_target_transition_givens)
126 | 
127 |         diff = q_s_a - target
128 | 
129 |         if self.clip_delta > 0:
130 |             # If we simply take the squared clipped diff as our loss,
131 |             # then the gradient will be zero whenever the diff exceeds
132 |             # the clip bounds. To avoid this, we extend the loss
133 |             # linearly past the clip point to keep the gradient constant
134 |             # in that regime.
135 |             # 
136 |             # This is equivalent to declaring d loss/d q_vals to be
137 |             # equal to the clipped diff, then backpropagating from
138 |             # there, which is what the DeepMind implementation does.
139 |             quadratic_part = T.minimum(abs(diff), self.clip_delta)
140 |             linear_part = abs(diff) - quadratic_part
141 |             loss = 0.5 * quadratic_part ** 2 + self.clip_delta * linear_part
142 |         else:
143 |             loss = 0.5 * diff ** 2
144 | 
145 |         if batch_accumulator == 'sum':
146 |             loss = T.sum(loss)
147 |         elif batch_accumulator == 'mean':
148 |             loss = T.mean(loss)
149 |         else:
150 |             raise ValueError("Bad accumulator: {}".format(batch_accumulator))
151 | 
152 |         params = lasagne.layers.helper.get_all_params(self.l_out)
153 | 
154 |         if update_rule == 'deepmind_rmsprop':
155 |             updates = deepmind_rmsprop(loss, params, self.lr, self.rho,
156 |                                        self.rms_epsilon)
157 |         elif update_rule == 'rmsprop':
158 |             updates = lasagne.updates.rmsprop(loss, params, self.lr, self.rho,
159 |                                               self.rms_epsilon)
160 |         elif update_rule == 'sgd':
161 |             updates = lasagne.updates.sgd(loss, params, self.lr)
162 |         else:
163 |             raise ValueError("Unrecognized update: {}".format(update_rule))
164 | 
165 |         if self.momentum > 0:
166 |             updates = lasagne.updates.apply_momentum(updates, None,
167 |                                                      self.momentum)
168 |         """Q(s,a) target train()"""
169 |         train_givens = {
170 |             states: self.states_shared,
171 |             actions: self.actions_shared,
172 |             target: self.target_shared
173 |         }
174 |         self._train = theano.function([], [loss], updates=updates, givens=train_givens, on_unused_input='warn')
175 | 
176 |         self._train2 = theano.function([], [loss], updates=updates, givens=train_givens, on_unused_input='warn')
177 | 
178 |     def q_vals(self, single_state):
179 |         self.states_one_shared.set_value(single_state)
180 |         return self._q1_vals()
181 | 
182 |     def q_batch_vals(self, states):
183 |         self.states_shared.set_value(states)
184 |         return self._q_batch_vals()
185 | 
186 |     def q_s_a_batch_vals(self, states, actions):
187 |         self.states_shared.set_value(states)
188 |         self.actions_shared.set_value(actions)
189 |         return self._q_s_a_vals()
190 | 
191 |     def q_target(self, batch_transition_states):
192 |         self.states_transition_shared.set_value(batch_transition_states)
193 |         return self._q_target()
194 | 
195 |     def q_target_vals(self, batch_transition_states):
196 |         self.states_transition_shared.set_value(batch_transition_states)
197 |         return self._q_target_vals()
198 | 
199 |     def train(self, states, actions, target):
200 |         self.states_shared.set_value(states)
201 |         self.actions_shared.set_value(actions)
202 |         self.target_shared.set_value(target)
203 |         if self.freeze_interval > 0 and self.update_counter % self.freeze_interval == 0:
204 |             self.reset_q_hat()
205 |         loss = self._train()
206 |         self.update_counter += 1
207 |         return np.sqrt(loss)
208 | 
209 |     def train2(self, states, actions, target):
210 |         self.states_shared.set_value(states)
211 |         self.actions_shared.set_value(actions)
212 |         self.target_shared.set_value(target)
213 |         if self.freeze_interval > 0 and self.update_counter % self.freeze_interval == 0:
214 |             self.reset_q_hat()
215 |         loss = self._train2()
216 |         return np.sqrt(loss)
217 | 
218 |     def build_network(self, network_type, input_width, input_height,
219 |                       output_dim, num_frames, batch_size):
220 |         if network_type == "nature_cuda":
221 |             return self.build_nature_network(input_width, input_height,
222 |                                              output_dim, num_frames, batch_size)
223 |         if network_type == "nature_dnn":
224 |             return self.build_nature_network_dnn(input_width, input_height,
225 |                                                  output_dim, num_frames,
226 |                                                  batch_size)
227 |         elif network_type == "linear":
228 |             return self.build_linear_network(input_width, input_height,
229 |                                              output_dim, num_frames, batch_size)
230 |         else:
231 |             raise ValueError("Unrecognized network: {}".format(network_type))
232 | 
233 |     def choose_action(self, state, epsilon):
234 |         if self.rng.rand() < epsilon:
235 |             return self.rng.randint(0, self.num_actions)
236 |         q_vals = self.q_vals(state)
237 |         return np.argmax(q_vals)
238 | 
239 |     def reset_q_hat(self):
240 |         all_params = lasagne.layers.helper.get_all_param_values(self.l_out)
241 |         lasagne.layers.helper.set_all_param_values(self.next_l_out, all_params)
242 | 
243 |     def build_nature_network(self, input_width, input_height, output_dim,
244 |                              num_frames, batch_size):
245 |         """
246 |         Build a large network consistent with the DeepMind Nature paper.
247 |         """
248 |         from lasagne.layers import cuda_convnet
249 | 
250 |         l_in = lasagne.layers.InputLayer(
251 |             shape=(None, num_frames, input_width, input_height)
252 |         )
253 | 
254 |         l_conv1 = cuda_convnet.Conv2DCCLayer(
255 |             l_in,
256 |             num_filters=32,
257 |             filter_size=(8, 8),
258 |             stride=(4, 4),
259 |             nonlinearity=lasagne.nonlinearities.rectify,
260 |             W=lasagne.init.HeUniform(), # Defaults to Glorot
261 |             b=lasagne.init.Constant(.1),
262 |             dimshuffle=True
263 |         )
264 | 
265 |         l_conv2 = cuda_convnet.Conv2DCCLayer(
266 |             l_conv1,
267 |             num_filters=64,
268 |             filter_size=(4, 4),
269 |             stride=(2, 2),
270 |             nonlinearity=lasagne.nonlinearities.rectify,
271 |             W=lasagne.init.HeUniform(),
272 |             b=lasagne.init.Constant(.1),
273 |             dimshuffle=True
274 |         )
275 | 
276 |         l_conv3 = cuda_convnet.Conv2DCCLayer(
277 |             l_conv2,
278 |             num_filters=64,
279 |             filter_size=(3, 3),
280 |             stride=(1, 1),
281 |             nonlinearity=lasagne.nonlinearities.rectify,
282 |             W=lasagne.init.HeUniform(),
283 |             b=lasagne.init.Constant(.1),
284 |             dimshuffle=True
285 |         )
286 | 
287 |         l_hidden1 = lasagne.layers.DenseLayer(
288 |             l_conv3,
289 |             num_units=512,
290 |             nonlinearity=lasagne.nonlinearities.rectify,
291 |             W=lasagne.init.HeUniform(),
292 |             b=lasagne.init.Constant(.1)
293 |         )
294 | 
295 |         l_out = lasagne.layers.DenseLayer(
296 |             l_hidden1,
297 |             num_units=output_dim,
298 |             nonlinearity=None,
299 |             W=lasagne.init.HeUniform(),
300 |             b=lasagne.init.Constant(.1)
301 |         )
302 | 
303 |         return l_out
304 | 
305 |     def build_nature_network_dnn(self, input_width, input_height, output_dim,
306 |                                  num_frames, batch_size):
307 |         """
308 |         Build a large network consistent with the DeepMind Nature paper.
309 |         """
310 |         from lasagne.layers import dnn
311 | 
312 |         l_in = lasagne.layers.InputLayer(
313 |             shape=(None, num_frames, input_width, input_height)
314 |         )
315 | 
316 |         l_conv1 = dnn.Conv2DDNNLayer(
317 |             l_in,
318 |             num_filters=32,
319 |             filter_size=(8, 8),
320 |             stride=(4, 4),
321 |             nonlinearity=lasagne.nonlinearities.rectify,
322 |             W=lasagne.init.HeUniform(),
323 |             b=lasagne.init.Constant(.1)
324 |         )
325 | 
326 |         l_conv2 = dnn.Conv2DDNNLayer(
327 |             l_conv1,
328 |             num_filters=64,
329 |             filter_size=(4, 4),
330 |             stride=(2, 2),
331 |             nonlinearity=lasagne.nonlinearities.rectify,
332 |             W=lasagne.init.HeUniform(),
333 |             b=lasagne.init.Constant(.1)
334 |         )
335 | 
336 |         l_conv3 = dnn.Conv2DDNNLayer(
337 |             l_conv2,
338 |             num_filters=64,
339 |             filter_size=(3, 3),
340 |             stride=(1, 1),
341 |             nonlinearity=lasagne.nonlinearities.rectify,
342 |             W=lasagne.init.HeUniform(),
343 |             b=lasagne.init.Constant(.1)
344 |         )
345 | 
346 |         l_hidden1 = lasagne.layers.DenseLayer(
347 |             l_conv3,
348 |             num_units=512,
349 |             nonlinearity=lasagne.nonlinearities.rectify,
350 |             W=lasagne.init.HeUniform(),
351 |             b=lasagne.init.Constant(.1)
352 |         )
353 | 
354 |         l_out = lasagne.layers.DenseLayer(
355 |             l_hidden1,
356 |             num_units=output_dim,
357 |             nonlinearity=None,
358 |             W=lasagne.init.HeUniform(),
359 |             b=lasagne.init.Constant(.1)
360 |         )
361 | 
362 |         return l_out
363 | 
364 |     def build_linear_network(self, input_width, input_height, output_dim,
365 |                              num_frames, batch_size):
366 |         """
367 |         Build a simple linear learner.  Useful for creating
368 |         tests that sanity-check the weight update code.
369 |         """
370 | 
371 |         l_in = lasagne.layers.InputLayer(
372 |             shape=(None, num_frames, input_width, input_height)
373 |         )
374 | 
375 |         l_out = lasagne.layers.DenseLayer(
376 |             l_in,
377 |             num_units=output_dim,
378 |             nonlinearity=None,
379 |             W=lasagne.init.Constant(0.0),
380 |             b=None
381 |         )
382 | 
383 |         return l_out
384 | 


--------------------------------------------------------------------------------
/code/run_OT.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | __author__ = 'frankhe'
 3 | 
 4 | import launcher
 5 | import sys
 6 | 
 7 | 
 8 | class Defaults:
 9 |     # ----------------------
10 |     # Experiment Parameters
11 |     # ----------------------
12 |     STEPS_PER_EPOCH = 250000
13 |     EPOCHS = 50
14 |     # runtime evaluation, not 30 no-op evaluation
15 |     STEPS_PER_TEST = 125000
16 | 
17 |     # ----------------------
18 |     # ALE Parameters
19 |     # ----------------------
20 |     BASE_ROM_PATH = "../roms/"
21 |     ROM = 'gopher.bin'
22 |     FRAME_SKIP = 4
23 |     REPEAT_ACTION_PROBABILITY = 0
24 | 
25 |     # ----------------------
26 |     # Agent/Network parameters:
27 |     # ----------------------
28 |     UPDATE_RULE = 'deepmind_rmsprop'
29 |     BATCH_ACCUMULATOR = 'sum'
30 |     LEARNING_RATE = .00025
31 |     DISCOUNT = .99
32 |     RMS_DECAY = .95 # (Rho)
33 |     RMS_EPSILON = .01
34 |     MOMENTUM = 0 # Note that the "momentum" value mentioned in the Nature
35 |                  # paper is not used in the same way as a traditional momentum
36 |                  # term.  It is used to track gradient for the purpose of
37 |                  # estimating the standard deviation. This package uses
38 |                  # rho/RMS_DECAY to track both the history of the gradient
39 |                  # and the squared gradient.
40 |     CLIP_DELTA = 1.0
41 |     EPSILON_START = 1.0
42 |     EPSILON_MIN = .1
43 |     EPSILON_DECAY = 1000000
44 |     PHI_LENGTH = 4
45 |     UPDATE_FREQUENCY = 4
46 |     REPLAY_MEMORY_SIZE = 1000000
47 |     BATCH_SIZE = 32
48 |     NETWORK_TYPE = "nature_dnn"
49 |     FREEZE_INTERVAL = 10000
50 |     REPLAY_START_SIZE = 50000
51 |     RESIZE_METHOD = 'scale'
52 |     RESIZED_WIDTH = 84
53 |     RESIZED_HEIGHT = 84
54 |     DEATH_ENDS_EPISODE = 'true'
55 |     MAX_START_NULLOPS = 30
56 |     DETERMINISTIC = True
57 |     CUDNN_DETERMINISTIC = False
58 |     FLICKERING_BUFFER_SIZE = 2
59 |     METHOD = 'ot'
60 | 
61 | if __name__ == "__main__":
62 |     launcher.launch(sys.argv[1:], Defaults, __doc__)
63 | 


--------------------------------------------------------------------------------
/code/updates.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Gradient update rules for the deep_q_rl package. 
  3 | 
  4 | Some code here is modified from the Lasagne package:
  5 |  
  6 | https://github.com/Lasagne/Lasagne/blob/master/LICENSE
  7 | 
  8 | """
  9 | 
 10 | import theano
 11 | import theano.tensor as T
 12 | from lasagne.updates import get_or_compute_grads
 13 | from collections import OrderedDict
 14 | import numpy as np
 15 | 
 16 | # The MIT License (MIT)
 17 | 
 18 | # Copyright (c) 2014 Sander Dieleman
 19 | 
 20 | # Permission is hereby granted, free of charge, to any person obtaining a copy
 21 | # of this software and associated documentation files (the "Software"), to deal
 22 | # in the Software without restriction, including without limitation the rights
 23 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 24 | # copies of the Software, and to permit persons to whom the Software is
 25 | # furnished to do so, subject to the following conditions:
 26 | 
 27 | # The above copyright notice and this permission notice shall be included in all
 28 | # copies or substantial portions of the Software.
 29 | 
 30 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 31 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 32 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 33 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 34 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 35 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 36 | # SOFTWARE.
 37 | # The MIT License (MIT)
 38 | 
 39 | # Copyright (c) 2014 Sander Dieleman
 40 | 
 41 | # Permission is hereby granted, free of charge, to any person obtaining a copy
 42 | # of this software and associated documentation files (the "Software"), to deal
 43 | # in the Software without restriction, including without limitation the rights
 44 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 45 | # copies of the Software, and to permit persons to whom the Software is
 46 | # furnished to do so, subject to the following conditions:
 47 | 
 48 | # The above copyright notice and this permission notice shall be included in all
 49 | # copies or substantial portions of the Software.
 50 | 
 51 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 52 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 53 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 54 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 55 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 56 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 57 | # SOFTWARE.
 58 | 
 59 | def deepmind_rmsprop(loss_or_grads, params, learning_rate, 
 60 |                      rho, epsilon):
 61 |     """RMSProp updates [1]_.
 62 | 
 63 |     Scale learning rates by dividing with the moving average of the root mean
 64 |     squared (RMS) gradients.
 65 | 
 66 |     Parameters
 67 |     ----------
 68 |     loss_or_grads : symbolic expression or list of expressions
 69 |         A scalar loss expression, or a list of gradient expressions
 70 |     params : list of shared variables
 71 |         The variables to generate update expressions for
 72 |     learning_rate : float or symbolic scalar
 73 |         The learning rate controlling the size of update steps
 74 |     rho : float or symbolic scalar
 75 |         Gradient moving average decay factor
 76 |     epsilon : float or symbolic scalar
 77 |         Small value added for numerical stability
 78 | 
 79 |     Returns
 80 |     -------
 81 |     OrderedDict
 82 |         A dictionary mapping each parameter to its update expression
 83 | 
 84 |     Notes
 85 |     -----
 86 |     `rho` should be between 0 and 1. A value of `rho` close to 1 will decay the
 87 |     moving average slowly and a value close to 0 will decay the moving average
 88 |     fast.
 89 | 
 90 |     Using the step size :math:`\\eta` and a decay factor :math:`\\rho` the
 91 |     learning rate :math:`\\eta_t` is calculated as:
 92 | 
 93 |     .. math::
 94 |        r_t &= \\rho r_{t-1} + (1-\\rho)*g^2\\\\
 95 |        \\eta_t &= \\frac{\\eta}{\\sqrt{r_t + \\epsilon}}
 96 | 
 97 |     References
 98 |     ----------
 99 |     .. [1] Tieleman, T. and Hinton, G. (2012):
100 |            Neural Networks for Machine Learning, Lecture 6.5 - rmsprop.
101 |            Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)
102 |     """
103 | 
104 |     grads = get_or_compute_grads(loss_or_grads, params)
105 |     updates = OrderedDict()
106 | 
107 |     for param, grad in zip(params, grads):
108 |         value = param.get_value(borrow=True)
109 | 
110 |         acc_grad = theano.shared(np.zeros(value.shape, dtype=value.dtype),
111 |                              broadcastable=param.broadcastable)
112 |         acc_grad_new = rho * acc_grad + (1 - rho) * grad
113 | 
114 |         acc_rms = theano.shared(np.zeros(value.shape, dtype=value.dtype),
115 |                              broadcastable=param.broadcastable)
116 |         acc_rms_new = rho * acc_rms + (1 - rho) * grad ** 2
117 | 
118 | 
119 |         updates[acc_grad] = acc_grad_new
120 |         updates[acc_rms] = acc_rms_new
121 | 
122 |         updates[param] = (param - learning_rate * 
123 |                           (grad / 
124 |                            T.sqrt(acc_rms_new - acc_grad_new **2 + epsilon)))
125 | 
126 |     return updates
127 | 


--------------------------------------------------------------------------------
/figures/a3c_fig.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/a3c_fig.png


--------------------------------------------------------------------------------
/figures/beam_rider_time.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/beam_rider_time.png


--------------------------------------------------------------------------------
/figures/breakout_time.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/breakout_time.png


--------------------------------------------------------------------------------
/figures/frame.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frame.png


--------------------------------------------------------------------------------
/figures/frostbite_cl2_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_cl2_1.png


--------------------------------------------------------------------------------
/figures/frostbite_cl2_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_cl2_2.png


--------------------------------------------------------------------------------
/figures/frostbite_r15_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_r15_1.png


--------------------------------------------------------------------------------
/figures/frostbite_r15_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_r15_2.png


--------------------------------------------------------------------------------
/figures/gopher.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/gopher.png


--------------------------------------------------------------------------------
/figures/gopher_running.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/gopher_running.png


--------------------------------------------------------------------------------
/figures/hero.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/hero.png


--------------------------------------------------------------------------------
/figures/pong_time.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/pong_time.png


--------------------------------------------------------------------------------
/figures/qbert_time.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/qbert_time.png


--------------------------------------------------------------------------------
/figures/rescale.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/rescale.png


--------------------------------------------------------------------------------
/figures/space_invaders_time.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/space_invaders_time.png


--------------------------------------------------------------------------------
/figures/star_gunner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/star_gunner.png


--------------------------------------------------------------------------------
/figures/star_gunner_running.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/star_gunner_running.png


--------------------------------------------------------------------------------
/figures/zaxxon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/zaxxon.png


--------------------------------------------------------------------------------
/roms/air_raid.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/air_raid.bin


--------------------------------------------------------------------------------
/roms/alien.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/alien.bin


--------------------------------------------------------------------------------
/roms/amidar.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/amidar.bin


--------------------------------------------------------------------------------
/roms/assault.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/assault.bin


--------------------------------------------------------------------------------
/roms/asterix.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/asterix.bin


--------------------------------------------------------------------------------
/roms/asteroids.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/asteroids.bin


--------------------------------------------------------------------------------
/roms/atlantis.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/atlantis.bin


--------------------------------------------------------------------------------
/roms/bank_heist.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/bank_heist.bin


--------------------------------------------------------------------------------
/roms/battle_zone.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/battle_zone.bin


--------------------------------------------------------------------------------
/roms/beam_rider.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/beam_rider.bin


--------------------------------------------------------------------------------
/roms/berzerk.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/berzerk.bin


--------------------------------------------------------------------------------
/roms/bowling.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/bowling.bin


--------------------------------------------------------------------------------
/roms/boxing.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/boxing.bin


--------------------------------------------------------------------------------
/roms/breakout.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/breakout.bin


--------------------------------------------------------------------------------
/roms/carnival.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/carnival.bin


--------------------------------------------------------------------------------
/roms/centipede.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/centipede.bin


--------------------------------------------------------------------------------
/roms/chopper_command.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/chopper_command.bin


--------------------------------------------------------------------------------
/roms/crazy_climber.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/crazy_climber.bin


--------------------------------------------------------------------------------
/roms/defender.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/defender.bin


--------------------------------------------------------------------------------
/roms/demon_attack.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/demon_attack.bin


--------------------------------------------------------------------------------
/roms/double_dunk.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/double_dunk.bin


--------------------------------------------------------------------------------
/roms/elevator_action.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/elevator_action.bin


--------------------------------------------------------------------------------
/roms/enduro.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/enduro.bin


--------------------------------------------------------------------------------
/roms/fishing_derby.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/fishing_derby.bin


--------------------------------------------------------------------------------
/roms/freeway.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/freeway.bin


--------------------------------------------------------------------------------
/roms/frostbite.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/frostbite.bin


--------------------------------------------------------------------------------
/roms/gopher.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/gopher.bin


--------------------------------------------------------------------------------
/roms/gravitar.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/gravitar.bin


--------------------------------------------------------------------------------
/roms/hero.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/hero.bin


--------------------------------------------------------------------------------
/roms/ice_hockey.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/ice_hockey.bin


--------------------------------------------------------------------------------
/roms/jamesbond.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/jamesbond.bin


--------------------------------------------------------------------------------
/roms/journey_escape.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/journey_escape.bin


--------------------------------------------------------------------------------
/roms/kangaroo.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/kangaroo.bin


--------------------------------------------------------------------------------
/roms/krull.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/krull.bin


--------------------------------------------------------------------------------
/roms/kung_fu_master.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/kung_fu_master.bin


--------------------------------------------------------------------------------
/roms/montezuma_revenge.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/montezuma_revenge.bin


--------------------------------------------------------------------------------
/roms/ms_pacman.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/ms_pacman.bin


--------------------------------------------------------------------------------
/roms/name_this_game.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/name_this_game.bin


--------------------------------------------------------------------------------
/roms/phoenix.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/phoenix.bin


--------------------------------------------------------------------------------
/roms/pitfall.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/pitfall.bin


--------------------------------------------------------------------------------
/roms/pong.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/pong.bin


--------------------------------------------------------------------------------
/roms/pooyan.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/pooyan.bin


--------------------------------------------------------------------------------
/roms/private_eye.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/private_eye.bin


--------------------------------------------------------------------------------
/roms/qbert.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/qbert.bin


--------------------------------------------------------------------------------
/roms/riverraid.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/riverraid.bin


--------------------------------------------------------------------------------
/roms/road_runner.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/road_runner.bin


--------------------------------------------------------------------------------
/roms/robotank.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/robotank.bin


--------------------------------------------------------------------------------
/roms/seaquest.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/seaquest.bin


--------------------------------------------------------------------------------
/roms/skiing.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/skiing.bin


--------------------------------------------------------------------------------
/roms/solaris.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/solaris.bin


--------------------------------------------------------------------------------
/roms/space_invaders.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/space_invaders.bin


--------------------------------------------------------------------------------
/roms/star_gunner.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/star_gunner.bin


--------------------------------------------------------------------------------
/roms/tennis.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/tennis.bin


--------------------------------------------------------------------------------
/roms/time_pilot.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/time_pilot.bin


--------------------------------------------------------------------------------
/roms/tutankham.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/tutankham.bin


--------------------------------------------------------------------------------
/roms/up_n_down.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/up_n_down.bin


--------------------------------------------------------------------------------
/roms/venture.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/venture.bin


--------------------------------------------------------------------------------
/roms/video_pinball.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/video_pinball.bin


--------------------------------------------------------------------------------
/roms/wizard_of_wor.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/wizard_of_wor.bin


--------------------------------------------------------------------------------
/roms/yars_revenge.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/yars_revenge.bin


--------------------------------------------------------------------------------
/roms/zaxxon.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/zaxxon.bin


--------------------------------------------------------------------------------