├── .gitignore ├── LICENSE.txt ├── README.md ├── code ├── ale_agents.py ├── ale_data_set.py ├── ale_experiment.py ├── image_preprocessing.py ├── launcher.py ├── q_network.py ├── run_OT.py └── updates.py ├── figures ├── a3c_fig.png ├── beam_rider_time.png ├── breakout_time.png ├── frame.png ├── frostbite_cl2_1.png ├── frostbite_cl2_2.png ├── frostbite_r15_1.png ├── frostbite_r15_2.png ├── gopher.png ├── gopher_running.png ├── hero.png ├── pong_time.png ├── qbert_time.png ├── rescale.png ├── space_invaders_time.png ├── star_gunner.png ├── star_gunner_running.png └── zaxxon.png └── roms ├── air_raid.bin ├── alien.bin ├── amidar.bin ├── assault.bin ├── asterix.bin ├── asteroids.bin ├── atlantis.bin ├── bank_heist.bin ├── battle_zone.bin ├── beam_rider.bin ├── berzerk.bin ├── bowling.bin ├── boxing.bin ├── breakout.bin ├── carnival.bin ├── centipede.bin ├── chopper_command.bin ├── crazy_climber.bin ├── defender.bin ├── demon_attack.bin ├── double_dunk.bin ├── elevator_action.bin ├── enduro.bin ├── fishing_derby.bin ├── freeway.bin ├── frostbite.bin ├── gopher.bin ├── gravitar.bin ├── hero.bin ├── ice_hockey.bin ├── jamesbond.bin ├── journey_escape.bin ├── kangaroo.bin ├── krull.bin ├── kung_fu_master.bin ├── montezuma_revenge.bin ├── ms_pacman.bin ├── name_this_game.bin ├── phoenix.bin ├── pitfall.bin ├── pong.bin ├── pooyan.bin ├── private_eye.bin ├── qbert.bin ├── riverraid.bin ├── road_runner.bin ├── robotank.bin ├── seaquest.bin ├── skiing.bin ├── solaris.bin ├── space_invaders.bin ├── star_gunner.bin ├── tennis.bin ├── time_pilot.bin ├── tutankham.bin ├── up_n_down.bin ├── venture.bin ├── video_pinball.bin ├── wizard_of_wor.bin ├── yars_revenge.bin └── zaxxon.bin /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .idea/.name 3 | 4 | *.iml 5 | 6 | *.xml 7 | 8 | *.pyc 9 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Frank He 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Q-Optimality-Tightening 2 | This is my implementation to paper [Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening](https://openreview.net/pdf?id=rJ8Je4clg). 3 | 4 | # Dependencies 5 | * Numpy 6 | * Scipy 7 | * Pillow 8 | * Matplotlib 9 | * Theano 10 | * Lasagne 11 | * ALE or gym 12 | 13 | Readers might refer to for installation information. However, I suggest readers installing all the packages using virtual environment. Please make sure the version of Theano is compatible with the one of Lasagne. 14 | 15 | # Running 16 | ``` 17 | THEANO_FLAGS='deivce=gpu0, allow_gc=False' python run_OT -r frostbite --close2 18 | ``` 19 | Running frostbite with close boundes. 20 | 21 | ``` 22 | THEANO_FLAGS='device=gpu1, allow_gc=False' python run_OT -r gopher 23 | ``` 24 | 25 | Running gopher with randomly sampled bounds. Default: 4 out of 10 upper bounds are selected as U\_{j,k}. 4 out of 10 lower bounds are selected as L\_{j,l}. 26 | 27 | I have already provided 62 game roms. 28 | 29 | If everything is configured correctly, the running should be like this: 30 | 31 | 32 | 33 | 34 | steps per second is usually between 105 to 140 using one Titan X. The GPU Occupation is about 30 percent which means our code still has huge space of improvement. 35 | 36 | 37 | # Experiments 38 | First I will show two figures runned on frostbite with ```--close2```: 39 | ![frostbite_cl2_1] 40 | ![frostbite_cl2_2] 41 | Two other figures runned with sampling 4 bounds out of 15 are below: 42 | ![frostbite_r15_1] 43 | ![frostbite_r15_2] 44 | > frostbite's 200M baseline is 328.3 45 | 46 | Some other games are displayed here: 47 | ![gopher] 48 | > gopher's 200M baseline is 8520 49 | 50 | ![hero] 51 | > hero's 200M baseline is 19950 52 | 53 | ![star_gunner] 54 | > star_gunner's 200M baseline is 57997 55 | 56 | ![zaxxon] 57 | > zaxxon's 200M baseline is 4977 58 | 59 | Finally, we can roughly compare our method with state-of-art method [A3C](https://arxiv.org/pdf/1602.01783). Our method is using 1 CPU thread and 1 GPU (GPU Occupation is 30%) while A3C is using multiple CPU threads. 60 | 61 | Figure 4 in paper [A3C](https://arxiv.org/pdf/1602.01783): 62 | ![A3C] 63 | 64 | Our results: 65 | 66 | ![beam_rider] 67 | ![breakout] 68 | ![pong] 69 | ![qbert] 70 | ![space_invaders] 71 | 72 | From the observations, our method almost always outperforms 1,2 and 4 threads A3C and achieves similar results as 8 threads A3C. To be noticed, these five games chosen by A3C paper are not our method's specialties. Our method would definitely achieve much better performance if we run tests on games that our method is good at. 73 | 74 | # Explain 75 | ### Gradients are also rescaled so that their magnitudes are comparable with or without penalty 76 | ![rescale] 77 | 78 | ### About frames 79 | ![frame] 80 | 81 | # Comments 82 | Since we never did grid search on hyperparameters, we expect finding better settings or initializations to further improve the results. More informed strategies regarding the choice of constraints are possible as well since we may expect lower bounds in the more distant future to have a larger impact early in the training. In contrast once the algorithm is almost converged we may expect lower bounds close to the considered time-step to have bigger impact. More complex penalty functions and sophisticated optimization approaches may yield even better results than the ones we reported yet. 83 | 84 | # Please cite our paper at 85 | ``` 86 | @inproceedings{HeICLR2017, 87 | author = {F.~S. He, Y. Liu, A.~G. Schwing and J. Peng}, 88 | title = {{Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening}}, 89 | booktitle = {Proc. ICLR}, 90 | year = {2017}, 91 | } 92 | ``` 93 | 94 | [frostbite_cl2_1]: figures/frostbite_cl2_1.png 95 | [frostbite_cl2_2]: figures/frostbite_cl2_2.png 96 | [frostbite_r15_1]: figures/frostbite_r15_1.png 97 | [frostbite_r15_2]: figures/frostbite_r15_2.png 98 | [gopher]: figures/gopher.png 99 | [hero]: figures/hero.png 100 | [star_gunner]: figures/star_gunner.png 101 | [zaxxon]: figures/zaxxon.png 102 | [A3C]: figures/a3c_fig.png 103 | [beam_rider]: figures/beam_rider_time.png 104 | [breakout]: figures/breakout_time.png 105 | [pong]: figures/pong_time.png 106 | [qbert]: figures/qbert_time.png 107 | [space_invaders]: figures/space_invaders_time.png 108 | [rescale]: figures/rescale.png 109 | [frame]: figures/frame.png 110 | -------------------------------------------------------------------------------- /code/ale_agents.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | """ 5 | DQN agents 6 | """ 7 | import time 8 | import os 9 | import logging 10 | import numpy as np 11 | import cPickle 12 | 13 | import ale_data_set 14 | import sys 15 | sys.setrecursionlimit(10000) 16 | 17 | recording_size = 0 18 | 19 | 20 | class OptimalityTightening(object): 21 | def __init__(self, q_network, epsilon_start, epsilon_min, 22 | epsilon_decay, replay_memory_size, exp_pref, update_frequency, 23 | replay_start_size, rng, transitions_sequence_length, transition_range, penalty_method, 24 | weight_min, weight_max, weight_decay_length, two_train=False, late2=True, close2=True, verbose=False, 25 | double=False, save_pkl=True): 26 | self.double_dqn = double 27 | self.network = q_network 28 | self.num_actions = q_network.num_actions 29 | self.epsilon_start = epsilon_start 30 | self.update_frequency = update_frequency 31 | 32 | self.epsilon_min = epsilon_min 33 | self.epsilon_decay = epsilon_decay 34 | self.replay_memory_size = replay_memory_size 35 | self.exp_dir = exp_pref + '_' + str(weight_max) + '_' + str(weight_min) 36 | if late2: 37 | self.exp_dir += '_l2' 38 | if close2: 39 | self.exp_dir += '_close2' 40 | else: 41 | self.exp_dir += '_len' + str(transitions_sequence_length) + '_r' + str(transition_range) 42 | if two_train: 43 | self.exp_dir += '_TTR' 44 | 45 | self.replay_start_size = replay_start_size 46 | self.rng = rng 47 | self.transition_len = transitions_sequence_length 48 | self.two_train = two_train 49 | self.verbose = verbose 50 | if verbose > 0: 51 | print "Using verbose", verbose 52 | self.exp_dir += '_vb' + str(verbose) 53 | 54 | self.phi_length = self.network.num_frames 55 | self.image_width = self.network.input_width 56 | self.image_height = self.network.input_height 57 | self.penalty_method = penalty_method 58 | self.batch_size = self.network.batch_size 59 | self.discount = self.network.discount 60 | self.transition_range = transition_range 61 | self.late2 = late2 62 | self.close2 = close2 63 | self.same_update = False 64 | self.save_pkl = save_pkl 65 | 66 | self.start_index = 0 67 | self.terminal_index = None 68 | 69 | self.weight_max = weight_max 70 | self.weight_min = weight_min 71 | self.weight = self.weight_max 72 | self.weight_decay_length = weight_decay_length 73 | self.weight_decay = (self.weight_max - self.weight_min) / self.weight_decay_length 74 | 75 | try: 76 | os.stat(self.exp_dir) 77 | except OSError: 78 | os.makedirs(self.exp_dir) 79 | 80 | self.data_set = ale_data_set.DataSet(width=self.image_width, 81 | height=self.image_height, 82 | rng=rng, 83 | max_steps=self.replay_memory_size, 84 | phi_length=self.phi_length, 85 | discount=self.discount, 86 | batch_size=self.batch_size, 87 | transitions_len=self.transition_len) 88 | 89 | # just needs to be big enough to create phi's 90 | self.test_data_set = ale_data_set.DataSet(width=self.image_width, 91 | height=self.image_height, 92 | rng=rng, 93 | max_steps=self.phi_length * 2, 94 | phi_length=self.phi_length) 95 | self.epsilon = self.epsilon_start 96 | if self.epsilon_decay != 0: 97 | self.epsilon_rate = ((self.epsilon_start - self.epsilon_min) / 98 | self.epsilon_decay) 99 | else: 100 | self.epsilon_rate = 0 101 | 102 | self.testing = False 103 | 104 | self._open_results_file() 105 | self._open_learning_file() 106 | self._open_recording_file() 107 | 108 | self.step_counter = None 109 | self.episode_reward = None 110 | self.start_time = None 111 | self.loss_averages = None 112 | self.total_reward = None 113 | 114 | self.episode_counter = 0 115 | self.batch_counter = 0 116 | 117 | self.holdout_data = None 118 | 119 | # In order to add an element to the data set we need the 120 | # previous state and action and the current reward. These 121 | # will be used to store states and actions. 122 | self.last_img = None 123 | self.last_action = None 124 | 125 | # Exponential moving average of runtime performance. 126 | self.steps_sec_ema = 0. 127 | self.program_start_time = None 128 | self.last_count_time = None 129 | self.epoch_time = None 130 | self.total_time = None 131 | 132 | def time_count_start(self): 133 | self.last_count_time = self.program_start_time = time.time() 134 | 135 | def _open_results_file(self): 136 | logging.info("OPENING " + self.exp_dir + '/results.csv') 137 | self.results_file = open(self.exp_dir + '/results.csv', 'w', 0) 138 | self.results_file.write(\ 139 | 'epoch,num_episodes,total_reward,reward_per_epoch,mean_q, epoch time, total time\n') 140 | self.results_file.flush() 141 | 142 | def _open_learning_file(self): 143 | self.learning_file = open(self.exp_dir + '/learning.csv', 'w', 0) 144 | self.learning_file.write('mean_loss,epsilon\n') 145 | self.learning_file.flush() 146 | 147 | def _update_results_file(self, epoch, num_episodes, holdout_sum): 148 | out = "{},{},{},{},{},{},{}\n".format(epoch, num_episodes, 149 | self.total_reward, self.total_reward / float(num_episodes), 150 | holdout_sum, self.epoch_time, self.total_time) 151 | self.last_count_time = time.time() 152 | self.results_file.write(out) 153 | self.results_file.flush() 154 | 155 | def _update_learning_file(self): 156 | out = "{},{}\n".format(np.mean(self.loss_averages), 157 | self.epsilon) 158 | self.learning_file.write(out) 159 | self.learning_file.flush() 160 | 161 | def _open_recording_file(self): 162 | self.recording_tot = 0 163 | self.recording_file = open(self.exp_dir + '/recording.csv', 'w', 0) 164 | self.recording_file.write('nn_output, q_return, history_return, loss') 165 | self.recording_file.write('\n') 166 | self.recording_file.flush() 167 | 168 | def _update_recording_file(self, nn_output, q_return, history_return, loss): 169 | if self.recording_tot > recording_size: 170 | return 171 | self.recording_tot += 1 172 | out = "{},{},{},{}".format(nn_output, q_return, history_return, loss) 173 | self.recording_file.write(out) 174 | self.recording_file.write('\n') 175 | self.recording_file.flush() 176 | 177 | def start_episode(self, observation): 178 | """ 179 | This method is called once at the beginning of each episode. 180 | No reward is provided, because reward is only available after 181 | an action has been taken. 182 | 183 | Arguments: 184 | observation - height x width numpy array 185 | 186 | Returns: 187 | An integer action 188 | """ 189 | 190 | self.step_counter = 0 191 | self.batch_counter = 0 192 | self.episode_reward = 0 193 | 194 | # We report the mean loss for every epoch. 195 | self.loss_averages = [] 196 | 197 | self.start_time = time.time() 198 | return_action = self.rng.randint(0, self.num_actions) 199 | 200 | self.last_action = return_action 201 | 202 | self.last_img = observation 203 | 204 | return return_action 205 | 206 | def _choose_action(self, data_set, epsilon, cur_img, reward): 207 | """ 208 | Add the most recent data to the data set and choose 209 | an action based on the current policy. 210 | """ 211 | 212 | data_set.add_sample(self.last_img, self.last_action, reward, False, start_index=self.start_index) 213 | if self.step_counter >= self.phi_length: 214 | phi = data_set.phi(cur_img) 215 | action = self.network.choose_action(phi, epsilon) 216 | else: 217 | action = self.rng.randint(0, self.num_actions) 218 | 219 | return action 220 | 221 | def _do_training(self): 222 | """ 223 | Returns the average loss for the current batch. 224 | May be overridden if a subclass needs to train the network 225 | differently. 226 | """ 227 | if self.close2: 228 | self.data_set.random_close_transitions_batch(self.batch_size, self.transition_len) 229 | else: 230 | self.data_set.random_transitions_batch(self.batch_size, self.transition_len, self.transition_range) 231 | 232 | target_q_imgs = np.append(self.data_set.forward_imgs, self.data_set.backward_imgs, axis=1) 233 | target_q_table = self.network.q_target_vals(target_q_imgs) 234 | target_double_q_table = None 235 | if self.double_dqn: 236 | target_double_q_table = self.network.q_target(target_q_imgs) 237 | q_values = self.network.q_s_a_batch_vals(self.data_set.center_imgs, self.data_set.center_actions) 238 | 239 | states1 = np.zeros((self.batch_size, self.phi_length, self.image_height, self.image_width), dtype='uint8') 240 | actions1 = np.zeros((self.batch_size, 1), dtype='int32') 241 | targets1 = np.zeros((self.batch_size, 1), dtype='float32') 242 | states2 = np.zeros((self.batch_size, self.phi_length, self.image_height, self.image_width), dtype='uint8') 243 | actions2 = np.zeros((self.batch_size, 1), dtype='int32') 244 | targets2 = np.zeros((self.batch_size, 1), dtype='float32') 245 | """ 246 | 0 1 2 3* 4 5 6 7 8 V_R 247 | 0 1 2 4 5 6 7 8 V_R 248 | V0 = r3 + y*Q4; V1 = r3 +y*r4 + y^2*Q5 249 | Q2 -r2 = Q3*y; Q1 - r1 - y*r2 = y^2*Q3 250 | V-1 = (Q2 - r2) / y; V-2 = (Q1 - r1 - y*r2)/y^2; V-3 = (Q0 -r0 -y*r1 - y^2*r2)/y^3 251 | r1 + y*r2 = R1 - y^2*R3 252 | Q1 = r1+y*r2 + y^2*Q3 253 | """ 254 | for i in xrange(self.batch_size): 255 | q_value = q_values[i] 256 | if self.two_train: 257 | # does nothing first 258 | states2[i] = self.data_set.center_imgs[i] 259 | actions2[i] = self.data_set.center_actions[i] 260 | targets2[i] = q_value 261 | center_position = int(self.data_set.center_positions[i]) 262 | if self.data_set.terminal.take(center_position, mode='wrap'): 263 | states1[i] = self.data_set.center_imgs[i] 264 | actions1[i] = self.data_set.center_actions[i] 265 | targets1[i] = self.data_set.center_return_values[i] 266 | continue 267 | forward_targets = np.zeros(self.transition_len, dtype=np.float32) 268 | backward_targets = np.zeros(self.transition_len, dtype=np.float32) 269 | for j in xrange(self.transition_len): 270 | if j > 0 and self.data_set.forward_positions[i, j] == center_position + 1: 271 | forward_targets[j] = q_value 272 | else: 273 | if not self.double_dqn: 274 | forward_targets[j] = self.data_set.center_return_values[i] - \ 275 | self.data_set.forward_return_values[i, j] * self.data_set.forward_discounts[i, j] + \ 276 | self.data_set.forward_discounts[i, j] * \ 277 | np.max(target_q_table[i, j]) 278 | else: 279 | forward_targets[j] = self.data_set.center_return_values[i] - \ 280 | self.data_set.forward_return_values[i, j] * self.data_set.forward_discounts[i, j] + \ 281 | self.data_set.forward_discounts[i, j] * target_double_q_table[i, j] 282 | """ for integrity""" 283 | if self.verbose == 1: 284 | end = self. data_set.forward_positions[i, j] 285 | discount = 1.0 286 | cumulative_reward = 0.0 287 | for k in range(center_position, end): 288 | cumulative_reward += discount * self.data_set.rewards.take(k, mode='wrap') 289 | discount *= self.discount 290 | cumulative_reward += discount * np.max(target_q_table[i, j]) 291 | if not np.isclose(cumulative_reward, forward_targets[j], atol=0.000001): 292 | print self.data_set.backward_positions[i], self.data_set.center_positions[i], self.data_set.forward_positions[i] 293 | print self.data_set.start_index.take(k, mode='wrap'), self.data_set.terminal_index.take(k, mode='wrap') 294 | print 'center return=', self.data_set.center_return_values[i], 'forward return=', \ 295 | self.data_set.forward_return_values[i,j], 'forward discount=', self.data_set.forward_discounts[i, j] 296 | end = self.data_set.forward_positions[i, j] 297 | discount = 1.0 298 | cumulative_reward = 0.0 299 | for k in range(center_position, end): 300 | cumulative_reward += discount * self.data_set.rewards.take(k, mode='wrap') 301 | print k, 'cumulative=', cumulative_reward, 'discount=', discount, 'reward=', self.data_set.rewards.take(k, mode='wrap'), \ 302 | 'return=', self.data_set.return_value.take(k, mode='wrap') 303 | print '\t start=', self.data_set.start_index.take(k, mode='wrap'), 'terminal=', self.data_set.terminal_index.take(k, mode='wrap') 304 | discount *= self.discount 305 | cumulative_reward += discount * np.max(target_q_table[i, j]) 306 | print 'final cumulative=', cumulative_reward, 'target=', forward_targets[j], \ 307 | 'maxQ=', np.max(target_q_table[i, j]) 308 | raw_input() 309 | 310 | if self.data_set.backward_positions[i, j] == center_position + 1: 311 | backward_targets[j] = q_value 312 | else: 313 | backward_targets[j] = (-self.data_set.backward_return_values[i, j] + 314 | self.data_set.backward_discounts[i, j] * self.data_set.center_return_values[i] + 315 | target_q_table[i, self.transition_len + j, self.data_set.backward_actions[i, j]]) /\ 316 | self.data_set.backward_discounts[i, j] 317 | """ for integrity""" 318 | if self.verbose == 1: 319 | end = self.data_set.backward_positions[i, j] 320 | discount = 1.0 321 | cumulative_reward = 0.0 322 | for k in range(end, center_position): 323 | cumulative_reward += self.data_set.rewards.take(k, mode='wrap') * discount 324 | discount *= self.discount 325 | cumulative_reward = (-cumulative_reward + target_q_table[i, self.transition_len + j, self.data_set.actions.take(end, mode='wrap')])/discount 326 | if not np.isclose(cumulative_reward, backward_targets[j], atol=0.000001): 327 | print self.data_set.backward_positions[i], self.data_set.center_positions[i], self.data_set.forward_positions[i] 328 | print self.data_set.start_index.take(k, mode='wrap'), self.data_set.terminal_index.take(k, mode='wrap') 329 | print 'center return=', self.data_set.center_return_values[i], 'backward return=', \ 330 | self.data_set.backward_return_values[i,j], 'backward discount=', self.data_set.backward_discounts[i, j] 331 | end = self.data_set.backward_positions[i, j] 332 | discount = 1.0 333 | cumulative_reward = 0.0 334 | for k in range(end, center_position): 335 | cumulative_reward += self.data_set.rewards.take(k, mode='wrap') * discount 336 | print k, 'cumulative=', cumulative_reward, 'discount=', discount, 'reward=', self.data_set.rewards.take(k, mode='wrap'),\ 337 | 'return=', self.data_set.return_value.take(k, mode='wrap') 338 | print '\t start=', self.data_set.start_index.take(k, mode='wrap'), 'terminal=', self.data_set.terminal_index.take(k, mode='wrap') 339 | discount *= self.discount 340 | cumulative_reward = (-cumulative_reward + target_q_table[i, self.transition_len + j, self.data_set.actions.take(end, mode='wrap')])/discount 341 | print 'final cumulative=', cumulative_reward, 'target=', backward_targets[j], \ 342 | 'Q=', target_q_table[i, self.transition_len + j, self.data_set.backward_actions[i, j]] 343 | raw_input() 344 | 345 | forward_targets = np.append(forward_targets, self.data_set.center_return_values[i]) 346 | v0 = v1 = forward_targets[0] 347 | if self.penalty_method == 'max': 348 | v_max = np.max(forward_targets[1:]) 349 | v_min = np.min(backward_targets) 350 | if self.two_train and v_min < q_value: 351 | v_min_index = np.argmin(backward_targets) 352 | states2[i] = self.data_set.backward_imgs[i, v_min_index] 353 | actions2[i] = self.data_set.backward_actions[i, v_min_index] 354 | targets2[i] = self.data_set.backward_return_values[i, v_min_index] - \ 355 | self.data_set.backward_discounts[i, v_min_index] * self.data_set.center_return_values[i] + \ 356 | self.data_set.backward_discounts[i, v_min_index] * q_value 357 | if ((self.late2 and self.weight == self.weight_min) or (not self.late2)) \ 358 | and (v_max - 0.1 > q_value > v_min + 0.1): 359 | v1 = v_max * 0.5 + v_min * 0.5 360 | elif v_max - 0.1 > q_value: 361 | v1 = v_max 362 | elif ((self.late2 and self.weight == self.weight_min) or (not self.late2)) and v_min + 0.1 < q_value: 363 | v1 = v_min 364 | 365 | states1[i] = self.data_set.center_imgs[i] 366 | actions1[i] = self.data_set.center_actions[i] 367 | targets1[i] = v0 * self.weight + (1-self.weight) * v1 368 | 369 | if self.two_train: 370 | if self.same_update: 371 | self.network.train(states2, actions2, targets2) 372 | else: 373 | self.network.train2(states2, actions2, targets2) 374 | loss = self.network.train(states1, actions1, targets1) 375 | # if self.recording_tot < recording_size: 376 | # pass 377 | # for i in range(self.network.batch_size): 378 | # self._update_recording_file(output[i], target[i], return_value[i], loss) 379 | return loss 380 | 381 | def step(self, reward, observation): 382 | """ 383 | This method is called each time step. 384 | 385 | Arguments: 386 | reward - Real valued reward. 387 | observation - A height x width numpy array 388 | 389 | Returns: 390 | An integer action. 391 | 392 | """ 393 | 394 | self.step_counter += 1 395 | self.episode_reward += reward 396 | 397 | # TESTING--------------------------- 398 | if self.testing: 399 | action = self._choose_action(self.test_data_set, 0.05, 400 | observation, np.clip(reward, -1, 1)) 401 | 402 | # NOT TESTING--------------------------- 403 | else: 404 | if len(self.data_set) > self.replay_start_size: 405 | self.epsilon = max(self.epsilon_min, 406 | self.epsilon - self.epsilon_rate) 407 | self.weight = max(self.weight_min, 408 | self.weight - self.weight_decay) 409 | 410 | action = self._choose_action(self.data_set, self.epsilon, 411 | observation, 412 | np.clip(reward, -1, 1)) 413 | 414 | if self.step_counter % self.update_frequency == 0: 415 | loss = self._do_training() 416 | self.batch_counter += 1 417 | self.loss_averages.append(loss) 418 | 419 | else: # Still gathering initial random data... 420 | action = self._choose_action(self.data_set, self.epsilon, 421 | observation, 422 | np.clip(reward, -1, 1)) 423 | 424 | self.last_action = action 425 | self.last_img = observation 426 | 427 | return action 428 | 429 | def end_episode(self, reward, terminal=True): 430 | """ 431 | This function is called once at the end of an episode. 432 | 433 | Arguments: 434 | reward - Real valued reward. 435 | terminal - Whether the episode ended intrinsically 436 | (ie we didn't run out of steps) 437 | Returns: 438 | None 439 | """ 440 | 441 | self.episode_reward += reward 442 | self.step_counter += 1 443 | total_time = time.time() - self.start_time 444 | 445 | if self.testing: 446 | # If we run out of time, only count the last episode if 447 | # it was the only episode. 448 | if terminal or self.episode_counter == 0: 449 | self.episode_counter += 1 450 | self.total_reward += self.episode_reward 451 | else: 452 | # Store the latest sample. 453 | self.data_set.add_sample(self.last_img, 454 | self.last_action, 455 | np.clip(reward, -1, 1), 456 | True, start_index=self.start_index) 457 | """update""" 458 | if terminal: 459 | q_return = 0. 460 | else: 461 | phi = self.data_set.phi(self.last_img) 462 | q_return = np.mean(self.network.q_vals(phi)) 463 | # last_q_return = -1.0 464 | self.start_index = self.data_set.top 465 | self.terminal_index = index = (self.start_index-1) % self.data_set.max_steps 466 | while True: 467 | q_return = q_return * self.network.discount + self.data_set.rewards[index] 468 | self.data_set.return_value[index] = q_return 469 | self.data_set.terminal_index[index] = self.terminal_index 470 | index = (index-1) % self.data_set.max_steps 471 | if self.data_set.terminal[index] or index == self.data_set.bottom: 472 | break 473 | 474 | rho = 0.98 475 | self.steps_sec_ema *= rho 476 | self.steps_sec_ema += (1. - rho) * (self.step_counter/total_time) 477 | 478 | logging.info("steps/second: {:.2f}, avg: {:.2f}".format( 479 | self.step_counter/total_time, self.steps_sec_ema)) 480 | 481 | if self.batch_counter > 0: 482 | self._update_learning_file() 483 | logging.info("average loss: {:.4f}".format(\ 484 | np.mean(self.loss_averages))) 485 | 486 | def finish_epoch(self, epoch): 487 | if self.save_pkl: 488 | net_file = open(self.exp_dir + '/network_file_' + str(epoch) + '.pkl', 'w') 489 | cPickle.dump(self.network, net_file, -1) 490 | net_file.close() 491 | this_time = time.time() 492 | self.total_time = this_time-self.program_start_time 493 | self.epoch_time = this_time-self.last_count_time 494 | 495 | def start_testing(self): 496 | self.testing = True 497 | self.total_reward = 0 498 | self.episode_counter = 0 499 | 500 | def finish_testing(self, epoch): 501 | self.testing = False 502 | holdout_size = 3200 503 | 504 | if self.holdout_data is None and len(self.data_set) > holdout_size: 505 | imgs = self.data_set.random_imgs(holdout_size) 506 | self.holdout_data = imgs[:, :self.phi_length] 507 | 508 | holdout_sum = 0 509 | if self.holdout_data is not None: 510 | for i in range(holdout_size): 511 | holdout_sum += np.max( 512 | self.network.q_vals(self.holdout_data[i])) 513 | 514 | self._update_results_file(epoch, self.episode_counter, 515 | holdout_sum / holdout_size) 516 | 517 | -------------------------------------------------------------------------------- /code/ale_data_set.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | import numpy as np 5 | import time 6 | import theano 7 | 8 | floatX = theano.config.floatX 9 | 10 | 11 | class DataSet(object): 12 | def __init__(self, width, height, rng, max_steps=1000000, phi_length=4, discount=0.99, batch_size=32, 13 | transitions_len=4): 14 | self.width = width 15 | self.height = height 16 | self.max_steps = max_steps 17 | self.phi_length = phi_length 18 | self.rng = rng 19 | self.discount = discount 20 | self.discount_table = np.power(self.discount, np.arange(30)) 21 | 22 | self.imgs = np.zeros((max_steps, height, width), dtype='uint8') 23 | self.actions = np.zeros(max_steps, dtype='int32') 24 | self.rewards = np.zeros(max_steps, dtype=floatX) 25 | self.return_value = np.zeros(max_steps, dtype=floatX) 26 | self.terminal = np.zeros(max_steps, dtype='bool') 27 | self.terminal_index = np.zeros(max_steps, dtype='int32') 28 | self.start_index = np.zeros(max_steps, dtype='int32') 29 | 30 | self.bottom = 0 31 | self.top = 0 32 | self.size = 0 33 | 34 | self.center_imgs = np.zeros((batch_size, 35 | self.phi_length, 36 | self.height, 37 | self.width), 38 | dtype='uint8') 39 | self.forward_imgs = np.zeros((batch_size, 40 | transitions_len, 41 | self.phi_length, 42 | self.height, 43 | self.width), 44 | dtype='uint8') 45 | self.backward_imgs = np.zeros((batch_size, 46 | transitions_len, 47 | self.phi_length, 48 | self.height, 49 | self.width), 50 | dtype='uint8') 51 | self.center_positions = np.zeros((batch_size, 1), dtype='int32') 52 | self.forward_positions = np.zeros((batch_size, transitions_len), dtype='int32') 53 | self.backward_positions = np.zeros((batch_size, transitions_len), dtype='int32') 54 | 55 | self.center_actions = np.zeros((batch_size, 1), dtype='int32') 56 | self.backward_actions = np.zeros((batch_size, transitions_len), dtype='int32') 57 | 58 | self.center_terminals = np.zeros((batch_size, 1), dtype='bool') 59 | self.center_rewards = np.zeros((batch_size, 1), dtype=floatX) 60 | 61 | self.center_return_values = np.zeros((batch_size, 1), dtype=floatX) 62 | self.forward_return_values = np.zeros((batch_size, transitions_len), dtype=floatX) 63 | self.backward_return_values = np.zeros((batch_size, transitions_len), dtype=floatX) 64 | 65 | self.forward_discounts = np.zeros((batch_size, transitions_len), dtype=floatX) 66 | self.backward_discounts = np.zeros((batch_size, transitions_len), dtype=floatX) 67 | 68 | def add_sample(self, img, action, reward, terminal, return_value=0.0, start_index=-1): 69 | 70 | self.imgs[self.top] = img 71 | self.actions[self.top] = action 72 | self.rewards[self.top] = reward 73 | self.terminal[self.top] = terminal 74 | self.return_value[self.top] = return_value 75 | self.start_index[self.top] = start_index 76 | self.terminal_index[self.top] = -1 77 | 78 | if self.size == self.max_steps: 79 | self.bottom = (self.bottom + 1) % self.max_steps 80 | else: 81 | self.size += 1 82 | self.top = (self.top + 1) % self.max_steps 83 | 84 | def __len__(self): 85 | return self.size 86 | 87 | def last_phi(self): 88 | """Return the most recent phi (sequence of image frames).""" 89 | indexes = np.arange(self.top - self.phi_length, self.top) 90 | return self.imgs.take(indexes, axis=0, mode='wrap') 91 | 92 | def phi(self, img): 93 | """Return a phi (sequence of image frames), using the last phi_length - 94 | 1, plus img. 95 | 96 | """ 97 | indexes = np.arange(self.top - self.phi_length + 1, self.top) 98 | 99 | phi = np.empty((self.phi_length, self.height, self.width), dtype='uint8') 100 | phi[0:self.phi_length - 1] = self.imgs.take(indexes, 101 | axis=0, 102 | mode='wrap') 103 | phi[-1] = img 104 | return phi 105 | 106 | def random_close_transitions_batch(self, batch_size, transitions_len): 107 | transition_range = transitions_len 108 | count = 0 109 | while count < batch_size: 110 | index = self.rng.randint(self.bottom, 111 | self.bottom + self.size - self.phi_length) 112 | 113 | all_indices = np.arange(index, index + self.phi_length) 114 | center_index = index + self.phi_length - 1 115 | """ 116 | frame0 frame1 frame2 frame3 117 | index center_index = index+phi-1 118 | """ 119 | if np.any(self.terminal.take(all_indices[0:-1], mode='wrap')): 120 | continue 121 | if np.any(self.terminal_index.take(all_indices, mode='wrap') == -1): 122 | continue 123 | terminal_index = self.terminal_index.take(center_index, mode='wrap') 124 | start_index = self.start_index.take(center_index, mode='wrap') 125 | self.center_positions[count] = center_index 126 | self.center_terminals[count] = self.terminal.take(center_index, mode='wrap') 127 | self.center_rewards[count] = self.rewards.take(center_index, mode='wrap') 128 | 129 | """ get forward transitions """ 130 | if terminal_index < center_index: 131 | terminal_index += self.size 132 | max_forward_index = max(min(center_index + transition_range, terminal_index), center_index+1) + 1 133 | self.forward_positions[count] = center_index + 1 134 | for i, j in zip(range(transitions_len), range(center_index + 1, max_forward_index)): 135 | self.forward_positions[count, i] = j 136 | """ get backward transitions """ 137 | if start_index + self.size < center_index: 138 | start_index += self.size 139 | min_backward_index = max(center_index - transition_range, start_index+self.phi_length-1) 140 | self.backward_positions[count] = center_index + 1 141 | for i, j in zip(range(transitions_len), range(center_index - 1, min_backward_index - 1, -1)): 142 | self.backward_positions[count, i] = j 143 | if self.terminal_index.take(j, mode='wrap') == -1: 144 | self.backward_positions[count, i] = center_index + 1 145 | 146 | self.center_imgs[count] = self.imgs.take(all_indices, axis=0, mode='wrap') 147 | for j in xrange(transitions_len): 148 | forward_index = self.forward_positions[count, j] 149 | backward_index = self.backward_positions[count, j] 150 | self.forward_imgs[count, j] = self.imgs.take( 151 | np.arange(forward_index - self.phi_length + 1, forward_index + 1), axis=0, mode='wrap') 152 | self.backward_imgs[count, j] = self.imgs.take( 153 | np.arange(backward_index - self.phi_length + 1, backward_index + 1), axis=0, mode='wrap') 154 | self.center_actions[count] = self.actions.take(center_index, mode='wrap') 155 | self.backward_actions[count] = self.actions.take(self.backward_positions[count], mode='wrap') 156 | self.center_return_values[count] = self.return_value.take(center_index, mode='wrap') 157 | self.forward_return_values[count] = self.return_value.take(self.forward_positions[count], mode='wrap') 158 | self.backward_return_values[count] = self.return_value.take(self.backward_positions[count], mode='wrap') 159 | distance = np.absolute(self.forward_positions[count] - center_index) 160 | self.forward_discounts[count] = self.discount_table[distance] 161 | distance = np.absolute(self.backward_positions[count] - center_index) 162 | self.backward_discounts[count] = self.discount_table[distance] 163 | # print self.backward_positions[count][::-1], self.center_positions[count], self.forward_positions[count] 164 | # print 'start=', start_index, 'center=', self.center_positions[count], 'end=', terminal_index 165 | # raw_input() 166 | count += 1 167 | 168 | def random_transitions_batch(self, batch_size, transitions_len, transition_range=10): 169 | count = 0 170 | while count < batch_size: 171 | index = self.rng.randint(self.bottom, 172 | self.bottom + self.size - self.phi_length) 173 | 174 | all_indices = np.arange(index, index + self.phi_length) 175 | center_index = index + self.phi_length - 1 176 | """ 177 | frame0 frame1 frame2 frame3 178 | index center_index = index+phi-1 179 | """ 180 | if np.any(self.terminal.take(all_indices[0:-1], mode='wrap')): 181 | continue 182 | if np.any(self.terminal_index.take(all_indices, mode='wrap') == -1): 183 | continue 184 | terminal_index = self.terminal_index.take(center_index, mode='wrap') 185 | start_index = self.start_index.take(center_index, mode='wrap') 186 | self.center_positions[count] = center_index 187 | self.center_terminals[count] = self.terminal.take(center_index, mode='wrap') 188 | self.center_rewards[count] = self.rewards.take(center_index, mode='wrap') 189 | 190 | """ get forward transitions """ 191 | if terminal_index < center_index: 192 | terminal_index += self.size 193 | max_forward_index = max(min(center_index + transition_range, terminal_index), center_index+1) + 1 194 | self.forward_positions[count, 0] = center_index+1 195 | if center_index + 2 >= max_forward_index: 196 | self.forward_positions[count, 1:] = center_index + 1 197 | else: 198 | self.forward_positions[count, 1:] = self.rng.randint(center_index+2, max_forward_index, transitions_len-1) 199 | """ get backward transitions """ 200 | 201 | if start_index + self.size < center_index: 202 | start_index += self.size 203 | min_backward_index = max(center_index - transition_range, start_index+self.phi_length-1) 204 | if min_backward_index >= center_index: 205 | self.backward_positions[count] = [center_index + 1] * transitions_len 206 | else: 207 | if center_index > self.top > min_backward_index: 208 | min_backward_index = self.top 209 | self.backward_positions[count] = self.rng.randint(min_backward_index, center_index, transitions_len) 210 | 211 | self.center_imgs[count] = self.imgs.take(all_indices, axis=0, mode='wrap') 212 | for j in xrange(transitions_len): 213 | forward_index = self.forward_positions[count, j] 214 | backward_index = self.backward_positions[count, j] 215 | self.forward_imgs[count, j] = self.imgs.take( 216 | np.arange(forward_index - self.phi_length + 1, forward_index + 1), axis=0, mode='wrap') 217 | self.backward_imgs[count, j] = self.imgs.take( 218 | np.arange(backward_index - self.phi_length + 1, backward_index + 1), axis=0, mode='wrap') 219 | self.center_actions[count] = self.actions.take(center_index, mode='wrap') 220 | self.backward_actions[count] = self.actions.take(self.backward_positions[count], mode='wrap') 221 | self.center_return_values[count] = self.return_value.take(center_index, mode='wrap') 222 | self.forward_return_values[count] = self.return_value.take(self.forward_positions[count], mode='wrap') 223 | self.backward_return_values[count] = self.return_value.take(self.backward_positions[count], mode='wrap') 224 | distance = np.absolute(self.forward_positions[count] - center_index) 225 | self.forward_discounts[count] = self.discount_table[distance] 226 | distance = np.absolute(self.backward_positions[count] - center_index) 227 | self.backward_discounts[count] = self.discount_table[distance] 228 | # print self.backward_positions[count][::-1], self.center_positions[count], self.forward_positions[count] 229 | # print 'start=', start_index, 'center=', self.center_positions[count], 'end=', terminal_index 230 | # raw_input() 231 | count += 1 232 | 233 | def random_imgs(self, size): 234 | imgs = np.zeros((size, 235 | self.phi_length + 1, 236 | self.height, 237 | self.width), 238 | dtype='uint8') 239 | 240 | count = 0 241 | while count < size: 242 | index = self.rng.randint(self.bottom, 243 | self.bottom + self.size - self.phi_length) 244 | all_indices = np.arange(index, index + self.phi_length + 1) 245 | end_index = index + self.phi_length - 1 246 | if np.any(self.terminal.take(all_indices[0:-2], mode='wrap')): 247 | continue 248 | imgs[count] = self.imgs.take(all_indices, axis=0, mode='wrap') 249 | count += 1 250 | return imgs 251 | 252 | -------------------------------------------------------------------------------- /code/ale_experiment.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | import logging 5 | import numpy as np 6 | import image_preprocessing 7 | 8 | # Number of rows to crop off the bottom of the (downsampled) screen. 9 | # This is appropriate for breakout, but it may need to be modified 10 | # for other games. 11 | CROP_OFFSET = 8 12 | 13 | 14 | class ALEExperiment(object): 15 | def __init__(self, ale, agent, resized_width, resized_height, 16 | resize_method, num_epochs, epoch_length, test_length, 17 | frame_skip, death_ends_episode, max_start_nullops, rng, flickering_buffer_size): 18 | self.ale = ale 19 | self.agent = agent 20 | self.num_epochs = num_epochs 21 | self.epoch_length = epoch_length 22 | self.test_length = test_length 23 | self.frame_skip = frame_skip 24 | self.death_ends_episode = death_ends_episode 25 | self.min_action_set = ale.getMinimalActionSet() 26 | self.resized_width = resized_width 27 | self.resized_height = resized_height 28 | self.resize_method = resize_method 29 | self.width, self.height = ale.getScreenDims() 30 | 31 | self.buffer_length = flickering_buffer_size 32 | self.buffer_count = 0 33 | self.screen_buffer = np.empty((self.buffer_length, 34 | self.height, self.width), 35 | dtype=np.uint8) 36 | 37 | self.terminal_lol = False # Most recent episode ended on a loss of life 38 | self.max_start_nullops = max_start_nullops 39 | self.rng = rng 40 | 41 | def run(self): 42 | """ 43 | Run the desired number of training epochs, a testing epoch 44 | is conducted after each training epoch. 45 | """ 46 | self.agent.time_count_start() 47 | for epoch in range(1, self.num_epochs + 1): 48 | self.run_epoch(epoch, self.epoch_length) 49 | self.agent.finish_epoch(epoch) 50 | 51 | if self.test_length > 0: 52 | self.agent.start_testing() 53 | self.run_epoch(epoch, self.test_length, True) 54 | self.agent.finish_testing(epoch) 55 | 56 | def run_epoch(self, epoch, num_steps, testing=False): 57 | """ Run one 'epoch' of training or testing, where an epoch is defined 58 | by the number of steps executed. Prints a progress report after 59 | every trial 60 | 61 | Arguments: 62 | epoch - the current epoch number 63 | num_steps - steps per epoch 64 | testing - True if this Epoch is used for testing and not training 65 | 66 | """ 67 | self.terminal_lol = False # Make sure each epoch starts with a reset. 68 | steps_left = num_steps 69 | while steps_left > 0: 70 | prefix = "testing" if testing else "training" 71 | logging.info(prefix + " epoch: " + str(epoch) + " steps_left: " + 72 | str(steps_left)) 73 | _, num_steps = self.run_episode(steps_left, testing) 74 | 75 | steps_left -= num_steps 76 | 77 | def _init_episode(self): 78 | """ This method resets the game if needed, performs enough null 79 | actions to ensure that the screen buffer is ready and optionally 80 | performs a randomly determined number of null action to randomize 81 | the initial game state.""" 82 | 83 | if not self.terminal_lol or self.ale.game_over(): 84 | self.ale.reset_game() 85 | 86 | if self.max_start_nullops > 0: 87 | random_actions = self.rng.randint(self.buffer_length-2, self.max_start_nullops+1) 88 | for _ in range(random_actions): 89 | self._act(0) # Null action 90 | 91 | # Make sure the screen buffer is filled at the beginning of 92 | # each episode... 93 | self._act(0) 94 | self._act(0) 95 | 96 | def _act(self, action): 97 | """Perform the indicated action for a single frame, return the 98 | resulting reward and store the resulting screen image in the 99 | buffer 100 | 101 | """ 102 | reward = self.ale.act(action) 103 | index = self.buffer_count % self.buffer_length 104 | 105 | self.ale.getScreenGrayscale(self.screen_buffer[index, ...]) 106 | 107 | self.buffer_count += 1 108 | return reward 109 | 110 | def _step(self, action): 111 | """ Repeat one action the appopriate number of times and return 112 | the summed reward. """ 113 | reward = 0 114 | for _ in range(self.frame_skip): 115 | reward += self._act(action) 116 | 117 | return reward 118 | 119 | def run_episode(self, max_steps, testing): 120 | """Run a single training episode. 121 | 122 | The boolean terminal value returned indicates whether the 123 | episode ended because the game ended or the agent died (True) 124 | or because the maximum number of steps was reached (False). 125 | Currently this value will be ignored. 126 | 127 | Return: (terminal, num_steps) 128 | 129 | """ 130 | 131 | self._init_episode() 132 | 133 | start_lives = self.ale.lives() 134 | 135 | action = self.agent.start_episode(self.get_observation()) 136 | num_steps = 0 137 | while True: 138 | reward = self._step(self.min_action_set[action]) 139 | self.terminal_lol = (self.death_ends_episode and not testing and 140 | self.ale.lives() < start_lives) 141 | terminal = self.ale.game_over() or self.terminal_lol 142 | num_steps += 1 143 | 144 | if terminal or num_steps >= max_steps: 145 | self.agent.end_episode(reward, terminal) 146 | break 147 | 148 | action = self.agent.step(reward, self.get_observation()) 149 | return terminal, num_steps 150 | 151 | def get_observation(self): 152 | """ Resize and merge the previous two screen images """ 153 | 154 | assert self.buffer_count >= self.buffer_length 155 | index = self.buffer_count % self.buffer_length - 1 156 | # max_image = np.maximum(self.screen_buffer[index, ...], 157 | # self.screen_buffer[index - 1, ...]) 158 | max_image = self.screen_buffer[index] 159 | for i in range(self.buffer_length): 160 | max_image = np.maximum(max_image, self.screen_buffer[index-i, ...]) 161 | return self.resize_image(max_image) 162 | 163 | def resize_image(self, image): 164 | """ Appropriately resize a single image """ 165 | 166 | if self.resize_method == 'crop': 167 | # resize keeping aspect ratio 168 | resize_height = int(round( 169 | float(self.height) * self.resized_width / self.width)) 170 | 171 | resized = image_preprocessing.resize(image, (self.resized_width, resize_height)) 172 | 173 | # Crop the part we want 174 | crop_y_cutoff = resize_height - CROP_OFFSET - self.resized_height 175 | cropped = resized[crop_y_cutoff: 176 | crop_y_cutoff + self.resized_height, :] 177 | 178 | return cropped 179 | elif self.resize_method == 'scale': 180 | return image_preprocessing.resize(image, (self.resized_width, self.resized_height)) 181 | else: 182 | raise ValueError('Unrecognized image resize method.') 183 | 184 | -------------------------------------------------------------------------------- /code/image_preprocessing.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | import numpy as np 5 | import matplotlib.pyplot as plt 6 | import matplotlib.image as mpimg 7 | import scipy.misc 8 | import cPickle 9 | 10 | 11 | def rgb2gray(rgb): 12 | return np.dot(rgb[..., :3], [0.299, 0.587, 0.114]) 13 | 14 | 15 | def resize(image, size): 16 | return scipy.misc.imresize(image, size=size) 17 | 18 | 19 | def imshow(photo, gray=False): 20 | if gray: 21 | plt.imshow(photo, cmap = plt.get_cmap('gray')) 22 | else: 23 | plt.imshow(photo) 24 | plt.show() 25 | 26 | 27 | def show_wall_paper(): 28 | img = mpimg.imread('wallpaper.jpg') 29 | gray = rgb2gray(img) 30 | gray = resize(gray, (1000, 1000)) 31 | imshow(gray, True) 32 | 33 | if __name__ == '__main__': 34 | f1 = open('game_images', mode='rb') 35 | images = cPickle.load(f1) 36 | print images[0].size, images[0].shape 37 | for i in range(1, len(images)): 38 | # imshow(images[i]) 39 | # print np.sum(images[i]-images[i-1]) 40 | raw_input() 41 | 42 | -------------------------------------------------------------------------------- /code/launcher.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | import os 5 | import argparse 6 | import logging 7 | try: 8 | import ale_python_interface 9 | except ImportError: 10 | import atari_py.ale_python_interface as ale_python_interface 11 | import cPickle 12 | import numpy as np 13 | import theano 14 | import time 15 | import ale_experiment 16 | import q_network 17 | import ale_agents 18 | 19 | 20 | def process_args(args, defaults, description): 21 | """ 22 | Handle the command line. 23 | 24 | args - list of command line arguments (not including executable name) 25 | defaults - a name space with variables corresponding to each of 26 | the required default command line values. 27 | description - a string to display at the top of the help message. 28 | """ 29 | parser = argparse.ArgumentParser(description=description) 30 | parser.add_argument('-r', '--rom', dest="rom", default=defaults.ROM, 31 | help='ROM to run (default: %(default)s)') 32 | parser.add_argument('-e', '--epochs', dest="epochs", type=int, 33 | default=defaults.EPOCHS, 34 | help='Number of training epochs (default: %(default)s)') 35 | parser.add_argument('-s', '--steps-per-epoch', dest="steps_per_epoch", 36 | type=int, default=defaults.STEPS_PER_EPOCH, 37 | help='Number of steps per epoch (default: %(default)s)') 38 | parser.add_argument('-t', '--test-length', dest="steps_per_test", 39 | type=int, default=defaults.STEPS_PER_TEST, 40 | help='Number of steps per test (default: %(default)s)') 41 | parser.add_argument('--display-screen', dest="display_screen", 42 | action='store_true', default=False, 43 | help='Show the game screen.') 44 | parser.add_argument('--double-dqn', dest="double_dqn", 45 | action='store_true', default=False, 46 | help='enable double DQN') 47 | parser.add_argument('--experiment-prefix', dest="experiment_prefix", 48 | default=None, 49 | help='Experiment name prefix ' 50 | '(default is the name of the game)') 51 | parser.add_argument('--frame-skip', dest="frame_skip", 52 | default=defaults.FRAME_SKIP, type=int, 53 | help='Every how many frames to process ' 54 | '(default: %(default)s)') 55 | parser.add_argument('--repeat-action-probability', 56 | dest="repeat_action_probability", 57 | default=defaults.REPEAT_ACTION_PROBABILITY, type=float, 58 | help=('Probability that action choice will be ' + 59 | 'ignored (default: %(default)s)')) 60 | parser.add_argument('--update-rule', dest="update_rule", 61 | type=str, default=defaults.UPDATE_RULE, 62 | help=('deepmind_rmsprop|rmsprop|sgd ' + 63 | '(default: %(default)s)')) 64 | parser.add_argument('--batch-accumulator', dest="batch_accumulator", 65 | type=str, default=defaults.BATCH_ACCUMULATOR, 66 | help=('sum|mean (default: %(default)s)')) 67 | parser.add_argument('--learning-rate', dest="learning_rate", 68 | type=float, default=defaults.LEARNING_RATE, 69 | help='Learning rate (default: %(default)s)') 70 | parser.add_argument('--rms-decay', dest="rms_decay", 71 | type=float, default=defaults.RMS_DECAY, 72 | help='Decay rate for rms_prop (default: %(default)s)') 73 | parser.add_argument('--rms-epsilon', dest="rms_epsilon", 74 | type=float, default=defaults.RMS_EPSILON, 75 | help='Denominator epsilson for rms_prop ' + 76 | '(default: %(default)s)') 77 | parser.add_argument('--momentum', type=float, default=defaults.MOMENTUM, 78 | help=('Momentum term for Nesterov momentum. '+ 79 | '(default: %(default)s)')) 80 | parser.add_argument('--clip-delta', dest="clip_delta", type=float, 81 | default=defaults.CLIP_DELTA, 82 | help=('Max absolute value for Q-update delta value. ' + 83 | '(default: %(default)s)')) 84 | parser.add_argument('--discount', type=float, default=defaults.DISCOUNT, 85 | help='Discount rate') 86 | parser.add_argument('--epsilon-start', dest="epsilon_start", 87 | type=float, default=defaults.EPSILON_START, 88 | help=('Starting value for epsilon. ' + 89 | '(default: %(default)s)')) 90 | parser.add_argument('--epsilon-min', dest="epsilon_min", 91 | type=float, default=defaults.EPSILON_MIN, 92 | help='Minimum epsilon. (default: %(default)s)') 93 | parser.add_argument('--epsilon-decay', dest="epsilon_decay", 94 | type=float, default=defaults.EPSILON_DECAY, 95 | help=('Number of steps to minimum epsilon. ' + 96 | '(default: %(default)s)')) 97 | parser.add_argument('--phi-length', dest="phi_length", 98 | type=int, default=defaults.PHI_LENGTH, 99 | help=('Number of recent frames used to represent ' + 100 | 'state. (default: %(default)s)')) 101 | parser.add_argument('--max-history', dest="replay_memory_size", 102 | type=int, default=defaults.REPLAY_MEMORY_SIZE, 103 | help=('Maximum number of steps stored in replay ' + 104 | 'memory. (default: %(default)s)')) 105 | parser.add_argument('--batch-size', dest="batch_size", 106 | type=int, default=defaults.BATCH_SIZE, 107 | help='Batch size. (default: %(default)s)') 108 | parser.add_argument('--network-type', dest="network_type", 109 | type=str, default=defaults.NETWORK_TYPE, 110 | help=('nips_cuda|nips_dnn|nature_cuda|nature_dnn' + 111 | '|linear (default: %(default)s)')) 112 | parser.add_argument('--freeze-interval', dest="freeze_interval", 113 | type=int, default=defaults.FREEZE_INTERVAL, 114 | help=('Interval between target freezes. ' + 115 | '(default: %(default)s)')) 116 | parser.add_argument('--update-frequency', dest="update_frequency", 117 | type=int, default=defaults.UPDATE_FREQUENCY, 118 | help=('Number of actions before each SGD update. '+ 119 | '(default: %(default)s)')) 120 | parser.add_argument('--replay-start-size', dest="replay_start_size", 121 | type=int, default=defaults.REPLAY_START_SIZE, 122 | help=('Number of random steps before training. ' + 123 | '(default: %(default)s)')) 124 | parser.add_argument('--resize-method', dest="resize_method", 125 | type=str, default=defaults.RESIZE_METHOD, 126 | help=('crop|scale (default: %(default)s)')) 127 | parser.add_argument('--nn-file', dest="nn_file", type=str, default=None, 128 | help='Pickle file containing trained net.') 129 | parser.add_argument('--death-ends-episode', dest="death_ends_episode", 130 | type=str, default=defaults.DEATH_ENDS_EPISODE, 131 | help=('true|false (default: %(default)s)')) 132 | parser.add_argument('--max-start-nullops', dest="max_start_nullops", 133 | type=int, default=defaults.MAX_START_NULLOPS, 134 | help=('Maximum number of null-ops at the start ' + 135 | 'of games. (default: %(default)s)')) 136 | parser.add_argument('--deterministic', dest="deterministic", 137 | action='store_false', default=defaults.DETERMINISTIC, 138 | help=('Whether to use deterministic parameters ' + 139 | 'for learning. (default: %(default)s)')) 140 | parser.add_argument('--cudnn_deterministic', dest="cudnn_deterministic", 141 | type=bool, default=defaults.CUDNN_DETERMINISTIC, 142 | help=('Whether to use deterministic backprop. ' + 143 | '(default: %(default)s)')) 144 | parser.add_argument('--flickering-buffer', dest="flickering_buffer_size", 145 | type=int, default=defaults.FLICKERING_BUFFER_SIZE, 146 | help='anti flickering buffer size') 147 | parser.add_argument('--method', dest="method", 148 | type=str, default=defaults.METHOD, 149 | help='choose different learning algorithms') 150 | parser.add_argument('--transition-len', dest='transition_length', 151 | type=int, default=4, 152 | help='transition length in Optimality Tightening') 153 | parser.add_argument('--transition-range', dest='transition_range', 154 | type=int, default=10, 155 | help='transition range in Optimality Tightening Sampling') 156 | parser.add_argument('--penalty-method', dest='penalty_method', 157 | type=str, default='max', 158 | help='penalty method') 159 | parser.add_argument('--two-train', dest="two_train", 160 | action='store_true', default=False, 161 | help='doing two gradient descents per update') 162 | parser.add_argument('--save-pkl', dest="save_pkl", 163 | action='store_true', default=False, 164 | help='saving network parameters') 165 | parser.add_argument('--weight-min', dest='weight_min', 166 | type=float, default=0.8, 167 | help='weight min for penalty method') 168 | parser.add_argument('--weight-max', dest='weight_max', 169 | type=float, default=0.8, 170 | help='weight max for penalty method') 171 | parser.add_argument('--anneal-len', dest='annealing_len', 172 | type=float, default=5000000, 173 | help='annealing length of penalty method') 174 | parser.add_argument('--close2', dest="close2", 175 | action='store_true', default=False, 176 | help='choose close bounds') 177 | parser.add_argument('--late2', dest="late2", 178 | action='store_true', default=False, 179 | help='delay the penalty') 180 | parser.add_argument('--verbose', dest="verbose", 181 | type=int, default=0, 182 | help='1: check correctness,') 183 | 184 | parameters = parser.parse_args(args) 185 | if parameters.experiment_prefix is None: 186 | name = os.path.splitext(os.path.basename(parameters.rom))[0] 187 | parameters.experiment_prefix = name + time.strftime("_%m-%d-%H-%M-%S_", time.gmtime()) + parameters.method 188 | 189 | if parameters.double_dqn: 190 | parameters.experiment_prefix += '_(double)' 191 | 192 | parameters.experiment_prefix += '_' + str(parameters.learning_rate) + '_(ep' + str(parameters.epochs) + ')' 193 | 194 | if parameters.death_ends_episode == 'true': 195 | parameters.death_ends_episode = True 196 | elif parameters.death_ends_episode == 'false': 197 | parameters.death_ends_episode = False 198 | else: 199 | raise ValueError("--death-ends-episode must be true or false") 200 | 201 | if parameters.freeze_interval > 0: 202 | # This addresses an inconsistency between the Nature paper and 203 | # the Deepmind code. The paper states that the target network 204 | # update frequency is "measured in the number of parameter 205 | # updates". In the code it is actually measured in the number 206 | # of action choices. 207 | parameters.freeze_interval = (parameters.freeze_interval // 208 | parameters.update_frequency) 209 | 210 | return parameters 211 | 212 | 213 | def launch(args, defaults, description): 214 | """ 215 | Execute a complete training run. 216 | """ 217 | 218 | logging.basicConfig(level=logging.INFO) 219 | parameters = process_args(args, defaults, description) 220 | 221 | if parameters.rom.endswith('.bin'): 222 | rom = parameters.rom 223 | else: 224 | rom = "%s.bin" % parameters.rom 225 | full_rom_path = os.path.join(defaults.BASE_ROM_PATH, rom) 226 | 227 | if parameters.deterministic: 228 | rng = np.random.RandomState(123456) 229 | else: 230 | rng = np.random.RandomState() 231 | 232 | if parameters.cudnn_deterministic: 233 | theano.config.dnn.conv.algo_bwd = 'deterministic' 234 | 235 | ale = ale_python_interface.ALEInterface() 236 | ale.setInt('random_seed', rng.randint(1000)) 237 | 238 | if parameters.display_screen: 239 | import sys 240 | if sys.platform == 'darwin': 241 | import pygame 242 | pygame.init() 243 | ale.setBool('sound', False) # Sound doesn't work on OSX 244 | 245 | ale.setBool('display_screen', parameters.display_screen) 246 | ale.setFloat('repeat_action_probability', 247 | parameters.repeat_action_probability) 248 | 249 | ale.loadROM(full_rom_path) 250 | 251 | num_actions = len(ale.getMinimalActionSet()) 252 | 253 | agent = None 254 | 255 | if not parameters.close2: 256 | print 'transition length is ', parameters.transition_length, 'transition range is', parameters.transition_range 257 | if parameters.method == 'ot': 258 | if parameters.nn_file is None: 259 | network = q_network.DeepQLearner(defaults.RESIZED_WIDTH, 260 | defaults.RESIZED_HEIGHT, 261 | num_actions, 262 | parameters.phi_length, 263 | parameters.discount, 264 | parameters.learning_rate, 265 | parameters.rms_decay, 266 | parameters.rms_epsilon, 267 | parameters.momentum, 268 | parameters.clip_delta, 269 | parameters.freeze_interval, 270 | parameters.batch_size, 271 | parameters.network_type, 272 | parameters.update_rule, 273 | parameters.batch_accumulator, 274 | rng, double=parameters.double_dqn, 275 | transition_length=parameters.transition_length) 276 | else: 277 | handle = open(parameters.nn_file, 'r') 278 | network = cPickle.load(handle) 279 | 280 | agent = ale_agents.OptimalityTightening(network, 281 | parameters.epsilon_start, 282 | parameters.epsilon_min, 283 | parameters.epsilon_decay, 284 | parameters.replay_memory_size, 285 | parameters.experiment_prefix, 286 | parameters.update_frequency, 287 | parameters.replay_start_size, 288 | rng, 289 | parameters.transition_length, 290 | parameters.transition_range, 291 | parameters.penalty_method, 292 | parameters.weight_min, 293 | parameters.weight_max, 294 | parameters.annealing_len, 295 | parameters.two_train, 296 | parameters.late2, 297 | parameters.close2, 298 | parameters.verbose, 299 | parameters.double_dqn, 300 | parameters.save_pkl) 301 | 302 | experiment = ale_experiment.ALEExperiment(ale, agent, 303 | defaults.RESIZED_WIDTH, 304 | defaults.RESIZED_HEIGHT, 305 | parameters.resize_method, 306 | parameters.epochs, 307 | parameters.steps_per_epoch, 308 | parameters.steps_per_test, 309 | parameters.frame_skip, 310 | parameters.death_ends_episode, 311 | parameters.max_start_nullops, 312 | rng, 313 | parameters.flickering_buffer_size) 314 | 315 | experiment.run() 316 | 317 | if __name__ == '__main__': 318 | pass 319 | -------------------------------------------------------------------------------- /code/q_network.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | import lasagne 5 | import numpy as np 6 | import theano 7 | import theano.tensor as T 8 | from updates import deepmind_rmsprop 9 | 10 | 11 | class DeepQLearner: 12 | def __init__(self, input_width, input_height, num_actions, 13 | num_frames, discount, learning_rate, rho, 14 | rms_epsilon, momentum, clip_delta, freeze_interval, 15 | batch_size, network_type, update_rule, 16 | batch_accumulator, rng, input_scale=255.0, 17 | double=False, transition_length=4): 18 | 19 | if double: 20 | print 'USING DOUBLE DQN' 21 | self.input_width = input_width 22 | self.input_height = input_height 23 | self.num_actions = num_actions 24 | self.num_frames = num_frames 25 | self.batch_size = batch_size 26 | self.discount = discount 27 | self.rho = rho 28 | self.lr = learning_rate 29 | self.rms_epsilon = rms_epsilon 30 | self.momentum = momentum 31 | self.clip_delta = clip_delta 32 | self.freeze_interval = freeze_interval 33 | self.rng = rng 34 | 35 | lasagne.random.set_rng(self.rng) 36 | 37 | self.update_counter = 0 38 | 39 | self.l_out = self.build_network(network_type, input_width, input_height, 40 | num_actions, num_frames, batch_size) 41 | if self.freeze_interval > 0: 42 | self.next_l_out = self.build_network(network_type, input_width, 43 | input_height, num_actions, 44 | num_frames, batch_size) 45 | self.reset_q_hat() 46 | 47 | states = T.tensor4('states_t') 48 | actions = T.icol('actions_t') 49 | target = T.col('evaluation_t') 50 | 51 | self.states_shared = theano.shared( 52 | np.zeros((batch_size, num_frames, input_height, input_width), 53 | dtype=theano.config.floatX)) 54 | self.actions_shared = theano.shared( 55 | np.zeros((batch_size, 1), dtype='int32'), 56 | broadcastable=(False, True)) 57 | self.target_shared = theano.shared( 58 | np.zeros((batch_size, 1), dtype=theano.config.floatX), 59 | broadcastable=(False, True)) 60 | 61 | self.states_transition_shared = theano.shared( 62 | np.zeros((batch_size, transition_length * 2, num_frames, input_height, input_width), 63 | dtype=theano.config.floatX)) 64 | self.states_one_shared = theano.shared( 65 | np.zeros((num_frames, input_height, input_width), 66 | dtype=theano.config.floatX)) 67 | 68 | q_vals = lasagne.layers.get_output(self.l_out, states / input_scale) 69 | 70 | """get Q(s) batch_size = 1 """ 71 | q1_givens = { 72 | states: self.states_one_shared.reshape((1, 73 | self.num_frames, 74 | self.input_height, 75 | self.input_width)) 76 | } 77 | self._q1_vals = theano.function([], q_vals[0], givens=q1_givens) 78 | """get Q(s) batch_size = batch size """ 79 | q_batch_givens = { 80 | states: self.states_shared.reshape((self.batch_size, 81 | self.num_frames, 82 | self.input_height, 83 | self.input_width)) 84 | } 85 | self._q_batch_vals = theano.function([], q_vals, givens=q_batch_givens) 86 | 87 | action_mask = T.eq(T.arange(num_actions).reshape((1, -1)), 88 | actions.reshape((-1, 1))).astype(theano.config.floatX) 89 | 90 | q_s_a = (q_vals * action_mask).sum(axis=1).reshape((-1, 1)) 91 | """ get Q(s,a) batch_size = batch size """ 92 | q_s_a_givens = { 93 | states: self.states_shared.reshape((self.batch_size, 94 | self.num_frames, 95 | self.input_height, 96 | self.input_width)), 97 | actions: self.actions_shared 98 | } 99 | self._q_s_a_vals = theano.function([], q_s_a, givens=q_s_a_givens) 100 | 101 | if self.freeze_interval > 0: 102 | q_target_vals = lasagne.layers.get_output(self.next_l_out, 103 | states / input_scale) 104 | else: 105 | q_target_vals = lasagne.layers.get_output(self.l_out, 106 | states / input_scale) 107 | q_target_vals = theano.gradient.disconnected_grad(q_target_vals) 108 | 109 | if not double: 110 | q_target = T.max(q_target_vals, axis=1) 111 | else: 112 | greedy_actions = T.argmax(q_vals, axis=1) 113 | q_target_mask = T.eq(T.arange(num_actions).reshape((1, -1)), 114 | greedy_actions.reshape((-1, 1)).astype(theano.config.floatX)) 115 | q_target = (q_target_vals * q_target_mask).sum(axis=1).reshape((-1, 1)) 116 | """get Q target Q'(s,a') for a batch of transitions batch size = batch_size * transition length""" 117 | q_target_transition_givens = { 118 | states: self.states_transition_shared.reshape( 119 | (batch_size * transition_length * 2, self.num_frames, self.input_height, self.input_width)) 120 | } 121 | self._q_target = theano.function([], q_target.reshape((batch_size, transition_length * 2)), 122 | givens=q_target_transition_givens) 123 | """get Q target_vals Q'(s) for a batch of transitions batch size = batch_size * transition length""" 124 | self._q_target_vals = theano.function([], q_target_vals.reshape( 125 | (batch_size, transition_length * 2, num_actions)), givens=q_target_transition_givens) 126 | 127 | diff = q_s_a - target 128 | 129 | if self.clip_delta > 0: 130 | # If we simply take the squared clipped diff as our loss, 131 | # then the gradient will be zero whenever the diff exceeds 132 | # the clip bounds. To avoid this, we extend the loss 133 | # linearly past the clip point to keep the gradient constant 134 | # in that regime. 135 | # 136 | # This is equivalent to declaring d loss/d q_vals to be 137 | # equal to the clipped diff, then backpropagating from 138 | # there, which is what the DeepMind implementation does. 139 | quadratic_part = T.minimum(abs(diff), self.clip_delta) 140 | linear_part = abs(diff) - quadratic_part 141 | loss = 0.5 * quadratic_part ** 2 + self.clip_delta * linear_part 142 | else: 143 | loss = 0.5 * diff ** 2 144 | 145 | if batch_accumulator == 'sum': 146 | loss = T.sum(loss) 147 | elif batch_accumulator == 'mean': 148 | loss = T.mean(loss) 149 | else: 150 | raise ValueError("Bad accumulator: {}".format(batch_accumulator)) 151 | 152 | params = lasagne.layers.helper.get_all_params(self.l_out) 153 | 154 | if update_rule == 'deepmind_rmsprop': 155 | updates = deepmind_rmsprop(loss, params, self.lr, self.rho, 156 | self.rms_epsilon) 157 | elif update_rule == 'rmsprop': 158 | updates = lasagne.updates.rmsprop(loss, params, self.lr, self.rho, 159 | self.rms_epsilon) 160 | elif update_rule == 'sgd': 161 | updates = lasagne.updates.sgd(loss, params, self.lr) 162 | else: 163 | raise ValueError("Unrecognized update: {}".format(update_rule)) 164 | 165 | if self.momentum > 0: 166 | updates = lasagne.updates.apply_momentum(updates, None, 167 | self.momentum) 168 | """Q(s,a) target train()""" 169 | train_givens = { 170 | states: self.states_shared, 171 | actions: self.actions_shared, 172 | target: self.target_shared 173 | } 174 | self._train = theano.function([], [loss], updates=updates, givens=train_givens, on_unused_input='warn') 175 | 176 | self._train2 = theano.function([], [loss], updates=updates, givens=train_givens, on_unused_input='warn') 177 | 178 | def q_vals(self, single_state): 179 | self.states_one_shared.set_value(single_state) 180 | return self._q1_vals() 181 | 182 | def q_batch_vals(self, states): 183 | self.states_shared.set_value(states) 184 | return self._q_batch_vals() 185 | 186 | def q_s_a_batch_vals(self, states, actions): 187 | self.states_shared.set_value(states) 188 | self.actions_shared.set_value(actions) 189 | return self._q_s_a_vals() 190 | 191 | def q_target(self, batch_transition_states): 192 | self.states_transition_shared.set_value(batch_transition_states) 193 | return self._q_target() 194 | 195 | def q_target_vals(self, batch_transition_states): 196 | self.states_transition_shared.set_value(batch_transition_states) 197 | return self._q_target_vals() 198 | 199 | def train(self, states, actions, target): 200 | self.states_shared.set_value(states) 201 | self.actions_shared.set_value(actions) 202 | self.target_shared.set_value(target) 203 | if self.freeze_interval > 0 and self.update_counter % self.freeze_interval == 0: 204 | self.reset_q_hat() 205 | loss = self._train() 206 | self.update_counter += 1 207 | return np.sqrt(loss) 208 | 209 | def train2(self, states, actions, target): 210 | self.states_shared.set_value(states) 211 | self.actions_shared.set_value(actions) 212 | self.target_shared.set_value(target) 213 | if self.freeze_interval > 0 and self.update_counter % self.freeze_interval == 0: 214 | self.reset_q_hat() 215 | loss = self._train2() 216 | return np.sqrt(loss) 217 | 218 | def build_network(self, network_type, input_width, input_height, 219 | output_dim, num_frames, batch_size): 220 | if network_type == "nature_cuda": 221 | return self.build_nature_network(input_width, input_height, 222 | output_dim, num_frames, batch_size) 223 | if network_type == "nature_dnn": 224 | return self.build_nature_network_dnn(input_width, input_height, 225 | output_dim, num_frames, 226 | batch_size) 227 | elif network_type == "linear": 228 | return self.build_linear_network(input_width, input_height, 229 | output_dim, num_frames, batch_size) 230 | else: 231 | raise ValueError("Unrecognized network: {}".format(network_type)) 232 | 233 | def choose_action(self, state, epsilon): 234 | if self.rng.rand() < epsilon: 235 | return self.rng.randint(0, self.num_actions) 236 | q_vals = self.q_vals(state) 237 | return np.argmax(q_vals) 238 | 239 | def reset_q_hat(self): 240 | all_params = lasagne.layers.helper.get_all_param_values(self.l_out) 241 | lasagne.layers.helper.set_all_param_values(self.next_l_out, all_params) 242 | 243 | def build_nature_network(self, input_width, input_height, output_dim, 244 | num_frames, batch_size): 245 | """ 246 | Build a large network consistent with the DeepMind Nature paper. 247 | """ 248 | from lasagne.layers import cuda_convnet 249 | 250 | l_in = lasagne.layers.InputLayer( 251 | shape=(None, num_frames, input_width, input_height) 252 | ) 253 | 254 | l_conv1 = cuda_convnet.Conv2DCCLayer( 255 | l_in, 256 | num_filters=32, 257 | filter_size=(8, 8), 258 | stride=(4, 4), 259 | nonlinearity=lasagne.nonlinearities.rectify, 260 | W=lasagne.init.HeUniform(), # Defaults to Glorot 261 | b=lasagne.init.Constant(.1), 262 | dimshuffle=True 263 | ) 264 | 265 | l_conv2 = cuda_convnet.Conv2DCCLayer( 266 | l_conv1, 267 | num_filters=64, 268 | filter_size=(4, 4), 269 | stride=(2, 2), 270 | nonlinearity=lasagne.nonlinearities.rectify, 271 | W=lasagne.init.HeUniform(), 272 | b=lasagne.init.Constant(.1), 273 | dimshuffle=True 274 | ) 275 | 276 | l_conv3 = cuda_convnet.Conv2DCCLayer( 277 | l_conv2, 278 | num_filters=64, 279 | filter_size=(3, 3), 280 | stride=(1, 1), 281 | nonlinearity=lasagne.nonlinearities.rectify, 282 | W=lasagne.init.HeUniform(), 283 | b=lasagne.init.Constant(.1), 284 | dimshuffle=True 285 | ) 286 | 287 | l_hidden1 = lasagne.layers.DenseLayer( 288 | l_conv3, 289 | num_units=512, 290 | nonlinearity=lasagne.nonlinearities.rectify, 291 | W=lasagne.init.HeUniform(), 292 | b=lasagne.init.Constant(.1) 293 | ) 294 | 295 | l_out = lasagne.layers.DenseLayer( 296 | l_hidden1, 297 | num_units=output_dim, 298 | nonlinearity=None, 299 | W=lasagne.init.HeUniform(), 300 | b=lasagne.init.Constant(.1) 301 | ) 302 | 303 | return l_out 304 | 305 | def build_nature_network_dnn(self, input_width, input_height, output_dim, 306 | num_frames, batch_size): 307 | """ 308 | Build a large network consistent with the DeepMind Nature paper. 309 | """ 310 | from lasagne.layers import dnn 311 | 312 | l_in = lasagne.layers.InputLayer( 313 | shape=(None, num_frames, input_width, input_height) 314 | ) 315 | 316 | l_conv1 = dnn.Conv2DDNNLayer( 317 | l_in, 318 | num_filters=32, 319 | filter_size=(8, 8), 320 | stride=(4, 4), 321 | nonlinearity=lasagne.nonlinearities.rectify, 322 | W=lasagne.init.HeUniform(), 323 | b=lasagne.init.Constant(.1) 324 | ) 325 | 326 | l_conv2 = dnn.Conv2DDNNLayer( 327 | l_conv1, 328 | num_filters=64, 329 | filter_size=(4, 4), 330 | stride=(2, 2), 331 | nonlinearity=lasagne.nonlinearities.rectify, 332 | W=lasagne.init.HeUniform(), 333 | b=lasagne.init.Constant(.1) 334 | ) 335 | 336 | l_conv3 = dnn.Conv2DDNNLayer( 337 | l_conv2, 338 | num_filters=64, 339 | filter_size=(3, 3), 340 | stride=(1, 1), 341 | nonlinearity=lasagne.nonlinearities.rectify, 342 | W=lasagne.init.HeUniform(), 343 | b=lasagne.init.Constant(.1) 344 | ) 345 | 346 | l_hidden1 = lasagne.layers.DenseLayer( 347 | l_conv3, 348 | num_units=512, 349 | nonlinearity=lasagne.nonlinearities.rectify, 350 | W=lasagne.init.HeUniform(), 351 | b=lasagne.init.Constant(.1) 352 | ) 353 | 354 | l_out = lasagne.layers.DenseLayer( 355 | l_hidden1, 356 | num_units=output_dim, 357 | nonlinearity=None, 358 | W=lasagne.init.HeUniform(), 359 | b=lasagne.init.Constant(.1) 360 | ) 361 | 362 | return l_out 363 | 364 | def build_linear_network(self, input_width, input_height, output_dim, 365 | num_frames, batch_size): 366 | """ 367 | Build a simple linear learner. Useful for creating 368 | tests that sanity-check the weight update code. 369 | """ 370 | 371 | l_in = lasagne.layers.InputLayer( 372 | shape=(None, num_frames, input_width, input_height) 373 | ) 374 | 375 | l_out = lasagne.layers.DenseLayer( 376 | l_in, 377 | num_units=output_dim, 378 | nonlinearity=None, 379 | W=lasagne.init.Constant(0.0), 380 | b=None 381 | ) 382 | 383 | return l_out 384 | -------------------------------------------------------------------------------- /code/run_OT.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | __author__ = 'frankhe' 3 | 4 | import launcher 5 | import sys 6 | 7 | 8 | class Defaults: 9 | # ---------------------- 10 | # Experiment Parameters 11 | # ---------------------- 12 | STEPS_PER_EPOCH = 250000 13 | EPOCHS = 50 14 | # runtime evaluation, not 30 no-op evaluation 15 | STEPS_PER_TEST = 125000 16 | 17 | # ---------------------- 18 | # ALE Parameters 19 | # ---------------------- 20 | BASE_ROM_PATH = "../roms/" 21 | ROM = 'gopher.bin' 22 | FRAME_SKIP = 4 23 | REPEAT_ACTION_PROBABILITY = 0 24 | 25 | # ---------------------- 26 | # Agent/Network parameters: 27 | # ---------------------- 28 | UPDATE_RULE = 'deepmind_rmsprop' 29 | BATCH_ACCUMULATOR = 'sum' 30 | LEARNING_RATE = .00025 31 | DISCOUNT = .99 32 | RMS_DECAY = .95 # (Rho) 33 | RMS_EPSILON = .01 34 | MOMENTUM = 0 # Note that the "momentum" value mentioned in the Nature 35 | # paper is not used in the same way as a traditional momentum 36 | # term. It is used to track gradient for the purpose of 37 | # estimating the standard deviation. This package uses 38 | # rho/RMS_DECAY to track both the history of the gradient 39 | # and the squared gradient. 40 | CLIP_DELTA = 1.0 41 | EPSILON_START = 1.0 42 | EPSILON_MIN = .1 43 | EPSILON_DECAY = 1000000 44 | PHI_LENGTH = 4 45 | UPDATE_FREQUENCY = 4 46 | REPLAY_MEMORY_SIZE = 1000000 47 | BATCH_SIZE = 32 48 | NETWORK_TYPE = "nature_dnn" 49 | FREEZE_INTERVAL = 10000 50 | REPLAY_START_SIZE = 50000 51 | RESIZE_METHOD = 'scale' 52 | RESIZED_WIDTH = 84 53 | RESIZED_HEIGHT = 84 54 | DEATH_ENDS_EPISODE = 'true' 55 | MAX_START_NULLOPS = 30 56 | DETERMINISTIC = True 57 | CUDNN_DETERMINISTIC = False 58 | FLICKERING_BUFFER_SIZE = 2 59 | METHOD = 'ot' 60 | 61 | if __name__ == "__main__": 62 | launcher.launch(sys.argv[1:], Defaults, __doc__) 63 | -------------------------------------------------------------------------------- /code/updates.py: -------------------------------------------------------------------------------- 1 | """ 2 | Gradient update rules for the deep_q_rl package. 3 | 4 | Some code here is modified from the Lasagne package: 5 | 6 | https://github.com/Lasagne/Lasagne/blob/master/LICENSE 7 | 8 | """ 9 | 10 | import theano 11 | import theano.tensor as T 12 | from lasagne.updates import get_or_compute_grads 13 | from collections import OrderedDict 14 | import numpy as np 15 | 16 | # The MIT License (MIT) 17 | 18 | # Copyright (c) 2014 Sander Dieleman 19 | 20 | # Permission is hereby granted, free of charge, to any person obtaining a copy 21 | # of this software and associated documentation files (the "Software"), to deal 22 | # in the Software without restriction, including without limitation the rights 23 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 24 | # copies of the Software, and to permit persons to whom the Software is 25 | # furnished to do so, subject to the following conditions: 26 | 27 | # The above copyright notice and this permission notice shall be included in all 28 | # copies or substantial portions of the Software. 29 | 30 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 31 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 32 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 33 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 34 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 35 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 36 | # SOFTWARE. 37 | # The MIT License (MIT) 38 | 39 | # Copyright (c) 2014 Sander Dieleman 40 | 41 | # Permission is hereby granted, free of charge, to any person obtaining a copy 42 | # of this software and associated documentation files (the "Software"), to deal 43 | # in the Software without restriction, including without limitation the rights 44 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 45 | # copies of the Software, and to permit persons to whom the Software is 46 | # furnished to do so, subject to the following conditions: 47 | 48 | # The above copyright notice and this permission notice shall be included in all 49 | # copies or substantial portions of the Software. 50 | 51 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 52 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 53 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 54 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 55 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 56 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 57 | # SOFTWARE. 58 | 59 | def deepmind_rmsprop(loss_or_grads, params, learning_rate, 60 | rho, epsilon): 61 | """RMSProp updates [1]_. 62 | 63 | Scale learning rates by dividing with the moving average of the root mean 64 | squared (RMS) gradients. 65 | 66 | Parameters 67 | ---------- 68 | loss_or_grads : symbolic expression or list of expressions 69 | A scalar loss expression, or a list of gradient expressions 70 | params : list of shared variables 71 | The variables to generate update expressions for 72 | learning_rate : float or symbolic scalar 73 | The learning rate controlling the size of update steps 74 | rho : float or symbolic scalar 75 | Gradient moving average decay factor 76 | epsilon : float or symbolic scalar 77 | Small value added for numerical stability 78 | 79 | Returns 80 | ------- 81 | OrderedDict 82 | A dictionary mapping each parameter to its update expression 83 | 84 | Notes 85 | ----- 86 | `rho` should be between 0 and 1. A value of `rho` close to 1 will decay the 87 | moving average slowly and a value close to 0 will decay the moving average 88 | fast. 89 | 90 | Using the step size :math:`\\eta` and a decay factor :math:`\\rho` the 91 | learning rate :math:`\\eta_t` is calculated as: 92 | 93 | .. math:: 94 | r_t &= \\rho r_{t-1} + (1-\\rho)*g^2\\\\ 95 | \\eta_t &= \\frac{\\eta}{\\sqrt{r_t + \\epsilon}} 96 | 97 | References 98 | ---------- 99 | .. [1] Tieleman, T. and Hinton, G. (2012): 100 | Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. 101 | Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20) 102 | """ 103 | 104 | grads = get_or_compute_grads(loss_or_grads, params) 105 | updates = OrderedDict() 106 | 107 | for param, grad in zip(params, grads): 108 | value = param.get_value(borrow=True) 109 | 110 | acc_grad = theano.shared(np.zeros(value.shape, dtype=value.dtype), 111 | broadcastable=param.broadcastable) 112 | acc_grad_new = rho * acc_grad + (1 - rho) * grad 113 | 114 | acc_rms = theano.shared(np.zeros(value.shape, dtype=value.dtype), 115 | broadcastable=param.broadcastable) 116 | acc_rms_new = rho * acc_rms + (1 - rho) * grad ** 2 117 | 118 | 119 | updates[acc_grad] = acc_grad_new 120 | updates[acc_rms] = acc_rms_new 121 | 122 | updates[param] = (param - learning_rate * 123 | (grad / 124 | T.sqrt(acc_rms_new - acc_grad_new **2 + epsilon))) 125 | 126 | return updates 127 | -------------------------------------------------------------------------------- /figures/a3c_fig.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/a3c_fig.png -------------------------------------------------------------------------------- /figures/beam_rider_time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/beam_rider_time.png -------------------------------------------------------------------------------- /figures/breakout_time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/breakout_time.png -------------------------------------------------------------------------------- /figures/frame.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frame.png -------------------------------------------------------------------------------- /figures/frostbite_cl2_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_cl2_1.png -------------------------------------------------------------------------------- /figures/frostbite_cl2_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_cl2_2.png -------------------------------------------------------------------------------- /figures/frostbite_r15_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_r15_1.png -------------------------------------------------------------------------------- /figures/frostbite_r15_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/frostbite_r15_2.png -------------------------------------------------------------------------------- /figures/gopher.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/gopher.png -------------------------------------------------------------------------------- /figures/gopher_running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/gopher_running.png -------------------------------------------------------------------------------- /figures/hero.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/hero.png -------------------------------------------------------------------------------- /figures/pong_time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/pong_time.png -------------------------------------------------------------------------------- /figures/qbert_time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/qbert_time.png -------------------------------------------------------------------------------- /figures/rescale.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/rescale.png -------------------------------------------------------------------------------- /figures/space_invaders_time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/space_invaders_time.png -------------------------------------------------------------------------------- /figures/star_gunner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/star_gunner.png -------------------------------------------------------------------------------- /figures/star_gunner_running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/star_gunner_running.png -------------------------------------------------------------------------------- /figures/zaxxon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/figures/zaxxon.png -------------------------------------------------------------------------------- /roms/air_raid.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/air_raid.bin -------------------------------------------------------------------------------- /roms/alien.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/alien.bin -------------------------------------------------------------------------------- /roms/amidar.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/amidar.bin -------------------------------------------------------------------------------- /roms/assault.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/assault.bin -------------------------------------------------------------------------------- /roms/asterix.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/asterix.bin -------------------------------------------------------------------------------- /roms/asteroids.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/asteroids.bin -------------------------------------------------------------------------------- /roms/atlantis.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/atlantis.bin -------------------------------------------------------------------------------- /roms/bank_heist.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/bank_heist.bin -------------------------------------------------------------------------------- /roms/battle_zone.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/battle_zone.bin -------------------------------------------------------------------------------- /roms/beam_rider.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/beam_rider.bin -------------------------------------------------------------------------------- /roms/berzerk.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/berzerk.bin -------------------------------------------------------------------------------- /roms/bowling.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/bowling.bin -------------------------------------------------------------------------------- /roms/boxing.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/boxing.bin -------------------------------------------------------------------------------- /roms/breakout.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/breakout.bin -------------------------------------------------------------------------------- /roms/carnival.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/carnival.bin -------------------------------------------------------------------------------- /roms/centipede.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/centipede.bin -------------------------------------------------------------------------------- /roms/chopper_command.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/chopper_command.bin -------------------------------------------------------------------------------- /roms/crazy_climber.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/crazy_climber.bin -------------------------------------------------------------------------------- /roms/defender.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/defender.bin -------------------------------------------------------------------------------- /roms/demon_attack.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/demon_attack.bin -------------------------------------------------------------------------------- /roms/double_dunk.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/double_dunk.bin -------------------------------------------------------------------------------- /roms/elevator_action.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/elevator_action.bin -------------------------------------------------------------------------------- /roms/enduro.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/enduro.bin -------------------------------------------------------------------------------- /roms/fishing_derby.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/fishing_derby.bin -------------------------------------------------------------------------------- /roms/freeway.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/freeway.bin -------------------------------------------------------------------------------- /roms/frostbite.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/frostbite.bin -------------------------------------------------------------------------------- /roms/gopher.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/gopher.bin -------------------------------------------------------------------------------- /roms/gravitar.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/gravitar.bin -------------------------------------------------------------------------------- /roms/hero.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/hero.bin -------------------------------------------------------------------------------- /roms/ice_hockey.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/ice_hockey.bin -------------------------------------------------------------------------------- /roms/jamesbond.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/jamesbond.bin -------------------------------------------------------------------------------- /roms/journey_escape.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/journey_escape.bin -------------------------------------------------------------------------------- /roms/kangaroo.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/kangaroo.bin -------------------------------------------------------------------------------- /roms/krull.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/krull.bin -------------------------------------------------------------------------------- /roms/kung_fu_master.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/kung_fu_master.bin -------------------------------------------------------------------------------- /roms/montezuma_revenge.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/montezuma_revenge.bin -------------------------------------------------------------------------------- /roms/ms_pacman.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/ms_pacman.bin -------------------------------------------------------------------------------- /roms/name_this_game.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/name_this_game.bin -------------------------------------------------------------------------------- /roms/phoenix.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/phoenix.bin -------------------------------------------------------------------------------- /roms/pitfall.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/pitfall.bin -------------------------------------------------------------------------------- /roms/pong.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/pong.bin -------------------------------------------------------------------------------- /roms/pooyan.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/pooyan.bin -------------------------------------------------------------------------------- /roms/private_eye.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/private_eye.bin -------------------------------------------------------------------------------- /roms/qbert.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/qbert.bin -------------------------------------------------------------------------------- /roms/riverraid.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/riverraid.bin -------------------------------------------------------------------------------- /roms/road_runner.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/road_runner.bin -------------------------------------------------------------------------------- /roms/robotank.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/robotank.bin -------------------------------------------------------------------------------- /roms/seaquest.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/seaquest.bin -------------------------------------------------------------------------------- /roms/skiing.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/skiing.bin -------------------------------------------------------------------------------- /roms/solaris.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/solaris.bin -------------------------------------------------------------------------------- /roms/space_invaders.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/space_invaders.bin -------------------------------------------------------------------------------- /roms/star_gunner.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/star_gunner.bin -------------------------------------------------------------------------------- /roms/tennis.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/tennis.bin -------------------------------------------------------------------------------- /roms/time_pilot.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/time_pilot.bin -------------------------------------------------------------------------------- /roms/tutankham.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/tutankham.bin -------------------------------------------------------------------------------- /roms/up_n_down.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/up_n_down.bin -------------------------------------------------------------------------------- /roms/venture.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/venture.bin -------------------------------------------------------------------------------- /roms/video_pinball.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/video_pinball.bin -------------------------------------------------------------------------------- /roms/wizard_of_wor.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/wizard_of_wor.bin -------------------------------------------------------------------------------- /roms/yars_revenge.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/yars_revenge.bin -------------------------------------------------------------------------------- /roms/zaxxon.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShibiHe/Q-Optimality-Tightening/1cce8cf94e3fe026a8d18f7e2f2ed8e709392f08/roms/zaxxon.bin --------------------------------------------------------------------------------