├── README.md ├── __pycache__ ├── atari_emulator.cpython-34.pyc ├── atari_environment.cpython-34.pyc ├── dqn_agent.cpython-34.pyc ├── experience_memory.cpython-34.pyc ├── experiment.cpython-34.pyc ├── parallel_dqn_agent.cpython-34.pyc ├── parallel_q_network.cpython-34.pyc ├── q_network.cpython-34.pyc ├── record_stats.cpython-34.pyc └── visuals.cpython-34.pyc ├── atari_emulator.py ├── dqn_agent.py ├── experience_memory.py ├── experiment.py ├── parallel_dqn_agent.py ├── parallel_q_network.py ├── q_network.py ├── record_stats.py ├── run_dqn.py └── visuals.py /README.md: -------------------------------------------------------------------------------- 1 | # deep_rl_ale 2 | This repo contains an implementation of [this paper](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) in TensorFlow. It also contains the option to use the [double dqn](http://arxiv.org/pdf/1509.06461v3.pdf) loss function, as well as a parallel version that acts and learns simultaneously to speed up training. 3 | 4 | [Watch it play Pong, Breakout, Space Invaders, and Seaquest here](https://youtu.be/gQ9FsAGb148) 5 | 6 | The code is still a little messy in some places, and will be cleaned up in the future, but there will probably not be any significant updates or changes until mid-May. 7 | 8 | ## Dependencies/Requirements 9 | 10 | 1. An nVidia GPU with GDDR5 memory to train in a reasonable amount of time 11 | 2. [Python 3](https://www.python.org/) 12 | 3. [The Arcade Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment) for the emulator framework. 13 | 4. [Tensorflow](https://www.tensorflow.org/) for gpu numerical computions and symbolic differentiation. 14 | 5. Linux/OSX, because Tensorflow doesn't support Windows. 15 | 6. [Matplotlib](http://matplotlib.org/) and [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) for visualizations. 16 | 7. [OpenCV](http://opencv.org/) for image scaling. Might switch to SciPy since OpenCV was a pain for me to install. 17 | 8. Any dependencies of the above software, of course, like NumPy. 18 | 19 | ## How to run 20 | 21 | From the top directory of the repo (dir with python files): 22 | ### Training 23 | `$ python3 ./run_dqn.py ` 24 | 25 | For example: 26 | 27 | `$ python3 ./run_dqn.py breakout dqn brick_hunter` 28 | 29 | ### Watching 30 | `$ python3 ./run_dqn.py --watch` 31 | Where \ is the \ used during training. If you used any non-default settings, make sure to use the same ones when watching as well. 32 | 33 | ## Running Notes 34 | 35 | You can change many hyperparameters/settings by entering optional arguments. 36 | To get a list of arguments: 37 | 38 | `$ python3 ./run_dqn.py --h` 39 | 40 | By default rom files are expected to be in a folder titled 'roms' in the parent directory of the repo. You can pass a diferent directory as an argument or change the default in run_dqn.py. 41 | 42 | Statistics and saved models are saved in the parent directory of the repo as well. 43 | 44 | The default settings are very similar to those used in the DeepMond Nature paper. There are only a few small differences of which I am aware. 45 | 46 | A full training run takes between 3 and 4 days on my nVidia GTX 970, depending on whether or not the parallel option is used. Parallel training speeds up training by ~30%, but I'm still testing how different things impact speed. 47 | -------------------------------------------------------------------------------- /__pycache__/atari_emulator.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/atari_emulator.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/atari_environment.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/atari_environment.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/dqn_agent.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/dqn_agent.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/experience_memory.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/experience_memory.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/experiment.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/experiment.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/parallel_dqn_agent.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/parallel_dqn_agent.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/parallel_q_network.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/parallel_q_network.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/q_network.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/q_network.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/record_stats.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/record_stats.cpython-34.pyc -------------------------------------------------------------------------------- /__pycache__/visuals.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jabberwockyll/deep_rl_ale/8df4d71c174073521b0a9dc8d7a106a05cfbb4f7/__pycache__/visuals.cpython-34.pyc -------------------------------------------------------------------------------- /atari_emulator.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Class for ale instances to generate experiences and test agents. 3 | Uses DeepMind's preproessing/initialization methods 4 | ''' 5 | 6 | from ale_python_interface import ALEInterface 7 | import cv2 8 | import random 9 | import numpy as np 10 | import sys 11 | 12 | class AtariEmulator: 13 | 14 | def __init__(self, args): 15 | ''' Initialize Atari environment ''' 16 | 17 | # Parameters 18 | self.buffer_length = args.buffer_length 19 | self.screen_dims = args.screen_dims 20 | self.frame_skip = args.frame_skip 21 | self.blend_method = args.blend_method 22 | self.reward_processing = args.reward_processing 23 | self.max_start_wait = args.max_start_wait 24 | self.history_length = args.history_length 25 | self.start_frames_needed = self.buffer_length - 1 + ((args.history_length - 1) * self.frame_skip) 26 | 27 | #Initialize ALE instance 28 | self.ale = ALEInterface() 29 | self.ale.setFloat(b'repeat_action_probability', 0.0) 30 | if args.watch: 31 | self.ale.setBool(b'sound', True) 32 | self.ale.setBool(b'display_screen', True) 33 | self.ale.loadROM(str.encode(args.rom_path + '/' + args.game + '.bin')) 34 | 35 | self.buffer = np.empty((self.buffer_length, 210, 160)) 36 | self.current = 0 37 | self.action_set = self.ale.getMinimalActionSet() 38 | self.lives = self.ale.lives() 39 | 40 | self.reset() 41 | 42 | 43 | def get_possible_actions(self): 44 | ''' Return list of possible actions for game ''' 45 | return self.action_set 46 | 47 | def get_screen(self): 48 | ''' Add screen to frame buffer ''' 49 | self.buffer[self.current] = np.squeeze(self.ale.getScreenGrayscale()) 50 | self.current = (self.current + 1) % self.buffer_length 51 | 52 | 53 | def reset(self): 54 | self.ale.reset_game() 55 | self.lives = self.ale.lives() 56 | 57 | if self.max_start_wait < 0: 58 | print("ERROR: max start wait decreased beyond 0") 59 | sys.exit() 60 | elif self.max_start_wait <= self.start_frames_needed: 61 | wait = 0 62 | else: 63 | wait = random.randint(0, self.max_start_wait - self.start_frames_needed) 64 | for _ in range(wait): 65 | self.ale.act(self.action_set[0]) 66 | 67 | # Fill frame buffer 68 | self.get_screen() 69 | for _ in range(self.buffer_length - 1): 70 | self.ale.act(self.action_set[0]) 71 | self.get_screen() 72 | # get initial_states 73 | state = [(self.preprocess(), 0, 0, False)] 74 | for step in range(self.history_length - 1): 75 | state.append(self.run_step(0)) 76 | 77 | # make sure agent hasn't died yet 78 | if self.isTerminal(): 79 | print("Agent lost during start wait. Decreasing max_start_wait by 1") 80 | self.max_start_wait -= 1 81 | return self.reset() 82 | 83 | return state 84 | 85 | 86 | def run_step(self, action): 87 | ''' Apply action to game and return next screen and reward ''' 88 | 89 | raw_reward = 0 90 | for step in range(self.frame_skip): 91 | raw_reward += self.ale.act(self.action_set[action]) 92 | self.get_screen() 93 | 94 | reward = None 95 | if self.reward_processing == 'clip': 96 | reward = np.clip(raw_reward, -1, 1) 97 | else: 98 | reward = raw_reward 99 | 100 | terminal = self.isTerminal() 101 | self.lives = self.ale.lives() 102 | 103 | return (self.preprocess(), action, reward, terminal, raw_reward) 104 | 105 | 106 | 107 | def preprocess(self): 108 | ''' Preprocess frame for agent ''' 109 | 110 | img = None 111 | 112 | if self.blend_method == "max": 113 | img = np.amax(self.buffer, axis=0) 114 | 115 | return cv2.resize(img, self.screen_dims, interpolation=cv2.INTER_LINEAR) 116 | 117 | def isTerminal(self): 118 | return (self.isGameOver() or (self.lives > self.ale.lives())) 119 | 120 | 121 | def isGameOver(self): 122 | return self.ale.game_over() 123 | -------------------------------------------------------------------------------- /dqn_agent.py: -------------------------------------------------------------------------------- 1 | 2 | import random 3 | import numpy as np 4 | 5 | class DQNAgent(): 6 | 7 | def __init__(self, args, q_network, emulator, experience_memory, num_actions, train_stats): 8 | 9 | self.network = q_network 10 | self.emulator = emulator 11 | self.memory = experience_memory 12 | self.train_stats = train_stats 13 | 14 | self.num_actions = num_actions 15 | self.history_length = args.history_length 16 | self.training_frequency = args.training_frequency 17 | self.random_exploration_length = args.random_exploration_length 18 | self.initial_exploration_rate = args.initial_exploration_rate 19 | self.final_exploration_rate = args.final_exploration_rate 20 | self.final_exploration_frame = args.final_exploration_frame 21 | self.test_exploration_rate = args.test_exploration_rate 22 | self.recording_frequency = args.recording_frequency 23 | 24 | self.exploration_rate = self.initial_exploration_rate 25 | self.total_steps = 0 26 | 27 | self.test_state = [] 28 | 29 | 30 | def choose_action(self): 31 | 32 | if random.random() >= self.exploration_rate: 33 | state = self.memory.get_current_state() 34 | q_values = self.network.inference(state) 35 | self.train_stats.add_q_values(q_values) 36 | return np.argmax(q_values) 37 | else: 38 | return random.randrange(self.num_actions) 39 | 40 | 41 | def checkGameOver(self): 42 | if self.emulator.isGameOver(): 43 | initial_state = self.emulator.reset() 44 | for experience in initial_state: 45 | self.memory.add(experience[0], experience[1], experience[2], experience[3]) 46 | self.train_stats.add_game() 47 | 48 | 49 | def run_random_exploration(self): 50 | 51 | for step in range(self.random_exploration_length): 52 | 53 | state, action, reward, terminal, raw_reward = self.emulator.run_step(random.randrange(self.num_actions)) 54 | self.train_stats.add_reward(raw_reward) 55 | self.memory.add(state, action, reward, terminal) 56 | self.checkGameOver() 57 | self.total_steps += 1 58 | if (self.total_steps % self.recording_frequency == 0): 59 | self.train_stats.record(self.total_steps) 60 | 61 | 62 | def run_epoch(self, steps, epoch): 63 | 64 | for step in range(steps): 65 | 66 | state, action, reward, terminal, raw_reward = self.emulator.run_step(self.choose_action()) 67 | self.memory.add(state, action, reward, terminal) 68 | self.train_stats.add_reward(raw_reward) 69 | self.checkGameOver() 70 | 71 | # training 72 | if self.total_steps % self.training_frequency == 0: 73 | states, actions, rewards, next_states, terminals = self.memory.get_batch() 74 | loss = self.network.train(states, actions, rewards, next_states, terminals) 75 | self.train_stats.add_loss(loss) 76 | 77 | self.total_steps += 1 78 | 79 | if self.total_steps < self.final_exploration_frame: 80 | self.exploration_rate -= (self.exploration_rate - self.final_exploration_rate) / (self.final_exploration_frame - self.total_steps) 81 | 82 | if self.total_steps % self.recording_frequency == 0: 83 | self.train_stats.record(self.total_steps) 84 | self.network.record_params(self.total_steps) 85 | 86 | def test_step(self, observation): 87 | 88 | if len(self.test_state) < self.history_length: 89 | self.test_state.append(observation) 90 | 91 | # choose action 92 | q_values = None 93 | action = None 94 | if random.random() >= self.test_exploration_rate: 95 | state = np.expand_dims(np.transpose(self.test_state, [1,2,0]), axis=0) 96 | q_values = self.network.inference(state) 97 | action = np.argmax(q_values) 98 | else: 99 | action = random.randrange(self.num_actions) 100 | 101 | self.test_state.pop(0) 102 | return [action, q_values] 103 | 104 | 105 | def save_model(self, epoch): 106 | self.network.save_model(epoch) -------------------------------------------------------------------------------- /experience_memory.py: -------------------------------------------------------------------------------- 1 | ''' 2 | ExperienceMemory is a class for experience replay. 3 | It stores experience samples and samples minibatches for training. 4 | ''' 5 | 6 | import numpy as np 7 | import random 8 | 9 | 10 | class ExperienceMemory: 11 | 12 | def __init__(self, args, num_actions): 13 | ''' Initialize emtpy experience dataset. ''' 14 | 15 | # params 16 | self.capacity = args.memory_capacity 17 | self.history_length = args.history_length 18 | self.batch_size = args.batch_size 19 | self.num_actions = num_actions 20 | self.screen_dims = args.screen_dims 21 | 22 | # initialize dataset 23 | self.observations = np.empty((self.capacity, self.screen_dims[0], self.screen_dims[1]), dtype=np.uint8) 24 | self.actions = np.empty(self.capacity, dtype=np.uint8) 25 | self.rewards = np.empty(self.capacity, dtype=np.integer) 26 | self.terminals = np.empty(self.capacity, dtype=np.bool) 27 | 28 | self.size = 0 29 | self.current = 0 30 | 31 | 32 | def add(self, obs, act, reward, terminal): 33 | ''' Add experience to dataset. 34 | 35 | Args: 36 | obs: single observation frame 37 | act: action taken 38 | reward: reward 39 | terminal: is this a terminal state? 40 | ''' 41 | 42 | self.observations[self.current] = obs 43 | self.actions[self.current] = act 44 | self.rewards[self.current] = reward 45 | self.terminals[self.current] = terminal 46 | 47 | self.current = (self.current + 1) % self.capacity 48 | if self.size == self.capacity - 1: 49 | self.size = self.capacity 50 | else: 51 | self.size = max(self.size, self.current) 52 | 53 | 54 | def get_state(self, indices): 55 | ''' Return the observation sequence that ends at index 56 | 57 | Args: 58 | indices: list of last observations in sequences 59 | ''' 60 | state = np.empty((len(indices), self.screen_dims[0], self.screen_dims[1], self.history_length)) 61 | count = 0 62 | 63 | for index in indices: 64 | frame_slice = np.arange(index - self.history_length + 1, (index + 1)) 65 | state[count] = np.transpose(np.take(self.observations, frame_slice, axis=0), [1,2,0]) 66 | count += 1 67 | return state 68 | 69 | 70 | def get_current_state(self): 71 | ''' Return most recent observation sequence ''' 72 | 73 | return self.get_state([(self.current-1)%self.capacity]) 74 | 75 | 76 | def get_batch(self): 77 | ''' Sample minibatch of experiences for training ''' 78 | 79 | samples = [] # indices of the end of each sample 80 | 81 | while len(samples) < self.batch_size: 82 | 83 | if self.size < self.capacity: # make this better 84 | index = random.randrange(self.history_length, self.current) 85 | else: 86 | # make sure state from index doesn't overlap with current's gap 87 | index = (self.current + random.randrange(self.history_length, self.size-1)) % self.capacity 88 | # make sure no terminal observations are in the first state 89 | if self.terminals[(index - self.history_length):index].any(): 90 | continue 91 | else: 92 | samples.append(index) 93 | # endwhile 94 | samples = np.asarray(samples) 95 | 96 | # create batch 97 | o1 = self.get_state((samples - 1) % self.capacity) 98 | a = np.eye(self.num_actions)[self.actions[samples]] # convert actions to one-hot matrix 99 | r = self.rewards[samples] 100 | o2 = self.get_state(samples) 101 | t = self.terminals[samples].astype(int) 102 | return [o1, a, r, o2, t] -------------------------------------------------------------------------------- /experiment.py: -------------------------------------------------------------------------------- 1 | from visuals import Visuals 2 | 3 | def evaluate_agent(args, agent, test_emulator, test_stats): 4 | step = 0 5 | games = 0 6 | reward = 0.0 7 | reset = test_emulator.reset() 8 | agent.test_state = list(next(zip(*reset))) 9 | screen = test_emulator.preprocess() 10 | visuals = None 11 | if args.watch: 12 | visuals = Visuals(test_emulator.get_possible_actions()) 13 | 14 | while (step < args.test_steps) and (games < args.test_episodes): 15 | while not test_emulator.isGameOver() and step < args.test_steps_hardcap: 16 | action, q_values = agent.test_step(screen) 17 | results = test_emulator.run_step(action) 18 | screen = results[0] 19 | reward += results[4] 20 | 21 | # record stats 22 | if not (test_stats is None): 23 | test_stats.add_reward(results[4]) 24 | if not (q_values is None): 25 | test_stats.add_q_values(q_values) 26 | # endif 27 | #endif 28 | 29 | # update visuals 30 | if args.watch and (not (q_values is None)): 31 | visuals.update(q_values) 32 | 33 | step +=1 34 | # endwhile 35 | games += 1 36 | if not (test_stats is None): 37 | test_stats.add_game() 38 | reset = test_emulator.reset() 39 | agent.test_state = list(next(zip(*reset))) 40 | 41 | return reward / games 42 | 43 | 44 | 45 | def run_experiment(args, agent, test_emulator, test_stats): 46 | 47 | agent.run_random_exploration() 48 | 49 | for epoch in range(1, args.epochs + 1): 50 | 51 | if epoch == 1: 52 | agent.run_epoch(args.epoch_length - agent.random_exploration_length, epoch) 53 | else: 54 | agent.run_epoch(args.epoch_length, epoch) 55 | 56 | results = evaluate_agent(args, agent, test_emulator, test_stats) 57 | print("Score for epoch {0}: {1}".format(epoch, results)) 58 | steps = 0 59 | if args.parallel: 60 | steps = agent.random_exploration_length + (agent.train_steps * args.training_frequency) 61 | else: 62 | steps = agent.total_steps 63 | 64 | test_stats.record(steps) 65 | if results >= args.saving_threshold: 66 | agent.save_model(epoch) 67 | -------------------------------------------------------------------------------- /parallel_dqn_agent.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import threading 4 | 5 | class ParallelDQNAgent(): 6 | 7 | def __init__(self, args, q_network, emulator, experience_memory, num_actions, train_stats): 8 | 9 | self.network = q_network 10 | self.emulator = emulator 11 | self.memory = experience_memory 12 | self.train_stats = train_stats 13 | 14 | self.num_actions = num_actions 15 | self.history_length = args.history_length 16 | self.training_frequency = args.training_frequency 17 | self.random_exploration_length = args.random_exploration_length 18 | self.initial_exploration_rate = args.initial_exploration_rate 19 | self.final_exploration_rate = args.final_exploration_rate 20 | self.final_exploration_frame = args.final_exploration_frame 21 | self.test_exploration_rate = args.test_exploration_rate 22 | self.recording_frequency = args.recording_frequency 23 | 24 | self.exploration_rate = self.initial_exploration_rate 25 | self.total_steps = 0 26 | self.train_steps = 0 27 | self.current_act_steps = 0 28 | self.current_train_steps = 0 29 | 30 | self.test_state = [] 31 | self.epoch_over = False 32 | 33 | 34 | def choose_action(self): 35 | 36 | if random.random() >= self.exploration_rate: 37 | state = self.memory.get_current_state() 38 | q_values = self.network.inference(state) 39 | self.train_stats.add_q_values(q_values) 40 | return np.argmax(q_values) 41 | else: 42 | return random.randrange(self.num_actions) 43 | 44 | 45 | def checkGameOver(self): 46 | if self.emulator.isGameOver(): 47 | initial_state = self.emulator.reset() 48 | for experience in initial_state: 49 | self.memory.add(experience[0], experience[1], experience[2], experience[3]) 50 | self.train_stats.add_game() 51 | 52 | 53 | def run_random_exploration(self): 54 | 55 | for step in range(self.random_exploration_length): 56 | 57 | state, action, reward, terminal, raw_reward = self.emulator.run_step(random.randrange(self.num_actions)) 58 | self.train_stats.add_reward(raw_reward) 59 | self.memory.add(state, action, reward, terminal) 60 | self.checkGameOver() 61 | self.total_steps += 1 62 | self.current_act_steps += 1 63 | if (self.total_steps % self.recording_frequency == 0): 64 | self.train_stats.record(self.total_steps) 65 | 66 | 67 | def train(self, steps): 68 | 69 | for step in range(steps): 70 | states, actions, rewards, next_states, terminals = self.memory.get_batch() 71 | loss = self.network.train(states, actions, rewards, next_states, terminals) 72 | self.train_stats.add_loss(loss) 73 | self.train_steps += 1 74 | self.current_train_steps += 1 75 | 76 | if self.train_steps < (self.final_exploration_frame / self.training_frequency): 77 | self.exploration_rate -= (self.exploration_rate - self.final_exploration_rate) / ((self.final_exploration_frame / self.training_frequency) - self.train_steps) 78 | 79 | if ((self.train_steps * self.training_frequency) % self.recording_frequency == 0) and not (step == steps - 1): 80 | self.train_stats.record(self.random_exploration_length + (self.train_steps * self.training_frequency)) 81 | self.network.record_params(self.random_exploration_length + (self.train_steps * self.training_frequency)) 82 | 83 | self.epoch_over = True 84 | 85 | 86 | def run_epoch(self, steps, epoch): 87 | 88 | self.epoch_over = False 89 | threading.Thread(target=self.train, args=(int(steps/self.training_frequency),)).start() 90 | 91 | while not self.epoch_over: 92 | state, action, reward, terminal, raw_reward = self.emulator.run_step(self.choose_action()) 93 | self.memory.add(state, action, reward, terminal) 94 | self.train_stats.add_reward(raw_reward) 95 | self.checkGameOver() 96 | 97 | self.total_steps += 1 98 | self.current_act_steps += 1 99 | 100 | print("act_steps: {0}".format(self.current_act_steps)) 101 | print("learn_steps: {0}".format(self.current_train_steps)) 102 | self.train_stats.record(self.random_exploration_length + (self.train_steps * self.training_frequency)) 103 | self.network.record_params(self.random_exploration_length + (self.train_steps * self.training_frequency)) 104 | self.network.save_model(epoch) 105 | self.current_act_steps = 0 106 | self.current_train_steps = 0 107 | 108 | 109 | def test_step(self, observation): 110 | 111 | if len(self.test_state) < self.history_length: 112 | self.test_state.append(observation) 113 | 114 | # choose action 115 | q_values = None 116 | action = None 117 | if random.random() >= self.test_exploration_rate: 118 | state = np.expand_dims(np.transpose(self.test_state, [1,2,0]), axis=0) 119 | q_values = self.network.gpu_inference(state) 120 | action = np.argmax(q_values) 121 | else: 122 | action = random.randrange(self.num_actions) 123 | 124 | self.test_state.pop(0) 125 | return [action, q_values] 126 | 127 | 128 | def save_model(self, epoch): 129 | self.network.save_model(epoch) -------------------------------------------------------------------------------- /parallel_q_network.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import os 3 | import numpy as np 4 | import math 5 | 6 | 7 | class ParallelQNetwork(): 8 | 9 | def __init__(self, args, num_actions): 10 | ''' Build tensorflow graph for deep q network ''' 11 | 12 | self.discount_factor = args.discount_factor 13 | self.target_update_frequency = args.target_update_frequency 14 | self.total_updates = 0 15 | self.path = '../saved_models/' + args.game + '/' + args.agent_type + '/' + args.agent_name 16 | if not os.path.exists(self.path): 17 | os.makedirs(self.path) 18 | self.name = args.agent_name 19 | 20 | # input placeholders 21 | with tf.device('/cpu:0'): 22 | self.observation = tf.placeholder(tf.float32, shape=[None, args.screen_dims[0], args.screen_dims[1], args.history_length], name="observation") 23 | self.actions = tf.placeholder(tf.float32, shape=[None, num_actions], name="actions") # one-hot matrix because tf.gather() doesn't support multidimensional indexing yet 24 | self.rewards = tf.placeholder(tf.float32, shape=[None], name="rewards") 25 | self.next_observation = tf.placeholder(tf.float32, shape=[None, args.screen_dims[0], args.screen_dims[1], args.history_length], name="next_observation") 26 | self.terminals = tf.placeholder(tf.float32, shape=[None], name="terminals") 27 | 28 | num_conv_layers = len(args.conv_kernel_shapes) 29 | assert(num_conv_layers == len(args.conv_strides)) 30 | num_dense_layers = len(args.dense_layer_shapes) 31 | 32 | last_cpu_layer = None 33 | last_gpu_layer = None 34 | last_target_layer = None 35 | self.update_target = [] 36 | self.policy_network_params = [] 37 | self.param_names = [] 38 | 39 | # initialize convolutional layers 40 | for layer in range(num_conv_layers): 41 | cpu_input = None 42 | gpu_input = None 43 | target_input = None 44 | if layer == 0: 45 | with tf.device('/cpu:0'): 46 | cpu_input = self.observation / 255.0 47 | gpu_input = self.observation / 255.0 48 | target_input = self.next_observation / 255.0 49 | else: 50 | cpu_input = last_cpu_layer 51 | gpu_input = last_gpu_layer 52 | target_input = last_target_layer 53 | 54 | last_layers = self.conv_relu(cpu_input, gpu_input, target_input, 55 | args.conv_kernel_shapes[layer], args.conv_strides[layer], layer) 56 | last_cpu_layer = last_layers[0] 57 | last_gpu_layer = last_layers[1] 58 | last_target_layer = last_layers[2] 59 | 60 | # initialize fully-connected layers 61 | for layer in range(num_dense_layers): 62 | cpu_input = None 63 | gpu_input = None 64 | target_input = None 65 | if layer == 0: 66 | input_size = args.dense_layer_shapes[0][0] 67 | cpu_input = tf.reshape(last_cpu_layer, shape=[-1, input_size]) 68 | gpu_input = tf.reshape(last_gpu_layer, shape=[-1, input_size]) 69 | target_input = tf.reshape(last_target_layer, shape=[-1, input_size]) 70 | else: 71 | cpu_input = last_cpu_layer 72 | gpu_input = last_gpu_layer 73 | target_input = last_target_layer 74 | 75 | last_layers = self.dense_relu(cpu_input, gpu_input, target_input, args.dense_layer_shapes[layer], layer) 76 | last_cpu_layer = last_layers[0] 77 | last_gpu_layer = last_layers[1] 78 | last_target_layer = last_layers[2] 79 | 80 | 81 | # initialize q_layer 82 | last_layers = self.dense_linear(last_cpu_layer, last_gpu_layer, last_target_layer, [args.dense_layer_shapes[-1][-1], num_actions]) 83 | self.cpu_q_layer = last_layers[0] 84 | self.gpu_q_layer = last_layers[1] 85 | self.target_q_layer = last_layers[2] 86 | 87 | self.loss = self.build_loss(args.error_clipping, num_actions, args.double_dqn) 88 | 89 | if (args.optimizer == 'rmsprop') and (args.gradient_clip <= 0): 90 | self.train_op = tf.train.RMSPropOptimizer( 91 | args.learning_rate, decay=args.rmsprop_decay, momentum=0.0, epsilon=args.rmsprop_epsilon).minimize(self.loss) 92 | elif (args.optimizer == 'graves_rmsprop') or (args.optimizer == 'rmsprop' and args.gradient_clip > 0): 93 | self.train_op = self.build_rmsprop_optimizer(args.learning_rate, args.rmsprop_decay, args.rmsprop_epsilon, args.gradient_clip, args.optimizer) 94 | 95 | self.saver = tf.train.Saver(self.policy_network_params) 96 | 97 | if not args.watch: 98 | param_hists = [tf.histogram_summary(name, param) for name, param in zip(self.param_names, self.policy_network_params)] 99 | self.param_summaries = tf.merge_summary(param_hists) 100 | 101 | # start tf session 102 | gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.6) # avoid using all vram for GTX 970 103 | self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 104 | 105 | if args.watch: 106 | print("Loading Saved Network...") 107 | load_path = tf.train.latest_checkpoint(self.path) 108 | self.saver.restore(self.sess, load_path) 109 | print("Network Loaded") 110 | else: 111 | self.sess.run(tf.initialize_all_variables()) 112 | print("Network Initialized") 113 | self.summary_writer = tf.train.SummaryWriter('../records/' + args.game + '/' + args.agent_type + '/' + args.agent_name + '/params', self.sess.graph) 114 | 115 | 116 | def conv_relu(self, cpu_input, gpu_input, target_input, kernel_shape, stride, layer_num): 117 | ''' Build a convolutional layer 118 | 119 | Args: 120 | input_layer: input to convolutional layer - must be 3d 121 | target_input: input to layer of target network - must also be 3d 122 | kernel_shape: tuple for filter shape: (filter_height, filter_width, in_channels, out_channels) 123 | stride: tuple for stride: (1, vert_stride. horiz_stride, 1) 124 | ''' 125 | name = 'conv' + str(layer_num + 1) 126 | with tf.variable_scope(name): 127 | weights = None 128 | biases = None 129 | cpu_activation = None 130 | gpu_activation = None 131 | target_activation = None 132 | 133 | # weights = tf.Variable(tf.truncated_normal(kernel_shape, stddev=0.01), name=(name + "_weights")) 134 | weights = self.get_weights(kernel_shape, name) 135 | # biases = tf.Variable(tf.fill([kernel_shape[-1]], 0.1), name=(name + "_biases")) 136 | biases = self.get_biases(kernel_shape, name) 137 | with tf.device('/cpu:0'): 138 | 139 | cpu_activation = tf.nn.relu(tf.nn.conv2d(cpu_input, weights, stride, 'VALID') + biases) 140 | 141 | with tf.device('/gpu:0'): 142 | gpu_activation = tf.nn.relu(tf.nn.conv2d(gpu_input, weights, stride, 'VALID') + biases) 143 | 144 | target_weights = tf.Variable(weights.initialized_value(), trainable=False, name=("target_" + name + "_weights")) 145 | target_biases = tf.Variable(biases.initialized_value(), trainable=False, name=("target_" + name + "_biases")) 146 | 147 | target_activation = tf.nn.relu(tf.nn.conv2d(target_input, target_weights, stride, 'VALID') + target_biases) 148 | 149 | self.update_target.append(target_weights.assign(weights)) 150 | self.update_target.append(target_biases.assign(biases)) 151 | 152 | self.policy_network_params.append(weights) 153 | self.policy_network_params.append(biases) 154 | self.param_names.append(name + "_weights") 155 | self.param_names.append(name + "_biases") 156 | 157 | return [cpu_activation, gpu_activation, target_activation] 158 | 159 | 160 | def dense_relu(self, cpu_input, gpu_input, target_input, shape, layer_num): 161 | ''' Build a fully-connected relu layer 162 | 163 | Args: 164 | input_layer: input to dense layer 165 | target_input: input to layer of target network 166 | shape: tuple for weight shape (num_input_nodes, num_layer_nodes) 167 | ''' 168 | name = 'dense' + str(layer_num + 1) 169 | with tf.variable_scope(name): 170 | cpu_activation = None 171 | gpu_activation = None 172 | target_activation = None 173 | 174 | # weights = tf.Variable(tf.truncated_normal(shape, stddev=0.01), name=(name + "_weights")) 175 | weights = self.get_weights(shape, name) 176 | # biases = tf.Variable(tf.fill([shape[-1]], 0.1), name=(name + "_biases")) 177 | biases = self.get_biases(shape, name) 178 | with tf.device('/cpu:0'): 179 | 180 | cpu_activation = tf.nn.relu(tf.matmul(cpu_input, weights) + biases) 181 | 182 | with tf.device('/gpu:0'): 183 | gpu_activation = tf.nn.relu(tf.matmul(gpu_input, weights) + biases) 184 | 185 | target_weights = tf.Variable(weights.initialized_value(), trainable=False, name=("target_" + name + "_weights")) 186 | target_biases = tf.Variable(biases.initialized_value(), trainable=False, name=("target_" + name + "_biases")) 187 | 188 | target_activation = tf.nn.relu(tf.matmul(target_input, target_weights) + target_biases) 189 | 190 | self.update_target.append(target_weights.assign(weights)) 191 | self.update_target.append(target_biases.assign(biases)) 192 | 193 | self.policy_network_params.append(weights) 194 | self.policy_network_params.append(biases) 195 | self.param_names.append(name + "_weights") 196 | self.param_names.append(name + "_biases") 197 | 198 | return [cpu_activation, gpu_activation, target_activation] 199 | 200 | 201 | def dense_linear(self, cpu_input, gpu_input, target_input, shape): 202 | ''' Build the fully-connected linear output layer 203 | 204 | Args: 205 | input_layer: last hidden layer 206 | target_input: last hidden layer of target network 207 | shape: tuple for weight shape (num_input_nodes, num_actions) 208 | ''' 209 | name = 'q_layer' 210 | with tf.variable_scope(name): 211 | cpu_activation = None 212 | gpu_activation = None 213 | target_activation = None 214 | 215 | # weights = tf.Variable(tf.truncated_normal(shape, stddev=0.01), name=(name + "_weights")) 216 | weights = self.get_weights(shape, name) 217 | # biases = tf.Variable(tf.fill([shape[-1]], 0.1), name=(name + "_biases")) 218 | biases = self.get_biases(shape, name) 219 | with tf.device('/cpu:0'): 220 | 221 | cpu_activation = tf.matmul(cpu_input, weights) + biases 222 | 223 | with tf.device('/gpu:0'): 224 | gpu_activation = tf.matmul(gpu_input, weights) + biases 225 | 226 | target_weights = tf.Variable(weights.initialized_value(), trainable=False, name=("target_" + name + "_weights")) 227 | target_biases = tf.Variable(biases.initialized_value(), trainable=False, name=("target_" + name + "_biases")) 228 | 229 | target_activation = tf.matmul(target_input, target_weights) + target_biases 230 | 231 | self.update_target.append(target_weights.assign(weights)) 232 | self.update_target.append(target_biases.assign(biases)) 233 | 234 | self.policy_network_params.append(weights) 235 | self.policy_network_params.append(biases) 236 | self.param_names.append(name + "_weights") 237 | self.param_names.append(name + "_biases") 238 | 239 | return [cpu_activation, gpu_activation, target_activation] 240 | 241 | 242 | 243 | def inference(self, obs): 244 | ''' Get state-action value predictions for an observation 245 | 246 | Args: 247 | observation: the observation 248 | ''' 249 | 250 | return self.sess.run(self.cpu_q_layer, feed_dict={self.observation:obs}) 251 | 252 | def gpu_inference(self, obs): 253 | 254 | return self.sess.run(self.gpu_q_layer, feed_dict={self.observation:obs}) 255 | 256 | 257 | def build_loss(self, error_clip, num_actions, double_dqn): 258 | ''' build loss graph ''' 259 | with tf.name_scope("loss"): 260 | 261 | predictions = tf.reduce_sum(tf.mul(self.gpu_q_layer, self.actions), 1) 262 | 263 | max_action_values = None 264 | if double_dqn: # Double Q-Learning: 265 | max_actions = tf.to_int32(tf.argmax(self.gpu_q_layer, 1)) 266 | # tf.gather doesn't support multidimensional indexing yet, so we flatten output activations for indexing 267 | indices = tf.range(0, tf.size(max_actions) * num_actions, num_actions) + max_actions 268 | max_action_values = tf.gather(tf.reshape(self.target_q_layer, shape=[-1]), indices) 269 | else: 270 | max_action_values = tf.reduce_max(self.target_q_layer, 1) 271 | 272 | targets = tf.stop_gradient(self.rewards + (self.discount_factor * max_action_values * (1 - self.terminals))) 273 | 274 | difference = tf.abs(predictions - targets) 275 | 276 | if error_clip >= 0: 277 | quadratic_part = tf.clip_by_value(difference, 0.0, error_clip) 278 | linear_part = difference - quadratic_part 279 | errors = (0.5 * tf.square(quadratic_part)) + (error_clip * linear_part) 280 | else: 281 | errors = (0.5 * tf.square(difference)) 282 | 283 | return tf.reduce_sum(errors) 284 | 285 | 286 | def train(self, o1, a, r, o2, t): 287 | ''' train network on batch of experiences 288 | 289 | Args: 290 | o1: first observations 291 | a: actions taken 292 | r: rewards received 293 | o2: succeeding observations 294 | ''' 295 | 296 | loss = self.sess.run([self.train_op, self.loss], 297 | feed_dict={self.observation:o1, self.actions:a, self.rewards:r, self.next_observation:o2, self.terminals:t})[1] 298 | 299 | self.total_updates += 1 300 | if self.total_updates % self.target_update_frequency == 0: 301 | self.sess.run(self.update_target) 302 | 303 | return loss 304 | 305 | 306 | def save_model(self, epoch): 307 | 308 | self.saver.save(self.sess, self.path + '/' + self.name + '.ckpt', global_step=epoch) 309 | 310 | 311 | def build_rmsprop_optimizer(self, learning_rate, rmsprop_decay, rmsprop_constant, gradient_clip, version): 312 | 313 | with tf.name_scope('rmsprop'): 314 | optimizer = None 315 | if version == 'rmsprop': 316 | optimizer = tf.train.RMSPropOptimizer(learning_rate, decay=rmsprop_decay, momentum=0.0, epsilon=rmsprop_constant) 317 | elif version == 'graves_rmsprop': 318 | optimizer = tf.train.GradientDescentOptimizer(learning_rate) 319 | 320 | grads_and_vars = optimizer.compute_gradients(self.loss) 321 | grads = [gv[0] for gv in grads_and_vars] 322 | params = [gv[1] for gv in grads_and_vars] 323 | 324 | if gradient_clip > 0: 325 | grads = tf.clip_by_global_norm(grads, gradient_clip)[0] 326 | 327 | if version == 'rmsprop': 328 | return optimizer.apply_gradients(zip(grads, params)) 329 | elif version == 'graves_rmsprop': 330 | square_grads = [tf.square(grad) for grad in grads] 331 | 332 | avg_grads = [tf.Variable(tf.zeros(var.get_shape())) for var in params] 333 | avg_square_grads = [tf.Variable(tf.zeros(var.get_shape())) for var in params] 334 | 335 | update_avg_grads = [grad_pair[0].assign((rmsprop_decay * grad_pair[0]) + ((1 - rmsprop_decay) * grad_pair[1])) 336 | for grad_pair in zip(avg_grads, grads)] 337 | update_avg_square_grads = [grad_pair[0].assign((rmsprop_decay * grad_pair[0]) + ((1 - rmsprop_decay) * tf.square(grad_pair[1]))) 338 | for grad_pair in zip(avg_square_grads, grads)] 339 | avg_grad_updates = update_avg_grads + update_avg_square_grads 340 | 341 | rms = [tf.sqrt(avg_grad_pair[1] - tf.square(avg_grad_pair[0]) + rmsprop_constant) 342 | for avg_grad_pair in zip(avg_grads, avg_square_grads)] 343 | 344 | 345 | rms_updates = [grad_rms_pair[0] / grad_rms_pair[1] for grad_rms_pair in zip(grads, rms)] 346 | train = optimizer.apply_gradients(zip(rms_updates, params)) 347 | 348 | return tf.group(train, tf.group(*avg_grad_updates)) 349 | 350 | 351 | def get_weights(self, shape, name): 352 | fan_in = np.prod(shape[0:-1]) 353 | std = 1 / math.sqrt(fan_in) 354 | return tf.Variable(tf.random_uniform(shape, minval=(-std), maxval=std), name=(name + "_weights")) 355 | 356 | def get_biases(self, shape, name): 357 | fan_in = np.prod(shape[0:-1]) 358 | std = 1 / math.sqrt(fan_in) 359 | return tf.Variable(tf.random_uniform([shape[-1]], minval=(-std), maxval=std), name=(name + "_biases")) 360 | 361 | def record_params(self, step): 362 | with tf.device('/cpu:0'): 363 | summary_string = self.sess.run(self.param_summaries) 364 | self.summary_writer.add_summary(summary_string, step) -------------------------------------------------------------------------------- /q_network.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import os 3 | import numpy as np 4 | import math 5 | 6 | 7 | class QNetwork(): 8 | 9 | def __init__(self, args, num_actions): 10 | ''' Build tensorflow graph for deep q network ''' 11 | 12 | print("Initializing Q-Network") 13 | 14 | self.discount_factor = args.discount_factor 15 | self.target_update_frequency = args.target_update_frequency 16 | self.total_updates = 0 17 | self.path = '../saved_models/' + args.game + '/' + args.agent_type + '/' + args.agent_name 18 | if not os.path.exists(self.path): 19 | os.makedirs(self.path) 20 | self.name = args.agent_name 21 | 22 | # input placeholders 23 | self.observation = tf.placeholder(tf.float32, shape=[None, args.screen_dims[0], args.screen_dims[1], args.history_length], name="observation") 24 | self.actions = tf.placeholder(tf.float32, shape=[None, num_actions], name="actions") # one-hot matrix because tf.gather() doesn't support multidimensional indexing yet 25 | self.rewards = tf.placeholder(tf.float32, shape=[None], name="rewards") 26 | self.next_observation = tf.placeholder(tf.float32, shape=[None, args.screen_dims[0], args.screen_dims[1], args.history_length], name="next_observation") 27 | self.terminals = tf.placeholder(tf.float32, shape=[None], name="terminals") 28 | self.normalized_observation = self.observation / 255.0 29 | self.normalized_next_observation = self.next_observation / 255.0 30 | 31 | num_conv_layers = len(args.conv_kernel_shapes) 32 | assert(num_conv_layers == len(args.conv_strides)) 33 | num_dense_layers = len(args.dense_layer_shapes) 34 | 35 | last_policy_layer = None 36 | last_target_layer = None 37 | self.update_target = [] 38 | self.policy_network_params = [] 39 | self.param_names = [] 40 | 41 | # initialize convolutional layers 42 | for layer in range(num_conv_layers): 43 | policy_input = None 44 | target_input = None 45 | if layer == 0: 46 | policy_input = self.normalized_observation 47 | target_input = self.normalized_next_observation 48 | else: 49 | policy_input = last_policy_layer 50 | target_input = last_target_layer 51 | 52 | last_layers = self.conv_relu(policy_input, target_input, 53 | args.conv_kernel_shapes[layer], args.conv_strides[layer], layer) 54 | last_policy_layer = last_layers[0] 55 | last_target_layer = last_layers[1] 56 | 57 | # initialize fully-connected layers 58 | for layer in range(num_dense_layers): 59 | policy_input = None 60 | target_input = None 61 | if layer == 0: 62 | input_size = args.dense_layer_shapes[0][0] 63 | policy_input = tf.reshape(last_policy_layer, shape=[-1, input_size]) 64 | target_input = tf.reshape(last_target_layer, shape=[-1, input_size]) 65 | else: 66 | policy_input = last_policy_layer 67 | target_input = last_target_layer 68 | 69 | last_layers = self.dense_relu(policy_input, target_input, args.dense_layer_shapes[layer], layer) 70 | last_policy_layer = last_layers[0] 71 | last_target_layer = last_layers[1] 72 | 73 | 74 | # initialize q_layer 75 | last_layers = self.dense_linear( 76 | last_policy_layer, last_target_layer, [args.dense_layer_shapes[-1][-1], num_actions]) 77 | self.policy_q_layer = last_layers[0] 78 | self.target_q_layer = last_layers[1] 79 | 80 | self.loss = self.build_loss(args.error_clipping, num_actions, args.double_dqn) 81 | 82 | if (args.optimizer == 'rmsprop') and (args.gradient_clip <= 0): 83 | self.train_op = tf.train.RMSPropOptimizer( 84 | args.learning_rate, decay=args.rmsprop_decay, momentum=0.0, epsilon=args.rmsprop_epsilon).minimize(self.loss) 85 | elif (args.optimizer == 'graves_rmsprop') or (args.optimizer == 'rmsprop' and args.gradient_clip > 0): 86 | self.train_op = self.build_rmsprop_optimizer(args.learning_rate, args.rmsprop_decay, args.rmsprop_epsilon, args.gradient_clip, args.optimizer) 87 | 88 | self.saver = tf.train.Saver(self.policy_network_params) 89 | 90 | if not args.watch: 91 | param_hists = [tf.histogram_summary(name, param) for name, param in zip(self.param_names, self.policy_network_params)] 92 | self.param_summaries = tf.merge_summary(param_hists) 93 | 94 | # start tf session 95 | gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.33333) # avoid using all vram for GTX 970 96 | self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 97 | 98 | if args.watch: 99 | print("Loading Saved Network...") 100 | load_path = tf.train.latest_checkpoint(self.path) 101 | self.saver.restore(self.sess, load_path) 102 | print("Network Loaded") 103 | else: 104 | self.sess.run(tf.initialize_all_variables()) 105 | print("Network Initialized") 106 | self.summary_writer = tf.train.SummaryWriter('../records/' + args.game + '/' + args.agent_type + '/' + args.agent_name + '/params', self.sess.graph) 107 | 108 | 109 | def conv_relu(self, policy_input, target_input, kernel_shape, stride, layer_num): 110 | ''' Build a convolutional layer 111 | 112 | Args: 113 | input_layer: input to convolutional layer - must be 4d 114 | target_input: input to layer of target network - must also be 4d 115 | kernel_shape: tuple for filter shape: (filter_height, filter_width, in_channels, out_channels) 116 | stride: tuple for stride: (1, vert_stride. horiz_stride, 1) 117 | ''' 118 | name = 'conv' + str(layer_num + 1) 119 | with tf.variable_scope(name): 120 | 121 | # weights = tf.Variable(tf.truncated_normal(kernel_shape, stddev=0.01), name=(name + "_weights")) 122 | weights = self.get_weights(kernel_shape, name) 123 | # biases = tf.Variable(tf.fill([kernel_shape[-1]], 0.1), name=(name + "_biases")) 124 | biases = self.get_biases(kernel_shape, name) 125 | 126 | activation = tf.nn.relu(tf.nn.conv2d(policy_input, weights, stride, 'VALID') + biases) 127 | 128 | target_weights = tf.Variable(weights.initialized_value(), trainable=False, name=("target_" + name + "_weights")) 129 | target_biases = tf.Variable(biases.initialized_value(), trainable=False, name=("target_" + name + "_biases")) 130 | 131 | target_activation = tf.nn.relu(tf.nn.conv2d(target_input, target_weights, stride, 'VALID') + target_biases) 132 | 133 | self.update_target.append(target_weights.assign(weights)) 134 | self.update_target.append(target_biases.assign(biases)) 135 | 136 | self.policy_network_params.append(weights) 137 | self.policy_network_params.append(biases) 138 | self.param_names.append(name + "_weights") 139 | self.param_names.append(name + "_biases") 140 | 141 | return [activation, target_activation] 142 | 143 | 144 | def dense_relu(self, policy_input, target_input, shape, layer_num): 145 | ''' Build a fully-connected relu layer 146 | 147 | Args: 148 | input_layer: input to dense layer 149 | target_input: input to layer of target network 150 | shape: tuple for weight shape (num_input_nodes, num_layer_nodes) 151 | ''' 152 | name = 'dense' + str(layer_num + 1) 153 | with tf.variable_scope(name): 154 | 155 | # weights = tf.Variable(tf.truncated_normal(shape, stddev=0.01), name=(name + "_weights")) 156 | weights = self.get_weights(shape, name) 157 | # biases = tf.Variable(tf.fill([shape[-1]], 0.1), name=(name + "_biases")) 158 | biases = self.get_biases(shape, name) 159 | 160 | activation = tf.nn.relu(tf.matmul(policy_input, weights) + biases) 161 | 162 | target_weights = tf.Variable(weights.initialized_value(), trainable=False, name=("target_" + name + "_weights")) 163 | target_biases = tf.Variable(biases.initialized_value(), trainable=False, name=("target_" + name + "_biases")) 164 | 165 | target_activation = tf.nn.relu(tf.matmul(target_input, target_weights) + target_biases) 166 | 167 | self.update_target.append(target_weights.assign(weights)) 168 | self.update_target.append(target_biases.assign(biases)) 169 | 170 | self.policy_network_params.append(weights) 171 | self.policy_network_params.append(biases) 172 | self.param_names.append(name + "_weights") 173 | self.param_names.append(name + "_biases") 174 | 175 | return [activation, target_activation] 176 | 177 | 178 | def dense_linear(self, policy_input, target_input, shape): 179 | ''' Build the fully-connected linear output layer 180 | 181 | Args: 182 | input_layer: last hidden layer 183 | target_input: last hidden layer of target network 184 | shape: tuple for weight shape (num_input_nodes, num_actions) 185 | ''' 186 | name = 'q_layer' 187 | with tf.variable_scope(name): 188 | 189 | # weights = tf.Variable(tf.truncated_normal(shape, stddev=0.01), name=(name + "_weights")) 190 | weights = self.get_weights(shape, name) 191 | # biases = tf.Variable(tf.fill([shape[-1]], 0.1), name=(name + "_biases")) 192 | biases = self.get_biases(shape, name) 193 | 194 | 195 | activation = tf.matmul(policy_input, weights) + biases 196 | 197 | target_weights = tf.Variable(weights.initialized_value(), trainable=False, name=("target_" + name + "_weights")) 198 | target_biases = tf.Variable(biases.initialized_value(), trainable=False, name=("target_" + name + "_biases")) 199 | 200 | target_activation = tf.matmul(target_input, target_weights) + target_biases 201 | 202 | self.update_target.append(target_weights.assign(weights)) 203 | self.update_target.append(target_biases.assign(biases)) 204 | 205 | self.policy_network_params.append(weights) 206 | self.policy_network_params.append(biases) 207 | self.param_names.append(name + "_weights") 208 | self.param_names.append(name + "_biases") 209 | 210 | return [activation, target_activation] 211 | 212 | 213 | 214 | def inference(self, obs): 215 | ''' Get state-action value predictions for an observation 216 | 217 | Args: 218 | observation: the observation 219 | ''' 220 | 221 | return np.squeeze(self.sess.run(self.policy_q_layer, feed_dict={self.observation:obs})) 222 | 223 | 224 | def build_loss(self, error_clip, num_actions, double_dqn): 225 | ''' build loss graph ''' 226 | with tf.name_scope("loss"): 227 | 228 | predictions = tf.reduce_sum(tf.mul(self.policy_q_layer, self.actions), 1) 229 | 230 | max_action_values = None 231 | if double_dqn: # Double Q-Learning: 232 | max_actions = tf.to_int32(tf.argmax(self.policy_q_layer, 1)) 233 | # tf.gather doesn't support multidimensional indexing yet, so we flatten output activations for indexing 234 | indices = tf.range(0, tf.size(max_actions) * num_actions, num_actions) + max_actions 235 | max_action_values = tf.gather(tf.reshape(self.target_q_layer, shape=[-1]), indices) 236 | else: 237 | max_action_values = tf.reduce_max(self.target_q_layer, 1) 238 | 239 | targets = tf.stop_gradient(self.rewards + (self.discount_factor * max_action_values * (1 - self.terminals))) 240 | 241 | difference = tf.abs(predictions - targets) 242 | 243 | if error_clip >= 0: 244 | quadratic_part = tf.clip_by_value(difference, 0.0, error_clip) 245 | linear_part = difference - quadratic_part 246 | errors = (0.5 * tf.square(quadratic_part)) + (error_clip * linear_part) 247 | else: 248 | errors = (0.5 * tf.square(difference)) 249 | 250 | return tf.reduce_sum(errors) 251 | 252 | 253 | def train(self, o1, a, r, o2, t): 254 | ''' train network on batch of experiences 255 | 256 | Args: 257 | o1: first observations 258 | a: actions taken 259 | r: rewards received 260 | o2: succeeding observations 261 | ''' 262 | 263 | loss = self.sess.run([self.train_op, self.loss], 264 | feed_dict={self.observation:o1, self.actions:a, self.rewards:r, self.next_observation:o2, self.terminals:t})[1] 265 | 266 | self.total_updates += 1 267 | if self.total_updates % self.target_update_frequency == 0: 268 | self.sess.run(self.update_target) 269 | 270 | return loss 271 | 272 | 273 | def save_model(self, epoch): 274 | 275 | self.saver.save(self.sess, self.path + '/' + self.name + '.ckpt', global_step=epoch) 276 | 277 | 278 | def build_rmsprop_optimizer(self, learning_rate, rmsprop_decay, rmsprop_constant, gradient_clip, version): 279 | 280 | with tf.name_scope('rmsprop'): 281 | optimizer = None 282 | if version == 'rmsprop': 283 | optimizer = tf.train.RMSPropOptimizer(learning_rate, decay=rmsprop_decay, momentum=0.0, epsilon=rmsprop_constant) 284 | elif version == 'graves_rmsprop': 285 | optimizer = tf.train.GradientDescentOptimizer(learning_rate) 286 | 287 | grads_and_vars = optimizer.compute_gradients(self.loss) 288 | grads = [gv[0] for gv in grads_and_vars] 289 | params = [gv[1] for gv in grads_and_vars] 290 | 291 | if gradient_clip > 0: 292 | grads = tf.clip_by_global_norm(grads, gradient_clip)[0] 293 | 294 | if version == 'rmsprop': 295 | return optimizer.apply_gradients(zip(grads, params)) 296 | elif version == 'graves_rmsprop': 297 | square_grads = [tf.square(grad) for grad in grads] 298 | 299 | avg_grads = [tf.Variable(tf.zeros(var.get_shape())) for var in params] 300 | avg_square_grads = [tf.Variable(tf.zeros(var.get_shape())) for var in params] 301 | 302 | update_avg_grads = [grad_pair[0].assign((rmsprop_decay * grad_pair[0]) + ((1 - rmsprop_decay) * grad_pair[1])) 303 | for grad_pair in zip(avg_grads, grads)] 304 | update_avg_square_grads = [grad_pair[0].assign((rmsprop_decay * grad_pair[0]) + ((1 - rmsprop_decay) * tf.square(grad_pair[1]))) 305 | for grad_pair in zip(avg_square_grads, grads)] 306 | avg_grad_updates = update_avg_grads + update_avg_square_grads 307 | 308 | rms = [tf.sqrt(avg_grad_pair[1] - tf.square(avg_grad_pair[0]) + rmsprop_constant) 309 | for avg_grad_pair in zip(avg_grads, avg_square_grads)] 310 | 311 | 312 | rms_updates = [grad_rms_pair[0] / grad_rms_pair[1] for grad_rms_pair in zip(grads, rms)] 313 | train = optimizer.apply_gradients(zip(rms_updates, params)) 314 | 315 | return tf.group(train, tf.group(*avg_grad_updates)) 316 | 317 | 318 | def get_weights(self, shape, name): 319 | fan_in = np.prod(shape[0:-1]) 320 | std = 1 / math.sqrt(fan_in) 321 | return tf.Variable(tf.random_uniform(shape, minval=(-std), maxval=std), name=(name + "_weights")) 322 | 323 | def get_biases(self, shape, name): 324 | fan_in = np.prod(shape[0:-1]) 325 | std = 1 / math.sqrt(fan_in) 326 | return tf.Variable(tf.random_uniform([shape[-1]], minval=(-std), maxval=std), name=(name + "_biases")) 327 | 328 | def record_params(self, step): 329 | summary_string = self.sess.run(self.param_summaries) 330 | self.summary_writer.add_summary(summary_string, step) -------------------------------------------------------------------------------- /record_stats.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import time 4 | 5 | class RecordStats: 6 | 7 | def __init__(self, args, test): 8 | 9 | self.test = test 10 | self.reward = 0 11 | self.step_count = 0 12 | self.loss = 0.0 13 | self.loss_count = 0 14 | self.games = 0 15 | self.q_values = 0.0 16 | self.q_count = 0 17 | self.current_score = 0 18 | self.max_score = -1000000000 19 | self.min_score = 1000000000 20 | self.recording_frequency = args.recording_frequency 21 | 22 | 23 | with tf.device('/cpu:0'): 24 | self.spg = tf.placeholder(tf.float32, shape=[], name="score_per_game") 25 | self.mean_q = tf.placeholder(tf.float32, shape=[]) 26 | self.total_gp = tf.placeholder(tf.float32, shape=[]) 27 | self.max_r = tf.placeholder(tf.float32, shape=[]) 28 | self.min_r = tf.placeholder(tf.float32, shape=[]) 29 | self.time = tf.placeholder(tf.float32, shape=[]) 30 | 31 | self.spg_summ = tf.scalar_summary('score_per_game', self.spg) 32 | self.q_summ = tf.scalar_summary('q_values', self.mean_q) 33 | self.gp_summ = tf.scalar_summary('steps_per_game', self.total_gp) 34 | self.max_summ = tf.scalar_summary('maximum_score', self.max_r) 35 | self.min_summ = tf.scalar_summary('minimum_score', self.min_r) 36 | self.time_summ = tf.scalar_summary('steps_per_second', self.time) 37 | 38 | 39 | if not test: 40 | self.mean_l = tf.placeholder(tf.float32, shape=[], name='loss') 41 | self.l_summ = tf.scalar_summary('loss', self.mean_l) 42 | self.summary_op = tf.merge_summary([self.spg_summ, self.q_summ, self.gp_summ, self.l_summ, self.max_summ, self.min_summ, self.time_summ]) 43 | self.path = ('../records/' + args.game + '/' + args.agent_type + '/' + args.agent_name + '/train') 44 | else: 45 | self.summary_op = tf.merge_summary([self.spg_summ, self.q_summ, self.gp_summ, self.max_summ, self.min_summ, self.time_summ]) 46 | self.path = ('../records/' + args.game + '/' + args.agent_type + '/' + args.agent_name + '/test') 47 | 48 | # self.summary_op = tf.merge_all_summaries() 49 | self.sess = tf.Session() 50 | self.summary_writer = tf.train.SummaryWriter(self.path) 51 | self.start_time = time.time() 52 | 53 | def record(self, epoch): 54 | 55 | seconds = time.time() - self.start_time 56 | 57 | avg_loss = 0 58 | if self.loss_count != 0: 59 | avg_loss = self.loss / self.loss_count 60 | # print("average loss: {0}".format(avg_loss)) 61 | 62 | mean_q_values = 0 63 | if self.q_count > 0: 64 | mean_q_values = self.q_values / self.q_count 65 | # print("average q_values: {0}".format(mean_q_values)) 66 | 67 | score_per_game = 0.0 68 | steps_per_game = 0 69 | 70 | if self.games == 0: 71 | score_per_game = self.reward 72 | steps_per_game = self.step_count 73 | else: 74 | score_per_game = self.reward / self.games 75 | steps_per_game = self.step_count / self.games 76 | 77 | score_per_game = float(score_per_game) 78 | 79 | if not self.test: 80 | step_per_sec = self.recording_frequency / seconds 81 | summary_str = self.sess.run(self.summary_op, 82 | feed_dict={self.spg:score_per_game, self.mean_l:avg_loss, self.mean_q:mean_q_values, 83 | self.total_gp:steps_per_game, self.max_r:self.max_score, self.min_r:self.min_score, self.time:step_per_sec}) 84 | self.summary_writer.add_summary(summary_str, global_step=epoch) 85 | else: 86 | step_per_sec = self.step_count / seconds 87 | summary_str = self.sess.run(self.summary_op, 88 | feed_dict={self.spg:score_per_game, self.mean_q:mean_q_values, self.total_gp:steps_per_game, 89 | self.max_r:self.max_score, self.min_r:self.min_score, self.time:step_per_sec}) 90 | self.summary_writer.add_summary(summary_str, global_step=epoch) 91 | current_score = 0 92 | 93 | self.reward = 0 94 | self.step_count = 0 95 | self.loss = 0 96 | self.loss_count = 0 97 | self.games = 0 98 | self.q_values = 0 99 | self.q_count = 0 100 | self.max_score = -1000000000 101 | self.min_score = 1000000000 102 | 103 | 104 | def add_reward(self, r): 105 | self.reward += r 106 | self.current_score += r 107 | 108 | if self.step_count == 0: 109 | self.start_time = time.time() 110 | 111 | self.step_count += 1 112 | 113 | def add_loss(self, l): 114 | self.loss += l 115 | self.loss_count += 1 116 | 117 | def add_game(self): 118 | self.games += 1 119 | 120 | if self.current_score > self.max_score: 121 | self.max_score = self.current_score 122 | if self.current_score < self.min_score: 123 | self.min_score = self.current_score 124 | 125 | self.current_score = 0 126 | 127 | def add_q_values(self, q_vals): 128 | mean_q = np.mean(q_vals) 129 | self.q_values += mean_q 130 | self.q_count += 1 131 | -------------------------------------------------------------------------------- /run_dqn.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import argparse 3 | from atari_emulator import AtariEmulator 4 | from experience_memory import ExperienceMemory 5 | from q_network import QNetwork 6 | from dqn_agent import DQNAgent 7 | from record_stats import RecordStats 8 | from parallel_q_network import ParallelQNetwork 9 | from parallel_dqn_agent import ParallelDQNAgent 10 | import experiment 11 | 12 | def main(): 13 | 14 | parser = argparse.ArgumentParser('a program to train or run a deep q-learning agent') 15 | parser.add_argument("game", type=str, help="name of game to play") 16 | parser.add_argument("agent_type", type=str, help="name of learning/acting technique used") 17 | parser.add_argument("agent_name", type=str, help="unique name of this agent instance") 18 | parser.add_argument("--rom_path", type=str, help="path to directory containing atari game roms", default='../roms') 19 | parser.add_argument("--watch", 20 | help="if true, a pretrained model with the specified name is loaded and tested with the game screen displayed", 21 | action='store_true') 22 | 23 | parser.add_argument("--epochs", type=int, help="number of epochs", default=200) 24 | parser.add_argument("--epoch_length", type=int, help="number of steps in an epoch", default=250000) 25 | parser.add_argument("--test_steps", type=int, help="max number of steps per test", default=125000) 26 | parser.add_argument("--test_steps_hardcap", type=int, help="absolute max number of steps per test", default=135000) 27 | parser.add_argument("--test_episodes", type=int, help="max number of episodes per test", default=30) 28 | parser.add_argument("--history_length", type=int, help="number of frames in a state", default=4) 29 | parser.add_argument("--training_frequency", type=int, help="number of steps run before training", default=4) 30 | parser.add_argument("--random_exploration_length", type=int, 31 | help="number of randomly-generated experiences to initially fill experience memory", default=50000) 32 | parser.add_argument("--initial_exploration_rate", type=float, help="initial exploration rate", default=1.0) 33 | parser.add_argument("--final_exploration_rate", type=float, help="final exploration rate from linear annealing", default=0.1) 34 | parser.add_argument("--final_exploration_frame", type=int, 35 | help="frame at which the final exploration rate is reached", default=1000000) 36 | parser.add_argument("--test_exploration_rate", type=float, help="exploration rate while testing", default=0.05) 37 | parser.add_argument("--frame_skip", type=int, help="number of frames to repeat chosen action", default=4) 38 | parser.add_argument("--screen_dims", type=tuple, help="dimensions to resize frames", default=(84,84)) 39 | # used for stochasticity and to help prevent overfitting. 40 | # Must be greater than frame_skip * (observation_length -1) + buffer_length - 1 41 | parser.add_argument("--max_start_wait", type=int, help="max number of frames to wait for initial state", default=30) 42 | # buffer_length = 1 prevents blending 43 | parser.add_argument("--buffer_length", type=int, help="length of buffer to blend frames", default=2) 44 | parser.add_argument("--blend_method", type=str, help="method used to blend frames", choices=('max'), default='max') 45 | parser.add_argument("--reward_processing", type=str, help="method to process rewards", choices=('clip', 'none'), default='clip') 46 | # must set network_architecture to custom in order use custom architecture 47 | parser.add_argument("--conv_kernel_shapes", type=tuple, 48 | help="shapes of convnet kernels: ((height, width, in_channels, out_channels), (next layer))") 49 | # must have same length as conv_kernel_shapes 50 | parser.add_argument("--conv_strides", type=tuple, help="connvet strides: ((1, height, width, 1), (next layer))") 51 | # currently, you must have at least one dense layer 52 | parser.add_argument("--dense_layer_shapes", type=tuple, help="shapes of dense layers: ((in_size, out_size), (next layer))") 53 | parser.add_argument("--discount_factor", type=float, help="constant to discount future rewards", default=0.99) 54 | parser.add_argument("--learning_rate", type=float, help="constant to scale parameter updates", default=0.00025) 55 | parser.add_argument("--optimizer", type=str, help="optimization method for network", 56 | choices=('rmsprop', 'graves_rmsprop'), default='graves_rmsprop') 57 | parser.add_argument("--rmsprop_decay", type=float, help="decay constant for moving average in rmsprop", default=0.95) 58 | parser.add_argument("--rmsprop_epsilon", type=int, help="constant to stabilize rmsprop", default=0.01) 59 | # set error_clipping to less than 0 to disable 60 | parser.add_argument("--error_clipping", type=float, help="constant at which td-error becomes linear instead of quadratic", default=1.0) 61 | # set gradient clipping to 0 or less to disable. Currently only works with graves_rmsprop. 62 | parser.add_argument("--gradient_clip", type=float, help="clip gradients to have the provided L2-norm", default=0) 63 | parser.add_argument("--target_update_frequency", type=int, help="number of policy network updates between target network updates", default=10000) 64 | parser.add_argument("--memory_capacity", type=int, help="max number of experiences to store in experience memory", default=1000000) 65 | parser.add_argument("--batch_size", type=int, help="number of transitions sampled from memory during learning", default=32) 66 | # must set to custom in order to specify custom architecture 67 | parser.add_argument("--network_architecture", type=str, help="name of prespecified network architecture", 68 | choices=("deepmind_nips", "deepmind_nature, custom"), default="deepmind_nature") 69 | parser.add_argument("--recording_frequency", type=int, help="number of steps before tensorboard recording", default=50000) 70 | 71 | parser.add_argument("--saving_threshold", type=int, help="min score threshold for saving model.", default=0) 72 | 73 | parser.add_argument("--parallel", help="parallelize acting and learning", action='store_true') 74 | parser.add_argument("--double_dqn", help="use double q-learning algorithm in error target calculation", action='store_true') 75 | args = parser.parse_args() 76 | 77 | 78 | if args.network_architecture == 'deepmind_nature': 79 | args.conv_kernel_shapes = [ 80 | [8,8,4,32], 81 | [4,4,32,64], 82 | [3,3,64,64]] 83 | args.conv_strides = [ 84 | [1,4,4,1], 85 | [1,2,2,1], 86 | [1,1,1,1]] 87 | args.dense_layer_shapes = [[3136, 512]] 88 | elif args.network_architecture == 'deepmind_nips': 89 | args.conv_kernel_shapes = [ 90 | [8,8,4,16], 91 | [4,4,16,32]] 92 | args.conv_strides = [ 93 | [1,4,4,1], 94 | [1,2,2,1]] 95 | args.dense_layer_shapes = [[2592, 256]] 96 | 97 | if not args.watch: 98 | train_stats = RecordStats(args, False) 99 | test_stats = RecordStats(args, True) 100 | training_emulator = AtariEmulator(args) 101 | testing_emulator = AtariEmulator(args) 102 | num_actions = len(training_emulator.get_possible_actions()) 103 | experience_memory = ExperienceMemory(args, num_actions) 104 | 105 | q_network= None 106 | agent = None 107 | if args.parallel: 108 | q_network = ParallelQNetwork(args, num_actions) 109 | agent = ParallelDQNAgent(args, q_network, training_emulator, experience_memory, num_actions, train_stats) 110 | else: 111 | q_network = QNetwork(args, num_actions) 112 | agent = DQNAgent(args, q_network, training_emulator, experience_memory, num_actions, train_stats) 113 | 114 | experiment.run_experiment(args, agent, testing_emulator, test_stats) 115 | 116 | else: 117 | testing_emulator = AtariEmulator(args) 118 | num_actions = len(testing_emulator.get_possible_actions()) 119 | q_network = QNetwork(args, num_actions) 120 | agent = DQNAgent(args, q_network, None, None, num_actions, None) 121 | experiment.evaluate_agent(args, agent, testing_emulator, None) 122 | 123 | if __name__ == "__main__": 124 | main() -------------------------------------------------------------------------------- /visuals.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib 3 | matplotlib.use('TKAgg') 4 | from matplotlib import pyplot as plt 5 | import seaborn as sns 6 | 7 | class Visuals: 8 | 9 | def __init__(self, actions): 10 | 11 | all_action_names = ['no-op', 'fire', 'up', 'right', 'left', 'down', 'up_right', 'up_left', 'down-right', 'down-left', 12 | 'up-fire', 'right-fire', 'left-fire', 'down-fire', 'up-right-fire', 'up-left-fire', 'down-right-fire', 'down-left-fire'] 13 | 14 | action_names = [all_action_names[i] for i in actions] 15 | self.num_actions = len(actions) 16 | self.max_q = 1 17 | self.min_q = 0 18 | # self.max_avg_q = 1 19 | 20 | xlocations = np.linspace(0.5, self.num_actions - 0.5, num=self.num_actions) 21 | xlocations = np.append(xlocations, self.num_actions + 0.05) 22 | if self.num_actions > 7: 23 | self.fig = plt.figure(figsize=(self.num_actions * 1.1, 6.0)) 24 | else: 25 | self.fig = plt.figure() 26 | self.bars = plt.bar(np.arange(self.num_actions), np.zeros(self.num_actions), 0.9) 27 | plt.xticks(xlocations, action_names + ['']) 28 | plt.ylabel('Expected Future Reward') 29 | plt.xlabel('Action') 30 | plt.title("State-Action Values") 31 | color_palette = sns.color_palette(n_colors=self.num_actions) 32 | for bar, color in zip(self.bars, color_palette): 33 | bar.set_color(color) 34 | self.fig.show() 35 | 36 | 37 | def update(self, q_values): 38 | 39 | for bar, q_value in zip(self.bars, q_values): 40 | bar.set_height(q_value) 41 | step_max = np.amax(q_values) 42 | step_min = np.amin(q_values) 43 | if step_max > self.max_q: 44 | self.max_q = step_max 45 | plt.gca().set_ylim([self.min_q, self.max_q]) 46 | if step_min < self.min_q: 47 | self.min_q = step_min 48 | plt.gca().set_ylim([self.min_q, self.max_q]) 49 | 50 | self.fig.canvas.draw() --------------------------------------------------------------------------------