├── .gitignore ├── README.md ├── docs ├── critic_net.png └── policy_net.png ├── local └── template.sh ├── model.py ├── networks.py ├── requirements.txt ├── tasks.py └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | # =========== # 2 | # Local Files # 3 | # =========== # 4 | local/logs/* 5 | local/models/* 6 | local/* 7 | !local/template.sh 8 | 9 | # =========== # 10 | # Notebooks # 11 | # =========== # 12 | .idea* 13 | __pycache__* 14 | .ipynb_checkpoints* 15 | 16 | # ============ # 17 | # OS generated # 18 | # ============ # 19 | .DS_Store* 20 | ._* 21 | .Spotlight-V100 22 | .Trashes 23 | Icon? 24 | ehthumbs.db 25 | [Tt]humbs.db 26 | [Dd]esktop.ini 27 | Corridor/Library/ShaderCache/ 28 | Corridor/Library/metadata/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ** NO LONGER MAINTAINED, USE AT YOUR OWN RISK ** 2 | 3 | # PySACX 4 | 5 | This repo contains a Pytorch implementation of the SAC-X RL Algorithm [1]. It uses the Lunar Lander v2 6 | environment from OpenAI gym. The SAC-X algorithm enables learning of complex behaviors from scratch 7 | in the presence of multiple sparse reward signals. 8 | 9 | ## Theory 10 | 11 | In addition to a main task reward, we define a series of auxiliary rewards. An important assumption is that 12 | each auxiliary reward can be evaluated at any state action pair. The rewards are defined as follows: 13 | 14 | *Auxiliary Tasks/Rewards* 15 | - Touch. Maximizing number of legs touching ground 16 | - Hover Planar. Minimize the planar movement of the lander craft 17 | - Hover Angular. Minimize the rotational movement of the lander craft 18 | - Upright. Minimize the angle of the lander craft 19 | - Goal Distance. Minimize distance between lander craft and pad 20 | 21 | *Main Task/Reward* 22 | - Did the lander land successfully (Sparse reward based on landing success) 23 | 24 | Each of these tasks (intentions in the paper) has a specific model head within the neural nets used 25 | to estimate the actor and critic functions. When executing a trajectory during training, the task (and 26 | subsequently the model head within the actor) is switched between the different available options. 27 | This switching can either be done randomly (SAC-U) or it can be learned (SAC-Q). 28 | 29 | The pictures below show the network architechtures for the actor and critic functions. Note the _N_ 30 | possible heads for _N_ possible tasks (intentions in the paper) [1]. 31 | 32 | ![alt text](docs/critic_net.png) ![alt text](docs/policy_net.png) 33 | 34 | *Learning* 35 | 36 | Learning the actor (policy function) is done off-policy using a gradient based approach. Gradients are 37 | backpropagated through task-specific versions of the actor by using the task-specific versions of the 38 | critic (Q function). Importantly though, the trajectory (collection of state action pairs) need not 39 | have been collected using the same task-specific actor, allowing learning from data generated by all other actors. 40 | The actor policy gradient is computed using the reparameterization trick (code in `model.py`) 41 | 42 | Learning the critic (Q function) is similarly done off-policy. We sample trajectories from a buffer 43 | collected with target actors (actor policies frozen at a particular learning iteration). The critic 44 | policy gradient is computed using the retrace method (code in `model.py`) 45 | 46 | ## Instructions 47 | 48 | - Use the `local/template.sh` script to train lots of model variations. Or use `train.py` to train an agent directly. 49 | 50 | ## Requirements 51 | 52 | - Python 3.6 53 | - [PyTorch](http://pytorch.org/) 0.3.0.post4 54 | - [OpenAI Gym](https://gym.openai.com/) 55 | - [tensorboardX](https://github.com/lanpa/tensorboard-pytorch/tree/master/tensorboardX) 56 | 57 | ## Sources 58 | 59 | [1] [Learning by Playing – Solving Sparse Reward Tasks from Scratch](https://arxiv.org/abs/1802.10567). 60 | -------------------------------------------------------------------------------- /docs/critic_net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hu-po/pySACQ/43ee4f457ddd8e444a01f19fc9658986bda73de0/docs/critic_net.png -------------------------------------------------------------------------------- /docs/policy_net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hu-po/pySACQ/43ee4f457ddd8e444a01f19fc9658986bda73de0/docs/policy_net.png -------------------------------------------------------------------------------- /local/template.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # Navigate to proper directory and activate the conda environment 4 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 5 | cd $DIR/.. 6 | source activate gym 7 | 8 | # Log directory is based on run date 9 | DATE=$(date +%Y-%m-%d) 10 | 11 | # Run several different training runs, storing in different log locations 12 | NAME="experiment_name" 13 | python train.py --log=$DATE$NAME \ 14 | --num_train_cycles=1 \ 15 | --buffer_size=1 \ 16 | --num_trajectories=1 \ 17 | --num_learning_iterations=1 \ 18 | --episode_batch_size=1 \ 19 | --batch_norm \ 20 | --loss=discounted_rewards \ 21 | --non_linear=relu -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | import random 3 | import torch 4 | import numpy as np 5 | 6 | # Named tuple for a single step within a trajectory 7 | Step = namedtuple('Step', ['state', 'action', 'log_prob', 'reward']) 8 | 9 | # Global step counters 10 | ACT_STEP = 0 11 | LEARN_STEP = 0 12 | 13 | 14 | def act(actor, env, task, B, num_trajectories=10, task_period=30, writer=None): 15 | """ 16 | Performs actions in the environment collecting reward/experience. 17 | This follows Algorithm 3 in [1] 18 | :param actor: (Actor) actor network object 19 | :param env: (Environment) OpenAI Gym Environment object 20 | :param task: (Task) task object 21 | :param B: (list) replay buffer containing trajectories 22 | :param num_trajectories: (int) number of trajectories to collect at a time 23 | :param task_period: (int) number of steps in a single task period 24 | :param writer: (SummaryWriter) writer object for logging 25 | :return: None 26 | """ 27 | global ACT_STEP 28 | for trajectory_idx in range(num_trajectories): 29 | print('Acting: trajectory %s of %s' % (trajectory_idx + 1, num_trajectories)) 30 | # Reset environment and trajectory specific parameters 31 | trajectory = [] # collection of state, action, task pairs 32 | task.reset() # h in paper 33 | obs = env.reset() 34 | done = False 35 | num_steps = 0 36 | # Roll out 37 | while not done: 38 | # Sample a new task using the scheduler 39 | if num_steps % task_period == 0: 40 | task.sample() 41 | # Get the action from current actor policy 42 | actor.eval() 43 | action, log_prob = actor.predict(np.expand_dims(obs, axis=0), task.current_task, log_prob=True) 44 | # Execute action and collect rewards for each task 45 | obs, gym_reward, done, _ = env.step(action[0]) 46 | # # Modify the main task reward (the huge -100 and 100 values cause instability) 47 | # gym_reward /= 100.0 48 | # Reward is a vector of the reward for each task 49 | reward = task.reward(obs, gym_reward) 50 | if writer: 51 | for i, r in enumerate(reward): 52 | writer.add_scalar('train/reward/%s' % i, r, ACT_STEP) 53 | # group information into a step and add to current trajectory 54 | trajectory.append(Step(obs, action[0], log_prob[0], reward)) 55 | num_steps += 1 # increment step counter 56 | ACT_STEP += 1 57 | # Add trajectory to replay buffer 58 | B.append(trajectory) 59 | 60 | 61 | def _loss_policy_gradient(trajectory, task, actor, gamma=0.95): 62 | """ 63 | Calculates actor and critic losses for a given trajectory. Uses simpler 64 | :param trajectory: (list of Steps) trajectory or episode 65 | :param task: (Task) task object 66 | :param actor: (Actor) actor network object 67 | :param gamma: (float) discount factor 68 | :return: (float, float) actor and critic loss 69 | """ 70 | # Extract information out of trajectory 71 | num_steps = len(trajectory) 72 | states = torch.FloatTensor([step.state for step in trajectory]) 73 | rewards = torch.FloatTensor([step.reward for step in trajectory]) 74 | # Create an intention (task) mask for all possible intentions 75 | task_mask = np.repeat(np.arange(0, task.num_tasks), num_steps) 76 | imask_task = torch.LongTensor(task_mask) 77 | states = states.repeat(task.num_tasks, 1) 78 | # actions (for each task) for every state action pair in trajectory 79 | task_actions, task_log_prob = actor.forward(states, imask_task, log_prob=True) 80 | # Calculate discounted cumulative rewards 81 | discounted_cumulative_rewards = torch.zeros_like(rewards) 82 | dcr = torch.zeros((1, task.num_tasks)) 83 | for j in reversed(range(num_steps)): 84 | dcr = gamma * dcr + rewards[j, :].unsqueeze(0) 85 | discounted_cumulative_rewards[j, :] = dcr 86 | discounted_cumulative_rewards = discounted_cumulative_rewards.repeat(task.num_tasks, 1) 87 | # Create intention mask 88 | one_hot_mask = np.zeros((states.shape[0], task.num_tasks)) 89 | one_hot_mask[np.arange(states.shape[0]), imask_task.numpy()] = 1 90 | mask_tensor = torch.FloatTensor(one_hot_mask) 91 | # Multiply by the intention mask and sum in the final dimension to get the right output shape 92 | discounted_cumulative_rewards = torch.autograd.Variable((discounted_cumulative_rewards * mask_tensor).sum(dim=1), 93 | requires_grad=False) 94 | 95 | # Actor loss is log-prob weighted sum of Q values (for each task) given states from trajectory 96 | actor_loss = - torch.sum(discounted_cumulative_rewards * task_log_prob) 97 | actor_loss /= len(trajectory) # Divide by the number of runs to prevent trajectory length from mattering 98 | return actor_loss 99 | 100 | 101 | def _loss_discounted_rewards(trajectory, task, actor, critic, gamma=0.95): 102 | """ 103 | Calculates actor and critic losses for a given trajectory. Uses simpler 104 | :param trajectory: (list of Steps) trajectory or episode 105 | :param task: (Task) task object 106 | :param actor: (Actor) actor network object 107 | :param critic: (Critic) critic network object 108 | :param gamma: (float) discount factor 109 | :return: (float, float) actor and critic loss 110 | """ 111 | # Extract information out of trajectory 112 | num_steps = len(trajectory) 113 | states = torch.FloatTensor([step.state for step in trajectory]) 114 | rewards = torch.FloatTensor([step.reward for step in trajectory]) 115 | # Create an intention (task) mask for all possible intentions 116 | task_mask = np.repeat(np.arange(0, task.num_tasks), num_steps) 117 | imask_task = torch.LongTensor(task_mask) 118 | states = states.repeat(task.num_tasks, 1) 119 | # actions (for each task) for every state action pair in trajectory 120 | task_actions, task_log_prob = actor.forward(states, imask_task, log_prob=True) 121 | # Q-values (for each task) for every state and task-action pair in trajectory 122 | critic_input = torch.cat([task_actions.data.float().unsqueeze(1), states], dim=1) 123 | task_q = critic.forward(critic_input, imask_task) 124 | # Actor loss is log-prob weighted sum of Q values (for each task) given states from trajectory 125 | actor_loss = - torch.sum(torch.autograd.Variable(task_q.data, requires_grad=False).squeeze(1) * task_log_prob) 126 | actor_loss /= len(trajectory) # Divide by the number of runs to prevent trajectory length from mattering 127 | # Calculate discounted cumulative rewards 128 | discounted_cumulative_rewards = torch.zeros_like(rewards) 129 | dcr = torch.zeros((1, task.num_tasks)) 130 | for j in reversed(range(num_steps)): 131 | dcr = gamma * dcr + rewards[j, :].unsqueeze(0) 132 | discounted_cumulative_rewards[j, :] = dcr 133 | discounted_cumulative_rewards = discounted_cumulative_rewards.repeat(task.num_tasks, 1) 134 | # Create intention mask 135 | one_hot_mask = np.zeros((states.shape[0], task.num_tasks)) 136 | one_hot_mask[np.arange(states.shape[0]), imask_task.numpy()] = 1 137 | mask_tensor = torch.FloatTensor(one_hot_mask) 138 | # Multiply by the intention mask and sum in the final dimension to get the right output shape 139 | discounted_cumulative_rewards = torch.autograd.Variable((discounted_cumulative_rewards * mask_tensor).sum(dim=1), 140 | requires_grad=False) 141 | # Use Huber Loss for critic 142 | critic_loss = torch.nn.SmoothL1Loss()(task_q, discounted_cumulative_rewards) 143 | return actor_loss, critic_loss 144 | 145 | 146 | def _loss_retrace(trajectory, task, actor, critic, gamma=0.95): 147 | """ 148 | Calculates actor and critic losses for a given trajectory. Following equations in [1] 149 | :param trajectory: (list of Steps) trajectory or episode 150 | :param task: (Task) task object 151 | :param actor: (Actor) actor network object 152 | :param critic: (Critic) critic network object 153 | :param gamma: (float) discount factor 154 | :return: (float, float) actor and critic loss 155 | """ 156 | # Extract information out of trajectory 157 | num_steps = len(trajectory) 158 | states = torch.FloatTensor([step.state for step in trajectory]) 159 | rewards = torch.FloatTensor([step.reward for step in trajectory]) 160 | actions = torch.FloatTensor([step.action for step in trajectory]).unsqueeze(1) 161 | log_probs = torch.FloatTensor([step.log_prob for step in trajectory]).unsqueeze(1) 162 | # Create an intention (task) mask for all possible intentions 163 | task_mask = np.repeat(np.arange(0, task.num_tasks), num_steps) 164 | imask_task = torch.LongTensor(task_mask) 165 | states = states.repeat(task.num_tasks, 1) 166 | actions = actions.repeat(task.num_tasks, 1) 167 | # actions (for each task) for every state action pair in trajectory 168 | task_actions, task_log_prob = actor.forward(states, imask_task, log_prob=True) 169 | # Q-values (for each task) for every state and task-action pair in trajectory 170 | critic_input = torch.cat([task_actions.data.float().unsqueeze(1), states], dim=1) 171 | task_q = critic.forward(critic_input, imask_task) 172 | # Q-values (for each task) for every state and action pair in trajectory 173 | critic_input = torch.cat([actions, states], dim=1) 174 | traj_q = critic.predict(critic_input, imask_task) 175 | # Actor loss is log-prob weighted sum of Q values (for each task) given states from trajectory 176 | actor_loss = - torch.sum(torch.autograd.Variable(task_q.data, requires_grad=False).squeeze(1) * task_log_prob) 177 | actor_loss /= len(trajectory) # Divide by the number of runs to prevent trajectory length from mattering 178 | # Calculation of retrace Q 179 | q_ret = torch.zeros_like(task_q.data) 180 | for task_id in range(task.num_tasks): 181 | start = task_id * num_steps 182 | for i in range(num_steps): 183 | q_ret_i = 0 184 | for j in range(i, num_steps): 185 | # Discount factor 186 | discount = gamma ** (j - i) 187 | # Importance weights 188 | cj = 1.0 189 | for k in range(i, j): 190 | ck = min(abs(task_log_prob.data[start + k] / float(log_probs[k])), 1.0) 191 | cj *= ck 192 | # Difference between the two q values 193 | del_q = task_q.data[start + i] - traj_q[start + j] 194 | # Retrace Q value is sum of discounted weighted rewards 195 | q_ret_i += discount * cj * (rewards[j, task_id] + del_q) 196 | # Append retrace Q value to float tensor using index_fill 197 | q_ret.index_fill_(0, torch.LongTensor([start + i]), q_ret_i[0]) 198 | # Critic loss uses retrace Q 199 | # critic_loss = torch.sum((task_q - torch.autograd.Variable(q_ret, requires_grad=False)) ** 2) 200 | # critic_loss /= len(trajectory) # Divide by the number of runs to prevent trajectory length from mattering 201 | # Use Huber Loss for critic 202 | critic_loss = torch.nn.SmoothL1Loss()(task_q, torch.autograd.Variable(q_ret, requires_grad=False)) 203 | return actor_loss, critic_loss 204 | 205 | 206 | def learn(actor, critic, task, B, num_learning_iterations=10, episode_batch_size=10, lr=0.0002, loss='retrace', 207 | writer=None): 208 | """ 209 | Pushes back gradients from the replay buffer, updating the actor and critic. 210 | This follows Algorithm 2 in [1] 211 | :param actor: (Actor) actor network object 212 | :param critic: (Critic) critic network object 213 | :param task: (Task) task object 214 | :param B: (list) replay buffer containing trajectories 215 | :param num_learning_iterations: (int) number of learning iterations per function call 216 | :param episode_batch_size: (int) number of trajectories in a batch (one gradient push) 217 | :param lr: (float) learning rate 218 | :param writer: (SummaryWriter) writer object for logging 219 | :return: None 220 | """ 221 | global LEARN_STEP 222 | for learn_idx in range(num_learning_iterations): 223 | print('Learning: trajectory %s of %s' % (learn_idx + 1, num_learning_iterations)) 224 | # optimizers for critic and actor 225 | actor_opt = torch.optim.Adam(actor.parameters(), lr) 226 | actor_opt.zero_grad() 227 | actor.train() 228 | if loss not in ['policy_gradient']: 229 | critic_opt = torch.optim.Adam(critic.parameters(), lr) 230 | critic_opt.zero_grad() 231 | critic.train() 232 | for batch_idx in range(episode_batch_size): 233 | # Sample a random trajectory from the replay buffer 234 | trajectory = random.choice(B) 235 | # Compute losses for critic and actor 236 | if loss == 'discounted_rewards': 237 | actor_loss, critic_loss = _loss_discounted_rewards(trajectory, task, actor, critic) 238 | elif loss == 'retrace': 239 | actor_loss, critic_loss = _loss_retrace(trajectory, task, actor, critic) 240 | elif loss == 'policy_gradient': 241 | actor_loss = _loss_policy_gradient(trajectory, task, actor) 242 | else: 243 | print('Loss not found') 244 | return 245 | if writer: 246 | writer.add_scalar('train/loss/actor', actor_loss.data[0], LEARN_STEP) 247 | if loss not in ['policy_gradient']: 248 | writer.add_scalar('train/loss/critic', critic_loss.data[0], LEARN_STEP) 249 | # compute gradients 250 | actor_loss.backward() 251 | if loss not in ['policy_gradient']: 252 | critic_loss.backward() 253 | LEARN_STEP += 1 254 | # Push back the accumulated gradients and update the networks 255 | actor_opt.step() 256 | if loss not in ['policy_gradient']: 257 | critic_opt.step() 258 | -------------------------------------------------------------------------------- /networks.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | 4 | 5 | class IntentionBase(torch.nn.Module): 6 | """Generic class for a single intention head (used within actor/critic networks)""" 7 | 8 | def __init__(self, input_size, hidden_size, output_size, non_linear, final_non_linear, use_gpu=True): 9 | super(IntentionBase, self).__init__() 10 | self.non_linear = non_linear 11 | self.final_non_linear = final_non_linear 12 | self.use_gpu = use_gpu 13 | 14 | # Build the network 15 | self.layer1 = torch.nn.Linear(input_size, hidden_size) 16 | self.final_layer = torch.nn.Linear(hidden_size, output_size) 17 | self.init_weights() 18 | 19 | def init_weights(self): 20 | # Initialize the other layers with xavier (still constant 0 bias) 21 | torch.nn.init.xavier_uniform(self.layer1.weight.data) 22 | torch.nn.init.constant(self.layer1.bias.data, 0) 23 | torch.nn.init.xavier_uniform(self.final_layer.weight.data) 24 | torch.nn.init.constant(self.final_layer.bias.data, 0) 25 | 26 | def forward(self, x): 27 | x = self.non_linear(self.layer1(x)) 28 | x = self.final_non_linear(self.final_layer(x)) 29 | return x 30 | 31 | 32 | class IntentionCritic(IntentionBase): 33 | """Class for a single Intention head within the Q-function (or critic) network""" 34 | 35 | def __init__(self, 36 | input_size, 37 | hidden_size, 38 | output_size, 39 | non_linear=torch.nn.ELU(), 40 | use_gpu=True): 41 | final_non_linear = non_linear 42 | super(IntentionCritic, self).__init__(input_size, hidden_size, output_size, non_linear, final_non_linear, 43 | use_gpu) 44 | 45 | 46 | class IntentionActor(IntentionBase): 47 | """Class for a single Intention head within the policy (or actor) network""" 48 | 49 | def __init__(self, 50 | input_size, 51 | hidden_size, 52 | output_size, 53 | non_linear=torch.nn.ELU(), 54 | final_non_linear=torch.nn.Softmax(dim=1), 55 | use_gpu=True): 56 | super(IntentionActor, self).__init__(input_size, hidden_size, output_size, non_linear, final_non_linear, 57 | use_gpu) 58 | 59 | 60 | class SQXNet(torch.nn.Module): 61 | """Generic class for actor and critic networks. The arch is very similar.""" 62 | 63 | def __init__(self, 64 | state_dim, 65 | base_hidden_size, 66 | num_intentions, 67 | head_input_size, 68 | head_hidden_size, 69 | head_output_size, 70 | non_linear, 71 | net_type, 72 | batch_norm=False, 73 | use_gpu=True): 74 | super(SQXNet, self).__init__() 75 | self.non_linear = non_linear 76 | self.batch_norm = batch_norm 77 | self.use_gpu = use_gpu 78 | 79 | # Build the base of the network 80 | self.layer1 = torch.nn.Linear(state_dim, base_hidden_size) 81 | self.layer2 = torch.nn.Linear(base_hidden_size, head_input_size) 82 | if self.batch_norm: 83 | self.bn1 = torch.nn.BatchNorm1d(base_hidden_size) 84 | self.init_weights() 85 | 86 | # Create the many intention nets heads 87 | if net_type == 'actor': 88 | intention_net_type = IntentionActor 89 | elif net_type == 'critic': 90 | intention_net_type = IntentionCritic 91 | else: 92 | raise Exception('Invalid net type for SQXNet') 93 | self.intention_nets = [] 94 | for _ in range(num_intentions): 95 | self.intention_nets.append(intention_net_type(input_size=head_input_size, 96 | hidden_size=head_hidden_size, 97 | output_size=head_output_size, 98 | use_gpu=use_gpu, 99 | non_linear=non_linear)) 100 | 101 | def init_weights(self): 102 | # Initialize the other layers with xavier (still constant 0 bias) 103 | torch.nn.init.xavier_uniform(self.layer1.weight.data) 104 | torch.nn.init.constant(self.layer1.bias.data, 0) 105 | torch.nn.init.xavier_uniform(self.layer2.weight.data) 106 | torch.nn.init.constant(self.layer2.bias.data, 0) 107 | 108 | def forward(self, x, intention): 109 | # Feed the input through the base layers of the model 110 | x = torch.autograd.Variable(torch.Tensor(x)) 111 | x = self.non_linear(self.layer1(x)) 112 | if self.batch_norm: 113 | x = self.bn1(x) 114 | x = self.non_linear(self.layer2(x)) 115 | if isinstance(intention, int): # single intention head 116 | x = self.intention_nets[intention].forward(x) 117 | else: 118 | # Create intention mask 119 | one_hot_mask = np.zeros((x.shape[0], len(self.intention_nets))) 120 | one_hot_mask[np.arange(x.shape[0]), intention.numpy()] = 1 121 | mask_tensor = torch.autograd.Variable(torch.FloatTensor(one_hot_mask).unsqueeze(1), requires_grad=False) 122 | # Feed forward through all the intention heads and concatenate on new dimension 123 | intention_out = torch.cat(list(head.forward(x).unsqueeze(2) for head in self.intention_nets), dim=2) 124 | # Multiply by the intention mask and sum in the final dimension to get the right output shape 125 | x = (intention_out * mask_tensor).sum(dim=2) 126 | return x 127 | 128 | def predict(self, x, intention): 129 | y = self.forward(x, intention).cpu().data 130 | return y 131 | 132 | 133 | class Actor(SQXNet): 134 | """Class for policy (or actor) network""" 135 | 136 | def __init__(self, 137 | state_dim=8, 138 | base_hidden_size=32, 139 | num_intentions=6, 140 | head_input_size=16, 141 | head_hidden_size=8, 142 | head_output_size=4, 143 | non_linear=torch.nn.ELU(), 144 | net_type='actor', 145 | batch_norm=False, 146 | use_gpu=True): 147 | super(Actor, self).__init__(state_dim, 148 | base_hidden_size, 149 | num_intentions, 150 | head_input_size, 151 | head_hidden_size, 152 | head_output_size, 153 | non_linear, 154 | net_type, 155 | batch_norm, 156 | use_gpu) 157 | 158 | def forward(self, x, intention, log_prob=False): 159 | x = super().forward(x, intention) 160 | # Intention head determines parameters of Categorical distribution 161 | dist = torch.distributions.Categorical(x) 162 | action = dist.sample() 163 | if log_prob: 164 | log_prob = dist.log_prob(action) 165 | return action, log_prob 166 | return action 167 | 168 | def predict(self, x, intention, log_prob=False): 169 | if log_prob: 170 | action, log_prob = self.forward(x, intention, log_prob=True) 171 | return action.cpu().data, log_prob.cpu().data 172 | else: 173 | action = self.forward(x, intention).cpu().data 174 | return action 175 | return None 176 | 177 | 178 | class Critic(SQXNet): 179 | """Class for Q-function (or critic) network""" 180 | 181 | def __init__(self, 182 | num_intentions=6, 183 | state_dim=9, 184 | base_hidden_size=64, 185 | head_input_size=64, 186 | head_hidden_size=32, 187 | head_output_size=1, 188 | non_linear=torch.nn.ELU(), 189 | net_type='critic', 190 | batch_norm=False, 191 | use_gpu=True): 192 | super(Critic, self).__init__(state_dim, 193 | base_hidden_size, 194 | num_intentions, 195 | head_input_size, 196 | head_hidden_size, 197 | head_output_size, 198 | non_linear, 199 | net_type, 200 | batch_norm, 201 | use_gpu) 202 | 203 | 204 | if __name__ == '__main__': 205 | print('Run this file directly to debug') 206 | 207 | actor = Actor() 208 | critic = Critic() 209 | 210 | # Carry out a step on the environment to test out forward functions 211 | import gym 212 | import random 213 | 214 | env = gym.make('LunarLander-v2') 215 | obs = env.reset() 216 | task_idx = random.randint(0, 6) 217 | 218 | # Get the action from current actor policy 219 | action = actor.predict(obs, task_idx) 220 | _, _, _, _ = env.step(action) 221 | 222 | print('Got to end sucessfully! (Though this only means there are no major bugs..)') 223 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | gym==0.10.3 2 | tensorboardX==1.1 3 | box2d-py==2.3.1 -------------------------------------------------------------------------------- /tasks.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | 4 | # Observation space, according to source: 5 | # state = [ 6 | # (pos.x - VIEWPORT_W / SCALE / 2) / (VIEWPORT_W / SCALE / 2), 7 | # (pos.y - (self.helipad_y + LEG_DOWN / SCALE)) / (VIEWPORT_W / SCALE / 2), 8 | # vel.x * (VIEWPORT_W / SCALE / 2) / FPS, 9 | # vel.y * (VIEWPORT_H / SCALE / 2) / FPS, 10 | # self.lander.angle, 11 | # 20.0 * self.lander.angularVelocity / FPS, 12 | # 1.0 if self.legs[0].ground_contact else 0.0, 13 | # 1.0 if self.legs[1].ground_contact else 0.0 14 | # ] 15 | 16 | # Auxiliary Rewards: 17 | # Touch. Maximizing number of legs touching ground 18 | # Hover Planar. Minimize the planar movement of the lander craft 19 | # Hover Angular. Minimize the rotational movement of ht lander craft 20 | # Upright. Minimize the angle of the lander craft 21 | # Goal Distance. Minimize distance between lander craft and pad 22 | # 23 | # Extrinsic Rewards: 24 | # Success: Did the lander land successfully (1 or 0) 25 | 26 | def touch(state): 27 | """ 28 | Auxiliary reward for touching lander legs on the ground 29 | :param state: (list) state of lunar lander 30 | :return: (float) reward 31 | """ 32 | left_contact = state[6] # 1.0 if self.legs[0].ground_contact else 0.0 33 | right_contact = state[7] # 1.0 if self.legs[1].ground_contact else 0.0 34 | reward = left_contact + right_contact 35 | return reward 36 | 37 | 38 | def hover_planar(state): 39 | """ 40 | Auxiliary reward for hovering the lander (minimal planar movement) 41 | :param state: (list) state of lunar lander 42 | :return: (float) reward 43 | """ 44 | x_vel = state[2] # vel.x * (VIEWPORT_W / SCALE / 2) / FPS 45 | y_vel = state[3] # vel.y * (VIEWPORT_H / SCALE / 2) / FPS 46 | reward = 2.0 - (abs(x_vel) + abs(y_vel)) 47 | return reward 48 | 49 | 50 | def hover_angular(state): 51 | """ 52 | Auxiliary reward for hovering the lander (minimal angular movement) 53 | :param state: (list) state of lunar lander 54 | :return: (float) reward 55 | """ 56 | ang_vel = state[5] # 20.0 * self.lander.angularVelocity / FPS 57 | reward = 2.0 - abs(ang_vel) 58 | return reward 59 | 60 | 61 | def upright(state): 62 | """ 63 | Auxiliary reward for keeping the lander upright 64 | :param state: (list) state of lunar lander 65 | :return: (float) reward 66 | """ 67 | angle = state[4] # self.lander.angle 68 | reward = 2.0 - abs(angle) 69 | return reward 70 | 71 | 72 | def goal_distance(state): 73 | """ 74 | Auxiliary reward for distance from lander to goal 75 | :param state: (list) state of lunar lander 76 | :return: (float) reward 77 | """ 78 | x_pos = state[2] # (pos.x - VIEWPORT_W / SCALE / 2) / (VIEWPORT_W / SCALE / 2) 79 | y_pos = state[3] # (pos.y - (self.helipad_y + LEG_DOWN / SCALE)) / (VIEWPORT_W / SCALE / 2) 80 | reward = 2.0 - (abs(x_pos) + abs(y_pos)) 81 | return reward 82 | 83 | 84 | class TaskScheduler(object): 85 | """Class defines Scheduler for storing and picking tasks""" 86 | 87 | def __init__(self): 88 | self.aux_rewards = [touch, 89 | hover_planar, 90 | hover_angular, 91 | upright, 92 | goal_distance] 93 | 94 | # Number of tasks is number of auxiliary tasks plus the main task 95 | self.num_tasks = len(self.aux_rewards) + 1 96 | 97 | # Internal tracking variable for current task, and set of tasks 98 | self.current_task = 0 99 | self.current_set = set() 100 | 101 | def reset(self): 102 | self.current_set = set() 103 | 104 | def sample(self): 105 | self.current_task = random.randint(0, self.num_tasks-1) 106 | self.current_set.add(self.current_task) 107 | 108 | def reward(self, state, main_reward): 109 | reward_vector = [] 110 | for task_reward in self.aux_rewards: 111 | reward_vector.append(task_reward(state)) 112 | # Append main task reward 113 | reward_vector.append(main_reward) 114 | return reward_vector 115 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from pathlib import Path 3 | import sys 4 | import time 5 | import gym 6 | import torch 7 | import numpy as np 8 | from tensorboardX import SummaryWriter 9 | 10 | # Add local files to path 11 | root_dir = Path.cwd() 12 | sys.path.append(str(root_dir)) 13 | from networks import Actor, Critic 14 | from tasks import TaskScheduler 15 | from model import act, learn 16 | 17 | # Log and model saving parameters 18 | parser = argparse.ArgumentParser(description='Train Arguments') 19 | parser.add_argument('--log', type=str, default=None, help='Write tensorboard style logs to this folder [default: None]') 20 | parser.add_argument('--saveas', type=str, default=None, help='savename for model (Training) [default: None]') 21 | parser.add_argument('--model', type=str, default=None, help='savename for model (Evaluating) [default: None]') 22 | 23 | # Training parameters 24 | parser.add_argument('--num_train_cycles', type=int, default=1000, help='Number of training cycles [default: 1]') 25 | parser.add_argument('--num_trajectories', type=int, default=5, 26 | help='Number of trajectories collected per acting cycle [default: 5]') 27 | parser.add_argument('--num_learning_iterations', type=int, default=1, 28 | help='Number of learning iterations per learn cycle [default: 1]') 29 | parser.add_argument('--episode_batch_size', type=int, default=2, 30 | help='Number of trajectories per batch (gradient push) [default: 2]') 31 | parser.add_argument('--buffer_size', type=int, default=200, 32 | help='Number of trajectories in replay buffer [default: 200]') 33 | 34 | # Model parameters 35 | parser.add_argument('--non_linear', type=str, default='relu', help='Non-linearity in the nets [default: ReLU]') 36 | parser.add_argument('--batch_norm', dest='batch_norm', default=False, action='store_true', 37 | help='Batch norm applied to input layers [default: False]') 38 | parser.add_argument('--loss', type=str, default='retrace', help='Type of loss used when training [default: retrace]') 39 | 40 | # Global step counters 41 | TEST_STEP = 0 42 | 43 | 44 | def run(actor, env, min_rate=None, writer=None, render=False): 45 | """ 46 | Runs the actor policy on the environment, rendering it. This does not store anything 47 | and is only used for visualization. 48 | :param actor: (Actor) actor network object 49 | :param env: (Environment) OpenAI Gym Environment object 50 | :param min_rate: (float) minimum framerate 51 | :param writer: (SummaryWriter) writer object for logging 52 | :param render: (Bool) toggle for rendering to window 53 | :return: None 54 | """ 55 | global TEST_STEP 56 | obs = env.reset() 57 | done = False 58 | # Counter variables for number of steps and total episode time 59 | epoch_tic = time.clock() 60 | num_steps = 0 61 | reward = 0 62 | while not done: 63 | step_tic = time.clock() 64 | if render: 65 | env.render() 66 | # Use the previous observation to get an action from policy 67 | actor.eval() 68 | action = actor.predict(np.expand_dims(obs, axis=0), -1) # Last intention is main task 69 | # Step the environment and push outputs to policy 70 | obs, reward, done, _ = env.step(action[0]) 71 | if writer: 72 | writer.add_scalar('test/reward', reward, TEST_STEP) 73 | step_toc = time.clock() 74 | step_time = step_toc - step_tic 75 | if min_rate and step_time < min_rate: # Sleep to ensure minimum rate 76 | time.sleep(min_rate - step_time) 77 | num_steps += 1 78 | TEST_STEP += 1 79 | # Total elapsed time in epoch 80 | epoch_toc = time.clock() 81 | epoch_time = epoch_toc - epoch_tic 82 | print('Episode complete (%s steps in %.2fsec), final reward %s ' % (num_steps, epoch_time, reward)) 83 | 84 | 85 | if __name__ == '__main__': 86 | 87 | # Parse and print out parameters 88 | args = parser.parse_args() 89 | print('Running Trainer. Parameters:') 90 | for attr, value in args.__dict__.items(): 91 | print('%s : %s' % (attr.upper(), value)) 92 | 93 | # Make sure we can use gpu 94 | use_gpu = torch.cuda.is_available() 95 | print('Gpu is enabled: %s' % use_gpu) 96 | 97 | # Replay buffer stores collected trajectories 98 | B = [] 99 | 100 | # Environment is the lunar lander from OpenAI gym 101 | env = gym.make('LunarLander-v2') 102 | 103 | # task scheduler is defined in tasks.py 104 | task = TaskScheduler() 105 | 106 | # Write tensorboard logs to local logs folder 107 | writer = None 108 | if args.log: 109 | log_dir = root_dir / 'local' / 'logs' / args.log 110 | writer = SummaryWriter(log_dir=str(log_dir)) 111 | 112 | if args.model: # TEST MODE 113 | model_path = str(root_dir / 'local' / 'models' / args.model) 114 | print('Loading models from %s' % model_path) 115 | actor = torch.load(model_path + '_actor.pt') 116 | critic = torch.load(model_path + '_critic.pt') 117 | print('...done') 118 | 119 | run(actor, env, min_rate=0.05, writer=writer, render=True) 120 | 121 | else: # TRAIN MODE 122 | # Non-linearity is an argument 123 | non_linear = None 124 | if args.non_linear == 'relu': 125 | non_linear = torch.nn.ReLU() 126 | elif args.non_linear == 'elu': 127 | non_linear = torch.nn.ELU() 128 | 129 | # New actor and critic policies 130 | actor = Actor(use_gpu=use_gpu, non_linear=non_linear, batch_norm=args.batch_norm) 131 | critic = Critic(use_gpu=use_gpu, non_linear=non_linear, batch_norm=args.batch_norm) 132 | 133 | for i in range(args.num_train_cycles): 134 | print('Training cycle %s of %s' % (i, args.num_train_cycles)) 135 | act(actor, env, task, B, 136 | num_trajectories=args.num_trajectories, 137 | task_period=30, writer=writer) 138 | learn(actor, critic, task, B, 139 | num_learning_iterations=args.num_learning_iterations, 140 | episode_batch_size=args.episode_batch_size, 141 | lr=0.0002, writer=writer, loss=args.loss) 142 | run(actor, env, min_rate=0.05, writer=writer) 143 | # Remove early trajectories when buffer gets too large 144 | B = B[-args.buffer_size:] 145 | 146 | # Save the model to local directory 147 | if args.saveas is not None: 148 | save_path = str(root_dir / 'local' / 'models' / args.saveas) 149 | print('Saving models to %s' % save_path) 150 | torch.save(actor, save_path + '_actor.pt') 151 | torch.save(critic, save_path + '_critic.pt') 152 | print('...done') 153 | 154 | # Close writer 155 | try: 156 | writer.close() 157 | except: 158 | pass 159 | --------------------------------------------------------------------------------