├── .gitignore
├── README.md
├── docs
    ├── critic_net.png
    └── policy_net.png
├── local
    └── template.sh
├── model.py
├── networks.py
├── requirements.txt
├── tasks.py
└── train.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # =========== #
 2 | # Local Files #
 3 | # =========== #
 4 | local/logs/*
 5 | local/models/*
 6 | local/*
 7 | !local/template.sh
 8 | 
 9 | # =========== #
10 | #  Notebooks  #
11 | # =========== #
12 | .idea*
13 | __pycache__*
14 | .ipynb_checkpoints*
15 | 
16 | # ============ #
17 | # OS generated #
18 | # ============ #
19 | .DS_Store*
20 | ._*
21 | .Spotlight-V100
22 | .Trashes
23 | Icon?
24 | ehthumbs.db
25 | [Tt]humbs.db
26 | [Dd]esktop.ini
27 | Corridor/Library/ShaderCache/
28 | Corridor/Library/metadata/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ** NO LONGER MAINTAINED, USE AT YOUR OWN RISK **
 2 | 
 3 | # PySACX
 4 | 
 5 | This repo contains a Pytorch implementation of the SAC-X RL Algorithm [1]. It uses the Lunar Lander v2
 6 | environment from OpenAI gym. The SAC-X algorithm enables learning of complex behaviors from scratch
 7 | in the presence of multiple sparse reward signals.
 8 |  
 9 | ## Theory
10 | 
11 | In addition to a main task reward, we define a series of auxiliary rewards. An important assumption is that
12 | each auxiliary reward can be evaluated at any state action pair. The rewards are defined as follows:
13 | 
14 | *Auxiliary Tasks/Rewards*
15 |  - Touch. Maximizing number of legs touching ground
16 |  - Hover Planar. Minimize the planar movement of the lander craft
17 |  - Hover Angular. Minimize the rotational movement of the lander craft
18 |  - Upright. Minimize the angle of the lander craft
19 |  - Goal Distance. Minimize distance between lander craft and pad
20 | 
21 | *Main Task/Reward*
22 |  - Did the lander land successfully (Sparse reward based on landing success)
23 |  
24 | Each of these tasks (intentions in the paper) has a specific model head within the neural nets used
25 | to estimate the actor and critic functions. When executing a trajectory during training, the task (and 
26 | subsequently the model head within the actor) is switched between the different available options.
27 | This switching can either be done randomly (SAC-U) or it can be learned (SAC-Q).
28 | 
29 | The pictures below show the network architechtures for the actor and critic functions. Note the _N_
30 | possible heads for _N_ possible tasks (intentions in the paper) [1].
31 | 
32 | ![alt text](docs/critic_net.png) ![alt text](docs/policy_net.png)
33 | 
34 | *Learning*
35 | 
36 | Learning the actor (policy function) is done off-policy using a gradient based approach. Gradients are
37 | backpropagated through task-specific versions of the actor by using the task-specific versions of the 
38 | critic (Q function). Importantly though, the trajectory (collection of state action pairs) need not
39 | have been collected using the same task-specific actor, allowing learning from data generated by all other actors.
40 | The actor policy gradient is computed using the reparameterization trick (code in `model.py`)
41 | 
42 | Learning the critic (Q function) is similarly done off-policy. We sample trajectories from a buffer
43 | collected with target actors (actor policies frozen at a particular learning iteration). The critic
44 | policy gradient is computed using the retrace method (code in `model.py`)
45 |  
46 | ## Instructions
47 | 
48 | - Use the `local/template.sh` script to train lots of model variations. Or use `train.py` to train an agent directly.
49 |  
50 | ## Requirements
51 | 
52 | - Python 3.6
53 | - [PyTorch](http://pytorch.org/) 0.3.0.post4
54 | - [OpenAI Gym](https://gym.openai.com/)
55 | - [tensorboardX](https://github.com/lanpa/tensorboard-pytorch/tree/master/tensorboardX)
56 | 
57 | ## Sources
58 | 
59 | [1] [Learning by Playing – Solving Sparse Reward Tasks from Scratch](https://arxiv.org/abs/1802.10567).
60 | 


--------------------------------------------------------------------------------
/docs/critic_net.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hu-po/pySACQ/43ee4f457ddd8e444a01f19fc9658986bda73de0/docs/critic_net.png


--------------------------------------------------------------------------------
/docs/policy_net.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hu-po/pySACQ/43ee4f457ddd8e444a01f19fc9658986bda73de0/docs/policy_net.png


--------------------------------------------------------------------------------
/local/template.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | # Navigate to proper directory and activate the conda environment
 4 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 5 | cd $DIR/..
 6 | source activate gym
 7 | 
 8 | # Log directory is based on run date
 9 | DATE=$(date +%Y-%m-%d)
10 | 
11 | # Run several different training runs, storing in different log locations
12 | NAME="experiment_name"
13 | python train.py --log=$DATE$NAME \
14 | --num_train_cycles=1 \
15 | --buffer_size=1 \
16 | --num_trajectories=1 \
17 | --num_learning_iterations=1 \
18 | --episode_batch_size=1 \
19 | --batch_norm \
20 | --loss=discounted_rewards \
21 | --non_linear=relu


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | from collections import namedtuple
  2 | import random
  3 | import torch
  4 | import numpy as np
  5 | 
  6 | # Named tuple for a single step within a trajectory
  7 | Step = namedtuple('Step', ['state', 'action', 'log_prob', 'reward'])
  8 | 
  9 | # Global step counters
 10 | ACT_STEP = 0
 11 | LEARN_STEP = 0
 12 | 
 13 | 
 14 | def act(actor, env, task, B, num_trajectories=10, task_period=30, writer=None):
 15 |     """
 16 |     Performs actions in the environment collecting reward/experience.
 17 |     This follows Algorithm 3 in [1]
 18 |     :param actor: (Actor) actor network object
 19 |     :param env: (Environment) OpenAI Gym Environment object
 20 |     :param task: (Task) task object
 21 |     :param B: (list) replay buffer containing trajectories
 22 |     :param num_trajectories: (int) number of trajectories to collect at a time
 23 |     :param task_period: (int) number of steps in a single task period
 24 |     :param writer: (SummaryWriter) writer object for logging
 25 |     :return: None
 26 |     """
 27 |     global ACT_STEP
 28 |     for trajectory_idx in range(num_trajectories):
 29 |         print('Acting: trajectory %s of %s' % (trajectory_idx + 1, num_trajectories))
 30 |         # Reset environment and trajectory specific parameters
 31 |         trajectory = []  # collection of state, action, task pairs
 32 |         task.reset()  # h in paper
 33 |         obs = env.reset()
 34 |         done = False
 35 |         num_steps = 0
 36 |         # Roll out
 37 |         while not done:
 38 |             # Sample a new task using the scheduler
 39 |             if num_steps % task_period == 0:
 40 |                 task.sample()
 41 |             # Get the action from current actor policy
 42 |             actor.eval()
 43 |             action, log_prob = actor.predict(np.expand_dims(obs, axis=0), task.current_task, log_prob=True)
 44 |             # Execute action and collect rewards for each task
 45 |             obs, gym_reward, done, _ = env.step(action[0])
 46 |             # # Modify the main task reward (the huge -100 and 100 values cause instability)
 47 |             # gym_reward /= 100.0
 48 |             # Reward is a vector of the reward for each task
 49 |             reward = task.reward(obs, gym_reward)
 50 |             if writer:
 51 |                 for i, r in enumerate(reward):
 52 |                     writer.add_scalar('train/reward/%s' % i, r, ACT_STEP)
 53 |             # group information into a step and add to current trajectory
 54 |             trajectory.append(Step(obs, action[0], log_prob[0], reward))
 55 |             num_steps += 1  # increment step counter
 56 |             ACT_STEP += 1
 57 |         # Add trajectory to replay buffer
 58 |         B.append(trajectory)
 59 | 
 60 | 
 61 | def _loss_policy_gradient(trajectory, task, actor, gamma=0.95):
 62 |     """
 63 |     Calculates actor and critic losses for a given trajectory. Uses simpler
 64 |     :param trajectory: (list of Steps) trajectory or episode
 65 |     :param task: (Task) task object
 66 |     :param actor: (Actor) actor network object
 67 |     :param gamma: (float) discount factor
 68 |     :return: (float, float) actor and critic loss
 69 |     """
 70 |     # Extract information out of trajectory
 71 |     num_steps = len(trajectory)
 72 |     states = torch.FloatTensor([step.state for step in trajectory])
 73 |     rewards = torch.FloatTensor([step.reward for step in trajectory])
 74 |     # Create an intention (task) mask for all possible intentions
 75 |     task_mask = np.repeat(np.arange(0, task.num_tasks), num_steps)
 76 |     imask_task = torch.LongTensor(task_mask)
 77 |     states = states.repeat(task.num_tasks, 1)
 78 |     # actions (for each task) for every state action pair in trajectory
 79 |     task_actions, task_log_prob = actor.forward(states, imask_task, log_prob=True)
 80 |     # Calculate discounted cumulative rewards
 81 |     discounted_cumulative_rewards = torch.zeros_like(rewards)
 82 |     dcr = torch.zeros((1, task.num_tasks))
 83 |     for j in reversed(range(num_steps)):
 84 |         dcr = gamma * dcr + rewards[j, :].unsqueeze(0)
 85 |         discounted_cumulative_rewards[j, :] = dcr
 86 |     discounted_cumulative_rewards = discounted_cumulative_rewards.repeat(task.num_tasks, 1)
 87 |     # Create intention mask
 88 |     one_hot_mask = np.zeros((states.shape[0], task.num_tasks))
 89 |     one_hot_mask[np.arange(states.shape[0]), imask_task.numpy()] = 1
 90 |     mask_tensor = torch.FloatTensor(one_hot_mask)
 91 |     # Multiply by the intention mask and sum in the final dimension to get the right output shape
 92 |     discounted_cumulative_rewards = torch.autograd.Variable((discounted_cumulative_rewards * mask_tensor).sum(dim=1),
 93 |                                                             requires_grad=False)
 94 | 
 95 |     # Actor loss is log-prob weighted sum of Q values (for each task) given states from trajectory
 96 |     actor_loss = - torch.sum(discounted_cumulative_rewards * task_log_prob)
 97 |     actor_loss /= len(trajectory)  # Divide by the number of runs to prevent trajectory length from mattering
 98 |     return actor_loss
 99 | 
100 | 
101 | def _loss_discounted_rewards(trajectory, task, actor, critic, gamma=0.95):
102 |     """
103 |     Calculates actor and critic losses for a given trajectory. Uses simpler
104 |     :param trajectory: (list of Steps) trajectory or episode
105 |     :param task: (Task) task object
106 |     :param actor: (Actor) actor network object
107 |     :param critic: (Critic) critic network object
108 |     :param gamma: (float) discount factor
109 |     :return: (float, float) actor and critic loss
110 |     """
111 |     # Extract information out of trajectory
112 |     num_steps = len(trajectory)
113 |     states = torch.FloatTensor([step.state for step in trajectory])
114 |     rewards = torch.FloatTensor([step.reward for step in trajectory])
115 |     # Create an intention (task) mask for all possible intentions
116 |     task_mask = np.repeat(np.arange(0, task.num_tasks), num_steps)
117 |     imask_task = torch.LongTensor(task_mask)
118 |     states = states.repeat(task.num_tasks, 1)
119 |     # actions (for each task) for every state action pair in trajectory
120 |     task_actions, task_log_prob = actor.forward(states, imask_task, log_prob=True)
121 |     # Q-values (for each task) for every state and task-action pair in trajectory
122 |     critic_input = torch.cat([task_actions.data.float().unsqueeze(1), states], dim=1)
123 |     task_q = critic.forward(critic_input, imask_task)
124 |     # Actor loss is log-prob weighted sum of Q values (for each task) given states from trajectory
125 |     actor_loss = - torch.sum(torch.autograd.Variable(task_q.data, requires_grad=False).squeeze(1) * task_log_prob)
126 |     actor_loss /= len(trajectory)  # Divide by the number of runs to prevent trajectory length from mattering
127 |     # Calculate discounted cumulative rewards
128 |     discounted_cumulative_rewards = torch.zeros_like(rewards)
129 |     dcr = torch.zeros((1, task.num_tasks))
130 |     for j in reversed(range(num_steps)):
131 |         dcr = gamma * dcr + rewards[j, :].unsqueeze(0)
132 |         discounted_cumulative_rewards[j, :] = dcr
133 |     discounted_cumulative_rewards = discounted_cumulative_rewards.repeat(task.num_tasks, 1)
134 |     # Create intention mask
135 |     one_hot_mask = np.zeros((states.shape[0], task.num_tasks))
136 |     one_hot_mask[np.arange(states.shape[0]), imask_task.numpy()] = 1
137 |     mask_tensor = torch.FloatTensor(one_hot_mask)
138 |     # Multiply by the intention mask and sum in the final dimension to get the right output shape
139 |     discounted_cumulative_rewards = torch.autograd.Variable((discounted_cumulative_rewards * mask_tensor).sum(dim=1),
140 |                                                             requires_grad=False)
141 |     # Use Huber Loss for critic
142 |     critic_loss = torch.nn.SmoothL1Loss()(task_q, discounted_cumulative_rewards)
143 |     return actor_loss, critic_loss
144 | 
145 | 
146 | def _loss_retrace(trajectory, task, actor, critic, gamma=0.95):
147 |     """
148 |     Calculates actor and critic losses for a given trajectory. Following equations in [1]
149 |     :param trajectory: (list of Steps) trajectory or episode
150 |     :param task: (Task) task object
151 |     :param actor: (Actor) actor network object
152 |     :param critic: (Critic) critic network object
153 |     :param gamma: (float) discount factor
154 |     :return: (float, float) actor and critic loss
155 |     """
156 |     # Extract information out of trajectory
157 |     num_steps = len(trajectory)
158 |     states = torch.FloatTensor([step.state for step in trajectory])
159 |     rewards = torch.FloatTensor([step.reward for step in trajectory])
160 |     actions = torch.FloatTensor([step.action for step in trajectory]).unsqueeze(1)
161 |     log_probs = torch.FloatTensor([step.log_prob for step in trajectory]).unsqueeze(1)
162 |     # Create an intention (task) mask for all possible intentions
163 |     task_mask = np.repeat(np.arange(0, task.num_tasks), num_steps)
164 |     imask_task = torch.LongTensor(task_mask)
165 |     states = states.repeat(task.num_tasks, 1)
166 |     actions = actions.repeat(task.num_tasks, 1)
167 |     # actions (for each task) for every state action pair in trajectory
168 |     task_actions, task_log_prob = actor.forward(states, imask_task, log_prob=True)
169 |     # Q-values (for each task) for every state and task-action pair in trajectory
170 |     critic_input = torch.cat([task_actions.data.float().unsqueeze(1), states], dim=1)
171 |     task_q = critic.forward(critic_input, imask_task)
172 |     # Q-values (for each task) for every state and action pair in trajectory
173 |     critic_input = torch.cat([actions, states], dim=1)
174 |     traj_q = critic.predict(critic_input, imask_task)
175 |     # Actor loss is log-prob weighted sum of Q values (for each task) given states from trajectory
176 |     actor_loss = - torch.sum(torch.autograd.Variable(task_q.data, requires_grad=False).squeeze(1) * task_log_prob)
177 |     actor_loss /= len(trajectory)  # Divide by the number of runs to prevent trajectory length from mattering
178 |     # Calculation of retrace Q
179 |     q_ret = torch.zeros_like(task_q.data)
180 |     for task_id in range(task.num_tasks):
181 |         start = task_id * num_steps
182 |         for i in range(num_steps):
183 |             q_ret_i = 0
184 |             for j in range(i, num_steps):
185 |                 # Discount factor
186 |                 discount = gamma ** (j - i)
187 |                 # Importance weights
188 |                 cj = 1.0
189 |                 for k in range(i, j):
190 |                     ck = min(abs(task_log_prob.data[start + k] / float(log_probs[k])), 1.0)
191 |                     cj *= ck
192 |                 # Difference between the two q values
193 |                 del_q = task_q.data[start + i] - traj_q[start + j]
194 |                 # Retrace Q value is sum of discounted weighted rewards
195 |                 q_ret_i += discount * cj * (rewards[j, task_id] + del_q)
196 |             # Append retrace Q value to float tensor using index_fill
197 |             q_ret.index_fill_(0, torch.LongTensor([start + i]), q_ret_i[0])
198 |     # Critic loss uses retrace Q
199 |     # critic_loss = torch.sum((task_q - torch.autograd.Variable(q_ret, requires_grad=False)) ** 2)
200 |     # critic_loss /= len(trajectory)  # Divide by the number of runs to prevent trajectory length from mattering
201 |     # Use Huber Loss for critic
202 |     critic_loss = torch.nn.SmoothL1Loss()(task_q, torch.autograd.Variable(q_ret, requires_grad=False))
203 |     return actor_loss, critic_loss
204 | 
205 | 
206 | def learn(actor, critic, task, B, num_learning_iterations=10, episode_batch_size=10, lr=0.0002, loss='retrace',
207 |           writer=None):
208 |     """
209 |     Pushes back gradients from the replay buffer, updating the actor and critic.
210 |     This follows Algorithm 2 in [1]
211 |     :param actor: (Actor) actor network object
212 |     :param critic: (Critic) critic network object
213 |     :param task: (Task) task object
214 |     :param B: (list) replay buffer containing trajectories
215 |     :param num_learning_iterations: (int) number of learning iterations per function call
216 |     :param episode_batch_size: (int) number of trajectories in a batch (one gradient push)
217 |     :param lr: (float) learning rate
218 |     :param writer: (SummaryWriter) writer object for logging
219 |     :return: None
220 |     """
221 |     global LEARN_STEP
222 |     for learn_idx in range(num_learning_iterations):
223 |         print('Learning: trajectory %s of %s' % (learn_idx + 1, num_learning_iterations))
224 |         # optimizers for critic and actor
225 |         actor_opt = torch.optim.Adam(actor.parameters(), lr)
226 |         actor_opt.zero_grad()
227 |         actor.train()
228 |         if loss not in ['policy_gradient']:
229 |             critic_opt = torch.optim.Adam(critic.parameters(), lr)
230 |             critic_opt.zero_grad()
231 |             critic.train()
232 |         for batch_idx in range(episode_batch_size):
233 |             # Sample a random trajectory from the replay buffer
234 |             trajectory = random.choice(B)
235 |             # Compute losses for critic and actor
236 |             if loss == 'discounted_rewards':
237 |                 actor_loss, critic_loss = _loss_discounted_rewards(trajectory, task, actor, critic)
238 |             elif loss == 'retrace':
239 |                 actor_loss, critic_loss = _loss_retrace(trajectory, task, actor, critic)
240 |             elif loss == 'policy_gradient':
241 |                 actor_loss = _loss_policy_gradient(trajectory, task, actor)
242 |             else:
243 |                 print('Loss not found')
244 |                 return
245 |             if writer:
246 |                 writer.add_scalar('train/loss/actor', actor_loss.data[0], LEARN_STEP)
247 |                 if loss not in ['policy_gradient']:
248 |                     writer.add_scalar('train/loss/critic', critic_loss.data[0], LEARN_STEP)
249 |             # compute gradients
250 |             actor_loss.backward()
251 |             if loss not in ['policy_gradient']:
252 |                 critic_loss.backward()
253 |             LEARN_STEP += 1
254 |         # Push back the accumulated gradients and update the networks
255 |         actor_opt.step()
256 |         if loss not in ['policy_gradient']:
257 |             critic_opt.step()
258 | 


--------------------------------------------------------------------------------
/networks.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | 
  4 | 
  5 | class IntentionBase(torch.nn.Module):
  6 |     """Generic class for a single intention head (used within actor/critic networks)"""
  7 | 
  8 |     def __init__(self, input_size, hidden_size, output_size, non_linear, final_non_linear, use_gpu=True):
  9 |         super(IntentionBase, self).__init__()
 10 |         self.non_linear = non_linear
 11 |         self.final_non_linear = final_non_linear
 12 |         self.use_gpu = use_gpu
 13 | 
 14 |         # Build the network
 15 |         self.layer1 = torch.nn.Linear(input_size, hidden_size)
 16 |         self.final_layer = torch.nn.Linear(hidden_size, output_size)
 17 |         self.init_weights()
 18 | 
 19 |     def init_weights(self):
 20 |         # Initialize the other layers with xavier (still constant 0 bias)
 21 |         torch.nn.init.xavier_uniform(self.layer1.weight.data)
 22 |         torch.nn.init.constant(self.layer1.bias.data, 0)
 23 |         torch.nn.init.xavier_uniform(self.final_layer.weight.data)
 24 |         torch.nn.init.constant(self.final_layer.bias.data, 0)
 25 | 
 26 |     def forward(self, x):
 27 |         x = self.non_linear(self.layer1(x))
 28 |         x = self.final_non_linear(self.final_layer(x))
 29 |         return x
 30 | 
 31 | 
 32 | class IntentionCritic(IntentionBase):
 33 |     """Class for a single Intention head within the Q-function (or critic) network"""
 34 | 
 35 |     def __init__(self,
 36 |                  input_size,
 37 |                  hidden_size,
 38 |                  output_size,
 39 |                  non_linear=torch.nn.ELU(),
 40 |                  use_gpu=True):
 41 |         final_non_linear = non_linear
 42 |         super(IntentionCritic, self).__init__(input_size, hidden_size, output_size, non_linear, final_non_linear,
 43 |                                               use_gpu)
 44 | 
 45 | 
 46 | class IntentionActor(IntentionBase):
 47 |     """Class for a single Intention head within the policy (or actor) network"""
 48 | 
 49 |     def __init__(self,
 50 |                  input_size,
 51 |                  hidden_size,
 52 |                  output_size,
 53 |                  non_linear=torch.nn.ELU(),
 54 |                  final_non_linear=torch.nn.Softmax(dim=1),
 55 |                  use_gpu=True):
 56 |         super(IntentionActor, self).__init__(input_size, hidden_size, output_size, non_linear, final_non_linear,
 57 |                                              use_gpu)
 58 | 
 59 | 
 60 | class SQXNet(torch.nn.Module):
 61 |     """Generic class for actor and critic networks. The arch is very similar."""
 62 | 
 63 |     def __init__(self,
 64 |                  state_dim,
 65 |                  base_hidden_size,
 66 |                  num_intentions,
 67 |                  head_input_size,
 68 |                  head_hidden_size,
 69 |                  head_output_size,
 70 |                  non_linear,
 71 |                  net_type,
 72 |                  batch_norm=False,
 73 |                  use_gpu=True):
 74 |         super(SQXNet, self).__init__()
 75 |         self.non_linear = non_linear
 76 |         self.batch_norm = batch_norm
 77 |         self.use_gpu = use_gpu
 78 | 
 79 |         # Build the base of the network
 80 |         self.layer1 = torch.nn.Linear(state_dim, base_hidden_size)
 81 |         self.layer2 = torch.nn.Linear(base_hidden_size, head_input_size)
 82 |         if self.batch_norm:
 83 |             self.bn1 = torch.nn.BatchNorm1d(base_hidden_size)
 84 |         self.init_weights()
 85 | 
 86 |         # Create the many intention nets heads
 87 |         if net_type == 'actor':
 88 |             intention_net_type = IntentionActor
 89 |         elif net_type == 'critic':
 90 |             intention_net_type = IntentionCritic
 91 |         else:
 92 |             raise Exception('Invalid net type for SQXNet')
 93 |         self.intention_nets = []
 94 |         for _ in range(num_intentions):
 95 |             self.intention_nets.append(intention_net_type(input_size=head_input_size,
 96 |                                                           hidden_size=head_hidden_size,
 97 |                                                           output_size=head_output_size,
 98 |                                                           use_gpu=use_gpu,
 99 |                                                           non_linear=non_linear))
100 | 
101 |     def init_weights(self):
102 |         # Initialize the other layers with xavier (still constant 0 bias)
103 |         torch.nn.init.xavier_uniform(self.layer1.weight.data)
104 |         torch.nn.init.constant(self.layer1.bias.data, 0)
105 |         torch.nn.init.xavier_uniform(self.layer2.weight.data)
106 |         torch.nn.init.constant(self.layer2.bias.data, 0)
107 | 
108 |     def forward(self, x, intention):
109 |         # Feed the input through the base layers of the model
110 |         x = torch.autograd.Variable(torch.Tensor(x))
111 |         x = self.non_linear(self.layer1(x))
112 |         if self.batch_norm:
113 |             x = self.bn1(x)
114 |         x = self.non_linear(self.layer2(x))
115 |         if isinstance(intention, int):  # single intention head
116 |             x = self.intention_nets[intention].forward(x)
117 |         else:
118 |             # Create intention mask
119 |             one_hot_mask = np.zeros((x.shape[0], len(self.intention_nets)))
120 |             one_hot_mask[np.arange(x.shape[0]), intention.numpy()] = 1
121 |             mask_tensor = torch.autograd.Variable(torch.FloatTensor(one_hot_mask).unsqueeze(1), requires_grad=False)
122 |             # Feed forward through all the intention heads and concatenate on new dimension
123 |             intention_out = torch.cat(list(head.forward(x).unsqueeze(2) for head in self.intention_nets), dim=2)
124 |             # Multiply by the intention mask and sum in the final dimension to get the right output shape
125 |             x = (intention_out * mask_tensor).sum(dim=2)
126 |         return x
127 | 
128 |     def predict(self, x, intention):
129 |         y = self.forward(x, intention).cpu().data
130 |         return y
131 | 
132 | 
133 | class Actor(SQXNet):
134 |     """Class for policy (or actor) network"""
135 | 
136 |     def __init__(self,
137 |                  state_dim=8,
138 |                  base_hidden_size=32,
139 |                  num_intentions=6,
140 |                  head_input_size=16,
141 |                  head_hidden_size=8,
142 |                  head_output_size=4,
143 |                  non_linear=torch.nn.ELU(),
144 |                  net_type='actor',
145 |                  batch_norm=False,
146 |                  use_gpu=True):
147 |         super(Actor, self).__init__(state_dim,
148 |                                     base_hidden_size,
149 |                                     num_intentions,
150 |                                     head_input_size,
151 |                                     head_hidden_size,
152 |                                     head_output_size,
153 |                                     non_linear,
154 |                                     net_type,
155 |                                     batch_norm,
156 |                                     use_gpu)
157 | 
158 |     def forward(self, x, intention, log_prob=False):
159 |         x = super().forward(x, intention)
160 |         # Intention head determines parameters of Categorical distribution
161 |         dist = torch.distributions.Categorical(x)
162 |         action = dist.sample()
163 |         if log_prob:
164 |             log_prob = dist.log_prob(action)
165 |             return action, log_prob
166 |         return action
167 | 
168 |     def predict(self, x, intention, log_prob=False):
169 |         if log_prob:
170 |             action, log_prob = self.forward(x, intention, log_prob=True)
171 |             return action.cpu().data, log_prob.cpu().data
172 |         else:
173 |             action = self.forward(x, intention).cpu().data
174 |             return action
175 |         return None
176 | 
177 | 
178 | class Critic(SQXNet):
179 |     """Class for Q-function (or critic) network"""
180 | 
181 |     def __init__(self,
182 |                  num_intentions=6,
183 |                  state_dim=9,
184 |                  base_hidden_size=64,
185 |                  head_input_size=64,
186 |                  head_hidden_size=32,
187 |                  head_output_size=1,
188 |                  non_linear=torch.nn.ELU(),
189 |                  net_type='critic',
190 |                  batch_norm=False,
191 |                  use_gpu=True):
192 |         super(Critic, self).__init__(state_dim,
193 |                                      base_hidden_size,
194 |                                      num_intentions,
195 |                                      head_input_size,
196 |                                      head_hidden_size,
197 |                                      head_output_size,
198 |                                      non_linear,
199 |                                      net_type,
200 |                                      batch_norm,
201 |                                      use_gpu)
202 | 
203 | 
204 | if __name__ == '__main__':
205 |     print('Run this file directly to debug')
206 | 
207 |     actor = Actor()
208 |     critic = Critic()
209 | 
210 |     # Carry out a step on the environment to test out forward functions
211 |     import gym
212 |     import random
213 | 
214 |     env = gym.make('LunarLander-v2')
215 |     obs = env.reset()
216 |     task_idx = random.randint(0, 6)
217 | 
218 |     # Get the action from current actor policy
219 |     action = actor.predict(obs, task_idx)
220 |     _, _, _, _ = env.step(action)
221 | 
222 |     print('Got to end sucessfully! (Though this only means there are no major bugs..)')
223 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | gym==0.10.3
2 | tensorboardX==1.1
3 | box2d-py==2.3.1


--------------------------------------------------------------------------------
/tasks.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | 
  4 | # Observation space, according to source:
  5 | # state = [
  6 | #     (pos.x - VIEWPORT_W / SCALE / 2) / (VIEWPORT_W / SCALE / 2),
  7 | #     (pos.y - (self.helipad_y + LEG_DOWN / SCALE)) / (VIEWPORT_W / SCALE / 2),
  8 | #     vel.x * (VIEWPORT_W / SCALE / 2) / FPS,
  9 | #     vel.y * (VIEWPORT_H / SCALE / 2) / FPS,
 10 | #     self.lander.angle,
 11 | #     20.0 * self.lander.angularVelocity / FPS,
 12 | #     1.0 if self.legs[0].ground_contact else 0.0,
 13 | #     1.0 if self.legs[1].ground_contact else 0.0
 14 | # ]
 15 | 
 16 | # Auxiliary Rewards:
 17 | # Touch. Maximizing number of legs touching ground
 18 | # Hover Planar. Minimize the planar movement of the lander craft
 19 | # Hover Angular. Minimize the rotational movement of ht lander craft
 20 | # Upright. Minimize the angle of the lander craft
 21 | # Goal Distance. Minimize distance between lander craft and pad
 22 | #
 23 | # Extrinsic Rewards:
 24 | # Success: Did the lander land successfully (1 or 0)
 25 | 
 26 | def touch(state):
 27 |     """
 28 |     Auxiliary reward for touching lander legs on the ground
 29 |     :param state: (list) state of lunar lander
 30 |     :return: (float) reward
 31 |     """
 32 |     left_contact = state[6]  # 1.0 if self.legs[0].ground_contact else 0.0
 33 |     right_contact = state[7]  # 1.0 if self.legs[1].ground_contact else 0.0
 34 |     reward = left_contact + right_contact
 35 |     return reward
 36 | 
 37 | 
 38 | def hover_planar(state):
 39 |     """
 40 |     Auxiliary reward for hovering the lander (minimal planar movement)
 41 |     :param state: (list) state of lunar lander
 42 |     :return: (float) reward
 43 |     """
 44 |     x_vel = state[2]  # vel.x * (VIEWPORT_W / SCALE / 2) / FPS
 45 |     y_vel = state[3]  # vel.y * (VIEWPORT_H / SCALE / 2) / FPS
 46 |     reward = 2.0 - (abs(x_vel) + abs(y_vel))
 47 |     return reward
 48 | 
 49 | 
 50 | def hover_angular(state):
 51 |     """
 52 |     Auxiliary reward for hovering the lander (minimal angular movement)
 53 |     :param state: (list) state of lunar lander
 54 |     :return: (float) reward
 55 |     """
 56 |     ang_vel = state[5]  # 20.0 * self.lander.angularVelocity / FPS
 57 |     reward = 2.0 - abs(ang_vel)
 58 |     return reward
 59 | 
 60 | 
 61 | def upright(state):
 62 |     """
 63 |     Auxiliary reward for keeping the lander upright
 64 |     :param state: (list) state of lunar lander
 65 |     :return: (float) reward
 66 |     """
 67 |     angle = state[4]  # self.lander.angle
 68 |     reward = 2.0 - abs(angle)
 69 |     return reward
 70 | 
 71 | 
 72 | def goal_distance(state):
 73 |     """
 74 |     Auxiliary reward for distance from lander to goal
 75 |     :param state: (list) state of lunar lander
 76 |     :return: (float) reward
 77 |     """
 78 |     x_pos = state[2]  # (pos.x - VIEWPORT_W / SCALE / 2) / (VIEWPORT_W / SCALE / 2)
 79 |     y_pos = state[3]  # (pos.y - (self.helipad_y + LEG_DOWN / SCALE)) / (VIEWPORT_W / SCALE / 2)
 80 |     reward = 2.0 - (abs(x_pos) + abs(y_pos))
 81 |     return reward
 82 | 
 83 | 
 84 | class TaskScheduler(object):
 85 |     """Class defines Scheduler for storing and picking tasks"""
 86 | 
 87 |     def __init__(self):
 88 |         self.aux_rewards = [touch,
 89 |                             hover_planar,
 90 |                             hover_angular,
 91 |                             upright,
 92 |                             goal_distance]
 93 | 
 94 |         # Number of tasks is number of auxiliary tasks plus the main task
 95 |         self.num_tasks = len(self.aux_rewards) + 1
 96 | 
 97 |         # Internal tracking variable for current task, and set of tasks
 98 |         self.current_task = 0
 99 |         self.current_set = set()
100 | 
101 |     def reset(self):
102 |         self.current_set = set()
103 | 
104 |     def sample(self):
105 |         self.current_task = random.randint(0, self.num_tasks-1)
106 |         self.current_set.add(self.current_task)
107 | 
108 |     def reward(self, state, main_reward):
109 |         reward_vector = []
110 |         for task_reward in self.aux_rewards:
111 |             reward_vector.append(task_reward(state))
112 |         # Append main task reward
113 |         reward_vector.append(main_reward)
114 |         return reward_vector
115 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | from pathlib import Path
  3 | import sys
  4 | import time
  5 | import gym
  6 | import torch
  7 | import numpy as np
  8 | from tensorboardX import SummaryWriter
  9 | 
 10 | # Add local files to path
 11 | root_dir = Path.cwd()
 12 | sys.path.append(str(root_dir))
 13 | from networks import Actor, Critic
 14 | from tasks import TaskScheduler
 15 | from model import act, learn
 16 | 
 17 | # Log and model saving parameters
 18 | parser = argparse.ArgumentParser(description='Train Arguments')
 19 | parser.add_argument('--log', type=str, default=None, help='Write tensorboard style logs to this folder [default: None]')
 20 | parser.add_argument('--saveas', type=str, default=None, help='savename for model (Training) [default: None]')
 21 | parser.add_argument('--model', type=str, default=None, help='savename for model (Evaluating) [default: None]')
 22 | 
 23 | # Training parameters
 24 | parser.add_argument('--num_train_cycles', type=int, default=1000, help='Number of training cycles [default: 1]')
 25 | parser.add_argument('--num_trajectories', type=int, default=5,
 26 |                     help='Number of trajectories collected per acting cycle [default: 5]')
 27 | parser.add_argument('--num_learning_iterations', type=int, default=1,
 28 |                     help='Number of learning iterations per learn cycle [default: 1]')
 29 | parser.add_argument('--episode_batch_size', type=int, default=2,
 30 |                     help='Number of trajectories per batch (gradient push) [default: 2]')
 31 | parser.add_argument('--buffer_size', type=int, default=200,
 32 |                     help='Number of trajectories in replay buffer [default: 200]')
 33 | 
 34 | # Model parameters
 35 | parser.add_argument('--non_linear', type=str, default='relu', help='Non-linearity in the nets [default: ReLU]')
 36 | parser.add_argument('--batch_norm', dest='batch_norm', default=False, action='store_true',
 37 |                     help='Batch norm applied to input layers [default: False]')
 38 | parser.add_argument('--loss', type=str, default='retrace', help='Type of loss used when training [default: retrace]')
 39 | 
 40 | # Global step counters
 41 | TEST_STEP = 0
 42 | 
 43 | 
 44 | def run(actor, env, min_rate=None, writer=None, render=False):
 45 |     """
 46 |     Runs the actor policy on the environment, rendering it. This does not store anything
 47 |     and is only used for visualization.
 48 |     :param actor: (Actor) actor network object
 49 |     :param env: (Environment) OpenAI Gym Environment object
 50 |     :param min_rate: (float) minimum framerate
 51 |     :param writer: (SummaryWriter) writer object for logging
 52 |     :param render: (Bool) toggle for rendering to window
 53 |     :return: None
 54 |     """
 55 |     global TEST_STEP
 56 |     obs = env.reset()
 57 |     done = False
 58 |     # Counter variables for number of steps and total episode time
 59 |     epoch_tic = time.clock()
 60 |     num_steps = 0
 61 |     reward = 0
 62 |     while not done:
 63 |         step_tic = time.clock()
 64 |         if render:
 65 |             env.render()
 66 |         # Use the previous observation to get an action from policy
 67 |         actor.eval()
 68 |         action = actor.predict(np.expand_dims(obs, axis=0), -1)  # Last intention is main task
 69 |         # Step the environment and push outputs to policy
 70 |         obs, reward, done, _ = env.step(action[0])
 71 |         if writer:
 72 |             writer.add_scalar('test/reward', reward, TEST_STEP)
 73 |         step_toc = time.clock()
 74 |         step_time = step_toc - step_tic
 75 |         if min_rate and step_time < min_rate:  # Sleep to ensure minimum rate
 76 |             time.sleep(min_rate - step_time)
 77 |         num_steps += 1
 78 |         TEST_STEP += 1
 79 |     # Total elapsed time in epoch
 80 |     epoch_toc = time.clock()
 81 |     epoch_time = epoch_toc - epoch_tic
 82 |     print('Episode complete (%s steps in %.2fsec), final reward %s ' % (num_steps, epoch_time, reward))
 83 | 
 84 | 
 85 | if __name__ == '__main__':
 86 | 
 87 |     # Parse and print out parameters
 88 |     args = parser.parse_args()
 89 |     print('Running Trainer. Parameters:')
 90 |     for attr, value in args.__dict__.items():
 91 |         print('%s : %s' % (attr.upper(), value))
 92 | 
 93 |     # Make sure we can use gpu
 94 |     use_gpu = torch.cuda.is_available()
 95 |     print('Gpu is enabled: %s' % use_gpu)
 96 | 
 97 |     # Replay buffer stores collected trajectories
 98 |     B = []
 99 | 
100 |     # Environment is the lunar lander from OpenAI gym
101 |     env = gym.make('LunarLander-v2')
102 | 
103 |     # task scheduler is defined in tasks.py
104 |     task = TaskScheduler()
105 | 
106 |     # Write tensorboard logs to local logs folder
107 |     writer = None
108 |     if args.log:
109 |         log_dir = root_dir / 'local' / 'logs' / args.log
110 |         writer = SummaryWriter(log_dir=str(log_dir))
111 | 
112 |     if args.model:  # TEST MODE
113 |         model_path = str(root_dir / 'local' / 'models' / args.model)
114 |         print('Loading models from %s' % model_path)
115 |         actor = torch.load(model_path + '_actor.pt')
116 |         critic = torch.load(model_path + '_critic.pt')
117 |         print('...done')
118 | 
119 |         run(actor, env, min_rate=0.05, writer=writer, render=True)
120 | 
121 |     else:  # TRAIN MODE
122 |         # Non-linearity is an argument
123 |         non_linear = None
124 |         if args.non_linear == 'relu':
125 |             non_linear = torch.nn.ReLU()
126 |         elif args.non_linear == 'elu':
127 |             non_linear = torch.nn.ELU()
128 | 
129 |         # New actor and critic policies
130 |         actor = Actor(use_gpu=use_gpu, non_linear=non_linear, batch_norm=args.batch_norm)
131 |         critic = Critic(use_gpu=use_gpu, non_linear=non_linear, batch_norm=args.batch_norm)
132 | 
133 |         for i in range(args.num_train_cycles):
134 |             print('Training cycle %s of %s' % (i, args.num_train_cycles))
135 |             act(actor, env, task, B,
136 |                 num_trajectories=args.num_trajectories,
137 |                 task_period=30, writer=writer)
138 |             learn(actor, critic, task, B,
139 |                   num_learning_iterations=args.num_learning_iterations,
140 |                   episode_batch_size=args.episode_batch_size,
141 |                   lr=0.0002, writer=writer, loss=args.loss)
142 |             run(actor, env, min_rate=0.05, writer=writer)
143 |             # Remove early trajectories when buffer gets too large
144 |             B = B[-args.buffer_size:]
145 | 
146 |         # Save the model to local directory
147 |         if args.saveas is not None:
148 |             save_path = str(root_dir / 'local' / 'models' / args.saveas)
149 |             print('Saving models to %s' % save_path)
150 |             torch.save(actor, save_path + '_actor.pt')
151 |             torch.save(critic, save_path + '_critic.pt')
152 |             print('...done')
153 | 
154 |     # Close writer
155 |     try:
156 |         writer.close()
157 |     except:
158 |         pass
159 | 


--------------------------------------------------------------------------------