├── .gitignore ├── LICENSE ├── README.md ├── pytorch-soft-actor-critic ├── LICENSE ├── continous_grids.py ├── exploration_models.py ├── flow_helpers.py ├── flows.py ├── graphics.py ├── main.py ├── model.py ├── normalized_actions.py ├── plots │ └── plot_comet.py ├── replay_memory.py ├── sac.py ├── scripts │ ├── run_contgridworld_exp.sh │ └── run_contgridworld_gauss.sh ├── settings.json └── utils.py └── pytorch-vanilla-reinforce ├── README.md ├── main_reinforce.py └── reinforce_simple.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.json 2 | *.rar 3 | *.tar 4 | *.zip 5 | *.pt 6 | *.swp 7 | *.png 8 | *.eps 9 | __pycache__/** 10 | .idea/** 11 | install/** 12 | *.pdf 13 | *.png 14 | *.xls 15 | .nfs* 16 | pytorch-soft-actor-critic/__pycache__/** 17 | pytorch-soft-actor-critic/ddpg_gridworld/__pycache__/** 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Avishek (Joey) Bose 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Improving Exploration in SAC with Normalizing Flows Policies 2 | 3 | This codebase was used to generate the results documented in the paper "[Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies](arxiv_url_placeholder)". 4 | Patrick Nadeem Ward^*12, Ariella Smofsky^*12, Avishek Joey Bose¹². INNF Workshop ICML 2019. 5 | 6 | * ^* Equal contribution, ¹ McGill University, ² Mila 7 | * Correspondence to: 8 | * Patrick Nadeem Ward <[Github: NadeemWard](https://github.com/NadeemWard), patrick.ward@mail.mcgill.ca> 9 | * Ariella Smofsky <[Github: asmoog](https://github.com/asmoog), ariella.smofsky@mail.mcgill.ca> 10 | 11 | ## Requirements 12 | * [PyTorch](https://pytorch.org/) 13 | * [comet.ml](https://www.comet.ml/) 14 | 15 | ## Run Experiments 16 | Gaussian policy on Dense Gridworld environment with REINFORCE: 17 | ``` 18 | TODO 19 | ``` 20 | 21 | Gaussian policy on Sparse Gridworld environment with REINFORCE: 22 | ``` 23 | TODO 24 | ``` 25 | 26 | Gaussian policy on Dense Gridworld environment with reparametrization: 27 | ``` 28 | python main.py --namestr=G-S-DG-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=100000 --policy=Gaussian --smol --comet --dense_goals --silent 29 | ``` 30 | 31 | Gaussian policy on Sparse Gridworld environment with reparametrization: 32 | ``` 33 | python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=100000 --policy=Gaussian --smol --comet --silent 34 | ``` 35 | 36 | Normalizing Flow policy on Dense Gridworld environment: 37 | ``` 38 | TODO 39 | ``` 40 | 41 | Normalizing Flow policy on Sparse Gridworld environment: 42 | ``` 43 | TODO 44 | ``` 45 | 46 | To run an experiment with a different policy distribution, modify the `--policy` flag. 47 | 48 | ## References 49 | * Implementation of SAC based on [PyTorch SAC](https://github.com/pranz24/pytorch-soft-actor-critic). -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Pranjal Tandon 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/continous_grids.py: -------------------------------------------------------------------------------- 1 | """ 2 | code from here: https://github.com/junhyukoh/value-prediction-network/blob/master/maze.py 3 | """ 4 | 5 | import copy 6 | import pandas as pd 7 | 8 | import seaborn as sns 9 | 10 | import gym.spaces 11 | import matplotlib.patches as patches 12 | import matplotlib.path as path 13 | import matplotlib.pyplot as plt 14 | import numpy as np 15 | from exploration_models import * 16 | from graphics import * 17 | from gym import spaces 18 | 19 | 20 | class GridWorld(gym.Env): 21 | """ 22 | empty grid world 23 | """ 24 | 25 | def __init__(self, 26 | num_rooms=0, 27 | start_position=(25.0, 25.0), 28 | goal_position=(75.0, 75.0), 29 | goal_reward=+100.0, 30 | dense_goals=None, 31 | dense_reward=+5, 32 | goal_radius=1.0, 33 | per_step_penalty=-0.01, 34 | max_episode_len=1000, 35 | grid_len=100, 36 | wall_breadth=1, 37 | door_breadth=5, 38 | action_limit_max=1.0, 39 | silent_mode=False): 40 | """ 41 | params: 42 | """ 43 | 44 | # num of rooms 45 | self.num_rooms = num_rooms 46 | self.silent_mode = silent_mode 47 | 48 | # grid size 49 | self.grid_len = float(grid_len) 50 | self.wall_breadth = float(wall_breadth) 51 | self.door_breadth = float(door_breadth) 52 | self.min_position = 0.0 53 | self.max_position = float(grid_len) 54 | 55 | # goal stats 56 | self.goal_position = np.array(goal_position) 57 | self.goal_radius = goal_radius 58 | self.start_position = np.array(start_position) 59 | 60 | # Dense reward stuff: 61 | self.dense_reward = dense_reward 62 | # List of dense goal coordinates 63 | self.dense_goals = dense_goals 64 | 65 | # rewards 66 | self.goal_reward = goal_reward 67 | self.per_step_penalty = per_step_penalty 68 | 69 | self.max_episode_len = max_episode_len 70 | 71 | # observation space 72 | self.low_state = np.array([self.min_position, self.min_position]) 73 | self.high_state = np.array([self.max_position, self.max_position]) 74 | 75 | # how much the agent can move in a step (dx,dy) 76 | self.min_action = np.array([-action_limit_max, -action_limit_max]) 77 | self.max_action = np.array([+action_limit_max, +action_limit_max]) 78 | 79 | self.observation_space = spaces.Box(low=self.low_state, high=self.high_state) 80 | self.action_space = spaces.Box(low=self.min_action, high=self.max_action) 81 | self.nb_actions = self.action_space.shape[-1] 82 | 83 | # add the walls here 84 | self.create_walls() 85 | self.scale = 5 86 | 87 | # This code enables live visualization of trajectories 88 | # Susan added these lines for visual purposes 89 | if not self.silent_mode: 90 | self.win1 = GraphWin("2DGrid", self.max_position * self.scale + 40, self.max_position * self.scale + 40) 91 | rectangle1 = Rectangle(Point(self.min_position * self.scale + 20, self.min_position * self.scale + 20), 92 | Point(self.max_position * self.scale + 20, self.max_position * self.scale + 20)) 93 | rectangle1.setOutline('red') 94 | rectangle1.draw(self.win1) 95 | 96 | if self.num_rooms > 0: 97 | wall1 = Rectangle(Point(self.min_position * self.scale + 20, 98 | self.max_position * self.scale / 2 + 20 - self.wall_breadth * self.scale), 99 | Point(self.max_position * self.scale / 2 + 20, 100 | self.max_position * self.scale / 2 + 20 + self.wall_breadth * self.scale)) 101 | wall1.draw(self.win1) 102 | wall1.setFill('aquamarine') 103 | 104 | wall2 = Rectangle(Point(self.max_position * self.scale / 2 + 20 - self.wall_breadth * self.scale, 105 | self.min_position * self.scale + 20), 106 | Point(self.max_position * self.scale / 2 + 20 + self.wall_breadth * self.scale, 107 | self.max_position * self.scale / 4 + 20 - self.door_breadth * self.scale)) 108 | wall2.draw(self.win1) 109 | wall2.setFill('aquamarine') 110 | 111 | wall3 = Rectangle(Point(self.max_position * self.scale / 2 + 20 - self.wall_breadth * self.scale, 112 | self.max_position * self.scale / 4 + 20 + self.door_breadth * self.scale), 113 | Point(self.max_position * self.scale / 2 + 20 + self.wall_breadth * self.scale, 114 | self.max_position * self.scale / 2 + 20 + self.wall_breadth * self.scale)) 115 | wall3.draw(self.win1) 116 | wall3.setFill('aquamarine') 117 | start_point = Circle(Point(start_position[0] * self.scale + 20, start_position[1] * self.scale + 20), 118 | goal_radius * self.scale) 119 | start_point.draw(self.win1) 120 | start_point.setFill('red') 121 | goal_point = Circle(Point(goal_position[0] * self.scale + 20, goal_position[1] * self.scale + 20), 122 | goal_radius * self.scale) 123 | goal_point.draw(self.win1) 124 | goal_point.setFill('green') 125 | 126 | # Drawing the dense goals: 127 | for idx, mini_goal in enumerate(self.dense_goals): 128 | mini_goal_point = Circle(Point(mini_goal[0] * self.scale + 20, mini_goal[1] * self.scale + 20), 129 | goal_radius * self.scale) 130 | mini_goal_point.draw(self.win1) 131 | mini_goal_point.setFill('blue') 132 | 133 | # self.win1.getMouse() 134 | 135 | self.seed() 136 | self.reset() 137 | 138 | def reset(self): 139 | self.state = copy.deepcopy(self.start_position) 140 | self.t = 0 141 | self.done = False 142 | 143 | return self._get_obs() 144 | 145 | def _get_obs(self): 146 | return copy.deepcopy(self.state) 147 | 148 | def step(self, a): 149 | """ 150 | take the action here 151 | """ 152 | 153 | # check if the action is valid 154 | assert self.action_space.contains(a) 155 | assert self.done is False 156 | 157 | # Susan added this line 158 | self.state_temp = copy.deepcopy(self.state) 159 | 160 | self.t += 1 161 | 162 | # check if collides, if it doesn't then update the state 163 | if self.num_rooms == 0 or not self.collides((self.state[0] + a[0], self.state[1] + a[1])): 164 | # move the agent and update the state 165 | self.state[0] += a[0] 166 | self.state[1] += a[1] 167 | 168 | # clip the state if out of bounds 169 | self.state[0] = np.clip(self.state[0], self.min_position, self.max_position) 170 | self.state[1] = np.clip(self.state[1], self.min_position, self.max_position) 171 | 172 | # the reward logic 173 | reward = self.per_step_penalty 174 | 175 | # Adding dense Rewards: 176 | for idx, mini_goal in enumerate(self.dense_goals): 177 | if np.linalg.norm(np.array(self.state) - np.array(mini_goal), 2) <= self.goal_radius: 178 | reward = self.dense_reward 179 | 180 | # if reached goal (within a radius of 1 unit) 181 | if np.linalg.norm(np.array(self.state) - np.array(self.goal_position), 2) <= self.goal_radius: 182 | # episode done 183 | self.done = True 184 | reward = self.goal_reward 185 | 186 | if self.t >= self.max_episode_len: 187 | self.done = True 188 | 189 | line = Line(Point(self.state_temp[0] * self.scale + 20, self.state_temp[1] * self.scale + 20), 190 | Point(self.state[0] * self.scale + 20, self.state[1] * self.scale + 20)) 191 | 192 | if not self.silent_mode: 193 | line.draw(self.win1) 194 | line.setOutline('black') 195 | # self.win1.getMouse() 196 | self.state_temp = self.state 197 | 198 | if self.silent_mode: 199 | return self._get_obs(), reward, self.done, None 200 | 201 | # return self.win1,self._get_obs(), reward, self.done, None 202 | return self._get_obs(), reward, self.done, None 203 | 204 | def sample(self, exploration, b_0, l_p, ou_noise, stddev): 205 | """ take a random sample """ 206 | if exploration == 'RandomWalk': 207 | return np.random.uniform(low=self.min_action[0], high=self.max_action[0], size=(2,)) 208 | elif exploration == 'PolyRL': 209 | return PolyNoise(L_p=float(l_p), b_0=float(b_0), action_dim=self.nb_actions, ou_noise=ou_noise, 210 | sigma=float(stddev)) 211 | else: 212 | raise Exception("The exploration method " + self.exploration + " is not defined!") 213 | 214 | def create_walls(self): 215 | """ 216 | create the walls here, the polygons 217 | """ 218 | self.walls = [] 219 | 220 | # codes for drawing the polygons in matplotlib 221 | codes = [path.Path.MOVETO, 222 | path.Path.LINETO, 223 | path.Path.LINETO, 224 | path.Path.LINETO, 225 | path.Path.CLOSEPOLY, 226 | ] 227 | 228 | if self.num_rooms == 0: 229 | # no walls required 230 | return 231 | elif self.num_rooms == 1: 232 | # create one room with one opening 233 | 234 | # a wall parallel to x-axis, at (0,grid_len/2), (grid_len/2,grid_len/2) 235 | self.walls.append(path.Path([(0, self.grid_len / 2.0 + self.wall_breadth), 236 | (0, self.grid_len / 2.0 - self.wall_breadth), 237 | (self.grid_len / 2.0, self.grid_len / 2.0 - self.wall_breadth), 238 | (self.grid_len / 2.0, self.grid_len / 2.0 + self.wall_breadth), 239 | (0, self.grid_len / 2.0 + self.wall_breadth) 240 | ], codes=codes)) 241 | 242 | # the top part of wall on (0,grid_len/2), parallel to y -axis containg 243 | self.walls.append(path.Path([(self.grid_len / 2.0 - self.wall_breadth, self.grid_len / 2.0), 244 | (self.grid_len / 2.0 - self.wall_breadth, 245 | self.grid_len / 4.0 + self.door_breadth), 246 | (self.grid_len / 2.0 + self.wall_breadth, 247 | self.grid_len / 4.0 + self.door_breadth), 248 | (self.grid_len / 2.0 + self.wall_breadth, self.grid_len / 2.0), 249 | (self.grid_len / 2.0 - self.wall_breadth, self.grid_len / 2.0), 250 | ], codes=codes)) 251 | 252 | # the bottom part of wall on (0,grid_len/2), parallel to y -axis containg 253 | self.walls.append( 254 | path.Path([(self.grid_len / 2.0 - self.wall_breadth, self.grid_len / 4.0 - self.door_breadth), 255 | (self.grid_len / 2.0 - self.wall_breadth, 0.), 256 | (self.grid_len / 2.0 + self.wall_breadth, 0.), 257 | (self.grid_len / 2.0 + self.wall_breadth, self.grid_len / 4.0 - self.door_breadth), 258 | (self.grid_len / 2.0 - self.wall_breadth, self.grid_len / 4.0 - self.door_breadth), 259 | ], codes=codes)) 260 | 261 | elif self.num_rooms == 4: 262 | # create 4 rooms 263 | raise Exception("Not implemented yet :(") 264 | else: 265 | raise Exception("Logic for current number of rooms " + 266 | str(self.num_rooms) + " is not implemented yet :(") 267 | 268 | def collides(self, pt): 269 | """ 270 | to check if the point (x,y) is in the area defined by the walls polygon (i.e. collides) 271 | """ 272 | wall_edge_low = self.grid_len / 2 - self.wall_breadth 273 | wall_edge_high = self.grid_len / 2 + self.wall_breadth 274 | for w in self.walls: 275 | if w.contains_point(pt): 276 | return True 277 | elif pt[0] <= self.min_position and pt[1] > wall_edge_low and pt[1] < wall_edge_high: 278 | return True 279 | elif pt[1] <= self.min_position and pt[0] > wall_edge_low and pt[0] < wall_edge_high: 280 | return True 281 | return False 282 | 283 | def vis_trajectory(self, traj, name_plot, experiment_id=None, imp_states=None): 284 | """ 285 | creates the trajectory and return the plot 286 | 287 | trj: numpy_array 288 | 289 | 290 | Code taken from: https://discuss.pytorch.org/t/example-code-to-put-matplotlib-graph-to-tensorboard-x/15806 291 | """ 292 | fig = plt.figure(figsize=(10, 10)) 293 | ax = fig.add_subplot(111) 294 | 295 | # convert the environment to the image 296 | ax.set_xlim(0.0, self.max_position) 297 | ax.set_ylim(0.0, self.max_position) 298 | 299 | # add the border here 300 | # for i in ax.spines.itervalues(): 301 | # i.set_linewidth(0.1) 302 | 303 | # plot any walls if any 304 | for w in self.walls: 305 | patch = patches.PathPatch(w, facecolor='gray', lw=2) 306 | ax.add_patch(patch) 307 | 308 | # plot the start and goal points 309 | ax.scatter([self.start_position[0]], [self.start_position[1]], c='g') 310 | ax.scatter([self.goal_position[0]], [self.goal_position[1]], c='y') 311 | 312 | # Plot the dense rewards: 313 | for idx, mini_goal in enumerate(self.dense_goals): 314 | ax.scatter([mini_goal[0]], [mini_goal[1]], c='b') 315 | 316 | # add the trajectory here 317 | # https://stackoverflow.com/questions/36607742/drawing-phase-space-trajectories-with-arrows-in-matplotlib 318 | 319 | ax.quiver(traj[:-1, 0], traj[:-1, 1], 320 | traj[1:, 0] - traj[:-1, 0], traj[1:, 1] - traj[:-1, 1], 321 | scale_units='xy', angles='xy', scale=1, color='black') 322 | 323 | # plot the decision points/states 324 | if imp_states is not None: 325 | ax.scatter(imp_states[:, 0], imp_states[:, 1], c='r') 326 | 327 | # return the image buff 328 | 329 | ax.set_title("grid") 330 | # fig.savefig(buf, format='jpeg') # maybe png 331 | fig.savefig('install/{}_{}'.format(name_plot, experiment_id), dpi=300) # maybe png 332 | 333 | def test_vis_trajectory(self, traj, name_plot, heatmap_title, experiment_id=None, heatmap_normalize=False, 334 | heatmap_vertical_clip_value=2500): 335 | 336 | # Trajectory heatmap 337 | x = np.array([point[0] * self.scale for point in traj]) 338 | y = np.array([point[1] * self.scale for point in traj]) 339 | 340 | # Save heatmap for different bin scales 341 | for num in range(2, 6): 342 | fig, ax = plt.subplots() 343 | 344 | bin_scale = num * 0.1 345 | 346 | h = ax.hist2d(x, y, bins=[np.arange(self.min_position * self.scale, 347 | self.max_position * self.scale, num), 348 | np.arange(self.min_position * self.scale, 349 | self.max_position * self.scale, num)], 350 | cmap='Blues', normed=heatmap_normalize, vmax=heatmap_vertical_clip_value) 351 | image = h[3] 352 | plt.colorbar(image, ax=ax) 353 | 354 | # Build graph barriers and start and goal positions 355 | start_point = (self.start_position[0] * self.scale, self.start_position[1] * self.scale) 356 | radius = self.goal_radius * self.scale / 2 357 | start_circle = patches.Circle(start_point, radius, 358 | facecolor='gold', edgecolor='black', lw=0.5, zorder=10) 359 | 360 | goal_point = (self.goal_position[0] * self.scale, self.goal_position[1] * self.scale) 361 | goal_circle = patches.Circle(goal_point, radius, 362 | facecolor='brown', edgecolor='black', lw=0.5, zorder=10) 363 | 364 | for idx, dense_goal in enumerate(self.dense_goals): 365 | dense_goal_point = (dense_goal[0] * self.scale, dense_goal[1] * self.scale) 366 | dense_goal_circle = patches.Circle(dense_goal_point, radius, 367 | facecolor='crimson', edgecolor='black', lw=0.5, zorder=10) 368 | ax.add_patch(dense_goal_circle) 369 | 370 | ax.add_patch(start_circle) 371 | ax.add_patch(goal_circle) 372 | 373 | if self.num_rooms == 1: 374 | wall1_xy = (self.min_position * self.scale, 375 | self.max_position/2 * self.scale - self.wall_breadth * self.scale) 376 | wall1_width = (self.max_position/2 - self.min_position) * self.scale 377 | wall1_height = 2 * self.wall_breadth * self.scale 378 | wall1_rect = patches.Rectangle(xy=wall1_xy, width=wall1_width, height=wall1_height, 379 | facecolor='grey', zorder=10) 380 | 381 | wall2_xy = (self.max_position/2 * self.scale - self.wall_breadth * self.scale, 382 | self.min_position * self.scale) 383 | wall2_width = 2 * self.wall_breadth * self.scale 384 | wall2_height = (self.max_position/4 - self.door_breadth - self.min_position) * self.scale 385 | wall2_rect = patches.Rectangle(xy=wall2_xy, width=wall2_width, height=wall2_height, 386 | facecolor='grey', zorder=10) 387 | 388 | wall3_xy = ((self.max_position/2 - self.wall_breadth) * self.scale, 389 | (self.max_position/4 + self.door_breadth) * self.scale) 390 | wall3_width = 2 * self.wall_breadth * self.scale 391 | wall3_height = (self.max_position/2 + self.wall_breadth - self.max_position/4 - self.door_breadth) * self.scale 392 | wall3_rect = patches.Rectangle(xy=wall3_xy, width=wall3_width, height=wall3_height, 393 | facecolor='grey', zorder=10) 394 | 395 | ax.add_patch(wall1_rect) 396 | ax.add_patch(wall2_rect) 397 | ax.add_patch(wall3_rect) 398 | 399 | ax.set_title(heatmap_title) 400 | plt.savefig('install/{}_{}_{}.pdf'.format(name_plot, experiment_id, num)) 401 | 402 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/exploration_models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import math 3 | import numpy as np 4 | import random 5 | from sklearn import preprocessing 6 | import copy 7 | 8 | 9 | # action noise models 10 | class ActionNoise(object): 11 | def reset(self): 12 | pass 13 | 14 | 15 | class RandomWalkNoise(object): 16 | def __init__(self, action_dim, max_action_limit): 17 | self.action_dim = action_dim 18 | self.max_action_limit = max_action_limit 19 | 20 | def __call__(self): 21 | return np.around(np.random.uniform(-self.max_action_limit,+self.max_action_limit, (self.action_dim,)), decimals = 10) 22 | 23 | 24 | class NormalActionNoise(ActionNoise): 25 | def __init__(self, mu, sigma): 26 | self.mu = mu 27 | self.sigma = sigma 28 | 29 | def __call__(self): 30 | return np.random.normal(self.mu, self.sigma) 31 | 32 | def __repr__(self): 33 | return 'NormalActionNoise(mu={}, sigma={})'.format(self.mu, self.sigma) 34 | 35 | 36 | # Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab 37 | class OrnsteinUhlenbeckActionNoise(ActionNoise): 38 | def __init__(self, mu, sigma, theta=.15, dt=1e-2, x0=None): 39 | self.theta = theta 40 | self.mu = mu 41 | self.sigma = sigma 42 | self.dt = dt 43 | self.x0 = x0 44 | self.reset() 45 | 46 | def __call__(self): 47 | x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape) 48 | self.x_prev = x 49 | return x 50 | 51 | def reset(self): 52 | self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu) 53 | 54 | def __repr__(self): 55 | return 'OrnsteinUhlenbeckActionNoise(mu={}, sigma={})'.format(self.mu, self.sigma) 56 | 57 | 58 | 59 | 60 | # TODO: Adaptive param noise: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py 61 | class AdaptiveParamNoiseSpec(object): 62 | def __init__(self, initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01): 63 | self.initial_stddev = initial_stddev 64 | self.desired_action_stddev = desired_action_stddev 65 | self.adoption_coefficient = adoption_coefficient 66 | 67 | self.current_stddev = initial_stddev 68 | 69 | def adapt(self, distance): 70 | if distance > self.desired_action_stddev: 71 | # Decrease stddev. 72 | self.current_stddev /= self.adoption_coefficient 73 | else: 74 | # Increase stddev. 75 | self.current_stddev *= self.adoption_coefficient 76 | 77 | def get_stats(self): 78 | stats = { 79 | 'param_noise_stddev': self.current_stddev, 80 | } 81 | return stats 82 | 83 | def __repr__(self): 84 | fmt = 'AdaptiveParamNoiseSpec(initial_stddev={}, desired_action_stddev={}, adoption_coefficient={})' 85 | return fmt.format(self.initial_stddev, self.desired_action_stddev, self.adoption_coefficient) 86 | 87 | 88 | 89 | class PolyNoise(object): 90 | def __init__(self, 91 | L_p, 92 | b_0, 93 | action_dim, 94 | ou_noise, 95 | sigma = 0.2,): 96 | """ 97 | params for the L_p formulation: 98 | L_p: the persistance length 99 | b_0: movement distance 100 | signma: correlaion_variance 101 | blind: disregard the current action and only use the previous action 102 | """ 103 | action_noise = NormalActionNoise(mu=np.zeros(action_dim), sigma=float(sigma) * np.ones(action_dim)) 104 | self.L_p = L_p 105 | self.b_0 = b_0 106 | self.sigma = sigma 107 | self.action_dim = action_dim 108 | self.ou_noise = ou_noise 109 | 110 | # calculate the angle here 111 | self.n = int(L_p/b_0) 112 | self.lambda_ = np.arccos(np.exp((-1. * b_0)/L_p)) 113 | # initialize and reset traj-specific stats 114 | self.reset() 115 | 116 | 117 | def reset(self): 118 | """ 119 | reset the chain history 120 | """ 121 | self.H = None 122 | self.a_p = None 123 | #self.ou_noise.reset() 124 | 125 | self.i = 0 126 | self.t = 0 127 | self.rand_or_poly = [] 128 | 129 | 130 | def __call__(self, a): 131 | """ 132 | the 133 | s: the current state 134 | a: the current action 135 | t: the time step 136 | """ 137 | new_a = a 138 | 139 | if self.t==0: 140 | # return original a 141 | pass 142 | elif self.t==1: 143 | # create randonm trajectory vector 144 | H = np.random.rand(self.action_dim) 145 | self.H = (H * self.b_0) / np.linalg.norm(H, 2) 146 | # append the new H to the previous actions 147 | new_a = self.a_p + self.H 148 | self.i += 1 149 | else: 150 | # done with polyRL noise 151 | if self.i == self.n: 152 | # intialize 153 | noise = self.ou_noise() 154 | 155 | # add the noise 156 | new_a = a + noise 157 | 158 | #rest i and H 159 | self.i = 0 160 | self.H = new_a - self.a_p 161 | self.rand_or_poly.append(False) 162 | else: 163 | eta = abs(np.random.normal(self.lambda_, self.sigma, 1)) 164 | B = sample_persistent_action(self.action_dim, self.H, self.a_p, eta) 165 | self.rand_or_poly.append(True) 166 | 167 | #update the trajectory 168 | self.H = self.b_0 * B 169 | 170 | new_a = self.a_p + self.H 171 | self.i += 1 172 | 173 | #update the previous a_p 174 | self.a_p = new_a 175 | self.t += 1 176 | 177 | 178 | 179 | return new_a 180 | 181 | 182 | class GyroPolyNoise(object): 183 | def __init__(self, 184 | L_p, 185 | b_0, 186 | action_dim, 187 | state_dim, 188 | ou_noise, 189 | sigma = 0.2,): 190 | """ 191 | params for the L_p formulation: 192 | L_p: the persistance length 193 | b_0: movement distance 194 | signma: correlaion_variance 195 | blind: disregard the current action and only use the previous action 196 | """ 197 | self.L_p = L_p 198 | self.b_0 = b_0 199 | self.sigma = sigma 200 | self.action_dim = action_dim 201 | self.state_dim = state_dim 202 | self.ou_noise = ou_noise 203 | 204 | # calculate the angle here 205 | self.n = int(L_p/b_0) 206 | self.lambda_ = np.arccos(np.exp((-1. * b_0)/L_p)) 207 | # initialize and reset traj-specific stats 208 | self.reset() 209 | 210 | 211 | def reset(self): 212 | """ 213 | reset the chain history 214 | """ 215 | self.H = None 216 | self.a_p = None 217 | self.ou_noise.reset() 218 | 219 | self.i = 0 220 | self.t = 0 221 | 222 | # raius of gyration 223 | self.g = 0 224 | self.delta_g = 0 225 | 226 | # centre of mass of gyration 227 | self.C = np.zeros(self.state_dim) 228 | 229 | 230 | self.rand_or_poly = [] 231 | self.g_history = [] 232 | self.avg_delta_g = 0 233 | 234 | 235 | def __call__(self, a, s): 236 | """ 237 | the 238 | s: the current state 239 | a: the current action 240 | t: the time step 241 | """ 242 | new_a = a 243 | 244 | if self.t==0: 245 | # return original a 246 | pass 247 | elif self.t==1: 248 | # create randonm trajectory vector 249 | H = np.random.rand(self.action_dim) 250 | self.H = (H * self.b_0) / np.linalg.norm(H, 2) 251 | 252 | # append the new H to the previous actions 253 | new_a = self.a_p + self.H 254 | self.i += 1 255 | else: 256 | # done with polyRL noise 257 | if self.delta_g < 0: 258 | # intialize 259 | noise = self.ou_noise() 260 | 261 | # add the noise 262 | new_a = a + noise 263 | 264 | #rest i and H 265 | self.i = 0 266 | self.g = 0 267 | self.C = np.zeros(self.state_dim) 268 | self.delta_g = 0 269 | self.H = new_a - self.a_p 270 | self.rand_or_poly.append(False) 271 | else: 272 | eta = abs(np.random.normal(self.lambda_, self.sigma, 1)) 273 | B = sample_persistent_action(self.action_dim, self.H, self.a_p, eta) 274 | self.rand_or_poly.append(True) 275 | 276 | #update the trajectory 277 | self.H = self.b_0 * B 278 | 279 | new_a = self.a_p + self.H 280 | 281 | if self.i == self.n: 282 | self.i = 0 283 | self.g = 0 284 | self.C = np.zeros(self.state_dim) 285 | self.delta_g = 0 286 | else: 287 | self.i += 1 288 | 289 | #update the previous a_p 290 | self.a_p = new_a 291 | 292 | if self.i != 0: 293 | g = np.sqrt(((float(self.i-1.0)/self.i) * self.g**2) + (1.0/(self.i+1.0) * np.linalg.norm(s - self.C, 2)**2) ) 294 | self.delta_g = g - self.g 295 | self.g = g 296 | 297 | # add to history 298 | self.avg_delta_g += self.delta_g 299 | self.g_history.append(self.g) 300 | 301 | 302 | self.C = (self.i * self.C + s)/(self.i + 1.0) 303 | self.t += 1 304 | 305 | 306 | return new_a 307 | class GyroPolyNoiseActionTraj (object): 308 | def __init__(self, 309 | lambd, 310 | action_dim, 311 | state_dim, 312 | ou_noise, 313 | sigma = 0.2, 314 | max_action_limit = 1.0): 315 | self.lambd = lambd 316 | self.action_dim = action_dim 317 | self.state_dim = state_dim 318 | self.ou_noise = ou_noise 319 | self.sigma = sigma 320 | self.max_action_limit = max_action_limit 321 | 322 | 323 | # initialize and reset traj-specific stats 324 | self.reset() 325 | 326 | def reset(self): 327 | 328 | """ 329 | reset the chain history 330 | """ 331 | self.a_p = None 332 | self.ou_noise.reset() 333 | 334 | self.i = 0 335 | self.t = 0 336 | 337 | # raius of gyration 338 | self.g = 0 339 | self.delta_g = 0 340 | 341 | # centre of mass 342 | self.C = np.zeros(self.state_dim) 343 | 344 | 345 | self.rand_or_poly = [] 346 | self.g_history = [] 347 | self.avg_delta_g = 0 348 | 349 | 350 | def __call__(self, a, s): 351 | """ 352 | the 353 | s: the current state 354 | a: the current action 355 | t: the time step 356 | """ 357 | new_a = a 358 | 359 | if self.t==0: 360 | # return original a 361 | pass 362 | else: 363 | # done with polyRL noise 364 | if self.delta_g < 0: 365 | # intialize 366 | noise = self.ou_noise() 367 | 368 | # add the noise 369 | new_a = a + noise 370 | 371 | #rest i and H 372 | self.i = 0 373 | self.g = 0 374 | self.C = np.zeros(self.state_dim) 375 | self.delta_g = 0 376 | self.rand_or_poly.append(False) 377 | else: 378 | eta = abs(np.random.normal(self.lambd, self.sigma, 1)) 379 | A = sample_persistent_action_noHvector(self.action_dim, self.a_p, eta, self.max_action_limit) 380 | self.rand_or_poly.append(True) 381 | 382 | #update the trajectory 383 | new_a = A 384 | 385 | self.i +=1 386 | 387 | 388 | #update the previous a_p 389 | self.a_p = new_a 390 | 391 | if self.i != 0: 392 | g = np.sqrt(((float(self.i-1.0)/self.i) * self.g**2) + (1.0/(self.i+1.0) * np.linalg.norm(s - self.C, 2)**2) ) 393 | self.delta_g = g - self.g 394 | self.g = g 395 | 396 | # add to history 397 | self.avg_delta_g += self.delta_g 398 | self.g_history.append(self.g) 399 | 400 | 401 | self.C = (self.i * self.C + s)/(self.i + 1.0) 402 | self.t += 1 403 | return new_a 404 | 405 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/flow_helpers.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import torch.distributions as D 5 | import torchvision.transforms as T 6 | import os 7 | import math 8 | import argparse 9 | import pprint 10 | import copy 11 | 12 | # -------------------- 13 | # Model layers and helpers 14 | # -------------------- 15 | 16 | def create_masks(input_size, hidden_size, n_hidden, input_order='sequential', input_degrees=None): 17 | # MADE paper sec 4: 18 | # degrees of connections between layers -- ensure at most in_degree - 1 connections 19 | degrees = [] 20 | 21 | # set input degrees to what is provided in args (the flipped order of the previous layer in a stack of mades); 22 | # else init input degrees based on strategy in input_order (sequential or random) 23 | if input_order == 'sequential': 24 | degrees += [torch.arange(input_size)] if input_degrees is None else [input_degrees] 25 | for _ in range(n_hidden + 1): 26 | degrees += [torch.arange(hidden_size) % (input_size - 1)] 27 | degrees += [torch.arange(input_size) % input_size - 1] if input_degrees is None else [input_degrees % input_size - 1] 28 | 29 | elif input_order == 'random': 30 | degrees += [torch.randperm(input_size)] if input_degrees is None else [input_degrees] 31 | for _ in range(n_hidden + 1): 32 | min_prev_degree = min(degrees[-1].min().item(), input_size - 1) 33 | degrees += [torch.randint(min_prev_degree, input_size, (hidden_size,))] 34 | min_prev_degree = min(degrees[-1].min().item(), input_size - 1) 35 | degrees += [torch.randint(min_prev_degree, input_size, (input_size,)) - 1] if input_degrees is None else [input_degrees - 1] 36 | 37 | # construct masks 38 | masks = [] 39 | for (d0, d1) in zip(degrees[:-1], degrees[1:]): 40 | masks += [(d1.unsqueeze(-1) >= d0.unsqueeze(0)).float()] 41 | 42 | return masks, degrees[0] 43 | 44 | 45 | class MaskedLinear(nn.Linear): 46 | """ MADE building block layer """ 47 | def __init__(self, input_size, n_outputs, mask, cond_label_size=None): 48 | super().__init__(input_size, n_outputs) 49 | 50 | self.register_buffer('mask', mask) 51 | 52 | self.cond_label_size = cond_label_size 53 | if cond_label_size is not None: 54 | self.cond_weight = nn.Parameter(torch.rand(n_outputs, cond_label_size) / math.sqrt(cond_label_size)) 55 | 56 | def forward(self, x, y=None): 57 | out = F.linear(x, self.weight * self.mask, self.bias) 58 | if y is not None: 59 | out = out + F.linear(y, self.cond_weight) 60 | return out 61 | 62 | def extra_repr(self): 63 | return 'in_features={}, out_features={}, bias={}'.format( 64 | self.in_features, self.out_features, self.bias is not None 65 | ) + (self.cond_label_size != None) * ', cond_features={}'.format(self.cond_label_size) 66 | 67 | 68 | class LinearMaskedCoupling(nn.Module): 69 | """ Modified RealNVP Coupling Layers per the MAF paper """ 70 | def __init__(self, input_size, hidden_size, n_hidden, mask, cond_label_size=None): 71 | super().__init__() 72 | 73 | self.register_buffer('mask', mask) 74 | 75 | # scale function 76 | s_net = [nn.Linear(input_size + (cond_label_size if cond_label_size is not None else 0), hidden_size)] 77 | for _ in range(n_hidden): 78 | s_net += [nn.Tanh(), nn.Linear(hidden_size, hidden_size)] 79 | s_net += [nn.Tanh(), nn.Linear(hidden_size, input_size)] 80 | self.s_net = nn.Sequential(*s_net) 81 | 82 | # translation function 83 | self.t_net = copy.deepcopy(self.s_net) 84 | # replace Tanh with ReLU's per MAF paper 85 | for i in range(len(self.t_net)): 86 | if not isinstance(self.t_net[i], nn.Linear): self.t_net[i] = nn.ReLU() 87 | 88 | def forward(self, x, y=None): 89 | # apply mask 90 | mx = x * self.mask 91 | 92 | # run through model 93 | s = self.s_net(mx if y is None else torch.cat([y, mx], dim=1)) 94 | t = self.t_net(mx if y is None else torch.cat([y, mx], dim=1)) 95 | u = mx + (1 - self.mask) * (x - t) * torch.exp(-s) # cf RealNVP eq 8 where u corresponds to x (here we're modeling u) 96 | 97 | log_abs_det_jacobian = - (1 - self.mask) * s # log det du/dx; cf RealNVP 8 and 6; note, sum over input_size done at model log_prob 98 | 99 | return u, log_abs_det_jacobian 100 | 101 | def inverse(self, u, y=None): 102 | # apply mask 103 | mu = u * self.mask 104 | 105 | # run through model 106 | s = self.s_net(mu if y is None else torch.cat([y, mu], dim=1)) 107 | t = self.t_net(mu if y is None else torch.cat([y, mu], dim=1)) 108 | x = mu + (1 - self.mask) * (u * s.exp() + t) # cf RealNVP eq 7 109 | 110 | log_abs_det_jacobian = (1 - self.mask) * s # log det dx/du 111 | 112 | return x, log_abs_det_jacobian 113 | 114 | 115 | class BatchNorm(nn.Module): 116 | """ RealNVP BatchNorm layer """ 117 | def __init__(self, input_size, momentum=0.9, eps=1e-5): 118 | super().__init__() 119 | self.momentum = momentum 120 | self.eps = eps 121 | 122 | self.log_gamma = nn.Parameter(torch.zeros(input_size)) 123 | self.beta = nn.Parameter(torch.zeros(input_size)) 124 | 125 | self.register_buffer('running_mean', torch.zeros(input_size)) 126 | self.register_buffer('running_var', torch.ones(input_size)) 127 | 128 | def forward(self, x, cond_y=None): 129 | if self.training: 130 | self.batch_mean = x.mean(0) 131 | self.batch_var = x.var(0) # note MAF paper uses biased variance estimate; ie x.var(0, unbiased=False) 132 | 133 | # update running mean 134 | self.running_mean.mul_(self.momentum).add_(self.batch_mean.data * (1 - self.momentum)) 135 | self.running_var.mul_(self.momentum).add_(self.batch_var.data * (1 - self.momentum)) 136 | 137 | mean = self.batch_mean 138 | var = self.batch_var 139 | else: 140 | mean = self.running_mean 141 | var = self.running_var 142 | 143 | # compute normalized input (cf original batch norm paper algo 1) 144 | x_hat = (x - mean) / torch.sqrt(var + self.eps) 145 | y = self.log_gamma.exp() * x_hat + self.beta 146 | 147 | # compute log_abs_det_jacobian (cf RealNVP paper) 148 | log_abs_det_jacobian = self.log_gamma - 0.5 * torch.log(var + self.eps) 149 | # print('in sum log var {:6.3f} ; out sum log var {:6.3f}; sum log det {:8.3f}; mean log_gamma {:5.3f}; mean beta {:5.3f}'.format( 150 | # (var + self.eps).log().sum().data.numpy(), y.var(0).log().sum().data.numpy(), log_abs_det_jacobian.mean(0).item(), self.log_gamma.mean(), self.beta.mean())) 151 | return y, log_abs_det_jacobian.expand_as(x) 152 | 153 | def inverse(self, y, cond_y=None): 154 | if self.training: 155 | mean = self.batch_mean 156 | var = self.batch_var 157 | else: 158 | mean = self.running_mean 159 | var = self.running_var 160 | 161 | x_hat = (y - self.beta) * torch.exp(-self.log_gamma) 162 | x = x_hat * torch.sqrt(var + self.eps) + mean 163 | 164 | log_abs_det_jacobian = 0.5 * torch.log(var + self.eps) - self.log_gamma 165 | 166 | return x, log_abs_det_jacobian.expand_as(x) 167 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/flows.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | import torch.nn.functional as F 5 | from torch.distributions import Normal 6 | from flow_helpers import * 7 | import ipdb 8 | 9 | #Reference: https://github.com/ritheshkumar95/pytorch-normalizing-flows/blob/master/modules.py 10 | # Initialize Policy weights 11 | LOG_SIG_MAX = 2 12 | LOG_SIG_MIN = -20 13 | epsilon = 1e-6 14 | def weights_init_(m): 15 | classname = m.__class__.__name__ 16 | if classname.find('Linear') != -1: 17 | torch.nn.init.xavier_uniform_(m.weight, gain=1) 18 | torch.nn.init.constant_(m.bias, 0) 19 | 20 | class PlanarBase(nn.Module): 21 | def __init__(self, n_blocks, state_size, input_size, hidden_size, n_hidden, device): 22 | super().__init__() 23 | self.l1 = nn.Linear(state_size, hidden_size) 24 | self.l2 = nn.Linear(hidden_size,hidden_size) 25 | self.mu = nn.Linear(hidden_size, input_size) 26 | self.log_std = nn.Linear(hidden_size, input_size) 27 | self.device = device 28 | self.z_size = input_size 29 | self.num_flows = n_blocks 30 | self.flow = Planar 31 | # Amortized flow parameters 32 | self.amor_u = nn.Linear(hidden_size, self.num_flows * input_size) 33 | self.amor_w = nn.Linear(hidden_size, self.num_flows * input_size) 34 | self.amor_b = nn.Linear(hidden_size, self.num_flows) 35 | 36 | # Normalizing flow layers 37 | for k in range(self.num_flows): 38 | flow_k = self.flow() 39 | self.add_module('flow_' + str(k), flow_k) 40 | 41 | self.apply(weights_init_) 42 | 43 | def encode(self, state): 44 | x = F.relu(self.l1(state)) 45 | x = F.relu(self.l2(x)) 46 | mean = self.mu(x) 47 | log_std = self.log_std(x) 48 | log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX) 49 | return mean, log_std, x 50 | 51 | def forward(self, state): 52 | batch_size = state.size(0) 53 | mean, log_std, x = self.encode(state) 54 | std = log_std.exp() 55 | normal = Normal(mean, std) 56 | x_t = normal.rsample() # for reparameterization trick (mean + std * N(0,1)) 57 | action = torch.tanh(x_t) 58 | log_prob = normal.log_prob(x_t) 59 | # Enforcing Action Bound 60 | log_prob -= torch.log(1 - action.pow(2) + epsilon) 61 | log_prob = log_prob.sum(1, keepdim=True) 62 | z = [action] 63 | u = self.amor_u(x).view(batch_size, self.num_flows, self.z_size, 1) 64 | w = self.amor_w(x).view(batch_size, self.num_flows, 1, self.z_size) 65 | b = self.amor_b(x).view(batch_size, self.num_flows, 1, 1) 66 | 67 | self.log_det_j = torch.zeros(batch_size).to(self.device) 68 | 69 | for k in range(self.num_flows): 70 | flow_k = getattr(self, 'flow_' + str(k)) 71 | z_k, log_det_jacobian = flow_k(z[k], u[:, k, :, :], 72 | w[:, k, :, :], b[:, k, :, :]) 73 | z.append(z_k) 74 | self.log_det_j += log_det_jacobian 75 | 76 | action = z[-1] 77 | log_prob_final_action = log_prob.squeeze() - self.log_det_j 78 | 79 | probability_final_action = torch.exp(log_prob_final_action) 80 | entropy = (probability_final_action * log_prob_final_action) 81 | normalized_action = torch.tanh(action) 82 | np_action = action.cpu().data.numpy().flatten() 83 | if np.isnan(np_action[0]): 84 | ipdb.set_trace() 85 | return normalized_action, log_prob, action , mean, std 86 | 87 | class PlanarFlow(nn.Module): 88 | def __init__(self, D): 89 | super().__init__() 90 | self.D = D 91 | 92 | def forward(self, z, lamda): 93 | ''' 94 | z - latents from prev layer 95 | lambda - Flow parameters (b, w, u) 96 | b - scalar 97 | w - vector 98 | u - vector 99 | ''' 100 | b = lamda[:, :1] 101 | w, u = lamda[:, 1:].chunk(2, dim=1) 102 | 103 | # Forward 104 | # f(z) = z + u tanh(w^T z + b) 105 | transf = F.tanh( 106 | z.unsqueeze(1).bmm(w.unsqueeze(2))[:, 0] + b 107 | ) 108 | f_z = z + u * transf 109 | 110 | # Inverse 111 | # psi_z = tanh' (w^T z + b) w 112 | psi_z = (1 - transf ** 2) * w 113 | log_abs_det_jacobian = torch.log( 114 | (1 + psi_z.unsqueeze(1).bmm(u.unsqueeze(2))).abs() 115 | ) 116 | 117 | return f_z, log_abs_det_jacobian 118 | 119 | class NormalizingFlow(nn.Module): 120 | def __init__(self, K, D): 121 | super().__init__() 122 | self.flows = nn.ModuleList([PlanarFlow(D) for i in range(K)]) 123 | 124 | def forward(self, z_k, flow_params): 125 | # ladj -> log abs det jacobian 126 | sum_ladj = 0 127 | for i, flow in enumerate(self.flows): 128 | z_k, ladj_k = flow(z_k, flow_params[i]) 129 | sum_ladj += ladj_k 130 | 131 | return z_k, sum_ladj 132 | 133 | class Planar(nn.Module): 134 | """ 135 | PyTorch implementation of planar flows as presented in "Variational Inference with Normalizing Flows" 136 | by Danilo Jimenez Rezende, Shakir Mohamed. Model assumes amortized flow parameters. 137 | """ 138 | 139 | def __init__(self): 140 | 141 | super(Planar, self).__init__() 142 | 143 | self.h = nn.Tanh() 144 | self.softplus = nn.Softplus() 145 | 146 | def der_h(self, x): 147 | """ Derivative of tanh """ 148 | 149 | return 1 - self.h(x) ** 2 150 | 151 | def forward(self, zk, u, w, b): 152 | """ 153 | Forward pass. Assumes amortized u, w and b. Conditions on diagonals of u and w for invertibility 154 | will be be satisfied inside this function. Computes the following transformation: 155 | z' = z + u h( w^T z + b) 156 | or actually 157 | z'^T = z^T + h(z^T w + b)u^T 158 | Assumes the following input shapes: 159 | shape u = (batch_size, z_size, 1) 160 | shape w = (batch_size, 1, z_size) 161 | shape b = (batch_size, 1, 1) 162 | shape z = (batch_size, z_size). 163 | """ 164 | 165 | zk = zk.unsqueeze(2) 166 | 167 | # reparameterize u such that the flow becomes invertible (see appendix paper) 168 | uw = torch.bmm(w, u) 169 | m_uw = -1. + self.softplus(uw) 170 | w_norm_sq = torch.sum(w ** 2, dim=2, keepdim=True) 171 | u_hat = u + ((m_uw - uw) * w.transpose(2, 1) / w_norm_sq) 172 | 173 | # compute flow with u_hat 174 | wzb = torch.bmm(w, zk) + b 175 | z = zk + u_hat * self.h(wzb) 176 | z = z.squeeze(2) 177 | 178 | # compute logdetJ 179 | psi = w * self.der_h(wzb) 180 | log_det_jacobian = torch.log(torch.abs(1 + torch.bmm(psi, u_hat))) 181 | log_det_jacobian = log_det_jacobian.squeeze(2).squeeze(1) 182 | 183 | return z, log_det_jacobian 184 | 185 | 186 | # All code below this line is taken from 187 | # https://github.com/kamenbliznashki/normalizing_flows/blob/master/maf.py 188 | 189 | class FlowSequential(nn.Sequential): 190 | """ Container for layers of a normalizing flow """ 191 | def forward(self, x, y): 192 | sum_log_abs_det_jacobians = 0 193 | for module in self: 194 | x, log_abs_det_jacobian = module(x, y) 195 | sum_log_abs_det_jacobians = sum_log_abs_det_jacobians + log_abs_det_jacobian 196 | return x, sum_log_abs_det_jacobians 197 | 198 | def inverse(self, u, y): 199 | sum_log_abs_det_jacobians = 0 200 | for module in reversed(self): 201 | u, log_abs_det_jacobian = module.inverse(u, y) 202 | sum_log_abs_det_jacobians = sum_log_abs_det_jacobians + log_abs_det_jacobian 203 | return u, sum_log_abs_det_jacobians 204 | 205 | # -------------------- 206 | # Models 207 | # -------------------- 208 | 209 | class MADE(nn.Module): 210 | def __init__(self, input_size, hidden_size, n_hidden, cond_label_size=None, activation='relu', input_order='sequential', input_degrees=None): 211 | """ 212 | Args: 213 | input_size -- scalar; dim of inputs 214 | hidden_size -- scalar; dim of hidden layers 215 | n_hidden -- scalar; number of hidden layers 216 | activation -- str; activation function to use 217 | input_order -- str or tensor; variable order for creating the autoregressive masks (sequential|random) 218 | or the order flipped from the previous layer in a stack of mades 219 | conditional -- bool; whether model is conditional 220 | """ 221 | super().__init__() 222 | # base distribution for calculation of log prob under the model 223 | self.register_buffer('base_dist_mean', torch.zeros(input_size)) 224 | self.register_buffer('base_dist_var', torch.ones(input_size)) 225 | 226 | # create masks 227 | masks, self.input_degrees = create_masks(input_size, hidden_size, n_hidden, input_order, input_degrees) 228 | 229 | # setup activation 230 | if activation == 'relu': 231 | activation_fn = nn.ReLU() 232 | elif activation == 'tanh': 233 | activation_fn = nn.Tanh() 234 | else: 235 | raise ValueError('Check activation function.') 236 | 237 | # construct model 238 | self.net_input = MaskedLinear(input_size, hidden_size, masks[0], cond_label_size) 239 | self.net = [] 240 | for m in masks[1:-1]: 241 | self.net += [activation_fn, MaskedLinear(hidden_size, hidden_size, m)] 242 | self.net += [activation_fn, MaskedLinear(hidden_size, 2 * input_size, masks[-1].repeat(2,1))] 243 | self.net = nn.Sequential(*self.net) 244 | 245 | @property 246 | def base_dist(self): 247 | return D.Normal(self.base_dist_mean, self.base_dist_var) 248 | 249 | def forward(self, x, y=None): 250 | # MAF eq 4 -- return mean and log std 251 | m, loga = self.net(self.net_input(x, y)).chunk(chunks=2, dim=1) 252 | u = (x - m) * torch.exp(-loga) 253 | # MAF eq 5 254 | log_abs_det_jacobian = - loga 255 | return u, log_abs_det_jacobian 256 | 257 | def inverse(self, u, y=None, sum_log_abs_det_jacobians=None): 258 | # MAF eq 3 259 | D = u.shape[1] 260 | x = torch.zeros_like(u) 261 | # run through reverse model 262 | for i in self.input_degrees: 263 | m, loga = self.net(self.net_input(x, y)).chunk(chunks=2, dim=1) 264 | x[:,i] = u[:,i] * torch.exp(loga[:,i]) + m[:,i] 265 | log_abs_det_jacobian = -loga 266 | return x, log_abs_det_jacobian 267 | 268 | def log_prob(self, x, y=None): 269 | u, log_abs_det_jacobian = self.forward(x, y) 270 | return torch.sum(self.base_dist.log_prob(u) + log_abs_det_jacobian, dim=1) 271 | 272 | 273 | class MADEMOG(nn.Module): 274 | """ Mixture of Gaussians MADE """ 275 | def __init__(self, n_components, input_size, hidden_size, n_hidden, cond_label_size=None, activation='relu', input_order='sequential', input_degrees=None): 276 | """ 277 | Args: 278 | n_components -- scalar; number of gauassian components in the mixture 279 | input_size -- scalar; dim of inputs 280 | hidden_size -- scalar; dim of hidden layers 281 | n_hidden -- scalar; number of hidden layers 282 | activation -- str; activation function to use 283 | input_order -- str or tensor; variable order for creating the autoregressive masks (sequential|random) 284 | or the order flipped from the previous layer in a stack of mades 285 | conditional -- bool; whether model is conditional 286 | """ 287 | super().__init__() 288 | self.n_components = n_components 289 | 290 | # base distribution for calculation of log prob under the model 291 | self.register_buffer('base_dist_mean', torch.zeros(input_size)) 292 | self.register_buffer('base_dist_var', torch.ones(input_size)) 293 | 294 | # create masks 295 | masks, self.input_degrees = create_masks(input_size, hidden_size, n_hidden, input_order, input_degrees) 296 | 297 | # setup activation 298 | if activation == 'relu': 299 | activation_fn = nn.ReLU() 300 | elif activation == 'tanh': 301 | activation_fn = nn.Tanh() 302 | else: 303 | raise ValueError('Check activation function.') 304 | 305 | # construct model 306 | self.net_input = MaskedLinear(input_size, hidden_size, masks[0], cond_label_size) 307 | self.net = [] 308 | for m in masks[1:-1]: 309 | self.net += [activation_fn, MaskedLinear(hidden_size, hidden_size, m)] 310 | self.net += [activation_fn, MaskedLinear(hidden_size, n_components * 3 * input_size, masks[-1].repeat(n_components * 3,1))] 311 | self.net = nn.Sequential(*self.net) 312 | 313 | @property 314 | def base_dist(self): 315 | return D.Normal(self.base_dist_mean, self.base_dist_var) 316 | 317 | def forward(self, x, y=None): 318 | # shapes 319 | N, L = x.shape 320 | C = self.n_components 321 | # MAF eq 2 -- parameters of Gaussians - mean, logsigma, log unnormalized cluster probabilities 322 | m, loga, logr = self.net(self.net_input(x, y)).view(N, C, 3 * L).chunk(chunks=3, dim=-1) # out 3 x (N, C, L) 323 | # MAF eq 4 324 | x = x.repeat(1, C).view(N, C, L) # out (N, C, L) 325 | u = (x - m) * torch.exp(-loga) # out (N, C, L) 326 | # MAF eq 5 327 | log_abs_det_jacobian = - loga # out (N, C, L) 328 | # normalize cluster responsibilities 329 | self.logr = logr - logr.logsumexp(1, keepdim=True) # out (N, C, L) 330 | return u, log_abs_det_jacobian 331 | 332 | def inverse(self, u, y=None, sum_log_abs_det_jacobians=None): 333 | # shapes 334 | N, C, L = u.shape 335 | # init output 336 | x = torch.zeros(N, L).to(u.device) 337 | # MAF eq 3 338 | # run through reverse model along each L 339 | for i in self.input_degrees: 340 | m, loga, logr = self.net(self.net_input(x, y)).view(N, C, 3 * L).chunk(chunks=3, dim=-1) # out 3 x (N, C, L) 341 | # normalize cluster responsibilities and sample cluster assignments from a categorical dist 342 | logr = logr - logr.logsumexp(1, keepdim=True) # out (N, C, L) 343 | z = D.Categorical(logits=logr[:,:,i]).sample().unsqueeze(-1) # out (N, 1) 344 | u_z = torch.gather(u[:,:,i], 1, z).squeeze() # out (N, 1) 345 | m_z = torch.gather(m[:,:,i], 1, z).squeeze() # out (N, 1) 346 | loga_z = torch.gather(loga[:,:,i], 1, z).squeeze() 347 | x[:,i] = u_z * torch.exp(loga_z) + m_z 348 | log_abs_det_jacobian = - loga 349 | return x, log_abs_det_jacobian 350 | 351 | def log_prob(self, x, y=None): 352 | u, log_abs_det_jacobian = self.forward(x, y) # u = (N,C,L); log_abs_det_jacobian = (N,C,L) 353 | # marginalize cluster probs 354 | log_probs = torch.logsumexp(self.logr + self.base_dist.log_prob(u) + log_abs_det_jacobian, dim=1) # sum over C; out (N, L) 355 | return log_probs.sum(1) # sum over L; out (N,) 356 | 357 | 358 | class MAF(nn.Module): 359 | def __init__(self, n_blocks, state_size, input_size, hidden_size, n_hidden, 360 | cond_label_size=None, activation='relu', 361 | input_order='sequential', batch_norm=True): 362 | super().__init__() 363 | # base distribution for calculation of log prob under the model 364 | self.register_buffer('base_dist_mean', torch.zeros(input_size)) 365 | self.register_buffer('base_dist_var', torch.ones(input_size)) 366 | self.linear1 = nn.Linear(state_size, input_size) 367 | 368 | # construct model 369 | modules = [] 370 | self.input_degrees = None 371 | for i in range(n_blocks): 372 | modules += [MADE(input_size, hidden_size, n_hidden, 373 | cond_label_size, activation, input_order, 374 | self.input_degrees)] 375 | self.input_degrees = modules[-1].input_degrees.flip(0) 376 | modules += batch_norm * [BatchNorm(input_size)] 377 | 378 | self.net = FlowSequential(*modules) 379 | 380 | @property 381 | def base_dist(self): 382 | return D.Normal(self.base_dist_mean, self.base_dist_var) 383 | 384 | def forward(self, x, y=None): 385 | ''' Projecting the State to the same dim as actions ''' 386 | action_proj = F.relu(self.linear1(x)) 387 | # action_proj = action_proj.view(1,-1) 388 | if action_proj.size()[0] == 1 and len(action_proj.size()) > 2: 389 | action, sum_log_abs_det_jacobians = self.net(action_proj[0], y) 390 | else: 391 | action, sum_log_abs_det_jacobians = self.net(action_proj, y) 392 | log_prob = torch.sum(self.base_dist.log_prob(action) + sum_log_abs_det_jacobians, dim=1) 393 | normalized_action = torch.tanh(action) 394 | # TODO: Find the mean and log std deviation of a Normalizing Flow 395 | return normalized_action, log_prob, action , action, 0 396 | 397 | def inverse(self, u, y=None): 398 | action_proj = F.relu(self.linear1(u)) 399 | action, sum_log_abs_det_jacobians = self.net.inverse(action_proj, y) 400 | log_prob = torch.sum(self.base_dist.log_prob(action) + sum_log_abs_det_jacobians, dim=1) 401 | normalized_action = torch.tanh(action) 402 | return normalized_action, log_prob, action , action, 0 403 | # return self.net.inverse(action_proj, y) 404 | 405 | def log_prob(self, x, y=None): 406 | u, sum_log_abs_det_jacobians = self.forward(x, y) 407 | return torch.sum(self.base_dist.log_prob(u) + sum_log_abs_det_jacobians, dim=1) 408 | 409 | class MAFMOG(nn.Module): 410 | """ MAF on mixture of gaussian MADE """ 411 | def __init__(self, n_blocks, n_components, input_size, hidden_size, n_hidden, cond_label_size=None, activation='relu', 412 | input_order='sequential', batch_norm=True): 413 | super().__init__() 414 | # base distribution for calculation of log prob under the model 415 | self.register_buffer('base_dist_mean', torch.zeros(input_size)) 416 | self.register_buffer('base_dist_var', torch.ones(input_size)) 417 | 418 | self.maf = MAF(n_blocks, input_size, hidden_size, n_hidden, cond_label_size, activation, input_order, batch_norm) 419 | # get reversed input order from the last layer (note in maf model, input_degrees are already flipped in for-loop model constructor 420 | input_degrees = self.maf.input_degrees#.flip(0) 421 | self.mademog = MADEMOG(n_components, input_size, hidden_size, n_hidden, cond_label_size, activation, input_order, input_degrees) 422 | 423 | @property 424 | def base_dist(self): 425 | return D.Normal(self.base_dist_mean, self.base_dist_var) 426 | 427 | def forward(self, x, y=None): 428 | u, maf_log_abs_dets = self.maf(x, y) 429 | u, made_log_abs_dets = self.mademog(u, y) 430 | sum_log_abs_det_jacobians = maf_log_abs_dets.unsqueeze(1) + made_log_abs_dets 431 | return u, sum_log_abs_det_jacobians 432 | 433 | def inverse(self, u, y=None): 434 | x, made_log_abs_dets = self.mademog.inverse(u, y) 435 | x, maf_log_abs_dets = self.maf.inverse(x, y) 436 | sum_log_abs_det_jacobians = maf_log_abs_dets.unsqueeze(1) + made_log_abs_dets 437 | return x, sum_log_abs_det_jacobians 438 | 439 | def log_prob(self, x, y=None): 440 | u, log_abs_det_jacobian = self.forward(x, y) # u = (N,C,L); log_abs_det_jacobian = (N,C,L) 441 | # marginalize cluster probs 442 | log_probs = torch.logsumexp(self.mademog.logr + self.base_dist.log_prob(u) + log_abs_det_jacobian, dim=1) # out (N, L) 443 | return log_probs.sum(1) # out (N,) 444 | 445 | 446 | class RealNVP(nn.Module): 447 | def __init__(self, n_blocks, input_size, hidden_size, n_hidden, cond_label_size=None, batch_norm=True): 448 | super().__init__() 449 | 450 | # base distribution for calculation of log prob under the model 451 | self.register_buffer('base_dist_mean', torch.zeros(input_size)) 452 | self.register_buffer('base_dist_var', torch.ones(input_size)) 453 | 454 | # construct model 455 | modules = [] 456 | mask = torch.arange(input_size).float() % 2 457 | for i in range(n_blocks): 458 | modules += [LinearMaskedCoupling(input_size, hidden_size, n_hidden, mask, cond_label_size)] 459 | mask = 1 - mask 460 | modules += batch_norm * [BatchNorm(input_size)] 461 | 462 | self.net = FlowSequential(*modules) 463 | 464 | @property 465 | def base_dist(self): 466 | return D.Normal(self.base_dist_mean, self.base_dist_var) 467 | 468 | def forward(self, x, y=None): 469 | return self.net(x, y) 470 | 471 | def inverse(self, u, y=None): 472 | return self.net.inverse(u, y) 473 | 474 | def log_prob(self, x, y=None): 475 | u, sum_log_abs_det_jacobians = self.forward(x, y) 476 | return torch.sum(self.base_dist.log_prob(u) + sum_log_abs_det_jacobians, dim=1) 477 | 478 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/graphics.py: -------------------------------------------------------------------------------- 1 | # graphics.py 2 | """Simple object oriented graphics library 3 | 4 | The library is designed to make it very easy for novice programmers to 5 | experiment with computer graphics in an object oriented fashion. It is 6 | written by John Zelle for use with the book "Python Programming: An 7 | Introduction to Computer Science" (Franklin, Beedle & Associates). 8 | 9 | LICENSE: This is open-source software released under the terms of the 10 | GPL (http://www.gnu.org/licenses/gpl.html). 11 | 12 | PLATFORMS: The package is a wrapper around Tkinter and should run on 13 | any platform where Tkinter is available. 14 | 15 | INSTALLATION: Put this file somewhere where Python can see it. 16 | 17 | OVERVIEW: There are two kinds of objects in the library. The GraphWin 18 | class implements a window where drawing can be done and various 19 | GraphicsObjects are provided that can be drawn into a GraphWin. As a 20 | simple example, here is a complete program to draw a circle of radius 21 | 10 centered in a 100x100 window: 22 | 23 | -------------------------------------------------------------------- 24 | from graphics import * 25 | 26 | def main(): 27 | win = GraphWin("My Circle", 100, 100) 28 | c = Circle(Point(50,50), 10) 29 | c.draw(win) 30 | win.getMouse() # Pause to view result 31 | win.close() # Close window when done 32 | 33 | main() 34 | -------------------------------------------------------------------- 35 | GraphWin objects support coordinate transformation through the 36 | setCoords method and mouse and keyboard interaction methods. 37 | 38 | The library provides the following graphical objects: 39 | Point 40 | Line 41 | Circle 42 | Oval 43 | Rectangle 44 | Polygon 45 | Text 46 | Entry (for text-based input) 47 | Image 48 | 49 | Various attributes of graphical objects can be set such as 50 | outline-color, fill-color and line-width. Graphical objects also 51 | support moving and hiding for animation effects. 52 | 53 | The library also provides a very simple class for pixel-based image 54 | manipulation, Pixmap. A pixmap can be loaded from a file and displayed 55 | using an Image object. Both getPixel and setPixel methods are provided 56 | for manipulating the image. 57 | 58 | DOCUMENTATION: For complete documentation, see Chapter 4 of "Python 59 | Programming: An Introduction to Computer Science" by John Zelle, 60 | published by Franklin, Beedle & Associates. Also see 61 | http://mcsp.wartburg.edu/zelle/python for a quick reference""" 62 | 63 | __version__ = "5.0" 64 | 65 | # Version 5 8/26/2016 66 | # * update at bottom to fix MacOS issue causing askopenfile() to hang 67 | # * update takes an optional parameter specifying update rate 68 | # * Entry objects get focus when drawn 69 | # * __repr_ for all objects 70 | # * fixed offset problem in window, made canvas borderless 71 | 72 | # Version 4.3 4/25/2014 73 | # * Fixed Image getPixel to work with Python 3.4, TK 8.6 (tuple type handling) 74 | # * Added interactive keyboard input (getKey and checkKey) to GraphWin 75 | # * Modified setCoords to cause redraw of current objects, thus 76 | # changing the view. This supports scrolling around via setCoords. 77 | # 78 | # Version 4.2 5/26/2011 79 | # * Modified Image to allow multiple undraws like other GraphicsObjects 80 | # Version 4.1 12/29/2009 81 | # * Merged Pixmap and Image class. Old Pixmap removed, use Image. 82 | # Version 4.0.1 10/08/2009 83 | # * Modified the autoflush on GraphWin to default to True 84 | # * Autoflush check on close, setBackground 85 | # * Fixed getMouse to flush pending clicks at entry 86 | # Version 4.0 08/2009 87 | # * Reverted to non-threaded version. The advantages (robustness, 88 | # efficiency, ability to use with other Tk code, etc.) outweigh 89 | # the disadvantage that interactive use with IDLE is slightly more 90 | # cumbersome. 91 | # * Modified to run in either Python 2.x or 3.x (same file). 92 | # * Added Image.getPixmap() 93 | # * Added update() -- stand alone function to cause any pending 94 | # graphics changes to display. 95 | # 96 | # Version 3.4 10/16/07 97 | # Fixed GraphicsError to avoid "exploded" error messages. 98 | # Version 3.3 8/8/06 99 | # Added checkMouse method to GraphWin 100 | # Version 3.2.3 101 | # Fixed error in Polygon init spotted by Andrew Harrington 102 | # Fixed improper threading in Image constructor 103 | # Version 3.2.2 5/30/05 104 | # Cleaned up handling of exceptions in Tk thread. The graphics package 105 | # now raises an exception if attempt is made to communicate with 106 | # a dead Tk thread. 107 | # Version 3.2.1 5/22/05 108 | # Added shutdown function for tk thread to eliminate race-condition 109 | # error "chatter" when main thread terminates 110 | # Renamed various private globals with _ 111 | # Version 3.2 5/4/05 112 | # Added Pixmap object for simple image manipulation. 113 | # Version 3.1 4/13/05 114 | # Improved the Tk thread communication so that most Tk calls 115 | # do not have to wait for synchonization with the Tk thread. 116 | # (see _tkCall and _tkExec) 117 | # Version 3.0 12/30/04 118 | # Implemented Tk event loop in separate thread. Should now work 119 | # interactively with IDLE. Undocumented autoflush feature is 120 | # no longer necessary. Its default is now False (off). It may 121 | # be removed in a future version. 122 | # Better handling of errors regarding operations on windows that 123 | # have been closed. 124 | # Addition of an isClosed method to GraphWindow class. 125 | 126 | # Version 2.2 8/26/04 127 | # Fixed cloning bug reported by Joseph Oldham. 128 | # Now implements deep copy of config info. 129 | # Version 2.1 1/15/04 130 | # Added autoflush option to GraphWin. When True (default) updates on 131 | # the window are done after each action. This makes some graphics 132 | # intensive programs sluggish. Turning off autoflush causes updates 133 | # to happen during idle periods or when flush is called. 134 | # Version 2.0 135 | # Updated Documentation 136 | # Made Polygon accept a list of Points in constructor 137 | # Made all drawing functions call TK update for easier animations 138 | # and to make the overall package work better with 139 | # Python 2.3 and IDLE 1.0 under Windows (still some issues). 140 | # Removed vestigial turtle graphics. 141 | # Added ability to configure font for Entry objects (analogous to Text) 142 | # Added setTextColor for Text as an alias of setFill 143 | # Changed to class-style exceptions 144 | # Fixed cloning of Text objects 145 | 146 | # Version 1.6 147 | # Fixed Entry so StringVar uses _root as master, solves weird 148 | # interaction with shell in Idle 149 | # Fixed bug in setCoords. X and Y coordinates can increase in 150 | # "non-intuitive" direction. 151 | # Tweaked wm_protocol so window is not resizable and kill box closes. 152 | 153 | # Version 1.5 154 | # Fixed bug in Entry. Can now define entry before creating a 155 | # GraphWin. All GraphWins are now toplevel windows and share 156 | # a fixed root (called _root). 157 | 158 | # Version 1.4 159 | # Fixed Garbage collection of Tkinter images bug. 160 | # Added ability to set text atttributes. 161 | # Added Entry boxes. 162 | 163 | import time, os, sys 164 | 165 | try: # import as appropriate for 2.x vs. 3.x 166 | import tkinter as tk 167 | except: 168 | import Tkinter as tk 169 | 170 | 171 | ########################################################################## 172 | # Module Exceptions 173 | 174 | class GraphicsError(Exception): 175 | """Generic error class for graphics module exceptions.""" 176 | pass 177 | 178 | OBJ_ALREADY_DRAWN = "Object currently drawn" 179 | UNSUPPORTED_METHOD = "Object doesn't support operation" 180 | BAD_OPTION = "Illegal option value" 181 | 182 | ########################################################################## 183 | # global variables and funtions 184 | 185 | _root = tk.Tk() 186 | _root.withdraw() 187 | 188 | _update_lasttime = time.time() 189 | 190 | def update(rate=None): 191 | global _update_lasttime 192 | if rate: 193 | now = time.time() 194 | pauseLength = 1/rate-(now-_update_lasttime) 195 | if pauseLength > 0: 196 | time.sleep(pauseLength) 197 | _update_lasttime = now + pauseLength 198 | else: 199 | _update_lasttime = now 200 | 201 | _root.update() 202 | 203 | ############################################################################ 204 | # Graphics classes start here 205 | 206 | class GraphWin(tk.Canvas): 207 | 208 | """A GraphWin is a toplevel window for displaying graphics.""" 209 | 210 | def __init__(self, title="Graphics Window", 211 | width=200, height=200, autoflush=True): 212 | assert type(title) == type(""), "Title must be a string" 213 | master = tk.Toplevel(_root) 214 | master.protocol("WM_DELETE_WINDOW", self.close) 215 | tk.Canvas.__init__(self, master, width=width, height=height, 216 | highlightthickness=0, bd=0) 217 | self.master.title(title) 218 | self.pack() 219 | master.resizable(0,0) 220 | self.foreground = "black" 221 | self.items = [] 222 | self.mouseX = None 223 | self.mouseY = None 224 | self.bind("", self._onClick) 225 | self.bind_all("", self._onKey) 226 | self.height = int(height) 227 | self.width = int(width) 228 | self.autoflush = autoflush 229 | self._mouseCallback = None 230 | self.trans = None 231 | self.closed = False 232 | master.lift() 233 | self.lastKey = "" 234 | if autoflush: _root.update() 235 | 236 | def __repr__(self): 237 | if self.isClosed(): 238 | return "" 239 | else: 240 | return "GraphWin('{}', {}, {})".format(self.master.title(), 241 | self.getWidth(), 242 | self.getHeight()) 243 | 244 | def __str__(self): 245 | return repr(self) 246 | 247 | def __checkOpen(self): 248 | if self.closed: 249 | raise GraphicsError("window is closed") 250 | 251 | def _onKey(self, evnt): 252 | self.lastKey = evnt.keysym 253 | 254 | 255 | def setBackground(self, color): 256 | """Set background color of the window""" 257 | self.__checkOpen() 258 | self.config(bg=color) 259 | self.__autoflush() 260 | 261 | def setCoords(self, x1, y1, x2, y2): 262 | """Set coordinates of window to run from (x1,y1) in the 263 | lower-left corner to (x2,y2) in the upper-right corner.""" 264 | self.trans = Transform(self.width, self.height, x1, y1, x2, y2) 265 | self.redraw() 266 | 267 | def close(self): 268 | """Close the window""" 269 | 270 | if self.closed: return 271 | self.closed = True 272 | self.master.destroy() 273 | self.__autoflush() 274 | 275 | 276 | def isClosed(self): 277 | return self.closed 278 | 279 | 280 | def isOpen(self): 281 | return not self.closed 282 | 283 | 284 | def __autoflush(self): 285 | if self.autoflush: 286 | _root.update() 287 | 288 | 289 | def plot(self, x, y, color="black"): 290 | """Set pixel (x,y) to the given color""" 291 | self.__checkOpen() 292 | xs,ys = self.toScreen(x,y) 293 | self.create_line(xs,ys,xs+1,ys, fill=color) 294 | self.__autoflush() 295 | 296 | def plotPixel(self, x, y, color="black"): 297 | """Set pixel raw (independent of window coordinates) pixel 298 | (x,y) to color""" 299 | self.__checkOpen() 300 | self.create_line(x,y,x+1,y, fill=color) 301 | self.__autoflush() 302 | 303 | def flush(self): 304 | """Update drawing to the window""" 305 | self.__checkOpen() 306 | self.update_idletasks() 307 | 308 | def getMouse(self): 309 | """Wait for mouse click and return Point object representing 310 | the click""" 311 | self.update() # flush any prior clicks 312 | self.mouseX = None 313 | self.mouseY = None 314 | while self.mouseX == None or self.mouseY == None: 315 | self.update() 316 | if self.isClosed(): raise GraphicsError("getMouse in closed window") 317 | time.sleep(.1) # give up thread 318 | x,y = self.toWorld(self.mouseX, self.mouseY) 319 | self.mouseX = None 320 | self.mouseY = None 321 | return Point(x,y) 322 | 323 | def checkMouse(self): 324 | """Return last mouse click or None if mouse has 325 | not been clicked since last call""" 326 | if self.isClosed(): 327 | raise GraphicsError("checkMouse in closed window") 328 | self.update() 329 | if self.mouseX != None and self.mouseY != None: 330 | x,y = self.toWorld(self.mouseX, self.mouseY) 331 | self.mouseX = None 332 | self.mouseY = None 333 | return Point(x,y) 334 | else: 335 | return None 336 | 337 | def getKey(self): 338 | """Wait for user to press a key and return it as a string.""" 339 | self.lastKey = "" 340 | while self.lastKey == "": 341 | self.update() 342 | if self.isClosed(): raise GraphicsError("getKey in closed window") 343 | time.sleep(.1) # give up thread 344 | 345 | key = self.lastKey 346 | self.lastKey = "" 347 | return key 348 | 349 | def checkKey(self): 350 | """Return last key pressed or None if no key pressed since last call""" 351 | if self.isClosed(): 352 | raise GraphicsError("checkKey in closed window") 353 | self.update() 354 | key = self.lastKey 355 | self.lastKey = "" 356 | return key 357 | 358 | def getHeight(self): 359 | """Return the height of the window""" 360 | return self.height 361 | 362 | def getWidth(self): 363 | """Return the width of the window""" 364 | return self.width 365 | 366 | def toScreen(self, x, y): 367 | trans = self.trans 368 | if trans: 369 | return self.trans.screen(x,y) 370 | else: 371 | return x,y 372 | 373 | def toWorld(self, x, y): 374 | trans = self.trans 375 | if trans: 376 | return self.trans.world(x,y) 377 | else: 378 | return x,y 379 | 380 | def setMouseHandler(self, func): 381 | self._mouseCallback = func 382 | 383 | def _onClick(self, e): 384 | self.mouseX = e.x 385 | self.mouseY = e.y 386 | if self._mouseCallback: 387 | self._mouseCallback(Point(e.x, e.y)) 388 | 389 | def addItem(self, item): 390 | self.items.append(item) 391 | 392 | def delItem(self, item): 393 | self.items.remove(item) 394 | 395 | def redraw(self): 396 | for item in self.items[:]: 397 | item.undraw() 398 | item.draw(self) 399 | self.update() 400 | 401 | 402 | class Transform: 403 | 404 | """Internal class for 2-D coordinate transformations""" 405 | 406 | def __init__(self, w, h, xlow, ylow, xhigh, yhigh): 407 | # w, h are width and height of window 408 | # (xlow,ylow) coordinates of lower-left [raw (0,h-1)] 409 | # (xhigh,yhigh) coordinates of upper-right [raw (w-1,0)] 410 | xspan = (xhigh-xlow) 411 | yspan = (yhigh-ylow) 412 | self.xbase = xlow 413 | self.ybase = yhigh 414 | self.xscale = xspan/float(w-1) 415 | self.yscale = yspan/float(h-1) 416 | 417 | def screen(self,x,y): 418 | # Returns x,y in screen (actually window) coordinates 419 | xs = (x-self.xbase) / self.xscale 420 | ys = (self.ybase-y) / self.yscale 421 | return int(xs+0.5),int(ys+0.5) 422 | 423 | def world(self,xs,ys): 424 | # Returns xs,ys in world coordinates 425 | x = xs*self.xscale + self.xbase 426 | y = self.ybase - ys*self.yscale 427 | return x,y 428 | 429 | 430 | # Default values for various item configuration options. Only a subset of 431 | # keys may be present in the configuration dictionary for a given item 432 | DEFAULT_CONFIG = {"fill":"", 433 | "outline":"black", 434 | "width":"1", 435 | "arrow":"none", 436 | "text":"", 437 | "justify":"center", 438 | "font": ("helvetica", 12, "normal")} 439 | 440 | class GraphicsObject: 441 | 442 | """Generic base class for all of the drawable objects""" 443 | # A subclass of GraphicsObject should override _draw and 444 | # and _move methods. 445 | 446 | def __init__(self, options): 447 | # options is a list of strings indicating which options are 448 | # legal for this object. 449 | 450 | # When an object is drawn, canvas is set to the GraphWin(canvas) 451 | # object where it is drawn and id is the TK identifier of the 452 | # drawn shape. 453 | self.canvas = None 454 | self.id = None 455 | 456 | # config is the dictionary of configuration options for the widget. 457 | config = {} 458 | for option in options: 459 | config[option] = DEFAULT_CONFIG[option] 460 | self.config = config 461 | 462 | def setFill(self, color): 463 | """Set interior color to color""" 464 | self._reconfig("fill", color) 465 | 466 | def setOutline(self, color): 467 | """Set outline color to color""" 468 | self._reconfig("outline", color) 469 | 470 | def setWidth(self, width): 471 | """Set line weight to width""" 472 | self._reconfig("width", width) 473 | 474 | def draw(self, graphwin): 475 | 476 | """Draw the object in graphwin, which should be a GraphWin 477 | object. A GraphicsObject may only be drawn into one 478 | window. Raises an error if attempt made to draw an object that 479 | is already visible.""" 480 | 481 | if self.canvas and not self.canvas.isClosed(): raise GraphicsError(OBJ_ALREADY_DRAWN) 482 | if graphwin.isClosed(): raise GraphicsError("Can't draw to closed window") 483 | self.canvas = graphwin 484 | self.id = self._draw(graphwin, self.config) 485 | graphwin.addItem(self) 486 | if graphwin.autoflush: 487 | _root.update() 488 | return self 489 | 490 | 491 | def undraw(self): 492 | 493 | """Undraw the object (i.e. hide it). Returns silently if the 494 | object is not currently drawn.""" 495 | 496 | if not self.canvas: return 497 | if not self.canvas.isClosed(): 498 | self.canvas.delete(self.id) 499 | self.canvas.delItem(self) 500 | if self.canvas.autoflush: 501 | _root.update() 502 | self.canvas = None 503 | self.id = None 504 | 505 | 506 | def move(self, dx, dy): 507 | 508 | """move object dx units in x direction and dy units in y 509 | direction""" 510 | 511 | self._move(dx,dy) 512 | canvas = self.canvas 513 | if canvas and not canvas.isClosed(): 514 | trans = canvas.trans 515 | if trans: 516 | x = dx/ trans.xscale 517 | y = -dy / trans.yscale 518 | else: 519 | x = dx 520 | y = dy 521 | self.canvas.move(self.id, x, y) 522 | if canvas.autoflush: 523 | _root.update() 524 | 525 | def _reconfig(self, option, setting): 526 | # Internal method for changing configuration of the object 527 | # Raises an error if the option does not exist in the config 528 | # dictionary for this object 529 | if option not in self.config: 530 | raise GraphicsError(UNSUPPORTED_METHOD) 531 | options = self.config 532 | options[option] = setting 533 | if self.canvas and not self.canvas.isClosed(): 534 | self.canvas.itemconfig(self.id, options) 535 | if self.canvas.autoflush: 536 | _root.update() 537 | 538 | 539 | def _draw(self, canvas, options): 540 | """draws appropriate figure on canvas with options provided 541 | Returns Tk id of item drawn""" 542 | pass # must override in subclass 543 | 544 | 545 | def _move(self, dx, dy): 546 | """updates internal state of object to move it dx,dy units""" 547 | pass # must override in subclass 548 | 549 | 550 | class Point(GraphicsObject): 551 | def __init__(self, x, y): 552 | GraphicsObject.__init__(self, ["outline", "fill"]) 553 | self.setFill = self.setOutline 554 | self.x = float(x) 555 | self.y = float(y) 556 | 557 | def __repr__(self): 558 | return "Point({}, {})".format(self.x, self.y) 559 | 560 | def _draw(self, canvas, options): 561 | x,y = canvas.toScreen(self.x,self.y) 562 | return canvas.create_rectangle(x,y,x+1,y+1,options) 563 | 564 | def _move(self, dx, dy): 565 | self.x = self.x + dx 566 | self.y = self.y + dy 567 | 568 | def clone(self): 569 | other = Point(self.x,self.y) 570 | other.config = self.config.copy() 571 | return other 572 | 573 | def getX(self): return self.x 574 | def getY(self): return self.y 575 | 576 | class _BBox(GraphicsObject): 577 | # Internal base class for objects represented by bounding box 578 | # (opposite corners) Line segment is a degenerate case. 579 | 580 | def __init__(self, p1, p2, options=["outline","width","fill"]): 581 | GraphicsObject.__init__(self, options) 582 | self.p1 = p1.clone() 583 | self.p2 = p2.clone() 584 | 585 | def _move(self, dx, dy): 586 | self.p1.x = self.p1.x + dx 587 | self.p1.y = self.p1.y + dy 588 | self.p2.x = self.p2.x + dx 589 | self.p2.y = self.p2.y + dy 590 | 591 | def getP1(self): return self.p1.clone() 592 | 593 | def getP2(self): return self.p2.clone() 594 | 595 | def getCenter(self): 596 | p1 = self.p1 597 | p2 = self.p2 598 | return Point((p1.x+p2.x)/2.0, (p1.y+p2.y)/2.0) 599 | 600 | 601 | class Rectangle(_BBox): 602 | 603 | def __init__(self, p1, p2): 604 | _BBox.__init__(self, p1, p2) 605 | 606 | def __repr__(self): 607 | return "Rectangle({}, {})".format(str(self.p1), str(self.p2)) 608 | 609 | def _draw(self, canvas, options): 610 | p1 = self.p1 611 | p2 = self.p2 612 | x1,y1 = canvas.toScreen(p1.x,p1.y) 613 | x2,y2 = canvas.toScreen(p2.x,p2.y) 614 | return canvas.create_rectangle(x1,y1,x2,y2,options) 615 | 616 | def clone(self): 617 | other = Rectangle(self.p1, self.p2) 618 | other.config = self.config.copy() 619 | return other 620 | 621 | 622 | class Oval(_BBox): 623 | 624 | def __init__(self, p1, p2): 625 | _BBox.__init__(self, p1, p2) 626 | 627 | def __repr__(self): 628 | return "Oval({}, {})".format(str(self.p1), str(self.p2)) 629 | 630 | 631 | def clone(self): 632 | other = Oval(self.p1, self.p2) 633 | other.config = self.config.copy() 634 | return other 635 | 636 | def _draw(self, canvas, options): 637 | p1 = self.p1 638 | p2 = self.p2 639 | x1,y1 = canvas.toScreen(p1.x,p1.y) 640 | x2,y2 = canvas.toScreen(p2.x,p2.y) 641 | return canvas.create_oval(x1,y1,x2,y2,options) 642 | 643 | class Circle(Oval): 644 | 645 | def __init__(self, center, radius): 646 | p1 = Point(center.x-radius, center.y-radius) 647 | p2 = Point(center.x+radius, center.y+radius) 648 | Oval.__init__(self, p1, p2) 649 | self.radius = radius 650 | 651 | def __repr__(self): 652 | return "Circle({}, {})".format(str(self.getCenter()), str(self.radius)) 653 | 654 | def clone(self): 655 | other = Circle(self.getCenter(), self.radius) 656 | other.config = self.config.copy() 657 | return other 658 | 659 | def getRadius(self): 660 | return self.radius 661 | 662 | 663 | class Line(_BBox): 664 | 665 | def __init__(self, p1, p2): 666 | _BBox.__init__(self, p1, p2, ["arrow","fill","width"]) 667 | self.setFill(DEFAULT_CONFIG['outline']) 668 | self.setOutline = self.setFill 669 | 670 | def __repr__(self): 671 | return "Line({}, {})".format(str(self.p1), str(self.p2)) 672 | 673 | def clone(self): 674 | other = Line(self.p1, self.p2) 675 | other.config = self.config.copy() 676 | return other 677 | 678 | def _draw(self, canvas, options): 679 | p1 = self.p1 680 | p2 = self.p2 681 | x1,y1 = canvas.toScreen(p1.x,p1.y) 682 | x2,y2 = canvas.toScreen(p2.x,p2.y) 683 | return canvas.create_line(x1,y1,x2,y2,options) 684 | 685 | def setArrow(self, option): 686 | if not option in ["first","last","both","none"]: 687 | raise GraphicsError(BAD_OPTION) 688 | self._reconfig("arrow", option) 689 | 690 | 691 | class Polygon(GraphicsObject): 692 | 693 | def __init__(self, *points): 694 | # if points passed as a list, extract it 695 | if len(points) == 1 and type(points[0]) == type([]): 696 | points = points[0] 697 | self.points = list(map(Point.clone, points)) 698 | GraphicsObject.__init__(self, ["outline", "width", "fill"]) 699 | 700 | def __repr__(self): 701 | return "Polygon"+str(tuple(p for p in self.points)) 702 | 703 | def clone(self): 704 | other = Polygon(*self.points) 705 | other.config = self.config.copy() 706 | return other 707 | 708 | def getPoints(self): 709 | return list(map(Point.clone, self.points)) 710 | 711 | def _move(self, dx, dy): 712 | for p in self.points: 713 | p.move(dx,dy) 714 | 715 | def _draw(self, canvas, options): 716 | args = [canvas] 717 | for p in self.points: 718 | x,y = canvas.toScreen(p.x,p.y) 719 | args.append(x) 720 | args.append(y) 721 | args.append(options) 722 | return GraphWin.create_polygon(*args) 723 | 724 | class Text(GraphicsObject): 725 | 726 | def __init__(self, p, text): 727 | GraphicsObject.__init__(self, ["justify","fill","text","font"]) 728 | self.setText(text) 729 | self.anchor = p.clone() 730 | self.setFill(DEFAULT_CONFIG['outline']) 731 | self.setOutline = self.setFill 732 | 733 | def __repr__(self): 734 | return "Text({}, '{}')".format(self.anchor, self.getText()) 735 | 736 | def _draw(self, canvas, options): 737 | p = self.anchor 738 | x,y = canvas.toScreen(p.x,p.y) 739 | return canvas.create_text(x,y,options) 740 | 741 | def _move(self, dx, dy): 742 | self.anchor.move(dx,dy) 743 | 744 | def clone(self): 745 | other = Text(self.anchor, self.config['text']) 746 | other.config = self.config.copy() 747 | return other 748 | 749 | def setText(self,text): 750 | self._reconfig("text", text) 751 | 752 | def getText(self): 753 | return self.config["text"] 754 | 755 | def getAnchor(self): 756 | return self.anchor.clone() 757 | 758 | def setFace(self, face): 759 | if face in ['helvetica','arial','courier','times roman']: 760 | f,s,b = self.config['font'] 761 | self._reconfig("font",(face,s,b)) 762 | else: 763 | raise GraphicsError(BAD_OPTION) 764 | 765 | def setSize(self, size): 766 | if 5 <= size <= 36: 767 | f,s,b = self.config['font'] 768 | self._reconfig("font", (f,size,b)) 769 | else: 770 | raise GraphicsError(BAD_OPTION) 771 | 772 | def setStyle(self, style): 773 | if style in ['bold','normal','italic', 'bold italic']: 774 | f,s,b = self.config['font'] 775 | self._reconfig("font", (f,s,style)) 776 | else: 777 | raise GraphicsError(BAD_OPTION) 778 | 779 | def setTextColor(self, color): 780 | self.setFill(color) 781 | 782 | 783 | class Entry(GraphicsObject): 784 | 785 | def __init__(self, p, width): 786 | GraphicsObject.__init__(self, []) 787 | self.anchor = p.clone() 788 | #print self.anchor 789 | self.width = width 790 | self.text = tk.StringVar(_root) 791 | self.text.set("") 792 | self.fill = "gray" 793 | self.color = "black" 794 | self.font = DEFAULT_CONFIG['font'] 795 | self.entry = None 796 | 797 | def __repr__(self): 798 | return "Entry({}, {})".format(self.anchor, self.width) 799 | 800 | def _draw(self, canvas, options): 801 | p = self.anchor 802 | x,y = canvas.toScreen(p.x,p.y) 803 | frm = tk.Frame(canvas.master) 804 | self.entry = tk.Entry(frm, 805 | width=self.width, 806 | textvariable=self.text, 807 | bg = self.fill, 808 | fg = self.color, 809 | font=self.font) 810 | self.entry.pack() 811 | #self.setFill(self.fill) 812 | self.entry.focus_set() 813 | return canvas.create_window(x,y,window=frm) 814 | 815 | def getText(self): 816 | return self.text.get() 817 | 818 | def _move(self, dx, dy): 819 | self.anchor.move(dx,dy) 820 | 821 | def getAnchor(self): 822 | return self.anchor.clone() 823 | 824 | def clone(self): 825 | other = Entry(self.anchor, self.width) 826 | other.config = self.config.copy() 827 | other.text = tk.StringVar() 828 | other.text.set(self.text.get()) 829 | other.fill = self.fill 830 | return other 831 | 832 | def setText(self, t): 833 | self.text.set(t) 834 | 835 | 836 | def setFill(self, color): 837 | self.fill = color 838 | if self.entry: 839 | self.entry.config(bg=color) 840 | 841 | 842 | def _setFontComponent(self, which, value): 843 | font = list(self.font) 844 | font[which] = value 845 | self.font = tuple(font) 846 | if self.entry: 847 | self.entry.config(font=self.font) 848 | 849 | 850 | def setFace(self, face): 851 | if face in ['helvetica','arial','courier','times roman']: 852 | self._setFontComponent(0, face) 853 | else: 854 | raise GraphicsError(BAD_OPTION) 855 | 856 | def setSize(self, size): 857 | if 5 <= size <= 36: 858 | self._setFontComponent(1,size) 859 | else: 860 | raise GraphicsError(BAD_OPTION) 861 | 862 | def setStyle(self, style): 863 | if style in ['bold','normal','italic', 'bold italic']: 864 | self._setFontComponent(2,style) 865 | else: 866 | raise GraphicsError(BAD_OPTION) 867 | 868 | def setTextColor(self, color): 869 | self.color=color 870 | if self.entry: 871 | self.entry.config(fg=color) 872 | 873 | 874 | class Image(GraphicsObject): 875 | 876 | idCount = 0 877 | imageCache = {} # tk photoimages go here to avoid GC while drawn 878 | 879 | def __init__(self, p, *pixmap): 880 | GraphicsObject.__init__(self, []) 881 | self.anchor = p.clone() 882 | self.imageId = Image.idCount 883 | Image.idCount = Image.idCount + 1 884 | if len(pixmap) == 1: # file name provided 885 | self.img = tk.PhotoImage(file=pixmap[0], master=_root) 886 | else: # width and height provided 887 | width, height = pixmap 888 | self.img = tk.PhotoImage(master=_root, width=width, height=height) 889 | 890 | def __repr__(self): 891 | return "Image({}, {}, {})".format(self.anchor, self.getWidth(), self.getHeight()) 892 | 893 | def _draw(self, canvas, options): 894 | p = self.anchor 895 | x,y = canvas.toScreen(p.x,p.y) 896 | self.imageCache[self.imageId] = self.img # save a reference 897 | return canvas.create_image(x,y,image=self.img) 898 | 899 | def _move(self, dx, dy): 900 | self.anchor.move(dx,dy) 901 | 902 | def undraw(self): 903 | try: 904 | del self.imageCache[self.imageId] # allow gc of tk photoimage 905 | except KeyError: 906 | pass 907 | GraphicsObject.undraw(self) 908 | 909 | def getAnchor(self): 910 | return self.anchor.clone() 911 | 912 | def clone(self): 913 | other = Image(Point(0,0), 0, 0) 914 | other.img = self.img.copy() 915 | other.anchor = self.anchor.clone() 916 | other.config = self.config.copy() 917 | return other 918 | 919 | def getWidth(self): 920 | """Returns the width of the image in pixels""" 921 | return self.img.width() 922 | 923 | def getHeight(self): 924 | """Returns the height of the image in pixels""" 925 | return self.img.height() 926 | 927 | def getPixel(self, x, y): 928 | """Returns a list [r,g,b] with the RGB color values for pixel (x,y) 929 | r,g,b are in range(256) 930 | 931 | """ 932 | 933 | value = self.img.get(x,y) 934 | if type(value) == type(0): 935 | return [value, value, value] 936 | elif type(value) == type((0,0,0)): 937 | return list(value) 938 | else: 939 | return list(map(int, value.split())) 940 | 941 | def setPixel(self, x, y, color): 942 | """Sets pixel (x,y) to the given color 943 | 944 | """ 945 | self.img.put("{" + color +"}", (x, y)) 946 | 947 | 948 | def save(self, filename): 949 | """Saves the pixmap image to filename. 950 | The format for the save image is determined from the filname extension. 951 | 952 | """ 953 | 954 | path, name = os.path.split(filename) 955 | ext = name.split(".")[-1] 956 | self.img.write( filename, format=ext) 957 | 958 | 959 | def color_rgb(r,g,b): 960 | """r,g,b are intensities of red, green, and blue in range(256) 961 | Returns color specifier string for the resulting color""" 962 | return "#%02x%02x%02x" % (r,g,b) 963 | 964 | def test(): 965 | win = GraphWin() 966 | win.setCoords(0,0,10,10) 967 | t = Text(Point(5,5), "Centered Text") 968 | t.draw(win) 969 | p = Polygon(Point(1,1), Point(5,3), Point(2,7)) 970 | p.draw(win) 971 | e = Entry(Point(5,6), 10) 972 | e.draw(win) 973 | win.getMouse() 974 | p.setFill("red") 975 | p.setOutline("blue") 976 | p.setWidth(2) 977 | s = "" 978 | for pt in p.getPoints(): 979 | s = s + "(%0.1f,%0.1f) " % (pt.getX(), pt.getY()) 980 | t.setText(e.getText()) 981 | e.setFill("green") 982 | e.setText("Spam!") 983 | e.move(2,0) 984 | win.getMouse() 985 | p.move(2,3) 986 | s = "" 987 | for pt in p.getPoints(): 988 | s = s + "(%0.1f,%0.1f) " % (pt.getX(), pt.getY()) 989 | t.setText(s) 990 | win.getMouse() 991 | p.undraw() 992 | e.undraw() 993 | t.setStyle("bold") 994 | win.getMouse() 995 | t.setStyle("normal") 996 | win.getMouse() 997 | t.setStyle("italic") 998 | win.getMouse() 999 | t.setStyle("bold italic") 1000 | win.getMouse() 1001 | t.setSize(14) 1002 | win.getMouse() 1003 | t.setFace("arial") 1004 | t.setSize(20) 1005 | win.getMouse() 1006 | win.close() 1007 | 1008 | #MacOS fix 2 1009 | #tk.Toplevel(_root).destroy() 1010 | 1011 | # MacOS fix 1 1012 | update() 1013 | 1014 | if __name__ == "__main__": 1015 | test() 1016 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import pickle 4 | import time 5 | from comet_ml import Experiment 6 | import json 7 | import gym 8 | from tqdm import tqdm 9 | import numpy as np 10 | import itertools 11 | import torch 12 | import ipdb 13 | from sac import SAC 14 | from normalized_actions import NormalizedActions 15 | from replay_memory import ReplayMemory 16 | from continous_grids import GridWorld 17 | 18 | 19 | def main(args): 20 | # Environment 21 | if args.make_cont_grid: 22 | if args.smol: 23 | dense_goals = [] 24 | if args.dense_goals: 25 | dense_goals = [(13.0, 8.0), (18.0, 11.0), (20.0, 15.0), (22.0, 19.0)] 26 | env = GridWorld(max_episode_len=500, num_rooms=1, action_limit_max=1.0, silent_mode=args.silent, 27 | start_position=(8.0, 8.0), goal_position=(22.0, 22.0), goal_reward=+100.0, 28 | dense_goals=dense_goals, dense_reward=+5, 29 | grid_len=30) 30 | env_name = "SmallGridWorld" 31 | elif args.tiny: 32 | env = GridWorld(max_episode_len=500, num_rooms=0, action_limit_max=1.0, silent_mode=args.silent, 33 | start_position=(5.0, 5.0), goal_position=(15.0, 15.0), goal_reward=+100.0, 34 | dense_goals=[], dense_reward=+0, 35 | grid_len=20) 36 | env_name = "TinyGridWorld" 37 | elif args.twotiny: 38 | env = GridWorld(max_episode_len=500, num_rooms=1, action_limit_max=1.0, silent_mode=args.silent, 39 | start_position=(5.0, 5.0), goal_position=(15.0, 15.0), goal_reward=+100.0, 40 | dense_goals=[], dense_reward=+0, 41 | grid_len=20, door_breadth=3) 42 | env_name = "TwoTinyGridWorld" 43 | elif args.threetiny: 44 | env = GridWorld(max_episode_len=500, num_rooms=0, action_limit_max=1.0, silent_mode=args.silent, 45 | start_position=(8.0, 8.0), goal_position=(22.0, 22.0), goal_reward=+100.0, 46 | dense_goals=[], dense_reward=+0, 47 | grid_len=30) 48 | env_name = "ThreeGridWorld" 49 | else: 50 | dense_goals = [] 51 | if args.dense_goals: 52 | dense_goals = [(35.0, 25.0), (45.0, 25.0), (55.0, 25.0), (68.0, 33.0), (75.0, 45.0), (75.0, 55.0), 53 | (75.0, 65.0)] 54 | env = GridWorld(max_episode_len=1000, num_rooms=1, action_limit_max=1.0, silent_mode=args.silent, 55 | dense_goals=dense_goals) 56 | env_name = "VeryLargeGridWorld" 57 | else: 58 | env = NormalizedActions(gym.make(args.env_name)) 59 | 60 | env.seed(args.seed) 61 | torch.manual_seed(args.seed) 62 | np.random.seed(args.seed) 63 | 64 | # Agent 65 | agent = SAC(env.observation_space.shape[0], env.action_space, args) 66 | 67 | # Memory 68 | memory = ReplayMemory(args.replay_size) 69 | 70 | # Training Loop 71 | rewards = [] 72 | test_rewards = [] 73 | total_numsteps = 0 74 | updates = 0 75 | 76 | if args.debug: 77 | args.use_logger = False 78 | 79 | # Check if settings file 80 | if os.path.isfile("settings.json"): 81 | with open('settings.json') as f: 82 | data = json.load(f) 83 | args.comet_apikey = data["apikey"] 84 | args.comet_username = data["username"] 85 | else: 86 | raise NotImplementedError 87 | 88 | experiment_id = None 89 | if args.comet: 90 | experiment = Experiment(api_key=args.comet_apikey,\ 91 | project_name="florl",auto_output_logging="None",\ 92 | workspace=args.comet_username,auto_metric_logging=False,\ 93 | auto_param_logging=False) 94 | experiment.set_name(args.namestr) 95 | args.experiment = experiment 96 | experiment_id = experiment.id 97 | 98 | if args.make_cont_grid: 99 | # The following lines are for visual purposes 100 | traj = [] 101 | imp_states = [] 102 | 103 | for i_episode in itertools.count(): 104 | state = env.reset() 105 | if args.make_cont_grid: 106 | traj.append(state) 107 | 108 | episode_reward = 0 109 | while True: 110 | if args.start_steps > total_numsteps: 111 | action = env.action_space.sample() 112 | else: 113 | action = agent.select_action(state) # Sample action from policy 114 | time.sleep(.002) 115 | next_state, reward, done, _ = env.step(action) # Step 116 | #Visual 117 | if args.make_cont_grid: 118 | traj.append(next_state) 119 | if total_numsteps % 10000 == 0 and total_numsteps != 0: 120 | imp_states.append(next_state) 121 | 122 | # Save current trajectories to JSON 123 | filename = 'run_data/{}_{}_{}_{}_{}.txt'.format(args.policy, args.env_name, 124 | experiment_id, args.namestr, total_numsteps) 125 | with open(filename, 'wb') as f: 126 | pickle.dump(traj, f) 127 | 128 | 129 | 130 | mask = not done # 1 for not done and 0 for done 131 | memory.push(state, action, reward, next_state, mask) # Append transition to memory 132 | if len(memory) > args.batch_size: 133 | for i in range(args.updates_per_step): # Number of updates per step in environment 134 | # Sample a batch from memory 135 | state_batch, action_batch, reward_batch, next_state_batch,\ 136 | mask_batch = memory.sample(args.batch_size) 137 | # Update parameters of all the networks 138 | value_loss, critic_1_loss, critic_2_loss, policy_loss,\ 139 | ent_loss, alpha = agent.update_parameters(state_batch,\ 140 | action_batch,reward_batch,next_state_batch,mask_batch, updates) 141 | 142 | if args.comet: 143 | args.experiment.log_metric("Loss Value", value_loss,step=updates) 144 | args.experiment.log_metric("Loss Critic 1",critic_1_loss,step=updates) 145 | args.experiment.log_metric("Loss Critic 2",critic_2_loss,step=updates) 146 | args.experiment.log_metric("Loss Policy",policy_loss,step=updates) 147 | args.experiment.log_metric("Entropy",ent_loss,step=updates) 148 | args.experiment.log_metric("Entropy Temperature",alpha,step=updates) 149 | updates += 1 150 | 151 | state = next_state 152 | total_numsteps += 1 153 | episode_reward += reward 154 | 155 | if done: 156 | break 157 | 158 | if total_numsteps > args.num_steps: 159 | break 160 | 161 | rewards.append(episode_reward) 162 | if args.comet: 163 | args.experiment.log_metric("Train Reward",episode_reward,step=i_episode) 164 | args.experiment.log_metric("Average Train Reward",\ 165 | np.round(np.mean(rewards[-100:]),2),step=i_episode) 166 | print("Episode: {}, total numsteps: {}, reward: {}, average reward: {}".format(i_episode,\ 167 | total_numsteps, np.round(rewards[-1],2),\ 168 | np.round(np.mean(rewards[-100:]),2))) 169 | 170 | if i_episode % 10 == 0 and args.eval == True: 171 | state = torch.Tensor([env.reset()]) 172 | episode_reward = 0 173 | while True: 174 | action = agent.select_action(state, eval=True) 175 | next_state, reward, done, _ = env.step(action) 176 | episode_reward += reward 177 | 178 | state = next_state 179 | if done: 180 | break 181 | 182 | if args.comet: 183 | args.experiment.log_metric("Test Reward", episode_reward, step=i_episode) 184 | 185 | test_rewards.append(episode_reward) 186 | print("----------------------------------------") 187 | print("Test Episode: {}, reward: {}".format(i_episode, test_rewards[-1])) 188 | print("----------------------------------------") 189 | if args.make_cont_grid: 190 | #Visual 191 | # env.vis_trajectory(np.asarray(traj), args.namestr, experiment_id, np.asarray(imp_states)) 192 | 193 | # Save final trajectories to JSON 194 | filename = 'run_data/finalrun_{}_{}_{}_{}_{}.txt'.format(args.policy, args.env_name, 195 | experiment_id, args.namestr, total_numsteps) 196 | with open(filename, 'wb') as f: 197 | pickle.dump(traj, f) 198 | env.test_vis_trajectory(np.asarray(traj), args.namestr, args.heatmap_title, experiment_id, 199 | args.heatmap_normalize, args.heatmap_vertical_clip_value) 200 | 201 | env.close() 202 | 203 | if __name__ == '__main__': 204 | """ 205 | Process command-line arguments, then call main() 206 | """ 207 | parser = argparse.ArgumentParser(description='PyTorch REINFORCE example') 208 | parser.add_argument('--env-name', default=None, 209 | help='name of the environment to run') 210 | parser.add_argument('--policy', default="Gaussian", 211 | help='algorithm to use: Gaussian | Deterministic') 212 | parser.add_argument('--eval', type=bool, default=True, 213 | help='Evaluates a policy a policy every 10 episode (default:True)') 214 | parser.add_argument('--gamma', type=float, default=0.99, metavar='G', 215 | help='discount factor for reward (default: 0.99)') 216 | parser.add_argument('--tau', type=float, default=0.005, metavar='G', 217 | help='target smoothing coefficient (default: 0.005)') 218 | parser.add_argument('--lr', type=float, default=0.0003, metavar='G', 219 | help='learning rate (default: 0.0003)') 220 | parser.add_argument('--alpha', type=float, default=0.1, metavar='G', 221 | help='Temperature parameter α determines the relative importance of the entropy term against the reward (default: 0.1)') 222 | parser.add_argument('--automatic_entropy_tuning', type=bool, default=False, metavar='G', 223 | help='Temperature parameter α automaically adjusted.') 224 | parser.add_argument('--seed', type=int, default=456, metavar='N', 225 | help='random seed (default: 456)') 226 | parser.add_argument('--batch_size', type=int, default=256, metavar='N', 227 | help='batch size (default: 256)') 228 | parser.add_argument('--clip', type=int, default=1, metavar='N', 229 | help='Clipping for gradient norm') 230 | parser.add_argument('--num_steps', type=int, default=1000000, metavar='N', 231 | help='maximum number of steps (default: 1000000)') 232 | parser.add_argument('--hidden_size', type=int, default=256, metavar='N', 233 | help='hidden size (default: 256)') 234 | parser.add_argument('--updates_per_step', type=int, default=1, metavar='N', 235 | help='model updates per simulator step (default: 1)') 236 | parser.add_argument('--start_steps', type=int, default=10000, metavar='N', 237 | help='Steps sampling random actions (default: 10000)') 238 | parser.add_argument('--target_update_interval', type=int, default=1, metavar='N', 239 | help='Value target update per no. of updates per step (default: 1)') 240 | parser.add_argument('--replay_size', type=int, default=1000000, metavar='N', 241 | help='size of replay buffer (default: 10000000)') 242 | parser.add_argument("--comet", action="store_true", default=False,help='Use comet for logging') 243 | parser.add_argument('--debug', default=False, action='store_true',help='Debug') 244 | parser.add_argument('--namestr', type=str, default='FloRL', \ 245 | help='additional info in output filename to describe experiments') 246 | parser.add_argument('--n_blocks', type=int, default=5,\ 247 | help='Number of blocks to stack in a model (MADE in MAF; Coupling+BN in RealNVP).') 248 | parser.add_argument('--n_components', type=int, default=1,\ 249 | help='Number of Gaussian clusters for mixture of gaussians models.') 250 | parser.add_argument('--flow_hidden_size', type=int, default=100,\ 251 | help='Hidden layer size for MADE (and each MADE block in an MAF).') 252 | parser.add_argument('--n_hidden', type=int, default=1, help='Number of hidden layers in each MADE.') 253 | parser.add_argument('--activation_fn', type=str, default='relu',\ 254 | help='What activation function to use in the MADEs.') 255 | parser.add_argument('--input_order', type=str, default='sequential',\ 256 | help='What input order to use (sequential | random).') 257 | parser.add_argument('--conditional', default=False, action='store_true',\ 258 | help='Whether to use a conditional model.') 259 | parser.add_argument('--no_batch_norm', action='store_true') 260 | parser.add_argument('--flow_model', default='maf', help='Which model to use: made, maf.') 261 | 262 | # flags for using reparameterization trick or not 263 | parser.add_argument('--reparam',dest='reparam',action='store_true') 264 | parser.add_argument('--no-reparam',dest='reparam',action='store_false') 265 | # flags for using a tanh activation or not 266 | parser.add_argument('--tanh', dest='tanh', action='store_true') 267 | parser.add_argument('--no-tanh', dest='tanh', action='store_false') 268 | # For different gridworld environments 269 | parser.add_argument('--make_cont_grid', default=False, action='store_true', help='Make GridWorld') 270 | parser.add_argument('--dense_goals', default=False, action='store_true', help='Create sub-goals') 271 | parser.add_argument("--smol", action="store_true", default=False, help='Change to a smaller sized gridworld') 272 | parser.add_argument("--tiny", action="store_true", default=False, help='Change to the smallest sized gridworld') 273 | parser.add_argument("--twotiny", action="store_true", default=False, 274 | help='Change to 2x the smallest sized gridworld') 275 | parser.add_argument("--threetiny", action="store_true", default=False, 276 | help='Change to 3x the smallest sized gridworld') 277 | parser.add_argument("--silent", action="store_true", default=False, 278 | help='Display graphical output. Set to true when running on a server.') 279 | parser.add_argument('--heatmap_title', default='Continuous GridWorld Trajectories') 280 | parser.add_argument('--heatmap_normalize', default=False, action='store_true') 281 | parser.add_argument('--heatmap_vertical_clip_value', type=int, default=2500) 282 | 283 | parser.set_defaults(reparam=True, tanh=True) 284 | 285 | args = parser.parse_args() 286 | args.cond_label_size = None 287 | main(args) 288 | 289 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/model.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import torch 4 | import ipdb 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | from torch.distributions import Normal, Exponential, LogNormal, Laplace 8 | 9 | LOG_SIG_MAX = 2 10 | LOG_SIG_MIN = -20 11 | epsilon = 1e-6 12 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 13 | 14 | # Initialize Policy weights 15 | def weights_init_(m): 16 | classname = m.__class__.__name__ 17 | if classname.find('Linear') != -1: 18 | torch.nn.init.xavier_uniform_(m.weight, gain=1) 19 | torch.nn.init.constant_(m.bias, 0) 20 | 21 | 22 | class ValueNetwork(nn.Module): 23 | def __init__(self, num_inputs, hidden_dim): 24 | super(ValueNetwork, self).__init__() 25 | 26 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 27 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 28 | self.linear3 = nn.Linear(hidden_dim, 1) 29 | 30 | self.apply(weights_init_) 31 | 32 | def forward(self, state): 33 | x = F.relu(self.linear1(state)) 34 | x = F.relu(self.linear2(x)) 35 | x = self.linear3(x) 36 | return x 37 | 38 | 39 | class QNetwork(nn.Module): 40 | def __init__(self, num_inputs, num_actions, hidden_dim): 41 | super(QNetwork, self).__init__() 42 | 43 | # Q1 architecture 44 | self.linear1 = nn.Linear(num_inputs + num_actions, hidden_dim) 45 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 46 | self.linear3 = nn.Linear(hidden_dim, 1) 47 | 48 | # Q2 architecture 49 | self.linear4 = nn.Linear(num_inputs + num_actions, hidden_dim) 50 | self.linear5 = nn.Linear(hidden_dim, hidden_dim) 51 | self.linear6 = nn.Linear(hidden_dim, 1) 52 | 53 | self.apply(weights_init_) 54 | 55 | def forward(self, state, action): 56 | x1 = torch.cat([state, action], 1) 57 | x1 = F.relu(self.linear1(x1)) 58 | x1 = F.relu(self.linear2(x1)) 59 | x1 = self.linear3(x1) 60 | 61 | x2 = torch.cat([state, action], 1) 62 | x2 = F.relu(self.linear4(x2)) 63 | x2 = F.relu(self.linear5(x2)) 64 | x2 = self.linear6(x2) 65 | 66 | return x1, x2 67 | 68 | 69 | class GaussianPolicy(nn.Module): 70 | def __init__(self, num_inputs, num_actions, hidden_dim, args): 71 | super(GaussianPolicy, self).__init__() 72 | 73 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 74 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 75 | 76 | self.mean_linear = nn.Linear(hidden_dim, num_actions) 77 | self.log_std_linear = nn.Linear(hidden_dim, num_actions) 78 | 79 | self.apply(weights_init_) 80 | 81 | self.tanh = args.tanh 82 | 83 | def encode(self, state): 84 | x = F.relu(self.linear1(state)) 85 | x = F.relu(self.linear2(x)) 86 | mean = self.mean_linear(x) 87 | log_std = self.log_std_linear(x) 88 | log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX) 89 | return mean, log_std 90 | 91 | def forward(self, state, reparam = False): 92 | mean, log_std = self.encode(state) 93 | std = log_std.exp() 94 | normal = Normal(mean, std) 95 | 96 | if reparam == True: 97 | x_t = normal.rsample() # for reparameterization trick (mean + std * N(0,1)) 98 | else: 99 | x_t = normal.sample() 100 | 101 | if self.tanh: 102 | action = torch.tanh(x_t) 103 | else: 104 | action = x_t 105 | 106 | log_prob = normal.log_prob(x_t) 107 | 108 | if self.tanh: 109 | # Enforcing Action Bound 110 | log_prob -= torch.log(1 - action.pow(2) + epsilon) 111 | log_prob = log_prob.sum(1, keepdim=True) 112 | 113 | return action, log_prob, x_t, mean, log_std 114 | 115 | 116 | class ExponentialPolicy(nn.Module): 117 | def __init__(self, num_inputs, num_actions, hidden_dim, args): 118 | super(ExponentialPolicy, self).__init__() 119 | 120 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 121 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 122 | 123 | self.rate_linear = nn.Linear(hidden_dim, num_actions) 124 | self.apply(weights_init_) 125 | 126 | self.tanh = args.tanh 127 | 128 | def encode(self, state): 129 | x = F.relu(self.linear1(state)) 130 | x = F.relu(self.linear2(x)) 131 | log_rate = self.rate_linear(x) 132 | return log_rate 133 | 134 | def forward(self, state, reparam = False): 135 | log_rate = self.encode(state) 136 | log_rate = torch.clamp(log_rate, min=LOG_SIG_MIN, max=LOG_SIG_MAX) 137 | rate = torch.exp(log_rate) 138 | exponential = Exponential(rate) 139 | 140 | # whether or not to use reparametrization trick 141 | if reparam == True: 142 | x_t = exponential.rsample() 143 | else: 144 | x_t = exponential.sample() 145 | # whether or not to add tanh 146 | if self.tanh: 147 | action = torch.tanh(x_t) 148 | else: 149 | action = x_t 150 | 151 | log_prob = exponential.log_prob(x_t) 152 | mean = exponential.mean 153 | std = torch.sqrt(exponential.variance) 154 | log_std = torch.log(std) 155 | if self.tanh: 156 | # Enforcing Action Bound 157 | log_prob -= torch.log(1 - action.pow(2) + epsilon) 158 | log_prob = log_prob.sum(1, keepdim=True) 159 | 160 | return action, log_prob, x_t, mean, log_std 161 | 162 | 163 | class LogNormalPolicy(nn.Module): 164 | def __init__(self, num_inputs, num_actions, hidden_dim, args): 165 | super(LogNormalPolicy, self).__init__() 166 | 167 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 168 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 169 | 170 | self.mean_linear = nn.Linear(hidden_dim, num_actions) 171 | self.std_linear = nn.Linear(hidden_dim, num_actions) 172 | 173 | self.apply(weights_init_) 174 | 175 | self.tanh = args.tanh 176 | 177 | def encode(self, state): 178 | x = F.relu(self.linear1(state)) 179 | x = F.relu(self.linear2(x)) 180 | mean = self.mean_linear(x) 181 | std = self.std_linear(x) 182 | std = torch.clamp(std, min=0 , max=LOG_SIG_MAX) # standard deviation has to be > 0 183 | return mean, std 184 | 185 | def forward(self, state, reparam = False): 186 | mean, std = self.encode(state) 187 | log_normal = LogNormal(mean, std) 188 | 189 | # whether or not to use reparametrization trick 190 | if reparam == True: 191 | x_t = log_normal.rsample() 192 | else: 193 | x_t = log_normal.sample() 194 | 195 | # whether or not to add tanh 196 | if self.tanh: 197 | action = torch.tanh(x_t) 198 | else: 199 | action = x_t 200 | 201 | log_prob = log_normal.log_prob(x_t) 202 | 203 | if self.tanh: 204 | # Enforcing Action Bound 205 | log_prob -= torch.log(1 - action.pow(2) + epsilon) 206 | log_prob = log_prob.sum(1, keepdim=True) 207 | 208 | # get mean and standard deviation of the distr 209 | mean = log_normal.mean 210 | std = torch.sqrt(log_normal.variance) 211 | log_std = torch.log(std) 212 | return action, log_prob, x_t, mean, log_std 213 | 214 | 215 | class LaplacePolicy(nn.Module): 216 | def __init__(self, num_inputs, num_actions, hidden_dim, args): 217 | super(LaplacePolicy, self).__init__() 218 | 219 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 220 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 221 | 222 | self.mean_linear = nn.Linear(hidden_dim, num_actions) 223 | self.log_scale_linear = nn.Linear(hidden_dim, num_actions) 224 | 225 | self.apply(weights_init_) 226 | 227 | self.tanh = args.tanh 228 | 229 | def encode(self, state): 230 | x = F.relu(self.linear1(state)) 231 | x = F.relu(self.linear2(x)) 232 | mean = self.mean_linear(x) 233 | log_scale = self.log_scale_linear(x) 234 | log_scale = torch.clamp(log_scale, min=LOG_SIG_MIN, max=LOG_SIG_MAX) 235 | return mean, log_scale 236 | 237 | def forward(self, state, reparam = False): 238 | mean, log_scale = self.encode(state) 239 | scale = torch.exp(log_scale) 240 | laplace = Laplace(mean, scale) 241 | 242 | if reparam == True: 243 | x_t = laplace.rsample() 244 | else: 245 | x_t = laplace.sample() 246 | 247 | if self.tanh: 248 | action = torch.tanh(x_t) 249 | else: 250 | action = x_t 251 | 252 | log_prob = laplace.log_prob(x_t) 253 | std = torch.sqrt(laplace.variance) 254 | log_std = torch.log(std) 255 | 256 | if self.tanh: 257 | # Enforcing Action Bound 258 | log_prob -= torch.log(1 - action.pow(2) + epsilon) 259 | log_prob = log_prob.sum(1, keepdim=True) 260 | 261 | mean = laplace.mean 262 | std = torch.sqrt(laplace.variance) 263 | log_std = torch.log(std) 264 | return action, log_prob, x_t, mean, log_std 265 | 266 | 267 | class DeterministicPolicy(nn.Module): 268 | def __init__(self, num_inputs, num_actions, hidden_dim): 269 | super(DeterministicPolicy, self).__init__() 270 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 271 | self.linear2 = nn.Linear(hidden_dim, hidden_dim) 272 | 273 | self.mean = nn.Linear(hidden_dim, num_actions) 274 | self.noise = torch.Tensor(num_actions) 275 | 276 | self.apply(weights_init_) 277 | 278 | def encode(self, state): 279 | x = F.relu(self.linear1(state)) 280 | x = F.relu(self.linear2(x)) 281 | mean = torch.tanh(self.mean(x)) 282 | return mean 283 | 284 | def forward(self, state): 285 | mean = self.forward(state) 286 | noise = self.noise.normal_(0., std=0.1) 287 | noise = noise.clamp(-0.25, 0.25) 288 | action = mean + noise 289 | return action, torch.tensor(0.), torch.tensor(0.), mean, torch.tensor(0.) 290 | 291 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/normalized_actions.py: -------------------------------------------------------------------------------- 1 | import gym 2 | 3 | 4 | class NormalizedActions(gym.ActionWrapper): 5 | 6 | def action(self, action): 7 | action = (action + 1) / 2 # [-1, 1] => [0, 1] 8 | action *= (self.action_space.high - self.action_space.low) 9 | action += self.action_space.low 10 | return action 11 | 12 | def _reverse_action(self, action): 13 | action -= self.action_space.low 14 | action /= (self.action_space.high - self.action_space.low) 15 | action = action * 2 - 1 16 | return action 17 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/plots/plot_comet.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import csv 3 | import json 4 | import os 5 | from statistics import mean 6 | 7 | from comet_ml import API 8 | import matplotlib 9 | import numpy as np 10 | 11 | matplotlib.use('Agg') 12 | import matplotlib.pyplot as plt 13 | import seaborn as sns 14 | 15 | # Set plotting style 16 | sns.set_context('paper', font_scale=1.3) 17 | sns.set_style('whitegrid') 18 | sns.set_palette('colorblind') 19 | plt.rcParams['text.usetex'] = True 20 | 21 | 22 | def extract_excel_data(source_filename): 23 | with open(source_filename, 'r') as csvfile: 24 | csvreader = csv.reader(csvfile) 25 | rows = {row[0]: row[1:] for row in csvreader} 26 | 27 | labels = {} 28 | labels['title'] = rows.get('filename')[0] 29 | labels['x_label'] = rows.get('xlabel')[0] 30 | labels['y_label'] = rows.get('ylabel')[0] 31 | labels['metric'] = rows.get('metric')[0] 32 | 33 | data = {key: value for key, value in rows.items() if 'experiment' in key.lower()} 34 | labels['experiments'] = [key.split(':')[1] for key, value in data.items()] 35 | 36 | return labels, data 37 | 38 | 39 | def connect_to_comet(): 40 | if os.path.isfile("settings.json"): 41 | with open("settings.json") as f: 42 | keys = json.load(f) 43 | comet_apikey = keys.get("apikey") 44 | comet_username = keys.get("username") 45 | comet_restapikey = keys.get("restapikey") 46 | comet_project = keys.get("project") 47 | 48 | print("COMET_REST_API_KEY=%s" %(comet_restapikey)) 49 | with open('.env', 'w') as writer: 50 | writer.write("COMET_API_KEY=%s\n" %(comet_apikey)) 51 | writer.write("COMET_REST_API_KEY=%s\n" %(comet_restapikey)) 52 | 53 | comet_api = API() 54 | return comet_api, comet_username, comet_project 55 | 56 | 57 | def truncate_exp(data_experiments): 58 | last_data_points = [run[-1] for data_run in data_experiments for run in data_run] 59 | run_end_times = [timestep for timestep, value in last_data_points] 60 | earliest_end_time = min(run_end_times) 61 | 62 | clean_data_experiments = [] 63 | for exp in data_experiments: 64 | clean_data_runs = [] 65 | for run in exp: 66 | clean_data_runs.append({x: y for x, y in run if x <= earliest_end_time}) 67 | clean_data_experiments.append(clean_data_runs) 68 | 69 | return clean_data_experiments 70 | 71 | 72 | def get_data(title, x_label, y_label, metric, data): 73 | if not title or not x_label or not y_label or not metric: 74 | print("Error in reading CSV file. Ensure filename, x and y labels, and metric are present.") 75 | exit(1) 76 | 77 | comet_api, comet_username, comet_project = connect_to_comet() 78 | 79 | # Accumulate data for all experiments. 80 | data_experiments = [] 81 | for exp_name, runs in data.items(): 82 | # Accumulate data for all runs of a given experiment. 83 | data_runs = [] 84 | if len(runs) > 0: 85 | for exp_key in runs: 86 | raw_data = comet_api.get("%s/%s/%s" %(comet_username, comet_project, exp_key)) 87 | data_points = raw_data.metrics_raw[metric] 88 | data_runs.append(data_points) 89 | 90 | data_experiments.append(data_runs) 91 | 92 | clean_data_experiments = truncate_exp(data_experiments) 93 | return clean_data_experiments 94 | 95 | 96 | def plot(**kwargs): 97 | labels = kwargs.get('labels') 98 | data = kwargs.get('data') 99 | 100 | # Setup figure 101 | fig = plt.figure(figsize=(12, 8)) 102 | ax = plt.subplot() 103 | for label in (ax.get_xticklabels()): 104 | label.set_fontname('Arial') 105 | label.set_fontsize(28) 106 | for label in (ax.get_yticklabels()): 107 | label.set_fontname('Arial') 108 | label.set_fontsize(28) 109 | plt.ticklabel_format(style='sci', axis='x', scilimits=(0, 0)) 110 | ax.xaxis.get_offset_text().set_fontsize(20) 111 | axis_font = {'fontname': 'Arial', 'size': '32'} 112 | colors = sns.color_palette('colorblind', n_colors=len(data)) 113 | 114 | # Plot data 115 | for runs, label, color in zip(data, labels.get('experiments'), colors): 116 | unique_x_values = set() 117 | for run in runs: 118 | for key in run.keys(): 119 | unique_x_values.add(key) 120 | x_values = sorted(unique_x_values) 121 | 122 | # Plot mean and standard deviation of all runs 123 | y_values_mean = [] 124 | y_values_std = [] 125 | 126 | for x in x_values: 127 | y_values_mean.append(mean([run.get(x) for run in runs if run.get(x)])) 128 | y_values_std.append(np.std([run.get(x) for run in runs if run.get(x)])) 129 | 130 | # Plot std 131 | ax.fill_between(x_values, np.add(np.array(y_values_mean), np.array(y_values_std)), 132 | np.subtract(np.array(y_values_mean), np.array(y_values_std)), 133 | alpha=0.3, 134 | edgecolor=color, facecolor=color) 135 | # Plot mean 136 | plt.plot(x_values, y_values_mean, color=color, linewidth=1.5, label=label) 137 | 138 | # Label figure 139 | ax.legend(loc='lower right', prop={'size': 26}) 140 | ax.set_xlabel(labels.get('x_label'), **axis_font) 141 | ax.set_ylabel(labels.get('y_label'), **axis_font) 142 | fig.subplots_adjust(bottom=0.2) 143 | fig.subplots_adjust(left=0.2) 144 | ax.set_title(labels.get('title'), **axis_font) 145 | 146 | fig.savefig('../install/{}.pdf'.format(labels.get('title'))) 147 | 148 | return 149 | 150 | 151 | def main(args): 152 | source_filename = args.source_filename 153 | 154 | labels, data = extract_excel_data(source_filename) 155 | data_experiments = get_data(labels.get('title'), labels.get('x_label'), labels.get('y_label'), labels.get('metric') 156 | , data) 157 | plot(labels=labels, data=data_experiments) 158 | 159 | 160 | if __name__ == '__main__': 161 | parser = argparse.ArgumentParser() 162 | parser.add_argument('--source_filename', default='plot_source.csv') 163 | args = parser.parse_args() 164 | 165 | main(args) 166 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/replay_memory.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | from collections import namedtuple 4 | import torch 5 | 6 | class ReplayMemory: 7 | def __init__(self, capacity): 8 | self.capacity = capacity 9 | self.buffer = [] 10 | self.position = 0 11 | 12 | def push(self, state, action, reward, next_state, done): 13 | if len(self.buffer) < self.capacity: 14 | self.buffer.append(None) 15 | self.buffer[self.position] = (state, action, reward, next_state, done) 16 | self.position = (self.position + 1) % self.capacity 17 | 18 | def sample(self, batch_size): 19 | batch = random.sample(self.buffer, batch_size) 20 | state, action, reward, next_state, done = map(np.stack, zip(*batch)) 21 | return state, action, reward, next_state, done 22 | 23 | def __len__(self): 24 | return len(self.buffer) 25 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/sac.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import numpy as np 4 | import torch 5 | import torch.nn.functional as F 6 | import torch.nn as nn 7 | from torch.optim import Adam 8 | from utils import soft_update, hard_update 9 | from model import GaussianPolicy, ExponentialPolicy, LogNormalPolicy, LaplacePolicy, QNetwork, ValueNetwork, DeterministicPolicy 10 | from flows import * 11 | import ipdb 12 | 13 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 14 | 15 | class SAC(object): 16 | def __init__(self, num_inputs, action_space, args): 17 | 18 | self.num_inputs = num_inputs 19 | self.action_space = action_space.shape[0] 20 | self.gamma = args.gamma 21 | self.tau = args.tau 22 | self.clip = args.clip 23 | 24 | self.policy_type = args.policy 25 | self.target_update_interval = args.target_update_interval 26 | self.automatic_entropy_tuning = args.automatic_entropy_tuning 27 | 28 | self.critic = QNetwork(self.num_inputs, self.action_space,\ 29 | args.hidden_size).to(device) 30 | self.critic_optim = Adam(self.critic.parameters(), lr=args.lr) 31 | self.alpha = args.alpha 32 | self.tanh = args.tanh 33 | self.reparam = args.reparam 34 | 35 | if self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal" or self.policy_type == "Laplace": 36 | # Target Entropy = −dim(A) (e.g. , -6 for HalfCheetah-v2) as given in the paper 37 | if self.automatic_entropy_tuning == True: 38 | self.target_entropy = -torch.prod(torch.Tensor(action_space.shape)).item() 39 | self.log_alpha = torch.zeros(1, requires_grad=True).to(device) 40 | self.alpha_optim = Adam([self.log_alpha], lr=args.lr) 41 | else: 42 | pass 43 | 44 | if self.policy_type == "Gaussian": 45 | self.policy = GaussianPolicy(self.num_inputs, self.action_space,\ 46 | args.hidden_size,args).to(device) 47 | elif self.policy_type == "Exponential": 48 | self.policy = ExponentialPolicy(self.num_inputs, self.action_space,\ 49 | args.hidden_size,args).to(device) 50 | elif self.policy_type == "LogNormal": 51 | self.policy = LogNormalPolicy(self.num_inputs, self.action_space,\ 52 | args.hidden_size,args).to(device) 53 | elif self.policy_type == "Laplace": 54 | self.policy = LaplacePolicy(self.num_inputs, self.action_space,\ 55 | args.hidden_size,args).to(device) 56 | 57 | self.policy_optim = Adam(self.policy.parameters(), lr=args.lr,weight_decay=1e-6) 58 | 59 | self.value = ValueNetwork(self.num_inputs,\ 60 | args.hidden_size).to(device) 61 | self.value_target = ValueNetwork(self.num_inputs,\ 62 | args.hidden_size).to(device) 63 | self.value_optim = Adam(self.value.parameters(), lr=args.lr) 64 | hard_update(self.value_target, self.value) 65 | elif self.policy_type == "Flow": 66 | if args.flow_model == 'made': 67 | self.policy = MADE(self.action_space,self.num_inputs,args.hidden_size, 68 | args.n_hidden, args.cond_label_size, 69 | args.activation_fn, 70 | args.input_order).to(device) 71 | elif args.flow_model == 'mademog': 72 | assert args.n_components > 1, 'Specify more than 1 component for mixture of gaussians models.' 73 | self.policy = MADEMOG(args.n_components, self.num_inputs, 74 | self.action_space, args.flow_hidden_size, 75 | args.n_hidden, args.cond_label_size, 76 | args.activation_fn, 77 | args.input_order).to(device) 78 | elif args.flow_model == 'maf': 79 | self.policy = MAF(args.n_blocks,self.num_inputs,self.action_space, 80 | args.flow_hidden_size, args.n_hidden, 81 | args.cond_label_size, args.activation_fn, 82 | args.input_order, batch_norm=not 83 | args.no_batch_norm).to(device) 84 | elif args.flow_model == 'mafmog': 85 | assert args.n_components > 1, 'Specify more than 1 component for mixture of gaussians models.' 86 | self.policy = MAFMOG(args.n_blocks,self.num_inputs,args.n_components, 87 | self.action_space, args.flow_hidden_size, 88 | args.n_hidden, args.cond_label_size, 89 | args.activation_fn,args.input_order, 90 | batch_norm=not 91 | args.no_batch_norm).to(device) 92 | elif args.flow_model =='realnvp': 93 | self.policy = RealNVP(args.n_blocks,self.num_inputs,self.action_space, 94 | args.flow_hidden_size,args.n_hidden, 95 | args.cond_label_size,batch_norm=not 96 | args.no_batch_norm).to(device) 97 | elif args.flow_model =='planar': 98 | self.policy = PlanarBase(args.n_blocks,self.num_inputs,self.action_space, 99 | args.flow_hidden_size,args.n_hidden,device).to(device) 100 | else: 101 | raise ValueError('Unrecognized model.') 102 | self.policy_optim = Adam(self.policy.parameters(), lr=args.lr, weight_decay=1e-6) 103 | self.value = ValueNetwork(self.num_inputs,\ 104 | args.hidden_size).to(device) 105 | self.value_target = ValueNetwork(self.num_inputs,\ 106 | args.hidden_size).to(device) 107 | self.value_optim = Adam(self.value.parameters(), lr=args.lr) 108 | hard_update(self.value_target, self.value) 109 | else: 110 | self.policy = DeterministicPolicy(self.num_inputs, self.action_space, args.hidden_size) 111 | self.policy_optim = Adam(self.policy.parameters(), lr=args.lr) 112 | 113 | self.critic_target = QNetwork(self.num_inputs, self.action_space,\ 114 | args.hidden_size).to(device) 115 | hard_update(self.critic_target, self.critic) 116 | 117 | def select_action(self, state, eval=False): 118 | state = torch.FloatTensor(state).to(device).unsqueeze(0) 119 | if eval == False: 120 | self.policy.train() 121 | if len(state.size()) > 2: 122 | state = state.view(-1,self.num_inputs) 123 | action, _, _, _, _ = self.policy(state, reparam = self.reparam) 124 | else: 125 | self.policy.eval() 126 | if len(state.size()) > 2: 127 | state = state.view(-1,self.num_inputs) 128 | if self.policy_type != 'Flow': 129 | _, _, _, action, _ = self.policy(state, reparam=self.reparam) 130 | else: 131 | _, _, _, action, _ = self.policy.inverse(state) 132 | if self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal" or self.policy_type == "Laplace": 133 | if self.tanh: 134 | action = torch.tanh(action) 135 | elif self.policy_type == "Flow": 136 | if self.tanh: 137 | action = torch.tanh(action) 138 | else: 139 | pass 140 | action = action.detach().cpu().numpy() 141 | return action[0] 142 | 143 | def update_parameters(self, state_batch, action_batch, reward_batch, next_state_batch, mask_batch, updates): 144 | state_batch = torch.FloatTensor(state_batch).to(device) 145 | next_state_batch = torch.FloatTensor(next_state_batch).to(device) 146 | action_batch = torch.FloatTensor(action_batch).to(device) 147 | reward_batch = torch.FloatTensor(reward_batch).to(device).unsqueeze(1) 148 | mask_batch = torch.FloatTensor(np.float32(mask_batch)).to(device).unsqueeze(1) 149 | 150 | """ 151 | Use two Q-functions to mitigate positive bias in the policy improvement step that is known 152 | to degrade performance of value based methods. Two Q-functions also significantly speed 153 | up training, especially on harder task. 154 | """ 155 | expected_q1_value, expected_q2_value = self.critic(state_batch, action_batch) 156 | if self.policy_type == 'Flow': 157 | new_action, log_prob, _, mean, log_std = self.policy.inverse(state_batch) 158 | else: 159 | new_action, log_prob, _, mean, log_std = self.policy(state_batch, reparam=self.reparam) 160 | 161 | if self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal" or self.policy_type == "Laplace" or self.policy_type == 'Flow': 162 | if self.automatic_entropy_tuning: 163 | """ 164 | Alpha Loss 165 | """ 166 | alpha_loss = -(self.log_alpha * (log_prob + self.target_entropy).detach()).mean() 167 | self.alpha_optim.zero_grad() 168 | alpha_loss.backward() 169 | self.alpha_optim.step() 170 | self.alpha = self.log_alpha.exp() 171 | alpha_logs = self.alpha.clone() # For TensorboardX logs 172 | else: 173 | alpha_loss = torch.tensor(0.) 174 | alpha_logs = self.alpha # For TensorboardX logs 175 | 176 | 177 | """ 178 | Including a separate function approximator for the soft value can stabilize training. 179 | """ 180 | expected_value = self.value(state_batch) 181 | target_value = self.value_target(next_state_batch) 182 | next_q_value = reward_batch + mask_batch * self.gamma * (target_value).detach() 183 | else: 184 | """ 185 | There is no need in principle to include a separate function approximator for the state value. 186 | We use a target critic network for deterministic policy and eradicate the value value network completely. 187 | """ 188 | alpha_loss = torch.tensor(0.) 189 | alpha_logs = self.alpha # For TensorboardX logs 190 | next_state_action, _, _, _, _, = self.policy(next_state_batch, reparam =self.reparam) 191 | target_critic_1, target_critic_2 = self.critic_target(next_state_batch, next_state_action) 192 | target_critic = torch.min(target_critic_1, target_critic_2) 193 | next_q_value = reward_batch + mask_batch * self.gamma * (target_critic).detach() 194 | 195 | """ 196 | Soft Q-function parameters can be trained to minimize the soft Bellman residual 197 | JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] 198 | ∇JQ = ∇Q(st,at)(Q(st,at) - r(st,at) - γV(target)(st+1)) 199 | """ 200 | q1_value_loss = F.mse_loss(expected_q1_value, next_q_value) 201 | q2_value_loss = F.mse_loss(expected_q2_value, next_q_value) 202 | q1_new, q2_new = self.critic(state_batch, new_action) 203 | expected_new_q_value = torch.min(q1_new, q2_new) 204 | 205 | if self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal" or self.policy_type == "Laplace" or self.policy_type == 'Flow': 206 | """ 207 | Including a separate function approximator for the soft value can stabilize training and is convenient to 208 | train simultaneously with the other networks 209 | Update the V towards the min of two Q-functions in order to reduce overestimation bias from function approximation error. 210 | JV = 𝔼st~D[0.5(V(st) - (𝔼at~π[Qmin(st,at) - α * log π(at|st)]))^2] 211 | ∇JV = ∇V(st)(V(st) - Q(st,at) + (α * logπ(at|st))) 212 | """ 213 | next_value = expected_new_q_value - (self.alpha * log_prob) 214 | value_loss = F.mse_loss(expected_value, next_value.detach()) 215 | else: 216 | pass 217 | # whether to use reparameterization trick or not 218 | if self.reparam == True: 219 | """ 220 | Reparameterization trick is used to get a low variance estimator 221 | f(εt;st) = action sampled from the policy 222 | εt is an input noise vector, sampled from some fixed distribution 223 | Jπ = 𝔼st∼D,εt∼N[α * logπ(f(εt;st)|st) − Q(st,f(εt;st))] 224 | ∇Jπ = ∇log π + ([∇at (α * logπ(at|st)) − ∇at Q(st,at)])∇f(εt;st) 225 | """ 226 | policy_loss = ((self.alpha * log_prob) - expected_new_q_value).mean() 227 | else: 228 | log_prob_target = expected_new_q_value - expected_value 229 | policy_loss = (log_prob * ((self.alpha * log_prob) - log_prob_target).detach() ).mean() 230 | 231 | # Regularization Loss 232 | if self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal" or self.policy_type == "Laplace": 233 | mean_loss = 0.001 * mean.pow(2).mean() 234 | std_loss = 0.001 * log_std.pow(2).mean() 235 | policy_loss += mean_loss + std_loss 236 | 237 | self.critic_optim.zero_grad() 238 | q1_value_loss.backward() 239 | self.critic_optim.step() 240 | 241 | self.critic_optim.zero_grad() 242 | q2_value_loss.backward() 243 | self.critic_optim.step() 244 | 245 | if self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal" or self.policy_type == "Laplace": 246 | self.value_optim.zero_grad() 247 | value_loss.backward() 248 | self.value_optim.step() 249 | else: 250 | value_loss = torch.tensor(0.) 251 | 252 | self.policy_optim.zero_grad() 253 | policy_loss.backward() 254 | if self.policy_type == 'Exponential' or self.policy_type == "LogNormal" or self.policy_type == "Laplace" or self.policy_type == 'Flow': 255 | torch.nn.utils.clip_grad_norm_(self.policy.parameters(),self.clip) 256 | self.policy_optim.step() 257 | 258 | # clip weights of policy network to insure the values don't blow up 259 | for p in self.policy.parameters(): 260 | p.data.clamp_(-10*self.clip, 10*self.clip) 261 | 262 | """ 263 | We update the target weights to match the current value function weights periodically 264 | Update target parameter after every n(args.target_update_interval) updates 265 | """ 266 | if updates % self.target_update_interval == 0 and self.policy_type == "Deterministic": 267 | soft_update(self.critic_target, self.critic, self.tau) 268 | elif updates % self.target_update_interval == 0 and (self.policy_type == "Gaussian" or self.policy_type == "Exponential" or self.policy_type == "LogNormal"): 269 | soft_update(self.value_target, self.value, self.tau) 270 | 271 | # calculate the entropy 272 | with torch.no_grad(): 273 | entropy = -(log_prob.mean()) 274 | 275 | # ipdb.set_trace() #alpha_loss.item() 276 | return value_loss.item(), q1_value_loss.item(), q2_value_loss.item(), policy_loss.item(),entropy, alpha_logs 277 | 278 | # Save model parameters 279 | def save_model(self, env_name, suffix="", actor_path=None, critic_path=None, value_path=None): 280 | if not os.path.exists('models/'): 281 | os.makedirs('models/') 282 | 283 | if actor_path is None: 284 | actor_path = "models/sac_actor_{}_{}".format(env_name, suffix) 285 | if critic_path is None: 286 | critic_path = "models/sac_critic_{}_{}".format(env_name, suffix) 287 | if value_path is None: 288 | value_path = "models/sac_value_{}_{}".format(env_name, suffix) 289 | print('Saving models to {}, {} and {}'.format(actor_path, critic_path, value_path)) 290 | torch.save(self.value.state_dict(), value_path) 291 | torch.save(self.policy.state_dict(), actor_path) 292 | torch.save(self.critic.state_dict(), critic_path) 293 | 294 | # Load model parameters 295 | def load_model(self, actor_path, critic_path, value_path): 296 | print('Loading models from {}, {} and {}'.format(actor_path, critic_path, value_path)) 297 | if actor_path is not None: 298 | self.policy.load_state_dict(torch.load(actor_path)) 299 | if critic_path is not None: 300 | self.critic.load_state_dict(torch.load(critic_path)) 301 | if value_path is not None: 302 | self.value.load_state_dict(torch.load(value_path)) 303 | 304 | -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/scripts/run_contgridworld_exp.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | set -x 4 | 5 | #echo "Getting into the script" 6 | 7 | # Script to run multiple batches of experiments together. 8 | # Can also be used for different hyperparameter settings. 9 | 10 | # Run the following scripts in parallel: 11 | 12 | COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-DG-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Exponential --smol --comet --dense_goals --silent --seed=0 & 13 | COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-DG-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Exponential --smol --comet --dense_goals --silent --seed=2 & 14 | COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-DG-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Exponential --smol --comet --dense_goals --silent --seed=4 15 | 16 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=150000 --policy=Exponential --smol --comet --silent --alpha=0.1 & 17 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=150000 --policy=Exponential --smol --comet --silent --alpha=0.2 & 18 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=150000 --policy=Exponential --smol --comet --silent --alpha=0.3 & 19 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=150000 --policy=Exponential --smol --comet --silent --alpha=0.4 & 20 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=E-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=150000 --policy=Exponential --smol --comet --silent --alpha=0.5 & -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/scripts/run_contgridworld_gauss.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | set -x 4 | 5 | #echo "Getting into the script" 6 | 7 | # Script to run multiple batches of experiments together. 8 | # Can also be used for different hyperparameter settings. 9 | 10 | # Run the following scripts in parallel: 11 | 12 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-DG-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=100000 --policy=Gaussian --smol --comet --dense_goals --silent & 13 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-DG-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=100000 --policy=Gaussian --smol --comet --dense_goals --silent 14 | 15 | # Baby parameter sweep 16 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Gaussian --smol --comet --alpha=0.1 --silent & 17 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Gaussian --smol --comet --alpha=0.2 --silent & 18 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Gaussian --smol --comet --alpha=0.3 --silent & 19 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Gaussian --smol --comet --alpha=0.4 --silent & 20 | #COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=200000 --policy=Gaussian --smol --comet --alpha=0.5 --silent 21 | 22 | COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=250000 --policy=Gaussian --smol --comet --alpha=0.2 --silent --seed=0 & 23 | COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=250000 --policy=Gaussian --smol --comet --alpha=0.2 --silent --seed=3 & 24 | COMET_DISABLE_AUTO_LOGGING=1 python main.py --namestr=G-S-CG --make_cont_grid --batch_size=128 --replay_size=100000 --hidden_size=64 --num_steps=250000 --policy=Gaussian --smol --comet --alpha=0.2 --silent --seed=4 & -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/settings.json: -------------------------------------------------------------------------------- 1 | {"username": "florl", "apikey": "q1ucYPwQZ5VndrGsUdtcHAv8y", "restapikey":"6KiaAJZv83a9meof3qkrFx15F", "project":"florl"} -------------------------------------------------------------------------------- /pytorch-soft-actor-critic/utils.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | 4 | def create_log_gaussian(mean, log_std, t): 5 | quadratic = -((0.5 * (t - mean) / (log_std.exp())).pow(2)) 6 | l = mean.shape 7 | log_z = log_std 8 | z = l[-1] * math.log(2 * math.pi) 9 | log_p = quadratic.sum(dim=-1) - log_z.sum(dim=-1) - 0.5 * z 10 | return log_p 11 | 12 | def logsumexp(inputs, dim=None, keepdim=False): 13 | if dim is None: 14 | inputs = inputs.view(-1) 15 | dim = 0 16 | s, _ = torch.max(inputs, dim=dim, keepdim=True) 17 | outputs = s + (inputs - s).exp().sum(dim=dim, keepdim=True).log() 18 | if not keepdim: 19 | outputs = outputs.squeeze(dim) 20 | return outputs 21 | 22 | def soft_update(target, source, tau): 23 | for target_param, param in zip(target.parameters(), source.parameters()): 24 | target_param.data.copy_(target_param.data * (1.0 - tau) + param.data * tau) 25 | 26 | def hard_update(target, source): 27 | for target_param, param in zip(target.parameters(), source.parameters()): 28 | target_param.data.copy_(param.data) 29 | -------------------------------------------------------------------------------- /pytorch-vanilla-reinforce/README.md: -------------------------------------------------------------------------------- 1 | This implements basic reinforce with and without a baseline value network for continuous control using Gaussian policies. 2 | 3 | An example of how to run reinforce: 4 | 5 | ```bash 6 | > python main_reinforce.py --namestr="name of experiment" --env-name --baseline {True/False} --num-episodes 4000 7 | ``` 8 | -------------------------------------------------------------------------------- /pytorch-vanilla-reinforce/main_reinforce.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from comet_ml import Experiment 3 | import torch 4 | from torch.autograd import Variable 5 | import torch.autograd as autograd 6 | import numpy as np 7 | import torch.nn as nn 8 | import json 9 | from reinforce_simple import REINFORCE 10 | import torch.nn.functional as F 11 | import torch.optim as optim 12 | from torch.distributions import Normal 13 | import mujoco_py 14 | import os 15 | import gym 16 | import ipdb 17 | 18 | def evaluate_policy(policy, eval_episodes = 10): 19 | ''' 20 | function to return the average reward of the policy over 10 runs 21 | ''' 22 | avg_reward = 0.0 23 | for _ in range(eval_episodes): 24 | obs = env.reset() 25 | done = False 26 | while not done: 27 | action, log_prob, mean, std = policy.select_action(np.array(obs) ) 28 | obs, reward, done, _ = env.step(action) 29 | avg_reward += reward 30 | 31 | avg_reward /= eval_episodes 32 | print("the average reward is: {0}".format(avg_reward)) 33 | #return avg_reward 34 | 35 | def render_policy(policy): 36 | ''' 37 | Function to see the policy in action 38 | ''' 39 | obs = env.reset() 40 | done = False 41 | while not done: 42 | env.render() 43 | action,_,_,_ = policy.select_action(np.array(obs)) 44 | obs, reward, done, _ = env.step(action) 45 | 46 | env.close() 47 | 48 | def main(args): 49 | 50 | # create env 51 | env = gym.make(args.env_name) 52 | env.seed(args.seed) 53 | torch.manual_seed(args.seed) 54 | np.random.seed(args.seed) 55 | 56 | # get env info 57 | state_dim = env.observation_space.shape[0] 58 | action_dim = env.action_space 59 | max_action = (env.action_space.high) 60 | min_action = (env.action_space.low) 61 | 62 | print("number of actions:{0}, dim of states: {1},\ 63 | max_action:{2}, min_action: {3}".format(action_dim,state_dim,max_action,min_action)) 64 | 65 | # setup comet_ml to track experiments 66 | if os.path.isfile("settings.json"): 67 | with open('settings.json') as f: 68 | data = json.load(f) 69 | args.comet_apikey = data["apikey"] 70 | args.comet_username = data["username"] 71 | else: 72 | raise NotImplementedError 73 | 74 | experiment = Experiment(api_key=args.comet_apikey,\ 75 | project_name="florl",auto_output_logging="None",\ 76 | workspace=args.comet_username,auto_metric_logging=False,\ 77 | auto_param_logging=False) 78 | experiment.set_name(args.namestr) 79 | args.experiment = experiment 80 | 81 | # construct model 82 | hidden_size = args.hidden_size 83 | policy = REINFORCE(state_dim, hidden_size, action_dim, baseline = args.baseline) 84 | 85 | # start of experiment: Keep looping until desired amount of episodes reached 86 | max_episodes = args.num_episodes 87 | total_episodes = 0 # keep track of amount of episodes that we have done 88 | 89 | while total_episodes < max_episodes: 90 | 91 | obs = env.reset() 92 | done = False 93 | trajectory = [] # trajectory info for reinforce update 94 | episode_reward = 0 # keep track of rewards per episode 95 | 96 | while not done: 97 | action, ln_prob, mean, std = policy.select_action(np.array(obs)) 98 | next_state, reward, done, _ = env.step(action) 99 | trajectory.append([np.array(obs), action, ln_prob, reward, next_state, done]) 100 | 101 | obs = next_state 102 | episode_reward += reward 103 | 104 | total_episodes += 1 105 | 106 | if args.baseline: 107 | policy_loss, value_loss = policy.train(trajectory) 108 | experiment.log_metric("value function loss", value_loss, step = total_episodes) 109 | else: 110 | policy_loss = policy.train(trajectory) 111 | 112 | experiment.log_metric("policy loss",policy_loss, step = total_episodes) 113 | experiment.log_metric("episode reward", episode_reward, step =total_episodes) 114 | 115 | 116 | env.close() 117 | 118 | 119 | if __name__ == '__main__': 120 | 121 | """ 122 | Process command-line arguments, then call main() 123 | """ 124 | parser = argparse.ArgumentParser(description='PyTorch REINFORCE example') 125 | parser.add_argument('--env-name', default="HalfCheetah-v1", 126 | help='name of the environment to run') 127 | parser.add_argument('--seed', type=int, default=456, metavar='N', 128 | help='random seed (default: 456)') 129 | parser.add_argument('--baseline', type=bool, default = False, help = 'Whether you want to add a baseline to Reinforce or not') 130 | parser.add_argument('--namestr', type=str, default='FloRL', \ 131 | help='additional info in output filename to describe experiments') 132 | parser.add_argument('--num-episodes', type=int, default=2000, metavar='N', 133 | help='maximum number of episodes (default:2000)') 134 | parser.add_argument('--hidden-size', type=int, default=256, metavar='N', 135 | help='hidden size (default: 256)') 136 | args = parser.parse_args() 137 | 138 | main(args) 139 | 140 | -------------------------------------------------------------------------------- /pytorch-vanilla-reinforce/reinforce_simple.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.autograd import Variable 3 | import torch.autograd as autograd 4 | import numpy as np 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | import torch.optim as optim 8 | from torch.distributions import Normal 9 | import ipdb 10 | 11 | LOG_SIG_MAX = 2 12 | LOG_SIG_MIN = -20 13 | epsilon = 1e-6 14 | 15 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 16 | 17 | class Policy(nn.Module): 18 | ''' 19 | Gaussian policy that consists of a neural network with 1 hidden layer that 20 | outputs mean and log std dev (the params) of a gaussian policy 21 | ''' 22 | 23 | def __init__(self, num_inputs, hidden_size, action_space): 24 | 25 | super(Policy, self).__init__() 26 | 27 | self.action_space = action_space 28 | num_outputs = action_space.shape[0] # the number of output actions 29 | 30 | self.linear = nn.Linear(num_inputs, hidden_size) 31 | self.mean = nn.Linear(hidden_size, num_outputs) 32 | self.log_std = nn.Linear(hidden_size, num_outputs) 33 | 34 | def forward(self, inputs): 35 | 36 | # forward pass of NN 37 | x = inputs 38 | x = F.relu(self.linear(x)) 39 | 40 | mean = self.mean(x) 41 | log_std = self.log_std(x) # if more than one action this will give you the diagonal elements of a diagonal covariance matrix 42 | log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX) # We limit the variance by forcing within a range of -2,20 43 | std = log_std.exp() 44 | 45 | return mean, std 46 | 47 | class ValueNetwork(nn.Module): 48 | ''' 49 | Value network V(s_t) = E[G_t | s_t] to use as a baseline in the reinforce 50 | update. This a Neural Net with 1 hidden layer 51 | ''' 52 | 53 | def __init__(self, num_inputs, hidden_dim): 54 | super(ValueNetwork, self).__init__() 55 | self.linear1 = nn.Linear(num_inputs, hidden_dim) 56 | self.linear2 = nn.Linear(hidden_dim, 1) 57 | 58 | def forward(self, state): 59 | 60 | x = F.relu(self.linear1(state)) 61 | x = self.linear2(x) 62 | 63 | return x 64 | 65 | class REINFORCE: 66 | ''' 67 | Implementation of the basic online reinforce algorithm for Gaussian policies. 68 | ''' 69 | 70 | def __init__(self, num_inputs, hidden_size, action_space, lr_pi = 3e-4,\ 71 | lr_vf = 1e-3, baseline = False, gamma = 0.99, train_v_iters = 1): 72 | 73 | self.gamma = gamma 74 | self.action_space = action_space 75 | self.policy = Policy(num_inputs, hidden_size, action_space) 76 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr = lr_pi) 77 | self.baseline = baseline 78 | self.train_v_iters = train_v_iters # how many times you want to run update loop. 79 | 80 | # create value network if we want to use baseline 81 | if self.baseline: 82 | self.value_function = ValueNetwork(num_inputs, hidden_size) 83 | self.value_optimizer = optim.Adam(self.value_function.parameters(), lr = lr_vf) 84 | 85 | def select_action(self,state): 86 | 87 | state = torch.from_numpy(state).float().unsqueeze(0) # just to make it a Tensor obj 88 | # get mean and std 89 | mean, std = self.policy(state) 90 | 91 | # create normal distribution 92 | normal = Normal(mean, std) 93 | 94 | # sample action 95 | action = normal.sample() 96 | 97 | # get log prob of that action 98 | ln_prob = normal.log_prob(action) 99 | ln_prob = ln_prob.sum() 100 | # squeeze action into [-1,1] 101 | action = torch.tanh(action) 102 | # turn actions into numpy array 103 | action = action.numpy() 104 | 105 | return action[0], ln_prob, mean, std 106 | 107 | def train(self, trajectory): 108 | 109 | ''' 110 | The training is done using the rewards-to-go formulation of the policy gradient update of Reinforce. 111 | If we are using a baseline, the value network is also trained. 112 | 113 | trajectory: a list of the form [( state , action , lnP(a_t|s_t), reward ), ... ] 114 | 115 | ''' 116 | 117 | log_probs = [item[2] for item in trajectory] 118 | rewards = [item[3] for item in trajectory] 119 | states = [item[0] for item in trajectory] 120 | actions = [item[1] for item in trajectory] 121 | 122 | #calculate rewards to go 123 | R = 0 124 | returns = [] 125 | for r in rewards[::-1]: 126 | R = r + 0.99 * R 127 | returns.insert(0, R) 128 | 129 | returns = torch.tensor(returns) 130 | 131 | # train the Value Network and calcualte Advantage 132 | if self.baseline: 133 | 134 | # loop over this a couple of times 135 | for _ in range(self.train_v_iters): 136 | # calculate loss of value function using mean squared error 137 | value_estimates = [] 138 | for state in states: 139 | state = torch.from_numpy(state).float().unsqueeze(0) # just to make it a Tensor obj 140 | value_estimates.append( self.value_function(state) ) 141 | 142 | value_estimates = torch.stack(value_estimates).squeeze() # rewards to go for each step of env trajectory 143 | 144 | v_loss = F.mse_loss(value_estimates, returns) 145 | # update the weights 146 | self.value_optimizer.zero_grad() 147 | v_loss.backward() 148 | self.value_optimizer.step() 149 | 150 | # calculate advantage 151 | advantage = [] 152 | for value, R in zip(value_estimates, returns): 153 | advantage.append(R - value) 154 | 155 | advantage = torch.Tensor(advantage) 156 | 157 | # caluclate policy loss 158 | policy_loss = [] 159 | for log_prob, adv in zip(log_probs, advantage): 160 | policy_loss.append( - log_prob * adv) 161 | 162 | 163 | else: 164 | policy_loss = [] 165 | for log_prob, R in zip(log_probs, returns): 166 | policy_loss.append( - log_prob * R) 167 | 168 | 169 | policy_loss = torch.stack( policy_loss ).sum() 170 | # update policy weights 171 | self.policy_optimizer.zero_grad() 172 | policy_loss.backward() 173 | self.policy_optimizer.step() 174 | 175 | 176 | if self.baseline: 177 | return policy_loss, v_loss 178 | 179 | else: 180 | return policy_loss 181 | --------------------------------------------------------------------------------