├── .gitignore ├── LICENSE ├── README.md ├── drl_implementation ├── __init__.py ├── agent │ ├── __init__.py │ ├── agent_base.py │ ├── continuous_action │ │ ├── __init__.py │ │ ├── ddpg.py │ │ ├── ddpg_goal_conditioned.py │ │ ├── distributional_ddpg.py │ │ ├── sac.py │ │ ├── sac_goal_conditioned.py │ │ ├── sac_parameterised_action_goal_conditioned.py │ │ ├── sac_pointnet.py │ │ └── td3.py │ ├── distributed_agent_base.py │ └── utils │ │ ├── __init__.py │ │ ├── env_wrapper.py │ │ ├── exploration_strategy.py │ │ ├── networks_conv.py │ │ ├── networks_mlp.py │ │ ├── networks_pointnet.py │ │ ├── normalizer.py │ │ ├── plot.py │ │ ├── pointnet_2_utils.py │ │ ├── pointnet_utils.py │ │ ├── replay_buffer.py │ │ └── segment_tree.py └── examples │ ├── KukaPushPHER.py │ ├── PendulumDDPG.py │ └── __init__.py ├── requirements.txt ├── setup.py ├── src ├── README.md ├── figs.png ├── pendulum.gif └── push.gif └── tests └── test.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | .idea 3 | __pycache__/ 4 | drl_implementation/agent/__pycache__/ 5 | drl_implementation/agent/utils/__pycache__/ 6 | exp_multi_goal/__pycache__/ 7 | build/ 8 | drl_implementation.egg-info/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 XT_Yang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## DRL_Implementation 2 | ##### Current status: minimal updates 3 | 4 | ### Introduction 5 | - This repository is a pytorch-based implementation of modern DRL algorithms, designed to be reusable for as many 6 | Gym-like training environments as possible 7 | - The package is mainly for my personal usage, however feel free to use it as you like. 8 | - It is recommended to use the [released version](https://github.com/IanYangChina/DRL_Implementation/tree/v2.0) 9 | - Understand more with the [Wiki!](https://github.com/IanYangChina/DRL_Implementation/wiki) 10 | - Tested environments: Gym, Pybullet-gym, Pybullet-multigoal-gym 11 | - *My priority is on continuous action algorithms as I'm working on robotics* 12 | 13 | #### Installation 14 | ``` 15 | git clone https://github.com/IanYangChina/DRL_Implementation.git 16 | cd DRL_Implementation 17 | python -m pip install -r requirements.txt 18 | python -m pip install . 19 | ``` 20 | [Click here for example codes](https://github.com/IanYangChina/DRL_Implementation/tree/master/drl_implementation/examples) 21 | , to run the codes you will need to install Gym, Pybullet, or pybullet-multigoal-gym. See env installation links below. 22 | For more use cases, have a look at the [drl_imp_test repo](https://github.com/IanYangChina/drl_imp_test)\ 23 | From the project root, run `python drl_implementation/examples/$SCTIPT_NAME.py` 24 | 25 | ##### State-based 26 | - [X] Distributional DDPG, Continuous 27 | - [X] DDPG - Deterministic, Continuous 28 | - [X] TD3 -Deterministic, Continuous 29 | - [X] SAC (Adaptive Temperature) - Stochastic, Continuous 30 | 31 | ##### Replay buffers 32 | - [X] Hindsight 33 | - [X] Prioritised 34 | 35 | ##### Tested Environments 36 | - [X] [Pybullet Gym (Continuous)](https://github.com/bulletphysics/bullet3) 37 | - [X] [OpenAI Gym Mujoco Robotics Multigoal Environment (Continuous)](https://openai.com/blog/ingredients-for-robotics-research/) 38 | - [X] [Pybullet Multigoal Gym](https://github.com/IanYangChina/pybullet_multigoal_gym) (OpenAI Robotics 39 | Multigoal Pybullet Migration) (Continuous) 40 | 41 | ##### Some result figures 42 | 43 | 44 | 45 | 46 | #### Reference Papers: Algorithm 47 | * [DQN](https://www.nature.com/articles/nature14236?wm=book_wap_0005) 48 | * [DoubleDQN](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389) 49 | * [DDPG](https://arxiv.org/abs/1509.02971) 50 | * [TD3](https://arxiv.org/pdf/1802.09477.pdf) 51 | * [SAC (Adaptive Temperature)](https://arxiv.org/pdf/1812.05905.pdf) 52 | * [PER](https://arxiv.org/abs/1511.05952) 53 | * [HER](http://papers.nips.cc/paper/7090-hindsight-experience-replay) 54 | 55 | #### Reference Papers: Implementation Matters 56 | * [Time limit](https://arxiv.org/abs/1712.00378) 57 | * [SOTA PPO Hyperparameters (many applicable to other algorithms)](https://arxiv.org/abs/2006.05990) 58 | * [SAC Temperature Auto-tuning](https://arxiv.org/abs/1812.05905) 59 | -------------------------------------------------------------------------------- /drl_implementation/__init__.py: -------------------------------------------------------------------------------- 1 | from .agent.continuous_action.ddpg import DDPG 2 | from .agent.continuous_action.ddpg_goal_conditioned import GoalConditionedDDPG 3 | from .agent.continuous_action.sac import SAC 4 | from .agent.continuous_action.sac_goal_conditioned import GoalConditionedSAC 5 | from .agent.continuous_action.td3 import TD3 6 | from .agent.continuous_action.distributional_ddpg import DistributionalDDPG 7 | from .agent.continuous_action.sac_parameterised_action_goal_conditioned import GPASAC 8 | 9 | agents = { 10 | 'ddpg': DDPG, 11 | 'ddpg_her': GoalConditionedDDPG, 12 | 'sac': SAC, 13 | 'sac_her': GoalConditionedSAC, 14 | 'td3': TD3, 15 | 'distri_ddpg': DistributionalDDPG, 16 | } 17 | -------------------------------------------------------------------------------- /drl_implementation/agent/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/__init__.py -------------------------------------------------------------------------------- /drl_implementation/agent/agent_base.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging as info_logging 3 | import torch as T 4 | import numpy as np 5 | import json 6 | import subprocess as sp 7 | from torch.utils.tensorboard import SummaryWriter 8 | from .utils.plot import smoothed_plot 9 | from .utils.replay_buffer import make_buffer 10 | from .utils.normalizer import Normalizer 11 | 12 | 13 | def mkdir(paths): 14 | for path in paths: 15 | os.makedirs(path, exist_ok=True) 16 | 17 | 18 | def get_gpu_memory(): 19 | command = "nvidia-smi --query-gpu=memory.free --format=csv" 20 | memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:] 21 | memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)] 22 | return memory_free_values 23 | 24 | 25 | def reset_logging(logging_to_reset): 26 | loggers = [logging_to_reset.getLogger(name) for name in logging_to_reset.root.manager.loggerDict] 27 | loggers.append(logging_to_reset.getLogger()) 28 | for logger in loggers: 29 | handlers = logger.handlers[:] 30 | for handler in handlers: 31 | logger.removeHandler(handler) 32 | handler.close() 33 | logger.setLevel(logging_to_reset.NOTSET) 34 | logger.propagate = True 35 | 36 | 37 | class Agent(object): 38 | def __init__(self, 39 | algo_params, logging=None, create_logger=False, 40 | transition_tuple=None, 41 | non_flat_obs=False, action_type='continuous', 42 | goal_conditioned=False, store_goal_ind=False, training_mode='episode_based', 43 | path=None, log_dir_suffix=None, seed=-1): 44 | """ 45 | Parameters 46 | ---------- 47 | algo_params : dict 48 | a dictionary of parameters 49 | transition_tuple : collections.namedtuple 50 | a python namedtuple for storing, managing and sampling experiences, see .utils.replay_buffer 51 | non_flat_obs : bool 52 | whether the observations are 1D or nD 53 | action_type : str 54 | either 'discrete' or 'continuous' 55 | goal_conditioned : bool 56 | whether the agent uses a goal-conditioned policy 57 | training_mode : str 58 | either 'episode_based' or 'env_step_based' 59 | path : str 60 | a directory to save files 61 | seed : int 62 | a random seed 63 | """ 64 | # torch device 65 | self.device = T.device("cuda" if T.cuda.is_available() else "cpu") 66 | if 'cuda_device_id' in algo_params.keys(): 67 | self.device = T.device("cuda:%i" % algo_params['cuda_device_id']) 68 | # path & seeding 69 | T.manual_seed(seed) 70 | T.cuda.manual_seed_all(seed) # this has no effect if cuda is not available 71 | 72 | # create a random number generator and seed it 73 | self.rng = np.random.default_rng(seed=seed) 74 | assert path is not None, 'please specify a project path to save files' 75 | self.path = path 76 | # path to save neural network check point 77 | self.ckpt_path = os.path.join(path, 'ckpts') 78 | # path to save statistics 79 | self.data_path = os.path.join(path, 'data') 80 | # create directories if not exist 81 | mkdir([self.path, self.ckpt_path, self.data_path]) 82 | if log_dir_suffix is not None: 83 | comment = '-'+log_dir_suffix 84 | else: 85 | comment = '' 86 | self.create_logger = create_logger 87 | if self.create_logger: 88 | self.logger = SummaryWriter(log_dir=self.data_path, comment=comment) 89 | self.logging = logging 90 | if self.logging is None: 91 | reset_logging(info_logging) 92 | log_file_name = os.path.join(self.data_path, 'optimisation.log') 93 | if os.path.isfile(log_file_name): 94 | filemode = "a" 95 | else: 96 | filemode = "w" 97 | info_logging.basicConfig(level=info_logging.NOTSET, filemode=filemode, 98 | filename=log_file_name, 99 | format="%(asctime)s %(levelname)s %(message)s") 100 | self.logging = info_logging 101 | 102 | # non-goal-conditioned args 103 | self.non_flat_obs = non_flat_obs 104 | self.action_type = action_type 105 | if self.non_flat_obs: 106 | self.state_dim = 0 107 | self.state_shape = algo_params['state_shape'] 108 | else: 109 | self.state_dim = algo_params['state_dim'] 110 | if self.action_type == 'hybrid': 111 | self.discrete_action_dim = algo_params['discrete_action_dim'] 112 | self.continuous_action_dim = algo_params['continuous_action_dim'] 113 | self.continuous_action_max = algo_params['continuous_action_max'] 114 | self.continuous_action_scaling = algo_params['continuous_action_scaling'] 115 | else: 116 | self.action_dim = algo_params['action_dim'] 117 | if self.action_type == 'continuous': 118 | self.action_max = algo_params['action_max'] 119 | self.action_scaling = algo_params['action_scaling'] 120 | 121 | # goal-conditioned args & buffers 122 | self.goal_conditioned = goal_conditioned 123 | # prioritised replay 124 | self.prioritised = algo_params['prioritised'] 125 | 126 | if self.goal_conditioned: 127 | if self.non_flat_obs: 128 | self.goal_dim = 0 129 | self.goal_shape = algo_params['goal_shape'] 130 | else: 131 | self.goal_dim = algo_params['goal_dim'] 132 | self.hindsight = algo_params['hindsight'] 133 | try: 134 | goal_distance_threshold = self.env.env.distance_threshold 135 | except: 136 | goal_distance_threshold = self.env.distance_threshold 137 | 138 | goal_conditioned_reward_func = None 139 | try: 140 | if self.env.env.goal_conditioned_reward_function is not None: 141 | goal_conditioned_reward_func = self.env.env.goal_conditioned_reward_function 142 | except: 143 | if self.env.goal_conditioned_reward_function is not None: 144 | goal_conditioned_reward_func = self.env.goal_conditioned_reward_function 145 | 146 | try: 147 | her_sample_strategy = algo_params['her_sampling_strategy'] 148 | except: 149 | her_sample_strategy = 'future' 150 | 151 | try: 152 | num_sampled_goal = algo_params['num_sampled_goal'] 153 | except: 154 | num_sampled_goal = 4 155 | 156 | self.buffer = make_buffer(mem_capacity=algo_params['memory_capacity'], 157 | transition_tuple=transition_tuple, prioritised=self.prioritised, 158 | seed=seed, rng=self.rng, 159 | goal_conditioned=True, keep_episode=self.hindsight, 160 | store_goal_ind=store_goal_ind, 161 | sampling_strategy=her_sample_strategy, 162 | num_sampled_goal=num_sampled_goal, 163 | terminal_on_achieved=algo_params['terminate_on_achieve'], 164 | goal_distance_threshold=goal_distance_threshold, 165 | goal_conditioned_reward_func=goal_conditioned_reward_func) 166 | else: 167 | self.goal_dim = 0 168 | self.buffer = make_buffer(mem_capacity=algo_params['memory_capacity'], 169 | transition_tuple=transition_tuple, prioritised=self.prioritised, 170 | seed=seed, rng=self.rng, 171 | goal_conditioned=False) 172 | 173 | # common args 174 | if not self.non_flat_obs: 175 | self.observation_normalization = algo_params['observation_normalization'] 176 | # if not using obs normalization, the normalizer is just a scale multiplier, returns inputs*scale 177 | self.normalizer = Normalizer(self.state_dim+self.goal_dim, 178 | algo_params['init_input_means'], algo_params['init_input_vars'], 179 | activated=self.observation_normalization) 180 | try: 181 | self.update_interval = algo_params['update_interval'] 182 | except: 183 | self.update_interval = 1 184 | self.actor_learning_rate = algo_params['actor_learning_rate'] 185 | self.critic_learning_rate = algo_params['critic_learning_rate'] 186 | self.batch_size = algo_params['batch_size'] 187 | self.optimizer_steps = algo_params['optimization_steps'] 188 | self.gamma = algo_params['discount_factor'] 189 | self.discard_time_limit = algo_params['discard_time_limit'] 190 | self.tau = algo_params['tau'] 191 | self.optim_step_count = 0 192 | 193 | assert training_mode in ['episode_based', 'step_based'] 194 | self.training_mode = training_mode 195 | self.total_env_step_count = 0 196 | self.total_env_episode_count = 0 197 | 198 | # network dict is filled in each specific agent 199 | self.network_dict = {} 200 | self.network_keys_to_save = None 201 | 202 | # algorithm-specific statistics are defined in each agent sub-class 203 | self.statistic_dict = { 204 | # use lowercase characters 205 | 'actor_loss': [], 206 | 'critic_loss': [], 207 | } 208 | self.get_gpu_memory = get_gpu_memory 209 | 210 | def run(self, render=False, test=False, load_network_ep=None, sleep=0): 211 | raise NotImplementedError 212 | 213 | def _interact(self, render=False, test=False, sleep=0): 214 | raise NotImplementedError 215 | 216 | def _select_action(self, obs, test=False): 217 | raise NotImplementedError 218 | 219 | def _learn(self, steps=None): 220 | raise NotImplementedError 221 | 222 | def _remember(self, *args, new_episode=False): 223 | if self.goal_conditioned: 224 | self.buffer.new_episode = new_episode 225 | self.buffer.store_experience(*args) 226 | else: 227 | self.buffer.store_experience(*args) 228 | 229 | def _soft_update(self, source, target, tau=None): 230 | if tau is None: 231 | tau = self.tau 232 | 233 | for target_param, param in zip(target.parameters(), source.parameters()): 234 | target_param.data.copy_( 235 | target_param.data * (1.0 - tau) + param.data * tau 236 | ) 237 | 238 | def _save_network(self, keys=None, ep=None, step=None): 239 | if ep is None: 240 | ep = '' 241 | else: 242 | ep = '_ep'+str(ep) 243 | if step is None: 244 | step = '' 245 | else: 246 | step = '_step'+str(step) 247 | if keys is None: 248 | keys = self.network_keys_to_save 249 | assert keys is not None 250 | for key in keys: 251 | T.save(self.network_dict[key].state_dict(), self.ckpt_path+'/ckpt_'+key+ep+step+'.pt') 252 | 253 | def _load_network(self, keys=None, ep=None, step=None): 254 | if (not self.non_flat_obs) and self.observation_normalization: 255 | self.normalizer.history_mean = np.load(os.path.join(self.data_path, 'input_means.npy')) 256 | self.normalizer.history_var = np.load(os.path.join(self.data_path, 'input_vars.npy')) 257 | if ep is None: 258 | ep = '' 259 | else: 260 | ep = '_ep'+str(ep) 261 | if step is None: 262 | step = '' 263 | else: 264 | step = '_step'+str(step) 265 | if keys is None: 266 | keys = self.network_keys_to_save 267 | assert keys is not None 268 | for key in keys: 269 | self.network_dict[key].load_state_dict(T.load(self.ckpt_path+'/ckpt_'+key+ep+step+'.pt', map_location=self.device)) 270 | 271 | def _save_statistics(self, keys=None): 272 | if (not self.non_flat_obs) and self.observation_normalization: 273 | np.save(os.path.join(self.data_path, 'input_means'), self.normalizer.history_mean) 274 | np.save(os.path.join(self.data_path, 'input_vars'), self.normalizer.history_var) 275 | if keys is None: 276 | keys = self.statistic_dict.keys() 277 | for key in keys: 278 | if len(self.statistic_dict[key]) == 0: 279 | continue 280 | # convert everything to a list before save via json 281 | if T.is_tensor(self.statistic_dict[key][0]): 282 | self.statistic_dict[key] = T.as_tensor(self.statistic_dict[key], device=self.device).cpu().numpy().tolist() 283 | else: 284 | self.statistic_dict[key] = np.array(self.statistic_dict[key]).tolist() 285 | json.dump(self.statistic_dict[key], open(os.path.join(self.data_path, key+'.json'), 'w')) 286 | 287 | def _plot_statistics(self, keys=None, x_labels=None, y_labels=None, window=5, save_to_file=True): 288 | if save_to_file: 289 | self._save_statistics(keys=keys) 290 | if y_labels is None: 291 | y_labels = {} 292 | for key in list(self.statistic_dict.keys()): 293 | if key not in y_labels.keys(): 294 | if 'loss' in key: 295 | label = 'Loss' 296 | elif 'return' in key: 297 | label = 'Return' 298 | elif 'success' in key: 299 | label = 'Success' 300 | else: 301 | label = key 302 | y_labels.update({key: label}) 303 | 304 | if x_labels is None: 305 | x_labels = {} 306 | for key in list(self.statistic_dict.keys()): 307 | if key not in x_labels.keys(): 308 | if ('loss' in key) or ('alpha' in key) or ('entropy' in key) or ('step' in key): 309 | label = 'Optimization step' 310 | elif 'cycle' in key: 311 | label = 'Cycle' 312 | elif 'epoch' in key: 313 | label = 'Epoch' 314 | else: 315 | label = 'Episode' 316 | x_labels.update({key: label}) 317 | 318 | if keys is None: 319 | for key in list(self.statistic_dict.keys()): 320 | if len(self.statistic_dict[key]) == 0: 321 | continue 322 | smoothed_plot(os.path.join(self.path, key+'.png'), self.statistic_dict[key], 323 | x_label=x_labels[key], y_label=y_labels[key], window=window) 324 | else: 325 | for key in keys: 326 | smoothed_plot(os.path.join(self.path, key+'.png'), self.statistic_dict[key], 327 | x_label=x_labels[key], y_label=y_labels[key], window=window) 328 | 329 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/continuous_action/__init__.py -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/ddpg.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import Actor, Critic 7 | from ..agent_base import Agent 8 | from ..utils.exploration_strategy import OUNoise, GaussianNoise 9 | 10 | 11 | class DDPG(Agent): 12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 13 | # environment 14 | self.env = env 15 | self.env.seed(seed) 16 | obs = self.env.reset() 17 | algo_params.update({'state_dim': obs.shape[0], 18 | 'action_dim': self.env.action_space.shape[0], 19 | 'action_max': self.env.action_space.high, 20 | 'action_scaling': self.env.action_space.high[0], 21 | 'init_input_means': None, 22 | 'init_input_vars': None 23 | }) 24 | # training args 25 | self.training_episodes = algo_params['training_episodes'] 26 | self.testing_gap = algo_params['testing_gap'] 27 | self.testing_episodes = algo_params['testing_episodes'] 28 | self.saving_gap = algo_params['saving_gap'] 29 | 30 | super(DDPG, self).__init__(algo_params, 31 | transition_tuple=transition_tuple, 32 | goal_conditioned=False, 33 | path=path, 34 | seed=seed) 35 | # torch 36 | self.network_dict.update({ 37 | 'actor': Actor(self.state_dim, self.action_dim).to(self.device), 38 | 'actor_target': Actor(self.state_dim, self.action_dim).to(self.device), 39 | 'critic': Critic(self.state_dim + self.action_dim, 1).to(self.device), 40 | 'critic_target': Critic(self.state_dim + self.action_dim, 1).to(self.device) 41 | }) 42 | self.network_keys_to_save = ['actor_target', 'critic_target'] 43 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 44 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1) 45 | self.critic_optimizer = Adam(self.network_dict['critic'].parameters(), lr=self.critic_learning_rate, weight_decay=algo_params['Q_weight_decay']) 46 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'], tau=1) 47 | # behavioural policy args (exploration) 48 | self.exploration_strategy = GaussianNoise(self.action_dim, self.action_max, sigma=0.1) 49 | # training args 50 | self.warmup_step = algo_params['warmup_step'] 51 | # statistic dict 52 | self.statistic_dict.update({ 53 | 'episode_return': [], 54 | 'episode_test_return': [] 55 | }) 56 | 57 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 58 | if test: 59 | num_episode = self.testing_episodes 60 | if load_network_ep is not None: 61 | print("Loading network parameters...") 62 | self._load_network(ep=load_network_ep) 63 | print("Start testing...") 64 | else: 65 | num_episode = self.training_episodes 66 | print("Start training...") 67 | 68 | for ep in range(num_episode): 69 | ep_return = self._interact(render, test, sleep=sleep) 70 | self.statistic_dict['episode_return'].append(ep_return) 71 | print("Episode %i" % ep, "return %0.1f" % ep_return) 72 | 73 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test): 74 | ep_test_return = [] 75 | for test_ep in range(self.testing_episodes): 76 | ep_test_return.append(self._interact(render, test=True)) 77 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes) 78 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes)) 79 | 80 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test): 81 | self._save_network(ep=ep) 82 | 83 | if not test: 84 | print("Finished training") 85 | print("Saving statistics...") 86 | self._plot_statistics(save_to_file=True) 87 | else: 88 | print("Finished testing") 89 | 90 | def _interact(self, render=False, test=False, sleep=0): 91 | done = False 92 | obs = self.env.reset() 93 | ep_return = 0 94 | # self.exploration_strategy.reset() 95 | # start a new episode 96 | while not done: 97 | if render: 98 | self.env.render() 99 | if self.env_step_count < self.warmup_step: 100 | action = self.env.action_space.sample() 101 | else: 102 | action = self._select_action(obs, test=test) 103 | new_obs, reward, done, info = self.env.step(action) 104 | time.sleep(sleep) 105 | ep_return += reward 106 | if not test: 107 | self._remember(obs, action, new_obs, reward, 1 - int(done)) 108 | if self.observation_normalization: 109 | self.normalizer.store_history(new_obs) 110 | self.normalizer.update_mean() 111 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step): 112 | self._learn() 113 | self.env_step_count += 1 114 | obs = new_obs 115 | return ep_return 116 | 117 | def _select_action(self, obs, test=False): 118 | obs = self.normalizer(obs) 119 | with T.no_grad(): 120 | inputs = T.as_tensor(obs, dtype=T.float, device=self.device) 121 | action = self.network_dict['actor_target'](inputs).cpu().detach().numpy() 122 | if test: 123 | # evaluate 124 | return np.clip(action, -self.action_max, self.action_max) 125 | else: 126 | # explore 127 | return self.exploration_strategy(action) 128 | 129 | def _learn(self, steps=None): 130 | if len(self.buffer) < self.batch_size: 131 | return 132 | if steps is None: 133 | steps = self.optimizer_steps 134 | 135 | for i in range(steps): 136 | if self.prioritised: 137 | batch, weights, inds = self.buffer.sample(self.batch_size) 138 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1) 139 | else: 140 | batch = self.buffer.sample(self.batch_size) 141 | weights = T.ones(size=(self.batch_size, 1), device=self.device) 142 | inds = None 143 | 144 | actor_inputs = np.array(self.normalizer(batch.state)) 145 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device) 146 | actions = T.as_tensor(np.array(batch.action), dtype=T.float32, device=self.device) 147 | critic_inputs = T.cat((actor_inputs, actions), dim=1) 148 | actor_inputs_ = np.array(self.normalizer(batch.next_state)) 149 | actor_inputs_ = T.as_tensor(np.array(actor_inputs_), dtype=T.float32, device=self.device) 150 | rewards = T.as_tensor(np.array(batch.reward), dtype=T.float32, device=self.device).unsqueeze(1) 151 | done = T.as_tensor(np.array(batch.done), dtype=T.float32, device=self.device).unsqueeze(1) 152 | 153 | if self.discard_time_limit: 154 | done = done * 0 + 1 155 | 156 | with T.no_grad(): 157 | actions_ = self.network_dict['actor_target'](actor_inputs_) 158 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1) 159 | value_ = self.network_dict['critic_target'](critic_inputs_) 160 | value_target = rewards + done * self.gamma * value_ 161 | 162 | self.critic_optimizer.zero_grad() 163 | value_estimate = self.network_dict['critic'](critic_inputs) 164 | critic_loss = F.mse_loss(value_estimate, value_target, reduction='none') 165 | (critic_loss * weights).mean().backward() 166 | self.critic_optimizer.step() 167 | 168 | if self.prioritised: 169 | assert inds is not None 170 | self.buffer.update_priority(inds, np.abs(critic_loss.cpu().detach().numpy())) 171 | 172 | self.actor_optimizer.zero_grad() 173 | new_actions = self.network_dict['actor'](actor_inputs) 174 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device) 175 | actor_loss = -self.network_dict['critic'](critic_eval_inputs).mean() 176 | actor_loss.backward() 177 | self.actor_optimizer.step() 178 | 179 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target']) 180 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target']) 181 | 182 | self.statistic_dict['critic_loss'].append(critic_loss.detach().mean()) 183 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean()) 184 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/ddpg_goal_conditioned.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import Actor, Critic 7 | from ..agent_base import Agent 8 | from ..utils.exploration_strategy import EGreedyGaussian 9 | 10 | 11 | class GoalConditionedDDPG(Agent): 12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 13 | # environment 14 | self.env = env 15 | self.env.seed(seed) 16 | obs = self.env.reset() 17 | algo_params.update({'state_dim': obs['observation'].shape[0], 18 | 'goal_dim': obs['desired_goal'].shape[0], 19 | 'action_dim': self.env.action_space.shape[0], 20 | 'action_max': self.env.action_space.high, 21 | 'action_scaling': self.env.action_space.high[0], 22 | 'init_input_means': None, 23 | 'init_input_vars': None 24 | }) 25 | self.curriculum = False 26 | if 'curriculum' in algo_params.keys(): 27 | self.curriculum = algo_params['curriculum'] 28 | # training args 29 | self.training_epochs = algo_params['training_epochs'] 30 | self.training_cycles = algo_params['training_cycles'] 31 | self.training_episodes = algo_params['training_episodes'] 32 | self.testing_gap = algo_params['testing_gap'] 33 | self.testing_episodes = algo_params['testing_episodes'] 34 | self.saving_gap = algo_params['saving_gap'] 35 | 36 | super(GoalConditionedDDPG, self).__init__(algo_params, 37 | transition_tuple=transition_tuple, 38 | goal_conditioned=True, 39 | path=path, 40 | seed=seed) 41 | # torch 42 | self.network_dict.update({ 43 | 'actor': Actor(self.state_dim + self.goal_dim, self.action_dim, action_scaling=self.action_scaling).to( 44 | self.device), 45 | 'actor_target': Actor(self.state_dim + self.goal_dim, self.action_dim, 46 | action_scaling=self.action_scaling).to(self.device), 47 | 'critic_1': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 48 | 'critic_1_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 49 | 'critic_2': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 50 | 'critic_2_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 51 | }) 52 | self.network_keys_to_save = ['actor_target', 'critic_1_target', 'critic_2_target'] 53 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 54 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1) 55 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate, 56 | weight_decay=algo_params['Q_weight_decay']) 57 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1) 58 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate, 59 | weight_decay=algo_params['Q_weight_decay']) 60 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1) 61 | # behavioural policy args (exploration) 62 | # different from the original DDPG paper, the HER paper uses another exploration strategy 63 | # paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html 64 | self.exploration_strategy = EGreedyGaussian(action_dim=self.action_dim, 65 | action_max=self.action_max, 66 | chance=algo_params['random_action_chance'], 67 | sigma=algo_params['noise_deviation'], rng=self.rng) 68 | self.noise_deviation = algo_params['noise_deviation'] 69 | # training args 70 | self.clip_value = algo_params['clip_value'] 71 | # statistic dict 72 | self.statistic_dict.update({ 73 | 'cycle_return': [], 74 | 'cycle_success_rate': [], 75 | 'epoch_test_return': [], 76 | 'epoch_test_success_rate': [] 77 | }) 78 | 79 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 80 | # training setup uses a hierarchy of Epoch, Cycle and Episode 81 | # following the HER paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html 82 | if test: 83 | if load_network_ep is not None: 84 | print("Loading network parameters...") 85 | self._load_network(ep=load_network_ep) 86 | print("Start testing...") 87 | else: 88 | print("Start training...") 89 | 90 | for epo in range(self.training_epochs): 91 | if self.curriculum: 92 | self.env.activate_curriculum_update() 93 | for cyc in range(self.training_cycles): 94 | cycle_return = 0 95 | cycle_success = 0 96 | for ep in range(self.training_episodes): 97 | ep_return = self._interact(render, test, sleep=sleep) 98 | cycle_return += ep_return 99 | if ep_return > -self.env._max_episode_steps: 100 | cycle_success += 1 101 | 102 | self.statistic_dict['cycle_return'].append(cycle_return / self.training_episodes) 103 | self.statistic_dict['cycle_success_rate'].append(cycle_success / self.training_episodes) 104 | print("Epoch %i" % epo, "Cycle %i" % cyc, 105 | "avg. return %0.1f" % (cycle_return / self.training_episodes), 106 | "success rate %0.1f" % (cycle_success / self.training_episodes)) 107 | 108 | if (epo % self.testing_gap == 0) and (epo != 0) and (not test): 109 | if self.curriculum: 110 | self.env.deactivate_curriculum_update() 111 | # testing during training 112 | test_return = 0 113 | test_success = 0 114 | for test_ep in range(self.testing_episodes): 115 | ep_test_return = self._interact(render, test=True) 116 | test_return += ep_test_return 117 | if ep_test_return > -self.env._max_episode_steps: 118 | test_success += 1 119 | self.statistic_dict['epoch_test_return'].append(test_return / self.testing_episodes) 120 | self.statistic_dict['epoch_test_success_rate'].append(test_success / self.testing_episodes) 121 | print("Epoch %i" % epo, "test avg. return %0.1f" % (test_return / self.testing_episodes)) 122 | 123 | if (epo % self.saving_gap == 0) and (epo != 0) and (not test): 124 | self._save_network(ep=epo) 125 | 126 | if not test: 127 | print("Finished training") 128 | print("Saving statistics...") 129 | self._plot_statistics( 130 | x_labels={ 131 | 'critic_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)', 132 | 'actor_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)' 133 | }, 134 | save_to_file=True) 135 | else: 136 | print("Finished testing") 137 | 138 | def _interact(self, render=False, test=False, sleep=0): 139 | done = False 140 | obs = self.env.reset() 141 | if self.curriculum: 142 | self.env._max_episode_steps = self.env.env.curriculum_goal_step 143 | ep_return = 0 144 | new_episode = True 145 | # start a new episode 146 | while not done: 147 | if render: 148 | self.env.render() 149 | action = self._select_action(obs, test=test) 150 | new_obs, reward, done, info = self.env.step(action) 151 | time.sleep(sleep) 152 | ep_return += reward 153 | if not test: 154 | self._remember(obs['observation'], obs['desired_goal'], action, 155 | new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done), 156 | new_episode=new_episode) 157 | if self.observation_normalization: 158 | self.normalizer.store_history(np.concatenate((new_obs['observation'], 159 | new_obs['achieved_goal']), axis=0)) 160 | obs = new_obs 161 | new_episode = False 162 | if not test: 163 | self.normalizer.update_mean() 164 | self._learn() 165 | return ep_return 166 | 167 | def _select_action(self, obs, test=False): 168 | inputs = np.concatenate((obs['observation'], obs['desired_goal']), axis=0) 169 | inputs = self.normalizer(inputs) 170 | with T.no_grad(): 171 | inputs = T.as_tensor(inputs, dtype=T.float, device=self.device) 172 | action = self.network_dict['actor_target'](inputs).cpu().detach().numpy() 173 | if test: 174 | # evaluate 175 | return np.clip(action, -self.action_max, self.action_max) 176 | else: 177 | # explore 178 | return self.exploration_strategy(action) 179 | 180 | def _learn(self, steps=None): 181 | if self.hindsight: 182 | self.buffer.modify_episodes() 183 | self.buffer.store_episodes() 184 | if len(self.buffer) < self.batch_size: 185 | return 186 | if steps is None: 187 | steps = self.optimizer_steps 188 | 189 | critic_losses = T.zeros(1, device=self.device) 190 | actor_losses = T.zeros(1, device=self.device) 191 | for i in range(steps): 192 | if self.prioritised: 193 | batch, weights, inds = self.buffer.sample(self.batch_size) 194 | weights = T.as_tensor(weights).view(self.batch_size, 1).to(self.device) 195 | else: 196 | batch = self.buffer.sample(self.batch_size) 197 | weights = T.ones(size=(self.batch_size, 1)).to(self.device) 198 | inds = None 199 | 200 | actor_inputs = np.concatenate((batch.state, batch.desired_goal), axis=1) 201 | actor_inputs = self.normalizer(actor_inputs) 202 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device) 203 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device) 204 | critic_inputs = T.cat((actor_inputs, actions), dim=1) 205 | actor_inputs_ = np.concatenate((batch.next_state, batch.desired_goal), axis=1) 206 | actor_inputs_ = self.normalizer(actor_inputs_) 207 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device) 208 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1) 209 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1) 210 | 211 | if self.discard_time_limit: 212 | done = done * 0 + 1 213 | 214 | with T.no_grad(): 215 | actions_ = self.network_dict['actor_target'](actor_inputs_) 216 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1) 217 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_) 218 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_) 219 | value_ = T.min(value_1_, value_2_) 220 | value_target = rewards + done * self.gamma * value_ 221 | value_target = T.clamp(value_target, -self.clip_value, 0.0) 222 | 223 | self.critic_1_optimizer.zero_grad() 224 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs) 225 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none') 226 | (critic_loss_1 * weights).mean().backward() 227 | self.critic_1_optimizer.step() 228 | 229 | if self.prioritised: 230 | assert inds is not None 231 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy())) 232 | 233 | self.critic_2_optimizer.zero_grad() 234 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs) 235 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none') 236 | (critic_loss_2 * weights).mean().backward() 237 | self.critic_2_optimizer.step() 238 | 239 | self.actor_optimizer.zero_grad() 240 | new_actions = self.network_dict['actor'](actor_inputs) 241 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device) 242 | new_values_1 = self.network_dict['critic_1'](critic_eval_inputs) 243 | new_values_2 = self.network_dict['critic_2'](critic_eval_inputs) 244 | actor_loss = -T.min(new_values_1, new_values_2).mean() 245 | actor_loss.backward() 246 | self.actor_optimizer.step() 247 | 248 | critic_losses += critic_loss_1.detach().mean() 249 | actor_losses += actor_loss.detach().mean() 250 | 251 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target']) 252 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target']) 253 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target']) 254 | 255 | self.statistic_dict['critic_loss'].append(critic_losses / steps) 256 | self.statistic_dict['actor_loss'].append(actor_losses / steps) 257 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/distributional_ddpg.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import Actor, Critic 7 | from ..agent_base import Agent 8 | from ..utils.exploration_strategy import OUNoise, GaussianNoise 9 | 10 | 11 | class DistributionalDDPG(Agent): 12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 13 | # environment 14 | self.env = env 15 | self.env.seed(seed) 16 | obs = self.env.reset() 17 | algo_params.update({'state_dim': obs.shape[0], 18 | 'action_dim': self.env.action_space.shape[0], 19 | 'action_max': self.env.action_space.high, 20 | 'action_scaling': self.env.action_space.high[0], 21 | 'init_input_means': None, 22 | 'init_input_vars': None 23 | }) 24 | # training args 25 | self.training_episodes = algo_params['training_episodes'] 26 | self.testing_gap = algo_params['testing_gap'] 27 | self.testing_episodes = algo_params['testing_episodes'] 28 | self.saving_gap = algo_params['saving_gap'] 29 | 30 | super(DistributionalDDPG, self).__init__(algo_params, 31 | transition_tuple=transition_tuple, 32 | goal_conditioned=False, 33 | path=path, 34 | seed=seed) 35 | # torch 36 | # categorical distribution atoms 37 | self.num_atoms = algo_params['num_atoms'] 38 | self.value_max = algo_params['value_max'] 39 | self.value_min = algo_params['value_min'] 40 | self.delta_z = (self.value_max - self.value_min) / (self.num_atoms - 1) 41 | self.support = T.linspace(self.value_min, self.value_max, steps=self.num_atoms).to(self.device) 42 | # network 43 | self.network_dict.update({ 44 | 'actor': Actor(self.state_dim, self.action_dim).to(self.device), 45 | 'actor_target': Actor(self.state_dim, self.action_dim).to(self.device), 46 | 'critic': Critic(self.state_dim + self.action_dim, self.num_atoms, softmax=True).to(self.device), 47 | 'critic_target': Critic(self.state_dim + self.action_dim, self.num_atoms, softmax=True).to(self.device) 48 | }) 49 | self.network_keys_to_save = ['actor_target', 'critic_target'] 50 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 51 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1) 52 | self.critic_optimizer = Adam(self.network_dict['critic'].parameters(), lr=self.critic_learning_rate, 53 | weight_decay=algo_params['Q_weight_decay']) 54 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'], tau=1) 55 | # behavioural policy args (exploration) 56 | self.exploration_strategy = GaussianNoise(self.action_dim, scale=0.3, sigma=1.0) 57 | # training args 58 | self.warmup_step = algo_params['warmup_step'] 59 | # statistic dict 60 | self.statistic_dict.update({ 61 | 'episode_return': [], 62 | 'episode_test_return': [] 63 | }) 64 | 65 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 66 | if test: 67 | num_episode = self.testing_episodes 68 | if load_network_ep is not None: 69 | print("Loading network parameters...") 70 | self._load_network(ep=load_network_ep) 71 | print("Start testing...") 72 | else: 73 | num_episode = self.training_episodes 74 | print("Start training...") 75 | 76 | for ep in range(num_episode): 77 | ep_return = self._interact(render, test, sleep=sleep) 78 | self.statistic_dict['episode_return'].append(ep_return) 79 | print("Episode %i" % ep, "return %0.1f" % ep_return) 80 | 81 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test): 82 | ep_test_return = [] 83 | for test_ep in range(self.testing_episodes): 84 | ep_test_return.append(self._interact(render, test=True)) 85 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return) / self.testing_episodes) 86 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return) / self.testing_episodes)) 87 | 88 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test): 89 | self._save_network(ep=ep) 90 | 91 | if not test: 92 | print("Finished training") 93 | print("Saving statistics...") 94 | self._plot_statistics(save_to_file=True) 95 | else: 96 | print("Finished testing") 97 | 98 | def _interact(self, render=False, test=False, sleep=0): 99 | done = False 100 | obs = self.env.reset() 101 | ep_return = 0 102 | while not done: 103 | if render: 104 | self.env.render() 105 | if self.env_step_count < self.warmup_step: 106 | action = self.env.action_space.sample() 107 | else: 108 | action = self._select_action(obs, test=test) 109 | new_obs, reward, done, info = self.env.step(action) 110 | time.sleep(sleep) 111 | ep_return += reward 112 | if not test: 113 | self._remember(obs, action, new_obs, reward, 1 - int(done)) 114 | if self.observation_normalization: 115 | self.normalizer.store_history(new_obs) 116 | self.normalizer.update_mean() 117 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step): 118 | self._learn() 119 | self.env_step_count += 1 120 | obs = new_obs 121 | return ep_return 122 | 123 | def _select_action(self, obs, test=False): 124 | obs = self.normalizer(obs) 125 | with T.no_grad(): 126 | inputs = T.as_tensor(obs, dtype=T.float, device=self.device) 127 | action = self.network_dict['actor_target'](inputs).cpu().detach().numpy() 128 | if test: 129 | # evaluate 130 | return np.clip(action, -self.action_max, self.action_max) 131 | else: 132 | # explore 133 | return self.exploration_strategy(action) 134 | 135 | def _learn(self, steps=None): 136 | if len(self.buffer) < self.batch_size: 137 | return 138 | if steps is None: 139 | steps = self.optimizer_steps 140 | 141 | for i in range(steps): 142 | if self.prioritised: 143 | batch, weights, inds = self.buffer.sample(self.batch_size) 144 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1) 145 | else: 146 | batch = self.buffer.sample(self.batch_size) 147 | weights = T.ones(size=(self.batch_size, 1), device=self.device) 148 | inds = None 149 | 150 | actor_inputs = self.normalizer(batch.state) 151 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device) 152 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device) 153 | critic_inputs = T.cat((actor_inputs, actions), dim=1) 154 | actor_inputs_ = self.normalizer(batch.next_state) 155 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device) 156 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device) 157 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device) 158 | 159 | if self.discard_time_limit: 160 | done = done * 0 + 1 161 | 162 | with T.no_grad(): 163 | actions_ = self.network_dict['actor_target'](actor_inputs_) 164 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1) 165 | value_dist_ = self.network_dict['critic_target'](critic_inputs_) 166 | value_dist_target = self.project_value_distribution(value_dist_, rewards, done) 167 | value_dist_target = T.as_tensor(value_dist_target, device=self.device) 168 | 169 | self.critic_optimizer.zero_grad() 170 | value_dist_estimate = self.network_dict['critic'](critic_inputs) 171 | critic_loss = F.binary_cross_entropy(value_dist_estimate, value_dist_target, reduction='none').sum(dim=1) 172 | (critic_loss * weights).mean().backward() 173 | self.critic_optimizer.step() 174 | 175 | if self.prioritised: 176 | assert inds is not None 177 | self.buffer.update_priority(inds, np.abs(critic_loss.cpu().detach().numpy())) 178 | 179 | self.actor_optimizer.zero_grad() 180 | new_actions = self.network_dict['actor'](actor_inputs) 181 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1) 182 | # take the expectation of the value distribution as the policy loss 183 | actor_loss = -(self.network_dict['critic'](critic_eval_inputs) * self.support) 184 | actor_loss = actor_loss.sum(dim=1) 185 | actor_loss.mean().backward() 186 | self.actor_optimizer.step() 187 | 188 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target']) 189 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target']) 190 | 191 | self.statistic_dict['critic_loss'].append(critic_loss.detach().mean()) 192 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean()) 193 | 194 | def project_value_distribution(self, value_dist, rewards, done): 195 | # refer to https://github.com/schatty/d4pg-pytorch/blob/7dc23096a45bc4036fbb02493e0b052d57cfe4c6/models/d4pg/l2_projection.py#L7 196 | # comments added 197 | copy_value_dist = value_dist.data.cpu().numpy() 198 | copy_rewards = rewards.data.cpu().numpy() 199 | copy_done = (1-done).data.cpu().numpy().astype(np.bool) 200 | batch_size = self.batch_size 201 | n_atoms = self.num_atoms 202 | projected_dist = np.zeros((batch_size, n_atoms), dtype=np.float32) 203 | 204 | # calculate the next state value for each atom in the support set 205 | for atom in range(n_atoms): 206 | atom_ = copy_rewards + (self.value_min + atom * self.delta_z) * self.gamma 207 | tz_j = np.clip(atom_, a_max=self.value_max, a_min=self.value_min) 208 | # compute where the next value is on the indexes axis of the support set 209 | b_j = (tz_j - self.value_min) / self.delta_z 210 | # compute floor and ceiling indexes of the next value on the support set 211 | l = np.floor(b_j).astype(np.int64) 212 | u = np.ceil(b_j).astype(np.int64) 213 | # since l and u are floor and ceiling indexes of the next value on the support set 214 | # their difference is always 0 at the boundary and 1 otherwise 215 | # thus, the predicted probability of the next value is distributed proportional to 216 | # the difference between the projected value index (b_j) and its floor or ceiling 217 | # boundary case, floor == ceiling 218 | eq_mask = (u == l) # this line gives an array of boolean masks 219 | projected_dist[eq_mask, l[eq_mask]] += copy_value_dist[eq_mask, atom] 220 | # otherwise, ceiling - floor == 1, i.e., (u - b_j) + (b_j - l) == 1 221 | ne_mask = (u != l) 222 | projected_dist[ne_mask, l[ne_mask]] += copy_value_dist[ne_mask, atom] * (u - b_j)[ne_mask] 223 | projected_dist[ne_mask, u[ne_mask]] += copy_value_dist[ne_mask, atom] * (b_j - l)[ne_mask] 224 | 225 | # check if a terminal state exists 226 | if copy_done.any(): 227 | projected_dist[copy_done] = 0.0 228 | # value at a terminal state should equal to the immediate reward only 229 | tz_j = np.clip(copy_rewards[copy_done], a_max=self.value_max, a_min=self.value_min) 230 | b_j = (tz_j - self.value_min) / self.delta_z 231 | l = np.floor(b_j).astype(np.int64) 232 | u = np.ceil(b_j).astype(np.int64) 233 | eq_mask = (u == l) 234 | eq_dones = copy_done.copy() 235 | eq_dones[copy_done] = eq_mask 236 | # the value probability is only set to 1.0 237 | # when it is a terminal state and its floor and ceiling indexes are the same 238 | if eq_dones.any(): 239 | projected_dist[eq_dones, l[eq_mask]] = 1.0 240 | ne_mask = (u != l) 241 | ne_dones = copy_done.copy() 242 | ne_dones[copy_done] = ne_mask 243 | # the value probability is only distributed while summed to 1.0 244 | # when it is a terminal state and its floor and ceiling indexes differ by 1 index 245 | if ne_dones.any(): 246 | projected_dist[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask] 247 | projected_dist[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask] 248 | 249 | return projected_dist 250 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/sac.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import StochasticActor, Critic 7 | from ..agent_base import Agent 8 | 9 | 10 | class SAC(Agent): 11 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 12 | # environment 13 | self.env = env 14 | self.env.seed(seed) 15 | obs = self.env.reset() 16 | algo_params.update({'state_dim': obs.shape[0], 17 | 'action_dim': self.env.action_space.shape[0], 18 | 'action_max': self.env.action_space.high, 19 | 'action_scaling': self.env.action_space.high[0], 20 | 'init_input_means': None, 21 | 'init_input_vars': None 22 | }) 23 | # training args 24 | self.training_episodes = algo_params['training_episodes'] 25 | self.testing_gap = algo_params['testing_gap'] 26 | self.testing_episodes = algo_params['testing_episodes'] 27 | self.saving_gap = algo_params['saving_gap'] 28 | 29 | super(SAC, self).__init__(algo_params, 30 | transition_tuple=transition_tuple, 31 | goal_conditioned=False, 32 | path=path, 33 | seed=seed) 34 | # torch 35 | self.network_dict.update({ 36 | 'actor': StochasticActor(self.state_dim, self.action_dim, log_std_min=-6, log_std_max=1, action_scaling=self.action_scaling).to(self.device), 37 | 'critic_1': Critic(self.state_dim + self.action_dim, 1).to(self.device), 38 | 'critic_1_target': Critic(self.state_dim + self.action_dim, 1).to(self.device), 39 | 'critic_2': Critic(self.state_dim + self.action_dim, 1).to(self.device), 40 | 'critic_2_target': Critic(self.state_dim + self.action_dim, 1).to(self.device), 41 | 'alpha': algo_params['alpha'], 42 | 'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device), 43 | }) 44 | self.network_keys_to_save = ['actor', 'critic_1_target'] 45 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 46 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate) 47 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate) 48 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1) 49 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1) 50 | self.target_entropy = -self.action_dim 51 | self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate) 52 | # training args 53 | self.warmup_step = algo_params['warmup_step'] 54 | self.actor_update_interval = algo_params['actor_update_interval'] 55 | self.critic_target_update_interval = algo_params['critic_target_update_interval'] 56 | # statistic dict 57 | self.statistic_dict.update({ 58 | 'episode_return': [], 59 | 'episode_test_return': [], 60 | 'alpha': [], 61 | 'policy_entropy': [], 62 | }) 63 | 64 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 65 | if test: 66 | num_episode = self.testing_episodes 67 | if load_network_ep is not None: 68 | print("Loading network parameters...") 69 | self._load_network(ep=load_network_ep) 70 | print("Start testing...") 71 | else: 72 | num_episode = self.training_episodes 73 | print("Start training...") 74 | 75 | for ep in range(num_episode): 76 | ep_return = self._interact(render, test, sleep=sleep) 77 | self.statistic_dict['episode_return'].append(ep_return) 78 | print("Episode %i" % ep, "return %0.1f" % ep_return) 79 | 80 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test): 81 | ep_test_return = [] 82 | for test_ep in range(self.testing_episodes): 83 | ep_test_return.append(self._interact(render, test=True)) 84 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes) 85 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes)) 86 | 87 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test): 88 | self._save_network(ep=ep) 89 | 90 | if not test: 91 | print("Finished training") 92 | print("Saving statistics...") 93 | self._plot_statistics(save_to_file=True) 94 | else: 95 | print("Finished testing") 96 | 97 | def _interact(self, render=False, test=False, sleep=0): 98 | done = False 99 | obs = self.env.reset() 100 | ep_return = 0 101 | # start a new episode 102 | while not done: 103 | if render: 104 | self.env.render() 105 | if self.env_step_count < self.warmup_step: 106 | action = self.env.action_space.sample() 107 | else: 108 | action = self._select_action(obs, test=test) 109 | new_obs, reward, done, info = self.env.step(action) 110 | time.sleep(sleep) 111 | ep_return += reward 112 | if not test: 113 | self._remember(obs, action, new_obs, reward, 1 - int(done)) 114 | if self.observation_normalization: 115 | self.normalizer.store_history(new_obs) 116 | self.normalizer.update_mean() 117 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step): 118 | self._learn() 119 | self.env_step_count += 1 120 | obs = new_obs 121 | return ep_return 122 | 123 | def _select_action(self, obs, test=False): 124 | inputs = self.normalizer(obs) 125 | inputs = T.as_tensor(inputs, dtype=T.float, device=self.device) 126 | return self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy() 127 | 128 | def _learn(self, steps=None): 129 | if len(self.buffer) < self.batch_size: 130 | return 131 | if steps is None: 132 | steps = self.optimizer_steps 133 | 134 | for i in range(steps): 135 | if self.prioritised: 136 | batch, weights, inds = self.buffer.sample(self.batch_size) 137 | weights = T.tensor(weights).view(self.batch_size, 1).to(self.device) 138 | else: 139 | batch = self.buffer.sample(self.batch_size) 140 | weights = T.ones(size=(self.batch_size, 1)).to(self.device) 141 | inds = None 142 | 143 | actor_inputs = self.normalizer(batch.state) 144 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device) 145 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device) 146 | critic_inputs = T.cat((actor_inputs, actions), dim=1) 147 | actor_inputs_ = self.normalizer(batch.next_state) 148 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device) 149 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1) 150 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1) 151 | 152 | if self.discard_time_limit: 153 | done = done * 0 + 1 154 | 155 | with T.no_grad(): 156 | actions_, log_probs_ = self.network_dict['actor'].get_action(actor_inputs_, probs=True) 157 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1) 158 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_) 159 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_) 160 | value_ = T.min(value_1_, value_2_) - (self.network_dict['alpha'] * log_probs_) 161 | value_target = rewards + done * self.gamma * value_ 162 | 163 | self.critic_1_optimizer.zero_grad() 164 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs) 165 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none') 166 | (critic_loss_1 * weights).mean().backward() 167 | self.critic_1_optimizer.step() 168 | 169 | if self.prioritised: 170 | assert inds is not None 171 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy())) 172 | 173 | self.critic_2_optimizer.zero_grad() 174 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs) 175 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none') 176 | (critic_loss_2 * weights).mean().backward() 177 | self.critic_2_optimizer.step() 178 | 179 | self.statistic_dict['critic_loss'].append(critic_loss_1.detach().mean()) 180 | 181 | if self.optim_step_count % self.critic_target_update_interval == 0: 182 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target']) 183 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target']) 184 | 185 | if self.optim_step_count % self.actor_update_interval == 0: 186 | self.actor_optimizer.zero_grad() 187 | new_actions, new_log_probs = self.network_dict['actor'].get_action(actor_inputs, probs=True) 188 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1) 189 | new_values = T.min(self.network_dict['critic_1'](critic_eval_inputs), 190 | self.network_dict['critic_2'](critic_eval_inputs)) 191 | actor_loss = (self.network_dict['alpha']*new_log_probs - new_values).mean() 192 | actor_loss.backward() 193 | self.actor_optimizer.step() 194 | 195 | self.alpha_optimizer.zero_grad() 196 | alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean() 197 | alpha_loss.backward() 198 | self.alpha_optimizer.step() 199 | self.network_dict['alpha'] = self.network_dict['log_alpha'].exp() 200 | 201 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean()) 202 | self.statistic_dict['alpha'].append(self.network_dict['alpha'].detach()) 203 | self.statistic_dict['policy_entropy'].append(-new_log_probs.detach().mean()) 204 | 205 | self.optim_step_count += 1 206 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/sac_goal_conditioned.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import StochasticActor, Critic 7 | from ..agent_base import Agent 8 | 9 | 10 | class GoalConditionedSAC(Agent): 11 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 12 | # environment 13 | self.env = env 14 | self.env.seed(seed) 15 | obs = self.env.reset() 16 | algo_params.update({'state_dim': obs['observation'].shape[0], 17 | 'goal_dim': obs['desired_goal'].shape[0], 18 | 'action_dim': self.env.action_space.shape[0], 19 | 'action_max': self.env.action_space.high, 20 | 'action_scaling': self.env.action_space.high[0], 21 | 'init_input_means': None, 22 | 'init_input_vars': None 23 | }) 24 | # training args 25 | self.training_epochs = algo_params['training_epochs'] 26 | self.training_cycles = algo_params['training_cycles'] 27 | self.training_episodes = algo_params['training_episodes'] 28 | self.testing_gap = algo_params['testing_gap'] 29 | self.testing_episodes = algo_params['testing_episodes'] 30 | self.saving_gap = algo_params['saving_gap'] 31 | 32 | super(GoalConditionedSAC, self).__init__(algo_params, 33 | transition_tuple=transition_tuple, 34 | goal_conditioned=True, 35 | path=path, 36 | seed=seed) 37 | # torch 38 | self.network_dict.update({ 39 | 'actor': StochasticActor(self.state_dim + self.goal_dim, self.action_dim, log_std_min=-6, log_std_max=1, 40 | action_scaling=self.action_scaling).to(self.device), 41 | 'critic_1': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 42 | 'critic_1_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 43 | 'critic_2': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 44 | 'critic_2_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device), 45 | 'alpha': algo_params['alpha'], 46 | 'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device), 47 | }) 48 | self.network_keys_to_save = ['actor', 'critic_1_target'] 49 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 50 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate) 51 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate) 52 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1) 53 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1) 54 | self.target_entropy = -self.action_dim 55 | self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate) 56 | # training args 57 | self.clip_value = algo_params['clip_value'] 58 | self.actor_update_interval = algo_params['actor_update_interval'] 59 | self.critic_target_update_interval = algo_params['critic_target_update_interval'] 60 | # statistic dict 61 | self.statistic_dict.update({ 62 | 'cycle_return': [], 63 | 'cycle_success_rate': [], 64 | 'epoch_test_return': [], 65 | 'epoch_test_success_rate': [], 66 | 'alpha': [], 67 | 'policy_entropy': [], 68 | }) 69 | 70 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 71 | # training setup uses a hierarchy of Epoch, Cycle and Episode 72 | # following the HER paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html 73 | if test: 74 | if load_network_ep is not None: 75 | print("Loading network parameters...") 76 | self._load_network(ep=load_network_ep) 77 | print("Start testing...") 78 | else: 79 | print("Start training...") 80 | 81 | for epo in range(self.training_epochs): 82 | for cyc in range(self.training_cycles): 83 | cycle_return = 0 84 | cycle_success = 0 85 | for ep in range(self.training_episodes): 86 | ep_return = self._interact(render, test, sleep=sleep) 87 | cycle_return += ep_return 88 | if ep_return > -50: 89 | cycle_success += 1 90 | 91 | self.statistic_dict['cycle_return'].append(cycle_return / self.training_episodes) 92 | self.statistic_dict['cycle_success_rate'].append(cycle_success / self.training_episodes) 93 | print("Epoch %i" % epo, "Cycle %i" % cyc, 94 | "avg. return %0.1f" % (cycle_return / self.training_episodes), 95 | "success rate %0.1f" % (cycle_success / self.training_episodes)) 96 | 97 | if (epo % self.testing_gap == 0) and (epo != 0) and (not test): 98 | test_return = 0 99 | test_success = 0 100 | for test_ep in range(self.testing_episodes): 101 | ep_test_return = self._interact(render, test=True) 102 | test_return += ep_test_return 103 | if ep_test_return > -50: 104 | test_success += 1 105 | self.statistic_dict['epoch_test_return'].append(test_return / self.testing_episodes) 106 | self.statistic_dict['epoch_test_success_rate'].append(test_success / self.testing_episodes) 107 | print("Epoch %i" % epo, "test avg. return %0.1f" % (test_return / self.testing_episodes)) 108 | 109 | if (epo % self.saving_gap == 0) and (epo != 0) and (not test): 110 | self._save_network(ep=epo) 111 | 112 | if not test: 113 | print("Finished training") 114 | print("Saving statistics...") 115 | self._plot_statistics( 116 | x_labels={ 117 | 'critic_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)', 118 | 'actor_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)', 119 | 'alpha': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)', 120 | 'policy_entropy': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)' 121 | }, 122 | save_to_file=True) 123 | else: 124 | print("Finished testing") 125 | 126 | def _interact(self, render=False, test=False, sleep=0): 127 | done = False 128 | obs = self.env.reset() 129 | ep_return = 0 130 | new_episode = True 131 | # start a new episode 132 | while not done: 133 | if render: 134 | self.env.render() 135 | action = self._select_action(obs, test=test) 136 | new_obs, reward, done, info = self.env.step(action) 137 | time.sleep(sleep) 138 | ep_return += reward 139 | if not test: 140 | self._remember(obs['observation'], obs['desired_goal'], action, 141 | new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done), 142 | new_episode=new_episode) 143 | if self.observation_normalization: 144 | self.normalizer.store_history(np.concatenate((new_obs['observation'], 145 | new_obs['achieved_goal']), axis=0)) 146 | obs = new_obs 147 | new_episode = False 148 | 149 | if not test: 150 | self.normalizer.update_mean() 151 | self._learn() 152 | return ep_return 153 | 154 | def _select_action(self, obs, test=False): 155 | inputs = np.concatenate((obs['observation'], obs['desired_goal']), axis=0) 156 | inputs = self.normalizer(inputs) 157 | inputs = T.as_tensor(inputs, dtype=T.float).to(self.device) 158 | return self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy() 159 | 160 | def _learn(self, steps=None): 161 | if self.hindsight: 162 | self.buffer.modify_episodes() 163 | self.buffer.store_episodes() 164 | if len(self.buffer) < self.batch_size: 165 | return 166 | if steps is None: 167 | steps = self.optimizer_steps 168 | 169 | critic_losses = T.zeros(1, device=self.device) 170 | actor_losses = T.zeros(1, device=self.device) 171 | alphas = T.zeros(1, device=self.device) 172 | policy_entropies = T.zeros(1, device=self.device) 173 | for i in range(steps): 174 | if self.prioritised: 175 | batch, weights, inds = self.buffer.sample(self.batch_size) 176 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1) 177 | else: 178 | batch = self.buffer.sample(self.batch_size) 179 | weights = T.ones(size=(self.batch_size, 1), device=self.device) 180 | inds = None 181 | 182 | actor_inputs = np.concatenate((batch.state, batch.desired_goal), axis=1) 183 | actor_inputs = self.normalizer(actor_inputs) 184 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device) 185 | actions_np = np.array(batch.action) 186 | actions = T.as_tensor(actions_np, dtype=T.float32, device=self.device) 187 | critic_inputs = T.cat((actor_inputs, actions), dim=1) 188 | actor_inputs_ = np.concatenate((batch.next_state, batch.desired_goal), axis=1) 189 | actor_inputs_ = self.normalizer(actor_inputs_) 190 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device) 191 | rewards_np = np.array(batch.reward) 192 | rewards = T.as_tensor(rewards_np, dtype=T.float32, device=self.device).unsqueeze(1) 193 | done_np = np.array(batch.done) 194 | done = T.as_tensor(done_np, dtype=T.float32, device=self.device).unsqueeze(1) 195 | 196 | if self.discard_time_limit: 197 | done = done * 0 + 1 198 | 199 | with T.no_grad(): 200 | actions_, log_probs_ = self.network_dict['actor'].get_action(actor_inputs_, probs=True) 201 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1) 202 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_) 203 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_) 204 | value_ = T.min(value_1_, value_2_) - (self.network_dict['alpha'] * log_probs_) 205 | value_target = rewards + done * self.gamma * value_ 206 | value_target = T.clamp(value_target, -self.clip_value, 0.0) 207 | 208 | self.critic_1_optimizer.zero_grad() 209 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs) 210 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none') 211 | (critic_loss_1 * weights).mean().backward() 212 | self.critic_1_optimizer.step() 213 | 214 | if self.prioritised: 215 | assert inds is not None 216 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy())) 217 | 218 | self.critic_2_optimizer.zero_grad() 219 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs) 220 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none') 221 | (critic_loss_2 * weights).mean().backward() 222 | self.critic_2_optimizer.step() 223 | 224 | critic_losses += critic_loss_1.detach().mean() 225 | 226 | if self.optim_step_count % self.critic_target_update_interval == 0: 227 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target']) 228 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target']) 229 | 230 | if self.optim_step_count % self.actor_update_interval == 0: 231 | self.actor_optimizer.zero_grad() 232 | new_actions, new_log_probs, entropy = self.network_dict['actor'].get_action(actor_inputs, probs=True, 233 | entropy=True) 234 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device) 235 | new_values = T.min(self.network_dict['critic_1'](critic_eval_inputs), 236 | self.network_dict['critic_2'](critic_eval_inputs)) 237 | actor_loss = (self.network_dict['alpha'] * new_log_probs - new_values).mean() 238 | actor_loss.backward() 239 | self.actor_optimizer.step() 240 | 241 | self.alpha_optimizer.zero_grad() 242 | alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean() 243 | alpha_loss.backward() 244 | self.alpha_optimizer.step() 245 | self.network_dict['alpha'] = self.network_dict['log_alpha'].exp() 246 | 247 | actor_losses += actor_loss.detach() 248 | alphas += self.network_dict['alpha'].detach() 249 | policy_entropies += entropy.detach().mean() 250 | 251 | self.optim_step_count += 1 252 | 253 | self.statistic_dict['critic_loss'].append(critic_losses / steps) 254 | self.statistic_dict['actor_loss'].append(actor_losses / steps) 255 | self.statistic_dict['alpha'].append(alphas / steps) 256 | self.statistic_dict['policy_entropy'].append(policy_entropies / steps) 257 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/sac_parameterised_action_goal_conditioned.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import StochasticActor 7 | from ..utils.networks_pointnet import CriticPointNet, CriticPointNet2 8 | from ..agent_base import Agent 9 | from collections import namedtuple 10 | 11 | 12 | class GPASAC(Agent): 13 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 14 | # environment 15 | self.env = env 16 | self.env.seed(seed) 17 | obs = self.env.reset() 18 | algo_params.update({'state_shape': obs['observation'].shape, 19 | 'goal_shape': obs['desired_goal'].shape, 20 | 'discrete_action_dim': self.env.discrete_action_space.n, 21 | 'continuous_action_dim': self.env.continuous_action_space.shape[0], 22 | 'continuous_action_max': self.env.continuous_action_space.high, 23 | 'continuous_action_scaling': self.env.continuous_action_space.high[0], 24 | }) 25 | # training args 26 | self.cur_ep = 0 27 | self.warmup_step = algo_params['warmup_step'] 28 | self.training_episodes = algo_params['training_episodes'] 29 | self.testing_gap = algo_params['testing_gap'] 30 | self.testing_episodes = algo_params['testing_episodes'] 31 | self.saving_gap = algo_params['saving_gap'] 32 | 33 | self.use_demonstrations = algo_params['use_demonstrations'] 34 | self.demonstrate_percentage = algo_params['demonstrate_percentage'] 35 | assert 0 < self.demonstrate_percentage < 1, "Demonstrate percentage should be between 0 and 1" 36 | self.n_demonstrate_episodes = int(self.demonstrate_percentage * self.training_episodes) 37 | self.planned_skills = algo_params['planned_skills'] 38 | assert not (self.use_demonstrations and self.planned_skills), "Cannot demonstrate and planned skills at the same time" 39 | self.skill_plan = algo_params['skill_plan'] 40 | self.use_planned_skills = False 41 | 42 | if transition_tuple is None: 43 | transition_tuple = namedtuple("transition", 44 | ('state', 'desired_goal', 'action', 45 | 'next_state', 'achieved_goal', 'reward', 'done', 'next_skill')) 46 | super(GPASAC, self).__init__(algo_params, 47 | non_flat_obs=True, 48 | action_type='hybrid', 49 | transition_tuple=transition_tuple, 50 | goal_conditioned=True, 51 | path=path, 52 | seed=seed, 53 | create_logger=True) 54 | # torch 55 | self.network_dict.update({ 56 | 'discrete_actor': StochasticActor(2048, self.discrete_action_dim, continuous=False, 57 | fc1_size=1024, 58 | log_std_min=-6, log_std_max=1).to(self.device), 59 | 'continuous_actor': StochasticActor(2048 + self.discrete_action_dim, self.continuous_action_dim, 60 | fc1_size=1024, 61 | log_std_min=-6, log_std_max=1, 62 | action_scaling=self.continuous_action_scaling).to(self.device), 63 | 'critic_1': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device), 64 | 'critic_1_target': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device), 65 | 'critic_2': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device), 66 | 'critic_2_target': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device), 67 | 'alpha_discrete': algo_params['alpha'], 68 | 'log_alpha_discrete': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device), 69 | 'alpha_continuous': algo_params['alpha'], 70 | 'log_alpha_continuous': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device), 71 | }) 72 | self.network_dict['critic_1_target'].eval() 73 | self.network_dict['critic_2_target'].eval() 74 | self.network_keys_to_save = ['discrete_actor', 'continuous_actor', 'critic_1', 'critic_1_target'] 75 | self.discrete_actor_optimizer = Adam(self.network_dict['discrete_actor'].parameters(), 76 | lr=self.actor_learning_rate) 77 | self.continuous_actor_optimizer = Adam(self.network_dict['continuous_actor'].parameters(), 78 | lr=self.actor_learning_rate) 79 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate) 80 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate) 81 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1) 82 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1) 83 | self.target_discrete_entropy = -self.discrete_action_dim 84 | self.target_continuous_entropy = -self.continuous_action_dim 85 | self.alpha_discrete_optimizer = Adam([self.network_dict['log_alpha_discrete']], lr=self.actor_learning_rate) 86 | self.alpha_continuous_optimizer = Adam([self.network_dict['log_alpha_continuous']], lr=self.actor_learning_rate) 87 | # training args 88 | # self.clip_value = algo_params['clip_value'] 89 | self.actor_update_interval = algo_params['actor_update_interval'] 90 | self.critic_target_update_interval = algo_params['critic_target_update_interval'] 91 | 92 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 93 | if test: 94 | if load_network_ep is not None: 95 | print("Loading network parameters...") 96 | self._load_network(ep=load_network_ep) 97 | print("Start testing...") 98 | else: 99 | print("Start training...") 100 | 101 | for ep in range(self.training_episodes): 102 | if self.use_demonstrations and (ep < self.n_demonstrate_episodes): 103 | self.use_planned_skills = True 104 | elif self.planned_skills: 105 | self.use_planned_skills = True 106 | else: 107 | self.use_planned_skills = False 108 | self.cur_ep = ep 109 | loss_info = self._interact(render, test, sleep=sleep) 110 | self.logger.add_scalar(tag='Task/return', scalar_value=loss_info['emd_loss'], global_step=self.cur_ep) 111 | self.logger.add_scalar(tag='Task/heightmap_loss', scalar_value=loss_info['height_map_loss'], global_step=ep) 112 | print("Episode %i" % ep, "return %0.1f" % loss_info['emd_loss']) 113 | if not test and self.hindsight: 114 | self.buffer.hindsight() 115 | 116 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test): 117 | if self.planned_skills: 118 | self.use_planned_skills = True 119 | else: 120 | self.use_planned_skills = False 121 | test_return = 0 122 | test_heightmap_loss = 0 123 | for test_ep in range(self.testing_episodes): 124 | loss_info = self._interact(render, test=True) 125 | test_return += loss_info['emd_loss'] 126 | test_heightmap_loss += loss_info['height_map_loss'] 127 | self.logger.add_scalar(tag='Task/test_return', 128 | scalar_value=(test_return / self.testing_episodes), global_step=self.cur_ep) 129 | self.logger.add_scalar(tag='Task/test_heightmap_loss', 130 | scalar_value=(test_heightmap_loss / self.testing_episodes), global_step=self.cur_ep) 131 | 132 | print("Episode %i" % ep, "test avg. return %0.1f" % (test_return / self.testing_episodes)) 133 | 134 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test): 135 | self._save_network(ep=ep) 136 | 137 | if not test: 138 | print("Finished training") 139 | print("Saving statistics...") 140 | else: 141 | print("Finished testing") 142 | 143 | def _interact(self, render=False, test=False, sleep=0): 144 | done = False 145 | obs = self.env.reset() 146 | ep_return = 0 147 | new_episode = True 148 | # start a new episode 149 | while not done: 150 | if render: 151 | self.env.render() 152 | if self.total_env_step_count < self.warmup_step: 153 | if self.use_planned_skills: 154 | discrete_action = self.skill_plan[self.env.step_count] 155 | else: 156 | discrete_action = self.env.discrete_action_space.sample() 157 | continuous_action = self.env.continuous_action_space.sample() 158 | action = np.concatenate([[discrete_action], continuous_action], axis=0) 159 | else: 160 | action = self._select_action(obs, test=test) 161 | new_obs, reward, done, info = self.env.step(action) 162 | time.sleep(sleep) 163 | ep_return += reward 164 | 165 | next_skill = 0 166 | if self.planned_skills: 167 | try: 168 | next_skill = self.skill_plan[self.env.step_count] 169 | except: 170 | pass 171 | 172 | if not test: 173 | self._remember(obs['observation'], obs['desired_goal'], action, 174 | new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done), next_skill, 175 | new_episode=new_episode) 176 | self.total_env_step_count += 1 177 | self._learn(steps=1) 178 | 179 | obs = new_obs 180 | new_episode = False 181 | 182 | return info 183 | 184 | def _select_action(self, obs, test=False): 185 | obs_points = T.as_tensor([obs['observation']], dtype=T.float).to(self.device) 186 | goal_points = T.as_tensor([obs['desired_goal']], dtype=T.float).to(self.device) 187 | obs_point_features = self.network_dict['critic_1_target'].get_features(obs_points.transpose(2, 1)) 188 | goal_point_features = self.network_dict['critic_1_target'].get_features(goal_points.transpose(2, 1)) 189 | inputs = T.cat((obs_point_features, goal_point_features), dim=1) 190 | if self.use_planned_skills: 191 | discrete_action = T.as_tensor([self.skill_plan[self.env.step_count]], dtype=T.long).to(self.device) 192 | else: 193 | discrete_action, _, _ = self.network_dict['discrete_actor'].get_action(inputs, greedy=test) 194 | discrete_action.type(T.long).flatten() 195 | discrete_action_onehot = F.one_hot(discrete_action, self.discrete_action_dim).float() 196 | inputs = T.cat((inputs, discrete_action_onehot), dim=1) 197 | continuous_action = self.network_dict['continuous_actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy() 198 | return np.concatenate([discrete_action.detach().cpu().numpy(), continuous_action[0]], axis=0) 199 | 200 | def _learn(self, steps=None): 201 | if len(self.buffer) < self.batch_size: 202 | return 203 | if steps is None: 204 | steps = self.optimizer_steps 205 | 206 | avg_critic_1_loss = T.zeros(1, device=self.device) 207 | avg_critic_2_loss = T.zeros(1, device=self.device) 208 | avg_discrete_actor_loss = T.zeros(1, device=self.device) 209 | avg_discrete_alpha = T.zeros(1, device=self.device) 210 | avg_discrete_policy_entropy = T.zeros(1, device=self.device) 211 | avg_continuous_actor_loss = T.zeros(1, device=self.device) 212 | avg_continuous_alpha = T.zeros(1, device=self.device) 213 | avg_continuous_policy_entropy = T.zeros(1, device=self.device) 214 | for i in range(steps): 215 | if self.prioritised: 216 | batch, weights, inds = self.buffer.sample(self.batch_size) 217 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1) 218 | else: 219 | batch = self.buffer.sample(self.batch_size) 220 | weights = T.ones(size=(self.batch_size, 1), device=self.device) 221 | inds = None 222 | 223 | obs = T.as_tensor(batch.state, dtype=T.float32, device=self.device).transpose(2, 1) 224 | obs_features = self.network_dict['critic_1_target'].get_features(obs, detach=True) 225 | goal = T.as_tensor(batch.desired_goal, dtype=T.float32, device=self.device).transpose(2, 1) 226 | goal_features = self.network_dict['critic_1_target'].get_features(goal, detach=True) 227 | obs_ = T.as_tensor(batch.next_state, dtype=T.float32, device=self.device).transpose(2, 1) 228 | obs_features_ = self.network_dict['critic_1_target'].get_features(obs_, detach=True) 229 | actor_inputs_ = T.cat((obs_features_, goal_features), dim=1) 230 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device) 231 | discrete_actions = actions[:, 0].type(T.long) 232 | discrete_actions_onehot = F.one_hot(discrete_actions, self.discrete_action_dim).float() 233 | actions = T.cat((discrete_actions_onehot, actions[:, 1:]), dim=1) 234 | rewards = T.as_tensor(np.array(batch.reward), dtype=T.float32, device=self.device).unsqueeze(1) 235 | done = T.as_tensor(np.array(batch.done), dtype=T.float32, device=self.device).unsqueeze(1) 236 | 237 | if self.discard_time_limit: 238 | done = done * 0 + 1 239 | 240 | with T.no_grad(): 241 | if not self.planned_skills: 242 | discrete_actions_, discrete_actions_log_probs_, _ = self.network_dict['discrete_actor'].get_action( 243 | actor_inputs_) 244 | discrete_actions_onehot_ = F.one_hot(discrete_actions_.flatten(), self.discrete_action_dim).float() 245 | else: 246 | discrete_actions_planned_ = T.as_tensor(batch.next_skill, dtype=T.long, device=self.device) 247 | discrete_actions_planned_onehot_ = F.one_hot(discrete_actions_planned_, self.discrete_action_dim).float() 248 | discrete_actions_onehot_ = discrete_actions_planned_onehot_ 249 | discrete_actions_log_probs_ = T.ones(size=(self.batch_size, 1), device=self.device, dtype=T.float32) 250 | 251 | actor_inputs_ = T.cat((actor_inputs_, discrete_actions_onehot_), dim=1) 252 | continuous_actions_, continuous_actions_log_probs_ = self.network_dict[ 253 | 'continuous_actor'].get_action(actor_inputs_, probs=True) 254 | actions_ = T.cat((discrete_actions_onehot_, continuous_actions_), dim=1) 255 | 256 | value_1_ = self.network_dict['critic_1_target'](obs_, actions_, goal) 257 | value_2_ = self.network_dict['critic_2_target'](obs_, actions_, goal) 258 | value_ = T.min(value_1_, value_2_) - \ 259 | (self.network_dict['alpha_discrete'] * discrete_actions_log_probs_) - \ 260 | (self.network_dict['alpha_continuous'] * continuous_actions_log_probs_) 261 | value_target = rewards + done * self.gamma * value_ 262 | # value_target = T.clamp(value_target, -self.clip_value, 0.0) 263 | 264 | self.critic_1_optimizer.zero_grad() 265 | value_estimate_1 = self.network_dict['critic_1'](obs, actions, goal) 266 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none') 267 | (critic_loss_1 * weights).mean().backward() 268 | self.critic_1_optimizer.step() 269 | 270 | if self.prioritised: 271 | assert inds is not None 272 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy())) 273 | 274 | self.critic_2_optimizer.zero_grad() 275 | value_estimate_2 = self.network_dict['critic_2'](obs, actions, goal) 276 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none') 277 | (critic_loss_2 * weights).mean().backward() 278 | self.critic_2_optimizer.step() 279 | 280 | avg_critic_1_loss += critic_loss_1.detach().mean() 281 | avg_critic_2_loss += critic_loss_2.detach().mean() 282 | 283 | if self.optim_step_count % self.critic_target_update_interval == 0: 284 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target']) 285 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target']) 286 | 287 | if self.optim_step_count % self.actor_update_interval == 0: 288 | self.discrete_actor_optimizer.zero_grad() 289 | self.continuous_actor_optimizer.zero_grad() 290 | actor_inputs = T.cat((obs_features, goal_features), dim=1) 291 | if not self.planned_skills: 292 | new_discrete_actions, new_discrete_action_log_probs, new_discrete_action_entropy = \ 293 | self.network_dict['discrete_actor'].get_action(actor_inputs) 294 | new_discrete_actions_onehot = F.one_hot(new_discrete_actions.flatten(), self.discrete_action_dim).float() 295 | else: 296 | new_discrete_actions_onehot = discrete_actions_onehot 297 | 298 | new_continuous_actions, new_continuous_action_log_probs, new_continuous_action_entropy = \ 299 | self.network_dict['continuous_actor'].get_action( 300 | T.cat((actor_inputs, new_discrete_actions_onehot), dim=1), probs=True, entropy=True) 301 | new_actions = T.cat((new_discrete_actions_onehot, new_continuous_actions), dim=1) 302 | 303 | new_values = T.min(self.network_dict['critic_1'](obs, new_actions, goal), 304 | self.network_dict['critic_2'](obs, new_actions, goal)) 305 | 306 | if not self.planned_skills: 307 | discrete_actor_loss = ( 308 | self.network_dict['alpha_discrete'] * new_discrete_action_log_probs - new_values).mean() 309 | discrete_actor_loss.backward(retain_graph=True) 310 | self.discrete_actor_optimizer.step() 311 | 312 | self.alpha_discrete_optimizer.zero_grad() 313 | discrete_alpha_loss = (self.network_dict['log_alpha_discrete'] * ( 314 | -new_discrete_action_log_probs - self.target_discrete_entropy).detach()).mean() 315 | discrete_alpha_loss.backward() 316 | self.alpha_discrete_optimizer.step() 317 | self.network_dict['alpha_discrete'] = self.network_dict['log_alpha_discrete'].exp() 318 | 319 | avg_discrete_actor_loss += discrete_actor_loss.detach() 320 | avg_discrete_alpha += self.network_dict['alpha_discrete'].detach() 321 | avg_discrete_policy_entropy += new_discrete_action_entropy.detach().mean() 322 | 323 | continuous_actor_loss = ( 324 | self.network_dict['alpha_continuous'] * new_continuous_action_log_probs - new_values).mean() 325 | continuous_actor_loss.backward() 326 | self.continuous_actor_optimizer.step() 327 | 328 | self.alpha_continuous_optimizer.zero_grad() 329 | continuous_alpha_loss = (self.network_dict['log_alpha_continuous'] * ( 330 | -new_continuous_action_log_probs - self.target_continuous_entropy).detach()).mean() 331 | continuous_alpha_loss.backward() 332 | self.alpha_continuous_optimizer.step() 333 | self.network_dict['alpha_continuous'] = self.network_dict['log_alpha_continuous'].exp() 334 | 335 | avg_continuous_actor_loss += continuous_actor_loss.detach() 336 | avg_continuous_alpha += self.network_dict['alpha_continuous'].detach() 337 | avg_continuous_policy_entropy += new_continuous_action_entropy.detach().mean() 338 | 339 | self.optim_step_count += 1 340 | 341 | self.logger.add_scalar(tag='Critic/critic_1_loss', scalar_value=avg_critic_1_loss / steps, global_step=self.cur_ep) 342 | self.logger.add_scalar(tag='Critic/critic_2_loss', scalar_value=avg_critic_2_loss / steps, global_step=self.cur_ep) 343 | if not self.planned_skills: 344 | self.logger.add_scalar(tag='Actor/discrete_actor_loss', scalar_value=avg_discrete_actor_loss / steps, global_step=self.cur_ep) 345 | self.logger.add_scalar(tag='Actor/discrete_alpha', scalar_value=avg_discrete_alpha / steps, global_step=self.cur_ep) 346 | self.logger.add_scalar(tag='Actor/discrete_policy_entropy', scalar_value=avg_discrete_policy_entropy / steps, 347 | global_step=self.cur_ep) 348 | self.logger.add_scalar(tag='Actor/continuous_actor_loss', scalar_value=avg_continuous_actor_loss / steps, global_step=self.cur_ep) 349 | self.logger.add_scalar(tag='Actor/continuous_alpha', scalar_value=avg_continuous_alpha / steps, global_step=self.cur_ep) 350 | self.logger.add_scalar(tag='Actor/continuous_policy_entropy', scalar_value=avg_continuous_policy_entropy / steps, 351 | global_step=self.cur_ep) 352 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/sac_pointnet.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import StochasticActor 7 | from ..utils.networks_pointnet import CriticPointNet 8 | from ..agent_base import Agent 9 | from ..utils.exploration_strategy import GaussianNoise 10 | from collections import namedtuple 11 | 12 | 13 | class PointnetSAC(Agent): 14 | def __init__(self, algo_params, env, logging=None, transition_tuple=None, path=None, seed=-1): 15 | # environment 16 | self.env = env 17 | self.env.seed(seed) 18 | obs = self.env.reset() 19 | algo_params.update({'state_shape': obs['observation'].shape, 20 | 'goal_shape': obs['desired_goal'].shape, 21 | 'action_dim': self.env.action_space.shape[0], 22 | 'action_max': self.env.action_space.high, 23 | 'action_scaling': self.env.action_space.high[0], 24 | }) 25 | self.onestep = True if self.env.horizon == 1 else False 26 | # training args 27 | self.cur_ep = 0 28 | self.warmup_step = algo_params['warmup_step'] 29 | self.training_episodes = algo_params['training_episodes'] 30 | self.testing_gap = algo_params['testing_gap'] 31 | self.testing_episodes = algo_params['testing_episodes'] 32 | self.saving_gap = algo_params['saving_gap'] 33 | if transition_tuple is None: 34 | transition_tuple = namedtuple('transition', 35 | ['state', 'desired_goal', 'action', 'achieved_goal', 'reward']) 36 | super(PointnetSAC, self).__init__(algo_params, non_flat_obs=True, 37 | action_type='continuous', 38 | transition_tuple=transition_tuple, 39 | goal_conditioned=True, 40 | path=path, 41 | seed=seed, 42 | logging=logging, 43 | create_logger=True) 44 | # torch 45 | self.network_dict.update({ 46 | 'actor': StochasticActor(2048, self.action_dim, 47 | fc1_size=1024, log_std_min=-6, log_std_max=1, 48 | action_scaling=self.action_scaling).to(self.device), 49 | 'critic_1': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device), 50 | 'critic_2': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device), 51 | 'critic_target': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device), 52 | 'alpha': algo_params['alpha'], 53 | 'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device), 54 | }) 55 | self.network_dict['critic_target'].eval() 56 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=1) 57 | if not self.onestep: 58 | self.network_dict.update( 59 | {'critic_target_2': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device)}) 60 | self.network_dict['critic_target_2'].eval() 61 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_target_2'], tau=1) 62 | 63 | self.network_keys_to_save = ['actor', 'critic_1'] 64 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 65 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate) 66 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate) 67 | self.target_entropy = -self.action_dim 68 | self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate) 69 | # training args 70 | self.actor_update_interval = algo_params['actor_update_interval'] 71 | self.use_demonstrations = algo_params['use_demonstrations'] 72 | self.demonstrate_percentage = algo_params['demonstrate_percentage'] 73 | assert 0 < self.demonstrate_percentage < 1, "Demonstrate percentage should be between 0 and 1" 74 | self.n_demonstrate_episodes = int(self.demonstrate_percentage * self.training_episodes) 75 | self.demonstration_action = np.asarray(algo_params['demonstration_action'], dtype=np.float32) 76 | self.gaussian_noise = GaussianNoise(action_dim=self.action_dim, action_max=self.action_max, 77 | sigma=0.1, rng=self.rng) 78 | 79 | def run(self, test=False, render=False, load_network_ep=None, sleep=0, get_action=False): 80 | if test: 81 | num_episode = self.testing_episodes 82 | if load_network_ep is not None: 83 | print("Loading network parameters...") 84 | self._load_network(ep=load_network_ep) 85 | print("Start testing...") 86 | if get_action: 87 | obs = self.env.reset() 88 | action = self._select_action(obs, test=True) 89 | return action 90 | else: 91 | num_episode = self.training_episodes 92 | print("Start training...") 93 | self.logging.info("Start training...") 94 | 95 | for ep in range(num_episode): 96 | self.cur_ep = ep 97 | loss_info = self._interact(render, test, sleep=sleep) 98 | print("Episode %i" % ep) 99 | self.logging.info("Episode %i" % ep) 100 | print("emd loss %0.1f" % loss_info['emd_loss']) 101 | self.logging.info("emd loss %0.1f" % loss_info['emd_loss']) 102 | self.logger.add_scalar(tag='Task/emd_loss', scalar_value=loss_info['emd_loss'], global_step=ep) 103 | try: 104 | print("heightmap loss %0.1f" % loss_info['height_map_loss']) 105 | self.logger.add_scalar(tag='Task/heightmap_loss', scalar_value=loss_info['height_map_loss'], global_step=ep) 106 | self.logging.info("heightmap loss %0.1f" % loss_info['height_map_loss']) 107 | except: 108 | pass 109 | GPU_memory = self.get_gpu_memory() 110 | self.logger.add_scalar(tag='System/Free GPU memory', scalar_value=GPU_memory[0], global_step=ep) 111 | try: 112 | self.logger.add_scalar(tag='System/Used GPU memory', scalar_value=GPU_memory[1], global_step=ep) 113 | except: 114 | pass 115 | if not test and self.hindsight: 116 | self.buffer.hindsight() 117 | 118 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test): 119 | ep_test_emd_loss = [] 120 | ep_test_heightmap_loss = [] 121 | for test_ep in range(self.testing_episodes): 122 | loss_info = self._interact(render, test=True) 123 | self.cur_ep += 1 124 | ep_test_emd_loss.append(loss_info['emd_loss']) 125 | try: 126 | ep_test_heightmap_loss.append(loss_info['height_map_loss']) 127 | except: 128 | pass 129 | self.logger.add_scalar(tag='Task/test_emd_loss', 130 | scalar_value=(sum(ep_test_emd_loss) / self.testing_episodes), global_step=ep) 131 | print("Episode %i" % ep) 132 | print("test emd loss %0.1f" % (sum(ep_test_emd_loss) / self.testing_episodes)) 133 | self.logging.info("Episode %i" % ep) 134 | self.logging.info("test emd loss %0.1f" % (sum(ep_test_emd_loss) / self.testing_episodes)) 135 | 136 | if len(ep_test_heightmap_loss) > 0: 137 | self.logger.add_scalar(tag='Task/test_heightmap_loss', 138 | scalar_value=(sum(ep_test_heightmap_loss) / self.testing_episodes), 139 | global_step=ep) 140 | print("test heightmap loss %0.1f" % (sum(ep_test_heightmap_loss) / self.testing_episodes)) 141 | self.logging.info("test heightmap loss %0.1f" % (sum(ep_test_heightmap_loss) / self.testing_episodes)) 142 | 143 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test): 144 | self._save_network(ep=ep) 145 | 146 | if not test: 147 | print("Finished training") 148 | self.logging.info("Finished training") 149 | else: 150 | print("Finished testing") 151 | 152 | def _interact(self, render=False, test=False, sleep=0): 153 | obs = self.env.reset() 154 | if render: 155 | self.env.render() 156 | if self.onestep: 157 | # An episode has only one step 158 | if self.use_demonstrations and (self.cur_ep < self.n_demonstrate_episodes): 159 | action = self.gaussian_noise(self.demonstration_action) 160 | else: 161 | action = self._select_action(obs, test=test) 162 | obs_, reward, _, info = self.env.step(action) 163 | time.sleep(sleep) 164 | 165 | if not test: 166 | self._remember(obs['observation'], obs['desired_goal'], action, obs_['achieved_goal'], reward, 167 | new_episode=True) 168 | if self.total_env_step_count % self.update_interval == 0: 169 | self._learn() 170 | self.total_env_step_count += 1 171 | else: 172 | n = 0 173 | done = False 174 | new_episode = True 175 | while not done: 176 | if self.use_demonstrations and (self.cur_ep < self.n_demonstrate_episodes): 177 | try: 178 | action, object_out_of_view, demon_info = self.env.get_cur_demonstration() 179 | except: 180 | action = self.gaussian_noise(self.demonstration_action[n]) 181 | else: 182 | action = self._select_action(obs, test=test) 183 | obs_, reward, done, info = self.env.step(action) 184 | time.sleep(sleep) 185 | 186 | if not test: 187 | self._remember(obs['observation'], obs['desired_goal'], action, obs_['achieved_goal'], reward, 188 | new_episode=new_episode) 189 | if self.total_env_step_count % self.update_interval == 0: 190 | self._learn() 191 | self.total_env_step_count += 1 192 | 193 | new_episode = False 194 | 195 | return info 196 | 197 | def _select_action(self, obs, test=False): 198 | obs_points = T.as_tensor([obs['observation']], dtype=T.float).to(self.device) 199 | goal_points = T.as_tensor([obs['desired_goal']], dtype=T.float).to(self.device) 200 | obs_point_features = self.network_dict['critic_target'].get_features(obs_points.transpose(2, 1)) 201 | goal_point_features = self.network_dict['critic_target'].get_features(goal_points.transpose(2, 1)) 202 | inputs = T.cat((obs_point_features, goal_point_features), dim=1) 203 | action = self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy() 204 | return action[0] 205 | 206 | def _learn(self, steps=None): 207 | if len(self.buffer) < self.batch_size: 208 | return 209 | if steps is None: 210 | steps = self.optimizer_steps 211 | 212 | avg_critic_1_loss = T.zeros(1, device=self.device) 213 | avg_critic_2_loss = T.zeros(1, device=self.device) 214 | avg_actor_loss = T.zeros(1, device=self.device) 215 | avg_alpha = T.zeros(1, device=self.device) 216 | avg_policy_entropy = T.zeros(1, device=self.device) 217 | for i in range(steps): 218 | if self.prioritised: 219 | batch, weights, inds = self.buffer.sample(self.batch_size) 220 | weights = T.tensor(weights).view(self.batch_size, 1).to(self.device) 221 | else: 222 | batch = self.buffer.sample(self.batch_size) 223 | weights = T.ones(size=(self.batch_size, 1)).to(self.device) 224 | inds = None 225 | 226 | obs = T.as_tensor(batch.state, dtype=T.float32, device=self.device).transpose(2, 1) 227 | goal = T.as_tensor(batch.desired_goal, dtype=T.float32, device=self.device).transpose(2, 1) 228 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device) 229 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1) 230 | if self.onestep: 231 | values_target = rewards 232 | else: 233 | with T.no_grad(): 234 | obs_ = T.as_tensor(batch.next_state, dtype=T.float32, device=self.device).transpose(2, 1) 235 | obs_features_ = self.network_dict['critic_1'].get_features(obs_, detach=True) 236 | goal_features = self.network_dict['critic_1'].get_features(goal, detach=True) 237 | actor_inputs = T.cat((obs_features_, goal_features), dim=1) 238 | new_actions = self.network_dict['actor'].get_action(actor_inputs) 239 | values_1 = self.network_dict['critic_target'](obs_, new_actions, goal) 240 | values_2 = self.network_dict['critic_target_2'](obs_, new_actions, goal) 241 | values_target = rewards + self.gamma * T.min(values_1, values_2) 242 | 243 | self.critic_1_optimizer.zero_grad() 244 | value_estimate_1 = self.network_dict['critic_1'](obs, actions, goal) 245 | critic_loss_1 = F.mse_loss(value_estimate_1, values_target, reduction='none') 246 | (critic_loss_1 * weights).mean().backward() 247 | self.critic_1_optimizer.step() 248 | 249 | if self.prioritised: 250 | assert inds is not None 251 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy())) 252 | 253 | self.critic_2_optimizer.zero_grad() 254 | value_estimate_2 = self.network_dict['critic_2'](obs, actions, goal) 255 | critic_loss_2 = F.mse_loss(value_estimate_2, values_target, reduction='none') 256 | (critic_loss_2 * weights).mean().backward() 257 | self.critic_2_optimizer.step() 258 | 259 | avg_critic_1_loss += critic_loss_1.detach().mean() 260 | avg_critic_2_loss += critic_loss_2.detach().mean() 261 | 262 | if self.optim_step_count % self.actor_update_interval == 0: 263 | self.actor_optimizer.zero_grad() 264 | obs_features = self.network_dict['critic_1'].get_features(obs, detach=True) 265 | goal_features = self.network_dict['critic_1'].get_features(goal, detach=True) 266 | actor_inputs = T.cat((obs_features, goal_features), dim=1) 267 | new_actions, new_log_probs, new_entropy = self.network_dict['actor'].get_action(actor_inputs, 268 | probs=True, 269 | entropy=True) 270 | new_values = T.min(self.network_dict['critic_1'](obs, new_actions, goal), 271 | self.network_dict['critic_2'](obs, new_actions, goal)) 272 | actor_loss = (self.network_dict['alpha'] * new_log_probs - new_values).mean() 273 | actor_loss.backward() 274 | self.actor_optimizer.step() 275 | 276 | self.alpha_optimizer.zero_grad() 277 | alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean() 278 | alpha_loss.backward() 279 | self.alpha_optimizer.step() 280 | self.network_dict['alpha'] = self.network_dict['log_alpha'].exp() 281 | 282 | avg_actor_loss += actor_loss.detach().mean() 283 | avg_alpha += self.network_dict['alpha'].detach() 284 | avg_policy_entropy += new_entropy.detach().mean() 285 | 286 | self.optim_step_count += 1 287 | 288 | if self.onestep: 289 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=1) 290 | else: 291 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=self.tau) 292 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_target_2'], tau=self.tau) 293 | 294 | self.logger.add_scalar(tag='Critic/critic_1_loss', scalar_value=avg_critic_1_loss / steps, 295 | global_step=self.cur_ep) 296 | self.logger.add_scalar(tag='Critic/critic_2_loss', scalar_value=avg_critic_2_loss / steps, 297 | global_step=self.cur_ep) 298 | self.logger.add_scalar(tag='Actor/actor_loss', scalar_value=avg_actor_loss / steps, global_step=self.cur_ep) 299 | self.logger.add_scalar(tag='Actor/alpha', scalar_value=avg_alpha / steps, global_step=self.cur_ep) 300 | self.logger.add_scalar(tag='Actor/policy_entropy', scalar_value=avg_policy_entropy / steps, 301 | global_step=self.cur_ep) 302 | -------------------------------------------------------------------------------- /drl_implementation/agent/continuous_action/td3.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch as T 4 | import torch.nn.functional as F 5 | from torch.optim.adam import Adam 6 | from ..utils.networks_mlp import Actor, Critic 7 | from ..agent_base import Agent 8 | from ..utils.exploration_strategy import GaussianNoise 9 | 10 | 11 | class TD3(Agent): 12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1): 13 | # environment 14 | self.env = env 15 | self.env.seed(seed) 16 | obs = self.env.reset() 17 | algo_params.update({'state_dim': obs.shape[0], 18 | 'action_dim': self.env.action_space.shape[0], 19 | 'action_max': self.env.action_space.high, 20 | 'action_scaling': self.env.action_space.high[0], 21 | 'init_input_means': None, 22 | 'init_input_vars': None 23 | }) 24 | # training args 25 | self.training_episodes = algo_params['training_episodes'] 26 | self.testing_gap = algo_params['testing_gap'] 27 | self.testing_episodes = algo_params['testing_episodes'] 28 | self.saving_gap = algo_params['saving_gap'] 29 | 30 | super(TD3, self).__init__(algo_params, 31 | transition_tuple=transition_tuple, 32 | goal_conditioned=False, 33 | path=path, 34 | seed=seed) 35 | # torch 36 | self.network_dict.update({ 37 | 'actor': Actor(self.state_dim, self.action_dim, action_scaling=self.action_scaling).to(self.device), 38 | 'actor_target': Actor(self.state_dim, self.action_dim, action_scaling=self.action_scaling).to(self.device), 39 | 'critic_1': Critic(self.state_dim + self.action_dim, 1).to(self.device), 40 | 'critic_1_target': Critic(self.state_dim + self.action_dim, 1).to(self.device), 41 | 'critic_2': Critic(self.state_dim + self.action_dim, 1).to(self.device), 42 | 'critic_2_target': Critic(self.state_dim + self.action_dim, 1).to(self.device) 43 | }) 44 | self.network_keys_to_save = ['actor_target', 'critic_1_target'] 45 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate) 46 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1) 47 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate) 48 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1) 49 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate) 50 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1) 51 | # behavioural policy args (exploration) 52 | self.exploration_strategy = GaussianNoise(self.action_dim, self.action_max, mu=0, sigma=0.1) 53 | # training args 54 | self.target_noise = algo_params['target_noise'] 55 | self.noise_clip = algo_params['noise_clip'] 56 | self.warmup_step = algo_params['warmup_step'] 57 | self.actor_update_interval = algo_params['actor_update_interval'] 58 | # statistic dict 59 | self.statistic_dict.update({ 60 | 'episode_return': [], 61 | 'episode_test_return': [] 62 | }) 63 | 64 | def run(self, test=False, render=False, load_network_ep=None, sleep=0): 65 | if test: 66 | num_episode = self.testing_episodes 67 | if load_network_ep is not None: 68 | print("Loading network parameters...") 69 | self._load_network(ep=load_network_ep) 70 | print("Start testing...") 71 | else: 72 | num_episode = self.training_episodes 73 | print("Start training...") 74 | 75 | for ep in range(num_episode): 76 | ep_return = self._interact(render, test, sleep=sleep) 77 | self.statistic_dict['episode_return'].append(ep_return) 78 | print("Episode %i" % ep, "return %0.1f" % ep_return) 79 | 80 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test): 81 | ep_test_return = [] 82 | for test_ep in range(self.testing_episodes): 83 | ep_test_return.append(self._interact(render, test=True)) 84 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes) 85 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes)) 86 | 87 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test): 88 | self._save_network(ep=ep) 89 | 90 | if not test: 91 | print("Finished training") 92 | print("Saving statistics...") 93 | self._plot_statistics(save_to_file=True) 94 | else: 95 | print("Finished testing") 96 | 97 | def _interact(self, render=False, test=False, sleep=0): 98 | done = False 99 | obs = self.env.reset() 100 | ep_return = 0 101 | # start a new episode 102 | while not done: 103 | if render: 104 | self.env.render() 105 | if self.env_step_count < self.warmup_step: 106 | action = self.env.action_space.sample() 107 | else: 108 | action = self._select_action(obs, test=test) 109 | new_obs, reward, done, info = self.env.step(action) 110 | time.sleep(sleep) 111 | ep_return += reward 112 | if not test: 113 | self._remember(obs, action, new_obs, reward, 1 - int(done)) 114 | if self.observation_normalization: 115 | self.normalizer.store_history(new_obs) 116 | self.normalizer.update_mean() 117 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step): 118 | self._learn() 119 | obs = new_obs 120 | self.env_step_count += 1 121 | return ep_return 122 | 123 | def _select_action(self, obs, test=False): 124 | obs = self.normalizer(obs) 125 | with T.no_grad(): 126 | inputs = T.as_tensor(obs, dtype=T.float, device=self.device) 127 | action = self.network_dict['actor_target'](inputs).detach().cpu().numpy() 128 | if test: 129 | # evaluate 130 | return np.clip(action, -self.action_max, self.action_max) 131 | else: 132 | # explore 133 | return self.exploration_strategy(action) 134 | 135 | def _learn(self, steps=None): 136 | if len(self.buffer) < self.batch_size: 137 | return 138 | if steps is None: 139 | steps = self.optimizer_steps 140 | 141 | for i in range(steps): 142 | if self.prioritised: 143 | batch, weights, inds = self.buffer.sample(self.batch_size) 144 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1) 145 | else: 146 | batch = self.buffer.sample(self.batch_size) 147 | weights = T.ones(size=(self.batch_size, 1), device=self.device) 148 | inds = None 149 | 150 | actor_inputs = self.normalizer(batch.state) 151 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device) 152 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device) 153 | critic_inputs = T.cat((actor_inputs, actions), dim=1) 154 | actor_inputs_ = self.normalizer(batch.next_state) 155 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device) 156 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1) 157 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1) 158 | 159 | if self.discard_time_limit: 160 | done = done * 0 + 1 161 | 162 | with T.no_grad(): 163 | actions_ = self.network_dict['actor_target'](actor_inputs_) 164 | # add noise 165 | noise = (T.randn_like(actions_, device=self.device) * self.target_noise) 166 | actions_ += noise.clamp(-self.noise_clip, self.noise_clip) 167 | actions_ = actions_.clamp(-self.action_max[0], self.action_max[0]) 168 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1) 169 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_) 170 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_) 171 | value_ = T.min(value_1_, value_2_) 172 | value_target = rewards + done * self.gamma * value_ 173 | 174 | self.critic_1_optimizer.zero_grad() 175 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs) 176 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none') 177 | (critic_loss_1 * weights).mean().backward() 178 | self.critic_1_optimizer.step() 179 | 180 | if self.prioritised: 181 | assert inds is not None 182 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy())) 183 | 184 | self.critic_2_optimizer.zero_grad() 185 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs) 186 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none') 187 | (critic_loss_2 * weights).mean().backward() 188 | self.critic_2_optimizer.step() 189 | 190 | self.statistic_dict['critic_loss'].append(critic_loss_1.detach().mean()) 191 | 192 | if self.optim_step_count % self.actor_update_interval == 0: 193 | self.actor_optimizer.zero_grad() 194 | new_actions = self.network_dict['actor'](actor_inputs) 195 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1) 196 | actor_loss = -self.network_dict['critic_1'](critic_eval_inputs).mean() 197 | actor_loss.backward() 198 | self.actor_optimizer.step() 199 | 200 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target']) 201 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target']) 202 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target']) 203 | 204 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean()) 205 | 206 | self.optim_step_count += 1 207 | -------------------------------------------------------------------------------- /drl_implementation/agent/distributed_agent_base.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import torch as T 4 | import numpy as np 5 | import json 6 | import queue 7 | import importlib 8 | import multiprocessing as mp 9 | from collections import namedtuple 10 | from .utils.plot import smoothed_plot 11 | from .utils.replay_buffer import ReplayBuffer, PrioritisedReplayBuffer 12 | from .utils.normalizer import Normalizer 13 | # T.multiprocessing.set_start_method('spawn') 14 | t = namedtuple("transition", ('state', 'action', 'next_state', 'reward', 'done')) 15 | 16 | 17 | def mkdir(paths): 18 | for path in paths: 19 | os.makedirs(path, exist_ok=True) 20 | 21 | 22 | class Agent(object): 23 | def __init__(self, algo_params, image_obs=False, action_type='continuous', path=None, seed=-1): 24 | # torch device 25 | self.device = T.device("cuda" if T.cuda.is_available() else "cpu") 26 | if 'cuda_device_id' in algo_params.keys(): 27 | self.device = T.device("cuda:%i" % algo_params['cuda_device_id']) 28 | # path & seeding 29 | T.manual_seed(seed) 30 | T.cuda.manual_seed_all(seed) # this has no effect if cuda is not available 31 | 32 | assert path is not None, 'please specify a project path to save files' 33 | self.path = path 34 | # path to save neural network check point 35 | self.ckpt_path = os.path.join(path, 'ckpts') 36 | # path to save statistics 37 | self.data_path = os.path.join(path, 'data') 38 | # create directories if not exist 39 | mkdir([self.path, self.ckpt_path, self.data_path]) 40 | 41 | # non-goal-conditioned args 42 | self.image_obs = image_obs 43 | self.action_type = action_type 44 | if self.image_obs: 45 | self.state_dim = 0 46 | self.state_shape = algo_params['state_shape'] 47 | else: 48 | self.state_dim = algo_params['state_dim'] 49 | self.action_dim = algo_params['action_dim'] 50 | if self.action_type == 'continuous': 51 | self.action_max = algo_params['action_max'] 52 | self.action_scaling = algo_params['action_scaling'] 53 | 54 | # common args 55 | if not self.image_obs: 56 | # todo: observation in distributed training should be synced as well 57 | self.observation_normalization = algo_params['observation_normalization'] 58 | # if not using obs normalization, the normalizer is just a scale multiplier, returns inputs*scale 59 | self.normalizer = Normalizer(self.state_dim, 60 | algo_params['init_input_means'], algo_params['init_input_vars'], 61 | activated=self.observation_normalization) 62 | 63 | self.gamma = algo_params['discount_factor'] 64 | self.tau = algo_params['tau'] 65 | 66 | # network dict is filled in each specific agent 67 | self.network_dict = {} 68 | self.network_keys_to_save = None 69 | 70 | # algorithm-specific statistics are defined in each agent sub-class 71 | self.statistic_dict = { 72 | # use lowercase characters 73 | 'actor_loss': [], 74 | 'critic_loss': [], 75 | } 76 | 77 | def _soft_update(self, source, target, tau=None, from_params=False): 78 | if tau is None: 79 | tau = self.tau 80 | 81 | if not from_params: 82 | for target_param, param in zip(target.parameters(), source.parameters()): 83 | target_param.data.copy_( 84 | target_param.data * (1.0 - tau) + param.data * tau 85 | ) 86 | else: 87 | for target_param, param in zip(target.parameters(), source): 88 | target_param.data.copy_( 89 | target_param.data * (1.0 - tau) + T.tensor(param).float().to(self.device) * tau 90 | ) 91 | 92 | def _save_network(self, keys=None, ep=None): 93 | if ep is None: 94 | ep = '' 95 | else: 96 | ep = '_ep' + str(ep) 97 | if keys is None: 98 | keys = self.network_keys_to_save 99 | assert keys is not None 100 | for key in keys: 101 | T.save(self.network_dict[key].state_dict(), self.ckpt_path + '/ckpt_' + key + ep + '.pt') 102 | 103 | def _load_network(self, keys=None, ep=None): 104 | if not self.image_obs: 105 | self.normalizer.history_mean = np.load(os.path.join(self.data_path, 'input_means.npy')) 106 | self.normalizer.history_var = np.load(os.path.join(self.data_path, 'input_vars.npy')) 107 | if ep is None: 108 | ep = '' 109 | else: 110 | ep = '_ep' + str(ep) 111 | if keys is None: 112 | keys = self.network_keys_to_save 113 | assert keys is not None 114 | for key in keys: 115 | self.network_dict[key].load_state_dict(T.load(self.ckpt_path + '/ckpt_' + key + ep + '.pt')) 116 | 117 | def _save_statistics(self, keys=None): 118 | if not self.image_obs: 119 | np.save(os.path.join(self.data_path, 'input_means'), self.normalizer.history_mean) 120 | np.save(os.path.join(self.data_path, 'input_vars'), self.normalizer.history_var) 121 | if keys is None: 122 | keys = self.statistic_dict.keys() 123 | for key in keys: 124 | if len(self.statistic_dict[key]) == 0: 125 | continue 126 | # convert everything to a list before save via json 127 | if T.is_tensor(self.statistic_dict[key][0]): 128 | self.statistic_dict[key] = T.as_tensor(self.statistic_dict[key], device=self.device).cpu().numpy().tolist() 129 | else: 130 | self.statistic_dict[key] = np.array(self.statistic_dict[key]).tolist() 131 | json.dump(self.statistic_dict[key], open(os.path.join(self.data_path, key+'.json'), 'w')) 132 | 133 | def _plot_statistics(self, keys=None, x_labels=None, y_labels=None, window=5, save_to_file=True): 134 | if save_to_file: 135 | self._save_statistics(keys=keys) 136 | if y_labels is None: 137 | y_labels = {} 138 | for key in list(self.statistic_dict.keys()): 139 | if key not in y_labels.keys(): 140 | if 'loss' in key: 141 | label = 'Loss' 142 | elif 'return' in key: 143 | label = 'Return' 144 | elif 'success' in key: 145 | label = 'Success' 146 | else: 147 | label = key 148 | y_labels.update({key: label}) 149 | 150 | if x_labels is None: 151 | x_labels = {} 152 | for key in list(self.statistic_dict.keys()): 153 | if key not in x_labels.keys(): 154 | if ('loss' in key) or ('alpha' in key) or ('entropy' in key) or ('step' in key): 155 | label = 'Optimization step' 156 | elif 'cycle' in key: 157 | label = 'Cycle' 158 | elif 'epoch' in key: 159 | label = 'Epoch' 160 | else: 161 | label = 'Episode' 162 | x_labels.update({key: label}) 163 | 164 | if keys is None: 165 | for key in list(self.statistic_dict.keys()): 166 | smoothed_plot(os.path.join(self.path, key + '.png'), self.statistic_dict[key], 167 | x_label=x_labels[key], y_label=y_labels[key], window=window) 168 | else: 169 | for key in keys: 170 | smoothed_plot(os.path.join(self.path, key + '.png'), self.statistic_dict[key], 171 | x_label=x_labels[key], y_label=y_labels[key], window=window) 172 | 173 | 174 | class Worker(Agent): 175 | def __init__(self, algo_params, queues, path=None, seed=0, i=0): 176 | self.queues = queues 177 | self.worker_id = i 178 | self.worker_update_gap = algo_params['worker_update_gap'] # in episodes 179 | self.env_step_count = 0 180 | super(Worker, self).__init__(algo_params, path=path, seed=seed) 181 | 182 | def run(self, render=False, test=False, load_network_ep=None, sleep=0): 183 | raise NotImplementedError 184 | 185 | def _interact(self, render=False, test=False, sleep=0): 186 | raise NotImplementedError 187 | 188 | def _select_action(self, obs, test=False): 189 | raise NotImplementedError 190 | 191 | def _remember(self, batch): 192 | try: 193 | self.queues['replay_queue'].put_nowait(batch) 194 | except queue.Full: 195 | pass 196 | 197 | def _download_actor_networks(self, keys, tau=1.0): 198 | try: 199 | source = self.queues['network_queue'].get_nowait() 200 | except queue.Empty: 201 | return False 202 | print("Worker No. %i downloading network" % self.worker_id) 203 | for key in keys: 204 | self._soft_update(source[key], self.network_dict[key], tau=tau, from_params=True) 205 | return True 206 | 207 | 208 | class Learner(Agent): 209 | def __init__(self, algo_params, queues, path=None, seed=0): 210 | self.queues = queues 211 | self.num_workers = algo_params['num_workers'] 212 | self.learner_steps = algo_params['learner_steps'] 213 | self.learner_upload_gap = algo_params['learner_upload_gap'] # in optimization steps 214 | self.actor_learning_rate = algo_params['actor_learning_rate'] 215 | self.critic_learning_rate = algo_params['critic_learning_rate'] 216 | self.discard_time_limit = algo_params['discard_time_limit'] 217 | self.batch_size = algo_params['batch_size'] 218 | self.prioritised = algo_params['prioritised'] 219 | self.optimizer_steps = algo_params['optimization_steps'] 220 | self.optim_step_count = 0 221 | super(Learner, self).__init__(algo_params, path=path, seed=seed) 222 | 223 | def run(self): 224 | raise NotImplementedError 225 | 226 | def _learn(self, steps=None): 227 | raise NotImplementedError 228 | 229 | def _upload_learner_networks(self, keys): 230 | print("Learner uploading network") 231 | params = dict.fromkeys(keys) 232 | for key in keys: 233 | params[key] = [p.data.cpu().detach().numpy() for p in self.network_dict[key].parameters()] 234 | # delete an old net and upload a new one 235 | try: 236 | data = self.queues['network_queue'].get_nowait() 237 | del data 238 | except queue.Empty: 239 | pass 240 | try: 241 | self.queues['network_queue'].put(params) 242 | except queue.Full: 243 | pass 244 | 245 | 246 | class CentralProcessor(object): 247 | def __init__(self, algo_params, env_name, env_source, learner, worker, transition_tuple=None, path=None, 248 | worker_seeds=None, seed=0): 249 | self.algo_params = algo_params.copy() 250 | self.env_name = env_name 251 | assert env_source in ['gym', 'pybullet_envs', 'pybullet_multigoal_gym'], \ 252 | "unsupported env source: {}, " \ 253 | "only 3 env sources are supported: {}, " \ 254 | "for new env sources please modify the original code".format(env_source, 255 | ['gym', 'pybullet_envs', 256 | 'pybullet_multigoal_gym']) 257 | self.env_source = importlib.import_module(env_source) 258 | self.learner = learner 259 | self.worker = worker 260 | self.batch_size = algo_params['batch_size'] 261 | self.num_workers = algo_params['num_workers'] 262 | self.learner_steps = algo_params['learner_steps'] 263 | if worker_seeds is None: 264 | worker_seeds = np.random.randint(10, 1000, size=self.num_workers).tolist() 265 | else: 266 | assert len(worker_seeds) == self.num_workers, 'should assign seeds to every worker' 267 | self.worker_seeds = worker_seeds 268 | assert path is not None, 'please specify a project path to save files' 269 | self.path = path 270 | # create a random number generator and seed it 271 | self.rng = np.random.default_rng(seed=0) 272 | 273 | # multiprocessing queues 274 | self.queues = { 275 | 'replay_queue': mp.Queue(maxsize=algo_params['replay_queue_size']), 276 | 'batch_queue': mp.Queue(maxsize=algo_params['batch_queue_size']), 277 | 'network_queue': T.multiprocessing.Queue(maxsize=self.num_workers), 278 | 'learner_step_count': mp.Value('i', 0), 279 | 'global_episode_count': mp.Value('i', 0), 280 | } 281 | 282 | # setup replay buffer 283 | # prioritised replay 284 | self.prioritised = algo_params['prioritised'] 285 | self.store_with_given_priority = algo_params['store_with_given_priority'] 286 | # non-goal-conditioned replay buffer 287 | tr = transition_tuple 288 | if transition_tuple is None: 289 | tr = t 290 | if not self.prioritised: 291 | self.buffer = ReplayBuffer(algo_params['memory_capacity'], tr, seed=seed) 292 | else: 293 | self.queues.update({ 294 | 'priority_queue': mp.Queue(maxsize=algo_params['priority_queue_size']) 295 | }) 296 | self.buffer = PrioritisedReplayBuffer(algo_params['memory_capacity'], tr, rng=self.rng) 297 | 298 | def run(self): 299 | def worker_process(i, seed): 300 | env = self.env_source.make(self.env_name) 301 | path = os.path.join(self.path, "worker_%i" % i) 302 | worker = self.worker(self.algo_params, env, self.queues, path=path, seed=seed, i=i) 303 | worker.run() 304 | self.empty_queue('replay_queue') 305 | 306 | def learner_process(): 307 | env = self.env_source.make(self.env_name) 308 | path = os.path.join(self.path, "learner") 309 | learner = self.learner(self.algo_params, env, self.queues, path=path, seed=0) 310 | learner.run() 311 | if self.prioritised: 312 | self.empty_queue('priority_queue') 313 | self.empty_queue('network_queue') 314 | 315 | def update_buffer(): 316 | while self.queues['learner_step_count'].value < self.learner_steps: 317 | num_transitions_in_queue = self.queues['replay_queue'].qsize() 318 | for n in range(num_transitions_in_queue): 319 | data = self.queues['replay_queue'].get() 320 | if self.prioritised: 321 | if self.store_with_given_priority: 322 | self.buffer.store_experience_with_given_priority(data['priority'], *data['transition']) 323 | else: 324 | self.buffer.store_experience(*data) 325 | else: 326 | self.buffer.store_experience(*data) 327 | if self.batch_size > len(self.buffer): 328 | continue 329 | 330 | if self.prioritised: 331 | try: 332 | inds, priorities = self.queues['priority_queue'].get_nowait() 333 | self.buffer.update_priority(inds, priorities) 334 | except queue.Empty: 335 | pass 336 | try: 337 | batch, weights, inds = self.buffer.sample(batch_size=self.batch_size) 338 | state, action, next_state, reward, done = batch 339 | self.queues['batch_queue'].put_nowait((state, action, next_state, reward, done, weights, inds)) 340 | except queue.Full: 341 | continue 342 | else: 343 | try: 344 | batch = self.buffer.sample(batch_size=self.batch_size) 345 | state, action, next_state, reward, done = batch 346 | self.queues['batch_queue'].put_nowait((state, action, next_state, reward, done)) 347 | except queue.Full: 348 | time.sleep(0.1) 349 | continue 350 | 351 | self.empty_queue('batch_queue') 352 | 353 | processes = [] 354 | p = T.multiprocessing.Process(target=update_buffer) 355 | processes.append(p) 356 | p = T.multiprocessing.Process(target=learner_process) 357 | processes.append(p) 358 | for i in range(self.num_workers): 359 | p = T.multiprocessing.Process(target=worker_process, 360 | args=(i, self.worker_seeds[i])) 361 | processes.append(p) 362 | 363 | for p in processes: 364 | p.start() 365 | for p in processes: 366 | p.join() 367 | 368 | def empty_queue(self, queue_name): 369 | while True: 370 | try: 371 | data = self.queues[queue_name].get_nowait() 372 | del data 373 | except queue.Empty: 374 | break 375 | self.queues[queue_name].close() 376 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/utils/__init__.py -------------------------------------------------------------------------------- /drl_implementation/agent/utils/env_wrapper.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import numpy as np 3 | from skimage.transform import resize 4 | from collections import deque 5 | 6 | 7 | class FrameStack(gym.Wrapper): 8 | def __init__(self, env, k): 9 | gym.Wrapper.__init__(self, env) 10 | self._k = k 11 | self._frames = deque([], maxlen=k) 12 | shp = env.observation_space.shape 13 | self.observation_space = gym.spaces.Box( 14 | low=0, 15 | high=1, 16 | shape=((shp[0] * k,) + shp[1:]), 17 | dtype=env.observation_space.dtype) 18 | self._max_episode_steps = env._max_episode_steps 19 | 20 | def reset(self): 21 | obs = self.env.reset() 22 | for _ in range(self._k): 23 | self._frames.append(obs) 24 | return self._get_obs() 25 | 26 | def step(self, action): 27 | obs, reward, done, info = self.env.step(action) 28 | self._frames.append(obs) 29 | return self._get_obs(), reward, done, info 30 | 31 | def _get_obs(self): 32 | assert len(self._frames) == self._k 33 | return np.concatenate(list(self._frames), axis=0) 34 | 35 | 36 | class PixelPybulletGym(gym.Wrapper): 37 | def __init__(self, env, image_size, crop_size, channel_first=True): 38 | gym.Wrapper.__init__(self, env) 39 | self.image_size = image_size 40 | self.crop_size = crop_size 41 | self.channel_first = channel_first 42 | self.vertical_boundary = int((env.env._render_height - self.crop_size) / 2) 43 | self.horizontal_boundary = int((env.env._render_width - self.crop_size) / 2) 44 | self._max_episode_steps = env._max_episode_steps 45 | 46 | def reset(self): 47 | self.env.reset() 48 | return self._get_obs() 49 | 50 | def step(self, action): 51 | _, reward, done, info = self.env.step(action) 52 | return self._get_obs(), reward, done, info 53 | 54 | def _get_obs(self): 55 | # H, W, C 56 | obs = self.render(mode="rgb_array") 57 | obs = obs[self.vertical_boundary:-self.vertical_boundary, self.horizontal_boundary:-self.horizontal_boundary, :] 58 | obs = resize(obs, (self.image_size, self.image_size)) 59 | if self.channel_first: 60 | obs = obs.transpose((-1, 0, 1)) 61 | return obs 62 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/exploration_strategy.py: -------------------------------------------------------------------------------- 1 | import math as M 2 | import numpy as np 3 | 4 | 5 | class ExpDecayGreedy(object): 6 | # e-greedy exploration with exponential decay 7 | def __init__(self, start=1, end=0.05, decay=50000, decay_start=None, rng=None): 8 | self.start = start 9 | self.end = end 10 | self.decay = decay 11 | self.decay_start = decay_start 12 | if rng is None: 13 | self.rng = np.random.default_rng(seed=0) 14 | else: 15 | self.rng = rng 16 | 17 | def __call__(self, count): 18 | if self.decay_start is not None: 19 | count -= self.decay_start 20 | if count < 0: 21 | count = 0 22 | epsilon = self.end + (self.start - self.end) * M.exp(-1. * count / self.decay) 23 | prob = self.rng.uniform(0, 1) 24 | if prob < epsilon: 25 | return True 26 | else: 27 | return False 28 | 29 | 30 | class LinearDecayGreedy(object): 31 | # e-greedy exploration with linear decay 32 | def __init__(self, start=1.0, end=0.1, decay=1000000, decay_start=None, rng=None): 33 | self.start = start 34 | self.end = end 35 | self.decay = decay 36 | self.decay_start = decay_start 37 | if rng is None: 38 | self.rng = np.random.default_rng(seed=0) 39 | else: 40 | self.rng = rng 41 | 42 | def __call__(self, count): 43 | if self.decay_start is not None: 44 | count -= self.decay_start 45 | if count < 0: 46 | count = 0 47 | if count > self.decay: 48 | count = self.decay 49 | epsilon = self.start - count * (self.start - self.end) / self.decay 50 | prob = self.rng.uniform(0, 1) 51 | if prob < epsilon: 52 | return True 53 | else: 54 | return False 55 | 56 | 57 | class OUNoise(object): 58 | # https://github.com/rll/rllab/blob/master/rllab/exploration_strategies/ou_strategy.py 59 | def __init__(self, action_dim, action_max, mu=0, theta=0.2, sigma=1.0, rng=None): 60 | if rng is None: 61 | self.rng = np.random.default_rng(seed=0) 62 | else: 63 | self.rng = rng 64 | self.action_dim = action_dim 65 | self.action_max = action_max 66 | self.mu = mu 67 | self.theta = theta 68 | self.sigma = sigma 69 | self.state = np.ones(self.action_dim) * self.mu 70 | self.reset() 71 | 72 | def reset(self): 73 | self.state = np.ones(self.action_dim) * self.mu 74 | 75 | def __call__(self, action): 76 | x = self.state 77 | dx = self.theta * (self.mu - x) + self.sigma * self.rng.standard_normal(len(x)) 78 | self.state = x + dx 79 | return np.clip(action + self.state, -self.action_max, self.action_max) 80 | 81 | 82 | class GaussianNoise(object): 83 | # the one used in the TD3 paper: http://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf 84 | def __init__(self, action_dim, action_max, scale=1, mu=0, sigma=0.1, rng=None): 85 | if rng is None: 86 | self.rng = np.random.default_rng(seed=0) 87 | else: 88 | self.rng = rng 89 | self.scale = scale 90 | self.action_dim = action_dim 91 | self.action_max = action_max 92 | self.mu = mu 93 | self.sigma = sigma 94 | 95 | def __call__(self, action): 96 | noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma, size=(self.action_dim,)) 97 | return np.clip(action + noise, -self.action_max, self.action_max) 98 | 99 | 100 | class EGreedyGaussian(object): 101 | # the one used in the HER paper: https://arxiv.org/abs/1707.01495 102 | def __init__(self, action_dim, action_max, chance=0.2, scale=1, mu=0, sigma=0.1, rng=None): 103 | self.chance = chance 104 | self.scale = scale 105 | self.action_dim = action_dim 106 | self.action_max = action_max 107 | self.mu = mu 108 | self.sigma = sigma 109 | if rng is None: 110 | self.rng = np.random.default_rng(seed=0) 111 | else: 112 | self.rng = rng 113 | 114 | def __call__(self, action): 115 | chance = self.rng.uniform(0, 1) 116 | if chance < self.chance: 117 | return self.rng.uniform(-self.action_max, self.action_max, size=(self.action_dim,)) 118 | else: 119 | noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma, size=(self.action_dim,)) 120 | return np.clip(action + noise, -self.action_max, self.action_max) 121 | 122 | 123 | class AutoAdjustingEGreedyGaussian(object): 124 | """ 125 | https://ieeexplore.ieee.org/document/9366328 126 | This exploration class is a goal-success-rate-based auto-adjusting exploration strategy. 127 | It modifies the original constant chance exploration strategy by reducing exploration probabilities and noise deviations 128 | w.r.t. the testing success rate of each goal. 129 | """ 130 | def __init__(self, goal_num, action_dim, action_max, tau=0.05, chance=0.2, scale=1, mu=0, sigma=0.2, rng=None): 131 | if rng is None: 132 | self.rng = np.random.default_rng(seed=0) 133 | else: 134 | self.rng = rng 135 | self.scale = scale 136 | self.action_dim = action_dim 137 | self.action_max = action_max 138 | self.mu = mu 139 | self.base_sigma = sigma 140 | self.sigma = np.ones(goal_num) * sigma 141 | 142 | self.base_chance = chance 143 | self.goal_num = goal_num 144 | self.tau = tau 145 | self.success_rates = np.zeros(self.goal_num) 146 | self.chance = np.ones(self.goal_num) * chance 147 | 148 | def update_success_rates(self, new_tet_suc_rate): 149 | old_tet_suc_rate = self.success_rates.copy() 150 | self.success_rates = (1-self.tau)*old_tet_suc_rate + self.tau*new_tet_suc_rate 151 | self.chance = self.base_chance*(1-self.success_rates) 152 | self.sigma = self.base_sigma*(1-self.success_rates) 153 | 154 | def __call__(self, goal_ind, action): 155 | # return a random action or a noisy action 156 | prob = self.rng.uniform(0, 1) 157 | if prob < self.chance[goal_ind]: 158 | return self.rng.uniform(-self.action_max, self.action_max, size=(self.action_dim,)) 159 | else: 160 | noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma[goal_ind], size=(self.action_dim,)) 161 | return action + noise 162 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/networks_conv.py: -------------------------------------------------------------------------------- 1 | import torch as T 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from torch.distributions import Normal 5 | 6 | 7 | class DQNetwork(nn.Module): 8 | def __init__(self, input_shape, action_dims, init_w=3e-3): 9 | super(DQNetwork, self).__init__() 10 | self.input_shape = input_shape 11 | # input_shape: tuple(c, h, w) 12 | self.features = nn.Sequential( 13 | nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4), 14 | nn.ReLU(), 15 | nn.Conv2d(32, 64, kernel_size=4, stride=2), 16 | nn.ReLU(), 17 | nn.Conv2d(64, 64, kernel_size=3, stride=1), 18 | nn.ReLU() 19 | ) 20 | 21 | x = T.randn([32] + list(input_shape)) 22 | self.conv_out_dim = self.features(x).view(x.size(0), -1).size(1) 23 | self.fc = nn.Linear(self.conv_out_dim, 512) 24 | self.v = nn.Linear(512, action_dims) 25 | T.nn.init.uniform_(self.v.weight.data, -init_w, init_w) 26 | T.nn.init.uniform_(self.v.bias.data, -init_w, init_w) 27 | 28 | def forward(self, obs): 29 | if obs.max() > 1.: 30 | obs = obs / 255. 31 | 32 | x = self.features(obs) 33 | x = x.view(x.size(0), -1) 34 | x = F.relu(self.fc(x)) 35 | value = self.v(x) 36 | return value 37 | 38 | def get_action(self, obs): 39 | values = self.forward(obs) 40 | return T.argmax(values).item() 41 | 42 | 43 | class StochasticConvActor(nn.Module): 44 | def __init__(self, action_dim, encoder, hidden_dim=1024, log_std_min=-10, log_std_max=2, action_scaling=1, 45 | detach_obs_encoder=False, 46 | goal_conditioned=False, detach_goal_encoder=True): 47 | super(StochasticConvActor, self).__init__() 48 | 49 | self.action_scaling = action_scaling 50 | self.encoder = encoder 51 | self.detach_obs_encoder = detach_obs_encoder 52 | self.log_std_min = log_std_min 53 | self.log_std_max = log_std_max 54 | trunk_input_dim = self.encoder.feature_dim 55 | self.goal_conditioned = goal_conditioned 56 | self.detach_goal_encoder = detach_goal_encoder 57 | if self.goal_conditioned: 58 | trunk_input_dim *= 2 59 | self.trunk = nn.Sequential( 60 | nn.Linear(trunk_input_dim, hidden_dim), nn.ReLU(), 61 | nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), 62 | nn.Linear(hidden_dim, 2 * action_dim) 63 | ) 64 | 65 | self.apply(orthogonal_init) 66 | 67 | def forward(self, obs, goal=None): 68 | feature = self.encoder(obs, detach=self.detach_obs_encoder) 69 | if self.goal_conditioned: 70 | assert goal is not None, "need a goal image for goal-conditioned network" 71 | goal_feature = self.encoder(goal, detach=self.detach_goal_encoder) 72 | feature = T.cat((feature, goal_feature), dim=1) 73 | 74 | mu, log_std = self.trunk(feature).chunk(2, dim=-1) 75 | log_std = T.clamp(log_std, self.log_std_min, self.log_std_max) 76 | return mu, log_std 77 | 78 | def get_action(self, obs, goal=None, epsilon=1e-6, mean_pi=False, probs=False, entropy=False): 79 | mean, log_std = self(obs, goal) 80 | if mean_pi: 81 | return T.tanh(mean) 82 | std = log_std.exp() 83 | mu = Normal(mean, std) 84 | z = mu.rsample() 85 | action = T.tanh(z) 86 | if not probs: 87 | return action * self.action_scaling 88 | else: 89 | log_probs = (mu.log_prob(z) - T.log(1 - action.pow(2) + epsilon)).sum(1, keepdim=True) 90 | if not entropy: 91 | return action * self.action_scaling, log_probs 92 | else: 93 | entropy = mu.entropy() 94 | return action * self.action_scaling, log_probs, entropy 95 | 96 | 97 | class ConvCritic(nn.Module): 98 | # Modified from https://github.com/PhilipZRH/ferm 99 | def __init__(self, action_dim, encoder, hidden_dim=1024, detach_obs_encoder=False, 100 | goal_conditioned=False, detach_goal_encoder=True): 101 | super(ConvCritic, self).__init__() 102 | 103 | self.encoder = encoder 104 | self.detach_obs_encoder = detach_obs_encoder 105 | trunk_input_dim = self.encoder.feature_dim 106 | self.goal_conditioned = goal_conditioned 107 | self.detach_goal_encoder = detach_goal_encoder 108 | if self.goal_conditioned: 109 | trunk_input_dim *= 2 110 | self.trunk = nn.Sequential( 111 | nn.Linear(trunk_input_dim + action_dim, hidden_dim), 112 | nn.ReLU(), 113 | nn.Linear(hidden_dim, hidden_dim), 114 | nn.ReLU(), 115 | nn.Linear(hidden_dim, 1) 116 | ) 117 | 118 | self.apply(orthogonal_init) 119 | 120 | def forward(self, obs, action, goal=None): 121 | # detach_encoder allows to stop gradient propagation to encoder 122 | feature = self.encoder(obs, detach=self.detach_obs_encoder) 123 | if self.goal_conditioned: 124 | assert goal is not None, "need a goal image for goal-conditioned network" 125 | goal_feature = self.encoder(goal, detach=self.detach_goal_encoder) 126 | feature = T.cat((feature, goal_feature), dim=1) 127 | trunk_input = T.cat([feature, action], dim=1) 128 | q = self.trunk(trunk_input) 129 | return q 130 | 131 | 132 | class CURL(nn.Module): 133 | # Modified from https://github.com/PhilipZRH/ferm 134 | def __init__(self, z_dim, encoder, encoder_target): 135 | super(CURL, self).__init__() 136 | self.encoder = encoder 137 | self.encoder_target = encoder_target 138 | assert z_dim == self.encoder.feature_dim == self.encoder_target.feature_dim 139 | self.W = nn.Parameter(T.rand(z_dim, z_dim)) 140 | 141 | def encode(self, x, detach=False, use_target=False): 142 | # if exponential moving average (ema) target is enabled, 143 | # then compute key values using target encoder without gradient, 144 | # else compute key values with the main encoder 145 | # from CURL https://arxiv.org/abs/2004.04136 146 | if use_target: 147 | with T.no_grad(): 148 | z_out = self.encoder_target(x) 149 | else: 150 | z_out = self.encoder(x) 151 | 152 | if detach: 153 | z_out = z_out.detach() 154 | return z_out 155 | 156 | def compute_score(self, z_a, z_pos): 157 | """ 158 | from CURL https://arxiv.org/abs/2004.04136 159 | - compute (B,B) matrix z_a (W z_pos.T) 160 | - positives are all diagonal elements 161 | - negatives are all other elements 162 | - to compute loss use multi-class cross entropy with identity matrix for labels 163 | """ 164 | Wz = T.matmul(self.W, z_pos.T) # (z_dim,B) 165 | score = T.matmul(z_a, Wz) # (B,B) 166 | score = score - T.max(score, 1)[0][:, None] 167 | return score 168 | 169 | 170 | class PixelEncoder(nn.Module): 171 | def __init__(self, obs_shape, feature_dim=50, num_layers=4, num_filters=32): 172 | # the encoder architecture adopted by SAC-AE, DrQ and CURL 173 | super(PixelEncoder, self).__init__() 174 | assert len(obs_shape) == 3 175 | self.obs_shape = obs_shape[-2:] 176 | self.feature_dim = feature_dim 177 | self.num_layers = num_layers 178 | 179 | self.convs = nn.ModuleList( 180 | [nn.Conv2d(obs_shape[0], num_filters, 3, stride=2)] 181 | ) 182 | for i in range(num_layers - 1): 183 | self.convs.append(nn.Conv2d(num_filters, num_filters, 3, stride=1)) 184 | 185 | x = T.randn([32] + list(obs_shape)) 186 | out_dim = self.forward_conv(x, flatten=False).shape[-1] 187 | self.trunk = nn.Sequential( 188 | nn.Linear(num_filters * out_dim * out_dim, self.feature_dim), 189 | nn.LayerNorm(self.feature_dim), 190 | nn.Tanh() 191 | ) 192 | 193 | def forward_conv(self, obs, flatten=True): 194 | if obs.max() > 1.: 195 | obs = obs / 255. 196 | 197 | conv = T.relu(self.convs[0](obs)) 198 | for i in range(1, self.num_layers): 199 | conv = T.relu(self.convs[i](conv)) 200 | if flatten: 201 | conv = conv.reshape(conv.size(0), -1) 202 | return conv 203 | 204 | def forward(self, obs, detach=False): 205 | h = self.forward_conv(obs) 206 | if detach: 207 | h = h.detach() 208 | h = self.trunk(h) 209 | return h 210 | 211 | def copy_conv_weights_from(self, source): 212 | # only copy conv layers' weights 213 | for i in range(self.num_layers): 214 | self.convs[i].weight = source.convs[i].weight 215 | self.convs[i].bias = source.convs[i].bias 216 | 217 | 218 | class PixelDecoder(nn.Module): 219 | def __init__(self, obs_shape, feature_dim=50, num_layers=4, num_filters=32): 220 | # the encoder architecture adopted by SAC-AE, DrQ and CURL 221 | super(PixelDecoder, self).__init__() 222 | assert len(obs_shape) == 3 223 | self.obs_shape = obs_shape[-2:] 224 | self.feature_dim = feature_dim 225 | self.num_layers = num_layers 226 | 227 | # todo 228 | 229 | 230 | def orthogonal_init(m): 231 | if isinstance(m, nn.Linear): 232 | nn.init.orthogonal_(m.weight.data) 233 | if hasattr(m.bias, 'data'): 234 | m.bias.data.fill_(0.0) 235 | elif isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d): 236 | gain = nn.init.calculate_gain('relu') 237 | nn.init.orthogonal_(m.weight.data, gain) 238 | if hasattr(m.bias, 'data'): 239 | m.bias.data.fill_(0.0) 240 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/networks_mlp.py: -------------------------------------------------------------------------------- 1 | import torch as T 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from torch.distributions import Normal, Categorical 5 | 6 | 7 | class Actor(nn.Module): 8 | def __init__(self, input_dim, output_dim, fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, action_scaling=1): 9 | super(Actor, self).__init__() 10 | self.fc1 = nn.Linear(input_dim, fc1_size) 11 | self.fc2 = nn.Linear(fc1_size, fc2_size) 12 | self.fc3 = nn.Linear(fc2_size, fc3_size) 13 | self.pi = nn.Linear(fc3_size, output_dim) 14 | self.apply(orthogonal_init) 15 | self.action_scaling = action_scaling 16 | 17 | def forward(self, inputs): 18 | x = F.relu(self.fc1(inputs)) 19 | x = F.relu(self.fc2(x)) 20 | x = F.relu(self.fc3(x)) 21 | action = T.tanh(self.pi(x)) 22 | return action * self.action_scaling 23 | 24 | 25 | class StochasticActor(nn.Module): 26 | def __init__(self, input_dim, output_dim, log_std_min, log_std_max, continuous=True, agent_state_dim=0, 27 | fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, action_scaling=1): 28 | super(StochasticActor, self).__init__() 29 | self.continuous = continuous 30 | self.action_dim = output_dim 31 | self.agent_state_dim = agent_state_dim 32 | self.fc1 = nn.Linear(input_dim+agent_state_dim, fc1_size) 33 | self.fc2 = nn.Linear(fc1_size, fc2_size) 34 | if self.continuous: 35 | self.fc3 = nn.Linear(fc2_size, fc3_size) 36 | self.mean = nn.Linear(fc3_size, output_dim) 37 | self.log_std = nn.Linear(fc3_size, output_dim) 38 | else: 39 | self.fc3 = nn.Linear(fc2_size, output_dim) 40 | self.apply(orthogonal_init) 41 | self.log_std_min = log_std_min 42 | self.log_std_max = log_std_max 43 | self.action_scaling = action_scaling 44 | 45 | def forward(self, inputs): 46 | x = F.relu(self.fc1(inputs)) 47 | x = F.relu(self.fc2(x)) 48 | x = F.relu(self.fc3(x)) 49 | if self.continuous: 50 | mean = self.mean(x) 51 | log_std = self.log_std(x) 52 | log_std = T.clamp(log_std, self.log_std_min, self.log_std_max) 53 | return mean, log_std 54 | else: 55 | return x 56 | 57 | def get_action(self, inputs, std_scale=None, epsilon=1e-6, mean_pi=False, greedy=False, probs=False, entropy=False): 58 | if self.continuous: 59 | mean, log_std = self(inputs) 60 | if mean_pi: 61 | return T.tanh(mean) 62 | std = log_std.exp() 63 | if std_scale is not None: 64 | std *= std_scale 65 | mu = Normal(mean, std) 66 | z = mu.rsample() 67 | action = T.tanh(z) 68 | if not probs: 69 | return action * self.action_scaling 70 | else: 71 | if action.shape == (self.action_dim,): 72 | action = action.reshape((1, self.action_dim)) 73 | log_probs = (mu.log_prob(z) - T.log(1 - action.pow(2) + epsilon)).sum(1, keepdim=True) 74 | if not entropy: 75 | return action * self.action_scaling, log_probs 76 | else: 77 | entropy = mu.entropy() 78 | return action * self.action_scaling, log_probs, entropy 79 | else: 80 | logits = self(inputs) 81 | if greedy: 82 | actions = T.argmax(logits, dim=1, keepdim=True) 83 | return actions, None, None 84 | action_probs = F.softmax(logits, dim=1) 85 | dist = Categorical(action_probs) 86 | actions = dist.sample().view(-1, 1) 87 | log_probs = T.log(action_probs + epsilon).gather(1, actions) 88 | entropy = dist.entropy() 89 | return actions, log_probs, entropy 90 | 91 | def get_log_probs(self, inputs, actions, std_scale=None): 92 | actions /= self.action_scaling 93 | mean, log_std = self(inputs) 94 | std = log_std.exp() 95 | if std_scale is not None: 96 | std *= std_scale 97 | mu = Normal(mean, std) 98 | log_probs = mu.log_prob(actions) 99 | entropy = mu.entropy() 100 | return log_probs, entropy 101 | 102 | 103 | class Critic(nn.Module): 104 | def __init__(self, input_dim, output_dim, fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, softmax=False): 105 | super(Critic, self).__init__() 106 | self.fc1 = nn.Linear(input_dim, fc1_size) 107 | self.fc2 = nn.Linear(fc1_size, fc2_size) 108 | self.fc3 = nn.Linear(fc2_size, fc3_size) 109 | self.v = nn.Linear(fc3_size, output_dim) 110 | self.apply(orthogonal_init) 111 | self.softmax = softmax 112 | 113 | def forward(self, inputs): 114 | x = F.relu(self.fc1(inputs)) 115 | x = F.relu(self.fc2(x)) 116 | x = F.relu(self.fc3(x)) 117 | value = self.v(x) 118 | if not self.softmax: 119 | return value 120 | else: 121 | return F.softmax(value, dim=1) 122 | 123 | def get_action(self, inputs): 124 | values = self.forward(inputs) 125 | return T.argmax(values).item() 126 | 127 | 128 | def orthogonal_init(m): 129 | if isinstance(m, nn.Linear): 130 | nn.init.orthogonal_(m.weight.data) 131 | if hasattr(m.bias, 'data'): 132 | m.bias.data.fill_(0.0) 133 | elif isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d): 134 | gain = nn.init.calculate_gain('relu') 135 | nn.init.orthogonal_(m.weight.data, gain) 136 | if hasattr(m.bias, 'data'): 137 | m.bias.data.fill_(0.0) 138 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/networks_pointnet.py: -------------------------------------------------------------------------------- 1 | import torch as T 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from .pointnet_2_utils import PointNetSetAbstraction 5 | from .pointnet_utils import PointNetEncoder, feature_transform_reguliarzer 6 | 7 | 8 | class CriticPointNet(nn.Module): 9 | def __init__(self, output_dim, action_dim, agent_state_dim=0, normal_channel=False, softmax=False, goal_conditioned=False): 10 | super(CriticPointNet, self).__init__() 11 | in_channel = 6 if normal_channel else 3 12 | self.action_dim = action_dim 13 | self.agent_state_dim = agent_state_dim 14 | self.feat = PointNetEncoder(global_feat=True, feature_transform=True, channel=in_channel) 15 | self.goal_conditioned = goal_conditioned 16 | if self.goal_conditioned: 17 | self.fc1 = nn.Linear(2048+action_dim+agent_state_dim, 512) 18 | else: 19 | self.fc1 = nn.Linear(1024+action_dim+agent_state_dim, 512) 20 | self.fc2 = nn.Linear(512, 256) 21 | self.fc3 = nn.Linear(256, output_dim) 22 | self.dropout = nn.Dropout(p=0.4) 23 | self.bn1 = nn.BatchNorm1d(512) 24 | self.bn2 = nn.BatchNorm1d(256) 25 | self.softmax = softmax 26 | 27 | def forward(self, obs_xyz, action, goal_xyz=None, agent_state=None): 28 | x, trans, trans_feat = self.feat(obs_xyz) 29 | if self.goal_conditioned and goal_xyz is not None: 30 | goal_x, goal_trans, goal_trans_feat = self.feat(goal_xyz) 31 | x = T.cat([x, goal_x.detach()], dim=1) 32 | if agent_state is not None: 33 | assert agent_state.shape[1] == self.agent_state_dim 34 | x = T.cat([x, agent_state], dim=1) 35 | x = T.cat([x, action], dim=1) 36 | x = F.relu(self.bn1(self.fc1(x))) 37 | x = F.relu(self.bn2(self.dropout(self.fc2(x)))) 38 | value = self.fc3(x) 39 | 40 | if not self.softmax: 41 | return value 42 | else: 43 | return F.softmax(value, dim=1) 44 | 45 | def get_features(self, xyz, detach=False): 46 | x, trans, trans_feat = self.feat(xyz) 47 | if detach: 48 | x = x.detach() 49 | return x 50 | 51 | 52 | class CriticPointNet2(nn.Module): 53 | def __init__(self, output_dim, action_dim, normal_channel=False, softmax=False): 54 | super(CriticPointNet2, self).__init__() 55 | in_channel = 6 if normal_channel else 3 56 | self.normal_channel = normal_channel 57 | self.sa1 = PointNetSetAbstraction(npoint=512, radius=0.2, nsample=32, in_channel=in_channel, mlp=[64, 64, 128], 58 | group_all=False) 59 | self.sa2 = PointNetSetAbstraction(npoint=128, radius=0.4, nsample=64, in_channel=128 + 3, mlp=[128, 128, 256], 60 | group_all=False) 61 | self.sa3 = PointNetSetAbstraction(npoint=None, radius=None, nsample=None, in_channel=256 + 3, 62 | mlp=[256, 512, 1024], group_all=True) 63 | 64 | self.fc1 = nn.Linear(1024+action_dim, 512) 65 | self.bn1 = nn.BatchNorm1d(512) 66 | self.drop1 = nn.Dropout(0.4) 67 | self.fc2 = nn.Linear(512, 256) 68 | self.bn2 = nn.BatchNorm1d(256) 69 | self.drop2 = nn.Dropout(0.4) 70 | self.fc3 = nn.Linear(256, output_dim) 71 | self.softmax = softmax 72 | 73 | def forward(self, xyz): 74 | B, _, _ = xyz.shape 75 | if self.normal_channel: 76 | norm = xyz[:, 3:, :] 77 | xyz = xyz[:, :3, :] 78 | else: 79 | norm = None 80 | l1_xyz, l1_points = self.sa1(xyz, norm) 81 | l2_xyz, l2_points = self.sa2(l1_xyz, l1_points) 82 | l3_xyz, l3_points = self.sa3(l2_xyz, l2_points) 83 | x = l3_points.view(B, 1024) 84 | 85 | x = self.drop1(F.relu(self.bn1(self.fc1(x)))) 86 | x = self.drop2(F.relu(self.bn2(self.fc2(x)))) 87 | value = self.fc3(x) 88 | 89 | if not self.softmax: 90 | return value 91 | else: 92 | return F.softmax(value, dim=1) 93 | 94 | def get_features(self, xyz, detach=False): 95 | B, _, _ = xyz.shape 96 | if self.normal_channel: 97 | norm = xyz[:, 3:, :] 98 | xyz = xyz[:, :3, :] 99 | else: 100 | norm = None 101 | l1_xyz, l1_points = self.sa1(xyz, norm) 102 | l2_xyz, l2_points = self.sa2(l1_xyz, l1_points) 103 | l3_xyz, l3_points = self.sa3(l2_xyz, l2_points) 104 | x = l3_points.view(B, 1024) 105 | if detach: 106 | x = x.detach() 107 | return x 108 | 109 | def get_action(self, inputs): 110 | values = self.forward(inputs) 111 | return T.argmax(values).item() 112 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/normalizer.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class Normalizer(object): 5 | def __init__(self, input_dims, init_mean, init_var, 6 | scale_factor=1, epsilon=1e-3, clip_range=None, activated=False): 7 | self.activated = activated 8 | self.input_dims = input_dims 9 | self.sample_count = 0 10 | self.history = [] 11 | self.history_mean = init_mean 12 | self.history_var = init_var 13 | if self.history_mean is None: 14 | self.history_mean = np.zeros(self.input_dims) 15 | if self.history_var is None: 16 | self.history_var = np.ones(self.input_dims) 17 | assert self.history_mean.shape == (self.input_dims,) 18 | assert self.history_var.shape == (self.input_dims,) 19 | self.epsilon = epsilon*np.ones(self.input_dims) 20 | if clip_range is None: 21 | clip_range = 1e3 22 | self.input_clip_range = (-clip_range*np.ones(self.input_dims), clip_range*np.ones(self.input_dims)) 23 | self.scale_factor = scale_factor 24 | 25 | def store_history(self, *args): 26 | self.history.append(*args) 27 | 28 | # update mean and var for z-score normalization 29 | def update_mean(self): 30 | if len(self.history) == 0: 31 | return 32 | new_sample_num = len(self.history) 33 | new_history = np.array(self.history, dtype=float) 34 | new_mean = np.mean(new_history, axis=0) 35 | 36 | new_var = np.sum(np.square(new_history - new_mean), axis=0) 37 | new_var = (self.sample_count * np.square(self.history_var) + new_var) 38 | new_var /= (new_sample_num + self.sample_count) 39 | self.history_var = np.sqrt(new_var) 40 | 41 | new_mean = (self.sample_count * self.history_mean + new_sample_num * new_mean) 42 | new_mean /= (new_sample_num + self.sample_count) 43 | self.history_mean = new_mean 44 | 45 | self.sample_count += new_sample_num 46 | self.history.clear() 47 | 48 | # pre-process inputs, currently using max-min-normalization 49 | def __call__(self, inputs): 50 | if self.activated: 51 | inputs = (inputs - self.history_mean) / (self.history_var+self.epsilon) 52 | inputs = np.clip(inputs, self.input_clip_range[0], self.input_clip_range[1]) 53 | return self.scale_factor*inputs 54 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/plot.py: -------------------------------------------------------------------------------- 1 | import json 2 | import numpy as np 3 | import matplotlib as mpl 4 | 5 | mpl.use('Agg') 6 | import matplotlib.pyplot as plt 7 | from matplotlib.lines import Line2D 8 | from copy import deepcopy as dcp 9 | 10 | 11 | def smoothed_plot(file, data, x_label="Timesteps", y_label="Success rate", window=5): 12 | N = len(data) 13 | running_avg = np.empty(N) 14 | for t in range(N): 15 | running_avg[t] = np.mean(data[max(0, t - window):(t + 1)]) 16 | x = [i for i in range(N)] 17 | plt.ylabel(y_label) 18 | plt.xlabel(x_label) 19 | if x_label == "Epoch": 20 | x_tick_interval = len(data) // 10 21 | plt.xticks([n * x_tick_interval for n in range(11)]) 22 | plt.plot(x, running_avg) 23 | plt.savefig(file, bbox_inches='tight', dpi=500) 24 | plt.close() 25 | 26 | 27 | def smoothed_plot_multi_line(file, data, colors=None, linestyles=None, linewidths=None, alphas=None, 28 | legend=None, legend_loc="upper right", window=5, title=None, 29 | x_label='Timesteps', x_axis_off=False, xticks=None, xticklabels=None, 30 | y_label="Success rate", ylim=(None, None), y_axis_off=False, yticks=None, yticklabels=None, 31 | grid=False, 32 | horizontal_lines=None, ho_linestyle='--', ho_linewidth=4, ho_xmin=0.05, ho_xmax=0.95): 33 | if y_axis_off: 34 | plt.ylabel(None) 35 | plt.yticks([]) 36 | else: 37 | plt.ylabel(y_label) 38 | if yticks is not None: 39 | plt.yticks(yticks, yticklabels) 40 | if ylim[0] is not None: 41 | plt.ylim(ylim) 42 | if title is not None: 43 | plt.title(title) 44 | 45 | if x_axis_off: 46 | plt.xlabel(None) 47 | plt.xticks([]) 48 | else: 49 | plt.xlabel(x_label) 50 | if x_label == "Epoch": 51 | x_tick_interval = len(data_dict_list[0]["mean"]) // 10 52 | plt.xticks([n * x_tick_interval for n in range(11)]) 53 | if xticks is not None: 54 | plt.xticks(xticks, xticklabels) 55 | 56 | for t in range(len(data)): 57 | N = len(data[t]) 58 | x = [i for i in range(N)] 59 | if window != 0: 60 | running_avg = np.empty(N) 61 | for n in range(N): 62 | running_avg[n] = np.mean(data[t][max(0, n - window):(n + 1)]) 63 | else: 64 | running_avg = data[t] 65 | 66 | if colors is None: 67 | c = None 68 | else: 69 | assert len(colors) >= len(data) 70 | c = colors[t] 71 | 72 | if linestyles is None: 73 | ls = '-' 74 | else: 75 | assert len(linestyles) == len(data) 76 | ls = linestyles[t] 77 | 78 | if linewidths is None: 79 | linewidth = 1 80 | else: 81 | assert len(linewidths) == len(data) 82 | linewidth = linewidths[t] 83 | 84 | if alphas is None: 85 | alpha = 1 86 | else: 87 | assert len(alphas) == len(data) 88 | alpha = alphas[t] 89 | 90 | plt.plot(x, running_avg, c=c, linestyle=ls, linewidth=linewidth, alpha=alpha) 91 | 92 | if horizontal_lines is not None: 93 | for n in range(len(horizontal_lines)): 94 | plt.axhline(y=horizontal_lines[n], color=colors[len(data) + n], 95 | xmin=ho_xmin, xmax=ho_xmax, linestyle=ho_linestyle, linewidth=ho_linewidth) 96 | 97 | if legend is not None: 98 | plt.legend(legend, loc=legend_loc) 99 | 100 | if grid: 101 | plt.grid(True, linewidth=0.2) 102 | 103 | plt.savefig(file, bbox_inches='tight', dpi=500) 104 | plt.close() 105 | 106 | 107 | def smoothed_plot_mean_deviation(file, data_dict_list, title=None, 108 | vertical_lines=None, horizontal_lines=None, linestyle='--', linewidth=4, 109 | x_label='Timesteps', x_axis_off=False, xticks=None, 110 | y_label="Success rate", window=5, ylim=(None, None), y_axis_off=False, yticks=None, 111 | legend=None, legend_only=False, legend_file=None, legend_loc="upper right", 112 | legend_title=None, legend_bbox_to_anchor=None, legend_ncol=4, legend_frame=False, 113 | handlelength=2): 114 | colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 115 | 'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan','k'] 116 | if not isinstance(data_dict_list, list): 117 | data_dict_list = [data_dict_list] 118 | 119 | if y_axis_off: 120 | plt.ylabel(None) 121 | plt.yticks([]) 122 | else: 123 | plt.ylabel(y_label) 124 | if yticks is not None: 125 | plt.yticks(yticks) 126 | if ylim[0] is not None: 127 | plt.ylim(ylim) 128 | if title is not None: 129 | plt.title(title) 130 | 131 | if x_axis_off: 132 | plt.xlabel(None) 133 | plt.xticks([]) 134 | else: 135 | plt.xlabel(x_label) 136 | if x_label == "Epoch": 137 | x_tick_interval = len(data_dict_list[0]["mean"]) // 10 138 | plt.xticks([n * x_tick_interval for n in range(11)]) 139 | if xticks is not None: 140 | plt.xticks(xticks) 141 | 142 | handles = [Line2D([0], [0], color=colors[i], linewidth=linewidth) for i in range(len(data_dict_list))] 143 | if legend is not None: 144 | legend_plot = plt.legend(handles, legend, handlelength=handlelength, 145 | title=legend_title, loc=legend_loc, labelspacing=0.15, 146 | bbox_to_anchor=legend_bbox_to_anchor, ncol=legend_ncol, frameon=legend_frame) 147 | if legend_only: 148 | assert legend_file is not None, 'specify legend save path' 149 | fig = legend_plot.figure 150 | fig.canvas.draw() 151 | bbox = legend_plot.get_window_extent().transformed(fig.dpi_scale_trans.inverted()) 152 | fig.savefig(legend_file, dpi=500, bbox_inches=bbox) 153 | plt.close() 154 | return 155 | 156 | N = len(data_dict_list[0]["mean"]) 157 | x = [i for i in range(N)] 158 | for i in range(len(data_dict_list)): 159 | case_data = data_dict_list[i] 160 | for key in case_data: 161 | running_avg = np.empty(N) 162 | for n in range(N): 163 | running_avg[n] = np.mean(case_data[key][max(0, n - window):(n + 1)]) 164 | 165 | case_data[key] = dcp(running_avg) 166 | 167 | plt.fill_between(x, case_data["upper"], case_data["lower"], alpha=0.3, color=colors[i], label='_nolegend_') 168 | plt.plot(x, case_data["mean"], color=colors[i]) 169 | 170 | if horizontal_lines is not None: 171 | for n in range(len(horizontal_lines)): 172 | plt.axhline(y=horizontal_lines[n], color=colors[len(data_dict_list) + n], xmin=0.05, xmax=0.95, 173 | linestyle=linestyle, linewidth=linewidth) 174 | if vertical_lines is not None: 175 | assert horizontal_lines is None 176 | for n in range(len(vertical_lines)): 177 | plt.axvline(x=vertical_lines[n], color=colors[len(data_dict_list) + n], ymin=0.05, ymax=0.95, 178 | linestyle=linestyle, linewidth=linewidth) 179 | 180 | plt.savefig(file, bbox_inches='tight', dpi=500) 181 | plt.close() 182 | 183 | 184 | def get_mean_and_deviation(data, save_data=False, file_name=None): 185 | upper = np.max(data, axis=0) 186 | lower = np.min(data, axis=0) 187 | mean = np.mean(data, axis=0) 188 | statistics = {"mean": mean.tolist(), 189 | "upper": upper.tolist(), 190 | "lower": lower.tolist()} 191 | if save_data: 192 | assert file_name is not None 193 | json.dump(statistics, open(file_name, 'w')) 194 | return statistics 195 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/pointnet_2_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | https://github.com/yanx27/Pointnet_Pointnet2_pytorch/blob/master/models/pointnet2_utils.py 3 | @article{Pytorch_Pointnet_Pointnet2, 4 | Author = {Xu Yan}, 5 | Title = {Pointnet/Pointnet++ Pytorch}, 6 | Journal = {https://github.com/yanx27/Pointnet_Pointnet2_pytorch}, 7 | Year = {2019} 8 | } 9 | """ 10 | import torch 11 | import torch.nn as nn 12 | import torch.nn.functional as F 13 | from time import time 14 | import numpy as np 15 | 16 | 17 | def timeit(tag, t): 18 | print("{}: {}s".format(tag, time() - t)) 19 | return time() 20 | 21 | 22 | def pc_normalize(pc): 23 | l = pc.shape[0] 24 | centroid = np.mean(pc, axis=0) 25 | pc = pc - centroid 26 | m = np.max(np.sqrt(np.sum(pc ** 2, axis=1))) 27 | pc = pc / m 28 | return pc 29 | 30 | 31 | def square_distance(src, dst): 32 | """ 33 | Calculate Euclid distance between each two points. 34 | 35 | src^T * dst = xn * xm + yn * ym + zn * zm; 36 | sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn; 37 | sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm; 38 | dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2 39 | = sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst 40 | 41 | Input: 42 | src: source points, [B, N, C] 43 | dst: target points, [B, M, C] 44 | Output: 45 | dist: per-point square distance, [B, N, M] 46 | """ 47 | B, N, _ = src.shape 48 | _, M, _ = dst.shape 49 | dist = -2 * torch.matmul(src, dst.permute(0, 2, 1)) 50 | dist += torch.sum(src ** 2, -1).view(B, N, 1) 51 | dist += torch.sum(dst ** 2, -1).view(B, 1, M) 52 | return dist 53 | 54 | 55 | def index_points(points, idx): 56 | """ 57 | 58 | Input: 59 | points: input points data, [B, N, C] 60 | idx: sample index data, [B, S] 61 | Return: 62 | new_points:, indexed points data, [B, S, C] 63 | """ 64 | device = points.device 65 | B = points.shape[0] 66 | view_shape = list(idx.shape) 67 | view_shape[1:] = [1] * (len(view_shape) - 1) 68 | repeat_shape = list(idx.shape) 69 | repeat_shape[0] = 1 70 | batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape) 71 | new_points = points[batch_indices, idx, :] 72 | return new_points 73 | 74 | 75 | def farthest_point_sample(xyz, npoint): 76 | """ 77 | Input: 78 | xyz: pointcloud data, [B, N, 3] 79 | npoint: number of samples 80 | Return: 81 | centroids: sampled pointcloud index, [B, npoint] 82 | """ 83 | device = xyz.device 84 | B, N, C = xyz.shape 85 | centroids = torch.zeros(B, npoint, dtype=torch.long).to(device) 86 | distance = torch.ones(B, N).to(device) * 1e10 87 | farthest = torch.randint(0, N, (B,), dtype=torch.long).to(device) 88 | batch_indices = torch.arange(B, dtype=torch.long).to(device) 89 | for i in range(npoint): 90 | centroids[:, i] = farthest 91 | centroid = xyz[batch_indices, farthest, :].view(B, 1, 3) 92 | dist = torch.sum((xyz - centroid) ** 2, -1) 93 | mask = dist < distance 94 | distance[mask] = dist[mask] 95 | farthest = torch.max(distance, -1)[1] 96 | return centroids 97 | 98 | 99 | def query_ball_point(radius, nsample, xyz, new_xyz): 100 | """ 101 | Input: 102 | radius: local region radius 103 | nsample: max sample number in local region 104 | xyz: all points, [B, N, 3] 105 | new_xyz: query points, [B, S, 3] 106 | Return: 107 | group_idx: grouped points index, [B, S, nsample] 108 | """ 109 | device = xyz.device 110 | B, N, C = xyz.shape 111 | _, S, _ = new_xyz.shape 112 | group_idx = torch.arange(N, dtype=torch.long).to(device).view(1, 1, N).repeat([B, S, 1]) 113 | sqrdists = square_distance(new_xyz, xyz) 114 | group_idx[sqrdists > radius ** 2] = N 115 | group_idx = group_idx.sort(dim=-1)[0][:, :, :nsample] 116 | group_first = group_idx[:, :, 0].view(B, S, 1).repeat([1, 1, nsample]) 117 | mask = group_idx == N 118 | group_idx[mask] = group_first[mask] 119 | return group_idx 120 | 121 | 122 | def sample_and_group(npoint, radius, nsample, xyz, points, returnfps=False): 123 | """ 124 | Input: 125 | npoint: 126 | radius: 127 | nsample: 128 | xyz: input points position data, [B, N, 3] 129 | points: input points data, [B, N, D] 130 | Return: 131 | new_xyz: sampled points position data, [B, npoint, nsample, 3] 132 | new_points: sampled points data, [B, npoint, nsample, 3+D] 133 | """ 134 | B, N, C = xyz.shape 135 | S = npoint 136 | fps_idx = farthest_point_sample(xyz, npoint) # [B, npoint, C] 137 | new_xyz = index_points(xyz, fps_idx) 138 | idx = query_ball_point(radius, nsample, xyz, new_xyz) 139 | grouped_xyz = index_points(xyz, idx) # [B, npoint, nsample, C] 140 | grouped_xyz_norm = grouped_xyz - new_xyz.view(B, S, 1, C) 141 | 142 | if points is not None: 143 | grouped_points = index_points(points, idx) 144 | new_points = torch.cat([grouped_xyz_norm, grouped_points], dim=-1) # [B, npoint, nsample, C+D] 145 | else: 146 | new_points = grouped_xyz_norm 147 | if returnfps: 148 | return new_xyz, new_points, grouped_xyz, fps_idx 149 | else: 150 | return new_xyz, new_points 151 | 152 | 153 | def sample_and_group_all(xyz, points): 154 | """ 155 | Input: 156 | xyz: input points position data, [B, N, 3] 157 | points: input points data, [B, N, D] 158 | Return: 159 | new_xyz: sampled points position data, [B, 1, 3] 160 | new_points: sampled points data, [B, 1, N, 3+D] 161 | """ 162 | device = xyz.device 163 | B, N, C = xyz.shape 164 | new_xyz = torch.zeros(B, 1, C).to(device) 165 | grouped_xyz = xyz.view(B, 1, N, C) 166 | if points is not None: 167 | new_points = torch.cat([grouped_xyz, points.view(B, 1, N, -1)], dim=-1) 168 | else: 169 | new_points = grouped_xyz 170 | return new_xyz, new_points 171 | 172 | 173 | class PointNetSetAbstraction(nn.Module): 174 | def __init__(self, npoint, radius, nsample, in_channel, mlp, group_all): 175 | super(PointNetSetAbstraction, self).__init__() 176 | self.npoint = npoint 177 | self.radius = radius 178 | self.nsample = nsample 179 | self.mlp_convs = nn.ModuleList() 180 | self.mlp_bns = nn.ModuleList() 181 | last_channel = in_channel 182 | for out_channel in mlp: 183 | self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1)) 184 | self.mlp_bns.append(nn.BatchNorm2d(out_channel)) 185 | last_channel = out_channel 186 | self.group_all = group_all 187 | 188 | def forward(self, xyz, points): 189 | """ 190 | Input: 191 | xyz: input points position data, [B, C, N] 192 | points: input points data, [B, D, N] 193 | Return: 194 | new_xyz: sampled points position data, [B, C, S] 195 | new_points_concat: sample points feature data, [B, D', S] 196 | """ 197 | xyz = xyz.permute(0, 2, 1) 198 | if points is not None: 199 | points = points.permute(0, 2, 1) 200 | 201 | if self.group_all: 202 | new_xyz, new_points = sample_and_group_all(xyz, points) 203 | else: 204 | new_xyz, new_points = sample_and_group(self.npoint, self.radius, self.nsample, xyz, points) 205 | # new_xyz: sampled points position data, [B, npoint, C] 206 | # new_points: sampled points data, [B, npoint, nsample, C+D] 207 | new_points = new_points.permute(0, 3, 2, 1) # [B, C+D, nsample,npoint] 208 | for i, conv in enumerate(self.mlp_convs): 209 | bn = self.mlp_bns[i] 210 | new_points = F.relu(bn(conv(new_points))) 211 | 212 | new_points = torch.max(new_points, 2)[0] 213 | new_xyz = new_xyz.permute(0, 2, 1) 214 | return new_xyz, new_points 215 | 216 | 217 | class PointNetSetAbstractionMsg(nn.Module): 218 | def __init__(self, npoint, radius_list, nsample_list, in_channel, mlp_list): 219 | super(PointNetSetAbstractionMsg, self).__init__() 220 | self.npoint = npoint 221 | self.radius_list = radius_list 222 | self.nsample_list = nsample_list 223 | self.conv_blocks = nn.ModuleList() 224 | self.bn_blocks = nn.ModuleList() 225 | for i in range(len(mlp_list)): 226 | convs = nn.ModuleList() 227 | bns = nn.ModuleList() 228 | last_channel = in_channel + 3 229 | for out_channel in mlp_list[i]: 230 | convs.append(nn.Conv2d(last_channel, out_channel, 1)) 231 | bns.append(nn.BatchNorm2d(out_channel)) 232 | last_channel = out_channel 233 | self.conv_blocks.append(convs) 234 | self.bn_blocks.append(bns) 235 | 236 | def forward(self, xyz, points): 237 | """ 238 | Input: 239 | xyz: input points position data, [B, C, N] 240 | points: input points data, [B, D, N] 241 | Return: 242 | new_xyz: sampled points position data, [B, C, S] 243 | new_points_concat: sample points feature data, [B, D', S] 244 | """ 245 | xyz = xyz.permute(0, 2, 1) 246 | if points is not None: 247 | points = points.permute(0, 2, 1) 248 | 249 | B, N, C = xyz.shape 250 | S = self.npoint 251 | new_xyz = index_points(xyz, farthest_point_sample(xyz, S)) 252 | new_points_list = [] 253 | for i, radius in enumerate(self.radius_list): 254 | K = self.nsample_list[i] 255 | group_idx = query_ball_point(radius, K, xyz, new_xyz) 256 | grouped_xyz = index_points(xyz, group_idx) 257 | grouped_xyz -= new_xyz.view(B, S, 1, C) 258 | if points is not None: 259 | grouped_points = index_points(points, group_idx) 260 | grouped_points = torch.cat([grouped_points, grouped_xyz], dim=-1) 261 | else: 262 | grouped_points = grouped_xyz 263 | 264 | grouped_points = grouped_points.permute(0, 3, 2, 1) # [B, D, K, S] 265 | for j in range(len(self.conv_blocks[i])): 266 | conv = self.conv_blocks[i][j] 267 | bn = self.bn_blocks[i][j] 268 | grouped_points = F.relu(bn(conv(grouped_points))) 269 | new_points = torch.max(grouped_points, 2)[0] # [B, D', S] 270 | new_points_list.append(new_points) 271 | 272 | new_xyz = new_xyz.permute(0, 2, 1) 273 | new_points_concat = torch.cat(new_points_list, dim=1) 274 | return new_xyz, new_points_concat 275 | 276 | 277 | class PointNetFeaturePropagation(nn.Module): 278 | def __init__(self, in_channel, mlp): 279 | super(PointNetFeaturePropagation, self).__init__() 280 | self.mlp_convs = nn.ModuleList() 281 | self.mlp_bns = nn.ModuleList() 282 | last_channel = in_channel 283 | for out_channel in mlp: 284 | self.mlp_convs.append(nn.Conv1d(last_channel, out_channel, 1)) 285 | self.mlp_bns.append(nn.BatchNorm1d(out_channel)) 286 | last_channel = out_channel 287 | 288 | def forward(self, xyz1, xyz2, points1, points2): 289 | """ 290 | Input: 291 | xyz1: input points position data, [B, C, N] 292 | xyz2: sampled input points position data, [B, C, S] 293 | points1: input points data, [B, D, N] 294 | points2: input points data, [B, D, S] 295 | Return: 296 | new_points: upsampled points data, [B, D', N] 297 | """ 298 | xyz1 = xyz1.permute(0, 2, 1) 299 | xyz2 = xyz2.permute(0, 2, 1) 300 | 301 | points2 = points2.permute(0, 2, 1) 302 | B, N, C = xyz1.shape 303 | _, S, _ = xyz2.shape 304 | 305 | if S == 1: 306 | interpolated_points = points2.repeat(1, N, 1) 307 | else: 308 | dists = square_distance(xyz1, xyz2) 309 | dists, idx = dists.sort(dim=-1) 310 | dists, idx = dists[:, :, :3], idx[:, :, :3] # [B, N, 3] 311 | 312 | dist_recip = 1.0 / (dists + 1e-8) 313 | norm = torch.sum(dist_recip, dim=2, keepdim=True) 314 | weight = dist_recip / norm 315 | interpolated_points = torch.sum(index_points(points2, idx) * weight.view(B, N, 3, 1), dim=2) 316 | 317 | if points1 is not None: 318 | points1 = points1.permute(0, 2, 1) 319 | new_points = torch.cat([points1, interpolated_points], dim=-1) 320 | else: 321 | new_points = interpolated_points 322 | 323 | new_points = new_points.permute(0, 2, 1) 324 | for i, conv in enumerate(self.mlp_convs): 325 | bn = self.mlp_bns[i] 326 | new_points = F.relu(bn(conv(new_points))) 327 | return new_points 328 | -------------------------------------------------------------------------------- /drl_implementation/agent/utils/pointnet_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | https://github.com/yanx27/Pointnet_Pointnet2_pytorch/blob/master/models/pointnet2_utils.py 3 | @article{Pytorch_Pointnet_Pointnet2, 4 | Author = {Xu Yan}, 5 | Title = {Pointnet/Pointnet++ Pytorch}, 6 | Journal = {https://github.com/yanx27/Pointnet_Pointnet2_pytorch}, 7 | Year = {2019} 8 | } 9 | """ 10 | import torch 11 | import torch.nn as nn 12 | import torch.nn.parallel 13 | import torch.utils.data 14 | from torch.autograd import Variable 15 | import numpy as np 16 | import torch.nn.functional as F 17 | 18 | 19 | class STN3d(nn.Module): 20 | def __init__(self, channel): 21 | super(STN3d, self).__init__() 22 | self.conv1 = torch.nn.Conv1d(channel, 64, 1) 23 | self.conv2 = torch.nn.Conv1d(64, 128, 1) 24 | self.conv3 = torch.nn.Conv1d(128, 1024, 1) 25 | self.fc1 = nn.Linear(1024, 512) 26 | self.fc2 = nn.Linear(512, 256) 27 | self.fc3 = nn.Linear(256, 9) 28 | self.relu = nn.ReLU() 29 | 30 | self.bn1 = nn.BatchNorm1d(64) 31 | self.bn2 = nn.BatchNorm1d(128) 32 | self.bn3 = nn.BatchNorm1d(1024) 33 | self.bn4 = nn.BatchNorm1d(512) 34 | self.bn5 = nn.BatchNorm1d(256) 35 | 36 | def forward(self, x): 37 | batchsize = x.size()[0] 38 | x = F.relu(self.bn1(self.conv1(x))) 39 | x = F.relu(self.bn2(self.conv2(x))) 40 | x = F.relu(self.bn3(self.conv3(x))) 41 | x = torch.max(x, 2, keepdim=True)[0] 42 | x = x.view(-1, 1024) 43 | 44 | x = F.relu(self.bn4(self.fc1(x))) 45 | x = F.relu(self.bn5(self.fc2(x))) 46 | x = self.fc3(x) 47 | 48 | iden = Variable(torch.from_numpy(np.array([1, 0, 0, 0, 1, 0, 0, 0, 1]).astype(np.float32))).view(1, 9).repeat( 49 | batchsize, 1) 50 | if x.is_cuda: 51 | iden = iden.cuda() 52 | x = x + iden 53 | x = x.view(-1, 3, 3) 54 | return x 55 | 56 | 57 | class STNkd(nn.Module): 58 | def __init__(self, k=64): 59 | super(STNkd, self).__init__() 60 | self.conv1 = torch.nn.Conv1d(k, 64, 1) 61 | self.conv2 = torch.nn.Conv1d(64, 128, 1) 62 | self.conv3 = torch.nn.Conv1d(128, 1024, 1) 63 | self.fc1 = nn.Linear(1024, 512) 64 | self.fc2 = nn.Linear(512, 256) 65 | self.fc3 = nn.Linear(256, k * k) 66 | self.relu = nn.ReLU() 67 | 68 | self.bn1 = nn.BatchNorm1d(64) 69 | self.bn2 = nn.BatchNorm1d(128) 70 | self.bn3 = nn.BatchNorm1d(1024) 71 | self.bn4 = nn.BatchNorm1d(512) 72 | self.bn5 = nn.BatchNorm1d(256) 73 | 74 | self.k = k 75 | 76 | def forward(self, x): 77 | batchsize = x.size()[0] 78 | x = F.relu(self.bn1(self.conv1(x))) 79 | x = F.relu(self.bn2(self.conv2(x))) 80 | x = F.relu(self.bn3(self.conv3(x))) 81 | x = torch.max(x, 2, keepdim=True)[0] 82 | x = x.view(-1, 1024) 83 | 84 | x = F.relu(self.bn4(self.fc1(x))) 85 | x = F.relu(self.bn5(self.fc2(x))) 86 | x = self.fc3(x) 87 | 88 | iden = Variable(torch.from_numpy(np.eye(self.k).flatten().astype(np.float32))).view(1, self.k * self.k).repeat( 89 | batchsize, 1) 90 | if x.is_cuda: 91 | iden = iden.cuda() 92 | x = x + iden 93 | x = x.view(-1, self.k, self.k) 94 | return x 95 | 96 | 97 | class PointNetEncoder(nn.Module): 98 | def __init__(self, global_feat=True, feature_transform=False, channel=3): 99 | super(PointNetEncoder, self).__init__() 100 | self.stn = STN3d(channel) 101 | self.conv1 = torch.nn.Conv1d(channel, 64, 1) 102 | self.conv2 = torch.nn.Conv1d(64, 128, 1) 103 | self.conv3 = torch.nn.Conv1d(128, 1024, 1) 104 | self.bn1 = nn.BatchNorm1d(64) 105 | self.bn2 = nn.BatchNorm1d(128) 106 | self.bn3 = nn.BatchNorm1d(1024) 107 | self.global_feat = global_feat 108 | self.feature_transform = feature_transform 109 | if self.feature_transform: 110 | self.fstn = STNkd(k=64) 111 | 112 | def forward(self, x): 113 | B, D, N = x.size() 114 | trans = self.stn(x) 115 | x = x.transpose(2, 1) 116 | if D > 3: 117 | feature = x[:, :, 3:] 118 | x = x[:, :, :3] 119 | x = torch.bmm(x, trans) 120 | if D > 3: 121 | x = torch.cat([x, feature], dim=2) 122 | x = x.transpose(2, 1) 123 | x = F.relu(self.bn1(self.conv1(x))) 124 | 125 | if self.feature_transform: 126 | trans_feat = self.fstn(x) 127 | x = x.transpose(2, 1) 128 | x = torch.bmm(x, trans_feat) 129 | x = x.transpose(2, 1) 130 | else: 131 | trans_feat = None 132 | 133 | pointfeat = x 134 | x = F.relu(self.bn2(self.conv2(x))) 135 | x = self.bn3(self.conv3(x)) 136 | x = torch.max(x, 2, keepdim=True)[0] 137 | x = x.view(-1, 1024) 138 | if self.global_feat: 139 | return x, trans, trans_feat 140 | else: 141 | x = x.view(-1, 1024, 1).repeat(1, 1, N) 142 | return torch.cat([x, pointfeat], 1), trans, trans_feat 143 | 144 | 145 | def feature_transform_reguliarzer(trans): 146 | d = trans.size()[1] 147 | I = torch.eye(d)[None, :, :] 148 | if trans.is_cuda: 149 | I = I.cuda() 150 | loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2, 1)) - I, dim=(1, 2))) 151 | return loss -------------------------------------------------------------------------------- /drl_implementation/agent/utils/segment_tree.py: -------------------------------------------------------------------------------- 1 | """ 2 | The segment tree implementation from OpenAI baseline GitHub repo: 3 | https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/segment_tree.py 4 | This is used in the prioritized replay buffer. 5 | """ 6 | import operator 7 | 8 | 9 | class SegmentTree(object): 10 | def __init__(self, capacity, operation, neutral_element): 11 | """Build a Segment Tree data structure. 12 | https://en.wikipedia.org/wiki/Segment_tree 13 | Can be used as regular array, but with two 14 | important differences: 15 | a) setting item's value is slightly slower. 16 | It is O(lg capacity) instead of O(1). 17 | b) user has access to an efficient ( O(log segment size) ) 18 | `reduce` operation which reduces `operation` over 19 | a contiguous subsequence of items in the array. 20 | Paramters 21 | --------- 22 | capacity: int 23 | Total size of the array - must be a power of two. 24 | operation: lambda obj, obj -> obj 25 | and operation for combining elements (eg. sum, max) 26 | must form a mathematical group together with the set of 27 | possible values for array elements (i.e. be associative) 28 | neutral_element: obj 29 | neutral element for the operation above. eg. float('-inf') 30 | for max and 0 for sum. 31 | """ 32 | assert capacity > 0 and capacity & (capacity - 1) == 0, "capacity must be positive and a power of 2." 33 | self._capacity = capacity 34 | self._value = [neutral_element for _ in range(2 * capacity)] 35 | self._operation = operation 36 | 37 | def _reduce_helper(self, start, end, node, node_start, node_end): 38 | if start == node_start and end == node_end: 39 | return self._value[node] 40 | mid = (node_start + node_end) // 2 41 | if end <= mid: 42 | return self._reduce_helper(start, end, 2 * node, node_start, mid) 43 | else: 44 | if mid + 1 <= start: 45 | return self._reduce_helper(start, end, 2 * node + 1, mid + 1, node_end) 46 | else: 47 | return self._operation( 48 | self._reduce_helper(start, mid, 2 * node, node_start, mid), 49 | self._reduce_helper(mid + 1, end, 2 * node + 1, mid + 1, node_end) 50 | ) 51 | 52 | def reduce(self, start=0, end=None): 53 | """Returns result of applying `self.operation` 54 | to a contiguous subsequence of the array. 55 | self.operation(arr[start], operation(arr[start+1], operation(... arr[end]))) 56 | Parameters 57 | ---------- 58 | start: int 59 | beginning of the subsequence 60 | end: int 61 | end of the subsequences 62 | Returns 63 | ------- 64 | reduced: obj 65 | result of reducing self.operation over the specified range of array elements. 66 | """ 67 | if end is None: 68 | end = self._capacity 69 | if end < 0: 70 | end += self._capacity 71 | end -= 1 72 | return self._reduce_helper(start, end, 1, 0, self._capacity - 1) 73 | 74 | def __setitem__(self, idx, val): 75 | # index of the leaf 76 | idx += self._capacity 77 | self._value[idx] = val 78 | idx //= 2 79 | while idx >= 1: 80 | self._value[idx] = self._operation( 81 | self._value[2 * idx], 82 | self._value[2 * idx + 1] 83 | ) 84 | idx //= 2 85 | 86 | def __getitem__(self, idx): 87 | assert 0 <= idx < self._capacity 88 | return self._value[self._capacity + idx] 89 | 90 | 91 | class SumSegmentTree(SegmentTree): 92 | def __init__(self, capacity): 93 | super(SumSegmentTree, self).__init__( 94 | capacity=capacity, 95 | operation=operator.add, 96 | neutral_element=0.0 97 | ) 98 | 99 | def sum(self, start=0, end=None): 100 | """Returns arr[start] + ... + arr[end]""" 101 | return super(SumSegmentTree, self).reduce(start, end) 102 | 103 | def find_prefixsum_idx(self, prefixsum): 104 | """Find the highest index `i` in the array such that 105 | sum(arr[0] + arr[1] + ... + arr[i - i]) <= prefixsum 106 | if array values are probabilities, this function 107 | allows to sample indexes according to the discrete 108 | probability efficiently. 109 | Parameters 110 | ---------- 111 | perfixsum: float 112 | upperbound on the sum of array prefix 113 | Returns 114 | ------- 115 | idx: int 116 | highest index satisfying the prefixsum constraint 117 | """ 118 | assert 0 <= prefixsum <= self.sum() + 1e-5 119 | idx = 1 120 | while idx < self._capacity: # while non-leaf 121 | if self._value[2 * idx] > prefixsum: 122 | idx = 2 * idx 123 | else: 124 | prefixsum -= self._value[2 * idx] 125 | idx = 2 * idx + 1 126 | return idx - self._capacity 127 | 128 | 129 | class MinSegmentTree(SegmentTree): 130 | def __init__(self, capacity): 131 | super(MinSegmentTree, self).__init__( 132 | capacity=capacity, 133 | operation=min, 134 | neutral_element=float('inf') 135 | ) 136 | 137 | def min(self, start=0, end=None): 138 | """Returns min(arr[start], ..., arr[end])""" 139 | 140 | return super(MinSegmentTree, self).reduce(start, end) 141 | -------------------------------------------------------------------------------- /drl_implementation/examples/KukaPushPHER.py: -------------------------------------------------------------------------------- 1 | # this example runs a goal-condition soft actor critic with prioritised+hindsight experience replay 2 | # on the Push task from the pybullet-multigoal-gym package 3 | 4 | import os 5 | import gym 6 | import pybullet_multigoal_gym as pmg 7 | from drl_implementation import GoalConditionedSAC, GoalConditionedDDPG 8 | # you can replace the agent instantiation by one of the two above, with the proper params 9 | 10 | ddpg_params = { 11 | 'hindsight': True, 12 | 'her_sampling_strategy': 'future', 13 | 'prioritised': True, 14 | 'memory_capacity': int(1e6), 15 | 'actor_learning_rate': 0.001, 16 | 'critic_learning_rate': 0.001, 17 | 'Q_weight_decay': 0.0, 18 | 'update_interval': 1, 19 | 'batch_size': 128, 20 | 'optimization_steps': 40, 21 | 'tau': 0.05, 22 | 'discount_factor': 0.98, 23 | 'clip_value': 50, 24 | 'discard_time_limit': True, 25 | 'terminate_on_achieve': False, 26 | 'observation_normalization': True, 27 | 28 | 'random_action_chance': 0.2, 29 | 'noise_deviation': 0.05, 30 | 31 | 'training_epochs': 101, 32 | 'training_cycles': 50, 33 | 'training_episodes': 16, 34 | 'testing_gap': 1, 35 | 'testing_episodes': 30, 36 | 'saving_gap': 25, 37 | } 38 | # sac_params = { 39 | # 'hindsight': True, 40 | # 'her_sampling_strategy': 'future', 41 | # 'prioritised': True, 42 | # 'memory_capacity': int(1e6), 43 | # 'actor_learning_rate': 0.001, 44 | # 'critic_learning_rate': 0.001, 45 | # 'update_interval': 1, 46 | # 'batch_size': 128, 47 | # 'optimization_steps': 40, 48 | # 'tau': 0.005, 49 | # 'clip_value': 50, 50 | # 'discount_factor': 0.98, 51 | # 'discard_time_limit': True, 52 | # 'terminate_on_achieve': False, 53 | # 'observation_normalization': True, 54 | # 55 | # 'alpha': 0.5, 56 | # 'actor_update_interval': 1, 57 | # 'critic_target_update_interval': 1, 58 | # 59 | # 'training_epochs': 101, 60 | # 'training_cycles': 50, 61 | # 'training_episodes': 16, 62 | # 'testing_gap': 1, 63 | # 'testing_episodes': 30, 64 | # 'saving_gap': 25, 65 | # } 66 | seeds = [11, 22, 33, 44] 67 | seed_returns = [] 68 | seed_success_rates = [] 69 | path = os.path.dirname(os.path.realpath(__file__)) 70 | path = os.path.join(path, 'PushPHER') 71 | 72 | for seed in seeds: 73 | 74 | env = pmg.make_env(task='push', 75 | gripper='parallel_jaw', 76 | render=False, 77 | binary_reward=True, 78 | max_episode_steps=50, 79 | image_observation=False, 80 | depth_image=False, 81 | goal_image=False) 82 | # use the render env for visualization 83 | 84 | seed_path = path + '/seed'+str(seed) 85 | 86 | agent = GoalConditionedDDPG(algo_params=ddpg_params, env=env, path=seed_path, seed=seed) 87 | agent.run(test=False) 88 | # the sleep argument pause the rendering for a while every step, useful for slowing down visualization 89 | # agent.run(test=True, load_network_ep=50, sleep=0.05) 90 | seed_returns.append(agent.statistic_dict['epoch_test_return']) 91 | seed_success_rates.append(agent.statistic_dict['epoch_test_success_rate']) 92 | -------------------------------------------------------------------------------- /drl_implementation/examples/PendulumDDPG.py: -------------------------------------------------------------------------------- 1 | # this example runs a ddpg agent on a inverted pendulum swingup task from pybullet gym 2 | 3 | import os 4 | import pybullet_envs 5 | from drl_implementation import DDPG, SAC, TD3 6 | # you can replace the agent instantiation by one of the three above, with the proper params 7 | 8 | # td3_params = { 9 | # 'prioritised': True, 10 | # 'memory_capacity': int(1e6), 11 | # 'actor_learning_rate': 0.0003, 12 | # 'critic_learning_rate': 0.0003, 13 | # 'batch_size': 256, 14 | # 'optimization_steps': 1, 15 | # 'tau': 0.005, 16 | # 'discount_factor': 0.99, 17 | # 'discard_time_limit': True, 18 | # 'warmup_step': 2500, 19 | # 'target_noise': 0.2, 20 | # 'noise_clip': 0.5, 21 | # 'update_interval': 1, 22 | # 'actor_update_interval': 2, 23 | # 'observation_normalization': False, 24 | # 25 | # 'training_episodes': 101, 26 | # 'testing_gap': 10, 27 | # 'testing_episodes': 10, 28 | # 'saving_gap': 50, 29 | # } 30 | # sac_params = { 31 | # 'prioritised': True, 32 | # 'memory_capacity': int(1e6), 33 | # 'actor_learning_rate': 0.0003, 34 | # 'critic_learning_rate': 0.0003, 35 | # 'update_interval': 1, 36 | # 'batch_size': 256, 37 | # 'optimization_steps': 1, 38 | # 'tau': 0.005, 39 | # 'discount_factor': 0.99, 40 | # 'discard_time_limit': True, 41 | # 'observation_normalization': False, 42 | # 43 | # 'alpha': 0.5, 44 | # 'actor_update_interval': 1, 45 | # 'critic_target_update_interval': 1, 46 | # 'warmup_step': 1000, 47 | # 48 | # 'training_episodes': 101, 49 | # 'testing_gap': 10, 50 | # 'testing_episodes': 10, 51 | # 'saving_gap': 50, 52 | # } 53 | ddpg_params = { 54 | 'prioritised': True, 55 | 'memory_capacity': int(1e6), 56 | 'actor_learning_rate': 0.001, 57 | 'critic_learning_rate': 0.001, 58 | 'Q_weight_decay': 0.0, 59 | 'update_interval': 1, 60 | 'batch_size': 100, 61 | 'optimization_steps': 1, 62 | 'tau': 0.005, 63 | 'discount_factor': 0.99, 64 | 'discard_time_limit': True, 65 | 'warmup_step': 2500, 66 | 'observation_normalization': False, 67 | 68 | 'training_episodes': 101, 69 | 'testing_gap': 10, 70 | 'testing_episodes': 10, 71 | 'saving_gap': 50, 72 | } 73 | seeds = [11, 22, 33, 44, 55, 66] 74 | seed_returns = [] 75 | path = os.path.dirname(os.path.realpath(__file__)) 76 | for seed in seeds: 77 | 78 | env = pybullet_envs.make("InvertedPendulumSwingupBulletEnv-v0") 79 | # call render() before training to visualize (pybullet-gym-specific) 80 | # env.render() 81 | seed_path = path + '/seed'+str(seed) 82 | 83 | agent = DDPG(algo_params=ddpg_params, env=env, path=seed_path, seed=seed) 84 | agent.run(test=False) 85 | # the sleep argument pause the rendering for a while at every env step, useful for slowing down visualization 86 | # agent.run(test=True, load_network_ep=50, sleep=0.05) 87 | seed_returns.append(agent.statistic_dict['episode_return']) 88 | del env, agent 89 | -------------------------------------------------------------------------------- /drl_implementation/examples/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/examples/__init__.py -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib >= 3.3.2 2 | numpy >= 1.18 3 | torch >= 1.3.0 4 | json 5 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | 4 | packages = find_packages() 5 | # Ensure that we don't pollute the global namespace. 6 | for p in packages: 7 | assert p == 'drl_implementation' or p.startswith('drl_implementation.') 8 | 9 | setup(name='drl-implementation', 10 | version='1.0.0', 11 | description='A collection of deep reinforcement learning algorithms for fast implementation', 12 | url='https://github.com/IanYangChina/DRL_Implementation', 13 | author='XintongYang', 14 | author_email='YangX66@cardiff.ac.uk', 15 | packages=packages, 16 | package_dir={'drl_implementation': 'drl_implementation'}, 17 | package_data={'drl_implementation': [ 18 | 'examples/*.md', 19 | ]}, 20 | classifiers=[ 21 | "Programming Language :: Python :: 3", 22 | "Operating System :: OS Independent", 23 | ]) 24 | -------------------------------------------------------------------------------- /src/README.md: -------------------------------------------------------------------------------- 1 | #### Some tips 2 | 3 | To run the FetchReach-v1 environment, you will need to install gym, mujoco and mujoco-py on your environment. 4 | Here's the links: 5 | 6 | [OpenAI Gym](https://github.com/openai/gym) 7 | [Get a free Mujoco trial license, here](https://www.roboti.us/license.html) 8 | [Install Mujoco and Mujoco.py by OpenAI, here](https://github.com/openai/mujoco-py) 9 | 10 | Unfortunately, it seems that mujoco is hard to be installed on Windows systems, but you might find some help in [this 11 | page](https://github.com/openai/mujoco-py/issues/253). However, I still suggest you to run these codes on Linux systems. 12 | 13 | The [pybullet-multigoal-gym](https://github.com/IanYangChina/pybullet_multigoal_gym) environment is a migration of the open ai gym multigoal environment, developed by the author 14 | of this repo. It is free as it is based on Pybullet. You will need it to run the pybullet experiments. 15 | 16 | #### Some notes I made when I was implementing HER: 17 | * The original paper uses multiple cpu to collect data, however, this implementation uses single cpu. Multi-cpu might be 18 | added in the future. 19 | * Actor, critic networks have 3 hidden layers, each with 256 units and relu activation; critic output without activation, 20 | while actor output with tanh and rescaling. 21 | * Observation and goal are concatenated and fed into both networks. 22 | * The original paper scales observation, goals and actions into [-5, 5] (we don't need rescaling with the Gym environment), 23 | and normalize to 0 mean and standard variation. The means and standard deviations are computed using encountered data. 24 | * Training process has 200 epochs with 50 cycles, each of which has 16 episodes and 40 optimization steps. The total 25 | episode number is 200\*50\*16=160000, each of which has 50 time steps. After every 16 episodes, 40 optimization steps are 26 | performed. 27 | * Each optimization step uses a mini-batch of 128 batch size uniformly sampled from a replay buffer with 10^6 capacity, 28 | target network is updated softly with tau=0.05. 29 | * Adam is used for learning with a learning rate of 0.001, discount factor is 0.98, target value is clipped to 30 | [-1/(1-0.98), 0], that is [-50, 0]. I think this is based on the 50 time steps they set for each episode, in which at 31 | most an agent could gain -50 return. 32 | * For exploration, they randomly select action from uniform distribution with 20% chance; and with 80% chance, they 33 | add normal noise into increments along each axes with standard deviation equal to 5% of the max bound. 34 | 35 | * The SAC agent doesn't need a behavioural policy. 36 | * The goal-conditioned **sac** agent doesn't need value clipping. 37 | * Prioritised replay supported. 38 | 39 | #### Results on the task 'Push' 40 | 41 | -------------------------------------------------------------------------------- /src/figs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/figs.png -------------------------------------------------------------------------------- /src/pendulum.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/pendulum.gif -------------------------------------------------------------------------------- /src/push.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/push.gif -------------------------------------------------------------------------------- /tests/test.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | 3 | t = namedtuple('t', ['a', 'b']) 4 | 5 | print('b' in t._fields) --------------------------------------------------------------------------------