├── .gitignore
├── LICENSE
├── README.md
├── drl_implementation
    ├── __init__.py
    ├── agent
    │   ├── __init__.py
    │   ├── agent_base.py
    │   ├── continuous_action
    │   │   ├── __init__.py
    │   │   ├── ddpg.py
    │   │   ├── ddpg_goal_conditioned.py
    │   │   ├── distributional_ddpg.py
    │   │   ├── sac.py
    │   │   ├── sac_goal_conditioned.py
    │   │   ├── sac_parameterised_action_goal_conditioned.py
    │   │   ├── sac_pointnet.py
    │   │   └── td3.py
    │   ├── distributed_agent_base.py
    │   └── utils
    │   │   ├── __init__.py
    │   │   ├── env_wrapper.py
    │   │   ├── exploration_strategy.py
    │   │   ├── networks_conv.py
    │   │   ├── networks_mlp.py
    │   │   ├── networks_pointnet.py
    │   │   ├── normalizer.py
    │   │   ├── plot.py
    │   │   ├── pointnet_2_utils.py
    │   │   ├── pointnet_utils.py
    │   │   ├── replay_buffer.py
    │   │   └── segment_tree.py
    └── examples
    │   ├── KukaPushPHER.py
    │   ├── PendulumDDPG.py
    │   └── __init__.py
├── requirements.txt
├── setup.py
├── src
    ├── README.md
    ├── figs.png
    ├── pendulum.gif
    └── push.gif
└── tests
    └── test.py


/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | .idea
3 | __pycache__/
4 | drl_implementation/agent/__pycache__/
5 | drl_implementation/agent/utils/__pycache__/
6 | exp_multi_goal/__pycache__/
7 | build/
8 | drl_implementation.egg-info/


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 XT_Yang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## DRL_Implementation
 2 | ##### Current status: minimal updates
 3 | 
 4 | ### Introduction
 5 | - This repository is a pytorch-based implementation of modern DRL algorithms, designed to be reusable for as many 
 6 | Gym-like training environments as possible
 7 | - The package is mainly for my personal usage, however feel free to use it as you like.
 8 | - It is recommended to use the [released version](https://github.com/IanYangChina/DRL_Implementation/tree/v2.0)
 9 | - Understand more with the [Wiki!](https://github.com/IanYangChina/DRL_Implementation/wiki)
10 | - Tested environments: Gym, Pybullet-gym, Pybullet-multigoal-gym
11 | - *My priority is on continuous action algorithms as I'm working on robotics*
12 | 
13 | #### Installation
14 | ```
15 | git clone https://github.com/IanYangChina/DRL_Implementation.git
16 | cd DRL_Implementation
17 | python -m pip install -r requirements.txt
18 | python -m pip install .
19 | ```
20 | [Click here for example codes](https://github.com/IanYangChina/DRL_Implementation/tree/master/drl_implementation/examples)
21 | , to run the codes you will need to install Gym, Pybullet, or pybullet-multigoal-gym. See env installation links below.
22 | For more use cases, have a look at the [drl_imp_test repo](https://github.com/IanYangChina/drl_imp_test)\
23 | From the project root, run `python drl_implementation/examples/$SCTIPT_NAME.py`
24 | 
25 | ##### State-based
26 | - [X] Distributional DDPG, Continuous
27 | - [X] DDPG - Deterministic, Continuous
28 | - [X] TD3 -Deterministic, Continuous
29 | - [X] SAC (Adaptive Temperature) - Stochastic, Continuous
30 | 
31 | ##### Replay buffers
32 | - [X] Hindsight
33 | - [X] Prioritised
34 | 
35 | ##### Tested Environments
36 | - [X] [Pybullet Gym (Continuous)](https://github.com/bulletphysics/bullet3)
37 | - [X] [OpenAI Gym Mujoco Robotics Multigoal Environment (Continuous)](https://openai.com/blog/ingredients-for-robotics-research/)
38 | - [X] [Pybullet Multigoal Gym](https://github.com/IanYangChina/pybullet_multigoal_gym) (OpenAI Robotics 
39 | Multigoal Pybullet Migration) (Continuous)
40 | 
41 | ##### Some result figures
42 | <img src="/src/figs.png" width="600"/>
43 | <img src="/src/push.gif" width="600"/>
44 | <img src="/src/pendulum.gif" width="600"/>
45 | 
46 | #### Reference Papers: Algorithm
47 | * [DQN](https://www.nature.com/articles/nature14236?wm=book_wap_0005)
48 | * [DoubleDQN](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
49 | * [DDPG](https://arxiv.org/abs/1509.02971)
50 | * [TD3](https://arxiv.org/pdf/1802.09477.pdf)
51 | * [SAC (Adaptive Temperature)](https://arxiv.org/pdf/1812.05905.pdf)
52 | * [PER](https://arxiv.org/abs/1511.05952)
53 | * [HER](http://papers.nips.cc/paper/7090-hindsight-experience-replay)
54 | 
55 | #### Reference Papers: Implementation Matters
56 | * [Time limit](https://arxiv.org/abs/1712.00378)
57 | * [SOTA PPO Hyperparameters (many applicable to other algorithms)](https://arxiv.org/abs/2006.05990)
58 | * [SAC Temperature Auto-tuning](https://arxiv.org/abs/1812.05905)
59 | 


--------------------------------------------------------------------------------
/drl_implementation/__init__.py:
--------------------------------------------------------------------------------
 1 | from .agent.continuous_action.ddpg import DDPG
 2 | from .agent.continuous_action.ddpg_goal_conditioned import GoalConditionedDDPG
 3 | from .agent.continuous_action.sac import SAC
 4 | from .agent.continuous_action.sac_goal_conditioned import GoalConditionedSAC
 5 | from .agent.continuous_action.td3 import TD3
 6 | from .agent.continuous_action.distributional_ddpg import DistributionalDDPG
 7 | from .agent.continuous_action.sac_parameterised_action_goal_conditioned import GPASAC
 8 | 
 9 | agents = {
10 |     'ddpg': DDPG,
11 |     'ddpg_her': GoalConditionedDDPG,
12 |     'sac': SAC,
13 |     'sac_her': GoalConditionedSAC,
14 |     'td3': TD3,
15 |     'distri_ddpg': DistributionalDDPG,
16 | }
17 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/__init__.py


--------------------------------------------------------------------------------
/drl_implementation/agent/agent_base.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import logging as info_logging
  3 | import torch as T
  4 | import numpy as np
  5 | import json
  6 | import subprocess as sp
  7 | from torch.utils.tensorboard import SummaryWriter
  8 | from .utils.plot import smoothed_plot
  9 | from .utils.replay_buffer import make_buffer
 10 | from .utils.normalizer import Normalizer
 11 | 
 12 | 
 13 | def mkdir(paths):
 14 |     for path in paths:
 15 |         os.makedirs(path, exist_ok=True)
 16 | 
 17 | 
 18 | def get_gpu_memory():
 19 |     command = "nvidia-smi --query-gpu=memory.free --format=csv"
 20 |     memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
 21 |     memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
 22 |     return memory_free_values
 23 | 
 24 | 
 25 | def reset_logging(logging_to_reset):
 26 |     loggers = [logging_to_reset.getLogger(name) for name in logging_to_reset.root.manager.loggerDict]
 27 |     loggers.append(logging_to_reset.getLogger())
 28 |     for logger in loggers:
 29 |         handlers = logger.handlers[:]
 30 |         for handler in handlers:
 31 |             logger.removeHandler(handler)
 32 |             handler.close()
 33 |         logger.setLevel(logging_to_reset.NOTSET)
 34 |         logger.propagate = True
 35 | 
 36 | 
 37 | class Agent(object):
 38 |     def __init__(self,
 39 |                  algo_params, logging=None, create_logger=False,
 40 |                  transition_tuple=None,
 41 |                  non_flat_obs=False, action_type='continuous',
 42 |                  goal_conditioned=False, store_goal_ind=False, training_mode='episode_based',
 43 |                  path=None, log_dir_suffix=None, seed=-1):
 44 |         """
 45 |         Parameters
 46 |         ----------
 47 |         algo_params : dict
 48 |             a dictionary of parameters
 49 |         transition_tuple : collections.namedtuple
 50 |             a python namedtuple for storing, managing and sampling experiences, see .utils.replay_buffer
 51 |         non_flat_obs : bool
 52 |             whether the observations are 1D or nD
 53 |         action_type : str
 54 |             either 'discrete' or 'continuous'
 55 |         goal_conditioned : bool
 56 |             whether the agent uses a goal-conditioned policy
 57 |         training_mode : str
 58 |             either 'episode_based' or 'env_step_based'
 59 |         path : str
 60 |             a directory to save files
 61 |         seed : int
 62 |             a random seed
 63 |         """
 64 |         # torch device
 65 |         self.device = T.device("cuda" if T.cuda.is_available() else "cpu")
 66 |         if 'cuda_device_id' in algo_params.keys():
 67 |             self.device = T.device("cuda:%i" % algo_params['cuda_device_id'])
 68 |         # path & seeding
 69 |         T.manual_seed(seed)
 70 |         T.cuda.manual_seed_all(seed)  # this has no effect if cuda is not available
 71 | 
 72 |         # create a random number generator and seed it
 73 |         self.rng = np.random.default_rng(seed=seed)
 74 |         assert path is not None, 'please specify a project path to save files'
 75 |         self.path = path
 76 |         # path to save neural network check point
 77 |         self.ckpt_path = os.path.join(path, 'ckpts')
 78 |         # path to save statistics
 79 |         self.data_path = os.path.join(path, 'data')
 80 |         # create directories if not exist
 81 |         mkdir([self.path, self.ckpt_path, self.data_path])
 82 |         if log_dir_suffix is not None:
 83 |             comment = '-'+log_dir_suffix
 84 |         else:
 85 |             comment = ''
 86 |         self.create_logger = create_logger
 87 |         if self.create_logger:
 88 |             self.logger = SummaryWriter(log_dir=self.data_path, comment=comment)
 89 |         self.logging = logging
 90 |         if self.logging is None:
 91 |             reset_logging(info_logging)
 92 |             log_file_name = os.path.join(self.data_path, 'optimisation.log')
 93 |             if os.path.isfile(log_file_name):
 94 |                 filemode = "a"
 95 |             else:
 96 |                 filemode = "w"
 97 |             info_logging.basicConfig(level=info_logging.NOTSET, filemode=filemode,
 98 |                                 filename=log_file_name,
 99 |                                 format="%(asctime)s %(levelname)s %(message)s")
100 |             self.logging = info_logging
101 | 
102 |         # non-goal-conditioned args
103 |         self.non_flat_obs = non_flat_obs
104 |         self.action_type = action_type
105 |         if self.non_flat_obs:
106 |             self.state_dim = 0
107 |             self.state_shape = algo_params['state_shape']
108 |         else:
109 |             self.state_dim = algo_params['state_dim']
110 |         if self.action_type == 'hybrid':
111 |             self.discrete_action_dim = algo_params['discrete_action_dim']
112 |             self.continuous_action_dim = algo_params['continuous_action_dim']
113 |             self.continuous_action_max = algo_params['continuous_action_max']
114 |             self.continuous_action_scaling = algo_params['continuous_action_scaling']
115 |         else:
116 |             self.action_dim = algo_params['action_dim']
117 |         if self.action_type == 'continuous':
118 |             self.action_max = algo_params['action_max']
119 |             self.action_scaling = algo_params['action_scaling']
120 | 
121 |         # goal-conditioned args & buffers
122 |         self.goal_conditioned = goal_conditioned
123 |         # prioritised replay
124 |         self.prioritised = algo_params['prioritised']
125 | 
126 |         if self.goal_conditioned:
127 |             if self.non_flat_obs:
128 |                 self.goal_dim = 0
129 |                 self.goal_shape = algo_params['goal_shape']
130 |             else:
131 |                 self.goal_dim = algo_params['goal_dim']
132 |             self.hindsight = algo_params['hindsight']
133 |             try:
134 |                 goal_distance_threshold = self.env.env.distance_threshold
135 |             except:
136 |                 goal_distance_threshold = self.env.distance_threshold
137 | 
138 |             goal_conditioned_reward_func = None
139 |             try:
140 |                 if self.env.env.goal_conditioned_reward_function is not None:
141 |                     goal_conditioned_reward_func = self.env.env.goal_conditioned_reward_function
142 |             except:
143 |                 if self.env.goal_conditioned_reward_function is not None:
144 |                     goal_conditioned_reward_func = self.env.goal_conditioned_reward_function
145 | 
146 |             try:
147 |                 her_sample_strategy = algo_params['her_sampling_strategy']
148 |             except:
149 |                 her_sample_strategy = 'future'
150 | 
151 |             try:
152 |                 num_sampled_goal = algo_params['num_sampled_goal']
153 |             except:
154 |                 num_sampled_goal = 4
155 | 
156 |             self.buffer = make_buffer(mem_capacity=algo_params['memory_capacity'],
157 |                                       transition_tuple=transition_tuple, prioritised=self.prioritised,
158 |                                       seed=seed, rng=self.rng,
159 |                                       goal_conditioned=True, keep_episode=self.hindsight,
160 |                                       store_goal_ind=store_goal_ind,
161 |                                       sampling_strategy=her_sample_strategy,
162 |                                       num_sampled_goal=num_sampled_goal,
163 |                                       terminal_on_achieved=algo_params['terminate_on_achieve'],
164 |                                       goal_distance_threshold=goal_distance_threshold,
165 |                                       goal_conditioned_reward_func=goal_conditioned_reward_func)
166 |         else:
167 |             self.goal_dim = 0
168 |             self.buffer = make_buffer(mem_capacity=algo_params['memory_capacity'],
169 |                                       transition_tuple=transition_tuple, prioritised=self.prioritised,
170 |                                       seed=seed, rng=self.rng,
171 |                                       goal_conditioned=False)
172 | 
173 |         # common args
174 |         if not self.non_flat_obs:
175 |             self.observation_normalization = algo_params['observation_normalization']
176 |             # if not using obs normalization, the normalizer is just a scale multiplier, returns inputs*scale
177 |             self.normalizer = Normalizer(self.state_dim+self.goal_dim,
178 |                                          algo_params['init_input_means'], algo_params['init_input_vars'],
179 |                                          activated=self.observation_normalization)
180 |         try:
181 |             self.update_interval = algo_params['update_interval']
182 |         except:
183 |             self.update_interval = 1
184 |         self.actor_learning_rate = algo_params['actor_learning_rate']
185 |         self.critic_learning_rate = algo_params['critic_learning_rate']
186 |         self.batch_size = algo_params['batch_size']
187 |         self.optimizer_steps = algo_params['optimization_steps']
188 |         self.gamma = algo_params['discount_factor']
189 |         self.discard_time_limit = algo_params['discard_time_limit']
190 |         self.tau = algo_params['tau']
191 |         self.optim_step_count = 0
192 | 
193 |         assert training_mode in ['episode_based', 'step_based']
194 |         self.training_mode = training_mode
195 |         self.total_env_step_count = 0
196 |         self.total_env_episode_count = 0
197 | 
198 |         # network dict is filled in each specific agent
199 |         self.network_dict = {}
200 |         self.network_keys_to_save = None
201 | 
202 |         # algorithm-specific statistics are defined in each agent sub-class
203 |         self.statistic_dict = {
204 |             # use lowercase characters
205 |             'actor_loss': [],
206 |             'critic_loss': [],
207 |         }
208 |         self.get_gpu_memory = get_gpu_memory
209 | 
210 |     def run(self, render=False, test=False, load_network_ep=None, sleep=0):
211 |         raise NotImplementedError
212 | 
213 |     def _interact(self, render=False, test=False, sleep=0):
214 |         raise NotImplementedError
215 | 
216 |     def _select_action(self, obs, test=False):
217 |         raise NotImplementedError
218 | 
219 |     def _learn(self, steps=None):
220 |         raise NotImplementedError
221 | 
222 |     def _remember(self, *args, new_episode=False):
223 |         if self.goal_conditioned:
224 |             self.buffer.new_episode = new_episode
225 |             self.buffer.store_experience(*args)
226 |         else:
227 |             self.buffer.store_experience(*args)
228 | 
229 |     def _soft_update(self, source, target, tau=None):
230 |         if tau is None:
231 |             tau = self.tau
232 | 
233 |         for target_param, param in zip(target.parameters(), source.parameters()):
234 |             target_param.data.copy_(
235 |                 target_param.data * (1.0 - tau) + param.data * tau
236 |             )
237 | 
238 |     def _save_network(self, keys=None, ep=None, step=None):
239 |         if ep is None:
240 |             ep = ''
241 |         else:
242 |             ep = '_ep'+str(ep)
243 |         if step is None:
244 |             step = ''
245 |         else:
246 |             step = '_step'+str(step)
247 |         if keys is None:
248 |             keys = self.network_keys_to_save
249 |         assert keys is not None
250 |         for key in keys:
251 |             T.save(self.network_dict[key].state_dict(), self.ckpt_path+'/ckpt_'+key+ep+step+'.pt')
252 | 
253 |     def _load_network(self, keys=None, ep=None, step=None):
254 |         if (not self.non_flat_obs) and self.observation_normalization:
255 |             self.normalizer.history_mean = np.load(os.path.join(self.data_path, 'input_means.npy'))
256 |             self.normalizer.history_var = np.load(os.path.join(self.data_path, 'input_vars.npy'))
257 |         if ep is None:
258 |             ep = ''
259 |         else:
260 |             ep = '_ep'+str(ep)
261 |         if step is None:
262 |             step = ''
263 |         else:
264 |             step = '_step'+str(step)
265 |         if keys is None:
266 |             keys = self.network_keys_to_save
267 |         assert keys is not None
268 |         for key in keys:
269 |             self.network_dict[key].load_state_dict(T.load(self.ckpt_path+'/ckpt_'+key+ep+step+'.pt', map_location=self.device))
270 | 
271 |     def _save_statistics(self, keys=None):
272 |         if (not self.non_flat_obs) and self.observation_normalization:
273 |             np.save(os.path.join(self.data_path, 'input_means'), self.normalizer.history_mean)
274 |             np.save(os.path.join(self.data_path, 'input_vars'), self.normalizer.history_var)
275 |         if keys is None:
276 |             keys = self.statistic_dict.keys()
277 |         for key in keys:
278 |             if len(self.statistic_dict[key]) == 0:
279 |                 continue
280 |             # convert everything to a list before save via json
281 |             if T.is_tensor(self.statistic_dict[key][0]):
282 |                 self.statistic_dict[key] = T.as_tensor(self.statistic_dict[key], device=self.device).cpu().numpy().tolist()
283 |             else:
284 |                 self.statistic_dict[key] = np.array(self.statistic_dict[key]).tolist()
285 |             json.dump(self.statistic_dict[key], open(os.path.join(self.data_path, key+'.json'), 'w'))
286 |     
287 |     def _plot_statistics(self, keys=None, x_labels=None, y_labels=None, window=5, save_to_file=True):
288 |         if save_to_file:
289 |             self._save_statistics(keys=keys)
290 |         if y_labels is None:
291 |             y_labels = {}
292 |         for key in list(self.statistic_dict.keys()):
293 |             if key not in y_labels.keys():
294 |                 if 'loss' in key:
295 |                     label = 'Loss'
296 |                 elif 'return' in key:
297 |                     label = 'Return'
298 |                 elif 'success' in key:
299 |                     label = 'Success'
300 |                 else:
301 |                     label = key
302 |                 y_labels.update({key: label})
303 |         
304 |         if x_labels is None:
305 |             x_labels = {}
306 |         for key in list(self.statistic_dict.keys()):
307 |             if key not in x_labels.keys():
308 |                 if ('loss' in key) or ('alpha' in key) or ('entropy' in key) or ('step' in key):
309 |                     label = 'Optimization step'
310 |                 elif 'cycle' in key:
311 |                     label = 'Cycle'
312 |                 elif 'epoch' in key:
313 |                     label = 'Epoch'
314 |                 else:
315 |                     label = 'Episode'
316 |                 x_labels.update({key: label})
317 |         
318 |         if keys is None:
319 |             for key in list(self.statistic_dict.keys()):
320 |                 if len(self.statistic_dict[key]) == 0:
321 |                     continue
322 |                 smoothed_plot(os.path.join(self.path, key+'.png'), self.statistic_dict[key],
323 |                               x_label=x_labels[key], y_label=y_labels[key], window=window)
324 |         else:
325 |             for key in keys:
326 |                 smoothed_plot(os.path.join(self.path, key+'.png'), self.statistic_dict[key],
327 |                               x_label=x_labels[key], y_label=y_labels[key], window=window)
328 | 
329 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/continuous_action/__init__.py


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/ddpg.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import Actor, Critic
  7 | from ..agent_base import Agent
  8 | from ..utils.exploration_strategy import OUNoise, GaussianNoise
  9 | 
 10 | 
 11 | class DDPG(Agent):
 12 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 13 |         # environment
 14 |         self.env = env
 15 |         self.env.seed(seed)
 16 |         obs = self.env.reset()
 17 |         algo_params.update({'state_dim': obs.shape[0],
 18 |                             'action_dim': self.env.action_space.shape[0],
 19 |                             'action_max': self.env.action_space.high,
 20 |                             'action_scaling': self.env.action_space.high[0],
 21 |                             'init_input_means': None,
 22 |                             'init_input_vars': None
 23 |                             })
 24 |         # training args
 25 |         self.training_episodes = algo_params['training_episodes']
 26 |         self.testing_gap = algo_params['testing_gap']
 27 |         self.testing_episodes = algo_params['testing_episodes']
 28 |         self.saving_gap = algo_params['saving_gap']
 29 | 
 30 |         super(DDPG, self).__init__(algo_params,
 31 |                                    transition_tuple=transition_tuple,
 32 |                                    goal_conditioned=False,
 33 |                                    path=path,
 34 |                                    seed=seed)
 35 |         # torch
 36 |         self.network_dict.update({
 37 |             'actor': Actor(self.state_dim, self.action_dim).to(self.device),
 38 |             'actor_target': Actor(self.state_dim, self.action_dim).to(self.device),
 39 |             'critic': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 40 |             'critic_target': Critic(self.state_dim + self.action_dim, 1).to(self.device)
 41 |         })
 42 |         self.network_keys_to_save = ['actor_target', 'critic_target']
 43 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 44 |         self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
 45 |         self.critic_optimizer = Adam(self.network_dict['critic'].parameters(), lr=self.critic_learning_rate, weight_decay=algo_params['Q_weight_decay'])
 46 |         self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'], tau=1)
 47 |         # behavioural policy args (exploration)
 48 |         self.exploration_strategy = GaussianNoise(self.action_dim, self.action_max, sigma=0.1)
 49 |         # training args
 50 |         self.warmup_step = algo_params['warmup_step']
 51 |         # statistic dict
 52 |         self.statistic_dict.update({
 53 |             'episode_return': [],
 54 |             'episode_test_return': []
 55 |         })
 56 | 
 57 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 58 |         if test:
 59 |             num_episode = self.testing_episodes
 60 |             if load_network_ep is not None:
 61 |                 print("Loading network parameters...")
 62 |                 self._load_network(ep=load_network_ep)
 63 |             print("Start testing...")
 64 |         else:
 65 |             num_episode = self.training_episodes
 66 |             print("Start training...")
 67 | 
 68 |         for ep in range(num_episode):
 69 |             ep_return = self._interact(render, test, sleep=sleep)
 70 |             self.statistic_dict['episode_return'].append(ep_return)
 71 |             print("Episode %i" % ep, "return %0.1f" % ep_return)
 72 | 
 73 |             if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
 74 |                 ep_test_return = []
 75 |                 for test_ep in range(self.testing_episodes):
 76 |                     ep_test_return.append(self._interact(render, test=True))
 77 |                 self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes)
 78 |                 print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes))
 79 | 
 80 |             if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
 81 |                 self._save_network(ep=ep)
 82 | 
 83 |         if not test:
 84 |             print("Finished training")
 85 |             print("Saving statistics...")
 86 |             self._plot_statistics(save_to_file=True)
 87 |         else:
 88 |             print("Finished testing")
 89 | 
 90 |     def _interact(self, render=False, test=False, sleep=0):
 91 |         done = False
 92 |         obs = self.env.reset()
 93 |         ep_return = 0
 94 |         # self.exploration_strategy.reset()
 95 |         # start a new episode
 96 |         while not done:
 97 |             if render:
 98 |                 self.env.render()
 99 |             if self.env_step_count < self.warmup_step:
100 |                 action = self.env.action_space.sample()
101 |             else:
102 |                 action = self._select_action(obs, test=test)
103 |             new_obs, reward, done, info = self.env.step(action)
104 |             time.sleep(sleep)
105 |             ep_return += reward
106 |             if not test:
107 |                 self._remember(obs, action, new_obs, reward, 1 - int(done))
108 |                 if self.observation_normalization:
109 |                     self.normalizer.store_history(new_obs)
110 |                     self.normalizer.update_mean()
111 |                 if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
112 |                     self._learn()
113 |                 self.env_step_count += 1
114 |             obs = new_obs
115 |         return ep_return
116 | 
117 |     def _select_action(self, obs, test=False):
118 |         obs = self.normalizer(obs)
119 |         with T.no_grad():
120 |             inputs = T.as_tensor(obs, dtype=T.float, device=self.device)
121 |             action = self.network_dict['actor_target'](inputs).cpu().detach().numpy()
122 |         if test:
123 |             # evaluate
124 |             return np.clip(action, -self.action_max, self.action_max)
125 |         else:
126 |             # explore
127 |             return self.exploration_strategy(action)
128 | 
129 |     def _learn(self, steps=None):
130 |         if len(self.buffer) < self.batch_size:
131 |             return
132 |         if steps is None:
133 |             steps = self.optimizer_steps
134 | 
135 |         for i in range(steps):
136 |             if self.prioritised:
137 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
138 |                 weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
139 |             else:
140 |                 batch = self.buffer.sample(self.batch_size)
141 |                 weights = T.ones(size=(self.batch_size, 1), device=self.device)
142 |                 inds = None
143 | 
144 |             actor_inputs = np.array(self.normalizer(batch.state))
145 |             actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
146 |             actions = T.as_tensor(np.array(batch.action), dtype=T.float32, device=self.device)
147 |             critic_inputs = T.cat((actor_inputs, actions), dim=1)
148 |             actor_inputs_ = np.array(self.normalizer(batch.next_state))
149 |             actor_inputs_ = T.as_tensor(np.array(actor_inputs_), dtype=T.float32, device=self.device)
150 |             rewards = T.as_tensor(np.array(batch.reward), dtype=T.float32, device=self.device).unsqueeze(1)
151 |             done = T.as_tensor(np.array(batch.done), dtype=T.float32, device=self.device).unsqueeze(1)
152 | 
153 |             if self.discard_time_limit:
154 |                 done = done * 0 + 1
155 | 
156 |             with T.no_grad():
157 |                 actions_ = self.network_dict['actor_target'](actor_inputs_)
158 |                 critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
159 |                 value_ = self.network_dict['critic_target'](critic_inputs_)
160 |                 value_target = rewards + done * self.gamma * value_
161 | 
162 |             self.critic_optimizer.zero_grad()
163 |             value_estimate = self.network_dict['critic'](critic_inputs)
164 |             critic_loss = F.mse_loss(value_estimate, value_target, reduction='none')
165 |             (critic_loss * weights).mean().backward()
166 |             self.critic_optimizer.step()
167 | 
168 |             if self.prioritised:
169 |                 assert inds is not None
170 |                 self.buffer.update_priority(inds, np.abs(critic_loss.cpu().detach().numpy()))
171 | 
172 |             self.actor_optimizer.zero_grad()
173 |             new_actions = self.network_dict['actor'](actor_inputs)
174 |             critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device)
175 |             actor_loss = -self.network_dict['critic'](critic_eval_inputs).mean()
176 |             actor_loss.backward()
177 |             self.actor_optimizer.step()
178 | 
179 |             self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
180 |             self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'])
181 | 
182 |             self.statistic_dict['critic_loss'].append(critic_loss.detach().mean())
183 |             self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
184 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/ddpg_goal_conditioned.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import Actor, Critic
  7 | from ..agent_base import Agent
  8 | from ..utils.exploration_strategy import EGreedyGaussian
  9 | 
 10 | 
 11 | class GoalConditionedDDPG(Agent):
 12 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 13 |         # environment
 14 |         self.env = env
 15 |         self.env.seed(seed)
 16 |         obs = self.env.reset()
 17 |         algo_params.update({'state_dim': obs['observation'].shape[0],
 18 |                             'goal_dim': obs['desired_goal'].shape[0],
 19 |                             'action_dim': self.env.action_space.shape[0],
 20 |                             'action_max': self.env.action_space.high,
 21 |                             'action_scaling': self.env.action_space.high[0],
 22 |                             'init_input_means': None,
 23 |                             'init_input_vars': None
 24 |                             })
 25 |         self.curriculum = False
 26 |         if 'curriculum' in algo_params.keys():
 27 |             self.curriculum = algo_params['curriculum']
 28 |         # training args
 29 |         self.training_epochs = algo_params['training_epochs']
 30 |         self.training_cycles = algo_params['training_cycles']
 31 |         self.training_episodes = algo_params['training_episodes']
 32 |         self.testing_gap = algo_params['testing_gap']
 33 |         self.testing_episodes = algo_params['testing_episodes']
 34 |         self.saving_gap = algo_params['saving_gap']
 35 | 
 36 |         super(GoalConditionedDDPG, self).__init__(algo_params,
 37 |                                                   transition_tuple=transition_tuple,
 38 |                                                   goal_conditioned=True,
 39 |                                                   path=path,
 40 |                                                   seed=seed)
 41 |         # torch
 42 |         self.network_dict.update({
 43 |             'actor': Actor(self.state_dim + self.goal_dim, self.action_dim, action_scaling=self.action_scaling).to(
 44 |                 self.device),
 45 |             'actor_target': Actor(self.state_dim + self.goal_dim, self.action_dim,
 46 |                                   action_scaling=self.action_scaling).to(self.device),
 47 |             'critic_1': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 48 |             'critic_1_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 49 |             'critic_2': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 50 |             'critic_2_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 51 |         })
 52 |         self.network_keys_to_save = ['actor_target', 'critic_1_target', 'critic_2_target']
 53 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 54 |         self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
 55 |         self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate,
 56 |                                        weight_decay=algo_params['Q_weight_decay'])
 57 |         self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
 58 |         self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate,
 59 |                                        weight_decay=algo_params['Q_weight_decay'])
 60 |         self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
 61 |         # behavioural policy args (exploration)
 62 |         # different from the original DDPG paper, the HER paper uses another exploration strategy
 63 |         #   paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
 64 |         self.exploration_strategy = EGreedyGaussian(action_dim=self.action_dim,
 65 |                                                     action_max=self.action_max,
 66 |                                                     chance=algo_params['random_action_chance'],
 67 |                                                     sigma=algo_params['noise_deviation'], rng=self.rng)
 68 |         self.noise_deviation = algo_params['noise_deviation']
 69 |         # training args
 70 |         self.clip_value = algo_params['clip_value']
 71 |         # statistic dict
 72 |         self.statistic_dict.update({
 73 |             'cycle_return': [],
 74 |             'cycle_success_rate': [],
 75 |             'epoch_test_return': [],
 76 |             'epoch_test_success_rate': []
 77 |         })
 78 | 
 79 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 80 |         # training setup uses a hierarchy of Epoch, Cycle and Episode
 81 |         #   following the HER paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
 82 |         if test:
 83 |             if load_network_ep is not None:
 84 |                 print("Loading network parameters...")
 85 |                 self._load_network(ep=load_network_ep)
 86 |             print("Start testing...")
 87 |         else:
 88 |             print("Start training...")
 89 | 
 90 |         for epo in range(self.training_epochs):
 91 |             if self.curriculum:
 92 |                 self.env.activate_curriculum_update()
 93 |             for cyc in range(self.training_cycles):
 94 |                 cycle_return = 0
 95 |                 cycle_success = 0
 96 |                 for ep in range(self.training_episodes):
 97 |                     ep_return = self._interact(render, test, sleep=sleep)
 98 |                     cycle_return += ep_return
 99 |                     if ep_return > -self.env._max_episode_steps:
100 |                         cycle_success += 1
101 | 
102 |                 self.statistic_dict['cycle_return'].append(cycle_return / self.training_episodes)
103 |                 self.statistic_dict['cycle_success_rate'].append(cycle_success / self.training_episodes)
104 |                 print("Epoch %i" % epo, "Cycle %i" % cyc,
105 |                       "avg. return %0.1f" % (cycle_return / self.training_episodes),
106 |                       "success rate %0.1f" % (cycle_success / self.training_episodes))
107 | 
108 |             if (epo % self.testing_gap == 0) and (epo != 0) and (not test):
109 |                 if self.curriculum:
110 |                     self.env.deactivate_curriculum_update()
111 |                 # testing during training
112 |                 test_return = 0
113 |                 test_success = 0
114 |                 for test_ep in range(self.testing_episodes):
115 |                     ep_test_return = self._interact(render, test=True)
116 |                     test_return += ep_test_return
117 |                     if ep_test_return > -self.env._max_episode_steps:
118 |                         test_success += 1
119 |                 self.statistic_dict['epoch_test_return'].append(test_return / self.testing_episodes)
120 |                 self.statistic_dict['epoch_test_success_rate'].append(test_success / self.testing_episodes)
121 |                 print("Epoch %i" % epo, "test avg. return %0.1f" % (test_return / self.testing_episodes))
122 | 
123 |             if (epo % self.saving_gap == 0) and (epo != 0) and (not test):
124 |                 self._save_network(ep=epo)
125 | 
126 |         if not test:
127 |             print("Finished training")
128 |             print("Saving statistics...")
129 |             self._plot_statistics(
130 |                 x_labels={
131 |                     'critic_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
132 |                     'actor_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)'
133 |                 },
134 |                 save_to_file=True)
135 |         else:
136 |             print("Finished testing")
137 | 
138 |     def _interact(self, render=False, test=False, sleep=0):
139 |         done = False
140 |         obs = self.env.reset()
141 |         if self.curriculum:
142 |             self.env._max_episode_steps = self.env.env.curriculum_goal_step
143 |         ep_return = 0
144 |         new_episode = True
145 |         # start a new episode
146 |         while not done:
147 |             if render:
148 |                 self.env.render()
149 |             action = self._select_action(obs, test=test)
150 |             new_obs, reward, done, info = self.env.step(action)
151 |             time.sleep(sleep)
152 |             ep_return += reward
153 |             if not test:
154 |                 self._remember(obs['observation'], obs['desired_goal'], action,
155 |                                new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done),
156 |                                new_episode=new_episode)
157 |                 if self.observation_normalization:
158 |                     self.normalizer.store_history(np.concatenate((new_obs['observation'],
159 |                                                                   new_obs['achieved_goal']), axis=0))
160 |             obs = new_obs
161 |             new_episode = False
162 |         if not test:
163 |             self.normalizer.update_mean()
164 |             self._learn()
165 |         return ep_return
166 | 
167 |     def _select_action(self, obs, test=False):
168 |         inputs = np.concatenate((obs['observation'], obs['desired_goal']), axis=0)
169 |         inputs = self.normalizer(inputs)
170 |         with T.no_grad():
171 |             inputs = T.as_tensor(inputs, dtype=T.float, device=self.device)
172 |             action = self.network_dict['actor_target'](inputs).cpu().detach().numpy()
173 |         if test:
174 |             # evaluate
175 |             return np.clip(action, -self.action_max, self.action_max)
176 |         else:
177 |             # explore
178 |             return self.exploration_strategy(action)
179 | 
180 |     def _learn(self, steps=None):
181 |         if self.hindsight:
182 |             self.buffer.modify_episodes()
183 |         self.buffer.store_episodes()
184 |         if len(self.buffer) < self.batch_size:
185 |             return
186 |         if steps is None:
187 |             steps = self.optimizer_steps
188 | 
189 |         critic_losses = T.zeros(1, device=self.device)
190 |         actor_losses = T.zeros(1, device=self.device)
191 |         for i in range(steps):
192 |             if self.prioritised:
193 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
194 |                 weights = T.as_tensor(weights).view(self.batch_size, 1).to(self.device)
195 |             else:
196 |                 batch = self.buffer.sample(self.batch_size)
197 |                 weights = T.ones(size=(self.batch_size, 1)).to(self.device)
198 |                 inds = None
199 | 
200 |             actor_inputs = np.concatenate((batch.state, batch.desired_goal), axis=1)
201 |             actor_inputs = self.normalizer(actor_inputs)
202 |             actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
203 |             actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
204 |             critic_inputs = T.cat((actor_inputs, actions), dim=1)
205 |             actor_inputs_ = np.concatenate((batch.next_state, batch.desired_goal), axis=1)
206 |             actor_inputs_ = self.normalizer(actor_inputs_)
207 |             actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
208 |             rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
209 |             done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1)
210 | 
211 |             if self.discard_time_limit:
212 |                 done = done * 0 + 1
213 | 
214 |             with T.no_grad():
215 |                 actions_ = self.network_dict['actor_target'](actor_inputs_)
216 |                 critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
217 |                 value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
218 |                 value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
219 |                 value_ = T.min(value_1_, value_2_)
220 |                 value_target = rewards + done * self.gamma * value_
221 |                 value_target = T.clamp(value_target, -self.clip_value, 0.0)
222 | 
223 |             self.critic_1_optimizer.zero_grad()
224 |             value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
225 |             critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
226 |             (critic_loss_1 * weights).mean().backward()
227 |             self.critic_1_optimizer.step()
228 | 
229 |             if self.prioritised:
230 |                 assert inds is not None
231 |                 self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
232 | 
233 |             self.critic_2_optimizer.zero_grad()
234 |             value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
235 |             critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
236 |             (critic_loss_2 * weights).mean().backward()
237 |             self.critic_2_optimizer.step()
238 | 
239 |             self.actor_optimizer.zero_grad()
240 |             new_actions = self.network_dict['actor'](actor_inputs)
241 |             critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device)
242 |             new_values_1 = self.network_dict['critic_1'](critic_eval_inputs)
243 |             new_values_2 = self.network_dict['critic_2'](critic_eval_inputs)
244 |             actor_loss = -T.min(new_values_1, new_values_2).mean()
245 |             actor_loss.backward()
246 |             self.actor_optimizer.step()
247 | 
248 |             critic_losses += critic_loss_1.detach().mean()
249 |             actor_losses += actor_loss.detach().mean()
250 | 
251 |             self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
252 |             self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
253 |             self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
254 | 
255 |         self.statistic_dict['critic_loss'].append(critic_losses / steps)
256 |         self.statistic_dict['actor_loss'].append(actor_losses / steps)
257 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/distributional_ddpg.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import Actor, Critic
  7 | from ..agent_base import Agent
  8 | from ..utils.exploration_strategy import OUNoise, GaussianNoise
  9 | 
 10 | 
 11 | class DistributionalDDPG(Agent):
 12 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 13 |         # environment
 14 |         self.env = env
 15 |         self.env.seed(seed)
 16 |         obs = self.env.reset()
 17 |         algo_params.update({'state_dim': obs.shape[0],
 18 |                             'action_dim': self.env.action_space.shape[0],
 19 |                             'action_max': self.env.action_space.high,
 20 |                             'action_scaling': self.env.action_space.high[0],
 21 |                             'init_input_means': None,
 22 |                             'init_input_vars': None
 23 |                             })
 24 |         # training args
 25 |         self.training_episodes = algo_params['training_episodes']
 26 |         self.testing_gap = algo_params['testing_gap']
 27 |         self.testing_episodes = algo_params['testing_episodes']
 28 |         self.saving_gap = algo_params['saving_gap']
 29 | 
 30 |         super(DistributionalDDPG, self).__init__(algo_params,
 31 |                                                  transition_tuple=transition_tuple,
 32 |                                                  goal_conditioned=False,
 33 |                                                  path=path,
 34 |                                                  seed=seed)
 35 |         # torch
 36 |         # categorical distribution atoms
 37 |         self.num_atoms = algo_params['num_atoms']
 38 |         self.value_max = algo_params['value_max']
 39 |         self.value_min = algo_params['value_min']
 40 |         self.delta_z = (self.value_max - self.value_min) / (self.num_atoms - 1)
 41 |         self.support = T.linspace(self.value_min, self.value_max, steps=self.num_atoms).to(self.device)
 42 |         # network
 43 |         self.network_dict.update({
 44 |             'actor': Actor(self.state_dim, self.action_dim).to(self.device),
 45 |             'actor_target': Actor(self.state_dim, self.action_dim).to(self.device),
 46 |             'critic': Critic(self.state_dim + self.action_dim, self.num_atoms, softmax=True).to(self.device),
 47 |             'critic_target': Critic(self.state_dim + self.action_dim, self.num_atoms, softmax=True).to(self.device)
 48 |         })
 49 |         self.network_keys_to_save = ['actor_target', 'critic_target']
 50 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 51 |         self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
 52 |         self.critic_optimizer = Adam(self.network_dict['critic'].parameters(), lr=self.critic_learning_rate,
 53 |                                      weight_decay=algo_params['Q_weight_decay'])
 54 |         self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'], tau=1)
 55 |         # behavioural policy args (exploration)
 56 |         self.exploration_strategy = GaussianNoise(self.action_dim, scale=0.3, sigma=1.0)
 57 |         # training args
 58 |         self.warmup_step = algo_params['warmup_step']
 59 |         # statistic dict
 60 |         self.statistic_dict.update({
 61 |             'episode_return': [],
 62 |             'episode_test_return': []
 63 |         })
 64 | 
 65 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 66 |         if test:
 67 |             num_episode = self.testing_episodes
 68 |             if load_network_ep is not None:
 69 |                 print("Loading network parameters...")
 70 |                 self._load_network(ep=load_network_ep)
 71 |             print("Start testing...")
 72 |         else:
 73 |             num_episode = self.training_episodes
 74 |             print("Start training...")
 75 | 
 76 |         for ep in range(num_episode):
 77 |             ep_return = self._interact(render, test, sleep=sleep)
 78 |             self.statistic_dict['episode_return'].append(ep_return)
 79 |             print("Episode %i" % ep, "return %0.1f" % ep_return)
 80 | 
 81 |             if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
 82 |                 ep_test_return = []
 83 |                 for test_ep in range(self.testing_episodes):
 84 |                     ep_test_return.append(self._interact(render, test=True))
 85 |                 self.statistic_dict['episode_test_return'].append(sum(ep_test_return) / self.testing_episodes)
 86 |                 print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return) / self.testing_episodes))
 87 | 
 88 |             if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
 89 |                 self._save_network(ep=ep)
 90 | 
 91 |         if not test:
 92 |             print("Finished training")
 93 |             print("Saving statistics...")
 94 |             self._plot_statistics(save_to_file=True)
 95 |         else:
 96 |             print("Finished testing")
 97 | 
 98 |     def _interact(self, render=False, test=False, sleep=0):
 99 |         done = False
100 |         obs = self.env.reset()
101 |         ep_return = 0
102 |         while not done:
103 |             if render:
104 |                 self.env.render()
105 |             if self.env_step_count < self.warmup_step:
106 |                 action = self.env.action_space.sample()
107 |             else:
108 |                 action = self._select_action(obs, test=test)
109 |             new_obs, reward, done, info = self.env.step(action)
110 |             time.sleep(sleep)
111 |             ep_return += reward
112 |             if not test:
113 |                 self._remember(obs, action, new_obs, reward, 1 - int(done))
114 |                 if self.observation_normalization:
115 |                     self.normalizer.store_history(new_obs)
116 |                     self.normalizer.update_mean()
117 |                 if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
118 |                     self._learn()
119 |                 self.env_step_count += 1
120 |             obs = new_obs
121 |         return ep_return
122 | 
123 |     def _select_action(self, obs, test=False):
124 |         obs = self.normalizer(obs)
125 |         with T.no_grad():
126 |             inputs = T.as_tensor(obs, dtype=T.float, device=self.device)
127 |             action = self.network_dict['actor_target'](inputs).cpu().detach().numpy()
128 |         if test:
129 |             # evaluate
130 |             return np.clip(action, -self.action_max, self.action_max)
131 |         else:
132 |             # explore
133 |             return self.exploration_strategy(action)
134 | 
135 |     def _learn(self, steps=None):
136 |         if len(self.buffer) < self.batch_size:
137 |             return
138 |         if steps is None:
139 |             steps = self.optimizer_steps
140 | 
141 |         for i in range(steps):
142 |             if self.prioritised:
143 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
144 |                 weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
145 |             else:
146 |                 batch = self.buffer.sample(self.batch_size)
147 |                 weights = T.ones(size=(self.batch_size, 1), device=self.device)
148 |                 inds = None
149 | 
150 |             actor_inputs = self.normalizer(batch.state)
151 |             actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
152 |             actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
153 |             critic_inputs = T.cat((actor_inputs, actions), dim=1)
154 |             actor_inputs_ = self.normalizer(batch.next_state)
155 |             actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
156 |             rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device)
157 |             done = T.as_tensor(batch.done, dtype=T.float32, device=self.device)
158 | 
159 |             if self.discard_time_limit:
160 |                 done = done * 0 + 1
161 | 
162 |             with T.no_grad():
163 |                 actions_ = self.network_dict['actor_target'](actor_inputs_)
164 |                 critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
165 |                 value_dist_ = self.network_dict['critic_target'](critic_inputs_)
166 |                 value_dist_target = self.project_value_distribution(value_dist_, rewards, done)
167 |                 value_dist_target = T.as_tensor(value_dist_target, device=self.device)
168 | 
169 |             self.critic_optimizer.zero_grad()
170 |             value_dist_estimate = self.network_dict['critic'](critic_inputs)
171 |             critic_loss = F.binary_cross_entropy(value_dist_estimate, value_dist_target, reduction='none').sum(dim=1)
172 |             (critic_loss * weights).mean().backward()
173 |             self.critic_optimizer.step()
174 | 
175 |             if self.prioritised:
176 |                 assert inds is not None
177 |                 self.buffer.update_priority(inds, np.abs(critic_loss.cpu().detach().numpy()))
178 | 
179 |             self.actor_optimizer.zero_grad()
180 |             new_actions = self.network_dict['actor'](actor_inputs)
181 |             critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1)
182 |             # take the expectation of the value distribution as the policy loss
183 |             actor_loss = -(self.network_dict['critic'](critic_eval_inputs) * self.support)
184 |             actor_loss = actor_loss.sum(dim=1)
185 |             actor_loss.mean().backward()
186 |             self.actor_optimizer.step()
187 | 
188 |             self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
189 |             self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'])
190 | 
191 |             self.statistic_dict['critic_loss'].append(critic_loss.detach().mean())
192 |             self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
193 | 
194 |     def project_value_distribution(self, value_dist, rewards, done):
195 |         # refer to https://github.com/schatty/d4pg-pytorch/blob/7dc23096a45bc4036fbb02493e0b052d57cfe4c6/models/d4pg/l2_projection.py#L7
196 |         # comments added
197 |         copy_value_dist = value_dist.data.cpu().numpy()
198 |         copy_rewards = rewards.data.cpu().numpy()
199 |         copy_done = (1-done).data.cpu().numpy().astype(np.bool)
200 |         batch_size = self.batch_size
201 |         n_atoms = self.num_atoms
202 |         projected_dist = np.zeros((batch_size, n_atoms), dtype=np.float32)
203 | 
204 |         # calculate the next state value for each atom in the support set
205 |         for atom in range(n_atoms):
206 |             atom_ = copy_rewards + (self.value_min + atom * self.delta_z) * self.gamma
207 |             tz_j = np.clip(atom_, a_max=self.value_max, a_min=self.value_min)
208 |             # compute where the next value is on the indexes axis of the support set
209 |             b_j = (tz_j - self.value_min) / self.delta_z
210 |             # compute floor and ceiling indexes of the next value on the support set
211 |             l = np.floor(b_j).astype(np.int64)
212 |             u = np.ceil(b_j).astype(np.int64)
213 |             # since l and u are floor and ceiling indexes of the next value on the support set
214 |             # their difference is always 0 at the boundary and 1 otherwise
215 |             # thus, the predicted probability of the next value is distributed proportional to
216 |             #       the difference between the projected value index (b_j) and its floor or ceiling
217 |             # boundary case, floor == ceiling
218 |             eq_mask = (u == l)  # this line gives an array of boolean masks
219 |             projected_dist[eq_mask, l[eq_mask]] += copy_value_dist[eq_mask, atom]
220 |             # otherwise, ceiling - floor == 1, i.e., (u - b_j) + (b_j - l) == 1
221 |             ne_mask = (u != l)
222 |             projected_dist[ne_mask, l[ne_mask]] += copy_value_dist[ne_mask, atom] * (u - b_j)[ne_mask]
223 |             projected_dist[ne_mask, u[ne_mask]] += copy_value_dist[ne_mask, atom] * (b_j - l)[ne_mask]
224 | 
225 |         # check if a terminal state exists
226 |         if copy_done.any():
227 |             projected_dist[copy_done] = 0.0
228 |             # value at a terminal state should equal to the immediate reward only
229 |             tz_j = np.clip(copy_rewards[copy_done], a_max=self.value_max, a_min=self.value_min)
230 |             b_j = (tz_j - self.value_min) / self.delta_z
231 |             l = np.floor(b_j).astype(np.int64)
232 |             u = np.ceil(b_j).astype(np.int64)
233 |             eq_mask = (u == l)
234 |             eq_dones = copy_done.copy()
235 |             eq_dones[copy_done] = eq_mask
236 |             # the value probability is only set to 1.0
237 |             #       when it is a terminal state and its floor and ceiling indexes are the same
238 |             if eq_dones.any():
239 |                 projected_dist[eq_dones, l[eq_mask]] = 1.0
240 |             ne_mask = (u != l)
241 |             ne_dones = copy_done.copy()
242 |             ne_dones[copy_done] = ne_mask
243 |             # the value probability is only distributed while summed to 1.0
244 |             #       when it is a terminal state and its floor and ceiling indexes differ by 1 index
245 |             if ne_dones.any():
246 |                 projected_dist[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask]
247 |                 projected_dist[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask]
248 | 
249 |         return projected_dist
250 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import StochasticActor, Critic
  7 | from ..agent_base import Agent
  8 | 
  9 | 
 10 | class SAC(Agent):
 11 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 12 |         # environment
 13 |         self.env = env
 14 |         self.env.seed(seed)
 15 |         obs = self.env.reset()
 16 |         algo_params.update({'state_dim': obs.shape[0],
 17 |                             'action_dim': self.env.action_space.shape[0],
 18 |                             'action_max': self.env.action_space.high,
 19 |                             'action_scaling': self.env.action_space.high[0],
 20 |                             'init_input_means': None,
 21 |                             'init_input_vars': None
 22 |                             })
 23 |         # training args
 24 |         self.training_episodes = algo_params['training_episodes']
 25 |         self.testing_gap = algo_params['testing_gap']
 26 |         self.testing_episodes = algo_params['testing_episodes']
 27 |         self.saving_gap = algo_params['saving_gap']
 28 | 
 29 |         super(SAC, self).__init__(algo_params,
 30 |                                    transition_tuple=transition_tuple,
 31 |                                    goal_conditioned=False,
 32 |                                    path=path,
 33 |                                    seed=seed)
 34 |         # torch
 35 |         self.network_dict.update({
 36 |             'actor': StochasticActor(self.state_dim, self.action_dim, log_std_min=-6, log_std_max=1, action_scaling=self.action_scaling).to(self.device),
 37 |             'critic_1': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 38 |             'critic_1_target': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 39 |             'critic_2': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 40 |             'critic_2_target': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 41 |             'alpha': algo_params['alpha'],
 42 |             'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
 43 |         })
 44 |         self.network_keys_to_save = ['actor', 'critic_1_target']
 45 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 46 |         self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
 47 |         self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
 48 |         self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
 49 |         self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
 50 |         self.target_entropy = -self.action_dim
 51 |         self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate)
 52 |         # training args
 53 |         self.warmup_step = algo_params['warmup_step']
 54 |         self.actor_update_interval = algo_params['actor_update_interval']
 55 |         self.critic_target_update_interval = algo_params['critic_target_update_interval']
 56 |         # statistic dict
 57 |         self.statistic_dict.update({
 58 |             'episode_return': [],
 59 |             'episode_test_return': [],
 60 |             'alpha': [],
 61 |             'policy_entropy': [],
 62 |         })
 63 | 
 64 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 65 |         if test:
 66 |             num_episode = self.testing_episodes
 67 |             if load_network_ep is not None:
 68 |                 print("Loading network parameters...")
 69 |                 self._load_network(ep=load_network_ep)
 70 |             print("Start testing...")
 71 |         else:
 72 |             num_episode = self.training_episodes
 73 |             print("Start training...")
 74 | 
 75 |         for ep in range(num_episode):
 76 |             ep_return = self._interact(render, test, sleep=sleep)
 77 |             self.statistic_dict['episode_return'].append(ep_return)
 78 |             print("Episode %i" % ep, "return %0.1f" % ep_return)
 79 | 
 80 |             if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
 81 |                 ep_test_return = []
 82 |                 for test_ep in range(self.testing_episodes):
 83 |                     ep_test_return.append(self._interact(render, test=True))
 84 |                 self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes)
 85 |                 print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes))
 86 | 
 87 |             if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
 88 |                 self._save_network(ep=ep)
 89 | 
 90 |         if not test:
 91 |             print("Finished training")
 92 |             print("Saving statistics...")
 93 |             self._plot_statistics(save_to_file=True)
 94 |         else:
 95 |             print("Finished testing")
 96 | 
 97 |     def _interact(self, render=False, test=False, sleep=0):
 98 |         done = False
 99 |         obs = self.env.reset()
100 |         ep_return = 0
101 |         # start a new episode
102 |         while not done:
103 |             if render:
104 |                 self.env.render()
105 |             if self.env_step_count < self.warmup_step:
106 |                 action = self.env.action_space.sample()
107 |             else:
108 |                 action = self._select_action(obs, test=test)
109 |             new_obs, reward, done, info = self.env.step(action)
110 |             time.sleep(sleep)
111 |             ep_return += reward
112 |             if not test:
113 |                 self._remember(obs, action, new_obs, reward, 1 - int(done))
114 |                 if self.observation_normalization:
115 |                     self.normalizer.store_history(new_obs)
116 |                     self.normalizer.update_mean()
117 |                 if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
118 |                     self._learn()
119 |                 self.env_step_count += 1
120 |             obs = new_obs
121 |         return ep_return
122 | 
123 |     def _select_action(self, obs, test=False):
124 |         inputs = self.normalizer(obs)
125 |         inputs = T.as_tensor(inputs, dtype=T.float, device=self.device)
126 |         return self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
127 | 
128 |     def _learn(self, steps=None):
129 |         if len(self.buffer) < self.batch_size:
130 |             return
131 |         if steps is None:
132 |             steps = self.optimizer_steps
133 | 
134 |         for i in range(steps):
135 |             if self.prioritised:
136 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
137 |                 weights = T.tensor(weights).view(self.batch_size, 1).to(self.device)
138 |             else:
139 |                 batch = self.buffer.sample(self.batch_size)
140 |                 weights = T.ones(size=(self.batch_size, 1)).to(self.device)
141 |                 inds = None
142 | 
143 |             actor_inputs = self.normalizer(batch.state)
144 |             actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
145 |             actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
146 |             critic_inputs = T.cat((actor_inputs, actions), dim=1)
147 |             actor_inputs_ = self.normalizer(batch.next_state)
148 |             actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
149 |             rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
150 |             done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1)
151 | 
152 |             if self.discard_time_limit:
153 |                 done = done * 0 + 1
154 | 
155 |             with T.no_grad():
156 |                 actions_, log_probs_ = self.network_dict['actor'].get_action(actor_inputs_, probs=True)
157 |                 critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
158 |                 value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
159 |                 value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
160 |                 value_ = T.min(value_1_, value_2_) - (self.network_dict['alpha'] * log_probs_)
161 |                 value_target = rewards + done * self.gamma * value_
162 | 
163 |             self.critic_1_optimizer.zero_grad()
164 |             value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
165 |             critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
166 |             (critic_loss_1 * weights).mean().backward()
167 |             self.critic_1_optimizer.step()
168 | 
169 |             if self.prioritised:
170 |                 assert inds is not None
171 |                 self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
172 | 
173 |             self.critic_2_optimizer.zero_grad()
174 |             value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
175 |             critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
176 |             (critic_loss_2 * weights).mean().backward()
177 |             self.critic_2_optimizer.step()
178 | 
179 |             self.statistic_dict['critic_loss'].append(critic_loss_1.detach().mean())
180 | 
181 |             if self.optim_step_count % self.critic_target_update_interval == 0:
182 |                 self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
183 |                 self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
184 | 
185 |             if self.optim_step_count % self.actor_update_interval == 0:
186 |                 self.actor_optimizer.zero_grad()
187 |                 new_actions, new_log_probs = self.network_dict['actor'].get_action(actor_inputs, probs=True)
188 |                 critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1)
189 |                 new_values = T.min(self.network_dict['critic_1'](critic_eval_inputs),
190 |                                    self.network_dict['critic_2'](critic_eval_inputs))
191 |                 actor_loss = (self.network_dict['alpha']*new_log_probs - new_values).mean()
192 |                 actor_loss.backward()
193 |                 self.actor_optimizer.step()
194 | 
195 |                 self.alpha_optimizer.zero_grad()
196 |                 alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean()
197 |                 alpha_loss.backward()
198 |                 self.alpha_optimizer.step()
199 |                 self.network_dict['alpha'] = self.network_dict['log_alpha'].exp()
200 | 
201 |                 self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
202 |                 self.statistic_dict['alpha'].append(self.network_dict['alpha'].detach())
203 |                 self.statistic_dict['policy_entropy'].append(-new_log_probs.detach().mean())
204 | 
205 |             self.optim_step_count += 1
206 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac_goal_conditioned.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import StochasticActor, Critic
  7 | from ..agent_base import Agent
  8 | 
  9 | 
 10 | class GoalConditionedSAC(Agent):
 11 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 12 |         # environment
 13 |         self.env = env
 14 |         self.env.seed(seed)
 15 |         obs = self.env.reset()
 16 |         algo_params.update({'state_dim': obs['observation'].shape[0],
 17 |                             'goal_dim': obs['desired_goal'].shape[0],
 18 |                             'action_dim': self.env.action_space.shape[0],
 19 |                             'action_max': self.env.action_space.high,
 20 |                             'action_scaling': self.env.action_space.high[0],
 21 |                             'init_input_means': None,
 22 |                             'init_input_vars': None
 23 |                             })
 24 |         # training args
 25 |         self.training_epochs = algo_params['training_epochs']
 26 |         self.training_cycles = algo_params['training_cycles']
 27 |         self.training_episodes = algo_params['training_episodes']
 28 |         self.testing_gap = algo_params['testing_gap']
 29 |         self.testing_episodes = algo_params['testing_episodes']
 30 |         self.saving_gap = algo_params['saving_gap']
 31 | 
 32 |         super(GoalConditionedSAC, self).__init__(algo_params,
 33 |                                                  transition_tuple=transition_tuple,
 34 |                                                  goal_conditioned=True,
 35 |                                                  path=path,
 36 |                                                  seed=seed)
 37 |         # torch
 38 |         self.network_dict.update({
 39 |             'actor': StochasticActor(self.state_dim + self.goal_dim, self.action_dim, log_std_min=-6, log_std_max=1,
 40 |                                      action_scaling=self.action_scaling).to(self.device),
 41 |             'critic_1': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 42 |             'critic_1_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 43 |             'critic_2': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 44 |             'critic_2_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
 45 |             'alpha': algo_params['alpha'],
 46 |             'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
 47 |         })
 48 |         self.network_keys_to_save = ['actor', 'critic_1_target']
 49 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 50 |         self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
 51 |         self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
 52 |         self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
 53 |         self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
 54 |         self.target_entropy = -self.action_dim
 55 |         self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate)
 56 |         # training args
 57 |         self.clip_value = algo_params['clip_value']
 58 |         self.actor_update_interval = algo_params['actor_update_interval']
 59 |         self.critic_target_update_interval = algo_params['critic_target_update_interval']
 60 |         # statistic dict
 61 |         self.statistic_dict.update({
 62 |             'cycle_return': [],
 63 |             'cycle_success_rate': [],
 64 |             'epoch_test_return': [],
 65 |             'epoch_test_success_rate': [],
 66 |             'alpha': [],
 67 |             'policy_entropy': [],
 68 |         })
 69 | 
 70 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 71 |         # training setup uses a hierarchy of Epoch, Cycle and Episode
 72 |         #   following the HER paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
 73 |         if test:
 74 |             if load_network_ep is not None:
 75 |                 print("Loading network parameters...")
 76 |                 self._load_network(ep=load_network_ep)
 77 |             print("Start testing...")
 78 |         else:
 79 |             print("Start training...")
 80 | 
 81 |         for epo in range(self.training_epochs):
 82 |             for cyc in range(self.training_cycles):
 83 |                 cycle_return = 0
 84 |                 cycle_success = 0
 85 |                 for ep in range(self.training_episodes):
 86 |                     ep_return = self._interact(render, test, sleep=sleep)
 87 |                     cycle_return += ep_return
 88 |                     if ep_return > -50:
 89 |                         cycle_success += 1
 90 | 
 91 |                 self.statistic_dict['cycle_return'].append(cycle_return / self.training_episodes)
 92 |                 self.statistic_dict['cycle_success_rate'].append(cycle_success / self.training_episodes)
 93 |                 print("Epoch %i" % epo, "Cycle %i" % cyc,
 94 |                       "avg. return %0.1f" % (cycle_return / self.training_episodes),
 95 |                       "success rate %0.1f" % (cycle_success / self.training_episodes))
 96 | 
 97 |             if (epo % self.testing_gap == 0) and (epo != 0) and (not test):
 98 |                 test_return = 0
 99 |                 test_success = 0
100 |                 for test_ep in range(self.testing_episodes):
101 |                     ep_test_return = self._interact(render, test=True)
102 |                     test_return += ep_test_return
103 |                     if ep_test_return > -50:
104 |                         test_success += 1
105 |                 self.statistic_dict['epoch_test_return'].append(test_return / self.testing_episodes)
106 |                 self.statistic_dict['epoch_test_success_rate'].append(test_success / self.testing_episodes)
107 |                 print("Epoch %i" % epo, "test avg. return %0.1f" % (test_return / self.testing_episodes))
108 | 
109 |             if (epo % self.saving_gap == 0) and (epo != 0) and (not test):
110 |                 self._save_network(ep=epo)
111 | 
112 |         if not test:
113 |             print("Finished training")
114 |             print("Saving statistics...")
115 |             self._plot_statistics(
116 |                 x_labels={
117 |                     'critic_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
118 |                     'actor_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
119 |                     'alpha': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
120 |                     'policy_entropy': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)'
121 |                 },
122 |                 save_to_file=True)
123 |         else:
124 |             print("Finished testing")
125 | 
126 |     def _interact(self, render=False, test=False, sleep=0):
127 |         done = False
128 |         obs = self.env.reset()
129 |         ep_return = 0
130 |         new_episode = True
131 |         # start a new episode
132 |         while not done:
133 |             if render:
134 |                 self.env.render()
135 |             action = self._select_action(obs, test=test)
136 |             new_obs, reward, done, info = self.env.step(action)
137 |             time.sleep(sleep)
138 |             ep_return += reward
139 |             if not test:
140 |                 self._remember(obs['observation'], obs['desired_goal'], action,
141 |                                new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done),
142 |                                new_episode=new_episode)
143 |                 if self.observation_normalization:
144 |                     self.normalizer.store_history(np.concatenate((new_obs['observation'],
145 |                                                                   new_obs['achieved_goal']), axis=0))
146 |             obs = new_obs
147 |             new_episode = False
148 | 
149 |         if not test:
150 |             self.normalizer.update_mean()
151 |             self._learn()
152 |         return ep_return
153 | 
154 |     def _select_action(self, obs, test=False):
155 |         inputs = np.concatenate((obs['observation'], obs['desired_goal']), axis=0)
156 |         inputs = self.normalizer(inputs)
157 |         inputs = T.as_tensor(inputs, dtype=T.float).to(self.device)
158 |         return self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
159 | 
160 |     def _learn(self, steps=None):
161 |         if self.hindsight:
162 |             self.buffer.modify_episodes()
163 |         self.buffer.store_episodes()
164 |         if len(self.buffer) < self.batch_size:
165 |             return
166 |         if steps is None:
167 |             steps = self.optimizer_steps
168 | 
169 |         critic_losses = T.zeros(1, device=self.device)
170 |         actor_losses = T.zeros(1, device=self.device)
171 |         alphas = T.zeros(1, device=self.device)
172 |         policy_entropies = T.zeros(1, device=self.device)
173 |         for i in range(steps):
174 |             if self.prioritised:
175 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
176 |                 weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
177 |             else:
178 |                 batch = self.buffer.sample(self.batch_size)
179 |                 weights = T.ones(size=(self.batch_size, 1), device=self.device)
180 |                 inds = None
181 | 
182 |             actor_inputs = np.concatenate((batch.state, batch.desired_goal), axis=1)
183 |             actor_inputs = self.normalizer(actor_inputs)
184 |             actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
185 |             actions_np = np.array(batch.action)
186 |             actions = T.as_tensor(actions_np, dtype=T.float32, device=self.device)
187 |             critic_inputs = T.cat((actor_inputs, actions), dim=1)
188 |             actor_inputs_ = np.concatenate((batch.next_state, batch.desired_goal), axis=1)
189 |             actor_inputs_ = self.normalizer(actor_inputs_)
190 |             actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
191 |             rewards_np = np.array(batch.reward)
192 |             rewards = T.as_tensor(rewards_np, dtype=T.float32, device=self.device).unsqueeze(1)
193 |             done_np = np.array(batch.done)
194 |             done = T.as_tensor(done_np, dtype=T.float32, device=self.device).unsqueeze(1)
195 | 
196 |             if self.discard_time_limit:
197 |                 done = done * 0 + 1
198 | 
199 |             with T.no_grad():
200 |                 actions_, log_probs_ = self.network_dict['actor'].get_action(actor_inputs_, probs=True)
201 |                 critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
202 |                 value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
203 |                 value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
204 |                 value_ = T.min(value_1_, value_2_) - (self.network_dict['alpha'] * log_probs_)
205 |                 value_target = rewards + done * self.gamma * value_
206 |                 value_target = T.clamp(value_target, -self.clip_value, 0.0)
207 | 
208 |             self.critic_1_optimizer.zero_grad()
209 |             value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
210 |             critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
211 |             (critic_loss_1 * weights).mean().backward()
212 |             self.critic_1_optimizer.step()
213 | 
214 |             if self.prioritised:
215 |                 assert inds is not None
216 |                 self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
217 | 
218 |             self.critic_2_optimizer.zero_grad()
219 |             value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
220 |             critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
221 |             (critic_loss_2 * weights).mean().backward()
222 |             self.critic_2_optimizer.step()
223 | 
224 |             critic_losses += critic_loss_1.detach().mean()
225 | 
226 |             if self.optim_step_count % self.critic_target_update_interval == 0:
227 |                 self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
228 |                 self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
229 | 
230 |             if self.optim_step_count % self.actor_update_interval == 0:
231 |                 self.actor_optimizer.zero_grad()
232 |                 new_actions, new_log_probs, entropy = self.network_dict['actor'].get_action(actor_inputs, probs=True,
233 |                                                                                             entropy=True)
234 |                 critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device)
235 |                 new_values = T.min(self.network_dict['critic_1'](critic_eval_inputs),
236 |                                    self.network_dict['critic_2'](critic_eval_inputs))
237 |                 actor_loss = (self.network_dict['alpha'] * new_log_probs - new_values).mean()
238 |                 actor_loss.backward()
239 |                 self.actor_optimizer.step()
240 | 
241 |                 self.alpha_optimizer.zero_grad()
242 |                 alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean()
243 |                 alpha_loss.backward()
244 |                 self.alpha_optimizer.step()
245 |                 self.network_dict['alpha'] = self.network_dict['log_alpha'].exp()
246 | 
247 |                 actor_losses += actor_loss.detach()
248 |                 alphas += self.network_dict['alpha'].detach()
249 |                 policy_entropies += entropy.detach().mean()
250 | 
251 |             self.optim_step_count += 1
252 | 
253 |         self.statistic_dict['critic_loss'].append(critic_losses / steps)
254 |         self.statistic_dict['actor_loss'].append(actor_losses / steps)
255 |         self.statistic_dict['alpha'].append(alphas / steps)
256 |         self.statistic_dict['policy_entropy'].append(policy_entropies / steps)
257 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac_parameterised_action_goal_conditioned.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import StochasticActor
  7 | from ..utils.networks_pointnet import CriticPointNet, CriticPointNet2
  8 | from ..agent_base import Agent
  9 | from collections import namedtuple
 10 | 
 11 | 
 12 | class GPASAC(Agent):
 13 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 14 |         # environment
 15 |         self.env = env
 16 |         self.env.seed(seed)
 17 |         obs = self.env.reset()
 18 |         algo_params.update({'state_shape': obs['observation'].shape,
 19 |                             'goal_shape': obs['desired_goal'].shape,
 20 |                             'discrete_action_dim': self.env.discrete_action_space.n,
 21 |                             'continuous_action_dim': self.env.continuous_action_space.shape[0],
 22 |                             'continuous_action_max': self.env.continuous_action_space.high,
 23 |                             'continuous_action_scaling': self.env.continuous_action_space.high[0],
 24 |                             })
 25 |         # training args
 26 |         self.cur_ep = 0
 27 |         self.warmup_step = algo_params['warmup_step']
 28 |         self.training_episodes = algo_params['training_episodes']
 29 |         self.testing_gap = algo_params['testing_gap']
 30 |         self.testing_episodes = algo_params['testing_episodes']
 31 |         self.saving_gap = algo_params['saving_gap']
 32 | 
 33 |         self.use_demonstrations = algo_params['use_demonstrations']
 34 |         self.demonstrate_percentage = algo_params['demonstrate_percentage']
 35 |         assert 0 < self.demonstrate_percentage < 1, "Demonstrate percentage should be between 0 and 1"
 36 |         self.n_demonstrate_episodes = int(self.demonstrate_percentage * self.training_episodes)
 37 |         self.planned_skills = algo_params['planned_skills']
 38 |         assert not (self.use_demonstrations and self.planned_skills), "Cannot demonstrate and planned skills at the same time"
 39 |         self.skill_plan = algo_params['skill_plan']
 40 |         self.use_planned_skills = False
 41 | 
 42 |         if transition_tuple is None:
 43 |             transition_tuple = namedtuple("transition",
 44 |                                           ('state', 'desired_goal', 'action',
 45 |                                            'next_state', 'achieved_goal', 'reward', 'done', 'next_skill'))
 46 |         super(GPASAC, self).__init__(algo_params,
 47 |                                      non_flat_obs=True,
 48 |                                      action_type='hybrid',
 49 |                                      transition_tuple=transition_tuple,
 50 |                                      goal_conditioned=True,
 51 |                                      path=path,
 52 |                                      seed=seed,
 53 |                                      create_logger=True)
 54 |         # torch
 55 |         self.network_dict.update({
 56 |             'discrete_actor': StochasticActor(2048, self.discrete_action_dim, continuous=False,
 57 |                                               fc1_size=1024,
 58 |                                               log_std_min=-6, log_std_max=1).to(self.device),
 59 |             'continuous_actor': StochasticActor(2048 + self.discrete_action_dim, self.continuous_action_dim,
 60 |                                                 fc1_size=1024,
 61 |                                                 log_std_min=-6, log_std_max=1,
 62 |                                                 action_scaling=self.continuous_action_scaling).to(self.device),
 63 |             'critic_1': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
 64 |             'critic_1_target': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
 65 |             'critic_2': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
 66 |             'critic_2_target': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
 67 |             'alpha_discrete': algo_params['alpha'],
 68 |             'log_alpha_discrete': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
 69 |             'alpha_continuous': algo_params['alpha'],
 70 |             'log_alpha_continuous': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
 71 |         })
 72 |         self.network_dict['critic_1_target'].eval()
 73 |         self.network_dict['critic_2_target'].eval()
 74 |         self.network_keys_to_save = ['discrete_actor', 'continuous_actor', 'critic_1', 'critic_1_target']
 75 |         self.discrete_actor_optimizer = Adam(self.network_dict['discrete_actor'].parameters(),
 76 |                                              lr=self.actor_learning_rate)
 77 |         self.continuous_actor_optimizer = Adam(self.network_dict['continuous_actor'].parameters(),
 78 |                                                lr=self.actor_learning_rate)
 79 |         self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
 80 |         self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
 81 |         self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
 82 |         self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
 83 |         self.target_discrete_entropy = -self.discrete_action_dim
 84 |         self.target_continuous_entropy = -self.continuous_action_dim
 85 |         self.alpha_discrete_optimizer = Adam([self.network_dict['log_alpha_discrete']], lr=self.actor_learning_rate)
 86 |         self.alpha_continuous_optimizer = Adam([self.network_dict['log_alpha_continuous']], lr=self.actor_learning_rate)
 87 |         # training args
 88 |         # self.clip_value = algo_params['clip_value']
 89 |         self.actor_update_interval = algo_params['actor_update_interval']
 90 |         self.critic_target_update_interval = algo_params['critic_target_update_interval']
 91 | 
 92 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 93 |         if test:
 94 |             if load_network_ep is not None:
 95 |                 print("Loading network parameters...")
 96 |                 self._load_network(ep=load_network_ep)
 97 |             print("Start testing...")
 98 |         else:
 99 |             print("Start training...")
100 | 
101 |         for ep in range(self.training_episodes):
102 |             if self.use_demonstrations and (ep < self.n_demonstrate_episodes):
103 |                 self.use_planned_skills = True
104 |             elif self.planned_skills:
105 |                 self.use_planned_skills = True
106 |             else:
107 |                 self.use_planned_skills = False
108 |             self.cur_ep = ep
109 |             loss_info = self._interact(render, test, sleep=sleep)
110 |             self.logger.add_scalar(tag='Task/return', scalar_value=loss_info['emd_loss'], global_step=self.cur_ep)
111 |             self.logger.add_scalar(tag='Task/heightmap_loss', scalar_value=loss_info['height_map_loss'], global_step=ep)
112 |             print("Episode %i" % ep, "return %0.1f" % loss_info['emd_loss'])
113 |             if not test and self.hindsight:
114 |                 self.buffer.hindsight()
115 | 
116 |             if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
117 |                 if self.planned_skills:
118 |                     self.use_planned_skills = True
119 |                 else:
120 |                     self.use_planned_skills = False
121 |                 test_return = 0
122 |                 test_heightmap_loss = 0
123 |                 for test_ep in range(self.testing_episodes):
124 |                     loss_info = self._interact(render, test=True)
125 |                     test_return += loss_info['emd_loss']
126 |                     test_heightmap_loss += loss_info['height_map_loss']
127 |                 self.logger.add_scalar(tag='Task/test_return',
128 |                                        scalar_value=(test_return / self.testing_episodes), global_step=self.cur_ep)
129 |                 self.logger.add_scalar(tag='Task/test_heightmap_loss',
130 |                                        scalar_value=(test_heightmap_loss / self.testing_episodes), global_step=self.cur_ep)
131 | 
132 |                 print("Episode %i" % ep, "test avg. return %0.1f" % (test_return / self.testing_episodes))
133 | 
134 |             if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
135 |                 self._save_network(ep=ep)
136 | 
137 |         if not test:
138 |             print("Finished training")
139 |             print("Saving statistics...")
140 |         else:
141 |             print("Finished testing")
142 | 
143 |     def _interact(self, render=False, test=False, sleep=0):
144 |         done = False
145 |         obs = self.env.reset()
146 |         ep_return = 0
147 |         new_episode = True
148 |         # start a new episode
149 |         while not done:
150 |             if render:
151 |                 self.env.render()
152 |             if self.total_env_step_count < self.warmup_step:
153 |                 if self.use_planned_skills:
154 |                     discrete_action = self.skill_plan[self.env.step_count]
155 |                 else:
156 |                     discrete_action = self.env.discrete_action_space.sample()
157 |                 continuous_action = self.env.continuous_action_space.sample()
158 |                 action = np.concatenate([[discrete_action], continuous_action], axis=0)
159 |             else:
160 |                 action = self._select_action(obs, test=test)
161 |             new_obs, reward, done, info = self.env.step(action)
162 |             time.sleep(sleep)
163 |             ep_return += reward
164 | 
165 |             next_skill = 0
166 |             if self.planned_skills:
167 |                 try:
168 |                     next_skill = self.skill_plan[self.env.step_count]
169 |                 except:
170 |                     pass
171 | 
172 |             if not test:
173 |                 self._remember(obs['observation'], obs['desired_goal'], action,
174 |                                new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done), next_skill,
175 |                                new_episode=new_episode)
176 |                 self.total_env_step_count += 1
177 |                 self._learn(steps=1)
178 | 
179 |             obs = new_obs
180 |             new_episode = False
181 | 
182 |         return info
183 | 
184 |     def _select_action(self, obs, test=False):
185 |         obs_points = T.as_tensor([obs['observation']], dtype=T.float).to(self.device)
186 |         goal_points = T.as_tensor([obs['desired_goal']], dtype=T.float).to(self.device)
187 |         obs_point_features = self.network_dict['critic_1_target'].get_features(obs_points.transpose(2, 1))
188 |         goal_point_features = self.network_dict['critic_1_target'].get_features(goal_points.transpose(2, 1))
189 |         inputs = T.cat((obs_point_features, goal_point_features), dim=1)
190 |         if self.use_planned_skills:
191 |             discrete_action = T.as_tensor([self.skill_plan[self.env.step_count]], dtype=T.long).to(self.device)
192 |         else:
193 |             discrete_action, _, _ = self.network_dict['discrete_actor'].get_action(inputs, greedy=test)
194 |             discrete_action.type(T.long).flatten()
195 |         discrete_action_onehot = F.one_hot(discrete_action, self.discrete_action_dim).float()
196 |         inputs = T.cat((inputs, discrete_action_onehot), dim=1)
197 |         continuous_action = self.network_dict['continuous_actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
198 |         return np.concatenate([discrete_action.detach().cpu().numpy(), continuous_action[0]], axis=0)
199 | 
200 |     def _learn(self, steps=None):
201 |         if len(self.buffer) < self.batch_size:
202 |             return
203 |         if steps is None:
204 |             steps = self.optimizer_steps
205 | 
206 |         avg_critic_1_loss = T.zeros(1, device=self.device)
207 |         avg_critic_2_loss = T.zeros(1, device=self.device)
208 |         avg_discrete_actor_loss = T.zeros(1, device=self.device)
209 |         avg_discrete_alpha = T.zeros(1, device=self.device)
210 |         avg_discrete_policy_entropy = T.zeros(1, device=self.device)
211 |         avg_continuous_actor_loss = T.zeros(1, device=self.device)
212 |         avg_continuous_alpha = T.zeros(1, device=self.device)
213 |         avg_continuous_policy_entropy = T.zeros(1, device=self.device)
214 |         for i in range(steps):
215 |             if self.prioritised:
216 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
217 |                 weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
218 |             else:
219 |                 batch = self.buffer.sample(self.batch_size)
220 |                 weights = T.ones(size=(self.batch_size, 1), device=self.device)
221 |                 inds = None
222 | 
223 |             obs = T.as_tensor(batch.state, dtype=T.float32, device=self.device).transpose(2, 1)
224 |             obs_features = self.network_dict['critic_1_target'].get_features(obs, detach=True)
225 |             goal = T.as_tensor(batch.desired_goal, dtype=T.float32, device=self.device).transpose(2, 1)
226 |             goal_features = self.network_dict['critic_1_target'].get_features(goal, detach=True)
227 |             obs_ = T.as_tensor(batch.next_state, dtype=T.float32, device=self.device).transpose(2, 1)
228 |             obs_features_ = self.network_dict['critic_1_target'].get_features(obs_, detach=True)
229 |             actor_inputs_ = T.cat((obs_features_, goal_features), dim=1)
230 |             actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
231 |             discrete_actions = actions[:, 0].type(T.long)
232 |             discrete_actions_onehot = F.one_hot(discrete_actions, self.discrete_action_dim).float()
233 |             actions = T.cat((discrete_actions_onehot, actions[:, 1:]), dim=1)
234 |             rewards = T.as_tensor(np.array(batch.reward), dtype=T.float32, device=self.device).unsqueeze(1)
235 |             done = T.as_tensor(np.array(batch.done), dtype=T.float32, device=self.device).unsqueeze(1)
236 | 
237 |             if self.discard_time_limit:
238 |                 done = done * 0 + 1
239 | 
240 |             with T.no_grad():
241 |                 if not self.planned_skills:
242 |                     discrete_actions_, discrete_actions_log_probs_, _ = self.network_dict['discrete_actor'].get_action(
243 |                         actor_inputs_)
244 |                     discrete_actions_onehot_ = F.one_hot(discrete_actions_.flatten(), self.discrete_action_dim).float()
245 |                 else:
246 |                     discrete_actions_planned_ = T.as_tensor(batch.next_skill, dtype=T.long, device=self.device)
247 |                     discrete_actions_planned_onehot_ = F.one_hot(discrete_actions_planned_, self.discrete_action_dim).float()
248 |                     discrete_actions_onehot_ = discrete_actions_planned_onehot_
249 |                     discrete_actions_log_probs_ = T.ones(size=(self.batch_size, 1), device=self.device, dtype=T.float32)
250 | 
251 |                 actor_inputs_ = T.cat((actor_inputs_, discrete_actions_onehot_), dim=1)
252 |                 continuous_actions_, continuous_actions_log_probs_ = self.network_dict[
253 |                     'continuous_actor'].get_action(actor_inputs_, probs=True)
254 |                 actions_ = T.cat((discrete_actions_onehot_, continuous_actions_), dim=1)
255 | 
256 |                 value_1_ = self.network_dict['critic_1_target'](obs_, actions_, goal)
257 |                 value_2_ = self.network_dict['critic_2_target'](obs_, actions_, goal)
258 |                 value_ = T.min(value_1_, value_2_) - \
259 |                          (self.network_dict['alpha_discrete'] * discrete_actions_log_probs_) - \
260 |                          (self.network_dict['alpha_continuous'] * continuous_actions_log_probs_)
261 |                 value_target = rewards + done * self.gamma * value_
262 |                 # value_target = T.clamp(value_target, -self.clip_value, 0.0)
263 | 
264 |             self.critic_1_optimizer.zero_grad()
265 |             value_estimate_1 = self.network_dict['critic_1'](obs, actions, goal)
266 |             critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
267 |             (critic_loss_1 * weights).mean().backward()
268 |             self.critic_1_optimizer.step()
269 | 
270 |             if self.prioritised:
271 |                 assert inds is not None
272 |                 self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
273 | 
274 |             self.critic_2_optimizer.zero_grad()
275 |             value_estimate_2 = self.network_dict['critic_2'](obs, actions, goal)
276 |             critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
277 |             (critic_loss_2 * weights).mean().backward()
278 |             self.critic_2_optimizer.step()
279 | 
280 |             avg_critic_1_loss += critic_loss_1.detach().mean()
281 |             avg_critic_2_loss += critic_loss_2.detach().mean()
282 | 
283 |             if self.optim_step_count % self.critic_target_update_interval == 0:
284 |                 self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
285 |                 self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
286 | 
287 |             if self.optim_step_count % self.actor_update_interval == 0:
288 |                 self.discrete_actor_optimizer.zero_grad()
289 |                 self.continuous_actor_optimizer.zero_grad()
290 |                 actor_inputs = T.cat((obs_features, goal_features), dim=1)
291 |                 if not self.planned_skills:
292 |                     new_discrete_actions, new_discrete_action_log_probs, new_discrete_action_entropy = \
293 |                         self.network_dict['discrete_actor'].get_action(actor_inputs)
294 |                     new_discrete_actions_onehot = F.one_hot(new_discrete_actions.flatten(), self.discrete_action_dim).float()
295 |                 else:
296 |                     new_discrete_actions_onehot = discrete_actions_onehot
297 | 
298 |                 new_continuous_actions, new_continuous_action_log_probs, new_continuous_action_entropy = \
299 |                     self.network_dict['continuous_actor'].get_action(
300 |                         T.cat((actor_inputs, new_discrete_actions_onehot), dim=1), probs=True, entropy=True)
301 |                 new_actions = T.cat((new_discrete_actions_onehot, new_continuous_actions), dim=1)
302 | 
303 |                 new_values = T.min(self.network_dict['critic_1'](obs, new_actions, goal),
304 |                                    self.network_dict['critic_2'](obs, new_actions, goal))
305 | 
306 |                 if not self.planned_skills:
307 |                     discrete_actor_loss = (
308 |                                 self.network_dict['alpha_discrete'] * new_discrete_action_log_probs - new_values).mean()
309 |                     discrete_actor_loss.backward(retain_graph=True)
310 |                     self.discrete_actor_optimizer.step()
311 | 
312 |                     self.alpha_discrete_optimizer.zero_grad()
313 |                     discrete_alpha_loss = (self.network_dict['log_alpha_discrete'] * (
314 |                                 -new_discrete_action_log_probs - self.target_discrete_entropy).detach()).mean()
315 |                     discrete_alpha_loss.backward()
316 |                     self.alpha_discrete_optimizer.step()
317 |                     self.network_dict['alpha_discrete'] = self.network_dict['log_alpha_discrete'].exp()
318 | 
319 |                     avg_discrete_actor_loss += discrete_actor_loss.detach()
320 |                     avg_discrete_alpha += self.network_dict['alpha_discrete'].detach()
321 |                     avg_discrete_policy_entropy += new_discrete_action_entropy.detach().mean()
322 | 
323 |                 continuous_actor_loss = (
324 |                             self.network_dict['alpha_continuous'] * new_continuous_action_log_probs - new_values).mean()
325 |                 continuous_actor_loss.backward()
326 |                 self.continuous_actor_optimizer.step()
327 | 
328 |                 self.alpha_continuous_optimizer.zero_grad()
329 |                 continuous_alpha_loss = (self.network_dict['log_alpha_continuous'] * (
330 |                             -new_continuous_action_log_probs - self.target_continuous_entropy).detach()).mean()
331 |                 continuous_alpha_loss.backward()
332 |                 self.alpha_continuous_optimizer.step()
333 |                 self.network_dict['alpha_continuous'] = self.network_dict['log_alpha_continuous'].exp()
334 | 
335 |                 avg_continuous_actor_loss += continuous_actor_loss.detach()
336 |                 avg_continuous_alpha += self.network_dict['alpha_continuous'].detach()
337 |                 avg_continuous_policy_entropy += new_continuous_action_entropy.detach().mean()
338 | 
339 |             self.optim_step_count += 1
340 | 
341 |         self.logger.add_scalar(tag='Critic/critic_1_loss', scalar_value=avg_critic_1_loss / steps, global_step=self.cur_ep)
342 |         self.logger.add_scalar(tag='Critic/critic_2_loss', scalar_value=avg_critic_2_loss / steps, global_step=self.cur_ep)
343 |         if not self.planned_skills:
344 |             self.logger.add_scalar(tag='Actor/discrete_actor_loss', scalar_value=avg_discrete_actor_loss / steps, global_step=self.cur_ep)
345 |             self.logger.add_scalar(tag='Actor/discrete_alpha', scalar_value=avg_discrete_alpha / steps, global_step=self.cur_ep)
346 |             self.logger.add_scalar(tag='Actor/discrete_policy_entropy', scalar_value=avg_discrete_policy_entropy / steps,
347 |                                    global_step=self.cur_ep)
348 |         self.logger.add_scalar(tag='Actor/continuous_actor_loss', scalar_value=avg_continuous_actor_loss / steps, global_step=self.cur_ep)
349 |         self.logger.add_scalar(tag='Actor/continuous_alpha', scalar_value=avg_continuous_alpha / steps, global_step=self.cur_ep)
350 |         self.logger.add_scalar(tag='Actor/continuous_policy_entropy', scalar_value=avg_continuous_policy_entropy / steps,
351 |                                global_step=self.cur_ep)
352 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac_pointnet.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import StochasticActor
  7 | from ..utils.networks_pointnet import CriticPointNet
  8 | from ..agent_base import Agent
  9 | from ..utils.exploration_strategy import GaussianNoise
 10 | from collections import namedtuple
 11 | 
 12 | 
 13 | class PointnetSAC(Agent):
 14 |     def __init__(self, algo_params, env, logging=None, transition_tuple=None, path=None, seed=-1):
 15 |         # environment
 16 |         self.env = env
 17 |         self.env.seed(seed)
 18 |         obs = self.env.reset()
 19 |         algo_params.update({'state_shape': obs['observation'].shape,
 20 |                             'goal_shape': obs['desired_goal'].shape,
 21 |                             'action_dim': self.env.action_space.shape[0],
 22 |                             'action_max': self.env.action_space.high,
 23 |                             'action_scaling': self.env.action_space.high[0],
 24 |                             })
 25 |         self.onestep = True if self.env.horizon == 1 else False
 26 |         # training args
 27 |         self.cur_ep = 0
 28 |         self.warmup_step = algo_params['warmup_step']
 29 |         self.training_episodes = algo_params['training_episodes']
 30 |         self.testing_gap = algo_params['testing_gap']
 31 |         self.testing_episodes = algo_params['testing_episodes']
 32 |         self.saving_gap = algo_params['saving_gap']
 33 |         if transition_tuple is None:
 34 |             transition_tuple = namedtuple('transition',
 35 |                                           ['state', 'desired_goal', 'action', 'achieved_goal', 'reward'])
 36 |         super(PointnetSAC, self).__init__(algo_params, non_flat_obs=True,
 37 |                                           action_type='continuous',
 38 |                                           transition_tuple=transition_tuple,
 39 |                                           goal_conditioned=True,
 40 |                                           path=path,
 41 |                                           seed=seed,
 42 |                                           logging=logging,
 43 |                                           create_logger=True)
 44 |         # torch
 45 |         self.network_dict.update({
 46 |             'actor': StochasticActor(2048, self.action_dim,
 47 |                                      fc1_size=1024, log_std_min=-6, log_std_max=1,
 48 |                                      action_scaling=self.action_scaling).to(self.device),
 49 |             'critic_1': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device),
 50 |             'critic_2': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device),
 51 |             'critic_target': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device),
 52 |             'alpha': algo_params['alpha'],
 53 |             'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
 54 |         })
 55 |         self.network_dict['critic_target'].eval()
 56 |         self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=1)
 57 |         if not self.onestep:
 58 |             self.network_dict.update(
 59 |                 {'critic_target_2': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device)})
 60 |             self.network_dict['critic_target_2'].eval()
 61 |             self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_target_2'], tau=1)
 62 | 
 63 |         self.network_keys_to_save = ['actor', 'critic_1']
 64 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 65 |         self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
 66 |         self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
 67 |         self.target_entropy = -self.action_dim
 68 |         self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate)
 69 |         # training args
 70 |         self.actor_update_interval = algo_params['actor_update_interval']
 71 |         self.use_demonstrations = algo_params['use_demonstrations']
 72 |         self.demonstrate_percentage = algo_params['demonstrate_percentage']
 73 |         assert 0 < self.demonstrate_percentage < 1, "Demonstrate percentage should be between 0 and 1"
 74 |         self.n_demonstrate_episodes = int(self.demonstrate_percentage * self.training_episodes)
 75 |         self.demonstration_action = np.asarray(algo_params['demonstration_action'], dtype=np.float32)
 76 |         self.gaussian_noise = GaussianNoise(action_dim=self.action_dim, action_max=self.action_max,
 77 |                                             sigma=0.1, rng=self.rng)
 78 | 
 79 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0, get_action=False):
 80 |         if test:
 81 |             num_episode = self.testing_episodes
 82 |             if load_network_ep is not None:
 83 |                 print("Loading network parameters...")
 84 |                 self._load_network(ep=load_network_ep)
 85 |             print("Start testing...")
 86 |             if get_action:
 87 |                 obs = self.env.reset()
 88 |                 action = self._select_action(obs, test=True)
 89 |                 return action
 90 |         else:
 91 |             num_episode = self.training_episodes
 92 |             print("Start training...")
 93 |             self.logging.info("Start training...")
 94 | 
 95 |         for ep in range(num_episode):
 96 |             self.cur_ep = ep
 97 |             loss_info = self._interact(render, test, sleep=sleep)
 98 |             print("Episode %i" % ep)
 99 |             self.logging.info("Episode %i" % ep)
100 |             print("emd loss %0.1f" % loss_info['emd_loss'])
101 |             self.logging.info("emd loss %0.1f" % loss_info['emd_loss'])
102 |             self.logger.add_scalar(tag='Task/emd_loss', scalar_value=loss_info['emd_loss'], global_step=ep)
103 |             try:
104 |                 print("heightmap loss %0.1f" % loss_info['height_map_loss'])
105 |                 self.logger.add_scalar(tag='Task/heightmap_loss', scalar_value=loss_info['height_map_loss'], global_step=ep)
106 |                 self.logging.info("heightmap loss %0.1f" % loss_info['height_map_loss'])
107 |             except:
108 |                 pass
109 |             GPU_memory = self.get_gpu_memory()
110 |             self.logger.add_scalar(tag='System/Free GPU memory', scalar_value=GPU_memory[0], global_step=ep)
111 |             try:
112 |                 self.logger.add_scalar(tag='System/Used GPU memory', scalar_value=GPU_memory[1], global_step=ep)
113 |             except:
114 |                 pass
115 |             if not test and self.hindsight:
116 |                 self.buffer.hindsight()
117 | 
118 |             if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
119 |                 ep_test_emd_loss = []
120 |                 ep_test_heightmap_loss = []
121 |                 for test_ep in range(self.testing_episodes):
122 |                     loss_info = self._interact(render, test=True)
123 |                     self.cur_ep += 1
124 |                     ep_test_emd_loss.append(loss_info['emd_loss'])
125 |                     try:
126 |                         ep_test_heightmap_loss.append(loss_info['height_map_loss'])
127 |                     except:
128 |                         pass
129 |                 self.logger.add_scalar(tag='Task/test_emd_loss',
130 |                                        scalar_value=(sum(ep_test_emd_loss) / self.testing_episodes), global_step=ep)
131 |                 print("Episode %i" % ep)
132 |                 print("test emd loss %0.1f" % (sum(ep_test_emd_loss) / self.testing_episodes))
133 |                 self.logging.info("Episode %i" % ep)
134 |                 self.logging.info("test emd loss %0.1f" % (sum(ep_test_emd_loss) / self.testing_episodes))
135 | 
136 |                 if len(ep_test_heightmap_loss) > 0:
137 |                     self.logger.add_scalar(tag='Task/test_heightmap_loss',
138 |                                            scalar_value=(sum(ep_test_heightmap_loss) / self.testing_episodes),
139 |                                            global_step=ep)
140 |                     print("test heightmap loss %0.1f" % (sum(ep_test_heightmap_loss) / self.testing_episodes))
141 |                     self.logging.info("test heightmap loss %0.1f" % (sum(ep_test_heightmap_loss) / self.testing_episodes))
142 | 
143 |             if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
144 |                 self._save_network(ep=ep)
145 | 
146 |         if not test:
147 |             print("Finished training")
148 |             self.logging.info("Finished training")
149 |         else:
150 |             print("Finished testing")
151 | 
152 |     def _interact(self, render=False, test=False, sleep=0):
153 |         obs = self.env.reset()
154 |         if render:
155 |             self.env.render()
156 |         if self.onestep:
157 |             # An episode has only one step
158 |             if self.use_demonstrations and (self.cur_ep < self.n_demonstrate_episodes):
159 |                 action = self.gaussian_noise(self.demonstration_action)
160 |             else:
161 |                 action = self._select_action(obs, test=test)
162 |             obs_, reward, _, info = self.env.step(action)
163 |             time.sleep(sleep)
164 | 
165 |             if not test:
166 |                 self._remember(obs['observation'], obs['desired_goal'], action, obs_['achieved_goal'], reward,
167 |                                new_episode=True)
168 |                 if self.total_env_step_count % self.update_interval == 0:
169 |                     self._learn()
170 |                 self.total_env_step_count += 1
171 |         else:
172 |             n = 0
173 |             done = False
174 |             new_episode = True
175 |             while not done:
176 |                 if self.use_demonstrations and (self.cur_ep < self.n_demonstrate_episodes):
177 |                     try:
178 |                         action, object_out_of_view, demon_info = self.env.get_cur_demonstration()
179 |                     except:
180 |                         action = self.gaussian_noise(self.demonstration_action[n])
181 |                 else:
182 |                     action = self._select_action(obs, test=test)
183 |                 obs_, reward, done, info = self.env.step(action)
184 |                 time.sleep(sleep)
185 | 
186 |                 if not test:
187 |                     self._remember(obs['observation'], obs['desired_goal'], action, obs_['achieved_goal'], reward,
188 |                                    new_episode=new_episode)
189 |                     if self.total_env_step_count % self.update_interval == 0:
190 |                         self._learn()
191 |                     self.total_env_step_count += 1
192 | 
193 |                 new_episode = False
194 | 
195 |         return info
196 | 
197 |     def _select_action(self, obs, test=False):
198 |         obs_points = T.as_tensor([obs['observation']], dtype=T.float).to(self.device)
199 |         goal_points = T.as_tensor([obs['desired_goal']], dtype=T.float).to(self.device)
200 |         obs_point_features = self.network_dict['critic_target'].get_features(obs_points.transpose(2, 1))
201 |         goal_point_features = self.network_dict['critic_target'].get_features(goal_points.transpose(2, 1))
202 |         inputs = T.cat((obs_point_features, goal_point_features), dim=1)
203 |         action = self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
204 |         return action[0]
205 | 
206 |     def _learn(self, steps=None):
207 |         if len(self.buffer) < self.batch_size:
208 |             return
209 |         if steps is None:
210 |             steps = self.optimizer_steps
211 | 
212 |         avg_critic_1_loss = T.zeros(1, device=self.device)
213 |         avg_critic_2_loss = T.zeros(1, device=self.device)
214 |         avg_actor_loss = T.zeros(1, device=self.device)
215 |         avg_alpha = T.zeros(1, device=self.device)
216 |         avg_policy_entropy = T.zeros(1, device=self.device)
217 |         for i in range(steps):
218 |             if self.prioritised:
219 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
220 |                 weights = T.tensor(weights).view(self.batch_size, 1).to(self.device)
221 |             else:
222 |                 batch = self.buffer.sample(self.batch_size)
223 |                 weights = T.ones(size=(self.batch_size, 1)).to(self.device)
224 |                 inds = None
225 | 
226 |             obs = T.as_tensor(batch.state, dtype=T.float32, device=self.device).transpose(2, 1)
227 |             goal = T.as_tensor(batch.desired_goal, dtype=T.float32, device=self.device).transpose(2, 1)
228 |             actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
229 |             rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
230 |             if self.onestep:
231 |                 values_target = rewards
232 |             else:
233 |                 with T.no_grad():
234 |                     obs_ = T.as_tensor(batch.next_state, dtype=T.float32, device=self.device).transpose(2, 1)
235 |                     obs_features_ = self.network_dict['critic_1'].get_features(obs_, detach=True)
236 |                     goal_features = self.network_dict['critic_1'].get_features(goal, detach=True)
237 |                     actor_inputs = T.cat((obs_features_, goal_features), dim=1)
238 |                     new_actions = self.network_dict['actor'].get_action(actor_inputs)
239 |                     values_1 = self.network_dict['critic_target'](obs_, new_actions, goal)
240 |                     values_2 = self.network_dict['critic_target_2'](obs_, new_actions, goal)
241 |                     values_target = rewards + self.gamma * T.min(values_1, values_2)
242 | 
243 |             self.critic_1_optimizer.zero_grad()
244 |             value_estimate_1 = self.network_dict['critic_1'](obs, actions, goal)
245 |             critic_loss_1 = F.mse_loss(value_estimate_1, values_target, reduction='none')
246 |             (critic_loss_1 * weights).mean().backward()
247 |             self.critic_1_optimizer.step()
248 | 
249 |             if self.prioritised:
250 |                 assert inds is not None
251 |                 self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
252 | 
253 |             self.critic_2_optimizer.zero_grad()
254 |             value_estimate_2 = self.network_dict['critic_2'](obs, actions, goal)
255 |             critic_loss_2 = F.mse_loss(value_estimate_2, values_target, reduction='none')
256 |             (critic_loss_2 * weights).mean().backward()
257 |             self.critic_2_optimizer.step()
258 | 
259 |             avg_critic_1_loss += critic_loss_1.detach().mean()
260 |             avg_critic_2_loss += critic_loss_2.detach().mean()
261 | 
262 |             if self.optim_step_count % self.actor_update_interval == 0:
263 |                 self.actor_optimizer.zero_grad()
264 |                 obs_features = self.network_dict['critic_1'].get_features(obs, detach=True)
265 |                 goal_features = self.network_dict['critic_1'].get_features(goal, detach=True)
266 |                 actor_inputs = T.cat((obs_features, goal_features), dim=1)
267 |                 new_actions, new_log_probs, new_entropy = self.network_dict['actor'].get_action(actor_inputs,
268 |                                                                                                 probs=True,
269 |                                                                                                 entropy=True)
270 |                 new_values = T.min(self.network_dict['critic_1'](obs, new_actions, goal),
271 |                                    self.network_dict['critic_2'](obs, new_actions, goal))
272 |                 actor_loss = (self.network_dict['alpha'] * new_log_probs - new_values).mean()
273 |                 actor_loss.backward()
274 |                 self.actor_optimizer.step()
275 | 
276 |                 self.alpha_optimizer.zero_grad()
277 |                 alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean()
278 |                 alpha_loss.backward()
279 |                 self.alpha_optimizer.step()
280 |                 self.network_dict['alpha'] = self.network_dict['log_alpha'].exp()
281 | 
282 |                 avg_actor_loss += actor_loss.detach().mean()
283 |                 avg_alpha += self.network_dict['alpha'].detach()
284 |                 avg_policy_entropy += new_entropy.detach().mean()
285 | 
286 |             self.optim_step_count += 1
287 | 
288 |         if self.onestep:
289 |             self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=1)
290 |         else:
291 |             self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=self.tau)
292 |             self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_target_2'], tau=self.tau)
293 | 
294 |         self.logger.add_scalar(tag='Critic/critic_1_loss', scalar_value=avg_critic_1_loss / steps,
295 |                                global_step=self.cur_ep)
296 |         self.logger.add_scalar(tag='Critic/critic_2_loss', scalar_value=avg_critic_2_loss / steps,
297 |                                global_step=self.cur_ep)
298 |         self.logger.add_scalar(tag='Actor/actor_loss', scalar_value=avg_actor_loss / steps, global_step=self.cur_ep)
299 |         self.logger.add_scalar(tag='Actor/alpha', scalar_value=avg_alpha / steps, global_step=self.cur_ep)
300 |         self.logger.add_scalar(tag='Actor/policy_entropy', scalar_value=avg_policy_entropy / steps,
301 |                                global_step=self.cur_ep)
302 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/td3.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch as T
  4 | import torch.nn.functional as F
  5 | from torch.optim.adam import Adam
  6 | from ..utils.networks_mlp import Actor, Critic
  7 | from ..agent_base import Agent
  8 | from ..utils.exploration_strategy import GaussianNoise
  9 | 
 10 | 
 11 | class TD3(Agent):
 12 |     def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
 13 |         # environment
 14 |         self.env = env
 15 |         self.env.seed(seed)
 16 |         obs = self.env.reset()
 17 |         algo_params.update({'state_dim': obs.shape[0],
 18 |                             'action_dim': self.env.action_space.shape[0],
 19 |                             'action_max': self.env.action_space.high,
 20 |                             'action_scaling': self.env.action_space.high[0],
 21 |                             'init_input_means': None,
 22 |                             'init_input_vars': None
 23 |                             })
 24 |         # training args
 25 |         self.training_episodes = algo_params['training_episodes']
 26 |         self.testing_gap = algo_params['testing_gap']
 27 |         self.testing_episodes = algo_params['testing_episodes']
 28 |         self.saving_gap = algo_params['saving_gap']
 29 | 
 30 |         super(TD3, self).__init__(algo_params,
 31 |                                    transition_tuple=transition_tuple,
 32 |                                    goal_conditioned=False,
 33 |                                    path=path,
 34 |                                    seed=seed)
 35 |         # torch
 36 |         self.network_dict.update({
 37 |             'actor': Actor(self.state_dim, self.action_dim, action_scaling=self.action_scaling).to(self.device),
 38 |             'actor_target': Actor(self.state_dim, self.action_dim, action_scaling=self.action_scaling).to(self.device),
 39 |             'critic_1': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 40 |             'critic_1_target': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 41 |             'critic_2': Critic(self.state_dim + self.action_dim, 1).to(self.device),
 42 |             'critic_2_target': Critic(self.state_dim + self.action_dim, 1).to(self.device)
 43 |         })
 44 |         self.network_keys_to_save = ['actor_target', 'critic_1_target']
 45 |         self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
 46 |         self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
 47 |         self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
 48 |         self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
 49 |         self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
 50 |         self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
 51 |         # behavioural policy args (exploration)
 52 |         self.exploration_strategy = GaussianNoise(self.action_dim, self.action_max, mu=0, sigma=0.1)
 53 |         # training args
 54 |         self.target_noise = algo_params['target_noise']
 55 |         self.noise_clip = algo_params['noise_clip']
 56 |         self.warmup_step = algo_params['warmup_step']
 57 |         self.actor_update_interval = algo_params['actor_update_interval']
 58 |         # statistic dict
 59 |         self.statistic_dict.update({
 60 |             'episode_return': [],
 61 |             'episode_test_return': []
 62 |         })
 63 | 
 64 |     def run(self, test=False, render=False, load_network_ep=None, sleep=0):
 65 |         if test:
 66 |             num_episode = self.testing_episodes
 67 |             if load_network_ep is not None:
 68 |                 print("Loading network parameters...")
 69 |                 self._load_network(ep=load_network_ep)
 70 |             print("Start testing...")
 71 |         else:
 72 |             num_episode = self.training_episodes
 73 |             print("Start training...")
 74 | 
 75 |         for ep in range(num_episode):
 76 |             ep_return = self._interact(render, test, sleep=sleep)
 77 |             self.statistic_dict['episode_return'].append(ep_return)
 78 |             print("Episode %i" % ep, "return %0.1f" % ep_return)
 79 | 
 80 |             if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
 81 |                 ep_test_return = []
 82 |                 for test_ep in range(self.testing_episodes):
 83 |                     ep_test_return.append(self._interact(render, test=True))
 84 |                 self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes)
 85 |                 print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes))
 86 | 
 87 |             if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
 88 |                 self._save_network(ep=ep)
 89 | 
 90 |         if not test:
 91 |             print("Finished training")
 92 |             print("Saving statistics...")
 93 |             self._plot_statistics(save_to_file=True)
 94 |         else:
 95 |             print("Finished testing")
 96 | 
 97 |     def _interact(self, render=False, test=False, sleep=0):
 98 |         done = False
 99 |         obs = self.env.reset()
100 |         ep_return = 0
101 |         # start a new episode
102 |         while not done:
103 |             if render:
104 |                 self.env.render()
105 |             if self.env_step_count < self.warmup_step:
106 |                 action = self.env.action_space.sample()
107 |             else:
108 |                 action = self._select_action(obs, test=test)
109 |             new_obs, reward, done, info = self.env.step(action)
110 |             time.sleep(sleep)
111 |             ep_return += reward
112 |             if not test:
113 |                 self._remember(obs, action, new_obs, reward, 1 - int(done))
114 |                 if self.observation_normalization:
115 |                     self.normalizer.store_history(new_obs)
116 |                     self.normalizer.update_mean()
117 |                 if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
118 |                     self._learn()
119 |             obs = new_obs
120 |             self.env_step_count += 1
121 |         return ep_return
122 | 
123 |     def _select_action(self, obs, test=False):
124 |         obs = self.normalizer(obs)
125 |         with T.no_grad():
126 |             inputs = T.as_tensor(obs, dtype=T.float, device=self.device)
127 |             action = self.network_dict['actor_target'](inputs).detach().cpu().numpy()
128 |         if test:
129 |             # evaluate
130 |             return np.clip(action, -self.action_max, self.action_max)
131 |         else:
132 |             # explore
133 |             return self.exploration_strategy(action)
134 | 
135 |     def _learn(self, steps=None):
136 |         if len(self.buffer) < self.batch_size:
137 |             return
138 |         if steps is None:
139 |             steps = self.optimizer_steps
140 | 
141 |         for i in range(steps):
142 |             if self.prioritised:
143 |                 batch, weights, inds = self.buffer.sample(self.batch_size)
144 |                 weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
145 |             else:
146 |                 batch = self.buffer.sample(self.batch_size)
147 |                 weights = T.ones(size=(self.batch_size, 1), device=self.device)
148 |                 inds = None
149 | 
150 |             actor_inputs = self.normalizer(batch.state)
151 |             actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
152 |             actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
153 |             critic_inputs = T.cat((actor_inputs, actions), dim=1)
154 |             actor_inputs_ = self.normalizer(batch.next_state)
155 |             actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
156 |             rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
157 |             done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1)
158 | 
159 |             if self.discard_time_limit:
160 |                 done = done * 0 + 1
161 | 
162 |             with T.no_grad():
163 |                 actions_ = self.network_dict['actor_target'](actor_inputs_)
164 |                 # add noise
165 |                 noise = (T.randn_like(actions_, device=self.device) * self.target_noise)
166 |                 actions_ += noise.clamp(-self.noise_clip, self.noise_clip)
167 |                 actions_ = actions_.clamp(-self.action_max[0], self.action_max[0])
168 |                 critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
169 |                 value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
170 |                 value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
171 |                 value_ = T.min(value_1_, value_2_)
172 |                 value_target = rewards + done * self.gamma * value_
173 | 
174 |             self.critic_1_optimizer.zero_grad()
175 |             value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
176 |             critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
177 |             (critic_loss_1 * weights).mean().backward()
178 |             self.critic_1_optimizer.step()
179 | 
180 |             if self.prioritised:
181 |                 assert inds is not None
182 |                 self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
183 | 
184 |             self.critic_2_optimizer.zero_grad()
185 |             value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
186 |             critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
187 |             (critic_loss_2 * weights).mean().backward()
188 |             self.critic_2_optimizer.step()
189 | 
190 |             self.statistic_dict['critic_loss'].append(critic_loss_1.detach().mean())
191 | 
192 |             if self.optim_step_count % self.actor_update_interval == 0:
193 |                 self.actor_optimizer.zero_grad()
194 |                 new_actions = self.network_dict['actor'](actor_inputs)
195 |                 critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1)
196 |                 actor_loss = -self.network_dict['critic_1'](critic_eval_inputs).mean()
197 |                 actor_loss.backward()
198 |                 self.actor_optimizer.step()
199 | 
200 |                 self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
201 |                 self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
202 |                 self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
203 | 
204 |                 self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
205 | 
206 |             self.optim_step_count += 1
207 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/distributed_agent_base.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import torch as T
  4 | import numpy as np
  5 | import json
  6 | import queue
  7 | import importlib
  8 | import multiprocessing as mp
  9 | from collections import namedtuple
 10 | from .utils.plot import smoothed_plot
 11 | from .utils.replay_buffer import ReplayBuffer, PrioritisedReplayBuffer
 12 | from .utils.normalizer import Normalizer
 13 | # T.multiprocessing.set_start_method('spawn')
 14 | t = namedtuple("transition", ('state', 'action', 'next_state', 'reward', 'done'))
 15 | 
 16 | 
 17 | def mkdir(paths):
 18 |     for path in paths:
 19 |         os.makedirs(path, exist_ok=True)
 20 | 
 21 | 
 22 | class Agent(object):
 23 |     def __init__(self, algo_params, image_obs=False, action_type='continuous', path=None, seed=-1):
 24 |         # torch device
 25 |         self.device = T.device("cuda" if T.cuda.is_available() else "cpu")
 26 |         if 'cuda_device_id' in algo_params.keys():
 27 |             self.device = T.device("cuda:%i" % algo_params['cuda_device_id'])
 28 |         # path & seeding
 29 |         T.manual_seed(seed)
 30 |         T.cuda.manual_seed_all(seed)  # this has no effect if cuda is not available
 31 | 
 32 |         assert path is not None, 'please specify a project path to save files'
 33 |         self.path = path
 34 |         # path to save neural network check point
 35 |         self.ckpt_path = os.path.join(path, 'ckpts')
 36 |         # path to save statistics
 37 |         self.data_path = os.path.join(path, 'data')
 38 |         # create directories if not exist
 39 |         mkdir([self.path, self.ckpt_path, self.data_path])
 40 | 
 41 |         # non-goal-conditioned args
 42 |         self.image_obs = image_obs
 43 |         self.action_type = action_type
 44 |         if self.image_obs:
 45 |             self.state_dim = 0
 46 |             self.state_shape = algo_params['state_shape']
 47 |         else:
 48 |             self.state_dim = algo_params['state_dim']
 49 |         self.action_dim = algo_params['action_dim']
 50 |         if self.action_type == 'continuous':
 51 |             self.action_max = algo_params['action_max']
 52 |             self.action_scaling = algo_params['action_scaling']
 53 | 
 54 |         # common args
 55 |         if not self.image_obs:
 56 |             # todo: observation in distributed training should be synced as well
 57 |             self.observation_normalization = algo_params['observation_normalization']
 58 |             # if not using obs normalization, the normalizer is just a scale multiplier, returns inputs*scale
 59 |             self.normalizer = Normalizer(self.state_dim,
 60 |                                          algo_params['init_input_means'], algo_params['init_input_vars'],
 61 |                                          activated=self.observation_normalization)
 62 | 
 63 |         self.gamma = algo_params['discount_factor']
 64 |         self.tau = algo_params['tau']
 65 | 
 66 |         # network dict is filled in each specific agent
 67 |         self.network_dict = {}
 68 |         self.network_keys_to_save = None
 69 | 
 70 |         # algorithm-specific statistics are defined in each agent sub-class
 71 |         self.statistic_dict = {
 72 |             # use lowercase characters
 73 |             'actor_loss': [],
 74 |             'critic_loss': [],
 75 |         }
 76 | 
 77 |     def _soft_update(self, source, target, tau=None, from_params=False):
 78 |         if tau is None:
 79 |             tau = self.tau
 80 | 
 81 |         if not from_params:
 82 |             for target_param, param in zip(target.parameters(), source.parameters()):
 83 |                 target_param.data.copy_(
 84 |                     target_param.data * (1.0 - tau) + param.data * tau
 85 |                 )
 86 |         else:
 87 |             for target_param, param in zip(target.parameters(), source):
 88 |                 target_param.data.copy_(
 89 |                     target_param.data * (1.0 - tau) + T.tensor(param).float().to(self.device) * tau
 90 |                 )
 91 | 
 92 |     def _save_network(self, keys=None, ep=None):
 93 |         if ep is None:
 94 |             ep = ''
 95 |         else:
 96 |             ep = '_ep' + str(ep)
 97 |         if keys is None:
 98 |             keys = self.network_keys_to_save
 99 |         assert keys is not None
100 |         for key in keys:
101 |             T.save(self.network_dict[key].state_dict(), self.ckpt_path + '/ckpt_' + key + ep + '.pt')
102 | 
103 |     def _load_network(self, keys=None, ep=None):
104 |         if not self.image_obs:
105 |             self.normalizer.history_mean = np.load(os.path.join(self.data_path, 'input_means.npy'))
106 |             self.normalizer.history_var = np.load(os.path.join(self.data_path, 'input_vars.npy'))
107 |         if ep is None:
108 |             ep = ''
109 |         else:
110 |             ep = '_ep' + str(ep)
111 |         if keys is None:
112 |             keys = self.network_keys_to_save
113 |         assert keys is not None
114 |         for key in keys:
115 |             self.network_dict[key].load_state_dict(T.load(self.ckpt_path + '/ckpt_' + key + ep + '.pt'))
116 | 
117 |     def _save_statistics(self, keys=None):
118 |         if not self.image_obs:
119 |             np.save(os.path.join(self.data_path, 'input_means'), self.normalizer.history_mean)
120 |             np.save(os.path.join(self.data_path, 'input_vars'), self.normalizer.history_var)
121 |         if keys is None:
122 |             keys = self.statistic_dict.keys()
123 |         for key in keys:
124 |             if len(self.statistic_dict[key]) == 0:
125 |                 continue
126 |             # convert everything to a list before save via json
127 |             if T.is_tensor(self.statistic_dict[key][0]):
128 |                 self.statistic_dict[key] = T.as_tensor(self.statistic_dict[key], device=self.device).cpu().numpy().tolist()
129 |             else:
130 |                 self.statistic_dict[key] = np.array(self.statistic_dict[key]).tolist()
131 |             json.dump(self.statistic_dict[key], open(os.path.join(self.data_path, key+'.json'), 'w'))
132 | 
133 |     def _plot_statistics(self, keys=None, x_labels=None, y_labels=None, window=5, save_to_file=True):
134 |         if save_to_file:
135 |             self._save_statistics(keys=keys)
136 |         if y_labels is None:
137 |             y_labels = {}
138 |         for key in list(self.statistic_dict.keys()):
139 |             if key not in y_labels.keys():
140 |                 if 'loss' in key:
141 |                     label = 'Loss'
142 |                 elif 'return' in key:
143 |                     label = 'Return'
144 |                 elif 'success' in key:
145 |                     label = 'Success'
146 |                 else:
147 |                     label = key
148 |                 y_labels.update({key: label})
149 | 
150 |         if x_labels is None:
151 |             x_labels = {}
152 |         for key in list(self.statistic_dict.keys()):
153 |             if key not in x_labels.keys():
154 |                 if ('loss' in key) or ('alpha' in key) or ('entropy' in key) or ('step' in key):
155 |                     label = 'Optimization step'
156 |                 elif 'cycle' in key:
157 |                     label = 'Cycle'
158 |                 elif 'epoch' in key:
159 |                     label = 'Epoch'
160 |                 else:
161 |                     label = 'Episode'
162 |                 x_labels.update({key: label})
163 | 
164 |         if keys is None:
165 |             for key in list(self.statistic_dict.keys()):
166 |                 smoothed_plot(os.path.join(self.path, key + '.png'), self.statistic_dict[key],
167 |                               x_label=x_labels[key], y_label=y_labels[key], window=window)
168 |         else:
169 |             for key in keys:
170 |                 smoothed_plot(os.path.join(self.path, key + '.png'), self.statistic_dict[key],
171 |                               x_label=x_labels[key], y_label=y_labels[key], window=window)
172 | 
173 | 
174 | class Worker(Agent):
175 |     def __init__(self, algo_params, queues, path=None, seed=0, i=0):
176 |         self.queues = queues
177 |         self.worker_id = i
178 |         self.worker_update_gap = algo_params['worker_update_gap']  # in episodes
179 |         self.env_step_count = 0
180 |         super(Worker, self).__init__(algo_params, path=path, seed=seed)
181 | 
182 |     def run(self, render=False, test=False, load_network_ep=None, sleep=0):
183 |         raise NotImplementedError
184 | 
185 |     def _interact(self, render=False, test=False, sleep=0):
186 |         raise NotImplementedError
187 | 
188 |     def _select_action(self, obs, test=False):
189 |         raise NotImplementedError
190 | 
191 |     def _remember(self, batch):
192 |         try:
193 |             self.queues['replay_queue'].put_nowait(batch)
194 |         except queue.Full:
195 |             pass
196 | 
197 |     def _download_actor_networks(self, keys, tau=1.0):
198 |         try:
199 |             source = self.queues['network_queue'].get_nowait()
200 |         except queue.Empty:
201 |             return False
202 |         print("Worker No. %i downloading network" % self.worker_id)
203 |         for key in keys:
204 |             self._soft_update(source[key], self.network_dict[key], tau=tau, from_params=True)
205 |         return True
206 | 
207 | 
208 | class Learner(Agent):
209 |     def __init__(self, algo_params, queues, path=None, seed=0):
210 |         self.queues = queues
211 |         self.num_workers = algo_params['num_workers']
212 |         self.learner_steps = algo_params['learner_steps']
213 |         self.learner_upload_gap = algo_params['learner_upload_gap']  # in optimization steps
214 |         self.actor_learning_rate = algo_params['actor_learning_rate']
215 |         self.critic_learning_rate = algo_params['critic_learning_rate']
216 |         self.discard_time_limit = algo_params['discard_time_limit']
217 |         self.batch_size = algo_params['batch_size']
218 |         self.prioritised = algo_params['prioritised']
219 |         self.optimizer_steps = algo_params['optimization_steps']
220 |         self.optim_step_count = 0
221 |         super(Learner, self).__init__(algo_params, path=path, seed=seed)
222 | 
223 |     def run(self):
224 |         raise NotImplementedError
225 | 
226 |     def _learn(self, steps=None):
227 |         raise NotImplementedError
228 | 
229 |     def _upload_learner_networks(self, keys):
230 |         print("Learner uploading network")
231 |         params = dict.fromkeys(keys)
232 |         for key in keys:
233 |             params[key] = [p.data.cpu().detach().numpy() for p in self.network_dict[key].parameters()]
234 |         # delete an old net and upload a new one
235 |         try:
236 |             data = self.queues['network_queue'].get_nowait()
237 |             del data
238 |         except queue.Empty:
239 |             pass
240 |         try:
241 |             self.queues['network_queue'].put(params)
242 |         except queue.Full:
243 |             pass
244 | 
245 | 
246 | class CentralProcessor(object):
247 |     def __init__(self, algo_params, env_name, env_source, learner, worker, transition_tuple=None, path=None,
248 |                  worker_seeds=None, seed=0):
249 |         self.algo_params = algo_params.copy()
250 |         self.env_name = env_name
251 |         assert env_source in ['gym', 'pybullet_envs', 'pybullet_multigoal_gym'], \
252 |             "unsupported env source: {}, " \
253 |             "only 3 env sources are supported: {}, " \
254 |             "for new env sources please modify the original code".format(env_source,
255 |                                                                          ['gym', 'pybullet_envs',
256 |                                                                           'pybullet_multigoal_gym'])
257 |         self.env_source = importlib.import_module(env_source)
258 |         self.learner = learner
259 |         self.worker = worker
260 |         self.batch_size = algo_params['batch_size']
261 |         self.num_workers = algo_params['num_workers']
262 |         self.learner_steps = algo_params['learner_steps']
263 |         if worker_seeds is None:
264 |             worker_seeds = np.random.randint(10, 1000, size=self.num_workers).tolist()
265 |         else:
266 |             assert len(worker_seeds) == self.num_workers, 'should assign seeds to every worker'
267 |         self.worker_seeds = worker_seeds
268 |         assert path is not None, 'please specify a project path to save files'
269 |         self.path = path
270 |         # create a random number generator and seed it
271 |         self.rng = np.random.default_rng(seed=0)
272 | 
273 |         # multiprocessing queues
274 |         self.queues = {
275 |             'replay_queue': mp.Queue(maxsize=algo_params['replay_queue_size']),
276 |             'batch_queue': mp.Queue(maxsize=algo_params['batch_queue_size']),
277 |             'network_queue': T.multiprocessing.Queue(maxsize=self.num_workers),
278 |             'learner_step_count': mp.Value('i', 0),
279 |             'global_episode_count': mp.Value('i', 0),
280 |         }
281 | 
282 |         # setup replay buffer
283 |         # prioritised replay
284 |         self.prioritised = algo_params['prioritised']
285 |         self.store_with_given_priority = algo_params['store_with_given_priority']
286 |         # non-goal-conditioned replay buffer
287 |         tr = transition_tuple
288 |         if transition_tuple is None:
289 |             tr = t
290 |         if not self.prioritised:
291 |             self.buffer = ReplayBuffer(algo_params['memory_capacity'], tr, seed=seed)
292 |         else:
293 |             self.queues.update({
294 |                 'priority_queue': mp.Queue(maxsize=algo_params['priority_queue_size'])
295 |             })
296 |             self.buffer = PrioritisedReplayBuffer(algo_params['memory_capacity'], tr, rng=self.rng)
297 | 
298 |     def run(self):
299 |         def worker_process(i, seed):
300 |             env = self.env_source.make(self.env_name)
301 |             path = os.path.join(self.path, "worker_%i" % i)
302 |             worker = self.worker(self.algo_params, env, self.queues, path=path, seed=seed, i=i)
303 |             worker.run()
304 |             self.empty_queue('replay_queue')
305 | 
306 |         def learner_process():
307 |             env = self.env_source.make(self.env_name)
308 |             path = os.path.join(self.path, "learner")
309 |             learner = self.learner(self.algo_params, env, self.queues, path=path, seed=0)
310 |             learner.run()
311 |             if self.prioritised:
312 |                 self.empty_queue('priority_queue')
313 |             self.empty_queue('network_queue')
314 | 
315 |         def update_buffer():
316 |             while self.queues['learner_step_count'].value < self.learner_steps:
317 |                 num_transitions_in_queue = self.queues['replay_queue'].qsize()
318 |                 for n in range(num_transitions_in_queue):
319 |                     data = self.queues['replay_queue'].get()
320 |                     if self.prioritised:
321 |                         if self.store_with_given_priority:
322 |                             self.buffer.store_experience_with_given_priority(data['priority'], *data['transition'])
323 |                         else:
324 |                             self.buffer.store_experience(*data)
325 |                     else:
326 |                         self.buffer.store_experience(*data)
327 |                 if self.batch_size > len(self.buffer):
328 |                     continue
329 | 
330 |                 if self.prioritised:
331 |                     try:
332 |                         inds, priorities = self.queues['priority_queue'].get_nowait()
333 |                         self.buffer.update_priority(inds, priorities)
334 |                     except queue.Empty:
335 |                         pass
336 |                     try:
337 |                         batch, weights, inds = self.buffer.sample(batch_size=self.batch_size)
338 |                         state, action, next_state, reward, done = batch
339 |                         self.queues['batch_queue'].put_nowait((state, action, next_state, reward, done, weights, inds))
340 |                     except queue.Full:
341 |                         continue
342 |                 else:
343 |                     try:
344 |                         batch = self.buffer.sample(batch_size=self.batch_size)
345 |                         state, action, next_state, reward, done = batch
346 |                         self.queues['batch_queue'].put_nowait((state, action, next_state, reward, done))
347 |                     except queue.Full:
348 |                         time.sleep(0.1)
349 |                         continue
350 | 
351 |             self.empty_queue('batch_queue')
352 | 
353 |         processes = []
354 |         p = T.multiprocessing.Process(target=update_buffer)
355 |         processes.append(p)
356 |         p = T.multiprocessing.Process(target=learner_process)
357 |         processes.append(p)
358 |         for i in range(self.num_workers):
359 |             p = T.multiprocessing.Process(target=worker_process,
360 |                                           args=(i, self.worker_seeds[i]))
361 |             processes.append(p)
362 | 
363 |         for p in processes:
364 |             p.start()
365 |         for p in processes:
366 |             p.join()
367 | 
368 |     def empty_queue(self, queue_name):
369 |         while True:
370 |             try:
371 |                 data = self.queues[queue_name].get_nowait()
372 |                 del data
373 |             except queue.Empty:
374 |                 break
375 |         self.queues[queue_name].close()
376 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/utils/__init__.py


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/env_wrapper.py:
--------------------------------------------------------------------------------
 1 | import gym
 2 | import numpy as np
 3 | from skimage.transform import resize
 4 | from collections import deque
 5 | 
 6 | 
 7 | class FrameStack(gym.Wrapper):
 8 |     def __init__(self, env, k):
 9 |         gym.Wrapper.__init__(self, env)
10 |         self._k = k
11 |         self._frames = deque([], maxlen=k)
12 |         shp = env.observation_space.shape
13 |         self.observation_space = gym.spaces.Box(
14 |             low=0,
15 |             high=1,
16 |             shape=((shp[0] * k,) + shp[1:]),
17 |             dtype=env.observation_space.dtype)
18 |         self._max_episode_steps = env._max_episode_steps
19 | 
20 |     def reset(self):
21 |         obs = self.env.reset()
22 |         for _ in range(self._k):
23 |             self._frames.append(obs)
24 |         return self._get_obs()
25 | 
26 |     def step(self, action):
27 |         obs, reward, done, info = self.env.step(action)
28 |         self._frames.append(obs)
29 |         return self._get_obs(), reward, done, info
30 | 
31 |     def _get_obs(self):
32 |         assert len(self._frames) == self._k
33 |         return np.concatenate(list(self._frames), axis=0)
34 | 
35 | 
36 | class PixelPybulletGym(gym.Wrapper):
37 |     def __init__(self, env, image_size, crop_size, channel_first=True):
38 |         gym.Wrapper.__init__(self, env)
39 |         self.image_size = image_size
40 |         self.crop_size = crop_size
41 |         self.channel_first = channel_first
42 |         self.vertical_boundary = int((env.env._render_height - self.crop_size) / 2)
43 |         self.horizontal_boundary = int((env.env._render_width - self.crop_size) / 2)
44 |         self._max_episode_steps = env._max_episode_steps
45 | 
46 |     def reset(self):
47 |         self.env.reset()
48 |         return self._get_obs()
49 | 
50 |     def step(self, action):
51 |         _, reward, done, info = self.env.step(action)
52 |         return self._get_obs(), reward, done, info
53 | 
54 |     def _get_obs(self):
55 |         # H, W, C
56 |         obs = self.render(mode="rgb_array")
57 |         obs = obs[self.vertical_boundary:-self.vertical_boundary, self.horizontal_boundary:-self.horizontal_boundary, :]
58 |         obs = resize(obs, (self.image_size, self.image_size))
59 |         if self.channel_first:
60 |             obs = obs.transpose((-1, 0, 1))
61 |         return obs
62 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/exploration_strategy.py:
--------------------------------------------------------------------------------
  1 | import math as M
  2 | import numpy as np
  3 | 
  4 | 
  5 | class ExpDecayGreedy(object):
  6 |     # e-greedy exploration with exponential decay
  7 |     def __init__(self, start=1, end=0.05, decay=50000, decay_start=None, rng=None):
  8 |         self.start = start
  9 |         self.end = end
 10 |         self.decay = decay
 11 |         self.decay_start = decay_start
 12 |         if rng is None:
 13 |             self.rng = np.random.default_rng(seed=0)
 14 |         else:
 15 |             self.rng = rng
 16 | 
 17 |     def __call__(self, count):
 18 |         if self.decay_start is not None:
 19 |             count -= self.decay_start
 20 |             if count < 0:
 21 |                 count = 0
 22 |         epsilon = self.end + (self.start - self.end) * M.exp(-1. * count / self.decay)
 23 |         prob = self.rng.uniform(0, 1)
 24 |         if prob < epsilon:
 25 |             return True
 26 |         else:
 27 |             return False
 28 | 
 29 | 
 30 | class LinearDecayGreedy(object):
 31 |     # e-greedy exploration with linear decay
 32 |     def __init__(self, start=1.0, end=0.1, decay=1000000, decay_start=None, rng=None):
 33 |         self.start = start
 34 |         self.end = end
 35 |         self.decay = decay
 36 |         self.decay_start = decay_start
 37 |         if rng is None:
 38 |             self.rng = np.random.default_rng(seed=0)
 39 |         else:
 40 |             self.rng = rng
 41 | 
 42 |     def __call__(self, count):
 43 |         if self.decay_start is not None:
 44 |             count -= self.decay_start
 45 |             if count < 0:
 46 |                 count = 0
 47 |         if count > self.decay:
 48 |             count = self.decay
 49 |         epsilon = self.start - count * (self.start - self.end) / self.decay
 50 |         prob = self.rng.uniform(0, 1)
 51 |         if prob < epsilon:
 52 |             return True
 53 |         else:
 54 |             return False
 55 | 
 56 | 
 57 | class OUNoise(object):
 58 |     # https://github.com/rll/rllab/blob/master/rllab/exploration_strategies/ou_strategy.py
 59 |     def __init__(self, action_dim, action_max, mu=0, theta=0.2, sigma=1.0, rng=None):
 60 |         if rng is None:
 61 |             self.rng = np.random.default_rng(seed=0)
 62 |         else:
 63 |             self.rng = rng
 64 |         self.action_dim = action_dim
 65 |         self.action_max = action_max
 66 |         self.mu = mu
 67 |         self.theta = theta
 68 |         self.sigma = sigma
 69 |         self.state = np.ones(self.action_dim) * self.mu
 70 |         self.reset()
 71 | 
 72 |     def reset(self):
 73 |         self.state = np.ones(self.action_dim) * self.mu
 74 | 
 75 |     def __call__(self, action):
 76 |         x = self.state
 77 |         dx = self.theta * (self.mu - x) + self.sigma * self.rng.standard_normal(len(x))
 78 |         self.state = x + dx
 79 |         return np.clip(action + self.state, -self.action_max, self.action_max)
 80 | 
 81 | 
 82 | class GaussianNoise(object):
 83 |     # the one used in the TD3 paper: http://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf
 84 |     def __init__(self, action_dim, action_max, scale=1, mu=0, sigma=0.1, rng=None):
 85 |         if rng is None:
 86 |             self.rng = np.random.default_rng(seed=0)
 87 |         else:
 88 |             self.rng = rng
 89 |         self.scale = scale
 90 |         self.action_dim = action_dim
 91 |         self.action_max = action_max
 92 |         self.mu = mu
 93 |         self.sigma = sigma
 94 | 
 95 |     def __call__(self, action):
 96 |         noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma, size=(self.action_dim,))
 97 |         return np.clip(action + noise, -self.action_max, self.action_max)
 98 | 
 99 | 
100 | class EGreedyGaussian(object):
101 |     # the one used in the HER paper: https://arxiv.org/abs/1707.01495
102 |     def __init__(self, action_dim, action_max, chance=0.2, scale=1, mu=0, sigma=0.1, rng=None):
103 |         self.chance = chance
104 |         self.scale = scale
105 |         self.action_dim = action_dim
106 |         self.action_max = action_max
107 |         self.mu = mu
108 |         self.sigma = sigma
109 |         if rng is None:
110 |             self.rng = np.random.default_rng(seed=0)
111 |         else:
112 |             self.rng = rng
113 | 
114 |     def __call__(self, action):
115 |         chance = self.rng.uniform(0, 1)
116 |         if chance < self.chance:
117 |             return self.rng.uniform(-self.action_max, self.action_max, size=(self.action_dim,))
118 |         else:
119 |             noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma, size=(self.action_dim,))
120 |             return np.clip(action + noise, -self.action_max, self.action_max)
121 | 
122 | 
123 | class AutoAdjustingEGreedyGaussian(object):
124 |     """
125 |     https://ieeexplore.ieee.org/document/9366328
126 |     This exploration class is a goal-success-rate-based auto-adjusting exploration strategy.
127 |     It modifies the original constant chance exploration strategy by reducing exploration probabilities and noise deviations
128 |         w.r.t. the testing success rate of each goal.
129 |     """
130 |     def __init__(self, goal_num, action_dim, action_max, tau=0.05, chance=0.2, scale=1, mu=0, sigma=0.2, rng=None):
131 |         if rng is None:
132 |             self.rng = np.random.default_rng(seed=0)
133 |         else:
134 |             self.rng = rng
135 |         self.scale = scale
136 |         self.action_dim = action_dim
137 |         self.action_max = action_max
138 |         self.mu = mu
139 |         self.base_sigma = sigma
140 |         self.sigma = np.ones(goal_num) * sigma
141 | 
142 |         self.base_chance = chance
143 |         self.goal_num = goal_num
144 |         self.tau = tau
145 |         self.success_rates = np.zeros(self.goal_num)
146 |         self.chance = np.ones(self.goal_num) * chance
147 | 
148 |     def update_success_rates(self, new_tet_suc_rate):
149 |         old_tet_suc_rate = self.success_rates.copy()
150 |         self.success_rates = (1-self.tau)*old_tet_suc_rate + self.tau*new_tet_suc_rate
151 |         self.chance = self.base_chance*(1-self.success_rates)
152 |         self.sigma = self.base_sigma*(1-self.success_rates)
153 | 
154 |     def __call__(self, goal_ind, action):
155 |         # return a random action or a noisy action
156 |         prob = self.rng.uniform(0, 1)
157 |         if prob < self.chance[goal_ind]:
158 |             return self.rng.uniform(-self.action_max, self.action_max, size=(self.action_dim,))
159 |         else:
160 |             noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma[goal_ind], size=(self.action_dim,))
161 |             return action + noise
162 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/networks_conv.py:
--------------------------------------------------------------------------------
  1 | import torch as T
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from torch.distributions import Normal
  5 | 
  6 | 
  7 | class DQNetwork(nn.Module):
  8 |     def __init__(self, input_shape, action_dims, init_w=3e-3):
  9 |         super(DQNetwork, self).__init__()
 10 |         self.input_shape = input_shape
 11 |         # input_shape: tuple(c, h, w)
 12 |         self.features = nn.Sequential(
 13 |             nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
 14 |             nn.ReLU(),
 15 |             nn.Conv2d(32, 64, kernel_size=4, stride=2),
 16 |             nn.ReLU(),
 17 |             nn.Conv2d(64, 64, kernel_size=3, stride=1),
 18 |             nn.ReLU()
 19 |         )
 20 | 
 21 |         x = T.randn([32] + list(input_shape))
 22 |         self.conv_out_dim = self.features(x).view(x.size(0), -1).size(1)
 23 |         self.fc = nn.Linear(self.conv_out_dim, 512)
 24 |         self.v = nn.Linear(512, action_dims)
 25 |         T.nn.init.uniform_(self.v.weight.data, -init_w, init_w)
 26 |         T.nn.init.uniform_(self.v.bias.data, -init_w, init_w)
 27 | 
 28 |     def forward(self, obs):
 29 |         if obs.max() > 1.:
 30 |             obs = obs / 255.
 31 | 
 32 |         x = self.features(obs)
 33 |         x = x.view(x.size(0), -1)
 34 |         x = F.relu(self.fc(x))
 35 |         value = self.v(x)
 36 |         return value
 37 | 
 38 |     def get_action(self, obs):
 39 |         values = self.forward(obs)
 40 |         return T.argmax(values).item()
 41 | 
 42 | 
 43 | class StochasticConvActor(nn.Module):
 44 |     def __init__(self, action_dim, encoder, hidden_dim=1024, log_std_min=-10, log_std_max=2, action_scaling=1,
 45 |                  detach_obs_encoder=False,
 46 |                  goal_conditioned=False, detach_goal_encoder=True):
 47 |         super(StochasticConvActor, self).__init__()
 48 | 
 49 |         self.action_scaling = action_scaling
 50 |         self.encoder = encoder
 51 |         self.detach_obs_encoder = detach_obs_encoder
 52 |         self.log_std_min = log_std_min
 53 |         self.log_std_max = log_std_max
 54 |         trunk_input_dim = self.encoder.feature_dim
 55 |         self.goal_conditioned = goal_conditioned
 56 |         self.detach_goal_encoder = detach_goal_encoder
 57 |         if self.goal_conditioned:
 58 |             trunk_input_dim *= 2
 59 |         self.trunk = nn.Sequential(
 60 |             nn.Linear(trunk_input_dim, hidden_dim), nn.ReLU(),
 61 |             nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
 62 |             nn.Linear(hidden_dim, 2 * action_dim)
 63 |         )
 64 | 
 65 |         self.apply(orthogonal_init)
 66 | 
 67 |     def forward(self, obs, goal=None):
 68 |         feature = self.encoder(obs, detach=self.detach_obs_encoder)
 69 |         if self.goal_conditioned:
 70 |             assert goal is not None, "need a goal image for goal-conditioned network"
 71 |             goal_feature = self.encoder(goal, detach=self.detach_goal_encoder)
 72 |             feature = T.cat((feature, goal_feature), dim=1)
 73 | 
 74 |         mu, log_std = self.trunk(feature).chunk(2, dim=-1)
 75 |         log_std = T.clamp(log_std, self.log_std_min, self.log_std_max)
 76 |         return mu, log_std
 77 | 
 78 |     def get_action(self, obs, goal=None, epsilon=1e-6, mean_pi=False, probs=False, entropy=False):
 79 |         mean, log_std = self(obs, goal)
 80 |         if mean_pi:
 81 |             return T.tanh(mean)
 82 |         std = log_std.exp()
 83 |         mu = Normal(mean, std)
 84 |         z = mu.rsample()
 85 |         action = T.tanh(z)
 86 |         if not probs:
 87 |             return action * self.action_scaling
 88 |         else:
 89 |             log_probs = (mu.log_prob(z) - T.log(1 - action.pow(2) + epsilon)).sum(1, keepdim=True)
 90 |             if not entropy:
 91 |                 return action * self.action_scaling, log_probs
 92 |             else:
 93 |                 entropy = mu.entropy()
 94 |                 return action * self.action_scaling, log_probs, entropy
 95 | 
 96 | 
 97 | class ConvCritic(nn.Module):
 98 |     # Modified from https://github.com/PhilipZRH/ferm
 99 |     def __init__(self, action_dim, encoder, hidden_dim=1024, detach_obs_encoder=False,
100 |                  goal_conditioned=False, detach_goal_encoder=True):
101 |         super(ConvCritic, self).__init__()
102 | 
103 |         self.encoder = encoder
104 |         self.detach_obs_encoder = detach_obs_encoder
105 |         trunk_input_dim = self.encoder.feature_dim
106 |         self.goal_conditioned = goal_conditioned
107 |         self.detach_goal_encoder = detach_goal_encoder
108 |         if self.goal_conditioned:
109 |             trunk_input_dim *= 2
110 |         self.trunk = nn.Sequential(
111 |             nn.Linear(trunk_input_dim + action_dim, hidden_dim),
112 |             nn.ReLU(),
113 |             nn.Linear(hidden_dim, hidden_dim),
114 |             nn.ReLU(),
115 |             nn.Linear(hidden_dim, 1)
116 |         )
117 | 
118 |         self.apply(orthogonal_init)
119 | 
120 |     def forward(self, obs, action, goal=None):
121 |         # detach_encoder allows to stop gradient propagation to encoder
122 |         feature = self.encoder(obs, detach=self.detach_obs_encoder)
123 |         if self.goal_conditioned:
124 |             assert goal is not None, "need a goal image for goal-conditioned network"
125 |             goal_feature = self.encoder(goal, detach=self.detach_goal_encoder)
126 |             feature = T.cat((feature, goal_feature), dim=1)
127 |         trunk_input = T.cat([feature, action], dim=1)
128 |         q = self.trunk(trunk_input)
129 |         return q
130 | 
131 | 
132 | class CURL(nn.Module):
133 |     # Modified from https://github.com/PhilipZRH/ferm
134 |     def __init__(self, z_dim, encoder, encoder_target):
135 |         super(CURL, self).__init__()
136 |         self.encoder = encoder
137 |         self.encoder_target = encoder_target
138 |         assert z_dim == self.encoder.feature_dim == self.encoder_target.feature_dim
139 |         self.W = nn.Parameter(T.rand(z_dim, z_dim))
140 | 
141 |     def encode(self, x, detach=False, use_target=False):
142 |         # if exponential moving average (ema) target is enabled,
143 |         #   then compute key values using target encoder without gradient,
144 |         #   else compute key values with the main encoder
145 |         # from CURL https://arxiv.org/abs/2004.04136
146 |         if use_target:
147 |             with T.no_grad():
148 |                 z_out = self.encoder_target(x)
149 |         else:
150 |             z_out = self.encoder(x)
151 | 
152 |         if detach:
153 |             z_out = z_out.detach()
154 |         return z_out
155 | 
156 |     def compute_score(self, z_a, z_pos):
157 |         """
158 |         from CURL https://arxiv.org/abs/2004.04136
159 |         - compute (B,B) matrix z_a (W z_pos.T)
160 |         - positives are all diagonal elements
161 |         - negatives are all other elements
162 |         - to compute loss use multi-class cross entropy with identity matrix for labels
163 |         """
164 |         Wz = T.matmul(self.W, z_pos.T)  # (z_dim,B)
165 |         score = T.matmul(z_a, Wz)  # (B,B)
166 |         score = score - T.max(score, 1)[0][:, None]
167 |         return score
168 | 
169 | 
170 | class PixelEncoder(nn.Module):
171 |     def __init__(self, obs_shape, feature_dim=50, num_layers=4, num_filters=32):
172 |         # the encoder architecture adopted by SAC-AE, DrQ and CURL
173 |         super(PixelEncoder, self).__init__()
174 |         assert len(obs_shape) == 3
175 |         self.obs_shape = obs_shape[-2:]
176 |         self.feature_dim = feature_dim
177 |         self.num_layers = num_layers
178 | 
179 |         self.convs = nn.ModuleList(
180 |             [nn.Conv2d(obs_shape[0], num_filters, 3, stride=2)]
181 |         )
182 |         for i in range(num_layers - 1):
183 |             self.convs.append(nn.Conv2d(num_filters, num_filters, 3, stride=1))
184 | 
185 |         x = T.randn([32] + list(obs_shape))
186 |         out_dim = self.forward_conv(x, flatten=False).shape[-1]
187 |         self.trunk = nn.Sequential(
188 |             nn.Linear(num_filters * out_dim * out_dim, self.feature_dim),
189 |             nn.LayerNorm(self.feature_dim),
190 |             nn.Tanh()
191 |         )
192 | 
193 |     def forward_conv(self, obs, flatten=True):
194 |         if obs.max() > 1.:
195 |             obs = obs / 255.
196 | 
197 |         conv = T.relu(self.convs[0](obs))
198 |         for i in range(1, self.num_layers):
199 |             conv = T.relu(self.convs[i](conv))
200 |         if flatten:
201 |             conv = conv.reshape(conv.size(0), -1)
202 |         return conv
203 | 
204 |     def forward(self, obs, detach=False):
205 |         h = self.forward_conv(obs)
206 |         if detach:
207 |             h = h.detach()
208 |         h = self.trunk(h)
209 |         return h
210 | 
211 |     def copy_conv_weights_from(self, source):
212 |         # only copy conv layers' weights
213 |         for i in range(self.num_layers):
214 |             self.convs[i].weight = source.convs[i].weight
215 |             self.convs[i].bias = source.convs[i].bias
216 | 
217 | 
218 | class PixelDecoder(nn.Module):
219 |     def __init__(self, obs_shape, feature_dim=50, num_layers=4, num_filters=32):
220 |         # the encoder architecture adopted by SAC-AE, DrQ and CURL
221 |         super(PixelDecoder, self).__init__()
222 |         assert len(obs_shape) == 3
223 |         self.obs_shape = obs_shape[-2:]
224 |         self.feature_dim = feature_dim
225 |         self.num_layers = num_layers
226 | 
227 |     # todo
228 | 
229 | 
230 | def orthogonal_init(m):
231 |     if isinstance(m, nn.Linear):
232 |         nn.init.orthogonal_(m.weight.data)
233 |         if hasattr(m.bias, 'data'):
234 |             m.bias.data.fill_(0.0)
235 |     elif isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
236 |         gain = nn.init.calculate_gain('relu')
237 |         nn.init.orthogonal_(m.weight.data, gain)
238 |         if hasattr(m.bias, 'data'):
239 |             m.bias.data.fill_(0.0)
240 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/networks_mlp.py:
--------------------------------------------------------------------------------
  1 | import torch as T
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from torch.distributions import Normal, Categorical
  5 | 
  6 | 
  7 | class Actor(nn.Module):
  8 |     def __init__(self, input_dim, output_dim, fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, action_scaling=1):
  9 |         super(Actor, self).__init__()
 10 |         self.fc1 = nn.Linear(input_dim, fc1_size)
 11 |         self.fc2 = nn.Linear(fc1_size, fc2_size)
 12 |         self.fc3 = nn.Linear(fc2_size, fc3_size)
 13 |         self.pi = nn.Linear(fc3_size, output_dim)
 14 |         self.apply(orthogonal_init)
 15 |         self.action_scaling = action_scaling
 16 | 
 17 |     def forward(self, inputs):
 18 |         x = F.relu(self.fc1(inputs))
 19 |         x = F.relu(self.fc2(x))
 20 |         x = F.relu(self.fc3(x))
 21 |         action = T.tanh(self.pi(x))
 22 |         return action * self.action_scaling
 23 | 
 24 | 
 25 | class StochasticActor(nn.Module):
 26 |     def __init__(self, input_dim, output_dim, log_std_min, log_std_max, continuous=True, agent_state_dim=0,
 27 |                  fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, action_scaling=1):
 28 |         super(StochasticActor, self).__init__()
 29 |         self.continuous = continuous
 30 |         self.action_dim = output_dim
 31 |         self.agent_state_dim = agent_state_dim
 32 |         self.fc1 = nn.Linear(input_dim+agent_state_dim, fc1_size)
 33 |         self.fc2 = nn.Linear(fc1_size, fc2_size)
 34 |         if self.continuous:
 35 |             self.fc3 = nn.Linear(fc2_size, fc3_size)
 36 |             self.mean = nn.Linear(fc3_size, output_dim)
 37 |             self.log_std = nn.Linear(fc3_size, output_dim)
 38 |         else:
 39 |             self.fc3 = nn.Linear(fc2_size, output_dim)
 40 |         self.apply(orthogonal_init)
 41 |         self.log_std_min = log_std_min
 42 |         self.log_std_max = log_std_max
 43 |         self.action_scaling = action_scaling
 44 | 
 45 |     def forward(self, inputs):
 46 |         x = F.relu(self.fc1(inputs))
 47 |         x = F.relu(self.fc2(x))
 48 |         x = F.relu(self.fc3(x))
 49 |         if self.continuous:
 50 |             mean = self.mean(x)
 51 |             log_std = self.log_std(x)
 52 |             log_std = T.clamp(log_std, self.log_std_min, self.log_std_max)
 53 |             return mean, log_std
 54 |         else:
 55 |             return x
 56 |     
 57 |     def get_action(self, inputs, std_scale=None, epsilon=1e-6, mean_pi=False, greedy=False, probs=False, entropy=False):
 58 |         if self.continuous:
 59 |             mean, log_std = self(inputs)
 60 |             if mean_pi:
 61 |                 return T.tanh(mean)
 62 |             std = log_std.exp()
 63 |             if std_scale is not None:
 64 |                 std *= std_scale
 65 |             mu = Normal(mean, std)
 66 |             z = mu.rsample()
 67 |             action = T.tanh(z)
 68 |             if not probs:
 69 |                 return action * self.action_scaling
 70 |             else:
 71 |                 if action.shape == (self.action_dim,):
 72 |                     action = action.reshape((1, self.action_dim))
 73 |                 log_probs = (mu.log_prob(z) - T.log(1 - action.pow(2) + epsilon)).sum(1, keepdim=True)
 74 |                 if not entropy:
 75 |                     return action * self.action_scaling, log_probs
 76 |                 else:
 77 |                     entropy = mu.entropy()
 78 |                     return action * self.action_scaling, log_probs, entropy
 79 |         else:
 80 |             logits = self(inputs)
 81 |             if greedy:
 82 |                 actions = T.argmax(logits, dim=1, keepdim=True)
 83 |                 return actions, None, None
 84 |             action_probs = F.softmax(logits, dim=1)
 85 |             dist = Categorical(action_probs)
 86 |             actions = dist.sample().view(-1, 1)
 87 |             log_probs = T.log(action_probs + epsilon).gather(1, actions)
 88 |             entropy = dist.entropy()
 89 |             return actions, log_probs, entropy
 90 | 
 91 |     def get_log_probs(self, inputs, actions, std_scale=None):
 92 |         actions /= self.action_scaling
 93 |         mean, log_std = self(inputs)
 94 |         std = log_std.exp()
 95 |         if std_scale is not None:
 96 |             std *= std_scale
 97 |         mu = Normal(mean, std)
 98 |         log_probs = mu.log_prob(actions)
 99 |         entropy = mu.entropy()
100 |         return log_probs, entropy
101 | 
102 | 
103 | class Critic(nn.Module):
104 |     def __init__(self, input_dim, output_dim, fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, softmax=False):
105 |         super(Critic, self).__init__()
106 |         self.fc1 = nn.Linear(input_dim, fc1_size)
107 |         self.fc2 = nn.Linear(fc1_size, fc2_size)
108 |         self.fc3 = nn.Linear(fc2_size, fc3_size)
109 |         self.v = nn.Linear(fc3_size, output_dim)
110 |         self.apply(orthogonal_init)
111 |         self.softmax = softmax
112 | 
113 |     def forward(self, inputs):
114 |         x = F.relu(self.fc1(inputs))
115 |         x = F.relu(self.fc2(x))
116 |         x = F.relu(self.fc3(x))
117 |         value = self.v(x)
118 |         if not self.softmax:
119 |             return value
120 |         else:
121 |             return F.softmax(value, dim=1)
122 | 
123 |     def get_action(self, inputs):
124 |         values = self.forward(inputs)
125 |         return T.argmax(values).item()
126 | 
127 | 
128 | def orthogonal_init(m):
129 |     if isinstance(m, nn.Linear):
130 |         nn.init.orthogonal_(m.weight.data)
131 |         if hasattr(m.bias, 'data'):
132 |             m.bias.data.fill_(0.0)
133 |     elif isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
134 |         gain = nn.init.calculate_gain('relu')
135 |         nn.init.orthogonal_(m.weight.data, gain)
136 |         if hasattr(m.bias, 'data'):
137 |             m.bias.data.fill_(0.0)
138 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/networks_pointnet.py:
--------------------------------------------------------------------------------
  1 | import torch as T
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from .pointnet_2_utils import PointNetSetAbstraction
  5 | from .pointnet_utils import PointNetEncoder, feature_transform_reguliarzer
  6 | 
  7 | 
  8 | class CriticPointNet(nn.Module):
  9 |     def __init__(self, output_dim, action_dim, agent_state_dim=0, normal_channel=False, softmax=False, goal_conditioned=False):
 10 |         super(CriticPointNet, self).__init__()
 11 |         in_channel = 6 if normal_channel else 3
 12 |         self.action_dim = action_dim
 13 |         self.agent_state_dim = agent_state_dim
 14 |         self.feat = PointNetEncoder(global_feat=True, feature_transform=True, channel=in_channel)
 15 |         self.goal_conditioned = goal_conditioned
 16 |         if self.goal_conditioned:
 17 |             self.fc1 = nn.Linear(2048+action_dim+agent_state_dim, 512)
 18 |         else:
 19 |             self.fc1 = nn.Linear(1024+action_dim+agent_state_dim, 512)
 20 |         self.fc2 = nn.Linear(512, 256)
 21 |         self.fc3 = nn.Linear(256, output_dim)
 22 |         self.dropout = nn.Dropout(p=0.4)
 23 |         self.bn1 = nn.BatchNorm1d(512)
 24 |         self.bn2 = nn.BatchNorm1d(256)
 25 |         self.softmax = softmax
 26 | 
 27 |     def forward(self, obs_xyz, action, goal_xyz=None, agent_state=None):
 28 |         x, trans, trans_feat = self.feat(obs_xyz)
 29 |         if self.goal_conditioned and goal_xyz is not None:
 30 |             goal_x, goal_trans, goal_trans_feat = self.feat(goal_xyz)
 31 |             x = T.cat([x, goal_x.detach()], dim=1)
 32 |             if agent_state is not None:
 33 |                 assert agent_state.shape[1] == self.agent_state_dim
 34 |                 x = T.cat([x, agent_state], dim=1)
 35 |         x = T.cat([x, action], dim=1)
 36 |         x = F.relu(self.bn1(self.fc1(x)))
 37 |         x = F.relu(self.bn2(self.dropout(self.fc2(x))))
 38 |         value = self.fc3(x)
 39 | 
 40 |         if not self.softmax:
 41 |             return value
 42 |         else:
 43 |             return F.softmax(value, dim=1)
 44 | 
 45 |     def get_features(self, xyz, detach=False):
 46 |         x, trans, trans_feat = self.feat(xyz)
 47 |         if detach:
 48 |             x = x.detach()
 49 |         return x
 50 | 
 51 | 
 52 | class CriticPointNet2(nn.Module):
 53 |     def __init__(self, output_dim, action_dim, normal_channel=False, softmax=False):
 54 |         super(CriticPointNet2, self).__init__()
 55 |         in_channel = 6 if normal_channel else 3
 56 |         self.normal_channel = normal_channel
 57 |         self.sa1 = PointNetSetAbstraction(npoint=512, radius=0.2, nsample=32, in_channel=in_channel, mlp=[64, 64, 128],
 58 |                                           group_all=False)
 59 |         self.sa2 = PointNetSetAbstraction(npoint=128, radius=0.4, nsample=64, in_channel=128 + 3, mlp=[128, 128, 256],
 60 |                                           group_all=False)
 61 |         self.sa3 = PointNetSetAbstraction(npoint=None, radius=None, nsample=None, in_channel=256 + 3,
 62 |                                           mlp=[256, 512, 1024], group_all=True)
 63 | 
 64 |         self.fc1 = nn.Linear(1024+action_dim, 512)
 65 |         self.bn1 = nn.BatchNorm1d(512)
 66 |         self.drop1 = nn.Dropout(0.4)
 67 |         self.fc2 = nn.Linear(512, 256)
 68 |         self.bn2 = nn.BatchNorm1d(256)
 69 |         self.drop2 = nn.Dropout(0.4)
 70 |         self.fc3 = nn.Linear(256, output_dim)
 71 |         self.softmax = softmax
 72 | 
 73 |     def forward(self, xyz):
 74 |         B, _, _ = xyz.shape
 75 |         if self.normal_channel:
 76 |             norm = xyz[:, 3:, :]
 77 |             xyz = xyz[:, :3, :]
 78 |         else:
 79 |             norm = None
 80 |         l1_xyz, l1_points = self.sa1(xyz, norm)
 81 |         l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
 82 |         l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
 83 |         x = l3_points.view(B, 1024)
 84 | 
 85 |         x = self.drop1(F.relu(self.bn1(self.fc1(x))))
 86 |         x = self.drop2(F.relu(self.bn2(self.fc2(x))))
 87 |         value = self.fc3(x)
 88 | 
 89 |         if not self.softmax:
 90 |             return value
 91 |         else:
 92 |             return F.softmax(value, dim=1)
 93 | 
 94 |     def get_features(self, xyz, detach=False):
 95 |         B, _, _ = xyz.shape
 96 |         if self.normal_channel:
 97 |             norm = xyz[:, 3:, :]
 98 |             xyz = xyz[:, :3, :]
 99 |         else:
100 |             norm = None
101 |         l1_xyz, l1_points = self.sa1(xyz, norm)
102 |         l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
103 |         l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
104 |         x = l3_points.view(B, 1024)
105 |         if detach:
106 |             x = x.detach()
107 |         return x
108 | 
109 |     def get_action(self, inputs):
110 |         values = self.forward(inputs)
111 |         return T.argmax(values).item()
112 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/normalizer.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | 
 4 | class Normalizer(object):
 5 |     def __init__(self, input_dims, init_mean, init_var,
 6 |                  scale_factor=1, epsilon=1e-3, clip_range=None, activated=False):
 7 |         self.activated = activated
 8 |         self.input_dims = input_dims
 9 |         self.sample_count = 0
10 |         self.history = []
11 |         self.history_mean = init_mean
12 |         self.history_var = init_var
13 |         if self.history_mean is None:
14 |             self.history_mean = np.zeros(self.input_dims)
15 |         if self.history_var is None:
16 |             self.history_var = np.ones(self.input_dims)
17 |         assert self.history_mean.shape == (self.input_dims,)
18 |         assert self.history_var.shape == (self.input_dims,)
19 |         self.epsilon = epsilon*np.ones(self.input_dims)
20 |         if clip_range is None:
21 |             clip_range = 1e3
22 |         self.input_clip_range = (-clip_range*np.ones(self.input_dims), clip_range*np.ones(self.input_dims))
23 |         self.scale_factor = scale_factor
24 | 
25 |     def store_history(self, *args):
26 |         self.history.append(*args)
27 | 
28 |     # update mean and var for z-score normalization
29 |     def update_mean(self):
30 |         if len(self.history) == 0:
31 |             return
32 |         new_sample_num = len(self.history)
33 |         new_history = np.array(self.history, dtype=float)
34 |         new_mean = np.mean(new_history, axis=0)
35 | 
36 |         new_var = np.sum(np.square(new_history - new_mean), axis=0)
37 |         new_var = (self.sample_count * np.square(self.history_var) + new_var)
38 |         new_var /= (new_sample_num + self.sample_count)
39 |         self.history_var = np.sqrt(new_var)
40 | 
41 |         new_mean = (self.sample_count * self.history_mean + new_sample_num * new_mean)
42 |         new_mean /= (new_sample_num + self.sample_count)
43 |         self.history_mean = new_mean
44 | 
45 |         self.sample_count += new_sample_num
46 |         self.history.clear()
47 | 
48 |     # pre-process inputs, currently using max-min-normalization
49 |     def __call__(self, inputs):
50 |         if self.activated:
51 |             inputs = (inputs - self.history_mean) / (self.history_var+self.epsilon)
52 |             inputs = np.clip(inputs, self.input_clip_range[0], self.input_clip_range[1])
53 |         return self.scale_factor*inputs
54 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/plot.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import numpy as np
  3 | import matplotlib as mpl
  4 | 
  5 | mpl.use('Agg')
  6 | import matplotlib.pyplot as plt
  7 | from matplotlib.lines import Line2D
  8 | from copy import deepcopy as dcp
  9 | 
 10 | 
 11 | def smoothed_plot(file, data, x_label="Timesteps", y_label="Success rate", window=5):
 12 |     N = len(data)
 13 |     running_avg = np.empty(N)
 14 |     for t in range(N):
 15 |         running_avg[t] = np.mean(data[max(0, t - window):(t + 1)])
 16 |     x = [i for i in range(N)]
 17 |     plt.ylabel(y_label)
 18 |     plt.xlabel(x_label)
 19 |     if x_label == "Epoch":
 20 |         x_tick_interval = len(data) // 10
 21 |         plt.xticks([n * x_tick_interval for n in range(11)])
 22 |     plt.plot(x, running_avg)
 23 |     plt.savefig(file, bbox_inches='tight', dpi=500)
 24 |     plt.close()
 25 | 
 26 | 
 27 | def smoothed_plot_multi_line(file, data, colors=None, linestyles=None, linewidths=None, alphas=None,
 28 |                              legend=None, legend_loc="upper right", window=5, title=None,
 29 |                              x_label='Timesteps', x_axis_off=False, xticks=None, xticklabels=None,
 30 |                              y_label="Success rate", ylim=(None, None), y_axis_off=False, yticks=None, yticklabels=None,
 31 |                              grid=False,
 32 |                              horizontal_lines=None, ho_linestyle='--', ho_linewidth=4, ho_xmin=0.05, ho_xmax=0.95):
 33 |     if y_axis_off:
 34 |         plt.ylabel(None)
 35 |         plt.yticks([])
 36 |     else:
 37 |         plt.ylabel(y_label)
 38 |         if yticks is not None:
 39 |             plt.yticks(yticks, yticklabels)
 40 |     if ylim[0] is not None:
 41 |         plt.ylim(ylim)
 42 |     if title is not None:
 43 |         plt.title(title)
 44 | 
 45 |     if x_axis_off:
 46 |         plt.xlabel(None)
 47 |         plt.xticks([])
 48 |     else:
 49 |         plt.xlabel(x_label)
 50 |         if x_label == "Epoch":
 51 |             x_tick_interval = len(data_dict_list[0]["mean"]) // 10
 52 |             plt.xticks([n * x_tick_interval for n in range(11)])
 53 |         if xticks is not None:
 54 |             plt.xticks(xticks, xticklabels)
 55 | 
 56 |     for t in range(len(data)):
 57 |         N = len(data[t])
 58 |         x = [i for i in range(N)]
 59 |         if window != 0:
 60 |             running_avg = np.empty(N)
 61 |             for n in range(N):
 62 |                 running_avg[n] = np.mean(data[t][max(0, n - window):(n + 1)])
 63 |         else:
 64 |             running_avg = data[t]
 65 | 
 66 |         if colors is None:
 67 |             c = None
 68 |         else:
 69 |             assert len(colors) >= len(data)
 70 |             c = colors[t]
 71 | 
 72 |         if linestyles is None:
 73 |             ls = '-'
 74 |         else:
 75 |             assert len(linestyles) == len(data)
 76 |             ls = linestyles[t]
 77 | 
 78 |         if linewidths is None:
 79 |             linewidth = 1
 80 |         else:
 81 |             assert len(linewidths) == len(data)
 82 |             linewidth = linewidths[t]
 83 | 
 84 |         if alphas is None:
 85 |             alpha = 1
 86 |         else:
 87 |             assert len(alphas) == len(data)
 88 |             alpha = alphas[t]
 89 | 
 90 |         plt.plot(x, running_avg, c=c, linestyle=ls, linewidth=linewidth, alpha=alpha)
 91 | 
 92 |     if horizontal_lines is not None:
 93 |         for n in range(len(horizontal_lines)):
 94 |             plt.axhline(y=horizontal_lines[n], color=colors[len(data) + n],
 95 |                         xmin=ho_xmin, xmax=ho_xmax, linestyle=ho_linestyle, linewidth=ho_linewidth)
 96 | 
 97 |     if legend is not None:
 98 |         plt.legend(legend, loc=legend_loc)
 99 | 
100 |     if grid:
101 |         plt.grid(True, linewidth=0.2)
102 | 
103 |     plt.savefig(file, bbox_inches='tight', dpi=500)
104 |     plt.close()
105 | 
106 | 
107 | def smoothed_plot_mean_deviation(file, data_dict_list, title=None,
108 |                                  vertical_lines=None, horizontal_lines=None, linestyle='--', linewidth=4,
109 |                                  x_label='Timesteps', x_axis_off=False, xticks=None,
110 |                                  y_label="Success rate", window=5, ylim=(None, None), y_axis_off=False, yticks=None,
111 |                                  legend=None, legend_only=False, legend_file=None, legend_loc="upper right",
112 |                                  legend_title=None, legend_bbox_to_anchor=None, legend_ncol=4, legend_frame=False,
113 |                                  handlelength=2):
114 |     colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple',
115 |               'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan','k']
116 |     if not isinstance(data_dict_list, list):
117 |         data_dict_list = [data_dict_list]
118 | 
119 |     if y_axis_off:
120 |         plt.ylabel(None)
121 |         plt.yticks([])
122 |     else:
123 |         plt.ylabel(y_label)
124 |         if yticks is not None:
125 |             plt.yticks(yticks)
126 |     if ylim[0] is not None:
127 |         plt.ylim(ylim)
128 |     if title is not None:
129 |         plt.title(title)
130 | 
131 |     if x_axis_off:
132 |         plt.xlabel(None)
133 |         plt.xticks([])
134 |     else:
135 |         plt.xlabel(x_label)
136 |         if x_label == "Epoch":
137 |             x_tick_interval = len(data_dict_list[0]["mean"]) // 10
138 |             plt.xticks([n * x_tick_interval for n in range(11)])
139 |         if xticks is not None:
140 |             plt.xticks(xticks)
141 | 
142 |     handles = [Line2D([0], [0], color=colors[i], linewidth=linewidth) for i in range(len(data_dict_list))]
143 |     if legend is not None:
144 |         legend_plot = plt.legend(handles, legend, handlelength=handlelength,
145 |                                  title=legend_title, loc=legend_loc, labelspacing=0.15,
146 |                                  bbox_to_anchor=legend_bbox_to_anchor, ncol=legend_ncol, frameon=legend_frame)
147 |         if legend_only:
148 |             assert legend_file is not None, 'specify legend save path'
149 |             fig = legend_plot.figure
150 |             fig.canvas.draw()
151 |             bbox = legend_plot.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
152 |             fig.savefig(legend_file, dpi=500, bbox_inches=bbox)
153 |             plt.close()
154 |             return
155 | 
156 |     N = len(data_dict_list[0]["mean"])
157 |     x = [i for i in range(N)]
158 |     for i in range(len(data_dict_list)):
159 |         case_data = data_dict_list[i]
160 |         for key in case_data:
161 |             running_avg = np.empty(N)
162 |             for n in range(N):
163 |                 running_avg[n] = np.mean(case_data[key][max(0, n - window):(n + 1)])
164 | 
165 |             case_data[key] = dcp(running_avg)
166 | 
167 |         plt.fill_between(x, case_data["upper"], case_data["lower"], alpha=0.3, color=colors[i], label='_nolegend_')
168 |         plt.plot(x, case_data["mean"], color=colors[i])
169 | 
170 |     if horizontal_lines is not None:
171 |         for n in range(len(horizontal_lines)):
172 |             plt.axhline(y=horizontal_lines[n], color=colors[len(data_dict_list) + n], xmin=0.05, xmax=0.95,
173 |                         linestyle=linestyle, linewidth=linewidth)
174 |     if vertical_lines is not None:
175 |         assert horizontal_lines is None
176 |         for n in range(len(vertical_lines)):
177 |             plt.axvline(x=vertical_lines[n], color=colors[len(data_dict_list) + n], ymin=0.05, ymax=0.95,
178 |                         linestyle=linestyle, linewidth=linewidth)
179 | 
180 |     plt.savefig(file, bbox_inches='tight', dpi=500)
181 |     plt.close()
182 | 
183 | 
184 | def get_mean_and_deviation(data, save_data=False, file_name=None):
185 |     upper = np.max(data, axis=0)
186 |     lower = np.min(data, axis=0)
187 |     mean = np.mean(data, axis=0)
188 |     statistics = {"mean": mean.tolist(),
189 |                   "upper": upper.tolist(),
190 |                   "lower": lower.tolist()}
191 |     if save_data:
192 |         assert file_name is not None
193 |         json.dump(statistics, open(file_name, 'w'))
194 |     return statistics
195 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/pointnet_2_utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | https://github.com/yanx27/Pointnet_Pointnet2_pytorch/blob/master/models/pointnet2_utils.py
  3 | @article{Pytorch_Pointnet_Pointnet2,
  4 |       Author = {Xu Yan},
  5 |       Title = {Pointnet/Pointnet++ Pytorch},
  6 |       Journal = {https://github.com/yanx27/Pointnet_Pointnet2_pytorch},
  7 |       Year = {2019}
  8 | }
  9 | """
 10 | import torch
 11 | import torch.nn as nn
 12 | import torch.nn.functional as F
 13 | from time import time
 14 | import numpy as np
 15 | 
 16 | 
 17 | def timeit(tag, t):
 18 |     print("{}: {}s".format(tag, time() - t))
 19 |     return time()
 20 | 
 21 | 
 22 | def pc_normalize(pc):
 23 |     l = pc.shape[0]
 24 |     centroid = np.mean(pc, axis=0)
 25 |     pc = pc - centroid
 26 |     m = np.max(np.sqrt(np.sum(pc ** 2, axis=1)))
 27 |     pc = pc / m
 28 |     return pc
 29 | 
 30 | 
 31 | def square_distance(src, dst):
 32 |     """
 33 |     Calculate Euclid distance between each two points.
 34 | 
 35 |     src^T * dst = xn * xm + yn * ym + zn * zm；
 36 |     sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn;
 37 |     sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm;
 38 |     dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2
 39 |          = sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst
 40 | 
 41 |     Input:
 42 |         src: source points, [B, N, C]
 43 |         dst: target points, [B, M, C]
 44 |     Output:
 45 |         dist: per-point square distance, [B, N, M]
 46 |     """
 47 |     B, N, _ = src.shape
 48 |     _, M, _ = dst.shape
 49 |     dist = -2 * torch.matmul(src, dst.permute(0, 2, 1))
 50 |     dist += torch.sum(src ** 2, -1).view(B, N, 1)
 51 |     dist += torch.sum(dst ** 2, -1).view(B, 1, M)
 52 |     return dist
 53 | 
 54 | 
 55 | def index_points(points, idx):
 56 |     """
 57 | 
 58 |     Input:
 59 |         points: input points data, [B, N, C]
 60 |         idx: sample index data, [B, S]
 61 |     Return:
 62 |         new_points:, indexed points data, [B, S, C]
 63 |     """
 64 |     device = points.device
 65 |     B = points.shape[0]
 66 |     view_shape = list(idx.shape)
 67 |     view_shape[1:] = [1] * (len(view_shape) - 1)
 68 |     repeat_shape = list(idx.shape)
 69 |     repeat_shape[0] = 1
 70 |     batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape)
 71 |     new_points = points[batch_indices, idx, :]
 72 |     return new_points
 73 | 
 74 | 
 75 | def farthest_point_sample(xyz, npoint):
 76 |     """
 77 |     Input:
 78 |         xyz: pointcloud data, [B, N, 3]
 79 |         npoint: number of samples
 80 |     Return:
 81 |         centroids: sampled pointcloud index, [B, npoint]
 82 |     """
 83 |     device = xyz.device
 84 |     B, N, C = xyz.shape
 85 |     centroids = torch.zeros(B, npoint, dtype=torch.long).to(device)
 86 |     distance = torch.ones(B, N).to(device) * 1e10
 87 |     farthest = torch.randint(0, N, (B,), dtype=torch.long).to(device)
 88 |     batch_indices = torch.arange(B, dtype=torch.long).to(device)
 89 |     for i in range(npoint):
 90 |         centroids[:, i] = farthest
 91 |         centroid = xyz[batch_indices, farthest, :].view(B, 1, 3)
 92 |         dist = torch.sum((xyz - centroid) ** 2, -1)
 93 |         mask = dist < distance
 94 |         distance[mask] = dist[mask]
 95 |         farthest = torch.max(distance, -1)[1]
 96 |     return centroids
 97 | 
 98 | 
 99 | def query_ball_point(radius, nsample, xyz, new_xyz):
100 |     """
101 |     Input:
102 |         radius: local region radius
103 |         nsample: max sample number in local region
104 |         xyz: all points, [B, N, 3]
105 |         new_xyz: query points, [B, S, 3]
106 |     Return:
107 |         group_idx: grouped points index, [B, S, nsample]
108 |     """
109 |     device = xyz.device
110 |     B, N, C = xyz.shape
111 |     _, S, _ = new_xyz.shape
112 |     group_idx = torch.arange(N, dtype=torch.long).to(device).view(1, 1, N).repeat([B, S, 1])
113 |     sqrdists = square_distance(new_xyz, xyz)
114 |     group_idx[sqrdists > radius ** 2] = N
115 |     group_idx = group_idx.sort(dim=-1)[0][:, :, :nsample]
116 |     group_first = group_idx[:, :, 0].view(B, S, 1).repeat([1, 1, nsample])
117 |     mask = group_idx == N
118 |     group_idx[mask] = group_first[mask]
119 |     return group_idx
120 | 
121 | 
122 | def sample_and_group(npoint, radius, nsample, xyz, points, returnfps=False):
123 |     """
124 |     Input:
125 |         npoint:
126 |         radius:
127 |         nsample:
128 |         xyz: input points position data, [B, N, 3]
129 |         points: input points data, [B, N, D]
130 |     Return:
131 |         new_xyz: sampled points position data, [B, npoint, nsample, 3]
132 |         new_points: sampled points data, [B, npoint, nsample, 3+D]
133 |     """
134 |     B, N, C = xyz.shape
135 |     S = npoint
136 |     fps_idx = farthest_point_sample(xyz, npoint)  # [B, npoint, C]
137 |     new_xyz = index_points(xyz, fps_idx)
138 |     idx = query_ball_point(radius, nsample, xyz, new_xyz)
139 |     grouped_xyz = index_points(xyz, idx)  # [B, npoint, nsample, C]
140 |     grouped_xyz_norm = grouped_xyz - new_xyz.view(B, S, 1, C)
141 | 
142 |     if points is not None:
143 |         grouped_points = index_points(points, idx)
144 |         new_points = torch.cat([grouped_xyz_norm, grouped_points], dim=-1)  # [B, npoint, nsample, C+D]
145 |     else:
146 |         new_points = grouped_xyz_norm
147 |     if returnfps:
148 |         return new_xyz, new_points, grouped_xyz, fps_idx
149 |     else:
150 |         return new_xyz, new_points
151 | 
152 | 
153 | def sample_and_group_all(xyz, points):
154 |     """
155 |     Input:
156 |         xyz: input points position data, [B, N, 3]
157 |         points: input points data, [B, N, D]
158 |     Return:
159 |         new_xyz: sampled points position data, [B, 1, 3]
160 |         new_points: sampled points data, [B, 1, N, 3+D]
161 |     """
162 |     device = xyz.device
163 |     B, N, C = xyz.shape
164 |     new_xyz = torch.zeros(B, 1, C).to(device)
165 |     grouped_xyz = xyz.view(B, 1, N, C)
166 |     if points is not None:
167 |         new_points = torch.cat([grouped_xyz, points.view(B, 1, N, -1)], dim=-1)
168 |     else:
169 |         new_points = grouped_xyz
170 |     return new_xyz, new_points
171 | 
172 | 
173 | class PointNetSetAbstraction(nn.Module):
174 |     def __init__(self, npoint, radius, nsample, in_channel, mlp, group_all):
175 |         super(PointNetSetAbstraction, self).__init__()
176 |         self.npoint = npoint
177 |         self.radius = radius
178 |         self.nsample = nsample
179 |         self.mlp_convs = nn.ModuleList()
180 |         self.mlp_bns = nn.ModuleList()
181 |         last_channel = in_channel
182 |         for out_channel in mlp:
183 |             self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))
184 |             self.mlp_bns.append(nn.BatchNorm2d(out_channel))
185 |             last_channel = out_channel
186 |         self.group_all = group_all
187 | 
188 |     def forward(self, xyz, points):
189 |         """
190 |         Input:
191 |             xyz: input points position data, [B, C, N]
192 |             points: input points data, [B, D, N]
193 |         Return:
194 |             new_xyz: sampled points position data, [B, C, S]
195 |             new_points_concat: sample points feature data, [B, D', S]
196 |         """
197 |         xyz = xyz.permute(0, 2, 1)
198 |         if points is not None:
199 |             points = points.permute(0, 2, 1)
200 | 
201 |         if self.group_all:
202 |             new_xyz, new_points = sample_and_group_all(xyz, points)
203 |         else:
204 |             new_xyz, new_points = sample_and_group(self.npoint, self.radius, self.nsample, xyz, points)
205 |         # new_xyz: sampled points position data, [B, npoint, C]
206 |         # new_points: sampled points data, [B, npoint, nsample, C+D]
207 |         new_points = new_points.permute(0, 3, 2, 1)  # [B, C+D, nsample,npoint]
208 |         for i, conv in enumerate(self.mlp_convs):
209 |             bn = self.mlp_bns[i]
210 |             new_points = F.relu(bn(conv(new_points)))
211 | 
212 |         new_points = torch.max(new_points, 2)[0]
213 |         new_xyz = new_xyz.permute(0, 2, 1)
214 |         return new_xyz, new_points
215 | 
216 | 
217 | class PointNetSetAbstractionMsg(nn.Module):
218 |     def __init__(self, npoint, radius_list, nsample_list, in_channel, mlp_list):
219 |         super(PointNetSetAbstractionMsg, self).__init__()
220 |         self.npoint = npoint
221 |         self.radius_list = radius_list
222 |         self.nsample_list = nsample_list
223 |         self.conv_blocks = nn.ModuleList()
224 |         self.bn_blocks = nn.ModuleList()
225 |         for i in range(len(mlp_list)):
226 |             convs = nn.ModuleList()
227 |             bns = nn.ModuleList()
228 |             last_channel = in_channel + 3
229 |             for out_channel in mlp_list[i]:
230 |                 convs.append(nn.Conv2d(last_channel, out_channel, 1))
231 |                 bns.append(nn.BatchNorm2d(out_channel))
232 |                 last_channel = out_channel
233 |             self.conv_blocks.append(convs)
234 |             self.bn_blocks.append(bns)
235 | 
236 |     def forward(self, xyz, points):
237 |         """
238 |         Input:
239 |             xyz: input points position data, [B, C, N]
240 |             points: input points data, [B, D, N]
241 |         Return:
242 |             new_xyz: sampled points position data, [B, C, S]
243 |             new_points_concat: sample points feature data, [B, D', S]
244 |         """
245 |         xyz = xyz.permute(0, 2, 1)
246 |         if points is not None:
247 |             points = points.permute(0, 2, 1)
248 | 
249 |         B, N, C = xyz.shape
250 |         S = self.npoint
251 |         new_xyz = index_points(xyz, farthest_point_sample(xyz, S))
252 |         new_points_list = []
253 |         for i, radius in enumerate(self.radius_list):
254 |             K = self.nsample_list[i]
255 |             group_idx = query_ball_point(radius, K, xyz, new_xyz)
256 |             grouped_xyz = index_points(xyz, group_idx)
257 |             grouped_xyz -= new_xyz.view(B, S, 1, C)
258 |             if points is not None:
259 |                 grouped_points = index_points(points, group_idx)
260 |                 grouped_points = torch.cat([grouped_points, grouped_xyz], dim=-1)
261 |             else:
262 |                 grouped_points = grouped_xyz
263 | 
264 |             grouped_points = grouped_points.permute(0, 3, 2, 1)  # [B, D, K, S]
265 |             for j in range(len(self.conv_blocks[i])):
266 |                 conv = self.conv_blocks[i][j]
267 |                 bn = self.bn_blocks[i][j]
268 |                 grouped_points = F.relu(bn(conv(grouped_points)))
269 |             new_points = torch.max(grouped_points, 2)[0]  # [B, D', S]
270 |             new_points_list.append(new_points)
271 | 
272 |         new_xyz = new_xyz.permute(0, 2, 1)
273 |         new_points_concat = torch.cat(new_points_list, dim=1)
274 |         return new_xyz, new_points_concat
275 | 
276 | 
277 | class PointNetFeaturePropagation(nn.Module):
278 |     def __init__(self, in_channel, mlp):
279 |         super(PointNetFeaturePropagation, self).__init__()
280 |         self.mlp_convs = nn.ModuleList()
281 |         self.mlp_bns = nn.ModuleList()
282 |         last_channel = in_channel
283 |         for out_channel in mlp:
284 |             self.mlp_convs.append(nn.Conv1d(last_channel, out_channel, 1))
285 |             self.mlp_bns.append(nn.BatchNorm1d(out_channel))
286 |             last_channel = out_channel
287 | 
288 |     def forward(self, xyz1, xyz2, points1, points2):
289 |         """
290 |         Input:
291 |             xyz1: input points position data, [B, C, N]
292 |             xyz2: sampled input points position data, [B, C, S]
293 |             points1: input points data, [B, D, N]
294 |             points2: input points data, [B, D, S]
295 |         Return:
296 |             new_points: upsampled points data, [B, D', N]
297 |         """
298 |         xyz1 = xyz1.permute(0, 2, 1)
299 |         xyz2 = xyz2.permute(0, 2, 1)
300 | 
301 |         points2 = points2.permute(0, 2, 1)
302 |         B, N, C = xyz1.shape
303 |         _, S, _ = xyz2.shape
304 | 
305 |         if S == 1:
306 |             interpolated_points = points2.repeat(1, N, 1)
307 |         else:
308 |             dists = square_distance(xyz1, xyz2)
309 |             dists, idx = dists.sort(dim=-1)
310 |             dists, idx = dists[:, :, :3], idx[:, :, :3]  # [B, N, 3]
311 | 
312 |             dist_recip = 1.0 / (dists + 1e-8)
313 |             norm = torch.sum(dist_recip, dim=2, keepdim=True)
314 |             weight = dist_recip / norm
315 |             interpolated_points = torch.sum(index_points(points2, idx) * weight.view(B, N, 3, 1), dim=2)
316 | 
317 |         if points1 is not None:
318 |             points1 = points1.permute(0, 2, 1)
319 |             new_points = torch.cat([points1, interpolated_points], dim=-1)
320 |         else:
321 |             new_points = interpolated_points
322 | 
323 |         new_points = new_points.permute(0, 2, 1)
324 |         for i, conv in enumerate(self.mlp_convs):
325 |             bn = self.mlp_bns[i]
326 |             new_points = F.relu(bn(conv(new_points)))
327 |         return new_points
328 | 


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/pointnet_utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | https://github.com/yanx27/Pointnet_Pointnet2_pytorch/blob/master/models/pointnet2_utils.py
  3 | @article{Pytorch_Pointnet_Pointnet2,
  4 |       Author = {Xu Yan},
  5 |       Title = {Pointnet/Pointnet++ Pytorch},
  6 |       Journal = {https://github.com/yanx27/Pointnet_Pointnet2_pytorch},
  7 |       Year = {2019}
  8 | }
  9 | """
 10 | import torch
 11 | import torch.nn as nn
 12 | import torch.nn.parallel
 13 | import torch.utils.data
 14 | from torch.autograd import Variable
 15 | import numpy as np
 16 | import torch.nn.functional as F
 17 | 
 18 | 
 19 | class STN3d(nn.Module):
 20 |     def __init__(self, channel):
 21 |         super(STN3d, self).__init__()
 22 |         self.conv1 = torch.nn.Conv1d(channel, 64, 1)
 23 |         self.conv2 = torch.nn.Conv1d(64, 128, 1)
 24 |         self.conv3 = torch.nn.Conv1d(128, 1024, 1)
 25 |         self.fc1 = nn.Linear(1024, 512)
 26 |         self.fc2 = nn.Linear(512, 256)
 27 |         self.fc3 = nn.Linear(256, 9)
 28 |         self.relu = nn.ReLU()
 29 | 
 30 |         self.bn1 = nn.BatchNorm1d(64)
 31 |         self.bn2 = nn.BatchNorm1d(128)
 32 |         self.bn3 = nn.BatchNorm1d(1024)
 33 |         self.bn4 = nn.BatchNorm1d(512)
 34 |         self.bn5 = nn.BatchNorm1d(256)
 35 | 
 36 |     def forward(self, x):
 37 |         batchsize = x.size()[0]
 38 |         x = F.relu(self.bn1(self.conv1(x)))
 39 |         x = F.relu(self.bn2(self.conv2(x)))
 40 |         x = F.relu(self.bn3(self.conv3(x)))
 41 |         x = torch.max(x, 2, keepdim=True)[0]
 42 |         x = x.view(-1, 1024)
 43 | 
 44 |         x = F.relu(self.bn4(self.fc1(x)))
 45 |         x = F.relu(self.bn5(self.fc2(x)))
 46 |         x = self.fc3(x)
 47 | 
 48 |         iden = Variable(torch.from_numpy(np.array([1, 0, 0, 0, 1, 0, 0, 0, 1]).astype(np.float32))).view(1, 9).repeat(
 49 |             batchsize, 1)
 50 |         if x.is_cuda:
 51 |             iden = iden.cuda()
 52 |         x = x + iden
 53 |         x = x.view(-1, 3, 3)
 54 |         return x
 55 | 
 56 | 
 57 | class STNkd(nn.Module):
 58 |     def __init__(self, k=64):
 59 |         super(STNkd, self).__init__()
 60 |         self.conv1 = torch.nn.Conv1d(k, 64, 1)
 61 |         self.conv2 = torch.nn.Conv1d(64, 128, 1)
 62 |         self.conv3 = torch.nn.Conv1d(128, 1024, 1)
 63 |         self.fc1 = nn.Linear(1024, 512)
 64 |         self.fc2 = nn.Linear(512, 256)
 65 |         self.fc3 = nn.Linear(256, k * k)
 66 |         self.relu = nn.ReLU()
 67 | 
 68 |         self.bn1 = nn.BatchNorm1d(64)
 69 |         self.bn2 = nn.BatchNorm1d(128)
 70 |         self.bn3 = nn.BatchNorm1d(1024)
 71 |         self.bn4 = nn.BatchNorm1d(512)
 72 |         self.bn5 = nn.BatchNorm1d(256)
 73 | 
 74 |         self.k = k
 75 | 
 76 |     def forward(self, x):
 77 |         batchsize = x.size()[0]
 78 |         x = F.relu(self.bn1(self.conv1(x)))
 79 |         x = F.relu(self.bn2(self.conv2(x)))
 80 |         x = F.relu(self.bn3(self.conv3(x)))
 81 |         x = torch.max(x, 2, keepdim=True)[0]
 82 |         x = x.view(-1, 1024)
 83 | 
 84 |         x = F.relu(self.bn4(self.fc1(x)))
 85 |         x = F.relu(self.bn5(self.fc2(x)))
 86 |         x = self.fc3(x)
 87 | 
 88 |         iden = Variable(torch.from_numpy(np.eye(self.k).flatten().astype(np.float32))).view(1, self.k * self.k).repeat(
 89 |             batchsize, 1)
 90 |         if x.is_cuda:
 91 |             iden = iden.cuda()
 92 |         x = x + iden
 93 |         x = x.view(-1, self.k, self.k)
 94 |         return x
 95 | 
 96 | 
 97 | class PointNetEncoder(nn.Module):
 98 |     def __init__(self, global_feat=True, feature_transform=False, channel=3):
 99 |         super(PointNetEncoder, self).__init__()
100 |         self.stn = STN3d(channel)
101 |         self.conv1 = torch.nn.Conv1d(channel, 64, 1)
102 |         self.conv2 = torch.nn.Conv1d(64, 128, 1)
103 |         self.conv3 = torch.nn.Conv1d(128, 1024, 1)
104 |         self.bn1 = nn.BatchNorm1d(64)
105 |         self.bn2 = nn.BatchNorm1d(128)
106 |         self.bn3 = nn.BatchNorm1d(1024)
107 |         self.global_feat = global_feat
108 |         self.feature_transform = feature_transform
109 |         if self.feature_transform:
110 |             self.fstn = STNkd(k=64)
111 | 
112 |     def forward(self, x):
113 |         B, D, N = x.size()
114 |         trans = self.stn(x)
115 |         x = x.transpose(2, 1)
116 |         if D > 3:
117 |             feature = x[:, :, 3:]
118 |             x = x[:, :, :3]
119 |         x = torch.bmm(x, trans)
120 |         if D > 3:
121 |             x = torch.cat([x, feature], dim=2)
122 |         x = x.transpose(2, 1)
123 |         x = F.relu(self.bn1(self.conv1(x)))
124 | 
125 |         if self.feature_transform:
126 |             trans_feat = self.fstn(x)
127 |             x = x.transpose(2, 1)
128 |             x = torch.bmm(x, trans_feat)
129 |             x = x.transpose(2, 1)
130 |         else:
131 |             trans_feat = None
132 | 
133 |         pointfeat = x
134 |         x = F.relu(self.bn2(self.conv2(x)))
135 |         x = self.bn3(self.conv3(x))
136 |         x = torch.max(x, 2, keepdim=True)[0]
137 |         x = x.view(-1, 1024)
138 |         if self.global_feat:
139 |             return x, trans, trans_feat
140 |         else:
141 |             x = x.view(-1, 1024, 1).repeat(1, 1, N)
142 |             return torch.cat([x, pointfeat], 1), trans, trans_feat
143 | 
144 | 
145 | def feature_transform_reguliarzer(trans):
146 |     d = trans.size()[1]
147 |     I = torch.eye(d)[None, :, :]
148 |     if trans.is_cuda:
149 |         I = I.cuda()
150 |     loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2, 1)) - I, dim=(1, 2)))
151 |     return loss


--------------------------------------------------------------------------------
/drl_implementation/agent/utils/segment_tree.py:
--------------------------------------------------------------------------------
  1 | """
  2 | The segment tree implementation from OpenAI baseline GitHub repo:
  3 | https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/segment_tree.py
  4 | This is used in the prioritized replay buffer.
  5 | """
  6 | import operator
  7 | 
  8 | 
  9 | class SegmentTree(object):
 10 |     def __init__(self, capacity, operation, neutral_element):
 11 |         """Build a Segment Tree data structure.
 12 |         https://en.wikipedia.org/wiki/Segment_tree
 13 |         Can be used as regular array, but with two
 14 |         important differences:
 15 |             a) setting item's value is slightly slower.
 16 |                It is O(lg capacity) instead of O(1).
 17 |             b) user has access to an efficient ( O(log segment size) )
 18 |                `reduce` operation which reduces `operation` over
 19 |                a contiguous subsequence of items in the array.
 20 |         Paramters
 21 |         ---------
 22 |         capacity: int
 23 |             Total size of the array - must be a power of two.
 24 |         operation: lambda obj, obj -> obj
 25 |             and operation for combining elements (eg. sum, max)
 26 |             must form a mathematical group together with the set of
 27 |             possible values for array elements (i.e. be associative)
 28 |         neutral_element: obj
 29 |             neutral element for the operation above. eg. float('-inf')
 30 |             for max and 0 for sum.
 31 |         """
 32 |         assert capacity > 0 and capacity & (capacity - 1) == 0, "capacity must be positive and a power of 2."
 33 |         self._capacity = capacity
 34 |         self._value = [neutral_element for _ in range(2 * capacity)]
 35 |         self._operation = operation
 36 | 
 37 |     def _reduce_helper(self, start, end, node, node_start, node_end):
 38 |         if start == node_start and end == node_end:
 39 |             return self._value[node]
 40 |         mid = (node_start + node_end) // 2
 41 |         if end <= mid:
 42 |             return self._reduce_helper(start, end, 2 * node, node_start, mid)
 43 |         else:
 44 |             if mid + 1 <= start:
 45 |                 return self._reduce_helper(start, end, 2 * node + 1, mid + 1, node_end)
 46 |             else:
 47 |                 return self._operation(
 48 |                     self._reduce_helper(start, mid, 2 * node, node_start, mid),
 49 |                     self._reduce_helper(mid + 1, end, 2 * node + 1, mid + 1, node_end)
 50 |                 )
 51 | 
 52 |     def reduce(self, start=0, end=None):
 53 |         """Returns result of applying `self.operation`
 54 |         to a contiguous subsequence of the array.
 55 |             self.operation(arr[start], operation(arr[start+1], operation(... arr[end])))
 56 |         Parameters
 57 |         ----------
 58 |         start: int
 59 |             beginning of the subsequence
 60 |         end: int
 61 |             end of the subsequences
 62 |         Returns
 63 |         -------
 64 |         reduced: obj
 65 |             result of reducing self.operation over the specified range of array elements.
 66 |         """
 67 |         if end is None:
 68 |             end = self._capacity
 69 |         if end < 0:
 70 |             end += self._capacity
 71 |         end -= 1
 72 |         return self._reduce_helper(start, end, 1, 0, self._capacity - 1)
 73 | 
 74 |     def __setitem__(self, idx, val):
 75 |         # index of the leaf
 76 |         idx += self._capacity
 77 |         self._value[idx] = val
 78 |         idx //= 2
 79 |         while idx >= 1:
 80 |             self._value[idx] = self._operation(
 81 |                 self._value[2 * idx],
 82 |                 self._value[2 * idx + 1]
 83 |             )
 84 |             idx //= 2
 85 | 
 86 |     def __getitem__(self, idx):
 87 |         assert 0 <= idx < self._capacity
 88 |         return self._value[self._capacity + idx]
 89 | 
 90 | 
 91 | class SumSegmentTree(SegmentTree):
 92 |     def __init__(self, capacity):
 93 |         super(SumSegmentTree, self).__init__(
 94 |             capacity=capacity,
 95 |             operation=operator.add,
 96 |             neutral_element=0.0
 97 |         )
 98 | 
 99 |     def sum(self, start=0, end=None):
100 |         """Returns arr[start] + ... + arr[end]"""
101 |         return super(SumSegmentTree, self).reduce(start, end)
102 | 
103 |     def find_prefixsum_idx(self, prefixsum):
104 |         """Find the highest index `i` in the array such that
105 |             sum(arr[0] + arr[1] + ... + arr[i - i]) <= prefixsum
106 |         if array values are probabilities, this function
107 |         allows to sample indexes according to the discrete
108 |         probability efficiently.
109 |         Parameters
110 |         ----------
111 |         perfixsum: float
112 |             upperbound on the sum of array prefix
113 |         Returns
114 |         -------
115 |         idx: int
116 |             highest index satisfying the prefixsum constraint
117 |         """
118 |         assert 0 <= prefixsum <= self.sum() + 1e-5
119 |         idx = 1
120 |         while idx < self._capacity:  # while non-leaf
121 |             if self._value[2 * idx] > prefixsum:
122 |                 idx = 2 * idx
123 |             else:
124 |                 prefixsum -= self._value[2 * idx]
125 |                 idx = 2 * idx + 1
126 |         return idx - self._capacity
127 | 
128 | 
129 | class MinSegmentTree(SegmentTree):
130 |     def __init__(self, capacity):
131 |         super(MinSegmentTree, self).__init__(
132 |             capacity=capacity,
133 |             operation=min,
134 |             neutral_element=float('inf')
135 |         )
136 | 
137 |     def min(self, start=0, end=None):
138 |         """Returns min(arr[start], ...,  arr[end])"""
139 | 
140 |         return super(MinSegmentTree, self).reduce(start, end)
141 | 


--------------------------------------------------------------------------------
/drl_implementation/examples/KukaPushPHER.py:
--------------------------------------------------------------------------------
 1 | # this example runs a goal-condition soft actor critic with prioritised+hindsight experience replay
 2 | #       on the Push task from the pybullet-multigoal-gym package
 3 | 
 4 | import os
 5 | import gym
 6 | import pybullet_multigoal_gym as pmg
 7 | from drl_implementation import GoalConditionedSAC, GoalConditionedDDPG
 8 | # you can replace the agent instantiation by one of the two above, with the proper params
 9 | 
10 | ddpg_params = {
11 |     'hindsight': True,
12 |     'her_sampling_strategy': 'future',
13 |     'prioritised': True,
14 |     'memory_capacity': int(1e6),
15 |     'actor_learning_rate': 0.001,
16 |     'critic_learning_rate': 0.001,
17 |     'Q_weight_decay': 0.0,
18 |     'update_interval': 1,
19 |     'batch_size': 128,
20 |     'optimization_steps': 40,
21 |     'tau': 0.05,
22 |     'discount_factor': 0.98,
23 |     'clip_value': 50,
24 |     'discard_time_limit': True,
25 |     'terminate_on_achieve': False,
26 |     'observation_normalization': True,
27 | 
28 |     'random_action_chance': 0.2,
29 |     'noise_deviation': 0.05,
30 | 
31 |     'training_epochs': 101,
32 |     'training_cycles': 50,
33 |     'training_episodes': 16,
34 |     'testing_gap': 1,
35 |     'testing_episodes': 30,
36 |     'saving_gap': 25,
37 | }
38 | # sac_params = {
39 | #     'hindsight': True,
40 | #     'her_sampling_strategy': 'future',
41 | #     'prioritised': True,
42 | #     'memory_capacity': int(1e6),
43 | #     'actor_learning_rate': 0.001,
44 | #     'critic_learning_rate': 0.001,
45 | #     'update_interval': 1,
46 | #     'batch_size': 128,
47 | #     'optimization_steps': 40,
48 | #     'tau': 0.005,
49 | #     'clip_value': 50,
50 | #     'discount_factor': 0.98,
51 | #     'discard_time_limit': True,
52 | #     'terminate_on_achieve': False,
53 | #     'observation_normalization': True,
54 | #
55 | #     'alpha': 0.5,
56 | #     'actor_update_interval': 1,
57 | #     'critic_target_update_interval': 1,
58 | #
59 | #     'training_epochs': 101,
60 | #     'training_cycles': 50,
61 | #     'training_episodes': 16,
62 | #     'testing_gap': 1,
63 | #     'testing_episodes': 30,
64 | #     'saving_gap': 25,
65 | # }
66 | seeds = [11, 22, 33, 44]
67 | seed_returns = []
68 | seed_success_rates = []
69 | path = os.path.dirname(os.path.realpath(__file__))
70 | path = os.path.join(path, 'PushPHER')
71 | 
72 | for seed in seeds:
73 | 
74 |     env = pmg.make_env(task='push',
75 |                        gripper='parallel_jaw',
76 |                        render=False,
77 |                        binary_reward=True,
78 |                        max_episode_steps=50,
79 |                        image_observation=False,
80 |                        depth_image=False,
81 |                        goal_image=False)
82 |     # use the render env for visualization
83 | 
84 |     seed_path = path + '/seed'+str(seed)
85 | 
86 |     agent = GoalConditionedDDPG(algo_params=ddpg_params, env=env, path=seed_path, seed=seed)
87 |     agent.run(test=False)
88 |     # the sleep argument pause the rendering for a while every step, useful for slowing down visualization
89 |     # agent.run(test=True, load_network_ep=50, sleep=0.05)
90 |     seed_returns.append(agent.statistic_dict['epoch_test_return'])
91 |     seed_success_rates.append(agent.statistic_dict['epoch_test_success_rate'])
92 | 


--------------------------------------------------------------------------------
/drl_implementation/examples/PendulumDDPG.py:
--------------------------------------------------------------------------------
 1 | # this example runs a ddpg agent on a inverted pendulum swingup task from pybullet gym
 2 | 
 3 | import os
 4 | import pybullet_envs
 5 | from drl_implementation import DDPG, SAC, TD3
 6 | # you can replace the agent instantiation by one of the three above, with the proper params
 7 | 
 8 | # td3_params = {
 9 | #     'prioritised': True,
10 | #     'memory_capacity': int(1e6),
11 | #     'actor_learning_rate': 0.0003,
12 | #     'critic_learning_rate': 0.0003,
13 | #     'batch_size': 256,
14 | #     'optimization_steps': 1,
15 | #     'tau': 0.005,
16 | #     'discount_factor': 0.99,
17 | #     'discard_time_limit': True,
18 | #     'warmup_step': 2500,
19 | #     'target_noise': 0.2,
20 | #     'noise_clip': 0.5,
21 | #     'update_interval': 1,
22 | #     'actor_update_interval': 2,
23 | #     'observation_normalization': False,
24 | #
25 | #     'training_episodes': 101,
26 | #     'testing_gap': 10,
27 | #     'testing_episodes': 10,
28 | #     'saving_gap': 50,
29 | # }
30 | # sac_params = {
31 | #     'prioritised': True,
32 | #     'memory_capacity': int(1e6),
33 | #     'actor_learning_rate': 0.0003,
34 | #     'critic_learning_rate': 0.0003,
35 | #     'update_interval': 1,
36 | #     'batch_size': 256,
37 | #     'optimization_steps': 1,
38 | #     'tau': 0.005,
39 | #     'discount_factor': 0.99,
40 | #     'discard_time_limit': True,
41 | #     'observation_normalization': False,
42 | #
43 | #     'alpha': 0.5,
44 | #     'actor_update_interval': 1,
45 | #     'critic_target_update_interval': 1,
46 | #     'warmup_step': 1000,
47 | #
48 | #     'training_episodes': 101,
49 | #     'testing_gap': 10,
50 | #     'testing_episodes': 10,
51 | #     'saving_gap': 50,
52 | # }
53 | ddpg_params = {
54 |     'prioritised': True,
55 |     'memory_capacity': int(1e6),
56 |     'actor_learning_rate': 0.001,
57 |     'critic_learning_rate': 0.001,
58 |     'Q_weight_decay': 0.0,
59 |     'update_interval': 1,
60 |     'batch_size': 100,
61 |     'optimization_steps': 1,
62 |     'tau': 0.005,
63 |     'discount_factor': 0.99,
64 |     'discard_time_limit': True,
65 |     'warmup_step': 2500,
66 |     'observation_normalization': False,
67 | 
68 |     'training_episodes': 101,
69 |     'testing_gap': 10,
70 |     'testing_episodes': 10,
71 |     'saving_gap': 50,
72 | }
73 | seeds = [11, 22, 33, 44, 55, 66]
74 | seed_returns = []
75 | path = os.path.dirname(os.path.realpath(__file__))
76 | for seed in seeds:
77 | 
78 |     env = pybullet_envs.make("InvertedPendulumSwingupBulletEnv-v0")
79 |     # call render() before training to visualize (pybullet-gym-specific)
80 |     # env.render()
81 |     seed_path = path + '/seed'+str(seed)
82 | 
83 |     agent = DDPG(algo_params=ddpg_params, env=env, path=seed_path, seed=seed)
84 |     agent.run(test=False)
85 |     # the sleep argument pause the rendering for a while at every env step, useful for slowing down visualization
86 |     # agent.run(test=True, load_network_ep=50, sleep=0.05)
87 |     seed_returns.append(agent.statistic_dict['episode_return'])
88 |     del env, agent
89 | 


--------------------------------------------------------------------------------
/drl_implementation/examples/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/examples/__init__.py


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | matplotlib >= 3.3.2
2 | numpy >= 1.18
3 | torch >= 1.3.0
4 | json
5 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | 
 4 | packages = find_packages()
 5 | # Ensure that we don't pollute the global namespace.
 6 | for p in packages:
 7 |     assert p == 'drl_implementation' or p.startswith('drl_implementation.')
 8 | 
 9 | setup(name='drl-implementation',
10 |       version='1.0.0',
11 |       description='A collection of deep reinforcement learning algorithms for fast implementation',
12 |       url='https://github.com/IanYangChina/DRL_Implementation',
13 |       author='XintongYang',
14 |       author_email='YangX66@cardiff.ac.uk',
15 |       packages=packages,
16 |       package_dir={'drl_implementation': 'drl_implementation'},
17 |       package_data={'drl_implementation': [
18 |           'examples/*.md',
19 |       ]},
20 |       classifiers=[
21 |           "Programming Language :: Python :: 3",
22 |           "Operating System :: OS Independent",
23 |       ])
24 | 


--------------------------------------------------------------------------------
/src/README.md:
--------------------------------------------------------------------------------
 1 | #### Some tips
 2 | 
 3 | To run the FetchReach-v1 environment, you will need to install gym, mujoco and mujoco-py on your environment.
 4 | Here's the links:  
 5 | 
 6 | [OpenAI Gym](https://github.com/openai/gym)  
 7 | [Get a free Mujoco trial license, here](https://www.roboti.us/license.html)  
 8 | [Install Mujoco and Mujoco.py by OpenAI, here](https://github.com/openai/mujoco-py)  
 9 | 
10 | Unfortunately, it seems that mujoco is hard to be installed on Windows systems, but you might find some help in [this
11 | page](https://github.com/openai/mujoco-py/issues/253). However, I still suggest you to run these codes on Linux systems.
12 | 
13 | The [pybullet-multigoal-gym](https://github.com/IanYangChina/pybullet_multigoal_gym) environment is a migration of the open ai gym multigoal environment, developed by the author
14 | of this repo. It is free as it is based on Pybullet. You will need it to run the pybullet experiments.
15 | 
16 | #### Some notes I made when I was implementing HER:  
17 | * The original paper uses multiple cpu to collect data, however, this implementation uses single cpu. Multi-cpu might be 
18 | added in the future.  
19 | * Actor, critic networks have 3 hidden layers, each with 256 units and relu activation; critic output without activation, 
20 | while actor output with tanh and rescaling.  
21 | * Observation and goal are concatenated and fed into both networks.  
22 | * The original paper scales observation, goals and actions into [-5, 5] (we don't need rescaling with the Gym environment), 
23 | and normalize to 0 mean and standard variation. The means and standard deviations are computed using encountered data.  
24 | * Training process has 200 epochs with 50 cycles, each of which has 16 episodes and 40 optimization steps. The total 
25 | episode number is 200\*50\*16=160000, each of which has 50 time steps. After every 16 episodes, 40 optimization steps are 
26 | performed.  
27 | * Each optimization step uses a mini-batch of 128 batch size uniformly sampled from a replay buffer with 10^6 capacity,
28 | target network is updated softly with tau=0.05.  
29 | * Adam is used for learning with a learning rate of 0.001, discount factor is 0.98, target value is clipped to 
30 | [-1/(1-0.98), 0], that is [-50, 0]. I think this is based on the 50 time steps they set for each episode, in which at 
31 | most an agent could gain -50 return.
32 | * For exploration, they randomly select action from uniform distribution with 20% chance; and with 80% chance, they 
33 | add normal noise into increments along each axes with standard deviation equal to 5% of the max bound.
34 | 
35 | * The SAC agent doesn't need a behavioural policy.
36 | * The goal-conditioned **sac** agent doesn't need value clipping.
37 | * Prioritised replay supported.
38 |     
39 | #### Results on the task 'Push'
40 | <img src="./push.gif" width="400"/>
41 | 


--------------------------------------------------------------------------------
/src/figs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/figs.png


--------------------------------------------------------------------------------
/src/pendulum.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/pendulum.gif


--------------------------------------------------------------------------------
/src/push.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/push.gif


--------------------------------------------------------------------------------
/tests/test.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | 
3 | t = namedtuple('t', ['a', 'b'])
4 | 
5 | print('b' in t._fields)


--------------------------------------------------------------------------------